pyclustering.cluster.silhouette.silhouette Class Reference

Represents Silhouette method that is used interpretation and validation of consistency. More...

Public Member Functions

def __init__ (self, data, clusters, kwargs)
 Initializes Silhouette method for analysis. More...
 
def process (self)
 Calculates Silhouette score for each object from input data. More...
 
def get_score (self)
 Returns Silhouette score for each object from input data. More...
 

Detailed Description

Represents Silhouette method that is used interpretation and validation of consistency.

The silhouette value is a measure of how similar an object is to its own cluster compared to other clusters. Be aware that silhouette method is applicable for K algorithm family, such as K-Means, K-Medians, K-Medoids, X-Means, etc., not not applicable for DBSCAN, OPTICS, CURE, etc. The Silhouette value is calculated using following formula:

\[s\left ( i \right )=\frac{ b\left ( i \right ) - a\left ( i \right ) }{ max\left \{ a\left ( i \right ), b\left ( i \right ) \right \}}\]

where $a\left ( i \right )$ - is average distance from object i to objects in its own cluster, $b\left ( i \right )$ - is average distance from object i to objects in the nearest cluster (the appropriate among other clusters).

Here is an example where Silhouette score is calculated for K-Means's clustering result:

from pyclustering.cluster.center_initializer import kmeans_plusplus_initializer
from pyclustering.cluster.kmeans import kmeans
from pyclustering.cluster.silhouette import silhouette
from pyclustering.samples.definitions import SIMPLE_SAMPLES
from pyclustering.utils import read_sample
# Read data 'SampleSimple3' from Simple Sample collection.
sample = read_sample(SIMPLE_SAMPLES.SAMPLE_SIMPLE3)
# Prepare initial centers
centers = kmeans_plusplus_initializer(sample, 4).initialize()
# Perform cluster analysis
kmeans_instance = kmeans(sample, centers)
kmeans_instance.process()
clusters = kmeans_instance.get_clusters()
# Calculate Silhouette score
score = silhouette(sample, clusters).process().get_score()

Let's perform clustering of the same sample by K-Means algorithm using different K values (2, 4, 6 and 8) and estimate clustering results using Silhouette method.

from pyclustering.cluster.kmeans import kmeans
from pyclustering.cluster.center_initializer import kmeans_plusplus_initializer
from pyclustering.cluster.silhouette import silhouette
from pyclustering.samples.definitions import SIMPLE_SAMPLES
from pyclustering.utils import read_sample
import matplotlib.pyplot as plt
def get_score(sample, amount_clusters):
# Prepare initial centers for K-Means algorithm.
centers = kmeans_plusplus_initializer(sample, amount_clusters).initialize()
# Perform cluster analysis.
kmeans_instance = kmeans(sample, centers)
kmeans_instance.process()
clusters = kmeans_instance.get_clusters()
# Calculate Silhouette score.
return silhouette(sample, clusters).process().get_score()
def draw_score(figure, position, title, score):
ax = figure.add_subplot(position)
ax.bar(range(0, len(score)), score, width=0.7)
ax.set_title(title)
ax.set_xlim(0, len(score))
ax.set_xticklabels([])
ax.grid()
# Read data 'SampleSimple3' from Simple Sample collection.
sample = read_sample(SIMPLE_SAMPLES.SAMPLE_SIMPLE3)
# Perform cluster analysis and estimation by Silhouette.
score_2 = get_score(sample, 2) # K = 2 (amount of clusters).
score_4 = get_score(sample, 4) # K = 4 - optimal.
score_6 = get_score(sample, 6) # K = 6.
score_8 = get_score(sample, 8) # K = 8.
# Visualize results.
figure = plt.figure()
# Visualize each result separately.
draw_score(figure, 221, 'K = 2', score_2)
draw_score(figure, 222, 'K = 4 (optimal)', score_4)
draw_score(figure, 223, 'K = 6', score_6)
draw_score(figure, 224, 'K = 8', score_8)
# Show a plot with visualized results.
plt.show()

There is visualized results that were done by Silhouette method. K = 4 is the optimal amount of clusters in line with Silhouette method because the score for each point is close to 1.0 and the average score for K = 4 is biggest value among others K.

silhouette_score_for_various_K.png
Fig. 1. Silhouette scores for various K.
See also
kmeans, kmedoids, kmedians, xmeans, elbow

Definition at line 45 of file silhouette.py.

Constructor & Destructor Documentation

◆ __init__()

def pyclustering.cluster.silhouette.silhouette.__init__ (   self,
  data,
  clusters,
  kwargs 
)

Initializes Silhouette method for analysis.

Parameters
[in]data(array_like): Input data that was used for cluster analysis and that is presented as list of points or distance matrix (defined by parameter 'data_type', by default data is considered as a list of points).
[in]clusters(list): Clusters that have been obtained after cluster analysis.
[in]**kwargsArbitrary keyword arguments (available arguments: 'metric').

Keyword Args:

  • metric (distance_metric): Metric that was used for cluster analysis and should be used for Silhouette score calculation (by default Square Euclidean distance).
  • data_type (string): Data type of input sample 'data' that is processed by the algorithm ('points', 'distance_matrix').
  • ccore (bool): If True then CCORE (C++ implementation of pyclustering library) is used (by default True).

Definition at line 144 of file silhouette.py.

Member Function Documentation

◆ get_score()

def pyclustering.cluster.silhouette.silhouette.get_score (   self)

Returns Silhouette score for each object from input data.

See also
process

Definition at line 217 of file silhouette.py.

◆ process()

def pyclustering.cluster.silhouette.silhouette.process (   self)

Calculates Silhouette score for each object from input data.

Returns
(silhouette) Instance of the method (self).

Definition at line 183 of file silhouette.py.


The documentation for this class was generated from the following file: