3 @brief Silhouette - method of interpretation and validation of consistency. 4 @details Implementation based on paper @cite article::cluster::silhouette::1. 6 @authors Andrei Novikov (pyclustering@yandex.ru) 8 @copyright GNU Public License 10 @cond GNU_PUBLIC_LICENSE 11 PyClustering is free software: you can redistribute it and/or modify 12 it under the terms of the GNU General Public License as published by 13 the Free Software Foundation, either version 3 of the License, or 14 (at your option) any later version. 16 PyClustering is distributed in the hope that it will be useful, 17 but WITHOUT ANY WARRANTY; without even the implied warranty of 18 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 19 GNU General Public License for more details. 21 You should have received a copy of the GNU General Public License 22 along with this program. If not, see <http://www.gnu.org/licenses/>. 28 from enum
import IntEnum
39 from pyclustering.core.wrapper
import ccore_library
40 from pyclustering.core.metric_wrapper
import metric_wrapper
42 import pyclustering.core.silhouette_wrapper
as wrapper
47 @brief Represents Silhouette method that is used interpretation and validation of consistency. 48 @details The silhouette value is a measure of how similar an object is to its own cluster compared to other clusters. 49 Be aware that silhouette method is applicable for K algorithm family, such as K-Means, K-Medians, 50 K-Medoids, X-Means, etc., not not applicable for DBSCAN, OPTICS, CURE, etc. The Silhouette value is 51 calculated using following formula: 52 \f[s\left ( i \right )=\frac{ b\left ( i \right ) - a\left ( i \right ) }{ max\left \{ a\left ( i \right ), b\left ( i \right ) \right \}}\f] 53 where \f$a\left ( i \right )\f$ - is average distance from object i to objects in its own cluster, 54 \f$b\left ( i \right )\f$ - is average distance from object i to objects in the nearest cluster (the appropriate among other clusters). 56 Here is an example where Silhouette score is calculated for K-Means's clustering result: 58 from pyclustering.cluster.center_initializer import kmeans_plusplus_initializer 59 from pyclustering.cluster.kmeans import kmeans 60 from pyclustering.cluster.silhouette import silhouette 62 from pyclustering.samples.definitions import SIMPLE_SAMPLES 63 from pyclustering.utils import read_sample 65 # Read data 'SampleSimple3' from Simple Sample collection. 66 sample = read_sample(SIMPLE_SAMPLES.SAMPLE_SIMPLE3) 68 # Prepare initial centers 69 centers = kmeans_plusplus_initializer(sample, 4).initialize() 71 # Perform cluster analysis 72 kmeans_instance = kmeans(sample, centers) 73 kmeans_instance.process() 74 clusters = kmeans_instance.get_clusters() 76 # Calculate Silhouette score 77 score = silhouette(sample, clusters).process().get_score() 80 Let's perform clustering of the same sample by K-Means algorithm using different `K` values (2, 4, 6 and 8) and 81 estimate clustering results using Silhouette method. 83 from pyclustering.cluster.kmeans import kmeans 84 from pyclustering.cluster.center_initializer import kmeans_plusplus_initializer 85 from pyclustering.cluster.silhouette import silhouette 87 from pyclustering.samples.definitions import SIMPLE_SAMPLES 88 from pyclustering.utils import read_sample 90 import matplotlib.pyplot as plt 92 def get_score(sample, amount_clusters): 93 # Prepare initial centers for K-Means algorithm. 94 centers = kmeans_plusplus_initializer(sample, amount_clusters).initialize() 96 # Perform cluster analysis. 97 kmeans_instance = kmeans(sample, centers) 98 kmeans_instance.process() 99 clusters = kmeans_instance.get_clusters() 101 # Calculate Silhouette score. 102 return silhouette(sample, clusters).process().get_score() 104 def draw_score(figure, position, title, score): 105 ax = figure.add_subplot(position) 106 ax.bar(range(0, len(score)), score, width=0.7) 108 ax.set_xlim(0, len(score)) 109 ax.set_xticklabels([]) 112 # Read data 'SampleSimple3' from Simple Sample collection. 113 sample = read_sample(SIMPLE_SAMPLES.SAMPLE_SIMPLE3) 115 # Perform cluster analysis and estimation by Silhouette. 116 score_2 = get_score(sample, 2) # K = 2 (amount of clusters). 117 score_4 = get_score(sample, 4) # K = 4 - optimal. 118 score_6 = get_score(sample, 6) # K = 6. 119 score_8 = get_score(sample, 8) # K = 8. 122 figure = plt.figure() 124 # Visualize each result separately. 125 draw_score(figure, 221, 'K = 2', score_2) 126 draw_score(figure, 222, 'K = 4 (optimal)', score_4) 127 draw_score(figure, 223, 'K = 6', score_6) 128 draw_score(figure, 224, 'K = 8', score_8) 130 # Show a plot with visualized results. 134 There is visualized results that were done by Silhouette method. `K = 4` is the optimal amount of clusters in line 135 with Silhouette method because the score for each point is close to `1.0` and the average score for `K = 4` is 136 biggest value among others `K`. 138 @image html silhouette_score_for_various_K.png "Fig. 1. Silhouette scores for various K." 140 @see kmeans, kmedoids, kmedians, xmeans, elbow 146 @brief Initializes Silhouette method for analysis. 148 @param[in] data (array_like): Input data that was used for cluster analysis and that is presented as list of 149 points or distance matrix (defined by parameter 'data_type', by default data is considered as a list 151 @param[in] clusters (list): Clusters that have been obtained after cluster analysis. 152 @param[in] **kwargs: Arbitrary keyword arguments (available arguments: 'metric'). 154 <b>Keyword Args:</b><br> 155 - metric (distance_metric): Metric that was used for cluster analysis and should be used for Silhouette 156 score calculation (by default Square Euclidean distance). 157 - data_type (string): Data type of input sample 'data' that is processed by the algorithm ('points', 'distance_matrix'). 158 - ccore (bool): If True then CCORE (C++ implementation of pyclustering library) is used (by default True). 164 self.
__data_type = kwargs.get(
'data_type',
'points')
166 if self.
__metric.get_type() != type_metric.USER_DEFINED:
171 self.
__score = [0.0] * len(data)
173 self.
__ccore = kwargs.get(
'ccore',
True)
and self.
__metric.get_type() != type_metric.USER_DEFINED
175 self.
__ccore = ccore_library.workable()
178 self.
__data = numpy.array(data)
185 @brief Calculates Silhouette score for each object from input data. 187 @return (silhouette) Instance of the method (self). 198 def __process_by_ccore(self):
200 @brief Performs processing using CCORE (C/C++ part of pyclustering library). 203 ccore_metric = metric_wrapper.create_instance(self.
__metric)
207 def __process_by_python(self):
209 @brief Performs processing using python code. 212 for index_cluster
in range(len(self.
__clusters)):
213 for index_point
in self.
__clusters[index_cluster]:
219 @brief Returns Silhouette score for each object from input data. 227 def __calculate_score(self, index_point, index_cluster):
229 @brief Calculates Silhouette score for the specific object defined by index_point. 231 @param[in] index_point (uint): Index point from input data for which Silhouette score should be calculated. 232 @param[in] index_cluster (uint): Index cluster to which the point belongs to. 234 @return (float) Silhouette score for the object. 240 difference = self.
__data[index_point]
245 return (b_score - a_score) / max(a_score, b_score)
248 def __calculate_within_cluster_score(self, index_cluster, difference):
250 @brief Calculates 'A' score for the specific object in cluster to which it belongs to. 252 @param[in] index_point (uint): Index point from input data for which 'A' score should be calculated. 253 @param[in] index_cluster (uint): Index cluster to which the point is belong to. 255 @return (float) 'A' score for the object. 262 return score / (len(self.
__clusters[index_cluster]) - 1)
265 def __calculate_cluster_score(self, index_cluster, difference):
267 @brief Calculates 'B*' score for the specific object for specific cluster. 269 @param[in] index_point (uint): Index point from input data for which 'B*' score should be calculated. 270 @param[in] index_cluster (uint): Index cluster to which the point is belong to. 272 @return (float) 'B*' score for the object for specific cluster. 277 return score / len(self.
__clusters[index_cluster])
280 def __caclulate_optimal_neighbor_cluster_score(self, index_cluster, difference):
282 @brief Calculates 'B' score for the specific object for the nearest cluster. 284 @param[in] index_point (uint): Index point from input data for which 'B' score should be calculated. 285 @param[in] index_cluster (uint): Index cluster to which the point is belong to. 287 @return (float) 'B' score for the object. 291 optimal_score = float(
'inf')
292 for index_neighbor_cluster
in range(len(self.
__clusters)):
293 if index_cluster != index_neighbor_cluster:
295 if candidate_score < optimal_score:
296 optimal_score = candidate_score
298 if optimal_score == float(
'inf'):
304 def __calculate_cluster_difference(self, index_cluster, difference):
306 @brief Calculates distance from each object in specified cluster to specified object. 308 @param[in] index_point (uint): Index point for which difference is calculated. 310 @return (list) Distance from specified object to each object from input data in specified cluster. 313 cluster_difference = 0.0
314 for index_point
in self.
__clusters[index_cluster]:
315 cluster_difference += difference[index_point]
317 return cluster_difference
320 def __calculate_dataset_difference(self, index_point):
322 @brief Calculate distance from each object to specified object. 324 @param[in] index_point (uint): Index point for which difference with other points is calculated. 326 @return (list) Distance to each object from input data from the specified. 330 if self.
__metric.get_type() != type_metric.USER_DEFINED:
335 return dataset_differences
338 def __verify_arguments(self):
340 @brief Verify input parameters for the algorithm and throw exception in case of incorrectness. 344 raise ValueError(
"Input data is empty (size: '%d')." % len(self.
__data))
347 raise ValueError(
"Input clusters are empty (size: '%d')." % len(self.
__clusters))
353 @brief Defines algorithms that can be used to find optimal number of cluster using Silhouette method. 355 @see silhouette_ksearch 370 @brief Returns algorithm type that corresponds to specified enumeration value. 372 @return (type) Algorithm type for cluster analysis. 375 if self == silhouette_ksearch_type.KMEANS:
377 elif self == silhouette_ksearch_type.KMEDIANS:
379 elif self == silhouette_ksearch_type.KMEDOIDS:
388 @brief Represent algorithm for searching optimal number of clusters using specified K-algorithm (K-Means, 389 K-Medians, K-Medoids) that is based on Silhouette method. 391 @details This algorithm uses average value of scores for estimation and applicable for clusters that are well 392 separated. Here is an example where clusters are well separated (sample 'Hepta'): 394 from pyclustering.cluster import cluster_visualizer 395 from pyclustering.cluster.center_initializer import kmeans_plusplus_initializer 396 from pyclustering.cluster.kmeans import kmeans 397 from pyclustering.cluster.silhouette import silhouette_ksearch_type, silhouette_ksearch 398 from pyclustering.samples.definitions import FCPS_SAMPLES 399 from pyclustering.utils import read_sample 401 sample = read_sample(FCPS_SAMPLES.SAMPLE_HEPTA) 402 search_instance = silhouette_ksearch(sample, 2, 10, algorithm=silhouette_ksearch_type.KMEANS).process() 404 amount = search_instance.get_amount() 405 scores = search_instance.get_scores() 407 print("Scores: '%s'" % str(scores)) 409 initial_centers = kmeans_plusplus_initializer(sample, amount).initialize() 410 kmeans_instance = kmeans(sample, initial_centers).process() 412 clusters = kmeans_instance.get_clusters() 414 visualizer = cluster_visualizer() 415 visualizer.append_clusters(clusters, sample) 419 Obtained Silhouette scores for each K: 421 Scores: '{2: 0.418434, 3: 0.450906, 4: 0.534709, 5: 0.689970, 6: 0.588460, 7: 0.882674, 8: 0.804725, 9: 0.780189}' 424 K = 7 has the bigger average Silhouette score and it means that it is optimal amount of clusters: 425 @image html silhouette_ksearch_hepta.png "Silhouette ksearch's analysis with further K-Means clustering (sample 'Hepta')." 427 @see silhouette_ksearch_type 433 @brief Initialize Silhouette search algorithm to find out optimal amount of clusters. 435 @param[in] data (array_like): Input data that is used for searching optimal amount of clusters. 436 @param[in] kmin (uint): Minimum amount of clusters that might be allocated. Should be equal or greater than `2`. 437 @param[in] kmax (uint): Maximum amount of clusters that might be allocated. Should be equal or less than amount 438 of points in input data. 439 @param[in] **kwargs: Arbitrary keyword arguments (available arguments: `algorithm`, `random_state`). 441 <b>Keyword Args:</b><br> 442 - algorithm (silhouette_ksearch_type): Defines algorithm that is used for searching optimal number of 443 clusters (by default K-Means). 444 - ccore (bool): If True then CCORE (C++ implementation of pyclustering library) is used (by default True). 451 self.
__algorithm = kwargs.get(
'algorithm', silhouette_ksearch_type.KMEANS)
461 self.
__ccore = kwargs.get(
'ccore',
True)
463 self.
__ccore = ccore_library.workable()
468 @brief Performs analysis to find optimal amount of clusters. 470 @see get_amount, get_score, get_scores 472 @return (silhouette_search) Itself instance (silhouette_search) 483 def __process_by_ccore(self):
485 @brief Performs processing using CCORE (C/C++ part of pyclustering library). 493 scores_list = results[2]
495 for i
in range(len(scores_list)):
499 def __process_by_python(self):
501 @brief Performs processing using python code. 508 if len(clusters) != k:
514 self.
__scores[k] = sum(score) / len(score)
523 @brief Returns optimal amount of clusters that has been found during analysis. 525 @return (uint) Optimal amount of clusters. 535 @brief Returns silhouette score that belongs to optimal amount of clusters (k). 537 @return (float) Score that belong to optimal amount of clusters. 539 @see process, get_scores 547 @brief Returns silhouette score for each K value (amount of clusters). 549 @return (dict) Silhouette score for each K value, where key is a K value and value is a silhouette score. 551 @see process, get_score 557 def __calculate_clusters(self, k):
559 @brief Performs cluster analysis using specified K value. 561 @param[in] k (uint): Amount of clusters that should be allocated. 563 @return (array_like) Allocated clusters. 568 return algorithm_type(self.
__data, initial_values).
process().get_clusters()
571 def __verify_arguments(self):
573 @brief Checks algorithm's arguments and if some of them is incorrect then exception is thrown. 577 raise ValueError(
"K max value '" + str(self.
__kmax) +
"' is bigger than amount of objects '" +
578 str(len(self.
__data)) +
"' in input data.")
581 raise ValueError(
"K min value '" + str(self.
__kmin) +
"' should be greater than 1 (impossible to provide " 582 "silhouette score for only one cluster).")
The module contains K-Means algorithm and other related services.
def __caclulate_optimal_neighbor_cluster_score(self, index_cluster, difference)
Calculates 'B' score for the specific object for the nearest cluster.
def __process_by_ccore(self)
Performs processing using CCORE (C/C++ part of pyclustering library).
Module provides various distance metrics - abstraction of the notion of distance in a metric space...
def process(self)
Calculates Silhouette score for each object from input data.
def __calculate_score(self, index_point, index_cluster)
Calculates Silhouette score for the specific object defined by index_point.
def process(self)
Performs analysis to find optimal amount of clusters.
def get_score(self)
Returns Silhouette score for each object from input data.
Distance metric performs distance calculation between two points in line with encapsulated function...
K-Means++ is an algorithm for choosing the initial centers for algorithms like K-Means or X-Means...
def __calculate_cluster_difference(self, index_cluster, difference)
Calculates distance from each object in specified cluster to specified object.
def __init__(self, data, clusters, kwargs)
Initializes Silhouette method for analysis.
def __process_by_python(self)
Performs processing using python code.
def __init__(self, data, kmin, kmax, kwargs)
Initialize Silhouette search algorithm to find out optimal amount of clusters.
def __calculate_within_cluster_score(self, index_cluster, difference)
Calculates 'A' score for the specific object in cluster to which it belongs to.
def __process_by_ccore(self)
Performs processing using CCORE (C/C++ part of pyclustering library).
def get_type(self)
Returns algorithm type that corresponds to specified enumeration value.
Defines algorithms that can be used to find optimal number of cluster using Silhouette method...
Collection of center initializers for algorithm that uses initial centers, for example, for K-Means or X-Means.
def __calculate_clusters(self, k)
Performs cluster analysis using specified K value.
def __process_by_python(self)
Performs processing using python code.
def get_scores(self)
Returns silhouette score for each K value (amount of clusters).
def __verify_arguments(self)
Checks algorithm's arguments and if some of them is incorrect then exception is thrown.
Represents Silhouette method that is used interpretation and validation of consistency.
def get_score(self)
Returns silhouette score that belongs to optimal amount of clusters (k).
Represent algorithm for searching optimal number of clusters using specified K-algorithm (K-Means...
def __verify_arguments(self)
Verify input parameters for the algorithm and throw exception in case of incorrectness.
Cluster analysis algorithm: K-Medoids.
def __calculate_dataset_difference(self, index_point)
Calculate distance from each object to specified object.
def __calculate_cluster_score(self, index_cluster, difference)
Calculates 'B*' score for the specific object for specific cluster.
def get_amount(self)
Returns optimal amount of clusters that has been found during analysis.