3 @brief Silhouette - method of interpretation and validation of consistency. 4 @details Implementation based on paper @cite article::cluster::silhouette::1. 6 @authors Andrei Novikov (pyclustering@yandex.ru) 8 @copyright GNU Public License 10 @cond GNU_PUBLIC_LICENSE 11 PyClustering is free software: you can redistribute it and/or modify 12 it under the terms of the GNU General Public License as published by 13 the Free Software Foundation, either version 3 of the License, or 14 (at your option) any later version. 16 PyClustering is distributed in the hope that it will be useful, 17 but WITHOUT ANY WARRANTY; without even the implied warranty of 18 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 19 GNU General Public License for more details. 21 You should have received a copy of the GNU General Public License 22 along with this program. If not, see <http://www.gnu.org/licenses/>. 28 from enum
import IntEnum
39 from pyclustering.core.wrapper
import ccore_library
40 from pyclustering.core.metric_wrapper
import metric_wrapper
42 import pyclustering.core.silhouette_wrapper
as wrapper
47 @brief Represents Silhouette method that is used interpretation and validation of consistency. 48 @details The silhouette value is a measure of how similar an object is to its own cluster compared to other clusters. 49 Be aware that silhouette method is applicable for K algorithm family, such as K-Means, K-Medians, 50 K-Medoids, X-Means, etc., not not applicable for DBSCAN, OPTICS, CURE, etc. The Silhouette value is 51 calculated using following formula: 52 \f[s\left ( i \right )=\frac{ b\left ( i \right ) - a\left ( i \right ) }{ max\left \{ a\left ( i \right ), b\left ( i \right ) \right \}}\f] 53 where \f$a\left ( i \right )\f$ - is average distance from object i to objects in its own cluster, 54 \f$b\left ( i \right )\f$ - is average distance from object i to objects in the nearest cluster (the appropriate among other clusters). 56 Here is an example where Silhouette score is calculated for K-Means's clustering result: 58 from pyclustering.cluster.center_initializer import kmeans_plusplus_initializer 59 from pyclustering.cluster.kmeans import kmeans 60 from pyclustering.cluster.silhouette import silhouette 62 from pyclustering.samples.definitions import SIMPLE_SAMPLES 63 from pyclustering.utils import read_sample 65 # Read data 'SampleSimple3' from Simple Sample collection. 66 sample = read_sample(SIMPLE_SAMPLES.SAMPLE_SIMPLE3); 68 # Prepare initial centers 69 centers = kmeans_plusplus_initializer(sample, 4).initialize(); 71 # Perform cluster analysis 72 kmeans_instance = kmeans(sample, centers); 73 kmeans_instance.process(); 74 clusters = kmeans_instance.get_clusters(); 76 # Calculate Silhouette score 77 score = silhouette(sample, clusters).process().get_score() 80 @see kmeans, kmedoids, kmedians, xmeans, elbow 86 @brief Initializes Silhouette method for analysis. 88 @param[in] data (array_like): Input data that was used for cluster analysis. 89 @param[in] clusters (list): Cluster that have been obtained after cluster analysis. 90 @param[in] **kwargs: Arbitrary keyword arguments (available arguments: 'metric'). 92 <b>Keyword Args:</b><br> 93 - metric (distance_metric): Metric that was used for cluster analysis and should be used for Silhouette 94 score calculation (by default Square Euclidean distance). 95 - ccore (bool): If True then CCORE (C++ implementation of pyclustering library) is used (by default True). 102 if self.
__metric.get_type() != type_metric.USER_DEFINED:
107 self.
__score = [0.0] * len(data)
109 self.
__ccore = kwargs.get(
'ccore',
True)
and self.
__metric.get_type() != type_metric.USER_DEFINED
111 self.
__ccore = ccore_library.workable()
114 self.
__data = numpy.array(data)
119 @brief Calculates Silhouette score for each object from input data. 121 @return (silhouette) Instance of the method (self). 132 def __process_by_ccore(self):
134 @brief Performs processing using CCORE (C/C++ part of pyclustering library). 137 ccore_metric = metric_wrapper.create_instance(self.
__metric)
141 def __process_by_python(self):
143 @brief Performs processing using python code. 146 for index_cluster
in range(len(self.
__clusters)):
147 for index_point
in self.
__clusters[index_cluster]:
153 @brief Returns Silhouette score for each object from input data. 161 def __calculate_score(self, index_point, index_cluster):
163 @brief Calculates Silhouette score for the specific object defined by index_point. 165 @param[in] index_point (uint): Index point from input data for which Silhouette score should be calculated. 166 @param[in] index_cluster (uint): Index cluster to which the point belongs to. 168 @return (float) Silhouette score for the object. 176 return (b_score - a_score) / max(a_score, b_score)
179 def __calculate_within_cluster_score(self, index_cluster, difference):
181 @brief Calculates 'A' score for the specific object in cluster to which it belongs to. 183 @param[in] index_point (uint): Index point from input data for which 'A' score should be calculated. 184 @param[in] index_cluster (uint): Index cluster to which the point is belong to. 186 @return (float) 'A' score for the object. 193 return score / (len(self.
__clusters[index_cluster]) - 1)
196 def __calculate_cluster_score(self, index_cluster, difference):
198 @brief Calculates 'B*' score for the specific object for specific cluster. 200 @param[in] index_point (uint): Index point from input data for which 'B*' score should be calculated. 201 @param[in] index_cluster (uint): Index cluster to which the point is belong to. 203 @return (float) 'B*' score for the object for specific cluster. 208 return score / len(self.
__clusters[index_cluster])
211 def __caclulate_optimal_neighbor_cluster_score(self, index_cluster, difference):
213 @brief Calculates 'B' score for the specific object for the nearest cluster. 215 @param[in] index_point (uint): Index point from input data for which 'B' score should be calculated. 216 @param[in] index_cluster (uint): Index cluster to which the point is belong to. 218 @return (float) 'B' score for the object. 222 optimal_score = float(
'inf')
223 for index_neighbor_cluster
in range(len(self.
__clusters)):
224 if index_cluster != index_neighbor_cluster:
226 if candidate_score < optimal_score:
227 optimal_score = candidate_score
229 if optimal_score == float(
'inf'):
235 def __calculate_cluster_difference(self, index_cluster, difference):
237 @brief Calculates distance from each object in specified cluster to specified object. 239 @param[in] index_point (uint): Index point for which difference is calculated. 241 @return (list) Distance from specified object to each object from input data in specified cluster. 244 cluster_difference = 0.0
245 for index_point
in self.
__clusters[index_cluster]:
246 cluster_difference += difference[index_point]
248 return cluster_difference
251 def __calculate_dataset_difference(self, index_point):
253 @brief Calculate distance from each object to specified object. 255 @param[in] index_point (uint): Index point for which difference with other points is calculated. 257 @return (list) Distance to each object from input data from the specified. 261 if self.
__metric.get_type() != type_metric.USER_DEFINED:
266 return dataset_differences
272 @brief Defines algorithms that can be used to find optimal number of cluster using Silhouette method. 274 @see silhouette_ksearch 289 @brief Returns algorithm type that corresponds to specified enumeration value. 291 @return (type) Algorithm type for cluster analysis. 294 if self == silhouette_ksearch_type.KMEANS:
296 elif self == silhouette_ksearch_type.KMEDIANS:
298 elif self == silhouette_ksearch_type.KMEDOIDS:
307 @brief Represent algorithm for searching optimal number of clusters using specified K-algorithm (K-Means, 308 K-Medians, K-Medoids) that is based on Silhouette method. 310 @details This algorithm uses average value of scores for estimation and applicable for clusters that are well 311 separated. Here is an example where clusters are well separated (sample 'Hepta'): 313 from pyclustering.cluster import cluster_visualizer 314 from pyclustering.cluster.center_initializer import kmeans_plusplus_initializer 315 from pyclustering.cluster.kmeans import kmeans 316 from pyclustering.cluster.silhouette import silhouette_ksearch_type, silhouette_ksearch 317 from pyclustering.samples.definitions import FCPS_SAMPLES 318 from pyclustering.utils import read_sample 320 sample = read_sample(FCPS_SAMPLES.SAMPLE_HEPTA) 321 search_instance = silhouette_ksearch(sample, 2, 10, algorithm=silhouette_ksearch_type.KMEANS).process() 323 amount = search_instance.get_amount() 324 scores = search_instance.get_scores() 326 print("Scores: '%s'" % str(scores)) 328 initial_centers = kmeans_plusplus_initializer(sample, amount).initialize() 329 kmeans_instance = kmeans(sample, initial_centers).process() 331 clusters = kmeans_instance.get_clusters() 333 visualizer = cluster_visualizer() 334 visualizer.append_clusters(clusters, sample) 338 Obtained Silhouette scores for each K: 340 Scores: '{2: 0.418434, 3: 0.450906, 4: 0.534709, 5: 0.689970, 6: 0.588460, 7: 0.882674, 8: 0.804725, 9: 0.780189}' 343 K = 7 has the bigger average Silhouette score and it means that it is optimal amount of clusters: 344 @image html silhouette_ksearch_hepta.png "Silhouette ksearch's analysis with further K-Means clustering (sample 'Hepta')." 346 @see silhouette_ksearch_type 352 @brief Initialize Silhouette search algorithm to find out optimal amount of clusters. 354 @param[in] data (array_like): Input data that is used for searching optimal amount of clusters. 355 @param[in] kmin (uint): Amount of clusters from which search is performed. Should be equal or greater than 2. 356 @param[in] kmax (uint): Amount of clusters to which search is performed. Should be equal or less than amount of 357 points in input data. 358 @param[in] **kwargs: Arbitrary keyword arguments (available arguments: 'algorithm'). 360 <b>Keyword Args:</b><br> 361 - algorithm (silhouette_ksearch_type): Defines algorithm that is used for searching optimal number of 362 clusters (by default K-Means). 363 - ccore (bool): If True then CCORE (C++ implementation of pyclustering library) is used (by default True). 370 self.
__algorithm = kwargs.get(
'algorithm', silhouette_ksearch_type.KMEANS)
379 self.
__ccore = kwargs.get(
'ccore',
True)
381 self.
__ccore = ccore_library.workable()
386 @brief Performs analysis to find optimal amount of clusters. 388 @see get_amount, get_score, get_scores 390 @return (silhouette_search) Itself instance (silhouette_search) 401 def __process_by_ccore(self):
403 @brief Performs processing using CCORE (C/C++ part of pyclustering library). 413 def __process_by_python(self):
415 @brief Performs processing using python code. 422 if len(clusters) != k:
428 self.
__scores[k] = sum(score) / len(score)
437 @brief Returns optimal amount of clusters that has been found during analysis. 439 @return (uint) Optimal amount of clusters. 449 @brief Returns silhouette score that belongs to optimal amount of clusters (k). 451 @return (float) Score that belong to optimal amount of clusters. 453 @see process, get_scores 461 @brief Returns silhouette score for each K value (amount of clusters). 463 @return (dict) Silhouette score for each K value, where key is a K value and value is a silhouette score. 465 @see process, get_score 471 def __calculate_clusters(self, k):
473 @brief Performs cluster analysis using specified K value. 475 @param[in] k (uint): Amount of clusters that should be allocated. 477 @return (array_like) Allocated clusters. 482 return algorithm_type(self.
__data, initial_values).
process().get_clusters()
485 def __verify_arguments(self):
487 @brief Checks algorithm's arguments and if some of them is incorrect then exception is thrown. 491 raise ValueError(
"K max value '" + str(self.
__kmax) +
"' is bigger than amount of objects '" +
492 str(len(self.
__data)) +
"' in input data.")
495 raise ValueError(
"K min value '" + str(self.
__kmin) +
"' should be greater than 1 (impossible to provide " 496 "silhouette score for only one cluster).")
Cluster analysis algorithm: K-Means.
def __caclulate_optimal_neighbor_cluster_score(self, index_cluster, difference)
Calculates 'B' score for the specific object for the nearest cluster.
def __process_by_ccore(self)
Performs processing using CCORE (C/C++ part of pyclustering library).
Module provides various distance metrics - abstraction of the notion of distance in a metric space...
def process(self)
Calculates Silhouette score for each object from input data.
def __calculate_score(self, index_point, index_cluster)
Calculates Silhouette score for the specific object defined by index_point.
def process(self)
Performs analysis to find optimal amount of clusters.
def get_score(self)
Returns Silhouette score for each object from input data.
Distance metric performs distance calculation between two points in line with encapsulated function...
K-Means++ is an algorithm for choosing the initial centers for algorithms like K-Means or X-Means...
def __calculate_cluster_difference(self, index_cluster, difference)
Calculates distance from each object in specified cluster to specified object.
def __init__(self, data, clusters, kwargs)
Initializes Silhouette method for analysis.
def __process_by_python(self)
Performs processing using python code.
def __init__(self, data, kmin, kmax, kwargs)
Initialize Silhouette search algorithm to find out optimal amount of clusters.
def __calculate_within_cluster_score(self, index_cluster, difference)
Calculates 'A' score for the specific object in cluster to which it belongs to.
def __process_by_ccore(self)
Performs processing using CCORE (C/C++ part of pyclustering library).
def get_type(self)
Returns algorithm type that corresponds to specified enumeration value.
Defines algorithms that can be used to find optimal number of cluster using Silhouette method...
Collection of center initializers for algorithm that uses initial centers, for example, for K-Means or X-Means.
def __calculate_clusters(self, k)
Performs cluster analysis using specified K value.
def __process_by_python(self)
Performs processing using python code.
def get_scores(self)
Returns silhouette score for each K value (amount of clusters).
def __verify_arguments(self)
Checks algorithm's arguments and if some of them is incorrect then exception is thrown.
Represents Silhouette method that is used interpretation and validation of consistency.
def get_score(self)
Returns silhouette score that belongs to optimal amount of clusters (k).
Represent algorithm for searching optimal number of clusters using specified K-algorithm (K-Means...
Cluster analysis algorithm: K-Medoids.
def __calculate_dataset_difference(self, index_point)
Calculate distance from each object to specified object.
def __calculate_cluster_score(self, index_cluster, difference)
Calculates 'B*' score for the specific object for specific cluster.
def get_amount(self)
Returns optimal amount of clusters that has been found during analysis.