3 @brief Silhouette - method of interpretation and validation of consistency. 4 @details Implementation based on paper @cite article::cluster::silhouette::1. 6 @authors Andrei Novikov (pyclustering@yandex.ru) 8 @copyright GNU Public License 10 @cond GNU_PUBLIC_LICENSE 11 PyClustering is free software: you can redistribute it and/or modify 12 it under the terms of the GNU General Public License as published by 13 the Free Software Foundation, either version 3 of the License, or 14 (at your option) any later version. 16 PyClustering is distributed in the hope that it will be useful, 17 but WITHOUT ANY WARRANTY; without even the implied warranty of 18 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 19 GNU General Public License for more details. 21 You should have received a copy of the GNU General Public License 22 along with this program. If not, see <http://www.gnu.org/licenses/>. 42 @brief Represents Silhouette method that is used interpretation and validation of consistency. 43 @details The silhouette value is a measure of how similar an object is to its own cluster compared to other clusters. 44 Be aware that silhouette method is applicable for K algorithm family, such as K-Means, K-Medians, 45 K-Medoids, X-Means, etc., not not applicable for DBSCAN, OPTICS, CURE, etc. The Silhouette value is 46 calculated using following formula: 47 \f[s\left ( i \right )=\frac{ b\left ( i \right ) - a\left ( i \right ) }{ max\left \{ a\left ( i \right ), b\left ( i \right ) \right \}}\f] 48 where \f$a\left ( i \right )\f$ - is average distance from object i to objects in its own cluster, 49 \f$b\left ( i \right )\f$ - is average distance from object i to objects in the nearest cluster (the appropriate among other clusters). 51 Here is an example where Silhouette score is calculated for K-Means's clustering result: 53 from pyclustering.cluster.center_initializer import kmeans_plusplus_initializer 54 from pyclustering.cluster.kmeans import kmeans 55 from pyclustering.cluster.silhouette import silhouette 57 from pyclustering.samples.definitions import SIMPLE_SAMPLES 58 from pyclustering.utils import read_sample 60 # Read data 'SampleSimple3' from Simple Sample collection. 61 sample = read_sample(SIMPLE_SAMPLES.SAMPLE_SIMPLE3); 63 # Prepare initial centers 64 centers = kmeans_plusplus_initializer(sample, 4).initialize(); 66 # Perform cluster analysis 67 kmeans_instance = kmeans(sample, centers); 68 kmeans_instance.process(); 69 clusters = kmeans_instance.get_clusters(); 71 # Calculate Silhouette score 72 score = silhouette(sample, clusters).process().get_score() 75 @see kmeans, kmedoids, kmedians, xmeans, elbow 81 @brief Initializes Silhouette method for analysis. 83 @param[in] data (array_like): Input data that was used for cluster analysis. 84 @param[in] clusters (list): Cluster that have been obtained after cluster analysis. 85 @param[in] **kwargs: Arbitrary keyword arguments (available arguments: 'metric'). 87 <b>Keyword Args:</b><br> 88 - metric (distance_metric): Metric that was used for cluster analysis and should be used for Silhouette 89 score calculation (by default Square Euclidean distance). 92 self.
__data = numpy.array(data)
96 if self.
__metric.get_type() != type_metric.USER_DEFINED:
101 self.
__score = [0.0] * len(data)
106 @brief Calculates Silhouette score for each object from input data. 108 @return (silhouette) Instance of the method (self). 111 for index_cluster
in range(len(self.
__clusters)):
112 for index_point
in self.
__clusters[index_cluster]:
120 @brief Returns Silhouette score for each object from input data. 128 def __calculate_score(self, index_point, index_cluster):
130 @brief Calculates Silhouette score for the specific object defined by index_point. 132 @param[in] index_point (uint): Index point from input data for which Silhouette score should be calculated. 133 @param[in] index_cluster (uint): Index cluster to which the point belongs to. 135 @return (float) Silhouette score for the object. 143 return (b_score - a_score) / max(a_score, b_score)
146 def __calculate_within_cluster_score(self, index_cluster, difference):
148 @brief Calculates 'A' score for the specific object in cluster to which it belongs to. 150 @param[in] index_point (uint): Index point from input data for which 'A' score should be calculated. 151 @param[in] index_cluster (uint): Index cluster to which the point is belong to. 153 @return (float) 'A' score for the object. 160 return score / (len(self.
__clusters[index_cluster]) - 1)
163 def __calculate_cluster_score(self, index_cluster, difference):
165 @brief Calculates 'B*' score for the specific object for specific cluster. 167 @param[in] index_point (uint): Index point from input data for which 'B*' score should be calculated. 168 @param[in] index_cluster (uint): Index cluster to which the point is belong to. 170 @return (float) 'B*' score for the object for specific cluster. 175 return score / len(self.
__clusters[index_cluster])
178 def __caclulate_optimal_neighbor_cluster_score(self, index_cluster, difference):
180 @brief Calculates 'B' score for the specific object for the nearest cluster. 182 @param[in] index_point (uint): Index point from input data for which 'B' score should be calculated. 183 @param[in] index_cluster (uint): Index cluster to which the point is belong to. 185 @return (float) 'B' score for the object. 189 optimal_score = float(
'inf')
190 for index_neighbor_cluster
in range(len(self.
__clusters)):
191 if index_cluster != index_neighbor_cluster:
193 if candidate_score < optimal_score:
194 optimal_score = candidate_score
199 def __calculate_cluster_difference(self, index_cluster, difference):
201 @brief Calculates distance from each object in specified cluster to specified object. 203 @param[in] index_point (uint): Index point for which difference is calculated. 205 @return (list) Distance from specified object to each object from input data in specified cluster. 208 cluster_difference = 0.0
209 for index_point
in self.
__clusters[index_cluster]:
210 cluster_difference += difference[index_point]
212 return cluster_difference
215 def __calculate_dataset_difference(self, index_point):
217 @brief Calculate distance from each object to specified object. 219 @param[in] index_point (uint): Index point for which difference with other points is calculated. 221 @return (list) Distance to each object from input data from the specified. 225 if self.
__metric.get_type() != type_metric.USER_DEFINED:
230 return dataset_differences
236 @brief Defines algorithms that can be used to find optimal number of cluster using Silhouette method. 238 @see silhouette_ksearch 253 @brief Returns algorithm type that corresponds to specified enumeration value. 255 @return (type) Algorithm type for cluster analysis. 258 if self == silhouette_ksearch_type.KMEANS:
260 elif self == silhouette_ksearch_type.KMEDIANS:
262 elif self == silhouette_ksearch_type.KMEDOIDS:
271 @brief Represent algorithm for searching optimal number of clusters using specified K-algorithm (K-Means, 272 K-Medians, K-Medoids) that is based on Silhouette method. 274 @details This algorithm uses average value of scores for estimation and applicable for clusters that are well 275 separated. Here is an example where clusters are well separated (sample 'Hepta'): 277 from pyclustering.cluster import cluster_visualizer 278 from pyclustering.cluster.center_initializer import kmeans_plusplus_initializer 279 from pyclustering.cluster.kmeans import kmeans 280 from pyclustering.cluster.silhouette import silhouette_ksearch_type, silhouette_ksearch 281 from pyclustering.samples.definitions import FCPS_SAMPLES 282 from pyclustering.utils import read_sample 284 sample = read_sample(FCPS_SAMPLES.SAMPLE_HEPTA) 285 search_instance = silhouette_ksearch(sample, 2, 10, algorithm=silhouette_ksearch_type.KMEANS).process() 287 amount = search_instance.get_amount() 288 scores = search_instance.get_scores() 290 print("Scores: '%s'" % str(scores)) 292 initial_centers = kmeans_plusplus_initializer(sample, amount).initialize() 293 kmeans_instance = kmeans(sample, initial_centers).process() 295 clusters = kmeans_instance.get_clusters() 297 visualizer = cluster_visualizer() 298 visualizer.append_clusters(clusters, sample) 302 Obtained Silhouette scores for each K: 304 Scores: '{2: 0.418434, 3: 0.450906, 4: 0.534709, 5: 0.689970, 6: 0.588460, 7: 0.882674, 8: 0.804725, 9: 0.780189}' 307 K = 7 has the bigger average Silhouette score and it means that it is optimal amount of clusters: 308 @image html silhouette_ksearch_hepta.png "Silhouette ksearch's analysis with further K-Means clustering (sample 'Hepta')." 310 @see silhouette_ksearch_type 316 @brief Initialize Silhouette search algorithm to find out optimal amount of clusters. 318 @param[in] data (array_like): Input data that is used for searching optimal amount of clusters. 319 @param[in] kmin (uint): Amount of clusters from which search is performed. Should be equal or greater than 2. 320 @param[in] kmax (uint): Amount of clusters to which search is performed. Should be equal or less than amount of 321 points in input data. 322 @param[in] **kwargs: Arbitrary keyword arguments (available arguments: 'algorithm'). 324 <b>Keyword Args:</b><br> 325 - algorithm (silhouette_ksearch_type): Defines algorithm that is used for searching optimal number of 326 clusters (by default K-Means). 333 self.
__algorithm = kwargs.get(
'algorithm', silhouette_ksearch_type.KMEANS)
345 @brief Performs analysis to find optimal amount of clusters. 347 @see get_amount, get_score, get_scores 349 @return (silhouette_search) Itself instance (silhouette_search) 356 if len(clusters) != k:
362 self.
__scores[k] = sum(score) / len(score)
373 @brief Returns optimal amount of clusters that has been found during analysis. 375 @return (uint) Optimal amount of clusters. 385 @brief Returns silhouette score that belongs to optimal amount of clusters (k). 387 @return (float) Score that belong to optimal amount of clusters. 389 @see process, get_scores 397 @brief Returns silhouette score for each K value (amount of clusters). 399 @return (dict) Silhouette score for each K value, where key is a K value and value is a silhouette score. 401 @see process, get_score 407 def __calculate_clusters(self, k):
409 @brief Performs cluster analysis using specified K value. 411 @param[in] k (uint): Amount of clusters that should be allocated. 413 @return (array_like) Allocated clusters. 418 return algorithm_type(self.
__data, initial_values).
process().get_clusters()
421 def __verify_arguments(self):
423 @brief Checks algorithm's arguments and if some of them is incorrect then exception is thrown. 427 raise ValueError(
"K max value '" + str(self.
__kmax) +
"' is bigger than amount of objects '" +
428 str(len(self.
__data)) +
"' in input data.")
431 raise ValueError(
"K min value '" + str(self.
__kmin) +
"' should be greater than 1 (impossible to provide " 432 "silhiuette score for only one cluster).")
Cluster analysis algorithm: K-Means.
def __caclulate_optimal_neighbor_cluster_score(self, index_cluster, difference)
Calculates 'B' score for the specific object for the nearest cluster.
Module provides various distance metrics - abstraction of the notion of distance in a metric space...
def process(self)
Calculates Silhouette score for each object from input data.
def __calculate_score(self, index_point, index_cluster)
Calculates Silhouette score for the specific object defined by index_point.
def process(self)
Performs analysis to find optimal amount of clusters.
def get_score(self)
Returns Silhouette score for each object from input data.
Distance metric performs distance calculation between two points in line with encapsulated function...
K-Means++ is an algorithm for choosing the initial centers for algorithms like K-Means or X-Means...
def __calculate_cluster_difference(self, index_cluster, difference)
Calculates distance from each object in specified cluster to specified object.
def __init__(self, data, clusters, kwargs)
Initializes Silhouette method for analysis.
def __init__(self, data, kmin, kmax, kwargs)
Initialize Silhouette search algorithm to find out optimal amount of clusters.
def __calculate_within_cluster_score(self, index_cluster, difference)
Calculates 'A' score for the specific object in cluster to which it belongs to.
def get_type(self)
Returns algorithm type that corresponds to specified enumeration value.
Defines algorithms that can be used to find optimal number of cluster using Silhouette method...
Collection of center initializers for algorithm that uses initial centers, for example, for K-Means or X-Means.
def __calculate_clusters(self, k)
Performs cluster analysis using specified K value.
def get_scores(self)
Returns silhouette score for each K value (amount of clusters).
def __verify_arguments(self)
Checks algorithm's arguments and if some of them is incorrect then exception is thrown.
Represents Silhouette method that is used interpretation and validation of consistency.
def get_score(self)
Returns silhouette score that belongs to optimal amount of clusters (k).
Represent algorithm for searching optimal number of clusters using specified K-algorithm (K-Means...
Cluster analysis algorithm: K-Medoids (PAM - Partitioning Around Medoids).
def __calculate_dataset_difference(self, index_point)
Calculate distance from each object to specified object.
def __calculate_cluster_score(self, index_cluster, difference)
Calculates 'B*' score for the specific object for specific cluster.
def get_amount(self)
Returns optimal amount of clusters that has been found during analysis.