3 @brief Silhouette - method of interpretation and validation of consistency. 4 @details Implementation based on paper @cite article::cluster::silhouette::1. 6 @authors Andrei Novikov (pyclustering@yandex.ru) 8 @copyright GNU Public License 10 @cond GNU_PUBLIC_LICENSE 11 PyClustering is free software: you can redistribute it and/or modify 12 it under the terms of the GNU General Public License as published by 13 the Free Software Foundation, either version 3 of the License, or 14 (at your option) any later version. 16 PyClustering is distributed in the hope that it will be useful, 17 but WITHOUT ANY WARRANTY; without even the implied warranty of 18 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 19 GNU General Public License for more details. 21 You should have received a copy of the GNU General Public License 22 along with this program. If not, see <http://www.gnu.org/licenses/>. 28 from enum
import IntEnum
39 from pyclustering.core.wrapper
import ccore_library
40 from pyclustering.core.metric_wrapper
import metric_wrapper
42 import pyclustering.core.silhouette_wrapper
as wrapper
47 @brief Represents Silhouette method that is used interpretation and validation of consistency. 48 @details The silhouette value is a measure of how similar an object is to its own cluster compared to other clusters. 49 Be aware that silhouette method is applicable for K algorithm family, such as K-Means, K-Medians, 50 K-Medoids, X-Means, etc., not not applicable for DBSCAN, OPTICS, CURE, etc. The Silhouette value is 51 calculated using following formula: 52 \f[s\left ( i \right )=\frac{ b\left ( i \right ) - a\left ( i \right ) }{ max\left \{ a\left ( i \right ), b\left ( i \right ) \right \}}\f] 53 where \f$a\left ( i \right )\f$ - is average distance from object i to objects in its own cluster, 54 \f$b\left ( i \right )\f$ - is average distance from object i to objects in the nearest cluster (the appropriate among other clusters). 56 Here is an example where Silhouette score is calculated for K-Means's clustering result: 58 from pyclustering.cluster.center_initializer import kmeans_plusplus_initializer 59 from pyclustering.cluster.kmeans import kmeans 60 from pyclustering.cluster.silhouette import silhouette 62 from pyclustering.samples.definitions import SIMPLE_SAMPLES 63 from pyclustering.utils import read_sample 65 # Read data 'SampleSimple3' from Simple Sample collection. 66 sample = read_sample(SIMPLE_SAMPLES.SAMPLE_SIMPLE3); 68 # Prepare initial centers 69 centers = kmeans_plusplus_initializer(sample, 4).initialize(); 71 # Perform cluster analysis 72 kmeans_instance = kmeans(sample, centers); 73 kmeans_instance.process(); 74 clusters = kmeans_instance.get_clusters(); 76 # Calculate Silhouette score 77 score = silhouette(sample, clusters).process().get_score() 80 @see kmeans, kmedoids, kmedians, xmeans, elbow 86 @brief Initializes Silhouette method for analysis. 88 @param[in] data (array_like): Input data that was used for cluster analysis and that is presented as list of 89 points or distance matrix (defined by parameter 'data_type', by default data is considered as a list 91 @param[in] clusters (list): Cluster that have been obtained after cluster analysis. 92 @param[in] **kwargs: Arbitrary keyword arguments (available arguments: 'metric'). 94 <b>Keyword Args:</b><br> 95 - metric (distance_metric): Metric that was used for cluster analysis and should be used for Silhouette 96 score calculation (by default Square Euclidean distance). 97 - data_type (string): Data type of input sample 'data' that is processed by the algorithm ('points', 'distance_matrix'). 98 - ccore (bool): If True then CCORE (C++ implementation of pyclustering library) is used (by default True). 104 self.
__data_type = kwargs.get(
'data_type',
'points')
106 if self.
__metric.get_type() != type_metric.USER_DEFINED:
111 self.
__score = [0.0] * len(data)
113 self.
__ccore = kwargs.get(
'ccore',
True)
and self.
__metric.get_type() != type_metric.USER_DEFINED
115 self.
__ccore = ccore_library.workable()
118 self.
__data = numpy.array(data)
125 @brief Calculates Silhouette score for each object from input data. 127 @return (silhouette) Instance of the method (self). 138 def __process_by_ccore(self):
140 @brief Performs processing using CCORE (C/C++ part of pyclustering library). 143 ccore_metric = metric_wrapper.create_instance(self.
__metric)
147 def __process_by_python(self):
149 @brief Performs processing using python code. 152 for index_cluster
in range(len(self.
__clusters)):
153 for index_point
in self.
__clusters[index_cluster]:
159 @brief Returns Silhouette score for each object from input data. 167 def __calculate_score(self, index_point, index_cluster):
169 @brief Calculates Silhouette score for the specific object defined by index_point. 171 @param[in] index_point (uint): Index point from input data for which Silhouette score should be calculated. 172 @param[in] index_cluster (uint): Index cluster to which the point belongs to. 174 @return (float) Silhouette score for the object. 180 difference = self.
__data[index_point]
185 return (b_score - a_score) / max(a_score, b_score)
188 def __calculate_within_cluster_score(self, index_cluster, difference):
190 @brief Calculates 'A' score for the specific object in cluster to which it belongs to. 192 @param[in] index_point (uint): Index point from input data for which 'A' score should be calculated. 193 @param[in] index_cluster (uint): Index cluster to which the point is belong to. 195 @return (float) 'A' score for the object. 202 return score / (len(self.
__clusters[index_cluster]) - 1)
205 def __calculate_cluster_score(self, index_cluster, difference):
207 @brief Calculates 'B*' score for the specific object for specific cluster. 209 @param[in] index_point (uint): Index point from input data for which 'B*' score should be calculated. 210 @param[in] index_cluster (uint): Index cluster to which the point is belong to. 212 @return (float) 'B*' score for the object for specific cluster. 217 return score / len(self.
__clusters[index_cluster])
220 def __caclulate_optimal_neighbor_cluster_score(self, index_cluster, difference):
222 @brief Calculates 'B' score for the specific object for the nearest cluster. 224 @param[in] index_point (uint): Index point from input data for which 'B' score should be calculated. 225 @param[in] index_cluster (uint): Index cluster to which the point is belong to. 227 @return (float) 'B' score for the object. 231 optimal_score = float(
'inf')
232 for index_neighbor_cluster
in range(len(self.
__clusters)):
233 if index_cluster != index_neighbor_cluster:
235 if candidate_score < optimal_score:
236 optimal_score = candidate_score
238 if optimal_score == float(
'inf'):
244 def __calculate_cluster_difference(self, index_cluster, difference):
246 @brief Calculates distance from each object in specified cluster to specified object. 248 @param[in] index_point (uint): Index point for which difference is calculated. 250 @return (list) Distance from specified object to each object from input data in specified cluster. 253 cluster_difference = 0.0
254 for index_point
in self.
__clusters[index_cluster]:
255 cluster_difference += difference[index_point]
257 return cluster_difference
260 def __calculate_dataset_difference(self, index_point):
262 @brief Calculate distance from each object to specified object. 264 @param[in] index_point (uint): Index point for which difference with other points is calculated. 266 @return (list) Distance to each object from input data from the specified. 270 if self.
__metric.get_type() != type_metric.USER_DEFINED:
275 return dataset_differences
278 def __verify_arguments(self):
280 @brief Verify input parameters for the algorithm and throw exception in case of incorrectness. 284 raise ValueError(
"Input data is empty (size: '%d')." % len(self.
__data))
287 raise ValueError(
"Input clusters are empty (size: '%d')." % len(self.
__clusters))
293 @brief Defines algorithms that can be used to find optimal number of cluster using Silhouette method. 295 @see silhouette_ksearch 310 @brief Returns algorithm type that corresponds to specified enumeration value. 312 @return (type) Algorithm type for cluster analysis. 315 if self == silhouette_ksearch_type.KMEANS:
317 elif self == silhouette_ksearch_type.KMEDIANS:
319 elif self == silhouette_ksearch_type.KMEDOIDS:
328 @brief Represent algorithm for searching optimal number of clusters using specified K-algorithm (K-Means, 329 K-Medians, K-Medoids) that is based on Silhouette method. 331 @details This algorithm uses average value of scores for estimation and applicable for clusters that are well 332 separated. Here is an example where clusters are well separated (sample 'Hepta'): 334 from pyclustering.cluster import cluster_visualizer 335 from pyclustering.cluster.center_initializer import kmeans_plusplus_initializer 336 from pyclustering.cluster.kmeans import kmeans 337 from pyclustering.cluster.silhouette import silhouette_ksearch_type, silhouette_ksearch 338 from pyclustering.samples.definitions import FCPS_SAMPLES 339 from pyclustering.utils import read_sample 341 sample = read_sample(FCPS_SAMPLES.SAMPLE_HEPTA) 342 search_instance = silhouette_ksearch(sample, 2, 10, algorithm=silhouette_ksearch_type.KMEANS).process() 344 amount = search_instance.get_amount() 345 scores = search_instance.get_scores() 347 print("Scores: '%s'" % str(scores)) 349 initial_centers = kmeans_plusplus_initializer(sample, amount).initialize() 350 kmeans_instance = kmeans(sample, initial_centers).process() 352 clusters = kmeans_instance.get_clusters() 354 visualizer = cluster_visualizer() 355 visualizer.append_clusters(clusters, sample) 359 Obtained Silhouette scores for each K: 361 Scores: '{2: 0.418434, 3: 0.450906, 4: 0.534709, 5: 0.689970, 6: 0.588460, 7: 0.882674, 8: 0.804725, 9: 0.780189}' 364 K = 7 has the bigger average Silhouette score and it means that it is optimal amount of clusters: 365 @image html silhouette_ksearch_hepta.png "Silhouette ksearch's analysis with further K-Means clustering (sample 'Hepta')." 367 @see silhouette_ksearch_type 373 @brief Initialize Silhouette search algorithm to find out optimal amount of clusters. 375 @param[in] data (array_like): Input data that is used for searching optimal amount of clusters. 376 @param[in] kmin (uint): Amount of clusters from which search is performed. Should be equal or greater than 2. 377 @param[in] kmax (uint): Amount of clusters to which search is performed. Should be equal or less than amount of 378 points in input data. 379 @param[in] **kwargs: Arbitrary keyword arguments (available arguments: 'algorithm'). 381 <b>Keyword Args:</b><br> 382 - algorithm (silhouette_ksearch_type): Defines algorithm that is used for searching optimal number of 383 clusters (by default K-Means). 384 - ccore (bool): If True then CCORE (C++ implementation of pyclustering library) is used (by default True). 391 self.
__algorithm = kwargs.get(
'algorithm', silhouette_ksearch_type.KMEANS)
400 self.
__ccore = kwargs.get(
'ccore',
True)
402 self.
__ccore = ccore_library.workable()
407 @brief Performs analysis to find optimal amount of clusters. 409 @see get_amount, get_score, get_scores 411 @return (silhouette_search) Itself instance (silhouette_search) 422 def __process_by_ccore(self):
424 @brief Performs processing using CCORE (C/C++ part of pyclustering library). 434 def __process_by_python(self):
436 @brief Performs processing using python code. 443 if len(clusters) != k:
449 self.
__scores[k] = sum(score) / len(score)
458 @brief Returns optimal amount of clusters that has been found during analysis. 460 @return (uint) Optimal amount of clusters. 470 @brief Returns silhouette score that belongs to optimal amount of clusters (k). 472 @return (float) Score that belong to optimal amount of clusters. 474 @see process, get_scores 482 @brief Returns silhouette score for each K value (amount of clusters). 484 @return (dict) Silhouette score for each K value, where key is a K value and value is a silhouette score. 486 @see process, get_score 492 def __calculate_clusters(self, k):
494 @brief Performs cluster analysis using specified K value. 496 @param[in] k (uint): Amount of clusters that should be allocated. 498 @return (array_like) Allocated clusters. 503 return algorithm_type(self.
__data, initial_values).
process().get_clusters()
506 def __verify_arguments(self):
508 @brief Checks algorithm's arguments and if some of them is incorrect then exception is thrown. 512 raise ValueError(
"K max value '" + str(self.
__kmax) +
"' is bigger than amount of objects '" +
513 str(len(self.
__data)) +
"' in input data.")
516 raise ValueError(
"K min value '" + str(self.
__kmin) +
"' should be greater than 1 (impossible to provide " 517 "silhouette score for only one cluster).")
The module contains K-Means algorithm and other related services.
def __caclulate_optimal_neighbor_cluster_score(self, index_cluster, difference)
Calculates 'B' score for the specific object for the nearest cluster.
def __process_by_ccore(self)
Performs processing using CCORE (C/C++ part of pyclustering library).
Module provides various distance metrics - abstraction of the notion of distance in a metric space...
def process(self)
Calculates Silhouette score for each object from input data.
def __calculate_score(self, index_point, index_cluster)
Calculates Silhouette score for the specific object defined by index_point.
def process(self)
Performs analysis to find optimal amount of clusters.
def get_score(self)
Returns Silhouette score for each object from input data.
Distance metric performs distance calculation between two points in line with encapsulated function...
K-Means++ is an algorithm for choosing the initial centers for algorithms like K-Means or X-Means...
def __calculate_cluster_difference(self, index_cluster, difference)
Calculates distance from each object in specified cluster to specified object.
def __init__(self, data, clusters, kwargs)
Initializes Silhouette method for analysis.
def __process_by_python(self)
Performs processing using python code.
def __init__(self, data, kmin, kmax, kwargs)
Initialize Silhouette search algorithm to find out optimal amount of clusters.
def __calculate_within_cluster_score(self, index_cluster, difference)
Calculates 'A' score for the specific object in cluster to which it belongs to.
def __process_by_ccore(self)
Performs processing using CCORE (C/C++ part of pyclustering library).
def get_type(self)
Returns algorithm type that corresponds to specified enumeration value.
Defines algorithms that can be used to find optimal number of cluster using Silhouette method...
Collection of center initializers for algorithm that uses initial centers, for example, for K-Means or X-Means.
def __calculate_clusters(self, k)
Performs cluster analysis using specified K value.
def __process_by_python(self)
Performs processing using python code.
def get_scores(self)
Returns silhouette score for each K value (amount of clusters).
def __verify_arguments(self)
Checks algorithm's arguments and if some of them is incorrect then exception is thrown.
Represents Silhouette method that is used interpretation and validation of consistency.
def get_score(self)
Returns silhouette score that belongs to optimal amount of clusters (k).
Represent algorithm for searching optimal number of clusters using specified K-algorithm (K-Means...
def __verify_arguments(self)
Verify input parameters for the algorithm and throw exception in case of incorrectness.
Cluster analysis algorithm: K-Medoids.
def __calculate_dataset_difference(self, index_point)
Calculate distance from each object to specified object.
def __calculate_cluster_score(self, index_cluster, difference)
Calculates 'B*' score for the specific object for specific cluster.
def get_amount(self)
Returns optimal amount of clusters that has been found during analysis.