3 @brief Cluster analysis algorithm: X-Means
4 @details Implementation based on papers @cite article::xmeans::1, @cite article::xmeans::mndl
6 @authors Andrei Novikov (pyclustering@yandex.ru)
8 @copyright BSD-3-Clause
16 from enum
import IntEnum
23 from pyclustering.core.metric_wrapper
import metric_wrapper
24 from pyclustering.core.wrapper
import ccore_library
26 import pyclustering.core.xmeans_wrapper
as wrapper
33 @brief Enumeration of splitting types that can be used as splitting creation of cluster in X-Means algorithm.
49 BAYESIAN_INFORMATION_CRITERION = 0
60 MINIMUM_NOISELESS_DESCRIPTION_LENGTH = 1
65 @brief Class represents clustering algorithm X-Means.
66 @details X-means clustering method starts with the assumption of having a minimum number of clusters,
67 and then dynamically increases them. X-means uses specified splitting criterion to control
68 the process of splitting clusters. Method K-Means++ can be used for calculation of initial centers.
70 CCORE implementation of the algorithm uses thread pool to parallelize the clustering process.
72 Here example how to perform cluster analysis using X-Means algorithm:
74 from pyclustering.cluster import cluster_visualizer
75 from pyclustering.cluster.xmeans import xmeans
76 from pyclustering.cluster.center_initializer import kmeans_plusplus_initializer
77 from pyclustering.utils import read_sample
78 from pyclustering.samples.definitions import SIMPLE_SAMPLES
80 # Read sample 'simple3' from file.
81 sample = read_sample(SIMPLE_SAMPLES.SAMPLE_SIMPLE3)
83 # Prepare initial centers - amount of initial centers defines amount of clusters from which X-Means will
85 amount_initial_centers = 2
86 initial_centers = kmeans_plusplus_initializer(sample, amount_initial_centers).initialize()
88 # Create instance of X-Means algorithm. The algorithm will start analysis from 2 clusters, the maximum
89 # number of clusters that can be allocated is 20.
90 xmeans_instance = xmeans(sample, initial_centers, 20)
91 xmeans_instance.process()
93 # Extract clustering results: clusters and their centers
94 clusters = xmeans_instance.get_clusters()
95 centers = xmeans_instance.get_centers()
97 # Print total sum of metric errors
98 print("Total WCE:", xmeans_instance.get_total_wce())
100 # Visualize clustering results
101 visualizer = cluster_visualizer()
102 visualizer.append_clusters(clusters, sample)
103 visualizer.append_cluster(centers, None, marker='*', markersize=10)
107 Visualization of clustering results that were obtained using code above and where X-Means algorithm allocates four clusters.
108 @image html xmeans_clustering_simple3.png "Fig. 1. X-Means clustering results (data 'Simple3')."
110 By default X-Means clustering algorithm uses Bayesian Information Criterion (BIC) to approximate the correct number
111 of clusters. There is an example where another criterion Minimum Noiseless Description Length (MNDL) is used in order
112 to find optimal amount of clusters:
114 from pyclustering.cluster import cluster_visualizer
115 from pyclustering.cluster.xmeans import xmeans, splitting_type
116 from pyclustering.cluster.center_initializer import kmeans_plusplus_initializer
117 from pyclustering.utils import read_sample
118 from pyclustering.samples.definitions import FCPS_SAMPLES
120 # Read sample 'Target'.
121 sample = read_sample(FCPS_SAMPLES.SAMPLE_TARGET)
123 # Prepare initial centers - amount of initial centers defines amount of clusters from which X-Means will start analysis.
125 amount_initial_centers = 3
126 initial_centers = kmeans_plusplus_initializer(sample, amount_initial_centers, random_state=random_seed).initialize()
128 # Create instance of X-Means algorithm with MNDL splitting criterion.
129 xmeans_mndl = xmeans(sample, initial_centers, 20, splitting_type=splitting_type.MINIMUM_NOISELESS_DESCRIPTION_LENGTH, random_state=random_seed)
130 xmeans_mndl.process()
132 # Extract X-Means MNDL clustering results.
133 mndl_clusters = xmeans_mndl.get_clusters()
135 # Visualize clustering results.
136 visualizer = cluster_visualizer(titles=['X-Means with MNDL criterion'])
137 visualizer.append_clusters(mndl_clusters, sample)
141 @image html xmeans_clustering_mndl_target.png "Fig. 2. X-Means MNDL clustering results (data 'Target')."
143 As in many others algorithms, it is possible to specify metric that should be used for cluster analysis, for
144 example, Chebyshev distance metric:
146 # Create instance of X-Means algorithm with Chebyshev distance metric.
147 chebyshev_metric = distance_metric(type_metric.CHEBYSHEV)
148 xmeans_instance = xmeans(sample, initial_centers, max_clusters_amount, metric=chebyshev_metric).process()
151 @see center_initializer
155 def __init__(self, data, initial_centers=None, kmax=20, tolerance=0.001, criterion=splitting_type.BAYESIAN_INFORMATION_CRITERION, ccore=True, **kwargs):
157 @brief Constructor of clustering algorithm X-Means.
159 @param[in] data (array_like): Input data that is presented as list of points (objects), each point should be represented by list or tuple.
160 @param[in] initial_centers (list): Initial coordinates of centers of clusters that are represented by list: `[center1, center2, ...]`,
161 if it is not specified then X-Means starts from the random center.
162 @param[in] kmax (uint): Maximum number of clusters that can be allocated.
163 @param[in] tolerance (double): Stop condition for each iteration: if maximum value of change of centers of clusters is less than tolerance than algorithm will stop processing.
164 @param[in] criterion (splitting_type): Type of splitting creation (by default `splitting_type.BAYESIAN_INFORMATION_CRITERION`).
165 @param[in] ccore (bool): Defines if C++ pyclustering library should be used instead of Python implementation.
166 @param[in] **kwargs: Arbitrary keyword arguments (available arguments: `repeat`, `random_state`, `metric`, `alpha`, `beta`).
168 <b>Keyword Args:</b><br>
169 - repeat (unit): How many times K-Means should be run to improve parameters (by default is `1`).
170 With larger `repeat` values suggesting higher probability of finding global optimum.
171 - random_state (int): Seed for random state (by default is `None`, current system time is used).
172 - metric (distance_metric): Metric that is used for distance calculation between two points (by default
173 euclidean square distance).
174 - alpha (double): Parameter distributed [0.0, 1.0] for alpha probabilistic bound \f$Q\left(\alpha\right)\f$.
175 The parameter is used only in case of MNDL splitting criterion, in all other cases this value is ignored.
176 - beta (double): Parameter distributed [0.0, 1.0] for beta probabilistic bound \f$Q\left(\beta\right)\f$.
177 The parameter is used only in case of MNDL splitting criterion, in all other cases this value is ignored.
184 self.
__metric = copy.copy(kwargs.get(
'metric', distance_metric(type_metric.EUCLIDEAN_SQUARE)))
186 if initial_centers
is not None:
187 self.
__centers = numpy.array(initial_centers)
195 self.
__repeat = kwargs.get(
'repeat', 1)
196 self.
__alpha = kwargs.get(
'alpha', 0.9)
197 self.
__beta = kwargs.get(
'beta', 0.9)
199 self.
__ccore = ccore
and self.
__metric.get_type() != type_metric.USER_DEFINED
201 self.
__ccore = ccore_library.workable()
208 @brief Performs cluster analysis in line with rules of X-Means algorithm.
210 @return (xmeans) Returns itself (X-Means instance).
226 def __process_by_ccore(self):
228 @brief Performs cluster analysis using CCORE (C/C++ part of pyclustering library).
232 ccore_metric = metric_wrapper.create_instance(self.
__metric)
236 ccore_metric.get_pointer())
243 def __process_by_python(self):
245 @brief Performs cluster analysis using python code.
251 current_cluster_number = len(self.
__centers)
256 if current_cluster_number == len(allocated_centers):
266 @brief Calculates the closest cluster to each point.
268 @param[in] points (array_like): Points for which closest clusters are calculated.
270 @return (list) List of closest clusters for each point. Each cluster is denoted by index. Return empty
271 collection if 'process()' method was not called.
273 An example how to calculate (or predict) the closest cluster to specified points.
275 from pyclustering.cluster.xmeans import xmeans
276 from pyclustering.samples.definitions import SIMPLE_SAMPLES
277 from pyclustering.utils import read_sample
279 # Load list of points for cluster analysis.
280 sample = read_sample(SIMPLE_SAMPLES.SAMPLE_SIMPLE3)
282 # Initial centers for sample 'Simple3'.
283 initial_centers = [[0.2, 0.1], [4.0, 1.0], [2.0, 2.0], [2.3, 3.9]]
285 # Create instance of X-Means algorithm with prepared centers.
286 xmeans_instance = xmeans(sample, initial_centers)
288 # Run cluster analysis.
289 xmeans_instance.process()
291 # Calculate the closest cluster to following two points.
292 points = [[0.25, 0.2], [2.5, 4.0]]
293 closest_clusters = xmeans_instance.predict(points)
294 print(closest_clusters)
298 nppoints = numpy.array(points)
305 differences = numpy.zeros((len(nppoints), len(npcenters)))
306 for index_point
in range(len(nppoints)):
307 differences[index_point] = self.
__metric(nppoints[index_point], npcenters)
311 return numpy.argmin(differences, axis=1)
316 @brief Returns list of allocated clusters, each cluster contains indexes of objects in list of data.
318 @return (list) List of allocated clusters.
331 @brief Returns list of centers for allocated clusters.
333 @return (list) List of centers for allocated clusters.
346 @brief Returns clustering result representation type that indicate how clusters are encoded.
348 @return (type_encoding) Clustering result representation.
354 return type_encoding.CLUSTER_INDEX_LIST_SEPARATION
359 @brief Returns sum of Euclidean Squared metric errors (SSE - Sum of Squared Errors).
360 @details Sum of metric errors is calculated using distance between point and its center:
361 \f[error=\sum_{i=0}^{N}euclidean_square_distance(x_{i}-center(x_{i}))\f]
371 def __search_optimial_parameters(self, local_data):
373 @brief Split data of the region into two cluster and tries to find global optimum by running k-means clustering
374 several times (defined by 'repeat' argument).
376 @param[in] local_data (list): Points of a region that should be split into two clusters.
378 @return (tuple) List of allocated clusters, list of centers and total WCE (clusters, centers, wce).
381 optimal_wce, optimal_centers, optimal_clusters = float(
'+inf'),
None,
None
385 if len(local_data) < candidates:
386 candidates = len(local_data)
391 kmeans_instance.process()
393 local_wce = kmeans_instance.get_total_wce()
394 if local_wce < optimal_wce:
395 optimal_centers = kmeans_instance.get_centers()
396 optimal_clusters = kmeans_instance.get_clusters()
397 optimal_wce = local_wce
399 return optimal_clusters, optimal_centers, optimal_wce
402 def __improve_parameters(self, centers, available_indexes=None):
404 @brief Performs k-means clustering in the specified region.
406 @param[in] centers (list): Cluster centers, if None then automatically generated two centers using center initialization method.
407 @param[in] available_indexes (list): Indexes that defines which points can be used for k-means clustering, if None then all points are used.
409 @return (tuple) List of allocated clusters, list of centers and total WCE (clusters, centers, wce).
413 if available_indexes
and len(available_indexes) == 1:
414 index_center = available_indexes[0]
415 return [available_indexes], self.
__pointer_data[index_center], 0.0
418 if available_indexes:
421 local_centers = centers
427 local_wce = kmeans_instance.get_total_wce()
428 local_centers = kmeans_instance.get_centers()
429 clusters = kmeans_instance.get_clusters()
431 if available_indexes:
434 return clusters, local_centers, local_wce
437 def __local_to_global_clusters(self, local_clusters, available_indexes):
439 @brief Converts clusters in local region define by 'available_indexes' to global clusters.
441 @param[in] local_clusters (list): Local clusters in specific region.
442 @param[in] available_indexes (list): Map between local and global point's indexes.
444 @return Global clusters.
449 for local_cluster
in local_clusters:
451 for index_point
in local_cluster:
452 current_cluster.append(available_indexes[index_point])
454 clusters.append(current_cluster)
459 def __improve_structure(self, clusters, centers):
461 @brief Check for best structure: divides each cluster into two and checks for best results using splitting criterion.
463 @param[in] clusters (list): Clusters that have been allocated (each cluster contains indexes of points from data).
464 @param[in] centers (list): Centers of clusters.
466 @return (list) Allocated centers for clustering.
470 allocated_centers = []
471 amount_free_centers = self.
__kmax - len(centers)
473 for index_cluster
in range(len(clusters)):
475 (parent_child_clusters, parent_child_centers, _) = self.
__improve_parameters(
None, clusters[index_cluster])
478 if len(parent_child_clusters) > 1:
481 child_scores = self.
__splitting_criterion([parent_child_clusters[0], parent_child_clusters[1]], parent_child_centers)
483 split_require =
False
486 if self.
__criterion == splitting_type.BAYESIAN_INFORMATION_CRITERION:
487 if parent_scores < child_scores:
490 elif self.
__criterion == splitting_type.MINIMUM_NOISELESS_DESCRIPTION_LENGTH:
493 if parent_scores > child_scores:
496 if (split_require
is True)
and (amount_free_centers > 0):
497 allocated_centers.append(parent_child_centers[0])
498 allocated_centers.append(parent_child_centers[1])
500 amount_free_centers -= 1
502 allocated_centers.append(centers[index_cluster])
505 allocated_centers.append(centers[index_cluster])
507 return allocated_centers
510 def __splitting_criterion(self, clusters, centers):
512 @brief Calculates splitting criterion for input clusters.
514 @param[in] clusters (list): Clusters for which splitting criterion should be calculated.
515 @param[in] centers (list): Centers of the clusters.
517 @return (double) Returns splitting criterion. High value of splitting criterion means that current structure is
520 @see __bayesian_information_criterion(clusters, centers)
521 @see __minimum_noiseless_description_length(clusters, centers)
525 if self.
__criterion == splitting_type.BAYESIAN_INFORMATION_CRITERION:
528 elif self.
__criterion == splitting_type.MINIMUM_NOISELESS_DESCRIPTION_LENGTH:
535 def __minimum_noiseless_description_length(self, clusters, centers):
537 @brief Calculates splitting criterion for input clusters using minimum noiseless description length criterion.
539 @param[in] clusters (list): Clusters for which splitting criterion should be calculated.
540 @param[in] centers (list): Centers of the clusters.
542 @return (double) Returns splitting criterion in line with bayesian information criterion.
543 Low value of splitting cretion means that current structure is much better.
545 @see __bayesian_information_criterion(clusters, centers)
558 alpha_square = alpha * alpha
561 for index_cluster
in range(0, len(clusters), 1):
562 Ni = len(clusters[index_cluster])
567 for index_object
in clusters[index_cluster]:
575 sigma_square /= (N - K)
576 sigma = sigma_square ** 0.5
578 Kw = (1.0 - K / N) * sigma_square
579 Ksa = (2.0 * alpha * sigma / (N ** 0.5)) * (alpha_square * sigma_square / N + W - Kw / 2.0) ** 0.5
580 UQa = W - Kw + 2.0 * alpha_square * sigma_square / N + Ksa
582 score = sigma_square * K / N + UQa + sigma_square * beta * ((2.0 * K) ** 0.5) / N
587 def __bayesian_information_criterion(self, clusters, centers):
589 @brief Calculates splitting criterion for input clusters using bayesian information criterion.
591 @param[in] clusters (list): Clusters for which splitting criterion should be calculated.
592 @param[in] centers (list): Centers of the clusters.
594 @return (double) Splitting criterion in line with bayesian information criterion.
595 High value of splitting criterion means that current structure is much better.
597 @see __minimum_noiseless_description_length(clusters, centers)
601 scores = [float(
'inf')] * len(clusters)
609 for index_cluster
in range(0, len(clusters), 1):
610 for index_object
in clusters[index_cluster]:
613 N += len(clusters[index_cluster])
616 sigma_sqrt /= (N - K)
617 p = (K - 1) + dimension * K + 1
620 sigma_multiplier = 0.0
621 if sigma_sqrt <= 0.0:
622 sigma_multiplier = float(
'-inf')
624 sigma_multiplier = dimension * 0.5 * log(sigma_sqrt)
627 for index_cluster
in range(0, len(clusters), 1):
628 n = len(clusters[index_cluster])
630 L = n * log(n) - n * log(N) - n * 0.5 * log(2.0 * numpy.pi) - n * sigma_multiplier - (n - K) * 0.5
633 scores[index_cluster] = L - p * 0.5 * log(N)
638 def __verify_arguments(self):
640 @brief Verify input parameters for the algorithm and throw exception in case of incorrectness.
644 raise ValueError(
"Input data is empty (size: '%d')." % len(self.
__pointer_data))
647 raise ValueError(
"Initial centers are empty (size: '%d')." % len(self.
__pointer_data))
650 raise ValueError(
"Tolerance (current value: '%d') should be greater or equal to 0." %
654 raise ValueError(
"Repeat (current value: '%d') should be greater than 0." %
658 raise ValueError(
"Parameter for the probabilistic bound Q(alpha) should in the following range [0, 1] "
659 "(current value: '%f')." % self.
__alpha)
662 raise ValueError(
"Parameter for the probabilistic bound Q(beta) should in the following range [0, 1] "
663 "(current value: '%f')." % self.
__beta)