3 @brief Cluster analysis algorithm: X-Means 4 @details Implementation based on paper @cite article::xmeans::1. 6 @authors Andrei Novikov (pyclustering@yandex.ru) 8 @copyright GNU Public License 10 @cond GNU_PUBLIC_LICENSE 11 PyClustering is free software: you can redistribute it and/or modify 12 it under the terms of the GNU General Public License as published by 13 the Free Software Foundation, either version 3 of the License, or 14 (at your option) any later version. 16 PyClustering is distributed in the hope that it will be useful, 17 but WITHOUT ANY WARRANTY; without even the implied warranty of 18 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 19 GNU General Public License for more details. 21 You should have received a copy of the GNU General Public License 22 along with this program. If not, see <http://www.gnu.org/licenses/>. 31 from enum
import IntEnum
39 from pyclustering.core.wrapper
import ccore_library
41 import pyclustering.core.xmeans_wrapper
as wrapper
43 from pyclustering.utils import euclidean_distance_square, euclidean_distance, distance_metric, type_metric
48 @brief Enumeration of splitting types that can be used as splitting creation of cluster in X-Means algorithm. 64 BAYESIAN_INFORMATION_CRITERION = 0
75 MINIMUM_NOISELESS_DESCRIPTION_LENGTH = 1
80 @brief Class represents clustering algorithm X-Means. 81 @details X-means clustering method starts with the assumption of having a minimum number of clusters, 82 and then dynamically increases them. X-means uses specified splitting criterion to control 83 the process of splitting clusters. Method K-Means++ can be used for calculation of initial centers. 85 CCORE implementation of the algorithm uses thread pool to parallelize the clustering process. 87 Here example how to perform cluster analysis using X-Means algorithm: 89 from pyclustering.cluster import cluster_visualizer 90 from pyclustering.cluster.xmeans import xmeans 91 from pyclustering.cluster.center_initializer import kmeans_plusplus_initializer 92 from pyclustering.utils import read_sample 93 from pyclustering.samples.definitions import SIMPLE_SAMPLES 95 # Read sample 'simple3' from file. 96 sample = read_sample(SIMPLE_SAMPLES.SAMPLE_SIMPLE3) 98 # Prepare initial centers - amount of initial centers defines amount of clusters from which X-Means will 100 amount_initial_centers = 2 101 initial_centers = kmeans_plusplus_initializer(sample, amount_initial_centers).initialize() 103 # Create instance of X-Means algorithm. The algorithm will start analysis from 2 clusters, the maximum 104 # number of clusters that can be allocated is 20. 105 xmeans_instance = xmeans(sample, initial_centers, 20) 106 xmeans_instance.process() 108 # Extract clustering results: clusters and their centers 109 clusters = xmeans_instance.get_clusters() 110 centers = xmeans_instance.get_centers() 112 # Visualize clustering results 113 visualizer = cluster_visualizer() 114 visualizer.append_clusters(clusters, sample) 115 visualizer.append_cluster(centers, None, marker='*', markersize=10) 119 Visualization of clustering results that were obtained using code above and where X-Means algorithm allocates four clusters. 120 @image html xmeans_clustering_simple3.png "Fig. 1. X-Means clustering results (data 'Simple3')." 122 @see center_initializer 126 def __init__(self, data, initial_centers=None, kmax=20, tolerance=0.025, criterion=splitting_type.BAYESIAN_INFORMATION_CRITERION, ccore=True):
128 @brief Constructor of clustering algorithm X-Means. 130 @param[in] data (list): Input data that is presented as list of points (objects), each point should be represented by list or tuple. 131 @param[in] initial_centers (list): Initial coordinates of centers of clusters that are represented by list: [center1, center2, ...], 132 if it is not specified then X-Means starts from the random center. 133 @param[in] kmax (uint): Maximum number of clusters that can be allocated. 134 @param[in] tolerance (double): Stop condition for each iteration: if maximum value of change of centers of clusters is less than tolerance than algorithm will stop processing. 135 @param[in] criterion (splitting_type): Type of splitting creation. 136 @param[in] ccore (bool): Defines should be CCORE (C++ pyclustering library) used instead of Python code or not. 143 if initial_centers
is not None:
146 self.
__centers = [[random.random()
for _
in range(len(data[0]))]]
154 self.
__ccore = ccore_library.workable()
159 @brief Performs cluster analysis in line with rules of X-Means algorithm. 161 @remark Results of clustering can be obtained using corresponding gets methods. 174 current_cluster_number = len(self.
__centers)
179 if current_cluster_number == len(allocated_centers):
190 @brief Calculates the closest cluster to each point. 192 @param[in] points (array_like): Points for which closest clusters are calculated. 194 @return (list) List of closest clusters for each point. Each cluster is denoted by index. Return empty 195 collection if 'process()' method was not called. 197 An example how to calculate (or predict) the closest cluster to specified points. 199 from pyclustering.cluster.xmeans import xmeans 200 from pyclustering.samples.definitions import SIMPLE_SAMPLES 201 from pyclustering.utils import read_sample 203 # Load list of points for cluster analysis. 204 sample = read_sample(SIMPLE_SAMPLES.SAMPLE_SIMPLE3) 206 # Initial centers for sample 'Simple3'. 207 initial_centers = [[0.2, 0.1], [4.0, 1.0], [2.0, 2.0], [2.3, 3.9]] 209 # Create instance of X-Means algorithm with prepared centers. 210 xmeans_instance = xmeans(sample, initial_centers) 212 # Run cluster analysis. 213 xmeans_instance.process() 215 # Calculate the closest cluster to following two points. 216 points = [[0.25, 0.2], [2.5, 4.0]] 217 closest_clusters = xmeans_instance.predict(points) 218 print(closest_clusters) 222 nppoints = numpy.array(points)
226 metric = distance_metric(type_metric.EUCLIDEAN_SQUARE, numpy_usage=
True)
228 differences = numpy.zeros((len(nppoints), len(self.
__centers)))
229 for index_point
in range(len(nppoints)):
230 differences[index_point] = metric(nppoints[index_point], self.
__centers)
232 return numpy.argmin(differences, axis=1)
237 @brief Returns list of allocated clusters, each cluster contains indexes of objects in list of data. 239 @return (list) List of allocated clusters. 251 @brief Returns list of centers for allocated clusters. 253 @return (list) List of centers for allocated clusters. 265 @brief Returns clustering result representation type that indicate how clusters are encoded. 267 @return (type_encoding) Clustering result representation. 273 return type_encoding.CLUSTER_INDEX_LIST_SEPARATION
276 def __improve_parameters(self, centers, available_indexes = None):
278 @brief Performs k-means clustering in the specified region. 280 @param[in] centers (list): Centers of clusters. 281 @param[in] available_indexes (list): Indexes that defines which points can be used for k-means clustering, if None - then all points are used. 283 @return (list) List of allocated clusters, each cluster contains indexes of objects in list of data. 287 if available_indexes
and len(available_indexes) == 1:
288 index_center = available_indexes[0]
292 if available_indexes:
293 local_data = [ self.
__pointer_data[i]
for i
in available_indexes ]
295 local_centers = centers
299 kmeans_instance =
kmeans(local_data, local_centers, tolerance=self.
__tolerance, ccore=
False)
300 kmeans_instance.process()
302 local_centers = kmeans_instance.get_centers()
304 clusters = kmeans_instance.get_clusters()
305 if available_indexes:
308 return clusters, local_centers
311 def __local_to_global_clusters(self, local_clusters, available_indexes):
313 @brief Converts clusters in local region define by 'available_indexes' to global clusters. 315 @param[in] local_clusters (list): Local clusters in specific region. 316 @param[in] available_indexes (list): Map between local and global point's indexes. 318 @return Global clusters. 323 for local_cluster
in local_clusters:
325 for index_point
in local_cluster:
326 current_cluster.append(available_indexes[index_point])
328 clusters.append(current_cluster)
333 def __improve_structure(self, clusters, centers):
335 @brief Check for best structure: divides each cluster into two and checks for best results using splitting criterion. 337 @param[in] clusters (list): Clusters that have been allocated (each cluster contains indexes of points from data). 338 @param[in] centers (list): Centers of clusters. 340 @return (list) Allocated centers for clustering. 344 allocated_centers = []
345 amount_free_centers = self.
__kmax - len(centers)
347 for index_cluster
in range(len(clusters)):
349 (parent_child_clusters, parent_child_centers) = self.
__improve_parameters(
None, clusters[index_cluster])
352 if len(parent_child_clusters) > 1:
355 child_scores = self.
__splitting_criterion([ parent_child_clusters[0], parent_child_clusters[1] ], parent_child_centers)
357 split_require =
False 360 if self.
__criterion == splitting_type.BAYESIAN_INFORMATION_CRITERION:
361 if parent_scores < child_scores: split_require =
True 363 elif self.
__criterion == splitting_type.MINIMUM_NOISELESS_DESCRIPTION_LENGTH:
366 if parent_scores > child_scores: split_require =
True;
368 if (split_require
is True)
and (amount_free_centers > 0):
369 allocated_centers.append(parent_child_centers[0])
370 allocated_centers.append(parent_child_centers[1])
372 amount_free_centers -= 1
374 allocated_centers.append(centers[index_cluster])
378 allocated_centers.append(centers[index_cluster])
380 return allocated_centers
383 def __splitting_criterion(self, clusters, centers):
385 @brief Calculates splitting criterion for input clusters. 387 @param[in] clusters (list): Clusters for which splitting criterion should be calculated. 388 @param[in] centers (list): Centers of the clusters. 390 @return (double) Returns splitting criterion. High value of splitting cretion means that current structure is much better. 392 @see __bayesian_information_criterion(clusters, centers) 393 @see __minimum_noiseless_description_length(clusters, centers) 397 if self.
__criterion == splitting_type.BAYESIAN_INFORMATION_CRITERION:
400 elif self.
__criterion == splitting_type.MINIMUM_NOISELESS_DESCRIPTION_LENGTH:
407 def __minimum_noiseless_description_length(self, clusters, centers):
409 @brief Calculates splitting criterion for input clusters using minimum noiseless description length criterion. 411 @param[in] clusters (list): Clusters for which splitting criterion should be calculated. 412 @param[in] centers (list): Centers of the clusters. 414 @return (double) Returns splitting criterion in line with bayesian information criterion. 415 Low value of splitting cretion means that current structure is much better. 417 @see __bayesian_information_criterion(clusters, centers) 421 scores = float(
'inf')
432 for index_cluster
in range(0, len(clusters), 1):
433 Ni = len(clusters[index_cluster])
438 for index_object
in clusters[index_cluster]:
441 Wi += euclidean_distance(self.
__pointer_data[index_object], centers[index_cluster])
448 sigma_sqrt /= (N - K)
449 sigma = sigma_sqrt ** 0.5
451 Kw = (1.0 - K / N) * sigma_sqrt
452 Ks = ( 2.0 * alpha * sigma / (N ** 0.5) ) * ( (alpha ** 2.0) * sigma_sqrt / N + W - Kw / 2.0 ) ** 0.5
454 scores = sigma_sqrt * (2 * K)**0.5 * ((2 * K)**0.5 + betta) / N + W - sigma_sqrt + Ks + 2 * alpha**0.5 * sigma_sqrt / N
459 def __bayesian_information_criterion(self, clusters, centers):
461 @brief Calculates splitting criterion for input clusters using bayesian information criterion. 463 @param[in] clusters (list): Clusters for which splitting criterion should be calculated. 464 @param[in] centers (list): Centers of the clusters. 466 @return (double) Splitting criterion in line with bayesian information criterion. 467 High value of splitting criterion means that current structure is much better. 469 @see __minimum_noiseless_description_length(clusters, centers) 473 scores = [float(
'inf')] * len(clusters)
481 for index_cluster
in range(0, len(clusters), 1):
482 for index_object
in clusters[index_cluster]:
483 sigma_sqrt += euclidean_distance_square(self.
__pointer_data[index_object], centers[index_cluster]);
485 N += len(clusters[index_cluster])
488 sigma_sqrt /= (N - K)
489 p = (K - 1) + dimension * K + 1
492 sigma_multiplier = 0.0
493 if sigma_sqrt <= 0.0:
494 sigma_multiplier = float(
'-inf')
496 sigma_multiplier = dimension * 0.5 * log(sigma_sqrt)
499 for index_cluster
in range(0, len(clusters), 1):
500 n = len(clusters[index_cluster])
502 L = n * log(n) - n * log(N) - n * 0.5 * log(2.0 * numpy.pi) - n * sigma_multiplier - (n - K) * 0.5
505 scores[index_cluster] = L - p * 0.5 * log(N)
def __improve_structure(self, clusters, centers)
Check for best structure: divides each cluster into two and checks for best results using splitting c...
def __splitting_criterion(self, clusters, centers)
Calculates splitting criterion for input clusters.
def __minimum_noiseless_description_length(self, clusters, centers)
Calculates splitting criterion for input clusters using minimum noiseless description length criterio...
def get_cluster_encoding(self)
Returns clustering result representation type that indicate how clusters are encoded.
Cluster analysis algorithm: K-Means.
Utils that are used by modules of pyclustering.
def process(self)
Performs cluster analysis in line with rules of X-Means algorithm.
Module for representing clustering results.
def __local_to_global_clusters(self, local_clusters, available_indexes)
Converts clusters in local region define by 'available_indexes' to global clusters.
def __bayesian_information_criterion(self, clusters, centers)
Calculates splitting criterion for input clusters using bayesian information criterion.
Class represents clustering algorithm X-Means.
K-Means++ is an algorithm for choosing the initial centers for algorithms like K-Means or X-Means...
Class represents K-Means clustering algorithm.
def get_clusters(self)
Returns list of allocated clusters, each cluster contains indexes of objects in list of data...
def get_centers(self)
Returns list of centers for allocated clusters.
def __improve_parameters(self, centers, available_indexes=None)
Performs k-means clustering in the specified region.
Collection of center initializers for algorithm that uses initial centers, for example, for K-Means or X-Means.
def predict(self, points)
Calculates the closest cluster to each point.
Enumeration of splitting types that can be used as splitting creation of cluster in X-Means algorithm...
def __init__(self, data, initial_centers=None, kmax=20, tolerance=0.025, criterion=splitting_type.BAYESIAN_INFORMATION_CRITERION, ccore=True)
Constructor of clustering algorithm X-Means.