3 @brief Cluster analysis algorithm: X-Means 4 @details Implementation based on paper @cite article::xmeans::1. 6 @authors Andrei Novikov (pyclustering@yandex.ru) 8 @copyright GNU Public License 10 @cond GNU_PUBLIC_LICENSE 11 PyClustering is free software: you can redistribute it and/or modify 12 it under the terms of the GNU General Public License as published by 13 the Free Software Foundation, either version 3 of the License, or 14 (at your option) any later version. 16 PyClustering is distributed in the hope that it will be useful, 17 but WITHOUT ANY WARRANTY; without even the implied warranty of 18 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 19 GNU General Public License for more details. 21 You should have received a copy of the GNU General Public License 22 along with this program. If not, see <http://www.gnu.org/licenses/>. 31 from enum
import IntEnum
39 from pyclustering.core.wrapper
import ccore_library
41 import pyclustering.core.xmeans_wrapper
as wrapper
43 from pyclustering.utils import euclidean_distance_square, euclidean_distance, distance_metric, type_metric
48 @brief Enumeration of splitting types that can be used as splitting creation of cluster in X-Means algorithm. 64 BAYESIAN_INFORMATION_CRITERION = 0
75 MINIMUM_NOISELESS_DESCRIPTION_LENGTH = 1
80 @brief Class represents clustering algorithm X-Means. 81 @details X-means clustering method starts with the assumption of having a minimum number of clusters, 82 and then dynamically increases them. X-means uses specified splitting criterion to control 83 the process of splitting clusters. Method K-Means++ can be used for calculation of initial centers. 85 CCORE implementation of the algorithm uses thread pool to parallelize the clustering process. 87 Here example how to perform cluster analysis using X-Means algorithm: 89 from pyclustering.cluster import cluster_visualizer 90 from pyclustering.cluster.xmeans import xmeans 91 from pyclustering.cluster.center_initializer import kmeans_plusplus_initializer 92 from pyclustering.utils import read_sample 93 from pyclustering.samples.definitions import SIMPLE_SAMPLES 95 # Read sample 'simple3' from file. 96 sample = read_sample(SIMPLE_SAMPLES.SAMPLE_SIMPLE3) 98 # Prepare initial centers - amount of initial centers defines amount of clusters from which X-Means will 100 amount_initial_centers = 2 101 initial_centers = kmeans_plusplus_initializer(sample, amount_initial_centers).initialize() 103 # Create instance of X-Means algorithm. The algorithm will start analysis from 2 clusters, the maximum 104 # number of clusters that can be allocated is 20. 105 xmeans_instance = xmeans(sample, initial_centers, 20) 106 xmeans_instance.process() 108 # Extract clustering results: clusters and their centers 109 clusters = xmeans_instance.get_clusters() 110 centers = xmeans_instance.get_centers() 112 # Print total sum of metric errors 113 print("Total WCE:", xmeans_instance.get_total_wce()) 115 # Visualize clustering results 116 visualizer = cluster_visualizer() 117 visualizer.append_clusters(clusters, sample) 118 visualizer.append_cluster(centers, None, marker='*', markersize=10) 122 Visualization of clustering results that were obtained using code above and where X-Means algorithm allocates four clusters. 123 @image html xmeans_clustering_simple3.png "Fig. 1. X-Means clustering results (data 'Simple3')." 125 @see center_initializer 129 def __init__(self, data, initial_centers=None, kmax=20, tolerance=0.025, criterion=splitting_type.BAYESIAN_INFORMATION_CRITERION, ccore=True, **kwargs):
131 @brief Constructor of clustering algorithm X-Means. 133 @param[in] data (list): Input data that is presented as list of points (objects), each point should be represented by list or tuple. 134 @param[in] initial_centers (list): Initial coordinates of centers of clusters that are represented by list: [center1, center2, ...], 135 if it is not specified then X-Means starts from the random center. 136 @param[in] kmax (uint): Maximum number of clusters that can be allocated. 137 @param[in] tolerance (double): Stop condition for each iteration: if maximum value of change of centers of clusters is less than tolerance than algorithm will stop processing. 138 @param[in] criterion (splitting_type): Type of splitting creation. 139 @param[in] ccore (bool): Defines should be CCORE (C++ pyclustering library) used instead of Python code or not. 140 @param[in] **kwargs: Arbitrary keyword arguments (available arguments: 'repeat'). 142 <b>Keyword Args:</b><br> 143 - repeat (unit): How many times K-Means should be run to improve parameters (by default is 1). 144 With larger 'repeat' values suggesting higher probability of finding global optimum. 151 if initial_centers
is not None:
160 self.
__repeat = kwargs.get(
'repeat', 1)
164 self.
__ccore = ccore_library.workable()
171 @brief Performs cluster analysis in line with rules of X-Means algorithm. 173 @return (xmeans) Returns itself (X-Means instance). 189 def __process_by_ccore(self):
191 @brief Performs cluster analysis using CCORE (C/C++ part of pyclustering library). 201 def __process_by_python(self):
203 @brief Performs cluster analysis using python code. 209 current_cluster_number = len(self.
__centers)
214 if current_cluster_number == len(allocated_centers):
224 @brief Calculates the closest cluster to each point. 226 @param[in] points (array_like): Points for which closest clusters are calculated. 228 @return (list) List of closest clusters for each point. Each cluster is denoted by index. Return empty 229 collection if 'process()' method was not called. 231 An example how to calculate (or predict) the closest cluster to specified points. 233 from pyclustering.cluster.xmeans import xmeans 234 from pyclustering.samples.definitions import SIMPLE_SAMPLES 235 from pyclustering.utils import read_sample 237 # Load list of points for cluster analysis. 238 sample = read_sample(SIMPLE_SAMPLES.SAMPLE_SIMPLE3) 240 # Initial centers for sample 'Simple3'. 241 initial_centers = [[0.2, 0.1], [4.0, 1.0], [2.0, 2.0], [2.3, 3.9]] 243 # Create instance of X-Means algorithm with prepared centers. 244 xmeans_instance = xmeans(sample, initial_centers) 246 # Run cluster analysis. 247 xmeans_instance.process() 249 # Calculate the closest cluster to following two points. 250 points = [[0.25, 0.2], [2.5, 4.0]] 251 closest_clusters = xmeans_instance.predict(points) 252 print(closest_clusters) 256 nppoints = numpy.array(points)
260 metric = distance_metric(type_metric.EUCLIDEAN_SQUARE, numpy_usage=
True)
262 differences = numpy.zeros((len(nppoints), len(self.
__centers)))
263 for index_point
in range(len(nppoints)):
264 differences[index_point] = metric(nppoints[index_point], self.
__centers)
266 return numpy.argmin(differences, axis=1)
271 @brief Returns list of allocated clusters, each cluster contains indexes of objects in list of data. 273 @return (list) List of allocated clusters. 286 @brief Returns list of centers for allocated clusters. 288 @return (list) List of centers for allocated clusters. 301 @brief Returns clustering result representation type that indicate how clusters are encoded. 303 @return (type_encoding) Clustering result representation. 309 return type_encoding.CLUSTER_INDEX_LIST_SEPARATION
314 @brief Returns sum of Euclidean Squared metric errors (SSE - Sum of Squared Errors). 315 @details Sum of metric errors is calculated using distance between point and its center: 316 \f[error=\sum_{i=0}^{N}euclidean_square_distance(x_{i}-center(x_{i}))\f] 326 def __search_optimial_parameters(self, local_data):
328 @brief Split data of the region into two cluster and tries to find global optimum by running k-means clustering 329 several times (defined by 'repeat' argument). 331 @param[in] local_data (list): Points of a region that should be split into two clusters. 333 @return (tuple) List of allocated clusters, list of centers and total WCE (clusters, centers, wce). 336 optimal_wce, optimal_centers, optimal_clusters = float(
'+inf'),
None,
None 340 if len(local_data) < candidates:
341 candidates = len(local_data)
345 kmeans_instance =
kmeans(local_data, local_centers, tolerance=self.
__tolerance, ccore=
False)
346 kmeans_instance.process()
348 local_wce = kmeans_instance.get_total_wce()
349 if local_wce < optimal_wce:
350 optimal_centers = kmeans_instance.get_centers()
351 optimal_clusters = kmeans_instance.get_clusters()
352 optimal_wce = local_wce
354 return optimal_clusters, optimal_centers, optimal_wce
357 def __improve_parameters(self, centers, available_indexes=None):
359 @brief Performs k-means clustering in the specified region. 361 @param[in] centers (list): Cluster centers, if None then automatically generated two centers using center initialization method. 362 @param[in] available_indexes (list): Indexes that defines which points can be used for k-means clustering, if None then all points are used. 364 @return (tuple) List of allocated clusters, list of centers and total WCE (clusters, centers, wce). 368 if available_indexes
and len(available_indexes) == 1:
369 index_center = available_indexes[0]
370 return [available_indexes], self.
__pointer_data[index_center], 0.0
373 if available_indexes:
376 local_centers = centers
382 local_wce = kmeans_instance.get_total_wce()
383 local_centers = kmeans_instance.get_centers()
384 clusters = kmeans_instance.get_clusters()
386 if available_indexes:
389 return clusters, local_centers, local_wce
392 def __local_to_global_clusters(self, local_clusters, available_indexes):
394 @brief Converts clusters in local region define by 'available_indexes' to global clusters. 396 @param[in] local_clusters (list): Local clusters in specific region. 397 @param[in] available_indexes (list): Map between local and global point's indexes. 399 @return Global clusters. 404 for local_cluster
in local_clusters:
406 for index_point
in local_cluster:
407 current_cluster.append(available_indexes[index_point])
409 clusters.append(current_cluster)
414 def __improve_structure(self, clusters, centers):
416 @brief Check for best structure: divides each cluster into two and checks for best results using splitting criterion. 418 @param[in] clusters (list): Clusters that have been allocated (each cluster contains indexes of points from data). 419 @param[in] centers (list): Centers of clusters. 421 @return (list) Allocated centers for clustering. 425 allocated_centers = []
426 amount_free_centers = self.
__kmax - len(centers)
428 for index_cluster
in range(len(clusters)):
430 (parent_child_clusters, parent_child_centers, _) = self.
__improve_parameters(
None, clusters[index_cluster])
433 if len(parent_child_clusters) > 1:
436 child_scores = self.
__splitting_criterion([parent_child_clusters[0], parent_child_clusters[1]], parent_child_centers)
438 split_require =
False 441 if self.
__criterion == splitting_type.BAYESIAN_INFORMATION_CRITERION:
442 if parent_scores < child_scores: split_require =
True 444 elif self.
__criterion == splitting_type.MINIMUM_NOISELESS_DESCRIPTION_LENGTH:
447 if parent_scores > child_scores: split_require =
True;
449 if (split_require
is True)
and (amount_free_centers > 0):
450 allocated_centers.append(parent_child_centers[0])
451 allocated_centers.append(parent_child_centers[1])
453 amount_free_centers -= 1
455 allocated_centers.append(centers[index_cluster])
458 allocated_centers.append(centers[index_cluster])
460 return allocated_centers
463 def __splitting_criterion(self, clusters, centers):
465 @brief Calculates splitting criterion for input clusters. 467 @param[in] clusters (list): Clusters for which splitting criterion should be calculated. 468 @param[in] centers (list): Centers of the clusters. 470 @return (double) Returns splitting criterion. High value of splitting cretion means that current structure is much better. 472 @see __bayesian_information_criterion(clusters, centers) 473 @see __minimum_noiseless_description_length(clusters, centers) 477 if self.
__criterion == splitting_type.BAYESIAN_INFORMATION_CRITERION:
480 elif self.
__criterion == splitting_type.MINIMUM_NOISELESS_DESCRIPTION_LENGTH:
487 def __minimum_noiseless_description_length(self, clusters, centers):
489 @brief Calculates splitting criterion for input clusters using minimum noiseless description length criterion. 491 @param[in] clusters (list): Clusters for which splitting criterion should be calculated. 492 @param[in] centers (list): Centers of the clusters. 494 @return (double) Returns splitting criterion in line with bayesian information criterion. 495 Low value of splitting cretion means that current structure is much better. 497 @see __bayesian_information_criterion(clusters, centers) 501 scores = float(
'inf')
512 for index_cluster
in range(0, len(clusters), 1):
513 Ni = len(clusters[index_cluster])
518 for index_object
in clusters[index_cluster]:
521 Wi += euclidean_distance(self.
__pointer_data[index_object], centers[index_cluster])
528 sigma_sqrt /= (N - K)
529 sigma = sigma_sqrt ** 0.5
531 Kw = (1.0 - K / N) * sigma_sqrt
532 Ks = ( 2.0 * alpha * sigma / (N ** 0.5) ) * ( (alpha ** 2.0) * sigma_sqrt / N + W - Kw / 2.0 ) ** 0.5
534 scores = sigma_sqrt * (2 * K)**0.5 * ((2 * K)**0.5 + betta) / N + W - sigma_sqrt + Ks + 2 * alpha**0.5 * sigma_sqrt / N
539 def __bayesian_information_criterion(self, clusters, centers):
541 @brief Calculates splitting criterion for input clusters using bayesian information criterion. 543 @param[in] clusters (list): Clusters for which splitting criterion should be calculated. 544 @param[in] centers (list): Centers of the clusters. 546 @return (double) Splitting criterion in line with bayesian information criterion. 547 High value of splitting criterion means that current structure is much better. 549 @see __minimum_noiseless_description_length(clusters, centers) 553 scores = [float(
'inf')] * len(clusters)
561 for index_cluster
in range(0, len(clusters), 1):
562 for index_object
in clusters[index_cluster]:
563 sigma_sqrt += euclidean_distance_square(self.
__pointer_data[index_object], centers[index_cluster])
565 N += len(clusters[index_cluster])
568 sigma_sqrt /= (N - K)
569 p = (K - 1) + dimension * K + 1
572 sigma_multiplier = 0.0
573 if sigma_sqrt <= 0.0:
574 sigma_multiplier = float(
'-inf')
576 sigma_multiplier = dimension * 0.5 * log(sigma_sqrt)
579 for index_cluster
in range(0, len(clusters), 1):
580 n = len(clusters[index_cluster])
582 L = n * log(n) - n * log(N) - n * 0.5 * log(2.0 * numpy.pi) - n * sigma_multiplier - (n - K) * 0.5
585 scores[index_cluster] = L - p * 0.5 * log(N)
590 def __verify_arguments(self):
592 @brief Verify input parameters for the algorithm and throw exception in case of incorrectness. 596 raise ValueError(
"Input data is empty (size: '%d')." % len(self.
__pointer_data))
599 raise ValueError(
"Initial centers are empty (size: '%d')." % len(self.
__pointer_data))
602 raise ValueError(
"Tolerance (current value: '%d') should be greater or equal to 0." %
606 raise ValueError(
"Repeat (current value: '%d') should be greater than 0." %
def __improve_structure(self, clusters, centers)
Check for best structure: divides each cluster into two and checks for best results using splitting c...
def __splitting_criterion(self, clusters, centers)
Calculates splitting criterion for input clusters.
def __minimum_noiseless_description_length(self, clusters, centers)
Calculates splitting criterion for input clusters using minimum noiseless description length criterio...
def get_cluster_encoding(self)
Returns clustering result representation type that indicate how clusters are encoded.
The module contains K-Means algorithm and other related services.
Utils that are used by modules of pyclustering.
def __process_by_python(self)
Performs cluster analysis using python code.
def process(self)
Performs cluster analysis in line with rules of X-Means algorithm.
Module for representing clustering results.
def get_total_wce(self)
Returns sum of Euclidean Squared metric errors (SSE - Sum of Squared Errors).
def __local_to_global_clusters(self, local_clusters, available_indexes)
Converts clusters in local region define by 'available_indexes' to global clusters.
def __bayesian_information_criterion(self, clusters, centers)
Calculates splitting criterion for input clusters using bayesian information criterion.
Class represents clustering algorithm X-Means.
K-Means++ is an algorithm for choosing the initial centers for algorithms like K-Means or X-Means...
def __verify_arguments(self)
Verify input parameters for the algorithm and throw exception in case of incorrectness.
Class implements K-Means clustering algorithm.
def get_clusters(self)
Returns list of allocated clusters, each cluster contains indexes of objects in list of data...
def get_centers(self)
Returns list of centers for allocated clusters.
def __search_optimial_parameters(self, local_data)
Split data of the region into two cluster and tries to find global optimum by running k-means cluster...
def __improve_parameters(self, centers, available_indexes=None)
Performs k-means clustering in the specified region.
Collection of center initializers for algorithm that uses initial centers, for example, for K-Means or X-Means.
def predict(self, points)
Calculates the closest cluster to each point.
def __process_by_ccore(self)
Performs cluster analysis using CCORE (C/C++ part of pyclustering library).
def __init__(self, data, initial_centers=None, kmax=20, tolerance=0.025, criterion=splitting_type.BAYESIAN_INFORMATION_CRITERION, ccore=True, kwargs)
Constructor of clustering algorithm X-Means.
Enumeration of splitting types that can be used as splitting creation of cluster in X-Means algorithm...