3 @brief Cluster analysis algorithm: X-Means 4 @details Implementation based on paper @cite article::xmeans::1. 6 @authors Andrei Novikov (pyclustering@yandex.ru) 8 @copyright GNU Public License 10 @cond GNU_PUBLIC_LICENSE 11 PyClustering is free software: you can redistribute it and/or modify 12 it under the terms of the GNU General Public License as published by 13 the Free Software Foundation, either version 3 of the License, or 14 (at your option) any later version. 16 PyClustering is distributed in the hope that it will be useful, 17 but WITHOUT ANY WARRANTY; without even the implied warranty of 18 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 19 GNU General Public License for more details. 21 You should have received a copy of the GNU General Public License 22 along with this program. If not, see <http://www.gnu.org/licenses/>. 30 from enum
import IntEnum
38 from pyclustering.core.wrapper
import ccore_library
40 import pyclustering.core.xmeans_wrapper
as wrapper
42 from pyclustering.utils import euclidean_distance_square, euclidean_distance, distance_metric, type_metric
47 @brief Enumeration of splitting types that can be used as splitting creation of cluster in X-Means algorithm. 63 BAYESIAN_INFORMATION_CRITERION = 0
74 MINIMUM_NOISELESS_DESCRIPTION_LENGTH = 1
79 @brief Class represents clustering algorithm X-Means. 80 @details X-means clustering method starts with the assumption of having a minimum number of clusters, 81 and then dynamically increases them. X-means uses specified splitting criterion to control 82 the process of splitting clusters. Method K-Means++ can be used for calculation of initial centers. 84 CCORE implementation of the algorithm uses thread pool to parallelize the clustering process. 86 Here example how to perform cluster analysis using X-Means algorithm: 88 from pyclustering.cluster import cluster_visualizer 89 from pyclustering.cluster.xmeans import xmeans 90 from pyclustering.cluster.center_initializer import kmeans_plusplus_initializer 91 from pyclustering.utils import read_sample 92 from pyclustering.samples.definitions import SIMPLE_SAMPLES 94 # Read sample 'simple3' from file. 95 sample = read_sample(SIMPLE_SAMPLES.SAMPLE_SIMPLE3) 97 # Prepare initial centers - amount of initial centers defines amount of clusters from which X-Means will 99 amount_initial_centers = 2 100 initial_centers = kmeans_plusplus_initializer(sample, amount_initial_centers).initialize() 102 # Create instance of X-Means algorithm. The algorithm will start analysis from 2 clusters, the maximum 103 # number of clusters that can be allocated is 20. 104 xmeans_instance = xmeans(sample, initial_centers, 20) 105 xmeans_instance.process() 107 # Extract clustering results: clusters and their centers 108 clusters = xmeans_instance.get_clusters() 109 centers = xmeans_instance.get_centers() 111 # Print total sum of metric errors 112 print("Total WCE:", xmeans_instance.get_total_wce()) 114 # Visualize clustering results 115 visualizer = cluster_visualizer() 116 visualizer.append_clusters(clusters, sample) 117 visualizer.append_cluster(centers, None, marker='*', markersize=10) 121 Visualization of clustering results that were obtained using code above and where X-Means algorithm allocates four clusters. 122 @image html xmeans_clustering_simple3.png "Fig. 1. X-Means clustering results (data 'Simple3')." 124 @see center_initializer 128 def __init__(self, data, initial_centers=None, kmax=20, tolerance=0.025, criterion=splitting_type.BAYESIAN_INFORMATION_CRITERION, ccore=True, **kwargs):
130 @brief Constructor of clustering algorithm X-Means. 132 @param[in] data (list): Input data that is presented as list of points (objects), each point should be represented by list or tuple. 133 @param[in] initial_centers (list): Initial coordinates of centers of clusters that are represented by list: `[center1, center2, ...]`, 134 if it is not specified then X-Means starts from the random center. 135 @param[in] kmax (uint): Maximum number of clusters that can be allocated. 136 @param[in] tolerance (double): Stop condition for each iteration: if maximum value of change of centers of clusters is less than tolerance than algorithm will stop processing. 137 @param[in] criterion (splitting_type): Type of splitting creation. 138 @param[in] ccore (bool): Defines if C++ pyclustering library should be used instead of Python implementation. 139 @param[in] **kwargs: Arbitrary keyword arguments (available arguments: `repeat`, `random_state`). 141 <b>Keyword Args:</b><br> 142 - repeat (unit): How many times K-Means should be run to improve parameters (by default is `1`). 143 With larger `repeat` values suggesting higher probability of finding global optimum. 144 - random_state (int): Seed for random state (by default is `None`, current system time is used). 152 if initial_centers
is not None:
161 self.
__repeat = kwargs.get(
'repeat', 1)
165 self.
__ccore = ccore_library.workable()
172 @brief Performs cluster analysis in line with rules of X-Means algorithm. 174 @return (xmeans) Returns itself (X-Means instance). 190 def __process_by_ccore(self):
192 @brief Performs cluster analysis using CCORE (C/C++ part of pyclustering library). 202 def __process_by_python(self):
204 @brief Performs cluster analysis using python code. 210 current_cluster_number = len(self.
__centers)
215 if current_cluster_number == len(allocated_centers):
225 @brief Calculates the closest cluster to each point. 227 @param[in] points (array_like): Points for which closest clusters are calculated. 229 @return (list) List of closest clusters for each point. Each cluster is denoted by index. Return empty 230 collection if 'process()' method was not called. 232 An example how to calculate (or predict) the closest cluster to specified points. 234 from pyclustering.cluster.xmeans import xmeans 235 from pyclustering.samples.definitions import SIMPLE_SAMPLES 236 from pyclustering.utils import read_sample 238 # Load list of points for cluster analysis. 239 sample = read_sample(SIMPLE_SAMPLES.SAMPLE_SIMPLE3) 241 # Initial centers for sample 'Simple3'. 242 initial_centers = [[0.2, 0.1], [4.0, 1.0], [2.0, 2.0], [2.3, 3.9]] 244 # Create instance of X-Means algorithm with prepared centers. 245 xmeans_instance = xmeans(sample, initial_centers) 247 # Run cluster analysis. 248 xmeans_instance.process() 250 # Calculate the closest cluster to following two points. 251 points = [[0.25, 0.2], [2.5, 4.0]] 252 closest_clusters = xmeans_instance.predict(points) 253 print(closest_clusters) 257 nppoints = numpy.array(points)
261 metric = distance_metric(type_metric.EUCLIDEAN_SQUARE, numpy_usage=
True)
263 differences = numpy.zeros((len(nppoints), len(self.
__centers)))
264 for index_point
in range(len(nppoints)):
265 differences[index_point] = metric(nppoints[index_point], self.
__centers)
267 return numpy.argmin(differences, axis=1)
272 @brief Returns list of allocated clusters, each cluster contains indexes of objects in list of data. 274 @return (list) List of allocated clusters. 287 @brief Returns list of centers for allocated clusters. 289 @return (list) List of centers for allocated clusters. 302 @brief Returns clustering result representation type that indicate how clusters are encoded. 304 @return (type_encoding) Clustering result representation. 310 return type_encoding.CLUSTER_INDEX_LIST_SEPARATION
315 @brief Returns sum of Euclidean Squared metric errors (SSE - Sum of Squared Errors). 316 @details Sum of metric errors is calculated using distance between point and its center: 317 \f[error=\sum_{i=0}^{N}euclidean_square_distance(x_{i}-center(x_{i}))\f] 327 def __search_optimial_parameters(self, local_data):
329 @brief Split data of the region into two cluster and tries to find global optimum by running k-means clustering 330 several times (defined by 'repeat' argument). 332 @param[in] local_data (list): Points of a region that should be split into two clusters. 334 @return (tuple) List of allocated clusters, list of centers and total WCE (clusters, centers, wce). 337 optimal_wce, optimal_centers, optimal_clusters = float(
'+inf'),
None,
None 341 if len(local_data) < candidates:
342 candidates = len(local_data)
346 kmeans_instance =
kmeans(local_data, local_centers, tolerance=self.
__tolerance, ccore=
False)
347 kmeans_instance.process()
349 local_wce = kmeans_instance.get_total_wce()
350 if local_wce < optimal_wce:
351 optimal_centers = kmeans_instance.get_centers()
352 optimal_clusters = kmeans_instance.get_clusters()
353 optimal_wce = local_wce
355 return optimal_clusters, optimal_centers, optimal_wce
358 def __improve_parameters(self, centers, available_indexes=None):
360 @brief Performs k-means clustering in the specified region. 362 @param[in] centers (list): Cluster centers, if None then automatically generated two centers using center initialization method. 363 @param[in] available_indexes (list): Indexes that defines which points can be used for k-means clustering, if None then all points are used. 365 @return (tuple) List of allocated clusters, list of centers and total WCE (clusters, centers, wce). 369 if available_indexes
and len(available_indexes) == 1:
370 index_center = available_indexes[0]
371 return [available_indexes], self.
__pointer_data[index_center], 0.0
374 if available_indexes:
377 local_centers = centers
383 local_wce = kmeans_instance.get_total_wce()
384 local_centers = kmeans_instance.get_centers()
385 clusters = kmeans_instance.get_clusters()
387 if available_indexes:
390 return clusters, local_centers, local_wce
393 def __local_to_global_clusters(self, local_clusters, available_indexes):
395 @brief Converts clusters in local region define by 'available_indexes' to global clusters. 397 @param[in] local_clusters (list): Local clusters in specific region. 398 @param[in] available_indexes (list): Map between local and global point's indexes. 400 @return Global clusters. 405 for local_cluster
in local_clusters:
407 for index_point
in local_cluster:
408 current_cluster.append(available_indexes[index_point])
410 clusters.append(current_cluster)
415 def __improve_structure(self, clusters, centers):
417 @brief Check for best structure: divides each cluster into two and checks for best results using splitting criterion. 419 @param[in] clusters (list): Clusters that have been allocated (each cluster contains indexes of points from data). 420 @param[in] centers (list): Centers of clusters. 422 @return (list) Allocated centers for clustering. 426 allocated_centers = []
427 amount_free_centers = self.
__kmax - len(centers)
429 for index_cluster
in range(len(clusters)):
431 (parent_child_clusters, parent_child_centers, _) = self.
__improve_parameters(
None, clusters[index_cluster])
434 if len(parent_child_clusters) > 1:
437 child_scores = self.
__splitting_criterion([parent_child_clusters[0], parent_child_clusters[1]], parent_child_centers)
439 split_require =
False 442 if self.
__criterion == splitting_type.BAYESIAN_INFORMATION_CRITERION:
443 if parent_scores < child_scores: split_require =
True 445 elif self.
__criterion == splitting_type.MINIMUM_NOISELESS_DESCRIPTION_LENGTH:
448 if parent_scores > child_scores: split_require =
True;
450 if (split_require
is True)
and (amount_free_centers > 0):
451 allocated_centers.append(parent_child_centers[0])
452 allocated_centers.append(parent_child_centers[1])
454 amount_free_centers -= 1
456 allocated_centers.append(centers[index_cluster])
459 allocated_centers.append(centers[index_cluster])
461 return allocated_centers
464 def __splitting_criterion(self, clusters, centers):
466 @brief Calculates splitting criterion for input clusters. 468 @param[in] clusters (list): Clusters for which splitting criterion should be calculated. 469 @param[in] centers (list): Centers of the clusters. 471 @return (double) Returns splitting criterion. High value of splitting cretion means that current structure is much better. 473 @see __bayesian_information_criterion(clusters, centers) 474 @see __minimum_noiseless_description_length(clusters, centers) 478 if self.
__criterion == splitting_type.BAYESIAN_INFORMATION_CRITERION:
481 elif self.
__criterion == splitting_type.MINIMUM_NOISELESS_DESCRIPTION_LENGTH:
488 def __minimum_noiseless_description_length(self, clusters, centers):
490 @brief Calculates splitting criterion for input clusters using minimum noiseless description length criterion. 492 @param[in] clusters (list): Clusters for which splitting criterion should be calculated. 493 @param[in] centers (list): Centers of the clusters. 495 @return (double) Returns splitting criterion in line with bayesian information criterion. 496 Low value of splitting cretion means that current structure is much better. 498 @see __bayesian_information_criterion(clusters, centers) 502 scores = float(
'inf')
513 for index_cluster
in range(0, len(clusters), 1):
514 Ni = len(clusters[index_cluster])
519 for index_object
in clusters[index_cluster]:
522 Wi += euclidean_distance(self.
__pointer_data[index_object], centers[index_cluster])
529 sigma_sqrt /= (N - K)
530 sigma = sigma_sqrt ** 0.5
532 Kw = (1.0 - K / N) * sigma_sqrt
533 Ks = ( 2.0 * alpha * sigma / (N ** 0.5) ) * ( (alpha ** 2.0) * sigma_sqrt / N + W - Kw / 2.0 ) ** 0.5
535 scores = sigma_sqrt * (2 * K)**0.5 * ((2 * K)**0.5 + betta) / N + W - sigma_sqrt + Ks + 2 * alpha**0.5 * sigma_sqrt / N
540 def __bayesian_information_criterion(self, clusters, centers):
542 @brief Calculates splitting criterion for input clusters using bayesian information criterion. 544 @param[in] clusters (list): Clusters for which splitting criterion should be calculated. 545 @param[in] centers (list): Centers of the clusters. 547 @return (double) Splitting criterion in line with bayesian information criterion. 548 High value of splitting criterion means that current structure is much better. 550 @see __minimum_noiseless_description_length(clusters, centers) 554 scores = [float(
'inf')] * len(clusters)
562 for index_cluster
in range(0, len(clusters), 1):
563 for index_object
in clusters[index_cluster]:
564 sigma_sqrt += euclidean_distance_square(self.
__pointer_data[index_object], centers[index_cluster])
566 N += len(clusters[index_cluster])
569 sigma_sqrt /= (N - K)
570 p = (K - 1) + dimension * K + 1
573 sigma_multiplier = 0.0
574 if sigma_sqrt <= 0.0:
575 sigma_multiplier = float(
'-inf')
577 sigma_multiplier = dimension * 0.5 * log(sigma_sqrt)
580 for index_cluster
in range(0, len(clusters), 1):
581 n = len(clusters[index_cluster])
583 L = n * log(n) - n * log(N) - n * 0.5 * log(2.0 * numpy.pi) - n * sigma_multiplier - (n - K) * 0.5
586 scores[index_cluster] = L - p * 0.5 * log(N)
591 def __verify_arguments(self):
593 @brief Verify input parameters for the algorithm and throw exception in case of incorrectness. 597 raise ValueError(
"Input data is empty (size: '%d')." % len(self.
__pointer_data))
600 raise ValueError(
"Initial centers are empty (size: '%d')." % len(self.
__pointer_data))
603 raise ValueError(
"Tolerance (current value: '%d') should be greater or equal to 0." %
607 raise ValueError(
"Repeat (current value: '%d') should be greater than 0." %
def __improve_structure(self, clusters, centers)
Check for best structure: divides each cluster into two and checks for best results using splitting c...
def __splitting_criterion(self, clusters, centers)
Calculates splitting criterion for input clusters.
def __minimum_noiseless_description_length(self, clusters, centers)
Calculates splitting criterion for input clusters using minimum noiseless description length criterio...
def get_cluster_encoding(self)
Returns clustering result representation type that indicate how clusters are encoded.
The module contains K-Means algorithm and other related services.
Utils that are used by modules of pyclustering.
def __process_by_python(self)
Performs cluster analysis using python code.
def process(self)
Performs cluster analysis in line with rules of X-Means algorithm.
Module for representing clustering results.
def get_total_wce(self)
Returns sum of Euclidean Squared metric errors (SSE - Sum of Squared Errors).
def __local_to_global_clusters(self, local_clusters, available_indexes)
Converts clusters in local region define by 'available_indexes' to global clusters.
def __bayesian_information_criterion(self, clusters, centers)
Calculates splitting criterion for input clusters using bayesian information criterion.
Class represents clustering algorithm X-Means.
K-Means++ is an algorithm for choosing the initial centers for algorithms like K-Means or X-Means...
def __verify_arguments(self)
Verify input parameters for the algorithm and throw exception in case of incorrectness.
Class implements K-Means clustering algorithm.
def get_clusters(self)
Returns list of allocated clusters, each cluster contains indexes of objects in list of data...
def get_centers(self)
Returns list of centers for allocated clusters.
def __search_optimial_parameters(self, local_data)
Split data of the region into two cluster and tries to find global optimum by running k-means cluster...
def __improve_parameters(self, centers, available_indexes=None)
Performs k-means clustering in the specified region.
Collection of center initializers for algorithm that uses initial centers, for example, for K-Means or X-Means.
def predict(self, points)
Calculates the closest cluster to each point.
def __process_by_ccore(self)
Performs cluster analysis using CCORE (C/C++ part of pyclustering library).
def __init__(self, data, initial_centers=None, kmax=20, tolerance=0.025, criterion=splitting_type.BAYESIAN_INFORMATION_CRITERION, ccore=True, kwargs)
Constructor of clustering algorithm X-Means.
Enumeration of splitting types that can be used as splitting creation of cluster in X-Means algorithm...