3 @brief Cluster analysis algorithm: X-Means 4 @details Implementation based on paper @cite article::syncnet::1. 6 @authors Andrei Novikov (pyclustering@yandex.ru) 8 @copyright GNU Public License 10 @cond GNU_PUBLIC_LICENSE 11 PyClustering is free software: you can redistribute it and/or modify 12 it under the terms of the GNU General Public License as published by 13 the Free Software Foundation, either version 3 of the License, or 14 (at your option) any later version. 16 PyClustering is distributed in the hope that it will be useful, 17 but WITHOUT ANY WARRANTY; without even the implied warranty of 18 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 19 GNU General Public License for more details. 21 You should have received a copy of the GNU General Public License 22 along with this program. If not, see <http://www.gnu.org/licenses/>. 31 from enum
import IntEnum
39 from pyclustering.core.wrapper
import ccore_library
41 import pyclustering.core.xmeans_wrapper
as wrapper
48 @brief Enumeration of splitting types that can be used as splitting creation of cluster in X-Means algorithm. 64 BAYESIAN_INFORMATION_CRITERION = 0
75 MINIMUM_NOISELESS_DESCRIPTION_LENGTH = 1
80 @brief Class represents clustering algorithm X-Means. 81 @details X-means clustering method starts with the assumption of having a minimum number of clusters, 82 and then dynamically increases them. X-means uses specified splitting criterion to control 83 the process of splitting clusters. Method K-Means++ can be used for calculation of initial centers. 85 CCORE option can be used to use the pyclustering core - C/C++ shared library for processing that significantly increases performance. 87 CCORE implementation of the algorithm uses thread pool to parallelize the clustering process. 89 Here example how to perform cluster analysis using X-Means algorithm: 91 from pyclustering.cluster import cluster_visualizer 92 from pyclustering.cluster.xmeans import xmeans 93 from pyclustering.cluster.center_initializer import kmeans_plusplus_initializer 94 from pyclustering.utils import read_sample 95 from pyclustering.samples.definitions import SIMPLE_SAMPLES 97 # Read sample 'simple3' from file. 98 sample = read_sample(SIMPLE_SAMPLES.SAMPLE_SIMPLE3) 100 # Prepare initial centers - amount of initial centers defines amount of clusters from which X-Means will 102 amount_initial_centers = 2 103 initial_centers = kmeans_plusplus_initializer(sample, amount_initial_centers).initialize() 105 # Create instance of X-Means algorithm. The algorithm will start analysis from 2 clusters, the maximum 106 # number of clusters that can be allocated is 20. 107 xmeans_instance = xmeans(sample, initial_centers, 20) 108 xmeans_instance.process() 110 # Extract clustering results: clusters and their centers 111 clusters = xmeans_instance.get_clusters() 112 centers = xmeans_instance.get_centers() 114 # Visualize clustering results 115 visualizer = cluster_visualizer() 116 visualizer.append_clusters(clusters, sample) 117 visualizer.append_cluster(centers, None, marker='*') 121 Visualization of clustering results that were obtained using code above and where X-Means algorithm allocates four clusters. 122 @image html xmeans_clustering_simple3.png "Fig. 1. X-Means clustering results (data 'Simple3')." 124 @see center_initializer 128 def __init__(self, data, initial_centers = None, kmax = 20, tolerance = 0.025, criterion = splitting_type.BAYESIAN_INFORMATION_CRITERION, ccore = True):
130 @brief Constructor of clustering algorithm X-Means. 132 @param[in] data (list): Input data that is presented as list of points (objects), each point should be represented by list or tuple. 133 @param[in] initial_centers (list): Initial coordinates of centers of clusters that are represented by list: [center1, center2, ...], 134 if it is not specified then X-Means starts from the random center. 135 @param[in] kmax (uint): Maximum number of clusters that can be allocated. 136 @param[in] tolerance (double): Stop condition for each iteration: if maximum value of change of centers of clusters is less than tolerance than algorithm will stop processing. 137 @param[in] criterion (splitting_type): Type of splitting creation. 138 @param[in] ccore (bool): Defines should be CCORE (C++ pyclustering library) used instead of Python code or not. 145 if initial_centers
is not None:
148 self.
__centers = [ [random.random()
for _
in range(len(data[0])) ] ]
156 self.
__ccore = ccore_library.workable()
161 @brief Performs cluster analysis in line with rules of X-Means algorithm. 163 @remark Results of clustering can be obtained using corresponding gets methods. 176 current_cluster_number = len(self.
__centers)
181 if current_cluster_number == len(allocated_centers):
192 @brief Returns list of allocated clusters, each cluster contains indexes of objects in list of data. 194 @return (list) List of allocated clusters. 206 @brief Returns list of centers for allocated clusters. 208 @return (list) List of centers for allocated clusters. 220 @brief Returns clustering result representation type that indicate how clusters are encoded. 222 @return (type_encoding) Clustering result representation. 228 return type_encoding.CLUSTER_INDEX_LIST_SEPARATION
231 def __improve_parameters(self, centers, available_indexes = None):
233 @brief Performs k-means clustering in the specified region. 235 @param[in] centers (list): Centers of clusters. 236 @param[in] available_indexes (list): Indexes that defines which points can be used for k-means clustering, if None - then all points are used. 238 @return (list) List of allocated clusters, each cluster contains indexes of objects in list of data. 242 if available_indexes
and len(available_indexes) == 1:
243 index_center = available_indexes[0]
247 if available_indexes:
248 local_data = [ self.
__pointer_data[i]
for i
in available_indexes ]
250 local_centers = centers
254 kmeans_instance =
kmeans(local_data, local_centers, tolerance=self.
__tolerance, ccore=
False)
255 kmeans_instance.process()
257 local_centers = kmeans_instance.get_centers()
259 clusters = kmeans_instance.get_clusters()
260 if available_indexes:
263 return clusters, local_centers
266 def __local_to_global_clusters(self, local_clusters, available_indexes):
268 @brief Converts clusters in local region define by 'available_indexes' to global clusters. 270 @param[in] local_clusters (list): Local clusters in specific region. 271 @param[in] available_indexes (list): Map between local and global point's indexes. 273 @return Global clusters. 278 for local_cluster
in local_clusters:
280 for index_point
in local_cluster:
281 current_cluster.append(available_indexes[index_point])
283 clusters.append(current_cluster)
288 def __improve_structure(self, clusters, centers):
290 @brief Check for best structure: divides each cluster into two and checks for best results using splitting criterion. 292 @param[in] clusters (list): Clusters that have been allocated (each cluster contains indexes of points from data). 293 @param[in] centers (list): Centers of clusters. 295 @return (list) Allocated centers for clustering. 299 allocated_centers = []
300 amount_free_centers = self.
__kmax - len(centers)
302 for index_cluster
in range(len(clusters)):
304 (parent_child_clusters, parent_child_centers) = self.
__improve_parameters(
None, clusters[index_cluster])
307 if len(parent_child_clusters) > 1:
310 child_scores = self.
__splitting_criterion([ parent_child_clusters[0], parent_child_clusters[1] ], parent_child_centers)
312 split_require =
False 315 if self.
__criterion == splitting_type.BAYESIAN_INFORMATION_CRITERION:
316 if parent_scores < child_scores: split_require =
True 318 elif self.
__criterion == splitting_type.MINIMUM_NOISELESS_DESCRIPTION_LENGTH:
321 if parent_scores > child_scores: split_require =
True;
323 if (split_require
is True)
and (amount_free_centers > 0):
324 allocated_centers.append(parent_child_centers[0])
325 allocated_centers.append(parent_child_centers[1])
327 amount_free_centers -= 1
329 allocated_centers.append(centers[index_cluster])
333 allocated_centers.append(centers[index_cluster])
335 return allocated_centers
338 def __splitting_criterion(self, clusters, centers):
340 @brief Calculates splitting criterion for input clusters. 342 @param[in] clusters (list): Clusters for which splitting criterion should be calculated. 343 @param[in] centers (list): Centers of the clusters. 345 @return (double) Returns splitting criterion. High value of splitting cretion means that current structure is much better. 347 @see __bayesian_information_criterion(clusters, centers) 348 @see __minimum_noiseless_description_length(clusters, centers) 352 if self.
__criterion == splitting_type.BAYESIAN_INFORMATION_CRITERION:
355 elif self.
__criterion == splitting_type.MINIMUM_NOISELESS_DESCRIPTION_LENGTH:
362 def __minimum_noiseless_description_length(self, clusters, centers):
364 @brief Calculates splitting criterion for input clusters using minimum noiseless description length criterion. 366 @param[in] clusters (list): Clusters for which splitting criterion should be calculated. 367 @param[in] centers (list): Centers of the clusters. 369 @return (double) Returns splitting criterion in line with bayesian information criterion. 370 Low value of splitting cretion means that current structure is much better. 372 @see __bayesian_information_criterion(clusters, centers) 376 scores = float(
'inf')
387 for index_cluster
in range(0, len(clusters), 1):
388 Ni = len(clusters[index_cluster])
393 for index_object
in clusters[index_cluster]:
396 Wi += euclidean_distance(self.
__pointer_data[index_object], centers[index_cluster])
403 sigma_sqrt /= (N - K)
404 sigma = sigma_sqrt ** 0.5
406 Kw = (1.0 - K / N) * sigma_sqrt
407 Ks = ( 2.0 * alpha * sigma / (N ** 0.5) ) * ( (alpha ** 2.0) * sigma_sqrt / N + W - Kw / 2.0 ) ** 0.5
409 scores = sigma_sqrt * (2 * K)**0.5 * ((2 * K)**0.5 + betta) / N + W - sigma_sqrt + Ks + 2 * alpha**0.5 * sigma_sqrt / N
414 def __bayesian_information_criterion(self, clusters, centers):
416 @brief Calculates splitting criterion for input clusters using bayesian information criterion. 418 @param[in] clusters (list): Clusters for which splitting criterion should be calculated. 419 @param[in] centers (list): Centers of the clusters. 421 @return (double) Splitting criterion in line with bayesian information criterion. 422 High value of splitting criterion means that current structure is much better. 424 @see __minimum_noiseless_description_length(clusters, centers) 428 scores = [float(
'inf')] * len(clusters)
436 for index_cluster
in range(0, len(clusters), 1):
437 for index_object
in clusters[index_cluster]:
438 sigma_sqrt += euclidean_distance_square(self.
__pointer_data[index_object], centers[index_cluster]);
440 N += len(clusters[index_cluster])
443 sigma_sqrt /= (N - K)
444 p = (K - 1) + dimension * K + 1
447 sigma_multiplier = 0.0
448 if sigma_sqrt <= 0.0:
449 sigma_multiplier = float(
'-inf')
451 sigma_multiplier = dimension * 0.5 * log(sigma_sqrt)
454 for index_cluster
in range(0, len(clusters), 1):
455 n = len(clusters[index_cluster])
457 L = n * log(n) - n * log(N) - n * 0.5 * log(2.0 * numpy.pi) - n * sigma_multiplier - (n - K) * 0.5
460 scores[index_cluster] = L - p * 0.5 * log(N)
def __improve_structure(self, clusters, centers)
Check for best structure: divides each cluster into two and checks for best results using splitting c...
def __splitting_criterion(self, clusters, centers)
Calculates splitting criterion for input clusters.
def __minimum_noiseless_description_length(self, clusters, centers)
Calculates splitting criterion for input clusters using minimum noiseless description length criterio...
def get_cluster_encoding(self)
Returns clustering result representation type that indicate how clusters are encoded.
Cluster analysis algorithm: K-Means.
Utils that are used by modules of pyclustering.
def process(self)
Performs cluster analysis in line with rules of X-Means algorithm.
Module for representing clustering results.
def __local_to_global_clusters(self, local_clusters, available_indexes)
Converts clusters in local region define by 'available_indexes' to global clusters.
def __bayesian_information_criterion(self, clusters, centers)
Calculates splitting criterion for input clusters using bayesian information criterion.
Class represents clustering algorithm X-Means.
K-Means++ is an algorithm for choosing the initial centers for algorithms like K-Means or X-Means...
Class represents K-Means clustering algorithm.
def get_clusters(self)
Returns list of allocated clusters, each cluster contains indexes of objects in list of data...
def get_centers(self)
Returns list of centers for allocated clusters.
def __improve_parameters(self, centers, available_indexes=None)
Performs k-means clustering in the specified region.
Collection of center initializers for algorithm that uses initial centers, for example, for K-Means or X-Means.
Enumeration of splitting types that can be used as splitting creation of cluster in X-Means algorithm...
def __init__(self, data, initial_centers=None, kmax=20, tolerance=0.025, criterion=splitting_type.BAYESIAN_INFORMATION_CRITERION, ccore=True)
Constructor of clustering algorithm X-Means.