3 @brief Cluster analysis algorithm: X-Means 4 @details Implementation based on paper @cite article::xmeans::1. 6 @authors Andrei Novikov (pyclustering@yandex.ru) 8 @copyright GNU Public License 10 @cond GNU_PUBLIC_LICENSE 11 PyClustering is free software: you can redistribute it and/or modify 12 it under the terms of the GNU General Public License as published by 13 the Free Software Foundation, either version 3 of the License, or 14 (at your option) any later version. 16 PyClustering is distributed in the hope that it will be useful, 17 but WITHOUT ANY WARRANTY; without even the implied warranty of 18 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 19 GNU General Public License for more details. 21 You should have received a copy of the GNU General Public License 22 along with this program. If not, see <http://www.gnu.org/licenses/>. 31 from enum
import IntEnum
39 from pyclustering.core.wrapper
import ccore_library
41 import pyclustering.core.xmeans_wrapper
as wrapper
48 @brief Enumeration of splitting types that can be used as splitting creation of cluster in X-Means algorithm. 64 BAYESIAN_INFORMATION_CRITERION = 0
75 MINIMUM_NOISELESS_DESCRIPTION_LENGTH = 1
80 @brief Class represents clustering algorithm X-Means. 81 @details X-means clustering method starts with the assumption of having a minimum number of clusters, 82 and then dynamically increases them. X-means uses specified splitting criterion to control 83 the process of splitting clusters. Method K-Means++ can be used for calculation of initial centers. 85 CCORE implementation of the algorithm uses thread pool to parallelize the clustering process. 87 Here example how to perform cluster analysis using X-Means algorithm: 89 from pyclustering.cluster import cluster_visualizer 90 from pyclustering.cluster.xmeans import xmeans 91 from pyclustering.cluster.center_initializer import kmeans_plusplus_initializer 92 from pyclustering.utils import read_sample 93 from pyclustering.samples.definitions import SIMPLE_SAMPLES 95 # Read sample 'simple3' from file. 96 sample = read_sample(SIMPLE_SAMPLES.SAMPLE_SIMPLE3) 98 # Prepare initial centers - amount of initial centers defines amount of clusters from which X-Means will 100 amount_initial_centers = 2 101 initial_centers = kmeans_plusplus_initializer(sample, amount_initial_centers).initialize() 103 # Create instance of X-Means algorithm. The algorithm will start analysis from 2 clusters, the maximum 104 # number of clusters that can be allocated is 20. 105 xmeans_instance = xmeans(sample, initial_centers, 20) 106 xmeans_instance.process() 108 # Extract clustering results: clusters and their centers 109 clusters = xmeans_instance.get_clusters() 110 centers = xmeans_instance.get_centers() 112 # Visualize clustering results 113 visualizer = cluster_visualizer() 114 visualizer.append_clusters(clusters, sample) 115 visualizer.append_cluster(centers, None, marker='*', markersize=10) 119 Visualization of clustering results that were obtained using code above and where X-Means algorithm allocates four clusters. 120 @image html xmeans_clustering_simple3.png "Fig. 1. X-Means clustering results (data 'Simple3')." 122 @see center_initializer 126 def __init__(self, data, initial_centers = None, kmax = 20, tolerance = 0.025, criterion = splitting_type.BAYESIAN_INFORMATION_CRITERION, ccore = True):
128 @brief Constructor of clustering algorithm X-Means. 130 @param[in] data (list): Input data that is presented as list of points (objects), each point should be represented by list or tuple. 131 @param[in] initial_centers (list): Initial coordinates of centers of clusters that are represented by list: [center1, center2, ...], 132 if it is not specified then X-Means starts from the random center. 133 @param[in] kmax (uint): Maximum number of clusters that can be allocated. 134 @param[in] tolerance (double): Stop condition for each iteration: if maximum value of change of centers of clusters is less than tolerance than algorithm will stop processing. 135 @param[in] criterion (splitting_type): Type of splitting creation. 136 @param[in] ccore (bool): Defines should be CCORE (C++ pyclustering library) used instead of Python code or not. 143 if initial_centers
is not None:
146 self.
__centers = [ [random.random()
for _
in range(len(data[0])) ] ]
154 self.
__ccore = ccore_library.workable()
159 @brief Performs cluster analysis in line with rules of X-Means algorithm. 161 @remark Results of clustering can be obtained using corresponding gets methods. 174 current_cluster_number = len(self.
__centers)
179 if current_cluster_number == len(allocated_centers):
190 @brief Returns list of allocated clusters, each cluster contains indexes of objects in list of data. 192 @return (list) List of allocated clusters. 204 @brief Returns list of centers for allocated clusters. 206 @return (list) List of centers for allocated clusters. 218 @brief Returns clustering result representation type that indicate how clusters are encoded. 220 @return (type_encoding) Clustering result representation. 226 return type_encoding.CLUSTER_INDEX_LIST_SEPARATION
229 def __improve_parameters(self, centers, available_indexes = None):
231 @brief Performs k-means clustering in the specified region. 233 @param[in] centers (list): Centers of clusters. 234 @param[in] available_indexes (list): Indexes that defines which points can be used for k-means clustering, if None - then all points are used. 236 @return (list) List of allocated clusters, each cluster contains indexes of objects in list of data. 240 if available_indexes
and len(available_indexes) == 1:
241 index_center = available_indexes[0]
245 if available_indexes:
246 local_data = [ self.
__pointer_data[i]
for i
in available_indexes ]
248 local_centers = centers
252 kmeans_instance =
kmeans(local_data, local_centers, tolerance=self.
__tolerance, ccore=
False)
253 kmeans_instance.process()
255 local_centers = kmeans_instance.get_centers()
257 clusters = kmeans_instance.get_clusters()
258 if available_indexes:
261 return clusters, local_centers
264 def __local_to_global_clusters(self, local_clusters, available_indexes):
266 @brief Converts clusters in local region define by 'available_indexes' to global clusters. 268 @param[in] local_clusters (list): Local clusters in specific region. 269 @param[in] available_indexes (list): Map between local and global point's indexes. 271 @return Global clusters. 276 for local_cluster
in local_clusters:
278 for index_point
in local_cluster:
279 current_cluster.append(available_indexes[index_point])
281 clusters.append(current_cluster)
286 def __improve_structure(self, clusters, centers):
288 @brief Check for best structure: divides each cluster into two and checks for best results using splitting criterion. 290 @param[in] clusters (list): Clusters that have been allocated (each cluster contains indexes of points from data). 291 @param[in] centers (list): Centers of clusters. 293 @return (list) Allocated centers for clustering. 297 allocated_centers = []
298 amount_free_centers = self.
__kmax - len(centers)
300 for index_cluster
in range(len(clusters)):
302 (parent_child_clusters, parent_child_centers) = self.
__improve_parameters(
None, clusters[index_cluster])
305 if len(parent_child_clusters) > 1:
308 child_scores = self.
__splitting_criterion([ parent_child_clusters[0], parent_child_clusters[1] ], parent_child_centers)
310 split_require =
False 313 if self.
__criterion == splitting_type.BAYESIAN_INFORMATION_CRITERION:
314 if parent_scores < child_scores: split_require =
True 316 elif self.
__criterion == splitting_type.MINIMUM_NOISELESS_DESCRIPTION_LENGTH:
319 if parent_scores > child_scores: split_require =
True;
321 if (split_require
is True)
and (amount_free_centers > 0):
322 allocated_centers.append(parent_child_centers[0])
323 allocated_centers.append(parent_child_centers[1])
325 amount_free_centers -= 1
327 allocated_centers.append(centers[index_cluster])
331 allocated_centers.append(centers[index_cluster])
333 return allocated_centers
336 def __splitting_criterion(self, clusters, centers):
338 @brief Calculates splitting criterion for input clusters. 340 @param[in] clusters (list): Clusters for which splitting criterion should be calculated. 341 @param[in] centers (list): Centers of the clusters. 343 @return (double) Returns splitting criterion. High value of splitting cretion means that current structure is much better. 345 @see __bayesian_information_criterion(clusters, centers) 346 @see __minimum_noiseless_description_length(clusters, centers) 350 if self.
__criterion == splitting_type.BAYESIAN_INFORMATION_CRITERION:
353 elif self.
__criterion == splitting_type.MINIMUM_NOISELESS_DESCRIPTION_LENGTH:
360 def __minimum_noiseless_description_length(self, clusters, centers):
362 @brief Calculates splitting criterion for input clusters using minimum noiseless description length criterion. 364 @param[in] clusters (list): Clusters for which splitting criterion should be calculated. 365 @param[in] centers (list): Centers of the clusters. 367 @return (double) Returns splitting criterion in line with bayesian information criterion. 368 Low value of splitting cretion means that current structure is much better. 370 @see __bayesian_information_criterion(clusters, centers) 374 scores = float(
'inf')
385 for index_cluster
in range(0, len(clusters), 1):
386 Ni = len(clusters[index_cluster])
391 for index_object
in clusters[index_cluster]:
394 Wi += euclidean_distance(self.
__pointer_data[index_object], centers[index_cluster])
401 sigma_sqrt /= (N - K)
402 sigma = sigma_sqrt ** 0.5
404 Kw = (1.0 - K / N) * sigma_sqrt
405 Ks = ( 2.0 * alpha * sigma / (N ** 0.5) ) * ( (alpha ** 2.0) * sigma_sqrt / N + W - Kw / 2.0 ) ** 0.5
407 scores = sigma_sqrt * (2 * K)**0.5 * ((2 * K)**0.5 + betta) / N + W - sigma_sqrt + Ks + 2 * alpha**0.5 * sigma_sqrt / N
412 def __bayesian_information_criterion(self, clusters, centers):
414 @brief Calculates splitting criterion for input clusters using bayesian information criterion. 416 @param[in] clusters (list): Clusters for which splitting criterion should be calculated. 417 @param[in] centers (list): Centers of the clusters. 419 @return (double) Splitting criterion in line with bayesian information criterion. 420 High value of splitting criterion means that current structure is much better. 422 @see __minimum_noiseless_description_length(clusters, centers) 426 scores = [float(
'inf')] * len(clusters)
434 for index_cluster
in range(0, len(clusters), 1):
435 for index_object
in clusters[index_cluster]:
436 sigma_sqrt += euclidean_distance_square(self.
__pointer_data[index_object], centers[index_cluster]);
438 N += len(clusters[index_cluster])
441 sigma_sqrt /= (N - K)
442 p = (K - 1) + dimension * K + 1
445 sigma_multiplier = 0.0
446 if sigma_sqrt <= 0.0:
447 sigma_multiplier = float(
'-inf')
449 sigma_multiplier = dimension * 0.5 * log(sigma_sqrt)
452 for index_cluster
in range(0, len(clusters), 1):
453 n = len(clusters[index_cluster])
455 L = n * log(n) - n * log(N) - n * 0.5 * log(2.0 * numpy.pi) - n * sigma_multiplier - (n - K) * 0.5
458 scores[index_cluster] = L - p * 0.5 * log(N)
def __improve_structure(self, clusters, centers)
Check for best structure: divides each cluster into two and checks for best results using splitting c...
def __splitting_criterion(self, clusters, centers)
Calculates splitting criterion for input clusters.
def __minimum_noiseless_description_length(self, clusters, centers)
Calculates splitting criterion for input clusters using minimum noiseless description length criterio...
def get_cluster_encoding(self)
Returns clustering result representation type that indicate how clusters are encoded.
Cluster analysis algorithm: K-Means.
Utils that are used by modules of pyclustering.
def process(self)
Performs cluster analysis in line with rules of X-Means algorithm.
Module for representing clustering results.
def __local_to_global_clusters(self, local_clusters, available_indexes)
Converts clusters in local region define by 'available_indexes' to global clusters.
def __bayesian_information_criterion(self, clusters, centers)
Calculates splitting criterion for input clusters using bayesian information criterion.
Class represents clustering algorithm X-Means.
K-Means++ is an algorithm for choosing the initial centers for algorithms like K-Means or X-Means...
Class represents K-Means clustering algorithm.
def get_clusters(self)
Returns list of allocated clusters, each cluster contains indexes of objects in list of data...
def get_centers(self)
Returns list of centers for allocated clusters.
def __improve_parameters(self, centers, available_indexes=None)
Performs k-means clustering in the specified region.
Collection of center initializers for algorithm that uses initial centers, for example, for K-Means or X-Means.
Enumeration of splitting types that can be used as splitting creation of cluster in X-Means algorithm...
def __init__(self, data, initial_centers=None, kmax=20, tolerance=0.025, criterion=splitting_type.BAYESIAN_INFORMATION_CRITERION, ccore=True)
Constructor of clustering algorithm X-Means.