3 @brief The module contains G-Means algorithm and other related services. 4 @details Implementation based on paper @cite inproceedings::cluster::gmeans::1. 6 @authors Andrei Novikov (pyclustering@yandex.ru) 8 @copyright GNU Public License 10 @cond GNU_PUBLIC_LICENSE 11 PyClustering is free software: you can redistribute it and/or modify 12 it under the terms of the GNU General Public License as published by 13 the Free Software Foundation, either version 3 of the License, or 14 (at your option) any later version. 16 PyClustering is distributed in the hope that it will be useful, 17 but WITHOUT ANY WARRANTY; without even the implied warranty of 18 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 19 GNU General Public License for more details. 21 You should have received a copy of the GNU General Public License 22 along with this program. If not, see <http://www.gnu.org/licenses/>. 31 from pyclustering.core.gmeans_wrapper
import gmeans
as gmeans_wrapper
32 from pyclustering.core.wrapper
import ccore_library
42 @brief Class implements G-Means clustering algorithm. 43 @details The G-means algorithm starts with a small number of centers, and grows the number of centers. 44 Each iteration of the G-Means algorithm splits into two those centers whose data appear not to come from a 45 Gaussian distribution. G-means repeatedly makes decisions based on a statistical test for the data 46 assigned to each center. 48 Implementation based on the paper @cite inproceedings::cluster::gmeans::1. 50 @image html gmeans_example_clustering.png "G-Means clustering results on most common data-sets." 52 Example #1. In this example, G-Means starts analysis from single cluster. 54 from pyclustering.cluster import cluster_visualizer 55 from pyclustering.cluster.gmeans import gmeans 56 from pyclustering.utils import read_sample 57 from pyclustering.samples.definitions import FCPS_SAMPLES 59 # Read sample 'Lsun' from file. 60 sample = read_sample(FCPS_SAMPLES.SAMPLE_LSUN) 62 # Create instance of G-Means algorithm. By default the algorithm starts search from a single cluster. 63 gmeans_instance = gmeans(sample).process() 65 # Extract clustering results: clusters and their centers 66 clusters = gmeans_instance.get_clusters() 67 centers = gmeans_instance.get_centers() 69 # Print total sum of metric errors 70 print("Total WCE:", gmeans_instance.get_total_wce()) 72 # Visualize clustering results 73 visualizer = cluster_visualizer() 74 visualizer.append_clusters(clusters, sample) 78 Example #2. Sometimes G-Means might find local optimum. `repeat` value can be used to increase probability to 79 find global optimum. Argument `repeat` defines how many times K-Means clustering with K-Means++ 80 initialization should be run in order to find optimal clusters. 82 # Read sample 'Tetra' from file. 83 sample = read_sample(FCPS_SAMPLES.SAMPLE_TETRA) 85 # Create instance of G-Means algorithm. By default algorithm start search from single cluster. 86 gmeans_instance = gmeans(sample, repeat=10).process() 88 # Extract clustering results: clusters and their centers 89 clusters = gmeans_instance.get_clusters() 91 # Visualize clustering results 92 visualizer = cluster_visualizer() 93 visualizer.append_clusters(clusters, sample) 97 In case of requirement to have labels instead of default representation of clustering results `CLUSTER_INDEX_LIST_SEPARATION`: 99 from pyclustering.cluster.gmeans import gmeans 100 from pyclustering.cluster.encoder import type_encoding, cluster_encoder 101 from pyclustering.samples.definitions import SIMPLE_SAMPLES 102 from pyclustering.utils import read_sample 104 data = read_sample(SIMPLE_SAMPLES.SAMPLE_SIMPLE1) 106 gmeans_instance = gmeans(data).process() 107 clusters = gmeans_instance.get_clusters() 109 # Change cluster representation from default to labeling. 110 encoder = cluster_encoder(type_encoding.CLUSTER_INDEX_LIST_SEPARATION, clusters, data) 111 encoder.set_encoding(type_encoding.CLUSTER_INDEX_LABELING) 112 labels = encoder.get_clusters() 114 print(labels) # Display labels 117 There is an output of the code above: 119 [0, 0, 0, 0, 0, 1, 1, 1, 1, 1] 124 def __init__(self, data, k_init=1, ccore=True, **kwargs):
126 @brief Initializes G-Means algorithm. 128 @param[in] data (array_like): Input data that is presented as array of points (objects), each point should be 129 represented by array_like data structure. 130 @param[in] k_init (uint): Initial amount of centers (by default started search from 1). 131 @param[in] ccore (bool): Defines whether CCORE library (C/C++ part of the library) should be used instead of 133 @param[in] **kwargs: Arbitrary keyword arguments (available arguments: `tolerance`, `repeat`, `k_max`, `random_state`). 135 <b>Keyword Args:</b><br> 136 - tolerance (double): tolerance (double): Stop condition for each K-Means iteration: if maximum value of 137 change of centers of clusters is less than tolerance than algorithm will stop processing. 138 - repeat (unit): How many times K-Means should be run to improve parameters (by default is 3). 139 With larger 'repeat' values suggesting higher probability of finding global optimum. 140 - k_max (uint): Maximum amount of cluster that might be allocated. The argument is considered as a stop 141 condition. When the maximum amount is reached then algorithm stops processing. By default the maximum 142 amount of clusters is not restricted (`k_max` is -1). 143 - random_state (int): Seed for random state (by default is `None`, current system time is used). 155 self.
__repeat = kwargs.get(
'repeat', 3)
156 self.
__k_max = kwargs.get(
'k_max', -1)
160 self.
__ccore = ccore_library.workable()
167 @brief Performs cluster analysis in line with rules of G-Means algorithm. 169 @return (gmeans) Returns itself (G-Means instance). 181 def _process_by_ccore(self):
183 @brief Performs cluster analysis using CCORE (C/C++ part of pyclustering library). 190 def _process_by_python(self):
192 @brief Performs cluster analysis using Python. 198 current_amount_clusters = len(self.
__clusters)
201 if current_amount_clusters == len(self.
__centers):
211 @brief Calculates the closest cluster to each point. 213 @param[in] points (array_like): Points for which closest clusters are calculated. 215 @return (list) List of closest clusters for each point. Each cluster is denoted by index. Return empty 216 collection if 'process()' method was not called. 219 nppoints = numpy.array(points)
223 metric = distance_metric(type_metric.EUCLIDEAN_SQUARE, numpy_usage=
True)
225 differences = numpy.zeros((len(nppoints), len(self.
__centers)))
226 for index_point
in range(len(nppoints)):
227 differences[index_point] = metric(nppoints[index_point], self.
__centers)
229 return numpy.argmin(differences, axis=1)
234 @brief Returns list of allocated clusters, each cluster contains indexes of objects in list of data. 236 @return (array_like) Allocated clusters. 247 @brief Returns list of centers of allocated clusters. 249 @return (array_like) Allocated centers. 260 @brief Returns sum of metric errors that depends on metric that was used for clustering (by default SSE - Sum of Squared Errors). 261 @details Sum of metric errors is calculated using distance between point and its center: 262 \f[error=\sum_{i=0}^{N}distance(x_{i}-center(x_{i}))\f] 274 @brief Returns clustering result representation type that indicate how clusters are encoded. 276 @return (type_encoding) Clustering result representation. 282 return type_encoding.CLUSTER_INDEX_LIST_SEPARATION
285 def _statistical_optimization(self):
287 @brief Try to split cluster into two to find optimal amount of clusters. 291 potential_amount_clusters = len(self.
__clusters)
294 if (new_centers
is None)
or ((self.
__k_max != -1)
and (potential_amount_clusters >= self.
__k_max)):
297 centers += new_centers
298 potential_amount_clusters += 1
303 def _split_and_search_optimal(self, cluster):
305 @brief Split specified cluster into two by performing K-Means clustering and check correctness by 306 Anderson-Darling test. 308 @param[in] cluster (array_like) Cluster that should be analysed and optimized by splitting if it is required. 310 @return (array_like) Two new centers if two new clusters are considered as more suitable. 311 (None) If current cluster is more suitable. 313 if len(cluster) == 1:
316 points = [self.
__data[index_point]
for index_point
in cluster]
319 if len(new_centers) > 1:
321 if not accept_null_hypothesis:
327 def _is_null_hypothesis(self, data, centers):
329 @brief Returns whether H0 hypothesis is accepted using Anderson-Darling test statistic. 331 @param[in] data (array_like): N-dimensional data for statistical test. 332 @param[in] centers (array_like): Two new allocated centers. 334 @return (bool) True is null hypothesis is acceptable. 337 v = numpy.subtract(centers[0], centers[1])
340 estimation, critical, _ = scipy.stats.anderson(points, dist=
'norm')
344 return estimation < critical[-1]
348 def _project_data(data, vector):
350 @brief Transform input data by project it onto input vector using formula: 353 x_{i}^{*}=\frac{\left \langle x_{i}, v \right \rangle}{\left \| v \right \|^{2}}. 356 @param[in] data (array_like): Input data that is represented by points. 357 @param[in] vector (array_like): Input vector that is used for projection. 359 @return (array_like) Transformed 1-dimensional data. 362 square_norm = numpy.sum(numpy.multiply(vector, vector))
363 return numpy.divide(numpy.sum(numpy.multiply(data, vector), axis=1), square_norm)
366 def _search_optimal_parameters(self, data, amount):
368 @brief Performs cluster analysis for specified data several times to find optimal clustering result in line 371 @param[in] data (array_like): Input data that should be clustered. 372 @param[in] amount (unit): Amount of clusters that should be allocated. 374 @return (tuple) Optimal clustering result: (clusters, centers, wce). 377 best_wce, best_clusters, best_centers = float(
'+inf'), [], []
382 candidate_wce = solver.get_total_wce()
383 if candidate_wce < best_wce:
384 best_wce = candidate_wce
385 best_clusters = solver.get_clusters()
386 best_centers = solver.get_centers()
388 if len(initial_centers) == 1:
391 return best_clusters, best_centers, best_wce
394 def _perform_clustering(self):
396 @brief Performs cluster analysis using K-Means algorithm using current centers are initial. 398 @param[in] data (array_like): Input data for cluster analysis. 407 def _run_condition(self):
409 @brief Defines whether the algorithm should continue processing or should stop. 411 @return `True` if the algorithm should continue processing, otherwise returns `False` 420 def _verify_arguments(self):
422 @brief Verify input parameters for the algorithm and throw exception in case of incorrectness. 426 raise ValueError(
"Input data is empty (size: '%d')." % len(self.
__data))
429 raise ValueError(
"Initial amount of centers should be greater than 0 " 430 "(current value: '%d')." % self.
__k_init)
433 raise ValueError(
"Tolerance should be greater than 0 (current value: '%f')." % self.
__tolerance)
436 raise ValueError(
"Amount of attempt to find optimal parameters should be greater than 0 " 437 "(current value: '%d')." % self.
__repeat)
440 raise ValueError(
"Maximum amount of cluster that might be allocated should be greater than 0 or -1 if " 441 "the algorithm should be restricted in searching optimal number of clusters.")
444 raise ValueError(
"Initial amount of clusters should be less than the maximum amount 'k_max'.")
The module contains K-Means algorithm and other related services.
def get_clusters(self)
Returns list of allocated clusters, each cluster contains indexes of objects in list of data...
def process(self)
Performs cluster analysis in line with rules of G-Means algorithm.
def get_centers(self)
Returns list of centers of allocated clusters.
Utils that are used by modules of pyclustering.
Module for representing clustering results.
def _run_condition(self)
Defines whether the algorithm should continue processing or should stop.
def _search_optimal_parameters(self, data, amount)
Performs cluster analysis for specified data several times to find optimal clustering result in line ...
K-Means++ is an algorithm for choosing the initial centers for algorithms like K-Means or X-Means...
def __init__(self, data, k_init=1, ccore=True, kwargs)
Initializes G-Means algorithm.
def _verify_arguments(self)
Verify input parameters for the algorithm and throw exception in case of incorrectness.
def _statistical_optimization(self)
Try to split cluster into two to find optimal amount of clusters.
Class implements K-Means clustering algorithm.
def _split_and_search_optimal(self, cluster)
Split specified cluster into two by performing K-Means clustering and check correctness by Anderson-D...
def _process_by_ccore(self)
Performs cluster analysis using CCORE (C/C++ part of pyclustering library).
def get_cluster_encoding(self)
Returns clustering result representation type that indicate how clusters are encoded.
def get_total_wce(self)
Returns sum of metric errors that depends on metric that was used for clustering (by default SSE - Su...
def _process_by_python(self)
Performs cluster analysis using Python.
Collection of center initializers for algorithm that uses initial centers, for example, for K-Means or X-Means.
def _project_data(data, vector)
Transform input data by project it onto input vector using formula:
Class implements G-Means clustering algorithm.
def _perform_clustering(self)
Performs cluster analysis using K-Means algorithm using current centers are initial.
def _is_null_hypothesis(self, data, centers)
Returns whether H0 hypothesis is accepted using Anderson-Darling test statistic.
def predict(self, points)
Calculates the closest cluster to each point.