pyclustering  0.10.1
pyclustring is a Python, C++ data mining library.
gmeans.py
1 """!
2 
3 @brief The module contains G-Means algorithm and other related services.
4 @details Implementation based on paper @cite inproceedings::cluster::gmeans::1.
5 
6 @authors Andrei Novikov (pyclustering@yandex.ru)
7 @date 2014-2020
8 @copyright BSD-3-Clause
9 
10 """
11 
12 
13 import numpy
14 import scipy.stats
15 
16 from pyclustering.core.gmeans_wrapper import gmeans as gmeans_wrapper
17 from pyclustering.core.wrapper import ccore_library
18 
19 from pyclustering.cluster.center_initializer import kmeans_plusplus_initializer
20 from pyclustering.cluster.encoder import type_encoding
21 from pyclustering.cluster.kmeans import kmeans
22 from pyclustering.utils import distance_metric, type_metric
23 
24 
25 class gmeans:
26  """!
27  @brief Class implements G-Means clustering algorithm.
28  @details The G-means algorithm starts with a small number of centers, and grows the number of centers.
29  Each iteration of the G-Means algorithm splits into two those centers whose data appear not to come from a
30  Gaussian distribution. G-means repeatedly makes decisions based on a statistical test for the data
31  assigned to each center.
32 
33  Implementation based on the paper @cite inproceedings::cluster::gmeans::1.
34 
35  @image html gmeans_example_clustering.png "G-Means clustering results on most common data-sets."
36 
37  Example #1. In this example, G-Means starts analysis from single cluster.
38  @code
39  from pyclustering.cluster import cluster_visualizer
40  from pyclustering.cluster.gmeans import gmeans
41  from pyclustering.utils import read_sample
42  from pyclustering.samples.definitions import FCPS_SAMPLES
43 
44  # Read sample 'Lsun' from file.
45  sample = read_sample(FCPS_SAMPLES.SAMPLE_LSUN)
46 
47  # Create instance of G-Means algorithm. By default the algorithm starts search from a single cluster.
48  gmeans_instance = gmeans(sample).process()
49 
50  # Extract clustering results: clusters and their centers
51  clusters = gmeans_instance.get_clusters()
52  centers = gmeans_instance.get_centers()
53 
54  # Print total sum of metric errors
55  print("Total WCE:", gmeans_instance.get_total_wce())
56 
57  # Visualize clustering results
58  visualizer = cluster_visualizer()
59  visualizer.append_clusters(clusters, sample)
60  visualizer.show()
61  @endcode
62 
63  Example #2. Sometimes G-Means might find local optimum. `repeat` value can be used to increase probability to
64  find global optimum. Argument `repeat` defines how many times K-Means clustering with K-Means++
65  initialization should be run in order to find optimal clusters.
66  @code
67  # Read sample 'Tetra' from file.
68  sample = read_sample(FCPS_SAMPLES.SAMPLE_TETRA)
69 
70  # Create instance of G-Means algorithm. By default algorithm start search from single cluster.
71  gmeans_instance = gmeans(sample, repeat=10).process()
72 
73  # Extract clustering results: clusters and their centers
74  clusters = gmeans_instance.get_clusters()
75 
76  # Visualize clustering results
77  visualizer = cluster_visualizer()
78  visualizer.append_clusters(clusters, sample)
79  visualizer.show()
80  @endcode
81 
82  In case of requirement to have labels instead of default representation of clustering results `CLUSTER_INDEX_LIST_SEPARATION`:
83  @code
84  from pyclustering.cluster.gmeans import gmeans
85  from pyclustering.cluster.encoder import type_encoding, cluster_encoder
86  from pyclustering.samples.definitions import SIMPLE_SAMPLES
87  from pyclustering.utils import read_sample
88 
89  data = read_sample(SIMPLE_SAMPLES.SAMPLE_SIMPLE1)
90 
91  gmeans_instance = gmeans(data).process()
92  clusters = gmeans_instance.get_clusters()
93 
94  # Change cluster representation from default to labeling.
95  encoder = cluster_encoder(type_encoding.CLUSTER_INDEX_LIST_SEPARATION, clusters, data)
96  encoder.set_encoding(type_encoding.CLUSTER_INDEX_LABELING)
97  labels = encoder.get_clusters()
98 
99  print(labels) # Display labels
100  @endcode
101 
102  There is an output of the code above:
103  @code
104  [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
105  @endcode
106 
107  """
108 
109  def __init__(self, data, k_init=1, ccore=True, **kwargs):
110  """!
111  @brief Initializes G-Means algorithm.
112 
113  @param[in] data (array_like): Input data that is presented as array of points (objects), each point should be
114  represented by array_like data structure.
115  @param[in] k_init (uint): Initial amount of centers (by default started search from 1).
116  @param[in] ccore (bool): Defines whether CCORE library (C/C++ part of the library) should be used instead of
117  Python code.
118  @param[in] **kwargs: Arbitrary keyword arguments (available arguments: `tolerance`, `repeat`, `k_max`, `random_state`).
119 
120  <b>Keyword Args:</b><br>
121  - tolerance (double): tolerance (double): Stop condition for each K-Means iteration: if maximum value of
122  change of centers of clusters is less than tolerance than algorithm will stop processing.
123  - repeat (unit): How many times K-Means should be run to improve parameters (by default is 3).
124  With larger 'repeat' values suggesting higher probability of finding global optimum.
125  - k_max (uint): Maximum amount of cluster that might be allocated. The argument is considered as a stop
126  condition. When the maximum amount is reached then algorithm stops processing. By default the maximum
127  amount of clusters is not restricted (`k_max` is -1).
128  - random_state (int): Seed for random state (by default is `None`, current system time is used).
129 
130  """
131  self.__data = data
132  self.__k_init = k_init
133 
134  self.__clusters = []
135  self.__centers = []
136  self.__total_wce = 0.0
137  self.__ccore = ccore
138 
139  self.__tolerance = kwargs.get('tolerance', 0.001)
140  self.__repeat = kwargs.get('repeat', 3)
141  self.__k_max = kwargs.get('k_max', -1)
142  self.__random_state = kwargs.get('random_state', None)
143 
144  if self.__ccore is True:
145  self.__ccore = ccore_library.workable()
146 
147  self._verify_arguments()
148 
149 
150  def process(self):
151  """!
152  @brief Performs cluster analysis in line with rules of G-Means algorithm.
153 
154  @return (gmeans) Returns itself (G-Means instance).
155 
156  @see get_clusters()
157  @see get_centers()
158 
159  """
160  if self.__ccore is True:
161  return self._process_by_ccore()
162 
163  return self._process_by_python()
164 
165 
166  def _process_by_ccore(self):
167  """!
168  @brief Performs cluster analysis using CCORE (C/C++ part of pyclustering library).
169 
170  """
171  self.__clusters, self.__centers, self.__total_wce = gmeans_wrapper(self.__data, self.__k_init, self.__tolerance, self.__repeat, self.__k_max, self.__random_state)
172  return self
173 
174 
175  def _process_by_python(self):
176  """!
177  @brief Performs cluster analysis using Python.
178 
179  """
180  self.__clusters, self.__centers, _ = self._search_optimal_parameters(self.__data, self.__k_init)
181 
182  while self._run_condition():
183  current_amount_clusters = len(self.__clusters)
185 
186  if current_amount_clusters == len(self.__centers): # amount of centers the same - no need to continue.
187  break
188 
189  self._perform_clustering()
190 
191  return self
192 
193 
194  def predict(self, points):
195  """!
196  @brief Calculates the closest cluster to each point.
197 
198  @param[in] points (array_like): Points for which closest clusters are calculated.
199 
200  @return (list) List of closest clusters for each point. Each cluster is denoted by index. Return empty
201  collection if 'process()' method was not called.
202 
203  """
204  nppoints = numpy.array(points)
205  if len(self.__clusters) == 0:
206  return []
207 
208  metric = distance_metric(type_metric.EUCLIDEAN_SQUARE, numpy_usage=True)
209 
210  npcenters = numpy.array(self.__centers)
211  differences = numpy.zeros((len(nppoints), len(npcenters)))
212  for index_point in range(len(nppoints)):
213  differences[index_point] = metric(nppoints[index_point], npcenters)
214 
215  return numpy.argmin(differences, axis=1)
216 
217 
218  def get_clusters(self):
219  """!
220  @brief Returns list of allocated clusters, each cluster contains indexes of objects in list of data.
221 
222  @return (array_like) Allocated clusters.
223 
224  @see process()
225  @see get_centers()
226 
227  """
228  return self.__clusters
229 
230 
231  def get_centers(self):
232  """!
233  @brief Returns list of centers of allocated clusters.
234 
235  @return (array_like) Allocated centers.
236 
237  @see process()
238  @see get_clusters()
239 
240  """
241  return self.__centers
242 
243 
244  def get_total_wce(self):
245  """!
246  @brief Returns sum of metric errors that depends on metric that was used for clustering (by default SSE - Sum of Squared Errors).
247  @details Sum of metric errors is calculated using distance between point and its center:
248  \f[error=\sum_{i=0}^{N}distance(x_{i}-center(x_{i}))\f]
249 
250  @see process()
251  @see get_clusters()
252 
253  """
254 
255  return self.__total_wce
256 
257 
259  """!
260  @brief Returns clustering result representation type that indicate how clusters are encoded.
261 
262  @return (type_encoding) Clustering result representation.
263 
264  @see get_clusters()
265 
266  """
267 
268  return type_encoding.CLUSTER_INDEX_LIST_SEPARATION
269 
270 
271  def _statistical_optimization(self):
272  """!
273  @brief Try to split cluster into two to find optimal amount of clusters.
274 
275  """
276  centers = []
277  potential_amount_clusters = len(self.__clusters)
278  for index in range(len(self.__clusters)):
279  new_centers = self._split_and_search_optimal(self.__clusters[index])
280  if (new_centers is None) or ((self.__k_max != -1) and (potential_amount_clusters >= self.__k_max)):
281  centers.append(self.__centers[index])
282  else:
283  centers += new_centers
284  potential_amount_clusters += 1
285 
286  self.__centers = centers
287 
288 
289  def _split_and_search_optimal(self, cluster):
290  """!
291  @brief Split specified cluster into two by performing K-Means clustering and check correctness by
292  Anderson-Darling test.
293 
294  @param[in] cluster (array_like) Cluster that should be analysed and optimized by splitting if it is required.
295 
296  @return (array_like) Two new centers if two new clusters are considered as more suitable.
297  (None) If current cluster is more suitable.
298  """
299  if len(cluster) == 1:
300  return None
301 
302  points = [self.__data[index_point] for index_point in cluster]
303  new_clusters, new_centers, _ = self._search_optimal_parameters(points, 2)
304 
305  if len(new_centers) > 1:
306  accept_null_hypothesis = self._is_null_hypothesis(points, new_centers)
307  if not accept_null_hypothesis:
308  return new_centers # If null hypothesis is rejected then use two new clusters
309 
310  return None
311 
312 
313  def _is_null_hypothesis(self, data, centers):
314  """!
315  @brief Returns whether H0 hypothesis is accepted using Anderson-Darling test statistic.
316 
317  @param[in] data (array_like): N-dimensional data for statistical test.
318  @param[in] centers (array_like): Two new allocated centers.
319 
320  @return (bool) True is null hypothesis is acceptable.
321 
322  """
323  v = numpy.subtract(centers[0], centers[1])
324  points = self._project_data(data, v)
325 
326  estimation, critical, _ = scipy.stats.anderson(points, dist='norm') # the Anderson-Darling test statistic
327 
328  # If the returned statistic is larger than these critical values then for the corresponding significance level,
329  # the null hypothesis that the data come from the chosen distribution can be rejected.
330  return estimation < critical[-1] # False - not a gaussian distribution (reject H0)
331 
332 
333  @staticmethod
334  def _project_data(data, vector):
335  """!
336  @brief Transform input data by project it onto input vector using formula:
337 
338  \f[
339  x_{i}^{*}=\frac{\left \langle x_{i}, v \right \rangle}{\left \| v \right \|^{2}}.
340  \f]
341 
342  @param[in] data (array_like): Input data that is represented by points.
343  @param[in] vector (array_like): Input vector that is used for projection.
344 
345  @return (array_like) Transformed 1-dimensional data.
346 
347  """
348  square_norm = numpy.sum(numpy.multiply(vector, vector))
349  return numpy.divide(numpy.sum(numpy.multiply(data, vector), axis=1), square_norm)
350 
351 
352  def _search_optimal_parameters(self, data, amount):
353  """!
354  @brief Performs cluster analysis for specified data several times to find optimal clustering result in line
355  with WCE.
356 
357  @param[in] data (array_like): Input data that should be clustered.
358  @param[in] amount (unit): Amount of clusters that should be allocated.
359 
360  @return (tuple) Optimal clustering result: (clusters, centers, wce).
361 
362  """
363  best_wce, best_clusters, best_centers = float('+inf'), [], []
364  for _ in range(self.__repeat):
365  initial_centers = kmeans_plusplus_initializer(data, amount, random_state=self.__random_state).initialize()
366  solver = kmeans(data, initial_centers, tolerance=self.__tolerance, ccore=False).process()
367 
368  candidate_wce = solver.get_total_wce()
369  if candidate_wce < best_wce:
370  best_wce = candidate_wce
371  best_clusters = solver.get_clusters()
372  best_centers = solver.get_centers()
373 
374  if len(initial_centers) == 1:
375  break # No need to rerun clustering for one initial center.
376 
377  return best_clusters, best_centers, best_wce
378 
379 
380  def _perform_clustering(self):
381  """!
382  @brief Performs cluster analysis using K-Means algorithm using current centers are initial.
383 
384  @param[in] data (array_like): Input data for cluster analysis.
385 
386  """
387  solver = kmeans(self.__data, self.__centers, tolerance=self.__tolerance, ccore=False).process()
388  self.__clusters = solver.get_clusters()
389  self.__centers = solver.get_centers()
390  self.__total_wce = solver.get_total_wce()
391 
392 
393  def _run_condition(self):
394  """!
395  @brief Defines whether the algorithm should continue processing or should stop.
396 
397  @return `True` if the algorithm should continue processing, otherwise returns `False`
398 
399  """
400  if (self.__k_max > 0) and (len(self.__clusters) >= self.__k_max):
401  return False
402 
403  return True
404 
405 
406  def _verify_arguments(self):
407  """!
408  @brief Verify input parameters for the algorithm and throw exception in case of incorrectness.
409 
410  """
411  if len(self.__data) == 0:
412  raise ValueError("Input data is empty (size: '%d')." % len(self.__data))
413 
414  if self.__k_init <= 0:
415  raise ValueError("Initial amount of centers should be greater than 0 "
416  "(current value: '%d')." % self.__k_init)
417 
418  if self.__tolerance <= 0.0:
419  raise ValueError("Tolerance should be greater than 0 (current value: '%f')." % self.__tolerance)
420 
421  if self.__repeat <= 0:
422  raise ValueError("Amount of attempt to find optimal parameters should be greater than 0 "
423  "(current value: '%d')." % self.__repeat)
424 
425  if (self.__k_max != -1) and (self.__k_max <= 0):
426  raise ValueError("Maximum amount of cluster that might be allocated should be greater than 0 or -1 if "
427  "the algorithm should be restricted in searching optimal number of clusters.")
428 
429  if (self.__k_max != -1) and (self.__k_max < self.__k_init):
430  raise ValueError("Initial amount of clusters should be less than the maximum amount 'k_max'.")
pyclustering.cluster.center_initializer.kmeans_plusplus_initializer
K-Means++ is an algorithm for choosing the initial centers for algorithms like K-Means or X-Means.
Definition: center_initializer.py:95
pyclustering.cluster.gmeans.gmeans.__total_wce
__total_wce
Definition: gmeans.py:136
pyclustering.cluster.gmeans.gmeans.get_clusters
def get_clusters(self)
Returns list of allocated clusters, each cluster contains indexes of objects in list of data.
Definition: gmeans.py:218
pyclustering.cluster.gmeans.gmeans.__k_max
__k_max
Definition: gmeans.py:141
pyclustering.cluster.gmeans.gmeans._statistical_optimization
def _statistical_optimization(self)
Try to split cluster into two to find optimal amount of clusters.
Definition: gmeans.py:271
pyclustering.cluster.kmeans.kmeans
Class implements K-Means clustering algorithm.
Definition: kmeans.py:253
pyclustering.cluster.gmeans.gmeans.__k_init
__k_init
Definition: gmeans.py:132
pyclustering.cluster.gmeans.gmeans.__data
__data
Definition: gmeans.py:131
pyclustering.cluster.gmeans.gmeans._perform_clustering
def _perform_clustering(self)
Performs cluster analysis using K-Means algorithm using current centers are initial.
Definition: gmeans.py:380
pyclustering.cluster.center_initializer
Collection of center initializers for algorithm that uses initial centers, for example,...
Definition: center_initializer.py:1
pyclustering.cluster.gmeans.gmeans._process_by_python
def _process_by_python(self)
Performs cluster analysis using Python.
Definition: gmeans.py:175
pyclustering.cluster.gmeans.gmeans.predict
def predict(self, points)
Calculates the closest cluster to each point.
Definition: gmeans.py:194
pyclustering.cluster.gmeans.gmeans._process_by_ccore
def _process_by_ccore(self)
Performs cluster analysis using CCORE (C/C++ part of pyclustering library).
Definition: gmeans.py:166
pyclustering.cluster.gmeans.gmeans.__ccore
__ccore
Definition: gmeans.py:137
pyclustering.cluster.gmeans.gmeans._is_null_hypothesis
def _is_null_hypothesis(self, data, centers)
Returns whether H0 hypothesis is accepted using Anderson-Darling test statistic.
Definition: gmeans.py:313
pyclustering.cluster.gmeans.gmeans.__init__
def __init__(self, data, k_init=1, ccore=True, **kwargs)
Initializes G-Means algorithm.
Definition: gmeans.py:109
pyclustering.cluster.gmeans.gmeans
Class implements G-Means clustering algorithm.
Definition: gmeans.py:25
pyclustering.cluster.gmeans.gmeans.get_centers
def get_centers(self)
Returns list of centers of allocated clusters.
Definition: gmeans.py:231
pyclustering.cluster.gmeans.gmeans.__random_state
__random_state
Definition: gmeans.py:142
pyclustering.cluster.gmeans.gmeans._split_and_search_optimal
def _split_and_search_optimal(self, cluster)
Split specified cluster into two by performing K-Means clustering and check correctness by Anderson-D...
Definition: gmeans.py:289
pyclustering.cluster.kmeans
The module contains K-Means algorithm and other related services.
Definition: kmeans.py:1
pyclustering.cluster.gmeans.gmeans.__centers
__centers
Definition: gmeans.py:135
pyclustering.cluster.gmeans.gmeans.__clusters
__clusters
Definition: gmeans.py:134
pyclustering.cluster.gmeans.gmeans._run_condition
def _run_condition(self)
Defines whether the algorithm should continue processing or should stop.
Definition: gmeans.py:393
pyclustering.cluster.gmeans.gmeans.__repeat
__repeat
Definition: gmeans.py:140
pyclustering.cluster.gmeans.gmeans._search_optimal_parameters
def _search_optimal_parameters(self, data, amount)
Performs cluster analysis for specified data several times to find optimal clustering result in line ...
Definition: gmeans.py:352
pyclustering.utils
Utils that are used by modules of pyclustering.
Definition: __init__.py:1
pyclustering.cluster.gmeans.gmeans.get_cluster_encoding
def get_cluster_encoding(self)
Returns clustering result representation type that indicate how clusters are encoded.
Definition: gmeans.py:258
pyclustering.cluster.gmeans.gmeans.__tolerance
__tolerance
Definition: gmeans.py:139
pyclustering.cluster.gmeans.gmeans._verify_arguments
def _verify_arguments(self)
Verify input parameters for the algorithm and throw exception in case of incorrectness.
Definition: gmeans.py:406
pyclustering.cluster.gmeans.gmeans.process
def process(self)
Performs cluster analysis in line with rules of G-Means algorithm.
Definition: gmeans.py:150
pyclustering.cluster.encoder
Module for representing clustering results.
Definition: encoder.py:1
pyclustering.cluster.gmeans.gmeans._project_data
def _project_data(data, vector)
Transform input data by project it onto input vector using formula:
Definition: gmeans.py:334
pyclustering.cluster.gmeans.gmeans.get_total_wce
def get_total_wce(self)
Returns sum of metric errors that depends on metric that was used for clustering (by default SSE - Su...
Definition: gmeans.py:244