pyclustering  0.10.1
pyclustring is a Python, C++ data mining library.
kmedoids.py
1 """!
2 
3 @brief Cluster analysis algorithm: K-Medoids.
4 @details Implementation based on paper @cite inproceedings::cluster::kmedoids::1.
5 
6 @authors Andrei Novikov (pyclustering@yandex.ru)
7 @date 2014-2020
8 @copyright BSD-3-Clause
9 
10 """
11 
12 
13 import numpy
14 
15 from pyclustering.cluster.encoder import type_encoding
16 
17 from pyclustering.utils import medoid
18 from pyclustering.utils.metric import distance_metric, type_metric
19 
20 import pyclustering.core.kmedoids_wrapper as wrapper
21 
22 from pyclustering.core.wrapper import ccore_library
23 from pyclustering.core.metric_wrapper import metric_wrapper
24 
25 
26 class kmedoids:
27  """!
28  @brief Class represents clustering algorithm K-Medoids (PAM algorithm).
29  @details PAM is a partitioning clustering algorithm that uses the medoids instead of centers like in case of K-Means
30  algorithm. Medoid is an object with the smallest dissimilarity to all others in the cluster. PAM algorithm
31  complexity is \f$O\left ( k\left ( n-k \right )^{2} \right )\f$.
32 
33  There is an example where PAM algorithm is used to cluster 'TwoDiamonds' data:
34  @code
35  from pyclustering.cluster.kmedoids import kmedoids
36  from pyclustering.cluster.center_initializer import kmeans_plusplus_initializer
37  from pyclustering.cluster import cluster_visualizer
38  from pyclustering.utils import read_sample
39  from pyclustering.samples.definitions import FCPS_SAMPLES
40 
41  # Load list of points for cluster analysis.
42  sample = read_sample(FCPS_SAMPLES.SAMPLE_TWO_DIAMONDS)
43 
44  # Initialize initial medoids using K-Means++ algorithm
45  initial_medoids = kmeans_plusplus_initializer(sample, 2).initialize(return_index=True)
46 
47  # Create instance of K-Medoids (PAM) algorithm.
48  kmedoids_instance = kmedoids(sample, initial_medoids)
49 
50  # Run cluster analysis and obtain results.
51  kmedoids_instance.process()
52  clusters = kmedoids_instance.get_clusters()
53  medoids = kmedoids_instance.get_medoids()
54 
55  # Print allocated clusters.
56  print("Clusters:", clusters)
57 
58  # Display clustering results.
59  visualizer = cluster_visualizer()
60  visualizer.append_clusters(clusters, sample)
61  visualizer.append_cluster(initial_medoids, sample, markersize=12, marker='*', color='gray')
62  visualizer.append_cluster(medoids, sample, markersize=14, marker='*', color='black')
63  visualizer.show()
64  @endcode
65 
66  @image html pam_clustering_two_diamonds.png "Fig. 1. K-Medoids (PAM) clustering results 'TwoDiamonds'."
67 
68  Metric for calculation distance between points can be specified by parameter additional 'metric':
69  @code
70  # create Minkowski distance metric with degree equals to '2'
71  metric = distance_metric(type_metric.MINKOWSKI, degree=2)
72 
73  # create K-Medoids algorithm with specific distance metric
74  kmedoids_instance = kmedoids(sample, initial_medoids, metric=metric)
75 
76  # run cluster analysis and obtain results
77  kmedoids_instance.process()
78  clusters = kmedoids_instance.get_clusters()
79  @endcode
80 
81  Distance matrix can be used instead of sequence of points to increase performance and for that purpose parameter 'data_type' should be used:
82  @code
83  # calculate distance matrix for sample
84  sample = read_sample(path_to_sample)
85  matrix = calculate_distance_matrix(sample)
86 
87  # create K-Medoids algorithm for processing distance matrix instead of points
88  kmedoids_instance = kmedoids(matrix, initial_medoids, data_type='distance_matrix')
89 
90  # run cluster analysis and obtain results
91  kmedoids_instance.process()
92 
93  clusters = kmedoids_instance.get_clusters()
94  medoids = kmedoids_instance.get_medoids()
95  @endcode
96 
97  """
98 
99 
100  def __init__(self, data, initial_index_medoids, tolerance=0.0001, ccore=True, **kwargs):
101  """!
102  @brief Constructor of clustering algorithm K-Medoids.
103 
104  @param[in] data (list): Input data that is presented as list of points (objects), each point should be represented by list or tuple.
105  @param[in] initial_index_medoids (list): Indexes of intial medoids (indexes of points in input data).
106  @param[in] tolerance (double): Stop condition: if maximum value of distance change of medoids of clusters is less than tolerance than algorithm will stop processing.
107  @param[in] ccore (bool): If specified than CCORE library (C++ pyclustering library) is used for clustering instead of Python code.
108  @param[in] **kwargs: Arbitrary keyword arguments (available arguments: 'metric', 'data_type', 'itermax').
109 
110  <b>Keyword Args:</b><br>
111  - metric (distance_metric): Metric that is used for distance calculation between two points.
112  - data_type (string): Data type of input sample 'data' that is processed by the algorithm ('points', 'distance_matrix').
113  - itermax (uint): Maximum number of iteration for cluster analysis.
114 
115  """
116  self.__pointer_data = data
117  self.__clusters = []
118  self.__labels = [-1] * len(data)
119  self.__medoid_indexes = initial_index_medoids[:]
120  self.__distance_first_medoid = [float('inf')] * len(data)
121  self.__distance_second_medoid = [float('inf')] * len(data)
122  self.__tolerance = tolerance
123 
124  self.__metric = kwargs.get('metric', distance_metric(type_metric.EUCLIDEAN_SQUARE))
125  self.__data_type = kwargs.get('data_type', 'points')
126  self.__itermax = kwargs.get('itermax', 200)
127 
129 
130  self.__ccore = ccore and self.__metric.get_type() != type_metric.USER_DEFINED
131  if self.__ccore:
132  self.__ccore = ccore_library.workable()
133 
134  self.__verify_arguments()
135 
136 
137  def process(self):
138  """!
139  @brief Performs cluster analysis in line with rules of K-Medoids algorithm.
140 
141  @return (kmedoids) Returns itself (K-Medoids instance).
142 
143  @remark Results of clustering can be obtained using corresponding get methods.
144 
145  @see get_clusters()
146  @see get_medoids()
147 
148  """
149 
150  if self.__ccore is True:
151  ccore_metric = metric_wrapper.create_instance(self.__metric)
152  self.__clusters, self.__medoid_indexes = wrapper.kmedoids(self.__pointer_data, self.__medoid_indexes, self.__tolerance, self.__itermax, ccore_metric.get_pointer(), self.__data_type)
153 
154  else:
155  changes = float('inf')
156  previous_deviation, current_deviation = float('inf'), float('inf')
157 
158  iterations = 0
159 
160  if self.__itermax > 0:
161  current_deviation = self.__update_clusters()
162 
163  while (changes > self.__tolerance) and (iterations < self.__itermax):
164  swap_cost = self.__swap_medoids()
165 
166  if swap_cost != float('inf'):
167  previous_deviation = current_deviation
168  current_deviation = self.__update_clusters()
169  changes = previous_deviation - current_deviation
170  else:
171  return self
172 
173  iterations += 1
174 
175  return self
176 
177 
178  def predict(self, points):
179  """!
180  @brief Calculates the closest cluster to each point.
181 
182  @param[in] points (array_like): Points for which closest clusters are calculated.
183 
184  @return (list) List of closest clusters for each point. Each cluster is denoted by index. Return empty
185  collection if 'process()' method was not called.
186 
187  An example how to calculate (or predict) the closest cluster to specified points.
188  @code
189  from pyclustering.cluster.kmedoids import kmedoids
190  from pyclustering.samples.definitions import SIMPLE_SAMPLES
191  from pyclustering.utils import read_sample
192 
193  # Load list of points for cluster analysis.
194  sample = read_sample(SIMPLE_SAMPLES.SAMPLE_SIMPLE3)
195 
196  # Initial medoids for sample 'Simple3'.
197  initial_medoids = [4, 12, 25, 37]
198 
199  # Create instance of K-Medoids algorithm with prepared centers.
200  kmedoids_instance = kmedoids(sample, initial_medoids)
201 
202  # Run cluster analysis.
203  kmedoids_instance.process()
204 
205  # Calculate the closest cluster to following two points.
206  points = [[0.35, 0.5], [2.5, 2.0]]
207  closest_clusters = kmedoids_instance.predict(points)
208  print(closest_clusters)
209  @endcode
210 
211  """
212 
213  if len(self.__clusters) == 0:
214  return []
215 
216  medoids = [self.__pointer_data[index] for index in self.__medoid_indexes]
217  differences = numpy.zeros((len(points), len(medoids)))
218  for index_point in range(len(points)):
219  differences[index_point] = [self.__metric(points[index_point], center) for center in medoids]
220 
221  return numpy.argmin(differences, axis=1)
222 
223 
224  def get_clusters(self):
225  """!
226  @brief Returns list of allocated clusters, each cluster contains indexes of objects in list of data.
227 
228  @see process()
229  @see get_medoids()
230 
231  """
232 
233  return self.__clusters
234 
235 
236  def get_medoids(self):
237  """!
238  @brief Returns list of medoids of allocated clusters represented by indexes from the input data.
239 
240  @see process()
241  @see get_clusters()
242 
243  """
244 
245  return self.__medoid_indexes
246 
247 
249  """!
250  @brief Returns clustering result representation type that indicate how clusters are encoded.
251 
252  @return (type_encoding) Clustering result representation.
253 
254  @see get_clusters()
255 
256  """
257 
258  return type_encoding.CLUSTER_INDEX_LIST_SEPARATION
259 
260 
261  def __verify_arguments(self):
262  """!
263  @brief Verify input parameters for the algorithm and throw exception in case of incorrectness.
264 
265  """
266  if len(self.__pointer_data) == 0:
267  raise ValueError("Input data is empty (size: '%d')." % len(self.__pointer_data))
268 
269  if len(self.__medoid_indexes) == 0:
270  raise ValueError("Initial medoids are empty (size: '%d')." % len(self.__pointer_data))
271 
272  if self.__tolerance < 0:
273  raise ValueError("Tolerance (current value: '%d') should be greater or equal to 0." %
274  self.__tolerance)
275 
276  if self.__itermax < 0:
277  raise ValueError("Maximum iterations (current value: '%d') should be greater or equal to 0." %
278  self.__tolerance)
279 
280 
281  def __create_distance_calculator(self):
282  """!
283  @brief Creates distance calculator in line with algorithms parameters.
284 
285  @return (callable) Distance calculator.
286 
287  """
288  if self.__data_type == 'points':
289  return lambda index1, index2: self.__metric(self.__pointer_data[index1], self.__pointer_data[index2])
290 
291  elif self.__data_type == 'distance_matrix':
292  if isinstance(self.__pointer_data, numpy.matrix):
293  return lambda index1, index2: self.__pointer_data.item((index1, index2))
294 
295  return lambda index1, index2: self.__pointer_data[index1][index2]
296 
297  else:
298  raise TypeError("Unknown type of data is specified '%s'" % self.__data_type)
299 
300 
301  def __update_clusters(self):
302  """!
303  @brief Calculate distance to each point from the each cluster.
304  @details Nearest points are captured by according clusters and as a result clusters are updated.
305 
306  @return (double) Total deviation (distance from each point to its closest medoid).
307 
308  """
309 
310  total_deviation = 0.0
311  self.__clusters = [[] for i in range(len(self.__medoid_indexes))]
312  for index_point in range(len(self.__pointer_data)):
313  index_optim = -1
314  dist_optim_first = float('Inf')
315  dist_optim_second = float('Inf')
316 
317  for index in range(len(self.__medoid_indexes)):
318  dist = self.__distance_calculator(index_point, self.__medoid_indexes[index])
319 
320  if dist < dist_optim_first:
321  dist_optim_second = dist_optim_first
322  index_optim = index
323  dist_optim_first = dist
324  elif dist < dist_optim_second:
325  dist_optim_second = dist
326 
327  total_deviation += dist_optim_first
328  self.__clusters[index_optim].append(index_point)
329  self.__labels[index_point] = index_optim
330 
331  self.__distance_first_medoid[index_point] = dist_optim_first
332  self.__distance_second_medoid[index_point] = dist_optim_second
333 
334  return total_deviation
335 
336 
337  def __swap_medoids(self):
338  """!
339  @brief Swap existed medoid with non-medoid points in order to find the most optimal medoid.
340 
341  @return (double) Cost that is needed to swap two medoids.
342 
343  """
344 
345  optimal_swap_cost = float('inf')
346  optimal_index_cluster = None
347  optimal_index_medoid = None
348 
349  for index_cluster in range(len(self.__clusters)):
350  for candidate_medoid_index in range(len(self.__pointer_data)):
351  if (candidate_medoid_index in self.__medoid_indexes) or (self.__distance_first_medoid[candidate_medoid_index] == 0.0):
352  continue
353 
354  swap_cost = self.__calculate_swap_cost(candidate_medoid_index, index_cluster)
355  if swap_cost < optimal_swap_cost:
356  optimal_swap_cost = swap_cost
357  optimal_index_cluster = index_cluster
358  optimal_index_medoid = candidate_medoid_index
359 
360  if optimal_index_cluster is not None:
361  self.__medoid_indexes[optimal_index_cluster] = optimal_index_medoid
362 
363  return optimal_swap_cost
364 
365 
366  def __calculate_swap_cost(self, index_candidate, cluster_index):
367  """!
368  @brief Calculates cost to swap `index_candidate` with the current medoid `cluster_index`.
369 
370  @param[in] index_candidate (uint): Index point that is considered as a medoid candidate.
371  @param[in] cluster_index (uint): Index of a cluster where the current medoid is used for calculation.
372 
373  @return (double) Cost that is needed to swap medoids.
374 
375  """
376  cost = 0.0
377 
378  for index_point in range(len(self.__pointer_data)):
379  if index_point == index_candidate:
380  continue
381 
382  candidate_distance = self.__distance_calculator(index_point, index_candidate)
383  if self.__labels[index_point] == cluster_index:
384  cost += min(candidate_distance, self.__distance_second_medoid[index_point]) - self.__distance_first_medoid[index_point]
385  elif candidate_distance < self.__distance_first_medoid[index_point]:
386  cost += candidate_distance - self.__distance_first_medoid[index_point]
387 
388  return cost - self.__distance_first_medoid[index_candidate]
pyclustering.cluster.kmedoids.kmedoids.__labels
__labels
Definition: kmedoids.py:118
pyclustering.cluster.kmedoids.kmedoids.__itermax
__itermax
Definition: kmedoids.py:126
pyclustering.cluster.kmedoids.kmedoids.predict
def predict(self, points)
Calculates the closest cluster to each point.
Definition: kmedoids.py:178
pyclustering.cluster.kmedoids.kmedoids
Class represents clustering algorithm K-Medoids (PAM algorithm).
Definition: kmedoids.py:26
pyclustering.cluster.kmedoids.kmedoids.__update_clusters
def __update_clusters(self)
Calculate distance to each point from the each cluster.
Definition: kmedoids.py:301
pyclustering.cluster.kmedoids.kmedoids.__tolerance
__tolerance
Definition: kmedoids.py:122
pyclustering.cluster.kmedoids.kmedoids.__calculate_swap_cost
def __calculate_swap_cost(self, index_candidate, cluster_index)
Calculates cost to swap index_candidate with the current medoid cluster_index.
Definition: kmedoids.py:366
pyclustering.cluster.kmedoids.kmedoids.__distance_first_medoid
__distance_first_medoid
Definition: kmedoids.py:120
pyclustering.cluster.kmedoids.kmedoids.get_cluster_encoding
def get_cluster_encoding(self)
Returns clustering result representation type that indicate how clusters are encoded.
Definition: kmedoids.py:248
pyclustering.cluster.kmedoids.kmedoids.__init__
def __init__(self, data, initial_index_medoids, tolerance=0.0001, ccore=True, **kwargs)
Constructor of clustering algorithm K-Medoids.
Definition: kmedoids.py:100
pyclustering.cluster.kmedoids.kmedoids.__clusters
__clusters
Definition: kmedoids.py:117
pyclustering.cluster.kmedoids.kmedoids.__swap_medoids
def __swap_medoids(self)
Swap existed medoid with non-medoid points in order to find the most optimal medoid.
Definition: kmedoids.py:337
pyclustering.cluster.kmedoids.kmedoids.__create_distance_calculator
def __create_distance_calculator(self)
Creates distance calculator in line with algorithms parameters.
Definition: kmedoids.py:281
pyclustering.utils.metric.distance_metric
Distance metric performs distance calculation between two points in line with encapsulated function,...
Definition: metric.py:52
pyclustering.cluster.kmedoids.kmedoids.get_clusters
def get_clusters(self)
Returns list of allocated clusters, each cluster contains indexes of objects in list of data.
Definition: kmedoids.py:224
pyclustering.cluster.kmedoids.kmedoids.get_medoids
def get_medoids(self)
Returns list of medoids of allocated clusters represented by indexes from the input data.
Definition: kmedoids.py:236
pyclustering.cluster.kmedoids.kmedoids.__pointer_data
__pointer_data
Definition: kmedoids.py:116
pyclustering.cluster.kmedoids.kmedoids.__medoid_indexes
__medoid_indexes
Definition: kmedoids.py:119
pyclustering.cluster.kmedoids.kmedoids.process
def process(self)
Performs cluster analysis in line with rules of K-Medoids algorithm.
Definition: kmedoids.py:137
pyclustering.cluster.kmedoids.kmedoids.__ccore
__ccore
Definition: kmedoids.py:130
pyclustering.cluster.kmedoids.kmedoids.__distance_second_medoid
__distance_second_medoid
Definition: kmedoids.py:121
pyclustering.cluster.kmedoids.kmedoids.__distance_calculator
__distance_calculator
Definition: kmedoids.py:128
pyclustering.cluster.kmedoids.kmedoids.__data_type
__data_type
Definition: kmedoids.py:125
pyclustering.utils
Utils that are used by modules of pyclustering.
Definition: __init__.py:1
pyclustering.cluster.encoder
Module for representing clustering results.
Definition: encoder.py:1
pyclustering.utils.metric
Module provides various distance metrics - abstraction of the notion of distance in a metric space.
Definition: metric.py:1
pyclustering.cluster.kmedoids.kmedoids.__metric
__metric
Definition: kmedoids.py:124
pyclustering.cluster.kmedoids.kmedoids.__verify_arguments
def __verify_arguments(self)
Verify input parameters for the algorithm and throw exception in case of incorrectness.
Definition: kmedoids.py:261