kmedoids.py
1 """!
2 
3 @brief Cluster analysis algorithm: K-Medoids (PAM - Partitioning Around Medoids).
4 @details Implementation based on papers @cite book::algorithms_for_clustering_data, @cite book::finding_groups_in_data.
5 
6 @authors Andrei Novikov (pyclustering@yandex.ru)
7 @date 2014-2018
8 @copyright GNU Public License
9 
10 @cond GNU_PUBLIC_LICENSE
11  PyClustering is free software: you can redistribute it and/or modify
12  it under the terms of the GNU General Public License as published by
13  the Free Software Foundation, either version 3 of the License, or
14  (at your option) any later version.
15 
16  PyClustering is distributed in the hope that it will be useful,
17  but WITHOUT ANY WARRANTY; without even the implied warranty of
18  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
19  GNU General Public License for more details.
20 
21  You should have received a copy of the GNU General Public License
22  along with this program. If not, see <http://www.gnu.org/licenses/>.
23 @endcond
24 
25 """
26 
27 
28 import numpy
29 
30 from pyclustering.cluster.encoder import type_encoding
31 
32 from pyclustering.utils import median
33 from pyclustering.utils.metric import distance_metric, type_metric
34 
35 import pyclustering.core.kmedoids_wrapper as wrapper
36 
37 from pyclustering.core.wrapper import ccore_library
38 from pyclustering.core.metric_wrapper import metric_wrapper
39 
40 
41 class kmedoids:
42  """!
43  @brief Class represents clustering algorithm K-Medoids (another one title is PAM - Partitioning Around Medoids).
44  @details The algorithm is less sensitive to outliers tham K-Means. The principle difference between K-Medoids and K-Medians is that
45  K-Medoids uses existed points from input data space as medoids, but median in K-Medians can be unreal object (not from
46  input data space).
47 
48  CCORE option can be used to use core pyclustering - C/C++ shared library for processing that significantly increases performance.
49 
50  Clustering example:
51  @code
52  # load list of points for cluster analysis
53  sample = read_sample(path)
54 
55  # set random initial medoids
56  initial_medoids = [1, 10]
57 
58  # create instance of K-Medoids algorithm
59  kmedoids_instance = kmedoids(sample, initial_medoids)
60 
61  # run cluster analysis and obtain results
62  kmedoids_instance.process();
63  clusters = kmedoids_instance.get_clusters()
64 
65  # show allocated clusters
66  print(clusters)
67  @endcode
68 
69  Metric for calculation distance between points can be specified by parameter additional 'metric':
70  @code
71  # create Minkowski distance metric with degree equals to '2'
72  metric = distance_metric(type_metric.MINKOWSKI, degree=2)
73 
74  # create K-Medoids algorithm with specific distance metric
75  kmedoids_instance = kmedoids(sample, initial_medoids, metric=metric)
76 
77  # run cluster analysis and obtain results
78  kmedoids_instance.process()
79  clusters = kmedoids_instance.get_clusters()
80  @endcode
81 
82  Distance matrix can be used instead of sequence of points to increase performance and for that purpose parameter 'data_type' should be used:
83  @code
84  # calculate distance matrix for sample
85  sample = read_sample(path_to_sample)
86  matrix = calculate_distance_matrix(sample)
87 
88  # create K-Medoids algorithm for processing distance matrix instead of points
89  kmedoids_instance = kmedoids(matrix, initial_medoids, data_type='distance_matrix')
90 
91  # run cluster analysis and obtain results
92  kmedoids_instance.process()
93 
94  clusters = kmedoids_instance.get_clusters()
95  medoids = kmedoids_instance.get_medoids()
96  @endcode
97 
98  """
99 
100 
101  def __init__(self, data, initial_index_medoids, tolerance=0.001, ccore=True, **kwargs):
102  """!
103  @brief Constructor of clustering algorithm K-Medoids.
104 
105  @param[in] data (list): Input data that is presented as list of points (objects), each point should be represented by list or tuple.
106  @param[in] initial_index_medoids (list): Indexes of intial medoids (indexes of points in input data).
107  @param[in] tolerance (double): Stop condition: if maximum value of distance change of medoids of clusters is less than tolerance than algorithm will stop processing.
108  @param[in] ccore (bool): If specified than CCORE library (C++ pyclustering library) is used for clustering instead of Python code.
109  @param[in] **kwargs: Arbitrary keyword arguments (available arguments: 'metric', 'data_type').
110 
111  <b>Keyword Args:</b><br>
112  - metric (distance_metric): Metric that is used for distance calculation between two points.
113  - data_type (string): Data type of input sample 'data' that is processed by the algorithm ('points', 'distance_matrix').
114 
115  """
116  self.__pointer_data = data
117  self.__clusters = []
118  self.__medoid_indexes = initial_index_medoids
119  self.__tolerance = tolerance
120 
121  self.__metric = kwargs.get('metric', distance_metric(type_metric.EUCLIDEAN_SQUARE))
122  self.__data_type = kwargs.get('data_type', 'points')
124 
125  self.__ccore = ccore and self.__metric.get_type() != type_metric.USER_DEFINED
126  if self.__ccore:
127  self.__ccore = ccore_library.workable()
128 
129 
130  def process(self):
131  """!
132  @brief Performs cluster analysis in line with rules of K-Medoids algorithm.
133 
134  @return (kmedoids) Returns itself (K-Medoids instance).
135 
136  @remark Results of clustering can be obtained using corresponding get methods.
137 
138  @see get_clusters()
139  @see get_medoids()
140 
141  """
142 
143  if self.__ccore is True:
144  ccore_metric = metric_wrapper.create_instance(self.__metric)
145  self.__clusters, self.__medoid_indexes = wrapper.kmedoids(self.__pointer_data, self.__medoid_indexes, self.__tolerance, ccore_metric.get_pointer(), self.__data_type)
146 
147  else:
148  changes = float('inf')
149 
150  stop_condition = self.__tolerance
151 
152  while changes > stop_condition:
153  self.__clusters = self.__update_clusters()
154  update_medoid_indexes = self.__update_medoids()
155 
156  changes = max([self.__distance_calculator(self.__medoid_indexes[index], update_medoid_indexes[index]) for index in range(len(update_medoid_indexes))])
157 
158  self.__medoid_indexes = update_medoid_indexes
159 
160  return self
161 
162 
163  def get_clusters(self):
164  """!
165  @brief Returns list of allocated clusters, each cluster contains indexes of objects in list of data.
166 
167  @see process()
168  @see get_medoids()
169 
170  """
171 
172  return self.__clusters
173 
174 
175  def get_medoids(self):
176  """!
177  @brief Returns list of medoids of allocated clusters represented by indexes from the input data.
178 
179  @see process()
180  @see get_clusters()
181 
182  """
183 
184  return self.__medoid_indexes
185 
186 
188  """!
189  @brief Returns clustering result representation type that indicate how clusters are encoded.
190 
191  @return (type_encoding) Clustering result representation.
192 
193  @see get_clusters()
194 
195  """
196 
197  return type_encoding.CLUSTER_INDEX_LIST_SEPARATION
198 
199 
200  def __create_distance_calculator(self):
201  """!
202  @brief Creates distance calculator in line with algorithms parameters.
203 
204  @return (callable) Distance calculator.
205 
206  """
207  if self.__data_type == 'points':
208  return lambda index1, index2: self.__metric(self.__pointer_data[index1], self.__pointer_data[index2])
209 
210  elif self.__data_type == 'distance_matrix':
211  if isinstance(self.__pointer_data, numpy.matrix):
212  return lambda index1, index2: self.__pointer_data.item((index1, index2))
213 
214  return lambda index1, index2: self.__pointer_data[index1][index2]
215 
216  else:
217  raise TypeError("Unknown type of data is specified '%s'" % self.__data_type)
218 
219 
220  def __update_clusters(self):
221  """!
222  @brief Calculate distance to each point from the each cluster.
223  @details Nearest points are captured by according clusters and as a result clusters are updated.
224 
225  @return (list) updated clusters as list of clusters where each cluster contains indexes of objects from data.
226 
227  """
228 
229  clusters = [[self.__medoid_indexes[i]] for i in range(len(self.__medoid_indexes))]
230  for index_point in range(len(self.__pointer_data)):
231  if index_point in self.__medoid_indexes:
232  continue
233 
234  index_optim = -1
235  dist_optim = float('Inf')
236 
237  for index in range(len(self.__medoid_indexes)):
238  dist = self.__distance_calculator(index_point, self.__medoid_indexes[index])
239 
240  if dist < dist_optim:
241  index_optim = index
242  dist_optim = dist
243 
244  clusters[index_optim].append(index_point)
245 
246  return clusters
247 
248 
249  def __update_medoids(self):
250  """!
251  @brief Find medoids of clusters in line with contained objects.
252 
253  @return (list) list of medoids for current number of clusters.
254 
255  """
256 
257  medoid_indexes = [-1] * len(self.__clusters)
258 
259  for index in range(len(self.__clusters)):
260  medoid_index = median(self.__pointer_data, self.__clusters[index], metric=self.__metric, data_type=self.__data_type)
261  medoid_indexes[index] = medoid_index
262 
263  return medoid_indexes
Module provides various distance metrics - abstraction of the notion of distance in a metric space...
Definition: metric.py:1
def __update_clusters(self)
Calculate distance to each point from the each cluster.
Definition: kmedoids.py:220
Utils that are used by modules of pyclustering.
Definition: __init__.py:1
def __update_medoids(self)
Find medoids of clusters in line with contained objects.
Definition: kmedoids.py:249
def get_cluster_encoding(self)
Returns clustering result representation type that indicate how clusters are encoded.
Definition: kmedoids.py:187
Module for representing clustering results.
Definition: encoder.py:1
Distance metric performs distance calculation between two points in line with encapsulated function...
Definition: metric.py:58
def __init__(self, data, initial_index_medoids, tolerance=0.001, ccore=True, kwargs)
Constructor of clustering algorithm K-Medoids.
Definition: kmedoids.py:101
def __create_distance_calculator(self)
Creates distance calculator in line with algorithms parameters.
Definition: kmedoids.py:200
Class represents clustering algorithm K-Medoids (another one title is PAM - Partitioning Around Medoi...
Definition: kmedoids.py:41
def get_clusters(self)
Returns list of allocated clusters, each cluster contains indexes of objects in list of data...
Definition: kmedoids.py:163
def process(self)
Performs cluster analysis in line with rules of K-Medoids algorithm.
Definition: kmedoids.py:130
def get_medoids(self)
Returns list of medoids of allocated clusters represented by indexes from the input data...
Definition: kmedoids.py:175