3 @brief Cluster analysis algorithm: K-Medoids. 4 @details Implementation based on papers @cite book::algorithms_for_clustering_data, @cite book::finding_groups_in_data. 6 @authors Andrei Novikov (pyclustering@yandex.ru) 8 @copyright GNU Public License 10 @cond GNU_PUBLIC_LICENSE 11 PyClustering is free software: you can redistribute it and/or modify 12 it under the terms of the GNU General Public License as published by 13 the Free Software Foundation, either version 3 of the License, or 14 (at your option) any later version. 16 PyClustering is distributed in the hope that it will be useful, 17 but WITHOUT ANY WARRANTY; without even the implied warranty of 18 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 19 GNU General Public License for more details. 21 You should have received a copy of the GNU General Public License 22 along with this program. If not, see <http://www.gnu.org/licenses/>. 35 import pyclustering.core.kmedoids_wrapper
as wrapper
37 from pyclustering.core.wrapper
import ccore_library
38 from pyclustering.core.metric_wrapper
import metric_wrapper
43 @brief Class represents clustering algorithm K-Medoids. 44 @details The algorithm is less sensitive to outliers tham K-Means. The principle difference between K-Medoids and K-Medians is that 45 K-Medoids uses existed points from input data space as medoids, but median in K-Medians can be unreal object (not from 50 from pyclustering.cluster.kmedoids import kmedoids 51 from pyclustering.cluster import cluster_visualizer 52 from pyclustering.utils import read_sample 53 from pyclustering.samples.definitions import FCPS_SAMPLES 55 # Load list of points for cluster analysis. 56 sample = read_sample(FCPS_SAMPLES.SAMPLE_TWO_DIAMONDS) 58 # Set random initial medoids. 59 initial_medoids = [1, 500] 61 # Create instance of K-Medoids algorithm. 62 kmedoids_instance = kmedoids(sample, initial_medoids) 64 # Run cluster analysis and obtain results. 65 kmedoids_instance.process() 66 clusters = kmedoids_instance.get_clusters() 68 # Show allocated clusters. 72 visualizer = cluster_visualizer() 73 visualizer.append_clusters(clusters, sample) 77 Metric for calculation distance between points can be specified by parameter additional 'metric': 79 # create Minkowski distance metric with degree equals to '2' 80 metric = distance_metric(type_metric.MINKOWSKI, degree=2) 82 # create K-Medoids algorithm with specific distance metric 83 kmedoids_instance = kmedoids(sample, initial_medoids, metric=metric) 85 # run cluster analysis and obtain results 86 kmedoids_instance.process() 87 clusters = kmedoids_instance.get_clusters() 90 Distance matrix can be used instead of sequence of points to increase performance and for that purpose parameter 'data_type' should be used: 92 # calculate distance matrix for sample 93 sample = read_sample(path_to_sample) 94 matrix = calculate_distance_matrix(sample) 96 # create K-Medoids algorithm for processing distance matrix instead of points 97 kmedoids_instance = kmedoids(matrix, initial_medoids, data_type='distance_matrix') 99 # run cluster analysis and obtain results 100 kmedoids_instance.process() 102 clusters = kmedoids_instance.get_clusters() 103 medoids = kmedoids_instance.get_medoids() 109 def __init__(self, data, initial_index_medoids, tolerance=0.001, ccore=True, **kwargs):
111 @brief Constructor of clustering algorithm K-Medoids. 113 @param[in] data (list): Input data that is presented as list of points (objects), each point should be represented by list or tuple. 114 @param[in] initial_index_medoids (list): Indexes of intial medoids (indexes of points in input data). 115 @param[in] tolerance (double): Stop condition: if maximum value of distance change of medoids of clusters is less than tolerance than algorithm will stop processing. 116 @param[in] ccore (bool): If specified than CCORE library (C++ pyclustering library) is used for clustering instead of Python code. 117 @param[in] **kwargs: Arbitrary keyword arguments (available arguments: 'metric', 'data_type', 'itermax'). 119 <b>Keyword Args:</b><br> 120 - metric (distance_metric): Metric that is used for distance calculation between two points. 121 - data_type (string): Data type of input sample 'data' that is processed by the algorithm ('points', 'distance_matrix'). 122 - itermax (uint): Maximum number of iteration for cluster analysis. 131 self.
__data_type = kwargs.get(
'data_type',
'points')
132 self.
__itermax = kwargs.get(
'itermax', 200)
136 self.
__ccore = ccore
and self.
__metric.get_type() != type_metric.USER_DEFINED
138 self.
__ccore = ccore_library.workable()
145 @brief Performs cluster analysis in line with rules of K-Medoids algorithm. 147 @return (kmedoids) Returns itself (K-Medoids instance). 149 @remark Results of clustering can be obtained using corresponding get methods. 157 ccore_metric = metric_wrapper.create_instance(self.
__metric)
161 changes = float(
'inf')
179 @brief Calculates the closest cluster to each point. 181 @param[in] points (array_like): Points for which closest clusters are calculated. 183 @return (list) List of closest clusters for each point. Each cluster is denoted by index. Return empty 184 collection if 'process()' method was not called. 186 An example how to calculate (or predict) the closest cluster to specified points. 188 from pyclustering.cluster.kmedoids import kmedoids 189 from pyclustering.samples.definitions import SIMPLE_SAMPLES 190 from pyclustering.utils import read_sample 192 # Load list of points for cluster analysis. 193 sample = read_sample(SIMPLE_SAMPLES.SAMPLE_SIMPLE3) 195 # Initial medoids for sample 'Simple3'. 196 initial_medoids = [4, 12, 25, 37] 198 # Create instance of K-Medoids algorithm with prepared centers. 199 kmedoids_instance = kmedoids(sample, initial_medoids) 201 # Run cluster analysis. 202 kmedoids_instance.process() 204 # Calculate the closest cluster to following two points. 205 points = [[0.35, 0.5], [2.5, 2.0]] 206 closest_clusters = kmedoids_instance.predict(points) 207 print(closest_clusters) 216 differences = numpy.zeros((len(points), len(medoids)))
217 for index_point
in range(len(points)):
218 differences[index_point] = [ self.
__metric(points[index_point], center)
for center
in medoids ]
220 return numpy.argmin(differences, axis=1)
225 @brief Returns list of allocated clusters, each cluster contains indexes of objects in list of data. 237 @brief Returns list of medoids of allocated clusters represented by indexes from the input data. 249 @brief Returns clustering result representation type that indicate how clusters are encoded. 251 @return (type_encoding) Clustering result representation. 257 return type_encoding.CLUSTER_INDEX_LIST_SEPARATION
260 def __verify_instance(self):
264 def __create_distance_calculator(self):
266 @brief Creates distance calculator in line with algorithms parameters. 268 @return (callable) Distance calculator. 271 if self.__data_type ==
'points':
272 return lambda index1, index2: self.__metric(self.__pointer_data[index1], self.__pointer_data[index2])
274 elif self.__data_type ==
'distance_matrix':
275 if isinstance(self.__pointer_data, numpy.matrix):
276 return lambda index1, index2: self.__pointer_data.item((index1, index2))
278 return lambda index1, index2: self.__pointer_data[index1][index2]
281 raise TypeError(
"Unknown type of data is specified '%s'" % self.__data_type)
284 def __update_clusters(self):
286 @brief Calculate distance to each point from the each cluster. 287 @details Nearest points are captured by according clusters and as a result clusters are updated. 289 @return (list) updated clusters as list of clusters where each cluster contains indexes of objects from data. 293 clusters = [[self.__medoid_indexes[i]]
for i
in range(len(self.__medoid_indexes))]
294 for index_point
in range(len(self.__pointer_data)):
295 if index_point
in self.__medoid_indexes:
299 dist_optim = float(
'Inf')
301 for index
in range(len(self.__medoid_indexes)):
302 dist = self.__distance_calculator(index_point, self.__medoid_indexes[index])
304 if dist < dist_optim:
308 clusters[index_optim].append(index_point)
313 def __update_medoids(self):
315 @brief Find medoids of clusters in line with contained objects. 317 @return (list) list of medoids for current number of clusters. 321 medoid_indexes = [-1] * len(self.__clusters)
323 for index
in range(len(self.__clusters)):
324 medoid_index = medoid(self.__pointer_data, self.__clusters[index], metric=self.__metric, data_type=self.__data_type)
325 medoid_indexes[index] = medoid_index
327 return medoid_indexes
Module provides various distance metrics - abstraction of the notion of distance in a metric space...
def __update_clusters(self)
Calculate distance to each point from the each cluster.
Utils that are used by modules of pyclustering.
def __update_medoids(self)
Find medoids of clusters in line with contained objects.
def get_cluster_encoding(self)
Returns clustering result representation type that indicate how clusters are encoded.
Module for representing clustering results.
Distance metric performs distance calculation between two points in line with encapsulated function...
def __init__(self, data, initial_index_medoids, tolerance=0.001, ccore=True, kwargs)
Constructor of clustering algorithm K-Medoids.
def __create_distance_calculator(self)
Creates distance calculator in line with algorithms parameters.
Class represents clustering algorithm K-Medoids.
def get_clusters(self)
Returns list of allocated clusters, each cluster contains indexes of objects in list of data...
def process(self)
Performs cluster analysis in line with rules of K-Medoids algorithm.
def get_medoids(self)
Returns list of medoids of allocated clusters represented by indexes from the input data...
def predict(self, points)
Calculates the closest cluster to each point.