3 @brief The module contains K-Means algorithm and other related services. 4 @details Implementation based on paper @cite inproceedings::kmeans::1. 6 @authors Andrei Novikov (pyclustering@yandex.ru) 8 @copyright GNU Public License 10 @cond GNU_PUBLIC_LICENSE 11 PyClustering is free software: you can redistribute it and/or modify 12 it under the terms of the GNU General Public License as published by 13 the Free Software Foundation, either version 3 of the License, or 14 (at your option) any later version. 16 PyClustering is distributed in the hope that it will be useful, 17 but WITHOUT ANY WARRANTY; without even the implied warranty of 18 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 19 GNU General Public License for more details. 21 You should have received a copy of the GNU General Public License 22 along with this program. If not, see <http://www.gnu.org/licenses/>. 32 import matplotlib.pyplot
as plt
33 import matplotlib.animation
as animation
34 except Exception
as error_instance:
35 warnings.warn(
"Impossible to import matplotlib (please, install 'matplotlib'), pyclustering's visualization " 36 "functionality is not available (details: '%s')." % str(error_instance))
38 import pyclustering.core.kmeans_wrapper
as wrapper
40 from pyclustering.core.wrapper
import ccore_library
41 from pyclustering.core.metric_wrapper
import metric_wrapper
51 @brief Observer of K-Means algorithm that is used to collect information about clustering process on each iteration of the algorithm. 59 @brief Initializer of observer of K-Means algorithm. 69 @brief Returns amount of steps that were observer during clustering process in K-Means algorithm. 77 @brief This method is called by K-Means algorithm to notify about changes. 79 @param[in] clusters (array_like): Allocated clusters by K-Means algorithm. 80 @param[in] centers (array_like): Allocated centers by K-Means algorithm. 89 @brief Set evolution of changes of centers during clustering process. 91 @param[in] evolution_centers (array_like): Evolution of changes of centers during clustering process. 99 @brief Get method to return centers at specific iteration of clustering process. 101 @param[in] iteration (uint): Clustering process iteration at which centers are required. 103 @return (array_like) Centers at specific iteration. 111 @brief Set evolution of changes of centers during clustering process. 113 @param[in] evolution_clusters (array_like): Evolution of changes of clusters during clustering process. 121 @brief Get method to return allocated clusters at specific iteration of clustering process. 123 @param[in] iteration (uint): Clustering process iteration at which clusters are required. 125 @return (array_like) Clusters at specific iteration. 134 @brief Visualizer of K-Means algorithm's results. 135 @details K-Means visualizer provides visualization services that are specific for K-Means algorithm. 139 __default_2d_marker_size = 15
140 __default_3d_marker_size = 70
144 def show_clusters(sample, clusters, centers, initial_centers = None, **kwargs):
146 @brief Display K-Means clustering results. 148 @param[in] sample (list): Dataset that was used for clustering. 149 @param[in] clusters (array_like): Clusters that were allocated by the algorithm. 150 @param[in] centers (array_like): Centers that were allocated by the algorithm. 151 @param[in] initial_centers (array_like): Initial centers that were used by the algorithm, if 'None' then initial centers are not displyed. 152 @param[in] **kwargs: Arbitrary keyword arguments (available arguments: 'figure', 'display', 'offset'). 154 <b>Keyword Args:</b><br> 155 - figure (figure): If 'None' then new is figure is created, otherwise specified figure is used for visualization. 156 - display (bool): If 'True' then figure will be shown by the method, otherwise it should be shown manually using matplotlib function 'plt.show()'. 157 - offset (uint): Specify axes index on the figure where results should be drawn (only if argument 'figure' is specified). 159 @return (figure) Figure where clusters were drawn. 164 visualizer.append_clusters(clusters, sample)
166 offset = kwargs.get(
'offset', 0)
167 figure = kwargs.get(
'figure',
None)
168 display = kwargs.get(
'display',
True)
171 figure = visualizer.show(display=
False)
173 visualizer.show(figure=figure, display=
False)
175 kmeans_visualizer.__draw_centers(figure, offset, visualizer, centers, initial_centers)
176 kmeans_visualizer.__draw_rays(figure, offset, visualizer, sample, clusters, centers)
185 def __draw_rays(figure, offset, visualizer, sample, clusters, centers):
186 ax = figure.get_axes()[offset]
188 for index_cluster
in range(len(clusters)):
189 color = visualizer.get_cluster_color(index_cluster, 0)
190 kmeans_visualizer.__draw_cluster_rays(ax, color, sample, clusters[index_cluster], centers[index_cluster])
194 def __draw_cluster_rays(ax, color, sample, cluster, center):
195 dimension = len(sample[0])
197 for index_point
in cluster:
198 point = sample[index_point]
200 ax.plot([point[0], center[0]], [0.0, 0.0],
'-', color=color, linewidth=0.5)
202 ax.plot([point[0], center[0]], [point[1], center[1]],
'-', color=color, linewidth=0.5)
204 ax.plot([point[0], center[0]], [point[1], center[1]], [point[2], center[2]],
'-', color=color, linewidth=0.5)
208 def __draw_center(ax, center, color, marker, alpha):
209 dimension = len(center)
212 ax.plot(center[0], 0.0, color=color, alpha=alpha, marker=marker, markersize=kmeans_visualizer.__default_2d_marker_size)
214 ax.plot(center[0], center[1], color=color, alpha=alpha, marker=marker, markersize=kmeans_visualizer.__default_2d_marker_size)
216 ax.scatter(center[0], center[1], center[2], c=color, alpha=alpha, marker=marker, s=kmeans_visualizer.__default_3d_marker_size)
220 def __draw_centers(figure, offset, visualizer, centers, initial_centers):
221 ax = figure.get_axes()[offset]
223 for index_center
in range(len(centers)):
224 color = visualizer.get_cluster_color(index_center, 0)
225 kmeans_visualizer.__draw_center(ax, centers[index_center], color,
'*', 1.0)
227 if initial_centers
is not None:
228 kmeans_visualizer.__draw_center(ax, initial_centers[index_center], color,
'*', 0.4)
234 @brief Animates clustering process that is performed by K-Means algorithm. 236 @param[in] data (list): Dataset that is used for clustering. 237 @param[in] observer (kmeans_observer): EM observer that was used for collection information about clustering process. 238 @param[in] animation_velocity (uint): Interval between frames in milliseconds (for run-time animation only). 239 @param[in] movie_fps (uint): Defines frames per second (for rendering movie only). 240 @param[in] save_movie (string): If it is specified then animation will be stored to file that is specified in this parameter. 243 figure = plt.figure()
246 return frame_generation(0)
248 def frame_generation(index_iteration):
251 figure.suptitle(
"K-Means algorithm (iteration: " + str(index_iteration) +
")", fontsize=18, fontweight=
'bold')
253 clusters = observer.get_clusters(index_iteration)
254 centers = observer.get_centers(index_iteration)
255 kmeans_visualizer.show_clusters(data, clusters, centers,
None, figure=figure, display=
False)
257 figure.subplots_adjust(top=0.85)
259 return [figure.gca()]
261 iterations = len(observer)
262 cluster_animation = animation.FuncAnimation(figure, frame_generation, iterations, interval=animation_velocity,
263 init_func=init_frame, repeat_delay=5000)
265 if save_movie
is not None:
266 cluster_animation.save(save_movie, writer=
'ffmpeg', fps=movie_fps, bitrate=3000)
274 @brief Class implements K-Means clustering algorithm. 275 @details K-Means clustering aims to partition n observations into k clusters in which each observation belongs to 276 the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning 277 of the data space into Voronoi cells. 279 K-Means clustering results depend on initial centers. Algorithm K-Means++ can used for initialization of 280 initial centers - see module 'pyclustering.cluster.center_initializer'. 282 CCORE implementation (C/C++ part of the library) of the algorithm performs parallel processing to ensure maximum 285 Implementation based on the paper @cite inproceedings::kmeans::1. 287 @image html kmeans_example_clustering.png "Fig. 1. K-Means clustering results. At the left - 'Simple03.data' sample, at the right - 'Lsun.data' sample." 289 Example #1 - Clustering using K-Means++ for center initialization: 291 from pyclustering.cluster.kmeans import kmeans, kmeans_visualizer 292 from pyclustering.cluster.center_initializer import kmeans_plusplus_initializer 293 from pyclustering.samples.definitions import FCPS_SAMPLES 294 from pyclustering.utils import read_sample 296 # Load list of points for cluster analysis. 297 sample = read_sample(FCPS_SAMPLES.SAMPLE_TWO_DIAMONDS) 299 # Prepare initial centers using K-Means++ method. 300 initial_centers = kmeans_plusplus_initializer(sample, 2).initialize() 302 # Create instance of K-Means algorithm with prepared centers. 303 kmeans_instance = kmeans(sample, initial_centers) 305 # Run cluster analysis and obtain results. 306 kmeans_instance.process() 307 clusters = kmeans_instance.get_clusters() 308 final_centers = kmeans_instance.get_centers() 310 # Visualize obtained results 311 kmeans_visualizer.show_clusters(sample, clusters, final_centers) 314 Example #2 - Clustering using specific distance metric, for example, Manhattan distance: 316 # prepare input data and initial centers for cluster analysis using K-Means 318 # create metric that will be used for clustering 319 manhattan_metric = distance_metric(type_metric.MANHATTAN) 321 # create instance of K-Means using specific distance metric: 322 kmeans_instance = kmeans(sample, initial_centers, metric=manhattan_metric) 324 # run cluster analysis and obtain results 325 kmeans_instance.process() 326 clusters = kmeans_instance.get_clusters() 329 @see center_initializer 333 def __init__(self, data, initial_centers, tolerance=0.001, ccore=True, **kwargs):
335 @brief Constructor of clustering algorithm K-Means. 336 @details Center initializer can be used for creating initial centers, for example, K-Means++ method. 338 @param[in] data (array_like): Input data that is presented as array of points (objects), each point should be represented by array_like data structure. 339 @param[in] initial_centers (array_like): Initial coordinates of centers of clusters that are represented by array_like data structure: [center1, center2, ...]. 340 @param[in] tolerance (double): Stop condition: if maximum value of change of centers of clusters is less than tolerance then algorithm stops processing. 341 @param[in] ccore (bool): Defines should be CCORE library (C++ pyclustering library) used instead of Python code or not. 342 @param[in] **kwargs: Arbitrary keyword arguments (available arguments: 'observer', 'metric', 'itermax'). 344 <b>Keyword Args:</b><br> 345 - observer (kmeans_observer): Observer of the algorithm to collect information about clustering process on each iteration. 346 - metric (distance_metric): Metric that is used for distance calculation between two points (by default euclidean square distance). 347 - itermax (uint): Maximum number of iterations that is used for clustering process (by default: 200). 349 @see center_initializer 354 self.
__centers = numpy.array(initial_centers)
358 self.
__observer = kwargs.get(
'observer',
None)
360 self.
__itermax = kwargs.get(
'itermax', 100)
362 if self.
__metric.get_type() != type_metric.USER_DEFINED:
367 self.
__ccore = ccore
and self.
__metric.get_type() != type_metric.USER_DEFINED
369 self.
__ccore = ccore_library.workable()
376 @brief Performs cluster analysis in line with rules of K-Means algorithm. 378 @return (kmeans) Returns itself (K-Means instance). 386 raise ValueError(
"Dimension of the input data and dimension of the initial cluster centers must be equal.")
396 def __process_by_ccore(self):
398 @brief Performs cluster analysis using CCORE (C/C++ part of pyclustering library). 401 ccore_metric = metric_wrapper.create_instance(self.
__metric)
408 self.
__observer.set_evolution_clusters(results[2])
409 self.
__observer.set_evolution_centers(results[3])
414 def __process_by_python(self):
416 @brief Performs cluster analysis using python code. 420 maximum_change = float(
'inf')
444 @brief Calculates the closest cluster to each point. 446 @param[in] points (array_like): Points for which closest clusters are calculated. 448 @return (list) List of closest clusters for each point. Each cluster is denoted by index. Return empty 449 collection if 'process()' method was not called. 453 nppoints = numpy.array(points)
457 differences = numpy.zeros((len(nppoints), len(self.
__centers)))
458 for index_point
in range(len(nppoints)):
459 if self.
__metric.get_type() != type_metric.USER_DEFINED:
462 differences[index_point] = [ self.
__metric(nppoints[index_point], center)
for center
in self.
__centers ]
464 return numpy.argmin(differences, axis=1)
469 @brief Returns list of allocated clusters, each cluster contains indexes of objects in list of data. 481 @brief Returns list of centers of allocated clusters. 496 @brief Returns sum of metric errors that depends on metric that was used for clustering (by default SSE - Sum of Squared Errors). 497 @details Sum of metric errors is calculated using distance between point and its center: 498 \f[error=\sum_{i=0}^{N}distance(x_{i}-center(x_{i}))\f] 510 @brief Returns clustering result representation type that indicate how clusters are encoded. 512 @return (type_encoding) Clustering result representation. 518 return type_encoding.CLUSTER_INDEX_LIST_SEPARATION
521 def __update_clusters(self):
523 @brief Calculate distance (in line with specified metric) to each point from the each cluster. Nearest points 524 are captured by according clusters and as a result clusters are updated. 526 @return (list) Updated clusters as list of clusters. Each cluster contains indexes of objects from data. 530 clusters = [[]
for _
in range(len(self.
__centers))]
534 optimum_indexes = numpy.argmin(dataset_differences, axis=0)
535 for index_point
in range(len(optimum_indexes)):
536 index_cluster = optimum_indexes[index_point]
537 clusters[index_cluster].append(index_point)
539 clusters = [cluster
for cluster
in clusters
if len(cluster) > 0]
544 def __update_centers(self):
546 @brief Calculate centers of clusters in line with contained objects. 548 @return (numpy.array) Updated centers. 553 centers = numpy.zeros((len(self.
__clusters), dimension))
557 centers[index] = cluster_points.mean(axis=0)
559 return numpy.array(centers)
562 def __calculate_total_wce(self):
564 @brief Calculate total within cluster errors that is depend on metric that was chosen for K-Means algorithm. 571 for index_cluster
in range(len(self.
__clusters)):
572 for index_point
in self.
__clusters[index_cluster]:
573 self.
__total_wce += dataset_differences[index_cluster][index_point]
576 def __calculate_dataset_difference(self, amount_clusters):
578 @brief Calculate distance from each point to each cluster center. 581 dataset_differences = numpy.zeros((amount_clusters, len(self.
__pointer_data)))
582 for index_center
in range(amount_clusters):
583 if self.
__metric.get_type() != type_metric.USER_DEFINED:
586 dataset_differences[index_center] = [ self.
__metric(point, self.
__centers[index_center])
589 return dataset_differences
592 def __calculate_changes(self, updated_centers):
594 @brief Calculates changes estimation between previous and current iteration using centers for that purpose. 596 @param[in] updated_centers (array_like): New cluster centers. 598 @return (float) Maximum changes between centers. 601 if len(self.
__centers) != len(updated_centers):
602 maximum_change = float(
'inf')
605 if self.
__metric.get_type() != type_metric.USER_DEFINED:
608 changes = [self.
__metric(center, updated_center)
for center, updated_center
in zip(self.
__centers, updated_centers)]
610 maximum_change = numpy.max(changes)
612 return maximum_change
615 def __verify_arguments(self):
617 @brief Verify input parameters for the algorithm and throw exception in case of incorrectness. 621 raise ValueError(
"Input data is empty (size: '%d')." % len(self.
__pointer_data))
624 raise ValueError(
"Initial centers are empty (size: '%d')." % len(self.
__pointer_data))
627 raise ValueError(
"Tolerance (current value: '%d') should be greater or equal to 0." %
631 raise ValueError(
"Maximum iterations (current value: '%d') should be greater or equal to 0." %
Common visualizer of clusters on 1D, 2D or 3D surface.
pyclustering module for cluster analysis.
def notify(self, clusters, centers)
This method is called by K-Means algorithm to notify about changes.
def get_centers(self)
Returns list of centers of allocated clusters.
def __calculate_dataset_difference(self, amount_clusters)
Calculate distance from each point to each cluster center.
def __verify_arguments(self)
Verify input parameters for the algorithm and throw exception in case of incorrectness.
def process(self)
Performs cluster analysis in line with rules of K-Means algorithm.
Module provides various distance metrics - abstraction of the notion of distance in a metric space...
def get_clusters(self, iteration)
Get method to return allocated clusters at specific iteration of clustering process.
def __calculate_changes(self, updated_centers)
Calculates changes estimation between previous and current iteration using centers for that purpose...
Module for representing clustering results.
Distance metric performs distance calculation between two points in line with encapsulated function...
Observer of K-Means algorithm that is used to collect information about clustering process on each it...
def __update_centers(self)
Calculate centers of clusters in line with contained objects.
def get_total_wce(self)
Returns sum of metric errors that depends on metric that was used for clustering (by default SSE - Su...
Class implements K-Means clustering algorithm.
def __init__(self)
Initializer of observer of K-Means algorithm.
Visualizer of K-Means algorithm's results.
def get_centers(self, iteration)
Get method to return centers at specific iteration of clustering process.
def __calculate_total_wce(self)
Calculate total within cluster errors that is depend on metric that was chosen for K-Means algorithm...
def set_evolution_centers(self, evolution_centers)
Set evolution of changes of centers during clustering process.
def __init__(self, data, initial_centers, tolerance=0.001, ccore=True, kwargs)
Constructor of clustering algorithm K-Means.
def __len__(self)
Returns amount of steps that were observer during clustering process in K-Means algorithm.
def set_evolution_clusters(self, evolution_clusters)
Set evolution of changes of centers during clustering process.
def show_clusters(sample, clusters, centers, initial_centers=None, kwargs)
Display K-Means clustering results.
def get_clusters(self)
Returns list of allocated clusters, each cluster contains indexes of objects in list of data...
def __process_by_ccore(self)
Performs cluster analysis using CCORE (C/C++ part of pyclustering library).
def get_cluster_encoding(self)
Returns clustering result representation type that indicate how clusters are encoded.
def __update_clusters(self)
Calculate distance (in line with specified metric) to each point from the each cluster.
def __process_by_python(self)
Performs cluster analysis using python code.
def animate_cluster_allocation(data, observer, animation_velocity=500, movie_fps=1, save_movie=None)
Animates clustering process that is performed by K-Means algorithm.
def predict(self, points)
Calculates the closest cluster to each point.