3 @brief Cluster analysis algorithm: K-Means 4 @details Implementation based on paper @cite inproceedings::kmeans::1. 6 @authors Andrei Novikov (pyclustering@yandex.ru) 8 @copyright GNU Public License 10 @cond GNU_PUBLIC_LICENSE 11 PyClustering is free software: you can redistribute it and/or modify 12 it under the terms of the GNU General Public License as published by 13 the Free Software Foundation, either version 3 of the License, or 14 (at your option) any later version. 16 PyClustering is distributed in the hope that it will be useful, 17 but WITHOUT ANY WARRANTY; without even the implied warranty of 18 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 19 GNU General Public License for more details. 21 You should have received a copy of the GNU General Public License 22 along with this program. If not, see <http://www.gnu.org/licenses/>. 32 import matplotlib.pyplot
as plt
33 import matplotlib.animation
as animation
34 except Exception
as error_instance:
35 warnings.warn(
"Impossible to import matplotlib (please, install 'matplotlib'), pyclustering's visualization " 36 "functionality is not available (details: '%s')." % str(error_instance))
38 import pyclustering.core.kmeans_wrapper
as wrapper
40 from pyclustering.core.wrapper
import ccore_library
41 from pyclustering.core.metric_wrapper
import metric_wrapper
51 @brief Observer of K-Means algorithm that is used to collect information about clustering process on each iteration of the algorithm. 59 @brief Initializer of observer of K-Means algorithm. 69 @brief Returns amount of steps that were observer during clustering process in K-Means algorithm. 77 @brief This method is called by K-Means algorithm to notify about changes. 79 @param[in] clusters (array_like): Allocated clusters by K-Means algorithm. 80 @param[in] centers (array_like): Allocated centers by K-Means algorithm. 89 @brief Set evolution of changes of centers during clustering process. 91 @param[in] evolution_centers (array_like): Evolution of changes of centers during clustering process. 99 @brief Get method to return centers at specific iteration of clustering process. 101 @param[in] iteration (uint): Clustering process iteration at which centers are required. 103 @return (array_like) Centers at specific iteration. 111 @brief Set evolution of changes of centers during clustering process. 113 @param[in] evolution_clusters (array_like): Evolution of changes of clusters during clustering process. 121 @brief Get method to return allocated clusters at specific iteration of clustering process. 123 @param[in] iteration (uint): Clustering process iteration at which clusters are required. 125 @return (array_like) Clusters at specific iteration. 134 @brief Visualizer of K-Means algorithm's results. 135 @details K-Means visualizer provides visualization services that are specific for K-Means algorithm. 139 __default_2d_marker_size = 15
140 __default_3d_marker_size = 70
144 def show_clusters(sample, clusters, centers, initial_centers = None, **kwargs):
146 @brief Display K-Means clustering results. 148 @param[in] sample (list): Dataset that was used for clustering. 149 @param[in] clusters (array_like): Clusters that were allocated by the algorithm. 150 @param[in] centers (array_like): Centers that were allocated by the algorithm. 151 @param[in] initial_centers (array_like): Initial centers that were used by the algorithm, if 'None' then initial centers are not displyed. 152 @param[in] **kwargs: Arbitrary keyword arguments (available arguments: 'figure', 'display', 'offset'). 154 <b>Keyword Args:</b><br> 155 - figure (figure): If 'None' then new is figure is created, otherwise specified figure is used for visualization. 156 - display (bool): If 'True' then figure will be shown by the method, otherwise it should be shown manually using matplotlib function 'plt.show()'. 157 - offset (uint): Specify axes index on the figure where results should be drawn (only if argument 'figure' is specified). 159 @return (figure) Figure where clusters were drawn. 164 visualizer.append_clusters(clusters, sample)
166 offset = kwargs.get(
'offset', 0)
167 figure = kwargs.get(
'figure',
None)
168 display = kwargs.get(
'display',
True)
171 figure = visualizer.show(display =
False)
173 visualizer.show(figure = figure, display =
False)
175 kmeans_visualizer.__draw_centers(figure, offset, visualizer, centers, initial_centers)
176 kmeans_visualizer.__draw_rays(figure, offset, visualizer, sample, clusters, centers)
185 def __draw_rays(figure, offset, visualizer, sample, clusters, centers):
186 ax = figure.get_axes()[offset]
188 for index_cluster
in range(len(clusters)):
189 color = visualizer.get_cluster_color(index_cluster, 0)
190 kmeans_visualizer.__draw_cluster_rays(ax, color, sample, clusters[index_cluster], centers[index_cluster])
194 def __draw_cluster_rays(ax, color, sample, cluster, center):
195 dimension = len(sample[0])
197 for index_point
in cluster:
198 point = sample[index_point]
200 ax.plot([point[0], center[0]], [0.0, 0.0],
'-', color=color, linewidth=0.5)
202 ax.plot([point[0], center[0]], [point[1], center[1]],
'-', color=color, linewidth=0.5)
204 ax.plot([point[0], center[0]], [point[1], center[1]], [point[2], center[2]],
'-', color=color, linewidth=0.5)
208 def __draw_center(ax, center, color, marker, alpha):
209 dimension = len(center)
212 ax.plot(center[0], 0.0, color=color, alpha=alpha, marker=marker, markersize=kmeans_visualizer.__default_2d_marker_size)
214 ax.plot(center[0], center[1], color=color, alpha=alpha, marker=marker, markersize=kmeans_visualizer.__default_2d_marker_size)
216 ax.scatter(center[0], center[1], center[2], c=color, alpha=alpha, marker=marker, s=kmeans_visualizer.__default_3d_marker_size)
220 def __draw_centers(figure, offset, visualizer, centers, initial_centers):
221 ax = figure.get_axes()[offset]
223 for index_center
in range(len(centers)):
224 color = visualizer.get_cluster_color(index_center, 0)
225 kmeans_visualizer.__draw_center(ax, centers[index_center], color,
'*', 1.0)
227 if initial_centers
is not None:
228 kmeans_visualizer.__draw_center(ax, initial_centers[index_center], color,
'*', 0.4)
234 @brief Animates clustering process that is performed by K-Means algorithm. 236 @param[in] data (list): Dataset that is used for clustering. 237 @param[in] observer (kmeans_observer): EM observer that was used for collection information about clustering process. 238 @param[in] animation_velocity (uint): Interval between frames in milliseconds (for run-time animation only). 239 @param[in] movie_fps (uint): Defines frames per second (for rendering movie only). 240 @param[in] save_movie (string): If it is specified then animation will be stored to file that is specified in this parameter. 243 figure = plt.figure()
246 return frame_generation(0)
248 def frame_generation(index_iteration):
251 figure.suptitle(
"K-Means algorithm (iteration: " + str(index_iteration) +
")", fontsize=18, fontweight=
'bold')
253 clusters = observer.get_clusters(index_iteration)
254 centers = observer.get_centers(index_iteration)
255 kmeans_visualizer.show_clusters(data, clusters, centers,
None, figure=figure, display=
False)
257 figure.subplots_adjust(top=0.85)
259 return [figure.gca()]
261 iterations = len(observer)
262 cluster_animation = animation.FuncAnimation(figure, frame_generation, iterations, interval=animation_velocity,
263 init_func=init_frame, repeat_delay=5000)
265 if save_movie
is not None:
266 cluster_animation.save(save_movie, writer=
'ffmpeg', fps=movie_fps, bitrate=3000)
274 @brief Class represents K-Means clustering algorithm. 275 @details CCORE implementation of the algorithm uses thread pool to parallelize the clustering process. 277 K-Means clustering results depend on initial centers. Algorithm K-Means++ can used for initialization 278 initial centers from module 'pyclustering.cluster.center_initializer'. 280 @image html kmeans_example_clustering.png "K-Means clustering results. At the left - 'Simple03.data' sample, at the right - 'Lsun.data' sample." 282 Example #1 - Clustering using K-Means++ for center initialization: 284 from pyclustering.cluster.kmeans import kmeans, kmeans_visualizer 285 from pyclustering.cluster.center_initializer import kmeans_plusplus_initializer 286 from pyclustering.samples.definitions import FCPS_SAMPLES 287 from pyclustering.utils import read_sample 289 # Load list of points for cluster analysis. 290 sample = read_sample(FCPS_SAMPLES.SAMPLE_TWO_DIAMONDS) 292 # Prepare initial centers using K-Means++ method. 293 initial_centers = kmeans_plusplus_initializer(sample, 2).initialize() 295 # Create instance of K-Means algorithm with prepared centers. 296 kmeans_instance = kmeans(sample, initial_centers) 298 # Run cluster analysis and obtain results. 299 kmeans_instance.process() 300 clusters = kmeans_instance.get_clusters() 301 final_centers = kmeans_instance.get_centers() 303 # Visualize obtained results 304 kmeans_visualizer.show_clusters(sample, clusters, final_centers) 307 Example #2 - Clustering using specific distance metric, for example, Manhattan distance: 309 # prepare input data and initial centers for cluster analysis using K-Means 311 # create metric that will be used for clustering 312 manhattan_metric = distance_metric(type_metric.MANHATTAN) 314 # create instance of K-Means using specific distance metric: 315 kmeans_instance = kmeans(sample, initial_centers, metric=manhattan_metric) 317 # run cluster analysis and obtain results 318 kmeans_instance.process() 319 clusters = kmeans_instance.get_clusters() 322 @see center_initializer 326 def __init__(self, data, initial_centers, tolerance=0.001, ccore=True, **kwargs):
328 @brief Constructor of clustering algorithm K-Means. 329 @details Center initializer can be used for creating initial centers, for example, K-Means++ method. 331 @param[in] data (array_like): Input data that is presented as array of points (objects), each point should be represented by array_like data structure. 332 @param[in] initial_centers (array_like): Initial coordinates of centers of clusters that are represented by array_like data structure: [center1, center2, ...]. 333 @param[in] tolerance (double): Stop condition: if maximum value of change of centers of clusters is less than tolerance then algorithm stops processing. 334 @param[in] ccore (bool): Defines should be CCORE library (C++ pyclustering library) used instead of Python code or not. 335 @param[in] **kwargs: Arbitrary keyword arguments (available arguments: 'observer', 'metric', 'itermax'). 337 <b>Keyword Args:</b><br> 338 - observer (kmeans_observer): Observer of the algorithm to collect information about clustering process on each iteration. 339 - metric (distance_metric): Metric that is used for distance calculation between two points (by default euclidean square distance). 340 - itermax (uint): Maximum number of iterations that is used for clustering process (by default: 200). 342 @see center_initializer 347 self.
__centers = numpy.array(initial_centers)
351 self.
__observer = kwargs.get(
'observer',
None)
353 self.
__itermax = kwargs.get(
'itermax', 100)
355 if self.
__metric.get_type() != type_metric.USER_DEFINED:
360 self.
__ccore = ccore
and self.
__metric.get_type() != type_metric.USER_DEFINED
362 self.
__ccore = ccore_library.workable()
367 @brief Performs cluster analysis in line with rules of K-Means algorithm. 369 @return (kmeans) Returns itself (K-Means instance). 371 @remark Results of clustering can be obtained using corresponding get methods. 379 raise ValueError(
"Dimension of the input data and dimension of the initial cluster centers must be equal.")
389 def __process_by_ccore(self):
391 @brief Performs cluster analysis using CCORE (C/C++ part of pyclustering library). 394 ccore_metric = metric_wrapper.create_instance(self.
__metric)
401 self.
__observer.set_evolution_clusters(results[2])
402 self.
__observer.set_evolution_centers(results[3])
407 def __process_by_python(self):
409 @brief Performs cluster analysis using python code. 413 maximum_change = float(
'inf')
437 @brief Calculates the closest cluster to each point. 439 @param[in] points (array_like): Points for which closest clusters are calculated. 441 @return (list) List of closest clusters for each point. Each cluster is denoted by index. Return empty 442 collection if 'process()' method was not called. 446 nppoints = numpy.array(points)
450 differences = numpy.zeros((len(nppoints), len(self.
__centers)))
451 for index_point
in range(len(nppoints)):
452 if self.
__metric.get_type() != type_metric.USER_DEFINED:
455 differences[index_point] = [ self.
__metric(nppoints[index_point], center)
for center
in self.
__centers ]
457 return numpy.argmin(differences, axis=1)
462 @brief Returns list of allocated clusters, each cluster contains indexes of objects in list of data. 474 @brief Returns list of centers of allocated clusters. 489 @brief Returns sum of metric errors that depends on metric that was used for clustering (by default SSE - Sum of Squared Errors). 490 @details Sum of metric errors is calculated using distance between point and its center: 491 \f[error=\sum_{i=0}^{N}distance(x_{i}-center(x_{i}))\f] 503 @brief Returns clustering result representation type that indicate how clusters are encoded. 505 @return (type_encoding) Clustering result representation. 511 return type_encoding.CLUSTER_INDEX_LIST_SEPARATION
514 def __update_clusters(self):
516 @brief Calculate distance (in line with specified metric) to each point from the each cluster. Nearest points 517 are captured by according clusters and as a result clusters are updated. 519 @return (list) Updated clusters as list of clusters. Each cluster contains indexes of objects from data. 523 clusters = [[]
for _
in range(len(self.
__centers))]
527 optimum_indexes = numpy.argmin(dataset_differences, axis=0)
528 for index_point
in range(len(optimum_indexes)):
529 index_cluster = optimum_indexes[index_point]
530 clusters[index_cluster].append(index_point)
532 clusters = [cluster
for cluster
in clusters
if len(cluster) > 0]
537 def __update_centers(self):
539 @brief Calculate centers of clusters in line with contained objects. 541 @return (numpy.array) Updated centers. 546 centers = numpy.zeros((len(self.
__clusters), dimension))
550 centers[index] = cluster_points.mean(axis=0)
552 return numpy.array(centers)
555 def __calculate_total_wce(self):
557 @brief Calculate total within cluster errors that is depend on metric that was chosen for K-Means algorithm. 564 for index_cluster
in range(len(self.
__clusters)):
565 for index_point
in self.
__clusters[index_cluster]:
566 self.
__total_wce += dataset_differences[index_cluster][index_point]
569 def __calculate_dataset_difference(self, amount_clusters):
571 @brief Calculate distance from each point to each cluster center. 574 dataset_differences = numpy.zeros((amount_clusters, len(self.
__pointer_data)))
575 for index_center
in range(amount_clusters):
576 if self.
__metric.get_type() != type_metric.USER_DEFINED:
579 dataset_differences[index_center] = [ self.
__metric(point, self.
__centers[index_center])
582 return dataset_differences
585 def __calculate_changes(self, updated_centers):
587 @brief Calculates changes estimation between previous and current iteration using centers for that purpose. 589 @param[in] updated_centers (array_like): New cluster centers. 591 @return (float) Maximum changes between centers. 594 if len(self.
__centers) != len(updated_centers):
595 maximum_change = float(
'inf')
599 maximum_change = numpy.max(changes)
601 return maximum_change
Common visualizer of clusters on 1D, 2D or 3D surface.
pyclustering module for cluster analysis.
def notify(self, clusters, centers)
This method is called by K-Means algorithm to notify about changes.
def get_centers(self)
Returns list of centers of allocated clusters.
def __calculate_dataset_difference(self, amount_clusters)
Calculate distance from each point to each cluster center.
def process(self)
Performs cluster analysis in line with rules of K-Means algorithm.
Module provides various distance metrics - abstraction of the notion of distance in a metric space...
def get_clusters(self, iteration)
Get method to return allocated clusters at specific iteration of clustering process.
def __calculate_changes(self, updated_centers)
Calculates changes estimation between previous and current iteration using centers for that purpose...
Module for representing clustering results.
Distance metric performs distance calculation between two points in line with encapsulated function...
Observer of K-Means algorithm that is used to collect information about clustering process on each it...
def __update_centers(self)
Calculate centers of clusters in line with contained objects.
def get_total_wce(self)
Returns sum of metric errors that depends on metric that was used for clustering (by default SSE - Su...
Class represents K-Means clustering algorithm.
def __init__(self)
Initializer of observer of K-Means algorithm.
Visualizer of K-Means algorithm's results.
def get_centers(self, iteration)
Get method to return centers at specific iteration of clustering process.
def __calculate_total_wce(self)
Calculate total within cluster errors that is depend on metric that was chosen for K-Means algorithm...
def set_evolution_centers(self, evolution_centers)
Set evolution of changes of centers during clustering process.
def __init__(self, data, initial_centers, tolerance=0.001, ccore=True, kwargs)
Constructor of clustering algorithm K-Means.
def __len__(self)
Returns amount of steps that were observer during clustering process in K-Means algorithm.
def set_evolution_clusters(self, evolution_clusters)
Set evolution of changes of centers during clustering process.
def show_clusters(sample, clusters, centers, initial_centers=None, kwargs)
Display K-Means clustering results.
def get_clusters(self)
Returns list of allocated clusters, each cluster contains indexes of objects in list of data...
def __process_by_ccore(self)
Performs cluster analysis using CCORE (C/C++ part of pyclustering library).
def get_cluster_encoding(self)
Returns clustering result representation type that indicate how clusters are encoded.
def __update_clusters(self)
Calculate distance (in line with specified metric) to each point from the each cluster.
def __process_by_python(self)
Performs cluster analysis using python code.
def animate_cluster_allocation(data, observer, animation_velocity=500, movie_fps=1, save_movie=None)
Animates clustering process that is performed by K-Means algorithm.
def predict(self, points)
Calculates the closest cluster to each point.