3 @brief Cluster analysis algorithm: K-Means 4 @details Implementation based on paper @cite inproceedings::kmeans::1. 6 @authors Andrei Novikov (pyclustering@yandex.ru) 8 @copyright GNU Public License 10 @cond GNU_PUBLIC_LICENSE 11 PyClustering is free software: you can redistribute it and/or modify 12 it under the terms of the GNU General Public License as published by 13 the Free Software Foundation, either version 3 of the License, or 14 (at your option) any later version. 16 PyClustering is distributed in the hope that it will be useful, 17 but WITHOUT ANY WARRANTY; without even the implied warranty of 18 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 19 GNU General Public License for more details. 21 You should have received a copy of the GNU General Public License 22 along with this program. If not, see <http://www.gnu.org/licenses/>. 32 import matplotlib.pyplot
as plt
33 import matplotlib.animation
as animation
34 except Exception
as error_instance:
35 warnings.warn(
"Impossible to import matplotlib (please, install 'matplotlib'), pyclustering's visualization " 36 "functionality is not available (details: '%s')." % str(error_instance))
38 import pyclustering.core.kmeans_wrapper
as wrapper
40 from pyclustering.core.wrapper
import ccore_library
41 from pyclustering.core.metric_wrapper
import metric_wrapper
51 @brief Observer of K-Means algorithm that is used to collect information about clustering process on each iteration of the algorithm. 59 @brief Initializer of observer of K-Means algorithm. 69 @brief Returns amount of steps that were observer during clustering process in K-Means algorithm. 77 @brief This method is called by K-Means algorithm to notify about changes. 79 @param[in] clusters (array_like): Allocated clusters by K-Means algorithm. 80 @param[in] centers (array_like): Allocated centers by K-Means algorithm. 89 @brief Set evolution of changes of centers during clustering process. 91 @param[in] evolution_centers (array_like): Evolution of changes of centers during clustering process. 99 @brief Get method to return centers at specific iteration of clustering process. 101 @param[in] iteration (uint): Clustering process iteration at which centers are required. 103 @return (array_like) Centers at specific iteration. 111 @brief Set evolution of changes of centers during clustering process. 113 @param[in] evolution_clusters (array_like): Evolution of changes of clusters during clustering process. 121 @brief Get method to return allocated clusters at specific iteration of clustering process. 123 @param[in] iteration (uint): Clustering process iteration at which clusters are required. 125 @return (array_like) Clusters at specific iteration. 134 @brief Visualizer of K-Means algorithm's results. 135 @details K-Means visualizer provides visualization services that are specific for K-Means algorithm. 139 __default_2d_marker_size = 15
140 __default_3d_marker_size = 70
144 def show_clusters(sample, clusters, centers, initial_centers = None, **kwargs):
146 @brief Display K-Means clustering results. 148 @param[in] sample (list): Dataset that was used for clustering. 149 @param[in] clusters (array_like): Clusters that were allocated by the algorithm. 150 @param[in] centers (array_like): Centers that were allocated by the algorithm. 151 @param[in] initial_centers (array_like): Initial centers that were used by the algorithm, if 'None' then initial centers are not displyed. 152 @param[in] **kwargs: Arbitrary keyword arguments (available arguments: 'figure', 'display', 'offset'). 154 <b>Keyword Args:</b><br> 155 - figure (figure): If 'None' then new is figure is created, otherwise specified figure is used for visualization. 156 - display (bool): If 'True' then figure will be shown by the method, otherwise it should be shown manually using matplotlib function 'plt.show()'. 157 - offset (uint): Specify axes index on the figure where results should be drawn (only if argument 'figure' is specified). 159 @return (figure) Figure where clusters were drawn. 164 visualizer.append_clusters(clusters, sample)
166 offset = kwargs.get(
'offset', 0)
167 figure = kwargs.get(
'figure',
None)
168 display = kwargs.get(
'display',
True)
171 figure = visualizer.show(display =
False)
173 visualizer.show(figure = figure, display =
False)
175 kmeans_visualizer.__draw_centers(figure, offset, visualizer, centers, initial_centers)
176 kmeans_visualizer.__draw_rays(figure, offset, visualizer, sample, clusters, centers)
185 def __draw_rays(figure, offset, visualizer, sample, clusters, centers):
186 ax = figure.get_axes()[offset]
188 for index_cluster
in range(len(clusters)):
189 color = visualizer.get_cluster_color(index_cluster, 0)
190 kmeans_visualizer.__draw_cluster_rays(ax, color, sample, clusters[index_cluster], centers[index_cluster])
194 def __draw_cluster_rays(ax, color, sample, cluster, center):
195 dimension = len(sample[0])
197 for index_point
in cluster:
198 point = sample[index_point]
200 ax.plot([point[0], center[0]], [0.0, 0.0],
'-', color=color, linewidth=0.5)
202 ax.plot([point[0], center[0]], [point[1], center[1]],
'-', color=color, linewidth=0.5)
204 ax.plot([point[0], center[0]], [point[1], center[1]], [point[2], center[2]],
'-', color=color, linewidth=0.5)
208 def __draw_center(ax, center, color, marker, alpha):
209 dimension = len(center)
212 ax.plot(center[0], 0.0, color=color, alpha=alpha, marker=marker, markersize=kmeans_visualizer.__default_2d_marker_size)
214 ax.plot(center[0], center[1], color=color, alpha=alpha, marker=marker, markersize=kmeans_visualizer.__default_2d_marker_size)
216 ax.scatter(center[0], center[1], center[2], c=color, alpha=alpha, marker=marker, s=kmeans_visualizer.__default_3d_marker_size)
220 def __draw_centers(figure, offset, visualizer, centers, initial_centers):
221 ax = figure.get_axes()[offset]
223 for index_center
in range(len(centers)):
224 color = visualizer.get_cluster_color(index_center, 0)
225 kmeans_visualizer.__draw_center(ax, centers[index_center], color,
'*', 1.0)
227 if initial_centers
is not None:
228 kmeans_visualizer.__draw_center(ax, initial_centers[index_center], color,
'*', 0.4)
234 @brief Animates clustering process that is performed by K-Means algorithm. 236 @param[in] data (list): Dataset that is used for clustering. 237 @param[in] observer (kmeans_observer): EM observer that was used for collection information about clustering process. 238 @param[in] animation_velocity (uint): Interval between frames in milliseconds (for run-time animation only). 239 @param[in] movie_fps (uint): Defines frames per second (for rendering movie only). 240 @param[in] save_movie (string): If it is specified then animation will be stored to file that is specified in this parameter. 243 figure = plt.figure()
246 return frame_generation(0)
248 def frame_generation(index_iteration):
251 figure.suptitle(
"K-Means algorithm (iteration: " + str(index_iteration) +
")", fontsize=18, fontweight=
'bold')
253 clusters = observer.get_clusters(index_iteration)
254 centers = observer.get_centers(index_iteration)
255 kmeans_visualizer.show_clusters(data, clusters, centers,
None, figure=figure, display=
False)
257 figure.subplots_adjust(top=0.85)
259 return [figure.gca()]
261 iterations = len(observer)
262 cluster_animation = animation.FuncAnimation(figure, frame_generation, iterations, interval=animation_velocity,
263 init_func=init_frame, repeat_delay=5000)
265 if save_movie
is not None:
266 cluster_animation.save(save_movie, writer=
'ffmpeg', fps=movie_fps, bitrate=3000)
274 @brief Class represents K-Means clustering algorithm. 275 @details CCORE option can be used to use the pyclustering core - C/C++ shared library for processing that significantly increases performance. 277 CCORE implementation of the algorithm uses thread pool to parallelize the clustering process. 279 K-Means clustering results depend on initial centers. Algorithm K-Means++ can used for initialization 280 initial centers from module 'pyclustering.cluster.center_initializer'. 282 @image html kmeans_example_clustering.png "K-Means clustering results. At the left - 'Simple03.data' sample, at the right - 'Lsun.data' sample." 284 Example #1 - Trivial clustering: 286 # load list of points for cluster analysis 287 sample = read_sample(path) 289 # create instance of K-Means algorithm 290 kmeans_instance = kmeans(sample, [ [0.0, 0.1], [2.5, 2.6] ]) 292 # run cluster analysis and obtain results 293 kmeans_instance.process() 294 clusters = kmeans_instance.get_clusters() 297 Example #2 - Clustering using K-Means++ for center initialization: 299 # load list of points for cluster analysis 300 sample = read_sample(SIMPLE_SAMPLES.SAMPLE_SIMPLE2) 302 # initialize initial centers using K-Means++ method 303 initial_centers = kmeans_plusplus_initializer(sample, 3).initialize() 305 # create instance of K-Means algorithm with prepared centers 306 kmeans_instance = kmeans(sample, initial_centers) 308 # run cluster analysis and obtain results 309 kmeans_instance.process() 310 clusters = kmeans_instance.get_clusters() 311 final_centers = kmeans_instance.get_centers() 314 @see center_initializer 318 def __init__(self, data, initial_centers, tolerance=0.001, ccore=True, **kwargs):
320 @brief Constructor of clustering algorithm K-Means. 321 @details Center initializer can be used for creating initial centers, for example, K-Means++ method. 323 @param[in] data (array_like): Input data that is presented as array of points (objects), each point should be represented by array_like data structure. 324 @param[in] initial_centers (array_like): Initial coordinates of centers of clusters that are represented by array_like data structure: [center1, center2, ...]. 325 @param[in] tolerance (double): Stop condition: if maximum value of change of centers of clusters is less than tolerance then algorithm stops processing. 326 @param[in] ccore (bool): Defines should be CCORE library (C++ pyclustering library) used instead of Python code or not. 327 @param[in] **kwargs: Arbitrary keyword arguments (available arguments: 'observer', 'metric', 'itermax'). 329 <b>Keyword Args:</b><br> 330 - observer (kmeans_observer): Observer of the algorithm to collect information about clustering process on each iteration. 331 - metric (distance_metric): Metric that is used for distance calculation between two points (by default euclidean square distance). 332 - itermax (uint): Maximum number of iterations that is used for clustering process (by default: 200). 334 @see center_initializer 339 self.
__centers = numpy.array(initial_centers)
343 self.
__observer = kwargs.get(
'observer',
None)
345 self.
__maxiter = kwargs.get(
'maxiter', 200)
347 if self.
__metric.get_type() != type_metric.USER_DEFINED:
352 self.
__ccore = ccore
and self.
__metric.get_type() != type_metric.USER_DEFINED
354 self.
__ccore = ccore_library.workable()
359 @brief Performs cluster analysis in line with rules of K-Means algorithm. 361 @return (kmeans) Returns itself (K-Means instance). 363 @remark Results of clustering can be obtained using corresponding get methods. 371 raise ValueError(
"Dimension of the input data and dimension of the initial cluster centers must be equal.")
381 def __process_by_ccore(self):
383 @brief Performs cluster analysis using CCORE (C/C++ part of pyclustering library). 386 ccore_metric = metric_wrapper.create_instance(self.
__metric)
393 self.
__observer.set_evolution_clusters(results[2])
394 self.
__observer.set_evolution_centers(results[3])
399 def __process_by_python(self):
401 @brief Performs cluster analysis using python code. 405 maximum_change = float(
'inf')
413 while maximum_change > stop_condition
and iteration < self.
__maxiter:
430 @brief Returns list of allocated clusters, each cluster contains indexes of objects in list of data. 442 @brief Returns list of centers of allocated clusters. 457 @brief Returns sum of metric errors that depends on metric that was used for clustering (by default SSE - Sum of Squared Errors). 458 @details Sum of metric errors is calculated using distance between point and its center: 459 \f[error=\sum_{i=0}^{N}distance(x_{i}-center(x_{i}))\f] 471 @brief Returns clustering result representation type that indicate how clusters are encoded. 473 @return (type_encoding) Clustering result representation. 479 return type_encoding.CLUSTER_INDEX_LIST_SEPARATION
482 def __update_clusters(self):
484 @brief Calculate Euclidean distance to each point from the each cluster. Nearest points are captured by according clusters and as a result clusters are updated. 486 @return (list) Updated clusters as list of clusters. Each cluster contains indexes of objects from data. 490 clusters = [[]
for _
in range(len(self.
__centers))]
494 optimum_indexes = numpy.argmin(dataset_differences, axis=0)
495 for index_point
in range(len(optimum_indexes)):
496 index_cluster = optimum_indexes[index_point]
497 clusters[index_cluster].append(index_point)
499 clusters = [cluster
for cluster
in clusters
if len(cluster) > 0]
504 def __update_centers(self):
506 @brief Calculate centers of clusters in line with contained objects. 508 @return (numpy.matrix) Updated centers as list of centers. 513 centers = numpy.zeros((len(self.
__clusters), dimension))
517 centers[index] = cluster_points.mean(axis=0)
519 return numpy.array(centers)
522 def __calculate_total_wce(self):
524 @brief Calculate total within cluster errors that is depend on metric that was chosen for K-Means algorithm. 531 for index_cluster
in range(len(self.
__clusters)):
532 for index_point
in self.
__clusters[index_cluster]:
533 self.
__total_wce += dataset_differences[index_cluster][index_point]
536 def __calculate_dataset_difference(self, amount_clusters):
538 @brief Calculate distance from each point to each cluster center. 541 dataset_differences = numpy.zeros((amount_clusters, len(self.
__pointer_data)))
542 for index_center
in range(amount_clusters):
543 if self.
__metric.get_type() != type_metric.USER_DEFINED:
546 dataset_differences[index_center] = [ self.
__metric(point, self.
__centers[index_center])
549 return dataset_differences
552 def __calculate_changes(self, updated_centers):
554 @brief Calculates changes estimation between previous and current iteration using centers for that purpose. 556 @param[in] updated_centers (array_like): New cluster centers. 558 @return (float) Maximum changes between centers. 561 if len(self.
__centers) != len(updated_centers):
562 maximum_change = float(
'inf')
566 maximum_change = numpy.max(changes)
568 return maximum_change
Common visualizer of clusters on 1D, 2D or 3D surface.
pyclustering module for cluster analysis.
def notify(self, clusters, centers)
This method is called by K-Means algorithm to notify about changes.
def get_centers(self)
Returns list of centers of allocated clusters.
def __calculate_dataset_difference(self, amount_clusters)
Calculate distance from each point to each cluster center.
def process(self)
Performs cluster analysis in line with rules of K-Means algorithm.
Module provides various distance metrics - abstraction of the notion of distance in a metric space...
def get_clusters(self, iteration)
Get method to return allocated clusters at specific iteration of clustering process.
def __calculate_changes(self, updated_centers)
Calculates changes estimation between previous and current iteration using centers for that purpose...
Module for representing clustering results.
Distance metric performs distance calculation between two points in line with encapsulated function...
Observer of K-Means algorithm that is used to collect information about clustering process on each it...
def __update_centers(self)
Calculate centers of clusters in line with contained objects.
def get_total_wce(self)
Returns sum of metric errors that depends on metric that was used for clustering (by default SSE - Su...
Class represents K-Means clustering algorithm.
def __init__(self)
Initializer of observer of K-Means algorithm.
Visualizer of K-Means algorithm's results.
def get_centers(self, iteration)
Get method to return centers at specific iteration of clustering process.
def __calculate_total_wce(self)
Calculate total within cluster errors that is depend on metric that was chosen for K-Means algorithm...
def set_evolution_centers(self, evolution_centers)
Set evolution of changes of centers during clustering process.
def __init__(self, data, initial_centers, tolerance=0.001, ccore=True, kwargs)
Constructor of clustering algorithm K-Means.
def __len__(self)
Returns amount of steps that were observer during clustering process in K-Means algorithm.
def set_evolution_clusters(self, evolution_clusters)
Set evolution of changes of centers during clustering process.
def show_clusters(sample, clusters, centers, initial_centers=None, kwargs)
Display K-Means clustering results.
def get_clusters(self)
Returns list of allocated clusters, each cluster contains indexes of objects in list of data...
def __process_by_ccore(self)
Performs cluster analysis using CCORE (C/C++ part of pyclustering library).
def get_cluster_encoding(self)
Returns clustering result representation type that indicate how clusters are encoded.
def __update_clusters(self)
Calculate Euclidean distance to each point from the each cluster.
def __process_by_python(self)
Performs cluster analysis using python code.
def animate_cluster_allocation(data, observer, animation_velocity=500, movie_fps=1, save_movie=None)
Animates clustering process that is performed by K-Means algorithm.