3 @brief Elbow method to determine the optimal number of clusters for k-means clustering. 4 @details Implementation based on paper @cite article::cluster::elbow::1. 6 @authors Andrei Novikov (pyclustering@yandex.ru) 8 @copyright GNU Public License 10 @cond GNU_PUBLIC_LICENSE 11 PyClustering is free software: you can redistribute it and/or modify 12 it under the terms of the GNU General Public License as published by 13 the Free Software Foundation, either version 3 of the License, or 14 (at your option) any later version. 16 PyClustering is distributed in the hope that it will be useful, 17 but WITHOUT ANY WARRANTY; without even the implied warranty of 18 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 19 GNU General Public License for more details. 21 You should have received a copy of the GNU General Public License 22 along with this program. If not, see <http://www.gnu.org/licenses/>. 32 from pyclustering.core.wrapper
import ccore_library
34 import pyclustering.core.elbow_wrapper
as wrapper
39 @brief Class represents Elbow method that is used to find out appropriate amount of clusters in a dataset. 40 @details Elbow method performs clustering using K-Means algorithm for each K and estimate clustering results using 41 sum of square erros. By default K-Means++ algorithm is used to calculate initial centers that are used by 44 The Elbow is determined by max distance from each point (x, y) to segment from kmin-point (x0, y0) to kmax-point (x1, y1), 45 where 'x' is K (amount of clusters), and 'y' is within-cluster error. Following expression is used to calculate Elbow 47 \f[Elbow_{k} = \frac{\left ( y_{0} - y_{1} \right )x_{k} + \left ( x_{1} - x_{0} \right )y_{k} + \left ( x_{0}y_{1} - x_{1}y_{0} \right )}{\sqrt{\left ( x_{1} - x_{0} \right )^{2} + \left ( y_{1} - y_{0} \right )^{2}}}\f] 49 Usage example of Elbow method for cluster analysis: 51 from pyclustering.cluster.kmeans import kmeans, kmeans_visualizer 52 from pyclustering.cluster.center_initializer import kmeans_plusplus_initializer 53 from pyclustering.cluster.elbow import elbow 54 from pyclustering.utils import read_sample 55 from pyclustering.samples.definitions import SIMPLE_SAMPLES 57 # read sample 'Simple3' from file (sample contains four clusters) 58 sample = read_sample(SIMPLE_SAMPLES.SAMPLE_SIMPLE3) 60 # create instance of Elbow method using K value from 1 to 10. 62 elbow_instance = elbow(sample, kmin, kmax) 64 # process input data and obtain results of analysis 65 elbow_instance.process() 66 amount_clusters = elbow_instance.get_amount() # most probable amount of clusters 67 wce = elbow_instance.get_wce() # total within-cluster errors for each K 69 # perform cluster analysis using K-Means algorithm 70 centers = kmeans_plusplus_initializer(sample, amount_clusters, 71 amount_candidates=kmeans_plusplus_initializer.FARTHEST_CENTER_CANDIDATE).initialize() 72 kmeans_instance = kmeans(sample, centers) 73 kmeans_instance.process() 75 # obtain clustering results and visualize them 76 clusters = kmeans_instance.get_clusters() 77 centers = kmeans_instance.get_centers() 78 kmeans_visualizer.show_clusters(sample, clusters, centers) 81 By default Elbow uses K-Means++ initializer to calculate initial centers for K-Means algorithm, it can be changed 82 using argument 'initializer': 84 # perform analysis using Elbow method with random center initializer for K-Means algorithm inside of the method. 86 elbow_instance = elbow(sample, kmin, kmax, initializer=random_center_initializer) 87 elbow_instance.process() 90 @image html elbow_example_simple_03.png "Elbows analysis with further K-Means clustering." 94 def __init__(self, data, kmin, kmax, **kwargs):
96 @brief Construct Elbow method. 98 @param[in] data (array_like): Input data that is presented as array of points (objects), each point should be represented by array_like data structure. 99 @param[in] kmin (int): Minimum amount of clusters that should be considered. 100 @param[in] kmax (int): Maximum amount of clusters that should be considered. 101 @param[in] **kwargs: Arbitrary keyword arguments (available arguments: 'ccore', 'initializer'). 103 <b>Keyword Args:</b><br> 104 - ccore (bool): If True then CCORE (C++ implementation of pyclustering library) is used (by default True). 105 - initializer (callable): Center initializer that is used by K-Means algorithm (by default K-Means++). 109 self.
__initializer = kwargs.get(
'initializer', kmeans_plusplus_initializer)
111 self.
__ccore = kwargs.get(
'ccore',
True)
or \
112 isinstance(self.
__initializer, kmeans_plusplus_initializer)
or \
116 self.
__ccore = ccore_library.workable()
131 @brief Performs analysis to find out appropriate amount of clusters. 133 @return (elbow) Returns itself (Elbow instance). 146 def __process_by_ccore(self):
148 @brief Performs processing using C++ implementation. 151 if isinstance(self.
__initializer, kmeans_plusplus_initializer):
152 initializer = wrapper.elbow_center_initializer.KMEANS_PLUS_PLUS
154 initializer = wrapper.elbow_center_initializer.RANDOM
159 self.
__wce = result[1]
162 def __process_by_python(self):
164 @brief Performs processing using python implementation. 172 self.
__wce.append(instance.get_total_wce())
180 @brief Returns appropriate amount of clusters. 188 @brief Returns list of total within cluster errors for each K-value (kmin, kmin + 1, ..., kmax - 1). 195 def __calculate_elbows(self):
197 @brief Calculates potential elbows. 198 @details Elbow is calculated as a distance from each point (x, y) to segment from kmin-point (x0, y0) to kmax-point (x1, y1). 202 x0, y0 = 0.0, self.
__wce[0]
203 x1, y1 = float(len(self.
__wce)), self.
__wce[-1]
205 for index_elbow
in range(1, len(self.
__wce) - 1):
206 x, y = float(index_elbow), self.
__wce[index_elbow]
208 segment = abs((y0 - y1) * x + (x1 - x0) * y + (x0 * y1 - x1 * y0))
209 norm = math.sqrt((x1 - x0) ** 2 + (y1 - y0) ** 2)
210 distance = segment / norm
215 def __find_optimal_kvalue(self):
217 @brief Finds elbow and returns corresponding K-value. 220 optimal_elbow_value = max(self.
__elbows)
224 def __verify_arguments(self):
226 @brief Verify input parameters for the algorithm and throw exception in case of incorrectness. 230 raise ValueError(
"Input data is empty (size: '%d')." % len(self.
__data))
233 raise ValueError(
"K min value (current value '%d') should be greater or equal to 1." % self.
__kmin)
236 raise ValueError(
"Amount of K (" + str(self.
__kmax - self.
__kmin) +
") is too small for analysis. " 237 "It is require to have at least three K to build elbow.")
240 raise ValueError(
"K max value '%d' is greater than amount of points in data '%d'." %
def __calculate_elbows(self)
Calculates potential elbows.
The module contains K-Means algorithm and other related services.
def __process_by_python(self)
Performs processing using python implementation.
def __find_optimal_kvalue(self)
Finds elbow and returns corresponding K-value.
Class implements K-Means clustering algorithm.
def __init__(self, data, kmin, kmax, kwargs)
Construct Elbow method.
def __process_by_ccore(self)
Performs processing using C++ implementation.
def process(self)
Performs analysis to find out appropriate amount of clusters.
def __verify_arguments(self)
Verify input parameters for the algorithm and throw exception in case of incorrectness.
Collection of center initializers for algorithm that uses initial centers, for example, for K-Means or X-Means.
Class represents Elbow method that is used to find out appropriate amount of clusters in a dataset...
def get_amount(self)
Returns appropriate amount of clusters.
def get_wce(self)
Returns list of total within cluster errors for each K-value (kmin, kmin + 1, ..., kmax - 1).