3 @brief The module contains G-Means algorithm and other related services. 4 @details Implementation based on paper @cite inproceedings::cluster::gmeans::1. 6 @authors Andrei Novikov (pyclustering@yandex.ru) 8 @copyright GNU Public License 10 @cond GNU_PUBLIC_LICENSE 11 PyClustering is free software: you can redistribute it and/or modify 12 it under the terms of the GNU General Public License as published by 13 the Free Software Foundation, either version 3 of the License, or 14 (at your option) any later version. 16 PyClustering is distributed in the hope that it will be useful, 17 but WITHOUT ANY WARRANTY; without even the implied warranty of 18 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 19 GNU General Public License for more details. 21 You should have received a copy of the GNU General Public License 22 along with this program. If not, see <http://www.gnu.org/licenses/>. 31 from pyclustering.core.gmeans_wrapper
import gmeans
as gmeans_wrapper
32 from pyclustering.core.wrapper
import ccore_library
41 @brief Class implements G-Means clustering algorithm. 42 @details The G-means algorithm starts with a small number of centers, and grows the number of centers. 43 Each iteration of the G-Means algorithm splits into two those centers whose data appear not to come from a 44 Gaussian distribution. G-means repeatedly makes decisions based on a statistical test for the data 45 assigned to each center. 47 Implementation based on the paper @cite inproceedings::cluster::gmeans::1. 49 @image html gmeans_example_clustering.png "G-Means clustering results on most common data-sets." 51 Example #1. In this example, G-Means starts analysis from single cluster. 53 from pyclustering.cluster import cluster_visualizer 54 from pyclustering.cluster.gmeans import gmeans 55 from pyclustering.utils import read_sample 56 from pyclustering.samples.definitions import FCPS_SAMPLES 58 # Read sample 'Lsun' from file. 59 sample = read_sample(FCPS_SAMPLES.SAMPLE_LSUN) 61 # Create instance of G-Means algorithm. By default algorithm start search from single cluster. 62 gmeans_instance = gmeans(sample).process() 64 # Extract clustering results: clusters and their centers 65 clusters = gmeans_instance.get_clusters() 66 centers = gmeans_instance.get_centers() 68 # Print total sum of metric errors 69 print("Total WCE:", gmeans_instance.get_total_wce()) 71 # Visualize clustering results 72 visualizer = cluster_visualizer() 73 visualizer.append_clusters(clusters, sample) 77 Example #2. Sometimes G-Means may found local optimum. 'repeat' value can be used to increase probability to 78 find global optimum. Argument 'repeat' defines how many times K-Means clustering with K-Means++ 79 initialization should be run to find optimal clusters. 81 # Read sample 'Tetra' from file. 82 sample = read_sample(FCPS_SAMPLES.SAMPLE_TETRA) 84 # Create instance of G-Means algorithm. By default algorithm start search from single cluster. 85 gmeans_instance = gmeans(sample, repeat=10).process() 87 # Extract clustering results: clusters and their centers 88 clusters = gmeans_instance.get_clusters() 90 # Visualize clustering results 91 visualizer = cluster_visualizer() 92 visualizer.append_clusters(clusters, sample) 98 def __init__(self, data, k_init=1, ccore=True, **kwargs):
100 @brief Initializes G-Means algorithm. 102 @param[in] data (array_like): Input data that is presented as array of points (objects), each point should be 103 represented by array_like data structure. 104 @param[in] k_init (uint): Initial amount of centers (by default started search from 1). 105 @param[in] ccore (bool): Defines whether CCORE library (C/C++ part of the library) should be used instead of 107 @param[in] **kwargs: Arbitrary keyword arguments (available arguments: 'tolerance', 'repeat'). 109 <b>Keyword Args:</b><br> 110 - tolerance (double): tolerance (double): Stop condition for each K-Means iteration: if maximum value of 111 change of centers of clusters is less than tolerance than algorithm will stop processing. 112 - repeat (unit): How many times K-Means should be run to improve parameters (by default is 3). 113 With larger 'repeat' values suggesting higher probability of finding global optimum. 125 self.
__repeat = kwargs.get(
'repeat', 3)
128 self.
__ccore = ccore_library.workable()
135 @brief Performs cluster analysis in line with rules of G-Means algorithm. 137 @return (gmeans) Returns itself (G-Means instance). 149 def _process_by_ccore(self):
151 @brief Performs cluster analysis using CCORE (C/C++ part of pyclustering library). 158 def _process_by_python(self):
160 @brief Performs cluster analysis using Python. 165 current_amount_clusters = len(self.
__clusters)
168 if current_amount_clusters == len(self.
__centers):
178 @brief Calculates the closest cluster to each point. 180 @param[in] points (array_like): Points for which closest clusters are calculated. 182 @return (list) List of closest clusters for each point. Each cluster is denoted by index. Return empty 183 collection if 'process()' method was not called. 186 nppoints = numpy.array(points)
190 metric = distance_metric(type_metric.EUCLIDEAN_SQUARE, numpy_usage=
True)
192 differences = numpy.zeros((len(nppoints), len(self.
__centers)))
193 for index_point
in range(len(nppoints)):
194 differences[index_point] = metric(nppoints[index_point], self.
__centers)
196 return numpy.argmin(differences, axis=1)
201 @brief Returns list of allocated clusters, each cluster contains indexes of objects in list of data. 203 @return (array_like) Allocated clusters. 214 @brief Returns list of centers of allocated clusters. 216 @return (array_like) Allocated centers. 227 @brief Returns sum of metric errors that depends on metric that was used for clustering (by default SSE - Sum of Squared Errors). 228 @details Sum of metric errors is calculated using distance between point and its center: 229 \f[error=\sum_{i=0}^{N}distance(x_{i}-center(x_{i}))\f] 239 def _statistical_optimization(self):
241 @brief Try to split cluster into two to find optimal amount of clusters. 247 if new_centers
is None:
250 centers += new_centers
255 def _split_and_search_optimal(self, cluster):
257 @brief Split specified cluster into two by performing K-Means clustering and check correctness by 258 Anderson-Darling test. 260 @param[in] cluster (array_like) Cluster that should be analysed and optimized by splitting if it is required. 262 @return (array_like) Two new centers if two new clusters are considered as more suitable. 263 (None) If current cluster is more suitable. 265 if len(cluster) == 1:
268 points = [self.
__data[index_point]
for index_point
in cluster]
271 if len(new_centers) > 1:
273 if not accept_null_hypothesis:
279 def _is_null_hypothesis(self, data, centers):
281 @brief Returns whether H0 hypothesis is accepted using Anderson-Darling test statistic. 283 @param[in] data (array_like): N-dimensional data for statistical test. 284 @param[in] centers (array_like): Two new allocated centers. 286 @return (bool) True is null hypothesis is acceptable. 289 v = numpy.subtract(centers[0], centers[1])
292 estimation, critical, _ = scipy.stats.anderson(points, dist=
'norm')
296 return estimation < critical[-1]
300 def _project_data(data, vector):
302 @brief Transform input data by project it onto input vector using formula: 305 x_{i}^{*}=\frac{\left \langle x_{i}, v \right \rangle}{\left \| v \right \|^{2}}. 308 @param[in] data (array_like): Input data that is represented by points. 309 @param[in] vector (array_like): Input vector that is used for projection. 311 @return (array_like) Transformed 1-dimensional data. 314 square_norm = numpy.sum(numpy.multiply(vector, vector))
315 return numpy.divide(numpy.sum(numpy.multiply(data, vector), axis=1), square_norm)
318 def _search_optimal_parameters(self, data, amount):
320 @brief Performs cluster analysis for specified data several times to find optimal clustering result in line 323 @param[in] data (array_like): Input data that should be clustered. 324 @param[in] amount (unit): Amount of clusters that should be allocated. 326 @return (tuple) Optimal clustering result: (clusters, centers, wce). 329 best_wce, best_clusters, best_centers = float(
'+inf'), [], []
334 candidate_wce = solver.get_total_wce()
335 if candidate_wce < best_wce:
336 best_wce = candidate_wce
337 best_clusters = solver.get_clusters()
338 best_centers = solver.get_centers()
340 if len(initial_centers) == 1:
343 return best_clusters, best_centers, best_wce
346 def _perform_clustering(self):
348 @brief Performs cluster analysis using K-Means algorithm using current centers are initial. 350 @param[in] data (array_like): Input data for cluster analysis. 359 def _verify_arguments(self):
361 @brief Verify input parameters for the algorithm and throw exception in case of incorrectness. 365 raise ValueError(
"Input data is empty (size: '%d')." % len(self.
__data))
368 raise ValueError(
"Initial amount of centers should be greater than 0 " 369 "(current value: '%d')." % self.
__k_init)
372 raise ValueError(
"Tolerance should be greater than 0 (current value: '%f')." % self.
__tolerance)
375 raise ValueError(
"Amount of attempt to find optimal parameters should be greater than 0 " 376 "(current value: '%d')." % self.
__repeat)
The module contains K-Means algorithm and other related services.
def get_clusters(self)
Returns list of allocated clusters, each cluster contains indexes of objects in list of data...
def process(self)
Performs cluster analysis in line with rules of G-Means algorithm.
def get_centers(self)
Returns list of centers of allocated clusters.
Utils that are used by modules of pyclustering.
def _search_optimal_parameters(self, data, amount)
Performs cluster analysis for specified data several times to find optimal clustering result in line ...
K-Means++ is an algorithm for choosing the initial centers for algorithms like K-Means or X-Means...
def __init__(self, data, k_init=1, ccore=True, kwargs)
Initializes G-Means algorithm.
def _verify_arguments(self)
Verify input parameters for the algorithm and throw exception in case of incorrectness.
def _statistical_optimization(self)
Try to split cluster into two to find optimal amount of clusters.
Class implements K-Means clustering algorithm.
def _split_and_search_optimal(self, cluster)
Split specified cluster into two by performing K-Means clustering and check correctness by Anderson-D...
def _process_by_ccore(self)
Performs cluster analysis using CCORE (C/C++ part of pyclustering library).
def get_total_wce(self)
Returns sum of metric errors that depends on metric that was used for clustering (by default SSE - Su...
def _process_by_python(self)
Performs cluster analysis using Python.
Collection of center initializers for algorithm that uses initial centers, for example, for K-Means or X-Means.
def _project_data(data, vector)
Transform input data by project it onto input vector using formula:
Class implements G-Means clustering algorithm.
def _perform_clustering(self)
Performs cluster analysis using K-Means algorithm using current centers are initial.
def _is_null_hypothesis(self, data, centers)
Returns whether H0 hypothesis is accepted using Anderson-Darling test statistic.
def predict(self, points)
Calculates the closest cluster to each point.