3 @brief Collection of center initializers for algorithm that uses initial centers, for example, for K-Means or X-Means.
4 @details Implementation based on paper @cite article::kmeans++::1.
6 @authors Andrei Novikov, Aleksey Kukushkin (pyclustering@yandex.ru)
8 @copyright BSD-3-Clause
10 @see pyclustering.cluster.kmeans
11 @see puclustering.cluster.xmeans
23 @brief Random center initializer is for generation specified amount of random of centers for specified data.
27 def __init__(self, data, amount_centers, **kwargs):
29 @brief Creates instance of random center initializer.
31 @param[in] data (list): List of points where each point is represented by list of coordinates.
32 @param[in] amount_centers (unit): Amount of centers that should be initialized.
33 @param[in] **kwargs: Arbitrary keyword arguments (available arguments: 'random_state').
35 <b>Keyword Args:</b><br>
36 - random_state (int): Seed for random state (by default is `None`, current system time is used).
44 random.seed(kwargs.get(
'random_state',
None))
47 raise ValueError(
"Amount of cluster centers should be at least 1.")
50 raise ValueError(
"Amount of cluster centers '%d' should be less than data size." % self.
__amount)
55 @brief Generates random centers in line with input parameters.
57 @param[in] **kwargs: Arbitrary keyword arguments (available arguments: 'return_index').
59 <b>Keyword Args:</b><br>
60 - return_index (bool): If True then returns indexes of points from input data instead of points itself.
62 @return (list) List of initialized initial centers.
63 If argument 'return_index' is False then returns list of points.
64 If argument 'return_index' is True then returns list of indexes.
67 return_index = kwargs.get(
'return_index',
False)
70 return list(range(len(self.
__data)))
76 def __create_center(self, return_index):
78 @brief Generates and returns random center.
80 @param[in] return_index (bool): If True then returns index of point from input data instead of point itself.
83 random_index_point = random.randint(0, len(self.
__data))
90 return random_index_point
91 return self.
__data[random_index_point]
97 @brief K-Means++ is an algorithm for choosing the initial centers for algorithms like K-Means or X-Means.
98 @details K-Means++ algorithm guarantees an approximation ratio O(log k). Clustering results are depends on
99 initial centers in case of K-Means algorithm and even in case of X-Means. This method is used to find
100 out optimal initial centers.
102 Algorithm can be divided into three steps. The first center is chosen from input data randomly with
103 uniform distribution at the first step. At the second, probability to being center is calculated for each point:
104 \f[p_{i}=\frac{D(x_{i})}{\sum_{j=0}^{N}D(x_{j})}\f]
105 where \f$D(x_{i})\f$ is a distance from point \f$i\f$ to the closest center. Using this probabilities next center
106 is chosen. The last step is repeated until required amount of centers is initialized.
108 Pyclustering implementation of the algorithm provides feature to consider several candidates on the second
113 amount_candidates = 3;
114 initializer = kmeans_plusplus_initializer(sample, amount_centers, amount_candidates);
117 If the farthest points should be used as centers then special constant 'FARTHEST_CENTER_CANDIDATE' should be used
118 for that purpose, for example:
121 amount_candidates = kmeans_plusplus_initializer.FARTHEST_CENTER_CANDIDATE;
122 initializer = kmeans_plusplus_initializer(sample, amount_centers, amount_candidates);
125 There is an example of initial centers that were calculated by the K-Means++ method:
127 @image html kmeans_plusplus_initializer_results.png
129 Code example where initial centers are prepared for K-Means algorithm:
131 from pyclustering.cluster.center_initializer import kmeans_plusplus_initializer
132 from pyclustering.cluster.kmeans import kmeans
133 from pyclustering.cluster import cluster_visualizer
134 from pyclustering.utils import read_sample
135 from pyclustering.samples.definitions import SIMPLE_SAMPLES
137 # Read data 'SampleSimple3' from Simple Sample collection.
138 sample = read_sample(SIMPLE_SAMPLES.SAMPLE_SIMPLE3)
140 # Calculate initial centers using K-Means++ method.
141 centers = kmeans_plusplus_initializer(sample, 4, kmeans_plusplus_initializer.FARTHEST_CENTER_CANDIDATE).initialize()
143 # Display initial centers.
144 visualizer = cluster_visualizer()
145 visualizer.append_cluster(sample)
146 visualizer.append_cluster(centers, marker='*', markersize=10)
149 # Perform cluster analysis using K-Means algorithm with initial centers.
150 kmeans_instance = kmeans(sample, centers)
152 # Run clustering process and obtain result.
153 kmeans_instance.process()
154 clusters = kmeans_instance.get_clusters()
161 FARTHEST_CENTER_CANDIDATE =
"farthest"
164 def __init__(self, data, amount_centers, amount_candidates=None, **kwargs):
166 @brief Creates K-Means++ center initializer instance.
168 @param[in] data (array_like): List of points where each point is represented by list of coordinates.
169 @param[in] amount_centers (uint): Amount of centers that should be initialized.
170 @param[in] amount_candidates (uint): Amount of candidates that is considered as a center, if the farthest points
171 (with the highest probability) should be considered as centers then special constant should be used
172 'FARTHEST_CENTER_CANDIDATE'. By default the amount of candidates is 3.
173 @param[in] **kwargs: Arbitrary keyword arguments (available arguments: 'random_state').
175 <b>Keyword Args:</b><br>
176 - random_state (int): Seed for random state (by default is `None`, current system time is used).
178 @see FARTHEST_CENTER_CANDIDATE
182 self.
__data = numpy.array(data)
186 if amount_candidates
is None:
195 random.seed(kwargs.get(
'random_state',
None))
198 def __check_parameters(self):
200 @brief Checks input parameters of the algorithm and if something wrong then corresponding exception is thrown.
204 raise ValueError(
"Amount of cluster centers '" + str(self.
__amount) +
"' should be at least 1 and "
205 "should be less or equal to amount of points in data.")
207 if self.
__candidates != kmeans_plusplus_initializer.FARTHEST_CENTER_CANDIDATE:
209 raise ValueError(
"Amount of center candidates '" + str(self.
__candidates) +
"' should be at least 1 "
210 "and should be less or equal to amount of points in data.")
213 raise ValueError(
"Data is empty.")
216 def __calculate_shortest_distances(self, data, centers):
218 @brief Calculates distance from each data point to nearest center.
220 @param[in] data (numpy.array): Array of points for that initialization is performed.
221 @param[in] centers (numpy.array): Array of indexes that represents centers.
223 @return (numpy.array) List of distances to closest center for each data point.
227 dataset_differences = numpy.zeros((len(centers), len(data)))
228 for index_center
in range(len(centers)):
229 center = data[centers[index_center]]
231 dataset_differences[index_center] = numpy.sum(numpy.square(data - center), axis=1).T
233 with warnings.catch_warnings():
234 numpy.warnings.filterwarnings(
'ignore',
r'All-NaN (slice|axis) encountered')
235 shortest_distances = numpy.nanmin(dataset_differences, axis=0)
237 return shortest_distances
240 def __get_next_center(self, centers):
242 @brief Calculates the next center for the data.
244 @param[in] centers (array_like): Current initialized centers represented by indexes.
246 @return (array_like) Next initialized center.<br>
247 (uint) Index of next initialized center if return_index is True.
253 if self.
__candidates == kmeans_plusplus_initializer.FARTHEST_CENTER_CANDIDATE:
254 for index_point
in centers:
255 distances[index_point] = numpy.nan
256 center_index = numpy.nanargmax(distances)
264 def __get_initial_center(self, return_index):
266 @brief Choose randomly first center.
268 @param[in] return_index (bool): If True then return center's index instead of point.
270 @return (array_like) First center.<br>
271 (uint) Index of first center.
275 index_center = random.randint(0, len(self.
__data) - 1)
279 return self.
__data[index_center]
282 def __calculate_probabilities(self, distances):
284 @brief Calculates cumulative probabilities of being center of each point.
286 @param[in] distances (array_like): Distances from each point to closest center.
288 @return (array_like) Cumulative probabilities of being center of each point.
292 total_distance = numpy.sum(distances)
293 if total_distance != 0.0:
294 probabilities = distances / total_distance
295 return numpy.cumsum(probabilities)
297 return numpy.zeros(len(distances))
300 def __get_probable_center(self, distances, probabilities):
302 @brief Calculates the next probable center considering amount candidates.
304 @param[in] distances (array_like): Distances from each point to closest center.
305 @param[in] probabilities (array_like): Cumulative probabilities of being center of each point.
307 @return (uint) Index point that is next initialized center.
311 index_best_candidate = 0
313 candidate_probability = random.random()
316 for index_object
in range(len(probabilities)):
317 if candidate_probability < probabilities[index_object]:
318 index_candidate = index_object
321 if index_candidate == -1:
323 elif distances[index_best_candidate] < distances[index_candidate]:
324 index_best_candidate = index_candidate
326 return index_best_candidate
331 @brief Calculates initial centers using K-Means++ method.
333 @param[in] **kwargs: Arbitrary keyword arguments (available arguments: 'return_index').
335 <b>Keyword Args:</b><br>
336 - return_index (bool): If True then returns indexes of points from input data instead of points itself.
338 @return (list) List of initialized initial centers.
339 If argument 'return_index' is False then returns list of points.
340 If argument 'return_index' is True then returns list of indexes.
344 return_index = kwargs.get(
'return_index',
False)
347 centers = [index_point]
353 centers.append(index_point)
357 centers = [self.
__data[index]
for index
in centers]