3 @brief Cluster analysis algorithm: BIRCH 4 @details Implementation based on paper @cite article::birch::1. 6 @authors Andrei Novikov (pyclustering@yandex.ru) 8 @copyright GNU Public License 10 @cond GNU_PUBLIC_LICENSE 11 PyClustering is free software: you can redistribute it and/or modify 12 it under the terms of the GNU General Public License as published by 13 the Free Software Foundation, either version 3 of the License, or 14 (at your option) any later version. 16 PyClustering is distributed in the hope that it will be useful, 17 but WITHOUT ANY WARRANTY; without even the implied warranty of 18 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 19 GNU General Public License for more details. 21 You should have received a copy of the GNU General Public License 22 along with this program. If not, see <http://www.gnu.org/licenses/>. 37 @brief Class represents clustering algorithm BIRCH. 39 Example how to extract clusters from 'OldFaithful' sample using BIRCH algorithm: 41 from pyclustering.cluster.birch import birch, measurement_type 42 from pyclustering.cluster import cluster_visualizer 43 from pyclustering.utils import read_sample 44 from pyclustering.samples.definitions import FAMOUS_SAMPLES 46 # Sample for cluster analysis (represented by list) 47 sample = read_sample(FAMOUS_SAMPLES.SAMPLE_OLD_FAITHFUL) 49 # Create BIRCH algorithm 50 birch_instance = birch(sample, 2) 53 birch_instance.process() 55 # Obtain results of clustering 56 clusters = birch_instance.get_clusters() 58 # Visualize allocated clusters 59 visualizer = cluster_visualizer() 60 visualizer.append_clusters(clusters, sample) 66 def __init__(self, data, number_clusters, branching_factor=5, max_node_entries=5, initial_diameter=0.1,
67 type_measurement=measurement_type.CENTROID_EUCLIDEAN_DISTANCE,
69 diameter_multiplier=1.5,
72 @brief Constructor of clustering algorithm BIRCH. 74 @param[in] data (list): Input data presented as list of points (objects), where each point should be represented by list or tuple. 75 @param[in] number_clusters (uint): Number of clusters that should be allocated. 76 @param[in] branching_factor (uint): Maximum number of successor that might be contained by each non-leaf node in CF-Tree. 77 @param[in] max_node_entries (uint): Maximum number of entries that might be contained by each leaf node in CF-Tree. 78 @param[in] initial_diameter (double): Initial diameter that used for CF-Tree construction, it can be increase if entry_size_limit is exceeded. 79 @param[in] type_measurement (measurement_type): Type measurement used for calculation distance metrics. 80 @param[in] entry_size_limit (uint): Maximum number of entries that can be stored in CF-Tree, if it is exceeded during creation then diameter is increased and CF-Tree is rebuilt. 81 @param[in] diameter_multiplier (double): Multiplier that is used for increasing diameter when entry_size_limit is exceeded. 82 @param[in] ccore (bool): If True than CCORE (C++ part of the library) will be used for solving the problem. 84 @remark Despite eight arguments only the first two is mandatory, others can be ommitted. In this case default values are used for instance creation. 99 self.
__tree =
cftree(branching_factor, max_node_entries, initial_diameter, type_measurement)
107 @brief Performs cluster analysis in line with rules of BIRCH algorithm. 109 @return (birch) Returns itself (BIRCH instance). 119 current_number_clusters = len(self.
__features)
127 current_number_clusters = len(self.
__features)
136 @brief Returns list of allocated clusters, each cluster contains indexes of objects in list of data. 138 @remark Allocated noise can be returned only after data processing (use method process() before). Otherwise empty list is returned. 140 @return (list) List of allocated clusters. 152 @brief Returns clustering result representation type that indicate how clusters are encoded. 154 @return (type_encoding) Clustering result representation. 160 return type_encoding.CLUSTER_INDEX_LIST_SEPARATION
163 def __verify_arguments(self):
165 @brief Verify input parameters for the algorithm and throw exception in case of incorrectness. 169 raise ValueError(
"Input data is empty (size: '%d')." % len(self.
__pointer_data))
172 raise ValueError(
"Amount of cluster (current value: '%d') for allocation should be greater than 0." %
176 raise ValueError(
"Limit entry size (current value: '%d') should be greater than 0." %
180 def __extract_features(self):
182 @brief Extracts features from CF-tree cluster. 188 if len(self.
__tree.leafes) == 1:
190 for entry
in self.
__tree.leafes[0].entries:
195 for node
in self.
__tree.leafes:
199 def __decode_data(self):
201 @brief Decodes data from CF-tree features. 211 self.
__clusters[cluster_index].append(index_point)
214 def __insert_data(self):
216 @brief Inserts input data to the tree. 218 @remark If number of maximum number of entries is exceeded than diameter is increased and tree is rebuilt. 224 self.
__tree.insert_cluster([point])
232 def __rebuild_tree(self, index_point):
234 @brief Rebuilt tree in case of maxumum number of entries is exceeded. 236 @param[in] index_point (uint): Index of point that is used as end point of re-building. 238 @return (cftree) Rebuilt tree with encoded points till specified point from input data space. 242 rebuild_result =
False 247 while rebuild_result
is False:
249 if increased_diameter == 0.0:
250 increased_diameter = 1.0
255 for index_point
in range(0, index_point + 1):
257 tree.insert_cluster([point])
264 rebuild_result =
True 269 def __find_nearest_cluster_features(self):
271 @brief Find pair of nearest CF entries. 273 @return (list) List of two nearest enties that are represented by list [index_point1, index_point2]. 277 minimum_distance = float(
"Inf")
281 for index_candidate1
in range(0, len(self.
__features)):
283 for index_candidate2
in range(index_candidate1 + 1, len(self.
__features)):
287 if distance < minimum_distance:
288 minimum_distance = distance
290 index1 = index_candidate1
291 index2 = index_candidate2
293 return [index1, index2]
296 def __get_nearest_feature(self, point, feature_collection):
298 @brief Find nearest entry for specified point. 300 @param[in] point (list): Pointer to point from input dataset. 301 @param[in] feature_collection (list): Feature collection that is used for obtaining nearest feature for the specified point. 303 @return (double, uint) Tuple of distance to nearest entry to the specified point and index of that entry. 307 minimum_distance = float(
"Inf")
308 index_nearest_feature = -1
310 for index_entry
in range(0, len(feature_collection)):
311 point_entry =
cfentry(1, linear_sum([ point ]), square_sum([point]))
313 distance = feature_collection[index_entry].get_distance(point_entry, self.
__measurement_type)
314 if distance < minimum_distance:
315 minimum_distance = distance
316 index_nearest_feature = index_entry
318 return minimum_distance, index_nearest_feature
def __extract_features(self)
Extracts features from CF-tree cluster.
def __insert_data(self)
Inserts input data to the tree.
Utils that are used by modules of pyclustering.
Module for representing clustering results.
def __rebuild_tree(self, index_point)
Rebuilt tree in case of maxumum number of entries is exceeded.
def __verify_arguments(self)
Verify input parameters for the algorithm and throw exception in case of incorrectness.
Class represents clustering algorithm BIRCH.
def process(self)
Performs cluster analysis in line with rules of BIRCH algorithm.
def __init__(self, data, number_clusters, branching_factor=5, max_node_entries=5, initial_diameter=0.1, type_measurement=measurement_type.CENTROID_EUCLIDEAN_DISTANCE, entry_size_limit=200, diameter_multiplier=1.5, ccore=True)
Constructor of clustering algorithm BIRCH.
def __decode_data(self)
Decodes data from CF-tree features.
Clustering feature representation.
def get_cluster_encoding(self)
Returns clustering result representation type that indicate how clusters are encoded.
def __get_nearest_feature(self, point, feature_collection)
Find nearest entry for specified point.
def __find_nearest_cluster_features(self)
Find pair of nearest CF entries.
def get_clusters(self)
Returns list of allocated clusters, each cluster contains indexes of objects in list of data...