3 @brief BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) cluster analysis algorithm. 4 @details Implementation based on paper @cite article::birch::1. 6 @authors Andrei Novikov (pyclustering@yandex.ru) 8 @copyright GNU Public License 10 @cond GNU_PUBLIC_LICENSE 11 PyClustering is free software: you can redistribute it and/or modify 12 it under the terms of the GNU General Public License as published by 13 the Free Software Foundation, either version 3 of the License, or 14 (at your option) any later version. 16 PyClustering is distributed in the hope that it will be useful, 17 but WITHOUT ANY WARRANTY; without even the implied warranty of 18 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 19 GNU General Public License for more details. 21 You should have received a copy of the GNU General Public License 22 along with this program. If not, see <http://www.gnu.org/licenses/>. 37 @brief Class represents the clustering algorithm BIRCH (Balanced Iterative Reducing and Clustering using 40 @details BIRCH is suitable for large databases. The algorithm incrementally and dynamically clusters 41 incoming multi-dimensional metric data points using the concepts of Clustering Feature and CF tree. 42 A Clustering Feature is a triple summarizing the information that is maintained about a cluster. 43 The Clustering Feature vector is defined as a triple: 44 \f[CF=\left ( N, \overrightarrow{LS}, SS \right )\f] 46 Example how to extract clusters from 'OldFaithful' sample using BIRCH algorithm: 48 from pyclustering.cluster.birch import birch 49 from pyclustering.cluster import cluster_visualizer 50 from pyclustering.utils import read_sample 51 from pyclustering.samples.definitions import FAMOUS_SAMPLES 53 # Sample for cluster analysis (represented by list) 54 sample = read_sample(FAMOUS_SAMPLES.SAMPLE_OLD_FAITHFUL) 56 # Create BIRCH algorithm 57 birch_instance = birch(sample, 2, diameter=3.0) 60 birch_instance.process() 62 # Obtain results of clustering 63 clusters = birch_instance.get_clusters() 65 # Visualize allocated clusters 66 visualizer = cluster_visualizer() 67 visualizer.append_clusters(clusters, sample) 71 Here is the clustering result produced by BIRCH algorithm: 72 @image html birch_clustering_old_faithful.png "Fig. 1. BIRCH clustering - sample 'OldFaithful'." 74 Methods 'get_cf_entries' and 'get_cf_clusters' can be used to obtain information how does an input data is 75 encoded. Here is an example how the encoding information can be extracted and visualized: 77 from pyclustering.cluster.birch import birch 78 from pyclustering.cluster import cluster_visualizer 79 from pyclustering.utils import read_sample 80 from pyclustering.samples.definitions import FCPS_SAMPLES 82 # Sample 'Lsun' for cluster analysis (represented by list of points) 83 sample = read_sample(FCPS_SAMPLES.SAMPLE_LSUN) 85 # Create BIRCH algorithm 86 birch_instance = birch(sample, 3, diameter=0.5) 89 birch_instance.process() 91 # Obtain results of clustering 92 clusters = birch_instance.get_clusters() 94 # Obtain information how does the 'Lsun' sample is encoded in the CF-tree. 95 cf_entries = birch_instance.get_cf_entries() 96 cf_clusters = birch_instance.get_cf_cluster() 98 cf_centroids = [entry.get_centroid() for entry in cf_entries] 100 # Visualize allocated clusters 101 visualizer = cluster_visualizer(2, 2, titles=["Encoded data by CF-entries", "Data clusters"]) 102 visualizer.append_clusters(cf_clusters, cf_centroids, canvas=0) 103 visualizer.append_clusters(clusters, sample, canvas=1) 107 Here is the clustering result produced by BIRCH algorithm: 108 @image html birch_cf_encoding_lsun.png "Fig. 2. CF-tree encoding and BIRCH clustering of 'Lsun' sample." 112 def __init__(self, data, number_clusters, branching_factor=50, max_node_entries=200, diameter=0.5,
113 type_measurement=measurement_type.CENTROID_EUCLIDEAN_DISTANCE,
114 entry_size_limit=500,
115 diameter_multiplier=1.5,
118 @brief Constructor of clustering algorithm BIRCH. 120 @param[in] data (list): An input data represented as a list of points (objects) where each point is be represented by list of coordinates. 121 @param[in] number_clusters (uint): Amount of clusters that should be allocated. 122 @param[in] branching_factor (uint): Maximum number of successor that might be contained by each non-leaf node in CF-Tree. 123 @param[in] max_node_entries (uint): Maximum number of entries that might be contained by each leaf node in CF-Tree. 124 @param[in] diameter (double): CF-entry diameter that used for CF-Tree construction, it might be increase if 'entry_size_limit' is exceeded. 125 @param[in] type_measurement (measurement_type): Type measurement used for calculation distance metrics. 126 @param[in] entry_size_limit (uint): Maximum number of entries that can be stored in CF-Tree, if it is exceeded 127 during creation then the 'diameter' is increased and CF-Tree is rebuilt. 128 @param[in] diameter_multiplier (double): Multiplier that is used for increasing diameter when 'entry_size_limit' is exceeded. 129 @param[in] ccore (bool): If True than C++ part of the library is used for processing. 144 self.
__tree =
cftree(branching_factor, max_node_entries, diameter, type_measurement)
152 @brief Performs cluster analysis in line with rules of BIRCH algorithm. 154 @return (birch) Returns itself (BIRCH instance). 163 cf_data = [feature.get_centroid()
for feature
in self.
__features]
169 set_encoding(type_encoding.CLUSTER_INDEX_LABELING).
get_clusters()
173 index_cf_entry = numpy.argmin(numpy.sum(numpy.square(
174 numpy.subtract(cf_data, self.
__pointer_data[index_point])), axis=1))
175 index_cluster = cf_labels[index_cf_entry]
176 self.
__clusters[index_cluster].append(index_point)
183 @brief Returns list of allocated clusters, each cluster is represented by a list of indexes where each index 184 corresponds to a point in an input dataset. 186 @return (list) List of allocated clusters. 197 @brief Returns CF-entries that encodes an input dataset. 199 @return (list) CF-entries that encodes an input dataset. 209 @brief Returns list of allocated CF-entry clusters where each cluster is represented by indexes (each index 210 corresponds to CF-entry). 212 @return (list) List of allocated CF-entry clusters. 222 @brief Returns clustering result representation type that indicate how clusters are encoded. 224 @return (type_encoding) Clustering result representation. 230 return type_encoding.CLUSTER_INDEX_LIST_SEPARATION
233 def __verify_arguments(self):
235 @brief Verify input parameters for the algorithm and throw exception in case of incorrectness. 239 raise ValueError(
"Input data is empty (size: '%d')." % len(self.
__pointer_data))
242 raise ValueError(
"Amount of cluster (current value: '%d') for allocation should be greater than 0." %
246 raise ValueError(
"Limit entry size (current value: '%d') should be greater than 0." %
250 def __extract_features(self):
252 @brief Extracts features from CF-tree cluster. 258 if len(self.
__tree.leafes) == 1:
260 for entry
in self.
__tree.leafes[0].entries:
265 for leaf_node
in self.
__tree.leafes:
269 def __insert_data(self):
271 @brief Inserts input data to the tree. 273 @remark If number of maximum number of entries is exceeded than diameter is increased and tree is rebuilt. 279 self.
__tree.insert_point(point)
285 def __rebuild_tree(self, index_point):
287 @brief Rebuilt tree in case of maxumum number of entries is exceeded. 289 @param[in] index_point (uint): Index of point that is used as end point of re-building. 291 @return (cftree) Rebuilt tree with encoded points till specified point from input data space. 295 rebuild_result =
False 300 while rebuild_result
is False:
302 if increased_diameter == 0.0:
303 increased_diameter = 1.0
308 for index_point
in range(0, index_point + 1):
310 tree.insert_point(point)
317 rebuild_result =
True def __init__(self, data, number_clusters, branching_factor=50, max_node_entries=200, diameter=0.5, type_measurement=measurement_type.CENTROID_EUCLIDEAN_DISTANCE, entry_size_limit=500, diameter_multiplier=1.5, ccore=True)
Constructor of clustering algorithm BIRCH.
def __extract_features(self)
Extracts features from CF-tree cluster.
def __insert_data(self)
Inserts input data to the tree.
Module for representing clustering results.
Class represents agglomerative algorithm for cluster analysis.
def get_cf_cluster(self)
Returns list of allocated CF-entry clusters where each cluster is represented by indexes (each index ...
def __rebuild_tree(self, index_point)
Rebuilt tree in case of maxumum number of entries is exceeded.
def __verify_arguments(self)
Verify input parameters for the algorithm and throw exception in case of incorrectness.
Class represents the clustering algorithm BIRCH (Balanced Iterative Reducing and Clustering using Hie...
def process(self)
Performs cluster analysis in line with rules of BIRCH algorithm.
def get_cf_entries(self)
Returns CF-entries that encodes an input dataset.
Provides service to change clustering result representation.
def get_cluster_encoding(self)
Returns clustering result representation type that indicate how clusters are encoded.
Cluster analysis algorithm: agglomerative algorithm.
def get_clusters(self)
Returns list of allocated clusters, each cluster is represented by a list of indexes where each index...