3 @brief Data Structure: CF-Tree 4 @details Implementation based on paper @cite article::birch::1. 6 @authors Andrei Novikov (pyclustering@yandex.ru) 8 @copyright GNU Public License 10 @cond GNU_PUBLIC_LICENSE 11 PyClustering is free software: you can redistribute it and/or modify 12 it under the terms of the GNU General Public License as published by 13 the Free Software Foundation, either version 3 of the License, or 14 (at your option) any later version. 16 PyClustering is distributed in the hope that it will be useful, 17 but WITHOUT ANY WARRANTY; without even the implied warranty of 18 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 19 GNU General Public License for more details. 21 You should have received a copy of the GNU General Public License 22 along with this program. If not, see <http://www.gnu.org/licenses/>. 35 from enum
import IntEnum
40 @brief Enumeration of measurement types for CF-Tree. 47 CENTROID_EUCLIDEAN_DISTANCE = 0
50 CENTROID_MANHATTAN_DISTANCE = 1
53 AVERAGE_INTER_CLUSTER_DISTANCE = 2
56 AVERAGE_INTRA_CLUSTER_DISTANCE = 3
59 VARIANCE_INCREASE_DISTANCE = 4
64 @brief Enumeration of CF-Node types that are used by CF-Tree. 83 @brief Clustering feature representation. 93 @brief Returns number of points that are encoded. 95 @return (uint) Number of encoded points. 103 @brief Returns linear sum. 105 @return (list) Linear sum. 114 @brief Returns square sum. 116 @return (double) Square sum. 122 def __init__(self, number_points, linear_sum, square_sum):
124 @brief CF-entry constructor. 126 @param[in] number_points (uint): Number of objects that is represented by the entry. 127 @param[in] linear_sum (list): Linear sum of values that represent objects in each dimension. 128 @param[in] square_sum (double): Square sum of values that represent objects. 143 @returns (cfentry) Makes copy of the CF-entry instance. 151 @return (string) Returns CF-entry representation. 154 return "CF-Entry (N: '%s', LS: '%s', D: '%s')" % \
160 @brief Default cfentry string representation. 168 @brief Overloaded operator add. Performs addition of two clustering features. 170 @param[in] entry (cfentry): Entry that is added to the current. 172 @return (cfentry) Result of addition of two clustering features. 177 result_linear_sum = numpy.add(self.
linear_sum, entry.linear_sum)
178 result_square_sum = self.
square_sum + entry.square_sum
180 return cfentry(number_points, result_linear_sum, result_square_sum)
185 @brief Overloaded operator eq. 186 @details Performs comparison of two clustering features. 188 @param[in] entry (cfentry): Entry that is used for comparison with current. 190 @return (bool) True is both clustering features are equals in line with tolerance, otherwise False. 197 result &= ((self.
square_sum + tolerance > entry.square_sum)
and (self.
square_sum - tolerance < entry.square_sum))
199 for index_dimension
in range(0, len(self.
linear_sum)):
200 result &= ((self.
linear_sum[index_dimension] + tolerance > entry.linear_sum[index_dimension])
and (self.
linear_sum[index_dimension] - tolerance < entry.linear_sum[index_dimension]))
207 @brief Calculates distance between two clusters in line with measurement type. 209 @details In case of usage CENTROID_EUCLIDIAN_DISTANCE square euclidian distance will be returned. 210 Square root should be taken from the result for obtaining real euclidian distance between 213 @param[in] entry (cfentry): Clustering feature to which distance should be obtained. 214 @param[in] type_measurement (measurement_type): Distance measurement algorithm between two clusters. 216 @return (double) Distance between two clusters. 220 if type_measurement
is measurement_type.CENTROID_EUCLIDEAN_DISTANCE:
221 return euclidean_distance_square(entry.get_centroid(), self.
get_centroid())
223 elif type_measurement
is measurement_type.CENTROID_MANHATTAN_DISTANCE:
224 return manhattan_distance(entry.get_centroid(), self.
get_centroid())
226 elif type_measurement
is measurement_type.AVERAGE_INTER_CLUSTER_DISTANCE:
229 elif type_measurement
is measurement_type.AVERAGE_INTRA_CLUSTER_DISTANCE:
232 elif type_measurement
is measurement_type.VARIANCE_INCREASE_DISTANCE:
236 raise ValueError(
"Unsupported type of measurement '%s' is specified." % type_measurement)
241 @brief Calculates centroid of cluster that is represented by the entry. 242 @details It's calculated once when it's requested after the last changes. 244 @return (array_like) Centroid of cluster that is represented by the entry. 257 @brief Calculates radius of cluster that is represented by the entry. 258 @details It's calculated once when it's requested after the last changes. 260 @return (double) Radius of cluster that is represented by the entry. 271 radius_part_2 = 2.0 * numpy.dot(self.
linear_sum, centroid)
272 radius_part_3 = N * numpy.dot(centroid, centroid)
274 self.
__radius = ((1.0 / N) * (radius_part_1 - radius_part_2 + radius_part_3)) ** 0.5
280 @brief Calculates diameter of cluster that is represented by the entry. 281 @details It's calculated once when it's requested after the last changes. 283 @return (double) Diameter of cluster that is represented by the entry. 292 if diameter_part < 0.000000001:
300 def __get_average_inter_cluster_distance(self, entry):
302 @brief Calculates average inter cluster distance between current and specified clusters. 304 @param[in] entry (cfentry): Clustering feature to which distance should be obtained. 306 @return (double) Average inter cluster distance. 310 linear_part_distance = numpy.dot(self.
linear_sum, entry.linear_sum)
315 def __get_average_intra_cluster_distance(self, entry):
317 @brief Calculates average intra cluster distance between current and specified clusters. 319 @param[in] entry (cfentry): Clustering feature to which distance should be obtained. 321 @return (double) Average intra cluster distance. 325 linear_part_first = numpy.add(self.
linear_sum, entry.linear_sum)
326 linear_part_second = linear_part_first
328 linear_part_distance = numpy.dot(linear_part_first, linear_part_second)
330 general_part_distance = 2.0 * (self.
number_points + entry.number_points) * (self.
square_sum + entry.square_sum) - 2.0 * linear_part_distance
332 return (general_part_distance / ((self.
number_points + entry.number_points) * (self.
number_points + entry.number_points - 1.0))) ** 0.5
335 def __get_variance_increase_distance(self, entry):
337 @brief Calculates variance increase distance between current and specified clusters. 339 @param[in] entry (cfentry): Clustering feature to which distance should be obtained. 341 @return (double) Variance increase distance. 345 linear_part_12 = numpy.add(self.
linear_sum, entry.linear_sum)
346 variance_part_first = (self.
square_sum + entry.square_sum) - \
347 2.0 * numpy.dot(linear_part_12, linear_part_12) / (self.
number_points + entry.number_points) + \
348 (self.
number_points + entry.number_points) * numpy.dot(linear_part_12, linear_part_12) / (self.
number_points + entry.number_points)**2.0
353 linear_part_22 = numpy.dot(entry.linear_sum, entry.linear_sum)
354 variance_part_third = -(entry.square_sum - (2.0 / entry.number_points) * linear_part_22 + entry.number_points * (1.0 / entry.number_points ** 2.0) * linear_part_22)
356 return variance_part_first + variance_part_second + variance_part_third
361 @brief Representation of node of CF-Tree. 367 @brief Constructor of abstract CF node. 369 @param[in] feature (cfentry): Clustering feature of the created node. 370 @param[in] parent (cfnode): Parent of the created node. 381 self.
type = cfnode_type.CFNODE_DUMMY
386 @return (string) Default representation of CF node. 390 return 'CF node %s, parent %s, feature %s' % (hex(id(self)), self.
parent, self.
feature)
395 @return (string) String representation of CF node. 403 @brief Calculates distance between nodes in line with specified type measurement. 405 @param[in] node (cfnode): CF-node that is used for calculation distance to the current node. 406 @param[in] type_measurement (measurement_type): Measurement type that is used for calculation distance. 408 @return (double) Distance between two nodes. 417 @brief Representation of clustering feature non-leaf node. 424 @return (list) List of successors of the node. 432 @brief Create CF Non-leaf node. 434 @param[in] feature (cfentry): Clustering feature of the created node. 435 @param[in] parent (non_leaf_node): Parent of the created node. 436 @param[in] successors (list): List of successors of the node. 443 self.
type = cfnode_type.CFNODE_NONLEAF
450 @return (string) Representation of non-leaf node representation. 453 return 'Non-leaf node %s, parent %s, feature %s, successors: %d' % (hex(id(self)), self.
parent, self.
feature, len(self.
successors))
458 @return (string) String non-leaf representation. 466 @brief Insert successor to the node. 468 @param[in] successor (cfnode): Successor for adding. 472 self.
feature += successor.feature
475 successor.parent = self
480 @brief Merge non-leaf node to the current. 482 @param[in] node (non_leaf_node): Non-leaf node that should be merged with current. 488 for child
in node.successors:
495 @brief Find pair of farthest successors of the node in line with measurement type. 497 @param[in] type_measurement (measurement_type): Measurement type that is used for obtaining farthest successors. 499 @return (list) Pair of farthest successors represented by list [cfnode1, cfnode2]. 503 farthest_node1 =
None 504 farthest_node2 =
None 505 farthest_distance = 0.0
512 candidate_distance = candidate1.get_distance(candidate2, type_measurement)
514 if candidate_distance > farthest_distance:
515 farthest_distance = candidate_distance
516 farthest_node1 = candidate1
517 farthest_node2 = candidate2
519 return [farthest_node1, farthest_node2]
524 @brief Find pair of nearest successors of the node in line with measurement type. 526 @param[in] type_measurement (measurement_type): Measurement type that is used for obtaining nearest successors. 528 @return (list) Pair of nearest successors represented by list. 534 nearest_distance = float(
"Inf")
541 candidate_distance = candidate1.get_distance(candidate2, type_measurement)
543 if candidate_distance < nearest_distance:
544 nearest_distance = candidate_distance
545 nearest_node1 = candidate1
546 nearest_node2 = candidate2
548 return [nearest_node1, nearest_node2]
553 @brief Represents clustering feature leaf node. 560 @return (list) List of leaf nodes. 568 @brief Create CF Leaf node. 570 @param[in] feature (cfentry): Clustering feature of the created node. 571 @param[in] parent (non_leaf_node): Parent of the created node. 572 @param[in] entries (list): List of entries of the node. 579 self.
type = cfnode_type.CFNODE_LEAF
586 @return (string) Default leaf node represenation. 591 text_entries +=
"\t" + str(entry) +
"\n" 593 return "Leaf-node: '%s', parent: '%s', feature: '%s', entries: '%d'" % \
599 @return (string) String leaf node representation. 607 @brief Insert new clustering feature to the leaf node. 609 @param[in] entry (cfentry): Clustering feature. 619 @brief Merge leaf node to the current. 621 @param[in] node (leaf_node): Leaf node that should be merged with current. 628 for entry
in node.entries:
634 @brief Find pair of farthest entries of the node. 636 @param[in] type_measurement (measurement_type): Measurement type that is used for obtaining farthest entries. 638 @return (list) Pair of farthest entries of the node that are represented by list. 642 farthest_entity1 =
None 643 farthest_entity2 =
None 644 farthest_distance = 0
646 for i
in range(0, len(self.
entries)):
649 for j
in range(i + 1, len(self.
entries)):
651 candidate_distance = candidate1.get_distance(candidate2, type_measurement)
653 if candidate_distance > farthest_distance:
654 farthest_distance = candidate_distance
655 farthest_entity1 = candidate1
656 farthest_entity2 = candidate2
658 return [farthest_entity1, farthest_entity2]
663 @brief Find nearest index of nearest entry of node for the specified entry. 665 @param[in] entry (cfentry): Entry that is used for calculation distance. 666 @param[in] type_measurement (measurement_type): Measurement type that is used for obtaining nearest entry to the specified. 668 @return (uint) Index of nearest entry of node for the specified entry. 672 minimum_distance = float(
'Inf')
675 for candidate_index
in range(0, len(self.
entries)):
677 if candidate_distance < minimum_distance:
678 minimum_distance = candidate_distance
679 nearest_index = candidate_index
686 @brief Find nearest entry of node for the specified entry. 688 @param[in] entry (cfentry): Entry that is used for calculation distance. 689 @param[in] type_measurement (measurement_type): Measurement type that is used for obtaining nearest entry to the specified. 691 @return (cfentry) Nearest entry of node for the specified entry. 695 min_key =
lambda cur_entity: cur_entity.get_distance(entry, type_measurement)
696 return min(self.
entries, key=min_key)
701 @brief CF-Tree representation. 702 @details A CF-tree is a height-balanced tree with two parameters: branching factor and threshold. 709 @return (cfnode) Root of the tree. 718 @return (list) List of all leaf nodes in the tree. 727 @return (unit) Number of nodes (leaf and non-leaf) in the tree. 736 @return (uint) Number of entries in the tree. 745 @return (uint) Height of the tree. 754 @return (uint) Branching factor of the tree. 755 @details Branching factor defines maximum number of successors in each non-leaf node. 764 @return (double) Threshold of the tree that represents maximum diameter of sub-clusters that is formed by leaf node entries. 773 @return (uint) Maximum number of entries in each leaf node. 782 @return (measurement_type) Type that is used for measuring. 788 def __init__(self, branch_factor, max_entries, threshold, type_measurement = measurement_type.CENTROID_EUCLIDEAN_DISTANCE):
790 @brief Create CF-tree. 792 @param[in] branch_factor (uint): Maximum number of children for non-leaf nodes. 793 @param[in] max_entries (uint): Maximum number of entries for leaf nodes. 794 @param[in] threshold (double): Maximum diameter of feature clustering for each leaf node. 795 @param[in] type_measurement (measurement_type): Measurement type that is used for calculation distance metrics. 820 @brief Traverses CF-tree to obtain nodes at the specified level. 822 @param[in] level (uint): CF-tree level from that nodes should be returned. 824 @return (list) List of CF-nodes that are located on the specified level of the CF-tree. 835 def __recursive_get_level_nodes(self, level, node):
837 @brief Traverses CF-tree to obtain nodes at the specified level recursively. 839 @param[in] level (uint): Current CF-tree level. 840 @param[in] node (cfnode): CF-node from that traversing is performed. 842 @return (list) List of CF-nodes that are located on the specified level of the CF-tree. 848 level_nodes.append(node)
851 for sucessor
in node.successors:
859 @brief Insert point that is represented by list of coordinates. 861 @param[in] point (list): Point represented by list of coordinates that should be inserted to CF tree. 865 entry =
cfentry(len([point]), linear_sum([point]), square_sum([point]))
871 @brief Insert clustering feature to the tree. 873 @param[in] entry (cfentry): Clustering feature that should be inserted. 889 if child_node_updation
is True:
897 @brief Search nearest leaf to the specified clustering feature. 899 @param[in] entry (cfentry): Clustering feature. 900 @param[in] search_node (cfnode): Node from that searching should be started, if None then search process will be started for the root. 902 @return (leaf_node) Nearest node to the specified clustering feature. 906 if search_node
is None:
909 nearest_node = search_node
911 if search_node.type == cfnode_type.CFNODE_NONLEAF:
912 min_key =
lambda child_node: child_node.feature.get_distance(entry, self.
__type_measurement)
913 nearest_child_node = min(search_node.successors, key = min_key)
920 def __recursive_insert(self, entry, search_node):
922 @brief Recursive insert of the entry to the tree. 923 @details It performs all required procedures during insertion such as splitting, merging. 925 @param[in] entry (cfentry): Clustering feature. 926 @param[in] search_node (cfnode): Node from that insertion should be started. 928 @return (bool) True if number of nodes at the below level is changed, otherwise False. 933 if search_node.type == cfnode_type.CFNODE_NONLEAF:
941 def __insert_for_leaf_node(self, entry, search_node):
943 @brief Recursive insert entry from leaf node to the tree. 945 @param[in] entry (cfentry): Clustering feature. 946 @param[in] search_node (cfnode): None-leaf node from that insertion should be started. 948 @return (bool) True if number of nodes at the below level is changed, otherwise False. 952 node_amount_updation =
False 955 index_nearest_entry = search_node.get_nearest_index_entry(entry, self.
__type_measurement)
956 nearest_entry = search_node.entries[index_nearest_entry]
957 merged_entry = nearest_entry + entry
962 search_node.insert_entry(entry)
967 node_amount_updation =
True 973 search_node.entries[index_nearest_entry] = merged_entry
974 search_node.feature += entry
976 return node_amount_updation
979 def __insert_for_noneleaf_node(self, entry, search_node):
981 @brief Recursive insert entry from none-leaf node to the tree. 983 @param[in] entry (cfentry): Clustering feature. 984 @param[in] search_node (cfnode): None-leaf node from that insertion should be started. 986 @return (bool) True if number of nodes at the below level is changed, otherwise False. 990 node_amount_updation =
False 992 min_key =
lambda child_node: child_node.get_distance(search_node, self.
__type_measurement)
993 nearest_child_node = min(search_node.successors, key=min_key)
998 search_node.feature += entry
1004 if search_node
is self.
__root:
1006 search_node.parent = self.
__root 1015 parent = search_node.parent
1016 parent.successors.remove(search_node)
1017 parent.successors.append(new_node1)
1018 parent.successors.append(new_node2)
1022 node_amount_updation =
True 1024 elif child_node_updation
is True:
1029 return node_amount_updation
1032 def __merge_nearest_successors(self, node):
1034 @brief Find nearest sucessors and merge them. 1036 @param[in] node (non_leaf_node): Node whose two nearest successors should be merged. 1038 @return (bool): True if merging has been successfully performed, otherwise False. 1042 merging_result =
False 1044 if node.successors[0].type == cfnode_type.CFNODE_NONLEAF:
1045 [nearest_child_node1, nearest_child_node2] = node.get_nearest_successors(self.
__type_measurement)
1047 if len(nearest_child_node1.successors) + len(nearest_child_node2.successors) <= self.
__branch_factor:
1048 node.successors.remove(nearest_child_node2)
1049 if nearest_child_node2.type == cfnode_type.CFNODE_LEAF:
1050 self.
__leafes.remove(nearest_child_node2)
1052 nearest_child_node1.merge(nearest_child_node2)
1054 merging_result =
True 1056 return merging_result
1059 def __split_procedure(self, split_node):
1061 @brief Starts node splitting procedure in the CF-tree from the specify node. 1063 @param[in] split_node (cfnode): CF-tree node that should be splitted. 1066 if split_node
is self.
__root:
1068 split_node.parent = self.
__root 1081 parent = split_node.parent
1082 parent.successors.remove(split_node)
1083 parent.successors.append(new_node1)
1084 parent.successors.append(new_node2)
1090 def __split_nonleaf_node(self, node):
1092 @brief Performs splitting of the specified non-leaf node. 1094 @param[in] node (non_leaf_node): Non-leaf node that should be splitted. 1096 @return (list) New pair of non-leaf nodes [non_leaf_node1, non_leaf_node2]. 1100 [farthest_node1, farthest_node2] = node.get_farthest_successors(self.
__type_measurement)
1103 new_node1 =
non_leaf_node(farthest_node1.feature, node.parent, [farthest_node1])
1104 new_node2 =
non_leaf_node(farthest_node2.feature, node.parent, [farthest_node2])
1106 farthest_node1.parent = new_node1
1107 farthest_node2.parent = new_node2
1110 for successor
in node.successors:
1111 if (successor
is not farthest_node1)
and (successor
is not farthest_node2):
1115 if distance1 < distance2:
1116 new_node1.insert_successor(successor)
1118 new_node2.insert_successor(successor)
1120 return [new_node1, new_node2]
1123 def __split_leaf_node(self, node):
1125 @brief Performs splitting of the specified leaf node. 1127 @param[in] node (leaf_node): Leaf node that should be splitted. 1129 @return (list) New pair of leaf nodes [leaf_node1, leaf_node2]. 1131 @warning Splitted node is transformed to non_leaf. 1136 [farthest_entity1, farthest_entity2] = node.get_farthest_entries(self.
__type_measurement)
1139 new_node1 =
leaf_node(farthest_entity1, node.parent, [farthest_entity1])
1140 new_node2 =
leaf_node(farthest_entity2, node.parent, [farthest_entity2])
1143 for entity
in node.entries:
1144 if (entity
is not farthest_entity1)
and (entity
is not farthest_entity2):
1148 if distance1 < distance2:
1149 new_node1.insert_entry(entity)
1151 new_node2.insert_entry(entity)
1153 return [new_node1, new_node2]
def linear_sum(self)
Returns linear sum.
Represents clustering feature leaf node.
def get_distance(self, entry, type_measurement)
Calculates distance between two clusters in line with measurement type.
def insert_entry(self, entry)
Insert new clustering feature to the leaf node.
def get_farthest_successors(self, type_measurement)
Find pair of farthest successors of the node in line with measurement type.
def __init__(self, feature, parent, entries)
Create CF Leaf node.
def __init__(self, feature, parent)
Constructor of abstract CF node.
def __str__(self)
Default cfentry string representation.
def __add__(self, entry)
Overloaded operator add.
def __insert_for_noneleaf_node(self, entry, search_node)
Recursive insert entry from none-leaf node to the tree.
Utils that are used by modules of pyclustering.
def __split_procedure(self, split_node)
Starts node splitting procedure in the CF-tree from the specify node.
def __recursive_get_level_nodes(self, level, node)
Traverses CF-tree to obtain nodes at the specified level recursively.
def __get_average_inter_cluster_distance(self, entry)
Calculates average inter cluster distance between current and specified clusters. ...
def __split_nonleaf_node(self, node)
Performs splitting of the specified non-leaf node.
def merge(self, node)
Merge non-leaf node to the current.
def insert_successor(self, successor)
Insert successor to the node.
feature
Clustering feature of the node.
Representation of clustering feature non-leaf node.
def __get_variance_increase_distance(self, entry)
Calculates variance increase distance between current and specified clusters.
def get_nearest_index_entry(self, entry, type_measurement)
Find nearest index of nearest entry of node for the specified entry.
def merge(self, node)
Merge leaf node to the current.
def __init__(self, feature, parent, successors)
Create CF Non-leaf node.
def get_diameter(self)
Calculates diameter of cluster that is represented by the entry.
def __insert_for_leaf_node(self, entry, search_node)
Recursive insert entry from leaf node to the tree.
def insert(self, entry)
Insert clustering feature to the tree.
def find_nearest_leaf(self, entry, search_node=None)
Search nearest leaf to the specified clustering feature.
def __merge_nearest_successors(self, node)
Find nearest sucessors and merge them.
def type_measurement(self)
def __eq__(self, entry)
Overloaded operator eq.
def get_radius(self)
Calculates radius of cluster that is represented by the entry.
def get_farthest_entries(self, type_measurement)
Find pair of farthest entries of the node.
def get_nearest_entry(self, entry, type_measurement)
Find nearest entry of node for the specified entry.
def square_sum(self)
Returns square sum.
parent
Pointer to the parent node (None for root).
def __init__(self, number_points, linear_sum, square_sum)
CF-entry constructor.
def get_distance(self, node, type_measurement)
Calculates distance between nodes in line with specified type measurement.
Enumeration of measurement types for CF-Tree.
type
Type node (leaf or non-leaf).
Clustering feature representation.
Representation of node of CF-Tree.
def number_points(self)
Returns number of points that are encoded.
def __init__(self, branch_factor, max_entries, threshold, type_measurement=measurement_type.CENTROID_EUCLIDEAN_DISTANCE)
Create CF-tree.
Enumeration of CF-Node types that are used by CF-Tree.
def __recursive_insert(self, entry, search_node)
Recursive insert of the entry to the tree.
def get_centroid(self)
Calculates centroid of cluster that is represented by the entry.
def __split_leaf_node(self, node)
Performs splitting of the specified leaf node.
def insert_point(self, point)
Insert point that is represented by list of coordinates.
def get_level_nodes(self, level)
Traverses CF-tree to obtain nodes at the specified level.
def get_nearest_successors(self, type_measurement)
Find pair of nearest successors of the node in line with measurement type.
def __get_average_intra_cluster_distance(self, entry)
Calculates average intra cluster distance between current and specified clusters. ...