3 @brief Cluster analysis algorithm: OPTICS (Ordering Points To Identify Clustering Structure) 4 @details Implementation based on paper @cite article::optics::1. 6 @authors Andrei Novikov (pyclustering@yandex.ru) 8 @copyright GNU Public License 10 @cond GNU_PUBLIC_LICENSE 11 PyClustering is free software: you can redistribute it and/or modify 12 it under the terms of the GNU General Public License as published by 13 the Free Software Foundation, either version 3 of the License, or 14 (at your option) any later version. 16 PyClustering is distributed in the hope that it will be useful, 17 but WITHOUT ANY WARRANTY; without even the implied warranty of 18 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 19 GNU General Public License for more details. 21 You should have received a copy of the GNU General Public License 22 along with this program. If not, see <http://www.gnu.org/licenses/>. 32 import matplotlib.pyplot
as plt
33 except Exception
as error_instance:
34 warnings.warn(
"Impossible to import matplotlib (please, install 'matplotlib'), pyclustering's visualization " 35 "functionality is not available (details: '%s')." % str(error_instance))
43 from pyclustering.core.wrapper
import ccore_library
45 import pyclustering.core.optics_wrapper
as wrapper
50 @brief Cluster ordering diagram visualizer that represents dataset graphically as density-based clustering structure. 51 @details This OPTICS algorithm is KD-tree optimized. 53 @see ordering_analyser 60 @brief Display cluster-ordering (reachability-plot) diagram. 62 @param[in] analyser (ordering_analyser): cluster-ordering analyser whose ordering diagram should be displayed. 63 @param[in] amount_clusters (uint): if it is not 'None' then it displays connectivity radius line that can used for allocation of specified amount of clusters 64 and colorize diagram by corresponding cluster colors. 66 Example demonstrates general abilities of 'ordering_visualizer' class: 68 # Display cluster-ordering diagram with connectivity radius is used for allocation of three clusters. 69 ordering_visualizer.show_ordering_diagram(analyser, 3); 71 # Display cluster-ordering diagram without radius. 72 ordering_visualizer.show_ordering_diagram(analyser); 76 ordering = analyser.cluster_ordering
77 axis = plt.subplot(111)
79 if amount_clusters
is not None:
80 radius, borders = analyser.calculate_connvectivity_radius(amount_clusters)
84 current_index_border = 0
85 for index_border
in range(len(borders)):
86 right_index_border = borders[index_border]
87 axis.bar(range(left_index_border, right_index_border), ordering[left_index_border:right_index_border], width = 1.0, color = color_list.TITLES[index_border])
88 left_index_border = right_index_border
89 current_index_border = index_border
91 axis.bar(range(left_index_border, len(ordering)), ordering[left_index_border:len(ordering)], width = 1.0, color = color_list.TITLES[current_index_border + 1])
93 plt.xlim([0, len(ordering)])
95 plt.axhline(y = radius, linewidth = 2, color =
'black')
96 plt.text(0, radius + radius * 0.03,
" Radius: " + str(round(radius, 4)) +
";\n Clusters: " + str(amount_clusters), color =
'b', fontsize = 10)
99 axis.bar(range(0, len(ordering)), ordering[0:len(ordering)], width = 1.0, color =
'black')
100 plt.xlim([0, len(ordering)])
107 @brief Analyser of cluster ordering diagram. 108 @details Using cluster-ordering it is able to connectivity radius for allocation of specified amount of clusters and 109 calculate amount of clusters using specified connectivity radius. Cluster-ordering is formed by OPTICS algorithm 110 during cluster analysis. 119 @brief (list) Returns values of dataset cluster ordering. 127 @brief Analyser of ordering diagram that is based on reachability-distances. 129 @see calculate_connvectivity_radius 137 @brief Returns length of clustering-ordering diagram. 145 @brief Calculates connectivity radius of allocation specified amount of clusters using ordering diagram and marks borders of clusters using indexes of values of ordering diagram. 146 @details Parameter 'maximum_iterations' is used to protect from hanging when it is impossible to allocate specified number of clusters. 148 @param[in] amount_clusters (uint): amount of clusters that should be allocated by calculated connectivity radius. 149 @param[in] maximum_iterations (uint): maximum number of iteration for searching connectivity radius to allocated specified amount of clusters (by default it is restricted by 100 iterations). 151 @return (double, list) Value of connectivity radius and borders of clusters like (radius, borders), radius may be 'None' as well as borders may be '[]' 152 if connectivity radius hasn't been found for the specified amount of iterations. 158 upper_distance = maximum_distance
164 if amount <= amount_clusters:
165 for _
in range(maximum_iterations):
166 radius = (lower_distance + upper_distance) / 2.0
169 if amount == amount_clusters:
176 elif amount > amount_clusters:
177 lower_distance = radius
179 elif amount < amount_clusters:
180 upper_distance = radius
182 return result, borders
187 @brief Obtains amount of clustering that can be allocated by using specified radius for ordering diagram and borders between them. 188 @details When growth of reachability-distances is detected than it is considered as a start point of cluster, 189 than pick is detected and after that recession is observed until new growth (that means end of the 190 current cluster and start of a new one) or end of diagram. 192 @param[in] radius (double): connectivity radius that is used for cluster allocation. 194 @return (unit, list) Amount of clusters that can be allocated by the connectivity radius on ordering diagram and borders between them using indexes 195 from ordering diagram (amount_clusters, border_clusters). 201 cluster_start =
False 203 total_similarity =
True 204 previous_cluster_distance =
None 205 previous_distance =
None 209 for index_ordering
in range(len(self.
__ordering)):
211 if distance >= radius:
212 if cluster_start
is False:
216 if index_ordering != 0:
217 cluster_borders.append(index_ordering)
220 if (distance < previous_cluster_distance)
and (cluster_pick
is False):
223 elif (distance > previous_cluster_distance)
and (cluster_pick
is True):
227 if index_ordering != 0:
228 cluster_borders.append(index_ordering)
230 previous_cluster_distance = distance
233 cluster_start =
False 236 if (previous_distance
is not None)
and (distance != previous_distance):
237 total_similarity =
False 239 previous_distance = distance
241 if (total_similarity
is True)
and (previous_distance > radius):
244 return amount_clusters, cluster_borders
249 @brief Object description that used by OPTICS algorithm for cluster analysis. 253 def __init__(self, index, core_distance = None, reachability_distance = None):
255 @brief Constructor of object description in optics terms. 257 @param[in] index (uint): Index of the object in the data set. 258 @param[in] core_distance (double): Core distance that is minimum distance to specified number of neighbors. 259 @param[in] reachability_distance (double): Reachability distance to this object. 277 @brief Returns string representation of the optics descriptor. 286 @brief Class represents clustering algorithm OPTICS (Ordering Points To Identify Clustering Structure) with KD-tree optimization (ccore options is supported). 287 @details OPTICS is a density-based algorithm. Purpose of the algorithm is to provide explicit clusters, but create clustering-ordering representation of the input data. 288 Clustering-ordering information contains information about internal structures of data set in terms of density and proper connectivity radius can be obtained 289 for allocation required amount of clusters using this diagram. In case of usage additional input parameter 'amount of clusters' connectivity radius should be 290 bigger than real - because it will be calculated by the algorithms if requested amount of clusters is not allocated. 292 @image html optics_example_clustering.png "Scheme how does OPTICS works. At the beginning only one cluster is allocated, but two is requested. At the second step OPTICS calculates connectivity radius using cluster-ordering and performs final cluster allocation." 294 Clustering example using sample 'Chainlink': 296 from pyclustering.cluster import cluster_visualizer 297 from pyclustering.cluster.optics import optics, ordering_analyser, ordering_visualizer 298 from pyclustering.samples.definitions import FCPS_SAMPLES 299 from pyclustering.utils import read_sample 301 # Read sample for clustering from some file. 302 sample = read_sample(FCPS_SAMPLES.SAMPLE_CHAINLINK) 304 # Run cluster analysis where connectivity radius is bigger than real. 307 optics_instance = optics(sample, radius, neighbors) 309 # Performs cluster analysis. 310 optics_instance.process() 312 # Obtain results of clustering. 313 clusters = optics_instance.get_clusters() 314 noise = optics_instance.get_noise() 315 ordering = optics_instance.get_ordering() 317 # Visualize clustering results. 318 visualizer = cluster_visualizer() 319 visualizer.append_clusters(clusters, sample) 323 analyser = ordering_analyser(ordering) 324 ordering_visualizer.show_ordering_diagram(analyser, 2) 327 Amount of clusters that should be allocated can be also specified. In this case connectivity radius should be greater than real, for example: 329 from pyclustering.cluster import cluster_visualizer 330 from pyclustering.cluster.optics import optics, ordering_analyser, ordering_visualizer 331 from pyclustering.samples.definitions import FCPS_SAMPLES 332 from pyclustering.utils import read_sample 334 # Read sample for clustering from some file 335 sample = read_sample(FCPS_SAMPLES.SAMPLE_LSUN) 337 # Run cluster analysis where connectivity radius is bigger than real 340 amount_of_clusters = 3 341 optics_instance = optics(sample, radius, neighbors, amount_of_clusters) 343 # Performs cluster analysis 344 optics_instance.process() 346 # Obtain results of clustering 347 clusters = optics_instance.get_clusters() 348 noise = optics_instance.get_noise() 349 ordering = optics_instance.get_ordering() 351 # Visualize ordering diagram 352 analyser = ordering_analyser(ordering) 353 ordering_visualizer.show_ordering_diagram(analyser, amount_of_clusters) 355 # Visualize clustering results 356 visualizer = cluster_visualizer() 357 visualizer.append_clusters(clusters, sample) 363 def __init__(self, sample, eps, minpts, amount_clusters = None, ccore = True, **kwargs):
365 @brief Constructor of clustering algorithm OPTICS. 367 @param[in] sample (list): Input data that is presented as a list of points (objects), where each point is represented by list or tuple. 368 @param[in] eps (double): Connectivity radius between points, points may be connected if distance between them less than the radius. 369 @param[in] minpts (uint): Minimum number of shared neighbors that is required for establishing links between points. 370 @param[in] amount_clusters (uint): Optional parameter where amount of clusters that should be allocated is specified. 371 In case of usage 'amount_clusters' connectivity radius can be greater than real, in other words, there is place for mistake 372 in connectivity radius usage. 373 @param[in] ccore (bool): if True than DLL CCORE (C++ solution) will be used for solving the problem. 374 @param[in] **kwargs: Arbitrary keyword arguments (available arguments: 'data_type'). 376 <b>Keyword Args:</b><br> 377 - data_type (string): Data type of input sample 'data' that is processed by the algorithm ('points', 'distance_matrix'). 391 self.
__data_type = kwargs.get(
'data_type',
'points')
399 self.
__ccore = ccore_library.workable()
404 @brief Performs cluster analysis in line with rules of OPTICS algorithm. 406 @remark Results of clustering can be obtained using corresponding gets methods. 421 def __process_by_ccore(self):
423 @brief Performs cluster analysis using CCORE (C/C++ part of pyclustering library). 428 objects_indexes, objects_core_distances, objects_reachability_distances) = \
432 for i
in range(len(objects_indexes)):
433 if objects_core_distances[i] < 0.0:
434 objects_core_distances[i] =
None 436 if objects_reachability_distances[i] < 0.0:
437 objects_reachability_distances[i] =
None 439 optics_object =
optics_descriptor(objects_indexes[i], objects_core_distances[i], objects_reachability_distances[i])
440 optics_object.processed =
True 445 def __process_by_python(self):
447 @brief Performs cluster analysis using python code. 459 if radius
is not None:
464 def __initialize(self, sample):
466 @brief Initializes internal states and resets clustering results in line with input sample. 478 def __allocate_clusters(self):
480 @brief Performs cluster allocation and builds ordering diagram that is based on reachability-distances. 487 if optic_object.processed
is False:
495 @brief Returns list of allocated clusters, where each cluster contains indexes of objects and each cluster is represented by list. 497 @return (list) List of allocated clusters. 511 @brief Returns list of noise that contains indexes of objects that corresponds to input data. 513 @return (list) List of allocated noise objects. 527 @brief Returns clustering ordering information about the input data set. 528 @details Clustering ordering of data-set contains the information about the internal clustering structure in line with connectivity radius. 530 @return (ordering_analyser) Analyser of clustering ordering. 536 @see get_optics_objects() 544 for index_object
in cluster:
546 if optics_object.reachability_distance
is not None:
547 self.
__ordering.append(optics_object.reachability_distance)
554 @brief Returns OPTICS objects where each object contains information about index of point from processed data, 555 core distance and reachability distance. 557 @return (list) OPTICS objects. 562 @see optics_descriptor 571 @brief Returns connectivity radius that is calculated and used for clustering by the algorithm. 572 @details Connectivity radius may be changed only in case of usage additional parameter of the algorithm - amount of clusters for allocation. 574 @return (double) Connectivity radius. 579 @see get_optics_objects() 588 @brief Returns clustering result representation type that indicate how clusters are encoded. 590 @return (type_encoding) Clustering result representation. 596 return type_encoding.CLUSTER_INDEX_LIST_SEPARATION
599 def __create_neighbor_searcher(self, data_type):
601 @brief Returns neighbor searcher in line with data type. 603 @param[in] data_type (string): Data type (points or distance matrix). 606 if data_type ==
'points':
608 elif data_type ==
'distance_matrix':
611 raise TypeError(
"Unknown type of data is specified '%s'" % data_type)
614 def __expand_cluster_order(self, optics_object):
616 @brief Expand cluster order from not processed optic-object that corresponds to object from input data. 617 Traverse procedure is performed until objects are reachable from core-objects in line with connectivity radius. 618 Order database is updated during expanding. 620 @param[in] optics_object (optics_descriptor): Object that hasn't been processed. 624 optics_object.processed =
True 627 optics_object.reachability_distance =
None 632 if len(neighbors_descriptor) >= self.
__minpts:
633 neighbors_descriptor.sort(key =
lambda obj: obj[1])
634 optics_object.core_distance = neighbors_descriptor[self.
__minpts - 1][1]
640 while len(order_seed) > 0:
641 optic_descriptor = order_seed[0]
642 order_seed.remove(optic_descriptor)
645 optic_descriptor.processed =
True 649 if len(neighbors_descriptor) >= self.
__minpts:
650 neighbors_descriptor.sort(key =
lambda obj: obj[1])
651 optic_descriptor.core_distance = neighbors_descriptor[self.
__minpts - 1][1]
655 optic_descriptor.core_distance =
None 658 optics_object.core_distance =
None 661 def __extract_clusters(self):
663 @brief Extract clusters and noise from order database. 672 if (optics_object.reachability_distance
is None)
or (optics_object.reachability_distance > self.
__eps):
673 if (optics_object.core_distance
is not None)
and (optics_object.core_distance <= self.
__eps):
674 self.
__clusters.append([ optics_object.index_object ])
677 self.
__noise.append(optics_object.index_object)
679 current_cluster.append(optics_object.index_object)
682 def __update_order_seed(self, optic_descriptor, neighbors_descriptors, order_seed):
684 @brief Update sorted list of reachable objects (from core-object) that should be processed using neighbors of core-object. 686 @param[in] optic_descriptor (optics_descriptor): Core-object whose neighbors should be analysed. 687 @param[in] neighbors_descriptors (list): List of neighbors of core-object. 688 @param[in|out] order_seed (list): List of sorted object in line with reachable distance. 692 for neighbor_descriptor
in neighbors_descriptors:
693 index_neighbor = neighbor_descriptor[0]
694 current_reachable_distance = neighbor_descriptor[1]
697 reachable_distance = max(current_reachable_distance, optic_descriptor.core_distance)
699 self.
__optics_objects[index_neighbor].reachability_distance = reachable_distance
702 index_insertion = len(order_seed)
703 for index_seed
in range(0, len(order_seed)):
704 if reachable_distance < order_seed[index_seed].reachability_distance:
705 index_insertion = index_seed
711 if reachable_distance < self.
__optics_objects[index_neighbor].reachability_distance:
712 self.
__optics_objects[index_neighbor].reachability_distance = reachable_distance
713 order_seed.sort(key =
lambda obj: obj.reachability_distance)
716 def __neighbor_indexes_points(self, optic_object):
718 @brief Return neighbors of the specified object in case of sequence of points. 720 @param[in] optic_object (optics_descriptor): Object for which neighbors should be returned in line with connectivity radius. 722 @return (list) List of indexes of neighbors in line the connectivity radius. 726 return [[node_tuple[1].payload, math.sqrt(node_tuple[0])]
for node_tuple
in kdnodes
if 727 node_tuple[1].payload != optic_object.index_object]
730 def __neighbor_indexes_distance_matrix(self, optic_object):
732 @brief Return neighbors of the specified object in case of distance matrix. 734 @param[in] optic_object (optics_descriptor): Object for which neighbors should be returned in line with connectivity radius. 736 @return (list) List of indexes of neighbors in line the connectivity radius. 740 return [[index_neighbor, distances[index_neighbor]]
for index_neighbor
in range(len(distances))
741 if ((distances[index_neighbor] <= self.
__eps)
and (index_neighbor != optic_object.index_object))]
Object description that used by OPTICS algorithm for cluster analysis.
def process(self)
Performs cluster analysis in line with rules of OPTICS algorithm.
def __neighbor_indexes_points(self, optic_object)
Return neighbors of the specified object in case of sequence of points.
def __len__(self)
Returns length of clustering-ordering diagram.
def show_ordering_diagram(analyser, amount_clusters=None)
Display cluster-ordering (reachability-plot) diagram.
def __update_order_seed(self, optic_descriptor, neighbors_descriptors, order_seed)
Update sorted list of reachable objects (from core-object) that should be processed using neighbors o...
def get_ordering(self)
Returns clustering ordering information about the input data set.
Class represents clustering algorithm OPTICS (Ordering Points To Identify Clustering Structure) with ...
def get_optics_objects(self)
Returns OPTICS objects where each object contains information about index of point from processed dat...
def __repr__(self)
Returns string representation of the optics descriptor.
def cluster_ordering(self)
(list) Returns values of dataset cluster ordering.
reachability_distance
Reachability distance - the smallest distance to be reachable by core object.
Module for representing clustering results.
Represents KD Tree that is a space-partitioning data structure for organizing points in a k-dimension...
def __create_neighbor_searcher(self, data_type)
Returns neighbor searcher in line with data type.
Colors used by pyclustering library for visualization.
index_object
Index of object from the input data.
processed
True is object has been already traversed.
core_distance
Core distance - the smallest distance to reach specified number of neighbors that is not greater then...
def __init__(self, index, core_distance=None, reachability_distance=None)
Constructor of object description in optics terms.
def __init__(self, sample, eps, minpts, amount_clusters=None, ccore=True, kwargs)
Constructor of clustering algorithm OPTICS.
def get_clusters(self)
Returns list of allocated clusters, where each cluster contains indexes of objects and each cluster i...
def __extract_clusters(self)
Extract clusters and noise from order database.
Analyser of cluster ordering diagram.
def __initialize(self, sample)
Initializes internal states and resets clustering results in line with input sample.
def __allocate_clusters(self)
Performs cluster allocation and builds ordering diagram that is based on reachability-distances.
def extract_cluster_amount(self, radius)
Obtains amount of clustering that can be allocated by using specified radius for ordering diagram and...
def get_noise(self)
Returns list of noise that contains indexes of objects that corresponds to input data.
def calculate_connvectivity_radius(self, amount_clusters, maximum_iterations=100)
Calculates connectivity radius of allocation specified amount of clusters using ordering diagram and ...
def __process_by_ccore(self)
Performs cluster analysis using CCORE (C/C++ part of pyclustering library).
def __neighbor_indexes_distance_matrix(self, optic_object)
Return neighbors of the specified object in case of distance matrix.
def get_cluster_encoding(self)
Returns clustering result representation type that indicate how clusters are encoded.
def __process_by_python(self)
Performs cluster analysis using python code.
Cluster ordering diagram visualizer that represents dataset graphically as density-based clustering s...
def __init__(self, ordering_diagram)
Analyser of ordering diagram that is based on reachability-distances.
def get_radius(self)
Returns connectivity radius that is calculated and used for clustering by the algorithm.
def __expand_cluster_order(self, optics_object)
Expand cluster order from not processed optic-object that corresponds to object from input data...