3 @brief Cluster analysis algorithm: OPTICS (Ordering Points To Identify Clustering Structure) 4 @details Implementation based on paper @cite article::optics::1. 6 @authors Andrei Novikov (pyclustering@yandex.ru) 8 @copyright GNU Public License 10 @cond GNU_PUBLIC_LICENSE 11 PyClustering is free software: you can redistribute it and/or modify 12 it under the terms of the GNU General Public License as published by 13 the Free Software Foundation, either version 3 of the License, or 14 (at your option) any later version. 16 PyClustering is distributed in the hope that it will be useful, 17 but WITHOUT ANY WARRANTY; without even the implied warranty of 18 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 19 GNU General Public License for more details. 21 You should have received a copy of the GNU General Public License 22 along with this program. If not, see <http://www.gnu.org/licenses/>. 32 import matplotlib.pyplot
as plt
33 except Exception
as error_instance:
34 warnings.warn(
"Impossible to import matplotlib (please, install 'matplotlib'), pyclustering's visualization " 35 "functionality is not available (details: '%s')." % str(error_instance))
43 from pyclustering.core.wrapper
import ccore_library
45 import pyclustering.core.optics_wrapper
as wrapper
50 @brief Cluster ordering diagram visualizer that represents dataset graphically as density-based clustering structure. 51 @details This OPTICS algorithm is KD-tree optimized. 53 @see ordering_analyser 60 @brief Display cluster-ordering (reachability-plot) diagram. 62 @param[in] analyser (ordering_analyser): cluster-ordering analyser whose ordering diagram should be displayed. 63 @param[in] amount_clusters (uint): if it is not 'None' then it displays connectivity radius line that can used for allocation of specified amount of clusters 64 and colorize diagram by corresponding cluster colors. 66 Example demonstrates general abilities of 'ordering_visualizer' class: 68 # Display cluster-ordering diagram with connectivity radius is used for allocation of three clusters. 69 ordering_visualizer.show_ordering_diagram(analyser, 3); 71 # Display cluster-ordering diagram without radius. 72 ordering_visualizer.show_ordering_diagram(analyser); 76 ordering = analyser.cluster_ordering
77 axis = plt.subplot(111)
79 if amount_clusters
is not None:
80 radius, borders = analyser.calculate_connvectivity_radius(amount_clusters)
84 current_index_border = 0
85 for index_border
in range(len(borders)):
86 right_index_border = borders[index_border]
87 axis.bar(range(left_index_border, right_index_border), ordering[left_index_border:right_index_border], width = 1.0, color = color_list.TITLES[index_border])
88 left_index_border = right_index_border
89 current_index_border = index_border
91 axis.bar(range(left_index_border, len(ordering)), ordering[left_index_border:len(ordering)], width = 1.0, color = color_list.TITLES[current_index_border + 1])
93 plt.xlim([0, len(ordering)])
95 plt.axhline(y = radius, linewidth = 2, color =
'black')
96 plt.text(0, radius + radius * 0.03,
" Radius: " + str(round(radius, 4)) +
";\n Clusters: " + str(amount_clusters), color =
'b', fontsize = 10)
99 axis.bar(range(0, len(ordering)), ordering[0:len(ordering)], width = 1.0, color =
'black')
100 plt.xlim([0, len(ordering)])
107 @brief Analyser of cluster ordering diagram. 108 @details Using cluster-ordering it is able to connectivity radius for allocation of specified amount of clusters and 109 calculate amount of clusters using specified connectivity radius. Cluster-ordering is formed by OPTICS algorithm 110 during cluster analysis. 119 @brief (list) Returns values of dataset cluster ordering. 127 @brief Analyser of ordering diagram that is based on reachability-distances. 129 @see calculate_connvectivity_radius 137 @brief Returns length of clustering-ordering diagram. 145 @brief Calculates connectivity radius of allocation specified amount of clusters using ordering diagram and marks borders of clusters using indexes of values of ordering diagram. 146 @details Parameter 'maximum_iterations' is used to protect from hanging when it is impossible to allocate specified number of clusters. 148 @param[in] amount_clusters (uint): amount of clusters that should be allocated by calculated connectivity radius. 149 @param[in] maximum_iterations (uint): maximum number of iteration for searching connectivity radius to allocated specified amount of clusters (by default it is restricted by 100 iterations). 151 @return (double, list) Value of connectivity radius and borders of clusters like (radius, borders), radius may be 'None' as well as borders may be '[]' 152 if connectivity radius hasn't been found for the specified amount of iterations. 158 upper_distance = maximum_distance
164 if amount <= amount_clusters:
165 for _
in range(maximum_iterations):
166 radius = (lower_distance + upper_distance) / 2.0
169 if amount == amount_clusters:
176 elif amount > amount_clusters:
177 lower_distance = radius
179 elif amount < amount_clusters:
180 upper_distance = radius
182 return result, borders
187 @brief Obtains amount of clustering that can be allocated by using specified radius for ordering diagram and borders between them. 188 @details When growth of reachability-distances is detected than it is considered as a start point of cluster, 189 than pick is detected and after that recession is observed until new growth (that means end of the 190 current cluster and start of a new one) or end of diagram. 192 @param[in] radius (double): connectivity radius that is used for cluster allocation. 194 @return (unit, list) Amount of clusters that can be allocated by the connectivity radius on ordering diagram and borders between them using indexes 195 from ordering diagram (amount_clusters, border_clusters). 201 cluster_start =
False 203 total_similarity =
True 204 previous_cluster_distance =
None 205 previous_distance =
None 209 for index_ordering
in range(len(self.
__ordering)):
211 if distance >= radius:
212 if cluster_start
is False:
216 if index_ordering != 0:
217 cluster_borders.append(index_ordering)
220 if (distance < previous_cluster_distance)
and (cluster_pick
is False):
223 elif (distance > previous_cluster_distance)
and (cluster_pick
is True):
227 if index_ordering != 0:
228 cluster_borders.append(index_ordering)
230 previous_cluster_distance = distance
233 cluster_start =
False 236 if (previous_distance
is not None)
and (distance != previous_distance):
237 total_similarity =
False 239 previous_distance = distance
241 if (total_similarity
is True)
and (previous_distance > radius):
244 return amount_clusters, cluster_borders
249 @brief Object description that used by OPTICS algorithm for cluster analysis. 253 def __init__(self, index, core_distance = None, reachability_distance = None):
255 @brief Constructor of object description in optics terms. 257 @param[in] index (uint): Index of the object in the data set. 258 @param[in] core_distance (double): Core distance that is minimum distance to specified number of neighbors. 259 @param[in] reachability_distance (double): Reachability distance to this object. 277 @brief Returns string representation of the optics descriptor. 286 @brief Class represents clustering algorithm OPTICS (Ordering Points To Identify Clustering Structure) with KD-tree optimization (ccore options is supported). 287 @details OPTICS is a density-based algorithm. Purpose of the algorithm is to provide explicit clusters, but create clustering-ordering representation of the input data. 288 Clustering-ordering information contains information about internal structures of data set in terms of density and proper connectivity radius can be obtained 289 for allocation required amount of clusters using this diagram. In case of usage additional input parameter 'amount of clusters' connectivity radius should be 290 bigger than real - because it will be calculated by the algorithms if requested amount of clusters is not allocated. 292 @image html optics_example_clustering.png "Scheme how does OPTICS works. At the beginning only one cluster is allocated, but two is requested. At the second step OPTICS calculates connectivity radius using cluster-ordering and performs final cluster allocation." 294 Clustering example using sample 'Chainlink': 296 from pyclustering.cluster import cluster_visualizer 297 from pyclustering.cluster.optics import optics, ordering_analyser, ordering_visualizer 298 from pyclustering.samples.definitions import FCPS_SAMPLES 299 from pyclustering.utils import read_sample 301 # Read sample for clustering from some file. 302 sample = read_sample(FCPS_SAMPLES.SAMPLE_CHAINLINK) 304 # Run cluster analysis where connectivity radius is bigger than real. 307 optics_instance = optics(sample, radius, neighbors) 309 # Performs cluster analysis. 310 optics_instance.process() 312 # Obtain results of clustering. 313 clusters = optics_instance.get_clusters() 314 noise = optics_instance.get_noise() 315 ordering = optics_instance.get_ordering() 317 # Visualize clustering results. 318 visualizer = cluster_visualizer() 319 visualizer.append_clusters(clusters, sample) 323 analyser = ordering_analyser(ordering) 324 ordering_visualizer.show_ordering_diagram(analyser, 2) 327 Amount of clusters that should be allocated can be also specified. In this case connectivity radius should be greater than real, for example: 329 from pyclustering.cluster import cluster_visualizer 330 from pyclustering.cluster.optics import optics, ordering_analyser, ordering_visualizer 331 from pyclustering.samples.definitions import FCPS_SAMPLES 332 from pyclustering.utils import read_sample 334 # Read sample for clustering from some file 335 sample = read_sample(FCPS_SAMPLES.SAMPLE_LSUN) 337 # Run cluster analysis where connectivity radius is bigger than real 340 amount_of_clusters = 3 341 optics_instance = optics(sample, radius, neighbors, amount_of_clusters) 343 # Performs cluster analysis 344 optics_instance.process() 346 # Obtain results of clustering 347 clusters = optics_instance.get_clusters() 348 noise = optics_instance.get_noise() 349 ordering = optics_instance.get_ordering() 351 # Visualize ordering diagram 352 analyser = ordering_analyser(ordering) 353 ordering_visualizer.show_ordering_diagram(analyser, amount_of_clusters) 355 # Visualize clustering results 356 visualizer = cluster_visualizer() 357 visualizer.append_clusters(clusters, sample) 361 Here is an example where OPTICS extracts outliers from sample 'Tetra': 363 from pyclustering.cluster import cluster_visualizer 364 from pyclustering.cluster.optics import optics 365 from pyclustering.samples.definitions import FCPS_SAMPLES 366 from pyclustering.utils import read_sample 368 # Read sample for clustering from some file. 369 sample = read_sample(FCPS_SAMPLES.SAMPLE_TETRA) 371 # Run cluster analysis where connectivity radius is bigger than real. 374 optics_instance = optics(sample, radius, neighbors) 376 # Performs cluster analysis. 377 optics_instance.process() 379 # Obtain results of clustering. 380 clusters = optics_instance.get_clusters() 381 noise = optics_instance.get_noise() 383 # Visualize clustering results (clusters and outliers). 384 visualizer = cluster_visualizer() 385 visualizer.append_clusters(clusters, sample) 386 visualizer.append_cluster(noise, sample, marker='x') 390 Visualization result of allocated clusters and outliers is presented on the image below: 391 @image html optics_noise_tetra.png "Clusters and outliers extracted by OPTICS algorithm from sample 'Tetra'." 395 def __init__(self, sample, eps, minpts, amount_clusters=None, ccore=True, **kwargs):
397 @brief Constructor of clustering algorithm OPTICS. 399 @param[in] sample (list): Input data that is presented as a list of points (objects), where each point is represented by list or tuple. 400 @param[in] eps (double): Connectivity radius between points, points may be connected if distance between them less than the radius. 401 @param[in] minpts (uint): Minimum number of shared neighbors that is required for establishing links between points. 402 @param[in] amount_clusters (uint): Optional parameter where amount of clusters that should be allocated is specified. 403 In case of usage 'amount_clusters' connectivity radius can be greater than real, in other words, there is place for mistake 404 in connectivity radius usage. 405 @param[in] ccore (bool): if True than DLL CCORE (C++ solution) will be used for solving the problem. 406 @param[in] **kwargs: Arbitrary keyword arguments (available arguments: 'data_type'). 408 <b>Keyword Args:</b><br> 409 - data_type (string): Data type of input sample 'data' that is processed by the algorithm ('points', 'distance_matrix'). 423 self.
__data_type = kwargs.get(
'data_type',
'points')
431 self.
__ccore = ccore_library.workable()
438 @brief Performs cluster analysis in line with rules of OPTICS algorithm. 440 @return (optics) Returns itself (OPTICS instance). 457 def __process_by_ccore(self):
459 @brief Performs cluster analysis using CCORE (C/C++ part of pyclustering library). 464 objects_indexes, objects_core_distances, objects_reachability_distances) = \
468 for i
in range(len(objects_indexes)):
469 if objects_core_distances[i] < 0.0:
470 objects_core_distances[i] =
None 472 if objects_reachability_distances[i] < 0.0:
473 objects_reachability_distances[i] =
None 475 optics_object =
optics_descriptor(objects_indexes[i], objects_core_distances[i], objects_reachability_distances[i])
476 optics_object.processed =
True 481 def __process_by_python(self):
483 @brief Performs cluster analysis using python code. 495 if radius
is not None:
500 def __initialize(self, sample):
502 @brief Initializes internal states and resets clustering results in line with input sample. 514 def __allocate_clusters(self):
516 @brief Performs cluster allocation and builds ordering diagram that is based on reachability-distances. 523 if optic_object.processed
is False:
531 @brief Returns list of allocated clusters, where each cluster contains indexes of objects and each cluster is represented by list. 533 @return (list) List of allocated clusters. 547 @brief Returns list of noise that contains indexes of objects that corresponds to input data. 549 @return (list) List of allocated noise objects. 563 @brief Returns clustering ordering information about the input data set. 564 @details Clustering ordering of data-set contains the information about the internal clustering structure in line with connectivity radius. 566 @return (ordering_analyser) Analyser of clustering ordering. 572 @see get_optics_objects() 580 for index_object
in cluster:
582 if optics_object.reachability_distance
is not None:
583 self.
__ordering.append(optics_object.reachability_distance)
590 @brief Returns OPTICS objects where each object contains information about index of point from processed data, 591 core distance and reachability distance. 593 @return (list) OPTICS objects. 598 @see optics_descriptor 607 @brief Returns connectivity radius that is calculated and used for clustering by the algorithm. 608 @details Connectivity radius may be changed only in case of usage additional parameter of the algorithm - amount of clusters for allocation. 610 @return (double) Connectivity radius. 615 @see get_optics_objects() 624 @brief Returns clustering result representation type that indicate how clusters are encoded. 626 @return (type_encoding) Clustering result representation. 632 return type_encoding.CLUSTER_INDEX_LIST_SEPARATION
635 def __create_neighbor_searcher(self, data_type):
637 @brief Returns neighbor searcher in line with data type. 639 @param[in] data_type (string): Data type (points or distance matrix). 642 if data_type ==
'points':
644 elif data_type ==
'distance_matrix':
647 raise TypeError(
"Unknown type of data is specified '%s'" % data_type)
650 def __expand_cluster_order(self, optics_object):
652 @brief Expand cluster order from not processed optic-object that corresponds to object from input data. 653 Traverse procedure is performed until objects are reachable from core-objects in line with connectivity radius. 654 Order database is updated during expanding. 656 @param[in] optics_object (optics_descriptor): Object that hasn't been processed. 660 optics_object.processed =
True 663 optics_object.reachability_distance =
None 668 if len(neighbors_descriptor) >= self.
__minpts:
669 neighbors_descriptor.sort(key =
lambda obj: obj[1])
670 optics_object.core_distance = neighbors_descriptor[self.
__minpts - 1][1]
676 while len(order_seed) > 0:
677 optic_descriptor = order_seed[0]
678 order_seed.remove(optic_descriptor)
681 optic_descriptor.processed =
True 685 if len(neighbors_descriptor) >= self.
__minpts:
686 neighbors_descriptor.sort(key =
lambda obj: obj[1])
687 optic_descriptor.core_distance = neighbors_descriptor[self.
__minpts - 1][1]
691 optic_descriptor.core_distance =
None 694 optics_object.core_distance =
None 697 def __extract_clusters(self):
699 @brief Extract clusters and noise from order database. 708 if (optics_object.reachability_distance
is None)
or (optics_object.reachability_distance > self.
__eps):
709 if (optics_object.core_distance
is not None)
and (optics_object.core_distance <= self.
__eps):
710 self.
__clusters.append([ optics_object.index_object ])
713 self.
__noise.append(optics_object.index_object)
715 current_cluster.append(optics_object.index_object)
718 def __update_order_seed(self, optic_descriptor, neighbors_descriptors, order_seed):
720 @brief Update sorted list of reachable objects (from core-object) that should be processed using neighbors of core-object. 722 @param[in] optic_descriptor (optics_descriptor): Core-object whose neighbors should be analysed. 723 @param[in] neighbors_descriptors (list): List of neighbors of core-object. 724 @param[in|out] order_seed (list): List of sorted object in line with reachable distance. 728 for neighbor_descriptor
in neighbors_descriptors:
729 index_neighbor = neighbor_descriptor[0]
730 current_reachable_distance = neighbor_descriptor[1]
733 reachable_distance = max(current_reachable_distance, optic_descriptor.core_distance)
735 self.
__optics_objects[index_neighbor].reachability_distance = reachable_distance
738 index_insertion = len(order_seed)
739 for index_seed
in range(0, len(order_seed)):
740 if reachable_distance < order_seed[index_seed].reachability_distance:
741 index_insertion = index_seed
747 if reachable_distance < self.
__optics_objects[index_neighbor].reachability_distance:
748 self.
__optics_objects[index_neighbor].reachability_distance = reachable_distance
749 order_seed.sort(key =
lambda obj: obj.reachability_distance)
752 def __neighbor_indexes_points(self, optic_object):
754 @brief Return neighbors of the specified object in case of sequence of points. 756 @param[in] optic_object (optics_descriptor): Object for which neighbors should be returned in line with connectivity radius. 758 @return (list) List of indexes of neighbors in line the connectivity radius. 762 return [[node_tuple[1].payload, math.sqrt(node_tuple[0])]
for node_tuple
in kdnodes
if 763 node_tuple[1].payload != optic_object.index_object]
766 def __neighbor_indexes_distance_matrix(self, optic_object):
768 @brief Return neighbors of the specified object in case of distance matrix. 770 @param[in] optic_object (optics_descriptor): Object for which neighbors should be returned in line with connectivity radius. 772 @return (list) List of indexes of neighbors in line the connectivity radius. 776 return [[index_neighbor, distances[index_neighbor]]
for index_neighbor
in range(len(distances))
777 if ((distances[index_neighbor] <= self.
__eps)
and (index_neighbor != optic_object.index_object))]
780 def __verify_arguments(self):
782 @brief Verify input parameters for the algorithm and throw exception in case of incorrectness. 786 raise ValueError(
"Input data is empty (size: '%d')." % len(self.
__sample_pointer))
789 raise ValueError(
"Connectivity radius (current value: '%d') should be greater or equal to 0." % self.
__eps)
792 raise ValueError(
"Minimum number of neighbors (current value: '%d') should be greater than 0." %
796 raise ValueError(
"Amount of clusters (current value: '%d') should be greater than 0." %
Object description that used by OPTICS algorithm for cluster analysis.
def process(self)
Performs cluster analysis in line with rules of OPTICS algorithm.
def __neighbor_indexes_points(self, optic_object)
Return neighbors of the specified object in case of sequence of points.
def __len__(self)
Returns length of clustering-ordering diagram.
def show_ordering_diagram(analyser, amount_clusters=None)
Display cluster-ordering (reachability-plot) diagram.
def __verify_arguments(self)
Verify input parameters for the algorithm and throw exception in case of incorrectness.
def __update_order_seed(self, optic_descriptor, neighbors_descriptors, order_seed)
Update sorted list of reachable objects (from core-object) that should be processed using neighbors o...
def get_ordering(self)
Returns clustering ordering information about the input data set.
Class represents clustering algorithm OPTICS (Ordering Points To Identify Clustering Structure) with ...
def get_optics_objects(self)
Returns OPTICS objects where each object contains information about index of point from processed dat...
def __repr__(self)
Returns string representation of the optics descriptor.
def cluster_ordering(self)
(list) Returns values of dataset cluster ordering.
reachability_distance
Reachability distance - the smallest distance to be reachable by core object.
Module for representing clustering results.
Represents KD Tree that is a space-partitioning data structure for organizing points in a k-dimension...
def __create_neighbor_searcher(self, data_type)
Returns neighbor searcher in line with data type.
Colors used by pyclustering library for visualization.
index_object
Index of object from the input data.
processed
True is object has been already traversed.
core_distance
Core distance - the smallest distance to reach specified number of neighbors that is not greater then...
def __init__(self, index, core_distance=None, reachability_distance=None)
Constructor of object description in optics terms.
def __init__(self, sample, eps, minpts, amount_clusters=None, ccore=True, kwargs)
Constructor of clustering algorithm OPTICS.
def get_clusters(self)
Returns list of allocated clusters, where each cluster contains indexes of objects and each cluster i...
def __extract_clusters(self)
Extract clusters and noise from order database.
Analyser of cluster ordering diagram.
def __initialize(self, sample)
Initializes internal states and resets clustering results in line with input sample.
def __allocate_clusters(self)
Performs cluster allocation and builds ordering diagram that is based on reachability-distances.
def extract_cluster_amount(self, radius)
Obtains amount of clustering that can be allocated by using specified radius for ordering diagram and...
def get_noise(self)
Returns list of noise that contains indexes of objects that corresponds to input data.
def calculate_connvectivity_radius(self, amount_clusters, maximum_iterations=100)
Calculates connectivity radius of allocation specified amount of clusters using ordering diagram and ...
def __process_by_ccore(self)
Performs cluster analysis using CCORE (C/C++ part of pyclustering library).
def __neighbor_indexes_distance_matrix(self, optic_object)
Return neighbors of the specified object in case of distance matrix.
def get_cluster_encoding(self)
Returns clustering result representation type that indicate how clusters are encoded.
def __process_by_python(self)
Performs cluster analysis using python code.
Cluster ordering diagram visualizer that represents dataset graphically as density-based clustering s...
def __init__(self, ordering_diagram)
Analyser of ordering diagram that is based on reachability-distances.
def get_radius(self)
Returns connectivity radius that is calculated and used for clustering by the algorithm.
def __expand_cluster_order(self, optics_object)
Expand cluster order from not processed optic-object that corresponds to object from input data...