3 @brief Cluster analysis algorithm: OPTICS (Ordering Points To Identify Clustering Structure) 4 @details Implementation based on paper @cite article::optics::1. 6 @authors Andrei Novikov (pyclustering@yandex.ru) 8 @copyright GNU Public License 10 @cond GNU_PUBLIC_LICENSE 11 PyClustering is free software: you can redistribute it and/or modify 12 it under the terms of the GNU General Public License as published by 13 the Free Software Foundation, either version 3 of the License, or 14 (at your option) any later version. 16 PyClustering is distributed in the hope that it will be useful, 17 but WITHOUT ANY WARRANTY; without even the implied warranty of 18 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 19 GNU General Public License for more details. 21 You should have received a copy of the GNU General Public License 22 along with this program. If not, see <http://www.gnu.org/licenses/>. 32 import matplotlib.pyplot
as plt
33 except Exception
as error_instance:
34 warnings.warn(
"Impossible to import matplotlib (please, install 'matplotlib'), pyclustering's visualization " 35 "functionality is not available (details: '%s')." % str(error_instance))
43 from pyclustering.core.wrapper
import ccore_library
45 import pyclustering.core.optics_wrapper
as wrapper
50 @brief Cluster ordering diagram visualizer that represents dataset graphically as density-based clustering structure. 51 @details This OPTICS algorithm is KD-tree optimized. 53 @see ordering_analyser 60 @brief Display cluster-ordering (reachability-plot) diagram. 62 @param[in] analyser (ordering_analyser): cluster-ordering analyser whose ordering diagram should be displayed. 63 @param[in] amount_clusters (uint): if it is not 'None' then it displays connectivity radius line that can used for allocation of specified amount of clusters 64 and colorize diagram by corresponding cluster colors. 66 Example demonstrates general abilities of 'ordering_visualizer' class: 68 # Display cluster-ordering diagram with connectivity radius is used for allocation of three clusters. 69 ordering_visualizer.show_ordering_diagram(analyser, 3); 71 # Display cluster-ordering diagram without radius. 72 ordering_visualizer.show_ordering_diagram(analyser); 76 ordering = analyser.cluster_ordering
77 axis = plt.subplot(111)
79 if amount_clusters
is not None:
80 radius, borders = analyser.calculate_connvectivity_radius(amount_clusters)
84 current_index_border = 0
85 for index_border
in range(len(borders)):
86 right_index_border = borders[index_border]
87 axis.bar(range(left_index_border, right_index_border), ordering[left_index_border:right_index_border], width = 1.0, color = color_list.TITLES[index_border])
88 left_index_border = right_index_border
89 current_index_border = index_border
91 axis.bar(range(left_index_border, len(ordering)), ordering[left_index_border:len(ordering)], width = 1.0, color = color_list.TITLES[current_index_border + 1])
93 plt.xlim([0, len(ordering)])
95 plt.axhline(y = radius, linewidth = 2, color =
'black')
96 plt.text(0, radius + radius * 0.03,
" Radius: " + str(round(radius, 4)) +
";\n Clusters: " + str(amount_clusters), color =
'b', fontsize = 10)
99 axis.bar(range(0, len(ordering)), ordering[0:len(ordering)], width = 1.0, color =
'black')
100 plt.xlim([0, len(ordering)])
107 @brief Analyser of cluster ordering diagram. 108 @details Using cluster-ordering it is able to connectivity radius for allocation of specified amount of clusters and 109 calculate amount of clusters using specified connectivity radius. Cluster-ordering is formed by OPTICS algorithm 110 during cluster analysis. 119 @brief (list) Returns values of dataset cluster ordering. 127 @brief Analyser of ordering diagram that is based on reachability-distances. 129 @see calculate_connvectivity_radius 137 @brief Returns length of clustering-ordering diagram. 145 @brief Calculates connectivity radius of allocation specified amount of clusters using ordering diagram and marks borders of clusters using indexes of values of ordering diagram. 146 @details Parameter 'maximum_iterations' is used to protect from hanging when it is impossible to allocate specified number of clusters. 148 @param[in] amount_clusters (uint): amount of clusters that should be allocated by calculated connectivity radius. 149 @param[in] maximum_iterations (uint): maximum number of iteration for searching connectivity radius to allocated specified amount of clusters (by default it is restricted by 100 iterations). 151 @return (double, list) Value of connectivity radius and borders of clusters like (radius, borders), radius may be 'None' as well as borders may be '[]' 152 if connectivity radius hasn't been found for the specified amount of iterations. 158 upper_distance = maximum_distance
164 if amount <= amount_clusters:
165 for _
in range(maximum_iterations):
166 radius = (lower_distance + upper_distance) / 2.0
169 if amount == amount_clusters:
176 elif amount > amount_clusters:
177 lower_distance = radius
179 elif amount < amount_clusters:
180 upper_distance = radius
182 return result, borders
187 @brief Obtains amount of clustering that can be allocated by using specified radius for ordering diagram and borders between them. 188 @details When growth of reachability-distances is detected than it is considered as a start point of cluster, 189 than pick is detected and after that recession is observed until new growth (that means end of the 190 current cluster and start of a new one) or end of diagram. 192 @param[in] radius (double): connectivity radius that is used for cluster allocation. 194 @return (unit, list) Amount of clusters that can be allocated by the connectivity radius on ordering diagram and borders between them using indexes 195 from ordering diagram (amount_clusters, border_clusters). 201 cluster_start =
False 203 total_similarity =
True 204 previous_cluster_distance =
None 205 previous_distance =
None 209 for index_ordering
in range(len(self.
__ordering)):
211 if distance >= radius:
212 if cluster_start
is False:
216 if index_ordering != 0:
217 cluster_borders.append(index_ordering)
220 if (distance < previous_cluster_distance)
and (cluster_pick
is False):
223 elif (distance > previous_cluster_distance)
and (cluster_pick
is True):
227 if index_ordering != 0:
228 cluster_borders.append(index_ordering)
230 previous_cluster_distance = distance
233 cluster_start =
False 236 if (previous_distance
is not None)
and (distance != previous_distance):
237 total_similarity =
False 239 previous_distance = distance
241 if (total_similarity
is True)
and (previous_distance > radius):
244 return amount_clusters, cluster_borders
249 @brief Object description that used by OPTICS algorithm for cluster analysis. 253 def __init__(self, index, core_distance = None, reachability_distance = None):
255 @brief Constructor of object description in optics terms. 257 @param[in] index (uint): Index of the object in the data set. 258 @param[in] core_distance (double): Core distance that is minimum distance to specified number of neighbors. 259 @param[in] reachability_distance (double): Reachability distance to this object. 277 @brief Returns string representation of the optics descriptor. 286 @brief Class represents clustering algorithm OPTICS (Ordering Points To Identify Clustering Structure) with KD-tree optimization (ccore options is supported). 287 @details OPTICS is a density-based algorithm. Purpose of the algorithm is to provide explicit clusters, but create clustering-ordering representation of the input data. 288 Clustering-ordering information contains information about internal structures of data set in terms of density and proper connectivity radius can be obtained 289 for allocation required amount of clusters using this diagram. In case of usage additional input parameter 'amount of clusters' connectivity radius should be 290 bigger than real - because it will be calculated by the algorithms if requested amount of clusters is not allocated. 292 CCORE option can be used to use the pyclustering core - C/C++ shared library for processing that significantly increases performance. 294 @image html optics_example_clustering.png "Scheme how does OPTICS works. At the beginning only one cluster is allocated, but two is requested. At the second step OPTICS calculates connectivity radius using cluster-ordering and performs final cluster allocation." 298 # Read sample for clustering from some file 299 sample = read_sample(path_sample); 301 # Create OPTICS algorithm for cluster analysis 302 optics_instance = optics(sample, 0.5, 6); 304 # Run cluster analysis 305 optics_instance.process(); 307 # Obtain results of clustering 308 clusters = optics_instance.get_clusters(); 309 noise = optics_instance.get_noise(); 311 # Obtain rechability-distances 312 ordering = ordering_analyser(optics_instance.get_ordering()); 314 # Visualization of cluster ordering in line with reachability distance. 315 ordering_visualizer.show_ordering_diagram(ordering); 318 Amount of clusters that should be allocated can be also specified. In this case connectivity radius should be greater than real, for example: 320 # Import required packages 321 from pyclustering.cluster.optics import optics; 322 from pyclustering.samples.definitions import FCPS_SAMPLES; 323 from pyclustering.utils import read_sample; 325 # Read sample for clustering from some file 326 sample = read_sample(FCPS_SAMPLES.SAMPLE_LSUN); 328 # Run cluster analysis where connvectivity radius is bigger than real 331 amount_of_clusters = 3; 333 optics_instance = optics(sample, radius, neighbors, amount_of_clusters); 335 # Obtain results of clustering 336 clusters = optics_instance.get_clusters(); 337 noise = optics_instance.get_noise(); 342 def __init__(self, sample, eps, minpts, amount_clusters = None, ccore = True, **kwargs):
344 @brief Constructor of clustering algorithm OPTICS. 346 @param[in] sample (list): Input data that is presented as a list of points (objects), where each point is represented by list or tuple. 347 @param[in] eps (double): Connectivity radius between points, points may be connected if distance between them less than the radius. 348 @param[in] minpts (uint): Minimum number of shared neighbors that is required for establishing links between points. 349 @param[in] amount_clusters (uint): Optional parameter where amount of clusters that should be allocated is specified. 350 In case of usage 'amount_clusters' connectivity radius can be greater than real, in other words, there is place for mistake 351 in connectivity radius usage. 352 @param[in] ccore (bool): if True than DLL CCORE (C++ solution) will be used for solving the problem. 353 @param[in] **kwargs: Arbitrary keyword arguments (available arguments: 'data_type'). 355 <b>Keyword Args:</b><br> 356 - data_type (string): Data type of input sample 'data' that is processed by the algorithm ('points', 'distance_matrix'). 370 self.
__data_type = kwargs.get(
'data_type',
'points')
378 self.
__ccore = ccore_library.workable()
383 @brief Performs cluster analysis in line with rules of OPTICS algorithm. 385 @remark Results of clustering can be obtained using corresponding gets methods. 400 def __process_by_ccore(self):
402 @brief Performs cluster analysis using CCORE (C/C++ part of pyclustering library). 407 objects_indexes, objects_core_distances, objects_reachability_distances) = \
411 for i
in range(len(objects_indexes)):
412 if objects_core_distances[i] < 0.0:
413 objects_core_distances[i] =
None 415 if objects_reachability_distances[i] < 0.0:
416 objects_reachability_distances[i] =
None 418 optics_object =
optics_descriptor(objects_indexes[i], objects_core_distances[i], objects_reachability_distances[i])
419 optics_object.processed =
True 424 def __process_by_python(self):
426 @brief Performs cluster analysis using python code. 438 if radius
is not None:
443 def __initialize(self, sample):
445 @brief Initializes internal states and resets clustering results in line with input sample. 457 def __allocate_clusters(self):
459 @brief Performs cluster allocation and builds ordering diagram that is based on reachability-distances. 466 if optic_object.processed
is False:
474 @brief Returns list of allocated clusters, where each cluster contains indexes of objects and each cluster is represented by list. 476 @return (list) List of allocated clusters. 490 @brief Returns list of noise that contains indexes of objects that corresponds to input data. 492 @return (list) List of allocated noise objects. 506 @brief Returns clustering ordering information about the input data set. 507 @details Clustering ordering of data-set contains the information about the internal clustering structure in line with connectivity radius. 509 @return (ordering_analyser) Analyser of clustering ordering. 515 @see get_optics_objects() 523 for index_object
in cluster:
525 if optics_object.reachability_distance
is not None:
526 self.
__ordering.append(optics_object.reachability_distance)
533 @brief Returns OPTICS objects where each object contains information about index of point from processed data, 534 core distance and reachability distance. 536 @return (list) OPTICS objects. 541 @see optics_descriptor 550 @brief Returns connectivity radius that is calculated and used for clustering by the algorithm. 551 @details Connectivity radius may be changed only in case of usage additional parameter of the algorithm - amount of clusters for allocation. 553 @return (double) Connectivity radius. 558 @see get_optics_objects() 567 @brief Returns clustering result representation type that indicate how clusters are encoded. 569 @return (type_encoding) Clustering result representation. 575 return type_encoding.CLUSTER_INDEX_LIST_SEPARATION
578 def __create_neighbor_searcher(self, data_type):
580 @brief Returns neighbor searcher in line with data type. 582 @param[in] data_type (string): Data type (points or distance matrix). 585 if data_type ==
'points':
587 elif data_type ==
'distance_matrix':
590 raise TypeError(
"Unknown type of data is specified '%s'" % data_type)
593 def __expand_cluster_order(self, optics_object):
595 @brief Expand cluster order from not processed optic-object that corresponds to object from input data. 596 Traverse procedure is performed until objects are reachable from core-objects in line with connectivity radius. 597 Order database is updated during expanding. 599 @param[in] optics_object (optics_descriptor): Object that hasn't been processed. 603 optics_object.processed =
True 606 optics_object.reachability_distance =
None 611 if len(neighbors_descriptor) >= self.
__minpts:
612 neighbors_descriptor.sort(key =
lambda obj: obj[1])
613 optics_object.core_distance = neighbors_descriptor[self.
__minpts - 1][1]
619 while len(order_seed) > 0:
620 optic_descriptor = order_seed[0]
621 order_seed.remove(optic_descriptor)
624 optic_descriptor.processed =
True 628 if len(neighbors_descriptor) >= self.
__minpts:
629 neighbors_descriptor.sort(key =
lambda obj: obj[1])
630 optic_descriptor.core_distance = neighbors_descriptor[self.
__minpts - 1][1]
634 optic_descriptor.core_distance =
None 637 optics_object.core_distance =
None 640 def __extract_clusters(self):
642 @brief Extract clusters and noise from order database. 651 if (optics_object.reachability_distance
is None)
or (optics_object.reachability_distance > self.
__eps):
652 if (optics_object.core_distance
is not None)
and (optics_object.core_distance <= self.
__eps):
653 self.
__clusters.append([ optics_object.index_object ])
656 self.
__noise.append(optics_object.index_object)
658 current_cluster.append(optics_object.index_object)
661 def __update_order_seed(self, optic_descriptor, neighbors_descriptors, order_seed):
663 @brief Update sorted list of reachable objects (from core-object) that should be processed using neighbors of core-object. 665 @param[in] optic_descriptor (optics_descriptor): Core-object whose neighbors should be analysed. 666 @param[in] neighbors_descriptors (list): List of neighbors of core-object. 667 @param[in|out] order_seed (list): List of sorted object in line with reachable distance. 671 for neighbor_descriptor
in neighbors_descriptors:
672 index_neighbor = neighbor_descriptor[0]
673 current_reachable_distance = neighbor_descriptor[1]
676 reachable_distance = max(current_reachable_distance, optic_descriptor.core_distance)
678 self.
__optics_objects[index_neighbor].reachability_distance = reachable_distance
681 index_insertion = len(order_seed)
682 for index_seed
in range(0, len(order_seed)):
683 if reachable_distance < order_seed[index_seed].reachability_distance:
684 index_insertion = index_seed
690 if reachable_distance < self.
__optics_objects[index_neighbor].reachability_distance:
691 self.
__optics_objects[index_neighbor].reachability_distance = reachable_distance
692 order_seed.sort(key =
lambda obj: obj.reachability_distance)
695 def __neighbor_indexes_points(self, optic_object):
697 @brief Return neighbors of the specified object in case of sequence of points. 699 @param[in] optic_object (optics_descriptor): Object for which neighbors should be returned in line with connectivity radius. 701 @return (list) List of indexes of neighbors in line the connectivity radius. 705 return [[node_tuple[1].payload, math.sqrt(node_tuple[0])]
for node_tuple
in kdnodes
if 706 node_tuple[1].payload != optic_object.index_object]
709 def __neighbor_indexes_distance_matrix(self, optic_object):
711 @brief Return neighbors of the specified object in case of distance matrix. 713 @param[in] optic_object (optics_descriptor): Object for which neighbors should be returned in line with connectivity radius. 715 @return (list) List of indexes of neighbors in line the connectivity radius. 719 return [[index_neighbor, distances[index_neighbor]]
for index_neighbor
in range(len(distances))
720 if ((distances[index_neighbor] <= self.
__eps)
and (index_neighbor != optic_object.index_object))]
Object description that used by OPTICS algorithm for cluster analysis.
def process(self)
Performs cluster analysis in line with rules of OPTICS algorithm.
def __neighbor_indexes_points(self, optic_object)
Return neighbors of the specified object in case of sequence of points.
def __len__(self)
Returns length of clustering-ordering diagram.
def show_ordering_diagram(analyser, amount_clusters=None)
Display cluster-ordering (reachability-plot) diagram.
def __update_order_seed(self, optic_descriptor, neighbors_descriptors, order_seed)
Update sorted list of reachable objects (from core-object) that should be processed using neighbors o...
def get_ordering(self)
Returns clustering ordering information about the input data set.
Class represents clustering algorithm OPTICS (Ordering Points To Identify Clustering Structure) with ...
def get_optics_objects(self)
Returns OPTICS objects where each object contains information about index of point from processed dat...
def __repr__(self)
Returns string representation of the optics descriptor.
def cluster_ordering(self)
(list) Returns values of dataset cluster ordering.
reachability_distance
Reachability distance - the smallest distance to be reachable by core object.
Module for representing clustering results.
Represents KD Tree that is a space-partitioning data structure for organizing points in a k-dimension...
def __create_neighbor_searcher(self, data_type)
Returns neighbor searcher in line with data type.
Colors used by pyclustering library for visualization.
index_object
Index of object from the input data.
processed
True is object has been already traversed.
core_distance
Core distance - the smallest distance to reach specified number of neighbors that is not greater then...
def __init__(self, index, core_distance=None, reachability_distance=None)
Constructor of object description in optics terms.
def __init__(self, sample, eps, minpts, amount_clusters=None, ccore=True, kwargs)
Constructor of clustering algorithm OPTICS.
def get_clusters(self)
Returns list of allocated clusters, where each cluster contains indexes of objects and each cluster i...
def __extract_clusters(self)
Extract clusters and noise from order database.
Analyser of cluster ordering diagram.
def __initialize(self, sample)
Initializes internal states and resets clustering results in line with input sample.
def __allocate_clusters(self)
Performs cluster allocation and builds ordering diagram that is based on reachability-distances.
def extract_cluster_amount(self, radius)
Obtains amount of clustering that can be allocated by using specified radius for ordering diagram and...
def get_noise(self)
Returns list of noise that contains indexes of objects that corresponds to input data.
def calculate_connvectivity_radius(self, amount_clusters, maximum_iterations=100)
Calculates connectivity radius of allocation specified amount of clusters using ordering diagram and ...
def __process_by_ccore(self)
Performs cluster analysis using CCORE (C/C++ part of pyclustering library).
def __neighbor_indexes_distance_matrix(self, optic_object)
Return neighbors of the specified object in case of distance matrix.
def get_cluster_encoding(self)
Returns clustering result representation type that indicate how clusters are encoded.
def __process_by_python(self)
Performs cluster analysis using python code.
Cluster ordering diagram visualizer that represents dataset graphically as density-based clustering s...
def __init__(self, ordering_diagram)
Analyser of ordering diagram that is based on reachability-distances.
def get_radius(self)
Returns connectivity radius that is calculated and used for clustering by the algorithm.
def __expand_cluster_order(self, optics_object)
Expand cluster order from not processed optic-object that corresponds to object from input data...