pyclustering  0.10.1
pyclustring is a Python, C++ data mining library.
pyclustering.cluster.kmeans.kmeans Class Reference

Class implements K-Means clustering algorithm. More...

Public Member Functions

def __init__ (self, data, initial_centers, tolerance=0.001, ccore=True, **kwargs)
 Constructor of clustering algorithm K-Means. More...
 
def process (self)
 Performs cluster analysis in line with rules of K-Means algorithm. More...
 
def predict (self, points)
 Calculates the closest cluster to each point. More...
 
def get_clusters (self)
 Returns list of allocated clusters, each cluster contains indexes of objects in list of data. More...
 
def get_centers (self)
 Returns list of centers of allocated clusters. More...
 
def get_total_wce (self)
 Returns sum of metric errors that depends on metric that was used for clustering (by default SSE - Sum of Squared Errors). More...
 
def get_cluster_encoding (self)
 Returns clustering result representation type that indicate how clusters are encoded. More...
 

Detailed Description

Class implements K-Means clustering algorithm.

K-Means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.

K-Means clustering results depend on initial centers. Algorithm K-Means++ can used for initialization of initial centers - see module 'pyclustering.cluster.center_initializer'.

CCORE implementation (C/C++ part of the library) of the algorithm performs parallel processing to ensure maximum performance.

Implementation based on the paper [26].

Fig. 1. K-Means clustering results. At the left - 'Simple03.data' sample, at the right - 'Lsun.data' sample.

Example #1 - Clustering using K-Means++ for center initialization:

from pyclustering.cluster.kmeans import kmeans, kmeans_visualizer
from pyclustering.cluster.center_initializer import kmeans_plusplus_initializer
from pyclustering.samples.definitions import FCPS_SAMPLES
from pyclustering.utils import read_sample
# Load list of points for cluster analysis.
sample = read_sample(FCPS_SAMPLES.SAMPLE_TWO_DIAMONDS)
# Prepare initial centers using K-Means++ method.
initial_centers = kmeans_plusplus_initializer(sample, 2).initialize()
# Create instance of K-Means algorithm with prepared centers.
kmeans_instance = kmeans(sample, initial_centers)
# Run cluster analysis and obtain results.
kmeans_instance.process()
clusters = kmeans_instance.get_clusters()
final_centers = kmeans_instance.get_centers()
# Visualize obtained results
kmeans_visualizer.show_clusters(sample, clusters, final_centers)

Example #2 - Clustering using specific distance metric, for example, Manhattan distance:

# prepare input data and initial centers for cluster analysis using K-Means
# create metric that will be used for clustering
manhattan_metric = distance_metric(type_metric.MANHATTAN)
# create instance of K-Means using specific distance metric:
kmeans_instance = kmeans(sample, initial_centers, metric=manhattan_metric)
# run cluster analysis and obtain results
kmeans_instance.process()
clusters = kmeans_instance.get_clusters()
See also
center_initializer

Definition at line 253 of file kmeans.py.

Constructor & Destructor Documentation

◆ __init__()

def pyclustering.cluster.kmeans.kmeans.__init__ (   self,
  data,
  initial_centers,
  tolerance = 0.001,
  ccore = True,
**  kwargs 
)

Constructor of clustering algorithm K-Means.

Center initializer can be used for creating initial centers, for example, K-Means++ method.

Parameters
[in]data(array_like): Input data that is presented as array of points (objects), each point should be represented by array_like data structure.
[in]initial_centers(array_like): Initial coordinates of centers of clusters that are represented by array_like data structure: [center1, center2, ...].
[in]tolerance(double): Stop condition: if maximum value of change of centers of clusters is less than tolerance then algorithm stops processing.
[in]ccore(bool): Defines should be CCORE library (C++ pyclustering library) used instead of Python code or not.
[in]**kwargsArbitrary keyword arguments (available arguments: 'observer', 'metric', 'itermax').

Keyword Args:

  • observer (kmeans_observer): Observer of the algorithm to collect information about clustering process on each iteration.
  • metric (distance_metric): Metric that is used for distance calculation between two points (by default euclidean square distance).
  • itermax (uint): Maximum number of iterations that is used for clustering process (by default: 200).
See also
center_initializer

Definition at line 314 of file kmeans.py.

Member Function Documentation

◆ get_centers()

def pyclustering.cluster.kmeans.kmeans.get_centers (   self)

Returns list of centers of allocated clusters.

See also
process()
get_clusters()

Definition at line 462 of file kmeans.py.

◆ get_cluster_encoding()

def pyclustering.cluster.kmeans.kmeans.get_cluster_encoding (   self)

Returns clustering result representation type that indicate how clusters are encoded.

Returns
(type_encoding) Clustering result representation.
See also
get_clusters()

Definition at line 491 of file kmeans.py.

◆ get_clusters()

def pyclustering.cluster.kmeans.kmeans.get_clusters (   self)

Returns list of allocated clusters, each cluster contains indexes of objects in list of data.

See also
process()
get_centers()

Definition at line 450 of file kmeans.py.

Referenced by pyclustering.samples.answer_reader.get_cluster_lengths(), and pyclustering.cluster.optics.optics.process().

◆ get_total_wce()

def pyclustering.cluster.kmeans.kmeans.get_total_wce (   self)

Returns sum of metric errors that depends on metric that was used for clustering (by default SSE - Sum of Squared Errors).

Sum of metric errors is calculated using distance between point and its center:

\[error=\sum_{i=0}^{N}distance(x_{i}-center(x_{i}))\]

See also
process()
get_clusters()

Definition at line 477 of file kmeans.py.

◆ predict()

def pyclustering.cluster.kmeans.kmeans.predict (   self,
  points 
)

Calculates the closest cluster to each point.

Parameters
[in]points(array_like): Points for which closest clusters are calculated.
Returns
(list) List of closest clusters for each point. Each cluster is denoted by index. Return empty collection if 'process()' method was not called.

Definition at line 425 of file kmeans.py.

◆ process()

def pyclustering.cluster.kmeans.kmeans.process (   self)

Performs cluster analysis in line with rules of K-Means algorithm.

Returns
(kmeans) Returns itself (K-Means instance).
See also
get_clusters()
get_centers()

Definition at line 355 of file kmeans.py.


The documentation for this class was generated from the following file:
pyclustering.cluster.center_initializer
Collection of center initializers for algorithm that uses initial centers, for example,...
Definition: center_initializer.py:1
pyclustering.utils.metric.distance_metric
Distance metric performs distance calculation between two points in line with encapsulated function,...
Definition: metric.py:52
pyclustering.cluster.kmeans
The module contains K-Means algorithm and other related services.
Definition: kmeans.py:1
pyclustering.utils
Utils that are used by modules of pyclustering.
Definition: __init__.py:1
pyclustering.utils.read_sample
def read_sample(filename)
Returns data sample from simple text file.
Definition: __init__.py:30