pyclustering
0.10.1
pyclustring is a Python, C++ data mining library.
|
Enumeration of splitting types that can be used as splitting creation of cluster in X-Means algorithm. More...
Static Public Attributes | |
int | BAYESIAN_INFORMATION_CRITERION = 0 |
Bayesian information criterion (BIC) to approximate the correct number of clusters. More... | |
int | MINIMUM_NOISELESS_DESCRIPTION_LENGTH = 1 |
Minimum noiseless description length (MNDL) to approximate the correct number of clusters [37]. More... | |
Enumeration of splitting types that can be used as splitting creation of cluster in X-Means algorithm.
|
static |
Bayesian information criterion (BIC) to approximate the correct number of clusters.
Kass's formula is used to calculate BIC:
\[BIC(\theta) = L(D) - \frac{1}{2}pln(N)\]
The number of free parameters \(p\) is simply the sum of \(K - 1\) class probabilities, \(MK\) centroid coordinates, and one variance estimate:
\[p = (K - 1) + MK + 1\]
The log-likelihood of the data:
\[L(D) = n_jln(n_j) - n_jln(N) - \frac{n_j}{2}ln(2\pi) - \frac{n_jd}{2}ln(\hat{\sigma}^2) - \frac{n_j - K}{2}\]
The maximum likelihood estimate (MLE) for the variance:
\[\hat{\sigma}^2 = \frac{1}{N - K}\sum\limits_{j}\sum\limits_{i}||x_{ij} - \hat{C}_j||^2\]
|
static |
Minimum noiseless description length (MNDL) to approximate the correct number of clusters [37].
Beheshti's formula is used to calculate upper bound:
\[Z = \frac{\sigma^2 \sqrt{2K} }{N}(\sqrt{2K} + \beta) + W - \sigma^2 + \frac{2\alpha\sigma}{\sqrt{N}}\sqrt{\frac{\alpha^2\sigma^2}{N} + W - \left(1 - \frac{K}{N}\right)\frac{\sigma^2}{2}} + \frac{2\alpha^2\sigma^2}{N}\]
where \(\alpha\) and \(\beta\) represent the parameters for validation probability and confidence probability.
To improve clustering results some contradiction is introduced:
\[W = \frac{1}{n_j}\sum\limits_{i}||x_{ij} - \hat{C}_j||\]
\[\hat{\sigma}^2 = \frac{1}{N - K}\sum\limits_{j}\sum\limits_{i}||x_{ij} - \hat{C}_j||\]