Click here for full text:
Dependence of Clustering Algorithm Performance on Clustered-ness of Data
Keyword(s): clustering; K-Means; K-Harmonic Means; EM; Data Mining
Abstract: Intuitively, clustering algorithms should work better on the datasets that have well separated clusters. But we found the contrary for the center-based clustering algorithms, including K-Means, K-Harmonic Means and EM. We generated 1200 synthetic datasets with varying ratio of inter-cluster variance over within-cluster variance, which we call the clustered-ness of the dataset. We run K-Means, K-Harmonic Means and EM on these datasets and found that the ratio of the performance over the global optimum grows with increasing clustered-ness. Dependence of clustering algorithm performance on other parameters -- quality of initialization and dimensionality of data -- are also demonstrated.
Back to Index