HP Labs Technical Reports
Click here for full text:
A Local Search Approach to K-Clustering
Zhang, Bin; Kleyner, Gary; Hsu, Meichun
Keyword(s): local search algorithm; data clustering; K-means; clustering aggregated data; clustering large data sets; data compression, vector quantization
Abstract: Data clustering is one of the common techniques used in data mining. A popular performance function for measuring goodness of the K-clustering is the total within-cluster variance, or the total mean-square quantization error (MSE). The K-Means (KM) algorithm is a popular algorithm which attempts to find a K- clustering which minimizes MSE. In this paper, we approach the min-MSE clustering problem by way of a Local Search (LS) algorithm, and analytically derive a clustering algorithm which we call LKM. A number of analyses of LKM are given; in particular, we prove that the set of local optima that can trap KM is a superset of those that can trap LKM. The experimental results also show that LKM converges faster and better than KM. More importantly, LKM naturally extends to an aggregated version, called A-LKM, which can be applied to the problem of clustering large data sets. A-LKM is a clustering algorithm which clusters subsets of data points, or subclusters, instead of individual data points. It can be used to cluster, for example, a large data set that has been aggregated through an algorithm such as the Phase 1 of the BIRCH algorithm ([ZRL96]), with the intention of fitting the aggregated data into the main memory to enable main memory-based clustering. We prove that A-LKM, as applied to the problem of clustering subclusters, preserve the monotone convergence property. Experimental results also show that A-LKM performs better than A-KM, in clustering aggregated data.
Back to Index