
HP Labs Technical Reports
Click here for full text:
A Local Search Approach to KClustering
Zhang, Bin; Kleyner, Gary; Hsu, Meichun
HPL1999119
Keyword(s): local search algorithm; data clustering; Kmeans; clustering aggregated data; clustering large data sets; data compression, vector quantization
Abstract: Data clustering is one of the common techniques used in data mining. A popular performance function for measuring goodness of the Kclustering is the total withincluster variance, or the total meansquare quantization error (MSE). The KMeans (KM) algorithm is a popular algorithm which attempts to find a K clustering which minimizes MSE. In this paper, we approach the minMSE clustering problem by way of a Local Search (LS) algorithm, and analytically derive a clustering algorithm which we call LKM. A number of analyses of LKM are given; in particular, we prove that the set of local optima that can trap KM is a superset of those that can trap LKM. The experimental results also show that LKM converges faster and better than KM. More importantly, LKM naturally extends to an aggregated version, called ALKM, which can be applied to the problem of clustering large data sets. ALKM is a clustering algorithm which clusters subsets of data points, or subclusters, instead of individual data points. It can be used to cluster, for example, a large data set that has been aggregated through an algorithm such as the Phase 1 of the BIRCH algorithm ([ZRL96]), with the intention of fitting the aggregated data into the main memory to enable main memorybased clustering. We prove that ALKM, as applied to the problem of clustering subclusters, preserve the monotone convergence property. Experimental results also show that ALKM performs better than AKM, in clustering aggregated data.
20 Pages
Back to Index
