
HP Labs Technical Reports
Click here for full text:
Linear SpeedUp for a Parallel NonApproximate Recasting of CenterBased Clustering Algorithms, including KMeans, KHarmonic Means, and EM
Forman, George; Zhang, Bin
HPL200093
Keyword(s): ExpectationMaximization; multidimensional data clustering; data mining; very large databases; parallel algorithms; scaleup
Abstract: Data clustering is one of the fundamental techniques in scientific data analysis and data mining. It partitions a data set into groups of similar items, as measured by some distance metric. Over the years, data set sizes have grown rapidly with the exponential growth of computer storage and increasingly automated business and manufacturing processes. To cluster such large data sets, efficient parallel algorithms are called for, both to reduce the computation time, and to bring the resources of multiple machines to bear on a given large problem in order to scale up the largest problem size one can handle. We describe a technique for parallelizing centerbased data clustering algorithms. The central idea is to communicate only sufficient statistics, yielding linear speedup with excellent efficiency. The technique does not involve approximation and may be used orthogonally in conjunction with sampling or aggregationbased methods, such as BIRCH, to lessen the quality degradation of their approximation or to handle larger data sets. We have demonstrated elsewhere the decomposition and theoretical speedup behaviors for three clustering algorithms: KMeans, KHarmonic Means and Expectation Maximization (EM) [Zhang, Hsu, Forman '00]. Here we present experimental measurements of their parallel behaviors, e.g., 93% speedup efficiency for EM on a million data points spread over 128 computers connected via 100baseT Ethernet.
6 Pages
Back to Index
