HP Labs Technical Reports
Click here for full text:
Linear Speed-Up for a Parallel Non-Approximate Recasting of Center-Based Clustering Algorithms, including K-Means, K-Harmonic Means, and EM
Forman, George; Zhang, Bin
Keyword(s): Expectation-Maximization; multidimensional data clustering; data mining; very large databases; parallel algorithms; scale-up
Abstract: Data clustering is one of the fundamental techniques in scientific data analysis and data mining. It partitions a data set into groups of similar items, as measured by some distance metric. Over the years, data set sizes have grown rapidly with the exponential growth of computer storage and increasingly automated business and manufacturing processes. To cluster such large data sets, efficient parallel algorithms are called for, both to reduce the computation time, and to bring the resources of multiple machines to bear on a given large problem in order to scale up the largest problem size one can handle. We describe a technique for parallelizing center-based data clustering algorithms. The central idea is to communicate only sufficient statistics, yielding linear speed-up with excellent efficiency. The technique does not involve approximation and may be used orthogonally in conjunction with sampling or aggregation-based methods, such as BIRCH, to lessen the quality degradation of their approximation or to handle larger data sets. We have demonstrated elsewhere the decomposition and theoretical speed-up behaviors for three clustering algorithms: K-Means, K-Harmonic Means and Expectation Maximization (EM) [Zhang, Hsu, Forman '00]. Here we present experimental measurements of their parallel behaviors, e.g., 93% speed-up efficiency for EM on a million data points spread over 128 computers connected via 100baseT Ethernet.
Back to Index