
HP Labs Technical Reports
Click here for full text:
Accurate Recasting of Parameter Estimation Algorithms Using Sufficient Statistic for Efficient Parallel SpeedupDemonstrated for CenterBased Data Clustering Algorithms
Zhang, Bin; Hsu, Meichun; Forman, George
HPL200094
Keyword(s): parallel algorithms; data mining; data clustering; K Means; KHarmonic Means; ExpectationMaximization; speedup; scaleup
Abstract: Fueled by advances in computer technology and online business, data collection is rapidly accelerating, as well as the importance of its analysisdata mining. Increasing database sizes strain the scalability of many data mining algorithms. Data clustering is one of the fundamental techniques in data mining solutions. The many clustering algorithms developed face new challenges with growing data sets. Algorithms with quadratic or higher computational complexity, such as agglomerative algorithms, drop out quickly. More efficient algorithms, such as KMeans EM with linear cost per iteration, still need work to scale up to large data sets. This paper shows that many parameter estimation algorithms, including KMeans, KHarmonic Means and EM, can be recast without approximation in terms of Sufficient Statistics, yielding a superior speedup efficiency. Estimates using today's workstations and local area network technology suggest efficient speedup to several hundred computers, leading to effective scaleup for clustering hundreds of gigabytes of data. Implementation of parallel clustering has been done in a parallel programming language, ZPL. Experimental results show above 90% utilization.
10 Pages
Back to Index
