Jump to content United States-English
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP

HP.com home

Technical Reports


HP Labs

» Research
» News and events
» Technical reports
» About HP Labs
» Careers @ HP Labs
» Worldwide sites
» Downloads
Content starts here

Click here for full text: PDF

Achieving Scalable Automated Diagnosis of Distributed Systems Performance Problems

Huang, Chengdu; Cohen, Ira; Symons, Julie; Abdelzaher, Tarek


Keyword(s): system performance diagnosis; machine learning; transfer learning; scalability

Abstract: Distributed systems continue to grow in scale and complexity, resulting in increasingly more involved interactions among components and increasingly more intricate failure modes that are very hard to diagnose manually. This increased vulnerability of larger systems, together with the increased difficulty of failure diagnosis, has motivated machine learning approaches to automate the diagnosis task. While preliminary encouraging results are achieved, scaling up the existing approaches to large applications remains challenging. With increase in scale, current approaches suffer the curse of dimensionality exacerbated by the exploding set of system states and measured metrics. In this paper, we significantly improve scalability of performance diagnosis methods. Our contributions lie in the use of (i) an intelligent partitioning of the metric space, coupled with a cooperative temporal segmentation algorithm, dividing system observations in time and in space to remove the multiplicative explosion of system states, and (ii) transfer learning techniques that improve accuracy by leveraging dependencies among the partitions. We validate our approaches on several months of production traces from a customer-facing geographically distributed, 24x7, 3-tier internet service. Our results show a significant accuracy improvement (350n average) over the naive partitioning of the state space (without the new temporal segmentation algorithm or transfer learning), and an order of magnitude reduction in computational cost over the .brute force. approach of learning with no partitioning, without loss of accuracy.

14 Pages

Back to Index

»Technical Reports

» 2009
» 2008
» 2007
» 2006
» 2005
» 2004
» 2003
» 2002
» 2001
» 2000
» 1990 - 1999

Heritage Technical Reports

» Compaq & DEC Technical Reports
» Tandem Technical Reports
Printable version
Privacy statement Using this site means you accept its terms Feedback to HP Labs
© 2009 Hewlett-Packard Development Company, L.P.