Click here for full text:
Ensembles of Models for Automated Diagnosis of System Performance Problems
Zhang, Steve; Cohen, Ira; Goldszmidt, Moises; Symons, Julie; Fox, Armando
Keyword(s): automated diagnosis; self-healing and self-monitoring systems; statistical induction and Bayesian Model Management
Abstract: Violations of service level objectives (SLO) in Internet services are urgent conditions requiring immediate attention. Previously we explored [ 1 ] an approach for identifying which low-level system properties were correlated to high-level SLO violations (the metric attribution problem). The approach is based on automatically inducing models from data using pattern recognition and probability modeling techniques. In this paper we extend our approach to adapt to changing workloads and external disturbances by maintaining an ensemble of probabilistic models, adding new models when existing ones do not accurately capture current system behavior. Using realistic workloads on an implemented prototype system, we show that the ensemble of models captures the performance behavior of the system accurately under changing workloads and conditions. We fuse information from the models in the ensemble to identify likely causes of the performance problem, with results comparable to those produced by an oracle that continuously changes the model based on advance knowledge of the workload. The cost of inducing new models and managing the ensembles is negligible, making our approach both immediately practical and theoretically appealing.
Back to Index