Jump to content United States-English
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP

HP Labs home

Technical reports

» 

HP Labs

» Research
» News and events
» Technical reports
» About HP Labs
» Careers @ HP Labs
» Worldwide sites
» Downloads
Content starts here

 
Click here for full text: PDF

Quantifying Counts, Costs, and Trends Accurately via Machine Learning

Forman, George

HPL-2007-164R1

Keyword(s): supervised machine learning, classification, prevalence estimation, class distribution estimation, cost quantification, quantification research methodology, minimizing training effort, detecting and tracking trends, concept drift, class imbalance, text mining

Abstract: In many business and science applications, it is important to track trends over historical data, for example, measuring the monthly prevalence of influenza incidents at a hospital. In situations where a machine learning classifier is needed to identify the relevant incidents from among all cases in the database, anything less than perfect classification accuracy will result in a consistent and potentially substantial bias in estimating the class prevalence. There is an assumption ubiquitous in machine learning that the class distribution of the training set matches that of the test set, but this is certainly not the case for applications where the goal is to measure changes or trends in the distribution over time. The paper defines two research challenges for machine learning that address this distribution mismatch problem. The 'quantification' task is to accurately estimate the number of positive cases (or class distribution) in an unlabeled test set via machine learning, using a limited training set that may have a substantially different class distribution. The 'cost quantification' task is to estimate the total cost associated with the positive class, where each case is tagged with a cost attribute, such as the hours of labor needed to resolve the case. Obtaining a precise quantification estimate over a set of cases has a very different utility model from traditional classification research, whose goal is to obtain an accurate classification for each individual case. For both forms of quantification, the paper describes a suitable experiment methodology and evaluates a variety of methods. It reveals which methods give more reliable estimates, even when training data is scarce and the testing class distribution differs widely from training. Some methods function well even under high class imbalance, e.g. 1% positives. These strengths can make quantification practical for business use, even where classification accuracy is poor. Publication Info: To be published in international journal Data Mining and Knowledge Discovery in a special issue on Utility-Based Data Mining

25 Pages

Back to Index

»Technical Reports

» 2009
» 2008
» 2007
» 2006
» 2005
» 2004
» 2003
» 2002
» 2001
» 2000
» 1990 - 1999

Heritage Technical Reports

» Compaq & DEC Technical Reports
» Tandem Technical Reports
Printable version
Privacy statement Using this site means you accept its terms Feedback to HP Labs
© 2009 Hewlett-Packard Development Company, L.P.