Jump to content United States-English
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP

hp.com home

Technical Reports

printable version

HP Labs

» Research
» News and events
» Technical reports
» About HP Labs
» Careers @ HP Labs
» Worldwide sites
» Downloads
Content starts here

  Click here for full text: PDF

A Method for Discovering the Insignificance of One's Best Classifier and the Unlearnability of a Classification Task

Forman, George


Keyword(s): supervised machine learning; overfitting; 2001 KDD Cup thrombin classification competition

Abstract: Consider the following common scenario: a data mining practitioner tries various specialized classification algorithms on a new dataset of unknown difficulty and selects the apparent best. Supposing its accuracy were 70% on a held-out test set, how can one know whether this is a significant result or not? It can be difficult to tell in the absence of standard benchmark results for the dataset. Surprisingly, it can also be difficult to tell even when the dataset has hundreds of benchmark results. This paper presents a method to address this question by comparing the chosen best classifier to the distribution of performance scores obtained by many simple classifiers that are randomly generated. This can also serve to discover when a classification problem appears nearly unlearnable. It is demonstrated for the results of the 2001 KDD Cup thrombin competition. Notes: To be published in and presented at Data Mining Lessons Learned Workshop, the 19th International Conference on Machine Learning (ICML), 8-12 July 2002, Sydney, Australia

5 Pages

Back to Index

»Technical Reports

» 2009
» 2008
» 2007
» 2006
» 2005
» 2004
» 2003
» 2002
» 2001
» 2000
» 1990 - 1999

Heritage Technical Reports

» Compaq & DEC Technical Reports
» Tandem Technical Reports
Privacy statement Using this site means you accept its terms Feedback to HP Labs
© 2009 Hewlett-Packard Development Company, L.P.