Adventures in Feature Selection on an Industrial Dataset... and Ensuing General Discoveries

Forman, George
HP Laboratories


Keyword(s): text feature selection; text classification; document categorization; lessons learned

Abstract: We relate the story of an interesting failure of text feature selection methods on an industrial dataset of technical documents. Our detailed dissection and ultimate understanding of the failure led to the creation of general solutions that not only solved the robustness problem we faced, but were also able to improve classification accuracy for simpler, public datasets, which was crucial to enable the works' publishability.

Additional Publication Information: To be published in the proceedings of Silver 2012: The Silver Lining: learning from unexpected results, ECML/PKDD 2012 Workshop

External Posting Date: September 21, 2012 [Fulltext]. Approved for External Publication
Internal Posting Date: September 21, 2012 [Fulltext]

