Feature Selection "Tomography" - Illustrating that Optimal Feature Filtering is Hopelessly Ungeneralizable
Keyword(s): Feature selection; text categorization; visualization; machine learning classification;
Abstract: Feature filtering methods are often used in text classification and other high-dimensional domains to quickly score each feature independently and pass only the best to the learning algorithm. The panoply of available methods grows over the years, with frequent research publications touting new functions that seem to yield superior learning: variants on Information Gain, Chi Squared, Mutual Information, and others. This slow generate-and-test search in the literature is usually counted as progress towards finding superior filter methods. But this is illusory. We provide a new empirical method to reveal the feature preference surface for a given dataset and classifier: cross-validating with an additional feature whose noise characteristics are controlled. By evaluating for every sensitivity and specificity level - though computationally expensive - we determine the expected benefit of adding a feature with a given characteristic. We desire features with high expected gain. Ideally this preference surface would be easily generalizable across many datasets and depend on just a few parameters. However, by visualizing the preference surface under different conditions while holding other factors constant, we demonstrate graphically that it depends on more factors than are ever considered in feature selection papers: training set size and class distribution, classifier model and parameters, and - even holding all of these constant - the individual target concept itself. Thus, the ongoing sequence of published papers in this area will not yield an optimal feature selection function of significant generality.
External Posting Date: February 6, 2012 [Fulltext]. Approved for External Publication
Internal Posting Date: February 6, 2012 [Fulltext]