BNS Scaling: An Improved Representation over TF·IDF for SVM Text Classification
Keyword(s): text classification; topic identification; machine learning; feature selection; Support Vector Machine; TF*IDF text representation
Abstract: In the realm of machine learning for text classification, TF·IDF is the most widely used representation for real-valued feature vectors. Unfortunately, it is oblivious to the training class labels, and naturally scales some features inappropriately. We replace IDF with Bi-Normal Separation (BNS), which was previously found to be excellent at ranking words for feature selection filtering. Empirical evaluation on a benchmark of 237 binary text classification tasks shows substantially better accuracy and F-measure for a Support Vector Machine (SVM) by using the BNS scaling representation. A wide variety of other feature scaling methods were found inferior, including binary features. Furthermore, BNS scaling yielded better performance without feature selection, obviating the complexities of feature selection.
Additional Publication Information: To be presented and published in ACM 17th Conference on Information and Knowledge Management. Napa Valley. CA, October 26-30, 2008
External Posting Date: August 6, 2008 [Fulltext]. Approved for External Publication
Internal Posting Date: August 6, 2008 [Fulltext]