Technical Reports


Click here for full text: PDF

Text Classification for Data Loss Prevention

Hart, Michael; Manadhata, Pratyusa K.; Johnson, Rob
HP Laboratories


Keyword(s): Data Loss Prevention; DLP; SVM; Text Classification

Abstract: Businesses, governments, and individuals leak confidential information, both accidentally and maliciously, at tremendous cost in money, privacy, national security, and reputation. Several security software vendors now offer "data loss prevention" (DLP) solutions that use simple algorithms, such as keyword lists and hashing, which are too coarse to capture the features what makes sensitive documents secret. In this paper, we present automatic text classification algorithms for classifying enterprise documents as either sensitive or non-sensitive. We also introduce a novel training strategy, supplement and adjust, to create a classifier that has a low false discovery rate, even when presented with documents unrelated to the enterprise. We evaluated our algorithm on several corpora that we assembled from confidential documents published on WikiLeaks and other archives. Our classifier had a false negative rate of less than 3.0% and a false discovery rate of less than 3.0% and a false discovery rate of less than 1.0% on all our tests (i.e, in a real deployment, the classifier can identify more than 97% of information leaks while raising at most 1 false alarm every 100th time.

21 Pages

External Posting Date: August 6, 2011 [Fulltext]. Approved for External Publication
Internal Posting Date: August 6, 2011 [Fulltext]

Back to Index