Technical Reports

HPL-2012-27

Text Documents as Social Networks

Balinsky, Helen; Balinsky, Alexander; Simske, Steven J.
HP Laboratories

HPL-2012-27

Keyword(s): Small world network; unusual behavior detection; Helmholtz principle; social networks; keyword extraction; data mining

Abstract: The extraction of keywords and features is a fundamental problem in text data mining. Document processing applications directly depend on the quality and speed of the identification of salient terms and phrases. Applications as disparate as automatic document classification, information visualization, filtering and security policy enforcement all rely on the quality of automatically extracted keywords. Recently, a novel approach to rapid change detection in data streams and documents has been developed. It is based on ideas from image processing and in particular on the Helmholtz Principle from the Gestalt Theory of human perception. By modeling a document as a one-parameter family of graphs with its sentences or paragraphs defining the vertex set and with edges defined by Helm holtz's principle, we demonstrated that for some range of the parameters, the resulting graph becomes a small-world network. In this article we investigate the natural orientation of edges in such small world networks. For two connected sentences, we can say which one is the first and which one is the second, according to their position in a document. This will make such a graph look like a small WWW-type network and PageRank type algorithms will produce interesting ranking of nodes in such a document.

13 Pages

External Posting Date: February 6, 2012 [Abstract Only]. Approved for External Publication - External Copyright Consideration
Internal Posting Date: February 6, 2012 [Fulltext]

Back to Index