Jump to content United States-English
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP

HP.com home


Technical Reports



» 

HP Labs

» Research
» News and events
» Technical reports
» About HP Labs
» Careers @ HP Labs
» Worldwide sites
» Downloads
Content starts here

 

Identification of Document Structure and Table of Content in Magazine Archives

Yacoub, Sherif; Peiro, Jose Abad

HPL-2005-101
External - Copyright Consideration

Keyword(s): document analysis; document recognition; table of content

Abstract: In this paper, we present a generic approach for reliable identification of the table of content (TOC) pages in scanned documents. We use multiple sources of information to obtain a reliable assessment of the TOC pages and the position of articles. These sources are produced by using three methods: title matching, section keyword matching, and numeric content. Finally a combination component is used to score potential TOC pages and select the best candidates. The system is used to identify the table of content, locate the beginning of articles, aid the process of advertisement identification (where present), and in general, identify the structure of scanned documents for the process of article extraction and online deployment of digital content. Results of applying the algorithms to an 80-years archive of Time weekly magazine are presented. Notes: Copyright IEEE. To be presented at the 8th International Conference on Document Analysis and Recognition, 29 August - 1 September 2005, Seoul, Korea

8 Pages

Back to Index

»Technical Reports

» 2009
» 2008
» 2007
» 2006
» 2005
» 2004
» 2003
» 2002
» 2001
» 2000
» 1990 - 1999

Heritage Technical Reports

» Compaq & DEC Technical Reports
» Tandem Technical Reports
Printable version
Privacy statement Using this site means you accept its terms Feedback to HP Labs
© 2009 Hewlett-Packard Development Company, L.P.