Jump to content United States-English
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP

hp.com home

Technical Reports

printable version

HP Labs

» Research
» News and events
» Technical reports
» About HP Labs
» Careers @ HP Labs
» Worldwide sites
» Downloads
Content starts here

Click here for full text: PDF

Impact of imperfect OCR on part-of-speech tagging

Lin, Xiaofan


Keyword(s): part-of-speech tagging; optical character recognition; natural language processing; system combination; majority voting; sensitivity analysis

Abstract: Part-of-speech (POS) tagging is the foundation of natural language processing (NLP) systems, and thus has been an active area of research for many years. However, one question remains unanswered: How will a POS tagger behave when the input text is not error- free? This issue can be of great importance when the text comes from imperfect sources like Optical Character Recognition (OCR). This paper analyzes the performance of both individual POS taggers and combination systems on imperfect text. Experimental results show that a POS tagger's accuracy will decrease linearly with the character error rate and the slope indicates a tagger's sensitivity to input text errors.

6 Pages

Back to Index

»Technical Reports

» 2009
» 2008
» 2007
» 2006
» 2005
» 2004
» 2003
» 2002
» 2001
» 2000
» 1990 - 1999

Heritage Technical Reports

» Compaq & DEC Technical Reports
» Tandem Technical Reports
Privacy statement Using this site means you accept its terms Feedback to HP Labs
© 2009 Hewlett-Packard Development Company, L.P.