Technical Reports


Chart Image Understanding and Numerical Data Extraction

Mishchenko, Ales; Vassilieva, Natalia
HP Laboratories


Keyword(s):Model-based recognition; data extraction; chart classification; image analysis;

Abstract: Chart images in digital documents are an important source of valuable information that is largely under- utilized for data indexing and information extraction purposes. We developed a framework to automatically extract data carried by charts and convert them to XML format. The proposed algorithm classifies image by chart type, detects graphical and textual components, extracts semantic relations between graphics and text. Classification is performed by a novel model-based method, which was extensively tested against the state-of-theart supervised learning methods and showed high accuracy, comparable to those of the best supervised approaches. The proposed text detection algorithm is applied prior to optical character recognition and leads to significant improvement in text recognition rate (up to 20 times better). The analysis of graphical components and their relations to textual cues allows the recovering of chart data. For testing purpose, a benchmark set was created with the XML/SWF Chart tool. By comparing the recovered data and the original data used for chart generation, we are able to evaluate our information extraction framework and confirm its validity.

6 Pages

External Posting Date: October 6, 2011 [Abstract]. Approved for External Publication
Internal Posting Date: October 6, 2011 [Fulltext]

Back to Index