Click here for full text:
Text-mining based journal splitting
Keyword(s): table of contents; OCR; journal splitting; text mining; text chunking; document understanding
Abstract: This paper introduces a novel journal splitting algorithm. It takes full advantage of various kinds of information such as text match, layout and page numbers. The core procedure is a highly efficient text-mining algorithm, which detects the matched phrases between the content pages and the title pages of individual articles. Experiments show that this algorithm is robust and able to split a wide range of journals, magazines and books.
Back to Index