Jump to content United States-English
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP

HP.com home

 

June 2005

Looking back in TIME

HP helps put 80 years of TIME magazine online


» 

HP Labs

» Research
» News and events
» Technical reports
» About HP Labs
» Careers @ HP Labs
» Worldwide sites
» Downloads























































































































Content starts here
The magic of all this is how all the different parts of HP worked together to make this happen.

by Simon Firth

Early last year, HP set out to digitize the entire 80-year print run of TIME magazine, one of the world's most widely read publications.

The idea was to automate the process as much as possible, going far beyond what had previously been achieved in the field.

In both technical and organizational terms, the job presented an enormously complex challenge. The archive was not only physically huge -- running to more than half a million pages -- it also changed in character over time, making the material to be digitized something of a moving target.

To accomplish the task, HP forged an unusual collaboration between its Consulting and Integration (C&I) group and HP Labs in an undertaking that required both groups to go beyond their usual areas of strength and put everyone under intense pressure to deliver. At times, recalls HP Labs’ Program Manager Giuliano Di Vitantonio, “it was absolutely crazy. There were weeks when we just didn’t sleep.”

By the end of the year, however, through a mixture of technological innovation and disciplined process management, the TIME team had processed the entire archive with almost 100 percent accuracy.

The resulting collection, housed at TIME.com, provides a vast online record of history in the making from 1923 to today.

The archive contains each issue's cover, photos and more than 266,000 original articles, covering everything from the Great Depression to the Beatles' U.S. debut (headlined "The Unbarbershopped Quartet") to the formal birth of the United Nations in 1945 to the race to map the human genome.

New technology

Typically, the only way to digitize a magazine archive has been to retype it by hand.

Applying Optical Character Recognition (OCR) software might seem a good technological alternative. But state-of-the-art OCR software is only 99.5 percent accurate, and it lacks the intelligence to reassemble the blocks of text it recognizes into complete articles.

Was there a way to apply new technology to the problem?

Building on previous work they’d done for the MIT Press, researchers at HP Labs came up with a three-stage solution.

First, they passed each digitally scanned magazine page through multiple OCR engines and selected the best output from each using a series of algorithms they developed.

That increased the OCR accuracy rate beyond what is typically expected. Yet many errors still had to be addressed, particularly the connection of articles across page boundaries.

Aiming for 100 percent

In a second stage, the Labs team reconstructed the articles from their constituent parts. They did this by creating a software engine that could recognize and exclude sections of each page -- such as advertisements and photographs -- that were not article text. The software then made intelligent guesses to determine the correct sequencing for the text blocks.

In this, says Labs' Di Vitantonio, they managed to reach 80 percent accuracy. Links between text sequences were identified with moveable arrows on a graphic reproduction of the page.

Full accuracy came with the third stage that employed a tool designed by HP researchers and developed by C&I consultants. This enabled C&I staff to manually link zones together to recreate the reading flow of the articles where the software had guessed wrong.

A moving target

The challenge of digitizing the TIME archive was made all the harder by changes in the magazine over the last eight decades.

Early issues used fonts that no longer existed, for example, and had pages that were often damaged and thus hard to read. The newer issues were clearly printed but often employed dynamic graphic layouts that made it extremely difficult to distinguish text elements from photographs or figures.

“Essentially, we were dealing with the history of modern printing of the last 80 years,” says Jeff Hager, solution delivery manager for Rich Media in HP C&I who oversaw the TIME project.

Logistical challenge

The work was carried out in six phases, each of which had checkpoints where, if content didn't meet quality standards, it went back a stage.

Managing that process was a huge logistical challenge.

The original magazines were scanned by C&I in Bridgewater, New Jersey, and then the resulting TIF data files were shipped to the HP Labs’ Barcelona Research Office in Spain, home to the HP Labs Digital Content Remastering Program. Once processed, the data were shipped back to Bridgewater for manual correction.

“When you’re dealing with content,” notes HP Labs researcher John Burns, “it’s old, it’s messy, it’s dirty, it’s incomplete. And so you end up running an industrial operation.”

That’s not the kind of thing HP usually does, says Burns. HP Labs also took on the unusual role of running data processing for the TIME job -- a task that, in this case, took 44 days of uninterrupted server operation.

“This wasn’t something we transferred to a division or a business unit and they did it with our technology," says Di Vitantonio with some pride. “We really ran the operations for the entire volume.”

Close collaboration

While Labs was running some of the project’s operations, C&I was required to do far more software development than usual, notes C&I’s Jeff Hager.

“It was a really good collaborative effort,” Hager says. The project, he says, required everyone on the team to rapidly customize their solution to meet demanding customer needs on very short notice. That wasn’t easy.

“Often, we were trying to figure out what to do, how to do it, how to build the tools to do it -- and actually do it all at one time," he adds.

“The magic of all this is how all the different parts of HP worked together to make this happen,” says Hager’s boss, Douglas McMahon, VP, HP Systems Solutions and HP’s principal liaison with TIME.

"The relationship between TIME and HP,” McMahon adds, “led to a solution based on people, process and technology that could get this job done.”

Archive online

With the newly digitized content as a prime feature, TIME.com’s archive went live late last December. Visitors to TIME.com can now search the entire output of the magazine from 1923 to the present day, as well as search for covers and browse stories grouped by theme.

Access to the archive is free to subscribers; otherwise, users must pay a small fee or be limited to article summaries.

What's ahead

HP is now looking into whether its digital content remastering solution might be offered commercially to other publishers.

Meanwhile, the HP Labs team is looking for its next challenge in what researchers call content-driven computing, exploring opportunities in creating IT-based solutions for the manipulation of digital content.

“Seeing our research get into a final product or service consumed by real people is always a great achievement,” says Di Vitantonio. “That’s the really exciting aspect of this.”

Related links

» Digital content remastering research
» TIME archive

News and events

» Recent news stories
» Archived news stories

illustration of scanned page
» click here for larger illustration






illustration of scanned page
» click here for larger illustration





illustration of scanned page
» click here for larger illustration

Printable version
Privacy statement Using this site means you accept its terms Feedback to HP Labs
© 2009 Hewlett-Packard Development Company, L.P.