Jump to content United States-English
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP

hp.com home


DSpace: Preserving Digital Data
for the Ages

Dspace Aims to Capture, Distribute and Preserve Intellectual Output

August 2003

printable version
» 

HP Labs

» Research
» News and events
» Technical reports
» About HP Labs
» Careers @ HP Labs
» People
» Worldwide sites
» Downloads
Robert Tansley

 

Content starts here

In 10 years, a book will still be there, but digital data is vulnerable.
In 10 years, a book will still be there, but digital data is vulnerable.

Scholars poring over medieval manuscripts or ancient stone tablets can still read what scribes set down hundreds or even thousands of years ago. But today, we're generating billions of bits of digital data that could become obsolete and indecipherable within just a few years -- what Nick Wainwright refers to as "bit rot." Wainwright oversees an HP Labs project called DSpace -- a partnership with MIT to address a problem he says is "only going to get worse" as time goes on.

According to Wainwright, the difficulty MIT and other institutions now face is that much of the intellectual output of professors and researchers is in digital form and has properties that are inherent to being "born digital."

For instance, consider a Web site with its many dynamic links, or a research report linked to the raw experimental results it's based on. Simply "printing out" such data destroys much of its usefulness. And often, it simply disappears.

"A lot of the data was being lost," says HP's lead developer on DSpace, Robert Tansley. "It gets left on people's hard drives. When they leave jobs or change departments or their hard drive crashes, all of that work is lost. Or perhaps it might make it to a departmental Web server that disappears when the department splits up into two separate research centers."

What Makes Digital Data Vulnerable

DSpace was launched to help capture and organize all that data into an "institutional repository" so it will be available to future generations in its original digital form. But it isn't easy.

"Conserving digital data is something you have to do quite aggressively," says Tansley. "With books, you just have a controlled environment room. You put the book in, shut the door and in 10 years the book will still be there." By contrast, digital data is vulnerable to rapidly changing standards for file formats, software applications and potential damage to the drums, disks, tapes and other media used to actually store the data.

DSpace isn't a single technology or a new format, but a set of tools for helping institutions keep track of their data, organize it in meaningful ways and migrate that data to new formats as old formats become obsolete. It's really a matter of establishing a system for the curation of this data, says Wainwright. "And we want to do as much of that in an automated way as possible in order to handle the increasing volume and complexity of the data being produced," he says.

Early Work on DSpace

In attacking the problem, HP researchers found some of their most valuable allies were not IT specialists, but librarians. "Libraries are used to dealing with things for a long time. They make great partners because they're used to thinking of things in terms of decades or even centuries," says Wainwright.

The current version of DSpace is what Tansley calls "bare bones." It contains mainly text documents searchable through a standard SQL database. But he says it's helping to pose new questions about what's still needed and how the system will be used.

"DSpace won't be much use unless the MIT faculty can actually deposit their digital assets in it," says Tansley. And, adds Wainwright, it won't be much use if that data can't be easily retrieved years in the future.

Getting the Right Metadata

One of the key components HP researchers are working on now is the issue of metadata -- that is, "data about data." Think of it as the information on a library card.

"The key to all this," Wainwright says, "is having information about the information you're storing, whether it describes what it is or who put it there or how it got there or what format it's in. And metadata about a collection will continue to grow over the lifetime of the collection."

For instance, a researcher could look up experimental data from a past thesis, and do new experiments or draw new conclusions based on that original data. This new information would then be linked with the earlier studies as part of the overall metadata on that topic.

User-Friendly Input

What Tansley and others are tackling is how to make the metadata open-ended enough to account for the wide variety of data out there -- from biomedical images and teaching videos to data sets and computer programs -- and organize it all in a way that will be useful to future researchers. Suddenly that library card needs to contain a lot more information.

"What you need is a way to allow people to describe the assets in the way that they want, and to allow those descriptions to evolve over time," says Tansley. "It forces you to take into account a lot of metadata you wouldn't normally think about if you were just archiving documents and photographs."

What's more, inputting both the data and relevant metadata will have to be user-friendly enough that busy professors will actually do it. "As the metadata you want to store becomes more complex, you also need to find a way of capturing that information and make it easy for MIT faculty," says Tansley. "If you ask them to fill out a form that's 15 pages long for everything they want to put in the system, they're not going to do it. So you have to trade ease of use off with capturing the richness of information."

Available Online

The current version of DSpace is readily available online as an open source Linux program, and a community of users is rapidly growing around it.

Tansley says not only are people realizing they have a problem with archiving digital data and looking at DSpace as a possible solution. "Actually, people have looked at DSpace and realized they have the problem."

News and Events

» DSpace
» HP-MIT Alliance
Nick Wainwright
Privacy statement Using this site means you accept its terms Feedback to HP Labs
© 2009 Hewlett-Packard Development Company, L.P.