|
Scholars poring over medieval manuscripts or ancient stone
tablets can still read what scribes set down hundreds or even
thousands of years ago. But today, we're generating billions
of bits of digital data that could become obsolete and indecipherable
within just a few years -- what Nick Wainwright refers to
as "bit rot." Wainwright oversees an HP Labs project
called DSpace -- a partnership with MIT to address a problem
he says is "only going to get worse" as time goes
on.
According to Wainwright, the difficulty MIT and other institutions
now face is that much of the intellectual output of professors
and researchers is in digital form and has properties that
are inherent to being "born digital."
For instance, consider a Web site with its many dynamic links,
or a research report linked to the raw experimental results
it's based on. Simply "printing out" such data destroys
much of its usefulness. And often, it simply disappears.
"A lot of the data was being lost," says HP's lead
developer on DSpace, Robert Tansley. "It gets left on
people's hard drives. When they leave jobs or change departments
or their hard drive crashes, all of that work is lost. Or
perhaps it might make it to a departmental Web server that
disappears when the department splits up into two separate
research centers."
What Makes Digital Data Vulnerable
DSpace was launched to help capture and organize all that
data into an "institutional repository" so it will
be available to future generations in its original digital
form. But it isn't easy.
"Conserving digital data is something you have to do
quite aggressively," says Tansley. "With books,
you just have a controlled environment room. You put the book
in, shut the door and in 10 years the book will still be there."
By contrast, digital data is vulnerable to rapidly changing
standards for file formats, software applications and potential
damage to the drums, disks, tapes and other media used to
actually store the data.
DSpace isn't a single technology or a new format, but a set
of tools for helping institutions keep track of their data,
organize it in meaningful ways and migrate that data to new
formats as old formats become obsolete. It's really a matter
of establishing a system for the curation of this data, says
Wainwright. "And we want to do as much of that in an
automated way as possible in order to handle the increasing
volume and complexity of the data being produced," he
says.
Early Work on DSpace
In attacking the problem, HP researchers found some of their
most valuable allies were not IT specialists, but librarians.
"Libraries are used to dealing with things for a long
time. They make great partners because they're used to thinking
of things in terms of decades or even centuries," says
Wainwright.
The current version of DSpace is what Tansley calls "bare
bones." It contains mainly text documents searchable
through a standard SQL database. But he says it's helping
to pose new questions about what's still needed and how the
system will be used.
"DSpace won't be much use unless the MIT faculty can
actually deposit their digital assets in it," says Tansley.
And, adds Wainwright, it won't be much use if that data can't
be easily retrieved years in the future.
Getting the Right Metadata
One of the key components HP researchers are working on now
is the issue of metadata -- that is, "data about data."
Think of it as the information on a library card.
"The key to all this," Wainwright says, "is
having information about the information you're storing, whether
it describes what it is or who put it there or how it got
there or what format it's in. And metadata about a collection
will continue to grow over the lifetime of the collection."
For instance, a researcher could look up experimental data
from a past thesis, and do new experiments or draw new conclusions
based on that original data. This new information would then
be linked with the earlier studies as part of the overall
metadata on that topic.
User-Friendly Input
What Tansley and others are tackling is how to make the metadata
open-ended enough to account for the wide variety of data
out there -- from biomedical images and teaching videos to
data sets and computer programs -- and organize it all in
a way that will be useful to future researchers. Suddenly
that library card needs to contain a lot more information.
"What you need is a way to allow people to describe
the assets in the way that they want, and to allow those descriptions
to evolve over time," says Tansley. "It forces you
to take into account a lot of metadata you wouldn't normally
think about if you were just archiving documents and photographs."
What's more, inputting both the data and relevant metadata
will have to be user-friendly enough that busy professors
will actually do it. "As the metadata you want to store
becomes more complex, you also need to find a way of capturing
that information and make it easy for MIT faculty," says
Tansley. "If you ask them to fill out a form that's 15
pages long for everything they want to put in the system,
they're not going to do it. So you have to trade ease of use
off with capturing the richness of information."
Available Online
The current version of DSpace
is readily available online as an open source Linux program,
and a community of users is rapidly growing around it.
Tansley says not only are people realizing they have a problem
with archiving digital data and looking at DSpace as a possible
solution. "Actually, people have looked at DSpace and
realized they have the problem."
|