Fast, cheap and under controlHP Labs introduces sparse indexing to radically improve data backup for small and medium-sized businesses
By Simon Firth
Most of us are pretty bad at backing up our home PCs, but it’s easy enough to do. We can just copy our whole hard drive to a second, back-up drive.
Enterprises face a vastly greater task, but know the value of their data. They’ll willingly pay large sums for complex, automated data back-up systems —services that employ a technique called deduplication to make sure that only fresh data is recorded every time a backup is made.
Caught in the middle, though, are small and medium businesses (SMBs). They don’t have huge IT budgets, and yet are generating ever larger volumes of data. And, increasingly, that data is both vital to their operations and contains information they need to be able to retrieve with relative ease.
Scientist Kave Eshghi
Now, thanks to a collaboration between HP Labs researchers and engineers at HP’s StorageWorks business, SMBs have an affordable alternative: a new line of fast, cheap and flexible disk-based backup devices that offer deduplication on an SMB budget.
The challenge of data indexing
The secret behind HP’s new StorageWorks D2D backup systems is a novel and sophisticated kind of deduplication, called Sparse Indexing.
For deduplication to work, a backup system has to be able to know whether it already holds a copy of any particular piece of data, which means it needs an index to everything it holds.
“The challenge,” says Eshghi, “is that you have data coming in at hundreds of megabytes a second and you keep having to look it up to see if each chunk of data is already there.”
If you put the index in RAM (the kind of memory that runs a PC’s operating system), you can look up data fast and avoid an indexing bottleneck. “But then you need a huge amount of RAM, “ Eshghi explains, “and that’s very expensive.”
Putting the index on a hard drive doesn’t help because a hard drive’s mechanical head limits how fast it can move, resulting in an unacceptably slow system.
HP’s solution lies in sampling the index of data being already held. Instead of holding every index item in RAM ready for comparison, the HP team keeps just one in every hundred or so items in RAM and puts the rest onto a hard drive. Duplicate data almost always arrives in bursts. In other words, if one chunk of the arriving stream is a duplicate, it is very likely that many following chunks are duplicates. Sparse indexing takes advantage of this phenomenon by storing the sequence of hashes of the stored chunks next to each other on disk. As a result, a ‘hit’ in the sample RAM index can direct the system to an area of the disk where many duplicates are likely to be found.
Scientist Mark Lillibridge
Indeed, when HP’s first D2D products were introduced in 2008, they matched the performance of competing products with just up to a quarter of the price.
Providing solutions to the business
Going into the D2D project, HP’s StorageWorks division already had a long history of working with HP Labs on magnetic tape technologies, notes Graham Perry, the lead engineer on the deduplication technology development team.
“So when we started to look at technologies requiring deduplication,” Perry recalls, “it was natural for us to expand that working relationship into a new area.”
Eshghi and Lillibridge had been working on an allied problem in digital movie making — how to shift gigabytes of data over transatlantic cables in order to remotely process frames of high quality digital animation. In that case, instead of looking to avoid storing redundant data, they were looking to avoid transmitting it. “We had files that would take nine hours to transmit,” Lillibridge remembers. “But with deduplication, we could do it in minutes.”
That experience proved invaluable, says Perry. “But the Labs engineers can also just sit down and come up with something completely new that no one’s thought about,” he adds, “And that’s a huge advantage to us.”
Scaling the service
Scientist Vinay Deolalikar
“SMB customers have to manage more data than they ever expected,” Perry notes. “They also have to manage the data for longer, both because they’re now legally required to keep data online longer and because they need to be able to analyze their data to stay competitive.” As a result, Perry foresees no let-up in the need for SMB backup systems that are at once both capacious and quickly accessible.
The researchers agree that the next big question for deduplication is whether it can scale in terms of both capacity and speed to meet those escalating needs.
“There’s a limit to what you can do with a single box,” says Lillibridge, “Just adding another machine to the one you already have to increase your scale doesn’t work. So one issue is how you can add multiple machines working together as a group.”
Currently, adds Eshghi, HP’s D2D systems backup at a rate of around one hundred megabytes per second. “But once we start facing one gigabyte per second or more of data coming in and petabytes of data in the store,” he says, “we’re talking really big numbers. So the challenges are not trivial.”