Fast, cheap and under control

HP Labs introduces sparse indexing to radically improve data backup for small and medium-sized businesses

By Simon Firth

Most of us are pretty bad at backing up our home PCs, but it’s easy enough to do. We can just copy our whole hard drive to a second, back-up drive.

Enterprises face a vastly greater task, but know the value of their data. They’ll willingly pay large sums for complex, automated data back-up systems —services that employ a technique called deduplication to make sure that only fresh data is recorded every time a backup is made.

Caught in the middle, though, are small and medium businesses (SMBs). They don’t have huge IT budgets, and yet are generating ever larger volumes of data. And, increasingly, that data is both vital to their operations and contains information they need to be able to retrieve with relative ease.

Principal Research Scientist Kave Eshghi

Principal Research
Scientist Kave Eshghi

Until recently, most SMBs ran their data backups more along the lines of individual consumers, says HP Labs researcher Kave Eshghi. “Only today that creates a huge amount of redundancy,” he explains, “because you keep backing up the same thing over and over again, and that just raises your costs.”

Now, thanks to a collaboration between HP Labs researchers and engineers at HP’s StorageWorks business, SMBs have an affordable alternative: a new line of fast, cheap and flexible disk-based backup devices that offer deduplication on an SMB budget.

The challenge of data indexing

The secret behind HP’s new StorageWorks D2D backup systems is a novel and sophisticated kind of deduplication, called Sparse Indexing.

For deduplication to work, a backup system has to be able to know whether it already holds a copy of any particular piece of data, which means it needs an index to everything it holds.

“The challenge,” says Eshghi, “is that you have data coming in at hundreds of megabytes a second and you keep having to look it up to see if each chunk of data is already there.”

If you put the index in RAM (the kind of memory that runs a PC’s operating system), you can look up data fast and avoid an indexing bottleneck. “But then you need a huge amount of RAM, “ Eshghi explains, “and that’s very expensive.”

Putting the index on a hard drive doesn’t help because a hard drive’s mechanical head limits how fast it can move, resulting in an unacceptably slow system.

HP’s solution lies in sampling the index of data being already held. Instead of holding every index item in RAM ready for comparison, the HP team keeps just one in every hundred or so items in RAM and puts the rest onto a hard drive. Duplicate data almost always arrives in bursts. In other words, if one chunk of the arriving stream is a duplicate, it is very likely that many following chunks are duplicates. Sparse indexing takes advantage of this phenomenon by storing the sequence of hashes of the stored chunks next to each other on disk. As a result, a ‘hit’ in the sample RAM index can direct the system to an area of the disk where many duplicates are likely to be found.

Senior Research Scientist Mark Lillibridge

Senior Research
Scientist Mark Lillibridge

“You can amortize the speed and memory costs of doing this kind of retrieval over the whole backup,” says Mark Lillibridge, who with Eshghi and colleague Vinay Deolalikar made up the HP Labs portion of the D2D team. “The bottom line,” he says, “is that, to index the same amount of data, we’re using something like half the RAM as our nearest competitors, which means the products we are offering can also be a lot cheaper.”

Indeed, when HP’s first D2D products were introduced in 2008, they matched the performance of competing products with just up to a quarter of the price.

Providing solutions to the business

Going into the D2D project, HP’s StorageWorks division already had a long history of working with HP Labs on magnetic tape technologies, notes Graham Perry, the lead engineer on the deduplication technology development team.

“So when we started to look at technologies requiring deduplication,” Perry recalls, “it was natural for us to expand that working relationship into a new area.”

Eshghi and Lillibridge had been working on an allied problem in digital movie making — how to shift gigabytes of data over transatlantic cables in order to remotely process frames of high quality digital animation. In that case, instead of looking to avoid storing redundant data, they were looking to avoid transmitting it. “We had files that would take nine hours to transmit,” Lillibridge remembers. “But with deduplication, we could do it in minutes.”

That experience proved invaluable, says Perry. “But the Labs engineers can also just sit down and come up with something completely new that no one’s thought about,” he adds, “And that’s a huge advantage to us.”

Scaling the service

Principal Research Scientist Vinay Deolalikar

Principal Research
Scientist Vinay Deolalikar

The team plans to keep working together for some time. There’s plenty to do, they say.

“SMB customers have to manage more data than they ever expected,” Perry notes. “They also have to manage the data for longer, both because they’re now legally required to keep data online longer and because they need to be able to analyze their data to stay competitive.” As a result, Perry foresees no let-up in the need for SMB backup systems that are at once both capacious and quickly accessible.

The researchers agree that the next big question for deduplication is whether it can scale in terms of both capacity and speed to meet those escalating needs.

“There’s a limit to what you can do with a single box,” says Lillibridge, “Just adding another machine to the one you already have to increase your scale doesn’t work. So one issue is how you can add multiple machines working together as a group.”

Currently, adds Eshghi, HP’s D2D systems backup at a rate of around one hundred megabytes per second. “But once we start facing one gigabyte per second or more of data coming in and petabytes of data in the store,” he says, “we’re talking really big numbers. So the challenges are not trivial.”