HP Labs helps make data deduplication ready for the enterprise

HP StorageWorks D2D backup systems.

Senior Research Scientist Mark Lillibridge.

Senior Research
Scientist Mark Lillibridge.

Businesses are drowning in data. Regulators are demanding that more documents be saved for easy retrieval. IT departments need a way to back up everything so that crucial information can be found quickly. And they have less time, not more, to meet these additional demands.

“It used to be that backup windows were all night and all weekend,” says Mark Lillibridge, senior research scientist at HP Labs. “In today’s 24/7 digital economy, you’re lucky if you get 2 hours to back up your systems.”

It’s not surprising, then, that one of the top data storage technologies helps businesses handle both the increase in data and the limited time for backup and recovery. Large and small businesses alike can take advantage of advances in data deduplication, the technology that underlies HP’s StoreOnce software, to make their backups faster and more efficient.

Making the most of disk storage

To accelerate data backups and recovery, it’s essential that you use disk instead of tape. “If you need to retrieve data that you’ve backed up before, you can produce it much faster if you’ve saved it on disk,” says Deepavali Bhagwat, software engineer, HP Labs. It takes less time to make the backups and restore lost data, as well.

Even though the cost of disk storage has dropped in recent years, the explosion of data would make disk backups expensive and unwieldy if each time a backup was made, the entire data set was copied onto the backup disk.

That’s where data deduplication comes in. It allows you to back up data to disk quickly, without recopying your entire data set and running out of storage space.

Breaking up the data

Principal Research Scientist Kave Eshghi.

Principal Research
Scientist Kave Eshghi.

Data deduplication breaks data into small chunks that are identified by a mathematical operation called a hash. Each chunk of data is stored and identified using its hash. Because documents often receive minimal changes or revisions, most of the data does not need to be saved repeatedly. Deduplication technology works by reading the hash and identifying the chunks of data that haven’t changed. It knows to not save that data again.

Data deduplication alone has a drawback, however. It means that when one document—a company’s annual report, for example—is backed up day after day with only minor changes, new chunks from Monday will be in one place on the disk and new chunks from Tuesday will be in a different place. After several days, pieces of the annual report will be scattered throughout the backup disk.

If a draft of the report is lost and someone needs to restore it from backup, finding and reassembling the chunks scattered throughout the disk could be a lengthy process. “You can always do a slow search through all your data and eventually find it, but you want to do this at lightning speed,” says Kave Eshghi, principal research scientist at HP Labs.

StoreOnce software: Taking data deduplication to the next level


HP’s StoreOnce software, based on next-generation deduplication technology, balances the need to find data quickly with the need to conserve disk space. It does this by storing chunks of data from the same document next to each other, even if they were backed up on different days. “We pay special attention to how and where we store the chunks of data so that when you want to restore in the future your information is less fragmented,” Eshghi says.

This means a single document isn’t broken up and stored in too many places, so that “we can get a guaranteed minimum restore speed,” Lillibridge says.

A storage “amplifier” for customers

HP’s StoreOnce software, which was launched in July 2010, is sold as part of a storage appliance. It’s a box that customers can turn on and start using, says Graham Perry, lead engineer on the deduplication technology development team within HP’s StorageWorks division. This offers customers the option of backing up data that needs to be easily restored, but doesn’t necessarily require split-second access. Backing up changes made to a database from a remote office might fall into this category, Perry says.

“Customers have been asking us for an amplifier that could boost their storage capabilities,” Perry says. “That’s what data deduplication does. It allows you to store more data for less money, and it allows you to manage it in a user-friendly way.”

HP has been working on advancing this technology for some time. In July 2008, the company released a solution to help small and medium-size businesses to accelerate their backups. Since then, HP has evolved the technology to be more than six times faster and have ten times more capacity, Eshghi says.

“It’s the difference between a Model T Ford and a modern Mercedes,” Perry says.

In addition, HP has improved the technology’s ease of use. HP's data deduplication technology has been re-engineered to make it easier to manage data services and backups at remote offices, for example. “It’s a massive step forward from what was offered two years ago,” Eshghi says. “StoreOnce represents an architecture that allows us to aggressively take data deduplication technology to enterprise customers.”

Eshghi says the technology also stands out because of its flexibility. “The same kind of architecture can be used in a variety of different situations,” he says, citing examples that range from data centers to remote offices with limited hardware.

HP Labs a valuable sounding board for new ideas

HP Labs has played an important role in the development of HP’s data deduplication technology. “We were working on data deduplication before backup speeds became a problem,” Eshghi says. “When we brought the idea to our colleagues working on product development, it was a perfect fit for HP’s customers and it’s now an important part of HP’s product portfolio.”

Eshghi describes the HP Labs’ relationship with HP business units as “symbiotic.”

Perry says HP Labs helps the company’s business units quickly explore the viability of new ideas. “Ideas come from all over the place, but as with everything, the devil is in the details,” he says. “The ability to vet ideas with HP Labs and quickly determine which have potential and which ones have already been explored makes us much more efficient.”

HP is looking to the future with StoreOnce, which is currently being sold in appliances with capacity for 40 terabytes of data after deduplication. That number is expected to rise to 90 terabytes in six months, and will multiply tenfold once multinode technology is ready.

“One of the big pushes will be to go from an appliance, which is made up of a single box, to clusters or federations of these boxes working in multiple-node systems,” Perry says. “The data explosion is so rapid that it’s overtaking anything that you could handle with a single piece of hardware.”