Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality
Lillibridge, Mark; Eshghi, Kave; Bhagwat, Deepavali; Deolalikar, Vinay; Trezise, Greg; Camble, Peter
Keyword(s): deduplication, storage, scaling, locality, inline deduplication, sparse index, chunking
Abstract: We present sparse indexing, a technique that uses sampling and exploits the inherent locality within backup streams to solve for large-scale backup (e.g., hundreds of terabytes) the chunk-lookup disk bottleneck problem that inline, chunk-based deduplication schemes face. The problem is that these schemes traditionally require a full chunk index, which indexes every chunk, in order to determine which chunks have already been stored; unfortunately, at scale it is impractical to keep such an index in RAM and a disk-based index with one seek per incoming chunk is far too slow. We perform stream deduplication by breaking up an incoming stream into relatively large segments and deduplicating each segment against only a few of the most similar previous segments. To identify similar segments, we use sampling and a sparse index. We choose a small portion of the chunks in the stream as samples; our sparse index maps these samples to the existing segments in which they occur. Thus, we avoid the need for a full chunk index. Since only the sampled chunks' hashes are kept in RAM and the sampling rate is low, we dramatically reduce the RAM to disk ratio for effective deduplication. At the same time, only a few seeks are required per segment so the chunk-lookup disk bottleneck is avoided. Sparse indexing has recently been incorporated into number of Hewlett-Packard backup products.
Additional Publication Information: Published and presented at 7th USENIX Conference on File and Storage Technologies (FAST '09), San Francisco, February 24-27, 2009
External Posting Date: June 6, 2009 [Fulltext]. Approved for External Publication
Internal Posting Date: June 6, 2009 [Fulltext]