Efficient Detection of Large Scale Redundancy in Enterprise File Systems

Forman, George; Eshghi, Kave; Suermondt, Jaap
Keyword(s): data mining, min-hashing, set sketches, directory similarity and deduplication, file systems, scalability, storage management.

Abstract: In order to catch and reduce waste in the exponential demand for disk storage, we have developed a technology based on set sketches that enables enterprise storage managers to efficiently detect approximate duplication of large directory hierarchies, e.g. unnecessary mirroring by uncoordinated employees or departments. Identifying these duplicate or near duplicate hierarchies allows appropriate action to be taken at a high level, e.g. coordinate and consolidate multiple copies in one location.

