On Wavelet Compression and Cardinality Estimation of Enterprise Data
Choudur, Lakshminarayan; Dayal, Umeshwar; Gupta, Chetan; Swaminathan, Ram
Keyword(s): compression, wavelets, thresholding, energy
Abstract: Storing and analyzing large volume of structured or unstructured data at the scale of petabytes in applications such as business intelligence of an enterprise, is a daunting task. It is therefore desirable to store data in a compressed form. Compression often is performed using transform methods that permit capture of details in the data while at the same time representing it efficiently; wavelet transform is one such compression technique. Viewing the data as a signal, wavelet transform involves computing coefficients and selecting a small subset judiciously to approximate the signal. Because we pick a subset of the coefficients and not all of them to synthesize the data, wavelet based compression is inherently lossy. Compression in the wavelet domain is traditionally done using hard and soft thresholding techniques and variants thereof. Based on the notion of "total energy" of a signal, in this paper, we introduce two new thresholding methods called, level-independent energy and level-dependent energy thresholding. In particular, our energybased thresholding techniques exploit the relationship between total energy of the signal and wavelet coefficients to obtain compression to meet pre- specified error tolerances. In addition, level- dependent energy thresholding aims at determining wavelet coefficients that carry most "information" at various resolutions of the wavelet decomposition of the data distribution. Detailed studies and experiments over noise contaminated synthetic and TPCH benchmark data sets indicate that energy based thresholding methods yield approximately 100:1 compression with high reconstruction accuracy. As an application of our compression techniques, we show that energy-based thresholding methods improve accuracy of cardinality estimation in databases substantially over the popular methods based on equi- height and max-diff histograms as well as over hard, soft and probabilistic thresholding techniques from the statistics and database literature.
External Posting Date: October 6, 2010 [Fulltext]. Approved for External Publication
Internal Posting Date: October 6, 2010 [Fulltext]