Technical Reports

HPL-2010-22R1

Click here for full text: PDF

A New Frequent Similar Tree Algorithm Motivated by DOM Mining Using RTDM and its new variant SiSTeR

Omer, Barkol; Ruth, Bergman; Shahar, Golan
HP Laboratories

HPL-2010-22R1

Keyword(s): data-mining; HTML; RTDM; SiSTeR

Abstract: The importance of recognizing repeating structures in web applications has generated a large body of work on algorithms for mining the HTML Document Object Model (DOM). A restricted tree edit distance metric, called the Restricted Top Down Metric (RTDM), is most suitable for DOM mining as well as less computationally expensive than the general tree edit distance. Given two trees with input size n1 and n2, the current methods take time O(n1n2) to compute RTDM. Consider, however, looking for patterns that form subtrees within a web page with n elements. The RTDM must be computed for all subtrees, and the running time becomes O(n4). This paper proposes a new algorithm which computes the distance between all the subtrees in a tree in time O(n2), which enables us to obtain better quality as well as better performance, on a DOM mining task. In addition, we propose a new tree edit-distance-SiSTeR (Similar Sibling Trees aware RTDM). This variant of RTDM allows considering the case were repetitious (very similar) subtrees of different quantity

6 Pages

External Posting Date: June 28, 2012 [Fulltext]. Approved for External Publication
Internal Posting Date: June 28, 2012 [Fulltext]

Back to Index