A Heterogeneous Naive-Bayesian Classifier for Relational Databases
Manjunath, Geetha; Murty, M Narasimha; Sitaram, Dinkar
Keyword(s): Relational databases, Classification, Data Mining, RDF
Abstract: Most enterprise data is distributed in multiple relational databases with expert-designed schema. Application of single-table data mining techniques to distributed relational data not only incurs a computational penalty for converting to a "at" form (mega-join), even the human-specified semantic information present in the relations/schema is lost. Purely relational classification algorithms on the other hand, do consider detailed relationships between attributes. However, these techniques either require computationally intensive transformations or multiple analysis of fused datasets, which becomes infeasible in practical scenarios. Classification being one of the most popular predictive data mining tasks, we need practical algorithms that can be directly applied on existing databases. We present such a practical two- phase classification algorithm for relational databases with a semantic divide and conquer approach. We propose and prove a recursive, prediction aggregation technique over heterogeneous classifiers applied on individual tables. Our approach also attempts to effectively leverage the semantic knowledge of the application that is hidden in the database schema using the Join Graph of an application. To automate the classification process, RDF (the core Semantic Web data model) is used for problem specification. A preliminary evaluation over TPCH and UCI benchmarks shows reduced training time in automated practical scenarios, without any loss of prediction accuracy. In fact, we show improved accuracy due to application of heterogeneous classifiers on individual tables by comparing it to other state-of-art techniques.
External Posting Date: September 6, 2009 [Fulltext]. Approved for External Publication
Internal Posting Date: September 6, 2009 [Fulltext]