
Overview
The Data Mining and Machine Learning project aims to develop, extend, and apply technologies and tools for finding (and enabling people to take advantage of) patterns in large datasets and data streams. The discovery of these patterns draws upon such research fields as Machine Learning, Statistics, Databases, Information Retrieval, and Information Visualization, to name a few.
We are developing technologies to intelligently analyze structured and unstructured information, so as to develop new service capabilities for our partners and customers. We work with HP business units as well as with leading-edge external customers to identify exciting research problems and high-potential application opportunities.
Problems Addressed
There are numerous application areas in which data mining plays a promising role. We have an especially productive relationship with HP Customer Support, which has led to numerous innovations and new capabilities. Here are some examples of problems we have recently addressed:
- Configuration analysis and semi-automated system assessments: how to focus proactive support resources on areas and systems that deviate from others in suspicious ways
- Automated categorization of documents in a very large topic hierarchy: how to put hundreds of thousands of documents in the right place, while minimizing the need for training data and dealing with a constantly shifting portfolio
- Enabling efficient storage, archiving, disk-based backup, and content services by taking advantage of commonalities in stored items
- Characterization of storage-system throughput patterns based on workload features, so as to enable automated storage services
- Automated correction and detection of similarities in extremely high-volume commercial transaction flows
- Analysis of conversion patterns within a sales portal
- Rapid detection of changes in hardware performance, response times, service levels, and various other variables of interest, without requiring extensive customization, calibration, or configuration of the change-point detection tools.
HP Labs Work
To address these and other applications, we have recently developed new technologies and algorithms in the following areas, among others:
- Clustering algorithms
We have developed a new type of clustering, Conjunctive Clustering, which does not rely on mapping data items into metric space.
We have developed K-Harmonic Means which has been shown to be more robust when compared with the industry-leading K-Means algorithm.
We have studied scalability issues, parallelizing clustering algorithms, divide and conquer style clustering in the data stream model of computation, as well as investigating the sample complexity of clustering
- Content management
We have developed methods for super-efficient representation of very large data items and structured collections of such items, using work in intrinsic references and chunking.
- Feature selection
We have developed a new feature-selection algorithm, bi-normal separation, which outperforms previously known alternatives
- Categorization
We created a set of tools to manage a topic hierarchy, provide example documents (training cases) for each topic, tag documents automatically, test the accuracy of the resulting tags, and enable the use of the resulting validated tags in browsing and searching. One of the design goals was to make this toolset easily reusable; other opportunities in content-management solutions might also benefit from our contributions.
- Genetic programming
We have developed GPLab, a robust, powerful, and, above all, flexible platform for genetic programming research, experimentation, and application. We are also conducting fundamental research aimed at expanding the power of the genetic programming paradigm
Contact: Jaap.Suermondt@hp.com

|