Course Home
Syllabus
Lectures
Project
Bibliography
Software



CS 236601: Information Retrieval and Digital Libraries

Project

During the semester the class will build an information indexing and retrieval system which will combine elements of a web search engine with elements of a personal digital library. The emphasis will be on algorithmic and system elements of information retrieval systems. For example, comparing different scoring algorithms for result relevance ranking, or attempting to automatically build human-sensible topic hierarchies from an amorphous document collection.

Project Suggestions

Here is a list of project suggestions. You are free to choose any of the listed suggestions, or you may propose an alternate project. Unfortunately, the scope of many of these projects will be limited by the available disk space. Since the Computer Science Department seems to limit students to 30MB of disk space, it makes it impossible for the students to create reasonable size corpuses. In turn, this impacts the ability to create unified corpuses with gigabytes of data from disparate sources.
  • A SONIA-like meta-search engine that collects search results from various search engines, downloads the returned pages, and post-processes them to cluster the pages into categories, and then rank the pages within each category. It should also rank the categories, provide a short description of the category (perhaps as a list of keywords), and provide a means for iteratively searching with feedback. It might also store previous searches like copernic. It might also provide a simple interface to allow users to add their own search engines that might provide search capabilities for local information or servers.
  • A bookmark organizer that takes a (perhaps slightly organized) set of bookmarks or favorites and creates a Yahoo-like topic hierarchy. It should also be able to automatically incorporate new pages from the browser's history in the hierarchy, and update the hierarchy as new pages are added. Optionally, pages should be able to belong to multiple relevant categories.
  • An ifile-like email filter that adaptively filters email based on previous user actions. Optionally documents should be able to live in multiple categories.
  • A Personal Information Assistant that indexes content from several personal information sources, such as email, local files, and bookmarks/favorites, and provides a unified search over the personal, local, and global information. Searches over local (corporate) and global (internet) information would be done using meta-search engine capabilities. It would also create and manage a Yahoo-like hierarchical view of the information. Documents should be able to live in multiple categories, and the hierarchy management should adapt to the user's actions with respect to refiling documents. Only students with access to more disk space than the pitiful quota allowed by the Technion Computer Science Department will be able to undertake this project

Available Software

In order to promote rapid development and prototyping, the projects will be developed in a scripting language, python. Python includes a large number of libraries which provide a great deal of basic functionality, such as fetching URLs, parsing MIME types, and connecting to mail servers. The basic Python system will be enhanced with a number of extensions:
Numeric
A python interface for linear algebra. While not required, it is a good idea to install it with LAPACK, which should be installed with ATLAS for high performance operation.
MatPy
A matlab-like interface for python built on Numeric.
PyLapack
A python interface for the complete LAPACK library.
MySQL-python
A python interface for the MySQL database server, which will be used instead of a hand-coded inverted index.
wxPython
A binding of the wxWindows GUI environment to Python. "wxPython for newbies" is a nice introduction to wxPython.
libsvm
An library of support vector machine learning classifiers and regression engines. A python interface is in development.