
Course Home
Syllabus
Lectures
Project
Bibliography
Software

|
|
CS 236601: Information Retrieval and Digital Libraries
Project
During the semester the class will build an information indexing and
retrieval system which will combine elements of a web search engine
with elements of a personal digital library. The emphasis will be
on algorithmic and system elements of information retrieval systems.
For example, comparing different scoring algorithms for result
relevance ranking, or attempting to automatically build human-sensible
topic hierarchies from an amorphous document collection.
Project Suggestions
Here is a list of project suggestions. You are free to choose any of
the listed suggestions, or you may propose an alternate project.
Unfortunately, the scope of many of these projects will be limited
by the available disk space. Since the Computer Science Department
seems to limit students to 30MB of disk space, it makes it impossible
for the students to create reasonable size corpuses. In turn, this
impacts the ability to create unified corpuses with gigabytes of
data from disparate sources.
- A SONIA-like
meta-search engine that collects search results from various search
engines, downloads the returned pages, and post-processes them to
cluster the pages into categories, and then rank the pages within
each category. It should also rank the categories, provide a short
description of the category (perhaps as a list of keywords), and
provide a means for iteratively searching with feedback. It might
also store previous searches like
copernic. It might also
provide a simple interface to allow users to add their own
search engines that might provide search capabilities for local
information or servers.
- A bookmark organizer that takes a (perhaps slightly organized) set
of bookmarks or favorites and creates a Yahoo-like topic hierarchy.
It should also be able to automatically incorporate new pages from
the browser's history in the hierarchy, and update the hierarchy
as new pages are added. Optionally, pages should be able to belong
to multiple relevant categories.
- An ifile-like
email filter that adaptively filters email based on previous
user actions. Optionally documents should be able to live in
multiple categories.
- A Personal Information Assistant that indexes content
from several personal information sources, such as email, local
files, and bookmarks/favorites, and provides a unified search over
the personal, local, and global information. Searches over local
(corporate) and global (internet) information would be done
using meta-search engine capabilities. It would also create and
manage a Yahoo-like hierarchical view of the information.
Documents should be able to live in multiple categories, and
the hierarchy management should adapt to the user's actions with
respect to refiling documents.
Only students with access to more disk space than the pitiful
quota allowed by the Technion Computer Science Department will
be able to undertake this project
Available Software
In order to promote rapid development and prototyping, the projects
will be developed in a scripting language,
python. Python includes a
large number of libraries which provide a great deal of basic
functionality, such as fetching URLs, parsing MIME types, and
connecting to mail servers.
The basic Python system will be enhanced with a number of extensions:
- Numeric
- A python interface for linear algebra. While not
required, it is a good idea to install it with
LAPACK,
which should be installed with
ATLAS
for high performance operation.
- MatPy
- A matlab-like interface for python built on Numeric.
- PyLapack
- A python interface for the complete LAPACK library.
- MySQL-python
- A python interface for the
MySQL
database server, which will be used instead of a hand-coded
inverted index.
- wxPython
- A binding of the wxWindows
GUI environment to Python.
"wxPython for newbies"
is a nice introduction to
wxPython.
- libsvm
- An library of support vector machine learning classifiers and
regression engines. A python interface is in development.
|