Course Home
Syllabus
Lectures
Project
Bibliography
Software



CS 236601: Information Retrieval and Digital Libraries

Software

There are many software packages which provide various information indexing and retrieval functionality. This is a (partial) list of relevant software programs which are publicly available. Most of these programs are either covered by the GPL or other Open Source variant licenses.

Please note that the software categories are often blurred and a software package may rightfully belong to more than one group. For example, many search engine packages a web crawler.

For the software used in this course, please see the Project page.

Search Engines

Harvest
Harvest is a system to collect information and make them searchable using a Web interface. Harvest can collect information on internet using HTTP, FTP, NNTP, and local files. Supported formats HTML, DVI, PS, fulltext, mail, man pages, news, troff, WordPerfect, C sources, and many more. Adding support for new format is easy due to Harvest's modular design.
Harvest-ng
Harvest-NG is a set of tools for building a standards-compliant Web crawler. It is implemented in Perl, and can provide a complete resource discovery system, along with a modular structure which allows others to easily customise the system to their own needs. The development of Harvest-NG was started as an attempt to preserve the strong features of the Harvest architecture, but to allow more rapid developing and prototyping from a cleaner, better structured codebase.
LiSEn
A Little Search Engine for niche topics.
Autojot
Autojot lets you index and search every Web page you view so you can find things again later.
ht-dig
ht://Dig is a world-wide-web search system for an intranet or small internet.
ASPSeek
ASPSeek is an Internet search engine, written in C++ using the STL library. It consists of an indexing robot, a search daemon, and a CGI search frontend.
Webglimpse
Webglimpse is the spider and manager for the files to be indexed by glimpse.
Alkaline
Alkaline is a full-featured standalone search and index server. The spider is a fully remote indexing daemon which includes support for all standards like robots.txt and "skip" meta tags, and allows multiple distinct configurations and search groups (searching many different sites from your server), including complex regexp indexing paths, authentification, filters for various document formats, XML-based online management and statistics, mrtg-compatible perf numbers, and more.
ESSE
ESSE (Efficient Site Search Engine) is a search engine for Web sites which is based on the PostgreSQL database engine. You can choose extensions for searchable documents, and you can fully customize the output of queries. It is configurable, and allows you to speed up the search by using statistic keywords instead of searching through the full document.
Sintrasearch
Sintrasearch is a robust and stable search engine, used for intranet sites.

Web Crawlers

Webbase
webbase is an internet web crawler based on a MySQL database. It is available as a command line program or as a library (shared or static).
Larbin
Larbin is an HTTP Web crawler with an easy interface that runs under Linux. It can fetch more than 5 million pages a day on a standard PC (with a good network).
wget
Wget is a freely available network utility to retrieve files from the World Wide Web using HTTP and FTP, the two most widely used Internet protocols. It works non-interactively, thus enabling work in the background, after having logged off.
FWget
FWget allows you to download Web sites. It is based on wget, which has serious problems processing huge sites. FWget uses hash tables to avoid such performance problems.
pagesucker
Pagesucker downloads a web page from a remote web server. A list of webservers, paths and output filename can be passed on the command-line or in a file.
Parallel URL fetcher
puf is a tool that can be used to download single files or mirror entire servers. It is similar to GNU wget (and has a partly-compatible command line), but has the ability to do many downloads in parallel. This is very useful if you have a high-bandwidth Internet connection.
httpr
httpr is a Perl script to recursively download entire Web sites or parts of Web sites. Given an URL, httpr will download the URL and all files/images (on the same site) linked to within the URL.
Sirobot
Sirobot is a Perl script that downloads Web pages recursively. The main advantage over wget is its ability to get them concurrently, and is able to continue aborted downloads and convert absolute links to relative ones. It uses curses, can do HTTPS, and has a pattern-matching filter to prevent you from downloading the whole Internet.

Text Indexing and Retrieval

Mifluz
The purpose of mifluz is to provide a C++ library to build and query a full text inverted index. It is dynamically updatable, scalable (up to 1Tb indexes), uses a controlled amount of memory, shares index files and memory cache among processes or threads and compresses index files to 50% of the raw data.
Cheshire
The Cheshire II project is developing a next-generation online catalog and full-text information retrieval system using advanced IR techniques.
glimpse
Glimpse is a very powerful indexing and querying system that allows you to search through all your files very quickly. It can be used by individuals for their personal file systems as well as by organizations for large data collections.
ISearch
Isearch is software for indexing and searching text documents. It supports full text and field based search, relevance ranked results, Boolean queries, and heterogeneous databases. Isearch can parse many kinds of documents "out of the box," including HTML, mail folders, list digests, SGML-style tagged data, and USMARC.
locus
locus lets you find words in your texts, for example newsgroup messages, Web page mirrors, electronic books - whatever you have. It uses word patterns (order, locality etc.) to match queries to texts, makes reasonable choices by default yet does exactly what you want when you specify it.
lq-text
lq-text is a command line-oriented text retrieval package for UNIX. You can use lq-text to index your text or HTML documents, and then search for words or phrases in them.
mg
The MG system is a suite of programs for compressing and indexing text and images. Most of the functionality implemented in the suite is as described in the book ``Managing Gigabytes: Compressing and Indexing Documents and Images''.
namazu
Namazu is a full-text search system intended for easy use. Not only does it work as a small or medium scale Web search engine, but also as a personal search system for email or other files. Supported document types: HTML, Mail/News, MHonArc, RFC, TeX (with detex), man (with groff), Word (with wvWare), PDF (with pdftotext) and plain text.
SWISH++
SWISH++ is a Unix-based file indexing and searching engine (typically used to index and search files on web sites). It was based on SWISH-E although SWISH++ is a complete rewrite. SWISH++ is at least 10 times faster and can handle much larger numbers of files. Additionally, it has unique features such as selective non-indexing, on-the-fly filters, user-selectable stemming, and more.
YASE
YASE is a text indexing and retrieval system. It allows you to index your document collection very easily. All words are indexed and can be optionally stemmed. The query tool supports searching all/any terms and can rank query results by relevance using the cosine measure.
libbow
Libbow is a library of C code intended for writing statistical text-processing programs. This distribution includes the library, as well as a text classification front-end, and a document retrieval front-end.

Meta-Crawlers

iSearch
The iSearch bot is a Perl script that will be able to adapt itself automatically to search engines to do a search operation, retrieve results, and understand the structure of those results. That should be true even if the search engine is unknown for the iSearch script or it has changed its search method or results structure.
Rover Search Server
Rover Search Server is a search engine interface that searches many search engines as well as custom searches like sound files, images, etc. It features tools such as saving and editing search results, and returns nicely formatted search results without graphics or frames. It works well with any Web browser and can be used locally or remotely.