
Course Home
Syllabus
Lectures
Project
Bibliography
Software

|
|
CS 236601: Information Retrieval and Digital Libraries
Software
There are many software packages which provide various
information indexing and retrieval functionality. This
is a (partial) list of relevant software programs which
are publicly available. Most of these programs are either
covered by the GPL or other Open Source variant licenses.
Please note that the software categories are often blurred
and a software package may rightfully belong to more than
one group. For example, many search engine packages
a web crawler.
For the software used in this course, please see the
Project page.
Search Engines
- Harvest
- Harvest is a system to collect information and make them searchable
using a Web interface. Harvest can collect information on internet
using HTTP, FTP, NNTP, and local files. Supported formats
HTML, DVI, PS, fulltext, mail, man pages, news, troff, WordPerfect,
C sources, and many more. Adding support for new format is easy due
to Harvest's modular design.
- Harvest-ng
- Harvest-NG is a set of tools for building a standards-compliant Web
crawler. It is implemented in Perl, and can provide a complete
resource discovery system, along with a modular structure which
allows others to easily customise the system to their own needs.
The development of Harvest-NG was started as an attempt to preserve
the strong features of the Harvest architecture, but to allow more
rapid developing and prototyping from a cleaner, better structured
codebase.
- LiSEn
- A Little Search Engine for niche topics.
- Autojot
- Autojot lets you index and search every Web page you view
so you can find things again later.
- ht-dig
- ht://Dig is a world-wide-web search system for an intranet
or small internet.
- ASPSeek
- ASPSeek is an Internet search engine, written in C++ using the
STL library. It consists of an indexing robot, a search daemon,
and a CGI search frontend.
- Webglimpse
- Webglimpse is the spider and manager for the files to be indexed
by glimpse.
- Alkaline
- Alkaline is a full-featured standalone search and index server. The
spider is a fully remote indexing daemon which includes support for
all standards like robots.txt and "skip" meta tags, and allows multiple
distinct configurations and search groups (searching many different
sites from your server), including complex regexp indexing paths,
authentification, filters for various document formats, XML-based
online management and statistics, mrtg-compatible perf numbers, and more.
- ESSE
- ESSE (Efficient Site Search Engine) is a search engine for Web sites
which is based on the PostgreSQL database engine. You can choose
extensions for searchable documents, and you can fully customize
the output of queries. It is configurable, and allows you to speed
up the search by using statistic keywords instead of searching
through the full document.
- Sintrasearch
- Sintrasearch is a robust and stable search engine, used for intranet
sites.
Web Crawlers
- Webbase
- webbase is an internet web crawler based on a MySQL database.
It is available as a command line program or as a library
(shared or static).
- Larbin
- Larbin is an HTTP Web crawler with an easy interface that runs
under Linux. It can fetch more than 5 million pages a day on a
standard PC (with a good network).
- wget
- Wget is a freely available network utility to retrieve files from
the World Wide Web using HTTP and FTP, the two most widely used
Internet protocols. It works non-interactively, thus enabling work
in the background, after having logged off.
- FWget
- FWget allows you to download Web sites. It is based on wget, which
has serious problems processing huge sites. FWget uses hash tables
to avoid such performance problems.
- pagesucker
- Pagesucker downloads a web page from a remote web server. A list of
webservers, paths and output filename can be passed on the
command-line or in a file.
- Parallel URL fetcher
- puf is a tool that can be used to download single files or mirror
entire servers. It is similar to GNU wget (and has a partly-compatible
command line), but has the ability to do many downloads in parallel.
This is very useful if you have a high-bandwidth Internet connection.
- httpr
- httpr is a Perl script to recursively download entire Web sites or
parts of Web sites. Given an URL, httpr will download the URL and
all files/images (on the same site) linked to within the URL.
- Sirobot
- Sirobot is a Perl script that downloads Web pages recursively. The
main advantage over wget is its ability to get them concurrently,
and is able to continue aborted downloads and convert absolute links
to relative ones. It uses curses, can do HTTPS, and has a
pattern-matching filter to prevent you from downloading the whole
Internet.
Text Indexing and Retrieval
- Mifluz
- The purpose of mifluz is to provide a C++ library to build and query
a full text inverted index. It is dynamically updatable, scalable
(up to 1Tb indexes), uses a controlled amount of memory, shares index
files and memory cache among processes or threads and compresses index
files to 50% of the raw data.
- Cheshire
- The Cheshire II project is developing a next-generation online catalog
and full-text information retrieval system using advanced IR techniques.
- glimpse
- Glimpse is a very powerful indexing and querying system that allows
you to search through all your files very quickly. It can be used
by individuals for their personal file systems as well as by
organizations for large data collections.
- ISearch
- Isearch is software for indexing and searching text documents. It
supports full text and field based search, relevance ranked results,
Boolean queries, and heterogeneous databases. Isearch can parse many
kinds of documents "out of the box," including HTML, mail folders,
list digests, SGML-style tagged data, and USMARC.
- locus
- locus lets you find words in your texts, for example newsgroup
messages, Web page mirrors, electronic books - whatever you have.
It uses word patterns (order, locality etc.) to match queries to
texts, makes reasonable choices by default yet does exactly what
you want when you specify it.
- lq-text
- lq-text is a command line-oriented text retrieval package for UNIX.
You can use lq-text to index your text or HTML documents, and then
search for words or phrases in them.
- mg
- The MG system is a suite of programs for compressing and indexing
text and images. Most of the functionality implemented in the suite
is as described in the book ``Managing Gigabytes: Compressing and
Indexing Documents and Images''.
- namazu
- Namazu is a full-text search system intended for easy use. Not only
does it work as a small or medium scale Web search engine, but also
as a personal search system for email or other files. Supported
document types: HTML, Mail/News, MHonArc, RFC, TeX (with detex),
man (with groff), Word (with wvWare), PDF (with pdftotext) and plain
text.
- SWISH++
- SWISH++ is a Unix-based file indexing and searching engine (typically
used to index and search files on web sites). It was based on SWISH-E
although SWISH++ is a complete rewrite. SWISH++ is at least 10 times
faster and can handle much larger numbers of files. Additionally, it
has unique features such as selective non-indexing, on-the-fly filters,
user-selectable stemming, and more.
- YASE
- YASE is a text indexing and retrieval system. It allows you to index
your document collection very easily. All words are indexed and can
be optionally stemmed. The query tool supports searching all/any
terms and can rank query results by relevance using the cosine measure.
- libbow
- Libbow is a library of C code intended for writing
statistical text-processing programs. This distribution includes
the library, as well as a text classification front-end, and a
document retrieval front-end.
Meta-Crawlers
- iSearch
- The iSearch bot is a Perl script that will be able to adapt itself
automatically to search engines to do a search operation, retrieve
results, and understand the structure of those results. That should
be true even if the search engine is unknown for the iSearch script
or it has changed its search method or results structure.
- Rover Search Server
- Rover Search Server is a search engine interface that searches many
search engines as well as custom searches like sound files, images,
etc. It features tools such as saving and editing search results,
and returns nicely formatted search results without graphics or
frames. It works well with any Web browser and can be used locally
or remotely.
|