|
This is a collection of research ideas I have for my spare time.
They are typically unrelated to my day job, and progress is generally
slow and sporadic.
- lmbench
- A micro-benchmark suite
- search
- An information retrieval package
- Asynchronous File I/O
-
lmbench is an ongoing
project to develop a suite of microbenchmarks to analyze various
important aspects of system performance.
lmbench version 1 was developed by Larry McVoy and provided latency
and bandwidth measurements for many important aspects of system
performance.
lmbench version 2 introduced a new timing harness which provided the
means for lmbench to make accurate measurements in a platform
independent fashion, and it was developed jointly by Carl Staelin and
Larry McVoy. The timing harness analyzed the system clock to
determine how long experiments needed to be repeated in order to
minimize timing errors induced by clock resolution. In addition, some
new benchmarks were introduced, and at least one benchmark
(mhz) was rewritten from scratch.
lmbench version 3 introduced (yet another) new timing harness to
measure system performance under scalable load, which is
primarily relevent for multi-processor machines.
In addition, lmbench version 3 introduced a number of new benchmarks
which attempt to characterize various aspects of the hardware
performance, such as the TLB size, cache line size, and memory
subsystem parallelism.
cache attempts to determine various attributes of the memory
hierarchy, such as the number of caches, the size of each cache, the
line size for each cache, the average memory latency for cache hits,
and the available memory parallelism to each cache. It utilizes
routines from line and par_mem to determine the cache line size and
memory parallelism.
The biggest new challenge presented by cache is to determine
the number and sizes of the various caches in the memory hierarchy.
The obvious approach is to measure memory latency for a range of
memory sizes. In an ideal world the graph would look like a series of
plateaus separated by sharp cliffs. The cliff edges would indicate
cache boundaries. Unfortunately, this is not an ideal world, so there
are a number of hardware/OS combinations which produce memory graphs
with gently sloping regions instead of sharp cliffs. This makes it
very hard to determine the cache edge because in principle it could
reside anywhere in the sloping region.
There are apparently a number of factors which can cause these sloping
regions. One primary cause is the fact that larger caches are often
direct mapped, or have a low degree of associativity, which makes them
prone to page collisions. Some operating systems are sensitive to
these issues and use
page coloring algorithms
to reduce cache conflicts. However, if page coloring is not used,
then the average memory latency for dereferencing a pointer chain
which should fit into cache but has conflicting pages can be
significantly larger than a cache hit since it will include all the
cache misses due to conflicting pages.
To reduce the impact of this problem I am investigating the use of
genetic algorithms to select pages from a larger pool of pages with
the aim of choosing a page set which minimizes cache collisions.
Initial results indicate that this is a promising approach, and it
dramatically improves cache's ability to precisely locate
cache boundaries. However, there are still some hardware/OS
combinations with sloped memory latency graphs, indicating that there
might be additional complications.
Initially developed as a simple demonstration of information retrieval
techniques for
CS236601
based on the description of the Google
data structures in
The anatomy of a large-scale hypertextual web search engine,
I am slowly in the process of turning it into a more complete, high
performance indexing and retrieval engine.
The initial implementation was able to index data very rapidly
and as far as possible used the data structures from Brin without
modification.
It had a number of interesting features that improved the indexing
performance. First of all, it used multiple threads to read data
from disk to make better use of technologies such as multi-disk
storage devices and disk scheduling. Secondly, it accumulated
the forward index in a large memory buffer (128MB on a 512MB machine),
and when the buffer was filled, it used qsort to
create the fragmentary inverted index, which was then written
to disk. The last stage was sequentially streaming the various
inverted index fragments and merging them into a single
inverted index. The output buffers were double-buffered
so the system could continue streaming input data while sorting
and flushing the (full) output buffer.
There are a few areas currently being developed:
- Incremental updates
- Add the ability to rapidly update the database by adding, removing,
and updating documents
- Multiple inverted indexes
- It is not necessary, or even cost effective to continually
modify a monolithic inverted index, particularly for very
dynamic collections such as one's email. Instead it may make
sense to have a few inverted indexes, creating new ones with
new or updated content, and occassionally merge or update
them using strategies similar to memory garbage collection.
- Parsing
- The demonstration code used a very simple, hand coded parser
which is unspeakably primitive. A more general and flexible
solution is desperately required.
- Merging with Python
- One solution is to create a very simple token-based interface
which allows a few simple operations, and to push all the
parsing, stemming, and other word manipulation operations
into the high-level scripting language, such as python.
- Threading
- In order for the system to become "real", it must be made
more robust, in particular it should be able to have multiple
applications reading and writing the database in parallel
(safely).
Modify one or more applications (e.g., glimpseindex, find, squid)
to utilize the asynchronous I/O and measure application performance
improvements.
Last modified: $Date: 2001/11/25 07:33:12 $
|