[Gc] mark phase and prefetch

Aliaksey Kandratsenka alk at tut.by
Sun Sep 13 00:31:02 PDT 2009


Hi.

I'd like to present results of my attempt on doing prefetch during mark
phase more smartly. My main motivation is that current prefetch approach
in libgc looks suboptimal. It does prefetch on addresses before putting
them on mark stack, but due to nature of stack discipline some addresses
will be handled too early after prefetching them and some too late.

For ideal prefetch we need to know N addresses we will be handling next,
which is impossible with depth-first order of traversal. With
breath-first traversal it's trivial, but breath-first traversal needs
too much memory compared to depth-first traversal.

My idea is to try something intermediate. We use stack as usual, but
after fetching address from stack we start prefetch and put it into
small fixed size queue which delays it's processing while hardware is
handling prefetch. We do main work on items from head of this queue.
I think it can be proved that this traversal order will need at most N
times more memory than depth-first traversal, where N is size of the
queue (prefetch distance). On the other hand it provides equal prefetch
distance for each traversed node.

I've made small test program that compares DFS, BFS (with prefetch) and
this 'pipelined DFS' on quite large tree. The tree is ideally balanced,
but the nodes of this tree are randomly shuffled in memory. The results
are very encouraging with pipelined DFS being almost 2 times faster than
DFS and even faster then BFS (I think, due to less memory write-back
traffic). But note that real-world object graphs are usually not random,
plus my test program does almost nothing aside from touching memory. So
speedup for GC (with more work between memory accesses and not random
graphs) cannot be that large. I'm attaching the source of this program.

Recently I tried to apply this prefetch strategy to bdwgc. I'm attaching
the dirty & very much work-in-progress patch against 6.8. It's not
production quality and does more changes than necessary. In particular
I've disabled 'credit' handling, to avoid it's potential ill-effects on
pipelined nature of new code (I haven't checked if there are any).

I was benchmarking this using GCBench.c (modified only for larger heap
usage and thus larger running time). On this benchmark my code
outperforms both stock prefetch strategy and the code with removed
prefetch. With numbers being:
a) patch - 8.89s
b) no-prefetch - 9.34s
c) stock - 9.75s
Note that it seems that 6.8 doesn't have proper support for AMD64 which
I use, 'cause lions share of time is spent in mutex_trylock that's
called from GC_malloc. Without this, the speedup should be larger.

Now I'm interested in trying this on more real-world object graphs. In
particular it's not clear if increased mark stack consumption of my code
is an issue. What do you guys think ?

-- 
Aliaksey Kandratsenka <alk at tut.by>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: tree-traversal.c
Type: text/x-csrc
Size: 3865 bytes
Desc: not available
Url : http://napali.hpl.hp.com/pipermail/gc/attachments/20090913/c2d2011d/tree-traversal.c
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-pipelined-marking.patch
Type: text/x-patch
Size: 6583 bytes
Desc: not available
Url : http://napali.hpl.hp.com/pipermail/gc/attachments/20090913/c2d2011d/0001-pipelined-marking.bin


More information about the Gc mailing list