[Gc] mark phase and prefetch
hans.boehm at hp.com
Sun Sep 13 11:01:41 PDT 2009
I highly recommend looking at the ASPLOS 2004 paper by Cher, Hosking, and Vijaykumar (https://www.dacapo-group.org/papers/gc.pdf) They explore this in some detail, though undortunate;ly their work never made it into our GC distribution.
> -----Original Message-----
> From: gc-bounces at napali.hpl.hp.com
> [mailto:gc-bounces at napali.hpl.hp.com] On Behalf Of Aliaksey
> Sent: Sunday, September 13, 2009 12:31 AM
> To: gc at napali.hpl.hp.com
> Subject: [Gc] mark phase and prefetch
> I'd like to present results of my attempt on doing prefetch
> during mark phase more smartly. My main motivation is that
> current prefetch approach in libgc looks suboptimal. It does
> prefetch on addresses before putting them on mark stack, but
> due to nature of stack discipline some addresses will be
> handled too early after prefetching them and some too late.
> For ideal prefetch we need to know N addresses we will be
> handling next, which is impossible with depth-first order of
> traversal. With breath-first traversal it's trivial, but
> breath-first traversal needs too much memory compared to
> depth-first traversal.
> My idea is to try something intermediate. We use stack as
> usual, but after fetching address from stack we start
> prefetch and put it into small fixed size queue which delays
> it's processing while hardware is handling prefetch. We do
> main work on items from head of this queue.
> I think it can be proved that this traversal order will need
> at most N times more memory than depth-first traversal, where
> N is size of the queue (prefetch distance). On the other hand
> it provides equal prefetch distance for each traversed node.
> I've made small test program that compares DFS, BFS (with
> prefetch) and this 'pipelined DFS' on quite large tree. The
> tree is ideally balanced, but the nodes of this tree are
> randomly shuffled in memory. The results are very encouraging
> with pipelined DFS being almost 2 times faster than DFS and
> even faster then BFS (I think, due to less memory write-back
> traffic). But note that real-world object graphs are usually
> not random, plus my test program does almost nothing aside
> from touching memory. So speedup for GC (with more work
> between memory accesses and not random
> graphs) cannot be that large. I'm attaching the source of
> this program.
> Recently I tried to apply this prefetch strategy to bdwgc.
> I'm attaching the dirty & very much work-in-progress patch
> against 6.8. It's not production quality and does more
> changes than necessary. In particular I've disabled 'credit'
> handling, to avoid it's potential ill-effects on pipelined
> nature of new code (I haven't checked if there are any).
> I was benchmarking this using GCBench.c (modified only for
> larger heap usage and thus larger running time). On this
> benchmark my code outperforms both stock prefetch strategy
> and the code with removed prefetch. With numbers being:
> a) patch - 8.89s
> b) no-prefetch - 9.34s
> c) stock - 9.75s
> Note that it seems that 6.8 doesn't have proper support for
> AMD64 which I use, 'cause lions share of time is spent in
> mutex_trylock that's called from GC_malloc. Without this, the
> speedup should be larger.
> Now I'm interested in trying this on more real-world object
> graphs. In particular it's not clear if increased mark stack
> consumption of my code is an issue. What do you guys think ?
> Aliaksey Kandratsenka <alk at tut.by>
More information about the Gc