[Gc] mark phase and prefetch

Boehm, Hans hans.boehm at hp.com
Sun Sep 13 11:01:41 PDT 2009

I highly recommend looking at the ASPLOS 2004 paper by Cher, Hosking, and Vijaykumar (https://www.dacapo-group.org/papers/gc.pdf)  They explore this in some detail, though undortunate;ly their work never made it into our GC distribution.


> -----Original Message-----
> From: gc-bounces at napali.hpl.hp.com 
> [mailto:gc-bounces at napali.hpl.hp.com] On Behalf Of Aliaksey 
> Kandratsenka
> Sent: Sunday, September 13, 2009 12:31 AM
> To: gc at napali.hpl.hp.com
> Subject: [Gc] mark phase and prefetch
> Hi.
> I'd like to present results of my attempt on doing prefetch 
> during mark phase more smartly. My main motivation is that 
> current prefetch approach in libgc looks suboptimal. It does 
> prefetch on addresses before putting them on mark stack, but 
> due to nature of stack discipline some addresses will be 
> handled too early after prefetching them and some too late.
> For ideal prefetch we need to know N addresses we will be 
> handling next, which is impossible with depth-first order of 
> traversal. With breath-first traversal it's trivial, but 
> breath-first traversal needs too much memory compared to 
> depth-first traversal.
> My idea is to try something intermediate. We use stack as 
> usual, but after fetching address from stack we start 
> prefetch and put it into small fixed size queue which delays 
> it's processing while hardware is handling prefetch. We do 
> main work on items from head of this queue.
> I think it can be proved that this traversal order will need 
> at most N times more memory than depth-first traversal, where 
> N is size of the queue (prefetch distance). On the other hand 
> it provides equal prefetch distance for each traversed node.
> I've made small test program that compares DFS, BFS (with 
> prefetch) and this 'pipelined DFS' on quite large tree. The 
> tree is ideally balanced, but the nodes of this tree are 
> randomly shuffled in memory. The results are very encouraging 
> with pipelined DFS being almost 2 times faster than DFS and 
> even faster then BFS (I think, due to less memory write-back 
> traffic). But note that real-world object graphs are usually 
> not random, plus my test program does almost nothing aside 
> from touching memory. So speedup for GC (with more work 
> between memory accesses and not random
> graphs) cannot be that large. I'm attaching the source of 
> this program.
> Recently I tried to apply this prefetch strategy to bdwgc. 
> I'm attaching the dirty & very much work-in-progress patch 
> against 6.8. It's not production quality and does more 
> changes than necessary. In particular I've disabled 'credit' 
> handling, to avoid it's potential ill-effects on pipelined 
> nature of new code (I haven't checked if there are any).
> I was benchmarking this using GCBench.c (modified only for 
> larger heap usage and thus larger running time). On this 
> benchmark my code outperforms both stock prefetch strategy 
> and the code with removed prefetch. With numbers being:
> a) patch - 8.89s
> b) no-prefetch - 9.34s
> c) stock - 9.75s
> Note that it seems that 6.8 doesn't have proper support for 
> AMD64 which I use, 'cause lions share of time is spent in 
> mutex_trylock that's called from GC_malloc. Without this, the 
> speedup should be larger.
> Now I'm interested in trying this on more real-world object 
> graphs. In particular it's not clear if increased mark stack 
> consumption of my code is an issue. What do you guys think ?
> --
> Aliaksey Kandratsenka <alk at tut.by>

More information about the Gc mailing list