[Gc] Generational collection performance disaster
hans.boehm at hp.com
Thu Apr 7 11:04:03 PDT 2005
Answers/comments are interspersed with the questions.
> -----Original Message-----
> From: Ralph Becket [mailto:rafe at ceres.cs.mu.oz.au]
> Hans, Ben, many thanks for the swift replies.
> We ran the benchmark under Debian Linux with kernel version
> 2.4.16. The boehm_gc version we're using is 6.3.2.
That makes it more puzzling.
> Boehm, Hans, Wednesday, 6 April 2005:
> > It might be worth checking where the protection faults are being
> > generated, i.e. which updates are causing problems.
> Ah, can you offer a quick pointer as to how I can do this?
I would set a breakpoint in GC_write_fault_handler. With a correctly
working gdb, the backtrace should tell you what you want. With a badly
working gdb, you may have to explicitly look at the signal context.
> > - It's also probably worthwhile to look at some GC logs both ways.
> How do I enable GC logging? Is it just a case of removing
> SILENT from the Makefile and recompiling?
That works. Setting the GC_PRINT_STATS environment variable may be
enough, though. Depending on the outcome, a profile of the application
may also be interesting.
> > - The current incremental GC depends on signal delivery
> > among other things. I assume this was on Linux? Its
> signal delivery
> > performance is normally pretty good.
> Yes, we're using Linux 2.4.16, although a fair number of our
> users run under Windows.
I think the story under Windows is actually a bit worse. The overhead
of the current implementation is probably appreciably higher.
Ben Hutchings has contributed code to use GetWriteWatch instead.
But it's only in the 7.0alpha tree.
> > - Stubborn allocation support was changed completely in
> 7.0. This is
> > implemented a bit more completely in my 7.0 tree than it is in
> > 7.0alpha1, but I'm sure it needs work. Instead of stubborn
> > allocation, there is support for explicit write barrier calls.
> > Objects directly reachable from the stack are viewed as
> dirty, so you
> > shouldn't need those calls for a functional language. If someone
> > wants to make this work for real, I can make available a newer
> > "experimental" version.
> We could be interested in this if we can identify the source
> of the current performance anomaly and have a reasonable
> expectation of some gain (a few per cent would be enough).
I'll try to get my current snapshot out before I leave for Norway
> > - Multiple generations were tried a long time ago. At the
> time, the
> > added overhead of keeping track of additional per-object
> bits seemed
> > to outweigh the benefit of avoiding premature promotions.
> Given that
> > it increasingly makes sense to reserve a whole byte for
> each mark bit
> > for concurrency reasons, there may be more room for clever
> > implementations in this area now.
Actually, I should correct this. I think we tried delayed
promotion strategies, which don't promote just-allocated
objects. But it addresses the same issue of needing a full
GC to reclaim objects allocated just before a minor collection.
> I've wondered whether it might be worth managing generations
> on a per-page basis rather than an per-object basis.
Is this different from a combination of:
(1) restricting allocation to a limited number of "young" pages
in the current scheme, and
(2) avoiding immediate promotion after surviving only a single GC?
I'm not sure how to really do (1) in a way that keeps "young"
pages emptier than average. There seems to be an inherent
tension between keeping occupancy low, and reusing pages,
at least if you can't copy. And if you fail to keep occupancy low,
you still end up with a lot of "dirty" objects that you need to scan,
though perhaps not in your case, with the trick of viewing directly
referenced objects as dirty. (Currently I think I dirty the
whole page for directly referenced objects. But that could
perhaps be avoided.)
In the past, my experience with trying to keep the young generation
in the cache has been fairly negative. But that was with cache
sizes of usually less than a MB. (Root sets tend to be larger
than that. My impression is that even for copying collectors in
JVMs, people usually give up on the idea pretty quickly for real
applications.) As cache sizes approaching 10MB become more
common, this is probably worth revisiting.
> Assuming roughly that
> - most new objects die young (but do take *some* time to die)
> - recently allocated objects will be in the cache
> - it is better to reuse addresses in the cache where possible
> - minor collections only clear the mark bits from younger
> pages prior to
> - tracing will only touch the relatively small amount of live data in
> the younger generations
> - if garbage cells are reused first from the younger generations then
> they are more likely to still be in the cache
> - as pages age they will tend to fill up with longer lived objects and
> can therefore be promoted.
> One way to optimise things is to preserve a copy of the mark
> bits on a page before clearing them prior to a minor
> collection. On the next minor collection, restore the old
> mark bits to avoid retracing older objects. This way
> collection can be interleaved with allocation and the page
> promotion strategy experimented with fairly easily.
> I readily admit that I'm not a GC expert and that the above
> may be very naive or be exactly what was tried before.
> Either way, I'd be very interested in hearing your opinon.
> -- Ralph
More information about the Gc