[Gc] RE: Abuse of collector...

Talbot, George Gtalbot at locuspharma.com
Mon May 11 10:27:32 PDT 2009

Hi Ivan,

Thanks for the hints.  I wasn't building the collector with the options you suggest.  I have now corrected this.  My story is a little more complicated, and I'd like to give some details and ask advice again.

I'm running the collector and my program on a 2-CPU dual core 2.4GHz AMD Opteron box (4 cores total) and I have 4GB of physical RAM with 2GB each installed next to each CPU.  If I recall correctly, the motherboard supports AMD's idea of NUMA such that memory accesses on RAM installed next to a CPU are faster than accesses to memory installed next to the "other" CPU.  I'm running a slightly older version of 64-bit Ubuntu (7.10) and compiling both the collector and my program with GCC 4.1.3.

Here's what one of my processors looks like from /proc/cpuinfo:

processor       : 3
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 65
model name      : Dual-Core AMD Opteron(tm) Processor 2216
stepping        : 2
cpu MHz         : 2400.000
cache size      : 1024 KB
physical id     : 1
siblings        : 2
core id         : 1
cpu cores       : 2
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm extapic cr8_legacy
bogomips        : 4800.45
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp tm stc

I've built the collector both with and without debugging from CVS (which I pulled on Friday).  Here's my current configure command line:

CFLAGS="-O2" ./configure --enable-threads=posix --enable-thread-local-alloc --enable-parallel-mark --enable-cplusplus --enable-large-config --enable-munmap --enable-gc-debug

I unfortunately right this second don't have a copy of the output with GC_PRINT_STATS=1 for a collection turned on, but from memory, with ~400MB of pointer-containing data, and ~300MB of pointer-free reachable bytes, in a 1.5GB heap, my collection times were ~2200ms for a non-debug build of the collector and ~2500ms for an optimized build with the --enable-gc-debug configure option.  I am unfortunately unable to build my program with GC_DEBUG defined and run it in the current scenario, as it becomes too slow and takes up too much memory to be useful right now.

I attempted setting GC_ENABLE_INCREMENTAL=1 in both non-debug and debug builds of the collector, and while it didn't cause any problems, it didn't appear to change collection times and my program still underwent full world-stop collection each time.

My program is heavily multithreaded (~90 threads are running during the initial startup).

In my previous email I was indicating that all of my startup time was taken up in the collector.  This has been fixed and was actually a problem with my program, and not with the collector.  The above mentioned 2s world-stop collection times are happening on the order of every 5 minutes or so, not every second or so, though during startup as the large data structure is built the collection times are more frequent (every 5-10s or so over a period of a few minutes).

On the whole, I'm pretty satisfied with this, and am very happy that changes to my data structure and program enabled by the collector brought my overall address space and resident set sizes down by two thirds!

I do have apprehension about collection times, however.  Right now I can tolerate a 2200ms stop in the collector, but I do notice it, and I do expect my data set size to rise over time.  Is there something I'm missing?  Does it make sense to try to disable parallel marking and try again enabling incremental collection?  Will this significantly affect performance when of normal allocation?  Are there any other suggestions for bringing down the collection times?

I will try (probably in a day or two when my cluster run is complete) to rebuild disabling parallel marking and enabling incremental collection to see if that helps.  I don't know, however, that I would be able to turn off interior pointer checks.  I can try that.

Thanks for all of your suggestions.

George T. Talbot
<gtalbot at locuspharma.com>

P.S.  Enabling munmap() was very helpful when looking at output of "top" and "ps".

From: gc-bounces at napali.hpl.hp.com [gc-bounces at napali.hpl.hp.com] On Behalf Of Ivan Maidanski [ivmai at mail.ru]
Sent: Thursday, May 07, 2009 3:55 AM
To: gc at napali.hpl.hp.com
Cc: Talbot, George
Subject: Re: [Gc] RE: Abuse of collector...


"Boehm, Hans" <hans.boehm at hp.com> wrote:
> > -----Original Message-----
> > From: gc-bounces at napali.hpl.hp.com
> > [mailto:gc-bounces at napali.hpl.hp.com] On Behalf Of Talbot, George
> > Sent: Wednesday, May 06, 2009 12:59 PM
> > To: gc at linux.hpl.hp.com
> > Subject: [Gc] Abuse of collector...
> >
> > Hi all,
> >
> > I've integrated the collector into my program and it works
> > ...
> > Questions:
> >
> > 1)  Does the time spent sound sane with experience that
> > others have had?
> If this is a modern X86 box or the like, it sounds too high to me.  I would have expected under a second.  What's the OS?  Can you profile the executable?  If not, random interruptions in a debugger might give you an idea.  If the time is not being spent in GC_mark_from(), something fishy is going on.  Looking at the log output might also be informative.  You don't have GC assertions enabled, right?

And You build the collector with -O2 -mtune=native -DNO_DEBUGGING -DLARGE_CONFIG, don't You?

My approx. average timings of a parser app that operates tree-like data (500MiB heap, 2x2.4GHz CPU):
- 260 MiB of ptr-containing data per 1280 ms (single threaded collector),
- 280 MiB of ptr-containing data per 70 ms !!! (single threaded, GC_ENABLE_INCREMENTAL),
- 260 MiB of ptr-containing data per 800 ms (2 marker threads),
- 270 MiB of ptr-containing data per 240 ms (2 marker threads, GC_ENABLE_INCREMENTAL).

> > 2)  Is there a way to spend less time in the allocator during
> > the initial startup?
> Explicitly calling GC_expand_hp() with the approximate final heap size should help.
> > 3)  Am I reasonable to believe that in the parallel
> > collector, generational features will save me from super-long
> > collections if my data structure is relatively constant after
> > the startup?  (i.e. no more than say 5% changes every couple
> > of hours or so.)
> Incremental collection currently doesn't combine well with parallel collection.  And incremental collection is somewhat tricky to use anyway, depending on the platform.  It's not on by default.
> Turning on only generational collection with parallel GC may be OK.  To do that, set GC_time_limit to GC_TIME_UNLIMITED, and then call GC_enable_incremental().

Simply call GC_enable_incremental() after (or instead of) GC_INIT() (GC_time_limit is set ss needed by the collector itself) or set GC_ENABLE_INCREMENTAL environment var. Anyway, check whether it shortens the gc delays (in average) for your app or the opposite... With GC_ENABLE_INCREMENTAL (but without parallel marking), the collections are typically much shorter (see above) but the collections are more frequent (and the total collection pause time is bigger).

> I also really need to get out a new version; there is unfortunately some chance you are running into an old problem.

But, for now, it may sounds good to try your app with the CVS snapshot (at least, it isn't more unstable than any of gc v7).

More tips:
- set GC_all_interior_pointers=0 before GC_INIT() (or build the collector without -DALL_INTERIOR_POINTERS) if you always preserve pointers to the beginning of alive objects.
- play with GC_free_space_divisor ("GC_FREE_SPACE_DIVISOR" env var) value.

> ...

Gc mailing list
Gc at linux.hpl.hp.com

More information about the Gc mailing list