[Gc] RE: Abuse of collector...
Gtalbot at locuspharma.com
Tue May 12 07:06:31 PDT 2009
From: Ivan Maidanski [ivmai at mail.ru]
Sent: Tuesday, May 12, 2009 2:48 AM
> I don't know how NUMA affects gc speed. Hans, may be, knows more...
> Your task is the only heavy-weight one running on the box at the same time, right?
Yes. That's correct.
> > I've built the collector both with and without debugging from CVS (which I pulled on Friday). Here's my current configure command line:
> > CFLAGS="-O2" ./configure --enable-threads=posix --enable-thread-local-alloc --enable-parallel-mark --enable-cplusplus --enable-large-config --enable-munmap --enable-gc-debug
> You'd better show me what are the args passed to "gcc" (not configure).
gcc -DPACKAGE_NAME=\"gc\" -DPACKAGE_TARNAME=\"gc\" -DPACKAGE_VERSION=\"7.2alpha1\" "-DPACKAGE_STRING=\"gc 7.2alpha1\"" -DPACKAGE_BUGREPORT=\"Hans.Boehm at hp.com\" -DGC_VERSION_MAJOR=7 -DGC_VERSION_MINOR=2 -DGC_ALPHA_VERSION=1 -DPACKAGE=\"gc\" -DVERSION=\"7.2alpha1\" -DGC_LINUX_THREADS=1 -D_REENTRANT=1 -DPARALLEL_MARK=1 -DTHREAD_LOCAL_ALLOC=1 -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 -DHAVE_DLFCN_H=1 -DNO_EXECUTE_PERMISSION=1 -DALL_INTERIOR_POINTERS=1 -DGC_GCJ_SUPPORT=1 -DKEEP_BACK_PTRS=1 -DDBG_HDRS_ALL=1 -DMAKE_BACK_GRAPH=1 -DSAVE_CALL_COUNT=8 -DJAVA_FINALIZATION=1 -DATOMIC_UNCOLLECTABLE=1 -DLARGE_CONFIG=1 -DUSE_MMAP=1 -DUSE_MUNMAP=1 -DMUNMAP_THRESHOLD=6 -I./include -fexceptions -I libatomic_ops/src -O2 -MT reclaim.lo -MD -MP -MF .deps/reclaim.Tpo -c reclaim.c -fPIC -DPIC -o .libs/reclaim.o
> I'm also using -fno-strict-aliasing along with -O for safety (but I can't say whether the world is safer with it or not - gcc produces some warnings without it).
Seems like that's worthwhile doing. I'll do that next. O2 is the "correct" optimization level, right? Is O3 OK?
> Unless You are debugging your app, I see no reason to have --enable-gc-debug or use GC_DEBUG.
I've been doing some debugging as adding the collector enabled me to do some significant code changes--I pulled use of boost::shared_ptr<> entirely from my program. Any big change, you will introduce bugs. For example, I was relying on this to set a pointer to null:
Let's say for convenience I've done this:
class Directory : public FileOrDir
typedef boost::shared_ptr<const FileOrDir> const_child_t;
You will end up relying on this:
Directory::const_child_t x; // x is null after construction.
So when you change that typedef to be "const FileOrDir*" you will unfortunately have to go back through your program and look everywhere for uninitialized variables. GCC helps some, but doesn't catch everything.
For right now, I'm leaving debugging in, so when things crash or I have a problem I can call GC_dump() from the debugger. It's bumping the collection time by about 10%. I'm _not_ defining GC_DEBUG, as this takes too much time and RAM at startup for my program to be useable in the full situation. For some debugging, I will define GC_DEBUG and only bring up a couple of connecting machines.
> Try GC_ENABLE_INCREMENTAL with tests/test.c first - if it works for Your platform then the number
> of collection (printed at end) should be smaller (approx by 1/4) in the incremental mode. Also try
> to measure average pause time in different collector modes (eg., with/without PARALLEL_MARK).
I'll give that a shot.
> > My program is heavily multithreaded (~90 threads are running during the initial startup).
> And all that threads operates garbage-collected memory (or the app is too complex to analyse that), right?
Yes. I've entirely converted over to garbage-collected memory at this point.
> What's the total size of all threads' stacks?
256K/thread + 2MB for the initial thread = 96 * 256k + 2MB = 26MB
128K/thread wasn't enough once the collector started running given the stack depths reached in my program during recursive traversal and modification of my data structure.
> If You think that you could miss an external event (or not respond to it in a reasonable time)
> while the world is stopped then consider using stop_func (set the default one with GC_set_stop_func()).
I'll look into that. Thanks.
> > On the whole, I'm pretty satisfied with this, and am very happy that changes to my data
> > structure and program enabled by the collector brought my overall address space and
> > resident set sizes down by two thirds!
> > ...
> >Are there any other suggestions for bringing down the collection times?
> Try maximize the use of GC_malloc_atomic[_...]() instead of GC_malloc[_...]().
I'm doing that at this point. Does the _ignore_off_page versions help also if possible to use them?
> Call GC_no_dls(1) before GC_INIT() if it is possible for your app.
Do I need to link the collector statically to do that?
> Play also with GC_full_freq ("GC_FULL_FREQUENCY" env var) value (only if GC_ENABLE_INCREMENTAL works).
Interesting. Will try.
> > I will try (probably in a day or two when my cluster run is complete) to rebuild disabling
> > parallel marking and enabling incremental collection to see if that helps. I don't know,
> > however, that I would be able to turn off interior pointer checks.
(as an aside, if and when GCC gets plugins, wouldn't that be useful for generating the type information needed to avoid interior pointer checks...)
> You can disable parallel marking without recompilation - just set env GC_MARKERS=1.
Ah. Very helpful.
Thanks for all of your help with this.
George T. Talbot
<gtalbot at locuspharma.com>
More information about the Gc