Re: [Gc] RE: Abuse of collector...
ivmai at mail.ru
Mon May 11 23:48:24 PDT 2009
"Talbot, George" <Gtalbot at locuspharma.com> wrote:
> Hi Ivan,
> Thanks for the hints. I wasn't building the collector with the options you suggest. I have now corrected this. My story is a little more complicated, and I'd like to give some details and ask advice again.
> I'm running the collector and my program on a 2-CPU dual core 2.4GHz AMD Opteron box (4 cores total) and I have 4GB of physical RAM with 2GB each installed next to each CPU. If I recall correctly, the motherboard supports AMD's idea of NUMA such that memory accesses on RAM installed next to a CPU are faster than accesses to memory installed next to the "other" CPU. I'm running a slightly older version of 64-bit Ubuntu (7.10) and compiling both the collector and my program with GCC 4.1.3.
I don't know how NUMA affects gc speed. Hans, may be, knows more...
Your task is the only heavy-weight one running on the box at the same time, right?
> Here's what one of my processors looks like from /proc/cpuinfo:
I discard all the supplied info except for: 4 cores, 2.4GHz, 4 GiB RAM, Linux, GCC v4, the task compiled for amd64.
> I've built the collector both with and without debugging from CVS (which I pulled on Friday). Here's my current configure command line:
> CFLAGS="-O2" ./configure --enable-threads=posix --enable-thread-local-alloc --enable-parallel-mark --enable-cplusplus --enable-large-config --enable-munmap --enable-gc-debug
You'd better show me what are the args passed to "gcc" (not configure).
I'm also using -fno-strict-aliasing along with -O for safety (but I can't say whether the world is safer with it or not - gcc produces some warnings without it).
> I unfortunately right this second don't have a copy of the output with GC_PRINT_STATS=1 for a collection turned on, but from memory, with ~400MB of pointer-containing data, and ~300MB of pointer-free reachable bytes, in a 1.5GB heap, my collection times were ~2200ms for a non-debug build of the collector and ~2500ms for an optimized build with the --enable-gc-debug configure option. I am unfortunately unable to build my program with GC_DEBUG defined and run it in the current scenario, as it becomes too slow and takes up too much memory to be useful right now.
Unless You are debugging your app, I see no reason to have --enable-gc-debug or use GC_DEBUG.
> I attempted setting GC_ENABLE_INCREMENTAL=1 in both non-debug and debug builds of the collector, and while it didn't cause any problems, it didn't appear to change collection times and my program still underwent full world-stop collection each time.
Try GC_ENABLE_INCREMENTAL with tests/test.c first - if it works for Your platform then the number of collection (printed at end) should be smaller (approx by 1/4) in the incremental mode. Also try to measure average pause time in different collector modes (eg., with/without PARALLEL_MARK).
> My program is heavily multithreaded (~90 threads are running during the initial startup).
And all that threads operates garbage-collected memory (or the app is too complex to analyse that), right?
What's the total size of all threads' stacks?
> In my previous email I was indicating that all of my startup time was taken up in the collector. This has been fixed and was actually a problem with my program, and not with the collector. The above mentioned 2s world-stop collection times are happening on the order of every 5 minutes or so, not every second or so, though during startup as the large data structure is built the collection times are more frequent (every 5-10s or so over a period of a few minutes).
If You think that you could miss an external event (or not respond to it in a reasonable time) while the world is stopped then consider using stop_func (set the default one with GC_set_stop_func()).
> On the whole, I'm pretty satisfied with this, and am very happy that changes to my data structure and program enabled by the collector brought my overall address space and resident set sizes down by two thirds!
> I do have apprehension about collection times, however. Right now I can tolerate a 2200ms stop in the collector, but I do notice it, and I do expect my data set size to rise over time. Is there something I'm missing? Does it make sense to try to disable parallel marking and try again enabling incremental collection? Will this significantly affect performance when of normal allocation? Are there any other suggestions for bringing down the collection times?
Try maximize the use of GC_malloc_atomic[_...]() instead of GC_malloc[_...]().
Call GC_no_dls(1) before GC_INIT() if it is possible for your app.
Play also with GC_full_freq ("GC_FULL_FREQUENCY" env var) value (only if GC_ENABLE_INCREMENTAL works).
> I will try (probably in a day or two when my cluster run is complete) to rebuild disabling parallel marking and enabling incremental collection to see if that helps. I don't know, however, that I would be able to turn off interior pointer checks. I can try that.
You can disable parallel marking without recompilation - just set env GC_MARKERS=1.
> Thanks for all of your suggestions.
> George T. Talbot
> <gtalbot at locuspharma.com>
> P.S. Enabling munmap() was very helpful when looking at output of "top" and "ps".
More information about the Gc