[Gc] Re: Understanding the performance of a
hans.boehm at hp.com
Tue Nov 28 11:06:57 PST 2006
Notice that roughly half the time is spent in GC_malloc, which grabs the first element from a free list. I think it's spending all of it's time acquiring and releasing locks. It looks to me like disappearing links and finalization are in the noise, as expected.
To check this conclusion, it might be useful to look at an instruction-level profile for GC_malloc. (This is blatant enough that even gdb and ^C might work. Otherwise oprofile or qprof can probably do it.)
I believe the Linux/X86 default behavior is to use a GC-implemented spin-lock, the fast path of which is inlined. The problem is that still requires an atomic operation like test and set, which is typically over 100 cycles on a Pentium 4 in the best case. (It may be one instruction, but it can easily be far slower than a function call.) The advantage of the spin lock is that you only do one per allocation, instead of one to lock and one to unlock.
A partial solution is probably to use the thread-local allocation facility. For 6.8:
1. Make sure the collector is built with THREAD_LOCAL_ALLOC defined, and make sure that GC_MALLOC and GC_MALLOC_ATOMIC (all caps) are used to allocate. (I'm afraid the collector you used for the profile is not built with THREAD_LOCAL_ALLOC. IIRC, that would cause the collector to switch to pthread locks, and would probably cause your test to run even slower.)
2. Define GC_REDIRECT_TO_LOCAL and then include gc_local_alloc.h before making any of the above calls.
If you still have performance issues, you may want to make sure that the correct (i.e. fastest) thread local storage mechanism is being used to get at the per-thread free lists. You probably want to make sure that USE_COMPILER_TLS gets defined on a modern Linux/X86 system.
A lot of this is better in 7.0. There, if the collector is built to support thread-local allocation, GC_malloc just does it. There are also hooks to inline the allocation calls and to avoid the thread local storage lookups if you can explicitly pass a pointer to the free lists in a local.
Bigloo uses inlined allocation for 6.8 and earlier, but that's slightly problematic, in that the 6.8 inlined allocation code is dependent on GC internals, and tends to break if you link against a different collector version. Manuel Serrano can tell you all about that.
> -----Original Message-----
> From: gc-bounces at napali.hpl.hp.com
> [mailto:gc-bounces at napali.hpl.hp.com] On Behalf Of Ludovic Courtès
> Sent: Tuesday, November 28, 2006 4:27 AM
> To: gc at napali.hpl.hp.com
> Subject: [Gc] Re: Understanding the performance of a
> (Re-sending without the large attachment...)
> "Boehm, Hans" <hans.boehm at hp.com> writes:
> > Can you generate an execution profile of the libgc-based Guile,
> > perhaps running GCBench?
> Sure. The full `gprof' output of the libgc-based Guile
> running GCBench is available at . `libgc' is 6.8. For
> some reason, the number of calls of libgc functions is not displayed.
> For comparison, here is the "top 10" flat profile of "normal Guile"
> running `GCBench' as well:
> Each sample counts as 0.01 seconds.
> % cumulative self self total
> time seconds seconds calls Ks/call Ks/call name
> 25.69 529.30 529.30 139465033 0.00 0.00
> 23.79 1019.51 490.21 9970 0.00 0.00 deval
> 23.07 1494.79 475.28 3074617 0.00 0.00
> 6.73 1633.40 138.61 264869792 0.00 0.00 scm_gc_mark
> 3.12 1697.78 64.38 143001687 0.00 0.00 scm_ilookup
> 2.95 1758.65 60.87 85265684 0.00 0.00 scm_list_2
> 2.42 1808.54 49.89 117193958 0.00 0.00 scm_list_1
> 2.05 1850.68 42.14 57511095 0.00 0.00 scm_acons
> 0.95 1870.33 19.65 15685771 0.00 0.00 scm_less_p
> 0.91 1889.10 18.77 1278 0.00 0.00
> > If you run GCBench with the standard Guile, presumably there are
> > essentially no malloc calls involved, since essentially all the
> > objects are small?
> Right, only the GC allocation routines are used.
> > Are you using a pre-built GC package, or did you build it?
> If you are
> > using a standard thread-safe build, without thread-local
> > the GC allocator will end up spending a lot of its time
> acquiring and
> > releasing the allocation lock, especially on a
> Pentium4-based machine.
> > If you are comparing it to a purely single-threaded Guile
> > that could easily account for the difference. (GC7 may largely fix
> > that, if that's indeed what you're seeing.)
> I'm usually using Debian's `libgc1c2' package. However, this
> time, I compiled my own (with `-pg'), without specifying any
> `configure' flag.
> Note that Guile supports preemptive multithreading since 1.8.
> From a GC viewpoint, each thread has its own "freelist" from
> which it can sweep cells, but other state is shared among
> threads and thus requires synchronization. In any case,
> isn't `pthread_mutex_lock ()' essentially a single "test and
> set" instruction on most platforms (since Linux has futexes)?
>  https://www.laas.fr/~lcourtes/tmp/libgc-guile-prof.txt.gz
> Gc mailing list
> Gc at linux.hpl.hp.com
More information about the Gc