[Gc] Re: Understanding the performance of a `libgc'-basedapplication

Ludovic Courtès ludovic.courtes at laas.fr
Tue Nov 28 14:56:57 PST 2006


Hi,

"Boehm, Hans" <hans.boehm at hp.com> writes:

> Notice that roughly half the time is spent in GC_malloc, which grabs
> the first element from a free list.  I think it's spending all of it's
> time acquiring and releasing locks.  It looks to me like disappearing
> links and finalization are in the noise, as expected.
>
> To check this conclusion, it might be useful to look at an
> instruction-level profile for GC_malloc.  (This is blatant enough that
> even gdb and ^C might work.  Otherwise oprofile or qprof can probably
> do it.)

I haven't been able to do anything meaningful in that respect so far
(looking at gdb's disassembler output is not very helpful).

> I believe the Linux/X86 default behavior is to use a GC-implemented
> spin-lock, the fast path of which is inlined.  The problem is that
> still requires an atomic operation like test and set, which is
> typically over 100 cycles on a Pentium 4 in the best case.  (It may be
> one instruction, but it can easily be far slower than a function
> call.)  The advantage of the spin lock is that you only do one per
> allocation, instead of one to lock and one to unlock.

FWIW, the profile I sent was obtained on GNU/Linux on PowerPC.

> A partial solution is probably to use the thread-local allocation
> facility.  For 6.8:
>
> 1. Make sure the collector is built with THREAD_LOCAL_ALLOC defined,
> and make sure that GC_MALLOC and GC_MALLOC_ATOMIC (all caps) are used
> to allocate.  (I'm afraid the collector you used for the profile is
> not built with THREAD_LOCAL_ALLOC.  IIRC, that would cause the
> collector to switch to pthread locks, and would probably cause your
> test to run even slower.)
>
> 2. Define GC_REDIRECT_TO_LOCAL and then include gc_local_alloc.h
> before making any of the above calls.

I did so, but the resulting code always segfaults when trying to access
thread-specific storage (see below).

> If you still have performance issues, you may want to make sure that
> the correct (i.e. fastest) thread local storage mechanism is being
> used to get at the per-thread free lists.  You probably want to make
> sure that USE_COMPILER_TLS gets defined on a modern Linux/X86 system.

No, `USE_COMPILER_TLS' doesn't get defined on Linux/x86 (as can be seen
from `configure.in').

So, I did the following:

  1. ./configure --enable-threads=posix

  2. make CPPFLAGS='-DUSE_COMPILER_TLS=1'

  3. make install

  4. run Guile

Guile terminated as follows (on x86 this time):

  Program terminated with signal 11, Segmentation fault.
  #0  GC_local_malloc (bytes=336) at pthread_support.c:300
  300             my_entry = *my_fl;

I tried without specifying any `CPPFLAGS' in step 2:

  Program terminated with signal 11, Segmentation fault.
  #0  GC_local_malloc (bytes=336) at ./include/private/specific.h:87
  87          tse * entry = *entry_ptr;   /* Must be loaded only once.    */
  (gdb) info registers 
  eax            0xf      15
  ecx            0x3c     60
  edx            0xbfff0  786416
  ebx            0xb7efd370       -1209019536
  esp            0xbfff0c60       0xbfff0c60
  ebp            0xbfff0c88       0xbfff0c88
  esi            0x0      0
  edi            0x2a     42
  eip            0xb7ef6385       0xb7ef6385 <GC_local_malloc+85>
  eflags         0x10206  [ PF IF RF ]
  cs             0x73     115
  ss             0x7b     123
  ds             0x7b     123
  es             0xc010007b       -1072693125
  fs             0x0      0
  gs             0x33     51

Finally, I tried CPPFLAGS='-DUSE_PTHREAD_SPECIFIC=1' but the resulting
code doesn't seem to be using `pthread_getspecific ()' (this could be
due to the ifdef machinery in `pthread_support.c' I guess).

I'll keep experimenting with all this tomorrow and report back.

Thanks,
Ludovic.



More information about the Gc mailing list