[Gc] Threaded GC questions
hans.boehm at hp.com
Wed Aug 17 11:10:48 PDT 2005
> From: Travis Griggs
> Sent: Wednesday, August 17, 2005 10:31 AM
> To: gc at napali.hpl.hp.com
> Subject: Re: [Gc] Threaded GC questions
> On Aug 16, 2005, at 17:27, Boehm, Hans wrote:
> >> 2) Mutator threads need to periodically suspend all other
> threads, it
> >> sounds like this is done via the use of program wide signal, and
> >> specialized signal handlers which yield/spin/whatever so that the
> >> mutator is not disturbed during collection. I'm sure this
> is fine for
> >> normal linux scheduling semantics, but what if one is using
> >> pthread_sched() to make some of the threads sched_rr and/or
> >> sched_fifo at higher levels. Is there a possibility for priority
> >> inversion in this
> >> case? I'm curious what other signal interactions one can expect as
> >> well...
> > Suspended threads normally wait for a second signal. In
> rare cases,
> > they might sleep. I don't immediately see any priority inversion
> > issues.
> > The default locking strategy without thread-local
> allocation does use
> > custom locks that spin/yield/sleep. I think that should be
> > by defining USE_PTHREAD_LOCKS (a rarely tested option).
> Is this something the client program needs to define when linking
> against it? Or does libgc have to be built with this option? I think
> I'd like to look at using this. Not because I don't trust your locks
> :), but because I've already got to worry about the pthread
> stuff, I'd
> like to just keep worrying about that and not anything else.
It has to be defined when libgc is built. With thread-local allocation,
it's implied. It's a bit risky to use it without thread-local
Some older pthreads implementations had mutexes that effectively
handed off the mutex to a waiter but kept left the previous owner
running. This often leads to convoying, with one context switch
per lock acquisition, i.e. one context switch per allocation. If
you run with a pthreads implementation that does this, the results
won't be pretty. You're talking about integral factors of slowdown,
> >> 3) The mentioned page talks about the thread local allocation
> >> strategy. Am I right in understanding that when using this, the
> >> degree of intrathread synchronization is reduced (or is it removed
> >> completely?) at the expense of slightly higher allocation
> times. Do I
> >> have to use both -DTHREAD_LOCAL_ALLOC as well as include
> >> gc_local_alloc.h? Or is the first enough? Or do I have to actually
> >> rebuild the gc lib? Can thread_local allocation be mixed with non?
> > This is changing with 7.0. Under 6.x, you have to build
> the GC with
> > THREAD_LOCAL_ALLOC defined, and the call the custom allocation
> > functions
> > from gc_local_alloc.h (or include that file after defining
> > GC_REDIRECT_TO_LOCAL, and the use the uppercase names).
> > In (expermental!) 7.0alpha4, if you build with THREAD_LOCAL_ALLOC
> > defined, GC_malloc gets you thread-local allocation.
> Build the collector? Or the client program? If all I have to
> do as add
> -DTHREAD_LOCAL_ALLOC to my Makefile, that's a desirable thing. :)
It's a libgc build option. However, the GNU style build machinery
turns it on by default, if the collector is configured for threads,
and it's supported on your platform. If you're using a standard
package on Devian, it's probably enabled. (And you're probably
getting pthread synchronization anyway, so you should realy use
> > Thread-local allocation is usually a good thing, both for
> > times and processor scalability. It should really be the
> default if
> > you need thread support. It may cost you a bit of space.
> OK, I'm interested. Is this site
> >) the place to get it. How close to solid release is this?
Yes. I'm trying to find time to solidify it, and to make a
CVS tree generally accessible. It's hard to say ...
> Does thread local allocation remove the need to pause all other
> threads? The space I have to spare.
No. Thread-local allocation only removes the need to acquire
and release a lock per allocation. The garbage collection process
is largely unaffected.
If you turn on parallel marking (another build-time option, not
enabled by default, since it slows down uniprocessor GC slightly)
all threads will still be stopped during GC, but the GC will use
as many threads as processors, so you shouldn't see idle processors
> >> I'd appreciate any insights/suggestions on what the best
> possible way
> >> to mate libgc with our program is:
> >> An example instance of our program might be running as many as 10+
> >> threads at 4 different priority levels. The lowest level (normal
> >> pthread scheduling semantics) is where the socket server runs at,
> >> accepts new connections (creates a thread per) and
> services outside
> >> requests. These connections rarely allocate memory,
> usually a large
> >> structure or two. Mostly they just facilitate queries
> against the run
> >> state.
> > Without parallel marking, the collector just runs inside the thread
> > that triggered the GC. This may not be a good thing in your
> > environment, since its priority will be unpredictable.
> However, once
> > it acquires the allocation lock, I'd expect it to run as everything
> > else blocks waiting on the lock. I'm not sure how this would
> > account for a deadlock.
> So using parallel marking sounds like a good thing for us probably.
> I'll look around, same questions here I guess. Is it an
> aspect of the
> way the library is built or the client?
The library. Parallel marking actually doesn't avoid this issue,
it just complicates it. Currently the collector runs in one of the
client threads + <processors - 1> helper threads. There are also
other reasons it probably shouldn't run in the client thread.
> What about incremental? One of the web pages I read said something
> about "most applications are fine as is". I decided for the
> time to run
> that way, but wondered if I'd want that long term.
If you can do without it, you don't want it. Currently on X86/Linux
it relies on mprotect+signals to track heap modification. This
requires that you are careful about system calls that write to the heap,
> One of the things working with the Smalltalk garbage collector has
> taught me is that simple things are simple to tune, but anything
> complex, you really want some way of profiling what's going on to be
> able to tune it. What (if any) tools or techniques can one use to
> determine how well the collector is working with one's
> environment? For
> example, one of the things I've wondered... how often is it
> and how long is it taking? Also, we've seen some processor
> times lately
> that showed our CPU usage climbing over time. We wondered "is this a
> heap fragmentation thing?" Is there a way to prove/disprove this?
Setting the GC_PRINT_STATS environment variable will give you
some information. See doc/README.environment for some other hooks.
Fragmentation is typically a problem at most at the large block level.
It very rarely impacts time performance noticeably, at least in my
I also find a simple PC sampling profiler (e.g. qprof) helpful.
> > It would be nice to see the (relevant pieces of) the thread stacks
> > after a deadlock. Usually it's fairly easy to tell what went wrong.
> I'll see if I can grab these.
> > Note that depending on your gdb and gc version, gdb itself
> may induce
> > deadlocks. The most recent gc versions work around one of the
> > problems there, but there may be others.
> Thanks for the reply Hans.
> Travis Griggs
> "It had better be a pretty good meeting, to be better than no
> meeting at all" -- Boyd K Packer
> DISCLAIMER: This email is bound by the terms and conditions
> described at https://www.key.net/disclaimer.htm
> Gc mailing list
> Gc at linux.hpl.hp.com
More information about the Gc