[Gc] Race condition between thread termination and garbage
collection under Solaris 10/x86
hans.boehm at hp.com
Tue Mar 2 17:35:12 PST 2010
It appears that my statements about Linux below were incorrect. We may have similar issues there. As far as I can tell the missing thread is actually still around; gdb just doesn't see it anymore.
> -----Original Message-----
> From: gc-bounces at napali.hpl.hp.com
> [mailto:gc-bounces at napali.hpl.hp.com] On Behalf Of Boehm, Hans
> Sent: Tuesday, March 02, 2010 5:23 PM
> To: Burkhard Linke; gc at napali.hpl.hp.com
> Subject: RE: [Gc] Race condition between thread termination
> and garbage collection under Solaris 10/x86
> I don't know why they decided to block signals in an exiting
> threads. I don't immediately find any justification in Unix
> or Posix standards. If anything, it seems inconsistent with
> those. (If you Google "An exiting thread runs with all
> signals blocked" you get only Solaris pages.)
> This is a somewhat fundamental problem for us. Once a thread
> enters this exiting state, we can't stop it. Probably the
> best we could do is notice that a thread is in this state and
> defer collection until it actually exits. We don't have any
> way to safely collect while it's still running. But such
> threads can (a) run indefinitely or (b) block indefinitely
> for another thread. (And for correct C++ code, I suspect (b)
> is sometimes unavoidable.) Thus this option doesn't seem
> very attractive at all, and is likely to provoke nastier
> failures under less likely conditions.
> Even for non-GC code, this seems like a very dubious design
> decision to me. It means that cleanup code can't rely on any
> library calls that internally uses signals, e.g. for
> timeouts. I'm not sure how I would even know which library
> calls do that.
> On my old Itanium Linux machine, your program also sometimes
> hangs, but for other reasons. It looks like the child is
> gone but still in the thread table. I'll see if I can debug
> that further. It looks like cancellation is generally still
> a problem, even in the CVS version.
> As far as the Solaris issue is concerned, aside from possibly
> filing a bug with Sun, maybe deferring GC somehow is still
> the best option. It would probably at least substantially
> reduce the deadlock frequency. Unfortunately, a hung exiting
> thread is likely to cause the process to run out of memory.
> We may want to defer for only so long. And I think we want
> to do this only on Solaris.
> > -----Original Message-----
> > From: gc-bounces at napali.hpl.hp.com
> > [mailto:gc-bounces at napali.hpl.hp.com] On Behalf Of Burkhard Linke
> > Sent: Monday, March 01, 2010 7:56 AM
> > To: gc at napali.hpl.hp.com
> > Subject: [Gc] Race condition between thread termination and garbage
> > collection under Solaris 10/x86
> > Hi,
> > after having random deadlocks with mono and the current GC release
> > (both 7.1 and CVS checkout), I was finally able to locate
> the problem
> > in the thread exit handler of boehm-gc.
> > libgc was configured with --enable-parallel-mark
> > --enable-threads=posix --enable-munmap=2
> --enable-large-config CC=cc
> > LDFLAGS=-m64 CPPFLAGS=-m64 CXX=CC (using Sun Studio compiler 12.1,
> > also tested with gcc 4.1 and gcc 4.3)
> > Symptoms (stack trace of deadlock produces by pstack):
> > 1918: /vol/src/gnu/mono/contrib/bdwgc/.libs/threadtest
> > ----------------- lwp# 1 / thread# 1 --------------------
> > 00007fffffd172f7 lwp_park (0, 0, 0)
> > 00007fffffd0b59b sema_wait () + b
> > 00007fffffe36eb2 sem_wait () + 22
> > 00007fffffe85c43 GC_stop_world () + 147 ...
> > ----------------- lwp# 2 / thread# 2 --------------------
> > 00007fffffd172f7 lwp_park (0, 0, 0)
> > 00007fffffd0fd08 mutex_lock_impl () + e8 00007fffffd0fdfb
> > () + b 00007fffffe8514c GC_lock () + 38
> > 00007fffffe846d4 GC_unregister_my_thread () + 2c
> > 00007fffffe84929 GC_thread_exit_proc () + 1c1 ...
> > Taken from the Solaris pthread_exit manpages:
> > *snipsnap*
> > An exiting thread runs with all signals blocked. All thread
> > termination functions, including cancellation cleanup
> > handlers and thread-specific data destructor functions, are
> > called with all signals blocked.
> > *snipsnap*
> > So the race condition occurs if a thread is currently
> terminating and
> > another thread triggers a garbage collection.
> > The suspend signal is blocked within the exit handler and acquiring
> > the GC lock in GC_unregister_my_thread results in a deadlock.
> > I've attached a little test program that allows me to
> reproduce this
> > error (thread.c). I've also added code to the exit function
> > GC_thread_exit_proc() defined in pthread_support.c to write debug
> > information about the signal mask and the currently pending
> > It also unblocks the suspend signal and even resends the suspend
> > signal if it is currently pending.
> > Output of the test program:
> > > ./threadtest
> > starting test thread
> > cancelling test thread
> > doing final collection
> > signals in thread 2
> > thread: 2 signal: EXIT ( 0) blocked: yes pending: yes
> > thread: 2 signal: HUP ( 1) blocked: yes pending: no
> > ... most signals blocked except KILL, STOP and CANCEL...
> > thread: 2 signal: RTMIN (41) blocked: yes pending: no
> > thread: 2 signal: RTMIN+1 (42) blocked: yes pending: no
> > thread: 2 signal: RTMIN+2 (43) blocked: yes pending: no
> > thread: 2 signal: RTMIN+3 (44) blocked: yes pending: no
> > thread: 2 signal: RTMAX-3 (45) blocked: yes pending: no
> > thread: 2 signal: RTMAX-2 (46) blocked: yes pending: no
> > thread: 2 signal: RTMAX-1 (47) blocked: yes pending: yes
> > signal 47 in thread 2 sending pending suspend signal 47 to 2
> > unregistering thread
> > At this point the test program is dead locked. Even after
> adding code
> > to unblock the suspend signal and calling
> > pthread_kill() if the signal is pending in
> > GC_thread_exit_proc() does not trigger signal delivery.
> > I do not have any clue how to fix this problem. The applications I
> > develop using mono are heavily based on threading, and thus this
> > random dead lock problem is a show stopper for me. Maybe
> someone with
> > more insight to the Solaris pthread implementation may give
> an advice
> > how to fix this.
> > With best regards
> > Burkhard Linke
> Gc mailing list
> Gc at linux.hpl.hp.com
More information about the Gc