[Gc] Race condition between thread termination and garbage
collection under Solaris 10/x86
hans.boehm at hp.com
Tue Mar 2 17:22:36 PST 2010
I don't know why they decided to block signals in an exiting threads. I don't immediately find any justification in Unix or Posix standards. If anything, it seems inconsistent with those. (If you Google "An exiting thread runs with all signals blocked" you get only Solaris pages.)
This is a somewhat fundamental problem for us. Once a thread enters this exiting state, we can't stop it. Probably the best we could do is notice that a thread is in this state and defer collection until it actually exits. We don't have any way to safely collect while it's still running. But such threads can (a) run indefinitely or (b) block indefinitely for another thread. (And for correct C++ code, I suspect (b) is sometimes unavoidable.) Thus this option doesn't seem very attractive at all, and is likely to provoke nastier failures under less likely conditions.
Even for non-GC code, this seems like a very dubious design decision to me. It means that cleanup code can't rely on any library calls that internally uses signals, e.g. for timeouts. I'm not sure how I would even know which library calls do that.
On my old Itanium Linux machine, your program also sometimes hangs, but for other reasons. It looks like the child is gone but still in the thread table. I'll see if I can debug that further. It looks like cancellation is generally still a problem, even in the CVS version.
As far as the Solaris issue is concerned, aside from possibly filing a bug with Sun, maybe deferring GC somehow is still the best option. It would probably at least substantially reduce the deadlock frequency. Unfortunately, a hung exiting thread is likely to cause the process to run out of memory. We may want to defer for only so long. And I think we want to do this only on Solaris.
> -----Original Message-----
> From: gc-bounces at napali.hpl.hp.com
> [mailto:gc-bounces at napali.hpl.hp.com] On Behalf Of Burkhard Linke
> Sent: Monday, March 01, 2010 7:56 AM
> To: gc at napali.hpl.hp.com
> Subject: [Gc] Race condition between thread termination and
> garbage collection under Solaris 10/x86
> after having random deadlocks with mono and the current GC
> release (both 7.1 and CVS checkout), I was finally able to
> locate the problem in the thread exit handler of boehm-gc.
> libgc was configured with --enable-parallel-mark
> --enable-threads=posix --enable-munmap=2
> --enable-large-config CC=cc LDFLAGS=-m64 CPPFLAGS=-m64 CXX=CC
> (using Sun Studio compiler 12.1, also tested with gcc 4.1
> and gcc 4.3)
> Symptoms (stack trace of deadlock produces by pstack):
> 1918: /vol/src/gnu/mono/contrib/bdwgc/.libs/threadtest
> ----------------- lwp# 1 / thread# 1 --------------------
> 00007fffffd172f7 lwp_park (0, 0, 0)
> 00007fffffd0b59b sema_wait () + b
> 00007fffffe36eb2 sem_wait () + 22
> 00007fffffe85c43 GC_stop_world () + 147 ...
> ----------------- lwp# 2 / thread# 2 --------------------
> 00007fffffd172f7 lwp_park (0, 0, 0)
> 00007fffffd0fd08 mutex_lock_impl () + e8 00007fffffd0fdfb
> mutex_lock () + b 00007fffffe8514c GC_lock () + 38
> 00007fffffe846d4 GC_unregister_my_thread () + 2c
> 00007fffffe84929 GC_thread_exit_proc () + 1c1 ...
> Taken from the Solaris pthread_exit manpages:
> An exiting thread runs with all signals blocked. All thread
> termination functions, including cancellation cleanup
> handlers and thread-specific data destructor functions, are
> called with all signals blocked.
> So the race condition occurs if a thread is currently
> terminating and another thread triggers a garbage collection.
> The suspend signal is blocked within the exit handler and
> acquiring the GC lock in GC_unregister_my_thread results in a
> I've attached a little test program that allows me to
> reproduce this error (thread.c). I've also added code to the
> exit function GC_thread_exit_proc() defined in
> pthread_support.c to write debug information about the signal
> mask and the currently pending signals. It also unblocks the
> suspend signal and even resends the suspend signal if it is
> currently pending.
> Output of the test program:
> > ./threadtest
> starting test thread
> cancelling test thread
> doing final collection
> signals in thread 2
> thread: 2 signal: EXIT ( 0) blocked: yes pending: yes
> thread: 2 signal: HUP ( 1) blocked: yes pending: no
> ... most signals blocked except KILL, STOP and CANCEL...
> thread: 2 signal: RTMIN (41) blocked: yes pending: no
> thread: 2 signal: RTMIN+1 (42) blocked: yes pending: no
> thread: 2 signal: RTMIN+2 (43) blocked: yes pending: no
> thread: 2 signal: RTMIN+3 (44) blocked: yes pending: no
> thread: 2 signal: RTMAX-3 (45) blocked: yes pending: no
> thread: 2 signal: RTMAX-2 (46) blocked: yes pending: no
> thread: 2 signal: RTMAX-1 (47) blocked: yes pending: yes
> unblocking signal 47 in thread 2 sending pending suspend
> signal 47 to 2 unregistering thread
> At this point the test program is dead locked. Even after
> adding code to unblock the suspend signal and calling
> pthread_kill() if the signal is pending in
> GC_thread_exit_proc() does not trigger signal delivery.
> I do not have any clue how to fix this problem. The
> applications I develop using mono are heavily based on
> threading, and thus this random dead lock problem is a show
> stopper for me. Maybe someone with more insight to the
> Solaris pthread implementation may give an advice how to fix this.
> With best regards
> Burkhard Linke
More information about the Gc