[Gc] Race condition between thread termination and garbage collection under Solaris 10/x86

Burkhard Linke blinke at CeBiTec.Uni-Bielefeld.DE
Mon Mar 1 07:56:23 PST 2010


after having random deadlocks with mono and the current GC release (both 7.1 
and CVS checkout), I was finally able to locate the problem in the thread 
exit handler of boehm-gc. libgc was configured 
with --enable-parallel-mark --enable-threads=posix --enable-munmap=2 --enable-large-config 
CC=cc LDFLAGS=-m64 CPPFLAGS=-m64 CXX=CC  (using Sun Studio compiler 12.1, 
also tested with gcc 4.1 and gcc 4.3)

Symptoms (stack trace of deadlock produces by pstack):

1918:   /vol/src/gnu/mono/contrib/bdwgc/.libs/threadtest
-----------------  lwp# 1 / thread# 1  --------------------
 00007fffffd172f7 lwp_park (0, 0, 0)
 00007fffffd0b59b sema_wait () + b
 00007fffffe36eb2 sem_wait () + 22
 00007fffffe85c43 GC_stop_world () + 147
-----------------  lwp# 2 / thread# 2  --------------------
 00007fffffd172f7 lwp_park (0, 0, 0)
 00007fffffd0fd08 mutex_lock_impl () + e8
 00007fffffd0fdfb mutex_lock () + b
 00007fffffe8514c GC_lock () + 38
 00007fffffe846d4 GC_unregister_my_thread () + 2c
 00007fffffe84929 GC_thread_exit_proc () + 1c1

Taken from the Solaris pthread_exit manpages:


     An exiting thread runs with all signals blocked. All  thread
     termination   functions,   including   cancellation  cleanup
     handlers and thread-specific data destructor functions,  are
     called with all signals blocked.


So the race condition occurs if a thread is currently terminating and another 
thread triggers a garbage collection. The suspend signal is blocked within 
the exit handler and acquiring the GC lock in GC_unregister_my_thread results 
in a deadlock.

I've attached a little test program that allows me to reproduce this error 
(thread.c). I've also added code to the exit function GC_thread_exit_proc() 
defined in pthread_support.c to write debug information about the signal mask 
and the currently pending signals. It also unblocks the suspend signal and 
even resends the suspend signal if it is currently pending.

Output of the test program:

> ./threadtest 
starting test thread
cancelling test thread
doing final collection
signals in thread 2
thread: 2 signal:    EXIT ( 0)  blocked: yes  pending: yes
thread: 2 signal:     HUP ( 1)  blocked: yes  pending:  no

... most signals blocked except KILL, STOP and CANCEL...

thread: 2 signal:   RTMIN (41)  blocked: yes  pending:  no
thread: 2 signal: RTMIN+1 (42)  blocked: yes  pending:  no
thread: 2 signal: RTMIN+2 (43)  blocked: yes  pending:  no
thread: 2 signal: RTMIN+3 (44)  blocked: yes  pending:  no
thread: 2 signal: RTMAX-3 (45)  blocked: yes  pending:  no
thread: 2 signal: RTMAX-2 (46)  blocked: yes  pending:  no
thread: 2 signal: RTMAX-1 (47)  blocked: yes  pending: yes
unblocking signal 47 in thread 2
sending pending suspend signal 47 to 2
unregistering thread

At this point the test program is dead locked. Even after adding code to 
unblock the suspend signal and calling pthread_kill() if the signal is 
pending in GC_thread_exit_proc() does not trigger signal delivery.

I do not have any clue how to fix this problem. The applications I develop 
using mono are heavily based on threading, and thus this random dead lock 
problem is a show stopper for me. Maybe someone with more insight to the 
Solaris pthread implementation may give an advice how to fix this.

With best regards
Burkhard Linke
-------------- next part --------------
A non-text attachment was scrubbed...
Name: threads.c
Type: text/x-csrc
Size: 1220 bytes
Desc: not available
Url : http://napali.hpl.hp.com/pipermail/gc/attachments/20100301/d8c89c0b/threads.c
-------------- next part --------------
A non-text attachment was scrubbed...
Name: exit_proc_debug.diff
Type: text/x-diff
Size: 2185 bytes
Desc: not available
Url : http://napali.hpl.hp.com/pipermail/gc/attachments/20100301/d8c89c0b/exit_proc_debug.bin

More information about the Gc mailing list