[Gc] Race condition between thread termination and garbage collection under Solaris 10/x86

Boehm, Hans hans.boehm at hp.com
Tue Mar 2 17:35:12 PST 2010


Correction:

It appears that my statements about Linux below were incorrect.  We may have similar issues there.  As far as I can tell the missing thread is actually still around; gdb just doesn't see it anymore.

Hans 

> -----Original Message-----
> From: gc-bounces at napali.hpl.hp.com 
> [mailto:gc-bounces at napali.hpl.hp.com] On Behalf Of Boehm, Hans
> Sent: Tuesday, March 02, 2010 5:23 PM
> To: Burkhard Linke; gc at napali.hpl.hp.com
> Subject: RE: [Gc] Race condition between thread termination 
> and garbage collection under Solaris 10/x86
> 
> Yucch.
> 
> I don't know why they decided to block signals in an exiting 
> threads. I don't immediately find any justification in Unix 
> or Posix standards.  If anything, it seems inconsistent with 
> those.  (If you Google "An exiting thread runs with all 
> signals blocked" you get only Solaris pages.)
> 
> This is a somewhat fundamental problem for us.  Once a thread 
> enters this exiting state, we can't stop it.  Probably the 
> best we could do is notice that a thread is in this state and 
> defer collection until it actually exits.  We don't have any 
> way to safely collect while it's still running.  But such 
> threads can (a) run indefinitely or (b) block indefinitely 
> for another thread.  (And for correct C++ code, I suspect (b) 
> is sometimes unavoidable.)  Thus this option doesn't seem 
> very attractive at all, and is likely to provoke nastier 
> failures under less likely conditions.
> 
> Even for non-GC code, this seems like a very dubious design 
> decision to me.  It means that cleanup code can't rely on any 
> library calls that internally uses signals, e.g. for 
> timeouts.  I'm not sure how I would even know which library 
> calls do that.
> 
> On my old Itanium Linux machine, your program also sometimes 
> hangs, but for other reasons.  It looks like the child is 
> gone but still in the thread table.  I'll see if I can debug 
> that further.  It looks like cancellation is generally still 
> a problem, even in the CVS version.
> 
> As far as the Solaris issue is concerned, aside from possibly 
> filing a bug with Sun, maybe deferring GC somehow is still 
> the best option.  It would probably at least substantially 
> reduce the deadlock frequency.  Unfortunately, a hung exiting 
> thread is likely to cause the process to run out of memory.  
> We may want to defer for only so long.  And I think we want 
> to do this only on Solaris.
> 
> Hans
> 
> > -----Original Message-----
> > From: gc-bounces at napali.hpl.hp.com
> > [mailto:gc-bounces at napali.hpl.hp.com] On Behalf Of Burkhard Linke
> > Sent: Monday, March 01, 2010 7:56 AM
> > To: gc at napali.hpl.hp.com
> > Subject: [Gc] Race condition between thread termination and garbage 
> > collection under Solaris 10/x86
> > 
> > Hi,
> > 
> > after having random deadlocks with mono and the current GC release 
> > (both 7.1 and CVS checkout), I was finally able to locate 
> the problem 
> > in the thread exit handler of boehm-gc.
> > libgc was configured with --enable-parallel-mark 
> > --enable-threads=posix --enable-munmap=2 
> --enable-large-config CC=cc 
> > LDFLAGS=-m64 CPPFLAGS=-m64 CXX=CC  (using Sun Studio compiler 12.1, 
> > also tested with gcc 4.1 and gcc 4.3)
> > 
> > Symptoms (stack trace of deadlock produces by pstack):
> > 
> > 1918:   /vol/src/gnu/mono/contrib/bdwgc/.libs/threadtest
> > -----------------  lwp# 1 / thread# 1  --------------------
> >  00007fffffd172f7 lwp_park (0, 0, 0)
> >  00007fffffd0b59b sema_wait () + b
> >  00007fffffe36eb2 sem_wait () + 22
> >  00007fffffe85c43 GC_stop_world () + 147 ...
> > -----------------  lwp# 2 / thread# 2  --------------------
> >  00007fffffd172f7 lwp_park (0, 0, 0)
> >  00007fffffd0fd08 mutex_lock_impl () + e8  00007fffffd0fdfb 
> mutex_lock 
> > () + b  00007fffffe8514c GC_lock () + 38
> >  00007fffffe846d4 GC_unregister_my_thread () + 2c
> >  00007fffffe84929 GC_thread_exit_proc () + 1c1 ...
> > 
> > Taken from the Solaris pthread_exit manpages:
> > 
> > *snipsnap*
> > 
> >      An exiting thread runs with all signals blocked. All  thread
> >      termination   functions,   including   cancellation  cleanup
> >      handlers and thread-specific data destructor functions,  are
> >      called with all signals blocked.
> > 
> > *snipsnap*
> > 
> > So the race condition occurs if a thread is currently 
> terminating and 
> > another thread triggers a garbage collection.
> > The suspend signal is blocked within the exit handler and acquiring 
> > the GC lock in GC_unregister_my_thread results in a deadlock.
> > 
> > I've attached a little test program that allows me to 
> reproduce this 
> > error (thread.c). I've also added code to the exit function 
> > GC_thread_exit_proc() defined in pthread_support.c to write debug 
> > information about the signal mask and the currently pending 
> signals. 
> > It also unblocks the suspend signal and even resends the suspend 
> > signal if it is currently pending.
> > 
> > Output of the test program:
> > 
> > > ./threadtest
> > starting test thread
> > cancelling test thread
> > doing final collection
> > signals in thread 2
> > thread: 2 signal:    EXIT ( 0)  blocked: yes  pending: yes
> > thread: 2 signal:     HUP ( 1)  blocked: yes  pending:  no
> > 
> > ... most signals blocked except KILL, STOP and CANCEL...
> > 
> > thread: 2 signal:   RTMIN (41)  blocked: yes  pending:  no
> > thread: 2 signal: RTMIN+1 (42)  blocked: yes  pending:  no
> > thread: 2 signal: RTMIN+2 (43)  blocked: yes  pending:  no
> > thread: 2 signal: RTMIN+3 (44)  blocked: yes  pending:  no
> > thread: 2 signal: RTMAX-3 (45)  blocked: yes  pending:  no
> > thread: 2 signal: RTMAX-2 (46)  blocked: yes  pending:  no
> > thread: 2 signal: RTMAX-1 (47)  blocked: yes  pending: yes 
> unblocking 
> > signal 47 in thread 2 sending pending suspend signal 47 to 2 
> > unregistering thread
> > 
> > At this point the test program is dead locked. Even after 
> adding code 
> > to unblock the suspend signal and calling
> > pthread_kill() if the signal is pending in
> > GC_thread_exit_proc() does not trigger signal delivery.
> > 
> > I do not have any clue how to fix this problem. The applications I 
> > develop using mono are heavily based on threading, and thus this 
> > random dead lock problem is a show stopper for me. Maybe 
> someone with 
> > more insight to the Solaris pthread implementation may give 
> an advice 
> > how to fix this.
> > 
> > With best regards
> > Burkhard Linke
> > 
> _______________________________________________
> Gc mailing list
> Gc at linux.hpl.hp.com
> http://www.hpl.hp.com/hosted/linux/mail-archives/gc/
> 


More information about the Gc mailing list