[Gc] [PATCH] Race condition when restarting threads

Boehm, Hans hans.boehm at hp.com
Fri Jul 8 14:46:58 PDT 2005


Thanks for tracking this down.  I agree that it's a serious problem.

I'm not 100% enthusiastic about the patch, since I think it potentially
introduces lots of extra context switches and an otherwise unnecessary
delay for the thread that triggered the GC.  It's certainly better than
what's there now, but I would prefer to avoid waiting for
acknowledgements.

I'm still looking for another way to determine which signal was actually
delivered.  I think we just need to avoid restarting on something like
SIGQUIT.  Sigwaitinfo almost does the trick, as would a thread local,
but neither of those seem to be async-signal-safe.  I'm still thinking
...

Hans

> -----Original Message-----
> From: gc-bounces at napali.hpl.hp.com 
> [mailto:gc-bounces at napali.hpl.hp.com] On Behalf Of Ben Maurer
> Sent: Sunday, July 03, 2005 8:29 AM
> To: mono-devel-list at lists.ximian.com; gc at napali.hpl.hp.com
> Subject: [Gc] [PATCH] Race condition when restarting threads
> 
> 
> Hey,
> 
> In a Mono bug report, we noticed a very rare race in the GC 
> when restarting the world. GC_restart_handler states:
> 
>     /* Let the GC_suspend_handler() know that we got a 
> SIG_THR_RESTART. */
>     /* The lookup here is safe, since I'm doing this on behalf  */
>     /* of a thread which holds the allocation lock in order	*/
>     /* to stop the world.  Thus concurrent modification of 
> the	*/
>     /* data structure is impossible.				*/
> 
> However, this comment is not always true. When starting the 
> world, the thread that does the restarting does *not* wait 
> for all threads to get past the point where they need the 
> structures used by the lookup for it to release the GC_lock.
> 
> So the sequence of events looked something like:
> 
>       * T1 signals T2 to restart the world
>       * T1 releases the GC_lock
>       * T3 is a newborn thread and adds itself to the table
>       * T2 gets the signal and sees a corrupt table because T3 is
>         concurrently modifying it.
> 
> What would end up happening when we experienced the race was 
> either a deadlock or a SIGSEGV.
> 
> The race was extremely rare. It took 1-2 hours to reproduce 
> on an SMP machine. With the attached patch, it has not 
> segfaulted or hung for 21 hrs.
> 
> -- Ben
> 



More information about the Gc mailing list