[Gc] [PATCH] Race condition when restarting threads
hans.boehm at hp.com
Mon Jul 11 14:15:29 PDT 2005
I've attached a different patch, which I think should solve the
problem without additional synchronization and context switches,
at least in the vast majority of cases. (It should solve the
problem in all cases. Additional context switches will be
needed only if the sigsuspend wakes up early, I claim.)
Please let me know if you have any problems with this, or if
this doesn't look right to you. I tested only superficially.
> -----Original Message-----
> From: gc-bounces at napali.hpl.hp.com
> [mailto:gc-bounces at napali.hpl.hp.com] On Behalf Of Boehm, Hans
> Sent: Friday, July 08, 2005 2:47 PM
> To: Ben Maurer; mono-devel-list at lists.ximian.com; gc at napali.hpl.hp.com
> Subject: RE: [Gc] [PATCH] Race condition when restarting threads
> Thanks for tracking this down. I agree that it's a serious problem.
> I'm not 100% enthusiastic about the patch, since I think it
> potentially introduces lots of extra context switches and an
> otherwise unnecessary delay for the thread that triggered the
> GC. It's certainly better than what's there now, but I would
> prefer to avoid waiting for acknowledgements.
> I'm still looking for another way to determine which signal
> was actually delivered. I think we just need to avoid
> restarting on something like SIGQUIT. Sigwaitinfo almost
> does the trick, as would a thread local, but neither of those
> seem to be async-signal-safe. I'm still thinking ...
> > -----Original Message-----
> > From: gc-bounces at napali.hpl.hp.com
> > [mailto:gc-bounces at napali.hpl.hp.com] On Behalf Of Ben Maurer
> > Sent: Sunday, July 03, 2005 8:29 AM
> > To: mono-devel-list at lists.ximian.com; gc at napali.hpl.hp.com
> > Subject: [Gc] [PATCH] Race condition when restarting threads
> > Hey,
> > In a Mono bug report, we noticed a very rare race in the GC
> > when restarting the world. GC_restart_handler states:
> > /* Let the GC_suspend_handler() know that we got a
> > SIG_THR_RESTART. */
> > /* The lookup here is safe, since I'm doing this on behalf */
> > /* of a thread which holds the allocation lock in order */
> > /* to stop the world. Thus concurrent modification of
> > the */
> > /* data structure is impossible.
> > However, this comment is not always true. When starting the
> > world, the thread that does the restarting does *not* wait
> > for all threads to get past the point where they need the
> > structures used by the lookup for it to release the GC_lock.
> > So the sequence of events looked something like:
> > * T1 signals T2 to restart the world
> > * T1 releases the GC_lock
> > * T3 is a newborn thread and adds itself to the table
> > * T2 gets the signal and sees a corrupt table because T3 is
> > concurrently modifying it.
> > What would end up happening when we experienced the race was
> > either a deadlock or a SIGSEGV.
> > The race was extremely rare. It took 1-2 hours to reproduce
> > on an SMP machine. With the attached patch, it has not
> > segfaulted or hung for 21 hrs.
> > -- Ben
> Gc mailing list
> Gc at linux.hpl.hp.com
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 4779 bytes
Url : http://napali.hpl.hp.com/pipermail/gc/attachments/20050711/8af6035b/race.obj
More information about the Gc