[Gc] pthread_cancel(3) confuses 'GC_suspend_all ()'
hans.boehm at hp.com
Mon Sep 14 17:17:46 PDT 2009
Without actually having reproduced this, but looking at the code, it seems to me that we could fix this immediate problem by putting a
before the "GC_register_my_thread_inner" call in GC_inner_start_routine() and
pthread_setcancelstate(oldstate, &dummy /* is 0 allowed? */)
call after the pthread_cleanup_push. That should ensure that threads don't disappear unexpectedly, since the cleanup handler needs to acquire the allocation locks, which is held by the collecting thread.
But thinking about this some more, I don't think that this is worth doing, at least not by itself. If any collector code can be cancelled, we're likely to end up with an unreleased allocation lock, since LOCK() doesn't push a cancellation handler. For performance reasons, it probably should not. So a better solution is probably to disable cancellation around all cancellation points invoked by the collector, which should also address this problem. And it should avoid making GC_malloc a cancellation point, which it should not be.
Diabling cancellation is unfortunately at least a little problematic, since the thread suspension signal handler has to call sigsuspend(), which is async-signal-safe and a cancellation point. But I believe pthread_set_cancel_state() is not async-signal-safe, and thus can't officially be called from a handler. Yuch.
Actually, this is probably a more serious problem in itself. Putting a cancellation point in an asynchronous handler is pretty much guaranteed to break any code using cancellation anyway. (I'm assuming the client is not using asynchronous cancellation in the Posix sense, which I doubt is handled correctly by any nontrivial library.)
If I look at the actual glibc implementation of pthread_setcancelstate, it actually doesn't seem to do anything questionable (unless the client enabled asynchronous cancellation).
Thus I'd be inclined to go ahead and disable cancellation across all potential cancellation points we can find, including the one in the handler, documenting that one as potentially not completely portable. We should also add the warning that Posix asynchronous cancellation should never be used with the collector (or, in my opinion, without it).
Note that although we should try to get this right, all of this currently appears to be of limited practical interest. There is no portable way to use cancellation with nontrivial C++ code, and very little C code is cancellation-safe, at least as far as I can tell. The C++ and Posix committees have so far failed to resolve the (admittedly nontrivial) C++ issues. In most contexts, we're talking about fixing one of several unrelated reasons that cancellation doesn't work.
> -----Original Message-----
> From: gc-bounces at napali.hpl.hp.com
> [mailto:gc-bounces at napali.hpl.hp.com] On Behalf Of Ludovic Courtès
> Sent: Monday, September 14, 2009 2:58 PM
> To: gc at napali.hpl.hp.com
> Subject: [Gc] pthread_cancel(3) confuses 'GC_suspend_all ()'
> The attached example leads to a deadlock most of the time on
> a dual-core 'x86_64-unknown-linux-gnu' with a recent CVS
> snapshot of libgc.
> Commenting out the pthread_cancel(3) call appears to fix the problem.
> When the deadlock occurs, all but one thread are waiting in
> sigsuspend(2). The remaining thread is the one that
> initiated a stop-the-world collection; it is waiting on
> sem_wait(3) from 'GC_stop_world ()' because it computed an
> 'n_live_threads' that is greater than the actual number of
> other threads.
> It looks like there's a subtle race condition here. Ideas?
More information about the Gc