[Gc] test_stack on powerpc (power7)
hans.boehm at hp.com
Tue Jan 28 23:05:43 PST 2014
I suspect a memory ordering bug in atomic_ops_stack.c, or possibly a subtle miscompilation. As the rest of the comment states, the scenario you describe should be impossible, because the first note on the list was added to the blacklist. Thus it could have been removed, but not reinserted (without low-order-bit-twiddling) by another thread. If any of this had happened, the compare_and_swap should have failed.
The algorithm and correctness argument (for a hypothetical sequentially consistent machine) are described in a PODC 2004 paper (see http://www.hpl.hp.com/techreports/2004/HPL-2004-105.pdf). Not that PODC papers are always correct; but they have to be close enough to convince smart referees. And the fact that this only fails on Power7 suggests memory ordering issues rather than an outright algorithmic bug to me.
I'm somewhat concerned whether the store in the AO_compare_and_swap_acquire() in AO_stack_pop_explicit_aux_acquire() is actually ordered before the subsequent loads of first_ptr and next. The acquire is enforced with an lwsync, which normally doesn't order a store followed by a load. Since the store is atomic with a load, it may do the right thing; I'm not sure. Can you try inserting a "sync" and see if that helps?
I would also check that the recheck of first against AO_load(list) appears in the assembly code where it should.
There could easily be yet a different issue that could explain this. I don't immediately see anything that would definitely explain it. But given the subtleties of the Power memory model, and the fact that libatomic_ops' semantics aren't as carefully defined as they should be (or as carefully as the later C++11 primitives), there seem to be plenty of bug opportunities here.
> -----Original Message-----
> From: gc-bounces at linux.hpl.hp.com [mailto:gc-bounces at linux.hpl.hp.com]
> On Behalf Of Will Schmidt
> Sent: Tuesday, January 28, 2014 1:23 PM
> To: gc at linux.hpl.hp.com
> Cc: kmp at linux.vnet.ibm.com; deepakcs at linux.vnet.ibm.com;
> lsorense at csclub.uwaterloo.ca
> Subject: [Gc] test_stack on powerpc (power7)
> Hi All,
> I've been looking at the test_stack test case failure as seen on
> ppc64 / power7 based systems. I don't have a fix, but believe I
> understand where the problem is occurring.
> The simplest case I've been able to duplicate is with three threads.
> As I've added debug to the code, the problem gets harder to nail down
> precisely, but this is what seems to be happening.
> In the failure scenario:
> The list appears OK during run_one_test() before and after
> AO_stack_pop() is called. The thread is holding two entries in the t[i] array,
> and the list still looks OK. The list is damaged after the
> AO_stack_push() call is made.
> Within AO_stack_push(),
> The malfunction seems to be triggered while one of the threads is between
> the "first=AO_load(list);" and the
> "AO_compare_and_swap_release(list,first,next);". Either one or both of the
> other threads will have removed and replaced multiple elements, such that
> the compare and swap of list,first,next will pass the check, but the list
> entries, particularly the next pointer at first, has changed.
> This is referenced in the comment at that location:
> /* Thus its next link cannot have changed out from under us, and we */
> /* removed exactly one entry and preserved the rest of the list. */
> /* Note that it is quite possible that an additional entry was */
> /* inserted and removed while we were running; this is OK since the */
> /* part of the list following first must have remained unchanged, and */
> /* first must again have been at the head of the list when the */
> /* compare_and_swap succeeded. */
> which seem to be untrue in this case.
> The powerpc AO_* functions seem to be OK. We'd prefer the gcc atomic
> builtins be used (http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-
> (thats what they are there for), but I don't think that change would help in
> this case.
> My recommendation is that the test be rewritten to handle the case where
> first->next has changed underneath the current thread.
> A shorter term fix would probably be to disable the test_stack test for
> power7 and newer processors, until it can be fixed.
> Gc mailing list
> Gc at linux.hpl.hp.com
More information about the Gc