Will Schmidt
2014-01-28 21:22:35 UTC
Hi All,
I've been looking at the test_stack test case failure as seen on
ppc64 / power7 based systems. I don't have a fix, but believe I
understand where the problem is occurring.
The simplest case I've been able to duplicate is with three threads.
As I've added debug to the code, the problem gets harder to nail down
precisely, but this is what seems to be happening.
In the failure scenario:
The list appears OK during run_one_test() before and after
AO_stack_pop() is called. The thread is holding two entries in the t[i]
array, and the list still looks OK. The list is damaged after the
AO_stack_push() call is made.
Within AO_stack_push(),
[src/atomic_ops_stack.c:AO_stack_pop_explicit_aux_require()]
The malfunction seems to be triggered while one of the threads is
between the "first=AO_load(list);" and the
"AO_compare_and_swap_release(list,first,next);". Either one or both of
the other threads will have removed and replaced multiple elements, such
that the compare and swap of list,first,next will pass the check, but
the list entries, particularly the next pointer at first, has changed.
This is referenced in the comment at that location:
/* Thus its next link cannot have changed out from under us, and we */
/* removed exactly one entry and preserved the rest of the list. */
/* Note that it is quite possible that an additional entry was */
/* inserted and removed while we were running; this is OK since the */
/* part of the list following first must have remained unchanged, and */
/* first must again have been at the head of the list when the */
/* compare_and_swap succeeded. */
which seem to be untrue in this case.
The powerpc AO_* functions seem to be OK. We'd prefer the gcc atomic
builtins be used (http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html),
(thats what they are there for), but I don't think that change
would help in this case.
My recommendation is that the test be rewritten to handle the case where
first->next has changed underneath the current thread.
A shorter term fix would probably be to disable the test_stack test for
power7 and newer processors, until it can be fixed.
Thanks,
-Will
I've been looking at the test_stack test case failure as seen on
ppc64 / power7 based systems. I don't have a fix, but believe I
understand where the problem is occurring.
The simplest case I've been able to duplicate is with three threads.
As I've added debug to the code, the problem gets harder to nail down
precisely, but this is what seems to be happening.
In the failure scenario:
The list appears OK during run_one_test() before and after
AO_stack_pop() is called. The thread is holding two entries in the t[i]
array, and the list still looks OK. The list is damaged after the
AO_stack_push() call is made.
Within AO_stack_push(),
[src/atomic_ops_stack.c:AO_stack_pop_explicit_aux_require()]
The malfunction seems to be triggered while one of the threads is
between the "first=AO_load(list);" and the
"AO_compare_and_swap_release(list,first,next);". Either one or both of
the other threads will have removed and replaced multiple elements, such
that the compare and swap of list,first,next will pass the check, but
the list entries, particularly the next pointer at first, has changed.
This is referenced in the comment at that location:
/* Thus its next link cannot have changed out from under us, and we */
/* removed exactly one entry and preserved the rest of the list. */
/* Note that it is quite possible that an additional entry was */
/* inserted and removed while we were running; this is OK since the */
/* part of the list following first must have remained unchanged, and */
/* first must again have been at the head of the list when the */
/* compare_and_swap succeeded. */
which seem to be untrue in this case.
The powerpc AO_* functions seem to be OK. We'd prefer the gcc atomic
builtins be used (http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html),
(thats what they are there for), but I don't think that change
would help in this case.
My recommendation is that the test be rewritten to handle the case where
first->next has changed underneath the current thread.
A shorter term fix would probably be to disable the test_stack test for
power7 and newer processors, until it can be fixed.
Thanks,
-Will