Discussion:
[Gc] pthread_cancel(3) confuses ‘GC_suspend_all ()’
(too old to reply)
Ludovic Courtès
2009-09-14 21:57:33 UTC
Permalink
Hello,

The attached example leads to a deadlock most of the time on a dual-core
‘x86_64-unknown-linux-gnu’ with a recent CVS snapshot of libgc.
Commenting out the pthread_cancel(3) call appears to fix the problem.

When the deadlock occurs, all but one thread are waiting in
sigsuspend(2). The remaining thread is the one that initiated a
stop-the-world collection; it is waiting on sem_wait(3) from
‘GC_stop_world ()’ because it computed an ‘n_live_threads’ that is
greater than the actual number of other threads.

It looks like there’s a subtle race condition here. Ideas?

Thanks,
Ludo’.
Boehm, Hans
2009-09-15 00:17:46 UTC
Permalink
Without actually having reproduced this, but looking at the code, it seems to me that we could fix this immediate problem by putting a

pthread_setcancelstate(PTHREAD_CANCEL_DISABLE, &oldstate)

before the "GC_register_my_thread_inner" call in GC_inner_start_routine() and
a

pthread_setcancelstate(oldstate, &dummy /* is 0 allowed? */)

call after the pthread_cleanup_push. That should ensure that threads don't disappear unexpectedly, since the cleanup handler needs to acquire the allocation locks, which is held by the collecting thread.

But thinking about this some more, I don't think that this is worth doing, at least not by itself. If any collector code can be cancelled, we're likely to end up with an unreleased allocation lock, since LOCK() doesn't push a cancellation handler. For performance reasons, it probably should not. So a better solution is probably to disable cancellation around all cancellation points invoked by the collector, which should also address this problem. And it should avoid making GC_malloc a cancellation point, which it should not be.

Diabling cancellation is unfortunately at least a little problematic, since the thread suspension signal handler has to call sigsuspend(), which is async-signal-safe and a cancellation point. But I believe pthread_set_cancel_state() is not async-signal-safe, and thus can't officially be called from a handler. Yuch.

Actually, this is probably a more serious problem in itself. Putting a cancellation point in an asynchronous handler is pretty much guaranteed to break any code using cancellation anyway. (I'm assuming the client is not using asynchronous cancellation in the Posix sense, which I doubt is handled correctly by any nontrivial library.)

If I look at the actual glibc implementation of pthread_setcancelstate, it actually doesn't seem to do anything questionable (unless the client enabled asynchronous cancellation).

Thus I'd be inclined to go ahead and disable cancellation across all potential cancellation points we can find, including the one in the handler, documenting that one as potentially not completely portable. We should also add the warning that Posix asynchronous cancellation should never be used with the collector (or, in my opinion, without it).

Other opinions?

Note that although we should try to get this right, all of this currently appears to be of limited practical interest. There is no portable way to use cancellation with nontrivial C++ code, and very little C code is cancellation-safe, at least as far as I can tell. The C++ and Posix committees have so far failed to resolve the (admittedly nontrivial) C++ issues. In most contexts, we're talking about fixing one of several unrelated reasons that cancellation doesn't work.

Hans
-----Original Message-----
Sent: Monday, September 14, 2009 2:58 PM
Subject: [Gc] pthread_cancel(3) confuses 'GC_suspend_all ()'
Hello,
The attached example leads to a deadlock most of the time on
a dual-core 'x86_64-unknown-linux-gnu' with a recent CVS
snapshot of libgc.
Commenting out the pthread_cancel(3) call appears to fix the problem.
When the deadlock occurs, all but one thread are waiting in
sigsuspend(2). The remaining thread is the one that
initiated a stop-the-world collection; it is waiting on
sem_wait(3) from 'GC_stop_world ()' because it computed an
'n_live_threads' that is greater than the actual number of
other threads.
It looks like there's a subtle race condition here. Ideas?
Thanks,
Ludo'.
Ludovic Courtès
2009-09-15 07:24:55 UTC
Permalink
Hi,

Thanks for the quick reply and detailed analysis!

"Boehm, Hans" <***@hp.com> writes:

[...]
Post by Boehm, Hans
Thus I'd be inclined to go ahead and disable cancellation across all
potential cancellation points we can find, including the one in the
handler, documenting that one as potentially not completely portable.
We should also add the warning that Posix asynchronous cancellation
should never be used with the collector (or, in my opinion, without
it).
It makes sense to me. Glibc’s pthread_setcancelstate(3) shouldn’t
introduce much overhead, so that’s probably OK. Any idea how many
cancellation points lie in libgc code that holds the allocation lock?

Thanks,
Ludo’.
Boehm, Hans
2009-09-15 18:56:56 UTC
Permalink
-----Original Message-----
Sent: Tuesday, September 15, 2009 12:25 AM
Subject: [Gc] Re: pthread_cancel(3) confuses 'GC_suspend_all ()'
Hi,
Thanks for the quick reply and detailed analysis!
[...]
Post by Boehm, Hans
Thus I'd be inclined to go ahead and disable cancellation
across all
Post by Boehm, Hans
potential cancellation points we can find, including the one in the
handler, documenting that one as potentially not completely
portable.
Post by Boehm, Hans
We should also add the warning that Posix asynchronous cancellation
should never be used with the collector (or, in my opinion, without
it).
It makes sense to me. Glibc's pthread_setcancelstate(3)
shouldn't introduce much overhead, so that's probably OK.
Any idea how many cancellation points lie in libgc code that
holds the allocation lock?
I think not a lot. I'll make a pass over the code to try to get the obvious ones. The hard part will be to find the ones I forget.

Hans
Boehm, Hans
2009-09-16 22:18:08 UTC
Permalink
Here's an attempt at a patch. Tested superficially on Linux (Itanium and ARM) and MacOS with today's CVS. (I seem to run into unrelated build problems on Cygwin, but that may be my ancient Cygwin installation.)

Could you confirm that this solves the original problem?

Assuming it does, my inclination would be to check it in. There is some down side risk. Especially with assertions enabled, it may introduce new failures, though they should be near trivial to track down. Having no story at all about cancellation is probably worse.

This patch tries to disable cancellation in a few relatively high level routines, and then (with assertions enabled) checks in the low level routines that it actually was disabled. Thus I think the performance cost with assertions disabled is usually minimal. Writing to the log file probably gets measurably, but not substantially, slower, since GC_printf/GC_write is called from all over the place, and thus needs to explicitly disable cancellation.

We do need additional work on the normal/fast allocation path or in the mark loop or the like.

We currently do assertion checking for cancellation only on platforms on which we believe __thread works, which is currently mostly Linux. This could probably be refined. But so long as we check on one common platform, we should catch most of the problems. And this mechanism has already caught a few.

Hans

* include/private/gcconfig.h (CANCEL_SAFE, IF_CANCEL): new macros.
* include/private/gc_priv.h (DISABLE_CANCEL, RESTORE_CANCEL,
ASSERT_CANCEL_DISABLED): New macros.
* alloc.c (GC_maybe_gc): Assert cancellation disabled.
(GC_collect_a_little_inner,GC_try_to_collect, GC_collect_or_expand):
Disable cancellation.
* misc.c (cancel_disable_count): declare.
(GC_init, GC_write): Disable cancellation.
(GC_init): Remove redundant GC_is_initialized test.
* os_dep.c (GC_repeat_read): Assert cancellation disabled.
(GC_get_stack_base): Disable cancellation.
* pthread_stop_world.c (GC_suspend_handler_inner): Disable
cancellation.
* pthread_support.c (GC_mark_thread): Permanently disable
cancellation.
(GC_wait_for_gc_completion, GC_wait_builder, GC_wait_marker):
Assert cancellation disabled.
(fork handling): Disable cancellation, fix comment.
(GC_unregister_my_thread): Disable cancellation.
-----Original Message-----
Sent: Tuesday, September 15, 2009 12:25 AM
Subject: [Gc] Re: pthread_cancel(3) confuses 'GC_suspend_all ()'
Hi,
Thanks for the quick reply and detailed analysis!
[...]
Post by Boehm, Hans
Thus I'd be inclined to go ahead and disable cancellation
across all
Post by Boehm, Hans
potential cancellation points we can find, including the one in the
handler, documenting that one as potentially not completely
portable.
Post by Boehm, Hans
We should also add the warning that Posix asynchronous cancellation
should never be used with the collector (or, in my opinion, without
it).
It makes sense to me. Glibc's pthread_setcancelstate(3)
shouldn't introduce much overhead, so that's probably OK.
Any idea how many cancellation points lie in libgc code that
holds the allocation lock?
Thanks,
Ludo'.
_______________________________________________
Gc mailing list
http://www.hpl.hp.com/hosted/linux/mail-archives/gc/
Ivan Maidanski
2009-09-17 08:22:23 UTC
Permalink
Hi!
Post by Boehm, Hans
Here's an attempt at a patch. Tested superficially on Linux (Itanium and ARM) and MacOS with today's CVS. (I seem to run into unrelated build problems on Cygwin, but that may be my ancient Cygwin installation.)
Some points to fix:

1. GC_try_to_collect: I think RESTORE should be before invoke finalizers (or even, might be, DISABLE/RESTORE should be inside LOCK/UNLOCK?)

2. cancel_disable_count should be prefixed with GC_ (since not static)

3. "IF_CANCEL(int cancel_state);" -> "IF_CANCEL(int cancel_state;)" (all occurrences).

4. GC_supend_handler_inner: RESTORE missing before "return" statement.

5. fork_cancel_state should be static.

Bye.
Boehm, Hans
2009-09-17 19:00:05 UTC
Permalink
Thanks for the careful proofreading. I applied these to my tree.

Unfortunately, I think they don't explain Ludo's results. I'll need to do some testing/debugging before posting the next version.

Hans
-----Original Message-----
Sent: Thursday, September 17, 2009 1:22 AM
Subject: Re[2]: [Gc] Re: pthread_cancel(3) confuses
'GC_suspend_all ()'
Hi!
Post by Boehm, Hans
Here's an attempt at a patch. Tested superficially on
Linux (Itanium
Post by Boehm, Hans
and ARM) and MacOS with today's CVS. (I seem to run into unrelated
build problems on Cygwin, but that may be my ancient Cygwin
installation.)
1. GC_try_to_collect: I think RESTORE should be before invoke
finalizers (or even, might be, DISABLE/RESTORE should be
inside LOCK/UNLOCK?)
2. cancel_disable_count should be prefixed with GC_ (since not static)
3. "IF_CANCEL(int cancel_state);" -> "IF_CANCEL(int
cancel_state;)" (all occurrences).
4. GC_supend_handler_inner: RESTORE missing before "return" statement.
5. fork_cancel_state should be static.
Bye.
_______________________________________________
Gc mailing list
http://www.hpl.hp.com/hosted/linux/mail-archives/gc/
Ivan Maidanski
2009-09-17 19:22:44 UTC
Permalink
Hi!
Post by Boehm, Hans
Thanks for the careful proofreading. I applied these to my tree.
Unfortunately, I think they don't explain Ludo's results. I'll need to do some testing/debugging before posting the next version.
Hans
A question: it works for Cygwin... but is this really needed for Cygwin?

Bye.
Ivan Maidanski
2009-09-18 06:39:00 UTC
Permalink
Hi!
Post by Boehm, Hans
Thanks for the careful proofreading. I applied these to my tree.
Unfortunately, I think they don't explain Ludo's results. I'll need to do some testing/debugging before posting the next version.
Hans
One more place (to make a bit better):

# if defined(GC_ASSERTIONS) && \
(defined(LINUX) || defined(HPUX) /* and probably others ... */)

->

# if defined(GC_ASSERTIONS) && (defined(USE_COMPILER_TLS) \
|| (defined(LINUX) && !defined(ARM32) \
&& (__GNUC__ > 3 || (__GNUC__ == 3 && __GNUC_MINOR__ >= 3)) \
|| defined(HPUX) /* and probably others ... */)

Bye.

Ludovic Courtès
2009-09-17 10:28:19 UTC
Permalink
Hi,
Post by Boehm, Hans
Here's an attempt at a patch. Tested superficially on Linux (Itanium
and ARM) and MacOS with today's CVS. (I seem to run into unrelated
build problems on Cygwin, but that may be my ancient Cygwin
installation.)
Could you confirm that this solves the original problem?
Thanks for the quick patch. Unfortunately, it doesn’t solve the problem
here on ‘x86_64-unknown-linux-gnu’: the test program I posted still
freezes eventually.

I suppose you need to make sure to test on a multi-core machine, because
I can’t reproduce the problem here with “numactl -C0 ./a.out”.

Thanks,
Ludo’.
Continue reading on narkive:
Loading...