Discussion:
[Gc] Lock Elision in eglibc 2.19
Paul Bone
2014-06-21 01:14:13 UTC
Permalink
Hi Andi

I've been tracking down a bug that came up when I upgraded from eglibc 2.18
to eglibc 2.19 (On Debian jessie, on x86_64).

The Mercury programming language http://www.mercurylang.org uses the
Boehm-Demers-Weiser Garbage Collector http://hboehm.info/gc/. Both use
pthreads on Linux. This is https://www.mercurylang.org/bugs/view.php?id=334
in Mercury's BTS.

I noticed the following error:
mercury_compile: ../nptl/pthread_mutex_lock.c:80: __pthread_mutex_cond_lock: Ass
ertion `mutex->__data.__owner == 0' failed.

This is thrown (indirectly) from a call to pthread_cond_wait in
pthread_support.c line 2036 in Boehm GC 7.4.2 I have the same problem with
Boehm Gc 7.2. There doesn't appear to be anything suspicious about the use
of the mutex or condition variable involved here.

A different bug affecting libtirpc and mount_nfs also started occuring when
I upgraded to eglibc 2.19. When investigating this I found that eglibc 2.19
introduced lock elision using TSX extensions and found your article here:
http://lwn.net/Articles/534758/ I use an i7-4770 processor which supports
TSX. (I chose this one because I wanted to experiment with some lock free
code myself.)

I've looked at the NTPL code and the Boehm code and I don't see anything
obvious - not that the NTPL assembler is easy to read. Given that the
assertion refers to the __owner field, and that on elision paths don't
update this field I wonder if they're related, that is that not updating the
__owner field has other issues.

This mutex and condition variable refer to the Boehm collector's marking
phase, which will read and update a lot of memory. Is the mutex code
falling back from lock elision to normal locks for this mutex and then
triggering the assertion because the owner field hasn't been updated?

As a work-around I'd like to explicitly disable elision for this mutex.
I've searched the glibc/eglibc sources and documentation and haven't found a
way to disable elision. But some things I read (mailing list messages etc)
say that it should be possible either per mutex or completely (with an
environment variable). Could you tell me how? Thanks.


I have a second question that is less important, but I'd like to understand
nevertheless. Your LWN article suggests that the entire critical section
(from pthread_mutex_lock to pthread_mutex_unlock) is a transactional memory
transaction. Have I understood correctly? If so, why not just start and
finish the transactional memory transaction within the pthread_mutex_lock
code? That is, after acquiring the lock, finish the TM transaction so that
the processor doesn't need to handle all the memory use until the
pthread_mutex_unlock call specially.

Thanks.
--
Paul Bone
Andi Kleen
2014-06-21 18:20:16 UTC
Permalink
Post by Paul Bone
mercury_compile: ../nptl/pthread_mutex_lock.c:80: __pthread_mutex_cond_lock: Ass
ertion `mutex->__data.__owner == 0' failed.
This is thrown (indirectly) from a call to pthread_cond_wait in
pthread_support.c line 2036 in Boehm GC 7.4.2 I have the same problem with
Boehm Gc 7.2. There doesn't appear to be anything suspicious about the use
of the mutex or condition variable involved here.
This could be a variant of
https://sourceware.org/bugzilla/show_bug.cgi?id=16657

Do the patches there help?
Post by Paul Bone
A different bug affecting libtirpc and mount_nfs also started occuring when
I upgraded to eglibc 2.19. When investigating this I found that eglibc 2.19
Likely related to locking?
Post by Paul Bone
As a work-around I'd like to explicitly disable elision for this mutex.
I've searched the glibc/eglibc sources and documentation and haven't found a
way to disable elision. But some things I read (mailing list messages etc)
say that it should be possible either per mutex or completely (with an
environment variable). Could you tell me how? Thanks.
My patches to do this were unfortunately not accepted. glibc
supports it internally but there is no way to request it
for user programs. I hope this can be revisited in the future.

The old tuning patches are in my github tree in the rtm-devel9 branch.
http://github.com/andikleen/glibc
Post by Paul Bone
I have a second question that is less important, but I'd like to understand
nevertheless. Your LWN article suggests that the entire critical section
(from pthread_mutex_lock to pthread_mutex_unlock) is a transactional memory
transaction. Have I understood correctly?
Yes.
Post by Paul Bone
If so, why not just start and
finish the transactional memory transaction within the pthread_mutex_lock
code? That is, after acquiring the lock, finish the TM transaction so that
the processor doesn't need to handle all the memory use until the
pthread_mutex_unlock call specially.
The point of lock elision is to allow full parallelism of the critical
section including all memory accesses in it. So the transaction
has to span the whole critical section, otherwise atomicity couldn't
be guaranteed.

Here's a newer article that has some more details:

http://queue.acm.org/detail.cfm?id=2579227

-Andi
--
***@linux.intel.com -- Speaking for myself only.
Paul Bone
2014-06-25 11:07:01 UTC
Permalink
Post by Andi Kleen
Post by Paul Bone
mercury_compile: ../nptl/pthread_mutex_lock.c:80: __pthread_mutex_cond_lock: Ass
ertion `mutex->__data.__owner == 0' failed.
This is thrown (indirectly) from a call to pthread_cond_wait in
pthread_support.c line 2036 in Boehm GC 7.4.2 I have the same problem with
Boehm Gc 7.2. There doesn't appear to be anything suspicious about the use
of the mutex or condition variable involved here.
This could be a variant of
https://sourceware.org/bugzilla/show_bug.cgi?id=16657
Do the patches there help?
I'll test these if I get time. It'll have to wait for a good weekend. ;-)
I'll report back with information after I've tried so that we know if this
or a different patch is required.
Post by Andi Kleen
Post by Paul Bone
A different bug affecting libtirpc and mount_nfs also started occuring when
I upgraded to eglibc 2.19. When investigating this I found that eglibc 2.19
Likely related to locking?
Yes. If I recall correctly they unlocked a lock with a thread that didn't
own the lock. For some reason this used to work but since elision has been
introduced it now segfaults on a xend (I think, I don't remember the name
exactly) instruction.
Post by Andi Kleen
Post by Paul Bone
As a work-around I'd like to explicitly disable elision for this mutex.
I've searched the glibc/eglibc sources and documentation and haven't found a
way to disable elision. But some things I read (mailing list messages etc)
say that it should be possible either per mutex or completely (with an
environment variable). Could you tell me how? Thanks.
My patches to do this were unfortunately not accepted. glibc
supports it internally but there is no way to request it
for user programs. I hope this can be revisited in the future.
The old tuning patches are in my github tree in the rtm-devel9 branch.
http://github.com/andikleen/glibc
I've found that creating the mutex with the error checking attribute - which
is already supported and portable, avoids the crash. So at this point the
issue isn't critical anymore although it's probably still important to fix.

I've submitted a patch to workaround this to the Boehm GC project:
https://lists.opendylan.org/pipermail/bdwgc/2014-June/005962.html
Post by Andi Kleen
Post by Paul Bone
I have a second question that is less important, but I'd like to understand
nevertheless. Your LWN article suggests that the entire critical section
(from pthread_mutex_lock to pthread_mutex_unlock) is a transactional memory
transaction. Have I understood correctly?
Yes.
Post by Paul Bone
If so, why not just start and
finish the transactional memory transaction within the pthread_mutex_lock
code? That is, after acquiring the lock, finish the TM transaction so that
the processor doesn't need to handle all the memory use until the
pthread_mutex_unlock call specially.
The point of lock elision is to allow full parallelism of the critical
section including all memory accesses in it. So the transaction
has to span the whole critical section, otherwise atomicity couldn't
be guaranteed.
Okay that makes sense. I did enough reading to learn that if elision fails
(say because of a buffer overflow or a system call) then NTPL can recover.
And that then it's less likely that NTPL will try to use elision on the
future. So I'm less concerned about using this with large transactions.
Post by Andi Kleen
http://queue.acm.org/detail.cfm?id=2579227
Thanks, and thanks for all the information.
--
Paul Bone
Loading...