[Gc] libatomic_ops CAS ordering constraints correspondence to C11 atomic ones

Discussion:

Ivan Maidanski

2016-03-24 20:19:51 UTC

Hello Hans,

AO Readme.txt is a bit weak in respect to CAS ordering constraints. Let's fix it by mapping to C11 atomic memory ordering (based on common sense and inspection of libatomic_ops exsiting CAS assembly implementations).

Is the following correct?
|| AO CAS suffix || C11 CAS success || C11 CAS failure ||
| - Â Â Â Â Â Â |Â __ATOMIC_RELAXED |Â __ATOMIC_RELAXED |
| acquire |Â __ATOMIC_ACQUIRE Â |Â __ATOMIC_ACQUIRE |
| release |Â __ATOMIC_RELEASE Â |Â __ATOMIC_RELAXEDÂ |
| full Â Â Â Â |Â __ATOMIC_SEQ_CST Â |Â __ATOMIC_ACQUIRE |

[1]Â https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html
[2]Â https://github.com/ivmai/libatomic_ops/blob/master/doc/README.txt Â

--
Best regards,
Ivan

Hans Boehm

2016-03-24 23:40:40 UTC

Permalink

I agree entirely with the first three.

The last one is debatable. I agree that the current spec is not very
satisfactory all the way around. But I believe the current implementations
of _full on architectures where it matters (Power in particular) actually
interpret it as acq_rel for RMW operations. Without fixing
implementations, I think the last line should read acq_rel | acquire. For
operations that don't include a read operation, we seem to map _full to
something like a release (if there's a write) plus a trailing seq_cst fence.

If I had to design this again, full would always mean seq_cst, mostly
because I can define that precisely. But that would slow down Power and
ARMv7 code, and may be a bit much of a revision now.

This library API was a step along the evolution of C/C++ atomics. The
latter are much better defined (though still imperfect). When this API was
designed, neither I nor anyone else fully appreciated some of the many
subtleties.

Hans

Post by Ivan Maidanski
Hello Hans,
AO Readme.txt is a bit weak in respect to CAS ordering constraints. Let's
fix it by mapping to C11 atomic memory ordering (based on common sense and
inspection of libatomic_ops exsiting CAS assembly implementations).
Is the following correct?
|| AO CAS suffix || C11 CAS success || C11 CAS failure ||
| - | __ATOMIC_RELAXED | __ATOMIC_RELAXED |
| acquire | __ATOMIC_ACQUIRE | __ATOMIC_ACQUIRE |
| release | __ATOMIC_RELEASE | __ATOMIC_RELAXED |
| full | __ATOMIC_SEQ_CST | __ATOMIC_ACQUIRE |
[1] https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html
[2] https://github.com/ivmai/libatomic_ops/blob/master/doc/README.txt
--
Best regards,
Ivan

Ivan Maidanski

2016-03-30 23:08:00 UTC

Permalink

Hello Hans,

I agree about cas_full to be defined as acq_rel | acquire to avoid performance regression on replacement asm with C11 atomics.
But the documentation should be refined and the whole code base should be reviewed to avoid (or minimize) inconsistencies (e.g. the behavior should not depend onÂ AO_PREFER_GENERALIZED presence/absence).
And we definitely do not want stability issues in apps using old AO releases after update to the upcoming one (except for AArch64 which already uses C11 atomics some of them might have too strict memory ordering mapping).

Â For operations that don't include a read operation, we seem to map _full to something like a release (if there's a write) plus a trailing seq_cst fence.

I see. Please correct the mapping below.

Let's review rest operations ('=' stands for definition in generalize.h):

nop:
* read | acquire
* write | release
* full | acq_rel or seq_cst? should it match __sync_synchronize?
load:
* none | relax
* acq | acquire
* read = load + nop_read
* full = load + nop_full or load + nop_read?
* anything else worth to map directly to C11 atomics (among existing ao orders)?
store:
* none | relax
* rel | release
* write = nop_write + none
* full = release + nop_full or nop_read + release?
* anything else?
test_and_set:
* none | relax
* acq | acquire
* rel | release
* full | acq_rel or seq_cst? (note that nop_full = test_and_set_full(&dummy))
* anything else?
fetch_and_add: same as test_and_set?
and, or, xor: same as fetch_and_add?

Regards,
Ivan

I agree entirely with the first three.
The last one is debatable.Â I agree that the current spec is not very satisfactory all the way around.Â But I believe the current implementations of _full on architectures where it matters (Power in particular) actually interpret it as acq_rel for RMW operations.Â Without fixing implementations, I think the last line should read acq_rel | acquire.Â For operations that don't include a read operation, we seem to map _full to something like a release (if there's a write) plus a trailing seq_cst fence.
If I had to design this again, full would always mean seq_cst, mostly because I can define that precisely.Â But that would slow down Power and ARMv7 code, and may be a bit much of a revision now.
This library API was a step along the evolution of C/C++ atomics. The latter are much better defined (though still imperfect).Â When this API was designed, neither I nor anyone else fully appreciated some of the many subtleties.
Hans

Post by Ivan Maidanski
Hello Hans,
AO Readme.txt is a bit weak in respect to CAS ordering constraints. Let's fix it by mapping to C11 atomic memory ordering (based on common sense and inspection of libatomic_ops exsiting CAS assembly implementations).
Is the following correct?
|| AO CAS suffix || C11 CAS success || C11 CAS failure ||
| - Â Â Â Â Â Â |Â __ATOMIC_RELAXED |Â __ATOMIC_RELAXED |
| acquire |Â __ATOMIC_ACQUIRE Â |Â __ATOMIC_ACQUIRE |
| release |Â __ATOMIC_RELEASE Â |Â __ATOMIC_RELAXEDÂ |
| full Â Â Â Â |Â __ATOMIC_SEQ_CST Â |Â __ATOMIC_ACQUIRE |
[1]Â https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html
[2]Â https://github.com/ivmai/libatomic_ops/blob/master/doc/README.txt Â
--
Best regards,
Ivan

_______________________________________________
bdwgc mailing list
https://lists.opendylan.org/mailman/listinfo/bdwgc

nop:
* read | acquire
* write | release
* full | acq_rel or seq_cst? should it match __sync_synchronize?

load:
* none | relax
* acq | acquire
* read = load + nop_read
* full = load + nop_full or load + nop_read?
* anything else (among existing ao orders)?

store:
* none | relax
* rel | release
* write = nop_write + none
* full = release + nop_full or nop_read + release?
* anything else?

test_and_set:
* none | relax
* acq | acquire
* rel | release
* full | acq_rel or seq_cst?
* anything else?

fetch_and_add: same as test_and_set?

and, or, xor: same as fetch_and_add?

Ivan Maidanski

2016-04-26 08:47:58 UTC

Permalink

Hello Hans,
Any progress with the answer?
PS I have added c11 atomics usage for x86 x86_64 arm aarch64 mips mips64 (gcc and clang).
Regards,
Ivan

Post by Ivan Maidanski
Hello Hans,
I agree about cas_full to be defined as acq_rel | acquire to avoid performance regression on replacement asm with C11 atomics.
But the documentation should be refined and the whole code base should be reviewed to avoid (or minimize) inconsistencies (e.g. the behavior should not depend onÂ AO_PREFER_GENERALIZED presence/absence).
And we definitely do not want stability issues in apps using old AO releases after update to the upcoming one (except for AArch64 which already uses C11 atomics some of them might have too strict memory ordering mapping).

Â For operations that don't include a read operation, we seem to map _full to something like a release (if there's a write) plus a trailing seq_cst fence.

I see. Please correct the mapping below.
* read | acquire
* write | release
* full | acq_rel or seq_cst? should it match __sync_synchronize?
* none | relax
* acq | acquire
* read = load + nop_read
* full = load + nop_full or load + nop_read?
* anything else worth to map directly to C11 atomics (among existing ao orders)?
* none | relax
* rel | release
* write = nop_write + none
* full = release + nop_full or nop_read + release?
* anything else?
* none | relax
* acq | acquire
* rel | release
* full | acq_rel or seq_cst? (note that nop_full = test_and_set_full(&dummy))
* anything else?
fetch_and_add: same as test_and_set?
and, or, xor: same as fetch_and_add?
Regards,
Ivan

Post by Ivan Maidanski
Hello Hans,
AO Readme.txt is a bit weak in respect to CAS ordering constraints. Let's fix it by mapping to C11 atomic memory ordering (based on common sense and inspection of libatomic_ops exsiting CAS assembly implementations).
Is the following correct?
|| AO CAS suffix || C11 CAS success || C11 CAS failure ||
| - Â Â Â Â Â Â |Â __ATOMIC_RELAXED |Â __ATOMIC_RELAXED |
| acquire |Â __ATOMIC_ACQUIRE Â |Â __ATOMIC_ACQUIRE |
| release |Â __ATOMIC_RELEASE Â |Â __ATOMIC_RELAXEDÂ |
| full Â Â Â Â |Â __ATOMIC_SEQ_CST Â |Â __ATOMIC_ACQUIRE |
[1]Â https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html
[2]Â https://github.com/ivmai/libatomic_ops/blob/master/doc/README.txt Â
--
Best regards,
Ivan

_______________________________________________
bdwgc mailing list
https://lists.opendylan.org/mailman/listinfo/bdwgc

* read | acquire
* write | release
* full | acq_rel or seq_cst? should it match __sync_synchronize?
* none | relax
* acq | acquire
* read = load + nop_read
* full = load + nop_full or load + nop_read?
* anything else (among existing ao orders)?
* none | relax
* rel | release
* write = nop_write + none
* full = release + nop_full or nop_read + release?
* anything else?
* none | relax
* acq | acquire
* rel | release
* full | acq_rel or seq_cst?
* anything else?
fetch_and_add: same as test_and_set?
and, or, xor: same as fetch_and_add?

Hans Boehm

2016-05-05 02:11:45 UTC

Permalink

[Sorry about the slow response.]

Post by Ivan Maidanski
Hello Hans,
I agree about cas_full to be defined as acq_rel | acquire to avoid

performance regression on replacement asm with C11 atomics.

Post by Ivan Maidanski
But the documentation should be refined and the whole code base should be

reviewed to avoid (or minimize) inconsistencies (e.g. the behavior should
not depend on AO_PREFER_GENERALIZED presence/absence).

Post by Ivan Maidanski
And we definitely do not want stability issues in apps using old AO

releases after update to the upcoming one (except for AArch64 which already
uses C11 atomics some of them might have too strict memory ordering
mapping).
Actually I agree that Aarch64 is a problem. We should probably change the
documentation to make it clear that for _acquire and _read the ordering is
between the load and subsequent operations, except in the nop case, where
it orders all prior and all subsequent operations. And similarly for
_release and _write. The Aarch64 implementation already weakens it in this
way.

The alternative is to not use Aarch64 acquire/release operations, and use
DMB instead. But that plays poorly with code that does this right.

Post by Ivan Maidanski

For operations that don't include a read operation, we seem to map

_full to something like a release (if there's a write) plus a trailing
seq_cst fence.

Post by Ivan Maidanski
I see. Please correct the mapping below.
* read | acquire
* write | release

Yes. In hindsight, we should never have used read and write.

Post by Ivan Maidanski
* full | acq_rel or seq_cst? should it match __sync_synchronize?

It should just be seq_cst, which I hope finally matches sync_synchronize.
Nobody should be using the latter.

Post by Ivan Maidanski
* none | relax
* acq | acquire

Yes.

Post by Ivan Maidanski
* read = load + nop_read

Or better yet, acquire. We should be strengthening the read and write
stuff to the more general acquire/release versions. It almost never makes a
difference and is much less error-prone.

Post by Ivan Maidanski
* full = load + nop_full or load + nop_read?

I think I would just map it to seq_cst and clarify the spec.

Post by Ivan Maidanski
* anything else worth to map directly to C11 atomics (among existing ao orders)?

I would map dd_acquire_read to acquire. Ideally it should map to consume,
but that's currently broken, and will map to acquire anyway.

Post by Ivan Maidanski
* none | relax
* rel | release
* write = nop_write + none

I would again just use release.

Post by Ivan Maidanski
* full = release + nop_full or nop_read + release?
* anything else?
* none | relax
* acq | acquire
* rel | release

Yes.

Post by Ivan Maidanski
* full | acq_rel or seq_cst? (note that nop_full =

test_and_set_full(&dummy))
I would make it seq_cst. It probably doesn't really matter. We should
explicitly define nop_full to be atomic_thread_fence(memory_order_seq_cst).
test_and_set_full(&dummy)) actually doesn't do the right thing on ARMv8,
where it needs to be "dmb ish".

Post by Ivan Maidanski
* anything else?
fetch_and_add: same as test_and_set?

I think so.

Post by Ivan Maidanski
and, or, xor: same as fetch_and_add?

I would change the documentation of _full to

_full: The associated operation is ordered with respect to both earlier and
later memory ops.
If the associated operation is nop, then this orders all earlier
memory operations with respect to subsequent ones.
AO_store_full or AO_nop_full are the normal ways to force a store
to be ordered with respect to a later load.

Post by Ivan Maidanski
Regards,
Ivan

Ivan Maidanski

2016-05-13 08:42:38 UTC

Permalink

Hello Hans,
Based on your email:
1. Readme update:
https://github.com/ivmai/libatomic_ops/commit/e558b28e2ce997075cbaf5cbf5f0c3da8a6b910a
2. ToDo items added (see .template file): https://github.com/ivmai/libatomic_ops/commit/9fdee25ae12e9fc3fcfdadee441b551ca9e929d4
PS. I'm close to making gc and atomic ops releases - I suppose to send them to you next week.
Regards,
Ivan
--

Post by Hans Boehm
[Sorry about the slow response.]

Post by Ivan Maidanski
Hello Hans,
I agree about cas_full to be defined as acq_rel | acquire to avoid performance regression on replacement asm with C11 atomics.
But the documentation should be refined and the whole code base should be reviewed to avoid (or minimize) inconsistencies (e.g. the behavior should not depend on AO_PREFER_GENERALIZED presence/absence).
And we definitely do not want stability issues in apps using old AO releases after update to the upcoming one (except for AArch64 which already uses C11 atomics some of them might have too strict memory ordering mapping).

Actually I agree that Aarch64 is a problem.Â We should probably change the documentation to make it clear that for _acquire and _read the ordering is between the load and subsequent operations, except in the nop case, where it orders all prior and all subsequent operations.Â And similarly for _release and _write.Â The Aarch64 implementation already weakens it in this way.
The alternative is to not use Aarch64 acquire/release operations, and use DMB instead.Â But that plays poorly with code that does this right.

Post by Ivan Maidanski

For operations that don't include a read operation, we seem to map _full to something like a release (if there's a write) plus a trailing seq_cst fence.

I see. Please correct the mapping below.
* read | acquire
* write | release

Yes.Â In hindsight, we should never have used read and write.

Post by Ivan Maidanski
* full | acq_rel or seq_cst? should it match __sync_synchronize?

It should just be seq_cst, which I hope finally matches sync_synchronize.Â Nobody should be using the latter.

Post by Ivan Maidanski
* none | relax
* acq | acquire

Yes.

Post by Ivan Maidanski
* read = load + nop_read

Or better yet, acquire.Â We should be strengthening the read and write stuff to the more general acquire/release versions. It almost never makes a difference and is much less error-prone.

Post by Ivan Maidanski
* full = load + nop_full or load + nop_read?

I think I would just map it to seq_cst and clarify the spec.

Post by Ivan Maidanski
* anything else worth to map directly to C11 atomics (among existing ao orders)?

I would map dd_acquire_read to acquire. Ideally it should map to consume, but that's currently broken, and will map to acquire anyway.

Post by Ivan Maidanski
* none | relax
* rel | release
* write = nop_write + none

I would again just use release.

Post by Ivan Maidanski
* full = release + nop_full or nop_read + release?
* anything else?
* none | relax
* acq | acquire
* rel | release

Yes.

Post by Ivan Maidanski
* full | acq_rel or seq_cst? (note that nop_full = test_and_set_full(&dummy))

I would make it seq_cst.Â It probably doesn't really matter.Â We should explicitly define nop_full to be atomic_thread_fence(memory_order_seq_cst). Â test_and_set_full(&dummy)) actually doesn't do the right thing on ARMv8, where it needs to be "dmb ish".

Post by Ivan Maidanski
* anything else?
fetch_and_add: same as test_and_set?

I think so.

Post by Ivan Maidanski
and, or, xor: same as fetch_and_add?

I would change the documentation of _full to
_full: The associated operation is ordered with respect to both earlier and later memory ops.
Â Â Â Â If the associated operation is nop, then this orders all earlier memory operations with respect to subsequent ones.
Â Â Â Â AO_store_full or AO_nop_full are the normal ways to force a store
Â Â Â Â to be ordered with respect to a later load. >

Post by Ivan Maidanski
Regards,
Ivan

_______________________________________________
bdwgc mailing list
https://lists.opendylan.org/mailman/listinfo/bdwgc