OrderAccess Refactoring

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

OrderAccess Refactoring

Erik Österlund
Hi,

I have some proposed changes to OrderAccess, thought I’d check if they are wanted. It comes in two parts: consistency fixing, and refactoring the code to a generalization/specialization approach instead to eliminate lots of redundant platform specific code which is not platform specific. It’s possible to pick only one if both are not wanted.

2.a) OrderAccess Maintainability
Summary: Remove platform specific code which is not platform specific (most of it) and make it shared, and separate different concerns in the code.

Different concerns are now joined to a big blob: Which barrier to use (acquire, release, fence) and where to put it w.r.t. to memory accesses, how to implement those barriers, how to make atomic memory accesses, how to specialize more efficient variants are all different concerns, now flattened out to be done over and over again for each platform even though the behaviour is very general. I propose to move everything except specialization and barrier implementation to shared code to separate their concerns, i.e. how to make atomic memory accesses and where to put the barriers w.r.t. accesses is the same for everyone except the few specialized variants with inline asm. This removes most of the code for most of the individual platforms and makes it easier to reason about the different concerns in isolation without changing stuff all over the place. The idea is to split the class into 3 classes using inheritance: OrderAccessBase, OrderAccessPlatform and OrderAccess (of course). Most of the stuff goes in OrderAccessBase, but platforms with inline assembly can put their specialized variants in OrderAccessPlatform.

2.b) OrderAccess Correctness
Summary: Relying on volatile to provide acquire/release on TSO cannot generally be guaranteed, must be fixed on all TSO except x86 linux and windows.

Correctness concerns are inconsistent over different platforms and dangerous which is my second issue with OrderAccess. Specifically, for TSO architectures like SPARC and x86, OrderAccess relies on C++ volatile for acquire release semantics, which in most cases is wrong. In every compiler I know of except MSVC 2008, this can not be relied upon as the C++ standard section 1.9 is vague whether volatiles may be reordered w.r.t. non-volatiles and has been interpreted differently by compiler implementors. It only really promises reordering constraints between volatiles which is not enough for acquire release semantics. As a matter of fact, GCC explicitly states they may be arbitrarily reordered at will, and therefore we need compiler barriers to prevent this. This problem was observed on linux and fixed by Mikael, on Linux. But the exact same problem (relying on volatile to have acquire release semantics) still lurks around on the other TSO architectures, waiting for the right moment to strike. With the refactoring, correcting such an issue (which I also intend to do because I think the current implementation is flawed) can be easily done with 3 LoC per platform - the barrier selection for memory accesses, instead of for each type of memory access on nearly every TSO platform.

The refactoring will as a bonus make a later transition to C++11/14 easier when compilers are ready, also with only a few LoC. Since we are now forced to use a compiler barrier which can be a bit too restrictive for the intended semantics, a more modern compiler can instead relax this a bit without sacrificing correctness and give us the exact semantics we want.

Any idea if this is wanted, and if so both a and b or only b? IMO it’s much easier to fix the correctness issues on different platforms with that extra bit of refactoring too. ;)

Thanks,
/Erik
Reply | Threaded
Open this post in threaded view
|

Re: OrderAccess Refactoring

Karen Kinnear
Erik,

I am delighted to have you looking at this code in detail to make sure it is accurate and maintainable.

> On Jan 9, 2015, at 5:32 AM, Erik Österlund <[hidden email]> wrote:
>
> Hi,
>
> I have some proposed changes to OrderAccess, thought I’d check if they are wanted. It comes in two parts: consistency fixing, and refactoring the code to a generalization/specialization approach instead to eliminate lots of redundant platform specific code which is not platform specific. It’s possible to pick only one if both are not wanted.
>
> 2.a) OrderAccess Maintainability
> Summary: Remove platform specific code which is not platform specific (most of it) and make it shared, and separate different concerns in the code.
>
> Different concerns are now joined to a big blob: Which barrier to use (acquire, release, fence) and where to put it w.r.t. to memory accesses, how to implement those barriers, how to make atomic memory accesses, how to specialize more efficient variants are all different concerns, now flattened out to be done over and over again for each platform even though the behaviour is very general. I propose to move everything except specialization and barrier implementation to shared code to separate their concerns, i.e. how to make atomic memory accesses and where to put the barriers w.r.t. accesses is the same for everyone except the few specialized variants with inline asm. This removes most of the code for most of the individual platforms and makes it easier to reason about the different concerns in isolation without changing stuff all over the place. The idea is to split the class into 3 classes using inheritance: OrderAccessBase, OrderAccessPlatform and OrderAccess (of course). Most of the stuff goes in OrderAccessBase, but platforms with inline assembly can put their specialized variants in OrderAccessPlatform.
In theory I like refactoring. In practice I hope there is a very clear way to specify the per-platform assumptions and to make it very obvious what the end results are
for a given platform and API.
 
>
> 2.b) OrderAccess Correctness
> Summary: Relying on volatile to provide acquire/release on TSO cannot generally be guaranteed, must be fixed on all TSO except x86 linux and windows.
>
> Correctness concerns are inconsistent over different platforms and dangerous which is my second issue with OrderAccess. Specifically, for TSO architectures like SPARC and x86, OrderAccess relies on C++ volatile for acquire release semantics, which in most cases is wrong. In every compiler I know of except MSVC 2008, this can not be relied upon as the C++ standard section 1.9 is vague whether volatiles may be reordered w.r.t. non-volatiles and has been interpreted differently by compiler implementors. It only really promises reordering constraints between volatiles which is not enough for acquire release semantics. As a matter of fact, GCC explicitly states they may be arbitrarily reordered at will, and therefore we need compiler barriers to prevent this. This problem was observed on linux and fixed by Mikael, on Linux. But the exact same problem (relying on volatile to have acquire release semantics) still lurks around on the other TSO architectures, waiting for the right moment to strike. With the refactoring, correcting such an issue (which I also intend to do because I think the current implementation is flawed) can be easily done with 3 LoC per platform - the barrier selection for memory accesses, instead of for each type of memory access on nearly every TSO platform.
Having code reviewed (and obviously missed things) earlier rounds of OrderAccess - I just wanted to note that we never thought that C++ volatile would provide
acquire/release semantics. volatile has always been just a compiler directive, and yes volatile-volatile only. Acquire/release are instructions to the hardware - not
what I would call compiler barriers.

Personally I would do the correctness fixes first - and make sure they are really well tested and carefully code reviewed studying the cookbook and memory model.
I have not been following the more recent java memory model evolution - I believe Aleksey Snipilev has - if you have references to any updated advice
from Doug Lea - that would be very helpful for your reviewers to study. I know the last time I reviewed this I had to make my own small charts and walk it
through based on per-platform guarantees.

I would find it helpful if you are aware of specific bugs - if you could list the actual problems and call those out? It is very hard to get this right and
hard to review - so if you were to have targeted reviews for specific fixes we could give the changes the depth of study that they need.

Doing the refactoring afterwards allows us the easier exercise of ensuring that the refactoring produces the same code that we already have studied
in depth and believe to be correct.

So as you refactor please do find a place to clarify the per-platform assumptions so if we realize our assumptions are wrong we will re-review the code.
>
> The refactoring will as a bonus make a later transition to C++11/14 easier when compilers are ready, also with only a few LoC. Since we are now forced to use a compiler barrier which can be a bit too restrictive for the intended semantics, a more modern compiler can instead relax this a bit without sacrificing correctness and give us the exact semantics we want.
>
> Any idea if this is wanted, and if so both a and b or only b? IMO it’s much easier to fix the correctness issues on different platforms with that extra bit of refactoring too. ;)
thank you for offering and for asking for opinions :-)
Karen
>
> Thanks,
> /Erik

Reply | Threaded
Open this post in threaded view
|

Re: OrderAccess Refactoring

Erik Österlund
Hi Karen,

Thanks for the reply!

On 09 Jan 2015, at 15:18, Karen Kinnear <[hidden email]<mailto:[hidden email]>> wrote:

Erik,

I am delighted to have you looking at this code in detail to make sure it is accurate and maintainable.

:)

In theory I like refactoring. In practice I hope there is a very clear way to specify the per-platform assumptions and to make it very obvious what the end results are
for a given platform and API.

For each platform you only have to specify two things:

1. What is needed for acquire, release and fence for memory accesses, for that platform (5 LoC per platform). The pattern of when to use them (semantics) is generalized in shared code.

2. To provide any specializations for optimization purposes when applicable (certain TSO platforms that can join the access and semantics to a single inline asm instruction). Note that this is optional, and only done to get faster accesses. Many platforms do not do it at all.

I think it’s very intuitive! Will provide webrev if this seems worthwhile.

Having code reviewed (and obviously missed things) earlier rounds of OrderAccess - I just wanted to note that we never thought that C++ volatile would provide
acquire/release semantics. volatile has always been just a compiler directive, and yes volatile-volatile only. Acquire/release are instructions to the hardware - not
what I would call compiler barriers.

Acquire/release are semantics, not necessarily bound to neither compiler/hardware. It gives a certain contract to the user that expects these acquire/release semantics. Compiler barriers and fences is merely a means of providing these semantics, but the user should not have to care about such details.

If what you are saying is that it was intended only as acquire/release for hardware but not constrain compiler reorderings to comply with the semantics, then the contract is broken (does in fact not promise acquire release, only on hardware level but compilers could do whatever with non-volatiles which is equally as bad).

A store release should synchronize with an acquire load to the same location, like in this example:

T1: store x1
T1: release store x2

T2: load acquire x2
T2: load x1

The semantics should guarantee that if load acquire x2 happens after store release x2 in total order, then it is guaranteed that store x1 will be observed by the load x1, and not some old value that was there before store x1. This contract does not currently hold since the normal non-volatile memory accesses may be reordered with volatile memory accesses; the stores and loads may be reordered in this example by the compiler at will.

Therefore acquire release semantics are not guaranteed in general, but only w.r.t. volatiles. This is neither clearly documented nor used in such a way.

OrderedAccess is used in a way that assumes acquire/release semantics. But for certain TSO platforms, the contract only provides volatile semantics, not acquire release semantics. On other platforms, it properly provides acquire release semantics, which is expected.

It only makes sense to by contract promise acquire/release semantics (both for compiler and hardware reorderings) for uses of these methods, and consistently do so on all platforms.

Personally I would do the correctness fixes first - and make sure they are really well tested and carefully code reviewed studying the cookbook and memory model.
I have not been following the more recent java memory model evolution - I believe Aleksey Snipilev has - if you have references to any updated advice
from Doug Lea - that would be very helpful for your reviewers to study. I know the last time I reviewed this I had to make my own small charts and walk it
through based on per-platform guarantees.

There is nothing wrong with the Java memory model. The only problem is that OrderAccess promises semantics that are only guaranteed on some platforms.

I would find it helpful if you are aware of specific bugs - if you could list the actual problems and call those out? It is very hard to get this right and
hard to review - so if you were to have targeted reviews for specific fixes we could give the changes the depth of study that they need.

In for instance https://bugs.openjdk.java.net/browse/JDK-8061964 this was observed and caused trouble for G1 card refinement that went boom. In the discussion I pointed out the issue of relying on volatiles as a cause for the crash, and it was fixed, but not consistently for all platforms, but only for gcc on linux where the reordering was observed in the generated asm output. But it is equally as dangerous on other (TSO) platforms (except windows where the compiler actually explicitly promises not to reorder - their volatiles follow acquire/release in compiler reorderings). I think hoping for for compilers to not reorder without any guarantees, waiting for things to go boom where it has already been observed on other platforms seems dangerous. Better safe than sorry. :)

One different solution is to redocument the contract clearly as being acquire/release w.r.t. volatile accesses, which is currently the actual contract provided by the implementation.
But since bugs were already discovered (which is very understandable because no other atomic library has such semantics that I know of) and its code is still unchanged, it seems dangerous to do so without re-evaluating the use of OrderedAccess in hotspot, which seems a bit painful. Also such a revised contract would be a disaster when you need to copy loads of data and then store release for instance. This would force the copying (of potentially lots of data) to either be done using volatile accesses rather than in an accelerated fashion (e.g. SSE extensions, write combining), or add compiler barriers to such copying to give the release store actual release store semantics. This would also require fixing G1 where a bug was observed to not rely on the intuitive and expected contract, but on an unintuitive and provably error prone contract. Ugh!!

The only solution that seems to make sense to me is to fix the semantics of OrderedAccess (that were already changed on Linux) to be what is expected, and that is acquire/release w.r.t. other accesses (volatile or non-volatile).

Doing the refactoring afterwards allows us the easier exercise of ensuring that the refactoring produces the same code that we already have studied
in depth and believe to be correct.

So as you refactor please do find a place to clarify the per-platform assumptions so if we realize our assumptions are wrong we will re-review the code.

I was hoping to do both at the same time, since there are many platforms to be changed, and making them over and over for every type of access on every platform to be changed is a bit tedious (and error prone) compared to doing it once per platform, reusing the shared pattern (to be reviewed once instead of once per platform) and review the choice of barriers once for each platform instead of for every memory access on every platform. Of course I don’t mind if we choose to do things in the other order though if that is preferred. :)

If you compare the change in semantics I propose, without thinking about the refactoring, just compare x86 on bsd and linux, you see that on linux a compiler barrier is correctly used, but not on bsd. This is the change that needs to be added to all TSO platforms except linux and windows to ensure correctness.

thank you for offering and for asking for opinions :-)

Thank you for the good feedback! It looks like this change seems wanted, so I’ll go ahead and prepare a webrev with the proposed changes. :)

Thanks,
Erik

Reply | Threaded
Open this post in threaded view
|

Re: OrderAccess Refactoring

Erik Österlund
[Hoping to fix nesting of comments removed by server so you can actually
see who said what in previous post :-|]

Hi Karen,

Thanks for the reply!

On 09 Jan 2015, at 15:18, Karen Kinnear
<[hidden email]<mailto:[hidden email]>> wrote:

> Erik,
>
> I am delighted to have you looking at this code in detail to make sure it is accurate and maintainable.

:)

> In theory I like refactoring. In practice I hope there is a very clear way to specify the per-platform assumptions and to make it very obvious what the end results are
> for a given platform and API.

For each platform you only have to specify two things:

1. What is needed for acquire, release and fence for memory accesses,
for that platform (5 LoC per platform). The pattern of when to use them
(semantics) is generalized in shared code.

2. To provide any specializations for optimization purposes when
applicable (certain TSO platforms that can join the access and semantics
to a single inline asm instruction). Note that this is optional, and
only done to get faster accesses. Many platforms do not do it at all.

I think it’s very intuitive! Will provide webrev if this seems worthwhile.

> Having code reviewed (and obviously missed things) earlier rounds of OrderAccess - I just wanted to note that we never thought that C++ volatile would provide
> acquire/release semantics. volatile has always been just a compiler directive, and yes volatile-volatile only. Acquire/release are instructions to the hardware - not
> what I would call compiler barriers.

Acquire/release are semantics, not necessarily bound to neither
compiler/hardware. It gives a certain contract to the user that expects
these acquire/release semantics. Compiler barriers and fences is merely
a means of providing these semantics, but the user should not have to
care about such details.

If what you are saying is that it was intended only as acquire/release
for hardware but not constrain compiler reorderings to comply with the
semantics, then the contract is broken (does in fact not promise acquire
release, only on hardware level but compilers could do whatever with
non-volatiles which is equally as bad).

A store release should synchronize with an acquire load to the same
location, like in this example:

T1: store x1
T1: release store x2

T2: load acquire x2
T2: load x1

The semantics should guarantee that if load acquire x2 happens after
store release x2 in total order, then it is guaranteed that store x1
will be observed by the load x1, and not some old value that was there
before store x1. This contract does not currently hold since the normal
non-volatile memory accesses may be reordered with volatile memory
accesses; the stores and loads may be reordered in this example by the
compiler at will.

Therefore acquire release semantics are not guaranteed in general, but
only w.r.t. volatiles. This is neither clearly documented nor used in
such a way.

OrderedAccess is used in a way that assumes acquire/release semantics.
But for certain TSO platforms, the contract only provides volatile
semantics, not acquire release semantics. On other platforms, it
properly provides acquire release semantics, which is expected.

It only makes sense to by contract promise acquire/release semantics
(both for compiler and hardware reorderings) for uses of these methods,
and consistently do so on all platforms.

> Personally I would do the correctness fixes first - and make sure they are really well tested and carefully code reviewed studying the cookbook and memory model.
> I have not been following the more recent java memory model evolution - I believe Aleksey Snipilev has - if you have references to any updated advice
> from Doug Lea - that would be very helpful for your reviewers to study. I know the last time I reviewed this I had to make my own small charts and walk it
> through based on per-platform guarantees.

There is nothing wrong with the Java memory model. The only problem is
that OrderAccess promises semantics that are only guaranteed on some
platforms.

> I would find it helpful if you are aware of specific bugs - if you could list the actual problems and call those out? It is very hard to get this right and
> hard to review - so if you were to have targeted reviews for specific fixes we could give the changes the depth of study that they need.

In for instance https://bugs.openjdk.java.net/browse/JDK-8061964 this
was observed and caused trouble for G1 card refinement that went boom.
In the discussion I pointed out the issue of relying on volatiles as a
cause for the crash, and it was fixed, but not consistently for all
platforms, but only for gcc on linux where the reordering was observed
in the generated asm output. But it is equally as dangerous on other
(TSO) platforms (except windows where the compiler actually explicitly
promises not to reorder - their volatiles follow acquire/release in
compiler reorderings). I think hoping for for compilers to not reorder
without any guarantees, waiting for things to go boom where it has
already been observed on other platforms seems dangerous. Better safe
than sorry. :)

One different solution is to redocument the contract clearly as being
acquire/release w.r.t. volatile accesses, which is currently the actual
contract provided by the implementation.
But since bugs were already discovered (which is very understandable
because no other atomic library has such semantics that I know of) and
its code is still unchanged, it seems dangerous to do so without
re-evaluating the use of OrderedAccess in hotspot, which seems a bit
painful. Also such a revised contract would be a disaster when you need
to copy loads of data and then store release for instance. This would
force the copying (of potentially lots of data) to either be done using
volatile accesses rather than in an accelerated fashion (e.g. SSE
extensions, write combining), or add compiler barriers to such copying
to give the release store actual release store semantics. This would
also require fixing G1 where a bug was observed to not rely on the
intuitive and expected contract, but on an unintuitive and provably
error prone contract. Ugh!!

The only solution that seems to make sense to me is to fix the semantics
of OrderedAccess (that were already changed on Linux) to be what is
expected, and that is acquire/release w.r.t. other accesses (volatile or
non-volatile).

> Doing the refactoring afterwards allows us the easier exercise of ensuring that the refactoring produces the same code that we already have studied
> in depth and believe to be correct.
>
> So as you refactor please do find a place to clarify the per-platform assumptions so if we realize our assumptions are wrong we will re-review the code.

I was hoping to do both at the same time, since there are many platforms
to be changed, and making them over and over for every type of access on
every platform to be changed is a bit tedious (and error prone) compared
to doing it once per platform, reusing the shared pattern (to be
reviewed once instead of once per platform) and review the choice of
barriers once for each platform instead of for every memory access on
every platform. Of course I don’t mind if we choose to do things in the
other order though if that is preferred. :)

If you compare the change in semantics I propose, without thinking about
the refactoring, just compare x86 on bsd and linux, you see that on
linux a compiler barrier is correctly used, but not on bsd. This is the
change that needs to be added to all TSO platforms except linux and
windows to ensure correctness.

thank you for offering and for asking for opinions :-)

Thank you for the good feedback! It looks like this change seems wanted,
so I’ll go ahead and prepare a webrev with the proposed changes. :)

Thanks,
Erik
Reply | Threaded
Open this post in threaded view
|

Re: OrderAccess Refactoring

Paul Hohensee-3
In reply to this post by Erik Österlund
Hi Erik,

Being the person who put OrderAccess into its present form :), here's some
background/comments.

First, I'm happy that someone's looking at this.  I wanted to make
available both barriers and acquire/release functionality, and also provide
common composition methods.  In retrospect, I didn't separate the two
concepts well enough in the interface (the work was done as part of an ia64
port, so I was kinda fixated on ia64 concepts at the time), so that's
definitely something that should be fixed.

We understood that volatile semantics wouldn't be enough by themselves to
implement the semantics we wanted ordering on most platforms, but they
*are* necessary to prevent compilers (not the hardware though) from
reordering volatile accesses with respect to each other.  Whether or not
the volatile keyword prevents non-volatile memory accesses from being
reordered wrt volatile ones seemed a bit fuzzy in the C++ standard at the
time (and still does, at least to me), but we figured we could handle that
in the os_cpu files by checking which compiler was being used.  The code
that implemented the *hardware* semantics also went into the os_cpu
file(s).  That's why there's some seemingly silly code there, including
assembly that executes what amount to nops.

I initially wrote the code in factored form, but changed to the current
scheme because I found it difficult to see all the code for a single
platform in one place (I wasn't using an IDE :) ).  In retrospect, an
unwise decision due to the usual code duplication problems.  Mea culpa. :)

Looking forward to your webrev.

Thanks,

Paul

---------

Hi Karen,

Thanks for the reply!

On 09 Jan 2015, at 15:18, Karen Kinnear
<[hidden email]<
mailto:[hidden email]>> wrote:

> Erik,
>
> I am delighted to have you looking at this code in detail to make sure it
is accurate and maintainable.

:)

> In theory I like refactoring. In practice I hope there is a very clear
way to specify the per-platform assumptions and to make it very obvious
what the end results are
> for a given platform and API.

For each platform you only have to specify two things:

1. What is needed for acquire, release and fence for memory accesses,
for that platform (5 LoC per platform). The pattern of when to use them
(semantics) is generalized in shared code.

2. To provide any specializations for optimization purposes when
applicable (certain TSO platforms that can join the access and semantics
to a single inline asm instruction). Note that this is optional, and
only done to get faster accesses. Many platforms do not do it at all.

I think it?s very intuitive! Will provide webrev if this seems worthwhile.

> Having code reviewed (and obviously missed things) earlier rounds of
OrderAccess - I just wanted to note that we never thought that C++ volatile
would provide
> acquire/release semantics. volatile has always been just a compiler
directive, and yes volatile-volatile only. Acquire/release are instructions
to the hardware - not
> what I would call compiler barriers.

Acquire/release are semantics, not necessarily bound to neither
compiler/hardware. It gives a certain contract to the user that expects
these acquire/release semantics. Compiler barriers and fences is merely
a means of providing these semantics, but the user should not have to
care about such details.

If what you are saying is that it was intended only as acquire/release
for hardware but not constrain compiler reorderings to comply with the
semantics, then the contract is broken (does in fact not promise acquire
release, only on hardware level but compilers could do whatever with
non-volatiles which is equally as bad).

A store release should synchronize with an acquire load to the same
location, like in this example:

T1: store x1
T1: release store x2

T2: load acquire x2
T2: load x1

The semantics should guarantee that if load acquire x2 happens after
store release x2 in total order, then it is guaranteed that store x1
will be observed by the load x1, and not some old value that was there
before store x1. This contract does not currently hold since the normal
non-volatile memory accesses may be reordered with volatile memory
accesses; the stores and loads may be reordered in this example by the
compiler at will.

Therefore acquire release semantics are not guaranteed in general, but
only w.r.t. volatiles. This is neither clearly documented nor used in
such a way.

OrderedAccess is used in a way that assumes acquire/release semantics.
But for certain TSO platforms, the contract only provides volatile
semantics, not acquire release semantics. On other platforms, it
properly provides acquire release semantics, which is expected.

It only makes sense to by contract promise acquire/release semantics
(both for compiler and hardware reorderings) for uses of these methods,
and consistently do so on all platforms.

> Personally I would do the correctness fixes first - and make sure they
are really well tested and carefully code reviewed studying the cookbook
and memory model.
> I have not been following the more recent java memory model evolution - I
believe Aleksey Snipilev has - if you have references to any updated advice
> from Doug Lea - that would be very helpful for your reviewers to study. I
know the last time I reviewed this I had to make my own small charts and
walk it
> through based on per-platform guarantees.

There is nothing wrong with the Java memory model. The only problem is
that OrderAccess promises semantics that are only guaranteed on some
platforms.

> I would find it helpful if you are aware of specific bugs - if you could
list the actual problems and call those out? It is very hard to get this
right and
> hard to review - so if you were to have targeted reviews for specific
fixes we could give the changes the depth of study that they need.

In for instance https://bugs.openjdk.java.net/browse/JDK-8061964 this
was observed and caused trouble for G1 card refinement that went boom.
In the discussion I pointed out the issue of relying on volatiles as a
cause for the crash, and it was fixed, but not consistently for all
platforms, but only for gcc on linux where the reordering was observed
in the generated asm output. But it is equally as dangerous on other
(TSO) platforms (except windows where the compiler actually explicitly
promises not to reorder - their volatiles follow acquire/release in
compiler reorderings). I think hoping for for compilers to not reorder
without any guarantees, waiting for things to go boom where it has
already been observed on other platforms seems dangerous. Better safe
than sorry. :)

One different solution is to redocument the contract clearly as being
acquire/release w.r.t. volatile accesses, which is currently the actual
contract provided by the implementation.
But since bugs were already discovered (which is very understandable
because no other atomic library has such semantics that I know of) and
its code is still unchanged, it seems dangerous to do so without
re-evaluating the use of OrderedAccess in hotspot, which seems a bit
painful. Also such a revised contract would be a disaster when you need
to copy loads of data and then store release for instance. This would
force the copying (of potentially lots of data) to either be done using
volatile accesses rather than in an accelerated fashion (e.g. SSE
extensions, write combining), or add compiler barriers to such copying
to give the release store actual release store semantics. This would
also require fixing G1 where a bug was observed to not rely on the
intuitive and expected contract, but on an unintuitive and provably
error prone contract. Ugh!!

The only solution that seems to make sense to me is to fix the semantics
of OrderedAccess (that were already changed on Linux) to be what is
expected, and that is acquire/release w.r.t. other accesses (volatile or
non-volatile).

> Doing the refactoring afterwards allows us the easier exercise of
ensuring that the refactoring produces the same code that we already have
studied
> in depth and believe to be correct.
>
> So as you refactor please do find a place to clarify the per-platform
assumptions so if we realize our assumptions are wrong we will re-review
the code.

I was hoping to do both at the same time, since there are many platforms
to be changed, and making them over and over for every type of access on
every platform to be changed is a bit tedious (and error prone) compared
to doing it once per platform, reusing the shared pattern (to be
reviewed once instead of once per platform) and review the choice of
barriers once for each platform instead of for every memory access on
every platform. Of course I don?t mind if we choose to do things in the
other order though if that is preferred. :)

If you compare the change in semantics I propose, without thinking about
the refactoring, just compare x86 on bsd and linux, you see that on
linux a compiler barrier is correctly used, but not on bsd. This is the
change that needs to be added to all TSO platforms except linux and
windows to ensure correctness.

thank you for offering and for asking for opinions :-)

Thank you for the good feedback! It looks like this change seems wanted,
so I?ll go ahead and prepare a webrev with the proposed changes. :)
Reply | Threaded
Open this post in threaded view
|

Re: OrderAccess Refactoring

David Holmes
In reply to this post by Erik Österlund
Erik,

Some very quick comments as there was a related meta-discussion on this
very recently (core-libs-dev) in relation to some descriptions of the
Unsafe memory barrier operations.

storestore, loadload etc are just as important (more so in my opinion)
as acquire/release.

acquire/release semantics are problematic because there is no single
common accepted meaning. Even talking about acquire() and release() as
independent operations is somewhat meaningless - it is only when they
are paired, and fixed in place that they start to give meaning to things.

load_acquire/store_release don't necessarily map cleanly to
acquire()/release() as currently described in orderAccess.hpp; nor do
they necessarily map cleanly to the same named operations on eg ARMv8;
nor to similar named entities in C++11. As Paul stated these were
introduced to map to the ia64 architecture - which I believe had
somewhat different semantics.

This whole area is a minefield because there are different concepts of
what "acquire" and "release" may mean - so you need to start with
detailed definitions. Personally, and I know I am not alone, I abhor the
acquire/release terminology and usage, I much prefer the unambiguous
storeload, storestore etc. :)

The current code often expresses things in a semantically incorrect way
eg that:

store_release(addr, val) == store(addr, val); release();

but in practice the end code provides what is required (often providing
a stronger barrier because that's all that is available on a platform).

I also agree that these primitives have to manage both compiler
reorderings and hardware reorderings.

And I agree with Karen - tackle correctness first then refactoring. IT
will be too hard to track correctness otherwise. Personally, given I
work in this code quite a bit, I like to see the complete set of
definitions for a platform in one place. The duplication is something I
can live with when it is mostly one-liners.

Cheers,
David

On 10/01/2015 7:17 AM, Erik Österlund wrote:

> [Hoping to fix nesting of comments removed by server so you can actually
> see who said what in previous post :-|]
>
> Hi Karen,
>
> Thanks for the reply!
>
> On 09 Jan 2015, at 15:18, Karen Kinnear
> <[hidden email]<mailto:[hidden email]>> wrote:
>
>> Erik,
>>
>> I am delighted to have you looking at this code in detail to make sure it is accurate and maintainable.
>
> :)
>
>> In theory I like refactoring. In practice I hope there is a very clear way to specify the per-platform assumptions and to make it very obvious what the end results are
>> for a given platform and API.
>
> For each platform you only have to specify two things:
>
> 1. What is needed for acquire, release and fence for memory accesses,
> for that platform (5 LoC per platform). The pattern of when to use them
> (semantics) is generalized in shared code.
>
> 2. To provide any specializations for optimization purposes when
> applicable (certain TSO platforms that can join the access and semantics
> to a single inline asm instruction). Note that this is optional, and
> only done to get faster accesses. Many platforms do not do it at all.
>
> I think it’s very intuitive! Will provide webrev if this seems worthwhile.
>
>> Having code reviewed (and obviously missed things) earlier rounds of OrderAccess - I just wanted to note that we never thought that C++ volatile would provide
>> acquire/release semantics. volatile has always been just a compiler directive, and yes volatile-volatile only. Acquire/release are instructions to the hardware - not
>> what I would call compiler barriers.
>
> Acquire/release are semantics, not necessarily bound to neither
> compiler/hardware. It gives a certain contract to the user that expects
> these acquire/release semantics. Compiler barriers and fences is merely
> a means of providing these semantics, but the user should not have to
> care about such details.
>
> If what you are saying is that it was intended only as acquire/release
> for hardware but not constrain compiler reorderings to comply with the
> semantics, then the contract is broken (does in fact not promise acquire
> release, only on hardware level but compilers could do whatever with
> non-volatiles which is equally as bad).
>
> A store release should synchronize with an acquire load to the same
> location, like in this example:
>
> T1: store x1
> T1: release store x2
>
> T2: load acquire x2
> T2: load x1
>
> The semantics should guarantee that if load acquire x2 happens after
> store release x2 in total order, then it is guaranteed that store x1
> will be observed by the load x1, and not some old value that was there
> before store x1. This contract does not currently hold since the normal
> non-volatile memory accesses may be reordered with volatile memory
> accesses; the stores and loads may be reordered in this example by the
> compiler at will.
>
> Therefore acquire release semantics are not guaranteed in general, but
> only w.r.t. volatiles. This is neither clearly documented nor used in
> such a way.
>
> OrderedAccess is used in a way that assumes acquire/release semantics.
> But for certain TSO platforms, the contract only provides volatile
> semantics, not acquire release semantics. On other platforms, it
> properly provides acquire release semantics, which is expected.
>
> It only makes sense to by contract promise acquire/release semantics
> (both for compiler and hardware reorderings) for uses of these methods,
> and consistently do so on all platforms.
>
>> Personally I would do the correctness fixes first - and make sure they are really well tested and carefully code reviewed studying the cookbook and memory model.
>> I have not been following the more recent java memory model evolution - I believe Aleksey Snipilev has - if you have references to any updated advice
>> from Doug Lea - that would be very helpful for your reviewers to study. I know the last time I reviewed this I had to make my own small charts and walk it
>> through based on per-platform guarantees.
>
> There is nothing wrong with the Java memory model. The only problem is
> that OrderAccess promises semantics that are only guaranteed on some
> platforms.
>
>> I would find it helpful if you are aware of specific bugs - if you could list the actual problems and call those out? It is very hard to get this right and
>> hard to review - so if you were to have targeted reviews for specific fixes we could give the changes the depth of study that they need.
>
> In for instance https://bugs.openjdk.java.net/browse/JDK-8061964 this
> was observed and caused trouble for G1 card refinement that went boom.
> In the discussion I pointed out the issue of relying on volatiles as a
> cause for the crash, and it was fixed, but not consistently for all
> platforms, but only for gcc on linux where the reordering was observed
> in the generated asm output. But it is equally as dangerous on other
> (TSO) platforms (except windows where the compiler actually explicitly
> promises not to reorder - their volatiles follow acquire/release in
> compiler reorderings). I think hoping for for compilers to not reorder
> without any guarantees, waiting for things to go boom where it has
> already been observed on other platforms seems dangerous. Better safe
> than sorry. :)
>
> One different solution is to redocument the contract clearly as being
> acquire/release w.r.t. volatile accesses, which is currently the actual
> contract provided by the implementation.
> But since bugs were already discovered (which is very understandable
> because no other atomic library has such semantics that I know of) and
> its code is still unchanged, it seems dangerous to do so without
> re-evaluating the use of OrderedAccess in hotspot, which seems a bit
> painful. Also such a revised contract would be a disaster when you need
> to copy loads of data and then store release for instance. This would
> force the copying (of potentially lots of data) to either be done using
> volatile accesses rather than in an accelerated fashion (e.g. SSE
> extensions, write combining), or add compiler barriers to such copying
> to give the release store actual release store semantics. This would
> also require fixing G1 where a bug was observed to not rely on the
> intuitive and expected contract, but on an unintuitive and provably
> error prone contract. Ugh!!
>
> The only solution that seems to make sense to me is to fix the semantics
> of OrderedAccess (that were already changed on Linux) to be what is
> expected, and that is acquire/release w.r.t. other accesses (volatile or
> non-volatile).
>
>> Doing the refactoring afterwards allows us the easier exercise of ensuring that the refactoring produces the same code that we already have studied
>> in depth and believe to be correct.
>>
>> So as you refactor please do find a place to clarify the per-platform assumptions so if we realize our assumptions are wrong we will re-review the code.
>
> I was hoping to do both at the same time, since there are many platforms
> to be changed, and making them over and over for every type of access on
> every platform to be changed is a bit tedious (and error prone) compared
> to doing it once per platform, reusing the shared pattern (to be
> reviewed once instead of once per platform) and review the choice of
> barriers once for each platform instead of for every memory access on
> every platform. Of course I don’t mind if we choose to do things in the
> other order though if that is preferred. :)
>
> If you compare the change in semantics I propose, without thinking about
> the refactoring, just compare x86 on bsd and linux, you see that on
> linux a compiler barrier is correctly used, but not on bsd. This is the
> change that needs to be added to all TSO platforms except linux and
> windows to ensure correctness.
>
> thank you for offering and for asking for opinions :-)
>
> Thank you for the good feedback! It looks like this change seems wanted,
> so I’ll go ahead and prepare a webrev with the proposed changes. :)
>
> Thanks,
> Erik
>
Reply | Threaded
Open this post in threaded view
|

Re: OrderAccess Refactoring

Erik Österlund
In reply to this post by Paul Hohensee-3
Hi Paul,

Thank you for clearing things out, it really helps talking to the person
who originally wrote this code!

It seems to me that this code was written at a time when it was vaguely
and incorrectly documented both what the hardware and compilers
guarantee. Must have been a nightmare! I think it's still vague today.

On 09/01/15 23:45, Paul Hohensee wrote

> We understood that volatile semantics wouldn't be enough by themselves
> to implement the semantics we wanted ordering on most platforms, but
> they *are* necessary to prevent compilers (not the hardware though) from
> reordering volatile accesses with respect to each other.  Whether or not
> the volatile keyword prevents non-volatile memory accesses from being
> reordered wrt volatile ones seemed a bit fuzzy in the C++ standard at
> the time (and still does, at least to me), but we figured we could
> handle that in the os_cpu files by checking which compiler was being
> used.  The code that implemented the *hardware* semantics also went into
> the os_cpu file(s).  That's why there's some seemingly silly code there,
> including assembly that executes what amount to nops.

I agree it was and still is very vague which is why different compilers
interpreted it differently. MSVC is explicit about using acquire release
for volatiles (wanted) and GCC is explicit about not guaranteeing
non-volatile w.r.t. volatile. I have not read anything explicit from the
others so will assume it's unsafe to rely on anything unless we find
explicit proof it is reliable.

I saw the strange barriers with nops and dummy volatile writes you speak
of, and intend to change these to normal compiler barriers instead.
Because the dummy volatile memory accesses don't actually give us the
guarantees we want for the same reason outlined before (volatile
semantics not good enough), while also looking a bit strange at the same
time haha!

> I initially wrote the code in factored form, but changed to the current
> scheme because I found it difficult to see all the code for a single
> platform in one place (I wasn't using an IDE :) ).  In retrospect, an
> unwise decision due to the usual code duplication problems.  Mea culpa. :)

:)

Thank you for the background and useful context Paul!

/Erik
Reply | Threaded
Open this post in threaded view
|

Re: OrderAccess Refactoring

Erik Österlund
In reply to this post by David Holmes
Hi David,

Thank you for the feedback! :)

On 10/01/15 11:11, David Holmes wrote:
> acquire/release semantics are problematic because there is no single
> common accepted meaning. Even talking about acquire() and release() as
> independent operations is somewhat meaningless - it is only when they
> are paired, and fixed in place that they start to give meaning to things.

I agree it is a bit vague. The description in the comments about
preventing all memory accesses from floating up and down is the one I
recognize and it is conceptually good. But then it is later defined in a
table as acquire = Load(Store|Load) and release = (Store|Load)Store. The
reason is understandable - even though the description of
acquire/release is quite general, they are only intended and designed to
be used in certain contexts like release store and load acquire (with
variants). Outside such similar contexts, it would not give the
guarantees described.

This gets more confusing when you go on reading that fence is
conceptually acquire and release. It is true on the conceptual level
using the conceptual descriptions of acquire/release, but the actual
guarantees of release + acquire is StoreStore, LoadLoad and LoadStore
according to the tables below, while fence() also provides StoreLoad
constraints. I don't know about you but I find that a bit misleading.

I mean, it might be quite obvious for anyone using it to only use
acquire/release in the intended contexts where it makes sense and then
there is no trouble, but I guess it doesn't hurt to add some comments
making it explicit to avoid any confusion. We don't want confusion I think.

> load_acquire/store_release don't necessarily map cleanly to
> acquire()/release() as currently described in orderAccess.hpp; nor do
> they necessarily map cleanly to the same named operations on eg ARMv8;
> nor to similar named entities in C++11. As Paul stated these were
> introduced to map to the ia64 architecture - which I believe had
> somewhat different semantics.

They do map cleanly after I change acquire() and release() to compiler
barriers on TSO platforms (and fencing on PPC). Except some exceptions
like load acquire on PPC that was specialized for speed using twi-isync
and x86 on windows that don't need the stronger barriers since volatile
is more accurate anyway.

> This whole area is a minefield because there are different concepts of
> what "acquire" and "release" may mean - so you need to start with
> detailed definitions. Personally, and I know I am not alone, I abhor the
> acquire/release terminology and usage, I much prefer the unambiguous
> storeload, storestore etc. :)

Yes I think we need a better concrete guarantee/contract. The current
conceptual description of acquire/release is nice I think and useful
also for understanding when to use it, but I want to make it explicit
that release is only used in conjunction with writes and acquire only in
conjunction with loads (and swapping instructions etc), and therefore do
not give the same guarantees as described as the more general conceptual
description promises.

The concrete guarantee users can rely on is acquire = Load(Store|Load)
and release = (Store|Load)Store, and that's consistent with the
implementation and reasonable intended usage. Hence, it's fine to use
them in the way acquire/release is described now, iff these barriers
suffice to guarantee it, otherwise acquire/release should not be used in
the given context. Does that make sense?

> The current code often expresses things in a semantically incorrect way
> eg that:
>
> store_release(addr, val) == store(addr, val); release();
>
> but in practice the end code provides what is required (often providing
> a stronger barrier because that's all that is available on a platform).

I'm not sure I get what you mean by this being semantically incorrect. I
get it's suboptimal and might be slower triggering compiler barriers
that won't be necessary when joined, but semantically surely it should
be the same, right?

> And I agree with Karen - tackle correctness first then refactoring. IT
> will be too hard to track correctness otherwise. Personally, given I
> work in this code quite a bit, I like to see the complete set of
> definitions for a platform in one place. The duplication is something I
> can live with when it is mostly one-liners.

Okay no problem. I built the general variant first (because it was
easier since I don't have access to other platforms except mine haha).
But I fully understand yours and Karen's motivation so I will get back
to you when I manage to do it the code duplicated way so it's easier to
follow the evolution.

Oh and speaking of which, I find it very strange that store_fence and
release_store_fence is duplicated on x86 because we don't want to cast
away the volatile. This whole non-volatile thing seems a bit off to me -
surely all calls to OrderedAccess should be expected to be volatile,
nothing else would make any sense? And if so, why is store_fence the one
exception that is non-volatile while none of the others are? Unless
anyone knows there is a good reason for this, I'd like to make it
consistent with the rest of the code and make it volatile as well (and
hence get rid of some code duplication while at it). Permission granted? ;)

Thank you David for a nice discussion about this! :)

/Erik
Reply | Threaded
Open this post in threaded view
|

Re: OrderAccess Refactoring

David Holmes
Hi Erik,

Quick partial response ...

On 13/01/2015 12:18 AM, Erik Österlund wrote:

> Hi David,
>
> Thank you for the feedback! :)
>
> On 10/01/15 11:11, David Holmes wrote:
>> acquire/release semantics are problematic because there is no single
>> common accepted meaning. Even talking about acquire() and release() as
>> independent operations is somewhat meaningless - it is only when they
>> are paired, and fixed in place that they start to give meaning to things.
>
> I agree it is a bit vague. The description in the comments about
> preventing all memory accesses from floating up and down is the one I
> recognize and it is conceptually good. But then it is later defined in a
> table as acquire = Load(Store|Load) and release = (Store|Load)Store.

The table is not a definition, it is an example of how the
acquire/release semantics can be implemented on the different architectures:

membar loadStore|loadLoad

is sufficient to implement acquire, but not exactly equivalent -
acquire/release semantics can not be expressed exactly in terms of
loadload|loadStore etc, as acquire/release define one-way barriers
whereas loadload etc are bi-directional.

The

> reason is understandable - even though the description of
> acquire/release is quite general, they are only intended and designed to
> be used in certain contexts like release store and load acquire (with
> variants). Outside such similar contexts, it would not give the
> guarantees described.
>
> This gets more confusing when you go on reading that fence is
> conceptually acquire and release. It is true on the conceptual level
> using the conceptual descriptions of acquire/release, but the actual
> guarantees of release + acquire is StoreStore, LoadLoad and LoadStore
> according to the tables below, while fence() also provides StoreLoad
> constraints. I don't know about you but I find that a bit misleading.

Yes conceptually fence is "back-to-back acquire and release" but that
doesn't mean fence can necessarily be implemented by concatenating the
implementations of acquire and release on any given platform. The
problem is the "back-to-back" part as you need something to stop loads
from moving down past the acquire, and stores from moving up before the
release. Hence to implement fence you need all 4 membars and you don't
need the "dummy" load or store.

> I mean, it might be quite obvious for anyone using it to only use
> acquire/release in the intended contexts where it makes sense and then
> there is no trouble, but I guess it doesn't hurt to add some comments
> making it explicit to avoid any confusion. We don't want confusion I think.
>
>> load_acquire/store_release don't necessarily map cleanly to
>> acquire()/release() as currently described in orderAccess.hpp; nor do
>> they necessarily map cleanly to the same named operations on eg ARMv8;
>> nor to similar named entities in C++11. As Paul stated these were
>> introduced to map to the ia64 architecture - which I believe had
>> somewhat different semantics.
>
> They do map cleanly after I change acquire() and release() to compiler
> barriers on TSO platforms (and fencing on PPC). Except some exceptions
> like load acquire on PPC that was specialized for speed using twi-isync
> and x86 on windows that don't need the stronger barriers since volatile
> is more accurate anyway.
>
>> This whole area is a minefield because there are different concepts of
>> what "acquire" and "release" may mean - so you need to start with
>> detailed definitions. Personally, and I know I am not alone, I abhor the
>> acquire/release terminology and usage, I much prefer the unambiguous
>> storeload, storestore etc. :)
>
> Yes I think we need a better concrete guarantee/contract. The current
> conceptual description of acquire/release is nice I think and useful
> also for understanding when to use it, but I want to make it explicit
> that release is only used in conjunction with writes and acquire only in
> conjunction with loads (and swapping instructions etc), and therefore do
> not give the same guarantees as described as the more general conceptual
> description promises.

Or put another way there is no acquire()/release() only
load_acquire()/store_release() ?

> The concrete guarantee users can rely on is acquire = Load(Store|Load)
> and release = (Store|Load)Store, and that's consistent with the

Nope as per above that is not correct.

> implementation and reasonable intended usage. Hence, it's fine to use
> them in the way acquire/release is described now, iff these barriers
> suffice to guarantee it, otherwise acquire/release should not be used in
> the given context. Does that make sense?
>
>> The current code often expresses things in a semantically incorrect way
>> eg that:
>>
>> store_release(addr, val) == store(addr, val); release();
>>
>> but in practice the end code provides what is required (often providing
>> a stronger barrier because that's all that is available on a platform).
>
> I'm not sure I get what you mean by this being semantically incorrect. I
> get it's suboptimal and might be slower triggering compiler barriers
> that won't be necessary when joined, but semantically surely it should
> be the same, right?

Sorry I got the API backwards. What we often see is:

release_store(addr, val) == release(); store(addr,val);

but semantically that is wrong because release() allows stores to move
ahead of it so the RHS could be executed as:

store(addr, val); release()

which != release_store (which binds the store tightly to the 'release'
part). But on a given platform release() might be implemented in such a
way that the store can't move ahead of it - so the code is right, but
the way it is written is potentially incorrect given the abstract
specifications. Though another way to look at this, given these are all
contained in per-platform files, is that when you see something like:

release_store(addr, val) == release(); store(addr,val);

it isn't a claim of general equivalence, but a statement that is
accurate for the implementation on the current platform. (Though I'd
still prefer not to see it written this way - and that is what I want to
fix under https://bugs.openjdk.java.net/browse/JDK-7143664 )

BTW this is another point of confusion when we have an API like
release_store() but instructions like st.rel - which side of the barrier
is the store on, and does it matter?

>
>> And I agree with Karen - tackle correctness first then refactoring. IT
>> will be too hard to track correctness otherwise. Personally, given I
>> work in this code quite a bit, I like to see the complete set of
>> definitions for a platform in one place. The duplication is something I
>> can live with when it is mostly one-liners.
>
> Okay no problem. I built the general variant first (because it was
> easier since I don't have access to other platforms except mine haha).
> But I fully understand yours and Karen's motivation so I will get back
> to you when I manage to do it the code duplicated way so it's easier to
> follow the evolution.
>
> Oh and speaking of which, I find it very strange that store_fence and
> release_store_fence is duplicated on x86 because we don't want to cast
> away the volatile. This whole non-volatile thing seems a bit off to me -

Very long, depressing history of issues with volatiles and casting with
different compilers. So what you see is the result of what needed to be
done to "make things work".

> surely all calls to OrderedAccess should be expected to be volatile,

I agree, and I've suggested this in the past too, but was told that
making variables volatile incurs too much performance penalty in some
cases, so we only cast them to volatile when needed for these ops.

> nothing else would make any sense? And if so, why is store_fence the one
> exception that is non-volatile while none of the others are? Unless
> anyone knows there is a good reason for this, I'd like to make it
> consistent with the rest of the code and make it volatile as well (and
> hence get rid of some code duplication while at it). Permission granted? ;)

That case would need to be examined in detail. :)

> Thank you David for a nice discussion about this! :)

I'd like to say it is fun, but it really isn't :) This stuff gives me
continual headaches.

Cheers,
David

> /Erik
>
Reply | Threaded
Open this post in threaded view
|

Re: OrderAccess Refactoring

Erik Österlund
Hi David,

Thank you for your reply! :)

On 13/01/15 03:37, David Holmes wrote:
> The table is not a definition, it is an example of how the
> acquire/release semantics can be implemented on the different architectures:
>
> membar loadStore|loadLoad
>
> is sufficient to implement acquire, but not exactly equivalent -
> acquire/release semantics can not be expressed exactly in terms of
> loadload|loadStore etc, as acquire/release define one-way barriers
> whereas loadload etc are bi-directional.

Understood, but I just think it is a bit more than just an example
(agreeing it's an approximation of the intended semantics). This is in
fact how it actually behaves concretely and is implemented on all
platforms, as opposed to the conceptual description of acquire/release
which in general can not be relied upon outside certain narrowed
contexts which might do with being clarified. I personally think it's
useful to point that context out where acquire release is defined, so
that anyone in doubt can see what is actually guaranteed by the
implementation and what is not and hence when such semantics can be
relied upon and when they can't. If you disagree though, I'm fine with
leaving the comment as it was. ;)

> Yes conceptually fence is "back-to-back acquire and release" but that
> doesn't mean fence can necessarily be implemented by concatenating the
> implementations of acquire and release on any given platform.

And that is exactly what I believe could be weird to somebody out there.
If acquire() and release() had the semantics described, you would be
able to call them both and get a fence, but since they are assumed to
only be paired with loads and stores in a certain well defined order,
you never need the StoreLoad, hence making them not a fence, and they
shouldn't be of course.

 > and you don't need the "dummy" load or store.

Agreed, I got rid of them all. :)

> Or put another way there is no acquire()/release() only
> load_acquire()/store_release() ?

Yes these are the contexts where it gives the intended semantics in the
implementation. And outside similar contexts, it makes no sense to use
it. Hopefully this is clear enough to the user.

> Sorry I got the API backwards. What we often see is:
>
> release_store(addr, val) == release(); store(addr,val);
>
> but semantically that is wrong because release() allows stores to move
> ahead of it so the RHS could be executed as:
>
> store(addr, val); release()

Actually that is semantically equivalent by all means both by
definition, current implementation and intention. And this is why I'm
stressing that acquire() and release() only make sense together with a
load and a store. Whether they are separate methods or not is
irrelevant, it has the same semantics anyway and it has to.

To demonstrate my point, let's take a simple example. If your reasoning
of splitting barriers from memory accesses is correct, then look what
happens in the following example:

1.a release store
2.b load acquire
2.c load acquire

...should be equivalent to:

1.a release
1.b store
1.c load
1.d acquire
1.e load
1.f acquire

...and if it wasn't we would need a fence in acquire (cf. 1.d) to
prevent reordering between 1.b and 1.e which the general definition of
acquire guarantees if not constrained to its load. Ouch!! ;)

But fear not, because when they are coupled with a store/load,
regardless if you use release() store() or release_store(), they are
equivalent and act /as if/ they were joined as a single instruction. And
as you see, 1.b and 1.c may reorder without violating any rules so we
don't have to do any expensive full memory barriers.

This is why I'm saying that the description we have of acquire() and
release() only make sense in a narrowed context: when acquire is only
used after a load and release only before a store, then the semantics
hold true to the definition, and whether you use the split or joined
form you get the same observed results and semantics. And in this
narrowed context where it makes sense, it's equivalent to the TSO memory
ordering, and I know you didn't like it but that can be defined as
acquire = Load(Load|Store) and release = (Load|Store)Store, in that
narrowed context where it is valid w.r.t. the original definition. And
these are also the actual guarantees in our implementation which is
helpful. :)

I know you didn't like this as a definition, but it solves the problem
of equivalence here and perfectly defines the scope of when acquire()
and release() can be used. And that is precisely when my definition is
equal to the current definition. ;)

> (Though I'd
> still prefer not to see it written this way - and that is what I want to
> fix under https://bugs.openjdk.java.net/browse/JDK-7143664 )

I do agree it's better to use the combined variant but maybe for a
different reason. :) They are actually semantically equivalent so I
disagree about that. However it is possible to have a better barrier for
the combined variant. For instance PPC has optimized acquire load, x86
on Windows don't need compiler barriers because its volatiles have
acquire/release compiler reordering semantics which is tighter etc.

And then of course, the user of the API wouldn't have to read between
the lines in the documentation about in which contexts the acquire() and
release() barriers are actually working as defined. By removing the
possibility of doing something potentially wrong, less errors might be
made! ;)

> BTW this is another point of confusion when we have an API like
> release_store() but instructions like st.rel - which side of the barrier
> is the store on, and does it matter?

Yes it does matter. The release is always before the store and the
acquire after the load - this is well defined. The reason for this is
that it matches an abstract machine with a store buffer using a FIFO
queue which is equivalent to TSO which is well understood and portable.
If they were not in this order it would be useless for this usage:

T1: write stuff
T1: release store x1

T2: load acquire x1
T2: load stuff

The release/acquire store/load to x1 should synchronize so that the
loading of stuff in T2 will see the writing of stuff in T1. Note that if
the release and acquire were on the wrong side, this would break.

> Very long, depressing history of issues with volatiles and casting with
> different compilers. So what you see is the result of what needed to be
> done to "make things work".

:)

>> surely all calls to OrderedAccess should be expected to be volatile,
>
> I agree, and I've suggested this in the past too, but was told that
> making variables volatile incurs too much performance penalty in some
> cases, so we only cast them to volatile when needed for these ops.

Performance is a non-argument IMO. Because it's only used for
store_fence, and it always requires a compiler barrier to work at least
for the fence. So volatile won't make a difference at all. Even if it
would (which it doesn't), the real deal is the fence, not the volatile
memory access (because C++ volatile is not Java volatile).

If that was the only argument, I'd like to make this consistent! :)

>> nothing else would make any sense? And if so, why is store_fence the one
>> exception that is non-volatile while none of the others are? Unless
>> anyone knows there is a good reason for this, I'd like to make it
>> consistent with the rest of the code and make it volatile as well (and
>> hence get rid of some code duplication while at it). Permission granted? ;)
>
> That case would need to be examined in detail. :)

My assessment:
1. Volatile can't impact performance in store_fence (requiring compiler
barrier anyway and volatile in C++ only does compiler level reordering,
and fences would kill any other effect even if there was one which there
is not).
2. Volatile won't break anything and create illegal reorderings, because
it's stricter than non-volatile. On the contrary it's easy to make
mistakes leading to trouble on for instance windows without it.
3. It makes the code more consistent, allows to remove several copy
pastes, and generalize making the code more readable.

If I get your permission, I'll add volatile!

>> Thank you David for a nice discussion about this! :)
>
> I'd like to say it is fun, but it really isn't :) This stuff gives me
> continual headaches.

Sorry about the headaches, I feel a bit guilty! Hopefully with this
cleanup, at least some future headache will be relieved though. :)

Thanks for the discussion, and hope this email won't make the headache
worse!

/Erik