RFR: 8261649: AArch64: Optimize LSE atomics in C++ code

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

RFR: 8261649: AArch64: Optimize LSE atomics in C++ code

Andrew Haley-2
Now that we have support for LSE atomics in C++ HotSpot source, we can generate much better code for them. In particular, the sequence we generate for CMPXCHG with a full two-way barrier using two DMBs is way suboptimal.

Barrier-ordered-before, Arm Architecture Reference Manual B2.3 :

   | Barrier instructions order prior Memory effects before subsequent
   | Memory effects generated by the same Observer. A read or a write RW1
   | is Barrier-ordered-before a read or a write RW2 from the same Observer
   | if and only if RW1 appears in program order before RW2 and any of the
   | following cases apply:
   |
   | [...]
   |
   | * RW1 appears in program order before an atomic instruction with both
   | Acquire and Release semantics that appears in program order before RW2.

So a prior load or store cannot be reordered with the load of an atomic swap with Acquire and Release semantics. This barrier-ordered-before in combination with sequential consistency gives us everything we need for a full barrier. However, we still need a DMB after the cmpxchg to ensure that subsequent loads and stores cannot be reordered with the store in an atomic instruction.

-------------

Commit messages:
 - Everything

Changes: https://git.openjdk.java.net/jdk/pull/2612/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=2612&range=00
  Issue: https://bugs.openjdk.java.net/browse/JDK-8261649
  Stats: 280 lines in 4 files changed: 164 ins; 51 del; 65 mod
  Patch: https://git.openjdk.java.net/jdk/pull/2612.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/2612/head:pull/2612

PR: https://git.openjdk.java.net/jdk/pull/2612
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261649: AArch64: Optimize LSE atomics in C++ code

Andrew Haley-2
On Wed, 17 Feb 2021 18:06:55 GMT, Andrew Haley <[hidden email]> wrote:

> Now that we have support for LSE atomics in C++ HotSpot source, we can generate much better code for them. In particular, the sequence we generate for CMPXCHG with a full two-way barrier using two DMBs is way suboptimal.
>
> Barrier-ordered-before, Arm Architecture Reference Manual B2.3 :
>
>    | Barrier instructions order prior Memory effects before subsequent
>    | Memory effects generated by the same Observer. A read or a write RW1
>    | is Barrier-ordered-before a read or a write RW2 from the same Observer
>    | if and only if RW1 appears in program order before RW2 and any of the
>    | following cases apply:
>    |
>    | [...]
>    |
>    | * RW1 appears in program order before an atomic instruction with both
>    | Acquire and Release semantics that appears in program order before RW2.
>
> So a prior load or store cannot be reordered with the load of an atomic swap with Acquire and Release semantics. This barrier-ordered-before in combination with sequential consistency gives us everything we need for a full barrier. However, we still need a DMB after the cmpxchg to ensure that subsequent loads and stores cannot be reordered with the store in an atomic instruction.

This patch:

Moves memory barriers from the atomic_linux_aarch64 file into the stubs.
Rewrites the LSE versions of the stubs to be more efficient.
Fixes a race condition in stub generation.
Mostly leaves the pre-LSE stubs alone, except that I added a PRFM which according to kernel engineers improves performance.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2612
Reply | Threaded
Open this post in threaded view
|

Withdrawn: 8261649: AArch64: Optimize LSE atomics in C++ code

Andrew Haley-2
In reply to this post by Andrew Haley-2
On Wed, 17 Feb 2021 18:06:55 GMT, Andrew Haley <[hidden email]> wrote:

> Now that we have support for LSE atomics in C++ HotSpot source, we can generate much better code for them. In particular, the sequence we generate for CMPXCHG with a full two-way barrier using two DMBs is way suboptimal.
>
> Barrier-ordered-before, Arm Architecture Reference Manual B2.3 :
>
>    | Barrier instructions order prior Memory effects before subsequent
>    | Memory effects generated by the same Observer. A read or a write RW1
>    | is Barrier-ordered-before a read or a write RW2 from the same Observer
>    | if and only if RW1 appears in program order before RW2 and any of the
>    | following cases apply:
>    |
>    | [...]
>    |
>    | * RW1 appears in program order before an atomic instruction with both
>    | Acquire and Release semantics that appears in program order before RW2.
>
> So a prior load or store cannot be reordered with the load of an atomic swap with Acquire and Release semantics. This barrier-ordered-before in combination with sequential consistency gives us everything we need for a full barrier. However, we still need a DMB after the cmpxchg to ensure that subsequent loads and stores cannot be reordered with the store in an atomic instruction.

This pull request has been closed without being integrated.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2612
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261649: AArch64: Optimize LSE atomics in C++ code

Andrew Haley-2
In reply to this post by Andrew Haley-2
On Wed, 17 Feb 2021 18:15:02 GMT, Andrew Haley <[hidden email]> wrote:

>> Now that we have support for LSE atomics in C++ HotSpot source, we can generate much better code for them. In particular, the sequence we generate for CMPXCHG with a full two-way barrier using two DMBs is way suboptimal.
>>
>> Barrier-ordered-before, Arm Architecture Reference Manual B2.3 :
>>
>>    | Barrier instructions order prior Memory effects before subsequent
>>    | Memory effects generated by the same Observer. A read or a write RW1
>>    | is Barrier-ordered-before a read or a write RW2 from the same Observer
>>    | if and only if RW1 appears in program order before RW2 and any of the
>>    | following cases apply:
>>    |
>>    | [...]
>>    |
>>    | * RW1 appears in program order before an atomic instruction with both
>>    | Acquire and Release semantics that appears in program order before RW2.
>>
>> So a prior load or store cannot be reordered with the load of an atomic swap with Acquire and Release semantics. This barrier-ordered-before in combination with sequential consistency gives us everything we need for a full barrier. However, we still need a DMB after the cmpxchg to ensure that subsequent loads and stores cannot be reordered with the store in an atomic instruction.
>
> This patch:
>
> Moves memory barriers from the atomic_linux_aarch64 file into the stubs.
> Rewrites the LSE versions of the stubs to be more efficient.
> Fixes a race condition in stub generation.
> Mostly leaves the pre-LSE stubs alone, except that I added a PRFM which according to kernel engineers improves performance.

Closing because this is a duplicate.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2612