Now that we have support for LSE atomics in C++ HotSpot source, we can generate much better code for them. In particular, the sequence we generate for CMPXCHG with a full two-way barrier using two DMBs is way suboptimal.
Barrier-ordered-before, Arm Architecture Reference Manual B2.3 : | Barrier instructions order prior Memory effects before subsequent | Memory effects generated by the same Observer. A read or a write RW1 | is Barrier-ordered-before a read or a write RW2 from the same Observer | if and only if RW1 appears in program order before RW2 and any of the | following cases apply: | | [...] | | * RW1 appears in program order before an atomic instruction with both | Acquire and Release semantics that appears in program order before RW2. So a prior load or store cannot be reordered with the load of an atomic swap with Acquire and Release semantics. This barrier-ordered-before in combination with sequential consistency gives us everything we need for a full barrier. However, we still need a DMB after the cmpxchg to ensure that subsequent loads and stores cannot be reordered with the store in an atomic instruction. ------------- Commit messages: - Everything Changes: https://git.openjdk.java.net/jdk/pull/2612/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=2612&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8261649 Stats: 280 lines in 4 files changed: 164 ins; 51 del; 65 mod Patch: https://git.openjdk.java.net/jdk/pull/2612.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/2612/head:pull/2612 PR: https://git.openjdk.java.net/jdk/pull/2612 |
On Wed, 17 Feb 2021 18:06:55 GMT, Andrew Haley <[hidden email]> wrote:
> Now that we have support for LSE atomics in C++ HotSpot source, we can generate much better code for them. In particular, the sequence we generate for CMPXCHG with a full two-way barrier using two DMBs is way suboptimal. > > Barrier-ordered-before, Arm Architecture Reference Manual B2.3 : > > | Barrier instructions order prior Memory effects before subsequent > | Memory effects generated by the same Observer. A read or a write RW1 > | is Barrier-ordered-before a read or a write RW2 from the same Observer > | if and only if RW1 appears in program order before RW2 and any of the > | following cases apply: > | > | [...] > | > | * RW1 appears in program order before an atomic instruction with both > | Acquire and Release semantics that appears in program order before RW2. > > So a prior load or store cannot be reordered with the load of an atomic swap with Acquire and Release semantics. This barrier-ordered-before in combination with sequential consistency gives us everything we need for a full barrier. However, we still need a DMB after the cmpxchg to ensure that subsequent loads and stores cannot be reordered with the store in an atomic instruction. This patch: Moves memory barriers from the atomic_linux_aarch64 file into the stubs. Rewrites the LSE versions of the stubs to be more efficient. Fixes a race condition in stub generation. Mostly leaves the pre-LSE stubs alone, except that I added a PRFM which according to kernel engineers improves performance. ------------- PR: https://git.openjdk.java.net/jdk/pull/2612 |
In reply to this post by Andrew Haley-2
On Wed, 17 Feb 2021 18:06:55 GMT, Andrew Haley <[hidden email]> wrote:
> Now that we have support for LSE atomics in C++ HotSpot source, we can generate much better code for them. In particular, the sequence we generate for CMPXCHG with a full two-way barrier using two DMBs is way suboptimal. > > Barrier-ordered-before, Arm Architecture Reference Manual B2.3 : > > | Barrier instructions order prior Memory effects before subsequent > | Memory effects generated by the same Observer. A read or a write RW1 > | is Barrier-ordered-before a read or a write RW2 from the same Observer > | if and only if RW1 appears in program order before RW2 and any of the > | following cases apply: > | > | [...] > | > | * RW1 appears in program order before an atomic instruction with both > | Acquire and Release semantics that appears in program order before RW2. > > So a prior load or store cannot be reordered with the load of an atomic swap with Acquire and Release semantics. This barrier-ordered-before in combination with sequential consistency gives us everything we need for a full barrier. However, we still need a DMB after the cmpxchg to ensure that subsequent loads and stores cannot be reordered with the store in an atomic instruction. This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.java.net/jdk/pull/2612 |
In reply to this post by Andrew Haley-2
On Wed, 17 Feb 2021 18:15:02 GMT, Andrew Haley <[hidden email]> wrote:
>> Now that we have support for LSE atomics in C++ HotSpot source, we can generate much better code for them. In particular, the sequence we generate for CMPXCHG with a full two-way barrier using two DMBs is way suboptimal. >> >> Barrier-ordered-before, Arm Architecture Reference Manual B2.3 : >> >> | Barrier instructions order prior Memory effects before subsequent >> | Memory effects generated by the same Observer. A read or a write RW1 >> | is Barrier-ordered-before a read or a write RW2 from the same Observer >> | if and only if RW1 appears in program order before RW2 and any of the >> | following cases apply: >> | >> | [...] >> | >> | * RW1 appears in program order before an atomic instruction with both >> | Acquire and Release semantics that appears in program order before RW2. >> >> So a prior load or store cannot be reordered with the load of an atomic swap with Acquire and Release semantics. This barrier-ordered-before in combination with sequential consistency gives us everything we need for a full barrier. However, we still need a DMB after the cmpxchg to ensure that subsequent loads and stores cannot be reordered with the store in an atomic instruction. > > This patch: > > Moves memory barriers from the atomic_linux_aarch64 file into the stubs. > Rewrites the LSE versions of the stubs to be more efficient. > Fixes a race condition in stub generation. > Mostly leaves the pre-LSE stubs alone, except that I added a PRFM which according to kernel engineers improves performance. Closing because this is a duplicate. ------------- PR: https://git.openjdk.java.net/jdk/pull/2612 |
Free forum by Nabble | Edit this page |