RFR: 8261649: AArch64: Optimize LSE atomics in C++ code

classic Classic list List threaded Threaded
18 messages Options
Reply | Threaded
Open this post in threaded view
|

RFR: 8261649: AArch64: Optimize LSE atomics in C++ code

Andrew Haley-2
8261649: AArch64: Optimize LSE atomics in C++ code

-------------

Commit messages:
 - Restore LSE CAS generation.
 - Merge https://github.com/openjdk/jdk into JDK-8261650
 - Trailing membar.
 - Flush everything correctly
 - Committed
 - Intermediate

Changes: https://git.openjdk.java.net/jdk/pull/2611/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=2611&range=00
  Issue: https://bugs.openjdk.java.net/browse/JDK-8261649
  Stats: 282 lines in 5 files changed: 166 ins; 51 del; 65 mod
  Patch: https://git.openjdk.java.net/jdk/pull/2611.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/2611/head:pull/2611

PR: https://git.openjdk.java.net/jdk/pull/2611
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261649: AArch64: Optimize LSE atomics in C++ code

Andrew Dinn-2
On Wed, 17 Feb 2021 17:14:46 GMT, Andrew Haley <[hidden email]> wrote:

> 8261649: AArch64: Optimize LSE atomics in C++ code

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5651:

> 5649:     __ mov(prev, compare_val);
> 5650:     __ lse_cas(prev, exchange_val, ptr, size, acquire, release, /*not_pair*/true);
> 5651:     if (acquire && release) {

These two flags are only ever passed as true,true or false,false. Does any other combination make sense? If not then should you not be using a single flag? or at least asserting (pro tem) that they are both equal?

src/hotspot/os_cpu/linux_aarch64/atomic_linux_aarch64.S line 75:

> 73:         .align 5
> 74: aarch64_atomic_cmpxchg_1_default_impl:
> 75:         dmb     ish

Having argued above that this dmb is never needed why is it in this default impl? (also for size 4 and 8)

-------------

PR: https://git.openjdk.java.net/jdk/pull/2611
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261649: AArch64: Optimize LSE atomics in C++ code

Andrew Haley-2
On Wed, 17 Feb 2021 17:48:06 GMT, Andrew Dinn <[hidden email]> wrote:

>> 8261649: AArch64: Optimize LSE atomics in C++ code
>
> src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5651:
>
>> 5649:     __ mov(prev, compare_val);
>> 5650:     __ lse_cas(prev, exchange_val, ptr, size, acquire, release, /*not_pair*/true);
>> 5651:     if (acquire && release) {
>
> These two flags are only ever passed as true,true or false,false. Does any other combination make sense? If not then should you not be using a single flag? or at least asserting (pro tem) that they are both equal?

Today HotSpot only really supports mo_conservative and mo_relaxed, but there are many where release on its own would make sense; I think Aleksey  recently found some. Having said that, it would be clearer here to expose mo_conservative as well. I'll do so.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2611
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261649: AArch64: Optimize LSE atomics in C++ code

Andrew Haley-2
In reply to this post by Andrew Dinn-2
On Wed, 17 Feb 2021 17:51:35 GMT, Andrew Dinn <[hidden email]> wrote:

>> 8261649: AArch64: Optimize LSE atomics in C++ code
>
> src/hotspot/os_cpu/linux_aarch64/atomic_linux_aarch64.S line 75:
>
>> 73:         .align 5
>> 74: aarch64_atomic_cmpxchg_1_default_impl:
>> 75:         dmb     ish
>
> Having argued above that this dmb is never needed why is it in this default impl? (also for size 4 and 8)

The default impl uses  ARMv8 LDXR ... STXR but the DMB is not needed if and only if ARMv8.1 LSE instructions.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2611
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261649: AArch64: Optimize LSE atomics in C++ code

Andrew Dinn-2
On Thu, 18 Feb 2021 09:20:21 GMT, Andrew Haley <[hidden email]> wrote:

>> src/hotspot/os_cpu/linux_aarch64/atomic_linux_aarch64.S line 75:
>>
>>> 73:         .align 5
>>> 74: aarch64_atomic_cmpxchg_1_default_impl:
>>> 75:         dmb     ish
>>
>> Having argued above that this dmb is never needed why is it in this default impl? (also for size 4 and 8)
>
> The default impl uses  ARMv8 LDXR ... STXR but the DMB is not needed if and only if ARMv8.1 LSE instructions.

Doh! And it says that clearly in the comment. Ok.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2611
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261649: AArch64: Optimize LSE atomics in C++ code

Andrew Dinn-2
In reply to this post by Andrew Haley-2
On Wed, 17 Feb 2021 17:14:46 GMT, Andrew Haley <[hidden email]> wrote:

> Now that we have support for LSE atomics in C++ HotSpot source, we can generate much better code for them. In particular, the sequence we generate for CMPXCHG with a full two-way barrier using two DMBs is way suboptimal.
>
> This patch:
>
> Moves memory barriers from the atomic_linux_aarch64 file into the stubs.
> Rewrites the LSE versions of the stubs to be more efficient.
> Fixes a race condition in stub generation.
> Mostly leaves the pre-LSE stubs alone, except that I added a PRFM which according to kernel engineers improves performance.

Ok, this looks good enough to me.

-------------

Marked as reviewed by adinn (Reviewer).

PR: https://git.openjdk.java.net/jdk/pull/2611
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261649: AArch64: Optimize LSE atomics in C++ code [v2]

Andrew Haley-2
In reply to this post by Andrew Haley-2
> Now that we have support for LSE atomics in C++ HotSpot source, we can generate much better code for them. In particular, the sequence we generate for CMPXCHG with a full two-way barrier using two DMBs is way suboptimal.
>
> This patch:
>
> Moves memory barriers from the atomic_linux_aarch64 file into the stubs.
> Rewrites the LSE versions of the stubs to be more efficient.
> Fixes a race condition in stub generation.
> Mostly leaves the pre-LSE stubs alone, except that I added a PRFM which according to kernel engineers improves performance.

Andrew Haley has updated the pull request incrementally with one additional commit since the last revision:

  Make things slightly less confusing

-------------

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/2611/files
  - new: https://git.openjdk.java.net/jdk/pull/2611/files/23df4485..68ff88d9

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=2611&range=01
 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=2611&range=00-01

  Stats: 21 lines in 1 file changed: 8 ins; 0 del; 13 mod
  Patch: https://git.openjdk.java.net/jdk/pull/2611.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/2611/head:pull/2611

PR: https://git.openjdk.java.net/jdk/pull/2611
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261649: AArch64: Optimize LSE atomics in C++ code [v2]

Andrew Haley-2
In reply to this post by Andrew Haley-2
On Thu, 18 Feb 2021 09:15:12 GMT, Andrew Haley <[hidden email]> wrote:

>> src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5651:
>>
>>> 5649:     __ mov(prev, compare_val);
>>> 5650:     __ lse_cas(prev, exchange_val, ptr, size, acquire, release, /*not_pair*/true);
>>> 5651:     if (acquire && release) {
>>
>> These two flags are only ever passed as true,true or false,false. Does any other combination make sense? If not then should you not be using a single flag? or at least asserting (pro tem) that they are both equal?
>
> Today HotSpot only really supports mo_conservative and mo_relaxed, but there are many places in HotSpot where release on its own would make sense; I think Aleksey  recently found some. Having said that, it would be clearer here to expose mo_conservative as well. I'll do so.

Clearer now?

-------------

PR: https://git.openjdk.java.net/jdk/pull/2611
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261649: AArch64: Optimize LSE atomics in C++ code [v3]

Andrew Haley-2
In reply to this post by Andrew Haley-2
> Now that we have support for LSE atomics in C++ HotSpot source, we can generate much better code for them. In particular, the sequence we generate for CMPXCHG with a full two-way barrier using two DMBs is way suboptimal.
>
> This patch:
>
> Moves memory barriers from the atomic_linux_aarch64 file into the stubs.
> Rewrites the LSE versions of the stubs to be more efficient.
> Fixes a race condition in stub generation.
> Mostly leaves the pre-LSE stubs alone, except that I added a PRFM which according to kernel engineers improves performance.

Andrew Haley has updated the pull request incrementally with one additional commit since the last revision:

  Remove mistaken change.

-------------

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/2611/files
  - new: https://git.openjdk.java.net/jdk/pull/2611/files/68ff88d9..6b338901

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=2611&range=02
 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=2611&range=01-02

  Stats: 2 lines in 1 file changed: 0 ins; 2 del; 0 mod
  Patch: https://git.openjdk.java.net/jdk/pull/2611.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/2611/head:pull/2611

PR: https://git.openjdk.java.net/jdk/pull/2611
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261649: AArch64: Optimize LSE atomics in C++ code [v3]

Aleksey Shipilev-5
On Thu, 18 Feb 2021 16:52:29 GMT, Andrew Haley <[hidden email]> wrote:

>> Now that we have support for LSE atomics in C++ HotSpot source, we can generate much better code for them. In particular, the sequence we generate for CMPXCHG with a full two-way barrier using two DMBs is way suboptimal.
>>
>> This patch:
>>
>> Moves memory barriers from the atomic_linux_aarch64 file into the stubs.
>> Rewrites the LSE versions of the stubs to be more efficient.
>> Fixes a race condition in stub generation.
>> Mostly leaves the pre-LSE stubs alone, except that I added a PRFM which according to kernel engineers improves performance.
>
> Andrew Haley has updated the pull request incrementally with one additional commit since the last revision:
>
>   Remove mistaken change.

Passer-by comments...

src/hotspot/cpu/aarch64/vm_version_aarch64.cpp line 192:

> 190:   if (_cpu == CPU_ARM && (_model == 0xd07 || _model2 == 0xd07)) _features |= CPU_STXR_PREFETCH;
> 191:
> 192:   _features |= CPU_STXR_PREFETCH;

This looks weird. The line above it adds `CPU_STXR_PREFETCH` conditionally. Which one is correct?

-------------

PR: https://git.openjdk.java.net/jdk/pull/2611
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261649: AArch64: Optimize LSE atomics in C++ code [v3]

Andrew Haley-2
In reply to this post by Andrew Dinn-2
On Thu, 18 Feb 2021 16:28:47 GMT, Andrew Dinn <[hidden email]> wrote:

> Ok, this looks good enough to me.

I pulled the vm_version.cpp change, which I committed by mistake. I think it's the right thing to do, but it needs a separate discussion. OK?

-------------

PR: https://git.openjdk.java.net/jdk/pull/2611
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261649: AArch64: Optimize LSE atomics in C++ code [v2]

Aleksey Shipilev-5
In reply to this post by Andrew Haley-2
On Thu, 18 Feb 2021 16:49:00 GMT, Andrew Haley <[hidden email]> wrote:

>> Now that we have support for LSE atomics in C++ HotSpot source, we can generate much better code for them. In particular, the sequence we generate for CMPXCHG with a full two-way barrier using two DMBs is way suboptimal.
>>
>> This patch:
>>
>> Moves memory barriers from the atomic_linux_aarch64 file into the stubs.
>> Rewrites the LSE versions of the stubs to be more efficient.
>> Fixes a race condition in stub generation.
>> Mostly leaves the pre-LSE stubs alone, except that I added a PRFM which according to kernel engineers improves performance.
>
> Andrew Haley has updated the pull request incrementally with one additional commit since the last revision:
>
>   Make things slightly less confusing

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5655:

> 5653:       acquire = false, release = false; break;
> 5654:     default:
> 5655:       acquire = true, release = true; break;

I think hotspot style is to indent the `case`-s. Also commas in `acquire` and `release` assignments look weird. Consider:

    Register prev = r3;
    Register ptr = c_rarg0;
    Register compare_val = c_rarg1;
    Register exchange_val = c_rarg2;

    bool acquire, release;
    switch (order) {
      case memory_order_relaxed:
        acquire = false;
        release = false;
        break;
      default:
        acquire = true;
        release = true;
        break;
    }

-------------

PR: https://git.openjdk.java.net/jdk/pull/2611
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261649: AArch64: Optimize LSE atomics in C++ code [v2]

Andrew Haley-2
On Thu, 18 Feb 2021 16:49:26 GMT, Aleksey Shipilev <[hidden email]> wrote:

>> Andrew Haley has updated the pull request incrementally with one additional commit since the last revision:
>>
>>   Make things slightly less confusing
>
> src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5655:
>
>> 5653:       acquire = false, release = false; break;
>> 5654:     default:
>> 5655:       acquire = true, release = true; break;
>
> I think hotspot style is to indent the `case`-s. Also commas in `acquire` and `release` assignments look weird. Consider:
>
>     Register prev = r3;
>     Register ptr = c_rarg0;
>     Register compare_val = c_rarg1;
>     Register exchange_val = c_rarg2;
>
>     bool acquire, release;
>     switch (order) {
>       case memory_order_relaxed:
>         acquire = false;
>         release = false;
>         break;
>       default:
>         acquire = true;
>         release = true;
>         break;
>     }

Oh noes! I'll have to change my emacs indentation setup. I may be gone some time...

-------------

PR: https://git.openjdk.java.net/jdk/pull/2611
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261649: AArch64: Optimize LSE atomics in C++ code [v4]

Andrew Haley-2
In reply to this post by Andrew Haley-2
> Now that we have support for LSE atomics in C++ HotSpot source, we can generate much better code for them. In particular, the sequence we generate for CMPXCHG with a full two-way barrier using two DMBs is way suboptimal.
>
> This patch:
>
> Moves memory barriers from the atomic_linux_aarch64 file into the stubs.
> Rewrites the LSE versions of the stubs to be more efficient.
> Fixes a race condition in stub generation.
> Mostly leaves the pre-LSE stubs alone, except that I added a PRFM which according to kernel engineers improves performance.

Andrew Haley has updated the pull request incrementally with one additional commit since the last revision:

  Switch statement layout.

-------------

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/2611/files
  - new: https://git.openjdk.java.net/jdk/pull/2611/files/6b338901..83016c1d

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=2611&range=03
 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=2611&range=02-03

  Stats: 8 lines in 1 file changed: 4 ins; 0 del; 4 mod
  Patch: https://git.openjdk.java.net/jdk/pull/2611.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/2611/head:pull/2611

PR: https://git.openjdk.java.net/jdk/pull/2611
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261649: AArch64: Optimize LSE atomics in C++ code [v2]

Andrew Haley-2
In reply to this post by Andrew Haley-2
On Thu, 18 Feb 2021 16:58:39 GMT, Andrew Haley <[hidden email]> wrote:

>> src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5655:
>>
>>> 5653:       acquire = false, release = false; break;
>>> 5654:     default:
>>> 5655:       acquire = true, release = true; break;
>>
>> I think hotspot style is to indent the `case`-s. Also commas in `acquire` and `release` assignments look weird. Consider:
>>
>>     Register prev = r3;
>>     Register ptr = c_rarg0;
>>     Register compare_val = c_rarg1;
>>     Register exchange_val = c_rarg2;
>>
>>     bool acquire, release;
>>     switch (order) {
>>       case memory_order_relaxed:
>>         acquire = false;
>>         release = false;
>>         break;
>>       default:
>>         acquire = true;
>>         release = true;
>>         break;
>>     }
>
> Oh noes! I'll have to change my emacs indentation setup. I may be gone some time...

Ha! Turns out it's an emacs FAQ: (c-set-offset 'case-label '+)

-------------

PR: https://git.openjdk.java.net/jdk/pull/2611
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261649: AArch64: Optimize LSE atomics in C++ code [v4]

Andrew Haley-2
In reply to this post by Aleksey Shipilev-5
On Thu, 18 Feb 2021 16:43:29 GMT, Aleksey Shipilev <[hidden email]> wrote:

>> Andrew Haley has updated the pull request incrementally with one additional commit since the last revision:
>>
>>   Switch statement layout.
>
> src/hotspot/cpu/aarch64/vm_version_aarch64.cpp line 192:
>
>> 190:   if (_cpu == CPU_ARM && (_model == 0xd07 || _model2 == 0xd07)) _features |= CPU_STXR_PREFETCH;
>> 191:
>> 192:   _features |= CPU_STXR_PREFETCH;
>
> This looks weird. The line above it adds `CPU_STXR_PREFETCH` conditionally. Which one is correct?

It was a mistake. Gone now.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2611
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261649: AArch64: Optimize LSE atomics in C++ code [v4]

Andrew Dinn-2
In reply to this post by Aleksey Shipilev-5
On Thu, 18 Feb 2021 16:52:29 GMT, Aleksey Shipilev <[hidden email]> wrote:

>> Andrew Haley has updated the pull request incrementally with one additional commit since the last revision:
>>
>>   Switch statement layout.
>
> Passer-by comments...

The memory_order_conservative/relaxed change is better. still good to go.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2611
Reply | Threaded
Open this post in threaded view
|

Integrated: 8261649: AArch64: Optimize LSE atomics in C++ code

Andrew Haley-2
In reply to this post by Andrew Haley-2
On Wed, 17 Feb 2021 17:14:46 GMT, Andrew Haley <[hidden email]> wrote:

> Now that we have support for LSE atomics in C++ HotSpot source, we can generate much better code for them. In particular, the sequence we generate for CMPXCHG with a full two-way barrier using two DMBs is way suboptimal.
>
> This patch:
>
> Moves memory barriers from the atomic_linux_aarch64 file into the stubs.
> Rewrites the LSE versions of the stubs to be more efficient.
> Fixes a race condition in stub generation.
> Mostly leaves the pre-LSE stubs alone, except that I added a PRFM which according to kernel engineers improves performance.

This pull request has now been integrated.

Changeset: 1b0c36b0
Author:    Andrew Haley <[hidden email]>
URL:       https://git.openjdk.java.net/jdk/commit/1b0c36b0
Stats:     285 lines in 4 files changed: 170 ins; 45 del; 70 mod

8261649: AArch64: Optimize LSE atomics in C++ code

Reviewed-by: adinn

-------------

PR: https://git.openjdk.java.net/jdk/pull/2611