RFR: 8261027: AArch64: Support for LSE atomics C++ HotSpot code

classic Classic list List threaded Threaded
39 messages Options
12
Reply | Threaded
Open this post in threaded view
|

RFR: 8261027: AArch64: Support for LSE atomics C++ HotSpot code

Andrew Haley-2
Go back a few years, and there were simple atomic load/store exclusive
instructions on Arm. Say you want to do an atomic increment of a
counter. You'd do an atomic load to get the counter into your local cache
in exclusive state, increment that counter locally, then write that
incremented counter back to memory with an atomic store. All the time
that cache line was in exclusive state, so you're guaranteed that
no-one else changed anything on that cache line while you had it.

This is hard to scale on a very large system (e.g. Fugaku) because if
many processors are incrementing that counter you get a lot of cache
line ping-ponging between cores.

So, Arm decided to add a locked memory increment instruction that
works without needing to load an entire line into local cache. It's a
single instruction that loads, increments, and writes back. The secret
is to send a cache control message to whichever processor owns the
cache line containing the count, tell that processor to increment the
counter and return the incremented value. That way cache coherency
traffic is mimimized. This new set of instructions is known as Large
System Extensions, or LSE.

Unfortunately, in recent processors, the "old" load/store exclusive
instructions, sometimes perform very badly. Therefore, it's now
necessary for software to detect which version of Arm it's running
on, and use the "new" LSE instructions if they're available. Otherwise
performance can be very poor under heavy contention.

GCC's -moutline-atomics does this by providing library calls which use
LSE if it's available, but this option is only provided on newer
versions of GCC. This is particularly problematic with older versions
of OpenJDK, which build using old GCC versions.

Also, I suspect that some other operating systems could use this.
Perhaps not MacOS, given that all Apple CPUs support LSE, but
maybe Windows.

-------------

Commit messages:
 - Hoist load of stub pointer before FULL_MEM_BARRIER.
 - Oops
 - Move stuff around
 - Cleanup
 - Intermediate for perf test
 - Untabify
 - Intermediate
 - Intermediate

Changes: https://git.openjdk.java.net/jdk/pull/2434/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=2434&range=00
  Issue: https://bugs.openjdk.java.net/browse/JDK-8261027
  Stats: 396 lines in 6 files changed: 362 ins; 6 del; 28 mod
  Patch: https://git.openjdk.java.net/jdk/pull/2434.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/2434/head:pull/2434

PR: https://git.openjdk.java.net/jdk/pull/2434
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261027: AArch64: Support for LSE atomics C++ HotSpot code

Andrew Haley-2
On Fri, 5 Feb 2021 18:56:46 GMT, Andrew Haley <[hidden email]> wrote:

> Go back a few years, and there were simple atomic load/store exclusive
> instructions on Arm. Say you want to do an atomic increment of a
> counter. You'd do an atomic load to get the counter into your local cache
> in exclusive state, increment that counter locally, then write that
> incremented counter back to memory with an atomic store. All the time
> that cache line was in exclusive state, so you're guaranteed that
> no-one else changed anything on that cache line while you had it.
>
> This is hard to scale on a very large system (e.g. Fugaku) because if
> many processors are incrementing that counter you get a lot of cache
> line ping-ponging between cores.
>
> So, Arm decided to add a locked memory increment instruction that
> works without needing to load an entire line into local cache. It's a
> single instruction that loads, increments, and writes back. The secret
> is to send a cache control message to whichever processor owns the
> cache line containing the count, tell that processor to increment the
> counter and return the incremented value. That way cache coherency
> traffic is mimimized. This new set of instructions is known as Large
> System Extensions, or LSE.
>
> Unfortunately, in recent processors, the "old" load/store exclusive
> instructions, sometimes perform very badly. Therefore, it's now
> necessary for software to detect which version of Arm it's running
> on, and use the "new" LSE instructions if they're available. Otherwise
> performance can be very poor under heavy contention.
>
> GCC's -moutline-atomics does this by providing library calls which use
> LSE if it's available, but this option is only provided on newer
> versions of GCC. This is particularly problematic with older versions
> of OpenJDK, which build using old GCC versions.
>
> Also, I suspect that some other operating systems could use this.
> Perhaps not MacOS, given that all Apple CPUs support LSE, but
> maybe Windows.

With regard to performance, the overhead of the ```call ... ret``` sequence seems to be almost negligible on the systems I've tested.

On ThunderX2, there is little difference, whatever you do. A straight-line count and increment loop is 5%v slower.

On Neoverse N1 there is some 25% straight-line improvement for a simple count and increment loop with this patch. GCC's -moutline-atomics isn't quite as good as this patch, with only a 17% improvement.

But simple straight-line tests aren't really the point of LSE. The big performance hit with the "old" atomics happens at times of heavy contention, when fairness problems cause severe scaling issues. This is more likely to be a problem on large systems with many cores and large heaps.


**ThunderX2:**

Baseline:

real 0m24.001s

Patched:

-XX:+UseLSE

real 0m25.222s

-XX:-UseLSE

real 0m25.215s

Built with -moutline-atomics:

real 0m25.227s


**Neoverse N1:**

Baseline:

real 0m10.027s

Patched:

-XX:+UseLSE

real 0m8.027s

-XX:-UseLSE

real 0m10.429s

Built with -moutline-atomics:

real 0m8.538s

-------------

PR: https://git.openjdk.java.net/jdk/pull/2434
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261027: AArch64: Support for LSE atomics C++ HotSpot code

Nick Gasson
In reply to this post by Andrew Haley-2
On Fri, 5 Feb 2021 18:56:46 GMT, Andrew Haley <[hidden email]> wrote:

> Go back a few years, and there were simple atomic load/store exclusive
> instructions on Arm. Say you want to do an atomic increment of a
> counter. You'd do an atomic load to get the counter into your local cache
> in exclusive state, increment that counter locally, then write that
> incremented counter back to memory with an atomic store. All the time
> that cache line was in exclusive state, so you're guaranteed that
> no-one else changed anything on that cache line while you had it.
>
> This is hard to scale on a very large system (e.g. Fugaku) because if
> many processors are incrementing that counter you get a lot of cache
> line ping-ponging between cores.
>
> So, Arm decided to add a locked memory increment instruction that
> works without needing to load an entire line into local cache. It's a
> single instruction that loads, increments, and writes back. The secret
> is to send a cache control message to whichever processor owns the
> cache line containing the count, tell that processor to increment the
> counter and return the incremented value. That way cache coherency
> traffic is mimimized. This new set of instructions is known as Large
> System Extensions, or LSE.
>
> Unfortunately, in recent processors, the "old" load/store exclusive
> instructions, sometimes perform very badly. Therefore, it's now
> necessary for software to detect which version of Arm it's running
> on, and use the "new" LSE instructions if they're available. Otherwise
> performance can be very poor under heavy contention.
>
> GCC's -moutline-atomics does this by providing library calls which use
> LSE if it's available, but this option is only provided on newer
> versions of GCC. This is particularly problematic with older versions
> of OpenJDK, which build using old GCC versions.
>
> Also, I suspect that some other operating systems could use this.
> Perhaps not MacOS, given that all Apple CPUs support LSE, but
> maybe Windows.

src/hotspot/cpu/aarch64/atomic_aarch64.S line 1:

> 1: // Copyright (c) 2021, Red Hat Inc. All rights reserved.

Does this file work with the Windows assembler?

src/hotspot/cpu/aarch64/atomic_aarch64.S line 35:

> 33:         ret
> 34:
> 35:         .globl aarch64_atomic_fetch_add_4_default_impl

The N1 optimisation guide suggests aligning branch targets on 32 byte boundaries.

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5579:

> 5577:   //
> 5578:   // If LSE is in use, generate LSE versions of all the stubs. The
> 5579:   // non-LSE versions are in atomic_aarch64.S.

IMO it would be better for maintainability if the LSE versions were in atomic_aarch64.S too (with an explicit `.arch armv8-a+lse` directive). Is there any reason to generate them here, other than to support old toolchains? As far as I can tell GNU as supported LSE as far back as binutils 2.27.

https://sourceware.org/binutils/docs-2.27/as/AArch64-Extensions.html

-------------

PR: https://git.openjdk.java.net/jdk/pull/2434
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261027: AArch64: Support for LSE atomics C++ HotSpot code

Andrew Haley-2
On Mon, 8 Feb 2021 08:35:49 GMT, Nick Gasson <[hidden email]> wrote:

>> Go back a few years, and there were simple atomic load/store exclusive
>> instructions on Arm. Say you want to do an atomic increment of a
>> counter. You'd do an atomic load to get the counter into your local cache
>> in exclusive state, increment that counter locally, then write that
>> incremented counter back to memory with an atomic store. All the time
>> that cache line was in exclusive state, so you're guaranteed that
>> no-one else changed anything on that cache line while you had it.
>>
>> This is hard to scale on a very large system (e.g. Fugaku) because if
>> many processors are incrementing that counter you get a lot of cache
>> line ping-ponging between cores.
>>
>> So, Arm decided to add a locked memory increment instruction that
>> works without needing to load an entire line into local cache. It's a
>> single instruction that loads, increments, and writes back. The secret
>> is to send a cache control message to whichever processor owns the
>> cache line containing the count, tell that processor to increment the
>> counter and return the incremented value. That way cache coherency
>> traffic is mimimized. This new set of instructions is known as Large
>> System Extensions, or LSE.
>>
>> Unfortunately, in recent processors, the "old" load/store exclusive
>> instructions, sometimes perform very badly. Therefore, it's now
>> necessary for software to detect which version of Arm it's running
>> on, and use the "new" LSE instructions if they're available. Otherwise
>> performance can be very poor under heavy contention.
>>
>> GCC's -moutline-atomics does this by providing library calls which use
>> LSE if it's available, but this option is only provided on newer
>> versions of GCC. This is particularly problematic with older versions
>> of OpenJDK, which build using old GCC versions.
>>
>> Also, I suspect that some other operating systems could use this.
>> Perhaps not MacOS, given that all Apple CPUs support LSE, but
>> maybe Windows.
>
> src/hotspot/cpu/aarch64/atomic_aarch64.S line 35:
>
>> 33:         ret
>> 34:
>> 35:         .globl aarch64_atomic_fetch_add_4_default_impl
>
> The N1 optimisation guide suggests aligning branch targets on 32 byte boundaries.

OK.

> src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5579:
>
>> 5577:   //
>> 5578:   // If LSE is in use, generate LSE versions of all the stubs. The
>> 5579:   // non-LSE versions are in atomic_aarch64.S.
>
> IMO it would be better for maintainability if the LSE versions were in atomic_aarch64.S too (with an explicit `.arch armv8-a+lse` directive). Is there any reason to generate them here, other than to support old toolchains? As far as I can tell GNU as supported LSE as far back as binutils 2.27.
>
> https://sourceware.org/binutils/docs-2.27/as/AArch64-Extensions.html

I can't see any reason to do this.There's be no benefit to moving this stuff, and it would be harder to change in the future. I'd do the whole lot as runtime stubs if I could, but they're needed before VM startup.

> src/hotspot/cpu/aarch64/atomic_aarch64.S line 1:
>
>> 1: // Copyright (c) 2021, Red Hat Inc. All rights reserved.
>
> Does this file work with the Windows assembler?

I have no idea. If it doesn't, please tell me; I have no Windows system.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2434
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261027: AArch64: Support for LSE atomics C++ HotSpot code

Nick Gasson
On Mon, 8 Feb 2021 09:52:41 GMT, Andrew Haley <[hidden email]> wrote:

>> src/hotspot/cpu/aarch64/atomic_aarch64.S line 1:
>>
>>> 1: // Copyright (c) 2021, Red Hat Inc. All rights reserved.
>>
>> Does this file work with the Windows assembler?
>
> I have no idea. If it doesn't, please tell me; I have no Windows system.

I don't either unfortunately. Maybe @mo-beck or @luhenry could help? The Windows build on GitHub Actions failed cryptically with:

 Creating support/modules_libs/java.base/server/jvm.dll from 1045 file(s)

make[3]: *** [lib/CompileJvm.gmk:143: /cygdrive/d/a/jdk/jdk/jdk/build/windows-aarch64/hotspot/variant-server/libjvm/objs/atomic_aarch64.obj] Error 2

-------------

PR: https://git.openjdk.java.net/jdk/pull/2434
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261027: AArch64: Support for LSE atomics C++ HotSpot code

Nick Gasson
On Mon, 8 Feb 2021 10:06:05 GMT, Nick Gasson <[hidden email]> wrote:

>> I have no idea. If it doesn't, please tell me; I have no Windows system.
>
> I don't either unfortunately. Maybe @mo-beck or @luhenry could help? The Windows build on GitHub Actions failed cryptically with:
>
>  Creating support/modules_libs/java.base/server/jvm.dll from 1045 file(s)
>
> make[3]: *** [lib/CompileJvm.gmk:143: /cygdrive/d/a/jdk/jdk/jdk/build/windows-aarch64/hotspot/variant-server/libjvm/objs/atomic_aarch64.obj] Error 2

I suppose you could just move it to `os_cpu/linux_aarch64/` as these are only called from the Linux atomics?

-------------

PR: https://git.openjdk.java.net/jdk/pull/2434
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261027: AArch64: Support for LSE atomics C++ HotSpot code

Andrew Haley-2
On Mon, 8 Feb 2021 10:11:30 GMT, Nick Gasson <[hidden email]> wrote:

>> I don't either unfortunately. Maybe @mo-beck or @luhenry could help? The Windows build on GitHub Actions failed cryptically with:
>>
>>  Creating support/modules_libs/java.base/server/jvm.dll from 1045 file(s)
>>
>> make[3]: *** [lib/CompileJvm.gmk:143: /cygdrive/d/a/jdk/jdk/jdk/build/windows-aarch64/hotspot/variant-server/libjvm/objs/atomic_aarch64.obj] Error 2
>
> I suppose you could just move it to `os_cpu/linux_aarch64/` as these are only called from the Linux atomics?

They're probably needed on Windows, or I'd have put them in linux_aarch64.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2434
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261027: AArch64: Support for LSE atomics C++ HotSpot code

Ludovic Henry-2
On Mon, 8 Feb 2021 11:24:00 GMT, Andrew Haley <[hidden email]> wrote:

>> I suppose you could just move it to `os_cpu/linux_aarch64/` as these are only called from the Linux atomics?
>
> They're probably needed on Windows, or I'd have put them in linux_aarch64.

I can confirm that assembly code targeted at Linux generally won't compile out of the box on Windows. For `windows_aarch64`, could we have simple fallbacks in C++ code that simply call `InterlockedAdd`, `InterlockedExchange`, and `InterlockedCompareExchange`? It would be similar to [atomic_windows_aarch64.hpp](https://github.com/openjdk/jdk/blob/master/src/hotspot/os_cpu/windows_aarch64/atomic_windows_aarch64.hpp)

-------------

PR: https://git.openjdk.java.net/jdk/pull/2434
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261027: AArch64: Support for LSE atomics C++ HotSpot code

Andrew Dinn-2
In reply to this post by Andrew Haley-2
On Fri, 5 Feb 2021 18:56:46 GMT, Andrew Haley <[hidden email]> wrote:

> Go back a few years, and there were simple atomic load/store exclusive
> instructions on Arm. Say you want to do an atomic increment of a
> counter. You'd do an atomic load to get the counter into your local cache
> in exclusive state, increment that counter locally, then write that
> incremented counter back to memory with an atomic store. All the time
> that cache line was in exclusive state, so you're guaranteed that
> no-one else changed anything on that cache line while you had it.
>
> This is hard to scale on a very large system (e.g. Fugaku) because if
> many processors are incrementing that counter you get a lot of cache
> line ping-ponging between cores.
>
> So, Arm decided to add a locked memory increment instruction that
> works without needing to load an entire line into local cache. It's a
> single instruction that loads, increments, and writes back. The secret
> is to send a cache control message to whichever processor owns the
> cache line containing the count, tell that processor to increment the
> counter and return the incremented value. That way cache coherency
> traffic is mimimized. This new set of instructions is known as Large
> System Extensions, or LSE.
>
> Unfortunately, in recent processors, the "old" load/store exclusive
> instructions, sometimes perform very badly. Therefore, it's now
> necessary for software to detect which version of Arm it's running
> on, and use the "new" LSE instructions if they're available. Otherwise
> performance can be very poor under heavy contention.
>
> GCC's -moutline-atomics does this by providing library calls which use
> LSE if it's available, but this option is only provided on newer
> versions of GCC. This is particularly problematic with older versions
> of OpenJDK, which build using old GCC versions.
>
> Also, I suspect that some other operating systems could use this.
> Perhaps not MacOS, given that all Apple CPUs support LSE, but
> maybe Windows.

It would help to change the name of old_value to value. Otherwise this si ok as is.

src/hotspot/os_cpu/linux_aarch64/atomic_linux_aarch64.hpp line 88:

> 86:   template<typename D, typename I>
> 87:   D add_and_fetch(D volatile* dest, I add_value, atomic_memory_order order) const {
> 88:     D old_value = fetch_and_add(dest, add_value, order) + add_value;

I'm not sure this should be called old_value. It is actually old_value + add_value i.e. it really ought to be called eiter new_value (or just value?).

src/hotspot/os_cpu/linux_aarch64/atomic_linux_aarch64.hpp line 95:

> 93: template<>
> 94: template<typename D, typename I>
> 95: inline D Atomic::PlatformAdd<4>::fetch_and_add(D volatile* dest, I add_value,

It may be possible to avoid the cut-and-paste involved in declaring almost exactly the same template body for byte_size == 4 and 8 with a generic template which includes a function type element supplemented with two auxiliary templates to instantiate that function element with either aarch64_atomic_fetch_add_4_impl or aarch64_atomic_fetch_add_8_impl. That would mean more templates but less repetition. On the whole I prefer less templates so perhaps this is best left as is.

-------------

Marked as reviewed by adinn (Reviewer).

PR: https://git.openjdk.java.net/jdk/pull/2434
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261027: AArch64: Support for LSE atomics C++ HotSpot code

Andrew Dinn-2
On Mon, 8 Feb 2021 12:21:22 GMT, Andrew Dinn <[hidden email]> wrote:

>> Go back a few years, and there were simple atomic load/store exclusive
>> instructions on Arm. Say you want to do an atomic increment of a
>> counter. You'd do an atomic load to get the counter into your local cache
>> in exclusive state, increment that counter locally, then write that
>> incremented counter back to memory with an atomic store. All the time
>> that cache line was in exclusive state, so you're guaranteed that
>> no-one else changed anything on that cache line while you had it.
>>
>> This is hard to scale on a very large system (e.g. Fugaku) because if
>> many processors are incrementing that counter you get a lot of cache
>> line ping-ponging between cores.
>>
>> So, Arm decided to add a locked memory increment instruction that
>> works without needing to load an entire line into local cache. It's a
>> single instruction that loads, increments, and writes back. The secret
>> is to send a cache control message to whichever processor owns the
>> cache line containing the count, tell that processor to increment the
>> counter and return the incremented value. That way cache coherency
>> traffic is mimimized. This new set of instructions is known as Large
>> System Extensions, or LSE.
>>
>> Unfortunately, in recent processors, the "old" load/store exclusive
>> instructions, sometimes perform very badly. Therefore, it's now
>> necessary for software to detect which version of Arm it's running
>> on, and use the "new" LSE instructions if they're available. Otherwise
>> performance can be very poor under heavy contention.
>>
>> GCC's -moutline-atomics does this by providing library calls which use
>> LSE if it's available, but this option is only provided on newer
>> versions of GCC. This is particularly problematic with older versions
>> of OpenJDK, which build using old GCC versions.
>>
>> Also, I suspect that some other operating systems could use this.
>> Perhaps not MacOS, given that all Apple CPUs support LSE, but
>> maybe Windows.
>
> src/hotspot/os_cpu/linux_aarch64/atomic_linux_aarch64.hpp line 95:
>
>> 93: template<>
>> 94: template<typename D, typename I>
>> 95: inline D Atomic::PlatformAdd<4>::fetch_and_add(D volatile* dest, I add_value,
>
> It may be possible to avoid the cut-and-paste involved in declaring almost exactly the same template body for byte_size == 4 and 8 with a generic template which includes a function type element supplemented with two auxiliary templates to instantiate that function element with either aarch64_atomic_fetch_add_4_impl or aarch64_atomic_fetch_add_8_impl. That would mean more templates but less repetition. On the whole I prefer less templates so perhaps this is best left as is.

n.b. the same comment applies to the cut and paste for  PlatformCchg and PlatformCmpxchg

-------------

PR: https://git.openjdk.java.net/jdk/pull/2434
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261027: AArch64: Support for LSE atomics C++ HotSpot code

Andrew Haley
In reply to this post by Ludovic Henry-2
On 2/8/21 1:38 PM, Ludovic Henry wrote:
> On Mon, 8 Feb 2021 11:24:00 GMT, Andrew Haley <[hidden email]> wrote:
>
>>> I suppose you could just move it to `os_cpu/linux_aarch64/` as these are only called from the Linux atomics?
>>
>> They're probably needed on Windows, or I'd have put them in linux_aarch64.
>
> I can confirm that assembly code targeted at Linux generally won't compile out of the box on Windows.

Generally, OK, but what's wrong with that specific file? It should run
just fine. We're getting an incomprehensible error message, but what
does it mean?

> For `windows_aarch64`, could we have simple fallbacks in C++ code that simply call `InterlockedAdd`, `InterlockedExchange`, and `InterlockedCompareExchange`? It would be similar to [atomic_windows_aarch64.hpp](https://github.com/openjdk/jdk/blob/master/src/hotspot/os_cpu/windows_aarch64/atomic_windows_aarch64.hpp)

Well, I don't know. Do InterlockedAdd and its friends generate good code?
You've got the machines there, so you can have a look, but I can't.
It's up to you to decide whether you care about code quality. I'm
trying my best to help, but without your assistance there's nothing I
can do.

--
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671

Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261027: AArch64: Support for LSE atomics C++ HotSpot code [v2]

Andrew Haley-2
In reply to this post by Andrew Haley-2
> Go back a few years, and there were simple atomic load/store exclusive
> instructions on Arm. Say you want to do an atomic increment of a
> counter. You'd do an atomic load to get the counter into your local cache
> in exclusive state, increment that counter locally, then write that
> incremented counter back to memory with an atomic store. All the time
> that cache line was in exclusive state, so you're guaranteed that
> no-one else changed anything on that cache line while you had it.
>
> This is hard to scale on a very large system (e.g. Fugaku) because if
> many processors are incrementing that counter you get a lot of cache
> line ping-ponging between cores.
>
> So, Arm decided to add a locked memory increment instruction that
> works without needing to load an entire line into local cache. It's a
> single instruction that loads, increments, and writes back. The secret
> is to send a cache control message to whichever processor owns the
> cache line containing the count, tell that processor to increment the
> counter and return the incremented value. That way cache coherency
> traffic is mimimized. This new set of instructions is known as Large
> System Extensions, or LSE.
>
> Unfortunately, in recent processors, the "old" load/store exclusive
> instructions, sometimes perform very badly. Therefore, it's now
> necessary for software to detect which version of Arm it's running
> on, and use the "new" LSE instructions if they're available. Otherwise
> performance can be very poor under heavy contention.
>
> GCC's -moutline-atomics does this by providing library calls which use
> LSE if it's available, but this option is only provided on newer
> versions of GCC. This is particularly problematic with older versions
> of OpenJDK, which build using old GCC versions.
>
> Also, I suspect that some other operating systems could use this.
> Perhaps not MacOS, given that all Apple CPUs support LSE, but
> maybe Windows.

Andrew Haley has updated the pull request incrementally with one additional commit since the last revision:

  Review changes

-------------

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/2434/files
  - new: https://git.openjdk.java.net/jdk/pull/2434/files/4f17903b..31f9c003

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=2434&range=01
 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=2434&range=00-01

  Stats: 16 lines in 3 files changed: 14 ins; 0 del; 2 mod
  Patch: https://git.openjdk.java.net/jdk/pull/2434.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/2434/head:pull/2434

PR: https://git.openjdk.java.net/jdk/pull/2434
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261027: AArch64: Support for LSE atomics C++ HotSpot code [v2]

Andrew Haley-2
In reply to this post by Andrew Dinn-2
On Mon, 8 Feb 2021 12:06:51 GMT, Andrew Dinn <[hidden email]> wrote:

>> Andrew Haley has updated the pull request incrementally with one additional commit since the last revision:
>>
>>   Review changes
>
> src/hotspot/os_cpu/linux_aarch64/atomic_linux_aarch64.hpp line 88:
>
>> 86:   template<typename D, typename I>
>> 87:   D add_and_fetch(D volatile* dest, I add_value, atomic_memory_order order) const {
>> 88:     D old_value = fetch_and_add(dest, add_value, order) + add_value;
>
> I'm not sure this should be called old_value. It is actually old_value + add_value i.e. it really ought to be called eiter new_value (or just value?).

Sure. This templated code is copied from os_cpu/linux_x86, so I kept the names the same as in that file. But you're right, that would make more sense.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2434
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261027: AArch64: Support for LSE atomics C++ HotSpot code [v2]

Andrew Haley-2
In reply to this post by Andrew Dinn-2
On Mon, 8 Feb 2021 12:23:31 GMT, Andrew Dinn <[hidden email]> wrote:

>> src/hotspot/os_cpu/linux_aarch64/atomic_linux_aarch64.hpp line 95:
>>
>>> 93: template<>
>>> 94: template<typename D, typename I>
>>> 95: inline D Atomic::PlatformAdd<4>::fetch_and_add(D volatile* dest, I add_value,
>>
>> It may be possible to avoid the cut-and-paste involved in declaring almost exactly the same template body for byte_size == 4 and 8 with a generic template which includes a function type element supplemented with two auxiliary templates to instantiate that function element with either aarch64_atomic_fetch_add_4_impl or aarch64_atomic_fetch_add_8_impl. That would mean more templates but less repetition. On the whole I prefer less templates so perhaps this is best left as is.
>
> n.b. the same comment applies to the cut and paste for  PlatformCchg and PlatformCmpxchg

Lawks! Even Oracle didn't do that with Linux/x86. I think I'll leave it as it is.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2434
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261027: AArch64: Support for LSE atomics C++ HotSpot code [v2]

Andrew Haley-2
In reply to this post by Andrew Dinn-2
On Mon, 8 Feb 2021 14:52:00 GMT, Andrew Dinn <[hidden email]> wrote:

>> Andrew Haley has updated the pull request incrementally with one additional commit since the last revision:
>>
>>   Review changes
>
> It would help to change the name of old_value to value. Otherwise this si ok as is.

> _Mailing list message from [Andrew Haley](mailto:[hidden email]) on [hotspot-dev](mailto:[hidden email]):_
> Well, I don't know. Do InterlockedAdd and its friends generate good code?
> You've got the machines there, so you can have a look, but I can't.

I'm sorry, that was unnecessarily sharp of me! It's entirely up to you, but you might find investigating this to be useful.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2434
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261027: AArch64: Support for LSE atomics C++ HotSpot code [v2]

Stefan Karlsson-3
In reply to this post by Andrew Haley-2
On Mon, 8 Feb 2021 17:40:27 GMT, Andrew Haley <[hidden email]> wrote:

>> src/hotspot/os_cpu/linux_aarch64/atomic_linux_aarch64.hpp line 88:
>>
>>> 86:   template<typename D, typename I>
>>> 87:   D add_and_fetch(D volatile* dest, I add_value, atomic_memory_order order) const {
>>> 88:     D old_value = fetch_and_add(dest, add_value, order) + add_value;
>>
>> I'm not sure this should be called old_value. It is actually old_value + add_value i.e. it really ought to be called eiter new_value (or just value?).
>
> Sure. This templated code is copied from os_cpu/linux_x86, so I kept the names the same as in that file. But you're right, that would make more sense.

FTR, the linux_x86 code doesn't use old_value for the add_and_fetch version:
  template<typename D, typename I>
  D add_and_fetch(D volatile* dest, I add_value, atomic_memory_order order) const {
    return fetch_and_add(dest, add_value, order) + add_value;
  }
only the fetch_and_add functions do. But I think that's correct.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2434
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261027: AArch64: Support for LSE atomics C++ HotSpot code [v2]

Volker Simonis-3
In reply to this post by Andrew Haley-2
On Mon, 8 Feb 2021 18:50:09 GMT, Andrew Haley <[hidden email]> wrote:

>> Go back a few years, and there were simple atomic load/store exclusive
>> instructions on Arm. Say you want to do an atomic increment of a
>> counter. You'd do an atomic load to get the counter into your local cache
>> in exclusive state, increment that counter locally, then write that
>> incremented counter back to memory with an atomic store. All the time
>> that cache line was in exclusive state, so you're guaranteed that
>> no-one else changed anything on that cache line while you had it.
>>
>> This is hard to scale on a very large system (e.g. Fugaku) because if
>> many processors are incrementing that counter you get a lot of cache
>> line ping-ponging between cores.
>>
>> So, Arm decided to add a locked memory increment instruction that
>> works without needing to load an entire line into local cache. It's a
>> single instruction that loads, increments, and writes back. The secret
>> is to send a cache control message to whichever processor owns the
>> cache line containing the count, tell that processor to increment the
>> counter and return the incremented value. That way cache coherency
>> traffic is mimimized. This new set of instructions is known as Large
>> System Extensions, or LSE.
>>
>> Unfortunately, in recent processors, the "old" load/store exclusive
>> instructions, sometimes perform very badly. Therefore, it's now
>> necessary for software to detect which version of Arm it's running
>> on, and use the "new" LSE instructions if they're available. Otherwise
>> performance can be very poor under heavy contention.
>>
>> GCC's -moutline-atomics does this by providing library calls which use
>> LSE if it's available, but this option is only provided on newer
>> versions of GCC. This is particularly problematic with older versions
>> of OpenJDK, which build using old GCC versions.
>>
>> Also, I suspect that some other operating systems could use this.
>> Perhaps not MacOS, given that all Apple CPUs support LSE, but
>> maybe Windows.
>
> Andrew Haley has updated the pull request incrementally with one additional commit since the last revision:
>
>   Review changes

Hi Andrew,

I'm happy to see this change after the [long](https://mail.openjdk.java.net/pipermail/hotspot-dev/2019-November/039930.html) and [tedious](https://mail.openjdk.java.net/pipermail/hotspot-dev/2019-November/039932.html) discussions we had about preferring C++ intrinsic over inline assembly :)

In general I'm fine with the change. Some of the previous C++ intrinsics (e.g. `__atomic_exchange_n` and `__atomic_add_fetch`) were called with `__ATOMIC_RELEASE` semantics which has now been dropped in the new implementation. But I think that's safe and a nice "optimization" because the instructions are followed by a full membar anyway.

One question I still have is why we need the default assembler implementations at all. As far as I can see, the MacroAssembler already dispatches based on LSE availability. So why can't we just use the generated stubs exclusively? This would also solve the platform problems with assembler code.

Finally, I didn't fully understand how you've measured the `call..ret` overhead and what's the "*simple stright-line test*" you've posted performance numbers for.

Other than that, the change looks fine to me.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2434
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261027: AArch64: Support for LSE atomics C++ HotSpot code [v2]

Andrew Haley-2
On Tue, 9 Feb 2021 11:12:21 GMT, Volker Simonis <[hidden email]> wrote:

> I'm happy to see this change after the [long](https://mail.openjdk.java.net/pipermail/hotspot-dev/2019-November/039930.html) and [tedious](https://mail.openjdk.java.net/pipermail/hotspot-dev/2019-November/039932.html) discussions we had about preferring C++ intrinsic over inline assembly :)
>
> In general I'm fine with the change. Some of the previous C++ intrinsics (e.g. `__atomic_exchange_n` and `__atomic_add_fetch`) were called with `__ATOMIC_RELEASE` semantics which has now been dropped in the new implementation. But I think that's safe and a nice "optimization" because the instructions are followed by a full membar anyway.

None of these sequences is ideal, so I'll follow up with some higher-performance LSE versions in a new patch.

> One question I still have is why we need the default assembler implementations at all. As far as I can see, the MacroAssembler already dispatches based on LSE availability. So why can't we just use the generated stubs exclusively? This would also solve the platform problems with assembler code.

We'd need an instance of Assembler very early, before the JVM is initialized. It could be done, but it would also a page of memory to be allocated early too. I did try, but it was rather awful. That is why I ended up with these simple bootstrapping versions of the atomics.

> Finally, I didn't fully understand how you've measured the `call..ret` overhead and what's the "_simple stright-line test_" you've posted performance numbers for.

That was just a counter in a loop. It's not terribly important for this case, given that the current code is way from optimal, but I wanted to know if the call...ret overhead could be measured. It can't, because it's swamped by the cost of the barriers.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2434
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261027: AArch64: Support for LSE atomics C++ HotSpot code [v3]

Andrew Haley-2
In reply to this post by Andrew Haley-2
> Go back a few years, and there were simple atomic load/store exclusive
> instructions on Arm. Say you want to do an atomic increment of a
> counter. You'd do an atomic load to get the counter into your local cache
> in exclusive state, increment that counter locally, then write that
> incremented counter back to memory with an atomic store. All the time
> that cache line was in exclusive state, so you're guaranteed that
> no-one else changed anything on that cache line while you had it.
>
> This is hard to scale on a very large system (e.g. Fugaku) because if
> many processors are incrementing that counter you get a lot of cache
> line ping-ponging between cores.
>
> So, Arm decided to add a locked memory increment instruction that
> works without needing to load an entire line into local cache. It's a
> single instruction that loads, increments, and writes back. The secret
> is to send a cache control message to whichever processor owns the
> cache line containing the count, tell that processor to increment the
> counter and return the incremented value. That way cache coherency
> traffic is mimimized. This new set of instructions is known as Large
> System Extensions, or LSE.
>
> Unfortunately, in recent processors, the "old" load/store exclusive
> instructions, sometimes perform very badly. Therefore, it's now
> necessary for software to detect which version of Arm it's running
> on, and use the "new" LSE instructions if they're available. Otherwise
> performance can be very poor under heavy contention.
>
> GCC's -moutline-atomics does this by providing library calls which use
> LSE if it's available, but this option is only provided on newer
> versions of GCC. This is particularly problematic with older versions
> of OpenJDK, which build using old GCC versions.
>
> Also, I suspect that some other operating systems could use this.
> Perhaps not MacOS, given that all Apple CPUs support LSE, but
> maybe Windows.

Andrew Haley has updated the pull request incrementally with one additional commit since the last revision:

  Properly align everything

-------------

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/2434/files
  - new: https://git.openjdk.java.net/jdk/pull/2434/files/31f9c003..1cec2f54

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=2434&range=02
 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=2434&range=01-02

  Stats: 7 lines in 1 file changed: 0 ins; 0 del; 7 mod
  Patch: https://git.openjdk.java.net/jdk/pull/2434.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/2434/head:pull/2434

PR: https://git.openjdk.java.net/jdk/pull/2434
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261027: AArch64: Support for LSE atomics C++ HotSpot code [v3]

Volker Simonis-3
On Tue, 9 Feb 2021 15:07:05 GMT, Andrew Haley <[hidden email]> wrote:

>> Go back a few years, and there were simple atomic load/store exclusive
>> instructions on Arm. Say you want to do an atomic increment of a
>> counter. You'd do an atomic load to get the counter into your local cache
>> in exclusive state, increment that counter locally, then write that
>> incremented counter back to memory with an atomic store. All the time
>> that cache line was in exclusive state, so you're guaranteed that
>> no-one else changed anything on that cache line while you had it.
>>
>> This is hard to scale on a very large system (e.g. Fugaku) because if
>> many processors are incrementing that counter you get a lot of cache
>> line ping-ponging between cores.
>>
>> So, Arm decided to add a locked memory increment instruction that
>> works without needing to load an entire line into local cache. It's a
>> single instruction that loads, increments, and writes back. The secret
>> is to send a cache control message to whichever processor owns the
>> cache line containing the count, tell that processor to increment the
>> counter and return the incremented value. That way cache coherency
>> traffic is mimimized. This new set of instructions is known as Large
>> System Extensions, or LSE.
>>
>> Unfortunately, in recent processors, the "old" load/store exclusive
>> instructions, sometimes perform very badly. Therefore, it's now
>> necessary for software to detect which version of Arm it's running
>> on, and use the "new" LSE instructions if they're available. Otherwise
>> performance can be very poor under heavy contention.
>>
>> GCC's -moutline-atomics does this by providing library calls which use
>> LSE if it's available, but this option is only provided on newer
>> versions of GCC. This is particularly problematic with older versions
>> of OpenJDK, which build using old GCC versions.
>>
>> Also, I suspect that some other operating systems could use this.
>> Perhaps not MacOS, given that all Apple CPUs support LSE, but
>> maybe Windows.
>
> Andrew Haley has updated the pull request incrementally with one additional commit since the last revision:
>
>   Properly align everything

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 35:

> 33: #include "interpreter/interpreter.hpp"
> 34: #include "memory/universe.hpp"
> 35: #include "atomic_aarch64.hpp"

I think the conventions is to put the includes in alphabetic order after `#include "precompiled.hpp"`

-------------

PR: https://git.openjdk.java.net/jdk/pull/2434
12