RFR: 8261493: Shenandoah: reconsider bitmap access memory ordering

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

RFR: 8261493: Shenandoah: reconsider bitmap access memory ordering

Aleksey Shipilev-5
Shenandoah currently uses its own marking bitmap (added by JDK-8254315). It accesses the marking bitmap with "acquire" for reads and "conservative" for updates. Hotspot's default for atomic operations is memory_order_conservative, which emits two-way memory fences around the CASes at least on AArch64 and PPC64. This seems to be excessive for Shenandoah marking bitmap updates, and "release" is enough.

I think both are actually excessive for marking bitmap accesses: we do not piggyback object updates on it, the atomics there are only to guarantee the access atomicity and CAS updates to bits. So we might as well use "relaxed" modes for both loads and updates.

Sample run with aggressive (back-to-back cycles) on SPECjvm2008:compiler.compiler on AArch64:

# Baseline
[135.357s][info][gc,stats] Concurrent Marking             =   38.795 s (a =   146951 us) (n =   264)
    (lvls, us =      172,     1719,   150391,   275391,   348305)

# Patched
[130.475s][info][gc,stats] Concurrent Marking             =   34.874 s (a =   120672 us) (n =   289)
    (lvls, us =      178,     1777,   132812,   222656,   323957)

Average time goes down, the number of GC cycles go up, since the cycles are shorter.

Additional testing:
 - [x] Linux x86_64 `hotspot_gc_shenandoah`
 - [x] Linux AArch64 `hotspot_gc_shenandoah`
 - [x] Linux AArch64 `tier1` with Shenandoah

-------------

Commit messages:
 - 8261493: Shenandoah: reconsider bitmap access memory ordering

Changes: https://git.openjdk.java.net/jdk/pull/2497/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=2497&range=00
  Issue: https://bugs.openjdk.java.net/browse/JDK-8261493
  Stats: 18 lines in 2 files changed: 0 ins; 14 del; 4 mod
  Patch: https://git.openjdk.java.net/jdk/pull/2497.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/2497/head:pull/2497

PR: https://git.openjdk.java.net/jdk/pull/2497
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261493: Shenandoah: reconsider bitmap access memory ordering

Roman Kennke
On Wed, 10 Feb 2021 09:32:18 GMT, Aleksey Shipilev <[hidden email]> wrote:

> Shenandoah currently uses its own marking bitmap (added by JDK-8254315). It accesses the marking bitmap with "acquire" for reads and "conservative" for updates. Hotspot's default for atomic operations is memory_order_conservative, which emits two-way memory fences around the CASes at least on AArch64 and PPC64. This seems to be excessive for Shenandoah marking bitmap updates, and "release" is enough.
>
> I think both are actually excessive for marking bitmap accesses: we do not piggyback object updates on it, the atomics there are only to guarantee the access atomicity and CAS updates to bits. So we might as well use "relaxed" modes for both loads and updates.
>
> Sample run with aggressive (back-to-back cycles) on SPECjvm2008:compiler.compiler on AArch64:
>
> # Baseline
> [135.357s][info][gc,stats] Concurrent Marking             =   38.795 s (a =   146951 us) (n =   264)
>     (lvls, us =      172,     1719,   150391,   275391,   348305)
>
> # Patched
> [130.475s][info][gc,stats] Concurrent Marking             =   34.874 s (a =   120672 us) (n =   289)
>     (lvls, us =      178,     1777,   132812,   222656,   323957)
>
> Average time goes down, the number of GC cycles go up, since the cycles are shorter.
>
> Additional testing:
>  - [x] Linux x86_64 `hotspot_gc_shenandoah`
>  - [x] Linux AArch64 `hotspot_gc_shenandoah`
>  - [x] Linux AArch64 `tier1` with Shenandoah

Nice improvement!
I think that makes sense. Patch looks good to me!

-------------

Marked as reviewed by rkennke (Reviewer).

PR: https://git.openjdk.java.net/jdk/pull/2497
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261493: Shenandoah: reconsider bitmap access memory ordering

Zhengyu Gu-3
In reply to this post by Aleksey Shipilev-5
On Wed, 10 Feb 2021 09:32:18 GMT, Aleksey Shipilev <[hidden email]> wrote:

> Shenandoah currently uses its own marking bitmap (added by JDK-8254315). It accesses the marking bitmap with "acquire" for reads and "conservative" for updates. Hotspot's default for atomic operations is memory_order_conservative, which emits two-way memory fences around the CASes at least on AArch64 and PPC64.
>
> I think both are actually excessive for marking bitmap accesses: we do not piggyback object updates on it, the atomics there are only to guarantee the access atomicity and CAS updates to bits.  It seems "relaxed" is enough for marking bitmap accesses.
>
> Sample run with "compact" (frequent GC cycles) on SPECjvm2008:compiler.sunflow on AArch64:
>
> # Baseline
> # Baseline
> [146.028s][info][gc,stats] Concurrent Marking =   50.315 s (a =   258024 us) (n =   195) (lvls, us =    31836,   230469,   273438,   306641,   464255)
> [141.458s][info][gc,stats] Concurrent Marking =   47.819 s (a =   242737 us) (n =   197) (lvls, us =    42773,   197266,   267578,   287109,   433948)
> [144.108s][info][gc,stats] Concurrent Marking =   49.806 s (a =   250283 us) (n =   199) (lvls, us =    32227,   201172,   267578,   296875,   448549)
>
> # Patched
> [144.238s][info][gc,stats] Concurrent Marking =   46.627 s (a =   220981 us) (n =   211) (lvls, us =    24414,   197266,   238281,   259766,   345112)
> [138.406s][info][gc,stats] Concurrent Marking =   45.022 s (a =   227383 us) (n =   198) (lvls, us =    20508,   205078,   244141,   271484,   427658)
> [140.950s][info][gc,stats] Concurrent Marking =   45.073 s (a =   222036 us) (n =   203) (lvls, us =    21680,   181641,   240234,   265625,   375750)
>
> Average time goes down, total marking time goes down.
>
> Additional testing:
>  - [x] Linux x86_64 `hotspot_gc_shenandoah`
>  - [x] Linux AArch64 `hotspot_gc_shenandoah`
>  - [x] Linux AArch64 `tier1` with Shenandoah

Now, you need load barrier on read side (e.g. ShenandoahMarkBitMap::at()). Although, it is not a correctness issue, but seeing stale value means extra unnecessary work.

-------------

Changes requested by zgu (Reviewer).

PR: https://git.openjdk.java.net/jdk/pull/2497
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261493: Shenandoah: reconsider bitmap access memory ordering

Aleksey Shipilev-5
On Thu, 11 Feb 2021 18:07:58 GMT, Zhengyu Gu <[hidden email]> wrote:

> Now, you need load barrier on read side (e.g. ShenandoahMarkBitMap::at()). Although, it is not a correctness issue, but seeing stale value means extra unnecessary work.

I don't see why. Adding load barriers would not affect promptness of seeing the memory updates to the bitmap itself. It might affect the promptness of seeing the object contents that we are reading after asking `is_marked` -- but that would be a race either way, because we do not use mark bitmap for memory ordering at all (i.e. there is no "release" on bitmap update).

-------------

PR: https://git.openjdk.java.net/jdk/pull/2497
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261493: Shenandoah: reconsider bitmap access memory ordering

Zhengyu Gu-3
On Fri, 12 Feb 2021 07:31:34 GMT, Aleksey Shipilev <[hidden email]> wrote:

> > Now, you need load barrier on read side (e.g. ShenandoahMarkBitMap::at()). Although, it is not a correctness issue, but seeing stale value means extra unnecessary work.
>
> I don't see why. Adding load barriers would not affect promptness of seeing the memory updates to the bitmap itself. It might affect the promptness of seeing the object contents that we are reading after asking `is_marked` -- but that would be a race either way, because we do not use mark bitmap for memory ordering at all (i.e. there is no "release" on bitmap update).

The `load barrier` I were thinking, is something that can prompt seeing the updating of bitmap. I did a little digging, it does not seem we have something like that.

We rarely call is_marked() and variants during mark phase, except SATB filtering. I wonder if adding a leading fence in SH::requires_marking() can accomplish that. Although, it is still a race, but I think 1) small price to pay compares to the work of enqueuing a marked oop 2) might help the termination by not enqueuing marked oops.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2497
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261493: Shenandoah: reconsider bitmap access memory ordering

Aleksey Shipilev-5
On Fri, 12 Feb 2021 13:16:03 GMT, Zhengyu Gu <[hidden email]> wrote:

> The `load barrier` I were thinking, is something that can prompt seeing the updating of bitmap. I did a little digging, it does not seem we have something like that.
>
> We rarely call is_marked() and variants during mark phase, except SATB filtering. I wonder if adding a leading fence in SH::requires_marking() can accomplish that. Although, it is still a race, but I think 1) small price to pay compares to the work of enqueuing a marked oop 2) might help the termination by not enqueuing marked oops.

I suspect it would not help much, mostly because hardware does not hoard mutable data on the timescales that are important for performance (they do it on timescales that are important for correctness though: data coming out of order "just" 100ps later is still out of order). I believe we would be paying barrier costs for a very little gain in promptness. Our current handshaking-before-final-mark and SATB locking provides enough of memory bashing, I think.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2497
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261493: Shenandoah: reconsider bitmap access memory ordering

Zhengyu Gu-3
In reply to this post by Aleksey Shipilev-5
On Wed, 10 Feb 2021 09:32:18 GMT, Aleksey Shipilev <[hidden email]> wrote:

> Shenandoah currently uses its own marking bitmap (added by JDK-8254315). It accesses the marking bitmap with "acquire" for reads and "conservative" for updates. Hotspot's default for atomic operations is memory_order_conservative, which emits two-way memory fences around the CASes at least on AArch64 and PPC64.
>
> I think both are actually excessive for marking bitmap accesses: we do not piggyback object updates on it, the atomics there are only to guarantee the access atomicity and CAS updates to bits.  It seems "relaxed" is enough for marking bitmap accesses.
>
> Sample run with "compact" (frequent GC cycles) on SPECjvm2008:compiler.sunflow on AArch64:
>
> # Baseline
> # Baseline
> [146.028s][info][gc,stats] Concurrent Marking =   50.315 s (a =   258024 us) (n =   195) (lvls, us =    31836,   230469,   273438,   306641,   464255)
> [141.458s][info][gc,stats] Concurrent Marking =   47.819 s (a =   242737 us) (n =   197) (lvls, us =    42773,   197266,   267578,   287109,   433948)
> [144.108s][info][gc,stats] Concurrent Marking =   49.806 s (a =   250283 us) (n =   199) (lvls, us =    32227,   201172,   267578,   296875,   448549)
>
> # Patched
> [144.238s][info][gc,stats] Concurrent Marking =   46.627 s (a =   220981 us) (n =   211) (lvls, us =    24414,   197266,   238281,   259766,   345112)
> [138.406s][info][gc,stats] Concurrent Marking =   45.022 s (a =   227383 us) (n =   198) (lvls, us =    20508,   205078,   244141,   271484,   427658)
> [140.950s][info][gc,stats] Concurrent Marking =   45.073 s (a =   222036 us) (n =   203) (lvls, us =    21680,   181641,   240234,   265625,   375750)
>
> Average time goes down, total marking time goes down.
>
> Additional testing:
>  - [x] Linux x86_64 `hotspot_gc_shenandoah`
>  - [x] Linux AArch64 `hotspot_gc_shenandoah`
>  - [x] Linux AArch64 `tier1` with Shenandoah

Marked as reviewed by zgu (Reviewer).

-------------

PR: https://git.openjdk.java.net/jdk/pull/2497
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261493: Shenandoah: reconsider bitmap access memory ordering

Zhengyu Gu-3
In reply to this post by Aleksey Shipilev-5
On Fri, 12 Feb 2021 15:50:31 GMT, Aleksey Shipilev <[hidden email]> wrote:

> > The `load barrier` I were thinking, is something that can prompt seeing the updating of bitmap. I did a little digging, it does not seem we have something like that.
> > We rarely call is_marked() and variants during mark phase, except SATB filtering. I wonder if adding a leading fence in SH::requires_marking() can accomplish that. Although, it is still a race, but I think 1) small price to pay compares to the work of enqueuing a marked oop 2) might help the termination by not enqueuing marked oops.
>
> I suspect it would not help much, mostly because hardware does not hoard mutable data on the timescales that are important for performance (they do it on timescales that are important for correctness though: data coming out of order "just" 100ps later is still out of order). I believe we would be paying barrier costs for a very little gain in promptness. Our current handshaking-before-final-mark and SATB locking provides enough of memory bashing, I think.

Okay, then.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2497
Reply | Threaded
Open this post in threaded view
|

Integrated: 8261493: Shenandoah: reconsider bitmap access memory ordering

Aleksey Shipilev-5
In reply to this post by Aleksey Shipilev-5
On Wed, 10 Feb 2021 09:32:18 GMT, Aleksey Shipilev <[hidden email]> wrote:

> Shenandoah currently uses its own marking bitmap (added by JDK-8254315). It accesses the marking bitmap with "acquire" for reads and "conservative" for updates. Hotspot's default for atomic operations is memory_order_conservative, which emits two-way memory fences around the CASes at least on AArch64 and PPC64.
>
> I think both are actually excessive for marking bitmap accesses: we do not piggyback object updates on it, the atomics there are only to guarantee the access atomicity and CAS updates to bits.  It seems "relaxed" is enough for marking bitmap accesses.
>
> Sample run with "compact" (frequent GC cycles) on SPECjvm2008:compiler.sunflow on AArch64:
>
> # Baseline
> # Baseline
> [146.028s][info][gc,stats] Concurrent Marking =   50.315 s (a =   258024 us) (n =   195) (lvls, us =    31836,   230469,   273438,   306641,   464255)
> [141.458s][info][gc,stats] Concurrent Marking =   47.819 s (a =   242737 us) (n =   197) (lvls, us =    42773,   197266,   267578,   287109,   433948)
> [144.108s][info][gc,stats] Concurrent Marking =   49.806 s (a =   250283 us) (n =   199) (lvls, us =    32227,   201172,   267578,   296875,   448549)
>
> # Patched
> [144.238s][info][gc,stats] Concurrent Marking =   46.627 s (a =   220981 us) (n =   211) (lvls, us =    24414,   197266,   238281,   259766,   345112)
> [138.406s][info][gc,stats] Concurrent Marking =   45.022 s (a =   227383 us) (n =   198) (lvls, us =    20508,   205078,   244141,   271484,   427658)
> [140.950s][info][gc,stats] Concurrent Marking =   45.073 s (a =   222036 us) (n =   203) (lvls, us =    21680,   181641,   240234,   265625,   375750)
>
> Average time goes down, total marking time goes down.
>
> Additional testing:
>  - [x] Linux x86_64 `hotspot_gc_shenandoah`
>  - [x] Linux AArch64 `hotspot_gc_shenandoah`
>  - [x] Linux AArch64 `tier1` with Shenandoah

This pull request has now been integrated.

Changeset: 745c0b91
Author:    Aleksey Shipilev <[hidden email]>
URL:       https://git.openjdk.java.net/jdk/commit/745c0b91
Stats:     18 lines in 2 files changed: 0 ins; 14 del; 4 mod

8261493: Shenandoah: reconsider bitmap access memory ordering

Reviewed-by: rkennke, zgu

-------------

PR: https://git.openjdk.java.net/jdk/pull/2497