Atomic operations: your thoughts are welocme

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Atomic operations: your thoughts are welocme

Andrew Haley
I've been looking at the hottest Atomic operations in HotSpot, with a view to
finding out if the default memory_order_conservative (which is very expensive
on some architectures) can be weakened to something less. It's impossible to
fix all of them, but perhaps we can fix some of the most frequent.

These are the hottest compare-and-swap uses in HotSpot, with the count
at the end of each line.

<G1ParScanThreadState::trim_queue_to_threshold(unsigned int)+7284>:     :: = 16406757

This one is already memory_order_relaxed, so no problem.

<OopOopIterateDispatch<G1CMOopClosure>::Table::oop_oop_iterate<InstanceKlass, narrowOop>(G1CMOopClosure*, oopDesc*, Klass*)+336>:       :: = 3903178

This is actually MarkBitMap::par_mark calling BitMap::par_set_bit. Does this
need to be memory_order_conservative, or would something weaker do? Even
acq_rel or seq_cst would be better.

<Symbol::decrement_refcount()+44>:      :: = 2376632
<Symbol::increment_refcount()+44>:      :: = 2003895

I can't imagine that either of these actually need memory_order_conservative,
they're just reference counts.

<OtherRegionsTable::add_reference(void*, unsigned int)+248>:    :: = 1719614

BitMap::par_set_bit again.

<G1ParScanThreadState::steal_and_trim_queue(GenericTaskQueueSet<OverflowTaskQueue<ScannerTask, (MEMFLAGS)5, 131072u>, (MEMFLAGS)5>*)+432>:      :: = 1617659

This one is GenericTaskQueue::pop_global calling cmpxchg_age().
Again, do we need conservative here?

There is, I suppose, always a possibility that some code somewhere is taking
advantage of the memory serializing properties of adjusting refcounts, I suppose.

Thanks,

--
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671

Reply | Threaded
Open this post in threaded view
|

Re: Atomic operations: your thoughts are welocme

Ioi Lam
Just curious, which benchmark is this?

Thanks
- Ioi

On 2/8/21 10:14 AM, Andrew Haley wrote:

> I've been looking at the hottest Atomic operations in HotSpot, with a view to
> finding out if the default memory_order_conservative (which is very expensive
> on some architectures) can be weakened to something less. It's impossible to
> fix all of them, but perhaps we can fix some of the most frequent.
>
> These are the hottest compare-and-swap uses in HotSpot, with the count
> at the end of each line.
>
> <G1ParScanThreadState::trim_queue_to_threshold(unsigned int)+7284>:     :: = 16406757
>
> This one is already memory_order_relaxed, so no problem.
>
> <OopOopIterateDispatch<G1CMOopClosure>::Table::oop_oop_iterate<InstanceKlass, narrowOop>(G1CMOopClosure*, oopDesc*, Klass*)+336>:       :: = 3903178
>
> This is actually MarkBitMap::par_mark calling BitMap::par_set_bit. Does this
> need to be memory_order_conservative, or would something weaker do? Even
> acq_rel or seq_cst would be better.
>
> <Symbol::decrement_refcount()+44>:      :: = 2376632
> <Symbol::increment_refcount()+44>:      :: = 2003895
>
> I can't imagine that either of these actually need memory_order_conservative,
> they're just reference counts.
>
> <OtherRegionsTable::add_reference(void*, unsigned int)+248>:    :: = 1719614
>
> BitMap::par_set_bit again.
>
> <G1ParScanThreadState::steal_and_trim_queue(GenericTaskQueueSet<OverflowTaskQueue<ScannerTask, (MEMFLAGS)5, 131072u>, (MEMFLAGS)5>*)+432>:      :: = 1617659
>
> This one is GenericTaskQueue::pop_global calling cmpxchg_age().
> Again, do we need conservative here?
>
> There is, I suppose, always a possibility that some code somewhere is taking
> advantage of the memory serializing properties of adjusting refcounts, I suppose.
>
> Thanks,
>

Reply | Threaded
Open this post in threaded view
|

Re: Atomic operations: your thoughts are welocme

Andrew Haley
Oh, sorry. This is my favourite benchmark, javac all of java.base. I'm mostly
using that because it's easy to run without any external dependencies, and it
loads a lot of classes. It's no better or worse than any other random program.

On 2/10/21 6:44 AM, Ioi Lam wrote:
> Just curious, which benchmark is this?
>
> Thanks
> - Ioi
>
> On 2/8/21 10:14 AM, Andrew Haley wrote:
>> I've been looking at the hottest Atomic operations in HotSpot, with a view to
>> finding out if the default memory_order_conservative (which is very expensive
>> on some architectures) can be weakened to something less.

--
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671

Reply | Threaded
Open this post in threaded view
|

Re: Atomic operations: your thoughts are welocme

Kim Barrett
In reply to this post by Andrew Haley
> On Feb 8, 2021, at 1:14 PM, Andrew Haley <[hidden email]> wrote:
>
> I've been looking at the hottest Atomic operations in HotSpot, with a view to
> finding out if the default memory_order_conservative (which is very expensive
> on some architectures) can be weakened to something less. It's impossible to
> fix all of them, but perhaps we can fix some of the most frequent.

Is there any information about the possible performance improvement from
such changes?  1.5-3M occurrences doesn't mean much without context.

We don't presently have support for sequentially consistent semantics, only
"conservative". My recollection is that this is in part because there might
be code that is assuming the possibly stronger "conservative" semantics, and
in part because there are different and incompatible approaches to
implementing sequentially consistent semantics on some hardware platforms
and we didn't want to make assumptions there.

We also don't presently have any cmpxchg implementation that really supports
anything between conservative and relaxed, nor do we support different order
constraints for the success vs failure cases. Things can be complicated
enough as is; while we *could* fill some of that in, I'm not sure we should.

> These are the hottest compare-and-swap uses in HotSpot, with the count
> at the end of each line.
>
> <G1ParScanThreadState::trim_queue_to_threshold(unsigned int)+7284>:     :: = 16406757
>
> This one is already memory_order_relaxed, so no problem.

Right.  Although I’m now wondering why this doesn’t need to do anything on the
failure side, similar to what is needed in the similar place in ParallelGC when that
was changed to use a relaxed cmpxchg.

> <OopOopIterateDispatch<G1CMOopClosure>::Table::oop_oop_iterate<InstanceKlass, narrowOop>(G1CMOopClosure*, oopDesc*, Klass*)+336>:       :: = 3903178
>
> This is actually MarkBitMap::par_mark calling BitMap::par_set_bit. Does this
> need to be memory_order_conservative, or would something weaker do? Even
> acq_rel or seq_cst would be better.

I think for setting bits in a bitmap the thing to do would be to identify
places that are safe and useful (impacts performance) to do so first. Then
add a weaker variant for use in those places, assuming any are found.

> <Symbol::decrement_refcount()+44>:      :: = 2376632
> <Symbol::increment_refcount()+44>:      :: = 2003895
>
> I can't imagine that either of these actually need memory_order_conservative,
> they're just reference counts.

The "usual" refcount implementation involves relaxed increment and stronger
ordering for decrement. (If I'm remembering correctly, dec-acquire and a
release fence on the zero value path before deleting. But I've not thought
about what one might want for this CAS-based variant that handles boundary
cases specially.) And as you say, whether any of these could be weakened
depends on whether there is any code surrounding a use that depends on the
stronger ordering semantics. At a guess I suspect increment could be changed
to relaxed, but I've not looked. This one is probably a question for runtime
folks.

> <OtherRegionsTable::add_reference(void*, unsigned int)+248>:    :: = 1719614
>
> BitMap::par_set_bit again.
>
> <G1ParScanThreadState::steal_and_trim_queue(GenericTaskQueueSet<OverflowTaskQueue<ScannerTask, (MEMFLAGS)5, 131072u>, (MEMFLAGS)5>*)+432>:      :: = 1617659
>
> This one is GenericTaskQueue::pop_global calling cmpxchg_age().
> Again, do we need conservative here?

This needs at least sequentially consistent semantics on the success path.
See the referenced paper by Le, et al.

There is also a cmpxchg_age in pop_local_slow.  The Le, et al paper doesn't
deal with that path.  But it's also not in your list, which is good since
this is supposed to be infrequently taken.

Reply | Threaded
Open this post in threaded view
|

Re: Atomic operations: your thoughts are welocme

Kim Barrett
> On Feb 10, 2021, at 10:59 PM, Kim Barrett <[hidden email]> wrote:
> We also don't presently have any cmpxchg implementation that really supports
> anything between conservative and relaxed, nor do we support different order
> constraints for the success vs failure cases. Things can be complicated
> enough as is; while we *could* fill some of that in, I'm not sure we should.

I forgot that the linux-ppc port tries harder in this area.

This was so a release-cmpxchg could be used in ParallelGC's
PSPromotionManager::copy_to_survivor_space and benefit from that.
The initial proposal was to use relaxed-cmpxchg, but that was shown
to be insufficient.


Reply | Threaded
Open this post in threaded view
|

Re: Atomic operations: your thoughts are welocme

Aleksey Shipilev-4
In reply to this post by Kim Barrett
On 2/11/21 4:59 AM, Kim Barrett wrote:
>> On Feb 8, 2021, at 1:14 PM, Andrew Haley <[hidden email]> wrote:
>>
>> I've been looking at the hottest Atomic operations in HotSpot, with a view to
>> finding out if the default memory_order_conservative (which is very expensive
>> on some architectures) can be weakened to something less. It's impossible to
>> fix all of them, but perhaps we can fix some of the most frequent.
>
> Is there any information about the possible performance improvement from
> such changes?  1.5-3M occurrences doesn't mean much without context.

I am going through the exercise of relaxing some of the memory orders in Shenandoah code, and
AArch64 benefits greatly from it (= two-way barriers are bad in hot code).

There are obvious things like relaxing counter updates:
  JDK-8261503: Shenandoah: reconsider verifier memory ordering
  JDK-8261501: Shenandoah: reconsider heap statistics memory ordering
  JDK-8261500: Shenandoah: reconsider region live data memory ordering
  JDK-8261496: Shenandoah: reconsider pacing updates memory ordering

There are more interesting things like relaxing accesses to marking bitmap (which is a large counter
array in disguise) -- which effectively implies a CAS (and thus two FULL_MEM_BARRIER-s on AArch64)
per marked object:
  JDK-8261493: Shenandoah: reconsider bitmap access memory ordering

These five relaxations above cut down marking phase time on AArch64 for about 10..15%.

And there is more advanced stuff where relaxed is not enough, but conservative is too conservative.
There, acq/rel should be enough -- but we cannot yet test it, because AArch64 cmpxchg does not do
anything except relaxed/conservative (JDK-8261579):
  JDK-8261492: Shenandoah: reconsider forwardee accesses memory ordering
  JDK-8261495: Shenandoah: reconsider update references memory ordering

These two (along with experimental 8261579 fix) cut down evacuation and update-references phase
times for about 25..30% and 10..15%, respectively.

All in all, this cuts down Shenandoah GC cycle times on AArch64 for about 15..20%! So, I believe
this shows enough benefit to invest our time. Heavy-duty GC code is where I expect the most benefit.

--
Thanks,
-Aleksey

Reply | Threaded
Open this post in threaded view
|

Re: Atomic operations: your thoughts are welocme

Andrew Haley
In reply to this post by Kim Barrett
On 11/02/2021 03:59, Kim Barrett wrote:

>> On Feb 8, 2021, at 1:14 PM, Andrew Haley <[hidden email]> wrote:
>>
>> I've been looking at the hottest Atomic operations in HotSpot, with a view to
>> finding out if the default memory_order_conservative (which is very expensive
>> on some architectures) can be weakened to something less. It's impossible to
>> fix all of them, but perhaps we can fix some of the most frequent.
>
> Is there any information about the possible performance improvement from
> such changes?  1.5-3M occurrences doesn't mean much without context.
>
> We don't presently have support for sequentially consistent semantics, only
> "conservative". My recollection is that this is in part because there might
> be code that is assuming the possibly stronger "conservative" semantics, and
> in part because there are different and incompatible approaches to
> implementing sequentially consistent semantics on some hardware platforms
> and we didn't want to make assumptions there.
>
> We also don't presently have any cmpxchg implementation that really supports
> anything between conservative and relaxed, nor do we support different order
> constraints for the success vs failure cases. Things can be complicated
> enough as is; while we *could* fill some of that in, I'm not sure we should.

OK. However, even though we don't implement any of them, we do have an
API that includes acq, rel, and seq_cst. The fact that we don't have
anything behind them is, I thought, To Be Done rather than Won't Do.

>> <OopOopIterateDispatch<G1CMOopClosure>::Table::oop_oop_iterate<InstanceKlass, narrowOop>(G1CMOopClosure*, oopDesc*, Klass*)+336>:       :: = 3903178
>>
>> This is actually MarkBitMap::par_mark calling BitMap::par_set_bit. Does this
>> need to be memory_order_conservative, or would something weaker do? Even
>> acq_rel or seq_cst would be better.
>
> I think for setting bits in a bitmap the thing to do would be to identify
> places that are safe and useful (impacts performance) to do so first. Then
> add a weaker variant for use in those places, assuming any are found.

I see. I'm assuming that frequency of use is a useful proxy for impact.
Aleksey has already, very helpfully, measured how significant these are
for Shenandoah, and I suspect all concurrent GCs would benefit in a
similar fashion.

>> <Symbol::decrement_refcount()+44>:      :: = 2376632
>> <Symbol::increment_refcount()+44>:      :: = 2003895
>>
>> I can't imagine that either of these actually need memory_order_conservative,
>> they're just reference counts.
>
> The "usual" refcount implementation involves relaxed increment and stronger
> ordering for decrement. (If I'm remembering correctly, dec-acquire and a
> release fence on the zero value path before deleting. But I've not thought
> about what one might want for this CAS-based variant that handles boundary
> cases specially.) And as you say, whether any of these could be weakened
> depends on whether there is any code surrounding a use that depends on the
> stronger ordering semantics. At a guess I suspect increment could be changed
> to relaxed, but I've not looked. This one is probably a question for runtime
> folks.

OK, this makes sense. I'm thinking of the long road to getting this stuff
documented so that we can see what side effects of atomic operations are
actually required.

>> <OtherRegionsTable::add_reference(void*, unsigned int)+248>:    :: = 1719614
>>
>> BitMap::par_set_bit again.
>>
>> <G1ParScanThreadState::steal_and_trim_queue(GenericTaskQueueSet<OverflowTaskQueue<ScannerTask, (MEMFLAGS)5, 131072u>, (MEMFLAGS)5>*)+432>:      :: = 1617659
>>
>> This one is GenericTaskQueue::pop_global calling cmpxchg_age().
>> Again, do we need conservative here?
>
> This needs at least sequentially consistent semantics on the success path.

Yep. That's easy, it's the full barrier in the failure path that
I'd love to eliminate.

> See the referenced paper by Le, et al.
>
> There is also a cmpxchg_age in pop_local_slow.  The Le, et al paper doesn't
> deal with that path.  But it's also not in your list, which is good since
> this is supposed to be infrequently taken.

Right. I'm trying to concentrate on the low-hanging fruit.

Thank you for the very detailed and informative reply.

--
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671

Reply | Threaded
Open this post in threaded view
|

RE: Atomic operations: your thoughts are welocme

Doerr, Martin
Hi,

I appreciate this investigation.
PPC64 has optimized versions for _relaxed, _acquire, _release and _acq_rel which are substantially faster than the other memory order modes.
So we should be able to observe performance improvements when any of these ones are used in hot code.

Best regards,
Martin


> -----Original Message-----
> From: hotspot-dev <[hidden email]> On Behalf Of
> Andrew Haley
> Sent: Donnerstag, 11. Februar 2021 14:34
> To: Kim Barrett <[hidden email]>
> Cc: hotspot-gc-dev openjdk.java.net <[hidden email]>;
> [hidden email]
> Subject: Re: Atomic operations: your thoughts are welocme
>
> On 11/02/2021 03:59, Kim Barrett wrote:
> >> On Feb 8, 2021, at 1:14 PM, Andrew Haley <[hidden email]> wrote:
> >>
> >> I've been looking at the hottest Atomic operations in HotSpot, with a view
> to
> >> finding out if the default memory_order_conservative (which is very
> expensive
> >> on some architectures) can be weakened to something less. It's
> impossible to
> >> fix all of them, but perhaps we can fix some of the most frequent.
> >
> > Is there any information about the possible performance improvement
> from
> > such changes?  1.5-3M occurrences doesn't mean much without context.
> >
> > We don't presently have support for sequentially consistent semantics,
> only
> > "conservative". My recollection is that this is in part because there might
> > be code that is assuming the possibly stronger "conservative" semantics,
> and
> > in part because there are different and incompatible approaches to
> > implementing sequentially consistent semantics on some hardware
> platforms
> > and we didn't want to make assumptions there.
> >
> > We also don't presently have any cmpxchg implementation that really
> supports
> > anything between conservative and relaxed, nor do we support different
> order
> > constraints for the success vs failure cases. Things can be complicated
> > enough as is; while we *could* fill some of that in, I'm not sure we should.
>
> OK. However, even though we don't implement any of them, we do have an
> API that includes acq, rel, and seq_cst. The fact that we don't have
> anything behind them is, I thought, To Be Done rather than Won't Do.
>
> >>
> <OopOopIterateDispatch<G1CMOopClosure>::Table::oop_oop_iterate<Inst
> anceKlass, narrowOop>(G1CMOopClosure*, oopDesc*, Klass*)+336>:       :: =
> 3903178
> >>
> >> This is actually MarkBitMap::par_mark calling BitMap::par_set_bit. Does
> this
> >> need to be memory_order_conservative, or would something weaker
> do? Even
> >> acq_rel or seq_cst would be better.
> >
> > I think for setting bits in a bitmap the thing to do would be to identify
> > places that are safe and useful (impacts performance) to do so first. Then
> > add a weaker variant for use in those places, assuming any are found.
>
> I see. I'm assuming that frequency of use is a useful proxy for impact.
> Aleksey has already, very helpfully, measured how significant these are
> for Shenandoah, and I suspect all concurrent GCs would benefit in a
> similar fashion.
>
> >> <Symbol::decrement_refcount()+44>:      :: = 2376632
> >> <Symbol::increment_refcount()+44>:      :: = 2003895
> >>
> >> I can't imagine that either of these actually need
> memory_order_conservative,
> >> they're just reference counts.
> >
> > The "usual" refcount implementation involves relaxed increment and
> stronger
> > ordering for decrement. (If I'm remembering correctly, dec-acquire and a
> > release fence on the zero value path before deleting. But I've not thought
> > about what one might want for this CAS-based variant that handles
> boundary
> > cases specially.) And as you say, whether any of these could be weakened
> > depends on whether there is any code surrounding a use that depends on
> the
> > stronger ordering semantics. At a guess I suspect increment could be
> changed
> > to relaxed, but I've not looked. This one is probably a question for runtime
> > folks.
>
> OK, this makes sense. I'm thinking of the long road to getting this stuff
> documented so that we can see what side effects of atomic operations are
> actually required.
>
> >> <OtherRegionsTable::add_reference(void*, unsigned int)+248>:    :: =
> 1719614
> >>
> >> BitMap::par_set_bit again.
> >>
> >>
> <G1ParScanThreadState::steal_and_trim_queue(GenericTaskQueueSet<Ov
> erflowTaskQueue<ScannerTask, (MEMFLAGS)5, 131072u>,
> (MEMFLAGS)5>*)+432>:      :: = 1617659
> >>
> >> This one is GenericTaskQueue::pop_global calling cmpxchg_age().
> >> Again, do we need conservative here?
> >
> > This needs at least sequentially consistent semantics on the success path.
>
> Yep. That's easy, it's the full barrier in the failure path that
> I'd love to eliminate.
>
> > See the referenced paper by Le, et al.
> >
> > There is also a cmpxchg_age in pop_local_slow.  The Le, et al paper doesn't
> > deal with that path.  But it's also not in your list, which is good since
> > this is supposed to be infrequently taken.
>
> Right. I'm trying to concentrate on the low-hanging fruit.
>
> Thank you for the very detailed and informative reply.
>
> --
> Andrew Haley  (he/him)
> Java Platform Lead Engineer
> Red Hat UK Ltd. <https://www.redhat.com>
> https://keybase.io/andrewhaley
> EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671

Reply | Threaded
Open this post in threaded view
|

Re: Atomic operations: your thoughts are welocme

Kim Barrett
In reply to this post by Andrew Haley
> On Feb 11, 2021, at 8:33 AM, Andrew Haley <[hidden email]> wrote:
>
> On 11/02/2021 03:59, Kim Barrett wrote:
>>> On Feb 8, 2021, at 1:14 PM, Andrew Haley <[hidden email]> wrote:
>>>
>>> I've been looking at the hottest Atomic operations in HotSpot, with a view to
>>> finding out if the default memory_order_conservative (which is very expensive
>>> on some architectures) can be weakened to something less. It's impossible to
>>> fix all of them, but perhaps we can fix some of the most frequent.
>>
>> Is there any information about the possible performance improvement from
>> such changes?  1.5-3M occurrences doesn't mean much without context.
>>
>> We don't presently have support for sequentially consistent semantics, only
>> "conservative". My recollection is that this is in part because there might
>> be code that is assuming the possibly stronger "conservative" semantics, and
>> in part because there are different and incompatible approaches to
>> implementing sequentially consistent semantics on some hardware platforms
>> and we didn't want to make assumptions there.
>>
>> We also don't presently have any cmpxchg implementation that really supports
>> anything between conservative and relaxed, nor do we support different order
>> constraints for the success vs failure cases. Things can be complicated
>> enough as is; while we *could* fill some of that in, I'm not sure we should.
>
> OK. However, even though we don't implement any of them, we do have an
> API that includes acq, rel, and seq_cst. The fact that we don't have
> anything behind them is, I thought, To Be Done rather than Won't Do.

My inclination is to be pretty conservative in this area. (No pun intended.)
I'm not eager to have a lot of reviews like that for JDK-8154736. (And in
looking back at that, I see we ended up not addressing non-ppc platforms,
even though there was specific concern at the time that by not dealing with
them (particularly arm/aarch64) that we might be fobbing off some really
hard debugging on some poor future person.)

>>> <OopOopIterateDispatch<G1CMOopClosure>::Table::oop_oop_iterate<InstanceKlass, narrowOop>(G1CMOopClosure*, oopDesc*, Klass*)+336>:       :: = 3903178
>>>
>>> This is actually MarkBitMap::par_mark calling BitMap::par_set_bit. Does this
>>> need to be memory_order_conservative, or would something weaker do? Even
>>> acq_rel or seq_cst would be better.
>>
>> I think for setting bits in a bitmap the thing to do would be to identify
>> places that are safe and useful (impacts performance) to do so first. Then
>> add a weaker variant for use in those places, assuming any are found.
>
> I see. I'm assuming that frequency of use is a useful proxy for impact.
> Aleksey has already, very helpfully, measured how significant these are
> for Shenandoah, and I suspect all concurrent GCs would benefit in a
> similar fashion.

Absolute counts don't say much without context. So what if there are a
million of these, if they are swamped by the 100 bazillion not-these?

Aleksey's measurements turned out to be less informative to me than they
seemed at first reading. Many of the proposed changes involve simple
counters or accumulators. Changing such to use relaxed atomic addition
operations is likely an easy improvement. But even that can suffer badly
from contention. If one is serious about reducing the cost of multi-threaded
accumulators, much better would be something like
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p0261r4.html

>>> <G1ParScanThreadState::steal_and_trim_queue(GenericTaskQueueSet<OverflowTaskQueue<ScannerTask, (MEMFLAGS)5, 131072u>, (MEMFLAGS)5>*)+432>:      :: = 1617659
>>>
>>> This one is GenericTaskQueue::pop_global calling cmpxchg_age().
>>> Again, do we need conservative here?
>>
>> This needs at least sequentially consistent semantics on the success path.
>
> Yep. That's easy, it's the full barrier in the failure path that
> I'd love to eliminate.

Why does the failure path matter here?

It should be rare [*], since it only fails when either there is contention
between a thief and the owner for the sole entry in the queue, or there is
contention between multiple thieves. The former should be rare because
non-empty queues usually contain more than one element. The latter should be
rare because of the random selection of queues the steal from. And in both
cases a losing thief will look for a new queue to steal from.

[*] The age/top (where pop_global takes from) and bottom (where push adds
and pop_local takes from) used to be adjacent members, so local operations
might induce false-sharing failures for the age/top CAS. These members were
separated in JDK 15.

Reply | Threaded
Open this post in threaded view
|

Re: Atomic operations: your thoughts are welocme

Andrew Haley
On 12/02/2021 09:58, Kim Barrett wrote:

>> On Feb 11, 2021, at 8:33 AM, Andrew Haley <[hidden email]> wrote:
>>
>> On 11/02/2021 03:59, Kim Barrett wrote:
>>>
>>> We also don't presently have any cmpxchg implementation that really supports
>>> anything between conservative and relaxed, nor do we support different order
>>> constraints for the success vs failure cases. Things can be complicated
>>> enough as is; while we *could* fill some of that in, I'm not sure we should.
>>
>> OK. However, even though we don't implement any of them, we do have an
>> API that includes acq, rel, and seq_cst. The fact that we don't have
>> anything behind them is, I thought, To Be Done rather than Won't Do.
>
> My inclination is to be pretty conservative in this area. (No pun intended.)
> I'm not eager to have a lot of reviews like that for JDK-8154736. (And in
> looking back at that, I see we ended up not addressing non-ppc platforms,
> even though there was specific concern at the time that by not dealing with
> them (particularly arm/aarch64) that we might be fobbing off some really
> hard debugging on some poor future person.)

Sure, and as you are probably aware I've had to do that, more than
once, on dusty old GC code that didn't follow the memory model.

IMVHO, there are not many places where seq_cst won't be adequate.

>> I see. I'm assuming that frequency of use is a useful proxy for impact.
>> Aleksey has already, very helpfully, measured how significant these are
>> for Shenandoah, and I suspect all concurrent GCs would benefit in a
>> similar fashion.
>
> Absolute counts don't say much without context. So what if there are a
> million of these, if they are swamped by the 100 bazillion not-these?
>
> Aleksey's measurements turned out to be less informative to me than they
> seemed at first reading. Many of the proposed changes involve simple
> counters or accumulators. Changing such to use relaxed atomic addition
> operations is likely an easy improvement. But even that can suffer badly
> from contention. If one is serious about reducing the cost of multi-threaded
> accumulators, much better would be something like
> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p0261r4.html

I very strongly disagree. Aleksey managed to prove a substantial
gain with only a couple of hours' work. We're talking about low-
hanging fruit here.

>>>> <G1ParScanThreadState::steal_and_trim_queue(GenericTaskQueueSet<OverflowTaskQueue<ScannerTask, (MEMFLAGS)5, 131072u>, (MEMFLAGS)5>*)+432>:      :: = 1617659
>>>>
>>>> This one is GenericTaskQueue::pop_global calling cmpxchg_age().
>>>> Again, do we need conservative here?
>>>
>>> This needs at least sequentially consistent semantics on the success path.
>>
>> Yep. That's easy, it's the full barrier in the failure path that
>> I'd love to eliminate.
>
> Why does the failure path matter here?
>
> It should be rare [*], since it only fails when either there is contention
> between a thief and the owner for the sole entry in the queue, or there is
> contention between multiple thieves.

OK, so that's useful guidance for an implementer: full barriers for CAS
failures should be wrapped in a conditional. That is a pain, because it
complexifies the code, but OK.

--
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671