RFR (XS): 8188877: Improper synchronization in offer_termination

classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|

RFR (XS): 8188877: Improper synchronization in offer_termination

White, Derek

Hi All,

 

The field ParallelTaskTerminator::_offered_termination is modified atomically, but is read as a normal field access.

 

In ParallelTaskTerminator::offer_termination(), this field is read within a loop, but there's nothing stopping a compiler from hoisting the load out of the loop. See CR for examples of possible load hoisting.

 

Trivial fix: The _offered_termination field should be declared volatile.

 

This was found by code inspection. I haven't seen an actual error, but compiler optimizations or unrelated code changes could make an error occur in the future.

 

CR: https://bugs.openjdk.java.net/browse/JDK-8188877

Webrev: http://cr.openjdk.java.net/~drwhite/8188877/webrev.01/

 

Minor build and testing done, but inspected the generated code for offer_termination() and found no differences.

 

  • Derek
Reply | Threaded
Open this post in threaded view
|

Re: RFR (XS): 8188877: Improper synchronization in offer_termination

Kim Barrett
> On Oct 6, 2017, at 7:13 PM, White, Derek <[hidden email]> wrote:
>
> Hi All,
>  
> The field ParallelTaskTerminator::_offered_termination is modified atomically, but is read as a normal field access.
>  
> In ParallelTaskTerminator::offer_termination(), this field is read within a loop, but there's nothing stopping a compiler from hoisting the load out of the loop. See CR for examples of possible load hoisting.
>  
> Trivial fix: The _offered_termination field should be declared volatile.
>  
> This was found by code inspection. I haven't seen an actual error, but compiler optimizations or unrelated code changes could make an error occur in the future.
>  
> CR: https://bugs.openjdk.java.net/browse/JDK-8188877
> Webrev: http://cr.openjdk.java.net/~drwhite/8188877/webrev.01/
>  
> Minor build and testing done, but inspected the generated code for offer_termination() and found no differences.
>  
> • Derek

------------------------------------------------------------------------------
src/hotspot/share/gc/shared/taskqueue.cpp

The change to make _offered_termination volatile prevents the compiler
from hoisting it out of the loop, but I'm not sure it is enough to
prevent unnecessary delays in recognizing the termination condition.
In particular, while offer_termination is spinning, it's not clear
there's anything to ensure an update by some other thread will become
visible during the repeated spin cycles.  yield and sleep seem likely
to involve fences that will make such updates visible, but not so much
for SpinPause.

I also noticed the SpinPause implementations for linux_arm(64) and
linux_aarch64 are different, with the latter perhaps being lacking.

------------------------------------------------------------------------------  

In addition, some pre-existing issues:

------------------------------------------------------------------------------  
src/hotspot/share/gc/shared/taskqueue.cpp

[pre-existing]
The casts in ParallelTaskTerminator::offer_termination's
Atomic::inc/dec of _offered_termination are no longer needed.

------------------------------------------------------------------------------
src/hotspot/share/gc/shared/taskqueue.cpp

[pre-existing]
In ParallelTaskTerminator::offer_termination, a reference to
WorkStealingHardSpins is cast to uint.  The only reason for that cast
is that the variable is declared uintx, when it really ought to be
uint.  Probably all three of the WorkStealingMUMBLE experimental
parameters ought to be of type uint rather than uintx.

------------------------------------------------------------------------------
src/hotspot/share/gc/shared/taskqueue.cpp

[pre-existing]
ParallelTaskTerminator::offer_termination computes the hard_spin_limit
as
  hard_spin_limit = WorkStealingHardSpins >> WorkStealingSpinToYieldRatio;
  hard_spin_limit = MAX2(hard_spin_limit, 1U);

The use of a left shift will result in undefined behavior for
seemingly reasonable non-default choices for the ratio parameter,
e.g. any value >= number of bits in type of WorkStealingHardSpins.

Perhaps this should be addressed by limiting the permitted range for
the ratio parameter.

------------------------------------------------------------------------------

Reply | Threaded
Open this post in threaded view
|

RE: RFR (XS): 8188877: Improper synchronization in offer_termination

White, Derek
Hi Kim,

Thanks for the review. More comments below...

> -----Original Message-----
> From: Kim Barrett [mailto:[hidden email]]
> Sent: Sunday, October 08, 2017 5:32 PM
> To: White, Derek <[hidden email]>
> Cc: [hidden email]
> Subject: Re: RFR (XS): 8188877: Improper synchronization in
> offer_termination
>
> > On Oct 6, 2017, at 7:13 PM, White, Derek <[hidden email]>
> wrote:
> >
> > Hi All,
> >
> > The field ParallelTaskTerminator::_offered_termination is modified
> atomically, but is read as a normal field access.
> >
> > In ParallelTaskTerminator::offer_termination(), this field is read within a
> loop, but there's nothing stopping a compiler from hoisting the load out of
> the loop. See CR for examples of possible load hoisting.
> >
> > Trivial fix: The _offered_termination field should be declared volatile.
> >
> > This was found by code inspection. I haven't seen an actual error, but
> compiler optimizations or unrelated code changes could make an error occur
> in the future.
> >
> > CR: https://bugs.openjdk.java.net/browse/JDK-8188877
> > Webrev: http://cr.openjdk.java.net/~drwhite/8188877/webrev.01/
> >
> > Minor build and testing done, but inspected the generated code for
> offer_termination() and found no differences.
> >
> > • Derek
>
> ------------------------------------------------------------------------------
> src/hotspot/share/gc/shared/taskqueue.cpp
>
> The change to make _offered_termination volatile prevents the compiler
> from hoisting it out of the loop, but I'm not sure it is enough to prevent
> unnecessary delays in recognizing the termination condition.
> In particular, while offer_termination is spinning, it's not clear there's
> anything to ensure an update by some other thread will become visible
> during the repeated spin cycles.  yield and sleep seem likely to involve fences
> that will make such updates visible, but not so much for SpinPause.

[Derek]
The concurrent access around _offered_termination is (semi*)-independent of other data accesses. So I'm not sure if more ordering constraints will help much here?

The atomic incr/dec operations provide a fence to publish the change. On aarch64 this is a DMB instruction. A full sync (DSB) will also publish the change, but blocks until the publish has been acked. But it doesn't publish any faster.

The only extra thing we could do is read _offered_termination using OrderAccess::load_acquire(). I think the only thing this adds in this case is ordering between the loads of _offered_termination in different  loop iterations. Is this required or helpful?

If so, we should probably encapsulate all accesses to _offered_termination and provide correct  access order constraints.

*SEMI_DEPENDENT DATA ACCESSES
Ok, offered_termination accesses are related to TaskQueue operations, esp. peek(). Those operations seem to involve volatile accesses also, and seem safe to me, but I didn't try to understand the whole implementation of TaskQueues.

> I also noticed the SpinPause implementations for linux_arm(64) and
> linux_aarch64 are different, with the latter perhaps being lacking.

[Derek]
Hah! That's why I was looking at offer_termination() in the first place*.  There is more discussion about what to do with SpinPause on aarch64 than you'd first expect.

*And in the second place, offer_termination() shows up in the top 10 on perf, when benchmarking hadoop in a default GC config: 63+ GC threads don't have much to do when there's a only a 4 GB heap.

> ------------------------------------------------------------------------------
>
> In addition, some pre-existing issues:
>
> ------------------------------------------------------------------------------
> src/hotspot/share/gc/shared/taskqueue.cpp
>
> [pre-existing]
> The casts in ParallelTaskTerminator::offer_termination's
> Atomic::inc/dec of _offered_termination are no longer needed.

[Derek]
OK.

> ------------------------------------------------------------------------------
> src/hotspot/share/gc/shared/taskqueue.cpp
>
> [pre-existing]
> In ParallelTaskTerminator::offer_termination, a reference to
> WorkStealingHardSpins is cast to uint.  The only reason for that cast is that
> the variable is declared uintx, when it really ought to be uint.  Probably all
> three of the WorkStealingMUMBLE experimental parameters ought to be of
> type uint rather than uintx.

[Derek]
OK

> ------------------------------------------------------------------------------
> src/hotspot/share/gc/shared/taskqueue.cpp
>
> [pre-existing]
> ParallelTaskTerminator::offer_termination computes the hard_spin_limit as
>   hard_spin_limit = WorkStealingHardSpins >> WorkStealingSpinToYieldRatio;
>   hard_spin_limit = MAX2(hard_spin_limit, 1U);
>
> The use of a left shift will result in undefined behavior for seemingly
> reasonable non-default choices for the ratio parameter, e.g. any value >=
> number of bits in type of WorkStealingHardSpins.
>
> Perhaps this should be addressed by limiting the permitted range for the ratio
> parameter.

[Derek]
OK.

> ------------------------------------------------------------------------------

[Derek] I'll get a new webrev out tomorrow.
Reply | Threaded
Open this post in threaded view
|

Re: RFR (XS): 8188877: Improper synchronization in offer_termination

Kim Barrett
> On Oct 9, 2017, at 8:20 PM, White, Derek <[hidden email]> wrote:
>> src/hotspot/share/gc/shared/taskqueue.cpp
>>
>> The change to make _offered_termination volatile prevents the compiler
>> from hoisting it out of the loop, but I'm not sure it is enough to prevent
>> unnecessary delays in recognizing the termination condition.
>> In particular, while offer_termination is spinning, it's not clear there's
>> anything to ensure an update by some other thread will become visible
>> during the repeated spin cycles.  yield and sleep seem likely to involve fences
>> that will make such updates visible, but not so much for SpinPause.
>
> [Derek]
> The concurrent access around _offered_termination is (semi*)-independent of other data accesses. So I'm not sure if more ordering constraints will help much here?
>
> The atomic incr/dec operations provide a fence to publish the change. On aarch64 this is a DMB instruction. A full sync (DSB) will also publish the change, but blocks until the publish has been acked. But it doesn't publish any faster.
>
> The only extra thing we could do is read _offered_termination using OrderAccess::load_acquire(). I think the only thing this adds in this case is ordering between the loads of _offered_termination in different  loop iterations. Is this required or helpful?

Consider the case where there is no further work to do, and all
threads are heading toward offer_termination.  Until the last thread
makes the offer, the others will be in the spin1...spinN/yield/sleep
cycle, as expected.  But because there appear to be no memory barriers
in the spin part of that cycle, a thread in the spin part might not
see the final offer until it completes the full spin cycle and then
performs the yield (assuming yield does have sufficient memory
barriers to acquire the published value).

So I think the read of _offered_termination for comparison with
_n_threads probably ought to be an acquire.

>> […]
>>
>> src/hotspot/share/gc/shared/taskqueue.cpp
>>
>> [pre-existing]
>> ParallelTaskTerminator::offer_termination computes the hard_spin_limit as
>>  hard_spin_limit = WorkStealingHardSpins >> WorkStealingSpinToYieldRatio;
>>  hard_spin_limit = MAX2(hard_spin_limit, 1U);
>>
>> The use of a left shift will result in undefined behavior for seemingly
>> reasonable non-default choices for the ratio parameter, e.g. any value >=
>> number of bits in type of WorkStealingHardSpins.
>>
>> Perhaps this should be addressed by limiting the permitted range for the ratio
>> parameter.

Thinking about this some more, WorkStealingSpinToYieldRatio seems
*really* poorly named; that hardly looks like a ratio at all.

Reply | Threaded
Open this post in threaded view
|

Re: RFR (XS): 8188877: Improper synchronization in offer_termination

Thomas Schatzl
On Mon, 2017-10-09 at 21:44 -0400, Kim Barrett wrote:

> > On Oct 9, 2017, at 8:20 PM, White, Derek <[hidden email]>
> > wrote:
> > > src/hotspot/share/gc/shared/taskqueue.cpp
> > >
> > > The change to make _offered_termination volatile prevents the
> > > compiler
> > > from hoisting it out of the loop, but I'm not sure it is enough
> > > to prevent
> > > unnecessary delays in recognizing the termination condition.
> > > In particular, while offer_termination is spinning, it's not
> > > clear there's
> > > anything to ensure an update by some other thread will become
> > > visible
> > > during the repeated spin cycles.  yield and sleep seem likely to
> > > involve fences
> > > that will make such updates visible, but not so much for
> > > SpinPause.
> >
> > [Derek] 
> > The concurrent access around _offered_termination is (semi*)-
> > independent of other data accesses. So I'm not sure if more
> > ordering constraints will help much here?
> >
> > The atomic incr/dec operations provide a fence to publish the
> > change. On aarch64 this is a DMB instruction. A full sync (DSB)
> > will also publish the change, but blocks until the publish has been
> > acked. But it doesn't publish any faster.
> >
> > The only extra thing we could do is read _offered_termination using
> > OrderAccess::load_acquire(). I think the only thing this adds in
> > this case is ordering between the loads of _offered_termination in
> > different  loop iterations. Is this required or helpful? 
>
> Consider the case where there is no further work to do, and all
> threads are heading toward offer_termination.  Until the last thread
> makes the offer, the others will be in the spin1...spinN/yield/sleep
> cycle, as expected.  But because there appear to be no memory
> barriers in the spin part of that cycle, a thread in the spin part
> might not see the final offer until it completes the full spin cycle
> and then performs the yield (assuming yield does have sufficient
> memory barriers to acquire the published value).
>
> So I think the read of _offered_termination for comparison with
> _n_threads probably ought to be an acquire.

  I agree with Derek that adding an acquire for the load of
_offered_termination does not seem to improve termination properties,
and is imho unnecessary (but the volatile declaration of course is).

Thanks,
  Thomas

Reply | Threaded
Open this post in threaded view
|

RE: RFR (XS): 8188877: Improper synchronization in offer_termination

White, Derek
Hi Kim, Thomas,

So Thomas, I think I agree with my original self too 😊.

Kim's concern:
> Consider the case where there is no further work to do, and all
> threads are heading toward offer_termination.  Until the last thread
> makes the offer, the others will be in the spin1...spinN/yield/sleep
> cycle, as expected.  But because there appear to be no memory barriers
> in the spin part of that cycle, a thread in the spin part might not
> see the final offer until it completes the full spin cycle and then
> performs the yield (assuming yield does have sufficient memory
> barriers to acquire the published value).

My understanding is that the "acquire" semantics are entirely about memory ordering, within a CPU. In particular it prevents "following loads" from executing before the "load acquire".

There is nothing in the "load acquire" that causes it to synchronize with the memory system more or less quickly than a naked load. Either kind of load will eventually notice that its local cached value has been invalidated and will fetch the updated value.

In the context of a spin-loop, there could be some large number of loads (from the same variable), comparisons, and branches in flight. Without a load-acquire, it's not clear which load will see the final value, but it doesn't matter. As long as one iteration of the loop sees that value, the loop will terminate.

BTW, I have seen several explanations of what the Intel PAUSE instruction does, in its earliest forms it seemed to be about avoiding having to clean up either: mis-predicted instructions in flight, or carefully ordering the memory loads. I might be misremembering and I can't find my sources for this though.

 - Derek

> -----Original Message-----
> From: Thomas Schatzl [mailto:[hidden email]]
> Sent: Thursday, November 16, 2017 3:42 AM
> To: Kim Barrett <[hidden email]>; White, Derek
> <[hidden email]>
> Cc: [hidden email]
> Subject: Re: RFR (XS): 8188877: Improper synchronization in
> offer_termination
>
> On Mon, 2017-10-09 at 21:44 -0400, Kim Barrett wrote:
> > > On Oct 9, 2017, at 8:20 PM, White, Derek <[hidden email]>
> > > wrote:
> > > > src/hotspot/share/gc/shared/taskqueue.cpp
> > > >
> > > > The change to make _offered_termination volatile prevents the
> > > > compiler from hoisting it out of the loop, but I'm not sure it is
> > > > enough to prevent unnecessary delays in recognizing the
> > > > termination condition.
> > > > In particular, while offer_termination is spinning, it's not clear
> > > > there's anything to ensure an update by some other thread will
> > > > become visible during the repeated spin cycles.  yield and sleep
> > > > seem likely to involve fences that will make such updates visible,
> > > > but not so much for SpinPause.
> > >
> > > [Derek]
> > > The concurrent access around _offered_termination is (semi*)-
> > > independent of other data accesses. So I'm not sure if more ordering
> > > constraints will help much here?
> > >
> > > The atomic incr/dec operations provide a fence to publish the
> > > change. On aarch64 this is a DMB instruction. A full sync (DSB) will
> > > also publish the change, but blocks until the publish has been
> > > acked. But it doesn't publish any faster.
> > >
> > > The only extra thing we could do is read _offered_termination using
> > > OrderAccess::load_acquire(). I think the only thing this adds in
> > > this case is ordering between the loads of _offered_termination in
> > > different  loop iterations. Is this required or helpful?
> >
> > Consider the case where there is no further work to do, and all
> > threads are heading toward offer_termination.  Until the last thread
> > makes the offer, the others will be in the spin1...spinN/yield/sleep
> > cycle, as expected.  But because there appear to be no memory barriers
> > in the spin part of that cycle, a thread in the spin part might not
> > see the final offer until it completes the full spin cycle and then
> > performs the yield (assuming yield does have sufficient memory
> > barriers to acquire the published value).
> >
> > So I think the read of _offered_termination for comparison with
> > _n_threads probably ought to be an acquire.
>
>   I agree with Derek that adding an acquire for the load of
> _offered_termination does not seem to improve termination properties, and
> is imho unnecessary (but the volatile declaration of course is).
>
> Thanks,
>   Thomas

Reply | Threaded
Open this post in threaded view
|

Re: RFR (XS): 8188877: Improper synchronization in offer_termination

Kim Barrett
> On Nov 21, 2017, at 4:53 PM, White, Derek <[hidden email]> wrote:
>
> Hi Kim, Thomas,
>
> So Thomas, I think I agree with my original self too 😊.

Agreed.  I fell into a common trap, and should know better.

Looks good.

Reply | Threaded
Open this post in threaded view
|

Re: RFR (XS): 8188877: Improper synchronization in offer_termination

Andrew Haley
In reply to this post by White, Derek
On 21/11/17 21:53, White, Derek wrote:
> My understanding is that the "acquire" semantics are entirely about
> memory ordering, within a CPU. In particular it prevents "following
> loads" from executing before the "load acquire".
>
>
> There is nothing in the "load acquire" that causes it to synchronize
> with the memory system more or less quickly than a naked load.

The abstract architecture only specifies things in terms of ordering
between loads, but it has to be implemented somehow, and this is MESI
or something similar.  Stores cause invalidate messages to be sent,
and these are put into the reader's invalidate queue.  When that
reader executes a load barrier it marks all the entries currently in
its invalidate queue.  The next load will wait until all marked
entries have been applied to the reader's cache.

> Either kind of read will eventually notice that its local cached
> value has been invalidated and will fetch the updated value.

Eventually, yes.

--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Reply | Threaded
Open this post in threaded view
|

Re: RFR (XS): 8188877: Improper synchronization in offer_termination

Thomas Schatzl
Hi all,

On Wed, 2017-11-22 at 09:13 +0000, Andrew Haley wrote:

> On 21/11/17 21:53, White, Derek wrote:
> > My understanding is that the "acquire" semantics are entirely about
> > memory ordering, within a CPU. In particular it prevents "following
> > loads" from executing before the "load acquire".
> >
> >
> > There is nothing in the "load acquire" that causes it to
> > synchronize
> > with the memory system more or less quickly than a naked load.
>
> The abstract architecture only specifies things in terms of ordering
> between loads, but it has to be implemented somehow, and this is MESI
> or something similar.  Stores cause invalidate messages to be sent,
> and these are put into the reader's invalidate queue.  When that
> reader executes a load barrier it marks all the entries currently in
> its invalidate queue.  The next load will wait until all marked
> entries have been applied to the reader's cache.
>
> > Either kind of read will eventually notice that its local cached
> > value has been invalidated and will fetch the updated value.
>
> Eventually, yes.
>

  so, summing up this discussion, the change is good to go? I think we
can always add any implementation dependent optimizations later.

If everybody agrees, I will push it.

Thanks,
  Thomas

Reply | Threaded
Open this post in threaded view
|

Re: RFR (XS): 8188877: Improper synchronization in offer_termination

Thomas Schatzl
Hi again,

On Mon, 2017-11-27 at 10:30 +0100, Thomas Schatzl wrote:

> Hi all,
>
> On Wed, 2017-11-22 at 09:13 +0000, Andrew Haley wrote:
> > On 21/11/17 21:53, White, Derek wrote:
> > > My understanding is that the "acquire" semantics are entirely
> > > about
> > > memory ordering, within a CPU. In particular it prevents
> > > "following
> > > loads" from executing before the "load acquire".
> > >
> > >
> > > There is nothing in the "load acquire" that causes it to
> > > synchronize
> > > with the memory system more or less quickly than a naked load.
> >
> > The abstract architecture only specifies things in terms of
> > ordering
> > between loads, but it has to be implemented somehow, and this is
> > MESI
> > or something similar.  Stores cause invalidate messages to be sent,
> > and these are put into the reader's invalidate queue.  When that
> > reader executes a load barrier it marks all the entries currently
> > in
> > its invalidate queue.  The next load will wait until all marked
> > entries have been applied to the reader's cache.
> >
> > > Either kind of read will eventually notice that its local cached
> > > value has been invalidated and will fetch the updated value.
> >
> > Eventually, yes.
> >
>
>   so, summing up this discussion, the change is good to go? I think
> we can always add any implementation dependent optimizations later.
>
> If everybody agrees, I will push it.

  and btw, looks good :)

Thomas

Reply | Threaded
Open this post in threaded view
|

Re: RFR (XS): 8188877: Improper synchronization in offer_termination

Andrew Haley
In reply to this post by Thomas Schatzl
On 27/11/17 09:30, Thomas Schatzl wrote:
>   so, summing up this discussion, the change is good to go? I think we
> can always add any implementation dependent optimizations later.
>
> If everybody agrees, I will push it.

OK from me.  I'm still a bit nervous about the possible performance
implications of interpreted code taking an extremely long time to get
to a safepoint.  However, I did some experiments and it looks like the
worst case we're likely to encounter is about a couple of hundred
microseconds.

--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Reply | Threaded
Open this post in threaded view
|

Re: RFR (XS): 8188877: Improper synchronization in offer_termination

Andrew Dinn
In reply to this post by Andrew Haley
On 22/11/17 09:13, Andrew Haley wrote:

> On 21/11/17 21:53, White, Derek wrote:
>> My understanding is that the "acquire" semantics are entirely about
>> memory ordering, within a CPU. In particular it prevents "following
>> loads" from executing before the "load acquire".
>>
>>
>> There is nothing in the "load acquire" that causes it to synchronize
>> with the memory system more or less quickly than a naked load.
>
> The abstract architecture only specifies things in terms of ordering
> between loads, but it has to be implemented somehow, and this is MESI
> or something similar.  Stores cause invalidate messages to be sent,
> and these are put into the reader's invalidate queue.  When that
> reader executes a load barrier it marks all the entries currently in
> its invalidate queue.  The next load will wait until all marked
> entries have been applied to the reader's cache.

That's what happens when the reader executes a read barrier. The
interesting question is what happens when the reader does not execute a
read barrier.

>> Either kind of read will eventually notice that its local cached
>> value has been invalidated and will fetch the updated value.
>
> Eventually, yes.
That's a rather ill-defined span of time ;-)

I understand that you tested this and found that it took no longer than
a few hundred microseconds. However, I really have to ask what precisely
the reader was doing during the test?

Specifically, does the time taken to 'eventually' notice a write to the
LDRed location depend upon what other instructions are executed between
successive LDRs?

regards,


Andrew Dinn
-----------
Senior Principal Software Engineer
Red Hat UK Ltd
Registered in England and Wales under Company Registration No. 03798903
Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander
Reply | Threaded
Open this post in threaded view
|

Re: RFR (XS): 8188877: Improper synchronization in offer_termination

Andrew Haley
On 27/11/17 12:30, Andrew Dinn wrote:

> On 22/11/17 09:13, Andrew Haley wrote:
>> On 21/11/17 21:53, White, Derek wrote:
>>> My understanding is that the "acquire" semantics are entirely about
>>> memory ordering, within a CPU. In particular it prevents "following
>>> loads" from executing before the "load acquire".
>>>
>>> There is nothing in the "load acquire" that causes it to synchronize
>>> with the memory system more or less quickly than a naked load.
>>
>> The abstract architecture only specifies things in terms of ordering
>> between loads, but it has to be implemented somehow, and this is MESI
>> or something similar.  Stores cause invalidate messages to be sent,
>> and these are put into the reader's invalidate queue.  When that
>> reader executes a load barrier it marks all the entries currently in
>> its invalidate queue.  The next load will wait until all marked
>> entries have been applied to the reader's cache.
>
> That's what happens when the reader executes a read barrier. The
> interesting question is what happens when the reader does not execute a
> read barrier.

The invalidate messages still arrive at the reader, but they sit in
the invalidate queue and aren't acted upon immediately.  Eventually
they must be processed, either lazily or because the reader's
invalidate queue fills up.

>>> Either kind of read will eventually notice that its local cached
>>> value has been invalidated and will fetch the updated value.
>>
>> Eventually, yes.
> That's a rather ill-defined span of time ;-)
>
> I understand that you tested this and found that it took no longer than
> a few hundred microseconds. However, I really have to ask what precisely
> the reader was doing during the test?

Nothing except spinning and loading, and that's a few microseconds'
delay rather than a few hundred.

> Specifically, does the time taken to 'eventually' notice a write to the
> LDRed location depend upon what other instructions are executed between
> successive LDRs?

It's really hard to be definite about that.  In practice it may well
be that back-to-back local cache accesses saturate the CPU<->cache
interconnect so much that they delay the processing of invalidate
queue entries, but that's my speculation and it's secret sauce anyway.

It is likely, though, that if you issue a load barrier the next load
must cause the contents of the invalidate queue to be applied to the
cache, in order to ensure that everything happens in order.  So I
suspect that load barriers will cause changes to be seem earlier.
Having said that, load barriers slow down all reads for a while.

And one final caveat: I'm talking about MESI, but there are more
elaborate and sophisticated ways of making this stuff work.

--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Reply | Threaded
Open this post in threaded view
|

Re: RFR (XS): 8188877: Improper synchronization in offer_termination

Andrew Dinn
On 27/11/17 14:53, Andrew Haley wrote:
> On 27/11/17 12:30, Andrew Dinn wrote:
<snip>
>> That's what happens when the reader executes a read barrier. The
>> interesting question is what happens when the reader does not execute a
>> read barrier.
>
> The invalidate messages still arrive at the reader, but they sit in
> the invalidate queue and aren't acted upon immediately.  Eventually
> they must be processed, either lazily or because the reader's
> invalidate queue fills up.

Hmm, that explanation assumes there will be other invalidate messages.
But at a STW pause that's not necessarily going to be the case. In the
worst case all other threads may could be spinning on a barrier count
while this one core/thread has a single invalidate message in its queue.

>> I understand that you tested this and found that it took no longer than
>> a few hundred microseconds. However, I really have to ask what precisely
>> the reader was doing during the test?
>
> Nothing except spinning and loading, and that's a few microseconds'
> delay rather than a few hundred.

Ok, so it's not as if the reader noticed the invalidate because it was
writing memory or performing some other 'active' interaction with the
memory system that forced the invalidate queue to be processed (assuming
that invalidate detection is not a sneaky side-effect of a nop :-).

>> Specifically, does the time taken to 'eventually' notice a write to the
>> LDRed location depend upon what other instructions are executed between
>> successive LDRs?
>
> It's really hard to be definite about that.  In practice it may well
> be that back-to-back local cache accesses saturate the CPU<->cache
> interconnect so much that they delay the processing of invalidate
> queue entries, but that's my speculation and it's secret sauce anyway.
 . . .
> And one final caveat: I'm talking about MESI, but there are more
> elaborate and sophisticated ways of making this stuff work.

Well yes, which is fine so long as any such elaborate sophistricat^H^H^H
improvement of current behaviour doesn't assume that a core can ignore
the invalidate queue absent an ldar (or, maybe, absent ldar union some
other subset of the available memory ops).

regards,


Andrew Dinn
-----------
Senior Principal Software Engineer
Red Hat UK Ltd
Registered in England and Wales under Company Registration No. 03798903
Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander
Reply | Threaded
Open this post in threaded view
|

Re: RFR (XS): 8188877: Improper synchronization in offer_termination

Andrew Haley
On 27/11/17 15:44, Andrew Dinn wrote:

> On 27/11/17 14:53, Andrew Haley wrote:
>> On 27/11/17 12:30, Andrew Dinn wrote:
> <snip>
>>> That's what happens when the reader executes a read barrier. The
>>> interesting question is what happens when the reader does not execute a
>>> read barrier.
>>
>> The invalidate messages still arrive at the reader, but they sit in
>> the invalidate queue and aren't acted upon immediately.  Eventually
>> they must be processed, either lazily or because the reader's
>> invalidate queue fills up.
>
> Hmm, that explanation assumes there will be other invalidate messages.

No, not at all.  By "lazily" I mean that while a core has nothing else
to do it might as well process its invalidate queue, and AFAIK that is
what happens.

> But at a STW pause that's not necessarily going to be the case. In the
> worst case all other threads may could be spinning on a barrier count
> while this one core/thread has a single invalidate message in its queue.

That could be, but there are other things go on.  There are other threads
active, and invalidate messages get sent to everyone.

In practice, I've never seen more than a few microseconds of delay.

Bear in mind that the interpreter changes we just made mean that
interpreted code won't necessarily see safepoint status changes for
about 100 microseconds, so the lack of an acquiring load in our
code is really not the biggest issue.

--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Reply | Threaded
Open this post in threaded view
|

RE: RFR (XS): 8188877: Improper synchronization in offer_termination

White, Derek
In reply to this post by Thomas Schatzl
Thomas, Kim, Andrew, Andrew,

Thanks for the review comments and push!

 - Derek

> -----Original Message-----
> From: Thomas Schatzl [mailto:[hidden email]]
> Sent: Monday, November 27, 2017 4:32 AM
> To: Andrew Haley <[hidden email]>; White, Derek
> <[hidden email]>; Kim Barrett <[hidden email]>
> Cc: [hidden email]
> Subject: Re: RFR (XS): 8188877: Improper synchronization in
> offer_termination
>
> Hi again,
>
> On Mon, 2017-11-27 at 10:30 +0100, Thomas Schatzl wrote:
> > Hi all,
> >
> > On Wed, 2017-11-22 at 09:13 +0000, Andrew Haley wrote:
> > > On 21/11/17 21:53, White, Derek wrote:
> > > > My understanding is that the "acquire" semantics are entirely
> > > > about memory ordering, within a CPU. In particular it prevents
> > > > "following loads" from executing before the "load acquire".
> > > >
> > > >
> > > > There is nothing in the "load acquire" that causes it to
> > > > synchronize with the memory system more or less quickly than a
> > > > naked load.
> > >
> > > The abstract architecture only specifies things in terms of ordering
> > > between loads, but it has to be implemented somehow, and this is
> > > MESI or something similar.  Stores cause invalidate messages to be
> > > sent, and these are put into the reader's invalidate queue.  When
> > > that reader executes a load barrier it marks all the entries
> > > currently in its invalidate queue.  The next load will wait until
> > > all marked entries have been applied to the reader's cache.
> > >
> > > > Either kind of read will eventually notice that its local cached
> > > > value has been invalidated and will fetch the updated value.
> > >
> > > Eventually, yes.
> > >
> >
> >   so, summing up this discussion, the change is good to go? I think we
> > can always add any implementation dependent optimizations later.
> >
> > If everybody agrees, I will push it.
>
>   and btw, looks good :)
>
> Thomas

Reply | Threaded
Open this post in threaded view
|

RE: RFR (XS): 8188877: Improper synchronization in offer_termination

White, Derek
In reply to this post by Andrew Haley
Hi Andrew and Andrew,

Thanks for the discussion on load-acquire. This has been informative.

Related to timely notifications, one thing about offer_termination is that it's not doing a classic spin-wait. Classic spin-waits (as seen in mutex.cpp, objectMonitor.cpp, safepoint.cpp, synchronizer.cpp, for examples), will test the termination condition as part of the loop.

Offer-termination just has a simple for-loop that delays some number of cycles. As high as 4k iterations * 140 cycles (per SpinPause() on x86), could be 573,000 cycles or so. For this case, especially where the termination test is a simple load, I think we should test _offered_termination in the spin-wait. This should have low overhead on the spinning thread and impose no impact on other threads.

Unless there's disagreement I'll create an enhancement request for this. I'll add a note about the cleanups that Kim mentioned also.

 - Derek

> -----Original Message-----
> From: Andrew Haley [mailto:[hidden email]]
> Sent: Monday, November 27, 2017 10:57 AM
> To: Andrew Dinn <[hidden email]>; White, Derek
> <[hidden email]>; Thomas Schatzl
> <[hidden email]>; Kim Barrett <[hidden email]>
> Cc: [hidden email]
> Subject: Re: RFR (XS): 8188877: Improper synchronization in
> offer_termination
>
> On 27/11/17 15:44, Andrew Dinn wrote:
> > On 27/11/17 14:53, Andrew Haley wrote:
> >> On 27/11/17 12:30, Andrew Dinn wrote:
> > <snip>
> >>> That's what happens when the reader executes a read barrier. The
> >>> interesting question is what happens when the reader does not
> >>> execute a read barrier.
> >>
> >> The invalidate messages still arrive at the reader, but they sit in
> >> the invalidate queue and aren't acted upon immediately.  Eventually
> >> they must be processed, either lazily or because the reader's
> >> invalidate queue fills up.
> >
> > Hmm, that explanation assumes there will be other invalidate messages.
>
> No, not at all.  By "lazily" I mean that while a core has nothing else to do it
> might as well process its invalidate queue, and AFAIK that is what happens.
>
> > But at a STW pause that's not necessarily going to be the case. In the
> > worst case all other threads may could be spinning on a barrier count
> > while this one core/thread has a single invalidate message in its queue.
>
> That could be, but there are other things go on.  There are other threads
> active, and invalidate messages get sent to everyone.
>
> In practice, I've never seen more than a few microseconds of delay.
>
> Bear in mind that the interpreter changes we just made mean that
> interpreted code won't necessarily see safepoint status changes for about
> 100 microseconds, so the lack of an acquiring load in our code is really not
> the biggest issue.
>
> --
> Andrew Haley
> Java Platform Lead Engineer
> Red Hat UK Ltd. <https://www.redhat.com>
> EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Reply | Threaded
Open this post in threaded view
|

Re: RFR (XS): 8188877: Improper synchronization in offer_termination

Andrew Haley
On 28/11/17 21:11, White, Derek wrote:
> Offer-termination just has a simple for-loop that delays some number of cycles. As high as 4k iterations * 140 cycles (per SpinPause() on x86), could be 573,000 cycles or so. For this case, especially where the termination test is a simple load, I think we should test _offered_termination in the spin-wait. This should have low overhead on the spinning thread and impose no impact on other threads.

Please point to the exact lines of code you're talking about.

--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Reply | Threaded
Open this post in threaded view
|

RE: RFR (XS): 8188877: Improper synchronization in offer_termination

White, Derek
Hi Andrew,

> -----Original Message-----
> From: Andrew Haley [mailto:[hidden email]]
> Sent: Wednesday, November 29, 2017 5:24 AM
> To: White, Derek <[hidden email]>; Andrew Dinn
> <[hidden email]>; Thomas Schatzl <[hidden email]>; Kim
> Barrett <[hidden email]>
> Cc: [hidden email]
> Subject: Re: RFR (XS): 8188877: Improper synchronization in
> offer_termination
>
> On 28/11/17 21:11, White, Derek wrote:
> > Offer-termination just has a simple for-loop that delays some number of
> cycles. As high as 4k iterations * 140 cycles (per SpinPause() on x86), could be
> 573,000 cycles or so. For this case, especially where the termination test is a
> simple load, I think we should test _offered_termination in the spin-wait.
> This should have low overhead on the spinning thread and impose no impact
> on other threads.
>
> Please point to the exact lines of code you're talking about.

taskqueue.cpp, in offer_termination(), about line 209:

          for (uint j = 0; j < hard_spin_limit; j++) {
            SpinPause();
          }

hard_spin_limit ranges from 4..4096.

SpinPause() on intel is "pause" (no surprise).

Pause is up to 140 cycles on SkyLake and above (up from 10 cycles).
"Intel® 64 and IA-32 Architectures Optimization Reference Manual", Sect 8.4.7.

 - Derek