RFR: 8189596: AArch64: implementation for Thread-local handshakes

classic Classic list List threaded Threaded
20 messages Options
Reply | Threaded
Open this post in threaded view
|

RFR: 8189596: AArch64: implementation for Thread-local handshakes

Andrew Haley
It was quite tricky to get this one right.

Sime notes:

I deleted LIR_Assembler::poll_for_safepoint because it's dead code.

I set ThreadLocalHandshakes to true.  This follows x86-64 and is
probably the right thing to do, but it does have the disadvantage that
it considerably extends the time to safepoint for interpreted code
because we only poll at branches.  I think this is probably a bad
idea, but again it's the same as x86-64.  OTOH, checking the safepoint
bit at every bytecode would seriously bloat the interpreter.  Oh dear.

--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8189596: AArch64: implementation for Thread-local handshakes

Andrew Haley
On 23/11/17 11:30, Andrew Haley wrote:

> It was quite tricky to get this one right.
>
> Some notes:
>
> I deleted LIR_Assembler::poll_for_safepoint because it's dead code.
>
> I set ThreadLocalHandshakes to true.  This follows x86-64 and is
> probably the right thing to do, but it does have the disadvantage that
> it considerably extends the time to safepoint for interpreted code
> because we only poll at branches.  I think this is probably a bad
> idea, but again it's the same as x86-64.  OTOH, checking the safepoint
> bit at every bytecode would seriously bloat the interpreter.  Oh dear.
>

http://cr.openjdk.java.net/~aph/8189596-1/

--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Reply | Threaded
Open this post in threaded view
|

Re: [aarch64-port-dev ] RFR: 8189596: AArch64: implementation for Thread-local handshakes

Andrew Dinn
On 23/11/17 14:12, Andrew Haley wrote:

> On 23/11/17 11:30, Andrew Haley wrote:
>> It was quite tricky to get this one right.
>>
>> Some notes:
>>
>> I deleted LIR_Assembler::poll_for_safepoint because it's dead code.
>>
>> I set ThreadLocalHandshakes to true.  This follows x86-64 and is
>> probably the right thing to do, but it does have the disadvantage that
>> it considerably extends the time to safepoint for interpreted code
>> because we only poll at branches.  I think this is probably a bad
>> idea, but again it's the same as x86-64.  OTOH, checking the safepoint
>> bit at every bytecode would seriously bloat the interpreter.  Oh dear.
>>
>
> http://cr.openjdk.java.net/~aph/8189596-1/
I have reviewed this by eyeball and it all looks good. I still need to
build and run to be 100% sure.

regards,


Andrew Dinn
-----------
Senior Principal Software Engineer
Red Hat UK Ltd
Registered in England and Wales under Company Registration No. 03798903
Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8189596: AArch64: implementation for Thread-local handshakes

Erik Österlund-2
In reply to this post by Andrew Haley
Hi Andrew,

I am glad to see you also support the move to use Thread-local
handshakes. Thank you for that.
I think your approach looks good. However, there is a certain race that
I believe you are currently vulnerable to that could cause trouble for
you. So I thought I'd point it out.

Specifically, when transitioning from native back to Java, there is a
peculiar race condition on your platform unless more synchronization is
applied. I would recommend loading the thread-local poll value when
waking up from native with ldar instead to avoid subsequent checks for
is_at_safepoint() on the thread waking up from native to provide the
wrong result, which could cause MutexLockerEx that is conditionally
taken only if we are not in a safepoint to hit the wrong path and not be
taken as it still thinks you are in a safepoint.

The race is due to an interaction between the wake-up path of the native
wrapper and the safepoint unsynchronizing path.

When SafepointSynchronize::end() starts, all JavaThreads except the ones
in native and in blocked are blocked. Specifically there exists a race
with the native wrapper trying to wake up from native while the
SafepointSynchronizer is waking up from a safepoint.

The SafepointSynchronize::end() will conceptually perform the following
important steps:

VM.1) SafepointSynchronize::_state = _not_synchronized
VM.2) For each thread: disarm local poll

Your native wrapper wakeup path filters slow paths on the local poll
(without acquire() or loadload()), but no longer the global _state value.

So a JavaThread waking up from native could do:

JT.1) load local poll (with plain ldr), check if it is armed (it is not,
because it observes the store from VM.1)
JT.2) release_store the thread state to Java with stlr
JT.3) enter arbitrary Java code
JT.4) make leaf call to the runtime from Java without thread transition
JT.5) Check if in a safepoint by loading SafepointSynchronize::_state

At JT.5, it is not guaranteed that the load observes the updated
_not_synchronized value written in VM.1 (because the load in JT.1 and
JT.5 may reorder). In this scenario the check to see whether you are in
a safepoint might if very unlucky return true spuriously. From then on,
bad things can happen.

By placing loading the local poll value with ldar *only* in the native
wrapper wakeup function, you avoid these issues.
Another approach you may elect to use is to only in the native wrapper
load both the thread-local poll value and the
SafepointSynchronize::_state, when filtering slow paths to avoid this
unfortunate race.

I have a bunch of other ideas as well exploiting dependent loads, but
thought maybe I should start the conversation and see what you think
before running off too much.

Again, very happy to see this move to use thread-local handshakes.
Thanks for doing this.

Thanks,
/Erik

On 2017-11-23 15:12, Andrew Haley wrote:

> On 23/11/17 11:30, Andrew Haley wrote:
>> It was quite tricky to get this one right.
>>
>> Some notes:
>>
>> I deleted LIR_Assembler::poll_for_safepoint because it's dead code.
>>
>> I set ThreadLocalHandshakes to true.  This follows x86-64 and is
>> probably the right thing to do, but it does have the disadvantage that
>> it considerably extends the time to safepoint for interpreted code
>> because we only poll at branches.  I think this is probably a bad
>> idea, but again it's the same as x86-64.  OTOH, checking the safepoint
>> bit at every bytecode would seriously bloat the interpreter.  Oh dear.
>>
> http://cr.openjdk.java.net/~aph/8189596-1/
>

Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8189596: AArch64: implementation for Thread-local handshakes

Andrew Dinn
On 24/11/17 10:36, Erik Österlund wrote:
> By placing loading the local poll value with ldar *only* in the native
> wrapper wakeup function, you avoid these issues.
> Another approach you may elect to use is to only in the native wrapper
> load both the thread-local poll value and the
> SafepointSynchronize::_state, when filtering slow paths to avoid this
> unfortunate race.

I can see why an ldar (actually ldarw) is needed when safepoint_poll is
called from the nativewrapper. Can you explain why ldar is not needed
for *all* calls to safepoint_poll?

> I have a bunch of other ideas as well exploiting dependent loads, but
> thought maybe I should start the conversation and see what you think
> before running off too much

regards,


Andrew Dinn
-----------
Senior Principal Software Engineer
Red Hat UK Ltd
Registered in England and Wales under Company Registration No. 03798903
Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8189596: AArch64: implementation for Thread-local handshakes

Erik Österlund-2
Hi Andrew,

On 2017-11-24 13:07, Andrew Dinn wrote:

> On 24/11/17 10:36, Erik Österlund wrote:
>> By placing loading the local poll value with ldar *only* in the native
>> wrapper wakeup function, you avoid these issues.
>> Another approach you may elect to use is to only in the native wrapper
>> load both the thread-local poll value and the
>> SafepointSynchronize::_state, when filtering slow paths to avoid this
>> unfortunate race.
> I can see why an ldar (actually ldarw) is needed when safepoint_poll is
> called from the nativewrapper. Can you explain why ldar is not needed
> for *all* calls to safepoint_poll?

That is a long story. :) But since you asked, here we go...

First we need to recognize that the following previous polls (with
different contexts):

1) Poll in JIT-compiled code (poll against the global page)
2) Poll in the VM for thread transitions
3) Poll in the interpreter (safepoint table)
4) Poll in the native wrapper

...have been replaced with polls reading the thread-local state. But
they have different races they are concerned with. Previously, the
global _state, the safepoint table, and polling page, have all been
manipulated in different places in SafepointSynchronize::begin/end,
causing interesting interactions now that it is replaced with one poll.

When we synchronize a safepoint, we have a fast-path poll on the local
poll, but if it is indeed armed, we *have* to use acquire before loading
the global state to check if we need to block or not. If the local poll
is disarmed on the other hand, then there is no harm with races with
synchronization of a safepoint. The lack of acquire can cause a stale
load of the global state. But the possible stale values when racing with
the synchronization is that you observe _not_synchronized instead of
_synchronizing, which is already racy and fine. So the race with
synchronization comes with the constraint that if the local poll is
armed, we have to acquire before reading the global state.

When we unsynchronize the safepoint, on the other hand, we are in a much
more dangerous situation. Unsynchronization races with transitions from
dormant states (blocked and native) to active states (Java and VM).
During the dormant to active transitions, reading a stale value of the
_state from SafepointSynchronize::end() is a disaster, because now a
stale load may observe _synchronized instead of !_synchronized values.
This must be prevented (to not get false positives on the query
is_at_safepoint()) by either a) making the load of the local poll use
acquire unconditionally (making sure subsequent loads of the _state are
not stale), or check both the local poll and global _state to make sure
you simply do not wake up when the stale state is observed to be
_synchronized.

So looking back at the scenarios we have replaced in order:

1) The JIT-compiled code races with SafepointSynchronize::begin(), not
end(), and hence only has the requirement that acquire() is
conditionally used before reading an armed poll value and checking if
the global _state is _synchronizing.
2) In the VM, the conservative standpoint is taken - the load uses
load_acquire unconditionally. Because it might be used for waking up
from dormant or transitioning between active states.
3) The interpreter scenario is similar to the JIT-compiled code in that
it races with SafepointSynchronize::begin(), and not end(). So similar
reasoning applies.
4) The poll in the native wrapper wakes up from a dormant state, and
hence races with SafepointSynchronize::end(). Therefore it comes with
the additional requirement of either using a load_acquire, or polling
both the local and global state to make sure stale loads do not cause us
to observe we are still in a safepoint when we are not.

The check for blocking in the VM currently uses acquire unconditionally,
but if you find that too expensive, it really only needs to
conditionally use acquire if the local poll is armed before reading the
global state, and unconditionally when waking up from a dormant state
(racing with SafepointSynchronize::end()). Those are the true restrictions.

I hope this sheds some light on the important races you need to be aware of.

Thanks,
/Erik

>> I have a bunch of other ideas as well exploiting dependent loads, but
>> thought maybe I should start the conversation and see what you think
>> before running off too much
> regards,
>
>
> Andrew Dinn
> -----------
> Senior Principal Software Engineer
> Red Hat UK Ltd
> Registered in England and Wales under Company Registration No. 03798903
> Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander

Reply | Threaded
Open this post in threaded view
|

Re: [aarch64-port-dev ] RFR: 8189596: AArch64: implementation for Thread-local handshakes

Andrew Dinn
In reply to this post by Andrew Dinn
On 23/11/17 16:15, Andrew Dinn wrote:
>> http://cr.openjdk.java.net/~aph/8189596-1/
> I have reviewed this by eyeball and it all looks good. I still need to
> build and run to be 100% sure.
Well, it builds and runs stuff ok (e.g. netbeans) despite Erik's race
(I'll address that separately in reply to Erik).

One minor peccadillo I did notice:

in src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp

+void MacroAssembler::safepoint_poll(Label& slow_path) {
+  if (SafepointMechanism::uses_thread_local_poll()) {
+    ldr(rscratch1, Address(rthread, Thread::polling_page_offset()));
+    tbnz(rscratch1, exact_log2(SafepointMechanism::poll_bit()), slow_path);
+  } else {
+    ExternalAddress state(SafepointSynchronize::address_of_state());
+    unsigned long offset;
+    adrp(rscratch1,
ExternalAddress(SafepointSynchronize::address_of_state()), offset);
+    ldrw(rscratch1, Address(rscratch1, offset));
+    assert(SafepointSynchronize::_not_synchronized == 0, "rewrite this
code");
+    cbnz(rscratch1, slow_path);
+  }
+}

In the else branch you declare ExternalAddress state then don't use it
in the adrp call.

regards,


Andrew Dinn
-----------
Senior Principal Software Engineer
Red Hat UK Ltd
Registered in England and Wales under Company Registration No. 03798903
Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8189596: AArch64: implementation for Thread-local handshakes

Andrew Dinn
In reply to this post by Erik Österlund-2
On 24/11/17 13:36, Erik Österlund wrote:

> On 2017-11-24 13:07, Andrew Dinn wrote:
>> On 24/11/17 10:36, Erik Österlund wrote:
>>> By placing loading the local poll value with ldar *only* in the native
>>> wrapper wakeup function, you avoid these issues.
>>> Another approach you may elect to use is to only in the native wrapper
>>> load both the thread-local poll value and the
>>> SafepointSynchronize::_state, when filtering slow paths to avoid this
>>> unfortunate race.
>> I can see why an ldar (actually ldarw) is needed when safepoint_poll is
>> called from the nativewrapper. Can you explain why ldar is not needed
>> for *all* calls to safepoint_poll?
>
> That is a long story. :) But since you asked, here we go...
> . . .
> I hope this sheds some light on the important races you need to be aware
> of.

Well, that's a good story and maybe needs to be included somewhere in
the code even if only in precis. The asymmetry between start and end
makes clear why you want an ldar in the native wrapper.

The one detail I am still not sure of is how this design ensures that
the benign races in JIT/interpreter ever come to an end. What guarantees
that JITted code or the interpreter actually performs an acquiring load
and thereby notices a change to the armed poll value? That might take a
very long time (if it happens at all?).

regards,


Andrew Dinn
-----------
Senior Principal Software Engineer
Red Hat UK Ltd
Registered in England and Wales under Company Registration No. 03798903
Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8189596: AArch64: implementation for Thread-local handshakes

Erik Österlund-2
Hi Andrew,

On 2017-11-24 15:39, Andrew Dinn wrote:

> On 24/11/17 13:36, Erik Österlund wrote:
>> On 2017-11-24 13:07, Andrew Dinn wrote:
>>> On 24/11/17 10:36, Erik Österlund wrote:
>>>> By placing loading the local poll value with ldar *only* in the native
>>>> wrapper wakeup function, you avoid these issues.
>>>> Another approach you may elect to use is to only in the native wrapper
>>>> load both the thread-local poll value and the
>>>> SafepointSynchronize::_state, when filtering slow paths to avoid this
>>>> unfortunate race.
>>> I can see why an ldar (actually ldarw) is needed when safepoint_poll is
>>> called from the nativewrapper. Can you explain why ldar is not needed
>>> for *all* calls to safepoint_poll?
>> That is a long story. :) But since you asked, here we go...
>> . . .
>> I hope this sheds some light on the important races you need to be aware
>> of.
> Well, that's a good story and maybe needs to be included somewhere in
> the code even if only in precis. The asymmetry between start and end
> makes clear why you want an ldar in the native wrapper.

Thank you. And yes, you are right - this should probably be documented.

> The one detail I am still not sure of is how this design ensures that
> the benign races in JIT/interpreter ever come to an end. What guarantees
> that JITted code or the interpreter actually performs an acquiring load
> and thereby notices a change to the armed poll value? That might take a
> very long time (if it happens at all?).

The JMM defines opaque stores as stores that among other properties
guarantee progress: the store will eventually become observable to other
threads. The stores that arm the thread-local polls use a
release_store(), which is strictly stronger than opaque on each platform
(and hence have the guarantee that they will "eventually" become
observable to opaque loads). Therefore, the fast-path in JIT-compiled
code and interpreter code are guaranteed to "eventually" observe those
stores with opaque loads (and ldr is an opaque load). When they do, they
will jump into the VM (directly or through a trampoline), and perform
the acquiring load on the thread-local poll, used before loading the
global state.

As for how long time it takes for the stores to become eventually
observable by remote loads, I imagine it should be no worse than the
delay of the TLB shootdown event. And just for completeness for this
discussion: the arming of the local polls is followed by a fence() which
performs a dmb ish in your GCC intrinsics.

Thanks,
/Erik

> regards,
>
>
> Andrew Dinn
> -----------
> Senior Principal Software Engineer
> Red Hat UK Ltd
> Registered in England and Wales under Company Registration No. 03798903
> Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander

Reply | Threaded
Open this post in threaded view
|

Re: [aarch64-port-dev ] RFR: 8189596: AArch64: implementation for Thread-local handshakes

Andrew Haley
In reply to this post by Andrew Dinn
On 24/11/17 12:07, Andrew Dinn wrote:
> I can see why an ldar (actually ldarw) is needed when safepoint_poll is
> called from the nativewrapper. Can you explain why ldar is not needed
> for *all* calls to safepoint_poll?

Because there's no ordering to worry about: the worst that will happen
is that a safepoint will be acted on after, say, a memory location is
updated.  But because in general we don't care about that ordering, it
doesn't matter.

--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Reply | Threaded
Open this post in threaded view
|

Re: [aarch64-port-dev ] RFR: 8189596: AArch64: implementation for Thread-local handshakes

Andrew Haley
In reply to this post by Andrew Dinn
On 24/11/17 14:32, Andrew Dinn wrote:
> In the else branch you declare ExternalAddress state then don't use it
> in the adrp call.

Oh yes.  Oops, will fix.

--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8189596: AArch64: implementation for Thread-local handshakes

Andrew Dinn
In reply to this post by Erik Österlund-2
On 24/11/17 15:10, Erik Österlund wrote:

> On 2017-11-24 15:39, Andrew Dinn wrote:
>> The one detail I am still not sure of is how this design ensures that
>> the benign races in JIT/interpreter ever come to an end. What guarantees
>> that JITted code or the interpreter actually performs an acquiring load
>> and thereby notices a change to the armed poll value? That might take a
>> very long time (if it happens at all?).
>
> The JMM defines opaque stores as stores that among other properties
> guarantee progress: the store will eventually become observable to other
> threads. The stores that arm the thread-local polls use a
> release_store(), which is strictly stronger than opaque on each platform
> (and hence have the guarantee that they will "eventually" become
> observable to opaque loads). Therefore, the fast-path in JIT-compiled
> code and interpreter code are guaranteed to "eventually" observe those
> stores with opaque loads (and ldr is an opaque load). When they do, they
> will jump into the VM (directly or through a trampoline), and perform
> the acquiring load on the thread-local poll, used before loading the
> global state.
>
> As for how long time it takes for the stores to become eventually
> observable by remote loads, I imagine it should be no worse than the
> delay of the TLB shootdown event. And just for completeness for this
> discussion: the arming of the local polls is followed by a fence() which
> performs a dmb ish in your GCC intrinsics.
Well, that's really what I was thinking about with the original
question. I understand that writes will become visible "eventually". I'm
just concerned to know what sort of upper bound is implied by that word
(theoretical might be interesting but I'm more interested in practical).
This is the mechanism intended to halt individual threads or all
threads. If there is any possibility of some extended delay --
especially in the latter case -- then that would merit quantifying.

I assume "no worse than the delay of the TLB shootdown event" means "TLB
shootdown relies on the same sort of benign race"? If so that doesn't
help quantify anything except to say that prior experience has not shown
any significant delay (so far as we know).

regards,


Andrew Dinn
-----------
Senior Principal Software Engineer
Red Hat UK Ltd
Registered in England and Wales under Company Registration No. 03798903
Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8189596: AArch64: implementation for Thread-local handshakes

Erik Österlund-2
Hi Andrew,

On 2017-11-24 16:55, Andrew Dinn wrote:

> On 24/11/17 15:10, Erik Österlund wrote:
>> On 2017-11-24 15:39, Andrew Dinn wrote:
>>> The one detail I am still not sure of is how this design ensures that
>>> the benign races in JIT/interpreter ever come to an end. What guarantees
>>> that JITted code or the interpreter actually performs an acquiring load
>>> and thereby notices a change to the armed poll value? That might take a
>>> very long time (if it happens at all?).
>> The JMM defines opaque stores as stores that among other properties
>> guarantee progress: the store will eventually become observable to other
>> threads. The stores that arm the thread-local polls use a
>> release_store(), which is strictly stronger than opaque on each platform
>> (and hence have the guarantee that they will "eventually" become
>> observable to opaque loads). Therefore, the fast-path in JIT-compiled
>> code and interpreter code are guaranteed to "eventually" observe those
>> stores with opaque loads (and ldr is an opaque load). When they do, they
>> will jump into the VM (directly or through a trampoline), and perform
>> the acquiring load on the thread-local poll, used before loading the
>> global state.
>>
>> As for how long time it takes for the stores to become eventually
>> observable by remote loads, I imagine it should be no worse than the
>> delay of the TLB shootdown event. And just for completeness for this
>> discussion: the arming of the local polls is followed by a fence() which
>> performs a dmb ish in your GCC intrinsics.
> Well, that's really what I was thinking about with the original
> question. I understand that writes will become visible "eventually". I'm
> just concerned to know what sort of upper bound is implied by that word
> (theoretical might be interesting but I'm more interested in practical).
> This is the mechanism intended to halt individual threads or all
> threads. If there is any possibility of some extended delay --
> especially in the latter case -- then that would merit quantifying.
>
> I assume "no worse than the delay of the TLB shootdown event" means "TLB
> shootdown relies on the same sort of benign race"? If so that doesn't
> help quantify anything except to say that prior experience has not shown
> any significant delay (so far as we know).

Naturally the actual delay of a single store to make it through a store
buffer and then the cache coherency protocol to eventually show up in a
load on another CPU depends on a bunch of stuff that is bound to look
different on every machine. And I can't say I know what typical delays
look like on typical AArch64 machines. But I would be *very* surprised
if it would get anywhere close to millisecond levels, which is typically
when it starts being noticeable on the radar at all to a safepointing
operation (compared to the time of doing the crazy things we do in
safepoints). A gut feeling is that it should be comparable to the time
for performing a dmb sy (which by contract makes preceding stores
globally observable in the whole system).

Based on what Andrew had observed, it seems like the delays were
dominated by the sparse polling of the interpreter, suggesting to me
that it does not seem like a large issue to propagate the state change
in practice.

Thanks,
/Erik

> regards,
>
>
> Andrew Dinn
> -----------
> Senior Principal Software Engineer
> Red Hat UK Ltd
> Registered in England and Wales under Company Registration No. 03798903
> Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander

Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8189596: AArch64: implementation for Thread-local handshakes

Andrew Haley
On 24/11/17 16:19, Erik Österlund wrote:

> But I would be *very* surprised if it would get anywhere close to
> millisecond levels, which is typically when it starts being
> noticeable on the radar at all to a safepointing operation (compared
> to the time of doing the crazy things we do in safepoints).

It's very unlikely but not impossible.

> A gut feeling is that it should be comparable to the time for
> performing a dmb sy (which by contract makes preceding stores
> globally observable in the whole system).

It's not really comparable.  It can be microseconds before a store
from one core is picked up by another; I have measured times as long
as that.  On the other hand, a full barrier can be a purely local
operation: all it has to do is flush the store buffer to the local
cache and mark the entries in the invalidate queue.  That takes a
matter of nanoseconds and the core can proceed, but of course, it does
slow down subsequent load and store operations.

--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8189596: AArch64: implementation for Thread-local handshakes

Andrew Haley
In reply to this post by Erik Österlund-2
On 24/11/17 10:36, Erik Österlund wrote:
> By placing loading the local poll value with ldar *only* in the native
> wrapper wakeup function, you avoid these issues.

OK, thanks.

http://cr.openjdk.java.net/~aph/8189596-2

--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8189596: AArch64: implementation for Thread-local handshakes

Erik Österlund-2
Hi Andrew,

> On 24 Nov 2017, at 18:39, Andrew Haley <[hidden email]> wrote:
>
>> On 24/11/17 10:36, Erik Österlund wrote:
>> By placing loading the local poll value with ldar *only* in the native
>> wrapper wakeup function, you avoid these issues.
>
> OK, thanks.

You are welcome. :)

> http://cr.openjdk.java.net/~aph/8189596-2

Looks a lot better now. Thanks for doing this.

/Erik

> --
> Andrew Haley
> Java Platform Lead Engineer
> Red Hat UK Ltd. <https://www.redhat.com>
> EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671

Reply | Threaded
Open this post in threaded view
|

Re: [aarch64-port-dev ] RFR: 8189596: AArch64: implementation for Thread-local handshakes

Andrew Dinn
In reply to this post by Andrew Haley
On 24/11/17 17:39, Andrew Haley wrote:
> On 24/11/17 10:36, Erik Österlund wrote:
>> By placing loading the local poll value with ldar *only* in the native
>> wrapper wakeup function, you avoid these issues.
>
> OK, thanks.
>
> http://cr.openjdk.java.net/~aph/8189596-2
Yes, looks good.

Nice catch on the two dmb instructions :-)

regards,


Andrew Dinn
-----------
Senior Principal Software Engineer
Red Hat UK Ltd
Registered in England and Wales under Company Registration No. 03798903
Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8189596: AArch64: implementation for Thread-local handshakes

Dmitry Samersoff-3
In reply to this post by Andrew Haley
Andrew,

> http://cr.openjdk.java.net/~aph/8189596-2

1. interp_masm_aarch64.cpp

 456   if (needs_thread_local_poll) {

It might be better to put the code related to needs_thread_local_poll
into the single block for better readability.

Approx :

if (needs_thread_local_poll) {
   NOT_PRODUCT(block_comment("Thread-local Safepoint poll"));
   ldr(rscratch2, Address(rthread, Thread::polling_page_offset()));
   tbz(rscratch2, exact_log2(SafepointMechanism::poll_bit()), notsafepoint);

   lea(rscratch2, ExternalAddress((address)safepoint_table));
   ldr(rscratch2, Address(rscratch2, rscratch1, Address::uxtw(3)));
   br(rscratch2);

}

bind(notsafepoint);

2. macroAssembler_aarch64.cpp

321     safepoint_poll(slow_path);

It might be better to put global poll code to a separate function, to
avoid double checking of SafepointMechanism::uses_thread_local_poll()

3. templateInterpreterGenerator_aarch64.cpp : 1382

1382     __ safepoint_poll_acquire(L);

Do we really need acquire here?

-Dmitry


On 24.11.2017 20:39, Andrew Haley wrote:
> On 24/11/17 10:36, Erik Österlund wrote:
>> By placing loading the local poll value with ldar *only* in the native
>> wrapper wakeup function, you avoid these issues.
>
> OK, thanks.
>
> http://cr.openjdk.java.net/~aph/8189596-2
>


Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8189596: AArch64: implementation for Thread-local handshakes

Andrew Haley
On 27/11/17 15:30, Dmitry Samersov wrote:

> Andrew,
>
>> http://cr.openjdk.java.net/~aph/8189596-2
>
> 1. interp_masm_aarch64.cpp
>
>  456   if (needs_thread_local_poll) {
>
> It might be better to put the code related to needs_thread_local_poll
> into the single block for better readability.
>
> Approx :
>
> if (needs_thread_local_poll) {
>    NOT_PRODUCT(block_comment("Thread-local Safepoint poll"));
>    ldr(rscratch2, Address(rthread, Thread::polling_page_offset()));
>    tbz(rscratch2, exact_log2(SafepointMechanism::poll_bit()), notsafepoint);
>
>    lea(rscratch2, ExternalAddress((address)safepoint_table));
>    ldr(rscratch2, Address(rscratch2, rscratch1, Address::uxtw(3)));
>    br(rscratch2);
>
> }
>
> bind(notsafepoint);

I didn't want to do that because I wanted to put the uncommon case out
of line.  In any case, I don't think it warrants a respin.

> 2. macroAssembler_aarch64.cpp
>
> 321     safepoint_poll(slow_path);
>
> It might be better to put global poll code to a separate function, to
> avoid double checking of SafepointMechanism::uses_thread_local_poll();

We don't need to do it for efficiency because GCC already knows how to
do that.  I can't think of any other reason we'd want to change it.

> 3. templateInterpreterGenerator_aarch64.cpp : 1382
>
> 1382     __ safepoint_poll_acquire(L);
>
> Do we really need acquire here?

I believe so,, for the reason discussed upthread.  We need to do it
whenever transitioning from native to Java state.

--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8189596: AArch64: implementation for Thread-local handshakes

Dmitry Samersoff-3
Andrew,

Thank you for the explanation.

-Dmitry


On 27.11.2017 18:47, Andrew Haley wrote:

> On 27/11/17 15:30, Dmitry Samersov wrote:
>> Andrew,
>>
>>> http://cr.openjdk.java.net/~aph/8189596-2
>>
>> 1. interp_masm_aarch64.cpp
>>
>>  456   if (needs_thread_local_poll) {
>>
>> It might be better to put the code related to needs_thread_local_poll
>> into the single block for better readability.
>>
>> Approx :
>>
>> if (needs_thread_local_poll) {
>>    NOT_PRODUCT(block_comment("Thread-local Safepoint poll"));
>>    ldr(rscratch2, Address(rthread, Thread::polling_page_offset()));
>>    tbz(rscratch2, exact_log2(SafepointMechanism::poll_bit()), notsafepoint);
>>
>>    lea(rscratch2, ExternalAddress((address)safepoint_table));
>>    ldr(rscratch2, Address(rscratch2, rscratch1, Address::uxtw(3)));
>>    br(rscratch2);
>>
>> }
>>
>> bind(notsafepoint);
>
> I didn't want to do that because I wanted to put the uncommon case out
> of line.  In any case, I don't think it warrants a respin.
>
>> 2. macroAssembler_aarch64.cpp
>>
>> 321     safepoint_poll(slow_path);
>>
>> It might be better to put global poll code to a separate function, to
>> avoid double checking of SafepointMechanism::uses_thread_local_poll();
>
> We don't need to do it for efficiency because GCC already knows how to
> do that.  I can't think of any other reason we'd want to change it.
>
>> 3. templateInterpreterGenerator_aarch64.cpp : 1382
>>
>> 1382     __ safepoint_poll_acquire(L);
>>
>> Do we really need acquire here?
>
> I believe so,, for the reason discussed upthread.  We need to do it
> whenever transitioning from native to Java state.
>