Work-in-progress: 8236485: Epoch synchronization protocol for G1 concurrent refinement

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Work-in-progress: 8236485: Epoch synchronization protocol for G1 concurrent refinement

Man Cao
Hi all,

I have written up a description and challenges for implementing an epoch
synchronization protocol. This protocol is necessary for removing the
StoreLoad fence in G1's post-write barrier (JDK-8226731)

Description: https://bugs.openjdk.java.net/browse/JDK-8236485
Work-in-progress webrev:
https://cr.openjdk.java.net/~manc/8236485/webrev_wip0/

There are two main challenges that I'm not sure how to resolve:
- Triggering a thread-local handshake is a blocking operation that can pass
a safepoint.
- There are native post-write barriers executed by threads in native/VM
state.

Discussions and suggestions are highly appreciated!

-Man
Reply | Threaded
Open this post in threaded view
|

Re: Work-in-progress: 8236485: Epoch synchronization protocol for G1 concurrent refinement

Man Cao
We had an offline discussion on this. To keep the community in the loop,
here is what we discussed.

a. Using Linux membarrier syscall or equivalent on other OSes seems a
cleaner solution than thread-local handshake (TLH). But we need to have a
backup mechanism for OSes and older Linuxes that do not have such a syscall.
b. For the blocking property of TLH,
https://bugs.openjdk.java.net/browse/JDK-8230594 may help solve the problem
once it is implemented.
c. TLH could be issued to a subset of all threads, e.g. only to thread that
have not yet reached the global epoch. This could save a lot of time for
the handshake.
d. Compiler threads are Java threads but they are mostly not in Java state.
They could be a source of problem for the epoch synchronization protocol.
e. The filter in G1EpochSynchronizer::check_and_update_frontier() may be
incorrect, because it racily reads a remote thread's state, which may not
observe all threads in Java state.
f. Implementing asynchronous processing of the dirty card buffers could
avoid a lot of TLH requests, so the speed of TLH may not be hugely
concerning.
g. It may be OK to slow down the native post-write barrier a bit with more
frequent execution of the StoreLoad fence. We could do some benchmarking to
test this. A more debatable issue is if we would make the native post-write
barrier different from the post-write barrier in Java code, that only the
native barrier has the StoreLoad fence.

I will further work on these issues.

-Man


On Sun, Dec 22, 2019 at 8:50 AM Man Cao <[hidden email]> wrote:

> Hi all,
>
> I have written up a description and challenges for implementing an epoch
> synchronization protocol. This protocol is necessary for removing the
> StoreLoad fence in G1's post-write barrier (JDK-8226731)
>
> Description: https://bugs.openjdk.java.net/browse/JDK-8236485
> Work-in-progress webrev:
> https://cr.openjdk.java.net/~manc/8236485/webrev_wip0/
>
> There are two main challenges that I'm not sure how to resolve:
> - Triggering a thread-local handshake is a blocking operation that can
> pass a safepoint.
> - There are native post-write barriers executed by threads in native/VM
> state.
>
> Discussions and suggestions are highly appreciated!
>
> -Man
>
Reply | Threaded
Open this post in threaded view
|

Re: Work-in-progress: 8236485: Epoch synchronization protocol for G1 concurrent refinement

Florian Weimer-5
* Man Cao:

> We had an offline discussion on this. To keep the community in the loop,
> here is what we discussed.
>
> a. Using Linux membarrier syscall or equivalent on other OSes seems a
> cleaner solution than thread-local handshake (TLH). But we need to have a
> backup mechanism for OSes and older Linuxes that do not have such a
> syscall.

Can you do with a membarrier call that doesn't require registration?

The usual fallback for membarrier is sending a special signal to all
threads, and make sure that they have run code in a signal handler
(possibly using a CPU barrier there).  But of course this is rather
slow.

membarrier has seen some backporting activity, but as far as I can see,
that hasn't been consistent across architectures.

Thanks,
Florian

Reply | Threaded
Open this post in threaded view
|

Re: Work-in-progress: 8236485: Epoch synchronization protocol for G1 concurrent refinement

Man Cao
Hi all,

I finally managed to allocate more time to make progress on this, and
resolved most issues since the last discussion.
I've updated the description in
https://bugs.openjdk.java.net/browse/JDK-8236485, and the current prototype
is the HEAD commit at https://github.com/caoman/jdk/tree/g1EpochSync.
Notable changes include:
- The protocol uses async handshake from JDK-8238761
<https://bugs.openjdk.java.net/browse/JDK-8238761> to resolve the blocking
issue from normal handshake.
- In order to support async refinement due to async handshake, added
support for _deferred global queue to G1DirtyCardQueueSet. Buffers rarely
get enqueued to _deferred at run-time.
- The async handshake only executes for a subset of threads.

I have a couple of questions:

1. Code review and patch size.
Should I start a pull request for this change, so it is easier to give
feedback?

What is the recommended approach to deal with large changes? Currently the
patch is about 1200 lines, without changing the write barrier itself
(JDK-8226731).
Since pushing only this patch will add some overhead to refinement, but not
bring any performance improvement from removing the write barrier's fence,
do you recommend I also implement the write barrier change in the same
patch?
Keeping the epoch sync patch separate from the write barrier patch has some
benefit for testing, in case the epoch patch introduces any bugs.
Currently the new code is mostly guarded by a flag
-XX:+G1TestEpochSyncInConcRefinement, and it will be removed after the
write barrier change. It could be used as an emergency flag to workaround
bugs, instead of backing out of the entire change. We probably cannot have
such a flag if we bundle the changes in one patch (it's too ugly to have a
flag in interpreter and compilers).

2. Checking if a remote thread is in _thread_in_Java state.
eosterlund@ pointed out it was incorrect to do JavaThread::thread_state()
== _thread_in_Java.
I looked into thread state transition, and revised it to also compare with
_thread_in_native_trans and _thread_in_vm_trans.
I think it is correct now for the purpose of epoch synchronization. I.e.,
it never "misses" a remote thread that is actually in in_Java state.
A detailed comment is here:
https://github.com/caoman/jdk/blob/2047ecefceb074e80d73e0d521d64a220fdc5779/src/hotspot/share/gc/g1/g1EpochSynchronizer.cpp#L67-L90
.
*Erik, could you take a look and decide if it is correct? If it is still
incorrect, could you advise a proper way to do this?*

3. Native write barrier (G1BarrierSet::write_ref_field_post).
The epoch synchronization protocol does not synchronize with threads in
_thread_in_native or _thread_in_vm state. It is much slower if we
synchronize with such threads.
Moreover, there are non-Java threads (e.g. concurrent mark worker) that
could execute the native write barrier.
As a result, it's probably best to keep the StoreLoad fence in the native
write barrier.
The final write post-barrier for JDK-8226731 would be:
Given:
x.a = q
and
p = @x.a

For Interpreter/C1/C2:
if (p and q in same regions or q == NULL) -> exit
if (card(p) == Dirty) -> exit
card(p) = Dirty;
enqueue(card(p))

For the native barrier:
if (p and q in same regions or q == NULL) -> exit
StoreLoad;
if (card(p) == Dirty) -> exit
card(p) = Dirty;
enqueue(card(p))

Does the above look good? Do we need to micro-benchmark the potential
overhead added to the native barrier? Or are typical macro-benchmarks
sufficient?

4. Performance benchmarking.
I did some preliminary benchmarking with DaCapo and BigRamTester, without
changing the write barrier. This is to measure the overhead added by the
epoch synchronization protocol.
Under the default JVM setting, I didn't see any performance difference.
When I tuned the JVM to refine cards aggressively, there was
still no difference in BigRamTester (probably because it only has 2
threads). Some DaCapo benchmarks saw 10-15% more CPU usage due to doing
more work in the refinement threads, and 2-5% total throughput regression
for these benchmarks.
The aggressive refinement flags are "-XX:-G1UseAdaptiveConcRefinement
-XX:G1UpdateBufferSize=4 -XX:G1ConcRefinementGreenZone=0
-XX:G1ConcRefinementYellowZone=1".

I wonder how important we should treat the aggressive refinement case.
Regression in this case is likely unavoidable, so how much regression is
tolerable?
Also, does anyone know a better benchmark to test refinement with default
JVM settings? Ideally it (1) has many mutator threads; (2) triggers
concurrent refinement frequently; (3) runs with a sizable Xmx (e.g., 2GiB
or above).

-Man


On Thu, Jan 16, 2020 at 4:06 AM Florian Weimer <[hidden email]> wrote:

> * Man Cao:
>
> > We had an offline discussion on this. To keep the community in the loop,
> > here is what we discussed.
> >
> > a. Using Linux membarrier syscall or equivalent on other OSes seems a
> > cleaner solution than thread-local handshake (TLH). But we need to have a
> > backup mechanism for OSes and older Linuxes that do not have such a
> > syscall.
>
> Can you do with a membarrier call that doesn't require registration?
>
> The usual fallback for membarrier is sending a special signal to all
> threads, and make sure that they have run code in a signal handler
> (possibly using a CPU barrier there).  But of course this is rather
> slow.
>
> membarrier has seen some backporting activity, but as far as I can see,
> that hasn't been consistent across architectures.
>
> Thanks,
> Florian
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Work-in-progress: 8236485: Epoch synchronization protocol for G1 concurrent refinement

Man Cao
Has anyone got a chance to take a look?

I also implemented the fence removal in interpreter and C1/C2. The patch is
quite small, so perhaps it is better to merge them into one pull request:
https://github.com/caoman/jdk/tree/8226731fenceRemoval

-Man


On Tue, Mar 30, 2021 at 7:43 PM Man Cao <[hidden email]> wrote:

> Hi all,
>
> I finally managed to allocate more time to make progress on this, and
> resolved most issues since the last discussion.
> I've updated the description in
> https://bugs.openjdk.java.net/browse/JDK-8236485, and the current
> prototype is the HEAD commit at
> https://github.com/caoman/jdk/tree/g1EpochSync.
> Notable changes include:
> - The protocol uses async handshake from JDK-8238761
> <https://bugs.openjdk.java.net/browse/JDK-8238761> to resolve the
> blocking issue from normal handshake.
> - In order to support async refinement due to async handshake, added
> support for _deferred global queue to G1DirtyCardQueueSet. Buffers rarely
> get enqueued to _deferred at run-time.
> - The async handshake only executes for a subset of threads.
>
> I have a couple of questions:
>
> 1. Code review and patch size.
> Should I start a pull request for this change, so it is easier to give
> feedback?
>
> What is the recommended approach to deal with large changes? Currently the
> patch is about 1200 lines, without changing the write barrier itself
> (JDK-8226731).
> Since pushing only this patch will add some overhead to refinement, but
> not bring any performance improvement from removing the write barrier's
> fence, do you recommend I also implement the write barrier change in the
> same patch?
> Keeping the epoch sync patch separate from the write barrier patch has
> some benefit for testing, in case the epoch patch introduces any bugs.
> Currently the new code is mostly guarded by a flag
> -XX:+G1TestEpochSyncInConcRefinement, and it will be removed after the
> write barrier change. It could be used as an emergency flag to workaround
> bugs, instead of backing out of the entire change. We probably cannot have
> such a flag if we bundle the changes in one patch (it's too ugly to have a
> flag in interpreter and compilers).
>
> 2. Checking if a remote thread is in _thread_in_Java state.
> eosterlund@ pointed out it was incorrect to do JavaThread::thread_state()
> == _thread_in_Java.
> I looked into thread state transition, and revised it to also compare with
> _thread_in_native_trans and _thread_in_vm_trans.
> I think it is correct now for the purpose of epoch synchronization. I.e.,
> it never "misses" a remote thread that is actually in in_Java state.
> A detailed comment is here:
> https://github.com/caoman/jdk/blob/2047ecefceb074e80d73e0d521d64a220fdc5779/src/hotspot/share/gc/g1/g1EpochSynchronizer.cpp#L67-L90
> .
> *Erik, could you take a look and decide if it is correct? If it is still
> incorrect, could you advise a proper way to do this?*
>
> 3. Native write barrier (G1BarrierSet::write_ref_field_post).
> The epoch synchronization protocol does not synchronize with threads in
> _thread_in_native or _thread_in_vm state. It is much slower if we
> synchronize with such threads.
> Moreover, there are non-Java threads (e.g. concurrent mark worker) that
> could execute the native write barrier.
> As a result, it's probably best to keep the StoreLoad fence in the native
> write barrier.
> The final write post-barrier for JDK-8226731 would be:
> Given:
> x.a = q
> and
> p = @x.a
>
> For Interpreter/C1/C2:
> if (p and q in same regions or q == NULL) -> exit
> if (card(p) == Dirty) -> exit
> card(p) = Dirty;
> enqueue(card(p))
>
> For the native barrier:
> if (p and q in same regions or q == NULL) -> exit
> StoreLoad;
> if (card(p) == Dirty) -> exit
> card(p) = Dirty;
> enqueue(card(p))
>
> Does the above look good? Do we need to micro-benchmark the potential
> overhead added to the native barrier? Or are typical macro-benchmarks
> sufficient?
>
> 4. Performance benchmarking.
> I did some preliminary benchmarking with DaCapo and BigRamTester, without
> changing the write barrier. This is to measure the overhead added by the
> epoch synchronization protocol.
> Under the default JVM setting, I didn't see any performance difference.
> When I tuned the JVM to refine cards aggressively, there was
> still no difference in BigRamTester (probably because it only has 2
> threads). Some DaCapo benchmarks saw 10-15% more CPU usage due to doing
> more work in the refinement threads, and 2-5% total throughput regression
> for these benchmarks.
> The aggressive refinement flags are "-XX:-G1UseAdaptiveConcRefinement
> -XX:G1UpdateBufferSize=4 -XX:G1ConcRefinementGreenZone=0
> -XX:G1ConcRefinementYellowZone=1".
>
> I wonder how important we should treat the aggressive refinement case.
> Regression in this case is likely unavoidable, so how much regression is
> tolerable?
> Also, does anyone know a better benchmark to test refinement with default
> JVM settings? Ideally it (1) has many mutator threads; (2) triggers
> concurrent refinement frequently; (3) runs with a sizable Xmx (e.g., 2GiB
> or above).
>
> -Man
>
>
> On Thu, Jan 16, 2020 at 4:06 AM Florian Weimer <[hidden email]> wrote:
>
>> * Man Cao:
>>
>> > We had an offline discussion on this. To keep the community in the loop,
>> > here is what we discussed.
>> >
>> > a. Using Linux membarrier syscall or equivalent on other OSes seems a
>> > cleaner solution than thread-local handshake (TLH). But we need to have
>> a
>> > backup mechanism for OSes and older Linuxes that do not have such a
>> > syscall.
>>
>> Can you do with a membarrier call that doesn't require registration?
>>
>> The usual fallback for membarrier is sending a special signal to all
>> threads, and make sure that they have run code in a signal handler
>> (possibly using a CPU barrier there).  But of course this is rather
>> slow.
>>
>> membarrier has seen some backporting activity, but as far as I can see,
>> that hasn't been consistent across architectures.
>>
>> Thanks,
>> Florian
>>
>>