RFR: Parallelize safepoint cleanup

classic Classic list List threaded Threaded
55 messages Options
123
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RFR: Parallelize safepoint cleanup

Roman Kennke-6
Some operations in safepoint cleanup have been observed to (sometimes)
take significant time. Most notably, idle monitor deflation and nmethod
marking stick out in some popular applications and benchmarks.

I propose to:
- parallelize safepoint cleanup processing
- enable to hook up idle monitor deflation and nmethod marking to GC VM
ops, if GC can support it (resulting in even more efficient
deflation/nmethod marking)

In some of my measurements this resulted in much improved pause times.
For example, in one popular benchmark on a server-class machine, I got
total average pause time down from ~80ms to ~30ms. In none of my
measurements has this resulted in decreased performance (although it may
be possible to construct something. For example, it may not be worth to
spin up worker threads if there's no work to do.)

Some implementation notes:

I introduced a dedicated worker thread pool in SafepointSynchronize.
This is only initialized when -XX:+ParallelSafepointCleanup is enabled,
and uses -XX:ParallelSafepointCleanupThreads=X threads, defaulting to 8
threads (just a wild guess, open for discussion. With
-XX:-ParallelSafepointCleanup turned off (the default) it will use the
old serial safepoint cleanup processing (with optional GC hooks, see below).

Parallel processing first lets all worker threads scan threads and
thereby deflate idle monitors and mark nmethods (in one pass). The rest
of the cleanup work is divided into claimed chunks by using SubTasksDone
(like, e.g., in G1RootProcessor).

Notice that I tried a bunch of other alternatives:

- First I tried to let Java threads deflate their own monitors on
safepoint arrival. This did not work out, because deflation (currently)
depends on all Java threads having arrived. Adding another sync point
there would have defeated the purpose.

- Then I tried to always use workers of the current GC. This did not
work either, because the GC may be using them, for example if a cleanup
safepoint is happening during concurrent marking.

- Then I gave SafepointSynchronize its own workers, and I now think this
is the best solution: indepdent of the GC and relatively isolated code-wise.


The other big thing in this change is the possibility to let the GC take
over deflation and nmethod marking. The motivation for this is simple:
GCs often scan threads themselves, and when they do, they can just as
well also do the deflation and nmethod marking. This is more efficient,
because it's better on caches, and because it parallelizes better (other
GC workers can do other GC stuff while some are still busy with the
threads). Notice that this change only provides convenient APIs for the
GCs to consume, but no actual implementation. More specifically:

- Deflation of idle monitors can be enabled for GC ops by overriding
VM_Operation::deflates_idle_monitors() to return true. This
automatically makes Threads::oops_do() and friends to deflate idle
monitors. CAUTION: this only works if the GC leaves oop's mark words
alone. Unfortunately, I think all GCs currently in OpenJDK preserve the
mark word and temporarily use it as forwarding pointer, and thus this
optimization is not possible. I have done it successfully in Shenandoah
GC. GC devs need to evaluate this.

- NMethod marking can be enabled by overriding
VM_Operation::marks_nmethods() to return true. In order to mark nmethods
during GC thread scanning, one has to call
NMethodSweeper::prepare_mark_active_nmethods() and pass the returned
CodeBlobClosure to Thread::oops_do() or
Threads::possibly_parallel_oops_do(). This is relatively simple and
should work for all existing GCs. Again, I have done it successfully in
Shenandoah GC.

- Hooking up deflation and nmethod marking also works with serial
safepoint cleanup. This may be useful for workloads where it's not worth
to spin up additional worker threads. They would still benefit from
improved cleanup at GC pauses.

Webrev:
http://cr.openjdk.java.net/~rkennke/8180932/webrev.00/
<http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.00/>

Bug:
https://bugs.openjdk.java.net/browse/JDK-8180932

Testing: specjvm, specjbb, hotspot_gc

I suppose this requires CSR for the new options?

Opinions?

Roman

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: RFR: Parallelize safepoint cleanup

Erik Osterlund
Hi Roman,

Just something to keep in mind when letting the GC steal safepoint cleanup tasks from the safepoint synchronizer:

When the cleanup has been skipped, it is crucial that the GC operation indeed does keep to the contract. So if for example a JNI locker blocks the GC, one has to still perform the cleanup tasks before exiting the VM operation. Unless of course the order was inversed so that cleanup runs after the VM operation and could check what is left to be done. But I am not certain what the consequences would be of doing that.

Just something to keep in mind.

Thanks,
/Erik

> On 24 May 2017, at 16:40, Roman Kennke <[hidden email]> wrote:
>
> Erik Helin asked me on IRC to trim down the scope of this change and
> split up the big patch into 3: the parallel cleanup, the GC hookup for
> deflation, and the GC hookup for nmethods marking. So here comes the
> first part:
>
> http://cr.openjdk.java.net/~rkennke/8180932/webrev.01/
> <http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.01/>
>
> The description for that part still applies. Should be simpler to review
> this way. Will file 2 more enhancement-bugs for the other two parts.
>
> Roman
>
>> Some operations in safepoint cleanup have been observed to (sometimes)
>> take significant time. Most notably, idle monitor deflation and nmethod
>> marking stick out in some popular applications and benchmarks.
>>
>> I propose to:
>> - parallelize safepoint cleanup processing
>> - enable to hook up idle monitor deflation and nmethod marking to GC VM
>> ops, if GC can support it (resulting in even more efficient
>> deflation/nmethod marking)
>>
>> In some of my measurements this resulted in much improved pause times.
>> For example, in one popular benchmark on a server-class machine, I got
>> total average pause time down from ~80ms to ~30ms. In none of my
>> measurements has this resulted in decreased performance (although it may
>> be possible to construct something. For example, it may not be worth to
>> spin up worker threads if there's no work to do.)
>>
>> Some implementation notes:
>>
>> I introduced a dedicated worker thread pool in SafepointSynchronize.
>> This is only initialized when -XX:+ParallelSafepointCleanup is enabled,
>> and uses -XX:ParallelSafepointCleanupThreads=X threads, defaulting to 8
>> threads (just a wild guess, open for discussion. With
>> -XX:-ParallelSafepointCleanup turned off (the default) it will use the
>> old serial safepoint cleanup processing (with optional GC hooks, see below).
>>
>> Parallel processing first lets all worker threads scan threads and
>> thereby deflate idle monitors and mark nmethods (in one pass). The rest
>> of the cleanup work is divided into claimed chunks by using SubTasksDone
>> (like, e.g., in G1RootProcessor).
>>
>> Notice that I tried a bunch of other alternatives:
>>
>> - First I tried to let Java threads deflate their own monitors on
>> safepoint arrival. This did not work out, because deflation (currently)
>> depends on all Java threads having arrived. Adding another sync point
>> there would have defeated the purpose.
>>
>> - Then I tried to always use workers of the current GC. This did not
>> work either, because the GC may be using them, for example if a cleanup
>> safepoint is happening during concurrent marking.
>>
>> - Then I gave SafepointSynchronize its own workers, and I now think this
>> is the best solution: indepdent of the GC and relatively isolated code-wise.
>>
>>
>> The other big thing in this change is the possibility to let the GC take
>> over deflation and nmethod marking. The motivation for this is simple:
>> GCs often scan threads themselves, and when they do, they can just as
>> well also do the deflation and nmethod marking. This is more efficient,
>> because it's better on caches, and because it parallelizes better (other
>> GC workers can do other GC stuff while some are still busy with the
>> threads). Notice that this change only provides convenient APIs for the
>> GCs to consume, but no actual implementation. More specifically:
>>
>> - Deflation of idle monitors can be enabled for GC ops by overriding
>> VM_Operation::deflates_idle_monitors() to return true. This
>> automatically makes Threads::oops_do() and friends to deflate idle
>> monitors. CAUTION: this only works if the GC leaves oop's mark words
>> alone. Unfortunately, I think all GCs currently in OpenJDK preserve the
>> mark word and temporarily use it as forwarding pointer, and thus this
>> optimization is not possible. I have done it successfully in Shenandoah
>> GC. GC devs need to evaluate this.
>>
>> - NMethod marking can be enabled by overriding
>> VM_Operation::marks_nmethods() to return true. In order to mark nmethods
>> during GC thread scanning, one has to call
>> NMethodSweeper::prepare_mark_active_nmethods() and pass the returned
>> CodeBlobClosure to Thread::oops_do() or
>> Threads::possibly_parallel_oops_do(). This is relatively simple and
>> should work for all existing GCs. Again, I have done it successfully in
>> Shenandoah GC.
>>
>> - Hooking up deflation and nmethod marking also works with serial
>> safepoint cleanup. This may be useful for workloads where it's not worth
>> to spin up additional worker threads. They would still benefit from
>> improved cleanup at GC pauses.
>>
>> Webrev:
>> http://cr.openjdk.java.net/~rkennke/8180932/webrev.00/
>> <http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.00/>
>>
>> Bug:
>> https://bugs.openjdk.java.net/browse/JDK-8180932
>>
>> Testing: specjvm, specjbb, hotspot_gc
>>
>> I suppose this requires CSR for the new options?
>>
>> Opinions?
>>
>> Roman
>>
>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: RFR: Parallelize safepoint cleanup

Erik Osterlund
Hi Roman,

> On 24 May 2017, at 20:43, Roman Kennke <[hidden email]> wrote:
>
> Thought about it a little more, and probably warrants some comments:
>
> - I am not sure that it 'is crucial'. not deflating idle monitors or
> marking nmethods is not very nice, but is not catastrophic. It just
> means we cling to monitors or nmethods longer than we should (and
> probably induce an artificial cleanup safepoint).

I think the sweeper thread expects marking to be done in the next safepoint. If it isn't done, things will not work. As for deflation, you might be right. Although if ForceSafepoint issued specifically to scavenge monitors is coalesced in the same safepoint as the GC, then I am not 100% convinced things will work properly if it was skipped. It might, but I can't convince myself yet.

> That being said, I
> would not encourage to not clean up if you say you clean up (hey, I have
> 3 kids ;-) )

Wow!

> - Inversing the order and doing cleanup at the end of the pause is
> probably not a good idea, at least for deflating monitors. It means the
> GC would have to scan all those idle monitors that would otherwise have
> been deflated before. And some programs generate a lot of them!

Well, you did propose to merge them... ;)
Then once you get to the end of the safepoint, they are "ticked" as done and skipped!

/Erik

> Cheers, Roman
>
>> Hi Roman,
>>
>> Just something to keep in mind when letting the GC steal safepoint cleanup tasks from the safepoint synchronizer:
>>
>> When the cleanup has been skipped, it is crucial that the GC operation indeed does keep to the contract. So if for example a JNI locker blocks the GC, one has to still perform the cleanup tasks before exiting the VM operation. Unless of course the order was inversed so that cleanup runs after the VM operation and could check what is left to be done. But I am not certain what the consequences would be of doing that.
>>
>> Just something to keep in mind.
>>
>> Thanks,
>> /Erik
>>
>>> On 24 May 2017, at 16:40, Roman Kennke <[hidden email]> wrote:
>>>
>>> Erik Helin asked me on IRC to trim down the scope of this change and
>>> split up the big patch into 3: the parallel cleanup, the GC hookup for
>>> deflation, and the GC hookup for nmethods marking. So here comes the
>>> first part:
>>>
>>> http://cr.openjdk.java.net/~rkennke/8180932/webrev.01/
>>> <http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.01/>
>>>
>>> The description for that part still applies. Should be simpler to review
>>> this way. Will file 2 more enhancement-bugs for the other two parts.
>>>
>>> Roman
>>>
>>>> Some operations in safepoint cleanup have been observed to (sometimes)
>>>> take significant time. Most notably, idle monitor deflation and nmethod
>>>> marking stick out in some popular applications and benchmarks.
>>>>
>>>> I propose to:
>>>> - parallelize safepoint cleanup processing
>>>> - enable to hook up idle monitor deflation and nmethod marking to GC VM
>>>> ops, if GC can support it (resulting in even more efficient
>>>> deflation/nmethod marking)
>>>>
>>>> In some of my measurements this resulted in much improved pause times.
>>>> For example, in one popular benchmark on a server-class machine, I got
>>>> total average pause time down from ~80ms to ~30ms. In none of my
>>>> measurements has this resulted in decreased performance (although it may
>>>> be possible to construct something. For example, it may not be worth to
>>>> spin up worker threads if there's no work to do.)
>>>>
>>>> Some implementation notes:
>>>>
>>>> I introduced a dedicated worker thread pool in SafepointSynchronize.
>>>> This is only initialized when -XX:+ParallelSafepointCleanup is enabled,
>>>> and uses -XX:ParallelSafepointCleanupThreads=X threads, defaulting to 8
>>>> threads (just a wild guess, open for discussion. With
>>>> -XX:-ParallelSafepointCleanup turned off (the default) it will use the
>>>> old serial safepoint cleanup processing (with optional GC hooks, see below).
>>>>
>>>> Parallel processing first lets all worker threads scan threads and
>>>> thereby deflate idle monitors and mark nmethods (in one pass). The rest
>>>> of the cleanup work is divided into claimed chunks by using SubTasksDone
>>>> (like, e.g., in G1RootProcessor).
>>>>
>>>> Notice that I tried a bunch of other alternatives:
>>>>
>>>> - First I tried to let Java threads deflate their own monitors on
>>>> safepoint arrival. This did not work out, because deflation (currently)
>>>> depends on all Java threads having arrived. Adding another sync point
>>>> there would have defeated the purpose.
>>>>
>>>> - Then I tried to always use workers of the current GC. This did not
>>>> work either, because the GC may be using them, for example if a cleanup
>>>> safepoint is happening during concurrent marking.
>>>>
>>>> - Then I gave SafepointSynchronize its own workers, and I now think this
>>>> is the best solution: indepdent of the GC and relatively isolated code-wise.
>>>>
>>>>
>>>> The other big thing in this change is the possibility to let the GC take
>>>> over deflation and nmethod marking. The motivation for this is simple:
>>>> GCs often scan threads themselves, and when they do, they can just as
>>>> well also do the deflation and nmethod marking. This is more efficient,
>>>> because it's better on caches, and because it parallelizes better (other
>>>> GC workers can do other GC stuff while some are still busy with the
>>>> threads). Notice that this change only provides convenient APIs for the
>>>> GCs to consume, but no actual implementation. More specifically:
>>>>
>>>> - Deflation of idle monitors can be enabled for GC ops by overriding
>>>> VM_Operation::deflates_idle_monitors() to return true. This
>>>> automatically makes Threads::oops_do() and friends to deflate idle
>>>> monitors. CAUTION: this only works if the GC leaves oop's mark words
>>>> alone. Unfortunately, I think all GCs currently in OpenJDK preserve the
>>>> mark word and temporarily use it as forwarding pointer, and thus this
>>>> optimization is not possible. I have done it successfully in Shenandoah
>>>> GC. GC devs need to evaluate this.
>>>>
>>>> - NMethod marking can be enabled by overriding
>>>> VM_Operation::marks_nmethods() to return true. In order to mark nmethods
>>>> during GC thread scanning, one has to call
>>>> NMethodSweeper::prepare_mark_active_nmethods() and pass the returned
>>>> CodeBlobClosure to Thread::oops_do() or
>>>> Threads::possibly_parallel_oops_do(). This is relatively simple and
>>>> should work for all existing GCs. Again, I have done it successfully in
>>>> Shenandoah GC.
>>>>
>>>> - Hooking up deflation and nmethod marking also works with serial
>>>> safepoint cleanup. This may be useful for workloads where it's not worth
>>>> to spin up additional worker threads. They would still benefit from
>>>> improved cleanup at GC pauses.
>>>>
>>>> Webrev:
>>>> http://cr.openjdk.java.net/~rkennke/8180932/webrev.00/
>>>> <http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.00/>
>>>>
>>>> Bug:
>>>> https://bugs.openjdk.java.net/browse/JDK-8180932
>>>>
>>>> Testing: specjvm, specjbb, hotspot_gc
>>>>
>>>> I suppose this requires CSR for the new options?
>>>>
>>>> Opinions?
>>>>
>>>> Roman
>>>>
>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: RFR: Parallelize safepoint cleanup

Kirk Pepperdine
In reply to this post by Roman Kennke-6
Hi Roman,

I also see opportunities for some wins with this enhancement.

>>
>>
>>
>> - Then I gave SafepointSynchronize its own workers, and I now think this
>> is the best solution: indepdent of the GC and relatively isolated code-wise.

This seems like the best option to me as everyone would benefit irregardless of choice of collector. For example, serial is still by far the best option for smaller devices/heaps. Additionally, I’d be hestiant to place more load on the GC threads.

Kind regards,
Kirk

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: RFR: Parallelize safepoint cleanup

David Holmes
In reply to this post by Roman Kennke-6
Hi Roman,

I will look at this when I get time. As always this needs a lot of
careful consideration and is not something I would want to rush into.
Putting it under an experimental opt-in flag initially would avoid the
need for a (possibly premature) CSR request.

Thanks,
David

On 25/05/2017 12:40 AM, Roman Kennke wrote:

> Erik Helin asked me on IRC to trim down the scope of this change and
> split up the big patch into 3: the parallel cleanup, the GC hookup for
> deflation, and the GC hookup for nmethods marking. So here comes the
> first part:
>
> http://cr.openjdk.java.net/~rkennke/8180932/webrev.01/
> <http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.01/>
>
> The description for that part still applies. Should be simpler to review
> this way. Will file 2 more enhancement-bugs for the other two parts.
>
> Roman
>
>> Some operations in safepoint cleanup have been observed to (sometimes)
>> take significant time. Most notably, idle monitor deflation and nmethod
>> marking stick out in some popular applications and benchmarks.
>>
>> I propose to:
>> - parallelize safepoint cleanup processing
>> - enable to hook up idle monitor deflation and nmethod marking to GC VM
>> ops, if GC can support it (resulting in even more efficient
>> deflation/nmethod marking)
>>
>> In some of my measurements this resulted in much improved pause times.
>> For example, in one popular benchmark on a server-class machine, I got
>> total average pause time down from ~80ms to ~30ms. In none of my
>> measurements has this resulted in decreased performance (although it may
>> be possible to construct something. For example, it may not be worth to
>> spin up worker threads if there's no work to do.)
>>
>> Some implementation notes:
>>
>> I introduced a dedicated worker thread pool in SafepointSynchronize.
>> This is only initialized when -XX:+ParallelSafepointCleanup is enabled,
>> and uses -XX:ParallelSafepointCleanupThreads=X threads, defaulting to 8
>> threads (just a wild guess, open for discussion. With
>> -XX:-ParallelSafepointCleanup turned off (the default) it will use the
>> old serial safepoint cleanup processing (with optional GC hooks, see below).
>>
>> Parallel processing first lets all worker threads scan threads and
>> thereby deflate idle monitors and mark nmethods (in one pass). The rest
>> of the cleanup work is divided into claimed chunks by using SubTasksDone
>> (like, e.g., in G1RootProcessor).
>>
>> Notice that I tried a bunch of other alternatives:
>>
>> - First I tried to let Java threads deflate their own monitors on
>> safepoint arrival. This did not work out, because deflation (currently)
>> depends on all Java threads having arrived. Adding another sync point
>> there would have defeated the purpose.
>>
>> - Then I tried to always use workers of the current GC. This did not
>> work either, because the GC may be using them, for example if a cleanup
>> safepoint is happening during concurrent marking.
>>
>> - Then I gave SafepointSynchronize its own workers, and I now think this
>> is the best solution: indepdent of the GC and relatively isolated code-wise.
>>
>>
>> The other big thing in this change is the possibility to let the GC take
>> over deflation and nmethod marking. The motivation for this is simple:
>> GCs often scan threads themselves, and when they do, they can just as
>> well also do the deflation and nmethod marking. This is more efficient,
>> because it's better on caches, and because it parallelizes better (other
>> GC workers can do other GC stuff while some are still busy with the
>> threads). Notice that this change only provides convenient APIs for the
>> GCs to consume, but no actual implementation. More specifically:
>>
>> - Deflation of idle monitors can be enabled for GC ops by overriding
>> VM_Operation::deflates_idle_monitors() to return true. This
>> automatically makes Threads::oops_do() and friends to deflate idle
>> monitors. CAUTION: this only works if the GC leaves oop's mark words
>> alone. Unfortunately, I think all GCs currently in OpenJDK preserve the
>> mark word and temporarily use it as forwarding pointer, and thus this
>> optimization is not possible. I have done it successfully in Shenandoah
>> GC. GC devs need to evaluate this.
>>
>> - NMethod marking can be enabled by overriding
>> VM_Operation::marks_nmethods() to return true. In order to mark nmethods
>> during GC thread scanning, one has to call
>> NMethodSweeper::prepare_mark_active_nmethods() and pass the returned
>> CodeBlobClosure to Thread::oops_do() or
>> Threads::possibly_parallel_oops_do(). This is relatively simple and
>> should work for all existing GCs. Again, I have done it successfully in
>> Shenandoah GC.
>>
>> - Hooking up deflation and nmethod marking also works with serial
>> safepoint cleanup. This may be useful for workloads where it's not worth
>> to spin up additional worker threads. They would still benefit from
>> improved cleanup at GC pauses.
>>
>> Webrev:
>> http://cr.openjdk.java.net/~rkennke/8180932/webrev.00/
>> <http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.00/>
>>
>> Bug:
>> https://bugs.openjdk.java.net/browse/JDK-8180932
>>
>> Testing: specjvm, specjbb, hotspot_gc
>>
>> I suppose this requires CSR for the new options?
>>
>> Opinions?
>>
>> Roman
>>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: RFR: Parallelize safepoint cleanup

David Holmes
In reply to this post by Roman Kennke-6
Hi Roman,

Has a RFE been filed for this yet?

On 24/05/2017 8:43 PM, Roman Kennke wrote:
> Some operations in safepoint cleanup have been observed to (sometimes)
> take significant time. Most notably, idle monitor deflation and nmethod
> marking stick out in some popular applications and benchmarks.
>
> I propose to:
> - parallelize safepoint cleanup processing
> - enable to hook up idle monitor deflation and nmethod marking to GC VM
> ops, if GC can support it (resulting in even more efficient
> deflation/nmethod marking)

This seems reasonable in principle. Though I do cringe at creating yet
another set of VM threads! But as it is opt-in that is not so bad.

This will need to be reworked a little to fit in with some recent
changes done under:
     - 8152953: ForceSafepoint operations should be more specific
     - 8152955: Many safepoints of "no vm operation" kind

> In some of my measurements this resulted in much improved pause times.
> For example, in one popular benchmark on a server-class machine, I got
> total average pause time down from ~80ms to ~30ms. In none of my
> measurements has this resulted in decreased performance (although it may
> be possible to construct something. For example, it may not be worth to
> spin up worker threads if there's no work to do.)

The impact on startup time should be measured. For opt-in this is not a
big deal, but if this were to be the default someday then it does have
to be considered.

> Some implementation notes:
>
> I introduced a dedicated worker thread pool in SafepointSynchronize.
> This is only initialized when -XX:+ParallelSafepointCleanup is enabled,
> and uses -XX:ParallelSafepointCleanupThreads=X threads, defaulting to 8
> threads (just a wild guess, open for discussion. With
> -XX:-ParallelSafepointCleanup turned off (the default) it will use the
> old serial safepoint cleanup processing (with optional GC hooks, see below).

I think the default needs to be ergonomically set, as for other "thread
pools". Both flags should be experimental at this stage.

> Parallel processing first lets all worker threads scan threads and
> thereby deflate idle monitors and mark nmethods (in one pass). The rest
> of the cleanup work is divided into claimed chunks by using SubTasksDone
> (like, e.g., in G1RootProcessor).
>
> Notice that I tried a bunch of other alternatives:
>
> - First I tried to let Java threads deflate their own monitors on
> safepoint arrival. This did not work out, because deflation (currently)
> depends on all Java threads having arrived. Adding another sync point
> there would have defeated the purpose.

Right, deflation by its very nature must occur at a safepoint regardless
of which thread(s) actually does the work.

> - Then I tried to always use workers of the current GC. This did not
> work either, because the GC may be using them, for example if a cleanup
> safepoint is happening during concurrent marking.
>
> - Then I gave SafepointSynchronize its own workers, and I now think this
> is the best solution: indepdent of the GC and relatively isolated code-wise.
>
>
> The other big thing in this change is the possibility to let the GC take
> over deflation and nmethod marking. The motivation for this is simple:
> GCs often scan threads themselves, and when they do, they can just as
> well also do the deflation and nmethod marking. This is more efficient,
> because it's better on caches, and because it parallelizes better (other
> GC workers can do other GC stuff while some are still busy with the
> threads). Notice that this change only provides convenient APIs for the
> GCs to consume, but no actual implementation. More specifically:
>
> - Deflation of idle monitors can be enabled for GC ops by overriding
> VM_Operation::deflates_idle_monitors() to return true. This
> automatically makes Threads::oops_do() and friends to deflate idle
> monitors. CAUTION: this only works if the GC leaves oop's mark words
> alone. Unfortunately, I think all GCs currently in OpenJDK preserve the
> mark word and temporarily use it as forwarding pointer, and thus this
> optimization is not possible. I have done it successfully in Shenandoah
> GC. GC devs need to evaluate this.
>
> - NMethod marking can be enabled by overriding
> VM_Operation::marks_nmethods() to return true. In order to mark nmethods
> during GC thread scanning, one has to call
> NMethodSweeper::prepare_mark_active_nmethods() and pass the returned
> CodeBlobClosure to Thread::oops_do() or
> Threads::possibly_parallel_oops_do(). This is relatively simple and
> should work for all existing GCs. Again, I have done it successfully in
> Shenandoah GC.
>
> - Hooking up deflation and nmethod marking also works with serial
> safepoint cleanup. This may be useful for workloads where it's not worth
> to spin up additional worker threads. They would still benefit from
> improved cleanup at GC pauses.

That sounds reasonable, but the devil is in the details and the GC (and
compiler?) folk would need to give that due consideration.

> Webrev:
> http://cr.openjdk.java.net/~rkennke/8180932/webrev.00/
> <http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.00/>

A few specific, minor comments:

src/share/vm/runtime/safepoint.cpp

  549   // If op does both deflating and nmethod marking, we don't
bother firing up
  550   // the workers.
  551   bool op_does_cleanup = op != NULL && op->marks_nmethods() &&
op->deflates_idle_monitors();
  552   if (ParallelSafepointCleanup && ! op_does_cleanup) {

I'd prefer to see the op_does_cleanup logic evaluated only when
ParallelSafepointCleanup is true.

---

src/share/vm/runtime/sweeper.cpp

  205     // TODO: Is this really needed?
  206     OrderAccess::storestore();

I can't see what following store the original code is trying to maintain
order with. ?? Compiler folk would need to chime in on that I think.

---

src/share/vm/runtime/synchronizer.cpp

1675       if (obj != NULL && cl != NULL) {
1676         cl->do_oop((oop*) mid->object_addr());
1677       }

What is this logic doing? I'm unclear why we need to do something to a
monitor we could not deflate.

1796   thread->omInUseCount-= deflated_count;

Nit: space needed before -=

---

src/share/vm/runtime/vm_operations.hpp

  222   // a CodeBlobClosure*, which then needs to be passes as
nmethods_cl argument

Typo: passes as -> passed as the

---

Thanks,
David
-----

> Bug:
> https://bugs.openjdk.java.net/browse/JDK-8180932
>
> Testing: specjvm, specjbb, hotspot_gc
>
> I suppose this requires CSR for the new options?
>
> Opinions?
>
> Roman
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: RFR: Parallelize safepoint cleanup

Roman Kennke-6

Hi David,

> Has a RFE been filed for this yet?

https://bugs.openjdk.java.net/browse/JDK-8180932

Also noticed that you reviewed the first version of the patch. Erik
Helin asked me to split out the GC related parts and only propose/review
the parallel safepoint parts for now:

http://cr.openjdk.java.net/~rkennke/8180932/webrev.01/
<http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.01/>

This makes most of your comments go away for now :-)

Let me reply to some of them anyway:

> On 24/05/2017 8:43 PM, Roman Kennke wrote:
>> Some operations in safepoint cleanup have been observed to (sometimes)
>> take significant time. Most notably, idle monitor deflation and nmethod
>> marking stick out in some popular applications and benchmarks.
>>
>> I propose to:
>> - parallelize safepoint cleanup processing
>> - enable to hook up idle monitor deflation and nmethod marking to GC VM
>> ops, if GC can support it (resulting in even more efficient
>> deflation/nmethod marking)
>
> This seems reasonable in principle. Though I do cringe at creating yet
> another set of VM threads! But as it is opt-in that is not so bad.
>
> This will need to be reworked a little to fit in with some recent
> changes done under:
>     - 8152953: ForceSafepoint operations should be more specific
>     - 8152955: Many safepoints of "no vm operation" kind

Ok will do.

>> In some of my measurements this resulted in much improved pause times.
>> For example, in one popular benchmark on a server-class machine, I got
>> total average pause time down from ~80ms to ~30ms. In none of my
>> measurements has this resulted in decreased performance (although it may
>> be possible to construct something. For example, it may not be worth to
>> spin up worker threads if there's no work to do.)
>
> The impact on startup time should be measured. For opt-in this is not
> a big deal, but if this were to be the default someday then it does
> have to be considered.
Do we have a test for this? Or should I try to fire up HelloWorld 100x
and see how well that performs with and without the patch?

>> Some implementation notes:
>>
>> I introduced a dedicated worker thread pool in SafepointSynchronize.
>> This is only initialized when -XX:+ParallelSafepointCleanup is enabled,
>> and uses -XX:ParallelSafepointCleanupThreads=X threads, defaulting to 8
>> threads (just a wild guess, open for discussion. With
>> -XX:-ParallelSafepointCleanup turned off (the default) it will use the
>> old serial safepoint cleanup processing (with optional GC hooks, see
>> below).
>
> I think the default needs to be ergonomically set, as for other
> "thread pools". Both flags should be experimental at this stage.

Ok

>> - Then I tried to always use workers of the current GC. This did not
>> work either, because the GC may be using them, for example if a cleanup
>> safepoint is happening during concurrent marking.
>>
>> - Then I gave SafepointSynchronize its own workers, and I now think this
>> is the best solution: indepdent of the GC and relatively isolated
>> code-wise.
>>
>>
>> The other big thing in this change is the possibility to let the GC take
>> over deflation and nmethod marking. The motivation for this is simple:
>> GCs often scan threads themselves, and when they do, they can just as
>> well also do the deflation and nmethod marking. This is more efficient,
>> because it's better on caches, and because it parallelizes better (other
>> GC workers can do other GC stuff while some are still busy with the
>> threads). Notice that this change only provides convenient APIs for the
>> GCs to consume, but no actual implementation. More specifically:
>>
>> - Deflation of idle monitors can be enabled for GC ops by overriding
>> VM_Operation::deflates_idle_monitors() to return true. This
>> automatically makes Threads::oops_do() and friends to deflate idle
>> monitors. CAUTION: this only works if the GC leaves oop's mark words
>> alone. Unfortunately, I think all GCs currently in OpenJDK preserve the
>> mark word and temporarily use it as forwarding pointer, and thus this
>> optimization is not possible. I have done it successfully in Shenandoah
>> GC. GC devs need to evaluate this.
>>
>> - NMethod marking can be enabled by overriding
>> VM_Operation::marks_nmethods() to return true. In order to mark nmethods
>> during GC thread scanning, one has to call
>> NMethodSweeper::prepare_mark_active_nmethods() and pass the returned
>> CodeBlobClosure to Thread::oops_do() or
>> Threads::possibly_parallel_oops_do(). This is relatively simple and
>> should work for all existing GCs. Again, I have done it successfully in
>> Shenandoah GC.
>>
>> - Hooking up deflation and nmethod marking also works with serial
>> safepoint cleanup. This may be useful for workloads where it's not worth
>> to spin up additional worker threads. They would still benefit from
>> improved cleanup at GC pauses.
>
> That sounds reasonable, but the devil is in the details and the GC
> (and compiler?) folk would need to give that due consideration.

Those GC related parts are exluded for now, but I'm in contact with them
to work out the details.

> src/share/vm/runtime/safepoint.cpp
>
>  549   // If op does both deflating and nmethod marking, we don't
> bother firing up
>  550   // the workers.
>  551   bool op_does_cleanup = op != NULL && op->marks_nmethods() &&
> op->deflates_idle_monitors();
>  552   if (ParallelSafepointCleanup && ! op_does_cleanup) {
>
> I'd prefer to see the op_does_cleanup logic evaluated only when
> ParallelSafepointCleanup is true.

Right.

> src/share/vm/runtime/sweeper.cpp
>
>  205     // TODO: Is this really needed?
>  206     OrderAccess::storestore();
>
> I can't see what following store the original code is trying to
> maintain order with. ?? Compiler folk would need to chime in on that I
> think.

The only one possibly being affected by this (as far as I can tell) is
the NMethodSweeper thread and the compiler threads, and I think both
only resume work after the safepoint, so should get a fence() in any
case. Right?

> ---
>
> src/share/vm/runtime/synchronizer.cpp
>
> 1675       if (obj != NULL && cl != NULL) {
> 1676         cl->do_oop((oop*) mid->object_addr());
> 1677       }
>
> What is this logic doing? I'm unclear why we need to do something to a
> monitor we could not deflate.

This part is out for now, but it's basically a combined
deflation/iteration pass over the monitors. It scans all of the thread's
monitors and either deflates them or else calls the closure to, e.g.,
mark its oop (or whatever the GC wants it to do). The idea is that when
the GC needs to scan the monitors anyway, we don't need two passes over
it, but can deflate and apply the closure in one pass.

I will update my patch to latest jdk10-hs and incorporate the changes
that you suggested and send the updated patch for review soon. Thanks
for reviewing!

Roman
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: RFR: Parallelize safepoint cleanup

David Holmes
In reply to this post by David Holmes
Hi Roman,

Very quick response...

On 29/05/2017 6:23 PM, Roman Kennke wrote:
> Hi David,
>
> the patch applied cleanly, i.e. was not affected by any hotspot changes.

I don't think those changes have been pushed yet.

David
-----

> Most of your suggestions applied to the GC part that I excluded from the
> scope of this change, except for the following:
>
>>> I introduced a dedicated worker thread pool in SafepointSynchronize.
>>> This is only initialized when -XX:+ParallelSafepointCleanup is enabled,
>>> and uses -XX:ParallelSafepointCleanupThreads=X threads, defaulting to 8
>>> threads (just a wild guess, open for discussion. With
>>> -XX:-ParallelSafepointCleanup turned off (the default) it will use the
>>> old serial safepoint cleanup processing (with optional GC hooks, see
>>> below).
>>
>> I think the default needs to be ergonomically set, as for other
>> "thread pools". Both flags should be experimental at this stage.
>
> I implemented this using the following:
>
>    if (ParallelSafepointCleanup &&
> FLAG_IS_DEFAULT(ParallelSafepointCleanupThreads)) {
>      // This will pick up ParallelGCThreads if set...
>      FLAG_SET_DEFAULT(ParallelSafepointCleanupThreads,
> Abstract_VM_Version::parallel_worker_threads());
>    }
>
> As noted, it will pick up ParallelGCThreads, which might be a reasonable
> number. Otherwise we'd need to re-work the logic in vm_version.cpp to
> calculate the number of worker threads without considering
> ParallelGCThreads.
>
> http://cr.openjdk.java.net/~rkennke/8180932/webrev.02/
> <http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.02/>
>
> What do you think?
>
> Roman
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: RFR: Parallelize safepoint cleanup

Roman Kennke-6
Am 29.05.2017 um 10:53 schrieb David Holmes:

> Hi Roman,
>
> Very quick response...
>
> On 29/05/2017 6:23 PM, Roman Kennke wrote:
>> Hi David,
>>
>> the patch applied cleanly, i.e. was not affected by any hotspot changes.
>
> I don't think those changes have been pushed yet.

I just checked. Both:

    - 8152953: ForceSafepoint operations should be more specific
    - 8152955: Many safepoints of "no vm operation" kind

are in jdk10-hs

Roman

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: RFR: Parallelize safepoint cleanup

Kim Barrett
In reply to this post by David Holmes
> On May 29, 2017, at 4:23 AM, Roman Kennke <[hidden email]> wrote:
> I implemented this using the following:
>
>  if (ParallelSafepointCleanup &&
> FLAG_IS_DEFAULT(ParallelSafepointCleanupThreads)) {
>    // This will pick up ParallelGCThreads if set...
>    FLAG_SET_DEFAULT(ParallelSafepointCleanupThreads,
> Abstract_VM_Version::parallel_worker_threads());
>  }

Just following along, and not presently planning to do a real review, but the
above caught my eye.  Clients of the VM_Version infrastructure should never
directly reference Abstract_VM_Version::<anything>, as doing so may bypass
platform-specific specializations.

I thought I’d filed a bug about that sort of thing, but apparently that got pushed
deep onto my todo list.  I’ll go take care of that now.

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: RFR: Parallelize safepoint cleanup

David Holmes
On 29/05/2017 8:47 PM, Kim Barrett wrote:

>> On May 29, 2017, at 4:23 AM, Roman Kennke <[hidden email]> wrote:
>> I implemented this using the following:
>>
>>   if (ParallelSafepointCleanup &&
>> FLAG_IS_DEFAULT(ParallelSafepointCleanupThreads)) {
>>     // This will pick up ParallelGCThreads if set...
>>     FLAG_SET_DEFAULT(ParallelSafepointCleanupThreads,
>> Abstract_VM_Version::parallel_worker_threads());
>>   }
>
> Just following along, and not presently planning to do a real review, but the
> above caught my eye.  Clients of the VM_Version infrastructure should never
> directly reference Abstract_VM_Version::<anything>, as doing so may bypass
> platform-specific specializations.
>
> I thought I’d filed a bug about that sort of thing, but apparently that got pushed
> deep onto my todo list.  I’ll go take care of that now.

Beat you to it by six years:

https://bugs.openjdk.java.net/browse/JDK-7041262

That's a pretty sad statement. Fortunately most things in
Abstract_VM_Version are not overridden.

David

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: RFR: Parallelize safepoint cleanup

Robbin Ehn
In reply to this post by David Holmes
Hi Roman,

On 05/29/2017 10:23 AM, Roman Kennke wrote:

> Hi David,
>
> the patch applied cleanly, i.e. was not affected by any hotspot changes.
> Most of your suggestions applied to the GC part that I excluded from the
> scope of this change, except for the following:
>
>>> I introduced a dedicated worker thread pool in SafepointSynchronize.
>>> This is only initialized when -XX:+ParallelSafepointCleanup is enabled,
>>> and uses -XX:ParallelSafepointCleanupThreads=X threads, defaulting to 8
>>> threads (just a wild guess, open for discussion. With
>>> -XX:-ParallelSafepointCleanup turned off (the default) it will use the
>>> old serial safepoint cleanup processing (with optional GC hooks, see
>>> below).
>>
>> I think the default needs to be ergonomically set, as for other
>> "thread pools". Both flags should be experimental at this stage.
>
> I implemented this using the following:
>
>    if (ParallelSafepointCleanup &&
> FLAG_IS_DEFAULT(ParallelSafepointCleanupThreads)) {
>      // This will pick up ParallelGCThreads if set...
>      FLAG_SET_DEFAULT(ParallelSafepointCleanupThreads,
> Abstract_VM_Version::parallel_worker_threads());
>    }
>
> As noted, it will pick up ParallelGCThreads, which might be a reasonable
> number. Otherwise we'd need to re-work the logic in vm_version.cpp to
> calculate the number of worker threads without considering
> ParallelGCThreads.
>
> http://cr.openjdk.java.net/~rkennke/8180932/webrev.02/
> <http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.02/>
>

This worries me because heuristics will be very hard to the average user.

1885   if (ParallelSafepointCleanup && FLAG_IS_DEFAULT(ParallelSafepointCleanupThreads)) {
1886     // This will pick up ParallelGCThreads if set...
1887     FLAG_SET_DEFAULT(ParallelSafepointCleanupThreads, Abstract_VM_Version::parallel_worker_threads());
1888   }

So on a 8 vcpu machine we could be running 8 threads cleanup and 8 workers threads might be marking?

"and I now think this is the best solution"

I don't agree that is the the best solution and I'm not in favor of a second thread pool.
If you want short safepoints it seem much better to use available cpu time finishing the cleanup.
Maybe steal or reserve worker threads... just keep 1 pool of thread which we can do decent sizing for.

"For example, in one popular benchmark on a server-class machine, I got
total average pause time down from ~80ms to ~30ms."

What's the numbers for a 4 core/8 threads machine? And other GCs?

Looked through the patch quick, I think you should do some abstraction.
Now this code is already lacking abstraction so I understand why you did it this way.

For example:
void ObjectSynchronizer::deflate_idle_monitors(bool deflate_thread_local_monitors)
...
   if (MonitorInUseLists) {
     if (deflate_thread_local_monitors) {

You have mandatory parameter which is ignore by a global flag.
Right now there are 4 different cases in the same code path.
Does ParallelSafepointCleanup work with -MonitorInUseLists ?
Please create some kind of abstraction for this and set it a startup.

I do not see any reason for duplicating the SafepointSynchronize::serial_cleanup(), just create proper abstraction so it can be used in both cases?

Thanks!

/Robbin

> What do you think?
>
> Roman
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: RFR: Parallelize safepoint cleanup

Roman Kennke-6
Am 29.05.2017 um 15:39 schrieb Erik Österlund:
> Hi,
>
> How do we feel about reusing the Parallel GC Threads work gang during
> safepoint cleanup?
I actually tried that, but failed, at least with Shenandoah, because...

> We know that the Parallel GC Threads worker gang is idle then.

Not Shenandoah :-) Shenandoah GC threads are currently happy to continue
working during (non-GC) safepoints. One exception is the heap dumper,
which currently requires some extra synchronization. If you know of any
other safepoints that would require stopping concurrent GC threads
(marking, evacuating or updating references) let me know. I haven't
found any. Parallelizing safepoint cleanup might be a reason to
implement GC thread safepointing though.

Another problem with this approach is that it's clearly GC dependend,
and requires support (i.e. a worker thread pool) and coordination (i.e.
pause-resume GC work) by the GC. For example, as Kirk pointed out,
serial gc couldn't support that.

> This saves creating another gazillion threads and another JVM flag to
> keep track of.

It would only be a few threads. There's just not that much work to do.
The prime scenario for this seems to be many- (like hundreds or
thousands of) threads work loads, in which case it hardly matters to add
some more to support more efficient cleanup.

Roman

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: RFR: Parallelize safepoint cleanup

Erik Osterlund


On 2017-05-29 16:08, Roman Kennke wrote:

> Am 29.05.2017 um 15:39 schrieb Erik Österlund:
>> Hi,
>>
>> How do we feel about reusing the Parallel GC Threads work gang during
>> safepoint cleanup?
> I actually tried that, but failed, at least with Shenandoah, because...
>
>> We know that the Parallel GC Threads worker gang is idle then.
> Not Shenandoah :-) Shenandoah GC threads are currently happy to continue
> working during (non-GC) safepoints. One exception is the heap dumper,
> which currently requires some extra synchronization. If you know of any
> other safepoints that would require stopping concurrent GC threads
> (marking, evacuating or updating references) let me know. I haven't
> found any. Parallelizing safepoint cleanup might be a reason to
> implement GC thread safepointing though.

So perhaps one solution would be to have a
SafepointSynchronize::worker_gang() that is initialized during
bootstrapping. Then the GC can decide whether it wants to share its
ParallelGCThreads worker gang, or do something else like creating a new
worker gang.

> Another problem with this approach is that it's clearly GC dependend,
> and requires support (i.e. a worker thread pool) and coordination (i.e.
> pause-resume GC work) by the GC. For example, as Kirk pointed out,
> serial gc couldn't support that.

You typically use Serial GC because you don't want a bunch of worker
gangs. I don't think safepoint cleanup is an exception to that.

>
>> This saves creating another gazillion threads and another JVM flag to
>> keep track of.
> It would only be a few threads. There's just not that much work to do.
> The prime scenario for this seems to be many- (like hundreds or
> thousands of) threads work loads, in which case it hardly matters to add
> some more to support more efficient cleanup.

I see your point. I am just a bit hesitant to adding new worker gangs
with user configurable knobs unless there is no other way.

Thanks,
/Erik

> Roman
>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: RFR: Parallelize safepoint cleanup

David Holmes
I missed this before replying earlier up the thread ...

On 30/05/2017 1:25 AM, Erik Österlund wrote:

>
>
> On 2017-05-29 16:08, Roman Kennke wrote:
>> Am 29.05.2017 um 15:39 schrieb Erik Österlund:
>>> Hi,
>>>
>>> How do we feel about reusing the Parallel GC Threads work gang during
>>> safepoint cleanup?
>> I actually tried that, but failed, at least with Shenandoah, because...
>>
>>> We know that the Parallel GC Threads worker gang is idle then.
>> Not Shenandoah :-) Shenandoah GC threads are currently happy to continue
>> working during (non-GC) safepoints. One exception is the heap dumper,
>> which currently requires some extra synchronization. If you know of any
>> other safepoints that would require stopping concurrent GC threads
>> (marking, evacuating or updating references) let me know. I haven't
>> found any. Parallelizing safepoint cleanup might be a reason to
>> implement GC thread safepointing though.
>
> So perhaps one solution would be to have a
> SafepointSynchronize::worker_gang() that is initialized during
> bootstrapping. Then the GC can decide whether it wants to share its
> ParallelGCThreads worker gang, or do something else like creating a new
> worker gang.

I think the best approach would be to factor out the "worker gang" into
a more general purpose thread pool (of non-JavaThreads) which can be
dynamically sized, buffer tasks during high demand etc ( as per
java.util.concurrent.ThreadPoolExecutor). Thereby isolating the worker
threads from any specific task they may be engaged for.

Cheers,
David
-----

>> Another problem with this approach is that it's clearly GC dependend,
>> and requires support (i.e. a worker thread pool) and coordination (i.e.
>> pause-resume GC work) by the GC. For example, as Kirk pointed out,
>> serial gc couldn't support that.
>
> You typically use Serial GC because you don't want a bunch of worker
> gangs. I don't think safepoint cleanup is an exception to that.
>
>>
>>> This saves creating another gazillion threads and another JVM flag to
>>> keep track of.
>> It would only be a few threads. There's just not that much work to do.
>> The prime scenario for this seems to be many- (like hundreds or
>> thousands of) threads work loads, in which case it hardly matters to add
>> some more to support more efficient cleanup.
>
> I see your point. I am just a bit hesitant to adding new worker gangs
> with user configurable knobs unless there is no other way.
>
> Thanks,
> /Erik
>
>> Roman
>>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: RFR: Parallelize safepoint cleanup

David Holmes
In reply to this post by David Holmes
Hi Roman,

On 29/05/2017 6:23 PM, Roman Kennke wrote:

> Hi David,
>
> the patch applied cleanly, i.e. was not affected by any hotspot changes.
> Most of your suggestions applied to the GC part that I excluded from the
> scope of this change, except for the following:
>
>>> I introduced a dedicated worker thread pool in SafepointSynchronize.
>>> This is only initialized when -XX:+ParallelSafepointCleanup is enabled,
>>> and uses -XX:ParallelSafepointCleanupThreads=X threads, defaulting to 8
>>> threads (just a wild guess, open for discussion. With
>>> -XX:-ParallelSafepointCleanup turned off (the default) it will use the
>>> old serial safepoint cleanup processing (with optional GC hooks, see
>>> below).
>>
>> I think the default needs to be ergonomically set, as for other
>> "thread pools". Both flags should be experimental at this stage.
>
> I implemented this using the following:
>
>    if (ParallelSafepointCleanup &&
> FLAG_IS_DEFAULT(ParallelSafepointCleanupThreads)) {
>      // This will pick up ParallelGCThreads if set...
>      FLAG_SET_DEFAULT(ParallelSafepointCleanupThreads,
> Abstract_VM_Version::parallel_worker_threads());
>    }
>
> As noted, it will pick up ParallelGCThreads, which might be a reasonable
> number. Otherwise we'd need to re-work the logic in vm_version.cpp to
> calculate the number of worker threads without considering
> ParallelGCThreads.

I think this is way too big a number of threads to consider for this
situation. GC has a huge amount of work to do, but the safepoint cleanup
should be much less work.

I'm more inclined now to set a small hard-wired default initially - say
4 - and for future work look at generalising the existing fixed GC
thread pool into a more general purpose, dynamic thread pool (something
similar to java.util.concurrent.ThreadPoolExecutor.) I note others
object to a dedicated thread pool for this, so generalising the existing
one seems to be a reasonable future direction, whilst avoiding tying
this to specific GCs.

Thanks,
David
-----

> http://cr.openjdk.java.net/~rkennke/8180932/webrev.02/
> <http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.02/>
>
> What do you think?
>
> Roman
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: RFR: Parallelize safepoint cleanup

Robbin Ehn
In reply to this post by Robbin Ehn
Hi Roman,

On 05/29/2017 04:16 PM, Roman Kennke wrote:
>
> I agree that having a single pool would be good. The current WorkGang
> doesn't do it though, because we can't borrow threads while the GC is
> doing work, at least not in a way that is GC agnostic (see my reply to
> Robbin).

Correct me if I'm wrong, using WorkGang works for all the GCs that we have in JDK10.
(serial I would assume just use the VmThread)

If Shenandoah needs modification to the WorkGang then you would be-able to have them in your repo.
If the modification is very large alternative you could use separate thread pool for this in your Shenandoah repo.
Since for vanilla JDK 10, as now, I don't see we need it?

>
> Yeah, I will think about it. The easiest way out would be to treat the
> serial case as 1-thread-MT case (called by VMThread), but this also
> means we can't easily keep the existing code path as-is while having a
> new separate code path. Maybe give the new code path some testing and
> time to iron out bugs (if any) and later switch the old path to a
> degenerated 1-threaded-MT path?

This sounds good, I think you could get away with doing just 1 case and we will iron out bugs faster I guess.

Thanks Robbin

>
> Roman
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: RFR: Parallelize safepoint cleanup

Roman Kennke-6
Hi Robbin,

I'm currently testing  an abstraction that allows to use GC workers, and also makes serial and parallel cleanup to use the same code path. Stay tuned ;-)

/Roman

Am 30. Mai 2017 14:57:23 MESZ schrieb Robbin Ehn <[hidden email]>:

>Hi Roman,
>
>On 05/29/2017 04:16 PM, Roman Kennke wrote:
>>
>> I agree that having a single pool would be good. The current WorkGang
>> doesn't do it though, because we can't borrow threads while the GC is
>> doing work, at least not in a way that is GC agnostic (see my reply
>to
>> Robbin).
>
>Correct me if I'm wrong, using WorkGang works for all the GCs that we
>have in JDK10.
>(serial I would assume just use the VmThread)
>
>If Shenandoah needs modification to the WorkGang then you would be-able
>to have them in your repo.
>If the modification is very large alternative you could use separate
>thread pool for this in your Shenandoah repo.
>Since for vanilla JDK 10, as now, I don't see we need it?
>
>>
>> Yeah, I will think about it. The easiest way out would be to treat
>the
>> serial case as 1-thread-MT case (called by VMThread), but this also
>> means we can't easily keep the existing code path as-is while having
>a
>> new separate code path. Maybe give the new code path some testing and
>> time to iron out bugs (if any) and later switch the old path to a
>> degenerated 1-threaded-MT path?
>
>This sounds good, I think you could get away with doing just 1 case and
>we will iron out bugs faster I guess.
>
>Thanks Robbin
>
>>
>> Roman
>>

--
Sent from my FairPhone
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: RFR: Parallelize safepoint cleanup

Robbin Ehn
In reply to this post by David Holmes
Hi Roman, I agree that is really needed but:

On 05/31/2017 10:27 AM, Roman Kennke wrote:

> I realized that sharing workers with GC is not so easy.
>
> We need to be able to use the workers at a safepoint during concurrent
> GC work (which also uses the same workers). This does not only require
> that those workers be suspended, like e.g.
> SuspendibleThreadSet::yield(), but they need to be idle, i.e. have
> finished their tasks. This needs some careful handling to work without
> races: it requires a SuspendibleThreadSetJoiner around the corresponding
> run_task() call and also the tasks themselves need to join the STS and
> handle requests for safepoints not by yielding, but by leaving the task.
> This is far too peculiar for me to make the call to hook up GC workers
> for safepoint cleanup, and I thus removed those parts. I left the API in
> CollectedHeap in place. I think GC devs who know better about G1 and CMS
> should make that call, or else just use a separate thread pool.
>
> http://cr.openjdk.java.net/~rkennke/8180932/webrev.05/
> <http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.05/>
>
> Is it ok now?

I still think you should put the "Parallel Safepoint Cleanup" workers inside Shenandoah,
so the SafepointSynchronizer only calls get_safepoint_workers, e.g.:

_cleanup_workers = heap->get_safepoint_workers();
_num_cleanup_workers = _cleanup_workers != NULL ? _cleanup_workers->total_workers() : 1;
ParallelSPCleanupTask cleanup(_cleanup_subtasks);
StrongRootsScope srs(_num_cleanup_workers);
if (_cleanup_workers != NULL) {
   _cleanup_workers->run_task(&cleanup, _num_cleanup_workers);
} else {
   cleanup.work(0);
}

That way you don't even need your new flags, but it will be up to the other GCs to make their worker available
or cheat with a separate workgang.

Otherwise this looking reasonable.

Thanks for doing this!

/Robbin

>
> Roman
>
> Am 30.05.2017 um 03:59 schrieb David Holmes:
>> Hi Roman,
>>
>> On 29/05/2017 6:23 PM, Roman Kennke wrote:
>>> Hi David,
>>>
>>> the patch applied cleanly, i.e. was not affected by any hotspot changes.
>>> Most of your suggestions applied to the GC part that I excluded from the
>>> scope of this change, except for the following:
>>>
>>>>> I introduced a dedicated worker thread pool in SafepointSynchronize.
>>>>> This is only initialized when -XX:+ParallelSafepointCleanup is
>>>>> enabled,
>>>>> and uses -XX:ParallelSafepointCleanupThreads=X threads, defaulting
>>>>> to 8
>>>>> threads (just a wild guess, open for discussion. With
>>>>> -XX:-ParallelSafepointCleanup turned off (the default) it will use the
>>>>> old serial safepoint cleanup processing (with optional GC hooks, see
>>>>> below).
>>>>
>>>> I think the default needs to be ergonomically set, as for other
>>>> "thread pools". Both flags should be experimental at this stage.
>>>
>>> I implemented this using the following:
>>>
>>>     if (ParallelSafepointCleanup &&
>>> FLAG_IS_DEFAULT(ParallelSafepointCleanupThreads)) {
>>>       // This will pick up ParallelGCThreads if set...
>>>       FLAG_SET_DEFAULT(ParallelSafepointCleanupThreads,
>>> Abstract_VM_Version::parallel_worker_threads());
>>>     }
>>>
>>> As noted, it will pick up ParallelGCThreads, which might be a reasonable
>>> number. Otherwise we'd need to re-work the logic in vm_version.cpp to
>>> calculate the number of worker threads without considering
>>> ParallelGCThreads.
>>
>> I think this is way too big a number of threads to consider for this
>> situation. GC has a huge amount of work to do, but the safepoint
>> cleanup should be much less work.
>>
>> I'm more inclined now to set a small hard-wired default initially -
>> say 4 - and for future work look at generalising the existing fixed GC
>> thread pool into a more general purpose, dynamic thread pool
>> (something similar to java.util.concurrent.ThreadPoolExecutor.) I note
>> others object to a dedicated thread pool for this, so generalising the
>> existing one seems to be a reasonable future direction, whilst
>> avoiding tying this to specific GCs.
>>
>> Thanks,
>> David
>> -----
>>
>>> http://cr.openjdk.java.net/~rkennke/8180932/webrev.02/
>>> <http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.02/>
>>>
>>> What do you think?
>>>
>>> Roman
>>>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: RFR: Parallelize safepoint cleanup

Roman Kennke-6
Am 31.05.2017 um 22:06 schrieb Robbin Ehn:

> Hi Roman, I agree that is really needed but:
>
> On 05/31/2017 10:27 AM, Roman Kennke wrote:
>> I realized that sharing workers with GC is not so easy.
>>
>> We need to be able to use the workers at a safepoint during concurrent
>> GC work (which also uses the same workers). This does not only require
>> that those workers be suspended, like e.g.
>> SuspendibleThreadSet::yield(), but they need to be idle, i.e. have
>> finished their tasks. This needs some careful handling to work without
>> races: it requires a SuspendibleThreadSetJoiner around the corresponding
>> run_task() call and also the tasks themselves need to join the STS and
>> handle requests for safepoints not by yielding, but by leaving the task.
>> This is far too peculiar for me to make the call to hook up GC workers
>> for safepoint cleanup, and I thus removed those parts. I left the API in
>> CollectedHeap in place. I think GC devs who know better about G1 and CMS
>> should make that call, or else just use a separate thread pool.
>>
>> http://cr.openjdk.java.net/~rkennke/8180932/webrev.05/
>> <http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.05/>
>>
>> Is it ok now?
>
> I still think you should put the "Parallel Safepoint Cleanup" workers
> inside Shenandoah,
> so the SafepointSynchronizer only calls get_safepoint_workers, e.g.:
>
> _cleanup_workers = heap->get_safepoint_workers();
> _num_cleanup_workers = _cleanup_workers != NULL ?
> _cleanup_workers->total_workers() : 1;
> ParallelSPCleanupTask cleanup(_cleanup_subtasks);
> StrongRootsScope srs(_num_cleanup_workers);
> if (_cleanup_workers != NULL) {
>   _cleanup_workers->run_task(&cleanup, _num_cleanup_workers);
> } else {
>   cleanup.work(0);
> }
>
> That way you don't even need your new flags, but it will be up to the
> other GCs to make their worker available
> or cheat with a separate workgang.
I can do that, I don't mind. The question is, do we want that?
I wouldn't call it 'cheating with a separate workgang' though. I see
that both G1 and CMS suspend their worker threads at a safepoint. However:
- Do they finish their work, stop, and then restart work after
safepoint? Or are the workers simply calling STS::yield() to suspend and
later resume their work where they left off. If they only call yield()
(or whatever equivalent in CMS), then this is not enough: the workers
need to be truly idle in order to be used by the safepoint cleaners.
- Parallel and serial GC don't have workgangs of their own.

So, as far as I can tell, this means that parallel safepoint cleanup
would only be supported by GCs for which we explicitely implement it,
after having carefully checked if/how workgangs are suspended at
safepoints, or by providing GC-internal thread pools. Do we really want
that?

Roman

123
Loading...