RFR: JDK-8260332: ParallelGC: Cooperative pretouch for oldgen expansion

classic Classic list List threaded Threaded
22 messages Options
12
Reply | Threaded
Open this post in threaded view
|

RFR: JDK-8260332: ParallelGC: Cooperative pretouch for oldgen expansion

Amit Pawar
In case of ParallelGC, oldgen expansion can happen during promotion. Expanding thread will touch the pages and can't request for task execution as this GC thread is already executing a task. The expanding thread holds the lock on "ExpandHeap_lock" to resize the oldgen and other threads may wait for their turn. This is a blocking call.

This patch changes this behavior by adding another constructor in "MutexLocker" class to enable non blocking or try_lock operation. This way one thread will acquire the lock and other threads can join pretouch work. Threads failed to acquire the lock will join pretouch only when task is marked ready by expanding thread.

Following minimum expansion size are seen during expansion.
1. 512KB without largepages and without UseNUMA.
2. 64MB without largepages and with UseNUMA,
3. 2MB (on x86)  with large pages and without UseNUMA,
4. 64MB without large pages and with UseNUMA.

When Oldgen is expanding repeatedly with smaller size then this change wont help. For such cases, resize size should adapt to application demand to make use of this change. For example if application nature triggers 100 expansion with smaller sizes in same GC then it is better to increase the expansion size during each resize to reduce the number of resizes. If this patch is accepted then will plan to fix this case in another patch.

Jtreg all test passed.

Please review this change.

-------------

Commit messages:
 - ParallelGC: Cooperative pretouch for oldgen expansion

Changes: https://git.openjdk.java.net/jdk/pull/2976/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=2976&range=00
  Issue: https://bugs.openjdk.java.net/browse/JDK-8260332
  Stats: 185 lines in 9 files changed: 158 ins; 0 del; 27 mod
  Patch: https://git.openjdk.java.net/jdk/pull/2976.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/2976/head:pull/2976

PR: https://git.openjdk.java.net/jdk/pull/2976
Reply | Threaded
Open this post in threaded view
|

Re: RFR: JDK-8260332: ParallelGC: Cooperative pretouch for oldgen expansion

David Holmes
Hi Amit,

On 13/03/2021 6:01 am, Amit Pawar wrote:
> In case of ParallelGC, oldgen expansion can happen during promotion. Expanding thread will touch the pages and can't request for task execution as this GC thread is already executing a task. The expanding thread holds the lock on "ExpandHeap_lock" to resize the oldgen and other threads may wait for their turn. This is a blocking call.
>
> This patch changes this behavior by adding another constructor in "MutexLocker" class to enable non blocking or try_lock operation. This way one thread will acquire the lock and other threads can join pretouch work. Threads failed to acquire the lock will join pretouch only when task is marked ready by expanding thread.

Sorry but I do not like the change to MutexLocker - a tryLock is only
maybe locking the mutex and the fact you have to return a value to
indicate whether it managed to lock or not, strikes me as a bad use of a
RAII style of object. I think explicit lock/try_lock/unlock would be
much clearer here.

Thanks,
David

>
> Following minimum expansion size are seen during expansion.
> 1. 512KB without largepages and without UseNUMA.
> 2. 64MB without largepages and with UseNUMA,
> 3. 2MB (on x86)  with large pages and without UseNUMA,
> 4. 64MB without large pages and with UseNUMA.
>
> When Oldgen is expanding repeatedly with smaller size then this change wont help. For such cases, resize size should adapt to application demand to make use of this change. For example if application nature triggers 100 expansion with smaller sizes in same GC then it is better to increase the expansion size during each resize to reduce the number of resizes. If this patch is accepted then will plan to fix this case in another patch.
>
> Jtreg all test passed.
>
> Please review this change.
>
> -------------
>
> Commit messages:
>   - ParallelGC: Cooperative pretouch for oldgen expansion
>
> Changes: https://git.openjdk.java.net/jdk/pull/2976/files
>   Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=2976&range=00
>    Issue: https://bugs.openjdk.java.net/browse/JDK-8260332
>    Stats: 185 lines in 9 files changed: 158 ins; 0 del; 27 mod
>    Patch: https://git.openjdk.java.net/jdk/pull/2976.diff
>    Fetch: git fetch https://git.openjdk.java.net/jdk pull/2976/head:pull/2976
>
> PR: https://git.openjdk.java.net/jdk/pull/2976
>
Reply | Threaded
Open this post in threaded view
|

Re: RFR: JDK-8260332: ParallelGC: Cooperative pretouch for oldgen expansion [v2]

Amit Pawar
In reply to this post by Amit Pawar
> In case of ParallelGC, oldgen expansion can happen during promotion. Expanding thread will touch the pages and can't request for task execution as this GC thread is already executing a task. The expanding thread holds the lock on "ExpandHeap_lock" to resize the oldgen and other threads may wait for their turn. This is a blocking call.
>
> This patch changes this behavior by adding another constructor in "MutexLocker" class to enable non blocking or try_lock operation. This way one thread will acquire the lock and other threads can join pretouch work. Threads failed to acquire the lock will join pretouch only when task is marked ready by expanding thread.
>
> Following minimum expansion size are seen during expansion.
> 1. 512KB without largepages and without UseNUMA.
> 2. 64MB without largepages and with UseNUMA,
> 3. 2MB (on x86)  with large pages and without UseNUMA,
> 4. 64MB without large pages and with UseNUMA.
>
> When Oldgen is expanding repeatedly with smaller size then this change wont help. For such cases, resize size should adapt to application demand to make use of this change. For example if application nature triggers 100 expansion with smaller sizes in same GC then it is better to increase the expansion size during each resize to reduce the number of resizes. If this patch is accepted then will plan to fix this case in another patch.
>
> Jtreg all test passed.
>
> Please review this change.

Amit Pawar has updated the pull request incrementally with one additional commit since the last revision:

  Fixed build issues for some targets and updated with suggested changes.

-------------

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/2976/files
  - new: https://git.openjdk.java.net/jdk/pull/2976/files/831eab9b..01e5077e

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=2976&range=01
 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=2976&range=00-01

  Stats: 22 lines in 5 files changed: 2 ins; 17 del; 3 mod
  Patch: https://git.openjdk.java.net/jdk/pull/2976.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/2976/head:pull/2976

PR: https://git.openjdk.java.net/jdk/pull/2976
Reply | Threaded
Open this post in threaded view
|

Re: RFR: JDK-8260332: ParallelGC: Cooperative pretouch for oldgen expansion

Amit Pawar
In reply to this post by Amit Pawar
On Fri, 12 Mar 2021 19:56:23 GMT, Amit Pawar <[hidden email]> wrote:

> In case of ParallelGC, oldgen expansion can happen during promotion. Expanding thread will touch the pages and can't request for task execution as this GC thread is already executing a task. The expanding thread holds the lock on "ExpandHeap_lock" to resize the oldgen and other threads may wait for their turn. This is a blocking call.
>
> This patch changes this behavior by adding another constructor in "MutexLocker" class to enable non blocking or try_lock operation. This way one thread will acquire the lock and other threads can join pretouch work. Threads failed to acquire the lock will join pretouch only when task is marked ready by expanding thread.
>
> Following minimum expansion size are seen during expansion.
> 1. 512KB without largepages and without UseNUMA.
> 2. 64MB without largepages and with UseNUMA,
> 3. 2MB (on x86)  with large pages and without UseNUMA,
> 4. 64MB without large pages and with UseNUMA.
>
> When Oldgen is expanding repeatedly with smaller size then this change wont help. For such cases, resize size should adapt to application demand to make use of this change. For example if application nature triggers 100 expansion with smaller sizes in same GC then it is better to increase the expansion size during each resize to reduce the number of resizes. If this patch is accepted then will plan to fix this case in another patch.
>
> Jtreg all test passed.
>
> Please review this change.

Thank you for your feedback. I have updated as per your suggestion and fixed build issues on some targets. Please check.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2976
Reply | Threaded
Open this post in threaded view
|

Re: RFR: JDK-8260332: ParallelGC: Cooperative pretouch for oldgen expansion [v2]

David Holmes-2
In reply to this post by Amit Pawar
On Sun, 14 Mar 2021 13:27:26 GMT, Amit Pawar <[hidden email]> wrote:

>> In case of ParallelGC, oldgen expansion can happen during promotion. Expanding thread will touch the pages and can't request for task execution as this GC thread is already executing a task. The expanding thread holds the lock on "ExpandHeap_lock" to resize the oldgen and other threads may wait for their turn. This is a blocking call.
>>
>> This patch changes this behavior by adding another constructor in "MutexLocker" class to enable non blocking or try_lock operation. This way one thread will acquire the lock and other threads can join pretouch work. Threads failed to acquire the lock will join pretouch only when task is marked ready by expanding thread.
>>
>> Following minimum expansion size are seen during expansion.
>> 1. 512KB without largepages and without UseNUMA.
>> 2. 64MB without largepages and with UseNUMA,
>> 3. 2MB (on x86)  with large pages and without UseNUMA,
>> 4. 64MB without large pages and with UseNUMA.
>>
>> When Oldgen is expanding repeatedly with smaller size then this change wont help. For such cases, resize size should adapt to application demand to make use of this change. For example if application nature triggers 100 expansion with smaller sizes in same GC then it is better to increase the expansion size during each resize to reduce the number of resizes. If this patch is accepted then will plan to fix this case in another patch.
>>
>> Jtreg all test passed.
>>
>> Please review this change.
>
> Amit Pawar has updated the pull request incrementally with one additional commit since the last revision:
>
>   Fixed build issues for some targets and updated with suggested changes.

src/hotspot/share/gc/shared/pretouchTask.cpp line 66:

> 64:
> 65:   // Following atomic loads are required to make other processor store
> 66:   // visible to all threads from this points.

That is not what an Atomic::load does; they only protect against word-tearing (which is only a theoretical possibility on some platforms). If you want to guarantee those loads see the most recent store then the fence needs to come first.

src/hotspot/share/gc/shared/pretouchTask.cpp line 94:

> 92:
> 93:   if (thread_num == 0) {
> 94:     Atomic::release_store(&_cur_addr, _end_addr);

The release is unnecessary given the Atomic::sub.

src/hotspot/share/gc/shared/pretouchTask.hpp line 47:

> 45:
> 46:   void set_task_status(TaskStatus status) { Atomic::release_store(&_task_status, (size_t)status);  }
> 47:   void set_task_done()                    { Atomic::release_store(&_task_status, (size_t)Done);    }

There is a convention that when a setter embodies a release_store that release is included in the name e.g. release_set_task_done().

src/hotspot/share/gc/shared/pretouchTask.hpp line 51:

> 49:   void set_task_notready()                { set_task_status(NotReady); }
> 50:   bool is_task_ready()                    { return Atomic::load(&_task_status) == Ready; }
> 51:   bool is_task_done()                     { return Atomic::load(&_task_status) == Done;  }

Given these fields are set with a release_store I would expect to see them read with a load_acquire to match with it. And named eg. is_task_done_acquire().

-------------

PR: https://git.openjdk.java.net/jdk/pull/2976
Reply | Threaded
Open this post in threaded view
|

Re: RFR: JDK-8260332: ParallelGC: Cooperative pretouch for oldgen expansion [v2]

David Holmes-2
In reply to this post by Amit Pawar
On Sun, 14 Mar 2021 13:27:26 GMT, Amit Pawar <[hidden email]> wrote:

>> In case of ParallelGC, oldgen expansion can happen during promotion. Expanding thread will touch the pages and can't request for task execution as this GC thread is already executing a task. The expanding thread holds the lock on "ExpandHeap_lock" to resize the oldgen and other threads may wait for their turn. This is a blocking call.
>>
>> This patch changes this behavior by adding another constructor in "MutexLocker" class to enable non blocking or try_lock operation. This way one thread will acquire the lock and other threads can join pretouch work. Threads failed to acquire the lock will join pretouch only when task is marked ready by expanding thread.
>>
>> Following minimum expansion size are seen during expansion.
>> 1. 512KB without largepages and without UseNUMA.
>> 2. 64MB without largepages and with UseNUMA,
>> 3. 2MB (on x86)  with large pages and without UseNUMA,
>> 4. 64MB without large pages and with UseNUMA.
>>
>> When Oldgen is expanding repeatedly with smaller size then this change wont help. For such cases, resize size should adapt to application demand to make use of this change. For example if application nature triggers 100 expansion with smaller sizes in same GC then it is better to increase the expansion size during each resize to reduce the number of resizes. If this patch is accepted then will plan to fix this case in another patch.
>>
>> Jtreg all test passed.
>>
>> Please review this change.
>
> Amit Pawar has updated the pull request incrementally with one additional commit since the last revision:
>
>   Fixed build issues for some targets and updated with suggested changes.

src/hotspot/share/gc/shared/pretouchTask.cpp line 96:

> 94:     Atomic::release_store(&_cur_addr, _end_addr);
> 95:     OrderAccess::storestore();
> 96:     set_task_done();

The storestore barrier is not needed as the set_task_done() is a release_store which already has a storestore barrier.

src/hotspot/share/gc/shared/pretouchTask.cpp line 56:

> 54: void PretouchTask::reinitialize(char* start_addr, char* end_addr) {
> 55:   Atomic::release_store(&_cur_addr, start_addr);
> 56:   Atomic::release_store(&_end_addr, end_addr);

Back-to-back release-stores have redundant loadstore semantics. You could just use a storestore() barrier after the first release_store, then use plain Atomic::store for the second field.

src/hotspot/share/gc/parallel/mutableSpace.cpp line 149:

> 147:         // waiting to expand old-gen will join from PSOldGen::expand_for_allocate
> 148:         // function for pretouch work.
> 149:         pretouch_task->set_task_ready();

The storestore is not needed as it is included in the release_store of set_task_ready().

-------------

PR: https://git.openjdk.java.net/jdk/pull/2976
Reply | Threaded
Open this post in threaded view
|

Re: RFR: JDK-8260332: ParallelGC: Cooperative pretouch for oldgen expansion

Kim Barrett
In reply to this post by Amit Pawar
> On Mar 12, 2021, at 3:01 PM, Amit Pawar <[hidden email]> wrote:
>
> In case of ParallelGC, oldgen expansion can happen during promotion. Expanding thread will touch the pages and can't request for task execution as this GC thread is already executing a task. The expanding thread holds the lock on "ExpandHeap_lock" to resize the oldgen and other threads may wait for their turn. This is a blocking call.
>
> This patch changes this behavior by adding another constructor in "MutexLocker" class to enable non blocking or try_lock operation. This way one thread will acquire the lock and other threads can join pretouch work. Threads failed to acquire the lock will join pretouch only when task is marked ready by expanding thread.
>
> Following minimum expansion size are seen during expansion.
> 1. 512KB without largepages and without UseNUMA.
> 2. 64MB without largepages and with UseNUMA,
> 3. 2MB (on x86)  with large pages and without UseNUMA,
> 4. 64MB without large pages and with UseNUMA.
>
> When Oldgen is expanding repeatedly with smaller size then this change wont help. For such cases, resize size should adapt to application demand to make use of this change. For example if application nature triggers 100 expansion with smaller sizes in same GC then it is better to increase the expansion size during each resize to reduce the number of resizes. If this patch is accepted then will plan to fix this case in another patch.

Sorry, but a change like this needs better motivation. What you say
above suggests this change doesn't actually help.

It's intentional that oldgen expansions aren't generally large, as the
oldgen shouldn't be grown unnecessarily. There are already parameters
such as MinHeapDeltaBytes to control and manipulate this.

It is also preferable to complete an expansion request quickly to make
the additional space available to other threads in the main allocation
path, rather than making them go to the expand path. Making expansions
larger could force more threads to take the slower expand path, which
doesn't seem like a win even if they then help with the pretouch part
of another thread's expansion. (And that also assumes UsePreTouch is
even enabled.)

So the followup change that you say is needed to make this one
profitable seems questionable.

The proposed change is also surprisingly large and intrusive for
something that seems like it should be very localized.

> Jtreg all test passed.

A change like this needs a lot more testing than that, both functionally
and performance.

Reply | Threaded
Open this post in threaded view
|

Re: RFR: JDK-8260332: ParallelGC: Cooperative pretouch for oldgen expansion [v2]

Kim Barrett-2
In reply to this post by Amit Pawar
On Sun, 14 Mar 2021 13:27:26 GMT, Amit Pawar <[hidden email]> wrote:

>> In case of ParallelGC, oldgen expansion can happen during promotion. Expanding thread will touch the pages and can't request for task execution as this GC thread is already executing a task. The expanding thread holds the lock on "ExpandHeap_lock" to resize the oldgen and other threads may wait for their turn. This is a blocking call.
>>
>> This patch changes this behavior by adding another constructor in "MutexLocker" class to enable non blocking or try_lock operation. This way one thread will acquire the lock and other threads can join pretouch work. Threads failed to acquire the lock will join pretouch only when task is marked ready by expanding thread.
>>
>> Following minimum expansion size are seen during expansion.
>> 1. 512KB without largepages and without UseNUMA.
>> 2. 64MB without largepages and with UseNUMA,
>> 3. 2MB (on x86)  with large pages and without UseNUMA,
>> 4. 64MB without large pages and with UseNUMA.
>>
>> When Oldgen is expanding repeatedly with smaller size then this change wont help. For such cases, resize size should adapt to application demand to make use of this change. For example if application nature triggers 100 expansion with smaller sizes in same GC then it is better to increase the expansion size during each resize to reduce the number of resizes. If this patch is accepted then will plan to fix this case in another patch.
>>
>> Jtreg all test passed.
>>
>> Please review this change.
>
> Amit Pawar has updated the pull request incrementally with one additional commit since the last revision:
>
>   Fixed build issues for some targets and updated with suggested changes.

Changes requested by kbarrett (Reviewer).

src/hotspot/share/gc/parallel/psOldGen.cpp line 226:

> 224:         while (!pretouch()->is_task_done()) {
> 225:           if (pretouch()->is_task_ready()) {
> 226:             pretouch()->work(Thread::current()->osthread()->thread_id());

worker thread_id and os thread_id are entirely different things.  This is another indication that reusing (abusing) PretouchTask in this way is a mistake.

src/hotspot/share/gc/parallel/psOldGen.hpp line 49:

> 47:   PSGenerationCounters*    _gen_counters;
> 48:   SpaceCounters*           _space_counters;
> 49:   PretouchTask*            _pretouch;          // Used when old gen resized during scavenging.

I think abusing PretouchTask in this way, completely outside the workgang protocol, is confusing and shouldn't be done.  There might be some code that could be shared between this use and PretouchTask, but if so then it should be factored out for such sharing, rather than mangling PretouchTask in the way being proposed.

src/hotspot/share/gc/shared/pretouchTask.cpp line 68:

> 66:   // visible to all threads from this points.
> 67:   char *cur_addr = Atomic::load(&_cur_addr);
> 68:   char *end_addr = Atomic::load(&_end_addr);

end needs to be read before cur.  Otherwise, cur could be from one pretouch instance and end could be from a later, unrelated, pretouch instance, leading to scribbling.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2976
Reply | Threaded
Open this post in threaded view
|

Re: RFR: JDK-8260332: ParallelGC: Cooperative pretouch for oldgen expansion [v2]

Kim Barrett-2
In reply to this post by David Holmes-2
On Sun, 14 Mar 2021 21:29:59 GMT, David Holmes <[hidden email]> wrote:

>> Amit Pawar has updated the pull request incrementally with one additional commit since the last revision:
>>
>>   Fixed build issues for some targets and updated with suggested changes.
>
> src/hotspot/share/gc/shared/pretouchTask.cpp line 94:
>
>> 92:
>> 93:   if (thread_num == 0) {
>> 94:     Atomic::release_store(&_cur_addr, _end_addr);
>
> The release is unnecessary given the Atomic::sub.

Besides what David said, I think this is also confusing.  I think it's to account for the claim (using fetch_and_add) unconditionally adding the chunk size to cur_addr, which could lead to overshoot.  I think it would be clearer to prevent the overshoot from occurring.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2976
Reply | Threaded
Open this post in threaded view
|

Re: RFR: JDK-8260332: ParallelGC: Cooperative pretouch for oldgen expansion [v2]

Amit Pawar
In reply to this post by David Holmes-2
On Sun, 14 Mar 2021 21:27:50 GMT, David Holmes <[hidden email]> wrote:

>> Amit Pawar has updated the pull request incrementally with one additional commit since the last revision:
>>
>>   Fixed build issues for some targets and updated with suggested changes.
>
> src/hotspot/share/gc/shared/pretouchTask.cpp line 66:
>
>> 64:
>> 65:   // Following atomic loads are required to make other processor store
>> 66:   // visible to all threads from this points.
>
> That is not what an Atomic::load does; they only protect against word-tearing (which is only a theoretical possibility on some platforms). If you want to guarantee those loads see the most recent store then the fence needs to come first.

Thank you David for your reply and will do as suggested.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2976
Reply | Threaded
Open this post in threaded view
|

Re: RFR: JDK-8260332: ParallelGC: Cooperative pretouch for oldgen expansion [v2]

Amit Pawar
In reply to this post by Kim Barrett-2
On Mon, 15 Mar 2021 02:07:27 GMT, Kim Barrett <[hidden email]> wrote:

>> src/hotspot/share/gc/shared/pretouchTask.cpp line 94:
>>
>>> 92:
>>> 93:   if (thread_num == 0) {
>>> 94:     Atomic::release_store(&_cur_addr, _end_addr);
>>
>> The release is unnecessary given the Atomic::sub.
>
> Besides what David said, I think this is also confusing.  I think it's to account for the claim (using fetch_and_add) unconditionally adding the chunk size to cur_addr, which could lead to overshoot.  I think it would be clearer to prevent the overshoot from occurring.

David, will update as suggested.

Kim, thanks for your feedback. Updating **_cur_addr** was not needed here but thought to keep the markers in clean state. Again, **_cur_addr** & **_end_addr** gets updated before the start of next pretouch resize so will remove this. If something breaks then will fix as suggested. I hope this should be OK.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2976
Reply | Threaded
Open this post in threaded view
|

Re: RFR: JDK-8260332: ParallelGC: Cooperative pretouch for oldgen expansion [v2]

Amit Pawar
In reply to this post by David Holmes-2
On Sun, 14 Mar 2021 21:31:32 GMT, David Holmes <[hidden email]> wrote:

>> Amit Pawar has updated the pull request incrementally with one additional commit since the last revision:
>>
>>   Fixed build issues for some targets and updated with suggested changes.
>
> src/hotspot/share/gc/shared/pretouchTask.hpp line 47:
>
>> 45:
>> 46:   void set_task_status(TaskStatus status) { Atomic::release_store(&_task_status, (size_t)status);  }
>> 47:   void set_task_done()                    { Atomic::release_store(&_task_status, (size_t)Done);    }
>
> There is a convention that when a setter embodies a release_store that release is included in the name e.g. release_set_task_done().

OK, will change.

> src/hotspot/share/gc/shared/pretouchTask.hpp line 51:
>
>> 49:   void set_task_notready()                { set_task_status(NotReady); }
>> 50:   bool is_task_ready()                    { return Atomic::load(&_task_status) == Ready; }
>> 51:   bool is_task_done()                     { return Atomic::load(&_task_status) == Done;  }
>
> Given these fields are set with a release_store I would expect to see them read with a load_acquire to match with it. And named eg. is_task_done_acquire().

OK, will change.

> src/hotspot/share/gc/shared/pretouchTask.cpp line 96:
>
>> 94:     Atomic::release_store(&_cur_addr, _end_addr);
>> 95:     OrderAccess::storestore();
>> 96:     set_task_done();
>
> The storestore barrier is not needed as the set_task_done() is a release_store which already has a storestore barrier.

OK.

> src/hotspot/share/gc/shared/pretouchTask.cpp line 56:
>
>> 54: void PretouchTask::reinitialize(char* start_addr, char* end_addr) {
>> 55:   Atomic::release_store(&_cur_addr, start_addr);
>> 56:   Atomic::release_store(&_end_addr, end_addr);
>
> Back-to-back release-stores have redundant loadstore semantics. You could just use a storestore() barrier after the first release_store, then use plain Atomic::store for the second field.

OK, will change as per your suggestion.

> src/hotspot/share/gc/parallel/mutableSpace.cpp line 149:
>
>> 147:         // waiting to expand old-gen will join from PSOldGen::expand_for_allocate
>> 148:         // function for pretouch work.
>> 149:         pretouch_task->set_task_ready();
>
> The storestore is not needed as it is included in the release_store of set_task_ready().

OK.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2976
Reply | Threaded
Open this post in threaded view
|

Re: RFR: JDK-8260332: ParallelGC: Cooperative pretouch for oldgen expansion [v2]

Amit Pawar
In reply to this post by Kim Barrett-2
On Mon, 15 Mar 2021 01:43:38 GMT, Kim Barrett <[hidden email]> wrote:

>> Amit Pawar has updated the pull request incrementally with one additional commit since the last revision:
>>
>>   Fixed build issues for some targets and updated with suggested changes.
>
> src/hotspot/share/gc/parallel/psOldGen.cpp line 226:
>
>> 224:         while (!pretouch()->is_task_done()) {
>> 225:           if (pretouch()->is_task_ready()) {
>> 226:             pretouch()->work(Thread::current()->osthread()->thread_id());
>
> worker thread_id and os thread_id are entirely different things.  This is another indication that reusing (abusing) PretouchTask in this way is a mistake.

Is following change OK ?
pretouch()->work(static_cast<AbstractGangWorker*>(Thread::current())->id());

otherwise please suggest.

> src/hotspot/share/gc/shared/pretouchTask.cpp line 68:
>
>> 66:   // visible to all threads from this points.
>> 67:   char *cur_addr = Atomic::load(&_cur_addr);
>> 68:   char *end_addr = Atomic::load(&_end_addr);
>
> end needs to be read before cur.  Otherwise, cur could be from one pretouch instance and end could be from a later, unrelated, pretouch instance, leading to scribbling.

didn't realize this. will change per your suggestion.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2976
Reply | Threaded
Open this post in threaded view
|

Re: RFR: JDK-8260332: ParallelGC: Cooperative pretouch for oldgen expansion [v2]

Amit Pawar
In reply to this post by Kim Barrett-2
On Mon, 15 Mar 2021 01:46:08 GMT, Kim Barrett <[hidden email]> wrote:

>> Amit Pawar has updated the pull request incrementally with one additional commit since the last revision:
>>
>>   Fixed build issues for some targets and updated with suggested changes.
>
> src/hotspot/share/gc/parallel/psOldGen.hpp line 49:
>
>> 47:   PSGenerationCounters*    _gen_counters;
>> 48:   SpaceCounters*           _space_counters;
>> 49:   PretouchTask*            _pretouch;          // Used when old gen resized during scavenging.
>
> I think abusing PretouchTask in this way, completely outside the workgang protocol, is confusing and shouldn't be done.  There might be some code that could be shared between this use and PretouchTask, but if so then it should be factored out for such sharing, rather than mangling PretouchTask in the way being proposed.

Thanks for your suggestion. It will be better if you can give some direction as G1GC also need similar fix during the expansion. I think G1 has array of regions to be touched during the expansion and that task needs to be shared across the threads. pretouch class can be put inside another class and that can be shared across the threads to synchronize the pretouch task and not sure whether that will be OK again. I explored this approach after your suggestion and thought to update after the feedback during the review process. Please suggest.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2976
Reply | Threaded
Open this post in threaded view
|

Re: RFR: JDK-8260332: ParallelGC: Cooperative pretouch for oldgen expansion [v2]

Amit Pawar
In reply to this post by Kim Barrett-2
On Mon, 15 Mar 2021 02:13:46 GMT, Kim Barrett <[hidden email]> wrote:

>> Amit Pawar has updated the pull request incrementally with one additional commit since the last revision:
>>
>>   Fixed build issues for some targets and updated with suggested changes.
>
> Changes requested by kbarrett (Reviewer).

>
>
> _Mailing list message from [Kim Barrett](mailto:[hidden email]) on [hotspot-dev](mailto:[hidden email]):_
>
> > On Mar 12, 2021, at 3:01 PM, Amit Pawar <github.com+71302734+amitdpawar at openjdk.java.net> wrote:
> > In case of ParallelGC, oldgen expansion can happen during promotion. Expanding thread will touch the pages and can't request for task execution as this GC thread is already executing a task. The expanding thread holds the lock on "ExpandHeap_lock" to resize the oldgen and other threads may wait for their turn. This is a blocking call.
> > This patch changes this behavior by adding another constructor in "MutexLocker" class to enable non blocking or try_lock operation. This way one thread will acquire the lock and other threads can join pretouch work. Threads failed to acquire the lock will join pretouch only when task is marked ready by expanding thread.
> > Following minimum expansion size are seen during expansion.
> > 1. 512KB without largepages and without UseNUMA.
> > 2. 64MB without largepages and with UseNUMA,
> > 3. 2MB (on x86)  with large pages and without UseNUMA,
> > 4. 64MB without large pages and with UseNUMA.
> > When Oldgen is expanding repeatedly with smaller size then this change wont help. For such cases, resize size should adapt to application demand to make use of this change. For example if application nature triggers 100 expansion with smaller sizes in same GC then it is better to increase the expansion size during each resize to reduce the number of resizes. If this patch is accepted then will plan to fix this case in another patch.
>
> Sorry, but a change like this needs better motivation. What you say
> above suggests this change doesn't actually help.
>
> It's intentional that oldgen expansions aren't generally large, as the
> oldgen shouldn't be grown unnecessarily. There are already parameters
> such as MinHeapDeltaBytes to control and manipulate this.
>
> It is also preferable to complete an expansion request quickly to make
> the additional space available to other threads in the main allocation
> path, rather than making them go to the expand path. Making expansions
> larger could force more threads to take the slower expand path, which
> doesn't seem like a win even if they then help with the pretouch part
> of another thread's expansion. (And that also assumes UsePreTouch is
> even enabled.)
>
> So the followup change that you say is needed to make this one
> profitable seems questionable.
>
> The proposed change is also surprisingly large and intrusive for
> something that seems like it should be very localized.
>
> > Jtreg all test passed.
>
> A change like this needs a lot more testing than that, both functionally
> and performance.

[https://bugs.openjdk.java.net/browse/JDK-8254699](JDK-8254699) contains test results in XL file to show PreTouchParallelChunkSize was recently changed from 1GB to 4MB on Linux after testing various sizes. I have downloaded the same XL file and same is updated for Oldgen case during resize and it gives some rough idea about the improvement for this fix and follow up fix. Please check "PretouchOldgenDuringResize" sheet for "Co-operative Fix" and "Adaptive Resize Fix" columns.

[PreTouchParallelChunkSize_TestResults_UpdatedForOldGenCase.xlsx](https://github.com/openjdk/jdk/files/6147180/PreTouchParallelChunkSize_TestResults_UpdatedForOldGenCase.xlsx)

Running SPECJbb composite shows 30-40% reduction in GC pause time when old-gen expands upto ~1GB for UseNUMA case (I tested for minimum oldgen size to trigger the resize). Non UseNUMA case will show improvement only when resize expands more than minimum "PreTouchParallelChunkSize" size to let other thread participate in pretouch work. Two cases 1 & 3 (without UseNUMA and with/without hugpages) above didnt shows any improvement because of "PreTouchParallelChunkSize" limit and that is the reason why I suggested another fix.

The "Adaptive Resize Fix" column in the sheet is for next suggested fix and may possibly help to improve further. For server JVM, expansion size of 512KB,  2MB (hugepages) and 64MB looks good for first resize but later needs some attention I think. JVM flag "MinHeapDeltaBytes" needs to be known by the user and need to set it upfront. I think this can be consider for first resize in every GC and later dynamically go for higher size like double the previous size to adopt to application nature. This way it may help to reduce the GC pause time during the expansion. I thought to share my observation and my understanding could be wrong. So please check and suggest.

Again, thanks for your feedback.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2976
Reply | Threaded
Open this post in threaded view
|

Re: RFR: JDK-8260332: ParallelGC: Cooperative pretouch for oldgen expansion [v2]

Kim Barrett-2
On Tue, 16 Mar 2021 08:32:09 GMT, Amit Pawar <[hidden email]> wrote:

> [https://bugs.openjdk.java.net/browse/JDK-8254699](JDK-8254699) contains test results in XL file to show PreTouchParallelChunkSize was recently changed from 1GB to 4MB on Linux after testing various sizes. I have downloaded the same XL file and same is updated for Oldgen case during resize and it gives some rough idea about the improvement for this fix and follow up fix. Please check "PretouchOldgenDuringResize" sheet for "Co-operative Fix" and "Adaptive Resize Fix" columns.
>
> [PreTouchParallelChunkSize_TestResults_UpdatedForOldGenCase.xlsx](https://github.com/openjdk/jdk/files/6147180/PreTouchParallelChunkSize_TestResults_UpdatedForOldGenCase.xlsx)

I'm not sure how to interpret this. In particular, it says "Test done using November 27rd Openjdk build". That predates a number of recent changes in this area that seem relevant. Is that correct? Or is that stale and the data has been updated for a newer baseline.

Also, can you provide details about how this information is generated and collected?  I would like to reproduce the results and do some other experiments.

I think that, as proposed, the change is making already messy code still more messy, and some refactoring is needed. I don't know what the details of that might look like yet; I'm still exploring and poking at the code. I also need to look at the corresponding part of G1 that you mentioned.

> The "Adaptive Resize Fix" column in the sheet is for next suggested fix and may possibly help to improve further. For server JVM, expansion size of 512KB, 2MB (hugepages) and 64MB looks good for first resize but later needs some attention I think. JVM flag "MinHeapDeltaBytes" needs to be known by the user and need to set it upfront. I think this can be consider for first resize in every GC and later dynamically go for higher size like double the previous size to adopt to application nature. This way it may help to reduce the GC pause time during the expansion. I thought to share my observation and my understanding could be wrong. So please check and suggest.

I think something like this has a serious risk of growing the oldgen a lot more than needed, which may have serious downsides.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2976
Reply | Threaded
Open this post in threaded view
|

Re: RFR: JDK-8260332: ParallelGC: Cooperative pretouch for oldgen expansion [v2]

Amit Pawar
On Mon, 22 Mar 2021 04:39:48 GMT, Kim Barrett <[hidden email]> wrote:

>>>
>>>
>>> _Mailing list message from [Kim Barrett](mailto:[hidden email]) on [hotspot-dev](mailto:[hidden email]):_
>>>
>>> > On Mar 12, 2021, at 3:01 PM, Amit Pawar <github.com+71302734+amitdpawar at openjdk.java.net> wrote:
>>> > In case of ParallelGC, oldgen expansion can happen during promotion. Expanding thread will touch the pages and can't request for task execution as this GC thread is already executing a task. The expanding thread holds the lock on "ExpandHeap_lock" to resize the oldgen and other threads may wait for their turn. This is a blocking call.
>>> > This patch changes this behavior by adding another constructor in "MutexLocker" class to enable non blocking or try_lock operation. This way one thread will acquire the lock and other threads can join pretouch work. Threads failed to acquire the lock will join pretouch only when task is marked ready by expanding thread.
>>> > Following minimum expansion size are seen during expansion.
>>> > 1. 512KB without largepages and without UseNUMA.
>>> > 2. 64MB without largepages and with UseNUMA,
>>> > 3. 2MB (on x86)  with large pages and without UseNUMA,
>>> > 4. 64MB without large pages and with UseNUMA.
>>> > When Oldgen is expanding repeatedly with smaller size then this change wont help. For such cases, resize size should adapt to application demand to make use of this change. For example if application nature triggers 100 expansion with smaller sizes in same GC then it is better to increase the expansion size during each resize to reduce the number of resizes. If this patch is accepted then will plan to fix this case in another patch.
>>>
>>> Sorry, but a change like this needs better motivation. What you say
>>> above suggests this change doesn't actually help.
>>>
>>> It's intentional that oldgen expansions aren't generally large, as the
>>> oldgen shouldn't be grown unnecessarily. There are already parameters
>>> such as MinHeapDeltaBytes to control and manipulate this.
>>>
>>> It is also preferable to complete an expansion request quickly to make
>>> the additional space available to other threads in the main allocation
>>> path, rather than making them go to the expand path. Making expansions
>>> larger could force more threads to take the slower expand path, which
>>> doesn't seem like a win even if they then help with the pretouch part
>>> of another thread's expansion. (And that also assumes UsePreTouch is
>>> even enabled.)
>>>
>>> So the followup change that you say is needed to make this one
>>> profitable seems questionable.
>>>
>>> The proposed change is also surprisingly large and intrusive for
>>> something that seems like it should be very localized.
>>>
>>> > Jtreg all test passed.
>>>
>>> A change like this needs a lot more testing than that, both functionally
>>> and performance.
>>
>> [https://bugs.openjdk.java.net/browse/JDK-8254699](JDK-8254699) contains test results in XL file to show PreTouchParallelChunkSize was recently changed from 1GB to 4MB on Linux after testing various sizes. I have downloaded the same XL file and same is updated for Oldgen case during resize and it gives some rough idea about the improvement for this fix and follow up fix. Please check "PretouchOldgenDuringResize" sheet for "Co-operative Fix" and "Adaptive Resize Fix" columns.
>>
>> [PreTouchParallelChunkSize_TestResults_UpdatedForOldGenCase.xlsx](https://github.com/openjdk/jdk/files/6147180/PreTouchParallelChunkSize_TestResults_UpdatedForOldGenCase.xlsx)
>>
>> Running SPECJbb composite shows 30-40% reduction in GC pause time when old-gen expands upto ~1GB for UseNUMA case (I tested for minimum oldgen size to trigger the resize). Non UseNUMA case will show improvement only when resize expands more than minimum "PreTouchParallelChunkSize" size to let other thread participate in pretouch work. Two cases 1 & 3 (without UseNUMA and with/without hugpages) above didnt shows any improvement because of "PreTouchParallelChunkSize" limit and that is the reason why I suggested another fix.
>>
>> The "Adaptive Resize Fix" column in the sheet is for next suggested fix and may possibly help to improve further. For server JVM, expansion size of 512KB,  2MB (hugepages) and 64MB looks good for first resize but later needs some attention I think. JVM flag "MinHeapDeltaBytes" needs to be known by the user and need to set it upfront. I think this can be consider for first resize in every GC and later dynamically go for higher size like double the previous size to adopt to application nature. This way it may help to reduce the GC pause time during the expansion. I thought to share my observation and my understanding could be wrong. So please check and suggest.
>>
>> Again, thanks for your feedback.
>
>> [https://bugs.openjdk.java.net/browse/JDK-8254699](JDK-8254699) contains test results in XL file to show PreTouchParallelChunkSize was recently changed from 1GB to 4MB on Linux after testing various sizes. I have downloaded the same XL file and same is updated for Oldgen case during resize and it gives some rough idea about the improvement for this fix and follow up fix. Please check "PretouchOldgenDuringResize" sheet for "Co-operative Fix" and "Adaptive Resize Fix" columns.
>>
>> [PreTouchParallelChunkSize_TestResults_UpdatedForOldGenCase.xlsx](https://github.com/openjdk/jdk/files/6147180/PreTouchParallelChunkSize_TestResults_UpdatedForOldGenCase.xlsx)
>
> I'm not sure how to interpret this. In particular, it says "Test done using November 27rd Openjdk build". That predates a number of recent changes in this area that seem relevant. Is that correct? Or is that stale and the data has been updated for a newer baseline.
>
> Also, can you provide details about how this information is generated and collected?  I would like to reproduce the results and do some other experiments.
>
> I think that, as proposed, the change is making already messy code still more messy, and some refactoring is needed. I don't know what the details of that might look like yet; I'm still exploring and poking at the code. I also need to look at the corresponding part of G1 that you mentioned.
>
>> The "Adaptive Resize Fix" column in the sheet is for next suggested fix and may possibly help to improve further. For server JVM, expansion size of 512KB, 2MB (hugepages) and 64MB looks good for first resize but later needs some attention I think. JVM flag "MinHeapDeltaBytes" needs to be known by the user and need to set it upfront. I think this can be consider for first resize in every GC and later dynamically go for higher size like double the previous size to adopt to application nature. This way it may help to reduce the GC pause time during the expansion. I thought to share my observation and my understanding could be wrong. So please check and suggest.
>
> I think something like this has a serious risk of growing the oldgen a lot more than needed, which may have serious downsides.

Thank you Kim for your reply and please see my inline comments.
>
>
> > [https://bugs.openjdk.java.net/browse/JDK-8254699](JDK-8254699) contains test results in XL file to show PreTouchParallelChunkSize was recently changed from 1GB to 4MB on Linux after testing various sizes. I have downloaded the same XL file and same is updated for Oldgen case during resize and it gives some rough idea about the improvement for this fix and follow up fix. Please check "PretouchOldgenDuringResize" sheet for "Co-operative Fix" and "Adaptive Resize Fix" columns.
> > [PreTouchParallelChunkSize_TestResults_UpdatedForOldGenCase.xlsx](https://github.com/openjdk/jdk/files/6147180/PreTouchParallelChunkSize_TestResults_UpdatedForOldGenCase.xlsx)
>
> I'm not sure how to interpret this. In particular, it says "Test done using November 27rd Openjdk build". That predates a number of recent changes in this area that seem relevant. Is that correct? Or is that stale and the data has been updated for a newer baseline.

Sorry for the confusion. I didnt test again and pre-touch time taken with different chunk size per thread was already recorded in the spreadsheet and thought to use it for reference to reply to David feedback "A change like this needs a lot more testing than that, both functionally and performance.".

>
> Also, can you provide details about how this information is generated and collected? I would like to reproduce the results and do some other experiments.

JDK-8254699 was fixed by me and test was done using SPECJbb composite with following command along with JVM flag "PreTouchParallelChunkSize" with values 1G (earlier default on x86) and 4M (current default on x86). To generate this info,  "Ticks::now()" function was called to record the time taken for pretouch operation. The first table (please ignore resize related) in the sheet "SPECjbb_Summary"  was created from the sheet "SPECjbb_128C256T_GCPretouchTime" that contains pretouch log dumps for SPECjbb composite tests.

Command (also mentioned in SPECjbb_128C256T_GCPretouchTime sheet):
-Xms24g -Xmx970g -Xmn960g -server -XX:+UseParallelGC -XX:+AlwaysPreTouch -XX:+UseLargePages -XX:LargePageSizeInBytes=2m -XX:MaxTenuringThreshold=15 -XX:+ScavengeBeforeFullGC -XX:+UseAdaptiveSizePolicy -XX:ParallelGCThreads=256 -XX:-UseBiasedLocking -XX:SurvivorRatio=200 -XX:TargetSurvivorRatio=95 -XX:+UseNUMA -XX:-UseNUMAInterleaving -XX:+UseTransparentHugePages -XX:-UseAdaptiveNUMAChunkSizing

>
> I think that, as proposed, the change is making already messy code still more messy, and some refactoring is needed. I don't know what the details of that might look like yet; I'm still exploring and poking at the code. I also need to look at the corresponding part of G1 that you mentioned.

First, I thought to push this change for ParallelGC and next G1GC. Earlier ParallelGC pretouch was done with single threaded and later this was fixed by moving multithreaded pretouch common code from G1GC and ParallelGC to PretouchTask class. I thought to do it similarly for this case too.

>
> > The "Adaptive Resize Fix" column in the sheet is for next suggested fix and may possibly help to improve further. For server JVM, expansion size of 512KB, 2MB (hugepages) and 64MB looks good for first resize but later needs some attention I think. JVM flag "MinHeapDeltaBytes" needs to be known by the user and need to set it upfront. I think this can be consider for first resize in every GC and later dynamically go for higher size like double the previous size to adopt to application nature. This way it may help to reduce the GC pause time during the expansion. I thought to share my observation and my understanding could be wrong. So please check and suggest.
>
> I think something like this has a serious risk of growing the oldgen a lot more than needed, which may have serious downsides.

I believe this depends on application runtime nature and it may expand as needed. If you still think that this change is not useful then please suggest. Will withdraw this PR.

Thanks,
Amit

-------------

PR: https://git.openjdk.java.net/jdk/pull/2976
Reply | Threaded
Open this post in threaded view
|

Re: RFR: JDK-8260332: ParallelGC: Cooperative pretouch for oldgen expansion [v2]

Kim Barrett-2
On Mon, 22 Mar 2021 17:39:35 GMT, Amit Pawar <[hidden email]> wrote:

> > > [PreTouchParallelChunkSize_TestResults_UpdatedForOldGenCase.xlsx](https://github.com/openjdk/jdk/files/6147180/PreTouchParallelChunkSize_TestResults_UpdatedForOldGenCase.xlsx)
> >
> >
> > I'm not sure how to interpret this. In particular, it says "Test done using November 27rd Openjdk build". That predates a number of recent changes in this area that seem relevant. Is that correct? Or is that stale and the data has been updated for a newer baseline.
>
> Sorry for the confusion. I didnt test again and pre-touch time taken with different chunk size per thread was already recorded in the spreadsheet and thought to use it for reference to reply to David feedback "A change like this needs a lot more testing than that, both functionally and performance.".

Getting any benefit from parallel pretouch here suggests we have many threads piling up on the expand mutex. Before JDK-8260045 (pushed together with JDK-8260044 on 2/12/2021) that would lead to "expansion storms", where several threads piled up on that mutex and serially entered and did a new expansion (with associated pretouch, and possibly of significant size). Getting rid of the expansion storms may alleviate the serial pretouch quite a bit. There may still be some benefit to cooperative parallization, but new measurements are needed to determine how much of a problem still exists.

If it is still worth addressing, I think the proposed approach has problems, as it makes messy complicated code messier and more complicated.  I think some preliminary refactoring is needed.  For example, splitting MutableSpace::initialize into a setup-pages part and set-the-range part.  I've not explored in detail what's needed yet, pending new measurements.  (I haven't had time to do those measurements yet myself.)

-------------

PR: https://git.openjdk.java.net/jdk/pull/2976
Reply | Threaded
Open this post in threaded view
|

Re: RFR: JDK-8260332: ParallelGC: Cooperative pretouch for oldgen expansion [v3]

Amit Pawar
In reply to this post by Amit Pawar
> In case of ParallelGC, oldgen expansion can happen during promotion. Expanding thread will touch the pages and can't request for task execution as this GC thread is already executing a task. The expanding thread holds the lock on "ExpandHeap_lock" to resize the oldgen and other threads may wait for their turn. This is a blocking call.
>
> This patch changes this behavior by adding another constructor in "MutexLocker" class to enable non blocking or try_lock operation. This way one thread will acquire the lock and other threads can join pretouch work. Threads failed to acquire the lock will join pretouch only when task is marked ready by expanding thread.
>
> Following minimum expansion size are seen during expansion.
> 1. 512KB without largepages and without UseNUMA.
> 2. 64MB without largepages and with UseNUMA,
> 3. 2MB (on x86)  with large pages and without UseNUMA,
> 4. 64MB without large pages and with UseNUMA.
>
> When Oldgen is expanding repeatedly with smaller size then this change wont help. For such cases, resize size should adapt to application demand to make use of this change. For example if application nature triggers 100 expansion with smaller sizes in same GC then it is better to increase the expansion size during each resize to reduce the number of resizes. If this patch is accepted then will plan to fix this case in another patch.
>
> Jtreg all test passed.
>
> Please review this change.

Amit Pawar has updated the pull request incrementally with one additional commit since the last revision:

  Updated the code as per the suggestion and enabled co-operative pretouch for
  both ParallelGC and G1GC.

-------------

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/2976/files
  - new: https://git.openjdk.java.net/jdk/pull/2976/files/01e5077e..f75593aa

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=2976&range=02
 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=2976&range=01-02

  Stats: 312 lines in 10 files changed: 168 ins; 105 del; 39 mod
  Patch: https://git.openjdk.java.net/jdk/pull/2976.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/2976/head:pull/2976

PR: https://git.openjdk.java.net/jdk/pull/2976
Reply | Threaded
Open this post in threaded view
|

Re: RFR: JDK-8260332: ParallelGC: Cooperative pretouch for oldgen expansion [v2]

Amit Pawar
In reply to this post by Kim Barrett-2
On Mon, 29 Mar 2021 09:22:43 GMT, Kim Barrett <[hidden email]> wrote:

>> Thank you Kim for your reply and please see my inline comments.
>>>
>>>
>>> > [https://bugs.openjdk.java.net/browse/JDK-8254699](JDK-8254699) contains test results in XL file to show PreTouchParallelChunkSize was recently changed from 1GB to 4MB on Linux after testing various sizes. I have downloaded the same XL file and same is updated for Oldgen case during resize and it gives some rough idea about the improvement for this fix and follow up fix. Please check "PretouchOldgenDuringResize" sheet for "Co-operative Fix" and "Adaptive Resize Fix" columns.
>>> > [PreTouchParallelChunkSize_TestResults_UpdatedForOldGenCase.xlsx](https://github.com/openjdk/jdk/files/6147180/PreTouchParallelChunkSize_TestResults_UpdatedForOldGenCase.xlsx)
>>>
>>> I'm not sure how to interpret this. In particular, it says "Test done using November 27rd Openjdk build". That predates a number of recent changes in this area that seem relevant. Is that correct? Or is that stale and the data has been updated for a newer baseline.
>>
>> Sorry for the confusion. I didnt test again and pre-touch time taken with different chunk size per thread was already recorded in the spreadsheet and thought to use it for reference to reply to David feedback "A change like this needs a lot more testing than that, both functionally and performance.".
>>
>>>
>>> Also, can you provide details about how this information is generated and collected? I would like to reproduce the results and do some other experiments.
>>
>> JDK-8254699 was fixed by me and test was done using SPECJbb composite with following command along with JVM flag "PreTouchParallelChunkSize" with values 1G (earlier default on x86) and 4M (current default on x86). To generate this info,  "Ticks::now()" function was called to record the time taken for pretouch operation. The first table (please ignore resize related) in the sheet "SPECjbb_Summary"  was created from the sheet "SPECjbb_128C256T_GCPretouchTime" that contains pretouch log dumps for SPECjbb composite tests.
>>
>> Command (also mentioned in SPECjbb_128C256T_GCPretouchTime sheet):
>> -Xms24g -Xmx970g -Xmn960g -server -XX:+UseParallelGC -XX:+AlwaysPreTouch -XX:+UseLargePages -XX:LargePageSizeInBytes=2m -XX:MaxTenuringThreshold=15 -XX:+ScavengeBeforeFullGC -XX:+UseAdaptiveSizePolicy -XX:ParallelGCThreads=256 -XX:-UseBiasedLocking -XX:SurvivorRatio=200 -XX:TargetSurvivorRatio=95 -XX:+UseNUMA -XX:-UseNUMAInterleaving -XX:+UseTransparentHugePages -XX:-UseAdaptiveNUMAChunkSizing
>>
>>>
>>> I think that, as proposed, the change is making already messy code still more messy, and some refactoring is needed. I don't know what the details of that might look like yet; I'm still exploring and poking at the code. I also need to look at the corresponding part of G1 that you mentioned.
>>
>> First, I thought to push this change for ParallelGC and next G1GC. Earlier ParallelGC pretouch was done with single threaded and later this was fixed by moving multithreaded pretouch common code from G1GC and ParallelGC to PretouchTask class. I thought to do it similarly for this case too.
>>
>>>
>>> > The "Adaptive Resize Fix" column in the sheet is for next suggested fix and may possibly help to improve further. For server JVM, expansion size of 512KB, 2MB (hugepages) and 64MB looks good for first resize but later needs some attention I think. JVM flag "MinHeapDeltaBytes" needs to be known by the user and need to set it upfront. I think this can be consider for first resize in every GC and later dynamically go for higher size like double the previous size to adopt to application nature. This way it may help to reduce the GC pause time during the expansion. I thought to share my observation and my understanding could be wrong. So please check and suggest.
>>>
>>> I think something like this has a serious risk of growing the oldgen a lot more than needed, which may have serious downsides.
>>
>> I believe this depends on application runtime nature and it may expand as needed. If you still think that this change is not useful then please suggest. Will withdraw this PR.
>>
>> Thanks,
>> Amit
>
>> > > [PreTouchParallelChunkSize_TestResults_UpdatedForOldGenCase.xlsx](https://github.com/openjdk/jdk/files/6147180/PreTouchParallelChunkSize_TestResults_UpdatedForOldGenCase.xlsx)
>> >
>> >
>> > I'm not sure how to interpret this. In particular, it says "Test done using November 27rd Openjdk build". That predates a number of recent changes in this area that seem relevant. Is that correct? Or is that stale and the data has been updated for a newer baseline.
>>
>> Sorry for the confusion. I didnt test again and pre-touch time taken with different chunk size per thread was already recorded in the spreadsheet and thought to use it for reference to reply to David feedback "A change like this needs a lot more testing than that, both functionally and performance.".
>
> Getting any benefit from parallel pretouch here suggests we have many threads piling up on the expand mutex. Before JDK-8260045 (pushed together with JDK-8260044 on 2/12/2021) that would lead to "expansion storms", where several threads piled up on that mutex and serially entered and did a new expansion (with associated pretouch, and possibly of significant size). Getting rid of the expansion storms may alleviate the serial pretouch quite a bit. There may still be some benefit to cooperative parallization, but new measurements are needed to determine how much of a problem still exists.
>
> If it is still worth addressing, I think the proposed approach has problems, as it makes messy complicated code messier and more complicated.  I think some preliminary refactoring is needed.  For example, splitting MutableSpace::initialize into a setup-pages part and set-the-range part.  I've not explored in detail what's needed yet, pending new measurements.  (I haven't had time to do those measurements yet myself.)

Thanks Kim for your feedback.

I have updated the code and it looks good now. Please use "Full" diff to see all the changes.

Summary of changes: added new class in pretouchTask.hpp file for synchronization. Now co-operative pretouch is enabled for both ParallelGC and G1GC. For testing purpose, added new flag "UseMultithreadedPretouchForOldGen" and timing is enabled for old-gen pretouch case for measurements. Enabling this flag makes try_lock to be used instead of lock (which is current default).

This change will be useful only if PreTouchParallelChunkSize value is not very huge compared to old-gen expansion size. For Linux it is set to 4MB and for other OS it is still 1GB.

Testing SPECJbb composite on AMD EPYC machine shows following improvement.

- Parallel GC: with UseNUMA, minimum expansion size is 64MB
-- with 2MB hugepages, master branch time taken is ~9-10ms and current fix time taken is ~1-2ms`
-- without hugepages, master branch time taken is ~14-17ms and current fix time taken is ~2-3ms

- Parallel GC: without UseNUMA, minimum expansion size is 512KB so this change is not useful for this size. Sometime expansion goes upto 20MB then pretouch takes less time for such large size.

- G1 GC:  with UseNUMA, maximum expansion size is 32MB
-- with 2MB hugepages, master branch time taken is ~5-7ms and current fix time taken is ~1-7ms` (higher time due to single thread activity as there was no other thread waiting to acquire the lock)

- G1 GC:  without UseNUMA, minimum expansion size 64KB and maximum expansion size is 32MB
-- without hugepages, master branch time taken = ~9-11ms and current fix time taken ~2-9ms (same here too)

Please check.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2976
12