[8u] RFR for backport of JDK-8148175: C1: G1 barriers don't preserve FP registers to jdk8u-dev

classic Classic list List threaded Threaded
28 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[8u] RFR for backport of JDK-8148175: C1: G1 barriers don't preserve FP registers to jdk8u-dev

Muthusamy Chinnathambi
Hi,
                 
Please review the backport of bug: "JDK-8148175: C1: G1 barriers don't preserve FP registers" to jdk8u-dev

Please note that this is not a clean backport due to new entries in TEST.groups and copyright changes.

Webrev: http://cr.openjdk.java.net/~mchinnathamb/8148175/webrev.00/
jdk9 bug: https://bugs.openjdk.java.net/browse/JDK-8148175 
Original patch pushed to jdk9: http://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/a9334e785873 

Test:  Run jtreg and jprt hotspot testsets.

Regards,
Muthusamy C
Reply | Threaded
Open this post in threaded view
|

Strange G1 behavior

Kirk Pepperdine-2
Hi,

I’ve got a GC log that contains records where the “to-space” is resized to 0 regions. This setting is maintained causing overall heap occupancy to spike which then triggers a Full GC. This is a behavior that I’ve observed in many versions of the JVM from 7.x to 8.x. I was waiting for a recent built to display it and to see if I could sort out more information before raising the issue. I have a customer (Wikipedia) that can reproduce this behavior and is willing to work with us to help resolve the issue. I’ve attached their GC log.

It all starts with the young gen collection….

2017-10-08T04:19:26.417+0000: 240984.824
…. snip
 (to-space exhausted), 0.2808928 secs].    <—— bad, generally means survivor is too small which is why (Charlie) we still need the age data.
…. SNIP

   [Eden: 0.0B(608.0M)->0.0B(612.0M) Survivors: 4096.0K->0.0B Heap: 11.9G(12.0G)->11.9G(12.0G)].    <—— you are now painted into cornered which most likely needs a Full GC to get out...
…. SNIP to the next young…
   [Eden: 0.0B(612.0M)->0.0B(612.0M) Survivors: 0.0B->0.0B Heap: 11.9G(12.0G)->11.9G(12.0G)].          <—— now a Full GC is coming, no questions.
….SNIP….
2017-10-08T04:19:26.714+0000: 240985.121: [Full GC                                                                               <—— and here it is….
 [Times: user=52.17 sys=0.12, real=26.26 secs]

After which everything is normal. But they cycle may repeat it’s self. The condition generally takes several days to show up so I’ve not seen it reproduced yet in a test environment. Hopefully there is a test case that can run in a reasonable amount of time.

Kind regards,
Kirk Pepperdine





wdqs-blazegraph_jvm_gc.pid29188.log.0.zip (3M) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Strange G1 behavior

Kirk Pepperdine-2
Hi,

I should add that Charlie Hunt added some invaluable information in our Friend of jclarity list that I should add here. He looked at the adaptive sizing records that are in the log and came to the conclusion that the regions in tenured were not collected because they are over the 85% live mark. This according to the AS records is true.. but then one has to ask, how does a full recover so much memory if more there are so many tenured regions that are more than 85% full. I have other logs from older JVMs where heap occupancy after each mixed collection is greater than the previous collection occupancy.. and then you get the full that collects everything. I didn’t spam the list with those logs because; they didn’t have print adaptive sizing turned on and they were from older JVMs meaning the problem may have already been solved. This JVM is a 141 build so while it behaves some what differently, it’s the same spike up ‘till only a full can help, fun the full and life goes on as if nothing happened.

Kind regards,
Kirk

> On Oct 11, 2017, at 11:48 AM, Kirk Pepperdine <[hidden email]> wrote:
>
> Hi,
>
> I’ve got a GC log that contains records where the “to-space” is resized to 0 regions. This setting is maintained causing overall heap occupancy to spike which then triggers a Full GC. This is a behavior that I’ve observed in many versions of the JVM from 7.x to 8.x. I was waiting for a recent built to display it and to see if I could sort out more information before raising the issue. I have a customer (Wikipedia) that can reproduce this behavior and is willing to work with us to help resolve the issue. I’ve attached their GC log.
>
> It all starts with the young gen collection….
>
> 2017-10-08T04:19:26.417+0000: 240984.824
> …. snip
> (to-space exhausted), 0.2808928 secs].    <—— bad, generally means survivor is too small which is why (Charlie) we still need the age data.
> …. SNIP
>
>   [Eden: 0.0B(608.0M)->0.0B(612.0M) Survivors: 4096.0K->0.0B Heap: 11.9G(12.0G)->11.9G(12.0G)].    <—— you are now painted into cornered which most likely needs a Full GC to get out...
> …. SNIP to the next young…
>   [Eden: 0.0B(612.0M)->0.0B(612.0M) Survivors: 0.0B->0.0B Heap: 11.9G(12.0G)->11.9G(12.0G)].          <—— now a Full GC is coming, no questions.
> ….SNIP….
> 2017-10-08T04:19:26.714+0000: 240985.121: [Full GC                                                                               <—— and here it is….
> [Times: user=52.17 sys=0.12, real=26.26 secs]
>
> After which everything is normal. But they cycle may repeat it’s self. The condition generally takes several days to show up so I’ve not seen it reproduced yet in a test environment. Hopefully there is a test case that can run in a reasonable amount of time.
>
> Kind regards,
> Kirk Pepperdine
>
> <wdqs-blazegraph_jvm_gc.pid29188.log.0.zip>
>

Reply | Threaded
Open this post in threaded view
|

RE: [8u] RFR for backport of JDK-8148175: C1: G1 barriers don't preserve FP registers to jdk8u-dev

Muthusamy Chinnathambi
In reply to this post by Muthusamy Chinnathambi
May I please get a second review for the change.

Regards,
Muthusamy C

-----Original Message-----
From: Vladimir Ivanov
Sent: Wednesday, October 11, 2017 5:29 PM
To: Muthusamy Chinnathambi <[hidden email]>
Cc: [hidden email]; hotspot compiler <[hidden email]>
Subject: Re: [8u] RFR for backport of JDK-8148175: C1: G1 barriers don't preserve FP registers to jdk8u-dev

Looks good.

Best regards,
Vladimir Ivanov

On 10/11/17 12:33 PM, Muthusamy Chinnathambi wrote:

> Hi,
>
> Please review the backport of bug: "JDK-8148175: C1: G1 barriers don't preserve FP registers" to jdk8u-dev
>
> Please note that this is not a clean backport due to new entries in TEST.groups and copyright changes.
>
> Webrev: http://cr.openjdk.java.net/~mchinnathamb/8148175/webrev.00/
> jdk9 bug: https://bugs.openjdk.java.net/browse/JDK-8148175
> Original patch pushed to jdk9: http://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/a9334e785873
>
> Test:  Run jtreg and jprt hotspot testsets.
>
> Regards,
> Muthusamy C
>
Reply | Threaded
Open this post in threaded view
|

Re: [8u] RFR for backport of JDK-8148175: C1: G1 barriers don't preserve FP registers to jdk8u-dev

Erik Osterlund
Hi Muthusamy,

Looks good. But...

In PreserveFPRegistersTest.java:54
54: long regionSize = 1_000_000; //WB.g1RegionSize();

Was it intentional to hard code the region size to 1 000 000?

Thanks,
/Erik

On 2017-10-12 12:46, Muthusamy Chinnathambi wrote:

> May I please get a second review for the change.
>
> Regards,
> Muthusamy C
>
> -----Original Message-----
> From: Vladimir Ivanov
> Sent: Wednesday, October 11, 2017 5:29 PM
> To: Muthusamy Chinnathambi <[hidden email]>
> Cc: [hidden email]; hotspot compiler <[hidden email]>
> Subject: Re: [8u] RFR for backport of JDK-8148175: C1: G1 barriers don't preserve FP registers to jdk8u-dev
>
> Looks good.
>
> Best regards,
> Vladimir Ivanov
>
> On 10/11/17 12:33 PM, Muthusamy Chinnathambi wrote:
>> Hi,
>>
>> Please review the backport of bug: "JDK-8148175: C1: G1 barriers don't preserve FP registers" to jdk8u-dev
>>
>> Please note that this is not a clean backport due to new entries in TEST.groups and copyright changes.
>>
>> Webrev: http://cr.openjdk.java.net/~mchinnathamb/8148175/webrev.00/
>> jdk9 bug: https://bugs.openjdk.java.net/browse/JDK-8148175
>> Original patch pushed to jdk9: http://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/a9334e785873
>>
>> Test:  Run jtreg and jprt hotspot testsets.
>>
>> Regards,
>> Muthusamy C
>>

Reply | Threaded
Open this post in threaded view
|

Re: [8u] RFR for backport of JDK-8148175: C1: G1 barriers don't preserve FP registers to jdk8u-dev

Vladimir Kozlov
In reply to this post by Muthusamy Chinnathambi
Why do you need to add test explicitly to hotspot_compiler group?
It should be included implicitly into compact1_minimal group as other compiler/ tests. And compact1_minimal should be
used in all other testing. Did you check that the test is executed without you modifying TEST.groups?

Thanks,
Vladimir K

On 10/12/17 3:46 AM, Muthusamy Chinnathambi wrote:

> May I please get a second review for the change.
>
> Regards,
> Muthusamy C
>
> -----Original Message-----
> From: Vladimir Ivanov
> Sent: Wednesday, October 11, 2017 5:29 PM
> To: Muthusamy Chinnathambi <[hidden email]>
> Cc: [hidden email]; hotspot compiler <[hidden email]>
> Subject: Re: [8u] RFR for backport of JDK-8148175: C1: G1 barriers don't preserve FP registers to jdk8u-dev
>
> Looks good.
>
> Best regards,
> Vladimir Ivanov
>
> On 10/11/17 12:33 PM, Muthusamy Chinnathambi wrote:
>> Hi,
>>
>> Please review the backport of bug: "JDK-8148175: C1: G1 barriers don't preserve FP registers" to jdk8u-dev
>>
>> Please note that this is not a clean backport due to new entries in TEST.groups and copyright changes.
>>
>> Webrev: http://cr.openjdk.java.net/~mchinnathamb/8148175/webrev.00/
>> jdk9 bug: https://bugs.openjdk.java.net/browse/JDK-8148175
>> Original patch pushed to jdk9: http://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/a9334e785873
>>
>> Test:  Run jtreg and jprt hotspot testsets.
>>
>> Regards,
>> Muthusamy C
>>
Reply | Threaded
Open this post in threaded view
|

RE: [8u] RFR for backport of JDK-8148175: C1: G1 barriers don't preserve FP registers to jdk8u-dev

Muthusamy Chinnathambi
Hi Vladimir,

> Why do you need to add test explicitly to hotspot_compiler group?
> It should be included implicitly into compact1_minimal group as other compiler/ tests. And compact1_minimal should be used in all other
> testing.
You are right, it should get picked implicitly as part of compact1_minimal group.

> Did you check that the test is executed without you modifying TEST.groups?
Now - yes. Without my TEST.groups modification the test gets executed.

I will drop the change in TEST.groups file.
Please note, this request is only for 8u.

Regards,
Muthusamy C

-----Original Message-----
From: Vladimir Kozlov
Sent: Thursday, October 12, 2017 11:20 PM
To: Muthusamy Chinnathambi <[hidden email]>; hotspot compiler <[hidden email]>; [hidden email]
Subject: Re: [8u] RFR for backport of JDK-8148175: C1: G1 barriers don't preserve FP registers to jdk8u-dev

Why do you need to add test explicitly to hotspot_compiler group?
It should be included implicitly into compact1_minimal group as other compiler/ tests. And compact1_minimal should be
used in all other testing. Did you check that the test is executed without you modifying TEST.groups?

Thanks,
Vladimir K

On 10/12/17 3:46 AM, Muthusamy Chinnathambi wrote:

> May I please get a second review for the change.
>
> Regards,
> Muthusamy C
>
> -----Original Message-----
> From: Vladimir Ivanov
> Sent: Wednesday, October 11, 2017 5:29 PM
> To: Muthusamy Chinnathambi <[hidden email]>
> Cc: [hidden email]; hotspot compiler <[hidden email]>
> Subject: Re: [8u] RFR for backport of JDK-8148175: C1: G1 barriers don't preserve FP registers to jdk8u-dev
>
> Looks good.
>
> Best regards,
> Vladimir Ivanov
>
> On 10/11/17 12:33 PM, Muthusamy Chinnathambi wrote:
>> Hi,
>>
>> Please review the backport of bug: "JDK-8148175: C1: G1 barriers don't preserve FP registers" to jdk8u-dev
>>
>> Please note that this is not a clean backport due to new entries in TEST.groups and copyright changes.
>>
>> Webrev: http://cr.openjdk.java.net/~mchinnathamb/8148175/webrev.00/
>> jdk9 bug: https://bugs.openjdk.java.net/browse/JDK-8148175
>> Original patch pushed to jdk9: http://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/a9334e785873
>>
>> Test:  Run jtreg and jprt hotspot testsets.
>>
>> Regards,
>> Muthusamy C
>>
Reply | Threaded
Open this post in threaded view
|

Re: Strange G1 behavior

Thomas Schatzl
In reply to this post by Kirk Pepperdine-2
Hi Kirk,

On Wed, 2017-10-11 at 19:05 +0200, Kirk Pepperdine wrote:

> Hi,
>
> I should add that Charlie Hunt added some invaluable information in
> our Friend of jclarity list that I should add here. He looked at the
> adaptive sizing records that are in the log and came to the
> conclusion that the regions in tenured were not collected because
> they are over the 85% live mark. This according to the AS records is
> true.. but then one has to ask, how does a full recover so much
> memory if more there are so many tenured regions that are more than
> 85% full. I have other logs from older JVMs where heap occupancy
> after each mixed collection is greater than the previous collection
> occupancy.. and then you get the full that collects everything. I

I suspect... the application is the main cause for the full gc here -
and bad timing.

> didn’t spam the list with those logs because; they didn’t have print
> adaptive sizing turned on and they were from older JVMs meaning the
> problem may have already been solved. This JVM is a 141 build so
> while it behaves some what differently, it’s the same spike up ‘till
> only a full can help, fun the full and life goes on as if nothing
> happened.

If you look at the previous cycles of allocation and reclamation, you
can see that most of them peak at 6G total heap.

There are a few spikes in the heap occupancy graph where during a short
interval more than 6G of memory is in use temporarily. That means,
memory usage spikes a few times in succession, with concurrent marking
running all the time.

Here are the locations of the peaks within the log:

- ~227938.857: peaking at 6.9G, after that young gc a single concurrent
mark cleans out 1G of completely empty memory.
- ~228085.205: peaking at 7G, again a following concurrent mark cleans
out 3G of completely empty.
- ~240985.106: the 12G peak with the full gc.
- ~242363.281: a 8.6G peak.
- ~246666.233: a 6.5G peak
- ~249968.146: a 6.5G peak
- ~251480.608: a 7.5G peak

In all/most of these cases, either the Cleanup pause or mixed GC can
reclaim lots of memory inbetween, but memory usage spiking again and
again, repeating this behavior a few times until things settle again.

All this sprinkled with initial marks caused by humongous allocations.

This leads to my hypothesis that humongous objects potentially being an
issue here, in combination with the application simply doing lots of
large allocations, holding on to them for a long time.

Let me explain a few (messy) heap usage graphs that show my reasoning:

http://cr.openjdk.java.net/~tschatzl/g1-strange-log/strange-g1-overview
.png : shows an overview of heap usage after gc of the whole log file,
split by GC/GC reason.

You can easily spot the live set spikes in that graph, including the
one that led to full gc.

Another interesting line here is the brownish one with the triangles,
showing heap usage for initial mark gcs caused by humongous
allocations. As you can see there, the application uses a considerable
amount of humongous objects which may be the cause for the large memory
usage - i.e. if they are not ideally sized.

Btw, these spikes seem to be somewhat regular, at least from around the
middle of the log, could it be some "result flushing" occurring every
like 1500 seconds?

http://cr.openjdk.java.net/~tschatzl/g1-strange-log/strange-g1-example.
png shows the heap usage graphs for the spike next to the full gc.

It shows that
- there are occasionally phases of the application where a lot of
memory is in use for like a minute or so (e.g. from second 242300 to
242380).
- during this phase, mixed gc can clean up quite a bit every time (pink
triangles), but apparently the application is also often allocating
faster than g1 can reclaim (which causes the full gc in the other
case).
- a lot of the marking is initiated by humongous allocation (brown
triangles)
- the cleanup pause after marking can often decrease memory usage quite
a bit by itself (green squares), which indicates humongous objects
dying fast (cleanup reclaims all completely empty regions, and
typically these are humongous objects).

I can't see any unusual issue with G1, apart from that it issues two
useless young gcs with eden size of zero - compared to the following
full gc they are probably negligible ;) (this is https://bugs.openjdk.j
ava.net/browse/JDK-8165150)

So my suggestion is to look at whether something can be done about the
humongous objects here, i.e. look if the waste caused by them is large
or not, and if so, try to decrease that.

There does not seem to be a significant if any fragmentation issue with
humongous objects here, and whether marking or eager reclaim catches
them has the same effect, although eager reclaim is cheaper. (I think I
saw some occurrences of eager reclaim).

Before suggesting potential fixes and optimization, I recommend
verifying this hypothesis ;)

The simplest way and least intrusive way (to me) to look at humongous
object sizes is to enable -XX:+G1TraceEagerReclaimHumongousObjects (it
is an experimental options) to get information about humognous objects
at every GC, but if you prefer somewhat less frequent output,
-XX:+G1PrintRegionLivenessInfo at the end of every concurrent marking.
Look at the occupancy of the last region of a set of regions and see if
there is lots of emptiness in there.

You can also do heap dumps with jmap and analyze the arrays with e.g.
MAT, but that is very intrusive.

_If_ there is lots of waste at the end of humongous regions, you can do
one of the following:

- fix the application to not use such awkward array sizes ;) I.e. the
typical double-the-array-size algorithm can create lots of waste for G1
as this creates 2^x + 16 bytes sized objects which just don't fit into
regions.

Maybe, by sizing these arrays better in advance (like if you have an
approximate idea about their size), you can actually avoid lots of
memory allocations and copying of array contents (independently of the
chosen collector) which is not free either, to get a significant boost
of your application.

- increase heap region size using G1HeapRegionSize to have less
humongous objects so that there are not that many huge arrays wasting
space. This one is probably the best short-term fix to get rid of a lot
of humongous object allocations.

- increase the heap

- contributions for https://bugs.openjdk.java.net/browse/JDK-8172713
are welcome.

Other issues:

- there are marking failures, i.e. g1 can't complete marking and needs
to restart it particularly during the ramp-up of heap usage that leads
to the full gc.
I do not think makes a difference for that particular situation, but it
might be awkward for others.
Increase -XX:MarkStackSize/-XX:MaxMarkStackSize (I think that's their
name) to something larger to avoid the "concurrent-mark-reset-for-
overflow" messages.
Or switch to jdk9 where this issue most likely will not occur.

- the time from end of marking to reclaiming memory is a bit long and
might be problematic sometimes. The only fix is to upgrade to jdk9,
which I highly recommend. (Also because the logging in jdk8 is a mess
;))

- the (by default) selected IHOP value seems too low, the application
wastes 6G of heap. Again, either update to jdk9 and profit from the
adaptive IHOP (and other improvements), or increase the IHOP manually
(-XX:InitiatingHeapOccupancyPercent) for increased throughput.

Note that increasing the IHOP makes no real difference in the frequency
of the full gcs as it does not change how much and how long the
application holds on to memory.
In your application, except in the cases when these spikes occur,
marking takes like 300ms max (see http://cr.openjdk.java.net/~tschatzl/
g1-strange-log/strange-g1-concmark.png).

In case of spikes, particularly even with the repeated marking, it
takes like 9s where the long marking times coincide with the spikes. I
assume that during these times a single very large array of j.l.O
serializes marking progress.

These spikes should be much more muted in jdk9 too :)

Thanks,
  Thomas

P.S.: Did I already mention to try out jdk9? ;)

Reply | Threaded
Open this post in threaded view
|

Re: Strange G1 behavior

Kirk Pepperdine-2
Hi Thomas,

Thanks for the this very detailed analysis. Where to start…….

There is no disputing your analysis of this GC log as it’s pretty much matches my thoughts on what is happening. My question is centered around my experiences running into this spiking behavior in many different applications across many different domains. A common artifact prior to the Full is seeing survivor resized to 0B. In fact if you see survivor be resized to 0B there doesn’t seem to be a way to avoid the full. (Survivor X->0.0B appears to be a side effect of the root of the problem).

I guess it is possible to have the same conditions appearing in many applications running across many domains that result in this identical behavior. That said, I guess I’m very lucky that I am running into the accidental timing issue as frequently as I am. It feels like the question I want to as is; is there a common coding pattern that all the applications share that is aggravating the garbage collector in nasty way? And if so, is there something the collector could do to prevent the Full? That is it reproducible in this particular case suggest that the answer is yes. Have I been able to eliminate this behavior by “adjusting” the applications use of humongous object? The answer to that is, I think it has made it better but hasn’t eliminated it. Quite typically this problem takes several days to show up which makes testing tricky

So I have to ask the question,  is it possible that humongous regions are being ignored in such a way that they build up to create a condition where a Full GC is necessary to recover from it?

I have in my GitHub repository a project that displays the data from -XX:+G1PrintRegionLivenessInfo. It was useful for visualizing humongous allocation behavior.

As for Java 9, not a real possibility for the moment. I can hope but I’m not so confident after this find that it’s going to offer relief for this particular issue.


Kind regards,
Kirk


> On Oct 13, 2017, at 2:12 PM, Thomas Schatzl <[hidden email]> wrote:
>
> Hi Kirk,
>
> On Wed, 2017-10-11 at 19:05 +0200, Kirk Pepperdine wrote:
>> Hi,
>>
>> I should add that Charlie Hunt added some invaluable information in
>> our Friend of jclarity list that I should add here. He looked at the
>> adaptive sizing records that are in the log and came to the
>> conclusion that the regions in tenured were not collected because
>> they are over the 85% live mark. This according to the AS records is
>> true.. but then one has to ask, how does a full recover so much
>> memory if more there are so many tenured regions that are more than
>> 85% full. I have other logs from older JVMs where heap occupancy
>> after each mixed collection is greater than the previous collection
>> occupancy.. and then you get the full that collects everything. I
>
> I suspect... the application is the main cause for the full gc here -
> and bad timing.
>
>> didn’t spam the list with those logs because; they didn’t have print
>> adaptive sizing turned on and they were from older JVMs meaning the
>> problem may have already been solved. This JVM is a 141 build so
>> while it behaves some what differently, it’s the same spike up ‘till
>> only a full can help, fun the full and life goes on as if nothing
>> happened.
>
> If you look at the previous cycles of allocation and reclamation, you
> can see that most of them peak at 6G total heap.
>
> There are a few spikes in the heap occupancy graph where during a short
> interval more than 6G of memory is in use temporarily. That means,
> memory usage spikes a few times in succession, with concurrent marking
> running all the time.
>
> Here are the locations of the peaks within the log:
>
> - ~227938.857: peaking at 6.9G, after that young gc a single concurrent
> mark cleans out 1G of completely empty memory.
> - ~228085.205: peaking at 7G, again a following concurrent mark cleans
> out 3G of completely empty.
> - ~240985.106: the 12G peak with the full gc.
> - ~242363.281: a 8.6G peak.
> - ~246666.233: a 6.5G peak
> - ~249968.146: a 6.5G peak
> - ~251480.608: a 7.5G peak
>
> In all/most of these cases, either the Cleanup pause or mixed GC can
> reclaim lots of memory inbetween, but memory usage spiking again and
> again, repeating this behavior a few times until things settle again.
>
> All this sprinkled with initial marks caused by humongous allocations.
>
> This leads to my hypothesis that humongous objects potentially being an
> issue here, in combination with the application simply doing lots of
> large allocations, holding on to them for a long time.
>
> Let me explain a few (messy) heap usage graphs that show my reasoning:
>
> http://cr.openjdk.java.net/~tschatzl/g1-strange-log/strange-g1-overview
> .png : shows an overview of heap usage after gc of the whole log file,
> split by GC/GC reason.
>
> You can easily spot the live set spikes in that graph, including the
> one that led to full gc.
>
> Another interesting line here is the brownish one with the triangles,
> showing heap usage for initial mark gcs caused by humongous
> allocations. As you can see there, the application uses a considerable
> amount of humongous objects which may be the cause for the large memory
> usage - i.e. if they are not ideally sized.
>
> Btw, these spikes seem to be somewhat regular, at least from around the
> middle of the log, could it be some "result flushing" occurring every
> like 1500 seconds?
>
> http://cr.openjdk.java.net/~tschatzl/g1-strange-log/strange-g1-example.
> png shows the heap usage graphs for the spike next to the full gc.
>
> It shows that
> - there are occasionally phases of the application where a lot of
> memory is in use for like a minute or so (e.g. from second 242300 to
> 242380).
> - during this phase, mixed gc can clean up quite a bit every time (pink
> triangles), but apparently the application is also often allocating
> faster than g1 can reclaim (which causes the full gc in the other
> case).
> - a lot of the marking is initiated by humongous allocation (brown
> triangles)
> - the cleanup pause after marking can often decrease memory usage quite
> a bit by itself (green squares), which indicates humongous objects
> dying fast (cleanup reclaims all completely empty regions, and
> typically these are humongous objects).
>
> I can't see any unusual issue with G1, apart from that it issues two
> useless young gcs with eden size of zero - compared to the following
> full gc they are probably negligible ;) (this is https://bugs.openjdk.j
> ava.net/browse/JDK-8165150)
>
> So my suggestion is to look at whether something can be done about the
> humongous objects here, i.e. look if the waste caused by them is large
> or not, and if so, try to decrease that.
>
> There does not seem to be a significant if any fragmentation issue with
> humongous objects here, and whether marking or eager reclaim catches
> them has the same effect, although eager reclaim is cheaper. (I think I
> saw some occurrences of eager reclaim).
>
> Before suggesting potential fixes and optimization, I recommend
> verifying this hypothesis ;)
>
> The simplest way and least intrusive way (to me) to look at humongous
> object sizes is to enable -XX:+G1TraceEagerReclaimHumongousObjects (it
> is an experimental options) to get information about humognous objects
> at every GC, but if you prefer somewhat less frequent output,
> -XX:+G1PrintRegionLivenessInfo at the end of every concurrent marking.
> Look at the occupancy of the last region of a set of regions and see if
> there is lots of emptiness in there.
>
> You can also do heap dumps with jmap and analyze the arrays with e.g.
> MAT, but that is very intrusive.
>
> _If_ there is lots of waste at the end of humongous regions, you can do
> one of the following:
>
> - fix the application to not use such awkward array sizes ;) I.e. the
> typical double-the-array-size algorithm can create lots of waste for G1
> as this creates 2^x + 16 bytes sized objects which just don't fit into
> regions.
>
> Maybe, by sizing these arrays better in advance (like if you have an
> approximate idea about their size), you can actually avoid lots of
> memory allocations and copying of array contents (independently of the
> chosen collector) which is not free either, to get a significant boost
> of your application.
>
> - increase heap region size using G1HeapRegionSize to have less
> humongous objects so that there are not that many huge arrays wasting
> space. This one is probably the best short-term fix to get rid of a lot
> of humongous object allocations.
>
> - increase the heap
>
> - contributions for https://bugs.openjdk.java.net/browse/JDK-8172713
> are welcome.
>
> Other issues:
>
> - there are marking failures, i.e. g1 can't complete marking and needs
> to restart it particularly during the ramp-up of heap usage that leads
> to the full gc.
> I do not think makes a difference for that particular situation, but it
> might be awkward for others.
> Increase -XX:MarkStackSize/-XX:MaxMarkStackSize (I think that's their
> name) to something larger to avoid the "concurrent-mark-reset-for-
> overflow" messages.
> Or switch to jdk9 where this issue most likely will not occur.
>
> - the time from end of marking to reclaiming memory is a bit long and
> might be problematic sometimes. The only fix is to upgrade to jdk9,
> which I highly recommend. (Also because the logging in jdk8 is a mess
> ;))
>
> - the (by default) selected IHOP value seems too low, the application
> wastes 6G of heap. Again, either update to jdk9 and profit from the
> adaptive IHOP (and other improvements), or increase the IHOP manually
> (-XX:InitiatingHeapOccupancyPercent) for increased throughput.
>
> Note that increasing the IHOP makes no real difference in the frequency
> of the full gcs as it does not change how much and how long the
> application holds on to memory.
> In your application, except in the cases when these spikes occur,
> marking takes like 300ms max (see http://cr.openjdk.java.net/~tschatzl/
> g1-strange-log/strange-g1-concmark.png).
>
> In case of spikes, particularly even with the repeated marking, it
> takes like 9s where the long marking times coincide with the spikes. I
> assume that during these times a single very large array of j.l.O
> serializes marking progress.
>
> These spikes should be much more muted in jdk9 too :)
>
> Thanks,
>  Thomas
>
> P.S.: Did I already mention to try out jdk9? ;)
>

Reply | Threaded
Open this post in threaded view
|

Re: [8u] RFR for backport of JDK-8148175: C1: G1 barriers don't preserve FP registers to jdk8u-dev

Vladimir Kozlov
In reply to this post by Muthusamy Chinnathambi
Good.

Thanks,
Vladimir

On 10/13/17 3:53 AM, Muthusamy Chinnathambi wrote:

> Hi Vladimir,
>
>> Why do you need to add test explicitly to hotspot_compiler group?
>> It should be included implicitly into compact1_minimal group as other compiler/ tests. And compact1_minimal should be used in all other
>> testing.
> You are right, it should get picked implicitly as part of compact1_minimal group.
>
>> Did you check that the test is executed without you modifying TEST.groups?
> Now - yes. Without my TEST.groups modification the test gets executed.
>
> I will drop the change in TEST.groups file.
> Please note, this request is only for 8u.
>
> Regards,
> Muthusamy C
>
> -----Original Message-----
> From: Vladimir Kozlov
> Sent: Thursday, October 12, 2017 11:20 PM
> To: Muthusamy Chinnathambi <[hidden email]>; hotspot compiler <[hidden email]>; [hidden email]
> Subject: Re: [8u] RFR for backport of JDK-8148175: C1: G1 barriers don't preserve FP registers to jdk8u-dev
>
> Why do you need to add test explicitly to hotspot_compiler group?
> It should be included implicitly into compact1_minimal group as other compiler/ tests. And compact1_minimal should be
> used in all other testing. Did you check that the test is executed without you modifying TEST.groups?
>
> Thanks,
> Vladimir K
>
> On 10/12/17 3:46 AM, Muthusamy Chinnathambi wrote:
>> May I please get a second review for the change.
>>
>> Regards,
>> Muthusamy C
>>
>> -----Original Message-----
>> From: Vladimir Ivanov
>> Sent: Wednesday, October 11, 2017 5:29 PM
>> To: Muthusamy Chinnathambi <[hidden email]>
>> Cc: [hidden email]; hotspot compiler <[hidden email]>
>> Subject: Re: [8u] RFR for backport of JDK-8148175: C1: G1 barriers don't preserve FP registers to jdk8u-dev
>>
>> Looks good.
>>
>> Best regards,
>> Vladimir Ivanov
>>
>> On 10/11/17 12:33 PM, Muthusamy Chinnathambi wrote:
>>> Hi,
>>>
>>> Please review the backport of bug: "JDK-8148175: C1: G1 barriers don't preserve FP registers" to jdk8u-dev
>>>
>>> Please note that this is not a clean backport due to new entries in TEST.groups and copyright changes.
>>>
>>> Webrev: http://cr.openjdk.java.net/~mchinnathamb/8148175/webrev.00/
>>> jdk9 bug: https://bugs.openjdk.java.net/browse/JDK-8148175
>>> Original patch pushed to jdk9: http://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/a9334e785873
>>>
>>> Test:  Run jtreg and jprt hotspot testsets.
>>>
>>> Regards,
>>> Muthusamy C
>>>
Reply | Threaded
Open this post in threaded view
|

Re: Strange G1 behavior

Thomas Schatzl
In reply to this post by Kirk Pepperdine-2
Hi Kirk,

On Fri, 2017-10-13 at 15:23 +0200, Kirk Pepperdine wrote:

> Hi Thomas,
>
> Thanks for the this very detailed analysis. Where to start…….
>
> There is no disputing your analysis of this GC log as it’s pretty
> much matches my thoughts on what is happening. My question is
> centered around my experiences running into this spiking behavior in
> many different applications across many different domains. A common
> artifact prior to the Full is seeing survivor resized to 0B. In fact
> if you see survivor be resized to 0B there doesn’t seem to be a way
> to avoid the full. (Survivor X->0.0B appears to be a side effect of
> the root of the problem).
>
> I guess it is possible to have the same conditions appearing in many
> applications running across many domains that result in this
> identical behavior. That said, I guess I’m very lucky that I am
> running into the accidental timing issue as frequently as I am. It
> feels like the question I want to as is; is there a common coding
> pattern that all the applications share that is aggravating the
> garbage collector in nasty way?

Given the current information I have, the application's live set and
allocation rate simply seems too high to support current GC settings
temporarily.

For the allocation rate, please compare the slopes of heap usage after
(young) gcs during these spikes (particularly in that full gc case) and
normal operation.

In this application, given the information I have, every like 1500s,
there seems to be some component in the application that allocates a
lot of memory in a short time, and holds onto most of it for its
duration.

Now, there can be other reasons for the high allocation rate in the old
gen (that the heap usage graphs show), one is fragmentation in the
PLABs (TLABs used during GC)/to-space during copying.

I.e. with high PLAB waste/fragmentation you potentially need more space
than necessary in the old gen. In addition to that, these mostly empty
regions are not going to be evacuated in the current mixed gc cycle
again as G1 does not have any liveness information about them except
"completely full". So you potentially waste tons (and I know of at
least one internal application that suffered a lot from that) of
memory.

That could explain why the full gc can free so much memory.

However there is no exact way in jdk8 to get information about that -
some of that information can be inferred from -XX:+PrintPLAB; for
testing purposes you could also try simply disabling PLAB resizing (-
XX:-ResizePLAB), set "reasonable" PLAB sizes manually and see if the
old gen allocation rate decreases during these spikes.

Again, in JDK9 we significantly improved this area, both about PLAB
resizing and reporting.
We have seen significant improvements on memory usage during gc and
decreases in gc frequency (and actually significant improvements on
throughput/latency because of that) on some internal applications.

Please look at the task JDK-8030849, and in particular the resolved
changes referenced there.

If you ask about common coding patterns that can avoid issues with
humongous object fragmentation, there are (to me at least ;) "easy"
ones: initialize your arrays with non-2^n element count values (unlike
the java collections do), or use a different exponential function than
doubling to increase the size of your arrays (or think a bit about
expected sizes of your containers to spend less cpu cycles on doubling
their memory backing, potentially improving overall application
throughput).

I.e. an initial array size of e.g. 16 ultimately results in an object
size of 2^n bytes + object header size bytes, which is enough to cause
lots of wasted space as soon as your array gets humongous. If you used
e.g. 15 as an initial value, everything would be fine from that POV.

Providing a patch for the previously mentioned JDK-8172713 would also
fix that - for future JDK versions only though.

> And if so, is there something the collector could do to prevent the
> Full? That is it reproducible in this particular case suggest that
> the answer is yes.

Depends on further information. If the application holds onto too much
memory and allocates too quickly, there is ultimately nothing that GC
can do.

If it is e.g. the issue with PLAB fragmentation, then yes, and that
issue has already been fixed in jdk9 (to a very large degree; as you
can see from JDK-8030849, there are some ideas that were left out
because they seemed unnecessary.)

JDK 18.3 will get a parallel full gc to improve the worst-case
situation a bit; in JDK9 the young gcs with to-space exhaustion should
already be very fast (I just saw that in that log with jdk8 they take
like seconds too).

> Have I been able to eliminate this behavior by “adjusting” the
> applications use of humongous object? The answer to that is, I think
> it has made it better but hasn’t eliminated it.

That seems logical: improving humongous object space usage only delays
the inevitable as it only frees some more memory the application can
hold onto a bit longer.

> Quite typically this problem takes several days to show up which
> makes testing tricky.

At least in this case, the issue seems to be quite reproducible: there
is something happening every like 1500s that causes these spikes.

> So I have to ask the question,  is it possible that humongous regions
> are being ignored in such a way that they build up to create a
> condition where a Full GC is necessary to recover from it?

I kind of fail to understand the goal of this request: humongous
objects are allocated by the Java application, and as such part of the
Java heap, and obviously counted against heap usage, wherever they are
allocated. They will also always cause at least some kind of
fragmentation waste wherever they are located.

Note that without quantification of the impact of humongous object
fragmentation waste in this particular instance I do not know whether a
more clever humongous object sizing improves the situation
significantly enough.

It seems to exacerbate the situation though.

> I have in my GitHub repository a project that displays the data from
> -XX:+G1PrintRegionLivenessInfo. It was useful for visualizing
> humongous allocation behavior.
>
> As for Java 9, not a real possibility for the moment. I can hope but
> I’m not so confident after this find that it’s going to offer relief
> for this particular issue.

Thanks,
  Thomas

Reply | Threaded
Open this post in threaded view
|

Re: Strange G1 behavior

Kirk Pepperdine-2
Hi Thomas,

Again, thank you for the detailed response.


> On Oct 16, 2017, at 1:32 PM, Thomas Schatzl <[hidden email]> wrote:
>
> Hi Kirk,
>
> On Fri, 2017-10-13 at 15:23 +0200, Kirk Pepperdine wrote:
>> Hi Thomas,
>>
>> Thanks for the this very detailed analysis. Where to start…….
>>
>> There is no disputing your analysis of this GC log as it’s pretty
>> much matches my thoughts on what is happening. My question is
>> centered around my experiences running into this spiking behavior in
>> many different applications across many different domains. A common
>> artifact prior to the Full is seeing survivor resized to 0B. In fact
>> if you see survivor be resized to 0B there doesn’t seem to be a way
>> to avoid the full. (Survivor X->0.0B appears to be a side effect of
>> the root of the problem).
>>
>> I guess it is possible to have the same conditions appearing in many
>> applications running across many domains that result in this
>> identical behavior. That said, I guess I’m very lucky that I am
>> running into the accidental timing issue as frequently as I am. It
>> feels like the question I want to as is; is there a common coding
>> pattern that all the applications share that is aggravating the
>> garbage collector in nasty way?
>
> Given the current information I have, the application's live set and
> allocation rate simply seems too high to support current GC settings
> temporarily.
>
> For the allocation rate, please compare the slopes of heap usage after
> (young) gcs during these spikes (particularly in that full gc case) and
> normal operation.

Censum estimates allocation rates as this is a metric that I routinely evaluate.

This log shows a spike at 10:07 which correlates with the Full GC event. However the allocation rates while high, are well within values I’ve seen with many other applications that are well behaved. Censum also estimates rates of promotion and those seem exceedingly high at 10:07. That said, there are spikes just after 10:10 and around 10:30 which don’t trigger a Full. In both cases the estimates for allocation rates are high though the estimates for rates of  promotion while high, are not as high as those seen at 10:07.

All in all, nothing here seems out of the ordinary and while I want you to be right about the waste and PLAB behaviors, these spikes feel artificial, i.e. I still want to blame the collector for not being able to cope with some aspect of application behavior that it should be able to cope with.. that is something other than a high allocation rate with low recover due to data simply being reference and therefor not eligible for collection...

>
> In this application, given the information I have, every like 1500s,
> there seems to be some component in the application that allocates a
> lot of memory in a short time, and holds onto most of it for its
> duration.

Sorry but I’m not seeing this pattern either in occupancy after or allocation rate views. What I do see is a systematic loss of free heap over time (slow memory leak??? Effects of caching???).

As I look at all of the views in Censum I see nothing outstanding that leads me to believe that this Full is a by-product of some interaction between the collector and the application (some form of zombies????). Also, one certainly cannot rule out your speculation for heap fragmentation in PLABs. I simply don’t have the data to say anything about that though I know it can be a huge issue. What I can say is that even if there is 20% waste, it can’t account for the amount of memory being recovered. I qualify that with, unless there is a blast of barely humongous allocations taking place. I’d like to thing this is a waste issue but I’m suspicious. I’m also suspicious that it’s simply the application allocating in a burst then releasing. If that were the case I’d expect a much more gradual reduction in the live set size.

I think the answer right now is; we need more data. I’ll try to get the “client” to turn on the extra flags and see what that yields. I won’t play with plab sizing this go ‘round if you don’t mind. If you’re right and it is a problem with waste, then the beer is on me the next time we meet.

The don’t allocate arrays in size of powers of 2 is an interesting comment. While there are clear advantages to allocating arrays in size of powers of 2, I believe in that these cases are specialized and that I don’t generally see people dogmatically allocating this way.

Thanks for the invaluable thoughts.

Kind regards,
Kirk


>
> Now, there can be other reasons for the high allocation rate in the old
> gen (that the heap usage graphs show), one is fragmentation in the
> PLABs (TLABs used during GC)/to-space during copying.
>
> I.e. with high PLAB waste/fragmentation you potentially need more space
> than necessary in the old gen. In addition to that, these mostly empty
> regions are not going to be evacuated in the current mixed gc cycle
> again as G1 does not have any liveness information about them except
> "completely full". So you potentially waste tons (and I know of at
> least one internal application that suffered a lot from that) of
> memory.
>
> That could explain why the full gc can free so much memory.
>
> However there is no exact way in jdk8 to get information about that -
> some of that information can be inferred from -XX:+PrintPLAB; for
> testing purposes you could also try simply disabling PLAB resizing (-
> XX:-ResizePLAB), set "reasonable" PLAB sizes manually and see if the
> old gen allocation rate decreases during these spikes.
>
> Again, in JDK9 we significantly improved this area, both about PLAB
> resizing and reporting.
> We have seen significant improvements on memory usage during gc and
> decreases in gc frequency (and actually significant improvements on
> throughput/latency because of that) on some internal applications.
>
> Please look at the task JDK-8030849, and in particular the resolved
> changes referenced there.
>
> If you ask about common coding patterns that can avoid issues with
> humongous object fragmentation, there are (to me at least ;) "easy"
> ones: initialize your arrays with non-2^n element count values (unlike
> the java collections do), or use a different exponential function than
> doubling to increase the size of your arrays (or think a bit about
> expected sizes of your containers to spend less cpu cycles on doubling
> their memory backing, potentially improving overall application
> throughput).
>
> I.e. an initial array size of e.g. 16 ultimately results in an object
> size of 2^n bytes + object header size bytes, which is enough to cause
> lots of wasted space as soon as your array gets humongous. If you used
> e.g. 15 as an initial value, everything would be fine from that POV.
>
> Providing a patch for the previously mentioned JDK-8172713 would also
> fix that - for future JDK versions only though.
>
>> And if so, is there something the collector could do to prevent the
>> Full? That is it reproducible in this particular case suggest that
>> the answer is yes.
>
> Depends on further information. If the application holds onto too much
> memory and allocates too quickly, there is ultimately nothing that GC
> can do.
>
> If it is e.g. the issue with PLAB fragmentation, then yes, and that
> issue has already been fixed in jdk9 (to a very large degree; as you
> can see from JDK-8030849, there are some ideas that were left out
> because they seemed unnecessary.)
>
> JDK 18.3 will get a parallel full gc to improve the worst-case
> situation a bit; in JDK9 the young gcs with to-space exhaustion should
> already be very fast (I just saw that in that log with jdk8 they take
> like seconds too).
>
>> Have I been able to eliminate this behavior by “adjusting” the
>> applications use of humongous object? The answer to that is, I think
>> it has made it better but hasn’t eliminated it.
>
> That seems logical: improving humongous object space usage only delays
> the inevitable as it only frees some more memory the application can
> hold onto a bit longer.
>
>> Quite typically this problem takes several days to show up which
>> makes testing tricky.
>
> At least in this case, the issue seems to be quite reproducible: there
> is something happening every like 1500s that causes these spikes.
>
>> So I have to ask the question,  is it possible that humongous regions
>> are being ignored in such a way that they build up to create a
>> condition where a Full GC is necessary to recover from it?
>
> I kind of fail to understand the goal of this request: humongous
> objects are allocated by the Java application, and as such part of the
> Java heap, and obviously counted against heap usage, wherever they are
> allocated. They will also always cause at least some kind of
> fragmentation waste wherever they are located.
>
> Note that without quantification of the impact of humongous object
> fragmentation waste in this particular instance I do not know whether a
> more clever humongous object sizing improves the situation
> significantly enough.
>
> It seems to exacerbate the situation though.
>
>> I have in my GitHub repository a project that displays the data from
>> -XX:+G1PrintRegionLivenessInfo. It was useful for visualizing
>> humongous allocation behavior.
>>
>> As for Java 9, not a real possibility for the moment. I can hope but
>> I’m not so confident after this find that it’s going to offer relief
>> for this particular issue.
>
> Thanks,
>  Thomas
>

Reply | Threaded
Open this post in threaded view
|

Re: Strange G1 behavior

Thomas Schatzl
Hi Kirk,

On Mon, 2017-10-16 at 20:07 +0200, Kirk Pepperdine wrote:

> Hi Thomas,
>
> Again, thank you for the detailed response.
>
>
> > On Oct 16, 2017, at 1:32 PM, Thomas Schatzl <thomas.schatzl@oracle.
> > com> wrote:
> >
> > For the allocation rate, please compare the slopes of heap usage
> > after (young) gcs during these spikes (particularly in that full gc
> > case) and normal operation.
>
> Censum estimates allocation rates as this is a metric that I
> routinely evaluate.
>
> This log shows a spike at 10:07 which correlates with the Full GC
> event. However the allocation rates while high, are well within
> values I’ve seen with many other applications that are well behaved.
> Censum also estimates rates of promotion and those seem exceedingly
> high at 10:07. That said, there are spikes just after 10:10 and
> around 10:30 which don’t trigger a Full. In both cases the estimates
> for allocation rates are high though the estimates for rates
> of  promotion while high, are not as high as those seen at 10:07.
>
> All in all, nothing here seems out of the ordinary and while I want
> you to be right about the waste and PLAB behaviors, these spikes feel
> artificial, i.e. I still want to blame the collector for not being
> able to cope with some aspect of application behavior that it should
> be able to cope with.. that is something other than a high allocation
> rate with low recover due to data simply being reference and therefor
> not eligible for collection...

I always meant "promotion rate" here as allocation rate. For this
discussion (and in general) in a generational collector the
application's real allocation rate is usually not very interesting.

Sorry for being imprecise.

> > In this application, given the information I have, every like
> > 1500s, there seems to be some component in the application that
> > allocates a lot of memory in a short time, and holds onto most of
> > it for its duration.
>
> Sorry but I’m not seeing this pattern either in occupancy after or
> allocation rate views. What I do see is a systematic loss of free
> heap over time (slow memory leak??? Effects of caching???).

Let's have a look at the heap usage after gc over time for a few
collection cycle before that full gc event. Please look at http://cr.op
enjdk.java.net/~tschatzl/g1-strange-log/strange-g1-promo.png which just
shows a few of these.

I added rough linear interpolations of the heap usage after gc (so that
the slope of these lines corresponds to the promotion rates). I can see
a large, significant difference in the slopes between the collection
cycles before the full gc event (black lines) and the full gc event
(red line), while all black ones are roughly the same. :)

Note that my graphs were painted off-hand without any actual
calculation, and particular the red one looks like an underestimation
of the slope.

> As I look at all of the views in Censum I see nothing outstanding
> that leads me to believe that this Full is a by-product of some
> interaction between the collector and the application (some form of
> zombies????). Also, one certainly cannot rule out your speculation

It does not look like there is an issue e.g. with j.l.ref.References of
any kind.

> for heap fragmentation in PLABs. I simply don’t have the data to say
> anything about that though I know it can be a huge issue. What I can
> say is that even if there is 20% waste, it can’t account for the
> amount of memory being recovered. I qualify that with, unless there
> is a blast of barely humongous allocations taking place. I’d like to
> thing this is a waste issue but I’m suspicious. I’m also suspicious
> that it’s simply the application allocating in a burst then
> releasing. If that were the case I’d expect a much more gradual
> reduction in the live set size.
>
> I think the answer right now is; we need more data.

Agree.

> I’ll try to get
> the “client” to turn on the extra flags and see what that yields. I
> won’t play with plab sizing this go ‘round if you don’t mind. If
> you’re right and it is a problem with waste, then the beer is on me
> the next time we meet.
>
> The don’t allocate arrays in size of powers of 2 is an interesting
> comment. While there are clear advantages to allocating arrays in
> size of powers of 2, I believe in that these cases are specialized
> and that I don’t generally see people dogmatically allocating this
> way.

There are cases where you probably want 2^n sized buffers, but in many,
many cases like some serialization for data transfer, it does not
matter a bit whether the output buffer can hold exactly 2^n bytes or
not, i.e. is just a bit smaller.

Of course this is something G1 should handle better by itself, but for
now that is what you can do about this.

Thanks,
  Thomas

Reply | Threaded
Open this post in threaded view
|

Re: Strange G1 behavior

Kirk Pepperdine-2

> On Oct 17, 2017, at 10:22 AM, Thomas Schatzl <[hidden email]> wrote:
>
> Hi Kirk,
>
> On Mon, 2017-10-16 at 20:07 +0200, Kirk Pepperdine wrote:
>> Hi Thomas,
>>
>> Again, thank you for the detailed response.
>>
>>
>>> On Oct 16, 2017, at 1:32 PM, Thomas Schatzl <thomas.schatzl@oracle.
>>> com> wrote:
>>>
>>> For the allocation rate, please compare the slopes of heap usage
>>> after (young) gcs during these spikes (particularly in that full gc
>>> case) and normal operation.
>>
>> Censum estimates allocation rates as this is a metric that I
>> routinely evaluate.
>>
>> This log shows a spike at 10:07 which correlates with the Full GC
>> event. However the allocation rates while high, are well within
>> values I’ve seen with many other applications that are well behaved.
>> Censum also estimates rates of promotion and those seem exceedingly
>> high at 10:07. That said, there are spikes just after 10:10 and
>> around 10:30 which don’t trigger a Full. In both cases the estimates
>> for allocation rates are high though the estimates for rates
>> of  promotion while high, are not as high as those seen at 10:07.
>>
>> All in all, nothing here seems out of the ordinary and while I want
>> you to be right about the waste and PLAB behaviors, these spikes feel
>> artificial, i.e. I still want to blame the collector for not being
>> able to cope with some aspect of application behavior that it should
>> be able to cope with.. that is something other than a high allocation
>> rate with low recover due to data simply being reference and therefor
>> not eligible for collection...
>
> I always meant "promotion rate" here as allocation rate. For this
> discussion (and in general) in a generational collector the
> application's real allocation rate is usually not very interesting.
>
> Sorry for being imprecise.

No worries, I get that promotion rates are very important but in when deciding what needs to be tuned, allocation rates are also very important so I try to be very precise here, maybe too precise at times.

>
>>> In this application, given the information I have, every like
>>> 1500s, there seems to be some component in the application that
>>> allocates a lot of memory in a short time, and holds onto most of
>>> it for its duration.
>>
>> Sorry but I’m not seeing this pattern either in occupancy after or
>> allocation rate views. What I do see is a systematic loss of free
>> heap over time (slow memory leak??? Effects of caching???).
>
> Let's have a look at the heap usage after gc over time for a few
> collection cycle before that full gc event. Please look at http://cr.op
> enjdk.java.net/~tschatzl/g1-strange-log/strange-g1-promo.png which just
> shows a few of these.

Sorry but I can’t see these charts.

The issue with promotion rates is that you have to ask, is this the dog wagging the tail, or the tail wagging the dog. In this case I think it’s the tail wagging the dog. Would it be possible that a significant portion of the data being promoted was directly due to to-space being resized to 0?

Sorry for being so slow but I think I’m starting to understand the underlying pathology here. I’ll hypothesis here and you can blow the whole idea out of the water with more insightful comments..

Since tenured can grow into the space that the to-space is likely to also want, preference towards tenured deflates survivor which in turn puts more pressure on tenured which in turn deflates survivor even further, wash rinse, repeat, Full GC to clean up. Since this didn’t happen with the (older) generational collectors, it seems as if it’s possible that everything could be ok if a minimum to-space size could be maintained.

>
> I added rough linear interpolations of the heap usage after gc (so that
> the slope of these lines corresponds to the promotion rates). I can see
> a large, significant difference in the slopes between the collection
> cycles before the full gc event (black lines) and the full gc event
> (red line), while all black ones are roughly the same. :)
>
> Note that my graphs were painted off-hand without any actual
> calculation, and particular the red one looks like an underestimation
> of the slope.
>
>> As I look at all of the views in Censum I see nothing outstanding
>> that leads me to believe that this Full is a by-product of some
>> interaction between the collector and the application (some form of
>> zombies????). Also, one certainly cannot rule out your speculation
>
> It does not look like there is an issue e.g. with j.l.ref.References of
> any kind.

LinkList used to be susceptible to nepotism and all generations collectors suffer from zombies.

>
>> for heap fragmentation in PLABs. I simply don’t have the data to say
>> anything about that though I know it can be a huge issue. What I can
>> say is that even if there is 20% waste, it can’t account for the
>> amount of memory being recovered. I qualify that with, unless there
>> is a blast of barely humongous allocations taking place. I’d like to
>> thing this is a waste issue but I’m suspicious. I’m also suspicious
>> that it’s simply the application allocating in a burst then
>> releasing. If that were the case I’d expect a much more gradual
>> reduction in the live set size.
>>
>> I think the answer right now is; we need more data.
>
> Agree.

It’s on it’s way!! Yeah for more data.
You might be interested in this for visualization.  https://github.com/kcpeppe/regions
It’s a wee bit crude at the moment but I’ll accept pull requests for any refinements.

Kind regards,
Kirk

Reply | Threaded
Open this post in threaded view
|

Re: Strange G1 behavior

Bernd Eckenfels-4
In reply to this post by Thomas Schatzl
Hello,

First of all I wanted to mention as I understand it the application is a graph query server, therefore freeing and creating of large graphs is the workload the GC would have to deal with. (I think Thomas described that allocation pattern - it would lead to spiky promotions if a query transaction lasts longer then a new collection cycle (or multiple))

And in Addition, Thomas did you removed your graph or is Java.net refusing to serve pages now?

Gruss
Bernd


From: hotspot-gc-dev <[hidden email]> on behalf of Thomas Schatzl <[hidden email]>
Sent: Tuesday, October 17, 2017 10:22:54 AM
To: Kirk Pepperdine
Cc: [hidden email] openjdk.java.net
Subject: Re: Strange G1 behavior
 
Hi Kirk,

On Mon, 2017-10-16 at 20:07 +0200, Kirk Pepperdine wrote:
> Hi Thomas,
>
> Again, thank you for the detailed response.
>
>
> > On Oct 16, 2017, at 1:32 PM, Thomas Schatzl <thomas.schatzl@oracle.
> > com> wrote:
> >
> > For the allocation rate, please compare the slopes of heap usage
> > after (young) gcs during these spikes (particularly in that full gc
> > case) and normal operation.
>
> Censum estimates allocation rates as this is a metric that I
> routinely evaluate.
>
> This log shows a spike at 10:07 which correlates with the Full GC
> event. However the allocation rates while high, are well within
> values I’ve seen with many other applications that are well behaved.
> Censum also estimates rates of promotion and those seem exceedingly
> high at 10:07. That said, there are spikes just after 10:10 and
> around 10:30 which don’t trigger a Full. In both cases the estimates
> for allocation rates are high though the estimates for rates
> of  promotion while high, are not as high as those seen at 10:07.
>
> All in all, nothing here seems out of the ordinary and while I want
> you to be right about the waste and PLAB behaviors, these spikes feel
> artificial, i.e. I still want to blame the collector for not being
> able to cope with some aspect of application behavior that it should
> be able to cope with.. that is something other than a high allocation
> rate with low recover due to data simply being reference and therefor
> not eligible for collection...

I always meant "promotion rate" here as allocation rate. For this
discussion (and in general) in a generational collector the
application's real allocation rate is usually not very interesting.

Sorry for being imprecise.

> > In this application, given the information I have, every like
> > 1500s, there seems to be some component in the application that
> > allocates a lot of memory in a short time, and holds onto most of
> > it for its duration.
>
> Sorry but I’m not seeing this pattern either in occupancy after or
> allocation rate views. What I do see is a systematic loss of free
> heap over time (slow memory leak??? Effects of caching???).

Let's have a look at the heap usage after gc over time for a few
collection cycle before that full gc event. Please look at http://cr.op
enjdk.java.net/~tschatzl/g1-strange-log/strange-g1-promo.png which just
shows a few of these.

I added rough linear interpolations of the heap usage after gc (so that
the slope of these lines corresponds to the promotion rates). I can see
a large, significant difference in the slopes between the collection
cycles before the full gc event (black lines) and the full gc event
(red line), while all black ones are roughly the same. :)

Note that my graphs were painted off-hand without any actual
calculation, and particular the red one looks like an underestimation
of the slope.

> As I look at all of the views in Censum I see nothing outstanding
> that leads me to believe that this Full is a by-product of some
> interaction between the collector and the application (some form of
> zombies????). Also, one certainly cannot rule out your speculation

It does not look like there is an issue e.g. with j.l.ref.References of
any kind.

> for heap fragmentation in PLABs. I simply don’t have the data to say
> anything about that though I know it can be a huge issue. What I can
> say is that even if there is 20% waste, it can’t account for the
> amount of memory being recovered. I qualify that with, unless there
> is a blast of barely humongous allocations taking place. I’d like to
> thing this is a waste issue but I’m suspicious. I’m also suspicious
> that it’s simply the application allocating in a burst then
> releasing. If that were the case I’d expect a much more gradual
> reduction in the live set size.
>
> I think the answer right now is; we need more data.

Agree.

> I’ll try to get
> the “client” to turn on the extra flags and see what that yields. I
> won’t play with plab sizing this go ‘round if you don’t mind. If
> you’re right and it is a problem with waste, then the beer is on me
> the next time we meet.
>
> The don’t allocate arrays in size of powers of 2 is an interesting
> comment. While there are clear advantages to allocating arrays in
> size of powers of 2, I believe in that these cases are specialized
> and that I don’t generally see people dogmatically allocating this
> way.

There are cases where you probably want 2^n sized buffers, but in many,
many cases like some serialization for data transfer, it does not
matter a bit whether the output buffer can hold exactly 2^n bytes or
not, i.e. is just a bit smaller.

Of course this is something G1 should handle better by itself, but for
now that is what you can do about this.

Thanks,
  Thomas

Reply | Threaded
Open this post in threaded view
|

Re: Strange G1 behavior

Kirk Pepperdine-2
Hi Bernd,


On Oct 17, 2017, at 12:12 PM, Bernd Eckenfels <[hidden email]> wrote:

Hello,

First of all I wanted to mention as I understand it the application is a graph query server,

I think it’s a triple store.

therefore freeing and creating of large graphs is the workload the GC would have to deal with. (I think Thomas described that allocation pattern - it would lead to spiky promotions if a query transaction lasts longer then a new collection cycle (or multiple))

Right, unfortunately we’ve gotten fixated on this particular log where as this is a pattern I’ve seen across many JVMs running in a number of different domains that are not SparQL. Since these are from older JVMs (8 and 7) I didn’t post them as I assumed with the rate of change in G1 that it would be difficult to pin the behavior on a pathological cases. I will try to dig out these older logs if you think it will help.

And in Addition, Thomas did you removed your graph or is Java.net refusing to serve pages now?

I’ve got all the relevant views in Censum and if one is missing I can add it in short order so I can visualize what Thomas is describing.

java.net is defunct, has been for a few months. I’ve tried to convince the OTN.. or what ever they’ve decided to rebrand themselves as that the loss of java.net is a real loss as that site was well branded and now… but that is a different issue and a different discussion and not for this list ;-)

Regards,
Kirk

Reply | Threaded
Open this post in threaded view
|

Re: Strange G1 behavior

Thomas Schatzl
In reply to this post by Kirk Pepperdine-2
Hi,

On Tue, 2017-10-17 at 11:52 +0200, Kirk Pepperdine wrote:

> > On Oct 17, 2017, at 10:22 AM, Thomas Schatzl <thomas.schatzl@oracle
> > .com> wrote:
> >
> > Hi Kirk,
> >
> > On Mon, 2017-10-16 at 20:07 +0200, Kirk Pepperdine wrote:
> > > Hi Thomas,
> > >
> > > Again, thank you for the detailed response.
> > >
> > >
> > > > On Oct 16, 2017, at 1:32 PM, Thomas Schatzl
> > > > <[hidden email]> wrote:
> > > >
> > > > For the allocation rate, please compare the slopes of heap
> > > > usage after (young) gcs during these spikes (particularly in
> > > > that full gc case) and normal operation.
> > >
> > > Censum estimates allocation rates as this is a metric that I
> > > routinely evaluate.
> > >
> > > This log shows a spike at 10:07 which correlates with the Full GC
> > > event. However the allocation rates while high, are well within
> > > values I’ve seen with many other applications that are well
> > > behaved. Censum also estimates rates of promotion and those seem
> > > exceedingly high at 10:07. That said, there are spikes just after
> > > 10:10 and around 10:30 which don’t trigger a Full. In both cases
> > > the estimates for allocation rates are high though the estimates
> > > for rates of  promotion while high, are not as high as those seen
> > > at 10:07.
> > >
> > > All in all, nothing here seems out of the ordinary and while I
> > > want you to be right about the waste and PLAB behaviors, these
> > > spikes feel artificial, i.e. I still want to blame the collector
> > > for not being able to cope with some aspect of application
> > > behavior that it should be able to cope with.. that is something
> > > other than a high allocation rate with low recover due to data
> > > simply being reference and therefor not eligible for
> > > collection...
> >
> > I always meant "promotion rate" here as allocation rate. For this
> > discussion (and in general) in a generational collector the
> > application's real allocation rate is usually not very interesting.
> >
> > Sorry for being imprecise.
>
> No worries, I get that promotion rates are very important but in when
> deciding what needs to be tuned, allocation rates are also very
> important so I try to be very precise here, maybe too precise at
> times.

Let's qualify this further: for the purpose of investigating this full
gc :)

> > > > In this application, given the information I have, every like
> > > > 1500s, there seems to be some component in the application that
> > > > allocates a lot of memory in a short time, and holds onto most
> > > > of
> > > > it for its duration.
> > >
> > > Sorry but I’m not seeing this pattern either in occupancy after
> > > or
> > > allocation rate views. What I do see is a systematic loss of free
> > > heap over time (slow memory leak??? Effects of caching???).
> >
> > Let's have a look at the heap usage after gc over time for a few
> > collection cycle before that full gc event. Please look at
> > http://cr.openjdk.java.net/~tschatzl/g1-strange-log/strange-g1-
> > promo.png which just shows a few of these.
>
> Sorry but I can’t see these charts.

My email client added some line breaks in the URL because it was too
long, sorry. One option is to manually stitch it together; all graphs
so far are collected in

http://cr.openjdk.java.net/~tschatzl/g1-strange-log/ , the
strange-g1-promo.png image.

(and cr.openjdk.java.net is live and kicking)

> The issue with promotion rates is that you have to ask, is this the
> dog wagging the tail, or the tail wagging the dog. In this case I
> think it’s the tail wagging the dog. Would it be possible that a
> significant portion of the data being promoted was directly due to
> to-space being resized to 0?

No. At that point the full gc has already been inevitable from my
understanding of the log. If you look at the log, the second to-space
exhaustion before the full gc does not, as assumed in the policy, give
back any free regions (typically they do), so there is no point in
restarting the application. Finally some time later it performs the
full gc.

Overall, in my understanding there seems to be a basic out-of-memory
situation here: the application's live set (plus any overheads caused
by memory management) during that application phase is too much for the
given available memory for G1.

Full GC can free up some of that extra space (and potentially the
application already let go of some data at that time), allowing the
application to continue until everything returns to normal.

That a GC algorithm needs extra space to work well is not a secret, and
is something the user needs to account for. Iirc some (rather old now)
literature (on smaller heaps, with different applications) indicate
that from ~33% of live set performance may start to degrade
significantly. After that full gc we are already at ~66% of live
data...

So the behavior we are seeing now seems like expected (but of course
undesirable) behavior of the GC algorithm to me.

Now there is of course the question where that overhead comes from, and
how it can be reduced as the memory used after full gc is quite a bit
below the maximum heap. :)

All that will probably be a bit clearer with more data.

> Sorry for being so slow but I think I’m starting to understand the
> underlying pathology here. I’ll hypothesis here and you can blow the
> whole idea out of the water with more insightful comments..
>
> Since tenured can grow into the space that the to-space is likely to
> also want, preference towards tenured deflates survivor which in turn
> puts more pressure on tenured which in turn deflates survivor even
> further, wash rinse, repeat, Full GC to clean up. Since this didn’t
> happen with the (older) generational collectors, it seems as if it’s
> possible that everything could be ok if a minimum to-space size could
> be maintained.

Not sure what you mean with "preference towards tenured": of course the
generations share the same total memory, and if one of them gets
unusually large, the other needs to adapt. G1 already has a minimum
young gen size (which is, as you can see, not a hard minimum), but in
this degenerate situation it simply thinks: "okay, let's try limping
along with less than that because we can. Maybe we are lucky, the
application gets back to normal within the next few seconds and so can
avoid full gc".

The alternative would basically be to start a fully compacting gc
earlier (the other collectors would do that btw) as soon as the space
statically allocated to old gen runs out.

Do you think this is preferable behavior?

> > I added rough linear interpolations of the heap usage after gc (so
> > that the slope of these lines corresponds to the promotion rates).
> > I can see a large, significant difference in the slopes between the
> > collection cycles before the full gc event (black lines) and the
> > full gc event (red line), while all black ones are roughly the
> > same. :)
> >
> > Note that my graphs were painted off-hand without any actual
> > calculation, and particular the red one looks like an
> > underestimation of the slope.
> >
> > > As I look at all of the views in Censum I see nothing outstanding
> > > that leads me to believe that this Full is a by-product of some
> > > interaction between the collector and the application (some form
> > > of zombies????). Also, one certainly cannot rule out your
> > > speculation 
> >
> > It does not look like there is an issue e.g. with
> > j.l.ref.References of
> > any kind.
>
> LinkList used to be susceptible to nepotism and all generations
> collectors suffer from zombies.

Maybe. I did not preclude more issues here :)

[...]

> > >
> > > I think the answer right now is; we need more data.
> >
> > Agree.
>
> It’s on it’s way!! Yeah for more data.
> You might be interested in this for visualization.  https://github.co
> m/kcpeppe/regions
> It’s a wee bit crude at the moment but I’ll accept pull requests for
> any refinements.
>

Thanks,
  Thomas

Reply | Threaded
Open this post in threaded view
|

Re: Strange G1 behavior

Guillaume Lederrey
It looks like a 20MB attachement isn't the way to go... So here is the log: https://people.wikimedia.org/~gehel/wdqs-blazegraph_jvm_gc.pid13326.log.0

On 17 October 2017 at 14:21, Guillaume Lederrey <[hidden email]> wrote:
Hello!

Thanks a lot for your analysis! I have to admit that I barely follow your train of thoughts.

Here is another GC log after adding "-XX:+G1PrintRegionLivenessInfo". This is only a few hours worth of data, I'll send additional logs as they are generated.

From what I understand of your discussions, we also need to have a deeper look at why our application is allocating that much memory and what we can do to prevent this. My very high level guess: we have some clients doing some more expensive query. Since our SPARQL endpoint is publicly accessible, that would mean we need to find a way to proactively estimate the memory cost of a query and drop the too expensive ones. Not an easy problem either...

Thanks again!

  Guillaume

On 17 October 2017 at 13:31, Thomas Schatzl <[hidden email]> wrote:
Hi,

On Tue, 2017-10-17 at 11:52 +0200, Kirk Pepperdine wrote:
> > On Oct 17, 2017, at 10:22 AM, Thomas Schatzl <thomas.schatzl@oracle
> > .com> wrote:
> >
> > Hi Kirk,
> >
> > On Mon, 2017-10-16 at 20:07 +0200, Kirk Pepperdine wrote:
> > > Hi Thomas,
> > >
> > > Again, thank you for the detailed response.
> > >
> > >
> > > > On Oct 16, 2017, at 1:32 PM, Thomas Schatzl
> > > > <[hidden email]> wrote:
> > > >
> > > > For the allocation rate, please compare the slopes of heap
> > > > usage after (young) gcs during these spikes (particularly in
> > > > that full gc case) and normal operation.
> > >
> > > Censum estimates allocation rates as this is a metric that I
> > > routinely evaluate.
> > >
> > > This log shows a spike at 10:07 which correlates with the Full GC
> > > event. However the allocation rates while high, are well within
> > > values I’ve seen with many other applications that are well
> > > behaved. Censum also estimates rates of promotion and those seem
> > > exceedingly high at 10:07. That said, there are spikes just after
> > > 10:10 and around 10:30 which don’t trigger a Full. In both cases
> > > the estimates for allocation rates are high though the estimates
> > > for rates of  promotion while high, are not as high as those seen
> > > at 10:07.
> > >
> > > All in all, nothing here seems out of the ordinary and while I
> > > want you to be right about the waste and PLAB behaviors, these
> > > spikes feel artificial, i.e. I still want to blame the collector
> > > for not being able to cope with some aspect of application
> > > behavior that it should be able to cope with.. that is something
> > > other than a high allocation rate with low recover due to data
> > > simply being reference and therefor not eligible for
> > > collection...
> >
> > I always meant "promotion rate" here as allocation rate. For this
> > discussion (and in general) in a generational collector the
> > application's real allocation rate is usually not very interesting.
> >
> > Sorry for being imprecise.
>
> No worries, I get that promotion rates are very important but in when
> deciding what needs to be tuned, allocation rates are also very
> important so I try to be very precise here, maybe too precise at
> times.

Let's qualify this further: for the purpose of investigating this full
gc :)

> > > > In this application, given the information I have, every like
> > > > 1500s, there seems to be some component in the application that
> > > > allocates a lot of memory in a short time, and holds onto most
> > > > of
> > > > it for its duration.
> > >
> > > Sorry but I’m not seeing this pattern either in occupancy after
> > > or
> > > allocation rate views. What I do see is a systematic loss of free
> > > heap over time (slow memory leak??? Effects of caching???).
> >
> > Let's have a look at the heap usage after gc over time for a few
> > collection cycle before that full gc event. Please look at
> > http://cr.openjdk.java.net/~tschatzl/g1-strange-log/strange-g1-
> > promo.png which just shows a few of these.
>
> Sorry but I can’t see these charts.

My email client added some line breaks in the URL because it was too
long, sorry. One option is to manually stitch it together; all graphs
so far are collected in

http://cr.openjdk.java.net/~tschatzl/g1-strange-log/ , the
strange-g1-promo.png image.

(and cr.openjdk.java.net is live and kicking)

> The issue with promotion rates is that you have to ask, is this the
> dog wagging the tail, or the tail wagging the dog. In this case I
> think it’s the tail wagging the dog. Would it be possible that a
> significant portion of the data being promoted was directly due to
> to-space being resized to 0?

No. At that point the full gc has already been inevitable from my
understanding of the log. If you look at the log, the second to-space
exhaustion before the full gc does not, as assumed in the policy, give
back any free regions (typically they do), so there is no point in
restarting the application. Finally some time later it performs the
full gc.

Overall, in my understanding there seems to be a basic out-of-memory
situation here: the application's live set (plus any overheads caused
by memory management) during that application phase is too much for the
given available memory for G1.

Full GC can free up some of that extra space (and potentially the
application already let go of some data at that time), allowing the
application to continue until everything returns to normal.

That a GC algorithm needs extra space to work well is not a secret, and
is something the user needs to account for. Iirc some (rather old now)
literature (on smaller heaps, with different applications) indicate
that from ~33% of live set performance may start to degrade
significantly. After that full gc we are already at ~66% of live
data...

So the behavior we are seeing now seems like expected (but of course
undesirable) behavior of the GC algorithm to me.

Now there is of course the question where that overhead comes from, and
how it can be reduced as the memory used after full gc is quite a bit
below the maximum heap. :)

All that will probably be a bit clearer with more data.

> Sorry for being so slow but I think I’m starting to understand the
> underlying pathology here. I’ll hypothesis here and you can blow the
> whole idea out of the water with more insightful comments..
>
> Since tenured can grow into the space that the to-space is likely to
> also want, preference towards tenured deflates survivor which in turn
> puts more pressure on tenured which in turn deflates survivor even
> further, wash rinse, repeat, Full GC to clean up. Since this didn’t
> happen with the (older) generational collectors, it seems as if it’s
> possible that everything could be ok if a minimum to-space size could
> be maintained.

Not sure what you mean with "preference towards tenured": of course the
generations share the same total memory, and if one of them gets
unusually large, the other needs to adapt. G1 already has a minimum
young gen size (which is, as you can see, not a hard minimum), but in
this degenerate situation it simply thinks: "okay, let's try limping
along with less than that because we can. Maybe we are lucky, the
application gets back to normal within the next few seconds and so can
avoid full gc".

The alternative would basically be to start a fully compacting gc
earlier (the other collectors would do that btw) as soon as the space
statically allocated to old gen runs out.

Do you think this is preferable behavior?

> > I added rough linear interpolations of the heap usage after gc (so
> > that the slope of these lines corresponds to the promotion rates).
> > I can see a large, significant difference in the slopes between the
> > collection cycles before the full gc event (black lines) and the
> > full gc event (red line), while all black ones are roughly the
> > same. :)
> >
> > Note that my graphs were painted off-hand without any actual
> > calculation, and particular the red one looks like an
> > underestimation of the slope.
> >
> > > As I look at all of the views in Censum I see nothing outstanding
> > > that leads me to believe that this Full is a by-product of some
> > > interaction between the collector and the application (some form
> > > of zombies????). Also, one certainly cannot rule out your
> > > speculation 
> >
> > It does not look like there is an issue e.g. with
> > j.l.ref.References of
> > any kind.
>
> LinkList used to be susceptible to nepotism and all generations
> collectors suffer from zombies.

Maybe. I did not preclude more issues here :)

[...]
> > >
> > > I think the answer right now is; we need more data.
> >
> > Agree.
>
> It’s on it’s way!! Yeah for more data.
> You might be interested in this for visualization.  https://github.co
> m/kcpeppe/regions
> It’s a wee bit crude at the moment but I’ll accept pull requests for
> any refinements.
>

Thanks,
  Thomas




--
mobile : <a href="tel:+41%2076%20573%2032%2040" value="+41765733240" target="_blank">+41 76 573 32 40
skype : Guillaume.Lederrey
Freenode: gehel



--
mobile : +41 76 573 32 40
skype : Guillaume.Lederrey
Freenode: gehel
Reply | Threaded
Open this post in threaded view
|

Re: Strange G1 behavior

Kirk Pepperdine
Hi all,

I’ve run the data through my crude visualization tool. I’ve screen casted it for those that don’t want to go through the pain of cloning and sorting out how to use the tooling.


This log looks different in that the mixed collections are actually recovering space. However there seems to be an issue with RSet update times just as heap occupancy jumps though I would view this as a normal response to increasing tenured occupancies. The spike in tenured occupancy does force young to shrink to a size that should see “to-space” with no room to accept in-coming survivors.

Specific recommendations; the app is churning using enough weak references that your app would benefit from parallelizing reference processing (off by default), I would double max heap and limit the shrinking of young to 20% to start with (default is 5%).

Kind regards,
Kirk

On Oct 17, 2017, at 2:36 PM, Guillaume Lederrey <[hidden email]> wrote:

It looks like a 20MB attachement isn't the way to go... So here is the log: https://people.wikimedia.org/~gehel/wdqs-blazegraph_jvm_gc.pid13326.log.0

On 17 October 2017 at 14:21, Guillaume Lederrey <[hidden email]> wrote:
Hello!

Thanks a lot for your analysis! I have to admit that I barely follow your train of thoughts.

Here is another GC log after adding "-XX:+G1PrintRegionLivenessInfo". This is only a few hours worth of data, I'll send additional logs as they are generated.

From what I understand of your discussions, we also need to have a deeper look at why our application is allocating that much memory and what we can do to prevent this. My very high level guess: we have some clients doing some more expensive query. Since our SPARQL endpoint is publicly accessible, that would mean we need to find a way to proactively estimate the memory cost of a query and drop the too expensive ones. Not an easy problem either...

Thanks again!

  Guillaume

On 17 October 2017 at 13:31, Thomas Schatzl <[hidden email]> wrote:
Hi,

On Tue, 2017-10-17 at 11:52 +0200, Kirk Pepperdine wrote:
> > On Oct 17, 2017, at 10:22 AM, Thomas Schatzl <thomas.schatzl@oracle
> > .com> wrote:
> >
> > Hi Kirk,
> >
> > On Mon, 2017-10-16 at 20:07 +0200, Kirk Pepperdine wrote:
> > > Hi Thomas,
> > >
> > > Again, thank you for the detailed response.
> > >
> > >
> > > > On Oct 16, 2017, at 1:32 PM, Thomas Schatzl
> > > > <[hidden email]> wrote:
> > > >
> > > > For the allocation rate, please compare the slopes of heap
> > > > usage after (young) gcs during these spikes (particularly in
> > > > that full gc case) and normal operation.
> > >
> > > Censum estimates allocation rates as this is a metric that I
> > > routinely evaluate.
> > >
> > > This log shows a spike at 10:07 which correlates with the Full GC
> > > event. However the allocation rates while high, are well within
> > > values I’ve seen with many other applications that are well
> > > behaved. Censum also estimates rates of promotion and those seem
> > > exceedingly high at 10:07. That said, there are spikes just after
> > > 10:10 and around 10:30 which don’t trigger a Full. In both cases
> > > the estimates for allocation rates are high though the estimates
> > > for rates of  promotion while high, are not as high as those seen
> > > at 10:07.
> > >
> > > All in all, nothing here seems out of the ordinary and while I
> > > want you to be right about the waste and PLAB behaviors, these
> > > spikes feel artificial, i.e. I still want to blame the collector
> > > for not being able to cope with some aspect of application
> > > behavior that it should be able to cope with.. that is something
> > > other than a high allocation rate with low recover due to data
> > > simply being reference and therefor not eligible for
> > > collection...
> >
> > I always meant "promotion rate" here as allocation rate. For this
> > discussion (and in general) in a generational collector the
> > application's real allocation rate is usually not very interesting.
> >
> > Sorry for being imprecise.
>
> No worries, I get that promotion rates are very important but in when
> deciding what needs to be tuned, allocation rates are also very
> important so I try to be very precise here, maybe too precise at
> times.

Let's qualify this further: for the purpose of investigating this full
gc :)

> > > > In this application, given the information I have, every like
> > > > 1500s, there seems to be some component in the application that
> > > > allocates a lot of memory in a short time, and holds onto most
> > > > of
> > > > it for its duration.
> > >
> > > Sorry but I’m not seeing this pattern either in occupancy after
> > > or
> > > allocation rate views. What I do see is a systematic loss of free
> > > heap over time (slow memory leak??? Effects of caching???).
> >
> > Let's have a look at the heap usage after gc over time for a few
> > collection cycle before that full gc event. Please look at
> > http://cr.openjdk.java.net/~tschatzl/g1-strange-log/strange-g1-
> > promo.png which just shows a few of these.
>
> Sorry but I can’t see these charts.

My email client added some line breaks in the URL because it was too
long, sorry. One option is to manually stitch it together; all graphs
so far are collected in

http://cr.openjdk.java.net/~tschatzl/g1-strange-log/ , the
strange-g1-promo.png image.

(and cr.openjdk.java.net is live and kicking)

> The issue with promotion rates is that you have to ask, is this the
> dog wagging the tail, or the tail wagging the dog. In this case I
> think it’s the tail wagging the dog. Would it be possible that a
> significant portion of the data being promoted was directly due to
> to-space being resized to 0?

No. At that point the full gc has already been inevitable from my
understanding of the log. If you look at the log, the second to-space
exhaustion before the full gc does not, as assumed in the policy, give
back any free regions (typically they do), so there is no point in
restarting the application. Finally some time later it performs the
full gc.

Overall, in my understanding there seems to be a basic out-of-memory
situation here: the application's live set (plus any overheads caused
by memory management) during that application phase is too much for the
given available memory for G1.

Full GC can free up some of that extra space (and potentially the
application already let go of some data at that time), allowing the
application to continue until everything returns to normal.

That a GC algorithm needs extra space to work well is not a secret, and
is something the user needs to account for. Iirc some (rather old now)
literature (on smaller heaps, with different applications) indicate
that from ~33% of live set performance may start to degrade
significantly. After that full gc we are already at ~66% of live
data...

So the behavior we are seeing now seems like expected (but of course
undesirable) behavior of the GC algorithm to me.

Now there is of course the question where that overhead comes from, and
how it can be reduced as the memory used after full gc is quite a bit
below the maximum heap. :)

All that will probably be a bit clearer with more data.

> Sorry for being so slow but I think I’m starting to understand the
> underlying pathology here. I’ll hypothesis here and you can blow the
> whole idea out of the water with more insightful comments..
>
> Since tenured can grow into the space that the to-space is likely to
> also want, preference towards tenured deflates survivor which in turn
> puts more pressure on tenured which in turn deflates survivor even
> further, wash rinse, repeat, Full GC to clean up. Since this didn’t
> happen with the (older) generational collectors, it seems as if it’s
> possible that everything could be ok if a minimum to-space size could
> be maintained.

Not sure what you mean with "preference towards tenured": of course the
generations share the same total memory, and if one of them gets
unusually large, the other needs to adapt. G1 already has a minimum
young gen size (which is, as you can see, not a hard minimum), but in
this degenerate situation it simply thinks: "okay, let's try limping
along with less than that because we can. Maybe we are lucky, the
application gets back to normal within the next few seconds and so can
avoid full gc".

The alternative would basically be to start a fully compacting gc
earlier (the other collectors would do that btw) as soon as the space
statically allocated to old gen runs out.

Do you think this is preferable behavior?

> > I added rough linear interpolations of the heap usage after gc (so
> > that the slope of these lines corresponds to the promotion rates).
> > I can see a large, significant difference in the slopes between the
> > collection cycles before the full gc event (black lines) and the
> > full gc event (red line), while all black ones are roughly the
> > same. :)
> >
> > Note that my graphs were painted off-hand without any actual
> > calculation, and particular the red one looks like an
> > underestimation of the slope.
> >
> > > As I look at all of the views in Censum I see nothing outstanding
> > > that leads me to believe that this Full is a by-product of some
> > > interaction between the collector and the application (some form
> > > of zombies????). Also, one certainly cannot rule out your
> > > speculation 
> >
> > It does not look like there is an issue e.g. with
> > j.l.ref.References of
> > any kind.
>
> LinkList used to be susceptible to nepotism and all generations
> collectors suffer from zombies.

Maybe. I did not preclude more issues here :)

[...]
> > >
> > > I think the answer right now is; we need more data.
> >
> > Agree.
>
> It’s on it’s way!! Yeah for more data.
> You might be interested in this for visualization.  https://github.co
> m/kcpeppe/regions
> It’s a wee bit crude at the moment but I’ll accept pull requests for
> any refinements.
>

Thanks,
  Thomas




--
mobile : <a href="tel:+41%2076%20573%2032%2040" value="+41765733240" target="_blank" class="">+41 76 573 32 40
skype : Guillaume.Lederrey
Freenode: gehel



--
mobile : +41 76 573 32 40
skype : Guillaume.Lederrey
Freenode: gehel

Reply | Threaded
Open this post in threaded view
|

Re: Strange G1 behavior

Guillaume Lederrey
Quick note before going to bed...

On 17 October 2017 at 23:28, Kirk Pepperdine <[hidden email]> wrote:
Hi all,

I’ve run the data through my crude visualization tool. I’ve screen casted it for those that don’t want to go through the pain of cloning and sorting out how to use the tooling.


I don't seem to have access, I sent an access request...
 

This log looks different in that the mixed collections are actually recovering space. However there seems to be an issue with RSet update times just as heap occupancy jumps though I would view this as a normal response to increasing tenured occupancies. The spike in tenured occupancy does force young to shrink to a size that should see “to-space” with no room to accept in-coming survivors.

Specific recommendations; the app is churning using enough weak references that your app would benefit from parallelizing reference processing (off by default), I would double max heap and limit the shrinking of young to 20% to start with (default is 5%).

I'll double max heap tomorrow. Parallel ref processing is already enabled (-XX:+ParallelRefProcEnabled), and young is already limited to max 25% (-XX:G1MaxNewSizePercent=25), I'll add -XX:G1NewSizePercent=20 (, if that's the correct option).

Thanks for the analysis!
 

Kind regards,
Kirk

On Oct 17, 2017, at 2:36 PM, Guillaume Lederrey <[hidden email]> wrote:

It looks like a 20MB attachement isn't the way to go... So here is the log: https://people.wikimedia.org/~gehel/wdqs-blazegraph_jvm_gc.pid13326.log.0

On 17 October 2017 at 14:21, Guillaume Lederrey <[hidden email]> wrote:
Hello!

Thanks a lot for your analysis! I have to admit that I barely follow your train of thoughts.

Here is another GC log after adding "-XX:+G1PrintRegionLivenessInfo". This is only a few hours worth of data, I'll send additional logs as they are generated.

From what I understand of your discussions, we also need to have a deeper look at why our application is allocating that much memory and what we can do to prevent this. My very high level guess: we have some clients doing some more expensive query. Since our SPARQL endpoint is publicly accessible, that would mean we need to find a way to proactively estimate the memory cost of a query and drop the too expensive ones. Not an easy problem either...

Thanks again!

  Guillaume

On 17 October 2017 at 13:31, Thomas Schatzl <[hidden email]> wrote:
Hi,

On Tue, 2017-10-17 at 11:52 +0200, Kirk Pepperdine wrote:
> > On Oct 17, 2017, at 10:22 AM, Thomas Schatzl <thomas.schatzl@oracle
> > .com> wrote:
> >
> > Hi Kirk,
> >
> > On Mon, 2017-10-16 at 20:07 +0200, Kirk Pepperdine wrote:
> > > Hi Thomas,
> > >
> > > Again, thank you for the detailed response.
> > >
> > >
> > > > On Oct 16, 2017, at 1:32 PM, Thomas Schatzl
> > > > <[hidden email]> wrote:
> > > >
> > > > For the allocation rate, please compare the slopes of heap
> > > > usage after (young) gcs during these spikes (particularly in
> > > > that full gc case) and normal operation.
> > >
> > > Censum estimates allocation rates as this is a metric that I
> > > routinely evaluate.
> > >
> > > This log shows a spike at 10:07 which correlates with the Full GC
> > > event. However the allocation rates while high, are well within
> > > values I’ve seen with many other applications that are well
> > > behaved. Censum also estimates rates of promotion and those seem
> > > exceedingly high at 10:07. That said, there are spikes just after
> > > 10:10 and around 10:30 which don’t trigger a Full. In both cases
> > > the estimates for allocation rates are high though the estimates
> > > for rates of  promotion while high, are not as high as those seen
> > > at 10:07.
> > >
> > > All in all, nothing here seems out of the ordinary and while I
> > > want you to be right about the waste and PLAB behaviors, these
> > > spikes feel artificial, i.e. I still want to blame the collector
> > > for not being able to cope with some aspect of application
> > > behavior that it should be able to cope with.. that is something
> > > other than a high allocation rate with low recover due to data
> > > simply being reference and therefor not eligible for
> > > collection...
> >
> > I always meant "promotion rate" here as allocation rate. For this
> > discussion (and in general) in a generational collector the
> > application's real allocation rate is usually not very interesting.
> >
> > Sorry for being imprecise.
>
> No worries, I get that promotion rates are very important but in when
> deciding what needs to be tuned, allocation rates are also very
> important so I try to be very precise here, maybe too precise at
> times.

Let's qualify this further: for the purpose of investigating this full
gc :)

> > > > In this application, given the information I have, every like
> > > > 1500s, there seems to be some component in the application that
> > > > allocates a lot of memory in a short time, and holds onto most
> > > > of
> > > > it for its duration.
> > >
> > > Sorry but I’m not seeing this pattern either in occupancy after
> > > or
> > > allocation rate views. What I do see is a systematic loss of free
> > > heap over time (slow memory leak??? Effects of caching???).
> >
> > Let's have a look at the heap usage after gc over time for a few
> > collection cycle before that full gc event. Please look at
> > http://cr.openjdk.java.net/~tschatzl/g1-strange-log/strange-g1-
> > promo.png which just shows a few of these.
>
> Sorry but I can’t see these charts.

My email client added some line breaks in the URL because it was too
long, sorry. One option is to manually stitch it together; all graphs
so far are collected in

http://cr.openjdk.java.net/~tschatzl/g1-strange-log/ , the
strange-g1-promo.png image.

(and cr.openjdk.java.net is live and kicking)

> The issue with promotion rates is that you have to ask, is this the
> dog wagging the tail, or the tail wagging the dog. In this case I
> think it’s the tail wagging the dog. Would it be possible that a
> significant portion of the data being promoted was directly due to
> to-space being resized to 0?

No. At that point the full gc has already been inevitable from my
understanding of the log. If you look at the log, the second to-space
exhaustion before the full gc does not, as assumed in the policy, give
back any free regions (typically they do), so there is no point in
restarting the application. Finally some time later it performs the
full gc.

Overall, in my understanding there seems to be a basic out-of-memory
situation here: the application's live set (plus any overheads caused
by memory management) during that application phase is too much for the
given available memory for G1.

Full GC can free up some of that extra space (and potentially the
application already let go of some data at that time), allowing the
application to continue until everything returns to normal.

That a GC algorithm needs extra space to work well is not a secret, and
is something the user needs to account for. Iirc some (rather old now)
literature (on smaller heaps, with different applications) indicate
that from ~33% of live set performance may start to degrade
significantly. After that full gc we are already at ~66% of live
data...

So the behavior we are seeing now seems like expected (but of course
undesirable) behavior of the GC algorithm to me.

Now there is of course the question where that overhead comes from, and
how it can be reduced as the memory used after full gc is quite a bit
below the maximum heap. :)

All that will probably be a bit clearer with more data.

> Sorry for being so slow but I think I’m starting to understand the
> underlying pathology here. I’ll hypothesis here and you can blow the
> whole idea out of the water with more insightful comments..
>
> Since tenured can grow into the space that the to-space is likely to
> also want, preference towards tenured deflates survivor which in turn
> puts more pressure on tenured which in turn deflates survivor even
> further, wash rinse, repeat, Full GC to clean up. Since this didn’t
> happen with the (older) generational collectors, it seems as if it’s
> possible that everything could be ok if a minimum to-space size could
> be maintained.

Not sure what you mean with "preference towards tenured": of course the
generations share the same total memory, and if one of them gets
unusually large, the other needs to adapt. G1 already has a minimum
young gen size (which is, as you can see, not a hard minimum), but in
this degenerate situation it simply thinks: "okay, let's try limping
along with less than that because we can. Maybe we are lucky, the
application gets back to normal within the next few seconds and so can
avoid full gc".

The alternative would basically be to start a fully compacting gc
earlier (the other collectors would do that btw) as soon as the space
statically allocated to old gen runs out.

Do you think this is preferable behavior?

> > I added rough linear interpolations of the heap usage after gc (so
> > that the slope of these lines corresponds to the promotion rates).
> > I can see a large, significant difference in the slopes between the
> > collection cycles before the full gc event (black lines) and the
> > full gc event (red line), while all black ones are roughly the
> > same. :)
> >
> > Note that my graphs were painted off-hand without any actual
> > calculation, and particular the red one looks like an
> > underestimation of the slope.
> >
> > > As I look at all of the views in Censum I see nothing outstanding
> > > that leads me to believe that this Full is a by-product of some
> > > interaction between the collector and the application (some form
> > > of zombies????). Also, one certainly cannot rule out your
> > > speculation 
> >
> > It does not look like there is an issue e.g. with
> > j.l.ref.References of
> > any kind.
>
> LinkList used to be susceptible to nepotism and all generations
> collectors suffer from zombies.

Maybe. I did not preclude more issues here :)

[...]
> > >
> > > I think the answer right now is; we need more data.
> >
> > Agree.
>
> It’s on it’s way!! Yeah for more data.
> You might be interested in this for visualization.  https://github.co
> m/kcpeppe/regions
> It’s a wee bit crude at the moment but I’ll accept pull requests for
> any refinements.
>

Thanks,
  Thomas




--
mobile : <a href="tel:+41%2076%20573%2032%2040" value="+41765733240" target="_blank">+41 76 573 32 40
skype : Guillaume.Lederrey
Freenode: gehel



--
mobile : <a href="tel:+41%2076%20573%2032%2040" value="+41765733240" target="_blank">+41 76 573 32 40
skype : Guillaume.Lederrey
Freenode: gehel




--
mobile : +41 76 573 32 40
skype : Guillaume.Lederrey
Freenode: gehel
12