High sys time in G1 GC logs and questions about big heap tuning

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

High sys time in G1 GC logs and questions about big heap tuning

Hector Caballero
Hi,

We've been troubleshooting a big traffic application for some time now
and have discovered GC times are impacting performance. Next paragraph
just to give some context:

We are using G1 GC with JDK 8u131 on a RHEL 6 server with latest patches
and these JVM parameters:

         javaOptions = -Xms128g
         javaOptions = -Xmx128g
         javaOptions = -XX:MetaspaceSize=128m
         javaOptions = -XX:+UseG1GC
         javaOptions = -XX:InitiatingHeapOccupancyPercent=15
         javaOptions = -XX:MaxGCPauseMillis=200
         javaOptions = -XX:+ParallelRefProcEnabled
         javaOptions = -XX:SoftRefLRUPolicyMSPerMB=0
         javaOptions = -XX:ParallelGCThreads=24
         javaOptions = -XX:+PrintGCDetails
         javaOptions = -XX:+PrintGCDateStamps
         javaOptions = -XX:+PrintTenuringDistribution
         javaOptions = -XX:+PrintAdaptiveSizePolicy
         javaOptions = -XX:+PrintReferenceGC
         javaOptions = -XX:+UnlockDiagnosticVMOptions
         javaOptions = -XX:+G1SummarizeRSetStats
         javaOptions =
-Xloggc:/opt/gerrit/review_site/logs/jvm/jvm_gc-%t.log
         javaOptions = -XX:+UseGCLogFileRotation
         javaOptions = -XX:GCLogFileSize=40M
         javaOptions = -XX:NumberOfGCLogFiles=20
         javaOptions = -Djava.util.Arrays.useLegacyMergeSort=true

and the system was stable. Some precision: The IHOP value was reduced as
a way to avoid having full GCs that were killing us and it worked well
for this purpose. The only problem we saw with these parameters was a
bit of impact on throughput but, as said, the system was still very usable.

After upgrading the application to latest version, the memory pattern
changed dramatically and full GCs went back making the application
unusable. The main change we saw was a very fast increase in old
generation (up to 115GB) while young generation kept low. As an urgent
measure, the system memory was upgraded and the heap size put to 256GB
with which there was a dramatic improvement in application performance
but we're still seeing a big regression in some operations that are
taking 3-4 times more. We understand we need to look at the code itself
(this application is an open source one) to see how object creation has
changed, but we have a lot of pressure of users and management to solve
this right now so changes in code need to wait a bit. Our goal right now
is to make the application as usable as possible tuning memory usage to
give us a bit of room to start looking at the code.

The increase in heap size led to very low throughput, around 70%. In
order to try to improve it, we decided to relax pause goal to 300ms.
Given doing this didn't affect much the application (at least it was not
worse) and things improved a bit but not enough. Next we decided to try
to start collecting a bit later to reduce the number of mixed
collections so we increase IHOP to 18% (very modest modification). After
that, looking at the GC logs we saw that, without having full GCs, there
are young collections taking long time (up to a minute in a case). In
some of those cases, the sys time was a lot higher than the user time,
which is kind of strange and we haven't found much information about the
causes of this but we see is affecting us a lot:

Timestamp    User Time (secs)    Sys Time (secs)    Real Time (secs)
2017-10-20T14:19:41    1.68    9.5    0.48
2017-10-20T14:19:59    2.26    9.56    0.56
2017-10-20T14:34:45    12.26    14.01    1.23
2017-10-20T15:26:33    9.82    31.73    2.39
2017-10-20T15:26:37    15.68    46.47    3.19
2017-10-20T15:26:41    10.11    39.64    2.44
2017-10-20T15:28:18    15.53    45.14    3.08
2017-10-20T15:29:01    5.4    41.63    2.43
2017-10-20T15:29:05    74.69    93.91    11.51
2017-10-20T15:29:18    80.67    137.35    18.38
2017-10-20T15:30:55    19.65    23.99    2.16
2017-10-20T15:31:03    11.03    29.28    1.82
2017-10-20T15:31:28    4.88    8.44    0.77
2017-10-20T15:31:30    1.58    59.97    3.21
2017-10-20T15:31:35    57.13    135.82    14.96
2017-10-20T15:32:23    25.46    150.54    14.79
2017-10-20T15:33:11    87.92    112.98    16.46
2017-10-20T15:35:13    17.99    32.15    2.33
2017-10-20T15:35:54    154.7    191.82    29.67

Any hints here?

Second question is about general tuning for this big heap as seems we're
not going well so far. We've been reading a lot of documentation about
the tuning of this algorithm and thought we had a fairly acceptable
understanding of the process but this has proven wrong as some of our
tunings (as increasing the IHOP to bigger values and increasing the
number of parallel threads) haven't show improvement and rather
regressions in some cases, as causing longer young collections or
dropping in throughput. I'm joining a chunk of the log so you can have a
quick look if possible and probably point to some ideas.

Thanks a lot,

Hector Caballero





jvm_gc-2017-10-20_14-19-14.log.0.current.gz (1M) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: High sys time in G1 GC logs and questions about big heap tuning

Vitaly Davidovich


On Fri, Oct 20, 2017 at 11:43 AM, Hector Caballero <[hidden email]> wrote:
Hi,

We've been troubleshooting a big traffic application for some time now and have discovered GC times are impacting performance. Next paragraph just to give some context:

We are using G1 GC with JDK 8u131 on a RHEL 6 server with latest patches and these JVM parameters:

        javaOptions = -Xms128g
        javaOptions = -Xmx128g
        javaOptions = -XX:MetaspaceSize=128m
        javaOptions = -XX:+UseG1GC
        javaOptions = -XX:InitiatingHeapOccupancyPercent=15
        javaOptions = -XX:MaxGCPauseMillis=200
        javaOptions = -XX:+ParallelRefProcEnabled
        javaOptions = -XX:SoftRefLRUPolicyMSPerMB=0
        javaOptions = -XX:ParallelGCThreads=24
        javaOptions = -XX:+PrintGCDetails
        javaOptions = -XX:+PrintGCDateStamps
        javaOptions = -XX:+PrintTenuringDistribution
        javaOptions = -XX:+PrintAdaptiveSizePolicy
        javaOptions = -XX:+PrintReferenceGC
        javaOptions = -XX:+UnlockDiagnosticVMOptions
        javaOptions = -XX:+G1SummarizeRSetStats
        javaOptions = -Xloggc:/opt/gerrit/review_site/logs/jvm/jvm_gc-%t.log
        javaOptions = -XX:+UseGCLogFileRotation
        javaOptions = -XX:GCLogFileSize=40M
        javaOptions = -XX:NumberOfGCLogFiles=20
        javaOptions = -Djava.util.Arrays.useLegacyMergeSort=true

and the system was stable. Some precision: The IHOP value was reduced as a way to avoid having full GCs that were killing us and it worked well for this purpose. The only problem we saw with these parameters was a bit of impact on throughput but, as said, the system was still very usable.

After upgrading the application to latest version, the memory pattern changed dramatically and full GCs went back making the application unusable. The main change we saw was a very fast increase in old generation (up to 115GB) while young generation kept low. As an urgent measure, the system memory was upgraded and the heap size put to 256GB with which there was a dramatic improvement in application performance but we're still seeing a big regression in some operations that are taking 3-4 times more. We understand we need to look at the code itself (this application is an open source one) to see how object creation has changed, but we have a lot of pressure of users and management to solve this right now so changes in code need to wait a bit. Our goal right now is to make the application as usable as possible tuning memory usage to give us a bit of room to start looking at the code.

The increase in heap size led to very low throughput, around 70%. In order to try to improve it, we decided to relax pause goal to 300ms. Given doing this didn't affect much the application (at least it was not worse) and things improved a bit but not enough. Next we decided to try to start collecting a bit later to reduce the number of mixed collections so we increase IHOP to 18% (very modest modification). After that, looking at the GC logs we saw that, without having full GCs, there are young collections taking long time (up to a minute in a case). In some of those cases, the sys time was a lot higher than the user time, which is kind of strange and we haven't found much information about the causes of this but we see is affecting us a lot:

Timestamp    User Time (secs)    Sys Time (secs)    Real Time (secs)
2017-10-20T14:19:41    1.68    9.5    0.48
2017-10-20T14:19:59    2.26    9.56    0.56
2017-10-20T14:34:45    12.26    14.01    1.23
2017-10-20T15:26:33    9.82    31.73    2.39
2017-10-20T15:26:37    15.68    46.47    3.19
2017-10-20T15:26:41    10.11    39.64    2.44
2017-10-20T15:28:18    15.53    45.14    3.08
2017-10-20T15:29:01    5.4    41.63    2.43
2017-10-20T15:29:05    74.69    93.91    11.51
2017-10-20T15:29:18    80.67    137.35    18.38
2017-10-20T15:30:55    19.65    23.99    2.16
2017-10-20T15:31:03    11.03    29.28    1.82
2017-10-20T15:31:28    4.88    8.44    0.77
2017-10-20T15:31:30    1.58    59.97    3.21
2017-10-20T15:31:35    57.13    135.82    14.96
2017-10-20T15:32:23    25.46    150.54    14.79
2017-10-20T15:33:11    87.92    112.98    16.46
2017-10-20T15:35:13    17.99    32.15    2.33
2017-10-20T15:35:54    154.7    191.82    29.67

Any hints here?
Do you have transparent huge pages (THP) enabled? This is a fairly common cause for outsized sys times. 
 

Second question is about general tuning for this big heap as seems we're not going well so far. We've been reading a lot of documentation about the tuning of this algorithm and thought we had a fairly acceptable understanding of the process but this has proven wrong as some of our tunings (as increasing the IHOP to bigger values and increasing the number of parallel threads) haven't show improvement and rather regressions in some cases, as causing longer young collections or dropping in throughput. I'm joining a chunk of the log so you can have a quick look if possible and probably point to some ideas.

Thanks a lot,

Hector Caballero





Reply | Threaded
Open this post in threaded view
|

Re: High sys time in G1 GC logs and questions about big heap tuning

Kirk Pepperdine-2
In reply to this post by Hector Caballero
Hi Hector,

This log lit up our CPU analytic. Out of 3042 GC pause events, kernel times exceeded User times 25 times and GC threads collected an abnormal amount of kernel time 352 times.
There could be a number of reasons for this. If the I/O channels are overloaded then writing perf data to temp could result in kernel time being accumulated. Transparent Huge Pages is also a common issue. Given the distribution of the problem, this could be likely.

The heap appears to be large, maybe even too large. That said, the live set is increasing over time. However, the runtime isn’t long enough to say if this is a problem or not.

The real issue allocation rates are insanely high. To top that off, there is a problem with promotions that suggest young is being resized too small. I don’t see the run up in memory that you were seeing but my belief is that the high allocation rate with the undersized young would be responsible for this. You survive because IHOP at 15% means your making up for the small young by running more mixed collections. This is all going to add to overhead.

The dominating cost in the collection is consistently object copy. If set the min size of young to a larger value you might be able to reduce the overall overhead by simply reducing the volume of data being prematurely promotion to tenured. The big win will come with a reduction in allocation rates.

This is a start.

Kind regards,
Kirk Pepperdine



On Oct 20, 2017, at 5:43 PM, Hector Caballero <[hidden email]> wrote:

Hi,

We've been troubleshooting a big traffic application for some time now and have discovered GC times are impacting performance. Next paragraph just to give some context:

We are using G1 GC with JDK 8u131 on a RHEL 6 server with latest patches and these JVM parameters:

        javaOptions = -Xms128g
        javaOptions = -Xmx128g
        javaOptions = -XX:MetaspaceSize=128m
        javaOptions = -XX:+UseG1GC
        javaOptions = -XX:InitiatingHeapOccupancyPercent=15
        javaOptions = -XX:MaxGCPauseMillis=200
        javaOptions = -XX:+ParallelRefProcEnabled
        javaOptions = -XX:SoftRefLRUPolicyMSPerMB=0
        javaOptions = -XX:ParallelGCThreads=24
        javaOptions = -XX:+PrintGCDetails
        javaOptions = -XX:+PrintGCDateStamps
        javaOptions = -XX:+PrintTenuringDistribution
        javaOptions = -XX:+PrintAdaptiveSizePolicy
        javaOptions = -XX:+PrintReferenceGC
        javaOptions = -XX:+UnlockDiagnosticVMOptions
        javaOptions = -XX:+G1SummarizeRSetStats
        javaOptions = -Xloggc:/opt/gerrit/review_site/logs/jvm/jvm_gc-%t.log
        javaOptions = -XX:+UseGCLogFileRotation
        javaOptions = -XX:GCLogFileSize=40M
        javaOptions = -XX:NumberOfGCLogFiles=20
        javaOptions = -Djava.util.Arrays.useLegacyMergeSort=true

and the system was stable. Some precision: The IHOP value was reduced as a way to avoid having full GCs that were killing us and it worked well for this purpose. The only problem we saw with these parameters was a bit of impact on throughput but, as said, the system was still very usable.

After upgrading the application to latest version, the memory pattern changed dramatically and full GCs went back making the application unusable. The main change we saw was a very fast increase in old generation (up to 115GB) while young generation kept low. As an urgent measure, the system memory was upgraded and the heap size put to 256GB with which there was a dramatic improvement in application performance but we're still seeing a big regression in some operations that are taking 3-4 times more. We understand we need to look at the code itself (this application is an open source one) to see how object creation has changed, but we have a lot of pressure of users and management to solve this right now so changes in code need to wait a bit. Our goal right now is to make the application as usable as possible tuning memory usage to give us a bit of room to start looking at the code.

The increase in heap size led to very low throughput, around 70%. In order to try to improve it, we decided to relax pause goal to 300ms. Given doing this didn't affect much the application (at least it was not worse) and things improved a bit but not enough. Next we decided to try to start collecting a bit later to reduce the number of mixed collections so we increase IHOP to 18% (very modest modification). After that, looking at the GC logs we saw that, without having full GCs, there are young collections taking long time (up to a minute in a case). In some of those cases, the sys time was a lot higher than the user time, which is kind of strange and we haven't found much information about the causes of this but we see is affecting us a lot:

Timestamp    User Time (secs)    Sys Time (secs)    Real Time (secs)
2017-10-20T14:19:41    1.68    9.5    0.48
2017-10-20T14:19:59    2.26    9.56    0.56
2017-10-20T14:34:45    12.26    14.01    1.23
2017-10-20T15:26:33    9.82    31.73    2.39
2017-10-20T15:26:37    15.68    46.47    3.19
2017-10-20T15:26:41    10.11    39.64    2.44
2017-10-20T15:28:18    15.53    45.14    3.08
2017-10-20T15:29:01    5.4    41.63    2.43
2017-10-20T15:29:05    74.69    93.91    11.51
2017-10-20T15:29:18    80.67    137.35    18.38
2017-10-20T15:30:55    19.65    23.99    2.16
2017-10-20T15:31:03    11.03    29.28    1.82
2017-10-20T15:31:28    4.88    8.44    0.77
2017-10-20T15:31:30    1.58    59.97    3.21
2017-10-20T15:31:35    57.13    135.82    14.96
2017-10-20T15:32:23    25.46    150.54    14.79
2017-10-20T15:33:11    87.92    112.98    16.46
2017-10-20T15:35:13    17.99    32.15    2.33
2017-10-20T15:35:54    154.7    191.82    29.67

Any hints here?

Second question is about general tuning for this big heap as seems we're not going well so far. We've been reading a lot of documentation about the tuning of this algorithm and thought we had a fairly acceptable understanding of the process but this has proven wrong as some of our tunings (as increasing the IHOP to bigger values and increasing the number of parallel threads) haven't show improvement and rather regressions in some cases, as causing longer young collections or dropping in throughput. I'm joining a chunk of the log so you can have a quick look if possible and probably point to some ideas.

Thanks a lot,

Hector Caballero




<jvm_gc-2017-10-20_14-19-14.log.0.current.gz>

Reply | Threaded
Open this post in threaded view
|

Re: High sys time in G1 GC logs and questions about big heap tuning

Hector Caballero

Hi all,


First, thanks to all for your help. We're confirming about TPH with our sysadmins, but it is by default enabled on RHEL 6 as we understand.

Unfortunately, this log was collected when the traffic was low for our standards, but we saw that the live set was going up and old generation peaks at about 180GB when traffic is full.

Will try your suggestion and will post again with the results if you're OK with this.

Thanks again


On 2017-10-20 02:34 PM, Kirk Pepperdine wrote:
Hi Hector,

This log lit up our CPU analytic. Out of 3042 GC pause events, kernel times exceeded User times 25 times and GC threads collected an abnormal amount of kernel time 352 times.
There could be a number of reasons for this. If the I/O channels are overloaded then writing perf data to temp could result in kernel time being accumulated. Transparent Huge Pages is also a common issue. Given the distribution of the problem, this could be likely.

The heap appears to be large, maybe even too large. That said, the live set is increasing over time. However, the runtime isn’t long enough to say if this is a problem or not.

The real issue allocation rates are insanely high. To top that off, there is a problem with promotions that suggest young is being resized too small. I don’t see the run up in memory that you were seeing but my belief is that the high allocation rate with the undersized young would be responsible for this. You survive because IHOP at 15% means your making up for the small young by running more mixed collections. This is all going to add to overhead.

The dominating cost in the collection is consistently object copy. If set the min size of young to a larger value you might be able to reduce the overall overhead by simply reducing the volume of data being prematurely promotion to tenured. The big win will come with a reduction in allocation rates.

This is a start.

Kind regards,
Kirk Pepperdine



On Oct 20, 2017, at 5:43 PM, Hector Caballero <[hidden email]> wrote:

Hi,

We've been troubleshooting a big traffic application for some time now and have discovered GC times are impacting performance. Next paragraph just to give some context:

We are using G1 GC with JDK 8u131 on a RHEL 6 server with latest patches and these JVM parameters:

        javaOptions = -Xms128g
        javaOptions = -Xmx128g
        javaOptions = -XX:MetaspaceSize=128m
        javaOptions = -XX:+UseG1GC
        javaOptions = -XX:InitiatingHeapOccupancyPercent=15
        javaOptions = -XX:MaxGCPauseMillis=200
        javaOptions = -XX:+ParallelRefProcEnabled
        javaOptions = -XX:SoftRefLRUPolicyMSPerMB=0
        javaOptions = -XX:ParallelGCThreads=24
        javaOptions = -XX:+PrintGCDetails
        javaOptions = -XX:+PrintGCDateStamps
        javaOptions = -XX:+PrintTenuringDistribution
        javaOptions = -XX:+PrintAdaptiveSizePolicy
        javaOptions = -XX:+PrintReferenceGC
        javaOptions = -XX:+UnlockDiagnosticVMOptions
        javaOptions = -XX:+G1SummarizeRSetStats
        javaOptions = -Xloggc:/opt/gerrit/review_site/logs/jvm/jvm_gc-%t.log
        javaOptions = -XX:+UseGCLogFileRotation
        javaOptions = -XX:GCLogFileSize=40M
        javaOptions = -XX:NumberOfGCLogFiles=20
        javaOptions = -Djava.util.Arrays.useLegacyMergeSort=true

and the system was stable. Some precision: The IHOP value was reduced as a way to avoid having full GCs that were killing us and it worked well for this purpose. The only problem we saw with these parameters was a bit of impact on throughput but, as said, the system was still very usable.

After upgrading the application to latest version, the memory pattern changed dramatically and full GCs went back making the application unusable. The main change we saw was a very fast increase in old generation (up to 115GB) while young generation kept low. As an urgent measure, the system memory was upgraded and the heap size put to 256GB with which there was a dramatic improvement in application performance but we're still seeing a big regression in some operations that are taking 3-4 times more. We understand we need to look at the code itself (this application is an open source one) to see how object creation has changed, but we have a lot of pressure of users and management to solve this right now so changes in code need to wait a bit. Our goal right now is to make the application as usable as possible tuning memory usage to give us a bit of room to start looking at the code.

The increase in heap size led to very low throughput, around 70%. In order to try to improve it, we decided to relax pause goal to 300ms. Given doing this didn't affect much the application (at least it was not worse) and things improved a bit but not enough. Next we decided to try to start collecting a bit later to reduce the number of mixed collections so we increase IHOP to 18% (very modest modification). After that, looking at the GC logs we saw that, without having full GCs, there are young collections taking long time (up to a minute in a case). In some of those cases, the sys time was a lot higher than the user time, which is kind of strange and we haven't found much information about the causes of this but we see is affecting us a lot:

Timestamp    User Time (secs)    Sys Time (secs)    Real Time (secs)
2017-10-20T14:19:41    1.68    9.5    0.48
2017-10-20T14:19:59    2.26    9.56    0.56
2017-10-20T14:34:45    12.26    14.01    1.23
2017-10-20T15:26:33    9.82    31.73    2.39
2017-10-20T15:26:37    15.68    46.47    3.19
2017-10-20T15:26:41    10.11    39.64    2.44
2017-10-20T15:28:18    15.53    45.14    3.08
2017-10-20T15:29:01    5.4    41.63    2.43
2017-10-20T15:29:05    74.69    93.91    11.51
2017-10-20T15:29:18    80.67    137.35    18.38
2017-10-20T15:30:55    19.65    23.99    2.16
2017-10-20T15:31:03    11.03    29.28    1.82
2017-10-20T15:31:28    4.88    8.44    0.77
2017-10-20T15:31:30    1.58    59.97    3.21
2017-10-20T15:31:35    57.13    135.82    14.96
2017-10-20T15:32:23    25.46    150.54    14.79
2017-10-20T15:33:11    87.92    112.98    16.46
2017-10-20T15:35:13    17.99    32.15    2.33
2017-10-20T15:35:54    154.7    191.82    29.67

Any hints here?

Second question is about general tuning for this big heap as seems we're not going well so far. We've been reading a lot of documentation about the tuning of this algorithm and thought we had a fairly acceptable understanding of the process but this has proven wrong as some of our tunings (as increasing the IHOP to bigger values and increasing the number of parallel threads) haven't show improvement and rather regressions in some cases, as causing longer young collections or dropping in throughput. I'm joining a chunk of the log so you can have a quick look if possible and probably point to some ideas.

Thanks a lot,

Hector Caballero




<jvm_gc-2017-10-20_14-19-14.log.0.current.gz>