GcInfo: longest pause duration?

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

GcInfo: longest pause duration?

Hohensee, Paul

I’d like to solicit opinions on reporting GC pause duration (stopped-world pause time) via JMX. This info would be useful in figuring out whether or not GC pause times are factors in failing to meet response time SLAs. The info is of course available directly from GC logs, but parsing logs is fraught and JMX doesn’t seem to report the equivalent info.

 

GcInfo

 

https://docs.oracle.com/javase/9/docs/api/com/sun/management/GcInfo.html

 

has a getDuration() method which works fine for the non-concurrent collectors (since they’re STW), but for CMS and G1 it appears to report the duration of an entire concurrent cycle, which isn’t what I want. The number of STW pauses during a concurrent cycle varies by collector, so ideally there would be a method that reports cause (as a string) and duration for each STW pause. If that’s too much, perhaps the minimum might be a getMaxPauseDuration() method that reports the maximum pause duration of all the STW pauses that happen during a concurrent cycle.

 

Relatedly, the full compacting GCs that happen as a result of CMS and G1 concurrent mode failure aren’t reported separately from concurrent cycles. It would be useful to differentiate these from “ConcurrentMarkSweep” and “G1 Old Generation”. Perhaps add collector types to CMS and G1, vis. “MarkSweepCompact” (which already exists and is literally what’s executed by CMS) and a new “G1 MarkSweepCompact” collector for G1.

 

If there’s a consensus that something should be done about either of these issues, I’d be happy to file RFE(s) and do the work.

 

Thanks,

 

Paul

 

Reply | Threaded
Open this post in threaded view
|

Re: GcInfo: longest pause duration?

Jeremy Manson-4
Yes, this information is useful.

We've done a couple of things to get at it, which folks might find interesting.

1) We extended the hsperfdata to include a large number of stats for CMS.  Parsing hsperfdata is ridiculously easy, and hasn't changed in years, so we just parse it directly and report the values to monitoring tools.  I'm too lazy to retype what we export, so are the relevant CMS variables from our code; what they do should be relatively obvious:

    _concurrent_mode_failure_count =
        PerfDataManager::create_counter(NULL_NS, "concurrent-mode-failure-count",
                                        PerfData::U_Events, CHECK);
    _concurrent_mode_failure_due_to_fragmentation_count =
        PerfDataManager::create_counter(NULL_NS, "concurrent-mode-failure-due-to-fragmentation-count",
                                        PerfData::U_Events, CHECK);
    _concurrent_mode_failure_time =
        PerfDataManager::create_counter(NULL_NS, "concurrent-mode-failure-time",
                                        PerfData::U_Ticks, CHECK);
    _cms_initial_mark_count =
        PerfDataManager::create_counter(NULL_NS, "cms-initial-mark-count",
                                        PerfData::U_Events, CHECK);
    _cms_initial_mark_time =
        PerfDataManager::create_counter(NULL_NS, "cms-initial-mark-time",
                                        PerfData::U_Ticks, CHECK);
    _cms_remark_count =
        PerfDataManager::create_counter(NULL_NS, "cms-remark-count",
                                        PerfData::U_Events, CHECK);
    _cms_remark_time =
        PerfDataManager::create_counter(NULL_NS, "cms-remark-time",
                                        PerfData::U_Ticks, CHECK);
    _cms_full_gc_count =
        PerfDataManager::create_counter(NULL_NS, "cms-full-gc-count",
                                        PerfData::U_Events, CHECK);
    _cms_full_gc_time =
        PerfDataManager::create_counter(NULL_NS, "cms-full-gc-time",
                                        PerfData::U_Ticks, CHECK);
    _cms_compacting_full_gc_count =
        PerfDataManager::create_counter(NULL_NS, "cms-compacting-full-gc-count",
                                        PerfData::U_Events, CHECK);
    _cms_compacting_full_gc_time =
        PerfDataManager::create_counter(NULL_NS, "cms-compacting-full-gc-time",
                                        PerfData::U_Ticks, CHECK);
    _cms_noncompacting_full_gc_count =
        PerfDataManager::create_counter(NULL_NS, "cms-noncompacting-full-gc-count",
                                        PerfData::U_Events, CHECK);
    _cms_noncompacting_full_gc_time =
        PerfDataManager::create_counter(NULL_NS, "cms-noncompacting-full-gc-time",
                                        PerfData::U_Ticks, CHECK);
    _cms_conc_mark_count =
        PerfDataManager::create_counter(NULL_NS, "cms-conc-mark-count",
                                        PerfData::U_Events, CHECK);
    _cms_conc_mark_time =
        PerfDataManager::create_counter(NULL_NS, "cms-conc-mark-time",
                                        PerfData::U_Ticks, CHECK);
    _cms_conc_preclean_count =
        PerfDataManager::create_counter(NULL_NS, "cms-conc-preclean-count",
                                        PerfData::U_Events, CHECK);
    _cms_conc_preclean_time =
        PerfDataManager::create_counter(NULL_NS, "cms-conc-preclean-time",
                                        PerfData::U_Ticks, CHECK);
    _cms_conc_abortable_preclean_count =
        PerfDataManager::create_counter(NULL_NS, "cms-conc-abortable-preclean-count",
                                        PerfData::U_Events, CHECK);
    _cms_conc_abortable_preclean_time =
        PerfDataManager::create_counter(NULL_NS, "cms-conc-abortable-preclean-time",
                                        PerfData::U_Ticks, CHECK);
    _cms_conc_sweep_count =
        PerfDataManager::create_counter(NULL_NS, "cms-conc-sweep-count",
                                        PerfData::U_Events, CHECK);
    _cms_conc_sweep_time =
        PerfDataManager::create_counter(NULL_NS, "cms-conc-sweep-time",
                                        PerfData::U_Ticks, CHECK);
    _cms_conc_reset_count =
        PerfDataManager::create_counter(NULL_NS, "cms-conc-reset-count",
                                        PerfData::U_Events, CHECK);
    _cms_conc_reset_time =
        PerfDataManager::create_counter(NULL_NS, "cms-conc-reset-time",
                                        PerfData::U_Ticks, CHECK);
    _cms_collection_count =
        PerfDataManager::create_counter(NULL_NS, "cms-collection-count",
                                        PerfData::U_Events, CHECK);
    _cms_collection_time =
        PerfDataManager::create_counter(NULL_NS, "cms-collection-time",
                                        PerfData::U_Ticks, CHECK);
    _cms_ref_proc_count =
        PerfDataManager::create_counter(NULL_NS, "cms-ref-proc-count",
                                        PerfData::U_Events, CHECK);
    _cms_ref_proc_time =
        PerfDataManager::create_counter(NULL_NS, "cms-ref-proc-time",
                                        PerfData::U_Ticks, CHECK);

    _yg_alloc_bytes =
        PerfDataManager::create_variable(NULL_NS, "jvm-gc-cms-young-gen-alloc-bytes",
                                         PerfData::U_None, CHECK);
    _yg_promo_bytes =
        PerfDataManager::create_variable(NULL_NS, "jvm-gc-cms-young-gen-promo-bytes",
                                         PerfData::U_None, CHECK);
    _og_direct_alloc_bytes =
        PerfDataManager::create_variable(NULL_NS, "jvm-gc-cms-old-gen-directalloc-bytes",
                                         PerfData::U_None, CHECK);
    _og_alloc_bytes =
        PerfDataManager::create_variable(NULL_NS, "jvm-gc-cms-old-gen-alloc-bytes",
                                         PerfData::U_None, CHECK);
    _total_alloc_bytes =
        PerfDataManager::create_variable(NULL_NS, "jvm-gc-cms-total-alloc-bytes",
                                         PerfData::U_None, CHECK);


2) We also implemented an API built on JFR's handling of GC events.  We use it to export the GC history to monitoring tools in a format that doesn't involve parsing logs that can change.
(Prior to JFR, we had a bespoke implementation where we instrumented everywhere a log happened, but we fortunately don't have to do that anymore.)

We'd love both of these things to be something that was more generally usable.

Jeremy

On Mon, Oct 23, 2017 at 9:31 AM, Hohensee, Paul <[hidden email]> wrote:

I’d like to solicit opinions on reporting GC pause duration (stopped-world pause time) via JMX. This info would be useful in figuring out whether or not GC pause times are factors in failing to meet response time SLAs. The info is of course available directly from GC logs, but parsing logs is fraught and JMX doesn’t seem to report the equivalent info.

 

GcInfo

 

https://docs.oracle.com/javase/9/docs/api/com/sun/management/GcInfo.html

 

has a getDuration() method which works fine for the non-concurrent collectors (since they’re STW), but for CMS and G1 it appears to report the duration of an entire concurrent cycle, which isn’t what I want. The number of STW pauses during a concurrent cycle varies by collector, so ideally there would be a method that reports cause (as a string) and duration for each STW pause. If that’s too much, perhaps the minimum might be a getMaxPauseDuration() method that reports the maximum pause duration of all the STW pauses that happen during a concurrent cycle.

 

Relatedly, the full compacting GCs that happen as a result of CMS and G1 concurrent mode failure aren’t reported separately from concurrent cycles. It would be useful to differentiate these from “ConcurrentMarkSweep” and “G1 Old Generation”. Perhaps add collector types to CMS and G1, vis. “MarkSweepCompact” (which already exists and is literally what’s executed by CMS) and a new “G1 MarkSweepCompact” collector for G1.

 

If there’s a consensus that something should be done about either of these issues, I’d be happy to file RFE(s) and do the work.

 

Thanks,

 

Paul

 


Reply | Threaded
Open this post in threaded view
|

Re: GcInfo: longest pause duration?

David Holmes
In reply to this post by Hohensee, Paul
Hi Paul,

Probably best to also include the GC mailing list here.

David

On 24/10/2017 2:31 AM, Hohensee, Paul wrote:

> I’d like to solicit opinions on reporting GC pause duration
> (stopped-world pause time) via JMX. This info would be useful in
> figuring out whether or not GC pause times are factors in failing to
> meet response time SLAs. The info is of course available directly from
> GC logs, but parsing logs is fraught and JMX doesn’t seem to report the
> equivalent info.
>
> GcInfo
>
> https://docs.oracle.com/javase/9/docs/api/com/sun/management/GcInfo.html
>
> has a getDuration() method which works fine for the non-concurrent
> collectors (since they’re STW), but for CMS and G1 it appears to report
> the duration of an entire concurrent cycle, which isn’t what I want. The
> number of STW pauses during a concurrent cycle varies by collector, so
> ideally there would be a method that reports cause (as a string) and
> duration for each STW pause. If that’s too much, perhaps the minimum
> might be a getMaxPauseDuration() method that reports the maximum pause
> duration of all the STW pauses that happen during a concurrent cycle.
>
> Relatedly, the full compacting GCs that happen as a result of CMS and G1
> concurrent mode failure aren’t reported separately from concurrent
> cycles. It would be useful to differentiate these from
> “ConcurrentMarkSweep” and “G1 Old Generation”. Perhaps add collector
> types to CMS and G1, vis. “MarkSweepCompact” (which already exists and
> is literally what’s executed by CMS) and a new “G1 MarkSweepCompact”
> collector for G1.
>
> If there’s a consensus that something should be done about either of
> these issues, I’d be happy to file RFE(s) and do the work.
>
> Thanks,
>
> Paul
>
Reply | Threaded
Open this post in threaded view
|

Re: GcInfo: longest pause duration?

Hohensee, Paul
In reply to this post by Jeremy Manson-4

+ hotspot-gc-dev.

 

I’d say that hsperfdata is the right place to put detailed collector-specific data, and JMX the place to put anything that might be generalized.

 

How far does anyone want to go with this? I took a look at Azul’s GC-related MXBean docs

 

http://docs.azul.com/zing/Zing_UserGuide/Zing_UserGuide/ZingMXBeans.htm

 

http://docs.azul.com/zing/ZingMXBeans_javadoc/

 

which has good ideas in it, e.g., PauseDetails and a data collection interval concept. My proposed getMaxPauseDuration() seems to be the same as the max of Azul’s GarbageCollectorMXBean.getLastGCDetails.getCollectorMaxPauseDetails().

 

A JEP might be in order. Opinions?

 

Thanks,

 

Paul

 

From: Jeremy Manson <[hidden email]>
Date: Monday, October 23, 2017 at 11:50 AM
To: "Hohensee, Paul" <[hidden email]>, Derek Thomson <[hidden email]>, Wessam Hassanein <[hidden email]>, Man Cao <[hidden email]>
Cc: "[hidden email]" <[hidden email]>
Subject: Re: GcInfo: longest pause duration?

 

Yes, this information is useful.

 

We've done a couple of things to get at it, which folks might find interesting.

 

1) We extended the hsperfdata to include a large number of stats for CMS.  Parsing hsperfdata is ridiculously easy, and hasn't changed in years, so we just parse it directly and report the values to monitoring tools.  I'm too lazy to retype what we export, so are the relevant CMS variables from our code; what they do should be relatively obvious:

 

    _concurrent_mode_failure_count =
        PerfDataManager::create_counter(NULL_NS, "concurrent-mode-failure-count",
                                        PerfData::U_Events, CHECK);
    _concurrent_mode_failure_due_to_fragmentation_count =
        PerfDataManager::create_counter(NULL_NS, "concurrent-mode-failure-due-to-fragmentation-count",
                                        PerfData::U_Events, CHECK);
    _concurrent_mode_failure_time =
        PerfDataManager::create_counter(NULL_NS, "concurrent-mode-failure-time",
                                        PerfData::U_Ticks, CHECK);
    _cms_initial_mark_count =
        PerfDataManager::create_counter(NULL_NS, "cms-initial-mark-count",
                                        PerfData::U_Events, CHECK);
    _cms_initial_mark_time =
        PerfDataManager::create_counter(NULL_NS, "cms-initial-mark-time",
                                        PerfData::U_Ticks, CHECK);
    _cms_remark_count =
        PerfDataManager::create_counter(NULL_NS, "cms-remark-count",
                                        PerfData::U_Events, CHECK);
    _cms_remark_time =
        PerfDataManager::create_counter(NULL_NS, "cms-remark-time",
                                        PerfData::U_Ticks, CHECK);
    _cms_full_gc_count =
        PerfDataManager::create_counter(NULL_NS, "cms-full-gc-count",
                                        PerfData::U_Events, CHECK);
    _cms_full_gc_time =
        PerfDataManager::create_counter(NULL_NS, "cms-full-gc-time",
                                        PerfData::U_Ticks, CHECK);
    _cms_compacting_full_gc_count =
        PerfDataManager::create_counter(NULL_NS, "cms-compacting-full-gc-count",
                                        PerfData::U_Events, CHECK);
    _cms_compacting_full_gc_time =
        PerfDataManager::create_counter(NULL_NS, "cms-compacting-full-gc-time",
                                        PerfData::U_Ticks, CHECK);
    _cms_noncompacting_full_gc_count =
        PerfDataManager::create_counter(NULL_NS, "cms-noncompacting-full-gc-count",
                                        PerfData::U_Events, CHECK);
    _cms_noncompacting_full_gc_time =
        PerfDataManager::create_counter(NULL_NS, "cms-noncompacting-full-gc-time",
                                        PerfData::U_Ticks, CHECK);
    _cms_conc_mark_count =
        PerfDataManager::create_counter(NULL_NS, "cms-conc-mark-count",
                                        PerfData::U_Events, CHECK);
    _cms_conc_mark_time =
        PerfDataManager::create_counter(NULL_NS, "cms-conc-mark-time",
                                        PerfData::U_Ticks, CHECK);
    _cms_conc_preclean_count =
        PerfDataManager::create_counter(NULL_NS, "cms-conc-preclean-count",
                                        PerfData::U_Events, CHECK);
    _cms_conc_preclean_time =
        PerfDataManager::create_counter(NULL_NS, "cms-conc-preclean-time",
                                        PerfData::U_Ticks, CHECK);
    _cms_conc_abortable_preclean_count =
        PerfDataManager::create_counter(NULL_NS, "cms-conc-abortable-preclean-count",
                                        PerfData::U_Events, CHECK);
    _cms_conc_abortable_preclean_time =
        PerfDataManager::create_counter(NULL_NS, "cms-conc-abortable-preclean-time",
                                        PerfData::U_Ticks, CHECK);
    _cms_conc_sweep_count =
        PerfDataManager::create_counter(NULL_NS, "cms-conc-sweep-count",
                                        PerfData::U_Events, CHECK);
    _cms_conc_sweep_time =
        PerfDataManager::create_counter(NULL_NS, "cms-conc-sweep-time",
                                        PerfData::U_Ticks, CHECK);
    _cms_conc_reset_count =
        PerfDataManager::create_counter(NULL_NS, "cms-conc-reset-count",
                                        PerfData::U_Events, CHECK);
    _cms_conc_reset_time =
        PerfDataManager::create_counter(NULL_NS, "cms-conc-reset-time",
                                        PerfData::U_Ticks, CHECK);
    _cms_collection_count =
        PerfDataManager::create_counter(NULL_NS, "cms-collection-count",
                                        PerfData::U_Events, CHECK);
    _cms_collection_time =
        PerfDataManager::create_counter(NULL_NS, "cms-collection-time",
                                        PerfData::U_Ticks, CHECK);
    _cms_ref_proc_count =
        PerfDataManager::create_counter(NULL_NS, "cms-ref-proc-count",
                                        PerfData::U_Events, CHECK);
    _cms_ref_proc_time =
        PerfDataManager::create_counter(NULL_NS, "cms-ref-proc-time",
                                        PerfData::U_Ticks, CHECK);
 
    _yg_alloc_bytes =
        PerfDataManager::create_variable(NULL_NS, "jvm-gc-cms-young-gen-alloc-bytes",
                                         PerfData::U_None, CHECK);
    _yg_promo_bytes =
        PerfDataManager::create_variable(NULL_NS, "jvm-gc-cms-young-gen-promo-bytes",
                                         PerfData::U_None, CHECK);
    _og_direct_alloc_bytes =
        PerfDataManager::create_variable(NULL_NS, "jvm-gc-cms-old-gen-directalloc-bytes",
                                         PerfData::U_None, CHECK);
    _og_alloc_bytes =
        PerfDataManager::create_variable(NULL_NS, "jvm-gc-cms-old-gen-alloc-bytes",
                                         PerfData::U_None, CHECK);
    _total_alloc_bytes =
        PerfDataManager::create_variable(NULL_NS, "jvm-gc-cms-total-alloc-bytes",
                                         PerfData::U_None, CHECK);

 

 

2) We also implemented an API built on JFR's handling of GC events.  We use it to export the GC history to monitoring tools in a format that doesn't involve parsing logs that can change.

(Prior to JFR, we had a bespoke implementation where we instrumented everywhere a log happened, but we fortunately don't have to do that anymore.)

 

We'd love both of these things to be something that was more generally usable.

 

Jeremy

 

On Mon, Oct 23, 2017 at 9:31 AM, Hohensee, Paul <[hidden email]> wrote:

I’d like to solicit opinions on reporting GC pause duration (stopped-world pause time) via JMX. This info would be useful in figuring out whether or not GC pause times are factors in failing to meet response time SLAs. The info is of course available directly from GC logs, but parsing logs is fraught and JMX doesn’t seem to report the equivalent info.

 

GcInfo

 

https://docs.oracle.com/javase/9/docs/api/com/sun/management/GcInfo.html

 

has a getDuration() method which works fine for the non-concurrent collectors (since they’re STW), but for CMS and G1 it appears to report the duration of an entire concurrent cycle, which isn’t what I want. The number of STW pauses during a concurrent cycle varies by collector, so ideally there would be a method that reports cause (as a string) and duration for each STW pause. If that’s too much, perhaps the minimum might be a getMaxPauseDuration() method that reports the maximum pause duration of all the STW pauses that happen during a concurrent cycle.

 

Relatedly, the full compacting GCs that happen as a result of CMS and G1 concurrent mode failure aren’t reported separately from concurrent cycles. It would be useful to differentiate these from “ConcurrentMarkSweep” and “G1 Old Generation”. Perhaps add collector types to CMS and G1, vis. “MarkSweepCompact” (which already exists and is literally what’s executed by CMS) and a new “G1 MarkSweepCompact” collector for G1.

 

If there’s a consensus that something should be done about either of these issues, I’d be happy to file RFE(s) and do the work.

 

Thanks,

 

Paul