Re: Low-Overhead Heap Profiling

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: Low-Overhead Heap Profiling

Robbin Ehn
Hi,

To me the compiler changes looks what is expected.
It would be good if someone from compiler could take a look at that.
Added compiler to mail thread.

Also adding Serguei, It would be good with his view also.

My initial take on it, read through most of the code and took it for a ride.

##############################
- Regarding the compiler changes: I think we need the 'TLAB end' trickery (mentioned by Tony P)
instead of a separate check for sampling in fast path for the final version.

##############################
- This patch I had to apply to get it compile on JDK 10:

diff -r ac3ded340b35 src/share/vm/gc/shared/collectedHeap.inline.hpp
--- a/src/share/vm/gc/shared/collectedHeap.inline.hpp Fri Apr 28 14:31:38 2017 +0200
+++ b/src/share/vm/gc/shared/collectedHeap.inline.hpp Thu May 04 10:22:56 2017 +0200
@@ -87,3 +87,3 @@
      // support for object alloc event (no-op most of the time)
-    if (klass() != NULL && klass()->name() != NULL) {
+    if (klass != NULL && klass->name() != NULL) {
        Thread *base_thread = Thread::current();
diff -r ac3ded340b35 src/share/vm/runtime/heapMonitoring.cpp
--- a/src/share/vm/runtime/heapMonitoring.cpp Fri Apr 28 14:31:38 2017 +0200
+++ b/src/share/vm/runtime/heapMonitoring.cpp Thu May 04 10:22:56 2017 +0200
@@ -316,3 +316,3 @@
    JavaThread *thread = reinterpret_cast<JavaThread *>(Thread::current());
-  assert(o->size() << LogHeapWordSize == byte_size,
+  assert(o->size() << LogHeapWordSize == (long)byte_size,
           "Object size is incorrect.");

##############################
- This patch I had to apply to get it not asserting during slowdebug:

--- a/src/share/vm/runtime/heapMonitoring.cpp Fri Apr 28 15:15:16 2017 +0200
+++ b/src/share/vm/runtime/heapMonitoring.cpp Thu May 04 10:24:25 2017 +0200
@@ -32,3 +32,3 @@
  // TODO(jcbeyler): should we make this into a JVMTI structure?
-struct StackTraceData {
+struct StackTraceData : CHeapObj<mtInternal> {
    ASGCT_CallTrace *trace;
@@ -143,3 +143,2 @@
  StackTraceStorage::StackTraceStorage() :
-    _allocated_traces(new StackTraceData*[MaxHeapTraces]),
      _allocated_traces_size(MaxHeapTraces),
@@ -147,2 +146,3 @@
      _allocated_count(0) {
+  _allocated_traces = NEW_C_HEAP_ARRAY(StackTraceData*, MaxHeapTraces, mtInternal);
    memset(_allocated_traces, 0, sizeof(*_allocated_traces) * MaxHeapTraces);
@@ -152,3 +152,3 @@
  StackTraceStorage::~StackTraceStorage() {
-  delete[] _allocated_traces;
+  FREE_C_HEAP_ARRAY(StackTraceData*, _allocated_traces);
  }

- Classes should extend correct base class for which type of memory is used for it e.g.: CHeapObj<mt????> or StackObj or AllStatic
- The style in heapMonitoring.cpp is a bit different from normal vm-style, e.g. using C++ casts instead of C. You mix NEW_C_HEAP_ARRAY, os::malloc and new.
- In jvmtiHeapTransition.hpp you use C cast instead.

##############################
- This patch I had apply to get traces without setting an ‘unrelated’ capability
- Should this not be a new capability?

diff -r c02a5d8785bf src/share/vm/prims/forte.cpp
--- a/src/share/vm/prims/forte.cpp Fri Apr 28 15:15:16 2017 +0200
+++ b/src/share/vm/prims/forte.cpp Thu May 04 10:24:25 2017 +0200
@@ -530,6 +530,6 @@

-  if (!JvmtiExport::should_post_class_load()) {
+/*  if (!JvmtiExport::should_post_class_load()) {
      trace->num_frames = ticks_no_class_load; // -1
      return;
-  }
+  }*/

##############################
- forte.cpp: (I know this is not part of your changes but)
find_jmethod_id_or_null give me NULL for my test.
It looks like we actually want the regular jmethod_id() ?

Since we are the thread we are talking about (and in same ucontext) and thread is in vm and have a last java frame,
I think most of the checks done in AsyncGetCallTrace is irrelevant, so you should be-able to call forte_fill_call_trace_given_top directly.
But since we might need jmethod_id() if possible to avoid getting method id NULL,
we need some fixes in forte code, or just do the vframStream loop inside heapMonitoring.cpp and not use forte.cpp.

Something like:

   if (jthread->has_last_Java_frame()) { // just to be safe
     vframeStream vfst(jthread);
     while (!vfst.at_end()) {
       Method* m = vfst.method();
       m->jmethod_id();
       m->line_number_from_bci(vfst.bci());
       vfst.next();
     }

- This is a bit confusing in forte.cpp, trace->frames[count].lineno = bci.
Line number should be m->line_number_from_bci(bci);
Do the heapMonitoring suppose to trace with bci or line number?
I would say bci, meaning we should either rename ASGCT_CallFrame→lineno or use another data structure which says bci.

##############################
- // TODO(jcbeyler): remove this extra code handling the extra trace for
Please fix all these TODO's :)

##############################
- heapMonitoring.hpp:
// TODO(jcbeyler): is this algorithm acceptable in open source?

Why is this comment here? What is the implication?
Have you tested any simpler algorithm?

##############################
- Create a sanity jtreg test. (./hotspot/make/test/JtregNative.gmk for building the agent)

##############################
- monitoring_period vs HeapMonitorRate, pick rate or period.

##############################
- globals.hpp
Why is MaxHeapTraces not settable/overridable from jvmti interface? That would be handy.

##############################
- jvmtiStackTraceData + ASGCT_CallFrame memory
Are the agent suppose to loop through and free all ASGCT_CallFrame?
Wouldn't it be better with some kinda protocol, like:
(*jvmti)->GetLiveTraces(jvmti, &stack_traces, &num_traces);
(*jvmti)->ReleaseTraces(jvmti, stack_traces, num_traces);

Also using another data structure that have num_traces inside it simplifies things.
So I'm not convinced using the async structure is the best way forward.


I have more questions, but I think it's better if you respond and update the code first.

Thanks!

/Robbin


On 04/21/2017 11:34 PM, JC Beyler wrote:

> Hi all,
>
> I've added size information to the allocation sampling system. This allows the callback to remember the size of each sampled allocation.
> http://cr.openjdk.java.net/~rasbold/8171119/webrev.01/
>
> The new webrev.01 also adds the actual heap monitoring sampling system in files:
> http://cr.openjdk.java.net/~rasbold/8171119/webrev.01/src/share/vm/runtime/heapMonitoring.cpp.patch
> and
> http://cr.openjdk.java.net/~rasbold/8171119/webrev.01/src/share/vm/runtime/heapMonitoring.hpp.patch
>
> My next step is to add the GC part to the webrev, which will allow users to determine what objects are live and what are garbage.
>
> Thanks for your attention and let me know if there are any questions!
>
> Have a wonderful Friday!
> Jc
>
> On Mon, Apr 17, 2017 at 12:37 PM, JC Beyler <[hidden email] <mailto:[hidden email]>> wrote:
>
>     Hi all,
>
>     I worked on getting a few numbers for overhead and accuracy for my feature. I'm unsure if here is the right place to provide the full data, so I am just summarizing
>     here for now.
>
>     - Overhead of the feature
>
>     Using the Dacapo benchmark (http://dacapobench.org/). My initial results are that sampling provides 2.4% with a 512k sampling, 512k being our default setting.
>
>     - Note: this was without the tradesoap, tradebeans and tomcat benchmarks since they did not work with my JDK9 (issue between Dacapo and JDK9 it seems)
>     - I want to rerun next week to ensure number stability
>
>     - Accuracy of the feature
>
>     I wrote a small microbenchmark that allocates from two different stacktraces at a given ratio. For example, 10% of stacktrace S1 and 90% from stacktrace S2. The
>     microbenchmark was run 20 times, I averaged the results and looked for accuracy. It seems that statistically it is sound since if I allocated10% S1 and 90% S2, with a
>     sampling rate of 512k, I obtained 9.61% S1 and 90.49% S2.
>
>     Let me know if there are any questions on the numbers and if you'd like to see some more data.
>
>     Note: this was done using our internal JDK8 implementation since the webrev provided by http://cr.openjdk.java.net/~rasbold/heapz/webrev.00/index.html
>     <http://cr.openjdk.java.net/~rasbold/heapz/webrev.00/index.html> does not yet contain the whole implementation and therefore would have been misleading.
>
>     Thanks,
>     Jc
>
>
>     On Tue, Apr 4, 2017 at 3:55 PM, JC Beyler <[hidden email] <mailto:[hidden email]>> wrote:
>
>         Hi all,
>
>         To move the discussion forward, with Chuck Rasbold's help to make a webrev, we pushed this:
>         http://cr.openjdk.java.net/~rasbold/heapz/webrev.00/index.html <http://cr.openjdk.java.net/~rasbold/heapz/webrev.00/index.html>
>         415 lines changed: 399 ins; 13 del; 3 mod; 51122 unchg
>
>         This is not a final change that does the whole proposition from the JBS entry: https://bugs.openjdk.java.net/browse/JDK-8177374
>         <https://bugs.openjdk.java.net/browse/JDK-8177374>; what it does show is parts of the implementation that is proposed and hopefully can start the conversation going
>         as I work through the details.
>
>         For example, the changes to C2 are done here for the allocations: http://cr.openjdk.java.net/~rasbold/heapz/webrev.00/src/share/vm/opto/macro.cpp.patch
>         <http://cr.openjdk.java.net/~rasbold/heapz/webrev.00/src/share/vm/opto/macro.cpp.patch>
>
>         Hopefully this all makes sense and thank you for all your future comments!
>         Jc
>
>
>         On Tue, Dec 13, 2016 at 1:11 PM, JC Beyler <[hidden email] <mailto:[hidden email]>> wrote:
>
>             Hello all,
>
>             This is a follow-up from Jeremy's initial email from last year:
>             http://mail.openjdk.java.net/pipermail/serviceability-dev/2015-June/017543.html <http://mail.openjdk.java.net/pipermail/serviceability-dev/2015-June/017543.html>
>
>             I've gone ahead and started working on preparing this and Jeremy and I went down the route of actually writing it up in JEP form:
>             https://bugs.openjdk.java.net/browse/JDK-8171119
>
>             I think original conversation that happened last year in that thread still holds true:
>
>               - We have a patch at Google that we think others might be interested in
>                  - It provides a means to understand where the allocation hotspots are at a very low overhead
>                  - Since it is at a low overhead, we can leave it on by default
>
>             So I come to the mailing list with Jeremy's initial question:
>             "I thought I would ask if there is any interest / if I should write a JEP / if I should just forget it."
>
>             A year ago, it seemed some thought it was a good idea, is this still true?
>
>             Thanks,
>             Jc
>
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Low-Overhead Heap Profiling

JC Beyler
So first off: thank you so much for your diligent look at the patch. A lot of the comments are due to me either using the wrong baseline (I was using JDK9) it seems or either that I first worked on keeping things simple, so maybe there are things that are still in flight and why things might seem unclear. I will try to rectify most of those in the next webrev.

I will say again that I really do appreciate the time you took to give all those comments. I was expecting a high level review at first and not an in-depth one :)

I'll work on the assumption that my code base is wrong and figure out how to get the right one tomorrow. I've inlined my comments which mostly are questions to help me fix the code.

On Thu, May 4, 2017 at 2:13 AM, Robbin Ehn <[hidden email]> wrote:
Hi,

To me the compiler changes looks what is expected.
It would be good if someone from compiler could take a look at that.
Added compiler to mail thread.

Also adding Serguei, It would be good with his view also.

My initial take on it, read through most of the code and took it for a ride.

##############################
- Regarding the compiler changes: I think we need the 'TLAB end' trickery (mentioned by Tony P)
instead of a separate check for sampling in fast path for the final version.


Agreed, I am still working on assessing it and the final version most likely will be TLAB-based if I can prove to myself and to you the overhead is acceptable. My initial data and the patch is simple enough that it seems like it should work. I might send out a webrev with it just to show it and people can play/comment on it.
 
- For simplicity, I've removed your actual patches but most are either because I believe my JDK that I was basing off of was not the right one compared to the one you are using. If you have a second and can provide me a link or process on how you get the right mercurial repositories set up, I'll set myself the same way you are doing it to align myself.

- I'll also double check slowdebug to ensure no problems.

- Classes should extend correct base class for which type of memory is used for it e.g.: CHeapObj<mt????> or StackObj or AllStatic
- The style in heapMonitoring.cpp is a bit different from normal vm-style, e.g. using C++ casts instead of C. You mix NEW_C_HEAP_ARRAY, os::malloc and new.
- In jvmtiHeapTransition.hpp you use C cast instead.


Noted, I thought I had cleaned up most of them but seemingly not. I'll go and clean it up and align myself with what I see the surrounding code is doing between code from runtime and code for the jvmti.

 
##############################
- This patch I had apply to get traces without setting an ‘unrelated’ capability
- Should this not be a new capability?

diff -r c02a5d8785bf src/share/vm/prims/forte.cpp
--- a/src/share/vm/prims/forte.cpp      Fri Apr 28 15:15:16 2017 +0200
+++ b/src/share/vm/prims/forte.cpp      Thu May 04 10:24:25 2017 +0200
@@ -530,6 +530,6 @@

-  if (!JvmtiExport::should_post_class_load()) {
+/*  if (!JvmtiExport::should_post_class_load()) {
     trace->num_frames = ticks_no_class_load; // -1
     return;
-  }
+  }*/


So the way I've been testing all of this for now was by creating a Java agent that sets everything up. I was waiting to have this conversation for when we were a bit more into the conversations. Since you have brought it up:
  caps.can_get_line_numbers = 1;
  caps.can_get_source_file_name = 1;
And I've added some callbacks too to help set things up: 
  ernum = global_jvmti->SetEventCallbacks(&callbacks, sizeof(jvmtiEventCallbacks));
  ernum = global_jvmti->SetEventNotificationMode(JVMTI_ENABLE, JVMTI_EVENT_VM_INIT, NULL);
  ernum = global_jvmti->SetEventNotificationMode(JVMTI_ENABLE, JVMTI_EVENT_CLASS_LOAD, NULL);
The class loading callback me to force the creation of the jmethod ids for the class, which most likely solves your problem here:
Quick question: Does it make sense to put my agent code somewhere in the webrev for all to see how I've been calling into this?
I was waiting a bit for things to settle a bit first but I could do that easily. Is there a spot where that would make sense?

##############################
- forte.cpp: (I know this is not part of your changes but)
find_jmethod_id_or_null give me NULL for my test.
It looks like we actually want the regular jmethod_id() ?

Since we are the thread we are talking about (and in same ucontext) and thread is in vm and have a last java frame,
I think most of the checks done in AsyncGetCallTrace is irrelevant, so you should be-able to call forte_fill_call_trace_given_top directly.
But since we might need jmethod_id() if possible to avoid getting method id NULL,
we need some fixes in forte code, or just do the vframStream loop inside heapMonitoring.cpp and not use forte.cpp.

Something like:

  if (jthread->has_last_Java_frame()) { // just to be safe
    vframeStream vfst(jthread);
    while (!vfst.at_end()) {
      Method* m = vfst.method();
      m->jmethod_id();
      m->line_number_from_bci(vfst.bci());
      vfst.next();
    }

I have no idea about this and will have to try :). I am open to any solution that will work and this seems to go around forte.cpp. We have a few changes in forte.cpp that I have not put here that help get more frames than the current implementation allows. Let me test this and see if it works :)
 

- This is a bit confusing in forte.cpp, trace->frames[count].lineno = bci.
Line number should be m->line_number_from_bci(bci);
Do the heapMonitoring suppose to trace with bci or line number?
I would say bci, meaning we should either rename ASGCT_CallFrame→lineno or use another data structure which says bci.

Agreed, I just did not want to create another structure straight away because I thought it was going to be confusing. + it seems that it is not necessary automatically. For the bci instead of line number, it is possible that, because this is code that has been ported over and over, we never cleaned it up or knew that a method was now available. Let me test it out and see how it works.
 

##############################
- // TODO(jcbeyler): remove this extra code handling the extra trace for
Please fix all these TODO's :)


Will do, they were my trackers of TODOs, I prefetched a bit when I sent out the webrev. I will try to clean out most of them for the next webrev :)
 
##############################
- heapMonitoring.hpp:
// TODO(jcbeyler): is this algorithm acceptable in open source?

Why is this comment here? What is the implication?
Have you tested any simpler algorithm?

Yes sorry, this is me being paranoid and trying to figure out if the licenses are compatible. This implementation exists in the open source world but I wanted to double check via this TODO that the original code and the JDK are compatible in licenses before bringing this in.

From what I've seen here, this is a known fast log implementation/algorithm and the license seems fine so I believe it is fine but still need to do the due diligence. I will double check and ensure it is fine :)
 

##############################
- Create a sanity jtreg test. (./hotspot/make/test/JtregNative.gmk for building the agent)

Sounds good, I will look at how to do that :)
 

##############################
- monitoring_period vs HeapMonitorRate, pick rate or period.

Agreed.
 

##############################
- globals.hpp
Why is MaxHeapTraces not settable/overridable from jvmti interface? That would be handy.

Just to be clear and complete: MaxHeapTraces will actually disappear in the future webrev:
  1) This was here only to show how the whole system will look with especially the general jvmti interface, compiler changes, and hooks
  2) In practice, we keep all live sampled objects and we free up the list via the GC events of sampled objects
  3) In our implementation and the final one that would be shown here:
      - MaxHeapTraces disappears but we would have the analoguous MaxGarbageHeapTraces
      - That would could use a flag to be settable/overridable
 

##############################
- jvmtiStackTraceData + ASGCT_CallFrame memory
Are the agent suppose to loop through and free all ASGCT_CallFrame?
Wouldn't it be better with some kinda protocol, like:
(*jvmti)->GetLiveTraces(jvmti, &stack_traces, &num_traces);
(*jvmti)->ReleaseTraces(jvmti, stack_traces, num_traces);

I will change it like you suggest. Our implementation did do the free in the agent but in the general case, your solution is more logical.
 

Also using another data structure that have num_traces inside it simplifies things.
So I'm not convinced using the async structure is the best way forward.

I'm not yet sure about leaving the async entirely, I will double check on that before commiting to it. I will let you know.
 


I have more questions, but I think it's better if you respond and update the code first.

I will work on the code changes and adapt to all your suggestions here. Hopefully, I've helped answer partially most of them and if you have more, please feel free to ask them now :) Or, as you said, wait until I send my next version out if you think that we should first iterate on this before opening other questions and conversations!

Thank you again for taking the time to do such a thorough review and I hope I have answered the initial questions well!
Jc
 

Thanks!

/Robbin


On 04/21/2017 11:34 PM, JC Beyler wrote:
Hi all,

I've added size information to the allocation sampling system. This allows the callback to remember the size of each sampled allocation.
http://cr.openjdk.java.net/~rasbold/8171119/webrev.01/

The new webrev.01 also adds the actual heap monitoring sampling system in files:
http://cr.openjdk.java.net/~rasbold/8171119/webrev.01/src/share/vm/runtime/heapMonitoring.cpp.patch
and
http://cr.openjdk.java.net/~rasbold/8171119/webrev.01/src/share/vm/runtime/heapMonitoring.hpp.patch

My next step is to add the GC part to the webrev, which will allow users to determine what objects are live and what are garbage.

Thanks for your attention and let me know if there are any questions!

Have a wonderful Friday!
Jc

On Mon, Apr 17, 2017 at 12:37 PM, JC Beyler <[hidden email] <mailto:[hidden email]>> wrote:

    Hi all,

    I worked on getting a few numbers for overhead and accuracy for my feature. I'm unsure if here is the right place to provide the full data, so I am just summarizing
    here for now.

    - Overhead of the feature

    Using the Dacapo benchmark (http://dacapobench.org/). My initial results are that sampling provides 2.4% with a 512k sampling, 512k being our default setting.

    - Note: this was without the tradesoap, tradebeans and tomcat benchmarks since they did not work with my JDK9 (issue between Dacapo and JDK9 it seems)
    - I want to rerun next week to ensure number stability

    - Accuracy of the feature

    I wrote a small microbenchmark that allocates from two different stacktraces at a given ratio. For example, 10% of stacktrace S1 and 90% from stacktrace S2. The
    microbenchmark was run 20 times, I averaged the results and looked for accuracy. It seems that statistically it is sound since if I allocated10% S1 and 90% S2, with a
    sampling rate of 512k, I obtained 9.61% S1 and 90.49% S2.

    Let me know if there are any questions on the numbers and if you'd like to see some more data.

    Note: this was done using our internal JDK8 implementation since the webrev provided by http://cr.openjdk.java.net/~rasbold/heapz/webrev.00/index.html
    <http://cr.openjdk.java.net/~rasbold/heapz/webrev.00/index.html> does not yet contain the whole implementation and therefore would have been misleading.

    Thanks,
    Jc


    On Tue, Apr 4, 2017 at 3:55 PM, JC Beyler <[hidden email] <mailto:[hidden email]>> wrote:

        Hi all,

        To move the discussion forward, with Chuck Rasbold's help to make a webrev, we pushed this:
        http://cr.openjdk.java.net/~rasbold/heapz/webrev.00/index.html <http://cr.openjdk.java.net/~rasbold/heapz/webrev.00/index.html>
        415 lines changed: 399 ins; 13 del; 3 mod; 51122 unchg

        This is not a final change that does the whole proposition from the JBS entry: https://bugs.openjdk.java.net/browse/JDK-8177374
        <https://bugs.openjdk.java.net/browse/JDK-8177374>; what it does show is parts of the implementation that is proposed and hopefully can start the conversation going
        as I work through the details.

        For example, the changes to C2 are done here for the allocations: http://cr.openjdk.java.net/~rasbold/heapz/webrev.00/src/share/vm/opto/macro.cpp.patch
        <http://cr.openjdk.java.net/~rasbold/heapz/webrev.00/src/share/vm/opto/macro.cpp.patch>

        Hopefully this all makes sense and thank you for all your future comments!
        Jc


        On Tue, Dec 13, 2016 at 1:11 PM, JC Beyler <[hidden email] <mailto:[hidden email]>> wrote:

            Hello all,

            This is a follow-up from Jeremy's initial email from last year:
            http://mail.openjdk.java.net/pipermail/serviceability-dev/2015-June/017543.html <http://mail.openjdk.java.net/pipermail/serviceability-dev/2015-June/017543.html>

            I've gone ahead and started working on preparing this and Jeremy and I went down the route of actually writing it up in JEP form:
            https://bugs.openjdk.java.net/browse/JDK-8171119

            I think original conversation that happened last year in that thread still holds true:

              - We have a patch at Google that we think others might be interested in
                 - It provides a means to understand where the allocation hotspots are at a very low overhead
                 - Since it is at a low overhead, we can leave it on by default

            So I come to the mailing list with Jeremy's initial question:
            "I thought I would ask if there is any interest / if I should write a JEP / if I should just forget it."

            A year ago, it seemed some thought it was a good idea, is this still true?

            Thanks,
            Jc






Reply | Threaded
Open this post in threaded view
|

Re: Low-Overhead Heap Profiling

JC Beyler
In reply to this post by Robbin Ehn
Dear all,

I have a new webrev up:

This webrev has, I hope, fixed a lot of the comments from Robbin:
  - The casts normally are all C++ style
  - Moved this to jdk10-hs
     - I have not tested slowdebug yet, hopefully it does not break there
  - Added the garbage collection system:
     - Now live sampled allocations are tracked throughout their lifetime
        - When GC happens, it moves the sampled allocation information to two lists: recent and frequent GC lists
            - Those lists use the array system that the live objects were using before but have different re-use strategies
     - Added the JVMTI API for them via a GetFrequentGarbageTraces and GetGarbageTraces
- Both use the same JVMTI structures
     - Added the calls to them for the test, though I've kept that test simple for now:       
        - As I write this, I notice my webrev is missing a final change I made to the test that calls the same ReleaseTraces to each live/garbage/frequent structure. This is updated in my local repo and will get in the next webrev.

Next steps for this work are: 
   - Putting the TLAB implementation (yes not forgotten ;-))
   - Adding more testing and separate the current test system to check things a bit more thoroughly
   - Have not tried to circumvent AsyncGetCallTrace yet
   - Still have to double check the stack walker a bit more

Happy webrev perusal!
Jc


On Tue, May 16, 2017 at 5:20 AM, Robbin Ehn <[hidden email]> wrote:
Just a few answers,

On 05/15/2017 06:48 PM, JC Beyler wrote:
Dear all,

I've updated the webrev to:
http://cr.openjdk.java.net/~rasbold/8171119/webrev.02/ <http://cr.openjdk.java.net/~rasbold/8171119/webrev.02/>

I'll look at this later, thanks!


Robbin,
I believe I have addressed most of your items with webrev 02:
   - I added a JTreg test to show how it works:
http://cr.openjdk.java.net/~rasbold/8171119/webrev.02/raw_files/new/test/serviceability/jvmti/HeapMonitor/libHeapMonitor.c <http://cr.openjdk.java.net/~rasbold/8171119/webrev.02/raw_files/new/test/serviceability/jvmti/HeapMonitor/libHeapMonitor.c>
  - I've modified the code to use its own data structures both internally and externally, this will make it easier to move out of AsyncGetCallTrace as we move forward, that is still on my TODOs
  - I cleaned up the JVMTI API by passing a structure that handles the num_traces and put in a ReleaseTraces as well
  - I cleaned up other issues as well.

However, I have three questions, which are probably because I'm new in this community:
  1) My previous webrevs were based off of JDK9 by mistake. When I took JDK10 via : hg clone http://hg.openjdk.java.net/jdk10/jdk10 <http://hg.openjdk.java.net/jdk10/jdk10> jdk10
      - I don't see code compatible with what you were showing (ie your patches don't make sense for that code base; ex: klass is still accessed via klass() for example in collectedHeap.inline.hpp)
      - Would you know what is the right hg clone command so we are working on the same code base?

We use jdk10-hs, e.g.
hg tclone http://hg.openjdk.java.net/jdk10/hs 10-hs

There is sporadic big merges going from jdk9->jdk10->jdk10-hs and jdk10-hs->jdk10, so 10 is moving...


  2) You mentioned I was using os::malloc, new, NEW_C_HEAP_ARRAY; I cleaned out the os::malloc but which of the new vs NEW_C_HEAP_ARRAY should I use. It might be that I don't understand when one uses one or the other but I see both used around the code base?
    - Is it that new is to be used for anything internal and NEW_C_HEAP_ARRAY anything provided to the JVMTI users outside of the JVM?

We overload new operator when you extend correct base class, e.g. CHeapObj<mtInternal> so use 'new'
But for arrays you will need the macro NEW_C_HEAP_ARRAY.


  3) Casts: same kind question: which should I use. The code was using a bit of everything, I'll refactor it entirely but I was not clear if I should go to C casts or C++ casts as I see both in the codebase. What is the convention I should use?

Just be consist, use what suites you, C++ casts might be preferable, if we are moving towards C++11.
And use 'right' cast, e.g. going from Thread* to JavaThread* you should use C cast or static_cast, not reinterpret_cast I would say.


Final notes on this webrev:
   - I am still missing:
     - Putting a TLAB implementation so that we can compare both webrevs
     - Have not tried to circumvent AsyncGetCallTrace
     - Putting in the handling of GC'd objects
     - Fix a stack walker issue I have seen, I think I know the problem and will test that theory out for the next webrev

I will work on integrating those items for the next webrev!

Thanks!


Thanks for your help,
Jc

Ps:  I tested this on a new repo:

hg clone http://hg.openjdk.java.net/jdk10/jdk10 <http://hg.openjdk.java.net/jdk10/jdk10> jdk10
... building it
cd test
jtreg -nativepath:<path-to-jdk10>/build/linux-x86_64-normal-server-release/support/test/hotspot/jtreg/native/lib/ -jdk <path-to-jdk10>/linux-x86_64-normal-server-release/images/jdk ../hotspot/test/serviceability/jvmti/HeapMonitor/


I'll test it out!

/Robbin



On Thu, May 4, 2017 at 11:21 PM, [hidden email] <mailto:[hidden email]> <[hidden email] <mailto:[hidden email]>> wrote:

    Robbin,

    Thank you for forwarding!
    I will review it.

    Thanks,
    Serguei



    On 5/4/17 02:13, Robbin Ehn wrote:

        Hi,

        To me the compiler changes looks what is expected.
        It would be good if someone from compiler could take a look at that.
        Added compiler to mail thread.

        Also adding Serguei, It would be good with his view also.

        My initial take on it, read through most of the code and took it for a ride.

        ##############################
        - Regarding the compiler changes: I think we need the 'TLAB end' trickery (mentioned by Tony P)
        instead of a separate check for sampling in fast path for the final version.

        ##############################
        - This patch I had to apply to get it compile on JDK 10:

        diff -r ac3ded340b35 src/share/vm/gc/shared/collectedHeap.inline.hpp
        --- a/src/share/vm/gc/shared/collectedHeap.inline.hpp    Fri Apr 28 14:31:38 2017 +0200
        +++ b/src/share/vm/gc/shared/collectedHeap.inline.hpp    Thu May 04 10:22:56 2017 +0200
        @@ -87,3 +87,3 @@
              // support for object alloc event (no-op most of the time)
        -    if (klass() != NULL && klass()->name() != NULL) {
        +    if (klass != NULL && klass->name() != NULL) {
                Thread *base_thread = Thread::current();
        diff -r ac3ded340b35 src/share/vm/runtime/heapMonitoring.cpp
        --- a/src/share/vm/runtime/heapMonitoring.cpp    Fri Apr 28 14:31:38 2017 +0200
        +++ b/src/share/vm/runtime/heapMonitoring.cpp    Thu May 04 10:22:56 2017 +0200
        @@ -316,3 +316,3 @@
            JavaThread *thread = reinterpret_cast<JavaThread *>(Thread::current());
        -  assert(o->size() << LogHeapWordSize == byte_size,
        +  assert(o->size() << LogHeapWordSize == (long)byte_size,
                   "Object size is incorrect.");

        ##############################
        - This patch I had to apply to get it not asserting during slowdebug:

        --- a/src/share/vm/runtime/heapMonitoring.cpp    Fri Apr 28 15:15:16 2017 +0200
        +++ b/src/share/vm/runtime/heapMonitoring.cpp    Thu May 04 10:24:25 2017 +0200
        @@ -32,3 +32,3 @@
          // TODO(jcbeyler): should we make this into a JVMTI structure?
        -struct StackTraceData {
        +struct StackTraceData : CHeapObj<mtInternal> {
            ASGCT_CallTrace *trace;
        @@ -143,3 +143,2 @@
          StackTraceStorage::StackTraceStorage() :
        -    _allocated_traces(new StackTraceData*[MaxHeapTraces]),
              _allocated_traces_size(MaxHeapTraces),
        @@ -147,2 +146,3 @@
              _allocated_count(0) {
        +  _allocated_traces = NEW_C_HEAP_ARRAY(StackTraceData*, MaxHeapTraces, mtInternal);
            memset(_allocated_traces, 0, sizeof(*_allocated_traces) * MaxHeapTraces);
        @@ -152,3 +152,3 @@
          StackTraceStorage::~StackTraceStorage() {
        -  delete[] _allocated_traces;
        +  FREE_C_HEAP_ARRAY(StackTraceData*, _allocated_traces);
          }

        - Classes should extend correct base class for which type of memory is used for it e.g.: CHeapObj<mt????> or StackObj or AllStatic
        - The style in heapMonitoring.cpp is a bit different from normal vm-style, e.g. using C++ casts instead of C. You mix NEW_C_HEAP_ARRAY, os::malloc and new.
        - In jvmtiHeapTransition.hpp you use C cast instead.

        ##############################
        - This patch I had apply to get traces without setting an ‘unrelated’ capability
        - Should this not be a new capability?

        diff -r c02a5d8785bf src/share/vm/prims/forte.cpp
        --- a/src/share/vm/prims/forte.cpp    Fri Apr 28 15:15:16 2017 +0200
        +++ b/src/share/vm/prims/forte.cpp    Thu May 04 10:24:25 2017 +0200
        @@ -530,6 +530,6 @@

        -  if (!JvmtiExport::should_post_class_load()) {
        +/*  if (!JvmtiExport::should_post_class_load()) {
              trace->num_frames = ticks_no_class_load; // -1
              return;
        -  }
        +  }*/

        ##############################
        - forte.cpp: (I know this is not part of your changes but)
        find_jmethod_id_or_null give me NULL for my test.
        It looks like we actually want the regular jmethod_id() ?

        Since we are the thread we are talking about (and in same ucontext) and thread is in vm and have a last java frame,
        I think most of the checks done in AsyncGetCallTrace is irrelevant, so you should be-able to call forte_fill_call_trace_given_top directly.
        But since we might need jmethod_id() if possible to avoid getting method id NULL,
        we need some fixes in forte code, or just do the vframStream loop inside heapMonitoring.cpp and not use forte.cpp.

        Something like:

           if (jthread->has_last_Java_frame()) { // just to be safe
             vframeStream vfst(jthread);
             while (!vfst.at_end()) {
               Method* m = vfst.method();
               m->jmethod_id();
               m->line_number_from_bci(vfst.bci());
               vfst.next();
             }

        - This is a bit confusing in forte.cpp, trace->frames[count].lineno = bci.
        Line number should be m->line_number_from_bci(bci);
        Do the heapMonitoring suppose to trace with bci or line number?
        I would say bci, meaning we should either rename ASGCT_CallFrame→lineno or use another data structure which says bci.

        ##############################
        - // TODO(jcbeyler): remove this extra code handling the extra trace for
        Please fix all these TODO's :)

        ##############################
        - heapMonitoring.hpp:
        // TODO(jcbeyler): is this algorithm acceptable in open source?

        Why is this comment here? What is the implication?
        Have you tested any simpler algorithm?

        ##############################
        - Create a sanity jtreg test. (./hotspot/make/test/JtregNative.gmk for building the agent)

        ##############################
        - monitoring_period vs HeapMonitorRate, pick rate or period.

        ##############################
        - globals.hpp
        Why is MaxHeapTraces not settable/overridable from jvmti interface? That would be handy.

        ##############################
        - jvmtiStackTraceData + ASGCT_CallFrame memory
        Are the agent suppose to loop through and free all ASGCT_CallFrame?
        Wouldn't it be better with some kinda protocol, like:
        (*jvmti)->GetLiveTraces(jvmti, &stack_traces, &num_traces);
        (*jvmti)->ReleaseTraces(jvmti, stack_traces, num_traces);

        Also using another data structure that have num_traces inside it simplifies things.
        So I'm not convinced using the async structure is the best way forward.


        I have more questions, but I think it's better if you respond and update the code first.

        Thanks!

        /Robbin


        On 04/21/2017 11:34 PM, JC Beyler wrote:

            Hi all,

            I've added size information to the allocation sampling system. This allows the callback to remember the size of each sampled allocation.
            http://cr.openjdk.java.net/~rasbold/8171119/webrev.01/ <http://cr.openjdk.java.net/~rasbold/8171119/webrev.01/>

            The new webrev.01 also adds the actual heap monitoring sampling system in files:
            http://cr.openjdk.java.net/~rasbold/8171119/webrev.01/src/share/vm/runtime/heapMonitoring.cpp.patch
            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.01/src/share/vm/runtime/heapMonitoring.cpp.patch>
            and
            http://cr.openjdk.java.net/~rasbold/8171119/webrev.01/src/share/vm/runtime/heapMonitoring.hpp.patch
            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.01/src/share/vm/runtime/heapMonitoring.hpp.patch>

            My next step is to add the GC part to the webrev, which will allow users to determine what objects are live and what are garbage.

            Thanks for your attention and let me know if there are any questions!

            Have a wonderful Friday!
            Jc

            On Mon, Apr 17, 2017 at 12:37 PM, JC Beyler <[hidden email] <mailto:[hidden email]> <mailto:[hidden email] <mailto:[hidden email]>>> wrote:

                 Hi all,

                 I worked on getting a few numbers for overhead and accuracy for my feature. I'm unsure if here is the right place to provide the full data, so I am just
            summarizing
                 here for now.

                 - Overhead of the feature

                 Using the Dacapo benchmark (http://dacapobench.org/). My initial results are that sampling provides 2.4% with a 512k sampling, 512k being our default setting.

                 - Note: this was without the tradesoap, tradebeans and tomcat benchmarks since they did not work with my JDK9 (issue between Dacapo and JDK9 it seems)
                 - I want to rerun next week to ensure number stability

                 - Accuracy of the feature

                 I wrote a small microbenchmark that allocates from two different stacktraces at a given ratio. For example, 10% of stacktrace S1 and 90% from stacktrace
            S2. The
                 microbenchmark was run 20 times, I averaged the results and looked for accuracy. It seems that statistically it is sound since if I allocated10% S1 and 90%
            S2, with a
                 sampling rate of 512k, I obtained 9.61% S1 and 90.49% S2.

                 Let me know if there are any questions on the numbers and if you'd like to see some more data.

                 Note: this was done using our internal JDK8 implementation since the webrev provided by http://cr.openjdk.java.net/~rasbold/heapz/webrev.00/index.html
            <http://cr.openjdk.java.net/~rasbold/heapz/webrev.00/index.html>
            <http://cr.openjdk.java.net/~rasbold/heapz/webrev.00/index.html <http://cr.openjdk.java.net/~rasbold/heapz/webrev.00/index.html>> does not yet contain the whole
            implementation and therefore would have been misleading.

                 Thanks,
                 Jc


                 On Tue, Apr 4, 2017 at 3:55 PM, JC Beyler <[hidden email] <mailto:[hidden email]> <mailto:[hidden email] <mailto:[hidden email]>>> wrote:

                     Hi all,

                     To move the discussion forward, with Chuck Rasbold's help to make a webrev, we pushed this:
            http://cr.openjdk.java.net/~rasbold/heapz/webrev.00/index.html <http://cr.openjdk.java.net/~rasbold/heapz/webrev.00/index.html>
            <http://cr.openjdk.java.net/~rasbold/heapz/webrev.00/index.html <http://cr.openjdk.java.net/~rasbold/heapz/webrev.00/index.html>>
                     415 lines changed: 399 ins; 13 del; 3 mod; 51122 unchg

                     This is not a final change that does the whole proposition from the JBS entry: https://bugs.openjdk.java.net/browse/JDK-8177374
            <https://bugs.openjdk.java.net/browse/JDK-8177374>
            <https://bugs.openjdk.java.net/browse/JDK-8177374 <https://bugs.openjdk.java.net/browse/JDK-8177374>>; what it does show is parts of the implementation that is
            proposed and hopefully can start the conversation going
                     as I work through the details.

                     For example, the changes to C2 are done here for the allocations: http://cr.openjdk.java.net/~rasbold/heapz/webrev.00/src/share/vm/opto/macro.cpp.patch
            <http://cr.openjdk.java.net/~rasbold/heapz/webrev.00/src/share/vm/opto/macro.cpp.patch>
            <http://cr.openjdk.java.net/~rasbold/heapz/webrev.00/src/share/vm/opto/macro.cpp.patch
            <http://cr.openjdk.java.net/~rasbold/heapz/webrev.00/src/share/vm/opto/macro.cpp.patch>>

                     Hopefully this all makes sense and thank you for all your future comments!
                     Jc


                     On Tue, Dec 13, 2016 at 1:11 PM, JC Beyler <[hidden email] <mailto:[hidden email]> <mailto:[hidden email] <mailto:[hidden email]>>>
            wrote:

                         Hello all,

                         This is a follow-up from Jeremy's initial email from last year:
            http://mail.openjdk.java.net/pipermail/serviceability-dev/2015-June/017543.html
            <http://mail.openjdk.java.net/pipermail/serviceability-dev/2015-June/017543.html>
            <http://mail.openjdk.java.net/pipermail/serviceability-dev/2015-June/017543.html <http://mail.openjdk.java.net/pipermail/serviceability-dev/2015-June/017543.html>>

                         I've gone ahead and started working on preparing this and Jeremy and I went down the route of actually writing it up in JEP form:
            https://bugs.openjdk.java.net/browse/JDK-8171119 <https://bugs.openjdk.java.net/browse/JDK-8171119>

                         I think original conversation that happened last year in that thread still holds true:

                           - We have a patch at Google that we think others might be interested in
                              - It provides a means to understand where the allocation hotspots are at a very low overhead
                              - Since it is at a low overhead, we can leave it on by default

                         So I come to the mailing list with Jeremy's initial question:
                         "I thought I would ask if there is any interest / if I should write a JEP / if I should just forget it."

                         A year ago, it seemed some thought it was a good idea, is this still true?

                         Thanks,
                         Jc








Reply | Threaded
Open this post in threaded view
|

Re: Low-Overhead Heap Profiling

serguei.spitsyn@oracle.com
Hi JC,

Some initial JVMTI-specific questions/comments.

I was not able to apply your patch as the are many merge conflicts.
Could you, please, re-base it on the latest JDK 10 sources?

I think, it would make sense to introduce new optional capability, something like:
   can_sample_heap_allocations

Do you consider this API to be used by a single agent or it is a multi-agent feature?
What if one agent calls the StartHeapSampling() but another calls one of the others.
The API specs needs to clarify this.

I'm not sure you had a chance to carefully think on what JVMTI phases are allowed for new functions.


http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/src/share/vm/runtime/heapMonitoring.cpp.html

I do not see the variable HeapMonitoring::_enabled is checked in the functions:
    HeapMonitoring::get_live_traces(),
    HeapMonitoring::get_garbage_traces(),
    HeapMonitoring::get_frequent_garbage_traces(),
    HeapMonitoring::release_traces()

A question about a multi-agent feature from above applies here as well.


Thanks,
Serguei



On 6/21/17 13:45, JC Beyler wrote:
Hi all,

First off: Thanks again to Robbin and Thomas for their reviews :)

Next, I've uploaded a new webrev:

Here is an update:

- @Robbin, I forgot to say that yes I need to look at implementing this for the other architectures and testing it before it is all ready to go. Is it common to have it working on all possible combinations or is there a subset that I should be doing first and we can do the others later?
- I've tested slowdebug, built and ran the JTreg tests I wrote with slowdebug and fixed a few more issues
- I've refactored a bit of the code following Thomas' comments
   - I think I've handled all the comments from Thomas (I put comments inline below for the specifics)

The biggest addition to this webrev is that there is more testing, I've added two tests for looking specifically at the two garbage sampling data Recent vs Frequent:
   and

   - Side question: I was looking at trying to make this a multi-file library so you would not have all the code in one spot. It seems this is not really done?
      - Is there a solution to this? What I really wanted was:
        - The core library that will help testing be easier
        - The JNI code for each Java class separate and calling into the core library

- Following Thomas' comments on statistics, I want to add some quality assurance tests and find that the easiest way would be to have a few counters of what is happening in the sampler and expose that to the user.
   - I'll be adding that in the next version if no one sees any objections to that.
   - This will allow me to add a sanity test in JTreg about number of samples and average of sampling rate

@Thomas: I had a few questions that I inlined below but I will summarize the "bigger ones" here:
   - You mentioned constants are not using the right conventions, I looked around and didn't see any convention except normal naming then for static constants. Is that right?
   - You mentioned putting the weak_oops_do in a separate method and logging, I inlined my answer below and would appreciate your feedback there too.

Thanks again for your reviews and patience!
Jc

PS: I've also inlined my answers to Thomas below:

On Tue, Jun 13, 2017 at 8:03 AM, Thomas Schatzl <[hidden email]> wrote:
Hi all,

On Mon, 2017-06-12 at 11:11 -0700, JC Beyler wrote:
> Dear all,
>
> I've continued working on this and have done the following webrev:
> http://cr.openjdk.java.net/~rasbold/8171119/webrev.05/

[...]
> Things I still need to do:
>    - Have to fix that TLAB case for the FastTLABRefill
>    - Have to start looking at the data to see that it is consistent
> and does gather the right samples, right frequency, etc.
>    - Have to check the GC elements and what that produces
>    - Run a slowdebug run and ensure I fixed all those issues you saw
> Robbin
>
> Thanks for looking at the webrev and have a great week!

  scratching a bit on the surface of this change, so apologies for
rather shallow comments:

- macroAssembler_x86.cpp:5604: while this is compiler code, and I am
not sure this is final, please avoid littering the code with TODO
remarks :) They tend to be candidates for later wtf moments only.

Just file a CR for that.


Newcomer question: what is a CR and not sure I have the rights to do that yet ? :)
 
- calling HeapMonitoring::do_weak_oops() (which should probably be
called weak_oops_do() like other similar methods) only if string
deduplication is enabled (in g1CollectedHeap.cpp:4511) seems wrong.

The call should be at least around 6 lines up outside the if.

Preferentially in a method like process_weak_jni_handles(), including
additional logging. (No new (G1) gc phase without minimal logging :)).

Done but really not sure because:

I put for logging:
  log_develop_trace(gc, freelist)("G1ConcRegionFreeing [other] : heap monitoring");

Since weak_jni_handles didn't have logging for me to be inspired from, I did that but unconvinced this is what should be done.
 

Seeing the changes in referenceProcess.cpp, you need to add the call to
HeapMonitoring::do_weak_oops() exactly similar to
process_weak_jni_handles() in case there is no reference processing
done.

- psParallelCompact.cpp:2172 similar to other places where the change
adds the call to HeapMonitoring::do_weak_oops(), remove the empty line.

Done from what I can tell in the whole webrev. (The only empty lines I still see are when I maintain an empty line, exception being http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/src/share/vm/gc/shared/collectedHeap.cpp.patch where I think it was "nicer")
 

- the change doubles the size of
CollectedHeap::allocate_from_tlab_slow() above the "small and nice"
threshold. Maybe it could be refactored a bit.

Done I think, it looks better to me :).
 

- referenceProcessor.cpp:40: please make sure that the includes are
sorted alphabetically (I did not check the other files yet).

Done and checked all other files normally.
 

- referenceProcessor.cpp:261: the change should add logging about the
number of references encountered, maybe after the corresponding "JNI
weak reference count" log message.

Just to double check, are you saying that you'd like to have the heap sampler to keep in store how many sampled objects were encountered in the HeapMonitoring::weak_oops_do?
   - Would a return of the method with the number of handled references and logging that work?
   - Additionally, would you prefer it in a separate block with its GCTraceTime?


- threadLocalAllocBuffer.cpp:331: one more "TODO"

Removed it and added it to my personal todos to look at.
 

- threadLocalAllocBuffer.hpp: ThreadLocalAllocBuffer class
documentation should be updated about the sampling additions. I would
have no clue what the difference between "actual_end" and "end" would
be from the given information.

If you are talking about the comments in this file, I made them more clear I hope in the new webrev. If it was somewhere else, let me know where to change.
 

- threadLocalAllocBuffer.hpp:130: extra whitespace ;)

Fixed :)
 

- some copyrights :)


I think I fixed all the issues, if you see specific issues, let me know.
 
- in heapMonitoring.hpp: there are some random comments about some code
that has been grabbed from "util/math/fastmath.[h|cc]". I can't tell
whether this is code that can be used but I assume that Noam Shazeer is
okay with that (i.e. that's all Google code).

Jeremy and I double checked and we can release that as I thought. I removed the comment from that piece of code entirely.
 

- heapMonitoring.hpp/cpp static constant naming does not correspond to
Hotspot's. Additionally, in Hotspot static methods are cased like other
methods.

I think I fixed the methods to be cased the same way as all other methods. For static constants, I was not sure. I fixed a few other variables but I could not seem to really see a consistent trend for constants. I made them as variables but I'm not sure now.

 

- in heapMonitoring.cpp there are a few cryptic comments at the top
that seem to refer to internal stuff that should probably be removed.

Sorry about that! My personal todos not cleared out.
 

I did not think through the impact of the TLAB changes on collector
behavior yet (if there are). Also I did not check for problems with
concurrent mark and SATB/G1 (if there are).

I would love to know your thoughts on this, I think this is fine. I see issues with multiple threads right now hitting the stack storage instance. Previous webrevs had a mutex lock here but we took it out for simplificity (and only for now).
 

Thanks,
  Thomas



Reply | Threaded
Open this post in threaded view
|

Re: Low-Overhead Heap Profiling

Robbin Ehn
In reply to this post by JC Beyler
Hi,

On 06/28/2017 12:04 AM, JC Beyler wrote:
> Dear Thomas et al,
>
> Here is the newest webrev:
> http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/

You have some more bits to in there but generally this looks good and really nice with more tests.
I'll do and deep dive and re-test this when I get back from my long vacation with whatever patch version you have then.

Also I think it's time you provide incremental (v06->07 changes) as well as complete change-sets.

Thanks, Robbin



>
> Thomas, I "think" I have answered all your remarks. The summary is:
>
> - The statistic system is up and provides insight on what the heap sampler is doing
>     - I've noticed that, though the sampling rate is at the right mean, we are missing some samples, I have not yet tracked out why (details below)
>
> - I've run a tiny benchmark that is the worse case: it is a very tight loop and allocated a small array
>     - In this case, I see no overhead when the system is off so that is a good start :)
>     - I see right now a high overhead in this case when sampling is on. This is not a really too surprising but I'm going to see if this is consistent with our internal
> implementation. The benchmark is really allocation stressful so I'm not too surprised but I want to do the due diligence.
>
>   - The statistic system up is up and I have a new test
> http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatTest.java.patch
>      - I did a bit of a study about the random generator here, more details are below but basically it seems to work well
>
>   - I added a capability but since this is the first time doing this, I was not sure I did it right
>     - I did add a test though for it and the test seems to do what I expect (all methods are failing with the JVMTI_ERROR_MUST_POSSESS_CAPABILITY error).
>         - http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorNoCapabilityTest.java.patch
>
>   - I still need to figure out what to do about the multi-agent vs single-agent issue
>
>   - As far as measurements, it seems I still need to look at:
>     - Why we do the 20 random calls first, are they necessary?
>     - Look at the mean of the sampling rate that the random generator does and also what is actually sampled
>     - What is the overhead in terms of memory/performance when on?
>
> I have inlined my answers, I think I got them all in the new webrev, let me know your thoughts.
>
> Thanks again!
> Jc
>
>
> On Fri, Jun 23, 2017 at 3:52 AM, Thomas Schatzl <[hidden email] <mailto:[hidden email]>> wrote:
>
>     Hi,
>
>     On Wed, 2017-06-21 at 13:45 -0700, JC Beyler wrote:
>     > Hi all,
>     >
>     > First off: Thanks again to Robbin and Thomas for their reviews :)
>     >
>     > Next, I've uploaded a new webrev:
>     > http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/ <http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/>
>     >
>     > Here is an update:
>     >
>     > - @Robbin, I forgot to say that yes I need to look at implementing
>     > this for the other architectures and testing it before it is all
>     > ready to go. Is it common to have it working on all possible
>     > combinations or is there a subset that I should be doing first and we
>     > can do the others later?
>     > - I've tested slowdebug, built and ran the JTreg tests I wrote with
>     > slowdebug and fixed a few more issues
>     > - I've refactored a bit of the code following Thomas' comments
>     >    - I think I've handled all the comments from Thomas (I put
>     > comments inline below for the specifics)
>
>     Thanks for handling all those.
>
>     > - Following Thomas' comments on statistics, I want to add some
>     > quality assurance tests and find that the easiest way would be to
>     > have a few counters of what is happening in the sampler and expose
>     > that to the user.
>     >    - I'll be adding that in the next version if no one sees any
>     > objections to that.
>     >    - This will allow me to add a sanity test in JTreg about number of
>     > samples and average of sampling rate
>     >
>     > @Thomas: I had a few questions that I inlined below but I will
>     > summarize the "bigger ones" here:
>     >    - You mentioned constants are not using the right conventions, I
>     > looked around and didn't see any convention except normal naming then
>     > for static constants. Is that right?
>
>     I looked through https://wiki.openjdk.java.net/display/HotSpot/StyleGui <https://wiki.openjdk.java.net/display/HotSpot/StyleGui>
>     de and the rule is to "follow an existing pattern and must have a
>     distinct appearance from other names". Which does not help a lot I
>     guess :/ The GC team started using upper camel case, e.g.
>     SomeOtherConstant, but very likely this is probably not applied
>     consistently throughout. So I am fine with not adding another style
>     (like kMaxStackDepth with the "k" in front with some unknown meaning)
>     is fine.
>
>     (Chances are you will find that style somewhere used anyway too,
>     apologies if so :/)
>
>
> Thanks for that link, now I know where to look. I used the upper camel case in my code as well then :) I should have gotten them all.
>
>
>      > PS: I've also inlined my answers to Thomas below:
>      >
>      > On Tue, Jun 13, 2017 at 8:03 AM, Thomas Schatzl <thomas.schatzl@oracl
>      > e.com <http://e.com>> wrote:
>      > > Hi all,
>      > >
>      > > On Mon, 2017-06-12 at 11:11 -0700, JC Beyler wrote:
>      > > > Dear all,
>      > > >
>      > > > I've continued working on this and have done the following
>      > > webrev:
>      > > > http://cr.openjdk.java.net/~rasbold/8171119/webrev.05/ <http://cr.openjdk.java.net/~rasbold/8171119/webrev.05/>
>      > >
>      > > [...]
>      > > > Things I still need to do:
>      > > >    - Have to fix that TLAB case for the FastTLABRefill
>      > > >    - Have to start looking at the data to see that it is
>      > > consistent and does gather the right samples, right frequency, etc.
>      > > >    - Have to check the GC elements and what that produces
>      > > >    - Run a slowdebug run and ensure I fixed all those issues you
>      > > saw > Robbin
>      > > >
>      > > > Thanks for looking at the webrev and have a great week!
>      > >
>      > >   scratching a bit on the surface of this change, so apologies for
>      > > rather shallow comments:
>      > >
>      > > - macroAssembler_x86.cpp:5604: while this is compiler code, and I
>      > > am not sure this is final, please avoid littering the code with
>      > > TODO remarks :) They tend to be candidates for later wtf moments
>      > > only.
>      > >
>      > > Just file a CR for that.
>      > >
>      > Newcomer question: what is a CR and not sure I have the rights to do
>      > that yet ? :)
>
>     Apologies. CR is a change request, this suggests to file a bug in the
>     bug tracker. And you are right, you can't just create a new account in
>     the OpenJDK JIRA yourselves. :(
>
>
> Ok good to know, I'll continue with my own todo list but I'll work hard on not letting it slip in the webrevs anymore :)
>
>
>     I was mostly referring to the "... but it is a TODO" part of that
>     comment in macroassembler_x86.cpp. Comments about the why of the code
>     are appreciated.
>
>
>     [Note that I now understand that this is to some degree still work in
>     progress. As long as the final changeset does no contain TODO's I am
>     fine (and it's not a hard objection, rather their use in "final" code
>     is typically limited in my experience)]
>
>     5603   // Currently, if this happens, just set back the actual end to
>     where it was.
>     5604   // We miss a chance to sample here.
>
>     Would be okay, if explaining "this" and the "why" of missing a chance
>     to sample here would be best.
>
>     Like maybe:
>
>     // If we needed to refill TLABs, just set the actual end point to
>     // the end of the TLAB again. We do not sample here although we could.
>
> Done with your comment, it works well in my mind.
>
>     I am not sure whether "miss a chance to sample" meant "we could, but
>     consciously don't because it's not that useful" or "it would be
>     necessary but don't because it's too complicated to do.".
>
>     Looking at the original comment once more, I am also not sure if that
>     comment shouldn't referring to the "end" variable (not actual_end)
>     because that's the variable that is responsible for taking the sampling
>     path? (Going from the member description of ThreadLocalAllocBuffer).
>
>
> I've moved this code and it no longer shows up here but the rationale and answer was:
>
> So.. Yes, end is the variable provoking the sampling. Actual end is the actual end of the TLAB.
>
> What was happening here is that the code is resetting _end to point towards the end of the new TLAB. Because, we now have the end for sampling and _actual_end for the
> actual end, we need to update the actual_end as well.
>
> Normally, were we to do the real work here, we would calculate the (end - start) offset, then do:
>
> - Set the new end to : start + (old_end - old_start)
> - Set the actual end like we do here now where it because it is the actual end.
>
> Why is this not done here now anymore?
>    - I was still debating which path to take:
>       - Do it in the fast refill code, it has its perks:
>           - In a world where fast refills are happening all the time or a lot, we can augment there the code to do the sampling
>       - Remember what we had as an end before leaving the slowpath and check on return
>           - This is what I'm doing now, it removes the need to go fix up all fast refill paths but if you remain in fast refill paths, you won't get sampling. I have to
> think of the consequences of that, maybe a future change later on?
>              - I have the statistics now so I'm going to study that
>                 -> By the way, though my statistics are showing I'm missing some samples, if I turn off FastTlabRefill, it is the same loss so for now, it seems this does
> not occur in my simple test.
>
>
>
>     But maybe I am only confused and it's best to just leave the comment
>     away. :)
>
>     Thinking about it some more, doesn't this not-sampling in this case
>     mean that sampling does not work in any collector that does inline TLAB
>     allocation at the moment? (Or is inline TLAB alloc automatically
>     disabled with sampling somehow?)
>
>
>     That would indeed be a bigger TODO then :)
>
>
> Agreed, this remark made me think that perhaps as a first step the new way of doing it is better but I did have to:
>   - Remove the const of the ThreadLocalBuffer remaining and hard_end methods
>   - Move hard_end out of the header file to have a bit more logic there
>
> Please let me know what you think of that and if you prefer it this way or changing the fast refills. (I prefer this way now because it is more incremental).
>
>
>     > > - calling HeapMonitoring::do_weak_oops() (which should probably be
>     > > called weak_oops_do() like other similar methods) only if string
>     > > deduplication is enabled (in g1CollectedHeap.cpp:4511) seems wrong.
>     >
>     > The call should be at least around 6 lines up outside the if.
>     >
>     > Preferentially in a method like process_weak_jni_handles(), including
>     > additional logging. (No new (G1) gc phase without minimal logging
>     > :)).
>     > Done but really not sure because:
>     >
>     > I put for logging:
>     >   log_develop_trace(gc, freelist)("G1ConcRegionFreeing [other] : heap
>     > monitoring");
>
>     I would think that "gc, ref" would be more appropriate log tags for
>     this similar to jni handles.
>     (I am als not sure what weak reference handling has to do with
>     G1ConcRegionFreeing, so I am a bit puzzled)
>
>
> I was not sure what to put for the tags or really as the message. I cleaned it up a bit now to:
>     log_develop_trace(gc, ref)("HeapSampling [other] : heap monitoring processing");
>
>
>
>     > Since weak_jni_handles didn't have logging for me to be inspired
>     > from, I did that but unconvinced this is what should be done.
>
>     The JNI handle processing does have logging, but only in
>     ReferenceProcessor::process_discovered_references(). In
>     process_weak_jni_handles() only overall time is measured (in a G1
>     specific way, since only G1 supports disabling reference procesing) :/
>
>     The code in ReferenceProcessor prints both time taken
>     referenceProcessor.cpp:254, as well as the count, but strangely only in
>     debug VMs.
>
>     I have no idea why this logging is that unimportant to only print that
>     in a debug VM. However there are reviews out for changing this area a
>     bit, so it might be useful to wait for that (JDK-8173335).
>
>
> I cleaned it up a bit anyway and now it returns the count of objects that are in the system.
>
>
>     > > - the change doubles the size of
>     > > CollectedHeap::allocate_from_tlab_slow() above the "small and nice"
>     > > threshold. Maybe it could be refactored a bit.
>     > Done I think, it looks better to me :).
>
>     In ThreadLocalAllocBuffer::handle_sample() I think the
>     set_back_actual_end()/pick_next_sample() calls could be hoisted out of
>     the "if" :)
>
>
> Done!
>
>
>     > > - referenceProcessor.cpp:261: the change should add logging about
>     > > the number of references encountered, maybe after the corresponding
>     > > "JNI weak reference count" log message.
>     > Just to double check, are you saying that you'd like to have the heap
>     > sampler to keep in store how many sampled objects were encountered in
>     > the HeapMonitoring::weak_oops_do?
>     >    - Would a return of the method with the number of handled
>     > references and logging that work?
>
>     Yes, it's fine if HeapMonitoring::weak_oops_do() only returned the
>     number of processed weak oops.
>
>
> Done also (but I admit I have not tested the output yet) :)
>
>
>     >    - Additionally, would you prefer it in a separate block with its
>     > GCTraceTime?
>
>     Yes. Both kinds of information is interesting: while the time taken is
>     typically more important, the next question would be why, and the
>     number of references typically goes a long way there.
>
>     See above though, it is probably best to wait a bit.
>
>
> Agreed that I "could" wait but, if it's ok, I'll just refactor/remove this when we get closer to something final. Either, JDK-8173335
> has gone in and I will notice it now or it will soon and I can change it then.
>
>
>     > > - threadLocalAllocBuffer.cpp:331: one more "TODO"
>     > Removed it and added it to my personal todos to look at.
>     >  
>     > >
>     > > - threadLocalAllocBuffer.hpp: ThreadLocalAllocBuffer class
>     > > documentation should be updated about the sampling additions. I
>     > > would have no clue what the difference between "actual_end" and
>     > > "end" would be from the given information.
>     > If you are talking about the comments in this file, I made them more
>     > clear I hope in the new webrev. If it was somewhere else, let me know
>     > where to change.
>
>     Thanks, that's much better. Maybe a note in the comment of the class
>     that ThreadLocalBuffer provides some sampling facility by modifying the
>     end() of the TLAB to cause "frequent" calls into the runtime call where
>     actual sampling takes place.
>
>
> Done, I think it's better now. Added something about the slow_path_end as well.
>
>
>     > > - in heapMonitoring.hpp: there are some random comments about some
>     > > code that has been grabbed from "util/math/fastmath.[h|cc]". I
>     > > can't tell whether this is code that can be used but I assume that
>     > > Noam Shazeer is okay with that (i.e. that's all Google code).
>     > Jeremy and I double checked and we can release that as I thought. I
>     > removed the comment from that piece of code entirely.
>
>     Thanks.
>
>     > > - heapMonitoring.hpp/cpp static constant naming does not correspond
>     > > to Hotspot's. Additionally, in Hotspot static methods are cased
>     > > like other methods.
>     > I think I fixed the methods to be cased the same way as all other
>     > methods. For static constants, I was not sure. I fixed a few other
>     > variables but I could not seem to really see a consistent trend for
>     > constants. I made them as variables but I'm not sure now.
>
>     Sorry again, style is a kind of mess. The goal of my suggestions here
>     is only to prevent yet another style creeping in.
>
>     > > - in heapMonitoring.cpp there are a few cryptic comments at the top
>     > > that seem to refer to internal stuff that should probably be
>     > > removed.
>     > Sorry about that! My personal todos not cleared out.
>
>     I am happy about comments, but I simply did not understand any of that
>     and I do not know about other readers as well.
>
>     If you think you will remember removing/updating them until the review
>     proper (I misunderstood the review situation a little it seems).
>
>     > > I did not think through the impact of the TLAB changes on collector
>     > > behavior yet (if there are). Also I did not check for problems with
>     > > concurrent mark and SATB/G1 (if there are).
>     > I would love to know your thoughts on this, I think this is fine. I
>
>     I think so too now. No objects are made live out of thin air :)
>
>     > see issues with multiple threads right now hitting the stack storage
>     > instance. Previous webrevs had a mutex lock here but we took it out
>     > for simplificity (and only for now).
>
>     :) When looking at this after some thinking I now assume for this
>     review that this code is not MT safe at all. There seems to be more
>     synchronization missing than just the one for the StackTraceStorage. So
>     no comments about this here.
>
>
> I doubled checked a bit (quickly I admit) but it seems that synchronization in StackTraceStorage is really all you need (all methods lead to a StackTraceStorage one and can
> be multithreaded outside of that).
> There is a question about the initialization where the method HeapMonitoring::initialize_profiling is not thread safe.
> It would work (famous last words) and not crash if there was a race but we could add a synchronization point there as well (and therefore on the stop as well).
>
> But anyway I will really check and do this once we add back synchronization.
>
>
>     Also, this would require some kind of specification of what is allowed
>     to be called when and where.
>
>
> Would we specify this with the methods in the jvmti.xml file? We could start by specifying in each that they are not thread safe but I saw no mention of that for other methods.
>
>
>     One potentially relevant observation about locking here: depending on
>     sampling frequency, StackTraceStore::add_trace() may be rather
>     frequently called. I assume that you are going to do measurements :)
>
>
> Though we don't have the TLAB implementation in our code, the compiler generated sampler uses 2% of overhead with a 512k sampling rate. I can do real measurements when the
> code settles and we can see how costly this is as a TLAB implementation.
> However, my theory is that if the rate is 512k, the memory/performance overhead should be minimal since it is what we saw with our code/workloads (though not called the
> same way, we call it essentially at the same rate).
> If you have a benchmark you'd like me to test, let me know!
>
> Right now, with my really small test, this does use a bit of overhead even for a 512k sample size. I don't know yet why, I'm going to see what is going on.
>
> Finally, I think it is not reasonable to suppose the overhead to be negligible if the sampling rate used is too low. The user should know that the lower the rate, the
> higher the overhead (documentation TODO?).
>
>
>     I am not sure what the expected usage of the API is, but
>     StackTraceStore::add_trace() seems to be able to grow without bounds.
>     Only a GC truncates them to the live ones. That in itself seems to be
>     problematic (GCs can be *wide* apart), and of course some of the API
>     methods add to that because they duplicate that unbounded array. Do you
>     have any concerns/measurements about this?
>
>
> So, the theory is that yes add_trace can be able to grow without bounds but it grows at a sample per 512k of allocated space. The stacks it gathers are currently maxed at
> 64 (I'd like to expand that to an option to the user though at some point). So I have no concerns because:
>
> - If really this is taking a lot of space, that means the job is keeping a lot of objects in memory as well, therefore the entire heap is getting huge
> - If this is the case, you will be triggering a GC at some point anyway.
>
> (I'm putting under the rug the issue of "What if we set the rate to 1 for example" because as you lower the sampling rate, we cannot guarantee low overhead; the idea behind
> this feature is to have a means of having meaningful allocated samples at a low overhead)
>
> I have no measurements really right now but since I now have some statistics I can poll, I will look a bit more at this question.
>
> I have the same last sentence than above: the user should expect this to happen if the sampling rate is too small. That probably can be reflected in the StartHeapSampling
> as a note : careful this might impact your performance.
>
>
>     Also, these stack traces might hold on to huge arrays. Any
>     consideration of that? Particularly it might be the cause for OOMEs in
>     tight memory situations.
>
>
> There is a stack size maximum that is set to 64 so it should not hold huge arrays. I don't think this is an issue but I can double check with a test or two.
>
>
>     - please consider adding a safepoint check in
>     HeapMonitoring::weak_oops_do to prevent accidental misuse.
>
>     - in struct StackTraceStorage, the public fields may also need
>     underscores. At least some files in the runtime directory have structs
>     with underscored public members (and some don't). The runtime team
>     should probably comment on that.
>
>
> Agreed I did not know. I looked around and a lot of structs did not have them it seemed so I left it as is. I will happily change it if someone prefers (I was not sure if
> you really preferred or not, your sentence seemed to be more a note of "this might need to change but I don't know if the runtime team enforces that", let me know if I read
> that wrongly).
>
>
>     - In StackTraceStorage::weak_oops_do(), when examining the
>     StackTraceData, maybe it is useful to consider having a non-NULL
>     reference outside of the heap's reserved space an error. There should
>     be no oop outside of the heap's reserved space ever.
>
>     Unless you allow storing random values in StackTraceData::obj, which I
>     would not encourage.
>
>
> I suppose you are talking about this part:
> if ((value != NULL && Universe::heap()->is_in_reserved(value)) &&
>            (is_alive == NULL || is_alive->do_object_b(value))) {
>
> What you are saying is that I could have something like:
> if (value != my_non_null_reference &&
>            (is_alive == NULL || is_alive->do_object_b(value))) {
>
> Is that what you meant? Is there really a reason to do so? When I look at the code, is_in_reserved seems like a O(1) method call. I'm not even sure we can have a NULL value
> to be honest. I might have to study that to see if this was not a paranoid test to begin with.
>
> The is_alive code has now morphed due to the comment below.
>
>
>
>     - HeapMonitoring::weak_oops_do() does not seem to use the
>     passed AbstractRefProcTaskExecutor.
>
>
> It did use it:
>   size_t HeapMonitoring::weak_oops_do(
>      AbstractRefProcTaskExecutor *task_executor,
>      BoolObjectClosure* is_alive,
>      OopClosure *f,
>      VoidClosure *complete_gc) {
>    assert(SafepointSynchronize::is_at_safepoint(), "must be at safepoint");
>
>    if (task_executor != NULL) {
>      task_executor->set_single_threaded_mode();
>    }
>    return StackTraceStorage::storage()->weak_oops_do(is_alive, f, complete_gc);
> }
>
> But due to the comment below, I refactored this, so this is no longer here. Now I have an always true closure that is passed.
>
>
>     - I do not understand allowing to call this method with a NULL
>     complete_gc closure. This would mean that objects referenced from the
>     object that is referenced by the StackTraceData are not pulled, meaning
>     they would get stale.
>
>     - same with is_alive parameter value of NULL
>
>
> So these questions made me look a bit closer at this code. This code I think was written this way to have a very small impact on the file but you are right, there is no
> reason for this here. I've simplified the code by making in referenceProcessor.cpp a process_HeapSampling method that handles everything there.
>
> The code allowed NULLs because it depended on where you were coming from and how the code was being called.
>
> - I added a static always_true variable and pass that now to be more consistent with the rest of the code.
> - I moved the complete_gc into process_phaseHeapSampling now (new method) and handle the task_executor and the complete_gc there
>     - Newbie question: in our code we did a set_single_threaded_mode but I see that process_phaseJNI does it right before its call, do I need to do it for the
> process_phaseHeapSample?
> That API is much cleaner (in my mind) and is consistent with what is done around it (again in my mind).
>
>
>     - heapMonitoring.cpp:590: I do not completely understand the purpose of
>     this code: in the end this results in a fixed value directly dependent
>     on the Thread address anyway? In the end this results in a fixed value
>     directly dependent on the Thread address anyway?
>     IOW, what is special about exactly 20 rounds?
>
>
> So we really want a fast random number generator that has a specific mean (512k is the default we use). The code uses the thread address as the start number of the sequence
> (why not, it is random enough is rationale). Then instead of just starting there, we prime the sequence and really only start at the 21st number, it is arbitrary and I have
> not done a study to see if we could do more or less of that.
>
> As I have the statistics of the system up and running, I'll run some experiments to see if this is needed, is 20 good, or not.
>
>
>     - also I would consider stripping a few bits of the threads' address as
>     initialization value for your rng. The last three bits (and probably
>     more, check whether the Thread object is allocated on special
>     boundaries) are always zero for them.
>     Not sure if the given "random" value is random enough before/after,
>     this method, so just skip that comment if you think this is not
>     required.
>
>
> I don't know is the honest answer. I think what is important is that we tend towards a mean and it is random "enough" to not fall in pitfalls of only sampling a subset of
> objects due to their allocation order. I added that as test to do to see if it changes the mean in any way for the 512k default value and/or if the first 1000 elements look
> better.
>
>
>     Some more random nits I did not find a place to put anywhere:
>
>     - ThreadLocalAllocBuffer::_extra_space does not seem to be used
>     anywhere?
>
>
> Good catch :).
>
>
>     - Maybe indent the declaration of ThreadLocalAllocBuffer::_bytes_until_sample to align below the other members of that group.
>
>
> Done moved it up a bit to have non static members together and static separate.
>
>     Thanks,
>        Thomas
>
>
> Thanks for your review!
> Jc
>
Reply | Threaded
Open this post in threaded view
|

Re: Low-Overhead Heap Profiling

JC Beyler
Hi Robbin,

A) Fast Refill

Sounds good for the fast refill, it will be simpler without that code so I'll be happy if it gets removed :).

Right now, I fixed it in a way where this no longer is an issue for fast refill turned on. If it does get removed, then my code can be simplified, which is always good. When you return we can see if we rather I simplify my code first and then we work on removing FastTLABRefill or the other way around.

B) Incremental changes

Again another newbiew question here...

For showing the incremental changes, is there a link that explains how to do that? I apologize for my newbie questions all the time :)

Right now, I do:

 ksh ../webrev.ksh -m -N

That generates a webrev.zip and send it to Chuck Rasbold. He then uploads it to a new webrev.

I tried commiting my change and adding a small change. Then if I just do ksh ../webrev.ksh without any options, it seems to produce a similar page but now with only the changes I had (so the 06-07 comparison you were talking about) and a changeset that has it all. I imagine that is what you meant.

Which means that my workflow would become:

1) Make changes
2) Make a webrev without any options to show just the differences with the tip
3) Amend my changes to my local commit so that I have it done with
4) Go to 1

Does that seem correct to you?

Note that when I do this, I only see the full change of a file in the full change set (Side note here: now the page says change set and not patch, which is maybe why Serguei was having issues?).

Thanks!
Jc



On Wed, Jun 28, 2017 at 1:12 AM, Robbin Ehn <[hidden email]> wrote:
Hi,

On 06/28/2017 12:04 AM, JC Beyler wrote:
Dear Thomas et al,

Here is the newest webrev:
http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/

You have some more bits to in there but generally this looks good and really nice with more tests.
I'll do and deep dive and re-test this when I get back from my long vacation with whatever patch version you have then.

Also I think it's time you provide incremental (v06->07 changes) as well as complete change-sets.

Thanks, Robbin




Thomas, I "think" I have answered all your remarks. The summary is:

- The statistic system is up and provides insight on what the heap sampler is doing
    - I've noticed that, though the sampling rate is at the right mean, we are missing some samples, I have not yet tracked out why (details below)

- I've run a tiny benchmark that is the worse case: it is a very tight loop and allocated a small array
    - In this case, I see no overhead when the system is off so that is a good start :)
    - I see right now a high overhead in this case when sampling is on. This is not a really too surprising but I'm going to see if this is consistent with our internal implementation. The benchmark is really allocation stressful so I'm not too surprised but I want to do the due diligence.

  - The statistic system up is up and I have a new test http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatTest.java.patch
     - I did a bit of a study about the random generator here, more details are below but basically it seems to work well

  - I added a capability but since this is the first time doing this, I was not sure I did it right
    - I did add a test though for it and the test seems to do what I expect (all methods are failing with the JVMTI_ERROR_MUST_POSSESS_CAPABILITY error).
        - http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorNoCapabilityTest.java.patch

  - I still need to figure out what to do about the multi-agent vs single-agent issue

  - As far as measurements, it seems I still need to look at:
    - Why we do the 20 random calls first, are they necessary?
    - Look at the mean of the sampling rate that the random generator does and also what is actually sampled
    - What is the overhead in terms of memory/performance when on?

I have inlined my answers, I think I got them all in the new webrev, let me know your thoughts.

Thanks again!
Jc


On Fri, Jun 23, 2017 at 3:52 AM, Thomas Schatzl <[hidden email] <mailto:[hidden email]>> wrote:

    Hi,

    On Wed, 2017-06-21 at 13:45 -0700, JC Beyler wrote:
    > Hi all,
    >
    > First off: Thanks again to Robbin and Thomas for their reviews :)
    >
    > Next, I've uploaded a new webrev:
    > http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/ <http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/>

    >
    > Here is an update:
    >
    > - @Robbin, I forgot to say that yes I need to look at implementing
    > this for the other architectures and testing it before it is all
    > ready to go. Is it common to have it working on all possible
    > combinations or is there a subset that I should be doing first and we
    > can do the others later?
    > - I've tested slowdebug, built and ran the JTreg tests I wrote with
    > slowdebug and fixed a few more issues
    > - I've refactored a bit of the code following Thomas' comments
    >    - I think I've handled all the comments from Thomas (I put
    > comments inline below for the specifics)

    Thanks for handling all those.

    > - Following Thomas' comments on statistics, I want to add some
    > quality assurance tests and find that the easiest way would be to
    > have a few counters of what is happening in the sampler and expose
    > that to the user.
    >    - I'll be adding that in the next version if no one sees any
    > objections to that.
    >    - This will allow me to add a sanity test in JTreg about number of
    > samples and average of sampling rate
    >
    > @Thomas: I had a few questions that I inlined below but I will
    > summarize the "bigger ones" here:
    >    - You mentioned constants are not using the right conventions, I
    > looked around and didn't see any convention except normal naming then
    > for static constants. Is that right?

    I looked through https://wiki.openjdk.java.net/display/HotSpot/StyleGui <https://wiki.openjdk.java.net/display/HotSpot/StyleGui>
    de and the rule is to "follow an existing pattern and must have a
    distinct appearance from other names". Which does not help a lot I
    guess :/ The GC team started using upper camel case, e.g.
    SomeOtherConstant, but very likely this is probably not applied
    consistently throughout. So I am fine with not adding another style
    (like kMaxStackDepth with the "k" in front with some unknown meaning)
    is fine.

    (Chances are you will find that style somewhere used anyway too,
    apologies if so :/)


Thanks for that link, now I know where to look. I used the upper camel case in my code as well then :) I should have gotten them all.


     > PS: I've also inlined my answers to Thomas below:
     >
     > On Tue, Jun 13, 2017 at 8:03 AM, Thomas Schatzl <thomas.schatzl@oracl
     > e.com <http://e.com>> wrote:
     > > Hi all,
     > >
     > > On Mon, 2017-06-12 at 11:11 -0700, JC Beyler wrote:
     > > > Dear all,
     > > >
     > > > I've continued working on this and have done the following
     > > webrev:
     > > > http://cr.openjdk.java.net/~rasbold/8171119/webrev.05/ <http://cr.openjdk.java.net/~rasbold/8171119/webrev.05/>

     > >
     > > [...]
     > > > Things I still need to do:
     > > >    - Have to fix that TLAB case for the FastTLABRefill
     > > >    - Have to start looking at the data to see that it is
     > > consistent and does gather the right samples, right frequency, etc.
     > > >    - Have to check the GC elements and what that produces
     > > >    - Run a slowdebug run and ensure I fixed all those issues you
     > > saw > Robbin
     > > >
     > > > Thanks for looking at the webrev and have a great week!
     > >
     > >   scratching a bit on the surface of this change, so apologies for
     > > rather shallow comments:
     > >
     > > - macroAssembler_x86.cpp:5604: while this is compiler code, and I
     > > am not sure this is final, please avoid littering the code with
     > > TODO remarks :) They tend to be candidates for later wtf moments
     > > only.
     > >
     > > Just file a CR for that.
     > >
     > Newcomer question: what is a CR and not sure I have the rights to do
     > that yet ? :)

    Apologies. CR is a change request, this suggests to file a bug in the
    bug tracker. And you are right, you can't just create a new account in
    the OpenJDK JIRA yourselves. :(


Ok good to know, I'll continue with my own todo list but I'll work hard on not letting it slip in the webrevs anymore :)


    I was mostly referring to the "... but it is a TODO" part of that
    comment in macroassembler_x86.cpp. Comments about the why of the code
    are appreciated.

    [Note that I now understand that this is to some degree still work in
    progress. As long as the final changeset does no contain TODO's I am
    fine (and it's not a hard objection, rather their use in "final" code
    is typically limited in my experience)]

    5603   // Currently, if this happens, just set back the actual end to
    where it was.
    5604   // We miss a chance to sample here.

    Would be okay, if explaining "this" and the "why" of missing a chance
    to sample here would be best.

    Like maybe:

    // If we needed to refill TLABs, just set the actual end point to
    // the end of the TLAB again. We do not sample here although we could.

Done with your comment, it works well in my mind.

    I am not sure whether "miss a chance to sample" meant "we could, but
    consciously don't because it's not that useful" or "it would be
    necessary but don't because it's too complicated to do.".

    Looking at the original comment once more, I am also not sure if that
    comment shouldn't referring to the "end" variable (not actual_end)
    because that's the variable that is responsible for taking the sampling
    path? (Going from the member description of ThreadLocalAllocBuffer).


I've moved this code and it no longer shows up here but the rationale and answer was:

So.. Yes, end is the variable provoking the sampling. Actual end is the actual end of the TLAB.

What was happening here is that the code is resetting _end to point towards the end of the new TLAB. Because, we now have the end for sampling and _actual_end for the actual end, we need to update the actual_end as well.

Normally, were we to do the real work here, we would calculate the (end - start) offset, then do:

- Set the new end to : start + (old_end - old_start)
- Set the actual end like we do here now where it because it is the actual end.

Why is this not done here now anymore?
   - I was still debating which path to take:
      - Do it in the fast refill code, it has its perks:
          - In a world where fast refills are happening all the time or a lot, we can augment there the code to do the sampling
      - Remember what we had as an end before leaving the slowpath and check on return
          - This is what I'm doing now, it removes the need to go fix up all fast refill paths but if you remain in fast refill paths, you won't get sampling. I have to think of the consequences of that, maybe a future change later on?
             - I have the statistics now so I'm going to study that
                -> By the way, though my statistics are showing I'm missing some samples, if I turn off FastTlabRefill, it is the same loss so for now, it seems this does not occur in my simple test.



    But maybe I am only confused and it's best to just leave the comment
    away. :)

    Thinking about it some more, doesn't this not-sampling in this case
    mean that sampling does not work in any collector that does inline TLAB
    allocation at the moment? (Or is inline TLAB alloc automatically
    disabled with sampling somehow?)

    That would indeed be a bigger TODO then :)


Agreed, this remark made me think that perhaps as a first step the new way of doing it is better but I did have to:
  - Remove the const of the ThreadLocalBuffer remaining and hard_end methods
  - Move hard_end out of the header file to have a bit more logic there

Please let me know what you think of that and if you prefer it this way or changing the fast refills. (I prefer this way now because it is more incremental).


    > > - calling HeapMonitoring::do_weak_oops() (which should probably be
    > > called weak_oops_do() like other similar methods) only if string
    > > deduplication is enabled (in g1CollectedHeap.cpp:4511) seems wrong.
    >
    > The call should be at least around 6 lines up outside the if.
    >
    > Preferentially in a method like process_weak_jni_handles(), including
    > additional logging. (No new (G1) gc phase without minimal logging
    > :)).
    > Done but really not sure because:
    >
    > I put for logging:
    >   log_develop_trace(gc, freelist)("G1ConcRegionFreeing [other] : heap
    > monitoring");

    I would think that "gc, ref" would be more appropriate log tags for
    this similar to jni handles.
    (I am als not sure what weak reference handling has to do with
    G1ConcRegionFreeing, so I am a bit puzzled)


I was not sure what to put for the tags or really as the message. I cleaned it up a bit now to:
    log_develop_trace(gc, ref)("HeapSampling [other] : heap monitoring processing");



    > Since weak_jni_handles didn't have logging for me to be inspired
    > from, I did that but unconvinced this is what should be done.

    The JNI handle processing does have logging, but only in
    ReferenceProcessor::process_discovered_references(). In
    process_weak_jni_handles() only overall time is measured (in a G1
    specific way, since only G1 supports disabling reference procesing) :/

    The code in ReferenceProcessor prints both time taken
    referenceProcessor.cpp:254, as well as the count, but strangely only in
    debug VMs.

    I have no idea why this logging is that unimportant to only print that
    in a debug VM. However there are reviews out for changing this area a
    bit, so it might be useful to wait for that (JDK-8173335).


I cleaned it up a bit anyway and now it returns the count of objects that are in the system.


    > > - the change doubles the size of
    > > CollectedHeap::allocate_from_tlab_slow() above the "small and nice"
    > > threshold. Maybe it could be refactored a bit.
    > Done I think, it looks better to me :).

    In ThreadLocalAllocBuffer::handle_sample() I think the
    set_back_actual_end()/pick_next_sample() calls could be hoisted out of
    the "if" :)


Done!


    > > - referenceProcessor.cpp:261: the change should add logging about
    > > the number of references encountered, maybe after the corresponding
    > > "JNI weak reference count" log message.
    > Just to double check, are you saying that you'd like to have the heap
    > sampler to keep in store how many sampled objects were encountered in
    > the HeapMonitoring::weak_oops_do?
    >    - Would a return of the method with the number of handled
    > references and logging that work?

    Yes, it's fine if HeapMonitoring::weak_oops_do() only returned the
    number of processed weak oops.


Done also (but I admit I have not tested the output yet) :)


    >    - Additionally, would you prefer it in a separate block with its
    > GCTraceTime?

    Yes. Both kinds of information is interesting: while the time taken is
    typically more important, the next question would be why, and the
    number of references typically goes a long way there.

    See above though, it is probably best to wait a bit.


Agreed that I "could" wait but, if it's ok, I'll just refactor/remove this when we get closer to something final. Either, JDK-8173335
has gone in and I will notice it now or it will soon and I can change it then.


    > > - threadLocalAllocBuffer.cpp:331: one more "TODO"
    > Removed it and added it to my personal todos to look at.
    >      > >
    > > - threadLocalAllocBuffer.hpp: ThreadLocalAllocBuffer class
    > > documentation should be updated about the sampling additions. I
    > > would have no clue what the difference between "actual_end" and
    > > "end" would be from the given information.
    > If you are talking about the comments in this file, I made them more
    > clear I hope in the new webrev. If it was somewhere else, let me know
    > where to change.

    Thanks, that's much better. Maybe a note in the comment of the class
    that ThreadLocalBuffer provides some sampling facility by modifying the
    end() of the TLAB to cause "frequent" calls into the runtime call where
    actual sampling takes place.


Done, I think it's better now. Added something about the slow_path_end as well.


    > > - in heapMonitoring.hpp: there are some random comments about some
    > > code that has been grabbed from "util/math/fastmath.[h|cc]". I
    > > can't tell whether this is code that can be used but I assume that
    > > Noam Shazeer is okay with that (i.e. that's all Google code).
    > Jeremy and I double checked and we can release that as I thought. I
    > removed the comment from that piece of code entirely.

    Thanks.

    > > - heapMonitoring.hpp/cpp static constant naming does not correspond
    > > to Hotspot's. Additionally, in Hotspot static methods are cased
    > > like other methods.
    > I think I fixed the methods to be cased the same way as all other
    > methods. For static constants, I was not sure. I fixed a few other
    > variables but I could not seem to really see a consistent trend for
    > constants. I made them as variables but I'm not sure now.

    Sorry again, style is a kind of mess. The goal of my suggestions here
    is only to prevent yet another style creeping in.

    > > - in heapMonitoring.cpp there are a few cryptic comments at the top
    > > that seem to refer to internal stuff that should probably be
    > > removed.
    > Sorry about that! My personal todos not cleared out.

    I am happy about comments, but I simply did not understand any of that
    and I do not know about other readers as well.

    If you think you will remember removing/updating them until the review
    proper (I misunderstood the review situation a little it seems).

    > > I did not think through the impact of the TLAB changes on collector
    > > behavior yet (if there are). Also I did not check for problems with
    > > concurrent mark and SATB/G1 (if there are).
    > I would love to know your thoughts on this, I think this is fine. I

    I think so too now. No objects are made live out of thin air :)

    > see issues with multiple threads right now hitting the stack storage
    > instance. Previous webrevs had a mutex lock here but we took it out
    > for simplificity (and only for now).

    :) When looking at this after some thinking I now assume for this
    review that this code is not MT safe at all. There seems to be more
    synchronization missing than just the one for the StackTraceStorage. So
    no comments about this here.


I doubled checked a bit (quickly I admit) but it seems that synchronization in StackTraceStorage is really all you need (all methods lead to a StackTraceStorage one and can be multithreaded outside of that).
There is a question about the initialization where the method HeapMonitoring::initialize_profiling is not thread safe.
It would work (famous last words) and not crash if there was a race but we could add a synchronization point there as well (and therefore on the stop as well).

But anyway I will really check and do this once we add back synchronization.


    Also, this would require some kind of specification of what is allowed
    to be called when and where.


Would we specify this with the methods in the jvmti.xml file? We could start by specifying in each that they are not thread safe but I saw no mention of that for other methods.


    One potentially relevant observation about locking here: depending on
    sampling frequency, StackTraceStore::add_trace() may be rather
    frequently called. I assume that you are going to do measurements :)


Though we don't have the TLAB implementation in our code, the compiler generated sampler uses 2% of overhead with a 512k sampling rate. I can do real measurements when the code settles and we can see how costly this is as a TLAB implementation.
However, my theory is that if the rate is 512k, the memory/performance overhead should be minimal since it is what we saw with our code/workloads (though not called the same way, we call it essentially at the same rate).
If you have a benchmark you'd like me to test, let me know!

Right now, with my really small test, this does use a bit of overhead even for a 512k sample size. I don't know yet why, I'm going to see what is going on.

Finally, I think it is not reasonable to suppose the overhead to be negligible if the sampling rate used is too low. The user should know that the lower the rate, the higher the overhead (documentation TODO?).


    I am not sure what the expected usage of the API is, but
    StackTraceStore::add_trace() seems to be able to grow without bounds.
    Only a GC truncates them to the live ones. That in itself seems to be
    problematic (GCs can be *wide* apart), and of course some of the API
    methods add to that because they duplicate that unbounded array. Do you
    have any concerns/measurements about this?


So, the theory is that yes add_trace can be able to grow without bounds but it grows at a sample per 512k of allocated space. The stacks it gathers are currently maxed at 64 (I'd like to expand that to an option to the user though at some point). So I have no concerns because:

- If really this is taking a lot of space, that means the job is keeping a lot of objects in memory as well, therefore the entire heap is getting huge
- If this is the case, you will be triggering a GC at some point anyway.

(I'm putting under the rug the issue of "What if we set the rate to 1 for example" because as you lower the sampling rate, we cannot guarantee low overhead; the idea behind this feature is to have a means of having meaningful allocated samples at a low overhead)

I have no measurements really right now but since I now have some statistics I can poll, I will look a bit more at this question.

I have the same last sentence than above: the user should expect this to happen if the sampling rate is too small. That probably can be reflected in the StartHeapSampling as a note : careful this might impact your performance.


    Also, these stack traces might hold on to huge arrays. Any
    consideration of that? Particularly it might be the cause for OOMEs in
    tight memory situations.


There is a stack size maximum that is set to 64 so it should not hold huge arrays. I don't think this is an issue but I can double check with a test or two.


    - please consider adding a safepoint check in
    HeapMonitoring::weak_oops_do to prevent accidental misuse.

    - in struct StackTraceStorage, the public fields may also need
    underscores. At least some files in the runtime directory have structs
    with underscored public members (and some don't). The runtime team
    should probably comment on that.


Agreed I did not know. I looked around and a lot of structs did not have them it seemed so I left it as is. I will happily change it if someone prefers (I was not sure if you really preferred or not, your sentence seemed to be more a note of "this might need to change but I don't know if the runtime team enforces that", let me know if I read that wrongly).


    - In StackTraceStorage::weak_oops_do(), when examining the
    StackTraceData, maybe it is useful to consider having a non-NULL
    reference outside of the heap's reserved space an error. There should
    be no oop outside of the heap's reserved space ever.

    Unless you allow storing random values in StackTraceData::obj, which I
    would not encourage.


I suppose you are talking about this part:
if ((value != NULL && Universe::heap()->is_in_reserved(value)) &&
           (is_alive == NULL || is_alive->do_object_b(value))) {

What you are saying is that I could have something like:
if (value != my_non_null_reference &&
           (is_alive == NULL || is_alive->do_object_b(value))) {

Is that what you meant? Is there really a reason to do so? When I look at the code, is_in_reserved seems like a O(1) method call. I'm not even sure we can have a NULL value to be honest. I might have to study that to see if this was not a paranoid test to begin with.

The is_alive code has now morphed due to the comment below.



    - HeapMonitoring::weak_oops_do() does not seem to use the
    passed AbstractRefProcTaskExecutor.


It did use it:
  size_t HeapMonitoring::weak_oops_do(
     AbstractRefProcTaskExecutor *task_executor,
     BoolObjectClosure* is_alive,
     OopClosure *f,
     VoidClosure *complete_gc) {
   assert(SafepointSynchronize::is_at_safepoint(), "must be at safepoint");

   if (task_executor != NULL) {
     task_executor->set_single_threaded_mode();
   }
   return StackTraceStorage::storage()->weak_oops_do(is_alive, f, complete_gc);
}

But due to the comment below, I refactored this, so this is no longer here. Now I have an always true closure that is passed.


    - I do not understand allowing to call this method with a NULL
    complete_gc closure. This would mean that objects referenced from the
    object that is referenced by the StackTraceData are not pulled, meaning
    they would get stale.

    - same with is_alive parameter value of NULL


So these questions made me look a bit closer at this code. This code I think was written this way to have a very small impact on the file but you are right, there is no reason for this here. I've simplified the code by making in referenceProcessor.cpp a process_HeapSampling method that handles everything there.

The code allowed NULLs because it depended on where you were coming from and how the code was being called.

- I added a static always_true variable and pass that now to be more consistent with the rest of the code.
- I moved the complete_gc into process_phaseHeapSampling now (new method) and handle the task_executor and the complete_gc there
    - Newbie question: in our code we did a set_single_threaded_mode but I see that process_phaseJNI does it right before its call, do I need to do it for the process_phaseHeapSample?
That API is much cleaner (in my mind) and is consistent with what is done around it (again in my mind).


    - heapMonitoring.cpp:590: I do not completely understand the purpose of
    this code: in the end this results in a fixed value directly dependent
    on the Thread address anyway? In the end this results in a fixed value
    directly dependent on the Thread address anyway?
    IOW, what is special about exactly 20 rounds?


So we really want a fast random number generator that has a specific mean (512k is the default we use). The code uses the thread address as the start number of the sequence (why not, it is random enough is rationale). Then instead of just starting there, we prime the sequence and really only start at the 21st number, it is arbitrary and I have not done a study to see if we could do more or less of that.

As I have the statistics of the system up and running, I'll run some experiments to see if this is needed, is 20 good, or not.


    - also I would consider stripping a few bits of the threads' address as
    initialization value for your rng. The last three bits (and probably
    more, check whether the Thread object is allocated on special
    boundaries) are always zero for them.
    Not sure if the given "random" value is random enough before/after,
    this method, so just skip that comment if you think this is not
    required.


I don't know is the honest answer. I think what is important is that we tend towards a mean and it is random "enough" to not fall in pitfalls of only sampling a subset of objects due to their allocation order. I added that as test to do to see if it changes the mean in any way for the 512k default value and/or if the first 1000 elements look better.


    Some more random nits I did not find a place to put anywhere:

    - ThreadLocalAllocBuffer::_extra_space does not seem to be used
    anywhere?


Good catch :).


    - Maybe indent the declaration of ThreadLocalAllocBuffer::_bytes_until_sample to align below the other members of that group.


Done moved it up a bit to have non static members together and static separate.

    Thanks,
       Thomas


Thanks for your review!
Jc


Reply | Threaded
Open this post in threaded view
|

Re: Low-Overhead Heap Profiling

Robbin Ehn
In reply to this post by JC Beyler
Hi Thomas,

First I want to thank you for helping out with this!

On 07/03/2017 02:44 PM, Thomas Schatzl wrote:

> Hi,
>
> On Wed, 2017-06-28 at 09:51 +0200, Robbin Ehn wrote:
>> Hi,
>>
>> On 06/21/2017 10:45 PM, JC Beyler wrote:
>>>
>>> Hi all,
>>>
>>> First off: Thanks again to Robbin and Thomas for their reviews :)
>>>
>>> Next, I've uploaded a new webrev:
>>> http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/
>>>
>>> Here is an update:
>>>
>>> - @Robbin, I forgot to say that yes I need to look at implementing
>>> this for the other architectures and testing it before it is all
>>> ready to go. Is it common to have it  working on all possible
>>> combinations or is there a subset that I should be doing first and
>>> we can do the others later?
>>
>> I will change my mind here and proposed a different approach.
>> Right now the only compiler platform change you have is for the
>> FastTLABRefill.
>> FastTLABRefill works for all gc except G1 (CMS is deprecated) but
>> only in C1 compiled code.
>> The performance angle of this is not really relevant, since then you
>> will have C2 compiled code.
>
> Afaik C2 compiled code also uses the stubs for TLAB refill/inline
> allocation too (I may remember wrongly).

The FastTLABRefill is only checked in e.g.:
c1_Runtime1_x86.cpp

>
>> So I suggested that we should remove FastTLABRefill completely, I
>> will try to makes this happen but most likely not before my long
>> vacation.
>>
>> That means you can avoid all platform specific code (as patch is
>> now), correct?
>>
>> If so just keep the x86 support in there and we remove that if I
>> succeed, plus you can support all platforms when
>> FastTLABRefill=false.
>>
>> Sounds good?
>
> Not sure about it is easy to remove FastTLABRefill, but that may be
> something to discuss in more detail. Just noting that while the runtime
> call can be negligible on large heaps (with large TLABs) it might not
> be on small ones (or you have many Java threads which causes your TLABs
> to get smallish due to how TLAB size is calculated iirc).
>
> E.g. Serial GC may still be used a lot on embedded platforms which tend
> to use smaller heaps and benefit the most from this optimization.

This was a worry I also had, I was told not to worry (by several folks).

>
> Now one can always automatically disable FastTLABRefill if you want
> this kind of heap monitoring (and it does not work with
> FastTLABRefill).

That would be good.

When I get back I will start the CSR for removing FastTLABRefill.
The CSR can of course be 'denied', but I will give it a go.

Thanks!

/Robbin

>
> Thanks,
>    Thomas
>
Reply | Threaded
Open this post in threaded view
|

Re: Low-Overhead Heap Profiling

JC Beyler
In reply to this post by Robbin Ehn
Forgot to say that for my numbers:
 - Not in the test are the actual numbers I got for the various array sizes, I ran the program 30 times and parsed the output; here are the averages and standard deviation:
      1000:     1.28% average; 1.13% standard deviation
      10000:    1.59% average; 1.25% standard deviation
      100000:   1.26% average; 1.26% standard deviation

The 1000/10000/100000 are the sizes of the arrays being allocated. These are allocated 100k times and the sampling rate is 111 times the size of the array.

Thanks!
Jc


On Mon, Sep 25, 2017 at 3:01 PM, JC Beyler <[hidden email]> wrote:
Hi all,

After a bit of a break, I am back working on this :). As before, here are two webrevs:

    (This version is compared to version 8 I last showed but ported to the new folder hierarchy)

In this version I have:
  - Handled Thomas' comments from his email of 07/03:
       - Merged the logging to be standard
       - Fixed up the code a bit where asked
       - Added some notes about the code not being thread-safe yet
   - Removed additional dead code from the version that modifies interpreter/c1/c2 
   - Fixed compiler issues so that it compiles with --disable-precompiled-header
        - Tested with ./configure --with-boot-jdk=<jdk8> --with-debug-level=slowdebug --disable-precompiled-headers

   - This allocates a number of arrays and checks that we obtain the number of samples we want with an accepted error of 5%. I tested it 100 times and it passed everytime, I can test more if wanted
   - Not in the test are the actual numbers I got for the various array sizes, I ran the program 30 times and parsed the output; here are the averages and standard deviation:
      1000:     1.28% average; 1.13% standard deviation
      10000:    1.59% average; 1.25% standard deviation
      100000:   1.26% average; 1.26% standard deviation

What this means is that we were always at about 1~2% of the number of samples the test expected.

Let me know what you think,
Jc
   
 

On Wed, Jul 5, 2017 at 9:31 PM, JC Beyler <[hidden email]> wrote:
Hi all,

I apologize, I have not yet handled your remarks but thought this new webrev would also be useful to see and comment on perhaps.

Here is the latest webrev, it is generated slightly different than the others since now I'm using webrev.ksh without the -N option:

And the webrev.07 to webrev.08 diff is here:

(Let me know if it works well)

It's a small change between versions but it:
  - provides a fix that makes the average sample rate correct (more on that below).
  - fixes the code to actually have it play nicely with the fast tlab refill
  - cleaned up a bit the JVMTI text and now use jvmtiFrameInfo
- moved the capability to be onload solo

With this webrev, I've done a small study of the random number generator we use here for the sampling rate. I took a small program and it can be simplified to:

for (outer loop)
for (inner loop)
int[] tmp = new int[arraySize];

- I've fixed the outer and inner loops to being 800 for this experiment, meaning we allocate 640000 times an array of a given array size.

- Each program provides the average sample size used for the whole execution

- Then, I ran each variation 30 times and then calculated the average of the average sample size used for various array sizes. I selected the array size to be one of the following: 1, 10, 100, 1000.

- When compared to 512kb, the average sample size of 30 runs:
1: 4.62% of error
10: 3.09% of error
100: 0.36% of error
1000: 0.1% of error
10000: 0.03% of error 

What it shows is that, depending on the number of samples, the average does become better. This is because with an allocation of 1 element per array, it will take longer to hit one of the thresholds. This is seen by looking at the sample count statistic I put in. For the same number of iterations (800 * 800), the different array sizes provoke:
1: 62 samples
10: 125 samples
100: 788 samples
1000: 6166 samples
10000: 57721 samples

And of course, the more samples you have, the more sample rates you pick, which means that your average gets closer using that math.

Thanks,
Jc

On Thu, Jun 29, 2017 at 10:01 PM, JC Beyler <[hidden email]> wrote:
Thanks Robbin,

This seems to have worked. When I have the next webrev ready, we will find out but I'm fairly confident it will work!

Thanks agian!
Jc

On Wed, Jun 28, 2017 at 11:46 PM, Robbin Ehn <[hidden email]> wrote:
Hi JC,

On 06/29/2017 12:15 AM, JC Beyler wrote:
B) Incremental changes

I guess the most common work flow here is using mq :
hg qnew fix_v1
edit files
hg qrefresh
hg qnew fix_v2
edit files
hg qrefresh

if you do hg log you will see 2 commits

webrev.ksh -r -2 -o my_inc_v1_v2
webrev.ksh -o my_full_v2


In  your .hgrc you might need:
[extensions]
mq =

/Robbin


Again another newbiew question here...

For showing the incremental changes, is there a link that explains how to do that? I apologize for my newbie questions all the time :)

Right now, I do:

  ksh ../webrev.ksh -m -N

That generates a webrev.zip and send it to Chuck Rasbold. He then uploads it to a new webrev.

I tried commiting my change and adding a small change. Then if I just do ksh ../webrev.ksh without any options, it seems to produce a similar page but now with only the changes I had (so the 06-07 comparison you were talking about) and a changeset that has it all. I imagine that is what you meant.

Which means that my workflow would become:

1) Make changes
2) Make a webrev without any options to show just the differences with the tip
3) Amend my changes to my local commit so that I have it done with
4) Go to 1

Does that seem correct to you?

Note that when I do this, I only see the full change of a file in the full change set (Side note here: now the page says change set and not patch, which is maybe why Serguei was having issues?).

Thanks!
Jc



On Wed, Jun 28, 2017 at 1:12 AM, Robbin Ehn <[hidden email] <mailto:[hidden email]>> wrote:

    Hi,

    On 06/28/2017 12:04 AM, JC Beyler wrote:

        Dear Thomas et al,

        Here is the newest webrev:
        http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/ <http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/>



    You have some more bits to in there but generally this looks good and really nice with more tests.
    I'll do and deep dive and re-test this when I get back from my long vacation with whatever patch version you have then.

    Also I think it's time you provide incremental (v06->07 changes) as well as complete change-sets.

    Thanks, Robbin




        Thomas, I "think" I have answered all your remarks. The summary is:

        - The statistic system is up and provides insight on what the heap sampler is doing
             - I've noticed that, though the sampling rate is at the right mean, we are missing some samples, I have not yet tracked out why (details below)

        - I've run a tiny benchmark that is the worse case: it is a very tight loop and allocated a small array
             - In this case, I see no overhead when the system is off so that is a good start :)
             - I see right now a high overhead in this case when sampling is on. This is not a really too surprising but I'm going to see if this is consistent with our
        internal implementation. The benchmark is really allocation stressful so I'm not too surprised but I want to do the due diligence.

           - The statistic system up is up and I have a new test
        http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatTest.java.patch
        <http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatTest.java.patch>
              - I did a bit of a study about the random generator here, more details are below but basically it seems to work well

           - I added a capability but since this is the first time doing this, I was not sure I did it right
             - I did add a test though for it and the test seems to do what I expect (all methods are failing with the JVMTI_ERROR_MUST_POSSESS_CAPABILITY error).
                 - http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorNoCapabilityTest.java.patch
        <http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorNoCapabilityTest.java.patch>

           - I still need to figure out what to do about the multi-agent vs single-agent issue

           - As far as measurements, it seems I still need to look at:
             - Why we do the 20 random calls first, are they necessary?
             - Look at the mean of the sampling rate that the random generator does and also what is actually sampled
             - What is the overhead in terms of memory/performance when on?

        I have inlined my answers, I think I got them all in the new webrev, let me know your thoughts.

        Thanks again!
        Jc


        On Fri, Jun 23, 2017 at 3:52 AM, Thomas Schatzl <[hidden email] <mailto:[hidden email]> <mailto:[hidden email]

        <mailto:[hidden email]>>> wrote:

             Hi,

             On Wed, 2017-06-21 at 13:45 -0700, JC Beyler wrote:
             > Hi all,
             >
             > First off: Thanks again to Robbin and Thomas for their reviews :)
             >
             > Next, I've uploaded a new webrev:
             > http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/ <http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/>
        <http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/ <http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/>>

             >
             > Here is an update:
             >
             > - @Robbin, I forgot to say that yes I need to look at implementing
             > this for the other architectures and testing it before it is all
             > ready to go. Is it common to have it working on all possible
             > combinations or is there a subset that I should be doing first and we
             > can do the others later?
             > - I've tested slowdebug, built and ran the JTreg tests I wrote with
             > slowdebug and fixed a few more issues
             > - I've refactored a bit of the code following Thomas' comments
             >    - I think I've handled all the comments from Thomas (I put
             > comments inline below for the specifics)

             Thanks for handling all those.

             > - Following Thomas' comments on statistics, I want to add some
             > quality assurance tests and find that the easiest way would be to
             > have a few counters of what is happening in the sampler and expose
             > that to the user.
             >    - I'll be adding that in the next version if no one sees any
             > objections to that.
             >    - This will allow me to add a sanity test in JTreg about number of
             > samples and average of sampling rate
             >
             > @Thomas: I had a few questions that I inlined below but I will
             > summarize the "bigger ones" here:
             >    - You mentioned constants are not using the right conventions, I
             > looked around and didn't see any convention except normal naming then
             > for static constants. Is that right?

             I looked through https://wiki.openjdk.java.net/display/HotSpot/StyleGui <https://wiki.openjdk.java.net/display/HotSpot/StyleGui>
        <https://wiki.openjdk.java.net/display/HotSpot/StyleGui <https://wiki.openjdk.java.net/display/HotSpot/StyleGui>>
             de and the rule is to "follow an existing pattern and must have a
             distinct appearance from other names". Which does not help a lot I
             guess :/ The GC team started using upper camel case, e.g.
             SomeOtherConstant, but very likely this is probably not applied
             consistently throughout. So I am fine with not adding another style
             (like kMaxStackDepth with the "k" in front with some unknown meaning)
             is fine.

             (Chances are you will find that style somewhere used anyway too,
             apologies if so :/)


        Thanks for that link, now I know where to look. I used the upper camel case in my code as well then :) I should have gotten them all.


              > PS: I've also inlined my answers to Thomas below:
              >
              > On Tue, Jun 13, 2017 at 8:03 AM, Thomas Schatzl <thomas.schatzl@oracl
              > e.com <http://e.com> <http://e.com>> wrote:
              > > Hi all,
              > >
              > > On Mon, 2017-06-12 at 11:11 -0700, JC Beyler wrote:
              > > > Dear all,
              > > >
              > > > I've continued working on this and have done the following
              > > webrev:
              > > > http://cr.openjdk.java.net/~rasbold/8171119/webrev.05/ <http://cr.openjdk.java.net/~rasbold/8171119/webrev.05/>
        <http://cr.openjdk.java.net/~rasbold/8171119/webrev.05/ <http://cr.openjdk.java.net/~rasbold/8171119/webrev.05/>>

              > >
              > > [...]
              > > > Things I still need to do:
              > > >    - Have to fix that TLAB case for the FastTLABRefill
              > > >    - Have to start looking at the data to see that it is
              > > consistent and does gather the right samples, right frequency, etc.
              > > >    - Have to check the GC elements and what that produces
              > > >    - Run a slowdebug run and ensure I fixed all those issues you
              > > saw > Robbin
              > > >
              > > > Thanks for looking at the webrev and have a great week!
              > >
              > >   scratching a bit on the surface of this change, so apologies for
              > > rather shallow comments:
              > >
              > > - macroAssembler_x86.cpp:5604: while this is compiler code, and I
              > > am not sure this is final, please avoid littering the code with
              > > TODO remarks :) They tend to be candidates for later wtf moments
              > > only.
              > >
              > > Just file a CR for that.
              > >
              > Newcomer question: what is a CR and not sure I have the rights to do
              > that yet ? :)

             Apologies. CR is a change request, this suggests to file a bug in the
             bug tracker. And you are right, you can't just create a new account in
             the OpenJDK JIRA yourselves. :(


        Ok good to know, I'll continue with my own todo list but I'll work hard on not letting it slip in the webrevs anymore :)


             I was mostly referring to the "... but it is a TODO" part of that
             comment in macroassembler_x86.cpp. Comments about the why of the code
             are appreciated.

             [Note that I now understand that this is to some degree still work in
             progress. As long as the final changeset does no contain TODO's I am
             fine (and it's not a hard objection, rather their use in "final" code
             is typically limited in my experience)]

             5603   // Currently, if this happens, just set back the actual end to
             where it was.
             5604   // We miss a chance to sample here.

             Would be okay, if explaining "this" and the "why" of missing a chance
             to sample here would be best.

             Like maybe:

             // If we needed to refill TLABs, just set the actual end point to
             // the end of the TLAB again. We do not sample here although we could.

        Done with your comment, it works well in my mind.

             I am not sure whether "miss a chance to sample" meant "we could, but
             consciously don't because it's not that useful" or "it would be
             necessary but don't because it's too complicated to do.".

             Looking at the original comment once more, I am also not sure if that
             comment shouldn't referring to the "end" variable (not actual_end)
             because that's the variable that is responsible for taking the sampling
             path? (Going from the member description of ThreadLocalAllocBuffer).


        I've moved this code and it no longer shows up here but the rationale and answer was:

        So.. Yes, end is the variable provoking the sampling. Actual end is the actual end of the TLAB.

        What was happening here is that the code is resetting _end to point towards the end of the new TLAB. Because, we now have the end for sampling and _actual_end for
        the actual end, we need to update the actual_end as well.

        Normally, were we to do the real work here, we would calculate the (end - start) offset, then do:

        - Set the new end to : start + (old_end - old_start)
        - Set the actual end like we do here now where it because it is the actual end.

        Why is this not done here now anymore?
            - I was still debating which path to take:
               - Do it in the fast refill code, it has its perks:
                   - In a world where fast refills are happening all the time or a lot, we can augment there the code to do the sampling
               - Remember what we had as an end before leaving the slowpath and check on return
                   - This is what I'm doing now, it removes the need to go fix up all fast refill paths but if you remain in fast refill paths, you won't get sampling. I
        have to think of the consequences of that, maybe a future change later on?
                      - I have the statistics now so I'm going to study that
                         -> By the way, though my statistics are showing I'm missing some samples, if I turn off FastTlabRefill, it is the same loss so for now, it seems
        this does not occur in my simple test.



             But maybe I am only confused and it's best to just leave the comment
             away. :)

             Thinking about it some more, doesn't this not-sampling in this case
             mean that sampling does not work in any collector that does inline TLAB
             allocation at the moment? (Or is inline TLAB alloc automatically
             disabled with sampling somehow?)

             That would indeed be a bigger TODO then :)


        Agreed, this remark made me think that perhaps as a first step the new way of doing it is better but I did have to:
           - Remove the const of the ThreadLocalBuffer remaining and hard_end methods
           - Move hard_end out of the header file to have a bit more logic there

        Please let me know what you think of that and if you prefer it this way or changing the fast refills. (I prefer this way now because it is more incremental).


             > > - calling HeapMonitoring::do_weak_oops() (which should probably be
             > > called weak_oops_do() like other similar methods) only if string
             > > deduplication is enabled (in g1CollectedHeap.cpp:4511) seems wrong.
             >
             > The call should be at least around 6 lines up outside the if.
             >
             > Preferentially in a method like process_weak_jni_handles(), including
             > additional logging. (No new (G1) gc phase without minimal logging
             > :)).
             > Done but really not sure because:
             >
             > I put for logging:
             >   log_develop_trace(gc, freelist)("G1ConcRegionFreeing [other] : heap
             > monitoring");

             I would think that "gc, ref" would be more appropriate log tags for
             this similar to jni handles.
             (I am als not sure what weak reference handling has to do with
             G1ConcRegionFreeing, so I am a bit puzzled)


        I was not sure what to put for the tags or really as the message. I cleaned it up a bit now to:
             log_develop_trace(gc, ref)("HeapSampling [other] : heap monitoring processing");



             > Since weak_jni_handles didn't have logging for me to be inspired
             > from, I did that but unconvinced this is what should be done.

             The JNI handle processing does have logging, but only in
             ReferenceProcessor::process_discovered_references(). In
             process_weak_jni_handles() only overall time is measured (in a G1
             specific way, since only G1 supports disabling reference procesing) :/

             The code in ReferenceProcessor prints both time taken
             referenceProcessor.cpp:254, as well as the count, but strangely only in
             debug VMs.

             I have no idea why this logging is that unimportant to only print that
             in a debug VM. However there are reviews out for changing this area a
             bit, so it might be useful to wait for that (JDK-8173335).


        I cleaned it up a bit anyway and now it returns the count of objects that are in the system.


             > > - the change doubles the size of
             > > CollectedHeap::allocate_from_tlab_slow() above the "small and nice"
             > > threshold. Maybe it could be refactored a bit.
             > Done I think, it looks better to me :).

             In ThreadLocalAllocBuffer::handle_sample() I think the
             set_back_actual_end()/pick_next_sample() calls could be hoisted out of
             the "if" :)


        Done!


             > > - referenceProcessor.cpp:261: the change should add logging about
             > > the number of references encountered, maybe after the corresponding
             > > "JNI weak reference count" log message.
             > Just to double check, are you saying that you'd like to have the heap
             > sampler to keep in store how many sampled objects were encountered in
             > the HeapMonitoring::weak_oops_do?
             >    - Would a return of the method with the number of handled
             > references and logging that work?

             Yes, it's fine if HeapMonitoring::weak_oops_do() only returned the
             number of processed weak oops.


        Done also (but I admit I have not tested the output yet) :)


             >    - Additionally, would you prefer it in a separate block with its
             > GCTraceTime?

             Yes. Both kinds of information is interesting: while the time taken is
             typically more important, the next question would be why, and the
             number of references typically goes a long way there.

             See above though, it is probably best to wait a bit.


        Agreed that I "could" wait but, if it's ok, I'll just refactor/remove this when we get closer to something final. Either, JDK-8173335
        has gone in and I will notice it now or it will soon and I can change it then.


             > > - threadLocalAllocBuffer.cpp:331: one more "TODO"
             > Removed it and added it to my personal todos to look at.
             >      > >
             > > - threadLocalAllocBuffer.hpp: ThreadLocalAllocBuffer class
             > > documentation should be updated about the sampling additions. I
             > > would have no clue what the difference between "actual_end" and
             > > "end" would be from the given information.
             > If you are talking about the comments in this file, I made them more
             > clear I hope in the new webrev. If it was somewhere else, let me know
             > where to change.

             Thanks, that's much better. Maybe a note in the comment of the class
             that ThreadLocalBuffer provides some sampling facility by modifying the
             end() of the TLAB to cause "frequent" calls into the runtime call where
             actual sampling takes place.


        Done, I think it's better now. Added something about the slow_path_end as well.


             > > - in heapMonitoring.hpp: there are some random comments about some
             > > code that has been grabbed from "util/math/fastmath.[h|cc]". I
             > > can't tell whether this is code that can be used but I assume that
             > > Noam Shazeer is okay with that (i.e. that's all Google code).
             > Jeremy and I double checked and we can release that as I thought. I
             > removed the comment from that piece of code entirely.

             Thanks.

             > > - heapMonitoring.hpp/cpp static constant naming does not correspond
             > > to Hotspot's. Additionally, in Hotspot static methods are cased
             > > like other methods.
             > I think I fixed the methods to be cased the same way as all other
             > methods. For static constants, I was not sure. I fixed a few other
             > variables but I could not seem to really see a consistent trend for
             > constants. I made them as variables but I'm not sure now.

             Sorry again, style is a kind of mess. The goal of my suggestions here
             is only to prevent yet another style creeping in.

             > > - in heapMonitoring.cpp there are a few cryptic comments at the top
             > > that seem to refer to internal stuff that should probably be
             > > removed.
             > Sorry about that! My personal todos not cleared out.

             I am happy about comments, but I simply did not understand any of that
             and I do not know about other readers as well.

             If you think you will remember removing/updating them until the review
             proper (I misunderstood the review situation a little it seems).

             > > I did not think through the impact of the TLAB changes on collector
             > > behavior yet (if there are). Also I did not check for problems with
             > > concurrent mark and SATB/G1 (if there are).
             > I would love to know your thoughts on this, I think this is fine. I

             I think so too now. No objects are made live out of thin air :)

             > see issues with multiple threads right now hitting the stack storage
             > instance. Previous webrevs had a mutex lock here but we took it out
             > for simplificity (and only for now).

             :) When looking at this after some thinking I now assume for this
             review that this code is not MT safe at all. There seems to be more
             synchronization missing than just the one for the StackTraceStorage. So
             no comments about this here.


        I doubled checked a bit (quickly I admit) but it seems that synchronization in StackTraceStorage is really all you need (all methods lead to a StackTraceStorage one
        and can be multithreaded outside of that).
        There is a question about the initialization where the method HeapMonitoring::initialize_profiling is not thread safe.
        It would work (famous last words) and not crash if there was a race but we could add a synchronization point there as well (and therefore on the stop as well).

        But anyway I will really check and do this once we add back synchronization.


             Also, this would require some kind of specification of what is allowed
             to be called when and where.


        Would we specify this with the methods in the jvmti.xml file? We could start by specifying in each that they are not thread safe but I saw no mention of that for
        other methods.


             One potentially relevant observation about locking here: depending on
             sampling frequency, StackTraceStore::add_trace() may be rather
             frequently called. I assume that you are going to do measurements :)


        Though we don't have the TLAB implementation in our code, the compiler generated sampler uses 2% of overhead with a 512k sampling rate. I can do real measurements
        when the code settles and we can see how costly this is as a TLAB implementation.
        However, my theory is that if the rate is 512k, the memory/performance overhead should be minimal since it is what we saw with our code/workloads (though not called
        the same way, we call it essentially at the same rate).
        If you have a benchmark you'd like me to test, let me know!

        Right now, with my really small test, this does use a bit of overhead even for a 512k sample size. I don't know yet why, I'm going to see what is going on.

        Finally, I think it is not reasonable to suppose the overhead to be negligible if the sampling rate used is too low. The user should know that the lower the rate,
        the higher the overhead (documentation TODO?).


             I am not sure what the expected usage of the API is, but
             StackTraceStore::add_trace() seems to be able to grow without bounds.
             Only a GC truncates them to the live ones. That in itself seems to be
             problematic (GCs can be *wide* apart), and of course some of the API
             methods add to that because they duplicate that unbounded array. Do you
             have any concerns/measurements about this?


        So, the theory is that yes add_trace can be able to grow without bounds but it grows at a sample per 512k of allocated space. The stacks it gathers are currently
        maxed at 64 (I'd like to expand that to an option to the user though at some point). So I have no concerns because:

        - If really this is taking a lot of space, that means the job is keeping a lot of objects in memory as well, therefore the entire heap is getting huge
        - If this is the case, you will be triggering a GC at some point anyway.

        (I'm putting under the rug the issue of "What if we set the rate to 1 for example" because as you lower the sampling rate, we cannot guarantee low overhead; the
        idea behind this feature is to have a means of having meaningful allocated samples at a low overhead)

        I have no measurements really right now but since I now have some statistics I can poll, I will look a bit more at this question.

        I have the same last sentence than above: the user should expect this to happen if the sampling rate is too small. That probably can be reflected in the
        StartHeapSampling as a note : careful this might impact your performance.


             Also, these stack traces might hold on to huge arrays. Any
             consideration of that? Particularly it might be the cause for OOMEs in
             tight memory situations.


        There is a stack size maximum that is set to 64 so it should not hold huge arrays. I don't think this is an issue but I can double check with a test or two.


             - please consider adding a safepoint check in
             HeapMonitoring::weak_oops_do to prevent accidental misuse.

             - in struct StackTraceStorage, the public fields may also need
             underscores. At least some files in the runtime directory have structs
             with underscored public members (and some don't). The runtime team
             should probably comment on that.


        Agreed I did not know. I looked around and a lot of structs did not have them it seemed so I left it as is. I will happily change it if someone prefers (I was not
        sure if you really preferred or not, your sentence seemed to be more a note of "this might need to change but I don't know if the runtime team enforces that", let
        me know if I read that wrongly).


             - In StackTraceStorage::weak_oops_do(), when examining the
             StackTraceData, maybe it is useful to consider having a non-NULL
             reference outside of the heap's reserved space an error. There should
             be no oop outside of the heap's reserved space ever.

             Unless you allow storing random values in StackTraceData::obj, which I
             would not encourage.


        I suppose you are talking about this part:
        if ((value != NULL && Universe::heap()->is_in_reserved(value)) &&
                    (is_alive == NULL || is_alive->do_object_b(value))) {

        What you are saying is that I could have something like:
        if (value != my_non_null_reference &&
                    (is_alive == NULL || is_alive->do_object_b(value))) {

        Is that what you meant? Is there really a reason to do so? When I look at the code, is_in_reserved seems like a O(1) method call. I'm not even sure we can have a
        NULL value to be honest. I might have to study that to see if this was not a paranoid test to begin with.

        The is_alive code has now morphed due to the comment below.



             - HeapMonitoring::weak_oops_do() does not seem to use the
             passed AbstractRefProcTaskExecutor.


        It did use it:
           size_t HeapMonitoring::weak_oops_do(
              AbstractRefProcTaskExecutor *task_executor,
              BoolObjectClosure* is_alive,
              OopClosure *f,
              VoidClosure *complete_gc) {
            assert(SafepointSynchronize::is_at_safepoint(), "must be at safepoint");

            if (task_executor != NULL) {
              task_executor->set_single_threaded_mode();
            }
            return StackTraceStorage::storage()->weak_oops_do(is_alive, f, complete_gc);
        }

        But due to the comment below, I refactored this, so this is no longer here. Now I have an always true closure that is passed.


             - I do not understand allowing to call this method with a NULL
             complete_gc closure. This would mean that objects referenced from the
             object that is referenced by the StackTraceData are not pulled, meaning
             they would get stale.

             - same with is_alive parameter value of NULL


        So these questions made me look a bit closer at this code. This code I think was written this way to have a very small impact on the file but you are right, there
        is no reason for this here. I've simplified the code by making in referenceProcessor.cpp a process_HeapSampling method that handles everything there.

        The code allowed NULLs because it depended on where you were coming from and how the code was being called.

        - I added a static always_true variable and pass that now to be more consistent with the rest of the code.
        - I moved the complete_gc into process_phaseHeapSampling now (new method) and handle the task_executor and the complete_gc there
             - Newbie question: in our code we did a set_single_threaded_mode but I see that process_phaseJNI does it right before its call, do I need to do it for the
        process_phaseHeapSample?
        That API is much cleaner (in my mind) and is consistent with what is done around it (again in my mind).


             - heapMonitoring.cpp:590: I do not completely understand the purpose of
             this code: in the end this results in a fixed value directly dependent
             on the Thread address anyway? In the end this results in a fixed value
             directly dependent on the Thread address anyway?
             IOW, what is special about exactly 20 rounds?


        So we really want a fast random number generator that has a specific mean (512k is the default we use). The code uses the thread address as the start number of the
        sequence (why not, it is random enough is rationale). Then instead of just starting there, we prime the sequence and really only start at the 21st number, it is
        arbitrary and I have not done a study to see if we could do more or less of that.

        As I have the statistics of the system up and running, I'll run some experiments to see if this is needed, is 20 good, or not.


             - also I would consider stripping a few bits of the threads' address as
             initialization value for your rng. The last three bits (and probably
             more, check whether the Thread object is allocated on special
             boundaries) are always zero for them.
             Not sure if the given "random" value is random enough before/after,
             this method, so just skip that comment if you think this is not
             required.


        I don't know is the honest answer. I think what is important is that we tend towards a mean and it is random "enough" to not fall in pitfalls of only sampling a
        subset of objects due to their allocation order. I added that as test to do to see if it changes the mean in any way for the 512k default value and/or if the first
        1000 elements look better.


             Some more random nits I did not find a place to put anywhere:

             - ThreadLocalAllocBuffer::_extra_space does not seem to be used
             anywhere?


        Good catch :).


             - Maybe indent the declaration of ThreadLocalAllocBuffer::_bytes_until_sample to align below the other members of that group.


        Done moved it up a bit to have non static members together and static separate.

             Thanks,
                Thomas


        Thanks for your review!
        Jc






Reply | Threaded
Open this post in threaded view
|

Re: Low-Overhead Heap Profiling

JC Beyler
Dear all,


In order to really test this, I needed to add this so thought now was a good time. It required a few changes here for the creation to ensure correctness and safety. Now we keep the static pointer but clear the data internally so on re-initialize, it will be a bit more costly than before. I don't think this is a huge use-case so I did not think it was a problem. I used the internal MutexLocker, I think I used it well, let me know.

I also added three tests:

1) Stack depth test:

This test shows that the maximum stack depth system is working.

2) Thread safety:

The test creates 24 threads and they all allocate at the same time. The test then checks it does find samples from all the threads.

3) Thread on/off safety

The test creates 24 threads that all allocate a bunch of memory. Then another thread turns the sampling on/off.

Btw, both tests 2 & 3 failed without the locks.

As I worked on this, I saw a lot of places where the tests are doing very similar things, I'm going to clean up the code a bit and make a HeapAllocator class that all tests can call directly. This will greatly simplify the code.

Thanks for any comments/criticisms!
Jc


On Mon, Oct 2, 2017 at 8:52 PM, JC Beyler <[hidden email]> wrote:
Dear all,

Small update to the webrev:

Full webrev is here:

I updated a bit of the naming, removed a TODO comment, and I added a test for testing the sampling rate. I also updated the maximum stack depth to 1024, there is no reason to keep it so small. I did a micro benchmark that tests the overhead and it seems relatively the same.

I compared allocations from a stack depth of 10 and allocations from a stack depth of 1024 (allocations are from the same helper method in http://cr.openjdk.java.net/~rasbold/8171119/webrev.10/raw_files/new/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatRateTest.java):
          - For an array of 1 integer allocated in a loop; stack depth 1024 vs stack depth 10: 1% slower
          - For an array of 200k integers allocated in a loop; stack depth 1024 vs stack depth 10: 3% slower

So basically now moving the maximum stack depth to 1024 but we only copy over the stack depths actually used.

For the next webrev, I will be adding a stack depth test to show that it works and probably put back the mutex locking so that we can see how difficult it is to keep thread safe.

Let me know what you think!
Jc



On Mon, Sep 25, 2017 at 3:02 PM, JC Beyler <[hidden email]> wrote:
Forgot to say that for my numbers:
 - Not in the test are the actual numbers I got for the various array sizes, I ran the program 30 times and parsed the output; here are the averages and standard deviation:
      1000:     1.28% average; 1.13% standard deviation
      10000:    1.59% average; 1.25% standard deviation
      100000:   1.26% average; 1.26% standard deviation

The 1000/10000/100000 are the sizes of the arrays being allocated. These are allocated 100k times and the sampling rate is 111 times the size of the array.

Thanks!
Jc


On Mon, Sep 25, 2017 at 3:01 PM, JC Beyler <[hidden email]> wrote:
Hi all,

After a bit of a break, I am back working on this :). As before, here are two webrevs:

    (This version is compared to version 8 I last showed but ported to the new folder hierarchy)

In this version I have:
  - Handled Thomas' comments from his email of 07/03:
       - Merged the logging to be standard
       - Fixed up the code a bit where asked
       - Added some notes about the code not being thread-safe yet
   - Removed additional dead code from the version that modifies interpreter/c1/c2 
   - Fixed compiler issues so that it compiles with --disable-precompiled-header
        - Tested with ./configure --with-boot-jdk=<jdk8> --with-debug-level=slowdebug --disable-precompiled-headers

   - This allocates a number of arrays and checks that we obtain the number of samples we want with an accepted error of 5%. I tested it 100 times and it passed everytime, I can test more if wanted
   - Not in the test are the actual numbers I got for the various array sizes, I ran the program 30 times and parsed the output; here are the averages and standard deviation:
      1000:     1.28% average; 1.13% standard deviation
      10000:    1.59% average; 1.25% standard deviation
      100000:   1.26% average; 1.26% standard deviation

What this means is that we were always at about 1~2% of the number of samples the test expected.

Let me know what you think,
Jc
   
 

On Wed, Jul 5, 2017 at 9:31 PM, JC Beyler <[hidden email]> wrote:
Hi all,

I apologize, I have not yet handled your remarks but thought this new webrev would also be useful to see and comment on perhaps.

Here is the latest webrev, it is generated slightly different than the others since now I'm using webrev.ksh without the -N option:

And the webrev.07 to webrev.08 diff is here:

(Let me know if it works well)

It's a small change between versions but it:
  - provides a fix that makes the average sample rate correct (more on that below).
  - fixes the code to actually have it play nicely with the fast tlab refill
  - cleaned up a bit the JVMTI text and now use jvmtiFrameInfo
- moved the capability to be onload solo

With this webrev, I've done a small study of the random number generator we use here for the sampling rate. I took a small program and it can be simplified to:

for (outer loop)
for (inner loop)
int[] tmp = new int[arraySize];

- I've fixed the outer and inner loops to being 800 for this experiment, meaning we allocate 640000 times an array of a given array size.

- Each program provides the average sample size used for the whole execution

- Then, I ran each variation 30 times and then calculated the average of the average sample size used for various array sizes. I selected the array size to be one of the following: 1, 10, 100, 1000.

- When compared to 512kb, the average sample size of 30 runs:
1: 4.62% of error
10: 3.09% of error
100: 0.36% of error
1000: 0.1% of error
10000: 0.03% of error 

What it shows is that, depending on the number of samples, the average does become better. This is because with an allocation of 1 element per array, it will take longer to hit one of the thresholds. This is seen by looking at the sample count statistic I put in. For the same number of iterations (800 * 800), the different array sizes provoke:
1: 62 samples
10: 125 samples
100: 788 samples
1000: 6166 samples
10000: 57721 samples

And of course, the more samples you have, the more sample rates you pick, which means that your average gets closer using that math.

Thanks,
Jc

On Thu, Jun 29, 2017 at 10:01 PM, JC Beyler <[hidden email]> wrote:
Thanks Robbin,

This seems to have worked. When I have the next webrev ready, we will find out but I'm fairly confident it will work!

Thanks agian!
Jc

On Wed, Jun 28, 2017 at 11:46 PM, Robbin Ehn <[hidden email]> wrote:
Hi JC,

On 06/29/2017 12:15 AM, JC Beyler wrote:
B) Incremental changes

I guess the most common work flow here is using mq :
hg qnew fix_v1
edit files
hg qrefresh
hg qnew fix_v2
edit files
hg qrefresh

if you do hg log you will see 2 commits

webrev.ksh -r -2 -o my_inc_v1_v2
webrev.ksh -o my_full_v2


In  your .hgrc you might need:
[extensions]
mq =

/Robbin


Again another newbiew question here...

For showing the incremental changes, is there a link that explains how to do that? I apologize for my newbie questions all the time :)

Right now, I do:

  ksh ../webrev.ksh -m -N

That generates a webrev.zip and send it to Chuck Rasbold. He then uploads it to a new webrev.

I tried commiting my change and adding a small change. Then if I just do ksh ../webrev.ksh without any options, it seems to produce a similar page but now with only the changes I had (so the 06-07 comparison you were talking about) and a changeset that has it all. I imagine that is what you meant.

Which means that my workflow would become:

1) Make changes
2) Make a webrev without any options to show just the differences with the tip
3) Amend my changes to my local commit so that I have it done with
4) Go to 1

Does that seem correct to you?

Note that when I do this, I only see the full change of a file in the full change set (Side note here: now the page says change set and not patch, which is maybe why Serguei was having issues?).

Thanks!
Jc



On Wed, Jun 28, 2017 at 1:12 AM, Robbin Ehn <[hidden email] <mailto:[hidden email]>> wrote:

    Hi,

    On 06/28/2017 12:04 AM, JC Beyler wrote:

        Dear Thomas et al,

        Here is the newest webrev:
        http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/ <http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/>



    You have some more bits to in there but generally this looks good and really nice with more tests.
    I'll do and deep dive and re-test this when I get back from my long vacation with whatever patch version you have then.

    Also I think it's time you provide incremental (v06->07 changes) as well as complete change-sets.

    Thanks, Robbin




        Thomas, I "think" I have answered all your remarks. The summary is:

        - The statistic system is up and provides insight on what the heap sampler is doing
             - I've noticed that, though the sampling rate is at the right mean, we are missing some samples, I have not yet tracked out why (details below)

        - I've run a tiny benchmark that is the worse case: it is a very tight loop and allocated a small array
             - In this case, I see no overhead when the system is off so that is a good start :)
             - I see right now a high overhead in this case when sampling is on. This is not a really too surprising but I'm going to see if this is consistent with our
        internal implementation. The benchmark is really allocation stressful so I'm not too surprised but I want to do the due diligence.

           - The statistic system up is up and I have a new test
        http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatTest.java.patch
        <http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatTest.java.patch>
              - I did a bit of a study about the random generator here, more details are below but basically it seems to work well

           - I added a capability but since this is the first time doing this, I was not sure I did it right
             - I did add a test though for it and the test seems to do what I expect (all methods are failing with the JVMTI_ERROR_MUST_POSSESS_CAPABILITY error).
                 - http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorNoCapabilityTest.java.patch
        <http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorNoCapabilityTest.java.patch>

           - I still need to figure out what to do about the multi-agent vs single-agent issue

           - As far as measurements, it seems I still need to look at:
             - Why we do the 20 random calls first, are they necessary?
             - Look at the mean of the sampling rate that the random generator does and also what is actually sampled
             - What is the overhead in terms of memory/performance when on?

        I have inlined my answers, I think I got them all in the new webrev, let me know your thoughts.

        Thanks again!
        Jc


        On Fri, Jun 23, 2017 at 3:52 AM, Thomas Schatzl <[hidden email] <mailto:[hidden email]> <mailto:[hidden email]

        <mailto:[hidden email]>>> wrote:

             Hi,

             On Wed, 2017-06-21 at 13:45 -0700, JC Beyler wrote:
             > Hi all,
             >
             > First off: Thanks again to Robbin and Thomas for their reviews :)
             >
             > Next, I've uploaded a new webrev:
             > http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/ <http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/>
        <http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/ <http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/>>

             >
             > Here is an update:
             >
             > - @Robbin, I forgot to say that yes I need to look at implementing
             > this for the other architectures and testing it before it is all
             > ready to go. Is it common to have it working on all possible
             > combinations or is there a subset that I should be doing first and we
             > can do the others later?
             > - I've tested slowdebug, built and ran the JTreg tests I wrote with
             > slowdebug and fixed a few more issues
             > - I've refactored a bit of the code following Thomas' comments
             >    - I think I've handled all the comments from Thomas (I put
             > comments inline below for the specifics)

             Thanks for handling all those.

             > - Following Thomas' comments on statistics, I want to add some
             > quality assurance tests and find that the easiest way would be to
             > have a few counters of what is happening in the sampler and expose
             > that to the user.
             >    - I'll be adding that in the next version if no one sees any
             > objections to that.
             >    - This will allow me to add a sanity test in JTreg about number of
             > samples and average of sampling rate
             >
             > @Thomas: I had a few questions that I inlined below but I will
             > summarize the "bigger ones" here:
             >    - You mentioned constants are not using the right conventions, I
             > looked around and didn't see any convention except normal naming then
             > for static constants. Is that right?

             I looked through https://wiki.openjdk.java.net/display/HotSpot/StyleGui <https://wiki.openjdk.java.net/display/HotSpot/StyleGui>
        <https://wiki.openjdk.java.net/display/HotSpot/StyleGui <https://wiki.openjdk.java.net/display/HotSpot/StyleGui>>
             de and the rule is to "follow an existing pattern and must have a
             distinct appearance from other names". Which does not help a lot I
             guess :/ The GC team started using upper camel case, e.g.
             SomeOtherConstant, but very likely this is probably not applied
             consistently throughout. So I am fine with not adding another style
             (like kMaxStackDepth with the "k" in front with some unknown meaning)
             is fine.

             (Chances are you will find that style somewhere used anyway too,
             apologies if so :/)


        Thanks for that link, now I know where to look. I used the upper camel case in my code as well then :) I should have gotten them all.


              > PS: I've also inlined my answers to Thomas below:
              >
              > On Tue, Jun 13, 2017 at 8:03 AM, Thomas Schatzl <thomas.schatzl@oracl
              > e.com <http://e.com> <http://e.com>> wrote:
              > > Hi all,
              > >
              > > On Mon, 2017-06-12 at 11:11 -0700, JC Beyler wrote:
              > > > Dear all,
              > > >
              > > > I've continued working on this and have done the following
              > > webrev:
              > > > http://cr.openjdk.java.net/~rasbold/8171119/webrev.05/ <http://cr.openjdk.java.net/~rasbold/8171119/webrev.05/>
        <http://cr.openjdk.java.net/~rasbold/8171119/webrev.05/ <http://cr.openjdk.java.net/~rasbold/8171119/webrev.05/>>

              > >
              > > [...]
              > > > Things I still need to do:
              > > >    - Have to fix that TLAB case for the FastTLABRefill
              > > >    - Have to start looking at the data to see that it is
              > > consistent and does gather the right samples, right frequency, etc.
              > > >    - Have to check the GC elements and what that produces
              > > >    - Run a slowdebug run and ensure I fixed all those issues you
              > > saw > Robbin
              > > >
              > > > Thanks for looking at the webrev and have a great week!
              > >
              > >   scratching a bit on the surface of this change, so apologies for
              > > rather shallow comments:
              > >
              > > - macroAssembler_x86.cpp:5604: while this is compiler code, and I
              > > am not sure this is final, please avoid littering the code with
              > > TODO remarks :) They tend to be candidates for later wtf moments
              > > only.
              > >
              > > Just file a CR for that.
              > >
              > Newcomer question: what is a CR and not sure I have the rights to do
              > that yet ? :)

             Apologies. CR is a change request, this suggests to file a bug in the
             bug tracker. And you are right, you can't just create a new account in
             the OpenJDK JIRA yourselves. :(


        Ok good to know, I'll continue with my own todo list but I'll work hard on not letting it slip in the webrevs anymore :)


             I was mostly referring to the "... but it is a TODO" part of that
             comment in macroassembler_x86.cpp. Comments about the why of the code
             are appreciated.

             [Note that I now understand that this is to some degree still work in
             progress. As long as the final changeset does no contain TODO's I am
             fine (and it's not a hard objection, rather their use in "final" code
             is typically limited in my experience)]

             5603   // Currently, if this happens, just set back the actual end to
             where it was.
             5604   // We miss a chance to sample here.

             Would be okay, if explaining "this" and the "why" of missing a chance
             to sample here would be best.

             Like maybe:

             // If we needed to refill TLABs, just set the actual end point to
             // the end of the TLAB again. We do not sample here although we could.

        Done with your comment, it works well in my mind.

             I am not sure whether "miss a chance to sample" meant "we could, but
             consciously don't because it's not that useful" or "it would be
             necessary but don't because it's too complicated to do.".

             Looking at the original comment once more, I am also not sure if that
             comment shouldn't referring to the "end" variable (not actual_end)
             because that's the variable that is responsible for taking the sampling
             path? (Going from the member description of ThreadLocalAllocBuffer).


        I've moved this code and it no longer shows up here but the rationale and answer was:

        So.. Yes, end is the variable provoking the sampling. Actual end is the actual end of the TLAB.

        What was happening here is that the code is resetting _end to point towards the end of the new TLAB. Because, we now have the end for sampling and _actual_end for
        the actual end, we need to update the actual_end as well.

        Normally, were we to do the real work here, we would calculate the (end - start) offset, then do:

        - Set the new end to : start + (old_end - old_start)
        - Set the actual end like we do here now where it because it is the actual end.

        Why is this not done here now anymore?
            - I was still debating which path to take:
               - Do it in the fast refill code, it has its perks:
                   - In a world where fast refills are happening all the time or a lot, we can augment there the code to do the sampling
               - Remember what we had as an end before leaving the slowpath and check on return
                   - This is what I'm doing now, it removes the need to go fix up all fast refill paths but if you remain in fast refill paths, you won't get sampling. I
        have to think of the consequences of that, maybe a future change later on?
                      - I have the statistics now so I'm going to study that
                         -> By the way, though my statistics are showing I'm missing some samples, if I turn off FastTlabRefill, it is the same loss so for now, it seems
        this does not occur in my simple test.



             But maybe I am only confused and it's best to just leave the comment
             away. :)

             Thinking about it some more, doesn't this not-sampling in this case
             mean that sampling does not work in any collector that does inline TLAB
             allocation at the moment? (Or is inline TLAB alloc automatically
             disabled with sampling somehow?)

             That would indeed be a bigger TODO then :)


        Agreed, this remark made me think that perhaps as a first step the new way of doing it is better but I did have to:
           - Remove the const of the ThreadLocalBuffer remaining and hard_end methods
           - Move hard_end out of the header file to have a bit more logic there

        Please let me know what you think of that and if you prefer it this way or changing the fast refills. (I prefer this way now because it is more incremental).


             > > - calling HeapMonitoring::do_weak_oops() (which should probably be
             > > called weak_oops_do() like other similar methods) only if string
             > > deduplication is enabled (in g1CollectedHeap.cpp:4511) seems wrong.
             >
             > The call should be at least around 6 lines up outside the if.
             >
             > Preferentially in a method like process_weak_jni_handles(), including
             > additional logging. (No new (G1) gc phase without minimal logging
             > :)).
             > Done but really not sure because:
             >
             > I put for logging:
             >   log_develop_trace(gc, freelist)("G1ConcRegionFreeing [other] : heap
             > monitoring");

             I would think that "gc, ref" would be more appropriate log tags for
             this similar to jni handles.
             (I am als not sure what weak reference handling has to do with
             G1ConcRegionFreeing, so I am a bit puzzled)


        I was not sure what to put for the tags or really as the message. I cleaned it up a bit now to:
             log_develop_trace(gc, ref)("HeapSampling [other] : heap monitoring processing");



             > Since weak_jni_handles didn't have logging for me to be inspired
             > from, I did that but unconvinced this is what should be done.

             The JNI handle processing does have logging, but only in
             ReferenceProcessor::process_discovered_references(). In
             process_weak_jni_handles() only overall time is measured (in a G1
             specific way, since only G1 supports disabling reference procesing) :/

             The code in ReferenceProcessor prints both time taken
             referenceProcessor.cpp:254, as well as the count, but strangely only in
             debug VMs.

             I have no idea why this logging is that unimportant to only print that
             in a debug VM. However there are reviews out for changing this area a
             bit, so it might be useful to wait for that (JDK-8173335).


        I cleaned it up a bit anyway and now it returns the count of objects that are in the system.


             > > - the change doubles the size of
             > > CollectedHeap::allocate_from_tlab_slow() above the "small and nice"
             > > threshold. Maybe it could be refactored a bit.
             > Done I think, it looks better to me :).

             In ThreadLocalAllocBuffer::handle_sample() I think the
             set_back_actual_end()/pick_next_sample() calls could be hoisted out of
             the "if" :)


        Done!


             > > - referenceProcessor.cpp:261: the change should add logging about
             > > the number of references encountered, maybe after the corresponding
             > > "JNI weak reference count" log message.
             > Just to double check, are you saying that you'd like to have the heap
             > sampler to keep in store how many sampled objects were encountered in
             > the HeapMonitoring::weak_oops_do?
             >    - Would a return of the method with the number of handled
             > references and logging that work?

             Yes, it's fine if HeapMonitoring::weak_oops_do() only returned the
             number of processed weak oops.


        Done also (but I admit I have not tested the output yet) :)


             >    - Additionally, would you prefer it in a separate block with its
             > GCTraceTime?

             Yes. Both kinds of information is interesting: while the time taken is
             typically more important, the next question would be why, and the
             number of references typically goes a long way there.

             See above though, it is probably best to wait a bit.


        Agreed that I "could" wait but, if it's ok, I'll just refactor/remove this when we get closer to something final. Either, JDK-8173335
        has gone in and I will notice it now or it will soon and I can change it then.


             > > - threadLocalAllocBuffer.cpp:331: one more "TODO"
             > Removed it and added it to my personal todos to look at.
             >      > >
             > > - threadLocalAllocBuffer.hpp: ThreadLocalAllocBuffer class
             > > documentation should be updated about the sampling additions. I
             > > would have no clue what the difference between "actual_end" and
             > > "end" would be from the given information.
             > If you are talking about the comments in this file, I made them more
             > clear I hope in the new webrev. If it was somewhere else, let me know
             > where to change.

             Thanks, that's much better. Maybe a note in the comment of the class
             that ThreadLocalBuffer provides some sampling facility by modifying the
             end() of the TLAB to cause "frequent" calls into the runtime call where
             actual sampling takes place.


        Done, I think it's better now. Added something about the slow_path_end as well.


             > > - in heapMonitoring.hpp: there are some random comments about some
             > > code that has been grabbed from "util/math/fastmath.[h|cc]". I
             > > can't tell whether this is code that can be used but I assume that
             > > Noam Shazeer is okay with that (i.e. that's all Google code).
             > Jeremy and I double checked and we can release that as I thought. I
             > removed the comment from that piece of code entirely.

             Thanks.

             > > - heapMonitoring.hpp/cpp static constant naming does not correspond
             > > to Hotspot's. Additionally, in Hotspot static methods are cased
             > > like other methods.
             > I think I fixed the methods to be cased the same way as all other
             > methods. For static constants, I was not sure. I fixed a few other
             > variables but I could not seem to really see a consistent trend for
             > constants. I made them as variables but I'm not sure now.

             Sorry again, style is a kind of mess. The goal of my suggestions here
             is only to prevent yet another style creeping in.

             > > - in heapMonitoring.cpp there are a few cryptic comments at the top
             > > that seem to refer to internal stuff that should probably be
             > > removed.
             > Sorry about that! My personal todos not cleared out.

             I am happy about comments, but I simply did not understand any of that
             and I do not know about other readers as well.

             If you think you will remember removing/updating them until the review
             proper (I misunderstood the review situation a little it seems).

             > > I did not think through the impact of the TLAB changes on collector
             > > behavior yet (if there are). Also I did not check for problems with
             > > concurrent mark and SATB/G1 (if there are).
             > I would love to know your thoughts on this, I think this is fine. I

             I think so too now. No objects are made live out of thin air :)

             > see issues with multiple threads right now hitting the stack storage
             > instance. Previous webrevs had a mutex lock here but we took it out
             > for simplificity (and only for now).

             :) When looking at this after some thinking I now assume for this
             review that this code is not MT safe at all. There seems to be more
             synchronization missing than just the one for the StackTraceStorage. So
             no comments about this here.


        I doubled checked a bit (quickly I admit) but it seems that synchronization in StackTraceStorage is really all you need (all methods lead to a StackTraceStorage one
        and can be multithreaded outside of that).
        There is a question about the initialization where the method HeapMonitoring::initialize_profiling is not thread safe.
        It would work (famous last words) and not crash if there was a race but we could add a synchronization point there as well (and therefore on the stop as well).

        But anyway I will really check and do this once we add back synchronization.


             Also, this would require some kind of specification of what is allowed
             to be called when and where.


        Would we specify this with the methods in the jvmti.xml file? We could start by specifying in each that they are not thread safe but I saw no mention of that for
        other methods.


             One potentially relevant observation about locking here: depending on
             sampling frequency, StackTraceStore::add_trace() may be rather
             frequently called. I assume that you are going to do measurements :)


        Though we don't have the TLAB implementation in our code, the compiler generated sampler uses 2% of overhead with a 512k sampling rate. I can do real measurements
        when the code settles and we can see how costly this is as a TLAB implementation.
        However, my theory is that if the rate is 512k, the memory/performance overhead should be minimal since it is what we saw with our code/workloads (though not called
        the same way, we call it essentially at the same rate).
        If you have a benchmark you'd like me to test, let me know!

        Right now, with my really small test, this does use a bit of overhead even for a 512k sample size. I don't know yet why, I'm going to see what is going on.

        Finally, I think it is not reasonable to suppose the overhead to be negligible if the sampling rate used is too low. The user should know that the lower the rate,
        the higher the overhead (documentation TODO?).


             I am not sure what the expected usage of the API is, but
             StackTraceStore::add_trace() seems to be able to grow without bounds.
             Only a GC truncates them to the live ones. That in itself seems to be
             problematic (GCs can be *wide* apart), and of course some of the API
             methods add to that because they duplicate that unbounded array. Do you
             have any concerns/measurements about this?


        So, the theory is that yes add_trace can be able to grow without bounds but it grows at a sample per 512k of allocated space. The stacks it gathers are currently
        maxed at 64 (I'd like to expand that to an option to the user though at some point). So I have no concerns because:

        - If really this is taking a lot of space, that means the job is keeping a lot of objects in memory as well, therefore the entire heap is getting huge
        - If this is the case, you will be triggering a GC at some point anyway.

        (I'm putting under the rug the issue of "What if we set the rate to 1 for example" because as you lower the sampling rate, we cannot guarantee low overhead; the
        idea behind this feature is to have a means of having meaningful allocated samples at a low overhead)

        I have no measurements really right now but since I now have some statistics I can poll, I will look a bit more at this question.

        I have the same last sentence than above: the user should expect this to happen if the sampling rate is too small. That probably can be reflected in the
        StartHeapSampling as a note : careful this might impact your performance.


             Also, these stack traces might hold on to huge arrays. Any
             consideration of that? Particularly it might be the cause for OOMEs in
             tight memory situations.


        There is a stack size maximum that is set to 64 so it should not hold huge arrays. I don't think this is an issue but I can double check with a test or two.


             - please consider adding a safepoint check in
             HeapMonitoring::weak_oops_do to prevent accidental misuse.

             - in struct StackTraceStorage, the public fields may also need
             underscores. At least some files in the runtime directory have structs
             with underscored public members (and some don't). The runtime team
             should probably comment on that.


        Agreed I did not know. I looked around and a lot of structs did not have them it seemed so I left it as is. I will happily change it if someone prefers (I was not
        sure if you really preferred or not, your sentence seemed to be more a note of "this might need to change but I don't know if the runtime team enforces that", let
        me know if I read that wrongly).


             - In StackTraceStorage::weak_oops_do(), when examining the
             StackTraceData, maybe it is useful to consider having a non-NULL
             reference outside of the heap's reserved space an error. There should
             be no oop outside of the heap's reserved space ever.

             Unless you allow storing random values in StackTraceData::obj, which I
             would not encourage.


        I suppose you are talking about this part:
        if ((value != NULL && Universe::heap()->is_in_reserved(value)) &&
                    (is_alive == NULL || is_alive->do_object_b(value))) {

        What you are saying is that I could have something like:
        if (value != my_non_null_reference &&
                    (is_alive == NULL || is_alive->do_object_b(value))) {

        Is that what you meant? Is there really a reason to do so? When I look at the code, is_in_reserved seems like a O(1) method call. I'm not even sure we can have a
        NULL value to be honest. I might have to study that to see if this was not a paranoid test to begin with.

        The is_alive code has now morphed due to the comment below.



             - HeapMonitoring::weak_oops_do() does not seem to use the
             passed AbstractRefProcTaskExecutor.


        It did use it:
           size_t HeapMonitoring::weak_oops_do(
              AbstractRefProcTaskExecutor *task_executor,
              BoolObjectClosure* is_alive,
              OopClosure *f,
              VoidClosure *complete_gc) {
            assert(SafepointSynchronize::is_at_safepoint(), "must be at safepoint");

            if (task_executor != NULL) {
              task_executor->set_single_threaded_mode();
            }
            return StackTraceStorage::storage()->weak_oops_do(is_alive, f, complete_gc);
        }

        But due to the comment below, I refactored this, so this is no longer here. Now I have an always true closure that is passed.


             - I do not understand allowing to call this method with a NULL
             complete_gc closure. This would mean that objects referenced from the
             object that is referenced by the StackTraceData are not pulled, meaning
             they would get stale.

             - same with is_alive parameter value of NULL


        So these questions made me look a bit closer at this code. This code I think was written this way to have a very small impact on the file but you are right, there
        is no reason for this here. I've simplified the code by making in referenceProcessor.cpp a process_HeapSampling method that handles everything there.

        The code allowed NULLs because it depended on where you were coming from and how the code was being called.

        - I added a static always_true variable and pass that now to be more consistent with the rest of the code.
        - I moved the complete_gc into process_phaseHeapSampling now (new method) and handle the task_executor and the complete_gc there
             - Newbie question: in our code we did a set_single_threaded_mode but I see that process_phaseJNI does it right before its call, do I need to do it for the
        process_phaseHeapSample?
        That API is much cleaner (in my mind) and is consistent with what is done around it (again in my mind).


             - heapMonitoring.cpp:590: I do not completely understand the purpose of
             this code: in the end this results in a fixed value directly dependent
             on the Thread address anyway? In the end this results in a fixed value
             directly dependent on the Thread address anyway?
             IOW, what is special about exactly 20 rounds?


        So we really want a fast random number generator that has a specific mean (512k is the default we use). The code uses the thread address as the start number of the
        sequence (why not, it is random enough is rationale). Then instead of just starting there, we prime the sequence and really only start at the 21st number, it is
        arbitrary and I have not done a study to see if we could do more or less of that.

        As I have the statistics of the system up and running, I'll run some experiments to see if this is needed, is 20 good, or not.


             - also I would consider stripping a few bits of the threads' address as
             initialization value for your rng. The last three bits (and probably
             more, check whether the Thread object is allocated on special
             boundaries) are always zero for them.
             Not sure if the given "random" value is random enough before/after,
             this method, so just skip that comment if you think this is not
             required.


        I don't know is the honest answer. I think what is important is that we tend towards a mean and it is random "enough" to not fall in pitfalls of only sampling a
        subset of objects due to their allocation order. I added that as test to do to see if it changes the mean in any way for the 512k default value and/or if the first
        1000 elements look better.


             Some more random nits I did not find a place to put anywhere:

             - ThreadLocalAllocBuffer::_extra_space does not seem to be used
             anywhere?


        Good catch :).


             - Maybe indent the declaration of ThreadLocalAllocBuffer::_bytes_until_sample to align below the other members of that group.


        Done moved it up a bit to have non static members together and static separate.

             Thanks,
                Thomas


        Thanks for your review!
        Jc








Reply | Threaded
Open this post in threaded view
|

Re: Low-Overhead Heap Profiling

Robbin Ehn
Hi JC,

I saw a webrev.12 in the directory, with only test changes(11->12), so I took that version.
I had a look and tested the tests, worked fine!

First glance at the code (looking at full v12) some minor things below, mostly unused stuff.

Thanks, Robbin

diff -r 9047e0d726d6 src/hotspot/share/runtime/heapMonitoring.cpp
--- a/src/hotspot/share/runtime/heapMonitoring.cpp Mon Oct 16 16:54:06 2017 +0200
+++ b/src/hotspot/share/runtime/heapMonitoring.cpp Mon Oct 16 17:42:42 2017 +0200
@@ -211,2 +211,3 @@
    void initialize(int max_storage) {
+    // validate max_storage to sane value ? What would 0 mean ?
      MutexLocker mu(HeapMonitor_lock);
@@ -227,8 +228,4 @@
    bool initialized() { return _initialized; }
-  volatile bool *initialized_address() { return &_initialized; }

   private:
-  // Protects the traces currently sampled (below).
-  volatile intptr_t _stack_storage_lock[1];
-
    // The traces currently sampled.
@@ -313,3 +310,2 @@
    _initialized(false) {
-    _stack_storage_lock[0] = 0;
  }
@@ -532,13 +528,2 @@

-// Delegate the initialization question to the underlying storage system.
-bool HeapMonitoring::initialized() {
-  return StackTraceStorage::storage()->initialized();
-}
-
-// Delegate the initialization question to the underlying storage system.
-bool *HeapMonitoring::initialized_address() {
-  return
-      const_cast<bool*>(StackTraceStorage::storage()->initialized_address());
-}
-
  void HeapMonitoring::get_live_traces(jvmtiStackTraces *traces) {
diff -r 9047e0d726d6 src/hotspot/share/runtime/heapMonitoring.hpp
--- a/src/hotspot/share/runtime/heapMonitoring.hpp Mon Oct 16 16:54:06 2017 +0200
+++ b/src/hotspot/share/runtime/heapMonitoring.hpp Mon Oct 16 17:42:42 2017 +0200
@@ -35,3 +35,2 @@
    static uint64_t _rnd;
-  static bool _initialized;
    static jint _monitoring_rate;
@@ -92,7 +91,2 @@

-  // Is the profiler initialized and where is the address to the initialized
-  // boolean.
-  static bool initialized();
-  static bool *initialized_address();
-
    // Called when o is to be sampled from a given thread and a given size.



On 10/10/2017 12:57 AM, JC Beyler wrote:

> Dear all,
>
> Thread-safety is back!! Here is the update webrev:
> http://cr.openjdk.java.net/~rasbold/8171119/webrev.10_11/
>
> Full webrev is here:
> http://cr.openjdk.java.net/~rasbold/8171119/webrev.11/
>
> In order to really test this, I needed to add this so thought now was a good time. It required a few changes here for the creation to ensure correctness and safety. Now we
> keep the static pointer but clear the data internally so on re-initialize, it will be a bit more costly than before. I don't think this is a huge use-case so I did not
> think it was a problem. I used the internal MutexLocker, I think I used it well, let me know.
>
> I also added three tests:
>
> 1) Stack depth test:
> http://cr.openjdk.java.net/~rasbold/8171119/webrev.10_11/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStackDepthTest.java.patch
>
> This test shows that the maximum stack depth system is working.
>
> 2) Thread safety:
> http://cr.openjdk.java.net/~rasbold/8171119/webrev.10_11/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorThreadTest.java.patch
>
> The test creates 24 threads and they all allocate at the same time. The test then checks it does find samples from all the threads.
>
> 3) Thread on/off safety
> http://cr.openjdk.java.net/~rasbold/8171119/webrev.10_11/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorThreadOnOffTest.java.patch
>
> The test creates 24 threads that all allocate a bunch of memory. Then another thread turns the sampling on/off.
>
> Btw, both tests 2 & 3 failed without the locks.
>
> As I worked on this, I saw a lot of places where the tests are doing very similar things, I'm going to clean up the code a bit and make a HeapAllocator class that all tests
> can call directly. This will greatly simplify the code.
>
> Thanks for any comments/criticisms!
> Jc
>
>
> On Mon, Oct 2, 2017 at 8:52 PM, JC Beyler <[hidden email] <mailto:[hidden email]>> wrote:
>
>     Dear all,
>
>     Small update to the webrev:
>     http://cr.openjdk.java.net/~rasbold/8171119/webrev.09_10/ <http://cr.openjdk.java.net/~rasbold/8171119/webrev.09_10/>
>
>     Full webrev is here:
>     http://cr.openjdk.java.net/~rasbold/8171119/webrev.10/ <http://cr.openjdk.java.net/~rasbold/8171119/webrev.10/>
>
>     I updated a bit of the naming, removed a TODO comment, and I added a test for testing the sampling rate. I also updated the maximum stack depth to 1024, there is no
>     reason to keep it so small. I did a micro benchmark that tests the overhead and it seems relatively the same.
>
>     I compared allocations from a stack depth of 10 and allocations from a stack depth of 1024 (allocations are from the same helper method in
>     http://cr.openjdk.java.net/~rasbold/8171119/webrev.10/raw_files/new/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatRateTest.java
>     <http://cr.openjdk.java.net/~rasbold/8171119/webrev.10/raw_files/new/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatRateTest.java>):
>                - For an array of 1 integer allocated in a loop; stack depth 1024 vs stack depth 10: 1% slower
>                - For an array of 200k integers allocated in a loop; stack depth 1024 vs stack depth 10: 3% slower
>
>     So basically now moving the maximum stack depth to 1024 but we only copy over the stack depths actually used.
>
>     For the next webrev, I will be adding a stack depth test to show that it works and probably put back the mutex locking so that we can see how difficult it is to keep
>     thread safe.
>
>     Let me know what you think!
>     Jc
>
>
>
>     On Mon, Sep 25, 2017 at 3:02 PM, JC Beyler <[hidden email] <mailto:[hidden email]>> wrote:
>
>         Forgot to say that for my numbers:
>           - Not in the test are the actual numbers I got for the various array sizes, I ran the program 30 times and parsed the output; here are the averages and standard
>         deviation:
>                1000:     1.28% average; 1.13% standard deviation
>                10000:    1.59% average; 1.25% standard deviation
>                100000:   1.26% average; 1.26% standard deviation
>
>         The 1000/10000/100000 are the sizes of the arrays being allocated. These are allocated 100k times and the sampling rate is 111 times the size of the array.
>
>         Thanks!
>         Jc
>
>
>         On Mon, Sep 25, 2017 at 3:01 PM, JC Beyler <[hidden email] <mailto:[hidden email]>> wrote:
>
>             Hi all,
>
>             After a bit of a break, I am back working on this :). As before, here are two webrevs:
>
>             - Full change set: http://cr.openjdk.java.net/~rasbold/8171119/webrev.09/ <http://cr.openjdk.java.net/~rasbold/8171119/webrev.09/>
>             - Compared to version 8: http://cr.openjdk.java.net/~rasbold/8171119/webrev.08_09/ <http://cr.openjdk.java.net/~rasbold/8171119/webrev.08_09/>
>                  (This version is compared to version 8 I last showed but ported to the new folder hierarchy)
>
>             In this version I have:
>                - Handled Thomas' comments from his email of 07/03:
>                     - Merged the logging to be standard
>                     - Fixed up the code a bit where asked
>                     - Added some notes about the code not being thread-safe yet
>                 - Removed additional dead code from the version that modifies interpreter/c1/c2
>                 - Fixed compiler issues so that it compiles with --disable-precompiled-header
>                      - Tested with ./configure --with-boot-jdk=<jdk8> --with-debug-level=slowdebug --disable-precompiled-headers
>
>             Additionally, I added a test to check the sanity of the sampler: HeapMonitorStatCorrectnessTest
>             (http://cr.openjdk.java.net/~rasbold/8171119/webrev.08_09/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatCorrectnessTest.java.patch <http://cr.openjdk.java.net/~rasbold/8171119/webrev.08_09/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatCorrectnessTest.java.patch>)
>                 - This allocates a number of arrays and checks that we obtain the number of samples we want with an accepted error of 5%. I tested it 100 times and it
>             passed everytime, I can test more if wanted
>                 - Not in the test are the actual numbers I got for the various array sizes, I ran the program 30 times and parsed the output; here are the averages and
>             standard deviation:
>                    1000:     1.28% average; 1.13% standard deviation
>                    10000:    1.59% average; 1.25% standard deviation
>                    100000:   1.26% average; 1.26% standard deviation
>
>             What this means is that we were always at about 1~2% of the number of samples the test expected.
>
>             Let me know what you think,
>             Jc
>
>             On Wed, Jul 5, 2017 at 9:31 PM, JC Beyler <[hidden email] <mailto:[hidden email]>> wrote:
>
>                 Hi all,
>
>                 I apologize, I have not yet handled your remarks but thought this new webrev would also be useful to see and comment on perhaps.
>
>                 Here is the latest webrev, it is generated slightly different than the others since now I'm using webrev.ksh without the -N option:
>                 http://cr.openjdk.java.net/~rasbold/8171119/webrev.08/ <http://cr.openjdk.java.net/~rasbold/8171119/webrev.08/>
>
>                 And the webrev.07 to webrev.08 diff is here:
>                 http://cr.openjdk.java.net/~rasbold/8171119/webrev.07_08/ <http://cr.openjdk.java.net/~rasbold/8171119/webrev.07_08/>
>
>                 (Let me know if it works well)
>
>                 It's a small change between versions but it:
>                    - provides a fix that makes the average sample rate correct (more on that below).
>                    - fixes the code to actually have it play nicely with the fast tlab refill
>                    - cleaned up a bit the JVMTI text and now use jvmtiFrameInfo
>                 - moved the capability to be onload solo
>
>                 With this webrev, I've done a small study of the random number generator we use here for the sampling rate. I took a small program and it can be simplified to:
>
>                 for (outer loop)
>                 for (inner loop)
>                 int[] tmp = new int[arraySize];
>
>                 - I've fixed the outer and inner loops to being 800 for this experiment, meaning we allocate 640000 times an array of a given array size.
>
>                 - Each program provides the average sample size used for the whole execution
>
>                 - Then, I ran each variation 30 times and then calculated the average of the average sample size used for various array sizes. I selected the array size to
>                 be one of the following: 1, 10, 100, 1000.
>
>                 - When compared to 512kb, the average sample size of 30 runs:
>                 1: 4.62% of error
>                 10: 3.09% of error
>                 100: 0.36% of error
>                 1000: 0.1% of error
>                 10000: 0.03% of error
>
>                 What it shows is that, depending on the number of samples, the average does become better. This is because with an allocation of 1 element per array, it
>                 will take longer to hit one of the thresholds. This is seen by looking at the sample count statistic I put in. For the same number of iterations (800 *
>                 800), the different array sizes provoke:
>                 1: 62 samples
>                 10: 125 samples
>                 100: 788 samples
>                 1000: 6166 samples
>                 10000: 57721 samples
>
>                 And of course, the more samples you have, the more sample rates you pick, which means that your average gets closer using that math.
>
>                 Thanks,
>                 Jc
>
>                 On Thu, Jun 29, 2017 at 10:01 PM, JC Beyler <[hidden email] <mailto:[hidden email]>> wrote:
>
>                     Thanks Robbin,
>
>                     This seems to have worked. When I have the next webrev ready, we will find out but I'm fairly confident it will work!
>
>                     Thanks agian!
>                     Jc
>
>                     On Wed, Jun 28, 2017 at 11:46 PM, Robbin Ehn <[hidden email] <mailto:[hidden email]>> wrote:
>
>                         Hi JC,
>
>                         On 06/29/2017 12:15 AM, JC Beyler wrote:
>
>                             B) Incremental changes
>
>
>                         I guess the most common work flow here is using mq :
>                         hg qnew fix_v1
>                         edit files
>                         hg qrefresh
>                         hg qnew fix_v2
>                         edit files
>                         hg qrefresh
>
>                         if you do hg log you will see 2 commits
>
>                         webrev.ksh -r -2 -o my_inc_v1_v2
>                         webrev.ksh -o my_full_v2
>
>
>                         In  your .hgrc you might need:
>                         [extensions]
>                         mq =
>
>                         /Robbin
>
>
>                             Again another newbiew question here...
>
>                             For showing the incremental changes, is there a link that explains how to do that? I apologize for my newbie questions all the time :)
>
>                             Right now, I do:
>
>                                ksh ../webrev.ksh -m -N
>
>                             That generates a webrev.zip and send it to Chuck Rasbold. He then uploads it to a new webrev.
>
>                             I tried commiting my change and adding a small change. Then if I just do ksh ../webrev.ksh without any options, it seems to produce a similar
>                             page but now with only the changes I had (so the 06-07 comparison you were talking about) and a changeset that has it all. I imagine that is
>                             what you meant.
>
>                             Which means that my workflow would become:
>
>                             1) Make changes
>                             2) Make a webrev without any options to show just the differences with the tip
>                             3) Amend my changes to my local commit so that I have it done with
>                             4) Go to 1
>
>                             Does that seem correct to you?
>
>                             Note that when I do this, I only see the full change of a file in the full change set (Side note here: now the page says change set and not
>                             patch, which is maybe why Serguei was having issues?).
>
>                             Thanks!
>                             Jc
>
>
>
>                             On Wed, Jun 28, 2017 at 1:12 AM, Robbin Ehn <[hidden email] <mailto:[hidden email]> <mailto:[hidden email]
>                             <mailto:[hidden email]>>> wrote:
>
>                                  Hi,
>
>                                  On 06/28/2017 12:04 AM, JC Beyler wrote:
>
>                                      Dear Thomas et al,
>
>                                      Here is the newest webrev:
>                             http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/ <http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/>
>                             <http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/ <http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/>>
>
>
>
>                                  You have some more bits to in there but generally this looks good and really nice with more tests.
>                                  I'll do and deep dive and re-test this when I get back from my long vacation with whatever patch version you have then.
>
>                                  Also I think it's time you provide incremental (v06->07 changes) as well as complete change-sets.
>
>                                  Thanks, Robbin
>
>
>
>
>                                      Thomas, I "think" I have answered all your remarks. The summary is:
>
>                                      - The statistic system is up and provides insight on what the heap sampler is doing
>                                           - I've noticed that, though the sampling rate is at the right mean, we are missing some samples, I have not yet tracked out why
>                             (details below)
>
>                                      - I've run a tiny benchmark that is the worse case: it is a very tight loop and allocated a small array
>                                           - In this case, I see no overhead when the system is off so that is a good start :)
>                                           - I see right now a high overhead in this case when sampling is on. This is not a really too surprising but I'm going to see if
>                             this is consistent with our
>                                      internal implementation. The benchmark is really allocation stressful so I'm not too surprised but I want to do the due diligence.
>
>                                         - The statistic system up is up and I have a new test
>                             http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatTest.java.patch
>                             <http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatTest.java.patch>
>                                      <http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatTest.java.patch
>                             <http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatTest.java.patch>>
>                                            - I did a bit of a study about the random generator here, more details are below but basically it seems to work well
>
>                                         - I added a capability but since this is the first time doing this, I was not sure I did it right
>                                           - I did add a test though for it and the test seems to do what I expect (all methods are failing with the
>                             JVMTI_ERROR_MUST_POSSESS_CAPABILITY error).
>                                               -
>                             http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorNoCapabilityTest.java.patch
>                             <http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorNoCapabilityTest.java.patch>
>                                    
>                             <http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorNoCapabilityTest.java.patch
>                             <http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorNoCapabilityTest.java.patch>>
>
>                                         - I still need to figure out what to do about the multi-agent vs single-agent issue
>
>                                         - As far as measurements, it seems I still need to look at:
>                                           - Why we do the 20 random calls first, are they necessary?
>                                           - Look at the mean of the sampling rate that the random generator does and also what is actually sampled
>                                           - What is the overhead in terms of memory/performance when on?
>
>                                      I have inlined my answers, I think I got them all in the new webrev, let me know your thoughts.
>
>                                      Thanks again!
>                                      Jc
>
>
>                                      On Fri, Jun 23, 2017 at 3:52 AM, Thomas Schatzl <[hidden email] <mailto:[hidden email]>
>                             <mailto:[hidden email] <mailto:[hidden email]>> <mailto:[hidden email] <mailto:[hidden email]>
>
>                                      <mailto:[hidden email] <mailto:[hidden email]>>>> wrote:
>
>                                           Hi,
>
>                                           On Wed, 2017-06-21 at 13:45 -0700, JC Beyler wrote:
>                                           > Hi all,
>                                           >
>                                           > First off: Thanks again to Robbin and Thomas for their reviews :)
>                                           >
>                                           > Next, I've uploaded a new webrev:
>                                           > http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/ <http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/>
>                             <http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/ <http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/>>
>                                      <http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/ <http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/>
>                             <http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/ <http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/>>>
>
>                                           >
>                                           > Here is an update:
>                                           >
>                                           > - @Robbin, I forgot to say that yes I need to look at implementing
>                                           > this for the other architectures and testing it before it is all
>                                           > ready to go. Is it common to have it working on all possible
>                                           > combinations or is there a subset that I should be doing first and we
>                                           > can do the others later?
>                                           > - I've tested slowdebug, built and ran the JTreg tests I wrote with
>                                           > slowdebug and fixed a few more issues
>                                           > - I've refactored a bit of the code following Thomas' comments
>                                           >    - I think I've handled all the comments from Thomas (I put
>                                           > comments inline below for the specifics)
>
>                                           Thanks for handling all those.
>
>                                           > - Following Thomas' comments on statistics, I want to add some
>                                           > quality assurance tests and find that the easiest way would be to
>                                           > have a few counters of what is happening in the sampler and expose
>                                           > that to the user.
>                                           >    - I'll be adding that in the next version if no one sees any
>                                           > objections to that.
>                                           >    - This will allow me to add a sanity test in JTreg about number of
>                                           > samples and average of sampling rate
>                                           >
>                                           > @Thomas: I had a few questions that I inlined below but I will
>                                           > summarize the "bigger ones" here:
>                                           >    - You mentioned constants are not using the right conventions, I
>                                           > looked around and didn't see any convention except normal naming then
>                                           > for static constants. Is that right?
>
>                                           I looked through https://wiki.openjdk.java.net/display/HotSpot/StyleGui <https://wiki.openjdk.java.net/display/HotSpot/StyleGui>
>                             <https://wiki.openjdk.java.net/display/HotSpot/StyleGui <https://wiki.openjdk.java.net/display/HotSpot/StyleGui>>
>                                      <https://wiki.openjdk.java.net/display/HotSpot/StyleGui <https://wiki.openjdk.java.net/display/HotSpot/StyleGui>
>                             <https://wiki.openjdk.java.net/display/HotSpot/StyleGui <https://wiki.openjdk.java.net/display/HotSpot/StyleGui>>>
>                                           de and the rule is to "follow an existing pattern and must have a
>                                           distinct appearance from other names". Which does not help a lot I
>                                           guess :/ The GC team started using upper camel case, e.g.
>                                           SomeOtherConstant, but very likely this is probably not applied
>                                           consistently throughout. So I am fine with not adding another style
>                                           (like kMaxStackDepth with the "k" in front with some unknown meaning)
>                                           is fine.
>
>                                           (Chances are you will find that style somewhere used anyway too,
>                                           apologies if so :/)
>
>
>                                      Thanks for that link, now I know where to look. I used the upper camel case in my code as well then :) I should have gotten them all.
>
>
>                                            > PS: I've also inlined my answers to Thomas below:
>                                            >
>                                            > On Tue, Jun 13, 2017 at 8:03 AM, Thomas Schatzl <thomas.schatzl@oracl
>                                            > e.com <http://e.com> <http://e.com> <http://e.com>> wrote:
>                                            > > Hi all,
>                                            > >
>                                            > > On Mon, 2017-06-12 at 11:11 -0700, JC Beyler wrote:
>                                            > > > Dear all,
>                                            > > >
>                                            > > > I've continued working on this and have done the following
>                                            > > webrev:
>                                            > > > http://cr.openjdk.java.net/~rasbold/8171119/webrev.05/ <http://cr.openjdk.java.net/~rasbold/8171119/webrev.05/>
>                             <http://cr.openjdk.java.net/~rasbold/8171119/webrev.05/ <http://cr.openjdk.java.net/~rasbold/8171119/webrev.05/>>
>                                      <http://cr.openjdk.java.net/~rasbold/8171119/webrev.05/ <http://cr.openjdk.java.net/~rasbold/8171119/webrev.05/>
>                             <http://cr.openjdk.java.net/~rasbold/8171119/webrev.05/ <http://cr.openjdk.java.net/~rasbold/8171119/webrev.05/>>>
>
>                                            > >
>                                            > > [...]
>                                            > > > Things I still need to do:
>                                            > > >    - Have to fix that TLAB case for the FastTLABRefill
>                                            > > >    - Have to start looking at the data to see that it is
>                                            > > consistent and does gather the right samples, right frequency, etc.
>                                            > > >    - Have to check the GC elements and what that produces
>                                            > > >    - Run a slowdebug run and ensure I fixed all those issues you
>                                            > > saw > Robbin
>                                            > > >
>                                            > > > Thanks for looking at the webrev and have a great week!
>                                            > >
>                                            > >   scratching a bit on the surface of this change, so apologies for
>                                            > > rather shallow comments:
>                                            > >
>                                            > > - macroAssembler_x86.cpp:5604: while this is compiler code, and I
>                                            > > am not sure this is final, please avoid littering the code with
>                                            > > TODO remarks :) They tend to be candidates for later wtf moments
>                                            > > only.
>                                            > >
>                                            > > Just file a CR for that.
>                                            > >
>                                            > Newcomer question: what is a CR and not sure I have the rights to do
>                                            > that yet ? :)
>
>                                           Apologies. CR is a change request, this suggests to file a bug in the
>                                           bug tracker. And you are right, you can't just create a new account in
>                                           the OpenJDK JIRA yourselves. :(
>
>
>                                      Ok good to know, I'll continue with my own todo list but I'll work hard on not letting it slip in the webrevs anymore :)
>
>
>                                           I was mostly referring to the "... but it is a TODO" part of that
>                                           comment in macroassembler_x86.cpp. Comments about the why of the code
>                                           are appreciated.
>
>                                           [Note that I now understand that this is to some degree still work in
>                                           progress. As long as the final changeset does no contain TODO's I am
>                                           fine (and it's not a hard objection, rather their use in "final" code
>                                           is typically limited in my experience)]
>
>                                           5603   // Currently, if this happens, just set back the actual end to
>                                           where it was.
>                                           5604   // We miss a chance to sample here.
>
>                                           Would be okay, if explaining "this" and the "why" of missing a chance
>                                           to sample here would be best.
>
>                                           Like maybe:
>
>                                           // If we needed to refill TLABs, just set the actual end point to
>                                           // the end of the TLAB again. We do not sample here although we could.
>
>                                      Done with your comment, it works well in my mind.
>
>                                           I am not sure whether "miss a chance to sample" meant "we could, but
>                                           consciously don't because it's not that useful" or "it would be
>                                           necessary but don't because it's too complicated to do.".
>
>                                           Looking at the original comment once more, I am also not sure if that
>                                           comment shouldn't referring to the "end" variable (not actual_end)
>                                           because that's the variable that is responsible for taking the sampling
>                                           path? (Going from the member description of ThreadLocalAllocBuffer).
>
>
>                                      I've moved this code and it no longer shows up here but the rationale and answer was:
>
>                                      So.. Yes, end is the variable provoking the sampling. Actual end is the actual end of the TLAB.
>
>                                      What was happening here is that the code is resetting _end to point towards the end of the new TLAB. Because, we now have the end for
>                             sampling and _actual_end for
>                                      the actual end, we need to update the actual_end as well.
>
>                                      Normally, were we to do the real work here, we would calculate the (end - start) offset, then do:
>
>                                      - Set the new end to : start + (old_end - old_start)
>                                      - Set the actual end like we do here now where it because it is the actual end.
>
>                                      Why is this not done here now anymore?
>                                          - I was still debating which path to take:
>                                             - Do it in the fast refill code, it has its perks:
>                                                 - In a world where fast refills are happening all the time or a lot, we can augment there the code to do the sampling
>                                             - Remember what we had as an end before leaving the slowpath and check on return
>                                                 - This is what I'm doing now, it removes the need to go fix up all fast refill paths but if you remain in fast refill paths,
>                             you won't get sampling. I
>                                      have to think of the consequences of that, maybe a future change later on?
>                                                    - I have the statistics now so I'm going to study that
>                                                       -> By the way, though my statistics are showing I'm missing some samples, if I turn off FastTlabRefill, it is the same
>                             loss so for now, it seems
>                                      this does not occur in my simple test.
>
>
>
>                                           But maybe I am only confused and it's best to just leave the comment
>                                           away. :)
>
>                                           Thinking about it some more, doesn't this not-sampling in this case
>                                           mean that sampling does not work in any collector that does inline TLAB
>                                           allocation at the moment? (Or is inline TLAB alloc automatically
>                                           disabled with sampling somehow?)
>
>                                           That would indeed be a bigger TODO then :)
>
>
>                                      Agreed, this remark made me think that perhaps as a first step the new way of doing it is better but I did have to:
>                                         - Remove the const of the ThreadLocalBuffer remaining and hard_end methods
>                                         - Move hard_end out of the header file to have a bit more logic there
>
>                                      Please let me know what you think of that and if you prefer it this way or changing the fast refills. (I prefer this way now because it
>                             is more incremental).
>
>
>                                           > > - calling HeapMonitoring::do_weak_oops() (which should probably be
>                                           > > called weak_oops_do() like other similar methods) only if string
>                                           > > deduplication is enabled (in g1CollectedHeap.cpp:4511) seems wrong.
>                                           >
>                                           > The call should be at least around 6 lines up outside the if.
>                                           >
>                                           > Preferentially in a method like process_weak_jni_handles(), including
>                                           > additional logging. (No new (G1) gc phase without minimal logging
>                                           > :)).
>                                           > Done but really not sure because:
>                                           >
>                                           > I put for logging:
>                                           >   log_develop_trace(gc, freelist)("G1ConcRegionFreeing [other] : heap
>                                           > monitoring");
>
>                                           I would think that "gc, ref" would be more appropriate log tags for
>                                           this similar to jni handles.
>                                           (I am als not sure what weak reference handling has to do with
>                                           G1ConcRegionFreeing, so I am a bit puzzled)
>
>
>                                      I was not sure what to put for the tags or really as the message. I cleaned it up a bit now to:
>                                           log_develop_trace(gc, ref)("HeapSampling [other] : heap monitoring processing");
>
>
>
>                                           > Since weak_jni_handles didn't have logging for me to be inspired
>                                           > from, I did that but unconvinced this is what should be done.
>
>                                           The JNI handle processing does have logging, but only in
>                                           ReferenceProcessor::process_discovered_references(). In
>                                           process_weak_jni_handles() only overall time is measured (in a G1
>                                           specific way, since only G1 supports disabling reference procesing) :/
>
>                                           The code in ReferenceProcessor prints both time taken
>                                           referenceProcessor.cpp:254, as well as the count, but strangely only in
>                                           debug VMs.
>
>                                           I have no idea why this logging is that unimportant to only print that
>                                           in a debug VM. However there are reviews out for changing this area a
>                                           bit, so it might be useful to wait for that (JDK-8173335).
>
>
>                                      I cleaned it up a bit anyway and now it returns the count of objects that are in the system.
>
>
>                                           > > - the change doubles the size of
>                                           > > CollectedHeap::allocate_from_tlab_slow() above the "small and nice"
>                                           > > threshold. Maybe it could be refactored a bit.
>                                           > Done I think, it looks better to me :).
>
>                                           In ThreadLocalAllocBuffer::handle_sample() I think the
>                                           set_back_actual_end()/pick_next_sample() calls could be hoisted out of
>                                           the "if" :)
>
>
>                                      Done!
>
>
>                                           > > - referenceProcessor.cpp:261: the change should add logging about
>                                           > > the number of references encountered, maybe after the corresponding
>                                           > > "JNI weak reference count" log message.
>                                           > Just to double check, are you saying that you'd like to have the heap
>                                           > sampler to keep in store how many sampled objects were encountered in
>                                           > the HeapMonitoring::weak_oops_do?
>                                           >    - Would a return of the method with the number of handled
>                                           > references and logging that work?
>
>                                           Yes, it's fine if HeapMonitoring::weak_oops_do() only returned the
>                                           number of processed weak oops.
>
>
>                                      Done also (but I admit I have not tested the output yet) :)
>
>
>                                           >    - Additionally, would you prefer it in a separate block with its
>                                           > GCTraceTime?
>
>                                           Yes. Both kinds of information is interesting: while the time taken is
>                                           typically more important, the next question would be why, and the
>                                           number of references typically goes a long way there.
>
>                                           See above though, it is probably best to wait a bit.
>
>
>                                      Agreed that I "could" wait but, if it's ok, I'll just refactor/remove this when we get closer to something final. Either, JDK-8173335
>                                      has gone in and I will notice it now or it will soon and I can change it then.
>
>
>                                           > > - threadLocalAllocBuffer.cpp:331: one more "TODO"
>                                           > Removed it and added it to my personal todos to look at.
>                                           >      > >
>                                           > > - threadLocalAllocBuffer.hpp: ThreadLocalAllocBuffer class
>                                           > > documentation should be updated about the sampling additions. I
>                                           > > would have no clue what the difference between "actual_end" and
>                                           > > "end" would be from the given information.
>                                           > If you are talking about the comments in this file, I made them more
>                                           > clear I hope in the new webrev. If it was somewhere else, let me know
>                                           > where to change.
>
>                                           Thanks, that's much better. Maybe a note in the comment of the class
>                                           that ThreadLocalBuffer provides some sampling facility by modifying the
>                                           end() of the TLAB to cause "frequent" calls into the runtime call where
>                                           actual sampling takes place.
>
>
>                                      Done, I think it's better now. Added something about the slow_path_end as well.
>
>
>                                           > > - in heapMonitoring.hpp: there are some random comments about some
>                                           > > code that has been grabbed from "util/math/fastmath.[h|cc]". I
>                                           > > can't tell whether this is code that can be used but I assume that
>                                           > > Noam Shazeer is okay with that (i.e. that's all Google code).
>                                           > Jeremy and I double checked and we can release that as I thought. I
>                                           > removed the comment from that piece of code entirely.
>
>                                           Thanks.
>
>                                           > > - heapMonitoring.hpp/cpp static constant naming does not correspond
>                                           > > to Hotspot's. Additionally, in Hotspot static methods are cased
>                                           > > like other methods.
>                                           > I think I fixed the methods to be cased the same way as all other
>                                           > methods. For static constants, I was not sure. I fixed a few other
>                                           > variables but I could not seem to really see a consistent trend for
>                                           > constants. I made them as variables but I'm not sure now.
>
>                                           Sorry again, style is a kind of mess. The goal of my suggestions here
>                                           is only to prevent yet another style creeping in.
>
>                                           > > - in heapMonitoring.cpp there are a few cryptic comments at the top
>                                           > > that seem to refer to internal stuff that should probably be
>                                           > > removed.
>                                           > Sorry about that! My personal todos not cleared out.
>
>                                           I am happy about comments, but I simply did not understand any of that
>                                           and I do not know about other readers as well.
>
>                                           If you think you will remember removing/updating them until the review
>                                           proper (I misunderstood the review situation a little it seems).
>
>                                           > > I did not think through the impact of the TLAB changes on collector
>                                           > > behavior yet (if there are). Also I did not check for problems with
>                                           > > concurrent mark and SATB/G1 (if there are).
>                                           > I would love to know your thoughts on this, I think this is fine. I
>
>                                           I think so too now. No objects are made live out of thin air :)
>
>                                           > see issues with multiple threads right now hitting the stack storage
>                                           > instance. Previous webrevs had a mutex lock here but we took it out
>                                           > for simplificity (and only for now).
>
>                                           :) When looking at this after some thinking I now assume for this
>                                           review that this code is not MT safe at all. There seems to be more
>                                           synchronization missing than just the one for the StackTraceStorage. So
>                                           no comments about this here.
>
>
>                                      I doubled checked a bit (quickly I admit) but it seems that synchronization in StackTraceStorage is really all you need (all methods
>                             lead to a StackTraceStorage one
>                                      and can be multithreaded outside of that).
>                                      There is a question about the initialization where the method HeapMonitoring::initialize_profiling is not thread safe.
>                                      It would work (famous last words) and not crash if there was a race but we could add a synchronization point there as well (and
>                             therefore on the stop as well).
>
>                                      But anyway I will really check and do this once we add back synchronization.
>
>
>                                           Also, this would require some kind of specification of what is allowed
>                                           to be called when and where.
>
>
>                                      Would we specify this with the methods in the jvmti.xml file? We could start by specifying in each that they are not thread safe but I
>                             saw no mention of that for
>                                      other methods.
>
>
>                                           One potentially relevant observation about locking here: depending on
>                                           sampling frequency, StackTraceStore::add_trace() may be rather
>                                           frequently called. I assume that you are going to do measurements :)
>
>
>                                      Though we don't have the TLAB implementation in our code, the compiler generated sampler uses 2% of overhead with a 512k sampling rate.
>                             I can do real measurements
>                                      when the code settles and we can see how costly this is as a TLAB implementation.
>                                      However, my theory is that if the rate is 512k, the memory/performance overhead should be minimal since it is what we saw with our
>                             code/workloads (though not called
>                                      the same way, we call it essentially at the same rate).
>                                      If you have a benchmark you'd like me to test, let me know!
>
>                                      Right now, with my really small test, this does use a bit of overhead even for a 512k sample size. I don't know yet why, I'm going to
>                             see what is going on.
>
>                                      Finally, I think it is not reasonable to suppose the overhead to be negligible if the sampling rate used is too low. The user should
>                             know that the lower the rate,
>                                      the higher the overhead (documentation TODO?).
>
>
>                                           I am not sure what the expected usage of the API is, but
>                                           StackTraceStore::add_trace() seems to be able to grow without bounds.
>                                           Only a GC truncates them to the live ones. That in itself seems to be
>                                           problematic (GCs can be *wide* apart), and of course some of the API
>                                           methods add to that because they duplicate that unbounded array. Do you
>                                           have any concerns/measurements about this?
>
>
>                                      So, the theory is that yes add_trace can be able to grow without bounds but it grows at a sample per 512k of allocated space. The
>                             stacks it gathers are currently
>                                      maxed at 64 (I'd like to expand that to an option to the user though at some point). So I have no concerns because:
>
>                                      - If really this is taking a lot of space, that means the job is keeping a lot of objects in memory as well, therefore the entire heap
>                             is getting huge
>                                      - If this is the case, you will be triggering a GC at some point anyway.
>
>                                      (I'm putting under the rug the issue of "What if we set the rate to 1 for example" because as you lower the sampling rate, we cannot
>                             guarantee low overhead; the
>                                      idea behind this feature is to have a means of having meaningful allocated samples at a low overhead)
>
>                                      I have no measurements really right now but since I now have some statistics I can poll, I will look a bit more at this question.
>
>                                      I have the same last sentence than above: the user should expect this to happen if the sampling rate is too small. That probably can be
>                             reflected in the
>                                      StartHeapSampling as a note : careful this might impact your performance.
>
>
>                                           Also, these stack traces might hold on to huge arrays. Any
>                                           consideration of that? Particularly it might be the cause for OOMEs in
>                                           tight memory situations.
>
>
>                                      There is a stack size maximum that is set to 64 so it should not hold huge arrays. I don't think this is an issue but I can double
>                             check with a test or two.
>
>
>                                           - please consider adding a safepoint check in
>                                           HeapMonitoring::weak_oops_do to prevent accidental misuse.
>
>                                           - in struct StackTraceStorage, the public fields may also need
>                                           underscores. At least some files in the runtime directory have structs
>                                           with underscored public members (and some don't). The runtime team
>                                           should probably comment on that.
>
>
>                                      Agreed I did not know. I looked around and a lot of structs did not have them it seemed so I left it as is. I will happily change it if
>                             someone prefers (I was not
>                                      sure if you really preferred or not, your sentence seemed to be more a note of "this might need to change but I don't know if the
>                             runtime team enforces that", let
>                                      me know if I read that wrongly).
>
>
>                                           - In StackTraceStorage::weak_oops_do(), when examining the
>                                           StackTraceData, maybe it is useful to consider having a non-NULL
>                                           reference outside of the heap's reserved space an error. There should
>                                           be no oop outside of the heap's reserved space ever.
>
>                                           Unless you allow storing random values in StackTraceData::obj, which I
>                                           would not encourage.
>
>
>                                      I suppose you are talking about this part:
>                                      if ((value != NULL && Universe::heap()->is_in_reserved(value)) &&
>                                                  (is_alive == NULL || is_alive->do_object_b(value))) {
>
>                                      What you are saying is that I could have something like:
>                                      if (value != my_non_null_reference &&
>                                                  (is_alive == NULL || is_alive->do_object_b(value))) {
>
>                                      Is that what you meant? Is there really a reason to do so? When I look at the code, is_in_reserved seems like a O(1) method call. I'm
>                             not even sure we can have a
>                                      NULL value to be honest. I might have to study that to see if this was not a paranoid test to begin with.
>
>                                      The is_alive code has now morphed due to the comment below.
>
>
>
>                                           - HeapMonitoring::weak_oops_do() does not seem to use the
>                                           passed AbstractRefProcTaskExecutor.
>
>
>                                      It did use it:
>                                         size_t HeapMonitoring::weak_oops_do(
>                                            AbstractRefProcTaskExecutor *task_executor,
>                                            BoolObjectClosure* is_alive,
>                                            OopClosure *f,
>                                            VoidClosure *complete_gc) {
>                                          assert(SafepointSynchronize::is_at_safepoint(), "must be at safepoint");
>
>                                          if (task_executor != NULL) {
>                                            task_executor->set_single_threaded_mode();
>                                          }
>                                          return StackTraceStorage::storage()->weak_oops_do(is_alive, f, complete_gc);
>                                      }
>
>                                      But due to the comment below, I refactored this, so this is no longer here. Now I have an always true closure that is passed.
>
>
>                                           - I do not understand allowing to call this method with a NULL
>                                           complete_gc closure. This would mean that objects referenced from the
>                                           object that is referenced by the StackTraceData are not pulled, meaning
>                                           they would get stale.
>
>                                           - same with is_alive parameter value of NULL
>
>
>                                      So these questions made me look a bit closer at this code. This code I think was written this way to have a very small impact on the
>                             file but you are right, there
>                                      is no reason for this here. I've simplified the code by making in referenceProcessor.cpp a process_HeapSampling method that handles
>                             everything there.
>
>                                      The code allowed NULLs because it depended on where you were coming from and how the code was being called.
>
>                                      - I added a static always_true variable and pass that now to be more consistent with the rest of the code.
>                                      - I moved the complete_gc into process_phaseHeapSampling now (new method) and handle the task_executor and the complete_gc there
>                                           - Newbie question: in our code we did a set_single_threaded_mode but I see that process_phaseJNI does it right before its call, do
>                             I need to do it for the
>                                      process_phaseHeapSample?
>                                      That API is much cleaner (in my mind) and is consistent with what is done around it (again in my mind).
>
>
>                                           - heapMonitoring.cpp:590: I do not completely understand the purpose of
>                                           this code: in the end this results in a fixed value directly dependent
>                                           on the Thread address anyway? In the end this results in a fixed value
>                                           directly dependent on the Thread address anyway?
>                                           IOW, what is special about exactly 20 rounds?
>
>
>                                      So we really want a fast random number generator that has a specific mean (512k is the default we use). The code uses the thread
>                             address as the start number of the
>                                      sequence (why not, it is random enough is rationale). Then instead of just starting there, we prime the sequence and really only start
>                             at the 21st number, it is
>                                      arbitrary and I have not done a study to see if we could do more or less of that.
>
>                                      As I have the statistics of the system up and running, I'll run some experiments to see if this is needed, is 20 good, or not.
>
>
>                                           - also I would consider stripping a few bits of the threads' address as
>                                           initialization value for your rng. The last three bits (and probably
>                                           more, check whether the Thread object is allocated on special
>                                           boundaries) are always zero for them.
>                                           Not sure if the given "random" value is random enough before/after,
>                                           this method, so just skip that comment if you think this is not
>                                           required.
>
>
>                                      I don't know is the honest answer. I think what is important is that we tend towards a mean and it is random "enough" to not fall in
>                             pitfalls of only sampling a
>                                      subset of objects due to their allocation order. I added that as test to do to see if it changes the mean in any way for the 512k
>                             default value and/or if the first
>                                      1000 elements look better.
>
>
>                                           Some more random nits I did not find a place to put anywhere:
>
>                                           - ThreadLocalAllocBuffer::_extra_space does not seem to be used
>                                           anywhere?
>
>
>                                      Good catch :).
>
>
>                                           - Maybe indent the declaration of ThreadLocalAllocBuffer::_bytes_until_sample to align below the other members of that group.
>
>
>                                      Done moved it up a bit to have non static members together and static separate.
>
>                                           Thanks,
>                                              Thomas
>
>
>                                      Thanks for your review!
>                                      Jc
>
>
>
>
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Low-Overhead Heap Profiling

JC Beyler
Dear all,

Small update this week with this new webrev:

I patched the code changes showed by Robbin last week and I refactored collectedHeap.cpp:

The original code became a bit too complex in my opinion with the handle_heap_sampling handling too many things. So I subdivided the logic into two smaller methods and moved out a bit of the logic to make it more clear. Hopefully it is :)

Let me know if you have any questions/comments :)
Jc

On Mon, Oct 16, 2017 at 9:34 AM, JC Beyler <[hidden email]> wrote:
Hi Robbin,

That is because version 11 to 12 was only a test change. I was going to write about it and say here are the webrev links:
Incremental:

Full webrev:

This change focused only on refactoring the tests to be more manageable, readable, maintainable. As all tests are looking at allocations, I moved common code to a java class:

And then most tests call into that class to turn on/off the sampling, allocate, etc. This has removed almost 500 lines of test code so I'm happy about that.

Thanks for your changes, a bit of relics of previous versions :). I've already integrated them into my code and will make a new webrev end of this week with a bit of refactor of the code handling the tlab slow path. I find it could use a bit of refactoring to make it easier to follow so I'm going to take a stab at it this week.

Any other issues/comments?

Thanks!
Jc


On Mon, Oct 16, 2017 at 8:46 AM, Robbin Ehn <[hidden email]> wrote:
Hi JC,

I saw a webrev.12 in the directory, with only test changes(11->12), so I took that version.
I had a look and tested the tests, worked fine!

First glance at the code (looking at full v12) some minor things below, mostly unused stuff.

Thanks, Robbin

diff -r 9047e0d726d6 src/hotspot/share/runtime/heapMonitoring.cpp
--- a/src/hotspot/share/runtime/heapMonitoring.cpp      Mon Oct 16 16:54:06 2017 +0200
+++ b/src/hotspot/share/runtime/heapMonitoring.cpp      Mon Oct 16 17:42:42 2017 +0200
@@ -211,2 +211,3 @@
   void initialize(int max_storage) {
+    // validate max_storage to sane value ? What would 0 mean ?
     MutexLocker mu(HeapMonitor_lock);
@@ -227,8 +228,4 @@
   bool initialized() { return _initialized; }
-  volatile bool *initialized_address() { return &_initialized; }

  private:
-  // Protects the traces currently sampled (below).
-  volatile intptr_t _stack_storage_lock[1];
-
   // The traces currently sampled.
@@ -313,3 +310,2 @@
   _initialized(false) {
-    _stack_storage_lock[0] = 0;
 }
@@ -532,13 +528,2 @@

-// Delegate the initialization question to the underlying storage system.
-bool HeapMonitoring::initialized() {
-  return StackTraceStorage::storage()->initialized();
-}
-
-// Delegate the initialization question to the underlying storage system.
-bool *HeapMonitoring::initialized_address() {
-  return
-      const_cast<bool*>(StackTraceStorage::storage()->initialized_address());
-}
-
 void HeapMonitoring::get_live_traces(jvmtiStackTraces *traces) {
diff -r 9047e0d726d6 src/hotspot/share/runtime/heapMonitoring.hpp
--- a/src/hotspot/share/runtime/heapMonitoring.hpp      Mon Oct 16 16:54:06 2017 +0200
+++ b/src/hotspot/share/runtime/heapMonitoring.hpp      Mon Oct 16 17:42:42 2017 +0200
@@ -35,3 +35,2 @@
   static uint64_t _rnd;
-  static bool _initialized;
   static jint _monitoring_rate;
@@ -92,7 +91,2 @@

-  // Is the profiler initialized and where is the address to the initialized
-  // boolean.
-  static bool initialized();
-  static bool *initialized_address();
-
   // Called when o is to be sampled from a given thread and a given size.



On 10/10/2017 12:57 AM, JC Beyler wrote:
Dear all,

Thread-safety is back!! Here is the update webrev:
http://cr.openjdk.java.net/~rasbold/8171119/webrev.10_11/

Full webrev is here:
http://cr.openjdk.java.net/~rasbold/8171119/webrev.11/

In order to really test this, I needed to add this so thought now was a good time. It required a few changes here for the creation to ensure correctness and safety. Now we keep the static pointer but clear the data internally so on re-initialize, it will be a bit more costly than before. I don't think this is a huge use-case so I did not think it was a problem. I used the internal MutexLocker, I think I used it well, let me know.

I also added three tests:

1) Stack depth test:
http://cr.openjdk.java.net/~rasbold/8171119/webrev.10_11/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStackDepthTest.java.patch

This test shows that the maximum stack depth system is working.

2) Thread safety:
http://cr.openjdk.java.net/~rasbold/8171119/webrev.10_11/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorThreadTest.java.patch

The test creates 24 threads and they all allocate at the same time. The test then checks it does find samples from all the threads.

3) Thread on/off safety
http://cr.openjdk.java.net/~rasbold/8171119/webrev.10_11/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorThreadOnOffTest.java.patch

The test creates 24 threads that all allocate a bunch of memory. Then another thread turns the sampling on/off.

Btw, both tests 2 & 3 failed without the locks.

As I worked on this, I saw a lot of places where the tests are doing very similar things, I'm going to clean up the code a bit and make a HeapAllocator class that all tests can call directly. This will greatly simplify the code.

Thanks for any comments/criticisms!
Jc


On Mon, Oct 2, 2017 at 8:52 PM, JC Beyler <[hidden email] <mailto:[hidden email]>> wrote:

    Dear all,

    Small update to the webrev:
    http://cr.openjdk.java.net/~rasbold/8171119/webrev.09_10/ <http://cr.openjdk.java.net/~rasbold/8171119/webrev.09_10/>

    Full webrev is here:
    http://cr.openjdk.java.net/~rasbold/8171119/webrev.10/ <http://cr.openjdk.java.net/~rasbold/8171119/webrev.10/>

    I updated a bit of the naming, removed a TODO comment, and I added a test for testing the sampling rate. I also updated the maximum stack depth to 1024, there is no
    reason to keep it so small. I did a micro benchmark that tests the overhead and it seems relatively the same.

    I compared allocations from a stack depth of 10 and allocations from a stack depth of 1024 (allocations are from the same helper method in
    http://cr.openjdk.java.net/~rasbold/8171119/webrev.10/raw_files/new/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatRateTest.java
    <http://cr.openjdk.java.net/~rasbold/8171119/webrev.10/raw_files/new/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatRateTest.java>):
               - For an array of 1 integer allocated in a loop; stack depth 1024 vs stack depth 10: 1% slower
               - For an array of 200k integers allocated in a loop; stack depth 1024 vs stack depth 10: 3% slower

    So basically now moving the maximum stack depth to 1024 but we only copy over the stack depths actually used.

    For the next webrev, I will be adding a stack depth test to show that it works and probably put back the mutex locking so that we can see how difficult it is to keep
    thread safe.

    Let me know what you think!
    Jc



    On Mon, Sep 25, 2017 at 3:02 PM, JC Beyler <[hidden email] <mailto:[hidden email]>> wrote:

        Forgot to say that for my numbers:
          - Not in the test are the actual numbers I got for the various array sizes, I ran the program 30 times and parsed the output; here are the averages and standard
        deviation:
               1000:     1.28% average; 1.13% standard deviation
               10000:    1.59% average; 1.25% standard deviation
               100000:   1.26% average; 1.26% standard deviation

        The 1000/10000/100000 are the sizes of the arrays being allocated. These are allocated 100k times and the sampling rate is 111 times the size of the array.

        Thanks!
        Jc


        On Mon, Sep 25, 2017 at 3:01 PM, JC Beyler <[hidden email] <mailto:[hidden email]>> wrote:

            Hi all,

            After a bit of a break, I am back working on this :). As before, here are two webrevs:

            - Full change set: http://cr.openjdk.java.net/~rasbold/8171119/webrev.09/ <http://cr.openjdk.java.net/~rasbold/8171119/webrev.09/>
            - Compared to version 8: http://cr.openjdk.java.net/~rasbold/8171119/webrev.08_09/ <http://cr.openjdk.java.net/~rasbold/8171119/webrev.08_09/>
                 (This version is compared to version 8 I last showed but ported to the new folder hierarchy)

            In this version I have:
               - Handled Thomas' comments from his email of 07/03:
                    - Merged the logging to be standard
                    - Fixed up the code a bit where asked
                    - Added some notes about the code not being thread-safe yet
                - Removed additional dead code from the version that modifies interpreter/c1/c2
                - Fixed compiler issues so that it compiles with --disable-precompiled-header
                     - Tested with ./configure --with-boot-jdk=<jdk8> --with-debug-level=slowdebug --disable-precompiled-headers

            Additionally, I added a test to check the sanity of the sampler: HeapMonitorStatCorrectnessTest
            (http://cr.openjdk.java.net/~rasbold/8171119/webrev.08_09/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatCorrectnessTest.java.patch <http://cr.openjdk.java.net/~rasbold/8171119/webrev.08_09/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatCorrectnessTest.java.patch>)
                - This allocates a number of arrays and checks that we obtain the number of samples we want with an accepted error of 5%. I tested it 100 times and it
            passed everytime, I can test more if wanted
                - Not in the test are the actual numbers I got for the various array sizes, I ran the program 30 times and parsed the output; here are the averages and
            standard deviation:
                   1000:     1.28% average; 1.13% standard deviation
                   10000:    1.59% average; 1.25% standard deviation
                   100000:   1.26% average; 1.26% standard deviation

            What this means is that we were always at about 1~2% of the number of samples the test expected.

            Let me know what you think,
            Jc

            On Wed, Jul 5, 2017 at 9:31 PM, JC Beyler <[hidden email] <mailto:[hidden email]>> wrote:

                Hi all,

                I apologize, I have not yet handled your remarks but thought this new webrev would also be useful to see and comment on perhaps.

                Here is the latest webrev, it is generated slightly different than the others since now I'm using webrev.ksh without the -N option:
                http://cr.openjdk.java.net/~rasbold/8171119/webrev.08/ <http://cr.openjdk.java.net/~rasbold/8171119/webrev.08/>

                And the webrev.07 to webrev.08 diff is here:
                http://cr.openjdk.java.net/~rasbold/8171119/webrev.07_08/ <http://cr.openjdk.java.net/~rasbold/8171119/webrev.07_08/>

                (Let me know if it works well)

                It's a small change between versions but it:
                   - provides a fix that makes the average sample rate correct (more on that below).
                   - fixes the code to actually have it play nicely with the fast tlab refill
                   - cleaned up a bit the JVMTI text and now use jvmtiFrameInfo
                - moved the capability to be onload solo

                With this webrev, I've done a small study of the random number generator we use here for the sampling rate. I took a small program and it can be simplified to:

                for (outer loop)
                for (inner loop)
                int[] tmp = new int[arraySize];

                - I've fixed the outer and inner loops to being 800 for this experiment, meaning we allocate 640000 times an array of a given array size.

                - Each program provides the average sample size used for the whole execution

                - Then, I ran each variation 30 times and then calculated the average of the average sample size used for various array sizes. I selected the array size to
                be one of the following: 1, 10, 100, 1000.

                - When compared to 512kb, the average sample size of 30 runs:
                1: 4.62% of error
                10: 3.09% of error
                100: 0.36% of error
                1000: 0.1% of error
                10000: 0.03% of error

                What it shows is that, depending on the number of samples, the average does become better. This is because with an allocation of 1 element per array, it
                will take longer to hit one of the thresholds. This is seen by looking at the sample count statistic I put in. For the same number of iterations (800 *
                800), the different array sizes provoke:
                1: 62 samples
                10: 125 samples
                100: 788 samples
                1000: 6166 samples
                10000: 57721 samples

                And of course, the more samples you have, the more sample rates you pick, which means that your average gets closer using that math.

                Thanks,
                Jc

                On Thu, Jun 29, 2017 at 10:01 PM, JC Beyler <[hidden email] <mailto:[hidden email]>> wrote:

                    Thanks Robbin,

                    This seems to have worked. When I have the next webrev ready, we will find out but I'm fairly confident it will work!

                    Thanks agian!
                    Jc

                    On Wed, Jun 28, 2017 at 11:46 PM, Robbin Ehn <[hidden email] <mailto:[hidden email]>> wrote:

                        Hi JC,

                        On 06/29/2017 12:15 AM, JC Beyler wrote:

                            B) Incremental changes


                        I guess the most common work flow here is using mq :
                        hg qnew fix_v1
                        edit files
                        hg qrefresh
                        hg qnew fix_v2
                        edit files
                        hg qrefresh

                        if you do hg log you will see 2 commits

                        webrev.ksh -r -2 -o my_inc_v1_v2
                        webrev.ksh -o my_full_v2


                        In  your .hgrc you might need:
                        [extensions]
                        mq =

                        /Robbin


                            Again another newbiew question here...

                            For showing the incremental changes, is there a link that explains how to do that? I apologize for my newbie questions all the time :)

                            Right now, I do:

                               ksh ../webrev.ksh -m -N

                            That generates a webrev.zip and send it to Chuck Rasbold. He then uploads it to a new webrev.

                            I tried commiting my change and adding a small change. Then if I just do ksh ../webrev.ksh without any options, it seems to produce a similar
                            page but now with only the changes I had (so the 06-07 comparison you were talking about) and a changeset that has it all. I imagine that is
                            what you meant.

                            Which means that my workflow would become:

                            1) Make changes
                            2) Make a webrev without any options to show just the differences with the tip
                            3) Amend my changes to my local commit so that I have it done with
                            4) Go to 1

                            Does that seem correct to you?

                            Note that when I do this, I only see the full change of a file in the full change set (Side note here: now the page says change set and not
                            patch, which is maybe why Serguei was having issues?).

                            Thanks!
                            Jc



                            On Wed, Jun 28, 2017 at 1:12 AM, Robbin Ehn <[hidden email] <mailto:[hidden email]> <mailto:[hidden email]
                            <mailto:[hidden email]>>> wrote:

                                 Hi,

                                 On 06/28/2017 12:04 AM, JC Beyler wrote:

                                     Dear Thomas et al,

                                     Here is the newest webrev:
                            http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/ <http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/>
                            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/ <http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/>>



                                 You have some more bits to in there but generally this looks good and really nice with more tests.
                                 I'll do and deep dive and re-test this when I get back from my long vacation with whatever patch version you have then.

                                 Also I think it's time you provide incremental (v06->07 changes) as well as complete change-sets.

                                 Thanks, Robbin




                                     Thomas, I "think" I have answered all your remarks. The summary is:

                                     - The statistic system is up and provides insight on what the heap sampler is doing
                                          - I've noticed that, though the sampling rate is at the right mean, we are missing some samples, I have not yet tracked out why
                            (details below)

                                     - I've run a tiny benchmark that is the worse case: it is a very tight loop and allocated a small array
                                          - In this case, I see no overhead when the system is off so that is a good start :)
                                          - I see right now a high overhead in this case when sampling is on. This is not a really too surprising but I'm going to see if
                            this is consistent with our
                                     internal implementation. The benchmark is really allocation stressful so I'm not too surprised but I want to do the due diligence.

                                        - The statistic system up is up and I have a new test
                            http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatTest.java.patch
                            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatTest.java.patch>
                                     <http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatTest.java.patch
                            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatTest.java.patch>>
                                           - I did a bit of a study about the random generator here, more details are below but basically it seems to work well

                                        - I added a capability but since this is the first time doing this, I was not sure I did it right
                                          - I did add a test though for it and the test seems to do what I expect (all methods are failing with the
                            JVMTI_ERROR_MUST_POSSESS_CAPABILITY error).
                                              -
                            http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorNoCapabilityTest.java.patch
                            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorNoCapabilityTest.java.patch>
                                                                <http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorNoCapabilityTest.java.patch
                            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorNoCapabilityTest.java.patch>>

                                        - I still need to figure out what to do about the multi-agent vs single-agent issue

                                        - As far as measurements, it seems I still need to look at:
                                          - Why we do the 20 random calls first, are they necessary?
                                          - Look at the mean of the sampling rate that the random generator does and also what is actually sampled
                                          - What is the overhead in terms of memory/performance when on?

                                     I have inlined my answers, I think I got them all in the new webrev, let me know your thoughts.

                                     Thanks again!
                                     Jc


                                     On Fri, Jun 23, 2017 at 3:52 AM, Thomas Schatzl <[hidden email] <mailto:[hidden email]>
                            <mailto:[hidden email] <mailto:[hidden email]>> <mailto:[hidden email] <mailto:[hidden email]>

                                     <mailto:[hidden email] <mailto:[hidden email]>>>> wrote:

                                          Hi,

                                          On Wed, 2017-06-21 at 13:45 -0700, JC Beyler wrote:
                                          > Hi all,
                                          >
                                          > First off: Thanks again to Robbin and Thomas for their reviews :)
                                          >
                                          > Next, I've uploaded a new webrev:
                                          > http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/ <http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/>
                            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/ <http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/>>
                                     <http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/ <http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/>
                            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/ <http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/>>>

                                          >
                                          > Here is an update:
                                          >
                                          > - @Robbin, I forgot to say that yes I need to look at implementing
                                          > this for the other architectures and testing it before it is all
                                          > ready to go. Is it common to have it working on all possible
                                          > combinations or is there a subset that I should be doing first and we
                                          > can do the others later?
                                          > - I've tested slowdebug, built and ran the JTreg tests I wrote with
                                          > slowdebug and fixed a few more issues
                                          > - I've refactored a bit of the code following Thomas' comments
                                          >    - I think I've handled all the comments from Thomas (I put
                                          > comments inline below for the specifics)

                                          Thanks for handling all those.

                                          > - Following Thomas' comments on statistics, I want to add some
                                          > quality assurance tests and find that the easiest way would be to
                                          > have a few counters of what is happening in the sampler and expose
                                          > that to the user.
                                          >    - I'll be adding that in the next version if no one sees any
                                          > objections to that.
                                          >    - This will allow me to add a sanity test in JTreg about number of
                                          > samples and average of sampling rate
                                          >
                                          > @Thomas: I had a few questions that I inlined below but I will
                                          > summarize the "bigger ones" here:
                                          >    - You mentioned constants are not using the right conventions, I
                                          > looked around and didn't see any convention except normal naming then
                                          > for static constants. Is that right?

                                          I looked through https://wiki.openjdk.java.net/display/HotSpot/StyleGui <https://wiki.openjdk.java.net/display/HotSpot/StyleGui>
                            <https://wiki.openjdk.java.net/display/HotSpot/StyleGui <https://wiki.openjdk.java.net/display/HotSpot/StyleGui>>
                                     <https://wiki.openjdk.java.net/display/HotSpot/StyleGui <https://wiki.openjdk.java.net/display/HotSpot/StyleGui>
                            <https://wiki.openjdk.java.net/display/HotSpot/StyleGui <https://wiki.openjdk.java.net/display/HotSpot/StyleGui>>>
                                          de and the rule is to "follow an existing pattern and must have a
                                          distinct appearance from other names". Which does not help a lot I
                                          guess :/ The GC team started using upper camel case, e.g.
                                          SomeOtherConstant, but very likely this is probably not applied
                                          consistently throughout. So I am fine with not adding another style
                                          (like kMaxStackDepth with the "k" in front with some unknown meaning)
                                          is fine.

                                          (Chances are you will find that style somewhere used anyway too,
                                          apologies if so :/)


                                     Thanks for that link, now I know where to look. I used the upper camel case in my code as well then :) I should have gotten them all.


                                           > PS: I've also inlined my answers to Thomas below:
                                           >
                                           > On Tue, Jun 13, 2017 at 8:03 AM, Thomas Schatzl <thomas.schatzl@oracl
                                           > e.com <http://e.com> <http://e.com> <http://e.com>> wrote:
                                           > > Hi all,
                                           > >
                                           > > On Mon, 2017-06-12 at 11:11 -0700, JC Beyler wrote:
                                           > > > Dear all,
                                           > > >
                                           > > > I've continued working on this and have done the following
                                           > > webrev:
                                           > > > http://cr.openjdk.java.net/~rasbold/8171119/webrev.05/ <http://cr.openjdk.java.net/~rasbold/8171119/webrev.05/>
                            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.05/ <http://cr.openjdk.java.net/~rasbold/8171119/webrev.05/>>
                                     <http://cr.openjdk.java.net/~rasbold/8171119/webrev.05/ <http://cr.openjdk.java.net/~rasbold/8171119/webrev.05/>
                            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.05/ <http://cr.openjdk.java.net/~rasbold/8171119/webrev.05/>>>

                                           > >
                                           > > [...]
                                           > > > Things I still need to do:
                                           > > >    - Have to fix that TLAB case for the FastTLABRefill
                                           > > >    - Have to start looking at the data to see that it is
                                           > > consistent and does gather the right samples, right frequency, etc.
                                           > > >    - Have to check the GC elements and what that produces
                                           > > >    - Run a slowdebug run and ensure I fixed all those issues you
                                           > > saw > Robbin
                                           > > >
                                           > > > Thanks for looking at the webrev and have a great week!
                                           > >
                                           > >   scratching a bit on the surface of this change, so apologies for
                                           > > rather shallow comments:
                                           > >
                                           > > - macroAssembler_x86.cpp:5604: while this is compiler code, and I
                                           > > am not sure this is final, please avoid littering the code with
                                           > > TODO remarks :) They tend to be candidates for later wtf moments
                                           > > only.
                                           > >
                                           > > Just file a CR for that.
                                           > >
                                           > Newcomer question: what is a CR and not sure I have the rights to do
                                           > that yet ? :)

                                          Apologies. CR is a change request, this suggests to file a bug in the
                                          bug tracker. And you are right, you can't just create a new account in
                                          the OpenJDK JIRA yourselves. :(


                                     Ok good to know, I'll continue with my own todo list but I'll work hard on not letting it slip in the webrevs anymore :)


                                          I was mostly referring to the "... but it is a TODO" part of that
                                          comment in macroassembler_x86.cpp. Comments about the why of the code
                                          are appreciated.

                                          [Note that I now understand that this is to some degree still work in
                                          progress. As long as the final changeset does no contain TODO's I am
                                          fine (and it's not a hard objection, rather their use in "final" code
                                          is typically limited in my experience)]

                                          5603   // Currently, if this happens, just set back the actual end to
                                          where it was.
                                          5604   // We miss a chance to sample here.

                                          Would be okay, if explaining "this" and the "why" of missing a chance
                                          to sample here would be best.

                                          Like maybe:

                                          // If we needed to refill TLABs, just set the actual end point to
                                          // the end of the TLAB again. We do not sample here although we could.

                                     Done with your comment, it works well in my mind.

                                          I am not sure whether "miss a chance to sample" meant "we could, but
                                          consciously don't because it's not that useful" or "it would be
                                          necessary but don't because it's too complicated to do.".

                                          Looking at the original comment once more, I am also not sure if that
                                          comment shouldn't referring to the "end" variable (not actual_end)
                                          because that's the variable that is responsible for taking the sampling
                                          path? (Going from the member description of ThreadLocalAllocBuffer).


                                     I've moved this code and it no longer shows up here but the rationale and answer was:

                                     So.. Yes, end is the variable provoking the sampling. Actual end is the actual end of the TLAB.

                                     What was happening here is that the code is resetting _end to point towards the end of the new TLAB. Because, we now have the end for
                            sampling and _actual_end for
                                     the actual end, we need to update the actual_end as well.

                                     Normally, were we to do the real work here, we would calculate the (end - start) offset, then do:

                                     - Set the new end to : start + (old_end - old_start)
                                     - Set the actual end like we do here now where it because it is the actual end.

                                     Why is this not done here now anymore?
                                         - I was still debating which path to take:
                                            - Do it in the fast refill code, it has its perks:
                                                - In a world where fast refills are happening all the time or a lot, we can augment there the code to do the sampling
                                            - Remember what we had as an end before leaving the slowpath and check on return
                                                - This is what I'm doing now, it removes the need to go fix up all fast refill paths but if you remain in fast refill paths,
                            you won't get sampling. I
                                     have to think of the consequences of that, maybe a future change later on?
                                                   - I have the statistics now so I'm going to study that
                                                      -> By the way, though my statistics are showing I'm missing some samples, if I turn off FastTlabRefill, it is the same
                            loss so for now, it seems
                                     this does not occur in my simple test.



                                          But maybe I am only confused and it's best to just leave the comment
                                          away. :)

                                          Thinking about it some more, doesn't this not-sampling in this case
                                          mean that sampling does not work in any collector that does inline TLAB
                                          allocation at the moment? (Or is inline TLAB alloc automatically
                                          disabled with sampling somehow?)

                                          That would indeed be a bigger TODO then :)


                                     Agreed, this remark made me think that perhaps as a first step the new way of doing it is better but I did have to:
                                        - Remove the const of the ThreadLocalBuffer remaining and hard_end methods
                                        - Move hard_end out of the header file to have a bit more logic there

                                     Please let me know what you think of that and if you prefer it this way or changing the fast refills. (I prefer this way now because it
                            is more incremental).


                                          > > - calling HeapMonitoring::do_weak_oops() (which should probably be
                                          > > called weak_oops_do() like other similar methods) only if string
                                          > > deduplication is enabled (in g1CollectedHeap.cpp:4511) seems wrong.
                                          >
                                          > The call should be at least around 6 lines up outside the if.
                                          >
                                          > Preferentially in a method like process_weak_jni_handles(), including
                                          > additional logging. (No new (G1) gc phase without minimal logging
                                          > :)).
                                          > Done but really not sure because:
                                          >
                                          > I put for logging:
                                          >   log_develop_trace(gc, freelist)("G1ConcRegionFreeing [other] : heap
                                          > monitoring");

                                          I would think that "gc, ref" would be more appropriate log tags for
                                          this similar to jni handles.
                                          (I am als not sure what weak reference handling has to do with
                                          G1ConcRegionFreeing, so I am a bit puzzled)


                                     I was not sure what to put for the tags or really as the message. I cleaned it up a bit now to:
                                          log_develop_trace(gc, ref)("HeapSampling [other] : heap monitoring processing");



                                          > Since weak_jni_handles didn't have logging for me to be inspired
                                          > from, I did that but unconvinced this is what should be done.

                                          The JNI handle processing does have logging, but only in
                                          ReferenceProcessor::process_discovered_references(). In
                                          process_weak_jni_handles() only overall time is measured (in a G1
                                          specific way, since only G1 supports disabling reference procesing) :/

                                          The code in ReferenceProcessor prints both time taken
                                          referenceProcessor.cpp:254, as well as the count, but strangely only in
                                          debug VMs.

                                          I have no idea why this logging is that unimportant to only print that
                                          in a debug VM. However there are reviews out for changing this area a
                                          bit, so it might be useful to wait for that (JDK-8173335).


                                     I cleaned it up a bit anyway and now it returns the count of objects that are in the system.


                                          > > - the change doubles the size of
                                          > > CollectedHeap::allocate_from_tlab_slow() above the "small and nice"
                                          > > threshold. Maybe it could be refactored a bit.
                                          > Done I think, it looks better to me :).

                                          In ThreadLocalAllocBuffer::handle_sample() I think the
                                          set_back_actual_end()/pick_next_sample() calls could be hoisted out of
                                          the "if" :)


                                     Done!


                                          > > - referenceProcessor.cpp:261: the change should add logging about
                                          > > the number of references encountered, maybe after the corresponding
                                          > > "JNI weak reference count" log message.
                                          > Just to double check, are you saying that you'd like to have the heap
                                          > sampler to keep in store how many sampled objects were encountered in
                                          > the HeapMonitoring::weak_oops_do?
                                          >    - Would a return of the method with the number of handled
                                          > references and logging that work?

                                          Yes, it's fine if HeapMonitoring::weak_oops_do() only returned the
                                          number of processed weak oops.


                                     Done also (but I admit I have not tested the output yet) :)


                                          >    - Additionally, would you prefer it in a separate block with its
                                          > GCTraceTime?

                                          Yes. Both kinds of information is interesting: while the time taken is
                                          typically more important, the next question would be why, and the
                                          number of references typically goes a long way there.

                                          See above though, it is probably best to wait a bit.


                                     Agreed that I "could" wait but, if it's ok, I'll just refactor/remove this when we get closer to something final. Either, JDK-8173335
                                     has gone in and I will notice it now or it will soon and I can change it then.


                                          > > - threadLocalAllocBuffer.cpp:331: one more "TODO"
                                          > Removed it and added it to my personal todos to look at.
                                          >      > >
                                          > > - threadLocalAllocBuffer.hpp: ThreadLocalAllocBuffer class
                                          > > documentation should be updated about the sampling additions. I
                                          > > would have no clue what the difference between "actual_end" and
                                          > > "end" would be from the given information.
                                          > If you are talking about the comments in this file, I made them more
                                          > clear I hope in the new webrev. If it was somewhere else, let me know
                                          > where to change.


Reply | Threaded
Open this post in threaded view
|

Re: Low-Overhead Heap Profiling

Thomas Schatzl
Hi Jc,

  sorry for taking a bit long to respond.... ;)

On Mon, 2017-10-23 at 08:27 -0700, JC Beyler wrote:

> Dear all,
>
> Small update this week with this new webrev:
>   - http://cr.openjdk.java.net/~rasbold/8171119/webrev.13/
>   - Incremental is here: http://cr.openjdk.java.net/~rasbold/8171119/
> webrev.12_13/
>
> I patched the code changes showed by Robbin last week and I
> refactored collectedHeap.cpp:
> http://cr.openjdk.java.net/~rasbold/8171119/webrev.12_13/src/hotspot/
> share/gc/shared/collectedHeap.cpp.patch
>
> The original code became a bit too complex in my opinion with the
> handle_heap_sampling handling too many things. So I subdivided the
> logic into two smaller methods and moved out a bit of the logic to
> make it more clear. Hopefully it is :)
>
> Let me know if you have any questions/comments :)
> Jc

A few minor issues:

  - weak reference handling has been factored out in JDK-8189359, now
you only need to add the additions required for this change to one
place. :) Please update the webrev :)

  - the one issue Robin noticed.

  - in the declaration of CollectedHeap::sample_allocation, it would be
nice if the fix_sample_rate parameter would be described - it takes a
time to figure out what it's used for. I.e. in case an allocation goes
beyond the sampling watermark, this value which represents the amount
of overallocation is used to adjust the next sampling watermark to
sample at the correct rate.
Something like this - and if what I wrote is incorrect, there is even
more reason to document it.
Or maybe just renaming "fix_sample_rate" to something more descriptive
- but I have no good idea about that.
With lack of units in the type, it would also be nice to have the unit
in the identifier name, as done elsewhere.

  - some (or most actually) of the new setters and getters in the
ThreadLocalAllocBuffer class could be private I think. Also, we
typically do not use "simple" getters that just return a member in the
class where they are defined.

  - ThreadLocalAllocBuffer::set_sample_end(): please use
pointer_delta() for pointer subtractions.

  - ThreadLocalAllocBuffer::pick_next_sample() - I recommend making the
first check an assert - it seems that it is only useful to call this
with heap monitoring enabled, as is done right now.

  - ThreadLocalAllocBuffer::pick_next_sample() - please use
"PTR_FORMAT" (or INTPTR_FORMAT - they are the same) as format string
for printing pointer values as is customary within Hotspot. %p output
is OS dependent. I.e. I heard that e.g. on Ubuntu it prints "null"
instead of 0x0...0 .... which is kind of annoying.

  - personal preference: do not allocate
HeapMonitoring::AlwaysTrueClosure globally, but only locally when it's
used. Setting it up seems to be very cheap.

  - HeapMonitoring::next_random() - the different names for the
constants use different formatting. Preferable (to me) is
UpperCamelCase, but at least make them uniform.

  - in HeapMonitoring::next_random(), you might want to use
right_n_bits() to create your mask.

  - not really convinced that it is a good idea to not somehow guard
StartHeapSampling() and StopHeapSampling() against being called by
multiple threads.

Otherwise looks okay from what I can see.

Thanks,
  Thomas

Reply | Threaded
Open this post in threaded view
|

Re: Low-Overhead Heap Profiling

JC Beyler
In reply to this post by JC Beyler
Clearly a last minute clean-up gone awry... Fixed for next webrev :)

On Wed, Oct 25, 2017 at 12:30 AM, Robbin Ehn <[hidden email]> wrote:
Hi,

325     HeapWord *tlab_old_end = thread->tlab().return end();

Should be something like:

325     HeapWord *tlab_old_end = thread->tlab().end();

Thanks, Robbin

On 2017-10-23 17:27, JC Beyler wrote:
Dear all,

Small update this week with this new webrev:
   - http://cr.openjdk.java.net/~rasbold/8171119/webrev.13/
   - Incremental is here: http://cr.openjdk.java.net/~rasbold/8171119/webrev.12_13/

I patched the code changes showed by Robbin last week and I refactored collectedHeap.cpp:
http://cr.openjdk.java.net/~rasbold/8171119/webrev.12_13/src/hotspot/share/gc/shared/collectedHeap.cpp.patch

The original code became a bit too complex in my opinion with the handle_heap_sampling handling too many things. So I subdivided the logic into two smaller methods and moved out a bit of the logic to make it more clear. Hopefully it is :)

Let me know if you have any questions/comments :)
Jc

On Mon, Oct 16, 2017 at 9:34 AM, JC Beyler <[hidden email] <mailto:[hidden email]>> wrote:

    Hi Robbin,

    That is because version 11 to 12 was only a test change. I was going to
    write about it and say here are the webrev links:
    Incremental:
    http://cr.openjdk.java.net/~rasbold/8171119/webrev.11_12/
    <http://cr.openjdk.java.net/~rasbold/8171119/webrev.11_12/>

    Full webrev:
    http://cr.openjdk.java.net/~rasbold/8171119/webrev.12/
    <http://cr.openjdk.java.net/~rasbold/8171119/webrev.12/>

    This change focused only on refactoring the tests to be more manageable,
    readable, maintainable. As all tests are looking at allocations, I moved
    common code to a java class:
    http://cr.openjdk.java.net/~rasbold/8171119/webrev.11_12/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitor.java.patch
    <http://cr.openjdk.java.net/~rasbold/8171119/webrev.11_12/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitor.java.patch>

    And then most tests call into that class to turn on/off the sampling,
    allocate, etc. This has removed almost 500 lines of test code so I'm happy
    about that.

    Thanks for your changes, a bit of relics of previous versions :). I've
    already integrated them into my code and will make a new webrev end of this
    week with a bit of refactor of the code handling the tlab slow path. I find
    it could use a bit of refactoring to make it easier to follow so I'm going
    to take a stab at it this week.

    Any other issues/comments?

    Thanks!
    Jc


    On Mon, Oct 16, 2017 at 8:46 AM, Robbin Ehn <[hidden email]
    <mailto:[hidden email]>> wrote:

        Hi JC,

        I saw a webrev.12 in the directory, with only test changes(11->12), so I
        took that version.
        I had a look and tested the tests, worked fine!

        First glance at the code (looking at full v12) some minor things below,
        mostly unused stuff.

        Thanks, Robbin

        diff -r 9047e0d726d6 src/hotspot/share/runtime/heapMonitoring.cpp
        --- a/src/hotspot/share/runtime/heapMonitoring.cpp      Mon Oct 16
        16:54:06 2017 +0200
        +++ b/src/hotspot/share/runtime/heapMonitoring.cpp      Mon Oct 16
        17:42:42 2017 +0200
        @@ -211,2 +211,3 @@
            void initialize(int max_storage) {
        +    // validate max_storage to sane value ? What would 0 mean ?
              MutexLocker mu(HeapMonitor_lock);
        @@ -227,8 +228,4 @@
            bool initialized() { return _initialized; }
        -  volatile bool *initialized_address() { return &_initialized; }

           private:
        -  // Protects the traces currently sampled (below).
        -  volatile intptr_t _stack_storage_lock[1];
        -
            // The traces currently sampled.
        @@ -313,3 +310,2 @@
            _initialized(false) {
        -    _stack_storage_lock[0] = 0;
          }
        @@ -532,13 +528,2 @@

        -// Delegate the initialization question to the underlying storage system.
        -bool HeapMonitoring::initialized() {
        -  return StackTraceStorage::storage()->initialized();
        -}
        -
        -// Delegate the initialization question to the underlying storage system.
        -bool *HeapMonitoring::initialized_address() {
        -  return
        -             const_cast<bool*>(StackTraceStorage::storage()->initialized_address());
        -}
        -
          void HeapMonitoring::get_live_traces(jvmtiStackTraces *traces) {
        diff -r 9047e0d726d6 src/hotspot/share/runtime/heapMonitoring.hpp
        --- a/src/hotspot/share/runtime/heapMonitoring.hpp      Mon Oct 16
        16:54:06 2017 +0200
        +++ b/src/hotspot/share/runtime/heapMonitoring.hpp      Mon Oct 16
        17:42:42 2017 +0200
        @@ -35,3 +35,2 @@
            static uint64_t _rnd;
        -  static bool _initialized;
            static jint _monitoring_rate;
        @@ -92,7 +91,2 @@

        -  // Is the profiler initialized and where is the address to the
        initialized
        -  // boolean.
        -  static bool initialized();
        -  static bool *initialized_address();
        -
            // Called when o is to be sampled from a given thread and a given size.



        On 10/10/2017 12:57 AM, JC Beyler wrote:

            Dear all,

            Thread-safety is back!! Here is the update webrev:
            http://cr.openjdk.java.net/~rasbold/8171119/webrev.10_11/
            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.10_11/>

            Full webrev is here:
            http://cr.openjdk.java.net/~rasbold/8171119/webrev.11/
            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.11/>

            In order to really test this, I needed to add this so thought now
            was a good time. It required a few changes here for the creation to
            ensure correctness and safety. Now we keep the static pointer but
            clear the data internally so on re-initialize, it will be a bit more
            costly than before. I don't think this is a huge use-case so I did
            not think it was a problem. I used the internal MutexLocker, I think
            I used it well, let me know.

            I also added three tests:

            1) Stack depth test:
            http://cr.openjdk.java.net/~rasbold/8171119/webrev.10_11/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStackDepthTest.java.patch
            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.10_11/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStackDepthTest.java.patch>

            This test shows that the maximum stack depth system is working.

            2) Thread safety:
            http://cr.openjdk.java.net/~rasbold/8171119/webrev.10_11/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorThreadTest.java.patch
            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.10_11/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorThreadTest.java.patch>

            The test creates 24 threads and they all allocate at the same time.
            The test then checks it does find samples from all the threads.

            3) Thread on/off safety
            http://cr.openjdk.java.net/~rasbold/8171119/webrev.10_11/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorThreadOnOffTest.java.patch
            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.10_11/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorThreadOnOffTest.java.patch>

            The test creates 24 threads that all allocate a bunch of memory.
            Then another thread turns the sampling on/off.

            Btw, both tests 2 & 3 failed without the locks.

            As I worked on this, I saw a lot of places where the tests are doing
            very similar things, I'm going to clean up the code a bit and make a
            HeapAllocator class that all tests can call directly. This will
            greatly simplify the code.

            Thanks for any comments/criticisms!
            Jc


            On Mon, Oct 2, 2017 at 8:52 PM, JC Beyler <[hidden email]
            <mailto:[hidden email]> <mailto:[hidden email]
            <mailto:[hidden email]>>> wrote:

                 Dear all,

                 Small update to the webrev:
            http://cr.openjdk.java.net/~rasbold/8171119/webrev.09_10/
            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.09_10/>
            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.09_10/
            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.09_10/>>

                 Full webrev is here:
            http://cr.openjdk.java.net/~rasbold/8171119/webrev.10/
            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.10/>
            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.10/
            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.10/>>

                 I updated a bit of the naming, removed a TODO comment, and I
            added a test for testing the sampling rate. I also updated the
            maximum stack depth to 1024, there is no
                 reason to keep it so small. I did a micro benchmark that tests
            the overhead and it seems relatively the same.

                 I compared allocations from a stack depth of 10 and allocations
            from a stack depth of 1024 (allocations are from the same helper
            method in
            http://cr.openjdk.java.net/~rasbold/8171119/webrev.10/raw_files/new/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatRateTest.java
            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.10/raw_files/new/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatRateTest.java>
                            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.10/raw_files/new/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatRateTest.java
            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.10/raw_files/new/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatRateTest.java>>):
                            - For an array of 1 integer allocated in a loop;
            stack depth 1024 vs stack depth 10: 1% slower
                            - For an array of 200k integers allocated in a loop;
            stack depth 1024 vs stack depth 10: 3% slower

                 So basically now moving the maximum stack depth to 1024 but we
            only copy over the stack depths actually used.

                 For the next webrev, I will be adding a stack depth test to
            show that it works and probably put back the mutex locking so that
            we can see how difficult it is to keep
                 thread safe.

                 Let me know what you think!
                 Jc



                 On Mon, Sep 25, 2017 at 3:02 PM, JC Beyler <[hidden email]
            <mailto:[hidden email]> <mailto:[hidden email]
            <mailto:[hidden email]>>> wrote:

                     Forgot to say that for my numbers:
                       - Not in the test are the actual numbers I got for the
            various array sizes, I ran the program 30 times and parsed the
            output; here are the averages and standard
                     deviation:
                            1000:     1.28% average; 1.13% standard deviation
                            10000:    1.59% average; 1.25% standard deviation
                            100000:   1.26% average; 1.26% standard deviation

                     The 1000/10000/100000 are the sizes of the arrays being
            allocated. These are allocated 100k times and the sampling rate is
            111 times the size of the array.

                     Thanks!
                     Jc


                     On Mon, Sep 25, 2017 at 3:01 PM, JC Beyler
            <[hidden email] <mailto:[hidden email]>
            <mailto:[hidden email] <mailto:[hidden email]>>> wrote:

                         Hi all,

                         After a bit of a break, I am back working on this :).
            As before, here are two webrevs:

                         - Full change set:
            http://cr.openjdk.java.net/~rasbold/8171119/webrev.09/
            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.09/>
            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.09/
            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.09/>>
                         - Compared to version 8:
            http://cr.openjdk.java.net/~rasbold/8171119/webrev.08_09/
            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.08_09/>
            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.08_09/
            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.08_09/>>
                              (This version is compared to version 8 I last
            showed but ported to the new folder hierarchy)

                         In this version I have:
                            - Handled Thomas' comments from his email of 07/03:
                                 - Merged the logging to be standard
                                 - Fixed up the code a bit where asked
                                 - Added some notes about the code not being
            thread-safe yet
                             - Removed additional dead code from the version
            that modifies interpreter/c1/c2
                             - Fixed compiler issues so that it compiles with
            --disable-precompiled-header
                                  - Tested with ./configure
            --with-boot-jdk=<jdk8> --with-debug-level=slowdebug
            --disable-precompiled-headers

                         Additionally, I added a test to check the sanity of the
            sampler: HeapMonitorStatCorrectnessTest
                                    (http://cr.openjdk.java.net/~rasbold/8171119/webrev.08_09/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatCorrectnessTest.java.patch
            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.08_09/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatCorrectnessTest.java.patch>
            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.08_09/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatCorrectnessTest.java.patch
            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.08_09/test/hotspot/jtreg/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatCorrectnessTest.java.patch>>)
                             - This allocates a number of arrays and checks that
            we obtain the number of samples we want with an accepted error of
            5%. I tested it 100 times and it
                         passed everytime, I can test more if wanted
                             - Not in the test are the actual numbers I got for
            the various array sizes, I ran the program 30 times and parsed the
            output; here are the averages and
                         standard deviation:
                                1000:     1.28% average; 1.13% standard deviation
                                10000:    1.59% average; 1.25% standard deviation
                                100000:   1.26% average; 1.26% standard deviation

                         What this means is that we were always at about 1~2% of
            the number of samples the test expected.

                         Let me know what you think,
                         Jc

                         On Wed, Jul 5, 2017 at 9:31 PM, JC Beyler
            <[hidden email] <mailto:[hidden email]>
            <mailto:[hidden email] <mailto:[hidden email]>>> wrote:

                             Hi all,

                             I apologize, I have not yet handled your remarks
            but thought this new webrev would also be useful to see and comment
            on perhaps.

                             Here is the latest webrev, it is generated slightly
            different than the others since now I'm using webrev.ksh without the
            -N option:
            http://cr.openjdk.java.net/~rasbold/8171119/webrev.08/
            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.08/>
            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.08/
            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.08/>>

                             And the webrev.07 to webrev.08 diff is here:
            http://cr.openjdk.java.net/~rasbold/8171119/webrev.07_08/
            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.07_08/>
            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.07_08/
            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.07_08/>>

                             (Let me know if it works well)

                             It's a small change between versions but it:
                                - provides a fix that makes the average sample
            rate correct (more on that below).
                                - fixes the code to actually have it play nicely
            with the fast tlab refill
                                - cleaned up a bit the JVMTI text and now use
            jvmtiFrameInfo
                             - moved the capability to be onload solo

                             With this webrev, I've done a small study of the
            random number generator we use here for the sampling rate. I took a
            small program and it can be simplified to:

                             for (outer loop)
                             for (inner loop)
                             int[] tmp = new int[arraySize];

                             - I've fixed the outer and inner loops to being 800
            for this experiment, meaning we allocate 640000 times an array of a
            given array size.

                             - Each program provides the average sample size
            used for the whole execution

                             - Then, I ran each variation 30 times and then
            calculated the average of the average sample size used for various
            array sizes. I selected the array size to
                             be one of the following: 1, 10, 100, 1000.

                             - When compared to 512kb, the average sample size
            of 30 runs:
                             1: 4.62% of error
                             10: 3.09% of error
                             100: 0.36% of error
                             1000: 0.1% of error
                             10000: 0.03% of error

                             What it shows is that, depending on the number of
            samples, the average does become better. This is because with an
            allocation of 1 element per array, it
                             will take longer to hit one of the thresholds. This
            is seen by looking at the sample count statistic I put in. For the
            same number of iterations (800 *
                             800), the different array sizes provoke:
                             1: 62 samples
                             10: 125 samples
                             100: 788 samples
                             1000: 6166 samples
                             10000: 57721 samples

                             And of course, the more samples you have, the more
            sample rates you pick, which means that your average gets closer
            using that math.

                             Thanks,
                             Jc

                             On Thu, Jun 29, 2017 at 10:01 PM, JC Beyler
            <[hidden email] <mailto:[hidden email]>
            <mailto:[hidden email] <mailto:[hidden email]>>> wrote:

                                 Thanks Robbin,

                                 This seems to have worked. When I have the next
            webrev ready, we will find out but I'm fairly confident it will work!

                                 Thanks agian!
                                 Jc

                                 On Wed, Jun 28, 2017 at 11:46 PM, Robbin Ehn
            <[hidden email] <mailto:[hidden email]>
            <mailto:[hidden email] <mailto:[hidden email]>>> wrote:

                                     Hi JC,

                                     On 06/29/2017 12:15 AM, JC Beyler wrote:

                                         B) Incremental changes


                                     I guess the most common work flow here is
            using mq :
                                     hg qnew fix_v1
                                     edit files
                                     hg qrefresh
                                     hg qnew fix_v2
                                     edit files
                                     hg qrefresh

                                     if you do hg log you will see 2 commits

                                     webrev.ksh -r -2 -o my_inc_v1_v2
                                     webrev.ksh -o my_full_v2


                                     In  your .hgrc you might need:
                                     [extensions]
                                     mq =

                                     /Robbin


                                         Again another newbiew question here...

                                         For showing the incremental changes, is
            there a link that explains how to do that? I apologize for my newbie
            questions all the time :)

                                         Right now, I do:

                                            ksh ../webrev.ksh -m -N

                                         That generates a webrev.zip and send it
            to Chuck Rasbold. He then uploads it to a new webrev.

                                         I tried commiting my change and adding
            a small change. Then if I just do ksh ../webrev.ksh without any
            options, it seems to produce a similar
                                         page but now with only the changes I
            had (so the 06-07 comparison you were talking about) and a changeset
            that has it all. I imagine that is
                                         what you meant.

                                         Which means that my workflow would become:

                                         1) Make changes
                                         2) Make a webrev without any options to
            show just the differences with the tip
                                         3) Amend my changes to my local commit
            so that I have it done with
                                         4) Go to 1

                                         Does that seem correct to you?

                                         Note that when I do this, I only see
            the full change of a file in the full change set (Side note here:
            now the page says change set and not
                                         patch, which is maybe why Serguei was
            having issues?).

                                         Thanks!
                                         Jc



                                         On Wed, Jun 28, 2017 at 1:12 AM, Robbin
            Ehn <[hidden email] <mailto:[hidden email]>
            <mailto:[hidden email] <mailto:[hidden email]>>
            <mailto:[hidden email] <mailto:[hidden email]>
                                         <mailto:[hidden email]
            <mailto:[hidden email]>>>> wrote:

                                              Hi,

                                              On 06/28/2017 12:04 AM, JC Beyler
            wrote:

                                                  Dear Thomas et al,

                                                  Here is the newest webrev:
            http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/
            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/>
            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/
            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/>>
                                                    <http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/
            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/>
            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/
            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/>>>



                                              You have some more bits to in
            there but generally this looks good and really nice with more tests.
                                              I'll do and deep dive and re-test
            this when I get back from my long vacation with whatever patch
            version you have then.

                                              Also I think it's time you provide
            incremental (v06->07 changes) as well as complete change-sets.

                                              Thanks, Robbin




                                                  Thomas, I "think" I have
            answered all your remarks. The summary is:

                                                  - The statistic system is up
            and provides insight on what the heap sampler is doing
                                                       - I've noticed that,
            though the sampling rate is at the right mean, we are missing some
            samples, I have not yet tracked out why
                                         (details below)

                                                  - I've run a tiny benchmark
            that is the worse case: it is a very tight loop and allocated a
            small array
                                                       - In this case, I see no
            overhead when the system is off so that is a good start :)
                                                       - I see right now a high
            overhead in this case when sampling is on. This is not a really too
            surprising but I'm going to see if
                                         this is consistent with our
                                                  internal implementation. The
            benchmark is really allocation stressful so I'm not too surprised
            but I want to do the due diligence.

                                                     - The statistic system up
            is up and I have a new test
            http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatTest.java.patch
            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatTest.java.patch>
                                                    <http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatTest.java.patch
            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatTest.java.patch>>
                                                             <http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatTest.java.patch
            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatTest.java.patch>
                                                    <http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatTest.java.patch
            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatTest.java.patch>>>
                                                        - I did a bit of a study
            about the random generator here, more details are below but
            basically it seems to work well

                                                     - I added a capability but
            since this is the first time doing this, I was not sure I did it right
                                                       - I did add a test though
            for it and the test seems to do what I expect (all methods are
            failing with the
                                         JVMTI_ERROR_MUST_POSSESS_CAPABILITY error).
                                                           -
            http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorNoCapabilityTest.java.patch
            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorNoCapabilityTest.java.patch>
                                                    <http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorNoCapabilityTest.java.patch
            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorNoCapabilityTest.java.patch>>
                                                                                        <http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorNoCapabilityTest.java.patch
            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorNoCapabilityTest.java.patch>
                                                    <http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorNoCapabilityTest.java.patch
            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.07/test/serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorNoCapabilityTest.java.patch>>>

                                                     - I still need to figure
            out what to do about the multi-agent vs single-agent issue

                                                     - As far as measurements,
            it seems I still need to look at:
                                                       - Why we do the 20 random
            calls first, are they necessary?
                                                       - Look at the mean of the
            sampling rate that the random generator does and also what is
            actually sampled
                                                       - What is the overhead in
            terms of memory/performance when on?

                                                  I have inlined my answers, I
            think I got them all in the new webrev, let me know your thoughts.

                                                  Thanks again!
                                                  Jc


                                                  On Fri, Jun 23, 2017 at 3:52
            AM, Thomas Schatzl <[hidden email]
            <mailto:[hidden email]> <mailto:[hidden email]
            <mailto:[hidden email]>>
                                         <mailto:[hidden email]
            <mailto:[hidden email]> <mailto:[hidden email]
            <mailto:[hidden email]>>>
            <mailto:[hidden email] <mailto:[hidden email]>
            <mailto:[hidden email] <mailto:[hidden email]>>

                                                             <mailto:[hidden email] <mailto:[hidden email]>
            <mailto:[hidden email]
            <mailto:[hidden email]>>>>> wrote:

                                                       Hi,

                                                       On Wed, 2017-06-21 at
            13:45 -0700, JC Beyler wrote:
                                                       > Hi all,
                                                       >
                                                       > First off: Thanks again
            to Robbin and Thomas for their reviews :)
                                                       >
                                                       > Next, I've uploaded a
            new webrev:
                                                       >
            http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/
            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/>
            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/
            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/>>
                                                    <http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/
            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/>
            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/
            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/>>>
                                                             <http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/
            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/>
            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/
            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/>>
                                                    <http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/
            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/>
            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/
            <http://cr.openjdk.java.net/~rasbold/8171119/webrev.06/>>>>

                                                       >
                                                       > Here is an update:
                                                       >
                                                       > - @Robbin, I forgot to
            say that yes I need to look at implementing
                                                       > this for the other
            architectures and testing it before it is all
                                                       > ready to go. Is it
            common to have it working on all possible
                                                       > combinations or is
            there a subset that I should be doing first and we

Reply | Threaded
Open this post in threaded view
|

Re: Low-Overhead Heap Profiling

Thomas Schatzl
In reply to this post by Thomas Schatzl
Hi JC,

  sorry for the long wait.

On Wed, 2017-11-01 at 10:46 -0700, JC Beyler wrote:

> Dear all,
>
> Here is the next webrev:
> http://cr.openjdk.java.net/~rasbold/8171119/webrev.14a/
>
> Incremental since the rebase:
> http://cr.openjdk.java.net/~rasbold/8171119/webrev.14_14a/
>
> (I'm still not too familiar with hg so I had to do a fresh rebase so
> v14 is once the rebase was done and v14a integrates the changes
> requested from Thomas).

Yeah, there seem to be something missing in the incremental webrev, but
thanks for the effort.

> I have also inlined answers to Thomas' email:
> > A few minor issues:
>  
> >   - in the declaration of CollectedHeap::sample_allocation, it
> > would be nice if the fix_sample_rate parameter would be described -
> > it takes a time to figure out what it's used for. I.e. in case an
> > allocation goes beyond the sampling watermark, this value which
> > represents the amount of overallocation is used to adjust the next
> > sampling watermark to sample at the correct rate.
> > Something like this - and if what I wrote is incorrect, there is
> > even more reason to document it.
> > Or maybe just renaming "fix_sample_rate" to something more
> > descriptive - but I have no good idea about that.
> > With lack of units in the type, it would also be nice to have the
> > unit in the identifier name, as done elsewhere.
>

Thanks. Could you s/passed/past in that documentation?

> Done for Robbin's issue and changed it to 
> >   - some (or most actually) of the new setters and getters in the
> > ThreadLocalAllocBuffer class could be private I think. Also, we
> > typically do not use "simple" getters that just return a member in
> > the class where they are defined.
>
> I removed all that were not used that I could see (not used outside
> the class) moved the ones that are not simple to private if they
> could be. I think it's a bit weird because now some of the setting of
> the tlab internal data is using methods, others are directly setting.
> Let me know what you think.

That's fine with me. You need to start somewhere I guess.

> >   - ThreadLocalAllocBuffer::pick_next_sample() - I recommend making
> > the first check an assert - it seems that it is only useful to call
> > this with heap monitoring enabled, as is done right now.
>
> Longer conversation below about this, It cannot be an assert (I could
> remove the test altogether though).

I looked at the description, and you are right. I missed that. Keep it
as is. :)

> > - HeapMonitoring::next_random() - the different names for the
> > constants use different formatting. Preferable (to me) is
> > UpperCamelCase, but at least make them uniform.
> >
>
> I think done the way you wanted!

In heapMonitoring.hpp:50-53 the constants still have different format?
 

> >  
> >   - not really convinced that it is a good idea to not somehow
> > guard
> > StartHeapSampling() and StopHeapSampling() against being called by
> > multiple threads.
> >
>
> I added another mutex for the start/stop so that way it will protect
> from that.
>

Thanks!

>  
> > Otherwise looks okay from what I can see.

Still okay. I do not need a re-review for the changes suggested in this
email.

>
> Awesome, what do you think I still need for this before going to the
> next step (which is what by the way? :)).

I think:

- look through the JEP if it is still current and fix the descriptions
if required
  - add a link to the latest webrev to the JEP as comment
- if not done already, file CSRs [0] for
  - the new flag
  - JVMTI changes (I think, not sure actually)
- find a sponsor from Oracle to do some internal work (pushing, and
before that there is iirc still some background work related to JEPs
that can only be done by Oracle, mostly shepherding :/).

I talked to Robbin about this, and he offered to help you with that.

Thanks,
  Thomas

[0] https://wiki.openjdk.java.net/display/csr/Main