Re: Low-Overhead Heap Profiling

classic Classic list List threaded Threaded
32 messages Options
12
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Low-Overhead Heap Profiling

Tony Printezis-3
Jeremy (and all),

I’m not on the serviceability list so I won’t include the messages so far. :-) Also CCing the hotspot GC list, in case they have some feedback on this.

Could I suggest a (much) simpler but at least as powerful and flexible way to do this? (This is something we’ve been meaning to do for a while now for TwitterJDK, the JDK we develop and deploy here at Twitter.) You can force allocations to go into the slow path periodically by artificially setting the TLAB top to a lower value. So, imagine a TLAB is 4M. You can set top to (bottom+1M). When an allocation thinks the TLAB is full (in this case, the first 1MB is full) it will call the allocation slow path. There, you can intercept it, sample the allocation (and, like in your case, you’ll also have the correct stack trace), notice that the TLAB is not actually full, extend its to top to, say, (bottom+2M), and you’re done.

Advantages of this approach:

* This is a much smaller, simpler, and self-contained change (no compiler changes necessary to maintain...).

* When it’s off, the overhead is only one extra test at the slow path TLAB allocation (i.e., negligible; we do some sampling on TLABs in TwitterJDK using a similar mechanism and, when it’s off, I’ve observed no performance overhead).

* (most importantly) You can turn this on and off, and adjust the sampling rate, dynamically. If you do the sampling based on JITed code, you’ll have to recompile all methods with allocation sites to turn the sampling on or off. (You can of course have it always on and just discard the output; it’d be nice not to have to do that though. IMHO, at least.)

* You can also very cheaply turn this on and off (or adjust the sampling frequncy) per thread, if that’s be helpful in some way (just add the appropriate info on the thread’s TLAB).

A few extra comments on the previous discussion:

* "JFR samples per new TLAB allocation. It provides really very good picture and I haven't seen overhead more than 2” : When TLABs get very large, I don’t think sampling one object per TLAB is enough to get a good sample (IMHO, at least). It’s probably OK for something like jbb which mostly allocates instances of a handful of classes and has very few allocation sites. But, a lot of the code we run at Twitter is a lot more elaborate than that and, in our experience, sampling one object per TLAB is not enough. You can, of course, decrease the TLAB size to increase the sampling size. But it’d be good not to have to do that given a smaller TLAB size could increase contention across threads.

* "Should it *just* take a stack trace, or should the behavior be configurable?” : I think we’d have to separate the allocation sampling mechanism from the consumption of the allocation samples. Once the sampling mechanism is in, different JVMs can take advantage of it in different ways. I assume that the Oracle folks would like at least a JFR event for every such sample. But in your build you can add extra code to collect the information in the way you have now.

* Talking of JFR, it’s a bit unfortunate that the AllocObjectInNewTLAB event has both the new TLAB information and the allocation information. It would have been nice if that event was split into two, say NewTLAB and AllocObjectInTLAB, and we’d be able to fire the latter for each sample.

* "Should the interval between samples be configurable?” : Totally. In fact, it’d be helpful if it was configurable dynamically. Imagine if a JVM starts misbehaving after 2-3 weeks of running. You can dynamically increase the sampling rate to get a better profile if the default is not giving fine-grain enough information.

* "As long of these features don’t contribute to sampling bias” : If the sampling interval is fixed, sampling bias would be a very real concern. In the above example, I’d increment top by 1M (the sampling frequency) + p% (a fudge factor). 

* "Yes, a perhaps optional callbacks would be nice too.” : Oh, no. :-) But, as I said, we should definitely separate the sampling mechanism from the mechanism that consumes the samples.

* "Another problem with our submitting things is that we can't really test on anything other than Linux.” : Another reason to go with a as platform independent solution as possible. :-)

Regards,

Tony

-----

Tony Printezis | JVM/GC Engineer / VM Team | Twitter

@TonyPrintezis

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Low-Overhead Heap Profiling

Jeremy Manson-4
I don't want the size of the TLAB, which is ergonomically adjusted, to be tied to the sampling rate.  There is no reason to do that.  I want reasonable statistical sampling of the allocations.  

All this requires is a separate counter that is set to the next sampling interval, and decremented when an allocation happens, which goes into a slow path when the decrement hits 0.  Doing a subtraction and a pointer bump in allocation instead of just a pointer bump is basically free.  Note that it has been doing an additional addition (to keep track of per thread allocation) as part of allocation since Java 7, and no one has complained.

I'm not worried about the ease of implementation here, because we've already implemented it.  It hasn't even been hard for us to do the forward port, except when the relevant Hotspot code is significantly refactored.

We can also turn the sampling off, if we want.  We can set the sampling rate to 2^32, have the sampling code do nothing, and no one will ever notice.  In fact, we could just have the sampling code do nothing, and no one would ever notice.

Honestly, no one ever notices the overhead of the sampling, anyway.  JDK8 made it more expensive to grab a stack trace (the cost became proportional to the number of loaded classes), but we have a patch that mitigates that, which we would also be happy to upstream.

As for the other concern: my concern about *just* having the callback mechanism is that there is quite a lot you can't do from user code during an allocation, because of lack of access to JNI.  However, you can do pretty much anything from the VM itself.  Crucially (for us), we don't just log the stack traces, we also keep track of which are live and which aren't.  We can't do this in a callback, if the callback can't create weak refs to the object.

What we do at Google is to have two methods: one that you pass a callback to (the callback gets invoked with a StackTraceData object, as I've defined above), and another that just tells you which sampled objects are still live.  We could also add a third, which allowed a callback to set the sampling interval (basically, the VM would call it to get the integer number of bytes to be allocated before the next sample).  

Would people be amenable to that?  It makes the code more complex, but, as I say, it's nice for detecting memory leaks ("Hey!  Where did that 1 GB object come from?").

Jeremy


On Tue, Jun 23, 2015 at 1:06 PM, Tony Printezis <[hidden email]> wrote:
Jeremy (and all),

I’m not on the serviceability list so I won’t include the messages so far. :-) Also CCing the hotspot GC list, in case they have some feedback on this.

Could I suggest a (much) simpler but at least as powerful and flexible way to do this? (This is something we’ve been meaning to do for a while now for TwitterJDK, the JDK we develop and deploy here at Twitter.) You can force allocations to go into the slow path periodically by artificially setting the TLAB top to a lower value. So, imagine a TLAB is 4M. You can set top to (bottom+1M). When an allocation thinks the TLAB is full (in this case, the first 1MB is full) it will call the allocation slow path. There, you can intercept it, sample the allocation (and, like in your case, you’ll also have the correct stack trace), notice that the TLAB is not actually full, extend its to top to, say, (bottom+2M), and you’re done.

Advantages of this approach:

* This is a much smaller, simpler, and self-contained change (no compiler changes necessary to maintain...).

* When it’s off, the overhead is only one extra test at the slow path TLAB allocation (i.e., negligible; we do some sampling on TLABs in TwitterJDK using a similar mechanism and, when it’s off, I’ve observed no performance overhead).

* (most importantly) You can turn this on and off, and adjust the sampling rate, dynamically. If you do the sampling based on JITed code, you’ll have to recompile all methods with allocation sites to turn the sampling on or off. (You can of course have it always on and just discard the output; it’d be nice not to have to do that though. IMHO, at least.)

* You can also very cheaply turn this on and off (or adjust the sampling frequncy) per thread, if that’s be helpful in some way (just add the appropriate info on the thread’s TLAB).

A few extra comments on the previous discussion:

* "JFR samples per new TLAB allocation. It provides really very good picture and I haven't seen overhead more than 2” : When TLABs get very large, I don’t think sampling one object per TLAB is enough to get a good sample (IMHO, at least). It’s probably OK for something like jbb which mostly allocates instances of a handful of classes and has very few allocation sites. But, a lot of the code we run at Twitter is a lot more elaborate than that and, in our experience, sampling one object per TLAB is not enough. You can, of course, decrease the TLAB size to increase the sampling size. But it’d be good not to have to do that given a smaller TLAB size could increase contention across threads.

* "Should it *just* take a stack trace, or should the behavior be configurable?” : I think we’d have to separate the allocation sampling mechanism from the consumption of the allocation samples. Once the sampling mechanism is in, different JVMs can take advantage of it in different ways. I assume that the Oracle folks would like at least a JFR event for every such sample. But in your build you can add extra code to collect the information in the way you have now.

* Talking of JFR, it’s a bit unfortunate that the AllocObjectInNewTLAB event has both the new TLAB information and the allocation information. It would have been nice if that event was split into two, say NewTLAB and AllocObjectInTLAB, and we’d be able to fire the latter for each sample.

* "Should the interval between samples be configurable?” : Totally. In fact, it’d be helpful if it was configurable dynamically. Imagine if a JVM starts misbehaving after 2-3 weeks of running. You can dynamically increase the sampling rate to get a better profile if the default is not giving fine-grain enough information.

* "As long of these features don’t contribute to sampling bias” : If the sampling interval is fixed, sampling bias would be a very real concern. In the above example, I’d increment top by 1M (the sampling frequency) + p% (a fudge factor). 

* "Yes, a perhaps optional callbacks would be nice too.” : Oh, no. :-) But, as I said, we should definitely separate the sampling mechanism from the mechanism that consumes the samples.

* "Another problem with our submitting things is that we can't really test on anything other than Linux.” : Another reason to go with a as platform independent solution as possible. :-)

Regards,

Tony

-----

Tony Printezis | JVM/GC Engineer / VM Team | Twitter

@TonyPrintezis


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Low-Overhead Heap Profiling

Tony Printezis-3
Hi Jeremy,

Please see inline.

On June 23, 2015 at 7:22:13 PM, Jeremy Manson ([hidden email]) wrote:

I don't want the size of the TLAB, which is ergonomically adjusted, to be tied to the sampling rate.  There is no reason to do that.  I want reasonable statistical sampling of the allocations.  


As I said explicitly in my e-mail, I totally agree with this. Which is why I never suggested to resize TLABs in order to vary the sampling rate. (Apologies if my e-mail was not clear.)



All this requires is a separate counter that is set to the next sampling interval, and decremented when an allocation happens, which goes into a slow path when the decrement hits 0.  Doing a subtraction and a pointer bump in allocation instead of just a pointer bump is basically free.  


Maybe on intel is cheap, but maybe it’s not on other platforms that other folks care about.


Note that it has been doing an additional addition (to keep track of per thread allocation) as part of allocation since Java 7, 


Interesting. I hadn’t realized that. Does that keep track of total size allocated per thread or number of allocated objects per thread? If it’s the former, why isn’t it possible to calculate that from the TLABs information?


and no one has complained.

I'm not worried about the ease of implementation here, because we've already implemented it.  


Yeah, but someone will have to maintain it moving forward.


It hasn't even been hard for us to do the forward port, except when the relevant Hotspot code is significantly refactored.

We can also turn the sampling off, if we want.  We can set the sampling rate to 2^32, have the sampling code do nothing, and no one will ever notice.  


You still have extra instructions in the allocation path, so it’s not turned off (i.e., you have the tax without any benefit).


In fact, we could just have the sampling code do nothing, and no one would ever notice.

Honestly, no one ever notices the overhead of the sampling, anyway.  JDK8 made it more expensive to grab a stack trace (the cost became proportional to the number of loaded classes), but we have a patch that mitigates that, which we would also be happy to upstream.

As for the other concern: my concern about *just* having the callback mechanism is that there is quite a lot you can't do from user code during an allocation, because of lack of access to JNI.


Maybe I missed something. Are the callbacks in Java? I.e., do you call them using JNI from the slow path you call directly from the allocation code?


  However, you can do pretty much anything from the VM itself.  Crucially (for us), we don't just log the stack traces, we also keep track of which are live and which aren't.  We can't do this in a callback, if the callback can't create weak refs to the object.

What we do at Google is to have two methods: one that you pass a callback to (the callback gets invoked with a StackTraceData object, as I've defined above), and another that just tells you which sampled objects are still live.  We could also add a third, which allowed a callback to set the sampling interval (basically, the VM would call it to get the integer number of bytes to be allocated before the next sample).  

Would people be amenable to that?  It makes the code more complex, but, as I say, it's nice for detecting memory leaks ("Hey!  Where did that 1 GB object come from?").


Well, that 1GB object would have most likely been allocated outside a TLAB and you could have identified it by instrumenting the “outside-of-TLAB allocation path” (just saying…).

But, seriously, why didn’t you like my proposal? It can do anything your scheme can with fewer and simpler code changes. The only thing that it cannot do is to sample based on object count (i.e., every 100 objects) instead of based on object size (i.e., every 1MB of allocations). But I think doing sampling based on size is the right approach here (IMHO).

Tony



Jeremy


On Tue, Jun 23, 2015 at 1:06 PM, Tony Printezis <[hidden email]> wrote:
Jeremy (and all),

I’m not on the serviceability list so I won’t include the messages so far. :-) Also CCing the hotspot GC list, in case they have some feedback on this.

Could I suggest a (much) simpler but at least as powerful and flexible way to do this? (This is something we’ve been meaning to do for a while now for TwitterJDK, the JDK we develop and deploy here at Twitter.) You can force allocations to go into the slow path periodically by artificially setting the TLAB top to a lower value. So, imagine a TLAB is 4M. You can set top to (bottom+1M). When an allocation thinks the TLAB is full (in this case, the first 1MB is full) it will call the allocation slow path. There, you can intercept it, sample the allocation (and, like in your case, you’ll also have the correct stack trace), notice that the TLAB is not actually full, extend its to top to, say, (bottom+2M), and you’re done.

Advantages of this approach:

* This is a much smaller, simpler, and self-contained change (no compiler changes necessary to maintain...).

* When it’s off, the overhead is only one extra test at the slow path TLAB allocation (i.e., negligible; we do some sampling on TLABs in TwitterJDK using a similar mechanism and, when it’s off, I’ve observed no performance overhead).

* (most importantly) You can turn this on and off, and adjust the sampling rate, dynamically. If you do the sampling based on JITed code, you’ll have to recompile all methods with allocation sites to turn the sampling on or off. (You can of course have it always on and just discard the output; it’d be nice not to have to do that though. IMHO, at least.)

* You can also very cheaply turn this on and off (or adjust the sampling frequncy) per thread, if that’s be helpful in some way (just add the appropriate info on the thread’s TLAB).

A few extra comments on the previous discussion:

* "JFR samples per new TLAB allocation. It provides really very good picture and I haven't seen overhead more than 2” : When TLABs get very large, I don’t think sampling one object per TLAB is enough to get a good sample (IMHO, at least). It’s probably OK for something like jbb which mostly allocates instances of a handful of classes and has very few allocation sites. But, a lot of the code we run at Twitter is a lot more elaborate than that and, in our experience, sampling one object per TLAB is not enough. You can, of course, decrease the TLAB size to increase the sampling size. But it’d be good not to have to do that given a smaller TLAB size could increase contention across threads.

* "Should it *just* take a stack trace, or should the behavior be configurable?” : I think we’d have to separate the allocation sampling mechanism from the consumption of the allocation samples. Once the sampling mechanism is in, different JVMs can take advantage of it in different ways. I assume that the Oracle folks would like at least a JFR event for every such sample. But in your build you can add extra code to collect the information in the way you have now.

* Talking of JFR, it’s a bit unfortunate that the AllocObjectInNewTLAB event has both the new TLAB information and the allocation information. It would have been nice if that event was split into two, say NewTLAB and AllocObjectInTLAB, and we’d be able to fire the latter for each sample.

* "Should the interval between samples be configurable?” : Totally. In fact, it’d be helpful if it was configurable dynamically. Imagine if a JVM starts misbehaving after 2-3 weeks of running. You can dynamically increase the sampling rate to get a better profile if the default is not giving fine-grain enough information.

* "As long of these features don’t contribute to sampling bias” : If the sampling interval is fixed, sampling bias would be a very real concern. In the above example, I’d increment top by 1M (the sampling frequency) + p% (a fudge factor). 

* "Yes, a perhaps optional callbacks would be nice too.” : Oh, no. :-) But, as I said, we should definitely separate the sampling mechanism from the mechanism that consumes the samples.

* "Another problem with our submitting things is that we can't really test on anything other than Linux.” : Another reason to go with a as platform independent solution as possible. :-)

Regards,

Tony

-----

Tony Printezis | JVM/GC Engineer / VM Team | Twitter

@TonyPrintezis









-----

Tony Printezis | JVM/GC Engineer / VM Team | Twitter

@TonyPrintezis

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Low-Overhead Heap Profiling

Jeremy Manson-4


On Wed, Jun 24, 2015 at 10:57 AM, Tony Printezis <[hidden email]> wrote:
Hi Jeremy,

Please see inline.

On June 23, 2015 at 7:22:13 PM, Jeremy Manson ([hidden email]) wrote:

I don't want the size of the TLAB, which is ergonomically adjusted, to be tied to the sampling rate.  There is no reason to do that.  I want reasonable statistical sampling of the allocations.  


As I said explicitly in my e-mail, I totally agree with this. Which is why I never suggested to resize TLABs in order to vary the sampling rate. (Apologies if my e-mail was not clear.)


My fault - I misread it.  Doesn't your proposal miss out of TLAB allocs entirely (and, in fact, not work if TLAB support is turned off)?  I might be missing something obvious (and see my response below).
 
All this requires is a separate counter that is set to the next sampling interval, and decremented when an allocation happens, which goes into a slow path when the decrement hits 0.  Doing a subtraction and a pointer bump in allocation instead of just a pointer bump is basically free.  


Maybe on intel is cheap, but maybe it’s not on other platforms that other folks care about.

Really?  A memory read and a subtraction?  Which architectures care about that?

Again, notice that no one has complained about the addition that was added for total bytes allocated per thread.  I note that was actually added in the 6u20 timeframe.

Note that it has been doing an additional addition (to keep track of per thread allocation) as part of allocation since Java 7, 


Interesting. I hadn’t realized that. Does that keep track of total size allocated per thread or number of allocated objects per thread? If it’s the former, why isn’t it possible to calculate that from the TLABs information?


Total size allocated per thread.  It isn't possible to calculate that from the TLAB because of out-of-TLAB allocation (and hypothetically disabled TLABs).

For some reason, they never included it in the ThreadMXBean interface, but it is in com.sun.management.ThreadMXBean, so you can cast your ThreadMXBean to a com.sun.management.ThreadMXBean and call getThreadAllocatedBytes() on it.
 
and no one has complained.

I'm not worried about the ease of implementation here, because we've already implemented it.  


Yeah, but someone will have to maintain it moving forward.


I've been maintaining it internally to Google for 5 years.  It's actually pretty self-contained.  The only work involved is when they refactor something (so I've had to move it), or when a bug in the existing implementation is discovered.  It is very closely parallel to the TLAB code, which doesn't change much / at all.
 

It hasn't even been hard for us to do the forward port, except when the relevant Hotspot code is significantly refactored.

We can also turn the sampling off, if we want.  We can set the sampling rate to 2^32, have the sampling code do nothing, and no one will ever notice.  


You still have extra instructions in the allocation path, so it’s not turned off (i.e., you have the tax without any benefit).


Hey, you have a counter in your allocation path you've never noticed, which none of your code uses.  Pipelining is a wonderful thing.  :)

In fact, we could just have the sampling code do nothing, and no one would ever notice.

Honestly, no one ever notices the overhead of the sampling, anyway.  JDK8 made it more expensive to grab a stack trace (the cost became proportional to the number of loaded classes), but we have a patch that mitigates that, which we would also be happy to upstream.

As for the other concern: my concern about *just* having the callback mechanism is that there is quite a lot you can't do from user code during an allocation, because of lack of access to JNI.


Maybe I missed something. Are the callbacks in Java? I.e., do you call them using JNI from the slow path you call directly from the allocation code?

(For context: this referred to the hypothetical feature where we can provide a callback that invokes some code from allocation.)

(It's not actually hypothetical, because we've already implemented it, but let's call it hypothetical for the moment.)

We invoke native code.  You can't invoke any Java code during allocation, including calling JNI methods, because that would make allocation potentially reentrant, which doesn't work for all sorts of reasons.  The native code doesn't even get passed a JNIEnv * - there is nothing it can do with it without making the VM crash a lot.

Or, rather, you might be able to do that, but it would take a lot of Hotspot rearchitecting.  When I tried to do it, I realized it would be an extremely deep dive.

  However, you can do pretty much anything from the VM itself.  Crucially (for us), we don't just log the stack traces, we also keep track of which are live and which aren't.  We can't do this in a callback, if the callback can't create weak refs to the object.

What we do at Google is to have two methods: one that you pass a callback to (the callback gets invoked with a StackTraceData object, as I've defined above), and another that just tells you which sampled objects are still live.  We could also add a third, which allowed a callback to set the sampling interval (basically, the VM would call it to get the integer number of bytes to be allocated before the next sample).  

Would people be amenable to that?  It makes the code more complex, but, as I say, it's nice for detecting memory leaks ("Hey!  Where did that 1 GB object come from?").


Well, that 1GB object would have most likely been allocated outside a TLAB and you could have identified it by instrumenting the “outside-of-TLAB allocation path” (just saying…).


That's orthogonal to the point I was making in the quote above - the point I was making there was that we want to be able to detect what sampled objects are live.  We can do that regardless of how we implement the sampling (although it did involve my making a new kind of weak oop processing mechanism inside the VM).

But to the question of whether we can just instrument the outside-of-tlab allocation path...  There are a few weirdnesses here.  The first one that jumps to mind is that there's also a fast path for allocating in the YG outside of TLABs, if an object is too large to fit in the current TLAB.  Those objects would never get sampled.  So "outside of tlab" doesn't always mean "slow path".

Another one that jumps to mind is that we don't know whether the outside-of-TLAB path actually passes the sampling threshold, especially if we let users configure the sampling threshold.  So how would we know whether to sample it?

You also have to keep track of the sampling interval in the code where we allocate new TLABs, in case the sampling threshold is larger than the TLAB size.  That's not a big deal, of course.

And, every time the TLAB code changes, we have to consider whether / how those changes affect this sampling mechanism.

I guess my larger point is that there are so many little corner cases with TLAB allocation, including whether it even happens, that basing the sampling strategy around it seems like a cop-out.  And my belief is that the arguments against our strategy don't really hold water, especially given the presence of the per-thread allocation counter that no one noticed.  

Heck, I've already had it reviewed internally by a Hotspot reviewer (Chuck Rasbold).  All we really need is to write an acceptable JEP, to adjust the code based on the changes the community wants, and someone from Oracle willing to say "yes".

For reference, to keep track of sampling, the delta to C2 is about 150 LOC (much of which is newlines-because-of-formatting for methods that take a lot of parameters), the delta to C1 is about 60 LOC, the delta to each x86 template interpreter is about 20 LOC, and the delta for the assembler is about 40 LOC.      It's not completely trivial, but the code hasn't changed substantially in the 5 years since I wrote it (other than a couple of bugfixes).

Obviously, assembler/template interpreter would have to be dup'd across platforms - we can do that for PPC and aarch64, on which we do active development, at least.

But, seriously, why didn’t you like my proposal? It can do anything your scheme can with fewer and simpler code changes. The only thing that it cannot do is to sample based on object count (i.e., every 100 objects) instead of based on object size (i.e., every 1MB of allocations). But I think doing sampling based on size is the right approach here (IMHO).

I agree that sampling based on size is the right approach.  

(And your approach is definitely simpler - I don't mean to discount it.  And if that's what it takes to get this feature accepted, we'll do it, but I'll grumble about it.)

Jeremy 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Low-Overhead Heap Profiling

Bernd Eckenfels-4
Am Wed, 24 Jun 2015 16:26:35 -0700
schrieb Jeremy Manson <[hidden email]>:

> > As for the other concern: my concern about *just* having the
> > callback mechanism is that there is quite a lot you can't do from
> > user code during an allocation, because of lack of access to JNI.
> >
> >
> > Maybe I missed something. Are the callbacks in Java? I.e., do you
> > call them using JNI from the slow path you call directly from the
> > allocation code?
> >
> > (For context: this referred to the hypothetical feature where we can
> provide a callback that invokes some code from allocation.)

What about a hypothetical queueing feature, so you can process the
events asynchronously (perhaps with some backpressure control). This
would work well for statistics processing.

(Your other use case, the throwing of OOM would not work, I guess)

But its an elegant solution to provide a code environment generic enoug
for all kinds of instrumentation and independent of the "allocation
recursion".

Greetings
Bernd
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Low-Overhead Heap Profiling

Jeremy Manson-4
Hey Bernd,

In addition to the overhead of the mechanism (bad enough to implement backpressure control?), I'd also be worried about losing thread-local state in that case.  It would be hard to use TLS at the native layer to persist data, or to map invocations to particular Java threads.

Also, you would have no guarantees about the callbacks being executed in a timely fashion, or at all.

A user could set up a mechanism to do this themselves, if they wanted.  All they would have to do is implement their own, purely native queue in the callback mechanism.  Another thread could pull from the queue and use JNI / Java to process the events.

In fact, we could provide an extension library to do this for them, if we wanted, with the guidance "Use this particular callback, and you can process the events in a Java thread".

Jeremy



On Wed, Jun 24, 2015 at 4:48 PM, Bernd Eckenfels <[hidden email]> wrote:
Am Wed, 24 Jun 2015 16:26:35 -0700
schrieb Jeremy Manson <[hidden email]>:
> > As for the other concern: my concern about *just* having the
> > callback mechanism is that there is quite a lot you can't do from
> > user code during an allocation, because of lack of access to JNI.
> >
> >
> > Maybe I missed something. Are the callbacks in Java? I.e., do you
> > call them using JNI from the slow path you call directly from the
> > allocation code?
> >
> > (For context: this referred to the hypothetical feature where we can
> provide a callback that invokes some code from allocation.)

What about a hypothetical queueing feature, so you can process the
events asynchronously (perhaps with some backpressure control). This
would work well for statistics processing.

(Your other use case, the throwing of OOM would not work, I guess)

But its an elegant solution to provide a code environment generic enoug
for all kinds of instrumentation and independent of the "allocation
recursion".

Greetings
Bernd

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Low-Overhead Heap Profiling

Kirk Pepperdine-2
In reply to this post by Tony Printezis-3


But, seriously, why didn’t you like my proposal? It can do anything your scheme can with fewer and simpler code changes. The only thing that it cannot do is to sample based on object count (i.e., every 100 objects) instead of based on object size (i.e., every 1MB of allocations). But I think doing sampling based on size is the right approach here (IMHO).



I would think that the size based sampling would create a size based bias in your sampling. Since IME, it’s allocation frequency is more damaging to performance, I’d prefer to see time boxed sampling

Kind regards,
Kirk Pepperdine

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Low-Overhead Heap Profiling

Jeremy Manson-4
Why would allocation frequency be more damaging to performance?  Allocation is cheap, and as long as they become dead before the YG collection, it costs the same to collect one 1MB object as it does to collection 1000 1K objects.

Jeremy

On Wed, Jun 24, 2015 at 11:54 PM, Kirk Pepperdine <[hidden email]> wrote:


But, seriously, why didn’t you like my proposal? It can do anything your scheme can with fewer and simpler code changes. The only thing that it cannot do is to sample based on object count (i.e., every 100 objects) instead of based on object size (i.e., every 1MB of allocations). But I think doing sampling based on size is the right approach here (IMHO).



I would think that the size based sampling would create a size based bias in your sampling. Since IME, it’s allocation frequency is more damaging to performance, I’d prefer to see time boxed sampling

Kind regards,
Kirk Pepperdine


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Low-Overhead Heap Profiling

Tony Printezis-3
In reply to this post by Jeremy Manson-4
Hi Jeremy,

Inline.

On June 24, 2015 at 7:26:55 PM, Jeremy Manson ([hidden email]) wrote:



On Wed, Jun 24, 2015 at 10:57 AM, Tony Printezis <[hidden email]> wrote:
Hi Jeremy,

Please see inline.

On June 23, 2015 at 7:22:13 PM, Jeremy Manson ([hidden email]) wrote:

I don't want the size of the TLAB, which is ergonomically adjusted, to be tied to the sampling rate.  There is no reason to do that.  I want reasonable statistical sampling of the allocations.  


As I said explicitly in my e-mail, I totally agree with this. Which is why I never suggested to resize TLABs in order to vary the sampling rate. (Apologies if my e-mail was not clear.)


My fault - I misread it.  Doesn't your proposal miss out of TLAB allocs entirely


This is correct: We’ll also have to intercept the outside-TLAB allocs. But, IMHO, this is a feature as it’s helpful to know how many (and which) allocations happen outside TLABs. These are generally very infrequent (and slow anyway), so sampling all of those, instead of only sampling some of them, does not have much of an overhead. But, you could also do sampling for the outside-TLAB allocs too, if you want: just accumulate their size on a separate per-thread counter and sample the one that bumps that counter goes over a limit.

An additional observation (orthogonal to the main point, but I thought I’d mention it anyway): For the outside-TLAB allocs it’d be helpful to also know which generation the object ended up in (e.g., young gen or direct-to-old-gen). This is very helpful in some situations when you’re trying to work out which allocation(s) grew the old gen occupancy between two young GCs.

FWIW, the existing JFR events follow the approach I described above:

* one event for each new TLAB + first alloc in that TLAB (my proposal basically generalizes this and removes the 1-1 relationship between object alloc sampling and new TLAB operation)

* one event for all allocs outside a TLAB

I think the above separation is helpful. But if you think it could confuse users, you can of course easily just combine the information (but I strongly believe it’s better to report the information separately).


(and, in fact, not work if TLAB support is turned off)? 


Who turns off TLABs? Is -UseTLAB even tested by Oracle? (This is a genuine question.)


 I might be missing something obvious (and see my response below).


 
All this requires is a separate counter that is set to the next sampling interval, and decremented when an allocation happens, which goes into a slow path when the decrement hits 0.  Doing a subtraction and a pointer bump in allocation instead of just a pointer bump is basically free.  


Maybe on intel is cheap, but maybe it’s not on other platforms that other folks care about.

Really?  A memory read and a subtraction?  Which architectures care about that?


I was not concerned with the read and subtraction, I was more concerned with the conditional that follows them (intel has great branch prediction).

And a personal pet peeve (based on past experience): How many “free” instructions do you have to add before they are not free any more?



Again, notice that no one has complained about the addition that was added for total bytes allocated per thread.  I note that was actually added in the 6u20 timeframe.

Note that it has been doing an additional addition (to keep track of per thread allocation) as part of allocation since Java 7, 


Interesting. I hadn’t realized that. Does that keep track of total size allocated per thread or number of allocated objects per thread? If it’s the former, why isn’t it possible to calculate that from the TLABs information?


Total size allocated per thread.  It isn't possible to calculate that from the TLAB because of out-of-TLAB allocation 


The allocating Thread is passed to the slow (outside-TLAB) alloc path so it would be trivial to update the per-thread allocation stats from there too (in fact, it does; see below).


(and hypothetically disabled TLABs).


Anyone cares? :-)



For some reason, they never included it in the ThreadMXBean interface, but it is in com.sun.management.ThreadMXBean, so you can cast your ThreadMXBean to a com.sun.management.ThreadMXBean and call getThreadAllocatedBytes() on it.


Thanks for the tip. I’ll look into this...

and no one has complained.

I'm not worried about the ease of implementation here, because we've already implemented it.  


Yeah, but someone will have to maintain it moving forward.


I've been maintaining it internally to Google for 5 years.  It's actually pretty self-contained.  The only work involved is when they refactor something (so I've had to move it), or when a bug in the existing implementation is discovered.  It is very closely parallel to the TLAB code, which doesn't change much / at all.


The TLAB code has really not changed much for a while. ;-) (but haven’t looked at the JDK 9 source very closely though…)

It hasn't even been hard for us to do the forward port, except when the relevant Hotspot code is significantly refactored.

We can also turn the sampling off, if we want.  We can set the sampling rate to 2^32, have the sampling code do nothing, and no one will ever notice.  


You still have extra instructions in the allocation path, so it’s not turned off (i.e., you have the tax without any benefit).


Hey, you have a counter in your allocation path you've never noticed, which none of your code uses.  Pipelining is a wonderful thing.  :)


See above re: “free” instructions.



In fact, we could just have the sampling code do nothing, and no one would ever notice.

Honestly, no one ever notices the overhead of the sampling, anyway.  JDK8 made it more expensive to grab a stack trace (the cost became proportional to the number of loaded classes), but we have a patch that mitigates that, which we would also be happy to upstream.

As for the other concern: my concern about *just* having the callback mechanism is that there is quite a lot you can't do from user code during an allocation, because of lack of access to JNI.


Maybe I missed something. Are the callbacks in Java? I.e., do you call them using JNI from the slow path you call directly from the allocation code?

(For context: this referred to the hypothetical feature where we can provide a callback that invokes some code from allocation.)

(It's not actually hypothetical, because we've already implemented it, but let's call it hypothetical for the moment.)


OK.



We invoke native code.  You can't invoke any Java code during allocation, including calling JNI methods, because that would make allocation potentially reentrant, which doesn't work for all sorts of reasons.


That’s what I was worried about….


  The native code doesn't even get passed a JNIEnv * - there is nothing it can do with it without making the VM crash a lot.


So, thanks for the clarification. Being able to attach a callback to this in, say, the JVM it’d be totally fine. I was worried that you wanted to call Java. :-)



Or, rather, you might be able to do that, but it would take a lot of Hotspot rearchitecting.  When I tried to do it, I realized it would be an extremely deep dive.


I believe you. :-)



  However, you can do pretty much anything from the VM itself.  Crucially (for us), we don't just log the stack traces, we also keep track of which are live and which aren't.  We can't do this in a callback, if the callback can't create weak refs to the object.

What we do at Google is to have two methods: one that you pass a callback to (the callback gets invoked with a StackTraceData object, as I've defined above), and another that just tells you which sampled objects are still live.  We could also add a third, which allowed a callback to set the sampling interval (basically, the VM would call it to get the integer number of bytes to be allocated before the next sample).  

Would people be amenable to that?  It makes the code more complex, but, as I say, it's nice for detecting memory leaks ("Hey!  Where did that 1 GB object come from?").


Well, that 1GB object would have most likely been allocated outside a TLAB and you could have identified it by instrumenting the “outside-of-TLAB allocation path” (just saying…).


That's orthogonal to the point I was making in the quote above - the point I was making there was that we want to be able to detect what sampled objects are live.  We can do that regardless of how we implement the sampling (although it did involve my making a new kind of weak oop processing mechanism inside the VM).


Yeah, I was thinking of doing something similar (tracking object lifetimes, and other attributes, with WeakRefs). 



But to the question of whether we can just instrument the outside-of-tlab allocation path...  There are a few weirdnesses here.  The first one that jumps to mind is that there's also a fast path for allocating in the YG outside of TLABs, if an object is too large to fit in the current TLAB.  Those objects would never get sampled.  So "outside of tlab" doesn't always mean "slow path".


CollectedHeap::common_mem_allocate_noinit() is the first-level of the slow path called when a TLAB allocation fails because the object doesn’t fit in the current TLAB. It checks (alocate_from_tlab() / allocate_from_tlab_slow()) whether to refill the current TLAB or keep the TLAB and delegate to the GC (mem_allocate()) to allocate the object outside a TLAB (either in the young or old gen; the GC might also decide to do a collection at this point if, say, the eden is full...). So, it depends on what you mean by slow path but, yes, any alloocations that go through the above path should be considered as “slow path” allocations.

One more piece of data: AllocTracer::send_allocation_outside_tlab_event() (the JFR entry point for outside-TLAB allocs) is fired from common_mem_allocate_noint(). So, if there are other non-TLAB allocation paths outside that method, that entry point has been placed incorrectly (it’s possible of course; but I think that it’s actually placed correctly).

(note: I only looked at the JDK 8 sources, haven’t checked the JDK 9 sources yet, the above might have been changed)

BTW, when looking at the common_mem_allocate_noinit() code I noticed the following:

THREAD->incr_allocated_bytes(size * HeapWordSize);

(as predicted earlier)



Another one that jumps to mind is that we don't know whether the outside-of-TLAB path actually passes the sampling threshold, especially if we let users configure the sampling threshold.  So how would we know whether to sample it?


See above (IMHO: sample all of them).



You also have to keep track of the sampling interval in the code where we allocate new TLABs, in case the sampling threshold is larger than the TLAB size.  That's not a big deal, of course.


Of course, but that’s kinda trivial. BTW, one approach here would be “given that refilling a TLAB is slow anyway, always sample the first object in each TLAB irrespective of desired sampling frequence”. Another would be “don’t do that, I set the sampling frequency pretty low not to be flooded with data when the TLABs are very small”. I have to say I’m in the latter camp.




And, every time the TLAB code changes, we have to consider whether / how those changes affect this sampling mechanism.


Yes, but how often does the TLAB code change? :-)



I guess my larger point is that there are so many little corner cases with TLAB allocation, including whether it even happens, that basing the sampling strategy around it seems like a cop-out.  


There are not many little corner cases. There are two cases: allocation inside a TLAB, allocation outside a TLAB. The former is by far the most common. The latter is generally very infrequent and has a well-defined code path (I described it earlier). And, as I said, it could be very helpful and informative to treat (and account for) the two cases separately.


And my belief is that the arguments against our strategy don't really hold water, especially given the presence of the per-thread allocation counter that no one noticed.  


I’ve already addressed that.



Heck, I've already had it reviewed internally by a Hotspot reviewer (Chuck Rasbold).  All we really need is to write an acceptable JEP, to adjust the code based on the changes the community wants, and someone from Oracle willing to say "yes".



For reference, to keep track of sampling, the delta to C2 is about 150 LOC (much of which is newlines-because-of-formatting for methods that take a lot of parameters), the delta to C1 is about 60 LOC, the delta to each x86 template interpreter is about 20 LOC, and the delta for the assembler is about 40 LOC.      It's not completely trivial, but the code hasn't changed substantially in the 5 years since I wrote it (other than a couple of bugfixes).

Obviously, assembler/template interpreter would have to be dup'd across platforms - we can do that for PPC and aarch64, on which we do active development, at least.


I’ll again vote for the simplicity of having a simple change in only one place (OK, two places…).



But, seriously, why didn’t you like my proposal? It can do anything your scheme can with fewer and simpler code changes. The only thing that it cannot do is to sample based on object count (i.e., every 100 objects) instead of based on object size (i.e., every 1MB of allocations). But I think doing sampling based on size is the right approach here (IMHO).

I agree that sampling based on size is the right approach.  

(And your approach is definitely simpler - I don't mean to discount it.  And if that's what it takes to get this feature accepted, we'll do it, but I'll grumble about it.)


That’s fine. :-)

Tony



-----

Tony Printezis | JVM/GC Engineer / VM Team | Twitter

@TonyPrintezis

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Low-Overhead Heap Profiling

Karen Kinnear
<base href="x-msg://348/">Jeremy,

Did I follow this correctly - that your approach modifies the compilers and interpreters and Tony's modifies the
common allocation code?

Given that the number of compilers and interpreters and interpreter platforms keeps expanding - I'd like to
add a vote to have heap allocation profiling in common allocation code.

thanks,
Karen

On Jun 25, 2015, at 4:28 PM, Tony Printezis wrote:

Hi Jeremy,

Inline.

On June 24, 2015 at 7:26:55 PM, Jeremy Manson ([hidden email]) wrote:



On Wed, Jun 24, 2015 at 10:57 AM, Tony Printezis <[hidden email]> wrote:
Hi Jeremy,

Please see inline.

On June 23, 2015 at 7:22:13 PM, Jeremy Manson ([hidden email]) wrote:

I don't want the size of the TLAB, which is ergonomically adjusted, to be tied to the sampling rate.  There is no reason to do that.  I want reasonable statistical sampling of the allocations.  


As I said explicitly in my e-mail, I totally agree with this. Which is why I never suggested to resize TLABs in order to vary the sampling rate. (Apologies if my e-mail was not clear.)


My fault - I misread it.  Doesn't your proposal miss out of TLAB allocs entirely


This is correct: We’ll also have to intercept the outside-TLAB allocs. But, IMHO, this is a feature as it’s helpful to know how many (and which) allocations happen outside TLABs. These are generally very infrequent (and slow anyway), so sampling all of those, instead of only sampling some of them, does not have much of an overhead. But, you could also do sampling for the outside-TLAB allocs too, if you want: just accumulate their size on a separate per-thread counter and sample the one that bumps that counter goes over a limit.

An additional observation (orthogonal to the main point, but I thought I’d mention it anyway): For the outside-TLAB allocs it’d be helpful to also know which generation the object ended up in (e.g., young gen or direct-to-old-gen). This is very helpful in some situations when you’re trying to work out which allocation(s) grew the old gen occupancy between two young GCs.

FWIW, the existing JFR events follow the approach I described above:

* one event for each new TLAB + first alloc in that TLAB (my proposal basically generalizes this and removes the 1-1 relationship between object alloc sampling and new TLAB operation)

* one event for all allocs outside a TLAB

I think the above separation is helpful. But if you think it could confuse users, you can of course easily just combine the information (but I strongly believe it’s better to report the information separately).


(and, in fact, not work if TLAB support is turned off)? 


Who turns off TLABs? Is -UseTLAB even tested by Oracle? (This is a genuine question.)


 I might be missing something obvious (and see my response below).


 
All this requires is a separate counter that is set to the next sampling interval, and decremented when an allocation happens, which goes into a slow path when the decrement hits 0.  Doing a subtraction and a pointer bump in allocation instead of just a pointer bump is basically free.  


Maybe on intel is cheap, but maybe it’s not on other platforms that other folks care about.

Really?  A memory read and a subtraction?  Which architectures care about that?


I was not concerned with the read and subtraction, I was more concerned with the conditional that follows them (intel has great branch prediction).

And a personal pet peeve (based on past experience): How many “free” instructions do you have to add before they are not free any more?



Again, notice that no one has complained about the addition that was added for total bytes allocated per thread.  I note that was actually added in the 6u20 timeframe.

Note that it has been doing an additional addition (to keep track of per thread allocation) as part of allocation since Java 7, 


Interesting. I hadn’t realized that. Does that keep track of total size allocated per thread or number of allocated objects per thread? If it’s the former, why isn’t it possible to calculate that from the TLABs information?


Total size allocated per thread.  It isn't possible to calculate that from the TLAB because of out-of-TLAB allocation 


The allocating Thread is passed to the slow (outside-TLAB) alloc path so it would be trivial to update the per-thread allocation stats from there too (in fact, it does; see below).


(and hypothetically disabled TLABs).


Anyone cares? :-)



For some reason, they never included it in the ThreadMXBean interface, but it is in com.sun.management.ThreadMXBean, so you can cast your ThreadMXBean to a com.sun.management.ThreadMXBean and call getThreadAllocatedBytes() on it.


Thanks for the tip. I’ll look into this...

and no one has complained.

I'm not worried about the ease of implementation here, because we've already implemented it.  


Yeah, but someone will have to maintain it moving forward.


I've been maintaining it internally to Google for 5 years.  It's actually pretty self-contained.  The only work involved is when they refactor something (so I've had to move it), or when a bug in the existing implementation is discovered.  It is very closely parallel to the TLAB code, which doesn't change much / at all.


The TLAB code has really not changed much for a while. ;-) (but haven’t looked at the JDK 9 source very closely though…)

It hasn't even been hard for us to do the forward port, except when the relevant Hotspot code is significantly refactored.

We can also turn the sampling off, if we want.  We can set the sampling rate to 2^32, have the sampling code do nothing, and no one will ever notice.  


You still have extra instructions in the allocation path, so it’s not turned off (i.e., you have the tax without any benefit).


Hey, you have a counter in your allocation path you've never noticed, which none of your code uses.  Pipelining is a wonderful thing.  :)


See above re: “free” instructions.



In fact, we could just have the sampling code do nothing, and no one would ever notice.

Honestly, no one ever notices the overhead of the sampling, anyway.  JDK8 made it more expensive to grab a stack trace (the cost became proportional to the number of loaded classes), but we have a patch that mitigates that, which we would also be happy to upstream.

As for the other concern: my concern about *just* having the callback mechanism is that there is quite a lot you can't do from user code during an allocation, because of lack of access to JNI.


Maybe I missed something. Are the callbacks in Java? I.e., do you call them using JNI from the slow path you call directly from the allocation code?

(For context: this referred to the hypothetical feature where we can provide a callback that invokes some code from allocation.)

(It's not actually hypothetical, because we've already implemented it, but let's call it hypothetical for the moment.)


OK.



We invoke native code.  You can't invoke any Java code during allocation, including calling JNI methods, because that would make allocation potentially reentrant, which doesn't work for all sorts of reasons.


That’s what I was worried about….


  The native code doesn't even get passed a JNIEnv * - there is nothing it can do with it without making the VM crash a lot.


So, thanks for the clarification. Being able to attach a callback to this in, say, the JVM it’d be totally fine. I was worried that you wanted to call Java. :-)



Or, rather, you might be able to do that, but it would take a lot of Hotspot rearchitecting.  When I tried to do it, I realized it would be an extremely deep dive.


I believe you. :-)



  However, you can do pretty much anything from the VM itself.  Crucially (for us), we don't just log the stack traces, we also keep track of which are live and which aren't.  We can't do this in a callback, if the callback can't create weak refs to the object.

What we do at Google is to have two methods: one that you pass a callback to (the callback gets invoked with a StackTraceData object, as I've defined above), and another that just tells you which sampled objects are still live.  We could also add a third, which allowed a callback to set the sampling interval (basically, the VM would call it to get the integer number of bytes to be allocated before the next sample).  

Would people be amenable to that?  It makes the code more complex, but, as I say, it's nice for detecting memory leaks ("Hey!  Where did that 1 GB object come from?").


Well, that 1GB object would have most likely been allocated outside a TLAB and you could have identified it by instrumenting the “outside-of-TLAB allocation path” (just saying…).


That's orthogonal to the point I was making in the quote above - the point I was making there was that we want to be able to detect what sampled objects are live.  We can do that regardless of how we implement the sampling (although it did involve my making a new kind of weak oop processing mechanism inside the VM).


Yeah, I was thinking of doing something similar (tracking object lifetimes, and other attributes, with WeakRefs). 



But to the question of whether we can just instrument the outside-of-tlab allocation path...  There are a few weirdnesses here.  The first one that jumps to mind is that there's also a fast path for allocating in the YG outside of TLABs, if an object is too large to fit in the current TLAB.  Those objects would never get sampled.  So "outside of tlab" doesn't always mean "slow path".


CollectedHeap::common_mem_allocate_noinit() is the first-level of the slow path called when a TLAB allocation fails because the object doesn’t fit in the current TLAB. It checks (alocate_from_tlab() / allocate_from_tlab_slow()) whether to refill the current TLAB or keep the TLAB and delegate to the GC (mem_allocate()) to allocate the object outside a TLAB (either in the young or old gen; the GC might also decide to do a collection at this point if, say, the eden is full...). So, it depends on what you mean by slow path but, yes, any alloocations that go through the above path should be considered as “slow path” allocations.

One more piece of data: AllocTracer::send_allocation_outside_tlab_event() (the JFR entry point for outside-TLAB allocs) is fired from common_mem_allocate_noint(). So, if there are other non-TLAB allocation paths outside that method, that entry point has been placed incorrectly (it’s possible of course; but I think that it’s actually placed correctly).

(note: I only looked at the JDK 8 sources, haven’t checked the JDK 9 sources yet, the above might have been changed)

BTW, when looking at the common_mem_allocate_noinit() code I noticed the following:

THREAD->incr_allocated_bytes(size * HeapWordSize);

(as predicted earlier)



Another one that jumps to mind is that we don't know whether the outside-of-TLAB path actually passes the sampling threshold, especially if we let users configure the sampling threshold.  So how would we know whether to sample it?


See above (IMHO: sample all of them).



You also have to keep track of the sampling interval in the code where we allocate new TLABs, in case the sampling threshold is larger than the TLAB size.  That's not a big deal, of course.


Of course, but that’s kinda trivial. BTW, one approach here would be “given that refilling a TLAB is slow anyway, always sample the first object in each TLAB irrespective of desired sampling frequence”. Another would be “don’t do that, I set the sampling frequency pretty low not to be flooded with data when the TLABs are very small”. I have to say I’m in the latter camp.




And, every time the TLAB code changes, we have to consider whether / how those changes affect this sampling mechanism.


Yes, but how often does the TLAB code change? :-)



I guess my larger point is that there are so many little corner cases with TLAB allocation, including whether it even happens, that basing the sampling strategy around it seems like a cop-out.  


There are not many little corner cases. There are two cases: allocation inside a TLAB, allocation outside a TLAB. The former is by far the most common. The latter is generally very infrequent and has a well-defined code path (I described it earlier). And, as I said, it could be very helpful and informative to treat (and account for) the two cases separately.


And my belief is that the arguments against our strategy don't really hold water, especially given the presence of the per-thread allocation counter that no one noticed.  


I’ve already addressed that.



Heck, I've already had it reviewed internally by a Hotspot reviewer (Chuck Rasbold).  All we really need is to write an acceptable JEP, to adjust the code based on the changes the community wants, and someone from Oracle willing to say "yes".



For reference, to keep track of sampling, the delta to C2 is about 150 LOC (much of which is newlines-because-of-formatting for methods that take a lot of parameters), the delta to C1 is about 60 LOC, the delta to each x86 template interpreter is about 20 LOC, and the delta for the assembler is about 40 LOC.      It's not completely trivial, but the code hasn't changed substantially in the 5 years since I wrote it (other than a couple of bugfixes).

Obviously, assembler/template interpreter would have to be dup'd across platforms - we can do that for PPC and aarch64, on which we do active development, at least.


I’ll again vote for the simplicity of having a simple change in only one place (OK, two places…).



But, seriously, why didn’t you like my proposal? It can do anything your scheme can with fewer and simpler code changes. The only thing that it cannot do is to sample based on object count (i.e., every 100 objects) instead of based on object size (i.e., every 1MB of allocations). But I think doing sampling based on size is the right approach here (IMHO).

I agree that sampling based on size is the right approach.  

(And your approach is definitely simpler - I don't mean to discount it.  And if that's what it takes to get this feature accepted, we'll do it, but I'll grumble about it.)


That’s fine. :-)

Tony



-----

Tony Printezis | JVM/GC Engineer / VM Team | Twitter

@TonyPrintezis



Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Low-Overhead Heap Profiling

Tony Printezis-3
In reply to this post by Bernd Eckenfels-4
Bernd,

I like the idea of buffering up the sampled objects in, some data structure. But I assume it’d have to be a per-thread data structure to avoid conention issues. So, we’ll also need a periodic task that collects all such data structures and makes them available (somehow) to whoever wants to consume them?

Tony

On June 24, 2015 at 7:49:06 PM, Bernd Eckenfels ([hidden email]) wrote:

Am Wed, 24 Jun 2015 16:26:35 -0700
schrieb Jeremy Manson <[hidden email]>:

> > As for the other concern: my concern about *just* having the
> > callback mechanism is that there is quite a lot you can't do from
> > user code during an allocation, because of lack of access to JNI.
> >
> >
> > Maybe I missed something. Are the callbacks in Java? I.e., do you
> > call them using JNI from the slow path you call directly from the
> > allocation code?
> >
> > (For context: this referred to the hypothetical feature where we can
> provide a callback that invokes some code from allocation.)

What about a hypothetical queueing feature, so you can process the
events asynchronously (perhaps with some backpressure control). This
would work well for statistics processing.

(Your other use case, the throwing of OOM would not work, I guess)

But its an elegant solution to provide a code environment generic enoug
for all kinds of instrumentation and independent of the "allocation
recursion".

Greetings
Bernd
-----

Tony Printezis | JVM/GC Engineer / VM Team | Twitter

@TonyPrintezis

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Low-Overhead Heap Profiling

Tony Printezis-3
In reply to this post by Kirk Pepperdine-2
Hi Kirk,

(long time!) See inline.

On June 25, 2015 at 2:54:04 AM, Kirk Pepperdine ([hidden email]) wrote:


But, seriously, why didn’t you like my proposal? It can do anything your scheme can with fewer and simpler code changes. The only thing that it cannot do is to sample based on object count (i.e., every 100 objects) instead of based on object size (i.e., every 1MB of allocations). But I think doing sampling based on size is the right approach here (IMHO).



I would think that the size based sampling would create a size based bias in your sampling. 


That’s actually true. And this could be good (if you’re interested in what’s filling up your eden, the larger objects might be of more interest) or bad (if you want to get a general idea of what’s being allocated, the size bias might make you miss some types of objects / allocation sites).


Since IME, it’s allocation frequency is more damaging to performance, I’d prefer to see time boxed sampling


Do you mean “sample every X ms, say”?


Tony



Kind regards,
Kirk Pepperdine



-----

Tony Printezis | JVM/GC Engineer / VM Team | Twitter

@TonyPrintezis

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Low-Overhead Heap Profiling

Tony Printezis-3
In reply to this post by Tony Printezis-3
BTW, Could we get a reaction from the Oracle folks on this? Even though Jeremy and myself are proposing different implementation approaches, we both agree (and Jeremy please correct me on this) that having an allocation sampling mechanism that’s more flexible to what’s already in HotSpot (in particular: the sampling frequency not being tied to the TLAB size) will be a very helpful profiling feature. Is this something that we pursue to contribute?

Tony

On June 24, 2015 at 1:57:44 PM, Tony Printezis ([hidden email]) wrote:

Hi Jeremy,

Please see inline.

On June 23, 2015 at 7:22:13 PM, Jeremy Manson ([hidden email]) wrote:

I don't want the size of the TLAB, which is ergonomically adjusted, to be tied to the sampling rate.  There is no reason to do that.  I want reasonable statistical sampling of the allocations.  


As I said explicitly in my e-mail, I totally agree with this. Which is why I never suggested to resize TLABs in order to vary the sampling rate. (Apologies if my e-mail was not clear.)



All this requires is a separate counter that is set to the next sampling interval, and decremented when an allocation happens, which goes into a slow path when the decrement hits 0.  Doing a subtraction and a pointer bump in allocation instead of just a pointer bump is basically free.  


Maybe on intel is cheap, but maybe it’s not on other platforms that other folks care about.


Note that it has been doing an additional addition (to keep track of per thread allocation) as part of allocation since Java 7, 


Interesting. I hadn’t realized that. Does that keep track of total size allocated per thread or number of allocated objects per thread? If it’s the former, why isn’t it possible to calculate that from the TLABs information?


and no one has complained.

I'm not worried about the ease of implementation here, because we've already implemented it.  


Yeah, but someone will have to maintain it moving forward.


It hasn't even been hard for us to do the forward port, except when the relevant Hotspot code is significantly refactored.

We can also turn the sampling off, if we want.  We can set the sampling rate to 2^32, have the sampling code do nothing, and no one will ever notice.  


You still have extra instructions in the allocation path, so it’s not turned off (i.e., you have the tax without any benefit).


In fact, we could just have the sampling code do nothing, and no one would ever notice.

Honestly, no one ever notices the overhead of the sampling, anyway.  JDK8 made it more expensive to grab a stack trace (the cost became proportional to the number of loaded classes), but we have a patch that mitigates that, which we would also be happy to upstream.

As for the other concern: my concern about *just* having the callback mechanism is that there is quite a lot you can't do from user code during an allocation, because of lack of access to JNI.


Maybe I missed something. Are the callbacks in Java? I.e., do you call them using JNI from the slow path you call directly from the allocation code?


  However, you can do pretty much anything from the VM itself.  Crucially (for us), we don't just log the stack traces, we also keep track of which are live and which aren't.  We can't do this in a callback, if the callback can't create weak refs to the object.

What we do at Google is to have two methods: one that you pass a callback to (the callback gets invoked with a StackTraceData object, as I've defined above), and another that just tells you which sampled objects are still live.  We could also add a third, which allowed a callback to set the sampling interval (basically, the VM would call it to get the integer number of bytes to be allocated before the next sample).  

Would people be amenable to that?  It makes the code more complex, but, as I say, it's nice for detecting memory leaks ("Hey!  Where did that 1 GB object come from?").


Well, that 1GB object would have most likely been allocated outside a TLAB and you could have identified it by instrumenting the “outside-of-TLAB allocation path” (just saying…).

But, seriously, why didn’t you like my proposal? It can do anything your scheme can with fewer and simpler code changes. The only thing that it cannot do is to sample based on object count (i.e., every 100 objects) instead of based on object size (i.e., every 1MB of allocations). But I think doing sampling based on size is the right approach here (IMHO).

Tony



Jeremy


On Tue, Jun 23, 2015 at 1:06 PM, Tony Printezis <[hidden email]> wrote:
Jeremy (and all),

I’m not on the serviceability list so I won’t include the messages so far. :-) Also CCing the hotspot GC list, in case they have some feedback on this.

Could I suggest a (much) simpler but at least as powerful and flexible way to do this? (This is something we’ve been meaning to do for a while now for TwitterJDK, the JDK we develop and deploy here at Twitter.) You can force allocations to go into the slow path periodically by artificially setting the TLAB top to a lower value. So, imagine a TLAB is 4M. You can set top to (bottom+1M). When an allocation thinks the TLAB is full (in this case, the first 1MB is full) it will call the allocation slow path. There, you can intercept it, sample the allocation (and, like in your case, you’ll also have the correct stack trace), notice that the TLAB is not actually full, extend its to top to, say, (bottom+2M), and you’re done.

Advantages of this approach:

* This is a much smaller, simpler, and self-contained change (no compiler changes necessary to maintain...).

* When it’s off, the overhead is only one extra test at the slow path TLAB allocation (i.e., negligible; we do some sampling on TLABs in TwitterJDK using a similar mechanism and, when it’s off, I’ve observed no performance overhead).

* (most importantly) You can turn this on and off, and adjust the sampling rate, dynamically. If you do the sampling based on JITed code, you’ll have to recompile all methods with allocation sites to turn the sampling on or off. (You can of course have it always on and just discard the output; it’d be nice not to have to do that though. IMHO, at least.)

* You can also very cheaply turn this on and off (or adjust the sampling frequncy) per thread, if that’s be helpful in some way (just add the appropriate info on the thread’s TLAB).

A few extra comments on the previous discussion:

* "JFR samples per new TLAB allocation. It provides really very good picture and I haven't seen overhead more than 2” : When TLABs get very large, I don’t think sampling one object per TLAB is enough to get a good sample (IMHO, at least). It’s probably OK for something like jbb which mostly allocates instances of a handful of classes and has very few allocation sites. But, a lot of the code we run at Twitter is a lot more elaborate than that and, in our experience, sampling one object per TLAB is not enough. You can, of course, decrease the TLAB size to increase the sampling size. But it’d be good not to have to do that given a smaller TLAB size could increase contention across threads.

* "Should it *just* take a stack trace, or should the behavior be configurable?” : I think we’d have to separate the allocation sampling mechanism from the consumption of the allocation samples. Once the sampling mechanism is in, different JVMs can take advantage of it in different ways. I assume that the Oracle folks would like at least a JFR event for every such sample. But in your build you can add extra code to collect the information in the way you have now.

* Talking of JFR, it’s a bit unfortunate that the AllocObjectInNewTLAB event has both the new TLAB information and the allocation information. It would have been nice if that event was split into two, say NewTLAB and AllocObjectInTLAB, and we’d be able to fire the latter for each sample.

* "Should the interval between samples be configurable?” : Totally. In fact, it’d be helpful if it was configurable dynamically. Imagine if a JVM starts misbehaving after 2-3 weeks of running. You can dynamically increase the sampling rate to get a better profile if the default is not giving fine-grain enough information.

* "As long of these features don’t contribute to sampling bias” : If the sampling interval is fixed, sampling bias would be a very real concern. In the above example, I’d increment top by 1M (the sampling frequency) + p% (a fudge factor). 

* "Yes, a perhaps optional callbacks would be nice too.” : Oh, no. :-) But, as I said, we should definitely separate the sampling mechanism from the mechanism that consumes the samples.

* "Another problem with our submitting things is that we can't really test on anything other than Linux.” : Another reason to go with a as platform independent solution as possible. :-)

Regards,

Tony

-----

Tony Printezis | JVM/GC Engineer / VM Team | Twitter

@TonyPrintezis









-----

Tony Printezis | JVM/GC Engineer / VM Team | Twitter

@TonyPrintezis

-----

Tony Printezis | JVM/GC Engineer / VM Team | Twitter

@TonyPrintezis

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Low-Overhead Heap Profiling

Jeremy Manson-4
In reply to this post by Tony Printezis-3


On Thu, Jun 25, 2015 at 1:28 PM, Tony Printezis <[hidden email]> wrote:
Hi Jeremy,

Inline.

On June 24, 2015 at 7:26:55 PM, Jeremy Manson ([hidden email]) wrote:



On Wed, Jun 24, 2015 at 10:57 AM, Tony Printezis <[hidden email]> wrote:
Hi Jeremy,

Please see inline.

On June 23, 2015 at 7:22:13 PM, Jeremy Manson ([hidden email]) wrote:

I don't want the size of the TLAB, which is ergonomically adjusted, to be tied to the sampling rate.  There is no reason to do that.  I want reasonable statistical sampling of the allocations.  


As I said explicitly in my e-mail, I totally agree with this. Which is why I never suggested to resize TLABs in order to vary the sampling rate. (Apologies if my e-mail was not clear.)


My fault - I misread it.  Doesn't your proposal miss out of TLAB allocs entirely


This is correct: We’ll also have to intercept the outside-TLAB allocs. But, IMHO, this is a feature as it’s helpful to know how many (and which) allocations happen outside TLABs. These are generally very infrequent (and slow anyway), so sampling all of those, instead of only sampling some of them, does not have much of an overhead. But, you could also do sampling for the outside-TLAB allocs too, if you want: just accumulate their size on a separate per-thread counter and sample the one that bumps that counter goes over a limit.


The outside-TLAB allocations generally get caught anyway, because they tend to be large enough to jump over the sample size immediately.
 

An additional observation (orthogonal to the main point, but I thought I’d mention it anyway): For the outside-TLAB allocs it’d be helpful to also know which generation the object ended up in (e.g., young gen or direct-to-old-gen). This is very helpful in some situations when you’re trying to work out which allocation(s) grew the old gen occupancy between two young GCs.


True.  We don't have this implemented, but it would be reasonably straightforward to glean it from the oop.
 

FWIW, the existing JFR events follow the approach I described above:

* one event for each new TLAB + first alloc in that TLAB (my proposal basically generalizes this and removes the 1-1 relationship between object alloc sampling and new TLAB operation)

* one event for all allocs outside a TLAB

I think the above separation is helpful. But if you think it could confuse users, you can of course easily just combine the information (but I strongly believe it’s better to report the information separately).


I do think it would make a confusing API.  It might make more sense to have a reporting mechanism that had a set number of fields with very concrete information (size, class, stacktrace), but allowed for platform-specific metadata.  We end up with a very long list of things we want in the sample: generation (how do you describe a generation?), object age (by number of GCs survived?  What kind of GC?), was it a TLAB allocation, etc.


(and, in fact, not work if TLAB support is turned off)? 


Who turns off TLABs? Is -UseTLAB even tested by Oracle? (This is a genuine question.)


I don't think they do.  I have turned them off for various reasons (usually, I'm trying to instrument allocations and I don't want to muck about with thinking about TLABs), and the code paths seem a little crufty.  ISTR at some point finding something that clearly only worked by mistake, but I can't remember now what it was.
 
[snip]

 
  However, you can do pretty much anything from the VM itself.  Crucially (for us), we don't just log the stack traces, we also keep track of which are live and which aren't.  We can't do this in a callback, if the callback can't create weak refs to the object.

What we do at Google is to have two methods: one that you pass a callback to (the callback gets invoked with a StackTraceData object, as I've defined above), and another that just tells you which sampled objects are still live.  We could also add a third, which allowed a callback to set the sampling interval (basically, the VM would call it to get the integer number of bytes to be allocated before the next sample).  

Would people be amenable to that?  It makes the code more complex, but, as I say, it's nice for detecting memory leaks ("Hey!  Where did that 1 GB object come from?").


Well, that 1GB object would have most likely been allocated outside a TLAB and you could have identified it by instrumenting the “outside-of-TLAB allocation path” (just saying…).


That's orthogonal to the point I was making in the quote above - the point I was making there was that we want to be able to detect what sampled objects are live.  We can do that regardless of how we implement the sampling (although it did involve my making a new kind of weak oop processing mechanism inside the VM).


Yeah, I was thinking of doing something similar (tracking object lifetimes, and other attributes, with WeakRefs). 


We have all of that implemented, so hopefully I can save you the trouble. :) 

But to the question of whether we can just instrument the outside-of-tlab allocation path...  There are a few weirdnesses here.  The first one that jumps to mind is that there's also a fast path for allocating in the YG outside of TLABs, if an object is too large to fit in the current TLAB.  Those objects would never get sampled.  So "outside of tlab" doesn't always mean "slow path".


CollectedHeap::common_mem_allocate_noinit() is the first-level of the slow path called when a TLAB allocation fails because the object doesn’t fit in the current TLAB. It checks (alocate_from_tlab() / allocate_from_tlab_slow()) whether to refill the current TLAB or keep the TLAB and delegate to the GC (mem_allocate()) to allocate the object outside a TLAB (either in the young or old gen; the GC might also decide to do a collection at this point if, say, the eden is full...). So, it depends on what you mean by slow path but, yes, any alloocations that go through the above path should be considered as “slow path” allocations.


Let me be more specific.  Here is a place where allocations go through a fast path that is outside of a TLAB:


If the object won't fit in the TLAB, but will fit in the Eden, it will be allocated in the Eden, with hand-generated assembly.  This case will be entirely missed by sampling just the TLAB creation (or your variant) and the slow path.  I may be missing something about that code, but I can't really see what it is.

One more piece of data: AllocTracer::send_allocation_outside_tlab_event() (the JFR entry point for outside-TLAB allocs) is fired from common_mem_allocate_noint(). So, if there are other non-TLAB allocation paths outside that method, that entry point has been placed incorrectly (it’s possible of course; but I think that it’s actually placed correctly).


What is happening in the line to which I referred, then?  To me, it kind of reads like "this is close enough to being TLAB allocation that I don't care that it isn't".
 
And that's really what's going on here. Your strategy is to tie what I see as a platform feature to a particular implementation.  If the implementation changes, or if we really don't understand it as well as we think we do, the whole thing falls on the floor.  If we mention TLABs in the docs, and TLABs do change, then it won't mean anything anymore.

A particular example pops to mind: I believe Metronome doesn't have TLABs at all.  Is that correct?  Can J9 developers implement this feature?
For reference, to keep track of sampling, the delta to C2 is about 150 LOC (much of which is newlines-because-of-formatting for methods that take a lot of parameters), the delta to C1 is about 60 LOC, the delta to each x86 template interpreter is about 20 LOC, and the delta for the assembler is about 40 LOC.      It's not completely trivial, but the code hasn't changed substantially in the 5 years since I wrote it (other than a couple of bugfixes).

Obviously, assembler/template interpreter would have to be dup'd across platforms - we can do that for PPC and aarch64, on which we do active development, at least.


I’ll again vote for the simplicity of having a simple change in only one place (OK, two places…).


This isn't a simple change anyway, if we're keeping track of live references.  We have to hook into reference processing - when a weak oop is detected to be dead, we have to delete the metadata.  And we have to change JVMTI.

Jeremy
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Low-Overhead Heap Profiling

Jeremy Manson-4
In reply to this post by Karen Kinnear
Karen,

I understand your concerns.  For reference, this is the additional code in the x86 assembler.  There are very small stubs in C1 and the template interpreter to call out to this macro (forgive the gratuitous use of the string "google" - we sprinkle it around a bit because it makes it a little easier to distinguish our code from upstream code).

#define GOOGLE_HEAP_MONITORING(ma, thread, var_size_in_bytes, con_size_in_bytes, object, t1, t2, sample_invocation) \
do {                                                                \
  {                                                                 \
    SkipIfEqual skip_if(ma, &GoogleHeapMonitor, 0);                 \
    Label skip_sample;                                              \
    Register thr = thread;                                          \
    if (!thr->is_valid()) {                                         \
      NOT_LP64(assert(t1 != noreg,                                  \
                      "Need temporary register for constants"));    \
      thr = NOT_LP64(t1) LP64_ONLY(r15_thread);                     \
      NOT_LP64(ma -> get_thread(thr);)                              \
    }                                                               \
    /* Trigger heap monitoring event */                             \
    Address bus(thr,                                                \
                JavaThread::google_bytes_until_sample_offset());    \
                                                                    \
    if (var_size_in_bytes->is_valid()) {                            \
      ma -> NOT_LP64(subl) LP64_ONLY(subq)(bus, var_size_in_bytes); \
    } else {                                                        \
      int csib = (con_size_in_bytes);                               \
      assert(t2 != noreg,                                           \
             "Need temporary register for constants");              \
      ma -> NOT_LP64(movl) LP64_ONLY(mov64)(t2, csib);              \
      ma -> NOT_LP64(subl) LP64_ONLY(subq)(bus, t2);                \
    }                                                               \
                                                                    \
    ma -> jcc(Assembler::positive, skip_sample);                    \
                                                                    \
    {                                                               \
      sample_invocation                                             \
    }                                                               \
    ma -> bind(skip_sample);                                        \
  }                                                                 \
} while(0)

It's not all that hard to port to additional architectures, but we'll have to think about it.

Jeremy

On Thu, Jun 25, 2015 at 1:41 PM, Karen Kinnear <[hidden email]> wrote:
Jeremy,

Did I follow this correctly - that your approach modifies the compilers and interpreters and Tony's modifies the
common allocation code?

Given that the number of compilers and interpreters and interpreter platforms keeps expanding - I'd like to
add a vote to have heap allocation profiling in common allocation code.

thanks,
Karen

On Jun 25, 2015, at 4:28 PM, Tony Printezis wrote:

Hi Jeremy,

Inline.

On June 24, 2015 at 7:26:55 PM, Jeremy Manson ([hidden email]) wrote:



On Wed, Jun 24, 2015 at 10:57 AM, Tony Printezis <[hidden email]> wrote:
Hi Jeremy,

Please see inline.

On June 23, 2015 at 7:22:13 PM, Jeremy Manson ([hidden email]) wrote:

I don't want the size of the TLAB, which is ergonomically adjusted, to be tied to the sampling rate.  There is no reason to do that.  I want reasonable statistical sampling of the allocations.  


As I said explicitly in my e-mail, I totally agree with this. Which is why I never suggested to resize TLABs in order to vary the sampling rate. (Apologies if my e-mail was not clear.)


My fault - I misread it.  Doesn't your proposal miss out of TLAB allocs entirely


This is correct: We’ll also have to intercept the outside-TLAB allocs. But, IMHO, this is a feature as it’s helpful to know how many (and which) allocations happen outside TLABs. These are generally very infrequent (and slow anyway), so sampling all of those, instead of only sampling some of them, does not have much of an overhead. But, you could also do sampling for the outside-TLAB allocs too, if you want: just accumulate their size on a separate per-thread counter and sample the one that bumps that counter goes over a limit.

An additional observation (orthogonal to the main point, but I thought I’d mention it anyway): For the outside-TLAB allocs it’d be helpful to also know which generation the object ended up in (e.g., young gen or direct-to-old-gen). This is very helpful in some situations when you’re trying to work out which allocation(s) grew the old gen occupancy between two young GCs.

FWIW, the existing JFR events follow the approach I described above:

* one event for each new TLAB + first alloc in that TLAB (my proposal basically generalizes this and removes the 1-1 relationship between object alloc sampling and new TLAB operation)

* one event for all allocs outside a TLAB

I think the above separation is helpful. But if you think it could confuse users, you can of course easily just combine the information (but I strongly believe it’s better to report the information separately).


(and, in fact, not work if TLAB support is turned off)? 


Who turns off TLABs? Is -UseTLAB even tested by Oracle? (This is a genuine question.)


 I might be missing something obvious (and see my response below).


 
All this requires is a separate counter that is set to the next sampling interval, and decremented when an allocation happens, which goes into a slow path when the decrement hits 0.  Doing a subtraction and a pointer bump in allocation instead of just a pointer bump is basically free.  


Maybe on intel is cheap, but maybe it’s not on other platforms that other folks care about.

Really?  A memory read and a subtraction?  Which architectures care about that?


I was not concerned with the read and subtraction, I was more concerned with the conditional that follows them (intel has great branch prediction).

And a personal pet peeve (based on past experience): How many “free” instructions do you have to add before they are not free any more?



Again, notice that no one has complained about the addition that was added for total bytes allocated per thread.  I note that was actually added in the 6u20 timeframe.

Note that it has been doing an additional addition (to keep track of per thread allocation) as part of allocation since Java 7, 


Interesting. I hadn’t realized that. Does that keep track of total size allocated per thread or number of allocated objects per thread? If it’s the former, why isn’t it possible to calculate that from the TLABs information?


Total size allocated per thread.  It isn't possible to calculate that from the TLAB because of out-of-TLAB allocation 


The allocating Thread is passed to the slow (outside-TLAB) alloc path so it would be trivial to update the per-thread allocation stats from there too (in fact, it does; see below).


(and hypothetically disabled TLABs).


Anyone cares? :-)



For some reason, they never included it in the ThreadMXBean interface, but it is in com.sun.management.ThreadMXBean, so you can cast your ThreadMXBean to a com.sun.management.ThreadMXBean and call getThreadAllocatedBytes() on it.


Thanks for the tip. I’ll look into this...

and no one has complained.

I'm not worried about the ease of implementation here, because we've already implemented it.  


Yeah, but someone will have to maintain it moving forward.


I've been maintaining it internally to Google for 5 years.  It's actually pretty self-contained.  The only work involved is when they refactor something (so I've had to move it), or when a bug in the existing implementation is discovered.  It is very closely parallel to the TLAB code, which doesn't change much / at all.


The TLAB code has really not changed much for a while. ;-) (but haven’t looked at the JDK 9 source very closely though…)

It hasn't even been hard for us to do the forward port, except when the relevant Hotspot code is significantly refactored.

We can also turn the sampling off, if we want.  We can set the sampling rate to 2^32, have the sampling code do nothing, and no one will ever notice.  


You still have extra instructions in the allocation path, so it’s not turned off (i.e., you have the tax without any benefit).


Hey, you have a counter in your allocation path you've never noticed, which none of your code uses.  Pipelining is a wonderful thing.  :)


See above re: “free” instructions.



In fact, we could just have the sampling code do nothing, and no one would ever notice.

Honestly, no one ever notices the overhead of the sampling, anyway.  JDK8 made it more expensive to grab a stack trace (the cost became proportional to the number of loaded classes), but we have a patch that mitigates that, which we would also be happy to upstream.

As for the other concern: my concern about *just* having the callback mechanism is that there is quite a lot you can't do from user code during an allocation, because of lack of access to JNI.


Maybe I missed something. Are the callbacks in Java? I.e., do you call them using JNI from the slow path you call directly from the allocation code?

(For context: this referred to the hypothetical feature where we can provide a callback that invokes some code from allocation.)

(It's not actually hypothetical, because we've already implemented it, but let's call it hypothetical for the moment.)


OK.



We invoke native code.  You can't invoke any Java code during allocation, including calling JNI methods, because that would make allocation potentially reentrant, which doesn't work for all sorts of reasons.


That’s what I was worried about….


  The native code doesn't even get passed a JNIEnv * - there is nothing it can do with it without making the VM crash a lot.


So, thanks for the clarification. Being able to attach a callback to this in, say, the JVM it’d be totally fine. I was worried that you wanted to call Java. :-)



Or, rather, you might be able to do that, but it would take a lot of Hotspot rearchitecting.  When I tried to do it, I realized it would be an extremely deep dive.


I believe you. :-)



  However, you can do pretty much anything from the VM itself.  Crucially (for us), we don't just log the stack traces, we also keep track of which are live and which aren't.  We can't do this in a callback, if the callback can't create weak refs to the object.

What we do at Google is to have two methods: one that you pass a callback to (the callback gets invoked with a StackTraceData object, as I've defined above), and another that just tells you which sampled objects are still live.  We could also add a third, which allowed a callback to set the sampling interval (basically, the VM would call it to get the integer number of bytes to be allocated before the next sample).  

Would people be amenable to that?  It makes the code more complex, but, as I say, it's nice for detecting memory leaks ("Hey!  Where did that 1 GB object come from?").


Well, that 1GB object would have most likely been allocated outside a TLAB and you could have identified it by instrumenting the “outside-of-TLAB allocation path” (just saying…).


That's orthogonal to the point I was making in the quote above - the point I was making there was that we want to be able to detect what sampled objects are live.  We can do that regardless of how we implement the sampling (although it did involve my making a new kind of weak oop processing mechanism inside the VM).


Yeah, I was thinking of doing something similar (tracking object lifetimes, and other attributes, with WeakRefs). 



But to the question of whether we can just instrument the outside-of-tlab allocation path...  There are a few weirdnesses here.  The first one that jumps to mind is that there's also a fast path for allocating in the YG outside of TLABs, if an object is too large to fit in the current TLAB.  Those objects would never get sampled.  So "outside of tlab" doesn't always mean "slow path".


CollectedHeap::common_mem_allocate_noinit() is the first-level of the slow path called when a TLAB allocation fails because the object doesn’t fit in the current TLAB. It checks (alocate_from_tlab() / allocate_from_tlab_slow()) whether to refill the current TLAB or keep the TLAB and delegate to the GC (mem_allocate()) to allocate the object outside a TLAB (either in the young or old gen; the GC might also decide to do a collection at this point if, say, the eden is full...). So, it depends on what you mean by slow path but, yes, any alloocations that go through the above path should be considered as “slow path” allocations.

One more piece of data: AllocTracer::send_allocation_outside_tlab_event() (the JFR entry point for outside-TLAB allocs) is fired from common_mem_allocate_noint(). So, if there are other non-TLAB allocation paths outside that method, that entry point has been placed incorrectly (it’s possible of course; but I think that it’s actually placed correctly).

(note: I only looked at the JDK 8 sources, haven’t checked the JDK 9 sources yet, the above might have been changed)

BTW, when looking at the common_mem_allocate_noinit() code I noticed the following:

THREAD->incr_allocated_bytes(size * HeapWordSize);

(as predicted earlier)



Another one that jumps to mind is that we don't know whether the outside-of-TLAB path actually passes the sampling threshold, especially if we let users configure the sampling threshold.  So how would we know whether to sample it?


See above (IMHO: sample all of them).



You also have to keep track of the sampling interval in the code where we allocate new TLABs, in case the sampling threshold is larger than the TLAB size.  That's not a big deal, of course.


Of course, but that’s kinda trivial. BTW, one approach here would be “given that refilling a TLAB is slow anyway, always sample the first object in each TLAB irrespective of desired sampling frequence”. Another would be “don’t do that, I set the sampling frequency pretty low not to be flooded with data when the TLABs are very small”. I have to say I’m in the latter camp.




And, every time the TLAB code changes, we have to consider whether / how those changes affect this sampling mechanism.


Yes, but how often does the TLAB code change? :-)



I guess my larger point is that there are so many little corner cases with TLAB allocation, including whether it even happens, that basing the sampling strategy around it seems like a cop-out.  


There are not many little corner cases. There are two cases: allocation inside a TLAB, allocation outside a TLAB. The former is by far the most common. The latter is generally very infrequent and has a well-defined code path (I described it earlier). And, as I said, it could be very helpful and informative to treat (and account for) the two cases separately.


And my belief is that the arguments against our strategy don't really hold water, especially given the presence of the per-thread allocation counter that no one noticed.  


I’ve already addressed that.



Heck, I've already had it reviewed internally by a Hotspot reviewer (Chuck Rasbold).  All we really need is to write an acceptable JEP, to adjust the code based on the changes the community wants, and someone from Oracle willing to say "yes".



For reference, to keep track of sampling, the delta to C2 is about 150 LOC (much of which is newlines-because-of-formatting for methods that take a lot of parameters), the delta to C1 is about 60 LOC, the delta to each x86 template interpreter is about 20 LOC, and the delta for the assembler is about 40 LOC.      It's not completely trivial, but the code hasn't changed substantially in the 5 years since I wrote it (other than a couple of bugfixes).

Obviously, assembler/template interpreter would have to be dup'd across platforms - we can do that for PPC and aarch64, on which we do active development, at least.


I’ll again vote for the simplicity of having a simple change in only one place (OK, two places…).



But, seriously, why didn’t you like my proposal? It can do anything your scheme can with fewer and simpler code changes. The only thing that it cannot do is to sample based on object count (i.e., every 100 objects) instead of based on object size (i.e., every 1MB of allocations). But I think doing sampling based on size is the right approach here (IMHO).

I agree that sampling based on size is the right approach.  

(And your approach is definitely simpler - I don't mean to discount it.  And if that's what it takes to get this feature accepted, we'll do it, but I'll grumble about it.)


That’s fine. :-)

Tony



-----

Tony Printezis | JVM/GC Engineer / VM Team | Twitter

@TonyPrintezis




Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Low-Overhead Heap Profiling

Jeremy Manson-4
In reply to this post by Tony Printezis-3

On Thu, Jun 25, 2015 at 1:55 PM, Tony Printezis <[hidden email]> wrote:
Bernd,

I like the idea of buffering up the sampled objects in, some data structure. But I assume it’d have to be a per-thread data structure to avoid conention issues. So, we’ll also need a periodic task that collects all such data structures and makes them available (somehow) to whoever wants to consume them?

This is easily done.  But, per my last message, I don't think it should be the default.  It can just be available as another callback, if you want it.

Jeremy

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Low-Overhead Heap Profiling

Jeremy Manson-4
In reply to this post by Tony Printezis-3


On Thu, Jun 25, 2015 at 2:23 PM, Tony Printezis <[hidden email]> wrote:
BTW, Could we get a reaction from the Oracle folks on this? Even though Jeremy and myself are proposing different implementation approaches, we both agree (and Jeremy please correct me on this) that having an allocation sampling mechanism that’s more flexible to what’s already in HotSpot (in particular: the sampling frequency not being tied to the TLAB size) will be a very helpful profiling feature. Is this something that we pursue to contribute?


Yes.  I think we are 90% in agreement.  And I think being able to query for live objects is pretty helpful, too.  

Jeremy 
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Low-Overhead Heap Profiling

Jeremy Manson-4
In reply to this post by Tony Printezis-3


On Thu, Jun 25, 2015 at 2:08 PM, Tony Printezis <[hidden email]> wrote:
Hi Kirk,

(long time!) See inline.

On June 25, 2015 at 2:54:04 AM, Kirk Pepperdine ([hidden email]) wrote:


But, seriously, why didn’t you like my proposal? It can do anything your scheme can with fewer and simpler code changes. The only thing that it cannot do is to sample based on object count (i.e., every 100 objects) instead of based on object size (i.e., every 1MB of allocations). But I think doing sampling based on size is the right approach here (IMHO).



I would think that the size based sampling would create a size based bias in your sampling. 


That’s actually true. And this could be good (if you’re interested in what’s filling up your eden, the larger objects might be of more interest) or bad (if you want to get a general idea of what’s being allocated, the size bias might make you miss some types of objects / allocation sites).


Note that it catches both large objects and objects that are frequently allocated in the same way.  Both of those are useful pieces of information.
 
Particularly, if we find, say, 200 of the same stack trace, and we know they aren't in the live set, then we know we have a place in the code that generates a lot of garbage.  That can be a useful piece of information for tuning.

Since IME, it’s allocation frequency is more damaging to performance, I’d prefer to see time boxed sampling


Do you mean “sample every X ms, say”?


This is not impossible, but a little weird.  The only obvious way I can think to do it without enormous overhead is having a thread that wakes up once every X ms and sets a shared location to 1.  Then you check that shared location on every allocation.  If it is 1, you go into a slow path where you try to CAS it to 0.  If the CAS succeeds, take the sample.

You could imagine some sampling problems caused by, say, thread priority issues.

Jeremy

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Low-Overhead Heap Profiling

Jeremy Manson-4
In reply to this post by Jeremy Manson-4
Another thought.  Since:

- It would be kind of surprising for Thread->allocated_bytes() to be different from the number used as the interval for tracking (e.g., if your interval is, say, 512K, you check allocated bytes, it says 0, you allocate 512K, you check allocated bytes, it says 512K, but no sample was taken), AND 
- We're already taking the maintenance hit to maintain the allocated bytes counter everywhere, 

Maybe a good compromise would be to piggyback on the allocated bytes counter?  If the allocated bytes counter is at N, and we pick a next sampling interval of K, we set a per-thread variable to N+K, and everywhere we increment the allocated bytes counter, we just test to see if it is greater than N+K?

This would add an additional load and another easily predicted branch, but no additional subtraction.  Also, it would have very obvious and tractable modifications to make in existing places that already have logic for the counter, so there wouldn't be much of an additional maintenance burden.  Finally, it would more-or-less address my concerns, because the non-TLAB fast paths I'm worried about are already instrumented for it.

Jeremy

On Thu, Jun 25, 2015 at 10:27 PM, Jeremy Manson <[hidden email]> wrote:


On Thu, Jun 25, 2015 at 1:28 PM, Tony Printezis <[hidden email]> wrote:
Hi Jeremy,

Inline.

On June 24, 2015 at 7:26:55 PM, Jeremy Manson ([hidden email]) wrote:



On Wed, Jun 24, 2015 at 10:57 AM, Tony Printezis <[hidden email]> wrote:
Hi Jeremy,

Please see inline.

On June 23, 2015 at 7:22:13 PM, Jeremy Manson ([hidden email]) wrote:

I don't want the size of the TLAB, which is ergonomically adjusted, to be tied to the sampling rate.  There is no reason to do that.  I want reasonable statistical sampling of the allocations.  


As I said explicitly in my e-mail, I totally agree with this. Which is why I never suggested to resize TLABs in order to vary the sampling rate. (Apologies if my e-mail was not clear.)


My fault - I misread it.  Doesn't your proposal miss out of TLAB allocs entirely


This is correct: We’ll also have to intercept the outside-TLAB allocs. But, IMHO, this is a feature as it’s helpful to know how many (and which) allocations happen outside TLABs. These are generally very infrequent (and slow anyway), so sampling all of those, instead of only sampling some of them, does not have much of an overhead. But, you could also do sampling for the outside-TLAB allocs too, if you want: just accumulate their size on a separate per-thread counter and sample the one that bumps that counter goes over a limit.


The outside-TLAB allocations generally get caught anyway, because they tend to be large enough to jump over the sample size immediately.
 

An additional observation (orthogonal to the main point, but I thought I’d mention it anyway): For the outside-TLAB allocs it’d be helpful to also know which generation the object ended up in (e.g., young gen or direct-to-old-gen). This is very helpful in some situations when you’re trying to work out which allocation(s) grew the old gen occupancy between two young GCs.


True.  We don't have this implemented, but it would be reasonably straightforward to glean it from the oop.
 

FWIW, the existing JFR events follow the approach I described above:

* one event for each new TLAB + first alloc in that TLAB (my proposal basically generalizes this and removes the 1-1 relationship between object alloc sampling and new TLAB operation)

* one event for all allocs outside a TLAB

I think the above separation is helpful. But if you think it could confuse users, you can of course easily just combine the information (but I strongly believe it’s better to report the information separately).


I do think it would make a confusing API.  It might make more sense to have a reporting mechanism that had a set number of fields with very concrete information (size, class, stacktrace), but allowed for platform-specific metadata.  We end up with a very long list of things we want in the sample: generation (how do you describe a generation?), object age (by number of GCs survived?  What kind of GC?), was it a TLAB allocation, etc.


(and, in fact, not work if TLAB support is turned off)? 


Who turns off TLABs? Is -UseTLAB even tested by Oracle? (This is a genuine question.)


I don't think they do.  I have turned them off for various reasons (usually, I'm trying to instrument allocations and I don't want to muck about with thinking about TLABs), and the code paths seem a little crufty.  ISTR at some point finding something that clearly only worked by mistake, but I can't remember now what it was.
 
[snip]

 
  However, you can do pretty much anything from the VM itself.  Crucially (for us), we don't just log the stack traces, we also keep track of which are live and which aren't.  We can't do this in a callback, if the callback can't create weak refs to the object.

What we do at Google is to have two methods: one that you pass a callback to (the callback gets invoked with a StackTraceData object, as I've defined above), and another that just tells you which sampled objects are still live.  We could also add a third, which allowed a callback to set the sampling interval (basically, the VM would call it to get the integer number of bytes to be allocated before the next sample).  

Would people be amenable to that?  It makes the code more complex, but, as I say, it's nice for detecting memory leaks ("Hey!  Where did that 1 GB object come from?").


Well, that 1GB object would have most likely been allocated outside a TLAB and you could have identified it by instrumenting the “outside-of-TLAB allocation path” (just saying…).


That's orthogonal to the point I was making in the quote above - the point I was making there was that we want to be able to detect what sampled objects are live.  We can do that regardless of how we implement the sampling (although it did involve my making a new kind of weak oop processing mechanism inside the VM).


Yeah, I was thinking of doing something similar (tracking object lifetimes, and other attributes, with WeakRefs). 


We have all of that implemented, so hopefully I can save you the trouble. :) 

But to the question of whether we can just instrument the outside-of-tlab allocation path...  There are a few weirdnesses here.  The first one that jumps to mind is that there's also a fast path for allocating in the YG outside of TLABs, if an object is too large to fit in the current TLAB.  Those objects would never get sampled.  So "outside of tlab" doesn't always mean "slow path".


CollectedHeap::common_mem_allocate_noinit() is the first-level of the slow path called when a TLAB allocation fails because the object doesn’t fit in the current TLAB. It checks (alocate_from_tlab() / allocate_from_tlab_slow()) whether to refill the current TLAB or keep the TLAB and delegate to the GC (mem_allocate()) to allocate the object outside a TLAB (either in the young or old gen; the GC might also decide to do a collection at this point if, say, the eden is full...). So, it depends on what you mean by slow path but, yes, any alloocations that go through the above path should be considered as “slow path” allocations.


Let me be more specific.  Here is a place where allocations go through a fast path that is outside of a TLAB:


If the object won't fit in the TLAB, but will fit in the Eden, it will be allocated in the Eden, with hand-generated assembly.  This case will be entirely missed by sampling just the TLAB creation (or your variant) and the slow path.  I may be missing something about that code, but I can't really see what it is.

One more piece of data: AllocTracer::send_allocation_outside_tlab_event() (the JFR entry point for outside-TLAB allocs) is fired from common_mem_allocate_noint(). So, if there are other non-TLAB allocation paths outside that method, that entry point has been placed incorrectly (it’s possible of course; but I think that it’s actually placed correctly).


What is happening in the line to which I referred, then?  To me, it kind of reads like "this is close enough to being TLAB allocation that I don't care that it isn't".
 
And that's really what's going on here. Your strategy is to tie what I see as a platform feature to a particular implementation.  If the implementation changes, or if we really don't understand it as well as we think we do, the whole thing falls on the floor.  If we mention TLABs in the docs, and TLABs do change, then it won't mean anything anymore.

A particular example pops to mind: I believe Metronome doesn't have TLABs at all.  Is that correct?  Can J9 developers implement this feature?
For reference, to keep track of sampling, the delta to C2 is about 150 LOC (much of which is newlines-because-of-formatting for methods that take a lot of parameters), the delta to C1 is about 60 LOC, the delta to each x86 template interpreter is about 20 LOC, and the delta for the assembler is about 40 LOC.      It's not completely trivial, but the code hasn't changed substantially in the 5 years since I wrote it (other than a couple of bugfixes).

Obviously, assembler/template interpreter would have to be dup'd across platforms - we can do that for PPC and aarch64, on which we do active development, at least.


I’ll again vote for the simplicity of having a simple change in only one place (OK, two places…).


This isn't a simple change anyway, if we're keeping track of live references.  We have to hook into reference processing - when a weak oop is detected to be dead, we have to delete the metadata.  And we have to change JVMTI.

Jeremy

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Low-Overhead Heap Profiling

Kirk Pepperdine-2
In reply to this post by Jeremy Manson-4
Hi Jeremy,

Sorry I wasn’t so clear, it’s not about collection, it’s about allocation. In this regard it’s not about about size, it’s about the frequency. People tend allocate small objects frequently and they will avoid allocating large objects frequently. The assumption is, large is expensive but small isn’t. These event will show up using execution profilers but given the safe-point bias of execution profilers and other factors, it’s often clearer to view this problem using memory profilers.

Kind regards,
Kirk

On Jun 25, 2015, at 7:34 PM, Jeremy Manson <[hidden email]> wrote:

Why would allocation frequency be more damaging to performance?  Allocation is cheap, and as long as they become dead before the YG collection, it costs the same to collect one 1MB object as it does to collection 1000 1K objects.

Jeremy

On Wed, Jun 24, 2015 at 11:54 PM, Kirk Pepperdine <[hidden email]> wrote:


But, seriously, why didn’t you like my proposal? It can do anything your scheme can with fewer and simpler code changes. The only thing that it cannot do is to sample based on object count (i.e., every 100 objects) instead of based on object size (i.e., every 1MB of allocations). But I think doing sampling based on size is the right approach here (IMHO).



I would think that the size based sampling would create a size based bias in your sampling. Since IME, it’s allocation frequency is more damaging to performance, I’d prefer to see time boxed sampling

Kind regards,
Kirk Pepperdine



12
Loading...