RFR: JDK-8261238: NMT should not limit baselining by size threshold

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

RFR: JDK-8261238: NMT should not limit baselining by size threshold

Thomas Stuefe
NMT is a very useful tool to detect memory leaks. Its easy to use, relatively cheap and requires almost zero setup.

But its use for detecting "slow riser" leaks is limited since it omits small leaks in the output, rather arbitrarily. That introduces subtle errors in the output: allocations below a certain threshold do not appear in the output at all; only if the leak rises above the threshold they appear, and then it will seem like the leak suddenly happened. This is an issue for both the absolute stats as well as stats referring to an earlier taken baseline.

There are two thresholds:
1) when collecting baseline information for a report, it omits call sites smaller than `MemBaseline::SIZE_THRESHOLD`, which is hard coded to 1K.
2) when printing summary information, it omits categories whose weight would be less than whatever unit we display the NMT report in. E.g. if the report were for scale=G, we would not see categories allocating less than 1G. Similarly, when printing detail sites, it omits all sites whose size would be less than NMT scale.

Note that setting the scale to 1 at the jcmd line will effectively disable threshold (2) but (1) is still in place.

I propose to remove the threshold (1) completely. This is needed to get accurate baseline diffs - otherwise, a baseline seeing an allocation of 1023 bytes, diff'ed against a later baseline of 1025 bytes, would listed as having a delta of +1025, not +2 as it would be correct.

The limit (2) can be kept in place, but the NMT report should contain a hint about omitted information to reduce confusion.

Footprint costs of omitting the (1) threshold:

According to my measurements, omitting that threshold check increases the cost of a MemBaseLine from today ~60K to ~270K - an increase of ~210K.

A single MemBaseLine object is used while generating the report. If the "baseline" feature is used in jcmd, a second MemBaseLine object is used to hold the baseline. The first MemBaseLine is temporary, the second one permanent, since it is not destroyed.

Therefore, the standard footprint should not be affected at all. If NMT is active *and* someone runs `jcmd VM.native_memory detail`, it will cause about 270K (210K more than today) of temporary allocations. If someone runs `jcmd VM.native_memory baseline`, that increase is not temporary but sticks.

Note that these numbers were taken with from some long running java programs which means the majority of malloc call sites should have been hit. I believe these numbers to be representative. While the program I ran may not have covered all call sites, the total number is bounded and I believe not too far off of what I measured. In fact I was not able to drastically change this number with different runs.

Also note that omitting the (1) threshold only affects baselining malloc call sites. While also used for baselining virtual memory call sites, in practice this has no effect since all of those virtual memory allocations happen at page granularity, which is >= 4K and hence always about the (1) threshold.

Bottomline: I think the footprint increase of is acceptable. It gives us more comprehensive numbers and the ability to scan for small leaks.

Note: should footprint really be an issue, we could take a look at how NMT manages its data. Currently allocation site objects are copied by value at various places: inside the MemBaseline as well as some temporary sorting lists when sorting output. We could at least share the call stack portion of these objects; there is no reason for multiple call stack objects to exist which refer to a single call site. That would reduce the size of MemBaseline objects by about half.

-------------

Commit messages:
 - Start

Changes: https://git.openjdk.java.net/jdk/pull/2428/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=2428&range=00
  Issue: https://bugs.openjdk.java.net/browse/JDK-8261238
  Stats: 64 lines in 4 files changed: 27 ins; 16 del; 21 mod
  Patch: https://git.openjdk.java.net/jdk/pull/2428.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/2428/head:pull/2428

PR: https://git.openjdk.java.net/jdk/pull/2428
Reply | Threaded
Open this post in threaded view
|

Re: RFR: JDK-8261238: NMT should not limit baselining by size threshold [v2]

Thomas Stuefe
> NMT is a very useful tool to detect memory leaks. Its easy to use, relatively cheap and requires almost zero setup.
>
> But its use for detecting "slow riser" leaks is limited since it omits small leaks in the output, rather arbitrarily. That introduces subtle errors in the output: allocations below a certain threshold do not appear in the output at all; only if the leak rises above the threshold they appear, and then it will seem like the leak suddenly happened. This is an issue for both the absolute stats as well as stats referring to an earlier taken baseline.
>
> There are two thresholds:
> 1) when collecting baseline information for a report, it omits call sites smaller than `MemBaseline::SIZE_THRESHOLD`, which is hard coded to 1K.
> 2) when printing summary information, it omits categories whose weight would be less than whatever unit we display the NMT report in. E.g. if the report were for scale=G, we would not see categories allocating less than 1G. Similarly, when printing detail sites, it omits all sites whose size would be less than NMT scale.
>
> Note that setting the scale to 1 at the jcmd line will effectively disable threshold (2) but (1) is still in place.
>
> I propose to remove the threshold (1) completely. This is needed to get accurate baseline diffs - otherwise, a baseline seeing an allocation of 1023 bytes, diff'ed against a later baseline of 1025 bytes, would listed as having a delta of +1025, not +2 as it would be correct.
>
> The limit (2) can be kept in place, but the NMT report should contain a hint about omitted information to reduce confusion.
>
> Footprint costs of omitting the (1) threshold:
>
> According to my measurements, omitting that threshold check increases the cost of a MemBaseLine from today ~60K to ~270K - an increase of ~210K.
>
> A single MemBaseLine object is used while generating the report. If the "baseline" feature is used in jcmd, a second MemBaseLine object is used to hold the baseline. The first MemBaseLine is temporary, the second one permanent, since it is not destroyed.
>
> Therefore, the standard footprint should not be affected at all. If NMT is active *and* someone runs `jcmd VM.native_memory detail`, it will cause about 270K (210K more than today) of temporary allocations. If someone runs `jcmd VM.native_memory baseline`, that increase is not temporary but sticks.
>
> Note that these numbers were taken with from some long running java programs which means the majority of malloc call sites should have been hit. I believe these numbers to be representative. While the program I ran may not have covered all call sites, the total number is bounded and I believe not too far off of what I measured. In fact I was not able to drastically change this number with different runs.
>
> Also note that omitting the (1) threshold only affects baselining malloc call sites. While also used for baselining virtual memory call sites, in practice this has no effect since all of those virtual memory allocations happen at page granularity, which is >= 4K and hence always about the (1) threshold.
>
> Bottomline: I think the footprint increase of is acceptable. It gives us more comprehensive numbers and the ability to scan for small leaks.
>
> Note: should footprint really be an issue, we could take a look at how NMT manages its data. Currently allocation site objects are copied by value at various places: inside the MemBaseline as well as some temporary sorting lists when sorting output. We could at least share the call stack portion of these objects; there is no reason for multiple call stack objects to exist which refer to a single call site. That would reduce the size of MemBaseline objects by about half.

Thomas Stuefe has updated the pull request incrementally with one additional commit since the last revision:

  Tweak reporting of omitted sites; make final report byte-bytes

-------------

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/2428/files
  - new: https://git.openjdk.java.net/jdk/pull/2428/files/59807474..6f98dbee

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=2428&range=01
 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=2428&range=00-01

  Stats: 44 lines in 4 files changed: 16 ins; 5 del; 23 mod
  Patch: https://git.openjdk.java.net/jdk/pull/2428.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/2428/head:pull/2428

PR: https://git.openjdk.java.net/jdk/pull/2428
Reply | Threaded
Open this post in threaded view
|

Re: RFR: JDK-8261238: NMT should not limit baselining by size threshold [v3]

Thomas Stuefe
In reply to this post by Thomas Stuefe
> NMT is a very useful tool to detect memory leaks. Its easy to use, relatively cheap and requires almost zero setup.
>
> But its use for detecting "slow riser" leaks is limited since it omits small leaks in the output, rather arbitrarily. That introduces subtle errors in the output: allocations below a certain threshold do not appear in the output at all; only if the leak rises above the threshold they appear, and then it will seem like the leak suddenly happened. This is an issue for both the absolute stats as well as stats referring to an earlier taken baseline.
>
> There are two thresholds:
> 1) when collecting baseline information for a report, it omits call sites smaller than `MemBaseline::SIZE_THRESHOLD`, which is hard coded to 1K.
> 2) when printing summary information, it omits categories whose weight would be less than whatever unit we display the NMT report in. E.g. if the report were for scale=G, we would not see categories allocating less than 1G. Similarly, when printing detail sites, it omits all sites whose size would be less than NMT scale.
>
> Note that setting the scale to 1 at the jcmd line will effectively disable threshold (2) but (1) is still in place.
>
> I propose to remove the threshold (1) completely. This is needed to get accurate baseline diffs - otherwise, a baseline seeing an allocation of 1023 bytes, diff'ed against a later baseline of 1025 bytes, would listed as having a delta of +1025, not +2 as it would be correct.
>
> The limit (2) can be kept in place, but the NMT report should contain a hint about omitted information to reduce confusion.
>
> Footprint costs of omitting the (1) threshold:
>
> According to my measurements, omitting that threshold check increases the cost of a MemBaseLine from today ~60K to ~270K - an increase of ~210K.
>
> A single MemBaseLine object is used while generating the report. If the "baseline" feature is used in jcmd, a second MemBaseLine object is used to hold the baseline. The first MemBaseLine is temporary, the second one permanent, since it is not destroyed.
>
> Therefore, the standard footprint should not be affected at all. If NMT is active *and* someone runs `jcmd VM.native_memory detail`, it will cause about 270K (210K more than today) of temporary allocations. If someone runs `jcmd VM.native_memory baseline`, that increase is not temporary but sticks.
>
> Note that these numbers were taken with from some long running java programs which means the majority of malloc call sites should have been hit. I believe these numbers to be representative. While the program I ran may not have covered all call sites, the total number is bounded and I believe not too far off of what I measured. In fact I was not able to drastically change this number with different runs.
>
> Also note that omitting the (1) threshold only affects baselining malloc call sites. While also used for baselining virtual memory call sites, in practice this has no effect since all of those virtual memory allocations happen at page granularity, which is >= 4K and hence always about the (1) threshold.
>
> Bottomline: I think the footprint increase of is acceptable. It gives us more comprehensive numbers and the ability to scan for small leaks.
>
> Note: should footprint really be an issue, we could take a look at how NMT manages its data. Currently allocation site objects are copied by value at various places: inside the MemBaseline as well as some temporary sorting lists when sorting output. We could at least share the call stack portion of these objects; there is no reason for multiple call stack objects to exist which refer to a single call site. That would reduce the size of MemBaseline objects by about half.

Thomas Stuefe has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision:

  Start

-------------

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/2428/files
  - new: https://git.openjdk.java.net/jdk/pull/2428/files/6f98dbee..4d352439

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=2428&range=02
 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=2428&range=01-02

  Stats: 31 lines in 3 files changed: 5 ins; 8 del; 18 mod
  Patch: https://git.openjdk.java.net/jdk/pull/2428.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/2428/head:pull/2428

PR: https://git.openjdk.java.net/jdk/pull/2428
Reply | Threaded
Open this post in threaded view
|

Re: RFR: JDK-8261238: NMT should not limit baselining by size threshold [v3]

Zhengyu Gu-3
On Mon, 8 Feb 2021 09:19:55 GMT, Thomas Stuefe <[hidden email]> wrote:

>> NMT is a very useful tool to detect memory leaks. Its easy to use, relatively cheap and requires almost zero setup.
>>
>> But its use for detecting "slow riser" leaks is limited since it omits small leaks in the output, rather arbitrarily. That introduces subtle errors in the output: allocations below a certain threshold do not appear in the output at all; only if the leak rises above the threshold they appear, and then it will seem like the leak suddenly happened. This is an issue for both the absolute stats as well as stats referring to an earlier taken baseline.
>>
>> There are two thresholds:
>> 1) when collecting baseline information for a report, it omits call sites smaller than `MemBaseline::SIZE_THRESHOLD`, which is hard coded to 1K.
>> 2) when printing summary information, it omits categories whose weight would be less than whatever unit we display the NMT report in. E.g. if the report were for scale=G, we would not see categories allocating less than 1G. Similarly, when printing detail sites, it omits all sites whose size would be less than NMT scale.
>>
>> Note that setting the scale to 1 at the jcmd line will effectively disable threshold (2) but (1) is still in place.
>>
>> I propose to remove the threshold (1) completely. This is needed to get accurate baseline diffs - otherwise, a baseline seeing an allocation of 1023 bytes, diff'ed against a later baseline of 1025 bytes, would listed as having a delta of +1025, not +2 as it would be correct.
>>
>> The limit (2) can be kept in place, but the NMT report should contain a hint about omitted information to reduce confusion.
>>
>> Footprint costs of omitting the (1) threshold:
>>
>> According to my measurements, omitting that threshold check increases the cost of a MemBaseLine from today ~60K to ~270K - an increase of ~210K.
>>
>> A single MemBaseLine object is used while generating the report. If the "baseline" feature is used in jcmd, a second MemBaseLine object is used to hold the baseline. The first MemBaseLine is temporary, the second one permanent, since it is not destroyed.
>>
>> Therefore, the standard footprint should not be affected at all. If NMT is active *and* someone runs `jcmd VM.native_memory detail`, it will cause about 270K (210K more than today) of temporary allocations. If someone runs `jcmd VM.native_memory baseline`, that increase is not temporary but sticks.
>>
>> Note that these numbers were taken with from some long running java programs which means the majority of malloc call sites should have been hit. I believe these numbers to be representative. While the program I ran may not have covered all call sites, the total number is bounded and I believe not too far off of what I measured. In fact I was not able to drastically change this number with different runs.
>>
>> Also note that omitting the (1) threshold only affects baselining malloc call sites. While also used for baselining virtual memory call sites, in practice this has no effect since all of those virtual memory allocations happen at page granularity, which is >= 4K and hence always about the (1) threshold.
>>
>> Bottomline: I think the footprint increase of is acceptable. It gives us more comprehensive numbers and the ability to scan for small leaks.
>>
>> Note: should footprint really be an issue, we could take a look at how NMT manages its data. Currently allocation site objects are copied by value at various places: inside the MemBaseline as well as some temporary sorting lists when sorting output. We could at least share the call stack portion of these objects; there is no reason for multiple call stack objects to exist which refer to a single call site. That would reduce the size of MemBaseline objects by about half.
>
> Thomas Stuefe has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR.

The threshold was put in place purely due to memory consumption concerns.
There was a memory overhead requirement in original document (prior to JEP, Oracle private). If this is no longer a concern, I am good with this change.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2428
Reply | Threaded
Open this post in threaded view
|

Re: RFR: JDK-8261238: NMT should not limit baselining by size threshold [v3]

Thomas Stuefe
On Mon, 8 Feb 2021 16:03:59 GMT, Zhengyu Gu <[hidden email]> wrote:

> The threshold was put in place purely due to memory consumption concerns.
> There was a memory overhead requirement in original document (prior to JEP, Oracle private). If this is no longer a concern, I am good with this change.

Great, Zhengyu, thanks!

-------------

PR: https://git.openjdk.java.net/jdk/pull/2428
Reply | Threaded
Open this post in threaded view
|

Re: RFR: JDK-8261238: NMT should not limit baselining by size threshold [v3]

David Holmes
On 9/02/2021 2:29 am, Thomas Stuefe wrote:
> On Mon, 8 Feb 2021 16:03:59 GMT, Zhengyu Gu <[hidden email]> wrote:
>
>> The threshold was put in place purely due to memory consumption concerns.
>> There was a memory overhead requirement in original document (prior to JEP, Oracle private). If this is no longer a concern, I am good with this change.
>
> Great, Zhengyu, thanks!

To know if this is still a concern we know to know what the original
concern was.

David

> -------------
>
> PR: https://git.openjdk.java.net/jdk/pull/2428
>
Reply | Threaded
Open this post in threaded view
|

Re: RFR: JDK-8261238: NMT should not limit baselining by size threshold [v4]

Thomas Stuefe
In reply to this post by Thomas Stuefe
> NMT is a very useful tool to detect memory leaks. Its easy to use, relatively cheap and requires almost zero setup.
>
> But its use for detecting "slow riser" leaks is limited since it omits small leaks in the output, rather arbitrarily. That introduces subtle errors in the output: allocations below a certain threshold do not appear in the output at all; only if the leak rises above the threshold they appear, and then it will seem like the leak suddenly happened. This is an issue for both the absolute stats as well as stats referring to an earlier taken baseline.
>
> There are two thresholds:
> 1) when collecting baseline information for a report, it omits call sites smaller than `MemBaseline::SIZE_THRESHOLD`, which is hard coded to 1K.
> 2) when printing summary information, it omits categories whose weight would be less than whatever unit we display the NMT report in. E.g. if the report were for scale=G, we would not see categories allocating less than 1G. Similarly, when printing detail sites, it omits all sites whose size would be less than NMT scale.
>
> Note that setting the scale to 1 at the jcmd line will effectively disable threshold (2) but (1) is still in place.
>
> I propose to remove the threshold (1) completely. This is needed to get accurate baseline diffs - otherwise, a baseline seeing an allocation of 1023 bytes, diff'ed against a later baseline of 1025 bytes, would listed as having a delta of +1025, not +2 as it would be correct.
>
> The limit (2) can be kept in place, but the NMT report should contain a hint about omitted information to reduce confusion.
>
> Footprint costs of omitting the (1) threshold:
>
> According to my measurements, omitting that threshold check increases the cost of a MemBaseLine from today ~60K to ~270K - an increase of ~210K.
>
> A single MemBaseLine object is used while generating the report. If the "baseline" feature is used in jcmd, a second MemBaseLine object is used to hold the baseline. The first MemBaseLine is temporary, the second one permanent, since it is not destroyed.
>
> Therefore, the standard footprint should not be affected at all. If NMT is active *and* someone runs `jcmd VM.native_memory detail`, it will cause about 270K (210K more than today) of temporary allocations. If someone runs `jcmd VM.native_memory baseline`, that increase is not temporary but sticks.
>
> Note that these numbers were taken with from some long running java programs which means the majority of malloc call sites should have been hit. I believe these numbers to be representative. While the program I ran may not have covered all call sites, the total number is bounded and I believe not too far off of what I measured. In fact I was not able to drastically change this number with different runs.
>
> Also note that omitting the (1) threshold only affects baselining malloc call sites. While also used for baselining virtual memory call sites, in practice this has no effect since all of those virtual memory allocations happen at page granularity, which is >= 4K and hence always about the (1) threshold.
>
> Bottomline: I think the footprint increase of is acceptable. It gives us more comprehensive numbers and the ability to scan for small leaks.
>
> Note: should footprint really be an issue, we could take a look at how NMT manages its data. Currently allocation site objects are copied by value at various places: inside the MemBaseline as well as some temporary sorting lists when sorting output. We could at least share the call stack portion of these objects; there is no reason for multiple call stack objects to exist which refer to a single call site. That would reduce the size of MemBaseline objects by about half.

Thomas Stuefe has updated the pull request incrementally with one additional commit since the last revision:

  Exclude zero-sized regions from baseline

-------------

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/2428/files
  - new: https://git.openjdk.java.net/jdk/pull/2428/files/4d352439..41ee3877

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=2428&range=03
 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=2428&range=02-03

  Stats: 20 lines in 1 file changed: 14 ins; 2 del; 4 mod
  Patch: https://git.openjdk.java.net/jdk/pull/2428.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/2428/head:pull/2428

PR: https://git.openjdk.java.net/jdk/pull/2428
Reply | Threaded
Open this post in threaded view
|

Re: RFR: JDK-8261238: NMT should not limit baselining by size threshold [v3]

Thomas Stuefe
In reply to this post by Thomas Stuefe
On Mon, 8 Feb 2021 16:26:59 GMT, Thomas Stuefe <[hidden email]> wrote:

>> The threshold was put in place purely due to memory consumption concerns.
>> There was a memory overhead requirement in original document (prior to JEP, Oracle private). If this is no longer a concern, I am good with this change.
>
>> The threshold was put in place purely due to memory consumption concerns.
>> There was a memory overhead requirement in original document (prior to JEP, Oracle private). If this is no longer a concern, I am good with this change.
>
> Great, Zhengyu, thanks!

> _Mailing list message from [David Holmes](mailto:[hidden email]) on [hotspot-runtime-dev](mailto:[hidden email]):_
>
> On 9/02/2021 2:29 am, Thomas Stuefe wrote:
>
> > On Mon, 8 Feb 2021 16:03:59 GMT, Zhengyu Gu <zgu at openjdk.org> wrote:
> > > The threshold was put in place purely due to memory consumption concerns.
> > > There was a memory overhead requirement in original document (prior to JEP, Oracle private). If this is no longer a concern, I am good with this change.
> >
> >
> > Great, Zhengyu, thanks!
>
> To know if this is still a concern we know to know what the original
> concern was.
>
> David

Yes, it would be helpful to know what the initial memory requirements were as well as the considerations behind it. NMT tries to be very slim; which is good, but at various places we trade performance for memory footprint (eg. live with a load factor of 5+ in the malloc site table rather than spending an additional 4-12K to broaden the table). Without knowing these requirements and without deciding if they are still valid improving NMT is difficult.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2428
Reply | Threaded
Open this post in threaded view
|

Re: RFR: JDK-8261238: NMT should not limit baselining by size threshold [v3]

Thomas Stuefe
On Tue, 9 Feb 2021 10:17:52 GMT, Thomas Stuefe <[hidden email]> wrote:

>>> The threshold was put in place purely due to memory consumption concerns.
>>> There was a memory overhead requirement in original document (prior to JEP, Oracle private). If this is no longer a concern, I am good with this change.
>>
>> Great, Zhengyu, thanks!
>
>> _Mailing list message from [David Holmes](mailto:[hidden email]) on [hotspot-runtime-dev](mailto:[hidden email]):_
>>
>> On 9/02/2021 2:29 am, Thomas Stuefe wrote:
>>
>> > On Mon, 8 Feb 2021 16:03:59 GMT, Zhengyu Gu <zgu at openjdk.org> wrote:
>> > > The threshold was put in place purely due to memory consumption concerns.
>> > > There was a memory overhead requirement in original document (prior to JEP, Oracle private). If this is no longer a concern, I am good with this change.
>> >
>> >
>> > Great, Zhengyu, thanks!
>>
>> To know if this is still a concern we know to know what the original
>> concern was.
>>
>> David
>
> Yes, it would be helpful to know what the initial memory requirements were as well as the considerations behind it. NMT tries to be very slim; which is good, but at various places we trade performance for memory footprint (eg. live with a load factor of 5+ in the malloc site table rather than spending an additional 4-12K to broaden the table). Without knowing these requirements and without deciding if they are still valid improving NMT is difficult.

I am currently working on a patch which reduces cost of baselining: https://bugs.openjdk.java.net/browse/JDK-8261491 which should be more than enough to mitigate the modest footprint increase of this patch.

So, can we get this patch approved please, unless there are other technical considerations I am not seeing?

-------------

PR: https://git.openjdk.java.net/jdk/pull/2428
Reply | Threaded
Open this post in threaded view
|

Re: RFR: JDK-8261238: NMT should not limit baselining by size threshold [v5]

Thomas Stuefe
In reply to this post by Thomas Stuefe
> Quick summary:
>
> - patch is technically simple
> - it fixes subtle reporting errors when dealing with low weight allocation sites
> - footprint increases temporarily while generating an NMT report by about 200K (400K when comparing with a baseline) which I argue below is worth the more precise reports
> - Also note that I have changes in work (https://bugs.openjdk.java.net/browse/JDK-8261491) which will reduce baselining footprint and more than negate the footprint increase of this patch. For clarity reasons I would like to keep the issues separate though.
>
> --------
>
> NMT is a very useful tool to detect memory leaks. Its easy to use, relatively cheap and requires almost zero setup.
>
> But its use for detecting "slow riser" leaks is limited since it omits small leaks in the output, rather arbitrarily. That introduces subtle errors in the output: allocations below a certain threshold do not appear in the output at all; only if the leak rises above the threshold they appear, and then it will seem like the leak suddenly happened. This is an issue for both the absolute stats as well as stats referring to an earlier taken baseline.
>
> There are two thresholds:
> 1) when collecting baseline information for a report, it omits call sites smaller than `MemBaseline::SIZE_THRESHOLD`, which is hard coded to 1K.
> 2) when printing summary information, it omits categories whose weight would be less than whatever unit we display the NMT report in. E.g. if the report were for scale=G, we would not see categories allocating less than 1G. Similarly, when printing detail sites, it omits all sites whose size would be less than NMT scale.
>
> Note that setting the scale to 1 at the jcmd line will effectively disable threshold (2) but (1) is still in place.
>
> I propose to remove the threshold (1) completely. This is needed to get accurate baseline diffs - otherwise, a baseline seeing an allocation of 1023 bytes, diff'ed against a later baseline of 1025 bytes, would listed as having a delta of +1025, not +2 as it would be correct.
>
> The limit (2) can be kept in place, but the NMT report should contain a hint about omitted information to reduce confusion.
>
> Footprint costs of omitting the (1) threshold:
>
> According to my measurements, omitting that threshold check increases the cost of a MemBaseLine from today ~60K to ~270K - an increase of ~210K.
>
> A single MemBaseLine object is used while generating the report. If the "baseline" feature is used in jcmd, a second MemBaseLine object is used to hold the baseline. The first MemBaseLine is temporary, the second one permanent, since it is not destroyed.
>
> Therefore, the standard footprint should not be affected at all. If NMT is active *and* someone runs `jcmd VM.native_memory detail`, it will cause about 270K (210K more than today) of temporary allocations. If someone runs `jcmd VM.native_memory baseline`, that increase is not temporary but sticks.
>
> Note that these numbers were taken with from some long running java programs which means the majority of malloc call sites should have been hit. I believe these numbers to be representative. While the program I ran may not have covered all call sites, the total number is bounded and I believe not too far off of what I measured. In fact I was not able to drastically change this number with different runs.
>
> Also note that omitting the (1) threshold only affects baselining malloc call sites. While also used for baselining virtual memory call sites, in practice this has no effect since all of those virtual memory allocations happen at page granularity, which is >= 4K and hence always about the (1) threshold.
>
> Bottomline: I think the footprint increase of is acceptable. It gives us more comprehensive numbers and the ability to scan for small leaks.
>
> Note: should footprint really be an issue, we could take a look at how NMT manages its data. Currently allocation site objects are copied by value at various places: inside the MemBaseline as well as some temporary sorting lists when sorting output. We could at least share the call stack portion of these objects; there is no reason for multiple call stack objects to exist which refer to a single call site. That would reduce the size of MemBaseline objects by about half.

Thomas Stuefe has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision:

 - Merge
 - Split out display changes
 - Exclude zero-sized regions from baseline
 - Start

-------------

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/2428/files
  - new: https://git.openjdk.java.net/jdk/pull/2428/files/41ee3877..8df3f4bc

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=2428&range=04
 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=2428&range=03-04

  Stats: 37357 lines in 910 files changed: 23066 ins; 9328 del; 4963 mod
  Patch: https://git.openjdk.java.net/jdk/pull/2428.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/2428/head:pull/2428

PR: https://git.openjdk.java.net/jdk/pull/2428
Reply | Threaded
Open this post in threaded view
|

Re: RFR: JDK-8261238: NMT should not limit baselining by size threshold [v3]

Thomas Stuefe
In reply to this post by Thomas Stuefe
On Thu, 11 Feb 2021 14:39:34 GMT, Thomas Stuefe <[hidden email]> wrote:

>>> _Mailing list message from [David Holmes](mailto:[hidden email]) on [hotspot-runtime-dev](mailto:[hidden email]):_
>>>
>>> On 9/02/2021 2:29 am, Thomas Stuefe wrote:
>>>
>>> > On Mon, 8 Feb 2021 16:03:59 GMT, Zhengyu Gu <zgu at openjdk.org> wrote:
>>> > > The threshold was put in place purely due to memory consumption concerns.
>>> > > There was a memory overhead requirement in original document (prior to JEP, Oracle private). If this is no longer a concern, I am good with this change.
>>> >
>>> >
>>> > Great, Zhengyu, thanks!
>>>
>>> To know if this is still a concern we know to know what the original
>>> concern was.
>>>
>>> David
>>
>> Yes, it would be helpful to know what the initial memory requirements were as well as the considerations behind it. NMT tries to be very slim; which is good, but at various places we trade performance for memory footprint (eg. live with a load factor of 5+ in the malloc site table rather than spending an additional 4-12K to broaden the table). Without knowing these requirements and without deciding if they are still valid improving NMT is difficult.
>
> I am currently working on a patch which reduces cost of baselining: https://bugs.openjdk.java.net/browse/JDK-8261491 which should be more than enough to mitigate the modest footprint increase of this patch.
>
> So, can we get this patch approved please, unless there are other technical considerations I am not seeing?

I split out non-controversial display changes into JDK-8262165.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2428
Reply | Threaded
Open this post in threaded view
|

Re: RFR: JDK-8261238: NMT should not limit baselining by size threshold [v3]

Thomas Stuefe
On Tue, 23 Feb 2021 07:20:31 GMT, Thomas Stuefe <[hidden email]> wrote:

>> I am currently working on a patch which reduces cost of baselining: https://bugs.openjdk.java.net/browse/JDK-8261491 which should be more than enough to mitigate the modest footprint increase of this patch.
>>
>> So, can we get this patch approved please, unless there are other technical considerations I am not seeing?
>
> I split out non-controversial display changes into JDK-8262165.

Gentle Ping.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2428