RFR: 8261710: SA DSO objects have sizes that are too large

classic Classic list List threaded Threaded
30 messages Options
12
Reply | Threaded
Open this post in threaded view
|

RFR: 8261710: SA DSO objects have sizes that are too large

Yasumasa Suenaga-7
This PR relates to [JDK-8261702](https://bugs.openjdk.java.net/browse/JDK-8261702) ( #2562 )
When SA creates a DSO object, which is used to represent a shared object file (.so), it initializes the "size" to be the size of the shared object file. This usually results in the size being too big. This can cause SA to get confused about whether or not an address is in the shared object. SA should instead set the DSO's size to the amount of the file that is actually mapped for executable code.

-------------

Commit messages:
 - Use entire of address range of shared library
 - 8261710: SA DSO objects have sizes that are too large

Changes: https://git.openjdk.java.net/jdk/pull/2563/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=2563&range=00
  Issue: https://bugs.openjdk.java.net/browse/JDK-8261710
  Stats: 49 lines in 5 files changed: 27 ins; 3 del; 19 mod
  Patch: https://git.openjdk.java.net/jdk/pull/2563.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/2563/head:pull/2563

PR: https://git.openjdk.java.net/jdk/pull/2563
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261710: SA DSO objects have sizes that are too large

Chris Plummer-2
On Sun, 14 Feb 2021 07:14:24 GMT, Yasumasa Suenaga <[hidden email]> wrote:

> This PR relates to [JDK-8261702](https://bugs.openjdk.java.net/browse/JDK-8261702) ( #2562 )
> When SA creates a DSO object, which is used to represent a shared object file (.so), it initializes the "size" to be the size of the shared object file. This usually results in the size being too big. This can cause SA to get confused about whether or not an address is in the shared object. SA should instead set the DSO's size to the amount of the file that is actually mapped for executable code.

These changes look good, but it would be nice to fix OSX too. Fortunately it should be easier. As part of the fix for [JDK-8247515](https://bugs.openjdk.java.net/browse/JDK-8247515), I added a `lib_info->memsz` field. I think it is correct and is what you need. I needed it in order to properly scan .dylibs for symbols without scanning too far. It seems to be working. Unfortunately we don't really have a good tests for this, but you could look at the before and after for ClhsdbPmap test to get an idea of if it is working. If you don't have an OSX machine to try, I can try out your changes for you.

BTW, I have no idea if Windows is getting the size right, but I guess we'll just have to assume it is:

    env->CallVoidMethod(obj, addLoadObject_ID, strName, (jlong) params[u].Size,
                        (jlong) params[u].Base);

-------------

PR: https://git.openjdk.java.net/jdk/pull/2563
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261710: SA DSO objects have sizes that are too large

Yasumasa Suenaga-7
On Mon, 15 Feb 2021 06:43:30 GMT, Chris Plummer <[hidden email]> wrote:

>> This PR relates to [JDK-8261702](https://bugs.openjdk.java.net/browse/JDK-8261702) ( #2562 )
>> When SA creates a DSO object, which is used to represent a shared object file (.so), it initializes the "size" to be the size of the shared object file. This usually results in the size being too big. This can cause SA to get confused about whether or not an address is in the shared object. SA should instead set the DSO's size to the amount of the file that is actually mapped for executable code.
>
> These changes look good, but it would be nice to fix OSX too. Fortunately it should be easier. As part of the fix for [JDK-8247515](https://bugs.openjdk.java.net/browse/JDK-8247515), I added a `lib_info->memsz` field. I think it is correct and is what you need. I needed it in order to properly scan .dylibs for symbols without scanning too far. It seems to be working. Unfortunately we don't really have a good tests for this, but you could look at the before and after for ClhsdbPmap test to get an idea of if it is working. If you don't have an OSX machine to try, I can try out your changes for you.
>
> BTW, I have no idea if Windows is getting the size right, but I guess we'll just have to assume it is:
>
>     env->CallVoidMethod(obj, addLoadObject_ID, strName, (jlong) params[u].Size,
>                         (jlong) params[u].Base);

@plummercj Thanks for the review!

> These changes look good, but it would be nice to fix OSX too. Fortunately it should be easier. As part of the fix for [JDK-8247515](https://bugs.openjdk.java.net/browse/JDK-8247515), I added a `lib_info->memsz` field. I think it is correct and is what you need.

AFAICS `lib_info->memsz` is not set to loaded size, it seems to relates to the offset of the symbol in the binary.

https://github.com/openjdk/jdk/blob/34ae7aeb64db90dcb4d2f3d4acb16c714a32824f/src/jdk.hotspot.agent/macosx/native/libsaproc/libproc_impl.c#L261
https://github.com/openjdk/jdk/blob/34ae7aeb64db90dcb4d2f3d4acb16c714a32824f/src/jdk.hotspot.agent/macosx/native/libsaproc/symtab.c#L185

Can I believe `lib_info->memsz` for this purpose?
I'm not familiar of Mach-O, and I don't have Mac, so I can't evaluate it.

> I needed it in order to properly scan .dylibs for symbols without scanning too far. It seems to be working. Unfortunately we don't really have a good tests for this, but you could look at the before and after for ClhsdbPmap test to get an idea of if it is working. If you don't have an OSX machine to try, I can try out your changes for you.

In process attach on Linux, we can test the change with /proc/<PID>/maps, but in other case, we might not test it.
In coredump on Linux, it is difficult because we cannot get memory map from other source (excepts the core).

> BTW, I have no idea if Windows is getting the size right, but I guess we'll just have to assume it is:
>
> ```
>     env->CallVoidMethod(obj, addLoadObject_ID, strName, (jlong) params[u].Size,
>                         (jlong) params[u].Base);
> ```

According to [Microsoft Docs](https://docs.microsoft.com/ja-jp/windows-hardware/drivers/ddi/dbgeng/ns-dbgeng-_debug_module_parameters), it sounds good.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2563
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261710: SA DSO objects have sizes that are too large

Chris Plummer-2
On Tue, 16 Feb 2021 01:37:15 GMT, Yasumasa Suenaga <[hidden email]> wrote:

>> These changes look good, but it would be nice to fix OSX too. Fortunately it should be easier. As part of the fix for [JDK-8247515](https://bugs.openjdk.java.net/browse/JDK-8247515), I added a `lib_info->memsz` field. I think it is correct and is what you need. I needed it in order to properly scan .dylibs for symbols without scanning too far. It seems to be working. Unfortunately we don't really have a good tests for this, but you could look at the before and after for ClhsdbPmap test to get an idea of if it is working. If you don't have an OSX machine to try, I can try out your changes for you.
>>
>> BTW, I have no idea if Windows is getting the size right, but I guess we'll just have to assume it is:
>>
>>     env->CallVoidMethod(obj, addLoadObject_ID, strName, (jlong) params[u].Size,
>>                         (jlong) params[u].Base);
>
> @plummercj Thanks for the review!
>
>> These changes look good, but it would be nice to fix OSX too. Fortunately it should be easier. As part of the fix for [JDK-8247515](https://bugs.openjdk.java.net/browse/JDK-8247515), I added a `lib_info->memsz` field. I think it is correct and is what you need.
>
> AFAICS `lib_info->memsz` is not set to loaded size, it seems to relates to the offset of the symbol in the binary.
>
> https://github.com/openjdk/jdk/blob/34ae7aeb64db90dcb4d2f3d4acb16c714a32824f/src/jdk.hotspot.agent/macosx/native/libsaproc/libproc_impl.c#L261
> https://github.com/openjdk/jdk/blob/34ae7aeb64db90dcb4d2f3d4acb16c714a32824f/src/jdk.hotspot.agent/macosx/native/libsaproc/symtab.c#L185
>
> Can I believe `lib_info->memsz` for this purpose?
> I'm not familiar of Mach-O, and I don't have Mac, so I can't evaluate it.
>
>> I needed it in order to properly scan .dylibs for symbols without scanning too far. It seems to be working. Unfortunately we don't really have a good tests for this, but you could look at the before and after for ClhsdbPmap test to get an idea of if it is working. If you don't have an OSX machine to try, I can try out your changes for you.
>
> In process attach on Linux, we can test the change with /proc/<PID>/maps, but in other case, we might not test it.
> In coredump on Linux, it is difficult because we cannot get memory map from other source (excepts the core).
>
>> BTW, I have no idea if Windows is getting the size right, but I guess we'll just have to assume it is:
>>
>> ```
>>     env->CallVoidMethod(obj, addLoadObject_ID, strName, (jlong) params[u].Size,
>>                         (jlong) params[u].Base);
>> ```
>
> According to [Microsoft Docs](https://docs.microsoft.com/ja-jp/windows-hardware/drivers/ddi/dbgeng/ns-dbgeng-_debug_module_parameters), it sounds good.

https://bugs.openjdk.java.net/browse/JDK-8250750 has some interesting details that I forgot about. Note this is referencing the state of the code before fixing JDK-8250750, which changed how `lib->memsz` is calculated.

> `lib->memsz` comes from the size of the `LC_SEGMENT_64` that the library was discovered in. However, the library can actually span multiple segments. In this case of the vtable address, the address was in the segment that follows the initial `LC_SEGMENT_64`. Because of this `lib->memsz` is too small, resulting in symbol lookups being restricted to addresses that are in the initial segment.

>  The simplest approach to fixing this seems to be locating the largest offset found in the symbol table, round that up to a page boundary, and use it as `lib->memsz`. I've implemented this and it seems to be working.

So it's not perfect, but I think it's good enough for our needs and better than using the file size and ending up with a size that is too big. I think the main way that this could fail is if you use `findpc` on an address that is beyond the last symbol (if it is even possible for there to be memory mapped at an address after the last symbol). In this case `findpc` will just say it doesn't know where the address is rather than saying it is in the dso + some very large offset. For symbols in the dso, they will always be found since lib`->memsz` now covers all symbols.

Also, since I never did figure out how to determine the size of a mach-o symbol, it's possible that the last symbol extends beyond the page boundary. If that is the case and you use `findpc` on and address that is part of the symbol but beyond the page boundary, `findpc` won't find it and say it doesn't know what the address is. So again, not perfect, but this issue can only happen with the last symbol in the dso, and only if the symbol crosses a page boundary, and only if the searched for address is after the page boundary. If you search for the start of the symbol, you'll still find it.

Having said all that, I think maybe we can get the correct size by taking the address of the highest symbol and then looking up the `map_info` that it is in. Then it's just a matter of adding `map->vaddr + map->memsz`. But this assumes that there are no more maps for the dso  after the one that has the last symbol, so maybe still not perfect. I'm not sure any of this extra work is worth it. When I dove into the mach-o support last year to fix some issues, it ended up being a huge rat hole that took way more of my time then I had expected. It was very badly broken, far worse then I first thought, and many things about mach-o core files were a mystery, and still remain so. For example, see https://bugs.openjdk.java.net/browse/JDK-8249779. We never did figure out how to create a link map.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2563
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261710: SA DSO objects have sizes that are too large [v2]

Yasumasa Suenaga-7
In reply to this post by Yasumasa Suenaga-7
> This PR relates to [JDK-8261702](https://bugs.openjdk.java.net/browse/JDK-8261702) ( #2562 )
> When SA creates a DSO object, which is used to represent a shared object file (.so), it initializes the "size" to be the size of the shared object file. This usually results in the size being too big. This can cause SA to get confused about whether or not an address is in the shared object. SA should instead set the DSO's size to the amount of the file that is actually mapped for executable code.

Yasumasa Suenaga has updated the pull request incrementally with one additional commit since the last revision:

  Fix for OS X

-------------

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/2563/files
  - new: https://git.openjdk.java.net/jdk/pull/2563/files/9af4e232..b33398ea

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=2563&range=01
 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=2563&range=00-01

  Stats: 35 lines in 7 files changed: 19 ins; 2 del; 14 mod
  Patch: https://git.openjdk.java.net/jdk/pull/2563.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/2563/head:pull/2563

PR: https://git.openjdk.java.net/jdk/pull/2563
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261710: SA DSO objects have sizes that are too large [v3]

Yasumasa Suenaga-7
In reply to this post by Yasumasa Suenaga-7
> This PR relates to [JDK-8261702](https://bugs.openjdk.java.net/browse/JDK-8261702) ( #2562 )
> When SA creates a DSO object, which is used to represent a shared object file (.so), it initializes the "size" to be the size of the shared object file. This usually results in the size being too big. This can cause SA to get confused about whether or not an address is in the shared object. SA should instead set the DSO's size to the amount of the file that is actually mapped for executable code.

Yasumasa Suenaga has updated the pull request incrementally with one additional commit since the last revision:

  Remove unnecessary code

-------------

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/2563/files
  - new: https://git.openjdk.java.net/jdk/pull/2563/files/b33398ea..c9a34723

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=2563&range=02
 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=2563&range=01-02

  Stats: 1 line in 1 file changed: 0 ins; 1 del; 0 mod
  Patch: https://git.openjdk.java.net/jdk/pull/2563.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/2563/head:pull/2563

PR: https://git.openjdk.java.net/jdk/pull/2563
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261710: SA DSO objects have sizes that are too large

Yasumasa Suenaga-7
In reply to this post by Chris Plummer-2
On Tue, 16 Feb 2021 02:45:23 GMT, Chris Plummer <[hidden email]> wrote:

>> @plummercj Thanks for the review!
>>
>>> These changes look good, but it would be nice to fix OSX too. Fortunately it should be easier. As part of the fix for [JDK-8247515](https://bugs.openjdk.java.net/browse/JDK-8247515), I added a `lib_info->memsz` field. I think it is correct and is what you need.
>>
>> AFAICS `lib_info->memsz` is not set to loaded size, it seems to relates to the offset of the symbol in the binary.
>>
>> https://github.com/openjdk/jdk/blob/34ae7aeb64db90dcb4d2f3d4acb16c714a32824f/src/jdk.hotspot.agent/macosx/native/libsaproc/libproc_impl.c#L261
>> https://github.com/openjdk/jdk/blob/34ae7aeb64db90dcb4d2f3d4acb16c714a32824f/src/jdk.hotspot.agent/macosx/native/libsaproc/symtab.c#L185
>>
>> Can I believe `lib_info->memsz` for this purpose?
>> I'm not familiar of Mach-O, and I don't have Mac, so I can't evaluate it.
>>
>>> I needed it in order to properly scan .dylibs for symbols without scanning too far. It seems to be working. Unfortunately we don't really have a good tests for this, but you could look at the before and after for ClhsdbPmap test to get an idea of if it is working. If you don't have an OSX machine to try, I can try out your changes for you.
>>
>> In process attach on Linux, we can test the change with /proc/<PID>/maps, but in other case, we might not test it.
>> In coredump on Linux, it is difficult because we cannot get memory map from other source (excepts the core).
>>
>>> BTW, I have no idea if Windows is getting the size right, but I guess we'll just have to assume it is:
>>>
>>> ```
>>>     env->CallVoidMethod(obj, addLoadObject_ID, strName, (jlong) params[u].Size,
>>>                         (jlong) params[u].Base);
>>> ```
>>
>> According to [Microsoft Docs](https://docs.microsoft.com/ja-jp/windows-hardware/drivers/ddi/dbgeng/ns-dbgeng-_debug_module_parameters), it sounds good.
>
> https://bugs.openjdk.java.net/browse/JDK-8250750 has some interesting details that I forgot about. Note this is referencing the state of the code before fixing JDK-8250750, which changed how `lib->memsz` is calculated.
>
>> `lib->memsz` comes from the size of the `LC_SEGMENT_64` that the library was discovered in. However, the library can actually span multiple segments. In this case of the vtable address, the address was in the segment that follows the initial `LC_SEGMENT_64`. Because of this `lib->memsz` is too small, resulting in symbol lookups being restricted to addresses that are in the initial segment.
>
>>  The simplest approach to fixing this seems to be locating the largest offset found in the symbol table, round that up to a page boundary, and use it as `lib->memsz`. I've implemented this and it seems to be working.
>
> So it's not perfect, but I think it's good enough for our needs and better than using the file size and ending up with a size that is too big. I think the main way that this could fail is if you use `findpc` on an address that is beyond the last symbol (if it is even possible for there to be memory mapped at an address after the last symbol). In this case `findpc` will just say it doesn't know where the address is rather than saying it is in the dso + some very large offset. For symbols in the dso, they will always be found since lib`->memsz` now covers all symbols.
>
> Also, since I never did figure out how to determine the size of a mach-o symbol, it's possible that the last symbol extends beyond the page boundary. If that is the case and you use `findpc` on and address that is part of the symbol but beyond the page boundary, `findpc` won't find it and say it doesn't know what the address is. So again, not perfect, but this issue can only happen with the last symbol in the dso, and only if the symbol crosses a page boundary, and only if the searched for address is after the page boundary. If you search for the start of the symbol, you'll still find it.
>
> Having said all that, I think maybe we can get the correct size by taking the address of the highest symbol and then looking up the `map_info` that it is in. Then it's just a matter of adding `map->vaddr + map->memsz`. But this assumes that there are no more maps for the dso  after the one that has the last symbol, so maybe still not perfect. I'm not sure any of this extra work is worth it. When I dove into the mach-o support last year to fix some issues, it ended up being a huge rat hole that took way more of my time then I had expected. It was very badly broken, far worse then I first thought, and many things about mach-o core files were a mystery, and still remain so. For example, see https://bugs.openjdk.java.net/browse/JDK-8249779. We never did figure out how to create a link map.

Thanks for explanation!
I understood Marc-O is complex, and to use `lib->memsz` is reasonable as you said.

I updated patch for macOS. I cannot test it, but all of tier 1 servicerbility tests were passed on GitHub Actions. Could you review again?

-------------

PR: https://git.openjdk.java.net/jdk/pull/2563
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261710: SA DSO objects have sizes that are too large

Daniel D.Daugherty
On Tue, 16 Feb 2021 12:19:38 GMT, Yasumasa Suenaga <[hidden email]> wrote:

>> https://bugs.openjdk.java.net/browse/JDK-8250750 has some interesting details that I forgot about. Note this is referencing the state of the code before fixing JDK-8250750, which changed how `lib->memsz` is calculated.
>>
>>> `lib->memsz` comes from the size of the `LC_SEGMENT_64` that the library was discovered in. However, the library can actually span multiple segments. In this case of the vtable address, the address was in the segment that follows the initial `LC_SEGMENT_64`. Because of this `lib->memsz` is too small, resulting in symbol lookups being restricted to addresses that are in the initial segment.
>>
>>>  The simplest approach to fixing this seems to be locating the largest offset found in the symbol table, round that up to a page boundary, and use it as `lib->memsz`. I've implemented this and it seems to be working.
>>
>> So it's not perfect, but I think it's good enough for our needs and better than using the file size and ending up with a size that is too big. I think the main way that this could fail is if you use `findpc` on an address that is beyond the last symbol (if it is even possible for there to be memory mapped at an address after the last symbol). In this case `findpc` will just say it doesn't know where the address is rather than saying it is in the dso + some very large offset. For symbols in the dso, they will always be found since lib`->memsz` now covers all symbols.
>>
>> Also, since I never did figure out how to determine the size of a mach-o symbol, it's possible that the last symbol extends beyond the page boundary. If that is the case and you use `findpc` on and address that is part of the symbol but beyond the page boundary, `findpc` won't find it and say it doesn't know what the address is. So again, not perfect, but this issue can only happen with the last symbol in the dso, and only if the symbol crosses a page boundary, and only if the searched for address is after the page boundary. If you search for the start of the symbol, you'll still find it.
>>
>> Having said all that, I think maybe we can get the correct size by taking the address of the highest symbol and then looking up the `map_info` that it is in. Then it's just a matter of adding `map->vaddr + map->memsz`. But this assumes that there are no more maps for the dso  after the one that has the last symbol, so maybe still not perfect. I'm not sure any of this extra work is worth it. When I dove into the mach-o support last year to fix some issues, it ended up being a huge rat hole that took way more of my time then I had expected. It was very badly broken, far worse then I first thought, and many things about mach-o core files were a mystery, and still remain so. For example, see https://bugs.openjdk.java.net/browse/JDK-8249779. We never did figure out how to create a link map.
>
> Thanks for explanation!
> I understood Marc-O is complex, and to use `lib->memsz` is reasonable as you said.
>
> I updated patch for macOS. I cannot test it, but all of tier 1 servicerbility tests were passed on GitHub Actions. Could you review again?

While I was characterizing:

JDK-8261844 serviceability/sa/ClhsdbFindPC.java#id1 failed with "'In code in NMethod for LingeredAppWithTrivialMain.main' missing from stdout/stderr"

I determined that I could easily reproduce serviceability/sa/ClhsdbFindPC.java failures
on my Ubuntu 16.04 test machine. I've applied this version of this PR at @plummercj's
request:

$ git log HEAD^!
commit c9a34723e34550301e1b5fd06185e0b0977da78e (HEAD -> pull/2563)
Author: Yasumasa Suenaga <[hidden email]>
Date:   Tue Feb 16 18:05:43 2021 +0900

    Remove unnecessary code

and retested serviceability/sa/ClhsdbFindPC.java:

$ do_java_test serviceability/sa/ClhsdbFindPC.java 2>&1 | tee -a do_java_test.8261710.log
INFO: GNUMAKE=make
INFO: GNUMAKE version is: GNU Make 4.1

INFO: JTREG options:
INFO:   JOBS=16
INFO:   TEST_MODE=othervm
INFO:   EXTRA_PROBLEM_LISTS=ProblemList-extra.txt
INFO:   VM_OPTIONS=
INFO: test_val=serviceability/sa/ClhsdbFindPC.java
Test Config: linux-x86_64-normal-server-release
    INFO: TIMEOUT_FACTOR=4
    Done testing
    Test Run linux-x86_64-normal-server-release time: 0.55 minutes.

    TEST                                              TOTAL  PASS  FAIL ERROR
    jtreg:open/test/hotspot/jtreg/serviceability/sa/ClhsdbFindPC.java
    >>                                                       4     3     1     0 <<

    1 failure(s) found in log=do_java_test.linux-x86_64-normal-server-release.log

    TEST: serviceability/sa/ClhsdbFindPC.java#id3
    ERROR: Failed to instantiate timeout handler: jdk.test.failurehandler.jtreg.GatherProcessInfoTimeoutHandler: file does not exist


Test Config: linux-x86_64-normal-server-fastdebug
    INFO: TIMEOUT_FACTOR=6
    Done testing
    Test Run linux-x86_64-normal-server-fastdebug time: 0.87 minutes.

    TEST                                              TOTAL  PASS  FAIL ERROR
    jtreg:open/test/hotspot/jtreg/serviceability/sa/ClhsdbFindPC.java
                                                          4     4     0     0

Test Config: linux-x86_64-normal-server-slowdebug
    INFO: TIMEOUT_FACTOR=12
    Done testing
    Test Run linux-x86_64-normal-server-slowdebug time: 2.97 minutes.

    TEST                                              TOTAL  PASS  FAIL ERROR
    jtreg:open/test/hotspot/jtreg/serviceability/sa/ClhsdbFindPC.java
    >>                                                       4     3     1     0 <<

    1 failure(s) found in log=do_java_test.linux-x86_64-normal-server-slowdebug.log

    TEST: serviceability/sa/ClhsdbFindPC.java#id3
    LOG: build/linux-x86_64-normal-server-slowdebug/test-support/jtreg_open_test_hotspot_jtreg_serviceability_sa_ClhsdbFindPC_java/serviceability/sa/ClhsdbFindPC_id3.jtr
    Saving build/linux-x86_64-normal-server-slowdebug/test-support/jtreg_open_test_hotspot_jtreg_serviceability_sa_ClhsdbFindPC_java/serviceability/sa/ClhsdbFindPC_id3.jtr as /work/shared/bug_hunt/8261844_for_jdk17.git/test_failures.2021-02-16-171036/ClhsdbFindPC_id3.jtr.slowdebug
    Saving /work/shared/bug_hunt/8261844_for_jdk17.git/build/linux-x86_64-normal-server-slowdebug/test-support/jtreg_open_test_hotspot_jtreg_serviceability_sa_ClhsdbFindPC_java/serviceability/sa/ClhsdbFindPC_id3/hs_err_pid16470.log as /work/shared/bug_hunt/8261844_for_jdk17.git/test_failures.2021-02-16-171036/hs_err_pid16470.log
    Moving /work/shared/bug_hunt/8261844_for_jdk17.git/build/linux-x86_64-normal-server-slowdebug/test-support/jtreg_open_test_hotspot_jtreg_serviceability_sa_ClhsdbFindPC_java/serviceability/sa/ClhsdbFindPC_id3/core to /work/shared/bug_hunt/8261844_for_jdk17.git/test_failures.2021-02-16-171036/core.16470


Total test time: 4.40 minutes.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2563
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261710: SA DSO objects have sizes that are too large

Chris Plummer-2
In reply to this post by Yasumasa Suenaga-7
On Tue, 16 Feb 2021 12:19:38 GMT, Yasumasa Suenaga <[hidden email]> wrote:

> I updated patch for macOS. I cannot test it, but all of tier 1 servicerbility tests were passed on GitHub Actions. Could you review again?

The OSX changes look good and passed my testing. However as Dan indicated above, it seems with your changes in place the original problem is still being reproduced. Note the original problem never reproduced with the testing I've been doing, so my testing doesn't prove that you fixed the issue. However, Dan is running on a local host that does reproduce the issue pretty reliably, and seems to still reproduce it after your fixes.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2563
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261710: SA DSO objects have sizes that are too large

Chris Plummer-2
On Tue, 16 Feb 2021 22:50:58 GMT, Chris Plummer <[hidden email]> wrote:

>> Thanks for explanation!
>> I understood Marc-O is complex, and to use `lib->memsz` is reasonable as you said.
>>
>> I updated patch for macOS. I cannot test it, but all of tier 1 servicerbility tests were passed on GitHub Actions. Could you review again?
>
>> I updated patch for macOS. I cannot test it, but all of tier 1 servicerbility tests were passed on GitHub Actions. Could you review again?
>
> The OSX changes look good and passed my testing. However as Dan indicated above, it seems with your changes in place the original problem is still being reproduced. Note the original problem never reproduced with the testing I've been doing, so my testing doesn't prove that you fixed the issue. However, Dan is running on a local host that does reproduce the issue pretty reliably, and seems to still reproduce it after your fixes.

Dan included the logs of the failures in the CR. This is what they show:

  + findpc 0x00002b77f084d116
Address 0x00002b77f084d116: /lib/x86_64-linux-gnu/libnss_files.so.2 + 0x21b116
...
java.lang.RuntimeException: Test ERROR java.lang.RuntimeException: 'In interpreter codelet' missing from stdout/stderr

-------------

PR: https://git.openjdk.java.net/jdk/pull/2563
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261710: SA DSO objects have sizes that are too large

Chris Plummer-2
On Tue, 16 Feb 2021 22:51:34 GMT, Chris Plummer <[hidden email]> wrote:

>>> I updated patch for macOS. I cannot test it, but all of tier 1 servicerbility tests were passed on GitHub Actions. Could you review again?
>>
>> The OSX changes look good and passed my testing. However as Dan indicated above, it seems with your changes in place the original problem is still being reproduced. Note the original problem never reproduced with the testing I've been doing, so my testing doesn't prove that you fixed the issue. However, Dan is running on a local host that does reproduce the issue pretty reliably, and seems to still reproduce it after your fixes.
>
> Dan included the logs of the failures in the CR. This is what they show:
>
>   + findpc 0x00002b77f084d116
> Address 0x00002b77f084d116: /lib/x86_64-linux-gnu/libnss_files.so.2 + 0x21b116
> ...
> java.lang.RuntimeException: Test ERROR java.lang.RuntimeException: 'In interpreter codelet' missing from stdout/stderr

I should add that the failures Dan is seeing are with #id3, which is no -Xcomp, but with a core file instead of a live process. With -Xcomp this part of the test is not run, so possibly this is just an issue with the dso size calculation for core files, and works correctly with a live process.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2563
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261710: SA DSO objects have sizes that are too large

Chris Plummer-2
On Tue, 16 Feb 2021 23:28:28 GMT, Chris Plummer <[hidden email]> wrote:

>> Dan included the logs of the failures in the CR. This is what they show:
>>
>>   + findpc 0x00002b77f084d116
>> Address 0x00002b77f084d116: /lib/x86_64-linux-gnu/libnss_files.so.2 + 0x21b116
>> ...
>> java.lang.RuntimeException: Test ERROR java.lang.RuntimeException: 'In interpreter codelet' missing from stdout/stderr
>
> I should add that the failures Dan is seeing are with #id3, which is no -Xcomp, but with a core file instead of a live process. With -Xcomp this part of the test is not run, so possibly this is just an issue with the dso size calculation for core files, and works correctly with a live process.

If you run ClhsdbPmap.java, you can see pmap output for both core and live processes. The sizes of the maps are very large for both of them, and actually a bit bigger with the live process. Here's the output for a live process:

0x000014755360c000 4048K /usr/lib64/libnss_sss.so.2
0x0000147553815000 4012K /usr/lib64/libnss_files-2.17.so
0x0000147560208000 4064K /usr/lib64/libm-2.17.so
0x000014756050a000 3032K /usr/lib64/librt-2.17.so
0x0000147560712000 32892K /scratch/cplummer/ws/jdk/jdk.clean/build/linux-x64-debug/images/jdk/lib/server/libjvm.so
0x0000147562731000 4924K /usr/lib64/libc-2.17.so
0x0000147562aff000 3076K /usr/lib64/libdl-2.17.so
0x0000147562d03000 3060K /usr/lib64/libpthread-2.17.so
0x0000147562f1f000 2948K /usr/lib64/libz.so.1.2.7
0x0000147563135000 2860K /usr/lib64/ld-2.17.so
0x0000147563164000 92K /scratch/cplummer/ws/jdk/jdk.clean/build/linux-x64-debug/images/jdk/lib/libnet.so
0x000014756317b000 80K /scratch/cplummer/ws/jdk/jdk.clean/build/linux-x64-debug/images/jdk/lib/libnio.so
0x00001475631e0000 156K /scratch/cplummer/ws/jdk/jdk.clean/build/linux-x64-debug/images/jdk/lib/libjava.so
0x0000147563207000 128K /scratch/cplummer/ws/jdk/jdk.clean/build/linux-x64-debug/images/jdk/lib/libjimage.so
0x000014756332c000 68K /scratch/cplummer/ws/jdk/jdk.clean/build/linux-x64-debug/images/jdk/lib/libjli.so
0x0000563c950bf000 16K /scratch/cplummer/ws/jdk/jdk.clean/build/linux-x64-debug/images/jdk/bin/java
`/usr/lib64/libnss_files-2.17.so` is the one that turned up in the test failure. It's only a 68k file but has a 4064k map. It's second in the list. I'm not sure if this is the order we would always see on linux systems. My assumption was it was the library at the highest address that was causing the problem, and that the inteprerter was located right after it, but that might not be the case.

The address in the interpreter that we are doing findpc on turned up at `libnss_files.so.2 + 0x21b116`, or at an offset of 2200k. I added a "pmap" command to ClhsdbFindPC, and from my test runs the interpreter seemed to alway be just before the first library. However, maybe on some systems it is intermixed with the libraries.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2563
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261710: SA DSO objects have sizes that are too large [v4]

Yasumasa Suenaga-7
In reply to this post by Yasumasa Suenaga-7
> This PR relates to [JDK-8261702](https://bugs.openjdk.java.net/browse/JDK-8261702) ( #2562 )
> When SA creates a DSO object, which is used to represent a shared object file (.so), it initializes the "size" to be the size of the shared object file. This usually results in the size being too big. This can cause SA to get confused about whether or not an address is in the shared object. SA should instead set the DSO's size to the amount of the file that is actually mapped for executable code.

Yasumasa Suenaga has updated the pull request incrementally with one additional commit since the last revision:

  Use p_filesz instead of p_memsz

-------------

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/2563/files
  - new: https://git.openjdk.java.net/jdk/pull/2563/files/c9a34723..033e096e

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=2563&range=03
 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=2563&range=02-03

  Stats: 28 lines in 1 file changed: 10 ins; 10 del; 8 mod
  Patch: https://git.openjdk.java.net/jdk/pull/2563.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/2563/head:pull/2563

PR: https://git.openjdk.java.net/jdk/pull/2563
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261710: SA DSO objects have sizes that are too large

Yasumasa Suenaga-7
In reply to this post by Chris Plummer-2
On Wed, 17 Feb 2021 00:13:02 GMT, Chris Plummer <[hidden email]> wrote:

>> I should add that the failures Dan is seeing are with #id3, which is no -Xcomp, but with a core file instead of a live process. With -Xcomp this part of the test is not run, so possibly this is just an issue with the dso size calculation for core files, and works correctly with a live process.
>
> If you run ClhsdbPmap.java, you can see pmap output for both core and live processes. The sizes of the maps are very large for both of them, and actually a bit bigger with the live process. Here's the output for a live process:
>
> 0x000014755360c000 4048K /usr/lib64/libnss_sss.so.2
> 0x0000147553815000 4012K /usr/lib64/libnss_files-2.17.so
> 0x0000147560208000 4064K /usr/lib64/libm-2.17.so
> 0x000014756050a000 3032K /usr/lib64/librt-2.17.so
> 0x0000147560712000 32892K /scratch/cplummer/ws/jdk/jdk.clean/build/linux-x64-debug/images/jdk/lib/server/libjvm.so
> 0x0000147562731000 4924K /usr/lib64/libc-2.17.so
> 0x0000147562aff000 3076K /usr/lib64/libdl-2.17.so
> 0x0000147562d03000 3060K /usr/lib64/libpthread-2.17.so
> 0x0000147562f1f000 2948K /usr/lib64/libz.so.1.2.7
> 0x0000147563135000 2860K /usr/lib64/ld-2.17.so
> 0x0000147563164000 92K /scratch/cplummer/ws/jdk/jdk.clean/build/linux-x64-debug/images/jdk/lib/libnet.so
> 0x000014756317b000 80K /scratch/cplummer/ws/jdk/jdk.clean/build/linux-x64-debug/images/jdk/lib/libnio.so
> 0x00001475631e0000 156K /scratch/cplummer/ws/jdk/jdk.clean/build/linux-x64-debug/images/jdk/lib/libjava.so
> 0x0000147563207000 128K /scratch/cplummer/ws/jdk/jdk.clean/build/linux-x64-debug/images/jdk/lib/libjimage.so
> 0x000014756332c000 68K /scratch/cplummer/ws/jdk/jdk.clean/build/linux-x64-debug/images/jdk/lib/libjli.so
> 0x0000563c950bf000 16K /scratch/cplummer/ws/jdk/jdk.clean/build/linux-x64-debug/images/jdk/bin/java
> `/usr/lib64/libnss_files-2.17.so` is the one that turned up in the test failure. It's only a 68k file but has a 4064k map. It's second in the list. I'm not sure if this is the order we would always see on linux systems. My assumption was it was the library at the highest address that was causing the problem, and that the inteprerter was located right after it, but that might not be the case.
>
> The address in the interpreter that we are doing findpc on turned up at `libnss_files.so.2 + 0x21b116`, or at an offset of 2200k. I added a "pmap" command to ClhsdbFindPC, and from my test runs the interpreter seemed to alway be just before the first library. However, maybe on some systems it is intermixed with the libraries.

I pushed new change to use `ELF_PHDR.p_filesz` instead of `p_memsz`. It almost works fine, but it is not perfect solution.
For example, let's consider for libnss_sss (provided by Fedora 33) - `/proc/<PID>/maps` shows libnss as following. There are 5 segments.

7f0ba6ec5000-7f0ba6ec7000 r--p 00000000 08:03 340133                     /usr/lib64/libnss_sss.so.2
7f0ba6ec7000-7f0ba6ece000 r-xp 00002000 08:03 340133                     /usr/lib64/libnss_sss.so.2
7f0ba6ece000-7f0ba6ed0000 r--p 00009000 08:03 340133                     /usr/lib64/libnss_sss.so.2
7f0ba6ed0000-7f0ba6ed1000 r--p 0000a000 08:03 340133                     /usr/lib64/libnss_sss.so.2
7f0ba6ed1000-7f0ba6ed2000 rw-p 0000b000 08:03 340133                     /usr/lib64/libnss_sss.so.2

However I could see only 4 segments in libnss_sss.so when I ran `readelf -l /usr/lib64/libnss_sss.so.2`:

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  LOAD           0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x0000000000001468 0x0000000000001468  R      0x1000
  LOAD           0x0000000000002000 0x0000000000002000 0x0000000000002000
                 0x0000000000006931 0x0000000000006931  R E    0x1000
  LOAD           0x0000000000009000 0x0000000000009000 0x0000000000009000
                 0x0000000000001110 0x0000000000001110  R      0x1000
  LOAD           0x000000000000ac78 0x000000000000bc78 0x000000000000bc78
                 0x000000000000044c 0x0000000000000658  RW     0x1000

Linux kernel seems to separate final segment (0xbc78) into RO and RW segments when it attempts to load shared library. (but I'm not sure)

I think we need to refactor handling shared libraries in other ways.

For live process, we can use `/proc/<PID>/maps`.
For coredump, we can use `NT_FILE` in note section in corefile, It has valid segments as below.

$ readelf -n core
  :
    0x00007f0ba6ec5000  0x00007f0ba6ec7000  0x0000000000000000

    0x00007f0ba6ec7000  0x00007f0ba6ece000  0x0000000000000002

    0x00007f0ba6ece000  0x00007f0ba6ed0000  0x0000000000000009

    0x00007f0ba6ed0000  0x00007f0ba6ed1000  0x000000000000000a

    0x00007f0ba6ed1000  0x00007f0ba6ed2000  0x000000000000000b


But they makes big change to SA.
As an option, we can integrate this change at first, then we will refactor them.
What do you think?
(I want to resolve this problem with smaller fix if I can of course, so another solutions are welcome)

-------------

PR: https://git.openjdk.java.net/jdk/pull/2563
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261710: SA DSO objects have sizes that are too large

Chris Plummer-2
On Wed, 17 Feb 2021 06:46:27 GMT, Yasumasa Suenaga <[hidden email]> wrote:

> Linux kernel seems to separate final segment (0xbc78) into RO and RW segments when it attempts to load shared library. (but I'm not sure)

I know I've seen this brought up before, if not with SA then with hotspot itself. Maybe it was CDS or NMT. Or maybe it has something to do with how hotspot generates core dumps (I seem to recall there are some bits you can set somewhere that determine what does or does not get dumped to the core).

In any case, I think the main issue it causes for you is that your rounding up the size of the last (4th) segment may not enough. I think in most cases you would need to round it up to a page boundary, and then add another page to it. Consider a page as 4k and let's say the segment size is 2k, but it is split into two 1k segments. Each of those segment would take one page, so a total of 8k. But if you round 2k up to a page boundary you only get to 4k. You need to add another page to that. I think the only time you don't want to add an extra page to the size is if one of the segment's size was already page aligned. However, you have know way of knowing if that will be the case, unless you can determine the RO and RW sizes of the segment.

> But they makes big change to SA.
As an option, we can integrate this change at first, then we will refactor them.
What do you think?

Whatever works for you. I'm actually not too concerned about getting this right, because with my PointerFinder workaround I don't think this issue with the map sizes has much impact on SA. Probably the only place it will show up is with SA pmap output.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2563
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261710: SA DSO objects have sizes that are too large

Chris Plummer-2
In reply to this post by Yasumasa Suenaga-7
On Wed, 17 Feb 2021 06:46:27 GMT, Yasumasa Suenaga <[hidden email]> wrote:

>> If you run ClhsdbPmap.java, you can see pmap output for both core and live processes. The sizes of the maps are very large for both of them, and actually a bit bigger with the live process. Here's the output for a live process:
>>
>> 0x000014755360c000 4048K /usr/lib64/libnss_sss.so.2
>> 0x0000147553815000 4012K /usr/lib64/libnss_files-2.17.so
>> 0x0000147560208000 4064K /usr/lib64/libm-2.17.so
>> 0x000014756050a000 3032K /usr/lib64/librt-2.17.so
>> 0x0000147560712000 32892K /scratch/cplummer/ws/jdk/jdk.clean/build/linux-x64-debug/images/jdk/lib/server/libjvm.so
>> 0x0000147562731000 4924K /usr/lib64/libc-2.17.so
>> 0x0000147562aff000 3076K /usr/lib64/libdl-2.17.so
>> 0x0000147562d03000 3060K /usr/lib64/libpthread-2.17.so
>> 0x0000147562f1f000 2948K /usr/lib64/libz.so.1.2.7
>> 0x0000147563135000 2860K /usr/lib64/ld-2.17.so
>> 0x0000147563164000 92K /scratch/cplummer/ws/jdk/jdk.clean/build/linux-x64-debug/images/jdk/lib/libnet.so
>> 0x000014756317b000 80K /scratch/cplummer/ws/jdk/jdk.clean/build/linux-x64-debug/images/jdk/lib/libnio.so
>> 0x00001475631e0000 156K /scratch/cplummer/ws/jdk/jdk.clean/build/linux-x64-debug/images/jdk/lib/libjava.so
>> 0x0000147563207000 128K /scratch/cplummer/ws/jdk/jdk.clean/build/linux-x64-debug/images/jdk/lib/libjimage.so
>> 0x000014756332c000 68K /scratch/cplummer/ws/jdk/jdk.clean/build/linux-x64-debug/images/jdk/lib/libjli.so
>> 0x0000563c950bf000 16K /scratch/cplummer/ws/jdk/jdk.clean/build/linux-x64-debug/images/jdk/bin/java
>> `/usr/lib64/libnss_files-2.17.so` is the one that turned up in the test failure. It's only a 68k file but has a 4064k map. It's second in the list. I'm not sure if this is the order we would always see on linux systems. My assumption was it was the library at the highest address that was causing the problem, and that the inteprerter was located right after it, but that might not be the case.
>>
>> The address in the interpreter that we are doing findpc on turned up at `libnss_files.so.2 + 0x21b116`, or at an offset of 2200k. I added a "pmap" command to ClhsdbFindPC, and from my test runs the interpreter seemed to alway be just before the first library. However, maybe on some systems it is intermixed with the libraries.
>
> I pushed new change to use `ELF_PHDR.p_filesz` instead of `p_memsz`. It almost works fine, but it is not perfect solution.
> For example, let's consider for libnss_sss (provided by Fedora 33) - `/proc/<PID>/maps` shows libnss as following. There are 5 segments.
>
> 7f0ba6ec5000-7f0ba6ec7000 r--p 00000000 08:03 340133                     /usr/lib64/libnss_sss.so.2
> 7f0ba6ec7000-7f0ba6ece000 r-xp 00002000 08:03 340133                     /usr/lib64/libnss_sss.so.2
> 7f0ba6ece000-7f0ba6ed0000 r--p 00009000 08:03 340133                     /usr/lib64/libnss_sss.so.2
> 7f0ba6ed0000-7f0ba6ed1000 r--p 0000a000 08:03 340133                     /usr/lib64/libnss_sss.so.2
> 7f0ba6ed1000-7f0ba6ed2000 rw-p 0000b000 08:03 340133                     /usr/lib64/libnss_sss.so.2
>
> However I could see only 4 segments in libnss_sss.so when I ran `readelf -l /usr/lib64/libnss_sss.so.2`:
>
> Program Headers:
>   Type           Offset             VirtAddr           PhysAddr
>                  FileSiz            MemSiz              Flags  Align
>   LOAD           0x0000000000000000 0x0000000000000000 0x0000000000000000
>                  0x0000000000001468 0x0000000000001468  R      0x1000
>   LOAD           0x0000000000002000 0x0000000000002000 0x0000000000002000
>                  0x0000000000006931 0x0000000000006931  R E    0x1000
>   LOAD           0x0000000000009000 0x0000000000009000 0x0000000000009000
>                  0x0000000000001110 0x0000000000001110  R      0x1000
>   LOAD           0x000000000000ac78 0x000000000000bc78 0x000000000000bc78
>                  0x000000000000044c 0x0000000000000658  RW     0x1000
>
> Linux kernel seems to separate final segment (0xbc78) into RO and RW segments when it attempts to load shared library. (but I'm not sure)
>
> I think we need to refactor handling shared libraries in other ways.
>
> For live process, we can use `/proc/<PID>/maps`.
> For coredump, we can use `NT_FILE` in note section in corefile, It has valid segments as below.
>
> $ readelf -n core
>   :
>     0x00007f0ba6ec5000  0x00007f0ba6ec7000  0x0000000000000000
>
>     0x00007f0ba6ec7000  0x00007f0ba6ece000  0x0000000000000002
>
>     0x00007f0ba6ece000  0x00007f0ba6ed0000  0x0000000000000009
>
>     0x00007f0ba6ed0000  0x00007f0ba6ed1000  0x000000000000000a
>
>     0x00007f0ba6ed1000  0x00007f0ba6ed2000  0x000000000000000b
>
>
> But they makes big change to SA.
> As an option, we can integrate this change at first, then we will refactor them.
> What do you think?
> (I want to resolve this problem with smaller fix if I can of course, so another solutions are welcome)

@YaSuenag https://bugs.openjdk.java.net/browse/JDK-8250826 is the bug I was thinking of that sounds like the RO/RW issue you were talking about.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2563
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261710: SA DSO objects have sizes that are too large

Chris Plummer-2
In reply to this post by Yasumasa Suenaga-7
On Wed, 17 Feb 2021 06:46:27 GMT, Yasumasa Suenaga <[hidden email]> wrote:

>> If you run ClhsdbPmap.java, you can see pmap output for both core and live processes. The sizes of the maps are very large for both of them, and actually a bit bigger with the live process. Here's the output for a live process:
>>
>> 0x000014755360c000 4048K /usr/lib64/libnss_sss.so.2
>> 0x0000147553815000 4012K /usr/lib64/libnss_files-2.17.so
>> 0x0000147560208000 4064K /usr/lib64/libm-2.17.so
>> 0x000014756050a000 3032K /usr/lib64/librt-2.17.so
>> 0x0000147560712000 32892K /scratch/cplummer/ws/jdk/jdk.clean/build/linux-x64-debug/images/jdk/lib/server/libjvm.so
>> 0x0000147562731000 4924K /usr/lib64/libc-2.17.so
>> 0x0000147562aff000 3076K /usr/lib64/libdl-2.17.so
>> 0x0000147562d03000 3060K /usr/lib64/libpthread-2.17.so
>> 0x0000147562f1f000 2948K /usr/lib64/libz.so.1.2.7
>> 0x0000147563135000 2860K /usr/lib64/ld-2.17.so
>> 0x0000147563164000 92K /scratch/cplummer/ws/jdk/jdk.clean/build/linux-x64-debug/images/jdk/lib/libnet.so
>> 0x000014756317b000 80K /scratch/cplummer/ws/jdk/jdk.clean/build/linux-x64-debug/images/jdk/lib/libnio.so
>> 0x00001475631e0000 156K /scratch/cplummer/ws/jdk/jdk.clean/build/linux-x64-debug/images/jdk/lib/libjava.so
>> 0x0000147563207000 128K /scratch/cplummer/ws/jdk/jdk.clean/build/linux-x64-debug/images/jdk/lib/libjimage.so
>> 0x000014756332c000 68K /scratch/cplummer/ws/jdk/jdk.clean/build/linux-x64-debug/images/jdk/lib/libjli.so
>> 0x0000563c950bf000 16K /scratch/cplummer/ws/jdk/jdk.clean/build/linux-x64-debug/images/jdk/bin/java
>> `/usr/lib64/libnss_files-2.17.so` is the one that turned up in the test failure. It's only a 68k file but has a 4064k map. It's second in the list. I'm not sure if this is the order we would always see on linux systems. My assumption was it was the library at the highest address that was causing the problem, and that the inteprerter was located right after it, but that might not be the case.
>>
>> The address in the interpreter that we are doing findpc on turned up at `libnss_files.so.2 + 0x21b116`, or at an offset of 2200k. I added a "pmap" command to ClhsdbFindPC, and from my test runs the interpreter seemed to alway be just before the first library. However, maybe on some systems it is intermixed with the libraries.
>
> I pushed new change to use `ELF_PHDR.p_filesz` instead of `p_memsz`. It almost works fine, but it is not perfect solution.
> For example, let's consider for libnss_sss (provided by Fedora 33) - `/proc/<PID>/maps` shows libnss as following. There are 5 segments.
>
> 7f0ba6ec5000-7f0ba6ec7000 r--p 00000000 08:03 340133                     /usr/lib64/libnss_sss.so.2
> 7f0ba6ec7000-7f0ba6ece000 r-xp 00002000 08:03 340133                     /usr/lib64/libnss_sss.so.2
> 7f0ba6ece000-7f0ba6ed0000 r--p 00009000 08:03 340133                     /usr/lib64/libnss_sss.so.2
> 7f0ba6ed0000-7f0ba6ed1000 r--p 0000a000 08:03 340133                     /usr/lib64/libnss_sss.so.2
> 7f0ba6ed1000-7f0ba6ed2000 rw-p 0000b000 08:03 340133                     /usr/lib64/libnss_sss.so.2
>
> However I could see only 4 segments in libnss_sss.so when I ran `readelf -l /usr/lib64/libnss_sss.so.2`:
>
> Program Headers:
>   Type           Offset             VirtAddr           PhysAddr
>                  FileSiz            MemSiz              Flags  Align
>   LOAD           0x0000000000000000 0x0000000000000000 0x0000000000000000
>                  0x0000000000001468 0x0000000000001468  R      0x1000
>   LOAD           0x0000000000002000 0x0000000000002000 0x0000000000002000
>                  0x0000000000006931 0x0000000000006931  R E    0x1000
>   LOAD           0x0000000000009000 0x0000000000009000 0x0000000000009000
>                  0x0000000000001110 0x0000000000001110  R      0x1000
>   LOAD           0x000000000000ac78 0x000000000000bc78 0x000000000000bc78
>                  0x000000000000044c 0x0000000000000658  RW     0x1000
>
> Linux kernel seems to separate final segment (0xbc78) into RO and RW segments when it attempts to load shared library. (but I'm not sure)
>
> I think we need to refactor handling shared libraries in other ways.
>
> For live process, we can use `/proc/<PID>/maps`.
> For coredump, we can use `NT_FILE` in note section in corefile, It has valid segments as below.
>
> $ readelf -n core
>   :
>     0x00007f0ba6ec5000  0x00007f0ba6ec7000  0x0000000000000000
>
>     0x00007f0ba6ec7000  0x00007f0ba6ece000  0x0000000000000002
>
>     0x00007f0ba6ece000  0x00007f0ba6ed0000  0x0000000000000009
>
>     0x00007f0ba6ed0000  0x00007f0ba6ed1000  0x000000000000000a
>
>     0x00007f0ba6ed1000  0x00007f0ba6ed2000  0x000000000000000b
>
>
> But they makes big change to SA.
> As an option, we can integrate this change at first, then we will refactor them.
> What do you think?
> (I want to resolve this problem with smaller fix if I can of course, so another solutions are welcome)

@YaSuenag I asked Dan to run a modified `ClhsdbFindPC` that also issues a `clhsdb pmap` command so we can see what the maps look like, and compare them to the address being looked up. This is before your latest fix, so the the sizes are still too big, but that's ok for this analysis. First, this is the `findpc` command that was suppose to show the address in the interpreter:

hsdb> + findpc 0x00002ab36ca942b6
Address 0x00002ab36ca942b6: /lib/x86_64-linux-gnu/libnss_files.so.2 + 0x21b2b6

And here's the pmap output . I had to manually sort by address, and I also added the location of the interpreter address being looked up.

0x00005652c8fd0000 16K <jdkdir>/jdk/bin/java
0x00002ab3692ae000 3400K /lib64/ld-linux-x86-64.so.2
0x00002ab3692e0000 12K <jdkdir>/test/hotspot/jtreg/native/libLingeredApp.so
0x00002ab3692ed000 84K <jdkdir>/jdk/bin/../lib/libjli.so
0x00002ab369406000 144K <jdkdir>/jdk/lib/libjimage.so
0x00002ab36942a000 200K <jdkdir>/jdk/lib/libjava.so
0x00002ab3694bc000 88K <jdkdir>/jdk/lib/libnio.so
0x00002ab3694d6000 3240K /lib/x86_64-linux-gnu/libz.so.1
0x00002ab3696f0000 3136K /lib/x86_64-linux-gnu/libpthread.so.0
0x00002ab36990d000 3020K /lib/x86_64-linux-gnu/libdl.so.2
0x00002ab369b11000 5052K /lib/x86_64-linux-gnu/libc.so.6
0x00002ab369edb000 31100K <jdkdir>/jdk/lib/server/libjvm.so
0x00002ab36bd3a000 2840K /lib/x86_64-linux-gnu/librt.so.1
0x00002ab36bf42000 4856K /lib/x86_64-linux-gnu/libm.so.6
0x00002ab36c24b000 3796K /lib/x86_64-linux-gnu/libnss_compat.so.2
0x00002ab36c454000 3760K /lib/x86_64-linux-gnu/libnsl.so.1
0x00002ab36c66d000 3660K /lib/x86_64-linux-gnu/libnss_nis.so.2
0x00002ab36c879000 3612K /lib/x86_64-linux-gnu/libnss_files.so.2
0x00002ab36ca942b6: /lib/x86_64-linux-gnu/libnss_files.so.2 + 0x21b2b6
0x00002ab38fc08000 112K <jdkdir>/jdk/lib/libnet.so
0x00002ab38fc55000 3756K /usr/lib/x86_64-linux-gnu/libstdc++.so.6
0x00002ab3bc000000 4096K /lib/x86_64-linux-gnu/libgcc_s.so.1

There appears to be a very large gap between `libnss_files.so.2` and `libnet.so` (about 590mb) so I assume a lot of hotspot memory allocations are located in this area, including the interpreter.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2563
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261710: SA DSO objects have sizes that are too large [v4]

Serguei Spitsyn-2
In reply to this post by Yasumasa Suenaga-7
On Wed, 17 Feb 2021 06:28:09 GMT, Yasumasa Suenaga <[hidden email]> wrote:

>> This PR relates to [JDK-8261702](https://bugs.openjdk.java.net/browse/JDK-8261702) ( #2562 )
>> When SA creates a DSO object, which is used to represent a shared object file (.so), it initializes the "size" to be the size of the shared object file. This usually results in the size being too big. This can cause SA to get confused about whether or not an address is in the shared object. SA should instead set the DSO's size to the amount of the file that is actually mapped for executable code.
>
> Yasumasa Suenaga has updated the pull request incrementally with one additional commit since the last revision:
>
>   Use p_filesz instead of p_memsz

Hi Yasumasa,

Just some comments.

src/jdk.hotspot.agent/linux/native/libsaproc/libproc_impl.c
+static inline uintptr_t align_down(uintptr_t ptr, size_t size) {
+  return (ptr & ~(size - 1));
+}
+
+static inline uintptr_t align_up(uintptr_t ptr, size_t size) {
+  return ((ptr + size - 1) & ~(size - 1));
+}

The name of 'size' parameter is confusing. Should it be renamed to something like page_size of psize?

+  lib->end = (uintptr_t)-1L;
. . .
+      if ((lib->end == (uintptr_t)-1L) || (lib->end < aligned_end)) {
+        lib->end = aligned_end;
       }
 
The condition  `(lib->end == (uintptr_t)-1L)` is a subset of `(lib->end < aligned_end)`, and so, can be removed. The same is true for the condition:
+        if ((lib->exec_end == (uintptr_t)-1L) || (lib->exec_end < aligned_end)) {
+          lib->exec_end = aligned_end;
+        }

`+      print_debug("%s [%d] 0x%lx-0x%lx: base = 0x%lx, vaddr = 0x%lx, memsz = 0x%lx, filesz = 0x%lx\n", lib->name, cnt, aligned_start, aligned_end, lib->base, ph->p_vaddr, ph->p_memsz, ph->p_filesz);`

It is better to split this long line.

Thanks,
Serguei

-------------

PR: https://git.openjdk.java.net/jdk/pull/2563
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261710: SA DSO objects have sizes that are too large

Yasumasa Suenaga-7
In reply to this post by Yasumasa Suenaga-7
On Sun, 14 Feb 2021 07:14:24 GMT, Yasumasa Suenaga <[hidden email]> wrote:

> This PR relates to [JDK-8261702](https://bugs.openjdk.java.net/browse/JDK-8261702) ( #2562 )
> When SA creates a DSO object, which is used to represent a shared object file (.so), it initializes the "size" to be the size of the shared object file. This usually results in the size being too big. This can cause SA to get confused about whether or not an address is in the shared object. SA should instead set the DSO's size to the amount of the file that is actually mapped for executable code.

> @YaSuenag https://bugs.openjdk.java.net/browse/JDK-8250826 is the bug I was thinking of that sounds like the RO/RW issue you were talking about.

It is similar, but it's different from this issue because JDK-8250826 is caused by `mprotect()` call against memory segment in ELF binary.

> In any case, I think the main issue it causes for you is that your rounding up the size of the last (4th) segment may not enough. I think in most cases you would need to round it up to a page boundary, and then add another page to it.

Hmm... it might be page-boundary problem as you said, but I don't have any ideas where we can collect the information about that excepting note section in the core.
My latest patch shows following debug message on the console. It shows all PT_LOAD segments have been handled correctly.

libsaproc DEBUG: /lib64/libnss_sss.so.2 [0] 0x7f0ba6ec5000-0x7f0ba6ec7000: base = 0x7f0ba6ec5000, vaddr = 0x0, memsz = 0x1468, filesz = 0x1468
libsaproc DEBUG: /lib64/libnss_sss.so.2 [1] 0x7f0ba6ec7000-0x7f0ba6ece000: base = 0x7f0ba6ec5000, vaddr = 0x2000, memsz = 0x6931, filesz = 0x6931
libsaproc DEBUG: /lib64/libnss_sss.so.2 [2] 0x7f0ba6ece000-0x7f0ba6ed0000: base = 0x7f0ba6ec5000, vaddr = 0x9000, memsz = 0x1110, filesz = 0x1110
libsaproc DEBUG: /lib64/libnss_sss.so.2 [3] 0x7f0ba6ed0000-0x7f0ba6ed1000: base = 0x7f0ba6ec5000, vaddr = 0xbc78, memsz = 0x658, filesz = 0x44c

> I'm actually not too concerned about getting this right, because with my PointerFinder workaround I don't think this issue with the map sizes has much impact on SA. Probably the only place it will show up is with SA pmap output.

Ok, I will move forward this fix.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2563
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261710: SA DSO objects have sizes that are too large [v5]

Yasumasa Suenaga-7
In reply to this post by Yasumasa Suenaga-7
> This PR relates to [JDK-8261702](https://bugs.openjdk.java.net/browse/JDK-8261702) ( #2562 )
> When SA creates a DSO object, which is used to represent a shared object file (.so), it initializes the "size" to be the size of the shared object file. This usually results in the size being too big. This can cause SA to get confused about whether or not an address is in the shared object. SA should instead set the DSO's size to the amount of the file that is actually mapped for executable code.

Yasumasa Suenaga has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains seven commits:

 - Fix comments
 - Merge remote-tracking branch 'upstream/master' into JDK-8261710
 - Use p_filesz instead of p_memsz
 - Remove unnecessary code
 - Fix for OS X
 - Use entire of address range of shared library
 - 8261710: SA DSO objects have sizes that are too large

-------------

Changes: https://git.openjdk.java.net/jdk/pull/2563/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=2563&range=04
  Stats: 95 lines in 9 files changed: 56 ins; 13 del; 26 mod
  Patch: https://git.openjdk.java.net/jdk/pull/2563.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/2563/head:pull/2563

PR: https://git.openjdk.java.net/jdk/pull/2563
12