Quantcast

[9] RFR (S): 8173151: Code heap corruption due to incorrect inclusion test

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[9] RFR (S): 8173151: Code heap corruption due to incorrect inclusion test

Zoltán Majó
Hi,


please review the fix for 8173151.

https://bugs.openjdk.java.net/browse/JDK-8173151
http://cr.openjdk.java.net/~zmajo/8173151/webrev.00/

The crash reported in the bug is caused by the corruption of the
'non-profiled nmethods' code heap. CodeHeap::_freelist for that code
heap points to an address one segment before the heap's address space.
The sweeper starts iterating through the code heap from the beginning of
the heap's address space. Thus, the sweeper assumes that the first item
in the code heap is a HeapBlock/FreeBlock (with the appropriate length
and usage information). However, that is not the case, as the first item
in the heap is actually *before* the heap. So the sweeper crashes.

This is a hard-to-reproduce problem (I managed to reproduce it only once
in 350 iterations, each iteration taking ~25 minutes). So the fix I
propose is based on core file debugging and source code investigation.
But I managed to write a regression test that triggers a problem similar
to the original problem.

I think that problem happens because a CodeBlob allocated in one code
heap (A) is returned to a different code heap (B). When the CodeBlob is
returned B, it is added to B's freelist. However, as the CodeBlob was
allocated in A, the freelist of B now points into A (i.e., B is corrupted).

The code cache uses CodeCache::get_code_heap(const CodeBlob* cb) to
determine to which code heap a 'cb' is supposed to be returned to. Since
8171008 (AOT) [1], the check is:

CodeHeap* CodeCache::get_code_heap(const CodeBlob* cb) {
assert(cb != NULL, "CodeBlob is null");
FOR_ALL_HEAPS(heap) {
- if ((*heap)->contains(cb)) {
+ if ((*heap)->contains(cb->code_begin())) {
return *heap;
}
}

The blob 'cb' can be returned to the wrong heap if, for example:
- 'cb' is at the end code heap A and
- the size of the code contained in 'cb' is 0 (therefore code_begin()
returns the address after 'cb', i.e., the first address in code heap B).

The fix proposes to restore CodeCache::get_code_heap() to its pre-AOT
state (as I'm not aware of the reasons why AOT changed that check). I
also propose to add some guarantees after allocation/deallocation in the
code heap to possibly easier catch this or related problems in the future.

The regression test I propose achieves the above condition and results
in a crash. The regression test works only with product builds, because
in a product build a BufferBlob fits into one segment whereas in a
fastdebug build it does not.

The test needs to set the CodeCacheMinBlockLength flag to 1. The flag is
currently develop and we would need to make it product for the test to
work. (Other flags controlling the code cache, e.g.,
CodeCacheExpansionSize, are also product.) I could also experiment with
reproducing the problem with different block lengths/segment sizes, but
that would most likely make the test more fragile (and
CodeCacheSegmentSize is anyway develop as well).

I tested the fix with JPRT, RBT is in progress.

Thank you!

Best regards,


Zoltan

[1] http://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/777aaa19c4b1#l116.71

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [9] RFR (S): 8173151: Code heap corruption due to incorrect inclusion test

Tobias Hartmann-2
Hi Zoltán,

good catch!

Here are some questions/suggestions:
- Wouldn't it make sense to include an upper bound check in the guarantees as well?
- I wonder why AOT changed the code to use cb->code_begin() instead of cb, maybe someone more familiar with the AOT implementation could clarify this?
- I'm not sure if we should make CodeCacheMinBlockLength a product flag because we would then also need more testing with different flag values. What about making it diagnostic or experimental?

Thanks,
Tobias

On 06.02.2017 16:29, Zoltán Majó wrote:

> Hi,
>
>
> please review the fix for 8173151.
>
> https://bugs.openjdk.java.net/browse/JDK-8173151
> http://cr.openjdk.java.net/~zmajo/8173151/webrev.00/
>
> The crash reported in the bug is caused by the corruption of the 'non-profiled nmethods' code heap. CodeHeap::_freelist for that code heap points to an address one segment before the heap's address space. The sweeper starts iterating through the code heap from the beginning of the heap's address space. Thus, the sweeper assumes that the first item in the code heap is a HeapBlock/FreeBlock (with the appropriate length and usage information). However, that is not the case, as the first item in the heap is actually *before* the heap. So the sweeper crashes.
>
> This is a hard-to-reproduce problem (I managed to reproduce it only once in 350 iterations, each iteration taking ~25 minutes). So the fix I propose is based on core file debugging and source code investigation. But I managed to write a regression test that triggers a problem similar to the original problem.
>
> I think that problem happens because a CodeBlob allocated in one code heap (A) is returned to a different code heap (B). When the CodeBlob is returned B, it is added to B's freelist. However, as the CodeBlob was allocated in A, the freelist of B now points into A (i.e., B is corrupted).
>
> The code cache uses CodeCache::get_code_heap(const CodeBlob* cb) to determine to which code heap a 'cb' is supposed to be returned to. Since 8171008 (AOT) [1], the check is:
>
> CodeHeap* CodeCache::get_code_heap(const CodeBlob* cb) {
> assert(cb != NULL, "CodeBlob is null");
> FOR_ALL_HEAPS(heap) {
> - if ((*heap)->contains(cb)) {
> + if ((*heap)->contains(cb->code_begin())) {
> return *heap;
> }
> }
>
> The blob 'cb' can be returned to the wrong heap if, for example:
> - 'cb' is at the end code heap A and
> - the size of the code contained in 'cb' is 0 (therefore code_begin() returns the address after 'cb', i.e., the first address in code heap B).
>
> The fix proposes to restore CodeCache::get_code_heap() to its pre-AOT state (as I'm not aware of the reasons why AOT changed that check). I also propose to add some guarantees after allocation/deallocation in the code heap to possibly easier catch this or related problems in the future.
>
> The regression test I propose achieves the above condition and results in a crash. The regression test works only with product builds, because in a product build a BufferBlob fits into one segment whereas in a fastdebug build it does not.
>
> The test needs to set the CodeCacheMinBlockLength flag to 1. The flag is currently develop and we would need to make it product for the test to work. (Other flags controlling the code cache, e.g., CodeCacheExpansionSize, are also product.) I could also experiment with reproducing the problem with different block lengths/segment sizes, but that would most likely make the test more fragile (and CodeCacheSegmentSize is anyway develop as well).
>
> I tested the fix with JPRT, RBT is in progress.
>
> Thank you!
>
> Best regards,
>
>
> Zoltan
>
> [1] http://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/777aaa19c4b1#l116.71
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [9] RFR (S): 8173151: Code heap corruption due to incorrect inclusion test

Zoltán Majó
Hi Tobias,


thank you for the feedback!

On 02/06/2017 04:52 PM, Tobias Hartmann wrote:
> Hi Zoltán,
>
> good catch!
>
> Here are some questions/suggestions:
> - Wouldn't it make sense to include an upper bound check in the guarantees as well?

That certainly makes sense, please see the updated webrev.

> - I wonder why AOT changed the code to use cb->code_begin() instead of cb, maybe someone more familiar with the AOT implementation could clarify this?

As we discussed offline, it seems that AOTCompiledMethods are allocated
on the C heap (instead in the code cache). That is, only the code of an
AOT method is allocated in the code cache. Could somebody from the AOT
team please confirm that?

If the above is true, we could add a virtual method
CodeBlob::blob_begin() to CodeBlob, which would return 'this'.
AOTCompiledMethod would then override blob_begin() to return
code_begin(). In CodeCache::get_code_heap() we would then use
...contains(cb->blob_begin()). Would that make sense? (I'm sure we can
find some better for that method.)

> - I'm not sure if we should make CodeCacheMinBlockLength a product flag because we would then also need more testing with different flag values. What about making it diagnostic or experimental?

Let's make it diagnostic then.

Here is the updated webrev:
http://cr.openjdk.java.net/~zmajo/8173151/webrev.01/

JPRT testing is in progress.

Thank you!

Best regards,


Zoltan

>
> Thanks,
> Tobias
>
> On 06.02.2017 16:29, Zoltán Majó wrote:
>> Hi,
>>
>>
>> please review the fix for 8173151.
>>
>> https://bugs.openjdk.java.net/browse/JDK-8173151
>> http://cr.openjdk.java.net/~zmajo/8173151/webrev.00/
>>
>> The crash reported in the bug is caused by the corruption of the 'non-profiled nmethods' code heap. CodeHeap::_freelist for that code heap points to an address one segment before the heap's address space. The sweeper starts iterating through the code heap from the beginning of the heap's address space. Thus, the sweeper assumes that the first item in the code heap is a HeapBlock/FreeBlock (with the appropriate length and usage information). However, that is not the case, as the first item in the heap is actually *before* the heap. So the sweeper crashes.
>>
>> This is a hard-to-reproduce problem (I managed to reproduce it only once in 350 iterations, each iteration taking ~25 minutes). So the fix I propose is based on core file debugging and source code investigation. But I managed to write a regression test that triggers a problem similar to the original problem.
>>
>> I think that problem happens because a CodeBlob allocated in one code heap (A) is returned to a different code heap (B). When the CodeBlob is returned B, it is added to B's freelist. However, as the CodeBlob was allocated in A, the freelist of B now points into A (i.e., B is corrupted).
>>
>> The code cache uses CodeCache::get_code_heap(const CodeBlob* cb) to determine to which code heap a 'cb' is supposed to be returned to. Since 8171008 (AOT) [1], the check is:
>>
>> CodeHeap* CodeCache::get_code_heap(const CodeBlob* cb) {
>> assert(cb != NULL, "CodeBlob is null");
>> FOR_ALL_HEAPS(heap) {
>> - if ((*heap)->contains(cb)) {
>> + if ((*heap)->contains(cb->code_begin())) {
>> return *heap;
>> }
>> }
>>
>> The blob 'cb' can be returned to the wrong heap if, for example:
>> - 'cb' is at the end code heap A and
>> - the size of the code contained in 'cb' is 0 (therefore code_begin() returns the address after 'cb', i.e., the first address in code heap B).
>>
>> The fix proposes to restore CodeCache::get_code_heap() to its pre-AOT state (as I'm not aware of the reasons why AOT changed that check). I also propose to add some guarantees after allocation/deallocation in the code heap to possibly easier catch this or related problems in the future.
>>
>> The regression test I propose achieves the above condition and results in a crash. The regression test works only with product builds, because in a product build a BufferBlob fits into one segment whereas in a fastdebug build it does not.
>>
>> The test needs to set the CodeCacheMinBlockLength flag to 1. The flag is currently develop and we would need to make it product for the test to work. (Other flags controlling the code cache, e.g., CodeCacheExpansionSize, are also product.) I could also experiment with reproducing the problem with different block lengths/segment sizes, but that would most likely make the test more fragile (and CodeCacheSegmentSize is anyway develop as well).
>>
>> I tested the fix with JPRT, RBT is in progress.
>>
>> Thank you!
>>
>> Best regards,
>>
>>
>> Zoltan
>>
>> [1] http://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/777aaa19c4b1#l116.71
>>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [9] RFR (S): 8173151: Code heap corruption due to incorrect inclusion test

Vladimir Kozlov
On 2/6/17 9:18 AM, Zoltán Majó wrote:

> Hi Tobias,
>
>
> thank you for the feedback!
>
> On 02/06/2017 04:52 PM, Tobias Hartmann wrote:
>> Hi Zoltán,
>>
>> good catch!
>>
>> Here are some questions/suggestions:
>> - Wouldn't it make sense to include an upper bound check in the
>> guarantees as well?
>
> That certainly makes sense, please see the updated webrev.
>
>> - I wonder why AOT changed the code to use cb->code_begin() instead of
>> cb, maybe someone more familiar with the AOT implementation could
>> clarify this?
>
> As we discussed offline, it seems that AOTCompiledMethods are allocated
> on the C heap (instead in the code cache). That is, only the code of an
> AOT method is allocated in the code cache. Could somebody from the AOT
> team please confirm that?

Yes, CodeBlob descriptors of AOT methods are allocated in C heap since
AOT code is in RO memory (in loaded AOT library) and we can't put
modifiable (addresses are unknown during AOT lib generation) CodeBlob there.

It is different from nmethods where CodeBlob is "wrapper" of nmethod
code in CodeCache. That is why we have to use cb->code_begin().

Sorry but your changes for get_code_heap() are incorrect for AOT code.

There is an other similar code which could be problematic -
CodeBlobIterator:

http://hg.openjdk.java.net/jdk9/hs/hotspot/file/8d77eb1a669d/src/share/vm/code/codeCache.hpp#l298

Note, CodeHeap::contains() is virtual and aotCodeHeap overwrite it but
still accept an address. Based on usage cases (how many) we may change
it to accept CodeBlob type instead of address and do appropriate checks
depending on AOT or not AOT heap. Or we add an other CodeHeap method
contains_blob(CodeBlob *) which do that.

>
> If the above is true, we could add a virtual method
> CodeBlob::blob_begin() to CodeBlob, which would return 'this'.
> AOTCompiledMethod would then override blob_begin() to return
> code_begin(). In CodeCache::get_code_heap() we would then use
> ...contains(cb->blob_begin()). Would that make sense? (I'm sure we can
> find some better for that method.)

We worked very hard to avoid virtual calls in AOT code since we saw
performance regression when scanning code cache. So we can't do virtual
calls.

Thanks,
Vladimir

>
>> - I'm not sure if we should make CodeCacheMinBlockLength a product
>> flag because we would then also need more testing with different flag
>> values. What about making it diagnostic or experimental?
>
> Let's make it diagnostic then.
>
> Here is the updated webrev:
> http://cr.openjdk.java.net/~zmajo/8173151/webrev.01/
>
> JPRT testing is in progress.
>
> Thank you!
>
> Best regards,
>
>
> Zoltan
>
>>
>> Thanks,
>> Tobias
>>
>> On 06.02.2017 16:29, Zoltán Majó wrote:
>>> Hi,
>>>
>>>
>>> please review the fix for 8173151.
>>>
>>> https://bugs.openjdk.java.net/browse/JDK-8173151
>>> http://cr.openjdk.java.net/~zmajo/8173151/webrev.00/
>>>
>>> The crash reported in the bug is caused by the corruption of the
>>> 'non-profiled nmethods' code heap. CodeHeap::_freelist for that code
>>> heap points to an address one segment before the heap's address
>>> space. The sweeper starts iterating through the code heap from the
>>> beginning of the heap's address space. Thus, the sweeper assumes that
>>> the first item in the code heap is a HeapBlock/FreeBlock (with the
>>> appropriate length and usage information). However, that is not the
>>> case, as the first item in the heap is actually *before* the heap. So
>>> the sweeper crashes.
>>>
>>> This is a hard-to-reproduce problem (I managed to reproduce it only
>>> once in 350 iterations, each iteration taking ~25 minutes). So the
>>> fix I propose is based on core file debugging and source code
>>> investigation. But I managed to write a regression test that triggers
>>> a problem similar to the original problem.
>>>
>>> I think that problem happens because a CodeBlob allocated in one code
>>> heap (A) is returned to a different code heap (B). When the CodeBlob
>>> is returned B, it is added to B's freelist. However, as the CodeBlob
>>> was allocated in A, the freelist of B now points into A (i.e., B is
>>> corrupted).
>>>
>>> The code cache uses CodeCache::get_code_heap(const CodeBlob* cb) to
>>> determine to which code heap a 'cb' is supposed to be returned to.
>>> Since 8171008 (AOT) [1], the check is:
>>>
>>> CodeHeap* CodeCache::get_code_heap(const CodeBlob* cb) {
>>> assert(cb != NULL, "CodeBlob is null");
>>> FOR_ALL_HEAPS(heap) {
>>> - if ((*heap)->contains(cb)) {
>>> + if ((*heap)->contains(cb->code_begin())) {
>>> return *heap;
>>> }
>>> }
>>>
>>> The blob 'cb' can be returned to the wrong heap if, for example:
>>> - 'cb' is at the end code heap A and
>>> - the size of the code contained in 'cb' is 0 (therefore code_begin()
>>> returns the address after 'cb', i.e., the first address in code heap B).
>>>
>>> The fix proposes to restore CodeCache::get_code_heap() to its pre-AOT
>>> state (as I'm not aware of the reasons why AOT changed that check). I
>>> also propose to add some guarantees after allocation/deallocation in
>>> the code heap to possibly easier catch this or related problems in
>>> the future.
>>>
>>> The regression test I propose achieves the above condition and
>>> results in a crash. The regression test works only with product
>>> builds, because in a product build a BufferBlob fits into one segment
>>> whereas in a fastdebug build it does not.
>>>
>>> The test needs to set the CodeCacheMinBlockLength flag to 1. The flag
>>> is currently develop and we would need to make it product for the
>>> test to work. (Other flags controlling the code cache, e.g.,
>>> CodeCacheExpansionSize, are also product.) I could also experiment
>>> with reproducing the problem with different block lengths/segment
>>> sizes, but that would most likely make the test more fragile (and
>>> CodeCacheSegmentSize is anyway develop as well).
>>>
>>> I tested the fix with JPRT, RBT is in progress.
>>>
>>> Thank you!
>>>
>>> Best regards,
>>>
>>>
>>> Zoltan
>>>
>>> [1]
>>> http://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/777aaa19c4b1#l116.71
>>>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [9] RFR (S): 8173151: Code heap corruption due to incorrect inclusion test

dean.long
In reply to this post by Zoltán Majó
When do we allocate a CodeBlob with a code size of 0?  Is it really
useful?  Would having a minimum code size of 1 fix the problem?

dl


On 2/6/17 7:29 AM, Zoltán Majó wrote:

> Hi,
>
>
> please review the fix for 8173151.
>
> https://bugs.openjdk.java.net/browse/JDK-8173151
> http://cr.openjdk.java.net/~zmajo/8173151/webrev.00/
>
> The crash reported in the bug is caused by the corruption of the
> 'non-profiled nmethods' code heap. CodeHeap::_freelist for that code
> heap points to an address one segment before the heap's address space.
> The sweeper starts iterating through the code heap from the beginning
> of the heap's address space. Thus, the sweeper assumes that the first
> item in the code heap is a HeapBlock/FreeBlock (with the appropriate
> length and usage information). However, that is not the case, as the
> first item in the heap is actually *before* the heap. So the sweeper
> crashes.
>
> This is a hard-to-reproduce problem (I managed to reproduce it only
> once in 350 iterations, each iteration taking ~25 minutes). So the fix
> I propose is based on core file debugging and source code
> investigation. But I managed to write a regression test that triggers
> a problem similar to the original problem.
>
> I think that problem happens because a CodeBlob allocated in one code
> heap (A) is returned to a different code heap (B). When the CodeBlob
> is returned B, it is added to B's freelist. However, as the CodeBlob
> was allocated in A, the freelist of B now points into A (i.e., B is
> corrupted).
>
> The code cache uses CodeCache::get_code_heap(const CodeBlob* cb) to
> determine to which code heap a 'cb' is supposed to be returned to.
> Since 8171008 (AOT) [1], the check is:
>
> CodeHeap* CodeCache::get_code_heap(const CodeBlob* cb) {
> assert(cb != NULL, "CodeBlob is null");
> FOR_ALL_HEAPS(heap) {
> - if ((*heap)->contains(cb)) {
> + if ((*heap)->contains(cb->code_begin())) {
> return *heap;
> }
> }
>
> The blob 'cb' can be returned to the wrong heap if, for example:
> - 'cb' is at the end code heap A and
> - the size of the code contained in 'cb' is 0 (therefore code_begin()
> returns the address after 'cb', i.e., the first address in code heap B).
>
> The fix proposes to restore CodeCache::get_code_heap() to its pre-AOT
> state (as I'm not aware of the reasons why AOT changed that check). I
> also propose to add some guarantees after allocation/deallocation in
> the code heap to possibly easier catch this or related problems in the
> future.
>
> The regression test I propose achieves the above condition and results
> in a crash. The regression test works only with product builds,
> because in a product build a BufferBlob fits into one segment whereas
> in a fastdebug build it does not.
>
> The test needs to set the CodeCacheMinBlockLength flag to 1. The flag
> is currently develop and we would need to make it product for the test
> to work. (Other flags controlling the code cache, e.g.,
> CodeCacheExpansionSize, are also product.) I could also experiment
> with reproducing the problem with different block lengths/segment
> sizes, but that would most likely make the test more fragile (and
> CodeCacheSegmentSize is anyway develop as well).
>
> I tested the fix with JPRT, RBT is in progress.
>
> Thank you!
>
> Best regards,
>
>
> Zoltan
>
> [1] http://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/777aaa19c4b1#l116.71
>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [9] RFR (S): 8173151: Code heap corruption due to incorrect inclusion test

Tobias Hartmann-2
Hi Dean,

On 07.02.2017 00:14, [hidden email] wrote:
> When do we allocate a CodeBlob with a code size of 0?  Is it really useful?  Would having a minimum code size of 1 fix the problem?

Yes, I've discussed that with Zoltan offline and I agree that we should enforce a minimum code size of 1 and fix the stress test(s) accordingly. It shouldn't be necessary to be able to create regular CodeBlobs with code size 0.

Best regards,
Tobias

> dl
>
>
> On 2/6/17 7:29 AM, Zoltán Majó wrote:
>> Hi,
>>
>>
>> please review the fix for 8173151.
>>
>> https://bugs.openjdk.java.net/browse/JDK-8173151
>> http://cr.openjdk.java.net/~zmajo/8173151/webrev.00/
>>
>> The crash reported in the bug is caused by the corruption of the 'non-profiled nmethods' code heap. CodeHeap::_freelist for that code heap points to an address one segment before the heap's address space. The sweeper starts iterating through the code heap from the beginning of the heap's address space. Thus, the sweeper assumes that the first item in the code heap is a HeapBlock/FreeBlock (with the appropriate length and usage information). However, that is not the case, as the first item in the heap is actually *before* the heap. So the sweeper crashes.
>>
>> This is a hard-to-reproduce problem (I managed to reproduce it only once in 350 iterations, each iteration taking ~25 minutes). So the fix I propose is based on core file debugging and source code investigation. But I managed to write a regression test that triggers a problem similar to the original problem.
>>
>> I think that problem happens because a CodeBlob allocated in one code heap (A) is returned to a different code heap (B). When the CodeBlob is returned B, it is added to B's freelist. However, as the CodeBlob was allocated in A, the freelist of B now points into A (i.e., B is corrupted).
>>
>> The code cache uses CodeCache::get_code_heap(const CodeBlob* cb) to determine to which code heap a 'cb' is supposed to be returned to. Since 8171008 (AOT) [1], the check is:
>>
>> CodeHeap* CodeCache::get_code_heap(const CodeBlob* cb) {
>> assert(cb != NULL, "CodeBlob is null");
>> FOR_ALL_HEAPS(heap) {
>> - if ((*heap)->contains(cb)) {
>> + if ((*heap)->contains(cb->code_begin())) {
>> return *heap;
>> }
>> }
>>
>> The blob 'cb' can be returned to the wrong heap if, for example:
>> - 'cb' is at the end code heap A and
>> - the size of the code contained in 'cb' is 0 (therefore code_begin() returns the address after 'cb', i.e., the first address in code heap B).
>>
>> The fix proposes to restore CodeCache::get_code_heap() to its pre-AOT state (as I'm not aware of the reasons why AOT changed that check). I also propose to add some guarantees after allocation/deallocation in the code heap to possibly easier catch this or related problems in the future.
>>
>> The regression test I propose achieves the above condition and results in a crash. The regression test works only with product builds, because in a product build a BufferBlob fits into one segment whereas in a fastdebug build it does not.
>>
>> The test needs to set the CodeCacheMinBlockLength flag to 1. The flag is currently develop and we would need to make it product for the test to work. (Other flags controlling the code cache, e.g., CodeCacheExpansionSize, are also product.) I could also experiment with reproducing the problem with different block lengths/segment sizes, but that would most likely make the test more fragile (and CodeCacheSegmentSize is anyway develop as well).
>>
>> I tested the fix with JPRT, RBT is in progress.
>>
>> Thank you!
>>
>> Best regards,
>>
>>
>> Zoltan
>>
>> [1] http://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/777aaa19c4b1#l116.71
>>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [9] RFR (S): 8173151: Code heap corruption due to incorrect inclusion test

Zoltán Majó
Hi Dean,
Hi Tobias,


thank you for the feedback!

On 02/07/2017 07:01 AM, Tobias Hartmann wrote:
> Hi Dean,
>
> On 07.02.2017 00:14, [hidden email] wrote:
>> When do we allocate a CodeBlob with a code size of 0?  Is it really useful?  Would having a minimum code size of 1 fix the problem?
> Yes, I've discussed that with Zoltan offline and I agree that we should enforce a minimum code size of 1 and fix the stress test(s) accordingly. It shouldn't be necessary to be able to create regular CodeBlobs with code size 0.

It is indeed a valid point that code sizes of 0 are not useful in practice.

However, the problem is a bit more complicated than the regression test
(and my description) suggests it to be.  The regression test works only
if CodeCacheMinBlockLength set to 1.  Only that way is it possible to
allocate BufferBlobs with size == 1 segment and set
CodeBlob::_code_begin to point to the next segment (which is possibly in
the next code heap).

For both crashes I've seen, however, CodeCacheMinBlockLength was equal
to 4.  With that CodeCacheMinBlockLength value, every allocation in the
code heap is at least 4 segments long.  It is therefore impossible to
create a BufferBlob whose _code_begin points "outside" of the space
reserved for that BufferBlob.  (Because BufferBlob::_code_begin will
either point to the beginning of second segment of the BufferBlob --
with product -- or somewhere into the second segment -- with fastdebug.)

Other blob types can have their _code_begin point to the boundary of 4
segments.  E.g., nmethods are between 2 and 3 segments long.  If the
relocation information in the nmethod fills up the space to the 4
segment boundary, _code_begin will be exactly at the segment boundary.

But I don't know the exact sequence of steps lead to the code heap
corruption (because the crash reproduced only twice so far).  I just
know that the first free block in the (corrupted) code heap is of length
8 segments (the contents of the first freelist item and of the segmap
both confirm this); the heap is corrupted because the freelist starts
before the code heap.

What sequence of steps lead to this?  Some nmethod
allocations/deallocations interleaved with BufferBlob
allocations/deallocations (possibly of size 0)?  I'm not sure.  In what
way were blocks in the heap merged/split by the
allocations/deallocations leading to the crash?  I don't know.

Did allocation/deallocation of BufferBlobs of size 0 play a role at
all?  I don't know either and I won't be able to tell unless the problem
reproduces more often.  However, by using code size == 0 it is possible
to write a regression test that triggers a real problem (which is
hopefully related to the crashes we've been seeing).

So I'd like to allow code sizes of 0 for now and attempt to fix
CodeCache::get_code_heap() and the CodeBlobIterator constructor the way
Vladimir suggested.  If that turns out to be difficult/risky or require
large code changes, I'll propose a fix instead that disallows code sizes
of 1 and fixes the test(s) -- just as you suggested.

Best regards,


Zoltan

> Best regards,
> Tobias
>
>> dl
>>
>>
>> On 2/6/17 7:29 AM, Zoltán Majó wrote:
>>> Hi,
>>>
>>>
>>> please review the fix for 8173151.
>>>
>>> https://bugs.openjdk.java.net/browse/JDK-8173151
>>> http://cr.openjdk.java.net/~zmajo/8173151/webrev.00/
>>>
>>> The crash reported in the bug is caused by the corruption of the 'non-profiled nmethods' code heap. CodeHeap::_freelist for that code heap points to an address one segment before the heap's address space. The sweeper starts iterating through the code heap from the beginning of the heap's address space. Thus, the sweeper assumes that the first item in the code heap is a HeapBlock/FreeBlock (with the appropriate length and usage information). However, that is not the case, as the first item in the heap is actually *before* the heap. So the sweeper crashes.
>>>
>>> This is a hard-to-reproduce problem (I managed to reproduce it only once in 350 iterations, each iteration taking ~25 minutes). So the fix I propose is based on core file debugging and source code investigation. But I managed to write a regression test that triggers a problem similar to the original problem.
>>>
>>> I think that problem happens because a CodeBlob allocated in one code heap (A) is returned to a different code heap (B). When the CodeBlob is returned B, it is added to B's freelist. However, as the CodeBlob was allocated in A, the freelist of B now points into A (i.e., B is corrupted).
>>>
>>> The code cache uses CodeCache::get_code_heap(const CodeBlob* cb) to determine to which code heap a 'cb' is supposed to be returned to. Since 8171008 (AOT) [1], the check is:
>>>
>>> CodeHeap* CodeCache::get_code_heap(const CodeBlob* cb) {
>>> assert(cb != NULL, "CodeBlob is null");
>>> FOR_ALL_HEAPS(heap) {
>>> - if ((*heap)->contains(cb)) {
>>> + if ((*heap)->contains(cb->code_begin())) {
>>> return *heap;
>>> }
>>> }
>>>
>>> The blob 'cb' can be returned to the wrong heap if, for example:
>>> - 'cb' is at the end code heap A and
>>> - the size of the code contained in 'cb' is 0 (therefore code_begin() returns the address after 'cb', i.e., the first address in code heap B).
>>>
>>> The fix proposes to restore CodeCache::get_code_heap() to its pre-AOT state (as I'm not aware of the reasons why AOT changed that check). I also propose to add some guarantees after allocation/deallocation in the code heap to possibly easier catch this or related problems in the future.
>>>
>>> The regression test I propose achieves the above condition and results in a crash. The regression test works only with product builds, because in a product build a BufferBlob fits into one segment whereas in a fastdebug build it does not.
>>>
>>> The test needs to set the CodeCacheMinBlockLength flag to 1. The flag is currently develop and we would need to make it product for the test to work. (Other flags controlling the code cache, e.g., CodeCacheExpansionSize, are also product.) I could also experiment with reproducing the problem with different block lengths/segment sizes, but that would most likely make the test more fragile (and CodeCacheSegmentSize is anyway develop as well).
>>>
>>> I tested the fix with JPRT, RBT is in progress.
>>>
>>> Thank you!
>>>
>>> Best regards,
>>>
>>>
>>> Zoltan
>>>
>>> [1] http://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/777aaa19c4b1#l116.71
>>>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [9] RFR (S): 8173151: Code heap corruption due to incorrect inclusion test

Zoltán Majó
In reply to this post by Vladimir Kozlov
Hi Vladimir,


thank you for the feedback!

On 02/06/2017 09:22 PM, Vladimir Kozlov wrote:

> On 2/6/17 9:18 AM, Zoltán Majó wrote:
>> [...]
>>
>> As we discussed offline, it seems that AOTCompiledMethods are allocated
>> on the C heap (instead in the code cache). That is, only the code of an
>> AOT method is allocated in the code cache. Could somebody from the AOT
>> team please confirm that?
>
> Yes, CodeBlob descriptors of AOT methods are allocated in C heap since
> AOT code is in RO memory (in loaded AOT library) and we can't put
> modifiable (addresses are unknown during AOT lib generation) CodeBlob
> there.
>
> It is different from nmethods where CodeBlob is "wrapper" of nmethod
> code in CodeCache. That is why we have to use cb->code_begin().

Thank you for clarifying.

>
> Sorry but your changes for get_code_heap() are incorrect for AOT code.
>
> There is an other similar code which could be problematic -
> CodeBlobIterator:
>
> http://hg.openjdk.java.net/jdk9/hs/hotspot/file/8d77eb1a669d/src/share/vm/code/codeCache.hpp#l298 
>

No problem, thank you for pointing me to the other code location.

>
> Note, CodeHeap::contains() is virtual and aotCodeHeap overwrite it but
> still accept an address. Based on usage cases (how many) we may change
> it to accept CodeBlob type instead of address and do appropriate
> checks depending on AOT or not AOT heap. Or we add an other CodeHeap
> method contains_blob(CodeBlob *) which do that.

OK, I added the CodeHeap::contains_blob(CodeBlob*) virtual method, as
you suggested. I updated both CodeCache::get_code_heap() and
CodeBlobIterator::CodeBlobIterator() to use that method instead of the
CodeHeap::contains(void*) method (which is also virtual, so we did not
increase the number of virtual calls by using contains_blob()).

Here is the updated webrev:
http://cr.openjdk.java.net/~zmajo/8173151/webrev.02/

>
>>
>> If the above is true, we could add a virtual method
>> CodeBlob::blob_begin() to CodeBlob, which would return 'this'.
>> AOTCompiledMethod would then override blob_begin() to return
>> code_begin(). In CodeCache::get_code_heap() we would then use
>> ...contains(cb->blob_begin()). Would that make sense? (I'm sure we can
>> find some better for that method.)
>
> We worked very hard to avoid virtual calls in AOT code since we saw
> performance regression when scanning code cache. So we can't do
> virtual calls.

I see. The solution in webrev.02 above does not increase the number of
virtual calls (unless I miss something, of course).

I tested the new patch with JPRT, all tests pass. I'll start RBT soon.

Thank you!

Best regards,


Zoltan

>
> Thanks,
> Vladimir
>
>>
>>> - I'm not sure if we should make CodeCacheMinBlockLength a product
>>> flag because we would then also need more testing with different flag
>>> values. What about making it diagnostic or experimental?
>>
>> Let's make it diagnostic then.
>>
>> Here is the updated webrev:
>> http://cr.openjdk.java.net/~zmajo/8173151/webrev.01/
>>
>> JPRT testing is in progress.
>>
>> Thank you!
>>
>> Best regards,
>>
>>
>> Zoltan
>>
>>>
>>> Thanks,
>>> Tobias
>>>
>>> On 06.02.2017 16:29, Zoltán Majó wrote:
>>>> Hi,
>>>>
>>>>
>>>> please review the fix for 8173151.
>>>>
>>>> https://bugs.openjdk.java.net/browse/JDK-8173151
>>>> http://cr.openjdk.java.net/~zmajo/8173151/webrev.00/
>>>>
>>>> The crash reported in the bug is caused by the corruption of the
>>>> 'non-profiled nmethods' code heap. CodeHeap::_freelist for that code
>>>> heap points to an address one segment before the heap's address
>>>> space. The sweeper starts iterating through the code heap from the
>>>> beginning of the heap's address space. Thus, the sweeper assumes that
>>>> the first item in the code heap is a HeapBlock/FreeBlock (with the
>>>> appropriate length and usage information). However, that is not the
>>>> case, as the first item in the heap is actually *before* the heap. So
>>>> the sweeper crashes.
>>>>
>>>> This is a hard-to-reproduce problem (I managed to reproduce it only
>>>> once in 350 iterations, each iteration taking ~25 minutes). So the
>>>> fix I propose is based on core file debugging and source code
>>>> investigation. But I managed to write a regression test that triggers
>>>> a problem similar to the original problem.
>>>>
>>>> I think that problem happens because a CodeBlob allocated in one code
>>>> heap (A) is returned to a different code heap (B). When the CodeBlob
>>>> is returned B, it is added to B's freelist. However, as the CodeBlob
>>>> was allocated in A, the freelist of B now points into A (i.e., B is
>>>> corrupted).
>>>>
>>>> The code cache uses CodeCache::get_code_heap(const CodeBlob* cb) to
>>>> determine to which code heap a 'cb' is supposed to be returned to.
>>>> Since 8171008 (AOT) [1], the check is:
>>>>
>>>> CodeHeap* CodeCache::get_code_heap(const CodeBlob* cb) {
>>>> assert(cb != NULL, "CodeBlob is null");
>>>> FOR_ALL_HEAPS(heap) {
>>>> - if ((*heap)->contains(cb)) {
>>>> + if ((*heap)->contains(cb->code_begin())) {
>>>> return *heap;
>>>> }
>>>> }
>>>>
>>>> The blob 'cb' can be returned to the wrong heap if, for example:
>>>> - 'cb' is at the end code heap A and
>>>> - the size of the code contained in 'cb' is 0 (therefore code_begin()
>>>> returns the address after 'cb', i.e., the first address in code
>>>> heap B).
>>>>
>>>> The fix proposes to restore CodeCache::get_code_heap() to its pre-AOT
>>>> state (as I'm not aware of the reasons why AOT changed that check). I
>>>> also propose to add some guarantees after allocation/deallocation in
>>>> the code heap to possibly easier catch this or related problems in
>>>> the future.
>>>>
>>>> The regression test I propose achieves the above condition and
>>>> results in a crash. The regression test works only with product
>>>> builds, because in a product build a BufferBlob fits into one segment
>>>> whereas in a fastdebug build it does not.
>>>>
>>>> The test needs to set the CodeCacheMinBlockLength flag to 1. The flag
>>>> is currently develop and we would need to make it product for the
>>>> test to work. (Other flags controlling the code cache, e.g.,
>>>> CodeCacheExpansionSize, are also product.) I could also experiment
>>>> with reproducing the problem with different block lengths/segment
>>>> sizes, but that would most likely make the test more fragile (and
>>>> CodeCacheSegmentSize is anyway develop as well).
>>>>
>>>> I tested the fix with JPRT, RBT is in progress.
>>>>
>>>> Thank you!
>>>>
>>>> Best regards,
>>>>
>>>>
>>>> Zoltan
>>>>
>>>> [1]
>>>> http://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/777aaa19c4b1#l116.71
>>>>
>>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [9] RFR (S): 8173151: Code heap corruption due to incorrect inclusion test

Vladimir Kozlov
In reply to this post by Zoltán Majó
Hi, Zoltan

Is it possible that the problem exist in free list maintenance code? Ot
allocation?

Or free size at the end of heap calculated incorrectly when CodeBlob is
allocated there and it can cross boundary?

Can you look in core file that all 3 heaps have consistent start address
and size so that next heap start at correct address. Like
_number_of_committed_segments, _number_of_reserved_segments, _next_segment

I am concern that may be there is a code which check status of next
segment and use it regardless if it belong to current heap or not.

Thanks,
Vladimir

On 2/7/17 5:17 AM, Zoltán Majó wrote:

> Hi Dean,
> Hi Tobias,
>
>
> thank you for the feedback!
>
> On 02/07/2017 07:01 AM, Tobias Hartmann wrote:
>> Hi Dean,
>>
>> On 07.02.2017 00:14, [hidden email] wrote:
>>> When do we allocate a CodeBlob with a code size of 0?  Is it really
>>> useful?  Would having a minimum code size of 1 fix the problem?
>> Yes, I've discussed that with Zoltan offline and I agree that we
>> should enforce a minimum code size of 1 and fix the stress test(s)
>> accordingly. It shouldn't be necessary to be able to create regular
>> CodeBlobs with code size 0.
>
> It is indeed a valid point that code sizes of 0 are not useful in practice.
>
> However, the problem is a bit more complicated than the regression test
> (and my description) suggests it to be.  The regression test works only
> if CodeCacheMinBlockLength set to 1.  Only that way is it possible to
> allocate BufferBlobs with size == 1 segment and set
> CodeBlob::_code_begin to point to the next segment (which is possibly in
> the next code heap).
>
> For both crashes I've seen, however, CodeCacheMinBlockLength was equal
> to 4.  With that CodeCacheMinBlockLength value, every allocation in the
> code heap is at least 4 segments long.  It is therefore impossible to
> create a BufferBlob whose _code_begin points "outside" of the space
> reserved for that BufferBlob.  (Because BufferBlob::_code_begin will
> either point to the beginning of second segment of the BufferBlob --
> with product -- or somewhere into the second segment -- with fastdebug.)
>
> Other blob types can have their _code_begin point to the boundary of 4
> segments.  E.g., nmethods are between 2 and 3 segments long.  If the
> relocation information in the nmethod fills up the space to the 4
> segment boundary, _code_begin will be exactly at the segment boundary.
>
> But I don't know the exact sequence of steps lead to the code heap
> corruption (because the crash reproduced only twice so far).  I just
> know that the first free block in the (corrupted) code heap is of length
> 8 segments (the contents of the first freelist item and of the segmap
> both confirm this); the heap is corrupted because the freelist starts
> before the code heap.
>
> What sequence of steps lead to this?  Some nmethod
> allocations/deallocations interleaved with BufferBlob
> allocations/deallocations (possibly of size 0)?  I'm not sure.  In what
> way were blocks in the heap merged/split by the
> allocations/deallocations leading to the crash?  I don't know.
>
> Did allocation/deallocation of BufferBlobs of size 0 play a role at
> all?  I don't know either and I won't be able to tell unless the problem
> reproduces more often.  However, by using code size == 0 it is possible
> to write a regression test that triggers a real problem (which is
> hopefully related to the crashes we've been seeing).
>
> So I'd like to allow code sizes of 0 for now and attempt to fix
> CodeCache::get_code_heap() and the CodeBlobIterator constructor the way
> Vladimir suggested.  If that turns out to be difficult/risky or require
> large code changes, I'll propose a fix instead that disallows code sizes
> of 1 and fixes the test(s) -- just as you suggested.
>
> Best regards,
>
>
> Zoltan
>
>> Best regards,
>> Tobias
>>
>>> dl
>>>
>>>
>>> On 2/6/17 7:29 AM, Zoltán Majó wrote:
>>>> Hi,
>>>>
>>>>
>>>> please review the fix for 8173151.
>>>>
>>>> https://bugs.openjdk.java.net/browse/JDK-8173151
>>>> http://cr.openjdk.java.net/~zmajo/8173151/webrev.00/
>>>>
>>>> The crash reported in the bug is caused by the corruption of the
>>>> 'non-profiled nmethods' code heap. CodeHeap::_freelist for that code
>>>> heap points to an address one segment before the heap's address
>>>> space. The sweeper starts iterating through the code heap from the
>>>> beginning of the heap's address space. Thus, the sweeper assumes
>>>> that the first item in the code heap is a HeapBlock/FreeBlock (with
>>>> the appropriate length and usage information). However, that is not
>>>> the case, as the first item in the heap is actually *before* the
>>>> heap. So the sweeper crashes.
>>>>
>>>> This is a hard-to-reproduce problem (I managed to reproduce it only
>>>> once in 350 iterations, each iteration taking ~25 minutes). So the
>>>> fix I propose is based on core file debugging and source code
>>>> investigation. But I managed to write a regression test that
>>>> triggers a problem similar to the original problem.
>>>>
>>>> I think that problem happens because a CodeBlob allocated in one
>>>> code heap (A) is returned to a different code heap (B). When the
>>>> CodeBlob is returned B, it is added to B's freelist. However, as the
>>>> CodeBlob was allocated in A, the freelist of B now points into A
>>>> (i.e., B is corrupted).
>>>>
>>>> The code cache uses CodeCache::get_code_heap(const CodeBlob* cb) to
>>>> determine to which code heap a 'cb' is supposed to be returned to.
>>>> Since 8171008 (AOT) [1], the check is:
>>>>
>>>> CodeHeap* CodeCache::get_code_heap(const CodeBlob* cb) {
>>>> assert(cb != NULL, "CodeBlob is null");
>>>> FOR_ALL_HEAPS(heap) {
>>>> - if ((*heap)->contains(cb)) {
>>>> + if ((*heap)->contains(cb->code_begin())) {
>>>> return *heap;
>>>> }
>>>> }
>>>>
>>>> The blob 'cb' can be returned to the wrong heap if, for example:
>>>> - 'cb' is at the end code heap A and
>>>> - the size of the code contained in 'cb' is 0 (therefore
>>>> code_begin() returns the address after 'cb', i.e., the first address
>>>> in code heap B).
>>>>
>>>> The fix proposes to restore CodeCache::get_code_heap() to its
>>>> pre-AOT state (as I'm not aware of the reasons why AOT changed that
>>>> check). I also propose to add some guarantees after
>>>> allocation/deallocation in the code heap to possibly easier catch
>>>> this or related problems in the future.
>>>>
>>>> The regression test I propose achieves the above condition and
>>>> results in a crash. The regression test works only with product
>>>> builds, because in a product build a BufferBlob fits into one
>>>> segment whereas in a fastdebug build it does not.
>>>>
>>>> The test needs to set the CodeCacheMinBlockLength flag to 1. The
>>>> flag is currently develop and we would need to make it product for
>>>> the test to work. (Other flags controlling the code cache, e.g.,
>>>> CodeCacheExpansionSize, are also product.) I could also experiment
>>>> with reproducing the problem with different block lengths/segment
>>>> sizes, but that would most likely make the test more fragile (and
>>>> CodeCacheSegmentSize is anyway develop as well).
>>>>
>>>> I tested the fix with JPRT, RBT is in progress.
>>>>
>>>> Thank you!
>>>>
>>>> Best regards,
>>>>
>>>>
>>>> Zoltan
>>>>
>>>> [1]
>>>> http://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/777aaa19c4b1#l116.71
>>>>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [9] RFR (S): 8173151: Code heap corruption due to incorrect inclusion test

Vladimir Kozlov
In reply to this post by Zoltán Majó
Yes, this looks good.

I think it is good to push this first and see if we hit this problem
again later. The problem could be related to segmented codecache and how
we manipulate free and used segments. So the fix may not solve it but at
least it will close one issue you found.

Thanks,
Vladimir

On 2/7/17 5:17 AM, Zoltán Majó wrote:

> Hi Vladimir,
>
>
> thank you for the feedback!
>
> On 02/06/2017 09:22 PM, Vladimir Kozlov wrote:
>> On 2/6/17 9:18 AM, Zoltán Majó wrote:
>>> [...]
>>>
>>> As we discussed offline, it seems that AOTCompiledMethods are allocated
>>> on the C heap (instead in the code cache). That is, only the code of an
>>> AOT method is allocated in the code cache. Could somebody from the AOT
>>> team please confirm that?
>>
>> Yes, CodeBlob descriptors of AOT methods are allocated in C heap since
>> AOT code is in RO memory (in loaded AOT library) and we can't put
>> modifiable (addresses are unknown during AOT lib generation) CodeBlob
>> there.
>>
>> It is different from nmethods where CodeBlob is "wrapper" of nmethod
>> code in CodeCache. That is why we have to use cb->code_begin().
>
> Thank you for clarifying.
>
>>
>> Sorry but your changes for get_code_heap() are incorrect for AOT code.
>>
>> There is an other similar code which could be problematic -
>> CodeBlobIterator:
>>
>> http://hg.openjdk.java.net/jdk9/hs/hotspot/file/8d77eb1a669d/src/share/vm/code/codeCache.hpp#l298
>>
>
> No problem, thank you for pointing me to the other code location.
>
>>
>> Note, CodeHeap::contains() is virtual and aotCodeHeap overwrite it but
>> still accept an address. Based on usage cases (how many) we may change
>> it to accept CodeBlob type instead of address and do appropriate
>> checks depending on AOT or not AOT heap. Or we add an other CodeHeap
>> method contains_blob(CodeBlob *) which do that.
>
> OK, I added the CodeHeap::contains_blob(CodeBlob*) virtual method, as
> you suggested. I updated both CodeCache::get_code_heap() and
> CodeBlobIterator::CodeBlobIterator() to use that method instead of the
> CodeHeap::contains(void*) method (which is also virtual, so we did not
> increase the number of virtual calls by using contains_blob()).
>
> Here is the updated webrev:
> http://cr.openjdk.java.net/~zmajo/8173151/webrev.02/
>
>>
>>>
>>> If the above is true, we could add a virtual method
>>> CodeBlob::blob_begin() to CodeBlob, which would return 'this'.
>>> AOTCompiledMethod would then override blob_begin() to return
>>> code_begin(). In CodeCache::get_code_heap() we would then use
>>> ...contains(cb->blob_begin()). Would that make sense? (I'm sure we can
>>> find some better for that method.)
>>
>> We worked very hard to avoid virtual calls in AOT code since we saw
>> performance regression when scanning code cache. So we can't do
>> virtual calls.
>
> I see. The solution in webrev.02 above does not increase the number of
> virtual calls (unless I miss something, of course).
>
> I tested the new patch with JPRT, all tests pass. I'll start RBT soon.
>
> Thank you!
>
> Best regards,
>
>
> Zoltan
>
>>
>> Thanks,
>> Vladimir
>>
>>>
>>>> - I'm not sure if we should make CodeCacheMinBlockLength a product
>>>> flag because we would then also need more testing with different flag
>>>> values. What about making it diagnostic or experimental?
>>>
>>> Let's make it diagnostic then.
>>>
>>> Here is the updated webrev:
>>> http://cr.openjdk.java.net/~zmajo/8173151/webrev.01/
>>>
>>> JPRT testing is in progress.
>>>
>>> Thank you!
>>>
>>> Best regards,
>>>
>>>
>>> Zoltan
>>>
>>>>
>>>> Thanks,
>>>> Tobias
>>>>
>>>> On 06.02.2017 16:29, Zoltán Majó wrote:
>>>>> Hi,
>>>>>
>>>>>
>>>>> please review the fix for 8173151.
>>>>>
>>>>> https://bugs.openjdk.java.net/browse/JDK-8173151
>>>>> http://cr.openjdk.java.net/~zmajo/8173151/webrev.00/
>>>>>
>>>>> The crash reported in the bug is caused by the corruption of the
>>>>> 'non-profiled nmethods' code heap. CodeHeap::_freelist for that code
>>>>> heap points to an address one segment before the heap's address
>>>>> space. The sweeper starts iterating through the code heap from the
>>>>> beginning of the heap's address space. Thus, the sweeper assumes that
>>>>> the first item in the code heap is a HeapBlock/FreeBlock (with the
>>>>> appropriate length and usage information). However, that is not the
>>>>> case, as the first item in the heap is actually *before* the heap. So
>>>>> the sweeper crashes.
>>>>>
>>>>> This is a hard-to-reproduce problem (I managed to reproduce it only
>>>>> once in 350 iterations, each iteration taking ~25 minutes). So the
>>>>> fix I propose is based on core file debugging and source code
>>>>> investigation. But I managed to write a regression test that triggers
>>>>> a problem similar to the original problem.
>>>>>
>>>>> I think that problem happens because a CodeBlob allocated in one code
>>>>> heap (A) is returned to a different code heap (B). When the CodeBlob
>>>>> is returned B, it is added to B's freelist. However, as the CodeBlob
>>>>> was allocated in A, the freelist of B now points into A (i.e., B is
>>>>> corrupted).
>>>>>
>>>>> The code cache uses CodeCache::get_code_heap(const CodeBlob* cb) to
>>>>> determine to which code heap a 'cb' is supposed to be returned to.
>>>>> Since 8171008 (AOT) [1], the check is:
>>>>>
>>>>> CodeHeap* CodeCache::get_code_heap(const CodeBlob* cb) {
>>>>> assert(cb != NULL, "CodeBlob is null");
>>>>> FOR_ALL_HEAPS(heap) {
>>>>> - if ((*heap)->contains(cb)) {
>>>>> + if ((*heap)->contains(cb->code_begin())) {
>>>>> return *heap;
>>>>> }
>>>>> }
>>>>>
>>>>> The blob 'cb' can be returned to the wrong heap if, for example:
>>>>> - 'cb' is at the end code heap A and
>>>>> - the size of the code contained in 'cb' is 0 (therefore code_begin()
>>>>> returns the address after 'cb', i.e., the first address in code
>>>>> heap B).
>>>>>
>>>>> The fix proposes to restore CodeCache::get_code_heap() to its pre-AOT
>>>>> state (as I'm not aware of the reasons why AOT changed that check). I
>>>>> also propose to add some guarantees after allocation/deallocation in
>>>>> the code heap to possibly easier catch this or related problems in
>>>>> the future.
>>>>>
>>>>> The regression test I propose achieves the above condition and
>>>>> results in a crash. The regression test works only with product
>>>>> builds, because in a product build a BufferBlob fits into one segment
>>>>> whereas in a fastdebug build it does not.
>>>>>
>>>>> The test needs to set the CodeCacheMinBlockLength flag to 1. The flag
>>>>> is currently develop and we would need to make it product for the
>>>>> test to work. (Other flags controlling the code cache, e.g.,
>>>>> CodeCacheExpansionSize, are also product.) I could also experiment
>>>>> with reproducing the problem with different block lengths/segment
>>>>> sizes, but that would most likely make the test more fragile (and
>>>>> CodeCacheSegmentSize is anyway develop as well).
>>>>>
>>>>> I tested the fix with JPRT, RBT is in progress.
>>>>>
>>>>> Thank you!
>>>>>
>>>>> Best regards,
>>>>>
>>>>>
>>>>> Zoltan
>>>>>
>>>>> [1]
>>>>> http://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/777aaa19c4b1#l116.71
>>>>>
>>>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [9] RFR (S): 8173151: Code heap corruption due to incorrect inclusion test

Zoltán Majó
Hi Vladimir,


On 02/07/2017 07:08 PM, Vladimir Kozlov wrote:
> Yes, this looks good.

thank you (and also Dean and Tobias) for the review.

>
> I think it is good to push this first and see if we hit this problem
> again later. The problem could be related to segmented codecache and
> how we manipulate free and used segments. So the fix may not solve it
> but at least it will close one issue you found.

OK, I'll push this once I have the results of RBT testing available and
they look good.

I'm also wondering if we hit this (or a similar problem) later. Maybe
the guarantees added with the patch will help to get a better
understanding of the problem later on.

Thank you!

Best regards,


Zoltan

>
> Thanks,
> Vladimir
>
> On 2/7/17 5:17 AM, Zoltán Majó wrote:
>> Hi Vladimir,
>>
>>
>> thank you for the feedback!
>>
>> On 02/06/2017 09:22 PM, Vladimir Kozlov wrote:
>>> On 2/6/17 9:18 AM, Zoltán Majó wrote:
>>>> [...]
>>>>
>>>> As we discussed offline, it seems that AOTCompiledMethods are
>>>> allocated
>>>> on the C heap (instead in the code cache). That is, only the code
>>>> of an
>>>> AOT method is allocated in the code cache. Could somebody from the AOT
>>>> team please confirm that?
>>>
>>> Yes, CodeBlob descriptors of AOT methods are allocated in C heap since
>>> AOT code is in RO memory (in loaded AOT library) and we can't put
>>> modifiable (addresses are unknown during AOT lib generation) CodeBlob
>>> there.
>>>
>>> It is different from nmethods where CodeBlob is "wrapper" of nmethod
>>> code in CodeCache. That is why we have to use cb->code_begin().
>>
>> Thank you for clarifying.
>>
>>>
>>> Sorry but your changes for get_code_heap() are incorrect for AOT code.
>>>
>>> There is an other similar code which could be problematic -
>>> CodeBlobIterator:
>>>
>>> http://hg.openjdk.java.net/jdk9/hs/hotspot/file/8d77eb1a669d/src/share/vm/code/codeCache.hpp#l298 
>>>
>>>
>>
>> No problem, thank you for pointing me to the other code location.
>>
>>>
>>> Note, CodeHeap::contains() is virtual and aotCodeHeap overwrite it but
>>> still accept an address. Based on usage cases (how many) we may change
>>> it to accept CodeBlob type instead of address and do appropriate
>>> checks depending on AOT or not AOT heap. Or we add an other CodeHeap
>>> method contains_blob(CodeBlob *) which do that.
>>
>> OK, I added the CodeHeap::contains_blob(CodeBlob*) virtual method, as
>> you suggested. I updated both CodeCache::get_code_heap() and
>> CodeBlobIterator::CodeBlobIterator() to use that method instead of the
>> CodeHeap::contains(void*) method (which is also virtual, so we did not
>> increase the number of virtual calls by using contains_blob()).
>>
>> Here is the updated webrev:
>> http://cr.openjdk.java.net/~zmajo/8173151/webrev.02/
>>
>>>
>>>>
>>>> If the above is true, we could add a virtual method
>>>> CodeBlob::blob_begin() to CodeBlob, which would return 'this'.
>>>> AOTCompiledMethod would then override blob_begin() to return
>>>> code_begin(). In CodeCache::get_code_heap() we would then use
>>>> ...contains(cb->blob_begin()). Would that make sense? (I'm sure we can
>>>> find some better for that method.)
>>>
>>> We worked very hard to avoid virtual calls in AOT code since we saw
>>> performance regression when scanning code cache. So we can't do
>>> virtual calls.
>>
>> I see. The solution in webrev.02 above does not increase the number of
>> virtual calls (unless I miss something, of course).
>>
>> I tested the new patch with JPRT, all tests pass. I'll start RBT soon.
>>
>> Thank you!
>>
>> Best regards,
>>
>>
>> Zoltan
>>
>>>
>>> Thanks,
>>> Vladimir
>>>
>>>>
>>>>> - I'm not sure if we should make CodeCacheMinBlockLength a product
>>>>> flag because we would then also need more testing with different flag
>>>>> values. What about making it diagnostic or experimental?
>>>>
>>>> Let's make it diagnostic then.
>>>>
>>>> Here is the updated webrev:
>>>> http://cr.openjdk.java.net/~zmajo/8173151/webrev.01/
>>>>
>>>> JPRT testing is in progress.
>>>>
>>>> Thank you!
>>>>
>>>> Best regards,
>>>>
>>>>
>>>> Zoltan
>>>>
>>>>>
>>>>> Thanks,
>>>>> Tobias
>>>>>
>>>>> On 06.02.2017 16:29, Zoltán Majó wrote:
>>>>>> Hi,
>>>>>>
>>>>>>
>>>>>> please review the fix for 8173151.
>>>>>>
>>>>>> https://bugs.openjdk.java.net/browse/JDK-8173151
>>>>>> http://cr.openjdk.java.net/~zmajo/8173151/webrev.00/
>>>>>>
>>>>>> The crash reported in the bug is caused by the corruption of the
>>>>>> 'non-profiled nmethods' code heap. CodeHeap::_freelist for that code
>>>>>> heap points to an address one segment before the heap's address
>>>>>> space. The sweeper starts iterating through the code heap from the
>>>>>> beginning of the heap's address space. Thus, the sweeper assumes
>>>>>> that
>>>>>> the first item in the code heap is a HeapBlock/FreeBlock (with the
>>>>>> appropriate length and usage information). However, that is not the
>>>>>> case, as the first item in the heap is actually *before* the
>>>>>> heap. So
>>>>>> the sweeper crashes.
>>>>>>
>>>>>> This is a hard-to-reproduce problem (I managed to reproduce it only
>>>>>> once in 350 iterations, each iteration taking ~25 minutes). So the
>>>>>> fix I propose is based on core file debugging and source code
>>>>>> investigation. But I managed to write a regression test that
>>>>>> triggers
>>>>>> a problem similar to the original problem.
>>>>>>
>>>>>> I think that problem happens because a CodeBlob allocated in one
>>>>>> code
>>>>>> heap (A) is returned to a different code heap (B). When the CodeBlob
>>>>>> is returned B, it is added to B's freelist. However, as the CodeBlob
>>>>>> was allocated in A, the freelist of B now points into A (i.e., B is
>>>>>> corrupted).
>>>>>>
>>>>>> The code cache uses CodeCache::get_code_heap(const CodeBlob* cb) to
>>>>>> determine to which code heap a 'cb' is supposed to be returned to.
>>>>>> Since 8171008 (AOT) [1], the check is:
>>>>>>
>>>>>> CodeHeap* CodeCache::get_code_heap(const CodeBlob* cb) {
>>>>>> assert(cb != NULL, "CodeBlob is null");
>>>>>> FOR_ALL_HEAPS(heap) {
>>>>>> - if ((*heap)->contains(cb)) {
>>>>>> + if ((*heap)->contains(cb->code_begin())) {
>>>>>> return *heap;
>>>>>> }
>>>>>> }
>>>>>>
>>>>>> The blob 'cb' can be returned to the wrong heap if, for example:
>>>>>> - 'cb' is at the end code heap A and
>>>>>> - the size of the code contained in 'cb' is 0 (therefore
>>>>>> code_begin()
>>>>>> returns the address after 'cb', i.e., the first address in code
>>>>>> heap B).
>>>>>>
>>>>>> The fix proposes to restore CodeCache::get_code_heap() to its
>>>>>> pre-AOT
>>>>>> state (as I'm not aware of the reasons why AOT changed that
>>>>>> check). I
>>>>>> also propose to add some guarantees after allocation/deallocation in
>>>>>> the code heap to possibly easier catch this or related problems in
>>>>>> the future.
>>>>>>
>>>>>> The regression test I propose achieves the above condition and
>>>>>> results in a crash. The regression test works only with product
>>>>>> builds, because in a product build a BufferBlob fits into one
>>>>>> segment
>>>>>> whereas in a fastdebug build it does not.
>>>>>>
>>>>>> The test needs to set the CodeCacheMinBlockLength flag to 1. The
>>>>>> flag
>>>>>> is currently develop and we would need to make it product for the
>>>>>> test to work. (Other flags controlling the code cache, e.g.,
>>>>>> CodeCacheExpansionSize, are also product.) I could also experiment
>>>>>> with reproducing the problem with different block lengths/segment
>>>>>> sizes, but that would most likely make the test more fragile (and
>>>>>> CodeCacheSegmentSize is anyway develop as well).
>>>>>>
>>>>>> I tested the fix with JPRT, RBT is in progress.
>>>>>>
>>>>>> Thank you!
>>>>>>
>>>>>> Best regards,
>>>>>>
>>>>>>
>>>>>> Zoltan
>>>>>>
>>>>>> [1]
>>>>>> http://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/777aaa19c4b1#l116.71 
>>>>>>
>>>>>>
>>>>
>>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [9] RFR (S): 8173151: Code heap corruption due to incorrect inclusion test

Zoltán Majó
In reply to this post by Vladimir Kozlov
Hi Vladimir,


On 02/07/2017 06:27 PM, Vladimir Kozlov wrote:
> Hi, Zoltan
>
> Is it possible that the problem exist in free list maintenance code?
> Ot allocation?
>
> Or free size at the end of heap calculated incorrectly when CodeBlob
> is allocated there and it can cross boundary?

I'm not sure. AOT did not change allocation / free list management code.
It only changed CodeCache::get_code_heap() to use cb->code_begin()
instead of cb directly.

Also, the failure with RandomAllocationTest appeared only after AOT was
pushed. Otherwise code cache management was not changed recently and
we've been running RandomAllocationTest for two years now. That makes it
likely that we would have hit this issue earlier.

>
> Can you look in core file that all 3 heaps have consistent start
> address and size so that next heap start at correct address. Like
> _number_of_committed_segments, _number_of_reserved_segments,
> _next_segment
>
> I am concern that may be there is a code which check status of next
> segment and use it regardless if it belong to current heap or not.

Thank you for your suggestions!

The test failed with -XX:+TieredCompilation -XX:TieredStopAtLevel=1 and
AOT disabled. So there are only two heaps. Here is the information for
about both:


(1) non-profiled nmethods heap

CodeHeap 'non-profiled nmethods': size=238384Kb used=178496Kb
max_used=224080Kb free=59887Kb
  bounds [0x000000011b20c000, 0x0000000129ad8000, 0x0000000129ad8000]

Length of segment: 0xE8CC000 = 0x1D1980 segments

_next_segment:                          0x00000000001d13e6 -- < 0x1D1980
so within current heap
_number_of_committed_segments:          0x00000000001d1980 -- OK
_number_of_reserved_segments:           0x00000000001d1980 -- OK
_freelist_segments:                     0x00000000000749e1 -- < 0x1D1980
so fits into current heap
_freelist:                              0x000000011b20bf80 -- points
into other code heap


(2) non nmethod code heap

CodeHeap 'non-nmethods': size=7376Kb used=5822Kb max_used=7376Kb free=1553Kb
  bounds [0x000000011aad8000, 0x000000011b20c000, 0x000000011b20c000]

Length of segment: 0x734000 = 0xe680 segments -- OK

_next_segment:                          0x000000000000e680 -- size of
current heap
_number_of_committed_segments:          0x000000000000e680 -- size of
current heap
_number_of_reserved_segments:           0x000000000000e680 -- size of
current heap
_freelist_segments:                     0x000000000000308d -- < 0xe680
so fits into current heap
_freelist:                              0x000000011ada6780 -- points
into the current code heap


There are two (possibly) suspicious elements:
- For (1), the _freelist points to into (2) -- we've discussed this before.
- For (2), _next_segment, _number_of_committed_segments, and
_number_of_reserved_segments is equal to the size of the code heap. That
is, _next_segment of (2) points into (1).

So I checked places where _next_segment is modified. Other than cases
when _next_segment is initialized to 0, only CodeHeap::allocate()
changes _next_segment.

http://hg.openjdk.java.net/jdk9/hs/hotspot/file/9b2ab23d5676/src/share/vm/memory/heap.cpp#l202

But I don't think an allocation can cause the code heap to overflow
there. An overflow would imply that
- (a) the number of committed segments is equal to the size of the
memory space reserved for the code heap (in segments) and
- (b) the newly allocated block points to (or after)
_number_of_committed_segments.

We can return a block pointing to _number_of_committed_segments if
_next_segment==_number_of_committed_segments and number_of_segments <= 0
so that the condition

_next_segment + number_of_segments <= _number_of_committed_segments

holds. But we have an assert before that checks that number_of_segments
 >= sizeof(FreeBlock) (which is in turn > 0).

Unfortunately, I don't see other ways the heap can become corrupted. But
let's hope the guarantees the patch plans to add help catch this problem
in the future (in case the patch does not already fix the problem).

Thank you!

Best regards,


Zoltan


>
> Thanks,
> Vladimir
>
> On 2/7/17 5:17 AM, Zoltán Majó wrote:
>> Hi Dean,
>> Hi Tobias,
>>
>>
>> thank you for the feedback!
>>
>> On 02/07/2017 07:01 AM, Tobias Hartmann wrote:
>>> Hi Dean,
>>>
>>> On 07.02.2017 00:14, [hidden email] wrote:
>>>> When do we allocate a CodeBlob with a code size of 0?  Is it really
>>>> useful?  Would having a minimum code size of 1 fix the problem?
>>> Yes, I've discussed that with Zoltan offline and I agree that we
>>> should enforce a minimum code size of 1 and fix the stress test(s)
>>> accordingly. It shouldn't be necessary to be able to create regular
>>> CodeBlobs with code size 0.
>>
>> It is indeed a valid point that code sizes of 0 are not useful in
>> practice.
>>
>> However, the problem is a bit more complicated than the regression test
>> (and my description) suggests it to be.  The regression test works only
>> if CodeCacheMinBlockLength set to 1.  Only that way is it possible to
>> allocate BufferBlobs with size == 1 segment and set
>> CodeBlob::_code_begin to point to the next segment (which is possibly in
>> the next code heap).
>>
>> For both crashes I've seen, however, CodeCacheMinBlockLength was equal
>> to 4.  With that CodeCacheMinBlockLength value, every allocation in the
>> code heap is at least 4 segments long.  It is therefore impossible to
>> create a BufferBlob whose _code_begin points "outside" of the space
>> reserved for that BufferBlob.  (Because BufferBlob::_code_begin will
>> either point to the beginning of second segment of the BufferBlob --
>> with product -- or somewhere into the second segment -- with fastdebug.)
>>
>> Other blob types can have their _code_begin point to the boundary of 4
>> segments.  E.g., nmethods are between 2 and 3 segments long.  If the
>> relocation information in the nmethod fills up the space to the 4
>> segment boundary, _code_begin will be exactly at the segment boundary.
>>
>> But I don't know the exact sequence of steps lead to the code heap
>> corruption (because the crash reproduced only twice so far).  I just
>> know that the first free block in the (corrupted) code heap is of length
>> 8 segments (the contents of the first freelist item and of the segmap
>> both confirm this); the heap is corrupted because the freelist starts
>> before the code heap.
>>
>> What sequence of steps lead to this?  Some nmethod
>> allocations/deallocations interleaved with BufferBlob
>> allocations/deallocations (possibly of size 0)?  I'm not sure.  In what
>> way were blocks in the heap merged/split by the
>> allocations/deallocations leading to the crash?  I don't know.
>>
>> Did allocation/deallocation of BufferBlobs of size 0 play a role at
>> all?  I don't know either and I won't be able to tell unless the problem
>> reproduces more often.  However, by using code size == 0 it is possible
>> to write a regression test that triggers a real problem (which is
>> hopefully related to the crashes we've been seeing).
>>
>> So I'd like to allow code sizes of 0 for now and attempt to fix
>> CodeCache::get_code_heap() and the CodeBlobIterator constructor the way
>> Vladimir suggested.  If that turns out to be difficult/risky or require
>> large code changes, I'll propose a fix instead that disallows code sizes
>> of 1 and fixes the test(s) -- just as you suggested.
>>
>> Best regards,
>>
>>
>> Zoltan
>>
>>> Best regards,
>>> Tobias
>>>
>>>> dl
>>>>
>>>>
>>>> On 2/6/17 7:29 AM, Zoltán Majó wrote:
>>>>> Hi,
>>>>>
>>>>>
>>>>> please review the fix for 8173151.
>>>>>
>>>>> https://bugs.openjdk.java.net/browse/JDK-8173151
>>>>> http://cr.openjdk.java.net/~zmajo/8173151/webrev.00/
>>>>>
>>>>> The crash reported in the bug is caused by the corruption of the
>>>>> 'non-profiled nmethods' code heap. CodeHeap::_freelist for that code
>>>>> heap points to an address one segment before the heap's address
>>>>> space. The sweeper starts iterating through the code heap from the
>>>>> beginning of the heap's address space. Thus, the sweeper assumes
>>>>> that the first item in the code heap is a HeapBlock/FreeBlock (with
>>>>> the appropriate length and usage information). However, that is not
>>>>> the case, as the first item in the heap is actually *before* the
>>>>> heap. So the sweeper crashes.
>>>>>
>>>>> This is a hard-to-reproduce problem (I managed to reproduce it only
>>>>> once in 350 iterations, each iteration taking ~25 minutes). So the
>>>>> fix I propose is based on core file debugging and source code
>>>>> investigation. But I managed to write a regression test that
>>>>> triggers a problem similar to the original problem.
>>>>>
>>>>> I think that problem happens because a CodeBlob allocated in one
>>>>> code heap (A) is returned to a different code heap (B). When the
>>>>> CodeBlob is returned B, it is added to B's freelist. However, as the
>>>>> CodeBlob was allocated in A, the freelist of B now points into A
>>>>> (i.e., B is corrupted).
>>>>>
>>>>> The code cache uses CodeCache::get_code_heap(const CodeBlob* cb) to
>>>>> determine to which code heap a 'cb' is supposed to be returned to.
>>>>> Since 8171008 (AOT) [1], the check is:
>>>>>
>>>>> CodeHeap* CodeCache::get_code_heap(const CodeBlob* cb) {
>>>>> assert(cb != NULL, "CodeBlob is null");
>>>>> FOR_ALL_HEAPS(heap) {
>>>>> - if ((*heap)->contains(cb)) {
>>>>> + if ((*heap)->contains(cb->code_begin())) {
>>>>> return *heap;
>>>>> }
>>>>> }
>>>>>
>>>>> The blob 'cb' can be returned to the wrong heap if, for example:
>>>>> - 'cb' is at the end code heap A and
>>>>> - the size of the code contained in 'cb' is 0 (therefore
>>>>> code_begin() returns the address after 'cb', i.e., the first address
>>>>> in code heap B).
>>>>>
>>>>> The fix proposes to restore CodeCache::get_code_heap() to its
>>>>> pre-AOT state (as I'm not aware of the reasons why AOT changed that
>>>>> check). I also propose to add some guarantees after
>>>>> allocation/deallocation in the code heap to possibly easier catch
>>>>> this or related problems in the future.
>>>>>
>>>>> The regression test I propose achieves the above condition and
>>>>> results in a crash. The regression test works only with product
>>>>> builds, because in a product build a BufferBlob fits into one
>>>>> segment whereas in a fastdebug build it does not.
>>>>>
>>>>> The test needs to set the CodeCacheMinBlockLength flag to 1. The
>>>>> flag is currently develop and we would need to make it product for
>>>>> the test to work. (Other flags controlling the code cache, e.g.,
>>>>> CodeCacheExpansionSize, are also product.) I could also experiment
>>>>> with reproducing the problem with different block lengths/segment
>>>>> sizes, but that would most likely make the test more fragile (and
>>>>> CodeCacheSegmentSize is anyway develop as well).
>>>>>
>>>>> I tested the fix with JPRT, RBT is in progress.
>>>>>
>>>>> Thank you!
>>>>>
>>>>> Best regards,
>>>>>
>>>>>
>>>>> Zoltan
>>>>>
>>>>> [1]
>>>>> http://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/777aaa19c4b1#l116.71
>>>>>
>>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [9] RFR (S): 8173151: Code heap corruption due to incorrect inclusion test

Vladimir Kozlov
Thank you for looking on this, Zoltan

 > Also, the failure with RandomAllocationTest appeared only after AOT was
 > pushed. Otherwise code cache management was not changed recently and
 > we've been running RandomAllocationTest for two years now. That makes it
 > likely that we would have hit this issue earlier.

Good point.

Looking on the RandomAllocationTest test and it uses Random to generate
size of CodeBlob and indeed could be 0 (+ header size). So lets hope
this change fix the problem.

Thanks,
Vladimir

On 2/8/17 4:39 AM, Zoltán Majó wrote:

> Hi Vladimir,
>
>
> On 02/07/2017 06:27 PM, Vladimir Kozlov wrote:
>> Hi, Zoltan
>>
>> Is it possible that the problem exist in free list maintenance code?
>> Ot allocation?
>>
>> Or free size at the end of heap calculated incorrectly when CodeBlob
>> is allocated there and it can cross boundary?
>
> I'm not sure. AOT did not change allocation / free list management code.
> It only changed CodeCache::get_code_heap() to use cb->code_begin()
> instead of cb directly.
>
> Also, the failure with RandomAllocationTest appeared only after AOT was
> pushed. Otherwise code cache management was not changed recently and
> we've been running RandomAllocationTest for two years now. That makes it
> likely that we would have hit this issue earlier.
>
>>
>> Can you look in core file that all 3 heaps have consistent start
>> address and size so that next heap start at correct address. Like
>> _number_of_committed_segments, _number_of_reserved_segments,
>> _next_segment
>>
>> I am concern that may be there is a code which check status of next
>> segment and use it regardless if it belong to current heap or not.
>
> Thank you for your suggestions!
>
> The test failed with -XX:+TieredCompilation -XX:TieredStopAtLevel=1 and
> AOT disabled. So there are only two heaps. Here is the information for
> about both:
>
>
> (1) non-profiled nmethods heap
>
> CodeHeap 'non-profiled nmethods': size=238384Kb used=178496Kb
> max_used=224080Kb free=59887Kb
>  bounds [0x000000011b20c000, 0x0000000129ad8000, 0x0000000129ad8000]
>
> Length of segment: 0xE8CC000 = 0x1D1980 segments
>
> _next_segment:                          0x00000000001d13e6 -- < 0x1D1980
> so within current heap
> _number_of_committed_segments:          0x00000000001d1980 -- OK
> _number_of_reserved_segments:           0x00000000001d1980 -- OK
> _freelist_segments:                     0x00000000000749e1 -- < 0x1D1980
> so fits into current heap
> _freelist:                              0x000000011b20bf80 -- points
> into other code heap
>
>
> (2) non nmethod code heap
>
> CodeHeap 'non-nmethods': size=7376Kb used=5822Kb max_used=7376Kb
> free=1553Kb
>  bounds [0x000000011aad8000, 0x000000011b20c000, 0x000000011b20c000]
>
> Length of segment: 0x734000 = 0xe680 segments -- OK
>
> _next_segment:                          0x000000000000e680 -- size of
> current heap
> _number_of_committed_segments:          0x000000000000e680 -- size of
> current heap
> _number_of_reserved_segments:           0x000000000000e680 -- size of
> current heap
> _freelist_segments:                     0x000000000000308d -- < 0xe680
> so fits into current heap
> _freelist:                              0x000000011ada6780 -- points
> into the current code heap
>
>
> There are two (possibly) suspicious elements:
> - For (1), the _freelist points to into (2) -- we've discussed this before.
> - For (2), _next_segment, _number_of_committed_segments, and
> _number_of_reserved_segments is equal to the size of the code heap. That
> is, _next_segment of (2) points into (1).
>
> So I checked places where _next_segment is modified. Other than cases
> when _next_segment is initialized to 0, only CodeHeap::allocate()
> changes _next_segment.
>
> http://hg.openjdk.java.net/jdk9/hs/hotspot/file/9b2ab23d5676/src/share/vm/memory/heap.cpp#l202
>
>
> But I don't think an allocation can cause the code heap to overflow
> there. An overflow would imply that
> - (a) the number of committed segments is equal to the size of the
> memory space reserved for the code heap (in segments) and
> - (b) the newly allocated block points to (or after)
> _number_of_committed_segments.
>
> We can return a block pointing to _number_of_committed_segments if
> _next_segment==_number_of_committed_segments and number_of_segments <= 0
> so that the condition
>
> _next_segment + number_of_segments <= _number_of_committed_segments
>
> holds. But we have an assert before that checks that number_of_segments
>>= sizeof(FreeBlock) (which is in turn > 0).
>
> Unfortunately, I don't see other ways the heap can become corrupted. But
> let's hope the guarantees the patch plans to add help catch this problem
> in the future (in case the patch does not already fix the problem).
>
> Thank you!
>
> Best regards,
>
>
> Zoltan
>
>
>>
>> Thanks,
>> Vladimir
>>
>> On 2/7/17 5:17 AM, Zoltán Majó wrote:
>>> Hi Dean,
>>> Hi Tobias,
>>>
>>>
>>> thank you for the feedback!
>>>
>>> On 02/07/2017 07:01 AM, Tobias Hartmann wrote:
>>>> Hi Dean,
>>>>
>>>> On 07.02.2017 00:14, [hidden email] wrote:
>>>>> When do we allocate a CodeBlob with a code size of 0?  Is it really
>>>>> useful?  Would having a minimum code size of 1 fix the problem?
>>>> Yes, I've discussed that with Zoltan offline and I agree that we
>>>> should enforce a minimum code size of 1 and fix the stress test(s)
>>>> accordingly. It shouldn't be necessary to be able to create regular
>>>> CodeBlobs with code size 0.
>>>
>>> It is indeed a valid point that code sizes of 0 are not useful in
>>> practice.
>>>
>>> However, the problem is a bit more complicated than the regression test
>>> (and my description) suggests it to be.  The regression test works only
>>> if CodeCacheMinBlockLength set to 1.  Only that way is it possible to
>>> allocate BufferBlobs with size == 1 segment and set
>>> CodeBlob::_code_begin to point to the next segment (which is possibly in
>>> the next code heap).
>>>
>>> For both crashes I've seen, however, CodeCacheMinBlockLength was equal
>>> to 4.  With that CodeCacheMinBlockLength value, every allocation in the
>>> code heap is at least 4 segments long.  It is therefore impossible to
>>> create a BufferBlob whose _code_begin points "outside" of the space
>>> reserved for that BufferBlob.  (Because BufferBlob::_code_begin will
>>> either point to the beginning of second segment of the BufferBlob --
>>> with product -- or somewhere into the second segment -- with fastdebug.)
>>>
>>> Other blob types can have their _code_begin point to the boundary of 4
>>> segments.  E.g., nmethods are between 2 and 3 segments long.  If the
>>> relocation information in the nmethod fills up the space to the 4
>>> segment boundary, _code_begin will be exactly at the segment boundary.
>>>
>>> But I don't know the exact sequence of steps lead to the code heap
>>> corruption (because the crash reproduced only twice so far).  I just
>>> know that the first free block in the (corrupted) code heap is of length
>>> 8 segments (the contents of the first freelist item and of the segmap
>>> both confirm this); the heap is corrupted because the freelist starts
>>> before the code heap.
>>>
>>> What sequence of steps lead to this?  Some nmethod
>>> allocations/deallocations interleaved with BufferBlob
>>> allocations/deallocations (possibly of size 0)?  I'm not sure.  In what
>>> way were blocks in the heap merged/split by the
>>> allocations/deallocations leading to the crash?  I don't know.
>>>
>>> Did allocation/deallocation of BufferBlobs of size 0 play a role at
>>> all?  I don't know either and I won't be able to tell unless the problem
>>> reproduces more often.  However, by using code size == 0 it is possible
>>> to write a regression test that triggers a real problem (which is
>>> hopefully related to the crashes we've been seeing).
>>>
>>> So I'd like to allow code sizes of 0 for now and attempt to fix
>>> CodeCache::get_code_heap() and the CodeBlobIterator constructor the way
>>> Vladimir suggested.  If that turns out to be difficult/risky or require
>>> large code changes, I'll propose a fix instead that disallows code sizes
>>> of 1 and fixes the test(s) -- just as you suggested.
>>>
>>> Best regards,
>>>
>>>
>>> Zoltan
>>>
>>>> Best regards,
>>>> Tobias
>>>>
>>>>> dl
>>>>>
>>>>>
>>>>> On 2/6/17 7:29 AM, Zoltán Majó wrote:
>>>>>> Hi,
>>>>>>
>>>>>>
>>>>>> please review the fix for 8173151.
>>>>>>
>>>>>> https://bugs.openjdk.java.net/browse/JDK-8173151
>>>>>> http://cr.openjdk.java.net/~zmajo/8173151/webrev.00/
>>>>>>
>>>>>> The crash reported in the bug is caused by the corruption of the
>>>>>> 'non-profiled nmethods' code heap. CodeHeap::_freelist for that code
>>>>>> heap points to an address one segment before the heap's address
>>>>>> space. The sweeper starts iterating through the code heap from the
>>>>>> beginning of the heap's address space. Thus, the sweeper assumes
>>>>>> that the first item in the code heap is a HeapBlock/FreeBlock (with
>>>>>> the appropriate length and usage information). However, that is not
>>>>>> the case, as the first item in the heap is actually *before* the
>>>>>> heap. So the sweeper crashes.
>>>>>>
>>>>>> This is a hard-to-reproduce problem (I managed to reproduce it only
>>>>>> once in 350 iterations, each iteration taking ~25 minutes). So the
>>>>>> fix I propose is based on core file debugging and source code
>>>>>> investigation. But I managed to write a regression test that
>>>>>> triggers a problem similar to the original problem.
>>>>>>
>>>>>> I think that problem happens because a CodeBlob allocated in one
>>>>>> code heap (A) is returned to a different code heap (B). When the
>>>>>> CodeBlob is returned B, it is added to B's freelist. However, as the
>>>>>> CodeBlob was allocated in A, the freelist of B now points into A
>>>>>> (i.e., B is corrupted).
>>>>>>
>>>>>> The code cache uses CodeCache::get_code_heap(const CodeBlob* cb) to
>>>>>> determine to which code heap a 'cb' is supposed to be returned to.
>>>>>> Since 8171008 (AOT) [1], the check is:
>>>>>>
>>>>>> CodeHeap* CodeCache::get_code_heap(const CodeBlob* cb) {
>>>>>> assert(cb != NULL, "CodeBlob is null");
>>>>>> FOR_ALL_HEAPS(heap) {
>>>>>> - if ((*heap)->contains(cb)) {
>>>>>> + if ((*heap)->contains(cb->code_begin())) {
>>>>>> return *heap;
>>>>>> }
>>>>>> }
>>>>>>
>>>>>> The blob 'cb' can be returned to the wrong heap if, for example:
>>>>>> - 'cb' is at the end code heap A and
>>>>>> - the size of the code contained in 'cb' is 0 (therefore
>>>>>> code_begin() returns the address after 'cb', i.e., the first address
>>>>>> in code heap B).
>>>>>>
>>>>>> The fix proposes to restore CodeCache::get_code_heap() to its
>>>>>> pre-AOT state (as I'm not aware of the reasons why AOT changed that
>>>>>> check). I also propose to add some guarantees after
>>>>>> allocation/deallocation in the code heap to possibly easier catch
>>>>>> this or related problems in the future.
>>>>>>
>>>>>> The regression test I propose achieves the above condition and
>>>>>> results in a crash. The regression test works only with product
>>>>>> builds, because in a product build a BufferBlob fits into one
>>>>>> segment whereas in a fastdebug build it does not.
>>>>>>
>>>>>> The test needs to set the CodeCacheMinBlockLength flag to 1. The
>>>>>> flag is currently develop and we would need to make it product for
>>>>>> the test to work. (Other flags controlling the code cache, e.g.,
>>>>>> CodeCacheExpansionSize, are also product.) I could also experiment
>>>>>> with reproducing the problem with different block lengths/segment
>>>>>> sizes, but that would most likely make the test more fragile (and
>>>>>> CodeCacheSegmentSize is anyway develop as well).
>>>>>>
>>>>>> I tested the fix with JPRT, RBT is in progress.
>>>>>>
>>>>>> Thank you!
>>>>>>
>>>>>> Best regards,
>>>>>>
>>>>>>
>>>>>> Zoltan
>>>>>>
>>>>>> [1]
>>>>>> http://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/777aaa19c4b1#l116.71
>>>>>>
>>>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [9] RFR (S): 8173151: Code heap corruption due to incorrect inclusion test

Zoltán Majó


On 02/08/2017 07:33 PM, Vladimir Kozlov wrote:

> Thank you for looking on this, Zoltan
>
> > Also, the failure with RandomAllocationTest appeared only after AOT was
> > pushed. Otherwise code cache management was not changed recently and
> > we've been running RandomAllocationTest for two years now. That
> makes it
> > likely that we would have hit this issue earlier.
>
> Good point.
>
> Looking on the RandomAllocationTest test and it uses Random to
> generate size of CodeBlob and indeed could be 0 (+ header size). So
> lets hope this change fix the problem.

Thank you, Vladimir!

Best regards,


Zoltan

>
> Thanks,
> Vladimir
>
> On 2/8/17 4:39 AM, Zoltán Majó wrote:
>> Hi Vladimir,
>>
>>
>> On 02/07/2017 06:27 PM, Vladimir Kozlov wrote:
>>> Hi, Zoltan
>>>
>>> Is it possible that the problem exist in free list maintenance code?
>>> Ot allocation?
>>>
>>> Or free size at the end of heap calculated incorrectly when CodeBlob
>>> is allocated there and it can cross boundary?
>>
>> I'm not sure. AOT did not change allocation / free list management code.
>> It only changed CodeCache::get_code_heap() to use cb->code_begin()
>> instead of cb directly.
>>
>> Also, the failure with RandomAllocationTest appeared only after AOT was
>> pushed. Otherwise code cache management was not changed recently and
>> we've been running RandomAllocationTest for two years now. That makes it
>> likely that we would have hit this issue earlier.
>>
>>>
>>> Can you look in core file that all 3 heaps have consistent start
>>> address and size so that next heap start at correct address. Like
>>> _number_of_committed_segments, _number_of_reserved_segments,
>>> _next_segment
>>>
>>> I am concern that may be there is a code which check status of next
>>> segment and use it regardless if it belong to current heap or not.
>>
>> Thank you for your suggestions!
>>
>> The test failed with -XX:+TieredCompilation -XX:TieredStopAtLevel=1 and
>> AOT disabled. So there are only two heaps. Here is the information for
>> about both:
>>
>>
>> (1) non-profiled nmethods heap
>>
>> CodeHeap 'non-profiled nmethods': size=238384Kb used=178496Kb
>> max_used=224080Kb free=59887Kb
>>  bounds [0x000000011b20c000, 0x0000000129ad8000, 0x0000000129ad8000]
>>
>> Length of segment: 0xE8CC000 = 0x1D1980 segments
>>
>> _next_segment:                          0x00000000001d13e6 -- < 0x1D1980
>> so within current heap
>> _number_of_committed_segments:          0x00000000001d1980 -- OK
>> _number_of_reserved_segments:           0x00000000001d1980 -- OK
>> _freelist_segments:                     0x00000000000749e1 -- < 0x1D1980
>> so fits into current heap
>> _freelist:                              0x000000011b20bf80 -- points
>> into other code heap
>>
>>
>> (2) non nmethod code heap
>>
>> CodeHeap 'non-nmethods': size=7376Kb used=5822Kb max_used=7376Kb
>> free=1553Kb
>>  bounds [0x000000011aad8000, 0x000000011b20c000, 0x000000011b20c000]
>>
>> Length of segment: 0x734000 = 0xe680 segments -- OK
>>
>> _next_segment:                          0x000000000000e680 -- size of
>> current heap
>> _number_of_committed_segments:          0x000000000000e680 -- size of
>> current heap
>> _number_of_reserved_segments:           0x000000000000e680 -- size of
>> current heap
>> _freelist_segments:                     0x000000000000308d -- < 0xe680
>> so fits into current heap
>> _freelist:                              0x000000011ada6780 -- points
>> into the current code heap
>>
>>
>> There are two (possibly) suspicious elements:
>> - For (1), the _freelist points to into (2) -- we've discussed this
>> before.
>> - For (2), _next_segment, _number_of_committed_segments, and
>> _number_of_reserved_segments is equal to the size of the code heap. That
>> is, _next_segment of (2) points into (1).
>>
>> So I checked places where _next_segment is modified. Other than cases
>> when _next_segment is initialized to 0, only CodeHeap::allocate()
>> changes _next_segment.
>>
>> http://hg.openjdk.java.net/jdk9/hs/hotspot/file/9b2ab23d5676/src/share/vm/memory/heap.cpp#l202 
>>
>>
>>
>> But I don't think an allocation can cause the code heap to overflow
>> there. An overflow would imply that
>> - (a) the number of committed segments is equal to the size of the
>> memory space reserved for the code heap (in segments) and
>> - (b) the newly allocated block points to (or after)
>> _number_of_committed_segments.
>>
>> We can return a block pointing to _number_of_committed_segments if
>> _next_segment==_number_of_committed_segments and number_of_segments <= 0
>> so that the condition
>>
>> _next_segment + number_of_segments <= _number_of_committed_segments
>>
>> holds. But we have an assert before that checks that number_of_segments
>>> = sizeof(FreeBlock) (which is in turn > 0).
>>
>> Unfortunately, I don't see other ways the heap can become corrupted. But
>> let's hope the guarantees the patch plans to add help catch this problem
>> in the future (in case the patch does not already fix the problem).
>>
>> Thank you!
>>
>> Best regards,
>>
>>
>> Zoltan
>>
>>
>>>
>>> Thanks,
>>> Vladimir
>>>
>>> On 2/7/17 5:17 AM, Zoltán Majó wrote:
>>>> Hi Dean,
>>>> Hi Tobias,
>>>>
>>>>
>>>> thank you for the feedback!
>>>>
>>>> On 02/07/2017 07:01 AM, Tobias Hartmann wrote:
>>>>> Hi Dean,
>>>>>
>>>>> On 07.02.2017 00:14, [hidden email] wrote:
>>>>>> When do we allocate a CodeBlob with a code size of 0?  Is it really
>>>>>> useful?  Would having a minimum code size of 1 fix the problem?
>>>>> Yes, I've discussed that with Zoltan offline and I agree that we
>>>>> should enforce a minimum code size of 1 and fix the stress test(s)
>>>>> accordingly. It shouldn't be necessary to be able to create regular
>>>>> CodeBlobs with code size 0.
>>>>
>>>> It is indeed a valid point that code sizes of 0 are not useful in
>>>> practice.
>>>>
>>>> However, the problem is a bit more complicated than the regression
>>>> test
>>>> (and my description) suggests it to be.  The regression test works
>>>> only
>>>> if CodeCacheMinBlockLength set to 1.  Only that way is it possible to
>>>> allocate BufferBlobs with size == 1 segment and set
>>>> CodeBlob::_code_begin to point to the next segment (which is
>>>> possibly in
>>>> the next code heap).
>>>>
>>>> For both crashes I've seen, however, CodeCacheMinBlockLength was equal
>>>> to 4.  With that CodeCacheMinBlockLength value, every allocation in
>>>> the
>>>> code heap is at least 4 segments long.  It is therefore impossible to
>>>> create a BufferBlob whose _code_begin points "outside" of the space
>>>> reserved for that BufferBlob.  (Because BufferBlob::_code_begin will
>>>> either point to the beginning of second segment of the BufferBlob --
>>>> with product -- or somewhere into the second segment -- with
>>>> fastdebug.)
>>>>
>>>> Other blob types can have their _code_begin point to the boundary of 4
>>>> segments.  E.g., nmethods are between 2 and 3 segments long.  If the
>>>> relocation information in the nmethod fills up the space to the 4
>>>> segment boundary, _code_begin will be exactly at the segment boundary.
>>>>
>>>> But I don't know the exact sequence of steps lead to the code heap
>>>> corruption (because the crash reproduced only twice so far).  I just
>>>> know that the first free block in the (corrupted) code heap is of
>>>> length
>>>> 8 segments (the contents of the first freelist item and of the segmap
>>>> both confirm this); the heap is corrupted because the freelist starts
>>>> before the code heap.
>>>>
>>>> What sequence of steps lead to this?  Some nmethod
>>>> allocations/deallocations interleaved with BufferBlob
>>>> allocations/deallocations (possibly of size 0)?  I'm not sure.  In
>>>> what
>>>> way were blocks in the heap merged/split by the
>>>> allocations/deallocations leading to the crash?  I don't know.
>>>>
>>>> Did allocation/deallocation of BufferBlobs of size 0 play a role at
>>>> all?  I don't know either and I won't be able to tell unless the
>>>> problem
>>>> reproduces more often.  However, by using code size == 0 it is
>>>> possible
>>>> to write a regression test that triggers a real problem (which is
>>>> hopefully related to the crashes we've been seeing).
>>>>
>>>> So I'd like to allow code sizes of 0 for now and attempt to fix
>>>> CodeCache::get_code_heap() and the CodeBlobIterator constructor the
>>>> way
>>>> Vladimir suggested.  If that turns out to be difficult/risky or
>>>> require
>>>> large code changes, I'll propose a fix instead that disallows code
>>>> sizes
>>>> of 1 and fixes the test(s) -- just as you suggested.
>>>>
>>>> Best regards,
>>>>
>>>>
>>>> Zoltan
>>>>
>>>>> Best regards,
>>>>> Tobias
>>>>>
>>>>>> dl
>>>>>>
>>>>>>
>>>>>> On 2/6/17 7:29 AM, Zoltán Majó wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>>
>>>>>>> please review the fix for 8173151.
>>>>>>>
>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8173151
>>>>>>> http://cr.openjdk.java.net/~zmajo/8173151/webrev.00/
>>>>>>>
>>>>>>> The crash reported in the bug is caused by the corruption of the
>>>>>>> 'non-profiled nmethods' code heap. CodeHeap::_freelist for that
>>>>>>> code
>>>>>>> heap points to an address one segment before the heap's address
>>>>>>> space. The sweeper starts iterating through the code heap from the
>>>>>>> beginning of the heap's address space. Thus, the sweeper assumes
>>>>>>> that the first item in the code heap is a HeapBlock/FreeBlock (with
>>>>>>> the appropriate length and usage information). However, that is not
>>>>>>> the case, as the first item in the heap is actually *before* the
>>>>>>> heap. So the sweeper crashes.
>>>>>>>
>>>>>>> This is a hard-to-reproduce problem (I managed to reproduce it only
>>>>>>> once in 350 iterations, each iteration taking ~25 minutes). So the
>>>>>>> fix I propose is based on core file debugging and source code
>>>>>>> investigation. But I managed to write a regression test that
>>>>>>> triggers a problem similar to the original problem.
>>>>>>>
>>>>>>> I think that problem happens because a CodeBlob allocated in one
>>>>>>> code heap (A) is returned to a different code heap (B). When the
>>>>>>> CodeBlob is returned B, it is added to B's freelist. However, as
>>>>>>> the
>>>>>>> CodeBlob was allocated in A, the freelist of B now points into A
>>>>>>> (i.e., B is corrupted).
>>>>>>>
>>>>>>> The code cache uses CodeCache::get_code_heap(const CodeBlob* cb) to
>>>>>>> determine to which code heap a 'cb' is supposed to be returned to.
>>>>>>> Since 8171008 (AOT) [1], the check is:
>>>>>>>
>>>>>>> CodeHeap* CodeCache::get_code_heap(const CodeBlob* cb) {
>>>>>>> assert(cb != NULL, "CodeBlob is null");
>>>>>>> FOR_ALL_HEAPS(heap) {
>>>>>>> - if ((*heap)->contains(cb)) {
>>>>>>> + if ((*heap)->contains(cb->code_begin())) {
>>>>>>> return *heap;
>>>>>>> }
>>>>>>> }
>>>>>>>
>>>>>>> The blob 'cb' can be returned to the wrong heap if, for example:
>>>>>>> - 'cb' is at the end code heap A and
>>>>>>> - the size of the code contained in 'cb' is 0 (therefore
>>>>>>> code_begin() returns the address after 'cb', i.e., the first
>>>>>>> address
>>>>>>> in code heap B).
>>>>>>>
>>>>>>> The fix proposes to restore CodeCache::get_code_heap() to its
>>>>>>> pre-AOT state (as I'm not aware of the reasons why AOT changed that
>>>>>>> check). I also propose to add some guarantees after
>>>>>>> allocation/deallocation in the code heap to possibly easier catch
>>>>>>> this or related problems in the future.
>>>>>>>
>>>>>>> The regression test I propose achieves the above condition and
>>>>>>> results in a crash. The regression test works only with product
>>>>>>> builds, because in a product build a BufferBlob fits into one
>>>>>>> segment whereas in a fastdebug build it does not.
>>>>>>>
>>>>>>> The test needs to set the CodeCacheMinBlockLength flag to 1. The
>>>>>>> flag is currently develop and we would need to make it product for
>>>>>>> the test to work. (Other flags controlling the code cache, e.g.,
>>>>>>> CodeCacheExpansionSize, are also product.) I could also experiment
>>>>>>> with reproducing the problem with different block lengths/segment
>>>>>>> sizes, but that would most likely make the test more fragile (and
>>>>>>> CodeCacheSegmentSize is anyway develop as well).
>>>>>>>
>>>>>>> I tested the fix with JPRT, RBT is in progress.
>>>>>>>
>>>>>>> Thank you!
>>>>>>>
>>>>>>> Best regards,
>>>>>>>
>>>>>>>
>>>>>>> Zoltan
>>>>>>>
>>>>>>> [1]
>>>>>>> http://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/777aaa19c4b1#l116.71 
>>>>>>>
>>>>>>>
>>>>
>>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [9] RFR (S): 8173151: Code heap corruption due to incorrect inclusion test

Rickard Bäckman
In reply to this post by Zoltán Majó
Sorry for being a bit late here...

The reason for the change in the contains() check from cb to
cb->code_begin() is that with the changes for AOT we no longer require
the metadata (AOTCompiledMethod *) and code to be in one blob.

AOT does things like:

code-heap-start
code-method-1
code-method-2
code-method-3
code-method-4
code-heap-end

and when we link those methods we create AOTCompiledMethods which just
have pointers to code-method-1.

The AOTCodeHeap then knows the boundaries of the code heap.
Checking if the heap contains the metadata (AOTCompiledMethod *) will
fail. While checking if it contains the code will succeed.

/R

On 02/06, Zoltán Majó wrote:

> Hi,
>
>
> please review the fix for 8173151.
>
> https://bugs.openjdk.java.net/browse/JDK-8173151
> http://cr.openjdk.java.net/~zmajo/8173151/webrev.00/
>
> The crash reported in the bug is caused by the corruption of the
> 'non-profiled nmethods' code heap. CodeHeap::_freelist for that code
> heap points to an address one segment before the heap's address
> space. The sweeper starts iterating through the code heap from the
> beginning of the heap's address space. Thus, the sweeper assumes
> that the first item in the code heap is a HeapBlock/FreeBlock (with
> the appropriate length and usage information). However, that is not
> the case, as the first item in the heap is actually *before* the
> heap. So the sweeper crashes.
>
> This is a hard-to-reproduce problem (I managed to reproduce it only
> once in 350 iterations, each iteration taking ~25 minutes). So the
> fix I propose is based on core file debugging and source code
> investigation. But I managed to write a regression test that
> triggers a problem similar to the original problem.
>
> I think that problem happens because a CodeBlob allocated in one
> code heap (A) is returned to a different code heap (B). When the
> CodeBlob is returned B, it is added to B's freelist. However, as the
> CodeBlob was allocated in A, the freelist of B now points into A
> (i.e., B is corrupted).
>
> The code cache uses CodeCache::get_code_heap(const CodeBlob* cb) to
> determine to which code heap a 'cb' is supposed to be returned to.
> Since 8171008 (AOT) [1], the check is:
>
> CodeHeap* CodeCache::get_code_heap(const CodeBlob* cb) {
> assert(cb != NULL, "CodeBlob is null");
> FOR_ALL_HEAPS(heap) {
> - if ((*heap)->contains(cb)) {
> + if ((*heap)->contains(cb->code_begin())) {
> return *heap;
> }
> }
>
> The blob 'cb' can be returned to the wrong heap if, for example:
> - 'cb' is at the end code heap A and
> - the size of the code contained in 'cb' is 0 (therefore
> code_begin() returns the address after 'cb', i.e., the first address
> in code heap B).
>
> The fix proposes to restore CodeCache::get_code_heap() to its
> pre-AOT state (as I'm not aware of the reasons why AOT changed that
> check). I also propose to add some guarantees after
> allocation/deallocation in the code heap to possibly easier catch
> this or related problems in the future.
>
> The regression test I propose achieves the above condition and
> results in a crash. The regression test works only with product
> builds, because in a product build a BufferBlob fits into one
> segment whereas in a fastdebug build it does not.
>
> The test needs to set the CodeCacheMinBlockLength flag to 1. The
> flag is currently develop and we would need to make it product for
> the test to work. (Other flags controlling the code cache, e.g.,
> CodeCacheExpansionSize, are also product.) I could also experiment
> with reproducing the problem with different block lengths/segment
> sizes, but that would most likely make the test more fragile (and
> CodeCacheSegmentSize is anyway develop as well).
>
> I tested the fix with JPRT, RBT is in progress.
>
> Thank you!
>
> Best regards,
>
>
> Zoltan
>
> [1] http://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/777aaa19c4b1#l116.71
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [9] RFR (S): 8173151: Code heap corruption due to incorrect inclusion test

Zoltán Majó
Hi Rickard,


On 02/10/2017 11:34 AM, Rickard Bäckman wrote:
> Sorry for being a bit late here...

thank you for the clarification. Unfortunately, I was not able to
mention you as a reviewer in the changeset, as pushing it has already
succeeded by the time I've received your message.

>
> The reason for the change in the contains() check from cb to
> cb->code_begin() is that with the changes for AOT we no longer require
> the metadata (AOTCompiledMethod *) and code to be in one blob.
>
> AOT does things like:
>
> code-heap-start
> code-method-1
> code-method-2
> code-method-3
> code-method-4
> code-heap-end
>
> and when we link those methods we create AOTCompiledMethods which just
> have pointers to code-method-1.
>
> The AOTCodeHeap then knows the boundaries of the code heap.
> Checking if the heap contains the metadata (AOTCompiledMethod *) will
> fail. While checking if it contains the code will succeed.

That was my understanding as well, I think: The changeset keeps the
inclusion test based on code_begin() for AOT-compiled methods

http://hg.openjdk.java.net/jdk9/hs/hotspot/rev/ed9e29cd84f5#l1.8

and changes the inclusion test only for other code blob types.

http://hg.openjdk.java.net/jdk9/hs/hotspot/rev/ed9e29cd84f5#l5.9

Thank you!

Best regards,


Zoltan

>
> /R
>
> On 02/06, Zoltán Majó wrote:
>> Hi,
>>
>>
>> please review the fix for 8173151.
>>
>> https://bugs.openjdk.java.net/browse/JDK-8173151
>> http://cr.openjdk.java.net/~zmajo/8173151/webrev.00/
>>
>> The crash reported in the bug is caused by the corruption of the
>> 'non-profiled nmethods' code heap. CodeHeap::_freelist for that code
>> heap points to an address one segment before the heap's address
>> space. The sweeper starts iterating through the code heap from the
>> beginning of the heap's address space. Thus, the sweeper assumes
>> that the first item in the code heap is a HeapBlock/FreeBlock (with
>> the appropriate length and usage information). However, that is not
>> the case, as the first item in the heap is actually *before* the
>> heap. So the sweeper crashes.
>>
>> This is a hard-to-reproduce problem (I managed to reproduce it only
>> once in 350 iterations, each iteration taking ~25 minutes). So the
>> fix I propose is based on core file debugging and source code
>> investigation. But I managed to write a regression test that
>> triggers a problem similar to the original problem.
>>
>> I think that problem happens because a CodeBlob allocated in one
>> code heap (A) is returned to a different code heap (B). When the
>> CodeBlob is returned B, it is added to B's freelist. However, as the
>> CodeBlob was allocated in A, the freelist of B now points into A
>> (i.e., B is corrupted).
>>
>> The code cache uses CodeCache::get_code_heap(const CodeBlob* cb) to
>> determine to which code heap a 'cb' is supposed to be returned to.
>> Since 8171008 (AOT) [1], the check is:
>>
>> CodeHeap* CodeCache::get_code_heap(const CodeBlob* cb) {
>> assert(cb != NULL, "CodeBlob is null");
>> FOR_ALL_HEAPS(heap) {
>> - if ((*heap)->contains(cb)) {
>> + if ((*heap)->contains(cb->code_begin())) {
>> return *heap;
>> }
>> }
>>
>> The blob 'cb' can be returned to the wrong heap if, for example:
>> - 'cb' is at the end code heap A and
>> - the size of the code contained in 'cb' is 0 (therefore
>> code_begin() returns the address after 'cb', i.e., the first address
>> in code heap B).
>>
>> The fix proposes to restore CodeCache::get_code_heap() to its
>> pre-AOT state (as I'm not aware of the reasons why AOT changed that
>> check). I also propose to add some guarantees after
>> allocation/deallocation in the code heap to possibly easier catch
>> this or related problems in the future.
>>
>> The regression test I propose achieves the above condition and
>> results in a crash. The regression test works only with product
>> builds, because in a product build a BufferBlob fits into one
>> segment whereas in a fastdebug build it does not.
>>
>> The test needs to set the CodeCacheMinBlockLength flag to 1. The
>> flag is currently develop and we would need to make it product for
>> the test to work. (Other flags controlling the code cache, e.g.,
>> CodeCacheExpansionSize, are also product.) I could also experiment
>> with reproducing the problem with different block lengths/segment
>> sizes, but that would most likely make the test more fragile (and
>> CodeCacheSegmentSize is anyway develop as well).
>>
>> I tested the fix with JPRT, RBT is in progress.
>>
>> Thank you!
>>
>> Best regards,
>>
>>
>> Zoltan
>>
>> [1] http://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/777aaa19c4b1#l116.71
>>

Loading...