RFR(XS): 8205195 NestedThreadsListHandleInErrorHandlingTest fails because hs_err doesn't contain _nested_thread_list_max

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

RFR(XS): 8205195 NestedThreadsListHandleInErrorHandlingTest fails because hs_err doesn't contain _nested_thread_list_max

Daniel D. Daugherty
Greetings,

I have a fix for a recent (very rare) Thread-SMR related test failure.

Since the fix is related to the ErrorHandling tests and affects hs_err_pid
file generation, this code review is being sent to both the Runtime and
the Serviceability teams. Please make sure you reply-all to any responses
so we have complete review threads on both aliases.

Bug URL: https://bugs.openjdk.java.net/browse/JDK-8205195

Webrev URL: http://cr.openjdk.java.net/~dcubed/8205195-webrev/0-for-jdk-jdk/

The bug itself contains analysis about the root cause of the bug and
the comment updates to the code describes the no win scenario that the
hs_err_pid file generation code is in. Of course, I also have a comment
where I was able to harden the ErrorHandling tests. I did manage to
resist the urge to mention the "Kobiyashi Maru" [1] in the new comments.

Testing: Mach5 builds-tier1,jdk-tier1,jdk-tier2,hs-tier1,hs-tier2,hs-tier3
          on the usual Oracle platforms.

Thanks, in advance, for any comments, questions or suggestions.

Dan

[1] https://www.urbandictionary.com/define.php?term=Kobayashi%20Maru

Reply | Threaded
Open this post in threaded view
|

Re: RFR(XS): 8205195 NestedThreadsListHandleInErrorHandlingTest fails because hs_err doesn't contain _nested_thread_list_max

Daniel D. Daugherty
Thomas,

Thanks for the quick review.


On 6/21/18 2:53 AM, Thomas Stüfe wrote:
> Hi Daniel,
>
> yes, that is annoying.
>
> I am okay with your fix, if you want to push it in this form.

Thanks.


> But preparing the test crash in this way feels weird since the whole
> point of this exercise is to test error handling in close-to-real
> scenarios... but it sure does not hurt in this case.

Yup. This is definitely weird, but my goal is to reduce testing noise.
I do acknowledge that the real world use of error reporting may run
into this failure mode which will suppress a section of hs_err_pid
output.


> Also, note that at different places we decide differently, see e.g.
> the "printing heap information" STEP - we omit locking Heap_lock in
> VMError::report() and only lock it in VMError::print_vm_info() (where
> we have no secondary signal handling and must not crash). So, in that
> case we are okay with risking a secondary crash in error handling.
> Probably there are just no regression tests for the heap information
> printout whose intermittent fails could annoy us :)

Yup. I recognized that when I wrote the Thread-SMR tests I was making
them picky enough to possible run into failure modes we never would
have detected before.


> My feeling is that I would like to see a solution at the test
> framework side. Maybe, if a test is marked as "may fail rarely and
> thats okay", the test framework could retry the test and only fail if
> the error happens again.

We currently don't have a way of tagging a test like that and I'm
not convinced that I would really want us to do that. However, this
particular bug truly falls into a no win scenario and that's a
different situation than I've encountered before.

Again, thanks for the review.

Dan


>
> Thanks, Thomas
>
>
>
> On Thu, Jun 21, 2018 at 2:18 AM, Daniel D. Daugherty
> <[hidden email]> wrote:
>> Greetings,
>>
>> I have a fix for a recent (very rare) Thread-SMR related test failure.
>>
>> Since the fix is related to the ErrorHandling tests and affects hs_err_pid
>> file generation, this code review is being sent to both the Runtime and
>> the Serviceability teams. Please make sure you reply-all to any responses
>> so we have complete review threads on both aliases.
>>
>> Bug URL: https://bugs.openjdk.java.net/browse/JDK-8205195
>>
>> Webrev URL: http://cr.openjdk.java.net/~dcubed/8205195-webrev/0-for-jdk-jdk/
>>
>> The bug itself contains analysis about the root cause of the bug and
>> the comment updates to the code describes the no win scenario that the
>> hs_err_pid file generation code is in. Of course, I also have a comment
>> where I was able to harden the ErrorHandling tests. I did manage to
>> resist the urge to mention the "Kobiyashi Maru" [1] in the new comments.
>>
>> Testing: Mach5 builds-tier1,jdk-tier1,jdk-tier2,hs-tier1,hs-tier2,hs-tier3
>>           on the usual Oracle platforms.
>>
>> Thanks, in advance, for any comments, questions or suggestions.
>>
>> Dan
>>
>> [1] https://www.urbandictionary.com/define.php?term=Kobayashi%20Maru
>>

Reply | Threaded
Open this post in threaded view
|

Re: RFR(XS): 8205195 NestedThreadsListHandleInErrorHandlingTest fails because hs_err doesn't contain _nested_thread_list_max

Daniel D. Daugherty
May I please have a second reviewer on this one...

Dan


On 6/21/18 8:57 AM, Daniel D. Daugherty wrote:

> Thomas,
>
> Thanks for the quick review.
>
>
> On 6/21/18 2:53 AM, Thomas Stüfe wrote:
>> Hi Daniel,
>>
>> yes, that is annoying.
>>
>> I am okay with your fix, if you want to push it in this form.
>
> Thanks.
>
>
>> But preparing the test crash in this way feels weird since the whole
>> point of this exercise is to test error handling in close-to-real
>> scenarios... but it sure does not hurt in this case.
>
> Yup. This is definitely weird, but my goal is to reduce testing noise.
> I do acknowledge that the real world use of error reporting may run
> into this failure mode which will suppress a section of hs_err_pid
> output.
>
>
>> Also, note that at different places we decide differently, see e.g.
>> the "printing heap information" STEP - we omit locking Heap_lock in
>> VMError::report() and only lock it in VMError::print_vm_info() (where
>> we have no secondary signal handling and must not crash). So, in that
>> case we are okay with risking a secondary crash in error handling.
>> Probably there are just no regression tests for the heap information
>> printout whose intermittent fails could annoy us :)
>
> Yup. I recognized that when I wrote the Thread-SMR tests I was making
> them picky enough to possible run into failure modes we never would
> have detected before.
>
>
>> My feeling is that I would like to see a solution at the test
>> framework side. Maybe, if a test is marked as "may fail rarely and
>> thats okay", the test framework could retry the test and only fail if
>> the error happens again.
>
> We currently don't have a way of tagging a test like that and I'm
> not convinced that I would really want us to do that. However, this
> particular bug truly falls into a no win scenario and that's a
> different situation than I've encountered before.
>
> Again, thanks for the review.
>
> Dan
>
>
>>
>> Thanks, Thomas
>>
>>
>>
>> On Thu, Jun 21, 2018 at 2:18 AM, Daniel D. Daugherty
>> <[hidden email]> wrote:
>>> Greetings,
>>>
>>> I have a fix for a recent (very rare) Thread-SMR related test failure.
>>>
>>> Since the fix is related to the ErrorHandling tests and affects
>>> hs_err_pid
>>> file generation, this code review is being sent to both the Runtime and
>>> the Serviceability teams. Please make sure you reply-all to any
>>> responses
>>> so we have complete review threads on both aliases.
>>>
>>> Bug URL: https://bugs.openjdk.java.net/browse/JDK-8205195
>>>
>>> Webrev URL:
>>> http://cr.openjdk.java.net/~dcubed/8205195-webrev/0-for-jdk-jdk/
>>>
>>> The bug itself contains analysis about the root cause of the bug and
>>> the comment updates to the code describes the no win scenario that the
>>> hs_err_pid file generation code is in. Of course, I also have a comment
>>> where I was able to harden the ErrorHandling tests. I did manage to
>>> resist the urge to mention the "Kobiyashi Maru" [1] in the new
>>> comments.
>>>
>>> Testing: Mach5
>>> builds-tier1,jdk-tier1,jdk-tier2,hs-tier1,hs-tier2,hs-tier3
>>>           on the usual Oracle platforms.
>>>
>>> Thanks, in advance, for any comments, questions or suggestions.
>>>
>>> Dan
>>>
>>> [1] https://www.urbandictionary.com/define.php?term=Kobayashi%20Maru
>>>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: RFR(XS): 8205195 NestedThreadsListHandleInErrorHandlingTest fails because hs_err doesn't contain _nested_thread_list_max

serguei.spitsyn@oracle.com
In reply to this post by Daniel D. Daugherty
Hi Dan,

The fix looks Okay to me.


http://cr.openjdk.java.net/~dcubed/8205195-webrev/0-for-jdk-jdk/src/hotspot/share/runtime/threadSMR.cpp.udiff.html

A typo in comment - "during" occurred twice:
  + // Note: Not grabbing the Threads_lock during during error reporting


Thanks,
Serguei


On 6/20/18 17:18, Daniel D. Daugherty wrote:
Greetings,

I have a fix for a recent (very rare) Thread-SMR related test failure.

Since the fix is related to the ErrorHandling tests and affects hs_err_pid
file generation, this code review is being sent to both the Runtime and
the Serviceability teams. Please make sure you reply-all to any responses
so we have complete review threads on both aliases.

Bug URL: https://bugs.openjdk.java.net/browse/JDK-8205195

Webrev URL: http://cr.openjdk.java.net/~dcubed/8205195-webrev/0-for-jdk-jdk/

The bug itself contains analysis about the root cause of the bug and
the comment updates to the code describes the no win scenario that the
hs_err_pid file generation code is in. Of course, I also have a comment
where I was able to harden the ErrorHandling tests. I did manage to
resist the urge to mention the "Kobiyashi Maru" [1] in the new comments.

Testing: Mach5 builds-tier1,jdk-tier1,jdk-tier2,hs-tier1,hs-tier2,hs-tier3
         on the usual Oracle platforms.

Thanks, in advance, for any comments, questions or suggestions.

Dan

[1] https://www.urbandictionary.com/define.php?term=Kobayashi%20Maru


Reply | Threaded
Open this post in threaded view
|

Re: RFR(XS): 8205195 NestedThreadsListHandleInErrorHandlingTest fails because hs_err doesn't contain _nested_thread_list_max

Daniel D. Daugherty
On 6/21/18 8:21 PM, [hidden email] wrote:
Hi Dan,

The fix looks Okay to me.
 
Thanks for the review!


http://cr.openjdk.java.net/~dcubed/8205195-webrev/0-for-jdk-jdk/src/hotspot/share/runtime/threadSMR.cpp.udiff.html

A typo in comment - "during" occurred twice:
  + // Note: Not grabbing the Threads_lock during during error reporting
 
Nice catch. Will fix that before I push (and reformat the comment).
I read that comment at least three times before I sent out the
review... sigh...

Dan




Thanks,
Serguei


On 6/20/18 17:18, Daniel D. Daugherty wrote:
Greetings,

I have a fix for a recent (very rare) Thread-SMR related test failure.

Since the fix is related to the ErrorHandling tests and affects hs_err_pid
file generation, this code review is being sent to both the Runtime and
the Serviceability teams. Please make sure you reply-all to any responses
so we have complete review threads on both aliases.

Bug URL: https://bugs.openjdk.java.net/browse/JDK-8205195

Webrev URL: http://cr.openjdk.java.net/~dcubed/8205195-webrev/0-for-jdk-jdk/

The bug itself contains analysis about the root cause of the bug and
the comment updates to the code describes the no win scenario that the
hs_err_pid file generation code is in. Of course, I also have a comment
where I was able to harden the ErrorHandling tests. I did manage to
resist the urge to mention the "Kobiyashi Maru" [1] in the new comments.

Testing: Mach5 builds-tier1,jdk-tier1,jdk-tier2,hs-tier1,hs-tier2,hs-tier3
         on the usual Oracle platforms.

Thanks, in advance, for any comments, questions or suggestions.

Dan

[1] https://www.urbandictionary.com/define.php?term=Kobayashi%20Maru



Reply | Threaded
Open this post in threaded view
|

Re: RFR(XS): 8205195 NestedThreadsListHandleInErrorHandlingTest fails because hs_err doesn't contain _nested_thread_list_max

Daniel D. Daugherty
In reply to this post by Daniel D. Daugherty
On 6/25/18 1:57 AM, David Holmes wrote:

> Hi Dan,
>
> On 21/06/2018 10:18 AM, Daniel D. Daugherty wrote:
>> Greetings,
>>
>> I have a fix for a recent (very rare) Thread-SMR related test failure.
>>
>> Since the fix is related to the ErrorHandling tests and affects
>> hs_err_pid
>> file generation, this code review is being sent to both the Runtime and
>> the Serviceability teams. Please make sure you reply-all to any
>> responses
>> so we have complete review threads on both aliases.
>>
>> Bug URL: https://bugs.openjdk.java.net/browse/JDK-8205195
>>
>> Webrev URL:
>> http://cr.openjdk.java.net/~dcubed/8205195-webrev/0-for-jdk-jdk/
>>
>> The bug itself contains analysis about the root cause of the bug and
>> the comment updates to the code describes the no win scenario that the
>> hs_err_pid file generation code is in. Of course, I also have a comment
>> where I was able to harden the ErrorHandling tests. I did manage to
>> resist the urge to mention the "Kobiyashi Maru" [1] in the new comments.
>>
>> Testing: Mach5
>> builds-tier1,jdk-tier1,jdk-tier2,hs-tier1,hs-tier2,hs-tier3
>>           on the usual Oracle platforms.
>>
>> Thanks, in advance, for any comments, questions or suggestions.
>
> I don't quite follow the fix.

I think this comment explains what the fix is _trying_ to do:

1702   // We grab Threads_lock to keep ThreadsSMRSupport::print_info_on()
1703   // from racing with Threads::add() or Threads::remove() as we
1704   // generate the hs_err_pid file. This makes our ErrorHandling tests
1705   // more stable.


> Won't you self-deadlock on acquiring the Threads_lock in the secondary
> error handler test, due to the recursive call to controlled_crash ?

I missed that possibility, but now I'm puzzled why my testing didn't
reveal this situation. We have a test for secondary error handling
and it should have self-deadlocked (and failed). I'll investigate.

Dan


>
> David
>
>> Dan
>>
>> [1] https://www.urbandictionary.com/define.php?term=Kobayashi%20Maru
>>

Reply | Threaded
Open this post in threaded view
|

Re: RFR(XS): 8205195 NestedThreadsListHandleInErrorHandlingTest fails because hs_err doesn't contain _nested_thread_list_max

Daniel D. Daugherty
David,

Thanks for your eagle eyes! More below...


On 6/25/18 9:27 AM, Daniel D. Daugherty wrote:

> On 6/25/18 1:57 AM, David Holmes wrote:
>> Hi Dan,
>>
>> On 21/06/2018 10:18 AM, Daniel D. Daugherty wrote:
>>> Greetings,
>>>
>>> I have a fix for a recent (very rare) Thread-SMR related test failure.
>>>
>>> Since the fix is related to the ErrorHandling tests and affects
>>> hs_err_pid
>>> file generation, this code review is being sent to both the Runtime and
>>> the Serviceability teams. Please make sure you reply-all to any
>>> responses
>>> so we have complete review threads on both aliases.
>>>
>>> Bug URL: https://bugs.openjdk.java.net/browse/JDK-8205195
>>>
>>> Webrev URL:
>>> http://cr.openjdk.java.net/~dcubed/8205195-webrev/0-for-jdk-jdk/
>>>
>>> The bug itself contains analysis about the root cause of the bug and
>>> the comment updates to the code describes the no win scenario that the
>>> hs_err_pid file generation code is in. Of course, I also have a comment
>>> where I was able to harden the ErrorHandling tests. I did manage to
>>> resist the urge to mention the "Kobiyashi Maru" [1] in the new
>>> comments.
>>>
>>> Testing: Mach5
>>> builds-tier1,jdk-tier1,jdk-tier2,hs-tier1,hs-tier2,hs-tier3
>>>           on the usual Oracle platforms.
>>>
>>> Thanks, in advance, for any comments, questions or suggestions.
>>
>> I don't quite follow the fix.
>
> I think this comment explains what the fix is _trying_ to do:
>
> 1702   // We grab Threads_lock to keep ThreadsSMRSupport::print_info_on()
> 1703   // from racing with Threads::add() or Threads::remove() as we
> 1704   // generate the hs_err_pid file. This makes our ErrorHandling
> tests
> 1705   // more stable.
>
>
>> Won't you self-deadlock on acquiring the Threads_lock in the
>> secondary error handler test, due to the recursive call to
>> controlled_crash ?
>
> I missed that possibility, but now I'm puzzled why my testing didn't
> reveal this situation. We have a test for secondary error handling
> and it should have self-deadlocked (and failed). I'll investigate.

The ErrorHandling tests don't run with 'release' bits so no self-deadlock.
The 'fastdebug' and 'slowdebug' versions will fail an assertion that gets
hidden by the fact that it is a secondary failure mode.

I've filed the following bug:

JDK-8205648 fix for 8205195 breaks secondary error handling
https://bugs.openjdk.java.net/browse/JDK-8205648

Dan


>
> Dan
>
>
>>
>> David
>>
>>> Dan
>>>
>>> [1] https://www.urbandictionary.com/define.php?term=Kobayashi%20Maru
>>>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: RFR(XS): 8205195 NestedThreadsListHandleInErrorHandlingTest fails because hs_err doesn't contain _nested_thread_list_max

David Holmes
On 26/06/2018 11:32 AM, Daniel D. Daugherty wrote:

> David,
>
> Thanks for your eagle eyes! More below...
>
>
> On 6/25/18 9:27 AM, Daniel D. Daugherty wrote:
>> On 6/25/18 1:57 AM, David Holmes wrote:
>>> Hi Dan,
>>>
>>> On 21/06/2018 10:18 AM, Daniel D. Daugherty wrote:
>>>> Greetings,
>>>>
>>>> I have a fix for a recent (very rare) Thread-SMR related test failure.
>>>>
>>>> Since the fix is related to the ErrorHandling tests and affects
>>>> hs_err_pid
>>>> file generation, this code review is being sent to both the Runtime and
>>>> the Serviceability teams. Please make sure you reply-all to any
>>>> responses
>>>> so we have complete review threads on both aliases.
>>>>
>>>> Bug URL: https://bugs.openjdk.java.net/browse/JDK-8205195
>>>>
>>>> Webrev URL:
>>>> http://cr.openjdk.java.net/~dcubed/8205195-webrev/0-for-jdk-jdk/
>>>>
>>>> The bug itself contains analysis about the root cause of the bug and
>>>> the comment updates to the code describes the no win scenario that the
>>>> hs_err_pid file generation code is in. Of course, I also have a comment
>>>> where I was able to harden the ErrorHandling tests. I did manage to
>>>> resist the urge to mention the "Kobiyashi Maru" [1] in the new
>>>> comments.
>>>>
>>>> Testing: Mach5
>>>> builds-tier1,jdk-tier1,jdk-tier2,hs-tier1,hs-tier2,hs-tier3
>>>>           on the usual Oracle platforms.
>>>>
>>>> Thanks, in advance, for any comments, questions or suggestions.
>>>
>>> I don't quite follow the fix.
>>
>> I think this comment explains what the fix is _trying_ to do:
>>
>> 1702   // We grab Threads_lock to keep ThreadsSMRSupport::print_info_on()
>> 1703   // from racing with Threads::add() or Threads::remove() as we
>> 1704   // generate the hs_err_pid file. This makes our ErrorHandling
>> tests
>> 1705   // more stable.
>>
>>
>>> Won't you self-deadlock on acquiring the Threads_lock in the
>>> secondary error handler test, due to the recursive call to
>>> controlled_crash ?
>>
>> I missed that possibility, but now I'm puzzled why my testing didn't
>> reveal this situation. We have a test for secondary error handling
>> and it should have self-deadlocked (and failed). I'll investigate.
>
> The ErrorHandling tests don't run with 'release' bits so no self-deadlock.
> The 'fastdebug' and 'slowdebug' versions will fail an assertion that gets
> hidden by the fact that it is a secondary failure mode.

Ah - of course.

> I've filed the following bug:
>
> JDK-8205648 fix for 8205195 breaks secondary error handling
> https://bugs.openjdk.java.net/browse/JDK-8205648

Thanks,
David

> Dan
>
>
>>
>> Dan
>>
>>
>>>
>>> David
>>>
>>>> Dan
>>>>
>>>> [1] https://www.urbandictionary.com/define.php?term=Kobayashi%20Maru
>>>>
>>
>>
>