RFR(s) (JDK10): 8166944: Hanging Error Reporting steps may lead to torn error logs.

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

RFR(s) (JDK10): 8166944: Hanging Error Reporting steps may lead to torn error logs.

Thomas Stüfe-2
Dear all,

please review (or re-review) this enhancement. It was already reviewed
extensively for JDK9 in Oct 16, but pushed back to JDK10. Now that JDK10
repo is open, I would like to close this issue.

In short, this enhancement lets the hotspot cancel hanging error reporting
steps (when writing the hs-err file), in order to allow subsequent error
reporting steps to run. In our JVM variant, this has been active for a long
time and has proven to make error reporting more stable in the face of
deadlocks or file system stalls.

Issue:
https://bugs.openjdk.java.net/browse/JDK-8166944

Webrev:
http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.02/webrev/

Old Diskussion for JDK9:
http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2016-October/021582.html

The webrev is unchanged from version 2 discussed in october, just got
rebased atop of JDK10/hs.

Thank you,

Kind Regards, Thomas
Reply | Threaded
Open this post in threaded view
|

Re: RFR(s) (JDK10): 8166944: Hanging Error Reporting steps may lead to torn error logs.

Chris Plummer
Hi Thomas,

Mostly looks good. A few minor comments:

Most files need 2017 copyright updates.

set_to_now() and get_timestamp() should have names that tie them
together better. Maybe set_step_start_time() and get_step_start_time().
Also, they should probably be methods of VMError and the statics should
all be fields.

In the first review I had mentioned that you only are allowing for one
step to hang, and still be able to continue with subsequent steps:

1509     // A step times out after half of the total timeout. Steps are
mostly fast unless they
1510     // hang for some reason, so this simple rule allows for one
hanging step and still
1511     // leaves time enough for the rest of the steps to finish.
1512     const jlong end = step_start_time_l + ErrorLogTimeout * 1000 / 2;

You mentioned that for your SAP version you divide by 4, allowing for up
to 3 steps to hang. Is there a reason not to do this for the open
version? In fact it looks like TestUnresponsiveErrorHandler is relying
on that being the case, since it will hang 3 times. Related to this,
TimeoutInErrorHandlingTest.java is only checking for one instance of
step timeout pattern. Probably it should check for 3 instances, and also
one instance of the overall timeout pattern:

1272         st->print_cr("------ Timout during error reporting after "
INT64_FORMAT "ms. ------",

thanks,

Chris

On 2/2/17 3:59 AM, Thomas Stüfe wrote:

> Dear all,
>
> please review (or re-review) this enhancement. It was already reviewed
> extensively for JDK9 in Oct 16, but pushed back to JDK10. Now that
> JDK10 repo is open, I would like to close this issue.
>
> In short, this enhancement lets the hotspot cancel hanging error
> reporting steps (when writing the hs-err file), in order to allow
> subsequent error reporting steps to run. In our JVM variant, this has
> been active for a long time and has proven to make error reporting
> more stable in the face of deadlocks or file system stalls.
>
> Issue:
> https://bugs.openjdk.java.net/browse/JDK-8166944
>
> Webrev:
> http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.02/webrev/ 
> <http://cr.openjdk.java.net/%7Estuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.02/webrev/>
>
> Old Diskussion for JDK9:
> http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2016-October/021582.html
>
> The webrev is unchanged from version 2 discussed in october, just got
> rebased atop of JDK10/hs.
>
> Thank you,
>
> Kind Regards, Thomas



Reply | Threaded
Open this post in threaded view
|

Re: RFR(s) (JDK10): 8166944: Hanging Error Reporting steps may lead to torn error logs.

David Holmes
In reply to this post by Thomas Stüfe-2
Hi Thomas,

Thanks for picking this back up. Overall this looks okay. I have a few
specific comments/suggestions below to add to Chris's.

David
-----

src/os/windows/vm/vmError_windows.cpp

Typo: canceleation

---

src/share/vm/utilities/vmError.cpp

As previously discussed Atomic::load/store for jlong do not depend on
supports_cx8. The commentary in atomic.hpp should really refer to
'atomic read-modify-write operations on jlongs' - as every platform must
support atomic 64-bit load/store to support Java volatile long/double.
So set_to_now and get_timestamp can be simplified, or even removed
completely.

Does the timestamp really need to be the system time in millis or do you
just want to record relative elapsed time? If the latter then I suggest
using javaTimeNanos()/100000 to avoid potential glitches with the system
time changing.

382   // TestUnresponsiveErrorHandler: three times to trigger first a
step timeout, then
383   // the global error reporting timeout.

I'm a little confused as to why three steps are needed to test two things??

1217     reporting_started();
1218     set_to_now(&reporting_start_time);

I think the setting of reporting_start_time should be internalized in
reporting_started.

Nit: Lets -> Let's  (it is a contraction of 'let us')

1272         st->print_cr("------ Timout during error reporting after "
INT64_FORMAT "ms. ------",

Typo: timout

Also I gather we are potentially racing with the WatcherThread here, so
no guarantee these final print statements will be seen?

1502       VMError::interrupt_reporting_thread();

Nit: you're inside a VMError function so shouldn't need the VMError prefix.

---

src/share/vm/utilities/vmError.hpp

Typo: dependend

+   // Called by WatcherThread to check if the currently running error
reporting did timeout.
+   //  Returns true if reporting did hit the global ErrorLogTimeout.

This doesn't quite read right to me. How about:

// Called by the WatcherThread to check if error reporting has timed-out.
// Returns true if error reporting has not completed within the
ErrorLogTimeout limit

---

test/runtime/ErrorHandling/TimeoutInErrorHandlingTest.java

I think using @requires and runtime Platform checks is redundant. I
prefer @requires for both debug build testing and not-windows.

On 2/02/2017 9:59 PM, Thomas Stüfe wrote:

> Dear all,
>
> please review (or re-review) this enhancement. It was already reviewed
> extensively for JDK9 in Oct 16, but pushed back to JDK10. Now that JDK10
> repo is open, I would like to close this issue.
>
> In short, this enhancement lets the hotspot cancel hanging error
> reporting steps (when writing the hs-err file), in order to allow
> subsequent error reporting steps to run. In our JVM variant, this has
> been active for a long time and has proven to make error reporting more
> stable in the face of deadlocks or file system stalls.
>
> Issue:
> https://bugs.openjdk.java.net/browse/JDK-8166944
>
> Webrev:
> http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.02/webrev/
>
> Old Diskussion for JDK9:
> http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2016-October/021582.html
>
> The webrev is unchanged from version 2 discussed in october, just got
> rebased atop of JDK10/hs.
>
> Thank you,
>
> Kind Regards, Thomas
Reply | Threaded
Open this post in threaded view
|

Re: RFR(s) (JDK10): 8166944: Hanging Error Reporting steps may lead to torn error logs.

Thomas Stüfe-2
In reply to this post by Chris Plummer
Hi Chris,

Thanks again for your review, and sorry for forgetting parts of your input.
Will prepare a new webrev and incorporate yours and Davids input.

Kind Regards, Thomas

On Fri, Feb 3, 2017 at 12:11 AM, Chris Plummer <[hidden email]>
wrote:

> Hi Thomas,
>
> Mostly looks good. A few minor comments:
>
> Most files need 2017 copyright updates.
>
> set_to_now() and get_timestamp() should have names that tie them together
> better. Maybe set_step_start_time() and get_step_start_time(). Also, they
> should probably be methods of VMError and the statics should all be fields.
>
> In the first review I had mentioned that you only are allowing for one
> step to hang, and still be able to continue with subsequent steps:
>
> 1509     // A step times out after half of the total timeout. Steps are
> mostly fast unless they
> 1510     // hang for some reason, so this simple rule allows for one
> hanging step and still
> 1511     // leaves time enough for the rest of the steps to finish.
> 1512     const jlong end = step_start_time_l + ErrorLogTimeout * 1000 / 2;
>
> You mentioned that for your SAP version you divide by 4, allowing for up
> to 3 steps to hang. Is there a reason not to do this for the open version?
> In fact it looks like TestUnresponsiveErrorHandler is relying on that being
> the case, since it will hang 3 times. Related to this,
> TimeoutInErrorHandlingTest.java is only checking for one instance of step
> timeout pattern. Probably it should check for 3 instances, and also one
> instance of the overall timeout pattern:
>
> 1272         st->print_cr("------ Timout during error reporting after "
> INT64_FORMAT "ms. ------",
>
> thanks,
>
> Chris
>
> On 2/2/17 3:59 AM, Thomas Stüfe wrote:
>
>> Dear all,
>>
>> please review (or re-review) this enhancement. It was already reviewed
>> extensively for JDK9 in Oct 16, but pushed back to JDK10. Now that JDK10
>> repo is open, I would like to close this issue.
>>
>> In short, this enhancement lets the hotspot cancel hanging error
>> reporting steps (when writing the hs-err file), in order to allow
>> subsequent error reporting steps to run. In our JVM variant, this has been
>> active for a long time and has proven to make error reporting more stable
>> in the face of deadlocks or file system stalls.
>>
>> Issue:
>> https://bugs.openjdk.java.net/browse/JDK-8166944
>>
>> Webrev:
>> http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-
>> Error-Reporting/jdk10-webrev.02/webrev/ <<a href="http://cr.openjdk.java.net/%7">http://cr.openjdk.java.net/%7
>> Estuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.02/webrev/>
>>
>> Old Diskussion for JDK9:
>> http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2
>> 016-October/021582.html
>>
>> The webrev is unchanged from version 2 discussed in october, just got
>> rebased atop of JDK10/hs.
>>
>> Thank you,
>>
>> Kind Regards, Thomas
>>
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: RFR(s) (JDK10): 8166944: Hanging Error Reporting steps may lead to torn error logs.

Thomas Stüfe-2
In reply to this post by David Holmes
Hi David,

thanks for your review! I will prepare a new webrev. Please find some more
remarks inline.

On Fri, Feb 3, 2017 at 3:01 AM, David Holmes <[hidden email]>
wrote:

> Hi Thomas,
>
> Thanks for picking this back up. Overall this looks okay. I have a few
> specific comments/suggestions below to add to Chris's.
>
> David
> -----
>
> src/os/windows/vm/vmError_windows.cpp
>
> Typo: canceleation
>
> ---
>
> src/share/vm/utilities/vmError.cpp
>
> As previously discussed Atomic::load/store for jlong do not depend on
> supports_cx8. The commentary in atomic.hpp should really refer to 'atomic
> read-modify-write operations on jlongs' - as every platform must support
> atomic 64-bit load/store to support Java volatile long/double. So
> set_to_now and get_timestamp can be simplified, or even removed completely.
>
> Does the timestamp really need to be the system time in millis or do you
> just want to record relative elapsed time? If the latter then I suggest
> using javaTimeNanos()/100000 to avoid potential glitches with the system
> time changing.
>

Okay, I will do this. I would have liked to avoid using high resolution
timers because of potential initialization problems (on some platforms the
underlying system APIs need to be dlsymed first) but after looking at the
implementations, I think this will be probably okay.


>
> 382   // TestUnresponsiveErrorHandler: three times to trigger first a step
> timeout, then
> 383   // the global error reporting timeout.
>
> I'm a little confused as to why three steps are needed to test two things??
>
>
TestUnresponsiveErrorHandler should trigger the step timeouts, then the
final global error reporting timeout. In that version, a step timeout was
1/2 of the global timeout, so in order to be safe I need to hang three
times, two times to trigger the step timeouts, a third time to trigger the
global timeout. If we lower the step timeout to 1/4 (like we did at SAP and
which Chris found reasonable), we will even need to execute the hanging
step 5 times. But I will wrap this into a small macro to make it easier on
the eye.


> 1217     reporting_started();
> 1218     set_to_now(&reporting_start_time);
>
> I think the setting of reporting_start_time should be internalized in
> reporting_started.
>
> Nit: Lets -> Let's  (it is a contraction of 'let us')
>
> 1272         st->print_cr("------ Timout during error reporting after "
> INT64_FORMAT "ms. ------",
>
> Typo: timout
>
> Also I gather we are potentially racing with the WatcherThread here, so no
> guarantee these final print statements will be seen?
>
> 1502       VMError::interrupt_reporting_thread();
>
> Nit: you're inside a VMError function so shouldn't need the VMError prefix.
>
> ---
>
> src/share/vm/utilities/vmError.hpp
>
> Typo: dependend
>
> +   // Called by WatcherThread to check if the currently running error
> reporting did timeout.
> +   //  Returns true if reporting did hit the global ErrorLogTimeout.
>
> This doesn't quite read right to me. How about:
>
> // Called by the WatcherThread to check if error reporting has timed-out.
> // Returns true if error reporting has not completed within the
> ErrorLogTimeout limit
>
> ---
>
> test/runtime/ErrorHandling/TimeoutInErrorHandlingTest.java
>
> I think using @requires and runtime Platform checks is redundant. I prefer
> @requires for both debug build testing and not-windows.
>
>
All suggestions make sense - as do Chris' suggestions - and I will
incorporate them into the new webrev. May take some days though.

Kind Regards, Thomas


>
> On 2/02/2017 9:59 PM, Thomas Stüfe wrote:
>
>> Dear all,
>>
>> please review (or re-review) this enhancement. It was already reviewed
>> extensively for JDK9 in Oct 16, but pushed back to JDK10. Now that JDK10
>> repo is open, I would like to close this issue.
>>
>> In short, this enhancement lets the hotspot cancel hanging error
>> reporting steps (when writing the hs-err file), in order to allow
>> subsequent error reporting steps to run. In our JVM variant, this has
>> been active for a long time and has proven to make error reporting more
>> stable in the face of deadlocks or file system stalls.
>>
>> Issue:
>> https://bugs.openjdk.java.net/browse/JDK-8166944
>>
>> Webrev:
>> http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-
>> Error-Reporting/jdk10-webrev.02/webrev/
>>
>> Old Diskussion for JDK9:
>> http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2
>> 016-October/021582.html
>>
>> The webrev is unchanged from version 2 discussed in october, just got
>> rebased atop of JDK10/hs.
>>
>> Thank you,
>>
>> Kind Regards, Thomas
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: RFR(s) (JDK10): 8166944: Hanging Error Reporting steps may lead to torn error logs.

Thomas Stüfe-2
Hi Chris,

On Fri, Feb 3, 2017 at 6:56 PM, Chris Plummer <[hidden email]>
wrote:

> On 2/3/17 4:52 AM, Thomas Stüfe wrote:
>
> Hi David,
>
> thanks for your review! I will prepare a new webrev. Please find some more
> remarks inline.
>
> On Fri, Feb 3, 2017 at 3:01 AM, David Holmes <[hidden email]>
> wrote:
>
>> Hi Thomas,
>>
>> Thanks for picking this back up. Overall this looks okay. I have a few
>> specific comments/suggestions below to add to Chris's.
>>
>> David
>> -----
>>
>> src/os/windows/vm/vmError_windows.cpp
>>
>> Typo: canceleation
>>
>> ---
>>
>> src/share/vm/utilities/vmError.cpp
>>
>> As previously discussed Atomic::load/store for jlong do not depend on
>> supports_cx8. The commentary in atomic.hpp should really refer to 'atomic
>> read-modify-write operations on jlongs' - as every platform must support
>> atomic 64-bit load/store to support Java volatile long/double. So
>> set_to_now and get_timestamp can be simplified, or even removed completely.
>>
>> Does the timestamp really need to be the system time in millis or do you
>> just want to record relative elapsed time? If the latter then I suggest
>> using javaTimeNanos()/100000 to avoid potential glitches with the system
>> time changing.
>>
>
> Okay, I will do this. I would have liked to avoid using high resolution
> timers because of potential initialization problems (on some platforms the
> underlying system APIs need to be dlsymed first) but after looking at the
> implementations, I think this will be probably okay.
>
>
>>
>> 382   // TestUnresponsiveErrorHandler: three times to trigger first a
>> step timeout, then
>> 383   // the global error reporting timeout.
>>
>> I'm a little confused as to why three steps are needed to test two
>> things??
>>
>>
> TestUnresponsiveErrorHandler should trigger the step timeouts, then the
> final global error reporting timeout. In that version, a step timeout was
> 1/2 of the global timeout, so in order to be safe I need to hang three
> times, two times to trigger the step timeouts, a third time to trigger the
> global timeout. If we lower the step timeout to 1/4 (like we did at SAP and
> which Chris found reasonable), we will even need to execute the hanging
> step 5 times. But I will wrap this into a small macro to make it easier on
> the eye.
>
> I think with the step timeout being 1/2 the global timeout, you would only
> need 2 steps to timeout in order to hit the global timeout because
> presumably you'll have spent at least a tiny amount of time in other steps
> already.
>
>
Yes, you are of course right.

Thomas


> Chris
>
>
>
>> 1217     reporting_started();
>> 1218     set_to_now(&reporting_start_time);
>>
>> I think the setting of reporting_start_time should be internalized in
>> reporting_started.
>>
>> Nit: Lets -> Let's  (it is a contraction of 'let us')
>>
>> 1272         st->print_cr("------ Timout during error reporting after "
>> INT64_FORMAT "ms. ------",
>>
>> Typo: timout
>>
>> Also I gather we are potentially racing with the WatcherThread here, so
>> no guarantee these final print statements will be seen?
>>
>> 1502       VMError::interrupt_reporting_thread();
>>
>> Nit: you're inside a VMError function so shouldn't need the VMError
>> prefix.
>>
>> ---
>>
>> src/share/vm/utilities/vmError.hpp
>>
>> Typo: dependend
>>
>> +   // Called by WatcherThread to check if the currently running error
>> reporting did timeout.
>> +   //  Returns true if reporting did hit the global ErrorLogTimeout.
>>
>> This doesn't quite read right to me. How about:
>>
>> // Called by the WatcherThread to check if error reporting has timed-out.
>> // Returns true if error reporting has not completed within the
>> ErrorLogTimeout limit
>>
>> ---
>>
>> test/runtime/ErrorHandling/TimeoutInErrorHandlingTest.java
>>
>> I think using @requires and runtime Platform checks is redundant. I
>> prefer @requires for both debug build testing and not-windows.
>>
>>
> All suggestions make sense - as do Chris' suggestions - and I will
> incorporate them into the new webrev. May take some days though.
>
> Kind Regards, Thomas
>
>
>>
>> On 2/02/2017 9:59 PM, Thomas Stüfe wrote:
>>
>>> Dear all,
>>>
>>> please review (or re-review) this enhancement. It was already reviewed
>>> extensively for JDK9 in Oct 16, but pushed back to JDK10. Now that JDK10
>>> repo is open, I would like to close this issue.
>>>
>>> In short, this enhancement lets the hotspot cancel hanging error
>>> reporting steps (when writing the hs-err file), in order to allow
>>> subsequent error reporting steps to run. In our JVM variant, this has
>>> been active for a long time and has proven to make error reporting more
>>> stable in the face of deadlocks or file system stalls.
>>>
>>> Issue:
>>> https://bugs.openjdk.java.net/browse/JDK-8166944
>>>
>>> Webrev:
>>> http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-E
>>> rror-Reporting/jdk10-webrev.02/webrev/
>>>
>>> Old Diskussion for JDK9:
>>> http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2
>>> 016-October/021582.html
>>>
>>> The webrev is unchanged from version 2 discussed in october, just got
>>> rebased atop of JDK10/hs.
>>>
>>> Thank you,
>>>
>>> Kind Regards, Thomas
>>>
>>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: RFR(s) (JDK10): 8166944: Hanging Error Reporting steps may lead to torn error logs.

David Holmes
On 4/02/2017 6:50 PM, Thomas Stüfe wrote:

> Hi Chris,
>
> On Fri, Feb 3, 2017 at 6:56 PM, Chris Plummer <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     On 2/3/17 4:52 AM, Thomas Stüfe wrote:
>>     Hi David,
>>
>>     thanks for your review! I will prepare a new webrev. Please find
>>     some more remarks inline.
>>
>>     On Fri, Feb 3, 2017 at 3:01 AM, David Holmes
>>     <[hidden email] <mailto:[hidden email]>> wrote:
>>
>>         Hi Thomas,
>>
>>         Thanks for picking this back up. Overall this looks okay. I
>>         have a few specific comments/suggestions below to add to Chris's.
>>
>>         David
>>         -----
>>
>>         src/os/windows/vm/vmError_windows.cpp
>>
>>         Typo: canceleation
>>
>>         ---
>>
>>         src/share/vm/utilities/vmError.cpp
>>
>>         As previously discussed Atomic::load/store for jlong do not
>>         depend on supports_cx8. The commentary in atomic.hpp should
>>         really refer to 'atomic read-modify-write operations on
>>         jlongs' - as every platform must support atomic 64-bit
>>         load/store to support Java volatile long/double. So set_to_now
>>         and get_timestamp can be simplified, or even removed completely.
>>
>>         Does the timestamp really need to be the system time in millis
>>         or do you just want to record relative elapsed time? If the
>>         latter then I suggest using javaTimeNanos()/100000 to avoid
>>         potential glitches with the system time changing.
>>
>>
>>     Okay, I will do this. I would have liked to avoid using high
>>     resolution timers because of potential initialization problems (on
>>     some platforms the underlying system APIs need to be dlsymed
>>     first) but after looking at the implementations, I think this will
>>     be probably okay.
>>
>>
>>
>>         382   // TestUnresponsiveErrorHandler: three times to trigger
>>         first a step timeout, then
>>         383   // the global error reporting timeout.
>>
>>         I'm a little confused as to why three steps are needed to test
>>         two things??
>>
>>
>>     TestUnresponsiveErrorHandler should trigger the step timeouts,
>>     then the final global error reporting timeout. In that version, a
>>     step timeout was 1/2 of the global timeout, so in order to be safe
>>     I need to hang three times, two times to trigger the step
>>     timeouts, a third time to trigger the global timeout. If we lower
>>     the step timeout to 1/4 (like we did at SAP and which Chris found
>>     reasonable), we will even need to execute the hanging step 5
>>     times. But I will wrap this into a small macro to make it easier
>>     on the eye.
>     I think with the step timeout being 1/2 the global timeout, you
>     would only need 2 steps to timeout in order to hit the global
>     timeout because presumably you'll have spent at least a tiny amount
>     of time in other steps already.
>
>
> Yes, you are of course right.

Yes but depending on how time is measured that "tiny amount" may or may
not be observable. It has to be at least a millisecond if using
currentTimeMillis.

David
-----

> Thomas
>
>
>     Chris
>
>>
>>
>>         1217     reporting_started();
>>         1218     set_to_now(&reporting_start_time);
>>
>>         I think the setting of reporting_start_time should be
>>         internalized in reporting_started.
>>
>>         Nit: Lets -> Let's  (it is a contraction of 'let us')
>>
>>         1272         st->print_cr("------ Timout during error
>>         reporting after " INT64_FORMAT "ms. ------",
>>
>>         Typo: timout
>>
>>         Also I gather we are potentially racing with the WatcherThread
>>         here, so no guarantee these final print statements will be seen?
>>
>>         1502       VMError::interrupt_reporting_thread();
>>
>>         Nit: you're inside a VMError function so shouldn't need the
>>         VMError prefix.
>>
>>         ---
>>
>>         src/share/vm/utilities/vmError.hpp
>>
>>         Typo: dependend
>>
>>         +   // Called by WatcherThread to check if the currently
>>         running error reporting did timeout.
>>         +   //  Returns true if reporting did hit the global
>>         ErrorLogTimeout.
>>
>>         This doesn't quite read right to me. How about:
>>
>>         // Called by the WatcherThread to check if error reporting has
>>         timed-out.
>>         // Returns true if error reporting has not completed within
>>         the ErrorLogTimeout limit
>>
>>         ---
>>
>>         test/runtime/ErrorHandling/TimeoutInErrorHandlingTest.java
>>
>>         I think using @requires and runtime Platform checks is
>>         redundant. I prefer @requires for both debug build testing and
>>         not-windows.
>>
>>
>>     All suggestions make sense - as do Chris' suggestions - and I will
>>     incorporate them into the new webrev. May take some days though.
>>
>>     Kind Regards, Thomas
>>
>>
>>
>>         On 2/02/2017 9:59 PM, Thomas Stüfe wrote:
>>
>>             Dear all,
>>
>>             please review (or re-review) this enhancement. It was
>>             already reviewed
>>             extensively for JDK9 in Oct 16, but pushed back to JDK10.
>>             Now that JDK10
>>             repo is open, I would like to close this issue.
>>
>>             In short, this enhancement lets the hotspot cancel hanging
>>             error
>>             reporting steps (when writing the hs-err file), in order
>>             to allow
>>             subsequent error reporting steps to run. In our JVM
>>             variant, this has
>>             been active for a long time and has proven to make error
>>             reporting more
>>             stable in the face of deadlocks or file system stalls.
>>
>>             Issue:
>>             https://bugs.openjdk.java.net/browse/JDK-8166944
>>             <https://bugs.openjdk.java.net/browse/JDK-8166944>
>>
>>             Webrev:
>>             http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.02/webrev/
>>             <http://cr.openjdk.java.net/%7Estuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.02/webrev/>
>>
>>             Old Diskussion for JDK9:
>>             http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2016-October/021582.html
>>             <http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2016-October/021582.html>
>>
>>             The webrev is unchanged from version 2 discussed in
>>             october, just got
>>             rebased atop of JDK10/hs.
>>
>>             Thank you,
>>
>>             Kind Regards, Thomas
>>
>>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: RFR(s) (JDK10): 8166944: Hanging Error Reporting steps may lead to torn error logs.

Thomas Stüfe-2
In reply to this post by Thomas Stüfe-2
Hi Chris, David,

Thanks for your reviews. I hope I addressed all of your issues. See further
details below.

new webrev:

http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.03/webrev/

delta to the last one:

http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.03-delta/webrev/


@Chris:

> Most files need 2017 copyright updates.

Done.

>set_to_now() and get_timestamp() should have names that tie them together
better. Maybe set_step_start_time() and get_step_start_time(). Also, they
>should probably be methods of VMError and the statics should all be fields.

Done. Added an accessor for every timeout value and a setter too (named
"record" to indicate that instead of passing a value it uses the current
timestamp).

I'm usually not a big fan of putting internals into a header, so I tend to
hide internals in a file-local static function. But if this is the way you
prefer it, this is fine too.

>In the first review I had mentioned that you only are allowing for one
step to hang, and still be able to continue with subsequent steps:

>1509     // A step times out after half of the total timeout. Steps are
mostly fast unless they
>1510     // hang for some reason, so this simple rule allows for one
hanging step and still
>1511     // leaves time enough for the rest of the steps to finish.
>1512     const jlong end = step_start_time_l + ErrorLogTimeout * 1000 / 2;
>
>You mentioned that for your SAP version you divide by 4, allowing for up
to 3 steps to hang. Is there a reason not to do this for the open version?
In fact it >looks like TestUnresponsiveErrorHandler is relying on that
being the case, since it will hang 3 times.

I changed this to the value we use at SAP (make the little timeouts a
quarter of the final timeout value). I now need to have five
timeout-test-steps in the error handler instead of the three earlier. I hid
them in a nice macro, so I hope they are not too much of an eye sore.

>Related to this, TimeoutInErrorHandlingTest.java is >only checking for one
instance of step timeout pattern. Probably it should check for 3 instances,
and >also one instance of the overall timeout pattern:
>
>1272         st->print_cr("------ Timout during error reporting after "
INT64_FORMAT "ms. ------",

I intentionally left this test to be very basic - see the comment in the
test. Basically, I am afraid of giving you guys a test which sporadically
fails if the timing is bad.

 As David pointed out, the final error message cannot be relied on - it
should appear in 99% of all cases but may not because error reporting
thread and WatcherThread race. I explicitly avoided adding synchronization
there, because it is better to sometimes miss output in the hs-err file
than for the VM to hang in error shutdown.

For the same reason I do not test for all the little timeout messages to
appear in the error log. I now just test for two messages of the
theoretical three or four max, to see that step timeout works repeatedly.

@David:

>src/os/windows/vm/vmError_windows.cpp

>Typo: canceleation

Done.

>As previously discussed Atomic::load/store for jlong do not depend on
supports_cx8. The commentary in atomic.hpp should really refer to 'atomic
read->modify-write operations on jlongs' - as every platform must support
atomic 64-bit load/store to support Java volatile long/double. So
set_to_now and >get_timestamp can be simplified, or even removed completely.

Sure! I removed the cx8 dependency, which makes the code simpler, but
decided to keep the accessor functions (renamed as get_.._start_time and
record_.._start_time)

>Does the timestamp really need to be the system time in millis or do you
just want to record relative elapsed time? If the latter then I suggest
using >javaTimeNanos()/100000 to avoid potential glitches with the system
time changing.

Did this. This exchanges the possibility of time glitches with possible
initialization issues (see non-apple bsd path and linux, e.g.) where the hr
timer api is dynamically loaded. But I agree, this is still better. The
latter is less probable to hurt, only if we crash at exactly the right time
during VM initialization.

>382   // TestUnresponsiveErrorHandler: three times to trigger first a step
timeout, then
>383   // the global error reporting timeout.

>I'm a little confused as to why three steps are needed to test two things??

See my earlier explanations. Now we have the 1:4 ratio between little/final
timeout values, so we need five steps.

>1272         st->print_cr("------ Timout during error reporting after "
INT64_FORMAT "ms. ------",

>Typo: timout

Done.

>Also I gather we are potentially racing with the WatcherThread here, so no
guarantee these final print statements will be seen?

Yes. I decided against synchronization here. The only point of this timeout
is to write a final message into the error log. If this is lost, it is not
a big deal. Note that we also have the final error message on stderr done
by the WatcherThread.

>1502       VMError::interrupt_reporting_thread();

>Nit: you're inside a VMError function so shouldn't need the VMError prefix.

Done.

>src/share/vm/utilities/vmError.hpp

>Typo: dependend

Done (comments changed).

> This doesn't quite read right to me. How about:..

Thanks for your suggestion, I took your version.

>test/runtime/ErrorHandling/TimeoutInErrorHandlingTest.java

>I think using @requires and runtime Platform checks is redundant. I prefer
@requires for both debug build testing and not-windows.

Done. Thanks for your suggestion.

---

Kind Regards, Thomas


On Thu, Feb 2, 2017 at 12:59 PM, Thomas Stüfe <[hidden email]>
wrote:

> Dear all,
>
> please review (or re-review) this enhancement. It was already reviewed
> extensively for JDK9 in Oct 16, but pushed back to JDK10. Now that JDK10
> repo is open, I would like to close this issue.
>
> In short, this enhancement lets the hotspot cancel hanging error reporting
> steps (when writing the hs-err file), in order to allow subsequent error
> reporting steps to run. In our JVM variant, this has been active for a long
> time and has proven to make error reporting more stable in the face of
> deadlocks or file system stalls.
>
> Issue:
> https://bugs.openjdk.java.net/browse/JDK-8166944
>
> Webrev:
> http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-
> Error-Reporting/jdk10-webrev.02/webrev/
>
> Old Diskussion for JDK9:
> http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2
> 016-October/021582.html
>
> The webrev is unchanged from version 2 discussed in october, just got
> rebased atop of JDK10/hs.
>
> Thank you,
>
> Kind Regards, Thomas
>
Reply | Threaded
Open this post in threaded view
|

Re: RFR(s) (JDK10): 8166944: Hanging Error Reporting steps may lead to torn error logs.

David Holmes
Hi Thomas,

Other than the remaining typos Chris pointed out this seems fine.

Thanks,
David

PS. We should be getting rid of many of the runtime loading cases eg
clock_gettime, in JDK 10.

PPS. Running this through JPRT just in case we overlooked something.
Also need to see whether we need to exclude the test from
hotspot_fast_runtime in TEST.groups.

On 6/02/2017 9:18 PM, Thomas Stüfe wrote:

> Hi Chris, David,
>
> Thanks for your reviews. I hope I addressed all of your issues. See
> further details below.
>
> new webrev:
>
> http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.03/webrev/
>
> delta to the last one:
>
> http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.03-delta/webrev/
>
>
> @Chris:
>
>> Most files need 2017 copyright updates.
>
> Done.
>
>>set_to_now() and get_timestamp() should have names that tie them together better. Maybe set_step_start_time() and get_step_start_time(). Also, they >should probably be methods of VMError and the statics should all be fields.
>
> Done. Added an accessor for every timeout value and a setter too (named
> "record" to indicate that instead of passing a value it uses the current
> timestamp).
>
> I'm usually not a big fan of putting internals into a header, so I tend
> to hide internals in a file-local static function. But if this is the
> way you prefer it, this is fine too.
>
>>In the first review I had mentioned that you only are allowing for one step to hang, and still be able to continue with subsequent steps:
>
>>1509     // A step times out after half of the total timeout. Steps are mostly fast unless they
>>1510     // hang for some reason, so this simple rule allows for one hanging step and still
>>1511     // leaves time enough for the rest of the steps to finish.
>>1512     const jlong end = step_start_time_l + ErrorLogTimeout * 1000 / 2;
>>
>>You mentioned that for your SAP version you divide by 4, allowing for up to 3 steps to hang. Is there a reason not to do this for the open version? In fact it >looks like TestUnresponsiveErrorHandler is relying on that being the case, since it will hang 3 times.
>
> I changed this to the value we use at SAP (make the little timeouts a
> quarter of the final timeout value). I now need to have five
> timeout-test-steps in the error handler instead of the three earlier. I
> hid them in a nice macro, so I hope they are not too much of an eye sore.
>
>>Related to this, TimeoutInErrorHandlingTest.java is >only checking for one instance of step timeout pattern. Probably
> it should check for 3 instances, and >also one instance of the overall
> timeout pattern:
>>
>>1272         st->print_cr("------ Timout during error reporting after " INT64_FORMAT "ms. ------",
>
> I intentionally left this test to be very basic - see the comment in the
> test. Basically, I am afraid of giving you guys a test which
> sporadically fails if the timing is bad.
>
>  As David pointed out, the final error message cannot be relied on - it
> should appear in 99% of all cases but may not because error reporting
> thread and WatcherThread race. I explicitly avoided adding
> synchronization there, because it is better to sometimes miss output in
> the hs-err file than for the VM to hang in error shutdown.
>
> For the same reason I do not test for all the little timeout messages to
> appear in the error log. I now just test for two messages of the
> theoretical three or four max, to see that step timeout works repeatedly.
>
> @David:
>
>>src/os/windows/vm/vmError_windows.cpp
>
>>Typo: canceleation
>
> Done.
>
>>As previously discussed Atomic::load/store for jlong do not depend on supports_cx8. The commentary in atomic.hpp should really refer to 'atomic read->modify-write operations on jlongs' - as every platform must support atomic 64-bit load/store to support Java volatile long/double. So set_to_now and >get_timestamp can be simplified, or even removed completely.
>
> Sure! I removed the cx8 dependency, which makes the code simpler, but
> decided to keep the accessor functions (renamed as get_.._start_time and
> record_.._start_time)
>
>>Does the timestamp really need to be the system time in millis or do you just want to record relative elapsed time? If the latter then I suggest using >javaTimeNanos()/100000 to avoid potential glitches with the system time changing.
>
> Did this. This exchanges the possibility of time glitches with possible
> initialization issues (see non-apple bsd path and linux, e.g.) where the
> hr timer api is dynamically loaded. But I agree, this is still better.
> The latter is less probable to hurt, only if we crash at exactly the
> right time during VM initialization.
>
>>382   // TestUnresponsiveErrorHandler: three times to trigger first a step timeout, then
>>383   // the global error reporting timeout.
>
>>I'm a little confused as to why three steps are needed to test two things??
>
> See my earlier explanations. Now we have the 1:4 ratio between
> little/final timeout values, so we need five steps.
>
>>1272         st->print_cr("------ Timout during error reporting after " INT64_FORMAT "ms. ------",
>
>>Typo: timout
>
> Done.
>
>>Also I gather we are potentially racing with the WatcherThread here, so
> no guarantee these final print statements will be seen?
>
> Yes. I decided against synchronization here. The only point of this
> timeout is to write a final message into the error log. If this is lost,
> it is not a big deal. Note that we also have the final error message on
> stderr done by the WatcherThread.
>
>>1502       VMError::interrupt_reporting_thread();
>
>>Nit: you're inside a VMError function so shouldn't need the VMError prefix.
>
> Done.
>
>>src/share/vm/utilities/vmError.hpp
>
>>Typo: dependend
>
> Done (comments changed).
>
>> This doesn't quite read right to me. How about:..
>
> Thanks for your suggestion, I took your version.
>
>>test/runtime/ErrorHandling/TimeoutInErrorHandlingTest.java
>
>>I think using @requires and runtime Platform checks is redundant. I prefer @requires for both debug build testing and not-windows.
>
> Done. Thanks for your suggestion.
>
> ---
>
> Kind Regards, Thomas
>
>
> On Thu, Feb 2, 2017 at 12:59 PM, Thomas Stüfe <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     Dear all,
>
>     please review (or re-review) this enhancement. It was already
>     reviewed extensively for JDK9 in Oct 16, but pushed back to JDK10.
>     Now that JDK10 repo is open, I would like to close this issue.
>
>     In short, this enhancement lets the hotspot cancel hanging error
>     reporting steps (when writing the hs-err file), in order to allow
>     subsequent error reporting steps to run. In our JVM variant, this
>     has been active for a long time and has proven to make error
>     reporting more stable in the face of deadlocks or file system stalls.
>
>     Issue:
>     https://bugs.openjdk.java.net/browse/JDK-8166944
>     <https://bugs.openjdk.java.net/browse/JDK-8166944>
>
>     Webrev:
>     http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.02/webrev/
>     <http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.02/webrev/>
>
>     Old Diskussion for JDK9:
>     http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2016-October/021582.html
>     <http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2016-October/021582.html>
>
>     The webrev is unchanged from version 2 discussed in october, just
>     got rebased atop of JDK10/hs.
>
>     Thank you,
>
>     Kind Regards, Thomas
>
>
Reply | Threaded
Open this post in threaded view
|

Re: RFR(s) (JDK10): 8166944: Hanging Error Reporting steps may lead to torn error logs.

Dmitry Dmitriev
In reply to this post by Thomas Stüfe-2
Hello Thomas,

Sorry for the late review. Few small comments about test:
1) Can you put FileInputStream and BufferedReader usage in the
try-with-resources block and remove br.close() on line 124?

2) Replace several empty lines with the one line: lines 45-46, lines 105-107

3) Remove last empty line(135)

Thank you!

Dmitry

On 06.02.2017 14:18, Thomas Stüfe wrote:

> Hi Chris, David,
>
> Thanks for your reviews. I hope I addressed all of your issues. See further
> details below.
>
> new webrev:
>
> http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.03/webrev/
>
> delta to the last one:
>
> http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.03-delta/webrev/
>
>
> @Chris:
>
>> Most files need 2017 copyright updates.
> Done.
>
>> set_to_now() and get_timestamp() should have names that tie them together
> better. Maybe set_step_start_time() and get_step_start_time(). Also, they
>> should probably be methods of VMError and the statics should all be fields.
> Done. Added an accessor for every timeout value and a setter too (named
> "record" to indicate that instead of passing a value it uses the current
> timestamp).
>
> I'm usually not a big fan of putting internals into a header, so I tend to
> hide internals in a file-local static function. But if this is the way you
> prefer it, this is fine too.
>
>> In the first review I had mentioned that you only are allowing for one
> step to hang, and still be able to continue with subsequent steps:
>
>> 1509     // A step times out after half of the total timeout. Steps are
> mostly fast unless they
>> 1510     // hang for some reason, so this simple rule allows for one
> hanging step and still
>> 1511     // leaves time enough for the rest of the steps to finish.
>> 1512     const jlong end = step_start_time_l + ErrorLogTimeout * 1000 / 2;
>>
>> You mentioned that for your SAP version you divide by 4, allowing for up
> to 3 steps to hang. Is there a reason not to do this for the open version?
> In fact it >looks like TestUnresponsiveErrorHandler is relying on that
> being the case, since it will hang 3 times.
>
> I changed this to the value we use at SAP (make the little timeouts a
> quarter of the final timeout value). I now need to have five
> timeout-test-steps in the error handler instead of the three earlier. I hid
> them in a nice macro, so I hope they are not too much of an eye sore.
>
>> Related to this, TimeoutInErrorHandlingTest.java is >only checking for one
> instance of step timeout pattern. Probably it should check for 3 instances,
> and >also one instance of the overall timeout pattern:
>> 1272         st->print_cr("------ Timout during error reporting after "
> INT64_FORMAT "ms. ------",
>
> I intentionally left this test to be very basic - see the comment in the
> test. Basically, I am afraid of giving you guys a test which sporadically
> fails if the timing is bad.
>
>   As David pointed out, the final error message cannot be relied on - it
> should appear in 99% of all cases but may not because error reporting
> thread and WatcherThread race. I explicitly avoided adding synchronization
> there, because it is better to sometimes miss output in the hs-err file
> than for the VM to hang in error shutdown.
>
> For the same reason I do not test for all the little timeout messages to
> appear in the error log. I now just test for two messages of the
> theoretical three or four max, to see that step timeout works repeatedly.
>
> @David:
>
>> src/os/windows/vm/vmError_windows.cpp
>> Typo: canceleation
> Done.
>
>> As previously discussed Atomic::load/store for jlong do not depend on
> supports_cx8. The commentary in atomic.hpp should really refer to 'atomic
> read->modify-write operations on jlongs' - as every platform must support
> atomic 64-bit load/store to support Java volatile long/double. So
> set_to_now and >get_timestamp can be simplified, or even removed completely.
>
> Sure! I removed the cx8 dependency, which makes the code simpler, but
> decided to keep the accessor functions (renamed as get_.._start_time and
> record_.._start_time)
>
>> Does the timestamp really need to be the system time in millis or do you
> just want to record relative elapsed time? If the latter then I suggest
> using >javaTimeNanos()/100000 to avoid potential glitches with the system
> time changing.
>
> Did this. This exchanges the possibility of time glitches with possible
> initialization issues (see non-apple bsd path and linux, e.g.) where the hr
> timer api is dynamically loaded. But I agree, this is still better. The
> latter is less probable to hurt, only if we crash at exactly the right time
> during VM initialization.
>
>> 382   // TestUnresponsiveErrorHandler: three times to trigger first a step
> timeout, then
>> 383   // the global error reporting timeout.
>> I'm a little confused as to why three steps are needed to test two things??
> See my earlier explanations. Now we have the 1:4 ratio between little/final
> timeout values, so we need five steps.
>
>> 1272         st->print_cr("------ Timout during error reporting after "
> INT64_FORMAT "ms. ------",
>
>> Typo: timout
> Done.
>
>> Also I gather we are potentially racing with the WatcherThread here, so no
> guarantee these final print statements will be seen?
>
> Yes. I decided against synchronization here. The only point of this timeout
> is to write a final message into the error log. If this is lost, it is not
> a big deal. Note that we also have the final error message on stderr done
> by the WatcherThread.
>
>> 1502       VMError::interrupt_reporting_thread();
>> Nit: you're inside a VMError function so shouldn't need the VMError prefix.
> Done.
>
>> src/share/vm/utilities/vmError.hpp
>> Typo: dependend
> Done (comments changed).
>
>> This doesn't quite read right to me. How about:..
> Thanks for your suggestion, I took your version.
>
>> test/runtime/ErrorHandling/TimeoutInErrorHandlingTest.java
>> I think using @requires and runtime Platform checks is redundant. I prefer
> @requires for both debug build testing and not-windows.
>
> Done. Thanks for your suggestion.
>
> ---
>
> Kind Regards, Thomas
>
>
> On Thu, Feb 2, 2017 at 12:59 PM, Thomas Stüfe <[hidden email]>
> wrote:
>
>> Dear all,
>>
>> please review (or re-review) this enhancement. It was already reviewed
>> extensively for JDK9 in Oct 16, but pushed back to JDK10. Now that JDK10
>> repo is open, I would like to close this issue.
>>
>> In short, this enhancement lets the hotspot cancel hanging error reporting
>> steps (when writing the hs-err file), in order to allow subsequent error
>> reporting steps to run. In our JVM variant, this has been active for a long
>> time and has proven to make error reporting more stable in the face of
>> deadlocks or file system stalls.
>>
>> Issue:
>> https://bugs.openjdk.java.net/browse/JDK-8166944
>>
>> Webrev:
>> http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-
>> Error-Reporting/jdk10-webrev.02/webrev/
>>
>> Old Diskussion for JDK9:
>> http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2
>> 016-October/021582.html
>>
>> The webrev is unchanged from version 2 discussed in october, just got
>> rebased atop of JDK10/hs.
>>
>> Thank you,
>>
>> Kind Regards, Thomas
>>

Reply | Threaded
Open this post in threaded view
|

Re: RFR(s) (JDK10): 8166944: Hanging Error Reporting steps may lead to torn error logs.

David Holmes
In reply to this post by David Holmes
Hi Thomas,

The test is quite a slow one and so we would like to exclude it from the
hotspot_fast_runtime group in TEST.groups (there are other ErrorHandling
tests excluded there already).

Thanks,
David

On 7/02/2017 3:58 PM, David Holmes wrote:

> Hi Thomas,
>
> Other than the remaining typos Chris pointed out this seems fine.
>
> Thanks,
> David
>
> PS. We should be getting rid of many of the runtime loading cases eg
> clock_gettime, in JDK 10.
>
> PPS. Running this through JPRT just in case we overlooked something.
> Also need to see whether we need to exclude the test from
> hotspot_fast_runtime in TEST.groups.
>
> On 6/02/2017 9:18 PM, Thomas Stüfe wrote:
>> Hi Chris, David,
>>
>> Thanks for your reviews. I hope I addressed all of your issues. See
>> further details below.
>>
>> new webrev:
>>
>> http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.03/webrev/
>>
>>
>> delta to the last one:
>>
>> http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.03-delta/webrev/
>>
>>
>>
>> @Chris:
>>
>>> Most files need 2017 copyright updates.
>>
>> Done.
>>
>>> set_to_now() and get_timestamp() should have names that tie them
>>> together better. Maybe set_step_start_time() and
>>> get_step_start_time(). Also, they >should probably be methods of
>>> VMError and the statics should all be fields.
>>
>> Done. Added an accessor for every timeout value and a setter too (named
>> "record" to indicate that instead of passing a value it uses the current
>> timestamp).
>>
>> I'm usually not a big fan of putting internals into a header, so I tend
>> to hide internals in a file-local static function. But if this is the
>> way you prefer it, this is fine too.
>>
>>> In the first review I had mentioned that you only are allowing for
>>> one step to hang, and still be able to continue with subsequent steps:
>>
>>> 1509     // A step times out after half of the total timeout. Steps
>>> are mostly fast unless they
>>> 1510     // hang for some reason, so this simple rule allows for one
>>> hanging step and still
>>> 1511     // leaves time enough for the rest of the steps to finish.
>>> 1512     const jlong end = step_start_time_l + ErrorLogTimeout * 1000
>>> / 2;
>>>
>>> You mentioned that for your SAP version you divide by 4, allowing for
>>> up to 3 steps to hang. Is there a reason not to do this for the open
>>> version? In fact it >looks like TestUnresponsiveErrorHandler is
>>> relying on that being the case, since it will hang 3 times.
>>
>> I changed this to the value we use at SAP (make the little timeouts a
>> quarter of the final timeout value). I now need to have five
>> timeout-test-steps in the error handler instead of the three earlier. I
>> hid them in a nice macro, so I hope they are not too much of an eye sore.
>>
>>> Related to this, TimeoutInErrorHandlingTest.java is >only checking
>>> for one instance of step timeout pattern. Probably
>> it should check for 3 instances, and >also one instance of the overall
>> timeout pattern:
>>>
>>> 1272         st->print_cr("------ Timout during error reporting after
>>> " INT64_FORMAT "ms. ------",
>>
>> I intentionally left this test to be very basic - see the comment in the
>> test. Basically, I am afraid of giving you guys a test which
>> sporadically fails if the timing is bad.
>>
>>  As David pointed out, the final error message cannot be relied on - it
>> should appear in 99% of all cases but may not because error reporting
>> thread and WatcherThread race. I explicitly avoided adding
>> synchronization there, because it is better to sometimes miss output in
>> the hs-err file than for the VM to hang in error shutdown.
>>
>> For the same reason I do not test for all the little timeout messages to
>> appear in the error log. I now just test for two messages of the
>> theoretical three or four max, to see that step timeout works repeatedly.
>>
>> @David:
>>
>>> src/os/windows/vm/vmError_windows.cpp
>>
>>> Typo: canceleation
>>
>> Done.
>>
>>> As previously discussed Atomic::load/store for jlong do not depend on
>>> supports_cx8. The commentary in atomic.hpp should really refer to
>>> 'atomic read->modify-write operations on jlongs' - as every platform
>>> must support atomic 64-bit load/store to support Java volatile
>>> long/double. So set_to_now and >get_timestamp can be simplified, or
>>> even removed completely.
>>
>> Sure! I removed the cx8 dependency, which makes the code simpler, but
>> decided to keep the accessor functions (renamed as get_.._start_time and
>> record_.._start_time)
>>
>>> Does the timestamp really need to be the system time in millis or do
>>> you just want to record relative elapsed time? If the latter then I
>>> suggest using >javaTimeNanos()/100000 to avoid potential glitches
>>> with the system time changing.
>>
>> Did this. This exchanges the possibility of time glitches with possible
>> initialization issues (see non-apple bsd path and linux, e.g.) where the
>> hr timer api is dynamically loaded. But I agree, this is still better.
>> The latter is less probable to hurt, only if we crash at exactly the
>> right time during VM initialization.
>>
>>> 382   // TestUnresponsiveErrorHandler: three times to trigger first a
>>> step timeout, then
>>> 383   // the global error reporting timeout.
>>
>>> I'm a little confused as to why three steps are needed to test two
>>> things??
>>
>> See my earlier explanations. Now we have the 1:4 ratio between
>> little/final timeout values, so we need five steps.
>>
>>> 1272         st->print_cr("------ Timout during error reporting after
>>> " INT64_FORMAT "ms. ------",
>>
>>> Typo: timout
>>
>> Done.
>>
>>> Also I gather we are potentially racing with the WatcherThread here, so
>> no guarantee these final print statements will be seen?
>>
>> Yes. I decided against synchronization here. The only point of this
>> timeout is to write a final message into the error log. If this is lost,
>> it is not a big deal. Note that we also have the final error message on
>> stderr done by the WatcherThread.
>>
>>> 1502       VMError::interrupt_reporting_thread();
>>
>>> Nit: you're inside a VMError function so shouldn't need the VMError
>>> prefix.
>>
>> Done.
>>
>>> src/share/vm/utilities/vmError.hpp
>>
>>> Typo: dependend
>>
>> Done (comments changed).
>>
>>> This doesn't quite read right to me. How about:..
>>
>> Thanks for your suggestion, I took your version.
>>
>>> test/runtime/ErrorHandling/TimeoutInErrorHandlingTest.java
>>
>>> I think using @requires and runtime Platform checks is redundant. I
>>> prefer @requires for both debug build testing and not-windows.
>>
>> Done. Thanks for your suggestion.
>>
>> ---
>>
>> Kind Regards, Thomas
>>
>>
>> On Thu, Feb 2, 2017 at 12:59 PM, Thomas Stüfe <[hidden email]
>> <mailto:[hidden email]>> wrote:
>>
>>     Dear all,
>>
>>     please review (or re-review) this enhancement. It was already
>>     reviewed extensively for JDK9 in Oct 16, but pushed back to JDK10.
>>     Now that JDK10 repo is open, I would like to close this issue.
>>
>>     In short, this enhancement lets the hotspot cancel hanging error
>>     reporting steps (when writing the hs-err file), in order to allow
>>     subsequent error reporting steps to run. In our JVM variant, this
>>     has been active for a long time and has proven to make error
>>     reporting more stable in the face of deadlocks or file system stalls.
>>
>>     Issue:
>>     https://bugs.openjdk.java.net/browse/JDK-8166944
>>     <https://bugs.openjdk.java.net/browse/JDK-8166944>
>>
>>     Webrev:
>>
>> http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.02/webrev/
>>
>>
>> <http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.02/webrev/>
>>
>>
>>     Old Diskussion for JDK9:
>>
>> http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2016-October/021582.html
>>
>>
>> <http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2016-October/021582.html>
>>
>>
>>     The webrev is unchanged from version 2 discussed in october, just
>>     got rebased atop of JDK10/hs.
>>
>>     Thank you,
>>
>>     Kind Regards, Thomas
>>
>>
Reply | Threaded
Open this post in threaded view
|

Re: RFR(s) (JDK10): 8166944: Hanging Error Reporting steps may lead to torn error logs.

Thomas Stüfe-2
In reply to this post by Thomas Stüfe-2
Hi Chris, David,

thanks for your patience. Let's wrap this up.

New webrev:
http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.04/webrev/

Delta to last webrev:
http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.04-delta/webrev/

Almost identical to the last version, only the typos Chris found were fixed.

@David:

>The test is quite a slow one and so we would like to exclude it from the
>hotspot_fast_runtime group in TEST.groups (there are other ErrorHandling
tests >excluded there already).

This is fine by me. Do you need me to do something?

Kind Regards, Thomas



On Tue, Feb 7, 2017 at 2:04 AM, Chris Plummer <[hidden email]>
wrote:

> Hi Tomas,
>
> vmError_windows.cpp: Getting closer. Should be "cancellation".
>
> vmError.cpp: Should be "let's"
>
>  389   // global timeout, lets execute the timeout step five times.
>
> And in response to your comment:
>
> I'm usually not a big fan of putting internals into a header, so I tend to
> hide internals in a file-local static function. But if this is the way you
> prefer it, this is fine too.
>
>
> I'm no hotspot coding conventions expert, but I have frequently seen
> statics declared in the class scope (private or protected), and don't see
> much in the way of statics declared in .cpp files (although I see plenty of
> static helper functions in .cpp files). It seems more consistent to declare
> them in the class scope, but I could be wrong here. I'm not personally
> attached to one way or another.
>
> Other than that it looks good. No need for another review.
>
> thanks,
>
> Chris
>
>
> On 2/6/17 3:18 AM, Thomas Stüfe wrote:
>
> Hi Chris, David,
>
> Thanks for your reviews. I hope I addressed all of your issues. See
> further details below.
>
> new webrev:
>
> http://cr.openjdk.java.net/~stuefe/webrevs/8166944-
> Hanging-Error-Reporting/jdk10-webrev.03/webrev/
>
> delta to the last one:
>
> http://cr.openjdk.java.net/~stuefe/webrevs/8166944-
> Hanging-Error-Reporting/jdk10-webrev.03-delta/webrev/
>
>
> @Chris:
>
> > Most files need 2017 copyright updates.
>
> Done.
>
> >set_to_now() and get_timestamp() should have names that tie them together
> better. Maybe set_step_start_time() and get_step_start_time(). Also, they
> >should probably be methods of VMError and the statics should all be fields.
>
> Done. Added an accessor for every timeout value and a setter too (named
> "record" to indicate that instead of passing a value it uses the current
> timestamp).
>
> I'm usually not a big fan of putting internals into a header, so I tend to
> hide internals in a file-local static function. But if this is the way you
> prefer it, this is fine too.
>
> >In the first review I had mentioned that you only are allowing for one
> step to hang, and still be able to continue with subsequent steps:
>
> >1509     // A step times out after half of the total timeout. Steps are
> mostly fast unless they
> >1510     // hang for some reason, so this simple rule allows for one
> hanging step and still
> >1511     // leaves time enough for the rest of the steps to finish.
> >1512     const jlong end = step_start_time_l + ErrorLogTimeout * 1000 / 2;
> >
> >You mentioned that for your SAP version you divide by 4, allowing for up
> to 3 steps to hang. Is there a reason not to do this for the open version?
> In fact it >looks like TestUnresponsiveErrorHandler is relying on that
> being the case, since it will hang 3 times.
>
> I changed this to the value we use at SAP (make the little timeouts a
> quarter of the final timeout value). I now need to have five
> timeout-test-steps in the error handler instead of the three earlier. I hid
> them in a nice macro, so I hope they are not too much of an eye sore.
>
> >Related to this, TimeoutInErrorHandlingTest.java is >only checking for
> one instance of step timeout pattern. Probably it should check for 3
> instances, and >also one instance of the overall timeout pattern:
> >
> >1272         st->print_cr("------ Timout during error reporting after "
> INT64_FORMAT "ms. ------",
>
> I intentionally left this test to be very basic - see the comment in the
> test. Basically, I am afraid of giving you guys a test which sporadically
> fails if the timing is bad.
>
>  As David pointed out, the final error message cannot be relied on - it
> should appear in 99% of all cases but may not because error reporting
> thread and WatcherThread race. I explicitly avoided adding synchronization
> there, because it is better to sometimes miss output in the hs-err file
> than for the VM to hang in error shutdown.
>
> For the same reason I do not test for all the little timeout messages to
> appear in the error log. I now just test for two messages of the
> theoretical three or four max, to see that step timeout works repeatedly.
>
> @David:
>
> >src/os/windows/vm/vmError_windows.cpp
>
> >Typo: canceleation
>
> Done.
>
> >As previously discussed Atomic::load/store for jlong do not depend on
> supports_cx8. The commentary in atomic.hpp should really refer to 'atomic
> read->modify-write operations on jlongs' - as every platform must support
> atomic 64-bit load/store to support Java volatile long/double. So
> set_to_now and >get_timestamp can be simplified, or even removed completely.
>
> Sure! I removed the cx8 dependency, which makes the code simpler, but
> decided to keep the accessor functions (renamed as get_.._start_time and
> record_.._start_time)
>
> >Does the timestamp really need to be the system time in millis or do you
> just want to record relative elapsed time? If the latter then I suggest
> using >javaTimeNanos()/100000 to avoid potential glitches with the system
> time changing.
>
> Did this. This exchanges the possibility of time glitches with possible
> initialization issues (see non-apple bsd path and linux, e.g.) where the hr
> timer api is dynamically loaded. But I agree, this is still better. The
> latter is less probable to hurt, only if we crash at exactly the right time
> during VM initialization.
>
> >382   // TestUnresponsiveErrorHandler: three times to trigger first a
> step timeout, then
> >383   // the global error reporting timeout.
>
> >I'm a little confused as to why three steps are needed to test two
> things??
>
> See my earlier explanations. Now we have the 1:4 ratio between
> little/final timeout values, so we need five steps.
>
> >1272         st->print_cr("------ Timout during error reporting after "
> INT64_FORMAT "ms. ------",
>
> >Typo: timout
>
> Done.
>
> >Also I gather we are potentially racing with the WatcherThread here, so
> no guarantee these final print statements will be seen?
>
> Yes. I decided against synchronization here. The only point of this
> timeout is to write a final message into the error log. If this is lost, it
> is not a big deal. Note that we also have the final error message on stderr
> done by the WatcherThread.
>
> >1502       VMError::interrupt_reporting_thread();
>
> >Nit: you're inside a VMError function so shouldn't need the VMError
> prefix.
>
> Done.
>
> >src/share/vm/utilities/vmError.hpp
>
> >Typo: dependend
>
> Done (comments changed).
>
> > This doesn't quite read right to me. How about:..
>
> Thanks for your suggestion, I took your version.
>
> >test/runtime/ErrorHandling/TimeoutInErrorHandlingTest.java
>
> >I think using @requires and runtime Platform checks is redundant. I
> prefer @requires for both debug build testing and not-windows.
>
> Done. Thanks for your suggestion.
>
> ---
>
> Kind Regards, Thomas
>
>
> On Thu, Feb 2, 2017 at 12:59 PM, Thomas Stüfe <[hidden email]>
> wrote:
>
>> Dear all,
>>
>> please review (or re-review) this enhancement. It was already reviewed
>> extensively for JDK9 in Oct 16, but pushed back to JDK10. Now that JDK10
>> repo is open, I would like to close this issue.
>>
>> In short, this enhancement lets the hotspot cancel hanging error
>> reporting steps (when writing the hs-err file), in order to allow
>> subsequent error reporting steps to run. In our JVM variant, this has been
>> active for a long time and has proven to make error reporting more stable
>> in the face of deadlocks or file system stalls.
>>
>> Issue:
>> https://bugs.openjdk.java.net/browse/JDK-8166944
>>
>> Webrev:
>> http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-E
>> rror-Reporting/jdk10-webrev.02/webrev/
>>
>> Old Diskussion for JDK9:
>> http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2
>> 016-October/021582.html
>>
>> The webrev is unchanged from version 2 discussed in october, just got
>> rebased atop of JDK10/hs.
>>
>> Thank you,
>>
>> Kind Regards, Thomas
>>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: RFR(s) (JDK10): 8166944: Hanging Error Reporting steps may lead to torn error logs.

Thomas Stüfe-2
Hi Chris,

thank you for the detailed explanation, I modified TEST.groups in hotspot
accordingly. I modified the webrev.04 in place.

Kind Regards, Thomas

On Wed, Feb 22, 2017 at 7:44 PM, Chris Plummer <[hidden email]>
wrote:

> On 2/22/17 6:35 AM, Thomas Stüfe wrote:
>
> Hi Chris, David,
>
> thanks for your patience. Let's wrap this up.
>
> New webrev: http://cr.openjdk.java.net/~stuefe/webrevs/
> 8166944-Hanging-Error-Reporting/jdk10-webrev.04/webrev/
>
> Delta to last webrev: http://cr.openjdk.java.net/~stuefe/webrevs/
> 8166944-Hanging-Error-Reporting/jdk10-webrev.04-delta/webrev/
>
> Almost identical to the last version, only the typos Chris found were
> fixed.
>
> Looks good!
>
>
> @David:
>
> >The test is quite a slow one and so we would like to exclude it from the
> >hotspot_fast_runtime group in TEST.groups (there are other ErrorHandling
> tests >excluded there already).
>
> This is fine by me. Do you need me to do something?
>
> What David said above. Look in hotspot/test/TEST.groups. Find
> hotspot_fast_runtime. Add the test to the list of tests to not run by
> putting a '-' in front of it. Currently there is:
>
> hotspot_fast_runtime = \
>   runtime/ \
>  -runtime/6626217/Test6626217.sh \
>  -runtime/7100935 \
>  -runtime/7158988/FieldMonitor.java \
>  -runtime/CommandLine/OptionsValidation/TestOptionsWithRanges.java \
>  -runtime/CommandLine/PrintGCApplicationConcurrentTime.java \
>  -runtime/ConstantPool/IntfMethod.java \
>  -runtime/ErrorHandling/CreateCoredumpOnCrash.java \
>  -runtime/ErrorHandling/ErrorHandler.java \
>  -runtime/logging/MonitorMismatchTest.java \
>  -runtime/memory/ReserveMemory.java \
> ...
>
> So that means all tests in the runtime directory are included in the
> group, except the tests that follow with the '-' prefix. You can see there
> are already a couple of ErrorHandling tests being excluded, so just add
> yours after them.
>
> cheers,
>
> Chris
>
>
> Kind Regards, Thomas
>
>
>
> On Tue, Feb 7, 2017 at 2:04 AM, Chris Plummer <[hidden email]>
> wrote:
>
>> Hi Tomas,
>>
>> vmError_windows.cpp: Getting closer. Should be "cancellation".
>>
>> vmError.cpp: Should be "let's"
>>
>>  389   // global timeout, lets execute the timeout step five times.
>>
>> And in response to your comment:
>>
>> I'm usually not a big fan of putting internals into a header, so I tend
>> to hide internals in a file-local static function. But if this is the way
>> you prefer it, this is fine too.
>>
>>
>> I'm no hotspot coding conventions expert, but I have frequently seen
>> statics declared in the class scope (private or protected), and don't see
>> much in the way of statics declared in .cpp files (although I see plenty of
>> static helper functions in .cpp files). It seems more consistent to declare
>> them in the class scope, but I could be wrong here. I'm not personally
>> attached to one way or another.
>>
>> Other than that it looks good. No need for another review.
>>
>> thanks,
>>
>> Chris
>>
>>
>> On 2/6/17 3:18 AM, Thomas Stüfe wrote:
>>
>> Hi Chris, David,
>>
>> Thanks for your reviews. I hope I addressed all of your issues. See
>> further details below.
>>
>> new webrev:
>>
>> http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-
>> Error-Reporting/jdk10-webrev.03/webrev/
>>
>> delta to the last one:
>>
>> http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-
>> Error-Reporting/jdk10-webrev.03-delta/webrev/
>>
>>
>> @Chris:
>>
>> > Most files need 2017 copyright updates.
>>
>> Done.
>>
>> >set_to_now() and get_timestamp() should have names that tie them
>> together better. Maybe set_step_start_time() and get_step_start_time().
>> Also, they >should probably be methods of VMError and the statics should
>> all be fields.
>>
>> Done. Added an accessor for every timeout value and a setter too (named
>> "record" to indicate that instead of passing a value it uses the current
>> timestamp).
>>
>> I'm usually not a big fan of putting internals into a header, so I tend
>> to hide internals in a file-local static function. But if this is the way
>> you prefer it, this is fine too.
>>
>> >In the first review I had mentioned that you only are allowing for one
>> step to hang, and still be able to continue with subsequent steps:
>>
>> >1509     // A step times out after half of the total timeout. Steps are
>> mostly fast unless they
>> >1510     // hang for some reason, so this simple rule allows for one
>> hanging step and still
>> >1511     // leaves time enough for the rest of the steps to finish.
>> >1512     const jlong end = step_start_time_l + ErrorLogTimeout * 1000 /
>> 2;
>> >
>> >You mentioned that for your SAP version you divide by 4, allowing for up
>> to 3 steps to hang. Is there a reason not to do this for the open version?
>> In fact it >looks like TestUnresponsiveErrorHandler is relying on that
>> being the case, since it will hang 3 times.
>>
>> I changed this to the value we use at SAP (make the little timeouts a
>> quarter of the final timeout value). I now need to have five
>> timeout-test-steps in the error handler instead of the three earlier. I hid
>> them in a nice macro, so I hope they are not too much of an eye sore.
>>
>> >Related to this, TimeoutInErrorHandlingTest.java is >only checking for
>> one instance of step timeout pattern. Probably it should check for 3
>> instances, and >also one instance of the overall timeout pattern:
>> >
>> >1272         st->print_cr("------ Timout during error reporting after "
>> INT64_FORMAT "ms. ------",
>>
>> I intentionally left this test to be very basic - see the comment in the
>> test. Basically, I am afraid of giving you guys a test which sporadically
>> fails if the timing is bad.
>>
>>  As David pointed out, the final error message cannot be relied on - it
>> should appear in 99% of all cases but may not because error reporting
>> thread and WatcherThread race. I explicitly avoided adding synchronization
>> there, because it is better to sometimes miss output in the hs-err file
>> than for the VM to hang in error shutdown.
>>
>> For the same reason I do not test for all the little timeout messages to
>> appear in the error log. I now just test for two messages of the
>> theoretical three or four max, to see that step timeout works repeatedly.
>>
>> @David:
>>
>> >src/os/windows/vm/vmError_windows.cpp
>>
>> >Typo: canceleation
>>
>> Done.
>>
>> >As previously discussed Atomic::load/store for jlong do not depend on
>> supports_cx8. The commentary in atomic.hpp should really refer to 'atomic
>> read->modify-write operations on jlongs' - as every platform must support
>> atomic 64-bit load/store to support Java volatile long/double. So
>> set_to_now and >get_timestamp can be simplified, or even removed completely.
>>
>> Sure! I removed the cx8 dependency, which makes the code simpler, but
>> decided to keep the accessor functions (renamed as get_.._start_time and
>> record_.._start_time)
>>
>> >Does the timestamp really need to be the system time in millis or do you
>> just want to record relative elapsed time? If the latter then I suggest
>> using >javaTimeNanos()/100000 to avoid potential glitches with the system
>> time changing.
>>
>> Did this. This exchanges the possibility of time glitches with possible
>> initialization issues (see non-apple bsd path and linux, e.g.) where the hr
>> timer api is dynamically loaded. But I agree, this is still better. The
>> latter is less probable to hurt, only if we crash at exactly the right time
>> during VM initialization.
>>
>> >382   // TestUnresponsiveErrorHandler: three times to trigger first a
>> step timeout, then
>> >383   // the global error reporting timeout.
>>
>> >I'm a little confused as to why three steps are needed to test two
>> things??
>>
>> See my earlier explanations. Now we have the 1:4 ratio between
>> little/final timeout values, so we need five steps.
>>
>> >1272         st->print_cr("------ Timout during error reporting after "
>> INT64_FORMAT "ms. ------",
>>
>> >Typo: timout
>>
>> Done.
>>
>> >Also I gather we are potentially racing with the WatcherThread here, so
>> no guarantee these final print statements will be seen?
>>
>> Yes. I decided against synchronization here. The only point of this
>> timeout is to write a final message into the error log. If this is lost, it
>> is not a big deal. Note that we also have the final error message on stderr
>> done by the WatcherThread.
>>
>> >1502       VMError::interrupt_reporting_thread();
>>
>> >Nit: you're inside a VMError function so shouldn't need the VMError
>> prefix.
>>
>> Done.
>>
>> >src/share/vm/utilities/vmError.hpp
>>
>> >Typo: dependend
>>
>> Done (comments changed).
>>
>> > This doesn't quite read right to me. How about:..
>>
>> Thanks for your suggestion, I took your version.
>>
>> >test/runtime/ErrorHandling/TimeoutInErrorHandlingTest.java
>>
>> >I think using @requires and runtime Platform checks is redundant. I
>> prefer @requires for both debug build testing and not-windows.
>>
>> Done. Thanks for your suggestion.
>>
>> ---
>>
>> Kind Regards, Thomas
>>
>>
>> On Thu, Feb 2, 2017 at 12:59 PM, Thomas Stüfe <[hidden email]>
>> wrote:
>>
>>> Dear all,
>>>
>>> please review (or re-review) this enhancement. It was already reviewed
>>> extensively for JDK9 in Oct 16, but pushed back to JDK10. Now that JDK10
>>> repo is open, I would like to close this issue.
>>>
>>> In short, this enhancement lets the hotspot cancel hanging error
>>> reporting steps (when writing the hs-err file), in order to allow
>>> subsequent error reporting steps to run. In our JVM variant, this has been
>>> active for a long time and has proven to make error reporting more stable
>>> in the face of deadlocks or file system stalls.
>>>
>>> Issue:
>>> https://bugs.openjdk.java.net/browse/JDK-8166944
>>>
>>> Webrev:
>>> http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-E
>>> rror-Reporting/jdk10-webrev.02/webrev/
>>>
>>> Old Diskussion for JDK9:
>>> http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2
>>> 016-October/021582.html
>>>
>>> The webrev is unchanged from version 2 discussed in october, just got
>>> rebased atop of JDK10/hs.
>>>
>>> Thank you,
>>>
>>> Kind Regards, Thomas
>>>
>>
>>
>>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: RFR(s) (JDK10): 8166944: Hanging Error Reporting steps may lead to torn error logs.

David Holmes
Looks good Thomas!

Thanks,
David

On 23/02/2017 5:32 PM, Thomas Stüfe wrote:

> Hi Chris,
>
> thank you for the detailed explanation, I modified TEST.groups in
> hotspot accordingly. I modified the webrev.04 in place.
>
> Kind Regards, Thomas
>
> On Wed, Feb 22, 2017 at 7:44 PM, Chris Plummer <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     On 2/22/17 6:35 AM, Thomas Stüfe wrote:
>>     Hi Chris, David,
>>
>>     thanks for your patience. Let's wrap this up.
>>
>>     New
>>     webrev: http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.04/webrev/
>>     <http://cr.openjdk.java.net/%7Estuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.04/webrev/>
>>
>>     Delta to last
>>     webrev: http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.04-delta/webrev/
>>     <http://cr.openjdk.java.net/%7Estuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.04-delta/webrev/>
>>
>>     Almost identical to the last version, only the typos Chris found
>>     were fixed.
>     Looks good!
>>
>>     @David:
>>
>>     >The test is quite a slow one and so we would like to exclude it from the
>>     >hotspot_fast_runtime group in TEST.groups (there are other
>>     ErrorHandling tests >excluded there already).
>>
>>     This is fine by me. Do you need me to do something?
>     What David said above. Look in hotspot/test/TEST.groups. Find
>     hotspot_fast_runtime. Add the test to the list of tests to not run
>     by putting a '-' in front of it. Currently there is:
>
>     hotspot_fast_runtime = \
>       runtime/ \
>      -runtime/6626217/Test6626217.sh \
>      -runtime/7100935 \
>      -runtime/7158988/FieldMonitor.java \
>      -runtime/CommandLine/OptionsValidation/TestOptionsWithRanges.java \
>      -runtime/CommandLine/PrintGCApplicationConcurrentTime.java \
>      -runtime/ConstantPool/IntfMethod.java \
>      -runtime/ErrorHandling/CreateCoredumpOnCrash.java \
>      -runtime/ErrorHandling/ErrorHandler.java \
>      -runtime/logging/MonitorMismatchTest.java \
>      -runtime/memory/ReserveMemory.java \
>     ...
>
>     So that means all tests in the runtime directory are included in the
>     group, except the tests that follow with the '-' prefix. You can see
>     there are already a couple of ErrorHandling tests being excluded, so
>     just add yours after them.
>
>     cheers,
>
>     Chris
>
>>
>>     Kind Regards, Thomas
>>
>>
>>
>>     On Tue, Feb 7, 2017 at 2:04 AM, Chris Plummer
>>     <[hidden email] <mailto:[hidden email]>> wrote:
>>
>>         Hi Tomas,
>>
>>         vmError_windows.cpp: Getting closer. Should be "cancellation".
>>
>>         vmError.cpp: Should be "let's"
>>
>>          389   // global timeout, lets execute the timeout step five
>>         times.
>>
>>         And in response to your comment:
>>
>>>         I'm usually not a big fan of putting internals into a header,
>>>         so I tend to hide internals in a file-local static function.
>>>         But if this is the way you prefer it, this is fine too.
>>
>>         I'm no hotspot coding conventions expert, but I have
>>         frequently seen statics declared in the class scope (private
>>         or protected), and don't see much in the way of statics
>>         declared in .cpp files (although I see plenty of static helper
>>         functions in .cpp files). It seems more consistent to declare
>>         them in the class scope, but I could be wrong here. I'm not
>>         personally attached to one way or another.
>>
>>         Other than that it looks good. No need for another review.
>>
>>         thanks,
>>
>>         Chris
>>
>>
>>         On 2/6/17 3:18 AM, Thomas Stüfe wrote:
>>>         Hi Chris, David,
>>>
>>>         Thanks for your reviews. I hope I addressed all of your
>>>         issues. See further details below.
>>>
>>>         new webrev:
>>>
>>>         http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.03/webrev/
>>>         <http://cr.openjdk.java.net/%7Estuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.03/webrev/>
>>>
>>>         delta to the last one:
>>>
>>>         http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.03-delta/webrev/
>>>         <http://cr.openjdk.java.net/%7Estuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.03-delta/webrev/>
>>>
>>>
>>>         @Chris:
>>>
>>>         > Most files need 2017 copyright updates.
>>>
>>>         Done.
>>>
>>>         >set_to_now() and get_timestamp() should have names that tie them together
>>>         better. Maybe set_step_start_time() and
>>>         get_step_start_time(). Also, they >should probably be methods
>>>         of VMError and the statics should all be fields.
>>>
>>>         Done. Added an accessor for every timeout value and a setter
>>>         too (named "record" to indicate that instead of passing a
>>>         value it uses the current timestamp).
>>>
>>>         I'm usually not a big fan of putting internals into a header,
>>>         so I tend to hide internals in a file-local static function.
>>>         But if this is the way you prefer it, this is fine too.
>>>
>>>         >In the first review I had mentioned that you only are allowing for one
>>>         step to hang, and still be able to continue with subsequent
>>>         steps:
>>>
>>>         >1509     // A step times out after half of the total timeout. Steps are
>>>         mostly fast unless they
>>>         >1510     // hang for some reason, so this simple rule allows for one
>>>         hanging step and still
>>>         >1511     // leaves time enough for the rest of the steps to finish.
>>>         >1512     const jlong end = step_start_time_l + ErrorLogTimeout *
>>>         1000 / 2;
>>>         >
>>>         >You mentioned that for your SAP version you divide by 4, allowing for up
>>>         to 3 steps to hang. Is there a reason not to do this for the
>>>         open version? In fact it >looks like
>>>         TestUnresponsiveErrorHandler is relying on that being the
>>>         case, since it will hang 3 times.
>>>
>>>         I changed this to the value we use at SAP (make the little
>>>         timeouts a quarter of the final timeout value). I now need to
>>>         have five timeout-test-steps in the error handler instead of
>>>         the three earlier. I hid them in a nice macro, so I hope they
>>>         are not too much of an eye sore.
>>>
>>>         >Related to this, TimeoutInErrorHandlingTest.java is >only checking for
>>>         one instance of step timeout pattern. Probably it should
>>>         check for 3 instances, and >also one instance of the overall
>>>         timeout pattern:
>>>         >
>>>         >1272         st->print_cr("------ Timout during error reporting after "
>>>         INT64_FORMAT "ms. ------",
>>>
>>>         I intentionally left this test to be very basic - see the
>>>         comment in the test. Basically, I am afraid of giving you
>>>         guys a test which sporadically fails if the timing is bad.
>>>
>>>          As David pointed out, the final error message cannot be
>>>         relied on - it should appear in 99% of all cases but may not
>>>         because error reporting thread and WatcherThread race. I
>>>         explicitly avoided adding synchronization there, because it
>>>         is better to sometimes miss output in the hs-err file than
>>>         for the VM to hang in error shutdown.
>>>
>>>         For the same reason I do not test for all the little timeout
>>>         messages to appear in the error log. I now just test for two
>>>         messages of the theoretical three or four max, to see that
>>>         step timeout works repeatedly.
>>>
>>>         @David:
>>>
>>>         >src/os/windows/vm/vmError_windows.cpp
>>>
>>>         >Typo: canceleation
>>>
>>>         Done.
>>>
>>>         >As previously discussed Atomic::load/store for jlong do not depend on
>>>         supports_cx8. The commentary in atomic.hpp should really
>>>         refer to 'atomic read->modify-write operations on jlongs' -
>>>         as every platform must support atomic 64-bit load/store to
>>>         support Java volatile long/double. So set_to_now and
>>>         >get_timestamp can be simplified, or even removed completely.
>>>
>>>         Sure! I removed the cx8 dependency, which makes the code
>>>         simpler, but decided to keep the accessor functions (renamed
>>>         as get_.._start_time and record_.._start_time)
>>>
>>>         >Does the timestamp really need to be the system time in millis or do
>>>         you just want to record relative elapsed time? If the latter
>>>         then I suggest using >javaTimeNanos()/100000 to avoid
>>>         potential glitches with the system time changing.
>>>
>>>         Did this. This exchanges the possibility of time glitches
>>>         with possible initialization issues (see non-apple bsd path
>>>         and linux, e.g.) where the hr timer api is dynamically
>>>         loaded. But I agree, this is still better. The latter is less
>>>         probable to hurt, only if we crash at exactly the right time
>>>         during VM initialization.
>>>
>>>         >382   // TestUnresponsiveErrorHandler: three times to trigger first a
>>>         step timeout, then
>>>         >383   // the global error reporting timeout.
>>>
>>>         >I'm a little confused as to why three steps are needed to test two things??
>>>
>>>         See my earlier explanations. Now we have the 1:4 ratio
>>>         between little/final timeout values, so we need five steps.
>>>
>>>         >1272         st->print_cr("------ Timout during error reporting after "
>>>         INT64_FORMAT "ms. ------",
>>>
>>>         >Typo: timout
>>>
>>>         Done.
>>>
>>>         >Also I gather we are potentially racing with the
>>>         WatcherThread here, so no guarantee these final print
>>>         statements will be seen?
>>>
>>>         Yes. I decided against synchronization here. The only point
>>>         of this timeout is to write a final message into the error
>>>         log. If this is lost, it is not a big deal. Note that we also
>>>         have the final error message on stderr done by the WatcherThread.
>>>
>>>         >1502       VMError::interrupt_reporting_thread();
>>>
>>>         >Nit: you're inside a VMError function so shouldn't need the VMError prefix.
>>>
>>>         Done.
>>>
>>>         >src/share/vm/utilities/vmError.hpp
>>>
>>>         >Typo: dependend
>>>
>>>         Done (comments changed).
>>>
>>>         > This doesn't quite read right to me. How about:..
>>>
>>>         Thanks for your suggestion, I took your version.
>>>
>>>         >test/runtime/ErrorHandling/TimeoutInErrorHandlingTest.java
>>>
>>>         >I think using @requires and runtime Platform checks is redundant. I prefer
>>>         @requires for both debug build testing and not-windows.
>>>
>>>         Done. Thanks for your suggestion.
>>>
>>>         ---
>>>
>>>         Kind Regards, Thomas
>>>
>>>
>>>         On Thu, Feb 2, 2017 at 12:59 PM, Thomas Stüfe
>>>         <[hidden email] <mailto:[hidden email]>> wrote:
>>>
>>>             Dear all,
>>>
>>>             please review (or re-review) this enhancement. It was
>>>             already reviewed extensively for JDK9 in Oct 16, but
>>>             pushed back to JDK10. Now that JDK10 repo is open, I
>>>             would like to close this issue.
>>>
>>>             In short, this enhancement lets the hotspot cancel
>>>             hanging error reporting steps (when writing the hs-err
>>>             file), in order to allow subsequent error reporting steps
>>>             to run. In our JVM variant, this has been active for a
>>>             long time and has proven to make error reporting more
>>>             stable in the face of deadlocks or file system stalls.
>>>
>>>             Issue:
>>>             https://bugs.openjdk.java.net/browse/JDK-8166944
>>>             <https://bugs.openjdk.java.net/browse/JDK-8166944>
>>>
>>>             Webrev:
>>>             http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.02/webrev/
>>>             <http://cr.openjdk.java.net/%7Estuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.02/webrev/>
>>>
>>>             Old Diskussion for JDK9:
>>>             http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2016-October/021582.html
>>>             <http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2016-October/021582.html>
>>>
>>>             The webrev is unchanged from version 2 discussed in
>>>             october, just got rebased atop of JDK10/hs.
>>>
>>>             Thank you,
>>>
>>>             Kind Regards, Thomas
>>>
>>>
>>
>>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: RFR(s) (JDK10): 8166944: Hanging Error Reporting steps may lead to torn error logs.

Thomas Stüfe-2
Thank you both! As always, I need a sponsor :)

..Thomas

On Thu, Feb 23, 2017 at 8:38 AM, David Holmes <[hidden email]>
wrote:

> Looks good Thomas!
>
> Thanks,
> David
>
> On 23/02/2017 5:32 PM, Thomas Stüfe wrote:
>
>> Hi Chris,
>>
>> thank you for the detailed explanation, I modified TEST.groups in
>> hotspot accordingly. I modified the webrev.04 in place.
>>
>> Kind Regards, Thomas
>>
>> On Wed, Feb 22, 2017 at 7:44 PM, Chris Plummer <[hidden email]
>> <mailto:[hidden email]>> wrote:
>>
>>     On 2/22/17 6:35 AM, Thomas Stüfe wrote:
>>
>>>     Hi Chris, David,
>>>
>>>     thanks for your patience. Let's wrap this up.
>>>
>>>     New
>>>     webrev: http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-
>>> Error-Reporting/jdk10-webrev.04/webrev/
>>>     <http://cr.openjdk.java.net/%7Estuefe/webrevs/8166944-Hangin
>>> g-Error-Reporting/jdk10-webrev.04/webrev/>
>>>
>>>     Delta to last
>>>     webrev: http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-
>>> Error-Reporting/jdk10-webrev.04-delta/webrev/
>>>     <http://cr.openjdk.java.net/%7Estuefe/webrevs/8166944-Hangin
>>> g-Error-Reporting/jdk10-webrev.04-delta/webrev/>
>>>
>>>
>>>     Almost identical to the last version, only the typos Chris found
>>>     were fixed.
>>>
>>     Looks good!
>>
>>>
>>>     @David:
>>>
>>>     >The test is quite a slow one and so we would like to exclude it
>>> from the
>>>     >hotspot_fast_runtime group in TEST.groups (there are other
>>>     ErrorHandling tests >excluded there already).
>>>
>>>     This is fine by me. Do you need me to do something?
>>>
>>     What David said above. Look in hotspot/test/TEST.groups. Find
>>     hotspot_fast_runtime. Add the test to the list of tests to not run
>>     by putting a '-' in front of it. Currently there is:
>>
>>     hotspot_fast_runtime = \
>>       runtime/ \
>>      -runtime/6626217/Test6626217.sh \
>>      -runtime/7100935 \
>>      -runtime/7158988/FieldMonitor.java \
>>      -runtime/CommandLine/OptionsValidation/TestOptionsWithRanges.java \
>>      -runtime/CommandLine/PrintGCApplicationConcurrentTime.java \
>>      -runtime/ConstantPool/IntfMethod.java \
>>      -runtime/ErrorHandling/CreateCoredumpOnCrash.java \
>>      -runtime/ErrorHandling/ErrorHandler.java \
>>      -runtime/logging/MonitorMismatchTest.java \
>>      -runtime/memory/ReserveMemory.java \
>>     ...
>>
>>     So that means all tests in the runtime directory are included in the
>>     group, except the tests that follow with the '-' prefix. You can see
>>     there are already a couple of ErrorHandling tests being excluded, so
>>     just add yours after them.
>>
>>     cheers,
>>
>>     Chris
>>
>>
>>>     Kind Regards, Thomas
>>>
>>>
>>>
>>>     On Tue, Feb 7, 2017 at 2:04 AM, Chris Plummer
>>>     <[hidden email] <mailto:[hidden email]>> wrote:
>>>
>>>         Hi Tomas,
>>>
>>>         vmError_windows.cpp: Getting closer. Should be "cancellation".
>>>
>>>         vmError.cpp: Should be "let's"
>>>
>>>          389   // global timeout, lets execute the timeout step five
>>>         times.
>>>
>>>         And in response to your comment:
>>>
>>>         I'm usually not a big fan of putting internals into a header,
>>>>         so I tend to hide internals in a file-local static function.
>>>>         But if this is the way you prefer it, this is fine too.
>>>>
>>>
>>>         I'm no hotspot coding conventions expert, but I have
>>>         frequently seen statics declared in the class scope (private
>>>         or protected), and don't see much in the way of statics
>>>         declared in .cpp files (although I see plenty of static helper
>>>         functions in .cpp files). It seems more consistent to declare
>>>         them in the class scope, but I could be wrong here. I'm not
>>>         personally attached to one way or another.
>>>
>>>         Other than that it looks good. No need for another review.
>>>
>>>         thanks,
>>>
>>>         Chris
>>>
>>>
>>>         On 2/6/17 3:18 AM, Thomas Stüfe wrote:
>>>
>>>>         Hi Chris, David,
>>>>
>>>>         Thanks for your reviews. I hope I addressed all of your
>>>>         issues. See further details below.
>>>>
>>>>         new webrev:
>>>>
>>>>         http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-
>>>> Error-Reporting/jdk10-webrev.03/webrev/
>>>>         <http://cr.openjdk.java.net/%7Estuefe/webrevs/8166944-Hangin
>>>> g-Error-Reporting/jdk10-webrev.03/webrev/>
>>>>
>>>>         delta to the last one:
>>>>
>>>>         http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-
>>>> Error-Reporting/jdk10-webrev.03-delta/webrev/
>>>>         <http://cr.openjdk.java.net/%7Estuefe/webrevs/8166944-Hangin
>>>> g-Error-Reporting/jdk10-webrev.03-delta/webrev/>
>>>>
>>>>
>>>>
>>>>         @Chris:
>>>>
>>>>         > Most files need 2017 copyright updates.
>>>>
>>>>         Done.
>>>>
>>>>         >set_to_now() and get_timestamp() should have names that tie
>>>> them together
>>>>         better. Maybe set_step_start_time() and
>>>>         get_step_start_time(). Also, they >should probably be methods
>>>>         of VMError and the statics should all be fields.
>>>>
>>>>         Done. Added an accessor for every timeout value and a setter
>>>>         too (named "record" to indicate that instead of passing a
>>>>         value it uses the current timestamp).
>>>>
>>>>         I'm usually not a big fan of putting internals into a header,
>>>>         so I tend to hide internals in a file-local static function.
>>>>         But if this is the way you prefer it, this is fine too.
>>>>
>>>>         >In the first review I had mentioned that you only are allowing
>>>> for one
>>>>         step to hang, and still be able to continue with subsequent
>>>>         steps:
>>>>
>>>>         >1509     // A step times out after half of the total timeout.
>>>> Steps are
>>>>         mostly fast unless they
>>>>         >1510     // hang for some reason, so this simple rule allows
>>>> for one
>>>>         hanging step and still
>>>>         >1511     // leaves time enough for the rest of the steps to
>>>> finish.
>>>>         >1512     const jlong end = step_start_time_l + ErrorLogTimeout
>>>> *
>>>>         1000 / 2;
>>>>         >
>>>>         >You mentioned that for your SAP version you divide by 4,
>>>> allowing for up
>>>>         to 3 steps to hang. Is there a reason not to do this for the
>>>>         open version? In fact it >looks like
>>>>         TestUnresponsiveErrorHandler is relying on that being the
>>>>         case, since it will hang 3 times.
>>>>
>>>>         I changed this to the value we use at SAP (make the little
>>>>         timeouts a quarter of the final timeout value). I now need to
>>>>         have five timeout-test-steps in the error handler instead of
>>>>         the three earlier. I hid them in a nice macro, so I hope they
>>>>         are not too much of an eye sore.
>>>>
>>>>         >Related to this, TimeoutInErrorHandlingTest.java is >only
>>>> checking for
>>>>         one instance of step timeout pattern. Probably it should
>>>>         check for 3 instances, and >also one instance of the overall
>>>>         timeout pattern:
>>>>         >
>>>>         >1272         st->print_cr("------ Timout during error
>>>> reporting after "
>>>>         INT64_FORMAT "ms. ------",
>>>>
>>>>         I intentionally left this test to be very basic - see the
>>>>         comment in the test. Basically, I am afraid of giving you
>>>>         guys a test which sporadically fails if the timing is bad.
>>>>
>>>>          As David pointed out, the final error message cannot be
>>>>         relied on - it should appear in 99% of all cases but may not
>>>>         because error reporting thread and WatcherThread race. I
>>>>         explicitly avoided adding synchronization there, because it
>>>>         is better to sometimes miss output in the hs-err file than
>>>>         for the VM to hang in error shutdown.
>>>>
>>>>         For the same reason I do not test for all the little timeout
>>>>         messages to appear in the error log. I now just test for two
>>>>         messages of the theoretical three or four max, to see that
>>>>         step timeout works repeatedly.
>>>>
>>>>         @David:
>>>>
>>>>         >src/os/windows/vm/vmError_windows.cpp
>>>>
>>>>         >Typo: canceleation
>>>>
>>>>         Done.
>>>>
>>>>         >As previously discussed Atomic::load/store for jlong do not
>>>> depend on
>>>>         supports_cx8. The commentary in atomic.hpp should really
>>>>         refer to 'atomic read->modify-write operations on jlongs' -
>>>>         as every platform must support atomic 64-bit load/store to
>>>>         support Java volatile long/double. So set_to_now and
>>>>         >get_timestamp can be simplified, or even removed completely.
>>>>
>>>>         Sure! I removed the cx8 dependency, which makes the code
>>>>         simpler, but decided to keep the accessor functions (renamed
>>>>         as get_.._start_time and record_.._start_time)
>>>>
>>>>         >Does the timestamp really need to be the system time in millis
>>>> or do
>>>>         you just want to record relative elapsed time? If the latter
>>>>         then I suggest using >javaTimeNanos()/100000 to avoid
>>>>         potential glitches with the system time changing.
>>>>
>>>>         Did this. This exchanges the possibility of time glitches
>>>>         with possible initialization issues (see non-apple bsd path
>>>>         and linux, e.g.) where the hr timer api is dynamically
>>>>         loaded. But I agree, this is still better. The latter is less
>>>>         probable to hurt, only if we crash at exactly the right time
>>>>         during VM initialization.
>>>>
>>>>         >382   // TestUnresponsiveErrorHandler: three times to trigger
>>>> first a
>>>>         step timeout, then
>>>>         >383   // the global error reporting timeout.
>>>>
>>>>         >I'm a little confused as to why three steps are needed to test
>>>> two things??
>>>>
>>>>         See my earlier explanations. Now we have the 1:4 ratio
>>>>         between little/final timeout values, so we need five steps.
>>>>
>>>>         >1272         st->print_cr("------ Timout during error
>>>> reporting after "
>>>>         INT64_FORMAT "ms. ------",
>>>>
>>>>         >Typo: timout
>>>>
>>>>         Done.
>>>>
>>>>         >Also I gather we are potentially racing with the
>>>>         WatcherThread here, so no guarantee these final print
>>>>         statements will be seen?
>>>>
>>>>         Yes. I decided against synchronization here. The only point
>>>>         of this timeout is to write a final message into the error
>>>>         log. If this is lost, it is not a big deal. Note that we also
>>>>         have the final error message on stderr done by the
>>>> WatcherThread.
>>>>
>>>>         >1502       VMError::interrupt_reporting_thread();
>>>>
>>>>         >Nit: you're inside a VMError function so shouldn't need the
>>>> VMError prefix.
>>>>
>>>>         Done.
>>>>
>>>>         >src/share/vm/utilities/vmError.hpp
>>>>
>>>>         >Typo: dependend
>>>>
>>>>         Done (comments changed).
>>>>
>>>>         > This doesn't quite read right to me. How about:..
>>>>
>>>>         Thanks for your suggestion, I took your version.
>>>>
>>>>         >test/runtime/ErrorHandling/TimeoutInErrorHandlingTest.java
>>>>
>>>>         >I think using @requires and runtime Platform checks is
>>>> redundant. I prefer
>>>>         @requires for both debug build testing and not-windows.
>>>>
>>>>         Done. Thanks for your suggestion.
>>>>
>>>>         ---
>>>>
>>>>         Kind Regards, Thomas
>>>>
>>>>
>>>>         On Thu, Feb 2, 2017 at 12:59 PM, Thomas Stüfe
>>>>         <[hidden email] <mailto:[hidden email]>>
>>>> wrote:
>>>>
>>>>             Dear all,
>>>>
>>>>             please review (or re-review) this enhancement. It was
>>>>             already reviewed extensively for JDK9 in Oct 16, but
>>>>             pushed back to JDK10. Now that JDK10 repo is open, I
>>>>             would like to close this issue.
>>>>
>>>>             In short, this enhancement lets the hotspot cancel
>>>>             hanging error reporting steps (when writing the hs-err
>>>>             file), in order to allow subsequent error reporting steps
>>>>             to run. In our JVM variant, this has been active for a
>>>>             long time and has proven to make error reporting more
>>>>             stable in the face of deadlocks or file system stalls.
>>>>
>>>>             Issue:
>>>>             https://bugs.openjdk.java.net/browse/JDK-8166944
>>>>             <https://bugs.openjdk.java.net/browse/JDK-8166944>
>>>>
>>>>             Webrev:
>>>>             http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-
>>>> Error-Reporting/jdk10-webrev.02/webrev/
>>>>             <<a href="http://cr.openjdk.java.net/%7">http://cr.openjdk.java.net/%7
>>>> Estuefe/webrevs/8166944-Hanging-Error-Reporting/jdk10-webrev.02/webrev/
>>>> >
>>>>
>>>>             Old Diskussion for JDK9:
>>>>             http://mail.openjdk.java.net/p
>>>> ipermail/hotspot-runtime-dev/2016-October/021582.html
>>>>             <http://mail.openjdk.java.net/
>>>> pipermail/hotspot-runtime-dev/2016-October/021582.html>
>>>>
>>>>             The webrev is unchanged from version 2 discussed in
>>>>             october, just got rebased atop of JDK10/hs.
>>>>
>>>>             Thank you,
>>>>
>>>>             Kind Regards, Thomas
>>>>
>>>>
>>>>
>>>
>>>
>>
>>