RFR: 8262094: Handshake timeout scaled wrong

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

RFR: 8262094: Handshake timeout scaled wrong

Robbin Ehn-2
Parameter HandshakeTimeout is in milliseconds.
Internally we use nanoseconds.
HandshakeTimeout must be scaled to nanoseconds.

Passes T1

-------------

Commit messages:
 - Use millis_to_nanos

Changes: https://git.openjdk.java.net/jdk/pull/2666/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=2666&range=00
  Issue: https://bugs.openjdk.java.net/browse/JDK-8262094
  Stats: 2 lines in 1 file changed: 1 ins; 0 del; 1 mod
  Patch: https://git.openjdk.java.net/jdk/pull/2666.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/2666/head:pull/2666

PR: https://git.openjdk.java.net/jdk/pull/2666
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8262094: Handshake timeout scaled wrong

David Holmes-2
On Sun, 21 Feb 2021 18:46:46 GMT, Robbin Ehn <[hidden email]> wrote:

> Parameter HandshakeTimeout is in milliseconds.
> Internally we use nanoseconds.
> HandshakeTimeout must be scaled to nanoseconds.
>
> Passes T1

Hi Robbin,

This change seems much clearer, but to be clear the actual result is the same as the "counter" tracks nanoseconds.

This counts as a trivial change IMO.

Thanks,
David

-------------

Marked as reviewed by dholmes (Reviewer).

PR: https://git.openjdk.java.net/jdk/pull/2666
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8262094: Handshake timeout scaled wrong

Robbin Ehn-2
On Mon, 22 Feb 2021 07:46:47 GMT, David Holmes <[hidden email]> wrote:

> Hi Robbin,
>
> This change seems much clearer, but to be clear the actual result is the same as the "counter" tracks nanoseconds.
>
> This counts as a trivial change IMO.
>
> Thanks,
> David

Hi David, unfortunately I don't I agree that this is trivial:

`open/src/hotspot/os/windows/os_windows.cpp:jlong os::elapsed_frequency() {
open/src/hotspot/os/windows/os_windows.cpp-  return performance_frequency;
open/src/hotspot/os/windows/os_windows.cpp-}`

This value comes from QueryPerformanceFrequency, which depends on what resolution the Windows thinks is reasonable given what hardware support there are.

In a virtualized env. this causes a 20.000 ms timeout to timeout after 2 s for me due to low resolution (windows only).
(all other platforms hardcoded to NANASECONDS)

We have two more suspective uses here, which I have not investigated (but we have seem wired timeouts with both ZGC and G1, but I have seen non with parGC):
`open/src/hotspot/share/gc/z/zMark.cpp:      _timeout(_start.value() + TimeHelper::micros_to_counter(timeout_in_micros))`

`open/src/hotspot/share/gc/g1/g1ServiceThread.cpp:  jlong delay = TimeHelper::millis_to_counter(delay_ms);`

Thanks

-------------

PR: https://git.openjdk.java.net/jdk/pull/2666
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8262094: Handshake timeout scaled wrong

Stefan Karlsson-3
In reply to this post by Robbin Ehn-2
On Sun, 21 Feb 2021 18:46:46 GMT, Robbin Ehn <[hidden email]> wrote:

> Parameter HandshakeTimeout is in milliseconds.
> Internally we use nanoseconds.
> HandshakeTimeout must be scaled to nanoseconds.
>
> Passes T1

Why are we using os::javaTimeNanos instead os::elapsed_counter()? I think we've tried to limit the usage of to places that explicitly interact with the Java side. For example, the reference processor uses os::javaTimeNanos because it interacts with java.lang.ref.SoftReference.clock. Most other places in the JVM uses os::elapsed_counter().

-------------

Changes requested by stefank (Reviewer).

PR: https://git.openjdk.java.net/jdk/pull/2666
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8262094: Handshake timeout scaled wrong

Stefan Johansson
In reply to this post by Robbin Ehn-2
On Sun, 21 Feb 2021 18:46:46 GMT, Robbin Ehn <[hidden email]> wrote:

> Parameter HandshakeTimeout is in milliseconds.
> Internally we use nanoseconds.
> HandshakeTimeout must be scaled to nanoseconds.
>
> Passes T1

`G1ServiceThread` is using `os::elapsed_counter()` but if the `frequency` used in `TimeHelper::millis_to_counter(...)` is wrong I guess we could be in trouble. @robehn, is your theory that the calculated delay in some cases could be extremely long and because of that cause the timeouts we see?

-------------

PR: https://git.openjdk.java.net/jdk/pull/2666
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8262094: Handshake timeout scaled wrong

Robbin Ehn-2
In reply to this post by Stefan Karlsson-3
On Mon, 22 Feb 2021 09:01:16 GMT, Stefan Karlsson <[hidden email]> wrote:

> Why are we using os::javaTimeNanos instead os::elapsed_counter()? I think we've tried to limit the usage of to places that explicitly interact with the Java side. For example, the reference processor uses os::javaTimeNanos because it interacts with java.lang.ref.SoftReference.clock. Most other places in the JVM uses os::elapsed_counter().

Both uses a 'java type', jlong.
For logging and logic for yielding we want a humanly understand unit that is absolute.
Since things goes pretty fast nowadays nanseconds seems reasonable.
Instead of to converting the elapsed_counter to nanos, it's much simpler to just get the nanos.
But yes it would preferably with a os::timeNanos().

-------------

PR: https://git.openjdk.java.net/jdk/pull/2666
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8262094: Handshake timeout scaled wrong

Robbin Ehn-2
On Mon, 22 Feb 2021 11:02:04 GMT, Robbin Ehn <[hidden email]> wrote:

>> Why are we using os::javaTimeNanos instead os::elapsed_counter()? I think we've tried to limit the usage of to places that explicitly interact with the Java side. For example, the reference processor uses os::javaTimeNanos because it interacts with java.lang.ref.SoftReference.clock. Most other places in the JVM uses os::elapsed_counter().
>
>> Why are we using os::javaTimeNanos instead os::elapsed_counter()? I think we've tried to limit the usage of to places that explicitly interact with the Java side. For example, the reference processor uses os::javaTimeNanos because it interacts with java.lang.ref.SoftReference.clock. Most other places in the JVM uses os::elapsed_counter().
>
> Both uses a 'java type', jlong.
> For logging and logic for yielding we want a humanly understand unit that is absolute.
> Since things goes pretty fast nowadays nanseconds seems reasonable.
> Instead of to converting the elapsed_counter to nanos, it's much simpler to just get the nanos.
> But yes it would preferably with a os::time_nanos().

> `G1ServiceThread` is using `os::elapsed_counter()` but if the `frequency` used in `TimeHelper::millis_to_counter(...)` is wrong I guess we could be in trouble. @robehn, is your theory that the calculated delay in some cases could be extremely long and because of that cause the timeouts we see?

The logging during handshake timeout doesn't work on windows for some reason, when I locally fixed the logging I notice that timeout was actually 2 seconds instead of 20 seconds. But that only seems to apply to some windows versions/(virtual hardware).

During some tests I have seem results such as:
"Error: time between events 0 ns"
When looking at the code the events are serialized with a mutex, so some time must have passed but the granularity of the system is to coarse. I don't know what else this can cause.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2666
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8262094: Handshake timeout scaled wrong

David Holmes
In reply to this post by Robbin Ehn-2
Hi Robbin,

On 22/02/2021 6:19 pm, Robbin Ehn wrote:

> On Mon, 22 Feb 2021 07:46:47 GMT, David Holmes <[hidden email]> wrote:
>
>> Hi Robbin,
>>
>> This change seems much clearer, but to be clear the actual result is the same as the "counter" tracks nanoseconds.
>>
>> This counts as a trivial change IMO.
>>
>> Thanks,
>> David
>
> Hi David, unfortunately I don't I agree that this is trivial:
>
> `open/src/hotspot/os/windows/os_windows.cpp:jlong os::elapsed_frequency() {
> open/src/hotspot/os/windows/os_windows.cpp-  return performance_frequency;
> open/src/hotspot/os/windows/os_windows.cpp-}`
>
> This value comes from QueryPerformanceFrequency, which depends on what resolution the Windows thinks is reasonable given what hardware support there are.

Yep mea culpa I completely overlooked Windows. So on Windows this change
actually alters the timeout. So a simple fix but not trivial.

> In a virtualized env. this causes a 20.000 ms timeout to timeout after 2 s for me due to low resolution (windows only).
> (all other platforms hardcoded to NANASECONDS)
>
> We have two more suspective uses here, which I have not investigated (but we have seem wired timeouts with both ZGC and G1, but I have seen non with parGC):
> `open/src/hotspot/share/gc/z/zMark.cpp:      _timeout(_start.value() + TimeHelper::micros_to_counter(timeout_in_micros))`
>
> `open/src/hotspot/share/gc/g1/g1ServiceThread.cpp:  jlong delay = TimeHelper::millis_to_counter(delay_ms);`

These should only be heuristics in any case, but I don't understand why
we would want to express a delay/timeout this way.

Cheers,
David
-----

> Thanks
>
> -------------
>
> PR: https://git.openjdk.java.net/jdk/pull/2666
>
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8262094: Handshake timeout scaled wrong

David Holmes
In reply to this post by Stefan Karlsson-3
On 22/02/2021 7:03 pm, Stefan Karlsson wrote:
> On Sun, 21 Feb 2021 18:46:46 GMT, Robbin Ehn <[hidden email]> wrote:
>
>> Parameter HandshakeTimeout is in milliseconds.
>> Internally we use nanoseconds.
>> HandshakeTimeout must be scaled to nanoseconds.
>>
>> Passes T1
>
> Why are we using os::javaTimeNanos instead os::elapsed_counter()? I think we've tried to limit the usage of to places that explicitly interact with the Java side. For example, the reference processor uses os::javaTimeNanos because it interacts with java.lang.ref.SoftReference.clock. Most other places in the JVM uses os::elapsed_counter().

Runtime code (threading and sync) uses actual time units (ie
javaTimeNanos) rather than os::elapsed_counter(). And elapsed_counter()
uses javaTimeNanos() on all platforms but Windows. It's not really about
interaction with Java, just about semantics. A timeout/delay in "ticks"
would scale with CPU speed if "ticks" presented that, but
os::elapsed_counter() doesn't - so a timeout/delay of N "counts" is
somewhat arbitrary in duration. Conversely, of course, any timeout/delay
expressed in ms or ns is somewhat arbitrary in terms of how much code
may have executed in that time. But these timeouts are heuristics anyway.

Cheers,
David


> -------------
>
> Changes requested by stefank (Reviewer).
>
> PR: https://git.openjdk.java.net/jdk/pull/2666
>
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8262094: Handshake timeout scaled wrong

Stefan Karlsson-3
In reply to this post by Robbin Ehn-2
On Mon, 22 Feb 2021 11:07:42 GMT, Robbin Ehn <[hidden email]> wrote:

>>> Why are we using os::javaTimeNanos instead os::elapsed_counter()? I think we've tried to limit the usage of to places that explicitly interact with the Java side. For example, the reference processor uses os::javaTimeNanos because it interacts with java.lang.ref.SoftReference.clock. Most other places in the JVM uses os::elapsed_counter().
>>
>> Both uses a 'java type', jlong.
>> For logging and logic for yielding we want a humanly understand unit that is absolute.
>> Since things goes pretty fast nowadays nanseconds seems reasonable.
>> Instead of to converting the elapsed_counter to nanos, it's much simpler to just get the nanos.
>> But yes it would preferably with a os::time_nanos().
>
>> `G1ServiceThread` is using `os::elapsed_counter()` but if the `frequency` used in `TimeHelper::millis_to_counter(...)` is wrong I guess we could be in trouble. @robehn, is your theory that the calculated delay in some cases could be extremely long and because of that cause the timeouts we see?
>
> The logging during handshake timeout doesn't work on windows for some reason, when I locally fixed the logging I notice that timeout was actually 2 seconds instead of 20 seconds. But that only seems to apply to some windows versions/(virtual hardware).
>
> During some tests I have seem results such as:
> "Error: time between events 0 ns"
> When looking at the code the events are serialized with a mutex, so some time must have passed but the granularity of the system is to coarse. I don't know what else this can cause.

> _Mailing list message from [David Holmes](mailto:[hidden email]) on [hotspot-runtime-dev](mailto:[hidden email]):_
>
> On 22/02/2021 7:03 pm, Stefan Karlsson wrote:
>
> > On Sun, 21 Feb 2021 18:46:46 GMT, Robbin Ehn <rehn at openjdk.org> wrote:
> > > Parameter HandshakeTimeout is in milliseconds.
> > > Internally we use nanoseconds.
> > > HandshakeTimeout must be scaled to nanoseconds.
> > > Passes T1
> >
> >
> > Why are we using os::javaTimeNanos instead os::elapsed_counter()? I think we've tried to limit the usage of to places that explicitly interact with the Java side. For example, the reference processor uses os::javaTimeNanos because it interacts with java.lang.ref.SoftReference.clock. Most other places in the JVM uses os::elapsed_counter().
>
> Runtime code (threading and sync) uses actual time units (ie
> javaTimeNanos) rather than os::elapsed_counter(). And elapsed_counter()
> uses javaTimeNanos() on all platforms but Windows.

I see. In the GC code we've tried to unify towards using one time source (elapsed_counter/elapsedTime/Ticks all use counters), and then convert to nanos/micros/millis when needed. It's a bit unfortunate that we don't use the same time source throughout the JVM.

 It's not really about
> interaction with Java, just about semantics.

From the GC's point-of-view it is. We use os::javaTimeNanos() just were we are forced to use it because interactions with Java. Other use-cases uses the counter based source.

 A timeout/delay in "ticks"
> would scale with CPU speed if "ticks" presented that, but
> os::elapsed_counter() doesn't - so a timeout/delay of N "counts" is
> somewhat arbitrary in duration.

For those cases you wouldn't pass down counter, but use TimeHelper::counter_to_millis (or equivalent).

But OK, given that Runtime already uses os::javaTimeNanos() I'll back off.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2666
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8262094: Handshake timeout scaled wrong

Robbin Ehn-2
On Mon, 22 Feb 2021 12:42:40 GMT, Stefan Karlsson <[hidden email]> wrote:

>>> `G1ServiceThread` is using `os::elapsed_counter()` but if the `frequency` used in `TimeHelper::millis_to_counter(...)` is wrong I guess we could be in trouble. @robehn, is your theory that the calculated delay in some cases could be extremely long and because of that cause the timeouts we see?
>>
>> The logging during handshake timeout doesn't work on windows for some reason, when I locally fixed the logging I notice that timeout was actually 2 seconds instead of 20 seconds. But that only seems to apply to some windows versions/(virtual hardware).
>>
>> During some tests I have seem results such as:
>> "Error: time between events 0 ns"
>> When looking at the code the events are serialized with a mutex, so some time must have passed but the granularity of the system is to coarse. I don't know what else this can cause.
>
>> _Mailing list message from [David Holmes](mailto:[hidden email]) on [hotspot-runtime-dev](mailto:[hidden email]):_
>>
>> On 22/02/2021 7:03 pm, Stefan Karlsson wrote:
>>
>> > On Sun, 21 Feb 2021 18:46:46 GMT, Robbin Ehn <rehn at openjdk.org> wrote:
>> > > Parameter HandshakeTimeout is in milliseconds.
>> > > Internally we use nanoseconds.
>> > > HandshakeTimeout must be scaled to nanoseconds.
>> > > Passes T1
>> >
>> >
>> > Why are we using os::javaTimeNanos instead os::elapsed_counter()? I think we've tried to limit the usage of to places that explicitly interact with the Java side. For example, the reference processor uses os::javaTimeNanos because it interacts with java.lang.ref.SoftReference.clock. Most other places in the JVM uses os::elapsed_counter().
>>
>> Runtime code (threading and sync) uses actual time units (ie
>> javaTimeNanos) rather than os::elapsed_counter(). And elapsed_counter()
>> uses javaTimeNanos() on all platforms but Windows.
>
> I see. In the GC code we've tried to unify towards using one time source (elapsed_counter/elapsedTime/Ticks all use counters), and then convert to nanos/micros/millis when needed. It's a bit unfortunate that we don't use the same time source throughout the JVM.
>
>  It's not really about
>> interaction with Java, just about semantics.
>
> From the GC's point-of-view it is. We use os::javaTimeNanos() just were we are forced to use it because interactions with Java. Other use-cases uses the counter based source.
>
>  A timeout/delay in "ticks"
>> would scale with CPU speed if "ticks" presented that, but
>> os::elapsed_counter() doesn't - so a timeout/delay of N "counts" is
>> somewhat arbitrary in duration.
>
> For those cases you wouldn't pass down counter, but use TimeHelper::counter_to_millis (or equivalent).
>
> But OK, given that Runtime already uses os::javaTimeNanos() I'll back off.

I agree with that we should use the same time source.
But here I just want to fixed this bug so the time-out is working as intended on windows.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2666
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8262094: Handshake timeout scaled wrong

Robbin Ehn
In reply to this post by David Holmes
Hi David,

>
> > In a virtualized env. this causes a 20.000 ms timeout to timeout
> > after 2 s for me due to low resolution (windows only).
> > (all other platforms hardcoded to NANASECONDS)
> >
> > We have two more suspective uses here, which I have not
> > investigated (but we have seem wired timeouts with both ZGC and G1,
> > but I have seen non with parGC):
> > `open/src/hotspot/share/gc/z/zMark.cpp:     
> > _timeout(_start.value() +
> > TimeHelper::micros_to_counter(timeout_in_micros))`
> >
> > `open/src/hotspot/share/gc/g1/g1ServiceThread.cpp:  jlong delay =
> > TimeHelper::millis_to_counter(delay_ms);`
>

FYI: I looked at these and they are fine.


> These should only be heuristics in any case, but I don't understand
> why
> we would want to express a delay/timeout this way.

I don't either.

Thanks, Robbin

>
> Cheers,
> David
> -----
>
> > Thanks
> >
> > -------------
> >
> > PR: https://git.openjdk.java.net/jdk/pull/2666
> >

Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8262094: Handshake timeout scaled wrong

Stefan Karlsson-3
In reply to this post by Robbin Ehn-2
On Sun, 21 Feb 2021 18:46:46 GMT, Robbin Ehn <[hidden email]> wrote:

> Parameter HandshakeTimeout is in milliseconds.
> Internally we use nanoseconds.
> HandshakeTimeout must be scaled to nanoseconds.
>
> Passes T1

Marked as reviewed by stefank (Reviewer).

-------------

PR: https://git.openjdk.java.net/jdk/pull/2666
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8262094: Handshake timeout scaled wrong

Robbin Ehn-2
On Tue, 23 Feb 2021 07:37:54 GMT, Stefan Karlsson <[hidden email]> wrote:

>> Parameter HandshakeTimeout is in milliseconds.
>> Internally we use nanoseconds.
>> HandshakeTimeout must be scaled to nanoseconds.
>>
>> Passes T1
>
> Marked as reviewed by stefank (Reviewer).

Thanks David and Stefan!

-------------

PR: https://git.openjdk.java.net/jdk/pull/2666
Reply | Threaded
Open this post in threaded view
|

Integrated: 8262094: Handshake timeout scaled wrong

Robbin Ehn-2
In reply to this post by Robbin Ehn-2
On Sun, 21 Feb 2021 18:46:46 GMT, Robbin Ehn <[hidden email]> wrote:

> Parameter HandshakeTimeout is in milliseconds.
> Internally we use nanoseconds.
> HandshakeTimeout must be scaled to nanoseconds.
>
> Passes T1

This pull request has now been integrated.

Changeset: 9d9bedd0
Author:    Robbin Ehn <[hidden email]>
URL:       https://git.openjdk.java.net/jdk/commit/9d9bedd0
Stats:     2 lines in 1 file changed: 1 ins; 0 del; 1 mod

8262094: Handshake timeout scaled wrong

Reviewed-by: dholmes, stefank

-------------

PR: https://git.openjdk.java.net/jdk/pull/2666