Discussion: 8172978: Remove Interpreter TOS optimization

classic Classic list List threaded Threaded
18 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Discussion: 8172978: Remove Interpreter TOS optimization

Max Ockner
Hello all,

We have filed a bug to remove the interpreter stack caching optimization
for jdk10.  Ideally we can make this change *early* during the jdk10
development cycle. See below for justification:

Bug: https://bugs.openjdk.java.net/browse/JDK-8172978

Stack caching has been around for a long time and is intended to replace
some of the load/store (pop/push) operations with corresponding register
operations. The need for this optimization arose before caching could
adequately lessen the burden of memory access. We have reevaluated the
JVM stack caching optimization and have found that it has a high memory
footprint and is very costly to maintain, but does not provide
significant measurable or theoretical benefit for us when used with
modern hardware.

Minimal Theoretical Benefit.
Because modern hardware does not slap us with the same cost for
accessing memory as it once did, the benefit of replacing memory access
with register access is far less dramatic now than it once was.
Additionally, the interpreter runs for a relatively short time before
relevant code sections are compiled. When the VM starts running compiled
code instead of interpreted code, performance should begin to move
asymptotically towards that of compiled code, diluting any performance
penalties from the interpreter to small performance variations.

No Measurable Benefit.
Please see the results files attached in the bug page.  This change was
adapted for x86 and sparc, and interpreter performance was measured with
Specjvm98 (run with -Xint).  No significant decrease in performance was
observed.

Memory footprint and code complexity.
Stack caching in the JVM is implemented by switching the instruction
look-up table depending on the tos (top-of-stack) state. At any moment
there are is an active table consisting of one dispatch table for each
of the 10 tos states.  When we enter a safepoint, we copy all 10
safepoint dispatch tables into the active table.  The additional entry
code makes this copy less efficient and makes any work in the
interpreter harder to debug.

If we remove this optimization, we will:
   - decrease memory usage in the interpreter,
   - eliminated wasteful memory transactions during safepoints,
   - decrease code complexity (a lot).

Please let me know what you think.
Thanks,
Max

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Discussion: 8172978: Remove Interpreter TOS optimization

Christian Thalinger-4
Yes, that’s a good idea.  And with AOT it should be even less of a problem.

> On Feb 15, 2017, at 12:18 PM, Max Ockner <[hidden email]> wrote:
>
> Hello all,
>
> We have filed a bug to remove the interpreter stack caching optimization for jdk10.  Ideally we can make this change *early* during the jdk10 development cycle. See below for justification:
>
> Bug: https://bugs.openjdk.java.net/browse/JDK-8172978
>
> Stack caching has been around for a long time and is intended to replace some of the load/store (pop/push) operations with corresponding register operations. The need for this optimization arose before caching could adequately lessen the burden of memory access. We have reevaluated the JVM stack caching optimization and have found that it has a high memory footprint and is very costly to maintain, but does not provide significant measurable or theoretical benefit for us when used with modern hardware.
>
> Minimal Theoretical Benefit.
> Because modern hardware does not slap us with the same cost for accessing memory as it once did, the benefit of replacing memory access with register access is far less dramatic now than it once was. Additionally, the interpreter runs for a relatively short time before relevant code sections are compiled. When the VM starts running compiled code instead of interpreted code, performance should begin to move asymptotically towards that of compiled code, diluting any performance penalties from the interpreter to small performance variations.
>
> No Measurable Benefit.
> Please see the results files attached in the bug page.  This change was adapted for x86 and sparc, and interpreter performance was measured with Specjvm98 (run with -Xint).  No significant decrease in performance was observed.
>
> Memory footprint and code complexity.
> Stack caching in the JVM is implemented by switching the instruction look-up table depending on the tos (top-of-stack) state. At any moment there are is an active table consisting of one dispatch table for each of the 10 tos states.  When we enter a safepoint, we copy all 10 safepoint dispatch tables into the active table.  The additional entry code makes this copy less efficient and makes any work in the interpreter harder to debug.
>
> If we remove this optimization, we will:
>  - decrease memory usage in the interpreter,
>  - eliminated wasteful memory transactions during safepoints,
>  - decrease code complexity (a lot).
>
> Please let me know what you think.
> Thanks,
> Max
>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Discussion: 8172978: Remove Interpreter TOS optimization

Daniel D. Daugherty
In reply to this post by Max Ockner
Hi Max,

Added a note to your bug. Interesting idea, but I think your data is
a bit incomplete at the moment.

Dan


On 2/15/17 3:18 PM, Max Ockner wrote:

> Hello all,
>
> We have filed a bug to remove the interpreter stack caching
> optimization for jdk10.  Ideally we can make this change *early*
> during the jdk10 development cycle. See below for justification:
>
> Bug: https://bugs.openjdk.java.net/browse/JDK-8172978
>
> Stack caching has been around for a long time and is intended to
> replace some of the load/store (pop/push) operations with
> corresponding register operations. The need for this optimization
> arose before caching could adequately lessen the burden of memory
> access. We have reevaluated the JVM stack caching optimization and
> have found that it has a high memory footprint and is very costly to
> maintain, but does not provide significant measurable or theoretical
> benefit for us when used with modern hardware.
>
> Minimal Theoretical Benefit.
> Because modern hardware does not slap us with the same cost for
> accessing memory as it once did, the benefit of replacing memory
> access with register access is far less dramatic now than it once was.
> Additionally, the interpreter runs for a relatively short time before
> relevant code sections are compiled. When the VM starts running
> compiled code instead of interpreted code, performance should begin to
> move asymptotically towards that of compiled code, diluting any
> performance penalties from the interpreter to small performance
> variations.
>
> No Measurable Benefit.
> Please see the results files attached in the bug page.  This change
> was adapted for x86 and sparc, and interpreter performance was
> measured with Specjvm98 (run with -Xint).  No significant decrease in
> performance was observed.
>
> Memory footprint and code complexity.
> Stack caching in the JVM is implemented by switching the instruction
> look-up table depending on the tos (top-of-stack) state. At any moment
> there are is an active table consisting of one dispatch table for each
> of the 10 tos states.  When we enter a safepoint, we copy all 10
> safepoint dispatch tables into the active table.  The additional entry
> code makes this copy less efficient and makes any work in the
> interpreter harder to debug.
>
> If we remove this optimization, we will:
>   - decrease memory usage in the interpreter,
>   - eliminated wasteful memory transactions during safepoints,
>   - decrease code complexity (a lot).
>
> Please let me know what you think.
> Thanks,
> Max
>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Discussion: 8172978: Remove Interpreter TOS optimization

Claes Redestad
Hi,

I've seen Max has run plenty of tests on our internal performance
infrastructure and everything I've seen there seems to corroborate the
idea that this removal is OK from a performance point of view, the
footprint improvements are small but significant and any negative
performance impact on throughput benchmarks is at noise levels even
with -Xint (it appears many benchmarks time out with this setting
both before and after, though; Max, let's discuss offline how to
deal with that :-))

I expect this will be tested more thoroughly once adapted to all
platforms (which I assume is the intent?), but see no concern from
a performance testing point of view: Do it!

Thanks!

/Claes

On 2017-02-16 16:40, Daniel D. Daugherty wrote:

> Hi Max,
>
> Added a note to your bug. Interesting idea, but I think your data is
> a bit incomplete at the moment.
>
> Dan
>
>
> On 2/15/17 3:18 PM, Max Ockner wrote:
>> Hello all,
>>
>> We have filed a bug to remove the interpreter stack caching
>> optimization for jdk10.  Ideally we can make this change *early*
>> during the jdk10 development cycle. See below for justification:
>>
>> Bug: https://bugs.openjdk.java.net/browse/JDK-8172978
>>
>> Stack caching has been around for a long time and is intended to
>> replace some of the load/store (pop/push) operations with
>> corresponding register operations. The need for this optimization
>> arose before caching could adequately lessen the burden of memory
>> access. We have reevaluated the JVM stack caching optimization and
>> have found that it has a high memory footprint and is very costly to
>> maintain, but does not provide significant measurable or theoretical
>> benefit for us when used with modern hardware.
>>
>> Minimal Theoretical Benefit.
>> Because modern hardware does not slap us with the same cost for
>> accessing memory as it once did, the benefit of replacing memory
>> access with register access is far less dramatic now than it once was.
>> Additionally, the interpreter runs for a relatively short time before
>> relevant code sections are compiled. When the VM starts running
>> compiled code instead of interpreted code, performance should begin to
>> move asymptotically towards that of compiled code, diluting any
>> performance penalties from the interpreter to small performance
>> variations.
>>
>> No Measurable Benefit.
>> Please see the results files attached in the bug page.  This change
>> was adapted for x86 and sparc, and interpreter performance was
>> measured with Specjvm98 (run with -Xint).  No significant decrease in
>> performance was observed.
>>
>> Memory footprint and code complexity.
>> Stack caching in the JVM is implemented by switching the instruction
>> look-up table depending on the tos (top-of-stack) state. At any moment
>> there are is an active table consisting of one dispatch table for each
>> of the 10 tos states.  When we enter a safepoint, we copy all 10
>> safepoint dispatch tables into the active table.  The additional entry
>> code makes this copy less efficient and makes any work in the
>> interpreter harder to debug.
>>
>> If we remove this optimization, we will:
>>   - decrease memory usage in the interpreter,
>>   - eliminated wasteful memory transactions during safepoints,
>>   - decrease code complexity (a lot).
>>
>> Please let me know what you think.
>> Thanks,
>> Max
>>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Discussion: 8172978: Remove Interpreter TOS optimization

Daniel D. Daugherty
If Claes is happy with the perf testing, then I'm happy. :-)

Dan


On 2/18/17 3:46 AM, Claes Redestad wrote:

> Hi,
>
> I've seen Max has run plenty of tests on our internal performance
> infrastructure and everything I've seen there seems to corroborate the
> idea that this removal is OK from a performance point of view, the
> footprint improvements are small but significant and any negative
> performance impact on throughput benchmarks is at noise levels even
> with -Xint (it appears many benchmarks time out with this setting
> both before and after, though; Max, let's discuss offline how to
> deal with that :-))
>
> I expect this will be tested more thoroughly once adapted to all
> platforms (which I assume is the intent?), but see no concern from
> a performance testing point of view: Do it!
>
> Thanks!
>
> /Claes
>
> On 2017-02-16 16:40, Daniel D. Daugherty wrote:
>> Hi Max,
>>
>> Added a note to your bug. Interesting idea, but I think your data is
>> a bit incomplete at the moment.
>>
>> Dan
>>
>>
>> On 2/15/17 3:18 PM, Max Ockner wrote:
>>> Hello all,
>>>
>>> We have filed a bug to remove the interpreter stack caching
>>> optimization for jdk10.  Ideally we can make this change *early*
>>> during the jdk10 development cycle. See below for justification:
>>>
>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8172978
>>>
>>> Stack caching has been around for a long time and is intended to
>>> replace some of the load/store (pop/push) operations with
>>> corresponding register operations. The need for this optimization
>>> arose before caching could adequately lessen the burden of memory
>>> access. We have reevaluated the JVM stack caching optimization and
>>> have found that it has a high memory footprint and is very costly to
>>> maintain, but does not provide significant measurable or theoretical
>>> benefit for us when used with modern hardware.
>>>
>>> Minimal Theoretical Benefit.
>>> Because modern hardware does not slap us with the same cost for
>>> accessing memory as it once did, the benefit of replacing memory
>>> access with register access is far less dramatic now than it once was.
>>> Additionally, the interpreter runs for a relatively short time before
>>> relevant code sections are compiled. When the VM starts running
>>> compiled code instead of interpreted code, performance should begin to
>>> move asymptotically towards that of compiled code, diluting any
>>> performance penalties from the interpreter to small performance
>>> variations.
>>>
>>> No Measurable Benefit.
>>> Please see the results files attached in the bug page.  This change
>>> was adapted for x86 and sparc, and interpreter performance was
>>> measured with Specjvm98 (run with -Xint).  No significant decrease in
>>> performance was observed.
>>>
>>> Memory footprint and code complexity.
>>> Stack caching in the JVM is implemented by switching the instruction
>>> look-up table depending on the tos (top-of-stack) state. At any moment
>>> there are is an active table consisting of one dispatch table for each
>>> of the 10 tos states.  When we enter a safepoint, we copy all 10
>>> safepoint dispatch tables into the active table.  The additional entry
>>> code makes this copy less efficient and makes any work in the
>>> interpreter harder to debug.
>>>
>>> If we remove this optimization, we will:
>>>   - decrease memory usage in the interpreter,
>>>   - eliminated wasteful memory transactions during safepoints,
>>>   - decrease code complexity (a lot).
>>>
>>> Please let me know what you think.
>>> Thanks,
>>> Max
>>>
>>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Discussion: 8172978: Remove Interpreter TOS optimization

coleen.phillimore
When Max gets back from the long weekend, he'll post the platforms in
your bug.

It's amazing that for -Xint there's no significant difference. I've seen
-Xint performance of 15% slower cause a 2% slowdown with server but that
was before tiered compilation.

The reason for this query was to see what developers for the other
platform ports think, since this change would affect all of the platforms.

Thanks,
Coleen

On 2/18/17 10:50 AM, Daniel D. Daugherty wrote:

> If Claes is happy with the perf testing, then I'm happy. :-)
>
> Dan
>
>
> On 2/18/17 3:46 AM, Claes Redestad wrote:
>> Hi,
>>
>> I've seen Max has run plenty of tests on our internal performance
>> infrastructure and everything I've seen there seems to corroborate the
>> idea that this removal is OK from a performance point of view, the
>> footprint improvements are small but significant and any negative
>> performance impact on throughput benchmarks is at noise levels even
>> with -Xint (it appears many benchmarks time out with this setting
>> both before and after, though; Max, let's discuss offline how to
>> deal with that :-))
>>
>> I expect this will be tested more thoroughly once adapted to all
>> platforms (which I assume is the intent?), but see no concern from
>> a performance testing point of view: Do it!
>>
>> Thanks!
>>
>> /Claes
>>
>> On 2017-02-16 16:40, Daniel D. Daugherty wrote:
>>> Hi Max,
>>>
>>> Added a note to your bug. Interesting idea, but I think your data is
>>> a bit incomplete at the moment.
>>>
>>> Dan
>>>
>>>
>>> On 2/15/17 3:18 PM, Max Ockner wrote:
>>>> Hello all,
>>>>
>>>> We have filed a bug to remove the interpreter stack caching
>>>> optimization for jdk10.  Ideally we can make this change *early*
>>>> during the jdk10 development cycle. See below for justification:
>>>>
>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8172978
>>>>
>>>> Stack caching has been around for a long time and is intended to
>>>> replace some of the load/store (pop/push) operations with
>>>> corresponding register operations. The need for this optimization
>>>> arose before caching could adequately lessen the burden of memory
>>>> access. We have reevaluated the JVM stack caching optimization and
>>>> have found that it has a high memory footprint and is very costly to
>>>> maintain, but does not provide significant measurable or theoretical
>>>> benefit for us when used with modern hardware.
>>>>
>>>> Minimal Theoretical Benefit.
>>>> Because modern hardware does not slap us with the same cost for
>>>> accessing memory as it once did, the benefit of replacing memory
>>>> access with register access is far less dramatic now than it once was.
>>>> Additionally, the interpreter runs for a relatively short time before
>>>> relevant code sections are compiled. When the VM starts running
>>>> compiled code instead of interpreted code, performance should begin to
>>>> move asymptotically towards that of compiled code, diluting any
>>>> performance penalties from the interpreter to small performance
>>>> variations.
>>>>
>>>> No Measurable Benefit.
>>>> Please see the results files attached in the bug page.  This change
>>>> was adapted for x86 and sparc, and interpreter performance was
>>>> measured with Specjvm98 (run with -Xint).  No significant decrease in
>>>> performance was observed.
>>>>
>>>> Memory footprint and code complexity.
>>>> Stack caching in the JVM is implemented by switching the instruction
>>>> look-up table depending on the tos (top-of-stack) state. At any moment
>>>> there are is an active table consisting of one dispatch table for each
>>>> of the 10 tos states.  When we enter a safepoint, we copy all 10
>>>> safepoint dispatch tables into the active table.  The additional entry
>>>> code makes this copy less efficient and makes any work in the
>>>> interpreter harder to debug.
>>>>
>>>> If we remove this optimization, we will:
>>>>   - decrease memory usage in the interpreter,
>>>>   - eliminated wasteful memory transactions during safepoints,
>>>>   - decrease code complexity (a lot).
>>>>
>>>> Please let me know what you think.
>>>> Thanks,
>>>> Max
>>>>
>>>
>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Discussion: 8172978: Remove Interpreter TOS optimization

Ioi Lam
I think it's worthwhile to hand craft a micro benchmark with lots of stack operations and see if there's any performance in Xint mode. Also, run this on a wide range of architectures such as 10 year old x86 vs latest x86, ARM, etc.

That will give us more insights than the results from large complicated benchmarks.

Ioi

> On Feb 18, 2017, at 8:14 AM, [hidden email] wrote:
>
> When Max gets back from the long weekend, he'll post the platforms in your bug.
>
> It's amazing that for -Xint there's no significant difference. I've seen -Xint performance of 15% slower cause a 2% slowdown with server but that was before tiered compilation.
>
> The reason for this query was to see what developers for the other platform ports think, since this change would affect all of the platforms.
>
> Thanks,
> Coleen
>
>> On 2/18/17 10:50 AM, Daniel D. Daugherty wrote:
>> If Claes is happy with the perf testing, then I'm happy. :-)
>>
>> Dan
>>
>>
>>> On 2/18/17 3:46 AM, Claes Redestad wrote:
>>> Hi,
>>>
>>> I've seen Max has run plenty of tests on our internal performance
>>> infrastructure and everything I've seen there seems to corroborate the
>>> idea that this removal is OK from a performance point of view, the
>>> footprint improvements are small but significant and any negative
>>> performance impact on throughput benchmarks is at noise levels even
>>> with -Xint (it appears many benchmarks time out with this setting
>>> both before and after, though; Max, let's discuss offline how to
>>> deal with that :-))
>>>
>>> I expect this will be tested more thoroughly once adapted to all
>>> platforms (which I assume is the intent?), but see no concern from
>>> a performance testing point of view: Do it!
>>>
>>> Thanks!
>>>
>>> /Claes
>>>
>>>> On 2017-02-16 16:40, Daniel D. Daugherty wrote:
>>>> Hi Max,
>>>>
>>>> Added a note to your bug. Interesting idea, but I think your data is
>>>> a bit incomplete at the moment.
>>>>
>>>> Dan
>>>>
>>>>
>>>>> On 2/15/17 3:18 PM, Max Ockner wrote:
>>>>> Hello all,
>>>>>
>>>>> We have filed a bug to remove the interpreter stack caching
>>>>> optimization for jdk10.  Ideally we can make this change *early*
>>>>> during the jdk10 development cycle. See below for justification:
>>>>>
>>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8172978
>>>>>
>>>>> Stack caching has been around for a long time and is intended to
>>>>> replace some of the load/store (pop/push) operations with
>>>>> corresponding register operations. The need for this optimization
>>>>> arose before caching could adequately lessen the burden of memory
>>>>> access. We have reevaluated the JVM stack caching optimization and
>>>>> have found that it has a high memory footprint and is very costly to
>>>>> maintain, but does not provide significant measurable or theoretical
>>>>> benefit for us when used with modern hardware.
>>>>>
>>>>> Minimal Theoretical Benefit.
>>>>> Because modern hardware does not slap us with the same cost for
>>>>> accessing memory as it once did, the benefit of replacing memory
>>>>> access with register access is far less dramatic now than it once was.
>>>>> Additionally, the interpreter runs for a relatively short time before
>>>>> relevant code sections are compiled. When the VM starts running
>>>>> compiled code instead of interpreted code, performance should begin to
>>>>> move asymptotically towards that of compiled code, diluting any
>>>>> performance penalties from the interpreter to small performance
>>>>> variations.
>>>>>
>>>>> No Measurable Benefit.
>>>>> Please see the results files attached in the bug page.  This change
>>>>> was adapted for x86 and sparc, and interpreter performance was
>>>>> measured with Specjvm98 (run with -Xint).  No significant decrease in
>>>>> performance was observed.
>>>>>
>>>>> Memory footprint and code complexity.
>>>>> Stack caching in the JVM is implemented by switching the instruction
>>>>> look-up table depending on the tos (top-of-stack) state. At any moment
>>>>> there are is an active table consisting of one dispatch table for each
>>>>> of the 10 tos states.  When we enter a safepoint, we copy all 10
>>>>> safepoint dispatch tables into the active table.  The additional entry
>>>>> code makes this copy less efficient and makes any work in the
>>>>> interpreter harder to debug.
>>>>>
>>>>> If we remove this optimization, we will:
>>>>>  - decrease memory usage in the interpreter,
>>>>>  - eliminated wasteful memory transactions during safepoints,
>>>>>  - decrease code complexity (a lot).
>>>>>
>>>>> Please let me know what you think.
>>>>> Thanks,
>>>>> Max
>>>>>
>>>>
>>
>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Discussion: 8172978: Remove Interpreter TOS optimization

coleen.phillimore
In reply to this post by coleen.phillimore


On 2/18/17 11:14 AM, [hidden email] wrote:
> When Max gets back from the long weekend, he'll post the platforms in
> your bug.
>
> It's amazing that for -Xint there's no significant difference. I've
> seen -Xint performance of 15% slower cause a 2% slowdown with server
> but that was before tiered compilation.

I should clarify this.  I've seen this slowdown for *different*
interpreter optimizations, which *can* affect server performance.  I was
measuring specjvm98 on linux x64.   If there's no significant difference
for this TOS optimization, there is no chance of a degredation in
overall performance.

Coleen

>
> The reason for this query was to see what developers for the other
> platform ports think, since this change would affect all of the
> platforms.
>
> Thanks,
> Coleen
>
> On 2/18/17 10:50 AM, Daniel D. Daugherty wrote:
>> If Claes is happy with the perf testing, then I'm happy. :-)
>>
>> Dan
>>
>>
>> On 2/18/17 3:46 AM, Claes Redestad wrote:
>>> Hi,
>>>
>>> I've seen Max has run plenty of tests on our internal performance
>>> infrastructure and everything I've seen there seems to corroborate the
>>> idea that this removal is OK from a performance point of view, the
>>> footprint improvements are small but significant and any negative
>>> performance impact on throughput benchmarks is at noise levels even
>>> with -Xint (it appears many benchmarks time out with this setting
>>> both before and after, though; Max, let's discuss offline how to
>>> deal with that :-))
>>>
>>> I expect this will be tested more thoroughly once adapted to all
>>> platforms (which I assume is the intent?), but see no concern from
>>> a performance testing point of view: Do it!
>>>
>>> Thanks!
>>>
>>> /Claes
>>>
>>> On 2017-02-16 16:40, Daniel D. Daugherty wrote:
>>>> Hi Max,
>>>>
>>>> Added a note to your bug. Interesting idea, but I think your data is
>>>> a bit incomplete at the moment.
>>>>
>>>> Dan
>>>>
>>>>
>>>> On 2/15/17 3:18 PM, Max Ockner wrote:
>>>>> Hello all,
>>>>>
>>>>> We have filed a bug to remove the interpreter stack caching
>>>>> optimization for jdk10.  Ideally we can make this change *early*
>>>>> during the jdk10 development cycle. See below for justification:
>>>>>
>>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8172978
>>>>>
>>>>> Stack caching has been around for a long time and is intended to
>>>>> replace some of the load/store (pop/push) operations with
>>>>> corresponding register operations. The need for this optimization
>>>>> arose before caching could adequately lessen the burden of memory
>>>>> access. We have reevaluated the JVM stack caching optimization and
>>>>> have found that it has a high memory footprint and is very costly to
>>>>> maintain, but does not provide significant measurable or theoretical
>>>>> benefit for us when used with modern hardware.
>>>>>
>>>>> Minimal Theoretical Benefit.
>>>>> Because modern hardware does not slap us with the same cost for
>>>>> accessing memory as it once did, the benefit of replacing memory
>>>>> access with register access is far less dramatic now than it once
>>>>> was.
>>>>> Additionally, the interpreter runs for a relatively short time before
>>>>> relevant code sections are compiled. When the VM starts running
>>>>> compiled code instead of interpreted code, performance should
>>>>> begin to
>>>>> move asymptotically towards that of compiled code, diluting any
>>>>> performance penalties from the interpreter to small performance
>>>>> variations.
>>>>>
>>>>> No Measurable Benefit.
>>>>> Please see the results files attached in the bug page. This change
>>>>> was adapted for x86 and sparc, and interpreter performance was
>>>>> measured with Specjvm98 (run with -Xint).  No significant decrease in
>>>>> performance was observed.
>>>>>
>>>>> Memory footprint and code complexity.
>>>>> Stack caching in the JVM is implemented by switching the instruction
>>>>> look-up table depending on the tos (top-of-stack) state. At any
>>>>> moment
>>>>> there are is an active table consisting of one dispatch table for
>>>>> each
>>>>> of the 10 tos states.  When we enter a safepoint, we copy all 10
>>>>> safepoint dispatch tables into the active table.  The additional
>>>>> entry
>>>>> code makes this copy less efficient and makes any work in the
>>>>> interpreter harder to debug.
>>>>>
>>>>> If we remove this optimization, we will:
>>>>>   - decrease memory usage in the interpreter,
>>>>>   - eliminated wasteful memory transactions during safepoints,
>>>>>   - decrease code complexity (a lot).
>>>>>
>>>>> Please let me know what you think.
>>>>> Thanks,
>>>>> Max
>>>>>
>>>>
>>
>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Discussion: 8172978: Remove Interpreter TOS optimization

Volker Simonis
Hi,

besides the fact that this of course means some work for us :) I
currently don't see any problems for our porting platforms (ppc64 and
s390x).

Are there any webrevs available, so we can see how big they are and
maybe do some own benchmarking?

Thanks,
Volker


On Sun, Feb 19, 2017 at 11:11 PM,  <[hidden email]> wrote:

>
>
> On 2/18/17 11:14 AM, [hidden email] wrote:
>>
>> When Max gets back from the long weekend, he'll post the platforms in your
>> bug.
>>
>> It's amazing that for -Xint there's no significant difference. I've seen
>> -Xint performance of 15% slower cause a 2% slowdown with server but that was
>> before tiered compilation.
>
>
> I should clarify this.  I've seen this slowdown for *different* interpreter
> optimizations, which *can* affect server performance.  I was measuring
> specjvm98 on linux x64.   If there's no significant difference for this TOS
> optimization, there is no chance of a degredation in overall performance.
>
> Coleen
>
>>
>> The reason for this query was to see what developers for the other
>> platform ports think, since this change would affect all of the platforms.
>>
>> Thanks,
>> Coleen
>>
>> On 2/18/17 10:50 AM, Daniel D. Daugherty wrote:
>>>
>>> If Claes is happy with the perf testing, then I'm happy. :-)
>>>
>>> Dan
>>>
>>>
>>> On 2/18/17 3:46 AM, Claes Redestad wrote:
>>>>
>>>> Hi,
>>>>
>>>> I've seen Max has run plenty of tests on our internal performance
>>>> infrastructure and everything I've seen there seems to corroborate the
>>>> idea that this removal is OK from a performance point of view, the
>>>> footprint improvements are small but significant and any negative
>>>> performance impact on throughput benchmarks is at noise levels even
>>>> with -Xint (it appears many benchmarks time out with this setting
>>>> both before and after, though; Max, let's discuss offline how to
>>>> deal with that :-))
>>>>
>>>> I expect this will be tested more thoroughly once adapted to all
>>>> platforms (which I assume is the intent?), but see no concern from
>>>> a performance testing point of view: Do it!
>>>>
>>>> Thanks!
>>>>
>>>> /Claes
>>>>
>>>> On 2017-02-16 16:40, Daniel D. Daugherty wrote:
>>>>>
>>>>> Hi Max,
>>>>>
>>>>> Added a note to your bug. Interesting idea, but I think your data is
>>>>> a bit incomplete at the moment.
>>>>>
>>>>> Dan
>>>>>
>>>>>
>>>>> On 2/15/17 3:18 PM, Max Ockner wrote:
>>>>>>
>>>>>> Hello all,
>>>>>>
>>>>>> We have filed a bug to remove the interpreter stack caching
>>>>>> optimization for jdk10.  Ideally we can make this change *early*
>>>>>> during the jdk10 development cycle. See below for justification:
>>>>>>
>>>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8172978
>>>>>>
>>>>>> Stack caching has been around for a long time and is intended to
>>>>>> replace some of the load/store (pop/push) operations with
>>>>>> corresponding register operations. The need for this optimization
>>>>>> arose before caching could adequately lessen the burden of memory
>>>>>> access. We have reevaluated the JVM stack caching optimization and
>>>>>> have found that it has a high memory footprint and is very costly to
>>>>>> maintain, but does not provide significant measurable or theoretical
>>>>>> benefit for us when used with modern hardware.
>>>>>>
>>>>>> Minimal Theoretical Benefit.
>>>>>> Because modern hardware does not slap us with the same cost for
>>>>>> accessing memory as it once did, the benefit of replacing memory
>>>>>> access with register access is far less dramatic now than it once was.
>>>>>> Additionally, the interpreter runs for a relatively short time before
>>>>>> relevant code sections are compiled. When the VM starts running
>>>>>> compiled code instead of interpreted code, performance should begin to
>>>>>> move asymptotically towards that of compiled code, diluting any
>>>>>> performance penalties from the interpreter to small performance
>>>>>> variations.
>>>>>>
>>>>>> No Measurable Benefit.
>>>>>> Please see the results files attached in the bug page. This change
>>>>>> was adapted for x86 and sparc, and interpreter performance was
>>>>>> measured with Specjvm98 (run with -Xint).  No significant decrease in
>>>>>> performance was observed.
>>>>>>
>>>>>> Memory footprint and code complexity.
>>>>>> Stack caching in the JVM is implemented by switching the instruction
>>>>>> look-up table depending on the tos (top-of-stack) state. At any moment
>>>>>> there are is an active table consisting of one dispatch table for each
>>>>>> of the 10 tos states.  When we enter a safepoint, we copy all 10
>>>>>> safepoint dispatch tables into the active table.  The additional entry
>>>>>> code makes this copy less efficient and makes any work in the
>>>>>> interpreter harder to debug.
>>>>>>
>>>>>> If we remove this optimization, we will:
>>>>>>   - decrease memory usage in the interpreter,
>>>>>>   - eliminated wasteful memory transactions during safepoints,
>>>>>>   - decrease code complexity (a lot).
>>>>>>
>>>>>> Please let me know what you think.
>>>>>> Thanks,
>>>>>> Max
>>>>>>
>>>>>
>>>
>>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Discussion: 8172978: Remove Interpreter TOS optimization

Max Ockner
Hi Volker,
I have attached the patch that I have been testing.
Thanks,
Max

On 2/20/2017 5:45 AM, Volker Simonis wrote:

> Hi,
>
> besides the fact that this of course means some work for us :) I
> currently don't see any problems for our porting platforms (ppc64 and
> s390x).
>
> Are there any webrevs available, so we can see how big they are and
> maybe do some own benchmarking?
>
> Thanks,
> Volker
>
>
> On Sun, Feb 19, 2017 at 11:11 PM,  <[hidden email]> wrote:
>>
>> On 2/18/17 11:14 AM, [hidden email] wrote:
>>> When Max gets back from the long weekend, he'll post the platforms in your
>>> bug.
>>>
>>> It's amazing that for -Xint there's no significant difference. I've seen
>>> -Xint performance of 15% slower cause a 2% slowdown with server but that was
>>> before tiered compilation.
>>
>> I should clarify this.  I've seen this slowdown for *different* interpreter
>> optimizations, which *can* affect server performance.  I was measuring
>> specjvm98 on linux x64.   If there's no significant difference for this TOS
>> optimization, there is no chance of a degredation in overall performance.
>>
>> Coleen
>>
>>> The reason for this query was to see what developers for the other
>>> platform ports think, since this change would affect all of the platforms.
>>>
>>> Thanks,
>>> Coleen
>>>
>>> On 2/18/17 10:50 AM, Daniel D. Daugherty wrote:
>>>> If Claes is happy with the perf testing, then I'm happy. :-)
>>>>
>>>> Dan
>>>>
>>>>
>>>> On 2/18/17 3:46 AM, Claes Redestad wrote:
>>>>> Hi,
>>>>>
>>>>> I've seen Max has run plenty of tests on our internal performance
>>>>> infrastructure and everything I've seen there seems to corroborate the
>>>>> idea that this removal is OK from a performance point of view, the
>>>>> footprint improvements are small but significant and any negative
>>>>> performance impact on throughput benchmarks is at noise levels even
>>>>> with -Xint (it appears many benchmarks time out with this setting
>>>>> both before and after, though; Max, let's discuss offline how to
>>>>> deal with that :-))
>>>>>
>>>>> I expect this will be tested more thoroughly once adapted to all
>>>>> platforms (which I assume is the intent?), but see no concern from
>>>>> a performance testing point of view: Do it!
>>>>>
>>>>> Thanks!
>>>>>
>>>>> /Claes
>>>>>
>>>>> On 2017-02-16 16:40, Daniel D. Daugherty wrote:
>>>>>> Hi Max,
>>>>>>
>>>>>> Added a note to your bug. Interesting idea, but I think your data is
>>>>>> a bit incomplete at the moment.
>>>>>>
>>>>>> Dan
>>>>>>
>>>>>>
>>>>>> On 2/15/17 3:18 PM, Max Ockner wrote:
>>>>>>> Hello all,
>>>>>>>
>>>>>>> We have filed a bug to remove the interpreter stack caching
>>>>>>> optimization for jdk10.  Ideally we can make this change *early*
>>>>>>> during the jdk10 development cycle. See below for justification:
>>>>>>>
>>>>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8172978
>>>>>>>
>>>>>>> Stack caching has been around for a long time and is intended to
>>>>>>> replace some of the load/store (pop/push) operations with
>>>>>>> corresponding register operations. The need for this optimization
>>>>>>> arose before caching could adequately lessen the burden of memory
>>>>>>> access. We have reevaluated the JVM stack caching optimization and
>>>>>>> have found that it has a high memory footprint and is very costly to
>>>>>>> maintain, but does not provide significant measurable or theoretical
>>>>>>> benefit for us when used with modern hardware.
>>>>>>>
>>>>>>> Minimal Theoretical Benefit.
>>>>>>> Because modern hardware does not slap us with the same cost for
>>>>>>> accessing memory as it once did, the benefit of replacing memory
>>>>>>> access with register access is far less dramatic now than it once was.
>>>>>>> Additionally, the interpreter runs for a relatively short time before
>>>>>>> relevant code sections are compiled. When the VM starts running
>>>>>>> compiled code instead of interpreted code, performance should begin to
>>>>>>> move asymptotically towards that of compiled code, diluting any
>>>>>>> performance penalties from the interpreter to small performance
>>>>>>> variations.
>>>>>>>
>>>>>>> No Measurable Benefit.
>>>>>>> Please see the results files attached in the bug page. This change
>>>>>>> was adapted for x86 and sparc, and interpreter performance was
>>>>>>> measured with Specjvm98 (run with -Xint).  No significant decrease in
>>>>>>> performance was observed.
>>>>>>>
>>>>>>> Memory footprint and code complexity.
>>>>>>> Stack caching in the JVM is implemented by switching the instruction
>>>>>>> look-up table depending on the tos (top-of-stack) state. At any moment
>>>>>>> there are is an active table consisting of one dispatch table for each
>>>>>>> of the 10 tos states.  When we enter a safepoint, we copy all 10
>>>>>>> safepoint dispatch tables into the active table.  The additional entry
>>>>>>> code makes this copy less efficient and makes any work in the
>>>>>>> interpreter harder to debug.
>>>>>>>
>>>>>>> If we remove this optimization, we will:
>>>>>>>    - decrease memory usage in the interpreter,
>>>>>>>    - eliminated wasteful memory transactions during safepoints,
>>>>>>>    - decrease code complexity (a lot).
>>>>>>>
>>>>>>> Please let me know what you think.
>>>>>>> Thanks,
>>>>>>> Max
>>>>>>>


remove_tos.patch (4K) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Discussion: 8172978: Remove Interpreter TOS optimization

Doerr, Martin
Hi Max,

thank you very much for sharing your results and for sending the patch.

I guess it covers the most relevant cases, but not all ones. I think it'd be better to modify dispatch_next instead of dispatch_epilog on x86.
(dispatch_next is also used by generate_return_entry_for and generate_deopt_entry_for.)

On s390, I'm using dispatch_next with:
  if (!EnableTosCache) {
    push(state);
    state = vtos;
  }
  dispatch_base(state, Interpreter::dispatch_table(state));

I also added an assertion to dispatch_base in order to make sure I'm hitting all dispatch usages:
assert(EnableTosCache || state == vtos, "sanity");

Unfortunately, the performance results of SPEC jvm98 with -Xint seem to drop significantly with -XX:-EnableTosCache on both, PPC64 and s390.
But we need to perform more measurements to get more reliable results.

Best regards,
Martin


-----Original Message-----
From: hotspot-dev [mailto:[hidden email]] On Behalf Of Max Ockner
Sent: Donnerstag, 23. Februar 2017 22:21
To: [hidden email]
Subject: Re: Discussion: 8172978: Remove Interpreter TOS optimization

Hi Volker,
I have attached the patch that I have been testing.
Thanks,
Max

On 2/20/2017 5:45 AM, Volker Simonis wrote:

> Hi,
>
> besides the fact that this of course means some work for us :) I
> currently don't see any problems for our porting platforms (ppc64 and
> s390x).
>
> Are there any webrevs available, so we can see how big they are and
> maybe do some own benchmarking?
>
> Thanks,
> Volker
>
>
> On Sun, Feb 19, 2017 at 11:11 PM,  <[hidden email]> wrote:
>>
>> On 2/18/17 11:14 AM, [hidden email] wrote:
>>> When Max gets back from the long weekend, he'll post the platforms in your
>>> bug.
>>>
>>> It's amazing that for -Xint there's no significant difference. I've seen
>>> -Xint performance of 15% slower cause a 2% slowdown with server but that was
>>> before tiered compilation.
>>
>> I should clarify this.  I've seen this slowdown for *different* interpreter
>> optimizations, which *can* affect server performance.  I was measuring
>> specjvm98 on linux x64.   If there's no significant difference for this TOS
>> optimization, there is no chance of a degredation in overall performance.
>>
>> Coleen
>>
>>> The reason for this query was to see what developers for the other
>>> platform ports think, since this change would affect all of the platforms.
>>>
>>> Thanks,
>>> Coleen
>>>
>>> On 2/18/17 10:50 AM, Daniel D. Daugherty wrote:
>>>> If Claes is happy with the perf testing, then I'm happy. :-)
>>>>
>>>> Dan
>>>>
>>>>
>>>> On 2/18/17 3:46 AM, Claes Redestad wrote:
>>>>> Hi,
>>>>>
>>>>> I've seen Max has run plenty of tests on our internal performance
>>>>> infrastructure and everything I've seen there seems to corroborate the
>>>>> idea that this removal is OK from a performance point of view, the
>>>>> footprint improvements are small but significant and any negative
>>>>> performance impact on throughput benchmarks is at noise levels even
>>>>> with -Xint (it appears many benchmarks time out with this setting
>>>>> both before and after, though; Max, let's discuss offline how to
>>>>> deal with that :-))
>>>>>
>>>>> I expect this will be tested more thoroughly once adapted to all
>>>>> platforms (which I assume is the intent?), but see no concern from
>>>>> a performance testing point of view: Do it!
>>>>>
>>>>> Thanks!
>>>>>
>>>>> /Claes
>>>>>
>>>>> On 2017-02-16 16:40, Daniel D. Daugherty wrote:
>>>>>> Hi Max,
>>>>>>
>>>>>> Added a note to your bug. Interesting idea, but I think your data is
>>>>>> a bit incomplete at the moment.
>>>>>>
>>>>>> Dan
>>>>>>
>>>>>>
>>>>>> On 2/15/17 3:18 PM, Max Ockner wrote:
>>>>>>> Hello all,
>>>>>>>
>>>>>>> We have filed a bug to remove the interpreter stack caching
>>>>>>> optimization for jdk10.  Ideally we can make this change *early*
>>>>>>> during the jdk10 development cycle. See below for justification:
>>>>>>>
>>>>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8172978
>>>>>>>
>>>>>>> Stack caching has been around for a long time and is intended to
>>>>>>> replace some of the load/store (pop/push) operations with
>>>>>>> corresponding register operations. The need for this optimization
>>>>>>> arose before caching could adequately lessen the burden of memory
>>>>>>> access. We have reevaluated the JVM stack caching optimization and
>>>>>>> have found that it has a high memory footprint and is very costly to
>>>>>>> maintain, but does not provide significant measurable or theoretical
>>>>>>> benefit for us when used with modern hardware.
>>>>>>>
>>>>>>> Minimal Theoretical Benefit.
>>>>>>> Because modern hardware does not slap us with the same cost for
>>>>>>> accessing memory as it once did, the benefit of replacing memory
>>>>>>> access with register access is far less dramatic now than it once was.
>>>>>>> Additionally, the interpreter runs for a relatively short time before
>>>>>>> relevant code sections are compiled. When the VM starts running
>>>>>>> compiled code instead of interpreted code, performance should begin to
>>>>>>> move asymptotically towards that of compiled code, diluting any
>>>>>>> performance penalties from the interpreter to small performance
>>>>>>> variations.
>>>>>>>
>>>>>>> No Measurable Benefit.
>>>>>>> Please see the results files attached in the bug page. This change
>>>>>>> was adapted for x86 and sparc, and interpreter performance was
>>>>>>> measured with Specjvm98 (run with -Xint).  No significant decrease in
>>>>>>> performance was observed.
>>>>>>>
>>>>>>> Memory footprint and code complexity.
>>>>>>> Stack caching in the JVM is implemented by switching the instruction
>>>>>>> look-up table depending on the tos (top-of-stack) state. At any moment
>>>>>>> there are is an active table consisting of one dispatch table for each
>>>>>>> of the 10 tos states.  When we enter a safepoint, we copy all 10
>>>>>>> safepoint dispatch tables into the active table.  The additional entry
>>>>>>> code makes this copy less efficient and makes any work in the
>>>>>>> interpreter harder to debug.
>>>>>>>
>>>>>>> If we remove this optimization, we will:
>>>>>>>    - decrease memory usage in the interpreter,
>>>>>>>    - eliminated wasteful memory transactions during safepoints,
>>>>>>>    - decrease code complexity (a lot).
>>>>>>>
>>>>>>> Please let me know what you think.
>>>>>>> Thanks,
>>>>>>> Max
>>>>>>>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Discussion: 8172978: Remove Interpreter TOS optimization

Doerr, Martin
In reply to this post by Max Ockner
Hi,

I've ran jvm98 with -Xint on several PPC64 machines and on a recent s390x machine.

Surprisingly, disabling of the Tos optimization does not hurt on older hardware (Power 5 and 6).
Seems like some sub benchmarks don't suffer at all or even benefit.
But it really hurts on recent Power 8.

Measured performance change on AIX 7.1 on Power 8 -XX:+EnableTosCache vs. -XX:-EnableTosCache:
Compress
38989 vs. 47686 -22%
Jess
11256 vs. 11849 -5%
Raytrace
17647 vs. 18300 -4%
Db
22713 vs. 25181 -11%
Javac
14554 vs. 15130 -4%
These sub benchmarks are relatively stable. It's possible to reproduce these results.

We lose about 3% on recent s390x hardware (z13).

I first liked the idea to remove the optimization, but these numbers speak against doing it.

Best regards,
Martin


-----Original Message-----
From: Doerr, Martin
Sent: Freitag, 24. Februar 2017 17:41
To: 'Max Ockner' <[hidden email]>; [hidden email]
Subject: RE: Discussion: 8172978: Remove Interpreter TOS optimization

Hi Max,

thank you very much for sharing your results and for sending the patch.

I guess it covers the most relevant cases, but not all ones. I think it'd be better to modify dispatch_next instead of dispatch_epilog on x86.
(dispatch_next is also used by generate_return_entry_for and generate_deopt_entry_for.)

On s390, I'm using dispatch_next with:
  if (!EnableTosCache) {
    push(state);
    state = vtos;
  }
  dispatch_base(state, Interpreter::dispatch_table(state));

I also added an assertion to dispatch_base in order to make sure I'm hitting all dispatch usages:
assert(EnableTosCache || state == vtos, "sanity");

Unfortunately, the performance results of SPEC jvm98 with -Xint seem to drop significantly with -XX:-EnableTosCache on both, PPC64 and s390.
But we need to perform more measurements to get more reliable results.

Best regards,
Martin


-----Original Message-----
From: hotspot-dev [mailto:[hidden email]] On Behalf Of Max Ockner
Sent: Donnerstag, 23. Februar 2017 22:21
To: [hidden email]
Subject: Re: Discussion: 8172978: Remove Interpreter TOS optimization

Hi Volker,
I have attached the patch that I have been testing.
Thanks,
Max

On 2/20/2017 5:45 AM, Volker Simonis wrote:

> Hi,
>
> besides the fact that this of course means some work for us :) I
> currently don't see any problems for our porting platforms (ppc64 and
> s390x).
>
> Are there any webrevs available, so we can see how big they are and
> maybe do some own benchmarking?
>
> Thanks,
> Volker
>
>
> On Sun, Feb 19, 2017 at 11:11 PM,  <[hidden email]> wrote:
>>
>> On 2/18/17 11:14 AM, [hidden email] wrote:
>>> When Max gets back from the long weekend, he'll post the platforms in your
>>> bug.
>>>
>>> It's amazing that for -Xint there's no significant difference. I've seen
>>> -Xint performance of 15% slower cause a 2% slowdown with server but that was
>>> before tiered compilation.
>>
>> I should clarify this.  I've seen this slowdown for *different* interpreter
>> optimizations, which *can* affect server performance.  I was measuring
>> specjvm98 on linux x64.   If there's no significant difference for this TOS
>> optimization, there is no chance of a degredation in overall performance.
>>
>> Coleen
>>
>>> The reason for this query was to see what developers for the other
>>> platform ports think, since this change would affect all of the platforms.
>>>
>>> Thanks,
>>> Coleen
>>>
>>> On 2/18/17 10:50 AM, Daniel D. Daugherty wrote:
>>>> If Claes is happy with the perf testing, then I'm happy. :-)
>>>>
>>>> Dan
>>>>
>>>>
>>>> On 2/18/17 3:46 AM, Claes Redestad wrote:
>>>>> Hi,
>>>>>
>>>>> I've seen Max has run plenty of tests on our internal performance
>>>>> infrastructure and everything I've seen there seems to corroborate the
>>>>> idea that this removal is OK from a performance point of view, the
>>>>> footprint improvements are small but significant and any negative
>>>>> performance impact on throughput benchmarks is at noise levels even
>>>>> with -Xint (it appears many benchmarks time out with this setting
>>>>> both before and after, though; Max, let's discuss offline how to
>>>>> deal with that :-))
>>>>>
>>>>> I expect this will be tested more thoroughly once adapted to all
>>>>> platforms (which I assume is the intent?), but see no concern from
>>>>> a performance testing point of view: Do it!
>>>>>
>>>>> Thanks!
>>>>>
>>>>> /Claes
>>>>>
>>>>> On 2017-02-16 16:40, Daniel D. Daugherty wrote:
>>>>>> Hi Max,
>>>>>>
>>>>>> Added a note to your bug. Interesting idea, but I think your data is
>>>>>> a bit incomplete at the moment.
>>>>>>
>>>>>> Dan
>>>>>>
>>>>>>
>>>>>> On 2/15/17 3:18 PM, Max Ockner wrote:
>>>>>>> Hello all,
>>>>>>>
>>>>>>> We have filed a bug to remove the interpreter stack caching
>>>>>>> optimization for jdk10.  Ideally we can make this change *early*
>>>>>>> during the jdk10 development cycle. See below for justification:
>>>>>>>
>>>>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8172978
>>>>>>>
>>>>>>> Stack caching has been around for a long time and is intended to
>>>>>>> replace some of the load/store (pop/push) operations with
>>>>>>> corresponding register operations. The need for this optimization
>>>>>>> arose before caching could adequately lessen the burden of memory
>>>>>>> access. We have reevaluated the JVM stack caching optimization and
>>>>>>> have found that it has a high memory footprint and is very costly to
>>>>>>> maintain, but does not provide significant measurable or theoretical
>>>>>>> benefit for us when used with modern hardware.
>>>>>>>
>>>>>>> Minimal Theoretical Benefit.
>>>>>>> Because modern hardware does not slap us with the same cost for
>>>>>>> accessing memory as it once did, the benefit of replacing memory
>>>>>>> access with register access is far less dramatic now than it once was.
>>>>>>> Additionally, the interpreter runs for a relatively short time before
>>>>>>> relevant code sections are compiled. When the VM starts running
>>>>>>> compiled code instead of interpreted code, performance should begin to
>>>>>>> move asymptotically towards that of compiled code, diluting any
>>>>>>> performance penalties from the interpreter to small performance
>>>>>>> variations.
>>>>>>>
>>>>>>> No Measurable Benefit.
>>>>>>> Please see the results files attached in the bug page. This change
>>>>>>> was adapted for x86 and sparc, and interpreter performance was
>>>>>>> measured with Specjvm98 (run with -Xint).  No significant decrease in
>>>>>>> performance was observed.
>>>>>>>
>>>>>>> Memory footprint and code complexity.
>>>>>>> Stack caching in the JVM is implemented by switching the instruction
>>>>>>> look-up table depending on the tos (top-of-stack) state. At any moment
>>>>>>> there are is an active table consisting of one dispatch table for each
>>>>>>> of the 10 tos states.  When we enter a safepoint, we copy all 10
>>>>>>> safepoint dispatch tables into the active table.  The additional entry
>>>>>>> code makes this copy less efficient and makes any work in the
>>>>>>> interpreter harder to debug.
>>>>>>>
>>>>>>> If we remove this optimization, we will:
>>>>>>>    - decrease memory usage in the interpreter,
>>>>>>>    - eliminated wasteful memory transactions during safepoints,
>>>>>>>    - decrease code complexity (a lot).
>>>>>>>
>>>>>>> Please let me know what you think.
>>>>>>> Thanks,
>>>>>>> Max
>>>>>>>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Discussion: 8172978: Remove Interpreter TOS optimization

Andrew Haley
On 28/02/17 15:00, Doerr, Martin wrote:
> I first liked the idea to remove the optimization, but these numbers speak against doing it.

Looks like it.  I haven't investigated making such a change on
AArch64, but I could if more data were needed.

Andrew.

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Discussion: 8172978: Remove Interpreter TOS optimization

Aleksey Shipilev-4
In reply to this post by Doerr, Martin
On 02/28/2017 04:00 PM, Doerr, Martin wrote:
> I've ran jvm98 with -Xint on several PPC64 machines and on a recent s390x
> machine.
>
> Surprisingly, disabling of the Tos optimization does not hurt on older
> hardware (Power 5 and 6). Seems like some sub benchmarks don't suffer at all
> or even benefit. But it really hurts on recent Power 8.

I don't think it makes sense to run performance tests with -Xint alone. Of
course removing interpreter optimizations would affect interpreter performance.
The real question one should ask if turning off an interpreter optimization
affects peak performance and time-to-performance when compilers are enabled.

That's because in 2017 we should not expect that users who need performance
would run with interpreter only. And removing complexity from interpreter
without sacrificing the performance in tiered/compiled mode is certainly a plus
in my book.

Thanks,
-Aleksey

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Discussion: 8172978: Remove Interpreter TOS optimization

Doerr, Martin
Hi Aleksey,

of course, the peak performance is not really affected by this change.
Ideally, one should measure the startup performance. However, these measurements are highly unstable, so one would need to spend more effort and many runs.
That's why we use interpreter only tests. They are very stable and give us a hint about what's happening.

The assumption behind this proposal was that interpreter performance would not suffer much on modern hardware.
The -Xint benchmark is able to show that this is not true.

Best regards,
Martin


-----Original Message-----
From: Aleksey Shipilev [mailto:[hidden email]]
Sent: Dienstag, 28. Februar 2017 16:14
To: Doerr, Martin <[hidden email]>; Max Ockner <[hidden email]>; [hidden email]
Subject: Re: Discussion: 8172978: Remove Interpreter TOS optimization

On 02/28/2017 04:00 PM, Doerr, Martin wrote:
> I've ran jvm98 with -Xint on several PPC64 machines and on a recent s390x
> machine.
>
> Surprisingly, disabling of the Tos optimization does not hurt on older
> hardware (Power 5 and 6). Seems like some sub benchmarks don't suffer at all
> or even benefit. But it really hurts on recent Power 8.

I don't think it makes sense to run performance tests with -Xint alone. Of
course removing interpreter optimizations would affect interpreter performance.
The real question one should ask if turning off an interpreter optimization
affects peak performance and time-to-performance when compilers are enabled.

That's because in 2017 we should not expect that users who need performance
would run with interpreter only. And removing complexity from interpreter
without sacrificing the performance in tiered/compiled mode is certainly a plus
in my book.

Thanks,
-Aleksey

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Discussion: 8172978: Remove Interpreter TOS optimization

Aleksey Shipilev-4
On 02/28/2017 04:23 PM, Doerr, Martin wrote:
> The assumption behind this proposal was that interpreter performance would
> not suffer much on modern hardware. The -Xint benchmark is able to show that
> this is not true.

Ah, I see. I can't fathom why this is a success metric then.

-Aleksey

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Discussion: 8172978: Remove Interpreter TOS optimization

Doerr, Martin
It would be a success metric the other way round:
If we didn't lose interpreter performance, we could have been certain not to lose startup performance.

Now, we don't know how much the impact on startup performance is. But it may be affected.


-----Original Message-----
From: Aleksey Shipilev [mailto:[hidden email]]
Sent: Dienstag, 28. Februar 2017 16:25
To: Doerr, Martin <[hidden email]>; Max Ockner <[hidden email]>; [hidden email]
Subject: Re: Discussion: 8172978: Remove Interpreter TOS optimization

On 02/28/2017 04:23 PM, Doerr, Martin wrote:
> The assumption behind this proposal was that interpreter performance would
> not suffer much on modern hardware. The -Xint benchmark is able to show that
> this is not true.

Ah, I see. I can't fathom why this is a success metric then.

-Aleksey

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Discussion: 8172978: Remove Interpreter TOS optimization

Doerr, Martin
In reply to this post by Andrew Haley
Hi Andrew,

I think it would be interesting to see how it impacts more platforms.

Thanks and best regards,
Martin


-----Original Message-----
From: Andrew Haley [mailto:[hidden email]]
Sent: Dienstag, 28. Februar 2017 16:05
To: Doerr, Martin <[hidden email]>; Max Ockner <[hidden email]>; [hidden email]
Subject: Re: Discussion: 8172978: Remove Interpreter TOS optimization

On 28/02/17 15:00, Doerr, Martin wrote:
> I first liked the idea to remove the optimization, but these numbers speak against doing it.

Looks like it.  I haven't investigated making such a change on
AArch64, but I could if more data were needed.

Andrew.

Loading...