RFR: 8264054: Bad XMM performance on java.lang.MathBench.sqrtDouble

classic Classic list List threaded Threaded
18 messages Options
Reply | Threaded
Open this post in threaded view
|

RFR: 8264054: Bad XMM performance on java.lang.MathBench.sqrtDouble

Sandhya Viswanathan
For the j.l.Math JMH at https://github.com/openjdk/jmh-jdk-microbenchmarks/blob/master/micros-jdk11/src/main/java/org/openjdk/bench/java/lang/MathBench.java, the performance for sqrt benchmark could be improved.  Thanks a lot to Eric Caspole for finding this issue.

Benchmark:
    @Benchmark
    public double  sqrtDouble() {
        return  Math.sqrt(double4Dot1);
    }

Current code generated (linux format) by c2 JIT is:
     vsqrtsd 0x50(%r10),%xmm0,%xmm0

The vsqrtsd instruction operation is specified as below:
     VSQRTSD (VEX.128 encoded version)
     DEST[63:0] := SQRT(SRC2[63:0])
     DEST[127:64] := SRC1[127:64]
     DEST[MAXVL-1:128] := 0

The upper 127:64 bits are set from previous contents of xmm0. As the destination xmm0 register was not initialized prior to use by c2 JIT, this causes stall and lower performance.

By adding xmm0 initialization prior to use, the performance of the above benchmark improves significantly.

Code generated after patch:
     vxorpd %xmm0,%xmm0,%xmm0
     vsqrtsd 0x50(%r10),%xmm0,%xmm0

Performance before patch:
Benchmark                    Mode  Cnt       Score      Error   Units
MathBench.sqrtDouble        thrpt    8  193612.396 ±  95.807  ops/ms

Performance after patch:
MathBench.sqrtDouble        thrpt    8  276388.024 ± 846.372  ops/ms

Best Regards,
Sandhya

-------------

Commit messages:
 - No xor needed when src and dest are same
 - 8264054: Bad XMM performance on java.lang.MathBench.sqrtDouble

Changes: https://git.openjdk.java.net/jdk/pull/3256/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=3256&range=00
  Issue: https://bugs.openjdk.java.net/browse/JDK-8264054
  Stats: 16 lines in 1 file changed: 11 ins; 2 del; 3 mod
  Patch: https://git.openjdk.java.net/jdk/pull/3256.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/3256/head:pull/3256

PR: https://git.openjdk.java.net/jdk/pull/3256
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8264054: Bad XMM performance on java.lang.MathBench.sqrtDouble

Nils Eliasson-2
On Tue, 30 Mar 2021 00:38:22 GMT, Sandhya Viswanathan <[hidden email]> wrote:

> For the j.l.Math JMH at https://github.com/openjdk/jmh-jdk-microbenchmarks/blob/master/micros-jdk11/src/main/java/org/openjdk/bench/java/lang/MathBench.java, the performance for sqrt benchmark could be improved.  Thanks a lot to Eric Caspole for finding this issue.
>
> Benchmark:
>     @Benchmark
>     public double  sqrtDouble() {
>         return  Math.sqrt(double4Dot1);
>     }
>
> Current code generated (linux format) by c2 JIT is:
>      vsqrtsd 0x50(%r10),%xmm0,%xmm0
>
> The vsqrtsd instruction operation is specified as below:
>      VSQRTSD (VEX.128 encoded version)
>      DEST[63:0] := SQRT(SRC2[63:0])
>      DEST[127:64] := SRC1[127:64]
>      DEST[MAXVL-1:128] := 0
>
> The upper 127:64 bits are set from previous contents of xmm0. As the destination xmm0 register was not initialized prior to use by c2 JIT, this causes stall and lower performance.
>
> By adding xmm0 initialization prior to use, the performance of the above benchmark improves significantly.
>
> Code generated after patch:
>      vxorpd %xmm0,%xmm0,%xmm0
>      vsqrtsd 0x50(%r10),%xmm0,%xmm0
>
> Performance before patch:
> Benchmark                    Mode  Cnt       Score      Error   Units
> MathBench.sqrtDouble        thrpt    8  193612.396 ±  95.807  ops/ms
>
> Performance after patch:
> MathBench.sqrtDouble        thrpt    8  276388.024 ± 846.372  ops/ms
>
> Best Regards,
> Sandhya

In the bug report you are mentioning new test micro benchmark test cases. Have you already contributed those, or would you like to add those to this PR?

-------------

PR: https://git.openjdk.java.net/jdk/pull/3256
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8264054: Bad XMM performance on java.lang.MathBench.sqrtDouble

Eric Caspole-2
On Tue, 30 Mar 2021 12:27:27 GMT, Nils Eliasson <[hidden email]> wrote:

>> For the j.l.Math JMH at https://github.com/openjdk/jmh-jdk-microbenchmarks/blob/master/micros-jdk11/src/main/java/org/openjdk/bench/java/lang/MathBench.java, the performance for sqrt benchmark could be improved.  Thanks a lot to Eric Caspole for finding this issue.
>>
>> Benchmark:
>>     @Benchmark
>>     public double  sqrtDouble() {
>>         return  Math.sqrt(double4Dot1);
>>     }
>>
>> Current code generated (linux format) by c2 JIT is:
>>      vsqrtsd 0x50(%r10),%xmm0,%xmm0
>>
>> The vsqrtsd instruction operation is specified as below:
>>      VSQRTSD (VEX.128 encoded version)
>>      DEST[63:0] := SQRT(SRC2[63:0])
>>      DEST[127:64] := SRC1[127:64]
>>      DEST[MAXVL-1:128] := 0
>>
>> The upper 127:64 bits are set from previous contents of xmm0. As the destination xmm0 register was not initialized prior to use by c2 JIT, this causes stall and lower performance.
>>
>> By adding xmm0 initialization prior to use, the performance of the above benchmark improves significantly.
>>
>> Code generated after patch:
>>      vxorpd %xmm0,%xmm0,%xmm0
>>      vsqrtsd 0x50(%r10),%xmm0,%xmm0
>>
>> Performance before patch:
>> Benchmark                    Mode  Cnt       Score      Error   Units
>> MathBench.sqrtDouble        thrpt    8  193612.396 ±  95.807  ops/ms
>>
>> Performance after patch:
>> MathBench.sqrtDouble        thrpt    8  276388.024 ± 846.372  ops/ms
>>
>> Best Regards,
>> Sandhya
>
> In the bug report you are mentioning new test micro benchmark test cases. Have you already contributed those, or would you like to add those to this PR?

I added these micros the the  https://github.com/openjdk/jmh-jdk-microbenchmarks a couple weeks ago, before we knew about this problem, because it is easy to work on JMH with Maven.  But it would be best to contribute these micros into the JDK repo at the same time as this patch. I will get it together for the JDK repo and connect with Sandhya.

-------------

PR: https://git.openjdk.java.net/jdk/pull/3256
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8264054: Bad XMM performance on java.lang.MathBench.sqrtDouble

Vladimir Kozlov-2
In reply to this post by Sandhya Viswanathan
On Tue, 30 Mar 2021 00:38:22 GMT, Sandhya Viswanathan <[hidden email]> wrote:

> For the j.l.Math JMH at https://github.com/openjdk/jmh-jdk-microbenchmarks/blob/master/micros-jdk11/src/main/java/org/openjdk/bench/java/lang/MathBench.java, the performance for sqrt benchmark could be improved.  Thanks a lot to Eric Caspole for finding this issue.
>
> Benchmark:
>     @Benchmark
>     public double  sqrtDouble() {
>         return  Math.sqrt(double4Dot1);
>     }
>
> Current code generated (linux format) by c2 JIT is:
>      vsqrtsd 0x50(%r10),%xmm0,%xmm0
>
> The vsqrtsd instruction operation is specified as below:
>      VSQRTSD (VEX.128 encoded version)
>      DEST[63:0] := SQRT(SRC2[63:0])
>      DEST[127:64] := SRC1[127:64]
>      DEST[MAXVL-1:128] := 0
>
> The upper 127:64 bits are set from previous contents of xmm0. As the destination xmm0 register was not initialized prior to use by c2 JIT, this causes stall and lower performance.
>
> By adding xmm0 initialization prior to use, the performance of the above benchmark improves significantly.
>
> Code generated after patch:
>      vxorpd %xmm0,%xmm0,%xmm0
>      vsqrtsd 0x50(%r10),%xmm0,%xmm0
>
> Performance before patch:
> Benchmark                    Mode  Cnt       Score      Error   Units
> MathBench.sqrtDouble        thrpt    8  193612.396 ±  95.807  ops/ms
>
> Performance after patch:
> MathBench.sqrtDouble        thrpt    8  276388.024 ± 846.372  ops/ms
>
> Best Regards,
> Sandhya

What is faster: `xor(xmm) + sqrt(xmm, mem)` or `mov(xmm, mem) + sqrt(xmm,xmm)` ?

Do we have other instructions which may suffer stall too if `dst` register is not zeroed?

-------------

PR: https://git.openjdk.java.net/jdk/pull/3256
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8264054: Bad XMM performance on java.lang.MathBench.sqrtDouble

Vladimir Ivanov-2
In reply to this post by Sandhya Viswanathan
On Tue, 30 Mar 2021 00:38:22 GMT, Sandhya Viswanathan <[hidden email]> wrote:

> For the j.l.Math JMH at https://github.com/openjdk/jmh-jdk-microbenchmarks/blob/master/micros-jdk11/src/main/java/org/openjdk/bench/java/lang/MathBench.java, the performance for sqrt benchmark could be improved.  Thanks a lot to Eric Caspole for finding this issue.
>
> Benchmark:
>     @Benchmark
>     public double  sqrtDouble() {
>         return  Math.sqrt(double4Dot1);
>     }
>
> Current code generated (linux format) by c2 JIT is:
>      vsqrtsd 0x50(%r10),%xmm0,%xmm0
>
> The vsqrtsd instruction operation is specified as below:
>      VSQRTSD (VEX.128 encoded version)
>      DEST[63:0] := SQRT(SRC2[63:0])
>      DEST[127:64] := SRC1[127:64]
>      DEST[MAXVL-1:128] := 0
>
> The upper 127:64 bits are set from previous contents of xmm0. As the destination xmm0 register was not initialized prior to use by c2 JIT, this causes stall and lower performance.
>
> By adding xmm0 initialization prior to use, the performance of the above benchmark improves significantly.
>
> Code generated after patch:
>      vxorpd %xmm0,%xmm0,%xmm0
>      vsqrtsd 0x50(%r10),%xmm0,%xmm0
>
> Performance before patch:
> Benchmark                    Mode  Cnt       Score      Error   Units
> MathBench.sqrtDouble        thrpt    8  193612.396 ±  95.807  ops/ms
>
> Performance after patch:
> MathBench.sqrtDouble        thrpt    8  276388.024 ± 846.372  ops/ms
>
> Best Regards,
> Sandhya

src/hotspot/cpu/x86/x86.ad line 3252:

> 3250:   predicate(UseSSE>=1);
> 3251:   match(Set dst (SqrtF (LoadF src)));
> 3252:   effect(TEMP dst);

Why do you declare `dst` as `TEMP` (here and in other places)?

-------------

PR: https://git.openjdk.java.net/jdk/pull/3256
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8264054: Bad XMM performance on java.lang.MathBench.sqrtDouble

Sandhya Viswanathan
In reply to this post by Vladimir Kozlov-2
On Tue, 30 Mar 2021 17:53:18 GMT, Vladimir Kozlov <[hidden email]> wrote:

>> For the j.l.Math JMH at https://github.com/openjdk/jmh-jdk-microbenchmarks/blob/master/micros-jdk11/src/main/java/org/openjdk/bench/java/lang/MathBench.java, the performance for sqrt benchmark could be improved.  Thanks a lot to Eric Caspole for finding this issue.
>>
>> Benchmark:
>>     @Benchmark
>>     public double  sqrtDouble() {
>>         return  Math.sqrt(double4Dot1);
>>     }
>>
>> Current code generated (linux format) by c2 JIT is:
>>      vsqrtsd 0x50(%r10),%xmm0,%xmm0
>>
>> The vsqrtsd instruction operation is specified as below:
>>      VSQRTSD (VEX.128 encoded version)
>>      DEST[63:0] := SQRT(SRC2[63:0])
>>      DEST[127:64] := SRC1[127:64]
>>      DEST[MAXVL-1:128] := 0
>>
>> The upper 127:64 bits are set from previous contents of xmm0. As the destination xmm0 register was not initialized prior to use by c2 JIT, this causes stall and lower performance.
>>
>> By adding xmm0 initialization prior to use, the performance of the above benchmark improves significantly.
>>
>> Code generated after patch:
>>      vxorpd %xmm0,%xmm0,%xmm0
>>      vsqrtsd 0x50(%r10),%xmm0,%xmm0
>>
>> Performance before patch:
>> Benchmark                    Mode  Cnt       Score      Error   Units
>> MathBench.sqrtDouble        thrpt    8  193612.396 ±  95.807  ops/ms
>>
>> Performance after patch:
>> MathBench.sqrtDouble        thrpt    8  276388.024 ± 846.372  ops/ms
>>
>> Best Regards,
>> Sandhya
>
> What is faster: `xor(xmm) + sqrt(xmm, mem)` or `mov(xmm, mem) + sqrt(xmm,xmm)` ?
>
> Do we have other instructions which may suffer stall too if `dst` register is not zeroed?

@vnkozlov  Both xor(xmm) + sqrt(xmm, mem) or mov(xmm, mem) + sqrt(xmm,xmm) have same performance. I could look into simplifying the patch by just keeping the register version of these rules.

I looked through x86.ad for other usages of unary instructions but didn't find any other instances. We implement negate and abs using binary instructions.

-------------

PR: https://git.openjdk.java.net/jdk/pull/3256
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8264054: Bad XMM performance on java.lang.MathBench.sqrtDouble

Sandhya Viswanathan
In reply to this post by Vladimir Ivanov-2
On Tue, 30 Mar 2021 18:22:02 GMT, Vladimir Ivanov <[hidden email]> wrote:

>> For the j.l.Math JMH at https://github.com/openjdk/jmh-jdk-microbenchmarks/blob/master/micros-jdk11/src/main/java/org/openjdk/bench/java/lang/MathBench.java, the performance for sqrt benchmark could be improved.  Thanks a lot to Eric Caspole for finding this issue.
>>
>> Benchmark:
>>     @Benchmark
>>     public double  sqrtDouble() {
>>         return  Math.sqrt(double4Dot1);
>>     }
>>
>> Current code generated (linux format) by c2 JIT is:
>>      vsqrtsd 0x50(%r10),%xmm0,%xmm0
>>
>> The vsqrtsd instruction operation is specified as below:
>>      VSQRTSD (VEX.128 encoded version)
>>      DEST[63:0] := SQRT(SRC2[63:0])
>>      DEST[127:64] := SRC1[127:64]
>>      DEST[MAXVL-1:128] := 0
>>
>> The upper 127:64 bits are set from previous contents of xmm0. As the destination xmm0 register was not initialized prior to use by c2 JIT, this causes stall and lower performance.
>>
>> By adding xmm0 initialization prior to use, the performance of the above benchmark improves significantly.
>>
>> Code generated after patch:
>>      vxorpd %xmm0,%xmm0,%xmm0
>>      vsqrtsd 0x50(%r10),%xmm0,%xmm0
>>
>> Performance before patch:
>> Benchmark                    Mode  Cnt       Score      Error   Units
>> MathBench.sqrtDouble        thrpt    8  193612.396 ±  95.807  ops/ms
>>
>> Performance after patch:
>> MathBench.sqrtDouble        thrpt    8  276388.024 ± 846.372  ops/ms
>>
>> Best Regards,
>> Sandhya
>
> src/hotspot/cpu/x86/x86.ad line 3252:
>
>> 3250:   predicate(UseSSE>=1);
>> 3251:   match(Set dst (SqrtF (LoadF src)));
>> 3252:   effect(TEMP dst);
>
> Why do you declare `dst` as `TEMP` (here and in other places)?

Good point. I was being extra cautious that the register in address should not be overwritten by the xorps. But of course the address cannot have xmm register here so TEMP dst is not needed.  I will update the patch accordingly.

-------------

PR: https://git.openjdk.java.net/jdk/pull/3256
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8264054: Bad XMM performance on java.lang.MathBench.sqrtDouble [v2]

Sandhya Viswanathan
In reply to this post by Sandhya Viswanathan
> For the j.l.Math JMH at https://github.com/openjdk/jmh-jdk-microbenchmarks/blob/master/micros-jdk11/src/main/java/org/openjdk/bench/java/lang/MathBench.java, the performance for sqrt benchmark could be improved.  Thanks a lot to Eric Caspole for finding this issue.
>
> Benchmark:
>     @Benchmark
>     public double  sqrtDouble() {
>         return  Math.sqrt(double4Dot1);
>     }
>
> Current code generated (linux format) by c2 JIT is:
>      vsqrtsd 0x50(%r10),%xmm0,%xmm0
>
> The vsqrtsd instruction operation is specified as below:
>      VSQRTSD (VEX.128 encoded version)
>      DEST[63:0] := SQRT(SRC2[63:0])
>      DEST[127:64] := SRC1[127:64]
>      DEST[MAXVL-1:128] := 0
>
> The upper 127:64 bits are set from previous contents of xmm0. As the destination xmm0 register was not initialized prior to use by c2 JIT, this causes stall and lower performance.
>
> By adding xmm0 initialization prior to use, the performance of the above benchmark improves significantly.
>
> Code generated after patch:
>      vxorpd %xmm0,%xmm0,%xmm0
>      vsqrtsd 0x50(%r10),%xmm0,%xmm0
>
> Performance before patch:
> Benchmark                    Mode  Cnt       Score      Error   Units
> MathBench.sqrtDouble        thrpt    8  193612.396 ±  95.807  ops/ms
>
> Performance after patch:
> MathBench.sqrtDouble        thrpt    8  276388.024 ± 846.372  ops/ms
>
> Best Regards,
> Sandhya

Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision:

  Implement review comments

-------------

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/3256/files
  - new: https://git.openjdk.java.net/jdk/pull/3256/files/2910e835..bbee5088

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=3256&range=01
 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=3256&range=00-01

  Stats: 952 lines in 3 files changed: 875 ins; 60 del; 17 mod
  Patch: https://git.openjdk.java.net/jdk/pull/3256.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/3256/head:pull/3256

PR: https://git.openjdk.java.net/jdk/pull/3256
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8264054: Bad XMM performance on java.lang.MathBench.sqrtDouble [v3]

Sandhya Viswanathan
In reply to this post by Sandhya Viswanathan
> For the j.l.Math JMH at https://github.com/openjdk/jmh-jdk-microbenchmarks/blob/master/micros-jdk11/src/main/java/org/openjdk/bench/java/lang/MathBench.java, the performance for sqrt benchmark could be improved.  Thanks a lot to Eric Caspole for finding this issue.
>
> Benchmark:
>     @Benchmark
>     public double  sqrtDouble() {
>         return  Math.sqrt(double4Dot1);
>     }
>
> Current code generated (linux format) by c2 JIT is:
>      vsqrtsd 0x50(%r10),%xmm0,%xmm0
>
> The vsqrtsd instruction operation is specified as below:
>      VSQRTSD (VEX.128 encoded version)
>      DEST[63:0] := SQRT(SRC2[63:0])
>      DEST[127:64] := SRC1[127:64]
>      DEST[MAXVL-1:128] := 0
>
> The upper 127:64 bits are set from previous contents of xmm0. As the destination xmm0 register was not initialized prior to use by c2 JIT, this causes stall and lower performance.
>
> By adding xmm0 initialization prior to use, the performance of the above benchmark improves significantly.
>
> Code generated after patch:
>      vxorpd %xmm0,%xmm0,%xmm0
>      vsqrtsd 0x50(%r10),%xmm0,%xmm0
>
> Performance before patch:
> Benchmark                    Mode  Cnt       Score      Error   Units
> MathBench.sqrtDouble        thrpt    8  193612.396 ±  95.807  ops/ms
>
> Performance after patch:
> MathBench.sqrtDouble        thrpt    8  276388.024 ± 846.372  ops/ms
>
> Best Regards,
> Sandhya

Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision:

  Correct permissions on StrictMathBench.java

-------------

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/3256/files
  - new: https://git.openjdk.java.net/jdk/pull/3256/files/bbee5088..2cfa863b

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=3256&range=02
 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=3256&range=01-02

  Stats: 0 lines in 1 file changed: 0 ins; 0 del; 0 mod
  Patch: https://git.openjdk.java.net/jdk/pull/3256.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/3256/head:pull/3256

PR: https://git.openjdk.java.net/jdk/pull/3256
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8264054: Bad XMM performance on java.lang.MathBench.sqrtDouble [v3]

Sandhya Viswanathan
In reply to this post by Nils Eliasson-2
On Tue, 30 Mar 2021 12:27:27 GMT, Nils Eliasson <[hidden email]> wrote:

>> Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision:
>>
>>   Correct permissions on StrictMathBench.java
>
> In the bug report you are mentioning new micro benchmark test cases. Have you already contributed those, or would you like to add those to this PR?

@neliasso @vnkozlov @iwanowww All the review comments are implemented.  
The MathBench.java and StrictMathBench.java contribution is from Eric Caspole and Charlie Hunt.
Please let me know if any other changes are needed.

-------------

PR: https://git.openjdk.java.net/jdk/pull/3256
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8264054: Bad XMM performance on java.lang.MathBench.sqrtDouble [v3]

Vladimir Kozlov-2
In reply to this post by Sandhya Viswanathan
On Wed, 31 Mar 2021 00:39:45 GMT, Sandhya Viswanathan <[hidden email]> wrote:

>> For the j.l.Math JMH at https://github.com/openjdk/jmh-jdk-microbenchmarks/blob/master/micros-jdk11/src/main/java/org/openjdk/bench/java/lang/MathBench.java, the performance for sqrt benchmark could be improved.  Thanks a lot to Eric Caspole for finding this issue.
>>
>> Benchmark:
>>     @Benchmark
>>     public double  sqrtDouble() {
>>         return  Math.sqrt(double4Dot1);
>>     }
>>
>> Current code generated (linux format) by c2 JIT is:
>>      vsqrtsd 0x50(%r10),%xmm0,%xmm0
>>
>> The vsqrtsd instruction operation is specified as below:
>>      VSQRTSD (VEX.128 encoded version)
>>      DEST[63:0] := SQRT(SRC2[63:0])
>>      DEST[127:64] := SRC1[127:64]
>>      DEST[MAXVL-1:128] := 0
>>
>> The upper 127:64 bits are set from previous contents of xmm0. As the destination xmm0 register was not initialized prior to use by c2 JIT, this causes stall and lower performance.
>>
>> By adding xmm0 initialization prior to use, the performance of the above benchmark improves significantly.
>>
>> Code generated after patch:
>>      vxorpd %xmm0,%xmm0,%xmm0
>>      vsqrtsd 0x50(%r10),%xmm0,%xmm0
>>
>> Performance before patch:
>> Benchmark                    Mode  Cnt       Score      Error   Units
>> MathBench.sqrtDouble        thrpt    8  193612.396 ±  95.807  ops/ms
>>
>> Performance after patch:
>> MathBench.sqrtDouble        thrpt    8  276388.024 ± 846.372  ops/ms
>>
>> Best Regards,
>> Sandhya
>
> Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision:
>
>   Correct permissions on StrictMathBench.java

Nice solution - let C2 generate the best value loading instruction. I assume the improvement is the same with this update.

Thank you for checking other instructions.

-------------

Marked as reviewed by kvn (Reviewer).

PR: https://git.openjdk.java.net/jdk/pull/3256
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8264054: Bad XMM performance on java.lang.MathBench.sqrtDouble [v3]

Sandhya Viswanathan
On Wed, 31 Mar 2021 00:46:10 GMT, Vladimir Kozlov <[hidden email]> wrote:

>> Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision:
>>
>>   Correct permissions on StrictMathBench.java
>
> Nice solution - let C2 generate the best value loading instruction. I assume the improvement is the same with this update.
>
> Thank you for checking other instructions.

@vnkozlov Yes the performance improvement is the same with the updated patch.

-------------

PR: https://git.openjdk.java.net/jdk/pull/3256
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8264054: Bad XMM performance on java.lang.MathBench.sqrtDouble [v3]

Vladimir Ivanov-2
In reply to this post by Sandhya Viswanathan
On Wed, 31 Mar 2021 00:39:45 GMT, Sandhya Viswanathan <[hidden email]> wrote:

>> For the j.l.Math JMH at https://github.com/openjdk/jmh-jdk-microbenchmarks/blob/master/micros-jdk11/src/main/java/org/openjdk/bench/java/lang/MathBench.java, the performance for sqrt benchmark could be improved.  Thanks a lot to Eric Caspole for finding this issue.
>>
>> Benchmark:
>>     @Benchmark
>>     public double  sqrtDouble() {
>>         return  Math.sqrt(double4Dot1);
>>     }
>>
>> Current code generated (linux format) by c2 JIT is:
>>      vsqrtsd 0x50(%r10),%xmm0,%xmm0
>>
>> The vsqrtsd instruction operation is specified as below:
>>      VSQRTSD (VEX.128 encoded version)
>>      DEST[63:0] := SQRT(SRC2[63:0])
>>      DEST[127:64] := SRC1[127:64]
>>      DEST[MAXVL-1:128] := 0
>>
>> The upper 127:64 bits are set from previous contents of xmm0. As the destination xmm0 register was not initialized prior to use by c2 JIT, this causes stall and lower performance.
>>
>> By adding xmm0 initialization prior to use, the performance of the above benchmark improves significantly.
>>
>> Code generated after patch:
>>      vxorpd %xmm0,%xmm0,%xmm0
>>      vsqrtsd 0x50(%r10),%xmm0,%xmm0
>>
>> Performance before patch:
>> Benchmark                    Mode  Cnt       Score      Error   Units
>> MathBench.sqrtDouble        thrpt    8  193612.396 ±  95.807  ops/ms
>>
>> Performance after patch:
>> MathBench.sqrtDouble        thrpt    8  276388.024 ± 846.372  ops/ms
>>
>> Best Regards,
>> Sandhya
>
> Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision:
>
>   Correct permissions on StrictMathBench.java

Looks good.

src/hotspot/cpu/x86/x86.ad line 3235:

> 3233: %}
> 3234:
> 3235: instruct sqrtF_reg(regF dst, regF src) %{

Would be helpful to have a comment describing why there are only reg-to-reg variants kept for `SqrtF`/`SqrtD`.

-------------

Marked as reviewed by vlivanov (Reviewer).

PR: https://git.openjdk.java.net/jdk/pull/3256
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8264054: Bad XMM performance on java.lang.MathBench.sqrtDouble [v3]

Nils Eliasson-2
In reply to this post by Sandhya Viswanathan
On Wed, 31 Mar 2021 00:39:45 GMT, Sandhya Viswanathan <[hidden email]> wrote:

>> For the j.l.Math JMH at https://github.com/openjdk/jmh-jdk-microbenchmarks/blob/master/micros-jdk11/src/main/java/org/openjdk/bench/java/lang/MathBench.java, the performance for sqrt benchmark could be improved.  Thanks a lot to Eric Caspole for finding this issue.
>>
>> Benchmark:
>>     @Benchmark
>>     public double  sqrtDouble() {
>>         return  Math.sqrt(double4Dot1);
>>     }
>>
>> Current code generated (linux format) by c2 JIT is:
>>      vsqrtsd 0x50(%r10),%xmm0,%xmm0
>>
>> The vsqrtsd instruction operation is specified as below:
>>      VSQRTSD (VEX.128 encoded version)
>>      DEST[63:0] := SQRT(SRC2[63:0])
>>      DEST[127:64] := SRC1[127:64]
>>      DEST[MAXVL-1:128] := 0
>>
>> The upper 127:64 bits are set from previous contents of xmm0. As the destination xmm0 register was not initialized prior to use by c2 JIT, this causes stall and lower performance.
>>
>> By adding xmm0 initialization prior to use, the performance of the above benchmark improves significantly.
>>
>> Code generated after patch:
>>      vxorpd %xmm0,%xmm0,%xmm0
>>      vsqrtsd 0x50(%r10),%xmm0,%xmm0
>>
>> Performance before patch:
>> Benchmark                    Mode  Cnt       Score      Error   Units
>> MathBench.sqrtDouble        thrpt    8  193612.396 ±  95.807  ops/ms
>>
>> Performance after patch:
>> MathBench.sqrtDouble        thrpt    8  276388.024 ± 846.372  ops/ms
>>
>> Best Regards,
>> Sandhya
>
> Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision:
>
>   Correct permissions on StrictMathBench.java

Thanks for adding the test.

Looks good.

-------------

Marked as reviewed by neliasso (Reviewer).

PR: https://git.openjdk.java.net/jdk/pull/3256
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8264054: Bad XMM performance on java.lang.MathBench.sqrtDouble [v4]

Sandhya Viswanathan
In reply to this post by Sandhya Viswanathan
> For the j.l.Math JMH at https://github.com/openjdk/jmh-jdk-microbenchmarks/blob/master/micros-jdk11/src/main/java/org/openjdk/bench/java/lang/MathBench.java, the performance for sqrt benchmark could be improved.  Thanks a lot to Eric Caspole for finding this issue.
>
> Benchmark:
>     @Benchmark
>     public double  sqrtDouble() {
>         return  Math.sqrt(double4Dot1);
>     }
>
> Current code generated (linux format) by c2 JIT is:
>      vsqrtsd 0x50(%r10),%xmm0,%xmm0
>
> The vsqrtsd instruction operation is specified as below:
>      VSQRTSD (VEX.128 encoded version)
>      DEST[63:0] := SQRT(SRC2[63:0])
>      DEST[127:64] := SRC1[127:64]
>      DEST[MAXVL-1:128] := 0
>
> The upper 127:64 bits are set from previous contents of xmm0. As the destination xmm0 register was not initialized prior to use by c2 JIT, this causes stall and lower performance.
>
> By adding xmm0 initialization prior to use, the performance of the above benchmark improves significantly.
>
> Code generated after patch:
>      vxorpd %xmm0,%xmm0,%xmm0
>      vsqrtsd 0x50(%r10),%xmm0,%xmm0
>
> Performance before patch:
> Benchmark                    Mode  Cnt       Score      Error   Units
> MathBench.sqrtDouble        thrpt    8  193612.396 ±  95.807  ops/ms
>
> Performance after patch:
> MathBench.sqrtDouble        thrpt    8  276388.024 ± 846.372  ops/ms
>
> Best Regards,
> Sandhya

Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision:

  Added comments for sqrtF_reg and sqrtD_reg rules

-------------

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/3256/files
  - new: https://git.openjdk.java.net/jdk/pull/3256/files/2cfa863b..903b7c41

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=3256&range=03
 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=3256&range=02-03

  Stats: 4 lines in 1 file changed: 4 ins; 0 del; 0 mod
  Patch: https://git.openjdk.java.net/jdk/pull/3256.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/3256/head:pull/3256

PR: https://git.openjdk.java.net/jdk/pull/3256
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8264054: Bad XMM performance on java.lang.MathBench.sqrtDouble [v4]

Sandhya Viswanathan
In reply to this post by Vladimir Ivanov-2
On Wed, 31 Mar 2021 08:13:17 GMT, Vladimir Ivanov <[hidden email]> wrote:

>> Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision:
>>
>>   Added comments for sqrtF_reg and sqrtD_reg rules
>
> src/hotspot/cpu/x86/x86.ad line 3235:
>
>> 3233: %}
>> 3234:
>> 3235: instruct sqrtF_reg(regF dst, regF src) %{
>
> Would be helpful to have a comment describing why there are only reg-to-reg variants kept for `SqrtF`/`SqrtD`.

Done, added comments for the sqrt rules.

-------------

PR: https://git.openjdk.java.net/jdk/pull/3256
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8264054: Bad XMM performance on java.lang.MathBench.sqrtDouble [v5]

Sandhya Viswanathan
In reply to this post by Sandhya Viswanathan
> For the j.l.Math JMH at https://github.com/openjdk/jmh-jdk-microbenchmarks/blob/master/micros-jdk11/src/main/java/org/openjdk/bench/java/lang/MathBench.java, the performance for sqrt benchmark could be improved.  Thanks a lot to Eric Caspole for finding this issue.
>
> Benchmark:
>     @Benchmark
>     public double  sqrtDouble() {
>         return  Math.sqrt(double4Dot1);
>     }
>
> Current code generated (linux format) by c2 JIT is:
>      vsqrtsd 0x50(%r10),%xmm0,%xmm0
>
> The vsqrtsd instruction operation is specified as below:
>      VSQRTSD (VEX.128 encoded version)
>      DEST[63:0] := SQRT(SRC2[63:0])
>      DEST[127:64] := SRC1[127:64]
>      DEST[MAXVL-1:128] := 0
>
> The upper 127:64 bits are set from previous contents of xmm0. As the destination xmm0 register was not initialized prior to use by c2 JIT, this causes stall and lower performance.
>
> By adding xmm0 initialization prior to use, the performance of the above benchmark improves significantly.
>
> Code generated after patch:
>      vxorpd %xmm0,%xmm0,%xmm0
>      vsqrtsd 0x50(%r10),%xmm0,%xmm0
>
> Performance before patch:
> Benchmark                    Mode  Cnt       Score      Error   Units
> MathBench.sqrtDouble        thrpt    8  193612.396 ±  95.807  ops/ms
>
> Performance after patch:
> MathBench.sqrtDouble        thrpt    8  276388.024 ± 846.372  ops/ms
>
> Best Regards,
> Sandhya

Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision:

  Corrected typo

-------------

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/3256/files
  - new: https://git.openjdk.java.net/jdk/pull/3256/files/903b7c41..9e90b452

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=3256&range=04
 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=3256&range=03-04

  Stats: 4 lines in 1 file changed: 0 ins; 0 del; 4 mod
  Patch: https://git.openjdk.java.net/jdk/pull/3256.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/3256/head:pull/3256

PR: https://git.openjdk.java.net/jdk/pull/3256
Reply | Threaded
Open this post in threaded view
|

Integrated: 8264054: Bad XMM performance on java.lang.MathBench.sqrtDouble

Sandhya Viswanathan
In reply to this post by Sandhya Viswanathan
On Tue, 30 Mar 2021 00:38:22 GMT, Sandhya Viswanathan <[hidden email]> wrote:

> For the j.l.Math JMH at https://github.com/openjdk/jmh-jdk-microbenchmarks/blob/master/micros-jdk11/src/main/java/org/openjdk/bench/java/lang/MathBench.java, the performance for sqrt benchmark could be improved.  Thanks a lot to Eric Caspole for finding this issue.
>
> Benchmark:
>     @Benchmark
>     public double  sqrtDouble() {
>         return  Math.sqrt(double4Dot1);
>     }
>
> Current code generated (linux format) by c2 JIT is:
>      vsqrtsd 0x50(%r10),%xmm0,%xmm0
>
> The vsqrtsd instruction operation is specified as below:
>      VSQRTSD (VEX.128 encoded version)
>      DEST[63:0] := SQRT(SRC2[63:0])
>      DEST[127:64] := SRC1[127:64]
>      DEST[MAXVL-1:128] := 0
>
> The upper 127:64 bits are set from previous contents of xmm0. As the destination xmm0 register was not initialized prior to use by c2 JIT, this causes stall and lower performance.
>
> By adding xmm0 initialization prior to use, the performance of the above benchmark improves significantly.
>
> Code generated after patch:
>      vxorpd %xmm0,%xmm0,%xmm0
>      vsqrtsd 0x50(%r10),%xmm0,%xmm0
>
> Performance before patch:
> Benchmark                    Mode  Cnt       Score      Error   Units
> MathBench.sqrtDouble        thrpt    8  193612.396 ±  95.807  ops/ms
>
> Performance after patch:
> MathBench.sqrtDouble        thrpt    8  276388.024 ± 846.372  ops/ms
>
> Best Regards,
> Sandhya

This pull request has now been integrated.

Changeset: 52d8a229
Author:    Sandhya Viswanathan <[hidden email]>
URL:       https://git.openjdk.java.net/jdk/commit/52d8a229
Stats:     947 lines in 3 files changed: 879 ins; 51 del; 17 mod

8264054: Bad XMM performance on java.lang.MathBench.sqrtDouble

Co-authored-by: Eric Caspole <[hidden email]>
Co-authored-by: Charlie Hunt <[hidden email]>
Reviewed-by: neliasso, kvn, vlivanov

-------------

PR: https://git.openjdk.java.net/jdk/pull/3256