FW: [10] RFR(M): 8185976: PPC64: Implement MulAdd and SquareToLen intrinsics

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

FW: [10] RFR(M): 8185976: PPC64: Implement MulAdd and SquareToLen intrinsics

Gustavo Serra Scalet


-----Original Message-----
From: ppc-aix-port-dev [mailto:[hidden email]] On Behalf Of Gustavo Serra Scalet
Sent: terça-feira, 8 de agosto de 2017 17:19
To: [hidden email]
Subject: [10] RFR(M): 8185976: PPC64: Implement MulAdd and SquareToLen intrinsics

Hi,

Could you please review this specific PPC64 change to hotspot? By implementing these intrinsics I noticed a small improvement with microbenchmarks analysis. On SpecJVM2008's crypto.rsa benchmark, only when backporting to JDK8 an improvement was noticed.

JBS: https://bugs.openjdk.java.net/browse/JDK-8185976
Webrev: https://gut.github.io/openjdk/webrev/JDK-8185976/webrev/

Motivation for this implementation: https://twitter.com/ijuma/status/698309312498835457

Best regards,
Gustavo Serra Scalet
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd and SquareToLen intrinsics

Doerr, Martin
Hi Gustavo,

thanks for the webrev and for sharing performance numbers. I've found a couple of things which should get addressed.

First of all, C2 does not perform sign extend when calling stubs. The int parms need to get zero/sign extended. (Could even be done without extra instructions by replacing sldi -> rldicl, cmpdi -> extsw_ in some cases.)

macroAssembler_ppc.cpp:
- Indentation should be 2 spaces.

stubGenerator_ppc:cpp:
- or_, addi_ should get replaced by orr, addi when CR0 result is not needed.
- Where is lplw initialized?
- I believe that the updating load/store instructions e.g. lwzu don't perform well on some processors. At least using stwu 2 times in the loop doesn't make sense.
- Note: It should be possible to use 8 byte instead of 4 byte instructions: MacroAssembler::multiply64, addc, adde. But I'm not requesting to change that because I guess it would make the code very complicated, especially when supporting both endianess versions.
- The squareToLen stub implementation is very close the Java implementation. So it'd be interesting to understand what C2 doesn't do as well as the hand written assembly code. Do you know that? (Not absolutely necessary for accepting this change as long as the stub is measurably faster.)

Best regards,
Martin


-----Original Message-----
From: hotspot-compiler-dev [mailto:[hidden email]] On Behalf Of Gustavo Serra Scalet
Sent: Donnerstag, 10. August 2017 19:22
To: '[hidden email]' <[hidden email]>
Subject: FW: [10] RFR(M): 8185976: PPC64: Implement MulAdd and SquareToLen intrinsics



-----Original Message-----
From: ppc-aix-port-dev [mailto:[hidden email]] On Behalf Of Gustavo Serra Scalet
Sent: terça-feira, 8 de agosto de 2017 17:19
To: [hidden email]
Subject: [10] RFR(M): 8185976: PPC64: Implement MulAdd and SquareToLen intrinsics

Hi,

Could you please review this specific PPC64 change to hotspot? By implementing these intrinsics I noticed a small improvement with microbenchmarks analysis. On SpecJVM2008's crypto.rsa benchmark, only when backporting to JDK8 an improvement was noticed.

JBS: https://bugs.openjdk.java.net/browse/JDK-8185976
Webrev: https://gut.github.io/openjdk/webrev/JDK-8185976/webrev/

Motivation for this implementation: https://twitter.com/ijuma/status/698309312498835457

Best regards,
Gustavo Serra Scalet
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: FW: [10] RFR(M): 8185976: PPC64: Implement MulAdd and SquareToLen intrinsics

Andrew Haley
In reply to this post by Gustavo Serra Scalet
On 10/08/17 18:22, Gustavo Serra Scalet wrote:
> Could you please review this specific PPC64 change to hotspot? By implementing these intrinsics I noticed a small improvement with microbenchmarks analysis. On SpecJVM2008's crypto.rsa benchmark, only when backporting to JDK8 an improvement was noticed.

I'm surprised it's worth bothering.  JDK-8145913 surely does most of the
work for RSA.

--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd and SquareToLen intrinsics

Gustavo Serra Scalet
In reply to this post by Doerr, Martin
Hi Martin,

Thanks for dedicated review. It took me a while to be able to work on this but I hope to have your points solved. Please check below the review as well as my comments quoting your email:
https://gut.github.io/openjdk/webrev/JDK-8185976/webrev.01/

> -----Original Message-----
> First of all, C2 does not perform sign extend when calling stubs. The
> int parms need to get zero/sign extended. (Could even be done without
> extra instructions by replacing sldi -> rldicl, cmpdi -> extsw_ in some
> cases.)

Does it make a difference on my case?

I guess you are talking about mulAdd preparation code. The only aspect I found about him is to force the cast from 32 bits -> 64 bits by cleaning higher bits. Offset is a signed integer but it can't be negative anyway.

So I changed from:
sldi   (R5_ARG3, R5_ARG3, 2);

to:
rldicl (R5_ARG3, R5_ARG3, 2, 32);  // always positive


> macroAssembler_ppc.cpp:
> - Indentation should be 2 spaces.

Done


> stubGenerator_ppc:cpp:
> - or_, addi_ should get replaced by orr, addi when CR0 result is not
> needed.

Done

> - Where is lplw initialized?

It should be initialized with 0, I missed that...

> - I believe that the updating load/store instructions e.g. lwzu don't
> perform well on some processors. At least using stwu 2 times in the loop
> doesn't make sense.

You are right. I could manipulate the bits differently and ended up with a single stdu in the loop. Neat! Although I could not reduce the total number of instructions.

> - Note: It should be possible to use 8 byte instead of 4 byte
> instructions: MacroAssembler::multiply64, addc, adde. But I'm not
> requesting to change that because I guess it would make the code very
> complicated, especially when supporting both endianess versions.

Yes, that would require a new analysis on this code. May we consider it next? As you said, I prefer having an initial version that looks as simple as the original java code.
 
> - The squareToLen stub implementation is very close the Java
> implementation. So it'd be interesting to understand what C2 doesn't do
> as well as the hand written assembly code. Do you know that? (Not
> absolutely necessary for accepting this change as long as the stub is
> measurably faster.)

I don't know either. Basically I chose doing it because I noticed some performance gain on SpecJVM2008 when analyzing X64. Then, taking a closer look, I didn't notice any AVX or some special instructions on X64 so I decided to try it on ppc64 by using some basic assembly.

Thanks

>
> Best regards,
> Martin
>
>
> -----Original Message-----
> From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-
> [hidden email]] On Behalf Of Gustavo Serra Scalet
> Sent: Donnerstag, 10. August 2017 19:22
> To: '[hidden email]' <hotspot-compiler-
> [hidden email]>
> Subject: FW: [10] RFR(M): 8185976: PPC64: Implement MulAdd and
> SquareToLen intrinsics
>
>
>
> -----Original Message-----
> From: ppc-aix-port-dev [mailto:ppc-aix-port-dev-
> [hidden email]] On Behalf Of Gustavo Serra Scalet
> Sent: terça-feira, 8 de agosto de 2017 17:19
> To: [hidden email]
> Subject: [10] RFR(M): 8185976: PPC64: Implement MulAdd and SquareToLen
> intrinsics
>
> Hi,
>
> Could you please review this specific PPC64 change to hotspot? By
> implementing these intrinsics I noticed a small improvement with
> microbenchmarks analysis. On SpecJVM2008's crypto.rsa benchmark, only
> when backporting to JDK8 an improvement was noticed.
>
> JBS: https://bugs.openjdk.java.net/browse/JDK-8185976
> Webrev: https://gut.github.io/openjdk/webrev/JDK-8185976/webrev/
>
> Motivation for this implementation:
> https://twitter.com/ijuma/status/698309312498835457
>
> Best regards,
> Gustavo Serra Scalet
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd and SquareToLen intrinsics

Doerr, Martin
Hi Gustavo,

thanks for the update.

1. Sign extending offset and len
Right, sign and zero extending is equivalent for offset and len because they are guaranteed to be >=0 (by checks in Java). But you can only rely on bit 32 (IBM notation) to be 0. Bit 0-31 may contain garbage.
rldicl was incorrect. My mistake, sorry for that. Correct would be rldic which also clears the least significant bits.
len should also get fixed e.g. by replacing cmpdi by extsw_ in muladd.

2. Using 8 byte instructions for int
The code which feeds stdu is endianess specific. Doesn't work on all PPC64 platforms.

3.Regarding Andrew's point: Superseded by Montgomery?
The Montgomery change got backported to jdk8u (JDK-8150152  in 8u102). I'd expect the performance improvement of these intrinsics to be irrelevant for crypto.rsa. Did you measure with an older jdk8 release?
(I think the change is still acceptable as the intrinsics could be used elsewhere and the implementation also exists on other platforms.)

Best regards,
Martin


-----Original Message-----
From: Gustavo Serra Scalet [mailto:[hidden email]]
Sent: Mittwoch, 16. August 2017 18:50
To: Doerr, Martin <[hidden email]>; '[hidden email]' <[hidden email]>
Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd and SquareToLen intrinsics

Hi Martin,

Thanks for dedicated review. It took me a while to be able to work on this but I hope to have your points solved. Please check below the review as well as my comments quoting your email:
https://gut.github.io/openjdk/webrev/JDK-8185976/webrev.01/

> -----Original Message-----
> First of all, C2 does not perform sign extend when calling stubs. The
> int parms need to get zero/sign extended. (Could even be done without
> extra instructions by replacing sldi -> rldicl, cmpdi -> extsw_ in some
> cases.)

Does it make a difference on my case?

I guess you are talking about mulAdd preparation code. The only aspect I found about him is to force the cast from 32 bits -> 64 bits by cleaning higher bits. Offset is a signed integer but it can't be negative anyway.

So I changed from:
sldi   (R5_ARG3, R5_ARG3, 2);

to:
rldicl (R5_ARG3, R5_ARG3, 2, 32);  // always positive


> macroAssembler_ppc.cpp:
> - Indentation should be 2 spaces.

Done


> stubGenerator_ppc:cpp:
> - or_, addi_ should get replaced by orr, addi when CR0 result is not
> needed.

Done

> - Where is lplw initialized?

It should be initialized with 0, I missed that...

> - I believe that the updating load/store instructions e.g. lwzu don't
> perform well on some processors. At least using stwu 2 times in the loop
> doesn't make sense.

You are right. I could manipulate the bits differently and ended up with a single stdu in the loop. Neat! Although I could not reduce the total number of instructions.

> - Note: It should be possible to use 8 byte instead of 4 byte
> instructions: MacroAssembler::multiply64, addc, adde. But I'm not
> requesting to change that because I guess it would make the code very
> complicated, especially when supporting both endianess versions.

Yes, that would require a new analysis on this code. May we consider it next? As you said, I prefer having an initial version that looks as simple as the original java code.
 
> - The squareToLen stub implementation is very close the Java
> implementation. So it'd be interesting to understand what C2 doesn't do
> as well as the hand written assembly code. Do you know that? (Not
> absolutely necessary for accepting this change as long as the stub is
> measurably faster.)

I don't know either. Basically I chose doing it because I noticed some performance gain on SpecJVM2008 when analyzing X64. Then, taking a closer look, I didn't notice any AVX or some special instructions on X64 so I decided to try it on ppc64 by using some basic assembly.

Thanks

>
> Best regards,
> Martin
>
>
> -----Original Message-----
> From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-
> [hidden email]] On Behalf Of Gustavo Serra Scalet
> Sent: Donnerstag, 10. August 2017 19:22
> To: '[hidden email]' <hotspot-compiler-
> [hidden email]>
> Subject: FW: [10] RFR(M): 8185976: PPC64: Implement MulAdd and
> SquareToLen intrinsics
>
>
>
> -----Original Message-----
> From: ppc-aix-port-dev [mailto:ppc-aix-port-dev-
> [hidden email]] On Behalf Of Gustavo Serra Scalet
> Sent: terça-feira, 8 de agosto de 2017 17:19
> To: [hidden email]
> Subject: [10] RFR(M): 8185976: PPC64: Implement MulAdd and SquareToLen
> intrinsics
>
> Hi,
>
> Could you please review this specific PPC64 change to hotspot? By
> implementing these intrinsics I noticed a small improvement with
> microbenchmarks analysis. On SpecJVM2008's crypto.rsa benchmark, only
> when backporting to JDK8 an improvement was noticed.
>
> JBS: https://bugs.openjdk.java.net/browse/JDK-8185976
> Webrev: https://gut.github.io/openjdk/webrev/JDK-8185976/webrev/
>
> Motivation for this implementation:
> https://twitter.com/ijuma/status/698309312498835457
>
> Best regards,
> Gustavo Serra Scalet
Loading...