Quantcast

RFR(S): 8178811: Minimize the AVX <-> SSE transition penalty on x86

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RFR(S): 8178811: Minimize the AVX <-> SSE transition penalty on x86

Deshpande, Vivek R

Hi All

 

This fix minimizes the AVX to SSE and SSE to AVX transition penalty through generation of vzeroupper instruction. With this patch we see zero transitions with penalty per SPECjbb2015 jOPS on BDW and a significant reduction on SKX CPU event vector width mismatch from 65 to 0.01 per SPECjbb2015 jOPS. We have also implemented an enhancement to disable vzeroupper generation for Knights family where the instruction has high penalty and is not recommended. The option UseVzeroupper is used to control generation of vzeroupper instruction and gets set to false on the Knights family. 
We observed ~3% gain on SPECJvm2008 composite result on Skylake.

Webrev:

http://cr.openjdk.java.net/~vdeshpande/8178811/webrev.00/

I have also updated the JBS entry.

https://bugs.openjdk.java.net/browse/JDK-8178811

Would you please review and sponsor it.

 

Regards,

Vivek

 

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: RFR(S): 8178811: Minimize the AVX <-> SSE transition penalty on x86

Vladimir Kozlov
Hi Vivek,

Did you pinpoint particular change which helps SPECjbb2015?

 From what I remember it is mostly during call to runtime or JNI calls.

   // AVX instruction which is used to clear upper 128 bits of YMM registers and
   // to avoid transaction penalty between AVX and SSE states. There is no
   // penalty if legacy SSE instructions are encoded using VEX prefix because
   // they always clear upper 128 bits. It should be used before calling
   // runtime code and native libraries.
   void vzeroupper();

So why you added vzeroupper() into arraycopy and other stubs? If AVX is supported all instructions should be encoded with VEX prefix. Or the statement in the comment is incorrect?

About UseVzeroupper flag. We should avoid adding new product flags if possible.
IT only make sense if you want to do experiments to check performance with and without it. But we already know that it is needed on most CPUs.

I would suggest to add VM_Version::supports_vzeroupper() { return (_features & CPU_VZEROUPPER) != 0; }
Set CPU_VZEROUPPE if AVX is supported and clear it for Knights CPU. The add check inside assembler instruction:

   void Assembler::vzeroupper() {
-   assert(VM_Version::supports_avx(), "");
+   if (VM_Version::supports_vzeroupper()) {

It will allow to avoid supports_avx() checks on each vzeroupper() call.

You missed check in MachEpilogNode::emit() in x86_64.ad

Note, I used (C->max_vector_size() > 16) checks in .ad files because of the same reasons as above:

   // There is no penalty if legacy SSE instructions are encoded using VEX prefix because
   // they always clear upper 128 bits.

I thought if vectors are small and we use VEX prefix then upper bits will be 0 anyway so you don't need vzeroupper().

Thanks,
Vladimir

On 4/14/17 11:17 AM, Deshpande, Vivek R wrote:

> Hi All
>
>
>
> This fix minimizes the AVX to SSE and SSE to AVX transition penalty through generation of vzeroupper instruction. With this patch we see zero transitions with penalty per SPECjbb2015 jOPS on BDW and a
> significant reduction on SKX CPU event vector width mismatch from 65 to 0.01 per SPECjbb2015 jOPS. We have also implemented an enhancement to disable vzeroupper generation for Knights family where the
> instruction has high penalty and is not recommended. The option UseVzeroupper is used to control generation of vzeroupper instruction and gets set to false on the Knights family.
> We observed ~3% gain on SPECJvm2008 composite result on Skylake.
>
> Webrev:
>
> http://cr.openjdk.java.net/~vdeshpande/8178811/webrev.00/
>
> I have also updated the JBS entry.
>
> https://bugs.openjdk.java.net/browse/JDK-8178811
>
> Would you please review and sponsor it.
>
>
>
> Regards,
>
> Vivek
>
>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: RFR(S): 8178811: Minimize the AVX <-> SSE transition penalty on x86

Andrew Haley
In reply to this post by Deshpande, Vivek R
Hi,

On 14/04/17 19:17, Deshpande, Vivek R wrote:

> This fix minimizes the AVX to SSE and SSE to AVX transition penalty
> through generation of vzeroupper instruction.

This looks like a performance enhancement, rather than a bug.  Is JDK
9 still open for performance enhancement?  I thought we were in
rampdown, but the report is targeted to 9.

Andrew.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: RFR(S): 8178811: Minimize the AVX <-> SSE transition penalty on x86

Andrew Haley
I'm sorry: my reply sounded terribly mean spirited but I didn't intend
it to be.

3% on an important benchmark is certainly worth having, and I'd
support the patch for that reason.  I'm just confused about what is
allowed: for example, there are a couple of missing vector patterns in
AArch64.ad, but I haven't posted the fix because we're in rampdown.

Andrew.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: RFR(S): 8178811: Minimize the AVX <-> SSE transition penalty on x86

Vladimir Kozlov
In reply to this post by Andrew Haley
What is important is 'Fix version'. 'Affected version' could be all
current and previous releases.

 From the beginning I thought it is targeting JDK 10. To avoid confusion
I set 'Fix version' to 10.

It may go into 9 Update as other small enchantments.

Regards,
Vladimir

On 4/16/17 3:03 AM, Andrew Haley wrote:

> Hi,
>
> On 14/04/17 19:17, Deshpande, Vivek R wrote:
>
>> This fix minimizes the AVX to SSE and SSE to AVX transition penalty
>> through generation of vzeroupper instruction.
>
> This looks like a performance enhancement, rather than a bug.  Is JDK
> 9 still open for performance enhancement?  I thought we were in
> rampdown, but the report is targeted to 9.
>
> Andrew.
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: RFR(S): 8178811: Minimize the AVX <-> SSE transition penalty on x86

Tobias Hartmann-2
Hi,

this issue slipped through our triaging process because the subcomponent was not set to "compiler". I converted it to an enhancement and targeted it to JDK 10.

As Vladimir said, we can later backport it to a JDK 9 update release.

Thanks,
Tobias

On 17.04.2017 06:30, Vladimir Kozlov wrote:

> What is important is 'Fix version'. 'Affected version' could be all current and previous releases.
>
> From the beginning I thought it is targeting JDK 10. To avoid confusion I set 'Fix version' to 10.
>
> It may go into 9 Update as other small enchantments.
>
> Regards,
> Vladimir
>
> On 4/16/17 3:03 AM, Andrew Haley wrote:
>> Hi,
>>
>> On 14/04/17 19:17, Deshpande, Vivek R wrote:
>>
>>> This fix minimizes the AVX to SSE and SSE to AVX transition penalty
>>> through generation of vzeroupper instruction.
>>
>> This looks like a performance enhancement, rather than a bug.  Is JDK
>> 9 still open for performance enhancement?  I thought we were in
>> rampdown, but the report is targeted to 9.
>>
>> Andrew.
>>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: RFR(S): 8178811: Minimize the AVX <-> SSE transition penalty on x86

Deshpande, Vivek R
In reply to this post by Vladimir Kozlov
HI Vladimir

We added almost all the vzeroupper instructions by analyzing SPECjbb2015.

We need vzeroupper after we execute avx2/avx-512 and before transition to SSE to avoid penalty of saving and restoring higher bits in YMM/ZMM registers. Since JVM is SSE compiled, mixing of AVX and SSE code happens in C1 and interpreter with usage of intrinsics, so I have added vzerouppers at these transitions in the stubs.

I think max_vector_size() > 16 would be with auto vectorization, but we need to have vzeroupper with stubs and intrinsics.

I have made all the changes which you suggested. Please take a look at the patch and let me know your thoughts.

The updated Webrev is here:
http://cr.openjdk.java.net/~vdeshpande/8178811/webrev.01/

Thank you.
Regards,
Vivek

-----Original Message-----
From: Vladimir Kozlov [mailto:[hidden email]]
Sent: Friday, April 14, 2017 12:48 PM
To: Deshpande, Vivek R; hotspot compiler
Cc: Viswanathan, Sandhya
Subject: Re: RFR(S): 8178811: Minimize the AVX <-> SSE transition penalty on x86

Hi Vivek,

Did you pinpoint particular change which helps SPECjbb2015?

 From what I remember it is mostly during call to runtime or JNI calls.

   // AVX instruction which is used to clear upper 128 bits of YMM registers and
   // to avoid transaction penalty between AVX and SSE states. There is no
   // penalty if legacy SSE instructions are encoded using VEX prefix because
   // they always clear upper 128 bits. It should be used before calling
   // runtime code and native libraries.
   void vzeroupper();

So why you added vzeroupper() into arraycopy and other stubs? If AVX is supported all instructions should be encoded with VEX prefix. Or the statement in the comment is incorrect?

About UseVzeroupper flag. We should avoid adding new product flags if possible.
IT only make sense if you want to do experiments to check performance with and without it. But we already know that it is needed on most CPUs.

I would suggest to add VM_Version::supports_vzeroupper() { return (_features & CPU_VZEROUPPER) != 0; } Set CPU_VZEROUPPE if AVX is supported and clear it for Knights CPU. The add check inside assembler instruction:

   void Assembler::vzeroupper() {
-   assert(VM_Version::supports_avx(), "");
+   if (VM_Version::supports_vzeroupper()) {

It will allow to avoid supports_avx() checks on each vzeroupper() call.

You missed check in MachEpilogNode::emit() in x86_64.ad

Note, I used (C->max_vector_size() > 16) checks in .ad files because of the same reasons as above:

   // There is no penalty if legacy SSE instructions are encoded using VEX prefix because
   // they always clear upper 128 bits.

I thought if vectors are small and we use VEX prefix then upper bits will be 0 anyway so you don't need vzeroupper().

Thanks,
Vladimir

On 4/14/17 11:17 AM, Deshpande, Vivek R wrote:

> Hi All
>
>
>
> This fix minimizes the AVX to SSE and SSE to AVX transition penalty
> through generation of vzeroupper instruction. With this patch we see
> zero transitions with penalty per SPECjbb2015 jOPS on BDW and a significant reduction on SKX CPU event vector width mismatch from 65 to 0.01 per SPECjbb2015 jOPS. We have also implemented an enhancement to disable vzeroupper generation for Knights family where the instruction has high penalty and is not recommended. The option UseVzeroupper is used to control generation of vzeroupper instruction and gets set to false on the Knights family.
> We observed ~3% gain on SPECJvm2008 composite result on Skylake.
>
> Webrev:
>
> http://cr.openjdk.java.net/~vdeshpande/8178811/webrev.00/
>
> I have also updated the JBS entry.
>
> https://bugs.openjdk.java.net/browse/JDK-8178811
>
> Would you please review and sponsor it.
>
>
>
> Regards,
>
> Vivek
>
>
>
Loading...