RFR: 8261542: X86 slice and unslice intrinsics for 256-bit byte/short vectors

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

RFR: 8261542: X86 slice and unslice intrinsics for 256-bit byte/short vectors

Sandhya Viswanathan
The slice and unslice intrinsics for 256-bit byte/short vectors can be implemented for x86 platforms supporting AVX2 using a sequence of instructions.

JBS: https://bugs.openjdk.java.net/browse/JDK-8261542

The PerfSliceOrigin.java jmh test attached to the JBS shows the following performance on AVX2 platform.

Before:
Benchmark                                 (size)   Mode  Cnt   Score   Error   Units
PerfSliceOrigin.vectorSliceOrigin           1024  thrpt    5  18.887 ± 1.128  ops/ms
PerfSliceOrigin.vectorSliceUnsliceOrigin    1024  thrpt    5   9.374 ± 0.370  ops/ms

After:
Benchmark                                 (size)   Mode  Cnt      Score     Error   Units
PerfSliceOrigin.vectorSliceOrigin           1024  thrpt    5  13861.420 ±  19.071  ops/ms
PerfSliceOrigin.vectorSliceUnsliceOrigin    1024  thrpt    5   7895.199 ± 142.580  ops/ms

-------------

Commit messages:
 - 8261542: X86 slice and unslice intrinsics for 256-bit byte/short vectors

Changes: https://git.openjdk.java.net/jdk/pull/2520/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=2520&range=00
  Issue: https://bugs.openjdk.java.net/browse/JDK-8261542
  Stats: 119 lines in 7 files changed: 99 ins; 5 del; 15 mod
  Patch: https://git.openjdk.java.net/jdk/pull/2520.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/2520/head:pull/2520

PR: https://git.openjdk.java.net/jdk/pull/2520
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261542: X86 slice and unslice intrinsics for 256-bit byte/short vectors

Sandhya Viswanathan
On Thu, 18 Feb 2021 19:14:37 GMT, Vladimir Kozlov <[hidden email]> wrote:

>> The slice and unslice intrinsics for 256-bit byte/short vectors can be implemented for x86 platforms supporting AVX2 using a sequence of instructions.
>>
>> JBS: https://bugs.openjdk.java.net/browse/JDK-8261542
>>
>> The PerfSliceOrigin.java jmh test attached to the JBS shows the following performance on AVX2 platform.
>>
>> Before:
>> Benchmark                                 (size)   Mode  Cnt   Score   Error   Units
>> PerfSliceOrigin.vectorSliceOrigin           1024  thrpt    5  18.887 ± 1.128  ops/ms
>> PerfSliceOrigin.vectorSliceUnsliceOrigin    1024  thrpt    5   9.374 ± 0.370  ops/ms
>>
>> After:
>> Benchmark                                 (size)   Mode  Cnt      Score     Error   Units
>> PerfSliceOrigin.vectorSliceOrigin           1024  thrpt    5  13861.420 ±  19.071  ops/ms
>> PerfSliceOrigin.vectorSliceUnsliceOrigin    1024  thrpt    5   7895.199 ± 142.580  ops/ms
>
> Please, add a test which verifies correctness of results when this code is used. If we don't have it already.

@vnkozlov thanks a lot for the review.  
The test for slice and unslice are already part of test/jdk/jdk/incubator/vector/Byte256VectorTests.java and Short256VectorTests.java.

> src/hotspot/cpu/x86/x86.ad line 7550:
>
>> 7548:     // only byte shuffle instruction available on these platforms
>> 7549:     int vlen_in_bytes = vector_length_in_bytes(this);
>> 7550:     if (UseAVX == 0) {
>
> This code will not be executed with vector length 16 because match_rule_supported_vector() bailout with (size_in_bits == 256 && UseAVX < 2).

Yes you are right, but this code will execute for vector length 16 when UseAVX ==2.
 It will also execure for vector length 16 when UseAVX == 3 &&
!VM_Version::supports_avx512bw.

> src/hotspot/cpu/x86/x86.ad line 7506:
>
>> 7504: instruct rearrangeB_avx(legVec dst, legVec src, vec shuffle, legVec vtmp1, legVec vtmp2, rRegP scratch) %{
>> 7505:   predicate(vector_element_basic_type(n) == T_BYTE &&
>> 7506:             vector_length(n) == 32 && !VM_Version::supports_avx512_vbmi());
>
> Predicate matches bail-out condition in match_rule_supported_vector(). Does it mean this code never used before?
> So you are implementing it now. Right?

Yes, this rule was not used before and I am implementing it now.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2520
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261542: X86 slice and unslice intrinsics for 256-bit byte/short vectors

Vladimir Kozlov-2
In reply to this post by Sandhya Viswanathan
On Thu, 18 Feb 2021 19:14:37 GMT, Vladimir Kozlov <[hidden email]> wrote:

>> The slice and unslice intrinsics for 256-bit byte/short vectors can be implemented for x86 platforms supporting AVX2 using a sequence of instructions.
>>
>> JBS: https://bugs.openjdk.java.net/browse/JDK-8261542
>>
>> The PerfSliceOrigin.java jmh test attached to the JBS shows the following performance on AVX2 platform.
>>
>> Before:
>> Benchmark                                 (size)   Mode  Cnt   Score   Error   Units
>> PerfSliceOrigin.vectorSliceOrigin           1024  thrpt    5  18.887 ± 1.128  ops/ms
>> PerfSliceOrigin.vectorSliceUnsliceOrigin    1024  thrpt    5   9.374 ± 0.370  ops/ms
>>
>> After:
>> Benchmark                                 (size)   Mode  Cnt      Score     Error   Units
>> PerfSliceOrigin.vectorSliceOrigin           1024  thrpt    5  13861.420 ±  19.071  ops/ms
>> PerfSliceOrigin.vectorSliceUnsliceOrigin    1024  thrpt    5   7895.199 ± 142.580  ops/ms
>
> Please, add a test which verifies correctness of results when this code is used. If we don't have it already.

> @vnkozlov thanks a lot for the review.
> The test for slice and unslice are already part of test/jdk/jdk/incubator/vector/Byte256VectorTests.java and Short256VectorTests.java.

Good.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2520
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261542: X86 slice and unslice intrinsics for 256-bit byte/short vectors

Vladimir Kozlov-2
In reply to this post by Sandhya Viswanathan
On Thu, 18 Feb 2021 21:21:49 GMT, Sandhya Viswanathan <[hidden email]> wrote:

>> src/hotspot/cpu/x86/x86.ad line 7550:
>>
>>> 7548:     // only byte shuffle instruction available on these platforms
>>> 7549:     int vlen_in_bytes = vector_length_in_bytes(this);
>>> 7550:     if (UseAVX == 0) {
>>
>> This code will not be executed with vector length 16 because match_rule_supported_vector() bailout with (size_in_bits == 256 && UseAVX < 2).
>
> Yes you are right, but this code will execute for vector length 16 when UseAVX ==2.
>  It will also execure for vector length 16 when UseAVX == 3 &&
> !VM_Version::supports_avx512bw.

Next assert checks <= 16 when code is guarded by (UseAVX == 0). It is not (UseAVX ==2).
Also } else { case is for UseAVX > 0 which includes AVX=1 but vpaddb() (avx3) is used there.
Seems UseAVX checks wrong here.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2520
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261542: X86 slice and unslice intrinsics for 256-bit byte/short vectors [v2]

Sandhya Viswanathan
In reply to this post by Sandhya Viswanathan
> The slice and unslice intrinsics for 256-bit byte/short vectors can be implemented for x86 platforms supporting AVX2 using a sequence of instructions.
>
> JBS: https://bugs.openjdk.java.net/browse/JDK-8261542
>
> The PerfSliceOrigin.java jmh test attached to the JBS shows the following performance on AVX2 platform.
>
> Before:
> Benchmark                                 (size)   Mode  Cnt   Score   Error   Units
> PerfSliceOrigin.vectorSliceOrigin           1024  thrpt    5  18.887 ± 1.128  ops/ms
> PerfSliceOrigin.vectorSliceUnsliceOrigin    1024  thrpt    5   9.374 ± 0.370  ops/ms
>
> After:
> Benchmark                                 (size)   Mode  Cnt      Score     Error   Units
> PerfSliceOrigin.vectorSliceOrigin           1024  thrpt    5  13861.420 ±  19.071  ops/ms
> PerfSliceOrigin.vectorSliceUnsliceOrigin    1024  thrpt    5   7895.199 ± 142.580  ops/ms

Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision:

  corrected assert

-------------

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/2520/files
  - new: https://git.openjdk.java.net/jdk/pull/2520/files/77324374..55165cb5

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=2520&range=01
 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=2520&range=00-01

  Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod
  Patch: https://git.openjdk.java.net/jdk/pull/2520.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/2520/head:pull/2520

PR: https://git.openjdk.java.net/jdk/pull/2520
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261542: X86 slice and unslice intrinsics for 256-bit byte/short vectors [v3]

Sandhya Viswanathan
In reply to this post by Sandhya Viswanathan
> The slice and unslice intrinsics for 256-bit byte/short vectors can be implemented for x86 platforms supporting AVX2 using a sequence of instructions.
>
> JBS: https://bugs.openjdk.java.net/browse/JDK-8261542
>
> The PerfSliceOrigin.java jmh test attached to the JBS shows the following performance on AVX2 platform.
>
> Before:
> Benchmark                                 (size)   Mode  Cnt   Score   Error   Units
> PerfSliceOrigin.vectorSliceOrigin           1024  thrpt    5  18.887 ± 1.128  ops/ms
> PerfSliceOrigin.vectorSliceUnsliceOrigin    1024  thrpt    5   9.374 ± 0.370  ops/ms
>
> After:
> Benchmark                                 (size)   Mode  Cnt      Score     Error   Units
> PerfSliceOrigin.vectorSliceOrigin           1024  thrpt    5  13861.420 ±  19.071  ops/ms
> PerfSliceOrigin.vectorSliceUnsliceOrigin    1024  thrpt    5   7895.199 ± 142.580  ops/ms

Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision:

  corrected assert

-------------

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/2520/files
  - new: https://git.openjdk.java.net/jdk/pull/2520/files/55165cb5..fa13679a

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=2520&range=02
 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=2520&range=01-02

  Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod
  Patch: https://git.openjdk.java.net/jdk/pull/2520.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/2520/head:pull/2520

PR: https://git.openjdk.java.net/jdk/pull/2520
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261542: X86 slice and unslice intrinsics for 256-bit byte/short vectors [v3]

Vladimir Kozlov-2
On Fri, 19 Feb 2021 01:23:04 GMT, Sandhya Viswanathan <[hidden email]> wrote:

>> The slice and unslice intrinsics for 256-bit byte/short vectors can be implemented for x86 platforms supporting AVX2 using a sequence of instructions.
>>
>> JBS: https://bugs.openjdk.java.net/browse/JDK-8261542
>>
>> The PerfSliceOrigin.java jmh test attached to the JBS shows the following performance on AVX2 platform.
>>
>> Before:
>> Benchmark                                 (size)   Mode  Cnt   Score   Error   Units
>> PerfSliceOrigin.vectorSliceOrigin           1024  thrpt    5  18.887 ± 1.128  ops/ms
>> PerfSliceOrigin.vectorSliceUnsliceOrigin    1024  thrpt    5   9.374 ± 0.370  ops/ms
>>
>> After:
>> Benchmark                                 (size)   Mode  Cnt      Score     Error   Units
>> PerfSliceOrigin.vectorSliceOrigin           1024  thrpt    5  13861.420 ±  19.071  ops/ms
>> PerfSliceOrigin.vectorSliceUnsliceOrigin    1024  thrpt    5   7895.199 ± 142.580  ops/ms
>
> Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision:
>
>   corrected assert

src/hotspot/cpu/x86/x86.ad line 1695:

> 1693:       if(vlen == 2) {
> 1694:         return false; // Implementation limitation due to how shuffle is loaded
> 1695:       } else if (size_in_bits == 256 && UseAVX < 2) {

Should this be >= 256?

-------------

PR: https://git.openjdk.java.net/jdk/pull/2520
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261542: X86 slice and unslice intrinsics for 256-bit byte/short vectors [v3]

Vladimir Kozlov-2
In reply to this post by Vladimir Kozlov-2
On Fri, 19 Feb 2021 01:27:57 GMT, Sandhya Viswanathan <[hidden email]> wrote:

>> Next assert checks <= 16 when code is guarded by (UseAVX == 0). It is not (UseAVX ==2).
>> Also } else { case is for UseAVX > 0 which includes AVX=1 but vpaddb() (avx3) is used there.
>> Seems UseAVX checks wrong here.
>
> The assert checks for  vlen_in_bytes <= 16 (128 bits) and so is a correct check for UseAVX=0.
> vpaddb is supported on  AVX1/AVX2 as well.
> vpaddb is supported on AVX1 for up to 128 bit and
> on AVX2 for upto 256 bit and
> on AVX3 (512) for upto 512 bit vectors.
> I have tested this for UseAVX=0, UseAVX=1, UseAVX=2, UseAVX=3 platform.
>
> The check is for UseAVX as with any flavor of AVX, we can use less number of instructions to do this operation.
> This is because AVX allows destination to be separate from both the sources.
>
> Please let me know if I am missing something.

My bad - I missed that size is in bytes in assert. The assert is correct, as you said.
And `} else {` part works for AVX1 because of match_rule_supported_vector() bailout 256-bit case.
May be add assert(UseAVX > 1 || vlen_in_bytes <= 16, ).

I only have one question left - about check >= 256 in  match_rule_supported_vector()

-------------

PR: https://git.openjdk.java.net/jdk/pull/2520
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261542: X86 slice and unslice intrinsics for 256-bit byte/short vectors [v3]

Sandhya Viswanathan
In reply to this post by Vladimir Kozlov-2
On Fri, 19 Feb 2021 01:56:19 GMT, Vladimir Kozlov <[hidden email]> wrote:

>> Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision:
>>
>>   corrected assert
>
> src/hotspot/cpu/x86/x86.ad line 1695:
>
>> 1693:       if(vlen == 2) {
>> 1694:         return false; // Implementation limitation due to how shuffle is loaded
>> 1695:       } else if (size_in_bits == 256 && UseAVX < 2) {
>
> Should this be >= 256?

The general >= 256 part is taken care of early on in match_rule_supported_vector as below:
  if (!vector_size_supported(bt, vlen)) {
    return false;
  }
The only additional check that is being done here is for float and double  256 bit vectors that are supported on AVX=1 and will pass the vector_size_supported check.
This is because the VectorLoadShuffle cannot be performed for 256 bit vectors on AVX1 platform as it needs "integer" 256 bit instructions which are only available on AVX2.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2520
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261542: X86 slice and unslice intrinsics for 256-bit byte/short vectors [v4]

Sandhya Viswanathan
In reply to this post by Sandhya Viswanathan
> The slice and unslice intrinsics for 256-bit byte/short vectors can be implemented for x86 platforms supporting AVX2 using a sequence of instructions.
>
> JBS: https://bugs.openjdk.java.net/browse/JDK-8261542
>
> The PerfSliceOrigin.java jmh test attached to the JBS shows the following performance on AVX2 platform.
>
> Before:
> Benchmark                                 (size)   Mode  Cnt   Score   Error   Units
> PerfSliceOrigin.vectorSliceOrigin           1024  thrpt    5  18.887 ± 1.128  ops/ms
> PerfSliceOrigin.vectorSliceUnsliceOrigin    1024  thrpt    5   9.374 ± 0.370  ops/ms
>
> After:
> Benchmark                                 (size)   Mode  Cnt      Score     Error   Units
> PerfSliceOrigin.vectorSliceOrigin           1024  thrpt    5  13861.420 ±  19.071  ops/ms
> PerfSliceOrigin.vectorSliceUnsliceOrigin    1024  thrpt    5   7895.199 ± 142.580  ops/ms

Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision:

  add assert on else path

-------------

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/2520/files
  - new: https://git.openjdk.java.net/jdk/pull/2520/files/fa13679a..ad3ab2b1

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=2520&range=03
 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=2520&range=02-03

  Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod
  Patch: https://git.openjdk.java.net/jdk/pull/2520.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/2520/head:pull/2520

PR: https://git.openjdk.java.net/jdk/pull/2520
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261542: X86 slice and unslice intrinsics for 256-bit byte/short vectors [v4]

Sandhya Viswanathan
In reply to this post by Vladimir Kozlov-2
On Fri, 19 Feb 2021 02:05:02 GMT, Vladimir Kozlov <[hidden email]> wrote:

>> The assert checks for  vlen_in_bytes <= 16 (128 bits) and so is a correct check for UseAVX=0.
>> vpaddb is supported on  AVX1/AVX2 as well.
>> vpaddb is supported on AVX1 for up to 128 bit and
>> on AVX2 for upto 256 bit and
>> on AVX3 (512) for upto 512 bit vectors.
>> I have tested this for UseAVX=0, UseAVX=1, UseAVX=2, UseAVX=3 platform.
>>
>> The check is for UseAVX as with any flavor of AVX, we can use less number of instructions to do this operation.
>> This is because AVX allows destination to be separate from both the sources.
>>
>> Please let me know if I am missing something.
>
> My bad - I missed that size is in bytes in assert. The assert is correct, as you said.
> And `} else {` part works for AVX1 because of match_rule_supported_vector() bailout 256-bit case.
> May be add assert(UseAVX > 1 || vlen_in_bytes <= 16, ).
>
> I only have one question left - about check >= 256 in  match_rule_supported_vector()

Added the following assert on else path:
+      assert(UseAVX > 1 || vlen_in_bytes <= 16, "required");

-------------

PR: https://git.openjdk.java.net/jdk/pull/2520
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261542: X86 slice and unslice intrinsics for 256-bit byte/short vectors [v4]

Nils Eliasson-2
In reply to this post by Sandhya Viswanathan
On Fri, 19 Feb 2021 03:20:59 GMT, Sandhya Viswanathan <[hidden email]> wrote:

>> The slice and unslice intrinsics for 256-bit byte/short vectors can be implemented for x86 platforms supporting AVX2 using a sequence of instructions.
>>
>> JBS: https://bugs.openjdk.java.net/browse/JDK-8261542
>>
>> The PerfSliceOrigin.java jmh test attached to the JBS shows the following performance on AVX2 platform.
>>
>> Before:
>> Benchmark                                 (size)   Mode  Cnt   Score   Error   Units
>> PerfSliceOrigin.vectorSliceOrigin           1024  thrpt    5  18.887 ± 1.128  ops/ms
>> PerfSliceOrigin.vectorSliceUnsliceOrigin    1024  thrpt    5   9.374 ± 0.370  ops/ms
>>
>> After:
>> Benchmark                                 (size)   Mode  Cnt      Score     Error   Units
>> PerfSliceOrigin.vectorSliceOrigin           1024  thrpt    5  13861.420 ±  19.071  ops/ms
>> PerfSliceOrigin.vectorSliceUnsliceOrigin    1024  thrpt    5   7895.199 ± 142.580  ops/ms
>
> Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision:
>
>   add assert on else path

Looks good.

-------------

Marked as reviewed by neliasso (Reviewer).

PR: https://git.openjdk.java.net/jdk/pull/2520
Reply | Threaded
Open this post in threaded view
|

Integrated: 8261542: X86 slice and unslice intrinsics for 256-bit byte/short vectors

Sandhya Viswanathan
In reply to this post by Sandhya Viswanathan
On Thu, 11 Feb 2021 02:37:35 GMT, Sandhya Viswanathan <[hidden email]> wrote:

> The slice and unslice intrinsics for 256-bit byte/short vectors can be implemented for x86 platforms supporting AVX2 using a sequence of instructions.
>
> JBS: https://bugs.openjdk.java.net/browse/JDK-8261542
>
> The PerfSliceOrigin.java jmh test attached to the JBS shows the following performance on AVX2 platform.
>
> Before:
> Benchmark                                 (size)   Mode  Cnt   Score   Error   Units
> PerfSliceOrigin.vectorSliceOrigin           1024  thrpt    5  18.887 ± 1.128  ops/ms
> PerfSliceOrigin.vectorSliceUnsliceOrigin    1024  thrpt    5   9.374 ± 0.370  ops/ms
>
> After:
> Benchmark                                 (size)   Mode  Cnt      Score     Error   Units
> PerfSliceOrigin.vectorSliceOrigin           1024  thrpt    5  13861.420 ±  19.071  ops/ms
> PerfSliceOrigin.vectorSliceUnsliceOrigin    1024  thrpt    5   7895.199 ± 142.580  ops/ms

This pull request has now been integrated.

Changeset: c53acc2a
Author:    Sandhya Viswanathan <[hidden email]>
URL:       https://git.openjdk.java.net/jdk/commit/c53acc2a
Stats:     120 lines in 7 files changed: 100 ins; 5 del; 15 mod

8261542: X86 slice and unslice intrinsics for 256-bit byte/short vectors

Reviewed-by: kvn, neliasso

-------------

PR: https://git.openjdk.java.net/jdk/pull/2520