Re: RFR: 8261542: X86 slice and unslice intrinsics for 256-bit byte/short vectors

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261542: X86 slice and unslice intrinsics for 256-bit byte/short vectors

Vladimir Kozlov-2
On Thu, 11 Feb 2021 02:37:35 GMT, Sandhya Viswanathan <[hidden email]> wrote:

> The slice and unslice intrinsics for 256-bit byte/short vectors can be implemented for x86 platforms supporting AVX2 using a sequence of instructions.
>
> JBS: https://bugs.openjdk.java.net/browse/JDK-8261542
>
> The PerfSliceOrigin.java jmh test attached to the JBS shows the following performance on AVX2 platform.
>
> Before:
> Benchmark                                 (size)   Mode  Cnt   Score   Error   Units
> PerfSliceOrigin.vectorSliceOrigin           1024  thrpt    5  18.887 ± 1.128  ops/ms
> PerfSliceOrigin.vectorSliceUnsliceOrigin    1024  thrpt    5   9.374 ± 0.370  ops/ms
>
> After:
> Benchmark                                 (size)   Mode  Cnt      Score     Error   Units
> PerfSliceOrigin.vectorSliceOrigin           1024  thrpt    5  13861.420 ±  19.071  ops/ms
> PerfSliceOrigin.vectorSliceUnsliceOrigin    1024  thrpt    5   7895.199 ± 142.580  ops/ms

Please, add a test which verifies correctness of results when this code is used. If we don't have it already.

src/hotspot/cpu/x86/x86.ad line 7506:

> 7504: instruct rearrangeB_avx(legVec dst, legVec src, vec shuffle, legVec vtmp1, legVec vtmp2, rRegP scratch) %{
> 7505:   predicate(vector_element_basic_type(n) == T_BYTE &&
> 7506:             vector_length(n) == 32 && !VM_Version::supports_avx512_vbmi());

Predicate matches bail-out condition in match_rule_supported_vector(). Does it mean this code never used before?
So you are implementing it now. Right?

src/hotspot/cpu/x86/x86.ad line 7550:

> 7548:     // only byte shuffle instruction available on these platforms
> 7549:     int vlen_in_bytes = vector_length_in_bytes(this);
> 7550:     if (UseAVX == 0) {

This code will not be executed with vector length 16 because match_rule_supported_vector() bailout with (size_in_bits == 256 && UseAVX < 2).

-------------

Changes requested by kvn (Reviewer).

PR: https://git.openjdk.java.net/jdk/pull/2520
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261542: X86 slice and unslice intrinsics for 256-bit byte/short vectors [v3]

Sandhya Viswanathan
On Thu, 18 Feb 2021 23:31:28 GMT, Vladimir Kozlov <[hidden email]> wrote:

>> Yes you are right, but this code will execute for vector length 16 when UseAVX ==2.
>>  It will also execure for vector length 16 when UseAVX == 3 &&
>> !VM_Version::supports_avx512bw.
>
> Next assert checks <= 16 when code is guarded by (UseAVX == 0). It is not (UseAVX ==2).
> Also } else { case is for UseAVX > 0 which includes AVX=1 but vpaddb() (avx3) is used there.
> Seems UseAVX checks wrong here.

The assert checks for  vlen_in_bytes <= 16 (128 bits) and so is a correct check for UseAVX=0.
vpaddb is supported on  AVX1/AVX2 as well.
vpaddb is supported on AVX1 for up to 128 bit and
on AVX2 for upto 256 bit and
on AVX3 (512) for upto 512 bit vectors.
I have tested this for UseAVX=0, UseAVX=1, UseAVX=2, UseAVX=3 platform.

The check is for UseAVX as with any flavor of AVX, we can use less number of instructions to do this operation.
This is because AVX allows destination to be separate from both the sources.

Please let me know if I am missing something.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2520
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261542: X86 slice and unslice intrinsics for 256-bit byte/short vectors [v3]

Vladimir Kozlov-2
On Fri, 19 Feb 2021 02:30:58 GMT, Sandhya Viswanathan <[hidden email]> wrote:

>> src/hotspot/cpu/x86/x86.ad line 1695:
>>
>>> 1693:       if(vlen == 2) {
>>> 1694:         return false; // Implementation limitation due to how shuffle is loaded
>>> 1695:       } else if (size_in_bits == 256 && UseAVX < 2) {
>>
>> Should this be >= 256?
>
> The general >= 256 part is taken care of early on in match_rule_supported_vector as below:
>   if (!vector_size_supported(bt, vlen)) {
>     return false;
>   }
> The only additional check that is being done here is for float and double  256 bit vectors that are supported on AVX=1 and will pass the vector_size_supported check.
> This is because the VectorLoadShuffle cannot be performed for 256 bit vectors on AVX1 platform as it needs "integer" 256 bit instructions which are only available on AVX2.

Okay.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2520
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261542: X86 slice and unslice intrinsics for 256-bit byte/short vectors [v4]

Vladimir Kozlov-2
In reply to this post by Sandhya Viswanathan
On Fri, 19 Feb 2021 03:17:56 GMT, Sandhya Viswanathan <[hidden email]> wrote:

>> My bad - I missed that size is in bytes in assert. The assert is correct, as you said.
>> And `} else {` part works for AVX1 because of match_rule_supported_vector() bailout 256-bit case.
>> May be add assert(UseAVX > 1 || vlen_in_bytes <= 16, ).
>>
>> I only have one question left - about check >= 256 in  match_rule_supported_vector()
>
> Added the following assert on else path:
> +      assert(UseAVX > 1 || vlen_in_bytes <= 16, "required");

Good.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2520