The slice and unslice intrinsics for 256-bit byte/short vectors can be implemented for x86 platforms supporting AVX2 using a sequence of instructions.
JBS: https://bugs.openjdk.java.net/browse/JDK-8261542 The PerfSliceOrigin.java jmh test attached to the JBS shows the following performance on AVX2 platform. Before: Benchmark (size) Mode Cnt Score Error Units PerfSliceOrigin.vectorSliceOrigin 1024 thrpt 5 18.887 ± 1.128 ops/ms PerfSliceOrigin.vectorSliceUnsliceOrigin 1024 thrpt 5 9.374 ± 0.370 ops/ms After: Benchmark (size) Mode Cnt Score Error Units PerfSliceOrigin.vectorSliceOrigin 1024 thrpt 5 13861.420 ± 19.071 ops/ms PerfSliceOrigin.vectorSliceUnsliceOrigin 1024 thrpt 5 7895.199 ± 142.580 ops/ms ------------- Commit messages: - 8261542: X86 slice and unslice intrinsics for 256-bit byte/short vectors Changes: https://git.openjdk.java.net/jdk/pull/2520/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=2520&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8261542 Stats: 119 lines in 7 files changed: 99 ins; 5 del; 15 mod Patch: https://git.openjdk.java.net/jdk/pull/2520.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/2520/head:pull/2520 PR: https://git.openjdk.java.net/jdk/pull/2520 |
On Thu, 18 Feb 2021 19:14:37 GMT, Vladimir Kozlov <[hidden email]> wrote:
>> The slice and unslice intrinsics for 256-bit byte/short vectors can be implemented for x86 platforms supporting AVX2 using a sequence of instructions. >> >> JBS: https://bugs.openjdk.java.net/browse/JDK-8261542 >> >> The PerfSliceOrigin.java jmh test attached to the JBS shows the following performance on AVX2 platform. >> >> Before: >> Benchmark (size) Mode Cnt Score Error Units >> PerfSliceOrigin.vectorSliceOrigin 1024 thrpt 5 18.887 ± 1.128 ops/ms >> PerfSliceOrigin.vectorSliceUnsliceOrigin 1024 thrpt 5 9.374 ± 0.370 ops/ms >> >> After: >> Benchmark (size) Mode Cnt Score Error Units >> PerfSliceOrigin.vectorSliceOrigin 1024 thrpt 5 13861.420 ± 19.071 ops/ms >> PerfSliceOrigin.vectorSliceUnsliceOrigin 1024 thrpt 5 7895.199 ± 142.580 ops/ms > > Please, add a test which verifies correctness of results when this code is used. If we don't have it already. @vnkozlov thanks a lot for the review. The test for slice and unslice are already part of test/jdk/jdk/incubator/vector/Byte256VectorTests.java and Short256VectorTests.java. > src/hotspot/cpu/x86/x86.ad line 7550: > >> 7548: // only byte shuffle instruction available on these platforms >> 7549: int vlen_in_bytes = vector_length_in_bytes(this); >> 7550: if (UseAVX == 0) { > > This code will not be executed with vector length 16 because match_rule_supported_vector() bailout with (size_in_bits == 256 && UseAVX < 2). Yes you are right, but this code will execute for vector length 16 when UseAVX ==2. It will also execure for vector length 16 when UseAVX == 3 && !VM_Version::supports_avx512bw. > src/hotspot/cpu/x86/x86.ad line 7506: > >> 7504: instruct rearrangeB_avx(legVec dst, legVec src, vec shuffle, legVec vtmp1, legVec vtmp2, rRegP scratch) %{ >> 7505: predicate(vector_element_basic_type(n) == T_BYTE && >> 7506: vector_length(n) == 32 && !VM_Version::supports_avx512_vbmi()); > > Predicate matches bail-out condition in match_rule_supported_vector(). Does it mean this code never used before? > So you are implementing it now. Right? Yes, this rule was not used before and I am implementing it now. ------------- PR: https://git.openjdk.java.net/jdk/pull/2520 |
In reply to this post by Sandhya Viswanathan
On Thu, 18 Feb 2021 19:14:37 GMT, Vladimir Kozlov <[hidden email]> wrote:
>> The slice and unslice intrinsics for 256-bit byte/short vectors can be implemented for x86 platforms supporting AVX2 using a sequence of instructions. >> >> JBS: https://bugs.openjdk.java.net/browse/JDK-8261542 >> >> The PerfSliceOrigin.java jmh test attached to the JBS shows the following performance on AVX2 platform. >> >> Before: >> Benchmark (size) Mode Cnt Score Error Units >> PerfSliceOrigin.vectorSliceOrigin 1024 thrpt 5 18.887 ± 1.128 ops/ms >> PerfSliceOrigin.vectorSliceUnsliceOrigin 1024 thrpt 5 9.374 ± 0.370 ops/ms >> >> After: >> Benchmark (size) Mode Cnt Score Error Units >> PerfSliceOrigin.vectorSliceOrigin 1024 thrpt 5 13861.420 ± 19.071 ops/ms >> PerfSliceOrigin.vectorSliceUnsliceOrigin 1024 thrpt 5 7895.199 ± 142.580 ops/ms > > Please, add a test which verifies correctness of results when this code is used. If we don't have it already. > @vnkozlov thanks a lot for the review. > The test for slice and unslice are already part of test/jdk/jdk/incubator/vector/Byte256VectorTests.java and Short256VectorTests.java. Good. ------------- PR: https://git.openjdk.java.net/jdk/pull/2520 |
In reply to this post by Sandhya Viswanathan
On Thu, 18 Feb 2021 21:21:49 GMT, Sandhya Viswanathan <[hidden email]> wrote:
>> src/hotspot/cpu/x86/x86.ad line 7550: >> >>> 7548: // only byte shuffle instruction available on these platforms >>> 7549: int vlen_in_bytes = vector_length_in_bytes(this); >>> 7550: if (UseAVX == 0) { >> >> This code will not be executed with vector length 16 because match_rule_supported_vector() bailout with (size_in_bits == 256 && UseAVX < 2). > > Yes you are right, but this code will execute for vector length 16 when UseAVX ==2. > It will also execure for vector length 16 when UseAVX == 3 && > !VM_Version::supports_avx512bw. Next assert checks <= 16 when code is guarded by (UseAVX == 0). It is not (UseAVX ==2). Also } else { case is for UseAVX > 0 which includes AVX=1 but vpaddb() (avx3) is used there. Seems UseAVX checks wrong here. ------------- PR: https://git.openjdk.java.net/jdk/pull/2520 |
In reply to this post by Sandhya Viswanathan
> The slice and unslice intrinsics for 256-bit byte/short vectors can be implemented for x86 platforms supporting AVX2 using a sequence of instructions.
> > JBS: https://bugs.openjdk.java.net/browse/JDK-8261542 > > The PerfSliceOrigin.java jmh test attached to the JBS shows the following performance on AVX2 platform. > > Before: > Benchmark (size) Mode Cnt Score Error Units > PerfSliceOrigin.vectorSliceOrigin 1024 thrpt 5 18.887 ± 1.128 ops/ms > PerfSliceOrigin.vectorSliceUnsliceOrigin 1024 thrpt 5 9.374 ± 0.370 ops/ms > > After: > Benchmark (size) Mode Cnt Score Error Units > PerfSliceOrigin.vectorSliceOrigin 1024 thrpt 5 13861.420 ± 19.071 ops/ms > PerfSliceOrigin.vectorSliceUnsliceOrigin 1024 thrpt 5 7895.199 ± 142.580 ops/ms Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: corrected assert ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/2520/files - new: https://git.openjdk.java.net/jdk/pull/2520/files/77324374..55165cb5 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=2520&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=2520&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/2520.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/2520/head:pull/2520 PR: https://git.openjdk.java.net/jdk/pull/2520 |
In reply to this post by Sandhya Viswanathan
> The slice and unslice intrinsics for 256-bit byte/short vectors can be implemented for x86 platforms supporting AVX2 using a sequence of instructions.
> > JBS: https://bugs.openjdk.java.net/browse/JDK-8261542 > > The PerfSliceOrigin.java jmh test attached to the JBS shows the following performance on AVX2 platform. > > Before: > Benchmark (size) Mode Cnt Score Error Units > PerfSliceOrigin.vectorSliceOrigin 1024 thrpt 5 18.887 ± 1.128 ops/ms > PerfSliceOrigin.vectorSliceUnsliceOrigin 1024 thrpt 5 9.374 ± 0.370 ops/ms > > After: > Benchmark (size) Mode Cnt Score Error Units > PerfSliceOrigin.vectorSliceOrigin 1024 thrpt 5 13861.420 ± 19.071 ops/ms > PerfSliceOrigin.vectorSliceUnsliceOrigin 1024 thrpt 5 7895.199 ± 142.580 ops/ms Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: corrected assert ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/2520/files - new: https://git.openjdk.java.net/jdk/pull/2520/files/55165cb5..fa13679a Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=2520&range=02 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=2520&range=01-02 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/2520.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/2520/head:pull/2520 PR: https://git.openjdk.java.net/jdk/pull/2520 |
On Fri, 19 Feb 2021 01:23:04 GMT, Sandhya Viswanathan <[hidden email]> wrote:
>> The slice and unslice intrinsics for 256-bit byte/short vectors can be implemented for x86 platforms supporting AVX2 using a sequence of instructions. >> >> JBS: https://bugs.openjdk.java.net/browse/JDK-8261542 >> >> The PerfSliceOrigin.java jmh test attached to the JBS shows the following performance on AVX2 platform. >> >> Before: >> Benchmark (size) Mode Cnt Score Error Units >> PerfSliceOrigin.vectorSliceOrigin 1024 thrpt 5 18.887 ± 1.128 ops/ms >> PerfSliceOrigin.vectorSliceUnsliceOrigin 1024 thrpt 5 9.374 ± 0.370 ops/ms >> >> After: >> Benchmark (size) Mode Cnt Score Error Units >> PerfSliceOrigin.vectorSliceOrigin 1024 thrpt 5 13861.420 ± 19.071 ops/ms >> PerfSliceOrigin.vectorSliceUnsliceOrigin 1024 thrpt 5 7895.199 ± 142.580 ops/ms > > Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: > > corrected assert src/hotspot/cpu/x86/x86.ad line 1695: > 1693: if(vlen == 2) { > 1694: return false; // Implementation limitation due to how shuffle is loaded > 1695: } else if (size_in_bits == 256 && UseAVX < 2) { Should this be >= 256? ------------- PR: https://git.openjdk.java.net/jdk/pull/2520 |
In reply to this post by Vladimir Kozlov-2
On Fri, 19 Feb 2021 01:27:57 GMT, Sandhya Viswanathan <[hidden email]> wrote:
>> Next assert checks <= 16 when code is guarded by (UseAVX == 0). It is not (UseAVX ==2). >> Also } else { case is for UseAVX > 0 which includes AVX=1 but vpaddb() (avx3) is used there. >> Seems UseAVX checks wrong here. > > The assert checks for vlen_in_bytes <= 16 (128 bits) and so is a correct check for UseAVX=0. > vpaddb is supported on AVX1/AVX2 as well. > vpaddb is supported on AVX1 for up to 128 bit and > on AVX2 for upto 256 bit and > on AVX3 (512) for upto 512 bit vectors. > I have tested this for UseAVX=0, UseAVX=1, UseAVX=2, UseAVX=3 platform. > > The check is for UseAVX as with any flavor of AVX, we can use less number of instructions to do this operation. > This is because AVX allows destination to be separate from both the sources. > > Please let me know if I am missing something. My bad - I missed that size is in bytes in assert. The assert is correct, as you said. And `} else {` part works for AVX1 because of match_rule_supported_vector() bailout 256-bit case. May be add assert(UseAVX > 1 || vlen_in_bytes <= 16, ). I only have one question left - about check >= 256 in match_rule_supported_vector() ------------- PR: https://git.openjdk.java.net/jdk/pull/2520 |
In reply to this post by Vladimir Kozlov-2
On Fri, 19 Feb 2021 01:56:19 GMT, Vladimir Kozlov <[hidden email]> wrote:
>> Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: >> >> corrected assert > > src/hotspot/cpu/x86/x86.ad line 1695: > >> 1693: if(vlen == 2) { >> 1694: return false; // Implementation limitation due to how shuffle is loaded >> 1695: } else if (size_in_bits == 256 && UseAVX < 2) { > > Should this be >= 256? The general >= 256 part is taken care of early on in match_rule_supported_vector as below: if (!vector_size_supported(bt, vlen)) { return false; } The only additional check that is being done here is for float and double 256 bit vectors that are supported on AVX=1 and will pass the vector_size_supported check. This is because the VectorLoadShuffle cannot be performed for 256 bit vectors on AVX1 platform as it needs "integer" 256 bit instructions which are only available on AVX2. ------------- PR: https://git.openjdk.java.net/jdk/pull/2520 |
In reply to this post by Sandhya Viswanathan
> The slice and unslice intrinsics for 256-bit byte/short vectors can be implemented for x86 platforms supporting AVX2 using a sequence of instructions.
> > JBS: https://bugs.openjdk.java.net/browse/JDK-8261542 > > The PerfSliceOrigin.java jmh test attached to the JBS shows the following performance on AVX2 platform. > > Before: > Benchmark (size) Mode Cnt Score Error Units > PerfSliceOrigin.vectorSliceOrigin 1024 thrpt 5 18.887 ± 1.128 ops/ms > PerfSliceOrigin.vectorSliceUnsliceOrigin 1024 thrpt 5 9.374 ± 0.370 ops/ms > > After: > Benchmark (size) Mode Cnt Score Error Units > PerfSliceOrigin.vectorSliceOrigin 1024 thrpt 5 13861.420 ± 19.071 ops/ms > PerfSliceOrigin.vectorSliceUnsliceOrigin 1024 thrpt 5 7895.199 ± 142.580 ops/ms Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: add assert on else path ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/2520/files - new: https://git.openjdk.java.net/jdk/pull/2520/files/fa13679a..ad3ab2b1 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=2520&range=03 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=2520&range=02-03 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/2520.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/2520/head:pull/2520 PR: https://git.openjdk.java.net/jdk/pull/2520 |
In reply to this post by Vladimir Kozlov-2
On Fri, 19 Feb 2021 02:05:02 GMT, Vladimir Kozlov <[hidden email]> wrote:
>> The assert checks for vlen_in_bytes <= 16 (128 bits) and so is a correct check for UseAVX=0. >> vpaddb is supported on AVX1/AVX2 as well. >> vpaddb is supported on AVX1 for up to 128 bit and >> on AVX2 for upto 256 bit and >> on AVX3 (512) for upto 512 bit vectors. >> I have tested this for UseAVX=0, UseAVX=1, UseAVX=2, UseAVX=3 platform. >> >> The check is for UseAVX as with any flavor of AVX, we can use less number of instructions to do this operation. >> This is because AVX allows destination to be separate from both the sources. >> >> Please let me know if I am missing something. > > My bad - I missed that size is in bytes in assert. The assert is correct, as you said. > And `} else {` part works for AVX1 because of match_rule_supported_vector() bailout 256-bit case. > May be add assert(UseAVX > 1 || vlen_in_bytes <= 16, ). > > I only have one question left - about check >= 256 in match_rule_supported_vector() Added the following assert on else path: + assert(UseAVX > 1 || vlen_in_bytes <= 16, "required"); ------------- PR: https://git.openjdk.java.net/jdk/pull/2520 |
In reply to this post by Sandhya Viswanathan
On Fri, 19 Feb 2021 03:20:59 GMT, Sandhya Viswanathan <[hidden email]> wrote:
>> The slice and unslice intrinsics for 256-bit byte/short vectors can be implemented for x86 platforms supporting AVX2 using a sequence of instructions. >> >> JBS: https://bugs.openjdk.java.net/browse/JDK-8261542 >> >> The PerfSliceOrigin.java jmh test attached to the JBS shows the following performance on AVX2 platform. >> >> Before: >> Benchmark (size) Mode Cnt Score Error Units >> PerfSliceOrigin.vectorSliceOrigin 1024 thrpt 5 18.887 ± 1.128 ops/ms >> PerfSliceOrigin.vectorSliceUnsliceOrigin 1024 thrpt 5 9.374 ± 0.370 ops/ms >> >> After: >> Benchmark (size) Mode Cnt Score Error Units >> PerfSliceOrigin.vectorSliceOrigin 1024 thrpt 5 13861.420 ± 19.071 ops/ms >> PerfSliceOrigin.vectorSliceUnsliceOrigin 1024 thrpt 5 7895.199 ± 142.580 ops/ms > > Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: > > add assert on else path Looks good. ------------- Marked as reviewed by neliasso (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/2520 |
In reply to this post by Sandhya Viswanathan
On Thu, 11 Feb 2021 02:37:35 GMT, Sandhya Viswanathan <[hidden email]> wrote:
> The slice and unslice intrinsics for 256-bit byte/short vectors can be implemented for x86 platforms supporting AVX2 using a sequence of instructions. > > JBS: https://bugs.openjdk.java.net/browse/JDK-8261542 > > The PerfSliceOrigin.java jmh test attached to the JBS shows the following performance on AVX2 platform. > > Before: > Benchmark (size) Mode Cnt Score Error Units > PerfSliceOrigin.vectorSliceOrigin 1024 thrpt 5 18.887 ± 1.128 ops/ms > PerfSliceOrigin.vectorSliceUnsliceOrigin 1024 thrpt 5 9.374 ± 0.370 ops/ms > > After: > Benchmark (size) Mode Cnt Score Error Units > PerfSliceOrigin.vectorSliceOrigin 1024 thrpt 5 13861.420 ± 19.071 ops/ms > PerfSliceOrigin.vectorSliceUnsliceOrigin 1024 thrpt 5 7895.199 ± 142.580 ops/ms This pull request has now been integrated. Changeset: c53acc2a Author: Sandhya Viswanathan <[hidden email]> URL: https://git.openjdk.java.net/jdk/commit/c53acc2a Stats: 120 lines in 7 files changed: 100 ins; 5 del; 15 mod 8261542: X86 slice and unslice intrinsics for 256-bit byte/short vectors Reviewed-by: kvn, neliasso ------------- PR: https://git.openjdk.java.net/jdk/pull/2520 |
Free forum by Nabble | Edit this page |