Since the vector bitwise `"andNot"` is implemented with `"v1.and(v2.xor(-1))"`, the generated codes with SVE look like:
mov z16.b, #-1 eor z17.d, z20.d, z16.d and z18.d, z18.d, z17.d This could be improved with a single instruction: bic z16.d, z16.d, z18.d Similarly, the following optimization for NEON is also needed: not v21.16b, v21.16b and v21.16b, v21.16b, v18.16b ==> bic v21.16b, v18.16b, v21.16b This patch also adds the following optimization to vector` "not"` for SVE which has already been added for NEON: mov z16.b, #-1 eor z17.d, z20.d, z16.d ==> not z17.d, p7/m, z20.d The performance can improve about `16% ~ 36%` with NEON for the `"AND_NOT"` benchmark [1]. [1] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/jdk/jdk/incubator/vector/benchmark/src/main/java/benchmark/jdk/incubator/vector/ByteMaxVector.java#L343 Tested tier1 and jdk:tier3. ------------- Commit messages: - 8264352: AArch64: Optimize vector "not/andNot" for NEON and SVE Changes: https://git.openjdk.java.net/jdk/pull/3370/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=3370&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8264352 Stats: 219 lines in 7 files changed: 185 ins; 0 del; 34 mod Patch: https://git.openjdk.java.net/jdk/pull/3370.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/3370/head:pull/3370 PR: https://git.openjdk.java.net/jdk/pull/3370 |
On Wed, 7 Apr 2021 05:53:46 GMT, Xiaohong Gong <[hidden email]> wrote:
> Since the vector bitwise `"andNot"` is implemented with `"v1.and(v2.xor(-1))"`, the generated codes with SVE look like: > mov z16.b, #-1 > eor z17.d, z20.d, z16.d > and z18.d, z18.d, z17.d > This could be improved with a single instruction: > bic z16.d, z16.d, z18.d > Similarly, the following optimization for NEON is also needed: > not v21.16b, v21.16b > and v21.16b, v21.16b, v18.16b ==> bic v21.16b, v18.16b, v21.16b > This patch also adds the following optimization to vector` "not"` for SVE which has already been added for NEON: > mov z16.b, #-1 > eor z17.d, z20.d, z16.d ==> not z17.d, p7/m, z20.d > The performance can improve about `16% ~ 36%` with NEON for the `"AND_NOT"` benchmark [1]. > > [1] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/jdk/jdk/incubator/vector/benchmark/src/main/java/benchmark/jdk/incubator/vector/ByteMaxVector.java#L343 > > Tested tier1 and jdk:tier3. Looks OK. Is there any test code for this is mainline? ------------- PR: https://git.openjdk.java.net/jdk/pull/3370 |
On Wed, 7 Apr 2021 08:31:19 GMT, Andrew Haley <[hidden email]> wrote:
> Looks OK. Is there any test code for this is mainline? Hi @theRealAph , thanks for looking at this PR. Yes, there is the Vector API jtreg tests that have covered the opcode `NOT/AND_NOT`. Please see the tests for byte vector: https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/ByteMaxVectorTests.java#L1708 and https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/ByteMaxVectorTests.java#L4602 ------------- PR: https://git.openjdk.java.net/jdk/pull/3370 |
In reply to this post by Xiaohong Gong
On Wed, 7 Apr 2021 05:53:46 GMT, Xiaohong Gong <[hidden email]> wrote:
> Since the vector bitwise `"andNot"` is implemented with `"v1.and(v2.xor(-1))"`, the generated codes with SVE look like: > mov z16.b, #-1 > eor z17.d, z20.d, z16.d > and z18.d, z18.d, z17.d > This could be improved with a single instruction: > bic z16.d, z16.d, z18.d > Similarly, the following optimization for NEON is also needed: > not v21.16b, v21.16b > and v21.16b, v21.16b, v18.16b ==> bic v21.16b, v18.16b, v21.16b > This patch also adds the following optimization to vector` "not"` for SVE which has already been added for NEON: > mov z16.b, #-1 > eor z17.d, z20.d, z16.d ==> not z17.d, p7/m, z20.d > The performance can improve about `16% ~ 36%` with NEON for the `"AND_NOT"` benchmark [1]. > > [1] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/jdk/jdk/incubator/vector/benchmark/src/main/java/benchmark/jdk/incubator/vector/ByteMaxVector.java#L343 > > Tested tier1 and jdk:tier3. Marked as reviewed by aph (Reviewer). ------------- PR: https://git.openjdk.java.net/jdk/pull/3370 |
In reply to this post by Xiaohong Gong
On Wed, 7 Apr 2021 05:53:46 GMT, Xiaohong Gong <[hidden email]> wrote:
> Since the vector bitwise `"andNot"` is implemented with `"v1.and(v2.xor(-1))"`, the generated codes with SVE look like: > mov z16.b, #-1 > eor z17.d, z20.d, z16.d > and z18.d, z18.d, z17.d > This could be improved with a single instruction: > bic z16.d, z16.d, z18.d > Similarly, the following optimization for NEON is also needed: > not v21.16b, v21.16b > and v21.16b, v21.16b, v18.16b ==> bic v21.16b, v18.16b, v21.16b > This patch also adds the following optimization to vector` "not"` for SVE which has already been added for NEON: > mov z16.b, #-1 > eor z17.d, z20.d, z16.d ==> not z17.d, p7/m, z20.d > The performance can improve about `16% ~ 36%` with NEON for the `"AND_NOT"` benchmark [1]. > > [1] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/jdk/jdk/incubator/vector/benchmark/src/main/java/benchmark/jdk/incubator/vector/ByteMaxVector.java#L343 > > Tested tier1 and jdk:tier3. Looks good. ------------- Marked as reviewed by njian (Committer). PR: https://git.openjdk.java.net/jdk/pull/3370 |
In reply to this post by Andrew Haley-2
On Wed, 7 Apr 2021 09:03:55 GMT, Andrew Haley <[hidden email]> wrote:
>> Since the vector bitwise `"andNot"` is implemented with `"v1.and(v2.xor(-1))"`, the generated codes with SVE look like: >> mov z16.b, #-1 >> eor z17.d, z20.d, z16.d >> and z18.d, z18.d, z17.d >> This could be improved with a single instruction: >> bic z16.d, z16.d, z18.d >> Similarly, the following optimization for NEON is also needed: >> not v21.16b, v21.16b >> and v21.16b, v21.16b, v18.16b ==> bic v21.16b, v18.16b, v21.16b >> This patch also adds the following optimization to vector` "not"` for SVE which has already been added for NEON: >> mov z16.b, #-1 >> eor z17.d, z20.d, z16.d ==> not z17.d, p7/m, z20.d >> The performance can improve about `16% ~ 36%` with NEON for the `"AND_NOT"` benchmark [1]. >> >> [1] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/jdk/jdk/incubator/vector/benchmark/src/main/java/benchmark/jdk/incubator/vector/ByteMaxVector.java#L343 >> >> Tested tier1 and jdk:tier3. > > Marked as reviewed by aph (Reviewer). Thanks for the review @theRealAph @nsjian ! ------------- PR: https://git.openjdk.java.net/jdk/pull/3370 |
In reply to this post by Xiaohong Gong
On Wed, 7 Apr 2021 05:53:46 GMT, Xiaohong Gong <[hidden email]> wrote:
> Since the vector bitwise `"andNot"` is implemented with `"v1.and(v2.xor(-1))"`, the generated codes with SVE look like: > mov z16.b, #-1 > eor z17.d, z20.d, z16.d > and z18.d, z18.d, z17.d > This could be improved with a single instruction: > bic z16.d, z16.d, z18.d > Similarly, the following optimization for NEON is also needed: > not v21.16b, v21.16b > and v21.16b, v21.16b, v18.16b ==> bic v21.16b, v18.16b, v21.16b > This patch also adds the following optimization to vector` "not"` for SVE which has already been added for NEON: > mov z16.b, #-1 > eor z17.d, z20.d, z16.d ==> not z17.d, p7/m, z20.d > The performance can improve about `16% ~ 36%` with NEON for the `"AND_NOT"` benchmark [1]. > > [1] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/jdk/jdk/incubator/vector/benchmark/src/main/java/benchmark/jdk/incubator/vector/ByteMaxVector.java#L343 > > Tested tier1 and jdk:tier3. This pull request has now been integrated. Changeset: e89542fb Author: Xiaohong Gong <[hidden email]> Committer: Ningsheng Jian <[hidden email]> URL: https://git.openjdk.java.net/jdk/commit/e89542fb Stats: 219 lines in 7 files changed: 185 ins; 0 del; 34 mod 8264352: AArch64: Optimize vector "not/andNot" for NEON and SVE Reviewed-by: aph, njian ------------- PR: https://git.openjdk.java.net/jdk/pull/3370 |
Free forum by Nabble | Edit this page |