In Vector API NEON implementation, we use a vector register to represent vector mask, where an element value of -1 is a true mask and an element value of 0 is a false mask. The allTrue() api is used to check whether all the elements of current mask are set.
Currently, the AArch64 NEON allTrue implementation looks like: andr $tmp, T16B $src1, $src2\t# src2 is maskAllTrue notr $tmp, T16B, $tmp addv $tmp, T16B, $tmp umov $dst, $tmp, B, 0 cmp $dst, 0 cset $dst where $src2 is a preset all true (-1) constant. We can optimize it to the code sequence like below, to check whether all bits are set: uminv $tmp, T16B, $src1 umov $dst, $tmp, B, 0 cmp $dst, 0xff cset $dst With this codegen improvement, we can see about 8%~70% performance uplift on different machines for Alibaba's Vector API bigdata benchmarks [1][2]. Tested with tier1 and vector api jtreg tests. [1] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/jdk/jdk/incubator/vector/benchmark/src/main/java/benchmark/bigdata/BooleanArrayCheck.java#L61 [2] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/jdk/jdk/incubator/vector/benchmark/src/main/java/benchmark/bigdata/ValueRangeCheckAndCastL2I.java#L93 ------------- Commit messages: - 8264409: AArch64: generate better code for Vector API allTrue Changes: https://git.openjdk.java.net/jdk/pull/3302/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=3302&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8264409 Stats: 409 lines in 5 files changed: 13 ins; 12 del; 384 mod Patch: https://git.openjdk.java.net/jdk/pull/3302.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/3302/head:pull/3302 PR: https://git.openjdk.java.net/jdk/pull/3302 |
On Thu, 1 Apr 2021 07:58:07 GMT, Ningsheng Jian <[hidden email]> wrote:
> In Vector API NEON implementation, we use a vector register to represent vector mask, where an element value of -1 is a true mask and an element value of 0 is a false mask. The allTrue() api is used to check whether all the elements of current mask are set. > > Currently, the AArch64 NEON allTrue implementation looks like: > > andr $tmp, T16B $src1, $src2\t# src2 is maskAllTrue > notr $tmp, T16B, $tmp > addv $tmp, T16B, $tmp > umov $dst, $tmp, B, 0 > cmp $dst, 0 > cset $dst > > where $src2 is a preset all true (-1) constant. We can optimize it to the code sequence like below, to check whether all bits are set: > > uminv $tmp, T16B, $src1 > umov $dst, $tmp, B, 0 > cmp $dst, 0xff > cset $dst > > With this codegen improvement, we can see about 8%~70% performance uplift on different machines for Alibaba's Vector API bigdata benchmarks [1][2]. > > Tested with tier1 and vector api jtreg tests. > > [1] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/jdk/jdk/incubator/vector/benchmark/src/main/java/benchmark/bigdata/BooleanArrayCheck.java#L61 > [2] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/jdk/jdk/incubator/vector/benchmark/src/main/java/benchmark/bigdata/ValueRangeCheckAndCastL2I.java#L93 It's a clever trick to use uminv for this specific case. The patch looks good to me. ------------- Marked as reviewed by adinn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/3302 |
In reply to this post by Ningsheng Jian-2
On Thu, 1 Apr 2021 07:58:07 GMT, Ningsheng Jian <[hidden email]> wrote:
> In Vector API NEON implementation, we use a vector register to represent vector mask, where an element value of -1 is a true mask and an element value of 0 is a false mask. The allTrue() api is used to check whether all the elements of current mask are set. > > Currently, the AArch64 NEON allTrue implementation looks like: > > andr $tmp, T16B $src1, $src2\t# src2 is maskAllTrue > notr $tmp, T16B, $tmp > addv $tmp, T16B, $tmp > umov $dst, $tmp, B, 0 > cmp $dst, 0 > cset $dst > > where $src2 is a preset all true (-1) constant. We can optimize it to the code sequence like below, to check whether all bits are set: > > uminv $tmp, T16B, $src1 > umov $dst, $tmp, B, 0 > cmp $dst, 0xff > cset $dst > > With this codegen improvement, we can see about 8%~70% performance uplift on different machines for Alibaba's Vector API bigdata benchmarks [1][2]. > > Tested with tier1 and vector api jtreg tests. > > [1] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/jdk/jdk/incubator/vector/benchmark/src/main/java/benchmark/bigdata/BooleanArrayCheck.java#L61 > [2] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/jdk/jdk/incubator/vector/benchmark/src/main/java/benchmark/bigdata/ValueRangeCheckAndCastL2I.java#L93 Overall looks good to me. (not a reviewer) src/hotspot/cpu/aarch64/aarch64_neon.ad line 3571: > 3569: format %{ "uminv $tmp, T8B, $src1\n\t" > 3570: "umov $dst, $tmp, B, 0\n\t" > 3571: "cmp $dst, 0xff\n\t" I think we should write "#0xff" here. But it looks that all other immediates in format field of aarch64_neon.ad lose the number sign as well. ------------- Marked as reviewed by pli (Committer). PR: https://git.openjdk.java.net/jdk/pull/3302 |
In reply to this post by Andrew Dinn-2
On Thu, 1 Apr 2021 09:02:39 GMT, Andrew Dinn <[hidden email]> wrote:
> It's a clever trick to use uminv for this specific case. > The patch looks good to me. Thank you for the review! @adinn ------------- PR: https://git.openjdk.java.net/jdk/pull/3302 |
In reply to this post by Pengfei Li
On Fri, 2 Apr 2021 02:26:52 GMT, Pengfei Li <[hidden email]> wrote:
>> In Vector API NEON implementation, we use a vector register to represent vector mask, where an element value of -1 is a true mask and an element value of 0 is a false mask. The allTrue() api is used to check whether all the elements of current mask are set. >> >> Currently, the AArch64 NEON allTrue implementation looks like: >> >> andr $tmp, T16B $src1, $src2\t# src2 is maskAllTrue >> notr $tmp, T16B, $tmp >> addv $tmp, T16B, $tmp >> umov $dst, $tmp, B, 0 >> cmp $dst, 0 >> cset $dst >> >> where $src2 is a preset all true (-1) constant. We can optimize it to the code sequence like below, to check whether all bits are set: >> >> uminv $tmp, T16B, $src1 >> umov $dst, $tmp, B, 0 >> cmp $dst, 0xff >> cset $dst >> >> With this codegen improvement, we can see about 8%~70% performance uplift on different machines for Alibaba's Vector API bigdata benchmarks [1][2]. >> >> Tested with tier1 and vector api jtreg tests. >> >> [1] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/jdk/jdk/incubator/vector/benchmark/src/main/java/benchmark/bigdata/BooleanArrayCheck.java#L61 >> [2] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/jdk/jdk/incubator/vector/benchmark/src/main/java/benchmark/bigdata/ValueRangeCheckAndCastL2I.java#L93 > > src/hotspot/cpu/aarch64/aarch64_neon.ad line 3571: > >> 3569: format %{ "uminv $tmp, T8B, $src1\n\t" >> 3570: "umov $dst, $tmp, B, 0\n\t" >> 3571: "cmp $dst, 0xff\n\t" > > I think we should write "#0xff" here. But it looks that all other immediates in format field of aarch64_neon.ad lose the number sign as well. Thanks for the review, but I think both are ok. ------------- PR: https://git.openjdk.java.net/jdk/pull/3302 |
In reply to this post by Ningsheng Jian-2
On Thu, 1 Apr 2021 07:58:07 GMT, Ningsheng Jian <[hidden email]> wrote:
> In Vector API NEON implementation, we use a vector register to represent vector mask, where an element value of -1 is a true mask and an element value of 0 is a false mask. The allTrue() api is used to check whether all the elements of current mask are set. > > Currently, the AArch64 NEON allTrue implementation looks like: > > andr $tmp, T16B $src1, $src2\t# src2 is maskAllTrue > notr $tmp, T16B, $tmp > addv $tmp, T16B, $tmp > umov $dst, $tmp, B, 0 > cmp $dst, 0 > cset $dst > > where $src2 is a preset all true (-1) constant. We can optimize it to the code sequence like below, to check whether all bits are set: > > uminv $tmp, T16B, $src1 > umov $dst, $tmp, B, 0 > cmp $dst, 0xff > cset $dst > > With this codegen improvement, we can see about 8%~70% performance uplift on different machines for Alibaba's Vector API bigdata benchmarks [1][2]. > > Tested with tier1 and vector api jtreg tests. > > [1] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/jdk/jdk/incubator/vector/benchmark/src/main/java/benchmark/bigdata/BooleanArrayCheck.java#L61 > [2] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/jdk/jdk/incubator/vector/benchmark/src/main/java/benchmark/bigdata/ValueRangeCheckAndCastL2I.java#L93 This pull request has now been integrated. Changeset: 0935eaa4 Author: Ningsheng Jian <[hidden email]> URL: https://git.openjdk.java.net/jdk/commit/0935eaa4 Stats: 409 lines in 5 files changed: 13 ins; 12 del; 384 mod 8264409: AArch64: generate better code for Vector API allTrue Reviewed-by: adinn, pli ------------- PR: https://git.openjdk.java.net/jdk/pull/3302 |
Free forum by Nabble | Edit this page |