RFR: 8264409: AArch64: generate better code for Vector API allTrue

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

RFR: 8264409: AArch64: generate better code for Vector API allTrue

Ningsheng Jian-2
In Vector API NEON implementation, we use a vector register to represent vector mask, where an element value of -1 is a true mask and an element value of 0 is a false mask. The allTrue() api is used to check whether all the elements of current mask are set.

Currently, the AArch64 NEON allTrue implementation looks like:

  andr  $tmp, T16B $src1, $src2\t# src2 is maskAllTrue
  notr  $tmp, T16B, $tmp
  addv  $tmp, T16B, $tmp
  umov  $dst, $tmp, B, 0
  cmp   $dst, 0
  cset  $dst

where $src2 is a preset all true (-1) constant. We can optimize it to the code sequence like below, to check whether all bits are set:

  uminv $tmp, T16B, $src1
  umov  $dst, $tmp, B, 0
  cmp   $dst, 0xff
  cset  $dst

With this codegen improvement, we can see about 8%~70% performance uplift on different machines for Alibaba's Vector API bigdata benchmarks [1][2].

Tested with tier1 and vector api jtreg tests.

[1] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/jdk/jdk/incubator/vector/benchmark/src/main/java/benchmark/bigdata/BooleanArrayCheck.java#L61
[2] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/jdk/jdk/incubator/vector/benchmark/src/main/java/benchmark/bigdata/ValueRangeCheckAndCastL2I.java#L93

-------------

Commit messages:
 - 8264409: AArch64: generate better code for Vector API allTrue

Changes: https://git.openjdk.java.net/jdk/pull/3302/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=3302&range=00
  Issue: https://bugs.openjdk.java.net/browse/JDK-8264409
  Stats: 409 lines in 5 files changed: 13 ins; 12 del; 384 mod
  Patch: https://git.openjdk.java.net/jdk/pull/3302.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/3302/head:pull/3302

PR: https://git.openjdk.java.net/jdk/pull/3302
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8264409: AArch64: generate better code for Vector API allTrue

Andrew Dinn-2
On Thu, 1 Apr 2021 07:58:07 GMT, Ningsheng Jian <[hidden email]> wrote:

> In Vector API NEON implementation, we use a vector register to represent vector mask, where an element value of -1 is a true mask and an element value of 0 is a false mask. The allTrue() api is used to check whether all the elements of current mask are set.
>
> Currently, the AArch64 NEON allTrue implementation looks like:
>
>   andr  $tmp, T16B $src1, $src2\t# src2 is maskAllTrue
>   notr  $tmp, T16B, $tmp
>   addv  $tmp, T16B, $tmp
>   umov  $dst, $tmp, B, 0
>   cmp   $dst, 0
>   cset  $dst
>
> where $src2 is a preset all true (-1) constant. We can optimize it to the code sequence like below, to check whether all bits are set:
>
>   uminv $tmp, T16B, $src1
>   umov  $dst, $tmp, B, 0
>   cmp   $dst, 0xff
>   cset  $dst
>
> With this codegen improvement, we can see about 8%~70% performance uplift on different machines for Alibaba's Vector API bigdata benchmarks [1][2].
>
> Tested with tier1 and vector api jtreg tests.
>
> [1] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/jdk/jdk/incubator/vector/benchmark/src/main/java/benchmark/bigdata/BooleanArrayCheck.java#L61
> [2] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/jdk/jdk/incubator/vector/benchmark/src/main/java/benchmark/bigdata/ValueRangeCheckAndCastL2I.java#L93

It's a clever trick to use uminv for this specific case.
The patch looks good to me.

-------------

Marked as reviewed by adinn (Reviewer).

PR: https://git.openjdk.java.net/jdk/pull/3302
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8264409: AArch64: generate better code for Vector API allTrue

Pengfei Li
In reply to this post by Ningsheng Jian-2
On Thu, 1 Apr 2021 07:58:07 GMT, Ningsheng Jian <[hidden email]> wrote:

> In Vector API NEON implementation, we use a vector register to represent vector mask, where an element value of -1 is a true mask and an element value of 0 is a false mask. The allTrue() api is used to check whether all the elements of current mask are set.
>
> Currently, the AArch64 NEON allTrue implementation looks like:
>
>   andr  $tmp, T16B $src1, $src2\t# src2 is maskAllTrue
>   notr  $tmp, T16B, $tmp
>   addv  $tmp, T16B, $tmp
>   umov  $dst, $tmp, B, 0
>   cmp   $dst, 0
>   cset  $dst
>
> where $src2 is a preset all true (-1) constant. We can optimize it to the code sequence like below, to check whether all bits are set:
>
>   uminv $tmp, T16B, $src1
>   umov  $dst, $tmp, B, 0
>   cmp   $dst, 0xff
>   cset  $dst
>
> With this codegen improvement, we can see about 8%~70% performance uplift on different machines for Alibaba's Vector API bigdata benchmarks [1][2].
>
> Tested with tier1 and vector api jtreg tests.
>
> [1] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/jdk/jdk/incubator/vector/benchmark/src/main/java/benchmark/bigdata/BooleanArrayCheck.java#L61
> [2] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/jdk/jdk/incubator/vector/benchmark/src/main/java/benchmark/bigdata/ValueRangeCheckAndCastL2I.java#L93

Overall looks good to me. (not a reviewer)

src/hotspot/cpu/aarch64/aarch64_neon.ad line 3571:

> 3569:   format %{ "uminv $tmp, T8B, $src1\n\t"
> 3570:             "umov  $dst, $tmp, B, 0\n\t"
> 3571:             "cmp   $dst, 0xff\n\t"

I think we should write "#0xff" here. But it looks that all other immediates in format field of aarch64_neon.ad lose the number sign as well.

-------------

Marked as reviewed by pli (Committer).

PR: https://git.openjdk.java.net/jdk/pull/3302
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8264409: AArch64: generate better code for Vector API allTrue

Ningsheng Jian-2
In reply to this post by Andrew Dinn-2
On Thu, 1 Apr 2021 09:02:39 GMT, Andrew Dinn <[hidden email]> wrote:

> It's a clever trick to use uminv for this specific case.
> The patch looks good to me.

Thank you for the review! @adinn

-------------

PR: https://git.openjdk.java.net/jdk/pull/3302
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8264409: AArch64: generate better code for Vector API allTrue

Ningsheng Jian-2
In reply to this post by Pengfei Li
On Fri, 2 Apr 2021 02:26:52 GMT, Pengfei Li <[hidden email]> wrote:

>> In Vector API NEON implementation, we use a vector register to represent vector mask, where an element value of -1 is a true mask and an element value of 0 is a false mask. The allTrue() api is used to check whether all the elements of current mask are set.
>>
>> Currently, the AArch64 NEON allTrue implementation looks like:
>>
>>   andr  $tmp, T16B $src1, $src2\t# src2 is maskAllTrue
>>   notr  $tmp, T16B, $tmp
>>   addv  $tmp, T16B, $tmp
>>   umov  $dst, $tmp, B, 0
>>   cmp   $dst, 0
>>   cset  $dst
>>
>> where $src2 is a preset all true (-1) constant. We can optimize it to the code sequence like below, to check whether all bits are set:
>>
>>   uminv $tmp, T16B, $src1
>>   umov  $dst, $tmp, B, 0
>>   cmp   $dst, 0xff
>>   cset  $dst
>>
>> With this codegen improvement, we can see about 8%~70% performance uplift on different machines for Alibaba's Vector API bigdata benchmarks [1][2].
>>
>> Tested with tier1 and vector api jtreg tests.
>>
>> [1] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/jdk/jdk/incubator/vector/benchmark/src/main/java/benchmark/bigdata/BooleanArrayCheck.java#L61
>> [2] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/jdk/jdk/incubator/vector/benchmark/src/main/java/benchmark/bigdata/ValueRangeCheckAndCastL2I.java#L93
>
> src/hotspot/cpu/aarch64/aarch64_neon.ad line 3571:
>
>> 3569:   format %{ "uminv $tmp, T8B, $src1\n\t"
>> 3570:             "umov  $dst, $tmp, B, 0\n\t"
>> 3571:             "cmp   $dst, 0xff\n\t"
>
> I think we should write "#0xff" here. But it looks that all other immediates in format field of aarch64_neon.ad lose the number sign as well.

Thanks for the review, but I think both are ok.

-------------

PR: https://git.openjdk.java.net/jdk/pull/3302
Reply | Threaded
Open this post in threaded view
|

Integrated: 8264409: AArch64: generate better code for Vector API allTrue

Ningsheng Jian-2
In reply to this post by Ningsheng Jian-2
On Thu, 1 Apr 2021 07:58:07 GMT, Ningsheng Jian <[hidden email]> wrote:

> In Vector API NEON implementation, we use a vector register to represent vector mask, where an element value of -1 is a true mask and an element value of 0 is a false mask. The allTrue() api is used to check whether all the elements of current mask are set.
>
> Currently, the AArch64 NEON allTrue implementation looks like:
>
>   andr  $tmp, T16B $src1, $src2\t# src2 is maskAllTrue
>   notr  $tmp, T16B, $tmp
>   addv  $tmp, T16B, $tmp
>   umov  $dst, $tmp, B, 0
>   cmp   $dst, 0
>   cset  $dst
>
> where $src2 is a preset all true (-1) constant. We can optimize it to the code sequence like below, to check whether all bits are set:
>
>   uminv $tmp, T16B, $src1
>   umov  $dst, $tmp, B, 0
>   cmp   $dst, 0xff
>   cset  $dst
>
> With this codegen improvement, we can see about 8%~70% performance uplift on different machines for Alibaba's Vector API bigdata benchmarks [1][2].
>
> Tested with tier1 and vector api jtreg tests.
>
> [1] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/jdk/jdk/incubator/vector/benchmark/src/main/java/benchmark/bigdata/BooleanArrayCheck.java#L61
> [2] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/jdk/jdk/incubator/vector/benchmark/src/main/java/benchmark/bigdata/ValueRangeCheckAndCastL2I.java#L93

This pull request has now been integrated.

Changeset: 0935eaa4
Author:    Ningsheng Jian <[hidden email]>
URL:       https://git.openjdk.java.net/jdk/commit/0935eaa4
Stats:     409 lines in 5 files changed: 13 ins; 12 del; 384 mod

8264409: AArch64: generate better code for Vector API allTrue

Reviewed-by: adinn, pli

-------------

PR: https://git.openjdk.java.net/jdk/pull/3302