RFR: 8262916: Merge LShiftCntV and RShiftCntV into a single node

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

RFR: 8262916: Merge LShiftCntV and RShiftCntV into a single node

Eric Liu-3
The vector shift count was defined by two separate nodes(LShiftCntV and
RShiftCntV), which would prevent them from being shared when the shift
counts are the same.

public static void test_shiftv(int sh) {
    for (int i = 0; i < N; i+=1) {
        a0[i] = a1[i] << sh;
        b0[i] = b1[i] >> sh;
    }
}

Given the example above, by merging the same shift counts into one
node, they could be shared by shift nodes(RShiftV or LShiftV) like
below:

Before:
1184  LShiftCntV  === _  1189  [[ 1185  ... ]]
1190  RShiftCntV  === _  1189  [[ 1191  ... ]]
1185  LShiftVI  === _  1181  1184  [[ 1186 ]]
1191  RShiftVI  === _  1187  1190  [[ 1192 ]]

After:
1190  ShiftCntV  === _  1189  [[ 1191 1204  ... ]]
1204  LShiftVI  === _  1211  1190  [[ 1203 ]]
1191  RShiftVI  === _  1187  1190  [[ 1192 ]]

The final code could remove one redundant “dup”(scalar->vector),
with one register saved.

Before:
        dup     v16.16b, w12
        dup     v17.16b, w12
        ...
        ldr     q18, [x13, #16]
        sshl    v18.4s, v18.4s, v16.4s
        add     x18, x16, x12           ; iaload

        add     x4, x15, x12
        str     q18, [x4, #16]          ; iastore

        ldr     q18, [x18, #16]
        add     x12, x14, x12
        neg     v19.16b, v17.16b
        sshl    v18.4s, v18.4s, v19.4s
        str     q18, [x12, #16]         ; iastore

After:
        dup v16.16b, w11
        ...
        ldr q17, [x13, #16]
        sshl v17.4s, v17.4s, v16.4s
        add x2, x22, x11            ; iaload

        add x4, x16, x11
        str q17, [x4, #16]          ; iastore

        ldr q17, [x2, #16]
        add x11, x21, x11
        neg v18.16b, v16.16b
        sshl v17.4s, v17.4s, v18.4s
        str q17, [x11, #16]         ; iastore

-------------

Commit messages:
 - 8262916: Merge LShiftCntV and RShiftCntV into a single node

Changes: https://git.openjdk.java.net/jdk/pull/3371/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=3371&range=00
  Issue: https://bugs.openjdk.java.net/browse/JDK-8262916
  Stats: 1494 lines in 14 files changed: 988 ins; 282 del; 224 mod
  Patch: https://git.openjdk.java.net/jdk/pull/3371.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/3371/head:pull/3371

PR: https://git.openjdk.java.net/jdk/pull/3371
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8262916: Merge LShiftCntV and RShiftCntV into a single node

Dean Long-2
On Wed, 7 Apr 2021 07:28:08 GMT, Eric Liu <[hidden email]> wrote:

> The vector shift count was defined by two separate nodes(LShiftCntV and
> RShiftCntV), which would prevent them from being shared when the shift
> counts are the same.
>
> public static void test_shiftv(int sh) {
>     for (int i = 0; i < N; i+=1) {
>         a0[i] = a1[i] << sh;
>         b0[i] = b1[i] >> sh;
>     }
> }
>
> Given the example above, by merging the same shift counts into one
> node, they could be shared by shift nodes(RShiftV or LShiftV) like
> below:
>
> Before:
> 1184  LShiftCntV  === _  1189  [[ 1185  ... ]]
> 1190  RShiftCntV  === _  1189  [[ 1191  ... ]]
> 1185  LShiftVI  === _  1181  1184  [[ 1186 ]]
> 1191  RShiftVI  === _  1187  1190  [[ 1192 ]]
>
> After:
> 1190  ShiftCntV  === _  1189  [[ 1191 1204  ... ]]
> 1204  LShiftVI  === _  1211  1190  [[ 1203 ]]
> 1191  RShiftVI  === _  1187  1190  [[ 1192 ]]
>
> The final code could remove one redundant “dup”(scalar->vector),
> with one register saved.
>
> Before:
>         dup     v16.16b, w12
>         dup     v17.16b, w12
>         ...
>         ldr     q18, [x13, #16]
>         sshl    v18.4s, v18.4s, v16.4s
>         add     x18, x16, x12           ; iaload
>
>         add     x4, x15, x12
>         str     q18, [x4, #16]          ; iastore
>
>         ldr     q18, [x18, #16]
>         add     x12, x14, x12
>         neg     v19.16b, v17.16b
>         sshl    v18.4s, v18.4s, v19.4s
>         str     q18, [x12, #16]         ; iastore
>
> After:
>         dup v16.16b, w11
>         ...
>         ldr q17, [x13, #16]
>         sshl v17.4s, v17.4s, v16.4s
>         add x2, x22, x11            ; iaload
>
>         add x4, x16, x11
>         str q17, [x4, #16]          ; iastore
>
>         ldr q17, [x2, #16]
>         add x11, x21, x11
>         neg v18.16b, v16.16b
>         sshl v17.4s, v17.4s, v18.4s
>         str q17, [x11, #16]         ; iastore

You should be able to do this without introducing a new node type.  You could change the shift rules to match a vector register like x86.ad and aarch64_sve.ad already do.

-------------

PR: https://git.openjdk.java.net/jdk/pull/3371
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8262916: Merge LShiftCntV and RShiftCntV into a single node

Eric Liu-3
On Thu, 8 Apr 2021 05:14:52 GMT, Dean Long <[hidden email]> wrote:

> You should be able to do this without introducing a new node type.

Do you mean just matching them with ReplicateBNode? Indeed they do generate the same instruction. I was wondering about the same but I'm not sure if ReplicateBNode has the same semantic with ShiftCntV in middle-end.

-------------

PR: https://git.openjdk.java.net/jdk/pull/3371
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8262916: Merge LShiftCntV and RShiftCntV into a single node

Vladimir Ivanov-2
In reply to this post by Dean Long-2
On Thu, 8 Apr 2021 05:14:52 GMT, Dean Long <[hidden email]> wrote:

> You should be able to do this without introducing a new node type. You could change the shift rules to match a vector register like x86.ad and aarch64_sve.ad already do.

Not sure what you refer to in x86.ad: vector shifts with variable scalar count require the scalar to be placed in XMM register. `ShiftCntV` handles register-to-register move between different register classes (`RegI` and `Vec*`).

Do you suggest reusing some existing vector node (like `Replicate`) to covert the scalar index to vector first? It would have slightly different behavior on x86.

-------------

PR: https://git.openjdk.java.net/jdk/pull/3371
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8262916: Merge LShiftCntV and RShiftCntV into a single node

Vladimir Ivanov-2
On Thu, 8 Apr 2021 11:04:49 GMT, Vladimir Ivanov <[hidden email]> wrote:

>> You should be able to do this without introducing a new node type.  You could change the shift rules to match a vector register like x86.ad and aarch64_sve.ad already do.
>
>> You should be able to do this without introducing a new node type. You could change the shift rules to match a vector register like x86.ad and aarch64_sve.ad already do.
>
> Not sure what you refer to in x86.ad: vector shifts with variable scalar count require the scalar to be placed in XMM register. `ShiftCntV` handles register-to-register move between different register classes (`RegI` and `Vec*`).
>
> Do you suggest reusing some existing vector node (like `Replicate`) to covert the scalar index to vector first? It would have slightly different behavior on x86.

Regarding the proposed change itself (`LShiftCntV/RShiftCntV => ShiftCntV`).

Not sure how important it is, but it has an unfortunate change in generated code for right vector shifts on AArch32: instead of sharing the result of index negation at all use sites, negation is performed at every use site now.

As a consequence, in an auto-vectorized loop it will lead to:
- 1 instruction per loop iteration (multiplied by unrolling factor);
- no way to hoist the negation of loop invariant index.

-------------

PR: https://git.openjdk.java.net/jdk/pull/3371
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8262916: Merge LShiftCntV and RShiftCntV into a single node

Eric Liu-3
On Thu, 8 Apr 2021 11:19:27 GMT, Vladimir Ivanov <[hidden email]> wrote:

> Regarding the proposed change itself (`LShiftCntV/RShiftCntV => ShiftCntV`).
>
> Not sure how important it is, but it has an unfortunate change in generated code for right vector shifts on AArch32: instead of sharing the result of index negation at all use sites, negation is performed at every use site now.
>
> As a consequence, in an auto-vectorized loop it will lead to:
>
> * 1 instruction per loop iteration (multiplied by unrolling factor);
> * no way to hoist the negation of loop invariant index.

Thanks for your feedback!

I have checked the generated code on AArch32 and it's a shame that 'vneg' is at every use point.

Before:
0xf46c8338:   add fp, r7, fp
0xf46c833c:   vshl.u16  d9, d9, d8
0xf46c8340:   vstr  d9, [fp, #12]
0xf46c8344:   vshl.u16  d9, d10, d8
0xf46c8348:   vstr  d9, [fp, #20]
0xf46c834c:   vshl.u16  d9, d11, d8
0xf46c8350:   vstr  d9, [fp, #28]

After:
0xf4aa1328:   add fp, r7, fp
0xf4aa132c:   vneg.s8 d13, d8
0xf4aa1330:   vshl.u16  d9, d9, d13
0xf4aa1334:   vstr  d9, [fp, #12]
0xf4aa1338:   vneg.s8 d9, d8
0xf4aa133c:   vshl.u16  d9, d10, d9
0xf4aa1340:   vstr  d9, [fp, #20]
0xf4aa1344:   vneg.s8 d9, d8
0xf4aa1348:   vshl.u16  d9, d11, d9
0xf4aa134c:   vstr  d9, [fp, #28]  

I suppose it's more like a trade off that either remaining those two R/LShiftCntV nodes and change AArch64 and X86  to what AArch32 dose, or merging them and accept this defect on AArch32.

-------------

PR: https://git.openjdk.java.net/jdk/pull/3371
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8262916: Merge LShiftCntV and RShiftCntV into a single node

Vladimir Ivanov-2
On Thu, 8 Apr 2021 11:52:02 GMT, Eric Liu <[hidden email]> wrote:

>> Regarding the proposed change itself (`LShiftCntV/RShiftCntV => ShiftCntV`).
>>
>> Not sure how important it is, but it has an unfortunate change in generated code for right vector shifts on AArch32: instead of sharing the result of index negation at all use sites, negation is performed at every use site now.
>>
>> As a consequence, in an auto-vectorized loop it will lead to:
>> - 1 instruction per loop iteration (multiplied by unrolling factor);
>> - no way to hoist the negation of loop invariant index.
>
>> Regarding the proposed change itself (`LShiftCntV/RShiftCntV => ShiftCntV`).
>>
>> Not sure how important it is, but it has an unfortunate change in generated code for right vector shifts on AArch32: instead of sharing the result of index negation at all use sites, negation is performed at every use site now.
>>
>> As a consequence, in an auto-vectorized loop it will lead to:
>>
>> * 1 instruction per loop iteration (multiplied by unrolling factor);
>> * no way to hoist the negation of loop invariant index.
>
> Thanks for your feedback!
>
> I have checked the generated code on AArch32 and it's a shame that 'vneg' is at every use point.
>
> Before:
> 0xf46c8338:   add fp, r7, fp
> 0xf46c833c:   vshl.u16  d9, d9, d8
> 0xf46c8340:   vstr  d9, [fp, #12]
> 0xf46c8344:   vshl.u16  d9, d10, d8
> 0xf46c8348:   vstr  d9, [fp, #20]
> 0xf46c834c:   vshl.u16  d9, d11, d8
> 0xf46c8350:   vstr  d9, [fp, #28]
>
> After:
> 0xf4aa1328:   add fp, r7, fp
> 0xf4aa132c:   vneg.s8 d13, d8
> 0xf4aa1330:   vshl.u16  d9, d9, d13
> 0xf4aa1334:   vstr  d9, [fp, #12]
> 0xf4aa1338:   vneg.s8 d9, d8
> 0xf4aa133c:   vshl.u16  d9, d10, d9
> 0xf4aa1340:   vstr  d9, [fp, #20]
> 0xf4aa1344:   vneg.s8 d9, d8
> 0xf4aa1348:   vshl.u16  d9, d11, d9
> 0xf4aa134c:   vstr  d9, [fp, #28]  
>
> I suppose it's more like a trade off that either remaining those two R/LShiftCntV nodes and change AArch64 and X86  to what AArch32 dose, or merging them and accept this defect on AArch32.

`LShiftCntV`/`RShiftCntV` were added specifically for AArch32 and other platforms don't need/benefit from such separation.

Ultimately, I'd like to hear what AArch32 port maintainers think about it, but if there are concerns about performance impact expressed, as an alternative a platform-specific logic can be introduced that adjusts Ideal IR shape for `URShiftV`/`RShiftV` nodes accordingly.

(Not sure whether it is a good trade-off w.r.t. code complexity though.)

-------------

PR: https://git.openjdk.java.net/jdk/pull/3371
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8262916: Merge LShiftCntV and RShiftCntV into a single node

Dean Long-2
In reply to this post by Eric Liu-3
On Thu, 8 Apr 2021 11:52:02 GMT, Eric Liu <[hidden email]> wrote:

>> Regarding the proposed change itself (`LShiftCntV/RShiftCntV => ShiftCntV`).
>>
>> Not sure how important it is, but it has an unfortunate change in generated code for right vector shifts on AArch32: instead of sharing the result of index negation at all use sites, negation is performed at every use site now.
>>
>> As a consequence, in an auto-vectorized loop it will lead to:
>> - 1 instruction per loop iteration (multiplied by unrolling factor);
>> - no way to hoist the negation of loop invariant index.
>
>> Regarding the proposed change itself (`LShiftCntV/RShiftCntV => ShiftCntV`).
>>
>> Not sure how important it is, but it has an unfortunate change in generated code for right vector shifts on AArch32: instead of sharing the result of index negation at all use sites, negation is performed at every use site now.
>>
>> As a consequence, in an auto-vectorized loop it will lead to:
>>
>> * 1 instruction per loop iteration (multiplied by unrolling factor);
>> * no way to hoist the negation of loop invariant index.
>
> Thanks for your feedback!
>
> I have checked the generated code on AArch32 and it's a shame that 'vneg' is at every use point.
>
> Before:
> 0xf46c8338:   add fp, r7, fp
> 0xf46c833c:   vshl.u16  d9, d9, d8
> 0xf46c8340:   vstr  d9, [fp, #12]
> 0xf46c8344:   vshl.u16  d9, d10, d8
> 0xf46c8348:   vstr  d9, [fp, #20]
> 0xf46c834c:   vshl.u16  d9, d11, d8
> 0xf46c8350:   vstr  d9, [fp, #28]
>
> After:
> 0xf4aa1328:   add fp, r7, fp
> 0xf4aa132c:   vneg.s8 d13, d8
> 0xf4aa1330:   vshl.u16  d9, d9, d13
> 0xf4aa1334:   vstr  d9, [fp, #12]
> 0xf4aa1338:   vneg.s8 d9, d8
> 0xf4aa133c:   vshl.u16  d9, d10, d9
> 0xf4aa1340:   vstr  d9, [fp, #20]
> 0xf4aa1344:   vneg.s8 d9, d8
> 0xf4aa1348:   vshl.u16  d9, d11, d9
> 0xf4aa134c:   vstr  d9, [fp, #28]  
>
> I suppose it's more like a trade off that either remaining those two R/LShiftCntV nodes and change AArch64 and X86  to what AArch32 dose, or merging them and accept this defect on AArch32.

@theRealELiu @iwanowww Yes, but I was thinking of the `vshiftcnt` rules and didn't realize the `replicate` rules are virtually identical.  Now that I look again, it appears that aarch64 is already doing the same as x86 (converting the scalar shift count to a vector register, and matching the vector register in the rule).  So why is it that x86 can share the shift register, but aarch64 cannot?  Is it because x86 has combined left- and right-shift into the same rule?

-------------

PR: https://git.openjdk.java.net/jdk/pull/3371
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8262916: Merge LShiftCntV and RShiftCntV into a single node

Eric Liu-3
In reply to this post by Eric Liu-3
On Thu, 8 Apr 2021 11:52:02 GMT, Eric Liu <[hidden email]> wrote:

>> Regarding the proposed change itself (`LShiftCntV/RShiftCntV => ShiftCntV`).
>>
>> Not sure how important it is, but it has an unfortunate change in generated code for right vector shifts on AArch32: instead of sharing the result of index negation at all use sites, negation is performed at every use site now.
>>
>> As a consequence, in an auto-vectorized loop it will lead to:
>> - 1 instruction per loop iteration (multiplied by unrolling factor);
>> - no way to hoist the negation of loop invariant index.
>
>> Regarding the proposed change itself (`LShiftCntV/RShiftCntV => ShiftCntV`).
>>
>> Not sure how important it is, but it has an unfortunate change in generated code for right vector shifts on AArch32: instead of sharing the result of index negation at all use sites, negation is performed at every use site now.
>>
>> As a consequence, in an auto-vectorized loop it will lead to:
>>
>> * 1 instruction per loop iteration (multiplied by unrolling factor);
>> * no way to hoist the negation of loop invariant index.
>
> Thanks for your feedback!
>
> I have checked the generated code on AArch32 and it's a shame that 'vneg' is at every use point.
>
> Before:
> 0xf46c8338:   add fp, r7, fp
> 0xf46c833c:   vshl.u16  d9, d9, d8
> 0xf46c8340:   vstr  d9, [fp, #12]
> 0xf46c8344:   vshl.u16  d9, d10, d8
> 0xf46c8348:   vstr  d9, [fp, #20]
> 0xf46c834c:   vshl.u16  d9, d11, d8
> 0xf46c8350:   vstr  d9, [fp, #28]
>
> After:
> 0xf4aa1328:   add fp, r7, fp
> 0xf4aa132c:   vneg.s8 d13, d8
> 0xf4aa1330:   vshl.u16  d9, d9, d13
> 0xf4aa1334:   vstr  d9, [fp, #12]
> 0xf4aa1338:   vneg.s8 d9, d8
> 0xf4aa133c:   vshl.u16  d9, d10, d9
> 0xf4aa1340:   vstr  d9, [fp, #20]
> 0xf4aa1344:   vneg.s8 d9, d8
> 0xf4aa1348:   vshl.u16  d9, d11, d9
> 0xf4aa134c:   vstr  d9, [fp, #28]  
>
> I suppose it's more like a trade off that either remaining those two R/LShiftCntV nodes and change AArch64 and X86  to what AArch32 dose, or merging them and accept this defect on AArch32.

> @theRealELiu @iwanowww Yes, but I was thinking of the `vshiftcnt` rules and didn't realize the `replicate` rules are virtually identical. Now that I look again, it appears that aarch64 is already doing the same as x86 (converting the scalar shift count to a vector register, and matching the vector register in the rule). So why is it that x86 can share the shift register, but aarch64 cannot? Is it because x86 has combined left- and right-shift into the same rule?

I suppose x86 has the different kinds of vector shift instructions[1], but AArch64 and AArch32 only have the left one. For this reason, arm should use a "neg-leftshift" pair to implement right shift.

[1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp#L1234

-------------

PR: https://git.openjdk.java.net/jdk/pull/3371
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8262916: Merge LShiftCntV and RShiftCntV into a single node

Eric Liu-3
In reply to this post by Vladimir Ivanov-2
On Thu, 8 Apr 2021 13:37:31 GMT, Vladimir Ivanov <[hidden email]> wrote:

> `LShiftCntV`/`RShiftCntV` were added specifically for AArch32 and other platforms don't need/benefit from such separation.

I think AArch64 shared the same problem with AArch32 since it only has left shift instruction as well. I checked the generated code on AArch64 with latest master. Test case is TestCharVect2::test_srav[1].

  0x0000ffffa9106c68:   ldr     q17, [x15, #16]
  0x0000ffffa9106c6c:   add     x14, x10, x14
  0x0000ffffa9106c70:   neg     v18.16b, v16.16b
  0x0000ffffa9106c74:   ushl    v17.8h, v17.8h, v18.8h
  0x0000ffffa9106c78:   str     q17, [x14, #16]
  0x0000ffffa9106c7c:   ldr     q17, [x15, #32]
  0x0000ffffa9106c80:   neg     v18.16b, v16.16b
  0x0000ffffa9106c84:   ushl    v17.8h, v17.8h, v18.8h
  0x0000ffffa9106c88:   str     q17, [x14, #32]
  0x0000ffffa9106c8c:   ldr     q17, [x15, #48]
  0x0000ffffa9106c90:   neg     v18.16b, v16.16b
  0x0000ffffa9106c94:   ushl    v17.8h, v17.8h, v18.8h
  0x0000ffffa9106c98:   str     q17, [x14, #48]
  0x0000ffffa9106c9c:   ldr     q17, [x15, #64]
  0x0000ffffa9106ca0:   neg     v18.16b, v16.16b
  0x0000ffffa9106ca4:   ushl    v17.8h, v17.8h, v18.8h
  0x0000ffffa9106ca8:   str     q17, [x14, #64]
  0x0000ffffa9106cac:   ldr     q17, [x15, #80]
  0x0000ffffa9106cb0:   neg     v18.16b, v16.16b
  0x0000ffffa9106cb4:   ushl    v17.8h, v17.8h, v18.8h

It seems that keeping those two RShiftCntV and LShiftCntV is friendly to AArch32/64 in this case, but AArch64 should changed to what AArch32 dose. @theRealAph

[1] https://github.com/openjdk/jdk/blob/master/test/hotspot/jtreg/compiler/codegen/TestCharVect2.java#L1215

-------------

PR: https://git.openjdk.java.net/jdk/pull/3371
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8262916: Merge LShiftCntV and RShiftCntV into a single node

Andrew Haley-2
On Fri, 9 Apr 2021 08:50:49 GMT, Eric Liu <[hidden email]> wrote:

>
> It seems that keeping those two RShiftCntV and LShiftCntV is friendly to AArch32/64 in this case, but AArch64 should changed to what AArch32 dose. @theRealAph

Thanks, but it's been a while since I looked at the vector code. Can you point me to the AArch32 patterns in question, to show me the AArch64 changes needed? Thanks.

-------------

PR: https://git.openjdk.java.net/jdk/pull/3371
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8262916: Merge LShiftCntV and RShiftCntV into a single node

Eric Liu-3
On Fri, 9 Apr 2021 09:26:19 GMT, Andrew Haley <[hidden email]> wrote:

> > It seems that keeping those two RShiftCntV and LShiftCntV is friendly to AArch32/64 in this case, but AArch64 should changed to what AArch32 dose. @theRealAph
>
> Thanks, but it's been a while since I looked at the vector code. Can you point me to the AArch32 patterns in question, to show me the AArch64 changes needed? Thanks.

AArch32 combinates 'neg' with 'dup' in RShiftCntV[1], which AArch64  has a single 'dup' only[2] and generates 'neg' before every use sites, e.g. RShiftVB[3].

[1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/arm/arm.ad#L10451
[2] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/aarch64_neon.ad#L5117
[3] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/aarch64_neon.ad#L5188

-------------

PR: https://git.openjdk.java.net/jdk/pull/3371