The vector shift count was defined by two separate nodes(LShiftCntV and
RShiftCntV), which would prevent them from being shared when the shift counts are the same. public static void test_shiftv(int sh) { for (int i = 0; i < N; i+=1) { a0[i] = a1[i] << sh; b0[i] = b1[i] >> sh; } } Given the example above, by merging the same shift counts into one node, they could be shared by shift nodes(RShiftV or LShiftV) like below: Before: 1184 LShiftCntV === _ 1189 [[ 1185 ... ]] 1190 RShiftCntV === _ 1189 [[ 1191 ... ]] 1185 LShiftVI === _ 1181 1184 [[ 1186 ]] 1191 RShiftVI === _ 1187 1190 [[ 1192 ]] After: 1190 ShiftCntV === _ 1189 [[ 1191 1204 ... ]] 1204 LShiftVI === _ 1211 1190 [[ 1203 ]] 1191 RShiftVI === _ 1187 1190 [[ 1192 ]] The final code could remove one redundant “dup”(scalar->vector), with one register saved. Before: dup v16.16b, w12 dup v17.16b, w12 ... ldr q18, [x13, #16] sshl v18.4s, v18.4s, v16.4s add x18, x16, x12 ; iaload add x4, x15, x12 str q18, [x4, #16] ; iastore ldr q18, [x18, #16] add x12, x14, x12 neg v19.16b, v17.16b sshl v18.4s, v18.4s, v19.4s str q18, [x12, #16] ; iastore After: dup v16.16b, w11 ... ldr q17, [x13, #16] sshl v17.4s, v17.4s, v16.4s add x2, x22, x11 ; iaload add x4, x16, x11 str q17, [x4, #16] ; iastore ldr q17, [x2, #16] add x11, x21, x11 neg v18.16b, v16.16b sshl v17.4s, v17.4s, v18.4s str q17, [x11, #16] ; iastore ------------- Commit messages: - 8262916: Merge LShiftCntV and RShiftCntV into a single node Changes: https://git.openjdk.java.net/jdk/pull/3371/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=3371&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8262916 Stats: 1494 lines in 14 files changed: 988 ins; 282 del; 224 mod Patch: https://git.openjdk.java.net/jdk/pull/3371.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/3371/head:pull/3371 PR: https://git.openjdk.java.net/jdk/pull/3371 |
On Wed, 7 Apr 2021 07:28:08 GMT, Eric Liu <[hidden email]> wrote:
> The vector shift count was defined by two separate nodes(LShiftCntV and > RShiftCntV), which would prevent them from being shared when the shift > counts are the same. > > public static void test_shiftv(int sh) { > for (int i = 0; i < N; i+=1) { > a0[i] = a1[i] << sh; > b0[i] = b1[i] >> sh; > } > } > > Given the example above, by merging the same shift counts into one > node, they could be shared by shift nodes(RShiftV or LShiftV) like > below: > > Before: > 1184 LShiftCntV === _ 1189 [[ 1185 ... ]] > 1190 RShiftCntV === _ 1189 [[ 1191 ... ]] > 1185 LShiftVI === _ 1181 1184 [[ 1186 ]] > 1191 RShiftVI === _ 1187 1190 [[ 1192 ]] > > After: > 1190 ShiftCntV === _ 1189 [[ 1191 1204 ... ]] > 1204 LShiftVI === _ 1211 1190 [[ 1203 ]] > 1191 RShiftVI === _ 1187 1190 [[ 1192 ]] > > The final code could remove one redundant “dup”(scalar->vector), > with one register saved. > > Before: > dup v16.16b, w12 > dup v17.16b, w12 > ... > ldr q18, [x13, #16] > sshl v18.4s, v18.4s, v16.4s > add x18, x16, x12 ; iaload > > add x4, x15, x12 > str q18, [x4, #16] ; iastore > > ldr q18, [x18, #16] > add x12, x14, x12 > neg v19.16b, v17.16b > sshl v18.4s, v18.4s, v19.4s > str q18, [x12, #16] ; iastore > > After: > dup v16.16b, w11 > ... > ldr q17, [x13, #16] > sshl v17.4s, v17.4s, v16.4s > add x2, x22, x11 ; iaload > > add x4, x16, x11 > str q17, [x4, #16] ; iastore > > ldr q17, [x2, #16] > add x11, x21, x11 > neg v18.16b, v16.16b > sshl v17.4s, v17.4s, v18.4s > str q17, [x11, #16] ; iastore You should be able to do this without introducing a new node type. You could change the shift rules to match a vector register like x86.ad and aarch64_sve.ad already do. ------------- PR: https://git.openjdk.java.net/jdk/pull/3371 |
On Thu, 8 Apr 2021 05:14:52 GMT, Dean Long <[hidden email]> wrote:
> You should be able to do this without introducing a new node type. Do you mean just matching them with ReplicateBNode? Indeed they do generate the same instruction. I was wondering about the same but I'm not sure if ReplicateBNode has the same semantic with ShiftCntV in middle-end. ------------- PR: https://git.openjdk.java.net/jdk/pull/3371 |
In reply to this post by Dean Long-2
On Thu, 8 Apr 2021 05:14:52 GMT, Dean Long <[hidden email]> wrote:
> You should be able to do this without introducing a new node type. You could change the shift rules to match a vector register like x86.ad and aarch64_sve.ad already do. Not sure what you refer to in x86.ad: vector shifts with variable scalar count require the scalar to be placed in XMM register. `ShiftCntV` handles register-to-register move between different register classes (`RegI` and `Vec*`). Do you suggest reusing some existing vector node (like `Replicate`) to covert the scalar index to vector first? It would have slightly different behavior on x86. ------------- PR: https://git.openjdk.java.net/jdk/pull/3371 |
On Thu, 8 Apr 2021 11:04:49 GMT, Vladimir Ivanov <[hidden email]> wrote:
>> You should be able to do this without introducing a new node type. You could change the shift rules to match a vector register like x86.ad and aarch64_sve.ad already do. > >> You should be able to do this without introducing a new node type. You could change the shift rules to match a vector register like x86.ad and aarch64_sve.ad already do. > > Not sure what you refer to in x86.ad: vector shifts with variable scalar count require the scalar to be placed in XMM register. `ShiftCntV` handles register-to-register move between different register classes (`RegI` and `Vec*`). > > Do you suggest reusing some existing vector node (like `Replicate`) to covert the scalar index to vector first? It would have slightly different behavior on x86. Regarding the proposed change itself (`LShiftCntV/RShiftCntV => ShiftCntV`). Not sure how important it is, but it has an unfortunate change in generated code for right vector shifts on AArch32: instead of sharing the result of index negation at all use sites, negation is performed at every use site now. As a consequence, in an auto-vectorized loop it will lead to: - 1 instruction per loop iteration (multiplied by unrolling factor); - no way to hoist the negation of loop invariant index. ------------- PR: https://git.openjdk.java.net/jdk/pull/3371 |
On Thu, 8 Apr 2021 11:19:27 GMT, Vladimir Ivanov <[hidden email]> wrote:
> Regarding the proposed change itself (`LShiftCntV/RShiftCntV => ShiftCntV`). > > Not sure how important it is, but it has an unfortunate change in generated code for right vector shifts on AArch32: instead of sharing the result of index negation at all use sites, negation is performed at every use site now. > > As a consequence, in an auto-vectorized loop it will lead to: > > * 1 instruction per loop iteration (multiplied by unrolling factor); > * no way to hoist the negation of loop invariant index. Thanks for your feedback! I have checked the generated code on AArch32 and it's a shame that 'vneg' is at every use point. Before: 0xf46c8338: add fp, r7, fp 0xf46c833c: vshl.u16 d9, d9, d8 0xf46c8340: vstr d9, [fp, #12] 0xf46c8344: vshl.u16 d9, d10, d8 0xf46c8348: vstr d9, [fp, #20] 0xf46c834c: vshl.u16 d9, d11, d8 0xf46c8350: vstr d9, [fp, #28] After: 0xf4aa1328: add fp, r7, fp 0xf4aa132c: vneg.s8 d13, d8 0xf4aa1330: vshl.u16 d9, d9, d13 0xf4aa1334: vstr d9, [fp, #12] 0xf4aa1338: vneg.s8 d9, d8 0xf4aa133c: vshl.u16 d9, d10, d9 0xf4aa1340: vstr d9, [fp, #20] 0xf4aa1344: vneg.s8 d9, d8 0xf4aa1348: vshl.u16 d9, d11, d9 0xf4aa134c: vstr d9, [fp, #28] I suppose it's more like a trade off that either remaining those two R/LShiftCntV nodes and change AArch64 and X86 to what AArch32 dose, or merging them and accept this defect on AArch32. ------------- PR: https://git.openjdk.java.net/jdk/pull/3371 |
On Thu, 8 Apr 2021 11:52:02 GMT, Eric Liu <[hidden email]> wrote:
>> Regarding the proposed change itself (`LShiftCntV/RShiftCntV => ShiftCntV`). >> >> Not sure how important it is, but it has an unfortunate change in generated code for right vector shifts on AArch32: instead of sharing the result of index negation at all use sites, negation is performed at every use site now. >> >> As a consequence, in an auto-vectorized loop it will lead to: >> - 1 instruction per loop iteration (multiplied by unrolling factor); >> - no way to hoist the negation of loop invariant index. > >> Regarding the proposed change itself (`LShiftCntV/RShiftCntV => ShiftCntV`). >> >> Not sure how important it is, but it has an unfortunate change in generated code for right vector shifts on AArch32: instead of sharing the result of index negation at all use sites, negation is performed at every use site now. >> >> As a consequence, in an auto-vectorized loop it will lead to: >> >> * 1 instruction per loop iteration (multiplied by unrolling factor); >> * no way to hoist the negation of loop invariant index. > > Thanks for your feedback! > > I have checked the generated code on AArch32 and it's a shame that 'vneg' is at every use point. > > Before: > 0xf46c8338: add fp, r7, fp > 0xf46c833c: vshl.u16 d9, d9, d8 > 0xf46c8340: vstr d9, [fp, #12] > 0xf46c8344: vshl.u16 d9, d10, d8 > 0xf46c8348: vstr d9, [fp, #20] > 0xf46c834c: vshl.u16 d9, d11, d8 > 0xf46c8350: vstr d9, [fp, #28] > > After: > 0xf4aa1328: add fp, r7, fp > 0xf4aa132c: vneg.s8 d13, d8 > 0xf4aa1330: vshl.u16 d9, d9, d13 > 0xf4aa1334: vstr d9, [fp, #12] > 0xf4aa1338: vneg.s8 d9, d8 > 0xf4aa133c: vshl.u16 d9, d10, d9 > 0xf4aa1340: vstr d9, [fp, #20] > 0xf4aa1344: vneg.s8 d9, d8 > 0xf4aa1348: vshl.u16 d9, d11, d9 > 0xf4aa134c: vstr d9, [fp, #28] > > I suppose it's more like a trade off that either remaining those two R/LShiftCntV nodes and change AArch64 and X86 to what AArch32 dose, or merging them and accept this defect on AArch32. `LShiftCntV`/`RShiftCntV` were added specifically for AArch32 and other platforms don't need/benefit from such separation. Ultimately, I'd like to hear what AArch32 port maintainers think about it, but if there are concerns about performance impact expressed, as an alternative a platform-specific logic can be introduced that adjusts Ideal IR shape for `URShiftV`/`RShiftV` nodes accordingly. (Not sure whether it is a good trade-off w.r.t. code complexity though.) ------------- PR: https://git.openjdk.java.net/jdk/pull/3371 |
In reply to this post by Eric Liu-3
On Thu, 8 Apr 2021 11:52:02 GMT, Eric Liu <[hidden email]> wrote:
>> Regarding the proposed change itself (`LShiftCntV/RShiftCntV => ShiftCntV`). >> >> Not sure how important it is, but it has an unfortunate change in generated code for right vector shifts on AArch32: instead of sharing the result of index negation at all use sites, negation is performed at every use site now. >> >> As a consequence, in an auto-vectorized loop it will lead to: >> - 1 instruction per loop iteration (multiplied by unrolling factor); >> - no way to hoist the negation of loop invariant index. > >> Regarding the proposed change itself (`LShiftCntV/RShiftCntV => ShiftCntV`). >> >> Not sure how important it is, but it has an unfortunate change in generated code for right vector shifts on AArch32: instead of sharing the result of index negation at all use sites, negation is performed at every use site now. >> >> As a consequence, in an auto-vectorized loop it will lead to: >> >> * 1 instruction per loop iteration (multiplied by unrolling factor); >> * no way to hoist the negation of loop invariant index. > > Thanks for your feedback! > > I have checked the generated code on AArch32 and it's a shame that 'vneg' is at every use point. > > Before: > 0xf46c8338: add fp, r7, fp > 0xf46c833c: vshl.u16 d9, d9, d8 > 0xf46c8340: vstr d9, [fp, #12] > 0xf46c8344: vshl.u16 d9, d10, d8 > 0xf46c8348: vstr d9, [fp, #20] > 0xf46c834c: vshl.u16 d9, d11, d8 > 0xf46c8350: vstr d9, [fp, #28] > > After: > 0xf4aa1328: add fp, r7, fp > 0xf4aa132c: vneg.s8 d13, d8 > 0xf4aa1330: vshl.u16 d9, d9, d13 > 0xf4aa1334: vstr d9, [fp, #12] > 0xf4aa1338: vneg.s8 d9, d8 > 0xf4aa133c: vshl.u16 d9, d10, d9 > 0xf4aa1340: vstr d9, [fp, #20] > 0xf4aa1344: vneg.s8 d9, d8 > 0xf4aa1348: vshl.u16 d9, d11, d9 > 0xf4aa134c: vstr d9, [fp, #28] > > I suppose it's more like a trade off that either remaining those two R/LShiftCntV nodes and change AArch64 and X86 to what AArch32 dose, or merging them and accept this defect on AArch32. @theRealELiu @iwanowww Yes, but I was thinking of the `vshiftcnt` rules and didn't realize the `replicate` rules are virtually identical. Now that I look again, it appears that aarch64 is already doing the same as x86 (converting the scalar shift count to a vector register, and matching the vector register in the rule). So why is it that x86 can share the shift register, but aarch64 cannot? Is it because x86 has combined left- and right-shift into the same rule? ------------- PR: https://git.openjdk.java.net/jdk/pull/3371 |
In reply to this post by Eric Liu-3
On Thu, 8 Apr 2021 11:52:02 GMT, Eric Liu <[hidden email]> wrote:
>> Regarding the proposed change itself (`LShiftCntV/RShiftCntV => ShiftCntV`). >> >> Not sure how important it is, but it has an unfortunate change in generated code for right vector shifts on AArch32: instead of sharing the result of index negation at all use sites, negation is performed at every use site now. >> >> As a consequence, in an auto-vectorized loop it will lead to: >> - 1 instruction per loop iteration (multiplied by unrolling factor); >> - no way to hoist the negation of loop invariant index. > >> Regarding the proposed change itself (`LShiftCntV/RShiftCntV => ShiftCntV`). >> >> Not sure how important it is, but it has an unfortunate change in generated code for right vector shifts on AArch32: instead of sharing the result of index negation at all use sites, negation is performed at every use site now. >> >> As a consequence, in an auto-vectorized loop it will lead to: >> >> * 1 instruction per loop iteration (multiplied by unrolling factor); >> * no way to hoist the negation of loop invariant index. > > Thanks for your feedback! > > I have checked the generated code on AArch32 and it's a shame that 'vneg' is at every use point. > > Before: > 0xf46c8338: add fp, r7, fp > 0xf46c833c: vshl.u16 d9, d9, d8 > 0xf46c8340: vstr d9, [fp, #12] > 0xf46c8344: vshl.u16 d9, d10, d8 > 0xf46c8348: vstr d9, [fp, #20] > 0xf46c834c: vshl.u16 d9, d11, d8 > 0xf46c8350: vstr d9, [fp, #28] > > After: > 0xf4aa1328: add fp, r7, fp > 0xf4aa132c: vneg.s8 d13, d8 > 0xf4aa1330: vshl.u16 d9, d9, d13 > 0xf4aa1334: vstr d9, [fp, #12] > 0xf4aa1338: vneg.s8 d9, d8 > 0xf4aa133c: vshl.u16 d9, d10, d9 > 0xf4aa1340: vstr d9, [fp, #20] > 0xf4aa1344: vneg.s8 d9, d8 > 0xf4aa1348: vshl.u16 d9, d11, d9 > 0xf4aa134c: vstr d9, [fp, #28] > > I suppose it's more like a trade off that either remaining those two R/LShiftCntV nodes and change AArch64 and X86 to what AArch32 dose, or merging them and accept this defect on AArch32. > @theRealELiu @iwanowww Yes, but I was thinking of the `vshiftcnt` rules and didn't realize the `replicate` rules are virtually identical. Now that I look again, it appears that aarch64 is already doing the same as x86 (converting the scalar shift count to a vector register, and matching the vector register in the rule). So why is it that x86 can share the shift register, but aarch64 cannot? Is it because x86 has combined left- and right-shift into the same rule? I suppose x86 has the different kinds of vector shift instructions[1], but AArch64 and AArch32 only have the left one. For this reason, arm should use a "neg-leftshift" pair to implement right shift. [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp#L1234 ------------- PR: https://git.openjdk.java.net/jdk/pull/3371 |
In reply to this post by Vladimir Ivanov-2
On Thu, 8 Apr 2021 13:37:31 GMT, Vladimir Ivanov <[hidden email]> wrote:
> `LShiftCntV`/`RShiftCntV` were added specifically for AArch32 and other platforms don't need/benefit from such separation. I think AArch64 shared the same problem with AArch32 since it only has left shift instruction as well. I checked the generated code on AArch64 with latest master. Test case is TestCharVect2::test_srav[1]. 0x0000ffffa9106c68: ldr q17, [x15, #16] 0x0000ffffa9106c6c: add x14, x10, x14 0x0000ffffa9106c70: neg v18.16b, v16.16b 0x0000ffffa9106c74: ushl v17.8h, v17.8h, v18.8h 0x0000ffffa9106c78: str q17, [x14, #16] 0x0000ffffa9106c7c: ldr q17, [x15, #32] 0x0000ffffa9106c80: neg v18.16b, v16.16b 0x0000ffffa9106c84: ushl v17.8h, v17.8h, v18.8h 0x0000ffffa9106c88: str q17, [x14, #32] 0x0000ffffa9106c8c: ldr q17, [x15, #48] 0x0000ffffa9106c90: neg v18.16b, v16.16b 0x0000ffffa9106c94: ushl v17.8h, v17.8h, v18.8h 0x0000ffffa9106c98: str q17, [x14, #48] 0x0000ffffa9106c9c: ldr q17, [x15, #64] 0x0000ffffa9106ca0: neg v18.16b, v16.16b 0x0000ffffa9106ca4: ushl v17.8h, v17.8h, v18.8h 0x0000ffffa9106ca8: str q17, [x14, #64] 0x0000ffffa9106cac: ldr q17, [x15, #80] 0x0000ffffa9106cb0: neg v18.16b, v16.16b 0x0000ffffa9106cb4: ushl v17.8h, v17.8h, v18.8h It seems that keeping those two RShiftCntV and LShiftCntV is friendly to AArch32/64 in this case, but AArch64 should changed to what AArch32 dose. @theRealAph [1] https://github.com/openjdk/jdk/blob/master/test/hotspot/jtreg/compiler/codegen/TestCharVect2.java#L1215 ------------- PR: https://git.openjdk.java.net/jdk/pull/3371 |
On Fri, 9 Apr 2021 08:50:49 GMT, Eric Liu <[hidden email]> wrote:
> > It seems that keeping those two RShiftCntV and LShiftCntV is friendly to AArch32/64 in this case, but AArch64 should changed to what AArch32 dose. @theRealAph Thanks, but it's been a while since I looked at the vector code. Can you point me to the AArch32 patterns in question, to show me the AArch64 changes needed? Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/3371 |
On Fri, 9 Apr 2021 09:26:19 GMT, Andrew Haley <[hidden email]> wrote:
> > It seems that keeping those two RShiftCntV and LShiftCntV is friendly to AArch32/64 in this case, but AArch64 should changed to what AArch32 dose. @theRealAph > > Thanks, but it's been a while since I looked at the vector code. Can you point me to the AArch32 patterns in question, to show me the AArch64 changes needed? Thanks. AArch32 combinates 'neg' with 'dup' in RShiftCntV[1], which AArch64 has a single 'dup' only[2] and generates 'neg' before every use sites, e.g. RShiftVB[3]. [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/arm/arm.ad#L10451 [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/aarch64_neon.ad#L5117 [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/aarch64_neon.ad#L5188 ------------- PR: https://git.openjdk.java.net/jdk/pull/3371 |
Free forum by Nabble | Edit this page |