RFR: 8262355: Support for AVX-512 opmask register allocation.

classic Classic list List threaded Threaded
120 messages Options
1234 ... 6
Reply | Threaded
Open this post in threaded view
|

RFR: 8262355: Support for AVX-512 opmask register allocation.

Jatin Bhateja
AVX-512 added 8 new 64 bit opmask registers[1] . These registers allow conditional execution and efficient merging of destination operands. At present cross instruction mask propagation is being done either using a GPR (e.g. vmask_gen patterns in x86.ad) or a vector register (for propagating results of a vector comparison or vector load mask operations).

This base patch extends the register allocator to support allocation of opmask registers. This will facilitate mask propagation across instructions and thus enable emitting efficient instruction sequence over X86 targets supporting AVX-512 feature.

We intend to build a robust optimization framework[2] based on this patch to emit optimized instruction sequence for masked/predicated vector operation for X86 targets supporting AVX-512.

Please review and share your feedback.

Summary of changes:

1) AD side changes: New register definitions, register classes, allocation classes, operand definitions and spill code handling for opmask registers.

2) Runtime: Save/restoration for opmask registers in 32 and 64 bit JVM.
   a) For 64 bit JVM we were anyways reserving the space in the frame layout but earlier were not saving and restoring at designated offset(1088), hence no extra space overhead apart from save/restore cost.
   b) For 32 bit JVM: Additional 64 byte are allocated apart from FXSTORE area on the lines of storage for ZMM(16-31) and YMM-Hi bank. There are few regressions due to extra space allocation which we are investigating.

3) Replacing all the hard-coded opmask references from macro-assembly routines: Pulling out the opmask occurrences all the way up to instruction pattern and adding an unbounded opmask operand for them. This exposes these operands to RA and scheduler; this will automatically facilitate spilling of live opmask registers across call sites.

4) Register class initializations related to Op_RegVMask during matcher startup.

5) Handling for mask generating node:  Currently VectorMaskGen node uses a GPR to propagate mask across mask generating DEF instruction to its USER instructions. There are other mask generating nodes like VectorCmpMask, VectorLoadMask which are not handled as the part of this patch. Conditional overriding of two routines, ideal_reg and bottom_type for mask generating IDEAL nodes and modifying the instruction patterns to have new opmask operands enables instruction selector to associate opmask register class with USE/DEF operands  for such MachNodes. This will constrain  the allocation set for these operands to opmask registers(K1-K7).

6) Special handling for setting a flag in PhiNode during construction in case any of its incoming node is a mask generating node, this flag is then checked to return appropriate ideal_reg and bottom_type corresponding to an opmask registers.

[1] : Section 15.1.3 :  https://software.intel.com/content/www/us/en/develop/download/intel-64-and-ia-32-architectures-software-developers-manual-volume-1-basic-architecture.html
[2] : http://cr.openjdk.java.net/~jbhateja/avx512_masked_operation_optimization/AVX-512_RA_Opmask_Support_VectorMask_Optimizations.pdf

-------------

Commit messages:
 - 8262355: Support for AVX-512 opmask register allocation.

Changes: https://git.openjdk.java.net/jdk/pull/2768/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=2768&range=00
  Issue: https://bugs.openjdk.java.net/browse/JDK-8262355
  Stats: 1040 lines in 33 files changed: 767 ins; 13 del; 260 mod
  Patch: https://git.openjdk.java.net/jdk/pull/2768.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/2768/head:pull/2768

PR: https://git.openjdk.java.net/jdk/pull/2768
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8262355: Support for AVX-512 opmask register allocation. [v2]

Jatin Bhateja
> AVX-512 added 8 new 64 bit opmask registers[1] . These registers allow conditional execution and efficient merging of destination operands. At present cross instruction mask propagation is being done either using a GPR (e.g. vmask_gen patterns in x86.ad) or a vector register (for propagating results of a vector comparison or vector load mask operations).
>
> This base patch extends the register allocator to support allocation of opmask registers. This will facilitate mask propagation across instructions and thus enable emitting efficient instruction sequence over X86 targets supporting AVX-512 feature.
>
> We intend to build a robust optimization framework[2] based on this patch to emit optimized instruction sequence for masked/predicated vector operation for X86 targets supporting AVX-512.
>
> Please review and share your feedback.
>
> Summary of changes:
>
> 1) AD side changes: New register definitions, register classes, allocation classes, operand definitions and spill code handling for opmask registers.
>
> 2) Runtime: Save/restoration for opmask registers in 32 and 64 bit JVM.
>    a) For 64 bit JVM we were anyways reserving the space in the frame layout but earlier were not saving and restoring at designated offset(1088), hence no extra space overhead apart from save/restore cost.
>    b) For 32 bit JVM: Additional 64 byte are allocated apart from FXSTORE area on the lines of storage for ZMM(16-31) and YMM-Hi bank. There are few regressions due to extra space allocation which we are investigating.
>
> 3) Replacing all the hard-coded opmask references from macro-assembly routines: Pulling out the opmask occurrences all the way up to instruction pattern and adding an unbounded opmask operand for them. This exposes these operands to RA and scheduler; this will automatically facilitate spilling of live opmask registers across call sites.
>
> 4) Register class initializations related to Op_RegVMask during matcher startup.
>
> 5) Handling for mask generating node:  Currently VectorMaskGen node uses a GPR to propagate mask across mask generating DEF instruction to its USER instructions. There are other mask generating nodes like VectorCmpMask, VectorLoadMask which are not handled as the part of this patch. Conditional overriding of two routines, ideal_reg and bottom_type for mask generating IDEAL nodes and modifying the instruction patterns to have new opmask operands enables instruction selector to associate opmask register class with USE/DEF operands  for such MachNodes. This will constrain  the allocation set for these operands to opmask registers(K1-K7).
>
> 6) Special handling for setting a flag in PhiNode during construction in case any of its incoming node is a mask generating node, this flag is then checked to return appropriate ideal_reg and bottom_type corresponding to an opmask registers.
>
> [1] : Section 15.1.3 :  https://software.intel.com/content/www/us/en/develop/download/intel-64-and-ia-32-architectures-software-developers-manual-volume-1-basic-architecture.html
> [2] : http://cr.openjdk.java.net/~jbhateja/avx512_masked_operation_optimization/AVX-512_RA_Opmask_Support_VectorMask_Optimizations.pdf

Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:

  8262355: Fix for AARCH64 build failure.

-------------

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/2768/files
  - new: https://git.openjdk.java.net/jdk/pull/2768/files/9e1c3e0d..69003aa6

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=2768&range=01
 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=2768&range=00-01

  Stats: 11 lines in 5 files changed: 1 ins; 2 del; 8 mod
  Patch: https://git.openjdk.java.net/jdk/pull/2768.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/2768/head:pull/2768

PR: https://git.openjdk.java.net/jdk/pull/2768
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8262355: Support for AVX-512 opmask register allocation.

Vladimir Ivanov
In reply to this post by Jatin Bhateja
Good work, Jatin!

I'd like to focus high-level aspects first.

There's a significant amount of crux coming from the fact that masks
don't have their own type. Reusing TypeLong::LONG for k-registers may
look appealing at first, but then all the places where RegVMask matters
have to handle the types specially. Why not introduce a dedicated type
for masks?

Also, my understanding is AArch64/SVE allows predicate registers to be
larger than 64-bit, so TypeLong::LONG won't work there and a dedicated
representation will be needed.

Second question is about x86 and different mask representations it has:
AVX-512 introduces predicate registers, but AVX/AVX2 keep masks in wide
vector registers. Are we fine with leaking this difference into Ideal IR
and specifying what shape a mask value has during Ideal construction?

Regarding the patch itself, RegisterSaver support and related changes
can be integrated separately.

Best regards,
Vladimir Ivanov

On 28.02.2021 21:40, Jatin Bhateja wrote:

> AVX-512 added 8 new 64 bit opmask registers[1] . These registers allow conditional execution and efficient merging of destination operands. At present cross instruction mask propagation is being done either using a GPR (e.g. vmask_gen patterns in x86.ad) or a vector register (for propagating results of a vector comparison or vector load mask operations).
>
> This base patch extends the register allocator to support allocation of opmask registers. This will facilitate mask propagation across instructions and thus enable emitting efficient instruction sequence over X86 targets supporting AVX-512 feature.
>
> We intend to build a robust optimization framework[2] based on this patch to emit optimized instruction sequence for masked/predicated vector operation for X86 targets supporting AVX-512.
>
> Please review and share your feedback.
>
> Summary of changes:
>
> 1) AD side changes: New register definitions, register classes, allocation classes, operand definitions and spill code handling for opmask registers.
>
> 2) Runtime: Save/restoration for opmask registers in 32 and 64 bit JVM.
>     a) For 64 bit JVM we were anyways reserving the space in the frame layout but earlier were not saving and restoring at designated offset(1088), hence no extra space overhead apart from save/restore cost.
>     b) For 32 bit JVM: Additional 64 byte are allocated apart from FXSTORE area on the lines of storage for ZMM(16-31) and YMM-Hi bank. There are few regressions due to extra space allocation which we are investigating.
>
> 3) Replacing all the hard-coded opmask references from macro-assembly routines: Pulling out the opmask occurrences all the way up to instruction pattern and adding an unbounded opmask operand for them. This exposes these operands to RA and scheduler; this will automatically facilitate spilling of live opmask registers across call sites.
>
> 4) Register class initializations related to Op_RegVMask during matcher startup.
>
> 5) Handling for mask generating node:  Currently VectorMaskGen node uses a GPR to propagate mask across mask generating DEF instruction to its USER instructions. There are other mask generating nodes like VectorCmpMask, VectorLoadMask which are not handled as the part of this patch. Conditional overriding of two routines, ideal_reg and bottom_type for mask generating IDEAL nodes and modifying the instruction patterns to have new opmask operands enables instruction selector to associate opmask register class with USE/DEF operands  for such MachNodes. This will constrain  the allocation set for these operands to opmask registers(K1-K7).
>
> 6) Special handling for setting a flag in PhiNode during construction in case any of its incoming node is a mask generating node, this flag is then checked to return appropriate ideal_reg and bottom_type corresponding to an opmask registers.
>
> [1] : Section 15.1.3 :  https://software.intel.com/content/www/us/en/develop/download/intel-64-and-ia-32-architectures-software-developers-manual-volume-1-basic-architecture.html
> [2] : http://cr.openjdk.java.net/~jbhateja/avx512_masked_operation_optimization/AVX-512_RA_Opmask_Support_VectorMask_Optimizations.pdf
>
> -------------
>
> Commit messages:
>   - 8262355: Support for AVX-512 opmask register allocation.
>
> Changes: https://git.openjdk.java.net/jdk/pull/2768/files
>   Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=2768&range=00
>    Issue: https://bugs.openjdk.java.net/browse/JDK-8262355
>    Stats: 1040 lines in 33 files changed: 767 ins; 13 del; 260 mod
>    Patch: https://git.openjdk.java.net/jdk/pull/2768.diff
>    Fetch: git fetch https://git.openjdk.java.net/jdk pull/2768/head:pull/2768
>
> PR: https://git.openjdk.java.net/jdk/pull/2768
>
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8262355: Support for AVX-512 opmask register allocation. [v2]

Vladimir Ivanov-2
In reply to this post by Jatin Bhateja
On Sun, 28 Feb 2021 19:09:00 GMT, Jatin Bhateja <[hidden email]> wrote:

>> AVX-512 added 8 new 64 bit opmask registers[1] . These registers allow conditional execution and efficient merging of destination operands. At present cross instruction mask propagation is being done either using a GPR (e.g. vmask_gen patterns in x86.ad) or a vector register (for propagating results of a vector comparison or vector load mask operations).
>>
>> This base patch extends the register allocator to support allocation of opmask registers. This will facilitate mask propagation across instructions and thus enable emitting efficient instruction sequence over X86 targets supporting AVX-512 feature.
>>
>> We intend to build a robust optimization framework[2] based on this patch to emit optimized instruction sequence for masked/predicated vector operation for X86 targets supporting AVX-512.
>>
>> Please review and share your feedback.
>>
>> Summary of changes:
>>
>> 1) AD side changes: New register definitions, register classes, allocation classes, operand definitions and spill code handling for opmask registers.
>>
>> 2) Runtime: Save/restoration for opmask registers in 32 and 64 bit JVM.
>>    a) For 64 bit JVM we were anyways reserving the space in the frame layout but earlier were not saving and restoring at designated offset(1088), hence no extra space overhead apart from save/restore cost.
>>    b) For 32 bit JVM: Additional 64 byte are allocated apart from FXSTORE area on the lines of storage for ZMM(16-31) and YMM-Hi bank. There are few regressions due to extra space allocation which we are investigating.
>>
>> 3) Replacing all the hard-coded opmask references from macro-assembly routines: Pulling out the opmask occurrences all the way up to instruction pattern and adding an unbounded opmask operand for them. This exposes these operands to RA and scheduler; this will automatically facilitate spilling of live opmask registers across call sites.
>>
>> 4) Register class initializations related to Op_RegVMask during matcher startup.
>>
>> 5) Handling for mask generating node:  Currently VectorMaskGen node uses a GPR to propagate mask across mask generating DEF instruction to its USER instructions. There are other mask generating nodes like VectorCmpMask, VectorLoadMask which are not handled as the part of this patch. Conditional overriding of two routines, ideal_reg and bottom_type for mask generating IDEAL nodes and modifying the instruction patterns to have new opmask operands enables instruction selector to associate opmask register class with USE/DEF operands  for such MachNodes. This will constrain  the allocation set for these operands to opmask registers(K1-K7).
>>
>> 6) Special handling for setting a flag in PhiNode during construction in case any of its incoming node is a mask generating node, this flag is then checked to return appropriate ideal_reg and bottom_type corresponding to an opmask registers.
>>
>> [1] : Section 15.1.3 :  https://software.intel.com/content/www/us/en/develop/download/intel-64-and-ia-32-architectures-software-developers-manual-volume-1-basic-architecture.html
>> [2] : http://cr.openjdk.java.net/~jbhateja/avx512_masked_operation_optimization/AVX-512_RA_Opmask_Support_VectorMask_Optimizations.pdf
>
> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
>
>   8262355: Fix for AARCH64 build failure.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.hpp line 35:

> 33:   // special instructions for EVEX
> 34:   void setvectmask(Register dst, Register src, KRegister mask = k1);
> 35:   void restorevectmask(KRegister mask = k1);

I don't see much sense in default values here. The register should be explicitly specified at call sites.

src/hotspot/cpu/x86/gc/z/zBarrierSetAssembler_x86.cpp line 149:

> 147:   __ movdqu(Address(rsp, xmm_size * 0), xmm0);
> 148:
> 149:   const int opmask_size = 8;

The code in `ZBarrierSetAssembler::load_at` doesn't handle vector registers. Why should it care about predicate registers?

src/hotspot/cpu/x86/gc/z/zBarrierSetAssembler_x86.cpp line 505:

> 503:     caller_saved.Insert(OptoReg::as_OptoReg(r10->as_VMReg()));
> 504:     caller_saved.Insert(OptoReg::as_OptoReg(r11->as_VMReg()));
> 505:     if (UseAVX > 2) {

Similar question here: `ZSaveLiveRegisters` handles GP registers and vector registers differently. Why predicate registers are handled the same way GP registers are and not as vector registers?

-------------

PR: https://git.openjdk.java.net/jdk/pull/2768
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8262355: Support for AVX-512 opmask register allocation.

Ningsheng Jian-2
In reply to this post by Jatin Bhateja
On Sun, 28 Feb 2021 18:35:11 GMT, Jatin Bhateja <[hidden email]> wrote:

> AVX-512 added 8 new 64 bit opmask registers[1] . These registers allow conditional execution and efficient merging of destination operands. At present cross instruction mask propagation is being done either using a GPR (e.g. vmask_gen patterns in x86.ad) or a vector register (for propagating results of a vector comparison or vector load mask operations).
>
> This base patch extends the register allocator to support allocation of opmask registers. This will facilitate mask propagation across instructions and thus enable emitting efficient instruction sequence over X86 targets supporting AVX-512 feature.
>
> We intend to build a robust optimization framework[2] based on this patch to emit optimized instruction sequence for masked/predicated vector operation for X86 targets supporting AVX-512.
>
> Please review and share your feedback.
>
> Summary of changes:
>
> 1) AD side changes: New register definitions, register classes, allocation classes, operand definitions and spill code handling for opmask registers.
>
> 2) Runtime: Save/restoration for opmask registers in 32 and 64 bit JVM.
>    a) For 64 bit JVM we were anyways reserving the space in the frame layout but earlier were not saving and restoring at designated offset(1088), hence no extra space overhead apart from save/restore cost.
>    b) For 32 bit JVM: Additional 64 byte are allocated apart from FXSTORE area on the lines of storage for ZMM(16-31) and YMM-Hi bank. There are few regressions due to extra space allocation which we are investigating.
>
> 3) Replacing all the hard-coded opmask references from macro-assembly routines: Pulling out the opmask occurrences all the way up to instruction pattern and adding an unbounded opmask operand for them. This exposes these operands to RA and scheduler; this will automatically facilitate spilling of live opmask registers across call sites.
>
> 4) Register class initializations related to Op_RegVMask during matcher startup.
>
> 5) Handling for mask generating node:  Currently VectorMaskGen node uses a GPR to propagate mask across mask generating DEF instruction to its USER instructions. There are other mask generating nodes like VectorCmpMask, VectorLoadMask which are not handled as the part of this patch. Conditional overriding of two routines, ideal_reg and bottom_type for mask generating IDEAL nodes and modifying the instruction patterns to have new opmask operands enables instruction selector to associate opmask register class with USE/DEF operands  for such MachNodes. This will constrain  the allocation set for these operands to opmask registers(K1-K7).
>
> 6) Special handling for setting a flag in PhiNode during construction in case any of its incoming node is a mask generating node, this flag is then checked to return appropriate ideal_reg and bottom_type corresponding to an opmask registers.
>
> [1] : Section 15.1.3 :  https://software.intel.com/content/www/us/en/develop/download/intel-64-and-ia-32-architectures-software-developers-manual-volume-1-basic-architecture.html
> [2] : http://cr.openjdk.java.net/~jbhateja/avx512_masked_operation_optimization/AVX-512_RA_Opmask_Support_VectorMask_Optimizations.pdf

Hi,

@XiaohongGong also posted Arm SVE predicate register allocation support in panama-vector together with other commits about vector masking support: https://github.com/openjdk/panama-vector/pull/40 last week before this PR. The predicate register allocation part has been tested for some time internally and could be separated from that PR (https://github.com/openjdk/panama-vector/pull/40/commits/e658f4d189c21dcd3668fcafee25d9b678cd3640). If it helps, we can also propose a patch here in openjdk/jdk.

> I'd like to focus high-level aspects first.
>
> There's a significant amount of crux coming from the fact that masks
> don't have their own type. Reusing TypeLong::LONG for k-registers may
> look appealing at first, but then all the places where RegVMask matters
> have to handle the types specially. Why not introduce a dedicated type
> for masks?
>

I agree that a dedicate type sounds more reasonable, which is covered by @XiaohongGong 's patch, see: https://github.com/openjdk/panama-vector/pull/40/commits/3f69d40f08868062e2cc144b3b757dcbaa2db2d1

> Also, my understanding is AArch64/SVE allows predicate registers to be
> larger than 64-bit, so TypeLong::LONG won't work there and a dedicated
> representation will be needed.
>

Yes, AArch64/SVE predicate registers could be larger. I see in Jatin's patch, it has arch dependent type Matcher::predicate_reg_type(), that looks hacky and workable. But I would still prefer a dedicate type, which looks cleaner. Would a dedicate type also work for k-register?

Thanks,
Ningsheng

-------------

PR: https://git.openjdk.java.net/jdk/pull/2768
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8262355: Support for AVX-512 opmask register allocation.

Xiaohong Gong
On Wed, 3 Mar 2021 01:39:11 GMT, Ningsheng Jian <[hidden email]> wrote:

>> AVX-512 added 8 new 64 bit opmask registers[1] . These registers allow conditional execution and efficient merging of destination operands. At present cross instruction mask propagation is being done either using a GPR (e.g. vmask_gen patterns in x86.ad) or a vector register (for propagating results of a vector comparison or vector load mask operations).
>>
>> This base patch extends the register allocator to support allocation of opmask registers. This will facilitate mask propagation across instructions and thus enable emitting efficient instruction sequence over X86 targets supporting AVX-512 feature.
>>
>> We intend to build a robust optimization framework[2] based on this patch to emit optimized instruction sequence for masked/predicated vector operation for X86 targets supporting AVX-512.
>>
>> Please review and share your feedback.
>>
>> Summary of changes:
>>
>> 1) AD side changes: New register definitions, register classes, allocation classes, operand definitions and spill code handling for opmask registers.
>>
>> 2) Runtime: Save/restoration for opmask registers in 32 and 64 bit JVM.
>>    a) For 64 bit JVM we were anyways reserving the space in the frame layout but earlier were not saving and restoring at designated offset(1088), hence no extra space overhead apart from save/restore cost.
>>    b) For 32 bit JVM: Additional 64 byte are allocated apart from FXSTORE area on the lines of storage for ZMM(16-31) and YMM-Hi bank. There are few regressions due to extra space allocation which we are investigating.
>>
>> 3) Replacing all the hard-coded opmask references from macro-assembly routines: Pulling out the opmask occurrences all the way up to instruction pattern and adding an unbounded opmask operand for them. This exposes these operands to RA and scheduler; this will automatically facilitate spilling of live opmask registers across call sites.
>>
>> 4) Register class initializations related to Op_RegVMask during matcher startup.
>>
>> 5) Handling for mask generating node:  Currently VectorMaskGen node uses a GPR to propagate mask across mask generating DEF instruction to its USER instructions. There are other mask generating nodes like VectorCmpMask, VectorLoadMask which are not handled as the part of this patch. Conditional overriding of two routines, ideal_reg and bottom_type for mask generating IDEAL nodes and modifying the instruction patterns to have new opmask operands enables instruction selector to associate opmask register class with USE/DEF operands  for such MachNodes. This will constrain  the allocation set for these operands to opmask registers(K1-K7).
>>
>> 6) Special handling for setting a flag in PhiNode during construction in case any of its incoming node is a mask generating node, this flag is then checked to return appropriate ideal_reg and bottom_type corresponding to an opmask registers.
>>
>> [1] : Section 15.1.3 :  https://software.intel.com/content/www/us/en/develop/download/intel-64-and-ia-32-architectures-software-developers-manual-volume-1-basic-architecture.html
>> [2] : http://cr.openjdk.java.net/~jbhateja/avx512_masked_operation_optimization/AVX-512_RA_Opmask_Support_VectorMask_Optimizations.pdf
>
> Hi,
>
> @XiaohongGong also posted Arm SVE predicate register allocation support in panama-vector together with other commits about vector masking support: https://github.com/openjdk/panama-vector/pull/40 last week before this PR. The predicate register allocation part has been tested for some time internally and could be separated from that PR (https://github.com/openjdk/panama-vector/pull/40/commits/e658f4d189c21dcd3668fcafee25d9b678cd3640). If it helps, we can also propose a patch here in openjdk/jdk.
>
>> I'd like to focus high-level aspects first.
>>
>> There's a significant amount of crux coming from the fact that masks
>> don't have their own type. Reusing TypeLong::LONG for k-registers may
>> look appealing at first, but then all the places where RegVMask matters
>> have to handle the types specially. Why not introduce a dedicated type
>> for masks?
>>
>
> I agree that a dedicate type sounds more reasonable, which is covered by @XiaohongGong 's patch, see: https://github.com/openjdk/panama-vector/pull/40/commits/3f69d40f08868062e2cc144b3b757dcbaa2db2d1
>
>> Also, my understanding is AArch64/SVE allows predicate registers to be
>> larger than 64-bit, so TypeLong::LONG won't work there and a dedicated
>> representation will be needed.
>>
>
> Yes, AArch64/SVE predicate registers could be larger. I see in Jatin's patch, it has arch dependent type Matcher::predicate_reg_type(), that looks hacky and workable. But I would still prefer a dedicate type, which looks cleaner. Would a dedicate type also work for k-register?
>
> Thanks,
> Ningsheng

> Second question is about x86 and different mask representations it has:
> AVX-512 introduces predicate registers, but AVX/AVX2 keep masks in wide
> vector registers. Are we fine with leaking this difference into Ideal IR
> and specifying what shape a mask value has during Ideal construction?

Yes, as @nsjian mentioned above, we added a new mask type mapped to a predicate register. Besides, to make a difference with the old vector IRs that uses vector registers for mask on other platforms, we also added a new abstract IR (`VectorMaskNode`) to represent the mask on SVE. All the mask generation IRs are extended from it. Please see codes: https://github.com/openjdk/panama-vector/pull/40/commits/3f69d40f08868062e2cc144b3b757dcbaa2db2d1 .

-------------

PR: https://git.openjdk.java.net/jdk/pull/2768
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8262355: Support for AVX-512 opmask register allocation. [v2]

Ningsheng Jian-2
In reply to this post by Jatin Bhateja
On Sun, 28 Feb 2021 19:09:00 GMT, Jatin Bhateja <[hidden email]> wrote:

>> AVX-512 added 8 new 64 bit opmask registers[1] . These registers allow conditional execution and efficient merging of destination operands. At present cross instruction mask propagation is being done either using a GPR (e.g. vmask_gen patterns in x86.ad) or a vector register (for propagating results of a vector comparison or vector load mask operations).
>>
>> This base patch extends the register allocator to support allocation of opmask registers. This will facilitate mask propagation across instructions and thus enable emitting efficient instruction sequence over X86 targets supporting AVX-512 feature.
>>
>> We intend to build a robust optimization framework[2] based on this patch to emit optimized instruction sequence for masked/predicated vector operation for X86 targets supporting AVX-512.
>>
>> Please review and share your feedback.
>>
>> Summary of changes:
>>
>> 1) AD side changes: New register definitions, register classes, allocation classes, operand definitions and spill code handling for opmask registers.
>>
>> 2) Runtime: Save/restoration for opmask registers in 32 and 64 bit JVM.
>>    a) For 64 bit JVM we were anyways reserving the space in the frame layout but earlier were not saving and restoring at designated offset(1088), hence no extra space overhead apart from save/restore cost.
>>    b) For 32 bit JVM: Additional 64 byte are allocated apart from FXSTORE area on the lines of storage for ZMM(16-31) and YMM-Hi bank. There are few regressions due to extra space allocation which we are investigating.
>>
>> 3) Replacing all the hard-coded opmask references from macro-assembly routines: Pulling out the opmask occurrences all the way up to instruction pattern and adding an unbounded opmask operand for them. This exposes these operands to RA and scheduler; this will automatically facilitate spilling of live opmask registers across call sites.
>>
>> 4) Register class initializations related to Op_RegVMask during matcher startup.
>>
>> 5) Handling for mask generating node:  Currently VectorMaskGen node uses a GPR to propagate mask across mask generating DEF instruction to its USER instructions. There are other mask generating nodes like VectorCmpMask, VectorLoadMask which are not handled as the part of this patch. Conditional overriding of two routines, ideal_reg and bottom_type for mask generating IDEAL nodes and modifying the instruction patterns to have new opmask operands enables instruction selector to associate opmask register class with USE/DEF operands  for such MachNodes. This will constrain  the allocation set for these operands to opmask registers(K1-K7).
>>
>> 6) Special handling for setting a flag in PhiNode during construction in case any of its incoming node is a mask generating node, this flag is then checked to return appropriate ideal_reg and bottom_type corresponding to an opmask registers.
>>
>> [1] : Section 15.1.3 :  https://software.intel.com/content/www/us/en/develop/download/intel-64-and-ia-32-architectures-software-developers-manual-volume-1-basic-architecture.html
>> [2] : http://cr.openjdk.java.net/~jbhateja/avx512_masked_operation_optimization/AVX-512_RA_Opmask_Support_VectorMask_Optimizations.pdf
>
> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
>
>   8262355: Fix for AARCH64 build failure.

src/hotspot/share/opto/chaitin.cpp line 900:

> 898: #if defined(IA32) || defined(AMD64)
> 899:         case Op_RegVMask:
> 900: #endif

I think this #ifdef is not necessary here, and should not put it together with Op_RegL here. For other arch, at least for Arm SVE, the num_regs for predicate reg is not 2. In https://github.com/openjdk/panama-vector/pull/40/commits/e658f4d189c21dcd3668fcafee25d9b678cd3640, we define a new enum SlotsPerRegVMask. Do you think that works for x86 as well?

src/hotspot/share/opto/node.hpp line 1025:

> 1023:   bool dominates(Node* sub, Node_List &nlist);
> 1024:
> 1025:   virtual const void* meta_data() const { return NULL; }

Is this being used?

src/hotspot/share/opto/regmask.cpp line 72:

> 70: #if defined(AMD64) || defined(IA32)
> 71:     case Op_RegVMask:
> 72: #endif

I think, it would be better we have less #ifdef in the shared code. So SlotsPerVMask? :-)

-------------

PR: https://git.openjdk.java.net/jdk/pull/2768
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8262355: Support for AVX-512 opmask register allocation.

Jatin Bhateja
In reply to this post by Xiaohong Gong
On Wed, 3 Mar 2021 01:52:33 GMT, Xiaohong Gong <[hidden email]> wrote:

>> Hi,
>>
>> @XiaohongGong also posted Arm SVE predicate register allocation support in panama-vector together with other commits about vector masking support: https://github.com/openjdk/panama-vector/pull/40 last week before this PR. The predicate register allocation part has been tested for some time internally and could be separated from that PR (https://github.com/openjdk/panama-vector/pull/40/commits/e658f4d189c21dcd3668fcafee25d9b678cd3640). If it helps, we can also propose a patch here in openjdk/jdk.
>>
>>> I'd like to focus high-level aspects first.
>>>
>>> There's a significant amount of crux coming from the fact that masks
>>> don't have their own type. Reusing TypeLong::LONG for k-registers may
>>> look appealing at first, but then all the places where RegVMask matters
>>> have to handle the types specially. Why not introduce a dedicated type
>>> for masks?
>>>
>>
>> I agree that a dedicate type sounds more reasonable, which is covered by @XiaohongGong 's patch, see: https://github.com/openjdk/panama-vector/pull/40/commits/3f69d40f08868062e2cc144b3b757dcbaa2db2d1
>>
>>> Also, my understanding is AArch64/SVE allows predicate registers to be
>>> larger than 64-bit, so TypeLong::LONG won't work there and a dedicated
>>> representation will be needed.
>>>
>>
>> Yes, AArch64/SVE predicate registers could be larger. I see in Jatin's patch, it has arch dependent type Matcher::predicate_reg_type(), that looks hacky and workable. But I would still prefer a dedicate type, which looks cleaner. Would a dedicate type also work for k-register?
>>
>> Thanks,
>> Ningsheng
>
>> Second question is about x86 and different mask representations it has:
>> AVX-512 introduces predicate registers, but AVX/AVX2 keep masks in wide
>> vector registers. Are we fine with leaking this difference into Ideal IR
>> and specifying what shape a mask value has during Ideal construction?
>
> Yes, as @nsjian mentioned above, we added a new mask type mapped to a predicate register. Besides, to make a difference with the old vector IRs that uses vector registers for mask on other platforms, we also added a new abstract IR (`VectorMaskNode`) to represent the mask on SVE. All the mask generation IRs are extended from it. Please see codes: https://github.com/openjdk/panama-vector/pull/40/commits/3f69d40f08868062e2cc144b3b757dcbaa2db2d1 .

> _Mailing list message from [Vladimir Ivanov](mailto:[hidden email]) on [hotspot-compiler-dev](mailto:[hidden email]):_
>
> Good work, Jatin!
>
> I'd like to focus high-level aspects first.
>
> There's a significant amount of crux coming from the fact that masks
> don't have their own type. Reusing TypeLong::LONG for k-registers may
> look appealing at first, but then all the places where RegVMask matters
> have to handle the types specially. Why not introduce a dedicated type
> for masks?
>
> Also, my understanding is AArch64/SVE allows predicate registers to be
> larger than 64-bit, so TypeLong::LONG won't work there and a dedicated
> representation will be needed.

 Current register allocation framework can perform allocation at the granularity of 32 bit. Thus in order to allocate a 64 bit register we  reserve 2 bits from the register mask of its corresponding live range. Spilling code (MachSpillCopyNode::implementation) also is sensitive to this since a 32 bit def in 64 bit mode spill only 32 bit value.

 Opmask register is special in a way such that usable register portion could be 8,16,32 or 64 bit wide depending on the lane type and  vector size. Thus in an optimal implementation both allocator and spill code may allocate and spill only the usable portion of the opmask register. This may not be possible in current allocation frame work.

 Keeping this added complexity out of implementation, existing patch performs both allocation and spilling at 64 bit granularity.  This is why a LONG type is sufficient to represent an Opmask register for X86.

 I agree that ARM SVE may have to create a new mask Type since performing the spill and allocation at widest possible mask will be costly.  Thus Matcher::perdicate_reg_type() can be used to return the LONG type for X86 and new mask type for ARM SVE. This will prevent  any modification in target independent IR.

 Also for X86 a mask generating node may have different Ideal type and register for non-AVX512 targets.

 Please let me know if there is any disconnect in my understanding here.
>
> Second question is about x86 and different mask representations it has:
> AVX-512 introduces predicate registers, but AVX/AVX2 keep masks in wide
> vector registers. Are we fine with leaking this difference into Ideal IR
> and specifying what shape a mask value has during Ideal construction?
>

For AVX-512 targets any mask generating node will have LONG type and a vector type for non-AVX512 case.
This will pass down from IDEAL IR to Machine IR and thus allocator will take allocation call based on the ideal register corresponding to the type.

Please elaborate what adverse implication do you see with this approach.

> Regarding the patch itself, RegisterSaver support and related changes
> can be integrated separately.
>
> Best regards,
> Vladimir Ivanov
>
> On 28.02.2021 21:40, Jatin Bhateja wrote:

-------------

PR: https://git.openjdk.java.net/jdk/pull/2768
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8262355: Support for AVX-512 opmask register allocation.

Vladimir Ivanov

>> There's a significant amount of crux coming from the fact that masks
>> don't have their own type. Reusing TypeLong::LONG for k-registers may
>> look appealing at first, but then all the places where RegVMask matters
>> have to handle the types specially. Why not introduce a dedicated type
>> for masks?
>>
>> Also, my understanding is AArch64/SVE allows predicate registers to be
>> larger than 64-bit, so TypeLong::LONG won't work there and a dedicated
>> representation will be needed.
>
>   Current register allocation framework can perform allocation at the granularity of 32 bit. Thus in order to allocate a 64 bit register we  reserve 2 bits from the register mask of its corresponding live range. Spilling code (MachSpillCopyNode::implementation) also is sensitive to this since a 32 bit def in 64 bit mode spill only 32 bit value.
>
>   Opmask register is special in a way such that usable register portion could be 8,16,32 or 64 bit wide depending on the lane type and  vector size. Thus in an optimal implementation both allocator and spill code may allocate and spill only the usable portion of the opmask register. This may not be possible in current allocation frame work.

Yes, it may be attractive in the future, but I don't see it as something
important enough for the first versions.

>   Keeping this added complexity out of implementation, existing patch performs both allocation and spilling at 64 bit granularity.  This is why a LONG type is sufficient to represent an Opmask register for X86.

That's fine with me.

>   I agree that ARM SVE may have to create a new mask Type since performing the spill and allocation at widest possible mask will be costly.  Thus Matcher::perdicate_reg_type() can be used to return the LONG type for X86 and new mask type for ARM SVE. This will prevent  any modification in target independent IR.

Matcher::perdicate_reg_type() is not enough to hide the
platform-specific choice. The choice of TypeLong introduces significant
complexity in shared code (e.g, PhiNode and MachSpillCopyNode-related
changes). Moreover, SVE will have to choose an alternative and more
generic representation. Hence, it makes the ad-hoc changes needed for
RegVMask+TypeLong even less attractive.

Considering there's active work going on on SVE support, I'm in favor of
collaborating on unified representation between platforms and rely on it
in the first version.

>   Also for X86 a mask generating node may have different Ideal type and register for non-AVX512 targets.

>> Second question is about x86 and different mask representations it has:
>> AVX-512 introduces predicate registers, but AVX/AVX2 keep masks in wide
>> vector registers. Are we fine with leaking this difference into Ideal IR
>> and specifying what shape a mask value has during Ideal construction?
>>
>
> For AVX-512 targets any mask generating node will have LONG type and a vector type for non-AVX512 case.
> This will pass down from IDEAL IR to Machine IR and thus allocator will take allocation call based on the ideal register corresponding to the type.

What I'd like to avoid is the situation when different Ideal IR shapes
are needed to work with different mask representations.

Customizing node types fits that goal well, but I don't fully understand
all the implications yet, in particular:

   (1) having a node which can be of type TypeVect or TypeLong may be
problematic for some existing code;

   (2) as of now, there's nothing which forbids pre-AVX512 and AVX512
code to be mixed in a single compilation unit. Is it feasible to require
only a single mask representation to be used at runtime?

Best regards,
Vladimir Ivanov
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8262355: Support for AVX-512 opmask register allocation.

John Rose-3
On Mar 3, 2021, at 4:22 AM, Vladimir Ivanov <[hidden email]> wrote:
>
> Matcher::perdicate_reg_type() is not enough to hide the platform-specific choice. The choice of TypeLong introduces significant complexity in shared code (e.g, PhiNode and MachSpillCopyNode-related changes). Moreover, SVE will have to choose an alternative and more generic representation. Hence, it makes the ad-hoc changes needed for RegVMask+TypeLong even less attractive.
>
> Considering there's active work going on on SVE support, I'm in favor of collaborating on unified representation between platforms and rely on it in the first version.

I agree.  Would it help to have TypeMask as a #define of TypeLong,
in the first versions?  (I’m thinking of TypeX as a similar pseudo-type.)
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8262355: Support for AVX-512 opmask register allocation.

Vladimir Ivanov

>> Matcher::perdicate_reg_type() is not enough to hide the
>> platform-specific choice. The choice of TypeLong introduces
>> significant complexity in shared code (e.g, PhiNode and
>> MachSpillCopyNode-related changes). Moreover, SVE will have to choose
>> an alternative and more generic representation. Hence, it makes the
>> ad-hoc changes needed for RegVMask+TypeLong even less attractive.
>>
>> Considering there's active work going on on SVE support, I'm in favor
>> of collaborating on unified representation between platforms and rely
>> on it in the first version.
>
> I agree.  Would it help to have TypeMask as a #define of TypeLong,
> in the first versions?  (I’m thinking of TypeX as a similar pseudo-type.)

Unfortunately, I don't see how a macro could help here. It would still
be an instance of TypeLong and the code would have to explicitly
dispatch between mask and non-mask cases when choosing appropriate ideal
register.

I find TypeVMask which is proposed as part of SVE support (Ningsheng
mentioned it earlier) a much better alternative:
 
https://github.com/openjdk/panama-vector/pull/40/commits/3f69d40f08868062e2cc144b3b757dcbaa2db2d1

Also, a comment on process: Ningsheng and Xiaohong started with RFC on
panama-dev which looks more convenient. It allows to commit earlier and
iterate quicker compared to jdk. It makes sense to come up with missing
primitives first which satisfy both AVX-512 and SVE requirements. Once
they are available, proper support can follow. And alternative
implementations can happily coexist in Panama repo while the
implementations converge.

Best regards,
Vladimir Ivanov
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8262355: Support for AVX-512 opmask register allocation. [v3]

Jatin Bhateja
In reply to this post by Jatin Bhateja
> AVX-512 added 8 new 64 bit opmask registers[1] . These registers allow conditional execution and efficient merging of destination operands. At present cross instruction mask propagation is being done either using a GPR (e.g. vmask_gen patterns in x86.ad) or a vector register (for propagating results of a vector comparison or vector load mask operations).
>
> This base patch extends the register allocator to support allocation of opmask registers. This will facilitate mask propagation across instructions and thus enable emitting efficient instruction sequence over X86 targets supporting AVX-512 feature.
>
> We intend to build a robust optimization framework[2] based on this patch to emit optimized instruction sequence for masked/predicated vector operation for X86 targets supporting AVX-512.
>
> Please review and share your feedback.
>
> Summary of changes:
>
> 1) AD side changes: New register definitions, register classes, allocation classes, operand definitions and spill code handling for opmask registers.
>
> 2) Runtime: Save/restoration for opmask registers in 32 and 64 bit JVM.
>    a) For 64 bit JVM we were anyways reserving the space in the frame layout but earlier were not saving and restoring at designated offset(1088), hence no extra space overhead apart from save/restore cost.
>    b) For 32 bit JVM: Additional 64 byte are allocated apart from FXSTORE area on the lines of storage for ZMM(16-31) and YMM-Hi bank. There are few regressions due to extra space allocation which we are investigating.
>
> 3) Replacing all the hard-coded opmask references from macro-assembly routines: Pulling out the opmask occurrences all the way up to instruction pattern and adding an unbounded opmask operand for them. This exposes these operands to RA and scheduler; this will automatically facilitate spilling of live opmask registers across call sites.
>
> 4) Register class initializations related to Op_RegVMask during matcher startup.
>
> 5) Handling for mask generating node:  Currently VectorMaskGen node uses a GPR to propagate mask across mask generating DEF instruction to its USER instructions. There are other mask generating nodes like VectorCmpMask, VectorLoadMask which are not handled as the part of this patch. Conditional overriding of two routines, ideal_reg and bottom_type for mask generating IDEAL nodes and modifying the instruction patterns to have new opmask operands enables instruction selector to associate opmask register class with USE/DEF operands  for such MachNodes. This will constrain  the allocation set for these operands to opmask registers(K1-K7).
>
> 6) Special handling for setting a flag in PhiNode during construction in case any of its incoming node is a mask generating node, this flag is then checked to return appropriate ideal_reg and bottom_type corresponding to an opmask registers.
>
> [1] : Section 15.1.3 :  https://software.intel.com/content/www/us/en/develop/download/intel-64-and-ia-32-architectures-software-developers-manual-volume-1-basic-architecture.html
> [2] : http://cr.openjdk.java.net/~jbhateja/avx512_masked_operation_optimization/AVX-512_RA_Opmask_Support_VectorMask_Optimizations.pdf

Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:

  8262355: Creating a new ideal type TypeVectMask for mask generating nodes.

-------------

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/2768/files
  - new: https://git.openjdk.java.net/jdk/pull/2768/files/69003aa6..ffacbc62

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=2768&range=02
 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=2768&range=01-02

  Stats: 270 lines in 24 files changed: 53 ins; 197 del; 20 mod
  Patch: https://git.openjdk.java.net/jdk/pull/2768.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/2768/head:pull/2768

PR: https://git.openjdk.java.net/jdk/pull/2768
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8262355: Support for AVX-512 opmask register allocation.

Jatin Bhateja
In reply to this post by Ningsheng Jian-2
On Wed, 3 Mar 2021 01:39:11 GMT, Ningsheng Jian <[hidden email]> wrote:

>> AVX-512 added 8 new 64 bit opmask registers[1] . These registers allow conditional execution and efficient merging of destination operands. At present cross instruction mask propagation is being done either using a GPR (e.g. vmask_gen patterns in x86.ad) or a vector register (for propagating results of a vector comparison or vector load mask operations).
>>
>> This base patch extends the register allocator to support allocation of opmask registers. This will facilitate mask propagation across instructions and thus enable emitting efficient instruction sequence over X86 targets supporting AVX-512 feature.
>>
>> We intend to build a robust optimization framework[2] based on this patch to emit optimized instruction sequence for masked/predicated vector operation for X86 targets supporting AVX-512.
>>
>> Please review and share your feedback.
>>
>> Summary of changes:
>>
>> 1) AD side changes: New register definitions, register classes, allocation classes, operand definitions and spill code handling for opmask registers.
>>
>> 2) Runtime: Save/restoration for opmask registers in 32 and 64 bit JVM.
>>    a) For 64 bit JVM we were anyways reserving the space in the frame layout but earlier were not saving and restoring at designated offset(1088), hence no extra space overhead apart from save/restore cost.
>>    b) For 32 bit JVM: Additional 64 byte are allocated apart from FXSTORE area on the lines of storage for ZMM(16-31) and YMM-Hi bank. There are few regressions due to extra space allocation which we are investigating.
>>
>> 3) Replacing all the hard-coded opmask references from macro-assembly routines: Pulling out the opmask occurrences all the way up to instruction pattern and adding an unbounded opmask operand for them. This exposes these operands to RA and scheduler; this will automatically facilitate spilling of live opmask registers across call sites.
>>
>> 4) Register class initializations related to Op_RegVMask during matcher startup.
>>
>> 5) Handling for mask generating node:  Currently VectorMaskGen node uses a GPR to propagate mask across mask generating DEF instruction to its USER instructions. There are other mask generating nodes like VectorCmpMask, VectorLoadMask which are not handled as the part of this patch. Conditional overriding of two routines, ideal_reg and bottom_type for mask generating IDEAL nodes and modifying the instruction patterns to have new opmask operands enables instruction selector to associate opmask register class with USE/DEF operands  for such MachNodes. This will constrain  the allocation set for these operands to opmask registers(K1-K7).
>>
>> 6) Special handling for setting a flag in PhiNode during construction in case any of its incoming node is a mask generating node, this flag is then checked to return appropriate ideal_reg and bottom_type corresponding to an opmask registers.
>>
>> [1] : Section 15.1.3 :  https://software.intel.com/content/www/us/en/develop/download/intel-64-and-ia-32-architectures-software-developers-manual-volume-1-basic-architecture.html
>> [2] : http://cr.openjdk.java.net/~jbhateja/avx512_masked_operation_optimization/AVX-512_RA_Opmask_Support_VectorMask_Optimizations.pdf
>
> Hi,
>
> @XiaohongGong also posted Arm SVE predicate register allocation support in panama-vector together with other commits about vector masking support: https://github.com/openjdk/panama-vector/pull/40 last week before this PR. The predicate register allocation part has been tested for some time internally and could be separated from that PR (https://github.com/openjdk/panama-vector/pull/40/commits/e658f4d189c21dcd3668fcafee25d9b678cd3640). If it helps, we can also propose a patch here in openjdk/jdk.
>
>> I'd like to focus high-level aspects first.
>>
>> There's a significant amount of crux coming from the fact that masks
>> don't have their own type. Reusing TypeLong::LONG for k-registers may
>> look appealing at first, but then all the places where RegVMask matters
>> have to handle the types specially. Why not introduce a dedicated type
>> for masks?
>>
>
> I agree that a dedicate type sounds more reasonable, which is covered by @XiaohongGong 's patch, see: https://github.com/openjdk/panama-vector/pull/40/commits/3f69d40f08868062e2cc144b3b757dcbaa2db2d1
>
>> Also, my understanding is AArch64/SVE allows predicate registers to be
>> larger than 64-bit, so TypeLong::LONG won't work there and a dedicated
>> representation will be needed.
>>
>
> Yes, AArch64/SVE predicate registers could be larger. I see in Jatin's patch, it has arch dependent type Matcher::predicate_reg_type(), that looks hacky and workable. But I would still prefer a dedicate type, which looks cleaner. Would a dedicate type also work for k-register?
>
> Thanks,
> Ningsheng

@nsjian  @iwanowww  , all your outstanding comments have been resolved.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2768
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8262355: Support for AVX-512 opmask register allocation. [v2]

Jatin Bhateja
In reply to this post by Vladimir Ivanov-2
On Tue, 2 Mar 2021 11:43:31 GMT, Vladimir Ivanov <[hidden email]> wrote:

>> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
>>
>>   8262355: Fix for AARCH64 build failure.
>
> src/hotspot/cpu/x86/gc/z/zBarrierSetAssembler_x86.cpp line 505:
>
>> 503:     caller_saved.Insert(OptoReg::as_OptoReg(r10->as_VMReg()));
>> 504:     caller_saved.Insert(OptoReg::as_OptoReg(r11->as_VMReg()));
>> 505:     if (UseAVX > 2) {
>
> Similar question here: `ZSaveLiveRegisters` handles GP registers and vector registers differently. Why predicate registers are handled the same way GP registers are and not as vector registers?

Hi Vladimir,
ZGC load barriers embeds a CALL instruction which is not exposed to compiler, thus opmask registers which are caller saved as per X86 ABI are saved before making the call and later restored back.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2768
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8262355: Support for AVX-512 opmask register allocation. [v4]

Jatin Bhateja
In reply to this post by Jatin Bhateja
> AVX-512 added 8 new 64 bit opmask registers[1] . These registers allow conditional execution and efficient merging of destination operands. At present cross instruction mask propagation is being done either using a GPR (e.g. vmask_gen patterns in x86.ad) or a vector register (for propagating results of a vector comparison or vector load mask operations).
>
> This base patch extends the register allocator to support allocation of opmask registers. This will facilitate mask propagation across instructions and thus enable emitting efficient instruction sequence over X86 targets supporting AVX-512 feature.
>
> We intend to build a robust optimization framework[2] based on this patch to emit optimized instruction sequence for masked/predicated vector operation for X86 targets supporting AVX-512.
>
> Please review and share your feedback.
>
> Summary of changes:
>
> 1) AD side changes: New register definitions, register classes, allocation classes, operand definitions and spill code handling for opmask registers.
>
> 2) Runtime: Save/restoration for opmask registers in 32 and 64 bit JVM.
>    a) For 64 bit JVM we were anyways reserving the space in the frame layout but earlier were not saving and restoring at designated offset(1088), hence no extra space overhead apart from save/restore cost.
>    b) For 32 bit JVM: Additional 64 byte are allocated apart from FXSTORE area on the lines of storage for ZMM(16-31) and YMM-Hi bank. There are few regressions due to extra space allocation which we are investigating.
>
> 3) Replacing all the hard-coded opmask references from macro-assembly routines: Pulling out the opmask occurrences all the way up to instruction pattern and adding an unbounded opmask operand for them. This exposes these operands to RA and scheduler; this will automatically facilitate spilling of live opmask registers across call sites.
>
> 4) Register class initializations related to Op_RegVMask during matcher startup.
>
> 5) Handling for mask generating node:  Currently VectorMaskGen node uses a GPR to propagate mask across mask generating DEF instruction to its USER instructions. There are other mask generating nodes like VectorCmpMask, VectorLoadMask which are not handled as the part of this patch. Conditional overriding of two routines, ideal_reg and bottom_type for mask generating IDEAL nodes and modifying the instruction patterns to have new opmask operands enables instruction selector to associate opmask register class with USE/DEF operands  for such MachNodes. This will constrain  the allocation set for these operands to opmask registers(K1-K7).
>
> 6) Special handling for setting a flag in PhiNode during construction in case any of its incoming node is a mask generating node, this flag is then checked to return appropriate ideal_reg and bottom_type corresponding to an opmask registers.
>
> [1] : Section 15.1.3 :  https://software.intel.com/content/www/us/en/develop/download/intel-64-and-ia-32-architectures-software-developers-manual-volume-1-basic-architecture.html
> [2] : http://cr.openjdk.java.net/~jbhateja/avx512_masked_operation_optimization/AVX-512_RA_Opmask_Support_VectorMask_Optimizations.pdf

Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:

  8262355: Some synthetic changes for cleanup.

-------------

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/2768/files
  - new: https://git.openjdk.java.net/jdk/pull/2768/files/ffacbc62..4eaa3d8c

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=2768&range=03
 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=2768&range=02-03

  Stats: 19 lines in 7 files changed: 4 ins; 10 del; 5 mod
  Patch: https://git.openjdk.java.net/jdk/pull/2768.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/2768/head:pull/2768

PR: https://git.openjdk.java.net/jdk/pull/2768
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8262355: Support for AVX-512 opmask register allocation. [v4]

Vladimir Ivanov-2
On Fri, 5 Mar 2021 11:26:11 GMT, Jatin Bhateja <[hidden email]> wrote:

>> AVX-512 added 8 new 64 bit opmask registers[1] . These registers allow conditional execution and efficient merging of destination operands. At present cross instruction mask propagation is being done either using a GPR (e.g. vmask_gen patterns in x86.ad) or a vector register (for propagating results of a vector comparison or vector load mask operations).
>>
>> This base patch extends the register allocator to support allocation of opmask registers. This will facilitate mask propagation across instructions and thus enable emitting efficient instruction sequence over X86 targets supporting AVX-512 feature.
>>
>> We intend to build a robust optimization framework[2] based on this patch to emit optimized instruction sequence for masked/predicated vector operation for X86 targets supporting AVX-512.
>>
>> Please review and share your feedback.
>>
>> Summary of changes:
>>
>> 1) AD side changes: New register definitions, register classes, allocation classes, operand definitions and spill code handling for opmask registers.
>>
>> 2) Runtime: Save/restoration for opmask registers in 32 and 64 bit JVM.
>>    a) For 64 bit JVM we were anyways reserving the space in the frame layout but earlier were not saving and restoring at designated offset(1088), hence no extra space overhead apart from save/restore cost.
>>    b) For 32 bit JVM: Additional 64 byte are allocated apart from FXSTORE area on the lines of storage for ZMM(16-31) and YMM-Hi bank. There are few regressions due to extra space allocation which we are investigating.
>>
>> 3) Replacing all the hard-coded opmask references from macro-assembly routines: Pulling out the opmask occurrences all the way up to instruction pattern and adding an unbounded opmask operand for them. This exposes these operands to RA and scheduler; this will automatically facilitate spilling of live opmask registers across call sites.
>>
>> 4) Register class initializations related to Op_RegVMask during matcher startup.
>>
>> 5) Handling for mask generating node:  Currently VectorMaskGen node uses a GPR to propagate mask across mask generating DEF instruction to its USER instructions. There are other mask generating nodes like VectorCmpMask, VectorLoadMask which are not handled as the part of this patch. Conditional overriding of two routines, ideal_reg and bottom_type for mask generating IDEAL nodes and modifying the instruction patterns to have new opmask operands enables instruction selector to associate opmask register class with USE/DEF operands  for such MachNodes. This will constrain  the allocation set for these operands to opmask registers(K1-K7).
>>
>> 6) Special handling for setting a flag in PhiNode during construction in case any of its incoming node is a mask generating node, this flag is then checked to return appropriate ideal_reg and bottom_type corresponding to an opmask registers.
>>
>> [1] : Section 15.1.3 :  https://software.intel.com/content/www/us/en/develop/download/intel-64-and-ia-32-architectures-software-developers-manual-volume-1-basic-architecture.html
>> [2] : http://cr.openjdk.java.net/~jbhateja/avx512_masked_operation_optimization/AVX-512_RA_Opmask_Support_VectorMask_Optimizations.pdf
>
> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
>
>   8262355: Some synthetic changes for cleanup.

It looks much better now.

src/hotspot/cpu/x86/register_x86.hpp line 260:

> 258:
> 259:     number_of_registers = RegisterImpl::number_of_registers * RegisterImpl::max_slots_per_register +
> 260:       2 * FloatRegisterImpl::number_of_registers + NOT_LP64(8) LP64_ONLY(0) +

Can you elaborate, please, why 8 more slots are needed on x86_32?

src/hotspot/cpu/x86/sharedRuntime_x86_64.cpp line 110:

> 108:     DEF_YMM_OFFS(0),
> 109:     DEF_YMM_OFFS(1),
> 110:     // 2..7 are implied in range usage

The comment is wrongly placed: it reads as if `2..7` were related to `DEF_YMM_OFFS` while it refers to `DEF_OPMASK_OFFS`.

Also, what about moving the declaration after `DEF_ZMM_UPPER_OFFS`? It's a bit weird to see `OPMASK` between `YMM` and `ZMM` declarations.

src/hotspot/cpu/x86/vmreg_x86.cpp line 75:

> 73: #define STACK_TYPE 3
> 74:
> 75: //TODO: Case for KRegisters

Please, elaborate what is missing and how the missing code manifests at runtime.

src/hotspot/cpu/x86/x86_64.ad line 428:

> 426:   _INT_NO_RCX_REG_mask.Remove(OptoReg::as_OptoReg(rcx->as_VMReg()));
> 427:
> 428:   if(Matcher::has_predicated_vectors()) {

Missing space between `if` and the bracket.

src/hotspot/cpu/x86/x86_64.ad line 1407:

> 1405:       if ((src_first & 1) == 0 && src_first + 1 == src_second &&
> 1406:           (dst_first & 1) == 0 && dst_first + 1 == dst_second) {
> 1407:       // 64-bit

`// 64-bit` identation is wrong.

src/hotspot/cpu/x86/x86_64.ad line 1577:

> 1575:       }
> 1576:       return 0;
> 1577:     } else if(dst_first_rc == rc_float) {

Missing a space between if and the bracket.

src/hotspot/share/opto/regmask.hpp line 108:

> 106:          SlotsPerVecY = 8,
> 107:          SlotsPerVecZ = 16,
> 108:          SlotsPerRegVmask = X86_ONLY(2) NOT_X86(1)

Should it be named `SlotsPerRegVMask` instead?

src/hotspot/share/opto/type.cpp line 664:

> 662:   _zero_type[T_CONFLICT]= NULL;
> 663:
> 664: #if defined(AMD64) || defined(IA32)

You can use `X86` instead.

src/hotspot/share/opto/matcher.cpp line 995:

> 993:     idealreg2regmask[Op_RegVMask] = regmask_for_ideal_register(Op_RegVMask, ret);
> 994:   } else {
> 995:     idealreg2regmask[Op_RegVMask] = idealreg2regmask[Op_RegI];

How `idealreg2regmask[Op_RegVMask]` is used when `Matcher::has_predicated_vectors() == false`?

-------------

PR: https://git.openjdk.java.net/jdk/pull/2768
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8262355: Support for AVX-512 opmask register allocation. [v2]

Vladimir Ivanov-2
In reply to this post by Jatin Bhateja
On Fri, 5 Mar 2021 10:00:37 GMT, Jatin Bhateja <[hidden email]> wrote:

>> src/hotspot/cpu/x86/gc/z/zBarrierSetAssembler_x86.cpp line 505:
>>
>>> 503:     caller_saved.Insert(OptoReg::as_OptoReg(r10->as_VMReg()));
>>> 504:     caller_saved.Insert(OptoReg::as_OptoReg(r11->as_VMReg()));
>>> 505:     if (UseAVX > 2) {
>>
>> Similar question here: `ZSaveLiveRegisters` handles GP registers and vector registers differently. Why predicate registers are handled the same way GP registers are and not as vector registers?
>
> Hi Vladimir,
> ZGC load barriers embeds a CALL instruction which is not exposed to compiler, thus opmask registers which are caller saved as per X86 ABI are saved before making the call and later restored back.

I took a closer look at the relevant code and now I better understand what confuses me here.

Through vector registers are caller saved, they aren't added into caller_saved, but just unconditionally spilled.
Why can't you do the same for KRegisters? `caller_saved.Member(opto_reg)` should always be true for KRegister, shouldn't it?

-------------

PR: https://git.openjdk.java.net/jdk/pull/2768
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8262355: Support for AVX-512 opmask register allocation. [v4]

Ningsheng Jian-2
In reply to this post by Jatin Bhateja
On Fri, 5 Mar 2021 11:26:11 GMT, Jatin Bhateja <[hidden email]> wrote:

>> AVX-512 added 8 new 64 bit opmask registers[1] . These registers allow conditional execution and efficient merging of destination operands. At present cross instruction mask propagation is being done either using a GPR (e.g. vmask_gen patterns in x86.ad) or a vector register (for propagating results of a vector comparison or vector load mask operations).
>>
>> This base patch extends the register allocator to support allocation of opmask registers. This will facilitate mask propagation across instructions and thus enable emitting efficient instruction sequence over X86 targets supporting AVX-512 feature.
>>
>> We intend to build a robust optimization framework[2] based on this patch to emit optimized instruction sequence for masked/predicated vector operation for X86 targets supporting AVX-512.
>>
>> Please review and share your feedback.
>>
>> Summary of changes:
>>
>> 1) AD side changes: New register definitions, register classes, allocation classes, operand definitions and spill code handling for opmask registers.
>>
>> 2) Runtime: Save/restoration for opmask registers in 32 and 64 bit JVM.
>>    a) For 64 bit JVM we were anyways reserving the space in the frame layout but earlier were not saving and restoring at designated offset(1088), hence no extra space overhead apart from save/restore cost.
>>    b) For 32 bit JVM: Additional 64 byte are allocated apart from FXSTORE area on the lines of storage for ZMM(16-31) and YMM-Hi bank. There are few regressions due to extra space allocation which we are investigating.
>>
>> 3) Replacing all the hard-coded opmask references from macro-assembly routines: Pulling out the opmask occurrences all the way up to instruction pattern and adding an unbounded opmask operand for them. This exposes these operands to RA and scheduler; this will automatically facilitate spilling of live opmask registers across call sites.
>>
>> 4) Register class initializations related to Op_RegVMask during matcher startup.
>>
>> 5) Handling for mask generating node:  Currently VectorMaskGen node uses a GPR to propagate mask across mask generating DEF instruction to its USER instructions. There are other mask generating nodes like VectorCmpMask, VectorLoadMask which are not handled as the part of this patch. Conditional overriding of two routines, ideal_reg and bottom_type for mask generating IDEAL nodes and modifying the instruction patterns to have new opmask operands enables instruction selector to associate opmask register class with USE/DEF operands  for such MachNodes. This will constrain  the allocation set for these operands to opmask registers(K1-K7).
>>
>> 6) Special handling for setting a flag in PhiNode during construction in case any of its incoming node is a mask generating node, this flag is then checked to return appropriate ideal_reg and bottom_type corresponding to an opmask registers.
>>
>> [1] : Section 15.1.3 :  https://software.intel.com/content/www/us/en/develop/download/intel-64-and-ia-32-architectures-software-developers-manual-volume-1-basic-architecture.html
>> [2] : http://cr.openjdk.java.net/~jbhateja/avx512_masked_operation_optimization/AVX-512_RA_Opmask_Support_VectorMask_Optimizations.pdf
>
> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
>
>   8262355: Some synthetic changes for cleanup.

Hi Jatin,

Thanks for the work. Unfortunately, it seems that there are some conflicts (e.g. type definition) with our work at https://github.com/openjdk/panama-vector/pull/40. Could you please help to take a look at https://github.com/openjdk/panama-vector/pull/40, so that we could improve that as well and reduce the potential merge effort?

Thanks,
Ningsheng

src/hotspot/share/opto/type.cpp line 74:

> 72:   { Bad,             T_ILLEGAL,    "vectorz:",      false, 0,                    relocInfo::none          },  // VectorZ
> 73: #elif defined(S390)
> 74:   { Bad,             T_ILLEGAL,    "vectorm:",      false, Op_RegVMask,          relocInfo::none          },  // VectorM.

The name "vectorm" here looks like a kind of vector, is it? Also, since this line is the same in all the #define-else blocks, perhaps you can move it out of the #define block?

src/hotspot/share/opto/type.hpp line 306:

> 304:   const TypeVect   *isa_vect() const;            // Returns NULL if not a Vector
> 305:   const TypeVectMask *is_vectmask() const;       // Predicate/Mask Vector
> 306:   const TypeVectMask *isa_vectmask() const;      // Returns NULL if not a Predicate/Mask Vector

"if not a Predicate/Mask Vector" --> "if not a Vector Predicate/Mask"?

src/hotspot/share/opto/type.hpp line 859:

> 857: public:
> 858:   friend class TypeVect;
> 859:   TypeVectMask(const Type* elem, uint length) : TypeVect(VectorM, elem, length) {}

Any reason why TypeVectMask is a subclass of TypeVect? Do existing helper functions in TypeVect also work for TypeVectMask, e.g. xmeet.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2768
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8262355: Support for AVX-512 opmask register allocation. [v5]

Jatin Bhateja
In reply to this post by Jatin Bhateja
> AVX-512 added 8 new 64 bit opmask registers[1] . These registers allow conditional execution and efficient merging of destination operands. At present cross instruction mask propagation is being done either using a GPR (e.g. vmask_gen patterns in x86.ad) or a vector register (for propagating results of a vector comparison or vector load mask operations).
>
> This base patch extends the register allocator to support allocation of opmask registers. This will facilitate mask propagation across instructions and thus enable emitting efficient instruction sequence over X86 targets supporting AVX-512 feature.
>
> We intend to build a robust optimization framework[2] based on this patch to emit optimized instruction sequence for masked/predicated vector operation for X86 targets supporting AVX-512.
>
> Please review and share your feedback.
>
> Summary of changes:
>
> 1) AD side changes: New register definitions, register classes, allocation classes, operand definitions and spill code handling for opmask registers.
>
> 2) Runtime: Save/restoration for opmask registers in 32 and 64 bit JVM.
>    a) For 64 bit JVM we were anyways reserving the space in the frame layout but earlier were not saving and restoring at designated offset(1088), hence no extra space overhead apart from save/restore cost.
>    b) For 32 bit JVM: Additional 64 byte are allocated apart from FXSTORE area on the lines of storage for ZMM(16-31) and YMM-Hi bank. There are few regressions due to extra space allocation which we are investigating.
>
> 3) Replacing all the hard-coded opmask references from macro-assembly routines: Pulling out the opmask occurrences all the way up to instruction pattern and adding an unbounded opmask operand for them. This exposes these operands to RA and scheduler; this will automatically facilitate spilling of live opmask registers across call sites.
>
> 4) Register class initializations related to Op_RegVMask during matcher startup.
>
> 5) Handling for mask generating node:  Currently VectorMaskGen node uses a GPR to propagate mask across mask generating DEF instruction to its USER instructions. There are other mask generating nodes like VectorCmpMask, VectorLoadMask which are not handled as the part of this patch. Conditional overriding of two routines, ideal_reg and bottom_type for mask generating IDEAL nodes and modifying the instruction patterns to have new opmask operands enables instruction selector to associate opmask register class with USE/DEF operands  for such MachNodes. This will constrain  the allocation set for these operands to opmask registers(K1-K7).
>
> 6) Special handling for setting a flag in PhiNode during construction in case any of its incoming node is a mask generating node, this flag is then checked to return appropriate ideal_reg and bottom_type corresponding to an opmask registers.
>
> [1] : Section 15.1.3 :  https://software.intel.com/content/www/us/en/develop/download/intel-64-and-ia-32-architectures-software-developers-manual-volume-1-basic-architecture.html
> [2] : http://cr.openjdk.java.net/~jbhateja/avx512_masked_operation_optimization/AVX-512_RA_Opmask_Support_VectorMask_Optimizations.pdf

Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:

  8262355 : Review comments resolution and deopt handling for mask generating nodes reconstruction.

-------------

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/2768/files
  - new: https://git.openjdk.java.net/jdk/pull/2768/files/4eaa3d8c..4fadca52

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=2768&range=04
 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=2768&range=03-04

  Stats: 140 lines in 14 files changed: 73 ins; 28 del; 39 mod
  Patch: https://git.openjdk.java.net/jdk/pull/2768.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/2768/head:pull/2768

PR: https://git.openjdk.java.net/jdk/pull/2768
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8262355: Support for AVX-512 opmask register allocation. [v4]

Jatin Bhateja
In reply to this post by Vladimir Ivanov-2
On Fri, 5 Mar 2021 13:49:41 GMT, Vladimir Ivanov <[hidden email]> wrote:

>> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
>>
>>   8262355: Some synthetic changes for cleanup.
>
> src/hotspot/cpu/x86/register_x86.hpp line 260:
>
>> 258:
>> 259:     number_of_registers = RegisterImpl::number_of_registers * RegisterImpl::max_slots_per_register +
>> 260:       2 * FloatRegisterImpl::number_of_registers + NOT_LP64(8) LP64_ONLY(0) +
>
> Can you elaborate, please, why 8 more slots are needed on x86_32?

This is to cover for additional FILL0-7 definitions present only for 32 bit JVM

> src/hotspot/cpu/x86/sharedRuntime_x86_64.cpp line 110:
>
>> 108:     DEF_YMM_OFFS(0),
>> 109:     DEF_YMM_OFFS(1),
>> 110:     // 2..7 are implied in range usage
>
> The comment is wrongly placed: it reads as if `2..7` were related to `DEF_YMM_OFFS` while it refers to `DEF_OPMASK_OFFS`.
>
> Also, what about moving the declaration after `DEF_ZMM_UPPER_OFFS`? It's a bit weird to see `OPMASK` between `YMM` and `ZMM` declarations.

To make it consistent with XSAVE stack layout dumping opmask at 1088 offset, anyways space was reserved in the frame, just that we weren't using it.

> src/hotspot/cpu/x86/vmreg_x86.cpp line 75:
>
>> 73: #define STACK_TYPE 3
>> 74:
>> 75: //TODO: Case for KRegisters
>
> Please, elaborate what is missing and how the missing code manifests at runtime.

Would seek your suggestion here. Can it wait for base patch

-------------

PR: https://git.openjdk.java.net/jdk/pull/2768
1234 ... 6