[10] RFR: 8189176 - AARCH64: Improve _updateBytesCRC32 intrinsic

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

[10] RFR: 8189176 - AARCH64: Improve _updateBytesCRC32 intrinsic

Dmitry Chuyko-2
Hello,

Please review an improvement of CRC32 calculation on AArch64.

MacroAssembler::kernel_crc32 gets table registers that are not used on
-XX:+UseCRC32 path. They can be used to make neighbor loads and CRC
calculations independent. Adding prologue and epilogue for main by-64
loop makes it applicable starting from len=128 so additional by-32 loop
is added for smaller lengths.

rfe: https://bugs.openjdk.java.net/browse/JDK-8189176
webrev: http://cr.openjdk.java.net/~dchuyko/8189176/webrev.00/
benchmark: http://cr.openjdk.java.net/~dchuyko/8189176/crc32/CRC32Bench.java

Results for T88 and A53 are good, but splitting pair loads may slow down
other CPUs so measurements on different HW are highly welcome.

-Dmitry

Reply | Threaded
Open this post in threaded view
|

RE: [10] RFR: 8189176 - AARCH64: Improve _updateBytesCRC32 intrinsic

White, Derek
Hi Dmitry,

The code looks good.

I have one suggestion for MacroAssembler::kernel_crc32(). It's a matter of taste, so it really is just a suggestion:
 - The use of temp registers in the UseCRC32 case is kind of muddled, using tmp, and table0..table3 as temp registers, and the name "table" is confusing in this case.
 - Maybe it would be cleaner to refactor the UseCRC32 code into a separate kernel_crc32_using_crc32() subroutine (static or macro?). This would accept the main args and 4 registers for temps. The caller can supply some combination of table or tmp registers.
- This would shrink the size of kernel_crc32() by a lot too.
- The next person to touch the UseNeon code could factor that out as well 😊

This obviously would apply to kernel_crc32c as well.

Thanks!
 - Derek

> -----Original Message-----
> From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-
> [hidden email]] On Behalf Of Dmitry Chuyko
> Sent: Wednesday, October 11, 2017 12:31 PM
> To: [hidden email]
> Subject: [10] RFR: 8189176 - AARCH64: Improve _updateBytesCRC32 intrinsic
>
> Hello,
>
> Please review an improvement of CRC32 calculation on AArch64.
>
> MacroAssembler::kernel_crc32 gets table registers that are not used on
> -XX:+UseCRC32 path. They can be used to make neighbor loads and CRC
> calculations independent. Adding prologue and epilogue for main by-64 loop
> makes it applicable starting from len=128 so additional by-32 loop is added
> for smaller lengths.
>
> rfe: https://bugs.openjdk.java.net/browse/JDK-8189176
> webrev: http://cr.openjdk.java.net/~dchuyko/8189176/webrev.00/
> benchmark:
> http://cr.openjdk.java.net/~dchuyko/8189176/crc32/CRC32Bench.java
>
> Results for T88 and A53 are good, but splitting pair loads may slow down
> other CPUs so measurements on different HW are highly welcome.
>
> -Dmitry

Reply | Threaded
Open this post in threaded view
|

Re: [10] RFR: 8189176 - AARCH64: Improve _updateBytesCRC32 intrinsic

Andrew Haley
In reply to this post by Dmitry Chuyko-2
On 11/10/17 17:30, Dmitry Chuyko wrote:
> Results for T88 and A53 are good, but splitting pair loads may slow down
> other CPUs so measurements on different HW are highly welcome.

Ah, yes.  OK, so I should do some measurements here.  Please remind me
offlist if I don't respond in a few days.

--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Reply | Threaded
Open this post in threaded view
|

Re: [10] RFR: 8189176 - AARCH64: Improve _updateBytesCRC32 intrinsic

Andrew Haley
In reply to this post by White, Derek
On 27/10/17 23:31, White, Derek wrote:
>  - The use of temp registers in the UseCRC32 case is kind of muddled, using tmp, and table0..table3 as temp registers, and the name "table" is confusing in this case.
>  - Maybe it would be cleaner to refactor the UseCRC32 code into a separate kernel_crc32_using_crc32() subroutine (static or macro?). This would accept the main args and 4 registers for temps. The caller can supply some combination of table or tmp registers.
> - This would shrink the size of kernel_crc32() by a lot too.

That would be nice.

--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Reply | Threaded
Open this post in threaded view
|

Re: [10] RFR: 8189176 - AARCH64: Improve _updateBytesCRC32 intrinsic

Dmitry Chuyko-2
As per Derek's suggestion I made private
MacroAssembler::kernel_crc32_using_crc32(). Register naming became more
clear, numeric results are the same.

webrev: http://cr.openjdk.java.net/~dchuyko/8189176/webrev.01/

-Dmitry


On 10/28/2017 10:52 AM, Andrew Haley wrote:
> On 27/10/17 23:31, White, Derek wrote:
>>   - The use of temp registers in the UseCRC32 case is kind of muddled, using tmp, and table0..table3 as temp registers, and the name "table" is confusing in this case.
>>   - Maybe it would be cleaner to refactor the UseCRC32 code into a separate kernel_crc32_using_crc32() subroutine (static or macro?). This would accept the main args and 4 registers for temps. The caller can supply some combination of table or tmp registers.
>> - This would shrink the size of kernel_crc32() by a lot too.
> That would be nice.
>

Reply | Threaded
Open this post in threaded view
|

RE: [10] RFR: 8189176 - AARCH64: Improve _updateBytesCRC32 intrinsic

White, Derek
Hi Dmitry,

That looks good to me.

Thanks!

 - Derek


> -----Original Message-----
> From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-
> [hidden email]] On Behalf Of Dmitry Chuyko
> Sent: Wednesday, November 01, 2017 12:25 PM
> To: [hidden email]
> Subject: Re: [10] RFR: 8189176 - AARCH64: Improve _updateBytesCRC32
> intrinsic
>
> As per Derek's suggestion I made private
> MacroAssembler::kernel_crc32_using_crc32(). Register naming became more
> clear, numeric results are the same.
>
> webrev: http://cr.openjdk.java.net/~dchuyko/8189176/webrev.01/
>
> -Dmitry
>
>
> On 10/28/2017 10:52 AM, Andrew Haley wrote:
> > On 27/10/17 23:31, White, Derek wrote:
> >>   - The use of temp registers in the UseCRC32 case is kind of muddled,
> using tmp, and table0..table3 as temp registers, and the name "table" is
> confusing in this case.
> >>   - Maybe it would be cleaner to refactor the UseCRC32 code into a
> separate kernel_crc32_using_crc32() subroutine (static or macro?). This
> would accept the main args and 4 registers for temps. The caller can supply
> some combination of table or tmp registers.
> >> - This would shrink the size of kernel_crc32() by a lot too.
> > That would be nice.
> >

Reply | Threaded
Open this post in threaded view
|

Re: [10] RFR: 8189176 - AARCH64: Improve _updateBytesCRC32 intrinsic

Andrew Haley
In reply to this post by Dmitry Chuyko-2
On 01/11/17 16:24, Dmitry Chuyko wrote:
> As per Derek's suggestion I made private
> MacroAssembler::kernel_crc32_using_crc32(). Register naming became more
> clear, numeric results are the same.
>
> webrev: http://cr.openjdk.java.net/~dchuyko/8189176/webrev.01/

It looks fine.  One small thing to tidy up while you're working on it:
ORNW should be MOVN.

Thanks.

--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Reply | Threaded
Open this post in threaded view
|

Re: [10] RFR: 8189176 - AARCH64: Improve _updateBytesCRC32 intrinsic

Dmitry Chuyko-2
On 11/02/2017 12:33 PM, Andrew Haley wrote:
> On 01/11/17 16:24, Dmitry Chuyko wrote:
>> As per Derek's suggestion I made private
>> MacroAssembler::kernel_crc32_using_crc32(). Register naming became more
>> clear, numeric results are the same.
>>
>> webrev: http://cr.openjdk.java.net/~dchuyko/8189176/webrev.01/
> It looks fine.  One small thing to tidy up while you're working on it:
> ORNW should be MOVN.
MVN I suppose. It hasn't been introduced in assembler file, now
implemented as ORN[W] using zero register, successfully decoded back by
hsdis. Other usages are also updated.

http://cr.openjdk.java.net/~dchuyko/8189176/webrev.02/

-Dmitry
>
> Thanks.
>

Reply | Threaded
Open this post in threaded view
|

Re: [10] RFR: 8189176 - AARCH64: Improve _updateBytesCRC32 intrinsic

Andrew Haley
On 02/11/17 14:35, Dmitry Chuyko wrote:
> MVN I suppose. It hasn't been introduced in assembler file, now
> implemented as ORN[W] using zero register, successfully decoded back by
> hsdis. Other usages are also updated.
>
> http://cr.openjdk.java.net/~dchuyko/8189176/webrev.02/

Thanks for doing this.  It's not a big thing, but a worthy cleanup.
OK.

--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Reply | Threaded
Open this post in threaded view
|

RE: [10] RFR: 8189176 - AARCH64: Improve _updateBytesCRC32 intrinsic

White, Derek
Looks good to me.

 - Derek

> -----Original Message-----
> From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-
> [hidden email]] On Behalf Of Andrew Haley
> Sent: Thursday, November 02, 2017 11:04 AM
> To: Dmitry Chuyko <[hidden email]>; hotspot-compiler-
> [hidden email]
> Subject: Re: [10] RFR: 8189176 - AARCH64: Improve _updateBytesCRC32
> intrinsic
>
> On 02/11/17 14:35, Dmitry Chuyko wrote:
> > MVN I suppose. It hasn't been introduced in assembler file, now
> > implemented as ORN[W] using zero register, successfully decoded back
> > by hsdis. Other usages are also updated.
> >
> > http://cr.openjdk.java.net/~dchuyko/8189176/webrev.02/
>
> Thanks for doing this.  It's not a big thing, but a worthy cleanup.
> OK.
>
> --
> Andrew Haley
> Java Platform Lead Engineer
> Red Hat UK Ltd. <https://www.redhat.com>
> EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671