[10] RFR (S): 8189177 - AARCH64: Improve _updateBytesCRC32C intrinsic

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[10] RFR (S): 8189177 - AARCH64: Improve _updateBytesCRC32C intrinsic

Dmitry Chuyko-2
Hello,

Please review an improvement of CRC32C calculation on AArch64. It is
done pretty similar to a change for JDK-8189176 described in [1].

MacroAssembler::kernel_crc32c gets unused table registers. They can be
used to make neighbor loads and CRC calculations independent. Adding
prologue and epilogue for main by-64 loop makes it applicable starting
from len=128 so additional by-32 loop is added for smaller lengths.

rfe: https://bugs.openjdk.java.net/browse/JDK-8189177
webrev: http://cr.openjdk.java.net/~dchuyko/8189177/webrev.00/
benchmark:
http://cr.openjdk.java.net/~dchuyko/8189177/crc32c/CRC32CBench.java

Results for T88 and A53 [2] are similar to CRC32 change (good), but
again splitting pair loads may slow down other CPUs so measurements on
different HW are welcome.

-Dmitry

[1]
http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2017-October/027225.html
[2]
https://bugs.openjdk.java.net/browse/JDK-8189177?focusedCommentId=14124535&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14124535

Reply | Threaded
Open this post in threaded view
|

Re: [10] RFR (S): 8189177 - AARCH64: Improve _updateBytesCRC32C intrinsic

Dmitry Chuyko-2
Similar to CRC32 I added private
MacroAssembler::kernel_crc32c_using_crc32c().

webrev: http://cr.openjdk.java.net/~dchuyko/8189177/webrev.01/

-Dmitry


On 10/20/2017 08:45 PM, Dmitry Chuyko wrote:

> Hello,
>
> Please review an improvement of CRC32C calculation on AArch64. It is
> done pretty similar to a change for JDK-8189176 described in [1].
>
> MacroAssembler::kernel_crc32c gets unused table registers. They can be
> used to make neighbor loads and CRC calculations independent. Adding
> prologue and epilogue for main by-64 loop makes it applicable starting
> from len=128 so additional by-32 loop is added for smaller lengths.
>
> rfe: https://bugs.openjdk.java.net/browse/JDK-8189177
> webrev: http://cr.openjdk.java.net/~dchuyko/8189177/webrev.00/
> benchmark:
> http://cr.openjdk.java.net/~dchuyko/8189177/crc32c/CRC32CBench.java
>
> Results for T88 and A53 [2] are similar to CRC32 change (good), but
> again splitting pair loads may slow down other CPUs so measurements on
> different HW are welcome.
>
> -Dmitry
>
> [1]
> http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2017-October/027225.html
> [2]
> https://bugs.openjdk.java.net/browse/JDK-8189177?focusedCommentId=14124535&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14124535
>

Reply | Threaded
Open this post in threaded view
|

RE: [10] RFR (S): 8189177 - AARCH64: Improve _updateBytesCRC32C intrinsic

White, Derek
Hi Dmitry,

This looks good!

Thanks,

 - Derek

> -----Original Message-----
> From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-
> [hidden email]] On Behalf Of Dmitry Chuyko
> Sent: Thursday, November 02, 2017 5:07 PM
> To: [hidden email]
> Subject: Re: [10] RFR (S): 8189177 - AARCH64: Improve _updateBytesCRC32C
> intrinsic
>
> Similar to CRC32 I added private
> MacroAssembler::kernel_crc32c_using_crc32c().
>
> webrev: http://cr.openjdk.java.net/~dchuyko/8189177/webrev.01/
>
> -Dmitry
>
>
> On 10/20/2017 08:45 PM, Dmitry Chuyko wrote:
> > Hello,
> >
> > Please review an improvement of CRC32C calculation on AArch64. It is
> > done pretty similar to a change for JDK-8189176 described in [1].
> >
> > MacroAssembler::kernel_crc32c gets unused table registers. They can be
> > used to make neighbor loads and CRC calculations independent. Adding
> > prologue and epilogue for main by-64 loop makes it applicable starting
> > from len=128 so additional by-32 loop is added for smaller lengths.
> >
> > rfe: https://bugs.openjdk.java.net/browse/JDK-8189177
> > webrev: http://cr.openjdk.java.net/~dchuyko/8189177/webrev.00/
> > benchmark:
> > http://cr.openjdk.java.net/~dchuyko/8189177/crc32c/CRC32CBench.java
> >
> > Results for T88 and A53 [2] are similar to CRC32 change (good), but
> > again splitting pair loads may slow down other CPUs so measurements on
> > different HW are welcome.
> >
> > -Dmitry
> >
> > [1]
> > http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2017-Octob
> > er/027225.html
> > [2]
> > https://bugs.openjdk.java.net/browse/JDK-
> 8189177?focusedCommentId=1412
> > 4535&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-
> tabpanel#comment-14124535
> >

Reply | Threaded
Open this post in threaded view
|

Re: [10] RFR (S): 8189177 - AARCH64: Improve _updateBytesCRC32C intrinsic

Dmitry Samersoff-3
Dmitry,

Looks good to me.

-Dmitry


On 11/08/2017 01:34 AM, White, Derek wrote:

> Hi Dmitry,
>
> This looks good!
>
> Thanks,
>
>  - Derek
>
>> -----Original Message-----
>> From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-
>> [hidden email]] On Behalf Of Dmitry Chuyko
>> Sent: Thursday, November 02, 2017 5:07 PM
>> To: [hidden email]
>> Subject: Re: [10] RFR (S): 8189177 - AARCH64: Improve _updateBytesCRC32C
>> intrinsic
>>
>> Similar to CRC32 I added private
>> MacroAssembler::kernel_crc32c_using_crc32c().
>>
>> webrev: http://cr.openjdk.java.net/~dchuyko/8189177/webrev.01/
>>
>> -Dmitry
>>
>>
>> On 10/20/2017 08:45 PM, Dmitry Chuyko wrote:
>>> Hello,
>>>
>>> Please review an improvement of CRC32C calculation on AArch64. It is
>>> done pretty similar to a change for JDK-8189176 described in [1].
>>>
>>> MacroAssembler::kernel_crc32c gets unused table registers. They can be
>>> used to make neighbor loads and CRC calculations independent. Adding
>>> prologue and epilogue for main by-64 loop makes it applicable starting
>>> from len=128 so additional by-32 loop is added for smaller lengths.
>>>
>>> rfe: https://bugs.openjdk.java.net/browse/JDK-8189177
>>> webrev: http://cr.openjdk.java.net/~dchuyko/8189177/webrev.00/
>>> benchmark:
>>> http://cr.openjdk.java.net/~dchuyko/8189177/crc32c/CRC32CBench.java
>>>
>>> Results for T88 and A53 [2] are similar to CRC32 change (good), but
>>> again splitting pair loads may slow down other CPUs so measurements on
>>> different HW are welcome.
>>>
>>> -Dmitry
>>>
>>> [1]
>>> http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2017-Octob
>>> er/027225.html
>>> [2]
>>> https://bugs.openjdk.java.net/browse/JDK-
>> 8189177?focusedCommentId=1412
>>> 4535&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-
>> tabpanel#comment-14124535
>>>
>