Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic [v6]

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic [v6]

Andrew Haley-2
On Wed, 7 Apr 2021 05:51:02 GMT, Dong Bo <[hidden email]> wrote:

>> In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding.
>> Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic idea can be found at http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords.
>>
>> Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build.
>> Tests in `test/jdk/java/util/Base64/` and `compiler/intrinsics/base64/TestBase64.java` runned specially for the correctness of the implementation.
>>
>> There can be illegal characters at the start of the input if the data is MIME encoded.
>> It would be no benefits to use SIMD for this case, so the stub use no-simd instructions for MIME encoded data now.
>>
>> A JMH micro, Base64Decode.java, is added for performance test.
>> With different input length (upper-bounded by parameter `maxNumBytes` in the JMH micro),
>> we witness ~2.5x improvements with long inputs and no regression with short inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on Kunpeng916.
>>
>> The Base64Decode.java JMH micro-benchmark results:
>>
>> Benchmark                          (lineSize)  (maxNumBytes)  Mode  Cnt       Score       Error  Units
>>
>> # Kunpeng916, intrinsic
>> Base64Decode.testBase64Decode               4              1  avgt    5      48.614 ±     0.609  ns/op
>> Base64Decode.testBase64Decode               4              3  avgt    5      58.199 ±     1.650  ns/op
>> Base64Decode.testBase64Decode               4              7  avgt    5      69.400 ±     0.931  ns/op
>> Base64Decode.testBase64Decode               4             32  avgt    5      96.818 ±     1.687  ns/op
>> Base64Decode.testBase64Decode               4             64  avgt    5     122.856 ±     9.217  ns/op
>> Base64Decode.testBase64Decode               4             80  avgt    5     130.935 ±     1.667  ns/op
>> Base64Decode.testBase64Decode               4             96  avgt    5     143.627 ±     1.751  ns/op
>> Base64Decode.testBase64Decode               4            112  avgt    5     152.311 ±     1.178  ns/op
>> Base64Decode.testBase64Decode               4            512  avgt    5     342.631 ±     0.584  ns/op
>> Base64Decode.testBase64Decode               4           1000  avgt    5     573.635 ±     1.050  ns/op
>> Base64Decode.testBase64Decode               4          20000  avgt    5    9534.136 ±    45.172  ns/op
>> Base64Decode.testBase64Decode               4          50000  avgt    5   22718.726 ±   192.070  ns/op
>> Base64Decode.testBase64MIMEDecode           4              1  avgt   10      63.558 ±    0.336  ns/op
>> Base64Decode.testBase64MIMEDecode           4              3  avgt   10      82.504 ±    0.848  ns/op
>> Base64Decode.testBase64MIMEDecode           4              7  avgt   10     120.591 ±    0.608  ns/op
>> Base64Decode.testBase64MIMEDecode           4             32  avgt   10     324.314 ±    6.236  ns/op
>> Base64Decode.testBase64MIMEDecode           4             64  avgt   10     532.678 ±    4.670  ns/op
>> Base64Decode.testBase64MIMEDecode           4             80  avgt   10     678.126 ±    4.324  ns/op
>> Base64Decode.testBase64MIMEDecode           4             96  avgt   10     771.603 ±    6.393  ns/op
>> Base64Decode.testBase64MIMEDecode           4            112  avgt   10     889.608 ±   0.759  ns/op
>> Base64Decode.testBase64MIMEDecode           4            512  avgt   10    3663.557 ±    3.422  ns/op
>> Base64Decode.testBase64MIMEDecode           4           1000  avgt   10    7017.784 ±    9.128  ns/op
>> Base64Decode.testBase64MIMEDecode           4          20000  avgt   10  128670.660 ± 7951.521  ns/op
>> Base64Decode.testBase64MIMEDecode           4          50000  avgt   10  317113.667 ±  161.758  ns/op
>>
>> # Kunpeng916, default
>> Base64Decode.testBase64Decode               4              1  avgt    5      48.455 ±   0.571  ns/op
>> Base64Decode.testBase64Decode               4              3  avgt    5      57.937 ±   0.505  ns/op
>> Base64Decode.testBase64Decode               4              7  avgt    5      73.823 ±   1.452  ns/op
>> Base64Decode.testBase64Decode               4             32  avgt    5     106.484 ±   1.243  ns/op
>> Base64Decode.testBase64Decode               4             64  avgt    5     141.004 ±   1.188  ns/op
>> Base64Decode.testBase64Decode               4             80  avgt    5     156.284 ±   0.572  ns/op
>> Base64Decode.testBase64Decode               4             96  avgt    5     174.137 ±   0.177  ns/op
>> Base64Decode.testBase64Decode               4            112  avgt    5     188.445 ±   0.572  ns/op
>> Base64Decode.testBase64Decode               4            512  avgt    5     610.847 ±   1.559  ns/op
>> Base64Decode.testBase64Decode               4           1000  avgt    5    1155.368 ±   0.813  ns/op
>> Base64Decode.testBase64Decode               4          20000  avgt    5   19751.477 ±  24.669  ns/op
>> Base64Decode.testBase64Decode               4          50000  avgt    5   50046.586 ± 523.155  ns/op
>> Base64Decode.testBase64MIMEDecode           4              1  avgt   10      64.130 ±   0.238  ns/op
>> Base64Decode.testBase64MIMEDecode           4              3  avgt   10      82.096 ±   0.205  ns/op
>> Base64Decode.testBase64MIMEDecode           4              7  avgt   10     118.849 ±   0.610  ns/op
>> Base64Decode.testBase64MIMEDecode           4             32  avgt   10     331.177 ±   4.732  ns/op
>> Base64Decode.testBase64MIMEDecode           4             64  avgt   10     549.117 ±   0.177  ns/op
>> Base64Decode.testBase64MIMEDecode           4             80  avgt   10     702.951 ±   4.572  ns/op
>> Base64Decode.testBase64MIMEDecode           4             96  avgt   10     799.566 ±   0.301  ns/op
>> Base64Decode.testBase64MIMEDecode           4            112  avgt   10     923.749 ±   0.389  ns/op
>> Base64Decode.testBase64MIMEDecode           4            512  avgt   10    4000.725 ±   2.519  ns/op
>> Base64Decode.testBase64MIMEDecode           4           1000  avgt   10    7674.994 ±   9.281  ns/op
>> Base64Decode.testBase64MIMEDecode           4          20000  avgt   10  142059.001 ± 157.920  ns/op
>> Base64Decode.testBase64MIMEDecode           4          50000  avgt   10  355698.369 ± 216.542  ns/op
>
> Dong Bo has updated the pull request incrementally with one additional commit since the last revision:
>
>   fix misleading annotations

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5829:

> 5827:     __ strb(r14, __ post(dst, 1));
> 5828:     __ strb(r15, __ post(dst, 1));
> 5829:     __ strb(r13, __ post(dst, 1));

I think this sequence should be 4 BFMs, STRW, BFM, STRW. That's the best we can do, I think.

-------------

PR: https://git.openjdk.java.net/jdk/pull/3228
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic [v6]

Andrew Haley-2
On Wed, 7 Apr 2021 09:50:45 GMT, Andrew Haley <[hidden email]> wrote:

>> Dong Bo has updated the pull request incrementally with one additional commit since the last revision:
>>
>>   fix misleading annotations
>
> src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5829:
>
>> 5827:     __ strb(r14, __ post(dst, 1));
>> 5828:     __ strb(r15, __ post(dst, 1));
>> 5829:     __ strb(r13, __ post(dst, 1));
>
> I think this sequence should be 4 BFMs, STRW, BFM, STRW. That's the best we can do, I think.

Sorry, that's not quite right, but you get the idea: let's not generate unnecessary memory traffic.

-------------

PR: https://git.openjdk.java.net/jdk/pull/3228