RFR: 8256245: AArch64: Implement Base64 decoding intrinsic

classic Classic list List threaded Threaded
31 messages Options
12
Reply | Threaded
Open this post in threaded view
|

RFR: 8256245: AArch64: Implement Base64 decoding intrinsic

Dong Bo
In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding.
Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic idea can be found at http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords.

Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build.
Tests in `test/jdk/java/util/Base64/` and `compiler/intrinsics/base64/TestBase64.java` runned specially for the correctness of the implementation.

There can be illegal characters at the start of the input if the data is MIME encoded.
It would be no benefits to use SIMD for this case, so the stub use no-simd instructions for MIME encoded data now.

A JMH micro, Base64Decode.java, is added for performance test.
With different input length (upper-bounded by parameter `maxNumBytes` in the JMH micro),
we witness ~2.5x improvements with long inputs and no regression with short inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on Kunpeng916.

The Base64Decode.java JMH micro-benchmark results:

# Kunpeng916, intrinsic
Base64Decode.testBase64Decode               4              1  avgt    5      48.614 ±     0.609  ns/op
Base64Decode.testBase64Decode               4              3  avgt    5      58.199 ±     1.650  ns/op
Base64Decode.testBase64Decode               4              7  avgt    5      69.400 ±     0.931  ns/op
Base64Decode.testBase64Decode               4             32  avgt    5      96.818 ±     1.687  ns/op
Base64Decode.testBase64Decode               4             64  avgt    5     122.856 ±     9.217  ns/op
Base64Decode.testBase64Decode               4             80  avgt    5     130.935 ±     1.667  ns/op
Base64Decode.testBase64Decode               4             96  avgt    5     143.627 ±     1.751  ns/op
Base64Decode.testBase64Decode               4            112  avgt    5     152.311 ±     1.178  ns/op
Base64Decode.testBase64Decode               4            512  avgt    5     342.631 ±     0.584  ns/op
Base64Decode.testBase64Decode               4           1000  avgt    5     573.635 ±     1.050  ns/op
Base64Decode.testBase64Decode               4          20000  avgt    5    9534.136 ±    45.172  ns/op
Base64Decode.testBase64Decode               4          50000  avgt    5   22718.726 ±   192.070  ns/op
Base64Decode.testBase64MIMEDecode           4              1  avgt   10      63.558 ±    0.336  ns/op
Base64Decode.testBase64MIMEDecode           4              3  avgt   10      82.504 ±    0.848  ns/op
Base64Decode.testBase64MIMEDecode           4              7  avgt   10     120.591 ±    0.608  ns/op
Base64Decode.testBase64MIMEDecode           4             32  avgt   10     324.314 ±    6.236  ns/op
Base64Decode.testBase64MIMEDecode           4             64  avgt   10     532.678 ±    4.670  ns/op
Base64Decode.testBase64MIMEDecode           4             80  avgt   10     678.126 ±    4.324  ns/op
Base64Decode.testBase64MIMEDecode           4             96  avgt   10     771.603 ±    6.393  ns/op
Base64Decode.testBase64MIMEDecode           4            112  avgt   10     889.608 ±   0.759  ns/op
Base64Decode.testBase64MIMEDecode           4            512  avgt   10    3663.557 ±    3.422  ns/op
Base64Decode.testBase64MIMEDecode           4           1000  avgt   10    7017.784 ±    9.128  ns/op
Base64Decode.testBase64MIMEDecode           4          20000  avgt   10  128670.660 ± 7951.521  ns/op
Base64Decode.testBase64MIMEDecode           4          50000  avgt   10  317113.667 ±  161.758  ns/op

# Kunpeng916, default
Base64Decode.testBase64Decode               4              1  avgt    5      48.455 ±   0.571  ns/op
Base64Decode.testBase64Decode               4              3  avgt    5      57.937 ±   0.505  ns/op
Base64Decode.testBase64Decode               4              7  avgt    5      73.823 ±   1.452  ns/op
Base64Decode.testBase64Decode               4             32  avgt    5     106.484 ±   1.243  ns/op
Base64Decode.testBase64Decode               4             64  avgt    5     141.004 ±   1.188  ns/op
Base64Decode.testBase64Decode               4             80  avgt    5     156.284 ±   0.572  ns/op
Base64Decode.testBase64Decode               4             96  avgt    5     174.137 ±   0.177  ns/op
Base64Decode.testBase64Decode               4            112  avgt    5     188.445 ±   0.572  ns/op
Base64Decode.testBase64Decode               4            512  avgt    5     610.847 ±   1.559  ns/op
Base64Decode.testBase64Decode               4           1000  avgt    5    1155.368 ±   0.813  ns/op
Base64Decode.testBase64Decode               4          20000  avgt    5   19751.477 ±  24.669  ns/op
Base64Decode.testBase64Decode               4          50000  avgt    5   50046.586 ± 523.155  ns/op
Base64Decode.testBase64MIMEDecode           4              1  avgt   10      64.130 ±   0.238  ns/op
Base64Decode.testBase64MIMEDecode           4              3  avgt   10      82.096 ±   0.205  ns/op
Base64Decode.testBase64MIMEDecode           4              7  avgt   10     118.849 ±   0.610  ns/op
Base64Decode.testBase64MIMEDecode           4             32  avgt   10     331.177 ±   4.732  ns/op
Base64Decode.testBase64MIMEDecode           4             64  avgt   10     549.117 ±   0.177  ns/op
Base64Decode.testBase64MIMEDecode           4             80  avgt   10     702.951 ±   4.572  ns/op
Base64Decode.testBase64MIMEDecode           4             96  avgt   10     799.566 ±   0.301  ns/op
Base64Decode.testBase64MIMEDecode           4            112  avgt   10     923.749 ±   0.389  ns/op
Base64Decode.testBase64MIMEDecode           4            512  avgt   10    4000.725 ±   2.519  ns/op
Base64Decode.testBase64MIMEDecode           4           1000  avgt   10    7674.994 ±   9.281  ns/op
Base64Decode.testBase64MIMEDecode           4          20000  avgt   10  142059.001 ± 157.920  ns/op
Base64Decode.testBase64MIMEDecode           4          50000  avgt   10  355698.369 ± 216.542  ns/op

-------------

Commit messages:
 - 8256245: AArch64: Implement Base64 decoding intrinsic

Changes: https://git.openjdk.java.net/jdk/pull/3228/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=3228&range=00
  Issue: https://bugs.openjdk.java.net/browse/JDK-8256245
  Stats: 410 lines in 3 files changed: 410 ins; 0 del; 0 mod
  Patch: https://git.openjdk.java.net/jdk/pull/3228.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/3228/head:pull/3228

PR: https://git.openjdk.java.net/jdk/pull/3228
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic

Andrew Haley-2
On Sat, 27 Mar 2021 08:58:03 GMT, Dong Bo <[hidden email]> wrote:

> In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding.
> Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic idea can be found at http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords.
>
> Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build.
> Tests in `test/jdk/java/util/Base64/` and `compiler/intrinsics/base64/TestBase64.java` runned specially for the correctness of the implementation.
>
> There can be illegal characters at the start of the input if the data is MIME encoded.
> It would be no benefits to use SIMD for this case, so the stub use no-simd instructions for MIME encoded data now.
>
> A JMH micro, Base64Decode.java, is added for performance test.
> With different input length (upper-bounded by parameter `maxNumBytes` in the JMH micro),
> we witness ~2.5x improvements with long inputs and no regression with short inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on Kunpeng916.
>
> The Base64Decode.java JMH micro-benchmark results:
>
> Benchmark                          (lineSize)  (maxNumBytes)  Mode  Cnt       Score       Error  Units
>
> # Kunpeng916, intrinsic
> Base64Decode.testBase64Decode               4              1  avgt    5      48.614 ±     0.609  ns/op
> Base64Decode.testBase64Decode               4              3  avgt    5      58.199 ±     1.650  ns/op
> Base64Decode.testBase64Decode               4              7  avgt    5      69.400 ±     0.931  ns/op
> Base64Decode.testBase64Decode               4             32  avgt    5      96.818 ±     1.687  ns/op
> Base64Decode.testBase64Decode               4             64  avgt    5     122.856 ±     9.217  ns/op
> Base64Decode.testBase64Decode               4             80  avgt    5     130.935 ±     1.667  ns/op
> Base64Decode.testBase64Decode               4             96  avgt    5     143.627 ±     1.751  ns/op
> Base64Decode.testBase64Decode               4            112  avgt    5     152.311 ±     1.178  ns/op
> Base64Decode.testBase64Decode               4            512  avgt    5     342.631 ±     0.584  ns/op
> Base64Decode.testBase64Decode               4           1000  avgt    5     573.635 ±     1.050  ns/op
> Base64Decode.testBase64Decode               4          20000  avgt    5    9534.136 ±    45.172  ns/op
> Base64Decode.testBase64Decode               4          50000  avgt    5   22718.726 ±   192.070  ns/op
> Base64Decode.testBase64MIMEDecode           4              1  avgt   10      63.558 ±    0.336  ns/op
> Base64Decode.testBase64MIMEDecode           4              3  avgt   10      82.504 ±    0.848  ns/op
> Base64Decode.testBase64MIMEDecode           4              7  avgt   10     120.591 ±    0.608  ns/op
> Base64Decode.testBase64MIMEDecode           4             32  avgt   10     324.314 ±    6.236  ns/op
> Base64Decode.testBase64MIMEDecode           4             64  avgt   10     532.678 ±    4.670  ns/op
> Base64Decode.testBase64MIMEDecode           4             80  avgt   10     678.126 ±    4.324  ns/op
> Base64Decode.testBase64MIMEDecode           4             96  avgt   10     771.603 ±    6.393  ns/op
> Base64Decode.testBase64MIMEDecode           4            112  avgt   10     889.608 ±   0.759  ns/op
> Base64Decode.testBase64MIMEDecode           4            512  avgt   10    3663.557 ±    3.422  ns/op
> Base64Decode.testBase64MIMEDecode           4           1000  avgt   10    7017.784 ±    9.128  ns/op
> Base64Decode.testBase64MIMEDecode           4          20000  avgt   10  128670.660 ± 7951.521  ns/op
> Base64Decode.testBase64MIMEDecode           4          50000  avgt   10  317113.667 ±  161.758  ns/op
>
> # Kunpeng916, default
> Base64Decode.testBase64Decode               4              1  avgt    5      48.455 ±   0.571  ns/op
> Base64Decode.testBase64Decode               4              3  avgt    5      57.937 ±   0.505  ns/op
> Base64Decode.testBase64Decode               4              7  avgt    5      73.823 ±   1.452  ns/op
> Base64Decode.testBase64Decode               4             32  avgt    5     106.484 ±   1.243  ns/op
> Base64Decode.testBase64Decode               4             64  avgt    5     141.004 ±   1.188  ns/op
> Base64Decode.testBase64Decode               4             80  avgt    5     156.284 ±   0.572  ns/op
> Base64Decode.testBase64Decode               4             96  avgt    5     174.137 ±   0.177  ns/op
> Base64Decode.testBase64Decode               4            112  avgt    5     188.445 ±   0.572  ns/op
> Base64Decode.testBase64Decode               4            512  avgt    5     610.847 ±   1.559  ns/op
> Base64Decode.testBase64Decode               4           1000  avgt    5    1155.368 ±   0.813  ns/op
> Base64Decode.testBase64Decode               4          20000  avgt    5   19751.477 ±  24.669  ns/op
> Base64Decode.testBase64Decode               4          50000  avgt    5   50046.586 ± 523.155  ns/op
> Base64Decode.testBase64MIMEDecode           4              1  avgt   10      64.130 ±   0.238  ns/op
> Base64Decode.testBase64MIMEDecode           4              3  avgt   10      82.096 ±   0.205  ns/op
> Base64Decode.testBase64MIMEDecode           4              7  avgt   10     118.849 ±   0.610  ns/op
> Base64Decode.testBase64MIMEDecode           4             32  avgt   10     331.177 ±   4.732  ns/op
> Base64Decode.testBase64MIMEDecode           4             64  avgt   10     549.117 ±   0.177  ns/op
> Base64Decode.testBase64MIMEDecode           4             80  avgt   10     702.951 ±   4.572  ns/op
> Base64Decode.testBase64MIMEDecode           4             96  avgt   10     799.566 ±   0.301  ns/op
> Base64Decode.testBase64MIMEDecode           4            112  avgt   10     923.749 ±   0.389  ns/op
> Base64Decode.testBase64MIMEDecode           4            512  avgt   10    4000.725 ±   2.519  ns/op
> Base64Decode.testBase64MIMEDecode           4           1000  avgt   10    7674.994 ±   9.281  ns/op
> Base64Decode.testBase64MIMEDecode           4          20000  avgt   10  142059.001 ± 157.920  ns/op
> Base64Decode.testBase64MIMEDecode           4          50000  avgt   10  355698.369 ± 216.542  ns/op

Firstly, I wonder how important this is for most applications. I don't actually know, but let's put that to one side.

There's a lot of unrolling, particularly in the non-SIMD case. Please consider taking out some of the unrolling; I suspect it'd not increase time by very much but would greatly reduce the code cache pollution.

-------------

PR: https://git.openjdk.java.net/jdk/pull/3228
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic

Andrew Haley-2
In reply to this post by Dong Bo
On Sat, 27 Mar 2021 08:58:03 GMT, Dong Bo <[hidden email]> wrote:

> In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding.
> Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic idea can be found at http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords.
>
> Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build.
> Tests in `test/jdk/java/util/Base64/` and `compiler/intrinsics/base64/TestBase64.java` runned specially for the correctness of the implementation.
>
> There can be illegal characters at the start of the input if the data is MIME encoded.
> It would be no benefits to use SIMD for this case, so the stub use no-simd instructions for MIME encoded data now.
>
> A JMH micro, Base64Decode.java, is added for performance test.
> With different input length (upper-bounded by parameter `maxNumBytes` in the JMH micro),
> we witness ~2.5x improvements with long inputs and no regression with short inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on Kunpeng916.
>
> The Base64Decode.java JMH micro-benchmark results:
>
> Benchmark                          (lineSize)  (maxNumBytes)  Mode  Cnt       Score       Error  Units
>
> # Kunpeng916, intrinsic
> Base64Decode.testBase64Decode               4              1  avgt    5      48.614 ±     0.609  ns/op
> Base64Decode.testBase64Decode               4              3  avgt    5      58.199 ±     1.650  ns/op
> Base64Decode.testBase64Decode               4              7  avgt    5      69.400 ±     0.931  ns/op
> Base64Decode.testBase64Decode               4             32  avgt    5      96.818 ±     1.687  ns/op
> Base64Decode.testBase64Decode               4             64  avgt    5     122.856 ±     9.217  ns/op
> Base64Decode.testBase64Decode               4             80  avgt    5     130.935 ±     1.667  ns/op
> Base64Decode.testBase64Decode               4             96  avgt    5     143.627 ±     1.751  ns/op
> Base64Decode.testBase64Decode               4            112  avgt    5     152.311 ±     1.178  ns/op
> Base64Decode.testBase64Decode               4            512  avgt    5     342.631 ±     0.584  ns/op
> Base64Decode.testBase64Decode               4           1000  avgt    5     573.635 ±     1.050  ns/op
> Base64Decode.testBase64Decode               4          20000  avgt    5    9534.136 ±    45.172  ns/op
> Base64Decode.testBase64Decode               4          50000  avgt    5   22718.726 ±   192.070  ns/op
> Base64Decode.testBase64MIMEDecode           4              1  avgt   10      63.558 ±    0.336  ns/op
> Base64Decode.testBase64MIMEDecode           4              3  avgt   10      82.504 ±    0.848  ns/op
> Base64Decode.testBase64MIMEDecode           4              7  avgt   10     120.591 ±    0.608  ns/op
> Base64Decode.testBase64MIMEDecode           4             32  avgt   10     324.314 ±    6.236  ns/op
> Base64Decode.testBase64MIMEDecode           4             64  avgt   10     532.678 ±    4.670  ns/op
> Base64Decode.testBase64MIMEDecode           4             80  avgt   10     678.126 ±    4.324  ns/op
> Base64Decode.testBase64MIMEDecode           4             96  avgt   10     771.603 ±    6.393  ns/op
> Base64Decode.testBase64MIMEDecode           4            112  avgt   10     889.608 ±   0.759  ns/op
> Base64Decode.testBase64MIMEDecode           4            512  avgt   10    3663.557 ±    3.422  ns/op
> Base64Decode.testBase64MIMEDecode           4           1000  avgt   10    7017.784 ±    9.128  ns/op
> Base64Decode.testBase64MIMEDecode           4          20000  avgt   10  128670.660 ± 7951.521  ns/op
> Base64Decode.testBase64MIMEDecode           4          50000  avgt   10  317113.667 ±  161.758  ns/op
>
> # Kunpeng916, default
> Base64Decode.testBase64Decode               4              1  avgt    5      48.455 ±   0.571  ns/op
> Base64Decode.testBase64Decode               4              3  avgt    5      57.937 ±   0.505  ns/op
> Base64Decode.testBase64Decode               4              7  avgt    5      73.823 ±   1.452  ns/op
> Base64Decode.testBase64Decode               4             32  avgt    5     106.484 ±   1.243  ns/op
> Base64Decode.testBase64Decode               4             64  avgt    5     141.004 ±   1.188  ns/op
> Base64Decode.testBase64Decode               4             80  avgt    5     156.284 ±   0.572  ns/op
> Base64Decode.testBase64Decode               4             96  avgt    5     174.137 ±   0.177  ns/op
> Base64Decode.testBase64Decode               4            112  avgt    5     188.445 ±   0.572  ns/op
> Base64Decode.testBase64Decode               4            512  avgt    5     610.847 ±   1.559  ns/op
> Base64Decode.testBase64Decode               4           1000  avgt    5    1155.368 ±   0.813  ns/op
> Base64Decode.testBase64Decode               4          20000  avgt    5   19751.477 ±  24.669  ns/op
> Base64Decode.testBase64Decode               4          50000  avgt    5   50046.586 ± 523.155  ns/op
> Base64Decode.testBase64MIMEDecode           4              1  avgt   10      64.130 ±   0.238  ns/op
> Base64Decode.testBase64MIMEDecode           4              3  avgt   10      82.096 ±   0.205  ns/op
> Base64Decode.testBase64MIMEDecode           4              7  avgt   10     118.849 ±   0.610  ns/op
> Base64Decode.testBase64MIMEDecode           4             32  avgt   10     331.177 ±   4.732  ns/op
> Base64Decode.testBase64MIMEDecode           4             64  avgt   10     549.117 ±   0.177  ns/op
> Base64Decode.testBase64MIMEDecode           4             80  avgt   10     702.951 ±   4.572  ns/op
> Base64Decode.testBase64MIMEDecode           4             96  avgt   10     799.566 ±   0.301  ns/op
> Base64Decode.testBase64MIMEDecode           4            112  avgt   10     923.749 ±   0.389  ns/op
> Base64Decode.testBase64MIMEDecode           4            512  avgt   10    4000.725 ±   2.519  ns/op
> Base64Decode.testBase64MIMEDecode           4           1000  avgt   10    7674.994 ±   9.281  ns/op
> Base64Decode.testBase64MIMEDecode           4          20000  avgt   10  142059.001 ± 157.920  ns/op
> Base64Decode.testBase64MIMEDecode           4          50000  avgt   10  355698.369 ± 216.542  ns/op

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5578:

> 5576:   void generate_base64_decode_nosimdround(Register src, Register dst,
> 5577:         Register nosimd_codec, Label &Exit)
> 5578:   {

We'd want enter/leave here so profiling tools can walk the stack.

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5728:

> 5726:
> 5727:     static const uint8_t fromBase64ForNoSIMD[256] = {
> 5728:       255u, 255u, 255u, 255u, 255u, 255u, 255u, 255u, 255u, 255u, 255u, 255u, 255u, 255u, 255u, 255u,

There seems to be no documentation of these magic tables of constants.

-------------

PR: https://git.openjdk.java.net/jdk/pull/3228
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic

Andrew Haley-2
In reply to this post by Andrew Haley-2
On Sat, 27 Mar 2021 09:53:37 GMT, Andrew Haley <[hidden email]> wrote:

>> In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding.
>> Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic idea can be found at http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords.
>>
>> Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build.
>> Tests in `test/jdk/java/util/Base64/` and `compiler/intrinsics/base64/TestBase64.java` runned specially for the correctness of the implementation.
>>
>> There can be illegal characters at the start of the input if the data is MIME encoded.
>> It would be no benefits to use SIMD for this case, so the stub use no-simd instructions for MIME encoded data now.
>>
>> A JMH micro, Base64Decode.java, is added for performance test.
>> With different input length (upper-bounded by parameter `maxNumBytes` in the JMH micro),
>> we witness ~2.5x improvements with long inputs and no regression with short inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on Kunpeng916.
>>
>> The Base64Decode.java JMH micro-benchmark results:
>>
>> Benchmark                          (lineSize)  (maxNumBytes)  Mode  Cnt       Score       Error  Units
>>
>> # Kunpeng916, intrinsic
>> Base64Decode.testBase64Decode               4              1  avgt    5      48.614 ±     0.609  ns/op
>> Base64Decode.testBase64Decode               4              3  avgt    5      58.199 ±     1.650  ns/op
>> Base64Decode.testBase64Decode               4              7  avgt    5      69.400 ±     0.931  ns/op
>> Base64Decode.testBase64Decode               4             32  avgt    5      96.818 ±     1.687  ns/op
>> Base64Decode.testBase64Decode               4             64  avgt    5     122.856 ±     9.217  ns/op
>> Base64Decode.testBase64Decode               4             80  avgt    5     130.935 ±     1.667  ns/op
>> Base64Decode.testBase64Decode               4             96  avgt    5     143.627 ±     1.751  ns/op
>> Base64Decode.testBase64Decode               4            112  avgt    5     152.311 ±     1.178  ns/op
>> Base64Decode.testBase64Decode               4            512  avgt    5     342.631 ±     0.584  ns/op
>> Base64Decode.testBase64Decode               4           1000  avgt    5     573.635 ±     1.050  ns/op
>> Base64Decode.testBase64Decode               4          20000  avgt    5    9534.136 ±    45.172  ns/op
>> Base64Decode.testBase64Decode               4          50000  avgt    5   22718.726 ±   192.070  ns/op
>> Base64Decode.testBase64MIMEDecode           4              1  avgt   10      63.558 ±    0.336  ns/op
>> Base64Decode.testBase64MIMEDecode           4              3  avgt   10      82.504 ±    0.848  ns/op
>> Base64Decode.testBase64MIMEDecode           4              7  avgt   10     120.591 ±    0.608  ns/op
>> Base64Decode.testBase64MIMEDecode           4             32  avgt   10     324.314 ±    6.236  ns/op
>> Base64Decode.testBase64MIMEDecode           4             64  avgt   10     532.678 ±    4.670  ns/op
>> Base64Decode.testBase64MIMEDecode           4             80  avgt   10     678.126 ±    4.324  ns/op
>> Base64Decode.testBase64MIMEDecode           4             96  avgt   10     771.603 ±    6.393  ns/op
>> Base64Decode.testBase64MIMEDecode           4            112  avgt   10     889.608 ±   0.759  ns/op
>> Base64Decode.testBase64MIMEDecode           4            512  avgt   10    3663.557 ±    3.422  ns/op
>> Base64Decode.testBase64MIMEDecode           4           1000  avgt   10    7017.784 ±    9.128  ns/op
>> Base64Decode.testBase64MIMEDecode           4          20000  avgt   10  128670.660 ± 7951.521  ns/op
>> Base64Decode.testBase64MIMEDecode           4          50000  avgt   10  317113.667 ±  161.758  ns/op
>>
>> # Kunpeng916, default
>> Base64Decode.testBase64Decode               4              1  avgt    5      48.455 ±   0.571  ns/op
>> Base64Decode.testBase64Decode               4              3  avgt    5      57.937 ±   0.505  ns/op
>> Base64Decode.testBase64Decode               4              7  avgt    5      73.823 ±   1.452  ns/op
>> Base64Decode.testBase64Decode               4             32  avgt    5     106.484 ±   1.243  ns/op
>> Base64Decode.testBase64Decode               4             64  avgt    5     141.004 ±   1.188  ns/op
>> Base64Decode.testBase64Decode               4             80  avgt    5     156.284 ±   0.572  ns/op
>> Base64Decode.testBase64Decode               4             96  avgt    5     174.137 ±   0.177  ns/op
>> Base64Decode.testBase64Decode               4            112  avgt    5     188.445 ±   0.572  ns/op
>> Base64Decode.testBase64Decode               4            512  avgt    5     610.847 ±   1.559  ns/op
>> Base64Decode.testBase64Decode               4           1000  avgt    5    1155.368 ±   0.813  ns/op
>> Base64Decode.testBase64Decode               4          20000  avgt    5   19751.477 ±  24.669  ns/op
>> Base64Decode.testBase64Decode               4          50000  avgt    5   50046.586 ± 523.155  ns/op
>> Base64Decode.testBase64MIMEDecode           4              1  avgt   10      64.130 ±   0.238  ns/op
>> Base64Decode.testBase64MIMEDecode           4              3  avgt   10      82.096 ±   0.205  ns/op
>> Base64Decode.testBase64MIMEDecode           4              7  avgt   10     118.849 ±   0.610  ns/op
>> Base64Decode.testBase64MIMEDecode           4             32  avgt   10     331.177 ±   4.732  ns/op
>> Base64Decode.testBase64MIMEDecode           4             64  avgt   10     549.117 ±   0.177  ns/op
>> Base64Decode.testBase64MIMEDecode           4             80  avgt   10     702.951 ±   4.572  ns/op
>> Base64Decode.testBase64MIMEDecode           4             96  avgt   10     799.566 ±   0.301  ns/op
>> Base64Decode.testBase64MIMEDecode           4            112  avgt   10     923.749 ±   0.389  ns/op
>> Base64Decode.testBase64MIMEDecode           4            512  avgt   10    4000.725 ±   2.519  ns/op
>> Base64Decode.testBase64MIMEDecode           4           1000  avgt   10    7674.994 ±   9.281  ns/op
>> Base64Decode.testBase64MIMEDecode           4          20000  avgt   10  142059.001 ± 157.920  ns/op
>> Base64Decode.testBase64MIMEDecode           4          50000  avgt   10  355698.369 ± 216.542  ns/op
>
> Firstly, I wonder how important this is for most applications. I don't actually know, but let's put that to one side.
>
> There's a lot of unrolling, particularly in the non-SIMD case. Please consider taking out some of the unrolling; I suspect it'd not increase time by very much but would greatly reduce the code cache pollution. It's very tempting to unroll everything to make a benchmark run quickly, but we have to take a balanced approach.

Please consider losing the non-SIMD case. It doesn't result in any significant gain.

> src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5728:
>
>> 5726:
>> 5727:     static const uint8_t fromBase64ForNoSIMD[256] = {
>> 5728:       255u, 255u, 255u, 255u, 255u, 255u, 255u, 255u, 255u, 255u, 255u, 255u, 255u, 255u, 255u, 255u,
>
> There seems to be no documentation of these magic tables of constants.

We're either going to need a proper description of the algorithm here or a permalink to one.

-------------

PR: https://git.openjdk.java.net/jdk/pull/3228
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic

Nick Gasson
In reply to this post by Dong Bo
On Sat, 27 Mar 2021 08:58:03 GMT, Dong Bo <[hidden email]> wrote:

> In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding.
> Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic idea can be found at http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords.
>
> Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build.
> Tests in `test/jdk/java/util/Base64/` and `compiler/intrinsics/base64/TestBase64.java` runned specially for the correctness of the implementation.
>
> There can be illegal characters at the start of the input if the data is MIME encoded.
> It would be no benefits to use SIMD for this case, so the stub use no-simd instructions for MIME encoded data now.
>
> A JMH micro, Base64Decode.java, is added for performance test.
> With different input length (upper-bounded by parameter `maxNumBytes` in the JMH micro),
> we witness ~2.5x improvements with long inputs and no regression with short inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on Kunpeng916.
>
> The Base64Decode.java JMH micro-benchmark results:
>
> Benchmark                          (lineSize)  (maxNumBytes)  Mode  Cnt       Score       Error  Units
>
> # Kunpeng916, intrinsic
> Base64Decode.testBase64Decode               4              1  avgt    5      48.614 ±     0.609  ns/op
> Base64Decode.testBase64Decode               4              3  avgt    5      58.199 ±     1.650  ns/op
> Base64Decode.testBase64Decode               4              7  avgt    5      69.400 ±     0.931  ns/op
> Base64Decode.testBase64Decode               4             32  avgt    5      96.818 ±     1.687  ns/op
> Base64Decode.testBase64Decode               4             64  avgt    5     122.856 ±     9.217  ns/op
> Base64Decode.testBase64Decode               4             80  avgt    5     130.935 ±     1.667  ns/op
> Base64Decode.testBase64Decode               4             96  avgt    5     143.627 ±     1.751  ns/op
> Base64Decode.testBase64Decode               4            112  avgt    5     152.311 ±     1.178  ns/op
> Base64Decode.testBase64Decode               4            512  avgt    5     342.631 ±     0.584  ns/op
> Base64Decode.testBase64Decode               4           1000  avgt    5     573.635 ±     1.050  ns/op
> Base64Decode.testBase64Decode               4          20000  avgt    5    9534.136 ±    45.172  ns/op
> Base64Decode.testBase64Decode               4          50000  avgt    5   22718.726 ±   192.070  ns/op
> Base64Decode.testBase64MIMEDecode           4              1  avgt   10      63.558 ±    0.336  ns/op
> Base64Decode.testBase64MIMEDecode           4              3  avgt   10      82.504 ±    0.848  ns/op
> Base64Decode.testBase64MIMEDecode           4              7  avgt   10     120.591 ±    0.608  ns/op
> Base64Decode.testBase64MIMEDecode           4             32  avgt   10     324.314 ±    6.236  ns/op
> Base64Decode.testBase64MIMEDecode           4             64  avgt   10     532.678 ±    4.670  ns/op
> Base64Decode.testBase64MIMEDecode           4             80  avgt   10     678.126 ±    4.324  ns/op
> Base64Decode.testBase64MIMEDecode           4             96  avgt   10     771.603 ±    6.393  ns/op
> Base64Decode.testBase64MIMEDecode           4            112  avgt   10     889.608 ±   0.759  ns/op
> Base64Decode.testBase64MIMEDecode           4            512  avgt   10    3663.557 ±    3.422  ns/op
> Base64Decode.testBase64MIMEDecode           4           1000  avgt   10    7017.784 ±    9.128  ns/op
> Base64Decode.testBase64MIMEDecode           4          20000  avgt   10  128670.660 ± 7951.521  ns/op
> Base64Decode.testBase64MIMEDecode           4          50000  avgt   10  317113.667 ±  161.758  ns/op
>
> # Kunpeng916, default
> Base64Decode.testBase64Decode               4              1  avgt    5      48.455 ±   0.571  ns/op
> Base64Decode.testBase64Decode               4              3  avgt    5      57.937 ±   0.505  ns/op
> Base64Decode.testBase64Decode               4              7  avgt    5      73.823 ±   1.452  ns/op
> Base64Decode.testBase64Decode               4             32  avgt    5     106.484 ±   1.243  ns/op
> Base64Decode.testBase64Decode               4             64  avgt    5     141.004 ±   1.188  ns/op
> Base64Decode.testBase64Decode               4             80  avgt    5     156.284 ±   0.572  ns/op
> Base64Decode.testBase64Decode               4             96  avgt    5     174.137 ±   0.177  ns/op
> Base64Decode.testBase64Decode               4            112  avgt    5     188.445 ±   0.572  ns/op
> Base64Decode.testBase64Decode               4            512  avgt    5     610.847 ±   1.559  ns/op
> Base64Decode.testBase64Decode               4           1000  avgt    5    1155.368 ±   0.813  ns/op
> Base64Decode.testBase64Decode               4          20000  avgt    5   19751.477 ±  24.669  ns/op
> Base64Decode.testBase64Decode               4          50000  avgt    5   50046.586 ± 523.155  ns/op
> Base64Decode.testBase64MIMEDecode           4              1  avgt   10      64.130 ±   0.238  ns/op
> Base64Decode.testBase64MIMEDecode           4              3  avgt   10      82.096 ±   0.205  ns/op
> Base64Decode.testBase64MIMEDecode           4              7  avgt   10     118.849 ±   0.610  ns/op
> Base64Decode.testBase64MIMEDecode           4             32  avgt   10     331.177 ±   4.732  ns/op
> Base64Decode.testBase64MIMEDecode           4             64  avgt   10     549.117 ±   0.177  ns/op
> Base64Decode.testBase64MIMEDecode           4             80  avgt   10     702.951 ±   4.572  ns/op
> Base64Decode.testBase64MIMEDecode           4             96  avgt   10     799.566 ±   0.301  ns/op
> Base64Decode.testBase64MIMEDecode           4            112  avgt   10     923.749 ±   0.389  ns/op
> Base64Decode.testBase64MIMEDecode           4            512  avgt   10    4000.725 ±   2.519  ns/op
> Base64Decode.testBase64MIMEDecode           4           1000  avgt   10    7674.994 ±   9.281  ns/op
> Base64Decode.testBase64MIMEDecode           4          20000  avgt   10  142059.001 ± 157.920  ns/op
> Base64Decode.testBase64MIMEDecode           4          50000  avgt   10  355698.369 ± 216.542  ns/op

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5624:

> 5622:     __ ld4(in0, in1, in2, in3, arrangement, __ post(src, 4 * size));
> 5623:
> 5624:     // we need unsigned saturationg substract, to make sure all input values

"saturating subtract"

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5649:

> 5647:     __ orr(decL3, arrangement, decL3, decH3);
> 5648:
> 5649:     // check iilegal inputs, value larger than 63 (maximum of 6 bits)

"illegal inputs". Are there existing jtreg tests that cover these cases?

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5772:

> 5770:     // The value of index 64 is set to 0, so that we know that we already get the
> 5771:     // decoded data with the 1st lookup.
> 5772:     static const uint8_t fromBase64ForSIMD[128] = {

This table and the one below seem to be identical to first half of the NoSIMD tables. Can't you just use one set of 256-entry tables for both SIMD and non-SIMD algorithms?

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5803:

> 5801:     Register dst   = c_rarg3;  // dest array
> 5802:     Register doff  = c_rarg4;  // position for writing to dest array
> 5803:     Register isURL = c_rarg5;  // Base64 or URL chracter set

"character set"

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5830:

> 5828:
> 5829:     // The 1st character of the input can be illegal if the data is MIME encoded.
> 5830:     // We can not benefits from SIMD for this case. The max line size of MIME

"cannot benefit"

-------------

PR: https://git.openjdk.java.net/jdk/pull/3228
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic

Nick Gasson
In reply to this post by Andrew Haley-2
On Sat, 27 Mar 2021 09:53:37 GMT, Andrew Haley <[hidden email]> wrote:

>
> There's a lot of unrolling, particularly in the non-SIMD case. Please consider taking out some of the unrolling; I suspect it'd not increase time by very much but would greatly reduce the code cache pollution. It's very tempting to unroll everything to make a benchmark run quickly, but we have to take a balanced approach.

But there's only ever one of these generated at startup, right? It's not like the string intrinsics that are expanded at every call site.

-------------

PR: https://git.openjdk.java.net/jdk/pull/3228
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic

Dong Bo
On Mon, 29 Mar 2021 03:15:57 GMT, Nick Gasson <[hidden email]> wrote:

>> Firstly, I wonder how important this is for most applications. I don't actually know, but let's put that to one side.
>>
>> There's a lot of unrolling, particularly in the non-SIMD case. Please consider taking out some of the unrolling; I suspect it'd not increase time by very much but would greatly reduce the code cache pollution. It's very tempting to unroll everything to make a benchmark run quickly, but we have to take a balanced approach.
>
>>
>> There's a lot of unrolling, particularly in the non-SIMD case. Please consider taking out some of the unrolling; I suspect it'd not increase time by very much but would greatly reduce the code cache pollution. It's very tempting to unroll everything to make a benchmark run quickly, but we have to take a balanced approach.
>
> But there's only ever one of these generated at startup, right? It's not like the string intrinsics that are expanded at every call site.

Thanks for the comments.

> Firstly, I wonder how important this is for most applications. I don't actually know, but let's put that to one side.
>

As claimed in JEP 135, Base64 is frequently used to encode binary/octet sequences that are transmitted as textual data.
It is commonly used by applications using Multipurpose Internal Mail Extensions (MIME), encoding passwords for HTTP headers, message digests, etc.
 
> There's a lot of unrolling, particularly in the non-SIMD case. Please consider taking out some of the unrolling; I suspect it'd not increase time by very much but would greatly reduce the code cache pollution. It's very tempting to unroll everything to make a benchmark run quickly, but we have to take a balanced approach.
>

There is no code unrolling in the non-SIMD case. The instructions are just loading, processing, storing data within loops.
About half of the code size is the error handling in SIMD case:
    // handle illegal input
    if (size == 16) {
      Label ErrorInLowerHalf;
      __ umov(rscratch1, in2, __ D, 0);
      __ cbnz(rscratch1, ErrorInLowerHalf);

      // illegal input is in higher half, store the lower half now.
      __ st3(out0, out1, out2, __ T8B, __ post(dst, 24));

      for (int i = 8; i < 15; i++) {
        __ umov(rscratch2, in2, __ B, (u1) i);
        __ cbnz(rscratch2, Exit);
        __ umov(r10, out0, __ B, (u1) i);
        __ umov(r11, out1, __ B, (u1) i);
        __ umov(r12, out2, __ B, (u1) i);
        __ strb(r10, __ post(dst, 1));
        __ strb(r11, __ post(dst, 1));
        __ strb(r12, __ post(dst, 1));
      }
      __ b(Exit);
I think I can rewrite this part as loops.
With an intial implemention, we can have almost half of the code size reduced (1312B -> 748B). Sounds OK to you?

> Please consider losing the non-SIMD case. It doesn't result in any significant gain.
>

The non-SIMD case is useful for MIME decoding performance.
The MIME base64 encoded data is arranged in lines (line size can be set by user with maximum 76B).
Newline characters, e.g. `\r\n`, are illegal but can be ignored by MIME decoding.
While the SIMD case works as `load data -> two vector table lookups -> combining -> error detection -> store data`.
When using SIMD for MIME decoding, the 1st byte of the input are possibly a newline character.
The SIMD case will execute too much wasty code before it can detect the error and exit, with non-simd case, there are only few ldrs, orrs, strs for error detecting.

-------------

PR: https://git.openjdk.java.net/jdk/pull/3228
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic

Dong Bo
In reply to this post by Nick Gasson
On Mon, 29 Mar 2021 03:15:57 GMT, Nick Gasson <[hidden email]> wrote:

>> Firstly, I wonder how important this is for most applications. I don't actually know, but let's put that to one side.
>>
>> There's a lot of unrolling, particularly in the non-SIMD case. Please consider taking out some of the unrolling; I suspect it'd not increase time by very much but would greatly reduce the code cache pollution. It's very tempting to unroll everything to make a benchmark run quickly, but we have to take a balanced approach.
>
>>
>> There's a lot of unrolling, particularly in the non-SIMD case. Please consider taking out some of the unrolling; I suspect it'd not increase time by very much but would greatly reduce the code cache pollution. It's very tempting to unroll everything to make a benchmark run quickly, but we have to take a balanced approach.
>
> But there's only ever one of these generated at startup, right? It's not like the string intrinsics that are expanded at every call site.

@nick-arm Thank you for watching this.

> That probably ought to go around the whole routine in generate_base64_decodeBlock rather than here?
>

There are two non-simd blocks in this intrinsic.
The 1st is at the begining, mainly to roll MIME decoding to non-simd processing due to the performance issue as I claimed before.
The 2nd is at the end to handle trailing inputs. So I guess we need generate_base64_decode_nosimdround here.

>  "illegal inputs". Are there existing jtreg tests that cover these cases?
>

Yes, they are covered by `test/hotspot/jtreg/compiler/intrinsics/base64/TestBase64.java`.

> This table and the one below seem to be identical to first half of the NoSIMD tables. Can't you just use one set of 256-entry tables for both SIMD and non-SIMD algorithms?
>
They are not identical, `*ForSIMD[64]==0`, `*forNoSIMD[64]=255`.
In SIMD case, `*ForSIMD[64]` acts as a pivot to tell us that we already get the decoded data with the 1st lookup when performing the 2nd lookup.

-------------

PR: https://git.openjdk.java.net/jdk/pull/3228
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic

Andrew Haley-2
In reply to this post by Nick Gasson
On Mon, 29 Mar 2021 03:15:57 GMT, Nick Gasson <[hidden email]> wrote:

> > There's a lot of unrolling, particularly in the non-SIMD case. Please consider taking out some of the unrolling; I suspect it'd not increase time by very much but would greatly reduce the code cache pollution. It's very tempting to unroll everything to make a benchmark run quickly, but we have to take a balanced approach.
>
> But there's only ever one of these generated at startup, right? It's not like the string intrinsics that are expanded at every call site.

I'm talking about icache pollution. This stuff could be quite small.

-------------

PR: https://git.openjdk.java.net/jdk/pull/3228
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic

Andrew Haley-2
In reply to this post by Dong Bo
On Mon, 29 Mar 2021 03:28:54 GMT, Dong Bo <[hidden email]> wrote:

> I think I can rewrite this part as loops.
> With an intial implemention, we can have almost half of the code size reduced (1312B -> 748B). Sounds OK to you?

Sounds great, but I'm still somewhat concerned that the non-SIMD case only offers 3-12% performance gain. Make it just 748 bytes, and therefore not icache-hostile, then perhaps the balance of risk and reward is justified.

-------------

PR: https://git.openjdk.java.net/jdk/pull/3228
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic [v2]

Dong Bo
In reply to this post by Dong Bo
> In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding.
> Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic idea can be found at http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords.
>
> Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build.
> Tests in `test/jdk/java/util/Base64/` and `compiler/intrinsics/base64/TestBase64.java` runned specially for the correctness of the implementation.
>
> There can be illegal characters at the start of the input if the data is MIME encoded.
> It would be no benefits to use SIMD for this case, so the stub use no-simd instructions for MIME encoded data now.
>
> A JMH micro, Base64Decode.java, is added for performance test.
> With different input length (upper-bounded by parameter `maxNumBytes` in the JMH micro),
> we witness ~2.5x improvements with long inputs and no regression with short inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on Kunpeng916.
>
> The Base64Decode.java JMH micro-benchmark results:
>
> Benchmark                          (lineSize)  (maxNumBytes)  Mode  Cnt       Score       Error  Units
>
> # Kunpeng916, intrinsic
> Base64Decode.testBase64Decode               4              1  avgt    5      48.614 ±     0.609  ns/op
> Base64Decode.testBase64Decode               4              3  avgt    5      58.199 ±     1.650  ns/op
> Base64Decode.testBase64Decode               4              7  avgt    5      69.400 ±     0.931  ns/op
> Base64Decode.testBase64Decode               4             32  avgt    5      96.818 ±     1.687  ns/op
> Base64Decode.testBase64Decode               4             64  avgt    5     122.856 ±     9.217  ns/op
> Base64Decode.testBase64Decode               4             80  avgt    5     130.935 ±     1.667  ns/op
> Base64Decode.testBase64Decode               4             96  avgt    5     143.627 ±     1.751  ns/op
> Base64Decode.testBase64Decode               4            112  avgt    5     152.311 ±     1.178  ns/op
> Base64Decode.testBase64Decode               4            512  avgt    5     342.631 ±     0.584  ns/op
> Base64Decode.testBase64Decode               4           1000  avgt    5     573.635 ±     1.050  ns/op
> Base64Decode.testBase64Decode               4          20000  avgt    5    9534.136 ±    45.172  ns/op
> Base64Decode.testBase64Decode               4          50000  avgt    5   22718.726 ±   192.070  ns/op
> Base64Decode.testBase64MIMEDecode           4              1  avgt   10      63.558 ±    0.336  ns/op
> Base64Decode.testBase64MIMEDecode           4              3  avgt   10      82.504 ±    0.848  ns/op
> Base64Decode.testBase64MIMEDecode           4              7  avgt   10     120.591 ±    0.608  ns/op
> Base64Decode.testBase64MIMEDecode           4             32  avgt   10     324.314 ±    6.236  ns/op
> Base64Decode.testBase64MIMEDecode           4             64  avgt   10     532.678 ±    4.670  ns/op
> Base64Decode.testBase64MIMEDecode           4             80  avgt   10     678.126 ±    4.324  ns/op
> Base64Decode.testBase64MIMEDecode           4             96  avgt   10     771.603 ±    6.393  ns/op
> Base64Decode.testBase64MIMEDecode           4            112  avgt   10     889.608 ±   0.759  ns/op
> Base64Decode.testBase64MIMEDecode           4            512  avgt   10    3663.557 ±    3.422  ns/op
> Base64Decode.testBase64MIMEDecode           4           1000  avgt   10    7017.784 ±    9.128  ns/op
> Base64Decode.testBase64MIMEDecode           4          20000  avgt   10  128670.660 ± 7951.521  ns/op
> Base64Decode.testBase64MIMEDecode           4          50000  avgt   10  317113.667 ±  161.758  ns/op
>
> # Kunpeng916, default
> Base64Decode.testBase64Decode               4              1  avgt    5      48.455 ±   0.571  ns/op
> Base64Decode.testBase64Decode               4              3  avgt    5      57.937 ±   0.505  ns/op
> Base64Decode.testBase64Decode               4              7  avgt    5      73.823 ±   1.452  ns/op
> Base64Decode.testBase64Decode               4             32  avgt    5     106.484 ±   1.243  ns/op
> Base64Decode.testBase64Decode               4             64  avgt    5     141.004 ±   1.188  ns/op
> Base64Decode.testBase64Decode               4             80  avgt    5     156.284 ±   0.572  ns/op
> Base64Decode.testBase64Decode               4             96  avgt    5     174.137 ±   0.177  ns/op
> Base64Decode.testBase64Decode               4            112  avgt    5     188.445 ±   0.572  ns/op
> Base64Decode.testBase64Decode               4            512  avgt    5     610.847 ±   1.559  ns/op
> Base64Decode.testBase64Decode               4           1000  avgt    5    1155.368 ±   0.813  ns/op
> Base64Decode.testBase64Decode               4          20000  avgt    5   19751.477 ±  24.669  ns/op
> Base64Decode.testBase64Decode               4          50000  avgt    5   50046.586 ± 523.155  ns/op
> Base64Decode.testBase64MIMEDecode           4              1  avgt   10      64.130 ±   0.238  ns/op
> Base64Decode.testBase64MIMEDecode           4              3  avgt   10      82.096 ±   0.205  ns/op
> Base64Decode.testBase64MIMEDecode           4              7  avgt   10     118.849 ±   0.610  ns/op
> Base64Decode.testBase64MIMEDecode           4             32  avgt   10     331.177 ±   4.732  ns/op
> Base64Decode.testBase64MIMEDecode           4             64  avgt   10     549.117 ±   0.177  ns/op
> Base64Decode.testBase64MIMEDecode           4             80  avgt   10     702.951 ±   4.572  ns/op
> Base64Decode.testBase64MIMEDecode           4             96  avgt   10     799.566 ±   0.301  ns/op
> Base64Decode.testBase64MIMEDecode           4            112  avgt   10     923.749 ±   0.389  ns/op
> Base64Decode.testBase64MIMEDecode           4            512  avgt   10    4000.725 ±   2.519  ns/op
> Base64Decode.testBase64MIMEDecode           4           1000  avgt   10    7674.994 ±   9.281  ns/op
> Base64Decode.testBase64MIMEDecode           4          20000  avgt   10  142059.001 ± 157.920  ns/op
> Base64Decode.testBase64MIMEDecode           4          50000  avgt   10  355698.369 ± 216.542  ns/op

Dong Bo has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision:

 - trivial fixes
 - Handling error in SIMD case with loops, combining two non-SIMD cases into one code blob, addressing other comments
 - Merge branch 'master' into aarch64.base64.decode
 - 8256245: AArch64: Implement Base64 decoding intrinsic

-------------

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/3228/files
  - new: https://git.openjdk.java.net/jdk/pull/3228/files/8a898aec..e658ebf4

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=3228&range=01
 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=3228&range=00-01

  Stats: 9524 lines in 363 files changed: 7727 ins; 450 del; 1347 mod
  Patch: https://git.openjdk.java.net/jdk/pull/3228.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/3228/head:pull/3228

PR: https://git.openjdk.java.net/jdk/pull/3228
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic

Dong Bo
In reply to this post by Andrew Haley-2
On Mon, 29 Mar 2021 08:38:59 GMT, Andrew Haley <[hidden email]> wrote:

> > With an intial implemention, we can have almost half of the code size reduced (1312B -> 748B). Sounds OK to you?
>
> Sounds great, but I'm still somewhat concerned that the non-SIMD case only offers 3-12% performance gain. Make it just 748 bytes, and therefore not icache-hostile, then perhaps the balance of risk and reward is justified.

Hi, @theRealAph @nick-arm

The code is updated. The error handling in SIMD case was rewriten as loops.

Also combined the two non-SIMD code blocks into one.
Due to we have only one non-SIMD loop now, it is moved into `generate_base64_decodeBlock`.
The size of the stub is 692 bytes, the non-SIMD loop takes about 92 bytes if my calculation is right.

Verified with tests `test/jdk/java/util/Base64/` and `compiler/intrinsics/base64/TestBase64.java`.
Compared with previous implementation, the performance changes are negligible.

Other comments are addressed too. Thanks.

-------------

PR: https://git.openjdk.java.net/jdk/pull/3228
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic [v3]

Dong Bo
In reply to this post by Dong Bo
> In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding.
> Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic idea can be found at http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords.
>
> Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build.
> Tests in `test/jdk/java/util/Base64/` and `compiler/intrinsics/base64/TestBase64.java` runned specially for the correctness of the implementation.
>
> There can be illegal characters at the start of the input if the data is MIME encoded.
> It would be no benefits to use SIMD for this case, so the stub use no-simd instructions for MIME encoded data now.
>
> A JMH micro, Base64Decode.java, is added for performance test.
> With different input length (upper-bounded by parameter `maxNumBytes` in the JMH micro),
> we witness ~2.5x improvements with long inputs and no regression with short inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on Kunpeng916.
>
> The Base64Decode.java JMH micro-benchmark results:
>
> Benchmark                          (lineSize)  (maxNumBytes)  Mode  Cnt       Score       Error  Units
>
> # Kunpeng916, intrinsic
> Base64Decode.testBase64Decode               4              1  avgt    5      48.614 ±     0.609  ns/op
> Base64Decode.testBase64Decode               4              3  avgt    5      58.199 ±     1.650  ns/op
> Base64Decode.testBase64Decode               4              7  avgt    5      69.400 ±     0.931  ns/op
> Base64Decode.testBase64Decode               4             32  avgt    5      96.818 ±     1.687  ns/op
> Base64Decode.testBase64Decode               4             64  avgt    5     122.856 ±     9.217  ns/op
> Base64Decode.testBase64Decode               4             80  avgt    5     130.935 ±     1.667  ns/op
> Base64Decode.testBase64Decode               4             96  avgt    5     143.627 ±     1.751  ns/op
> Base64Decode.testBase64Decode               4            112  avgt    5     152.311 ±     1.178  ns/op
> Base64Decode.testBase64Decode               4            512  avgt    5     342.631 ±     0.584  ns/op
> Base64Decode.testBase64Decode               4           1000  avgt    5     573.635 ±     1.050  ns/op
> Base64Decode.testBase64Decode               4          20000  avgt    5    9534.136 ±    45.172  ns/op
> Base64Decode.testBase64Decode               4          50000  avgt    5   22718.726 ±   192.070  ns/op
> Base64Decode.testBase64MIMEDecode           4              1  avgt   10      63.558 ±    0.336  ns/op
> Base64Decode.testBase64MIMEDecode           4              3  avgt   10      82.504 ±    0.848  ns/op
> Base64Decode.testBase64MIMEDecode           4              7  avgt   10     120.591 ±    0.608  ns/op
> Base64Decode.testBase64MIMEDecode           4             32  avgt   10     324.314 ±    6.236  ns/op
> Base64Decode.testBase64MIMEDecode           4             64  avgt   10     532.678 ±    4.670  ns/op
> Base64Decode.testBase64MIMEDecode           4             80  avgt   10     678.126 ±    4.324  ns/op
> Base64Decode.testBase64MIMEDecode           4             96  avgt   10     771.603 ±    6.393  ns/op
> Base64Decode.testBase64MIMEDecode           4            112  avgt   10     889.608 ±   0.759  ns/op
> Base64Decode.testBase64MIMEDecode           4            512  avgt   10    3663.557 ±    3.422  ns/op
> Base64Decode.testBase64MIMEDecode           4           1000  avgt   10    7017.784 ±    9.128  ns/op
> Base64Decode.testBase64MIMEDecode           4          20000  avgt   10  128670.660 ± 7951.521  ns/op
> Base64Decode.testBase64MIMEDecode           4          50000  avgt   10  317113.667 ±  161.758  ns/op
>
> # Kunpeng916, default
> Base64Decode.testBase64Decode               4              1  avgt    5      48.455 ±   0.571  ns/op
> Base64Decode.testBase64Decode               4              3  avgt    5      57.937 ±   0.505  ns/op
> Base64Decode.testBase64Decode               4              7  avgt    5      73.823 ±   1.452  ns/op
> Base64Decode.testBase64Decode               4             32  avgt    5     106.484 ±   1.243  ns/op
> Base64Decode.testBase64Decode               4             64  avgt    5     141.004 ±   1.188  ns/op
> Base64Decode.testBase64Decode               4             80  avgt    5     156.284 ±   0.572  ns/op
> Base64Decode.testBase64Decode               4             96  avgt    5     174.137 ±   0.177  ns/op
> Base64Decode.testBase64Decode               4            112  avgt    5     188.445 ±   0.572  ns/op
> Base64Decode.testBase64Decode               4            512  avgt    5     610.847 ±   1.559  ns/op
> Base64Decode.testBase64Decode               4           1000  avgt    5    1155.368 ±   0.813  ns/op
> Base64Decode.testBase64Decode               4          20000  avgt    5   19751.477 ±  24.669  ns/op
> Base64Decode.testBase64Decode               4          50000  avgt    5   50046.586 ± 523.155  ns/op
> Base64Decode.testBase64MIMEDecode           4              1  avgt   10      64.130 ±   0.238  ns/op
> Base64Decode.testBase64MIMEDecode           4              3  avgt   10      82.096 ±   0.205  ns/op
> Base64Decode.testBase64MIMEDecode           4              7  avgt   10     118.849 ±   0.610  ns/op
> Base64Decode.testBase64MIMEDecode           4             32  avgt   10     331.177 ±   4.732  ns/op
> Base64Decode.testBase64MIMEDecode           4             64  avgt   10     549.117 ±   0.177  ns/op
> Base64Decode.testBase64MIMEDecode           4             80  avgt   10     702.951 ±   4.572  ns/op
> Base64Decode.testBase64MIMEDecode           4             96  avgt   10     799.566 ±   0.301  ns/op
> Base64Decode.testBase64MIMEDecode           4            112  avgt   10     923.749 ±   0.389  ns/op
> Base64Decode.testBase64MIMEDecode           4            512  avgt   10    4000.725 ±   2.519  ns/op
> Base64Decode.testBase64MIMEDecode           4           1000  avgt   10    7674.994 ±   9.281  ns/op
> Base64Decode.testBase64MIMEDecode           4          20000  avgt   10  142059.001 ± 157.920  ns/op
> Base64Decode.testBase64MIMEDecode           4          50000  avgt   10  355698.369 ± 216.542  ns/op

Dong Bo has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains six additional commits since the last revision:

 - Merge branch 'master' into aarch64.base64.decode
 - copyright
 - trivial fixes
 - Handling error in SIMD case with loops, combining two non-SIMD cases into one code blob, addressing other comments
 - Merge branch 'master' into aarch64.base64.decode
 - 8256245: AArch64: Implement Base64 decoding intrinsic

-------------

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/3228/files
  - new: https://git.openjdk.java.net/jdk/pull/3228/files/e658ebf4..16ebc471

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=3228&range=02
 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=3228&range=01-02

  Stats: 7270 lines in 287 files changed: 5225 ins; 950 del; 1095 mod
  Patch: https://git.openjdk.java.net/jdk/pull/3228.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/3228/head:pull/3228

PR: https://git.openjdk.java.net/jdk/pull/3228
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic

Dong Bo
In reply to this post by Dong Bo
On Tue, 30 Mar 2021 03:24:16 GMT, Dong Bo <[hidden email]> wrote:

>>> I think I can rewrite this part as loops.
>>> With an intial implemention, we can have almost half of the code size reduced (1312B -> 748B). Sounds OK to you?
>>
>> Sounds great, but I'm still somewhat concerned that the non-SIMD case only offers 3-12% performance gain. Make it just 748 bytes, and therefore not icache-hostile, then perhaps the balance of risk and reward is justified.
>
>> > With an intial implemention, we can have almost half of the code size reduced (1312B -> 748B). Sounds OK to you?
>>
>> Sounds great, but I'm still somewhat concerned that the non-SIMD case only offers 3-12% performance gain. Make it just 748 bytes, and therefore not icache-hostile, then perhaps the balance of risk and reward is justified.
>
> Hi, @theRealAph @nick-arm
>
> The code is updated. The error handling in SIMD case was rewriten as loops.
>
> Also combined the two non-SIMD code blocks into one.
> Due to we have only one non-SIMD loop now, it is moved into `generate_base64_decodeBlock`.
> The size of the stub is 692 bytes, the non-SIMD loop takes about 92 bytes if my calculation is right.
>
> Verified with tests `test/jdk/java/util/Base64/` and `compiler/intrinsics/base64/TestBase64.java`.
> Compared with previous implementation, the performance changes are negligible.
>
> Other comments are addressed too. Thanks.

PING... Any suggestions on the updated commit?

-------------

PR: https://git.openjdk.java.net/jdk/pull/3228
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic [v3]

Andrew Haley-2
In reply to this post by Dong Bo
On Fri, 2 Apr 2021 03:10:57 GMT, Dong Bo <[hidden email]> wrote:

>> In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding.
>> Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic idea can be found at http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords.
>>
>> Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build.
>> Tests in `test/jdk/java/util/Base64/` and `compiler/intrinsics/base64/TestBase64.java` runned specially for the correctness of the implementation.
>>
>> There can be illegal characters at the start of the input if the data is MIME encoded.
>> It would be no benefits to use SIMD for this case, so the stub use no-simd instructions for MIME encoded data now.
>>
>> A JMH micro, Base64Decode.java, is added for performance test.
>> With different input length (upper-bounded by parameter `maxNumBytes` in the JMH micro),
>> we witness ~2.5x improvements with long inputs and no regression with short inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on Kunpeng916.
>>
>> The Base64Decode.java JMH micro-benchmark results:
>>
>> Benchmark                          (lineSize)  (maxNumBytes)  Mode  Cnt       Score       Error  Units
>>
>> # Kunpeng916, intrinsic
>> Base64Decode.testBase64Decode               4              1  avgt    5      48.614 ±     0.609  ns/op
>> Base64Decode.testBase64Decode               4              3  avgt    5      58.199 ±     1.650  ns/op
>> Base64Decode.testBase64Decode               4              7  avgt    5      69.400 ±     0.931  ns/op
>> Base64Decode.testBase64Decode               4             32  avgt    5      96.818 ±     1.687  ns/op
>> Base64Decode.testBase64Decode               4             64  avgt    5     122.856 ±     9.217  ns/op
>> Base64Decode.testBase64Decode               4             80  avgt    5     130.935 ±     1.667  ns/op
>> Base64Decode.testBase64Decode               4             96  avgt    5     143.627 ±     1.751  ns/op
>> Base64Decode.testBase64Decode               4            112  avgt    5     152.311 ±     1.178  ns/op
>> Base64Decode.testBase64Decode               4            512  avgt    5     342.631 ±     0.584  ns/op
>> Base64Decode.testBase64Decode               4           1000  avgt    5     573.635 ±     1.050  ns/op
>> Base64Decode.testBase64Decode               4          20000  avgt    5    9534.136 ±    45.172  ns/op
>> Base64Decode.testBase64Decode               4          50000  avgt    5   22718.726 ±   192.070  ns/op
>> Base64Decode.testBase64MIMEDecode           4              1  avgt   10      63.558 ±    0.336  ns/op
>> Base64Decode.testBase64MIMEDecode           4              3  avgt   10      82.504 ±    0.848  ns/op
>> Base64Decode.testBase64MIMEDecode           4              7  avgt   10     120.591 ±    0.608  ns/op
>> Base64Decode.testBase64MIMEDecode           4             32  avgt   10     324.314 ±    6.236  ns/op
>> Base64Decode.testBase64MIMEDecode           4             64  avgt   10     532.678 ±    4.670  ns/op
>> Base64Decode.testBase64MIMEDecode           4             80  avgt   10     678.126 ±    4.324  ns/op
>> Base64Decode.testBase64MIMEDecode           4             96  avgt   10     771.603 ±    6.393  ns/op
>> Base64Decode.testBase64MIMEDecode           4            112  avgt   10     889.608 ±   0.759  ns/op
>> Base64Decode.testBase64MIMEDecode           4            512  avgt   10    3663.557 ±    3.422  ns/op
>> Base64Decode.testBase64MIMEDecode           4           1000  avgt   10    7017.784 ±    9.128  ns/op
>> Base64Decode.testBase64MIMEDecode           4          20000  avgt   10  128670.660 ± 7951.521  ns/op
>> Base64Decode.testBase64MIMEDecode           4          50000  avgt   10  317113.667 ±  161.758  ns/op
>>
>> # Kunpeng916, default
>> Base64Decode.testBase64Decode               4              1  avgt    5      48.455 ±   0.571  ns/op
>> Base64Decode.testBase64Decode               4              3  avgt    5      57.937 ±   0.505  ns/op
>> Base64Decode.testBase64Decode               4              7  avgt    5      73.823 ±   1.452  ns/op
>> Base64Decode.testBase64Decode               4             32  avgt    5     106.484 ±   1.243  ns/op
>> Base64Decode.testBase64Decode               4             64  avgt    5     141.004 ±   1.188  ns/op
>> Base64Decode.testBase64Decode               4             80  avgt    5     156.284 ±   0.572  ns/op
>> Base64Decode.testBase64Decode               4             96  avgt    5     174.137 ±   0.177  ns/op
>> Base64Decode.testBase64Decode               4            112  avgt    5     188.445 ±   0.572  ns/op
>> Base64Decode.testBase64Decode               4            512  avgt    5     610.847 ±   1.559  ns/op
>> Base64Decode.testBase64Decode               4           1000  avgt    5    1155.368 ±   0.813  ns/op
>> Base64Decode.testBase64Decode               4          20000  avgt    5   19751.477 ±  24.669  ns/op
>> Base64Decode.testBase64Decode               4          50000  avgt    5   50046.586 ± 523.155  ns/op
>> Base64Decode.testBase64MIMEDecode           4              1  avgt   10      64.130 ±   0.238  ns/op
>> Base64Decode.testBase64MIMEDecode           4              3  avgt   10      82.096 ±   0.205  ns/op
>> Base64Decode.testBase64MIMEDecode           4              7  avgt   10     118.849 ±   0.610  ns/op
>> Base64Decode.testBase64MIMEDecode           4             32  avgt   10     331.177 ±   4.732  ns/op
>> Base64Decode.testBase64MIMEDecode           4             64  avgt   10     549.117 ±   0.177  ns/op
>> Base64Decode.testBase64MIMEDecode           4             80  avgt   10     702.951 ±   4.572  ns/op
>> Base64Decode.testBase64MIMEDecode           4             96  avgt   10     799.566 ±   0.301  ns/op
>> Base64Decode.testBase64MIMEDecode           4            112  avgt   10     923.749 ±   0.389  ns/op
>> Base64Decode.testBase64MIMEDecode           4            512  avgt   10    4000.725 ±   2.519  ns/op
>> Base64Decode.testBase64MIMEDecode           4           1000  avgt   10    7674.994 ±   9.281  ns/op
>> Base64Decode.testBase64MIMEDecode           4          20000  avgt   10  142059.001 ± 157.920  ns/op
>> Base64Decode.testBase64MIMEDecode           4          50000  avgt   10  355698.369 ± 216.542  ns/op
>
> Dong Bo has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains six additional commits since the last revision:
>
>  - Merge branch 'master' into aarch64.base64.decode
>  - copyright
>  - trivial fixes
>  - Handling error in SIMD case with loops, combining two non-SIMD cases into one code blob, addressing other comments
>  - Merge branch 'master' into aarch64.base64.decode
>  - 8256245: AArch64: Implement Base64 decoding intrinsic

test/micro/org/openjdk/bench/java/util/Base64Decode.java line 85:

> 83:         }
> 84:     }
> 85:

Are there any existing test cases for failing inputs?

-------------

PR: https://git.openjdk.java.net/jdk/pull/3228
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic [v3]

Andrew Haley-2
In reply to this post by Dong Bo
On Fri, 2 Apr 2021 03:10:57 GMT, Dong Bo <[hidden email]> wrote:

>> In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding.
>> Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic idea can be found at http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords.
>>
>> Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build.
>> Tests in `test/jdk/java/util/Base64/` and `compiler/intrinsics/base64/TestBase64.java` runned specially for the correctness of the implementation.
>>
>> There can be illegal characters at the start of the input if the data is MIME encoded.
>> It would be no benefits to use SIMD for this case, so the stub use no-simd instructions for MIME encoded data now.
>>
>> A JMH micro, Base64Decode.java, is added for performance test.
>> With different input length (upper-bounded by parameter `maxNumBytes` in the JMH micro),
>> we witness ~2.5x improvements with long inputs and no regression with short inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on Kunpeng916.
>>
>> The Base64Decode.java JMH micro-benchmark results:
>>
>> Benchmark                          (lineSize)  (maxNumBytes)  Mode  Cnt       Score       Error  Units
>>
>> # Kunpeng916, intrinsic
>> Base64Decode.testBase64Decode               4              1  avgt    5      48.614 ±     0.609  ns/op
>> Base64Decode.testBase64Decode               4              3  avgt    5      58.199 ±     1.650  ns/op
>> Base64Decode.testBase64Decode               4              7  avgt    5      69.400 ±     0.931  ns/op
>> Base64Decode.testBase64Decode               4             32  avgt    5      96.818 ±     1.687  ns/op
>> Base64Decode.testBase64Decode               4             64  avgt    5     122.856 ±     9.217  ns/op
>> Base64Decode.testBase64Decode               4             80  avgt    5     130.935 ±     1.667  ns/op
>> Base64Decode.testBase64Decode               4             96  avgt    5     143.627 ±     1.751  ns/op
>> Base64Decode.testBase64Decode               4            112  avgt    5     152.311 ±     1.178  ns/op
>> Base64Decode.testBase64Decode               4            512  avgt    5     342.631 ±     0.584  ns/op
>> Base64Decode.testBase64Decode               4           1000  avgt    5     573.635 ±     1.050  ns/op
>> Base64Decode.testBase64Decode               4          20000  avgt    5    9534.136 ±    45.172  ns/op
>> Base64Decode.testBase64Decode               4          50000  avgt    5   22718.726 ±   192.070  ns/op
>> Base64Decode.testBase64MIMEDecode           4              1  avgt   10      63.558 ±    0.336  ns/op
>> Base64Decode.testBase64MIMEDecode           4              3  avgt   10      82.504 ±    0.848  ns/op
>> Base64Decode.testBase64MIMEDecode           4              7  avgt   10     120.591 ±    0.608  ns/op
>> Base64Decode.testBase64MIMEDecode           4             32  avgt   10     324.314 ±    6.236  ns/op
>> Base64Decode.testBase64MIMEDecode           4             64  avgt   10     532.678 ±    4.670  ns/op
>> Base64Decode.testBase64MIMEDecode           4             80  avgt   10     678.126 ±    4.324  ns/op
>> Base64Decode.testBase64MIMEDecode           4             96  avgt   10     771.603 ±    6.393  ns/op
>> Base64Decode.testBase64MIMEDecode           4            112  avgt   10     889.608 ±   0.759  ns/op
>> Base64Decode.testBase64MIMEDecode           4            512  avgt   10    3663.557 ±    3.422  ns/op
>> Base64Decode.testBase64MIMEDecode           4           1000  avgt   10    7017.784 ±    9.128  ns/op
>> Base64Decode.testBase64MIMEDecode           4          20000  avgt   10  128670.660 ± 7951.521  ns/op
>> Base64Decode.testBase64MIMEDecode           4          50000  avgt   10  317113.667 ±  161.758  ns/op
>>
>> # Kunpeng916, default
>> Base64Decode.testBase64Decode               4              1  avgt    5      48.455 ±   0.571  ns/op
>> Base64Decode.testBase64Decode               4              3  avgt    5      57.937 ±   0.505  ns/op
>> Base64Decode.testBase64Decode               4              7  avgt    5      73.823 ±   1.452  ns/op
>> Base64Decode.testBase64Decode               4             32  avgt    5     106.484 ±   1.243  ns/op
>> Base64Decode.testBase64Decode               4             64  avgt    5     141.004 ±   1.188  ns/op
>> Base64Decode.testBase64Decode               4             80  avgt    5     156.284 ±   0.572  ns/op
>> Base64Decode.testBase64Decode               4             96  avgt    5     174.137 ±   0.177  ns/op
>> Base64Decode.testBase64Decode               4            112  avgt    5     188.445 ±   0.572  ns/op
>> Base64Decode.testBase64Decode               4            512  avgt    5     610.847 ±   1.559  ns/op
>> Base64Decode.testBase64Decode               4           1000  avgt    5    1155.368 ±   0.813  ns/op
>> Base64Decode.testBase64Decode               4          20000  avgt    5   19751.477 ±  24.669  ns/op
>> Base64Decode.testBase64Decode               4          50000  avgt    5   50046.586 ± 523.155  ns/op
>> Base64Decode.testBase64MIMEDecode           4              1  avgt   10      64.130 ±   0.238  ns/op
>> Base64Decode.testBase64MIMEDecode           4              3  avgt   10      82.096 ±   0.205  ns/op
>> Base64Decode.testBase64MIMEDecode           4              7  avgt   10     118.849 ±   0.610  ns/op
>> Base64Decode.testBase64MIMEDecode           4             32  avgt   10     331.177 ±   4.732  ns/op
>> Base64Decode.testBase64MIMEDecode           4             64  avgt   10     549.117 ±   0.177  ns/op
>> Base64Decode.testBase64MIMEDecode           4             80  avgt   10     702.951 ±   4.572  ns/op
>> Base64Decode.testBase64MIMEDecode           4             96  avgt   10     799.566 ±   0.301  ns/op
>> Base64Decode.testBase64MIMEDecode           4            112  avgt   10     923.749 ±   0.389  ns/op
>> Base64Decode.testBase64MIMEDecode           4            512  avgt   10    4000.725 ±   2.519  ns/op
>> Base64Decode.testBase64MIMEDecode           4           1000  avgt   10    7674.994 ±   9.281  ns/op
>> Base64Decode.testBase64MIMEDecode           4          20000  avgt   10  142059.001 ± 157.920  ns/op
>> Base64Decode.testBase64MIMEDecode           4          50000  avgt   10  355698.369 ± 216.542  ns/op
>
> Dong Bo has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains six additional commits since the last revision:
>
>  - Merge branch 'master' into aarch64.base64.decode
>  - copyright
>  - trivial fixes
>  - Handling error in SIMD case with loops, combining two non-SIMD cases into one code blob, addressing other comments
>  - Merge branch 'master' into aarch64.base64.decode
>  - 8256245: AArch64: Implement Base64 decoding intrinsic

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 5802:

> 5800:     // The 1st character of the input can be illegal if the data is MIME encoded.
> 5801:     // We cannot benefits from SIMD for this case. The max line size of MIME
> 5802:     // encoding is 76, with the PreProcess80B blob, we actually use no-simd

"cannot benefit"

-------------

PR: https://git.openjdk.java.net/jdk/pull/3228
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic [v4]

Dong Bo
In reply to this post by Dong Bo
> In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding.
> Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic idea can be found at http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords.
>
> Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build.
> Tests in `test/jdk/java/util/Base64/` and `compiler/intrinsics/base64/TestBase64.java` runned specially for the correctness of the implementation.
>
> There can be illegal characters at the start of the input if the data is MIME encoded.
> It would be no benefits to use SIMD for this case, so the stub use no-simd instructions for MIME encoded data now.
>
> A JMH micro, Base64Decode.java, is added for performance test.
> With different input length (upper-bounded by parameter `maxNumBytes` in the JMH micro),
> we witness ~2.5x improvements with long inputs and no regression with short inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on Kunpeng916.
>
> The Base64Decode.java JMH micro-benchmark results:
>
> Benchmark                          (lineSize)  (maxNumBytes)  Mode  Cnt       Score       Error  Units
>
> # Kunpeng916, intrinsic
> Base64Decode.testBase64Decode               4              1  avgt    5      48.614 ±     0.609  ns/op
> Base64Decode.testBase64Decode               4              3  avgt    5      58.199 ±     1.650  ns/op
> Base64Decode.testBase64Decode               4              7  avgt    5      69.400 ±     0.931  ns/op
> Base64Decode.testBase64Decode               4             32  avgt    5      96.818 ±     1.687  ns/op
> Base64Decode.testBase64Decode               4             64  avgt    5     122.856 ±     9.217  ns/op
> Base64Decode.testBase64Decode               4             80  avgt    5     130.935 ±     1.667  ns/op
> Base64Decode.testBase64Decode               4             96  avgt    5     143.627 ±     1.751  ns/op
> Base64Decode.testBase64Decode               4            112  avgt    5     152.311 ±     1.178  ns/op
> Base64Decode.testBase64Decode               4            512  avgt    5     342.631 ±     0.584  ns/op
> Base64Decode.testBase64Decode               4           1000  avgt    5     573.635 ±     1.050  ns/op
> Base64Decode.testBase64Decode               4          20000  avgt    5    9534.136 ±    45.172  ns/op
> Base64Decode.testBase64Decode               4          50000  avgt    5   22718.726 ±   192.070  ns/op
> Base64Decode.testBase64MIMEDecode           4              1  avgt   10      63.558 ±    0.336  ns/op
> Base64Decode.testBase64MIMEDecode           4              3  avgt   10      82.504 ±    0.848  ns/op
> Base64Decode.testBase64MIMEDecode           4              7  avgt   10     120.591 ±    0.608  ns/op
> Base64Decode.testBase64MIMEDecode           4             32  avgt   10     324.314 ±    6.236  ns/op
> Base64Decode.testBase64MIMEDecode           4             64  avgt   10     532.678 ±    4.670  ns/op
> Base64Decode.testBase64MIMEDecode           4             80  avgt   10     678.126 ±    4.324  ns/op
> Base64Decode.testBase64MIMEDecode           4             96  avgt   10     771.603 ±    6.393  ns/op
> Base64Decode.testBase64MIMEDecode           4            112  avgt   10     889.608 ±   0.759  ns/op
> Base64Decode.testBase64MIMEDecode           4            512  avgt   10    3663.557 ±    3.422  ns/op
> Base64Decode.testBase64MIMEDecode           4           1000  avgt   10    7017.784 ±    9.128  ns/op
> Base64Decode.testBase64MIMEDecode           4          20000  avgt   10  128670.660 ± 7951.521  ns/op
> Base64Decode.testBase64MIMEDecode           4          50000  avgt   10  317113.667 ±  161.758  ns/op
>
> # Kunpeng916, default
> Base64Decode.testBase64Decode               4              1  avgt    5      48.455 ±   0.571  ns/op
> Base64Decode.testBase64Decode               4              3  avgt    5      57.937 ±   0.505  ns/op
> Base64Decode.testBase64Decode               4              7  avgt    5      73.823 ±   1.452  ns/op
> Base64Decode.testBase64Decode               4             32  avgt    5     106.484 ±   1.243  ns/op
> Base64Decode.testBase64Decode               4             64  avgt    5     141.004 ±   1.188  ns/op
> Base64Decode.testBase64Decode               4             80  avgt    5     156.284 ±   0.572  ns/op
> Base64Decode.testBase64Decode               4             96  avgt    5     174.137 ±   0.177  ns/op
> Base64Decode.testBase64Decode               4            112  avgt    5     188.445 ±   0.572  ns/op
> Base64Decode.testBase64Decode               4            512  avgt    5     610.847 ±   1.559  ns/op
> Base64Decode.testBase64Decode               4           1000  avgt    5    1155.368 ±   0.813  ns/op
> Base64Decode.testBase64Decode               4          20000  avgt    5   19751.477 ±  24.669  ns/op
> Base64Decode.testBase64Decode               4          50000  avgt    5   50046.586 ± 523.155  ns/op
> Base64Decode.testBase64MIMEDecode           4              1  avgt   10      64.130 ±   0.238  ns/op
> Base64Decode.testBase64MIMEDecode           4              3  avgt   10      82.096 ±   0.205  ns/op
> Base64Decode.testBase64MIMEDecode           4              7  avgt   10     118.849 ±   0.610  ns/op
> Base64Decode.testBase64MIMEDecode           4             32  avgt   10     331.177 ±   4.732  ns/op
> Base64Decode.testBase64MIMEDecode           4             64  avgt   10     549.117 ±   0.177  ns/op
> Base64Decode.testBase64MIMEDecode           4             80  avgt   10     702.951 ±   4.572  ns/op
> Base64Decode.testBase64MIMEDecode           4             96  avgt   10     799.566 ±   0.301  ns/op
> Base64Decode.testBase64MIMEDecode           4            112  avgt   10     923.749 ±   0.389  ns/op
> Base64Decode.testBase64MIMEDecode           4            512  avgt   10    4000.725 ±   2.519  ns/op
> Base64Decode.testBase64MIMEDecode           4           1000  avgt   10    7674.994 ±   9.281  ns/op
> Base64Decode.testBase64MIMEDecode           4          20000  avgt   10  142059.001 ± 157.920  ns/op
> Base64Decode.testBase64MIMEDecode           4          50000  avgt   10  355698.369 ± 216.542  ns/op

Dong Bo has updated the pull request incrementally with one additional commit since the last revision:

  load data with one ldrw, add JMH tests for error inputs

-------------

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/3228/files
  - new: https://git.openjdk.java.net/jdk/pull/3228/files/16ebc471..54a75f05

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=3228&range=03
 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=3228&range=02-03

  Stats: 37 lines in 2 files changed: 30 ins; 0 del; 7 mod
  Patch: https://git.openjdk.java.net/jdk/pull/3228.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/3228/head:pull/3228

PR: https://git.openjdk.java.net/jdk/pull/3228
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic [v5]

Dong Bo
In reply to this post by Dong Bo
> In JDK-8248188, IntrinsicCandidate and API is added for Base64 decoding.
> Base64 decoding can be improved on aarch64 with ld4/tbl/tbx/st3, a basic idea can be found at http://0x80.pl/articles/base64-simd-neon.html#encoding-quadwords.
>
> Patch passed jtreg tier1-3 tests with linux-aarch64-server-fastdebug build.
> Tests in `test/jdk/java/util/Base64/` and `compiler/intrinsics/base64/TestBase64.java` runned specially for the correctness of the implementation.
>
> There can be illegal characters at the start of the input if the data is MIME encoded.
> It would be no benefits to use SIMD for this case, so the stub use no-simd instructions for MIME encoded data now.
>
> A JMH micro, Base64Decode.java, is added for performance test.
> With different input length (upper-bounded by parameter `maxNumBytes` in the JMH micro),
> we witness ~2.5x improvements with long inputs and no regression with short inputs for raw base64 decodeing, minor improvements (~10.95%) for MIME on Kunpeng916.
>
> The Base64Decode.java JMH micro-benchmark results:
>
> Benchmark                          (lineSize)  (maxNumBytes)  Mode  Cnt       Score       Error  Units
>
> # Kunpeng916, intrinsic
> Base64Decode.testBase64Decode               4              1  avgt    5      48.614 ±     0.609  ns/op
> Base64Decode.testBase64Decode               4              3  avgt    5      58.199 ±     1.650  ns/op
> Base64Decode.testBase64Decode               4              7  avgt    5      69.400 ±     0.931  ns/op
> Base64Decode.testBase64Decode               4             32  avgt    5      96.818 ±     1.687  ns/op
> Base64Decode.testBase64Decode               4             64  avgt    5     122.856 ±     9.217  ns/op
> Base64Decode.testBase64Decode               4             80  avgt    5     130.935 ±     1.667  ns/op
> Base64Decode.testBase64Decode               4             96  avgt    5     143.627 ±     1.751  ns/op
> Base64Decode.testBase64Decode               4            112  avgt    5     152.311 ±     1.178  ns/op
> Base64Decode.testBase64Decode               4            512  avgt    5     342.631 ±     0.584  ns/op
> Base64Decode.testBase64Decode               4           1000  avgt    5     573.635 ±     1.050  ns/op
> Base64Decode.testBase64Decode               4          20000  avgt    5    9534.136 ±    45.172  ns/op
> Base64Decode.testBase64Decode               4          50000  avgt    5   22718.726 ±   192.070  ns/op
> Base64Decode.testBase64MIMEDecode           4              1  avgt   10      63.558 ±    0.336  ns/op
> Base64Decode.testBase64MIMEDecode           4              3  avgt   10      82.504 ±    0.848  ns/op
> Base64Decode.testBase64MIMEDecode           4              7  avgt   10     120.591 ±    0.608  ns/op
> Base64Decode.testBase64MIMEDecode           4             32  avgt   10     324.314 ±    6.236  ns/op
> Base64Decode.testBase64MIMEDecode           4             64  avgt   10     532.678 ±    4.670  ns/op
> Base64Decode.testBase64MIMEDecode           4             80  avgt   10     678.126 ±    4.324  ns/op
> Base64Decode.testBase64MIMEDecode           4             96  avgt   10     771.603 ±    6.393  ns/op
> Base64Decode.testBase64MIMEDecode           4            112  avgt   10     889.608 ±   0.759  ns/op
> Base64Decode.testBase64MIMEDecode           4            512  avgt   10    3663.557 ±    3.422  ns/op
> Base64Decode.testBase64MIMEDecode           4           1000  avgt   10    7017.784 ±    9.128  ns/op
> Base64Decode.testBase64MIMEDecode           4          20000  avgt   10  128670.660 ± 7951.521  ns/op
> Base64Decode.testBase64MIMEDecode           4          50000  avgt   10  317113.667 ±  161.758  ns/op
>
> # Kunpeng916, default
> Base64Decode.testBase64Decode               4              1  avgt    5      48.455 ±   0.571  ns/op
> Base64Decode.testBase64Decode               4              3  avgt    5      57.937 ±   0.505  ns/op
> Base64Decode.testBase64Decode               4              7  avgt    5      73.823 ±   1.452  ns/op
> Base64Decode.testBase64Decode               4             32  avgt    5     106.484 ±   1.243  ns/op
> Base64Decode.testBase64Decode               4             64  avgt    5     141.004 ±   1.188  ns/op
> Base64Decode.testBase64Decode               4             80  avgt    5     156.284 ±   0.572  ns/op
> Base64Decode.testBase64Decode               4             96  avgt    5     174.137 ±   0.177  ns/op
> Base64Decode.testBase64Decode               4            112  avgt    5     188.445 ±   0.572  ns/op
> Base64Decode.testBase64Decode               4            512  avgt    5     610.847 ±   1.559  ns/op
> Base64Decode.testBase64Decode               4           1000  avgt    5    1155.368 ±   0.813  ns/op
> Base64Decode.testBase64Decode               4          20000  avgt    5   19751.477 ±  24.669  ns/op
> Base64Decode.testBase64Decode               4          50000  avgt    5   50046.586 ± 523.155  ns/op
> Base64Decode.testBase64MIMEDecode           4              1  avgt   10      64.130 ±   0.238  ns/op
> Base64Decode.testBase64MIMEDecode           4              3  avgt   10      82.096 ±   0.205  ns/op
> Base64Decode.testBase64MIMEDecode           4              7  avgt   10     118.849 ±   0.610  ns/op
> Base64Decode.testBase64MIMEDecode           4             32  avgt   10     331.177 ±   4.732  ns/op
> Base64Decode.testBase64MIMEDecode           4             64  avgt   10     549.117 ±   0.177  ns/op
> Base64Decode.testBase64MIMEDecode           4             80  avgt   10     702.951 ±   4.572  ns/op
> Base64Decode.testBase64MIMEDecode           4             96  avgt   10     799.566 ±   0.301  ns/op
> Base64Decode.testBase64MIMEDecode           4            112  avgt   10     923.749 ±   0.389  ns/op
> Base64Decode.testBase64MIMEDecode           4            512  avgt   10    4000.725 ±   2.519  ns/op
> Base64Decode.testBase64MIMEDecode           4           1000  avgt   10    7674.994 ±   9.281  ns/op
> Base64Decode.testBase64MIMEDecode           4          20000  avgt   10  142059.001 ± 157.920  ns/op
> Base64Decode.testBase64MIMEDecode           4          50000  avgt   10  355698.369 ± 216.542  ns/op

Dong Bo has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 10 commits:

 - conflicts resolved
 - Merge branch 'master' of https://git.openjdk.java.net/jdk into aarch64.base64.decode
 - resovling conflicts
 - load data with one ldrw, add JMH tests for error inputs
 - Merge branch 'master' into aarch64.base64.decode
 - copyright
 - trivial fixes
 - Handling error in SIMD case with loops, combining two non-SIMD cases into one code blob, addressing other comments
 - Merge branch 'master' into aarch64.base64.decode
 - 8256245: AArch64: Implement Base64 decoding intrinsic

-------------

Changes: https://git.openjdk.java.net/jdk/pull/3228/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=3228&range=04
  Stats: 438 lines in 3 files changed: 438 ins; 0 del; 0 mod
  Patch: https://git.openjdk.java.net/jdk/pull/3228.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/3228/head:pull/3228

PR: https://git.openjdk.java.net/jdk/pull/3228
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic

Dong Bo
In reply to this post by Dong Bo
On Fri, 2 Apr 2021 10:17:57 GMT, Andrew Haley <[hidden email]> wrote:

>> PING... Any suggestions on the updated commit?
>
>> PING... Any suggestions on the updated commit?
>
> Once you reply to the comments, sure.

>
> Are there any existing test cases for failing inputs?
>
I added one, the error character is injected at the paramized index of the encoded data.
There are no big differences for small error injected index, seems too much time is took by exception handing.
Witnessed ~2x performance improvements as expected. The JMH tests:
### Kunpeng 916, intrinsic,tested with `-jar benchmarks.jar testBase64WithErrorInputsDecode -p errorIndex=3,64,144,208,272,1000,20000 -p maxNumBytes=1`
Base64Decode.testBase64WithErrorInputsDecode             3           4              1  avgt   10   3696.151 ± 202.783  ns/op
Base64Decode.testBase64WithErrorInputsDecode            64           4              1  avgt   10   3899.269 ± 178.289  ns/op
Base64Decode.testBase64WithErrorInputsDecode           144           4              1  avgt   10   3902.022 ± 163.611  ns/op
Base64Decode.testBase64WithErrorInputsDecode           208           4              1  avgt   10   3982.423 ± 256.638  ns/op
Base64Decode.testBase64WithErrorInputsDecode           272           4              1  avgt   10   3984.545 ± 144.282  ns/op
Base64Decode.testBase64WithErrorInputsDecode          1000           4              1  avgt   10   4532.959 ± 310.068  ns/op
Base64Decode.testBase64WithErrorInputsDecode         20000           4              1  avgt   10  17578.148 ± 631.600  ns/op
### Kunpeng 916, default,tested with `-XX:-UseBASE64Intrinsics -jar benchmarks.jar testBase64WithErrorInputsDecode -p errorIndex=3,64,144,208,272,1000,20000 -p maxNumBytes=1`
Base64Decode.testBase64WithErrorInputsDecode             3           4              1  avgt   10   3760.330 ± 261.672  ns/op
Base64Decode.testBase64WithErrorInputsDecode            64           4              1  avgt   10   3900.326 ± 121.632  ns/op
Base64Decode.testBase64WithErrorInputsDecode           144           4              1  avgt   10   4041.428 ± 174.435  ns/op
Base64Decode.testBase64WithErrorInputsDecode           208           4              1  avgt   10   4177.670 ± 214.433  ns/op
Base64Decode.testBase64WithErrorInputsDecode           272           4              1  avgt   10   4324.020 ± 106.826  ns/op
Base64Decode.testBase64WithErrorInputsDecode          1000           4              1  avgt   10   5476.469 ± 171.647  ns/op
Base64Decode.testBase64WithErrorInputsDecode         20000           4              1  avgt   10  34163.743 ± 162.263  ns/op

>
> Your test results suggest that it isn't useful for that, surely?
>
The results suggest non-SIMD code provides ~11.9% improvements for MIME decoding.
Furthermore, according to local tests, we may have about ~30% performance regression for MIME decoding without non-SIMD code.

In worst case, a MIME line has only 4 base64 encoded characters and a newline string consisted of error inputs, e.g. `\r\n`.
When the instrinsic encounter an illegal character (`\r`), it has to exit.
Then the Java code will pass the next illegal source byte (`\n`) to the intrinsic.
With only SIMD code, it will execute too much wasty instructions before it can detect the error.
Whie with non-SIMD code, the instrinsic will execute only one non-SIMD round for this error input.

>
> For loads and four post increments rather than one load and a few BFMs? Why?
>
Nice suggestion. Done, thanks.

-------------

PR: https://git.openjdk.java.net/jdk/pull/3228
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8256245: AArch64: Implement Base64 decoding intrinsic

Andrew Haley-2
In reply to this post by Dong Bo
On Sat, 27 Mar 2021 08:58:03 GMT, Dong Bo <[hidden email]> wrote:

> There can be illegal characters at the start of the input if the data is MIME encoded.
> It would be no benefits to use SIMD for this case, so the stub use no-simd instructions for MIME encoded data now.

What is the reasoning here? Sure, there can be illegal characters at the start, but what if there are not? The generic logic uses decodeBlock() even in the MIME case, because we don't know that there certainly will be illegal characters.

-------------

PR: https://git.openjdk.java.net/jdk/pull/3228
12