[10] RFR: 8187472 - AARCH64: array_equals intrinsic doesn't use prefetch for large arrays

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

[10] RFR: 8187472 - AARCH64: array_equals intrinsic doesn't use prefetch for large arrays

Dmitrij Pochepko-2
Hi,

as part of JEP “Improve performance of String and Array operations on
AArch64” I wanted to send out a pre-review for some of the improved
intrinsics to get early feedback. This is the first in a row.

Please pre-review patch for 8187472 - “AARCH64: array_equals intrinsic
doesn't use prefetch for large arrays” which improves large array
handling (small arrays are unaffected).

In short, this patch uses large (64 byte) loop with prefetch instruction
to handle large arrays, which is done in a stub. I can observe
performance boost on systems without h/w prefetcher up to x6. System
with hardware prefetching (Cortex A53 and some very modern ones) also
benefit from this patch (15% improvement).

I've tried a number of different versions (attached to JDK-8187472) with
different load instructions (ldr/ldp/<simd>), slightly different code
shapes, different data dependencies across registers, alignments, e.t.c.
Version presented in webrev (version 2.6d from JDK-8187472 attachments)
is the simplest from the fast ones (as measured on 3 systems available
for testing).

I've used this simple benchmark to measure performance:
http://cr.openjdk.java.net/~dpochepk/8187472/ArrayEqualsBench.java

Chart for ThunderX:
http://cr.openjdk.java.net/~dpochepk/8187472/ThunderX.png

Chart for Cortex A53(R-Pi):
http://cr.openjdk.java.net/~dpochepk/8187472/R-Pi.png

Raw numbers for ThunderX:
http://cr.openjdk.java.net/~dpochepk/8187472/ThunderX.results.txt

Raw numbers for R-Pi:
http://cr.openjdk.java.net/~dpochepk/8187472/R-Pi.results.txt

webrev: http://cr.openjdk.java.net/~dpochepk/8187472/webrev.01/

Testing: I've run existing jtreg test
(java/util/Arrays/ArraysEqCmpTest.java) in both Xmixed and Xcomp and
found no regressions.

Any additional numbers on other systems are welcome, as well as early
feedback on the code.

Thanks,

Dmitrij
Reply | Threaded
Open this post in threaded view
|

Re: [aarch64-port-dev ] [10] RFR: 8187472 - AARCH64: array_equals intrinsic doesn't use prefetch for large arrays

Andrew Haley
On 30/10/17 15:42, Dmitrij Pochepko wrote:
> Any additional numbers on other systems are welcome, as well as early
> feedback on the code.

I take it that the small comparisons are unaffected.  The small
comparisons are very common, so they shouldn't be ignored.

The patch seems unobjectionable, but it's extremely hard to test
this stuff.

Why is this change:

@@ -16154,7 +16154,7 @@
   ins_pipe(pipe_class_memory);
 %}

-instruct string_equalsL(iRegP_R1 str1, iRegP_R3 str2, iRegI_R4 cnt,
+instruct string_equalsL(iRegP_R1 str1, iRegP_R3 str2, iRegP_R4 cnt,
                         iRegI_R0 result, rFlagsReg cr)
 %{
   predicate(((StrEqualsNode*)n)->encoding() == StrIntrinsicNode::LL);

It seems very odd to me.

Was a vertor-based implementation considered?

--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Reply | Threaded
Open this post in threaded view
|

Re: [aarch64-port-dev ] [10] RFR: 8187472 - AARCH64: array_equals intrinsic doesn't use prefetch for large arrays

Dmitrij Pochepko-2


On 30.10.2017 19:13, Andrew Haley wrote:
> On 30/10/17 15:42, Dmitrij Pochepko wrote:
>> Any additional numbers on other systems are welcome, as well as early
>> feedback on the code.
> I take it that the small comparisons are unaffected.  The small
> comparisons are very common, so they shouldn't be ignored.
>
> The patch seems unobjectionable, but it's extremely hard to test
> this stuff.
Well, I've actually used small brute force test which generates all
cases for arrays length from 1 to N(parameter) to test it, because I
couldn't find better way.

i.e.:
case 0: equal arrays
case 1: arrays different in 1st symbol
...
case N: arrays different in (N-1)th symbol


And this test passed. However, I don't think such test should be added
to jtreg testbase, because it takes long time to run, so, I assume
existing array equals test is enough.


>
> Why is this change:
>
> @@ -16154,7 +16154,7 @@
>     ins_pipe(pipe_class_memory);
>   %}
>
> -instruct string_equalsL(iRegP_R1 str1, iRegP_R3 str2, iRegI_R4 cnt,
> +instruct string_equalsL(iRegP_R1 str1, iRegP_R3 str2, iRegP_R4 cnt,
>                           iRegI_R0 result, rFlagsReg cr)
>   %{
>     predicate(((StrEqualsNode*)n)->encoding() == StrIntrinsicNode::LL);
>
> It seems very odd to me.
You're right. It's leftover from previous versions. It can be reverted
back to iRegI_R4.
>
> Was a vertor-based implementation considered?
>
Yes.
I've tried simd loads(even aligned ones to be sure that alignment is not
an issue). simd versions were attached into JDK-8187472 as
  - v5.0(simd loads, 16-byte address alignment, 64 bytes per 1 loop
iteration)
  - v7.0(simd loads, 16-byte alignment, 64 bytes per 1 loop iteration)
  - v9.0(simd loads, 64 byte alignment, 128 bytes per 1 loop iteration).

I've measured it on ThunderX and found while best non-simd version
handles 1000000 bytes arrays in ~295 microseconds, simd versions had
numbers about ~355 microseconds.

Thanks,
Dmitrij
Reply | Threaded
Open this post in threaded view
|

Re: [aarch64-port-dev ] [10] RFR: 8187472 - AARCH64: array_equals intrinsic doesn't use prefetch for large arrays

Andrew Haley
On 30/10/17 16:43, Dmitrij Pochepko wrote:

> I've tried simd loads(even aligned ones to be sure that alignment is not
> an issue). simd versions were attached into JDK-8187472 as
>   - v5.0(simd loads, 16-byte address alignment, 64 bytes per 1 loop
> iteration)
>   - v7.0(simd loads, 16-byte alignment, 64 bytes per 1 loop iteration)
>   - v9.0(simd loads, 64 byte alignment, 128 bytes per 1 loop iteration).
>
> I've measured it on ThunderX and found while best non-simd version
> handles 1000000 bytes arrays in ~295 microseconds, simd versions had
> numbers about ~355 microseconds.

I'm rather reluctant to accept non-SIMD intrinsics because I expect
SIMD performance to improve, and I expect SIMD to be the future.  The
same is true of implementations which avoid the use of ldp.

--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Reply | Threaded
Open this post in threaded view
|

Re: [aarch64-port-dev ] [10] RFR: 8187472 - AARCH64: array_equals intrinsic doesn't use prefetch for large arrays

Dmitrij Pochepko-2


On 30.10.2017 20:30, Andrew Haley wrote:

> On 30/10/17 16:43, Dmitrij Pochepko wrote:
>> I've tried simd loads(even aligned ones to be sure that alignment is not
>> an issue). simd versions were attached into JDK-8187472 as
>>    - v5.0(simd loads, 16-byte address alignment, 64 bytes per 1 loop
>> iteration)
>>    - v7.0(simd loads, 16-byte alignment, 64 bytes per 1 loop iteration)
>>    - v9.0(simd loads, 64 byte alignment, 128 bytes per 1 loop iteration).
>>
>> I've measured it on ThunderX and found while best non-simd version
>> handles 1000000 bytes arrays in ~295 microseconds, simd versions had
>> numbers about ~355 microseconds.
> I'm rather reluctant to accept non-SIMD intrinsics because I expect
> SIMD performance to improve, and I expect SIMD to be the future.  The
> same is true of implementations which avoid the use of ldp.
>
I also expected NEON to be faster on very new designs. Since I have a
SIMD version of this intrinsic that I can merge into stub under an if
with new option (like UseSIMDForArrayEquals with default value set to
false, almost the same as existing UseSIMDForMemoryOps, which is used in
array copy intrinsic) if you want, but it is slower for the CPUs we have
access to and likely not going to be the default. This way we'll have a
fast version and a SIMD version.

I am hesitant if it is best to do this, or keep a single, simple, and
fastest version for now for this intrinsic, and get back to it when SVE
becomes widely available.

What do you think?

Note that other intrinsics that are in the works will use SIMD.

Thanks,
Dmitrij
Reply | Threaded
Open this post in threaded view
|

Re: [aarch64-port-dev ] [10] RFR: 8187472 - AARCH64: array_equals intrinsic doesn't use prefetch for large arrays

Andrew Haley
On 30/10/17 18:03, Dmitrij Pochepko wrote:
> I am hesitant if it is best to do this, or keep a single, simple, and
> fastest version for now for this intrinsic, and get back to it when SVE
> becomes widely available.
>
> What do you think?

Do it now, or we'll have merge problems later.

> Note that other intrinsics that are in the works will use SIMD.

OK, thanks.

--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Reply | Threaded
Open this post in threaded view
|

Re: [aarch64-port-dev ] [10] RFR: 8187472 - AARCH64: array_equals intrinsic doesn't use prefetch for large arrays

Dmitrij Pochepko-2

ok, I'll create a merged version before sending this to review. Thanks for the feedback!

21:06, 30 октября 2017 г., Andrew Haley <[hidden email]>:

On 30/10/17 18:03, Dmitrij Pochepko wrote:

 I am hesitant if it is best to do this, or keep a single, simple, and
 fastest version for now for this intrinsic, and get back to it when SVE
 becomes widely available.

 What do you think?


Do it now, or we'll have merge problems later.

 Note that other intrinsics that are in the works will use SIMD.


OK, thanks.

--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


--
Отправлено из мобильного приложения Яндекс.Почты
Reply | Threaded
Open this post in threaded view
|

Re: [aarch64-port-dev ] [10] RFR: 8187472 - AARCH64: array_equals intrinsic doesn't use prefetch for large arrays

Dmitrij Pochepko-2

Hi,

please take a look at merged simd/non-simd version.

I've set ergonomics to use non-simd version on ThunderX, because simd is about 15% slower there.

R-Pi shows about the same results for both versions.


webrev: http://cr.openjdk.java.net/~dpochepk/8187472/webrev.02/


Thanks,

Dmitrij


On 30.10.2017 22:18, [hidden email] wrote:

ok, I'll create a merged version before sending this to review. Thanks for the feedback!

21:06, 30 октября 2017 г., Andrew Haley [hidden email]:

On 30/10/17 18:03, Dmitrij Pochepko wrote:

 I am hesitant if it is best to do this, or keep a single, simple, and
 fastest version for now for this intrinsic, and get back to it when SVE
 becomes widely available.

 What do you think?


Do it now, or we'll have merge problems later.

 Note that other intrinsics that are in the works will use SIMD.


OK, thanks.

--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


--
Отправлено из мобильного приложения Яндекс.Почты

Reply | Threaded
Open this post in threaded view
|

Re: [aarch64-port-dev ] [10] RFR: 8187472 - AARCH64: array_equals intrinsic doesn't use prefetch for large arrays

Andrew Haley
On 10/11/17 14:19, Dmitrij Pochepko wrote:

> please take a look at merged simd/non-simd version.

+    if (UseSIMDForArrayEquals) {
+      __ ld1(v0, v1, v2, v3, __ T2D, Address(__ post(a1, 64)));
+      __ ld1(v4, v5, v6, v7, __ T2D, Address(__ post(a2, 64)));

Is this post-increment correct?  It should be a multiple of wordSize.

Also, the indentation in generate_large_array_equals is wrong.

--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Reply | Threaded
Open this post in threaded view
|

Re: [aarch64-port-dev ] [10] RFR: 8187472 - AARCH64: array_equals intrinsic doesn't use prefetch for large arrays

Dmitrij Pochepko-2


On 10.11.2017 17:28, Andrew Haley wrote:
> On 10/11/17 14:19, Dmitrij Pochepko wrote:
>
>> please take a look at merged simd/non-simd version.
> +    if (UseSIMDForArrayEquals) {
> +      __ ld1(v0, v1, v2, v3, __ T2D, Address(__ post(a1, 64)));
> +      __ ld1(v4, v5, v6, v7, __ T2D, Address(__ post(a2, 64)));
>
> Is this post-increment correct?  It should be a multiple of wordSize.
It's correct, since we're loading 4x128-bit registers, which is 64
bytes. But I've changed it to "4 * 2 * wordSize", which has the same value.
>
> Also, the indentation in generate_large_array_equals is wrong.
>
Thank you for noticing it.

Please take a look at
http://cr.openjdk.java.net/~dpochepk/8187472/webrev.03/


Thanks,
Dmitrij
Reply | Threaded
Open this post in threaded view
|

Re: [aarch64-port-dev ] [10] RFR: 8187472 - AARCH64: array_equals intrinsic doesn't use prefetch for large arrays

Ningsheng Jian
Hi Dmitrij,

I can see regressions on some hardware with your patch.

Could you please try the micro-benchmark below on your systems?

http://people.linaro.org/~ningsheng.jian/test/StringOps.java

$ java -jar ./target/benchmarks.jar "StringEqualsLarge" -wi 10 -f 3 -i 10

Seems that your patch affects the inlining.

Thanks,
Ningsheng

On 10 November 2017 at 22:52, Dmitrij Pochepko
<[hidden email]> wrote:

>
>
> On 10.11.2017 17:28, Andrew Haley wrote:
>>
>> On 10/11/17 14:19, Dmitrij Pochepko wrote:
>>
>>> please take a look at merged simd/non-simd version.
>>
>> +    if (UseSIMDForArrayEquals) {
>> +      __ ld1(v0, v1, v2, v3, __ T2D, Address(__ post(a1, 64)));
>> +      __ ld1(v4, v5, v6, v7, __ T2D, Address(__ post(a2, 64)));
>>
>> Is this post-increment correct?  It should be a multiple of wordSize.
>
> It's correct, since we're loading 4x128-bit registers, which is 64 bytes.
> But I've changed it to "4 * 2 * wordSize", which has the same value.
>>
>>
>> Also, the indentation in generate_large_array_equals is wrong.
>>
> Thank you for noticing it.
>
> Please take a look at
> http://cr.openjdk.java.net/~dpochepk/8187472/webrev.03/
>
>
> Thanks,
> Dmitrij
Reply | Threaded
Open this post in threaded view
|

Re: [aarch64-port-dev ] [10] RFR: 8187472 - AARCH64: array_equals intrinsic doesn't use prefetch for large arrays

Andrew Haley
On 15/11/17 08:34, Ningsheng Jian wrote:

> Hi Dmitrij,
>
> I can see regressions on some hardware with your patch.
>
> Could you please try the micro-benchmark below on your systems?
>
> http://people.linaro.org/~ningsheng.jian/test/StringOps.java
>
> $ java -jar ./target/benchmarks.jar "StringEqualsLarge" -wi 10 -f 3 -i 10
>
> Seems that your patch affects the inlining.

Before the patch:

StringOps.jmhTimeStringEqualsIgnoreCaseLarge         avgt   10  2984.638 ? 12.899  ns/op
StringOps.jmhTimeStringEqualsIgnoreCaseSmall         avgt   10   319.434 ?  0.434  ns/op
StringOps.jmhTimeStringEqualsLarge                   avgt   10   146.122 ?  0.638  ns/op
StringOps.jmhTimeStringEqualsSmall                   avgt   10    45.998 ?  0.444  ns/op

After:
                                     Mode  Cnt     Score   Error  Units
StringOps.jmhTimeStringEqualsIgnoreCaseLarge  avgt   10  2982.222 ? 3.683  ns/op
StringOps.jmhTimeStringEqualsIgnoreCaseSmall  avgt   10   319.671 ? 0.902  ns/op
StringOps.jmhTimeStringEqualsLarge            avgt   10  1137.348 ? 1.088  ns/op
StringOps.jmhTimeStringEqualsSmall            avgt   10   194.698 ? 0.857  ns/op

Gosh.  I think that's a very bad patch.

--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Reply | Threaded
Open this post in threaded view
|

Re: [aarch64-port-dev ] [10] RFR: 8187472 - AARCH64: array_equals intrinsic doesn't use prefetch for large arrays

Andrew Haley
On 15/11/17 10:34, Andrew Haley wrote:
> Gosh.  I think that's a very bad patch.

In fact it's even worse than I thought.  It causes crashes in testing.
Nobody seems to have noticed that the .ad file is wrong.

Note this:

instruct string_equalsL(iRegP_R1 str1, iRegP_R3 str2, iRegI_R4 cnt,
                        iRegI_R0 result, rFlagsReg cr)

instruct array_equalsB(iRegP_R1 ary1, iRegP_R2 ary2, iRegI_R0 result,
                      iRegP_R4 tmp, rFlagsReg cr)

Spot the difference?  str2 is in R3, ary2 is in R2.  That means you
can't call the same stub from both of these patterns.

If you're adding a stub call you must assert that the passed registers
are correct.  Add the assertions, fix the register usage, and test it
properly, both with and without UseSIMDForArrayEquals.

--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Reply | Threaded
Open this post in threaded view
|

Re: [aarch64-port-dev ] [10] RFR: 8187472 - AARCH64: array_equals intrinsic doesn't use prefetch for large arrays

Dmitrij Pochepko-2
Hi,

Ningsheng, Andrew,

thank you for looking into in. Your pre-review of this patch helps a
lot. I was originally focused on Arrays::equals, without paying too much
attention to String::equals, so, haven't found these issues.


The cause of this slowdown was a small bug Andrew mentioned on first
webrev(iRegP/iRegI change). webrev.03 you were using has leftover of
this problem
(http://cr.openjdk.java.net/~dpochepk/8187472/webrev.03/src/hotspot/cpu/aarch64/aarch64.ad.udiff.html).
And it caused this intrinsic to be ignored by matcher, so, no intrinsic
was used at all.


I changed back iRegP to iRegI and it fixed slowdown. I also changed R3
to R2 to keep 2nd string in .ad file to fix 2nd problem Andrew found and
it seems to work fine.

new webrev (only .ad file was changed):
http://cr.openjdk.java.net/~dpochepk/8187472/webrev.04/



Ningsheng's benchmark numbers for this patch on ThunderX:


Before:

Benchmark                           Mode  Cnt    Score   Error Units
StringOps.jmhTimeStringEqualsLarge  avgt   10  235.631 ? 0.054 ns/op


After:

Benchmark                           Mode  Cnt    Score   Error Units
StringOps.jmhTimeStringEqualsLarge  avgt   10  216.660 ? 0.304 ns/op


I still have to test this patch a bit more to be sure no other problems
appears.


Thanks,

Dmitrij


On 15.11.2017 15:16, Andrew Haley wrote:

> On 15/11/17 10:34, Andrew Haley wrote:
>> Gosh.  I think that's a very bad patch.
> In fact it's even worse than I thought.  It causes crashes in testing.
> Nobody seems to have noticed that the .ad file is wrong.
>
> Note this:
>
> instruct string_equalsL(iRegP_R1 str1, iRegP_R3 str2, iRegI_R4 cnt,
>                          iRegI_R0 result, rFlagsReg cr)
>
> instruct array_equalsB(iRegP_R1 ary1, iRegP_R2 ary2, iRegI_R0 result,
>                        iRegP_R4 tmp, rFlagsReg cr)
>
> Spot the difference?  str2 is in R3, ary2 is in R2.  That means you
> can't call the same stub from both of these patterns.
>
> If you're adding a stub call you must assert that the passed registers
> are correct.  Add the assertions, fix the register usage, and test it
> properly, both with and without UseSIMDForArrayEquals.
>

Reply | Threaded
Open this post in threaded view
|

Re: [aarch64-port-dev ] [10] RFR: 8187472 - AARCH64: array_equals intrinsic doesn't use prefetch for large arrays

Ningsheng Jian
Hi Dmitrij,

Your code below (T2D) in stubGenerator_aarch64.cpp will trigger
assertion in assembler on debug build.

3935       __ eor(v0, __ T2D, v0, v4);
3936       __ eor(v1, __ T2D, v1, v5);

I ran your benchmarks in some platforms. Will share the benchmark data
to you in a separate mail.

Thanks,
Ningsheng

On 16 November 2017 at 05:31, Dmitrij Pochepko
<[hidden email]> wrote:

> Hi,
>
> Ningsheng, Andrew,
>
> thank you for looking into in. Your pre-review of this patch helps a lot. I
> was originally focused on Arrays::equals, without paying too much attention
> to String::equals, so, haven't found these issues.
>
>
> The cause of this slowdown was a small bug Andrew mentioned on first
> webrev(iRegP/iRegI change). webrev.03 you were using has leftover of this
> problem
> (http://cr.openjdk.java.net/~dpochepk/8187472/webrev.03/src/hotspot/cpu/aarch64/aarch64.ad.udiff.html).
> And it caused this intrinsic to be ignored by matcher, so, no intrinsic was
> used at all.
>
>
> I changed back iRegP to iRegI and it fixed slowdown. I also changed R3 to R2
> to keep 2nd string in .ad file to fix 2nd problem Andrew found and it seems
> to work fine.
>
> new webrev (only .ad file was changed):
> http://cr.openjdk.java.net/~dpochepk/8187472/webrev.04/
>
>
>
> Ningsheng's benchmark numbers for this patch on ThunderX:
>
>
> Before:
>
> Benchmark                           Mode  Cnt    Score   Error Units
> StringOps.jmhTimeStringEqualsLarge  avgt   10  235.631 ? 0.054 ns/op
>
>
> After:
>
> Benchmark                           Mode  Cnt    Score   Error Units
> StringOps.jmhTimeStringEqualsLarge  avgt   10  216.660 ? 0.304 ns/op
>
>
> I still have to test this patch a bit more to be sure no other problems
> appears.
>
>
> Thanks,
>
> Dmitrij
>
>
>
> On 15.11.2017 15:16, Andrew Haley wrote:
>>
>> On 15/11/17 10:34, Andrew Haley wrote:
>>>
>>> Gosh.  I think that's a very bad patch.
>>
>> In fact it's even worse than I thought.  It causes crashes in testing.
>> Nobody seems to have noticed that the .ad file is wrong.
>>
>> Note this:
>>
>> instruct string_equalsL(iRegP_R1 str1, iRegP_R3 str2, iRegI_R4 cnt,
>>                          iRegI_R0 result, rFlagsReg cr)
>>
>> instruct array_equalsB(iRegP_R1 ary1, iRegP_R2 ary2, iRegI_R0 result,
>>                        iRegP_R4 tmp, rFlagsReg cr)
>>
>> Spot the difference?  str2 is in R3, ary2 is in R2.  That means you
>> can't call the same stub from both of these patterns.
>>
>> If you're adding a stub call you must assert that the passed registers
>> are correct.  Add the assertions, fix the register usage, and test it
>> properly, both with and without UseSIMDForArrayEquals.
>>
>