RFR: 8179444: AArch64: Put zero_words on a diet

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

RFR: 8179444: AArch64: Put zero_words on a diet

Andrew Haley
New version, corrected:


The code we generate for ClearArray in C2 is much too verbose.  It
looks like this:

  0x000003ffad2213a4: cbz x11, 0x000003ffad22140c
  0x000003ffad2213a8: tbz w10, #3, 0x000003ffad2213b4
  0x000003ffad2213ac: str xzr, [x10],#8
  0x000003ffad2213b0: sub x11, x11, #0x1
  0x000003ffad2213b4: subs xscratch1, x11, #0x20
  0x000003ffad2213b8: b.lt 0x000003ffad2213c0
  0x000003ffad2213bc: bl Stub::zero_longs  ;   {external_word}
  0x000003ffad2213c0: and xscratch1, x11, #0xe
  0x000003ffad2213c4: sub x11, x11, xscratch1
  0x000003ffad2213c8: add x10, x10, xscratch1, lsl #3
  0x000003ffad2213cc: adr xscratch2, 0x000003ffad2213fc
  0x000003ffad2213d0: sub xscratch2, xscratch2, xscratch1, lsl #1
  0x000003ffad2213d4: br xscratch2
  0x000003ffad2213d8: add x10, x10, #0x80
  0x000003ffad2213dc: stp xzr, xzr, [x10,#-128]
  0x000003ffad2213e0: stp xzr, xzr, [x10,#-112]
  0x000003ffad2213e4: stp xzr, xzr, [x10,#-96]
  0x000003ffad2213e8: stp xzr, xzr, [x10,#-80]
  0x000003ffad2213ec: stp xzr, xzr, [x10,#-64]
  0x000003ffad2213f0: stp xzr, xzr, [x10,#-48]
  0x000003ffad2213f4: stp xzr, xzr, [x10,#-32]
  0x000003ffad2213f8: stp xzr, xzr, [x10,#-16]
  0x000003ffad2213fc: subs x11, x11, #0x10
  0x000003ffad221400: b.ge 0x000003ffad2213d8
  0x000003ffad221404: tbz w11, #0, 0x000003ffad22140c
  0x000003ffad221408: str xzr, [x10],#8

This patch takes much of this code and puts it into a stub.  The new
version of ClearArray is:

  0x000003ff8022b7b0: cmp       x11, #0x8
  0x000003ff8022b7b4: b.cc      0x000003ff8022b7bc
  0x000003ff8022b7b8: bl        Stub::zero_blocks  ;   {runtime_call StubRoutines (2)}
  0x000003ff8022b7bc: tbz       w11, #2, 0x000003ff8022b7c8
  0x000003ff8022b7c0: stp       xzr, xzr, [x10],#16
  0x000003ff8022b7c4: stp       xzr, xzr, [x10],#16
  0x000003ff8022b7c8: tbz       w11, #1, 0x000003ff8022b7d0
  0x000003ff8022b7cc: stp       xzr, xzr, [x10],#16
  0x000003ff8022b7d0: tbz       w11, #0, 0x000003ff8022b7d8
  0x000003ff8022b7d4: str       xzr, [x10]

... which I hope you'll agree is much better.

The idea is to handle array sizes of 0-7 words inline, so small arrays
are got out of the way very quickly, and handle anything larger in
Stub::zero_blocks.  I wanted to make sure that there is no significant
loss of performance, and I have attached the results of the benchmark
I used, which does no more than create an array of ints of various
sizes.  There are winners and losers, but nothing is changed by very
much, and the code cache usage of each ClearArray goes down from 104
to 40 bytes.

http://cr.openjdk.java.net/~aph/8179444-2/

OK?

Andrew.


Before:

Benchmark             (size)  Mode  Cnt    Score   Error  Units

CreateArray.newArray       5  avgt   10   48.273 ?   1.679  ns/op
CreateArray.newArray       7  avgt   10   48.915 ?   0.793  ns/op
CreateArray.newArray      10  avgt   10   49.826 ?   0.868  ns/op
CreateArray.newArray      15  avgt   10   52.582 ?   0.521  ns/op
CreateArray.newArray      23  avgt   10   57.589 ?   0.670  ns/op
CreateArray.newArray      34  avgt   10   67.233 ?   0.984  ns/op
CreateArray.newArray      51  avgt   10  120.652 ?   2.018  ns/op
CreateArray.newArray      77  avgt   10  102.745 ?   1.034  ns/op
CreateArray.newArray     115  avgt   10  136.703 ?   1.067  ns/op
CreateArray.newArray     173  avgt   10  182.247 ?   1.093  ns/op
CreateArray.newArray     259  avgt   10  163.168 ?   5.967  ns/op
CreateArray.newArray     389  avgt   10  233.874 ?   3.400  ns/op
CreateArray.newArray     584  avgt   10  251.286 ?   4.892  ns/op
CreateArray.newArray     876  avgt   10  242.510 ?   0.520  ns/op
CreateArray.newArray    1314  avgt   10  382.846 ?   0.624  ns/op
CreateArray.newArray    1971  avgt   10  487.590 ?   1.409  ns/op



After:

Benchmark             (size)  Mode  Cnt    Score   Error  Units
CreateArray.newArray       5  avgt   10   47.208 ? 0.656  ns/op
CreateArray.newArray       7  avgt   10   47.838 ? 0.608  ns/op
CreateArray.newArray      10  avgt   10   48.798 ? 0.797  ns/op
CreateArray.newArray      15  avgt   10   51.981 ? 0.424  ns/op
CreateArray.newArray      23  avgt   10   56.614 ? 1.064  ns/op
CreateArray.newArray      34  avgt   10   65.986 ? 1.114  ns/op
CreateArray.newArray      51  avgt   10  119.811 ? 0.857  ns/op
CreateArray.newArray      77  avgt   10  101.694 ? 1.192  ns/op
CreateArray.newArray     115  avgt   10  137.169 ? 2.159  ns/op
CreateArray.newArray     173  avgt   10  185.815 ? 0.754  ns/op
CreateArray.newArray     259  avgt   10  163.305 ? 2.107  ns/op
CreateArray.newArray     389  avgt   10  234.049 ? 3.162  ns/op
CreateArray.newArray     584  avgt   10  250.729 ? 1.714  ns/op
CreateArray.newArray     876  avgt   10  242.921 ? 0.577  ns/op
CreateArray.newArray    1314  avgt   10  384.337 ? 1.465  ns/op
CreateArray.newArray    1971  avgt   10  486.948 ? 5.303  ns/op



Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8179444: AArch64: Put zero_words on a diet

Andrew Dinn
On 03/05/17 18:05, Andrew Haley wrote:
> New version, corrected:
> . . .
> http://cr.openjdk.java.net/~aph/8179444-2/
>
> OK?

The patch looks good (not an official review) except that I don't
understand one detail. Why does MacroAssembler::zero_words include this?


+    RuntimeAddress zero_blocks =
RuntimeAddress(StubRoutines::aarch64::zero_blocks());
+    assert(zero_blocks.target() != NULL, "zero_blocks stub has not been
generated");
+    if (StubRoutines::aarch64::complete()) {
+      trampoline_call(zero_blocks);
+    } else {
+      bl(zero_blocks);
+    }


regards,


Andrew Dinn
-----------
Senior Principal Software Engineer
Red Hat UK Ltd
Registered in England and Wales under Company Registration No. 03798903
Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8179444: AArch64: Put zero_words on a diet

Andrew Haley
On 05/05/17 10:47, Andrew Dinn wrote:

> On 03/05/17 18:05, Andrew Haley wrote:
>> New version, corrected:
>> . . .
>> http://cr.openjdk.java.net/~aph/8179444-2/
>>
>> OK?
>
> The patch looks good (not an official review) except that I don't
> understand one detail. Why does MacroAssembler::zero_words include this?
>
> +    RuntimeAddress zero_blocks =
> RuntimeAddress(StubRoutines::aarch64::zero_blocks());
> +    assert(zero_blocks.target() != NULL, "zero_blocks stub has not been
> generated");
> +    if (StubRoutines::aarch64::complete()) {
> +      trampoline_call(zero_blocks);
> +    } else {
> +      bl(zero_blocks);
> +    }

Trampoline calls only work from compiler-generated code, so we have to
do something different when we're generating the stubs.  I suppose I
could have had two versions of MacroAssembler::zero_words or added a
parameter to say we're generating stubs.  Would that be clearer?

Andrew.


Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8179444: AArch64: Put zero_words on a diet

Andrew Dinn
On 05/05/17 16:06, Andrew Haley wrote:

> On 05/05/17 10:47, Andrew Dinn wrote:
>> On 03/05/17 18:05, Andrew Haley wrote:
>>> New version, corrected:
>>> . . .
>>> http://cr.openjdk.java.net/~aph/8179444-2/
>>>
>>> OK?
>>
>> The patch looks good (not an official review) except that I don't
>> understand one detail. Why does MacroAssembler::zero_words include this?
>>
>> +    RuntimeAddress zero_blocks =
>> RuntimeAddress(StubRoutines::aarch64::zero_blocks());
>> +    assert(zero_blocks.target() != NULL, "zero_blocks stub has not been
>> generated");
>> +    if (StubRoutines::aarch64::complete()) {
>> +      trampoline_call(zero_blocks);
>> +    } else {
>> +      bl(zero_blocks);
>> +    }
>
> Trampoline calls only work from compiler-generated code, so we have to
> do something different when we're generating the stubs.  I suppose I
> could have had two versions of MacroAssembler::zero_words or added a
> parameter to say we're generating stubs.  Would that be clearer?

Ok, I see now. Two versions of the methods seems like overkill. A
comment before the if explaining what is going on is probably all that
is needed. For example:

+    // if stubs are complete then we are generating under
+    // the compiler so we need to use a trampoline_call
+    // otherwise we have to use a normal call
+    if (StubRoutines::aarch64::complete()) {
+      trampoline_call(zero_blocks);
+    } else {

I have not yet tested the patch. I will do so and report back.

regards,


Andrew Dinn
-----------
Senior Principal Software Engineer
Red Hat UK Ltd
Registered in England and Wales under Company Registration No. 03798903
Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8179444: AArch64: Put zero_words on a diet

Andrew Dinn
Ok, this ought not strictly to count as an official review (as I am not
a jdk9/10 reviewer) but I have i) eyeballed the code to my satisfaction
and ii)  successfully built and tested a patched jdk10 (by running java
Hello, javac Hello.java and netbeans).

Ship it!

regards,


Andrew Dinn
-----------
Senior Principal Software Engineer
Red Hat UK Ltd
Registered in England and Wales under Company Registration No. 03798903
Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8179444: AArch64: Put zero_words on a diet

Roland Westrelin-3
In reply to this post by Andrew Haley
Reply | Threaded
Open this post in threaded view
|

RE: RFR: 8179444: AArch64: Put zero_words on a diet

White, Derek
Hi Andrew,

Really nice seeing the smaller code! I had some questions, observations, and suggestions.

src/cpu/aarch64/vm/macroAssembler_aarch64.cpp:
 - zero_words(reg, reg):
   - Comment mentions that it's used by the ClearArray pattern, but it's also used by generate_fill().
   - The old zero_words(), via block_zero() and fill_words(), would align base to 16-bytes before doing stps, the new code doesn't. It may be worth conditionally pre-aligning if AvoidUnalignedAcesses is true. And could conditionally remove pre-aligning in zero_blocks.

Line  4999: This is pre-existing, but it's confusing - could you rename "ShortArraySize" to "SmallArraySize"?

- zero_words(reg, int):
 - I don't see a great way to handle unaligned accesses without blowing up the code size. So no real request here.
   - It looks like worst case is 15 unaligned stp instructions.
   - With some benchmarking someday I might argue for calling this constant count version for smaller copies.
   - Unless ClearArray only zero's from the first array element on - then we could guess if base is 16-byte unaligned by looking at the array header size.

-  zero_dcache_blocks():
  - Suggest a new comment that mentions that it's only called from zero_blocks()? Or if it's meant to be more general then add comments about the requirements for base alignment, and that cnt has to be >= 2*zva_length.
  - zero_blocks DOES align base to 16-bytes, so we don't need to check here? Or make it a runtime assert?

Thanks,

 - Derek


-----Original Message-----
From: hotspot-dev [mailto:[hidden email]] On Behalf Of Roland Westrelin
Sent: Friday, May 05, 2017 11:59 AM
To: Andrew Haley <[hidden email]>; hotspot-dev Source Developers <[hidden email]>
Subject: Re: RFR: 8179444: AArch64: Put zero_words on a diet


> http://cr.openjdk.java.net/~aph/8179444-2/

Looks ok to me.

Roland.
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8179444: AArch64: Put zero_words on a diet

Andrew Haley
Hi,

On 05/05/17 23:54, White, Derek wrote:

> src/cpu/aarch64/vm/macroAssembler_aarch64.cpp:
>  - zero_words(reg, reg):
>    - Comment mentions that it's used by the ClearArray pattern, but
> it's also used by generate_fill().

OK, but there's no contradiction there.  The comment just explains why
zero_words must be small.

>    - The old zero_words(), via block_zero() and fill_words(), would
> align base to 16-bytes before doing stps, the new code doesn't. It
> may be worth conditionally pre-aligning if AvoidUnalignedAcesses is
> true. And could conditionally remove pre-aligning in zero_blocks.

But that'd bloat zero_words, surely.  And I haven't see the new code
running any slower on any hardware,

> Line 4999: This is pre-existing, but it's confusing - could you
> rename "ShortArraySize" to "SmallArraySize"?

OK.

> - zero_words(reg, int):
>  - I don't see a great way to handle unaligned accesses without
> blowing up the code size. So no real request here.
>    - It looks like worst case is 15 unaligned stp instructions.
>    - With some benchmarking someday I might argue for calling this
> constant count version for smaller copies.

I don't understand this.

>    - Unless ClearArray only zero's from the first array element on -
> then we could guess if base is 16-byte unaligned by looking at the
> array header size.
>
> -  zero_dcache_blocks():

>   - Suggest a new comment that mentions that it's only called from
> zero_blocks()? Or if it's meant to be more general then add comments
> about the requirements for base alignment, and that cnt has to be >=
> 2*zva_length.

OK.

>   - zero_blocks DOES align base to 16-bytes, so we don't need to
> check here? Or make it a runtime assert?

I did wonder about that, but left it in as a safety net because its
cost is immeasurably small.

Thanks for such a detailed review.

Andrew.
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8179444: AArch64: Put zero_words on a diet

Andrew Haley
In reply to this post by White, Derek
On 05/05/17 23:54, White, Derek wrote:

>    - Unless ClearArray only zero's from the first array element on -
> then we could guess if base is 16-byte unaligned by looking at the
> array header size.

I think the heap is only HeapWord-aligned anyway.  I hope that in the
future there will be no hardware penalty for word-aligned STRD.  In
the meantime I'm not at all eager to put a load of special-purpose
tweaks into the AArch64 port.

Andrew.

Reply | Threaded
Open this post in threaded view
|

RE: RFR: 8179444: AArch64: Put zero_words on a diet

White, Derek
In reply to this post by Andrew Haley
Hi Andrew,

Looks OK then.

If I find a major performance issue on misaligned stp I'll propose a fix separately. In some HW I've seen a up to a 2x performance penalty on unaligned destinations in some memcpy-like code.

 - Derek

-----Original Message-----
From: Andrew Haley [mailto:[hidden email]]
Sent: Sunday, May 07, 2017 4:41 AM
To: White, Derek <[hidden email]>; Roland Westrelin <[hidden email]>; hotspot-dev Source Developers <[hidden email]>
Subject: Re: RFR: 8179444: AArch64: Put zero_words on a diet

Hi,

On 05/05/17 23:54, White, Derek wrote:

> src/cpu/aarch64/vm/macroAssembler_aarch64.cpp:
>  - zero_words(reg, reg):
>    - Comment mentions that it's used by the ClearArray pattern, but
> it's also used by generate_fill().

OK, but there's no contradiction there.  The comment just explains why zero_words must be small.

>    - The old zero_words(), via block_zero() and fill_words(), would
> align base to 16-bytes before doing stps, the new code doesn't. It may
> be worth conditionally pre-aligning if AvoidUnalignedAcesses is true.
> And could conditionally remove pre-aligning in zero_blocks.

But that'd bloat zero_words, surely.  And I haven't see the new code running any slower on any hardware,

> Line 4999: This is pre-existing, but it's confusing - could you rename
> "ShortArraySize" to "SmallArraySize"?

OK.

> - zero_words(reg, int):
>  - I don't see a great way to handle unaligned accesses without
> blowing up the code size. So no real request here.
>    - It looks like worst case is 15 unaligned stp instructions.
>    - With some benchmarking someday I might argue for calling this
> constant count version for smaller copies.

I don't understand this.

>    - Unless ClearArray only zero's from the first array element on -
> then we could guess if base is 16-byte unaligned by looking at the
> array header size.
>
> -  zero_dcache_blocks():

>   - Suggest a new comment that mentions that it's only called from
> zero_blocks()? Or if it's meant to be more general then add comments
> about the requirements for base alignment, and that cnt has to be >=
> 2*zva_length.

OK.

>   - zero_blocks DOES align base to 16-bytes, so we don't need to check
> here? Or make it a runtime assert?

I did wonder about that, but left it in as a safety net because its cost is immeasurably small.

Thanks for such a detailed review.

Andrew.