RFR: 8179444: AArch64: Put zero_words on a diet

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

RFR: 8179444: AArch64: Put zero_words on a diet

Andrew Haley
The code we generate for ClearArray in C2 is much too verbose.  It
looks like this:

  0x000003ffad2213a4: cbz x11, 0x000003ffad22140c
  0x000003ffad2213a8: tbz w10, #3, 0x000003ffad2213b4
  0x000003ffad2213ac: str xzr, [x10],#8
  0x000003ffad2213b0: sub x11, x11, #0x1
  0x000003ffad2213b4: subs xscratch1, x11, #0x20
  0x000003ffad2213b8: b.lt 0x000003ffad2213c0
  0x000003ffad2213bc: bl Stub::zero_longs  ;   {external_word}
  0x000003ffad2213c0: and xscratch1, x11, #0xe
  0x000003ffad2213c4: sub x11, x11, xscratch1
  0x000003ffad2213c8: add x10, x10, xscratch1, lsl #3
  0x000003ffad2213cc: adr xscratch2, 0x000003ffad2213fc
  0x000003ffad2213d0: sub xscratch2, xscratch2, xscratch1, lsl #1
  0x000003ffad2213d4: br xscratch2
  0x000003ffad2213d8: add x10, x10, #0x80
  0x000003ffad2213dc: stp xzr, xzr, [x10,#-128]
  0x000003ffad2213e0: stp xzr, xzr, [x10,#-112]
  0x000003ffad2213e4: stp xzr, xzr, [x10,#-96]
  0x000003ffad2213e8: stp xzr, xzr, [x10,#-80]
  0x000003ffad2213ec: stp xzr, xzr, [x10,#-64]
  0x000003ffad2213f0: stp xzr, xzr, [x10,#-48]
  0x000003ffad2213f4: stp xzr, xzr, [x10,#-32]
  0x000003ffad2213f8: stp xzr, xzr, [x10,#-16]
  0x000003ffad2213fc: subs x11, x11, #0x10
  0x000003ffad221400: b.ge 0x000003ffad2213d8
  0x000003ffad221404: tbz w11, #0, 0x000003ffad22140c
  0x000003ffad221408: str xzr, [x10],#8

This patch takes much of this code and puts it into a stub.  The new
version of ClearArray is:

  0x000003ffad21088c: cmp x11, #0x8
  0x000003ffad210890: b.lt 0x000003ffad210898
  0x000003ffad210894: bl Stub::zero_blocks  ;   {runtime_call StubRoutines (2)}
  0x000003ffad210898: and xscratch1, x11, #0x6
  0x000003ffad21089c: adr xscratch2, 0x000003ffad2108b4
  0x000003ffad2108a0: sub xscratch2, xscratch2, xscratch1, lsl #1
  0x000003ffad2108a4: br xscratch2
  0x000003ffad2108a8: stp xzr, xzr, [x10],#16
  0x000003ffad2108ac: stp xzr, xzr, [x10],#16
  0x000003ffad2108b0: stp xzr, xzr, [x10],#16
  0x000003ffad2108b4: tbz w11, #0, 0x000003ffad2108bc
  0x000003ffad2108b8: str xzr, [x10]

The idea is to handle array sizes of 0-7 words inline, so small arrays
are got out of the way very quickly, and handle anything larger in
Stub::zero_blocks.  I wanted to make sure that there is no significant
loss of performance, and I have attached the results of the benchmark
I used, which does no more than create an array of ints of various
sizes.  There are winners and losers, but nothing is changed by very
much, and the code cache usage of each ClearArray goes down from 104
to 48 bytes.

http://cr.openjdk.java.net/~aph/8179444/

OK?

Andrew.


Machine A:

Before:

Benchmark             (size)  Mode  Cnt    Score   Error  Units
CreateArray.newArray       5  avgt    5   48.221 ? 3.185  ns/op
CreateArray.newArray       7  avgt    5   48.853 ? 1.921  ns/op
CreateArray.newArray      10  avgt    5   49.963 ? 2.240  ns/op
CreateArray.newArray      15  avgt    5   52.538 ? 1.332  ns/op
CreateArray.newArray      23  avgt    5   57.289 ? 1.120  ns/op
CreateArray.newArray      34  avgt    5   67.091 ? 2.207  ns/op
CreateArray.newArray      51  avgt    5  119.948 ? 1.839  ns/op
CreateArray.newArray      77  avgt    5  101.851 ? 1.968  ns/op
CreateArray.newArray     115  avgt    5  142.568 ? 3.621  ns/op
CreateArray.newArray     173  avgt    5  180.204 ? 2.908  ns/op
CreateArray.newArray     259  avgt    5  170.446 ? 6.083  ns/op
CreateArray.newArray     389  avgt    5  231.124 ? 1.804  ns/op
CreateArray.newArray     584  avgt    5  248.411 ? 0.438  ns/op
CreateArray.newArray     876  avgt    5  241.776 ? 1.261  ns/op
CreateArray.newArray    1314  avgt    5  383.609 ? 1.363  ns/op
CreateArray.newArray    1971  avgt    5  483.217 ? 8.044  ns/op


After:

Benchmark             (size)  Mode  Cnt    Score   Error  Units
CreateArray.newArray       5  avgt    5   47.256 ? 1.511  ns/op
CreateArray.newArray       7  avgt    5   48.674 ? 1.046  ns/op
CreateArray.newArray      10  avgt    5   50.915 ? 2.581  ns/op
CreateArray.newArray      15  avgt    5   53.351 ? 6.562  ns/op
CreateArray.newArray      23  avgt    5   56.746 ? 3.820  ns/op
CreateArray.newArray      34  avgt    5   65.796 ? 3.357  ns/op
CreateArray.newArray      51  avgt    5  119.825 ? 2.268  ns/op
CreateArray.newArray      77  avgt    5  100.708 ? 1.647  ns/op
CreateArray.newArray     115  avgt    5  135.210 ? 2.844  ns/op
CreateArray.newArray     173  avgt    5  180.521 ? 1.373  ns/op
CreateArray.newArray     259  avgt    5  160.899 ? 2.677  ns/op
CreateArray.newArray     389  avgt    5  230.253 ? 1.412  ns/op
CreateArray.newArray     584  avgt    5  249.173 ? 2.827  ns/op
CreateArray.newArray     876  avgt    5  242.180 ? 0.991  ns/op
CreateArray.newArray    1314  avgt    5  385.272 ? 1.872  ns/op
CreateArray.newArray    1971  avgt    5  485.198 ? 3.196  ns/op


Machine B:

The timings for Machine B are very noisy with small array sizes, so
it's hard to conclude very much, but I don't think there is any
regression.

Before:

Benchmark             (size)  Mode  Cnt      Score    Error  Units
CreateArray.newArray       5  avgt    5     89.209 ? 11.640  ns/op
CreateArray.newArray       7  avgt    5     93.453 ?  2.113  ns/op
CreateArray.newArray      10  avgt    5     93.388 ? 21.406  ns/op
CreateArray.newArray      15  avgt    5    102.904 ? 23.075  ns/op
CreateArray.newArray      23  avgt    5    117.167 ? 19.673  ns/op
CreateArray.newArray      34  avgt    5    130.184 ?  1.042  ns/op
CreateArray.newArray      51  avgt    5    132.981 ?  8.446  ns/op
CreateArray.newArray      77  avgt    5    137.438 ?  5.723  ns/op
CreateArray.newArray     115  avgt    5    135.289 ?  3.393  ns/op
CreateArray.newArray     173  avgt    5    151.245 ?  8.469  ns/op
CreateArray.newArray     259  avgt    5    157.292 ?  2.087  ns/op
CreateArray.newArray     389  avgt    5    176.621 ?  3.741  ns/op
CreateArray.newArray     584  avgt    5    200.957 ?  6.825  ns/op
CreateArray.newArray     876  avgt    5    233.122 ?  3.508  ns/op
CreateArray.newArray    1314  avgt    5    280.525 ?  5.696  ns/op
CreateArray.newArray    1971  avgt    5    360.799 ?  8.859  ns/op


After:

Benchmark             (size)  Mode  Cnt    Score    Error  Units
CreateArray.newArray       5  avgt    5   90.168 ?  4.363  ns/op
CreateArray.newArray       7  avgt    5   88.221 ? 32.537  ns/op
CreateArray.newArray      10  avgt    5   97.991 ?  1.778  ns/op
CreateArray.newArray      15  avgt    5  102.441 ? 30.219  ns/op
CreateArray.newArray      23  avgt    5  120.875 ? 11.074  ns/op
CreateArray.newArray      34  avgt    5  130.916 ?  2.476  ns/op
CreateArray.newArray      51  avgt    5  134.765 ? 10.002  ns/op
CreateArray.newArray      77  avgt    5  138.228 ?  2.479  ns/op
CreateArray.newArray     115  avgt    5  135.907 ?  1.025  ns/op
CreateArray.newArray     173  avgt    5  150.318 ?  9.291  ns/op
CreateArray.newArray     259  avgt    5  156.671 ?  2.023  ns/op
CreateArray.newArray     389  avgt    5  175.735 ?  3.861  ns/op
CreateArray.newArray     584  avgt    5  206.501 ?  9.117  ns/op
CreateArray.newArray     876  avgt    5  233.676 ?  3.463  ns/op
CreateArray.newArray    1314  avgt    5  280.259 ?  4.131  ns/op
CreateArray.newArray    1971  avgt    5  360.037 ?  9.968  ns/op

Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8179444: AArch64: Put zero_words on a diet

Andrew Haley
On 28/04/17 20:28, Andrew Haley wrote:

> http://cr.openjdk.java.net/~aph/8179444/

I withdraw this patch.  I've found some more dead wood to cut out.

Andrew.