[10] RFR(S): JDK-8184943: AARCH64: Intrinsify hasNegatives

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[10] RFR(S): JDK-8184943: AARCH64: Intrinsify hasNegatives

dmitrij.pochepko

Hi everyone,


Please review this small webrev [1] that implements an enhancement [2] which adds has_negatives intrinsic to AARCH64 OpenJDK port. This intrinsic performs better than c2-compiled code for every array size tried:

ThunderX T88: about 2% for array size = 1 and up to 8.5x for large arrays

Cortex A53(R-Pi): shows about the same numbers(really large sizes can't be normally tested there due to small amount of available memory).


Intrinsified HasNegatives method checks if provided byte array has any byte with negative value(higher bit set) and intrinsic in general do as following(with various minor optimizations):


1) check array length variable to have lower bits set (0x1, 0x2, 0x4, 0x8) and invoke respective load instruction(ldrb, ldrh, ldrw, ldr) while reducing remaining length variable respectively. So, remaining length is 16*N after this code. Proceed to 2).

2) in case remaining length  >= 64, loads data in a loop with 4 ldp instructions(16 bytes each) and invoking prfm (prefetch hint) in case SoftwarePrefetchHintDistance >= 0 once per loop. This new flag (SoftwarePrefetchHintDistance) is introduced to provide configurable software prefetching in dynamically compiled code. This flag can disable software prefetch hint or set prefetch distance. Default distance is set to 3 * dcache_line which shows best performance on armv8 CPUs we have. 64-bytes loop proceed until length < 64, then, proceed to 3).

3) simple 16-byte loading loop until remaining length is 0.


Note: It was observed that software prefetching hint improves performance for platforms that do not have hardware prefetching (ThunderX T88), but also for platforms we have in hand which do have hardware prefetching (Cortex A53).


Performance testing:

JMH-based microbenchmark was developed [3] to test the performance of this enhancement. The  performance results on Cortex A53 [4] and ThunderX T88 [5] for this intrinsic are on-par with C2-compiled java code for very small strings and improve the performance with the increase in string length starting from string length of 3 and up to 8x for long strings.

Functional testing:

Tested by running hotspot jtreg tests on Cortex A53 and ThunderX T88 and comparing the test results diff with vanilla build. No regressions were observed. Specifically, test hotspot/test/compiler/intrinsics/string/TestHasNegatives.java passed on both Cortex A53 and ThunderX T88.


[1] webrev: http://cr.openjdk.java.net/~dpochepk/8184943/webrev.01/
[2] CR: https://bugs.openjdk.java.net/browse/JDK-8184943
[3] JMH micro benchmark: http://cr.openjdk.java.net/~dpochepk/8184943/HasNegativesBenchmark/
[4] A53 graph: http://cr.openjdk.java.net/~dpochepk/8184943/Cortex_A53_comparison.png
[5] T88 graph: http://cr.openjdk.java.net/~dpochepk/8184943/ThunderX_comparison.png


I'll be happy to merge suggestions for improvement of this intrinsic should they come into this review.


Thanks,
Dmitrij

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [10] RFR(S): JDK-8184943: AARCH64: Intrinsify hasNegatives

Andrew Haley
Hi,

On 20/07/17 11:03, Dmitrij Pochepko wrote:

> Please review this small webrev [1] that implements an enhancement [2] which adds has_negatives intrinsic to AARCH64 OpenJDK port. This intrinsic performs better than c2-compiled code for every array size tried:

Yay!  We're off to the races!

Yours:

Benchmark                       (length)  Mode  Cnt      Score   Error  Units
HasNegatives.loopingFastMethod         4  avgt    5   6680.619 ? 0.953  ns/op
HasNegatives.loopingFastMethod        31  avgt    5  12936.791 ? 1.599  ns/op
HasNegatives.loopingFastMethod        65  avgt    5  14604.253 ? 2.088  ns/op
HasNegatives.loopingFastMethod       101  avgt    5  19606.385 ? 7.751  ns/op
HasNegatives.loopingFastMethod       256  avgt    5  30858.498 ? 1.225  ns/op


Stuart's:

Benchmark                       (length)  Mode  Cnt      Score   Error  Units
HasNegatives.loopingFastMethod         4  avgt    5   5013.024 ? 0.572  ns/op
HasNegatives.loopingFastMethod        31  avgt    5   9186.044 ? 2.439  ns/op
HasNegatives.loopingFastMethod        65  avgt    5  13769.220 ? 1.879  ns/op
HasNegatives.loopingFastMethod       101  avgt    5  15854.385 ? 2.482  ns/op
HasNegatives.loopingFastMethod       256  avgt    5  26691.626 ? 3.523  ns/op

I didn't expect a big difference.  Note that the really important measurement
is on length ~31, which is very common.

Benchmark at http://cr.openjdk.java.net/~aph/HasNegativesBench/.  Test was on
APM.

--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [aarch64-port-dev ] [10] RFR(S): JDK-8184943: AARCH64: Intrinsify hasNegatives

Stuart Monteith
Hi,
   I'm going to try the patches on all of my machines, and compare, as
well as a proper visual review. The APM seems generally free of vices,
not needing a lot of fussy code.

BR,
   Stuart

On 20 July 2017 at 13:55, Andrew Haley <[hidden email]> wrote:

> Hi,
>
> On 20/07/17 11:03, Dmitrij Pochepko wrote:
>
>> Please review this small webrev [1] that implements an enhancement [2] which adds has_negatives intrinsic to AARCH64 OpenJDK port. This intrinsic performs better than c2-compiled code for every array size tried:
>
> Yay!  We're off to the races!
>
> Yours:
>
> Benchmark                       (length)  Mode  Cnt      Score   Error  Units
> HasNegatives.loopingFastMethod         4  avgt    5   6680.619 ? 0.953  ns/op
> HasNegatives.loopingFastMethod        31  avgt    5  12936.791 ? 1.599  ns/op
> HasNegatives.loopingFastMethod        65  avgt    5  14604.253 ? 2.088  ns/op
> HasNegatives.loopingFastMethod       101  avgt    5  19606.385 ? 7.751  ns/op
> HasNegatives.loopingFastMethod       256  avgt    5  30858.498 ? 1.225  ns/op
>
>
> Stuart's:
>
> Benchmark                       (length)  Mode  Cnt      Score   Error  Units
> HasNegatives.loopingFastMethod         4  avgt    5   5013.024 ? 0.572  ns/op
> HasNegatives.loopingFastMethod        31  avgt    5   9186.044 ? 2.439  ns/op
> HasNegatives.loopingFastMethod        65  avgt    5  13769.220 ? 1.879  ns/op
> HasNegatives.loopingFastMethod       101  avgt    5  15854.385 ? 2.482  ns/op
> HasNegatives.loopingFastMethod       256  avgt    5  26691.626 ? 3.523  ns/op
>
> I didn't expect a big difference.  Note that the really important measurement
> is on length ~31, which is very common.
>
> Benchmark at http://cr.openjdk.java.net/~aph/HasNegativesBench/.  Test was on
> APM.
>
> --
> Andrew Haley
> Java Platform Lead Engineer
> Red Hat UK Ltd. <https://www.redhat.com>
> EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [10] RFR(S): JDK-8184943: AARCH64: Intrinsify hasNegatives

dmitrij.pochepko
In reply to this post by Andrew Haley
Hi,

can you check large length, like 10000, 100000   (I support this jmh
options will do it: -p length=10000,100000)

Thanks,
Dmitrij
On 20.07.2017 15:55, Andrew Haley wrote:

> Hi,
>
> On 20/07/17 11:03, Dmitrij Pochepko wrote:
>
>> Please review this small webrev [1] that implements an enhancement [2] which adds has_negatives intrinsic to AARCH64 OpenJDK port. This intrinsic performs better than c2-compiled code for every array size tried:
> Yay!  We're off to the races!
>
> Yours:
>
> Benchmark                       (length)  Mode  Cnt      Score   Error  Units
> HasNegatives.loopingFastMethod         4  avgt    5   6680.619 ? 0.953  ns/op
> HasNegatives.loopingFastMethod        31  avgt    5  12936.791 ? 1.599  ns/op
> HasNegatives.loopingFastMethod        65  avgt    5  14604.253 ? 2.088  ns/op
> HasNegatives.loopingFastMethod       101  avgt    5  19606.385 ? 7.751  ns/op
> HasNegatives.loopingFastMethod       256  avgt    5  30858.498 ? 1.225  ns/op
>
>
> Stuart's:
>
> Benchmark                       (length)  Mode  Cnt      Score   Error  Units
> HasNegatives.loopingFastMethod         4  avgt    5   5013.024 ? 0.572  ns/op
> HasNegatives.loopingFastMethod        31  avgt    5   9186.044 ? 2.439  ns/op
> HasNegatives.loopingFastMethod        65  avgt    5  13769.220 ? 1.879  ns/op
> HasNegatives.loopingFastMethod       101  avgt    5  15854.385 ? 2.482  ns/op
> HasNegatives.loopingFastMethod       256  avgt    5  26691.626 ? 3.523  ns/op
>
> I didn't expect a big difference.  Note that the really important measurement
> is on length ~31, which is very common.
>
> Benchmark at http://cr.openjdk.java.net/~aph/HasNegativesBench/.  Test was on
> APM.
>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [10] RFR(S): JDK-8184943: AARCH64: Intrinsify hasNegatives

Andrew Haley
On 20/07/17 16:17, Dmitrij Pochepko wrote:
> can you check large length, like 10000, 100000   (I support this jmh
> options will do it: -p length=10000,100000)

stuart:

Benchmark                       (length)  Mode  Cnt         Score       Error  Units
HasNegatives.loopingFastMethod     10000  avgt    5    788432.952 ?   362.183  ns/op
HasNegatives.loopingFastMethod    100000  avgt    5  12401737.536 ? 17752.545  ns/op

dmitrij:

Benchmark                       (length)  Mode  Cnt         Score      Error  Units
HasNegatives.loopingFastMethod     10000  avgt    5    918447.832 ?  223.858  ns/op
HasNegatives.loopingFastMethod    100000  avgt    5  11745723.456 ? 7526.962  ns/op

--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [10] RFR(S): JDK-8184943: AARCH64: Intrinsify hasNegatives

dmitrij.pochepko
Thank you.

Interesting results.

I see in general Stuart's version is faster on smaller sizes. I suppose
it's due to single 8-byte load at the start and the same load at the
end, which saves up to 3 loads + few cpu cycles. Also, an aligned access
might help on some platforms. I also have version of my code with
aligned access(attached as alternative implementation to CR quite a time
ago), but it seems like I don't have platform which shows large
difference in this case so, I've put this patch aside.

Btw: I've also considered such unconditional 8-bytes load at start, but
abandoned this idea since I wasn't sure if it's safe. Say, array is
allocated at the border of allocated region(so, last array byte == last
allocated region byte). Then hasNegatives is called with offset ==
array_length - 1 and len = 1 just to check last byte, so, then 8-byte
load is issued at this address?


I also have following results on ThunderX T88(shows significant
improvement on 10000 and 100000 length (about 1.5x and 2.5x) comparing
to Stuart's implementation):

My:

Benchmark                          (length)  Mode  Cnt Score        
Error  Unitsthat
HasNegativesBench.loopingFastMethod       1  avgt    5 7555.169 ?    
35.714  ns/op
HasNegativesBench.loopingFastMethod       4  avgt    5 9030.759 ?      
7.614  ns/op
HasNegativesBench.loopingFastMethod      31  avgt    5 27586.010 ?    
16.815  ns/op
HasNegativesBench.loopingFastMethod      65  avgt    5 40239.515 ?    
564.833  ns/op
HasNegativesBench.loopingFastMethod     101  avgt    5 52673.495 ?    
176.033  ns/op
HasNegativesBench.loopingFastMethod     256  avgt    5 111487.193 ?    
551.301  ns/op
HasNegativesBench.loopingFastMethod    1000  avgt    5 392706.118 ?  
1749.139  ns/op
HasNegativesBench.loopingFastMethod   10000  avgt    5 1274876.279 ?  
11404.115  ns/op
HasNegativesBench.loopingFastMethod  100000  avgt    5 13627036.757 ?
129977.081  ns/op

Stuart's:
Benchmark                          (length)  Mode  Cnt Score        
Error  Units
HasNegativesBench.loopingFastMethod       1  avgt    5 7535.175 ?    
50.769  ns/op
HasNegativesBench.loopingFastMethod       4  avgt    5 7526.599 ?      
8.993  ns/op
HasNegativesBench.loopingFastMethod      31  avgt    5 18554.420 ?      
1.448  ns/op
HasNegativesBench.loopingFastMethod      65  avgt    5 26607.388 ?    
89.429  ns/op
HasNegativesBench.loopingFastMethod     101  avgt    5 32641.349 ?    
168.976  ns/op
HasNegativesBench.loopingFastMethod     256  avgt    5 60745.493 ?    
362.656  ns/op
HasNegativesBench.loopingFastMethod    1000  avgt    5 202915.691 ?  
1103.984  ns/op
HasNegativesBench.loopingFastMethod   10000  avgt    5 1898428.471 ?  
10022.381  ns/op
HasNegativesBench.loopingFastMethod  100000  avgt    5 33463429.058 ?
548791.811  ns/op


And on R-Pi 3 (Cortex A53) (about the same improvement on large size):

My:

Benchmark                            (length)  Mode  Cnt Score        
Error  Units
HasNegativesBench.loopingFastMethod       1  avgt    5 15233.213 ±  
10299.068  ns/op
HasNegativesBench.loopingFastMethod       4  avgt    5 28372.544 ±  
22395.968  ns/op
HasNegativesBench.loopingFastMethod      31  avgt    5 54031.864 ±  
41530.777  ns/op
HasNegativesBench.loopingFastMethod      65  avgt    5 60528.950 ±  
23216.620  ns/op
HasNegativesBench.loopingFastMethod     101  avgt    5 68123.059 ±  
31609.714  ns/op
HasNegativesBench.loopingFastMethod     256  avgt    5  130330.740 ±
109803.722  ns/op
HasNegativesBench.loopingFastMethod    1000  avgt    5  289047.106 ±
197153.259  ns/op
HasNegativesBench.loopingFastMethod   10000  avgt    5 3175862.063 ±
3126363.838  ns/op
HasNegativesBench.loopingFastMethod  100000  avgt    5 28595658.058 ±
15509202.529  ns/op

Stuart's:

Benchmark                            (length)  Mode  Cnt Score      
Error  Units
HasNegativesBench.loopingFastMethod       1  avgt    5  16068.939 ±
13611.338  ns/op
HasNegativesBench.loopingFastMethod       4  avgt    5  22888.871 ±
21902.553  ns/op
HasNegativesBench.loopingFastMethod      31  avgt    5  40784.842 ±
44233.928  ns/op
HasNegativesBench.loopingFastMethod      65  avgt    5  66288.469 ±
65255.857  ns/op
HasNegativesBench.loopingFastMethod     101  avgt    5   89416.174 ±
93875.338  ns/op
HasNegativesBench.loopingFastMethod     256  avgt    5  170013.296 ±
86799.999  ns/op
HasNegativesBench.loopingFastMethod    1000  avgt    5   635557.297 ±  
141291.822  ns/op
HasNegativesBench.loopingFastMethod   10000  avgt    5  5368914.966 ±
7607076.827  ns/op
HasNegativesBench.loopingFastMethod  100000  avgt    5  47019213.416 ±
40360305.523  ns/op


Probably best way would be to merge large data loads from my patch and
Stuart's lightning-fast small arrays handling.

I'll be happy to merge these ideas in one intrinsic that works fastest
on small and large arrays if Stuart does not mind. I could use some help
testing the final solution on some of the HW we don't have. I don't mind
if Stuart want to merge it, then we'll help him with testing on h/w he
doesn't have.


Thanks,

Dmitrij


On 20.07.2017 19:32, Andrew Haley wrote:

> On 20/07/17 16:17, Dmitrij Pochepko wrote:
>> can you check large length, like 10000, 100000   (I support this jmh
>> options will do it: -p length=10000,100000)
> stuart:
>
> Benchmark                       (length)  Mode  Cnt         Score       Error  Units
> HasNegatives.loopingFastMethod     10000  avgt    5    788432.952 ?   362.183  ns/op
> HasNegatives.loopingFastMethod    100000  avgt    5  12401737.536 ? 17752.545  ns/op
>
> dmitrij:
>
> Benchmark                       (length)  Mode  Cnt         Score      Error  Units
> HasNegatives.loopingFastMethod     10000  avgt    5    918447.832 ?  223.858  ns/op
> HasNegatives.loopingFastMethod    100000  avgt    5  11745723.456 ? 7526.962  ns/op
>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [10] RFR(S): JDK-8184943: AARCH64: Intrinsify hasNegatives

Andrew Haley
On 20/07/17 19:27, Dmitrij Pochepko wrote:
> Btw: I've also considered such unconditional 8-bytes load at start, but
> abandoned this idea since I wasn't sure if it's safe. Say, array is
> allocated at the border of allocated region(so, last array byte == last
> allocated region byte). Then hasNegatives is called with offset ==
> array_length - 1 and len = 1 just to check last byte, so, then 8-byte
> load is issued at this address?

It's certainly possible.  We can't read an address beyond our byte
array if there is any possibility that we're at the end of a page,
because we might hit a segfault.  We could test for that.

--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [10] RFR(S): JDK-8184943: AARCH64: Intrinsify hasNegatives

Andrew Haley
In reply to this post by dmitrij.pochepko
On 20/07/17 19:27, Dmitrij Pochepko wrote:
> Probably best way would be to merge large data loads from my patch and
> Stuart's lightning-fast small arrays handling.

Yes.

> I'll be happy to merge these ideas in one intrinsic that works fastest
> on small and large arrays if Stuart does not mind. I could use some help
> testing the final solution on some of the HW we don't have. I don't mind
> if Stuart want to merge it, then we'll help him with testing on h/w he
> doesn't have.

Have fun!  The performance to care about is small strings (< 31 bytes) and,
less commonly, very long ones.  Super-fast handling of small strings is
very important.

--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: [10] RFR(S): JDK-8184943: AARCH64: Intrinsify hasNegatives

White, Derek
In reply to this post by Andrew Haley
Hi Andrew,

If this is a problem, there might be a problem in copy_memory (stubGenerator_aarch64.cpp) as well, where we may read past an array.

I went looking for see if there's some padding at the end of a heap region that we were counting on, but didn't find any yet.

 - Derek

> -----Original Message-----
> From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-
> [hidden email]] On Behalf Of Andrew Haley
> Sent: Friday, July 21, 2017 4:22 AM
> To: Dmitrij Pochepko <[hidden email]>
> Cc: [hidden email]; aarch64-port-
> [hidden email]
> Subject: Re: [10] RFR(S): JDK-8184943: AARCH64: Intrinsify hasNegatives
>
> On 20/07/17 19:27, Dmitrij Pochepko wrote:
> > Btw: I've also considered such unconditional 8-bytes load at start,
> > but abandoned this idea since I wasn't sure if it's safe. Say, array
> > is allocated at the border of allocated region(so, last array byte ==
> > last allocated region byte). Then hasNegatives is called with offset
> > == array_length - 1 and len = 1 just to check last byte, so, then
> > 8-byte load is issued at this address?
>
> It's certainly possible.  We can't read an address beyond our byte array if
> there is any possibility that we're at the end of a page, because we might hit
> a segfault.  We could test for that.
>
> --
> Andrew Haley
> Java Platform Lead Engineer
> Red Hat UK Ltd. <https://www.redhat.com>
> EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [10] RFR(S): JDK-8184943: AARCH64: Intrinsify hasNegatives

Andrew Haley
On 21/07/17 15:05, White, Derek wrote:

> If this is a problem, there might be a problem in copy_memory
> (stubGenerator_aarch64.cpp) as well, where we may read past an
> array.

I think that's the bit I wrote, and I'm fairly sure that we don't.

> I went looking for see if there's some padding at the end of a heap
> region that we were counting on, but didn't find any yet.

There isn't.  I know that because I once hit such a bug.

--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [10] RFR(S): JDK-8184943: AARCH64: Intrinsify hasNegatives

dmitrij.pochepko
In reply to this post by Andrew Haley
Hi, 


please review a new version of this RFR [1] which is significantly
re-worked.


Changes compared to original posting:


- 2 versions of hasNegatives intrinsic were merged, which result in good
performance for both small and large array.


- large array case and "at-the-end-of-mem-page" case were moved to stub
to save code cache and help register allocator


Raw performance numbers for the original
hasNegativesBench.loopingFastMethod [2] are here[3] and accompanied by
updated comparison charts for Raspberry Pi 3 [4] and ThunderX T88 [5].
In short, intrinsified hasNegatives is x4 faster on T88 and x2.5 on R-Pi
for 31 byte array and up to 8 times faster on large arrays.

I've also created small and simple benchmark [6] which demonstrates
performance difference for string constructor for strings without
negative byte values.  Raw results [7] shows significantly increased
performance on Thunder X T88. Results also can be seen on comparison
charts [8]. Due to large amount of allocations and gc this benchmark is
not applicable for R-Pi, which has 1GB system memory and sd-card as main
drive.



This patch should be considered as patch with 2 contributors
([hidden email] and [hidden email] (openjdk
login dpochepk)). 

Also I'd like to thank Andrew Haley for early
reviews and consulting. 


No regressions were found via jtreg tests.

Thanks, 


Dmitrij


[1] Webrev: http://cr.openjdk.java.net/~dpochepk/8184943/webrev.02/
[2] http://cr.openjdk.java.net/~aph/HasNegativesBench/
[3] http://cr.openjdk.java.net/~dpochepk/8184943/perf_numbers.txt
[4] http://cr.openjdk.java.net/~dpochepk/8184943/Cortex_A53_comparison.png
[5] http://cr.openjdk.java.net/~dpochepk/8184943/ThunderX_comparison.png
[6]
http://cr.openjdk.java.net/~dpochepk/8184943/StringConstructorBench.java
[7] http://cr.openjdk.java.net/~dpochepk/8184943/StringConstructorBench.txt
[8]
http://cr.openjdk.java.net/~dpochepk/8184943/ThunderX-StringConstructor.png

On 21.07.2017 11:26, Andrew Haley wrote:

> On 20/07/17 19:27, Dmitrij Pochepko wrote:
>> Probably best way would be to merge large data loads from my patch and
>> Stuart's lightning-fast small arrays handling.
> Yes.
>
>> I'll be happy to merge these ideas in one intrinsic that works fastest
>> on small and large arrays if Stuart does not mind. I could use some help
>> testing the final solution on some of the HW we don't have. I don't mind
>> if Stuart want to merge it, then we'll help him with testing on h/w he
>> doesn't have.
> Have fun!  The performance to care about is small strings (< 31 bytes) and,
> less commonly, very long ones.  Super-fast handling of small strings is
> very important.
>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [10] RFR(S): JDK-8184943: AARCH64: Intrinsify hasNegatives

Andrew Haley
On 11/08/17 18:30, Dmitrij Pochepko wrote:
> 
This patch should be considered as patch with 2 contributors
> ([hidden email] and [hidden email] (openjdk
> login dpochepk)). 

Also I'd like to thank Andrew Haley for early
> reviews and consulting. 

>
> No regressions were found via jtreg tests.

Good work.  I think we're done.  Thanks.

--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [10] RFR(S): JDK-8184943: AARCH64: Intrinsify hasNegatives

Stuart Monteith
In reply to this post by dmitrij.pochepko
Thanks Dmitrij,
  I'll look at what you've done and try your patch on my machines.

BR,
   Stuart

On 11 August 2017 at 18:30, Dmitrij Pochepko
<[hidden email]> wrote:

> Hi,
>
> please review a new version of this RFR [1] which is significantly
> re-worked.
>
>
> Changes compared to original posting:
>
> - 2 versions of hasNegatives intrinsic were merged, which result in good
> performance for both small and large array.
>
> - large array case and "at-the-end-of-mem-page" case were moved to stub to
> save code cache and help register allocator
>
>
> Raw performance numbers for the original hasNegativesBench.loopingFastMethod
> [2] are here[3] and accompanied by updated comparison charts for Raspberry
> Pi 3 [4] and ThunderX T88 [5]. In short, intrinsified hasNegatives is x4
> faster on T88 and x2.5 on R-Pi for 31 byte array and up to 8 times faster on
> large arrays.
>
> I've also created small and simple benchmark [6] which demonstrates
> performance difference for string constructor for strings without negative
> byte values.  Raw results [7] shows significantly increased performance on
> Thunder X T88. Results also can be seen on comparison charts [8]. Due to
> large amount of allocations and gc this benchmark is not applicable for
> R-Pi, which has 1GB system memory and sd-card as main drive.
>
>
> 
This patch should be considered as patch with 2 contributors
> ([hidden email] and [hidden email] (openjdk login
> dpochepk)). Also I'd like to thank Andrew Haley for early reviews and
> consulting.
>
> No regressions were found via jtreg tests.
>
> Thanks,
>
> Dmitrij
>
>
> [1] Webrev: http://cr.openjdk.java.net/~dpochepk/8184943/webrev.02/
> [2] http://cr.openjdk.java.net/~aph/HasNegativesBench/
> [3] http://cr.openjdk.java.net/~dpochepk/8184943/perf_numbers.txt
> [4] http://cr.openjdk.java.net/~dpochepk/8184943/Cortex_A53_comparison.png
> [5] http://cr.openjdk.java.net/~dpochepk/8184943/ThunderX_comparison.png
> [6]
> http://cr.openjdk.java.net/~dpochepk/8184943/StringConstructorBench.java
> [7] http://cr.openjdk.java.net/~dpochepk/8184943/StringConstructorBench.txt
> [8]
> http://cr.openjdk.java.net/~dpochepk/8184943/ThunderX-StringConstructor.png
>
>
> On 21.07.2017 11:26, Andrew Haley wrote:
>>
>> On 20/07/17 19:27, Dmitrij Pochepko wrote:
>>>
>>> Probably best way would be to merge large data loads from my patch and
>>> Stuart's lightning-fast small arrays handling.
>>
>> Yes.
>>
>>> I'll be happy to merge these ideas in one intrinsic that works fastest
>>> on small and large arrays if Stuart does not mind. I could use some help
>>> testing the final solution on some of the HW we don't have. I don't mind
>>> if Stuart want to merge it, then we'll help him with testing on h/w he
>>> doesn't have.
>>
>> Have fun!  The performance to care about is small strings (< 31 bytes)
>> and,
>> less commonly, very long ones.  Super-fast handling of small strings is
>> very important.
>>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [10] RFR(S): JDK-8184943: AARCH64: Intrinsify hasNegatives

Stuart Monteith
Hello,
 Please find below hyperlinks to the jmh results - the graphs show the
performance relative to the "steam" method - compiled by C2. There is
an improvement for all platforms. With 100,000 bytes there is no
improvement, but that is an unlikely circumstance.

http://people.linaro.org/~stuart.monteith/hasneg-last/hasnegA-last.svg
http://people.linaro.org/~stuart.monteith/hasneg-last/hasnegB-last.svg
http://people.linaro.org/~stuart.monteith/hasneg-last/hasnegC-last.svg


BR,
   Stuart

On 14 August 2017 at 11:47, Stuart Monteith <[hidden email]> wrote:

> Thanks Dmitrij,
>   I'll look at what you've done and try your patch on my machines.
>
> BR,
>    Stuart
>
> On 11 August 2017 at 18:30, Dmitrij Pochepko
> <[hidden email]> wrote:
>> Hi,
>>
>> please review a new version of this RFR [1] which is significantly
>> re-worked.
>>
>>
>> Changes compared to original posting:
>>
>> - 2 versions of hasNegatives intrinsic were merged, which result in good
>> performance for both small and large array.
>>
>> - large array case and "at-the-end-of-mem-page" case were moved to stub to
>> save code cache and help register allocator
>>
>>
>> Raw performance numbers for the original hasNegativesBench.loopingFastMethod
>> [2] are here[3] and accompanied by updated comparison charts for Raspberry
>> Pi 3 [4] and ThunderX T88 [5]. In short, intrinsified hasNegatives is x4
>> faster on T88 and x2.5 on R-Pi for 31 byte array and up to 8 times faster on
>> large arrays.
>>
>> I've also created small and simple benchmark [6] which demonstrates
>> performance difference for string constructor for strings without negative
>> byte values.  Raw results [7] shows significantly increased performance on
>> Thunder X T88. Results also can be seen on comparison charts [8]. Due to
>> large amount of allocations and gc this benchmark is not applicable for
>> R-Pi, which has 1GB system memory and sd-card as main drive.
>>
>>
>> This patch should be considered as patch with 2 contributors
>> ([hidden email] and [hidden email] (openjdk login
>> dpochepk)). Also I'd like to thank Andrew Haley for early reviews and
>> consulting.
>>
>> No regressions were found via jtreg tests.
>>
>> Thanks,
>>
>> Dmitrij
>>
>>
>> [1] Webrev: http://cr.openjdk.java.net/~dpochepk/8184943/webrev.02/
>> [2] http://cr.openjdk.java.net/~aph/HasNegativesBench/
>> [3] http://cr.openjdk.java.net/~dpochepk/8184943/perf_numbers.txt
>> [4] http://cr.openjdk.java.net/~dpochepk/8184943/Cortex_A53_comparison.png
>> [5] http://cr.openjdk.java.net/~dpochepk/8184943/ThunderX_comparison.png
>> [6]
>> http://cr.openjdk.java.net/~dpochepk/8184943/StringConstructorBench.java
>> [7] http://cr.openjdk.java.net/~dpochepk/8184943/StringConstructorBench.txt
>> [8]
>> http://cr.openjdk.java.net/~dpochepk/8184943/ThunderX-StringConstructor.png
>>
>>
>> On 21.07.2017 11:26, Andrew Haley wrote:
>>>
>>> On 20/07/17 19:27, Dmitrij Pochepko wrote:
>>>>
>>>> Probably best way would be to merge large data loads from my patch and
>>>> Stuart's lightning-fast small arrays handling.
>>>
>>> Yes.
>>>
>>>> I'll be happy to merge these ideas in one intrinsic that works fastest
>>>> on small and large arrays if Stuart does not mind. I could use some help
>>>> testing the final solution on some of the HW we don't have. I don't mind
>>>> if Stuart want to merge it, then we'll help him with testing on h/w he
>>>> doesn't have.
>>>
>>> Have fun!  The performance to care about is small strings (< 31 bytes)
>>> and,
>>> less commonly, very long ones.  Super-fast handling of small strings is
>>> very important.
>>>
>>
Loading...