RFR(L): 8189793: [s390]: Improve String compress/inflate by exploiting vector instructions

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

RFR(L): 8189793: [s390]: Improve String compress/inflate by exploiting vector instructions

Schmidt, Lutz

Dear all,

 

I would like to request reviews for this s390-only enhancement:

 

Bug:    https://bugs.openjdk.java.net/browse/JDK-8189793

Webrev: http://cr.openjdk.java.net/~lucy/webrevs/8189793.00/index.html  

 

Vector instructions, which have been available on System z for a while (since z13), promise noticeable performance improvements. This enhancement improves the String Compress and String Inflate intrinsics by exploiting vector instructions, when available. For long strings, up to 2x performance improvement has been observed in micro-benchmarks.

 

Special care was taken to preserve good performance for short strings. All examined workloads showed a high ratio of short and very short strings.

 

Thank you!

Lutz

 

 

 

 

Dr. Lutz Schmidt | SAP JVM | PI  SAP CP Core | T: +49 (6227) 7-42834

 

Reply | Threaded
Open this post in threaded view
|

RE: RFR(L): 8189793: [s390]: Improve String compress/inflate by exploiting vector instructions

Doerr, Martin

Hi Lutz,

 

thanks for working on vector-based enhancements and for providing this webrev.

 

assembler_s390:

-The changes in the assembler look good.

 

s390.ad:

-It doesn't make sense to load constant len to a register and generate complex compare instructions for it and still to emit code for all cases. I assume that e.g. the 4 characters cases usually have a constant length. If so, much better code could be generated for them by omitting all the stuff around the simple instructions. (ppc64.ad already contains nodes for constant length of needle in indexOf rules.)

 

macroAssembler_s390:

-Are you sure the prefetch instructions improve performance?

I remember that we had them in other String intrinsics but removed them again as they showed absolutely no performance gain.

-Comment: Using hardcoded vector registers is ok for now, but may need to get changed e.g. when using them for C2's SuperWord optimization.

-Comment: You could use the vperm instruction instead of vo+vn, but I'm ok with the current implementation because loading a mask is much more convenient than getting the permutation vector loaded (e.g. from constant pool or pc relative).

-So the new vector loop looks good to me.

-In my opinion, the size of all the generated cases should be in relationship to their performance benefit.

As intrinsics are not like stubs and may get inlined often, I can't get rid of the impression that generating so large code wastes valuable code cache space with questionable performance gain in real world scenarios.

 

Best regards,

Martin

 

From: hotspot-compiler-dev [mailto:[hidden email]] On Behalf Of Schmidt, Lutz
Sent: Mittwoch, 25. Oktober 2017 12:02
To: [hidden email]
Subject: RFR(L): 8189793: [s390]: Improve String compress/inflate by exploiting vector instructions

 

Dear all,

 

I would like to request reviews for this s390-only enhancement:

 

Bug:    https://bugs.openjdk.java.net/browse/JDK-8189793

Webrev: http://cr.openjdk.java.net/~lucy/webrevs/8189793.00/index.html  

 

Vector instructions, which have been available on System z for a while (since z13), promise noticeable performance improvements. This enhancement improves the String Compress and String Inflate intrinsics by exploiting vector instructions, when available. For long strings, up to 2x performance improvement has been observed in micro-benchmarks.

 

Special care was taken to preserve good performance for short strings. All examined workloads showed a high ratio of short and very short strings.

 

Thank you!

Lutz

 

 

 

 

Dr. Lutz Schmidt | SAP JVM | PI  SAP CP Core | T: +49 (6227) 7-42834

 

Reply | Threaded
Open this post in threaded view
|

Re: RFR(L): 8189793: [s390]: Improve String compress/inflate by exploiting vector instructions

Schmidt, Lutz

Hi Martin,

 

Thanks for reviewing my change!

 

This is a preliminary response just to let you know I’m working on the change. I’m putting a lot of effort in producing reliable performance measurement data. Turns out this is not easy (to be more honest: almost impossible).

 

s390.ad:

You are absolutely right, the sequence load_const/string_compress makes no sense at all. But it does not hurt either – I could not find one match in all tests I ran. -> Match rule deleted.

 

macroAssembler_s390:

prefetch: did not see impact, neither positive nor negative. Artificial micro benchmarks will not benefit (data is in cache anyway). More complex benchmarks show measurement noise which covers the possible prefetch benefit. -> prefetch deleted.

Hardcoded vector registers: you are right. There are some design decisions pending, e.g. how many vector scratch registers?

Vperm instruction: using that is just another implementation variant that could save the vn vector instruction. On the other hand, loading the index vector is a (compared to vgmh) costly memory access. Given the fact that we mostly deal with short strings, initialization effort is relevant.

Code size vs. performance: the old, well known, often discussed tradeoff. Starting from the existing implementation, I invested quite some time in optimizing the (len <= 8) cases. With every refinement step I saw (or believed to see (measurement noise)) some improvement – or discarded it. Is the overall improvement worth the larger code size? -> tradeoff, discussion.

 

Best Regards,

Lutz

 

 

 

On 25.10.2017, 21:08, "Doerr, Martin" <[hidden email]> wrote:

 

Hi Lutz,

 

thanks for working on vector-based enhancements and for providing this webrev.

 

assembler_s390:

-The changes in the assembler look good.

 

s390.ad:

-It doesn't make sense to load constant len to a register and generate complex compare instructions for it and still to emit code for all cases. I assume that e.g. the 4 characters cases usually have a constant length. If so, much better code could be generated for them by omitting all the stuff around the simple instructions. (ppc64.ad already contains nodes for constant length of needle in indexOf rules.)

 

macroAssembler_s390:

-Are you sure the prefetch instructions improve performance?

I remember that we had them in other String intrinsics but removed them again as they showed absolutely no performance gain.

-Comment: Using hardcoded vector registers is ok for now, but may need to get changed e.g. when using them for C2's SuperWord optimization.

-Comment: You could use the vperm instruction instead of vo+vn, but I'm ok with the current implementation because loading a mask is much more convenient than getting the permutation vector loaded (e.g. from constant pool or pc relative).

-So the new vector loop looks good to me.

-In my opinion, the size of all the generated cases should be in relationship to their performance benefit.

As intrinsics are not like stubs and may get inlined often, I can't get rid of the impression that generating so large code wastes valuable code cache space with questionable performance gain in real world scenarios.

 

Best regards,

Martin

 

From: hotspot-compiler-dev [mailto:[hidden email]] On Behalf Of Schmidt, Lutz
Sent: Mittwoch, 25. Oktober 2017 12:02
To: [hidden email]
Subject: RFR(L): 8189793: [s390]: Improve String compress/inflate by exploiting vector instructions

 

Dear all,

 

I would like to request reviews for this s390-only enhancement:

 

Bug:    https://bugs.openjdk.java.net/browse/JDK-8189793

Webrev: http://cr.openjdk.java.net/~lucy/webrevs/8189793.00/index.html  

 

Vector instructions, which have been available on System z for a while (since z13), promise noticeable performance improvements. This enhancement improves the String Compress and String Inflate intrinsics by exploiting vector instructions, when available. For long strings, up to 2x performance improvement has been observed in micro-benchmarks.

 

Special care was taken to preserve good performance for short strings. All examined workloads showed a high ratio of short and very short strings.

 

Thank you!

Lutz

 

 

 

 

Dr. Lutz Schmidt | SAP JVM | PI  SAP CP Core | T: +49 (6227) 7-42834

 

Reply | Threaded
Open this post in threaded view
|

RE: RFR(L): 8189793: [s390]: Improve String compress/inflate by exploiting vector instructions

Lindenmaier, Goetz
Hi Lutz,

I have been looking at your change. I think it's a good idea to match
for constant string length. I did this for the ppc string intrinsics in the
past.
I remember that distribution of constant strings was quite uneven.
StrIndexOf appeared with constant lengths a lot, StrEquals and StrComp
didn't.

Do you have any data on how often the new match rules match?

Actually, if there is a constant string deflated, a platform independent
optimization could compute that at compile time, but that's a different
issue ...

Best regards,
  Goetz.


> -----Original Message-----
> From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-
> [hidden email]] On Behalf Of Schmidt, Lutz
> Sent: Freitag, 27. Oktober 2017 13:07
> To: Doerr, Martin <[hidden email]>; hotspot-compiler-
> [hidden email]
> Subject: Re: RFR(L): 8189793: [s390]: Improve String compress/inflate by
> exploiting vector instructions
>
> Hi Martin,
>
>
>
> Thanks for reviewing my change!
>
>
>
> This is a preliminary response just to let you know I’m working on the
> change. I’m putting a lot of effort in producing reliable performance
> measurement data. Turns out this is not easy (to be more honest: almost
> impossible).
>
>
>
> s390.ad:
>
> You are absolutely right, the sequence load_const/string_compress makes
> no sense at all. But it does not hurt either – I could not find one match in all
> tests I ran. -> Match rule deleted.
>
>
>
> macroAssembler_s390:
>
> prefetch: did not see impact, neither positive nor negative. Artificial micro
> benchmarks will not benefit (data is in cache anyway). More complex
> benchmarks show measurement noise which covers the possible prefetch
> benefit. -> prefetch deleted.
>
> Hardcoded vector registers: you are right. There are some design decisions
> pending, e.g. how many vector scratch registers?
>
> Vperm instruction: using that is just another implementation variant that
> could save the vn vector instruction. On the other hand, loading the index
> vector is a (compared to vgmh) costly memory access. Given the fact that we
> mostly deal with short strings, initialization effort is relevant.
>
> Code size vs. performance: the old, well known, often discussed tradeoff.
> Starting from the existing implementation, I invested quite some time in
> optimizing the (len <= 8) cases. With every refinement step I saw (or
> believed to see (measurement noise)) some improvement – or discarded it.
> Is the overall improvement worth the larger code size? -> tradeoff,
> discussion.
>
>
>
> Best Regards,
>
> Lutz
>
>
>
>
>
>
>
> On 25.10.2017, 21:08, "Doerr, Martin" <[hidden email]
> <mailto:[hidden email]> > wrote:
>
>
>
> Hi Lutz,
>
>
>
> thanks for working on vector-based enhancements and for providing this
> webrev.
>
>
>
> assembler_s390:
>
> -The changes in the assembler look good.
>
>
>
> s390.ad:
>
> -It doesn't make sense to load constant len to a register and generate
> complex compare instructions for it and still to emit code for all cases. I
> assume that e.g. the 4 characters cases usually have a constant length. If so,
> much better code could be generated for them by omitting all the stuff
> around the simple instructions. (ppc64.ad already contains nodes for
> constant length of needle in indexOf rules.)
>
>
>
> macroAssembler_s390:
>
> -Are you sure the prefetch instructions improve performance?
>
> I remember that we had them in other String intrinsics but removed them
> again as they showed absolutely no performance gain.
>
> -Comment: Using hardcoded vector registers is ok for now, but may need to
> get changed e.g. when using them for C2's SuperWord optimization.
>
> -Comment: You could use the vperm instruction instead of vo+vn, but I'm ok
> with the current implementation because loading a mask is much more
> convenient than getting the permutation vector loaded (e.g. from constant
> pool or pc relative).
>
> -So the new vector loop looks good to me.
>
> -In my opinion, the size of all the generated cases should be in relationship to
> their performance benefit.
>
> As intrinsics are not like stubs and may get inlined often, I can't get rid of the
> impression that generating so large code wastes valuable code cache space
> with questionable performance gain in real world scenarios.
>
>
>
> Best regards,
>
> Martin
>
>
>
> From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-
> [hidden email]] On Behalf Of Schmidt, Lutz
> Sent: Mittwoch, 25. Oktober 2017 12:02
> To: [hidden email]
> Subject: RFR(L): 8189793: [s390]: Improve String compress/inflate by
> exploiting vector instructions
>
>
>
> Dear all,
>
>
>
> I would like to request reviews for this s390-only enhancement:
>
>
>
> Bug:    https://bugs.openjdk.java.net/browse/JDK-8189793
>
> Webrev: http://cr.openjdk.java.net/~lucy/webrevs/8189793.00/index.html
>
>
>
> Vector instructions, which have been available on System z for a while (since
> z13), promise noticeable performance improvements. This enhancement
> improves the String Compress and String Inflate intrinsics by exploiting vector
> instructions, when available. For long strings, up to 2x performance
> improvement has been observed in micro-benchmarks.
>
>
>
> Special care was taken to preserve good performance for short strings. All
> examined workloads showed a high ratio of short and very short strings.
>
>
>
> Thank you!
>
> Lutz
>
>
>
>
>
>
>
>
>
> Dr. Lutz Schmidt | SAP JVM | PI  SAP CP Core | T: +49 (6227) 7-42834
>
>

Reply | Threaded
Open this post in threaded view
|

Re: RFR(L): 8189793: [s390]: Improve String compress/inflate by exploiting vector instructions

Schmidt, Lutz
Hi Goetz,

I agree. Knowing the string length greatly helps to optimize the generated code, both in terms of size and performance. There are no compress calls with constant length, though. Constant strings are stored compressed at compile time.

Here are some counters I found when running (a subset of) the SPECjvm2008 suite:
 string_compress match count:  11
 string_inflate  match count: 171
 string_inflate_const:
   len =  1: 10 matches
   len =  2:  2 matches
   len =  4:  3 matches
   len =  6:  9 matches
   len =  7:  3 matches
   len =  9:  4 matches
   len = 10:  5 matches
   len = 11: 15 matches
   len = 15:  1 matches
   len = 17:  1 matches
   len = 18:  2 matches
   len = 19:  2 matches
   len = 29:  1 matches
   len = 31:  1 matches

These (rather few) matches handle a lot of compress/inflate operations:
      n     #compress        #inflate
    <16       673 Mio        2895 Mio
   <256       207 Mio         704 Mio
  <4096       0.7 Mio         1.8 Mio
 >=4096       1.1 Mio         0.3 Mio


A short not on performance gains:
I have done a lot of performance tests in different settings. With complex tests, like SPECjvm2008, the positive (or negative) effect of such low-level optimizations disappears in measurement noise. With a micro benchmark, just compressing and inflating a string, some effect is visible:

My new, improved implementation of the intrinsics shows a slight performance advantage of 1..4% for short strings. Once the vector instructions kick in (at len >= 32), performance improves by 50..70% for string_compress and by 50..150% for string_inflate. Measurements show a high variance, despite testing was done on a system with dedicated cpu resources and with no concurrent load.

BTW, there is a new webrev at http://cr.openjdk.java.net/~lucy/webrevs/8189793.01/index.html 
 
In addition to the changes mentioned below, it contains two minor, nevertheless important fixes to the compress intrinsic:
1) z_bru(skipShortcut); is changed to z_brh(skipShortcut);
2) Code is added after label ScalarShortcut to check for zero length strings.

Regards,
Lutz


 

On 03.11.2017, 12:44, "Lindenmaier, Goetz" <[hidden email]> wrote:

    Hi Lutz,
   
    I have been looking at your change. I think it's a good idea to match
    for constant string length. I did this for the ppc string intrinsics in the
    past.
    I remember that distribution of constant strings was quite uneven.
    StrIndexOf appeared with constant lengths a lot, StrEquals and StrComp
    didn't.
   
    Do you have any data on how often the new match rules match?
   
    Actually, if there is a constant string deflated, a platform independent
    optimization could compute that at compile time, but that's a different
    issue ...
   
    Best regards,
      Goetz.
   
   
    > -----Original Message-----
    > From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-
    > [hidden email]] On Behalf Of Schmidt, Lutz
    > Sent: Freitag, 27. Oktober 2017 13:07
    > To: Doerr, Martin <[hidden email]>; hotspot-compiler-
    > [hidden email]
    > Subject: Re: RFR(L): 8189793: [s390]: Improve String compress/inflate by
    > exploiting vector instructions
    >
    > Hi Martin,
    >
    >
    >
    > Thanks for reviewing my change!
    >
    >
    >
    > This is a preliminary response just to let you know I’m working on the
    > change. I’m putting a lot of effort in producing reliable performance
    > measurement data. Turns out this is not easy (to be more honest: almost
    > impossible).
    >
    >
    >
    > s390.ad:
    >
    > You are absolutely right, the sequence load_const/string_compress makes
    > no sense at all. But it does not hurt either – I could not find one match in all
    > tests I ran. -> Match rule deleted.
    >
    >
    >
    > macroAssembler_s390:
    >
    > prefetch: did not see impact, neither positive nor negative. Artificial micro
    > benchmarks will not benefit (data is in cache anyway). More complex
    > benchmarks show measurement noise which covers the possible prefetch
    > benefit. -> prefetch deleted.
    >
    > Hardcoded vector registers: you are right. There are some design decisions
    > pending, e.g. how many vector scratch registers?
    >
    > Vperm instruction: using that is just another implementation variant that
    > could save the vn vector instruction. On the other hand, loading the index
    > vector is a (compared to vgmh) costly memory access. Given the fact that we
    > mostly deal with short strings, initialization effort is relevant.
    >
    > Code size vs. performance: the old, well known, often discussed tradeoff.
    > Starting from the existing implementation, I invested quite some time in
    > optimizing the (len <= 8) cases. With every refinement step I saw (or
    > believed to see (measurement noise)) some improvement – or discarded it.
    > Is the overall improvement worth the larger code size? -> tradeoff,
    > discussion.
    >
    >
    >
    > Best Regards,
    >
    > Lutz
    >
    >
    >
    >
    >
    >
    >
    > On 25.10.2017, 21:08, "Doerr, Martin" <[hidden email]
    > <mailto:[hidden email]> > wrote:
    >
    >
    >
    > Hi Lutz,
    >
    >
    >
    > thanks for working on vector-based enhancements and for providing this
    > webrev.
    >
    >
    >
    > assembler_s390:
    >
    > -The changes in the assembler look good.
    >
    >
    >
    > s390.ad:
    >
    > -It doesn't make sense to load constant len to a register and generate
    > complex compare instructions for it and still to emit code for all cases. I
    > assume that e.g. the 4 characters cases usually have a constant length. If so,
    > much better code could be generated for them by omitting all the stuff
    > around the simple instructions. (ppc64.ad already contains nodes for
    > constant length of needle in indexOf rules.)
    >
    >
    >
    > macroAssembler_s390:
    >
    > -Are you sure the prefetch instructions improve performance?
    >
    > I remember that we had them in other String intrinsics but removed them
    > again as they showed absolutely no performance gain.
    >
    > -Comment: Using hardcoded vector registers is ok for now, but may need to
    > get changed e.g. when using them for C2's SuperWord optimization.
    >
    > -Comment: You could use the vperm instruction instead of vo+vn, but I'm ok
    > with the current implementation because loading a mask is much more
    > convenient than getting the permutation vector loaded (e.g. from constant
    > pool or pc relative).
    >
    > -So the new vector loop looks good to me.
    >
    > -In my opinion, the size of all the generated cases should be in relationship to
    > their performance benefit.
    >
    > As intrinsics are not like stubs and may get inlined often, I can't get rid of the
    > impression that generating so large code wastes valuable code cache space
    > with questionable performance gain in real world scenarios.
    >
    >
    >
    > Best regards,
    >
    > Martin
    >
    >
    >
    > From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-
    > [hidden email]] On Behalf Of Schmidt, Lutz
    > Sent: Mittwoch, 25. Oktober 2017 12:02
    > To: [hidden email]
    > Subject: RFR(L): 8189793: [s390]: Improve String compress/inflate by
    > exploiting vector instructions
    >
    >
    >
    > Dear all,
    >
    >
    >
    > I would like to request reviews for this s390-only enhancement:
    >
    >
    >
    > Bug:    https://bugs.openjdk.java.net/browse/JDK-8189793
    >
    > Webrev: http://cr.openjdk.java.net/~lucy/webrevs/8189793.00/index.html
    >
    >
    >
    > Vector instructions, which have been available on System z for a while (since
    > z13), promise noticeable performance improvements. This enhancement
    > improves the String Compress and String Inflate intrinsics by exploiting vector
    > instructions, when available. For long strings, up to 2x performance
    > improvement has been observed in micro-benchmarks.
    >
    >
    >
    > Special care was taken to preserve good performance for short strings. All
    > examined workloads showed a high ratio of short and very short strings.
    >
    >
    >
    > Thank you!
    >
    > Lutz
    >
    >
    >
    >
    >
    >
    >
    >
    >
    > Dr. Lutz Schmidt | SAP JVM | PI  SAP CP Core | T: +49 (6227) 7-42834
    >
    >
   
   

Reply | Threaded
Open this post in threaded view
|

RE: RFR(L): 8189793: [s390]: Improve String compress/inflate by exploiting vector instructions

Lindenmaier, Goetz
Hi Lutz,

thanks for the numbers and for the two fixes.
Change looks good now.

The numbers indicate that deflation of constant strings
at compile time would make sense, as well as optimizing
compress for large strings.  (if jvm2008 is representative,
but I assume it's good enough).

Best regards,
  Goetz.

> -----Original Message-----
> From: Schmidt, Lutz
> Sent: Freitag, 3. November 2017 17:46
> To: Lindenmaier, Goetz <[hidden email]>; Doerr, Martin
> <[hidden email]>; [hidden email]
> Subject: Re: RFR(L): 8189793: [s390]: Improve String compress/inflate by
> exploiting vector instructions
>
> Hi Goetz,
>
> I agree. Knowing the string length greatly helps to optimize the generated
> code, both in terms of size and performance. There are no compress calls
> with constant length, though. Constant strings are stored compressed at
> compile time.
>
> Here are some counters I found when running (a subset of) the SPECjvm2008
> suite:
>  string_compress match count:  11
>  string_inflate  match count: 171
>  string_inflate_const:
>    len =  1: 10 matches
>    len =  2:  2 matches
>    len =  4:  3 matches
>    len =  6:  9 matches
>    len =  7:  3 matches
>    len =  9:  4 matches
>    len = 10:  5 matches
>    len = 11: 15 matches
>    len = 15:  1 matches
>    len = 17:  1 matches
>    len = 18:  2 matches
>    len = 19:  2 matches
>    len = 29:  1 matches
>    len = 31:  1 matches
>
> These (rather few) matches handle a lot of compress/inflate operations:
>       n     #compress        #inflate
>     <16       673 Mio        2895 Mio
>    <256       207 Mio         704 Mio
>   <4096       0.7 Mio         1.8 Mio
>  >=4096       1.1 Mio         0.3 Mio
>
>
> A short not on performance gains:
> I have done a lot of performance tests in different settings. With complex
> tests, like SPECjvm2008, the positive (or negative) effect of such low-level
> optimizations disappears in measurement noise. With a micro benchmark,
> just compressing and inflating a string, some effect is visible:
>
> My new, improved implementation of the intrinsics shows a slight
> performance advantage of 1..4% for short strings. Once the vector
> instructions kick in (at len >= 32), performance improves by 50..70% for
> string_compress and by 50..150% for string_inflate. Measurements show a
> high variance, despite testing was done on a system with dedicated cpu
> resources and with no concurrent load.
>
> BTW, there is a new webrev at
> http://cr.openjdk.java.net/~lucy/webrevs/8189793.01/index.html
>
> In addition to the changes mentioned below, it contains two minor,
> nevertheless important fixes to the compress intrinsic:
> 1) z_bru(skipShortcut); is changed to z_brh(skipShortcut);
> 2) Code is added after label ScalarShortcut to check for zero length strings.
>
> Regards,
> Lutz
>
>
>
>
> On 03.11.2017, 12:44, "Lindenmaier, Goetz" <[hidden email]>
> wrote:
>
>     Hi Lutz,
>
>     I have been looking at your change. I think it's a good idea to match
>     for constant string length. I did this for the ppc string intrinsics in the
>     past.
>     I remember that distribution of constant strings was quite uneven.
>     StrIndexOf appeared with constant lengths a lot, StrEquals and StrComp
>     didn't.
>
>     Do you have any data on how often the new match rules match?
>
>     Actually, if there is a constant string deflated, a platform independent
>     optimization could compute that at compile time, but that's a different
>     issue ...
>
>     Best regards,
>       Goetz.
>
>
>     > -----Original Message-----
>     > From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-
>     > [hidden email]] On Behalf Of Schmidt, Lutz
>     > Sent: Freitag, 27. Oktober 2017 13:07
>     > To: Doerr, Martin <[hidden email]>; hotspot-compiler-
>     > [hidden email]
>     > Subject: Re: RFR(L): 8189793: [s390]: Improve String compress/inflate by
>     > exploiting vector instructions
>     >
>     > Hi Martin,
>     >
>     >
>     >
>     > Thanks for reviewing my change!
>     >
>     >
>     >
>     > This is a preliminary response just to let you know I’m working on the
>     > change. I’m putting a lot of effort in producing reliable performance
>     > measurement data. Turns out this is not easy (to be more honest: almost
>     > impossible).
>     >
>     >
>     >
>     > s390.ad:
>     >
>     > You are absolutely right, the sequence load_const/string_compress
> makes
>     > no sense at all. But it does not hurt either – I could not find one match in
> all
>     > tests I ran. -> Match rule deleted.
>     >
>     >
>     >
>     > macroAssembler_s390:
>     >
>     > prefetch: did not see impact, neither positive nor negative. Artificial
> micro
>     > benchmarks will not benefit (data is in cache anyway). More complex
>     > benchmarks show measurement noise which covers the possible
> prefetch
>     > benefit. -> prefetch deleted.
>     >
>     > Hardcoded vector registers: you are right. There are some design
> decisions
>     > pending, e.g. how many vector scratch registers?
>     >
>     > Vperm instruction: using that is just another implementation variant that
>     > could save the vn vector instruction. On the other hand, loading the
> index
>     > vector is a (compared to vgmh) costly memory access. Given the fact that
> we
>     > mostly deal with short strings, initialization effort is relevant.
>     >
>     > Code size vs. performance: the old, well known, often discussed
> tradeoff.
>     > Starting from the existing implementation, I invested quite some time in
>     > optimizing the (len <= 8) cases. With every refinement step I saw (or
>     > believed to see (measurement noise)) some improvement – or
> discarded it.
>     > Is the overall improvement worth the larger code size? -> tradeoff,
>     > discussion.
>     >
>     >
>     >
>     > Best Regards,
>     >
>     > Lutz
>     >
>     >
>     >
>     >
>     >
>     >
>     >
>     > On 25.10.2017, 21:08, "Doerr, Martin" <[hidden email]
>     > <mailto:[hidden email]> > wrote:
>     >
>     >
>     >
>     > Hi Lutz,
>     >
>     >
>     >
>     > thanks for working on vector-based enhancements and for providing this
>     > webrev.
>     >
>     >
>     >
>     > assembler_s390:
>     >
>     > -The changes in the assembler look good.
>     >
>     >
>     >
>     > s390.ad:
>     >
>     > -It doesn't make sense to load constant len to a register and generate
>     > complex compare instructions for it and still to emit code for all cases. I
>     > assume that e.g. the 4 characters cases usually have a constant length. If
> so,
>     > much better code could be generated for them by omitting all the stuff
>     > around the simple instructions. (ppc64.ad already contains nodes for
>     > constant length of needle in indexOf rules.)
>     >
>     >
>     >
>     > macroAssembler_s390:
>     >
>     > -Are you sure the prefetch instructions improve performance?
>     >
>     > I remember that we had them in other String intrinsics but removed
> them
>     > again as they showed absolutely no performance gain.
>     >
>     > -Comment: Using hardcoded vector registers is ok for now, but may need
> to
>     > get changed e.g. when using them for C2's SuperWord optimization.
>     >
>     > -Comment: You could use the vperm instruction instead of vo+vn, but I'm
> ok
>     > with the current implementation because loading a mask is much more
>     > convenient than getting the permutation vector loaded (e.g. from
> constant
>     > pool or pc relative).
>     >
>     > -So the new vector loop looks good to me.
>     >
>     > -In my opinion, the size of all the generated cases should be in
> relationship to
>     > their performance benefit.
>     >
>     > As intrinsics are not like stubs and may get inlined often, I can't get rid of
> the
>     > impression that generating so large code wastes valuable code cache
> space
>     > with questionable performance gain in real world scenarios.
>     >
>     >
>     >
>     > Best regards,
>     >
>     > Martin
>     >
>     >
>     >
>     > From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-
>     > [hidden email]] On Behalf Of Schmidt, Lutz
>     > Sent: Mittwoch, 25. Oktober 2017 12:02
>     > To: [hidden email]
>     > Subject: RFR(L): 8189793: [s390]: Improve String compress/inflate by
>     > exploiting vector instructions
>     >
>     >
>     >
>     > Dear all,
>     >
>     >
>     >
>     > I would like to request reviews for this s390-only enhancement:
>     >
>     >
>     >
>     > Bug:    https://bugs.openjdk.java.net/browse/JDK-8189793
>     >
>     > Webrev:
> http://cr.openjdk.java.net/~lucy/webrevs/8189793.00/index.html
>     >
>     >
>     >
>     > Vector instructions, which have been available on System z for a while
> (since
>     > z13), promise noticeable performance improvements. This enhancement
>     > improves the String Compress and String Inflate intrinsics by exploiting
> vector
>     > instructions, when available. For long strings, up to 2x performance
>     > improvement has been observed in micro-benchmarks.
>     >
>     >
>     >
>     > Special care was taken to preserve good performance for short strings.
> All
>     > examined workloads showed a high ratio of short and very short strings.
>     >
>     >
>     >
>     > Thank you!
>     >
>     > Lutz
>     >
>     >
>     >
>     >
>     >
>     >
>     >
>     >
>     >
>     > Dr. Lutz Schmidt | SAP JVM | PI  SAP CP Core | T: +49 (6227) 7-42834
>     >
>     >
>
>

Reply | Threaded
Open this post in threaded view
|

Re: RFR(L): 8189793: [s390]: Improve String compress/inflate by exploiting vector instructions

Schmidt, Lutz
Thank you, Goetz!
Best Regards,
Lutz

 

On 13.11.2017, 12:48, "Lindenmaier, Goetz" <[hidden email]> wrote:

    Hi Lutz,
   
    thanks for the numbers and for the two fixes.
    Change looks good now.
   
    The numbers indicate that deflation of constant strings
    at compile time would make sense, as well as optimizing
    compress for large strings.  (if jvm2008 is representative,
    but I assume it's good enough).
   
    Best regards,
      Goetz.
   
    > -----Original Message-----
    > From: Schmidt, Lutz
    > Sent: Freitag, 3. November 2017 17:46
    > To: Lindenmaier, Goetz <[hidden email]>; Doerr, Martin
    > <[hidden email]>; [hidden email]
    > Subject: Re: RFR(L): 8189793: [s390]: Improve String compress/inflate by
    > exploiting vector instructions
    >
    > Hi Goetz,
    >
    > I agree. Knowing the string length greatly helps to optimize the generated
    > code, both in terms of size and performance. There are no compress calls
    > with constant length, though. Constant strings are stored compressed at
    > compile time.
    >
    > Here are some counters I found when running (a subset of) the SPECjvm2008
    > suite:
    >  string_compress match count:  11
    >  string_inflate  match count: 171
    >  string_inflate_const:
    >    len =  1: 10 matches
    >    len =  2:  2 matches
    >    len =  4:  3 matches
    >    len =  6:  9 matches
    >    len =  7:  3 matches
    >    len =  9:  4 matches
    >    len = 10:  5 matches
    >    len = 11: 15 matches
    >    len = 15:  1 matches
    >    len = 17:  1 matches
    >    len = 18:  2 matches
    >    len = 19:  2 matches
    >    len = 29:  1 matches
    >    len = 31:  1 matches
    >
    > These (rather few) matches handle a lot of compress/inflate operations:
    >       n     #compress        #inflate
    >     <16       673 Mio        2895 Mio
    >    <256       207 Mio         704 Mio
    >   <4096       0.7 Mio         1.8 Mio
    >  >=4096       1.1 Mio         0.3 Mio
    >
    >
    > A short not on performance gains:
    > I have done a lot of performance tests in different settings. With complex
    > tests, like SPECjvm2008, the positive (or negative) effect of such low-level
    > optimizations disappears in measurement noise. With a micro benchmark,
    > just compressing and inflating a string, some effect is visible:
    >
    > My new, improved implementation of the intrinsics shows a slight
    > performance advantage of 1..4% for short strings. Once the vector
    > instructions kick in (at len >= 32), performance improves by 50..70% for
    > string_compress and by 50..150% for string_inflate. Measurements show a
    > high variance, despite testing was done on a system with dedicated cpu
    > resources and with no concurrent load.
    >
    > BTW, there is a new webrev at
    > http://cr.openjdk.java.net/~lucy/webrevs/8189793.01/index.html
    >
    > In addition to the changes mentioned below, it contains two minor,
    > nevertheless important fixes to the compress intrinsic:
    > 1) z_bru(skipShortcut); is changed to z_brh(skipShortcut);
    > 2) Code is added after label ScalarShortcut to check for zero length strings.
    >
    > Regards,
    > Lutz
    >
    >
    >
    >
    > On 03.11.2017, 12:44, "Lindenmaier, Goetz" <[hidden email]>
    > wrote:
    >
    >     Hi Lutz,
    >
    >     I have been looking at your change. I think it's a good idea to match
    >     for constant string length. I did this for the ppc string intrinsics in the
    >     past.
    >     I remember that distribution of constant strings was quite uneven.
    >     StrIndexOf appeared with constant lengths a lot, StrEquals and StrComp
    >     didn't.
    >
    >     Do you have any data on how often the new match rules match?
    >
    >     Actually, if there is a constant string deflated, a platform independent
    >     optimization could compute that at compile time, but that's a different
    >     issue ...
    >
    >     Best regards,
    >       Goetz.
    >
    >
    >     > -----Original Message-----
    >     > From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-
    >     > [hidden email]] On Behalf Of Schmidt, Lutz
    >     > Sent: Freitag, 27. Oktober 2017 13:07
    >     > To: Doerr, Martin <[hidden email]>; hotspot-compiler-
    >     > [hidden email]
    >     > Subject: Re: RFR(L): 8189793: [s390]: Improve String compress/inflate by
    >     > exploiting vector instructions
    >     >
    >     > Hi Martin,
    >     >
    >     >
    >     >
    >     > Thanks for reviewing my change!
    >     >
    >     >
    >     >
    >     > This is a preliminary response just to let you know I’m working on the
    >     > change. I’m putting a lot of effort in producing reliable performance
    >     > measurement data. Turns out this is not easy (to be more honest: almost
    >     > impossible).
    >     >
    >     >
    >     >
    >     > s390.ad:
    >     >
    >     > You are absolutely right, the sequence load_const/string_compress
    > makes
    >     > no sense at all. But it does not hurt either – I could not find one match in
    > all
    >     > tests I ran. -> Match rule deleted.
    >     >
    >     >
    >     >
    >     > macroAssembler_s390:
    >     >
    >     > prefetch: did not see impact, neither positive nor negative. Artificial
    > micro
    >     > benchmarks will not benefit (data is in cache anyway). More complex
    >     > benchmarks show measurement noise which covers the possible
    > prefetch
    >     > benefit. -> prefetch deleted.
    >     >
    >     > Hardcoded vector registers: you are right. There are some design
    > decisions
    >     > pending, e.g. how many vector scratch registers?
    >     >
    >     > Vperm instruction: using that is just another implementation variant that
    >     > could save the vn vector instruction. On the other hand, loading the
    > index
    >     > vector is a (compared to vgmh) costly memory access. Given the fact that
    > we
    >     > mostly deal with short strings, initialization effort is relevant.
    >     >
    >     > Code size vs. performance: the old, well known, often discussed
    > tradeoff.
    >     > Starting from the existing implementation, I invested quite some time in
    >     > optimizing the (len <= 8) cases. With every refinement step I saw (or
    >     > believed to see (measurement noise)) some improvement – or
    > discarded it.
    >     > Is the overall improvement worth the larger code size? -> tradeoff,
    >     > discussion.
    >     >
    >     >
    >     >
    >     > Best Regards,
    >     >
    >     > Lutz
    >     >
    >     >
    >     >
    >     >
    >     >
    >     >
    >     >
    >     > On 25.10.2017, 21:08, "Doerr, Martin" <[hidden email]
    >     > <mailto:[hidden email]> > wrote:
    >     >
    >     >
    >     >
    >     > Hi Lutz,
    >     >
    >     >
    >     >
    >     > thanks for working on vector-based enhancements and for providing this
    >     > webrev.
    >     >
    >     >
    >     >
    >     > assembler_s390:
    >     >
    >     > -The changes in the assembler look good.
    >     >
    >     >
    >     >
    >     > s390.ad:
    >     >
    >     > -It doesn't make sense to load constant len to a register and generate
    >     > complex compare instructions for it and still to emit code for all cases. I
    >     > assume that e.g. the 4 characters cases usually have a constant length. If
    > so,
    >     > much better code could be generated for them by omitting all the stuff
    >     > around the simple instructions. (ppc64.ad already contains nodes for
    >     > constant length of needle in indexOf rules.)
    >     >
    >     >
    >     >
    >     > macroAssembler_s390:
    >     >
    >     > -Are you sure the prefetch instructions improve performance?
    >     >
    >     > I remember that we had them in other String intrinsics but removed
    > them
    >     > again as they showed absolutely no performance gain.
    >     >
    >     > -Comment: Using hardcoded vector registers is ok for now, but may need
    > to
    >     > get changed e.g. when using them for C2's SuperWord optimization.
    >     >
    >     > -Comment: You could use the vperm instruction instead of vo+vn, but I'm
    > ok
    >     > with the current implementation because loading a mask is much more
    >     > convenient than getting the permutation vector loaded (e.g. from
    > constant
    >     > pool or pc relative).
    >     >
    >     > -So the new vector loop looks good to me.
    >     >
    >     > -In my opinion, the size of all the generated cases should be in
    > relationship to
    >     > their performance benefit.
    >     >
    >     > As intrinsics are not like stubs and may get inlined often, I can't get rid of
    > the
    >     > impression that generating so large code wastes valuable code cache
    > space
    >     > with questionable performance gain in real world scenarios.
    >     >
    >     >
    >     >
    >     > Best regards,
    >     >
    >     > Martin
    >     >
    >     >
    >     >
    >     > From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-
    >     > [hidden email]] On Behalf Of Schmidt, Lutz
    >     > Sent: Mittwoch, 25. Oktober 2017 12:02
    >     > To: [hidden email]
    >     > Subject: RFR(L): 8189793: [s390]: Improve String compress/inflate by
    >     > exploiting vector instructions
    >     >
    >     >
    >     >
    >     > Dear all,
    >     >
    >     >
    >     >
    >     > I would like to request reviews for this s390-only enhancement:
    >     >
    >     >
    >     >
    >     > Bug:    https://bugs.openjdk.java.net/browse/JDK-8189793
    >     >
    >     > Webrev:
    > http://cr.openjdk.java.net/~lucy/webrevs/8189793.00/index.html
    >     >
    >     >
    >     >
    >     > Vector instructions, which have been available on System z for a while
    > (since
    >     > z13), promise noticeable performance improvements. This enhancement
    >     > improves the String Compress and String Inflate intrinsics by exploiting
    > vector
    >     > instructions, when available. For long strings, up to 2x performance
    >     > improvement has been observed in micro-benchmarks.
    >     >
    >     >
    >     >
    >     > Special care was taken to preserve good performance for short strings.
    > All
    >     > examined workloads showed a high ratio of short and very short strings.
    >     >
    >     >
    >     >
    >     > Thank you!
    >     >
    >     > Lutz
    >     >
    >     >
    >     >
    >     >
    >     >
    >     >
    >     >
    >     >
    >     >
    >     > Dr. Lutz Schmidt | SAP JVM | PI  SAP CP Core | T: +49 (6227) 7-42834
    >     >
    >     >
    >
    >
   
   

Reply | Threaded
Open this post in threaded view
|

Re: RFR(L): 8189793: [s390]: Improve String compress/inflate by exploiting vector instructions

Schmidt, Lutz
Hi all,

Following the request from Martin, I have disabled the “special case handling” for very short strings. That sacrifices some possible performance advantage for generated code size.

Please find the new, updated webrev here: http://cr.openjdk.java.net/~lucy/webrevs/8189793.02/index.html

Thanks and best regards,
Lutz



On 13.11.2017, 12:52, "hotspot-compiler-dev on behalf of Schmidt, Lutz" <[hidden email] on behalf of [hidden email]> wrote:

    Thank you, Goetz!
    Best Regards,
    Lutz
   
     
   
    On 13.11.2017, 12:48, "Lindenmaier, Goetz" <[hidden email]> wrote:
   
        Hi Lutz,
       
        thanks for the numbers and for the two fixes.
        Change looks good now.
       
        The numbers indicate that deflation of constant strings
        at compile time would make sense, as well as optimizing
        compress for large strings.  (if jvm2008 is representative,
        but I assume it's good enough).
       
        Best regards,
          Goetz.
       
        > -----Original Message-----
        > From: Schmidt, Lutz
        > Sent: Freitag, 3. November 2017 17:46
        > To: Lindenmaier, Goetz <[hidden email]>; Doerr, Martin
        > <[hidden email]>; [hidden email]
        > Subject: Re: RFR(L): 8189793: [s390]: Improve String compress/inflate by
        > exploiting vector instructions
        >
        > Hi Goetz,
        >
        > I agree. Knowing the string length greatly helps to optimize the generated
        > code, both in terms of size and performance. There are no compress calls
        > with constant length, though. Constant strings are stored compressed at
        > compile time.
        >
        > Here are some counters I found when running (a subset of) the SPECjvm2008
        > suite:
        >  string_compress match count:  11
        >  string_inflate  match count: 171
        >  string_inflate_const:
        >    len =  1: 10 matches
        >    len =  2:  2 matches
        >    len =  4:  3 matches
        >    len =  6:  9 matches
        >    len =  7:  3 matches
        >    len =  9:  4 matches
        >    len = 10:  5 matches
        >    len = 11: 15 matches
        >    len = 15:  1 matches
        >    len = 17:  1 matches
        >    len = 18:  2 matches
        >    len = 19:  2 matches
        >    len = 29:  1 matches
        >    len = 31:  1 matches
        >
        > These (rather few) matches handle a lot of compress/inflate operations:
        >       n     #compress        #inflate
        >     <16       673 Mio        2895 Mio
        >    <256       207 Mio         704 Mio
        >   <4096       0.7 Mio         1.8 Mio
        >  >=4096       1.1 Mio         0.3 Mio
        >
        >
        > A short not on performance gains:
        > I have done a lot of performance tests in different settings. With complex
        > tests, like SPECjvm2008, the positive (or negative) effect of such low-level
        > optimizations disappears in measurement noise. With a micro benchmark,
        > just compressing and inflating a string, some effect is visible:
        >
        > My new, improved implementation of the intrinsics shows a slight
        > performance advantage of 1..4% for short strings. Once the vector
        > instructions kick in (at len >= 32), performance improves by 50..70% for
        > string_compress and by 50..150% for string_inflate. Measurements show a
        > high variance, despite testing was done on a system with dedicated cpu
        > resources and with no concurrent load.
        >
        > BTW, there is a new webrev at
        > http://cr.openjdk.java.net/~lucy/webrevs/8189793.01/index.html
        >
        > In addition to the changes mentioned below, it contains two minor,
        > nevertheless important fixes to the compress intrinsic:
        > 1) z_bru(skipShortcut); is changed to z_brh(skipShortcut);
        > 2) Code is added after label ScalarShortcut to check for zero length strings.
        >
        > Regards,
        > Lutz
        >
        >
        >
        >
        > On 03.11.2017, 12:44, "Lindenmaier, Goetz" <[hidden email]>
        > wrote:
        >
        >     Hi Lutz,
        >
        >     I have been looking at your change. I think it's a good idea to match
        >     for constant string length. I did this for the ppc string intrinsics in the
        >     past.
        >     I remember that distribution of constant strings was quite uneven.
        >     StrIndexOf appeared with constant lengths a lot, StrEquals and StrComp
        >     didn't.
        >
        >     Do you have any data on how often the new match rules match?
        >
        >     Actually, if there is a constant string deflated, a platform independent
        >     optimization could compute that at compile time, but that's a different
        >     issue ...
        >
        >     Best regards,
        >       Goetz.
        >
        >
        >     > -----Original Message-----
        >     > From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-
        >     > [hidden email]] On Behalf Of Schmidt, Lutz
        >     > Sent: Freitag, 27. Oktober 2017 13:07
        >     > To: Doerr, Martin <[hidden email]>; hotspot-compiler-
        >     > [hidden email]
        >     > Subject: Re: RFR(L): 8189793: [s390]: Improve String compress/inflate by
        >     > exploiting vector instructions
        >     >
        >     > Hi Martin,
        >     >
        >     >
        >     >
        >     > Thanks for reviewing my change!
        >     >
        >     >
        >     >
        >     > This is a preliminary response just to let you know I’m working on the
        >     > change. I’m putting a lot of effort in producing reliable performance
        >     > measurement data. Turns out this is not easy (to be more honest: almost
        >     > impossible).
        >     >
        >     >
        >     >
        >     > s390.ad:
        >     >
        >     > You are absolutely right, the sequence load_const/string_compress
        > makes
        >     > no sense at all. But it does not hurt either – I could not find one match in
        > all
        >     > tests I ran. -> Match rule deleted.
        >     >
        >     >
        >     >
        >     > macroAssembler_s390:
        >     >
        >     > prefetch: did not see impact, neither positive nor negative. Artificial
        > micro
        >     > benchmarks will not benefit (data is in cache anyway). More complex
        >     > benchmarks show measurement noise which covers the possible
        > prefetch
        >     > benefit. -> prefetch deleted.
        >     >
        >     > Hardcoded vector registers: you are right. There are some design
        > decisions
        >     > pending, e.g. how many vector scratch registers?
        >     >
        >     > Vperm instruction: using that is just another implementation variant that
        >     > could save the vn vector instruction. On the other hand, loading the
        > index
        >     > vector is a (compared to vgmh) costly memory access. Given the fact that
        > we
        >     > mostly deal with short strings, initialization effort is relevant.
        >     >
        >     > Code size vs. performance: the old, well known, often discussed
        > tradeoff.
        >     > Starting from the existing implementation, I invested quite some time in
        >     > optimizing the (len <= 8) cases. With every refinement step I saw (or
        >     > believed to see (measurement noise)) some improvement – or
        > discarded it.
        >     > Is the overall improvement worth the larger code size? -> tradeoff,
        >     > discussion.
        >     >
        >     >
        >     >
        >     > Best Regards,
        >     >
        >     > Lutz
        >     >
        >     >
        >     >
        >     >
        >     >
        >     >
        >     >
        >     > On 25.10.2017, 21:08, "Doerr, Martin" <[hidden email]
        >     > <mailto:[hidden email]> > wrote:
        >     >
        >     >
        >     >
        >     > Hi Lutz,
        >     >
        >     >
        >     >
        >     > thanks for working on vector-based enhancements and for providing this
        >     > webrev.
        >     >
        >     >
        >     >
        >     > assembler_s390:
        >     >
        >     > -The changes in the assembler look good.
        >     >
        >     >
        >     >
        >     > s390.ad:
        >     >
        >     > -It doesn't make sense to load constant len to a register and generate
        >     > complex compare instructions for it and still to emit code for all cases. I
        >     > assume that e.g. the 4 characters cases usually have a constant length. If
        > so,
        >     > much better code could be generated for them by omitting all the stuff
        >     > around the simple instructions. (ppc64.ad already contains nodes for
        >     > constant length of needle in indexOf rules.)
        >     >
        >     >
        >     >
        >     > macroAssembler_s390:
        >     >
        >     > -Are you sure the prefetch instructions improve performance?
        >     >
        >     > I remember that we had them in other String intrinsics but removed
        > them
        >     > again as they showed absolutely no performance gain.
        >     >
        >     > -Comment: Using hardcoded vector registers is ok for now, but may need
        > to
        >     > get changed e.g. when using them for C2's SuperWord optimization.
        >     >
        >     > -Comment: You could use the vperm instruction instead of vo+vn, but I'm
        > ok
        >     > with the current implementation because loading a mask is much more
        >     > convenient than getting the permutation vector loaded (e.g. from
        > constant
        >     > pool or pc relative).
        >     >
        >     > -So the new vector loop looks good to me.
        >     >
        >     > -In my opinion, the size of all the generated cases should be in
        > relationship to
        >     > their performance benefit.
        >     >
        >     > As intrinsics are not like stubs and may get inlined often, I can't get rid of
        > the
        >     > impression that generating so large code wastes valuable code cache
        > space
        >     > with questionable performance gain in real world scenarios.
        >     >
        >     >
        >     >
        >     > Best regards,
        >     >
        >     > Martin
        >     >
        >     >
        >     >
        >     > From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-
        >     > [hidden email]] On Behalf Of Schmidt, Lutz
        >     > Sent: Mittwoch, 25. Oktober 2017 12:02
        >     > To: [hidden email]
        >     > Subject: RFR(L): 8189793: [s390]: Improve String compress/inflate by
        >     > exploiting vector instructions
        >     >
        >     >
        >     >
        >     > Dear all,
        >     >
        >     >
        >     >
        >     > I would like to request reviews for this s390-only enhancement:
        >     >
        >     >
        >     >
        >     > Bug:    https://bugs.openjdk.java.net/browse/JDK-8189793
        >     >
        >     > Webrev:
        > http://cr.openjdk.java.net/~lucy/webrevs/8189793.00/index.html
        >     >
        >     >
        >     >
        >     > Vector instructions, which have been available on System z for a while
        > (since
        >     > z13), promise noticeable performance improvements. This enhancement
        >     > improves the String Compress and String Inflate intrinsics by exploiting
        > vector
        >     > instructions, when available. For long strings, up to 2x performance
        >     > improvement has been observed in micro-benchmarks.
        >     >
        >     >
        >     >
        >     > Special care was taken to preserve good performance for short strings.
        > All
        >     > examined workloads showed a high ratio of short and very short strings.
        >     >
        >     >
        >     >
        >     > Thank you!
        >     >
        >     > Lutz
        >     >
        >     >
        >     >
        >     >
        >     >
        >     >
        >     >
        >     >
        >     >
        >     > Dr. Lutz Schmidt | SAP JVM | PI  SAP CP Core | T: +49 (6227) 7-42834
        >     >
        >     >
        >
        >