RFR: 8261553: Efficient mask generation using BMI2 BZHI instruction

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

RFR: 8261553: Efficient mask generation using BMI2 BZHI instruction

Jatin Bhateja
BMI2 BHZI instruction can be used to optimize the instruction sequence
used for mask generation at various place in array copy stubs and partial in-lining for small copy operations.

-------------

Commit messages:
 - 8261553: Efficient mask generation using BMI2 BZHI instruction.

Changes: https://git.openjdk.java.net/jdk/pull/2522/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=2522&range=00
  Issue: https://bugs.openjdk.java.net/browse/JDK-8261553
  Stats: 30 lines in 5 files changed: 8 ins; 11 del; 11 mod
  Patch: https://git.openjdk.java.net/jdk/pull/2522.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/2522/head:pull/2522

PR: https://git.openjdk.java.net/jdk/pull/2522
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261553: Efficient mask generation using BMI2 BZHI instruction

Claes Redestad-2
On Thu, 11 Feb 2021 08:31:40 GMT, Jatin Bhateja <[hidden email]> wrote:

> BMI2 BHZI instruction can be used to optimize the instruction sequence
> used for mask generation at various place in array copy stubs and partial in-lining for small copy operations.

- Rather than removing the old code, I believe the code calling bzhiq needs to be in a branch checking `VM_Version::supports_bmi2`. Otherwise you'll hit asserts on older hardware without this extension
- Some demonstration of the performance benefit would be nice - either a new microbenchmark or a statistically significant result running some existing ones, e.g. `make test TEST=micro:ArrayCopy`

-------------

Changes requested by redestad (Reviewer).

PR: https://git.openjdk.java.net/jdk/pull/2522
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261553: Efficient mask generation using BMI2 BZHI instruction [v2]

Jatin Bhateja
On Thu, 11 Feb 2021 10:28:05 GMT, Claes Redestad <[hidden email]> wrote:

>> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
>>
>>   8261553: Adding BMI2 missing check for partial in-lining.
>
> - Rather than removing the old code, I believe the code calling bzhiq needs to be in a branch checking `VM_Version::supports_bmi2`. Otherwise you'll hit asserts on older hardware without this extension
> - Some demonstration of the performance benefit would be nice - either a new microbenchmark or a statistically significant result running some existing ones, e.g. `make test TEST=micro:ArrayCopy`

Hi Claes,

Here is the JMH performance data over CLX for arraycopy benchmarks:
http://cr.openjdk.java.net/~jbhateja/8261553/JMH_PERF_CLX_BASELINE.txt
http://cr.openjdk.java.net/~jbhateja/8261553/JMH_PERF_CLX_WITH_OPTS.txt

Regards,
Jatin

-------------

PR: https://git.openjdk.java.net/jdk/pull/2522
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261553: Efficient mask generation using BMI2 BZHI instruction [v2]

Jatin Bhateja
In reply to this post by Claes Redestad-2
On Thu, 11 Feb 2021 10:28:05 GMT, Claes Redestad <[hidden email]> wrote:

> * Rather than removing the old code, I believe the code calling bzhiq needs to be in a branch checking `VM_Version::supports_bmi2`. Otherwise you'll hit asserts on older hardware without this extension

Hi Claes, added missing safely check for BMI2,  its in general  rare that a target supporting AVX-512 does not support BMI2

> * Some demonstration of the performance benefit would be nice - either a new microbenchmark or a statistically significant result running some existing ones, e.g. `make test TEST=micro:ArrayCopy`

-------------

PR: https://git.openjdk.java.net/jdk/pull/2522
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261553: Efficient mask generation using BMI2 BZHI instruction [v2]

Claes Redestad-2
In reply to this post by Jatin Bhateja
On Thu, 11 Feb 2021 12:22:29 GMT, Jatin Bhateja <[hidden email]> wrote:

> Hi Claes,
>
> Here is the JMH performance data over CLX for arraycopy benchmarks:
> http://cr.openjdk.java.net/~jbhateja/8261553/JMH_PERF_CLX_BASELINE.txt
> http://cr.openjdk.java.net/~jbhateja/8261553/JMH_PERF_CLX_WITH_OPTS.txt
>
> Regards,
> Jatin

Thanks! Eyeballing the results it looks like a mixed bag. There even seems to be a few regressions such as this:

o.o.b.java.lang.ArrayCopyUnalignedSrc.testLong                   1200     N/A   avgt    2     61.663           ns/op
-->
o.o.b.java.lang.ArrayCopyUnalignedSrc.testLong                   1200     N/A   avgt    2     74.160           ns/op

-------------

PR: https://git.openjdk.java.net/jdk/pull/2522
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261553: Efficient mask generation using BMI2 BZHI instruction [v2]

Jatin Bhateja
On Thu, 11 Feb 2021 12:56:24 GMT, Claes Redestad <[hidden email]> wrote:

> > Hi Claes,
> > Here is the JMH performance data over CLX for arraycopy benchmarks:
> > http://cr.openjdk.java.net/~jbhateja/8261553/JMH_PERF_CLX_BASELINE.txt
> > http://cr.openjdk.java.net/~jbhateja/8261553/JMH_PERF_CLX_WITH_OPTS.txt
> > Regards,
> > Jatin
>
> Thanks! Eyeballing the results it looks like a mixed bag. There even seems to be a few regressions such as this:
>
> ```
> o.o.b.java.lang.ArrayCopyUnalignedSrc.testLong                   1200     N/A   avgt    2     61.663           ns/op
> -->
> o.o.b.java.lang.ArrayCopyUnalignedSrc.testLong                   1200     N/A   avgt    2     74.160           ns/op
> ```

Hi Claes, This could be a run to run variation, in general we are now having fewer number of instructions (one shift operation saved per mask computation) compared to previous masked generation sequence  and thus it will always offer better execution latencies.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2522
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261553: Efficient mask generation using BMI2 BZHI instruction [v2]

Claes Redestad-2
On Thu, 11 Feb 2021 13:52:09 GMT, Jatin Bhateja <[hidden email]> wrote:

> Hi Claes, This could be a run to run variation, in general we are now having fewer number of instructions (one shift operation saved per mask computation) compared to previous masked generation sequence and thus it will always offer better execution latencies.

Run-to-run variation would be easy to rule out by running more forks and more iterations to attain statistically significant results. While the instruction manuals suggest latency should be better for this instruction on all CPUs where it's supported, it would be good if there was some clear proof - such as a significant benchmark win - to motivate the added complexity.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2522
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261553: Efficient mask generation using BMI2 BZHI instruction [v2]

Jatin Bhateja
On Thu, 11 Feb 2021 14:28:01 GMT, Claes Redestad <[hidden email]> wrote:

> > Hi Claes, This could be a run to run variation, in general we are now having fewer number of instructions (one shift operation saved per mask computation) compared to previous masked generation sequence and thus it will always offer better execution latencies.
>
> Run-to-run variation would be easy to rule out by running more forks and more iterations to attain statistically significant results. While the instruction manuals suggest latency should be better for this instruction on all CPUs where it's supported, it would be good if there was some clear proof - such as a significant benchmark win - to motivate the added complexity.

BASELINE:
Result "org.openjdk.bench.java.lang.ArrayCopyUnalignedSrc.testLong":
  61.037 ns/op

Secondary result "org.openjdk.bench.java.lang.ArrayCopyUnalignedSrc.testLong:·perf":
Perf stats:
--------------------------------------------------

         19,739.21 msec task-clock                #    0.389 CPUs utilized          
               646      context-switches          #    0.033 K/sec                  
                12      cpu-migrations            #    0.001 K/sec                  
               150      page-faults               #    0.008 K/sec                  
   74,59,83,59,139      cycles                    #    3.779 GHz                      (30.73%)
 1,78,78,79,19,117      instructions              #    2.40  insn per cycle           (38.48%)
   24,79,81,63,651      branches                  # 1256.289 M/sec                    (38.55%)
      32,24,89,924      branch-misses             #    1.30% of all branches          (38.62%)
   52,56,88,28,472      L1-dcache-loads           # 2663.167 M/sec                    (38.65%)
         39,00,969      L1-dcache-load-misses     #    0.01% of all L1-dcache hits    (38.57%)
          3,74,131      LLC-loads                 #    0.019 M/sec                    (30.77%)
            22,315      LLC-load-misses           #    5.96% of all LL-cache hits     (30.72%)
   <not supported>      L1-icache-loads                                            
         17,49,997      L1-icache-load-misses                                         (30.72%)
   52,91,41,70,636      dTLB-loads                # 2680.663 M/sec                    (30.69%)
             3,315      dTLB-load-misses          #    0.00% of all dTLB cache hits   (30.67%)
             4,674      iTLB-loads                #    0.237 K/sec                    (30.65%)
            33,746      iTLB-load-misses          #  721.99% of all iTLB cache hits   (30.63%)
   <not supported>      L1-dcache-prefetches                                        
   <not supported>      L1-dcache-prefetch-misses                                  

      50.723759146 seconds time elapsed

      51.447054000 seconds user
       0.189949000 seconds sys


WITH OPT:
Result "org.openjdk.bench.java.lang.ArrayCopyUnalignedSrc.testLong":
  74.356 ns/op

Secondary result "org.openjdk.bench.java.lang.ArrayCopyUnalignedSrc.testLong:·perf":
Perf stats:
--------------------------------------------------

         19,741.09 msec task-clock                #    0.389 CPUs utilized          
               641      context-switches          #    0.032 K/sec                  
                17      cpu-migrations            #    0.001 K/sec                  
               164      page-faults               #    0.008 K/sec                  
   74,40,40,48,513      cycles                    #    3.769 GHz                      (30.81%)
 1,45,66,22,06,797      instructions              #    1.96  insn per cycle           (38.56%)
   20,31,28,43,577      branches                  # 1028.963 M/sec                    (38.65%)
         14,11,419      branch-misses             #    0.01% of all branches          (38.69%)
   43,07,86,33,662      L1-dcache-loads           # 2182.182 M/sec                    (38.72%)
         37,06,744      L1-dcache-load-misses     #    0.01% of all L1-dcache hits    (38.56%)
          1,34,292      LLC-loads                 #    0.007 M/sec                    (30.72%)
            30,627      LLC-load-misses           #   22.81% of all LL-cache hits     (30.68%)
   <not supported>      L1-icache-loads                                            
         14,49,145      L1-icache-load-misses                                         (30.65%)
   43,44,86,27,516      dTLB-loads                # 2200.924 M/sec                    (30.63%)
               218      dTLB-load-misses          #    0.00% of all dTLB cache hits   (30.63%)
             2,445      iTLB-loads                #    0.124 K/sec                    (30.63%)
            28,624      iTLB-load-misses          # 1170.72% of all iTLB cache hits   (30.63%)
   <not supported>      L1-dcache-prefetches                                        
   <not supported>      L1-dcache-prefetch-misses                                  

      50.716083931 seconds time elapsed

      51.467300000 seconds user
       0.200390000 seconds sys


JMH perf data for ArrayCopyUnalignedSrc.testLong with copy length  of 1200 shows degradation in LID accesses, it seems the benchmask got displaced from its sweet spot.

But, there is a significant reduction in instruction count  and cycles are almost comparable.  We are saving one shift per mask computation.

          OLD Sequence:
              0x00007f7fc1030ead:   movabs $0x1,%rax
              0x00007f7fc1030eb7:   shlx   %r8,%rax,%rax
              0x00007f7fc1030ebc:   dec    %rax
              0x00007f7fc1030ebf:   kmovq  %rax,%k2
          NEW Sequence:
              0x00007f775d030d51:   movabs $0xffffffffffffffff,%rax
              0x00007f775d030d5b:   bzhi   %r8,%rax,%rax
              0x00007f775d030d60:   kmovq  %rax,%k2

-------------

PR: https://git.openjdk.java.net/jdk/pull/2522
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261553: Efficient mask generation using BMI2 BZHI instruction [v3]

Jatin Bhateja
In reply to this post by Jatin Bhateja
> BMI2 BHZI instruction can be used to optimize the instruction sequence
> used for mask generation at various place in array copy stubs and partial in-lining for small copy operations.

Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:

  8261553 : Aligning main copy loop to prevent any penalty due to LSD and DSB misses.

-------------

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/2522/files
  - new: https://git.openjdk.java.net/jdk/pull/2522/files/84c9c2da..7012eed0

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=2522&range=02
 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=2522&range=01-02

  Stats: 8 lines in 1 file changed: 4 ins; 0 del; 4 mod
  Patch: https://git.openjdk.java.net/jdk/pull/2522.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/2522/head:pull/2522

PR: https://git.openjdk.java.net/jdk/pull/2522
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261553: Efficient mask generation using BMI2 BZHI instruction [v3]

Jatin Bhateja
In reply to this post by Jatin Bhateja
On Fri, 12 Feb 2021 05:59:01 GMT, Jatin Bhateja <[hidden email]> wrote:

>>> Hi Claes, This could be a run to run variation, in general we are now having fewer number of instructions (one shift operation saved per mask computation) compared to previous masked generation sequence and thus it will always offer better execution latencies.
>>
>> Run-to-run variation would be easy to rule out by running more forks and more iterations to attain statistically significant results. While the instruction manuals suggest latency should be better for this instruction on all CPUs where it's supported, it would be good if there was some clear proof - such as a significant benchmark win - to motivate the added complexity.
>
>> > Hi Claes, This could be a run to run variation, in general we are now having fewer number of instructions (one shift operation saved per mask computation) compared to previous masked generation sequence and thus it will always offer better execution latencies.
>>
>> Run-to-run variation would be easy to rule out by running more forks and more iterations to attain statistically significant results. While the instruction manuals suggest latency should be better for this instruction on all CPUs where it's supported, it would be good if there was some clear proof - such as a significant benchmark win - to motivate the added complexity.
>
> BASELINE:
> Result "org.openjdk.bench.java.lang.ArrayCopyUnalignedSrc.testLong":
>   61.037 ns/op
>
> Secondary result "org.openjdk.bench.java.lang.ArrayCopyUnalignedSrc.testLong:·perf":
> Perf stats:
> --------------------------------------------------
>
>          19,739.21 msec task-clock                #    0.389 CPUs utilized          
>                646      context-switches          #    0.033 K/sec                  
>                 12      cpu-migrations            #    0.001 K/sec                  
>                150      page-faults               #    0.008 K/sec                  
>    74,59,83,59,139      cycles                    #    3.779 GHz                      (30.73%)
>  1,78,78,79,19,117      instructions              #    2.40  insn per cycle           (38.48%)
>    24,79,81,63,651      branches                  # 1256.289 M/sec                    (38.55%)
>       32,24,89,924      branch-misses             #    1.30% of all branches          (38.62%)
>    52,56,88,28,472      L1-dcache-loads           # 2663.167 M/sec                    (38.65%)
>          39,00,969      L1-dcache-load-misses     #    0.01% of all L1-dcache hits    (38.57%)
>           3,74,131      LLC-loads                 #    0.019 M/sec                    (30.77%)
>             22,315      LLC-load-misses           #    5.96% of all LL-cache hits     (30.72%)
>    <not supported>      L1-icache-loads                                            
>          17,49,997      L1-icache-load-misses                                         (30.72%)
>    52,91,41,70,636      dTLB-loads                # 2680.663 M/sec                    (30.69%)
>              3,315      dTLB-load-misses          #    0.00% of all dTLB cache hits   (30.67%)
>              4,674      iTLB-loads                #    0.237 K/sec                    (30.65%)
>             33,746      iTLB-load-misses          #  721.99% of all iTLB cache hits   (30.63%)
>    <not supported>      L1-dcache-prefetches                                        
>    <not supported>      L1-dcache-prefetch-misses                                  
>
>       50.723759146 seconds time elapsed
>
>       51.447054000 seconds user
>        0.189949000 seconds sys
>
>
> WITH OPT:
> Result "org.openjdk.bench.java.lang.ArrayCopyUnalignedSrc.testLong":
>   74.356 ns/op
>
> Secondary result "org.openjdk.bench.java.lang.ArrayCopyUnalignedSrc.testLong:·perf":
> Perf stats:
> --------------------------------------------------
>
>          19,741.09 msec task-clock                #    0.389 CPUs utilized          
>                641      context-switches          #    0.032 K/sec                  
>                 17      cpu-migrations            #    0.001 K/sec                  
>                164      page-faults               #    0.008 K/sec                  
>    74,40,40,48,513      cycles                    #    3.769 GHz                      (30.81%)
>  1,45,66,22,06,797      instructions              #    1.96  insn per cycle           (38.56%)
>    20,31,28,43,577      branches                  # 1028.963 M/sec                    (38.65%)
>          14,11,419      branch-misses             #    0.01% of all branches          (38.69%)
>    43,07,86,33,662      L1-dcache-loads           # 2182.182 M/sec                    (38.72%)
>          37,06,744      L1-dcache-load-misses     #    0.01% of all L1-dcache hits    (38.56%)
>           1,34,292      LLC-loads                 #    0.007 M/sec                    (30.72%)
>             30,627      LLC-load-misses           #   22.81% of all LL-cache hits     (30.68%)
>    <not supported>      L1-icache-loads                                            
>          14,49,145      L1-icache-load-misses                                         (30.65%)
>    43,44,86,27,516      dTLB-loads                # 2200.924 M/sec                    (30.63%)
>                218      dTLB-load-misses          #    0.00% of all dTLB cache hits   (30.63%)
>              2,445      iTLB-loads                #    0.124 K/sec                    (30.63%)
>             28,624      iTLB-load-misses          # 1170.72% of all iTLB cache hits   (30.63%)
>    <not supported>      L1-dcache-prefetches                                        
>    <not supported>      L1-dcache-prefetch-misses                                  
>
>       50.716083931 seconds time elapsed
>
>       51.467300000 seconds user
>        0.200390000 seconds sys
>
>
> JMH perf data for ArrayCopyUnalignedSrc.testLong with copy length  of 1200 shows degradation in LID accesses, it seems the benchmask got displaced from its sweet spot.
>
> But, there is a significant reduction in instruction count  and cycles are almost comparable.  We are saving one shift per mask computation.
>
>           OLD Sequence:
>               0x00007f7fc1030ead:   movabs $0x1,%rax
>               0x00007f7fc1030eb7:   shlx   %r8,%rax,%rax
>               0x00007f7fc1030ebc:   dec    %rax
>               0x00007f7fc1030ebf:   kmovq  %rax,%k2
>           NEW Sequence:
>               0x00007f775d030d51:   movabs $0xffffffffffffffff,%rax
>               0x00007f775d030d5b:   bzhi   %r8,%rax,%rax
>               0x00007f775d030d60:   kmovq  %rax,%k2

Further analysis of perf degradation revealed that with new optimized instruction pattern, code alignment got disturbed. This led to increase in LSD misses, also it reduced the UOPs cashing in DSB.  
Aligning copy loops at 32 byte boundary prevents any adverse impact on UOP caching.
NOPs used for padding add up to the instruction count and thus may over shadow the code size gains due to new mask generation sequence in copy stubs.  

Baseline:
 ArrayCopyAligned.testLong   Length : 1200    61 ns/op (approx)
 1,93,44,43,11,622      cycles                                                      
 4,59,57,99,78,727      instructions              #    2.38  insn per cycle        
 1,83,68,75,68,255      idq.dsb_uops                                                
 2,08,32,43,71,906      lsd.uops                                                    
   37,12,54,60,211      idq.mite_uops                                              

With Opt:
 ArrayCopyAligned.testLong   Length : 1200   74 ns/op (approx)
 1,93,51,25,94,766      cycles                                                      
 3,75,11,57,91,917      instructions              #    1.94  insn per cycle        
   48,67,58,25,566      idq.dsb_uops                                                
      19,46,13,236      lsd.uops                                                    
 2,87,42,95,74,280      idq.mite_uops                                              

With Opt + main loop alignment(nop):         61 ns/op (approx)
 ArrayCopyAligned.testLong   Length : 1200
 1,93,52,15,90,080      cycles                                                      
 4,60,89,14,06,528      instructions              #    2.38  insn per cycle        
 1,78,76,10,34,991      idq.dsb_uops                                                
 2,09,16,15,84,313      lsd.uops                                                    
   46,25,31,92,101      idq.mite_uops            


While computing the mask for partial in-lining of small copy calls ( currently enabled for sub-word types with copy length less than 32/64 bytes), new optimized sequence should always offer lower instruction count and latency path.


Baseline:
ArrayCopyAligned.testByte        Length : 20  avgt    2  2.635          ns/op
 1,97,76,75,18,052      cycles                                                      
 8,96,00,37,11,803      instructions              #    4.53  insn per cycle        
    2,71,83,79,035      idq.dsb_uops                                                
 7,54,82,43,63,409      lsd.uops                                                    
    3,92,55,74,395      idq.mite_uops                                              

ArrayCopyAligned.testByte        Length : 31  avgt    2  2.635          ns/op
 1,97,79,16,56,787      cycles                                                      
 8,96,13,15,69,780      instructions              #    4.53  insn per cycle        
    2,69,07,11,691      idq.dsb_uops                                                
 7,54,95,63,77,683      lsd.uops                                                    
    3,90,19,10,747      idq.mite_uops                                              

WithOpt:
ArrayCopyAligned.testByte        Length : 20  avgt    2  2.635          ns/op
 1,97,66,64,62,541      cycles                                                      
 8,92,03,95,00,236      instructions              #    4.51  insn per cycle        
    2,72,38,56,205      idq.dsb_uops                                                
 7,50,87,50,60,591      lsd.uops                                                    
    3,89,15,02,954      idq.mite_uops                                              

ArrayCopyAligned.testByte        Length : 31  avgt    2  2.635          ns/op
 1,97,54,21,61,110      cycles                                                      
 8,91,46,64,23,754      instructions              #    4.51  insn per cycle        
    2,78,12,19,544      idq.dsb_uops                                                
 7,50,35,88,95,843      lsd.uops                                                    
    3,90,41,97,276      idq.mite_uops                              
               

Following are the links to updated JMH perf data:
http://cr.openjdk.java.net/~jbhateja/8261553/JMH_PERF_CLX_BASELINE.txt
http://cr.openjdk.java.net/~jbhateja/8261553/JMH_PERF_CLX_WITH_OPTS_LOOP_ALIGN.txt

In general gains are not significant in case of copy stubs, but new sequence offers a optimal latency path for mask computation sequence.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2522
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261553: Efficient mask generation using BMI2 BZHI instruction [v3]

Claes Redestad-2
On Mon, 15 Feb 2021 18:51:58 GMT, Jatin Bhateja <[hidden email]> wrote:

>>> > Hi Claes, This could be a run to run variation, in general we are now having fewer number of instructions (one shift operation saved per mask computation) compared to previous masked generation sequence and thus it will always offer better execution latencies.
>>>
>>> Run-to-run variation would be easy to rule out by running more forks and more iterations to attain statistically significant results. While the instruction manuals suggest latency should be better for this instruction on all CPUs where it's supported, it would be good if there was some clear proof - such as a significant benchmark win - to motivate the added complexity.
>>
>> BASELINE:
>> Result "org.openjdk.bench.java.lang.ArrayCopyUnalignedSrc.testLong":
>>   61.037 ns/op
>>
>> Secondary result "org.openjdk.bench.java.lang.ArrayCopyUnalignedSrc.testLong:·perf":
>> Perf stats:
>> --------------------------------------------------
>>
>>          19,739.21 msec task-clock                #    0.389 CPUs utilized          
>>                646      context-switches          #    0.033 K/sec                  
>>                 12      cpu-migrations            #    0.001 K/sec                  
>>                150      page-faults               #    0.008 K/sec                  
>>    74,59,83,59,139      cycles                    #    3.779 GHz                      (30.73%)
>>  1,78,78,79,19,117      instructions              #    2.40  insn per cycle           (38.48%)
>>    24,79,81,63,651      branches                  # 1256.289 M/sec                    (38.55%)
>>       32,24,89,924      branch-misses             #    1.30% of all branches          (38.62%)
>>    52,56,88,28,472      L1-dcache-loads           # 2663.167 M/sec                    (38.65%)
>>          39,00,969      L1-dcache-load-misses     #    0.01% of all L1-dcache hits    (38.57%)
>>           3,74,131      LLC-loads                 #    0.019 M/sec                    (30.77%)
>>             22,315      LLC-load-misses           #    5.96% of all LL-cache hits     (30.72%)
>>    <not supported>      L1-icache-loads                                            
>>          17,49,997      L1-icache-load-misses                                         (30.72%)
>>    52,91,41,70,636      dTLB-loads                # 2680.663 M/sec                    (30.69%)
>>              3,315      dTLB-load-misses          #    0.00% of all dTLB cache hits   (30.67%)
>>              4,674      iTLB-loads                #    0.237 K/sec                    (30.65%)
>>             33,746      iTLB-load-misses          #  721.99% of all iTLB cache hits   (30.63%)
>>    <not supported>      L1-dcache-prefetches                                        
>>    <not supported>      L1-dcache-prefetch-misses                                  
>>
>>       50.723759146 seconds time elapsed
>>
>>       51.447054000 seconds user
>>        0.189949000 seconds sys
>>
>>
>> WITH OPT:
>> Result "org.openjdk.bench.java.lang.ArrayCopyUnalignedSrc.testLong":
>>   74.356 ns/op
>>
>> Secondary result "org.openjdk.bench.java.lang.ArrayCopyUnalignedSrc.testLong:·perf":
>> Perf stats:
>> --------------------------------------------------
>>
>>          19,741.09 msec task-clock                #    0.389 CPUs utilized          
>>                641      context-switches          #    0.032 K/sec                  
>>                 17      cpu-migrations            #    0.001 K/sec                  
>>                164      page-faults               #    0.008 K/sec                  
>>    74,40,40,48,513      cycles                    #    3.769 GHz                      (30.81%)
>>  1,45,66,22,06,797      instructions              #    1.96  insn per cycle           (38.56%)
>>    20,31,28,43,577      branches                  # 1028.963 M/sec                    (38.65%)
>>          14,11,419      branch-misses             #    0.01% of all branches          (38.69%)
>>    43,07,86,33,662      L1-dcache-loads           # 2182.182 M/sec                    (38.72%)
>>          37,06,744      L1-dcache-load-misses     #    0.01% of all L1-dcache hits    (38.56%)
>>           1,34,292      LLC-loads                 #    0.007 M/sec                    (30.72%)
>>             30,627      LLC-load-misses           #   22.81% of all LL-cache hits     (30.68%)
>>    <not supported>      L1-icache-loads                                            
>>          14,49,145      L1-icache-load-misses                                         (30.65%)
>>    43,44,86,27,516      dTLB-loads                # 2200.924 M/sec                    (30.63%)
>>                218      dTLB-load-misses          #    0.00% of all dTLB cache hits   (30.63%)
>>              2,445      iTLB-loads                #    0.124 K/sec                    (30.63%)
>>             28,624      iTLB-load-misses          # 1170.72% of all iTLB cache hits   (30.63%)
>>    <not supported>      L1-dcache-prefetches                                        
>>    <not supported>      L1-dcache-prefetch-misses                                  
>>
>>       50.716083931 seconds time elapsed
>>
>>       51.467300000 seconds user
>>        0.200390000 seconds sys
>>
>>
>> JMH perf data for ArrayCopyUnalignedSrc.testLong with copy length  of 1200 shows degradation in LID accesses, it seems the benchmask got displaced from its sweet spot.
>>
>> But, there is a significant reduction in instruction count  and cycles are almost comparable.  We are saving one shift per mask computation.
>>
>>           OLD Sequence:
>>               0x00007f7fc1030ead:   movabs $0x1,%rax
>>               0x00007f7fc1030eb7:   shlx   %r8,%rax,%rax
>>               0x00007f7fc1030ebc:   dec    %rax
>>               0x00007f7fc1030ebf:   kmovq  %rax,%k2
>>           NEW Sequence:
>>               0x00007f775d030d51:   movabs $0xffffffffffffffff,%rax
>>               0x00007f775d030d5b:   bzhi   %r8,%rax,%rax
>>               0x00007f775d030d60:   kmovq  %rax,%k2
>
> Further analysis of perf degradation revealed that with new optimized instruction pattern, code alignment got disturbed. This led to increase in LSD misses, also it reduced the UOPs cashing in DSB.  
> Aligning copy loops at 32 byte boundary prevents any adverse impact on UOP caching.
> NOPs used for padding add up to the instruction count and thus may over shadow the code size gains due to new mask generation sequence in copy stubs.  
>
> Baseline:
>  ArrayCopyAligned.testLong   Length : 1200    61 ns/op (approx)
>  1,93,44,43,11,622      cycles                                                      
>  4,59,57,99,78,727      instructions              #    2.38  insn per cycle        
>  1,83,68,75,68,255      idq.dsb_uops                                                
>  2,08,32,43,71,906      lsd.uops                                                    
>    37,12,54,60,211      idq.mite_uops                                              
>
> With Opt:
>  ArrayCopyAligned.testLong   Length : 1200   74 ns/op (approx)
>  1,93,51,25,94,766      cycles                                                      
>  3,75,11,57,91,917      instructions              #    1.94  insn per cycle        
>    48,67,58,25,566      idq.dsb_uops                                                
>       19,46,13,236      lsd.uops                                                    
>  2,87,42,95,74,280      idq.mite_uops                                              
>
> With Opt + main loop alignment(nop):         61 ns/op (approx)
>  ArrayCopyAligned.testLong   Length : 1200
>  1,93,52,15,90,080      cycles                                                      
>  4,60,89,14,06,528      instructions              #    2.38  insn per cycle        
>  1,78,76,10,34,991      idq.dsb_uops                                                
>  2,09,16,15,84,313      lsd.uops                                                    
>    46,25,31,92,101      idq.mite_uops            
>
>
> While computing the mask for partial in-lining of small copy calls ( currently enabled for sub-word types with copy length less than 32/64 bytes), new optimized sequence should always offer lower instruction count and latency path.
>
>
> Baseline:
> ArrayCopyAligned.testByte        Length : 20  avgt    2  2.635          ns/op
>  1,97,76,75,18,052      cycles                                                      
>  8,96,00,37,11,803      instructions              #    4.53  insn per cycle        
>     2,71,83,79,035      idq.dsb_uops                                                
>  7,54,82,43,63,409      lsd.uops                                                    
>     3,92,55,74,395      idq.mite_uops                                              
>
> ArrayCopyAligned.testByte        Length : 31  avgt    2  2.635          ns/op
>  1,97,79,16,56,787      cycles                                                      
>  8,96,13,15,69,780      instructions              #    4.53  insn per cycle        
>     2,69,07,11,691      idq.dsb_uops                                                
>  7,54,95,63,77,683      lsd.uops                                                    
>     3,90,19,10,747      idq.mite_uops                                              
>
> WithOpt:
> ArrayCopyAligned.testByte        Length : 20  avgt    2  2.635          ns/op
>  1,97,66,64,62,541      cycles                                                      
>  8,92,03,95,00,236      instructions              #    4.51  insn per cycle        
>     2,72,38,56,205      idq.dsb_uops                                                
>  7,50,87,50,60,591      lsd.uops                                                    
>     3,89,15,02,954      idq.mite_uops                                              
>
> ArrayCopyAligned.testByte        Length : 31  avgt    2  2.635          ns/op
>  1,97,54,21,61,110      cycles                                                      
>  8,91,46,64,23,754      instructions              #    4.51  insn per cycle        
>     2,78,12,19,544      idq.dsb_uops                                                
>  7,50,35,88,95,843      lsd.uops                                                    
>     3,90,41,97,276      idq.mite_uops                              
>                
>
> Following are the links to updated JMH perf data:
> http://cr.openjdk.java.net/~jbhateja/8261553/JMH_PERF_CLX_BASELINE.txt
> http://cr.openjdk.java.net/~jbhateja/8261553/JMH_PERF_CLX_WITH_OPTS_LOOP_ALIGN.txt
>
> In general gains are not significant in case of copy stubs, but new sequence offers a optimal latency path for mask computation sequence.

Thanks for getting to the bottom of that regression.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2522
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261553: Efficient mask generation using BMI2 BZHI instruction [v2]

Nils Eliasson-2
In reply to this post by Jatin Bhateja
On Thu, 11 Feb 2021 12:25:53 GMT, Jatin Bhateja <[hidden email]> wrote:

>> BMI2 BHZI instruction can be used to optimize the instruction sequence
>> used for mask generation at various place in array copy stubs and partial in-lining for small copy operations.
>
> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
>
>   8261553: Adding BMI2 missing check for partial in-lining.

Looks good.

-------------

Marked as reviewed by neliasso (Reviewer).

PR: https://git.openjdk.java.net/jdk/pull/2522
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261553: Efficient mask generation using BMI2 BZHI instruction [v2]

Jatin Bhateja
On Tue, 16 Feb 2021 21:40:58 GMT, Nils Eliasson <[hidden email]> wrote:

>> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
>>
>>   8261553: Adding BMI2 missing check for partial in-lining.
>
> Looks good.

> Thanks for getting to the bottom of that regression.

Thanks, since there is not a significant impact on performance, but having an optimum instruction sequence will still reduce the complexity. Should it be ok to check this in ?  We have one reviewer consent.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2522
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8261553: Efficient mask generation using BMI2 BZHI instruction [v3]

Claes Redestad-2
In reply to this post by Jatin Bhateja
On Mon, 15 Feb 2021 18:55:02 GMT, Jatin Bhateja <[hidden email]> wrote:

>> BMI2 BHZI instruction can be used to optimize the instruction sequence
>> used for mask generation at various place in array copy stubs and partial in-lining for small copy operations.
>
> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
>
>   8261553 : Aligning main copy loop to prevent any penalty due to LSD and DSB misses.

Marked as reviewed by redestad (Reviewer).

-------------

PR: https://git.openjdk.java.net/jdk/pull/2522
Reply | Threaded
Open this post in threaded view
|

Integrated: 8261553: Efficient mask generation using BMI2 BZHI instruction

Jatin Bhateja
In reply to this post by Jatin Bhateja
On Thu, 11 Feb 2021 08:31:40 GMT, Jatin Bhateja <[hidden email]> wrote:

> BMI2 BHZI instruction can be used to optimize the instruction sequence
> used for mask generation at various place in array copy stubs and partial in-lining for small copy operations.

This pull request has now been integrated.

Changeset: cb84539d
Author:    Jatin Bhateja <[hidden email]>
URL:       https://git.openjdk.java.net/jdk/commit/cb84539d
Stats:     39 lines in 6 files changed: 12 ins; 11 del; 16 mod

8261553: Efficient mask generation using BMI2 BZHI instruction

Reviewed-by: redestad, neliasso

-------------

PR: https://git.openjdk.java.net/jdk/pull/2522