SuperWord loop optimization lost after method inlining

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

SuperWord loop optimization lost after method inlining

Nicolas Heutte
Hi all,

I am encountering a performance issue caused by the interaction between
method inlining and automatic vectorization.

Our application aggregates arrays intensively using a method named
ArrayFloatToArrayFloatVectorBinding.plus() with the following code:

    for (int i = 0; i < srcLen; ++i) {

            dstArray[i] += srcArray[i];

    }

When we microbenchmark this method we observe fast performance close to the
practical memory bandwidth and when we print the assembly code we observe
loop unrolling and automatic vectorization with SIMD instructions.

  0x000001ef4600abf0: vmovdqu 0x10(%r14,%r13,4),%ymm0

  0x000001ef4600abf7: vaddps 0x10(%rcx,%r13,4),%ymm0,%ymm0

  0x000001ef4600abfe: vmovdqu %ymm0,0x10(%r14,%r13,4)

  0x000001ef4600ac05: movslq %r13d,%r11

  0x000001ef4600ac08: vmovdqu 0x30(%r14,%r11,4),%ymm0

  0x000001ef4600ac0f: vaddps 0x30(%rcx,%r11,4),%ymm0,%ymm0

  0x000001ef4600ac16: vmovdqu %ymm0,0x30(%r14,%r11,4)

  0x000001ef4600ac1d: vmovdqu 0x50(%r14,%r11,4),%ymm0

  0x000001ef4600ac24: vaddps 0x50(%rcx,%r11,4),%ymm0,%ymm0

  0x000001ef4600ac2b: vmovdqu %ymm0,0x50(%r14,%r11,4)

  0x000001ef4600ac32: vmovdqu 0x70(%r14,%r11,4),%ymm0

  0x000001ef4600ac39: vaddps 0x70(%rcx,%r11,4),%ymm0,%ymm0

  0x000001ef4600ac40: vmovdqu %ymm0,0x70(%r14,%r11,4)

  0x000001ef4600ac47: vmovdqu 0x90(%r14,%r11,4),%ymm0

  0x000001ef4600ac51: vaddps 0x90(%rcx,%r11,4),%ymm0,%ymm0

  0x000001ef4600ac5b: vmovdqu %ymm0,0x90(%r14,%r11,4)

  0x000001ef4600ac65: vmovdqu 0xb0(%r14,%r11,4),%ymm0

  0x000001ef4600ac6f: vaddps 0xb0(%rcx,%r11,4),%ymm0,%ymm0

  0x000001ef4600ac79: vmovdqu %ymm0,0xb0(%r14,%r11,4)

  0x000001ef4600ac83: vmovdqu 0xd0(%r14,%r11,4),%ymm0

  0x000001ef4600ac8d: vaddps 0xd0(%rcx,%r11,4),%ymm0,%ymm0

  0x000001ef4600ac97: vmovdqu %ymm0,0xd0(%r14,%r11,4)

  0x000001ef4600aca1: vmovdqu 0xf0(%r14,%r11,4),%ymm0

  0x000001ef4600acab: vaddps 0xf0(%rcx,%r11,4),%ymm0,%ymm0

  0x000001ef4600acb5: vmovdqu %ymm0,0xf0(%r14,%r11,4)  ;*fastore
{reexecute=0 rethrow=0 return_oop=0}

                                                ; -
com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus@61
(line 41)

  0x000001ef4600acbf: add    $0x40,%r13d        ;*iinc {reexecute=0
rethrow=0 return_oop=0}

                                                ; -
com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus@62
(line 40)

  0x000001ef4600acc3: cmp    %eax,%r13d

  0x000001ef4600acc6: jl     0x000001ef4600abf0  ;*goto {reexecute=0
rethrow=0 return_oop=0}

                                                ; -
com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus@65
(line 40)



In the real application, this method is actually inlined in a higher level
method named AVector.plus(). Unfortunately, the inlined version of the
aggregation code is not vectorized anymore:



  0x000001ef460180a0: cmp    %ebx,%r11d

  0x000001ef460180a3: jae    0x000001ef460180e6

  0x000001ef460180a5: vmovss 0x10(%r8,%r11,4),%xmm1  ;*faload {reexecute=0
rethrow=0 return_oop=0}

                                                ; -
com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus@54
(line 41)

                                                ; -
com.qfs.vector.impl.AVector::plus@17 (line 204)

  0x000001ef460180ac: cmp    %ecx,%r11d

  0x000001ef460180af: jae    0x000001ef46018104

  0x000001ef460180b1: vaddss 0x10(%r9,%r11,4),%xmm1,%xmm1

  0x000001ef460180b8: vmovss %xmm1,0x10(%r8,%r11,4)  ;*fastore {reexecute=0
rethrow=0 return_oop=0}

                                                ; -
com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus@61
(line 41)

                                                ; -
com.qfs.vector.impl.AVector::plus@17 (line 204)

  0x000001ef460180bf: inc    %r11d              ;*iinc {reexecute=0
rethrow=0 return_oop=0}

                                                ; -
com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus@62
(line 40)

                                                ; -
com.qfs.vector.impl.AVector::plus@17 (line 204)

  0x000001ef460180c2: cmp    %r10d,%r11d

  0x000001ef460180c5: jl     0x000001ef460180a0  ;*goto {reexecute=0
rethrow=0 return_oop=0}

                                                ; -
com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus@65
(line 40)

                                                ; -
com.qfs.vector.impl.AVector::plus@17 (line 204)



This causes a significant performance drop, compared to a run where we
explicitly disable the inlining and observe automatically vectorized code
again (
-XX:CompileCommand=dontinline,com/qfs/vector/binding/impl/ArrayFloatToArrayFloatVectorBinding.plus
).


How do you guys explain that behavior of the JIT compiler? Is this a known
and tracked issue, could it be fixed in the JVM? Can we do something in the
java code to prevent this from happening?


Best regards,

Nicolas Heutte
Reply | Threaded
Open this post in threaded view
|

Re: SuperWord loop optimization lost after method inlining

Vladimir Kozlov
Hi, Nicolas

Looks like, when inlined, the loop from ArrayFloatToArrayFloatVectorBinding::plus() was not optimized at all: it is not
unrolled and has range checks. Such loops are not vectorized (you need unrolling and no checks).

What Java version you are running? What HotSpot VM flags you are using when running application?

Run application with -XX:+LogCompilation and look on compilation data in hotspot_pid<PID>.log file for caller
AVector::plus().

VM also has several flags to trace loop optimizations but they are only available in debug VM build. If you have access
to such build run with -XX:+PrintCompilation -XX:+TraceLoopOpts flags.

Thanks,
Vladimir K

On 2/10/21 9:24 AM, Nicolas Heutte wrote:

> Hi all,
>
> I am encountering a performance issue caused by the interaction between
> method inlining and automatic vectorization.
>
> Our application aggregates arrays intensively using a method named
> ArrayFloatToArrayFloatVectorBinding.plus() with the following code:
>
>      for (int i = 0; i < srcLen; ++i) {
>
>              dstArray[i] += srcArray[i];
>
>      }
>
> When we microbenchmark this method we observe fast performance close to the
> practical memory bandwidth and when we print the assembly code we observe
> loop unrolling and automatic vectorization with SIMD instructions.
>
>    0x000001ef4600abf0: vmovdqu 0x10(%r14,%r13,4),%ymm0
>
>    0x000001ef4600abf7: vaddps 0x10(%rcx,%r13,4),%ymm0,%ymm0
>
>    0x000001ef4600abfe: vmovdqu %ymm0,0x10(%r14,%r13,4)
>
>    0x000001ef4600ac05: movslq %r13d,%r11
>
>    0x000001ef4600ac08: vmovdqu 0x30(%r14,%r11,4),%ymm0
>
>    0x000001ef4600ac0f: vaddps 0x30(%rcx,%r11,4),%ymm0,%ymm0
>
>    0x000001ef4600ac16: vmovdqu %ymm0,0x30(%r14,%r11,4)
>
>    0x000001ef4600ac1d: vmovdqu 0x50(%r14,%r11,4),%ymm0
>
>    0x000001ef4600ac24: vaddps 0x50(%rcx,%r11,4),%ymm0,%ymm0
>
>    0x000001ef4600ac2b: vmovdqu %ymm0,0x50(%r14,%r11,4)
>
>    0x000001ef4600ac32: vmovdqu 0x70(%r14,%r11,4),%ymm0
>
>    0x000001ef4600ac39: vaddps 0x70(%rcx,%r11,4),%ymm0,%ymm0
>
>    0x000001ef4600ac40: vmovdqu %ymm0,0x70(%r14,%r11,4)
>
>    0x000001ef4600ac47: vmovdqu 0x90(%r14,%r11,4),%ymm0
>
>    0x000001ef4600ac51: vaddps 0x90(%rcx,%r11,4),%ymm0,%ymm0
>
>    0x000001ef4600ac5b: vmovdqu %ymm0,0x90(%r14,%r11,4)
>
>    0x000001ef4600ac65: vmovdqu 0xb0(%r14,%r11,4),%ymm0
>
>    0x000001ef4600ac6f: vaddps 0xb0(%rcx,%r11,4),%ymm0,%ymm0
>
>    0x000001ef4600ac79: vmovdqu %ymm0,0xb0(%r14,%r11,4)
>
>    0x000001ef4600ac83: vmovdqu 0xd0(%r14,%r11,4),%ymm0
>
>    0x000001ef4600ac8d: vaddps 0xd0(%rcx,%r11,4),%ymm0,%ymm0
>
>    0x000001ef4600ac97: vmovdqu %ymm0,0xd0(%r14,%r11,4)
>
>    0x000001ef4600aca1: vmovdqu 0xf0(%r14,%r11,4),%ymm0
>
>    0x000001ef4600acab: vaddps 0xf0(%rcx,%r11,4),%ymm0,%ymm0
>
>    0x000001ef4600acb5: vmovdqu %ymm0,0xf0(%r14,%r11,4)  ;*fastore
> {reexecute=0 rethrow=0 return_oop=0}
>
>                                                  ; -
> com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus@61
> (line 41)
>
>    0x000001ef4600acbf: add    $0x40,%r13d        ;*iinc {reexecute=0
> rethrow=0 return_oop=0}
>
>                                                  ; -
> com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus@62
> (line 40)
>
>    0x000001ef4600acc3: cmp    %eax,%r13d
>
>    0x000001ef4600acc6: jl     0x000001ef4600abf0  ;*goto {reexecute=0
> rethrow=0 return_oop=0}
>
>                                                  ; -
> com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus@65
> (line 40)
>
>
>
> In the real application, this method is actually inlined in a higher level
> method named AVector.plus(). Unfortunately, the inlined version of the
> aggregation code is not vectorized anymore:
>
>
>
>    0x000001ef460180a0: cmp    %ebx,%r11d
>
>    0x000001ef460180a3: jae    0x000001ef460180e6
>
>    0x000001ef460180a5: vmovss 0x10(%r8,%r11,4),%xmm1  ;*faload {reexecute=0
> rethrow=0 return_oop=0}
>
>                                                  ; -
> com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus@54
> (line 41)
>
>                                                  ; -
> com.qfs.vector.impl.AVector::plus@17 (line 204)
>
>    0x000001ef460180ac: cmp    %ecx,%r11d
>
>    0x000001ef460180af: jae    0x000001ef46018104
>
>    0x000001ef460180b1: vaddss 0x10(%r9,%r11,4),%xmm1,%xmm1
>
>    0x000001ef460180b8: vmovss %xmm1,0x10(%r8,%r11,4)  ;*fastore {reexecute=0
> rethrow=0 return_oop=0}
>
>                                                  ; -
> com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus@61
> (line 41)
>
>                                                  ; -
> com.qfs.vector.impl.AVector::plus@17 (line 204)
>
>    0x000001ef460180bf: inc    %r11d              ;*iinc {reexecute=0
> rethrow=0 return_oop=0}
>
>                                                  ; -
> com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus@62
> (line 40)
>
>                                                  ; -
> com.qfs.vector.impl.AVector::plus@17 (line 204)
>
>    0x000001ef460180c2: cmp    %r10d,%r11d
>
>    0x000001ef460180c5: jl     0x000001ef460180a0  ;*goto {reexecute=0
> rethrow=0 return_oop=0}
>
>                                                  ; -
> com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus@65
> (line 40)
>
>                                                  ; -
> com.qfs.vector.impl.AVector::plus@17 (line 204)
>
>
>
> This causes a significant performance drop, compared to a run where we
> explicitly disable the inlining and observe automatically vectorized code
> again (
> -XX:CompileCommand=dontinline,com/qfs/vector/binding/impl/ArrayFloatToArrayFloatVectorBinding.plus
> ).
>
>
> How do you guys explain that behavior of the JIT compiler? Is this a known
> and tracked issue, could it be fixed in the JVM? Can we do something in the
> java code to prevent this from happening?
>
>
> Best regards,
>
> Nicolas Heutte
>
Reply | Threaded
Open this post in threaded view
|

Re: SuperWord loop optimization lost after method inlining

Nicolas Heutte
Hi Vladimir,

Thank you for your help.

I'm currently running Java 11.0.9, and I did not use any VM flag of note.

I checked the content of the compilation log, and it seems that
ArrayFloatToArrayFloatVectorBinding::plus() was deoptimized in order to
allow AVector::plus() to be compiled:

<writer thread='11576'/>
<task_queued compile_id='17280' method='com.qfs.vector.impl.AVector plus
(Lcom/qfs/vector/IVector;)V' bytes='23' count='916' iicount='916' level='3'
stamp='7394.056' comment='tiered' hot_count='896'/>
<writer thread='15784'/>
<deoptimized thread='15784' reason='constraint' pc='0x00000296d0785b94'
compile_id='17257' compiler='c1' level='3'>
<jvms bci='65'
method='com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding
plus (Lcom/qfs/vector/IVector;Lcom/qfs/vector/IVector;)V' bytes='69'
count='909' backedge_count='155602' iicount='910'/>
</deoptimized>

The last compilation entry for AVector::plus() is:

<writer thread='16380'/>
<nmethod compile_id='17317' compiler='c2' level='4'
entry='0x00000296d6af32c0' size='1960' address='0x00000296d6af3110'
relocation_offset='376' insts_offset='432' stub_offset='1040'
scopes_data_offset='1152' scopes_pcs_offset='1592'
dependencies_offset='1880' nul_chk_table_offset='1896' oops_offset='1064'
metadata_offset='1072' method='com.qfs.vector.impl.AVector plus
(Lcom/qfs/vector/IVector;)V' bytes='23' count='172425' iicount='172425'
stamp='7394.199'/>
<make_not_entrant thread='16380' compile_id='17280' compiler='c1' level='2'
stamp='7394.199'/>
                              @ 1
com.qfs.vector.array.impl.ArrayFloatVector::getBindingId (4 bytes)   inline
(hot)
                               \-&gt; TypeProfile (14552/14552 counts) =
com/qfs/vector/array/impl/ArrayFloatVector
                              @ 7
com.qfs.vector.array.impl.ArrayFloatVector::getBindingId (4 bytes)   inline
(hot)
                               \-&gt; TypeProfile (14150/14150 counts) =
com/qfs/vector/array/impl/ArrayFloatVector
                              @ 10
com.qfs.vector.binding.impl.VectorBindings::getBinding (9 bytes)   inline
(hot)
                                @ 5
com.qfs.vector.binding.impl.VectorBindings$VectorBindingsProvider::getBinding
(22 bytes)   inline (hot)
                                  @ 3
com.qfs.vector.binding.impl.VectorBindings$VectorBindingsProvider::hasBinding
(34 bytes)   inline (hot)
                              @ 17
com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus (69
bytes)   inline (hot)
                               \-&gt; TypeProfile (14054/14054 counts) =
com/qfs/vector/binding/impl/ArrayFloatToArrayFloatVectorBinding
                                @ 12
com.qfs.vector.array.impl.ArrayFloatVector::size (6 bytes)   inline (hot)
                                @ 22
com.qfs.vector.impl.AVector::checkIndex (37 bytes)   inline (hot)
                                  @ 6
com.qfs.vector.array.impl.ArrayFloatVector::size (6 bytes)   inline (hot)
                                @ 27
com.qfs.vector.array.impl.ArrayFloatVector::getUnderlying (5 bytes)
accessor
                                @ 34
com.qfs.vector.array.impl.ArrayFloatVector::getUnderlying (5 bytes)
accessor
<writer thread='15896'/>

Unfortunately, I do not have access to a debug VM build, so I cannot run
the second test you recommend.

Best regards,
Nicolas Heutte

On Wed, Feb 10, 2021 at 7:36 PM Vladimir Kozlov <[hidden email]>
wrote:

> Hi, Nicolas
>
> Looks like, when inlined, the loop from
> ArrayFloatToArrayFloatVectorBinding::plus() was not optimized at all: it is
> not
> unrolled and has range checks. Such loops are not vectorized (you need
> unrolling and no checks).
>
> What Java version you are running? What HotSpot VM flags you are using
> when running application?
>
> Run application with -XX:+LogCompilation and look on compilation data in
> hotspot_pid<PID>.log file for caller
> AVector::plus().
>
> VM also has several flags to trace loop optimizations but they are only
> available in debug VM build. If you have access
> to such build run with -XX:+PrintCompilation -XX:+TraceLoopOpts flags.
>
> Thanks,
> Vladimir K
>
> On 2/10/21 9:24 AM, Nicolas Heutte wrote:
> > Hi all,
> >
> > I am encountering a performance issue caused by the interaction between
> > method inlining and automatic vectorization.
> >
> > Our application aggregates arrays intensively using a method named
> > ArrayFloatToArrayFloatVectorBinding.plus() with the following code:
> >
> >      for (int i = 0; i < srcLen; ++i) {
> >
> >              dstArray[i] += srcArray[i];
> >
> >      }
> >
> > When we microbenchmark this method we observe fast performance close to
> the
> > practical memory bandwidth and when we print the assembly code we observe
> > loop unrolling and automatic vectorization with SIMD instructions.
> >
> >    0x000001ef4600abf0: vmovdqu 0x10(%r14,%r13,4),%ymm0
> >
> >    0x000001ef4600abf7: vaddps 0x10(%rcx,%r13,4),%ymm0,%ymm0
> >
> >    0x000001ef4600abfe: vmovdqu %ymm0,0x10(%r14,%r13,4)
> >
> >    0x000001ef4600ac05: movslq %r13d,%r11
> >
> >    0x000001ef4600ac08: vmovdqu 0x30(%r14,%r11,4),%ymm0
> >
> >    0x000001ef4600ac0f: vaddps 0x30(%rcx,%r11,4),%ymm0,%ymm0
> >
> >    0x000001ef4600ac16: vmovdqu %ymm0,0x30(%r14,%r11,4)
> >
> >    0x000001ef4600ac1d: vmovdqu 0x50(%r14,%r11,4),%ymm0
> >
> >    0x000001ef4600ac24: vaddps 0x50(%rcx,%r11,4),%ymm0,%ymm0
> >
> >    0x000001ef4600ac2b: vmovdqu %ymm0,0x50(%r14,%r11,4)
> >
> >    0x000001ef4600ac32: vmovdqu 0x70(%r14,%r11,4),%ymm0
> >
> >    0x000001ef4600ac39: vaddps 0x70(%rcx,%r11,4),%ymm0,%ymm0
> >
> >    0x000001ef4600ac40: vmovdqu %ymm0,0x70(%r14,%r11,4)
> >
> >    0x000001ef4600ac47: vmovdqu 0x90(%r14,%r11,4),%ymm0
> >
> >    0x000001ef4600ac51: vaddps 0x90(%rcx,%r11,4),%ymm0,%ymm0
> >
> >    0x000001ef4600ac5b: vmovdqu %ymm0,0x90(%r14,%r11,4)
> >
> >    0x000001ef4600ac65: vmovdqu 0xb0(%r14,%r11,4),%ymm0
> >
> >    0x000001ef4600ac6f: vaddps 0xb0(%rcx,%r11,4),%ymm0,%ymm0
> >
> >    0x000001ef4600ac79: vmovdqu %ymm0,0xb0(%r14,%r11,4)
> >
> >    0x000001ef4600ac83: vmovdqu 0xd0(%r14,%r11,4),%ymm0
> >
> >    0x000001ef4600ac8d: vaddps 0xd0(%rcx,%r11,4),%ymm0,%ymm0
> >
> >    0x000001ef4600ac97: vmovdqu %ymm0,0xd0(%r14,%r11,4)
> >
> >    0x000001ef4600aca1: vmovdqu 0xf0(%r14,%r11,4),%ymm0
> >
> >    0x000001ef4600acab: vaddps 0xf0(%rcx,%r11,4),%ymm0,%ymm0
> >
> >    0x000001ef4600acb5: vmovdqu %ymm0,0xf0(%r14,%r11,4)  ;*fastore
> > {reexecute=0 rethrow=0 return_oop=0}
> >
> >                                                  ; -
> > com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus@61
> > (line 41)
> >
> >    0x000001ef4600acbf: add    $0x40,%r13d        ;*iinc {reexecute=0
> > rethrow=0 return_oop=0}
> >
> >                                                  ; -
> > com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus@62
> > (line 40)
> >
> >    0x000001ef4600acc3: cmp    %eax,%r13d
> >
> >    0x000001ef4600acc6: jl     0x000001ef4600abf0  ;*goto {reexecute=0
> > rethrow=0 return_oop=0}
> >
> >                                                  ; -
> > com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus@65
> > (line 40)
> >
> >
> >
> > In the real application, this method is actually inlined in a higher
> level
> > method named AVector.plus(). Unfortunately, the inlined version of the
> > aggregation code is not vectorized anymore:
> >
> >
> >
> >    0x000001ef460180a0: cmp    %ebx,%r11d
> >
> >    0x000001ef460180a3: jae    0x000001ef460180e6
> >
> >    0x000001ef460180a5: vmovss 0x10(%r8,%r11,4),%xmm1  ;*faload
> {reexecute=0
> > rethrow=0 return_oop=0}
> >
> >                                                  ; -
> > com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus@54
> > (line 41)
> >
> >                                                  ; -
> > com.qfs.vector.impl.AVector::plus@17 (line 204)
> >
> >    0x000001ef460180ac: cmp    %ecx,%r11d
> >
> >    0x000001ef460180af: jae    0x000001ef46018104
> >
> >    0x000001ef460180b1: vaddss 0x10(%r9,%r11,4),%xmm1,%xmm1
> >
> >    0x000001ef460180b8: vmovss %xmm1,0x10(%r8,%r11,4)  ;*fastore
> {reexecute=0
> > rethrow=0 return_oop=0}
> >
> >                                                  ; -
> > com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus@61
> > (line 41)
> >
> >                                                  ; -
> > com.qfs.vector.impl.AVector::plus@17 (line 204)
> >
> >    0x000001ef460180bf: inc    %r11d              ;*iinc {reexecute=0
> > rethrow=0 return_oop=0}
> >
> >                                                  ; -
> > com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus@62
> > (line 40)
> >
> >                                                  ; -
> > com.qfs.vector.impl.AVector::plus@17 (line 204)
> >
> >    0x000001ef460180c2: cmp    %r10d,%r11d
> >
> >    0x000001ef460180c5: jl     0x000001ef460180a0  ;*goto {reexecute=0
> > rethrow=0 return_oop=0}
> >
> >                                                  ; -
> > com.qfs.vector.binding.impl.ArrayFloatToArrayFloatVectorBinding::plus@65
> > (line 40)
> >
> >                                                  ; -
> > com.qfs.vector.impl.AVector::plus@17 (line 204)
> >
> >
> >
> > This causes a significant performance drop, compared to a run where we
> > explicitly disable the inlining and observe automatically vectorized code
> > again (
> >
> -XX:CompileCommand=dontinline,com/qfs/vector/binding/impl/ArrayFloatToArrayFloatVectorBinding.plus
> > ).
> >
> >
> > How do you guys explain that behavior of the JIT compiler? Is this a
> known
> > and tracked issue, could it be fixed in the JVM? Can we do something in
> the
> > java code to prevent this from happening?
> >
> >
> > Best regards,
> >
> > Nicolas Heutte
> >
>