Sum of integers optimization

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Sum of integers optimization

Ionut
Hello All,

  I am playing with below example (very trivial, just computing a sum of 1...N integers):

@Benchmark
public long sum() {
    long sum = 0;
    for (int i = 1; i <= N; i++) {
  sum += i;
}
    return sum;
}


Generated asm on my machine (snapshot from the main scalar loop):
                                ......................................................
                            ↗  0x00007f4779bff060: movsxd r10,r11d
                           │  0x00007f4779bff063: add    rax,r10
  7.67%   24.83% │  0x00007f4779bff066: add    rax,r10
  6.11%    3.64%  │  0x00007f4779bff069: add    rax,r10
  4.54%    3.71%  │  0x00007f4779bff06c: add    rax,r10
  6.12%    5.85%  │  0x00007f4779bff06f: add    rax,r10
  5.75%    4.21%  │  0x00007f4779bff072: add    rax,r10
  5.96%    4.38%  │  0x00007f4779bff075: add    rax,r10
  4.23%    3.63%  │  0x00007f4779bff078: add    rax,r10
  6.70%    6.32%  │  0x00007f4779bff07b: add    rax,r10
  7.40%    4.56%  │  0x00007f4779bff07e: add    rax,r10
  4.61%    3.31%  │  0x00007f4779bff081: add    rax,r10
  5.45%    5.24%  │  0x00007f4779bff084: add    rax,r10
  5.99%    5.14%  │  0x00007f4779bff087: add    rax,r10
  7.70%    5.36%  │  0x00007f4779bff08a: add    rax,r10
  5.17%    4.16%  │  0x00007f4779bff08d: add    rax,r10
  3.97%    3.83%  │  0x00007f4779bff090: add    rax,r10
  4.80%    3.97%  │  0x00007f4779bff093: add    rax,0x78                                                                  
  5.92%    5.97%  │  0x00007f4779bff097: add    r11d,0x10   
           0.01%      │  0x00007f4779bff09b: cmp    r11d,0x5f5e0f2
                          ╰  0x00007f4779bff0a2: jl     0x00007f4779bff060 
                                ......................................................

Questions
- Would it be possible for JIT C2 to perform a better optimization in this context, as for example replacing the main loop (which might be costly) by a reduction formula as N*(N-1)/2 (in this specific case)? 
- Is there any context where JIT C2 can perform such optimization but I am missing? 
- If not, what prevents it for doing this?

Thanks
Ionut


Reply | Threaded
Open this post in threaded view
|

Re: Sum of integers optimization

Krystal Mok
Hi Ionut,

tl;dr: C2's infrastructure for optimizing loops can be made a lot stronger, but from the current directions we can see around the OpenJDK community, it's very unlikely for C2 to receive a major infrastructural upgrade in the future. If you'd like to contribute to Graal to help optimize this kind of code, I'm sure a lot of us in the community would love that.

You're right about the code produced by C2. Just ran your example on JDK9/macOS and the main loop produced by C2 is:

  0x0000000118ee6640: movslq %r11d,%r10         ;*i2l {reexecute=0 rethrow=0 return_oop=0}
                                                ; - XYZ::sum@12 (line 7)

  0x0000000118ee6643: add    %r10,%rax
  0x0000000118ee6646: add    %r10,%rax
  0x0000000118ee6649: add    %r10,%rax
  0x0000000118ee664c: add    %r10,%rax
  0x0000000118ee664f: add    %r10,%rax
  0x0000000118ee6652: add    %r10,%rax
  0x0000000118ee6655: add    %r10,%rax
  0x0000000118ee6658: add    %r10,%rax
  0x0000000118ee665b: add    %r10,%rax
  0x0000000118ee665e: add    %r10,%rax
  0x0000000118ee6661: add    %r10,%rax
  0x0000000118ee6664: add    %r10,%rax
  0x0000000118ee6667: add    %r10,%rax
  0x0000000118ee666a: add    %r10,%rax
  0x0000000118ee666d: add    %r10,%rax
  0x0000000118ee6670: add    %r10,%rax
  0x0000000118ee6673: add    $0x78,%rax         ;*ladd {reexecute=0 rethrow=0 return_oop=0}
                                                ; - XYZ::sum@13 (line 7)

  0x0000000118ee6677: add    $0x10,%r11d        ;*iinc {reexecute=0 rethrow=0 return_oop=0}
                                                ; - XYZ::sum@15 (line 6)

Pretty much the same as what you saw.

It's certainly possible to tweak C2 or some other JIT compiler to make it more optimized for this test case. I don't have a copy of Zing right now but I believe its Falcon compiler will compile this down to the N*(N-1)/2 form that you expected, since the LLVM it's based on can compile this piece of C code:

#include <stdint.h>

int64_t sum(int n) {
  int64_t sum = 0;
  for (int32_t i = 1; i <= n; i++) {
    sum += i;
  }
  return sum;
}

Down to:

sum:
  test edi, edi
  jle .LBB0_1
  lea eax, [rdi - 1]
  add edi, -2
  imul rdi, rax
  shr rdi
  lea rax, [rdi + 2*rax + 1]
  ret
.LBB0_1:
  xor eax, eax
  ret

For this test case, C2 could at least do a few things to generate better code:
1. A better expression canonicalizer that flattens expression trees. The chain of adds you see in the resulting code is because of the 16x unrolled sum += 1 is turned into:

// 120 == 0 + 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + 10 + 11 + 12 + 13 + 14 + 15
sum = ((((((((((((((((sum + i) + i) + i) + i) + i) + i) + i) + i) + i) + i) + i) + i) + i) + i) + i) + i) + 120

See how the additions involving i are skewed to the left, effectively degenerating an expression tree into a "linked list of additions". C2's value number, on its own, doesn't recognize that it can reassociate the expression into a flatter tree, e.g.

((((i + i) + (i + i)) + ((i + i) + (i + i))) + (((i + i) + (i + i)) + ((i + i) + (i + i)))) + sum + 120

in which case C2's value number will be able to turn into:

t1 = i + i
t2 = t1 + t1
t3 = t2 + t2
t4 = t3 + t3
sum = t4 + sum + 120

and then into sum = (i << 4) + sum + 120.

This kind of reassociation will at least help make the loop body better, without involving any complicated loop optimizations. The "tree flattening" reassociation can actually be implemented by directly linearizing an expression tree into a C0*X + C1*Y + ... + C2 form.

To get to the end goal of optimizing the whole loop into the N*(N-1)/2 form, you'd need more advanced loop analysis, e.g. something akin to LLVM's SCEV, to recognize how "sum" is related to the loop induction variable.

BTW, Graal from graalvm-0.22 generates a straightforward loop for this case:

XYZ.sum (null)  [0x000000010cc091e0, 0x000000010cc09230]  80 bytes
[Entry Point]
[Verified Entry Point]
[Constants]
  # {method} {0x000000011ebfd768} 'sum' '()J' in 'XYZ'
  #           [sp+0x10]  (sp of caller)
  0x000000010cc091e0: nopl   0x0(%rax,%rax,1)
  0x000000010cc091e5: mov    $0x1,%r10d
  0x000000010cc091eb: mov    $0x0,%rax
  0x000000010cc091f2: jmpq   0x000000010cc0920f
  0x000000010cc091f7: nopw   0x0(%rax,%rax,1)   ;*if_icmpgt {reexecute=0 rethrow=0 return_oop=0}
                                                ; - XYZ::sum@7 (line 6)

  0x000000010cc09200: mov    %r10d,%r11d
  0x000000010cc09203: inc    %r11d              ;*iinc {reexecute=0 rethrow=0 return_oop=0}
                                                ; - XYZ::sum@15 (line 6)

  0x000000010cc09206: movslq %r10d,%r10         ;*i2l {reexecute=0 rethrow=0 return_oop=0}
                                                ; - XYZ::sum@12 (line 7)

  0x000000010cc09209: add    %r10,%rax          ;*ladd {reexecute=0 rethrow=0 return_oop=0}
                                                ; - XYZ::sum@13 (line 7)

  0x000000010cc0920c: mov    %r11d,%r10d        ;*goto {reexecute=0 rethrow=0 return_oop=0}
                                                ; - XYZ::sum@18 (line 6)

  0x000000010cc0920f: cmp    $0x186a1,%r10d
  0x000000010cc09216: jl     0x000000010cc09200  ;*if_icmpgt {reexecute=0 rethrow=0 return_oop=0}
                                                ; - XYZ::sum@7 (line 6)

  0x000000010cc09218: test   %eax,-0x1d69218(%rip)        # 0x000000010aea0006
                                                ;   {poll_return}
  0x000000010cc0921e: vzeroupper 
  0x000000010cc09221: retq

- Kris

On Tue, Oct 31, 2017 at 2:59 PM, Ionut <[hidden email]> wrote:
Hello All,

  I am playing with below example (very trivial, just computing a sum of 1...N integers):

@Benchmark
public long sum() {
    long sum = 0;
    for (int i = 1; i <= N; i++) {
  sum += i;
}
    return sum;
}


Generated asm on my machine (snapshot from the main scalar loop):
                                ......................................................
                            ↗  0x00007f4779bff060: movsxd r10,r11d
                           │  0x00007f4779bff063: add    rax,r10
  7.67%   24.83% │  0x00007f4779bff066: add    rax,r10
  6.11%    3.64%  │  0x00007f4779bff069: add    rax,r10
  4.54%    3.71%  │  0x00007f4779bff06c: add    rax,r10
  6.12%    5.85%  │  0x00007f4779bff06f: add    rax,r10
  5.75%    4.21%  │  0x00007f4779bff072: add    rax,r10
  5.96%    4.38%  │  0x00007f4779bff075: add    rax,r10
  4.23%    3.63%  │  0x00007f4779bff078: add    rax,r10
  6.70%    6.32%  │  0x00007f4779bff07b: add    rax,r10
  7.40%    4.56%  │  0x00007f4779bff07e: add    rax,r10
  4.61%    3.31%  │  0x00007f4779bff081: add    rax,r10
  5.45%    5.24%  │  0x00007f4779bff084: add    rax,r10
  5.99%    5.14%  │  0x00007f4779bff087: add    rax,r10
  7.70%    5.36%  │  0x00007f4779bff08a: add    rax,r10
  5.17%    4.16%  │  0x00007f4779bff08d: add    rax,r10
  3.97%    3.83%  │  0x00007f4779bff090: add    rax,r10
  4.80%    3.97%  │  0x00007f4779bff093: add    rax,0x78                                                                  
  5.92%    5.97%  │  0x00007f4779bff097: add    r11d,0x10   
           0.01%      │  0x00007f4779bff09b: cmp    r11d,0x5f5e0f2
                          ╰  0x00007f4779bff0a2: jl     0x00007f4779bff060 
                                ......................................................

Questions
- Would it be possible for JIT C2 to perform a better optimization in this context, as for example replacing the main loop (which might be costly) by a reduction formula as N*(N-1)/2 (in this specific case)? 
- Is there any context where JIT C2 can perform such optimization but I am missing? 
- If not, what prevents it for doing this?

Thanks
Ionut