RFR(L): 8186027: C2: loop strip mining

classic Classic list List threaded Threaded
20 messages Options
Reply | Threaded
Open this post in threaded view
|

RFR(L): 8186027: C2: loop strip mining

Roland Westrelin-3

http://cr.openjdk.java.net/~roland/8186027/webrev.00/

This converts loop:

for (int i = start; i < stop; i += inc) {
  // body
}

to a loop nest:

i = start;
if (i < stop) {
  do {
    int next = MIN(stop, i+LoopStripMiningIter*inc);
    do {
      // body
      i += inc;
    } while (i < next);
    safepoint();
  } while (i < stop);
}

(It's actually:
int next = MIN(stop - i, LoopStripMiningIter*inc) + i;
to protect against overflows)

This should bring the best of running with UseCountedLoopSafepoints on
and running with it off: low time to safepoint with little to no impact
on throughput. That change was first pushed to the shenandoah repo
several months ago and we've been running with it enabled since.

The command line argument LoopStripMiningIter is the number of
iterations between safepoints. In practice, with an arbitrary
LoopStripMiningIter=1000, we observe time to safepoint on par with the
current -XX:+UseCountedLoopSafepoints and most performance regressions
due to -XX:+UseCountedLoopSafepoints gone. The exception is when an
inner counted loop runs for a low number of iterations on average (and
the compiler doesn't have an upper bound on the number of iteration).

This is enabled on the command line with:
-XX:+UseCountedLoopSafepoints -XX:LoopStripMiningIter=1000

In PhaseIdealLoop::is_counted_loop(), when loop strip mining is enabled,
for an inner loop, the compiler builds a skeleton outer loop around the
the counted loop. The outer loop is kept as simple as possible so
required adjustments to the existing loop optimizations are not too
intrusive. The reason the outer loop is inserted early in the
optimization process is so that optimizations are not disrupted: an
alternate implementation could have kept the safepoint in the counted
loop until loop opts are over and then only have added the outer loop
and moved the safepoint to the outer loop. That would have prevented
nodes that are referenced in the safepoint to be sunk out of loop for
instance.

The outer loop is a LoopNode with a backedge to a loop exit test and a
safepoint. The loop exit test is a CmpI with a new Opaque5Node. The
skeleton loop is populated with all required Phis after loop opts are
over during macro expansion. At that point only, the loop exit tests are
adjusted so the inner loop runs for at most LoopStripMiningIter. If the
compiler can prove the inner loop runs for no more than
LoopStripMiningIter then during macro expansion, the outer loop is
removed. The safepoint is removed only if the inner loop executes for
less than LoopStripMiningIterShortLoop so that if there are several
counted loops in a raw, we still poll for safepoints regularly.

Until macro expansion, there can be only a few extra nodes in the outer
loop: nodes that would have sunk out of the inner loop and be kept in
the outer loop by the safepoint.

PhaseIdealLoop::clone_loop() which is used by most loop opts has now
several ways of cloning a counted loop. For loop unswitching, both inner
and outer loops need to be cloned. For unrolling, only the inner loop
needs to be cloned. For pre/post loops insertion, only the inner loop
needs to be cloned but the control flow must connect one of the inner
loop copies to the outer loop of the other copy.

Beyond verifying performance results with the usual benchmarks, when I
implemented that change, I wrote test cases for (hopefully) every loop
optimization and verified by inspection of the generated code that the
loop opt triggers correct with loop strip mining.

Roland.
Reply | Threaded
Open this post in threaded view
|

Re: RFR(L): 8186027: C2: loop strip mining

Nils Eliasson
Hi Roland,

I have started reviewing and testing I will sponsor your change when the
full review is completed.

Best Regards,

Nils


On 2017-10-03 15:19, Roland Westrelin wrote:

> http://cr.openjdk.java.net/~roland/8186027/webrev.00/
>
> This converts loop:
>
> for (int i = start; i < stop; i += inc) {
>    // body
> }
>
> to a loop nest:
>
> i = start;
> if (i < stop) {
>    do {
>      int next = MIN(stop, i+LoopStripMiningIter*inc);
>      do {
>        // body
>        i += inc;
>      } while (i < next);
>      safepoint();
>    } while (i < stop);
> }
>
> (It's actually:
> int next = MIN(stop - i, LoopStripMiningIter*inc) + i;
> to protect against overflows)
>
> This should bring the best of running with UseCountedLoopSafepoints on
> and running with it off: low time to safepoint with little to no impact
> on throughput. That change was first pushed to the shenandoah repo
> several months ago and we've been running with it enabled since.
>
> The command line argument LoopStripMiningIter is the number of
> iterations between safepoints. In practice, with an arbitrary
> LoopStripMiningIter=1000, we observe time to safepoint on par with the
> current -XX:+UseCountedLoopSafepoints and most performance regressions
> due to -XX:+UseCountedLoopSafepoints gone. The exception is when an
> inner counted loop runs for a low number of iterations on average (and
> the compiler doesn't have an upper bound on the number of iteration).
>
> This is enabled on the command line with:
> -XX:+UseCountedLoopSafepoints -XX:LoopStripMiningIter=1000
>
> In PhaseIdealLoop::is_counted_loop(), when loop strip mining is enabled,
> for an inner loop, the compiler builds a skeleton outer loop around the
> the counted loop. The outer loop is kept as simple as possible so
> required adjustments to the existing loop optimizations are not too
> intrusive. The reason the outer loop is inserted early in the
> optimization process is so that optimizations are not disrupted: an
> alternate implementation could have kept the safepoint in the counted
> loop until loop opts are over and then only have added the outer loop
> and moved the safepoint to the outer loop. That would have prevented
> nodes that are referenced in the safepoint to be sunk out of loop for
> instance.
>
> The outer loop is a LoopNode with a backedge to a loop exit test and a
> safepoint. The loop exit test is a CmpI with a new Opaque5Node. The
> skeleton loop is populated with all required Phis after loop opts are
> over during macro expansion. At that point only, the loop exit tests are
> adjusted so the inner loop runs for at most LoopStripMiningIter. If the
> compiler can prove the inner loop runs for no more than
> LoopStripMiningIter then during macro expansion, the outer loop is
> removed. The safepoint is removed only if the inner loop executes for
> less than LoopStripMiningIterShortLoop so that if there are several
> counted loops in a raw, we still poll for safepoints regularly.
>
> Until macro expansion, there can be only a few extra nodes in the outer
> loop: nodes that would have sunk out of the inner loop and be kept in
> the outer loop by the safepoint.
>
> PhaseIdealLoop::clone_loop() which is used by most loop opts has now
> several ways of cloning a counted loop. For loop unswitching, both inner
> and outer loops need to be cloned. For unrolling, only the inner loop
> needs to be cloned. For pre/post loops insertion, only the inner loop
> needs to be cloned but the control flow must connect one of the inner
> loop copies to the outer loop of the other copy.
>
> Beyond verifying performance results with the usual benchmarks, when I
> implemented that change, I wrote test cases for (hopefully) every loop
> optimization and verified by inspection of the generated code that the
> loop opt triggers correct with loop strip mining.
>
> Roland.

Reply | Threaded
Open this post in threaded view
|

Re: RFR(L): 8186027: C2: loop strip mining

Roland Westrelin-3

> I have started reviewing and testing I will sponsor your change when the
> full review is completed.

Thanks!

Roland.
Reply | Threaded
Open this post in threaded view
|

Re: RFR(L): 8186027: C2: loop strip mining

Nils Eliasson
Hi Roland,

Sorry for the delay.

First - It's a very impressive work you have done!

Currently your patch doesn't apply cleanly. The fix of JDK-8189067
changes loopopts.cpp.

I have run your code (based on jdk10 before JDK-8189067) through
testing. I encountered a minor build problem on solaris_x64 (patch
below), otherwise it was stable with no encountered test failures.

I have also run performance testing with the conclusion that no
significant regression can be seen. In some benchmarks like
scimark.sparse.large that has a known safepointing issue
(https://bugs.openjdk.java.net/browse/JDK-8177704), very good results
can be seen.

scimark.sparse.large using G1:
-XX:-UseCountedLoopSafepoints (default) ~86 ops/m
-XX:+UseCountedLoopSafepoints  ~106 ops/m
-XX:+UseCountedLoopSafepoints -XX:LoopStripMiningIter=1000 ~111 ops/m

The positive results leads us to the conclusion that we would like
UseCountedLoopSafepoints to bedefault true, and LoopStripMiningIter
default to 1000.

c2_globals.hpp:

-  product(bool, UseCountedLoopSafepoints, false,
+  product(bool, UseCountedLoopSafepoints, true,

-  product(uintx, LoopStripMiningIter, 0,
+  product(uintx, LoopStripMiningIter, 1000,

solaris_x64 complained about type conversion:

src/hotspot/share/opto/loopopts.cpp:
@@ -1729,7 +1729,7 @@
      Node* l = cl->outer_loop();
      Node* tail = cl->outer_loop_tail();
      IfNode* le = cl->outer_loop_end();
-    Node* sfpt = cl->outer_safepoint();
+    Node* sfpt = (Node*) cl->outer_safepoint();

src/hotspot/share/opto/opaquenode.cpp
@@ -144,7 +144,7 @@
    assert(iter_estimate > 0, "broken");
    if ((jlong)scaled_iters != scaled_iters_long || iter_estimate <=
short_scaled_iters) {
      // Remove outer loop and safepoint (too few iterations)
-    Node* outer_sfpt = inner_cl->outer_safepoint();
+    Node* outer_sfpt = (Node*) inner_cl->outer_safepoint();

In the TraceLoopOpts print out I suggest changing space to underscore to
conform with how the other print outs look:

"PreMainPost Loop: N153/N130 limit_check predicated counted [0,int),+1
(26 iters) has_sfpt strip mined"

loopnode.cpp:1867
- tty->print(" strip mined");
+ tty->print(" strip_mined");


When your patch is updated, I will do some additional functional
testing. Also, a second reviewer is required.


Best regards,

Nils Eliasson


On 2017-10-11 15:53, Roland Westrelin wrote:
>> I have started reviewing and testing I will sponsor your change when the
>> full review is completed.
> Thanks!
>
> Roland.

Reply | Threaded
Open this post in threaded view
|

Re: RFR(L): 8186027: C2: loop strip mining

Roland Westrelin-3

Hi Nils,

Thanks for going over the patch and testing it.
Here is an updated webrev:

http://cr.openjdk.java.net/~roland/8186027/webrev.01/

I also made the changes you suggested, except for:

> src/hotspot/share/opto/loopopts.cpp:
> @@ -1729,7 +1729,7 @@
>       Node* l = cl->outer_loop();
>       Node* tail = cl->outer_loop_tail();
>       IfNode* le = cl->outer_loop_end();
> -    Node* sfpt = cl->outer_safepoint();
> +    Node* sfpt = (Node*) cl->outer_safepoint();
>
> src/hotspot/share/opto/opaquenode.cpp
> @@ -144,7 +144,7 @@
>     assert(iter_estimate > 0, "broken");
>     if ((jlong)scaled_iters != scaled_iters_long || iter_estimate <=
> short_scaled_iters) {
>       // Remove outer loop and safepoint (too few iterations)
> -    Node* outer_sfpt = inner_cl->outer_safepoint();
> +    Node* outer_sfpt = (Node*) inner_cl->outer_safepoint();

for which I used the patch below instead (I ran the build with
precompiled headers disabled to verify that change).

Roland.

diff --git a/src/hotspot/share/opto/loopopts.cpp b/src/hotspot/share/opto/loopopts.cpp
--- a/src/hotspot/share/opto/loopopts.cpp
+++ b/src/hotspot/share/opto/loopopts.cpp
@@ -26,6 +26,7 @@
 #include "memory/allocation.inline.hpp"
 #include "memory/resourceArea.hpp"
 #include "opto/addnode.hpp"
+#include "opto/callnode.hpp"
 #include "opto/castnode.hpp"
 #include "opto/connode.hpp"
 #include "opto/castnode.hpp"
@@ -845,7 +846,6 @@
               assert(n_loop->_parent == outer_loop, "broken loop tree");
             }
 #endif
-            int count = phi->replace_edge(n, n->in(MemNode::Memory));
             assert(count > 0, "inconsistent phi");
 
             // Compute latest point this store can go
diff --git a/src/hotspot/share/opto/opaquenode.cpp b/src/hotspot/share/opto/opaquenode.cpp
--- a/src/hotspot/share/opto/opaquenode.cpp
+++ b/src/hotspot/share/opto/opaquenode.cpp
@@ -24,6 +24,7 @@
 
 #include "precompiled.hpp"
 #include "opto/addnode.hpp"
+#include "opto/callnode.hpp"
 #include "opto/cfgnode.hpp"
 #include "opto/connode.hpp"
 #include "opto/divnode.hpp"

Reply | Threaded
Open this post in threaded view
|

Re: RFR(L): 8186027: C2: loop strip mining

Vladimir Kozlov
In reply to this post by Roland Westrelin-3
Roland,

Did you consider less intrusive approach by adding branch over SafePoint with masking on index variable?

   int mask = LoopStripMiningMask * inc; // simplified
   for (int i = start; i < stop; i += inc) {
      // body
      if (i & mask != 0) continue;
      safepoint;
   }

Or may be doing it inside .ad file in new SafePoint node implementation so that ideal graph is not affected.
I am concern that suggested changes may affect Range Check elimination (you changed limit to variable value/flag) in addition to complexity of changes which may affect stability of C2.

Thanks,
Vladimir

On 10/3/17 6:19 AM, Roland Westrelin wrote:

>
> http://cr.openjdk.java.net/~roland/8186027/webrev.00/
>
> This converts loop:
>
> for (int i = start; i < stop; i += inc) {
>    // body
> }
>
> to a loop nest:
>
> i = start;
> if (i < stop) {
>    do {
>      int next = MIN(stop, i+LoopStripMiningIter*inc);
>      do {
>        // body
>        i += inc;
>      } while (i < next);
>      safepoint();
>    } while (i < stop);
> }
>
> (It's actually:
> int next = MIN(stop - i, LoopStripMiningIter*inc) + i;
> to protect against overflows)
>
> This should bring the best of running with UseCountedLoopSafepoints on
> and running with it off: low time to safepoint with little to no impact
> on throughput. That change was first pushed to the shenandoah repo
> several months ago and we've been running with it enabled since.
>
> The command line argument LoopStripMiningIter is the number of
> iterations between safepoints. In practice, with an arbitrary
> LoopStripMiningIter=1000, we observe time to safepoint on par with the
> current -XX:+UseCountedLoopSafepoints and most performance regressions
> due to -XX:+UseCountedLoopSafepoints gone. The exception is when an
> inner counted loop runs for a low number of iterations on average (and
> the compiler doesn't have an upper bound on the number of iteration).
>
> This is enabled on the command line with:
> -XX:+UseCountedLoopSafepoints -XX:LoopStripMiningIter=1000
>
> In PhaseIdealLoop::is_counted_loop(), when loop strip mining is enabled,
> for an inner loop, the compiler builds a skeleton outer loop around the
> the counted loop. The outer loop is kept as simple as possible so
> required adjustments to the existing loop optimizations are not too
> intrusive. The reason the outer loop is inserted early in the
> optimization process is so that optimizations are not disrupted: an
> alternate implementation could have kept the safepoint in the counted
> loop until loop opts are over and then only have added the outer loop
> and moved the safepoint to the outer loop. That would have prevented
> nodes that are referenced in the safepoint to be sunk out of loop for
> instance.
>
> The outer loop is a LoopNode with a backedge to a loop exit test and a
> safepoint. The loop exit test is a CmpI with a new Opaque5Node. The
> skeleton loop is populated with all required Phis after loop opts are
> over during macro expansion. At that point only, the loop exit tests are
> adjusted so the inner loop runs for at most LoopStripMiningIter. If the
> compiler can prove the inner loop runs for no more than
> LoopStripMiningIter then during macro expansion, the outer loop is
> removed. The safepoint is removed only if the inner loop executes for
> less than LoopStripMiningIterShortLoop so that if there are several
> counted loops in a raw, we still poll for safepoints regularly.
>
> Until macro expansion, there can be only a few extra nodes in the outer
> loop: nodes that would have sunk out of the inner loop and be kept in
> the outer loop by the safepoint.
>
> PhaseIdealLoop::clone_loop() which is used by most loop opts has now
> several ways of cloning a counted loop. For loop unswitching, both inner
> and outer loops need to be cloned. For unrolling, only the inner loop
> needs to be cloned. For pre/post loops insertion, only the inner loop
> needs to be cloned but the control flow must connect one of the inner
> loop copies to the outer loop of the other copy.
>
> Beyond verifying performance results with the usual benchmarks, when I
> implemented that change, I wrote test cases for (hopefully) every loop
> optimization and verified by inspection of the generated code that the
> loop opt triggers correct with loop strip mining.
>
> Roland.
>
Reply | Threaded
Open this post in threaded view
|

Re: RFR(L): 8186027: C2: loop strip mining

Roland Westrelin-3

Hi Vladimir,

Thanks for looking at this.

> Did you consider less intrusive approach by adding branch over
> SafePoint with masking on index variable?
>
>    int mask = LoopStripMiningMask * inc; // simplified
>    for (int i = start; i < stop; i += inc) {
>       // body
>       if (i & mask != 0) continue;
>       safepoint;
>    }
>
> Or may be doing it inside .ad file in new SafePoint node
> implementation so that ideal graph is not affected.

We're looking for the best trade off between latency and thoughput: we
want the safepoint poll overhead to be entirely eliminated even when the
safepoint doesn't trigger.

> I am concern that suggested changes may affect Range Check elimination
> (you changed limit to variable value/flag) in addition to complexity
> of changes which may affect stability of C2.

The CountedLoop that is created with my patch is strictly identical to
the CountedLoop created today with -UseCountedLoopSafepoints. Bounds are
not changed at that time. They are left as they are today. The
difference, with loop strip mining, is that the counted loop has a
skeleton outer loop. The bounds of the counted loop are adjusted once
loop opts are over. If the counted loop has a predicate, the predicate
is moved out of loop just as it is today. The only difference with
today, is that the predicate should be moved out of the outer loop. If a
pre and post loop needs to be created, then the only difference with
today is that the clones need to be moved out of the outer loop and
logic that locate the pre from the main loop need to account for the
outer loop.

It's obviously a complex change so if your primary concern is stability
then loop strip mining can be disabled by default. Assuming strip mining
off, then that patch is mostly some code refactoring and some logic that
never triggers.

Roland.
Reply | Threaded
Open this post in threaded view
|

Re: RFR(L): 8186027: C2: loop strip mining

Vladimir Kozlov
I ran pre-integration testing with latest webrev.01 and it passed.
But, give me more time to look though changes.

Thanks,
Vladimir

On 10/25/17 7:29 AM, Roland Westrelin wrote:

>
> Hi Vladimir,
>
> Thanks for looking at this.
>
>> Did you consider less intrusive approach by adding branch over
>> SafePoint with masking on index variable?
>>
>>     int mask = LoopStripMiningMask * inc; // simplified
>>     for (int i = start; i < stop; i += inc) {
>>        // body
>>        if (i & mask != 0) continue;
>>        safepoint;
>>     }
>>
>> Or may be doing it inside .ad file in new SafePoint node
>> implementation so that ideal graph is not affected.
>
> We're looking for the best trade off between latency and thoughput: we
> want the safepoint poll overhead to be entirely eliminated even when the
> safepoint doesn't trigger.
>
>> I am concern that suggested changes may affect Range Check elimination
>> (you changed limit to variable value/flag) in addition to complexity
>> of changes which may affect stability of C2.
>
> The CountedLoop that is created with my patch is strictly identical to
> the CountedLoop created today with -UseCountedLoopSafepoints. Bounds are
> not changed at that time. They are left as they are today. The
> difference, with loop strip mining, is that the counted loop has a
> skeleton outer loop. The bounds of the counted loop are adjusted once
> loop opts are over. If the counted loop has a predicate, the predicate
> is moved out of loop just as it is today. The only difference with
> today, is that the predicate should be moved out of the outer loop. If a
> pre and post loop needs to be created, then the only difference with
> today is that the clones need to be moved out of the outer loop and
> logic that locate the pre from the main loop need to account for the
> outer loop.
>
> It's obviously a complex change so if your primary concern is stability
> then loop strip mining can be disabled by default. Assuming strip mining
> off, then that patch is mostly some code refactoring and some logic that
> never triggers.
>
> Roland.
>
Reply | Threaded
Open this post in threaded view
|

Re: RFR(L): 8186027: C2: loop strip mining

Roland Westrelin-3

> I ran pre-integration testing with latest webrev.01 and it passed.
> But, give me more time to look though changes.

Sure. Thanks for testing it.

Roland.
Reply | Threaded
Open this post in threaded view
|

Re: RFR(L): 8186027: C2: loop strip mining

Vladimir Kozlov
In reply to this post by Vladimir Kozlov
First observations.

src/hotspot/share/opto/c2_globals.hpp

We have uint and int types for flags now. Don't use uintx, which is 64-bit.

src/hotspot/share/runtime/arguments.cpp

I agree that UseCountedLoopSafepoints should enable strip mining by default. I am concern about
enabling UseCountedLoopSafepoints by default. I will look on performance data late. But for
regular/nightly testing we need to add special testing with it on and off.

src/hotspot/share/opto/loopnode.hpp

Should we just make _loop_flags field type uint (32-bit) since we hit 16-bit limit?

There is confusion (because you did not have enough bits?) about which loops are marked as
strip_mined. I thought it is only inner loop but it looks like out (skeleton) loop also marked as
such. I would suggest to mark them differently.

I was thinking may be we should create new Loop node subclass for outer loop. Then you don't need
special flag for it and it will be obvious what they are in Ideal Graph. The same for outer loop end
node.

src/hotspot/share/opto/superword.cpp

Where next change come from?

+      if (t2->Opcode() == Op_AddI && t2 == _lp->as_CountedLoop()->incr()) continue; // don't mess
with the iv

Thanks,
Vladimir

On 10/27/17 8:50 AM, Vladimir Kozlov wrote:

> I ran pre-integration testing with latest webrev.01 and it passed.
> But, give me more time to look though changes.
>
> Thanks,
> Vladimir
>
> On 10/25/17 7:29 AM, Roland Westrelin wrote:
>>
>> Hi Vladimir,
>>
>> Thanks for looking at this.
>>
>>> Did you consider less intrusive approach by adding branch over
>>> SafePoint with masking on index variable?
>>>
>>>     int mask = LoopStripMiningMask * inc; // simplified
>>>     for (int i = start; i < stop; i += inc) {
>>>        // body
>>>        if (i & mask != 0) continue;
>>>        safepoint;
>>>     }
>>>
>>> Or may be doing it inside .ad file in new SafePoint node
>>> implementation so that ideal graph is not affected.
>>
>> We're looking for the best trade off between latency and thoughput: we
>> want the safepoint poll overhead to be entirely eliminated even when the
>> safepoint doesn't trigger.
>>
>>> I am concern that suggested changes may affect Range Check elimination
>>> (you changed limit to variable value/flag) in addition to complexity
>>> of changes which may affect stability of C2.
>>
>> The CountedLoop that is created with my patch is strictly identical to
>> the CountedLoop created today with -UseCountedLoopSafepoints. Bounds are
>> not changed at that time. They are left as they are today. The
>> difference, with loop strip mining, is that the counted loop has a
>> skeleton outer loop. The bounds of the counted loop are adjusted once
>> loop opts are over. If the counted loop has a predicate, the predicate
>> is moved out of loop just as it is today. The only difference with
>> today, is that the predicate should be moved out of the outer loop. If a
>> pre and post loop needs to be created, then the only difference with
>> today is that the clones need to be moved out of the outer loop and
>> logic that locate the pre from the main loop need to account for the
>> outer loop.
>>
>> It's obviously a complex change so if your primary concern is stability
>> then loop strip mining can be disabled by default. Assuming strip mining
>> off, then that patch is mostly some code refactoring and some logic that
>> never triggers.
>>
>> Roland.
>>
Reply | Threaded
Open this post in threaded view
|

Re: RFR(L): 8186027: C2: loop strip mining

Roland Westrelin-3

Hi Vladimir,

> Should we just make _loop_flags field type uint (32-bit) since we hit 16-bit limit?

We don't hit the limit with this change. I have some other changes for
which I had to change _loop_flags to uint. That's where the int -> uint
tweaks are coming from. I can remove them if you like as they are not
required. Sorry for the confusion.

> There is confusion (because you did not have enough bits?) about which loops are marked as
> strip_mined. I thought it is only inner loop but it looks like out (skeleton) loop also marked as
> such. I would suggest to mark them differently.

The way it works currently is:

Opcode() == Op_Loop && is_strip_mined() => outer loop
Opcode() == Op_CountedLoop && is_strip_mined() => inner loop

The outer loop can't be transformed to a counted loop so that scheme
shouldn't break.

> I was thinking may be we should create new Loop node subclass for outer loop. Then you don't need
> special flag for it and it will be obvious what they are in Ideal Graph. The same for outer loop end
> node.

Ok. That sounds like it could clean up the code a bit. Do you want me to
look into that?

> src/hotspot/share/opto/superword.cpp
>
> Where next change come from?
>
> +      if (t2->Opcode() == Op_AddI && t2 == _lp->as_CountedLoop()->incr()) continue; // don't mess
> with the iv

I saw a few cases where t2 is the increment of the CountedLoop
iv. SuperWord::opnd_positions_match() then swaps the edges of the AddI
and later CountedLoopEndNode::phi() fails because the edges of the iv's
AddI are not in the expected order anymore.

Roland.
Reply | Threaded
Open this post in threaded view
|

Re: RFR(L): 8186027: C2: loop strip mining

Vladimir Kozlov


On 10/30/17 10:02 AM, Roland Westrelin wrote:
>
> Hi Vladimir,
>
>> Should we just make _loop_flags field type uint (32-bit) since we hit 16-bit limit?
>
> We don't hit the limit with this change. I have some other changes for
> which I had to change _loop_flags to uint. That's where the int -> uint
> tweaks are coming from. I can remove them if you like as they are not
> required. Sorry for the confusion.

Please, remove them. Changes are big already.

>
>> There is confusion (because you did not have enough bits?) about which loops are marked as
>> strip_mined. I thought it is only inner loop but it looks like out (skeleton) loop also marked as
>> such. I would suggest to mark them differently.
>
> The way it works currently is:
>
> Opcode() == Op_Loop && is_strip_mined() => outer loop
> Opcode() == Op_CountedLoop && is_strip_mined() => inner loop

Yes, I got it after 3rd read-through. It is not very obvious.

>
> The outer loop can't be transformed to a counted loop so that scheme
> shouldn't break.

Yes, it is correct code but it is hard to follow.

>
>> I was thinking may be we should create new Loop node subclass for outer loop. Then you don't need
>> special flag for it and it will be obvious what they are in Ideal Graph. The same for outer loop end
>> node.
>
> Ok. That sounds like it could clean up the code a bit. Do you want me to
> look into that?

Yes, please.

>
>> src/hotspot/share/opto/superword.cpp
>>
>> Where next change come from?
>>
>> +      if (t2->Opcode() == Op_AddI && t2 == _lp->as_CountedLoop()->incr()) continue; // don't mess
>> with the iv
>
> I saw a few cases where t2 is the increment of the CountedLoop
> iv. SuperWord::opnd_positions_match() then swaps the edges of the AddI
> and later CountedLoopEndNode::phi() fails because the edges of the iv's
> AddI are not in the expected order anymore.

Good. But why you need to check Opcode() too? In Counted loops it should be Int type.

Thanks,
Vladimir

>
> Roland.
>
Reply | Threaded
Open this post in threaded view
|

Re: RFR(L): 8186027: C2: loop strip mining

Roland Westrelin-3

>>> src/hotspot/share/opto/superword.cpp
>>>
>>> Where next change come from?
>>>
>>> +      if (t2->Opcode() == Op_AddI && t2 == _lp->as_CountedLoop()->incr()) continue; // don't mess
>>> with the iv
>>
>> I saw a few cases where t2 is the increment of the CountedLoop
>> iv. SuperWord::opnd_positions_match() then swaps the edges of the AddI
>> and later CountedLoopEndNode::phi() fails because the edges of the iv's
>> AddI are not in the expected order anymore.
>
> Good. But why you need to check Opcode() too? In Counted loops it should be Int type.

It's not required for correctness but _lp->as_CountedLoop()->incr() is
somewhat expensive because it has to follow several edges from the loop
head to the AddI so t2->Opcode() == Op_AddI skips that call if it's
obvious it's not needed.

Roland.
Reply | Threaded
Open this post in threaded view
|

Re: RFR(L): 8186027: C2: loop strip mining

Vladimir Kozlov
On 11/7/17 8:49 AM, Roland Westrelin wrote:

>
>>>> src/hotspot/share/opto/superword.cpp
>>>>
>>>> Where next change come from?
>>>>
>>>> +      if (t2->Opcode() == Op_AddI && t2 == _lp->as_CountedLoop()->incr()) continue; // don't mess
>>>> with the iv
>>>
>>> I saw a few cases where t2 is the increment of the CountedLoop
>>> iv. SuperWord::opnd_positions_match() then swaps the edges of the AddI
>>> and later CountedLoopEndNode::phi() fails because the edges of the iv's
>>> AddI are not in the expected order anymore.
>>
>> Good. But why you need to check Opcode() too? In Counted loops it should be Int type.
>
> It's not required for correctness but _lp->as_CountedLoop()->incr() is
> somewhat expensive because it has to follow several edges from the loop
> head to the AddI so t2->Opcode() == Op_AddI skips that call if it's
> obvious it's not needed.

Got it. I thought Opcode() will be more expensive since it is virtual but looking on other - it has a lot of checks.

Thanks,
Vladimir

>
> Roland.
>
Reply | Threaded
Open this post in threaded view
|

Re: RFR(L): 8186027: C2: loop strip mining

Roland Westrelin-3
In reply to this post by Vladimir Kozlov

> I was thinking may be we should create new Loop node subclass for outer loop. Then you don't need
> special flag for it and it will be obvious what they are in Ideal Graph. The same for outer loop end
> node.

Here is a new webrev with new node types for the outer loop:

http://cr.openjdk.java.net/~roland/8186027/webrev.02/

Roland.
Reply | Threaded
Open this post in threaded view
|

Re: RFR(L): 8186027: C2: loop strip mining

Vladimir Kozlov
Thank you, Roland

You answered all my comments.
I will run testing and push it if testing is clean.

Thanks,
Vladimir

On 11/20/17 2:27 AM, Roland Westrelin wrote:

>
>> I was thinking may be we should create new Loop node subclass for outer loop. Then you don't need
>> special flag for it and it will be obvious what they are in Ideal Graph. The same for outer loop end
>> node.
>
> Here is a new webrev with new node types for the outer loop:
>
> http://cr.openjdk.java.net/~roland/8186027/webrev.02/
>
> Roland.
>
Reply | Threaded
Open this post in threaded view
|

Re: RFR(L): 8186027: C2: loop strip mining

Roland Westrelin-3

> You answered all my comments.
> I will run testing and push it if testing is clean.

Thanks for the review and taking care of sponsoring.

On JBS, you asked about a performance regression:
SPECjvm2008-Sparse.small-G1 show significant regression - about 10%.

The generated code looks ok to me. Inner loop with strip mining off:

 0x00007f5d5b3698b0: mov    0x10(%r9,%rax,4),%r11d  ;*iaload {reexecute=0 rethrow=0 return_oop=0}
                                               ; - spec.benchmarks.scimark.SparseCompRow::matmult@65 (line 63)
 0x00007f5d5b3698b5: cmp    %r10d,%r11d
 0x00007f5d5b3698b8: jae    0x00007f5d5b3698f1
 0x00007f5d5b3698ba: vmovsd 0x10(%rcx,%rax,8),%xmm5  ;*daload {reexecute=0 rethrow=0 return_oop=0}
                                               ; - spec.benchmarks.scimark.SparseCompRow::matmult@70 (line 63)
 0x00007f5d5b3698c0: vmulsd 0x10(%r14,%r11,8),%xmm5,%xmm5
 0x00007f5d5b3698c7: vaddsd %xmm5,%xmm4,%xmm4  ;*dadd {reexecute=0 rethrow=0 return_oop=0}
                                               ; - spec.benchmarks.scimark.SparseCompRow::matmult@72 (line 63)
 0x00007f5d5b3698cb: inc    %eax               ;*iinc {reexecute=0 rethrow=0 return_oop=0}
                                               ; - spec.benchmarks.scimark.SparseCompRow::matmult@75 (line 62)
 0x00007f5d5b3698cd: cmp    %esi,%eax
 0x00007f5d5b3698cf: jl     0x00007f5d5b3698b0  ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}

Inner loop + strip mined outer loop with strip mining:

 0x00007f0a02f27289: mov    %ebx,%ebp
 0x00007f0a02f2728b: sub    %r9d,%ebp
 0x00007f0a02f2728e: cmp    %esi,%ebp
 0x00007f0a02f27290: cmovg  %esi,%ebp
 0x00007f0a02f27293: add    %r9d,%ebp
 0x00007f0a02f27296: data16 nopw 0x0(%rax,%rax,1)  ;*dload {reexecute=0 rethrow=0 return_oop=0}
                                               ; - spec.benchmarks.scimark.SparseCompRow::matmult@57 (line 63)
 0x00007f0a02f272a0: vmovq  %xmm4,%r13
 0x00007f0a02f272a5: mov    0x10(%r13,%r9,4),%r8d  ;*iaload {reexecute=0 rethrow=0 return_oop=0}
                                               ; - spec.benchmarks.scimark.SparseCompRow::matmult@65 (line 63)
 0x00007f0a02f272aa: cmp    %r11d,%r8d
 0x00007f0a02f272ad: jae    0x00007f0a02f272df
 0x00007f0a02f272af: vmovsd 0x10(%rcx,%r9,8),%xmm7  ;*daload {reexecute=0 rethrow=0 return_oop=0}
                                               ; - spec.benchmarks.scimark.SparseCompRow::matmult@70 (line 63)
 0x00007f0a02f272b6: vmovq  %xmm2,%r13
 0x00007f0a02f272bb: vmulsd 0x10(%r13,%r8,8),%xmm7,%xmm7
 0x00007f0a02f272c2: vaddsd %xmm7,%xmm6,%xmm6  ;*dadd {reexecute=0 rethrow=0 return_oop=0}
                                               ; - spec.benchmarks.scimark.SparseCompRow::matmult@72 (line 63)
 0x00007f0a02f272c6: inc    %r9d               ;*iinc {reexecute=0 rethrow=0 return_oop=0}
                                               ; - spec.benchmarks.scimark.SparseCompRow::matmult@75 (line 62)
 0x00007f0a02f272c9: cmp    %ebp,%r9d
 0x00007f0a02f272cc: jl     0x00007f0a02f272a0  ;*goto {reexecute=0 rethrow=0 return_oop=0}
                                               ; - spec.benchmarks.scimark.SparseCompRow::matmult@78 (line 62)
 0x00007f0a02f272ce: mov    0x58(%r15),%r8     ; ImmutableOopMap{rcx=Oop rdx=Oop xmm0=Oop xmm2=Oop xmm4=Oop [0]=Oop }
                                               ;*goto {reexecute=1 rethrow=0 return_oop=0}
                                               ; - spec.benchmarks.scimark.SparseCompRow::matmult@78 (line 62)
 0x00007f0a02f272d2: test   %eax,(%r8)         ;*goto {reexecute=0 rethrow=0 return_oop=0}
                                               ; - spec.benchmarks.scimark.SparseCompRow::matmult@78 (line 62)
                                               ;   {poll}
 0x00007f0a02f272d5: cmp    %ebx,%r9d
 0x00007f0a02f272d8: jl     0x00007f0a02f27289

There are extra spill instructions for some reason but I think the
bigger problem here is that the inner loop runs for a small number of
iterations (~2) and so the outer strip mined loop is executed too
often. I have an experimental patch that, for loop executed a small
number of iterations (from profiling), makes 2 copies: one with the
outer strip mined loop, one without. The code picks one or the other at
runtime based on the actual number of iterations so there's still an
overhead. I didn't include it here because it doesn't make the
regression go away entirely and it's extra complexity.

Roland.
Reply | Threaded
Open this post in threaded view
|

Re: RFR(L): 8186027: C2: loop strip mining

Vladimir Kozlov
Thank you, Roland

I also ran pre-integration testing again and I see that testing with
-Xcomp took longer with this fix than without it.

I am running testing again. But if this will repeat and presence of this
Sparse.small regression suggesting to me that may be we should keep this
optimization off by default - keep UseCountedLoopSafepoints false.

We may switch it on later with additional changes which address regressions.

What do you think?

Thanks,
Vladimir

On 11/22/17 6:57 AM, Roland Westrelin wrote:

>
>> You answered all my comments.
>> I will run testing and push it if testing is clean.
>
> Thanks for the review and taking care of sponsoring.
>
> On JBS, you asked about a performance regression:
> SPECjvm2008-Sparse.small-G1 show significant regression - about 10%.
>
> The generated code looks ok to me. Inner loop with strip mining off:
>
>   0x00007f5d5b3698b0: mov    0x10(%r9,%rax,4),%r11d  ;*iaload {reexecute=0 rethrow=0 return_oop=0}
>                                                 ; - spec.benchmarks.scimark.SparseCompRow::matmult@65 (line 63)
>   0x00007f5d5b3698b5: cmp    %r10d,%r11d
>   0x00007f5d5b3698b8: jae    0x00007f5d5b3698f1
>   0x00007f5d5b3698ba: vmovsd 0x10(%rcx,%rax,8),%xmm5  ;*daload {reexecute=0 rethrow=0 return_oop=0}
>                                                 ; - spec.benchmarks.scimark.SparseCompRow::matmult@70 (line 63)
>   0x00007f5d5b3698c0: vmulsd 0x10(%r14,%r11,8),%xmm5,%xmm5
>   0x00007f5d5b3698c7: vaddsd %xmm5,%xmm4,%xmm4  ;*dadd {reexecute=0 rethrow=0 return_oop=0}
>                                                 ; - spec.benchmarks.scimark.SparseCompRow::matmult@72 (line 63)
>   0x00007f5d5b3698cb: inc    %eax               ;*iinc {reexecute=0 rethrow=0 return_oop=0}
>                                                 ; - spec.benchmarks.scimark.SparseCompRow::matmult@75 (line 62)
>   0x00007f5d5b3698cd: cmp    %esi,%eax
>   0x00007f5d5b3698cf: jl     0x00007f5d5b3698b0  ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
>
> Inner loop + strip mined outer loop with strip mining:
>
>   0x00007f0a02f27289: mov    %ebx,%ebp
>   0x00007f0a02f2728b: sub    %r9d,%ebp
>   0x00007f0a02f2728e: cmp    %esi,%ebp
>   0x00007f0a02f27290: cmovg  %esi,%ebp
>   0x00007f0a02f27293: add    %r9d,%ebp
>   0x00007f0a02f27296: data16 nopw 0x0(%rax,%rax,1)  ;*dload {reexecute=0 rethrow=0 return_oop=0}
>                                                 ; - spec.benchmarks.scimark.SparseCompRow::matmult@57 (line 63)
>   0x00007f0a02f272a0: vmovq  %xmm4,%r13
>   0x00007f0a02f272a5: mov    0x10(%r13,%r9,4),%r8d  ;*iaload {reexecute=0 rethrow=0 return_oop=0}
>                                                 ; - spec.benchmarks.scimark.SparseCompRow::matmult@65 (line 63)
>   0x00007f0a02f272aa: cmp    %r11d,%r8d
>   0x00007f0a02f272ad: jae    0x00007f0a02f272df
>   0x00007f0a02f272af: vmovsd 0x10(%rcx,%r9,8),%xmm7  ;*daload {reexecute=0 rethrow=0 return_oop=0}
>                                                 ; - spec.benchmarks.scimark.SparseCompRow::matmult@70 (line 63)
>   0x00007f0a02f272b6: vmovq  %xmm2,%r13
>   0x00007f0a02f272bb: vmulsd 0x10(%r13,%r8,8),%xmm7,%xmm7
>   0x00007f0a02f272c2: vaddsd %xmm7,%xmm6,%xmm6  ;*dadd {reexecute=0 rethrow=0 return_oop=0}
>                                                 ; - spec.benchmarks.scimark.SparseCompRow::matmult@72 (line 63)
>   0x00007f0a02f272c6: inc    %r9d               ;*iinc {reexecute=0 rethrow=0 return_oop=0}
>                                                 ; - spec.benchmarks.scimark.SparseCompRow::matmult@75 (line 62)
>   0x00007f0a02f272c9: cmp    %ebp,%r9d
>   0x00007f0a02f272cc: jl     0x00007f0a02f272a0  ;*goto {reexecute=0 rethrow=0 return_oop=0}
>                                                 ; - spec.benchmarks.scimark.SparseCompRow::matmult@78 (line 62)
>   0x00007f0a02f272ce: mov    0x58(%r15),%r8     ; ImmutableOopMap{rcx=Oop rdx=Oop xmm0=Oop xmm2=Oop xmm4=Oop [0]=Oop }
>                                                 ;*goto {reexecute=1 rethrow=0 return_oop=0}
>                                                 ; - spec.benchmarks.scimark.SparseCompRow::matmult@78 (line 62)
>   0x00007f0a02f272d2: test   %eax,(%r8)         ;*goto {reexecute=0 rethrow=0 return_oop=0}
>                                                 ; - spec.benchmarks.scimark.SparseCompRow::matmult@78 (line 62)
>                                                 ;   {poll}
>   0x00007f0a02f272d5: cmp    %ebx,%r9d
>   0x00007f0a02f272d8: jl     0x00007f0a02f27289
>
> There are extra spill instructions for some reason but I think the
> bigger problem here is that the inner loop runs for a small number of
> iterations (~2) and so the outer strip mined loop is executed too
> often. I have an experimental patch that, for loop executed a small
> number of iterations (from profiling), makes 2 copies: one with the
> outer strip mined loop, one without. The code picks one or the other at
> runtime based on the actual number of iterations so there's still an
> overhead. I didn't include it here because it doesn't make the
> regression go away entirely and it's extra complexity.
>
> Roland.
>
Reply | Threaded
Open this post in threaded view
|

Re: RFR(L): 8186027: C2: loop strip mining

Roland Westrelin-3

Hi Vladimir,

> I am running testing again. But if this will repeat and presence of this
> Sparse.small regression suggesting to me that may be we should keep this
> optimization off by default - keep UseCountedLoopSafepoints false.
>
> We may switch it on later with additional changes which address regressions.
>
> What do you think?

If the inner loop runs for a small number of iterations and the compiler
can't statically prove it, I don't see a way to remove the overhead of
loop strip mining entirely. So I'm not optimistic the regression can be
fixed.

If loop strip mining defaults to false, would there we be any regular
testing on your side?

It seems to me that it would make sense to enable loop strip mining
depending on what GC is used: it makes little sense for parallel gc but
we'll want it enabled for Shenandoah for instance. Where does G1 fit? I
can't really say and I don't have a strong opinion. But as I understand,
G1 was made default under the assumption that users would be ok trading
throughput for better latency. Maybe, that same reasoning applies to
loop strip mining?

Roland.
Reply | Threaded
Open this post in threaded view
|

Re: RFR(L): 8186027: C2: loop strip mining

Andrew Haley
On 23/11/17 21:59, Nils Eliasson wrote:
> The current situation is that we have some extra performance with
> UseCountedLoopSafepoints default off, but let some users have a bad
> experience when they encounter long time-to-safepoint times or failures
> (https://bugs.openjdk.java.net/browse/JDK-5014723). I rather turn the
> table and have loop strip mining on, and let the power users experiment
> with turning it off for any uncertain performance boost.

That just sounds sensible to me, and very Java: we default to a reasonable
balance of throughput and predictability.

--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671