Intermittent JRuby json issue related to tiered or G1

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Intermittent JRuby json issue related to tiered or G1

Charles Oliver Nutter-4
Hello again folks, we seem to have stumbled on an issue with either
tiered compilation or G1 or both.

https://github.com/jruby/jruby/issues/6554

We have been trying to track down a sporadic parse error in JRuby's
json library, and it now seems likely that it is an issue at the JVM
level. The issue above describes the problem that at least two of our
users have been seeing. The issue manifests as JRuby's json library
suddenly failing to parse code that worked earlier in the run.

The main user contributing details here, "Freaky", has been able to
confirm that either turning off tiered compilation or switching away
from G1 (to Parallel or Shenandoah) makes the problem go away, and has
provided a small reproduction:

https://github.com/Freaky/jruby-issue-6554

Freaky has also been trying to correlate JIT and GC logs to the time
of failure and has posted his findings on the JRuby issue above.
Freaky has been running the latest Java 15 to reproduce, but other
users are seeing the same issue on 11.

I wanted to get some feedback here before opening an issue, mostly
because it seems to require both tiered and G1 to trigger. Anyone have
some cycles to help us investigate?

- Charlie
Reply | Threaded
Open this post in threaded view
|

Re: Intermittent JRuby json issue related to tiered or G1

Aleksey Shipilev-4
On 2/16/21 7:48 PM, Charles Oliver Nutter wrote:
> I wanted to get some feedback here before opening an issue, mostly
> because it seems to require both tiered and G1 to trigger. Anyone have
> some cycles to help us investigate?

I am struggling to reproduce this, reproducer requires fiddling with the actual JRuby installation
path, installing Bundler, installing gems, etc. It would help if you can provide the reproducer that
one can able to run from a blank machine.

As usual, running with fastdebug builds is the first step here: maybe VM would assert meaningfully,
and then you can search JIRA for the assert message. My own fastdebug builds are here:
https://builds.shipilev.net/

--
Thanks,
-Aleksey

Reply | Threaded
Open this post in threaded view
|

Re: Intermittent JRuby json issue related to tiered or G1

Aleksey Shipilev-4
On 2/16/21 8:10 PM, Aleksey Shipilev wrote:
> As usual, running with fastdebug builds is the first step here: maybe VM would assert meaningfully,
> and then you can search JIRA for the assert message. My own fastdebug builds are here:
> https://builds.shipilev.net/

Look:

$ export JAVA_HOME=./jdk15-fastdebug/; export PATH=$JAVA_HOME/bin:$PATH
$ while true; do ~/Install/jruby-9.2.14.0/bin/jruby -J-Xmx512m ./test.rb; done
Warming up --------------------------------------
             JSON gem# To suppress the following error report, specify this argument
# after -XX: or in .hotspotrc:  SuppressErrorAt=/ifnode.cpp:952
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error
(/home/buildbot/worker/build-jdk15u-linux/build/src/hotspot/share/opto/ifnode.cpp:952), pid=3795216,
tid=3795432
#  assert(this_bool->_test.is_less() && !fail->_con) failed: incorrect test
#
# JRE version: OpenJDK Runtime Environment (15.0.2) (fastdebug build
15.0.2-testing+0-builds.shipilev.net-openjdk-jdk15-b50-20210206)
# Java VM: OpenJDK 64-Bit Server VM (fastdebug
15.0.2-testing+0-builds.shipilev.net-openjdk-jdk15-b50-20210206, mixed mode, sharing, tiered,
compressed oops, g1 gc, linux-amd64)
# Problematic frame:
# V  [libjvm.so+0xbf59ce]  IfNode::fold_compares_helper(ProjNode*, ProjNode*, ProjNode*,
PhaseIterGVN*)+0x66e
#
# Core dump will be written. Default location: Core dumps may be processed with
"/usr/share/apport/apport %p %s %c %d %P %E" (or dumping to
/home/shade/temp/jruby/jruby-issue-6554/core.3795216)
#
# An error report file with more information is saved as:
# /home/shade/temp/jruby/jruby-issue-6554/hs_err_pid3795216.log
#
# Compiler replay data is saved as:
# /home/shade/temp/jruby/jruby-issue-6554/replay_pid3795216.log
#
# If you would like to submit a bug report, please visit:
#   https://bugreport.java.com/bugreport/crash.jsp

------

This reliably reproduces with latest JDK 15 binaries from here:
   https://builds.shipilev.net/openjdk-jdk15/

There is a single hit in JIRA, for the bug that was fixed in JDK 9.

But! Neither JDK 16 nor JDK 17 fastdebug assert, which gives us an opportunity to reverse-bisect
which change had fixed it.

--
Thanks,
-Aleksey

Reply | Threaded
Open this post in threaded view
|

Re: Intermittent JRuby json issue related to tiered or G1

Charles Oliver Nutter-4
On Tue, Feb 16, 2021 at 1:50 PM Aleksey Shipilev <[hidden email]> wrote:

>
> On 2/16/21 8:10 PM, Aleksey Shipilev wrote:
> > As usual, running with fastdebug builds is the first step here: maybe VM would assert meaningfully,
> > and then you can search JIRA for the assert message. My own fastdebug builds are here:
> > https://builds.shipilev.net/
>
> Look:
>
> $ export JAVA_HOME=./jdk15-fastdebug/; export PATH=$JAVA_HOME/bin:$PATH
> $ while true; do ~/Install/jruby-9.2.14.0/bin/jruby -J-Xmx512m ./test.rb; done
> Warming up --------------------------------------
>              JSON gem# To suppress the following error report, specify this argument
> # after -XX: or in .hotspotrc:  SuppressErrorAt=/ifnode.cpp:952
> #
> # A fatal error has been detected by the Java Runtime Environment:

Thanks for digging into this Aleksey... I will try to remember to
throw a fastdebug build at these issues in the future. This one has
largely been driven by our users since up until last week we had no
way to reproduce.

Those users are also monitoring this thread.

I am a bit confused about your JDK9 reference. If it was fixed in 9
why does it reliably reproduce in 15? Perhaps I am misunderstanding
the lineage of the fix you are referring to.

- Charlie
Reply | Threaded
Open this post in threaded view
|

Re: Intermittent JRuby json issue related to tiered or G1

Aleksey Shipilev-4
On 2/16/21 9:00 PM, Charles Oliver Nutter wrote:
> I am a bit confused about your JDK9 reference. If it was fixed in 9
> why does it reliably reproduce in 15? Perhaps I am misunderstanding
> the lineage of the fix you are referring to.

I am saying that there are no direct JIRA hits that could explain why this is happening. The only
hit I got is for fix already in JDK 9, so it should not happen again.

I am (slowly) bisecting between JDK 15 and JDK 16 to see which fix directly or accidentally fixed
it. Then we would know what we are dealing with.

--
Thanks,
-Aleksey

Reply | Threaded
Open this post in threaded view
|

Re: Intermittent JRuby json issue related to tiered or G1

Aleksey Shipilev-4
On 2/16/21 9:05 PM, Aleksey Shipilev wrote:

> On 2/16/21 9:00 PM, Charles Oliver Nutter wrote:
>> I am a bit confused about your JDK9 reference. If it was fixed in 9
>> why does it reliably reproduce in 15? Perhaps I am misunderstanding
>> the lineage of the fix you are referring to.
>
> I am saying that there are no direct JIRA hits that could explain why this is happening. The only
> hit I got is for fix already in JDK 9, so it should not happen again.
>
> I am (slowly) bisecting between JDK 15 and JDK 16 to see which fix directly or accidentally fixed
> it. Then we would know what we are dealing with.

This thing is really hairy. Reverse bisects shows that this one:
   https://bugs.openjdk.java.net/browse/JDK-8257847

...makes failure in fastdebug much less likely. This explains why I have not seen the failures in
JDK 16 and JDK 17 yesterday. I have managed to reliably crash the recent JDK by promoting the assert
in question into guarantee:

diff --git a/src/hotspot/share/opto/ifnode.cpp b/src/hotspot/share/opto/ifnode.cpp
index 29624765324..467d8f19276 100644
--- a/src/hotspot/share/opto/ifnode.cpp
+++ b/src/hotspot/share/opto/ifnode.cpp
@@ -948,7 +948,9 @@ bool IfNode::fold_compares_helper(ProjNode* proj, ProjNode* success, ProjNode* f
      assert((dom_bool->_test.is_less() && proj->_con) ||
             (dom_bool->_test.is_greater() && !proj->_con), "incorrect test");
      // this test was canonicalized
-    assert(this_bool->_test.is_less() && !fail->_con, "incorrect test");
+    guarantee(this_bool->_test.is_less() && !fail->_con, "incorrect test: dom_bool.test=%d
proj._con=%d this_bool.test=%d fail._con=%d",
+           dom_bool->_test._test, proj->_con,
+           this_bool->_test._test, fail->_con);

      cond = (hi_test == BoolTest::le || hi_test == BoolTest::gt) ? BoolTest::gt : BoolTest::ge;


...which then fails with:

# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (ifnode.cpp:955), pid=2438111, tid=2438182
#  guarantee(this_bool->_test.is_less() && !fail->_con) failed: incorrect test: dom_bool.test=3
proj._con=1 this_bool.test=7 fail._con=1
#
# JRE version: OpenJDK Runtime Environment (17.0) (build 17-internal+0-adhoc.shade.jdk)
# Java VM: OpenJDK 64-Bit Server VM (17-internal+0-adhoc.shade.jdk, mixed mode, sharing, tiered,
compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# V  [libjvm.so+0x7fc3ee]  IfNode::fold_compares_helper(ProjNode*, ProjNode*, ProjNode*,
PhaseIterGVN*) [clone .part.0]+0x19e
#
# Core dump will be written. Default location: Core dumps may be processed with
"/usr/share/apport/apport %p %s %c %d %P %E" (or dumping to
/home/shade/temp/jruby/jruby-issue-6554/core.2438111)
#
# An error report file with more information is saved as:
# /home/shade/temp/jruby/jruby-issue-6554/hs_err_pid2438111.log
#
# Compiler replay data is saved as:
# /home/shade/temp/jruby/jruby-issue-6554/replay_pid2438111.log


"this_bool.test=7" means the test is "GE". The downstream code does not expect this. It expects the
test to be canonicalized. This minimal thing bails out on discovery of such bad test:

  diff --git a/src/hotspot/share/opto/ifnode.cpp b/src/hotspot/share/opto/ifnode.cpp
@@ -971,6 +973,9 @@ bool IfNode::fold_compares_helper(ProjNode* proj, ProjNode* success, ProjNode* f
          lo = igvn->transform(new AddINode(lo, igvn->intcon(1)));
          cond = BoolTest::ge;
        }
+    } else {
+      // Safety: something is broken, break away.
+      return false;
      }
    } else {
      const TypeInt* failtype = filtered_int_type(igvn, n, proj);


I think I'll submit two issues: one that codes fold_compares_helper more defensively like in the
patch above (this would be backportable), and then the follow-up that targets to address the actual
problem (why do we have uncanonicalized test).


--
Thanks,
-Aleksey

Reply | Threaded
Open this post in threaded view
|

Re: Intermittent JRuby json issue related to tiered or G1

Aleksey Shipilev-4
On 2/17/21 3:50 PM, Aleksey Shipilev wrote:
> I think I'll submit two issues: one that codes fold_compares_helper more defensively like in the
> patch above (this would be backportable)

This would be:
   https://bugs.openjdk.java.net/browse/JDK-8261912

> and then the follow-up that targets to address the actual
> problem (why do we have uncanonicalized test).

And this would be:
   https://bugs.openjdk.java.net/browse/JDK-8261914

--
Thanks,
-Aleksey

Reply | Threaded
Open this post in threaded view
|

Re: Intermittent JRuby json issue related to tiered or G1

Charles Oliver Nutter-4
On Wed, Feb 17, 2021 at 9:14 AM Aleksey Shipilev <[hidden email]> wrote:

>
> On 2/17/21 3:50 PM, Aleksey Shipilev wrote:
> > I think I'll submit two issues: one that codes fold_compares_helper more defensively like in the
> > patch above (this would be backportable)
>
> This would be:
>    https://bugs.openjdk.java.net/browse/JDK-8261912
>
> > and then the follow-up that targets to address the actual
> > problem (why do we have uncanonicalized test).
>
> And this would be:
>    https://bugs.openjdk.java.net/browse/JDK-8261914

This is outstanding sleuthing, thank you Aleksey! I'm not happy that
there's a JVM issue but at least we can stop banging our heads against
this as a JRuby bug.

Given your discoveries, I would have felt safe saying this is purely a
tiered JIT issue... but our users have reported that switching away
from G1 also eliminates the problem. Were they just lucky? Could there
be a separate issue?

I'm trying to come up with a short-term workaround with minimal
impact. Switching GC may not be in the cards but disabling tiered
compilation would probably be acceptable for production
environments...if that is sufficient.

 As always let me know if we can provide any more information. The
affected JRuby users and I will continue monitoring.

- Charlie
Reply | Threaded
Open this post in threaded view
|

Re: Intermittent JRuby json issue related to tiered or G1

Aleksey Shipilev-4
On 2/17/21 6:28 PM, Charles Oliver Nutter wrote:
> Given your discoveries, I would have felt safe saying this is purely a
> tiered JIT issue... but our users have reported that switching away
> from G1 also eliminates the problem. Were they just lucky? Could there
> be a separate issue?

Having spent a day trying to reproduce the issue, I think the conditions under which this reproduces
are very flaky, and minor things can change the reproducibility. The bug is in C2 code, and that
code apparently takes a path that leads to the error, under some conditions that are not yet clear
to me. Profile taking different shapes I think affects it the most, this is why it lead me to the
unsuccessful reverse bisect. That minor profiling code adjustment dropped the incidence for about
10x for me! Argh.

Maybe switching the GC does indeed lead the application away from the bug, I have no firm evidence
about this. Note that all this time JDK 9+ was running with G1 without us knowing about this problem.

> I'm trying to come up with a short-term workaround with minimal
> impact. Switching GC may not be in the cards but disabling tiered
> compilation would probably be acceptable for production
> environments...if that is sufficient.

The short-term workaround should really come with:
   https://bugs.openjdk.java.net/browse/JDK-8261914

I plan to propose it for 16u and 11u backports once it hopefully lands in mainline.

I cannot see any diagnostic/experimental flag that would disable the problematic path in C2
specifically. So without JDK-8261914, we are left with several nuclear options:
  a) disabling C2 completely with -XX:TieredStopAtLevel=1, which *would* affect performance;
  b) falling back to 8u, which does not have this C2 code at all

There is the "usual" workaround of disabling the affected method compilation with
-XX:CompileCommand=exclude,<...> -- but that kinda hopes that only one method compilation is affected.

>   As always let me know if we can provide any more information. The
> affected JRuby users and I will continue monitoring.

I had run thousands of runs of the JRuby reproducer, and it works fine with JDK-8261914 defensive
patch. Without it, it fails about 1 time out of 10 with the error that Freaky reported. So I am
pretty confident it is the same issue. You might want to test it yourselves today, if you can build
mainline JDK. Or wait for patches to land in mainline and pick up EA binaries to test.

--
Thanks,
-Aleksey