[10] RFR (M/L): 8137099: G1 needs to "upgrade" GC within the safepoint if it can't allocate during that safepoint to avoid OoME

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

[10] RFR (M/L): 8137099: G1 needs to "upgrade" GC within the safepoint if it can't allocate during that safepoint to avoid OoME

Thomas Schatzl
Hi all,

  can I have reviews for this change that fixes a race within the G1
collector that in conjunction with the GC locker causes it to OoME
prematurely?

The issue is that if a young gc of any kind (including gc locker caused
ones) fails to allocate enough memory, the scheduled full gc can be
cancelled by the gc locker, causing OoME in that thread. I.e. before
any full gc has been tried.

The solution is to, similar to other collectors (which corresponds to
the observation that they do not show this behavior), start a
compacting full gc in the same VM operation where the young GC has been
 started in to avoid this race to happen.

Axel Siebenborn (cc'ed) from SAP originally contributed a fix that
covered most situations; this change enhances it (and converts the
standalone test he wrote into a jtreg ones in addition to style
changes). I added us both as contributors.

Unfortunately the test has had to be put into the Problemlist, because
I did not manage to make it not to exhibit the failure in JDK-8192647
sometimes. I still see it as a useful addition as soon that one has
been fixed. For testing I manually checked the test output that the GC
type "upgrade" occurs as expected though.

Also, the test has been changed to be a stress test, and only execute
at most a certain amount of time.

CR:
https://bugs.openjdk.java.net/browse/JDK-8137099
Webrev:
http://cr.openjdk.java.net/~tschatzl/8137099/webrev/
Testing:
hs-tier1-3, manual execution of included test, checking output to be as
expected.

Thanks,
  Thomas

Reply | Threaded
Open this post in threaded view
|

RE: [10] RFR (M/L): 8137099: G1 needs to "upgrade" GC within the safepoint if it can't allocate during that safepoint to avoid OoME

Siebenborn, Axel
Hi Thomas,
thanks for bringing this up again .

For me the change looks good.

Thanks,
Axel
Reply | Threaded
Open this post in threaded view
|

Re: [10] RFR (M/L): 8137099: G1 needs to "upgrade" GC within the safepoint if it can't allocate during that safepoint to avoid OoME

Hohensee, Paul
In reply to this post by Thomas Schatzl
In attempt_allocation_humongous in g1CollectedHeap.cpp, there’s a note that says there’s code duplication between it and attempt_allocation_slow. There may be a case of version skew between the code at line 553 and the lack of similar code at line 972. If you don’t want to retry the humongous allocation attempt at that point, then the comment at line 968 should be updated.

Thanks,
Paul

On 12/4/17, 7:58 AM, "hotspot-gc-dev on behalf of Thomas Schatzl" <[hidden email] on behalf of [hidden email]> wrote:

    Hi all,
   
      can I have reviews for this change that fixes a race within the G1
    collector that in conjunction with the GC locker causes it to OoME
    prematurely?
   
    The issue is that if a young gc of any kind (including gc locker caused
    ones) fails to allocate enough memory, the scheduled full gc can be
    cancelled by the gc locker, causing OoME in that thread. I.e. before
    any full gc has been tried.
   
    The solution is to, similar to other collectors (which corresponds to
    the observation that they do not show this behavior), start a
    compacting full gc in the same VM operation where the young GC has been
     started in to avoid this race to happen.
   
    Axel Siebenborn (cc'ed) from SAP originally contributed a fix that
    covered most situations; this change enhances it (and converts the
    standalone test he wrote into a jtreg ones in addition to style
    changes). I added us both as contributors.
   
    Unfortunately the test has had to be put into the Problemlist, because
    I did not manage to make it not to exhibit the failure in JDK-8192647
    sometimes. I still see it as a useful addition as soon that one has
    been fixed. For testing I manually checked the test output that the GC
    type "upgrade" occurs as expected though.
   
    Also, the test has been changed to be a stress test, and only execute
    at most a certain amount of time.
   
    CR:
    https://bugs.openjdk.java.net/browse/JDK-8137099
    Webrev:
    http://cr.openjdk.java.net/~tschatzl/8137099/webrev/
    Testing:
    hs-tier1-3, manual execution of included test, checking output to be as
    expected.
   
    Thanks,
      Thomas
   
   

Reply | Threaded
Open this post in threaded view
|

Re: [10] RFR (M/L): 8137099: G1 needs to "upgrade" GC within the safepoint if it can't allocate during that safepoint to avoid OoME

Thomas Schatzl
Hi Paul,

  sorry for the late reply, there has been another issue keeping me
busy ;)

On Thu, 2017-12-07 at 00:44 +0000, Hohensee, Paul wrote:
> In attempt_allocation_humongous in g1CollectedHeap.cpp, there’s a
> note that says there’s code duplication between it and
> attempt_allocation_slow. There may be a case of version skew between
> the code at line 553 and the lack of similar code at line 972. If you
> don’t want to retry the humongous allocation attempt at that point,
> then the comment at line 968 should be updated.

  humongous allocation needs to be under the heap lock anyway, so we
just skip the inline allocation as it would be the same as the code at
the start of the loop.

I did add a comment, fixed some whitespace and had to move declarations
in the JNI c file to make it compile on some platforms:

New webrevs:
http://cr.openjdk.java.net/~tschatzl/8137099/webrev.0_to_1 (diff)
http://cr.openjdk.java.net/~tschatzl/8137099/webrev.1 (full)

There is some potential to do refactoring in the
attempt_allocation_slow/humongous methods (the part in "if
(should_try_gc)"), but I would like to do that in a different change.

Thanks,
  Thomas

Reply | Threaded
Open this post in threaded view
|

Re: [10] RFR (M/L): 8137099: G1 needs to "upgrade" GC within the safepoint if it can't allocate during that safepoint to avoid OoME

Thomas Schatzl
In reply to this post by Siebenborn, Axel
Hi,

On Wed, 2017-12-06 at 15:23 +0000, Siebenborn, Axel wrote:
> Hi Thomas,
> thanks for bringing this up again .
>
> For me the change looks good.
>

  thanks for your review.

Thomas

Reply | Threaded
Open this post in threaded view
|

Re: [10] RFR (M/L): 8137099: G1 needs to "upgrade" GC within the safepoint if it can't allocate during that safepoint to avoid OoME

Stefan Johansson
In reply to this post by Thomas Schatzl
Hi Thomas,

On 2017-12-11 12:07, Thomas Schatzl wrote:

> Hi Paul,
>
>    sorry for the late reply, there has been another issue keeping me
> busy ;)
>
> On Thu, 2017-12-07 at 00:44 +0000, Hohensee, Paul wrote:
>> In attempt_allocation_humongous in g1CollectedHeap.cpp, there’s a
>> note that says there’s code duplication between it and
>> attempt_allocation_slow. There may be a case of version skew between
>> the code at line 553 and the lack of similar code at line 972. If you
>> don’t want to retry the humongous allocation attempt at that point,
>> then the comment at line 968 should be updated.
>    humongous allocation needs to be under the heap lock anyway, so we
> just skip the inline allocation as it would be the same as the code at
> the start of the loop.
>
> I did add a comment, fixed some whitespace and had to move declarations
> in the JNI c file to make it compile on some platforms:
>
> New webrevs:
> http://cr.openjdk.java.net/~tschatzl/8137099/webrev.0_to_1 (diff)
> http://cr.openjdk.java.net/~tschatzl/8137099/webrev.1 (full)
Thanks for taking on this problem. I think the change looks really good,
just a few comments:
---
g1CollectedHeap.cpp
* The logging uses the thread pointer as identifier, I think it would be
more useful to use the thread name. Also this one log line is only
tagged with "gc" while the others use "gc,alloc":
527       log_trace(gc)(PTR_FORMAT " Unsuccessfully scheduled collection
allocating " SIZE_FORMAT " words",
528                     p2i(Thread::current()), word_size);
---
g1CollectedHeap.hpp
* Remove "friend class VM_G1CollectForAllocation;"
* Remove/change:
   // Callback from VM_G1CollectForAllocation operation.
   // This function does everything necessary/possible to satisfy a
   // failed allocation request (including collection, expansion, etc.)
   HeapWord* satisfy_failed_allocation(size_t word_size,
                                       AllocationContext_t context,
                                       bool* succeeded);
---
vm_operations_g1.hpp
* Remove "//     - VM_G1CollectForAllocation"

If you like you could also simplify the class hierarchy now, since
VM_G1OperationWithAllocRequest only has one sub-class.
---

Thanks,
Stefan

> There is some potential to do refactoring in the
> attempt_allocation_slow/humongous methods (the part in "if
> (should_try_gc)"), but I would like to do that in a different change.
>
> Thanks,
>    Thomas
>

Reply | Threaded
Open this post in threaded view
|

Re: [10] RFR (M/L): 8137099: G1 needs to "upgrade" GC within the safepoint if it can't allocate during that safepoint to avoid OoME

Thomas Schatzl
Hi Stefan,

  thanks for your review.

On Mon, 2017-12-11 at 14:19 +0100, Stefan Johansson wrote:

> Hi Thomas,
>
> On 2017-12-11 12:07, Thomas Schatzl wrote:
> > Hi Paul,
> >
> >    sorry for the late reply, there has been another issue keeping
> > me
> > busy ;)
> >
> > On Thu, 2017-12-07 at 00:44 +0000, Hohensee, Paul wrote:
> > > In attempt_allocation_humongous in g1CollectedHeap.cpp, there’s a
> > > note that says there’s code duplication between it and
> > > attempt_allocation_slow. There may be a case of version skew
> > > between the code at line 553 and the lack of similar code at line
> > > 972. If you don’t want to retry the humongous allocation attempt
> > > at that point, then the comment at line 968 should be updated.
> >
> >    humongous allocation needs to be under the heap lock anyway, so
> > we just skip the inline allocation as it would be the same as the
> > code at the start of the loop.
> >
> > I did add a comment, fixed some whitespace and had to move
> > declarations in the JNI c file to make it compile on some
> > platforms:
> >
> > New webrevs:
> > http://cr.openjdk.java.net/~tschatzl/8137099/webrev.0_to_1 (diff)
> > http://cr.openjdk.java.net/~tschatzl/8137099/webrev.1 (full)
>
> Thanks for taking on this problem. I think the change looks really
> good, just a few comments:
> ---
> g1CollectedHeap.cpp
> * The logging uses the thread pointer as identifier, I think it would
> be more useful to use the thread name. Also this one log line is
> only tagged with "gc" while the others use "gc,alloc":
> 527       log_trace(gc)(PTR_FORMAT " Unsuccessfully scheduled
> collection allocating " SIZE_FORMAT " words",
> 528                     p2i(Thread::current()), word_size);

done

> ---
> g1CollectedHeap.hpp
> * Remove "friend class VM_G1CollectForAllocation;"

Done

> * Remove/change:
>    // Callback from VM_G1CollectForAllocation operation.
>    // This function does everything necessary/possible to satisfy a
>    // failed allocation request (including collection, expansion,
> etc.)
>    HeapWord* satisfy_failed_allocation(size_t word_size,
>                                        AllocationContext_t context,
>                                        bool* succeeded);

Fine now :)

> ---
> vm_operations_g1.hpp
> * Remove "//     - VM_G1CollectForAllocation"
>
> If you like you could also simplify the class hierarchy now, since
> VM_G1OperationWithAllocRequest only has one sub-class.

Done.

- I changed the name of VM_G1IncCollectionPause to
VM_G1CollectForAllocation (the name of the operation that has been
deleted previously) because it is more fitting

- un-scrambled the parameters to the VM_G1CollectForAllocation
constructor

Testing with hs-tier1-5

New webrevs:
http://cr.openjdk.java.net/~tschatzl/8137099/webrev.1_to_2/ (diff)
http://cr.openjdk.java.net/~tschatzl/8137099/webrev.2/ (full)

Thanks,
  Thomas