RFR: 8259643: ZGC can return metaspace OOM prematurely

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

RFR: 8259643: ZGC can return metaspace OOM prematurely

Erik Österlund-3
There exists a race condition for ZGC metaspace allocations, where an allocation can throw OOM due to unbounded starvation from other threads. Towards the end of the allocation dance, we conceptually do this:

1. full_gc()
2. final_allocation_attempt()

And if we still fail at 2 after doing a full GC, we conclude that there isn't enough metaspace memory. However, if the thread gets preempted between 1 and 2, then an unbounded number of metaspace allocations from other threads can fill up the entire metaspace, making the final allocation attempt fail and hence throw. This can cause a situation where almost the entire metaspace is unreachable from roots, yet we throw OOM. I managed to reproduce this with the right sleeps.

The way we deal with this particular issue for heap allocations, is to have an allocation request queue, and satisfy those allocations before others, preventing starvation. My solution to this metaspace OOM problem will be to basically do exactly that - have a queue of "critical" allocations, that get precedence over normal metaspace allocations.

The solution should work for other concurrent GCs (who likely have the same issue), but I only tried this with ZGC, so I am only hooking in ZGC to the new API (for concurrently unloading GCs to manage critical metaspace allocations) at this point.

Passes ZGC tests from tier 1-5, and the particular test that failed (with the JVM sleeps that make it fail deterministically).

-------------

Commit messages:
 - 8259643: ZGC can return metaspace OOM prematurely

Changes: https://git.openjdk.java.net/jdk/pull/2289/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=2289&range=00
  Issue: https://bugs.openjdk.java.net/browse/JDK-8259643
  Stats: 296 lines in 6 files changed: 275 ins; 17 del; 4 mod
  Patch: https://git.openjdk.java.net/jdk/pull/2289.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/2289/head:pull/2289

PR: https://git.openjdk.java.net/jdk/pull/2289
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8259643: ZGC can return metaspace OOM prematurely [v2]

Erik Österlund-3
> There exists a race condition for ZGC metaspace allocations, where an allocation can throw OOM due to unbounded starvation from other threads. Towards the end of the allocation dance, we conceptually do this:
>
> 1. full_gc()
> 2. final_allocation_attempt()
>
> And if we still fail at 2 after doing a full GC, we conclude that there isn't enough metaspace memory. However, if the thread gets preempted between 1 and 2, then an unbounded number of metaspace allocations from other threads can fill up the entire metaspace, making the final allocation attempt fail and hence throw. This can cause a situation where almost the entire metaspace is unreachable from roots, yet we throw OOM. I managed to reproduce this with the right sleeps.
>
> The way we deal with this particular issue for heap allocations, is to have an allocation request queue, and satisfy those allocations before others, preventing starvation. My solution to this metaspace OOM problem will be to basically do exactly that - have a queue of "critical" allocations, that get precedence over normal metaspace allocations.
>
> The solution should work for other concurrent GCs (who likely have the same issue), but I only tried this with ZGC, so I am only hooking in ZGC to the new API (for concurrently unloading GCs to manage critical metaspace allocations) at this point.
>
> Passes ZGC tests from tier 1-5, and the particular test that failed (with the JVM sleeps that make it fail deterministically).

Erik Österlund has updated the pull request incrementally with one additional commit since the last revision:

  polish code alignment and rename register/unregister to add/remove

-------------

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/2289/files
  - new: https://git.openjdk.java.net/jdk/pull/2289/files/9d49103f..e5a0d080

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=2289&range=01
 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=2289&range=00-01

  Stats: 12 lines in 2 files changed: 0 ins; 1 del; 11 mod
  Patch: https://git.openjdk.java.net/jdk/pull/2289.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/2289/head:pull/2289

PR: https://git.openjdk.java.net/jdk/pull/2289
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8259643: ZGC can return metaspace OOM prematurely [v2]

Stefan Karlsson-3
On Thu, 28 Jan 2021 13:24:10 GMT, Erik Österlund <[hidden email]> wrote:

>> There exists a race condition for ZGC metaspace allocations, where an allocation can throw OOM due to unbounded starvation from other threads. Towards the end of the allocation dance, we conceptually do this:
>>
>> 1. full_gc()
>> 2. final_allocation_attempt()
>>
>> And if we still fail at 2 after doing a full GC, we conclude that there isn't enough metaspace memory. However, if the thread gets preempted between 1 and 2, then an unbounded number of metaspace allocations from other threads can fill up the entire metaspace, making the final allocation attempt fail and hence throw. This can cause a situation where almost the entire metaspace is unreachable from roots, yet we throw OOM. I managed to reproduce this with the right sleeps.
>>
>> The way we deal with this particular issue for heap allocations, is to have an allocation request queue, and satisfy those allocations before others, preventing starvation. My solution to this metaspace OOM problem will be to basically do exactly that - have a queue of "critical" allocations, that get precedence over normal metaspace allocations.
>>
>> The solution should work for other concurrent GCs (who likely have the same issue), but I only tried this with ZGC, so I am only hooking in ZGC to the new API (for concurrently unloading GCs to manage critical metaspace allocations) at this point.
>>
>> Passes ZGC tests from tier 1-5, and the particular test that failed (with the JVM sleeps that make it fail deterministically).
>
> Erik Österlund has updated the pull request incrementally with one additional commit since the last revision:
>
>   polish code alignment and rename register/unregister to add/remove

Marked as reviewed by stefank (Reviewer).

-------------

PR: https://git.openjdk.java.net/jdk/pull/2289
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8259643: ZGC can return metaspace OOM prematurely [v2]

Per Liden-2
In reply to this post by Erik Österlund-3
On Thu, 28 Jan 2021 13:24:10 GMT, Erik Österlund <[hidden email]> wrote:

>> There exists a race condition for ZGC metaspace allocations, where an allocation can throw OOM due to unbounded starvation from other threads. Towards the end of the allocation dance, we conceptually do this:
>>
>> 1. full_gc()
>> 2. final_allocation_attempt()
>>
>> And if we still fail at 2 after doing a full GC, we conclude that there isn't enough metaspace memory. However, if the thread gets preempted between 1 and 2, then an unbounded number of metaspace allocations from other threads can fill up the entire metaspace, making the final allocation attempt fail and hence throw. This can cause a situation where almost the entire metaspace is unreachable from roots, yet we throw OOM. I managed to reproduce this with the right sleeps.
>>
>> The way we deal with this particular issue for heap allocations, is to have an allocation request queue, and satisfy those allocations before others, preventing starvation. My solution to this metaspace OOM problem will be to basically do exactly that - have a queue of "critical" allocations, that get precedence over normal metaspace allocations.
>>
>> The solution should work for other concurrent GCs (who likely have the same issue), but I only tried this with ZGC, so I am only hooking in ZGC to the new API (for concurrently unloading GCs to manage critical metaspace allocations) at this point.
>>
>> Passes ZGC tests from tier 1-5, and the particular test that failed (with the JVM sleeps that make it fail deterministically).
>
> Erik Österlund has updated the pull request incrementally with one additional commit since the last revision:
>
>   polish code alignment and rename register/unregister to add/remove

Marked as reviewed by pliden (Reviewer).

-------------

PR: https://git.openjdk.java.net/jdk/pull/2289
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8259643: ZGC can return metaspace OOM prematurely [v2]

Erik Österlund-3
On Thu, 28 Jan 2021 13:36:20 GMT, Per Liden <[hidden email]> wrote:

>> Erik Österlund has updated the pull request incrementally with one additional commit since the last revision:
>>
>>   polish code alignment and rename register/unregister to add/remove
>
> Marked as reviewed by pliden (Reviewer).

Thanks for the reviews @pliden and @stefank.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2289
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8259643: ZGC can return metaspace OOM prematurely [v2]

Thomas Stuefe
On Thu, 28 Jan 2021 13:37:16 GMT, Erik Österlund <[hidden email]> wrote:

>> Marked as reviewed by pliden (Reviewer).
>
> Thanks for the reviews @pliden and @stefank.

Hi Erik,

lets see if I understand the problem:

1 n threads allocate metaspace
2 thread A gets an allocation error (not HWM but a hard limit)
3 .. returns, (eventually) schedules a synchronous GC.
4 .. gc runs, the CLDG is at some point purged, releases metaspace pressure
5 other threads load classes, allocating metaspace concurrently, benefitting from the released pressure
6 thread A reattempts allocation, fails again.

This is normally not a problem, no? Which thread exactly gets the OOM if the VM hovers that close to the limit does not really matter. But you write "This can cause a situation where almost the entire metaspace is unreachable from roots, yet we throw OOM." - so we could get OOMs even if most of the Metaspace were vacant? This only can happen if, between (4) and (6), other threads not only allocate metaspace, but also then loose the loaders used for those allocations, to late for the GC at (4) to collect them but before (5). Collecting them would require another GC.

In other words, the contract is that we only throw an OOM if we really tried everything, but since the effects of the first GC are "stale" it does not count as try?

Do you think this is a realistic problem?

Do I understand your patch right in that you divide allocations in two priority classes, add another lock, MetaspaceCritical_lock, which blocks normal allocations as long as critical allocations are queued?

Sorry if I am slow :)

---

One problem I see is that Metaspace::purge is not the full purge. Reclaiming metaspace happens in two stages:
1) in CLDG::purge, we delete all `ClassLoaderMetaspace` objects belonging to dead loaders. This releases all their metaspace to the freelists, optionally uncommitting portions of it (since JEP387).
2) in Metaspace::purge, we go through Metaspace and munmap any mappings which are now completely vacant.

The metaspace pressure release already happens in (1), so any concurrent thread allocating will benefit already.

---

Why do we even need a queue? Why could we not just let the first thread attempting a synchronous gc block metaspace allocation path for all threads, including others running into a limit, until the gc is finished and it had its first-allocation-right served?

Thanks, Thomas

-------------

PR: https://git.openjdk.java.net/jdk/pull/2289
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8259643: ZGC can return metaspace OOM prematurely [v2]

Erik Österlund-3
On Thu, 28 Jan 2021 16:23:12 GMT, Thomas Stuefe <[hidden email]> wrote:

>> Thanks for the reviews @pliden and @stefank.
>
> Hi Erik,
>
> lets see if I understand the problem:
>
> 1 n threads allocate metaspace
> 2 thread A gets an allocation error (not HWM but a hard limit)
> 3 .. returns, (eventually) schedules a synchronous GC.
> 4 .. gc runs, the CLDG is at some point purged, releases metaspace pressure
> 5 other threads load classes, allocating metaspace concurrently, benefitting from the released pressure
> 6 thread A reattempts allocation, fails again.
>
> This is normally not a problem, no? Which thread exactly gets the OOM if the VM hovers that close to the limit does not really matter. But you write "This can cause a situation where almost the entire metaspace is unreachable from roots, yet we throw OOM." - so we could get OOMs even if most of the Metaspace were vacant? This only can happen if, between (4) and (6), other threads not only allocate metaspace, but also then loose the loaders used for those allocations, to late for the GC at (4) to collect them but before (5). Collecting them would require another GC.
>
> In other words, the contract is that we only throw an OOM if we really tried everything, but since the effects of the first GC are "stale" it does not count as try?
>
> Do you think this is a realistic problem?
>
> Do I understand your patch right in that you divide allocations in two priority classes, add another lock, MetaspaceCritical_lock, which blocks normal allocations as long as critical allocations are queued?
>
> Sorry if I am slow :)
>
> ---
>
> One problem I see is that Metaspace::purge is not the full purge. Reclaiming metaspace happens in two stages:
> 1) in CLDG::purge, we delete all `ClassLoaderMetaspace` objects belonging to dead loaders. This releases all their metaspace to the freelists, optionally uncommitting portions of it (since JEP387).
> 2) in Metaspace::purge, we go through Metaspace and munmap any mappings which are now completely vacant.
>
> The metaspace pressure release already happens in (1), so any concurrent thread allocating will benefit already.
>
> ---
>
> Why do we even need a queue? Why could we not just let the first thread attempting a synchronous gc block metaspace allocation path for all threads, including others running into a limit, until the gc is finished and it had its first-allocation-right served?
>
> Thanks, Thomas

Hi Thomas,

Thanks for chiming in! I will reply inline.

> Hi Erik,
>
> lets see if I understand the problem:
>
> 1 n threads allocate metaspace
> 2 thread A gets an allocation error (not HWM but a hard limit)
> 3 .. returns, (eventually) schedules a synchronous GC.
> 4 .. gc runs, the CLDG is at some point purged, releases metaspace pressure
> 5 other threads load classes, allocating metaspace concurrently, benefitting from the released pressure
> 6 thread A reattempts allocation, fails again.

That's the one.

> This is normally not a problem, no?

Indeed, and that was my gut feeling when the current handling was written. I wouldn't expect an actual application to ever hit this problem. Nevertheless, I think it's a soundness problem with completely unbounded starvation, even though it doesn't happen in real life. So I think I would still like to fix the issue.

It is definitely a bit arbitrary though where we decide to draw the line of what we guarantee, and what the balance between exactness and maintainability should be. My aim is to try hard enough so we don't rely on luck (random sleeps) if a program that shouldn't be even close to OOM will fail or not, even though you have to be *very* unlucky for it to fail. But I am not interested in prefect guarantees either, down to the last allocation. My balance is that I allow throwing OOM prematurely if we are "really close" to being down to the last allocation OOM, but if you are not "really close", then no sleep in the world should cause a failure.

> Which thread exactly gets the OOM if the VM hovers that close to the limit does not really matter. But you write "This can cause a situation where almost the entire metaspace is unreachable from roots, yet we throw OOM." - so we could get OOMs even if most of the Metaspace were vacant? This only can happen if, between (4) and (6), other threads not only allocate metaspace, but also then loose the loaders used for those allocations, to late for the GC at (4) to collect them but before (5). Collecting them would require another GC.

Right, and that is indeed what the test does. It loads chunks of 1000 classes and releases them, assuming that surely after releasing them, I can allocate more classes. Unless of course starvation ruins the day.

> In other words, the contract is that we only throw an OOM if we really tried everything, but since the effects of the first GC are "stale" it does not count as try?

The previous contract was that we try to allocate again after a full GC, and if that fails we give up. The issue is that this guarantee allows the GC to free up 99% of metaspace, yet still fail the allocation due to races with other threads doing the same thing. So at any given point, it might be that only 1% of metadata is reachable, yet an OOM can be produced if you are "unlucky".

> Do you think this is a realistic problem?

It is realistic enough that one stress test has failed in the real world. Yet I don't think any application out there will run into any issue. But I prefer having a sound solution where we can know that and not rely on probability.

> Do I understand your patch right in that you divide allocations in two priority classes, add another lock, MetaspaceCritical_lock, which blocks normal allocations as long as critical allocations are queued?

Yes, that's exactly right.

> Sorry if I am slow :)

Not at all!

> One problem I see is that Metaspace::purge is not the full purge. Reclaiming metaspace happens in two stages:
>
>     1. in CLDG::purge, we delete all `ClassLoaderMetaspace` objects belonging to dead loaders. This releases all their metaspace to the freelists, optionally uncommitting portions of it (since JEP387).
>
>     2. in Metaspace::purge, we go through Metaspace and munmap any mappings which are now completely vacant.
>
>
> The metaspace pressure release already happens in (1), so any concurrent thread allocating will benefit already.

Ah. I thought it was all done in 2. I can move the Big Fat Lock to cover all of CLDG::purge instead. What do you think? It just needs to cover the entire thing basically.

> Why do we even need a queue? Why could we not just let the first thread attempting a synchronous gc block metaspace allocation path for all threads, including others running into a limit, until the gc is finished and it had its first-allocation-right served?

Each "critical" allocation rides on one particular GC cycle, that denotes the make-or-break point of the allocation.
In order to prevent starvation, we have to satisfy all critical allocations who have their make-or-break GC cycle associated with the current purge() operation, before we release the lock in purge(), letting new allocations in, or we will rely on luck again. However, among the pending critical allocations, they will all possibly have different make-or-break GC cycles associated with them. So in purge() some of them need to be satisfied, and others do not, yet can happily get their allocations satisfied opportunistically if possible. So we need to make sure they are ordered somehow, such that the earliest arriving pending critical allocations are satisfied first, before subsequent critical allocations (possibly waiting for a later GC), or we can get premature OOM situations again, where a thread releases a bunch of memory, expecting to be able to allocate, yet fails due to races with various threads.
The queue basically ensures the ordering of critical allocation satisfaction is sound, so that the pending critical allocations with the associated make-or-break GC being the one running purge(), are satisfied first, before satisfying (opportunistically) other critical allocations, that are really waiting for the next GC to happen.

Thanks,


> Thanks, Thomas

-------------

PR: https://git.openjdk.java.net/jdk/pull/2289
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8259643: ZGC can return metaspace OOM prematurely [v2]

Thomas Stuefe
On Thu, 28 Jan 2021 17:54:25 GMT, Erik Österlund <[hidden email]> wrote:

>> Hi Erik,
>>
>> lets see if I understand the problem:
>>
>> 1 n threads allocate metaspace
>> 2 thread A gets an allocation error (not HWM but a hard limit)
>> 3 .. returns, (eventually) schedules a synchronous GC.
>> 4 .. gc runs, the CLDG is at some point purged, releases metaspace pressure
>> 5 other threads load classes, allocating metaspace concurrently, benefitting from the released pressure
>> 6 thread A reattempts allocation, fails again.
>>
>> This is normally not a problem, no? Which thread exactly gets the OOM if the VM hovers that close to the limit does not really matter. But you write "This can cause a situation where almost the entire metaspace is unreachable from roots, yet we throw OOM." - so we could get OOMs even if most of the Metaspace were vacant? This only can happen if, between (4) and (6), other threads not only allocate metaspace, but also then loose the loaders used for those allocations, to late for the GC at (4) to collect them but before (5). Collecting them would require another GC.
>>
>> In other words, the contract is that we only throw an OOM if we really tried everything, but since the effects of the first GC are "stale" it does not count as try?
>>
>> Do you think this is a realistic problem?
>>
>> Do I understand your patch right in that you divide allocations in two priority classes, add another lock, MetaspaceCritical_lock, which blocks normal allocations as long as critical allocations are queued?
>>
>> Sorry if I am slow :)
>>
>> ---
>>
>> One problem I see is that Metaspace::purge is not the full purge. Reclaiming metaspace happens in two stages:
>> 1) in CLDG::purge, we delete all `ClassLoaderMetaspace` objects belonging to dead loaders. This releases all their metaspace to the freelists, optionally uncommitting portions of it (since JEP387).
>> 2) in Metaspace::purge, we go through Metaspace and munmap any mappings which are now completely vacant.
>>
>> The metaspace pressure release already happens in (1), so any concurrent thread allocating will benefit already.
>>
>> ---
>>
>> Why do we even need a queue? Why could we not just let the first thread attempting a synchronous gc block metaspace allocation path for all threads, including others running into a limit, until the gc is finished and it had its first-allocation-right served?
>>
>> Thanks, Thomas
>
> Hi Thomas,
>
> Thanks for chiming in! I will reply inline.
>
>> Hi Erik,
>>
>> lets see if I understand the problem:
>>
>> 1 n threads allocate metaspace
>> 2 thread A gets an allocation error (not HWM but a hard limit)
>> 3 .. returns, (eventually) schedules a synchronous GC.
>> 4 .. gc runs, the CLDG is at some point purged, releases metaspace pressure
>> 5 other threads load classes, allocating metaspace concurrently, benefitting from the released pressure
>> 6 thread A reattempts allocation, fails again.
>
> That's the one.
>
>> This is normally not a problem, no?
>
> Indeed, and that was my gut feeling when the current handling was written. I wouldn't expect an actual application to ever hit this problem. Nevertheless, I think it's a soundness problem with completely unbounded starvation, even though it doesn't happen in real life. So I think I would still like to fix the issue.
>
> It is definitely a bit arbitrary though where we decide to draw the line of what we guarantee, and what the balance between exactness and maintainability should be. My aim is to try hard enough so we don't rely on luck (random sleeps) if a program that shouldn't be even close to OOM will fail or not, even though you have to be *very* unlucky for it to fail. But I am not interested in prefect guarantees either, down to the last allocation. My balance is that I allow throwing OOM prematurely if we are "really close" to being down to the last allocation OOM, but if you are not "really close", then no sleep in the world should cause a failure.
>
>> Which thread exactly gets the OOM if the VM hovers that close to the limit does not really matter. But you write "This can cause a situation where almost the entire metaspace is unreachable from roots, yet we throw OOM." - so we could get OOMs even if most of the Metaspace were vacant? This only can happen if, between (4) and (6), other threads not only allocate metaspace, but also then loose the loaders used for those allocations, to late for the GC at (4) to collect them but before (5). Collecting them would require another GC.
>
> Right, and that is indeed what the test does. It loads chunks of 1000 classes and releases them, assuming that surely after releasing them, I can allocate more classes. Unless of course starvation ruins the day.
>
>> In other words, the contract is that we only throw an OOM if we really tried everything, but since the effects of the first GC are "stale" it does not count as try?
>
> The previous contract was that we try to allocate again after a full GC, and if that fails we give up. The issue is that this guarantee allows the GC to free up 99% of metaspace, yet still fail the allocation due to races with other threads doing the same thing. So at any given point, it might be that only 1% of metadata is reachable, yet an OOM can be produced if you are "unlucky".
>
>> Do you think this is a realistic problem?
>
> It is realistic enough that one stress test has failed in the real world. Yet I don't think any application out there will run into any issue. But I prefer having a sound solution where we can know that and not rely on probability.
>
>> Do I understand your patch right in that you divide allocations in two priority classes, add another lock, MetaspaceCritical_lock, which blocks normal allocations as long as critical allocations are queued?
>
> Yes, that's exactly right.
>
>> Sorry if I am slow :)
>
> Not at all!
>
>> One problem I see is that Metaspace::purge is not the full purge. Reclaiming metaspace happens in two stages:
>>
>>     1. in CLDG::purge, we delete all `ClassLoaderMetaspace` objects belonging to dead loaders. This releases all their metaspace to the freelists, optionally uncommitting portions of it (since JEP387).
>>
>>     2. in Metaspace::purge, we go through Metaspace and munmap any mappings which are now completely vacant.
>>
>>
>> The metaspace pressure release already happens in (1), so any concurrent thread allocating will benefit already.
>
> Ah. I thought it was all done in 2. I can move the Big Fat Lock to cover all of CLDG::purge instead. What do you think? It just needs to cover the entire thing basically.
>
>> Why do we even need a queue? Why could we not just let the first thread attempting a synchronous gc block metaspace allocation path for all threads, including others running into a limit, until the gc is finished and it had its first-allocation-right served?
>
> Each "critical" allocation rides on one particular GC cycle, that denotes the make-or-break point of the allocation.
> In order to prevent starvation, we have to satisfy all critical allocations who have their make-or-break GC cycle associated with the current purge() operation, before we release the lock in purge(), letting new allocations in, or we will rely on luck again. However, among the pending critical allocations, they will all possibly have different make-or-break GC cycles associated with them. So in purge() some of them need to be satisfied, and others do not, yet can happily get their allocations satisfied opportunistically if possible. So we need to make sure they are ordered somehow, such that the earliest arriving pending critical allocations are satisfied first, before subsequent critical allocations (possibly waiting for a later GC), or we can get premature OOM situations again, where a thread releases a bunch of memory, expecting to be able to allocate, yet fails due to races with various threads.
> The queue basically ensures the ordering of critical allocation satisfaction is sound, so that the pending critical allocations with the associated make-or-break GC being the one running purge(), are satisfied first, before satisfying (opportunistically) other critical allocations, that are really waiting for the next GC to happen.
>
> Thanks,
>
>
>> Thanks, Thomas

Hi Erik,

thanks for the extensive explanations!

One issue with your patch just came to me: the block-on-allocate may be too early. `Metaspace::allocate()` is a bit hot. I wonder about the performance impact of pulling and releasing a lock on each atomar allocation, even if its uncontended. Ideally I'd like to keep this path as close to a simple pointer bump allocation as possible (which it isn't unfortunately).

It is also not necessary: the majority of callers satisfy their allocation from already-committed arena local memory. So they are good and no thieves. We would block them unnecessary. I estimate only about 1:60 to 1:1000 calls would need that lock.

Allocation happens (roughly) in these steps:
1 try allocate from arena local free block list
2 try allocate from arena local current chunk without newly committing memory
3 try enlarge the chunk in place and/or commit more chunk memory and allocate from current chunk
4 get a new chunk from the freelist

(1) and (2) don't bother anyone. Hot path is typically (2). From (3) onward concurrently released memory could be used. So (1) and (2) can still happen before your block.

All that happens inside `MetaspaceArena::allocate`:

https://github.com/openjdk/jdk/blob/0675473486bc0ee321654d308b600874cf5ce41e/src/hotspot/share/memory/metaspace/metaspaceArena.cpp#L225

But code-wise (2) and (3) are a bit entangled, so the code would have to be massaged a bit to clearly express (2) from (3).

Please find more remarks inline.

> Hi Thomas,
>
> Thanks for chiming in! I will reply inline.
>
> > Hi Erik,
> > lets see if I understand the problem:
> > 1 n threads allocate metaspace
> > 2 thread A gets an allocation error (not HWM but a hard limit)
> > 3 .. returns, (eventually) schedules a synchronous GC.
> > 4 .. gc runs, the CLDG is at some point purged, releases metaspace pressure
> > 5 other threads load classes, allocating metaspace concurrently, benefitting from the released pressure
> > 6 thread A reattempts allocation, fails again.
>
> That's the one.
>
> > This is normally not a problem, no?
>
> Indeed, and that was my gut feeling when the current handling was written. I wouldn't expect an actual application to ever hit this problem. Nevertheless, I think it's a soundness problem with completely unbounded starvation, even though it doesn't happen in real life. So I think I would still like to fix the issue.
>
> It is definitely a bit arbitrary though where we decide to draw the line of what we guarantee, and what the balance between exactness and maintainability should be. My aim is to try hard enough so we don't rely on luck (random sleeps) if a program that shouldn't be even close to OOM will fail or not, even though you have to be _very_ unlucky for it to fail. But I am not interested in prefect guarantees either, down to the last allocation. My balance is that I allow throwing OOM prematurely if we are "really close" to being down to the last allocation OOM, but if you are not "really close", then no sleep in the world should cause a failure.

I think I get this now. IIUC we have the problem that memory release happens delayed, and we carry around a baggage of "potentially free" memory which needs a collector run to materialize. So many threads jumping up and down and doing class loading and unloading drive up metaspace use rate and increase the "potentially free" overhead, right? So the thing is to time collector runs right.

One concern I have is that if the customer runs with too tight a limit, we may bounce from full GC to full GC. Always scraping the barrel enough to just keep going - maybe collecting some short lived loaders - but not enough to get the VM into clear waters. I think this may be an issue today already. What is unclear to me is when it would be just better to give up and throw an OOM. To motivate the customer to increase the limits.

>
> > Which thread exactly gets the OOM if the VM hovers that close to the limit does not really matter. But you write "This can cause a situation where almost the entire metaspace is unreachable from roots, yet we throw OOM." - so we could get OOMs even if most of the Metaspace were vacant? This only can happen if, between (4) and (6), other threads not only allocate metaspace, but also then loose the loaders used for those allocations, to late for the GC at (4) to collect them but before (5). Collecting them would require another GC.
>
> Right, and that is indeed what the test does. It loads chunks of 1000 classes and releases them, assuming that surely after releasing them, I can allocate more classes. Unless of course starvation ruins the day.
>
> > In other words, the contract is that we only throw an OOM if we really tried everything, but since the effects of the first GC are "stale" it does not count as try?
>
> The previous contract was that we try to allocate again after a full GC, and if that fails we give up. The issue is that this guarantee allows the GC to free up 99% of metaspace, yet still fail the allocation due to races with other threads doing the same thing. So at any given point, it might be that only 1% of metadata is reachable, yet an OOM can be produced if you are "unlucky".
>
> > Do you think this is a realistic problem?
>
> It is realistic enough that one stress test has failed in the real world. Yet I don't think any application out there will run into any issue. But I prefer having a sound solution where we can know that and not rely on probability.
>
> > Do I understand your patch right in that you divide allocations in two priority classes, add another lock, MetaspaceCritical_lock, which blocks normal allocations as long as critical allocations are queued?
>
> Yes, that's exactly right.
>
> > Sorry if I am slow :)
>
> Not at all!
>
> > One problem I see is that Metaspace::purge is not the full purge. Reclaiming metaspace happens in two stages:
> > ```
> > 1. in CLDG::purge, we delete all `ClassLoaderMetaspace` objects belonging to dead loaders. This releases all their metaspace to the freelists, optionally uncommitting portions of it (since JEP387).
> >
> > 2. in Metaspace::purge, we go through Metaspace and munmap any mappings which are now completely vacant.
> > ```
> >
> >
> > The metaspace pressure release already happens in (1), so any concurrent thread allocating will benefit already.
>
> Ah. I thought it was all done in 2. I can move the Big Fat Lock to cover all of CLDG::purge instead. What do you think? It just needs to cover the entire thing basically.

Why not just cover the whole synchronous GC collect call? I'd put that barrier up as early as possible, to prevent as many threads as possible from entering the more expensive fail path. At that point we know we are near exhaustion. Any thread allocating could just as well wait inside MetaspaceArena::allocate. If the GC succeeds in releasing lots of memory, they will not have been disturbed much.

>
> > Why do we even need a queue? Why could we not just let the first thread attempting a synchronous gc block metaspace allocation path for all threads, including others running into a limit, until the gc is finished and it had its first-allocation-right served?
>
> Each "critical" allocation rides on one particular GC cycle, that denotes the make-or-break point of the allocation.

I feel like I should know this, but if multiple threads enter satisfy_failed_metadata_allocation around the same time and call a synchronous collect(), they would wait on the same GC, right? They won't start individual GCs for each thread?

> In order to prevent starvation, we have to satisfy all critical allocations who have their make-or-break GC cycle associated with the current purge() operation, before we release the lock in purge(), letting new allocations in, or we will rely on luck again. However, among the pending critical allocations, they will all possibly have different make-or-break GC cycles associated with them. So in purge() some of them need to be satisfied, and others do not, yet can happily get their allocations satisfied opportunistically if possible. So we need to make sure they are ordered somehow, such that the earliest arriving pending critical allocations are satisfied first, before subsequent critical allocations (possibly waiting for a later GC), or we can get premature OOM situations again, where a thread releases a bunch of memory, expecting to be able to allocate, yet fails due to races with various threads.
> The queue basically ensures the ordering of critical allocation satisfaction is sound, so that the pending critical allocations with the associated make-or-break GC being the one running purge(), are satisfied first, before satisfying (opportunistically) other critical allocations, that are really waiting for the next GC to happen.

I still don't get why the order of the critical allocations matter. I understand that even with your barrier, multiple threads can fail the initial allocation, enter the "satisfy_failed_metadata_allocation()" path, and now their allocation count as critical since if they fail again they will throw an OOM. But why does the order of the critical allocations among themselves matter? Why not just let the critical allocations trickle out unordered? Is the relation to the GC cycle not arbitrary?

>
> Thanks,
> /Erik
>

We need an "erik" pr command :)

Just FYI, I have very vague plans to extend usage of the metaspace allocator to other areas. To get back the cost of implementation. Eg one candidate may be replacement of the current resource areas, which are just more primitive arena based allocators. This is very vague and not a high priority, but I am a bit interested in keeping the coding generic if its not too much effort. But I don't see your patch causing any problems there.

Cheers, Thomas

-------------

PR: https://git.openjdk.java.net/jdk/pull/2289
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8259643: ZGC can return metaspace OOM prematurely [v2]

Erik Österlund-3
On Fri, 29 Jan 2021 11:23:30 GMT, Thomas Stuefe <[hidden email]> wrote:

>> Hi Thomas,
>>
>> Thanks for chiming in! I will reply inline.
>>
>>> Hi Erik,
>>>
>>> lets see if I understand the problem:
>>>
>>> 1 n threads allocate metaspace
>>> 2 thread A gets an allocation error (not HWM but a hard limit)
>>> 3 .. returns, (eventually) schedules a synchronous GC.
>>> 4 .. gc runs, the CLDG is at some point purged, releases metaspace pressure
>>> 5 other threads load classes, allocating metaspace concurrently, benefitting from the released pressure
>>> 6 thread A reattempts allocation, fails again.
>>
>> That's the one.
>>
>>> This is normally not a problem, no?
>>
>> Indeed, and that was my gut feeling when the current handling was written. I wouldn't expect an actual application to ever hit this problem. Nevertheless, I think it's a soundness problem with completely unbounded starvation, even though it doesn't happen in real life. So I think I would still like to fix the issue.
>>
>> It is definitely a bit arbitrary though where we decide to draw the line of what we guarantee, and what the balance between exactness and maintainability should be. My aim is to try hard enough so we don't rely on luck (random sleeps) if a program that shouldn't be even close to OOM will fail or not, even though you have to be *very* unlucky for it to fail. But I am not interested in prefect guarantees either, down to the last allocation. My balance is that I allow throwing OOM prematurely if we are "really close" to being down to the last allocation OOM, but if you are not "really close", then no sleep in the world should cause a failure.
>>
>>> Which thread exactly gets the OOM if the VM hovers that close to the limit does not really matter. But you write "This can cause a situation where almost the entire metaspace is unreachable from roots, yet we throw OOM." - so we could get OOMs even if most of the Metaspace were vacant? This only can happen if, between (4) and (6), other threads not only allocate metaspace, but also then loose the loaders used for those allocations, to late for the GC at (4) to collect them but before (5). Collecting them would require another GC.
>>
>> Right, and that is indeed what the test does. It loads chunks of 1000 classes and releases them, assuming that surely after releasing them, I can allocate more classes. Unless of course starvation ruins the day.
>>
>>> In other words, the contract is that we only throw an OOM if we really tried everything, but since the effects of the first GC are "stale" it does not count as try?
>>
>> The previous contract was that we try to allocate again after a full GC, and if that fails we give up. The issue is that this guarantee allows the GC to free up 99% of metaspace, yet still fail the allocation due to races with other threads doing the same thing. So at any given point, it might be that only 1% of metadata is reachable, yet an OOM can be produced if you are "unlucky".
>>
>>> Do you think this is a realistic problem?
>>
>> It is realistic enough that one stress test has failed in the real world. Yet I don't think any application out there will run into any issue. But I prefer having a sound solution where we can know that and not rely on probability.
>>
>>> Do I understand your patch right in that you divide allocations in two priority classes, add another lock, MetaspaceCritical_lock, which blocks normal allocations as long as critical allocations are queued?
>>
>> Yes, that's exactly right.
>>
>>> Sorry if I am slow :)
>>
>> Not at all!
>>
>>> One problem I see is that Metaspace::purge is not the full purge. Reclaiming metaspace happens in two stages:
>>>
>>>     1. in CLDG::purge, we delete all `ClassLoaderMetaspace` objects belonging to dead loaders. This releases all their metaspace to the freelists, optionally uncommitting portions of it (since JEP387).
>>>
>>>     2. in Metaspace::purge, we go through Metaspace and munmap any mappings which are now completely vacant.
>>>
>>>
>>> The metaspace pressure release already happens in (1), so any concurrent thread allocating will benefit already.
>>
>> Ah. I thought it was all done in 2. I can move the Big Fat Lock to cover all of CLDG::purge instead. What do you think? It just needs to cover the entire thing basically.
>>
>>> Why do we even need a queue? Why could we not just let the first thread attempting a synchronous gc block metaspace allocation path for all threads, including others running into a limit, until the gc is finished and it had its first-allocation-right served?
>>
>> Each "critical" allocation rides on one particular GC cycle, that denotes the make-or-break point of the allocation.
>> In order to prevent starvation, we have to satisfy all critical allocations who have their make-or-break GC cycle associated with the current purge() operation, before we release the lock in purge(), letting new allocations in, or we will rely on luck again. However, among the pending critical allocations, they will all possibly have different make-or-break GC cycles associated with them. So in purge() some of them need to be satisfied, and others do not, yet can happily get their allocations satisfied opportunistically if possible. So we need to make sure they are ordered somehow, such that the earliest arriving pending critical allocations are satisfied first, before subsequent critical allocations (possibly waiting for a later GC), or we can get premature OOM situations again, where a thread releases a bunch of memory, expecting to be able to allocate, yet fails due to races with various threads.
>> The queue basically ensures the ordering of critical allocation satisfaction is sound, so that the pending critical allocations with the associated make-or-break GC being the one running purge(), are satisfied first, before satisfying (opportunistically) other critical allocations, that are really waiting for the next GC to happen.
>>
>> Thanks,
>>
>>
>>> Thanks, Thomas
>
> Hi Erik,
>
> thanks for the extensive explanations!
>
> One issue with your patch just came to me: the block-on-allocate may be too early. `Metaspace::allocate()` is a bit hot. I wonder about the performance impact of pulling and releasing a lock on each atomar allocation, even if its uncontended. Ideally I'd like to keep this path as close to a simple pointer bump allocation as possible (which it isn't unfortunately).
>
> It is also not necessary: the majority of callers satisfy their allocation from already-committed arena local memory. So they are good and no thieves. We would block them unnecessary. I estimate only about 1:60 to 1:1000 calls would need that lock.
>
> Allocation happens (roughly) in these steps:
> 1 try allocate from arena local free block list
> 2 try allocate from arena local current chunk without newly committing memory
> 3 try enlarge the chunk in place and/or commit more chunk memory and allocate from current chunk
> 4 get a new chunk from the freelist
>
> (1) and (2) don't bother anyone. Hot path is typically (2). From (3) onward concurrently released memory could be used. So (1) and (2) can still happen before your block.
>
> All that happens inside `MetaspaceArena::allocate`:
>
> https://github.com/openjdk/jdk/blob/0675473486bc0ee321654d308b600874cf5ce41e/src/hotspot/share/memory/metaspace/metaspaceArena.cpp#L225
>
> But code-wise (2) and (3) are a bit entangled, so the code would have to be massaged a bit to clearly express (2) from (3).
>
> Please find more remarks inline.
>
>> Hi Thomas,
>>
>> Thanks for chiming in! I will reply inline.
>>
>> > Hi Erik,
>> > lets see if I understand the problem:
>> > 1 n threads allocate metaspace
>> > 2 thread A gets an allocation error (not HWM but a hard limit)
>> > 3 .. returns, (eventually) schedules a synchronous GC.
>> > 4 .. gc runs, the CLDG is at some point purged, releases metaspace pressure
>> > 5 other threads load classes, allocating metaspace concurrently, benefitting from the released pressure
>> > 6 thread A reattempts allocation, fails again.
>>
>> That's the one.
>>
>> > This is normally not a problem, no?
>>
>> Indeed, and that was my gut feeling when the current handling was written. I wouldn't expect an actual application to ever hit this problem. Nevertheless, I think it's a soundness problem with completely unbounded starvation, even though it doesn't happen in real life. So I think I would still like to fix the issue.
>>
>> It is definitely a bit arbitrary though where we decide to draw the line of what we guarantee, and what the balance between exactness and maintainability should be. My aim is to try hard enough so we don't rely on luck (random sleeps) if a program that shouldn't be even close to OOM will fail or not, even though you have to be _very_ unlucky for it to fail. But I am not interested in prefect guarantees either, down to the last allocation. My balance is that I allow throwing OOM prematurely if we are "really close" to being down to the last allocation OOM, but if you are not "really close", then no sleep in the world should cause a failure.
>
> I think I get this now. IIUC we have the problem that memory release happens delayed, and we carry around a baggage of "potentially free" memory which needs a collector run to materialize. So many threads jumping up and down and doing class loading and unloading drive up metaspace use rate and increase the "potentially free" overhead, right? So the thing is to time collector runs right.
>
> One concern I have is that if the customer runs with too tight a limit, we may bounce from full GC to full GC. Always scraping the barrel enough to just keep going - maybe collecting some short lived loaders - but not enough to get the VM into clear waters. I think this may be an issue today already. What is unclear to me is when it would be just better to give up and throw an OOM. To motivate the customer to increase the limits.
>
>>
>> > Which thread exactly gets the OOM if the VM hovers that close to the limit does not really matter. But you write "This can cause a situation where almost the entire metaspace is unreachable from roots, yet we throw OOM." - so we could get OOMs even if most of the Metaspace were vacant? This only can happen if, between (4) and (6), other threads not only allocate metaspace, but also then loose the loaders used for those allocations, to late for the GC at (4) to collect them but before (5). Collecting them would require another GC.
>>
>> Right, and that is indeed what the test does. It loads chunks of 1000 classes and releases them, assuming that surely after releasing them, I can allocate more classes. Unless of course starvation ruins the day.
>>
>> > In other words, the contract is that we only throw an OOM if we really tried everything, but since the effects of the first GC are "stale" it does not count as try?
>>
>> The previous contract was that we try to allocate again after a full GC, and if that fails we give up. The issue is that this guarantee allows the GC to free up 99% of metaspace, yet still fail the allocation due to races with other threads doing the same thing. So at any given point, it might be that only 1% of metadata is reachable, yet an OOM can be produced if you are "unlucky".
>>
>> > Do you think this is a realistic problem?
>>
>> It is realistic enough that one stress test has failed in the real world. Yet I don't think any application out there will run into any issue. But I prefer having a sound solution where we can know that and not rely on probability.
>>
>> > Do I understand your patch right in that you divide allocations in two priority classes, add another lock, MetaspaceCritical_lock, which blocks normal allocations as long as critical allocations are queued?
>>
>> Yes, that's exactly right.
>>
>> > Sorry if I am slow :)
>>
>> Not at all!
>>
>> > One problem I see is that Metaspace::purge is not the full purge. Reclaiming metaspace happens in two stages:
>> > ```
>> > 1. in CLDG::purge, we delete all `ClassLoaderMetaspace` objects belonging to dead loaders. This releases all their metaspace to the freelists, optionally uncommitting portions of it (since JEP387).
>> >
>> > 2. in Metaspace::purge, we go through Metaspace and munmap any mappings which are now completely vacant.
>> > ```
>> >
>> >
>> > The metaspace pressure release already happens in (1), so any concurrent thread allocating will benefit already.
>>
>> Ah. I thought it was all done in 2. I can move the Big Fat Lock to cover all of CLDG::purge instead. What do you think? It just needs to cover the entire thing basically.
>
> Why not just cover the whole synchronous GC collect call? I'd put that barrier up as early as possible, to prevent as many threads as possible from entering the more expensive fail path. At that point we know we are near exhaustion. Any thread allocating could just as well wait inside MetaspaceArena::allocate. If the GC succeeds in releasing lots of memory, they will not have been disturbed much.
>
>>
>> > Why do we even need a queue? Why could we not just let the first thread attempting a synchronous gc block metaspace allocation path for all threads, including others running into a limit, until the gc is finished and it had its first-allocation-right served?
>>
>> Each "critical" allocation rides on one particular GC cycle, that denotes the make-or-break point of the allocation.
>
> I feel like I should know this, but if multiple threads enter satisfy_failed_metadata_allocation around the same time and call a synchronous collect(), they would wait on the same GC, right? They won't start individual GCs for each thread?
>
>> In order to prevent starvation, we have to satisfy all critical allocations who have their make-or-break GC cycle associated with the current purge() operation, before we release the lock in purge(), letting new allocations in, or we will rely on luck again. However, among the pending critical allocations, they will all possibly have different make-or-break GC cycles associated with them. So in purge() some of them need to be satisfied, and others do not, yet can happily get their allocations satisfied opportunistically if possible. So we need to make sure they are ordered somehow, such that the earliest arriving pending critical allocations are satisfied first, before subsequent critical allocations (possibly waiting for a later GC), or we can get premature OOM situations again, where a thread releases a bunch of memory, expecting to be able to allocate, yet fails due to races with various threads.
>> The queue basically ensures the ordering of critical allocation satisfaction is sound, so that the pending critical allocations with the associated make-or-break GC being the one running purge(), are satisfied first, before satisfying (opportunistically) other critical allocations, that are really waiting for the next GC to happen.
>
> I still don't get why the order of the critical allocations matter. I understand that even with your barrier, multiple threads can fail the initial allocation, enter the "satisfy_failed_metadata_allocation()" path, and now their allocation count as critical since if they fail again they will throw an OOM. But why does the order of the critical allocations among themselves matter? Why not just let the critical allocations trickle out unordered? Is the relation to the GC cycle not arbitrary?
>
>>
>> Thanks,
>> /Erik
>>
>
> We need an "erik" pr command :)
>
> Just FYI, I have very vague plans to extend usage of the metaspace allocator to other areas. To get back the cost of implementation. Eg one candidate may be replacement of the current resource areas, which are just more primitive arena based allocators. This is very vague and not a high priority, but I am a bit interested in keeping the coding generic if its not too much effort. But I don't see your patch causing any problems there.
>
> Cheers, Thomas

Hi Thomas,

Thanks for having a look at this.

> Hi Erik,
>
> thanks for the extensive explanations!
>
> One issue with your patch just came to me: the block-on-allocate may be too early. `Metaspace::allocate()` is a bit hot. I wonder about the performance impact of pulling and releasing a lock on each atomar allocation, even if its uncontended. Ideally I'd like to keep this path as close to a simple pointer bump allocation as possible (which it isn't unfortunately).

I have a global flag that denotes there being at least 1 critical allocation in the system. It is set by the first critical allocation, and cleared by the GC if all critical allocations could be satisfied. The global lock is only taken in Metaspace::allocate() if said flag is on. Normal apps should never see this flag being on. So the only overhead in the common case is to check the flag and see that no locking is required. I think that should be fast enough, right? And when you are in the critical mode, you obviously have more to worry about than lock contention, in terms of how the system is performing.

> It is also not necessary: the majority of callers satisfy their allocation from already-committed arena local memory. So they are good and no thieves. We would block them unnecessary. I estimate only about 1:60 to 1:1000 calls would need that lock.
>
> Allocation happens (roughly) in these steps:
> 1 try allocate from arena local free block list
> 2 try allocate from arena local current chunk without newly committing memory
> 3 try enlarge the chunk in place and/or commit more chunk memory and allocate from current chunk
> 4 get a new chunk from the freelist
>
> (1) and (2) don't bother anyone. Hot path is typically (2). From (3) onward concurrently released memory could be used. So (1) and (2) can still happen before your block.
>
> All that happens inside `MetaspaceArena::allocate`:
>
> https://github.com/openjdk/jdk/blob/0675473486bc0ee321654d308b600874cf5ce41e/src/hotspot/share/memory/metaspace/metaspaceArena.cpp#L225
>
> But code-wise (2) and (3) are a bit entangled, so the code would have to be massaged a bit to clearly express (2) from (3).

It might be possible to move it to a less hot path as you propose, but given that the fast path doesn't take said lock unless anyone really has critical allocations in the system, I don't think we need to worry about that. But if you still have performance concerns, I am open to suggestions of course.

> Please find more remarks inline.
>
> > Hi Thomas,
> > Thanks for chiming in! I will reply inline.
> > > Hi Erik,
> > > lets see if I understand the problem:
> > > 1 n threads allocate metaspace
> > > 2 thread A gets an allocation error (not HWM but a hard limit)
> > > 3 .. returns, (eventually) schedules a synchronous GC.
> > > 4 .. gc runs, the CLDG is at some point purged, releases metaspace pressure
> > > 5 other threads load classes, allocating metaspace concurrently, benefitting from the released pressure
> > > 6 thread A reattempts allocation, fails again.
> >
> >
> > That's the one.
> > > This is normally not a problem, no?
> >
> >
> > Indeed, and that was my gut feeling when the current handling was written. I wouldn't expect an actual application to ever hit this problem. Nevertheless, I think it's a soundness problem with completely unbounded starvation, even though it doesn't happen in real life. So I think I would still like to fix the issue.
> > It is definitely a bit arbitrary though where we decide to draw the line of what we guarantee, and what the balance between exactness and maintainability should be. My aim is to try hard enough so we don't rely on luck (random sleeps) if a program that shouldn't be even close to OOM will fail or not, even though you have to be _very_ unlucky for it to fail. But I am not interested in prefect guarantees either, down to the last allocation. My balance is that I allow throwing OOM prematurely if we are "really close" to being down to the last allocation OOM, but if you are not "really close", then no sleep in the world should cause a failure.
>
> I think I get this now. IIUC we have the problem that memory release happens delayed, and we carry around a baggage of "potentially free" memory which needs a collector run to materialize. So many threads jumping up and down and doing class loading and unloading drive up metaspace use rate and increase the "potentially free" overhead, right? So the thing is to time collector runs right.

Exactly.

> One concern I have is that if the customer runs with too tight a limit, we may bounce from full GC to full GC. Always scraping the barrel enough to just keep going - maybe collecting some short lived loaders - but not enough to get the VM into clear waters. I think this may be an issue today already. What is unclear to me is when it would be just better to give up and throw an OOM. To motivate the customer to increase the limits.

Right - this is a bit of a philosophical one. There is always a balance there between precision, code complexity, and when to put the VM out of its misery when it is performing poorly. We deal with the same trade-off really with heap allocations, which is why I am also solving the starvation problem in pretty much the same way: with a queue satisfied by the GC, and locking out starvation. Then we throw OOM in fairly similar conditions. What they have in common is that when they throw, we will have a live portion of metaspace that is "pretty much" full, and there is no point in continuing, while allowing unfortunate timings on a less full (in terms of temporal liveness) metaspace to always succeed.

One might argue that the trade-off should be moved in some direction, and that it is okay for it to be more or less exact, but I was hoping that by doing the same dance that heap OOM situations do, we can at least follow a trade-off that is pretty well established and has worked pretty well for heap OOM situations for many years. And I think heap OOM situations in the wild are far more common than metaspace OOM, so I don't think that the metaspace OOM mechanism needs to do better than what the heap OOM mechanism does. If that makes sense.

> > > Which thread exactly gets the OOM if the VM hovers that close to the limit does not really matter. But you write "This can cause a situation where almost the entire metaspace is unreachable from roots, yet we throw OOM." - so we could get OOMs even if most of the Metaspace were vacant? This only can happen if, between (4) and (6), other threads not only allocate metaspace, but also then loose the loaders used for those allocations, to late for the GC at (4) to collect them but before (5). Collecting them would require another GC.
> >
> >
> > Right, and that is indeed what the test does. It loads chunks of 1000 classes and releases them, assuming that surely after releasing them, I can allocate more classes. Unless of course starvation ruins the day.
> > > In other words, the contract is that we only throw an OOM if we really tried everything, but since the effects of the first GC are "stale" it does not count as try?
> >
> >
> > The previous contract was that we try to allocate again after a full GC, and if that fails we give up. The issue is that this guarantee allows the GC to free up 99% of metaspace, yet still fail the allocation due to races with other threads doing the same thing. So at any given point, it might be that only 1% of metadata is reachable, yet an OOM can be produced if you are "unlucky".
> > > Do you think this is a realistic problem?
> >
> >
> > It is realistic enough that one stress test has failed in the real world. Yet I don't think any application out there will run into any issue. But I prefer having a sound solution where we can know that and not rely on probability.
> > > Do I understand your patch right in that you divide allocations in two priority classes, add another lock, MetaspaceCritical_lock, which blocks normal allocations as long as critical allocations are queued?
> >
> >
> > Yes, that's exactly right.
> > > Sorry if I am slow :)
> >
> >
> > Not at all!
> > > One problem I see is that Metaspace::purge is not the full purge. Reclaiming metaspace happens in two stages:
> > > ```
> > > 1. in CLDG::purge, we delete all `ClassLoaderMetaspace` objects belonging to dead loaders. This releases all their metaspace to the freelists, optionally uncommitting portions of it (since JEP387).
> > >
> > > 2. in Metaspace::purge, we go through Metaspace and munmap any mappings which are now completely vacant.
> > > ```
> > >
> > >
> > > The metaspace pressure release already happens in (1), so any concurrent thread allocating will benefit already.
> >
> >
> > Ah. I thought it was all done in 2. I can move the Big Fat Lock to cover all of CLDG::purge instead. What do you think? It just needs to cover the entire thing basically.
>
> Why not just cover the whole synchronous GC collect call? I'd put that barrier up as early as possible, to prevent as many threads as possible from entering the more expensive fail path. At that point we know we are near exhaustion. Any thread allocating could just as well wait inside MetaspaceArena::allocate. If the GC succeeds in releasing lots of memory, they will not have been disturbed much.

Do you mean
1) why I don't hold the MetaspaceCritical_lock across the collect() call at the mutator side of the code, or
2) why I don't hold the MetaspaceCritical_lock across the entire GC cycle instead of purge?

I think you meant 2), so will answer that:
a) Keeping the lock across the entire GC cycle is rather problematic when it comes to constraining lock ranks. It would have to be above any lock ever needed in an entire GC cycle, yet we take a bunch of other locks that mutators hold at different point, interacting with class unloading. It would be very hard to find the right rank for this.
b) The GC goes in and out of safepoints, and needs to sometimes block out safepoints. Holding locks in and out of safepoints while blocking in and out safepoints, is in my experience rather deadlock prone.
c) There is no need to keep the lock across more than the operation that frees metaspace memory, which in a concurrent GC always happens when safepoints are blocked out. If a mutator succeeds during a concurrent GC due to finding memory in a local free list or something, while another allocation failed and needs a critical allocation, then that is absolutely fine, as the successful allocation is never what causes the failing allocation to fail.

> > > Why do we even need a queue? Why could we not just let the first thread attempting a synchronous gc block metaspace allocation path for all threads, including others running into a limit, until the gc is finished and it had its first-allocation-right served?
> >
> >
> > Each "critical" allocation rides on one particular GC cycle, that denotes the make-or-break point of the allocation.
>
> I feel like I should know this, but if multiple threads enter satisfy_failed_metadata_allocation around the same time and call a synchronous collect(), they would wait on the same GC, right? They won't start individual GCs for each thread?

The rule we have in ZGC to ensure what we want in terms of OOM situations, is that GCs that are *requested* before an equivalent GC *starts*, can be satisfied with the same GC cycle.

Let's go through what will likely happen in practice with my solution when you have, let's say 1000 concurrent calls to satisfy a failing metaspace allocation.

1. Thread 1 registers its allocation, sees it is the first one, and starts a metaspace GC.

2. GC starts running

3. Threads 2-999 register their allocations, and see that there was already a critical allocation before them. This causes them to wait for the GC to purge, opportunistically, riding on that first GC.

4. The GC satisfies allocations. For the sake of example, let's say that allocations 1-500 could be satisfied, but not the rest.

5. Threads 2-500 who were waiting for purge to finish, wake up, and run off happily with their satisfied allocations.

6. Threads 501-1000 wake up seeing that their allocations didn't get satisfied. They now stop being opportunistic, and request a gc each before finally giving up.

7. The first GC cycle finishes.

8. Thread 1 wakes up after the entire first GC cycle is done and sees its satisfied allocation, happily running off with it.

9. The next GC cycle starts

10. The next GC cycle successfully satisfies metadata allocations for threads 501-750, but not the rest.

11. The entire next GC cycle finishes, satisfying the GC requested by threads 2-1000, as they all *requested* a metaspace GC, *before* it started running. Therefore, no further GC will run.

12. Threads 501-750 wake up, happily running off with their satisfied allocations.

13. Threads 751-1000 wake up, grumpy about the lack of memory after their GC. They are all gonna throw.

So basically, if they can all get satisfied with 1 GC, then 1 GC will be enough. But we won't throw until a thread has had a full GC run *after* it was requested, but multiple threads can ride on the same GC there. In this example, threads 2-1000 all ride on the same GC.

Note though, that we would never allow a GC that is already running to satisfy a GC request that comes in while the GC is already running, as we then wouldn't catch situations when a thread releases a lot of memory, and then expects it to be available just after.

> > In order to prevent starvation, we have to satisfy all critical allocations who have their make-or-break GC cycle associated with the current purge() operation, before we release the lock in purge(), letting new allocations in, or we will rely on luck again. However, among the pending critical allocations, they will all possibly have different make-or-break GC cycles associated with them. So in purge() some of them need to be satisfied, and others do not, yet can happily get their allocations satisfied opportunistically if possible. So we need to make sure they are ordered somehow, such that the earliest arriving pending critical allocations are satisfied first, before subsequent critical allocations (possibly waiting for a later GC), or we can get premature OOM situations again, where a thread releases a bunch of memory, expecting to be able to allocate, yet fails due to races with various threads.
> > The queue basically ensures the ordering of critical allocation satisfaction is sound, so that the pending critical allocations with the associated make-or-break GC being the one running purge(), are satisfied first, before satisfying (opportunistically) other critical allocations, that are really waiting for the next GC to happen.
>
> I still don't get why the order of the critical allocations matter. I understand that even with your barrier, multiple threads can fail the initial allocation, enter the "satisfy_failed_metadata_allocation()" path, and now their allocation count as critical since if they fail again they will throw an OOM. But why does the order of the critical allocations among themselves matter? Why not just let the critical allocations trickle out unordered? Is the relation to the GC cycle not arbitrary?

If we don't preserve the order, we would miss situations when a thread releases a large chunk of metaspace (e.g. by releasing a class loader reference), and then expects that memory to be available. An allocation from a critical allocation that is associated with a subsequent GC, could starve a thread that is associated with the current GC cycle, hence causing a premature OOM for that thread, while not really needing that allocation until next GC cycle, while doing it in the right order would satisfy both allocations.

One might argue it might be okay to throw in such a scenario with an unordered solution. We are pretty close to running out of memory anyway I guess. But I'm not really sure what such a solution would look like in more detail, and thought writing this little list was easy enough, and for me easier to reason about, partially because we do the same dance in the GC to deal with heap OOM, which has been a success.

Thanks,


^--- I know this will be treated as a PR bot command, but I can't give up on my slash!

> > Thanks,
> > /Erik
>
> We need an "erik" pr command :)

Indeed.

> Just FYI, I have very vague plans to extend usage of the metaspace allocator to other areas. To get back the cost of implementation. Eg one candidate may be replacement of the current resource areas, which are just more primitive arena based allocators. This is very vague and not a high priority, but I am a bit interested in keeping the coding generic if its not too much effort. But I don't see your patch causing any problems there.

That sounds very interesting. We had internal discussions about riding on the MetaspaceExpand lock which I believe would also work, but thought this really ought to be a separate thing so that we don't tie ourselves too tight to the internal implementation of the allocator. Given that you might want to use the allocator for other stuff, it sounds like that is indeed the right choice.

> Cheers, Thomas

-------------

PR: https://git.openjdk.java.net/jdk/pull/2289
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8259643: ZGC can return metaspace OOM prematurely [v2]

Thomas Stuefe
In reply to this post by Erik Österlund-3
On Thu, 28 Jan 2021 13:24:10 GMT, Erik Österlund <[hidden email]> wrote:

>> There exists a race condition for ZGC metaspace allocations, where an allocation can throw OOM due to unbounded starvation from other threads. Towards the end of the allocation dance, we conceptually do this:
>>
>> 1. full_gc()
>> 2. final_allocation_attempt()
>>
>> And if we still fail at 2 after doing a full GC, we conclude that there isn't enough metaspace memory. However, if the thread gets preempted between 1 and 2, then an unbounded number of metaspace allocations from other threads can fill up the entire metaspace, making the final allocation attempt fail and hence throw. This can cause a situation where almost the entire metaspace is unreachable from roots, yet we throw OOM. I managed to reproduce this with the right sleeps.
>>
>> The way we deal with this particular issue for heap allocations, is to have an allocation request queue, and satisfy those allocations before others, preventing starvation. My solution to this metaspace OOM problem will be to basically do exactly that - have a queue of "critical" allocations, that get precedence over normal metaspace allocations.
>>
>> The solution should work for other concurrent GCs (who likely have the same issue), but I only tried this with ZGC, so I am only hooking in ZGC to the new API (for concurrently unloading GCs to manage critical metaspace allocations) at this point.
>>
>> Passes ZGC tests from tier 1-5, and the particular test that failed (with the JVM sleeps that make it fail deterministically).
>
> Erik Österlund has updated the pull request incrementally with one additional commit since the last revision:
>
>   polish code alignment and rename register/unregister to add/remove

Hi Erik,

sorry for the delay, I got swamped.

thanks for your patient explanations. I think I understand most of it now, but I still have a number of questions. Also see the code remarks, though most are just more questions.

> > One issue with your patch just came to me: the block-on-allocate may be too early. `Metaspace::allocate()` is a bit hot. I wonder about the performance impact of pulling and releasing a lock on each atomar allocation, even if its uncontended. Ideally I'd like to keep this path as close to a simple pointer bump allocation as possible (which it isn't unfortunately).
>
> I have a global flag that denotes there being at least 1 critical allocation in the system. It is set by the first critical allocation, and cleared by the GC if all critical allocations could be satisfied. The global lock is only taken in Metaspace::allocate() if said flag is on. Normal apps should never see this flag being on. So the only overhead in the common case is to check the flag and see that no locking is required. I think that should be fast enough, right? And when you are in the critical mode, you obviously have more to worry about than lock contention, in terms of how the system is performing.

I overlooked the flag check. I think this is fine then in its current form.

> >
> > I think I get this now. IIUC we have the problem that memory release happens delayed, and we carry around a baggage of "potentially free" memory which needs a collector run to materialize. So many threads jumping up and down and doing class loading and unloading drive up metaspace use rate and increase the "potentially free" overhead, right? So the thing is to time collector runs right.
>
> Exactly.
>
> > One concern I have is that if the customer runs with too tight a limit, we may bounce from full GC to full GC. Always scraping the barrel enough to just keep going - maybe collecting some short lived loaders - but not enough to get the VM into clear waters. I think this may be an issue today already. What is unclear to me is when it would be just better to give up and throw an OOM. To motivate the customer to increase the limits.
>
> Right - this is a bit of a philosophical one. There is always a balance there between precision, code complexity, and when to put the VM out of its misery when it is performing poorly. We deal with the same trade-off really with heap allocations, which is why I am also solving the starvation problem in pretty much the same way: with a queue satisfied by the GC, and locking out starvation. Then we throw OOM in fairly similar conditions. What they have in common is that when they throw, we will have a live portion of metaspace that is "pretty much" full, and there is no point in continuing, while allowing unfortunate timings on a less full (in terms of temporal liveness) metaspace to always succeed.
>
> One might argue that the trade-off should be moved in some direction, and that it is okay for it to be more or less exact, but I was hoping that by doing the same dance that heap OOM situations do, we can at least follow a trade-off that is pretty well established and has worked pretty well for heap OOM situations for many years. And I think heap OOM situations in the wild are far more common than metaspace OOM, so I don't think that the metaspace OOM mechanism needs to do better than what the heap OOM mechanism does. If that makes sense.

It makes sense. If its an established proven pattern lets use it here too. Its not that complex.

I think there are differences between heap and metaspace in elasticity. Metaspace is way less "spongy", chance of recovering metaspace is slimmer than with the heap, so a Full GC is more expensive in relation to its effects. I think I have seen series of Full GCs on quite empty heaps when customers set MaxMetaspaceSize too low (people seem to like doing that). I'm worried about small loaders with short lifetimes which just allow the VM to reclaim enough to keep going. Reflection invocation loaders, or projects like jruby. But I think this is not the norm, nor does it have anything to do with your patch. Typically we do just one futile GC and then throw OOM.

> >
> > Why not just cover the whole synchronous GC collect call? I'd put that barrier up as early as possible, to prevent as many threads as possible from entering the more expensive fail path. At that point we know we are near exhaustion. Any thread allocating could just as well wait inside MetaspaceArena::allocate. If the GC succeeds in releasing lots of memory, they will not have been disturbed much.
>
> Do you mean
>
> 1. why I don't hold the MetaspaceCritical_lock across the collect() call at the mutator side of the code, or
> 2. why I don't hold the MetaspaceCritical_lock across the entire GC cycle instead of purge?
>
> I think you meant 2), so will answer that:
> a) Keeping the lock across the entire GC cycle is rather problematic when it comes to constraining lock ranks. It would have to be above any lock ever needed in an entire GC cycle, yet we take a bunch of other locks that mutators hold at different point, interacting with class unloading. It would be very hard to find the right rank for this.
> b) The GC goes in and out of safepoints, and needs to sometimes block out safepoints. Holding locks in and out of safepoints while blocking in and out safepoints, is in my experience rather deadlock prone.
> c) There is no need to keep the lock across more than the operation that frees metaspace memory, which in a concurrent GC always happens when safepoints are blocked out. If a mutator succeeds during a concurrent GC due to finding memory in a local free list or something, while another allocation failed and needs a critical allocation, then that is absolutely fine, as the successful allocation is never what causes the failing allocation to fail.

Thanks, that makes sense. I was completely offtrack, thinking more in direction of (1), wrt the current coding, not your patch. I thought if the mutator thread locks across the collect call and the subsequent allocation attempt (which would have to be some sort of priority allocation, ignoring the lock) this would be a simple solution. But I think that has a number of holes, never mind the allocation request ordering.

>
> > > > Why do we even need a queue? Why could we not just let the first thread attempting a synchronous gc block metaspace allocation path for all threads, including others running into a limit, until the gc is finished and it had its first-allocation-right served?
> > >
> > >
> > > Each "critical" allocation rides on one particular GC cycle, that denotes the make-or-break point of the allocation.
> >
> >
> > I feel like I should know this, but if multiple threads enter satisfy_failed_metadata_allocation around the same time and call a synchronous collect(), they would wait on the same GC, right? They won't start individual GCs for each thread?
>
> The rule we have in ZGC to ensure what we want in terms of OOM situations, is that GCs that are _requested_ before an equivalent GC _starts_, can be satisfied with the same GC cycle.
>
> Let's go through what will likely happen in practice with my solution when you have, let's say 1000 concurrent calls to satisfy a failing metaspace allocation.
>
> 1. Thread 1 registers its allocation, sees it is the first one, and starts a metaspace GC.
> 2. GC starts running
> 3. Threads 2-999 register their allocations, and see that there was already a critical allocation before them. This causes them to wait for the GC to purge, opportunistically, riding on that first GC.
> 4. The GC satisfies allocations. For the sake of example, let's say that allocations 1-500 could be satisfied, but not the rest.
> 5. Threads 2-500 who were waiting for purge to finish, wake up, and run off happily with their satisfied allocations.
> 6. Threads 501-1000 wake up seeing that their allocations didn't get satisfied. They now stop being opportunistic, and request a gc each before finally giving up.
> 7. The first GC cycle finishes.
> 8. Thread 1 wakes up after the entire first GC cycle is done and sees its satisfied allocation, happily running off with it.
> 9. The next GC cycle starts
> 10. The next GC cycle successfully satisfies metadata allocations for threads 501-750, but not the rest.
> 11. The entire next GC cycle finishes, satisfying the GC requested by threads 2-1000, as they all _requested_ a metaspace GC, _before_ it started running. Therefore, no further GC will run.
> 12. Threads 501-750 wake up, happily running off with their satisfied allocations.
> 13. Threads 751-1000 wake up, grumpy about the lack of memory after their GC. They are all gonna throw.
>
> So basically, if they can all get satisfied with 1 GC, then 1 GC will be enough. But we won't throw until a thread has had a full GC run _after_ it was requested, but multiple threads can ride on the same GC there. In this example, threads 2-1000 all ride on the same GC.

This is a nice elegant solution. I had some trouble understanding your explanation: When you write "threads 2-1000 all ride on the same GC" this confused me since threads 2-500 were lucky and were satisfied by the purge in GC cycle 1. So I would have expected threads 501-1000 to ride on the second GC, thread 1-500 on the first. Unless with "ride" you mean "guaranteed to be processed"?

>
> Note though, that we would never allow a GC that is already running to satisfy a GC request that comes in while the GC is already running, as we then wouldn't catch situations when a thread releases a lot of memory, and then expects it to be available just after.
>
> > > In order to prevent starvation, we have to satisfy all critical allocations who have their make-or-break GC cycle associated with the current purge() operation, before we release the lock in purge(), letting new allocations in, or we will rely on luck again. However, among the pending critical allocations, they will all possibly have different make-or-break GC cycles associated with them. So in purge() some of them need to be satisfied, and others do not, yet can happily get their allocations satisfied opportunistically if possible. So we need to make sure they are ordered somehow, such that the earliest arriving pending critical allocations are satisfied first, before subsequent critical allocations (possibly waiting for a later GC), or we can get premature OOM situations again, where a thread releases a bunch of memory, expecting to be able to allocate, yet fails due to races with various threads.
> > > The queue basically ensures the ordering of critical allocation satisfaction is sound, so that the pending critical allocations with the associated make-or-break GC being the one running purge(), are satisfied first, before satisfying (opportunistically) other critical allocations, that are really waiting for the next GC to happen.
> >
> >
> > I still don't get why the order of the critical allocations matter. I understand that even with your barrier, multiple threads can fail the initial allocation, enter the "satisfy_failed_metadata_allocation()" path, and now their allocation count as critical since if they fail again they will throw an OOM. But why does the order of the critical allocations among themselves matter? Why not just let the critical allocations trickle out unordered? Is the relation to the GC cycle not arbitrary?
>
> If we don't preserve the order, we would miss situations when a thread releases a large chunk of metaspace (e.g. by releasing a class loader reference), and then expects that memory to be available. An allocation from a critical allocation that is associated with a subsequent GC, could starve a thread that is associated with the current GC cycle, hence causing a premature OOM for that thread, while not really needing that allocation until next GC cycle, while doing it in the right order would satisfy both allocations.
>
> One might argue it might be okay to throw in such a scenario with an unordered solution. We are pretty close to running out of memory anyway I guess. But I'm not really sure what such a solution would look like in more detail, and thought writing this little list was easy enough, and for me easier to reason about, partially because we do the same dance in the GC to deal with heap OOM, which has been a success.

I think I get this now. Its still a brain teaser if you are not used to the pattern, but I think I got most of it. If the pattern is well established and proofed it makes sense.

Thanks for the great explanations!

>
> Thanks,
> /Erik
>
> ^--- I know this will be treated as a PR bot command, but I can't give up on my slash!
>

Alias it to integrate :)

>
> > Just FYI, I have very vague plans to extend usage of the metaspace allocator to other areas. To get back the cost of implementation. Eg one candidate may be replacement of the current resource areas, which are just more primitive arena based allocators. This is very vague and not a high priority, but I am a bit interested in keeping the coding generic if its not too much effort. But I don't see your patch causing any problems there.
>
> That sounds very interesting. We had internal discussions about riding on the MetaspaceExpand lock which I believe would also work, but thought this really ought to be a separate thing so that we don't tie ourselves too tight to the internal implementation of the allocator. Given that you might want to use the allocator for other stuff, it sounds like that is indeed the right choice.

Yes, I think this is definitely better. The expand lock (I renamed it to be just the Metaspace_lock since its also used for reclamation and other stuff) is used in a more fine granular fashion. I cannot see it working in the same way as the critical lock, protecting the queue and preventing entry for non-priority allocation.

Thanks, Thomas

src/hotspot/share/memory/metaspaceCriticalAllocation.cpp line 37:

> 35:   ClassLoaderData*           _loader_data;
> 36:   size_t                     _word_size;
> 37:   Metaspace::MetadataType    _type;

These could be const, right? ptr const in the case of the CLD.

src/hotspot/share/memory/metaspaceCriticalAllocation.cpp line 40:

> 38:   MetadataAllocationRequest* _next;
> 39:   MetaWord*                  _result;
> 40:   bool                       _has_result;

I think `_has_result` is confusing, since I first mistook it for `_result != NULL` . Suggestion: `_processed` or `_handled`?

src/hotspot/share/memory/metaspaceCriticalAllocation.cpp line 170:

> 168:     return request.result();
> 169:   }
> 170:

I don't fully understand yet how in the second GC cycles the leftover critical allocations from the first cycles are handled:

100 threads allocate, run into the limit, enter MetaspaceCriticalAllocation::allocate, each create a request object on their stacks, which are linked and now they are all queued.

1) t1 (passing thru try_allocate_critical) calls collect and waits.
2) t2-t99 enter try_allocate_critical, enter wait loop in wait_for_purge()
3) GC locks critical lock, frees up metaspace, calls MetaspaceCriticalAllocation::satisfy(), attempts to allocate for all queued requests, in the order they came in since thats the queue. Satisfies lets say t1 and t2, but t3-99 fail to allocate. But all requests are marked as "have result" though. GC unlocks critical lock
4) t1 gets control and return != NULL
5) t2 gets control, try_allocate_critical() returns true, return != NULL
6) t3 gets control, try_allocate_critical() return false, now he is the second GC instigator. Calls collect and waits.
7) t4 gets control, try_allocate_critical() returns false, also calls collect?
8) t5..99 too?

Why do we not keep t4-t99 in wait_for_purge()? They all now wait directly on the collect call? Also, any critical allocation not satisfied by this second GC cycle now is considered failing? Or is that the point?

Edit: wait, I did not read your description closely. This is case (6) in your list. Have I got that right?

src/hotspot/share/memory/metaspaceCriticalAllocation.cpp line 150:

> 148:     MetaWord* result = curr->loader_data()->metaspace_non_null()->allocate(curr->word_size(), curr->type());
> 149:     if (result == NULL) {
> 150:       result = curr->loader_data()->metaspace_non_null()->expand_and_allocate(curr->word_size(), curr->type());

I am not sure it makes sense retrying to expand here. The HWM at this point must be fully extended, since the hard limit is the problem, not the HWM. HWM may get reduced at the end of the GC, eg ZUnload::finish, but this is yet to come at this point, right?

src/hotspot/share/memory/metaspaceCriticalAllocation.cpp line 171:

> 169:   }
> 170:
> 171:   // Always perform a synchronous full GC before bailing

Can we have a comment here stating that this stalls the calling thread? Its only obvious if you know the collector details.

src/hotspot/share/memory/metaspaceCriticalAllocation.cpp line 126:

> 124:
> 125: void MetaspaceCriticalAllocation::wait_for_purge(MetadataAllocationRequest* request) {
> 126:   while (!request->has_result()) {

I'm unclear on how we could leave `wait` and still not be handled here. Since the lock is only notified at the end of Metaspace::purge, which itself is under critical lock protection, which must mean this request was queued before Metaspace::purge started.

src/hotspot/share/memory/metaspaceCriticalAllocation.cpp line 143:

> 141:   bool all_satisfied = true;
> 142:   for (MetadataAllocationRequest* curr = _requests_head; curr != NULL; curr = curr->next()) {
> 143:     if (curr->result() != NULL) {

So this means the thread had no occasion to continue and leave wait_for_purge() until the second GC ran?

src/hotspot/share/memory/metaspaceCriticalAllocation.cpp line 114:

> 112: }
> 113:
> 114: bool MetaspaceCriticalAllocation::try_allocate_critical(MetadataAllocationRequest* request) {

I think this is a bit of a misnomer. Since we do nothing here but wait for the purge. Call it `wait_for_purge` maybe, and merge the current `wait_for_purge` function into this function (since its only called from here and quite small)?

-------------

PR: https://git.openjdk.java.net/jdk/pull/2289
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8259643: ZGC can return metaspace OOM prematurely [v2]

Erik Österlund-3
On Wed, 3 Feb 2021 16:18:53 GMT, Thomas Stuefe <[hidden email]> wrote:

> Hi Erik,
>
> sorry for the delay, I got swamped.

Hi Thomas,

Sorry I also got swamped and still am swamped. Hope it is okay if I come back to this a bit later; there are higher priority items for me to deal with at the moment.

Thanks
-- Erik

-------------

PR: https://git.openjdk.java.net/jdk/pull/2289
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8259643: ZGC can return metaspace OOM prematurely [v2]

Thomas Stuefe
In reply to this post by Erik Österlund-3
On Thu, 28 Jan 2021 13:24:10 GMT, Erik Österlund <[hidden email]> wrote:

>> There exists a race condition for ZGC metaspace allocations, where an allocation can throw OOM due to unbounded starvation from other threads. Towards the end of the allocation dance, we conceptually do this:
>>
>> 1. full_gc()
>> 2. final_allocation_attempt()
>>
>> And if we still fail at 2 after doing a full GC, we conclude that there isn't enough metaspace memory. However, if the thread gets preempted between 1 and 2, then an unbounded number of metaspace allocations from other threads can fill up the entire metaspace, making the final allocation attempt fail and hence throw. This can cause a situation where almost the entire metaspace is unreachable from roots, yet we throw OOM. I managed to reproduce this with the right sleeps.
>>
>> The way we deal with this particular issue for heap allocations, is to have an allocation request queue, and satisfy those allocations before others, preventing starvation. My solution to this metaspace OOM problem will be to basically do exactly that - have a queue of "critical" allocations, that get precedence over normal metaspace allocations.
>>
>> The solution should work for other concurrent GCs (who likely have the same issue), but I only tried this with ZGC, so I am only hooking in ZGC to the new API (for concurrently unloading GCs to manage critical metaspace allocations) at this point.
>>
>> Passes ZGC tests from tier 1-5, and the particular test that failed (with the JVM sleeps that make it fail deterministically).
>
> Erik Österlund has updated the pull request incrementally with one additional commit since the last revision:
>
>   polish code alignment and rename register/unregister to add/remove

Marked as reviewed by stuefe (Reviewer).

-------------

PR: https://git.openjdk.java.net/jdk/pull/2289
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8259643: ZGC can return metaspace OOM prematurely [v2]

Thomas Stuefe
In reply to this post by Erik Österlund-3
On Mon, 8 Mar 2021 14:03:18 GMT, Erik Österlund <[hidden email]> wrote:

>> Hi Erik,
>>
>> sorry for the delay, I got swamped.
>>
>> thanks for your patient explanations. I think I understand most of it now, but I still have a number of questions. Also see the code remarks, though most are just more questions.
>>
>>> > One issue with your patch just came to me: the block-on-allocate may be too early. `Metaspace::allocate()` is a bit hot. I wonder about the performance impact of pulling and releasing a lock on each atomar allocation, even if its uncontended. Ideally I'd like to keep this path as close to a simple pointer bump allocation as possible (which it isn't unfortunately).
>>>
>>> I have a global flag that denotes there being at least 1 critical allocation in the system. It is set by the first critical allocation, and cleared by the GC if all critical allocations could be satisfied. The global lock is only taken in Metaspace::allocate() if said flag is on. Normal apps should never see this flag being on. So the only overhead in the common case is to check the flag and see that no locking is required. I think that should be fast enough, right? And when you are in the critical mode, you obviously have more to worry about than lock contention, in terms of how the system is performing.
>>
>> I overlooked the flag check. I think this is fine then in its current form.
>>
>>> >
>>> > I think I get this now. IIUC we have the problem that memory release happens delayed, and we carry around a baggage of "potentially free" memory which needs a collector run to materialize. So many threads jumping up and down and doing class loading and unloading drive up metaspace use rate and increase the "potentially free" overhead, right? So the thing is to time collector runs right.
>>>
>>> Exactly.
>>>
>>> > One concern I have is that if the customer runs with too tight a limit, we may bounce from full GC to full GC. Always scraping the barrel enough to just keep going - maybe collecting some short lived loaders - but not enough to get the VM into clear waters. I think this may be an issue today already. What is unclear to me is when it would be just better to give up and throw an OOM. To motivate the customer to increase the limits.
>>>
>>> Right - this is a bit of a philosophical one. There is always a balance there between precision, code complexity, and when to put the VM out of its misery when it is performing poorly. We deal with the same trade-off really with heap allocations, which is why I am also solving the starvation problem in pretty much the same way: with a queue satisfied by the GC, and locking out starvation. Then we throw OOM in fairly similar conditions. What they have in common is that when they throw, we will have a live portion of metaspace that is "pretty much" full, and there is no point in continuing, while allowing unfortunate timings on a less full (in terms of temporal liveness) metaspace to always succeed.
>>>
>>> One might argue that the trade-off should be moved in some direction, and that it is okay for it to be more or less exact, but I was hoping that by doing the same dance that heap OOM situations do, we can at least follow a trade-off that is pretty well established and has worked pretty well for heap OOM situations for many years. And I think heap OOM situations in the wild are far more common than metaspace OOM, so I don't think that the metaspace OOM mechanism needs to do better than what the heap OOM mechanism does. If that makes sense.
>>
>> It makes sense. If its an established proven pattern lets use it here too. Its not that complex.
>>
>> I think there are differences between heap and metaspace in elasticity. Metaspace is way less "spongy", chance of recovering metaspace is slimmer than with the heap, so a Full GC is more expensive in relation to its effects. I think I have seen series of Full GCs on quite empty heaps when customers set MaxMetaspaceSize too low (people seem to like doing that). I'm worried about small loaders with short lifetimes which just allow the VM to reclaim enough to keep going. Reflection invocation loaders, or projects like jruby. But I think this is not the norm, nor does it have anything to do with your patch. Typically we do just one futile GC and then throw OOM.
>>
>>> >
>>> > Why not just cover the whole synchronous GC collect call? I'd put that barrier up as early as possible, to prevent as many threads as possible from entering the more expensive fail path. At that point we know we are near exhaustion. Any thread allocating could just as well wait inside MetaspaceArena::allocate. If the GC succeeds in releasing lots of memory, they will not have been disturbed much.
>>>
>>> Do you mean
>>>
>>> 1. why I don't hold the MetaspaceCritical_lock across the collect() call at the mutator side of the code, or
>>> 2. why I don't hold the MetaspaceCritical_lock across the entire GC cycle instead of purge?
>>>
>>> I think you meant 2), so will answer that:
>>> a) Keeping the lock across the entire GC cycle is rather problematic when it comes to constraining lock ranks. It would have to be above any lock ever needed in an entire GC cycle, yet we take a bunch of other locks that mutators hold at different point, interacting with class unloading. It would be very hard to find the right rank for this.
>>> b) The GC goes in and out of safepoints, and needs to sometimes block out safepoints. Holding locks in and out of safepoints while blocking in and out safepoints, is in my experience rather deadlock prone.
>>> c) There is no need to keep the lock across more than the operation that frees metaspace memory, which in a concurrent GC always happens when safepoints are blocked out. If a mutator succeeds during a concurrent GC due to finding memory in a local free list or something, while another allocation failed and needs a critical allocation, then that is absolutely fine, as the successful allocation is never what causes the failing allocation to fail.
>>
>> Thanks, that makes sense. I was completely offtrack, thinking more in direction of (1), wrt the current coding, not your patch. I thought if the mutator thread locks across the collect call and the subsequent allocation attempt (which would have to be some sort of priority allocation, ignoring the lock) this would be a simple solution. But I think that has a number of holes, never mind the allocation request ordering.
>>
>>>
>>> > > > Why do we even need a queue? Why could we not just let the first thread attempting a synchronous gc block metaspace allocation path for all threads, including others running into a limit, until the gc is finished and it had its first-allocation-right served?
>>> > >
>>> > >
>>> > > Each "critical" allocation rides on one particular GC cycle, that denotes the make-or-break point of the allocation.
>>> >
>>> >
>>> > I feel like I should know this, but if multiple threads enter satisfy_failed_metadata_allocation around the same time and call a synchronous collect(), they would wait on the same GC, right? They won't start individual GCs for each thread?
>>>
>>> The rule we have in ZGC to ensure what we want in terms of OOM situations, is that GCs that are _requested_ before an equivalent GC _starts_, can be satisfied with the same GC cycle.
>>>
>>> Let's go through what will likely happen in practice with my solution when you have, let's say 1000 concurrent calls to satisfy a failing metaspace allocation.
>>>
>>> 1. Thread 1 registers its allocation, sees it is the first one, and starts a metaspace GC.
>>> 2. GC starts running
>>> 3. Threads 2-999 register their allocations, and see that there was already a critical allocation before them. This causes them to wait for the GC to purge, opportunistically, riding on that first GC.
>>> 4. The GC satisfies allocations. For the sake of example, let's say that allocations 1-500 could be satisfied, but not the rest.
>>> 5. Threads 2-500 who were waiting for purge to finish, wake up, and run off happily with their satisfied allocations.
>>> 6. Threads 501-1000 wake up seeing that their allocations didn't get satisfied. They now stop being opportunistic, and request a gc each before finally giving up.
>>> 7. The first GC cycle finishes.
>>> 8. Thread 1 wakes up after the entire first GC cycle is done and sees its satisfied allocation, happily running off with it.
>>> 9. The next GC cycle starts
>>> 10. The next GC cycle successfully satisfies metadata allocations for threads 501-750, but not the rest.
>>> 11. The entire next GC cycle finishes, satisfying the GC requested by threads 2-1000, as they all _requested_ a metaspace GC, _before_ it started running. Therefore, no further GC will run.
>>> 12. Threads 501-750 wake up, happily running off with their satisfied allocations.
>>> 13. Threads 751-1000 wake up, grumpy about the lack of memory after their GC. They are all gonna throw.
>>>
>>> So basically, if they can all get satisfied with 1 GC, then 1 GC will be enough. But we won't throw until a thread has had a full GC run _after_ it was requested, but multiple threads can ride on the same GC there. In this example, threads 2-1000 all ride on the same GC.
>>
>> This is a nice elegant solution. I had some trouble understanding your explanation: When you write "threads 2-1000 all ride on the same GC" this confused me since threads 2-500 were lucky and were satisfied by the purge in GC cycle 1. So I would have expected threads 501-1000 to ride on the second GC, thread 1-500 on the first. Unless with "ride" you mean "guaranteed to be processed"?
>>
>>>
>>> Note though, that we would never allow a GC that is already running to satisfy a GC request that comes in while the GC is already running, as we then wouldn't catch situations when a thread releases a lot of memory, and then expects it to be available just after.
>>>
>>> > > In order to prevent starvation, we have to satisfy all critical allocations who have their make-or-break GC cycle associated with the current purge() operation, before we release the lock in purge(), letting new allocations in, or we will rely on luck again. However, among the pending critical allocations, they will all possibly have different make-or-break GC cycles associated with them. So in purge() some of them need to be satisfied, and others do not, yet can happily get their allocations satisfied opportunistically if possible. So we need to make sure they are ordered somehow, such that the earliest arriving pending critical allocations are satisfied first, before subsequent critical allocations (possibly waiting for a later GC), or we can get premature OOM situations again, where a thread releases a bunch of memory, expecting to be able to allocate, yet fails due to races with various threads.
>>> > > The queue basically ensures the ordering of critical allocation satisfaction is sound, so that the pending critical allocations with the associated make-or-break GC being the one running purge(), are satisfied first, before satisfying (opportunistically) other critical allocations, that are really waiting for the next GC to happen.
>>> >
>>> >
>>> > I still don't get why the order of the critical allocations matter. I understand that even with your barrier, multiple threads can fail the initial allocation, enter the "satisfy_failed_metadata_allocation()" path, and now their allocation count as critical since if they fail again they will throw an OOM. But why does the order of the critical allocations among themselves matter? Why not just let the critical allocations trickle out unordered? Is the relation to the GC cycle not arbitrary?
>>>
>>> If we don't preserve the order, we would miss situations when a thread releases a large chunk of metaspace (e.g. by releasing a class loader reference), and then expects that memory to be available. An allocation from a critical allocation that is associated with a subsequent GC, could starve a thread that is associated with the current GC cycle, hence causing a premature OOM for that thread, while not really needing that allocation until next GC cycle, while doing it in the right order would satisfy both allocations.
>>>
>>> One might argue it might be okay to throw in such a scenario with an unordered solution. We are pretty close to running out of memory anyway I guess. But I'm not really sure what such a solution would look like in more detail, and thought writing this little list was easy enough, and for me easier to reason about, partially because we do the same dance in the GC to deal with heap OOM, which has been a success.
>>
>> I think I get this now. Its still a brain teaser if you are not used to the pattern, but I think I got most of it. If the pattern is well established and proofed it makes sense.
>>
>> Thanks for the great explanations!
>>
>>>
>>> Thanks,
>>> /Erik
>>>
>>> ^--- I know this will be treated as a PR bot command, but I can't give up on my slash!
>>>
>>
>> Alias it to integrate :)
>>
>>>
>>> > Just FYI, I have very vague plans to extend usage of the metaspace allocator to other areas. To get back the cost of implementation. Eg one candidate may be replacement of the current resource areas, which are just more primitive arena based allocators. This is very vague and not a high priority, but I am a bit interested in keeping the coding generic if its not too much effort. But I don't see your patch causing any problems there.
>>>
>>> That sounds very interesting. We had internal discussions about riding on the MetaspaceExpand lock which I believe would also work, but thought this really ought to be a separate thing so that we don't tie ourselves too tight to the internal implementation of the allocator. Given that you might want to use the allocator for other stuff, it sounds like that is indeed the right choice.
>>
>> Yes, I think this is definitely better. The expand lock (I renamed it to be just the Metaspace_lock since its also used for reclamation and other stuff) is used in a more fine granular fashion. I cannot see it working in the same way as the critical lock, protecting the queue and preventing entry for non-priority allocation.
>>
>> Thanks, Thomas
>
>> Hi Erik,
>>
>> sorry for the delay, I got swamped.
>
> Hi Thomas,
>
> Sorry I also got swamped and still am swamped. Hope it is okay if I come back to this a bit later; there are higher priority items for me to deal with at the moment.
>
> Thanks
> -- Erik

Hi Eric,

please feel free to commit this, and answer my question at your leisure, if at all. I am fine with your change as it is (now that I understand it :)

Cheers, Thomas

-------------

PR: https://git.openjdk.java.net/jdk/pull/2289
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8259643: ZGC can return metaspace OOM prematurely [v2]

Erik Österlund-3
In reply to this post by Thomas Stuefe
On Mon, 8 Mar 2021 14:40:29 GMT, Thomas Stuefe <[hidden email]> wrote:

>> Erik Österlund has updated the pull request incrementally with one additional commit since the last revision:
>>
>>   polish code alignment and rename register/unregister to add/remove
>
> Marked as reviewed by stuefe (Reviewer).

> Hi Eric,
>
> please feel free to commit this, and answer my question at your leisure, if at all. I am fine with your change as it is (now that I understand it :)
>
> Cheers, Thomas

Okay - thanks Thomas! :-)

-- Erik

-------------

PR: https://git.openjdk.java.net/jdk/pull/2289
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8259643: ZGC can return metaspace OOM prematurely [v2]

Coleen Phillimore-3
On Tue, 9 Mar 2021 11:34:36 GMT, Erik Österlund <[hidden email]> wrote:

>> Marked as reviewed by stuefe (Reviewer).
>
>> Hi Eric,
>>
>> please feel free to commit this, and answer my question at your leisure, if at all. I am fine with your change as it is (now that I understand it :)
>>
>> Cheers, Thomas
>
> Okay - thanks Thomas! :-)
>
> -- Erik

This change is very interesting and looks fine to me for the record, but you didn't add the test.  I'm a bit greedy for class loading and unloading tests.  Even if it has timing dependencies to show the bug, I think it would be worth adding to jtreg.  Could you add it with a follow-on RFE?  Thanks!

-------------

PR: https://git.openjdk.java.net/jdk/pull/2289