Quantcast

RFR: 8168914: Crash in ClassLoaderData/JNIHandleBlock::oops_do during concurrent marking

classic Classic list List threaded Threaded
46 messages Options
123
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RFR: 8168914: Crash in ClassLoaderData/JNIHandleBlock::oops_do during concurrent marking

Erik Helin-2
Hi all,

this patch aims to solve the bug [0] by making it safe for a GC to
concurrently traverse the oops in a ClassLoaderData.

The problem right now is that ClassLoaderData::_handles is a
JNIHandleBlock. It is not safe for one thread to iterate over the oops
in a JNIHandleBlock while another thread concurrently adds a new oop to
the JNIHandleBlock.

My proposed solution is to replace JNIHandleBlock with another data
structure for ClassLoaderData. ClassLoaderData only needs a place to
store oops that a GC can iterate over, it has no use for the rest of the
methods in the JNIHandleBlock class. I have implemented a minimal
chunked linked list data structure (internal to ClassLoaderData) with
only two public methods:
- add (only executed by one thread at a time)
- oops_do (can be executed by any number of threads, possibly
            concurrently with a call to `add`)

ChunkedHandleList uses load_acquire and release_store to synchronize
access to the underlying chunks (arrays). Since ChunkedHandleList
utilizes the (rather) minimal requirements of its user
(ClassLoaderData), I decided to keep the class internal to
ClassLoaderData for now. If other find it useful elsewhere, the we can
try to create a more generalized version (this is trickier than it
appears at first sight, I tried ;)).

I also changed ClassLoaderData::remove_handle to write NULL to the oop*
(that is represented by a jobject), instead of using a sentinel oop as
done by JNIHandleBlock. The GC closures should be prepared to handle a
field in a Java object becoming NULL, so this should be fine.

Bug:
https://bugs.openjdk.java.net/browse/JDK-8168914

Webrev:
http://cr.openjdk.java.net/~ehelin/8168914/00/

Testing:
- Tier 1 (aka JPRT), 2, 3, 4, 5.

I would appreciate quite a few reviews on this patch, it contains a nice
mixture of:
- me venturing into the runtime code :)
- lock free code
- unreproducible bug (even though I'm sure of the root cause)

Thanks for everyone participating in the discussions around this bug and
potential solutions: Volker, Coleen, Mikael G, StefanK, Erik Ö and Jiangli.

Thanks!
Erik

[0]: https://bugs.openjdk.java.net/browse/JDK-8168914
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: RFR: 8168914: Crash in ClassLoaderData/JNIHandleBlock::oops_do during concurrent marking

David Holmes
Hi Erik,

On 15/02/2017 2:40 AM, Erik Helin wrote:

> Hi all,
>
> this patch aims to solve the bug [0] by making it safe for a GC to
> concurrently traverse the oops in a ClassLoaderData.
>
> The problem right now is that ClassLoaderData::_handles is a
> JNIHandleBlock. It is not safe for one thread to iterate over the oops
> in a JNIHandleBlock while another thread concurrently adds a new oop to
> the JNIHandleBlock.
>
> My proposed solution is to replace JNIHandleBlock with another data
> structure for ClassLoaderData. ClassLoaderData only needs a place to
> store oops that a GC can iterate over, it has no use for the rest of the
> methods in the JNIHandleBlock class. I have implemented a minimal
> chunked linked list data structure (internal to ClassLoaderData) with
> only two public methods:
> - add (only executed by one thread at a time)
> - oops_do (can be executed by any number of threads, possibly
>            concurrently with a call to `add`)
>
> ChunkedHandleList uses load_acquire and release_store to synchronize
> access to the underlying chunks (arrays). Since ChunkedHandleList
> utilizes the (rather) minimal requirements of its user
> (ClassLoaderData), I decided to keep the class internal to
> ClassLoaderData for now. If other find it useful elsewhere, the we can
> try to create a more generalized version (this is trickier than it
> appears at first sight, I tried ;)).
>
> I also changed ClassLoaderData::remove_handle to write NULL to the oop*
> (that is represented by a jobject), instead of using a sentinel oop as
> done by JNIHandleBlock. The GC closures should be prepared to handle a
> field in a Java object becoming NULL, so this should be fine.
>
> Bug:
> https://bugs.openjdk.java.net/browse/JDK-8168914
>
> Webrev:
> http://cr.openjdk.java.net/~ehelin/8168914/00/

oops_do can miss an oop that is in the process of being added - is that
a problem?

  140       if (c->data[i] != NULL) {
  141         f->do_oop(&c->data[i]);
  142       }

Given a slot can be nulled out concurrently at any time, is it worth
does this NULL check? The OopClosure will have to do its own NULL check
anyway.

  144     c = (Chunk*) OrderAccess::load_ptr_acquire((volatile
intptr_t*)&c->next);

This doesn't need to be a load-acquire. You already loaded 'c' via
load-acquire of _head (or chained therefrom) and its next field is set
prior to the setting of the _head that you read.

  624 jobject ClassLoaderData::add_handle(Handle h) {
  625   MutexLockerEx ml(metaspace_lock(),
Mutex::_no_safepoint_check_flag);
  626   return (jobject) _handles.add(h());
  627 }
  628
  629 void ClassLoaderData::remove_handle_unsafe(jobject h) {
  630   assert(_handles.contains((oop*) h), "Got unexpected handle "
PTR_FORMAT, p2i((oop*) h));
  631   *((oop*) h) = NULL;
  632 }

I'm a bit unclear on the typing here. Previously we use a JNI Handle
which was a jobject. But now we've simply stored an oop into a slot in
an array. We pass the address of that slot back as a jobject, but is it
really? I would also expect contains to take an oop not an oop* - it
seems to be asking "is this an address of a slot in our
ChunkedHandleList" rather than asking "is this oop in our
ChunkedHandleList". ??

A few minor comments:

Copyrights need updating.

src/share/vm/classfile/classLoaderData.hpp

177     // Only on thread ...

Typo: on -> one

---

src/share/vm/classfile/moduleEntry.cpp

90   // Create a JNI handle for the shared ProtectionDomain and save it
atomically.
  91   // If someone beats us setting the _pd cache, the created JNI
handle is destroyed.

These are no longer JNI Handles.

Thanks,
David
-----

> Testing:
> - Tier 1 (aka JPRT), 2, 3, 4, 5.
>
> I would appreciate quite a few reviews on this patch, it contains a nice
> mixture of:
> - me venturing into the runtime code :)
> - lock free code
> - unreproducible bug (even though I'm sure of the root cause)
>
> Thanks for everyone participating in the discussions around this bug and
> potential solutions: Volker, Coleen, Mikael G, StefanK, Erik Ö and Jiangli.
>
> Thanks!
> Erik
>
> [0]: https://bugs.openjdk.java.net/browse/JDK-8168914
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: RFR: 8168914: Crash in ClassLoaderData/JNIHandleBlock::oops_do during concurrent marking

Erik Helin-2
On 02/15/2017 02:48 AM, David Holmes wrote:
> Hi Erik,

Hi David,

thanks for having a look! Please see new patches at:
- incremental: http://cr.openjdk.java.net/~ehelin/8168914/00-01/
- full: http://cr.openjdk.java.net/~ehelin/8168914/01/

> oops_do can miss an oop that is in the process of being added - is that
> a problem?

Nope, that is not an issue. Even if `add_handle` and `oops_do` where
synchronized with each other (for example by using lock), you can still
have the scenario where a GC thread first runs (and finishes) `oops_do`
and then `add_handle` is called (a linearized execution).

This scenario is something the concurrent marking algorithms already
should be prepared to handle.

>  140       if (c->data[i] != NULL) {
>  141         f->do_oop(&c->data[i]);
>  142       }
>
> Given a slot can be nulled out concurrently at any time, is it worth
> does this NULL check? The OopClosure will have to do its own NULL check
> anyway.

Sure, this was mostly to be a bit consistent with JNIHandleBlock (it
does the same NULL check). Think of it as an optimization, there is no
reason to call `f->do_oop` if `c->data[i] == NULL`. Do you think it
reads better if the NULL check is removed? I don't have a strong opinion
on this one.

>  144     c = (Chunk*) OrderAccess::load_ptr_acquire((volatile
> intptr_t*)&c->next);
>
> This doesn't need to be a load-acquire. You already loaded 'c' via
> load-acquire of _head (or chained therefrom) and its next field is set
> prior to the setting of the _head that you read.

Agree. I just discussed this with Erik Ö as well, and we all agree on
this. I also removed the `volatile` specifier for `next`.

>  624 jobject ClassLoaderData::add_handle(Handle h) {
>  625   MutexLockerEx ml(metaspace_lock(), Mutex::_no_safepoint_check_flag);
>  626   return (jobject) _handles.add(h());
>  627 }
>  628
>  629 void ClassLoaderData::remove_handle_unsafe(jobject h) {
>  630   assert(_handles.contains((oop*) h), "Got unexpected handle "
> PTR_FORMAT, p2i((oop*) h));
>  631   *((oop*) h) = NULL;
>  632 }
>
> I'm a bit unclear on the typing here. Previously we use a JNI Handle
> which was a jobject. But now we've simply stored an oop into a slot in
> an array. We pass the address of that slot back as a jobject, but is it
> really?

So, this is a bit confusing, and I discussed this with Coleen a few days
ago. jobject was used originally used because that is what
`JNIHandleBlock::allocate_handle` returns. Using something other than
jobject means updating `ModuleEntry` and in particular users
`ModuleEntry::_pd`. This is not something we are keen on doing right now
for JDK 9, and Coleen is planning to clean up how the runtime handles
oops for 10 (or at least improve the situation).

I don't know what jobject is meant to represent, in the code it is just
an empty class (an opaque type).

ClassLoaderData::add_handle could just as well return an oop*, but AFAIU
we try to avoid oops and oop* in the runtime code (because that might
mean that some code forgot to tell the GC about the oop). Maybe I'm
wrong here?

 > I would also expect contains to take an oop not an oop* - it
 > seems to be asking "is this an address of a slot in our
 > ChunkedHandleList" rather than asking "is this oop in our
 > ChunkedHandleList". ??

I actually wanted the assert to check whether the passed jobject (which
really is an oop*) actually originated from a call to
ClassLoaderData::add_handle. I don't want any code to pass a jobject to
ClassLoaderData::remove_handle_unsafe that doesn't come from a call to
ClassLoaderData::add_handle.

> A few minor comments:
>
> Copyrights need updating.

Aaah, it is that time of the year again. Fixed.

> src/share/vm/classfile/classLoaderData.hpp
>
> 177     // Only on thread ...
>
> Typo: on -> one

Thanks, fixed.

> src/share/vm/classfile/moduleEntry.cpp
>
> 90   // Create a JNI handle for the shared ProtectionDomain and save it
> atomically.
>  91   // If someone beats us setting the _pd cache, the created JNI
> handle is destroyed.
>
> These are no longer JNI Handles.

Thanks, I updated the comment.

Thanks,
Erik

> Thanks,
> David
> -----
>
>> Testing:
>> - Tier 1 (aka JPRT), 2, 3, 4, 5.
>>
>> I would appreciate quite a few reviews on this patch, it contains a nice
>> mixture of:
>> - me venturing into the runtime code :)
>> - lock free code
>> - unreproducible bug (even though I'm sure of the root cause)
>>
>> Thanks for everyone participating in the discussions around this bug and
>> potential solutions: Volker, Coleen, Mikael G, StefanK, Erik Ö and
>> Jiangli.
>>
>> Thanks!
>> Erik
>>
>> [0]: https://bugs.openjdk.java.net/browse/JDK-8168914
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: RFR: 8168914: Crash in ClassLoaderData/JNIHandleBlock::oops_do during concurrent marking

Erik Österlund-2
Hi Erik,

I looked at the memory ordering aspect and I think it looks good as I
discussed with you, now that the unnecessary synchronization on the
immutable next value has been removed.

Thanks,
/Erik

On 2017-02-15 16:07, Erik Helin wrote:

> On 02/15/2017 02:48 AM, David Holmes wrote:
>> Hi Erik,
>
> Hi David,
>
> thanks for having a look! Please see new patches at:
> - incremental: http://cr.openjdk.java.net/~ehelin/8168914/00-01/
> - full: http://cr.openjdk.java.net/~ehelin/8168914/01/
>
>> oops_do can miss an oop that is in the process of being added - is that
>> a problem?
>
> Nope, that is not an issue. Even if `add_handle` and `oops_do` where
> synchronized with each other (for example by using lock), you can
> still have the scenario where a GC thread first runs (and finishes)
> `oops_do` and then `add_handle` is called (a linearized execution).
>
> This scenario is something the concurrent marking algorithms already
> should be prepared to handle.
>
>>  140       if (c->data[i] != NULL) {
>>  141         f->do_oop(&c->data[i]);
>>  142       }
>>
>> Given a slot can be nulled out concurrently at any time, is it worth
>> does this NULL check? The OopClosure will have to do its own NULL check
>> anyway.
>
> Sure, this was mostly to be a bit consistent with JNIHandleBlock (it
> does the same NULL check). Think of it as an optimization, there is no
> reason to call `f->do_oop` if `c->data[i] == NULL`. Do you think it
> reads better if the NULL check is removed? I don't have a strong
> opinion on this one.
>
>>  144     c = (Chunk*) OrderAccess::load_ptr_acquire((volatile
>> intptr_t*)&c->next);
>>
>> This doesn't need to be a load-acquire. You already loaded 'c' via
>> load-acquire of _head (or chained therefrom) and its next field is set
>> prior to the setting of the _head that you read.
>
> Agree. I just discussed this with Erik Ö as well, and we all agree on
> this. I also removed the `volatile` specifier for `next`.
>
>>  624 jobject ClassLoaderData::add_handle(Handle h) {
>>  625   MutexLockerEx ml(metaspace_lock(),
>> Mutex::_no_safepoint_check_flag);
>>  626   return (jobject) _handles.add(h());
>>  627 }
>>  628
>>  629 void ClassLoaderData::remove_handle_unsafe(jobject h) {
>>  630   assert(_handles.contains((oop*) h), "Got unexpected handle "
>> PTR_FORMAT, p2i((oop*) h));
>>  631   *((oop*) h) = NULL;
>>  632 }
>>
>> I'm a bit unclear on the typing here. Previously we use a JNI Handle
>> which was a jobject. But now we've simply stored an oop into a slot in
>> an array. We pass the address of that slot back as a jobject, but is it
>> really?
>
> So, this is a bit confusing, and I discussed this with Coleen a few
> days ago. jobject was used originally used because that is what
> `JNIHandleBlock::allocate_handle` returns. Using something other than
> jobject means updating `ModuleEntry` and in particular users
> `ModuleEntry::_pd`. This is not something we are keen on doing right
> now for JDK 9, and Coleen is planning to clean up how the runtime
> handles oops for 10 (or at least improve the situation).
>
> I don't know what jobject is meant to represent, in the code it is
> just an empty class (an opaque type).
>
> ClassLoaderData::add_handle could just as well return an oop*, but
> AFAIU we try to avoid oops and oop* in the runtime code (because that
> might mean that some code forgot to tell the GC about the oop). Maybe
> I'm wrong here?
>
> > I would also expect contains to take an oop not an oop* - it
> > seems to be asking "is this an address of a slot in our
> > ChunkedHandleList" rather than asking "is this oop in our
> > ChunkedHandleList". ??
>
> I actually wanted the assert to check whether the passed jobject
> (which really is an oop*) actually originated from a call to
> ClassLoaderData::add_handle. I don't want any code to pass a jobject
> to ClassLoaderData::remove_handle_unsafe that doesn't come from a call
> to ClassLoaderData::add_handle.
>
>> A few minor comments:
>>
>> Copyrights need updating.
>
> Aaah, it is that time of the year again. Fixed.
>
>> src/share/vm/classfile/classLoaderData.hpp
>>
>> 177     // Only on thread ...
>>
>> Typo: on -> one
>
> Thanks, fixed.
>
>> src/share/vm/classfile/moduleEntry.cpp
>>
>> 90   // Create a JNI handle for the shared ProtectionDomain and save it
>> atomically.
>>  91   // If someone beats us setting the _pd cache, the created JNI
>> handle is destroyed.
>>
>> These are no longer JNI Handles.
>
> Thanks, I updated the comment.
>
> Thanks,
> Erik
>
>> Thanks,
>> David
>> -----
>>
>>> Testing:
>>> - Tier 1 (aka JPRT), 2, 3, 4, 5.
>>>
>>> I would appreciate quite a few reviews on this patch, it contains a
>>> nice
>>> mixture of:
>>> - me venturing into the runtime code :)
>>> - lock free code
>>> - unreproducible bug (even though I'm sure of the root cause)
>>>
>>> Thanks for everyone participating in the discussions around this bug
>>> and
>>> potential solutions: Volker, Coleen, Mikael G, StefanK, Erik Ö and
>>> Jiangli.
>>>
>>> Thanks!
>>> Erik
>>>
>>> [0]: https://bugs.openjdk.java.net/browse/JDK-8168914

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: RFR: 8168914: Crash in ClassLoaderData/JNIHandleBlock::oops_do during concurrent marking

Thomas Schatzl
In reply to this post by Erik Helin-2
Hi,

On Wed, 2017-02-15 at 16:07 +0100, Erik Helin wrote:
> On 02/15/2017 02:48 AM, David Holmes wrote:
> >
> > Hi Erik,
> Hi David,
>
> thanks for having a look! Please see new patches at:
> - incremental: http://cr.openjdk.java.net/~ehelin/8168914/00-01/
> - full: http://cr.openjdk.java.net/~ehelin/8168914/01/
>

http://cr.openjdk.java.net/~ehelin/8168914/01/src/share/vm/classfile/cl
assLoaderData.hpp.frames.html
 177     // Only one thread can add a time, guarded by the
Metaspace_lock.

-> Only one thread can add elements at a time, ...

Also, I would somewhat prefer if the method actually asserted that
Metaspace_lock is owned by this thread if it has been explicitly
mentioned in the description.

 - same file:

 211   ChunkedHandleList _handles; // Handles to constant pool arrays,
Modules, etc, which
 212                               // have the same life cycle of the
corresponding ClassLoader.

-> ... have the same life cycle _as_ the corresponding class loader.

Looks good otherwise. I do not need a re-review for the comment
changes.

Thanks,
  Thomas
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: RFR: 8168914: Crash in ClassLoaderData/JNIHandleBlock::oops_do during concurrent marking

coleen.phillimore
In reply to this post by Erik Helin-2


On 2/15/17 10:07 AM, Erik Helin wrote:

> On 02/15/2017 02:48 AM, David Holmes wrote:
>> Hi Erik,
>
> Hi David,
>
> thanks for having a look! Please see new patches at:
> - incremental: http://cr.openjdk.java.net/~ehelin/8168914/00-01/
> - full: http://cr.openjdk.java.net/~ehelin/8168914/01/
>
>> oops_do can miss an oop that is in the process of being added - is that
>> a problem?
>
> Nope, that is not an issue. Even if `add_handle` and `oops_do` where
> synchronized with each other (for example by using lock), you can
> still have the scenario where a GC thread first runs (and finishes)
> `oops_do` and then `add_handle` is called (a linearized execution).
>
> This scenario is something the concurrent marking algorithms already
> should be prepared to handle.
>
>>  140       if (c->data[i] != NULL) {
>>  141         f->do_oop(&c->data[i]);
>>  142       }
>>
>> Given a slot can be nulled out concurrently at any time, is it worth
>> does this NULL check? The OopClosure will have to do its own NULL check
>> anyway.
>
> Sure, this was mostly to be a bit consistent with JNIHandleBlock (it
> does the same NULL check). Think of it as an optimization, there is no
> reason to call `f->do_oop` if `c->data[i] == NULL`. Do you think it
> reads better if the NULL check is removed? I don't have a strong
> opinion on this one.
>
>>  144     c = (Chunk*) OrderAccess::load_ptr_acquire((volatile
>> intptr_t*)&c->next);
>>
>> This doesn't need to be a load-acquire. You already loaded 'c' via
>> load-acquire of _head (or chained therefrom) and its next field is set
>> prior to the setting of the _head that you read.
>
> Agree. I just discussed this with Erik Ö as well, and we all agree on
> this. I also removed the `volatile` specifier for `next`.
>
>>  624 jobject ClassLoaderData::add_handle(Handle h) {
>>  625   MutexLockerEx ml(metaspace_lock(),
>> Mutex::_no_safepoint_check_flag);
>>  626   return (jobject) _handles.add(h());
>>  627 }
>>  628
>>  629 void ClassLoaderData::remove_handle_unsafe(jobject h) {
>>  630   assert(_handles.contains((oop*) h), "Got unexpected handle "
>> PTR_FORMAT, p2i((oop*) h));
>>  631   *((oop*) h) = NULL;
>>  632 }
>>
>> I'm a bit unclear on the typing here. Previously we use a JNI Handle
>> which was a jobject. But now we've simply stored an oop into a slot in
>> an array. We pass the address of that slot back as a jobject, but is it
>> really?
>
> So, this is a bit confusing, and I discussed this with Coleen a few
> days ago. jobject was used originally used because that is what
> `JNIHandleBlock::allocate_handle` returns. Using something other than
> jobject means updating `ModuleEntry` and in particular users
> `ModuleEntry::_pd`. This is not something we are keen on doing right
> now for JDK 9, and Coleen is planning to clean up how the runtime
> handles oops for 10 (or at least improve the situation).

Yes, these are not really jobject if you think of jobject and jweak as
being native handles to oops.  These really are indirect pointers to
oops that we use in the runtime.

We're still discussing how we want these to look for JDK10.  Your
suggestions are welcome - we'll include you on some email as we discuss
this (and openjdk also).

>
> I don't know what jobject is meant to represent, in the code it is
> just an empty class (an opaque type).
>
> ClassLoaderData::add_handle could just as well return an oop*, but
> AFAIU we try to avoid oops and oop* in the runtime code (because that
> might mean that some code forgot to tell the GC about the oop). Maybe
> I'm wrong here?

You are right.  We don't want to have oop* in the runtime code because
it also is unclear if it was supposed to be compressed or not.  We want
a special name for these.

Thanks,
Coleen

>
> > I would also expect contains to take an oop not an oop* - it
> > seems to be asking "is this an address of a slot in our
> > ChunkedHandleList" rather than asking "is this oop in our
> > ChunkedHandleList". ??
>
> I actually wanted the assert to check whether the passed jobject
> (which really is an oop*) actually originated from a call to
> ClassLoaderData::add_handle. I don't want any code to pass a jobject
> to ClassLoaderData::remove_handle_unsafe that doesn't come from a call
> to ClassLoaderData::add_handle.
>
>> A few minor comments:
>>
>> Copyrights need updating.
>
> Aaah, it is that time of the year again. Fixed.
>
>> src/share/vm/classfile/classLoaderData.hpp
>>
>> 177     // Only on thread ...
>>
>> Typo: on -> one
>
> Thanks, fixed.
>
>> src/share/vm/classfile/moduleEntry.cpp
>>
>> 90   // Create a JNI handle for the shared ProtectionDomain and save it
>> atomically.
>>  91   // If someone beats us setting the _pd cache, the created JNI
>> handle is destroyed.
>>
>> These are no longer JNI Handles.
>
> Thanks, I updated the comment.
>
> Thanks,
> Erik
>
>> Thanks,
>> David
>> -----
>>
>>> Testing:
>>> - Tier 1 (aka JPRT), 2, 3, 4, 5.
>>>
>>> I would appreciate quite a few reviews on this patch, it contains a
>>> nice
>>> mixture of:
>>> - me venturing into the runtime code :)
>>> - lock free code
>>> - unreproducible bug (even though I'm sure of the root cause)
>>>
>>> Thanks for everyone participating in the discussions around this bug
>>> and
>>> potential solutions: Volker, Coleen, Mikael G, StefanK, Erik Ö and
>>> Jiangli.
>>>
>>> Thanks!
>>> Erik
>>>
>>> [0]: https://bugs.openjdk.java.net/browse/JDK-8168914

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: RFR: 8168914: Crash in ClassLoaderData/JNIHandleBlock::oops_do during concurrent marking

coleen.phillimore
In reply to this post by Erik Helin-2


Hi Erik,

I thought you were going to hide the declaration of class
ChunkedHandleList in classLoaderData.cpp since it is only used internally?

And have a field in ClassLoaderData:   ChunkedHandleList*
_chunked_handle_list;   The class name "Chunk" is so overused!

Here ChunkedHandleList should not be a subtype of CHeapObj<mtClass>.  
It's a ValueObj.

I don't think it should be a struct either because even though it's
internal to ClassLoaderData, the fields could still be accessed outside
the class.

Why is

629 void ClassLoaderData::remove_handle_unsafe(jobject h) {


it called unsafe?  It is safe to zero the element in the block, isn't it?

One other reason to make it internal is that the CAPACITY of 64 is way
too big for the ClassLoaderData of anonymous classes which I think will
only have presently 1 (or 2 maybe) oops in it.   And there are zillions
of them.

Thanks,
Coleen


On 2/14/17 11:40 AM, Erik Helin wrote:

> Hi all,
>
> this patch aims to solve the bug [0] by making it safe for a GC to
> concurrently traverse the oops in a ClassLoaderData.
>
> The problem right now is that ClassLoaderData::_handles is a
> JNIHandleBlock. It is not safe for one thread to iterate over the oops
> in a JNIHandleBlock while another thread concurrently adds a new oop
> to the JNIHandleBlock.
>
> My proposed solution is to replace JNIHandleBlock with another data
> structure for ClassLoaderData. ClassLoaderData only needs a place to
> store oops that a GC can iterate over, it has no use for the rest of
> the methods in the JNIHandleBlock class. I have implemented a minimal
> chunked linked list data structure (internal to ClassLoaderData) with
> only two public methods:
> - add (only executed by one thread at a time)
> - oops_do (can be executed by any number of threads, possibly
>            concurrently with a call to `add`)
>
> ChunkedHandleList uses load_acquire and release_store to synchronize
> access to the underlying chunks (arrays). Since ChunkedHandleList
> utilizes the (rather) minimal requirements of its user
> (ClassLoaderData), I decided to keep the class internal to
> ClassLoaderData for now. If other find it useful elsewhere, the we can
> try to create a more generalized version (this is trickier than it
> appears at first sight, I tried ;)).
>
> I also changed ClassLoaderData::remove_handle to write NULL to the
> oop* (that is represented by a jobject), instead of using a sentinel
> oop as done by JNIHandleBlock. The GC closures should be prepared to
> handle a field in a Java object becoming NULL, so this should be fine.
>
> Bug:
> https://bugs.openjdk.java.net/browse/JDK-8168914
>
> Webrev:
> http://cr.openjdk.java.net/~ehelin/8168914/00/
>
> Testing:
> - Tier 1 (aka JPRT), 2, 3, 4, 5.
>
> I would appreciate quite a few reviews on this patch, it contains a
> nice mixture of:
> - me venturing into the runtime code :)
> - lock free code
> - unreproducible bug (even though I'm sure of the root cause)
>
> Thanks for everyone participating in the discussions around this bug
> and potential solutions: Volker, Coleen, Mikael G, StefanK, Erik Ö and
> Jiangli.
>
> Thanks!
> Erik
>
> [0]: https://bugs.openjdk.java.net/browse/JDK-8168914

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: RFR: 8168914: Crash in ClassLoaderData/JNIHandleBlock::oops_do during concurrent marking

David Holmes
In reply to this post by Erik Helin-2
On 16/02/2017 1:07 AM, Erik Helin wrote:
> On 02/15/2017 02:48 AM, David Holmes wrote:
>> Hi Erik,
>
> Hi David,
>
> thanks for having a look! Please see new patches at:
> - incremental: http://cr.openjdk.java.net/~ehelin/8168914/00-01/
> - full: http://cr.openjdk.java.net/~ehelin/8168914/01/

Updates look fine to me.

>> oops_do can miss an oop that is in the process of being added - is that
>> a problem?
>
> Nope, that is not an issue. Even if `add_handle` and `oops_do` where
> synchronized with each other (for example by using lock), you can still
> have the scenario where a GC thread first runs (and finishes) `oops_do`
> and then `add_handle` is called (a linearized execution).
>
> This scenario is something the concurrent marking algorithms already
> should be prepared to handle.

I'm unclear about the exact hand-over process for the oop. Before the
add the oop will be found through some existing path - presumably the
Handle - and after the add it will be found through oops_do. But at some
point that "existing path" will no longer have the oop, and it will be
assumed to be found via the oops_do. The success of that seems to depend
on what order the concurrent marking checks things. Or it may be that
once concurrent marking starts the Handle can't lose the oop without
concurrent marking being informed?

The scenario I'm concerned about is this:
- add() is executing concurrently with oops_do
- oops_do reads _head just before new Chunk is added
- oops_do thread is preempted
- add() completes and call chain unwinds and at some point the Handle is
released
- oops_do continues but has missed the new oop
- Handle traversal continues but the oop is no longer there

Thanks,
David
-----

>>  140       if (c->data[i] != NULL) {
>>  141         f->do_oop(&c->data[i]);
>>  142       }
>>
>> Given a slot can be nulled out concurrently at any time, is it worth
>> does this NULL check? The OopClosure will have to do its own NULL check
>> anyway.
>
> Sure, this was mostly to be a bit consistent with JNIHandleBlock (it
> does the same NULL check). Think of it as an optimization, there is no
> reason to call `f->do_oop` if `c->data[i] == NULL`. Do you think it
> reads better if the NULL check is removed? I don't have a strong opinion
> on this one.
>
>>  144     c = (Chunk*) OrderAccess::load_ptr_acquire((volatile
>> intptr_t*)&c->next);
>>
>> This doesn't need to be a load-acquire. You already loaded 'c' via
>> load-acquire of _head (or chained therefrom) and its next field is set
>> prior to the setting of the _head that you read.
>
> Agree. I just discussed this with Erik Ö as well, and we all agree on
> this. I also removed the `volatile` specifier for `next`.
>
>>  624 jobject ClassLoaderData::add_handle(Handle h) {
>>  625   MutexLockerEx ml(metaspace_lock(),
>> Mutex::_no_safepoint_check_flag);
>>  626   return (jobject) _handles.add(h());
>>  627 }
>>  628
>>  629 void ClassLoaderData::remove_handle_unsafe(jobject h) {
>>  630   assert(_handles.contains((oop*) h), "Got unexpected handle "
>> PTR_FORMAT, p2i((oop*) h));
>>  631   *((oop*) h) = NULL;
>>  632 }
>>
>> I'm a bit unclear on the typing here. Previously we use a JNI Handle
>> which was a jobject. But now we've simply stored an oop into a slot in
>> an array. We pass the address of that slot back as a jobject, but is it
>> really?
>
> So, this is a bit confusing, and I discussed this with Coleen a few days
> ago. jobject was used originally used because that is what
> `JNIHandleBlock::allocate_handle` returns. Using something other than
> jobject means updating `ModuleEntry` and in particular users
> `ModuleEntry::_pd`. This is not something we are keen on doing right now
> for JDK 9, and Coleen is planning to clean up how the runtime handles
> oops for 10 (or at least improve the situation).
>
> I don't know what jobject is meant to represent, in the code it is just
> an empty class (an opaque type).
>
> ClassLoaderData::add_handle could just as well return an oop*, but AFAIU
> we try to avoid oops and oop* in the runtime code (because that might
> mean that some code forgot to tell the GC about the oop). Maybe I'm
> wrong here?
>
>> I would also expect contains to take an oop not an oop* - it
>> seems to be asking "is this an address of a slot in our
>> ChunkedHandleList" rather than asking "is this oop in our
>> ChunkedHandleList". ??
>
> I actually wanted the assert to check whether the passed jobject (which
> really is an oop*) actually originated from a call to
> ClassLoaderData::add_handle. I don't want any code to pass a jobject to
> ClassLoaderData::remove_handle_unsafe that doesn't come from a call to
> ClassLoaderData::add_handle.
>
>> A few minor comments:
>>
>> Copyrights need updating.
>
> Aaah, it is that time of the year again. Fixed.
>
>> src/share/vm/classfile/classLoaderData.hpp
>>
>> 177     // Only on thread ...
>>
>> Typo: on -> one
>
> Thanks, fixed.
>
>> src/share/vm/classfile/moduleEntry.cpp
>>
>> 90   // Create a JNI handle for the shared ProtectionDomain and save it
>> atomically.
>>  91   // If someone beats us setting the _pd cache, the created JNI
>> handle is destroyed.
>>
>> These are no longer JNI Handles.
>
> Thanks, I updated the comment.
>
> Thanks,
> Erik
>
>> Thanks,
>> David
>> -----
>>
>>> Testing:
>>> - Tier 1 (aka JPRT), 2, 3, 4, 5.
>>>
>>> I would appreciate quite a few reviews on this patch, it contains a nice
>>> mixture of:
>>> - me venturing into the runtime code :)
>>> - lock free code
>>> - unreproducible bug (even though I'm sure of the root cause)
>>>
>>> Thanks for everyone participating in the discussions around this bug and
>>> potential solutions: Volker, Coleen, Mikael G, StefanK, Erik Ö and
>>> Jiangli.
>>>
>>> Thanks!
>>> Erik
>>>
>>> [0]: https://bugs.openjdk.java.net/browse/JDK-8168914
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: RFR: 8168914: Crash in ClassLoaderData/JNIHandleBlock::oops_do during concurrent marking

Kim Barrett
In reply to this post by Erik Helin-2
> On Feb 15, 2017, at 10:07 AM, Erik Helin <[hidden email]> wrote:
>
> On 02/15/2017 02:48 AM, David Holmes wrote:
>> Hi Erik,
>
> Hi David,
>
> thanks for having a look! Please see new patches at:
> - incremental: http://cr.openjdk.java.net/~ehelin/8168914/00-01/
> - full: http://cr.openjdk.java.net/~ehelin/8168914/01/

------------------------------------------------------------------------------
/src/share/vm/classfile/classLoaderData.cpp
 138     const juint size = OrderAccess::load_acquire(&c->size);

I think all but the first chunk have a constant size of the
maximum size, so no need for acquire except for the first.

------------------------------------------------------------------------------
/src/share/vm/classfile/classLoaderData.cpp
 173   VerifyContainsOopClosure cl(op);
 174   oops_do(&cl);

We don't care that this iterates over the entire list even if found
early?

------------------------------------------------------------------------------
/src/share/vm/classfile/classLoaderData.hpp
 162   struct ChunkedHandleList : public CHeapObj<mtClass> {

As used, this doesn't seem to need to be CHeapObj.

------------------------------------------------------------------------------
/src/share/vm/classfile/classLoaderData.hpp
 163     struct Chunk : public CHeapObj<mtClass> {
 164       static const size_t CAPACITY = 64;
 165       oop data[CAPACITY];

Large fixed capacity seems problematic?  I think there are lots of
"small" CLDs.  Don't anonymous classes have there own class loader?
A variable sized chunk with a capacity field doesn't seem like it
should introduce any new problems.

The whole Chunk definition could also be moved to the .cpp file, with
only a forward declaration here.

------------------------------------------------------------------------------
/src/share/vm/classfile/classLoaderData.cpp
 626   return (jobject) _handles.add(h());
and
 631   *((oop*) h) = NULL;

Having spent a bunch of time recently dealing with and cleaning up a
bunch of places that break the jobject abstraction (though by no means
all), I'm not happy to see new ones.

As Coleen said, there's discussion about how to refer to indirect
pointers to oops in runtime code.  jobject has been used as an answer
for that in some places, but they might need to be changed.  Hiding
the conversions as casts will just make that harder.  I'm thinking
some named conversion functions are in order, instead of bare casts.

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: RFR: 8168914: Crash in ClassLoaderData/JNIHandleBlock::oops_do during concurrent marking

David Holmes
Hi Kim,

If I may ...

On 16/02/2017 12:41 PM, Kim Barrett wrote:

>> On Feb 15, 2017, at 10:07 AM, Erik Helin <[hidden email]> wrote:
>>
>> On 02/15/2017 02:48 AM, David Holmes wrote:
>>> Hi Erik,
>>
>> Hi David,
>>
>> thanks for having a look! Please see new patches at:
>> - incremental: http://cr.openjdk.java.net/~ehelin/8168914/00-01/
>> - full: http://cr.openjdk.java.net/~ehelin/8168914/01/
>
> ------------------------------------------------------------------------------
> /src/share/vm/classfile/classLoaderData.cpp
>  138     const juint size = OrderAccess::load_acquire(&c->size);
>
> I think all but the first chunk have a constant size of the
> maximum size, so no need for acquire except for the first.

The size is incremented on each add, so this acquire sync's with the
release of the new size, so that we are sure to see the oop that was
just added.

Though given the races between add and oops_do it may not actually matter.

David
-----

> ------------------------------------------------------------------------------
> /src/share/vm/classfile/classLoaderData.cpp
>  173   VerifyContainsOopClosure cl(op);
>  174   oops_do(&cl);
>
> We don't care that this iterates over the entire list even if found
> early?
>
> ------------------------------------------------------------------------------
> /src/share/vm/classfile/classLoaderData.hpp
>  162   struct ChunkedHandleList : public CHeapObj<mtClass> {
>
> As used, this doesn't seem to need to be CHeapObj.
>
> ------------------------------------------------------------------------------
> /src/share/vm/classfile/classLoaderData.hpp
>  163     struct Chunk : public CHeapObj<mtClass> {
>  164       static const size_t CAPACITY = 64;
>  165       oop data[CAPACITY];
>
> Large fixed capacity seems problematic?  I think there are lots of
> "small" CLDs.  Don't anonymous classes have there own class loader?
> A variable sized chunk with a capacity field doesn't seem like it
> should introduce any new problems.
>
> The whole Chunk definition could also be moved to the .cpp file, with
> only a forward declaration here.
>
> ------------------------------------------------------------------------------
> /src/share/vm/classfile/classLoaderData.cpp
>  626   return (jobject) _handles.add(h());
> and
>  631   *((oop*) h) = NULL;
>
> Having spent a bunch of time recently dealing with and cleaning up a
> bunch of places that break the jobject abstraction (though by no means
> all), I'm not happy to see new ones.
>
> As Coleen said, there's discussion about how to refer to indirect
> pointers to oops in runtime code.  jobject has been used as an answer
> for that in some places, but they might need to be changed.  Hiding
> the conversions as casts will just make that harder.  I'm thinking
> some named conversion functions are in order, instead of bare casts.
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: RFR: 8168914: Crash in ClassLoaderData/JNIHandleBlock::oops_do during concurrent marking

Kim Barrett
> On Feb 15, 2017, at 10:31 PM, David Holmes <[hidden email]> wrote:
>
> Hi Kim,
>
> If I may ...
>
> On 16/02/2017 12:41 PM, Kim Barrett wrote:
>>> On Feb 15, 2017, at 10:07 AM, Erik Helin <[hidden email]> wrote:
>>>
>>> On 02/15/2017 02:48 AM, David Holmes wrote:
>>>> Hi Erik,
>>>
>>> Hi David,
>>>
>>> thanks for having a look! Please see new patches at:
>>> - incremental: http://cr.openjdk.java.net/~ehelin/8168914/00-01/
>>> - full: http://cr.openjdk.java.net/~ehelin/8168914/01/
>>
>> ------------------------------------------------------------------------------
>> /src/share/vm/classfile/classLoaderData.cpp
>> 138     const juint size = OrderAccess::load_acquire(&c->size);
>>
>> I think all but the first chunk have a constant size of the
>> maximum size, so no need for acquire except for the first.
>
> The size is incremented on each add, so this acquire sync's with the release of the new size, so that we are sure to see the oop that was just added.
>
> Though given the races between add and oops_do it may not actually matter.

By “first” I mean the first chunk.  only the first chunk’s size is subject to modification.  The release_store of _head that made the first chunk accessible will be after the setting of the next chunk’s size to its final (max) value.  The corresponding load_acquire of _head ensures any _next chunks are visibly complete.  It’s the same rationale for _next not requiring any special handling. At least, that’s how it looks to me right now; am I missing something.

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: RFR: 8168914: Crash in ClassLoaderData/JNIHandleBlock::oops_do during concurrent marking

David Holmes
On 16/02/2017 3:53 PM, Kim Barrett wrote:

>> On Feb 15, 2017, at 10:31 PM, David Holmes <[hidden email]> wrote:
>>
>> Hi Kim,
>>
>> If I may ...
>>
>> On 16/02/2017 12:41 PM, Kim Barrett wrote:
>>>> On Feb 15, 2017, at 10:07 AM, Erik Helin <[hidden email]> wrote:
>>>>
>>>> On 02/15/2017 02:48 AM, David Holmes wrote:
>>>>> Hi Erik,
>>>>
>>>> Hi David,
>>>>
>>>> thanks for having a look! Please see new patches at:
>>>> - incremental: http://cr.openjdk.java.net/~ehelin/8168914/00-01/
>>>> - full: http://cr.openjdk.java.net/~ehelin/8168914/01/
>>>
>>> ------------------------------------------------------------------------------
>>> /src/share/vm/classfile/classLoaderData.cpp
>>> 138     const juint size = OrderAccess::load_acquire(&c->size);
>>>
>>> I think all but the first chunk have a constant size of the
>>> maximum size, so no need for acquire except for the first.
>>
>> The size is incremented on each add, so this acquire sync's with the release of the new size, so that we are sure to see the oop that was just added.
>>
>> Though given the races between add and oops_do it may not actually matter.
>
> By “first” I mean the first chunk.  only the first chunk’s size is subject to modification.  The release_store of _head that made the first chunk accessible will be after the setting of the next chunk’s size to its final (max) value.  The corresponding load_acquire of _head ensures any _next chunks are visibly complete.  It’s the same rationale for _next not requiring any special handling. At least, that’s how it looks to me right now; am I missing something.

The _head chunk has its _size incremented on each add().

David

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: RFR: 8168914: Crash in ClassLoaderData/JNIHandleBlock::oops_do during concurrent marking

Kim Barrett
> On Feb 16, 2017, at 2:41 AM, David Holmes <[hidden email]> wrote:
>
> On 16/02/2017 3:53 PM, Kim Barrett wrote:
>>>> /src/share/vm/classfile/classLoaderData.cpp
>>>> 138     const juint size = OrderAccess::load_acquire(&c->size);
>>>>
>>>> I think all but the first chunk have a constant size of the
>>>> maximum size, so no need for acquire except for the first.
>>>
>>> The size is incremented on each add, so this acquire sync's with the release of the new size, so that we are sure to see the oop that was just added.
>>>
>>> Though given the races between add and oops_do it may not actually matter.
>>
>> By “first” I mean the first chunk.  only the first chunk’s size is subject to modification.  The release_store of _head that made the first chunk accessible will be after the setting of the next chunk’s size to its final (max) value.  The corresponding load_acquire of _head ensures any _next chunks are visibly complete.  It’s the same rationale for _next not requiring any special handling. At least, that’s how it looks to me right now; am I missing something.
>
> The _head chunk has its _size incremented on each add().
>
> David

Agreed, and a load_acquire is needed for the _head chunk.

I’m suggesting an ordinary load is sufficient for any _next chunks as we loop down the list.


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: RFR: 8168914: Crash in ClassLoaderData/JNIHandleBlock::oops_do during concurrent marking

David Holmes
On 16/02/2017 5:46 PM, Kim Barrett wrote:

>> On Feb 16, 2017, at 2:41 AM, David Holmes <[hidden email]> wrote:
>>
>> On 16/02/2017 3:53 PM, Kim Barrett wrote:
>>>>> /src/share/vm/classfile/classLoaderData.cpp
>>>>> 138     const juint size = OrderAccess::load_acquire(&c->size);
>>>>>
>>>>> I think all but the first chunk have a constant size of the
>>>>> maximum size, so no need for acquire except for the first.
>>>>
>>>> The size is incremented on each add, so this acquire sync's with the release of the new size, so that we are sure to see the oop that was just added.
>>>>
>>>> Though given the races between add and oops_do it may not actually matter.
>>>
>>> By “first” I mean the first chunk.  only the first chunk’s size is subject to modification.  The release_store of _head that made the first chunk accessible will be after the setting of the next chunk’s size to its final (max) value.  The corresponding load_acquire of _head ensures any _next chunks are visibly complete.  It’s the same rationale for _next not requiring any special handling. At least, that’s how it looks to me right now; am I missing something.
>>
>> The _head chunk has its _size incremented on each add().
>>
>> David
>
> Agreed, and a load_acquire is needed for the _head chunk.
>
> I’m suggesting an ordinary load is sufficient for any _next chunks as we loop down the list.

Ah! Now I get what you mean - sorry. Yes only the first loop iteration
needs the load-acquire. So just:

Chunk* c = (Chunk*) OrderAccess::load_ptr_acquire((volatile
intptr_t*)&_head);
const juint size = (c == NULL ? 0 : OrderAccess::load_acquire(&c->size));
while (c != NULL) {
     size = OrderAccess::load_acquire(&c->size);
     ...
}


Cheers,
David
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: RFR: 8168914: Crash in ClassLoaderData/JNIHandleBlock::oops_do during concurrent marking

David Holmes
Bah! didn't finish my edit ...

On 16/02/2017 5:54 PM, David Holmes wrote:

> On 16/02/2017 5:46 PM, Kim Barrett wrote:
>>> On Feb 16, 2017, at 2:41 AM, David Holmes <[hidden email]>
>>> wrote:
>>>
>>> On 16/02/2017 3:53 PM, Kim Barrett wrote:
>>>>>> /src/share/vm/classfile/classLoaderData.cpp
>>>>>> 138     const juint size = OrderAccess::load_acquire(&c->size);
>>>>>>
>>>>>> I think all but the first chunk have a constant size of the
>>>>>> maximum size, so no need for acquire except for the first.
>>>>>
>>>>> The size is incremented on each add, so this acquire sync's with
>>>>> the release of the new size, so that we are sure to see the oop
>>>>> that was just added.
>>>>>
>>>>> Though given the races between add and oops_do it may not actually
>>>>> matter.
>>>>
>>>> By “first” I mean the first chunk.  only the first chunk’s size is
>>>> subject to modification.  The release_store of _head that made the
>>>> first chunk accessible will be after the setting of the next chunk’s
>>>> size to its final (max) value.  The corresponding load_acquire of
>>>> _head ensures any _next chunks are visibly complete.  It’s the same
>>>> rationale for _next not requiring any special handling. At least,
>>>> that’s how it looks to me right now; am I missing something.
>>>
>>> The _head chunk has its _size incremented on each add().
>>>
>>> David
>>
>> Agreed, and a load_acquire is needed for the _head chunk.
>>
>> I’m suggesting an ordinary load is sufficient for any _next chunks as
>> we loop down the list.
>
> Ah! Now I get what you mean - sorry. Yes only the first loop iteration
> needs the load-acquire. So just:
>
> Chunk* c = (Chunk*) OrderAccess::load_ptr_acquire((volatile
> intptr_t*)&_head);
> const juint size = (c == NULL ? 0 : OrderAccess::load_acquire(&c->size));
> while (c != NULL) {
>     size = OrderAccess::load_acquire(&c->size);

       size = c->size;

David
-----
>     ...
> }
>
>
> Cheers,
> David
>>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: RFR: 8168914: Crash in ClassLoaderData/JNIHandleBlock::oops_do during concurrent marking

Thomas Schatzl
Hi,

On Thu, 2017-02-16 at 18:09 +1000, David Holmes wrote:

> Bah! didn't finish my edit ...
>
> On 16/02/2017 5:54 PM, David Holmes wrote:
> >
> > On 16/02/2017 5:46 PM, Kim Barrett wrote:
> > >
> > > >
> > > > On Feb 16, 2017, at 2:41 AM, David Holmes <david.holmes@oracle.
> > > > com>
> > > > wrote:
> > > >
> > > > On 16/02/2017 3:53 PM, Kim Barrett wrote:
> > > > >
> > > > > >
> > > > > > >
> > > > > > > /src/share/vm/classfile/classLoaderData.cpp
> > > > > > > 138     const juint size = OrderAccess::load_acquire(&c-
> > > > > > > >size);
> > > > > > >
> > > > > > > I think all but the first chunk have a constant size of
> > > > > > > the
> > > > > > > maximum size, so no need for acquire except for the
> > > > > > > first.
> > > > > > The size is incremented on each add, so this acquire sync's
> > > > > > with
> > > > > > the release of the new size, so that we are sure to see the
> > > > > > oop
> > > > > > that was just added.
> > > > > >
> > > > > > Though given the races between add and oops_do it may not
> > > > > > actually
> > > > > > matter.
> > > > > By “first” I mean the first chunk.  only the first chunk’s
> > > > > size is
> > > > > subject to modification.  The release_store of _head that
> > > > > made the
> > > > > first chunk accessible will be after the setting of the next
> > > > > chunk’s
> > > > > size to its final (max) value.  The corresponding
> > > > > load_acquire of
> > > > > _head ensures any _next chunks are visibly complete.  It’s
> > > > > the same
> > > > > rationale for _next not requiring any special handling. At
> > > > > least,
> > > > > that’s how it looks to me right now; am I missing something.
> > > > The _head chunk has its _size incremented on each add().
> > > >
> > > > David
> > > Agreed, and a load_acquire is needed for the _head chunk.
> > >
> > > I’m suggesting an ordinary load is sufficient for any _next
> > > chunks as
> > > we loop down the list.
> > Ah! Now I get what you mean - sorry. Yes only the first loop
> > iteration
> > needs the load-acquire. So just:
> >
> > Chunk* c = (Chunk*) OrderAccess::load_ptr_acquire((volatile
> > intptr_t*)&_head);
> > const juint size = (c == NULL ? 0 : OrderAccess::load_acquire(&c-
> > >size));
> > while (c != NULL) {
> >     size = OrderAccess::load_acquire(&c->size);
>        size = c->size;

Maybe I'm wrong, but just looking at this suggestion without more
thought: what if the value of size changes just between the above
load_acquire and the re-read of the size, and the actual value in the
array has not been updated yet?

If that optimization is done, please just peel out the first iteration.

Thanks,
  Thomas

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: RFR: 8168914: Crash in ClassLoaderData/JNIHandleBlock::oops_do during concurrent marking

Thomas Schatzl
In reply to this post by Kim Barrett
Hi,

On Wed, 2017-02-15 at 21:41 -0500, Kim Barrett wrote:

> >
> > On Feb 15, 2017, at 10:07 AM, Erik Helin <[hidden email]>
> > wrote:
> >
> > On 02/15/2017 02:48 AM, David Holmes wrote:
> > >
> > > Hi Erik,
> > Hi David,
> >
> > thanks for having a look! Please see new patches at:
> > - incremental: http://cr.openjdk.java.net/~ehelin/8168914/00-01/
> > - full: http://cr.openjdk.java.net/~ehelin/8168914/01/
> -------------------------------------------------------------------
> -----------
> /src/share/vm/classfile/classLoaderData.cpp
>  138     const juint size = OrderAccess::load_acquire(&c->size);
>
> I think all but the first chunk have a constant size of the
> maximum size, so no need for acquire except for the first.
>
> -------------------------------------------------------------------
> -----------
> /src/share/vm/classfile/classLoaderData.cpp
>  173   VerifyContainsOopClosure cl(op);
>  174   oops_do(&cl);
>
> We don't care that this iterates over the entire list even if found
> early?

Not trying to defend the change: I thought the same, but it is only
verification code, and as Coleen and you suggested, there are very few
entries anyway. Also, the current interface (including the used
OopClosure) do not allow returning a value anyway.

Unless there is some AbortableOopClosure that can be used, I would
recommend moving this to an RFE this late in the release.

> -------------------------------------------------------------------
> -----------
> /src/share/vm/classfile/classLoaderData.hpp
>  162   struct ChunkedHandleList : public CHeapObj<mtClass> {
>
> As used, this doesn't seem to need to be CHeapObj.
>
> -------------------------------------------------------------------
> -----------
> /src/share/vm/classfile/classLoaderData.hpp
>  163     struct Chunk : public CHeapObj<mtClass> {
>  164       static const size_t CAPACITY = 64;
>  165       oop data[CAPACITY];
>
> Large fixed capacity seems problematic?  I think there are lots of
> "small" CLDs.  Don't anonymous classes have there own class loader?
> A variable sized chunk with a capacity field doesn't seem like it
> should introduce any new problems.

While the default value might be too high, particularly a variable
sized chunk size with very few entries will to add the overhead of the
additional capacity field.
For those ClassLoaders with very few entries, the existing Chunk
already seems to have a quite high overhead.

Maybe some specialization for anonymous vs. regular class loaders?

Do you (Kim, Coleen) know typical sizes of this list for both kinds of
class loaders?

> The whole Chunk definition could also be moved to the .cpp file, with
> only a forward declaration here.

I would personally kind of prefer to keep them together here. I am fine
with either way though.

Thanks,
  Thomas

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: RFR: 8168914: Crash in ClassLoaderData/JNIHandleBlock::oops_do during concurrent marking

David Holmes
In reply to this post by Thomas Schatzl
On 16/02/2017 6:55 PM, Thomas Schatzl wrote:

> Hi,
>
> On Thu, 2017-02-16 at 18:09 +1000, David Holmes wrote:
>> Bah! didn't finish my edit ...
>>
>> On 16/02/2017 5:54 PM, David Holmes wrote:
>>>
>>> On 16/02/2017 5:46 PM, Kim Barrett wrote:
>>>>
>>>>>
>>>>> On Feb 16, 2017, at 2:41 AM, David Holmes <david.holmes@oracle.
>>>>> com>
>>>>> wrote:
>>>>>
>>>>> On 16/02/2017 3:53 PM, Kim Barrett wrote:
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> /src/share/vm/classfile/classLoaderData.cpp
>>>>>>>> 138     const juint size = OrderAccess::load_acquire(&c-
>>>>>>>>> size);
>>>>>>>>
>>>>>>>> I think all but the first chunk have a constant size of
>>>>>>>> the
>>>>>>>> maximum size, so no need for acquire except for the
>>>>>>>> first.
>>>>>>> The size is incremented on each add, so this acquire sync's
>>>>>>> with
>>>>>>> the release of the new size, so that we are sure to see the
>>>>>>> oop
>>>>>>> that was just added.
>>>>>>>
>>>>>>> Though given the races between add and oops_do it may not
>>>>>>> actually
>>>>>>> matter.
>>>>>> By “first” I mean the first chunk.  only the first chunk’s
>>>>>> size is
>>>>>> subject to modification.  The release_store of _head that
>>>>>> made the
>>>>>> first chunk accessible will be after the setting of the next
>>>>>> chunk’s
>>>>>> size to its final (max) value.  The corresponding
>>>>>> load_acquire of
>>>>>> _head ensures any _next chunks are visibly complete.  It’s
>>>>>> the same
>>>>>> rationale for _next not requiring any special handling. At
>>>>>> least,
>>>>>> that’s how it looks to me right now; am I missing something.
>>>>> The _head chunk has its _size incremented on each add().
>>>>>
>>>>> David
>>>> Agreed, and a load_acquire is needed for the _head chunk.
>>>>
>>>> I’m suggesting an ordinary load is sufficient for any _next
>>>> chunks as
>>>> we loop down the list.
>>> Ah! Now I get what you mean - sorry. Yes only the first loop
>>> iteration
>>> needs the load-acquire. So just:
>>>
>>> Chunk* c = (Chunk*) OrderAccess::load_ptr_acquire((volatile
>>> intptr_t*)&_head);
>>> const juint size = (c == NULL ? 0 : OrderAccess::load_acquire(&c-
>>>> size));
>>> while (c != NULL) {
>>>     size = OrderAccess::load_acquire(&c->size);
>>        size = c->size;
>
> Maybe I'm wrong, but just looking at this suggestion without more
> thought: what if the value of size changes just between the above
> load_acquire and the re-read of the size, and the actual value in the
> array has not been updated yet?

Yeah I screwed up peeling out the initial read of size. That line should
come at the end of the loop after c=c->next;

Thanks,
David

> If that optimization is done, please just peel out the first iteration.
>
> Thanks,
>   Thomas
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: RFR: 8168914: Crash in ClassLoaderData/JNIHandleBlock::oops_do during concurrent marking

Erik Helin-2
In reply to this post by Thomas Schatzl
On 02/16/2017 10:06 AM, Thomas Schatzl wrote:

> Hi,
>
> On Wed, 2017-02-15 at 21:41 -0500, Kim Barrett wrote:
>>>
>>> On Feb 15, 2017, at 10:07 AM, Erik Helin <[hidden email]>
>>> wrote:
>>>
>>> On 02/15/2017 02:48 AM, David Holmes wrote:
>>>>
>>>> Hi Erik,
>>> Hi David,
>>>
>>> thanks for having a look! Please see new patches at:
>>> - incremental: http://cr.openjdk.java.net/~ehelin/8168914/00-01/
>>> - full: http://cr.openjdk.java.net/~ehelin/8168914/01/
>> -------------------------------------------------------------------
>> -----------
>> /src/share/vm/classfile/classLoaderData.cpp
>>  138     const juint size = OrderAccess::load_acquire(&c->size);
>>
>> I think all but the first chunk have a constant size of the
>> maximum size, so no need for acquire except for the first.
>>
>> -------------------------------------------------------------------
>> -----------
>> /src/share/vm/classfile/classLoaderData.cpp
>>  173   VerifyContainsOopClosure cl(op);
>>  174   oops_do(&cl);
>>
>> We don't care that this iterates over the entire list even if found
>> early?
>
> Not trying to defend the change: I thought the same, but it is only
> verification code, and as Coleen and you suggested, there are very few
> entries anyway. Also, the current interface (including the used
> OopClosure) do not allow returning a value anyway.

Yep, this was my reasoning, there is no need to optimize this.

> Unless there is some AbortableOopClosure that can be used, I would
> recommend moving this to an RFE this late in the release.

We do want AbortableOopClosure (I had a patch circling around a long
time ago), but I don't think the optimization is worth it here.

>> -------------------------------------------------------------------
>> -----------
>> /src/share/vm/classfile/classLoaderData.hpp
>>  162   struct ChunkedHandleList : public CHeapObj<mtClass> {
>>
>> As used, this doesn't seem to need to be CHeapObj.

Correct (also noticed by Coleen).

>> -------------------------------------------------------------------
>> -----------
>> /src/share/vm/classfile/classLoaderData.hpp
>>  163     struct Chunk : public CHeapObj<mtClass> {
>>  164       static const size_t CAPACITY = 64;
>>  165       oop data[CAPACITY];
>>
>> Large fixed capacity seems problematic?  I think there are lots of
>> "small" CLDs.  Don't anonymous classes have there own class loader?
>> A variable sized chunk with a capacity field doesn't seem like it
>> should introduce any new problems.
>
> While the default value might be too high, particularly a variable
> sized chunk size with very few entries will to add the overhead of the
> additional capacity field.
> For those ClassLoaders with very few entries, the existing Chunk
> already seems to have a quite high overhead.
>
> Maybe some specialization for anonymous vs. regular class loaders?
>
> Do you (Kim, Coleen) know typical sizes of this list for both kinds of
> class loaders?
>
>> The whole Chunk definition could also be moved to the .cpp file, with
>> only a forward declaration here.
>
> I would personally kind of prefer to keep them together here. I am fine
> with either way though.

I would also prefer to keep them together (but don't have a very strong
opinion).

Thanks,
Erik

> Thanks,
>   Thomas
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: RFR: 8168914: Crash in ClassLoaderData/JNIHandleBlock::oops_do during concurrent marking

Erik Helin-2
In reply to this post by David Holmes
On 02/16/2017 10:09 AM, David Holmes wrote:

> On 16/02/2017 6:55 PM, Thomas Schatzl wrote:
>> Hi,
>>
>> On Thu, 2017-02-16 at 18:09 +1000, David Holmes wrote:
>>> Bah! didn't finish my edit ...
>>>
>>> On 16/02/2017 5:54 PM, David Holmes wrote:
>>>>
>>>> On 16/02/2017 5:46 PM, Kim Barrett wrote:
>>>>>
>>>>>>
>>>>>> On Feb 16, 2017, at 2:41 AM, David Holmes <david.holmes@oracle.
>>>>>> com>
>>>>>> wrote:
>>>>>>
>>>>>> On 16/02/2017 3:53 PM, Kim Barrett wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> /src/share/vm/classfile/classLoaderData.cpp
>>>>>>>>> 138     const juint size = OrderAccess::load_acquire(&c-
>>>>>>>>>> size);
>>>>>>>>>
>>>>>>>>> I think all but the first chunk have a constant size of
>>>>>>>>> the
>>>>>>>>> maximum size, so no need for acquire except for the
>>>>>>>>> first.
>>>>>>>> The size is incremented on each add, so this acquire sync's
>>>>>>>> with
>>>>>>>> the release of the new size, so that we are sure to see the
>>>>>>>> oop
>>>>>>>> that was just added.
>>>>>>>>
>>>>>>>> Though given the races between add and oops_do it may not
>>>>>>>> actually
>>>>>>>> matter.
>>>>>>> By “first” I mean the first chunk.  only the first chunk’s
>>>>>>> size is
>>>>>>> subject to modification.  The release_store of _head that
>>>>>>> made the
>>>>>>> first chunk accessible will be after the setting of the next
>>>>>>> chunk’s
>>>>>>> size to its final (max) value.  The corresponding
>>>>>>> load_acquire of
>>>>>>> _head ensures any _next chunks are visibly complete.  It’s
>>>>>>> the same
>>>>>>> rationale for _next not requiring any special handling. At
>>>>>>> least,
>>>>>>> that’s how it looks to me right now; am I missing something.
>>>>>> The _head chunk has its _size incremented on each add().
>>>>>>
>>>>>> David
>>>>> Agreed, and a load_acquire is needed for the _head chunk.
>>>>>
>>>>> I’m suggesting an ordinary load is sufficient for any _next
>>>>> chunks as
>>>>> we loop down the list.
>>>> Ah! Now I get what you mean - sorry. Yes only the first loop
>>>> iteration
>>>> needs the load-acquire. So just:
>>>>
>>>> Chunk* c = (Chunk*) OrderAccess::load_ptr_acquire((volatile
>>>> intptr_t*)&_head);
>>>> const juint size = (c == NULL ? 0 : OrderAccess::load_acquire(&c-
>>>>> size));
>>>> while (c != NULL) {
>>>>     size = OrderAccess::load_acquire(&c->size);
>>>        size = c->size;
>>
>> Maybe I'm wrong, but just looking at this suggestion without more
>> thought: what if the value of size changes just between the above
>> load_acquire and the re-read of the size, and the actual value in the
>> array has not been updated yet?
>
> Yeah I screwed up peeling out the initial read of size. That line should
> come at the end of the loop after c=c->next;

I agree that we can implement this optimization, but I'm not sure it is
worth it? The code will become quite a bit more cumbersome vs the
straightforward code (well, as straightforward as lock free code gets)
that is in the patch right now.

Thanks,
Erik

> Thanks,
> David
>
>> If that optimization is done, please just peel out the first iteration.
>>
>> Thanks,
>>   Thomas
>>
123
Loading...