Re: RFR(XXS): 8227117: normal interpreter table is not restored after single stepping with TLH

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: RFR(XXS): 8227117: normal interpreter table is not restored after single stepping with TLH

Erik Österlund-2
Hi David,

When you run without TLH, this copying mechanism is used to synchronize
the safepoint while JavaThreads are running. The interpreter doesn't
emit any polls then. Instead it clobbers the dispatch table. JavaThreads
will be reading from the dispatch table while it is being
(non-atomically) modified. That could crash. For example with the
Solaris + studio + SPARC - TLH configuration, the compiler will almost
certainly emit a memcpy (this transformation has been observed in
practice), the memcpy will use BIS instructions (observed in practice)
for performance, with out-of-thin-air values (observed in practice), and
the JavaThreads will occasionally crash during safepoint synchronization
due to said out-of-thin-air values.

So I guess the problem might be larger back when TLH was not default.
But this seems conceptually wrong.

/Erik

On 2019-07-04 09:17, David Holmes wrote:

> Hi Erik,
>
> On 4/07/2019 5:10 pm, Erik Österlund wrote:
>> Hi Dan,
>>
>> Thanks for picking this up. The change looks good.
>>
>> However, when reviewing this, I looked at the code for actually
>> restoring the table (ignore/notice safepoints). It copies the
>> dispatch table for the interpreter. There is a comment stating it is
>> important the copying is atomic for MT-safety, and I can definitely
>> see why. However, the copying the line after that comment is in fact
>> not atomic.
>
> Is it assuming "atomicity" by virtue of executing at a safepoint?
>
> David
> -----
>
>> Here is the copying code in templateInterpreter.cpp:
>>
>> static inline void copy_table(address* from, address* to, int size) {
>>    // Copy non-overlapping tables. The copy has to occur word wise
>> for MT safety.
>>    while (size-- > 0) *to++ = *from++;
>> }
>>
>> Copying using a loop of non-volatile loads and stores can and
>> definitely will on some compilers turn into memcpy calls instead as
>> the compiler (correctly) considers that an equivalent transformation.
>> And memcpy does not guarantee atomicity. Indeed on some platforms it
>> is not atomic. On some platforms it will even enjoy out-of-thin-air
>> values. Perhaps Copy::disjoint_words_atomic() would be a better
>> choice for atomic word copying. If not, at the very least we should
>> use Atomic::load/store here.
>>
>> Having said that, the fix for that issue seems like a separate RFE,
>> because it has been sitting there for a lot longer than TLH has been
>> around.
>>
>> Thanks,
>> /Erik
>>
>> On 2019-07-04 04:04, Daniel D. Daugherty wrote:
>>> Greetings,
>>>
>>> Robbin recently discovered this issue with Thread Local Handshakes.
>>> Since
>>> he's not available at the moment, I'm handling the issue:
>>>
>>>      JDK-8227117 normal interpreter table is not restored after
>>> single stepping with TLH
>>>      https://bugs.openjdk.java.net/browse/JDK-8227117
>>>
>>> When using Thread Local Handshakes, the normal interpreter table is
>>> not restored after single stepping. This issue is caused by the
>>> VM_ChangeSingleStep VM-op relying on SafepointSynchronize::end() to
>>> restore the normal interpreter table for the "off" case.
>>>
>>> Prior to Thread Local Handshakes, this was a valid assumption to make.
>>> SafepointSynchronize::end() has been refactored into
>>> disarm_safepoint() and it only calls Interpreter::ignore_safepoints()
>>> on the global safepoint branch. That matches up with the call to
>>> Interpreter::notice_safepoints() that is also on the global safepoint
>>> branch.
>>>
>>> The solution is for the VM_ChangeSingleStep VM-op for the "off" case
>>> to call Interpreter::ignore_safepoints() directly.
>>>
>>> Here's the webrev URL:
>>>
>>> http://cr.openjdk.java.net/~dcubed/8227117-webrev/0_for_jdk14/
>>>
>>> The fix is just a small addition to VM_ChangeSingleStep::doit():
>>>
>>>     if (_on) {
>>>       Interpreter::notice_safepoints();
>>> +  } else {
>>> +    Interpreter::ignore_safepoints();
>>>     }
>>>
>>> Everything else is just new logging support for future debugging of
>>> interpreter table management and single stepping.
>>>
>>> Tested this fix with Mach5 Tier[1-3] on the standard Oracle platforms.
>>> Mach5 Tier[4-6] on standard Oracle platforms is running now.
>>>
>>> Thanks, in advance, for questions, comments or suggestions.
>>>
>>> Dan
>>>

Reply | Threaded
Open this post in threaded view
|

Re: RFR(XXS): 8227117: normal interpreter table is not restored after single stepping with TLH

Daniel D. Daugherty
On 7/4/19 5:38 AM, David Holmes wrote:

> Hi Erik,
>
> On 4/07/2019 6:08 pm, Erik Österlund wrote:
>> Hi David,
>>
>> When you run without TLH, this copying mechanism is used to
>> synchronize the safepoint while JavaThreads are running. The
>> interpreter doesn't emit any polls then. Instead it clobbers the
>> dispatch table. JavaThreads will be reading from the dispatch table
>> while it is being (non-atomically) modified. That could crash. For
>> example with the Solaris + studio + SPARC - TLH configuration, the
>> compiler will almost certainly emit a memcpy (this transformation has
>> been observed in practice), the memcpy will use BIS instructions
>> (observed in practice) for performance, with out-of-thin-air values
>> (observed in practice), and the JavaThreads will occasionally crash
>> during safepoint synchronization due to said out-of-thin-air values.
>>
>> So I guess the problem might be larger back when TLH was not default.
>> But this seems conceptually wrong.
>
> I always thought there were two dispatch tables and we simply switched
> between them - not copied anything!

Based on my search back into the TeamWare repos, it looks like we have
always copied the table...

Dan


>
> David
>
>> /Erik
>>
>> On 2019-07-04 09:17, David Holmes wrote:
>>> Hi Erik,
>>>
>>> On 4/07/2019 5:10 pm, Erik Österlund wrote:
>>>> Hi Dan,
>>>>
>>>> Thanks for picking this up. The change looks good.
>>>>
>>>> However, when reviewing this, I looked at the code for actually
>>>> restoring the table (ignore/notice safepoints). It copies the
>>>> dispatch table for the interpreter. There is a comment stating it
>>>> is important the copying is atomic for MT-safety, and I can
>>>> definitely see why. However, the copying the line after that
>>>> comment is in fact not atomic.
>>>
>>> Is it assuming "atomicity" by virtue of executing at a safepoint?
>>>
>>> David
>>> -----
>>>
>>>> Here is the copying code in templateInterpreter.cpp:
>>>>
>>>> static inline void copy_table(address* from, address* to, int size) {
>>>>    // Copy non-overlapping tables. The copy has to occur word wise
>>>> for MT safety.
>>>>    while (size-- > 0) *to++ = *from++;
>>>> }
>>>>
>>>> Copying using a loop of non-volatile loads and stores can and
>>>> definitely will on some compilers turn into memcpy calls instead as
>>>> the compiler (correctly) considers that an equivalent
>>>> transformation. And memcpy does not guarantee atomicity. Indeed on
>>>> some platforms it is not atomic. On some platforms it will even
>>>> enjoy out-of-thin-air values. Perhaps Copy::disjoint_words_atomic()
>>>> would be a better choice for atomic word copying. If not, at the
>>>> very least we should use Atomic::load/store here.
>>>>
>>>> Having said that, the fix for that issue seems like a separate RFE,
>>>> because it has been sitting there for a lot longer than TLH has
>>>> been around.
>>>>
>>>> Thanks,
>>>> /Erik
>>>>
>>>> On 2019-07-04 04:04, Daniel D. Daugherty wrote:
>>>>> Greetings,
>>>>>
>>>>> Robbin recently discovered this issue with Thread Local
>>>>> Handshakes. Since
>>>>> he's not available at the moment, I'm handling the issue:
>>>>>
>>>>>      JDK-8227117 normal interpreter table is not restored after
>>>>> single stepping with TLH
>>>>>      https://bugs.openjdk.java.net/browse/JDK-8227117
>>>>>
>>>>> When using Thread Local Handshakes, the normal interpreter table is
>>>>> not restored after single stepping. This issue is caused by the
>>>>> VM_ChangeSingleStep VM-op relying on SafepointSynchronize::end() to
>>>>> restore the normal interpreter table for the "off" case.
>>>>>
>>>>> Prior to Thread Local Handshakes, this was a valid assumption to
>>>>> make.
>>>>> SafepointSynchronize::end() has been refactored into
>>>>> disarm_safepoint() and it only calls Interpreter::ignore_safepoints()
>>>>> on the global safepoint branch. That matches up with the call to
>>>>> Interpreter::notice_safepoints() that is also on the global safepoint
>>>>> branch.
>>>>>
>>>>> The solution is for the VM_ChangeSingleStep VM-op for the "off" case
>>>>> to call Interpreter::ignore_safepoints() directly.
>>>>>
>>>>> Here's the webrev URL:
>>>>>
>>>>> http://cr.openjdk.java.net/~dcubed/8227117-webrev/0_for_jdk14/
>>>>>
>>>>> The fix is just a small addition to VM_ChangeSingleStep::doit():
>>>>>
>>>>>     if (_on) {
>>>>>       Interpreter::notice_safepoints();
>>>>> +  } else {
>>>>> +    Interpreter::ignore_safepoints();
>>>>>     }
>>>>>
>>>>> Everything else is just new logging support for future debugging of
>>>>> interpreter table management and single stepping.
>>>>>
>>>>> Tested this fix with Mach5 Tier[1-3] on the standard Oracle
>>>>> platforms.
>>>>> Mach5 Tier[4-6] on standard Oracle platforms is running now.
>>>>>
>>>>> Thanks, in advance, for questions, comments or suggestions.
>>>>>
>>>>> Dan
>>>>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: RFR(XXS): 8227117: normal interpreter table is not restored after single stepping with TLH

Daniel D. Daugherty
In reply to this post by Erik Österlund-2
On 7/5/19 1:16 PM, Daniel D. Daugherty wrote:

> On 7/4/19 3:18 AM, David Holmes wrote:
>> PS. I just noticed this comment:
>>
>> // This change must always be occur when at a safepoint.
>> // Being at a safepoint causes the interpreter to use the
>> // safepoint dispatch table which we overload to find single
>> // step points.  Just to be sure that it has been set, we
>> // call notice_safepoints when turning on single stepping.
>> // When we leave our current safepoint, should_post_single_step
>> // will be checked by the interpreter, and the table kept
>> // or changed accordingly.
>> void VM_ChangeSingleStep::doit() {
>>
>> The "when we leave the safepoint" part is actually the bug that is
>> being fixed - right? So the comment is not accurate.
>
> I'll take a closer look at this part of the comment:
>
> // When we leave our current safepoint, should_post_single_step
> // will be checked by the interpreter, and the table kept
> // or changed accordingly.
>
> and figure out how to clarify it as part of this change.

I ended up rewriting the entire block comment that David quoted
above to be this:

+// When _on == true, we use the safepoint interpreter dispatch table
+// to allow us to find the single step points. Otherwise, we switch
+// back to the regular interpreter dispatch table.
+// Note: We call Interpreter::notice_safepoints() and ignore_safepoints()
+// in a VM_Operation to safely make the dispatch table switch. We
+// no longer rely on the safepoint mechanism to do any of this work
+// for us.

Dan


>
> Dan
>
>
>>
>> David
>> -----
>>
>> On 4/07/2019 5:13 pm, David Holmes wrote:
>>> Hi Dan,
>>>
>>> On 4/07/2019 12:04 pm, Daniel D. Daugherty wrote:
>>>> Greetings,
>>>>
>>>> Robbin recently discovered this issue with Thread Local Handshakes.
>>>> Since
>>>> he's not available at the moment, I'm handling the issue:
>>>>
>>>>      JDK-8227117 normal interpreter table is not restored after
>>>> single stepping with TLH
>>>>      https://bugs.openjdk.java.net/browse/JDK-8227117
>>>>
>>>> When using Thread Local Handshakes, the normal interpreter table is
>>>> not restored after single stepping. This issue is caused by the
>>>> VM_ChangeSingleStep VM-op relying on SafepointSynchronize::end() to
>>>> restore the normal interpreter table for the "off" case.
>>>
>>> So the result of this is that debugging tests may run more slowly
>>> overall?
>>>
>>>> Prior to Thread Local Handshakes, this was a valid assumption to make.
>>>> SafepointSynchronize::end() has been refactored into
>>>> disarm_safepoint() and it only calls Interpreter::ignore_safepoints()
>>>> on the global safepoint branch. That matches up with the call to
>>>> Interpreter::notice_safepoints() that is also on the global safepoint
>>>> branch.
>>>>
>>>> The solution is for the VM_ChangeSingleStep VM-op for the "off" case
>>>> to call Interpreter::ignore_safepoints() directly.
>>>>
>>>> Here's the webrev URL:
>>>>
>>>> http://cr.openjdk.java.net/~dcubed/8227117-webrev/0_for_jdk14/
>>>>
>>>> The fix is just a small addition to VM_ChangeSingleStep::doit():
>>>>
>>>>     if (_on) {
>>>>       Interpreter::notice_safepoints();
>>>> +  } else {
>>>> +    Interpreter::ignore_safepoints();
>>>>     }
>>>
>>> Looks good - thanks for the detailed analysis in the bug report.
>>>
>>> I have on additional request from looking at related code - can you
>>> fix this confused initializer:
>>>
>>> VM_ChangeSingleStep::VM_ChangeSingleStep(bool on)
>>>    : _on(on != 0)
>>> {
>>> }
>>>
>>> as _on and on are both bool the assignment can be direct and we
>>> shouldn't be comparing a bool to 0 as a matter of style. Thanks.
>>>
>>>> Everything else is just new logging support for future debugging of
>>>> interpreter table management and single stepping.
>>>
>>> Logging looks good too.
>>>
>>> Thanks,
>>> David
>>> -----
>>>
>>>> Tested this fix with Mach5 Tier[1-3] on the standard Oracle platforms.
>>>> Mach5 Tier[4-6] on standard Oracle platforms is running now.
>>>>
>>>> Thanks, in advance, for questions, comments or suggestions.
>>>>
>>>> Dan
>>>>
>
>