RFR: 8264393: JDK-8258284 introduced dangling TLH race

classic Classic list List threaded Threaded
17 messages Options
Reply | Threaded
Open this post in threaded view
|

RFR: 8264393: JDK-8258284 introduced dangling TLH race

Daniel D.Daugherty
I ported some 20 year old tests using JDK-8262881 in order to help
test [~rehn]'s fix for JDK-8257831. These tests in combination with
one piece of the fix from JDK-8257831 revealed a bug in my fix for
JDK-8258284 from back in Dec 2020.

The race revealed by the ported tests from JDK-8262881 happens
only with nested ThreadsListHandles. When TLH2 is destroyed, the
thread updates its threads_hazard_ptr from the TLH2-list to the
TLH1-list; I made this change back in 2020.12 using JDK-8258284.
The threads_hazard_ptr can be observed by a thread calling
ThreadsSMRSupport::free_list() as a stable ThreadsList at the same
time as the TLH1 destructor is decrementing the nested_handle_cnt
that permits the ThreadsList to be freed. So the thread calling
ThreadsSMRSupport::free_list() thinks it has a stable hazard ptr
(TLH1-list), but that hazard ptr can be freed and causes lots of
confusion.

-------------

Depends on: https://git.openjdk.java.net/jdk/pull/3255

Commit messages:
 - 8264393.00.patch

Changes: https://git.openjdk.java.net/jdk/pull/3272/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=3272&range=00
  Issue: https://bugs.openjdk.java.net/browse/JDK-8264393
  Stats: 61 lines in 2 files changed: 22 ins; 21 del; 18 mod
  Patch: https://git.openjdk.java.net/jdk/pull/3272.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/3272/head:pull/3272

PR: https://git.openjdk.java.net/jdk/pull/3272
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8264393: JDK-8258284 introduced dangling TLH race

Daniel D.Daugherty
On Tue, 30 Mar 2021 16:35:23 GMT, Daniel D. Daugherty <[hidden email]> wrote:

> I ported some 20 year old tests using JDK-8262881 in order to help
> test [~rehn]'s fix for JDK-8257831. These tests in combination with
> one piece of the fix from JDK-8257831 revealed a bug in my fix for
> JDK-8258284 from back in Dec 2020.
>
> The race revealed by the ported tests from JDK-8262881 happens
> only with nested ThreadsListHandles. When TLH2 is destroyed, the
> thread updates its threads_hazard_ptr from the TLH2-list to the
> TLH1-list; I made this change back in 2020.12 using JDK-8258284.
> The threads_hazard_ptr can be observed by a thread calling
> ThreadsSMRSupport::free_list() as a stable ThreadsList at the same
> time as the TLH1 destructor is decrementing the nested_handle_cnt
> that permits the ThreadsList to be freed. So the thread calling
> ThreadsSMRSupport::free_list() thinks it has a stable hazard ptr
> (TLH1-list), but that hazard ptr can be freed and causes lots of
> confusion.

@fisk, @robehn and @dholmes-ora - This one is a Threads-SMR race fix that I used
"JDK-8264123 add ThreadsList.is_valid() support" to find. It would be good to have
you guys brush off your Thread-SMR memories and review this fix.

-------------

PR: https://git.openjdk.java.net/jdk/pull/3272
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8264393: JDK-8258284 introduced dangling TLH race [v2]

Daniel D.Daugherty
In reply to this post by Daniel D.Daugherty
> I ported some 20 year old tests using JDK-8262881 in order to help
> test [~rehn]'s fix for JDK-8257831. These tests in combination with
> one piece of the fix from JDK-8257831 revealed a bug in my fix for
> JDK-8258284 from back in Dec 2020.
>
> The race revealed by the ported tests from JDK-8262881 happens
> only with nested ThreadsListHandles. When TLH2 is destroyed, the
> thread updates its threads_hazard_ptr from the TLH2-list to the
> TLH1-list; I made this change back in 2020.12 using JDK-8258284.
> The threads_hazard_ptr can be observed by a thread calling
> ThreadsSMRSupport::free_list() as a stable ThreadsList at the same
> time as the TLH1 destructor is decrementing the nested_handle_cnt
> that permits the ThreadsList to be freed. So the thread calling
> ThreadsSMRSupport::free_list() thinks it has a stable hazard ptr
> (TLH1-list), but that hazard ptr can be freed and causes lots of
> confusion.
>
> Update: This fix along with the fix from JDK-8264123 were stress
> tested with the new tests from JDK-8262881.

Daniel D. Daugherty has updated the pull request incrementally with one additional commit since the last revision:

  Fix two more ThreadsListHandle tests.

-------------

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/3272/files
  - new: https://git.openjdk.java.net/jdk/pull/3272/files/3ef27feb..98521a36

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=3272&range=01
 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=3272&range=00-01

  Stats: 4 lines in 2 files changed: 0 ins; 0 del; 4 mod
  Patch: https://git.openjdk.java.net/jdk/pull/3272.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/3272/head:pull/3272

PR: https://git.openjdk.java.net/jdk/pull/3272
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8264393: JDK-8258284 introduced dangling TLH race [v2]

David Holmes-2
On Tue, 30 Mar 2021 21:13:35 GMT, Daniel D. Daugherty <[hidden email]> wrote:

>> I ported some 20 year old tests using JDK-8262881 in order to help
>> test [~rehn]'s fix for JDK-8257831. These tests in combination with
>> one piece of the fix from JDK-8257831 revealed a bug in my fix for
>> JDK-8258284 from back in Dec 2020.
>>
>> The race revealed by the ported tests from JDK-8262881 happens
>> only with nested ThreadsListHandles. When TLH2 is destroyed, the
>> thread updates its threads_hazard_ptr from the TLH2-list to the
>> TLH1-list; I made this change back in 2020.12 using JDK-8258284.
>> The threads_hazard_ptr can be observed by a thread calling
>> ThreadsSMRSupport::free_list() as a stable ThreadsList at the same
>> time as the TLH1 destructor is decrementing the nested_handle_cnt
>> that permits the ThreadsList to be freed. So the thread calling
>> ThreadsSMRSupport::free_list() thinks it has a stable hazard ptr
>> (TLH1-list), but that hazard ptr can be freed and causes lots of
>> confusion.
>>
>> Update: This fix along with the fix from JDK-8264123 were stress
>> tested with the new tests from JDK-8262881.
>
> Daniel D. Daugherty has updated the pull request incrementally with one additional commit since the last revision:
>
>   Fix two more ThreadsListHandle tests.

Hi Dan,

So in a nutshell, when clearing a nested TLH you can't simply install the previous TLH as the threads_hazard_ptr, but instead set to NULL so that it is properly set by acquire_stable_list.

This seems reasonable to me.

Thanks,
David

src/hotspot/share/runtime/threadSMR.cpp line 542:

> 540:     // And that protocol cannot be properly done with a ThreadsList that
> 541:     // might not be the current system ThreadsList.
> 542:     assert(_previous->_list->_nested_handle_cnt > 0, "must be > than zero");

Nit: "> than" reads "greater than than".

-------------

Marked as reviewed by dholmes (Reviewer).

PR: https://git.openjdk.java.net/jdk/pull/3272
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8264393: JDK-8258284 introduced dangling TLH race [v3]

Daniel D.Daugherty
In reply to this post by Daniel D.Daugherty
> I ported some 20 year old tests using JDK-8262881 in order to help
> test [~rehn]'s fix for JDK-8257831. These tests in combination with
> one piece of the fix from JDK-8257831 revealed a bug in my fix for
> JDK-8258284 from back in Dec 2020.
>
> The race revealed by the ported tests from JDK-8262881 happens
> only with nested ThreadsListHandles. When TLH2 is destroyed, the
> thread updates its threads_hazard_ptr from the TLH2-list to the
> TLH1-list; I made this change back in 2020.12 using JDK-8258284.
> The threads_hazard_ptr can be observed by a thread calling
> ThreadsSMRSupport::free_list() as a stable ThreadsList at the same
> time as the TLH1 destructor is decrementing the nested_handle_cnt
> that permits the ThreadsList to be freed. So the thread calling
> ThreadsSMRSupport::free_list() thinks it has a stable hazard ptr
> (TLH1-list), but that hazard ptr can be freed and causes lots of
> confusion.
>
> Update: This fix along with the fix from JDK-8264123 were stress
> tested with the new tests from JDK-8262881.

Daniel D. Daugherty has updated the pull request incrementally with one additional commit since the last revision:

  dholmes CR - fix nit in assert mesg.

-------------

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/3272/files
  - new: https://git.openjdk.java.net/jdk/pull/3272/files/98521a36..fa400b47

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=3272&range=02
 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=3272&range=01-02

  Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod
  Patch: https://git.openjdk.java.net/jdk/pull/3272.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/3272/head:pull/3272

PR: https://git.openjdk.java.net/jdk/pull/3272
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8264393: JDK-8258284 introduced dangling TLH race [v2]

Daniel D.Daugherty
In reply to this post by David Holmes-2
On Wed, 31 Mar 2021 00:28:19 GMT, David Holmes <[hidden email]> wrote:

>> Daniel D. Daugherty has updated the pull request incrementally with one additional commit since the last revision:
>>
>>   Fix two more ThreadsListHandle tests.
>
> Hi Dan,
>
> So in a nutshell, when clearing a nested TLH you can't simply install the previous TLH as the threads_hazard_ptr, but instead set to NULL so that it is properly set by acquire_stable_list.
>
> This seems reasonable to me.
>
> Thanks,
> David

@dholmes-ora - your nutshell summary is spot on. Thanks for the review.

> src/hotspot/share/runtime/threadSMR.cpp line 542:
>
>> 540:     // And that protocol cannot be properly done with a ThreadsList that
>> 541:     // might not be the current system ThreadsList.
>> 542:     assert(_previous->_list->_nested_handle_cnt > 0, "must be > than zero");
>
> Nit: "> than" reads "greater than than".

Fixed.

-------------

PR: https://git.openjdk.java.net/jdk/pull/3272
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8264393: JDK-8258284 introduced dangling TLH race [v2]

Robbin Ehn-2
On Wed, 31 Mar 2021 01:53:52 GMT, Daniel D. Daugherty <[hidden email]> wrote:

>> Hi Dan,
>>
>> So in a nutshell, when clearing a nested TLH you can't simply install the previous TLH as the threads_hazard_ptr, but instead set to NULL so that it is properly set by acquire_stable_list.
>>
>> This seems reasonable to me.
>>
>> Thanks,
>> David
>
> @dholmes-ora - your nutshell summary is spot on. Thanks for the review.

Hi Dan,

As you describe it and how it look to me, isn't the issue just that we decrement before before reinstating the old list?
So if we do first publish the previous list as current list and then decrement the nested handle count it should be okay?
E.g.:
_thread->set_threads_hazard_ptr(_previous->_list);
_list->dec_nested_handle_cnt();

So you just need to move the "if (_has_ref_count) {" piece of code after the potential reinstating of the previous list.

Or am I missing something?

Thanks for finding it!

-------------

PR: https://git.openjdk.java.net/jdk/pull/3272
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8264393: JDK-8258284 introduced dangling TLH race [v2]

Daniel D.Daugherty
On Wed, 31 Mar 2021 08:04:02 GMT, Robbin Ehn <[hidden email]> wrote:

>> @dholmes-ora - your nutshell summary is spot on. Thanks for the review.
>
> Hi Dan,
>
> As you describe it and how it look to me, isn't the issue just that we decrement before before reinstating the old list?
> So if we do first publish the previous list as current list and then decrement the nested handle count it should be okay?
> E.g.:
> _thread->set_threads_hazard_ptr(_previous->_list);
> _list->dec_nested_handle_cnt();
>
> So you just need to move the "if (_has_ref_count) {" piece of code after the potential reinstating of the previous list.
>
> Or am I missing something?
>
> Thanks for finding it!

@robehn - Thanks for reviewing the fix. Yes, I think you have missed something. :-)

I modeled the analysis of this race after one of your favorite race techniques
in my analysis.ThreadsList.for_JBS attachment to the bug report: since there
is nothing to force the two threads to interact with each in a particular order,
I posited delays at various points in the execution of each thread. This quote
from the analysis.ThreadsList.for_JBS attachment describes scenario:

Race note:
    THR-1 is the thread calling the TLH destructors.
    THR-2 is the exiting thread calling ThreadsSMRSupport::free_list.

    If THR-2's ThreadsSMRSupport::free_list() call finishes its scan of of
    the active Threads _threads_hazard_ptr values BEFORE the TLH-2
    destructor code sequence updates THR-1->_threads_hazard_ptr from TL-2
    to TL-1, then TL-2 and not TL-1 will be on the list of in-use
    ThreadsLists:

      // Gather a hash table of the current hazard ptrs:
      ThreadScanHashtable *scan_table = new ThreadScanHashtable(hash_table_size);
      ScanHazardPtrGatherThreadsListClosure scan_cl(scan_table);
      threads_do(&scan_cl);

    At this point, THR-2's ThreadsSMRSupport::free_list() call stalls and
    THR-1 not only finishes the TLH-2 destructor, it also finishes its use
    of TLH-1 as described in the next section and starts to run the TLH-1
    destructor.

After the first ThreadsListHandle is released for THR-1:

+----------------------------------+
| THR-1                            |
+----------------------------------+
| _threads_hazard_ptr=0            |
| _threads_list_ptr=0              |
| _nested_threads_hazard_ptr_cnt=0 |
+----------------------------------+

                                         +----------------------+
                                         | TL-1                 |
                                         +----------------------+
                                         | _length=XXXXXXXXXXXX |
                                         | _next_list=XXXXXXXXX |
                                         | _threads[5]=XXXXXXXX |
                                         | _nested_handle_cnt=X |
                                         +----------------------+

Race note:
    THR-1 is running the TLH-1 destructor and has decremented the TL-1
    _nested_handle_cnt, but stalls before it clears _threads_hazard_ptr.

    The THR-2's ThreadsSMRSupport::free_list() call continues executing and
    checks the _to_delete_list ThreadsLists and if they are not in the
    scan_table and have a _nested_handle_cnt == 0 then, they are freed.

    This is how TL-1 is freed, but still remains in THR-1's
    _threads_hazard_ptr field and can be observed later by THR-2 as a valid
    hazard ptr in its call to smr_delete() on itself or by another thread
    perusing the system ThreadsList. This is especially true after
    ThreadsSMRSupport::free_list() has finished its work and released the
    Threads_lock which will allow another thread to walk the set of hazard
    ptrs.

    THR-1 resumes running again and clears _threads_hazard_ptr. However,
    the other thread walking the set of hazard ptrs has the stale TL-1
    value and tries to use it. Boom!

Switching the decrement:
`_list->dec_nested_handle_cnt()`
to happen after the:
`_thread->set_threads_hazard_ptr(_previous->_list)`
doesn't help because THR-2 observed TL-2 before we
reached that code and then THR-2 stalled until after all
the updates were made. THR-2 recorded TL-2 in the
collection of current hazard ptrs and THR-2 knows nothing
about TL-1 being a valid hazard ptr so THR-2 can free it.

-------------

PR: https://git.openjdk.java.net/jdk/pull/3272
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8264393: JDK-8258284 introduced dangling TLH race [v2]

Daniel D.Daugherty
On Wed, 31 Mar 2021 16:18:24 GMT, Daniel D. Daugherty <[hidden email]> wrote:

>> Hi Dan,
>>
>> As you describe it and how it look to me, isn't the issue just that we decrement before before reinstating the old list?
>> So if we do first publish the previous list as current list and then decrement the nested handle count it should be okay?
>> E.g.:
>> _thread->set_threads_hazard_ptr(_previous->_list);
>> _list->dec_nested_handle_cnt();
>>
>> So you just need to move the "if (_has_ref_count) {" piece of code after the potential reinstating of the previous list.
>>
>> Or am I missing something?
>>
>> Thanks for finding it!
>
> @robehn - Thanks for reviewing the fix. Yes, I think you have missed something. :-)
>
> I modeled the analysis of this race after one of your favorite race techniques
> in my analysis.ThreadsList.for_JBS attachment to the bug report: since there
> is nothing to force the two threads to interact with each in a particular order,
> I posited delays at various points in the execution of each thread. This quote
> from the analysis.ThreadsList.for_JBS attachment describes scenario:
>
> Race note:
>     THR-1 is the thread calling the TLH destructors.
>     THR-2 is the exiting thread calling ThreadsSMRSupport::free_list.
>
>     If THR-2's ThreadsSMRSupport::free_list() call finishes its scan of of
>     the active Threads _threads_hazard_ptr values BEFORE the TLH-2
>     destructor code sequence updates THR-1->_threads_hazard_ptr from TL-2
>     to TL-1, then TL-2 and not TL-1 will be on the list of in-use
>     ThreadsLists:
>
>       // Gather a hash table of the current hazard ptrs:
>       ThreadScanHashtable *scan_table = new ThreadScanHashtable(hash_table_size);
>       ScanHazardPtrGatherThreadsListClosure scan_cl(scan_table);
>       threads_do(&scan_cl);
>
>     At this point, THR-2's ThreadsSMRSupport::free_list() call stalls and
>     THR-1 not only finishes the TLH-2 destructor, it also finishes its use
>     of TLH-1 as described in the next section and starts to run the TLH-1
>     destructor.
>
> After the first ThreadsListHandle is released for THR-1:
>
> +----------------------------------+
> | THR-1                            |
> +----------------------------------+
> | _threads_hazard_ptr=0            |
> | _threads_list_ptr=0              |
> | _nested_threads_hazard_ptr_cnt=0 |
> +----------------------------------+
>
>                                          +----------------------+
>                                          | TL-1                 |
>                                          +----------------------+
>                                          | _length=XXXXXXXXXXXX |
>                                          | _next_list=XXXXXXXXX |
>                                          | _threads[5]=XXXXXXXX |
>                                          | _nested_handle_cnt=X |
>                                          +----------------------+
>
> Race note:
>     THR-1 is running the TLH-1 destructor and has decremented the TL-1
>     _nested_handle_cnt, but stalls before it clears _threads_hazard_ptr.
>
>     The THR-2's ThreadsSMRSupport::free_list() call continues executing and
>     checks the _to_delete_list ThreadsLists and if they are not in the
>     scan_table and have a _nested_handle_cnt == 0 then, they are freed.
>
>     This is how TL-1 is freed, but still remains in THR-1's
>     _threads_hazard_ptr field and can be observed later by THR-2 as a valid
>     hazard ptr in its call to smr_delete() on itself or by another thread
>     perusing the system ThreadsList. This is especially true after
>     ThreadsSMRSupport::free_list() has finished its work and released the
>     Threads_lock which will allow another thread to walk the set of hazard
>     ptrs.
>
>     THR-1 resumes running again and clears _threads_hazard_ptr. However,
>     the other thread walking the set of hazard ptrs has the stale TL-1
>     value and tries to use it. Boom!
>
> Switching the decrement:
> `_list->dec_nested_handle_cnt()`
> to happen after the:
> `_thread->set_threads_hazard_ptr(_previous->_list)`
> doesn't help because THR-2 observed TL-2 before we
> reached that code and then THR-2 stalled until after all
> the updates were made. THR-2 recorded TL-2 in the
> collection of current hazard ptrs and THR-2 knows nothing
> about TL-1 being a valid hazard ptr so THR-2 can free it.

I tested JDK-8264123 together with this fix (JDK-8264393) in Mach5 Tier[1-7]
and there are no regressions.

-------------

PR: https://git.openjdk.java.net/jdk/pull/3272
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8264393: JDK-8258284 introduced dangling TLH race [v3]

Robbin Ehn-2
In reply to this post by Daniel D.Daugherty
On Wed, 31 Mar 2021 01:57:40 GMT, Daniel D. Daugherty <[hidden email]> wrote:

>> I ported some 20 year old tests using JDK-8262881 in order to help
>> test [~rehn]'s fix for JDK-8257831. These tests in combination with
>> one piece of the fix from JDK-8257831 revealed a bug in my fix for
>> JDK-8258284 from back in Dec 2020.
>>
>> The race revealed by the ported tests from JDK-8262881 happens
>> only with nested ThreadsListHandles. When TLH2 is destroyed, the
>> thread updates its threads_hazard_ptr from the TLH2-list to the
>> TLH1-list; I made this change back in 2020.12 using JDK-8258284.
>> The threads_hazard_ptr can be observed by a thread calling
>> ThreadsSMRSupport::free_list() as a stable ThreadsList at the same
>> time as the TLH1 destructor is decrementing the nested_handle_cnt
>> that permits the ThreadsList to be freed. So the thread calling
>> ThreadsSMRSupport::free_list() thinks it has a stable hazard ptr
>> (TLH1-list), but that hazard ptr can be freed and causes lots of
>> confusion.
>>
>> Update: This fix along with the fix from JDK-8264123 were stress
>> tested with the new tests from JDK-8262881.
>
> Daniel D. Daugherty has updated the pull request incrementally with one additional commit since the last revision:
>
>   dholmes CR - fix nit in assert mesg.

Marked as reviewed by rehn (Reviewer).

-------------

PR: https://git.openjdk.java.net/jdk/pull/3272
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8264393: JDK-8258284 introduced dangling TLH race [v2]

Robbin Ehn-2
In reply to this post by Daniel D.Daugherty
On Wed, 31 Mar 2021 16:24:11 GMT, Daniel D. Daugherty <[hidden email]> wrote:

>> @robehn - Thanks for reviewing the fix. Yes, I think you have missed something. :-)
>>
>> I modeled the analysis of this race after one of your favorite race techniques
>> in my analysis.ThreadsList.for_JBS attachment to the bug report: since there
>> is nothing to force the two threads to interact with each in a particular order,
>> I posited delays at various points in the execution of each thread. This quote
>> from the analysis.ThreadsList.for_JBS attachment describes scenario:
>>
>> Race note:
>>     THR-1 is the thread calling the TLH destructors.
>>     THR-2 is the exiting thread calling ThreadsSMRSupport::free_list.
>>
>>     If THR-2's ThreadsSMRSupport::free_list() call finishes its scan of of
>>     the active Threads _threads_hazard_ptr values BEFORE the TLH-2
>>     destructor code sequence updates THR-1->_threads_hazard_ptr from TL-2
>>     to TL-1, then TL-2 and not TL-1 will be on the list of in-use
>>     ThreadsLists:
>>
>>       // Gather a hash table of the current hazard ptrs:
>>       ThreadScanHashtable *scan_table = new ThreadScanHashtable(hash_table_size);
>>       ScanHazardPtrGatherThreadsListClosure scan_cl(scan_table);
>>       threads_do(&scan_cl);
>>
>>     At this point, THR-2's ThreadsSMRSupport::free_list() call stalls and
>>     THR-1 not only finishes the TLH-2 destructor, it also finishes its use
>>     of TLH-1 as described in the next section and starts to run the TLH-1
>>     destructor.
>>
>> After the first ThreadsListHandle is released for THR-1:
>>
>> +----------------------------------+
>> | THR-1                            |
>> +----------------------------------+
>> | _threads_hazard_ptr=0            |
>> | _threads_list_ptr=0              |
>> | _nested_threads_hazard_ptr_cnt=0 |
>> +----------------------------------+
>>
>>                                          +----------------------+
>>                                          | TL-1                 |
>>                                          +----------------------+
>>                                          | _length=XXXXXXXXXXXX |
>>                                          | _next_list=XXXXXXXXX |
>>                                          | _threads[5]=XXXXXXXX |
>>                                          | _nested_handle_cnt=X |
>>                                          +----------------------+
>>
>> Race note:
>>     THR-1 is running the TLH-1 destructor and has decremented the TL-1
>>     _nested_handle_cnt, but stalls before it clears _threads_hazard_ptr.
>>
>>     The THR-2's ThreadsSMRSupport::free_list() call continues executing and
>>     checks the _to_delete_list ThreadsLists and if they are not in the
>>     scan_table and have a _nested_handle_cnt == 0 then, they are freed.
>>
>>     This is how TL-1 is freed, but still remains in THR-1's
>>     _threads_hazard_ptr field and can be observed later by THR-2 as a valid
>>     hazard ptr in its call to smr_delete() on itself or by another thread
>>     perusing the system ThreadsList. This is especially true after
>>     ThreadsSMRSupport::free_list() has finished its work and released the
>>     Threads_lock which will allow another thread to walk the set of hazard
>>     ptrs.
>>
>>     THR-1 resumes running again and clears _threads_hazard_ptr. However,
>>     the other thread walking the set of hazard ptrs has the stale TL-1
>>     value and tries to use it. Boom!
>>
>> Switching the decrement:
>> `_list->dec_nested_handle_cnt()`
>> to happen after the:
>> `_thread->set_threads_hazard_ptr(_previous->_list)`
>> doesn't help because THR-2 observed TL-2 before we
>> reached that code and then THR-2 stalled until after all
>> the updates were made. THR-2 recorded TL-2 in the
>> collection of current hazard ptrs and THR-2 knows nothing
>> about TL-1 being a valid hazard ptr so THR-2 can free it.
>
> I tested JDK-8264123 together with this fix (JDK-8264393) in Mach5 Tier[1-7]
> and there are no regressions.

Hi Dan, yes thanks.

So I would say, you may not install a ThreadsList into your hazard pointer if it's on the _to_delete_list.
Got it, thanks.

-------------

PR: https://git.openjdk.java.net/jdk/pull/3272
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8264393: JDK-8258284 introduced dangling TLH race [v3]

Daniel D.Daugherty
In reply to this post by Robbin Ehn-2
On Thu, 1 Apr 2021 07:17:20 GMT, Robbin Ehn <[hidden email]> wrote:

>> Daniel D. Daugherty has updated the pull request incrementally with one additional commit since the last revision:
>>
>>   dholmes CR - fix nit in assert mesg.
>
> Marked as reviewed by rehn (Reviewer).

@robehn - Thanks for closing the loop on your review thread.

@dholmes-ora nutshell summary covers it:
when clearing a nested TLH you can't simply install the previous TLH as the threads_hazard_ptr, but instead set to NULL so that it is properly set by acquire_stable_list.

Another way to put it is that the `_threads_hazard_ptr` field must only be set to
a non-NULL value by the acquire_stable_list() protocol.

-------------

PR: https://git.openjdk.java.net/jdk/pull/3272
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8264393: JDK-8258284 introduced dangling TLH race [v3]

Daniel D.Daugherty
On Thu, 1 Apr 2021 17:15:53 GMT, Daniel D. Daugherty <[hidden email]> wrote:

>> Marked as reviewed by rehn (Reviewer).
>
> @robehn - Thanks for closing the loop on your review thread.
>
> @dholmes-ora nutshell summary covers it:
> when clearing a nested TLH you can't simply install the previous TLH as the threads_hazard_ptr, but instead set to NULL so that it is properly set by acquire_stable_list.
>
> Another way to put it is that the `_threads_hazard_ptr` field must only be set to
> a non-NULL value by the acquire_stable_list() protocol.

@fisk - since I'm tweaking your code (again), I really need you to chime in on
this review thread.

-------------

PR: https://git.openjdk.java.net/jdk/pull/3272
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8264393: JDK-8258284 introduced dangling TLH race [v3]

Erik Österlund-3
In reply to this post by Daniel D.Daugherty
On Wed, 31 Mar 2021 01:57:40 GMT, Daniel D. Daugherty <[hidden email]> wrote:

>> I ported some 20 year old tests using JDK-8262881 in order to help
>> test [~rehn]'s fix for JDK-8257831. These tests in combination with
>> one piece of the fix from JDK-8257831 revealed a bug in my fix for
>> JDK-8258284 from back in Dec 2020.
>>
>> The race revealed by the ported tests from JDK-8262881 happens
>> only with nested ThreadsListHandles. When TLH2 is destroyed, the
>> thread updates its threads_hazard_ptr from the TLH2-list to the
>> TLH1-list; I made this change back in 2020.12 using JDK-8258284.
>> The threads_hazard_ptr can be observed by a thread calling
>> ThreadsSMRSupport::free_list() as a stable ThreadsList at the same
>> time as the TLH1 destructor is decrementing the nested_handle_cnt
>> that permits the ThreadsList to be freed. So the thread calling
>> ThreadsSMRSupport::free_list() thinks it has a stable hazard ptr
>> (TLH1-list), but that hazard ptr can be freed and causes lots of
>> confusion.
>>
>> Update: This fix along with the fix from JDK-8264123 were stress
>> tested with the new tests from JDK-8262881.
>
> Daniel D. Daugherty has updated the pull request incrementally with one additional commit since the last revision:
>
>   dholmes CR - fix nit in assert mesg.

Not sure how I missed this race earlier. I never originally intended for hazard pointers to be set when exiting nested ThreadsListHandles. Anyway - the problem is understood and the fix looks good.

-------------

Marked as reviewed by eosterlund (Reviewer).

PR: https://git.openjdk.java.net/jdk/pull/3272
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8264393: JDK-8258284 introduced dangling TLH race [v3]

Daniel D.Daugherty
On Sat, 3 Apr 2021 14:52:29 GMT, Erik Österlund <[hidden email]> wrote:

>> Daniel D. Daugherty has updated the pull request incrementally with one additional commit since the last revision:
>>
>>   dholmes CR - fix nit in assert mesg.
>
> Not sure how I missed this race earlier. I never originally intended for hazard pointers to be set when exiting nested ThreadsListHandles. Anyway - the problem is understood and the fix looks good.

@fisk - Thanks for the review. The fault is mine from when I
worked on JDK-8258284. I even added the new test that
verified the improper behavior. In any case, it's fixed now and
comments are added that should prevent any of us that play
with this code from making the same mistake again.

-------------

PR: https://git.openjdk.java.net/jdk/pull/3272
Reply | Threaded
Open this post in threaded view
|

Integrated: 8264393: JDK-8258284 introduced dangling TLH race

Daniel D.Daugherty
In reply to this post by Daniel D.Daugherty
On Tue, 30 Mar 2021 16:35:23 GMT, Daniel D. Daugherty <[hidden email]> wrote:

> I ported some 20 year old tests using JDK-8262881 in order to help
> test [~rehn]'s fix for JDK-8257831. These tests in combination with
> one piece of the fix from JDK-8257831 revealed a bug in my fix for
> JDK-8258284 from back in Dec 2020.
>
> The race revealed by the ported tests from JDK-8262881 happens
> only with nested ThreadsListHandles. When TLH2 is destroyed, the
> thread updates its threads_hazard_ptr from the TLH2-list to the
> TLH1-list; I made this change back in 2020.12 using JDK-8258284.
> The threads_hazard_ptr can be observed by a thread calling
> ThreadsSMRSupport::free_list() as a stable ThreadsList at the same
> time as the TLH1 destructor is decrementing the nested_handle_cnt
> that permits the ThreadsList to be freed. So the thread calling
> ThreadsSMRSupport::free_list() thinks it has a stable hazard ptr
> (TLH1-list), but that hazard ptr can be freed and causes lots of
> confusion.
>
> Update: This fix along with the fix from JDK-8264123 were stress
> tested with the new tests from JDK-8262881.

This pull request has now been integrated.

Changeset: f259eeaf
Author:    Daniel D. Daugherty <[hidden email]>
URL:       https://git.openjdk.java.net/jdk/commit/f259eeaf
Stats:     65 lines in 4 files changed: 22 ins; 21 del; 22 mod

8264393: JDK-8258284 introduced dangling TLH race

Reviewed-by: dholmes, rehn, eosterlund

-------------

PR: https://git.openjdk.java.net/jdk/pull/3272
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8264393: JDK-8258284 introduced dangling TLH race [v3]

Robbin Ehn-2
In reply to this post by Erik Österlund-3
On Sat, 3 Apr 2021 14:52:29 GMT, Erik Österlund <[hidden email]> wrote:

> Not sure how I missed this race earlier. I never originally intended for hazard pointers to be set when exiting nested ThreadsListHandles. Anyway - the problem is understood and the fix looks good.

I don't think you did, because originally we never intended them to be nested.
I still feel that the simple solution to nesting: pausing deletion on nesting, thus never changing the hazard pointer would be better (as in much simpler).
The only drawback is if there is a pathological case were we always have one thread with nested ThreadsLists (but hard to imagine).

-------------

PR: https://git.openjdk.java.net/jdk/pull/3272