Quantcast

JDK 9 RFR(s): 8150488: add note to Scanner.findAll() regarding possible infinite streams

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

JDK 9 RFR(s): 8150488: add note to Scanner.findAll() regarding possible infinite streams

Stuart Marks
Hi all,

Please review these non-normative textual additions to the Scanner.findAll()
method docs. These methods were added earlier in JDK 9; there's a small pitfall
if the regex can match zero characters.

Thanks,

s'marks


# HG changeset patch
# User smarks
# Date 1490749958 25200
#      Tue Mar 28 18:12:38 2017 -0700
# Node ID 6b43c4698752779793d58813f46d3687c17dde75
# Parent  fb54b256d751ae3191e9cef42ff9f5630931f047
8150488: add note to Scanner.findAll() regarding possible infinite streams
Reviewed-by: XXX

diff -r fb54b256d751 -r 6b43c4698752
src/java.base/share/classes/java/util/Scanner.java
--- a/src/java.base/share/classes/java/util/Scanner.java Mon Mar 27 15:12:01
2017 -0700
+++ b/src/java.base/share/classes/java/util/Scanner.java Tue Mar 28 18:12:38
2017 -0700
@@ -2808,6 +2808,10 @@
       * }
       * }</pre>
       *
+     * <p>The pattern must always match at least one character. If the pattern
+     * can match zero characters, the result will be an infinite stream
+     * of empty matches.
+     *
       * @param pattern the pattern to be matched
       * @return a sequential stream of match results
       * @throws NullPointerException if pattern is null
@@ -2829,6 +2833,11 @@
       *     scanner.findAll(Pattern.compile(patString))
       * }</pre>
       *
+     * @apiNote
+     * The pattern must always match at least one character. If the pattern
+     * can match zero characters, the result will be an infinite stream
+     * of empty matches.
+     *
       * @param patString the pattern string
       * @return a sequential stream of match results
       * @throws NullPointerException if patString is null
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: JDK 9 RFR(s): 8150488: add note to Scanner.findAll() regarding possible infinite streams

Xueming Shen
On 3/29/17, 5:56 PM, Stuart Marks wrote:
> Hi all,
>
> Please review these non-normative textual additions to the
> Scanner.findAll() method docs. These methods were added earlier in JDK
> 9; there's a small pitfall if the regex can match zero characters.
>
Stuart,

This might practically put the api itself almost useless? it might be an
easy task to spot
whether or not it's a 0-width-match-possible regex when the regex is
simple, but it gets
harder and harder, if not impossible when the regex gets complicated,
especially consider
the possible use scenario that the use site is embedded deeply inside a
library implementation.

The alternative is to "fix" it, maybe as what Matcher.find() does, if
the previous match is
zero-width-match (the fist==last), we step one to the next cursor before
next try. I know
Scanner.findPatternInBuffer() is setting new region set every time it is
invoked which makes
it complicated, but I would assume it might be still worth a trying? for
example, utilize the
"hasNextResult"/matcher.end(). I'm not sure without looking into the
code, does

while (hasNext(pattern)) {
     next(pattern);
}

have the same issue, when pattern matches 0-width?

Thanks!
-Sherman




> Thanks,
>
> s'marks
>
>
> # HG changeset patch
> # User smarks
> # Date 1490749958 25200
> #      Tue Mar 28 18:12:38 2017 -0700
> # Node ID 6b43c4698752779793d58813f46d3687c17dde75
> # Parent  fb54b256d751ae3191e9cef42ff9f5630931f047
> 8150488: add note to Scanner.findAll() regarding possible infinite
> streams
> Reviewed-by: XXX
>
> diff -r fb54b256d751 -r 6b43c4698752
> src/java.base/share/classes/java/util/Scanner.java
> --- a/src/java.base/share/classes/java/util/Scanner.java    Mon Mar 27
> 15:12:01 2017 -0700
> +++ b/src/java.base/share/classes/java/util/Scanner.java    Tue Mar 28
> 18:12:38 2017 -0700
> @@ -2808,6 +2808,10 @@
>       * }
>       * }</pre>
>       *
> +     * <p>The pattern must always match at least one character. If
> the pattern
> +     * can match zero characters, the result will be an infinite stream
> +     * of empty matches.
> +     *
>       * @param pattern the pattern to be matched
>       * @return a sequential stream of match results
>       * @throws NullPointerException if pattern is null
> @@ -2829,6 +2833,11 @@
>       *     scanner.findAll(Pattern.compile(patString))
>       * }</pre>
>       *
> +     * @apiNote
> +     * The pattern must always match at least one character. If the
> pattern
> +     * can match zero characters, the result will be an infinite stream
> +     * of empty matches.
> +     *
>       * @param patString the pattern string
>       * @return a sequential stream of match results
>       * @throws NullPointerException if patString is null

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: JDK 9 RFR(s): 8150488: add note to Scanner.findAll() regardingpossible infinite streams

Timo Kinnunen
Hi,

I guess this somewhat contrived example also wouldn’t work?

                String s = "\\b\\w+|\\G|\\B";
                String t = "Matcher m = Pattern.compile(s).matcher(t);\n";
                Matcher m = Pattern.compile(s).matcher(t);
                while(m.find()) {
                        System.out.println("'" + m.group() + "'");
                }
                // Outputs:
                // 'Matcher'
                // ''
                // 'm'
                // ''
                // ''
                // ''
                // 'Pattern'
                // ''
                // 'compile'
                // ''
                // 's'
                // ''
                // ''
                // 'matcher'
                // ''
                // 't'
                // ''
                // ''
                // ''
                // ''



Sent from Mail for Windows 10

From: Xueming Shen
Sent: Thursday, March 30, 2017 05:41
To: [hidden email]
Subject: Re: JDK 9 RFR(s): 8150488: add note to Scanner.findAll() regardingpossible infinite streams

On 3/29/17, 5:56 PM, Stuart Marks wrote:
> Hi all,
>
> Please review these non-normative textual additions to the
> Scanner.findAll() method docs. These methods were added earlier in JDK
> 9; there's a small pitfall if the regex can match zero characters.
>
Stuart,

This might practically put the api itself almost useless? it might be an
easy task to spot
whether or not it's a 0-width-match-possible regex when the regex is
simple, but it gets
harder and harder, if not impossible when the regex gets complicated,
especially consider
the possible use scenario that the use site is embedded deeply inside a
library implementation.

The alternative is to "fix" it, maybe as what Matcher.find() does, if
the previous match is
zero-width-match (the fist==last), we step one to the next cursor before
next try. I know
Scanner.findPatternInBuffer() is setting new region set every time it is
invoked which makes
it complicated, but I would assume it might be still worth a trying? for
example, utilize the
"hasNextResult"/matcher.end(). I'm not sure without looking into the
code, does

while (hasNext(pattern)) {
     next(pattern);
}

have the same issue, when pattern matches 0-width?

Thanks!
-Sherman




> Thanks,
>
> s'marks
>
>
> # HG changeset patch
> # User smarks
> # Date 1490749958 25200
> #      Tue Mar 28 18:12:38 2017 -0700
> # Node ID 6b43c4698752779793d58813f46d3687c17dde75
> # Parent  fb54b256d751ae3191e9cef42ff9f5630931f047
> 8150488: add note to Scanner.findAll() regarding possible infinite
> streams
> Reviewed-by: XXX
>
> diff -r fb54b256d751 -r 6b43c4698752
> src/java.base/share/classes/java/util/Scanner.java
> --- a/src/java.base/share/classes/java/util/Scanner.java    Mon Mar 27
> 15:12:01 2017 -0700
> +++ b/src/java.base/share/classes/java/util/Scanner.java    Tue Mar 28
> 18:12:38 2017 -0700
> @@ -2808,6 +2808,10 @@
>       * }
>       * }</pre>
>       *
> +     * <p>The pattern must always match at least one character. If
> the pattern
> +     * can match zero characters, the result will be an infinite stream
> +     * of empty matches.
> +     *
>       * @param pattern the pattern to be matched
>       * @return a sequential stream of match results
>       * @throws NullPointerException if pattern is null
> @@ -2829,6 +2833,11 @@
>       *     scanner.findAll(Pattern.compile(patString))
>       * }</pre>
>       *
> +     * @apiNote
> +     * The pattern must always match at least one character. If the
> pattern
> +     * can match zero characters, the result will be an infinite stream
> +     * of empty matches.
> +     *
>       * @param patString the pattern string
>       * @return a sequential stream of match results
>       * @throws NullPointerException if patString is null


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: JDK 9 RFR(s): 8150488: add note to Scanner.findAll() regardingpossible infinite streams

Stuart Marks
Hi Timo, Sherman,

Thanks for looking at this.

Sherman wrote:

> This might practically put the api itself almost useless? it might be an easy task to spot
> whether or not it's a 0-width-match-possible regex when the regex is simple, but it gets
> harder and harder, if not impossible when the regex gets complicated, especially consider
> the possible use scenario that the use site is embedded deeply inside a library implementation.

Well, not "useless", but perhaps less useful than one might like. :-)

I think this is potentially surprising behavior, which is why I at least wanted
to add the note. It's not clear to me whether we should try to fix this by
changing Scanner though.

Essentially, findAll() is defined in terms of findWithinHorizon(pattern, 0). So
if one were to write a loop like so:

     String str;
     while ((str = scanner.findWithinHorizon(pattern, 0)) != null) {
         System.out.println(str);
     }

then this loop would have the same problem if pattern were to match zero characters.

> The alternative is to "fix" it, maybe as what Matcher.find() does, if the previous match is
> zero-width-match (the fist==last), we step one to the next cursor before next try. I know

Interesting, I didn't know Matcher.find() advances the cursor like this. But
Scanner.findWithinHorizon() apparently doesn't, so that's why an infinite loop
can occur.

> Scanner.findPatternInBuffer() is setting new region set every time it is invoked which makes
> it complicated, but I would assume it might be still worth a trying? for example, utilize the
> "hasNextResult"/matcher.end(). I'm not sure without looking into the code, does
>
> while (hasNext(pattern)) {
>     next(pattern);
> }
>
> have the same issue, when pattern matches 0-width?

No, this doesn't have the problem, because hasNext(pat) and next(pat) match
delimited tokens. Each call to next() implicitly advances past the next
delimiter to reach the subsequent token, if any.


On 3/30/17 8:56 AM, Timo Kinnunen wrote:
> I guess this somewhat contrived example also wouldn’t work?
>
> String s = "\\b\\w+|\\G|\\B";
> String t = "Matcher m = Pattern.compile(s).matcher(t);\n";
> Matcher m = Pattern.compile(s).matcher(t);
> while(m.find()) {
> System.out.println("'" + m.group() + "'");
> }

Right, so if you rewrote this loop to use Scanner.findWithinHorizon() instead of
Matcher,

     Scanner sc = new Scanner(t);
     String str;
     while ((str = sc.findWithinHorizon(s, 0)) != null) {
         System.out.println("'" + str + "'");
     }

you'd get an infinite loop with str being continually assigned the empty string.
As Sherman mentioned, the Matcher.find() will advance the cursor if it gets a
zero-width match, avoiding this problem.

* * *

This didn't come up in the code review thread, which was mostly about concurrent
modification and late-binding of the spliterator:

   http://mail.openjdk.java.net/pipermail/core-libs-dev/2015-September/035034.html

I remember noting this phenomenon a while back, which is why I had filed the bug
to add a note. I seem to remember discussing it, though, but it might have been
in a meeting or in a hallway conversation.

This bug (JDK-8150488) does note that an infinite stream might be unexpected or
surprising, but it's not a fatal problem. It can be terminated with limit(). It
can also be terminated with takeWhile(), also added in JDK 9. Maybe I could
mention these in the API note.

I guess we could also consider changing the implicit findWithinHorizon() loop
that findAll() does, perhaps by having it terminate on a zero-width match. Or we
could even change findWithinHorizon's behavior if it gets a zero-width match,
siilar to what Matcher.find() does. But I'm quite reluctant to start making such
changes at this point.

s'marks



> // Outputs:
> // 'Matcher'
> // ''
> // 'm'
> // ''
> // ''
> // ''
> // 'Pattern'
> // ''
> // 'compile'
> // ''
> // 's'
> // ''
> // ''
> // 'matcher'
> // ''
> // 't'
> // ''
> // ''
> // ''
> // ''
>
>
>
> Sent from Mail for Windows 10
>
> From: Xueming Shen
> Sent: Thursday, March 30, 2017 05:41
> To: [hidden email]
> Subject: Re: JDK 9 RFR(s): 8150488: add note to Scanner.findAll() regardingpossible infinite streams
>
> On 3/29/17, 5:56 PM, Stuart Marks wrote:
>> Hi all,
>>
>> Please review these non-normative textual additions to the
>> Scanner.findAll() method docs. These methods were added earlier in JDK
>> 9; there's a small pitfall if the regex can match zero characters.
>>
> Stuart,
>
> This might practically put the api itself almost useless? it might be an
> easy task to spot
> whether or not it's a 0-width-match-possible regex when the regex is
> simple, but it gets
> harder and harder, if not impossible when the regex gets complicated,
> especially consider
> the possible use scenario that the use site is embedded deeply inside a
> library implementation.
>
> The alternative is to "fix" it, maybe as what Matcher.find() does, if
> the previous match is
> zero-width-match (the fist==last), we step one to the next cursor before
> next try. I know
> Scanner.findPatternInBuffer() is setting new region set every time it is
> invoked which makes
> it complicated, but I would assume it might be still worth a trying? for
> example, utilize the
> "hasNextResult"/matcher.end(). I'm not sure without looking into the
> code, does
>
> while (hasNext(pattern)) {
>      next(pattern);
> }
>
> have the same issue, when pattern matches 0-width?
>
> Thanks!
> -Sherman
>
>
>
>
>> Thanks,
>>
>> s'marks
>>
>>
>> # HG changeset patch
>> # User smarks
>> # Date 1490749958 25200
>> #      Tue Mar 28 18:12:38 2017 -0700
>> # Node ID 6b43c4698752779793d58813f46d3687c17dde75
>> # Parent  fb54b256d751ae3191e9cef42ff9f5630931f047
>> 8150488: add note to Scanner.findAll() regarding possible infinite
>> streams
>> Reviewed-by: XXX
>>
>> diff -r fb54b256d751 -r 6b43c4698752
>> src/java.base/share/classes/java/util/Scanner.java
>> --- a/src/java.base/share/classes/java/util/Scanner.java    Mon Mar 27
>> 15:12:01 2017 -0700
>> +++ b/src/java.base/share/classes/java/util/Scanner.java    Tue Mar 28
>> 18:12:38 2017 -0700
>> @@ -2808,6 +2808,10 @@
>>       * }
>>       * }</pre>
>>       *
>> +     * <p>The pattern must always match at least one character. If
>> the pattern
>> +     * can match zero characters, the result will be an infinite stream
>> +     * of empty matches.
>> +     *
>>       * @param pattern the pattern to be matched
>>       * @return a sequential stream of match results
>>       * @throws NullPointerException if pattern is null
>> @@ -2829,6 +2833,11 @@
>>       *     scanner.findAll(Pattern.compile(patString))
>>       * }</pre>
>>       *
>> +     * @apiNote
>> +     * The pattern must always match at least one character. If the
>> pattern
>> +     * can match zero characters, the result will be an infinite stream
>> +     * of empty matches.
>> +     *
>>       * @param patString the pattern string
>>       * @return a sequential stream of match results
>>       * @throws NullPointerException if patString is null
>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: JDK 9 RFR(s): 8150488: add note to Scanner.findAll() regardingpossible infinite streams

Stuart Marks
Anything further on this?

I'd at least like to add the API note I proposed in order to document this
issue. I'm reluctant to start tinkering with the behavior of this method at this
late stage in the release.

BTW I used Scanner.findAll() in a little programming exercise I worked on the
other day. It worked perfectly. :-)

s'marks

On 3/30/17 2:19 PM, Stuart Marks wrote:

> Hi Timo, Sherman,
>
> Thanks for looking at this.
>
> Sherman wrote:
>
>> This might practically put the api itself almost useless? it might be an easy
>> task to spot
>> whether or not it's a 0-width-match-possible regex when the regex is simple,
>> but it gets
>> harder and harder, if not impossible when the regex gets complicated,
>> especially consider
>> the possible use scenario that the use site is embedded deeply inside a
>> library implementation.
>
> Well, not "useless", but perhaps less useful than one might like. :-)
>
> I think this is potentially surprising behavior, which is why I at least wanted
> to add the note. It's not clear to me whether we should try to fix this by
> changing Scanner though.
>
> Essentially, findAll() is defined in terms of findWithinHorizon(pattern, 0). So
> if one were to write a loop like so:
>
>     String str;
>     while ((str = scanner.findWithinHorizon(pattern, 0)) != null) {
>         System.out.println(str);
>     }
>
> then this loop would have the same problem if pattern were to match zero
> characters.
>
>> The alternative is to "fix" it, maybe as what Matcher.find() does, if the
>> previous match is
>> zero-width-match (the fist==last), we step one to the next cursor before next
>> try. I know
>
> Interesting, I didn't know Matcher.find() advances the cursor like this. But
> Scanner.findWithinHorizon() apparently doesn't, so that's why an infinite loop
> can occur.
>
>> Scanner.findPatternInBuffer() is setting new region set every time it is
>> invoked which makes
>> it complicated, but I would assume it might be still worth a trying? for
>> example, utilize the
>> "hasNextResult"/matcher.end(). I'm not sure without looking into the code, does
>>
>> while (hasNext(pattern)) {
>>     next(pattern);
>> }
>>
>> have the same issue, when pattern matches 0-width?
>
> No, this doesn't have the problem, because hasNext(pat) and next(pat) match
> delimited tokens. Each call to next() implicitly advances past the next
> delimiter to reach the subsequent token, if any.
>
>
> On 3/30/17 8:56 AM, Timo Kinnunen wrote:
>> I guess this somewhat contrived example also wouldn’t work?
>>
>>         String s = "\\b\\w+|\\G|\\B";
>>         String t = "Matcher m = Pattern.compile(s).matcher(t);\n";
>>         Matcher m = Pattern.compile(s).matcher(t);
>>         while(m.find()) {
>>             System.out.println("'" + m.group() + "'");
>>         }
>
> Right, so if you rewrote this loop to use Scanner.findWithinHorizon() instead of
> Matcher,
>
>     Scanner sc = new Scanner(t);
>     String str;
>     while ((str = sc.findWithinHorizon(s, 0)) != null) {
>         System.out.println("'" + str + "'");
>     }
>
> you'd get an infinite loop with str being continually assigned the empty string.
> As Sherman mentioned, the Matcher.find() will advance the cursor if it gets a
> zero-width match, avoiding this problem.
>
> * * *
>
> This didn't come up in the code review thread, which was mostly about concurrent
> modification and late-binding of the spliterator:
>
>   http://mail.openjdk.java.net/pipermail/core-libs-dev/2015-September/035034.html
>
> I remember noting this phenomenon a while back, which is why I had filed the bug
> to add a note. I seem to remember discussing it, though, but it might have been
> in a meeting or in a hallway conversation.
>
> This bug (JDK-8150488) does note that an infinite stream might be unexpected or
> surprising, but it's not a fatal problem. It can be terminated with limit(). It
> can also be terminated with takeWhile(), also added in JDK 9. Maybe I could
> mention these in the API note.
>
> I guess we could also consider changing the implicit findWithinHorizon() loop
> that findAll() does, perhaps by having it terminate on a zero-width match. Or we
> could even change findWithinHorizon's behavior if it gets a zero-width match,
> siilar to what Matcher.find() does. But I'm quite reluctant to start making such
> changes at this point.
>
> s'marks
>
>
>
>>         // Outputs:
>>         // 'Matcher'
>>         // ''
>>         // 'm'
>>         // ''
>>         // ''
>>         // ''
>>         // 'Pattern'
>>         // ''
>>         // 'compile'
>>         // ''
>>         // 's'
>>         // ''
>>         // ''
>>         // 'matcher'
>>         // ''
>>         // 't'
>>         // ''
>>         // ''
>>         // ''
>>         // ''
>>
>>
>>
>> Sent from Mail for Windows 10
>>
>> From: Xueming Shen
>> Sent: Thursday, March 30, 2017 05:41
>> To: [hidden email]
>> Subject: Re: JDK 9 RFR(s): 8150488: add note to Scanner.findAll()
>> regardingpossible infinite streams
>>
>> On 3/29/17, 5:56 PM, Stuart Marks wrote:
>>> Hi all,
>>>
>>> Please review these non-normative textual additions to the
>>> Scanner.findAll() method docs. These methods were added earlier in JDK
>>> 9; there's a small pitfall if the regex can match zero characters.
>>>
>> Stuart,
>>
>> This might practically put the api itself almost useless? it might be an
>> easy task to spot
>> whether or not it's a 0-width-match-possible regex when the regex is
>> simple, but it gets
>> harder and harder, if not impossible when the regex gets complicated,
>> especially consider
>> the possible use scenario that the use site is embedded deeply inside a
>> library implementation.
>>
>> The alternative is to "fix" it, maybe as what Matcher.find() does, if
>> the previous match is
>> zero-width-match (the fist==last), we step one to the next cursor before
>> next try. I know
>> Scanner.findPatternInBuffer() is setting new region set every time it is
>> invoked which makes
>> it complicated, but I would assume it might be still worth a trying? for
>> example, utilize the
>> "hasNextResult"/matcher.end(). I'm not sure without looking into the
>> code, does
>>
>> while (hasNext(pattern)) {
>>      next(pattern);
>> }
>>
>> have the same issue, when pattern matches 0-width?
>>
>> Thanks!
>> -Sherman
>>
>>
>>
>>
>>> Thanks,
>>>
>>> s'marks
>>>
>>>
>>> # HG changeset patch
>>> # User smarks
>>> # Date 1490749958 25200
>>> #      Tue Mar 28 18:12:38 2017 -0700
>>> # Node ID 6b43c4698752779793d58813f46d3687c17dde75
>>> # Parent  fb54b256d751ae3191e9cef42ff9f5630931f047
>>> 8150488: add note to Scanner.findAll() regarding possible infinite
>>> streams
>>> Reviewed-by: XXX
>>>
>>> diff -r fb54b256d751 -r 6b43c4698752
>>> src/java.base/share/classes/java/util/Scanner.java
>>> --- a/src/java.base/share/classes/java/util/Scanner.java    Mon Mar 27
>>> 15:12:01 2017 -0700
>>> +++ b/src/java.base/share/classes/java/util/Scanner.java    Tue Mar 28
>>> 18:12:38 2017 -0700
>>> @@ -2808,6 +2808,10 @@
>>>       * }
>>>       * }</pre>
>>>       *
>>> +     * <p>The pattern must always match at least one character. If
>>> the pattern
>>> +     * can match zero characters, the result will be an infinite stream
>>> +     * of empty matches.
>>> +     *
>>>       * @param pattern the pattern to be matched
>>>       * @return a sequential stream of match results
>>>       * @throws NullPointerException if pattern is null
>>> @@ -2829,6 +2833,11 @@
>>>       *     scanner.findAll(Pattern.compile(patString))
>>>       * }</pre>
>>>       *
>>> +     * @apiNote
>>> +     * The pattern must always match at least one character. If the
>>> pattern
>>> +     * can match zero characters, the result will be an infinite stream
>>> +     * of empty matches.
>>> +     *
>>>       * @param patString the pattern string
>>>       * @return a sequential stream of match results
>>>       * @throws NullPointerException if patString is null
>>
>>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: JDK 9 RFR(s): 8150488: add note to Scanner.findAll() regardingpossible infinite streams

Xueming Shen
On 4/4/17, 12:52 PM, Stuart Marks wrote:
> Anything further on this?
>
> I'd at least like to add the API note I proposed in order to document
> this issue. I'm reluctant to start tinkering with the behavior of this
> method at this late stage in the release.
>
> BTW I used Scanner.findAll() in a little programming exercise I worked
> on the other day. It worked perfectly. :-)

H Stuart,

The word "useless" was definitely exaggerate :-) Really meant to say the
note might make the api less
useful/popular.

Personally I think the use scenario and the expected resulting behavior
of Stream<MR>finaAll(ptn)
should be more equivalent/similar to the use case  of while
(s.hasNext(p)) {   s.next(p); }, or while
(m.find()) { }, therefor it is probably desired it can be used without
worrying the possibility of getting
into an infinite loop. That said, I agree it might be hard to argue to
fix it in jdk9. The question is if
we are going to fix it in 9u sometime in the further, is it still worth
putting down the note in the
API now. I'm fine if you believe a note will help at least make the
issue a known/documented usage
limitation, for JDK9.

Yes, I think we did chat about this one at the hallway some time when
either you or something
ran into the loop by using the method ...

Thanks,
Sherman


>
> s'marks
>
> On 3/30/17 2:19 PM, Stuart Marks wrote:
>> Hi Timo, Sherman,
>>
>> Thanks for looking at this.
>>
>> Sherman wrote:
>>
>>> This might practically put the api itself almost useless? it might
>>> be an easy
>>> task to spot
>>> whether or not it's a 0-width-match-possible regex when the regex is
>>> simple,
>>> but it gets
>>> harder and harder, if not impossible when the regex gets complicated,
>>> especially consider
>>> the possible use scenario that the use site is embedded deeply inside a
>>> library implementation.
>>
>> Well, not "useless", but perhaps less useful than one might like. :-)
>>
>> I think this is potentially surprising behavior, which is why I at
>> least wanted
>> to add the note. It's not clear to me whether we should try to fix
>> this by
>> changing Scanner though.
>>
>> Essentially, findAll() is defined in terms of
>> findWithinHorizon(pattern, 0). So
>> if one were to write a loop like so:
>>
>>     String str;
>>     while ((str = scanner.findWithinHorizon(pattern, 0)) != null) {
>>         System.out.println(str);
>>     }
>>
>> then this loop would have the same problem if pattern were to match zero
>> characters.
>>
>>> The alternative is to "fix" it, maybe as what Matcher.find() does,
>>> if the
>>> previous match is
>>> zero-width-match (the fist==last), we step one to the next cursor
>>> before next
>>> try. I know
>>
>> Interesting, I didn't know Matcher.find() advances the cursor like
>> this. But
>> Scanner.findWithinHorizon() apparently doesn't, so that's why an
>> infinite loop
>> can occur.
>>
>>> Scanner.findPatternInBuffer() is setting new region set every time
>>> it is
>>> invoked which makes
>>> it complicated, but I would assume it might be still worth a trying?
>>> for
>>> example, utilize the
>>> "hasNextResult"/matcher.end(). I'm not sure without looking into the
>>> code, does
>>>
>>> while (hasNext(pattern)) {
>>>     next(pattern);
>>> }
>>>
>>> have the same issue, when pattern matches 0-width?
>>
>> No, this doesn't have the problem, because hasNext(pat) and next(pat)
>> match
>> delimited tokens. Each call to next() implicitly advances past the next
>> delimiter to reach the subsequent token, if any.
>>
>>
>> On 3/30/17 8:56 AM, Timo Kinnunen wrote:
>>> I guess this somewhat contrived example also wouldn’t work?
>>>
>>>         String s = "\\b\\w+|\\G|\\B";
>>>         String t = "Matcher m = Pattern.compile(s).matcher(t);\n";
>>>         Matcher m = Pattern.compile(s).matcher(t);
>>>         while(m.find()) {
>>>             System.out.println("'" + m.group() + "'");
>>>         }
>>
>> Right, so if you rewrote this loop to use Scanner.findWithinHorizon()
>> instead of
>> Matcher,
>>
>>     Scanner sc = new Scanner(t);
>>     String str;
>>     while ((str = sc.findWithinHorizon(s, 0)) != null) {
>>         System.out.println("'" + str + "'");
>>     }
>>
>> you'd get an infinite loop with str being continually assigned the
>> empty string.
>> As Sherman mentioned, the Matcher.find() will advance the cursor if
>> it gets a
>> zero-width match, avoiding this problem.
>>
>> * * *
>>
>> This didn't come up in the code review thread, which was mostly about
>> concurrent
>> modification and late-binding of the spliterator:
>>
>>  
>> http://mail.openjdk.java.net/pipermail/core-libs-dev/2015-September/035034.html
>>
>> I remember noting this phenomenon a while back, which is why I had
>> filed the bug
>> to add a note. I seem to remember discussing it, though, but it might
>> have been
>> in a meeting or in a hallway conversation.
>>
>> This bug (JDK-8150488) does note that an infinite stream might be
>> unexpected or
>> surprising, but it's not a fatal problem. It can be terminated with
>> limit(). It
>> can also be terminated with takeWhile(), also added in JDK 9. Maybe I
>> could
>> mention these in the API note.
>>
>> I guess we could also consider changing the implicit
>> findWithinHorizon() loop
>> that findAll() does, perhaps by having it terminate on a zero-width
>> match. Or we
>> could even change findWithinHorizon's behavior if it gets a
>> zero-width match,
>> siilar to what Matcher.find() does. But I'm quite reluctant to start
>> making such
>> changes at this point.
>>
>> s'marks
>>
>>
>>
>>>         // Outputs:
>>>         // 'Matcher'
>>>         // ''
>>>         // 'm'
>>>         // ''
>>>         // ''
>>>         // ''
>>>         // 'Pattern'
>>>         // ''
>>>         // 'compile'
>>>         // ''
>>>         // 's'
>>>         // ''
>>>         // ''
>>>         // 'matcher'
>>>         // ''
>>>         // 't'
>>>         // ''
>>>         // ''
>>>         // ''
>>>         // ''
>>>
>>>
>>>
>>> Sent from Mail for Windows 10
>>>
>>> From: Xueming Shen
>>> Sent: Thursday, March 30, 2017 05:41
>>> To: [hidden email]
>>> Subject: Re: JDK 9 RFR(s): 8150488: add note to Scanner.findAll()
>>> regardingpossible infinite streams
>>>
>>> On 3/29/17, 5:56 PM, Stuart Marks wrote:
>>>> Hi all,
>>>>
>>>> Please review these non-normative textual additions to the
>>>> Scanner.findAll() method docs. These methods were added earlier in JDK
>>>> 9; there's a small pitfall if the regex can match zero characters.
>>>>
>>> Stuart,
>>>
>>> This might practically put the api itself almost useless? it might
>>> be an
>>> easy task to spot
>>> whether or not it's a 0-width-match-possible regex when the regex is
>>> simple, but it gets
>>> harder and harder, if not impossible when the regex gets complicated,
>>> especially consider
>>> the possible use scenario that the use site is embedded deeply inside a
>>> library implementation.
>>>
>>> The alternative is to "fix" it, maybe as what Matcher.find() does, if
>>> the previous match is
>>> zero-width-match (the fist==last), we step one to the next cursor
>>> before
>>> next try. I know
>>> Scanner.findPatternInBuffer() is setting new region set every time
>>> it is
>>> invoked which makes
>>> it complicated, but I would assume it might be still worth a trying?
>>> for
>>> example, utilize the
>>> "hasNextResult"/matcher.end(). I'm not sure without looking into the
>>> code, does
>>>
>>> while (hasNext(pattern)) {
>>>      next(pattern);
>>> }
>>>
>>> have the same issue, when pattern matches 0-width?
>>>
>>> Thanks!
>>> -Sherman
>>>
>>>
>>>
>>>
>>>> Thanks,
>>>>
>>>> s'marks
>>>>
>>>>
>>>> # HG changeset patch
>>>> # User smarks
>>>> # Date 1490749958 25200
>>>> #      Tue Mar 28 18:12:38 2017 -0700
>>>> # Node ID 6b43c4698752779793d58813f46d3687c17dde75
>>>> # Parent  fb54b256d751ae3191e9cef42ff9f5630931f047
>>>> 8150488: add note to Scanner.findAll() regarding possible infinite
>>>> streams
>>>> Reviewed-by: XXX
>>>>
>>>> diff -r fb54b256d751 -r 6b43c4698752
>>>> src/java.base/share/classes/java/util/Scanner.java
>>>> --- a/src/java.base/share/classes/java/util/Scanner.java    Mon Mar 27
>>>> 15:12:01 2017 -0700
>>>> +++ b/src/java.base/share/classes/java/util/Scanner.java    Tue Mar 28
>>>> 18:12:38 2017 -0700
>>>> @@ -2808,6 +2808,10 @@
>>>>       * }
>>>>       * }</pre>
>>>>       *
>>>> +     * <p>The pattern must always match at least one character. If
>>>> the pattern
>>>> +     * can match zero characters, the result will be an infinite
>>>> stream
>>>> +     * of empty matches.
>>>> +     *
>>>>       * @param pattern the pattern to be matched
>>>>       * @return a sequential stream of match results
>>>>       * @throws NullPointerException if pattern is null
>>>> @@ -2829,6 +2833,11 @@
>>>>       *     scanner.findAll(Pattern.compile(patString))
>>>>       * }</pre>
>>>>       *
>>>> +     * @apiNote
>>>> +     * The pattern must always match at least one character. If the
>>>> pattern
>>>> +     * can match zero characters, the result will be an infinite
>>>> stream
>>>> +     * of empty matches.
>>>> +     *
>>>>       * @param patString the pattern string
>>>>       * @return a sequential stream of match results
>>>>       * @throws NullPointerException if patString is null
>>>
>>>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: JDK 9 RFR(s): 8150488: add note to Scanner.findAll()regardingpossible infinite streams

Timo Kinnunen
Hi,

I agree, the behavior should change to work like Matcher.find() which is probably the better known API. E.g. http://www.regular-expressions.info/continue.html describes Java’s Matcher’s behavior on empty matches but doesn’t mention any other APIs beside that.

If we say a regex engine’s task is to find non-overlapping matches in a given input then it repeatedly returning the same empty match at the same location could even be considered a bug.



Sent from Mail for Windows 10

From: Xueming Shen
Sent: Wednesday, April 5, 2017 01:11
To: Stuart Marks
Cc: Timo Kinnunen; [hidden email]
Subject: Re: JDK 9 RFR(s): 8150488: add note to Scanner.findAll()regardingpossible infinite streams

On 4/4/17, 12:52 PM, Stuart Marks wrote:
> Anything further on this?
>
> I'd at least like to add the API note I proposed in order to document
> this issue. I'm reluctant to start tinkering with the behavior of this
> method at this late stage in the release.
>
> BTW I used Scanner.findAll() in a little programming exercise I worked
> on the other day. It worked perfectly. :-)

H Stuart,

The word "useless" was definitely exaggerate :-) Really meant to say the
note might make the api less
useful/popular.

Personally I think the use scenario and the expected resulting behavior
of Stream<MR>finaAll(ptn)
should be more equivalent/similar to the use case  of while
(s.hasNext(p)) {   s.next(p); }, or while
(m.find()) { }, therefor it is probably desired it can be used without
worrying the possibility of getting
into an infinite loop. That said, I agree it might be hard to argue to
fix it in jdk9. The question is if
we are going to fix it in 9u sometime in the further, is it still worth
putting down the note in the
API now. I'm fine if you believe a note will help at least make the
issue a known/documented usage
limitation, for JDK9.

Yes, I think we did chat about this one at the hallway some time when
either you or something
ran into the loop by using the method ...

Thanks,
Sherman


>
> s'marks
>
> On 3/30/17 2:19 PM, Stuart Marks wrote:
>> Hi Timo, Sherman,
>>
>> Thanks for looking at this.
>>
>> Sherman wrote:
>>
>>> This might practically put the api itself almost useless? it might
>>> be an easy
>>> task to spot
>>> whether or not it's a 0-width-match-possible regex when the regex is
>>> simple,
>>> but it gets
>>> harder and harder, if not impossible when the regex gets complicated,
>>> especially consider
>>> the possible use scenario that the use site is embedded deeply inside a
>>> library implementation.
>>
>> Well, not "useless", but perhaps less useful than one might like. :-)
>>
>> I think this is potentially surprising behavior, which is why I at
>> least wanted
>> to add the note. It's not clear to me whether we should try to fix
>> this by
>> changing Scanner though.
>>
>> Essentially, findAll() is defined in terms of
>> findWithinHorizon(pattern, 0). So
>> if one were to write a loop like so:
>>
>>     String str;
>>     while ((str = scanner.findWithinHorizon(pattern, 0)) != null) {
>>         System.out.println(str);
>>     }
>>
>> then this loop would have the same problem if pattern were to match zero
>> characters.
>>
>>> The alternative is to "fix" it, maybe as what Matcher.find() does,
>>> if the
>>> previous match is
>>> zero-width-match (the fist==last), we step one to the next cursor
>>> before next
>>> try. I know
>>
>> Interesting, I didn't know Matcher.find() advances the cursor like
>> this. But
>> Scanner.findWithinHorizon() apparently doesn't, so that's why an
>> infinite loop
>> can occur.
>>
>>> Scanner.findPatternInBuffer() is setting new region set every time
>>> it is
>>> invoked which makes
>>> it complicated, but I would assume it might be still worth a trying?
>>> for
>>> example, utilize the
>>> "hasNextResult"/matcher.end(). I'm not sure without looking into the
>>> code, does
>>>
>>> while (hasNext(pattern)) {
>>>     next(pattern);
>>> }
>>>
>>> have the same issue, when pattern matches 0-width?
>>
>> No, this doesn't have the problem, because hasNext(pat) and next(pat)
>> match
>> delimited tokens. Each call to next() implicitly advances past the next
>> delimiter to reach the subsequent token, if any.
>>
>>
>> On 3/30/17 8:56 AM, Timo Kinnunen wrote:
>>> I guess this somewhat contrived example also wouldn’t work?
>>>
>>>         String s = "\\b\\w+|\\G|\\B";
>>>         String t = "Matcher m = Pattern.compile(s).matcher(t);\n";
>>>         Matcher m = Pattern.compile(s).matcher(t);
>>>         while(m.find()) {
>>>             System.out.println("'" + m.group() + "'");
>>>         }
>>
>> Right, so if you rewrote this loop to use Scanner.findWithinHorizon()
>> instead of
>> Matcher,
>>
>>     Scanner sc = new Scanner(t);
>>     String str;
>>     while ((str = sc.findWithinHorizon(s, 0)) != null) {
>>         System.out.println("'" + str + "'");
>>     }
>>
>> you'd get an infinite loop with str being continually assigned the
>> empty string.
>> As Sherman mentioned, the Matcher.find() will advance the cursor if
>> it gets a
>> zero-width match, avoiding this problem.
>>
>> * * *
>>
>> This didn't come up in the code review thread, which was mostly about
>> concurrent
>> modification and late-binding of the spliterator:
>>
>>  
>> http://mail.openjdk.java.net/pipermail/core-libs-dev/2015-September/035034.html
>>
>> I remember noting this phenomenon a while back, which is why I had
>> filed the bug
>> to add a note. I seem to remember discussing it, though, but it might
>> have been
>> in a meeting or in a hallway conversation.
>>
>> This bug (JDK-8150488) does note that an infinite stream might be
>> unexpected or
>> surprising, but it's not a fatal problem. It can be terminated with
>> limit(). It
>> can also be terminated with takeWhile(), also added in JDK 9. Maybe I
>> could
>> mention these in the API note.
>>
>> I guess we could also consider changing the implicit
>> findWithinHorizon() loop
>> that findAll() does, perhaps by having it terminate on a zero-width
>> match. Or we
>> could even change findWithinHorizon's behavior if it gets a
>> zero-width match,
>> siilar to what Matcher.find() does. But I'm quite reluctant to start
>> making such
>> changes at this point.
>>
>> s'marks
>>
>>
>>
>>>         // Outputs:
>>>         // 'Matcher'
>>>         // ''
>>>         // 'm'
>>>         // ''
>>>         // ''
>>>         // ''
>>>         // 'Pattern'
>>>         // ''
>>>         // 'compile'
>>>         // ''
>>>         // 's'
>>>         // ''
>>>         // ''
>>>         // 'matcher'
>>>         // ''
>>>         // 't'
>>>         // ''
>>>         // ''
>>>         // ''
>>>         // ''
>>>
>>>
>>>
>>> Sent from Mail for Windows 10
>>>
>>> From: Xueming Shen
>>> Sent: Thursday, March 30, 2017 05:41
>>> To: [hidden email]
>>> Subject: Re: JDK 9 RFR(s): 8150488: add note to Scanner.findAll()
>>> regardingpossible infinite streams
>>>
>>> On 3/29/17, 5:56 PM, Stuart Marks wrote:
>>>> Hi all,
>>>>
>>>> Please review these non-normative textual additions to the
>>>> Scanner.findAll() method docs. These methods were added earlier in JDK
>>>> 9; there's a small pitfall if the regex can match zero characters.
>>>>
>>> Stuart,
>>>
>>> This might practically put the api itself almost useless? it might
>>> be an
>>> easy task to spot
>>> whether or not it's a 0-width-match-possible regex when the regex is
>>> simple, but it gets
>>> harder and harder, if not impossible when the regex gets complicated,
>>> especially consider
>>> the possible use scenario that the use site is embedded deeply inside a
>>> library implementation.
>>>
>>> The alternative is to "fix" it, maybe as what Matcher.find() does, if
>>> the previous match is
>>> zero-width-match (the fist==last), we step one to the next cursor
>>> before
>>> next try. I know
>>> Scanner.findPatternInBuffer() is setting new region set every time
>>> it is
>>> invoked which makes
>>> it complicated, but I would assume it might be still worth a trying?
>>> for
>>> example, utilize the
>>> "hasNextResult"/matcher.end(). I'm not sure without looking into the
>>> code, does
>>>
>>> while (hasNext(pattern)) {
>>>      next(pattern);
>>> }
>>>
>>> have the same issue, when pattern matches 0-width?
>>>
>>> Thanks!
>>> -Sherman
>>>
>>>
>>>
>>>
>>>> Thanks,
>>>>
>>>> s'marks
>>>>
>>>>
>>>> # HG changeset patch
>>>> # User smarks
>>>> # Date 1490749958 25200
>>>> #      Tue Mar 28 18:12:38 2017 -0700
>>>> # Node ID 6b43c4698752779793d58813f46d3687c17dde75
>>>> # Parent  fb54b256d751ae3191e9cef42ff9f5630931f047
>>>> 8150488: add note to Scanner.findAll() regarding possible infinite
>>>> streams
>>>> Reviewed-by: XXX
>>>>
>>>> diff -r fb54b256d751 -r 6b43c4698752
>>>> src/java.base/share/classes/java/util/Scanner.java
>>>> --- a/src/java.base/share/classes/java/util/Scanner.java    Mon Mar 27
>>>> 15:12:01 2017 -0700
>>>> +++ b/src/java.base/share/classes/java/util/Scanner.java    Tue Mar 28
>>>> 18:12:38 2017 -0700
>>>> @@ -2808,6 +2808,10 @@
>>>>       * }
>>>>       * }</pre>
>>>>       *
>>>> +     * <p>The pattern must always match at least one character. If
>>>> the pattern
>>>> +     * can match zero characters, the result will be an infinite
>>>> stream
>>>> +     * of empty matches.
>>>> +     *
>>>>       * @param pattern the pattern to be matched
>>>>       * @return a sequential stream of match results
>>>>       * @throws NullPointerException if pattern is null
>>>> @@ -2829,6 +2833,11 @@
>>>>       *     scanner.findAll(Pattern.compile(patString))
>>>>       * }</pre>
>>>>       *
>>>> +     * @apiNote
>>>> +     * The pattern must always match at least one character. If the
>>>> pattern
>>>> +     * can match zero characters, the result will be an infinite
>>>> stream
>>>> +     * of empty matches.
>>>> +     *
>>>>       * @param patString the pattern string
>>>>       * @return a sequential stream of match results
>>>>       * @throws NullPointerException if patString is null
>>>
>>>


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: JDK 9 RFR(s): 8150488: add note to Scanner.findAll() regardingpossible infinite streams

Stuart Marks
In reply to this post by Xueming Shen


On 4/4/17 4:10 PM, Xueming Shen wrote:

> On 4/4/17, 12:52 PM, Stuart Marks wrote:
>> I'd at least like to add the API note I proposed in order to document this
>> issue. I'm reluctant to start tinkering with the behavior of this method at
>> this late stage in the release.
>>
>> BTW I used Scanner.findAll() in a little programming exercise I worked on the
>> other day. It worked perfectly. :-)
>
> The word "useless" was definitely exaggerate :-) Really meant to say the
> note might make the api less useful/popular.


> Personally I think the use scenario and the expected resulting behavior of
> Stream<MR>finaAll(ptn) should be more equivalent/similar to the use case  of
> while (s.hasNext(p)) { s.next(p); }, or while (m.find()) { }, therefor it is
> probably desired it can be used without worrying the possibility of getting
> into an infinite loop.

The Scanner.tokens() API, also added in Java 9, provides a stream of delimited
tokens (strings) that's more-or-less equivalent of a hasNext()/next() loop.

Just as findWithinHorizon() is a distinct use case from hasNext()/next()
delimited tokens, findAll() provides for distinct uses from tokens().

> That said, I agree it might be hard to argue to fix it in jdk9. The question
> is if we are going to fix it in 9u sometime in the further, is it still worth
> putting down the note in the API now. I'm fine if you believe a note will
> help at least make the issue a known/documented usage limitation, for JDK9.
>
> Yes, I think we did chat about this one at the hallway some time when either
> you or something ran into the loop by using the method ...

There are already a couple ways you can get unexpected behavior with Scanner if
you're not careful. For example, see the warnings in findWithinHorizon() and
skip() about the possibility of buffering up the entire input. These methods are
useful, but if you're not careful with the regex you provide, things can go
wrong. The findAll() method seems to be in the same situation. It's not clear to
me that any of these are bugs for which there's a reasonable fix.

So yes, I think the note is necessary. I could also add something about
terminating such streams with limit() or takeWhile().

s'marks

>
> Thanks,
> Sherman
>
>
>>
>> s'marks
>>
>> On 3/30/17 2:19 PM, Stuart Marks wrote:
>>> Hi Timo, Sherman,
>>>
>>> Thanks for looking at this.
>>>
>>> Sherman wrote:
>>>
>>>> This might practically put the api itself almost useless? it might be an easy
>>>> task to spot
>>>> whether or not it's a 0-width-match-possible regex when the regex is simple,
>>>> but it gets
>>>> harder and harder, if not impossible when the regex gets complicated,
>>>> especially consider
>>>> the possible use scenario that the use site is embedded deeply inside a
>>>> library implementation.
>>>
>>> Well, not "useless", but perhaps less useful than one might like. :-)
>>>
>>> I think this is potentially surprising behavior, which is why I at least wanted
>>> to add the note. It's not clear to me whether we should try to fix this by
>>> changing Scanner though.
>>>
>>> Essentially, findAll() is defined in terms of findWithinHorizon(pattern, 0). So
>>> if one were to write a loop like so:
>>>
>>>     String str;
>>>     while ((str = scanner.findWithinHorizon(pattern, 0)) != null) {
>>>         System.out.println(str);
>>>     }
>>>
>>> then this loop would have the same problem if pattern were to match zero
>>> characters.
>>>
>>>> The alternative is to "fix" it, maybe as what Matcher.find() does, if the
>>>> previous match is
>>>> zero-width-match (the fist==last), we step one to the next cursor before next
>>>> try. I know
>>>
>>> Interesting, I didn't know Matcher.find() advances the cursor like this. But
>>> Scanner.findWithinHorizon() apparently doesn't, so that's why an infinite loop
>>> can occur.
>>>
>>>> Scanner.findPatternInBuffer() is setting new region set every time it is
>>>> invoked which makes
>>>> it complicated, but I would assume it might be still worth a trying? for
>>>> example, utilize the
>>>> "hasNextResult"/matcher.end(). I'm not sure without looking into the code, does
>>>>
>>>> while (hasNext(pattern)) {
>>>>     next(pattern);
>>>> }
>>>>
>>>> have the same issue, when pattern matches 0-width?
>>>
>>> No, this doesn't have the problem, because hasNext(pat) and next(pat) match
>>> delimited tokens. Each call to next() implicitly advances past the next
>>> delimiter to reach the subsequent token, if any.
>>>
>>>
>>> On 3/30/17 8:56 AM, Timo Kinnunen wrote:
>>>> I guess this somewhat contrived example also wouldn’t work?
>>>>
>>>>         String s = "\\b\\w+|\\G|\\B";
>>>>         String t = "Matcher m = Pattern.compile(s).matcher(t);\n";
>>>>         Matcher m = Pattern.compile(s).matcher(t);
>>>>         while(m.find()) {
>>>>             System.out.println("'" + m.group() + "'");
>>>>         }
>>>
>>> Right, so if you rewrote this loop to use Scanner.findWithinHorizon() instead of
>>> Matcher,
>>>
>>>     Scanner sc = new Scanner(t);
>>>     String str;
>>>     while ((str = sc.findWithinHorizon(s, 0)) != null) {
>>>         System.out.println("'" + str + "'");
>>>     }
>>>
>>> you'd get an infinite loop with str being continually assigned the empty string.
>>> As Sherman mentioned, the Matcher.find() will advance the cursor if it gets a
>>> zero-width match, avoiding this problem.
>>>
>>> * * *
>>>
>>> This didn't come up in the code review thread, which was mostly about concurrent
>>> modification and late-binding of the spliterator:
>>>
>>>
>>> http://mail.openjdk.java.net/pipermail/core-libs-dev/2015-September/035034.html
>>>
>>> I remember noting this phenomenon a while back, which is why I had filed the bug
>>> to add a note. I seem to remember discussing it, though, but it might have been
>>> in a meeting or in a hallway conversation.
>>>
>>> This bug (JDK-8150488) does note that an infinite stream might be unexpected or
>>> surprising, but it's not a fatal problem. It can be terminated with limit(). It
>>> can also be terminated with takeWhile(), also added in JDK 9. Maybe I could
>>> mention these in the API note.
>>>
>>> I guess we could also consider changing the implicit findWithinHorizon() loop
>>> that findAll() does, perhaps by having it terminate on a zero-width match. Or we
>>> could even change findWithinHorizon's behavior if it gets a zero-width match,
>>> siilar to what Matcher.find() does. But I'm quite reluctant to start making such
>>> changes at this point.
>>>
>>> s'marks
>>>
>>>
>>>
>>>>         // Outputs:
>>>>         // 'Matcher'
>>>>         // ''
>>>>         // 'm'
>>>>         // ''
>>>>         // ''
>>>>         // ''
>>>>         // 'Pattern'
>>>>         // ''
>>>>         // 'compile'
>>>>         // ''
>>>>         // 's'
>>>>         // ''
>>>>         // ''
>>>>         // 'matcher'
>>>>         // ''
>>>>         // 't'
>>>>         // ''
>>>>         // ''
>>>>         // ''
>>>>         // ''
>>>>
>>>>
>>>>
>>>> Sent from Mail for Windows 10
>>>>
>>>> From: Xueming Shen
>>>> Sent: Thursday, March 30, 2017 05:41
>>>> To: [hidden email]
>>>> Subject: Re: JDK 9 RFR(s): 8150488: add note to Scanner.findAll()
>>>> regardingpossible infinite streams
>>>>
>>>> On 3/29/17, 5:56 PM, Stuart Marks wrote:
>>>>> Hi all,
>>>>>
>>>>> Please review these non-normative textual additions to the
>>>>> Scanner.findAll() method docs. These methods were added earlier in JDK
>>>>> 9; there's a small pitfall if the regex can match zero characters.
>>>>>
>>>> Stuart,
>>>>
>>>> This might practically put the api itself almost useless? it might be an
>>>> easy task to spot
>>>> whether or not it's a 0-width-match-possible regex when the regex is
>>>> simple, but it gets
>>>> harder and harder, if not impossible when the regex gets complicated,
>>>> especially consider
>>>> the possible use scenario that the use site is embedded deeply inside a
>>>> library implementation.
>>>>
>>>> The alternative is to "fix" it, maybe as what Matcher.find() does, if
>>>> the previous match is
>>>> zero-width-match (the fist==last), we step one to the next cursor before
>>>> next try. I know
>>>> Scanner.findPatternInBuffer() is setting new region set every time it is
>>>> invoked which makes
>>>> it complicated, but I would assume it might be still worth a trying? for
>>>> example, utilize the
>>>> "hasNextResult"/matcher.end(). I'm not sure without looking into the
>>>> code, does
>>>>
>>>> while (hasNext(pattern)) {
>>>>      next(pattern);
>>>> }
>>>>
>>>> have the same issue, when pattern matches 0-width?
>>>>
>>>> Thanks!
>>>> -Sherman
>>>>
>>>>
>>>>
>>>>
>>>>> Thanks,
>>>>>
>>>>> s'marks
>>>>>
>>>>>
>>>>> # HG changeset patch
>>>>> # User smarks
>>>>> # Date 1490749958 25200
>>>>> #      Tue Mar 28 18:12:38 2017 -0700
>>>>> # Node ID 6b43c4698752779793d58813f46d3687c17dde75
>>>>> # Parent  fb54b256d751ae3191e9cef42ff9f5630931f047
>>>>> 8150488: add note to Scanner.findAll() regarding possible infinite
>>>>> streams
>>>>> Reviewed-by: XXX
>>>>>
>>>>> diff -r fb54b256d751 -r 6b43c4698752
>>>>> src/java.base/share/classes/java/util/Scanner.java
>>>>> --- a/src/java.base/share/classes/java/util/Scanner.java    Mon Mar 27
>>>>> 15:12:01 2017 -0700
>>>>> +++ b/src/java.base/share/classes/java/util/Scanner.java    Tue Mar 28
>>>>> 18:12:38 2017 -0700
>>>>> @@ -2808,6 +2808,10 @@
>>>>>       * }
>>>>>       * }</pre>
>>>>>       *
>>>>> +     * <p>The pattern must always match at least one character. If
>>>>> the pattern
>>>>> +     * can match zero characters, the result will be an infinite stream
>>>>> +     * of empty matches.
>>>>> +     *
>>>>>       * @param pattern the pattern to be matched
>>>>>       * @return a sequential stream of match results
>>>>>       * @throws NullPointerException if pattern is null
>>>>> @@ -2829,6 +2833,11 @@
>>>>>       *     scanner.findAll(Pattern.compile(patString))
>>>>>       * }</pre>
>>>>>       *
>>>>> +     * @apiNote
>>>>> +     * The pattern must always match at least one character. If the
>>>>> pattern
>>>>> +     * can match zero characters, the result will be an infinite stream
>>>>> +     * of empty matches.
>>>>> +     *
>>>>>       * @param patString the pattern string
>>>>>       * @return a sequential stream of match results
>>>>>       * @throws NullPointerException if patString is null
>>>>
>>>>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: JDK 9 RFR(s): 8150488: add note to Scanner.findAll()regardingpossible infinite streams

Stuart Marks
In reply to this post by Timo Kinnunen


On 4/4/17 5:33 PM, Timo Kinnunen wrote:
>
> Hi,
>
> I agree, the behavior should change to work like Matcher.find() which is
> probably the better known API. E.g.
> http://www.regular-expressions.info/continue.html describes Java’s Matcher’s
> behavior on empty matches but doesn’t mention any other APIs beside that.
>
I don't think Sherman was suggesting that.

In any case this would be a change to findWithinHorizon(), which isn't in scope.

s'marks

>
> If we say a regex engine’s task is to find non-overlapping matches in a given
> input then it repeatedly returning the same empty match at the same location
> could even be considered a bug.
>
> Sent from Mail <https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10
>
> *From: *Xueming Shen <mailto:[hidden email]>
> *Sent: *Wednesday, April 5, 2017 01:11
> *To: *Stuart Marks <mailto:[hidden email]>
> *Cc: *Timo Kinnunen <mailto:[hidden email]>;
> [hidden email] <mailto:[hidden email]>
> *Subject: *Re: JDK 9 RFR(s): 8150488: add note to
> Scanner.findAll()regardingpossible infinite streams
>
> On 4/4/17, 12:52 PM, Stuart Marks wrote:
>
> > Anything further on this?
>
> >
>
> > I'd at least like to add the API note I proposed in order to document
>
> > this issue. I'm reluctant to start tinkering with the behavior of this
>
> > method at this late stage in the release.
>
> >
>
> > BTW I used Scanner.findAll() in a little programming exercise I worked
>
> > on the other day. It worked perfectly. :-)
>
> H Stuart,
>
> The word "useless" was definitely exaggerate :-) Really meant to say the
>
> note might make the api less
>
> useful/popular.
>
> Personally I think the use scenario and the expected resulting behavior
>
> of Stream<MR>finaAll(ptn)
>
> should be more equivalent/similar to the use case  of while
>
> (s.hasNext(p)) {   s.next(p); }, or while
>
> (m.find()) { }, therefor it is probably desired it can be used without
>
> worrying the possibility of getting
>
> into an infinite loop. That said, I agree it might be hard to argue to
>
> fix it in jdk9. The question is if
>
> we are going to fix it in 9u sometime in the further, is it still worth
>
> putting down the note in the
>
> API now. I'm fine if you believe a note will help at least make the
>
> issue a known/documented usage
>
> limitation, for JDK9.
>
> Yes, I think we did chat about this one at the hallway some time when
>
> either you or something
>
> ran into the loop by using the method ...
>
> Thanks,
>
> Sherman
>
> >
>
> > s'marks
>
> >
>
> > On 3/30/17 2:19 PM, Stuart Marks wrote:
>
> >> Hi Timo, Sherman,
>
> >>
>
> >> Thanks for looking at this.
>
> >>
>
> >> Sherman wrote:
>
> >>
>
> >>> This might practically put the api itself almost useless? it might
>
> >>> be an easy
>
> >>> task to spot
>
> >>> whether or not it's a 0-width-match-possible regex when the regex is
>
> >>> simple,
>
> >>> but it gets
>
> >>> harder and harder, if not impossible when the regex gets complicated,
>
> >>> especially consider
>
> >>> the possible use scenario that the use site is embedded deeply inside a
>
> >>> library implementation.
>
> >>
>
> >> Well, not "useless", but perhaps less useful than one might like. :-)
>
> >>
>
> >> I think this is potentially surprising behavior, which is why I at
>
> >> least wanted
>
> >> to add the note. It's not clear to me whether we should try to fix
>
> >> this by
>
> >> changing Scanner though.
>
> >>
>
> >> Essentially, findAll() is defined in terms of
>
> >> findWithinHorizon(pattern, 0). So
>
> >> if one were to write a loop like so:
>
> >>
>
> >>     String str;
>
> >>     while ((str = scanner.findWithinHorizon(pattern, 0)) != null) {
>
> >>         System.out.println(str);
>
> >>     }
>
> >>
>
> >> then this loop would have the same problem if pattern were to match zero
>
> >> characters.
>
> >>
>
> >>> The alternative is to "fix" it, maybe as what Matcher.find() does,
>
> >>> if the
>
> >>> previous match is
>
> >>> zero-width-match (the fist==last), we step one to the next cursor
>
> >>> before next
>
> >>> try. I know
>
> >>
>
> >> Interesting, I didn't know Matcher.find() advances the cursor like
>
> >> this. But
>
> >> Scanner.findWithinHorizon() apparently doesn't, so that's why an
>
> >> infinite loop
>
> >> can occur.
>
> >>
>
> >>> Scanner.findPatternInBuffer() is setting new region set every time
>
> >>> it is
>
> >>> invoked which makes
>
> >>> it complicated, but I would assume it might be still worth a trying?
>
> >>> for
>
> >>> example, utilize the
>
> >>> "hasNextResult"/matcher.end(). I'm not sure without looking into the
>
> >>> code, does
>
> >>>
>
> >>> while (hasNext(pattern)) {
>
> >>>     next(pattern);
>
> >>> }
>
> >>>
>
> >>> have the same issue, when pattern matches 0-width?
>
> >>
>
> >> No, this doesn't have the problem, because hasNext(pat) and next(pat)
>
> >> match
>
> >> delimited tokens. Each call to next() implicitly advances past the next
>
> >> delimiter to reach the subsequent token, if any.
>
> >>
>
> >>
>
> >> On 3/30/17 8:56 AM, Timo Kinnunen wrote:
>
> >>> I guess this somewhat contrived example also wouldn’t work?
>
> >>>
>
> >>>         String s = "\\b\\w+|\\G|\\B";
>
> >>>         String t = "Matcher m = Pattern.compile(s).matcher(t);\n";
>
> >>>         Matcher m = Pattern.compile(s).matcher(t);
>
> >>>         while(m.find()) {
>
> >>> System.out.println("'" + m.group() + "'");
>
> >>>         }
>
> >>
>
> >> Right, so if you rewrote this loop to use Scanner.findWithinHorizon()
>
> >> instead of
>
> >> Matcher,
>
> >>
>
> >>     Scanner sc = new Scanner(t);
>
> >>     String str;
>
> >>     while ((str = sc.findWithinHorizon(s, 0)) != null) {
>
> >>         System.out.println("'" + str + "'");
>
> >>     }
>
> >>
>
> >> you'd get an infinite loop with str being continually assigned the
>
> >> empty string.
>
> >> As Sherman mentioned, the Matcher.find() will advance the cursor if
>
> >> it gets a
>
> >> zero-width match, avoiding this problem.
>
> >>
>
> >> * * *
>
> >>
>
> >> This didn't come up in the code review thread, which was mostly about
>
> >> concurrent
>
> >> modification and late-binding of the spliterator:
>
> >>
>
> >>
>
> >> http://mail.openjdk.java.net/pipermail/core-libs-dev/2015-September/035034.html
>
> >>
>
> >> I remember noting this phenomenon a while back, which is why I had
>
> >> filed the bug
>
> >> to add a note. I seem to remember discussing it, though, but it might
>
> >> have been
>
> >> in a meeting or in a hallway conversation.
>
> >>
>
> >> This bug (JDK-8150488) does note that an infinite stream might be
>
> >> unexpected or
>
> >> surprising, but it's not a fatal problem. It can be terminated with
>
> >> limit(). It
>
> >> can also be terminated with takeWhile(), also added in JDK 9. Maybe I
>
> >> could
>
> >> mention these in the API note.
>
> >>
>
> >> I guess we could also consider changing the implicit
>
> >> findWithinHorizon() loop
>
> >> that findAll() does, perhaps by having it terminate on a zero-width
>
> >> match. Or we
>
> >> could even change findWithinHorizon's behavior if it gets a
>
> >> zero-width match,
>
> >> siilar to what Matcher.find() does. But I'm quite reluctant to start
>
> >> making such
>
> >> changes at this point.
>
> >>
>
> >> s'marks
>
> >>
>
> >>
>
> >>
>
> >>>         // Outputs:
>
> >>>         // 'Matcher'
>
> >>>         // ''
>
> >>>         // 'm'
>
> >>>         // ''
>
> >>>         // ''
>
> >>>         // ''
>
> >>>         // 'Pattern'
>
> >>>         // ''
>
> >>>         // 'compile'
>
> >>>         // ''
>
> >>>         // 's'
>
> >>>         // ''
>
> >>>         // ''
>
> >>>         // 'matcher'
>
> >>>         // ''
>
> >>>         // 't'
>
> >>>         // ''
>
> >>>         // ''
>
> >>>         // ''
>
> >>>         // ''
>
> >>>
>
> >>>
>
> >>>
>
> >>> Sent from Mail for Windows 10
>
> >>>
>
> >>> From: Xueming Shen
>
> >>> Sent: Thursday, March 30, 2017 05:41
>
> >>> To: [hidden email]
>
> >>> Subject: Re: JDK 9 RFR(s): 8150488: add note to Scanner.findAll()
>
> >>> regardingpossible infinite streams
>
> >>>
>
> >>> On 3/29/17, 5:56 PM, Stuart Marks wrote:
>
> >>>> Hi all,
>
> >>>>
>
> >>>> Please review these non-normative textual additions to the
>
> >>>> Scanner.findAll() method docs. These methods were added earlier in JDK
>
> >>>> 9; there's a small pitfall if the regex can match zero characters.
>
> >>>>
>
> >>> Stuart,
>
> >>>
>
> >>> This might practically put the api itself almost useless? it might
>
> >>> be an
>
> >>> easy task to spot
>
> >>> whether or not it's a 0-width-match-possible regex when the regex is
>
> >>> simple, but it gets
>
> >>> harder and harder, if not impossible when the regex gets complicated,
>
> >>> especially consider
>
> >>> the possible use scenario that the use site is embedded deeply inside a
>
> >>> library implementation.
>
> >>>
>
> >>> The alternative is to "fix" it, maybe as what Matcher.find() does, if
>
> >>> the previous match is
>
> >>> zero-width-match (the fist==last), we step one to the next cursor
>
> >>> before
>
> >>> next try. I know
>
> >>> Scanner.findPatternInBuffer() is setting new region set every time
>
> >>> it is
>
> >>> invoked which makes
>
> >>> it complicated, but I would assume it might be still worth a trying?
>
> >>> for
>
> >>> example, utilize the
>
> >>> "hasNextResult"/matcher.end(). I'm not sure without looking into the
>
> >>> code, does
>
> >>>
>
> >>> while (hasNext(pattern)) {
>
> >>>      next(pattern);
>
> >>> }
>
> >>>
>
> >>> have the same issue, when pattern matches 0-width?
>
> >>>
>
> >>> Thanks!
>
> >>> -Sherman
>
> >>>
>
> >>>
>
> >>>
>
> >>>
>
> >>>> Thanks,
>
> >>>>
>
> >>>> s'marks
>
> >>>>
>
> >>>>
>
> >>>> # HG changeset patch
>
> >>>> # User smarks
>
> >>>> # Date 1490749958 25200
>
> >>>> #      Tue Mar 28 18:12:38 2017 -0700
>
> >>>> # Node ID 6b43c4698752779793d58813f46d3687c17dde75
>
> >>>> # Parent fb54b256d751ae3191e9cef42ff9f5630931f047
>
> >>>> 8150488: add note to Scanner.findAll() regarding possible infinite
>
> >>>> streams
>
> >>>> Reviewed-by: XXX
>
> >>>>
>
> >>>> diff -r fb54b256d751 -r 6b43c4698752
>
> >>>> src/java.base/share/classes/java/util/Scanner.java
>
> >>>> --- a/src/java.base/share/classes/java/util/Scanner.java    Mon Mar 27
>
> >>>> 15:12:01 2017 -0700
>
> >>>> +++ b/src/java.base/share/classes/java/util/Scanner.java    Tue Mar 28
>
> >>>> 18:12:38 2017 -0700
>
> >>>> @@ -2808,6 +2808,10 @@
>
> >>>>       * }
>
> >>>>       * }</pre>
>
> >>>>       *
>
> >>>> +     * <p>The pattern must always match at least one character. If
>
> >>>> the pattern
>
> >>>> +     * can match zero characters, the result will be an infinite
>
> >>>> stream
>
> >>>> +     * of empty matches.
>
> >>>> +     *
>
> >>>>       * @param pattern the pattern to be matched
>
> >>>>       * @return a sequential stream of match results
>
> >>>>       * @throws NullPointerException if pattern is null
>
> >>>> @@ -2829,6 +2833,11 @@
>
> >>>>       * scanner.findAll(Pattern.compile(patString))
>
> >>>>       * }</pre>
>
> >>>>       *
>
> >>>> +     * @apiNote
>
> >>>> +     * The pattern must always match at least one character. If the
>
> >>>> pattern
>
> >>>> +     * can match zero characters, the result will be an infinite
>
> >>>> stream
>
> >>>> +     * of empty matches.
>
> >>>> +     *
>
> >>>>       * @param patString the pattern string
>
> >>>>       * @return a sequential stream of match results
>
> >>>>       * @throws NullPointerException if patString is null
>
> >>>
>
> >>>
>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: JDK 9 RFR(s): 8150488: add note toScanner.findAll()regardingpossible infinite streams

Timo Kinnunen
Then Scanner.findAll() should be specified not in terms of findWithinHorizon() but something else, maybe Matcher.find(). If it’s specified using findWithinHorizon() this would require implementations of its backing regex engine to be inferior or even broken in this respect when compared to other regex engines. If a regex engine chosen for an implementation didn’t have infinite looping then that would have to be emulated!






Sent from Mail for Windows 10

From: Stuart Marks
Sent: Wednesday, April 5, 2017 07:50
To: Timo Kinnunen
Cc: Xueming Shen; [hidden email]
Subject: Re: JDK 9 RFR(s): 8150488: add note toScanner.findAll()regardingpossible infinite streams



On 4/4/17 5:33 PM, Timo Kinnunen wrote:
Hi,
 
I agree, the behavior should change to work like Matcher.find() which is probably the better known API. E.g. http://www.regular-expressions.info/continue.html describes Java’s Matcher’s behavior on empty matches but doesn’t mention any other APIs beside that.
I don't think Sherman was suggesting that.

In any case this would be a change to findWithinHorizon(), which isn't in scope.

s'marks

 
If we say a regex engine’s task is to find non-overlapping matches in a given input then it repeatedly returning the same empty match at the same location could even be considered a bug.
 
 
 
Sent from Mail for Windows 10
 
From: Xueming Shen
Sent: Wednesday, April 5, 2017 01:11
To: Stuart Marks
Cc: Timo Kinnunen; [hidden email]
Subject: Re: JDK 9 RFR(s): 8150488: add note to Scanner.findAll()regardingpossible infinite streams
 
On 4/4/17, 12:52 PM, Stuart Marks wrote:
> Anything further on this?

> I'd at least like to add the API note I proposed in order to document
> this issue. I'm reluctant to start tinkering with the behavior of this
> method at this late stage in the release.

> BTW I used Scanner.findAll() in a little programming exercise I worked
> on the other day. It worked perfectly. :-)
 
H Stuart,
 
The word "useless" was definitely exaggerate :-) Really meant to say the
note might make the api less
useful/popular.
 
Personally I think the use scenario and the expected resulting behavior
of Stream<MR>finaAll(ptn)
should be more equivalent/similar to the use case  of while
(s.hasNext(p)) {   s.next(p); }, or while
(m.find()) { }, therefor it is probably desired it can be used without
worrying the possibility of getting
into an infinite loop. That said, I agree it might be hard to argue to
fix it in jdk9. The question is if
we are going to fix it in 9u sometime in the further, is it still worth
putting down the note in the
API now. I'm fine if you believe a note will help at least make the
issue a known/documented usage
limitation, for JDK9.
 
Yes, I think we did chat about this one at the hallway some time when
either you or something
ran into the loop by using the method ...
 
Thanks,
Sherman
 
 


> s'marks

> On 3/30/17 2:19 PM, Stuart Marks wrote:
>> Hi Timo, Sherman,
>> 
>> Thanks for looking at this.
>> 
>> Sherman wrote:
>> 
>>> This might practically put the api itself almost useless? it might
>>> be an easy
>>> task to spot
>>> whether or not it's a 0-width-match-possible regex when the regex is
>>> simple,
>>> but it gets
>>> harder and harder, if not impossible when the regex gets complicated,
>>> especially consider
>>> the possible use scenario that the use site is embedded deeply inside a
>>> library implementation.
>> 
>> Well, not "useless", but perhaps less useful than one might like. :-)
>> 
>> I think this is potentially surprising behavior, which is why I at
>> least wanted
>> to add the note. It's not clear to me whether we should try to fix
>> this by
>> changing Scanner though.
>> 
>> Essentially, findAll() is defined in terms of
>> findWithinHorizon(pattern, 0). So
>> if one were to write a loop like so:
>> 
>>     String str;
>>     while ((str = scanner.findWithinHorizon(pattern, 0)) != null) {
>>         System.out.println(str);
>>     }
>> 
>> then this loop would have the same problem if pattern were to match zero
>> characters.
>> 
>>> The alternative is to "fix" it, maybe as what Matcher.find() does,
>>> if the
>>> previous match is
>>> zero-width-match (the fist==last), we step one to the next cursor
>>> before next
>>> try. I know
>> 
>> Interesting, I didn't know Matcher.find() advances the cursor like
>> this. But
>> Scanner.findWithinHorizon() apparently doesn't, so that's why an
>> infinite loop
>> can occur.
>> 
>>> Scanner.findPatternInBuffer() is setting new region set every time
>>> it is
>>> invoked which makes
>>> it complicated, but I would assume it might be still worth a trying?
>>> for
>>> example, utilize the
>>> "hasNextResult"/matcher.end(). I'm not sure without looking into the
>>> code, does
>>> 
>>> while (hasNext(pattern)) {
>>>     next(pattern);
>>> }
>>> 
>>> have the same issue, when pattern matches 0-width?
>> 
>> No, this doesn't have the problem, because hasNext(pat) and next(pat)
>> match
>> delimited tokens. Each call to next() implicitly advances past the next
>> delimiter to reach the subsequent token, if any.
>> 
>> 
>> On 3/30/17 8:56 AM, Timo Kinnunen wrote:
>>> I guess this somewhat contrived example also wouldn’t work?
>>> 
>>>         String s = "\\b\\w+|\\G|\\B";
>>>         String t = "Matcher m = Pattern.compile(s).matcher(t);\n";
>>>         Matcher m = Pattern.compile(s).matcher(t);
>>>         while(m.find()) {
>>>             System.out.println("'" + m.group() + "'");
>>>         }
>> 
>> Right, so if you rewrote this loop to use Scanner.findWithinHorizon()
>> instead of
>> Matcher,
>> 
>>     Scanner sc = new Scanner(t);
>>     String str;
>>     while ((str = sc.findWithinHorizon(s, 0)) != null) {
>>         System.out.println("'" + str + "'");
>>     }
>> 
>> you'd get an infinite loop with str being continually assigned the
>> empty string.
>> As Sherman mentioned, the Matcher.find() will advance the cursor if
>> it gets a
>> zero-width match, avoiding this problem.
>> 
>> * * *
>> 
>> This didn't come up in the code review thread, which was mostly about
>> concurrent
>> modification and late-binding of the spliterator:
>> 
>>  
>> http://mail.openjdk.java.net/pipermail/core-libs-dev/2015-September/035034.html
>> 
>> I remember noting this phenomenon a while back, which is why I had
>> filed the bug
>> to add a note. I seem to remember discussing it, though, but it might
>> have been
>> in a meeting or in a hallway conversation.
>> 
>> This bug (JDK-8150488) does note that an infinite stream might be
>> unexpected or
>> surprising, but it's not a fatal problem. It can be terminated with
>> limit(). It
>> can also be terminated with takeWhile(), also added in JDK 9. Maybe I
>> could
>> mention these in the API note.
>> 
>> I guess we could also consider changing the implicit
>> findWithinHorizon() loop
>> that findAll() does, perhaps by having it terminate on a zero-width
>> match. Or we
>> could even change findWithinHorizon's behavior if it gets a
>> zero-width match,
>> siilar to what Matcher.find() does. But I'm quite reluctant to start
>> making such
>> changes at this point.
>> 
>> s'marks
>> 
>> 
>> 
>>>         // Outputs:
>>>         // 'Matcher'
>>>         // ''
>>>         // 'm'
>>>         // ''
>>>         // ''
>>>         // ''
>>>         // 'Pattern'
>>>         // ''
>>>         // 'compile'
>>>         // ''
>>>         // 's'
>>>         // ''
>>>         // ''
>>>         // 'matcher'
>>>         // ''
>>>         // 't'
>>>         // ''
>>>         // ''
>>>         // ''
>>>         // ''
>>> 
>>> 
>>> 
>>> Sent from Mail for Windows 10
>>> 
>>> From: Xueming Shen
>>> Sent: Thursday, March 30, 2017 05:41
>>> To: [hidden email]
>>> Subject: Re: JDK 9 RFR(s): 8150488: add note to Scanner.findAll()
>>> regardingpossible infinite streams
>>> 
>>> On 3/29/17, 5:56 PM, Stuart Marks wrote:
>>>> Hi all,
>>>> 
>>>> Please review these non-normative textual additions to the
>>>> Scanner.findAll() method docs. These methods were added earlier in JDK
>>>> 9; there's a small pitfall if the regex can match zero characters.
>>>> 
>>> Stuart,
>>> 
>>> This might practically put the api itself almost useless? it might
>>> be an
>>> easy task to spot
>>> whether or not it's a 0-width-match-possible regex when the regex is
>>> simple, but it gets
>>> harder and harder, if not impossible when the regex gets complicated,
>>> especially consider
>>> the possible use scenario that the use site is embedded deeply inside a
>>> library implementation.
>>> 
>>> The alternative is to "fix" it, maybe as what Matcher.find() does, if
>>> the previous match is
>>> zero-width-match (the fist==last), we step one to the next cursor
>>> before
>>> next try. I know
>>> Scanner.findPatternInBuffer() is setting new region set every time
>>> it is
>>> invoked which makes
>>> it complicated, but I would assume it might be still worth a trying?
>>> for
>>> example, utilize the
>>> "hasNextResult"/matcher.end(). I'm not sure without looking into the
>>> code, does
>>> 
>>> while (hasNext(pattern)) {
>>>      next(pattern);
>>> }
>>> 
>>> have the same issue, when pattern matches 0-width?
>>> 
>>> Thanks!
>>> -Sherman
>>> 
>>> 
>>> 
>>> 
>>>> Thanks,
>>>> 
>>>> s'marks
>>>> 
>>>> 
>>>> # HG changeset patch
>>>> # User smarks
>>>> # Date 1490749958 25200
>>>> #      Tue Mar 28 18:12:38 2017 -0700
>>>> # Node ID 6b43c4698752779793d58813f46d3687c17dde75
>>>> # Parent  fb54b256d751ae3191e9cef42ff9f5630931f047
>>>> 8150488: add note to Scanner.findAll() regarding possible infinite
>>>> streams
>>>> Reviewed-by: XXX
>>>> 
>>>> diff -r fb54b256d751 -r 6b43c4698752
>>>> src/java.base/share/classes/java/util/Scanner.java
>>>> --- a/src/java.base/share/classes/java/util/Scanner.java    Mon Mar 27
>>>> 15:12:01 2017 -0700
>>>> +++ b/src/java.base/share/classes/java/util/Scanner.java    Tue Mar 28
>>>> 18:12:38 2017 -0700
>>>> @@ -2808,6 +2808,10 @@
>>>>       * }
>>>>       * }</pre>
>>>>       *
>>>> +     * <p>The pattern must always match at least one character. If
>>>> the pattern
>>>> +     * can match zero characters, the result will be an infinite
>>>> stream
>>>> +     * of empty matches.
>>>> +     *
>>>>       * @param pattern the pattern to be matched
>>>>       * @return a sequential stream of match results
>>>>       * @throws NullPointerException if pattern is null
>>>> @@ -2829,6 +2833,11 @@
>>>>       *     scanner.findAll(Pattern.compile(patString))
>>>>       * }</pre>
>>>>       *
>>>> +     * @apiNote
>>>> +     * The pattern must always match at least one character. If the
>>>> pattern
>>>> +     * can match zero characters, the result will be an infinite
>>>> stream
>>>> +     * of empty matches.
>>>> +     *
>>>>       * @param patString the pattern string
>>>>       * @return a sequential stream of match results
>>>>       * @throws NullPointerException if patString is null
>>> 
>>> 
 
 


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: JDK 9 RFR(s): 8150488: add note to Scanner.findAll() regardingpossible infinite streams

Stuart Marks
In reply to this post by Stuart Marks


On 4/4/17 10:48 PM, Stuart Marks wrote:

>
>
> On 4/4/17 4:10 PM, Xueming Shen wrote:
>> Personally I think the use scenario and the expected resulting behavior of
>> Stream<MR>finaAll(ptn) should be more equivalent/similar to the use case  of
>> while (s.hasNext(p)) { s.next(p); }, or while (m.find()) { }, therefor it is
>> probably desired it can be used without worrying the possibility of getting
>> into an infinite loop.
>
> The Scanner.tokens() API, also added in Java 9, provides a stream of delimited
> tokens (strings) that's more-or-less equivalent of a hasNext()/next() loop.
>
> Just as findWithinHorizon() is a distinct use case from hasNext()/next()
> delimited tokens, findAll() provides for distinct uses from tokens().
>
>> That said, I agree it might be hard to argue to fix it in jdk9. The question
>> is if we are going to fix it in 9u sometime in the further, is it still worth
>> putting down the note in the API now. I'm fine if you believe a note will
>> help at least make the issue a known/documented usage limitation, for JDK9.

I've thought about this some more... fundamentally the behavior is odd, since
Scanner.findWithinHorizon() is *based on* Matcher.find(). So why does
Matcher.find() "work" and S.findWithinHorizon() produce an infinite loop?

If you look in findPatternInBuffer, upon which findWithinHorizon and findAll are
based, it essentially resets the matcher every time, which cases the
Matcher.find() special case to be bypassed. So I think this could be considered
a bug in findPatternInBuffer, which causes misbehavior to be visible in
findWithinHorizon and findAll.

I've filed:

JDK-8178116 scanner.findWithinHorizon doesn't advance after matching zero characters

This is certainly too subtle to be trying to fix right now.

I guess the question is, if we believe this is a bug that will be fixed at some
point, do we add a note to findAll() -- and maybe even findWithinHorizon() -- to
document this? Historically we haven't documented bugs like this, so maybe not.

Note also that this behavior of findWithinHorizon has been there since JDK 6,
which is the earliest release I had conveniently at hand.

s'marks


>>
>> Yes, I think we did chat about this one at the hallway some time when either
>> you or something ran into the loop by using the method ...
>
> There are already a couple ways you can get unexpected behavior with Scanner if
> you're not careful. For example, see the warnings in findWithinHorizon() and
> skip() about the possibility of buffering up the entire input. These methods are
> useful, but if you're not careful with the regex you provide, things can go
> wrong. The findAll() method seems to be in the same situation. It's not clear to
> me that any of these are bugs for which there's a reasonable fix.
>
> So yes, I think the note is necessary. I could also add something about
> terminating such streams with limit() or takeWhile().
>
> s'marks
>
>>
>> Thanks,
>> Sherman
>>
>>
>>>
>>> s'marks
>>>
>>> On 3/30/17 2:19 PM, Stuart Marks wrote:
>>>> Hi Timo, Sherman,
>>>>
>>>> Thanks for looking at this.
>>>>
>>>> Sherman wrote:
>>>>
>>>>> This might practically put the api itself almost useless? it might be an easy
>>>>> task to spot
>>>>> whether or not it's a 0-width-match-possible regex when the regex is simple,
>>>>> but it gets
>>>>> harder and harder, if not impossible when the regex gets complicated,
>>>>> especially consider
>>>>> the possible use scenario that the use site is embedded deeply inside a
>>>>> library implementation.
>>>>
>>>> Well, not "useless", but perhaps less useful than one might like. :-)
>>>>
>>>> I think this is potentially surprising behavior, which is why I at least wanted
>>>> to add the note. It's not clear to me whether we should try to fix this by
>>>> changing Scanner though.
>>>>
>>>> Essentially, findAll() is defined in terms of findWithinHorizon(pattern, 0). So
>>>> if one were to write a loop like so:
>>>>
>>>>     String str;
>>>>     while ((str = scanner.findWithinHorizon(pattern, 0)) != null) {
>>>>         System.out.println(str);
>>>>     }
>>>>
>>>> then this loop would have the same problem if pattern were to match zero
>>>> characters.
>>>>
>>>>> The alternative is to "fix" it, maybe as what Matcher.find() does, if the
>>>>> previous match is
>>>>> zero-width-match (the fist==last), we step one to the next cursor before next
>>>>> try. I know
>>>>
>>>> Interesting, I didn't know Matcher.find() advances the cursor like this. But
>>>> Scanner.findWithinHorizon() apparently doesn't, so that's why an infinite loop
>>>> can occur.
>>>>
>>>>> Scanner.findPatternInBuffer() is setting new region set every time it is
>>>>> invoked which makes
>>>>> it complicated, but I would assume it might be still worth a trying? for
>>>>> example, utilize the
>>>>> "hasNextResult"/matcher.end(). I'm not sure without looking into the code,
>>>>> does
>>>>>
>>>>> while (hasNext(pattern)) {
>>>>>     next(pattern);
>>>>> }
>>>>>
>>>>> have the same issue, when pattern matches 0-width?
>>>>
>>>> No, this doesn't have the problem, because hasNext(pat) and next(pat) match
>>>> delimited tokens. Each call to next() implicitly advances past the next
>>>> delimiter to reach the subsequent token, if any.
>>>>
>>>>
>>>> On 3/30/17 8:56 AM, Timo Kinnunen wrote:
>>>>> I guess this somewhat contrived example also wouldn’t work?
>>>>>
>>>>>         String s = "\\b\\w+|\\G|\\B";
>>>>>         String t = "Matcher m = Pattern.compile(s).matcher(t);\n";
>>>>>         Matcher m = Pattern.compile(s).matcher(t);
>>>>>         while(m.find()) {
>>>>>             System.out.println("'" + m.group() + "'");
>>>>>         }
>>>>
>>>> Right, so if you rewrote this loop to use Scanner.findWithinHorizon()
>>>> instead of
>>>> Matcher,
>>>>
>>>>     Scanner sc = new Scanner(t);
>>>>     String str;
>>>>     while ((str = sc.findWithinHorizon(s, 0)) != null) {
>>>>         System.out.println("'" + str + "'");
>>>>     }
>>>>
>>>> you'd get an infinite loop with str being continually assigned the empty
>>>> string.
>>>> As Sherman mentioned, the Matcher.find() will advance the cursor if it gets a
>>>> zero-width match, avoiding this problem.
>>>>
>>>> * * *
>>>>
>>>> This didn't come up in the code review thread, which was mostly about
>>>> concurrent
>>>> modification and late-binding of the spliterator:
>>>>
>>>>
>>>> http://mail.openjdk.java.net/pipermail/core-libs-dev/2015-September/035034.html
>>>>
>>>> I remember noting this phenomenon a while back, which is why I had filed the
>>>> bug
>>>> to add a note. I seem to remember discussing it, though, but it might have been
>>>> in a meeting or in a hallway conversation.
>>>>
>>>> This bug (JDK-8150488) does note that an infinite stream might be unexpected or
>>>> surprising, but it's not a fatal problem. It can be terminated with limit(). It
>>>> can also be terminated with takeWhile(), also added in JDK 9. Maybe I could
>>>> mention these in the API note.
>>>>
>>>> I guess we could also consider changing the implicit findWithinHorizon() loop
>>>> that findAll() does, perhaps by having it terminate on a zero-width match.
>>>> Or we
>>>> could even change findWithinHorizon's behavior if it gets a zero-width match,
>>>> siilar to what Matcher.find() does. But I'm quite reluctant to start making
>>>> such
>>>> changes at this point.
>>>>
>>>> s'marks
>>>>
>>>>
>>>>
>>>>>         // Outputs:
>>>>>         // 'Matcher'
>>>>>         // ''
>>>>>         // 'm'
>>>>>         // ''
>>>>>         // ''
>>>>>         // ''
>>>>>         // 'Pattern'
>>>>>         // ''
>>>>>         // 'compile'
>>>>>         // ''
>>>>>         // 's'
>>>>>         // ''
>>>>>         // ''
>>>>>         // 'matcher'
>>>>>         // ''
>>>>>         // 't'
>>>>>         // ''
>>>>>         // ''
>>>>>         // ''
>>>>>         // ''
>>>>>
>>>>>
>>>>>
>>>>> Sent from Mail for Windows 10
>>>>>
>>>>> From: Xueming Shen
>>>>> Sent: Thursday, March 30, 2017 05:41
>>>>> To: [hidden email]
>>>>> Subject: Re: JDK 9 RFR(s): 8150488: add note to Scanner.findAll()
>>>>> regardingpossible infinite streams
>>>>>
>>>>> On 3/29/17, 5:56 PM, Stuart Marks wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> Please review these non-normative textual additions to the
>>>>>> Scanner.findAll() method docs. These methods were added earlier in JDK
>>>>>> 9; there's a small pitfall if the regex can match zero characters.
>>>>>>
>>>>> Stuart,
>>>>>
>>>>> This might practically put the api itself almost useless? it might be an
>>>>> easy task to spot
>>>>> whether or not it's a 0-width-match-possible regex when the regex is
>>>>> simple, but it gets
>>>>> harder and harder, if not impossible when the regex gets complicated,
>>>>> especially consider
>>>>> the possible use scenario that the use site is embedded deeply inside a
>>>>> library implementation.
>>>>>
>>>>> The alternative is to "fix" it, maybe as what Matcher.find() does, if
>>>>> the previous match is
>>>>> zero-width-match (the fist==last), we step one to the next cursor before
>>>>> next try. I know
>>>>> Scanner.findPatternInBuffer() is setting new region set every time it is
>>>>> invoked which makes
>>>>> it complicated, but I would assume it might be still worth a trying? for
>>>>> example, utilize the
>>>>> "hasNextResult"/matcher.end(). I'm not sure without looking into the
>>>>> code, does
>>>>>
>>>>> while (hasNext(pattern)) {
>>>>>      next(pattern);
>>>>> }
>>>>>
>>>>> have the same issue, when pattern matches 0-width?
>>>>>
>>>>> Thanks!
>>>>> -Sherman
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> s'marks
>>>>>>
>>>>>>
>>>>>> # HG changeset patch
>>>>>> # User smarks
>>>>>> # Date 1490749958 25200
>>>>>> #      Tue Mar 28 18:12:38 2017 -0700
>>>>>> # Node ID 6b43c4698752779793d58813f46d3687c17dde75
>>>>>> # Parent  fb54b256d751ae3191e9cef42ff9f5630931f047
>>>>>> 8150488: add note to Scanner.findAll() regarding possible infinite
>>>>>> streams
>>>>>> Reviewed-by: XXX
>>>>>>
>>>>>> diff -r fb54b256d751 -r 6b43c4698752
>>>>>> src/java.base/share/classes/java/util/Scanner.java
>>>>>> --- a/src/java.base/share/classes/java/util/Scanner.java    Mon Mar 27
>>>>>> 15:12:01 2017 -0700
>>>>>> +++ b/src/java.base/share/classes/java/util/Scanner.java    Tue Mar 28
>>>>>> 18:12:38 2017 -0700
>>>>>> @@ -2808,6 +2808,10 @@
>>>>>>       * }
>>>>>>       * }</pre>
>>>>>>       *
>>>>>> +     * <p>The pattern must always match at least one character. If
>>>>>> the pattern
>>>>>> +     * can match zero characters, the result will be an infinite stream
>>>>>> +     * of empty matches.
>>>>>> +     *
>>>>>>       * @param pattern the pattern to be matched
>>>>>>       * @return a sequential stream of match results
>>>>>>       * @throws NullPointerException if pattern is null
>>>>>> @@ -2829,6 +2833,11 @@
>>>>>>       *     scanner.findAll(Pattern.compile(patString))
>>>>>>       * }</pre>
>>>>>>       *
>>>>>> +     * @apiNote
>>>>>> +     * The pattern must always match at least one character. If the
>>>>>> pattern
>>>>>> +     * can match zero characters, the result will be an infinite stream
>>>>>> +     * of empty matches.
>>>>>> +     *
>>>>>>       * @param patString the pattern string
>>>>>>       * @return a sequential stream of match results
>>>>>>       * @throws NullPointerException if patString is null
>>>>>
>>>>>
>>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: JDK 9 RFR(s): 8150488: add note to Scanner.findAll()regardingpossible infinite streams

Timo Kinnunen
IMHO there should be a notice added in findAll which excludes the behavior of the stream after an empty match from any compatibility requirements while the notice remains in place. This would be to ensure that findAll and the stream it returns can be changed independently from the other methods which also have this issue. Just in case.



Sent from Mail for Windows 10

From: Stuart Marks
Sent: Wednesday, April 5, 2017 18:49
To: Xueming Shen
Cc: [hidden email]
Subject: Re: JDK 9 RFR(s): 8150488: add note to Scanner.findAll()regardingpossible infinite streams



On 4/4/17 10:48 PM, Stuart Marks wrote:

>
>
> On 4/4/17 4:10 PM, Xueming Shen wrote:
>> Personally I think the use scenario and the expected resulting behavior of
>> Stream<MR>finaAll(ptn) should be more equivalent/similar to the use case  of
>> while (s.hasNext(p)) { s.next(p); }, or while (m.find()) { }, therefor it is
>> probably desired it can be used without worrying the possibility of getting
>> into an infinite loop.
>
> The Scanner.tokens() API, also added in Java 9, provides a stream of delimited
> tokens (strings) that's more-or-less equivalent of a hasNext()/next() loop.
>
> Just as findWithinHorizon() is a distinct use case from hasNext()/next()
> delimited tokens, findAll() provides for distinct uses from tokens().
>
>> That said, I agree it might be hard to argue to fix it in jdk9. The question
>> is if we are going to fix it in 9u sometime in the further, is it still worth
>> putting down the note in the API now. I'm fine if you believe a note will
>> help at least make the issue a known/documented usage limitation, for JDK9.

I've thought about this some more... fundamentally the behavior is odd, since
Scanner.findWithinHorizon() is *based on* Matcher.find(). So why does
Matcher.find() "work" and S.findWithinHorizon() produce an infinite loop?

If you look in findPatternInBuffer, upon which findWithinHorizon and findAll are
based, it essentially resets the matcher every time, which cases the
Matcher.find() special case to be bypassed. So I think this could be considered
a bug in findPatternInBuffer, which causes misbehavior to be visible in
findWithinHorizon and findAll.

I've filed:

JDK-8178116 scanner.findWithinHorizon doesn't advance after matching zero characters

This is certainly too subtle to be trying to fix right now.

I guess the question is, if we believe this is a bug that will be fixed at some
point, do we add a note to findAll() -- and maybe even findWithinHorizon() -- to
document this? Historically we haven't documented bugs like this, so maybe not.

Note also that this behavior of findWithinHorizon has been there since JDK 6,
which is the earliest release I had conveniently at hand.

s'marks


>>
>> Yes, I think we did chat about this one at the hallway some time when either
>> you or something ran into the loop by using the method ...
>
> There are already a couple ways you can get unexpected behavior with Scanner if
> you're not careful. For example, see the warnings in findWithinHorizon() and
> skip() about the possibility of buffering up the entire input. These methods are
> useful, but if you're not careful with the regex you provide, things can go
> wrong. The findAll() method seems to be in the same situation. It's not clear to
> me that any of these are bugs for which there's a reasonable fix.
>
> So yes, I think the note is necessary. I could also add something about
> terminating such streams with limit() or takeWhile().
>
> s'marks
>
>>
>> Thanks,
>> Sherman
>>
>>
>>>
>>> s'marks
>>>
>>> On 3/30/17 2:19 PM, Stuart Marks wrote:
>>>> Hi Timo, Sherman,
>>>>
>>>> Thanks for looking at this.
>>>>
>>>> Sherman wrote:
>>>>
>>>>> This might practically put the api itself almost useless? it might be an easy
>>>>> task to spot
>>>>> whether or not it's a 0-width-match-possible regex when the regex is simple,
>>>>> but it gets
>>>>> harder and harder, if not impossible when the regex gets complicated,
>>>>> especially consider
>>>>> the possible use scenario that the use site is embedded deeply inside a
>>>>> library implementation.
>>>>
>>>> Well, not "useless", but perhaps less useful than one might like. :-)
>>>>
>>>> I think this is potentially surprising behavior, which is why I at least wanted
>>>> to add the note. It's not clear to me whether we should try to fix this by
>>>> changing Scanner though.
>>>>
>>>> Essentially, findAll() is defined in terms of findWithinHorizon(pattern, 0). So
>>>> if one were to write a loop like so:
>>>>
>>>>     String str;
>>>>     while ((str = scanner.findWithinHorizon(pattern, 0)) != null) {
>>>>         System.out.println(str);
>>>>     }
>>>>
>>>> then this loop would have the same problem if pattern were to match zero
>>>> characters.
>>>>
>>>>> The alternative is to "fix" it, maybe as what Matcher.find() does, if the
>>>>> previous match is
>>>>> zero-width-match (the fist==last), we step one to the next cursor before next
>>>>> try. I know
>>>>
>>>> Interesting, I didn't know Matcher.find() advances the cursor like this. But
>>>> Scanner.findWithinHorizon() apparently doesn't, so that's why an infinite loop
>>>> can occur.
>>>>
>>>>> Scanner.findPatternInBuffer() is setting new region set every time it is
>>>>> invoked which makes
>>>>> it complicated, but I would assume it might be still worth a trying? for
>>>>> example, utilize the
>>>>> "hasNextResult"/matcher.end(). I'm not sure without looking into the code,
>>>>> does
>>>>>
>>>>> while (hasNext(pattern)) {
>>>>>     next(pattern);
>>>>> }
>>>>>
>>>>> have the same issue, when pattern matches 0-width?
>>>>
>>>> No, this doesn't have the problem, because hasNext(pat) and next(pat) match
>>>> delimited tokens. Each call to next() implicitly advances past the next
>>>> delimiter to reach the subsequent token, if any.
>>>>
>>>>
>>>> On 3/30/17 8:56 AM, Timo Kinnunen wrote:
>>>>> I guess this somewhat contrived example also wouldn’t work?
>>>>>
>>>>>         String s = "\\b\\w+|\\G|\\B";
>>>>>         String t = "Matcher m = Pattern.compile(s).matcher(t);\n";
>>>>>         Matcher m = Pattern.compile(s).matcher(t);
>>>>>         while(m.find()) {
>>>>>             System.out.println("'" + m.group() + "'");
>>>>>         }
>>>>
>>>> Right, so if you rewrote this loop to use Scanner.findWithinHorizon()
>>>> instead of
>>>> Matcher,
>>>>
>>>>     Scanner sc = new Scanner(t);
>>>>     String str;
>>>>     while ((str = sc.findWithinHorizon(s, 0)) != null) {
>>>>         System.out.println("'" + str + "'");
>>>>     }
>>>>
>>>> you'd get an infinite loop with str being continually assigned the empty
>>>> string.
>>>> As Sherman mentioned, the Matcher.find() will advance the cursor if it gets a
>>>> zero-width match, avoiding this problem.
>>>>
>>>> * * *
>>>>
>>>> This didn't come up in the code review thread, which was mostly about
>>>> concurrent
>>>> modification and late-binding of the spliterator:
>>>>
>>>>
>>>> http://mail.openjdk.java.net/pipermail/core-libs-dev/2015-September/035034.html
>>>>
>>>> I remember noting this phenomenon a while back, which is why I had filed the
>>>> bug
>>>> to add a note. I seem to remember discussing it, though, but it might have been
>>>> in a meeting or in a hallway conversation.
>>>>
>>>> This bug (JDK-8150488) does note that an infinite stream might be unexpected or
>>>> surprising, but it's not a fatal problem. It can be terminated with limit(). It
>>>> can also be terminated with takeWhile(), also added in JDK 9. Maybe I could
>>>> mention these in the API note.
>>>>
>>>> I guess we could also consider changing the implicit findWithinHorizon() loop
>>>> that findAll() does, perhaps by having it terminate on a zero-width match.
>>>> Or we
>>>> could even change findWithinHorizon's behavior if it gets a zero-width match,
>>>> siilar to what Matcher.find() does. But I'm quite reluctant to start making
>>>> such
>>>> changes at this point.
>>>>
>>>> s'marks
>>>>
>>>>
>>>>
>>>>>         // Outputs:
>>>>>         // 'Matcher'
>>>>>         // ''
>>>>>         // 'm'
>>>>>         // ''
>>>>>         // ''
>>>>>         // ''
>>>>>         // 'Pattern'
>>>>>         // ''
>>>>>         // 'compile'
>>>>>         // ''
>>>>>         // 's'
>>>>>         // ''
>>>>>         // ''
>>>>>         // 'matcher'
>>>>>         // ''
>>>>>         // 't'
>>>>>         // ''
>>>>>         // ''
>>>>>         // ''
>>>>>         // ''
>>>>>
>>>>>
>>>>>
>>>>> Sent from Mail for Windows 10
>>>>>
>>>>> From: Xueming Shen
>>>>> Sent: Thursday, March 30, 2017 05:41
>>>>> To: [hidden email]
>>>>> Subject: Re: JDK 9 RFR(s): 8150488: add note to Scanner.findAll()
>>>>> regardingpossible infinite streams
>>>>>
>>>>> On 3/29/17, 5:56 PM, Stuart Marks wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> Please review these non-normative textual additions to the
>>>>>> Scanner.findAll() method docs. These methods were added earlier in JDK
>>>>>> 9; there's a small pitfall if the regex can match zero characters.
>>>>>>
>>>>> Stuart,
>>>>>
>>>>> This might practically put the api itself almost useless? it might be an
>>>>> easy task to spot
>>>>> whether or not it's a 0-width-match-possible regex when the regex is
>>>>> simple, but it gets
>>>>> harder and harder, if not impossible when the regex gets complicated,
>>>>> especially consider
>>>>> the possible use scenario that the use site is embedded deeply inside a
>>>>> library implementation.
>>>>>
>>>>> The alternative is to "fix" it, maybe as what Matcher.find() does, if
>>>>> the previous match is
>>>>> zero-width-match (the fist==last), we step one to the next cursor before
>>>>> next try. I know
>>>>> Scanner.findPatternInBuffer() is setting new region set every time it is
>>>>> invoked which makes
>>>>> it complicated, but I would assume it might be still worth a trying? for
>>>>> example, utilize the
>>>>> "hasNextResult"/matcher.end(). I'm not sure without looking into the
>>>>> code, does
>>>>>
>>>>> while (hasNext(pattern)) {
>>>>>      next(pattern);
>>>>> }
>>>>>
>>>>> have the same issue, when pattern matches 0-width?
>>>>>
>>>>> Thanks!
>>>>> -Sherman
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> s'marks
>>>>>>
>>>>>>
>>>>>> # HG changeset patch
>>>>>> # User smarks
>>>>>> # Date 1490749958 25200
>>>>>> #      Tue Mar 28 18:12:38 2017 -0700
>>>>>> # Node ID 6b43c4698752779793d58813f46d3687c17dde75
>>>>>> # Parent  fb54b256d751ae3191e9cef42ff9f5630931f047
>>>>>> 8150488: add note to Scanner.findAll() regarding possible infinite
>>>>>> streams
>>>>>> Reviewed-by: XXX
>>>>>>
>>>>>> diff -r fb54b256d751 -r 6b43c4698752
>>>>>> src/java.base/share/classes/java/util/Scanner.java
>>>>>> --- a/src/java.base/share/classes/java/util/Scanner.java    Mon Mar 27
>>>>>> 15:12:01 2017 -0700
>>>>>> +++ b/src/java.base/share/classes/java/util/Scanner.java    Tue Mar 28
>>>>>> 18:12:38 2017 -0700
>>>>>> @@ -2808,6 +2808,10 @@
>>>>>>       * }
>>>>>>       * }</pre>
>>>>>>       *
>>>>>> +     * <p>The pattern must always match at least one character. If
>>>>>> the pattern
>>>>>> +     * can match zero characters, the result will be an infinite stream
>>>>>> +     * of empty matches.
>>>>>> +     *
>>>>>>       * @param pattern the pattern to be matched
>>>>>>       * @return a sequential stream of match results
>>>>>>       * @throws NullPointerException if pattern is null
>>>>>> @@ -2829,6 +2833,11 @@
>>>>>>       *     scanner.findAll(Pattern.compile(patString))
>>>>>>       * }</pre>
>>>>>>       *
>>>>>> +     * @apiNote
>>>>>> +     * The pattern must always match at least one character. If the
>>>>>> pattern
>>>>>> +     * can match zero characters, the result will be an infinite stream
>>>>>> +     * of empty matches.
>>>>>> +     *
>>>>>>       * @param patString the pattern string
>>>>>>       * @return a sequential stream of match results
>>>>>>       * @throws NullPointerException if patString is null
>>>>>
>>>>>
>>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: JDK 9 RFR(s): 8150488: add note to Scanner.findAll()regardingpossible infinite streams

Stuart Marks
On 4/6/17 6:38 AM, Timo Kinnunen wrote:
> IMHO there should be a notice added in findAll which excludes the behavior of
> the stream after an empty match from any compatibility requirements while the
> notice remains in place. This would be to ensure that findAll and the stream it
> returns can be changed independently from the other methods which also have this
> issue. Just in case.

In general we don't use @apiNote to document implementation limits or
restrictions that might be modified in the future, e.g. by bugfixes. And also
note that @apiNote is non-normative, so it doesn't affect compatibility.

But also, Sherman came up with a prototype for a fix that we might be able to
get into 9, so we're pursuing that at the moment.

s'marks
Loading...