[10] RFR 8134512 : provide Alpha-Numeric (logical) Comparator

classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[10] RFR 8134512 : provide Alpha-Numeric (logical) Comparator

Ivan Gerasimov
Hello!

It is a proposal to provide a String comparator, which will pay
attention to the numbers embedded into the strings (should they present).

This proposal was initially discussed back in 2014 and seemed to bring
some interest from the community:
http://mail.openjdk.java.net/pipermail/core-libs-dev/2014-December/030343.html

In the latest webrev two methods are added to the public API:
j.u.Comparator.comparingNumerically() and
j.u.Comparator.comparingNumericallyLeadingZerosAhead().

The regression test is extended to exercise this new comparator.

BUGURL: https://bugs.openjdk.java.net/browse/JDK-8134512
WEBREV: http://cr.openjdk.java.net/~igerasim/8134512/01/webrev/

Comments, suggestions are very welcome!

--
With kind regards,
Ivan Gerasimov

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [10] RFR 8134512 : provide Alpha-Numeric (logical) Comparator

Ivan Gerasimov
Hello!

This is a gentle reminder.

The proposed comparator implementation would be particularly useful when
one will need to compare version strings.
Some popular file managers also use similar comparing algorithm, as the
results often look more natural to the human eyes (e.g. File Explorer on
Windows, Files on Ubuntu).

Now, as Java 10 is been worked on, to sort the list of Java names
correctly, this kind of comparator is needed:

Look: a list { ... "Java 8", "Java 9", "Java 10" } definitely looks
nicer than { "Java 1", "Java 10", "Java 2", ... }  :-)

Would you please help review the proposal?

With kind regards,
Ivan


On 7/19/17 1:41 AM, Ivan Gerasimov wrote:

> Hello!
>
> It is a proposal to provide a String comparator, which will pay
> attention to the numbers embedded into the strings (should they present).
>
> This proposal was initially discussed back in 2014 and seemed to bring
> some interest from the community:
> http://mail.openjdk.java.net/pipermail/core-libs-dev/2014-December/030343.html 
>
>
> In the latest webrev two methods are added to the public API:
> j.u.Comparator.comparingNumerically() and
> j.u.Comparator.comparingNumericallyLeadingZerosAhead().
>
> The regression test is extended to exercise this new comparator.
>
> BUGURL: https://bugs.openjdk.java.net/browse/JDK-8134512
> WEBREV: http://cr.openjdk.java.net/~igerasim/8134512/01/webrev/
>
> Comments, suggestions are very welcome!
>

--
With kind regards,
Ivan Gerasimov

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [10] RFR 8134512 : provide Alpha-Numeric (logical) Comparator

Jonathan Bluett-Duncan
Meant to reply to core-libs-dev as well; doing so now.

Jonathan

On 23 July 2017 at 11:50, Jonathan Bluett-Duncan <[hidden email]>
wrote:

> Hi Ivan,
>
> I'm not a reviewer per se, but I thought I'd chime in.
>
> There's a few things with Comparator.java which I think could be improved:
>
>    1. `comparingNumericallyLeadingZerosAhead()` is a confusing name for
>    me, as the "ahead" has no meaning in my mind; IMO the name `
>    comparingNumericallyLeadingZerosFirst()` better implies that "0001" <
>    "001" < "01".
>    2. It wasn't clear to me from the javadoc that the comparators compare
>    strings like "abc9" and "abc10" as "abc9" < "abc10", so I think they should
>    include more examples.
>    3. There's a typo in the javadocs for both methods: "represets" -->
>    "represents".
>    4. Where the javadocs say "follows the procedure", I think it'd make
>    more grammatical sense if they said "does the following".
>    5. The lines which say "at the boundary of a subsequence, consisting
>    of decimal digits, the" would flow better if they said "at the boundary of
>    a subsequence *consisting solely of decimal digits*, then the". Note
>    the removal of the comma between "subsequence" and "consisting".
>
> Hope this helps!
>
> Jonathan
>
> On 23 July 2017 at 05:36, Ivan Gerasimov <[hidden email]>
> wrote:
>
>> Hello!
>>
>> This is a gentle reminder.
>>
>> The proposed comparator implementation would be particularly useful when
>> one will need to compare version strings.
>> Some popular file managers also use similar comparing algorithm, as the
>> results often look more natural to the human eyes (e.g. File Explorer on
>> Windows, Files on Ubuntu).
>>
>> Now, as Java 10 is been worked on, to sort the list of Java names
>> correctly, this kind of comparator is needed:
>>
>> Look: a list { ... "Java 8", "Java 9", "Java 10" } definitely looks nicer
>> than { "Java 1", "Java 10", "Java 2", ... }  :-)
>>
>> Would you please help review the proposal?
>>
>> With kind regards,
>> Ivan
>>
>>
>>
>> On 7/19/17 1:41 AM, Ivan Gerasimov wrote:
>>
>>> Hello!
>>>
>>> It is a proposal to provide a String comparator, which will pay
>>> attention to the numbers embedded into the strings (should they present).
>>>
>>> This proposal was initially discussed back in 2014 and seemed to bring
>>> some interest from the community:
>>> http://mail.openjdk.java.net/pipermail/core-libs-dev/2014-De
>>> cember/030343.html
>>>
>>> In the latest webrev two methods are added to the public API:
>>> j.u.Comparator.comparingNumerically() and
>>> j.u.Comparator.comparingNumericallyLeadingZerosAhead().
>>>
>>> The regression test is extended to exercise this new comparator.
>>>
>>> BUGURL: https://bugs.openjdk.java.net/browse/JDK-8134512
>>> WEBREV: http://cr.openjdk.java.net/~igerasim/8134512/01/webrev/
>>>
>>> Comments, suggestions are very welcome!
>>>
>>>
>> --
>> With kind regards,
>> Ivan Gerasimov
>>
>>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [10] RFR 8134512 : provide Alpha-Numeric (logical) Comparator

Ivan Gerasimov
Thank you Jonathan for looking into that!

I agree with all your suggestions.

Here's the updated webrev with suggested modifications:
WEBREV: http://cr.openjdk.java.net/~igerasim/8134512/02/webrev/

With kind regards,
Ivan

On 7/23/17 3:56 AM, Jonathan Bluett-Duncan wrote:

> Meant to reply to core-libs-dev as well; doing so now.
>
> Jonathan
>
> On 23 July 2017 at 11:50, Jonathan Bluett-Duncan
> <[hidden email] <mailto:[hidden email]>> wrote:
>
>     Hi Ivan,
>
>     I'm not a reviewer per se, but I thought I'd chime in.
>
>     There's a few things with Comparator.java which I think could be
>     improved:
>
>      1. `comparingNumericallyLeadingZerosAhead()` is a confusing name
>         for me, as the "ahead" has no meaning in my mind; IMO the name
>         `comparingNumericallyLeadingZerosFirst()` better implies that
>         "0001" < "001" < "01".
>      2. It wasn't clear to me from the javadoc that the comparators
>         compare strings like "abc9" and "abc10" as "abc9" < "abc10",
>         so I think they should include more examples.
>      3. There's a typo in the javadocs for both methods: "represets"
>         --> "represents".
>      4. Where the javadocs say "follows the procedure", I think it'd
>         make more grammatical sense if they said "does the following".
>      5. The lines which say "at the boundary of a subsequence,
>         consisting of decimal digits, the" would flow better if they
>         said "at the boundary of a subsequence *consisting solely of
>         decimal digits*, then the". Note the removal of the comma
>         between "subsequence" and "consisting".
>
>     Hope this helps!
>
>     Jonathan
>
>     On 23 July 2017 at 05:36, Ivan Gerasimov
>     <[hidden email] <mailto:[hidden email]>> wrote:
>
>         Hello!
>
>         This is a gentle reminder.
>
>         The proposed comparator implementation would be particularly
>         useful when one will need to compare version strings.
>         Some popular file managers also use similar comparing
>         algorithm, as the results often look more natural to the human
>         eyes (e.g. File Explorer on Windows, Files on Ubuntu).
>
>         Now, as Java 10 is been worked on, to sort the list of Java
>         names correctly, this kind of comparator is needed:
>
>         Look: a list { ... "Java 8", "Java 9", "Java 10" } definitely
>         looks nicer than { "Java 1", "Java 10", "Java 2", ... }  :-)
>
>         Would you please help review the proposal?
>
>         With kind regards,
>         Ivan
>
>
>
>         On 7/19/17 1:41 AM, Ivan Gerasimov wrote:
>
>             Hello!
>
>             It is a proposal to provide a String comparator, which
>             will pay attention to the numbers embedded into the
>             strings (should they present).
>
>             This proposal was initially discussed back in 2014 and
>             seemed to bring some interest from the community:
>             http://mail.openjdk.java.net/pipermail/core-libs-dev/2014-December/030343.html
>             <http://mail.openjdk.java.net/pipermail/core-libs-dev/2014-December/030343.html>
>
>
>             In the latest webrev two methods are added to the public API:
>             j.u.Comparator.comparingNumerically() and
>             j.u.Comparator.comparingNumericallyLeadingZerosAhead().
>
>             The regression test is extended to exercise this new
>             comparator.
>
>             BUGURL: https://bugs.openjdk.java.net/browse/JDK-8134512
>             <https://bugs.openjdk.java.net/browse/JDK-8134512>
>             WEBREV:
>             http://cr.openjdk.java.net/~igerasim/8134512/01/webrev/
>             <http://cr.openjdk.java.net/%7Eigerasim/8134512/01/webrev/>
>
>             Comments, suggestions are very welcome!
>
>
>         --
>         With kind regards,
>         Ivan Gerasimov
>
>
>

--
With kind regards,
Ivan Gerasimov

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [10] RFR 8134512 : provide Alpha-Numeric (logical) Comparator

Jonathan Bluett-Duncan
You're welcome Ivan!

I'm just proofreading the new webrev, and a few more things have occurred
to me:

1. What happens if the comparators are given the elements {"1abc", "2abc",
"10abc"} to compare? Would they sort the elements as {"1abc", "2abc",
"10abc"} or as {"1abc", "10abc", "2abc"}?
What about the array {"x1yz", "x2yz", "x10yz"}?
I wonder if these two cases should be added to the test too and given as
additional examples in the javadocs.

2. The example arrays which you kindly added to the comparators' javadoc
have no whitespace at the start but one space between the last element and
the }, so I think there either should be no space at the end or an extra
space at the start.

3. Very sorry, error on my part: on the javadoc lines which now say "then the
corresponding numerical values are compared; otherwise", I suggested to add
a "then" at the start; re-reading it though, I personally think it reads
better without, but I would understand and not press it further if you
disagree and think that the "then" is a useful addition.

Best regards,
Jonathan


On 24 Jul 2017 06:21, "Ivan Gerasimov" <[hidden email]> wrote:

Thank you Jonathan for looking into that!

I agree with all your suggestions.

Here's the updated webrev with suggested modifications:
WEBREV: http://cr.openjdk.java.net/~igerasim/8134512/02/webrev/

With kind regards,
Ivan


On 7/23/17 3:56 AM, Jonathan Bluett-Duncan wrote:

Meant to reply to core-libs-dev as well; doing so now.

Jonathan

On 23 July 2017 at 11:50, Jonathan Bluett-Duncan <[hidden email]>
wrote:

> Hi Ivan,
>
> I'm not a reviewer per se, but I thought I'd chime in.
>
> There's a few things with Comparator.java which I think could be improved:
>
>    1. `comparingNumericallyLeadingZerosAhead()` is a confusing name for
>    me, as the "ahead" has no meaning in my mind; IMO the name
>    `comparingNumericallyLeadingZerosFirst()` better implies that "0001" <
>    "001" < "01".
>    2. It wasn't clear to me from the javadoc that the comparators compare
>    strings like "abc9" and "abc10" as "abc9" < "abc10", so I think they should
>    include more examples.
>    3. There's a typo in the javadocs for both methods: "represets" -->
>    "represents".
>    4. Where the javadocs say "follows the procedure", I think it'd make
>    more grammatical sense if they said "does the following".
>    5. The lines which say "at the boundary of a subsequence, consisting
>    of decimal digits, the" would flow better if they said "at the boundary of
>    a subsequence *consisting solely of decimal digits*, then the". Note
>    the removal of the comma between "subsequence" and "consisting".
>
> Hope this helps!
>
> Jonathan
>
> On 23 July 2017 at 05:36, Ivan Gerasimov <[hidden email]>
> wrote:
>
>> Hello!
>>
>> This is a gentle reminder.
>>
>> The proposed comparator implementation would be particularly useful when
>> one will need to compare version strings.
>> Some popular file managers also use similar comparing algorithm, as the
>> results often look more natural to the human eyes (e.g. File Explorer on
>> Windows, Files on Ubuntu).
>>
>> Now, as Java 10 is been worked on, to sort the list of Java names
>> correctly, this kind of comparator is needed:
>>
>> Look: a list { ... "Java 8", "Java 9", "Java 10" } definitely looks nicer
>> than { "Java 1", "Java 10", "Java 2", ... }  :-)
>>
>> Would you please help review the proposal?
>>
>> With kind regards,
>> Ivan
>>
>>
>>
>> On 7/19/17 1:41 AM, Ivan Gerasimov wrote:
>>
>>> Hello!
>>>
>>> It is a proposal to provide a String comparator, which will pay
>>> attention to the numbers embedded into the strings (should they present).
>>>
>>> This proposal was initially discussed back in 2014 and seemed to bring
>>> some interest from the community:
>>> http://mail.openjdk.java.net/pipermail/core-libs-dev/2014-De
>>> cember/030343.html
>>>
>>> In the latest webrev two methods are added to the public API:
>>> j.u.Comparator.comparingNumerically() and
>>> j.u.Comparator.comparingNumericallyLeadingZerosAhead().
>>>
>>> The regression test is extended to exercise this new comparator.
>>>
>>> BUGURL: https://bugs.openjdk.java.net/browse/JDK-8134512
>>> WEBREV: http://cr.openjdk.java.net/~igerasim/8134512/01/webrev/
>>>
>>> Comments, suggestions are very welcome!
>>>
>>>
>> --
>> With kind regards,
>> Ivan Gerasimov
>>
>>
>

--
With kind regards,
Ivan Gerasimov
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [10] RFR 8134512 : provide Alpha-Numeric (logical) Comparator

roger riggs
In reply to this post by Ivan Gerasimov
Hi Ivan,

Thanks for bringing this up again.

Some initial comments, not a full review.

Though the enhancement says that no consideration is given to sign
characters they
may produce puzzling results for strings like "-2000" and "-2001" where
the strings appear
to be signed numbers but the comparison will be for the "-" prefix and
the rest unsigned.
Including the word 'unsigned' in the description might help reinforce
the semantics.

Also, I don't see test cases for strings like:  "abc200123" and
"abc20000123" which share
a prefix part of which is numeric but differ in the number of "leading"
zeros after the prefix.

What about strings that include more than 1 numeric segment:
abc2003def0001 and abc02003def001
in the leading zero case?

Though the test adds the @key randomness, it would be useful to use the
test library
mechanisms to report the seed and be able to run the test with a seed,
if case they fail.
(As might be provided by the test library jdk.test.lib.RandomFactory).


Comparator.java:
576: "the common prefix is skipped" is problematic when considering
leading zeros.
    The common prefix may contain non-zero digits.
585: it is not clear whether the "different number of leading zeros" is
applied regardless
    of the common prefix or only starting after the common prefix.

550, 586: for many readers, it is easier to read 'for example' than
"e.g." or "i.e.".

562, 598:  editorial: "is to compare" -> "compares"

Comparators.java:

192: @param for param o is missing;  (the param name "o" usually refers
to an object, not a string).
200:   Can skipLeadingZeros be coded to correctly work if cnt == 0; it
would be more robust
      SkipLeading zeros works correctly only if pos is given the first
numeric digit in the subsequence
      so the numStart1 and numStart2 must be first digit in each string.

compare():

Line 223: if strings typically have non-numeric prefixes, it might
perform slightly better
to detect the end of the common prefix and then scan back to find the
beginning of the numeric part.
(Avoiding checking every char for isDigit).

Line 224: If assigned for every digit, it will hold the last equal digit
in the common prefix, not the first digit.

239, 240:  chars at o1(index) and o2(index) are already known to be
digits, can it be avoided checking them twice
by starting at index+1?

$.02, Roger


On 7/19/2017 4:41 AM, Ivan Gerasimov wrote:

> Hello!
>
> It is a proposal to provide a String comparator, which will pay
> attention to the numbers embedded into the strings (should they present).
>
> This proposal was initially discussed back in 2014 and seemed to bring
> some interest from the community:
> http://mail.openjdk.java.net/pipermail/core-libs-dev/2014-December/030343.html 
>
>
> In the latest webrev two methods are added to the public API:
> j.u.Comparator.comparingNumerically() and
> j.u.Comparator.comparingNumericallyLeadingZerosAhead().
>
> The regression test is extended to exercise this new comparator.
>
> BUGURL: https://bugs.openjdk.java.net/browse/JDK-8134512
> WEBREV: http://cr.openjdk.java.net/~igerasim/8134512/01/webrev/
>
> Comments, suggestions are very welcome!
>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [10] RFR 8134512 : provide Alpha-Numeric (logical) Comparator

Stuart Marks
In reply to this post by Ivan Gerasimov
Hi Ivan,

I think this is an interesting avenue to explore adding to the platform. The
idea of sorting this way is pretty subtle and it seems to come up frequently, so
it seems valuable. There are some issues that warrant further discussion,
though. Briefly:

1. Should this be in the JDK?
2. What do other platforms do?
3. Does it have the right semantics?

Discussion follows.


--


1. Should this be in the JDK?

I think a case for it can be made. It does appear in other platforms (see below)
and there are also several third party implementations available in a variety of
environments. So people do have a need for this feature. It's also complicated
enough to have generated lots of discussions and articles on the topic. The
questions are whether this can be specified sufficiently clearly, and whether it
provides value for the use cases for which it's intended. It's not obvious
whether this is true, but I believe a case can and should be made.


2. What do other platforms do?

It was a bit difficult to find information about this, since it doesn't seem to
have a well established name. Words like "natural", "logical", "alphanum", and
"mixed" tend to be used. I eventually found these:

Windows XP StrCmpLogicalW [1]:

     Compares two Unicode strings. Digits in the strings are considered as
     numerical content rather than text. This test is not case-sensitive.

Windows 7 CompareStringEx SORT_DIGITSASNUMBERS [2]

     Treat digits as numbers during sorting, for example, sort "2" before "10".

     (Note: this API takes a locale parameter.)

Macintosh Mac NSString localizedStandardCompare [3]

     This method should be used whenever file names or other strings are
     presented in lists and tables where Finder-like sorting is appropriate.
     The exact sorting behavior of this method is different under different
     locales and may be changed in future releases. This method uses the
     current locale.

     (Note: I observe that the Mac Finder sorting is case insensitive.)

Swift String.localizedStandardCompare [4]

     Compares strings as sorted by the Finder.

There are also third party, open source implementations available for a variety
of platforms. These aren't too hard to find; this Coding Horror article [5] has
a discussion of the issues and links to several implementations. Of particular
note is the short Python implementation embedded in the article.

There is also the Node package javascript-natural-sort [6] which is one of
several (of course) similar packages on NPM. This one seems popular, with more
than 200,000 downloads in the past month.

Finally, there is mention of "numericOrdering" in this Unicode TR [7] but it
seems fairly non-specific, and I don't know how it applies. The point here is
that the Unicode community is aware of this kind of ordering, and various
libraries that implement Unicode collation, such as ICU [8], might have
implementations that can provide guidance.


3. Does it have the right semantics?

I think you can see from the above survey that there is no standard, and
different implementations are all over the map, and most if not all are
completely ill-specified. But what is useful about the survey is that it shows
what people are actually using, and that there are things that many of them have
in common. Two items jump out at me:

  - case-insensitive comparison (sometimes optional)
  - locale-specific collation

The obvious (but simplistic) thing to do is to provide variations of this API
that can use String.CASE_INSENSITIVE_ORDER. Note however that its doc
specifically states that it provides "unsatisfactory ordering for certain
locales" and directs the reader to the Collator class, which does take locale
into account.

Now, I'm sensitive about making this more complicated than necessary. But the
point of "logical" comparator is to provide something that makes sense to humans
looking at the result, which implies that locale-specific collation needs to be
applied, as well as case insensitivity (which itself is locale-specific). So I
think consideration of those is indeed necessary.

I don't know what the API should look like. The java.text.Collator class
implements Comparator. This suggests the possibility of an API that allows a
"downstream" comparator to be specified, to which ordering of certain
subsequences can be delegated.

s'marks



[1] https://msdn.microsoft.com/en-us/library/windows/desktop/bb759947(v=vs.85).aspx

[2] https://msdn.microsoft.com/en-us/library/windows/desktop/dd317761(v=vs.85).aspx

[3]
https://developer.apple.com/documentation/foundation/nsstring/1409742-localizedstandardcompare?language=objc

[4]
https://developer.apple.com/documentation/swift/string/1408384-localizedstandardcompare

[5] https://blog.codinghorror.com/sorting-for-humans-natural-sort-order/

[6] https://www.npmjs.com/package/javascript-natural-sort

[7] http://unicode.org/reports/tr35/tr35-collation.html#Setting_Options

[8] http://userguide.icu-project.org/collation



On 7/19/17 1:41 AM, Ivan Gerasimov wrote:

> Hello!
>
> It is a proposal to provide a String comparator, which will pay attention to the
> numbers embedded into the strings (should they present).
>
> This proposal was initially discussed back in 2014 and seemed to bring some
> interest from the community:
> http://mail.openjdk.java.net/pipermail/core-libs-dev/2014-December/030343.html
>
> In the latest webrev two methods are added to the public API:
> j.u.Comparator.comparingNumerically() and
> j.u.Comparator.comparingNumericallyLeadingZerosAhead().
>
> The regression test is extended to exercise this new comparator.
>
> BUGURL: https://bugs.openjdk.java.net/browse/JDK-8134512
> WEBREV: http://cr.openjdk.java.net/~igerasim/8134512/01/webrev/
>
> Comments, suggestions are very welcome!
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [10] RFR 8134512 : provide Alpha-Numeric (logical) Comparator

Ivan Gerasimov
In reply to this post by Jonathan Bluett-Duncan
Hi Jonathan!

On 7/24/17 3:42 AM, Jonathan Bluett-Duncan wrote:

> You're welcome Ivan!
>
> I'm just proofreading the new webrev, and a few more things have
> occurred to me:
>
> 1. What happens if the comparators are given the elements {"1abc",
> "2abc", "10abc"} to compare? Would they sort the elements as {"1abc",
> "2abc", "10abc"} or as {"1abc", "10abc", "2abc"}?
> What about the array {"x1yz", "x2yz", "x10yz"}?
> I wonder if these two cases should be added to the test too and given
> as additional examples in the javadocs.
>
These arrays would be sorted in the way you expect: i.e. {"1abc",
"2abc", "10abc"} and {"x1yz", "x2yz", "x10yz"}, respectively.
I agree it is worthwhile to choose a good descriptive example for the
javadoc, so it would make it clear for a reader what to expect from the
routine.
Probably, something similar to what they have in MSDN:
https://msdn.microsoft.com/en-us/library/windows/desktop/bb759947(v=vs.85).aspx

> 2. The example arrays which you kindly added to the comparators'
> javadoc have no whitespace at the start but one space between the last
> element and the }, so I think there either should be no space at the
> end or an extra space at the start.
>
Okay, I'll make it consistent.

> 3. Very sorry, error on my part: on the javadoc lines which now say
> "then the corresponding numerical values are compared; otherwise", I
> suggested to add a "then" at the start; re-reading it though, I
> personally think it reads better without, but I would understand and
> not press it further if you disagree and think that the "then" is a
> useful addition.
>
Fixed, thanks!

I'll post the updated webrev later, once incorporate other suggestions.

With kind regards,
Ivan



> Best regards,
> Jonathan
>
>
> On 24 Jul 2017 06:21, "Ivan Gerasimov" <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     Thank you Jonathan for looking into that!
>
>     I agree with all your suggestions.
>
>     Here's the updated webrev with suggested modifications:
>     WEBREV: http://cr.openjdk.java.net/~igerasim/8134512/02/webrev/
>     <http://cr.openjdk.java.net/%7Eigerasim/8134512/02/webrev/>
>
>     With kind regards,
>     Ivan
>
>
>     On 7/23/17 3:56 AM, Jonathan Bluett-Duncan wrote:
>>     Meant to reply to core-libs-dev as well; doing so now.
>>
>>     Jonathan
>>
>>     On 23 July 2017 at 11:50, Jonathan Bluett-Duncan
>>     <[hidden email] <mailto:[hidden email]>> wrote:
>>
>>         Hi Ivan,
>>
>>         I'm not a reviewer per se, but I thought I'd chime in.
>>
>>         There's a few things with Comparator.java which I think could
>>         be improved:
>>
>>          1. `comparingNumericallyLeadingZerosAhead()` is a confusing
>>             name for me, as the "ahead" has no meaning in my mind;
>>             IMO the name `comparingNumericallyLeadingZerosFirst()`
>>             better implies that "0001" < "001" < "01".
>>          2. It wasn't clear to me from the javadoc that the
>>             comparators compare strings like "abc9" and "abc10" as
>>             "abc9" < "abc10", so I think they should include more
>>             examples.
>>          3. There's a typo in the javadocs for both methods:
>>             "represets" --> "represents".
>>          4. Where the javadocs say "follows the procedure", I think
>>             it'd make more grammatical sense if they said "does the
>>             following".
>>          5. The lines which say "at the boundary of a subsequence,
>>             consisting of decimal digits, the" would flow better if
>>             they said "at the boundary of a subsequence *consisting
>>             solely of decimal digits*, then the". Note the removal of
>>             the comma between "subsequence" and "consisting".
>>
>>         Hope this helps!
>>
>>         Jonathan
>>
>>         On 23 July 2017 at 05:36, Ivan Gerasimov
>>         <[hidden email]
>>         <mailto:[hidden email]>> wrote:
>>
>>             Hello!
>>
>>             This is a gentle reminder.
>>
>>             The proposed comparator implementation would be
>>             particularly useful when one will need to compare version
>>             strings.
>>             Some popular file managers also use similar comparing
>>             algorithm, as the results often look more natural to the
>>             human eyes (e.g. File Explorer on Windows, Files on Ubuntu).
>>
>>             Now, as Java 10 is been worked on, to sort the list of
>>             Java names correctly, this kind of comparator is needed:
>>
>>             Look: a list { ... "Java 8", "Java 9", "Java 10" }
>>             definitely looks nicer than { "Java 1", "Java 10", "Java
>>             2", ... }  :-)
>>
>>             Would you please help review the proposal?
>>
>>             With kind regards,
>>             Ivan
>>
>>
>>
>>             On 7/19/17 1:41 AM, Ivan Gerasimov wrote:
>>
>>                 Hello!
>>
>>                 It is a proposal to provide a String comparator,
>>                 which will pay attention to the numbers embedded into
>>                 the strings (should they present).
>>
>>                 This proposal was initially discussed back in 2014
>>                 and seemed to bring some interest from the community:
>>                 http://mail.openjdk.java.net/pipermail/core-libs-dev/2014-December/030343.html
>>                 <http://mail.openjdk.java.net/pipermail/core-libs-dev/2014-December/030343.html>
>>
>>
>>                 In the latest webrev two methods are added to the
>>                 public API:
>>                 j.u.Comparator.comparingNumerically() and
>>                 j.u.Comparator.comparingNumericallyLeadingZerosAhead().
>>
>>                 The regression test is extended to exercise this new
>>                 comparator.
>>
>>                 BUGURL:
>>                 https://bugs.openjdk.java.net/browse/JDK-8134512
>>                 <https://bugs.openjdk.java.net/browse/JDK-8134512>
>>                 WEBREV:
>>                 http://cr.openjdk.java.net/~igerasim/8134512/01/webrev/
>>                 <http://cr.openjdk.java.net/%7Eigerasim/8134512/01/webrev/>
>>
>>                 Comments, suggestions are very welcome!
>>
>>
>>             --
>>             With kind regards,
>>             Ivan Gerasimov
>>
>>
>>
>
>     --
>     With kind regards,
>     Ivan Gerasimov
>
>
>

--
With kind regards,
Ivan Gerasimov

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [10] RFR 8134512 : provide Alpha-Numeric (logical) Comparator

Peter Levart
In reply to this post by Ivan Gerasimov
Hi Ivan,

Would I be wrong if I described the logic of these two comparators as
the following:

The comparator compares two character sequences as though each of them
would be 1st transformed into a tuple of the form:

(A0, N0, A1, N1, ..., An-1, Nn-1, An)

where:

A0 and An are (possibly empty) sub-sequences consisting of
non-decimal-digit characters
A1 ... An-1 are non-empty sub-sequences consisting of non-decimal-digit
characters
N0 ... Nn-1 are non-empty sub-sequences consisting of decimal-digit
characters

...such that all sub-sequences concatenated together in order as they
appear in the tuple yield the original character sequence.

Any characher sequence can be uniquely transformed into such tuple. For
example:

"" -> (A0); where A0 == ""
"ab10" -> (A0, N0, A1); where A0 == "ab", N0 == "10", A1 = ""
"1" -> (A0, N0, A1); where A0 == "", N0 == "1", A1 = ""
...

After transformation, the tuples are compared by their elements (from
left to right) so that corresponding Ax elements are compared
lexicographically and Nx elements are compared as decimal integers (with
two variations considering leading zeroes).

If I am right than perhaps such description would be easier to understand.

What do you think?


Regards, Peter

On 07/19/2017 10:41 AM, Ivan Gerasimov wrote:

> Hello!
>
> It is a proposal to provide a String comparator, which will pay
> attention to the numbers embedded into the strings (should they present).
>
> This proposal was initially discussed back in 2014 and seemed to bring
> some interest from the community:
> http://mail.openjdk.java.net/pipermail/core-libs-dev/2014-December/030343.html 
>
>
> In the latest webrev two methods are added to the public API:
> j.u.Comparator.comparingNumerically() and
> j.u.Comparator.comparingNumericallyLeadingZerosAhead().
>
> The regression test is extended to exercise this new comparator.
>
> BUGURL: https://bugs.openjdk.java.net/browse/JDK-8134512
> WEBREV: http://cr.openjdk.java.net/~igerasim/8134512/01/webrev/
>
> Comments, suggestions are very welcome!
>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [10] RFR 8134512 : provide Alpha-Numeric (logical) Comparator

Peter Levart
Hi Ivan,

In the light of what Stuart Marks wrote, then what do you think about a
parameterized comparator (would be sub-optimal, I know) where one would
supply
2 Comparator(s) which would be used to compare Ax and Nx sub-sequences
respectively as described below...

For Nx sub-sequences, one would need a comparator comparing the reversed
sequence lexicographically, but for Ax sub-sequences, one could choose
from a plethora of case-(in)sensitive comparator(s) and collators
already available on the platform.

Regards, peter

On 07/28/2017 04:12 PM, Peter Levart wrote:

> Hi Ivan,
>
> Would I be wrong if I described the logic of these two comparators as
> the following:
>
> The comparator compares two character sequences as though each of them
> would be 1st transformed into a tuple of the form:
>
> (A0, N0, A1, N1, ..., An-1, Nn-1, An)
>
> where:
>
> A0 and An are (possibly empty) sub-sequences consisting of
> non-decimal-digit characters
> A1 ... An-1 are non-empty sub-sequences consisting of
> non-decimal-digit characters
> N0 ... Nn-1 are non-empty sub-sequences consisting of decimal-digit
> characters
>
> ...such that all sub-sequences concatenated together in order as they
> appear in the tuple yield the original character sequence.
>
> Any characher sequence can be uniquely transformed into such tuple.
> For example:
>
> "" -> (A0); where A0 == ""
> "ab10" -> (A0, N0, A1); where A0 == "ab", N0 == "10", A1 = ""
> "1" -> (A0, N0, A1); where A0 == "", N0 == "1", A1 = ""
> ...
>
> After transformation, the tuples are compared by their elements (from
> left to right) so that corresponding Ax elements are compared
> lexicographically and Nx elements are compared as decimal integers
> (with two variations considering leading zeroes).
>
> If I am right than perhaps such description would be easier to
> understand.
>
> What do you think?
>
>
> Regards, Peter
>
> On 07/19/2017 10:41 AM, Ivan Gerasimov wrote:
>> Hello!
>>
>> It is a proposal to provide a String comparator, which will pay
>> attention to the numbers embedded into the strings (should they
>> present).
>>
>> This proposal was initially discussed back in 2014 and seemed to
>> bring some interest from the community:
>> http://mail.openjdk.java.net/pipermail/core-libs-dev/2014-December/030343.html 
>>
>>
>> In the latest webrev two methods are added to the public API:
>> j.u.Comparator.comparingNumerically() and
>> j.u.Comparator.comparingNumericallyLeadingZerosAhead().
>>
>> The regression test is extended to exercise this new comparator.
>>
>> BUGURL: https://bugs.openjdk.java.net/browse/JDK-8134512
>> WEBREV: http://cr.openjdk.java.net/~igerasim/8134512/01/webrev/
>>
>> Comments, suggestions are very welcome!
>>
>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [10] RFR 8134512 : provide Alpha-Numeric (logical) Comparator

Ivan Gerasimov
In reply to this post by roger riggs
Hi Roger!


On 7/24/17 7:42 AM, Roger Riggs wrote:

> Hi Ivan,
>
> Thanks for bringing this up again.
>
> Some initial comments, not a full review.
>
> Though the enhancement says that no consideration is given to sign
> characters they
> may produce puzzling results for strings like "-2000" and "-2001"
> where the strings appear
> to be signed numbers but the comparison will be for the "-" prefix and
> the rest unsigned.
> Including the word 'unsigned' in the description might help reinforce
> the semantics.
>
Yes, it's a good point.  I'll add some words to make it certain we only
recognize unsigned numbers.
Otherwise, it would become confusing when comparing something like
"2017-07-28" and "2017-07-29".

> Also, I don't see test cases for strings like: "abc200123" and
> "abc20000123" which share
> a prefix part of which is numeric but differ in the number of
> "leading" zeros after the prefix.
>
Sure.  Good addition to the test.

> What about strings that include more than 1 numeric segment:
> abc2003def0001 and abc02003def001
> in the leading zero case?
>
With the first comparator (which treats the numbers with more leading
zeroes as greater ones), these strings would be sorted as
"abc2003def0001" < "abc02003def001".
The logic here is as following:
1) skip common prefix "abc",
2) find the numerical parts: "2003" and "02003",
3) after skipping leading zeroes, they are compared to be equal,
4) then, the string with more leading zeroes is said to be greater.

> Though the test adds the @key randomness, it would be useful to use
> the test library
> mechanisms to report the seed and be able to run the test with a seed,
> if case they fail.
> (As might be provided by the test library jdk.test.lib.RandomFactory).
>
Okay, I'll add this Random to the shuffling to make it reproducible.
>
> Comparator.java:
> 576: "the common prefix is skipped" is problematic when considering
> leading zeros.
>    The common prefix may contain non-zero digits.
Yes, of course.
While scanning the string for the common prefix, we still keep track of
the numeric part.
Probably "skip" is a wrong word here.

> 585: it is not clear whether the "different number of leading zeros"
> is applied regardless
>    of the common prefix or only starting after the common prefix.
>
When talking about leading zeroes, then the entire numerical substring
is meant.
Part of this substring can belong to the common prefix.

> 550, 586: for many readers, it is easier to read 'for example' than
> "e.g." or "i.e.".
>
Thanks!  I'll change it accordingly.

> 562, 598:  editorial: "is to compare" -> "compares"
>
Thanks!
> Comparators.java:
>
> 192: @param for param o is missing;  (the param name "o" usually
> refers to an object, not a string).
> 200:   Can skipLeadingZeros be coded to correctly work if cnt == 0; it
> would be more robust
>      SkipLeading zeros works correctly only if pos is given the first
> numeric digit in the subsequence
>      so the numStart1 and numStart2 must be first digit in each string.
I don't think that skipLeadingZeros() is very specific that it always
called for longer string, trying to reduce its size via skipping leading
zeroes.
It wouldn't make sense to call it with cnt == 0, and so we can avoid one
initial comparing and save a couple of nanoseconds :)
I added this prerequisite to the javadoc of the method, so hopefully it
shouldn't cause much confusion.

> compare():
>
> Line 223: if strings typically have non-numeric prefixes, it might
> perform slightly better
> to detect the end of the common prefix and then scan back to find the
> beginning of the numeric part.
> (Avoiding checking every char for isDigit).
>
On the other hand, we're saving on not calling String.charAt() for the
second time :)

> Line 224: If assigned for every digit, it will hold the last equal
> digit in the common prefix, not the first digit.
It will only be assigned when a non-digit is met.
And since the index was just incremented @ Line 222, numStart1 will be
set to the index of the first digit inside the common prefix.

For example, if the common prefix is abc123, then the numStart1 will be
updated to 3 when looking at the char 'c'.  Last three cycles of the
loop it won't be updated, since all the trailing chars are digits.

>
> 239, 240:  chars at o1(index) and o2(index) are already known to be
> digits, can it be avoided checking them twice
> by starting at index+1?
>
Not quite necessarily.  We can be here due to numLen1 > 0, and *only
one* or *none* of the string remainders start with digits.
In the later case we'll end up @ line 279.

Here's the updated webrev:
http://cr.openjdk.java.net/~igerasim/8134512/03/webrev/

Given the further discussion this iteration is probably not the final,
but it's better to have a checkpoint :)

With kind regards,
Ivan

> $.02, Roger
>
>
> On 7/19/2017 4:41 AM, Ivan Gerasimov wrote:
>> Hello!
>>
>> It is a proposal to provide a String comparator, which will pay
>> attention to the numbers embedded into the strings (should they
>> present).
>>
>> This proposal was initially discussed back in 2014 and seemed to
>> bring some interest from the community:
>> http://mail.openjdk.java.net/pipermail/core-libs-dev/2014-December/030343.html 
>>
>>
>> In the latest webrev two methods are added to the public API:
>> j.u.Comparator.comparingNumerically() and
>> j.u.Comparator.comparingNumericallyLeadingZerosAhead().
>>
>> The regression test is extended to exercise this new comparator.
>>
>> BUGURL: https://bugs.openjdk.java.net/browse/JDK-8134512
>> WEBREV: http://cr.openjdk.java.net/~igerasim/8134512/01/webrev/
>>
>> Comments, suggestions are very welcome!
>>
>
>

--
With kind regards,
Ivan Gerasimov

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [10] RFR 8134512 : provide Alpha-Numeric (logical) Comparator

Ivan Gerasimov
In reply to this post by Peter Levart
Hi Peter!

Thank a lot for looking into this!

On 7/28/17 7:32 AM, Peter Levart wrote:
> Hi Ivan,
>
> In the light of what Stuart Marks wrote, then what do you think about
> a parameterized comparator (would be sub-optimal, I know) where one
> would supply
> 2 Comparator(s) which would be used to compare Ax and Nx sub-sequences
> respectively as described below...
>
Yes.  Inspired by what Stuart suggested I made a draft of such a
comparator (see below).
It works, but as you've said it's not that efficient (mostly due to
expensive substrings) and a bit harder to use in a simple case.
Now I need to think about how to combine two approaches.

> For Nx sub-sequences, one would need a comparator comparing the
> reversed sequence lexicographically,
I'm not sure I understand why they need to be reversed.
> but for Ax sub-sequences, one could choose from a plethora of
> case-(in)sensitive comparator(s) and collators already available on
> the platform.
>
Yes. In the example below I used compareToIgnoreCase to compare alpha
subsequences.

-------

class MyComparator implements Comparator<String> {Comparator<String>
alphaCmp;Comparator<String> numCmp;MyComparator(Comparator<String>
alphaCmp,Comparator<String> numCmp) {this.alphaCmp =
alphaCmp;this.numCmp = numCmp;}boolean skipLeadingZeroes(String s, int
len) {for (int i = 0; i < len ; ++i) {if (Character.digit(s.charAt(i),
10) != 0)return false;}return true;}@Override public int compare(String
o1, String o2) {Supplier<String> s1 = new
NumberSlicer(o1);Supplier<String> s2 = new NumberSlicer(o2);while (true)
{// alpha part String ss1 = s1.get();String ss2 = s2.get();int cmp =
alphaCmp.compare(ss1, ss2);if (cmp != 0) return cmp;if (ss1.length() ==
0) return 0;// numeric part ss1 = s1.get();ss2 = s2.get();int len1 =
ss1.length();int len2 = ss2.length();int dlen = len1 - len2;if (dlen >
0) {if (!skipLeadingZeroes(ss1, dlen))return 1;ss1 = ss1.substring(dlen,
len1);} else if (dlen < 0) {if (!skipLeadingZeroes(ss2, -dlen))return
-1;ss2 = ss2.substring(-dlen, len2);}cmp = numCmp.compare(ss1, ss2);if
(cmp != 0) return cmp;if (dlen != 0) return dlen;}}static class
NumberSlicer implements Supplier<String> {private String
sequence;private boolean expectNumber = false;private int index =
0;NumberSlicer(String s) {sequence = s;}@Override public String get()
{int start = index, end = start, len = sequence.length() - start;for (;
len > 0; ++end, --len) {if (Character.isDigit(sequence.charAt(end)) ^
expectNumber)break;}expectNumber = !expectNumber;return
sequence.substring(start, index = end);}}}------------Here how it is
invoked with case-insensitive comparator:Arrays.sort(str,new
MyComparator(Comparator.comparing(String::toString,String::compareToIgnoreCase),Comparator.comparing(String::toString,String::compareTo)));------------

simple test results for case insensitive sorting:java 1java 1 javajava 1
JAVAJava 2JAVA 5jaVA 6.1java 10java 10 v 13Java 10 v 013Java 10 v
000013java 10 v 113Java 2017Java 2017Java 20017Java 200017Java
2000017Java 20000017Java 20000017Java 200000017With kind regards, Ivan
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [10] RFR 8134512 : provide Alpha-Numeric (logical) Comparator

joe darcy
In reply to this post by Ivan Gerasimov
Hi Ivan,

A few comments.

I don't have a specific suggestion for a different name, but I don't
think "comparingNumerically" does an adequate job capturing the
described behavior of the method. Likewise, the summary of the methods'
behavior in the javadoc should have a more suggestive description of the
behavior.

Please add comments describing what the "// Fullwidth characters" values
are in the test.

Interesting test cases would also include substrings for
Integer.MAX_VALUE, Integer.MAX_VALUE+1, Long.MAX_VALUE, Long.MAX_VALUE +
1, etc. Test cases of purely numerical and purely alphabetical inputs
would be warranted too.

While randomizing the array and sorting is a valid test, a more through
test would do pair-wise comparisons of each array element. While such
comparisons are quadratic with array length, the arrays here are small.

Cheers,

-Joe


On 7/28/2017 9:03 AM, Ivan Gerasimov wrote:

> Hi Roger!
>
>
> On 7/24/17 7:42 AM, Roger Riggs wrote:
>> Hi Ivan,
>>
>> Thanks for bringing this up again.
>>
>> Some initial comments, not a full review.
>>
>> Though the enhancement says that no consideration is given to sign
>> characters they
>> may produce puzzling results for strings like "-2000" and "-2001"
>> where the strings appear
>> to be signed numbers but the comparison will be for the "-" prefix
>> and the rest unsigned.
>> Including the word 'unsigned' in the description might help reinforce
>> the semantics.
>>
> Yes, it's a good point.  I'll add some words to make it certain we
> only recognize unsigned numbers.
> Otherwise, it would become confusing when comparing something like
> "2017-07-28" and "2017-07-29".
>
>> Also, I don't see test cases for strings like: "abc200123" and
>> "abc20000123" which share
>> a prefix part of which is numeric but differ in the number of
>> "leading" zeros after the prefix.
>>
> Sure.  Good addition to the test.
>
>> What about strings that include more than 1 numeric segment:
>> abc2003def0001 and abc02003def001
>> in the leading zero case?
>>
> With the first comparator (which treats the numbers with more leading
> zeroes as greater ones), these strings would be sorted as
> "abc2003def0001" < "abc02003def001".
> The logic here is as following:
> 1) skip common prefix "abc",
> 2) find the numerical parts: "2003" and "02003",
> 3) after skipping leading zeroes, they are compared to be equal,
> 4) then, the string with more leading zeroes is said to be greater.
>
>> Though the test adds the @key randomness, it would be useful to use
>> the test library
>> mechanisms to report the seed and be able to run the test with a
>> seed, if case they fail.
>> (As might be provided by the test library jdk.test.lib.RandomFactory).
>>
> Okay, I'll add this Random to the shuffling to make it reproducible.
>>
>> Comparator.java:
>> 576: "the common prefix is skipped" is problematic when considering
>> leading zeros.
>>    The common prefix may contain non-zero digits.
> Yes, of course.
> While scanning the string for the common prefix, we still keep track
> of the numeric part.
> Probably "skip" is a wrong word here.
>
>> 585: it is not clear whether the "different number of leading zeros"
>> is applied regardless
>>    of the common prefix or only starting after the common prefix.
>>
> When talking about leading zeroes, then the entire numerical substring
> is meant.
> Part of this substring can belong to the common prefix.
>
>> 550, 586: for many readers, it is easier to read 'for example' than
>> "e.g." or "i.e.".
>>
> Thanks!  I'll change it accordingly.
>
>> 562, 598:  editorial: "is to compare" -> "compares"
>>
> Thanks!
>> Comparators.java:
>>
>> 192: @param for param o is missing;  (the param name "o" usually
>> refers to an object, not a string).
>> 200:   Can skipLeadingZeros be coded to correctly work if cnt == 0;
>> it would be more robust
>>      SkipLeading zeros works correctly only if pos is given the first
>> numeric digit in the subsequence
>>      so the numStart1 and numStart2 must be first digit in each string.
> I don't think that skipLeadingZeros() is very specific that it always
> called for longer string, trying to reduce its size via skipping
> leading zeroes.
> It wouldn't make sense to call it with cnt == 0, and so we can avoid
> one initial comparing and save a couple of nanoseconds :)
> I added this prerequisite to the javadoc of the method, so hopefully
> it shouldn't cause much confusion.
>
>> compare():
>>
>> Line 223: if strings typically have non-numeric prefixes, it might
>> perform slightly better
>> to detect the end of the common prefix and then scan back to find the
>> beginning of the numeric part.
>> (Avoiding checking every char for isDigit).
>>
> On the other hand, we're saving on not calling String.charAt() for the
> second time :)
>
>> Line 224: If assigned for every digit, it will hold the last equal
>> digit in the common prefix, not the first digit.
> It will only be assigned when a non-digit is met.
> And since the index was just incremented @ Line 222, numStart1 will be
> set to the index of the first digit inside the common prefix.
>
> For example, if the common prefix is abc123, then the numStart1 will
> be updated to 3 when looking at the char 'c'.  Last three cycles of
> the loop it won't be updated, since all the trailing chars are digits.
>
>>
>> 239, 240:  chars at o1(index) and o2(index) are already known to be
>> digits, can it be avoided checking them twice
>> by starting at index+1?
>>
> Not quite necessarily.  We can be here due to numLen1 > 0, and *only
> one* or *none* of the string remainders start with digits.
> In the later case we'll end up @ line 279.
>
> Here's the updated webrev:
> http://cr.openjdk.java.net/~igerasim/8134512/03/webrev/
>
> Given the further discussion this iteration is probably not the final,
> but it's better to have a checkpoint :)
>
> With kind regards,
> Ivan
>
>> $.02, Roger
>>
>>
>> On 7/19/2017 4:41 AM, Ivan Gerasimov wrote:
>>> Hello!
>>>
>>> It is a proposal to provide a String comparator, which will pay
>>> attention to the numbers embedded into the strings (should they
>>> present).
>>>
>>> This proposal was initially discussed back in 2014 and seemed to
>>> bring some interest from the community:
>>> http://mail.openjdk.java.net/pipermail/core-libs-dev/2014-December/030343.html 
>>>
>>>
>>> In the latest webrev two methods are added to the public API:
>>> j.u.Comparator.comparingNumerically() and
>>> j.u.Comparator.comparingNumericallyLeadingZerosAhead().
>>>
>>> The regression test is extended to exercise this new comparator.
>>>
>>> BUGURL: https://bugs.openjdk.java.net/browse/JDK-8134512
>>> WEBREV: http://cr.openjdk.java.net/~igerasim/8134512/01/webrev/
>>>
>>> Comments, suggestions are very welcome!
>>>
>>
>>
>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [10] RFR 8134512 : provide Alpha-Numeric (logical) Comparator

Jonathan Bluett-Duncan
In reply to this post by Ivan Gerasimov
Hi Ivan,

It looks like the MyComparator code example which you gave in your last
email lost its formatting along the way, so I'm finding it difficult to
read. Would you mind resubmitting it?

Cheers,
Jonathan

On 28 July 2017 at 17:25, Ivan Gerasimov <[hidden email]> wrote:

> Hi Peter!
>
> Thank a lot for looking into this!
>
> On 7/28/17 7:32 AM, Peter Levart wrote:
>
>> Hi Ivan,
>>
>> In the light of what Stuart Marks wrote, then what do you think about a
>> parameterized comparator (would be sub-optimal, I know) where one would
>> supply
>> 2 Comparator(s) which would be used to compare Ax and Nx sub-sequences
>> respectively as described below...
>>
>> Yes.  Inspired by what Stuart suggested I made a draft of such a
> comparator (see below).
> It works, but as you've said it's not that efficient (mostly due to
> expensive substrings) and a bit harder to use in a simple case.
> Now I need to think about how to combine two approaches.
>
> For Nx sub-sequences, one would need a comparator comparing the reversed
>> sequence lexicographically,
>>
> I'm not sure I understand why they need to be reversed.
>
>> but for Ax sub-sequences, one could choose from a plethora of
>> case-(in)sensitive comparator(s) and collators already available on the
>> platform.
>>
>> Yes. In the example below I used compareToIgnoreCase to compare alpha
> subsequences.
>
> -------
>
> class MyComparator implements Comparator<String> {Comparator<String>
> alphaCmp;Comparator<String> numCmp;MyComparator(Comparator<String>
> alphaCmp,Comparator<String> numCmp) {this.alphaCmp = alphaCmp;this.numCmp =
> numCmp;}boolean skipLeadingZeroes(String s, int len) {for (int i = 0; i <
> len ; ++i) {if (Character.digit(s.charAt(i), 10) != 0)return false;}return
> true;}@Override public int compare(String o1, String o2)
> {Supplier<String> s1 = new NumberSlicer(o1);Supplier<String> s2 = new
> NumberSlicer(o2);while (true) {// alpha part String ss1 = s1.get();String
> ss2 = s2.get();int cmp = alphaCmp.compare(ss1, ss2);if (cmp != 0) return
> cmp;if (ss1.length() == 0) return 0;// numeric part ss1 = s1.get();ss2 =
> s2.get();int len1 = ss1.length();int len2 = ss2.length();int dlen = len1 -
> len2;if (dlen > 0) {if (!skipLeadingZeroes(ss1, dlen))return 1;ss1 =
> ss1.substring(dlen, len1);} else if (dlen < 0) {if (!skipLeadingZeroes(ss2,
> -dlen))return -1;ss2 = ss2.substring(-dlen, len2);}cmp =
> numCmp.compare(ss1, ss2);if (cmp != 0) return cmp;if (dlen != 0) return
> dlen;}}static class NumberSlicer implements Supplier<String> {private
> String sequence;private boolean expectNumber = false;private int index =
> 0;NumberSlicer(String s) {sequence = s;}@Override public String get()
> {int start = index, end = start, len = sequence.length() - start;for (; len
> > 0; ++end, --len) {if (Character.isDigit(sequence.charAt(end)) ^
> expectNumber)break;}expectNumber = !expectNumber;return
> sequence.substring(start, index = end);}}}------------Here how it is
> invoked with case-insensitive comparator:Arrays.sort(str,new
> MyComparator(Comparator.comparing(String::toString,String::
> compareToIgnoreCase),Comparator.comparing(String::toString,
> String::compareTo)));------------
>
> simple test results for case insensitive sorting:java 1java 1 javajava 1
> JAVAJava 2JAVA 5jaVA 6.1java 10java 10 v 13Java 10 v 013Java 10 v
> 000013java 10 v 113Java 2017Java 2017Java 20017Java 200017Java 2000017Java
> 20000017Java 20000017Java 200000017With kind regards, Ivan
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [10] RFR 8134512 : provide Alpha-Numeric (logical) Comparator

Peter Levart
In reply to this post by Ivan Gerasimov
Hi Ivan,

On 07/28/2017 06:25 PM, Ivan Gerasimov wrote:

>
> Hi Peter!
>
> Thank a lot for looking into this!
>
> On 7/28/17 7:32 AM, Peter Levart wrote:
>> Hi Ivan,
>>
>> In the light of what Stuart Marks wrote, then what do you think about
>> a parameterized comparator (would be sub-optimal, I know) where one
>> would supply
>> 2 Comparator(s) which would be used to compare Ax and Nx
>> sub-sequences respectively as described below...
>>
> Yes.  Inspired by what Stuart suggested I made a draft of such a
> comparator (see below).
> It works, but as you've said it's not that efficient (mostly due to
> expensive substrings) and a bit harder to use in a simple case.
> Now I need to think about how to combine two approaches.

There is a bug in MyComparator. Comparing "1" with "2" gives 0 (equal).
You must not return "equal" when both alpha prefixes are empty.

>
>> For Nx sub-sequences, one would need a comparator comparing the
>> reversed sequence lexicographically,
> I'm not sure I understand why they need to be reversed.

Scrap that. I got confused.  :-o

>> but for Ax sub-sequences, one could choose from a plethora of
>> case-(in)sensitive comparator(s) and collators already available on
>> the platform.
>>
> Yes. In the example below I used compareToIgnoreCase to compare alpha
> subsequences.
...
> Arrays.sort(str,new
> MyComparator(Comparator.comparing(String::toString,String::compareToIgnoreCase),Comparator.comparing(String::toString,String::compareTo)));

Substrings are by copy. There is another approach. Originally you
created the NumericComparator to compare CharSequence(s) - not String(s)
as you did with MyComparator. So why not keeping that?

While talking about types, you don't need to have a generic parameter:

     NumericComparator<T extends CharSequence> implements CharSequence<T>

simple:

     NumericComparator implements Comparator<CharSequence>

is suitable for comparing any subtype of CharSequence as use-sites
should accept Comparator<? super String> or Comparator<? super
CharSequence>. For example:

     Stream<T> sorted(Comparator<? super T> comparator);

You can use Comparator<CharSequence> to sort Stream<String> as well as
Stream<StringBuilder> or Stream<CharSequence>...


I tried an approach where sub-sequences are CharSequence objects by
delegation and use sub-comparators that don't convert CharSequence(s) to
String(s). Here's what this looks like:

http://cr.openjdk.java.net/~plevart/misc/AlphadecimalComparator/alphadecimal-comparator/src/main/java/peter/AlphadecimalComparator.java

with supporting sub-comparator classes:

http://cr.openjdk.java.net/~plevart/misc/AlphadecimalComparator/alphadecimal-comparator/src/main/java/peter/CharSequenceComparator.java
http://cr.openjdk.java.net/~plevart/misc/AlphadecimalComparator/alphadecimal-comparator/src/main/java/peter/DecimalComparator.java

I created a JMH  benchmark comparing your 02/webrev
(ivan.NumericComparator):

http://cr.openjdk.java.net/~plevart/misc/AlphadecimalComparator/alphadecimal-comparator/src/main/java/ivan/NumericComparator.java

with a modified variant of your MyComparator that works correctly
(ivan.MyComparator):

http://cr.openjdk.java.net/~plevart/misc/AlphadecimalComparator/alphadecimal-comparator/src/main/java/ivan/MyComparator.java

and my (peter.AlphadecimalComparator). The later two are tested with
case-sensitive (CS) and case-insensitive (CIS) variants for comparing
alpha sub-sequences.

Here are the results:

http://cr.openjdk.java.net/~plevart/misc/AlphadecimalComparator/alphadecimal-comparator/bench_results_speed.txt

With -prof gc, it can be seen that JIT is able to optimize-away
allocation of sub-sequence CharSequence(s). When the hot code is JIT-ed,
at least with provided helper sub-comparators, no allocation is taking
place (not so with MyComparator which creates substrings):

http://cr.openjdk.java.net/~plevart/misc/AlphadecimalComparator/alphadecimal-comparator/bench_results_gc.txt

The performance of AlphadecimalComparator is about 50% slower than your
specialized NumericComparator, but it allows custom parametrization.
Perhaps the decimal part of variations could be hidden behind a boolean
or an enum parameter (there is a leading-zeroes-greater,
leading-zeroes-less, but also a leading-zeroes-equal variant that makes
sense). I tried to do that by making DecimalComparator an enum and
restricting the decimalComparato parameter to that type, but maybe
specializing code around a boolean flag might yield better performance
and DecimalComparator as Comparator<CharSequence> doesn't make much
sense outside the AlphadecimalComparator.

So what do you think? Also, what do you think about the name?

Regards, Peter

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [10] RFR 8134512 : provide Alpha-Numeric (logical) Comparator

Peter Levart
Hi Ivan,

On 07/29/2017 07:22 PM, Peter Levart wrote:
> I tried an approach where sub-sequences are CharSequence objects by
> delegation and use sub-comparators that don't convert CharSequence(s)
> to String(s). Here's what this looks like:
>
> http://cr.openjdk.java.net/~plevart/misc/AlphadecimalComparator/alphadecimal-comparator/src/main/java/peter/AlphadecimalComparator.java
>
> Here are the results:
>
> http://cr.openjdk.java.net/~plevart/misc/AlphadecimalComparator/alphadecimal-comparator/bench_results_speed.txt

A simple optimization of above code (manually unrolling the
alpha/decimal alteration - i.e. getting rid of 'alpha' boolean flag):

before:

         boolean alpha = true;
         while (true) {
             if (ss1.next(alpha)) {
                 if (ss2.next(alpha)) {
                     Comparator<? super CharSequence>
                         comparator = alpha
                                      ? alphaComparator
                                      : decimalComparator;
                     int cmp = comparator.compare(ss1, ss2);
                     if (cmp != 0) {
                         return cmp;
                     }
                 } else {
                     return 1;
                 }
             } else {
                 if (ss2.next(alpha)) {
                     return -1;
                 } else {
                     return 0;
                 }
             }
             alpha = !alpha;
         }

after:

         while (true) {
             // alpha part
             if (ss1.next(true)) {
                 if (ss2.next(true)) {
                     int cmp = alphaComparator.compare(ss1, ss2);
                     if (cmp != 0) {
                         return cmp;
                     }
                 } else {
                     return 1;
                 }
             } else {
                 if (ss2.next(true)) {
                     return -1;
                 } else {
                     return 0;
                 }
             }
             // decimal part
             if (ss1.next(false)) {
                 if (ss2.next(false)) {
                     int cmp = decimalComparator.compare(ss1, ss2);
                     if (cmp != 0) {
                         return cmp;
                     }
                 } else {
                     return 1;
                 }
             } else {
                 if (ss2.next(false)) {
                     return -1;
                 } else {
                     return 0;
                 }
             }
         }

...yields a noticeable spead-up:

// reference:

Benchmark             (_strings)            (comparator)  Mode Cnt      
Score     Error  Units
ComparatorBench.test    strings1  ivan.NumericComparator  avgt 10  
1455.789 ±  37.158  ns/op
ComparatorBench.test    strings2  ivan.NumericComparator  avgt   10
13680.131 ± 338.763  ns/op

// before explicit alpha/decimal alteration unrolling:

Benchmark             (_strings)                     (comparator) Mode  
Cnt      Score     Error  Units
ComparatorBench.test    strings1  peter.AlphadecimalComparator.CS avgt  
10   2065.324 ±  50.532  ns/op
ComparatorBench.test    strings2  peter.AlphadecimalComparator.CS avgt  
10  21118.430 ± 350.769  ns/op

// after explicit alpha/decimal alteration unrolling:

Benchmark             (_strings)                     (comparator) Mode  
Cnt      Score      Error  Units
ComparatorBench.test    strings1  peter.AlphadecimalComparator.CS avgt  
10   1692.145 ±    2.160  ns/op
ComparatorBench.test    strings2  peter.AlphadecimalComparator.CS avgt  
10  18624.677 ± 1900.779  ns/op


With this mod, AlphadecimalComparator is only about 16 - 35% slower than
NumericComparator.


Regards, Peter

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [10] RFR 8134512 : provide Alpha-Numeric (logical) Comparator

Ivan Gerasimov
In reply to this post by Peter Levart
Thank you Peter for sharing your thoughts and experiments!


On 7/29/17 10:22 AM, Peter Levart wrote:

> Hi Ivan,
>
> On 07/28/2017 06:25 PM, Ivan Gerasimov wrote:
>>
>> Hi Peter!
>>
>> Thank a lot for looking into this!
>>
>> On 7/28/17 7:32 AM, Peter Levart wrote:
>>> Hi Ivan,
>>>
>>> In the light of what Stuart Marks wrote, then what do you think
>>> about a parameterized comparator (would be sub-optimal, I know)
>>> where one would supply
>>> 2 Comparator(s) which would be used to compare Ax and Nx
>>> sub-sequences respectively as described below...
>>>
>> Yes.  Inspired by what Stuart suggested I made a draft of such a
>> comparator (see below).
>> It works, but as you've said it's not that efficient (mostly due to
>> expensive substrings) and a bit harder to use in a simple case.
>> Now I need to think about how to combine two approaches.
>
> There is a bug in MyComparator. Comparing "1" with "2" gives 0
> (equal). You must not return "equal" when both alpha prefixes are empty.
>
Yes, you're right.  Should have checked the length of numerical
substrings, or just use a separate method for end-of-sequence.

>>
>>> For Nx sub-sequences, one would need a comparator comparing the
>>> reversed sequence lexicographically,
>> I'm not sure I understand why they need to be reversed.
>
> Scrap that. I got confused.  :-o
>
>>> but for Ax sub-sequences, one could choose from a plethora of
>>> case-(in)sensitive comparator(s) and collators already available on
>>> the platform.
>>>
>> Yes. In the example below I used compareToIgnoreCase to compare alpha
>> subsequences.
> ...
>> Arrays.sort(str,new
>> MyComparator(Comparator.comparing(String::toString,String::compareToIgnoreCase),Comparator.comparing(String::toString,String::compareTo)));
> Substrings are by copy. There is another approach. Originally you
> created the NumericComparator to compare CharSequence(s) - not
> String(s) as you did with MyComparator. So why not keeping that?
Well, it was mostly for illustrative purposes.  Of course the final code
should deal with CharSequences, not Strings for generality. However, I
assume that in most cases this comparator will need to work with
Strings, and it's important to remember that String.subSequence() is as
expensive as String.substring() [1] [1]
http://docs.oracle.com/javase/8/docs/api/java/lang/String.html#subSequence-int-int- 

> While talking about types, you don't need to have a generic parameter:
>     NumericComparator<T extends CharSequence> implements
> CharSequence<T> simple:     NumericComparator implements
> Comparator<CharSequence> is suitable for comparing any subtype of
> CharSequence as use-sites should accept Comparator<? super String> or
> Comparator<? super CharSequence>.
But it wouldn't work in a case you need to pass the comparator to a
function of form  fun(Comparator<String> cmp).

> For example:     Stream<T> sorted(Comparator<? super T> comparator);
> You can use Comparator<CharSequence> to sort Stream<String> as well as
> Stream<StringBuilder> or Stream<CharSequence>... I tried an approach
> where sub-sequences are CharSequence objects by delegation and use
> sub-comparators that don't convert CharSequence(s) to String(s).
> Here's what this looks like:
> http://cr.openjdk.java.net/~plevart/misc/AlphadecimalComparator/alphadecimal-comparator/src/main/java/peter/AlphadecimalComparator.java 
> with supporting sub-comparator classes:
> http://cr.openjdk.java.net/~plevart/misc/AlphadecimalComparator/alphadecimal-comparator/src/main/java/peter/CharSequenceComparator.java 
> http://cr.openjdk.java.net/~plevart/misc/AlphadecimalComparator/alphadecimal-comparator/src/main/java/peter/DecimalComparator.java 
> I created a JMH  benchmark comparing your 02/webrev
> (ivan.NumericComparator):
> http://cr.openjdk.java.net/~plevart/misc/AlphadecimalComparator/alphadecimal-comparator/src/main/java/ivan/NumericComparator.java 
> with a modified variant of your MyComparator that works correctly
> (ivan.MyComparator):
> http://cr.openjdk.java.net/~plevart/misc/AlphadecimalComparator/alphadecimal-comparator/src/main/java/ivan/MyComparator.java 
> and my (peter.AlphadecimalComparator). The later two are tested with
> case-sensitive (CS) and case-insensitive (CIS) variants for comparing
> alpha sub-sequences. Here are the results:
> http://cr.openjdk.java.net/~plevart/misc/AlphadecimalComparator/alphadecimal-comparator/bench_results_speed.txt 
> With -prof gc, it can be seen that JIT is able to optimize-away
> allocation of sub-sequence CharSequence(s). When the hot code is
> JIT-ed, at least with provided helper sub-comparators, no allocation
> is taking place (not so with MyComparator which creates substrings):
> http://cr.openjdk.java.net/~plevart/misc/AlphadecimalComparator/alphadecimal-comparator/bench_results_gc.txt 

That's a very good news then!

> The performance of AlphadecimalComparator is about 50% slower than
> your specialized NumericComparator, but it allows custom
> parametrization. Perhaps the decimal part of variations could be
> hidden behind a boolean or an enum parameter (there is a
> leading-zeroes-greater, leading-zeroes-less, but also a
> leading-zeroes-equal variant that makes sense). I tried to do that by
> making DecimalComparator an enum and restricting the decimalComparato
> parameter to that type, but maybe specializing code around a boolean
> flag might yield better performance and DecimalComparator as
> Comparator<CharSequence> doesn't make much sense outside the
> AlphadecimalComparator. So what do you think? Also, what do you think
> about the name? Regards, Peter
I've tried to go one step further and created even more abstract
comparator:  It uses a supplied predicate to decompose the input
sequences into odd/even subsequences (e.g. alpha/numeric) and then uses
two separate comparator to compare them. Additionally, a comparator for
comparing sequences, consisting only of digits is provided. For example,
to build a case-insensitive AlphaDecimal comparator one could use: 1)
Character::isDigit -- as the predicate for decomposing, 2)
String::compareToIgnoreCase -- to compare alpha (i.e. odd parts); to
work with CharSequences one would need to make it
Comparator.comparing(CharSequence::toString,
String::compareToIgnoreCase), 3) The special decimal-only comparator,
which compares the decimal representation of the sequences. Here's the
file with all the comparators and a simple test:
http://cr.openjdk.java.net/~igerasim/8134512/test/Test.java

--
With kind regards,
Ivan Gerasimov
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [10] RFR 8134512 : provide Alpha-Numeric (logical) Comparator

Ivan Gerasimov
In reply to this post by Jonathan Bluett-Duncan
On 7/28/17 12:16 PM, Jonathan Bluett-Duncan wrote:

> Hi Ivan,
>
> It looks like the MyComparator code example which you gave in your
> last email lost its formatting along the way, so I'm finding it
> difficult to read. Would you mind resubmitting it?
>

Oh, sorry about that.
I've just uploaded another modification of the comparator here:
http://cr.openjdk.java.net/~igerasim/8134512/test/Test.java

With kind regards,
Ivan

> Cheers,
> Jonathan
>
> On 28 July 2017 at 17:25, Ivan Gerasimov <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     Hi Peter!
>
>     Thank a lot for looking into this!
>
>     On 7/28/17 7:32 AM, Peter Levart wrote:
>
>         Hi Ivan,
>
>         In the light of what Stuart Marks wrote, then what do you
>         think about a parameterized comparator (would be sub-optimal,
>         I know) where one would supply
>         2 Comparator(s) which would be used to compare Ax and Nx
>         sub-sequences respectively as described below...
>
>     Yes.  Inspired by what Stuart suggested I made a draft of such a
>     comparator (see below).
>     It works, but as you've said it's not that efficient (mostly due
>     to expensive substrings) and a bit harder to use in a simple case.
>     Now I need to think about how to combine two approaches.
>
>         For Nx sub-sequences, one would need a comparator comparing
>         the reversed sequence lexicographically,
>
>     I'm not sure I understand why they need to be reversed.
>
>         but for Ax sub-sequences, one could choose from a plethora of
>         case-(in)sensitive comparator(s) and collators already
>         available on the platform.
>
>     Yes. In the example below I used compareToIgnoreCase to compare
>     alpha subsequences.
>
>     -------
>
>     class MyComparator implements Comparator<String>
>     {Comparator<String> alphaCmp;Comparator<String>
>     numCmp;MyComparator(Comparator<String> alphaCmp,Comparator<String>
>     numCmp) {this.alphaCmp = alphaCmp;this.numCmp = numCmp;}boolean
>     skipLeadingZeroes(String s, int len) {for (int i = 0; i < len ;
>     ++i) {if (Character.digit(s.charAt(i), 10) != 0)return
>     false;}return true;}@Override public int compare(String o1, String
>     o2) {Supplier<String> s1 = new NumberSlicer(o1);Supplier<String>
>     s2 = new NumberSlicer(o2);while (true) {// alpha part String ss1 =
>     s1.get();String ss2 = s2.get();int cmp = alphaCmp.compare(ss1,
>     ss2);if (cmp != 0) return cmp;if (ss1.length() == 0) return 0;//
>     numeric part ss1 = s1.get();ss2 = s2.get();int len1 =
>     ss1.length();int len2 = ss2.length();int dlen = len1 - len2;if
>     (dlen > 0) {if (!skipLeadingZeroes(ss1, dlen))return 1;ss1 =
>     ss1.substring(dlen, len1);} else if (dlen < 0) {if
>     (!skipLeadingZeroes(ss2, -dlen))return -1;ss2 =
>     ss2.substring(-dlen, len2);}cmp = numCmp.compare(ss1, ss2);if (cmp
>     != 0) return cmp;if (dlen != 0) return dlen;}}static class
>     NumberSlicer implements Supplier<String> {private String
>     sequence;private boolean expectNumber = false;private int index =
>     0;NumberSlicer(String s) {sequence = s;}@Override public String
>     get() {int start = index, end = start, len = sequence.length() -
>     start;for (; len > 0; ++end, --len) {if
>     (Character.isDigit(sequence.ch <http://sequence.ch>arAt(end)) ^
>     expectNumber)break;}expectNumber = !expectNumber;return
>     sequence.substring(start, index = end);}}}------------Here how it
>     is invoked with case-insensitive comparator:Arrays.sort(str,new
>     MyComparator(Comparator.comparing(String::toString,String::compareToIgnoreCase),Comparator.comparing(String::toString,String::compareTo)));------------
>
>     simple test results for case insensitive sorting:java 1java 1
>     javajava 1 JAVAJava 2JAVA 5jaVA 6.1java 10java 10 v 13Java 10 v
>     013Java 10 v 000013java 10 v 113Java 2017Java 2017Java 20017Java
>     200017Java 2000017Java 20000017Java 20000017Java 200000017With
>     kind regards, Ivan
>
>

--
With kind regards,
Ivan Gerasimov

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [10] RFR 8134512 : provide Alpha-Numeric (logical) Comparator

Stuart Marks
In reply to this post by Ivan Gerasimov
On 8/1/17 11:56 PM, Ivan Gerasimov wrote:

> I've tried to go one step further and created even more abstract comparator:  It
> uses a supplied predicate to decompose the input sequences into odd/even
> subsequences (e.g. alpha/numeric) and then uses two separate comparator to
> compare them. Additionally, a comparator for comparing sequences, consisting
> only of digits is provided. For example, to build a case-insensitive
> AlphaDecimal comparator one could use: 1) Character::isDigit -- as the predicate
> for decomposing, 2) String::compareToIgnoreCase -- to compare alpha (i.e. odd
> parts); to work with CharSequences one would need to make it
> Comparator.comparing(CharSequence::toString, String::compareToIgnoreCase), 3)
> The special decimal-only comparator, which compares the decimal representation
> of the sequences. Here's the file with all the comparators and a simple test:
> http://cr.openjdk.java.net/~igerasim/8134512/test/Test.java

Hi, a couple follow-up thoughts on this.

1) Supplementary characters

The current code uses Character.isDigit(char), which works only for char values
in the BMP (basic multilingual plane, values <= U+FFFF). It won't work for
supplementary characters. There are several blocks of digits in the BMP, but
there are several more in the supplementary character range.

I don't see any reason not to handle the supplementary characters as well,
except that it spoils the nice char-by-char technique of processing the string.
Instead, it'd have to pull in code point values, which might be comprised of two
surrogate chars. There are a variety of methods on Character that help with
this. Note that there is an overload Character.isDigit(int) which takes any code
point value, including supplementary characters.

2) Too much generality?

This version includes Predicate<Character> for determining whether a character
is part of the alphabetic or decimal portion of the string. I'm thinking this
might be overkill. It might be sufficient to "hardwire" the partitioning
predicate to be Character::isDigit and the value mapping function to use
Character::digit.

The problem is that adding a predicate opens the door to a lot more complexity,
while providing dimishing value. First, the predicate would have to handle code
points (per the above) so it'd need to be an IntPredicate. Second, there would
also need to be a mapping function from the code point value to a numeric value.
This might be an IntUnaryOperator. This would allow someone to sort based on
Roman numerals, using Character::getNumericValue. (Yes, Roman numerals are in
Unicode.) Or maybe the mapping function should return any Comparable value, not
an int. ... See where I'm going here?

Since this kind of sorting is intended to be viewed by people, it's probably
worth providing full internationalization support (supplementary characters, and
delegation to sub-comparators, to allow locale-specific collating sequences).
But I start to question any complexity beyond that.

s'marks
Loading...