Quantcast

Java needs an immutable byte array wrapper

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Java needs an immutable byte array wrapper

Keith Turner
While trying to design an API for Fluo, its become clear to me that
Java could really benefit from an immutable byte array wrapper.
Something like java.lang.String except for byte arrays instead of char
arrays.  It would be nice if this new type interoperated well with
byte[], String, ByteBuffer, InputStream, OutputStream etc.

I wrote the following blog post about my experiences with this issue
while designing an API for Fluo.

http://fluo.apache.org/blog/2016/11/10/immutable-bytes/

Is there any reason something like this should not be added to Java?

Thanks,

Keith
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Java needs an immutable byte array wrapper

Roman Kennke-2
Am Samstag, den 12.11.2016, 11:45 -0500 schrieb Keith Turner:

> While trying to design an API for Fluo, its become clear to me that
> Java could really benefit from an immutable byte array wrapper.
> Something like java.lang.String except for byte arrays instead of
> char
> arrays.  It would be nice if this new type interoperated well with
> byte[], String, ByteBuffer, InputStream, OutputStream etc.
>
> I wrote the following blog post about my experiences with this issue
> while designing an API for Fluo.
>
> http://fluo.apache.org/blog/2016/11/10/immutable-bytes/
>
> Is there any reason something like this should not be added to Java?

You mean something like NIO ByteBuffers and related APIs?

Roman

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Java needs an immutable byte array wrapper

Keith Turner
On Sat, Nov 12, 2016 at 12:04 PM, Roman Kennke <[hidden email]> wrote:

> Am Samstag, den 12.11.2016, 11:45 -0500 schrieb Keith Turner:
>> While trying to design an API for Fluo, its become clear to me that
>> Java could really benefit from an immutable byte array wrapper.
>> Something like java.lang.String except for byte arrays instead of
>> char
>> arrays.  It would be nice if this new type interoperated well with
>> byte[], String, ByteBuffer, InputStream, OutputStream etc.
>>
>> I wrote the following blog post about my experiences with this issue
>> while designing an API for Fluo.
>>
>> http://fluo.apache.org/blog/2016/11/10/immutable-bytes/
>>
>> Is there any reason something like this should not be added to Java?
>
> You mean something like NIO ByteBuffers and related APIs?

As I discussed in the blog post, ByteBuffers does not fit the bill of
what I need.  In the blog post I have the following little program as
an example to show that ByteBuffer is not immutable in the way String
is.

  byte[] bytes1 = new byte[] {1,2,3,(byte)250};
  ByteBuffer bb1 = ByteBuffer.wrap(bytes1).asReadOnlyBuffer();

  System.out.println(bb1.hashCode());
  bytes1[2]=89;
  System.out.println(bb1.hashCode());
  bb1.get();
  System.out.println(bb1.hashCode());

Would not want to use ByteBuffer as a map key.  Would be nice if Java
had something like ByteString[1] or Bytes[2].  Having a type like that
in Java would allow to be used in library APIs and avoid copies
between multiple implementations of an immutable byte array wrapper.

[1]: https://developers.google.com/protocol-buffers/docs/reference/java/com/google/protobuf/ByteString
[2]: https://static.javadoc.io/org.apache.fluo/fluo-api/1.0.0-incubating/org/apache/fluo/api/data/Bytes.html

>
> Roman
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Java needs an immutable byte array wrapper

Peter Lawrey-3
Java 9 String has a byte [] at its core. I suspect it's not appropriate but
worth thinking about.

We have a BytesStore class which wraps bytes on or off heap which can be
used for keys.

On 12 Nov 2016 17:17, "Keith Turner" <[hidden email]> wrote:

> On Sat, Nov 12, 2016 at 12:04 PM, Roman Kennke <[hidden email]> wrote:
> > Am Samstag, den 12.11.2016, 11:45 -0500 schrieb Keith Turner:
> >> While trying to design an API for Fluo, its become clear to me that
> >> Java could really benefit from an immutable byte array wrapper.
> >> Something like java.lang.String except for byte arrays instead of
> >> char
> >> arrays.  It would be nice if this new type interoperated well with
> >> byte[], String, ByteBuffer, InputStream, OutputStream etc.
> >>
> >> I wrote the following blog post about my experiences with this issue
> >> while designing an API for Fluo.
> >>
> >> http://fluo.apache.org/blog/2016/11/10/immutable-bytes/
> >>
> >> Is there any reason something like this should not be added to Java?
> >
> > You mean something like NIO ByteBuffers and related APIs?
>
> As I discussed in the blog post, ByteBuffers does not fit the bill of
> what I need.  In the blog post I have the following little program as
> an example to show that ByteBuffer is not immutable in the way String
> is.
>
>   byte[] bytes1 = new byte[] {1,2,3,(byte)250};
>   ByteBuffer bb1 = ByteBuffer.wrap(bytes1).asReadOnlyBuffer();
>
>   System.out.println(bb1.hashCode());
>   bytes1[2]=89;
>   System.out.println(bb1.hashCode());
>   bb1.get();
>   System.out.println(bb1.hashCode());
>
> Would not want to use ByteBuffer as a map key.  Would be nice if Java
> had something like ByteString[1] or Bytes[2].  Having a type like that
> in Java would allow to be used in library APIs and avoid copies
> between multiple implementations of an immutable byte array wrapper.
>
> [1]: https://developers.google.com/protocol-buffers/docs/
> reference/java/com/google/protobuf/ByteString
> [2]: https://static.javadoc.io/org.apache.fluo/fluo-api/1.0.0-
> incubating/org/apache/fluo/api/data/Bytes.html
>
> >
> > Roman
> >
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Java needs an immutable byte array wrapper

John Rose-3
In reply to this post by Keith Turner
On Nov 12, 2016, at 8:45 AM, Keith Turner <[hidden email]> wrote:

>
> While trying to design an API for Fluo, its become clear to me that
> Java could really benefit from an immutable byte array wrapper.
> Something like java.lang.String except for byte arrays instead of char
> arrays.  It would be nice if this new type interoperated well with
> byte[], String, ByteBuffer, InputStream, OutputStream etc.
>
> I wrote the following blog post about my experiences with this issue
> while designing an API for Fluo.
>
> http://fluo.apache.org/blog/2016/11/10/immutable-bytes/ <http://fluo.apache.org/blog/2016/11/10/immutable-bytes/>

That's a good blog entry; thanks, especially the pointer to ByteString.

Of course Java needs a type like that, but our story for immutability
is still in flux, so folks are being cautious about adopting such features.

In a similar vein, I would like to see the ability to freeze Java arrays
(make them immutable), and (independently) add more API points
to them.  But the ideas are not fully baked yet.

See also this application for immutable bytes:
  https://bugs.openjdk.java.net/browse/JDK-8161256 <https://bugs.openjdk.java.net/browse/JDK-8161256>

— John

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Java needs an immutable byte array wrapper

Keith Turner
On Sat, Nov 12, 2016 at 1:04 PM, John Rose <[hidden email]> wrote:

> On Nov 12, 2016, at 8:45 AM, Keith Turner <[hidden email]> wrote:
>
>
> While trying to design an API for Fluo, its become clear to me that
> Java could really benefit from an immutable byte array wrapper.
> Something like java.lang.String except for byte arrays instead of char
> arrays.  It would be nice if this new type interoperated well with
> byte[], String, ByteBuffer, InputStream, OutputStream etc.
>
> I wrote the following blog post about my experiences with this issue
> while designing an API for Fluo.
>
> http://fluo.apache.org/blog/2016/11/10/immutable-bytes/
>
>
> That's a good blog entry; thanks, especially the pointer to ByteString.
>
> Of course Java needs a type like that, but our story for immutability
> is still in flux, so folks are being cautious about adopting such features.
>
> In a similar vein, I would like to see the ability to freeze Java arrays
> (make them immutable), and (independently) add more API points

Is the concept of freezing byte arrays written up anywhere?

> to them.  But the ideas are not fully baked yet.
>
> See also this application for immutable bytes:
>   https://bugs.openjdk.java.net/browse/JDK-8161256
>
> — John
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Java needs an immutable byte array wrapper

Keith Turner
In reply to this post by Peter Lawrey-3
On Sat, Nov 12, 2016 at 12:53 PM, Peter Lawrey <[hidden email]> wrote:
> Java 9 String has a byte [] at its core. I suspect it's not appropriate but
> worth thinking about.

I am not sure, I would have to look into it.

Would there always be a conversions to/from char when creating Strings
from byte[] and when calling String.getBytes()?  Also would like
something that interoperates well with ByteBuffer, InputStream,
OutputStream for byte sequence data, like protobuf's ByteString and
Fluo's Bytes do.

>
> We have a BytesStore class which wraps bytes on or off heap which can be
> used for keys.

I suspect many project roll their own thing for this.

>
>
> On 12 Nov 2016 17:17, "Keith Turner" <[hidden email]> wrote:
>>
>> On Sat, Nov 12, 2016 at 12:04 PM, Roman Kennke <[hidden email]> wrote:
>> > Am Samstag, den 12.11.2016, 11:45 -0500 schrieb Keith Turner:
>> >> While trying to design an API for Fluo, its become clear to me that
>> >> Java could really benefit from an immutable byte array wrapper.
>> >> Something like java.lang.String except for byte arrays instead of
>> >> char
>> >> arrays.  It would be nice if this new type interoperated well with
>> >> byte[], String, ByteBuffer, InputStream, OutputStream etc.
>> >>
>> >> I wrote the following blog post about my experiences with this issue
>> >> while designing an API for Fluo.
>> >>
>> >> http://fluo.apache.org/blog/2016/11/10/immutable-bytes/
>> >>
>> >> Is there any reason something like this should not be added to Java?
>> >
>> > You mean something like NIO ByteBuffers and related APIs?
>>
>> As I discussed in the blog post, ByteBuffers does not fit the bill of
>> what I need.  In the blog post I have the following little program as
>> an example to show that ByteBuffer is not immutable in the way String
>> is.
>>
>>   byte[] bytes1 = new byte[] {1,2,3,(byte)250};
>>   ByteBuffer bb1 = ByteBuffer.wrap(bytes1).asReadOnlyBuffer();
>>
>>   System.out.println(bb1.hashCode());
>>   bytes1[2]=89;
>>   System.out.println(bb1.hashCode());
>>   bb1.get();
>>   System.out.println(bb1.hashCode());
>>
>> Would not want to use ByteBuffer as a map key.  Would be nice if Java
>> had something like ByteString[1] or Bytes[2].  Having a type like that
>> in Java would allow to be used in library APIs and avoid copies
>> between multiple implementations of an immutable byte array wrapper.
>>
>> [1]:
>> https://developers.google.com/protocol-buffers/docs/reference/java/com/google/protobuf/ByteString
>> [2]:
>> https://static.javadoc.io/org.apache.fluo/fluo-api/1.0.0-incubating/org/apache/fluo/api/data/Bytes.html
>>
>> >
>> > Roman
>> >
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

string indexing (was: Java needs an immutable byte array wrapper)

Per Bothner
In reply to this post by Peter Lawrey-3
On 11/12/2016 09:53 AM, Peter Lawrey wrote:
> Java 9 String has a byte [] at its core. I suspect it's not appropriate but
> worth thinking about.

Interesting.  I would be even more interested if they could make codePointAt
and codePointCount be constant-time: A number of programming languages define
a string as a sequence of code-points, and the indexing operator that their standard library
provide is basically codePointAt.  Example languages include Python3, Scheme, and the
XQuery/XPath/XSLT family.

Implementing string indexing for such a language on the JVM gives you the unpalatable choice
of either having indexing take linear time, or not using java.lang.String and thus hurting
Java interoperability.

Note it would be easy to change the Java9 String implementation such that codePointAt
was constant-time in the case of BMP-only (no-surrogate) strings.  Just use a bit in
the 'coder' field to indicate that the string is BMP-only.  Doing so would be a
big and easy win for the common BMP-only case, though it doesn't give us
guaranteed constant-time indexing - a single non-BMP character breaks that.

As a compromise I recently implemented an IString class, which gives you O(1)
codepoint indexing while still being compact and implementing CharSequence efficiently:

http://sourceware.org/viewvc/kawa/branches/invoke/gnu/lists/IString.java?view=markup
[Warning: this has not been tested much.]

Still, it would be much nicer if we could use java.lang.String directly.  It wouldn't
be very expensive.  Note that the offsets array in my IString class only adds 0.24 bytes
per 2-byte char, so roughly 12%.  It is possible to encode the Java9 'coder' field using
the IString 'offsets' field (by using a static flag array for the LATIN1 case).
--
        --Per Bothner
[hidden email]   http://per.bothner.com/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: string indexing (was: Java needs an immutable byte array wrapper)

Zenaan Harkness
On Sat, Nov 12, 2016 at 08:06:55PM -0800, Per Bothner wrote:
> On 11/12/2016 09:53 AM, Peter Lawrey wrote:
> >Java 9 String has a byte [] at its core. I suspect it's not
> >appropriate but worth thinking about.

Time to read up on that, thanks.

> Interesting.  I would be even more interested if they could make
> codePointAt and codePointCount be constant-time: A number of
> programming languages define a string as a sequence of code-points,
> and the indexing operator that their standard library provide is
> basically codePointAt.  Example languages include Python3, Scheme, and
> the XQuery/XPath/XSLT family.

Ack.

Although grapheme indexing is probably more generally useful for
multi-lingual UI. Swift basically gets "String" right as far as my
reading of Swift's docs goes - not only code-points, but graphemes, the
next layer of indexing above code-points.

I cannott speak to Swift's implementation as to storage / time
tradeoffs made.

Trying to create a simple string formatter (left, right, centered) that
was also "multi lingual" lead me into the deep dark past of Java's (pre
v1.0) decision to go with UTF-16 (sensible at the time), which for 20
years has been known to be deficient (prior to Java 1.1 it was when
Unicode ascertained they needed more than 16 bits) and yet
java.lang.String never got updated, at least until recently with Java 9,
which now lays the foundation for a sane string class.

Took me two full working weeks to sort out the mess in my head, so I
wrote up the details of that exploration here:
https://zenaan.github.io/zen/javadoc/zen/lang/string.html
(Note, this was pre-Java 9)

Hopefully by Java 10, 11 or 12, we might see full grapheme support in
Java (as is the case in Swift), now that String is implemented with byte
array storage.


> Implementing string indexing for such a language on the JVM gives you
> the unpalatable choice of either having indexing take linear time, or
> not using java.lang.String and thus hurting Java interoperability.

Can class finality be bypassed at the JVM level?

With byte[] underlying Java 9's String class, code-point and grapheme
indexing could be in a subclass?

The trade off then is between the storage (and construction time) cost
for the extra layers of indexing (code-points, then graphemes on top of
that), vs the run time performance hit for dynamically finding these
index points every time needed. There is no universal "best" option of
course... depends always on the application.


> Note it would be easy to change the Java9 String implementation such
> that codePointAt was constant-time in the case of BMP-only
> (no-surrogate) strings.

I.e. without increasing storage cost. I don't think code-points really
solve the significant problem though (discovery of grapheme boundaries
when one truly needs to handle multiple languages).


> Just use a bit in the 'coder' field to indicate that the string is
> BMP-only.  Doing so would be a big and easy win for the common
> BMP-only case, though it doesn't give us guaranteed constant-time
> indexing - a single non-BMP character breaks that.

Again, my write up highlights the issues with code-points - we have
combining "characters", non displayed "characters" and plenty more
besides - it is graphemes (and non-graphemes) that, at the UI layer at
least, we really need to know about.


> As a compromise I recently implemented an IString class, which gives
> you O(1) codepoint indexing while still being compact and implementing
> CharSequence efficiently:
>
> http://sourceware.org/viewvc/kawa/branches/invoke/gnu/lists/IString.java?view=markup
> [Warning: this has not been tested much.]

Thanks.

"CharSequence" is deceptive. Should be called CodePointSequence or
something else again... "char" is -so- overloaded in Java in particular.


> Still, it would be much nicer if we could use java.lang.String
> directly.  It wouldn't be very expensive.  Note that the offsets array
> in my IString class only adds 0.24 bytes per 2-byte char, so roughly
> 12%.  It is possible to encode the Java9 'coder' field using the
> IString 'offsets' field (by using a static flag array for the LATIN1
> case).

I strongly believe the that immutability of byte arrays would provide the
safety that java.lang.String otherwise provides, and that as long as
removing String finality did not significantly impact performance of
code in the wild, the new byte[] String would be entirely sufficient for
one or two additional, and optional indexing layers - one for
code-points, and the top layer for graphemes.

Regards,
Zenaan
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: string indexing

Per Bothner


On 11/13/2016 04:21 AM, Zenaan Harkness wrote:
> Although grapheme indexing is probably more generally useful for
> multi-lingual UI.

Quite possibly.  However, a code-point can be represented as an unboxed
int.  A grapheme requires memory allocation. You cannot store it in a
register or even a fixed number of registers, unless you use an indirect
substring representation (base string, start offset, end offset), which
has its own problems.

You can always build a grapheme-based API on top of a codepoint API,
but not vice versa. You can of course do the same on top of a UTF16
code-unit API, but it's more error-prone and unnatural: At least
code-points have some natural semantic meaning; code-units do not.

> "CharSequence" is deceptive. Should be called CodePointSequence or
> something else again... "char" is -so- overloaded in Java in particular.

java.lang.CharSequence is *not* a sequence of code-points.
It's a sequence of UTF-16 code-units, just like java.lang.String.
--
        --Per Bothner
[hidden email]   http://per.bothner.com/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: string indexing

Zenaan Harkness
On Sun, Nov 13, 2016 at 05:28:36PM -0800, Per Bothner wrote:

> On 11/13/2016 04:21 AM, Zenaan Harkness wrote:
> >Although grapheme indexing is probably more generally useful for
> >multi-lingual UI.
>
> Quite possibly.  However, a code-point can be represented as an unboxed
> int.  A grapheme requires memory allocation. You cannot store it in a
> register or even a fixed number of registers, unless you use an indirect
> substring representation (base string, start offset, end offset), which
> has its own problems.
>
> You can always build a grapheme-based API on top of a codepoint API,
> but not vice versa. You can of course do the same on top of a UTF16
> code-unit API, but it's more error-prone and unnatural: At least
> code-points have some natural semantic meaning; code-units do not.

Ack.

I would only refer here of course:
http://utf8everywhere.org/

Java is what it is, and String is particularly unfortunate - Java 9's
byte[] implementation is a performance improvement in some situations,
but still messy:

http://stackoverflow.com/questions/38213239/what-is-java-9s-new-string-implementaion
"
Because most usages of Strings are Latin-1 and only require one byte,
Java-9's String will be updated to be implemented under the hood as a
byte array with an encoding flag field to note if it is a byte array. If
the characters are not Latin-1 and require more than one byte it will be
stored as a UTF-16 char array (2 bytes per char) and the flag.
"


> >"CharSequence" is deceptive. Should be called CodePointSequence or
> >something else again... "char" is -so- overloaded in Java in particular.
>
> java.lang.CharSequence is *not* a sequence of code-points.
> It's a sequence of UTF-16 code-units, just like java.lang.String.

Even more the reason it's name is problematic.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: string indexing (was: Java needs an immutable byte array wrapper)

Zenaan Harkness
In reply to this post by Zenaan Harkness
On Sun, Nov 13, 2016 at 11:21:48PM +1100, Zenaan Harkness wrote:

> On Sat, Nov 12, 2016 at 08:06:55PM -0800, Per Bothner wrote:
> > On 11/12/2016 09:53 AM, Peter Lawrey wrote:
> > >Java 9 String has a byte [] at its core. I suspect it's not
> > >appropriate but worth thinking about.
>
> Time to read up on that, thanks.
>
> > Interesting.  I would be even more interested if they could make
> > codePointAt and codePointCount be constant-time: A number of
> > programming languages define a string as a sequence of code-points,
> > and the indexing operator that their standard library provide is
> > basically codePointAt.  Example languages include Python3, Scheme, and
> > the XQuery/XPath/XSLT family.
>
> Ack.
>
> Although grapheme indexing is probably more generally useful for
> multi-lingual UI. Swift basically gets "String" right as far as my
> reading of Swift's docs goes - not only code-points, but graphemes, the
> next layer of indexing above code-points.
>
> I cannott speak to Swift's implementation as to storage / time
> tradeoffs made.
>
> Trying to create a simple string formatter (left, right, centered) that
> was also "multi lingual" lead me into the deep dark past of Java's (pre
> v1.0) decision to go with UTF-16 (sensible at the time), which for 20
> years has been known to be deficient (prior to Java 1.1 it was when
> Unicode ascertained they needed more than 16 bits) and yet
> java.lang.String never got updated, at least until recently with Java 9,
> which now lays the foundation for a sane string class.
>
> Took me two full working weeks to sort out the mess in my head, so I

That should be 'volunteering weeks' or "working weeks as in 10 days"
or something.  (I donate my time to a human rights cause, and getting
stuck into Java's String was ultimately a pleasant sidetrack from that.)

BTW pre-Java-1.0 was my first foray into the language back in my
university days, and it became my primary choice from that point.
C++ finally "caught up" (on what I consider important) with namespaces
and most recently modules.  Java String's inability to handle graphemes
with any real proficiency has been the proverbial never ending teeth
grinding story for me over the past couple decades ...


> wrote up the details of that exploration here:
> https://zenaan.github.io/zen/javadoc/zen/lang/string.html
> (Note, this was pre-Java 9)
>
> Hopefully by Java 10, 11 or 12, we might see full grapheme support in
> Java (as is the case in Swift), now that String is implemented with byte
> array storage.
...
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Java needs an immutable byte array wrapper

Andrew Haley
In reply to this post by Keith Turner
On 12/11/16 16:45, Keith Turner wrote:

> While trying to design an API for Fluo, its become clear to me that
> Java could really benefit from an immutable byte array wrapper.
> Something like java.lang.String except for byte arrays instead of char
> arrays.  It would be nice if this new type interoperated well with
> byte[], String, ByteBuffer, InputStream, OutputStream etc.
>
> I wrote the following blog post about my experiences with this issue
> while designing an API for Fluo.
>
> http://fluo.apache.org/blog/2016/11/10/immutable-bytes/
>
> Is there any reason something like this should not be added to Java?

Apart from bulking up core Java and its specification and TCK even
more, no.

But aren't we looking at this in the wrong way?  I would have thought
the Right Thing would be to make Protobuf’s ByteString into a library
in its own right.

Andrew.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Java needs an immutable byte array wrapper

Keith Turner
On Thu, Jan 26, 2017 at 4:19 AM, Andrew Haley <[hidden email]> wrote:

> On 12/11/16 16:45, Keith Turner wrote:
>> While trying to design an API for Fluo, its become clear to me that
>> Java could really benefit from an immutable byte array wrapper.
>> Something like java.lang.String except for byte arrays instead of char
>> arrays.  It would be nice if this new type interoperated well with
>> byte[], String, ByteBuffer, InputStream, OutputStream etc.
>>
>> I wrote the following blog post about my experiences with this issue
>> while designing an API for Fluo.
>>
>> http://fluo.apache.org/blog/2016/11/10/immutable-bytes/
>>
>> Is there any reason something like this should not be added to Java?
>
> Apart from bulking up core Java and its specification and TCK even
> more, no.
>
> But aren't we looking at this in the wrong way?  I would have thought
> the Right Thing would be to make Protobuf’s ByteString into a library
> in its own right.

I agree that would be a good solution.  It would be really nice to
have a library with really good api practices and no dependencies.  I
discussed that at the end of the post.  There is a minor drawback, w/o
something in java you will end up with multiple  libraries like this
and interoperability will not be as good as if it were in Java.
However that being said, having this library would be infinitely
better than not having it.

I have been thinking about creating a library like this, I just have
not found the time.   I'll see if I can get something started and
circle back asking for review from this community and the protobuf
community.

>
> Andrew.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

CodePointCursor + CodePointParser (was Re: string indexing (was: Java needs an immutable byte array wrapper))

Zenaan Harkness
In reply to this post by Zenaan Harkness
On Sun, Nov 13, 2016 at 11:21:48PM +1100, Zenaan Harkness wrote:
> https://zenaan.github.io/zen/javadoc/zen/lang/string.html
> (Note, this was pre-Java 9)
>
> Hopefully by Java 10, 11 or 12, we might see full grapheme support in
> Java (as is the case in Swift), now that String is implemented with byte
> array storage.

Further:

https://github.com/zenaan/zen
https://github.com/zenaan/zen/blob/master/src/java/zen/lang/CodePointCursor.java
https://github.com/zenaan/zen/blob/master/src/java/zen/lang/CodePointParser.java

A small step for anyone who prefers to work with Unicode code points
rather than Java String's "code units":

* CodePointCursor.java:
  - complete and straightforward to use Unicode code point cursor
    (or "iterator")
  - resettable - and optionally reversable when reset (or at
    construction)
  - bidirectional - can move forwards and backwards
  - supports hops of >1 , in either direction
  - supports (default constructor) creation of default "null" cursor
  - provides methods for both inclusive and exclusive indexes
  - provides methods for both code point indexes, and underlying code
    unit indexes
  - supports traditional Java hasNext() and next() idiom
  - supports also peek(), advance(), curr() and prev() idioms
  - outputs a useful string representation of itself

* CodePointParser.java:
  - limited parser exercising CodePointCursor
  - parses string and unsigned long - no overflow checking
  - supports optional literals escape char
  - traditional parsing model
  - pretty step-wise output messages/ parse analysis

This parser is very simple as seen, but the two functions are
well tested fwiw.

Now the next layer above code points is graphemes, "grapheme clusters"
as I think Swift calls them, or "code point clusters" or "codepoint
clusters" as they ought rightfully be called.

That looks like a big job, unlike a simple code point cursor and parser.


Now ideally we would be starting with and building upon, UTF-8 strings,
not 16-bit "code unit strings", but that's for another day... or year :/

Also have not yet checked out IBM's ICU4J yet...


Any feedback, positive or negative, appreciated.

Regards,
Zenaan



----------
# Method signatures (CP = code point, CU = code unit) :

public class CodePointCursor implements IDEBUG {
   public CodePointCursor () {}
   public CodePointCursor (String s) {this(s, false);}
   public CodePointCursor (String s, boolean reverse)
   public CodePointCursor reset (boolean reverse)
   public void setDebug (boolean debug) {this._debug = debug;}
   public int getCPLen ()
   public int getCPIdxIn ()
   public int getCPIdxEx ()
   public int getCUIdxEx ()
   public boolean hasNext () {return i != iend;} // | hasNext(1)
   public boolean hasNext (int n)
   public int peekIdx () throws IndexOutOfBoundsException
   public int peekIdx (int n) throws IndexOutOfBoundsException
   public int advance () throws IndexOutOfBoundsException
   public int advance (int n) throws IndexOutOfBoundsException
   public int peek ()
   public int peek (int n)
   public int next ()
   public int next (int n)
   public int curr ()
   public int prev ()
   public String toString () {


public class CodePointParser implements IDEBUG {
   public CodePointParser (CodePointCursor i) {this.i = i;}
   public class CPParserException extends RuntimeException {
   public void setEscape (int escapecp) {
   public boolean hasEscape () {return lescape != 0;}
   public String toString () {
   public int parseString (StringBuilder result, int end_char)
   public int parseString (StringBuilder result, int end_char1, int
end_char2)

   public static final int CURSOR_END = -1;
   public static final int ESCAPE_AT_END = -2;
   public int parseString (StringBuilder result, int end_char1, int
end_char2, Messages m)

   public static class Messages {
      public String noDigit   = "<no digit> ";
      public String endChar   = "<end_char> ";
      public String atEnd     = "<at end> ";
      public String escape    = "<escape> ";
      public String literal   = "<literal> ";
      public String pushBack  = "<PUSH BACK> ";

   public long parseULong (long defaultVal)
   public long parseULong (long defaultVal, int base, Messages m)

Loading...