RFR: 8176188: jdk/internal/misc/JavaLangAccess/NewUnsafeString.java failing since 9-b93

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

RFR: 8176188: jdk/internal/misc/JavaLangAccess/NewUnsafeString.java failing since 9-b93

Claes Redestad
Hi,

the compact strings JEP changed semantics of the package-private
String(char[], boolean)
constructor to do the same as the public String(char[]) constructor.

Previously the former was used in trusted, internal code to avoid
copying the given char[],
but since the char[] now has to be converted to a byte[] that
optimization is no longer
possible via this method[1], and tests that checked that the returned
string shared the
given char[] naturally stopped working.

To fix this bug I propose the following clean-up:
- change all uses of JavaLangAccess.newUnsafeString(char[]) to new
String(char[])
- remove the package-private String(char[], boolean) constructor
- remove the newUnsafeString from JavaLangAccess
- remove the now unnecessary NewUnsafeString test

Patch: http://cr.openjdk.java.net/~redestad/8176188/open.00/
Bug: https://bugs.openjdk.java.net/browse/JDK-8176188

Thanks!

/Claes

[1] For some of the usages here we could improve somewhat by exposingthe
String(byte[], byte)
constructor, but I think that's out of scope here and I think we'd best
avoid leaking the
coder byte implementation detail outside of java.lang.
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8176188: jdk/internal/misc/JavaLangAccess/NewUnsafeString.java failing since 9-b93

Paul Sandoz
+1

Paul.

> On 3 Dec 2017, at 11:53, Claes Redestad <[hidden email]> wrote:
>
> Hi,
>
> the compact strings JEP changed semantics of the package-private String(char[], boolean)
> constructor to do the same as the public String(char[]) constructor.
>
> Previously the former was used in trusted, internal code to avoid copying the given char[],
> but since the char[] now has to be converted to a byte[] that optimization is no longer
> possible via this method[1], and tests that checked that the returned string shared the
> given char[] naturally stopped working.
>
> To fix this bug I propose the following clean-up:
> - change all uses of JavaLangAccess.newUnsafeString(char[]) to new String(char[])
> - remove the package-private String(char[], boolean) constructor
> - remove the newUnsafeString from JavaLangAccess
> - remove the now unnecessary NewUnsafeString test
>
> Patch: http://cr.openjdk.java.net/~redestad/8176188/open.00/
> Bug: https://bugs.openjdk.java.net/browse/JDK-8176188
>
> Thanks!
>
> /Claes
>
> [1] For some of the usages here we could improve somewhat by exposingthe String(byte[], byte)
> constructor, but I think that's out of scope here and I think we'd best avoid leaking the
> coder byte implementation detail outside of java.lang.

Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8176188: jdk/internal/misc/JavaLangAccess/NewUnsafeString.java failing since 9-b93

Martin Buchholz-3
In reply to this post by Claes Redestad
I'm rather sad about what happened to our non-copying String constructions
for trusted code.  This is a performance regression with the change in
String representation that should have blocked that change IMO.  I think we
should have a plan for moving in the opposite direction.  I don't think we
can implement something as ambitious as Rust's ownership tracking, so have
to restrict ourselves to trusted code.  The use case that keeps coming up
is constructing zip entry names, which are much more likely to be pure
ASCII than their file contents.

I don't have a good design for how one could do that, and who the trusted
set of callers is (at least java.base, not java.lang), but I think we
should set a direction.

On Sun, Dec 3, 2017 at 11:53 AM, Claes Redestad <[hidden email]>
wrote:

> Hi,
>
> the compact strings JEP changed semantics of the package-private
> String(char[], boolean)
> constructor to do the same as the public String(char[]) constructor.
>
> Previously the former was used in trusted, internal code to avoid copying
> the given char[],
> but since the char[] now has to be converted to a byte[] that optimization
> is no longer
> possible via this method[1], and tests that checked that the returned
> string shared the
> given char[] naturally stopped working.
>
> To fix this bug I propose the following clean-up:
> - change all uses of JavaLangAccess.newUnsafeString(char[]) to new
> String(char[])
> - remove the package-private String(char[], boolean) constructor
> - remove the newUnsafeString from JavaLangAccess
> - remove the now unnecessary NewUnsafeString test
>
> Patch: http://cr.openjdk.java.net/~redestad/8176188/open.00/
> Bug: https://bugs.openjdk.java.net/browse/JDK-8176188
>
> Thanks!
>
> /Claes
>
> [1] For some of the usages here we could improve somewhat by exposingthe
> String(byte[], byte)
> constructor, but I think that's out of scope here and I think we'd best
> avoid leaking the
> coder byte implementation detail outside of java.lang.
>
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8176188: jdk/internal/misc/JavaLangAccess/NewUnsafeString.java failing since 9-b93

Claes Redestad
Hi Martin,

On 2017-12-04 22:06, Martin Buchholz wrote:

> I'm rather sad about what happened to our non-copying String
> constructions for trusted code.  This is a performance regression with
> the change in String representation that should have blocked that
> change IMO.  I think we should have a plan for moving in the opposite
> direction.  I don't think we can implement something as ambitious as
> Rust's ownership tracking, so have to restrict ourselves to trusted
> code.  The use case that keeps coming up is constructing zip entry
> names, which are much more likely to be pure ASCII than their file
> contents.
>
> I don't have a good design for how one could do that, and who the
> trusted set of callers is (at least java.base, not java.lang), but I
> think we should set a direction.

as I alluded to in a footnote there exists a non-copying String(byte[]
value, byte coder) constructor - the problem is that it's somewhat
cumbersome to use:

- first off, the caller needs to be aware about the value of
String.COMPACT_STRINGS: if false, all strings needs to be UTF-16 encoded
and the coder byte always set to String.UTF16
- secondly, the caller needs to know if the byte[] you're constructing
needs to be LATIN-1 or UTF-16 encoded up front and act accordingly

Some of the more performance sensitive uses outside of java.lang was
addressed by the Compact Strings update, for example the implementation
backing java.util.UUID was somewhat surprisingly moved into
java.lang.Long::fastUUID[1]. Something similar is doable for the
java.sql types, but further complicated by those classes being in a
different module, and ultimately questionable since their
implementations in JDK 9 are quite a bit more performant than in any
previous release (thus not technically a regression).

That leaves StringJoiner as the one case that stands out. And the fact
that existing uses of String(byte[], byte) are a bit of an eye-sore[1!!1!].

One idea I'm tinkering with here is to have a trusted, package-private
SharedStringBuilder added to java.lang and exposed via SharedSecrets.
It'd more or less mimic StringBuilder (including deal with inflating the
byte[] when necessary, encapsulate the awkward String.COMPACT_STRINGS
checks etc) but enable calling String(byte[], byte) in the toString()
call.  To be effective it'll only have a single constructor taking the
capacity, and should probably throw IOOBE rather than resize the
internal buffer. Some cases like Long::fastUUID could probably be much
simplified by using such a builder instead (for a very minimal
overhead). Does that sound reasonable? At any rate I think of this as a
possible follow-up RFE, and not an alternative to the cleanup/"bugfix"
at hand.

Thanks!

/Claes

[1]
http://hg.openjdk.java.net/jdk/jdk/file/532cdc178e42/src/java.base/share/classes/java/lang/Long.java#l427

Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8176188: jdk/internal/misc/JavaLangAccess/NewUnsafeString.java failing since 9-b93

Martin Buchholz-3
Thanks, Claes.

I think we're in agreement!

I did that shared String optimization for StringJoiner.  It's a lot harder
to justify in the new String world because we have to handle non-ASCII, and
the non-ASCII case is actually fairly likely.

Does it make sense to keep COMPACT_STRINGS as an option in openjdk?  If
essentially every user runs with that flag turned on, bit rot is likely to
set in, and even if not, we have to remain ever vigilant to not break
anyone turning it off.

The code in Long.fastUUID is indeed ugly.  I've never heard of UUID
creation being a bottleneck.  At Google it sometimes seems all our java
performance problems are with zip file manipulation.

I might have avoided the code duplication in Long.fastUUID by:
- build the all-ASCII byte[] unconditionally
- if COMPACT_STRINGS, then use the secret LATIN1 constructor, else use the
public constructor(byte[], ISO_8859_1)

One idea for a public performant API is to add something like
Charset.toString(byte[]) or Charset.toString(byte[], int start, int length)
via default method on Charset.  Then; override that in ISO_8859_1 to do
only a single copy (and can we cheat somehow to do no copies?)

On Mon, Dec 4, 2017 at 1:47 PM, Claes Redestad <[hidden email]>
wrote:

> Hi Martin,
>
> On 2017-12-04 22:06, Martin Buchholz wrote:
>
>> I'm rather sad about what happened to our non-copying String
>> constructions for trusted code.  This is a performance regression with the
>> change in String representation that should have blocked that change IMO.
>> I think we should have a plan for moving in the opposite direction.  I
>> don't think we can implement something as ambitious as Rust's ownership
>> tracking, so have to restrict ourselves to trusted code.  The use case that
>> keeps coming up is constructing zip entry names, which are much more likely
>> to be pure ASCII than their file contents.
>>
>> I don't have a good design for how one could do that, and who the trusted
>> set of callers is (at least java.base, not java.lang), but I think we
>> should set a direction.
>>
>
> as I alluded to in a footnote there exists a non-copying String(byte[]
> value, byte coder) constructor - the problem is that it's somewhat
> cumbersome to use:
>
> - first off, the caller needs to be aware about the value of
> String.COMPACT_STRINGS: if false, all strings needs to be UTF-16 encoded
> and the coder byte always set to String.UTF16
> - secondly, the caller needs to know if the byte[] you're constructing
> needs to be LATIN-1 or UTF-16 encoded up front and act accordingly
>
> Some of the more performance sensitive uses outside of java.lang was
> addressed by the Compact Strings update, for example the implementation
> backing java.util.UUID was somewhat surprisingly moved into
> java.lang.Long::fastUUID[1]. Something similar is doable for the java.sql
> types, but further complicated by those classes being in a different
> module, and ultimately questionable since their implementations in JDK 9
> are quite a bit more performant than in any previous release (thus not
> technically a regression).
>
> That leaves StringJoiner as the one case that stands out. And the fact
> that existing uses of String(byte[], byte) are a bit of an eye-sore[1!!1!].
>
> One idea I'm tinkering with here is to have a trusted, package-private
> SharedStringBuilder added to java.lang and exposed via SharedSecrets. It'd
> more or less mimic StringBuilder (including deal with inflating the byte[]
> when necessary, encapsulate the awkward String.COMPACT_STRINGS checks etc)
> but enable calling String(byte[], byte) in the toString() call.  To be
> effective it'll only have a single constructor taking the capacity, and
> should probably throw IOOBE rather than resize the internal buffer. Some
> cases like Long::fastUUID could probably be much simplified by using such a
> builder instead (for a very minimal overhead). Does that sound reasonable?
> At any rate I think of this as a possible follow-up RFE, and not an
> alternative to the cleanup/"bugfix" at hand.
>
> Thanks!
>
> /Claes
>
> [1] http://hg.openjdk.java.net/jdk/jdk/file/532cdc178e42/src/jav
> a.base/share/classes/java/lang/Long.java#l427
>
>
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8176188: jdk/internal/misc/JavaLangAccess/NewUnsafeString.java failing since 9-b93

Xueming Shen
In reply to this post by Claes Redestad
+1

On 12/3/17, 11:53 AM, Claes Redestad wrote:

> Hi,
>
> the compact strings JEP changed semantics of the package-private
> String(char[], boolean)
> constructor to do the same as the public String(char[]) constructor.
>
> Previously the former was used in trusted, internal code to avoid
> copying the given char[],
> but since the char[] now has to be converted to a byte[] that
> optimization is no longer
> possible via this method[1], and tests that checked that the returned
> string shared the
> given char[] naturally stopped working.
>
> To fix this bug I propose the following clean-up:
> - change all uses of JavaLangAccess.newUnsafeString(char[]) to new
> String(char[])
> - remove the package-private String(char[], boolean) constructor
> - remove the newUnsafeString from JavaLangAccess
> - remove the now unnecessary NewUnsafeString test
>
> Patch: http://cr.openjdk.java.net/~redestad/8176188/open.00/
> Bug: https://bugs.openjdk.java.net/browse/JDK-8176188
>
> Thanks!
>
> /Claes
>
> [1] For some of the usages here we could improve somewhat by
> exposingthe String(byte[], byte)
> constructor, but I think that's out of scope here and I think we'd
> best avoid leaking the
> coder byte implementation detail outside of java.lang.

Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8176188: jdk/internal/misc/JavaLangAccess/NewUnsafeString.java failing since 9-b93

Steven Schlansker
In reply to this post by Martin Buchholz-3

> On Dec 4, 2017, at 3:46 PM, Martin Buchholz <[hidden email]> wrote:
>
> The code in Long.fastUUID is indeed ugly.  I've never heard of UUID
> creation being a bottleneck.  At Google it sometimes seems all our java
> performance problems are with zip file manipulation.

At $MY_WORK, we never use Zip files, but indeed ran into some really nasty
performance bottlenecks in UUID, and cared enough to whine about it and
get it fixed ;)

http://mail.openjdk.java.net/pipermail/core-libs-dev/2013-January/013494.html
http://bugs.java.com/view_bug.do?bug_id=JDK-8006627

While I doubt a single string allocation is make or break for us (by far the
bigger problem was the truly awful use of regex and tons of intermediate arrays)
but UUID is a core data type and people expect it to be reasonably performant.

Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8176188: jdk/internal/misc/JavaLangAccess/NewUnsafeString.java failing since 9-b93

Claes Redestad
In reply to this post by Martin Buchholz-3
On 2017-12-05 00:46, Martin Buchholz wrote:
> Thanks, Claes.
>
> I think we're in agreement!

Great!

>
> I did that shared String optimization for StringJoiner. It's a lot
> harder to justify in the new String world because we have to handle
> non-ASCII, and the non-ASCII case is actually fairly likely.

Let's sleep on it, at least. :-)

> Does it make sense to keep COMPACT_STRINGS as an option in openjdk? 
> If essentially every user runs with that flag turned on, bit rot is
> likely to set in, and even if not, we have to remain ever vigilant to
> not break anyone turning it off.

I think it was only kept as a safeguard in case anyone would run into
serious trouble with the implementation (unlikely), so it might be
reasonable to deprecate the flag in an upcoming release. Maybe survey if
anyone ever had to opt-out of it first.

>
> The code in Long.fastUUID is indeed ugly.  I've never heard of UUID
> creation being a bottleneck.  At Google it sometimes seems all our
> java performance problems are with zip file manipulation.

I see Steven already said Hi. I helped him get this optimization in, and
it did look a bit nicer originally. :-)

>
> I might have avoided the code duplication in Long.fastUUID by:
> - build the all-ASCII byte[] unconditionally
> - if COMPACT_STRINGS, then use the secret LATIN1 constructor, else use
> the public constructor(byte[], ISO_8859_1)

I think the implementors felt it would be unfair to penalize those who
had to opt-out of it for some unlikely reason.

>
> One idea for a public performant API is to add something like
> Charset.toString(byte[]) or Charset.toString(byte[], int start, int
> length) via default method on Charset.  Then; override that
> in ISO_8859_1 to do only a single copy (and can we cheat somehow to do
> no copies?)

I think these would need to clone the byte[] most of the time, so I
think it'd not be that much different from new String(byte[],
ISO_8859_1). The variant with Charset.toString(byte[], int, int) could
optimize away one copy when start > 0 or length < bytes.length, but this
seems like a corner case gain that doesn't really motivate adding two
new methods to every Charset class.

I guess a SharedStringBuilder (InlineStringBuilder?) could be made
public if we used $suitable_concurrency_trick to ensure the toString()
is callable at most once and that once it's been called the byte[] can't
be written to again. Maybe a ThreadLocal-based implementation could
suffice if we disallow nested InlineStringBuilder use (shouldn't be too
harsh of a limitation).

/Claes

>
> On Mon, Dec 4, 2017 at 1:47 PM, Claes Redestad
> <[hidden email] <mailto:[hidden email]>> wrote:
>
>     Hi Martin,
>
>     On 2017-12-04 22:06, Martin Buchholz wrote:
>
>         I'm rather sad about what happened to our non-copying String
>         constructions for trusted code.  This is a performance
>         regression with the change in String representation that
>         should have blocked that change IMO.  I think we should have a
>         plan for moving in the opposite direction.  I don't think we
>         can implement something as ambitious as Rust's ownership
>         tracking, so have to restrict ourselves to trusted code.  The
>         use case that keeps coming up is constructing zip entry names,
>         which are much more likely to be pure ASCII than their file
>         contents.
>
>         I don't have a good design for how one could do that, and who
>         the trusted set of callers is (at least java.base, not
>         java.lang), but I think we should set a direction.
>
>
>     as I alluded to in a footnote there exists a non-copying
>     String(byte[] value, byte coder) constructor - the problem is that
>     it's somewhat cumbersome to use:
>
>     - first off, the caller needs to be aware about the value of
>     String.COMPACT_STRINGS: if false, all strings needs to be UTF-16
>     encoded and the coder byte always set to String.UTF16
>     - secondly, the caller needs to know if the byte[] you're
>     constructing needs to be LATIN-1 or UTF-16 encoded up front and
>     act accordingly
>
>     Some of the more performance sensitive uses outside of java.lang
>     was addressed by the Compact Strings update, for example the
>     implementation backing java.util.UUID was somewhat surprisingly
>     moved into java.lang.Long::fastUUID[1]. Something similar is
>     doable for the java.sql types, but further complicated by those
>     classes being in a different module, and ultimately questionable
>     since their implementations in JDK 9 are quite a bit more
>     performant than in any previous release (thus not technically a
>     regression).
>
>     That leaves StringJoiner as the one case that stands out. And the
>     fact that existing uses of String(byte[], byte) are a bit of an
>     eye-sore[1!!1!].
>
>     One idea I'm tinkering with here is to have a trusted,
>     package-private SharedStringBuilder added to java.lang and exposed
>     via SharedSecrets. It'd more or less mimic StringBuilder
>     (including deal with inflating the byte[] when necessary,
>     encapsulate the awkward String.COMPACT_STRINGS checks etc) but
>     enable calling String(byte[], byte) in the toString() call.  To be
>     effective it'll only have a single constructor taking the
>     capacity, and should probably throw IOOBE rather than resize the
>     internal buffer. Some cases like Long::fastUUID could probably be
>     much simplified by using such a builder instead (for a very
>     minimal overhead). Does that sound reasonable? At any rate I think
>     of this as a possible follow-up RFE, and not an alternative to the
>     cleanup/"bugfix" at hand.
>
>     Thanks!
>
>     /Claes
>
>     [1]
>     http://hg.openjdk.java.net/jdk/jdk/file/532cdc178e42/src/java.base/share/classes/java/lang/Long.java#l427
>     <http://hg.openjdk.java.net/jdk/jdk/file/532cdc178e42/src/java.base/share/classes/java/lang/Long.java#l427>
>
>