RFR: 8264288: Performance issue with MethodHandle.asCollector

classic Classic list List threaded Threaded
22 messages Options
12
Reply | Threaded
Open this post in threaded view
|

RFR: 8264288: Performance issue with MethodHandle.asCollector

Jorn Vernee-2
This patch speeds up MethodHandle.asCollector handles where the array type is not Object[], as well as speeding up all collectors where the arity is greater than 10.

The old code is creating a collector handle by combining a set of hard coded methods for collecting arguments into an Object[], up to a maximum of ten elements, and then copying this intermediate array into a final array.

In principle, it shouldn't matter how slow (or fast) this handle is, because it should be replaced by existing bytecode intrinsification which does the right thing. But, through investigation it turns out that the intrinsification is only being applied in a very limited amount of cases: Object[] with max 10 elements only, only for the intermediate collector handles. Every other collector shape incurs overhead because it essentially ends up calling the ineffecient fallback handle.

Rather than sticking on a band aid, the new code replaces the existing implementation with a collector handle implemented using a LambdaForm, which removes the need for intrinsification, and also greatly reduces code-complexity of the implementation. (I plan to expose this collector using a public API in the future as well, so users don't have to go through MHs::identity to make a collector).

The old code was also using a special lambda form transform for collecting arguments into an array. I believe this was done to take advantage of the existing-but-limited bytecode intrinsification, at least for Object[] with less than 10 elements.

The new code just uses the existing collect arguments transform with the newly added collector handle as filter, and this works just as well for the existing case, but as a bonus is also much simpler, since no separate transform is needed. Using the collect arguments transform should also improve sharing.

As a result of these changes a lot of code was unused and has been removed in this patch.

Testing: tier 1-3, benchmarking using TypedAsCollector (part of the patch), as well as another variant of the benchmark that used a declared static identity method instead of MHs::identity (not included). Before/after comparison of MethodHandleAs* benchmarks (no differences there).

Here are some numbers from the added benchmark:

Before:
Benchmark                                    Mode  Cnt    Score    Error  Units
TypedAsCollector.testIntCollect              avgt   30  189.156 �  1.796  ns/op
TypedAsCollector.testIntCollectHighArity     avgt   30  660.549 � 10.159  ns/op
TypedAsCollector.testObjectCollect           avgt   30    7.092 �  0.042  ns/op
TypedAsCollector.testObjectCollectHighArity  avgt   30   65.225 �  0.546  ns/op
TypedAsCollector.testStringCollect           avgt   30   28.511 �  0.243  ns/op
TypedAsCollector.testStringCollectHighArity  avgt   30   57.054 �  0.635  ns/op
(as you can see, just the Object[] with arity less than 10 case is fast here)
After:
Benchmark                                    Mode  Cnt  Score   Error  Units
TypedAsCollector.testIntCollect              avgt   30  6.569 � 0.131  ns/op
TypedAsCollector.testIntCollectHighArity     avgt   30  8.923 � 0.066  ns/op
TypedAsCollector.testObjectCollect           avgt   30  6.813 � 0.035  ns/op
TypedAsCollector.testObjectCollectHighArity  avgt   30  9.718 � 0.066  ns/op
TypedAsCollector.testStringCollect           avgt   30  6.737 � 0.016  ns/op
TypedAsCollector.testStringCollectHighArity  avgt   30  9.618 � 0.052  ns/op

Thanks,
Jorn

-------------

Commit messages:
 - WIP - clean up
 - Completely implement collect in lambdaform
 - WIP - fix MH collector perf, with jumbo arity bandaid

Changes: https://git.openjdk.java.net/jdk/pull/3306/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=3306&range=00
  Issue: https://bugs.openjdk.java.net/browse/JDK-8264288
  Stats: 513 lines in 8 files changed: 154 ins; 337 del; 22 mod
  Patch: https://git.openjdk.java.net/jdk/pull/3306.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/3306/head:pull/3306

PR: https://git.openjdk.java.net/jdk/pull/3306
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8264288: Performance issue with MethodHandle.asCollector

John R Rose
On Thu, 1 Apr 2021 13:25:05 GMT, Jorn Vernee <[hidden email]> wrote:

> This patch speeds up MethodHandle.asCollector handles where the array type is not Object[], as well as speeding up all collectors where the arity is greater than 10.
>
> The old code is creating a collector handle by combining a set of hard coded methods for collecting arguments into an Object[], up to a maximum of ten elements, and then copying this intermediate array into a final array.
>
> In principle, it shouldn't matter how slow (or fast) this handle is, because it should be replaced by existing bytecode intrinsification which does the right thing. But, through investigation it turns out that the intrinsification is only being applied in a very limited amount of cases: Object[] with max 10 elements only, only for the intermediate collector handles. Every other collector shape incurs overhead because it essentially ends up calling the ineffecient fallback handle.
>
> Rather than sticking on a band aid (I tried this, but it turned out to be very tricky to untangle the existing code), the new code replaces the existing implementation with a collector handle implemented using a LambdaForm, which removes the need for intrinsification, and also greatly reduces code-complexity of the implementation. (I plan to expose this collector using a public API in the future as well, so users don't have to go through MHs::identity to make a collector).
>
> The old code was also using a special lambda form transform for collecting arguments into an array. I believe this was done to take advantage of the existing-but-limited bytecode intrinsification, at least for Object[] with less than 10 elements.
>
> The new code just uses the existing collect arguments transform with the newly added collector handle as filter, and this works just as well for the existing case, but as a bonus is also much simpler, since no separate transform is needed. Using the collect arguments transform should also improve sharing.
>
> As a result of these changes a lot of code was unused and has been removed in this patch.
>
> Testing: tier 1-3, benchmarking using TypedAsCollector (part of the patch), as well as another variant of the benchmark that used a declared static identity method instead of MHs::identity (not included). Before/after comparison of MethodHandleAs* benchmarks (no differences there).
>
> Here are some numbers from the added benchmark:
>
> Before:
> Benchmark                                    Mode  Cnt    Score    Error  Units
> TypedAsCollector.testIntCollect              avgt   30  189.156 �  1.796  ns/op
> TypedAsCollector.testIntCollectHighArity     avgt   30  660.549 � 10.159  ns/op
> TypedAsCollector.testObjectCollect           avgt   30    7.092 �  0.042  ns/op
> TypedAsCollector.testObjectCollectHighArity  avgt   30   65.225 �  0.546  ns/op
> TypedAsCollector.testStringCollect           avgt   30   28.511 �  0.243  ns/op
> TypedAsCollector.testStringCollectHighArity  avgt   30   57.054 �  0.635  ns/op
> (as you can see, just the Object[] with arity less than 10 case is fast here)
> After:
> Benchmark                                    Mode  Cnt  Score   Error  Units
> TypedAsCollector.testIntCollect              avgt   30  6.569 � 0.131  ns/op
> TypedAsCollector.testIntCollectHighArity     avgt   30  8.923 � 0.066  ns/op
> TypedAsCollector.testObjectCollect           avgt   30  6.813 � 0.035  ns/op
> TypedAsCollector.testObjectCollectHighArity  avgt   30  9.718 � 0.066  ns/op
> TypedAsCollector.testStringCollect           avgt   30  6.737 � 0.016  ns/op
> TypedAsCollector.testStringCollectHighArity  avgt   30  9.618 � 0.052  ns/op
>
> Thanks,
> Jorn

So many red deletion lines; I'm looking at a beautiful sunset!
It's a sunset for some very old code, some of the first code
I wrote for method handles, long before Lambda Forms
made this sort of task easier.

Thanks very much for cleaning this out.  See a few
minor comments on some diff lines.

src/java.base/share/classes/java/lang/invoke/LambdaFormEditor.java line 651:

> 649:     }
> 650:
> 651:     LambdaForm collectArgumentArrayForm(int pos, MethodHandle arrayCollector) {

It's counter-intuitive that removing a LFE tactic would be harmless.
Each LFE tactic is a point where LFs can be shared, reducing footprint.
But in this case `collectArgumentArrayForm` is always paired with
`collectArgumentsForm`, so the latter takes care of sharing.
The actual code which makes the arrays is also shared, via
`Makers.TYPED_COLLECTORS` (unchanged).

src/java.base/share/classes/java/lang/invoke/MethodHandleImpl.java line 1342:

> 1340:     }
> 1341:
> 1342:     // Array identity function (used as getConstantHandle(MH_arrayIdentity)).

If this is no longer used, remove it.

src/java.base/share/classes/java/lang/invoke/MethodHandleImpl.java line 1903:

> 1901:     }
> 1902:
> 1903:     private static LambdaForm makeCollectorForm(MethodType basicType, MethodHandle storeFunc) {

This reads very well.  It is the right way to do this job with LFs.
There is a slight risk that the JIT will fail to inline the tiny MHs
that (a) construct arrays and (b) poke elements into them.
If that happens, then perhaps the bytecode generator can
treat them as intrinsics.  Maybe that's worth a comment in
the code, to help later readers?

-------------

Marked as reviewed by jrose (Reviewer).

PR: https://git.openjdk.java.net/jdk/pull/3306
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8264288: Performance issue with MethodHandle.asCollector

Paul Sandoz-2
In reply to this post by Jorn Vernee-2
On Thu, 1 Apr 2021 13:25:05 GMT, Jorn Vernee <[hidden email]> wrote:

> This patch speeds up MethodHandle.asCollector handles where the array type is not Object[], as well as speeding up all collectors where the arity is greater than 10.
>
> The old code is creating a collector handle by combining a set of hard coded methods for collecting arguments into an Object[], up to a maximum of ten elements, and then copying this intermediate array into a final array.
>
> In principle, it shouldn't matter how slow (or fast) this handle is, because it should be replaced by existing bytecode intrinsification which does the right thing. But, through investigation it turns out that the intrinsification is only being applied in a very limited amount of cases: Object[] with max 10 elements only, only for the intermediate collector handles. Every other collector shape incurs overhead because it essentially ends up calling the ineffecient fallback handle.
>
> Rather than sticking on a band aid (I tried this, but it turned out to be very tricky to untangle the existing code), the new code replaces the existing implementation with a collector handle implemented using a LambdaForm, which removes the need for intrinsification, and also greatly reduces code-complexity of the implementation. (I plan to expose this collector using a public API in the future as well, so users don't have to go through MHs::identity to make a collector).
>
> The old code was also using a special lambda form transform for collecting arguments into an array. I believe this was done to take advantage of the existing-but-limited bytecode intrinsification, at least for Object[] with less than 10 elements.
>
> The new code just uses the existing collect arguments transform with the newly added collector handle as filter, and this works just as well for the existing case, but as a bonus is also much simpler, since no separate transform is needed. Using the collect arguments transform should also improve sharing.
>
> As a result of these changes a lot of code was unused and has been removed in this patch.
>
> Testing: tier 1-3, benchmarking using TypedAsCollector (part of the patch), as well as another variant of the benchmark that used a declared static identity method instead of MHs::identity (not included). Before/after comparison of MethodHandleAs* benchmarks (no differences there).
>
> Here are some numbers from the added benchmark:
>
> Before:
> Benchmark                                    Mode  Cnt    Score    Error  Units
> TypedAsCollector.testIntCollect              avgt   30  189.156 �  1.796  ns/op
> TypedAsCollector.testIntCollectHighArity     avgt   30  660.549 � 10.159  ns/op
> TypedAsCollector.testObjectCollect           avgt   30    7.092 �  0.042  ns/op
> TypedAsCollector.testObjectCollectHighArity  avgt   30   65.225 �  0.546  ns/op
> TypedAsCollector.testStringCollect           avgt   30   28.511 �  0.243  ns/op
> TypedAsCollector.testStringCollectHighArity  avgt   30   57.054 �  0.635  ns/op
> (as you can see, just the Object[] with arity less than 10 case is fast here)
> After:
> Benchmark                                    Mode  Cnt  Score   Error  Units
> TypedAsCollector.testIntCollect              avgt   30  6.569 � 0.131  ns/op
> TypedAsCollector.testIntCollectHighArity     avgt   30  8.923 � 0.066  ns/op
> TypedAsCollector.testObjectCollect           avgt   30  6.813 � 0.035  ns/op
> TypedAsCollector.testObjectCollectHighArity  avgt   30  9.718 � 0.066  ns/op
> TypedAsCollector.testStringCollect           avgt   30  6.737 � 0.016  ns/op
> TypedAsCollector.testStringCollectHighArity  avgt   30  9.618 � 0.052  ns/op
>
> Thanks,
> Jorn

That's an elegant solution.

At first i thought it might unduly perturb lambda form generation and caching. but you slotted a different lambda form implementation underneath the varargs implementation.

src/java.base/share/classes/java/lang/invoke/MethodHandleImpl.java line 1277:

> 1275:             if (intrinsicName == Intrinsic.IDENTITY) {
> 1276:                 MethodType resultType = type().asCollectorType(arrayType, type().parameterCount() - 1, arrayLength);
> 1277:                 MethodHandle collector = MethodHandleImpl.makeCollector(arrayType, arrayLength);

Should `MethodHandleImpl.varargsArray` still be used here since it performs arity checking and caching?

Maybe the arity checks are performed by `asCollectorType`, but that would still leave the caching.

-------------

PR: https://git.openjdk.java.net/jdk/pull/3306
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8264288: Performance issue with MethodHandle.asCollector

Jorn Vernee-2
In reply to this post by John R Rose
On Thu, 1 Apr 2021 19:09:14 GMT, John R Rose <[hidden email]> wrote:

>> This patch speeds up MethodHandle.asCollector handles where the array type is not Object[], as well as speeding up all collectors where the arity is greater than 10.
>>
>> The old code is creating a collector handle by combining a set of hard coded methods for collecting arguments into an Object[], up to a maximum of ten elements, and then copying this intermediate array into a final array.
>>
>> In principle, it shouldn't matter how slow (or fast) this handle is, because it should be replaced by existing bytecode intrinsification which does the right thing. But, through investigation it turns out that the intrinsification is only being applied in a very limited amount of cases: Object[] with max 10 elements only, only for the intermediate collector handles. Every other collector shape incurs overhead because it essentially ends up calling the ineffecient fallback handle.
>>
>> Rather than sticking on a band aid (I tried this, but it turned out to be very tricky to untangle the existing code), the new code replaces the existing implementation with a collector handle implemented using a LambdaForm, which removes the need for intrinsification, and also greatly reduces code-complexity of the implementation. (I plan to expose this collector using a public API in the future as well, so users don't have to go through MHs::identity to make a collector).
>>
>> The old code was also using a special lambda form transform for collecting arguments into an array. I believe this was done to take advantage of the existing-but-limited bytecode intrinsification, at least for Object[] with less than 10 elements.
>>
>> The new code just uses the existing collect arguments transform with the newly added collector handle as filter, and this works just as well for the existing case, but as a bonus is also much simpler, since no separate transform is needed. Using the collect arguments transform should also improve sharing.
>>
>> As a result of these changes a lot of code was unused and has been removed in this patch.
>>
>> Testing: tier 1-3, benchmarking using TypedAsCollector (part of the patch), as well as another variant of the benchmark that used a declared static identity method instead of MHs::identity (not included). Before/after comparison of MethodHandleAs* benchmarks (no differences there).
>>
>> Here are some numbers from the added benchmark:
>>
>> Before:
>> Benchmark                                    Mode  Cnt    Score    Error  Units
>> TypedAsCollector.testIntCollect              avgt   30  189.156 �  1.796  ns/op
>> TypedAsCollector.testIntCollectHighArity     avgt   30  660.549 � 10.159  ns/op
>> TypedAsCollector.testObjectCollect           avgt   30    7.092 �  0.042  ns/op
>> TypedAsCollector.testObjectCollectHighArity  avgt   30   65.225 �  0.546  ns/op
>> TypedAsCollector.testStringCollect           avgt   30   28.511 �  0.243  ns/op
>> TypedAsCollector.testStringCollectHighArity  avgt   30   57.054 �  0.635  ns/op
>> (as you can see, just the Object[] with arity less than 10 case is fast here)
>> After:
>> Benchmark                                    Mode  Cnt  Score   Error  Units
>> TypedAsCollector.testIntCollect              avgt   30  6.569 � 0.131  ns/op
>> TypedAsCollector.testIntCollectHighArity     avgt   30  8.923 � 0.066  ns/op
>> TypedAsCollector.testObjectCollect           avgt   30  6.813 � 0.035  ns/op
>> TypedAsCollector.testObjectCollectHighArity  avgt   30  9.718 � 0.066  ns/op
>> TypedAsCollector.testStringCollect           avgt   30  6.737 � 0.016  ns/op
>> TypedAsCollector.testStringCollectHighArity  avgt   30  9.618 � 0.052  ns/op
>>
>> Thanks,
>> Jorn
>
> src/java.base/share/classes/java/lang/invoke/MethodHandleImpl.java line 1342:
>
>> 1340:     }
>> 1341:
>> 1342:     // Array identity function (used as getConstantHandle(MH_arrayIdentity)).
>
> If this is no longer used, remove it.

Yes, it looks like this can be removed. Good catch!

-------------

PR: https://git.openjdk.java.net/jdk/pull/3306
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8264288: Performance issue with MethodHandle.asCollector

Jorn Vernee-2
In reply to this post by Paul Sandoz-2
On Thu, 1 Apr 2021 19:12:42 GMT, Paul Sandoz <[hidden email]> wrote:

>> This patch speeds up MethodHandle.asCollector handles where the array type is not Object[], as well as speeding up all collectors where the arity is greater than 10.
>>
>> The old code is creating a collector handle by combining a set of hard coded methods for collecting arguments into an Object[], up to a maximum of ten elements, and then copying this intermediate array into a final array.
>>
>> In principle, it shouldn't matter how slow (or fast) this handle is, because it should be replaced by existing bytecode intrinsification which does the right thing. But, through investigation it turns out that the intrinsification is only being applied in a very limited amount of cases: Object[] with max 10 elements only, only for the intermediate collector handles. Every other collector shape incurs overhead because it essentially ends up calling the ineffecient fallback handle.
>>
>> Rather than sticking on a band aid (I tried this, but it turned out to be very tricky to untangle the existing code), the new code replaces the existing implementation with a collector handle implemented using a LambdaForm, which removes the need for intrinsification, and also greatly reduces code-complexity of the implementation. (I plan to expose this collector using a public API in the future as well, so users don't have to go through MHs::identity to make a collector).
>>
>> The old code was also using a special lambda form transform for collecting arguments into an array. I believe this was done to take advantage of the existing-but-limited bytecode intrinsification, at least for Object[] with less than 10 elements.
>>
>> The new code just uses the existing collect arguments transform with the newly added collector handle as filter, and this works just as well for the existing case, but as a bonus is also much simpler, since no separate transform is needed. Using the collect arguments transform should also improve sharing.
>>
>> As a result of these changes a lot of code was unused and has been removed in this patch.
>>
>> Testing: tier 1-3, benchmarking using TypedAsCollector (part of the patch), as well as another variant of the benchmark that used a declared static identity method instead of MHs::identity (not included). Before/after comparison of MethodHandleAs* benchmarks (no differences there).
>>
>> Here are some numbers from the added benchmark:
>>
>> Before:
>> Benchmark                                    Mode  Cnt    Score    Error  Units
>> TypedAsCollector.testIntCollect              avgt   30  189.156 �  1.796  ns/op
>> TypedAsCollector.testIntCollectHighArity     avgt   30  660.549 � 10.159  ns/op
>> TypedAsCollector.testObjectCollect           avgt   30    7.092 �  0.042  ns/op
>> TypedAsCollector.testObjectCollectHighArity  avgt   30   65.225 �  0.546  ns/op
>> TypedAsCollector.testStringCollect           avgt   30   28.511 �  0.243  ns/op
>> TypedAsCollector.testStringCollectHighArity  avgt   30   57.054 �  0.635  ns/op
>> (as you can see, just the Object[] with arity less than 10 case is fast here)
>> After:
>> Benchmark                                    Mode  Cnt  Score   Error  Units
>> TypedAsCollector.testIntCollect              avgt   30  6.569 � 0.131  ns/op
>> TypedAsCollector.testIntCollectHighArity     avgt   30  8.923 � 0.066  ns/op
>> TypedAsCollector.testObjectCollect           avgt   30  6.813 � 0.035  ns/op
>> TypedAsCollector.testObjectCollectHighArity  avgt   30  9.718 � 0.066  ns/op
>> TypedAsCollector.testStringCollect           avgt   30  6.737 � 0.016  ns/op
>> TypedAsCollector.testStringCollectHighArity  avgt   30  9.618 � 0.052  ns/op
>>
>> Thanks,
>> Jorn
>
> src/java.base/share/classes/java/lang/invoke/MethodHandleImpl.java line 1277:
>
>> 1275:             if (intrinsicName == Intrinsic.IDENTITY) {
>> 1276:                 MethodType resultType = type().asCollectorType(arrayType, type().parameterCount() - 1, arrayLength);
>> 1277:                 MethodHandle collector = MethodHandleImpl.makeCollector(arrayType, arrayLength);
>
> Should `MethodHandleImpl.varargsArray` still be used here since it performs arity checking and caching?
>
> Maybe the arity checks are performed by `asCollectorType`, but that would still leave the caching.

Ah right, good catch! This is a leftover from moving things around and should indeed use `varargsArray`.

-------------

PR: https://git.openjdk.java.net/jdk/pull/3306
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8264288: Performance issue with MethodHandle.asCollector

Jorn Vernee-2
In reply to this post by John R Rose
On Thu, 1 Apr 2021 19:09:53 GMT, John R Rose <[hidden email]> wrote:

>> This patch speeds up MethodHandle.asCollector handles where the array type is not Object[], as well as speeding up all collectors where the arity is greater than 10.
>>
>> The old code is creating a collector handle by combining a set of hard coded methods for collecting arguments into an Object[], up to a maximum of ten elements, and then copying this intermediate array into a final array.
>>
>> In principle, it shouldn't matter how slow (or fast) this handle is, because it should be replaced by existing bytecode intrinsification which does the right thing. But, through investigation it turns out that the intrinsification is only being applied in a very limited amount of cases: Object[] with max 10 elements only, only for the intermediate collector handles. Every other collector shape incurs overhead because it essentially ends up calling the ineffecient fallback handle.
>>
>> Rather than sticking on a band aid (I tried this, but it turned out to be very tricky to untangle the existing code), the new code replaces the existing implementation with a collector handle implemented using a LambdaForm, which removes the need for intrinsification, and also greatly reduces code-complexity of the implementation. (I plan to expose this collector using a public API in the future as well, so users don't have to go through MHs::identity to make a collector).
>>
>> The old code was also using a special lambda form transform for collecting arguments into an array. I believe this was done to take advantage of the existing-but-limited bytecode intrinsification, at least for Object[] with less than 10 elements.
>>
>> The new code just uses the existing collect arguments transform with the newly added collector handle as filter, and this works just as well for the existing case, but as a bonus is also much simpler, since no separate transform is needed. Using the collect arguments transform should also improve sharing.
>>
>> As a result of these changes a lot of code was unused and has been removed in this patch.
>>
>> Testing: tier 1-3, benchmarking using TypedAsCollector (part of the patch), as well as another variant of the benchmark that used a declared static identity method instead of MHs::identity (not included). Before/after comparison of MethodHandleAs* benchmarks (no differences there).
>>
>> Here are some numbers from the added benchmark:
>>
>> Before:
>> Benchmark                                    Mode  Cnt    Score    Error  Units
>> TypedAsCollector.testIntCollect              avgt   30  189.156 �  1.796  ns/op
>> TypedAsCollector.testIntCollectHighArity     avgt   30  660.549 � 10.159  ns/op
>> TypedAsCollector.testObjectCollect           avgt   30    7.092 �  0.042  ns/op
>> TypedAsCollector.testObjectCollectHighArity  avgt   30   65.225 �  0.546  ns/op
>> TypedAsCollector.testStringCollect           avgt   30   28.511 �  0.243  ns/op
>> TypedAsCollector.testStringCollectHighArity  avgt   30   57.054 �  0.635  ns/op
>> (as you can see, just the Object[] with arity less than 10 case is fast here)
>> After:
>> Benchmark                                    Mode  Cnt  Score   Error  Units
>> TypedAsCollector.testIntCollect              avgt   30  6.569 � 0.131  ns/op
>> TypedAsCollector.testIntCollectHighArity     avgt   30  8.923 � 0.066  ns/op
>> TypedAsCollector.testObjectCollect           avgt   30  6.813 � 0.035  ns/op
>> TypedAsCollector.testObjectCollectHighArity  avgt   30  9.718 � 0.066  ns/op
>> TypedAsCollector.testStringCollect           avgt   30  6.737 � 0.016  ns/op
>> TypedAsCollector.testStringCollectHighArity  avgt   30  9.618 � 0.052  ns/op
>>
>> Thanks,
>> Jorn
>
> src/java.base/share/classes/java/lang/invoke/MethodHandleImpl.java line 1903:
>
>> 1901:     }
>> 1902:
>> 1903:     private static LambdaForm makeCollectorForm(MethodType basicType, MethodHandle storeFunc) {
>
> This reads very well.  It is the right way to do this job with LFs.
> There is a slight risk that the JIT will fail to inline the tiny MHs
> that (a) construct arrays and (b) poke elements into them.
> If that happens, then perhaps the bytecode generator can
> treat them as intrinsics.  Maybe that's worth a comment in
> the code, to help later readers?

Good idea, I'll leave some notes.

-------------

PR: https://git.openjdk.java.net/jdk/pull/3306
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8264288: Performance issue with MethodHandle.asCollector [v2]

Jorn Vernee-2
In reply to this post by Jorn Vernee-2
> This patch speeds up MethodHandle.asCollector handles where the array type is not Object[], as well as speeding up all collectors where the arity is greater than 10.
>
> The old code is creating a collector handle by combining a set of hard coded methods for collecting arguments into an Object[], up to a maximum of ten elements, and then copying this intermediate array into a final array.
>
> In principle, it shouldn't matter how slow (or fast) this handle is, because it should be replaced by existing bytecode intrinsification which does the right thing. But, through investigation it turns out that the intrinsification is only being applied in a very limited amount of cases: Object[] with max 10 elements only, only for the intermediate collector handles. Every other collector shape incurs overhead because it essentially ends up calling the ineffecient fallback handle.
>
> Rather than sticking on a band aid (I tried this, but it turned out to be very tricky to untangle the existing code), the new code replaces the existing implementation with a collector handle implemented using a LambdaForm, which removes the need for intrinsification, and also greatly reduces code-complexity of the implementation. (I plan to expose this collector using a public API in the future as well, so users don't have to go through MHs::identity to make a collector).
>
> The old code was also using a special lambda form transform for collecting arguments into an array. I believe this was done to take advantage of the existing-but-limited bytecode intrinsification, at least for Object[] with less than 10 elements.
>
> The new code just uses the existing collect arguments transform with the newly added collector handle as filter, and this works just as well for the existing case, but as a bonus is also much simpler, since no separate transform is needed. Using the collect arguments transform should also improve sharing.
>
> As a result of these changes a lot of code was unused and has been removed in this patch.
>
> Testing: tier 1-3, benchmarking using TypedAsCollector (part of the patch), as well as another variant of the benchmark that used a declared static identity method instead of MHs::identity (not included). Before/after comparison of MethodHandleAs* benchmarks (no differences there).
>
> Here are some numbers from the added benchmark:
>
> Before:
> Benchmark                                    Mode  Cnt    Score    Error  Units
> TypedAsCollector.testIntCollect              avgt   30  189.156 �  1.796  ns/op
> TypedAsCollector.testIntCollectHighArity     avgt   30  660.549 � 10.159  ns/op
> TypedAsCollector.testObjectCollect           avgt   30    7.092 �  0.042  ns/op
> TypedAsCollector.testObjectCollectHighArity  avgt   30   65.225 �  0.546  ns/op
> TypedAsCollector.testStringCollect           avgt   30   28.511 �  0.243  ns/op
> TypedAsCollector.testStringCollectHighArity  avgt   30   57.054 �  0.635  ns/op
> (as you can see, just the Object[] with arity less than 10 case is fast here)
> After:
> Benchmark                                    Mode  Cnt  Score   Error  Units
> TypedAsCollector.testIntCollect              avgt   30  6.569 � 0.131  ns/op
> TypedAsCollector.testIntCollectHighArity     avgt   30  8.923 � 0.066  ns/op
> TypedAsCollector.testObjectCollect           avgt   30  6.813 � 0.035  ns/op
> TypedAsCollector.testObjectCollectHighArity  avgt   30  9.718 � 0.066  ns/op
> TypedAsCollector.testStringCollect           avgt   30  6.737 � 0.016  ns/op
> TypedAsCollector.testStringCollectHighArity  avgt   30  9.618 � 0.052  ns/op
>
> Thanks,
> Jorn

Jorn Vernee has updated the pull request incrementally with one additional commit since the last revision:

  - Address review comments
  - Use cached version of store func getter
  - Use ARRAY_STORE intrinsic for array stores
  - Generate direct call to Array.newInstance instead of using an array constructor handle
  - Intrinsify call to Array.newInstance if the component type is constant

-------------

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/3306/files
  - new: https://git.openjdk.java.net/jdk/pull/3306/files/46bc5d19..313ffaaf

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=3306&range=01
 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=3306&range=00-01

  Stats: 57 lines in 2 files changed: 27 ins; 19 del; 11 mod
  Patch: https://git.openjdk.java.net/jdk/pull/3306.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/3306/head:pull/3306

PR: https://git.openjdk.java.net/jdk/pull/3306
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8264288: Performance issue with MethodHandle.asCollector [v2]

Jorn Vernee-2
In reply to this post by Paul Sandoz-2
On Thu, 1 Apr 2021 19:19:10 GMT, Paul Sandoz <[hidden email]> wrote:

>> Jorn Vernee has updated the pull request incrementally with one additional commit since the last revision:
>>
>>   - Address review comments
>>   - Use cached version of store func getter
>>   - Use ARRAY_STORE intrinsic for array stores
>>   - Generate direct call to Array.newInstance instead of using an array constructor handle
>>   - Intrinsify call to Array.newInstance if the component type is constant
>
> That's an elegant solution.
>
> At first i thought it might unduly perturb lambda form generation and caching. but you slotted a different lambda form implementation underneath the varargs implementation.

I've address review comments, plus some other things:

- I realized that I was calling the uncached version of the store function factory. Fixed that.
- I also realized that there's already an `ARRAY_STORE` intrinsic, which I'm now using to avoid generating a call.
- I also realized that since we only have 1 array creation handle per lambda form, we can instead generate a direct call to `Array::newInstance` instead of going through the array constructor handle (which also avoids having to use a BoundMethodHandle).
- Finally, I added an instrinsic, under the old `NEW_ARRAY` name, that intrinsifies a call to `Array::newInstance` if the component type argument is constant (which it is in this case).

As a result, the lambda form is now fully intrinsified (no more calls in the generated bytecode) e.g.:

static java.lang.Object collector001_LLLL_L(java.lang.Object, java.lang.Object, java.lang.Object, java.lang.Object);
    Code:
       0: iconst_3
       1: anewarray     #12                 // class java/lang/String
       4: astore        4
       6: aload         4
       8: checkcast     #14                 // class "[Ljava/lang/String;"
      11: dup
      12: astore        4
      14: iconst_0
      15: aload_1
      16: checkcast     #12                 // class java/lang/String
      19: aastore
      20: aload         4
      22: iconst_1
      23: aload_2
      24: checkcast     #12                 // class java/lang/String
      27: aastore
      28: aload         4
      30: iconst_2
      31: aload_3
      32: checkcast     #12                 // class java/lang/String
      35: aastore
      36: aload         4
      38: areturn

Thanks,
Jorn

-------------

PR: https://git.openjdk.java.net/jdk/pull/3306
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8264288: Performance issue with MethodHandle.asCollector [v2]

Vladimir Ivanov-2
In reply to this post by Jorn Vernee-2
On Fri, 2 Apr 2021 13:17:55 GMT, Jorn Vernee <[hidden email]> wrote:

>> This patch speeds up MethodHandle.asCollector handles where the array type is not Object[], as well as speeding up all collectors where the arity is greater than 10.
>>
>> The old code is creating a collector handle by combining a set of hard coded methods for collecting arguments into an Object[], up to a maximum of ten elements, and then copying this intermediate array into a final array.
>>
>> In principle, it shouldn't matter how slow (or fast) this handle is, because it should be replaced by existing bytecode intrinsification which does the right thing. But, through investigation it turns out that the intrinsification is only being applied in a very limited amount of cases: Object[] with max 10 elements only, only for the intermediate collector handles. Every other collector shape incurs overhead because it essentially ends up calling the ineffecient fallback handle.
>>
>> Rather than sticking on a band aid (I tried this, but it turned out to be very tricky to untangle the existing code), the new code replaces the existing implementation with a collector handle implemented using a LambdaForm, which removes the need for intrinsification, and also greatly reduces code-complexity of the implementation. (I plan to expose this collector using a public API in the future as well, so users don't have to go through MHs::identity to make a collector).
>>
>> The old code was also using a special lambda form transform for collecting arguments into an array. I believe this was done to take advantage of the existing-but-limited bytecode intrinsification, at least for Object[] with less than 10 elements.
>>
>> The new code just uses the existing collect arguments transform with the newly added collector handle as filter, and this works just as well for the existing case, but as a bonus is also much simpler, since no separate transform is needed. Using the collect arguments transform should also improve sharing.
>>
>> As a result of these changes a lot of code was unused and has been removed in this patch.
>>
>> Testing: tier 1-3, benchmarking using TypedAsCollector (part of the patch), as well as another variant of the benchmark that used a declared static identity method instead of MHs::identity (not included). Before/after comparison of MethodHandleAs* benchmarks (no differences there).
>>
>> Here are some numbers from the added benchmark:
>>
>> Before:
>> Benchmark                                    Mode  Cnt    Score    Error  Units
>> TypedAsCollector.testIntCollect              avgt   30  189.156 �  1.796  ns/op
>> TypedAsCollector.testIntCollectHighArity     avgt   30  660.549 � 10.159  ns/op
>> TypedAsCollector.testObjectCollect           avgt   30    7.092 �  0.042  ns/op
>> TypedAsCollector.testObjectCollectHighArity  avgt   30   65.225 �  0.546  ns/op
>> TypedAsCollector.testStringCollect           avgt   30   28.511 �  0.243  ns/op
>> TypedAsCollector.testStringCollectHighArity  avgt   30   57.054 �  0.635  ns/op
>> (as you can see, just the Object[] with arity less than 10 case is fast here)
>> After:
>> Benchmark                                    Mode  Cnt  Score   Error  Units
>> TypedAsCollector.testIntCollect              avgt   30  6.569 � 0.131  ns/op
>> TypedAsCollector.testIntCollectHighArity     avgt   30  8.923 � 0.066  ns/op
>> TypedAsCollector.testObjectCollect           avgt   30  6.813 � 0.035  ns/op
>> TypedAsCollector.testObjectCollectHighArity  avgt   30  9.718 � 0.066  ns/op
>> TypedAsCollector.testStringCollect           avgt   30  6.737 � 0.016  ns/op
>> TypedAsCollector.testStringCollectHighArity  avgt   30  9.618 � 0.052  ns/op
>>
>> Thanks,
>> Jorn
>
> Jorn Vernee has updated the pull request incrementally with one additional commit since the last revision:
>
>   - Address review comments
>   - Use cached version of store func getter
>   - Use ARRAY_STORE intrinsic for array stores
>   - Generate direct call to Array.newInstance instead of using an array constructor handle
>   - Intrinsify call to Array.newInstance if the component type is constant

src/java.base/share/classes/java/lang/invoke/InvokerBytecodeGenerator.java line 874:

> 872:                 case NEW_ARRAY:
> 873:                     Class<?> rtype = name.function.methodType().returnType();
> 874:                     if (isStaticallyNameable(rtype)) {

Why do you remote `isStaticallyNameable(rtype)` check?

It is there for a reason: only a very limited set of classes can be directly referenced from LF bytecode.

src/java.base/share/classes/java/lang/invoke/MethodHandleImpl.java line 1897:

> 1895:         // field of the receiver handle.
> 1896:         // This is left as a followup enhancement, as it needs to be investigated if this causes
> 1897:         // profile pollution.

Profile pollution shouldn't be a problem here: both types (element type + array type) will be constants attached to the "receiver" BMH instance and during inlining will be fetched from there.  

I'm fine with handling it as a separate RFE.

-------------

PR: https://git.openjdk.java.net/jdk/pull/3306
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8264288: Performance issue with MethodHandle.asCollector [v2]

Jorn Vernee-2
On Fri, 2 Apr 2021 13:56:31 GMT, Vladimir Ivanov <[hidden email]> wrote:

>> Jorn Vernee has updated the pull request incrementally with one additional commit since the last revision:
>>
>>   - Address review comments
>>   - Use cached version of store func getter
>>   - Use ARRAY_STORE intrinsic for array stores
>>   - Generate direct call to Array.newInstance instead of using an array constructor handle
>>   - Intrinsify call to Array.newInstance if the component type is constant
>
> src/java.base/share/classes/java/lang/invoke/InvokerBytecodeGenerator.java line 874:
>
>> 872:                 case NEW_ARRAY:
>> 873:                     Class<?> rtype = name.function.methodType().returnType();
>> 874:                     if (isStaticallyNameable(rtype)) {
>
> Why do you remote `isStaticallyNameable(rtype)` check?
>
> It is there for a reason: only a very limited set of classes can be directly referenced from LF bytecode.

I removed the `NEW_ARRAY` intrinsic altogether at first, but added it back in the latest commit. I didn't add the check since I was not aware of the restriction.

Good to know, will add a check.

> src/java.base/share/classes/java/lang/invoke/MethodHandleImpl.java line 1897:
>
>> 1895:         // field of the receiver handle.
>> 1896:         // This is left as a followup enhancement, as it needs to be investigated if this causes
>> 1897:         // profile pollution.
>
> Profile pollution shouldn't be a problem here: both types (element type + array type) will be constants attached to the "receiver" BMH instance and during inlining will be fetched from there.  
>
> I'm fine with handling it as a separate RFE.

That's what I thought as well (but not 100% sure). I can partially roll back the last commit so the code still uses an injected array constructor handle, and then it should be easy to add caching in the cases where the component type is a reference type.

-------------

PR: https://git.openjdk.java.net/jdk/pull/3306
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8264288: Performance issue with MethodHandle.asCollector [v2]

Vladimir Ivanov-2
On Fri, 2 Apr 2021 14:41:06 GMT, Jorn Vernee <[hidden email]> wrote:

>> src/java.base/share/classes/java/lang/invoke/InvokerBytecodeGenerator.java line 874:
>>
>>> 872:                 case NEW_ARRAY:
>>> 873:                     Class<?> rtype = name.function.methodType().returnType();
>>> 874:                     if (isStaticallyNameable(rtype)) {
>>
>> Why do you remote `isStaticallyNameable(rtype)` check?
>>
>> It is there for a reason: only a very limited set of classes can be directly referenced from LF bytecode.
>
> I removed the `NEW_ARRAY` intrinsic altogether at first, but added it back in the latest commit. I didn't add the check since I was not aware of the restriction (looks like some of the tests failed as well).
>
> Good to know, will add a check.

Also, please, add a test case for such scenario. It should be trivial: just use a class defined by the test (loaded by `AppClassLoader`).

>> src/java.base/share/classes/java/lang/invoke/MethodHandleImpl.java line 1897:
>>
>>> 1895:         // field of the receiver handle.
>>> 1896:         // This is left as a followup enhancement, as it needs to be investigated if this causes
>>> 1897:         // profile pollution.
>>
>> Profile pollution shouldn't be a problem here: both types (element type + array type) will be constants attached to the "receiver" BMH instance and during inlining will be fetched from there.  
>>
>> I'm fine with handling it as a separate RFE.
>
> That's what I thought as well (but not 100% sure). I can partially roll back the last commit so the code still uses an injected array constructor handle, and then it should be easy to add caching in the cases where the component type is a reference type.

Whatever you prefer. I'm fine with it either way.

-------------

PR: https://git.openjdk.java.net/jdk/pull/3306
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8264288: Performance issue with MethodHandle.asCollector [v3]

Jorn Vernee-2
In reply to this post by Jorn Vernee-2
> This patch speeds up MethodHandle.asCollector handles where the array type is not Object[], as well as speeding up all collectors where the arity is greater than 10.
>
> The old code is creating a collector handle by combining a set of hard coded methods for collecting arguments into an Object[], up to a maximum of ten elements, and then copying this intermediate array into a final array.
>
> In principle, it shouldn't matter how slow (or fast) this handle is, because it should be replaced by existing bytecode intrinsification which does the right thing. But, through investigation it turns out that the intrinsification is only being applied in a very limited amount of cases: Object[] with max 10 elements only, only for the intermediate collector handles. Every other collector shape incurs overhead because it essentially ends up calling the ineffecient fallback handle.
>
> Rather than sticking on a band aid (I tried this, but it turned out to be very tricky to untangle the existing code), the new code replaces the existing implementation with a collector handle implemented using a LambdaForm, which removes the need for intrinsification, and also greatly reduces code-complexity of the implementation. (I plan to expose this collector using a public API in the future as well, so users don't have to go through MHs::identity to make a collector).
>
> The old code was also using a special lambda form transform for collecting arguments into an array. I believe this was done to take advantage of the existing-but-limited bytecode intrinsification, at least for Object[] with less than 10 elements.
>
> The new code just uses the existing collect arguments transform with the newly added collector handle as filter, and this works just as well for the existing case, but as a bonus is also much simpler, since no separate transform is needed. Using the collect arguments transform should also improve sharing.
>
> As a result of these changes a lot of code was unused and has been removed in this patch.
>
> Testing: tier 1-3, benchmarking using TypedAsCollector (part of the patch), as well as another variant of the benchmark that used a declared static identity method instead of MHs::identity (not included). Before/after comparison of MethodHandleAs* benchmarks (no differences there).
>
> Here are some numbers from the added benchmark:
>
> Before:
> Benchmark                                    Mode  Cnt    Score    Error  Units
> TypedAsCollector.testIntCollect              avgt   30  189.156 �  1.796  ns/op
> TypedAsCollector.testIntCollectHighArity     avgt   30  660.549 � 10.159  ns/op
> TypedAsCollector.testObjectCollect           avgt   30    7.092 �  0.042  ns/op
> TypedAsCollector.testObjectCollectHighArity  avgt   30   65.225 �  0.546  ns/op
> TypedAsCollector.testStringCollect           avgt   30   28.511 �  0.243  ns/op
> TypedAsCollector.testStringCollectHighArity  avgt   30   57.054 �  0.635  ns/op
> (as you can see, just the Object[] with arity less than 10 case is fast here)
> After:
> Benchmark                                    Mode  Cnt  Score   Error  Units
> TypedAsCollector.testIntCollect              avgt   30  6.569 � 0.131  ns/op
> TypedAsCollector.testIntCollectHighArity     avgt   30  8.923 � 0.066  ns/op
> TypedAsCollector.testObjectCollect           avgt   30  6.813 � 0.035  ns/op
> TypedAsCollector.testObjectCollectHighArity  avgt   30  9.718 � 0.066  ns/op
> TypedAsCollector.testStringCollect           avgt   30  6.737 � 0.016  ns/op
> TypedAsCollector.testStringCollectHighArity  avgt   30  9.618 � 0.052  ns/op
>
> Thanks,
> Jorn

Jorn Vernee has updated the pull request incrementally with one additional commit since the last revision:

  - Revert back to injected constructor handle
  - Add lambda form sharing
  - Add test case for collecting a custom class

-------------

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/3306/files
  - new: https://git.openjdk.java.net/jdk/pull/3306/files/313ffaaf..dbba7910

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=3306&range=02
 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=3306&range=01-02

  Stats: 68 lines in 4 files changed: 32 ins; 22 del; 14 mod
  Patch: https://git.openjdk.java.net/jdk/pull/3306.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/3306/head:pull/3306

PR: https://git.openjdk.java.net/jdk/pull/3306
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8264288: Performance issue with MethodHandle.asCollector [v2]

Jorn Vernee-2
In reply to this post by Vladimir Ivanov-2
On Fri, 2 Apr 2021 15:12:04 GMT, Vladimir Ivanov <[hidden email]> wrote:

>> That's what I thought as well (but not 100% sure). I can partially roll back the last commit so the code still uses an injected array constructor handle, and then it should be easy to add caching in the cases where the component type is a reference type.
>
> Whatever you prefer. I'm fine with it either way.

I found it easier to add sharing now, since it makes the intrinsic redundant (and it can be removed again).

-------------

PR: https://git.openjdk.java.net/jdk/pull/3306
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8264288: Performance issue with MethodHandle.asCollector [v3]

Jorn Vernee-2
In reply to this post by Jorn Vernee-2
On Fri, 2 Apr 2021 13:23:07 GMT, Jorn Vernee <[hidden email]> wrote:

>> That's an elegant solution.
>>
>> At first i thought it might unduly perturb lambda form generation and caching. but you slotted a different lambda form implementation underneath the varargs implementation.
>
> I've addressed review comments, plus some other things:
>
> - I realized that I was calling the uncached version of the store function factory. Fixed that.
> - I also realized that there's already an `ARRAY_STORE` intrinsic, which I'm now using to avoid generating a call.
> - I also realized that since we only have 1 array creation handle per lambda form, we can instead generate a direct call to `Array::newInstance` instead of going through the array constructor handle (which also avoids having to use a BoundMethodHandle).
> - Finally, I added an instrinsic, under the old `NEW_ARRAY` name, that intrinsifies a call to `Array::newInstance` if the component type argument is constant (which it is in this case).
>
> As a result, the lambda form is now fully intrinsified (no more calls in the generated bytecode) e.g.:
>
> static java.lang.Object collector001_LLLL_L(java.lang.Object, java.lang.Object, java.lang.Object, java.lang.Object);
>     Code:
>        0: iconst_3
>        1: anewarray     #12                 // class java/lang/String
>        4: astore        4
>        6: aload         4
>        8: checkcast     #14                 // class "[Ljava/lang/String;"
>       11: dup
>       12: astore        4
>       14: iconst_0
>       15: aload_1
>       16: checkcast     #12                 // class java/lang/String
>       19: aastore
>       20: aload         4
>       22: iconst_1
>       23: aload_2
>       24: checkcast     #12                 // class java/lang/String
>       27: aastore
>       28: aload         4
>       30: iconst_2
>       31: aload_3
>       32: checkcast     #12                 // class java/lang/String
>       35: aastore
>       36: aload         4
>       38: areturn
>
> Thanks,
> Jorn

Addressed latest review comments:
- Reverted back to using an injected constructor handle (to be able to take advantage of lambda form sharing)
- Added lambda form sharing for empty and reference arrays
- Added a test case for a custom class to VarargsArrayTest, which catches the illegal access error in the intrinsified case that Vlad pointed out (though the itrinsic itself is now removed).

I also had to switch back to the un-cached version for creating the array element setter, since the cached version casts the array to be a specific type, and if the lambda form is shared, this won't work (e.g. casting an Object[] to String[], depending on the order of creating the collectors). Adding caching there is left as a followup.

-------------

PR: https://git.openjdk.java.net/jdk/pull/3306
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8264288: Performance issue with MethodHandle.asCollector [v3]

Vladimir Ivanov-2
In reply to this post by Jorn Vernee-2
On Mon, 5 Apr 2021 11:57:08 GMT, Jorn Vernee <[hidden email]> wrote:

>> This patch speeds up MethodHandle.asCollector handles where the array type is not Object[], as well as speeding up all collectors where the arity is greater than 10.
>>
>> The old code is creating a collector handle by combining a set of hard coded methods for collecting arguments into an Object[], up to a maximum of ten elements, and then copying this intermediate array into a final array.
>>
>> In principle, it shouldn't matter how slow (or fast) this handle is, because it should be replaced by existing bytecode intrinsification which does the right thing. But, through investigation it turns out that the intrinsification is only being applied in a very limited amount of cases: Object[] with max 10 elements only, only for the intermediate collector handles. Every other collector shape incurs overhead because it essentially ends up calling the ineffecient fallback handle.
>>
>> Rather than sticking on a band aid (I tried this, but it turned out to be very tricky to untangle the existing code), the new code replaces the existing implementation with a collector handle implemented using a LambdaForm, which removes the need for intrinsification, and also greatly reduces code-complexity of the implementation. (I plan to expose this collector using a public API in the future as well, so users don't have to go through MHs::identity to make a collector).
>>
>> The old code was also using a special lambda form transform for collecting arguments into an array. I believe this was done to take advantage of the existing-but-limited bytecode intrinsification, at least for Object[] with less than 10 elements.
>>
>> The new code just uses the existing collect arguments transform with the newly added collector handle as filter, and this works just as well for the existing case, but as a bonus is also much simpler, since no separate transform is needed. Using the collect arguments transform should also improve sharing.
>>
>> As a result of these changes a lot of code was unused and has been removed in this patch.
>>
>> Testing: tier 1-3, benchmarking using TypedAsCollector (part of the patch), as well as another variant of the benchmark that used a declared static identity method instead of MHs::identity (not included). Before/after comparison of MethodHandleAs* benchmarks (no differences there).
>>
>> Here are some numbers from the added benchmark:
>>
>> Before:
>> Benchmark                                    Mode  Cnt    Score    Error  Units
>> TypedAsCollector.testIntCollect              avgt   30  189.156 �  1.796  ns/op
>> TypedAsCollector.testIntCollectHighArity     avgt   30  660.549 � 10.159  ns/op
>> TypedAsCollector.testObjectCollect           avgt   30    7.092 �  0.042  ns/op
>> TypedAsCollector.testObjectCollectHighArity  avgt   30   65.225 �  0.546  ns/op
>> TypedAsCollector.testStringCollect           avgt   30   28.511 �  0.243  ns/op
>> TypedAsCollector.testStringCollectHighArity  avgt   30   57.054 �  0.635  ns/op
>> (as you can see, just the Object[] with arity less than 10 case is fast here)
>> After:
>> Benchmark                                    Mode  Cnt  Score   Error  Units
>> TypedAsCollector.testIntCollect              avgt   30  6.569 � 0.131  ns/op
>> TypedAsCollector.testIntCollectHighArity     avgt   30  8.923 � 0.066  ns/op
>> TypedAsCollector.testObjectCollect           avgt   30  6.813 � 0.035  ns/op
>> TypedAsCollector.testObjectCollectHighArity  avgt   30  9.718 � 0.066  ns/op
>> TypedAsCollector.testStringCollect           avgt   30  6.737 � 0.016  ns/op
>> TypedAsCollector.testStringCollectHighArity  avgt   30  9.618 � 0.052  ns/op
>>
>> Thanks,
>> Jorn
>
> Jorn Vernee has updated the pull request incrementally with one additional commit since the last revision:
>
>   - Revert back to injected constructor handle
>   - Add lambda form sharing
>   - Add test case for collecting a custom class

src/java.base/share/classes/java/lang/invoke/MethodHandleImpl.java line 1909:

> 1907:         boolean isSharedLambdaForm = parameterCount == 0 || basicType.parameterType(0) == Object.class;
> 1908:         if (isSharedLambdaForm) {
> 1909:             LambdaForm lform = basicType.form().cachedLambdaForm(MethodTypeForm.LF_COLLECTOR);

How does sharing work w.r.t. array store check? Unless you put `storeFunc` on the BMH, I don't see how sharing preserves proper check.

-------------

PR: https://git.openjdk.java.net/jdk/pull/3306
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8264288: Performance issue with MethodHandle.asCollector [v3]

Vladimir Ivanov-2
On Mon, 5 Apr 2021 12:37:06 GMT, Vladimir Ivanov <[hidden email]> wrote:

>> Jorn Vernee has updated the pull request incrementally with one additional commit since the last revision:
>>
>>   - Revert back to injected constructor handle
>>   - Add lambda form sharing
>>   - Add test case for collecting a custom class
>
> src/java.base/share/classes/java/lang/invoke/MethodHandleImpl.java line 1909:
>
>> 1907:         boolean isSharedLambdaForm = parameterCount == 0 || basicType.parameterType(0) == Object.class;
>> 1908:         if (isSharedLambdaForm) {
>> 1909:             LambdaForm lform = basicType.form().cachedLambdaForm(MethodTypeForm.LF_COLLECTOR);
>
> How does sharing work w.r.t. array store check? Unless you put `storeFunc` on the BMH, I don't see how sharing preserves proper check.

Nevermind, it's the type of the method handle produced by `makeCollector` which ensures that all the arguments have proper type.

-------------

PR: https://git.openjdk.java.net/jdk/pull/3306
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8264288: Performance issue with MethodHandle.asCollector [v3]

Vladimir Ivanov-2
In reply to this post by Jorn Vernee-2
On Mon, 5 Apr 2021 11:57:08 GMT, Jorn Vernee <[hidden email]> wrote:

>> This patch speeds up MethodHandle.asCollector handles where the array type is not Object[], as well as speeding up all collectors where the arity is greater than 10.
>>
>> The old code is creating a collector handle by combining a set of hard coded methods for collecting arguments into an Object[], up to a maximum of ten elements, and then copying this intermediate array into a final array.
>>
>> In principle, it shouldn't matter how slow (or fast) this handle is, because it should be replaced by existing bytecode intrinsification which does the right thing. But, through investigation it turns out that the intrinsification is only being applied in a very limited amount of cases: Object[] with max 10 elements only, only for the intermediate collector handles. Every other collector shape incurs overhead because it essentially ends up calling the ineffecient fallback handle.
>>
>> Rather than sticking on a band aid (I tried this, but it turned out to be very tricky to untangle the existing code), the new code replaces the existing implementation with a collector handle implemented using a LambdaForm, which removes the need for intrinsification, and also greatly reduces code-complexity of the implementation. (I plan to expose this collector using a public API in the future as well, so users don't have to go through MHs::identity to make a collector).
>>
>> The old code was also using a special lambda form transform for collecting arguments into an array. I believe this was done to take advantage of the existing-but-limited bytecode intrinsification, at least for Object[] with less than 10 elements.
>>
>> The new code just uses the existing collect arguments transform with the newly added collector handle as filter, and this works just as well for the existing case, but as a bonus is also much simpler, since no separate transform is needed. Using the collect arguments transform should also improve sharing.
>>
>> As a result of these changes a lot of code was unused and has been removed in this patch.
>>
>> Testing: tier 1-3, benchmarking using TypedAsCollector (part of the patch), as well as another variant of the benchmark that used a declared static identity method instead of MHs::identity (not included). Before/after comparison of MethodHandleAs* benchmarks (no differences there).
>>
>> Here are some numbers from the added benchmark:
>>
>> Before:
>> Benchmark                                    Mode  Cnt    Score    Error  Units
>> TypedAsCollector.testIntCollect              avgt   30  189.156 �  1.796  ns/op
>> TypedAsCollector.testIntCollectHighArity     avgt   30  660.549 � 10.159  ns/op
>> TypedAsCollector.testObjectCollect           avgt   30    7.092 �  0.042  ns/op
>> TypedAsCollector.testObjectCollectHighArity  avgt   30   65.225 �  0.546  ns/op
>> TypedAsCollector.testStringCollect           avgt   30   28.511 �  0.243  ns/op
>> TypedAsCollector.testStringCollectHighArity  avgt   30   57.054 �  0.635  ns/op
>> (as you can see, just the Object[] with arity less than 10 case is fast here)
>> After:
>> Benchmark                                    Mode  Cnt  Score   Error  Units
>> TypedAsCollector.testIntCollect              avgt   30  6.569 � 0.131  ns/op
>> TypedAsCollector.testIntCollectHighArity     avgt   30  8.923 � 0.066  ns/op
>> TypedAsCollector.testObjectCollect           avgt   30  6.813 � 0.035  ns/op
>> TypedAsCollector.testObjectCollectHighArity  avgt   30  9.718 � 0.066  ns/op
>> TypedAsCollector.testStringCollect           avgt   30  6.737 � 0.016  ns/op
>> TypedAsCollector.testStringCollectHighArity  avgt   30  9.618 � 0.052  ns/op
>>
>> Thanks,
>> Jorn
>
> Jorn Vernee has updated the pull request incrementally with one additional commit since the last revision:
>
>   - Revert back to injected constructor handle
>   - Add lambda form sharing
>   - Add test case for collecting a custom class

src/java.base/share/classes/java/lang/invoke/MethodHandleImpl.java line 1883:

> 1881:
> 1882:         MethodHandle newArray = MethodHandles.arrayConstructor(arrayType);
> 1883:         MethodHandle storeFunc = ArrayAccessor.getAccessor(arrayType, ArrayAccess.SET);

I'd move `storeFunc` into `makeCollectorForm()` and use erased accessor there (`ArrayAccessor.OBJECT_ARRAY_SETTER`).

(It is interesting that `OBJECT_ARRAY_SETTER` is `ARRAY_STORE` intrinsic.)

-------------

PR: https://git.openjdk.java.net/jdk/pull/3306
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8264288: Performance issue with MethodHandle.asCollector [v4]

Jorn Vernee-2
In reply to this post by Jorn Vernee-2
> This patch speeds up MethodHandle.asCollector handles where the array type is not Object[], as well as speeding up all collectors where the arity is greater than 10.
>
> The old code is creating a collector handle by combining a set of hard coded methods for collecting arguments into an Object[], up to a maximum of ten elements, and then copying this intermediate array into a final array.
>
> In principle, it shouldn't matter how slow (or fast) this handle is, because it should be replaced by existing bytecode intrinsification which does the right thing. But, through investigation it turns out that the intrinsification is only being applied in a very limited amount of cases: Object[] with max 10 elements only, only for the intermediate collector handles. Every other collector shape incurs overhead because it essentially ends up calling the ineffecient fallback handle.
>
> Rather than sticking on a band aid (I tried this, but it turned out to be very tricky to untangle the existing code), the new code replaces the existing implementation with a collector handle implemented using a LambdaForm, which removes the need for intrinsification, and also greatly reduces code-complexity of the implementation. (I plan to expose this collector using a public API in the future as well, so users don't have to go through MHs::identity to make a collector).
>
> The old code was also using a special lambda form transform for collecting arguments into an array. I believe this was done to take advantage of the existing-but-limited bytecode intrinsification, at least for Object[] with less than 10 elements.
>
> The new code just uses the existing collect arguments transform with the newly added collector handle as filter, and this works just as well for the existing case, but as a bonus is also much simpler, since no separate transform is needed. Using the collect arguments transform should also improve sharing.
>
> As a result of these changes a lot of code was unused and has been removed in this patch.
>
> Testing: tier 1-3, benchmarking using TypedAsCollector (part of the patch), as well as another variant of the benchmark that used a declared static identity method instead of MHs::identity (not included). Before/after comparison of MethodHandleAs* benchmarks (no differences there).
>
> Here are some numbers from the added benchmark:
>
> Before:
> Benchmark                                    Mode  Cnt    Score    Error  Units
> TypedAsCollector.testIntCollect              avgt   30  189.156 �  1.796  ns/op
> TypedAsCollector.testIntCollectHighArity     avgt   30  660.549 � 10.159  ns/op
> TypedAsCollector.testObjectCollect           avgt   30    7.092 �  0.042  ns/op
> TypedAsCollector.testObjectCollectHighArity  avgt   30   65.225 �  0.546  ns/op
> TypedAsCollector.testStringCollect           avgt   30   28.511 �  0.243  ns/op
> TypedAsCollector.testStringCollectHighArity  avgt   30   57.054 �  0.635  ns/op
> (as you can see, just the Object[] with arity less than 10 case is fast here)
> After:
> Benchmark                                    Mode  Cnt  Score   Error  Units
> TypedAsCollector.testIntCollect              avgt   30  6.569 � 0.131  ns/op
> TypedAsCollector.testIntCollectHighArity     avgt   30  8.923 � 0.066  ns/op
> TypedAsCollector.testObjectCollect           avgt   30  6.813 � 0.035  ns/op
> TypedAsCollector.testObjectCollectHighArity  avgt   30  9.718 � 0.066  ns/op
> TypedAsCollector.testStringCollect           avgt   30  6.737 � 0.016  ns/op
> TypedAsCollector.testStringCollectHighArity  avgt   30  9.618 � 0.052  ns/op
>
> Thanks,
> Jorn

Jorn Vernee has updated the pull request incrementally with one additional commit since the last revision:

  Review comment: Rearrange code to explicitly reference erased setter

-------------

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/3306/files
  - new: https://git.openjdk.java.net/jdk/pull/3306/files/dbba7910..72ccf282

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=3306&range=03
 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=3306&range=02-03

  Stats: 11 lines in 1 file changed: 6 ins; 2 del; 3 mod
  Patch: https://git.openjdk.java.net/jdk/pull/3306.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/3306/head:pull/3306

PR: https://git.openjdk.java.net/jdk/pull/3306
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8264288: Performance issue with MethodHandle.asCollector [v3]

Jorn Vernee-2
In reply to this post by Vladimir Ivanov-2
On Mon, 5 Apr 2021 12:48:07 GMT, Vladimir Ivanov <[hidden email]> wrote:

>> Jorn Vernee has updated the pull request incrementally with one additional commit since the last revision:
>>
>>   - Revert back to injected constructor handle
>>   - Add lambda form sharing
>>   - Add test case for collecting a custom class
>
> src/java.base/share/classes/java/lang/invoke/MethodHandleImpl.java line 1883:
>
>> 1881:
>> 1882:         MethodHandle newArray = MethodHandles.arrayConstructor(arrayType);
>> 1883:         MethodHandle storeFunc = ArrayAccessor.getAccessor(arrayType, ArrayAccess.SET);
>
> I'd move `storeFunc` into `makeCollectorForm()` and use erased accessor there (`ArrayAccessor.OBJECT_ARRAY_SETTER`).
>
> (It is interesting that `OBJECT_ARRAY_SETTER` is `ARRAY_STORE` intrinsic.)

Done. I think it's also more obvious this way that the erased setter is being used. Good suggestion.

-------------

PR: https://git.openjdk.java.net/jdk/pull/3306
Reply | Threaded
Open this post in threaded view
|

Re: RFR: 8264288: Performance issue with MethodHandle.asCollector [v4]

Vladimir Ivanov-2
In reply to this post by Jorn Vernee-2
On Mon, 5 Apr 2021 14:16:37 GMT, Jorn Vernee <[hidden email]> wrote:

>> This patch speeds up MethodHandle.asCollector handles where the array type is not Object[], as well as speeding up all collectors where the arity is greater than 10.
>>
>> The old code is creating a collector handle by combining a set of hard coded methods for collecting arguments into an Object[], up to a maximum of ten elements, and then copying this intermediate array into a final array.
>>
>> In principle, it shouldn't matter how slow (or fast) this handle is, because it should be replaced by existing bytecode intrinsification which does the right thing. But, through investigation it turns out that the intrinsification is only being applied in a very limited amount of cases: Object[] with max 10 elements only, only for the intermediate collector handles. Every other collector shape incurs overhead because it essentially ends up calling the ineffecient fallback handle.
>>
>> Rather than sticking on a band aid (I tried this, but it turned out to be very tricky to untangle the existing code), the new code replaces the existing implementation with a collector handle implemented using a LambdaForm, which removes the need for intrinsification, and also greatly reduces code-complexity of the implementation. (I plan to expose this collector using a public API in the future as well, so users don't have to go through MHs::identity to make a collector).
>>
>> The old code was also using a special lambda form transform for collecting arguments into an array. I believe this was done to take advantage of the existing-but-limited bytecode intrinsification, at least for Object[] with less than 10 elements.
>>
>> The new code just uses the existing collect arguments transform with the newly added collector handle as filter, and this works just as well for the existing case, but as a bonus is also much simpler, since no separate transform is needed. Using the collect arguments transform should also improve sharing.
>>
>> As a result of these changes a lot of code was unused and has been removed in this patch.
>>
>> Testing: tier 1-3, benchmarking using TypedAsCollector (part of the patch), as well as another variant of the benchmark that used a declared static identity method instead of MHs::identity (not included). Before/after comparison of MethodHandleAs* benchmarks (no differences there).
>>
>> Here are some numbers from the added benchmark:
>>
>> Before:
>> Benchmark                                    Mode  Cnt    Score    Error  Units
>> TypedAsCollector.testIntCollect              avgt   30  189.156 �  1.796  ns/op
>> TypedAsCollector.testIntCollectHighArity     avgt   30  660.549 � 10.159  ns/op
>> TypedAsCollector.testObjectCollect           avgt   30    7.092 �  0.042  ns/op
>> TypedAsCollector.testObjectCollectHighArity  avgt   30   65.225 �  0.546  ns/op
>> TypedAsCollector.testStringCollect           avgt   30   28.511 �  0.243  ns/op
>> TypedAsCollector.testStringCollectHighArity  avgt   30   57.054 �  0.635  ns/op
>> (as you can see, just the Object[] with arity less than 10 case is fast here)
>> After:
>> Benchmark                                    Mode  Cnt  Score   Error  Units
>> TypedAsCollector.testIntCollect              avgt   30  6.569 � 0.131  ns/op
>> TypedAsCollector.testIntCollectHighArity     avgt   30  8.923 � 0.066  ns/op
>> TypedAsCollector.testObjectCollect           avgt   30  6.813 � 0.035  ns/op
>> TypedAsCollector.testObjectCollectHighArity  avgt   30  9.718 � 0.066  ns/op
>> TypedAsCollector.testStringCollect           avgt   30  6.737 � 0.016  ns/op
>> TypedAsCollector.testStringCollectHighArity  avgt   30  9.618 � 0.052  ns/op
>>
>> Thanks,
>> Jorn
>
> Jorn Vernee has updated the pull request incrementally with one additional commit since the last revision:
>
>   Review comment: Rearrange code to explicitly reference erased setter

Looks good.

-------------

Marked as reviewed by vlivanov (Reviewer).

PR: https://git.openjdk.java.net/jdk/pull/3306
12