Reduced performance in Java 9.0.1 (vs 8u152)

classic Classic list List threaded Threaded
17 messages Options
Reply | Threaded
Open this post in threaded view
|

Reduced performance in Java 9.0.1 (vs 8u152)

Martin Traverso
Hi,

We're in the process of migrating and qualifying Presto (http://prestodb.io) to build and run on Java 9. One of the key dependencies is a library of pure-java compression and decompression algorithms (http://github.com/airlift/aircompressor). 

In the course of trying to understand the performance characteristics when running on Java 9, we discovered a significant drop in performance for the compression algorithms (up to 10%) when compared to 8u152.

Here's a summary of the results and instructions on how to run the benchmarks: https://github.com/martint/aircompressor/tree/perf

These are the outputs of JMH's perfasm profiler:


The generated assembly looks very different, but as far as I can tell, it's just different decisions of when and which registers to spill.

- Martin
Reply | Threaded
Open this post in threaded view
|

Re: Reduced performance in Java 9.0.1 (vs 8u152)

Eric Caspole-2
Hi Martin,

As you may know, JEP 248 made G1 the default collector for 9 where it
was ParallelGC earlier: http://openjdk.java.net/jeps/248

I tried your JMH specifying +UseParallelGC by JMH annotations and the
performance of 9 seems quite even to 8u131 that I have handy.

Maybe you could try this for yourself and see how it goes.

Regards,
Eric

On 12/22/2017 12:59 PM, Martin Traverso wrote:

> Hi,
>
> We're in the process of migrating and qualifying Presto
> (http://prestodb.io) to build and run on Java 9. One of the key
> dependencies is a library of pure-java compression and decompression
> algorithms (http://github.com/airlift/aircompressor).
>
> In the course of trying to understand the performance characteristics
> when running on Java 9, we discovered a significant drop in
> performance for the compression algorithms (up to 10%) when compared
> to 8u152.
>
> Here's a summary of the results and instructions on how to run the
> benchmarks: https://github.com/martint/aircompressor/tree/perf
>
> These are the outputs of JMH's perfasm profiler:
>
> Java 8u152: https://github.com/martint/aircompressor/blob/perf/perf-8.txt
> Java 9.0.1: https://github.com/martint/aircompressor/blob/perf/perf-9.txt
>
> The generated assembly looks very different, but as far as I can tell,
> it's just different decisions of when and which registers to spill.
>
> - Martin

Reply | Threaded
Open this post in threaded view
|

Re: Reduced performance in Java 9.0.1 (vs 8u152)

Dawid Weiss
> I tried your JMH specifying +UseParallelGC by JMH annotations and the
> performance of 9 seems quite even to 8u131 that I have handy.

Just a note: we have observed a similar effect with a (long)
computational process which does
some minor GCs and logs GC timings at the end. The results are
significantly slower with G1GC on both 8 and 9 (the times below are
quite repeatable):

JDK, GC, Time
8, g1, 3h 25m
9, g1, 3h 22m
8, par (default), 3h 0m

The actual in-GC timings reported are much smaller than the overall
absolute difference -- ~2 minutes for both.
The process is highly concurrent and heavy on computation, I/O...
pretty much every aspect you can think of.

To be fair to the G1 -- it acts *much* better on larger heaps and
low-memory conditions (the default GC on 8 falls into
repeated major collections and effectively stalls the process).

Dawid

>
> Maybe you could try this for yourself and see how it goes.
>
> Regards,
> Eric
>
>
> On 12/22/2017 12:59 PM, Martin Traverso wrote:
>>
>> Hi,
>>
>> We're in the process of migrating and qualifying Presto
>> (http://prestodb.io) to build and run on Java 9. One of the key dependencies
>> is a library of pure-java compression and decompression algorithms
>> (http://github.com/airlift/aircompressor).
>>
>> In the course of trying to understand the performance characteristics when
>> running on Java 9, we discovered a significant drop in performance for the
>> compression algorithms (up to 10%) when compared to 8u152.
>>
>> Here's a summary of the results and instructions on how to run the
>> benchmarks: https://github.com/martint/aircompressor/tree/perf
>>
>> These are the outputs of JMH's perfasm profiler:
>>
>> Java 8u152: https://github.com/martint/aircompressor/blob/perf/perf-8.txt
>> Java 9.0.1: https://github.com/martint/aircompressor/blob/perf/perf-9.txt
>>
>> The generated assembly looks very different, but as far as I can tell,
>> it's just different decisions of when and which registers to spill.
>>
>> - Martin
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Reduced performance in Java 9.0.1 (vs 8u152)

Martin Traverso
In reply to this post by Eric Caspole-2
Yes, I'm aware of that change. This code is allocation-free, so I wasn't expecting that to be a factor. I'll rerun the benchmarks with that and report back.

- Martin

> On Dec 22, 2017, at 1:20 PM, Eric Caspole <[hidden email]> wrote:
>
> Hi Martin,
>
> As you may know, JEP 248 made G1 the default collector for 9 where it was ParallelGC earlier: http://openjdk.java.net/jeps/248
>
> I tried your JMH specifying +UseParallelGC by JMH annotations and the performance of 9 seems quite even to 8u131 that I have handy.
>
> Maybe you could try this for yourself and see how it goes.
>
> Regards,
> Eric
>
>> On 12/22/2017 12:59 PM, Martin Traverso wrote:
>> Hi,
>>
>> We're in the process of migrating and qualifying Presto (http://prestodb.io) to build and run on Java 9. One of the key dependencies is a library of pure-java compression and decompression algorithms (http://github.com/airlift/aircompressor).
>>
>> In the course of trying to understand the performance characteristics when running on Java 9, we discovered a significant drop in performance for the compression algorithms (up to 10%) when compared to 8u152.
>>
>> Here's a summary of the results and instructions on how to run the benchmarks: https://github.com/martint/aircompressor/tree/perf
>>
>> These are the outputs of JMH's perfasm profiler:
>>
>> Java 8u152: https://github.com/martint/aircompressor/blob/perf/perf-8.txt
>> Java 9.0.1: https://github.com/martint/aircompressor/blob/perf/perf-9.txt
>>
>> The generated assembly looks very different, but as far as I can tell, it's just different decisions of when and which registers to spill.
>>
>> - Martin
>
Reply | Threaded
Open this post in threaded view
|

RE: Reduced performance in Java 9.0.1 (vs 8u152)

Uwe Schindler-4
Hi,

Allocation free does not mean that it is not affected by G1GC. As G1GC is using more parallelity it also needs to add more checks / barriers into the code that also affects stuff that does not do allocations. So in general, by using G1GC I have seen slowdowns of up to 10% for code only does calculations. This is  (by the way) one reason, why Elasticsearch people still recommend to use CMS collector with generally small heap sizes (Elasticsearch - or better said Apache Lucene does most stuff including expensive calculations outside of heap in mmapped files).

Uwe

-----
Uwe Schindler
[hidden email]
ASF Member, Apache Lucene PMC / Committer
Bremen, Germany
http://lucene.apache.org/

> -----Original Message-----
> From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-
> [hidden email]] On Behalf Of Martin Traverso
> Sent: Friday, December 22, 2017 11:13 PM
> To: Eric Caspole <[hidden email]>
> Cc: [hidden email]
> Subject: Re: Reduced performance in Java 9.0.1 (vs 8u152)
>
> Yes, I'm aware of that change. This code is allocation-free, so I wasn't
> expecting that to be a factor. I'll rerun the benchmarks with that and report
> back.
>
> - Martin
>
> > On Dec 22, 2017, at 1:20 PM, Eric Caspole <[hidden email]>
> wrote:
> >
> > Hi Martin,
> >
> > As you may know, JEP 248 made G1 the default collector for 9 where it was
> ParallelGC earlier: http://openjdk.java.net/jeps/248
> >
> > I tried your JMH specifying +UseParallelGC by JMH annotations and the
> performance of 9 seems quite even to 8u131 that I have handy.
> >
> > Maybe you could try this for yourself and see how it goes.
> >
> > Regards,
> > Eric
> >
> >> On 12/22/2017 12:59 PM, Martin Traverso wrote:
> >> Hi,
> >>
> >> We're in the process of migrating and qualifying Presto
> (http://prestodb.io) to build and run on Java 9. One of the key dependencies
> is a library of pure-java compression and decompression algorithms
> (http://github.com/airlift/aircompressor).
> >>
> >> In the course of trying to understand the performance characteristics
> when running on Java 9, we discovered a significant drop in performance for
> the compression algorithms (up to 10%) when compared to 8u152.
> >>
> >> Here's a summary of the results and instructions on how to run the
> benchmarks: https://github.com/martint/aircompressor/tree/perf
> >>
> >> These are the outputs of JMH's perfasm profiler:
> >>
> >> Java 8u152: https://github.com/martint/aircompressor/blob/perf/perf-
> 8.txt
> >> Java 9.0.1: https://github.com/martint/aircompressor/blob/perf/perf-
> 9.txt
> >>
> >> The generated assembly looks very different, but as far as I can tell, it's
> just different decisions of when and which registers to spill.
> >>
> >> - Martin
> >

Reply | Threaded
Open this post in threaded view
|

Re: Reduced performance in Java 9.0.1 (vs 8u152)

Igor Veresov
In reply to this post by Martin Traverso
The barriers are substantially more complicated though.

igor

> On Dec 22, 2017, at 2:12 PM, Martin Traverso <[hidden email]> wrote:
>
> Yes, I'm aware of that change. This code is allocation-free, so I wasn't expecting that to be a factor. I'll rerun the benchmarks with that and report back.
>
> - Martin
>
>> On Dec 22, 2017, at 1:20 PM, Eric Caspole <[hidden email]> wrote:
>>
>> Hi Martin,
>>
>> As you may know, JEP 248 made G1 the default collector for 9 where it was ParallelGC earlier: http://openjdk.java.net/jeps/248
>>
>> I tried your JMH specifying +UseParallelGC by JMH annotations and the performance of 9 seems quite even to 8u131 that I have handy.
>>
>> Maybe you could try this for yourself and see how it goes.
>>
>> Regards,
>> Eric
>>
>>> On 12/22/2017 12:59 PM, Martin Traverso wrote:
>>> Hi,
>>>
>>> We're in the process of migrating and qualifying Presto (http://prestodb.io) to build and run on Java 9. One of the key dependencies is a library of pure-java compression and decompression algorithms (http://github.com/airlift/aircompressor).
>>>
>>> In the course of trying to understand the performance characteristics when running on Java 9, we discovered a significant drop in performance for the compression algorithms (up to 10%) when compared to 8u152.
>>>
>>> Here's a summary of the results and instructions on how to run the benchmarks: https://github.com/martint/aircompressor/tree/perf
>>>
>>> These are the outputs of JMH's perfasm profiler:
>>>
>>> Java 8u152: https://github.com/martint/aircompressor/blob/perf/perf-8.txt
>>> Java 9.0.1: https://github.com/martint/aircompressor/blob/perf/perf-9.txt
>>>
>>> The generated assembly looks very different, but as far as I can tell, it's just different decisions of when and which registers to spill.
>>>
>>> - Martin
>>

Reply | Threaded
Open this post in threaded view
|

Re: Reduced performance in Java 9.0.1 (vs 8u152)

Martin Traverso
I'm not sure I understand why it would need to insert barriers in this code, though. It's effectively a single method that doesn't access any state besides a couple of byte[] that are passed in (no shared state, no object/field references, etc). Other than the safepoint polling on loop backedges, there should be no "external interference". In the generated assembly (https://github.com/martint/aircompressor/blob/perf/perf-9.txt), which instructions correspond to barriers? 

In any case, I re-ran the benchmarks using G1 on both Java 8 and 9 and the results are very similar to what I observed initially. I updated the page with the outputs of the latest runs, in case you want to take a look: https://github.com/martint/aircompressor/tree/perf

Thanks,

- Martin






On Fri, Dec 22, 2017 at 7:29 PM Igor Veresov <[hidden email]> wrote:
The barriers are substantially more complicated though.

igor

> On Dec 22, 2017, at 2:12 PM, Martin Traverso <[hidden email]> wrote:
>
> Yes, I'm aware of that change. This code is allocation-free, so I wasn't expecting that to be a factor. I'll rerun the benchmarks with that and report back.
>
> - Martin
>
>> On Dec 22, 2017, at 1:20 PM, Eric Caspole <[hidden email]> wrote:
>>
>> Hi Martin,
>>
>> As you may know, JEP 248 made G1 the default collector for 9 where it was ParallelGC earlier: http://openjdk.java.net/jeps/248
>>
>> I tried your JMH specifying +UseParallelGC by JMH annotations and the performance of 9 seems quite even to 8u131 that I have handy.
>>
>> Maybe you could try this for yourself and see how it goes.
>>
>> Regards,
>> Eric
>>
>>> On 12/22/2017 12:59 PM, Martin Traverso wrote:
>>> Hi,
>>>
>>> We're in the process of migrating and qualifying Presto (http://prestodb.io) to build and run on Java 9. One of the key dependencies is a library of pure-java compression and decompression algorithms (http://github.com/airlift/aircompressor).
>>>
>>> In the course of trying to understand the performance characteristics when running on Java 9, we discovered a significant drop in performance for the compression algorithms (up to 10%) when compared to 8u152.
>>>
>>> Here's a summary of the results and instructions on how to run the benchmarks: https://github.com/martint/aircompressor/tree/perf
>>>
>>> These are the outputs of JMH's perfasm profiler:
>>>
>>> Java 8u152: https://github.com/martint/aircompressor/blob/perf/perf-8.txt
>>> Java 9.0.1: https://github.com/martint/aircompressor/blob/perf/perf-9.txt
>>>
>>> The generated assembly looks very different, but as far as I can tell, it's just different decisions of when and which registers to spill.
>>>
>>> - Martin
>>

Reply | Threaded
Open this post in threaded view
|

Re: Reduced performance in Java 9.0.1 (vs 8u152)

Kirk Pepperdine
In reply to this post by Martin Traverso
Hi Martin,

I’ve setup benchmarks that saturation the CPU creating a zero sum gain condition. Under these conditions you can get some idea of GC overhead by looking at workload throughput times. Under these conditions I’ve found that no matter how you tune G1, the closest you can get to CMS numbers is about 10%. I believe one of the costs is one that you’d experience and that is RSet refinement queue updating. You maybe able to get a handle on this cost by making Eden large enough that your bench doesn’t experience a GC cycle during the run. My guess is this should minimize (hopefully) eliminate RSet refinement costs which may give you an idea on that cost. It’s something that I've not tired myself so might just be a crazy/ridiculous experiment.

Kind regards,
Kirk

> On Dec 22, 2017, at 11:12 PM, Martin Traverso <[hidden email]> wrote:
>
> Yes, I'm aware of that change. This code is allocation-free, so I wasn't expecting that to be a factor. I'll rerun the benchmarks with that and report back.
>
> - Martin
>
>> On Dec 22, 2017, at 1:20 PM, Eric Caspole <[hidden email]> wrote:
>>
>> Hi Martin,
>>
>> As you may know, JEP 248 made G1 the default collector for 9 where it was ParallelGC earlier: http://openjdk.java.net/jeps/248
>>
>> I tried your JMH specifying +UseParallelGC by JMH annotations and the performance of 9 seems quite even to 8u131 that I have handy.
>>
>> Maybe you could try this for yourself and see how it goes.
>>
>> Regards,
>> Eric
>>
>>> On 12/22/2017 12:59 PM, Martin Traverso wrote:
>>> Hi,
>>>
>>> We're in the process of migrating and qualifying Presto (http://prestodb.io) to build and run on Java 9. One of the key dependencies is a library of pure-java compression and decompression algorithms (http://github.com/airlift/aircompressor).
>>>
>>> In the course of trying to understand the performance characteristics when running on Java 9, we discovered a significant drop in performance for the compression algorithms (up to 10%) when compared to 8u152.
>>>
>>> Here's a summary of the results and instructions on how to run the benchmarks: https://github.com/martint/aircompressor/tree/perf
>>>
>>> These are the outputs of JMH's perfasm profiler:
>>>
>>> Java 8u152: https://github.com/martint/aircompressor/blob/perf/perf-8.txt
>>> Java 9.0.1: https://github.com/martint/aircompressor/blob/perf/perf-9.txt
>>>
>>> The generated assembly looks very different, but as far as I can tell, it's just different decisions of when and which registers to spill.
>>>
>>> - Martin
>>

Reply | Threaded
Open this post in threaded view
|

Re: Reduced performance in Java 9.0.1 (vs 8u152)

Andrew Haley
In reply to this post by Martin Traverso
On 23/12/17 08:16, Martin Traverso wrote:
> I'm not sure I understand why it would need to insert barriers in this
> code, though. It's effectively a single method that doesn't access any
> state besides a couple of byte[] that are passed in (no shared state, no
> object/field references, etc). Other than the safepoint polling on loop
> backedges, there should be no "external interference". In the generated
> assembly (https://github.com/martint/aircompressor/blob/perf/perf-9.txt),
> which instructions correspond to barriers?

None of them.  I think the effect you're seeing here might be more to
do with a different unrolling strategy which perhaps leads to more
spilling.  8u152 has a tighter inner loop. It's just one of those
things which happens sometimes with JIT compilers, IMO.  G1 is a red
herring.

--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Reply | Threaded
Open this post in threaded view
|

Re: Reduced performance in Java 9.0.1 (vs 8u152)

Dawid Weiss
> do with a different unrolling strategy which perhaps leads to more
> spilling.  8u152 has a tighter inner loop. It's just one of those
> things which happens sometimes with JIT compilers, IMO.  G1 is a red
> herring.

I don't have enough knowledge of JIT internals to argue, but my
experience and real-life software
runs clearly show the difference Martin mentioned within a single JVM version.

JDK, GC, Time
8, g1, 3h 25m
8, par (default), 3h 0m

Again -- this is a particular result, but they're very repeatable and
the average/ variance is definitely
not accidental.

Dawid
Reply | Threaded
Open this post in threaded view
|

Re: Reduced performance in Java 9.0.1 (vs 8u152)

Andrew Haley
On 02/01/18 09:37, Dawid Weiss wrote:

>> do with a different unrolling strategy which perhaps leads to more
>> spilling.  8u152 has a tighter inner loop. It's just one of those
>> things which happens sometimes with JIT compilers, IMO.  G1 is a red
>> herring.
>
> I don't have enough knowledge of JIT internals to argue, but my
> experience and real-life software
> runs clearly show the difference Martin mentioned within a single JVM version.
>
> JDK, GC, Time
> 8, g1, 3h 25m
> 8, par (default), 3h 0m
>
> Again -- this is a particular result, but they're very repeatable and
> the average/ variance is definitely
> not accidental.

I'm only looking at the posted assembler code.  I can't see any
barriers.  However, the fact that there are barriers in the code we're
not looking at will affect register allocation, especially on a
machine as register starved as x86, so changing the GC will affect
even code which doesn't do anything GC related.  All it has to do is
use one more register, and that can cause a storm of spilling.

It would help a lot of someone posted the assembler code from a debug
build.

--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
Reply | Threaded
Open this post in threaded view
|

Re: Reduced performance in Java 9.0.1 (vs 8u152)

Roland Westrelin-3
In reply to this post by Martin Traverso

Hi Martin,

One thing I noticed that seems to affect performance is a change in the
implementation of Unsafe.copyMemory(). JDK-8149159 added some
verification code. Commenting out that verification code recovers some
of the performance and the compiled code for
Lz4RawDecompressor::decompress() is then much closer to that of the 8u
vm.

Roland.

diff --git a/src/java.base/share/classes/jdk/internal/misc/Unsafe.java b/src/java.base/share/classes/jdk/internal/misc/Unsafe.java
--- a/src/java.base/share/classes/jdk/internal/misc/Unsafe.java
+++ b/src/java.base/share/classes/jdk/internal/misc/Unsafe.java
@@ -779,11 +779,11 @@
     public void copyMemory(Object srcBase, long srcOffset,
                            Object destBase, long destOffset,
                            long bytes) {
-        copyMemoryChecks(srcBase, srcOffset, destBase, destOffset, bytes);
-
-        if (bytes == 0) {
-            return;
-        }
+        // copyMemoryChecks(srcBase, srcOffset, destBase, destOffset, bytes);
+
+        // if (bytes == 0) {
+        //     return;
+        // }
 
         copyMemory0(srcBase, srcOffset, destBase, destOffset, bytes);
     }
Reply | Threaded
Open this post in threaded view
|

Re: Reduced performance in Java 9.0.1 (vs 8u152)

Martin Traverso
One thing I noticed that seems to affect performance is a change in the
implementation of Unsafe.copyMemory(). JDK-8149159 added some
verification code.

The decompressor uses Unsafe.copyMemory() for the tail of the data block, so it shouldn't be called so frequently that it would affect performance that much. Do you think the mere presence of those checks (and, thus, extra code in the optimization window) may have some effect on the JIT's ability to optimize the rest of the code?

- Martin
 
Reply | Threaded
Open this post in threaded view
|

Re: Reduced performance in Java 9.0.1 (vs 8u152)

Dawid Weiss
Could be, but we don't use Unsafe (at least not explicitly) and the slowdown
does seem to be present in various bits of software (see [1], for example),
so it seems like something more systematic than a single method call.

Could be a butterfly effect or a combination of different factors like
it was already mentioned.
Interesting nonetheless.

Dawid

[1] LUCENE-7966; the discussion thread contains execution times and
discussion concerning 9 vs. 8 performance (and nobody seemed to know
how to explain the numbers).

https://issues.apache.org/jira/browse/LUCENE-7966?focusedCommentId=16174500&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16174500
Reply | Threaded
Open this post in threaded view
|

Re: Reduced performance in Java 9.0.1 (vs 8u152)

Laurent Bourgès
Hi,

I wonder if the gcc compiler 4.9 used for jdk9 builds may cause some slowdowns (native code).

I already tested few gcc options (auto vectorization) and it had some effect on the awt library (c maskfill loops). I also tried gcc 6...

Do you experience slowdowns on linux only or is it more general (cross platform) ?

My 2 cents...

Laurent

Le 5 janv. 2018 9:37 AM, "Dawid Weiss" <[hidden email]> a écrit :
Could be, but we don't use Unsafe (at least not explicitly) and the slowdown
does seem to be present in various bits of software (see [1], for example),
so it seems like something more systematic than a single method call.

Could be a butterfly effect or a combination of different factors like
it was already mentioned.
Interesting nonetheless.

Dawid

[1] LUCENE-7966; the discussion thread contains execution times and
discussion concerning 9 vs. 8 performance (and nobody seemed to know
how to explain the numbers).

https://issues.apache.org/jira/browse/LUCENE-7966?focusedCommentId=16174500&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16174500
Reply | Threaded
Open this post in threaded view
|

Re: Reduced performance in Java 9.0.1 (vs 8u152)

Dawid Weiss
I'm on Windows, the Lucene folks are on Linux/ Macs...

Dawid
Reply | Threaded
Open this post in threaded view
|

Re: Reduced performance in Java 9.0.1 (vs 8u152)

Roland Westrelin-3
In reply to this post by Martin Traverso

> The decompressor uses Unsafe.copyMemory() for the tail of the data block,
> so it shouldn't be called so frequently that it would affect performance
> that much. Do you think the mere presence of those checks (and, thus, extra
> code in the optimization window) may have some effect on the JIT's ability
> to optimize the rest of the code?

The fast copy loop with getLong/putLong has an extra spill when the
checks are present. So they affect register allocation at least.

Roland.