Linux/PPC64: "mbind: Invalid argument" when -XX:+UseNUMA is used

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Linux/PPC64: "mbind: Invalid argument" when -XX:+UseNUMA is used

Gustavo Romero
Hi,

On Linux/PPC64 I'm getting a series of "mbind: Invalid argument" that seems
 exactly the same as reported for x64 [1]:

[root@spocfire3 ~]# java -XX:+UseNUMA -version
mbind: Invalid argument
mbind: Invalid argument
mbind: Invalid argument
mbind: Invalid argument
mbind: Invalid argument
mbind: Invalid argument
mbind: Invalid argument
openjdk version "1.8.0_121"
OpenJDK Runtime Environment (build 1.8.0_121-b13)
OpenJDK 64-Bit Server VM (build 25.121-b13, mixed mode)

[root@spocfire3 ~]# uname -a
Linux spocfire3.aus.stglabs.ibm.com 3.10.0-327.el7.ppc64le #1 SMP Thu Oct 29 17:31:13 EDT 2015 ppc64le ppc64le ppc64le GNU/Linux

[root@spocfire3 ~]# lscpu
Architecture:          ppc64le
Byte Order:            Little Endian
CPU(s):                160
On-line CPU(s) list:   0-159
Thread(s) per core:    8
Core(s) per socket:    10
Socket(s):             2
NUMA node(s):          2
Model:                 2.0 (pvr 004d 0200)
Model name:            POWER8 (raw), altivec supported
L1d cache:             64K
L1i cache:             32K
L2 cache:              512K
L3 cache:              8192K
NUMA node0 CPU(s):     0-79
NUMA node8 CPU(s):     80-159

On chasing down it, looks like it comes from PSYoungGen::initialize() in
src/share/vm/gc_implementation/parallelScavenge/psYoungGen.cpp that calls
initialize_work(), that calls the MutableNUMASpace() constructor if
UseNUMA is set:
http://hg.openjdk.java.net/jdk8u/jdk8u/hotspot/file/567e410935e5/src/share/vm/gc_implementation/parallelScavenge/psYoungGen.cpp#l77

MutableNUMASpace() then calls os::numa_make_local(), that in the end calls
numa_set_bind_policy() in libnuma.so.1 [2].

I've traced some values for which mbind() syscall fails:
http://termbin.com/ztfs  (search for "Invalid argument" in the log).

Assuming it's the same bug as reported in [1] and so it's not fixed on 9 and 10:

- Is there any WIP or known workaround?
- Should I append this output in [1] description or open a new one and make it
  related to" [1]?

Thank you.


Best regards,
Gustavo

[1] https://bugs.openjdk.java.net/browse/JDK-8163796
[2] https://da.gd/4vXF

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Linux/PPC64: "mbind: Invalid argument" when -XX:+UseNUMA is used

sangheon.kim@oracle.com
Hi Gustavo,

On 02/06/2017 01:50 PM, Gustavo Romero wrote:

> Hi,
>
> On Linux/PPC64 I'm getting a series of "mbind: Invalid argument" that seems
>   exactly the same as reported for x64 [1]:
>
> [root@spocfire3 ~]# java -XX:+UseNUMA -version
> mbind: Invalid argument
> mbind: Invalid argument
> mbind: Invalid argument
> mbind: Invalid argument
> mbind: Invalid argument
> mbind: Invalid argument
> mbind: Invalid argument
> openjdk version "1.8.0_121"
> OpenJDK Runtime Environment (build 1.8.0_121-b13)
> OpenJDK 64-Bit Server VM (build 25.121-b13, mixed mode)
>
> [root@spocfire3 ~]# uname -a
> Linux spocfire3.aus.stglabs.ibm.com 3.10.0-327.el7.ppc64le #1 SMP Thu Oct 29 17:31:13 EDT 2015 ppc64le ppc64le ppc64le GNU/Linux
>
> [root@spocfire3 ~]# lscpu
> Architecture:          ppc64le
> Byte Order:            Little Endian
> CPU(s):                160
> On-line CPU(s) list:   0-159
> Thread(s) per core:    8
> Core(s) per socket:    10
> Socket(s):             2
> NUMA node(s):          2
> Model:                 2.0 (pvr 004d 0200)
> Model name:            POWER8 (raw), altivec supported
> L1d cache:             64K
> L1i cache:             32K
> L2 cache:              512K
> L3 cache:              8192K
> NUMA node0 CPU(s):     0-79
> NUMA node8 CPU(s):     80-159
>
> On chasing down it, looks like it comes from PSYoungGen::initialize() in
> src/share/vm/gc_implementation/parallelScavenge/psYoungGen.cpp that calls
> initialize_work(), that calls the MutableNUMASpace() constructor if
> UseNUMA is set:
> http://hg.openjdk.java.net/jdk8u/jdk8u/hotspot/file/567e410935e5/src/share/vm/gc_implementation/parallelScavenge/psYoungGen.cpp#l77
>
> MutableNUMASpace() then calls os::numa_make_local(), that in the end calls
> numa_set_bind_policy() in libnuma.so.1 [2].
>
> I've traced some values for which mbind() syscall fails:
> http://termbin.com/ztfs  (search for "Invalid argument" in the log).
>
> Assuming it's the same bug as reported in [1] and so it's not fixed on 9 and 10:
>
> - Is there any WIP or known workaround?
There's no progress on JDK-8163796 and no workaround found yet.
And unfortunately, I'm not planning to fix it soon.

> - Should I append this output in [1] description or open a new one and make it
>    related to" [1]?
I think your problem seems same as JDK-8163796, so adding your output on
the CR seems good.
And please add logs as well. I recommend to enabling something like
"-Xlog:gc*,gc+heap*=trace".
IIRC, the problem was only occurred when the -Xmx was small in my case.

Thanks,
Sangheon


>
> Thank you.
>
>
> Best regards,
> Gustavo
>
> [1] https://bugs.openjdk.java.net/browse/JDK-8163796
> [2] https://da.gd/4vXF
>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Linux/PPC64: "mbind: Invalid argument" when -XX:+UseNUMA is used

Gustavo Romero
Hi Sangheon,

Please find my comments inline.

On 06-02-2017 20:23, sangheon wrote:

> Hi Gustavo,
>
> On 02/06/2017 01:50 PM, Gustavo Romero wrote:
>> Hi,
>>
>> On Linux/PPC64 I'm getting a series of "mbind: Invalid argument" that seems
>>   exactly the same as reported for x64 [1]:  
>>
>> [root@spocfire3 ~]# java -XX:+UseNUMA -version
>> mbind: Invalid argument
>> mbind: Invalid argument
>> mbind: Invalid argument
>> mbind: Invalid argument
>> mbind: Invalid argument
>> mbind: Invalid argument
>> mbind: Invalid argument
>> openjdk version "1.8.0_121"
>> OpenJDK Runtime Environment (build 1.8.0_121-b13)
>> OpenJDK 64-Bit Server VM (build 25.121-b13, mixed mode)
>>
>> [root@spocfire3 ~]# uname -a
>> Linux spocfire3.aus.stglabs.ibm.com 3.10.0-327.el7.ppc64le #1 SMP Thu Oct 29 17:31:13 EDT 2015 ppc64le ppc64le ppc64le GNU/Linux
>>
>> [root@spocfire3 ~]# lscpu
>> Architecture:          ppc64le
>> Byte Order:            Little Endian
>> CPU(s):                160
>> On-line CPU(s) list:   0-159
>> Thread(s) per core:    8
>> Core(s) per socket:    10
>> Socket(s):             2
>> NUMA node(s):          2
>> Model:                 2.0 (pvr 004d 0200)
>> Model name:            POWER8 (raw), altivec supported
>> L1d cache:             64K
>> L1i cache:             32K
>> L2 cache:              512K
>> L3 cache:              8192K
>> NUMA node0 CPU(s):     0-79
>> NUMA node8 CPU(s):     80-159
>>
>> On chasing down it, looks like it comes from PSYoungGen::initialize() in
>> src/share/vm/gc_implementation/parallelScavenge/psYoungGen.cpp that calls
>> initialize_work(), that calls the MutableNUMASpace() constructor if
>> UseNUMA is set:
>> http://hg.openjdk.java.net/jdk8u/jdk8u/hotspot/file/567e410935e5/src/share/vm/gc_implementation/parallelScavenge/psYoungGen.cpp#l77
>>
>> MutableNUMASpace() then calls os::numa_make_local(), that in the end calls
>> numa_set_bind_policy() in libnuma.so.1 [2].
>>
>> I've traced some values for which mbind() syscall fails:
>> http://termbin.com/ztfs  (search for "Invalid argument" in the log).
>>
>> Assuming it's the same bug as reported in [1] and so it's not fixed on 9 and 10:
>>
>> - Is there any WIP or known workaround?
> There's no progress on JDK-8163796 and no workaround found yet.
> And unfortunately, I'm not planning to fix it soon.

Hive, a critical component of Hadoop ecosystem, comes with a shell and uses java
(with UseNUMA flag) in the background to run mysql-kind of queries. On PPC64 the
mbind() messages in question make the shell pretty cumbersome. For instance:

hive> show databases;
mbind: Invalid argument
mbind: Invalid argument
mbind: Invalid argument (repeat message more 28 times...)
...
OK
mbind: Invalid argument
mbind: Invalid argument
mbind: Invalid argument
default
tpcds_bin_partitioned_orc_10
tpcds_text_10
Time taken: 1.036 seconds, Fetched: 3 row(s)
hive> mbind: Invalid argument
mbind: Invalid argument
mbind: Invalid argument

Also, on PPC64 a simple "java -XX:+UseParallelGC -XX:+UseNUMA -version" will
trigger the problem, without any additional flags. So I'd like to correct that
behavior (please see my next comment on that).


>> - Should I append this output in [1] description or open a new one and make it
>>    related to" [1]?
> I think your problem seems same as JDK-8163796, so adding your output on the CR seems good.
> And please add logs as well. I recommend to enabling something like "-Xlog:gc*,gc+heap*=trace".
> IIRC, the problem was only occurred when the -Xmx was small in my case.

JVM code used to discover which numa nodes it can bind assumes that nodes are
consecutive and tries to bind from 0 to numa_max_node() [1, 2, 3, 4], i.e. from
0 to the highest node number available on the system. However, at least on PPC64
that assumption is not always true. For instance, consider the following numa
topology:

available: 4 nodes (0-1,16-17)
node 0 cpus: 0 8 16 24 32
node 0 size: 130706 MB
node 0 free: 145 MB
node 1 cpus: 40 48 56 64 72
node 1 size: 0 MB
node 1 free: 0 MB
node 16 cpus: 80 88 96 104 112
node 16 size: 130630 MB
node 16 free: 529 MB
node 17 cpus: 120 128 136 144 152
node 17 size: 0 MB
node 17 free: 0 MB
node distances:
node   0   1  16  17
  0:  10  20  40  40
  1:  20  10  40  40
 16:  40  40  10  20
 17:  40  40  20  10

In that case we have four nodes, 2 without memory (1 and 17), where the
highest node id is 17. Hence if the JVM tries to bind from 0 to 17, mbind() will
fail except for nodes 0 and 16, which are configured and have memory. mbind()
failures will generate the "mbind: Invalid argument" messages.

A solution would be to use in os::numa_get_group_num() not numa_max_node() but
instead numa_num_configured_nodes() which returns the total number of nodes with
memory in the system (so in our example above it will return exactly 2 nodes)
and then inspect numa_all_node_ptr in os::numa_get_leaf_groups() to find the
correct node ids to append (in our case, map ids[0] = 0 [node 0] and ids[1] = 16
[node 16]).

One thing is that os::numa_get_leaf_groups() argument "size" will not be
required anymore and will be loose, so the interface will have to be adapted on
other OSs besides Linux I guess [5].

It would be necessary to adapt os::Linux::rebuild_cpu_to_node_map()
since not all numa nodes are suitable to be returned by a call to
os::numa_get_group_id() as some cpus would be in a node without memory.
In that case we can return the closest numa node instead. A new way to translate
indices to nodes is also useful since nodes are not always consecutive.

Finally, although "numa_nodes_ptr" is not present in libnuma's manual it's what
is used in numactl to find out the total number of nodes in the system [6]. I
could not find a function that would return that number readily. I asked to
libnuma ML if a better solution exists [7].

The following webrev implements the proposed changes on jdk9 (backport to 8 is
simple):

webrev: http://cr.openjdk.java.net/~gromero/8175813/
bug:    https://bugs.openjdk.java.net/browse/JDK-8175813

Here are the logs with "-Xlog:gc*,gc+heap*=trace":

http://cr.openjdk.java.net/~gromero/logs/pristine.log     (current state)
http://cr.openjdk.java.net/~gromero/logs/numa_patched.log (proposed change)

I've tested on 8 against SPECjvm2008 on the aforementioned machine and
performance improved ~5% in comparison to the same version packaged by
the distro, but I don't expect any difference on machines where nodes
are always consecutive and where nodes always have memory.

After a due community review, could you sponsor that change?

Thank you.


Best regards,
Gustavo

[1] http://hg.openjdk.java.net/jdk9/hs/hotspot/file/636271c3697a/src/share/vm/gc/parallel/mutableNUMASpace.cpp#l241
[2] http://hg.openjdk.java.net/jdk9/hs/hotspot/file/636271c3697a/src/os/linux/vm/os_linux.cpp#l2745
[3] http://hg.openjdk.java.net/jdk9/hs/hotspot/file/636271c3697a/src/share/vm/gc/parallel/mutableNUMASpace.cpp#l243
[4] http://hg.openjdk.java.net/jdk9/hs/hotspot/file/636271c3697a/src/os/linux/vm/os_linux.cpp#l2761
[5] http://hg.openjdk.java.net/jdk9/hs/hotspot/file/636271c3697a/src/share/vm/runtime/os.hpp#l356
[6] https://github.com/numactl/numactl/blob/master/numactl.c#L251
[7] http://www.spinics.net/lists/linux-numa/msg01173.html

>
> Thanks,
> Sangheon
>
>
>>
>> Thank you.
>>
>>
>> Best regards,
>> Gustavo
>>
>> [1] https://bugs.openjdk.java.net/browse/JDK-8163796
>> [2] https://da.gd/4vXF
>>
>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Linux/PPC64: "mbind: Invalid argument" when -XX:+UseNUMA is used

David Holmes
Hi Gustavo,

I am not a NUMA expert but it seems to me that our NUMA support is both
incomplete and bit-rotting. It seems evident that UseNUMA is only
working in limited contexts that match our testing environment. There
were a couple of JEPS proposed to enhance NUMA support back in 2012:

JDK-8046147 JEP 157: G1 GC: NUMA-Aware Allocation
JDK-8046153 JEP 163: Enable NUMA Mode by Default When Appropriate

but they have not progressed. If they were to progress then it seems our
overall approach to NUMA would need serious review and update - as per
your patch.

I'm also unclear about the distinctions between memory and non-memory
nodes wrt. the existing os::Linux NUMA API. It isn't at all clear to me
what functions should only be dealing with memory-nodes and which should
deal with any kind eg. I expect cpu to node map to map cpu to nodes not
cpu to nearest node with memory configured. If that is what is needed
then the API's need to be changed and the usage checked - aas that
distinction does not presently exist in the code AFAICS.

It is too late to take this patch into 9 IMHO as we don't have the
ability to test it effectively, nor is there time for NUMA users to put
it through its paces. I think this would have to be part of a bigger
NUMA project for 10 that addresses the NUMA API and how it is used.

Thanks,
David

On 24/02/2017 10:02 PM, Gustavo Romero wrote:

> Hi Sangheon,
>
> Please find my comments inline.
>
> On 06-02-2017 20:23, sangheon wrote:
>> Hi Gustavo,
>>
>> On 02/06/2017 01:50 PM, Gustavo Romero wrote:
>>> Hi,
>>>
>>> On Linux/PPC64 I'm getting a series of "mbind: Invalid argument" that seems
>>>   exactly the same as reported for x64 [1]:
>>>
>>> [root@spocfire3 ~]# java -XX:+UseNUMA -version
>>> mbind: Invalid argument
>>> mbind: Invalid argument
>>> mbind: Invalid argument
>>> mbind: Invalid argument
>>> mbind: Invalid argument
>>> mbind: Invalid argument
>>> mbind: Invalid argument
>>> openjdk version "1.8.0_121"
>>> OpenJDK Runtime Environment (build 1.8.0_121-b13)
>>> OpenJDK 64-Bit Server VM (build 25.121-b13, mixed mode)
>>>
>>> [root@spocfire3 ~]# uname -a
>>> Linux spocfire3.aus.stglabs.ibm.com 3.10.0-327.el7.ppc64le #1 SMP Thu Oct 29 17:31:13 EDT 2015 ppc64le ppc64le ppc64le GNU/Linux
>>>
>>> [root@spocfire3 ~]# lscpu
>>> Architecture:          ppc64le
>>> Byte Order:            Little Endian
>>> CPU(s):                160
>>> On-line CPU(s) list:   0-159
>>> Thread(s) per core:    8
>>> Core(s) per socket:    10
>>> Socket(s):             2
>>> NUMA node(s):          2
>>> Model:                 2.0 (pvr 004d 0200)
>>> Model name:            POWER8 (raw), altivec supported
>>> L1d cache:             64K
>>> L1i cache:             32K
>>> L2 cache:              512K
>>> L3 cache:              8192K
>>> NUMA node0 CPU(s):     0-79
>>> NUMA node8 CPU(s):     80-159
>>>
>>> On chasing down it, looks like it comes from PSYoungGen::initialize() in
>>> src/share/vm/gc_implementation/parallelScavenge/psYoungGen.cpp that calls
>>> initialize_work(), that calls the MutableNUMASpace() constructor if
>>> UseNUMA is set:
>>> http://hg.openjdk.java.net/jdk8u/jdk8u/hotspot/file/567e410935e5/src/share/vm/gc_implementation/parallelScavenge/psYoungGen.cpp#l77
>>>
>>> MutableNUMASpace() then calls os::numa_make_local(), that in the end calls
>>> numa_set_bind_policy() in libnuma.so.1 [2].
>>>
>>> I've traced some values for which mbind() syscall fails:
>>> http://termbin.com/ztfs  (search for "Invalid argument" in the log).
>>>
>>> Assuming it's the same bug as reported in [1] and so it's not fixed on 9 and 10:
>>>
>>> - Is there any WIP or known workaround?
>> There's no progress on JDK-8163796 and no workaround found yet.
>> And unfortunately, I'm not planning to fix it soon.
>
> Hive, a critical component of Hadoop ecosystem, comes with a shell and uses java
> (with UseNUMA flag) in the background to run mysql-kind of queries. On PPC64 the
> mbind() messages in question make the shell pretty cumbersome. For instance:
>
> hive> show databases;
> mbind: Invalid argument
> mbind: Invalid argument
> mbind: Invalid argument (repeat message more 28 times...)
> ...
> OK
> mbind: Invalid argument
> mbind: Invalid argument
> mbind: Invalid argument
> default
> tpcds_bin_partitioned_orc_10
> tpcds_text_10
> Time taken: 1.036 seconds, Fetched: 3 row(s)
> hive> mbind: Invalid argument
> mbind: Invalid argument
> mbind: Invalid argument
>
> Also, on PPC64 a simple "java -XX:+UseParallelGC -XX:+UseNUMA -version" will
> trigger the problem, without any additional flags. So I'd like to correct that
> behavior (please see my next comment on that).
>
>
>>> - Should I append this output in [1] description or open a new one and make it
>>>    related to" [1]?
>> I think your problem seems same as JDK-8163796, so adding your output on the CR seems good.
>> And please add logs as well. I recommend to enabling something like "-Xlog:gc*,gc+heap*=trace".
>> IIRC, the problem was only occurred when the -Xmx was small in my case.
>
> JVM code used to discover which numa nodes it can bind assumes that nodes are
> consecutive and tries to bind from 0 to numa_max_node() [1, 2, 3, 4], i.e. from
> 0 to the highest node number available on the system. However, at least on PPC64
> that assumption is not always true. For instance, consider the following numa
> topology:
>
> available: 4 nodes (0-1,16-17)
> node 0 cpus: 0 8 16 24 32
> node 0 size: 130706 MB
> node 0 free: 145 MB
> node 1 cpus: 40 48 56 64 72
> node 1 size: 0 MB
> node 1 free: 0 MB
> node 16 cpus: 80 88 96 104 112
> node 16 size: 130630 MB
> node 16 free: 529 MB
> node 17 cpus: 120 128 136 144 152
> node 17 size: 0 MB
> node 17 free: 0 MB
> node distances:
> node   0   1  16  17
>   0:  10  20  40  40
>   1:  20  10  40  40
>  16:  40  40  10  20
>  17:  40  40  20  10
>
> In that case we have four nodes, 2 without memory (1 and 17), where the
> highest node id is 17. Hence if the JVM tries to bind from 0 to 17, mbind() will
> fail except for nodes 0 and 16, which are configured and have memory. mbind()
> failures will generate the "mbind: Invalid argument" messages.
>
> A solution would be to use in os::numa_get_group_num() not numa_max_node() but
> instead numa_num_configured_nodes() which returns the total number of nodes with
> memory in the system (so in our example above it will return exactly 2 nodes)
> and then inspect numa_all_node_ptr in os::numa_get_leaf_groups() to find the
> correct node ids to append (in our case, map ids[0] = 0 [node 0] and ids[1] = 16
> [node 16]).
>
> One thing is that os::numa_get_leaf_groups() argument "size" will not be
> required anymore and will be loose, so the interface will have to be adapted on
> other OSs besides Linux I guess [5].
>
> It would be necessary to adapt os::Linux::rebuild_cpu_to_node_map()
> since not all numa nodes are suitable to be returned by a call to
> os::numa_get_group_id() as some cpus would be in a node without memory.
> In that case we can return the closest numa node instead. A new way to translate
> indices to nodes is also useful since nodes are not always consecutive.
>
> Finally, although "numa_nodes_ptr" is not present in libnuma's manual it's what
> is used in numactl to find out the total number of nodes in the system [6]. I
> could not find a function that would return that number readily. I asked to
> libnuma ML if a better solution exists [7].
>
> The following webrev implements the proposed changes on jdk9 (backport to 8 is
> simple):
>
> webrev: http://cr.openjdk.java.net/~gromero/8175813/
> bug:    https://bugs.openjdk.java.net/browse/JDK-8175813
>
> Here are the logs with "-Xlog:gc*,gc+heap*=trace":
>
> http://cr.openjdk.java.net/~gromero/logs/pristine.log     (current state)
> http://cr.openjdk.java.net/~gromero/logs/numa_patched.log (proposed change)
>
> I've tested on 8 against SPECjvm2008 on the aforementioned machine and
> performance improved ~5% in comparison to the same version packaged by
> the distro, but I don't expect any difference on machines where nodes
> are always consecutive and where nodes always have memory.
>
> After a due community review, could you sponsor that change?
>
> Thank you.
>
>
> Best regards,
> Gustavo
>
> [1] http://hg.openjdk.java.net/jdk9/hs/hotspot/file/636271c3697a/src/share/vm/gc/parallel/mutableNUMASpace.cpp#l241
> [2] http://hg.openjdk.java.net/jdk9/hs/hotspot/file/636271c3697a/src/os/linux/vm/os_linux.cpp#l2745
> [3] http://hg.openjdk.java.net/jdk9/hs/hotspot/file/636271c3697a/src/share/vm/gc/parallel/mutableNUMASpace.cpp#l243
> [4] http://hg.openjdk.java.net/jdk9/hs/hotspot/file/636271c3697a/src/os/linux/vm/os_linux.cpp#l2761
> [5] http://hg.openjdk.java.net/jdk9/hs/hotspot/file/636271c3697a/src/share/vm/runtime/os.hpp#l356
> [6] https://github.com/numactl/numactl/blob/master/numactl.c#L251
> [7] http://www.spinics.net/lists/linux-numa/msg01173.html
>
>>
>> Thanks,
>> Sangheon
>>
>>
>>>
>>> Thank you.
>>>
>>>
>>> Best regards,
>>> Gustavo
>>>
>>> [1] https://bugs.openjdk.java.net/browse/JDK-8163796
>>> [2] https://da.gd/4vXF
>>>
>>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Linux/PPC64: "mbind: Invalid argument" when -XX:+UseNUMA is used

sangheon.kim@oracle.com
Hi Gustabvo,

I am not in the PPC64 ML, so replying late.

>
>> After a due community review, could you sponsor that change?
Sure, I can sponsor this patch after the review. Please initiate the review on jdk 10 base.

Thanks,
Sangheon


> On Feb 27, 2017, at 8:10 AM, David Holmes <[hidden email]> wrote:
>
> Hi Gustavo,
>
> I am not a NUMA expert but it seems to me that our NUMA support is both incomplete and bit-rotting. It seems evident that UseNUMA is only working in limited contexts that match our testing environment. There were a couple of JEPS proposed to enhance NUMA support back in 2012:
>
> JDK-8046147    JEP 157: G1 GC: NUMA-Aware Allocation
> JDK-8046153    JEP 163: Enable NUMA Mode by Default When Appropriate
>
> but they have not progressed. If they were to progress then it seems our overall approach to NUMA would need serious review and update - as per your patch.
>
> I'm also unclear about the distinctions between memory and non-memory nodes wrt. the existing os::Linux NUMA API. It isn't at all clear to me what functions should only be dealing with memory-nodes and which should deal with any kind eg. I expect cpu to node map to map cpu to nodes not cpu to nearest node with memory configured. If that is what is needed then the API's need to be changed and the usage checked - aas that distinction does not presently exist in the code AFAICS.
>
> It is too late to take this patch into 9 IMHO as we don't have the ability to test it effectively, nor is there time for NUMA users to put it through its paces. I think this would have to be part of a bigger NUMA project for 10 that addresses the NUMA API and how it is used.
>
> Thanks,
> David
>
>> On 24/02/2017 10:02 PM, Gustavo Romero wrote:
>> Hi Sangheon,
>>
>> Please find my comments inline.
>>
>>> On 06-02-2017 20:23, sangheon wrote:
>>> Hi Gustavo,
>>>
>>>> On 02/06/2017 01:50 PM, Gustavo Romero wrote:
>>>> Hi,
>>>>
>>>> On Linux/PPC64 I'm getting a series of "mbind: Invalid argument" that seems
>>>>  exactly the same as reported for x64 [1]:
>>>>
>>>> [root@spocfire3 ~]# java -XX:+UseNUMA -version
>>>> mbind: Invalid argument
>>>> mbind: Invalid argument
>>>> mbind: Invalid argument
>>>> mbind: Invalid argument
>>>> mbind: Invalid argument
>>>> mbind: Invalid argument
>>>> mbind: Invalid argument
>>>> openjdk version "1.8.0_121"
>>>> OpenJDK Runtime Environment (build 1.8.0_121-b13)
>>>> OpenJDK 64-Bit Server VM (build 25.121-b13, mixed mode)
>>>>
>>>> [root@spocfire3 ~]# uname -a
>>>> Linux spocfire3.aus.stglabs.ibm.com 3.10.0-327.el7.ppc64le #1 SMP Thu Oct 29 17:31:13 EDT 2015 ppc64le ppc64le ppc64le GNU/Linux
>>>>
>>>> [root@spocfire3 ~]# lscpu
>>>> Architecture:          ppc64le
>>>> Byte Order:            Little Endian
>>>> CPU(s):                160
>>>> On-line CPU(s) list:   0-159
>>>> Thread(s) per core:    8
>>>> Core(s) per socket:    10
>>>> Socket(s):             2
>>>> NUMA node(s):          2
>>>> Model:                 2.0 (pvr 004d 0200)
>>>> Model name:            POWER8 (raw), altivec supported
>>>> L1d cache:             64K
>>>> L1i cache:             32K
>>>> L2 cache:              512K
>>>> L3 cache:              8192K
>>>> NUMA node0 CPU(s):     0-79
>>>> NUMA node8 CPU(s):     80-159
>>>>
>>>> On chasing down it, looks like it comes from PSYoungGen::initialize() in
>>>> src/share/vm/gc_implementation/parallelScavenge/psYoungGen.cpp that calls
>>>> initialize_work(), that calls the MutableNUMASpace() constructor if
>>>> UseNUMA is set:
>>>> http://hg.openjdk.java.net/jdk8u/jdk8u/hotspot/file/567e410935e5/src/share/vm/gc_implementation/parallelScavenge/psYoungGen.cpp#l77
>>>>
>>>> MutableNUMASpace() then calls os::numa_make_local(), that in the end calls
>>>> numa_set_bind_policy() in libnuma.so.1 [2].
>>>>
>>>> I've traced some values for which mbind() syscall fails:
>>>> http://termbin.com/ztfs  (search for "Invalid argument" in the log).
>>>>
>>>> Assuming it's the same bug as reported in [1] and so it's not fixed on 9 and 10:
>>>>
>>>> - Is there any WIP or known workaround?
>>> There's no progress on JDK-8163796 and no workaround found yet.
>>> And unfortunately, I'm not planning to fix it soon.
>>
>> Hive, a critical component of Hadoop ecosystem, comes with a shell and uses java
>> (with UseNUMA flag) in the background to run mysql-kind of queries. On PPC64 the
>> mbind() messages in question make the shell pretty cumbersome. For instance:
>>
>> hive> show databases;
>> mbind: Invalid argument
>> mbind: Invalid argument
>> mbind: Invalid argument (repeat message more 28 times...)
>> ...
>> OK
>> mbind: Invalid argument
>> mbind: Invalid argument
>> mbind: Invalid argument
>> default
>> tpcds_bin_partitioned_orc_10
>> tpcds_text_10
>> Time taken: 1.036 seconds, Fetched: 3 row(s)
>> hive> mbind: Invalid argument
>> mbind: Invalid argument
>> mbind: Invalid argument
>>
>> Also, on PPC64 a simple "java -XX:+UseParallelGC -XX:+UseNUMA -version" will
>> trigger the problem, without any additional flags. So I'd like to correct that
>> behavior (please see my next comment on that).
>>
>>
>>>> - Should I append this output in [1] description or open a new one and make it
>>>>   related to" [1]?
>>> I think your problem seems same as JDK-8163796, so adding your output on the CR seems good.
>>> And please add logs as well. I recommend to enabling something like "-Xlog:gc*,gc+heap*=trace".
>>> IIRC, the problem was only occurred when the -Xmx was small in my case.
>>
>> JVM code used to discover which numa nodes it can bind assumes that nodes are
>> consecutive and tries to bind from 0 to numa_max_node() [1, 2, 3, 4], i.e. from
>> 0 to the highest node number available on the system. However, at least on PPC64
>> that assumption is not always true. For instance, consider the following numa
>> topology:
>>
>> available: 4 nodes (0-1,16-17)
>> node 0 cpus: 0 8 16 24 32
>> node 0 size: 130706 MB
>> node 0 free: 145 MB
>> node 1 cpus: 40 48 56 64 72
>> node 1 size: 0 MB
>> node 1 free: 0 MB
>> node 16 cpus: 80 88 96 104 112
>> node 16 size: 130630 MB
>> node 16 free: 529 MB
>> node 17 cpus: 120 128 136 144 152
>> node 17 size: 0 MB
>> node 17 free: 0 MB
>> node distances:
>> node   0   1  16  17
>>  0:  10  20  40  40
>>  1:  20  10  40  40
>> 16:  40  40  10  20
>> 17:  40  40  20  10
>>
>> In that case we have four nodes, 2 without memory (1 and 17), where the
>> highest node id is 17. Hence if the JVM tries to bind from 0 to 17, mbind() will
>> fail except for nodes 0 and 16, which are configured and have memory. mbind()
>> failures will generate the "mbind: Invalid argument" messages.
>>
>> A solution would be to use in os::numa_get_group_num() not numa_max_node() but
>> instead numa_num_configured_nodes() which returns the total number of nodes with
>> memory in the system (so in our example above it will return exactly 2 nodes)
>> and then inspect numa_all_node_ptr in os::numa_get_leaf_groups() to find the
>> correct node ids to append (in our case, map ids[0] = 0 [node 0] and ids[1] = 16
>> [node 16]).
>>
>> One thing is that os::numa_get_leaf_groups() argument "size" will not be
>> required anymore and will be loose, so the interface will have to be adapted on
>> other OSs besides Linux I guess [5].
>>
>> It would be necessary to adapt os::Linux::rebuild_cpu_to_node_map()
>> since not all numa nodes are suitable to be returned by a call to
>> os::numa_get_group_id() as some cpus would be in a node without memory.
>> In that case we can return the closest numa node instead. A new way to translate
>> indices to nodes is also useful since nodes are not always consecutive.
>>
>> Finally, although "numa_nodes_ptr" is not present in libnuma's manual it's what
>> is used in numactl to find out the total number of nodes in the system [6]. I
>> could not find a function that would return that number readily. I asked to
>> libnuma ML if a better solution exists [7].
>>
>> The following webrev implements the proposed changes on jdk9 (backport to 8 is
>> simple):
>>
>> webrev: http://cr.openjdk.java.net/~gromero/8175813/
>> bug:    https://bugs.openjdk.java.net/browse/JDK-8175813
>>
>> Here are the logs with "-Xlog:gc*,gc+heap*=trace":
>>
>> http://cr.openjdk.java.net/~gromero/logs/pristine.log     (current state)
>> http://cr.openjdk.java.net/~gromero/logs/numa_patched.log (proposed change)
>>
>> I've tested on 8 against SPECjvm2008 on the aforementioned machine and
>> performance improved ~5% in comparison to the same version packaged by
>> the distro, but I don't expect any difference on machines where nodes
>> are always consecutive and where nodes always have memory.
>>
>> After a due community review, could you sponsor that change?
>>
>> Thank you.
>>
>>
>> Best regards,
>> Gustavo
>>
>> [1] http://hg.openjdk.java.net/jdk9/hs/hotspot/file/636271c3697a/src/share/vm/gc/parallel/mutableNUMASpace.cpp#l241
>> [2] http://hg.openjdk.java.net/jdk9/hs/hotspot/file/636271c3697a/src/os/linux/vm/os_linux.cpp#l2745
>> [3] http://hg.openjdk.java.net/jdk9/hs/hotspot/file/636271c3697a/src/share/vm/gc/parallel/mutableNUMASpace.cpp#l243
>> [4] http://hg.openjdk.java.net/jdk9/hs/hotspot/file/636271c3697a/src/os/linux/vm/os_linux.cpp#l2761
>> [5] http://hg.openjdk.java.net/jdk9/hs/hotspot/file/636271c3697a/src/share/vm/runtime/os.hpp#l356
>> [6] https://github.com/numactl/numactl/blob/master/numactl.c#L251
>> [7] http://www.spinics.net/lists/linux-numa/msg01173.html
>>
>>>
>>> Thanks,
>>> Sangheon
>>>
>>>
>>>>
>>>> Thank you.
>>>>
>>>>
>>>> Best regards,
>>>> Gustavo
>>>>
>>>> [1] https://bugs.openjdk.java.net/browse/JDK-8163796
>>>> [2] https://da.gd/4vXF
>>>>
>>>
>>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Linux/PPC64: "mbind: Invalid argument" when -XX:+UseNUMA is used

Gustavo Romero
Hi Sangheon, David

On 27-02-2017 20:17, [hidden email] wrote:
> Hi Gustabvo,
>
> I am not in the PPC64 ML, so replying late.
>
>>
>>> After a due community review, could you sponsor that change?
> Sure, I can sponsor this patch after the review. Please initiate the review on jdk 10 base.

Thank you!

@David thanks for your comments on current state of the NUMA support.

Which bug should I refer to since the one I opened was closed because it was
marked as duplicated?

Could I reopen it since I don't know if the bug reported before by Sangheon on
x64 had the exactly same root cause I found on PPC64? (there is no numa topology
information that could help on a guess...)

Thanks and best regards,
Gustavo


> Thanks,
> Sangheon
>
>
>> On Feb 27, 2017, at 8:10 AM, David Holmes <[hidden email]> wrote:
>>
>> Hi Gustavo,
>>
>> I am not a NUMA expert but it seems to me that our NUMA support is both incomplete and bit-rotting. It seems evident that UseNUMA is only working in limited contexts that match our testing environment. There were a couple of JEPS proposed to enhance NUMA support back in 2012:
>>
>> JDK-8046147    JEP 157: G1 GC: NUMA-Aware Allocation
>> JDK-8046153    JEP 163: Enable NUMA Mode by Default When Appropriate
>>
>> but they have not progressed. If they were to progress then it seems our overall approach to NUMA would need serious review and update - as per your patch.
>>
>> I'm also unclear about the distinctions between memory and non-memory nodes wrt. the existing os::Linux NUMA API. It isn't at all clear to me what functions should only be dealing with memory-nodes and which should deal with any kind eg. I expect cpu to node map to map cpu to nodes not cpu to nearest node with memory configured. If that is what is needed then the API's need to be changed and the usage checked - aas that distinction does not presently exist in the code AFAICS.
>>
>> It is too late to take this patch into 9 IMHO as we don't have the ability to test it effectively, nor is there time for NUMA users to put it through its paces. I think this would have to be part of a bigger NUMA project for 10 that addresses the NUMA API and how it is used.
>>
>> Thanks,
>> David
>>
>>> On 24/02/2017 10:02 PM, Gustavo Romero wrote:
>>> Hi Sangheon,
>>>
>>> Please find my comments inline.
>>>
>>>> On 06-02-2017 20:23, sangheon wrote:
>>>> Hi Gustavo,
>>>>
>>>>> On 02/06/2017 01:50 PM, Gustavo Romero wrote:
>>>>> Hi,
>>>>>
>>>>> On Linux/PPC64 I'm getting a series of "mbind: Invalid argument" that seems
>>>>>  exactly the same as reported for x64 [1]:
>>>>>
>>>>> [root@spocfire3 ~]# java -XX:+UseNUMA -version
>>>>> mbind: Invalid argument
>>>>> mbind: Invalid argument
>>>>> mbind: Invalid argument
>>>>> mbind: Invalid argument
>>>>> mbind: Invalid argument
>>>>> mbind: Invalid argument
>>>>> mbind: Invalid argument
>>>>> openjdk version "1.8.0_121"
>>>>> OpenJDK Runtime Environment (build 1.8.0_121-b13)
>>>>> OpenJDK 64-Bit Server VM (build 25.121-b13, mixed mode)
>>>>>
>>>>> [root@spocfire3 ~]# uname -a
>>>>> Linux spocfire3.aus.stglabs.ibm.com 3.10.0-327.el7.ppc64le #1 SMP Thu Oct 29 17:31:13 EDT 2015 ppc64le ppc64le ppc64le GNU/Linux
>>>>>
>>>>> [root@spocfire3 ~]# lscpu
>>>>> Architecture:          ppc64le
>>>>> Byte Order:            Little Endian
>>>>> CPU(s):                160
>>>>> On-line CPU(s) list:   0-159
>>>>> Thread(s) per core:    8
>>>>> Core(s) per socket:    10
>>>>> Socket(s):             2
>>>>> NUMA node(s):          2
>>>>> Model:                 2.0 (pvr 004d 0200)
>>>>> Model name:            POWER8 (raw), altivec supported
>>>>> L1d cache:             64K
>>>>> L1i cache:             32K
>>>>> L2 cache:              512K
>>>>> L3 cache:              8192K
>>>>> NUMA node0 CPU(s):     0-79
>>>>> NUMA node8 CPU(s):     80-159
>>>>>
>>>>> On chasing down it, looks like it comes from PSYoungGen::initialize() in
>>>>> src/share/vm/gc_implementation/parallelScavenge/psYoungGen.cpp that calls
>>>>> initialize_work(), that calls the MutableNUMASpace() constructor if
>>>>> UseNUMA is set:
>>>>> http://hg.openjdk.java.net/jdk8u/jdk8u/hotspot/file/567e410935e5/src/share/vm/gc_implementation/parallelScavenge/psYoungGen.cpp#l77
>>>>>
>>>>> MutableNUMASpace() then calls os::numa_make_local(), that in the end calls
>>>>> numa_set_bind_policy() in libnuma.so.1 [2].
>>>>>
>>>>> I've traced some values for which mbind() syscall fails:
>>>>> http://termbin.com/ztfs  (search for "Invalid argument" in the log).
>>>>>
>>>>> Assuming it's the same bug as reported in [1] and so it's not fixed on 9 and 10:
>>>>>
>>>>> - Is there any WIP or known workaround?
>>>> There's no progress on JDK-8163796 and no workaround found yet.
>>>> And unfortunately, I'm not planning to fix it soon.
>>>
>>> Hive, a critical component of Hadoop ecosystem, comes with a shell and uses java
>>> (with UseNUMA flag) in the background to run mysql-kind of queries. On PPC64 the
>>> mbind() messages in question make the shell pretty cumbersome. For instance:
>>>
>>> hive> show databases;
>>> mbind: Invalid argument
>>> mbind: Invalid argument
>>> mbind: Invalid argument (repeat message more 28 times...)
>>> ...
>>> OK
>>> mbind: Invalid argument
>>> mbind: Invalid argument
>>> mbind: Invalid argument
>>> default
>>> tpcds_bin_partitioned_orc_10
>>> tpcds_text_10
>>> Time taken: 1.036 seconds, Fetched: 3 row(s)
>>> hive> mbind: Invalid argument
>>> mbind: Invalid argument
>>> mbind: Invalid argument
>>>
>>> Also, on PPC64 a simple "java -XX:+UseParallelGC -XX:+UseNUMA -version" will
>>> trigger the problem, without any additional flags. So I'd like to correct that
>>> behavior (please see my next comment on that).
>>>
>>>
>>>>> - Should I append this output in [1] description or open a new one and make it
>>>>>   related to" [1]?
>>>> I think your problem seems same as JDK-8163796, so adding your output on the CR seems good.
>>>> And please add logs as well. I recommend to enabling something like "-Xlog:gc*,gc+heap*=trace".
>>>> IIRC, the problem was only occurred when the -Xmx was small in my case.
>>>
>>> JVM code used to discover which numa nodes it can bind assumes that nodes are
>>> consecutive and tries to bind from 0 to numa_max_node() [1, 2, 3, 4], i.e. from
>>> 0 to the highest node number available on the system. However, at least on PPC64
>>> that assumption is not always true. For instance, consider the following numa
>>> topology:
>>>
>>> available: 4 nodes (0-1,16-17)
>>> node 0 cpus: 0 8 16 24 32
>>> node 0 size: 130706 MB
>>> node 0 free: 145 MB
>>> node 1 cpus: 40 48 56 64 72
>>> node 1 size: 0 MB
>>> node 1 free: 0 MB
>>> node 16 cpus: 80 88 96 104 112
>>> node 16 size: 130630 MB
>>> node 16 free: 529 MB
>>> node 17 cpus: 120 128 136 144 152
>>> node 17 size: 0 MB
>>> node 17 free: 0 MB
>>> node distances:
>>> node   0   1  16  17
>>>  0:  10  20  40  40
>>>  1:  20  10  40  40
>>> 16:  40  40  10  20
>>> 17:  40  40  20  10
>>>
>>> In that case we have four nodes, 2 without memory (1 and 17), where the
>>> highest node id is 17. Hence if the JVM tries to bind from 0 to 17, mbind() will
>>> fail except for nodes 0 and 16, which are configured and have memory. mbind()
>>> failures will generate the "mbind: Invalid argument" messages.
>>>
>>> A solution would be to use in os::numa_get_group_num() not numa_max_node() but
>>> instead numa_num_configured_nodes() which returns the total number of nodes with
>>> memory in the system (so in our example above it will return exactly 2 nodes)
>>> and then inspect numa_all_node_ptr in os::numa_get_leaf_groups() to find the
>>> correct node ids to append (in our case, map ids[0] = 0 [node 0] and ids[1] = 16
>>> [node 16]).
>>>
>>> One thing is that os::numa_get_leaf_groups() argument "size" will not be
>>> required anymore and will be loose, so the interface will have to be adapted on
>>> other OSs besides Linux I guess [5].
>>>
>>> It would be necessary to adapt os::Linux::rebuild_cpu_to_node_map()
>>> since not all numa nodes are suitable to be returned by a call to
>>> os::numa_get_group_id() as some cpus would be in a node without memory.
>>> In that case we can return the closest numa node instead. A new way to translate
>>> indices to nodes is also useful since nodes are not always consecutive.
>>>
>>> Finally, although "numa_nodes_ptr" is not present in libnuma's manual it's what
>>> is used in numactl to find out the total number of nodes in the system [6]. I
>>> could not find a function that would return that number readily. I asked to
>>> libnuma ML if a better solution exists [7].
>>>
>>> The following webrev implements the proposed changes on jdk9 (backport to 8 is
>>> simple):
>>>
>>> webrev: http://cr.openjdk.java.net/~gromero/8175813/
>>> bug:    https://bugs.openjdk.java.net/browse/JDK-8175813
>>>
>>> Here are the logs with "-Xlog:gc*,gc+heap*=trace":
>>>
>>> http://cr.openjdk.java.net/~gromero/logs/pristine.log     (current state)
>>> http://cr.openjdk.java.net/~gromero/logs/numa_patched.log (proposed change)
>>>
>>> I've tested on 8 against SPECjvm2008 on the aforementioned machine and
>>> performance improved ~5% in comparison to the same version packaged by
>>> the distro, but I don't expect any difference on machines where nodes
>>> are always consecutive and where nodes always have memory.
>>>
>>> After a due community review, could you sponsor that change?
>>>
>>> Thank you.
>>>
>>>
>>> Best regards,
>>> Gustavo
>>>
>>> [1] http://hg.openjdk.java.net/jdk9/hs/hotspot/file/636271c3697a/src/share/vm/gc/parallel/mutableNUMASpace.cpp#l241
>>> [2] http://hg.openjdk.java.net/jdk9/hs/hotspot/file/636271c3697a/src/os/linux/vm/os_linux.cpp#l2745
>>> [3] http://hg.openjdk.java.net/jdk9/hs/hotspot/file/636271c3697a/src/share/vm/gc/parallel/mutableNUMASpace.cpp#l243
>>> [4] http://hg.openjdk.java.net/jdk9/hs/hotspot/file/636271c3697a/src/os/linux/vm/os_linux.cpp#l2761
>>> [5] http://hg.openjdk.java.net/jdk9/hs/hotspot/file/636271c3697a/src/share/vm/runtime/os.hpp#l356
>>> [6] https://github.com/numactl/numactl/blob/master/numactl.c#L251
>>> [7] http://www.spinics.net/lists/linux-numa/msg01173.html
>>>
>>>>
>>>> Thanks,
>>>> Sangheon
>>>>
>>>>
>>>>>
>>>>> Thank you.
>>>>>
>>>>>
>>>>> Best regards,
>>>>> Gustavo
>>>>>
>>>>> [1] https://bugs.openjdk.java.net/browse/JDK-8163796
>>>>> [2] https://da.gd/4vXF
>>>>>
>>>>
>>>
>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Linux/PPC64: "mbind: Invalid argument" when -XX:+UseNUMA is used

David Holmes
Hi Gustavo,

On 2/03/2017 6:53 AM, Gustavo Romero wrote:

> Hi Sangheon, David
>
> On 27-02-2017 20:17, [hidden email] wrote:
>> Hi Gustabvo,
>>
>> I am not in the PPC64 ML, so replying late.
>>
>>>
>>>> After a due community review, could you sponsor that change?
>> Sure, I can sponsor this patch after the review. Please initiate the review on jdk 10 base.
>
> Thank you!
>
> @David thanks for your comments on current state of the NUMA support.
>
> Which bug should I refer to since the one I opened was closed because it was
> marked as duplicated?
>
> Could I reopen it since I don't know if the bug reported before by Sangheon on
> x64 had the exactly same root cause I found on PPC64? (there is no numa topology
> information that could help on a guess...)

You are right and I have reopened JDK-8175813. While they both have a
similar symptom, the EINVAL from mbind can have many different causes
and the two failure modes seem likely to be different.

That said I'm still concerned that your patch makes a distinction
between memory and non-memory nodes that the existing NUMA code seems
oblivious of. I would want to see the owners of that code evaluate this
patch in the context of that. I also think a lot more work is needed in
the NUMA area in general - ref the JEPS I mentioned earlier.

Thanks,
David

> Thanks and best regards,
> Gustavo
>
>
>> Thanks,
>> Sangheon
>>
>>
>>> On Feb 27, 2017, at 8:10 AM, David Holmes <[hidden email]> wrote:
>>>
>>> Hi Gustavo,
>>>
>>> I am not a NUMA expert but it seems to me that our NUMA support is both incomplete and bit-rotting. It seems evident that UseNUMA is only working in limited contexts that match our testing environment. There were a couple of JEPS proposed to enhance NUMA support back in 2012:
>>>
>>> JDK-8046147    JEP 157: G1 GC: NUMA-Aware Allocation
>>> JDK-8046153    JEP 163: Enable NUMA Mode by Default When Appropriate
>>>
>>> but they have not progressed. If they were to progress then it seems our overall approach to NUMA would need serious review and update - as per your patch.
>>>
>>> I'm also unclear about the distinctions between memory and non-memory nodes wrt. the existing os::Linux NUMA API. It isn't at all clear to me what functions should only be dealing with memory-nodes and which should deal with any kind eg. I expect cpu to node map to map cpu to nodes not cpu to nearest node with memory configured. If that is what is needed then the API's need to be changed and the usage checked - aas that distinction does not presently exist in the code AFAICS.
>>>
>>> It is too late to take this patch into 9 IMHO as we don't have the ability to test it effectively, nor is there time for NUMA users to put it through its paces. I think this would have to be part of a bigger NUMA project for 10 that addresses the NUMA API and how it is used.
>>>
>>> Thanks,
>>> David
>>>
>>>> On 24/02/2017 10:02 PM, Gustavo Romero wrote:
>>>> Hi Sangheon,
>>>>
>>>> Please find my comments inline.
>>>>
>>>>> On 06-02-2017 20:23, sangheon wrote:
>>>>> Hi Gustavo,
>>>>>
>>>>>> On 02/06/2017 01:50 PM, Gustavo Romero wrote:
>>>>>> Hi,
>>>>>>
>>>>>> On Linux/PPC64 I'm getting a series of "mbind: Invalid argument" that seems
>>>>>>  exactly the same as reported for x64 [1]:
>>>>>>
>>>>>> [root@spocfire3 ~]# java -XX:+UseNUMA -version
>>>>>> mbind: Invalid argument
>>>>>> mbind: Invalid argument
>>>>>> mbind: Invalid argument
>>>>>> mbind: Invalid argument
>>>>>> mbind: Invalid argument
>>>>>> mbind: Invalid argument
>>>>>> mbind: Invalid argument
>>>>>> openjdk version "1.8.0_121"
>>>>>> OpenJDK Runtime Environment (build 1.8.0_121-b13)
>>>>>> OpenJDK 64-Bit Server VM (build 25.121-b13, mixed mode)
>>>>>>
>>>>>> [root@spocfire3 ~]# uname -a
>>>>>> Linux spocfire3.aus.stglabs.ibm.com 3.10.0-327.el7.ppc64le #1 SMP Thu Oct 29 17:31:13 EDT 2015 ppc64le ppc64le ppc64le GNU/Linux
>>>>>>
>>>>>> [root@spocfire3 ~]# lscpu
>>>>>> Architecture:          ppc64le
>>>>>> Byte Order:            Little Endian
>>>>>> CPU(s):                160
>>>>>> On-line CPU(s) list:   0-159
>>>>>> Thread(s) per core:    8
>>>>>> Core(s) per socket:    10
>>>>>> Socket(s):             2
>>>>>> NUMA node(s):          2
>>>>>> Model:                 2.0 (pvr 004d 0200)
>>>>>> Model name:            POWER8 (raw), altivec supported
>>>>>> L1d cache:             64K
>>>>>> L1i cache:             32K
>>>>>> L2 cache:              512K
>>>>>> L3 cache:              8192K
>>>>>> NUMA node0 CPU(s):     0-79
>>>>>> NUMA node8 CPU(s):     80-159
>>>>>>
>>>>>> On chasing down it, looks like it comes from PSYoungGen::initialize() in
>>>>>> src/share/vm/gc_implementation/parallelScavenge/psYoungGen.cpp that calls
>>>>>> initialize_work(), that calls the MutableNUMASpace() constructor if
>>>>>> UseNUMA is set:
>>>>>> http://hg.openjdk.java.net/jdk8u/jdk8u/hotspot/file/567e410935e5/src/share/vm/gc_implementation/parallelScavenge/psYoungGen.cpp#l77
>>>>>>
>>>>>> MutableNUMASpace() then calls os::numa_make_local(), that in the end calls
>>>>>> numa_set_bind_policy() in libnuma.so.1 [2].
>>>>>>
>>>>>> I've traced some values for which mbind() syscall fails:
>>>>>> http://termbin.com/ztfs  (search for "Invalid argument" in the log).
>>>>>>
>>>>>> Assuming it's the same bug as reported in [1] and so it's not fixed on 9 and 10:
>>>>>>
>>>>>> - Is there any WIP or known workaround?
>>>>> There's no progress on JDK-8163796 and no workaround found yet.
>>>>> And unfortunately, I'm not planning to fix it soon.
>>>>
>>>> Hive, a critical component of Hadoop ecosystem, comes with a shell and uses java
>>>> (with UseNUMA flag) in the background to run mysql-kind of queries. On PPC64 the
>>>> mbind() messages in question make the shell pretty cumbersome. For instance:
>>>>
>>>> hive> show databases;
>>>> mbind: Invalid argument
>>>> mbind: Invalid argument
>>>> mbind: Invalid argument (repeat message more 28 times...)
>>>> ...
>>>> OK
>>>> mbind: Invalid argument
>>>> mbind: Invalid argument
>>>> mbind: Invalid argument
>>>> default
>>>> tpcds_bin_partitioned_orc_10
>>>> tpcds_text_10
>>>> Time taken: 1.036 seconds, Fetched: 3 row(s)
>>>> hive> mbind: Invalid argument
>>>> mbind: Invalid argument
>>>> mbind: Invalid argument
>>>>
>>>> Also, on PPC64 a simple "java -XX:+UseParallelGC -XX:+UseNUMA -version" will
>>>> trigger the problem, without any additional flags. So I'd like to correct that
>>>> behavior (please see my next comment on that).
>>>>
>>>>
>>>>>> - Should I append this output in [1] description or open a new one and make it
>>>>>>   related to" [1]?
>>>>> I think your problem seems same as JDK-8163796, so adding your output on the CR seems good.
>>>>> And please add logs as well. I recommend to enabling something like "-Xlog:gc*,gc+heap*=trace".
>>>>> IIRC, the problem was only occurred when the -Xmx was small in my case.
>>>>
>>>> JVM code used to discover which numa nodes it can bind assumes that nodes are
>>>> consecutive and tries to bind from 0 to numa_max_node() [1, 2, 3, 4], i.e. from
>>>> 0 to the highest node number available on the system. However, at least on PPC64
>>>> that assumption is not always true. For instance, consider the following numa
>>>> topology:
>>>>
>>>> available: 4 nodes (0-1,16-17)
>>>> node 0 cpus: 0 8 16 24 32
>>>> node 0 size: 130706 MB
>>>> node 0 free: 145 MB
>>>> node 1 cpus: 40 48 56 64 72
>>>> node 1 size: 0 MB
>>>> node 1 free: 0 MB
>>>> node 16 cpus: 80 88 96 104 112
>>>> node 16 size: 130630 MB
>>>> node 16 free: 529 MB
>>>> node 17 cpus: 120 128 136 144 152
>>>> node 17 size: 0 MB
>>>> node 17 free: 0 MB
>>>> node distances:
>>>> node   0   1  16  17
>>>>  0:  10  20  40  40
>>>>  1:  20  10  40  40
>>>> 16:  40  40  10  20
>>>> 17:  40  40  20  10
>>>>
>>>> In that case we have four nodes, 2 without memory (1 and 17), where the
>>>> highest node id is 17. Hence if the JVM tries to bind from 0 to 17, mbind() will
>>>> fail except for nodes 0 and 16, which are configured and have memory. mbind()
>>>> failures will generate the "mbind: Invalid argument" messages.
>>>>
>>>> A solution would be to use in os::numa_get_group_num() not numa_max_node() but
>>>> instead numa_num_configured_nodes() which returns the total number of nodes with
>>>> memory in the system (so in our example above it will return exactly 2 nodes)
>>>> and then inspect numa_all_node_ptr in os::numa_get_leaf_groups() to find the
>>>> correct node ids to append (in our case, map ids[0] = 0 [node 0] and ids[1] = 16
>>>> [node 16]).
>>>>
>>>> One thing is that os::numa_get_leaf_groups() argument "size" will not be
>>>> required anymore and will be loose, so the interface will have to be adapted on
>>>> other OSs besides Linux I guess [5].
>>>>
>>>> It would be necessary to adapt os::Linux::rebuild_cpu_to_node_map()
>>>> since not all numa nodes are suitable to be returned by a call to
>>>> os::numa_get_group_id() as some cpus would be in a node without memory.
>>>> In that case we can return the closest numa node instead. A new way to translate
>>>> indices to nodes is also useful since nodes are not always consecutive.
>>>>
>>>> Finally, although "numa_nodes_ptr" is not present in libnuma's manual it's what
>>>> is used in numactl to find out the total number of nodes in the system [6]. I
>>>> could not find a function that would return that number readily. I asked to
>>>> libnuma ML if a better solution exists [7].
>>>>
>>>> The following webrev implements the proposed changes on jdk9 (backport to 8 is
>>>> simple):
>>>>
>>>> webrev: http://cr.openjdk.java.net/~gromero/8175813/
>>>> bug:    https://bugs.openjdk.java.net/browse/JDK-8175813
>>>>
>>>> Here are the logs with "-Xlog:gc*,gc+heap*=trace":
>>>>
>>>> http://cr.openjdk.java.net/~gromero/logs/pristine.log     (current state)
>>>> http://cr.openjdk.java.net/~gromero/logs/numa_patched.log (proposed change)
>>>>
>>>> I've tested on 8 against SPECjvm2008 on the aforementioned machine and
>>>> performance improved ~5% in comparison to the same version packaged by
>>>> the distro, but I don't expect any difference on machines where nodes
>>>> are always consecutive and where nodes always have memory.
>>>>
>>>> After a due community review, could you sponsor that change?
>>>>
>>>> Thank you.
>>>>
>>>>
>>>> Best regards,
>>>> Gustavo
>>>>
>>>> [1] http://hg.openjdk.java.net/jdk9/hs/hotspot/file/636271c3697a/src/share/vm/gc/parallel/mutableNUMASpace.cpp#l241
>>>> [2] http://hg.openjdk.java.net/jdk9/hs/hotspot/file/636271c3697a/src/os/linux/vm/os_linux.cpp#l2745
>>>> [3] http://hg.openjdk.java.net/jdk9/hs/hotspot/file/636271c3697a/src/share/vm/gc/parallel/mutableNUMASpace.cpp#l243
>>>> [4] http://hg.openjdk.java.net/jdk9/hs/hotspot/file/636271c3697a/src/os/linux/vm/os_linux.cpp#l2761
>>>> [5] http://hg.openjdk.java.net/jdk9/hs/hotspot/file/636271c3697a/src/share/vm/runtime/os.hpp#l356
>>>> [6] https://github.com/numactl/numactl/blob/master/numactl.c#L251
>>>> [7] http://www.spinics.net/lists/linux-numa/msg01173.html
>>>>
>>>>>
>>>>> Thanks,
>>>>> Sangheon
>>>>>
>>>>>
>>>>>>
>>>>>> Thank you.
>>>>>>
>>>>>>
>>>>>> Best regards,
>>>>>> Gustavo
>>>>>>
>>>>>> [1] https://bugs.openjdk.java.net/browse/JDK-8163796
>>>>>> [2] https://da.gd/4vXF
>>>>>>
>>>>>
>>>>
>>
>
Loading...