[10] RFR (S) 8175813: PPC64: "mbind: Invalid argument" when -XX:+UseNUMA is used

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[10] RFR (S) 8175813: PPC64: "mbind: Invalid argument" when -XX:+UseNUMA is used

Gustavo Romero
Hi,

Could the following webrev be reviewed please?

It improves the numa node detection when non-consecutive or memory-less nodes
exist in the system.

webrev: http://cr.openjdk.java.net/~gromero/8175813/v2/
bug   : https://bugs.openjdk.java.net/browse/JDK-8175813

Currently, although no problem exists when the JVM detects numa nodes that are
consecutive and have memory, for example in a numa topology like:

available: 2 nodes (0-1)
node 0 cpus: 0 8 16 24 32
node 0 size: 65258 MB
node 0 free: 34 MB
node 1 cpus: 40 48 56 64 72
node 1 size: 65320 MB
node 1 free: 150 MB
node distances:
node   0   1
  0:  10  20
  1:  20  10,

it fails on detecting numa nodes to be used in the Parallel GC in a numa
topology like:

available: 4 nodes (0-1,16-17)
node 0 cpus: 0 8 16 24 32
node 0 size: 130706 MB
node 0 free: 7729 MB
node 1 cpus: 40 48 56 64 72
node 1 size: 0 MB
node 1 free: 0 MB
node 16 cpus: 80 88 96 104 112
node 16 size: 130630 MB
node 16 free: 5282 MB
node 17 cpus: 120 128 136 144 152
node 17 size: 0 MB
node 17 free: 0 MB
node distances:
node   0   1  16  17
  0:  10  20  40  40
  1:  20  10  40  40
 16:  40  40  10  20
 17:  40  40  20  10,

where node 16 is not consecutive in relation to 1 and also nodes 1 and 17 have
no memory.

If a topology like that exists, os::numa_make_local() will receive a local group
id as a hint that is not available in the system to be bound (it will receive
all nodes from 0 to 17), causing a proliferation of "mbind: Invalid argument"
messages:

http://cr.openjdk.java.net/~gromero/logs/jdk10_pristine.log

That change improves the detection by making the JVM numa API aware of the
existence of numa nodes that are non-consecutive from 0 to the highest node
number and also of nodes that might be memory-less nodes, i.e. that might not
be, in libnuma terms, a configured node. Hence just the configured nodes will
be available:

http://cr.openjdk.java.net/~gromero/logs/jdk10_numa_patched.log

The change has no effect on numa topologies were the problem does not occur,
i.e. no change in the number of nodes and no change in the cpu to node map. On
numa topologies where memory-less nodes exist (like in the last example above),
cpus from a memory-less node won't be able to bind locally so they are mapped
to the closest node, otherwise they would be not associate to any node and
MutableNUMASpace::cas_allocate() would pick a node randomly, compromising the
performance.

I found no regressions on x64 for the following numa topology:

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 8 9 10 11
node 0 size: 24102 MB
node 0 free: 19806 MB
node 1 cpus: 4 5 6 7 12 13 14 15
node 1 size: 24190 MB
node 1 free: 21951 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10

I understand that fixing the current numa detection is a prerequisite to enable
UseNUMA by the default [1] and to extend the numa-aware allocation to the G1 GC [2].

Thank you.


Best regards,
Gustavo

[1] https://bugs.openjdk.java.net/browse/JDK-8046153 (JEP 163: Enable NUMA Mode by Default When Appropriate)
[2] https://bugs.openjdk.java.net/browse/JDK-8046147 (JEP 157: G1 GC: NUMA-Aware Allocation)

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [10] RFR (S) 8175813: PPC64: "mbind: Invalid argument" when -XX:+UseNUMA is used

Gustavo Romero
Hi,

Any update on it?

Thank you.

Regards,
Gustavo

On 09-03-2017 16:33, Gustavo Romero wrote:

> Hi,
>
> Could the following webrev be reviewed please?
>
> It improves the numa node detection when non-consecutive or memory-less nodes
> exist in the system.
>
> webrev: http://cr.openjdk.java.net/~gromero/8175813/v2/
> bug   : https://bugs.openjdk.java.net/browse/JDK-8175813
>
> Currently, although no problem exists when the JVM detects numa nodes that are
> consecutive and have memory, for example in a numa topology like:
>
> available: 2 nodes (0-1)
> node 0 cpus: 0 8 16 24 32
> node 0 size: 65258 MB
> node 0 free: 34 MB
> node 1 cpus: 40 48 56 64 72
> node 1 size: 65320 MB
> node 1 free: 150 MB
> node distances:
> node   0   1
>   0:  10  20
>   1:  20  10,
>
> it fails on detecting numa nodes to be used in the Parallel GC in a numa
> topology like:
>
> available: 4 nodes (0-1,16-17)
> node 0 cpus: 0 8 16 24 32
> node 0 size: 130706 MB
> node 0 free: 7729 MB
> node 1 cpus: 40 48 56 64 72
> node 1 size: 0 MB
> node 1 free: 0 MB
> node 16 cpus: 80 88 96 104 112
> node 16 size: 130630 MB
> node 16 free: 5282 MB
> node 17 cpus: 120 128 136 144 152
> node 17 size: 0 MB
> node 17 free: 0 MB
> node distances:
> node   0   1  16  17
>   0:  10  20  40  40
>   1:  20  10  40  40
>  16:  40  40  10  20
>  17:  40  40  20  10,
>
> where node 16 is not consecutive in relation to 1 and also nodes 1 and 17 have
> no memory.
>
> If a topology like that exists, os::numa_make_local() will receive a local group
> id as a hint that is not available in the system to be bound (it will receive
> all nodes from 0 to 17), causing a proliferation of "mbind: Invalid argument"
> messages:
>
> http://cr.openjdk.java.net/~gromero/logs/jdk10_pristine.log
>
> That change improves the detection by making the JVM numa API aware of the
> existence of numa nodes that are non-consecutive from 0 to the highest node
> number and also of nodes that might be memory-less nodes, i.e. that might not
> be, in libnuma terms, a configured node. Hence just the configured nodes will
> be available:
>
> http://cr.openjdk.java.net/~gromero/logs/jdk10_numa_patched.log
>
> The change has no effect on numa topologies were the problem does not occur,
> i.e. no change in the number of nodes and no change in the cpu to node map. On
> numa topologies where memory-less nodes exist (like in the last example above),
> cpus from a memory-less node won't be able to bind locally so they are mapped
> to the closest node, otherwise they would be not associate to any node and
> MutableNUMASpace::cas_allocate() would pick a node randomly, compromising the
> performance.
>
> I found no regressions on x64 for the following numa topology:
>
> available: 2 nodes (0-1)
> node 0 cpus: 0 1 2 3 8 9 10 11
> node 0 size: 24102 MB
> node 0 free: 19806 MB
> node 1 cpus: 4 5 6 7 12 13 14 15
> node 1 size: 24190 MB
> node 1 free: 21951 MB
> node distances:
> node   0   1
>   0:  10  21
>   1:  21  10
>
> I understand that fixing the current numa detection is a prerequisite to enable
> UseNUMA by the default [1] and to extend the numa-aware allocation to the G1 GC [2].
>
> Thank you.
>
>
> Best regards,
> Gustavo
>
> [1] https://bugs.openjdk.java.net/browse/JDK-8046153 (JEP 163: Enable NUMA Mode by Default When Appropriate)
> [2] https://bugs.openjdk.java.net/browse/JDK-8046147 (JEP 157: G1 GC: NUMA-Aware Allocation)
>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [10] RFR (S) 8175813: PPC64: "mbind: Invalid argument" when -XX:+UseNUMA is used

Volker Simonis
Hi Gustavo,

thanks for addressing this problem and sorry for my late reply. I
think this is a good change which definitely improves the situation
for uncommon NUMA configurations without changing the handling for
common topologies.

It would be great if somebody could run this trough JPRT, but as
Gustavo mentioned, I don't expect any regressions.

@Igor: I think you've been the original author of the NUMA-aware
allocator port to Linux (i.e. "6684395: Port NUMA-aware allocator to
linux"). If you could find some spare minutes to take a look at this
change, your comment would be very much appreciated :)

Following some minor comments from me:

- in os::numa_get_groups_num() you now use numa_num_configured_nodes()
to get the actual number of configured nodes. This is good and
certainly an improvement over the previous implementation. However,
the man page for numa_num_configured_nodes() mentions that the
returned count may contain currently disabled nodes. Do we currently
handle disabled nodes? What will be the consequence if we would use
such a disabled node (e.g. mbind() warnings)?

- the same question applies to the usage of
Linux::isnode_in_configured_nodes() within os::numa_get_leaf_groups().
Does isnode_in_configured_nodes() (i.e. the node set defined by
'numa_all_nodes_ptr' take into account the disabled nodes or not? Can
this be a potential problem (i.e. if we use a disabled node).

- I'd like to suggest renaming the 'index' part of the following
variables and functions to 'nindex' ('node_index' is probably to long)
in the following code, to emphasize that we have node indexes pointing
to actual, not always consecutive node numbers:

2879         // Create an index -> node mapping, since nodes are not
always consecutive
2880         _index_to_node = new (ResourceObj::C_HEAP, mtInternal)
GrowableArray<int>(0, true);
2881         rebuild_index_to_node_map();

- can you please wrap the following one-line else statement into curly
braces (it's more readable and we usually do it that way in HotSpot
although there are no formal style guidelines :)

2953      } else
2954        // Current node is already a configured node.
2955        closest_node = index_to_node()->at(i);

- in os::Linux::rebuild_cpu_to_node_map(), if you set
'closest_distance' to INT_MAX at the beginning of the loop, you can
later avoid the check for '|| !closest_distance'. Also, according to
the man page, numa_distance() returns 0 if it can not determine the
distance. So with the above change, the condition on line 2974 should
read:

2947           if (distance && distance < closest_distance) {


Finally, and not directly related to your change, I'd suggest the
following clean-ups:

- remove the usage of 'NCPUS = 32768' in
os::Linux::rebuild_cpu_to_node_map(). The comment on that line is
unclear to me and probably related to an older version/problem of
libnuma? I think we should simply use
numa_allocate_cpumask()/numa_free_cpumask() instead.

- we still use the NUMA version 1 function prototypes (e.g.
"numa_node_to_cpus(int node, unsigned long *buffer, int buffer_len)"
instead of "numa_node_to_cpus(int node, struct bitmask *mask)", but
also "numa_interleave_memory()" and maybe others). I think we should
switch all prototypes to the new NUMA version 2 interface which you've
already used for the new functions which you've added.

That said, I think these changes all require libnuma 2.0 (see
os::Linux::libnuma_dlsym). So before starting this, you should make
sure that libnuma 2.0 is available on all platforms to which you'd
like to down-port this change. For jdk10 we could definitely do it,
for jdk9 probably also, for jdk8 I'm not so sure.

Regards,
Volker

On Thu, Apr 13, 2017 at 12:51 AM, Gustavo Romero
<[hidden email]> wrote:

> Hi,
>
> Any update on it?
>
> Thank you.
>
> Regards,
> Gustavo
>
> On 09-03-2017 16:33, Gustavo Romero wrote:
>> Hi,
>>
>> Could the following webrev be reviewed please?
>>
>> It improves the numa node detection when non-consecutive or memory-less nodes
>> exist in the system.
>>
>> webrev: http://cr.openjdk.java.net/~gromero/8175813/v2/
>> bug   : https://bugs.openjdk.java.net/browse/JDK-8175813
>>
>> Currently, although no problem exists when the JVM detects numa nodes that are
>> consecutive and have memory, for example in a numa topology like:
>>
>> available: 2 nodes (0-1)
>> node 0 cpus: 0 8 16 24 32
>> node 0 size: 65258 MB
>> node 0 free: 34 MB
>> node 1 cpus: 40 48 56 64 72
>> node 1 size: 65320 MB
>> node 1 free: 150 MB
>> node distances:
>> node   0   1
>>   0:  10  20
>>   1:  20  10,
>>
>> it fails on detecting numa nodes to be used in the Parallel GC in a numa
>> topology like:
>>
>> available: 4 nodes (0-1,16-17)
>> node 0 cpus: 0 8 16 24 32
>> node 0 size: 130706 MB
>> node 0 free: 7729 MB
>> node 1 cpus: 40 48 56 64 72
>> node 1 size: 0 MB
>> node 1 free: 0 MB
>> node 16 cpus: 80 88 96 104 112
>> node 16 size: 130630 MB
>> node 16 free: 5282 MB
>> node 17 cpus: 120 128 136 144 152
>> node 17 size: 0 MB
>> node 17 free: 0 MB
>> node distances:
>> node   0   1  16  17
>>   0:  10  20  40  40
>>   1:  20  10  40  40
>>  16:  40  40  10  20
>>  17:  40  40  20  10,
>>
>> where node 16 is not consecutive in relation to 1 and also nodes 1 and 17 have
>> no memory.
>>
>> If a topology like that exists, os::numa_make_local() will receive a local group
>> id as a hint that is not available in the system to be bound (it will receive
>> all nodes from 0 to 17), causing a proliferation of "mbind: Invalid argument"
>> messages:
>>
>> http://cr.openjdk.java.net/~gromero/logs/jdk10_pristine.log
>>
>> That change improves the detection by making the JVM numa API aware of the
>> existence of numa nodes that are non-consecutive from 0 to the highest node
>> number and also of nodes that might be memory-less nodes, i.e. that might not
>> be, in libnuma terms, a configured node. Hence just the configured nodes will
>> be available:
>>
>> http://cr.openjdk.java.net/~gromero/logs/jdk10_numa_patched.log
>>
>> The change has no effect on numa topologies were the problem does not occur,
>> i.e. no change in the number of nodes and no change in the cpu to node map. On
>> numa topologies where memory-less nodes exist (like in the last example above),
>> cpus from a memory-less node won't be able to bind locally so they are mapped
>> to the closest node, otherwise they would be not associate to any node and
>> MutableNUMASpace::cas_allocate() would pick a node randomly, compromising the
>> performance.
>>
>> I found no regressions on x64 for the following numa topology:
>>
>> available: 2 nodes (0-1)
>> node 0 cpus: 0 1 2 3 8 9 10 11
>> node 0 size: 24102 MB
>> node 0 free: 19806 MB
>> node 1 cpus: 4 5 6 7 12 13 14 15
>> node 1 size: 24190 MB
>> node 1 free: 21951 MB
>> node distances:
>> node   0   1
>>   0:  10  21
>>   1:  21  10
>>
>> I understand that fixing the current numa detection is a prerequisite to enable
>> UseNUMA by the default [1] and to extend the numa-aware allocation to the G1 GC [2].
>>
>> Thank you.
>>
>>
>> Best regards,
>> Gustavo
>>
>> [1] https://bugs.openjdk.java.net/browse/JDK-8046153 (JEP 163: Enable NUMA Mode by Default When Appropriate)
>> [2] https://bugs.openjdk.java.net/browse/JDK-8046147 (JEP 157: G1 GC: NUMA-Aware Allocation)
>>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [10] RFR (S) 8175813: PPC64: "mbind: Invalid argument" when -XX:+UseNUMA is used

Gustavo Romero
Dear Volker,

On 24-04-2017 14:08, Volker Simonis wrote:
> Hi Gustavo,
>
> thanks for addressing this problem and sorry for my late reply. I
> think this is a good change which definitely improves the situation
> for uncommon NUMA configurations without changing the handling for
> common topologies.

Thanks a lot for reviewing the change!


> It would be great if somebody could run this trough JPRT, but as
> Gustavo mentioned, I don't expect any regressions.
>
> @Igor: I think you've been the original author of the NUMA-aware
> allocator port to Linux (i.e. "6684395: Port NUMA-aware allocator to
> linux"). If you could find some spare minutes to take a look at this
> change, your comment would be very much appreciated :)
>
> Following some minor comments from me:
>
> - in os::numa_get_groups_num() you now use numa_num_configured_nodes()
> to get the actual number of configured nodes. This is good and
> certainly an improvement over the previous implementation. However,
> the man page for numa_num_configured_nodes() mentions that the
> returned count may contain currently disabled nodes. Do we currently
> handle disabled nodes? What will be the consequence if we would use
> such a disabled node (e.g. mbind() warnings)?

In [1] 'numa_memnode_ptr' is set to keep a list of *just nodes with memory in
found in /sys/devices/system/node/* Hence numa_num_configured_nodes() just
returns the number of nodes in 'numa_memnode_ptr' [2], thus just returns the
number of nodes with memory in the system. To the best of my knowledge there is
no system configuration on Linux/PPC64 that could match such a notion of
"disabled nodes" as it appears in libnuma's manual. If it is enabled, it's in
that dir and just the ones with memory will be taken into account. If it's
disabled (somehow), it's not in the dir, so won't be taken into account (i.e. no
mbind() tried against it).

On Power it's possible to have a numa node without memory (memory-less node, a
case covered in this change), a numa node without cpus at all but with memory
(a configured node anyway, so a case already covered) but to disable a specific
numa node so it does not appear in /sys/devices/system/node/* it's only possible
from the inners of the control module. Or other rare condition not invisible /
adjustable from the OS. Also I'm not aware of a case where a node is in this
dir but is at the same time flagged as something like "disabled". There are
cpu/memory hotplugs, but that does not change numa nodes status AFAIK.

[1] https://github.com/numactl/numactl/blob/master/libnuma.c#L334-L347
[2] https://github.com/numactl/numactl/blob/master/libnuma.c#L614-L618


> - the same question applies to the usage of
> Linux::isnode_in_configured_nodes() within os::numa_get_leaf_groups().
> Does isnode_in_configured_nodes() (i.e. the node set defined by
> 'numa_all_nodes_ptr' take into account the disabled nodes or not? Can
> this be a potential problem (i.e. if we use a disabled node).

On the meaning of "disabled nodes", it's the same case as above, so to the
best of knowledge it's not a potential problem.

Anyway 'numa_all_nodes_ptr' just includes the configured nodes (with memory),
i.e. "all nodes on which the calling task may allocate memory". It's exactly
the same pointer returned by numa_get_membind() v2 [3] which:

"returns the mask of nodes from which memory can currently be allocated"

and that is used, for example, in "numactl --show" to show nodes from where
memory can be allocated [4, 5].

[3] https://github.com/numactl/numactl/blob/master/libnuma.c#L1147
[4] https://github.com/numactl/numactl/blob/master/numactl.c#L144
[5] https://github.com/numactl/numactl/blob/master/numactl.c#L177


> - I'd like to suggest renaming the 'index' part of the following
> variables and functions to 'nindex' ('node_index' is probably to long)
> in the following code, to emphasize that we have node indexes pointing
> to actual, not always consecutive node numbers:
>
> 2879         // Create an index -> node mapping, since nodes are not
> always consecutive
> 2880         _index_to_node = new (ResourceObj::C_HEAP, mtInternal)
> GrowableArray<int>(0, true);
> 2881         rebuild_index_to_node_map();

Simple change but much better to read indeed. Done.


> - can you please wrap the following one-line else statement into curly
> braces (it's more readable and we usually do it that way in HotSpot
> although there are no formal style guidelines :)
>
> 2953      } else
> 2954        // Current node is already a configured node.
> 2955        closest_node = index_to_node()->at(i);

Done.


> - in os::Linux::rebuild_cpu_to_node_map(), if you set
> 'closest_distance' to INT_MAX at the beginning of the loop, you can
> later avoid the check for '|| !closest_distance'. Also, according to
> the man page, numa_distance() returns 0 if it can not determine the
> distance. So with the above change, the condition on line 2974 should
> read:
>
> 2947           if (distance && distance < closest_distance) {
>

Sure, much better to set the initial condition as distant as possible and
adjust to a closer one bit by bit improving the if condition. Done.


> Finally, and not directly related to your change, I'd suggest the
> following clean-ups:
>
> - remove the usage of 'NCPUS = 32768' in
> os::Linux::rebuild_cpu_to_node_map(). The comment on that line is
> unclear to me and probably related to an older version/problem of
> libnuma? I think we should simply use
> numa_allocate_cpumask()/numa_free_cpumask() instead.
>
> - we still use the NUMA version 1 function prototypes (e.g.
> "numa_node_to_cpus(int node, unsigned long *buffer, int buffer_len)"
> instead of "numa_node_to_cpus(int node, struct bitmask *mask)", but
> also "numa_interleave_memory()" and maybe others). I think we should
> switch all prototypes to the new NUMA version 2 interface which you've
> already used for the new functions which you've added.

I agree. Could I open a new bug to address these clean-ups?


> That said, I think these changes all require libnuma 2.0 (see
> os::Linux::libnuma_dlsym). So before starting this, you should make
> sure that libnuma 2.0 is available on all platforms to which you'd
> like to down-port this change. For jdk10 we could definitely do it,
> for jdk9 probably also, for jdk8 I'm not so sure.

libnuma v1 last release dates back to 2008, but any idea how could I check that
for sure since it's on shared code?

new webrev: http://cr.openjdk.java.net/~gromero/8175813/v3/

Thank you!

Best regards,
Gustavo


> Regards,
> Volker
>
> On Thu, Apr 13, 2017 at 12:51 AM, Gustavo Romero
> <[hidden email]> wrote:
>> Hi,
>>
>> Any update on it?
>>
>> Thank you.
>>
>> Regards,
>> Gustavo
>>
>> On 09-03-2017 16:33, Gustavo Romero wrote:
>>> Hi,
>>>
>>> Could the following webrev be reviewed please?
>>>
>>> It improves the numa node detection when non-consecutive or memory-less nodes
>>> exist in the system.
>>>
>>> webrev: http://cr.openjdk.java.net/~gromero/8175813/v2/
>>> bug   : https://bugs.openjdk.java.net/browse/JDK-8175813
>>>
>>> Currently, although no problem exists when the JVM detects numa nodes that are
>>> consecutive and have memory, for example in a numa topology like:
>>>
>>> available: 2 nodes (0-1)
>>> node 0 cpus: 0 8 16 24 32
>>> node 0 size: 65258 MB
>>> node 0 free: 34 MB
>>> node 1 cpus: 40 48 56 64 72
>>> node 1 size: 65320 MB
>>> node 1 free: 150 MB
>>> node distances:
>>> node   0   1
>>>   0:  10  20
>>>   1:  20  10,
>>>
>>> it fails on detecting numa nodes to be used in the Parallel GC in a numa
>>> topology like:
>>>
>>> available: 4 nodes (0-1,16-17)
>>> node 0 cpus: 0 8 16 24 32
>>> node 0 size: 130706 MB
>>> node 0 free: 7729 MB
>>> node 1 cpus: 40 48 56 64 72
>>> node 1 size: 0 MB
>>> node 1 free: 0 MB
>>> node 16 cpus: 80 88 96 104 112
>>> node 16 size: 130630 MB
>>> node 16 free: 5282 MB
>>> node 17 cpus: 120 128 136 144 152
>>> node 17 size: 0 MB
>>> node 17 free: 0 MB
>>> node distances:
>>> node   0   1  16  17
>>>   0:  10  20  40  40
>>>   1:  20  10  40  40
>>>  16:  40  40  10  20
>>>  17:  40  40  20  10,
>>>
>>> where node 16 is not consecutive in relation to 1 and also nodes 1 and 17 have
>>> no memory.
>>>
>>> If a topology like that exists, os::numa_make_local() will receive a local group
>>> id as a hint that is not available in the system to be bound (it will receive
>>> all nodes from 0 to 17), causing a proliferation of "mbind: Invalid argument"
>>> messages:
>>>
>>> http://cr.openjdk.java.net/~gromero/logs/jdk10_pristine.log
>>>
>>> That change improves the detection by making the JVM numa API aware of the
>>> existence of numa nodes that are non-consecutive from 0 to the highest node
>>> number and also of nodes that might be memory-less nodes, i.e. that might not
>>> be, in libnuma terms, a configured node. Hence just the configured nodes will
>>> be available:
>>>
>>> http://cr.openjdk.java.net/~gromero/logs/jdk10_numa_patched.log
>>>
>>> The change has no effect on numa topologies were the problem does not occur,
>>> i.e. no change in the number of nodes and no change in the cpu to node map. On
>>> numa topologies where memory-less nodes exist (like in the last example above),
>>> cpus from a memory-less node won't be able to bind locally so they are mapped
>>> to the closest node, otherwise they would be not associate to any node and
>>> MutableNUMASpace::cas_allocate() would pick a node randomly, compromising the
>>> performance.
>>>
>>> I found no regressions on x64 for the following numa topology:
>>>
>>> available: 2 nodes (0-1)
>>> node 0 cpus: 0 1 2 3 8 9 10 11
>>> node 0 size: 24102 MB
>>> node 0 free: 19806 MB
>>> node 1 cpus: 4 5 6 7 12 13 14 15
>>> node 1 size: 24190 MB
>>> node 1 free: 21951 MB
>>> node distances:
>>> node   0   1
>>>   0:  10  21
>>>   1:  21  10
>>>
>>> I understand that fixing the current numa detection is a prerequisite to enable
>>> UseNUMA by the default [1] and to extend the numa-aware allocation to the G1 GC [2].
>>>
>>> Thank you.
>>>
>>>
>>> Best regards,
>>> Gustavo
>>>
>>> [1] https://bugs.openjdk.java.net/browse/JDK-8046153 (JEP 163: Enable NUMA Mode by Default When Appropriate)
>>> [2] https://bugs.openjdk.java.net/browse/JDK-8046147 (JEP 157: G1 GC: NUMA-Aware Allocation)
>>>
>>
>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [10] RFR (S) 8175813: PPC64: "mbind: Invalid argument" when -XX:+UseNUMA is used

Gustavo Romero
Hi community,

I understand that there is nothing that can be done additionally regarding this
issue, at this point, on the PPC64 side.

It's a change in the shared code - but that in effect does not change anything in
the numa detection mechanism for other platforms - and hence it's necessary a
conjoint community effort to review the change and a sponsor to run it against
the JPRT.

I know it's a stabilizing moment of OpenJDK 9, but since that issue is of
great concern on PPC64 (specially on POWER8 machines) I would be very glad if
the community could point out directions on how that change could move on.

Thank you!

Best regards,
Gustavo

On 25-04-2017 23:49, Gustavo Romero wrote:

> Dear Volker,
>
> On 24-04-2017 14:08, Volker Simonis wrote:
>> Hi Gustavo,
>>
>> thanks for addressing this problem and sorry for my late reply. I
>> think this is a good change which definitely improves the situation
>> for uncommon NUMA configurations without changing the handling for
>> common topologies.
>
> Thanks a lot for reviewing the change!
>
>
>> It would be great if somebody could run this trough JPRT, but as
>> Gustavo mentioned, I don't expect any regressions.
>>
>> @Igor: I think you've been the original author of the NUMA-aware
>> allocator port to Linux (i.e. "6684395: Port NUMA-aware allocator to
>> linux"). If you could find some spare minutes to take a look at this
>> change, your comment would be very much appreciated :)
>>
>> Following some minor comments from me:
>>
>> - in os::numa_get_groups_num() you now use numa_num_configured_nodes()
>> to get the actual number of configured nodes. This is good and
>> certainly an improvement over the previous implementation. However,
>> the man page for numa_num_configured_nodes() mentions that the
>> returned count may contain currently disabled nodes. Do we currently
>> handle disabled nodes? What will be the consequence if we would use
>> such a disabled node (e.g. mbind() warnings)?
>
> In [1] 'numa_memnode_ptr' is set to keep a list of *just nodes with memory in
> found in /sys/devices/system/node/* Hence numa_num_configured_nodes() just
> returns the number of nodes in 'numa_memnode_ptr' [2], thus just returns the
> number of nodes with memory in the system. To the best of my knowledge there is
> no system configuration on Linux/PPC64 that could match such a notion of
> "disabled nodes" as it appears in libnuma's manual. If it is enabled, it's in
> that dir and just the ones with memory will be taken into account. If it's
> disabled (somehow), it's not in the dir, so won't be taken into account (i.e. no
> mbind() tried against it).
>
> On Power it's possible to have a numa node without memory (memory-less node, a
> case covered in this change), a numa node without cpus at all but with memory
> (a configured node anyway, so a case already covered) but to disable a specific
> numa node so it does not appear in /sys/devices/system/node/* it's only possible
> from the inners of the control module. Or other rare condition not invisible /
> adjustable from the OS. Also I'm not aware of a case where a node is in this
> dir but is at the same time flagged as something like "disabled". There are
> cpu/memory hotplugs, but that does not change numa nodes status AFAIK.
>
> [1] https://github.com/numactl/numactl/blob/master/libnuma.c#L334-L347
> [2] https://github.com/numactl/numactl/blob/master/libnuma.c#L614-L618
>
>
>> - the same question applies to the usage of
>> Linux::isnode_in_configured_nodes() within os::numa_get_leaf_groups().
>> Does isnode_in_configured_nodes() (i.e. the node set defined by
>> 'numa_all_nodes_ptr' take into account the disabled nodes or not? Can
>> this be a potential problem (i.e. if we use a disabled node).
>
> On the meaning of "disabled nodes", it's the same case as above, so to the
> best of knowledge it's not a potential problem.
>
> Anyway 'numa_all_nodes_ptr' just includes the configured nodes (with memory),
> i.e. "all nodes on which the calling task may allocate memory". It's exactly
> the same pointer returned by numa_get_membind() v2 [3] which:
>
> "returns the mask of nodes from which memory can currently be allocated"
>
> and that is used, for example, in "numactl --show" to show nodes from where
> memory can be allocated [4, 5].
>
> [3] https://github.com/numactl/numactl/blob/master/libnuma.c#L1147
> [4] https://github.com/numactl/numactl/blob/master/numactl.c#L144
> [5] https://github.com/numactl/numactl/blob/master/numactl.c#L177
>
>
>> - I'd like to suggest renaming the 'index' part of the following
>> variables and functions to 'nindex' ('node_index' is probably to long)
>> in the following code, to emphasize that we have node indexes pointing
>> to actual, not always consecutive node numbers:
>>
>> 2879         // Create an index -> node mapping, since nodes are not
>> always consecutive
>> 2880         _index_to_node = new (ResourceObj::C_HEAP, mtInternal)
>> GrowableArray<int>(0, true);
>> 2881         rebuild_index_to_node_map();
>
> Simple change but much better to read indeed. Done.
>
>
>> - can you please wrap the following one-line else statement into curly
>> braces (it's more readable and we usually do it that way in HotSpot
>> although there are no formal style guidelines :)
>>
>> 2953      } else
>> 2954        // Current node is already a configured node.
>> 2955        closest_node = index_to_node()->at(i);
>
> Done.
>
>
>> - in os::Linux::rebuild_cpu_to_node_map(), if you set
>> 'closest_distance' to INT_MAX at the beginning of the loop, you can
>> later avoid the check for '|| !closest_distance'. Also, according to
>> the man page, numa_distance() returns 0 if it can not determine the
>> distance. So with the above change, the condition on line 2974 should
>> read:
>>
>> 2947           if (distance && distance < closest_distance) {
>>
>
> Sure, much better to set the initial condition as distant as possible and
> adjust to a closer one bit by bit improving the if condition. Done.
>
>
>> Finally, and not directly related to your change, I'd suggest the
>> following clean-ups:
>>
>> - remove the usage of 'NCPUS = 32768' in
>> os::Linux::rebuild_cpu_to_node_map(). The comment on that line is
>> unclear to me and probably related to an older version/problem of
>> libnuma? I think we should simply use
>> numa_allocate_cpumask()/numa_free_cpumask() instead.
>>
>> - we still use the NUMA version 1 function prototypes (e.g.
>> "numa_node_to_cpus(int node, unsigned long *buffer, int buffer_len)"
>> instead of "numa_node_to_cpus(int node, struct bitmask *mask)", but
>> also "numa_interleave_memory()" and maybe others). I think we should
>> switch all prototypes to the new NUMA version 2 interface which you've
>> already used for the new functions which you've added.
>
> I agree. Could I open a new bug to address these clean-ups?
>
>
>> That said, I think these changes all require libnuma 2.0 (see
>> os::Linux::libnuma_dlsym). So before starting this, you should make
>> sure that libnuma 2.0 is available on all platforms to which you'd
>> like to down-port this change. For jdk10 we could definitely do it,
>> for jdk9 probably also, for jdk8 I'm not so sure.
>
> libnuma v1 last release dates back to 2008, but any idea how could I check that
> for sure since it's on shared code?
>
> new webrev: http://cr.openjdk.java.net/~gromero/8175813/v3/
>
> Thank you!
>
> Best regards,
> Gustavo
>
>
>> Regards,
>> Volker
>>
>> On Thu, Apr 13, 2017 at 12:51 AM, Gustavo Romero
>> <[hidden email]> wrote:
>>> Hi,
>>>
>>> Any update on it?
>>>
>>> Thank you.
>>>
>>> Regards,
>>> Gustavo
>>>
>>> On 09-03-2017 16:33, Gustavo Romero wrote:
>>>> Hi,
>>>>
>>>> Could the following webrev be reviewed please?
>>>>
>>>> It improves the numa node detection when non-consecutive or memory-less nodes
>>>> exist in the system.
>>>>
>>>> webrev: http://cr.openjdk.java.net/~gromero/8175813/v2/
>>>> bug   : https://bugs.openjdk.java.net/browse/JDK-8175813
>>>>
>>>> Currently, although no problem exists when the JVM detects numa nodes that are
>>>> consecutive and have memory, for example in a numa topology like:
>>>>
>>>> available: 2 nodes (0-1)
>>>> node 0 cpus: 0 8 16 24 32
>>>> node 0 size: 65258 MB
>>>> node 0 free: 34 MB
>>>> node 1 cpus: 40 48 56 64 72
>>>> node 1 size: 65320 MB
>>>> node 1 free: 150 MB
>>>> node distances:
>>>> node   0   1
>>>>   0:  10  20
>>>>   1:  20  10,
>>>>
>>>> it fails on detecting numa nodes to be used in the Parallel GC in a numa
>>>> topology like:
>>>>
>>>> available: 4 nodes (0-1,16-17)
>>>> node 0 cpus: 0 8 16 24 32
>>>> node 0 size: 130706 MB
>>>> node 0 free: 7729 MB
>>>> node 1 cpus: 40 48 56 64 72
>>>> node 1 size: 0 MB
>>>> node 1 free: 0 MB
>>>> node 16 cpus: 80 88 96 104 112
>>>> node 16 size: 130630 MB
>>>> node 16 free: 5282 MB
>>>> node 17 cpus: 120 128 136 144 152
>>>> node 17 size: 0 MB
>>>> node 17 free: 0 MB
>>>> node distances:
>>>> node   0   1  16  17
>>>>   0:  10  20  40  40
>>>>   1:  20  10  40  40
>>>>  16:  40  40  10  20
>>>>  17:  40  40  20  10,
>>>>
>>>> where node 16 is not consecutive in relation to 1 and also nodes 1 and 17 have
>>>> no memory.
>>>>
>>>> If a topology like that exists, os::numa_make_local() will receive a local group
>>>> id as a hint that is not available in the system to be bound (it will receive
>>>> all nodes from 0 to 17), causing a proliferation of "mbind: Invalid argument"
>>>> messages:
>>>>
>>>> http://cr.openjdk.java.net/~gromero/logs/jdk10_pristine.log
>>>>
>>>> That change improves the detection by making the JVM numa API aware of the
>>>> existence of numa nodes that are non-consecutive from 0 to the highest node
>>>> number and also of nodes that might be memory-less nodes, i.e. that might not
>>>> be, in libnuma terms, a configured node. Hence just the configured nodes will
>>>> be available:
>>>>
>>>> http://cr.openjdk.java.net/~gromero/logs/jdk10_numa_patched.log
>>>>
>>>> The change has no effect on numa topologies were the problem does not occur,
>>>> i.e. no change in the number of nodes and no change in the cpu to node map. On
>>>> numa topologies where memory-less nodes exist (like in the last example above),
>>>> cpus from a memory-less node won't be able to bind locally so they are mapped
>>>> to the closest node, otherwise they would be not associate to any node and
>>>> MutableNUMASpace::cas_allocate() would pick a node randomly, compromising the
>>>> performance.
>>>>
>>>> I found no regressions on x64 for the following numa topology:
>>>>
>>>> available: 2 nodes (0-1)
>>>> node 0 cpus: 0 1 2 3 8 9 10 11
>>>> node 0 size: 24102 MB
>>>> node 0 free: 19806 MB
>>>> node 1 cpus: 4 5 6 7 12 13 14 15
>>>> node 1 size: 24190 MB
>>>> node 1 free: 21951 MB
>>>> node distances:
>>>> node   0   1
>>>>   0:  10  21
>>>>   1:  21  10
>>>>
>>>> I understand that fixing the current numa detection is a prerequisite to enable
>>>> UseNUMA by the default [1] and to extend the numa-aware allocation to the G1 GC [2].
>>>>
>>>> Thank you.
>>>>
>>>>
>>>> Best regards,
>>>> Gustavo
>>>>
>>>> [1] https://bugs.openjdk.java.net/browse/JDK-8046153 (JEP 163: Enable NUMA Mode by Default When Appropriate)
>>>> [2] https://bugs.openjdk.java.net/browse/JDK-8046147 (JEP 157: G1 GC: NUMA-Aware Allocation)
>>>>
>>>
>>
>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [10] RFR (S) 8175813: PPC64: "mbind: Invalid argument" when -XX:+UseNUMA is used

Volker Simonis
In reply to this post by Gustavo Romero
Hi Gustavo,

thanks for the latest corrections. I think your change looks good now.

On Wed, Apr 26, 2017 at 4:49 AM, Gustavo Romero
<[hidden email]> wrote:

> Dear Volker,
>
> On 24-04-2017 14:08, Volker Simonis wrote:
>> Hi Gustavo,
>>
>> thanks for addressing this problem and sorry for my late reply. I
>> think this is a good change which definitely improves the situation
>> for uncommon NUMA configurations without changing the handling for
>> common topologies.
>
> Thanks a lot for reviewing the change!
>
>
>> It would be great if somebody could run this trough JPRT, but as
>> Gustavo mentioned, I don't expect any regressions.
>>
>> @Igor: I think you've been the original author of the NUMA-aware
>> allocator port to Linux (i.e. "6684395: Port NUMA-aware allocator to
>> linux"). If you could find some spare minutes to take a look at this
>> change, your comment would be very much appreciated :)
>>
>> Following some minor comments from me:
>>
>> - in os::numa_get_groups_num() you now use numa_num_configured_nodes()
>> to get the actual number of configured nodes. This is good and
>> certainly an improvement over the previous implementation. However,
>> the man page for numa_num_configured_nodes() mentions that the
>> returned count may contain currently disabled nodes. Do we currently
>> handle disabled nodes? What will be the consequence if we would use
>> such a disabled node (e.g. mbind() warnings)?
>
> In [1] 'numa_memnode_ptr' is set to keep a list of *just nodes with memory in
> found in /sys/devices/system/node/* Hence numa_num_configured_nodes() just
> returns the number of nodes in 'numa_memnode_ptr' [2], thus just returns the
> number of nodes with memory in the system. To the best of my knowledge there is
> no system configuration on Linux/PPC64 that could match such a notion of
> "disabled nodes" as it appears in libnuma's manual. If it is enabled, it's in
> that dir and just the ones with memory will be taken into account. If it's
> disabled (somehow), it's not in the dir, so won't be taken into account (i.e. no
> mbind() tried against it).
>
> On Power it's possible to have a numa node without memory (memory-less node, a
> case covered in this change), a numa node without cpus at all but with memory
> (a configured node anyway, so a case already covered) but to disable a specific
> numa node so it does not appear in /sys/devices/system/node/* it's only possible
> from the inners of the control module. Or other rare condition not invisible /
> adjustable from the OS. Also I'm not aware of a case where a node is in this
> dir but is at the same time flagged as something like "disabled". There are
> cpu/memory hotplugs, but that does not change numa nodes status AFAIK.
>
> [1] https://github.com/numactl/numactl/blob/master/libnuma.c#L334-L347
> [2] https://github.com/numactl/numactl/blob/master/libnuma.c#L614-L618
>
>
>> - the same question applies to the usage of
>> Linux::isnode_in_configured_nodes() within os::numa_get_leaf_groups().
>> Does isnode_in_configured_nodes() (i.e. the node set defined by
>> 'numa_all_nodes_ptr' take into account the disabled nodes or not? Can
>> this be a potential problem (i.e. if we use a disabled node).
>
> On the meaning of "disabled nodes", it's the same case as above, so to the
> best of knowledge it's not a potential problem.
>
> Anyway 'numa_all_nodes_ptr' just includes the configured nodes (with memory),
> i.e. "all nodes on which the calling task may allocate memory". It's exactly
> the same pointer returned by numa_get_membind() v2 [3] which:
>
> "returns the mask of nodes from which memory can currently be allocated"
>
> and that is used, for example, in "numactl --show" to show nodes from where
> memory can be allocated [4, 5].
>
> [3] https://github.com/numactl/numactl/blob/master/libnuma.c#L1147
> [4] https://github.com/numactl/numactl/blob/master/numactl.c#L144
> [5] https://github.com/numactl/numactl/blob/master/numactl.c#L177
>
>
>> - I'd like to suggest renaming the 'index' part of the following
>> variables and functions to 'nindex' ('node_index' is probably to long)
>> in the following code, to emphasize that we have node indexes pointing
>> to actual, not always consecutive node numbers:
>>
>> 2879         // Create an index -> node mapping, since nodes are not
>> always consecutive
>> 2880         _index_to_node = new (ResourceObj::C_HEAP, mtInternal)
>> GrowableArray<int>(0, true);
>> 2881         rebuild_index_to_node_map();
>
> Simple change but much better to read indeed. Done.
>
>
>> - can you please wrap the following one-line else statement into curly
>> braces (it's more readable and we usually do it that way in HotSpot
>> although there are no formal style guidelines :)
>>
>> 2953      } else
>> 2954        // Current node is already a configured node.
>> 2955        closest_node = index_to_node()->at(i);
>
> Done.
>
>
>> - in os::Linux::rebuild_cpu_to_node_map(), if you set
>> 'closest_distance' to INT_MAX at the beginning of the loop, you can
>> later avoid the check for '|| !closest_distance'. Also, according to
>> the man page, numa_distance() returns 0 if it can not determine the
>> distance. So with the above change, the condition on line 2974 should
>> read:
>>
>> 2947           if (distance && distance < closest_distance) {
>>
>
> Sure, much better to set the initial condition as distant as possible and
> adjust to a closer one bit by bit improving the if condition. Done.
>
>
>> Finally, and not directly related to your change, I'd suggest the
>> following clean-ups:
>>
>> - remove the usage of 'NCPUS = 32768' in
>> os::Linux::rebuild_cpu_to_node_map(). The comment on that line is
>> unclear to me and probably related to an older version/problem of
>> libnuma? I think we should simply use
>> numa_allocate_cpumask()/numa_free_cpumask() instead.
>>
>> - we still use the NUMA version 1 function prototypes (e.g.
>> "numa_node_to_cpus(int node, unsigned long *buffer, int buffer_len)"
>> instead of "numa_node_to_cpus(int node, struct bitmask *mask)", but
>> also "numa_interleave_memory()" and maybe others). I think we should
>> switch all prototypes to the new NUMA version 2 interface which you've
>> already used for the new functions which you've added.
>
> I agree. Could I open a new bug to address these clean-ups?
>

Yes, would be great to track that.

Maybe you could add another topic for implementing os::get_page_info()
on Linux. The information will be used for -XX:+NUMAStats. Not sure if
you can easily gather it on Linux. As far as I can see it is currently
only implemented on Solaris.

>
>> That said, I think these changes all require libnuma 2.0 (see
>> os::Linux::libnuma_dlsym). So before starting this, you should make
>> sure that libnuma 2.0 is available on all platforms to which you'd
>> like to down-port this change. For jdk10 we could definitely do it,
>> for jdk9 probably also, for jdk8 I'm not so sure.
>
> libnuma v1 last release dates back to 2008, but any idea how could I check that
> for sure since it's on shared code?
>
> new webrev: http://cr.openjdk.java.net/~gromero/8175813/v3/
>
> Thank you!
>
> Best regards,
> Gustavo
>
>
>> Regards,
>> Volker
>>
>> On Thu, Apr 13, 2017 at 12:51 AM, Gustavo Romero
>> <[hidden email]> wrote:
>>> Hi,
>>>
>>> Any update on it?
>>>
>>> Thank you.
>>>
>>> Regards,
>>> Gustavo
>>>
>>> On 09-03-2017 16:33, Gustavo Romero wrote:
>>>> Hi,
>>>>
>>>> Could the following webrev be reviewed please?
>>>>
>>>> It improves the numa node detection when non-consecutive or memory-less nodes
>>>> exist in the system.
>>>>
>>>> webrev: http://cr.openjdk.java.net/~gromero/8175813/v2/
>>>> bug   : https://bugs.openjdk.java.net/browse/JDK-8175813
>>>>
>>>> Currently, although no problem exists when the JVM detects numa nodes that are
>>>> consecutive and have memory, for example in a numa topology like:
>>>>
>>>> available: 2 nodes (0-1)
>>>> node 0 cpus: 0 8 16 24 32
>>>> node 0 size: 65258 MB
>>>> node 0 free: 34 MB
>>>> node 1 cpus: 40 48 56 64 72
>>>> node 1 size: 65320 MB
>>>> node 1 free: 150 MB
>>>> node distances:
>>>> node   0   1
>>>>   0:  10  20
>>>>   1:  20  10,
>>>>
>>>> it fails on detecting numa nodes to be used in the Parallel GC in a numa
>>>> topology like:
>>>>
>>>> available: 4 nodes (0-1,16-17)
>>>> node 0 cpus: 0 8 16 24 32
>>>> node 0 size: 130706 MB
>>>> node 0 free: 7729 MB
>>>> node 1 cpus: 40 48 56 64 72
>>>> node 1 size: 0 MB
>>>> node 1 free: 0 MB
>>>> node 16 cpus: 80 88 96 104 112
>>>> node 16 size: 130630 MB
>>>> node 16 free: 5282 MB
>>>> node 17 cpus: 120 128 136 144 152
>>>> node 17 size: 0 MB
>>>> node 17 free: 0 MB
>>>> node distances:
>>>> node   0   1  16  17
>>>>   0:  10  20  40  40
>>>>   1:  20  10  40  40
>>>>  16:  40  40  10  20
>>>>  17:  40  40  20  10,
>>>>
>>>> where node 16 is not consecutive in relation to 1 and also nodes 1 and 17 have
>>>> no memory.
>>>>
>>>> If a topology like that exists, os::numa_make_local() will receive a local group
>>>> id as a hint that is not available in the system to be bound (it will receive
>>>> all nodes from 0 to 17), causing a proliferation of "mbind: Invalid argument"
>>>> messages:
>>>>
>>>> http://cr.openjdk.java.net/~gromero/logs/jdk10_pristine.log
>>>>
>>>> That change improves the detection by making the JVM numa API aware of the
>>>> existence of numa nodes that are non-consecutive from 0 to the highest node
>>>> number and also of nodes that might be memory-less nodes, i.e. that might not
>>>> be, in libnuma terms, a configured node. Hence just the configured nodes will
>>>> be available:
>>>>
>>>> http://cr.openjdk.java.net/~gromero/logs/jdk10_numa_patched.log
>>>>
>>>> The change has no effect on numa topologies were the problem does not occur,
>>>> i.e. no change in the number of nodes and no change in the cpu to node map. On
>>>> numa topologies where memory-less nodes exist (like in the last example above),
>>>> cpus from a memory-less node won't be able to bind locally so they are mapped
>>>> to the closest node, otherwise they would be not associate to any node and
>>>> MutableNUMASpace::cas_allocate() would pick a node randomly, compromising the
>>>> performance.
>>>>
>>>> I found no regressions on x64 for the following numa topology:
>>>>
>>>> available: 2 nodes (0-1)
>>>> node 0 cpus: 0 1 2 3 8 9 10 11
>>>> node 0 size: 24102 MB
>>>> node 0 free: 19806 MB
>>>> node 1 cpus: 4 5 6 7 12 13 14 15
>>>> node 1 size: 24190 MB
>>>> node 1 free: 21951 MB
>>>> node distances:
>>>> node   0   1
>>>>   0:  10  21
>>>>   1:  21  10
>>>>
>>>> I understand that fixing the current numa detection is a prerequisite to enable
>>>> UseNUMA by the default [1] and to extend the numa-aware allocation to the G1 GC [2].
>>>>
>>>> Thank you.
>>>>
>>>>
>>>> Best regards,
>>>> Gustavo
>>>>
>>>> [1] https://bugs.openjdk.java.net/browse/JDK-8046153 (JEP 163: Enable NUMA Mode by Default When Appropriate)
>>>> [2] https://bugs.openjdk.java.net/browse/JDK-8046147 (JEP 157: G1 GC: NUMA-Aware Allocation)
>>>>
>>>
>>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [10] RFR (S) 8175813: PPC64: "mbind: Invalid argument" when -XX:+UseNUMA is used

Volker Simonis
In reply to this post by Gustavo Romero
Hi,

I've reviewed Gustavo's change and I'm fine with the latest version at:

http://cr.openjdk.java.net/~gromero/8175813/v3/

Can somebody please sponsor the change?

Thank you and best regards,
Volker


On Wed, May 3, 2017 at 3:27 PM, Gustavo Romero
<[hidden email]> wrote:

> Hi community,
>
> I understand that there is nothing that can be done additionally regarding this
> issue, at this point, on the PPC64 side.
>
> It's a change in the shared code - but that in effect does not change anything in
> the numa detection mechanism for other platforms - and hence it's necessary a
> conjoint community effort to review the change and a sponsor to run it against
> the JPRT.
>
> I know it's a stabilizing moment of OpenJDK 9, but since that issue is of
> great concern on PPC64 (specially on POWER8 machines) I would be very glad if
> the community could point out directions on how that change could move on.
>
> Thank you!
>
> Best regards,
> Gustavo
>
> On 25-04-2017 23:49, Gustavo Romero wrote:
>> Dear Volker,
>>
>> On 24-04-2017 14:08, Volker Simonis wrote:
>>> Hi Gustavo,
>>>
>>> thanks for addressing this problem and sorry for my late reply. I
>>> think this is a good change which definitely improves the situation
>>> for uncommon NUMA configurations without changing the handling for
>>> common topologies.
>>
>> Thanks a lot for reviewing the change!
>>
>>
>>> It would be great if somebody could run this trough JPRT, but as
>>> Gustavo mentioned, I don't expect any regressions.
>>>
>>> @Igor: I think you've been the original author of the NUMA-aware
>>> allocator port to Linux (i.e. "6684395: Port NUMA-aware allocator to
>>> linux"). If you could find some spare minutes to take a look at this
>>> change, your comment would be very much appreciated :)
>>>
>>> Following some minor comments from me:
>>>
>>> - in os::numa_get_groups_num() you now use numa_num_configured_nodes()
>>> to get the actual number of configured nodes. This is good and
>>> certainly an improvement over the previous implementation. However,
>>> the man page for numa_num_configured_nodes() mentions that the
>>> returned count may contain currently disabled nodes. Do we currently
>>> handle disabled nodes? What will be the consequence if we would use
>>> such a disabled node (e.g. mbind() warnings)?
>>
>> In [1] 'numa_memnode_ptr' is set to keep a list of *just nodes with memory in
>> found in /sys/devices/system/node/* Hence numa_num_configured_nodes() just
>> returns the number of nodes in 'numa_memnode_ptr' [2], thus just returns the
>> number of nodes with memory in the system. To the best of my knowledge there is
>> no system configuration on Linux/PPC64 that could match such a notion of
>> "disabled nodes" as it appears in libnuma's manual. If it is enabled, it's in
>> that dir and just the ones with memory will be taken into account. If it's
>> disabled (somehow), it's not in the dir, so won't be taken into account (i.e. no
>> mbind() tried against it).
>>
>> On Power it's possible to have a numa node without memory (memory-less node, a
>> case covered in this change), a numa node without cpus at all but with memory
>> (a configured node anyway, so a case already covered) but to disable a specific
>> numa node so it does not appear in /sys/devices/system/node/* it's only possible
>> from the inners of the control module. Or other rare condition not invisible /
>> adjustable from the OS. Also I'm not aware of a case where a node is in this
>> dir but is at the same time flagged as something like "disabled". There are
>> cpu/memory hotplugs, but that does not change numa nodes status AFAIK.
>>
>> [1] https://github.com/numactl/numactl/blob/master/libnuma.c#L334-L347
>> [2] https://github.com/numactl/numactl/blob/master/libnuma.c#L614-L618
>>
>>
>>> - the same question applies to the usage of
>>> Linux::isnode_in_configured_nodes() within os::numa_get_leaf_groups().
>>> Does isnode_in_configured_nodes() (i.e. the node set defined by
>>> 'numa_all_nodes_ptr' take into account the disabled nodes or not? Can
>>> this be a potential problem (i.e. if we use a disabled node).
>>
>> On the meaning of "disabled nodes", it's the same case as above, so to the
>> best of knowledge it's not a potential problem.
>>
>> Anyway 'numa_all_nodes_ptr' just includes the configured nodes (with memory),
>> i.e. "all nodes on which the calling task may allocate memory". It's exactly
>> the same pointer returned by numa_get_membind() v2 [3] which:
>>
>> "returns the mask of nodes from which memory can currently be allocated"
>>
>> and that is used, for example, in "numactl --show" to show nodes from where
>> memory can be allocated [4, 5].
>>
>> [3] https://github.com/numactl/numactl/blob/master/libnuma.c#L1147
>> [4] https://github.com/numactl/numactl/blob/master/numactl.c#L144
>> [5] https://github.com/numactl/numactl/blob/master/numactl.c#L177
>>
>>
>>> - I'd like to suggest renaming the 'index' part of the following
>>> variables and functions to 'nindex' ('node_index' is probably to long)
>>> in the following code, to emphasize that we have node indexes pointing
>>> to actual, not always consecutive node numbers:
>>>
>>> 2879         // Create an index -> node mapping, since nodes are not
>>> always consecutive
>>> 2880         _index_to_node = new (ResourceObj::C_HEAP, mtInternal)
>>> GrowableArray<int>(0, true);
>>> 2881         rebuild_index_to_node_map();
>>
>> Simple change but much better to read indeed. Done.
>>
>>
>>> - can you please wrap the following one-line else statement into curly
>>> braces (it's more readable and we usually do it that way in HotSpot
>>> although there are no formal style guidelines :)
>>>
>>> 2953      } else
>>> 2954        // Current node is already a configured node.
>>> 2955        closest_node = index_to_node()->at(i);
>>
>> Done.
>>
>>
>>> - in os::Linux::rebuild_cpu_to_node_map(), if you set
>>> 'closest_distance' to INT_MAX at the beginning of the loop, you can
>>> later avoid the check for '|| !closest_distance'. Also, according to
>>> the man page, numa_distance() returns 0 if it can not determine the
>>> distance. So with the above change, the condition on line 2974 should
>>> read:
>>>
>>> 2947           if (distance && distance < closest_distance) {
>>>
>>
>> Sure, much better to set the initial condition as distant as possible and
>> adjust to a closer one bit by bit improving the if condition. Done.
>>
>>
>>> Finally, and not directly related to your change, I'd suggest the
>>> following clean-ups:
>>>
>>> - remove the usage of 'NCPUS = 32768' in
>>> os::Linux::rebuild_cpu_to_node_map(). The comment on that line is
>>> unclear to me and probably related to an older version/problem of
>>> libnuma? I think we should simply use
>>> numa_allocate_cpumask()/numa_free_cpumask() instead.
>>>
>>> - we still use the NUMA version 1 function prototypes (e.g.
>>> "numa_node_to_cpus(int node, unsigned long *buffer, int buffer_len)"
>>> instead of "numa_node_to_cpus(int node, struct bitmask *mask)", but
>>> also "numa_interleave_memory()" and maybe others). I think we should
>>> switch all prototypes to the new NUMA version 2 interface which you've
>>> already used for the new functions which you've added.
>>
>> I agree. Could I open a new bug to address these clean-ups?
>>
>>
>>> That said, I think these changes all require libnuma 2.0 (see
>>> os::Linux::libnuma_dlsym). So before starting this, you should make
>>> sure that libnuma 2.0 is available on all platforms to which you'd
>>> like to down-port this change. For jdk10 we could definitely do it,
>>> for jdk9 probably also, for jdk8 I'm not so sure.
>>
>> libnuma v1 last release dates back to 2008, but any idea how could I check that
>> for sure since it's on shared code?
>>
>> new webrev: http://cr.openjdk.java.net/~gromero/8175813/v3/
>>
>> Thank you!
>>
>> Best regards,
>> Gustavo
>>
>>
>>> Regards,
>>> Volker
>>>
>>> On Thu, Apr 13, 2017 at 12:51 AM, Gustavo Romero
>>> <[hidden email]> wrote:
>>>> Hi,
>>>>
>>>> Any update on it?
>>>>
>>>> Thank you.
>>>>
>>>> Regards,
>>>> Gustavo
>>>>
>>>> On 09-03-2017 16:33, Gustavo Romero wrote:
>>>>> Hi,
>>>>>
>>>>> Could the following webrev be reviewed please?
>>>>>
>>>>> It improves the numa node detection when non-consecutive or memory-less nodes
>>>>> exist in the system.
>>>>>
>>>>> webrev: http://cr.openjdk.java.net/~gromero/8175813/v2/
>>>>> bug   : https://bugs.openjdk.java.net/browse/JDK-8175813
>>>>>
>>>>> Currently, although no problem exists when the JVM detects numa nodes that are
>>>>> consecutive and have memory, for example in a numa topology like:
>>>>>
>>>>> available: 2 nodes (0-1)
>>>>> node 0 cpus: 0 8 16 24 32
>>>>> node 0 size: 65258 MB
>>>>> node 0 free: 34 MB
>>>>> node 1 cpus: 40 48 56 64 72
>>>>> node 1 size: 65320 MB
>>>>> node 1 free: 150 MB
>>>>> node distances:
>>>>> node   0   1
>>>>>   0:  10  20
>>>>>   1:  20  10,
>>>>>
>>>>> it fails on detecting numa nodes to be used in the Parallel GC in a numa
>>>>> topology like:
>>>>>
>>>>> available: 4 nodes (0-1,16-17)
>>>>> node 0 cpus: 0 8 16 24 32
>>>>> node 0 size: 130706 MB
>>>>> node 0 free: 7729 MB
>>>>> node 1 cpus: 40 48 56 64 72
>>>>> node 1 size: 0 MB
>>>>> node 1 free: 0 MB
>>>>> node 16 cpus: 80 88 96 104 112
>>>>> node 16 size: 130630 MB
>>>>> node 16 free: 5282 MB
>>>>> node 17 cpus: 120 128 136 144 152
>>>>> node 17 size: 0 MB
>>>>> node 17 free: 0 MB
>>>>> node distances:
>>>>> node   0   1  16  17
>>>>>   0:  10  20  40  40
>>>>>   1:  20  10  40  40
>>>>>  16:  40  40  10  20
>>>>>  17:  40  40  20  10,
>>>>>
>>>>> where node 16 is not consecutive in relation to 1 and also nodes 1 and 17 have
>>>>> no memory.
>>>>>
>>>>> If a topology like that exists, os::numa_make_local() will receive a local group
>>>>> id as a hint that is not available in the system to be bound (it will receive
>>>>> all nodes from 0 to 17), causing a proliferation of "mbind: Invalid argument"
>>>>> messages:
>>>>>
>>>>> http://cr.openjdk.java.net/~gromero/logs/jdk10_pristine.log
>>>>>
>>>>> That change improves the detection by making the JVM numa API aware of the
>>>>> existence of numa nodes that are non-consecutive from 0 to the highest node
>>>>> number and also of nodes that might be memory-less nodes, i.e. that might not
>>>>> be, in libnuma terms, a configured node. Hence just the configured nodes will
>>>>> be available:
>>>>>
>>>>> http://cr.openjdk.java.net/~gromero/logs/jdk10_numa_patched.log
>>>>>
>>>>> The change has no effect on numa topologies were the problem does not occur,
>>>>> i.e. no change in the number of nodes and no change in the cpu to node map. On
>>>>> numa topologies where memory-less nodes exist (like in the last example above),
>>>>> cpus from a memory-less node won't be able to bind locally so they are mapped
>>>>> to the closest node, otherwise they would be not associate to any node and
>>>>> MutableNUMASpace::cas_allocate() would pick a node randomly, compromising the
>>>>> performance.
>>>>>
>>>>> I found no regressions on x64 for the following numa topology:
>>>>>
>>>>> available: 2 nodes (0-1)
>>>>> node 0 cpus: 0 1 2 3 8 9 10 11
>>>>> node 0 size: 24102 MB
>>>>> node 0 free: 19806 MB
>>>>> node 1 cpus: 4 5 6 7 12 13 14 15
>>>>> node 1 size: 24190 MB
>>>>> node 1 free: 21951 MB
>>>>> node distances:
>>>>> node   0   1
>>>>>   0:  10  21
>>>>>   1:  21  10
>>>>>
>>>>> I understand that fixing the current numa detection is a prerequisite to enable
>>>>> UseNUMA by the default [1] and to extend the numa-aware allocation to the G1 GC [2].
>>>>>
>>>>> Thank you.
>>>>>
>>>>>
>>>>> Best regards,
>>>>> Gustavo
>>>>>
>>>>> [1] https://bugs.openjdk.java.net/browse/JDK-8046153 (JEP 163: Enable NUMA Mode by Default When Appropriate)
>>>>> [2] https://bugs.openjdk.java.net/browse/JDK-8046147 (JEP 157: G1 GC: NUMA-Aware Allocation)
>>>>>
>>>>
>>>
>>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [10] RFR (S) 8175813: PPC64: "mbind: Invalid argument" when -XX:+UseNUMA is used

David Holmes
Hi Volker, Gustavo,

I will try to take a look at this again, but may be a day or two.

David

On 4/05/2017 12:34 AM, Volker Simonis wrote:

> Hi,
>
> I've reviewed Gustavo's change and I'm fine with the latest version at:
>
> http://cr.openjdk.java.net/~gromero/8175813/v3/
>
> Can somebody please sponsor the change?
>
> Thank you and best regards,
> Volker
>
>
> On Wed, May 3, 2017 at 3:27 PM, Gustavo Romero
> <[hidden email]> wrote:
>> Hi community,
>>
>> I understand that there is nothing that can be done additionally regarding this
>> issue, at this point, on the PPC64 side.
>>
>> It's a change in the shared code - but that in effect does not change anything in
>> the numa detection mechanism for other platforms - and hence it's necessary a
>> conjoint community effort to review the change and a sponsor to run it against
>> the JPRT.
>>
>> I know it's a stabilizing moment of OpenJDK 9, but since that issue is of
>> great concern on PPC64 (specially on POWER8 machines) I would be very glad if
>> the community could point out directions on how that change could move on.
>>
>> Thank you!
>>
>> Best regards,
>> Gustavo
>>
>> On 25-04-2017 23:49, Gustavo Romero wrote:
>>> Dear Volker,
>>>
>>> On 24-04-2017 14:08, Volker Simonis wrote:
>>>> Hi Gustavo,
>>>>
>>>> thanks for addressing this problem and sorry for my late reply. I
>>>> think this is a good change which definitely improves the situation
>>>> for uncommon NUMA configurations without changing the handling for
>>>> common topologies.
>>>
>>> Thanks a lot for reviewing the change!
>>>
>>>
>>>> It would be great if somebody could run this trough JPRT, but as
>>>> Gustavo mentioned, I don't expect any regressions.
>>>>
>>>> @Igor: I think you've been the original author of the NUMA-aware
>>>> allocator port to Linux (i.e. "6684395: Port NUMA-aware allocator to
>>>> linux"). If you could find some spare minutes to take a look at this
>>>> change, your comment would be very much appreciated :)
>>>>
>>>> Following some minor comments from me:
>>>>
>>>> - in os::numa_get_groups_num() you now use numa_num_configured_nodes()
>>>> to get the actual number of configured nodes. This is good and
>>>> certainly an improvement over the previous implementation. However,
>>>> the man page for numa_num_configured_nodes() mentions that the
>>>> returned count may contain currently disabled nodes. Do we currently
>>>> handle disabled nodes? What will be the consequence if we would use
>>>> such a disabled node (e.g. mbind() warnings)?
>>>
>>> In [1] 'numa_memnode_ptr' is set to keep a list of *just nodes with memory in
>>> found in /sys/devices/system/node/* Hence numa_num_configured_nodes() just
>>> returns the number of nodes in 'numa_memnode_ptr' [2], thus just returns the
>>> number of nodes with memory in the system. To the best of my knowledge there is
>>> no system configuration on Linux/PPC64 that could match such a notion of
>>> "disabled nodes" as it appears in libnuma's manual. If it is enabled, it's in
>>> that dir and just the ones with memory will be taken into account. If it's
>>> disabled (somehow), it's not in the dir, so won't be taken into account (i.e. no
>>> mbind() tried against it).
>>>
>>> On Power it's possible to have a numa node without memory (memory-less node, a
>>> case covered in this change), a numa node without cpus at all but with memory
>>> (a configured node anyway, so a case already covered) but to disable a specific
>>> numa node so it does not appear in /sys/devices/system/node/* it's only possible
>>> from the inners of the control module. Or other rare condition not invisible /
>>> adjustable from the OS. Also I'm not aware of a case where a node is in this
>>> dir but is at the same time flagged as something like "disabled". There are
>>> cpu/memory hotplugs, but that does not change numa nodes status AFAIK.
>>>
>>> [1] https://github.com/numactl/numactl/blob/master/libnuma.c#L334-L347
>>> [2] https://github.com/numactl/numactl/blob/master/libnuma.c#L614-L618
>>>
>>>
>>>> - the same question applies to the usage of
>>>> Linux::isnode_in_configured_nodes() within os::numa_get_leaf_groups().
>>>> Does isnode_in_configured_nodes() (i.e. the node set defined by
>>>> 'numa_all_nodes_ptr' take into account the disabled nodes or not? Can
>>>> this be a potential problem (i.e. if we use a disabled node).
>>>
>>> On the meaning of "disabled nodes", it's the same case as above, so to the
>>> best of knowledge it's not a potential problem.
>>>
>>> Anyway 'numa_all_nodes_ptr' just includes the configured nodes (with memory),
>>> i.e. "all nodes on which the calling task may allocate memory". It's exactly
>>> the same pointer returned by numa_get_membind() v2 [3] which:
>>>
>>> "returns the mask of nodes from which memory can currently be allocated"
>>>
>>> and that is used, for example, in "numactl --show" to show nodes from where
>>> memory can be allocated [4, 5].
>>>
>>> [3] https://github.com/numactl/numactl/blob/master/libnuma.c#L1147
>>> [4] https://github.com/numactl/numactl/blob/master/numactl.c#L144
>>> [5] https://github.com/numactl/numactl/blob/master/numactl.c#L177
>>>
>>>
>>>> - I'd like to suggest renaming the 'index' part of the following
>>>> variables and functions to 'nindex' ('node_index' is probably to long)
>>>> in the following code, to emphasize that we have node indexes pointing
>>>> to actual, not always consecutive node numbers:
>>>>
>>>> 2879         // Create an index -> node mapping, since nodes are not
>>>> always consecutive
>>>> 2880         _index_to_node = new (ResourceObj::C_HEAP, mtInternal)
>>>> GrowableArray<int>(0, true);
>>>> 2881         rebuild_index_to_node_map();
>>>
>>> Simple change but much better to read indeed. Done.
>>>
>>>
>>>> - can you please wrap the following one-line else statement into curly
>>>> braces (it's more readable and we usually do it that way in HotSpot
>>>> although there are no formal style guidelines :)
>>>>
>>>> 2953      } else
>>>> 2954        // Current node is already a configured node.
>>>> 2955        closest_node = index_to_node()->at(i);
>>>
>>> Done.
>>>
>>>
>>>> - in os::Linux::rebuild_cpu_to_node_map(), if you set
>>>> 'closest_distance' to INT_MAX at the beginning of the loop, you can
>>>> later avoid the check for '|| !closest_distance'. Also, according to
>>>> the man page, numa_distance() returns 0 if it can not determine the
>>>> distance. So with the above change, the condition on line 2974 should
>>>> read:
>>>>
>>>> 2947           if (distance && distance < closest_distance) {
>>>>
>>>
>>> Sure, much better to set the initial condition as distant as possible and
>>> adjust to a closer one bit by bit improving the if condition. Done.
>>>
>>>
>>>> Finally, and not directly related to your change, I'd suggest the
>>>> following clean-ups:
>>>>
>>>> - remove the usage of 'NCPUS = 32768' in
>>>> os::Linux::rebuild_cpu_to_node_map(). The comment on that line is
>>>> unclear to me and probably related to an older version/problem of
>>>> libnuma? I think we should simply use
>>>> numa_allocate_cpumask()/numa_free_cpumask() instead.
>>>>
>>>> - we still use the NUMA version 1 function prototypes (e.g.
>>>> "numa_node_to_cpus(int node, unsigned long *buffer, int buffer_len)"
>>>> instead of "numa_node_to_cpus(int node, struct bitmask *mask)", but
>>>> also "numa_interleave_memory()" and maybe others). I think we should
>>>> switch all prototypes to the new NUMA version 2 interface which you've
>>>> already used for the new functions which you've added.
>>>
>>> I agree. Could I open a new bug to address these clean-ups?
>>>
>>>
>>>> That said, I think these changes all require libnuma 2.0 (see
>>>> os::Linux::libnuma_dlsym). So before starting this, you should make
>>>> sure that libnuma 2.0 is available on all platforms to which you'd
>>>> like to down-port this change. For jdk10 we could definitely do it,
>>>> for jdk9 probably also, for jdk8 I'm not so sure.
>>>
>>> libnuma v1 last release dates back to 2008, but any idea how could I check that
>>> for sure since it's on shared code?
>>>
>>> new webrev: http://cr.openjdk.java.net/~gromero/8175813/v3/
>>>
>>> Thank you!
>>>
>>> Best regards,
>>> Gustavo
>>>
>>>
>>>> Regards,
>>>> Volker
>>>>
>>>> On Thu, Apr 13, 2017 at 12:51 AM, Gustavo Romero
>>>> <[hidden email]> wrote:
>>>>> Hi,
>>>>>
>>>>> Any update on it?
>>>>>
>>>>> Thank you.
>>>>>
>>>>> Regards,
>>>>> Gustavo
>>>>>
>>>>> On 09-03-2017 16:33, Gustavo Romero wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Could the following webrev be reviewed please?
>>>>>>
>>>>>> It improves the numa node detection when non-consecutive or memory-less nodes
>>>>>> exist in the system.
>>>>>>
>>>>>> webrev: http://cr.openjdk.java.net/~gromero/8175813/v2/
>>>>>> bug   : https://bugs.openjdk.java.net/browse/JDK-8175813
>>>>>>
>>>>>> Currently, although no problem exists when the JVM detects numa nodes that are
>>>>>> consecutive and have memory, for example in a numa topology like:
>>>>>>
>>>>>> available: 2 nodes (0-1)
>>>>>> node 0 cpus: 0 8 16 24 32
>>>>>> node 0 size: 65258 MB
>>>>>> node 0 free: 34 MB
>>>>>> node 1 cpus: 40 48 56 64 72
>>>>>> node 1 size: 65320 MB
>>>>>> node 1 free: 150 MB
>>>>>> node distances:
>>>>>> node   0   1
>>>>>>   0:  10  20
>>>>>>   1:  20  10,
>>>>>>
>>>>>> it fails on detecting numa nodes to be used in the Parallel GC in a numa
>>>>>> topology like:
>>>>>>
>>>>>> available: 4 nodes (0-1,16-17)
>>>>>> node 0 cpus: 0 8 16 24 32
>>>>>> node 0 size: 130706 MB
>>>>>> node 0 free: 7729 MB
>>>>>> node 1 cpus: 40 48 56 64 72
>>>>>> node 1 size: 0 MB
>>>>>> node 1 free: 0 MB
>>>>>> node 16 cpus: 80 88 96 104 112
>>>>>> node 16 size: 130630 MB
>>>>>> node 16 free: 5282 MB
>>>>>> node 17 cpus: 120 128 136 144 152
>>>>>> node 17 size: 0 MB
>>>>>> node 17 free: 0 MB
>>>>>> node distances:
>>>>>> node   0   1  16  17
>>>>>>   0:  10  20  40  40
>>>>>>   1:  20  10  40  40
>>>>>>  16:  40  40  10  20
>>>>>>  17:  40  40  20  10,
>>>>>>
>>>>>> where node 16 is not consecutive in relation to 1 and also nodes 1 and 17 have
>>>>>> no memory.
>>>>>>
>>>>>> If a topology like that exists, os::numa_make_local() will receive a local group
>>>>>> id as a hint that is not available in the system to be bound (it will receive
>>>>>> all nodes from 0 to 17), causing a proliferation of "mbind: Invalid argument"
>>>>>> messages:
>>>>>>
>>>>>> http://cr.openjdk.java.net/~gromero/logs/jdk10_pristine.log
>>>>>>
>>>>>> That change improves the detection by making the JVM numa API aware of the
>>>>>> existence of numa nodes that are non-consecutive from 0 to the highest node
>>>>>> number and also of nodes that might be memory-less nodes, i.e. that might not
>>>>>> be, in libnuma terms, a configured node. Hence just the configured nodes will
>>>>>> be available:
>>>>>>
>>>>>> http://cr.openjdk.java.net/~gromero/logs/jdk10_numa_patched.log
>>>>>>
>>>>>> The change has no effect on numa topologies were the problem does not occur,
>>>>>> i.e. no change in the number of nodes and no change in the cpu to node map. On
>>>>>> numa topologies where memory-less nodes exist (like in the last example above),
>>>>>> cpus from a memory-less node won't be able to bind locally so they are mapped
>>>>>> to the closest node, otherwise they would be not associate to any node and
>>>>>> MutableNUMASpace::cas_allocate() would pick a node randomly, compromising the
>>>>>> performance.
>>>>>>
>>>>>> I found no regressions on x64 for the following numa topology:
>>>>>>
>>>>>> available: 2 nodes (0-1)
>>>>>> node 0 cpus: 0 1 2 3 8 9 10 11
>>>>>> node 0 size: 24102 MB
>>>>>> node 0 free: 19806 MB
>>>>>> node 1 cpus: 4 5 6 7 12 13 14 15
>>>>>> node 1 size: 24190 MB
>>>>>> node 1 free: 21951 MB
>>>>>> node distances:
>>>>>> node   0   1
>>>>>>   0:  10  21
>>>>>>   1:  21  10
>>>>>>
>>>>>> I understand that fixing the current numa detection is a prerequisite to enable
>>>>>> UseNUMA by the default [1] and to extend the numa-aware allocation to the G1 GC [2].
>>>>>>
>>>>>> Thank you.
>>>>>>
>>>>>>
>>>>>> Best regards,
>>>>>> Gustavo
>>>>>>
>>>>>> [1] https://bugs.openjdk.java.net/browse/JDK-8046153 (JEP 163: Enable NUMA Mode by Default When Appropriate)
>>>>>> [2] https://bugs.openjdk.java.net/browse/JDK-8046147 (JEP 157: G1 GC: NUMA-Aware Allocation)
>>>>>>
>>>>>
>>>>
>>>
>>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [10] RFR (S) 8175813: PPC64: "mbind: Invalid argument" when -XX:+UseNUMA is used

David Holmes
In reply to this post by Volker Simonis
Hi Volker, Gustavo,

On 4/05/2017 12:34 AM, Volker Simonis wrote:
> Hi,
>
> I've reviewed Gustavo's change and I'm fine with the latest version at:
>
> http://cr.openjdk.java.net/~gromero/8175813/v3/

Nothing has really changed for me since I first looked at this - I don't
know NUMA and I can't comment on any of the details. But no-one else has
commented negatively so they are implicitly okay with this, or else they
should have spoken up. So with Volker as the Reviewer and myself as a
second reviewer, I will sponsor this. I'll run the current patch through
JPRT while awaiting the final version.

One thing I was unclear on with all this numa code is the expectation
regarding all those dynamically looked up functions - is it expected
that we will have them all or else have none? It wasn't at all obvious
what would happen if we don't have those functions but still executed
this code - assuming that is even possible. I guess I would have
expected that no numa code would execute unless -XX:+UseNUMA was set, in
which case the VM would abort if any of the libnuma functions could not
be found. That way we wouldn't need the null checks for the function
pointers.

Style nits:
- we should avoid implicit booleans, so the isnode_in_* functions should
return bool not int; and check "distance != 0"
- spaces around operators eg. node=0 should be node = 0

Thanks,
David

> Can somebody please sponsor the change?
>
> Thank you and best regards,
> Volker
>
>
> On Wed, May 3, 2017 at 3:27 PM, Gustavo Romero
> <[hidden email]> wrote:
>> Hi community,
>>
>> I understand that there is nothing that can be done additionally regarding this
>> issue, at this point, on the PPC64 side.
>>
>> It's a change in the shared code - but that in effect does not change anything in
>> the numa detection mechanism for other platforms - and hence it's necessary a
>> conjoint community effort to review the change and a sponsor to run it against
>> the JPRT.
>>
>> I know it's a stabilizing moment of OpenJDK 9, but since that issue is of
>> great concern on PPC64 (specially on POWER8 machines) I would be very glad if
>> the community could point out directions on how that change could move on.
>>
>> Thank you!
>>
>> Best regards,
>> Gustavo
>>
>> On 25-04-2017 23:49, Gustavo Romero wrote:
>>> Dear Volker,
>>>
>>> On 24-04-2017 14:08, Volker Simonis wrote:
>>>> Hi Gustavo,
>>>>
>>>> thanks for addressing this problem and sorry for my late reply. I
>>>> think this is a good change which definitely improves the situation
>>>> for uncommon NUMA configurations without changing the handling for
>>>> common topologies.
>>>
>>> Thanks a lot for reviewing the change!
>>>
>>>
>>>> It would be great if somebody could run this trough JPRT, but as
>>>> Gustavo mentioned, I don't expect any regressions.
>>>>
>>>> @Igor: I think you've been the original author of the NUMA-aware
>>>> allocator port to Linux (i.e. "6684395: Port NUMA-aware allocator to
>>>> linux"). If you could find some spare minutes to take a look at this
>>>> change, your comment would be very much appreciated :)
>>>>
>>>> Following some minor comments from me:
>>>>
>>>> - in os::numa_get_groups_num() you now use numa_num_configured_nodes()
>>>> to get the actual number of configured nodes. This is good and
>>>> certainly an improvement over the previous implementation. However,
>>>> the man page for numa_num_configured_nodes() mentions that the
>>>> returned count may contain currently disabled nodes. Do we currently
>>>> handle disabled nodes? What will be the consequence if we would use
>>>> such a disabled node (e.g. mbind() warnings)?
>>>
>>> In [1] 'numa_memnode_ptr' is set to keep a list of *just nodes with memory in
>>> found in /sys/devices/system/node/* Hence numa_num_configured_nodes() just
>>> returns the number of nodes in 'numa_memnode_ptr' [2], thus just returns the
>>> number of nodes with memory in the system. To the best of my knowledge there is
>>> no system configuration on Linux/PPC64 that could match such a notion of
>>> "disabled nodes" as it appears in libnuma's manual. If it is enabled, it's in
>>> that dir and just the ones with memory will be taken into account. If it's
>>> disabled (somehow), it's not in the dir, so won't be taken into account (i.e. no
>>> mbind() tried against it).
>>>
>>> On Power it's possible to have a numa node without memory (memory-less node, a
>>> case covered in this change), a numa node without cpus at all but with memory
>>> (a configured node anyway, so a case already covered) but to disable a specific
>>> numa node so it does not appear in /sys/devices/system/node/* it's only possible
>>> from the inners of the control module. Or other rare condition not invisible /
>>> adjustable from the OS. Also I'm not aware of a case where a node is in this
>>> dir but is at the same time flagged as something like "disabled". There are
>>> cpu/memory hotplugs, but that does not change numa nodes status AFAIK.
>>>
>>> [1] https://github.com/numactl/numactl/blob/master/libnuma.c#L334-L347
>>> [2] https://github.com/numactl/numactl/blob/master/libnuma.c#L614-L618
>>>
>>>
>>>> - the same question applies to the usage of
>>>> Linux::isnode_in_configured_nodes() within os::numa_get_leaf_groups().
>>>> Does isnode_in_configured_nodes() (i.e. the node set defined by
>>>> 'numa_all_nodes_ptr' take into account the disabled nodes or not? Can
>>>> this be a potential problem (i.e. if we use a disabled node).
>>>
>>> On the meaning of "disabled nodes", it's the same case as above, so to the
>>> best of knowledge it's not a potential problem.
>>>
>>> Anyway 'numa_all_nodes_ptr' just includes the configured nodes (with memory),
>>> i.e. "all nodes on which the calling task may allocate memory". It's exactly
>>> the same pointer returned by numa_get_membind() v2 [3] which:
>>>
>>> "returns the mask of nodes from which memory can currently be allocated"
>>>
>>> and that is used, for example, in "numactl --show" to show nodes from where
>>> memory can be allocated [4, 5].
>>>
>>> [3] https://github.com/numactl/numactl/blob/master/libnuma.c#L1147
>>> [4] https://github.com/numactl/numactl/blob/master/numactl.c#L144
>>> [5] https://github.com/numactl/numactl/blob/master/numactl.c#L177
>>>
>>>
>>>> - I'd like to suggest renaming the 'index' part of the following
>>>> variables and functions to 'nindex' ('node_index' is probably to long)
>>>> in the following code, to emphasize that we have node indexes pointing
>>>> to actual, not always consecutive node numbers:
>>>>
>>>> 2879         // Create an index -> node mapping, since nodes are not
>>>> always consecutive
>>>> 2880         _index_to_node = new (ResourceObj::C_HEAP, mtInternal)
>>>> GrowableArray<int>(0, true);
>>>> 2881         rebuild_index_to_node_map();
>>>
>>> Simple change but much better to read indeed. Done.
>>>
>>>
>>>> - can you please wrap the following one-line else statement into curly
>>>> braces (it's more readable and we usually do it that way in HotSpot
>>>> although there are no formal style guidelines :)
>>>>
>>>> 2953      } else
>>>> 2954        // Current node is already a configured node.
>>>> 2955        closest_node = index_to_node()->at(i);
>>>
>>> Done.
>>>
>>>
>>>> - in os::Linux::rebuild_cpu_to_node_map(), if you set
>>>> 'closest_distance' to INT_MAX at the beginning of the loop, you can
>>>> later avoid the check for '|| !closest_distance'. Also, according to
>>>> the man page, numa_distance() returns 0 if it can not determine the
>>>> distance. So with the above change, the condition on line 2974 should
>>>> read:
>>>>
>>>> 2947           if (distance && distance < closest_distance) {
>>>>
>>>
>>> Sure, much better to set the initial condition as distant as possible and
>>> adjust to a closer one bit by bit improving the if condition. Done.
>>>
>>>
>>>> Finally, and not directly related to your change, I'd suggest the
>>>> following clean-ups:
>>>>
>>>> - remove the usage of 'NCPUS = 32768' in
>>>> os::Linux::rebuild_cpu_to_node_map(). The comment on that line is
>>>> unclear to me and probably related to an older version/problem of
>>>> libnuma? I think we should simply use
>>>> numa_allocate_cpumask()/numa_free_cpumask() instead.
>>>>
>>>> - we still use the NUMA version 1 function prototypes (e.g.
>>>> "numa_node_to_cpus(int node, unsigned long *buffer, int buffer_len)"
>>>> instead of "numa_node_to_cpus(int node, struct bitmask *mask)", but
>>>> also "numa_interleave_memory()" and maybe others). I think we should
>>>> switch all prototypes to the new NUMA version 2 interface which you've
>>>> already used for the new functions which you've added.
>>>
>>> I agree. Could I open a new bug to address these clean-ups?
>>>
>>>
>>>> That said, I think these changes all require libnuma 2.0 (see
>>>> os::Linux::libnuma_dlsym). So before starting this, you should make
>>>> sure that libnuma 2.0 is available on all platforms to which you'd
>>>> like to down-port this change. For jdk10 we could definitely do it,
>>>> for jdk9 probably also, for jdk8 I'm not so sure.
>>>
>>> libnuma v1 last release dates back to 2008, but any idea how could I check that
>>> for sure since it's on shared code?
>>>
>>> new webrev: http://cr.openjdk.java.net/~gromero/8175813/v3/
>>>
>>> Thank you!
>>>
>>> Best regards,
>>> Gustavo
>>>
>>>
>>>> Regards,
>>>> Volker
>>>>
>>>> On Thu, Apr 13, 2017 at 12:51 AM, Gustavo Romero
>>>> <[hidden email]> wrote:
>>>>> Hi,
>>>>>
>>>>> Any update on it?
>>>>>
>>>>> Thank you.
>>>>>
>>>>> Regards,
>>>>> Gustavo
>>>>>
>>>>> On 09-03-2017 16:33, Gustavo Romero wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Could the following webrev be reviewed please?
>>>>>>
>>>>>> It improves the numa node detection when non-consecutive or memory-less nodes
>>>>>> exist in the system.
>>>>>>
>>>>>> webrev: http://cr.openjdk.java.net/~gromero/8175813/v2/
>>>>>> bug   : https://bugs.openjdk.java.net/browse/JDK-8175813
>>>>>>
>>>>>> Currently, although no problem exists when the JVM detects numa nodes that are
>>>>>> consecutive and have memory, for example in a numa topology like:
>>>>>>
>>>>>> available: 2 nodes (0-1)
>>>>>> node 0 cpus: 0 8 16 24 32
>>>>>> node 0 size: 65258 MB
>>>>>> node 0 free: 34 MB
>>>>>> node 1 cpus: 40 48 56 64 72
>>>>>> node 1 size: 65320 MB
>>>>>> node 1 free: 150 MB
>>>>>> node distances:
>>>>>> node   0   1
>>>>>>   0:  10  20
>>>>>>   1:  20  10,
>>>>>>
>>>>>> it fails on detecting numa nodes to be used in the Parallel GC in a numa
>>>>>> topology like:
>>>>>>
>>>>>> available: 4 nodes (0-1,16-17)
>>>>>> node 0 cpus: 0 8 16 24 32
>>>>>> node 0 size: 130706 MB
>>>>>> node 0 free: 7729 MB
>>>>>> node 1 cpus: 40 48 56 64 72
>>>>>> node 1 size: 0 MB
>>>>>> node 1 free: 0 MB
>>>>>> node 16 cpus: 80 88 96 104 112
>>>>>> node 16 size: 130630 MB
>>>>>> node 16 free: 5282 MB
>>>>>> node 17 cpus: 120 128 136 144 152
>>>>>> node 17 size: 0 MB
>>>>>> node 17 free: 0 MB
>>>>>> node distances:
>>>>>> node   0   1  16  17
>>>>>>   0:  10  20  40  40
>>>>>>   1:  20  10  40  40
>>>>>>  16:  40  40  10  20
>>>>>>  17:  40  40  20  10,
>>>>>>
>>>>>> where node 16 is not consecutive in relation to 1 and also nodes 1 and 17 have
>>>>>> no memory.
>>>>>>
>>>>>> If a topology like that exists, os::numa_make_local() will receive a local group
>>>>>> id as a hint that is not available in the system to be bound (it will receive
>>>>>> all nodes from 0 to 17), causing a proliferation of "mbind: Invalid argument"
>>>>>> messages:
>>>>>>
>>>>>> http://cr.openjdk.java.net/~gromero/logs/jdk10_pristine.log
>>>>>>
>>>>>> That change improves the detection by making the JVM numa API aware of the
>>>>>> existence of numa nodes that are non-consecutive from 0 to the highest node
>>>>>> number and also of nodes that might be memory-less nodes, i.e. that might not
>>>>>> be, in libnuma terms, a configured node. Hence just the configured nodes will
>>>>>> be available:
>>>>>>
>>>>>> http://cr.openjdk.java.net/~gromero/logs/jdk10_numa_patched.log
>>>>>>
>>>>>> The change has no effect on numa topologies were the problem does not occur,
>>>>>> i.e. no change in the number of nodes and no change in the cpu to node map. On
>>>>>> numa topologies where memory-less nodes exist (like in the last example above),
>>>>>> cpus from a memory-less node won't be able to bind locally so they are mapped
>>>>>> to the closest node, otherwise they would be not associate to any node and
>>>>>> MutableNUMASpace::cas_allocate() would pick a node randomly, compromising the
>>>>>> performance.
>>>>>>
>>>>>> I found no regressions on x64 for the following numa topology:
>>>>>>
>>>>>> available: 2 nodes (0-1)
>>>>>> node 0 cpus: 0 1 2 3 8 9 10 11
>>>>>> node 0 size: 24102 MB
>>>>>> node 0 free: 19806 MB
>>>>>> node 1 cpus: 4 5 6 7 12 13 14 15
>>>>>> node 1 size: 24190 MB
>>>>>> node 1 free: 21951 MB
>>>>>> node distances:
>>>>>> node   0   1
>>>>>>   0:  10  21
>>>>>>   1:  21  10
>>>>>>
>>>>>> I understand that fixing the current numa detection is a prerequisite to enable
>>>>>> UseNUMA by the default [1] and to extend the numa-aware allocation to the G1 GC [2].
>>>>>>
>>>>>> Thank you.
>>>>>>
>>>>>>
>>>>>> Best regards,
>>>>>> Gustavo
>>>>>>
>>>>>> [1] https://bugs.openjdk.java.net/browse/JDK-8046153 (JEP 163: Enable NUMA Mode by Default When Appropriate)
>>>>>> [2] https://bugs.openjdk.java.net/browse/JDK-8046147 (JEP 157: G1 GC: NUMA-Aware Allocation)
>>>>>>
>>>>>
>>>>
>>>
>>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [10] RFR (S) 8175813: PPC64: "mbind: Invalid argument" when -XX:+UseNUMA is used

Gustavo Romero
Hi David,

On 04-05-2017 21:32, David Holmes wrote:

> Hi Volker, Gustavo,
>
> On 4/05/2017 12:34 AM, Volker Simonis wrote:
>> Hi,
>>
>> I've reviewed Gustavo's change and I'm fine with the latest version at:
>>
>> http://cr.openjdk.java.net/~gromero/8175813/v3/
>
> Nothing has really changed for me since I first looked at this - I don't know NUMA and I can't comment on any of the details. But no-one else has commented negatively so they are implicitly okay with
> this, or else they should have spoken up. So with Volker as the Reviewer and myself as a second reviewer, I will sponsor this. I'll run the current patch through JPRT while awaiting the final version.

Thanks a lot for reviewing and sponsoring the change.


> One thing I was unclear on with all this numa code is the expectation regarding all those dynamically looked up functions - is it expected that we will have them all or else have none? It wasn't at
> all obvious what would happen if we don't have those functions but still executed this code - assuming that is even possible. I guess I would have expected that no numa code would execute unless
> -XX:+UseNUMA was set, in which case the VM would abort if any of the libnuma functions could not be found. That way we wouldn't need the null checks for the function pointers.

If libnuma is not available in the system os::Linux::libnuma_init() will return
false and JVM will refuse to enable the UseNUMA features instead of aborting:

4904   if (UseNUMA) {
4905     if (!Linux::libnuma_init()) {
4906       UseNUMA = false;
4907     } else {

I understand those null checks as part of the initial design of JVM numa api to
enforce protection against the usage of its methods in other parts of the code
when JVM api failed to initialize properly, even tho it's expected that
UseNUMA = false should suffice to protect such a usages.

That said, I could not find any recent Linux distribution that does not support
libnuma v2 api (and so also v1 api). On Ubuntu it will be installed as a
dependency of metapackage ubuntu-standard and because that requires "irqbalance"
it also requires libnuma. Libnuma was updated from libnuma v1 to v2
around mid 2008:

numactl (2.0.1-1) unstable; urgency=low

  * New upstream
  * patches/static-lib.patch: update
  * debian/watch: update to new SGI location

 -- Ian Wienand <[hidden email]>  Sat, 07 Jun 2008 14:18:22 -0700

numactl (1.0.2-1) unstable; urgency=low

  * New upstream
  * Closes: #442690 -- Add to rules a hack to remove libnuma.a after
    unpatching
  * Update README.debian


 -- Ian Wienand <[hidden email]>  Wed, 03 Oct 2007 21:49:27 +1000


It's similar on RHEL, where "irqbalance" is in core group. Regarding
the libnuma version it was also updated in 2008 to v2, so since
Fedora 11 contains v2, hence RHEL 6 and RHEL 7 contains it:

* Wed Feb 25 2009 Fedora Release Engineering <[hidden email]> - 2.0.2-3
- Rebuilt for https://fedoraproject.org/wiki/Fedora_11_Mass_Rebuild

* Mon Sep 29 2008 Neil Horman <[hidden email]> - 2.0.2-2
- Fix build break due to register selection in asm

* Mon Sep 29 2008 Neil Horman <[hidden email]> - 2.0.2-1
- Update rawhide to version 2.0.2 of numactl

* Fri Apr 25 2008 Neil Horman <[hidden email]> - 1.0.2-6
- Fix buffer size passing and arg sanity check for physcpubind (bz 442521)


Also, the last release of libnuma v1 dates back to 2008:
https://github.com/numactl/numactl/releases/tag/v1.0.2

So it looks like libnuma v2 absence on Linux is by now uncommon.


> Style nits:
> - we should avoid implicit booleans, so the isnode_in_* functions should return bool not int; and check "distance != 0"
> - spaces around operators eg. node=0 should be node = 0

new webrev: http://cr.openjdk.java.net/~gromero/8175813/v4/


Thank you and best regards,
Gustavo

> Thanks,
> David
>
>> Can somebody please sponsor the change?
>>
>> Thank you and best regards,
>> Volker
>>
>>
>> On Wed, May 3, 2017 at 3:27 PM, Gustavo Romero
>> <[hidden email]> wrote:
>>> Hi community,
>>>
>>> I understand that there is nothing that can be done additionally regarding this
>>> issue, at this point, on the PPC64 side.
>>>
>>> It's a change in the shared code - but that in effect does not change anything in
>>> the numa detection mechanism for other platforms - and hence it's necessary a
>>> conjoint community effort to review the change and a sponsor to run it against
>>> the JPRT.
>>>
>>> I know it's a stabilizing moment of OpenJDK 9, but since that issue is of
>>> great concern on PPC64 (specially on POWER8 machines) I would be very glad if
>>> the community could point out directions on how that change could move on.
>>>
>>> Thank you!
>>>
>>> Best regards,
>>> Gustavo
>>>
>>> On 25-04-2017 23:49, Gustavo Romero wrote:
>>>> Dear Volker,
>>>>
>>>> On 24-04-2017 14:08, Volker Simonis wrote:
>>>>> Hi Gustavo,
>>>>>
>>>>> thanks for addressing this problem and sorry for my late reply. I
>>>>> think this is a good change which definitely improves the situation
>>>>> for uncommon NUMA configurations without changing the handling for
>>>>> common topologies.
>>>>
>>>> Thanks a lot for reviewing the change!
>>>>
>>>>
>>>>> It would be great if somebody could run this trough JPRT, but as
>>>>> Gustavo mentioned, I don't expect any regressions.
>>>>>
>>>>> @Igor: I think you've been the original author of the NUMA-aware
>>>>> allocator port to Linux (i.e. "6684395: Port NUMA-aware allocator to
>>>>> linux"). If you could find some spare minutes to take a look at this
>>>>> change, your comment would be very much appreciated :)
>>>>>
>>>>> Following some minor comments from me:
>>>>>
>>>>> - in os::numa_get_groups_num() you now use numa_num_configured_nodes()
>>>>> to get the actual number of configured nodes. This is good and
>>>>> certainly an improvement over the previous implementation. However,
>>>>> the man page for numa_num_configured_nodes() mentions that the
>>>>> returned count may contain currently disabled nodes. Do we currently
>>>>> handle disabled nodes? What will be the consequence if we would use
>>>>> such a disabled node (e.g. mbind() warnings)?
>>>>
>>>> In [1] 'numa_memnode_ptr' is set to keep a list of *just nodes with memory in
>>>> found in /sys/devices/system/node/* Hence numa_num_configured_nodes() just
>>>> returns the number of nodes in 'numa_memnode_ptr' [2], thus just returns the
>>>> number of nodes with memory in the system. To the best of my knowledge there is
>>>> no system configuration on Linux/PPC64 that could match such a notion of
>>>> "disabled nodes" as it appears in libnuma's manual. If it is enabled, it's in
>>>> that dir and just the ones with memory will be taken into account. If it's
>>>> disabled (somehow), it's not in the dir, so won't be taken into account (i.e. no
>>>> mbind() tried against it).
>>>>
>>>> On Power it's possible to have a numa node without memory (memory-less node, a
>>>> case covered in this change), a numa node without cpus at all but with memory
>>>> (a configured node anyway, so a case already covered) but to disable a specific
>>>> numa node so it does not appear in /sys/devices/system/node/* it's only possible
>>>> from the inners of the control module. Or other rare condition not invisible /
>>>> adjustable from the OS. Also I'm not aware of a case where a node is in this
>>>> dir but is at the same time flagged as something like "disabled". There are
>>>> cpu/memory hotplugs, but that does not change numa nodes status AFAIK.
>>>>
>>>> [1] https://github.com/numactl/numactl/blob/master/libnuma.c#L334-L347
>>>> [2] https://github.com/numactl/numactl/blob/master/libnuma.c#L614-L618
>>>>
>>>>
>>>>> - the same question applies to the usage of
>>>>> Linux::isnode_in_configured_nodes() within os::numa_get_leaf_groups().
>>>>> Does isnode_in_configured_nodes() (i.e. the node set defined by
>>>>> 'numa_all_nodes_ptr' take into account the disabled nodes or not? Can
>>>>> this be a potential problem (i.e. if we use a disabled node).
>>>>
>>>> On the meaning of "disabled nodes", it's the same case as above, so to the
>>>> best of knowledge it's not a potential problem.
>>>>
>>>> Anyway 'numa_all_nodes_ptr' just includes the configured nodes (with memory),
>>>> i.e. "all nodes on which the calling task may allocate memory". It's exactly
>>>> the same pointer returned by numa_get_membind() v2 [3] which:
>>>>
>>>> "returns the mask of nodes from which memory can currently be allocated"
>>>>
>>>> and that is used, for example, in "numactl --show" to show nodes from where
>>>> memory can be allocated [4, 5].
>>>>
>>>> [3] https://github.com/numactl/numactl/blob/master/libnuma.c#L1147
>>>> [4] https://github.com/numactl/numactl/blob/master/numactl.c#L144
>>>> [5] https://github.com/numactl/numactl/blob/master/numactl.c#L177
>>>>
>>>>
>>>>> - I'd like to suggest renaming the 'index' part of the following
>>>>> variables and functions to 'nindex' ('node_index' is probably to long)
>>>>> in the following code, to emphasize that we have node indexes pointing
>>>>> to actual, not always consecutive node numbers:
>>>>>
>>>>> 2879         // Create an index -> node mapping, since nodes are not
>>>>> always consecutive
>>>>> 2880         _index_to_node = new (ResourceObj::C_HEAP, mtInternal)
>>>>> GrowableArray<int>(0, true);
>>>>> 2881         rebuild_index_to_node_map();
>>>>
>>>> Simple change but much better to read indeed. Done.
>>>>
>>>>
>>>>> - can you please wrap the following one-line else statement into curly
>>>>> braces (it's more readable and we usually do it that way in HotSpot
>>>>> although there are no formal style guidelines :)
>>>>>
>>>>> 2953      } else
>>>>> 2954        // Current node is already a configured node.
>>>>> 2955        closest_node = index_to_node()->at(i);
>>>>
>>>> Done.
>>>>
>>>>
>>>>> - in os::Linux::rebuild_cpu_to_node_map(), if you set
>>>>> 'closest_distance' to INT_MAX at the beginning of the loop, you can
>>>>> later avoid the check for '|| !closest_distance'. Also, according to
>>>>> the man page, numa_distance() returns 0 if it can not determine the
>>>>> distance. So with the above change, the condition on line 2974 should
>>>>> read:
>>>>>
>>>>> 2947           if (distance && distance < closest_distance) {
>>>>>
>>>>
>>>> Sure, much better to set the initial condition as distant as possible and
>>>> adjust to a closer one bit by bit improving the if condition. Done.
>>>>
>>>>
>>>>> Finally, and not directly related to your change, I'd suggest the
>>>>> following clean-ups:
>>>>>
>>>>> - remove the usage of 'NCPUS = 32768' in
>>>>> os::Linux::rebuild_cpu_to_node_map(). The comment on that line is
>>>>> unclear to me and probably related to an older version/problem of
>>>>> libnuma? I think we should simply use
>>>>> numa_allocate_cpumask()/numa_free_cpumask() instead.
>>>>>
>>>>> - we still use the NUMA version 1 function prototypes (e.g.
>>>>> "numa_node_to_cpus(int node, unsigned long *buffer, int buffer_len)"
>>>>> instead of "numa_node_to_cpus(int node, struct bitmask *mask)", but
>>>>> also "numa_interleave_memory()" and maybe others). I think we should
>>>>> switch all prototypes to the new NUMA version 2 interface which you've
>>>>> already used for the new functions which you've added.
>>>>
>>>> I agree. Could I open a new bug to address these clean-ups?
>>>>
>>>>
>>>>> That said, I think these changes all require libnuma 2.0 (see
>>>>> os::Linux::libnuma_dlsym). So before starting this, you should make
>>>>> sure that libnuma 2.0 is available on all platforms to which you'd
>>>>> like to down-port this change. For jdk10 we could definitely do it,
>>>>> for jdk9 probably also, for jdk8 I'm not so sure.
>>>>
>>>> libnuma v1 last release dates back to 2008, but any idea how could I check that
>>>> for sure since it's on shared code?
>>>>
>>>> new webrev: http://cr.openjdk.java.net/~gromero/8175813/v3/
>>>>
>>>> Thank you!
>>>>
>>>> Best regards,
>>>> Gustavo
>>>>
>>>>
>>>>> Regards,
>>>>> Volker
>>>>>
>>>>> On Thu, Apr 13, 2017 at 12:51 AM, Gustavo Romero
>>>>> <[hidden email]> wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Any update on it?
>>>>>>
>>>>>> Thank you.
>>>>>>
>>>>>> Regards,
>>>>>> Gustavo
>>>>>>
>>>>>> On 09-03-2017 16:33, Gustavo Romero wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> Could the following webrev be reviewed please?
>>>>>>>
>>>>>>> It improves the numa node detection when non-consecutive or memory-less nodes
>>>>>>> exist in the system.
>>>>>>>
>>>>>>> webrev: http://cr.openjdk.java.net/~gromero/8175813/v2/
>>>>>>> bug   : https://bugs.openjdk.java.net/browse/JDK-8175813
>>>>>>>
>>>>>>> Currently, although no problem exists when the JVM detects numa nodes that are
>>>>>>> consecutive and have memory, for example in a numa topology like:
>>>>>>>
>>>>>>> available: 2 nodes (0-1)
>>>>>>> node 0 cpus: 0 8 16 24 32
>>>>>>> node 0 size: 65258 MB
>>>>>>> node 0 free: 34 MB
>>>>>>> node 1 cpus: 40 48 56 64 72
>>>>>>> node 1 size: 65320 MB
>>>>>>> node 1 free: 150 MB
>>>>>>> node distances:
>>>>>>> node   0   1
>>>>>>>   0:  10  20
>>>>>>>   1:  20  10,
>>>>>>>
>>>>>>> it fails on detecting numa nodes to be used in the Parallel GC in a numa
>>>>>>> topology like:
>>>>>>>
>>>>>>> available: 4 nodes (0-1,16-17)
>>>>>>> node 0 cpus: 0 8 16 24 32
>>>>>>> node 0 size: 130706 MB
>>>>>>> node 0 free: 7729 MB
>>>>>>> node 1 cpus: 40 48 56 64 72
>>>>>>> node 1 size: 0 MB
>>>>>>> node 1 free: 0 MB
>>>>>>> node 16 cpus: 80 88 96 104 112
>>>>>>> node 16 size: 130630 MB
>>>>>>> node 16 free: 5282 MB
>>>>>>> node 17 cpus: 120 128 136 144 152
>>>>>>> node 17 size: 0 MB
>>>>>>> node 17 free: 0 MB
>>>>>>> node distances:
>>>>>>> node   0   1  16  17
>>>>>>>   0:  10  20  40  40
>>>>>>>   1:  20  10  40  40
>>>>>>>  16:  40  40  10  20
>>>>>>>  17:  40  40  20  10,
>>>>>>>
>>>>>>> where node 16 is not consecutive in relation to 1 and also nodes 1 and 17 have
>>>>>>> no memory.
>>>>>>>
>>>>>>> If a topology like that exists, os::numa_make_local() will receive a local group
>>>>>>> id as a hint that is not available in the system to be bound (it will receive
>>>>>>> all nodes from 0 to 17), causing a proliferation of "mbind: Invalid argument"
>>>>>>> messages:
>>>>>>>
>>>>>>> http://cr.openjdk.java.net/~gromero/logs/jdk10_pristine.log
>>>>>>>
>>>>>>> That change improves the detection by making the JVM numa API aware of the
>>>>>>> existence of numa nodes that are non-consecutive from 0 to the highest node
>>>>>>> number and also of nodes that might be memory-less nodes, i.e. that might not
>>>>>>> be, in libnuma terms, a configured node. Hence just the configured nodes will
>>>>>>> be available:
>>>>>>>
>>>>>>> http://cr.openjdk.java.net/~gromero/logs/jdk10_numa_patched.log
>>>>>>>
>>>>>>> The change has no effect on numa topologies were the problem does not occur,
>>>>>>> i.e. no change in the number of nodes and no change in the cpu to node map. On
>>>>>>> numa topologies where memory-less nodes exist (like in the last example above),
>>>>>>> cpus from a memory-less node won't be able to bind locally so they are mapped
>>>>>>> to the closest node, otherwise they would be not associate to any node and
>>>>>>> MutableNUMASpace::cas_allocate() would pick a node randomly, compromising the
>>>>>>> performance.
>>>>>>>
>>>>>>> I found no regressions on x64 for the following numa topology:
>>>>>>>
>>>>>>> available: 2 nodes (0-1)
>>>>>>> node 0 cpus: 0 1 2 3 8 9 10 11
>>>>>>> node 0 size: 24102 MB
>>>>>>> node 0 free: 19806 MB
>>>>>>> node 1 cpus: 4 5 6 7 12 13 14 15
>>>>>>> node 1 size: 24190 MB
>>>>>>> node 1 free: 21951 MB
>>>>>>> node distances:
>>>>>>> node   0   1
>>>>>>>   0:  10  21
>>>>>>>   1:  21  10
>>>>>>>
>>>>>>> I understand that fixing the current numa detection is a prerequisite to enable
>>>>>>> UseNUMA by the default [1] and to extend the numa-aware allocation to the G1 GC [2].
>>>>>>>
>>>>>>> Thank you.
>>>>>>>
>>>>>>>
>>>>>>> Best regards,
>>>>>>> Gustavo
>>>>>>>
>>>>>>> [1] https://bugs.openjdk.java.net/browse/JDK-8046153 (JEP 163: Enable NUMA Mode by Default When Appropriate)
>>>>>>> [2] https://bugs.openjdk.java.net/browse/JDK-8046147 (JEP 157: G1 GC: NUMA-Aware Allocation)
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [10] RFR (S) 8175813: PPC64: "mbind: Invalid argument" when -XX:+UseNUMA is used

Volker Simonis
On Fri, May 5, 2017 at 9:43 PM, Gustavo Romero
<[hidden email]> wrote:

> Hi David,
>
> On 04-05-2017 21:32, David Holmes wrote:
>> Hi Volker, Gustavo,
>>
>> On 4/05/2017 12:34 AM, Volker Simonis wrote:
>>> Hi,
>>>
>>> I've reviewed Gustavo's change and I'm fine with the latest version at:
>>>
>>> http://cr.openjdk.java.net/~gromero/8175813/v3/
>>
>> Nothing has really changed for me since I first looked at this - I don't know NUMA and I can't comment on any of the details. But no-one else has commented negatively so they are implicitly okay with
>> this, or else they should have spoken up. So with Volker as the Reviewer and myself as a second reviewer, I will sponsor this. I'll run the current patch through JPRT while awaiting the final version.
>
> Thanks a lot for reviewing and sponsoring the change.
>
>
>> One thing I was unclear on with all this numa code is the expectation regarding all those dynamically looked up functions - is it expected that we will have them all or else have none? It wasn't at
>> all obvious what would happen if we don't have those functions but still executed this code - assuming that is even possible. I guess I would have expected that no numa code would execute unless
>> -XX:+UseNUMA was set, in which case the VM would abort if any of the libnuma functions could not be found. That way we wouldn't need the null checks for the function pointers.
>
> If libnuma is not available in the system os::Linux::libnuma_init() will return
> false and JVM will refuse to enable the UseNUMA features instead of aborting:
>
> 4904   if (UseNUMA) {
> 4905     if (!Linux::libnuma_init()) {
> 4906       UseNUMA = false;
> 4907     } else {
>
> I understand those null checks as part of the initial design of JVM numa api to
> enforce protection against the usage of its methods in other parts of the code
> when JVM api failed to initialize properly, even tho it's expected that
> UseNUMA = false should suffice to protect such a usages.
>
> That said, I could not find any recent Linux distribution that does not support
> libnuma v2 api (and so also v1 api). On Ubuntu it will be installed as a
> dependency of metapackage ubuntu-standard and because that requires "irqbalance"
> it also requires libnuma. Libnuma was updated from libnuma v1 to v2
> around mid 2008:
>
> numactl (2.0.1-1) unstable; urgency=low
>
>   * New upstream
>   * patches/static-lib.patch: update
>   * debian/watch: update to new SGI location
>
>  -- Ian Wienand <[hidden email]>  Sat, 07 Jun 2008 14:18:22 -0700
>
> numactl (1.0.2-1) unstable; urgency=low
>
>   * New upstream
>   * Closes: #442690 -- Add to rules a hack to remove libnuma.a after
>     unpatching
>   * Update README.debian
>
>
>  -- Ian Wienand <[hidden email]>  Wed, 03 Oct 2007 21:49:27 +1000
>
>
> It's similar on RHEL, where "irqbalance" is in core group. Regarding
> the libnuma version it was also updated in 2008 to v2, so since
> Fedora 11 contains v2, hence RHEL 6 and RHEL 7 contains it:
>
> * Wed Feb 25 2009 Fedora Release Engineering <[hidden email]> - 2.0.2-3
> - Rebuilt for https://fedoraproject.org/wiki/Fedora_11_Mass_Rebuild
>
> * Mon Sep 29 2008 Neil Horman <[hidden email]> - 2.0.2-2
> - Fix build break due to register selection in asm
>
> * Mon Sep 29 2008 Neil Horman <[hidden email]> - 2.0.2-1
> - Update rawhide to version 2.0.2 of numactl
>
> * Fri Apr 25 2008 Neil Horman <[hidden email]> - 1.0.2-6
> - Fix buffer size passing and arg sanity check for physcpubind (bz 442521)
>
>
> Also, the last release of libnuma v1 dates back to 2008:
> https://github.com/numactl/numactl/releases/tag/v1.0.2
>
> So it looks like libnuma v2 absence on Linux is by now uncommon.
>
>
>> Style nits:
>> - we should avoid implicit booleans, so the isnode_in_* functions should return bool not int; and check "distance != 0"
>> - spaces around operators eg. node=0 should be node = 0
>
> new webrev: http://cr.openjdk.java.net/~gromero/8175813/v4/
>

Still good :) THumbs up!

And thanks a lot for digging into the history of libnuma and its
incarnation in various Linux distros. That's really useful
information!

Regards,
Volker

>
> Thank you and best regards,
> Gustavo
>
>> Thanks,
>> David
>>
>>> Can somebody please sponsor the change?
>>>
>>> Thank you and best regards,
>>> Volker
>>>
>>>
>>> On Wed, May 3, 2017 at 3:27 PM, Gustavo Romero
>>> <[hidden email]> wrote:
>>>> Hi community,
>>>>
>>>> I understand that there is nothing that can be done additionally regarding this
>>>> issue, at this point, on the PPC64 side.
>>>>
>>>> It's a change in the shared code - but that in effect does not change anything in
>>>> the numa detection mechanism for other platforms - and hence it's necessary a
>>>> conjoint community effort to review the change and a sponsor to run it against
>>>> the JPRT.
>>>>
>>>> I know it's a stabilizing moment of OpenJDK 9, but since that issue is of
>>>> great concern on PPC64 (specially on POWER8 machines) I would be very glad if
>>>> the community could point out directions on how that change could move on.
>>>>
>>>> Thank you!
>>>>
>>>> Best regards,
>>>> Gustavo
>>>>
>>>> On 25-04-2017 23:49, Gustavo Romero wrote:
>>>>> Dear Volker,
>>>>>
>>>>> On 24-04-2017 14:08, Volker Simonis wrote:
>>>>>> Hi Gustavo,
>>>>>>
>>>>>> thanks for addressing this problem and sorry for my late reply. I
>>>>>> think this is a good change which definitely improves the situation
>>>>>> for uncommon NUMA configurations without changing the handling for
>>>>>> common topologies.
>>>>>
>>>>> Thanks a lot for reviewing the change!
>>>>>
>>>>>
>>>>>> It would be great if somebody could run this trough JPRT, but as
>>>>>> Gustavo mentioned, I don't expect any regressions.
>>>>>>
>>>>>> @Igor: I think you've been the original author of the NUMA-aware
>>>>>> allocator port to Linux (i.e. "6684395: Port NUMA-aware allocator to
>>>>>> linux"). If you could find some spare minutes to take a look at this
>>>>>> change, your comment would be very much appreciated :)
>>>>>>
>>>>>> Following some minor comments from me:
>>>>>>
>>>>>> - in os::numa_get_groups_num() you now use numa_num_configured_nodes()
>>>>>> to get the actual number of configured nodes. This is good and
>>>>>> certainly an improvement over the previous implementation. However,
>>>>>> the man page for numa_num_configured_nodes() mentions that the
>>>>>> returned count may contain currently disabled nodes. Do we currently
>>>>>> handle disabled nodes? What will be the consequence if we would use
>>>>>> such a disabled node (e.g. mbind() warnings)?
>>>>>
>>>>> In [1] 'numa_memnode_ptr' is set to keep a list of *just nodes with memory in
>>>>> found in /sys/devices/system/node/* Hence numa_num_configured_nodes() just
>>>>> returns the number of nodes in 'numa_memnode_ptr' [2], thus just returns the
>>>>> number of nodes with memory in the system. To the best of my knowledge there is
>>>>> no system configuration on Linux/PPC64 that could match such a notion of
>>>>> "disabled nodes" as it appears in libnuma's manual. If it is enabled, it's in
>>>>> that dir and just the ones with memory will be taken into account. If it's
>>>>> disabled (somehow), it's not in the dir, so won't be taken into account (i.e. no
>>>>> mbind() tried against it).
>>>>>
>>>>> On Power it's possible to have a numa node without memory (memory-less node, a
>>>>> case covered in this change), a numa node without cpus at all but with memory
>>>>> (a configured node anyway, so a case already covered) but to disable a specific
>>>>> numa node so it does not appear in /sys/devices/system/node/* it's only possible
>>>>> from the inners of the control module. Or other rare condition not invisible /
>>>>> adjustable from the OS. Also I'm not aware of a case where a node is in this
>>>>> dir but is at the same time flagged as something like "disabled". There are
>>>>> cpu/memory hotplugs, but that does not change numa nodes status AFAIK.
>>>>>
>>>>> [1] https://github.com/numactl/numactl/blob/master/libnuma.c#L334-L347
>>>>> [2] https://github.com/numactl/numactl/blob/master/libnuma.c#L614-L618
>>>>>
>>>>>
>>>>>> - the same question applies to the usage of
>>>>>> Linux::isnode_in_configured_nodes() within os::numa_get_leaf_groups().
>>>>>> Does isnode_in_configured_nodes() (i.e. the node set defined by
>>>>>> 'numa_all_nodes_ptr' take into account the disabled nodes or not? Can
>>>>>> this be a potential problem (i.e. if we use a disabled node).
>>>>>
>>>>> On the meaning of "disabled nodes", it's the same case as above, so to the
>>>>> best of knowledge it's not a potential problem.
>>>>>
>>>>> Anyway 'numa_all_nodes_ptr' just includes the configured nodes (with memory),
>>>>> i.e. "all nodes on which the calling task may allocate memory". It's exactly
>>>>> the same pointer returned by numa_get_membind() v2 [3] which:
>>>>>
>>>>> "returns the mask of nodes from which memory can currently be allocated"
>>>>>
>>>>> and that is used, for example, in "numactl --show" to show nodes from where
>>>>> memory can be allocated [4, 5].
>>>>>
>>>>> [3] https://github.com/numactl/numactl/blob/master/libnuma.c#L1147
>>>>> [4] https://github.com/numactl/numactl/blob/master/numactl.c#L144
>>>>> [5] https://github.com/numactl/numactl/blob/master/numactl.c#L177
>>>>>
>>>>>
>>>>>> - I'd like to suggest renaming the 'index' part of the following
>>>>>> variables and functions to 'nindex' ('node_index' is probably to long)
>>>>>> in the following code, to emphasize that we have node indexes pointing
>>>>>> to actual, not always consecutive node numbers:
>>>>>>
>>>>>> 2879         // Create an index -> node mapping, since nodes are not
>>>>>> always consecutive
>>>>>> 2880         _index_to_node = new (ResourceObj::C_HEAP, mtInternal)
>>>>>> GrowableArray<int>(0, true);
>>>>>> 2881         rebuild_index_to_node_map();
>>>>>
>>>>> Simple change but much better to read indeed. Done.
>>>>>
>>>>>
>>>>>> - can you please wrap the following one-line else statement into curly
>>>>>> braces (it's more readable and we usually do it that way in HotSpot
>>>>>> although there are no formal style guidelines :)
>>>>>>
>>>>>> 2953      } else
>>>>>> 2954        // Current node is already a configured node.
>>>>>> 2955        closest_node = index_to_node()->at(i);
>>>>>
>>>>> Done.
>>>>>
>>>>>
>>>>>> - in os::Linux::rebuild_cpu_to_node_map(), if you set
>>>>>> 'closest_distance' to INT_MAX at the beginning of the loop, you can
>>>>>> later avoid the check for '|| !closest_distance'. Also, according to
>>>>>> the man page, numa_distance() returns 0 if it can not determine the
>>>>>> distance. So with the above change, the condition on line 2974 should
>>>>>> read:
>>>>>>
>>>>>> 2947           if (distance && distance < closest_distance) {
>>>>>>
>>>>>
>>>>> Sure, much better to set the initial condition as distant as possible and
>>>>> adjust to a closer one bit by bit improving the if condition. Done.
>>>>>
>>>>>
>>>>>> Finally, and not directly related to your change, I'd suggest the
>>>>>> following clean-ups:
>>>>>>
>>>>>> - remove the usage of 'NCPUS = 32768' in
>>>>>> os::Linux::rebuild_cpu_to_node_map(). The comment on that line is
>>>>>> unclear to me and probably related to an older version/problem of
>>>>>> libnuma? I think we should simply use
>>>>>> numa_allocate_cpumask()/numa_free_cpumask() instead.
>>>>>>
>>>>>> - we still use the NUMA version 1 function prototypes (e.g.
>>>>>> "numa_node_to_cpus(int node, unsigned long *buffer, int buffer_len)"
>>>>>> instead of "numa_node_to_cpus(int node, struct bitmask *mask)", but
>>>>>> also "numa_interleave_memory()" and maybe others). I think we should
>>>>>> switch all prototypes to the new NUMA version 2 interface which you've
>>>>>> already used for the new functions which you've added.
>>>>>
>>>>> I agree. Could I open a new bug to address these clean-ups?
>>>>>
>>>>>
>>>>>> That said, I think these changes all require libnuma 2.0 (see
>>>>>> os::Linux::libnuma_dlsym). So before starting this, you should make
>>>>>> sure that libnuma 2.0 is available on all platforms to which you'd
>>>>>> like to down-port this change. For jdk10 we could definitely do it,
>>>>>> for jdk9 probably also, for jdk8 I'm not so sure.
>>>>>
>>>>> libnuma v1 last release dates back to 2008, but any idea how could I check that
>>>>> for sure since it's on shared code?
>>>>>
>>>>> new webrev: http://cr.openjdk.java.net/~gromero/8175813/v3/
>>>>>
>>>>> Thank you!
>>>>>
>>>>> Best regards,
>>>>> Gustavo
>>>>>
>>>>>
>>>>>> Regards,
>>>>>> Volker
>>>>>>
>>>>>> On Thu, Apr 13, 2017 at 12:51 AM, Gustavo Romero
>>>>>> <[hidden email]> wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> Any update on it?
>>>>>>>
>>>>>>> Thank you.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Gustavo
>>>>>>>
>>>>>>> On 09-03-2017 16:33, Gustavo Romero wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Could the following webrev be reviewed please?
>>>>>>>>
>>>>>>>> It improves the numa node detection when non-consecutive or memory-less nodes
>>>>>>>> exist in the system.
>>>>>>>>
>>>>>>>> webrev: http://cr.openjdk.java.net/~gromero/8175813/v2/
>>>>>>>> bug   : https://bugs.openjdk.java.net/browse/JDK-8175813
>>>>>>>>
>>>>>>>> Currently, although no problem exists when the JVM detects numa nodes that are
>>>>>>>> consecutive and have memory, for example in a numa topology like:
>>>>>>>>
>>>>>>>> available: 2 nodes (0-1)
>>>>>>>> node 0 cpus: 0 8 16 24 32
>>>>>>>> node 0 size: 65258 MB
>>>>>>>> node 0 free: 34 MB
>>>>>>>> node 1 cpus: 40 48 56 64 72
>>>>>>>> node 1 size: 65320 MB
>>>>>>>> node 1 free: 150 MB
>>>>>>>> node distances:
>>>>>>>> node   0   1
>>>>>>>>   0:  10  20
>>>>>>>>   1:  20  10,
>>>>>>>>
>>>>>>>> it fails on detecting numa nodes to be used in the Parallel GC in a numa
>>>>>>>> topology like:
>>>>>>>>
>>>>>>>> available: 4 nodes (0-1,16-17)
>>>>>>>> node 0 cpus: 0 8 16 24 32
>>>>>>>> node 0 size: 130706 MB
>>>>>>>> node 0 free: 7729 MB
>>>>>>>> node 1 cpus: 40 48 56 64 72
>>>>>>>> node 1 size: 0 MB
>>>>>>>> node 1 free: 0 MB
>>>>>>>> node 16 cpus: 80 88 96 104 112
>>>>>>>> node 16 size: 130630 MB
>>>>>>>> node 16 free: 5282 MB
>>>>>>>> node 17 cpus: 120 128 136 144 152
>>>>>>>> node 17 size: 0 MB
>>>>>>>> node 17 free: 0 MB
>>>>>>>> node distances:
>>>>>>>> node   0   1  16  17
>>>>>>>>   0:  10  20  40  40
>>>>>>>>   1:  20  10  40  40
>>>>>>>>  16:  40  40  10  20
>>>>>>>>  17:  40  40  20  10,
>>>>>>>>
>>>>>>>> where node 16 is not consecutive in relation to 1 and also nodes 1 and 17 have
>>>>>>>> no memory.
>>>>>>>>
>>>>>>>> If a topology like that exists, os::numa_make_local() will receive a local group
>>>>>>>> id as a hint that is not available in the system to be bound (it will receive
>>>>>>>> all nodes from 0 to 17), causing a proliferation of "mbind: Invalid argument"
>>>>>>>> messages:
>>>>>>>>
>>>>>>>> http://cr.openjdk.java.net/~gromero/logs/jdk10_pristine.log
>>>>>>>>
>>>>>>>> That change improves the detection by making the JVM numa API aware of the
>>>>>>>> existence of numa nodes that are non-consecutive from 0 to the highest node
>>>>>>>> number and also of nodes that might be memory-less nodes, i.e. that might not
>>>>>>>> be, in libnuma terms, a configured node. Hence just the configured nodes will
>>>>>>>> be available:
>>>>>>>>
>>>>>>>> http://cr.openjdk.java.net/~gromero/logs/jdk10_numa_patched.log
>>>>>>>>
>>>>>>>> The change has no effect on numa topologies were the problem does not occur,
>>>>>>>> i.e. no change in the number of nodes and no change in the cpu to node map. On
>>>>>>>> numa topologies where memory-less nodes exist (like in the last example above),
>>>>>>>> cpus from a memory-less node won't be able to bind locally so they are mapped
>>>>>>>> to the closest node, otherwise they would be not associate to any node and
>>>>>>>> MutableNUMASpace::cas_allocate() would pick a node randomly, compromising the
>>>>>>>> performance.
>>>>>>>>
>>>>>>>> I found no regressions on x64 for the following numa topology:
>>>>>>>>
>>>>>>>> available: 2 nodes (0-1)
>>>>>>>> node 0 cpus: 0 1 2 3 8 9 10 11
>>>>>>>> node 0 size: 24102 MB
>>>>>>>> node 0 free: 19806 MB
>>>>>>>> node 1 cpus: 4 5 6 7 12 13 14 15
>>>>>>>> node 1 size: 24190 MB
>>>>>>>> node 1 free: 21951 MB
>>>>>>>> node distances:
>>>>>>>> node   0   1
>>>>>>>>   0:  10  21
>>>>>>>>   1:  21  10
>>>>>>>>
>>>>>>>> I understand that fixing the current numa detection is a prerequisite to enable
>>>>>>>> UseNUMA by the default [1] and to extend the numa-aware allocation to the G1 GC [2].
>>>>>>>>
>>>>>>>> Thank you.
>>>>>>>>
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>> Gustavo
>>>>>>>>
>>>>>>>> [1] https://bugs.openjdk.java.net/browse/JDK-8046153 (JEP 163: Enable NUMA Mode by Default When Appropriate)
>>>>>>>> [2] https://bugs.openjdk.java.net/browse/JDK-8046147 (JEP 157: G1 GC: NUMA-Aware Allocation)
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [10] RFR (S) 8175813: PPC64: "mbind: Invalid argument" when -XX:+UseNUMA is used

David Holmes
In reply to this post by Gustavo Romero
Hi Gustavo,

On 6/05/2017 5:43 AM, Gustavo Romero wrote:

> Hi David,
>
> On 04-05-2017 21:32, David Holmes wrote:
>> Hi Volker, Gustavo,
>>
>> On 4/05/2017 12:34 AM, Volker Simonis wrote:
>>> Hi,
>>>
>>> I've reviewed Gustavo's change and I'm fine with the latest version at:
>>>
>>> http://cr.openjdk.java.net/~gromero/8175813/v3/
>>
>> Nothing has really changed for me since I first looked at this - I don't know NUMA and I can't comment on any of the details. But no-one else has commented negatively so they are implicitly okay with
>> this, or else they should have spoken up. So with Volker as the Reviewer and myself as a second reviewer, I will sponsor this. I'll run the current patch through JPRT while awaiting the final version.
>
> Thanks a lot for reviewing and sponsoring the change.
>
>
>> One thing I was unclear on with all this numa code is the expectation regarding all those dynamically looked up functions - is it expected that we will have them all or else have none? It wasn't at
>> all obvious what would happen if we don't have those functions but still executed this code - assuming that is even possible. I guess I would have expected that no numa code would execute unless
>> -XX:+UseNUMA was set, in which case the VM would abort if any of the libnuma functions could not be found. That way we wouldn't need the null checks for the function pointers.
>
> If libnuma is not available in the system os::Linux::libnuma_init() will return
> false and JVM will refuse to enable the UseNUMA features instead of aborting:
>
> 4904   if (UseNUMA) {
> 4905     if (!Linux::libnuma_init()) {
> 4906       UseNUMA = false;
> 4907     } else {
>
> I understand those null checks as part of the initial design of JVM numa api to
> enforce protection against the usage of its methods in other parts of the code
> when JVM api failed to initialize properly, even tho it's expected that
> UseNUMA = false should suffice to protect such a usages.

Ok. Seems like they should be asserts rather than runtime checks if all
the paths are properly guarded by UseNUMA - but that isn't your problem.

> That said, I could not find any recent Linux distribution that does not support
> libnuma v2 api (and so also v1 api). On Ubuntu it will be installed as a
> dependency of metapackage ubuntu-standard and because that requires "irqbalance"
> it also requires libnuma. Libnuma was updated from libnuma v1 to v2
> around mid 2008:

Thanks for the additional info.

> numactl (2.0.1-1) unstable; urgency=low
>
>   * New upstream
>   * patches/static-lib.patch: update
>   * debian/watch: update to new SGI location
>
>  -- Ian Wienand <[hidden email]>  Sat, 07 Jun 2008 14:18:22 -0700
>
> numactl (1.0.2-1) unstable; urgency=low
>
>   * New upstream
>   * Closes: #442690 -- Add to rules a hack to remove libnuma.a after
>     unpatching
>   * Update README.debian
>
>
>  -- Ian Wienand <[hidden email]>  Wed, 03 Oct 2007 21:49:27 +1000
>
>
> It's similar on RHEL, where "irqbalance" is in core group. Regarding
> the libnuma version it was also updated in 2008 to v2, so since
> Fedora 11 contains v2, hence RHEL 6 and RHEL 7 contains it:
>
> * Wed Feb 25 2009 Fedora Release Engineering <[hidden email]> - 2.0.2-3
> - Rebuilt for https://fedoraproject.org/wiki/Fedora_11_Mass_Rebuild
>
> * Mon Sep 29 2008 Neil Horman <[hidden email]> - 2.0.2-2
> - Fix build break due to register selection in asm
>
> * Mon Sep 29 2008 Neil Horman <[hidden email]> - 2.0.2-1
> - Update rawhide to version 2.0.2 of numactl
>
> * Fri Apr 25 2008 Neil Horman <[hidden email]> - 1.0.2-6
> - Fix buffer size passing and arg sanity check for physcpubind (bz 442521)
>
>
> Also, the last release of libnuma v1 dates back to 2008:
> https://github.com/numactl/numactl/releases/tag/v1.0.2
>
> So it looks like libnuma v2 absence on Linux is by now uncommon.
>
>
>> Style nits:
>> - we should avoid implicit booleans, so the isnode_in_* functions should return bool not int; and check "distance != 0"
>> - spaces around operators eg. node=0 should be node = 0
>
> new webrev: http://cr.openjdk.java.net/~gromero/8175813/v4/

Looks good. Changes being pushed now.

David
-----

>
> Thank you and best regards,
> Gustavo
>
>> Thanks,
>> David
>>
>>> Can somebody please sponsor the change?
>>>
>>> Thank you and best regards,
>>> Volker
>>>
>>>
>>> On Wed, May 3, 2017 at 3:27 PM, Gustavo Romero
>>> <[hidden email]> wrote:
>>>> Hi community,
>>>>
>>>> I understand that there is nothing that can be done additionally regarding this
>>>> issue, at this point, on the PPC64 side.
>>>>
>>>> It's a change in the shared code - but that in effect does not change anything in
>>>> the numa detection mechanism for other platforms - and hence it's necessary a
>>>> conjoint community effort to review the change and a sponsor to run it against
>>>> the JPRT.
>>>>
>>>> I know it's a stabilizing moment of OpenJDK 9, but since that issue is of
>>>> great concern on PPC64 (specially on POWER8 machines) I would be very glad if
>>>> the community could point out directions on how that change could move on.
>>>>
>>>> Thank you!
>>>>
>>>> Best regards,
>>>> Gustavo
>>>>
>>>> On 25-04-2017 23:49, Gustavo Romero wrote:
>>>>> Dear Volker,
>>>>>
>>>>> On 24-04-2017 14:08, Volker Simonis wrote:
>>>>>> Hi Gustavo,
>>>>>>
>>>>>> thanks for addressing this problem and sorry for my late reply. I
>>>>>> think this is a good change which definitely improves the situation
>>>>>> for uncommon NUMA configurations without changing the handling for
>>>>>> common topologies.
>>>>>
>>>>> Thanks a lot for reviewing the change!
>>>>>
>>>>>
>>>>>> It would be great if somebody could run this trough JPRT, but as
>>>>>> Gustavo mentioned, I don't expect any regressions.
>>>>>>
>>>>>> @Igor: I think you've been the original author of the NUMA-aware
>>>>>> allocator port to Linux (i.e. "6684395: Port NUMA-aware allocator to
>>>>>> linux"). If you could find some spare minutes to take a look at this
>>>>>> change, your comment would be very much appreciated :)
>>>>>>
>>>>>> Following some minor comments from me:
>>>>>>
>>>>>> - in os::numa_get_groups_num() you now use numa_num_configured_nodes()
>>>>>> to get the actual number of configured nodes. This is good and
>>>>>> certainly an improvement over the previous implementation. However,
>>>>>> the man page for numa_num_configured_nodes() mentions that the
>>>>>> returned count may contain currently disabled nodes. Do we currently
>>>>>> handle disabled nodes? What will be the consequence if we would use
>>>>>> such a disabled node (e.g. mbind() warnings)?
>>>>>
>>>>> In [1] 'numa_memnode_ptr' is set to keep a list of *just nodes with memory in
>>>>> found in /sys/devices/system/node/* Hence numa_num_configured_nodes() just
>>>>> returns the number of nodes in 'numa_memnode_ptr' [2], thus just returns the
>>>>> number of nodes with memory in the system. To the best of my knowledge there is
>>>>> no system configuration on Linux/PPC64 that could match such a notion of
>>>>> "disabled nodes" as it appears in libnuma's manual. If it is enabled, it's in
>>>>> that dir and just the ones with memory will be taken into account. If it's
>>>>> disabled (somehow), it's not in the dir, so won't be taken into account (i.e. no
>>>>> mbind() tried against it).
>>>>>
>>>>> On Power it's possible to have a numa node without memory (memory-less node, a
>>>>> case covered in this change), a numa node without cpus at all but with memory
>>>>> (a configured node anyway, so a case already covered) but to disable a specific
>>>>> numa node so it does not appear in /sys/devices/system/node/* it's only possible
>>>>> from the inners of the control module. Or other rare condition not invisible /
>>>>> adjustable from the OS. Also I'm not aware of a case where a node is in this
>>>>> dir but is at the same time flagged as something like "disabled". There are
>>>>> cpu/memory hotplugs, but that does not change numa nodes status AFAIK.
>>>>>
>>>>> [1] https://github.com/numactl/numactl/blob/master/libnuma.c#L334-L347
>>>>> [2] https://github.com/numactl/numactl/blob/master/libnuma.c#L614-L618
>>>>>
>>>>>
>>>>>> - the same question applies to the usage of
>>>>>> Linux::isnode_in_configured_nodes() within os::numa_get_leaf_groups().
>>>>>> Does isnode_in_configured_nodes() (i.e. the node set defined by
>>>>>> 'numa_all_nodes_ptr' take into account the disabled nodes or not? Can
>>>>>> this be a potential problem (i.e. if we use a disabled node).
>>>>>
>>>>> On the meaning of "disabled nodes", it's the same case as above, so to the
>>>>> best of knowledge it's not a potential problem.
>>>>>
>>>>> Anyway 'numa_all_nodes_ptr' just includes the configured nodes (with memory),
>>>>> i.e. "all nodes on which the calling task may allocate memory". It's exactly
>>>>> the same pointer returned by numa_get_membind() v2 [3] which:
>>>>>
>>>>> "returns the mask of nodes from which memory can currently be allocated"
>>>>>
>>>>> and that is used, for example, in "numactl --show" to show nodes from where
>>>>> memory can be allocated [4, 5].
>>>>>
>>>>> [3] https://github.com/numactl/numactl/blob/master/libnuma.c#L1147
>>>>> [4] https://github.com/numactl/numactl/blob/master/numactl.c#L144
>>>>> [5] https://github.com/numactl/numactl/blob/master/numactl.c#L177
>>>>>
>>>>>
>>>>>> - I'd like to suggest renaming the 'index' part of the following
>>>>>> variables and functions to 'nindex' ('node_index' is probably to long)
>>>>>> in the following code, to emphasize that we have node indexes pointing
>>>>>> to actual, not always consecutive node numbers:
>>>>>>
>>>>>> 2879         // Create an index -> node mapping, since nodes are not
>>>>>> always consecutive
>>>>>> 2880         _index_to_node = new (ResourceObj::C_HEAP, mtInternal)
>>>>>> GrowableArray<int>(0, true);
>>>>>> 2881         rebuild_index_to_node_map();
>>>>>
>>>>> Simple change but much better to read indeed. Done.
>>>>>
>>>>>
>>>>>> - can you please wrap the following one-line else statement into curly
>>>>>> braces (it's more readable and we usually do it that way in HotSpot
>>>>>> although there are no formal style guidelines :)
>>>>>>
>>>>>> 2953      } else
>>>>>> 2954        // Current node is already a configured node.
>>>>>> 2955        closest_node = index_to_node()->at(i);
>>>>>
>>>>> Done.
>>>>>
>>>>>
>>>>>> - in os::Linux::rebuild_cpu_to_node_map(), if you set
>>>>>> 'closest_distance' to INT_MAX at the beginning of the loop, you can
>>>>>> later avoid the check for '|| !closest_distance'. Also, according to
>>>>>> the man page, numa_distance() returns 0 if it can not determine the
>>>>>> distance. So with the above change, the condition on line 2974 should
>>>>>> read:
>>>>>>
>>>>>> 2947           if (distance && distance < closest_distance) {
>>>>>>
>>>>>
>>>>> Sure, much better to set the initial condition as distant as possible and
>>>>> adjust to a closer one bit by bit improving the if condition. Done.
>>>>>
>>>>>
>>>>>> Finally, and not directly related to your change, I'd suggest the
>>>>>> following clean-ups:
>>>>>>
>>>>>> - remove the usage of 'NCPUS = 32768' in
>>>>>> os::Linux::rebuild_cpu_to_node_map(). The comment on that line is
>>>>>> unclear to me and probably related to an older version/problem of
>>>>>> libnuma? I think we should simply use
>>>>>> numa_allocate_cpumask()/numa_free_cpumask() instead.
>>>>>>
>>>>>> - we still use the NUMA version 1 function prototypes (e.g.
>>>>>> "numa_node_to_cpus(int node, unsigned long *buffer, int buffer_len)"
>>>>>> instead of "numa_node_to_cpus(int node, struct bitmask *mask)", but
>>>>>> also "numa_interleave_memory()" and maybe others). I think we should
>>>>>> switch all prototypes to the new NUMA version 2 interface which you've
>>>>>> already used for the new functions which you've added.
>>>>>
>>>>> I agree. Could I open a new bug to address these clean-ups?
>>>>>
>>>>>
>>>>>> That said, I think these changes all require libnuma 2.0 (see
>>>>>> os::Linux::libnuma_dlsym). So before starting this, you should make
>>>>>> sure that libnuma 2.0 is available on all platforms to which you'd
>>>>>> like to down-port this change. For jdk10 we could definitely do it,
>>>>>> for jdk9 probably also, for jdk8 I'm not so sure.
>>>>>
>>>>> libnuma v1 last release dates back to 2008, but any idea how could I check that
>>>>> for sure since it's on shared code?
>>>>>
>>>>> new webrev: http://cr.openjdk.java.net/~gromero/8175813/v3/
>>>>>
>>>>> Thank you!
>>>>>
>>>>> Best regards,
>>>>> Gustavo
>>>>>
>>>>>
>>>>>> Regards,
>>>>>> Volker
>>>>>>
>>>>>> On Thu, Apr 13, 2017 at 12:51 AM, Gustavo Romero
>>>>>> <[hidden email]> wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> Any update on it?
>>>>>>>
>>>>>>> Thank you.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Gustavo
>>>>>>>
>>>>>>> On 09-03-2017 16:33, Gustavo Romero wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Could the following webrev be reviewed please?
>>>>>>>>
>>>>>>>> It improves the numa node detection when non-consecutive or memory-less nodes
>>>>>>>> exist in the system.
>>>>>>>>
>>>>>>>> webrev: http://cr.openjdk.java.net/~gromero/8175813/v2/
>>>>>>>> bug   : https://bugs.openjdk.java.net/browse/JDK-8175813
>>>>>>>>
>>>>>>>> Currently, although no problem exists when the JVM detects numa nodes that are
>>>>>>>> consecutive and have memory, for example in a numa topology like:
>>>>>>>>
>>>>>>>> available: 2 nodes (0-1)
>>>>>>>> node 0 cpus: 0 8 16 24 32
>>>>>>>> node 0 size: 65258 MB
>>>>>>>> node 0 free: 34 MB
>>>>>>>> node 1 cpus: 40 48 56 64 72
>>>>>>>> node 1 size: 65320 MB
>>>>>>>> node 1 free: 150 MB
>>>>>>>> node distances:
>>>>>>>> node   0   1
>>>>>>>>   0:  10  20
>>>>>>>>   1:  20  10,
>>>>>>>>
>>>>>>>> it fails on detecting numa nodes to be used in the Parallel GC in a numa
>>>>>>>> topology like:
>>>>>>>>
>>>>>>>> available: 4 nodes (0-1,16-17)
>>>>>>>> node 0 cpus: 0 8 16 24 32
>>>>>>>> node 0 size: 130706 MB
>>>>>>>> node 0 free: 7729 MB
>>>>>>>> node 1 cpus: 40 48 56 64 72
>>>>>>>> node 1 size: 0 MB
>>>>>>>> node 1 free: 0 MB
>>>>>>>> node 16 cpus: 80 88 96 104 112
>>>>>>>> node 16 size: 130630 MB
>>>>>>>> node 16 free: 5282 MB
>>>>>>>> node 17 cpus: 120 128 136 144 152
>>>>>>>> node 17 size: 0 MB
>>>>>>>> node 17 free: 0 MB
>>>>>>>> node distances:
>>>>>>>> node   0   1  16  17
>>>>>>>>   0:  10  20  40  40
>>>>>>>>   1:  20  10  40  40
>>>>>>>>  16:  40  40  10  20
>>>>>>>>  17:  40  40  20  10,
>>>>>>>>
>>>>>>>> where node 16 is not consecutive in relation to 1 and also nodes 1 and 17 have
>>>>>>>> no memory.
>>>>>>>>
>>>>>>>> If a topology like that exists, os::numa_make_local() will receive a local group
>>>>>>>> id as a hint that is not available in the system to be bound (it will receive
>>>>>>>> all nodes from 0 to 17), causing a proliferation of "mbind: Invalid argument"
>>>>>>>> messages:
>>>>>>>>
>>>>>>>> http://cr.openjdk.java.net/~gromero/logs/jdk10_pristine.log
>>>>>>>>
>>>>>>>> That change improves the detection by making the JVM numa API aware of the
>>>>>>>> existence of numa nodes that are non-consecutive from 0 to the highest node
>>>>>>>> number and also of nodes that might be memory-less nodes, i.e. that might not
>>>>>>>> be, in libnuma terms, a configured node. Hence just the configured nodes will
>>>>>>>> be available:
>>>>>>>>
>>>>>>>> http://cr.openjdk.java.net/~gromero/logs/jdk10_numa_patched.log
>>>>>>>>
>>>>>>>> The change has no effect on numa topologies were the problem does not occur,
>>>>>>>> i.e. no change in the number of nodes and no change in the cpu to node map. On
>>>>>>>> numa topologies where memory-less nodes exist (like in the last example above),
>>>>>>>> cpus from a memory-less node won't be able to bind locally so they are mapped
>>>>>>>> to the closest node, otherwise they would be not associate to any node and
>>>>>>>> MutableNUMASpace::cas_allocate() would pick a node randomly, compromising the
>>>>>>>> performance.
>>>>>>>>
>>>>>>>> I found no regressions on x64 for the following numa topology:
>>>>>>>>
>>>>>>>> available: 2 nodes (0-1)
>>>>>>>> node 0 cpus: 0 1 2 3 8 9 10 11
>>>>>>>> node 0 size: 24102 MB
>>>>>>>> node 0 free: 19806 MB
>>>>>>>> node 1 cpus: 4 5 6 7 12 13 14 15
>>>>>>>> node 1 size: 24190 MB
>>>>>>>> node 1 free: 21951 MB
>>>>>>>> node distances:
>>>>>>>> node   0   1
>>>>>>>>   0:  10  21
>>>>>>>>   1:  21  10
>>>>>>>>
>>>>>>>> I understand that fixing the current numa detection is a prerequisite to enable
>>>>>>>> UseNUMA by the default [1] and to extend the numa-aware allocation to the G1 GC [2].
>>>>>>>>
>>>>>>>> Thank you.
>>>>>>>>
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>> Gustavo
>>>>>>>>
>>>>>>>> [1] https://bugs.openjdk.java.net/browse/JDK-8046153 (JEP 163: Enable NUMA Mode by Default When Appropriate)
>>>>>>>> [2] https://bugs.openjdk.java.net/browse/JDK-8046147 (JEP 157: G1 GC: NUMA-Aware Allocation)
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [10] RFR (S) 8175813: PPC64: "mbind: Invalid argument" when -XX:+UseNUMA is used

Gustavo Romero
Hi David, Volker

Thanks a lot reviewing and pushing the change!


Regards,
Gustavo

On 07-05-2017 17:45, David Holmes wrote:

> Hi Gustavo,
>
> On 6/05/2017 5:43 AM, Gustavo Romero wrote:
>> Hi David,
>>
>> On 04-05-2017 21:32, David Holmes wrote:
>>> Hi Volker, Gustavo,
>>>
>>> On 4/05/2017 12:34 AM, Volker Simonis wrote:
>>>> Hi,
>>>>
>>>> I've reviewed Gustavo's change and I'm fine with the latest version at:
>>>>
>>>> http://cr.openjdk.java.net/~gromero/8175813/v3/
>>>
>>> Nothing has really changed for me since I first looked at this - I don't know NUMA and I can't comment on any of the details. But no-one else has commented negatively so they are implicitly okay with
>>> this, or else they should have spoken up. So with Volker as the Reviewer and myself as a second reviewer, I will sponsor this. I'll run the current patch through JPRT while awaiting the final version.
>>
>> Thanks a lot for reviewing and sponsoring the change.
>>
>>
>>> One thing I was unclear on with all this numa code is the expectation regarding all those dynamically looked up functions - is it expected that we will have them all or else have none? It wasn't at
>>> all obvious what would happen if we don't have those functions but still executed this code - assuming that is even possible. I guess I would have expected that no numa code would execute unless
>>> -XX:+UseNUMA was set, in which case the VM would abort if any of the libnuma functions could not be found. That way we wouldn't need the null checks for the function pointers.
>>
>> If libnuma is not available in the system os::Linux::libnuma_init() will return
>> false and JVM will refuse to enable the UseNUMA features instead of aborting:
>>
>> 4904   if (UseNUMA) {
>> 4905     if (!Linux::libnuma_init()) {
>> 4906       UseNUMA = false;
>> 4907     } else {
>>
>> I understand those null checks as part of the initial design of JVM numa api to
>> enforce protection against the usage of its methods in other parts of the code
>> when JVM api failed to initialize properly, even tho it's expected that
>> UseNUMA = false should suffice to protect such a usages.
>
> Ok. Seems like they should be asserts rather than runtime checks if all the paths are properly guarded by UseNUMA - but that isn't your problem.
>
>> That said, I could not find any recent Linux distribution that does not support
>> libnuma v2 api (and so also v1 api). On Ubuntu it will be installed as a
>> dependency of metapackage ubuntu-standard and because that requires "irqbalance"
>> it also requires libnuma. Libnuma was updated from libnuma v1 to v2
>> around mid 2008:
>
> Thanks for the additional info.
>
>> numactl (2.0.1-1) unstable; urgency=low
>>
>>   * New upstream
>>   * patches/static-lib.patch: update
>>   * debian/watch: update to new SGI location
>>
>>  -- Ian Wienand <[hidden email]>  Sat, 07 Jun 2008 14:18:22 -0700
>>
>> numactl (1.0.2-1) unstable; urgency=low
>>
>>   * New upstream
>>   * Closes: #442690 -- Add to rules a hack to remove libnuma.a after
>>     unpatching
>>   * Update README.debian
>>
>>
>>  -- Ian Wienand <[hidden email]>  Wed, 03 Oct 2007 21:49:27 +1000
>>
>>
>> It's similar on RHEL, where "irqbalance" is in core group. Regarding
>> the libnuma version it was also updated in 2008 to v2, so since
>> Fedora 11 contains v2, hence RHEL 6 and RHEL 7 contains it:
>>
>> * Wed Feb 25 2009 Fedora Release Engineering <[hidden email]> - 2.0.2-3
>> - Rebuilt for https://fedoraproject.org/wiki/Fedora_11_Mass_Rebuild
>>
>> * Mon Sep 29 2008 Neil Horman <[hidden email]> - 2.0.2-2
>> - Fix build break due to register selection in asm
>>
>> * Mon Sep 29 2008 Neil Horman <[hidden email]> - 2.0.2-1
>> - Update rawhide to version 2.0.2 of numactl
>>
>> * Fri Apr 25 2008 Neil Horman <[hidden email]> - 1.0.2-6
>> - Fix buffer size passing and arg sanity check for physcpubind (bz 442521)
>>
>>
>> Also, the last release of libnuma v1 dates back to 2008:
>> https://github.com/numactl/numactl/releases/tag/v1.0.2
>>
>> So it looks like libnuma v2 absence on Linux is by now uncommon.
>>
>>
>>> Style nits:
>>> - we should avoid implicit booleans, so the isnode_in_* functions should return bool not int; and check "distance != 0"
>>> - spaces around operators eg. node=0 should be node = 0
>>
>> new webrev: http://cr.openjdk.java.net/~gromero/8175813/v4/
>
> Looks good. Changes being pushed now.
>
> David
> -----
>
>>
>> Thank you and best regards,
>> Gustavo
>>
>>> Thanks,
>>> David
>>>
>>>> Can somebody please sponsor the change?
>>>>
>>>> Thank you and best regards,
>>>> Volker
>>>>
>>>>
>>>> On Wed, May 3, 2017 at 3:27 PM, Gustavo Romero
>>>> <[hidden email]> wrote:
>>>>> Hi community,
>>>>>
>>>>> I understand that there is nothing that can be done additionally regarding this
>>>>> issue, at this point, on the PPC64 side.
>>>>>
>>>>> It's a change in the shared code - but that in effect does not change anything in
>>>>> the numa detection mechanism for other platforms - and hence it's necessary a
>>>>> conjoint community effort to review the change and a sponsor to run it against
>>>>> the JPRT.
>>>>>
>>>>> I know it's a stabilizing moment of OpenJDK 9, but since that issue is of
>>>>> great concern on PPC64 (specially on POWER8 machines) I would be very glad if
>>>>> the community could point out directions on how that change could move on.
>>>>>
>>>>> Thank you!
>>>>>
>>>>> Best regards,
>>>>> Gustavo
>>>>>
>>>>> On 25-04-2017 23:49, Gustavo Romero wrote:
>>>>>> Dear Volker,
>>>>>>
>>>>>> On 24-04-2017 14:08, Volker Simonis wrote:
>>>>>>> Hi Gustavo,
>>>>>>>
>>>>>>> thanks for addressing this problem and sorry for my late reply. I
>>>>>>> think this is a good change which definitely improves the situation
>>>>>>> for uncommon NUMA configurations without changing the handling for
>>>>>>> common topologies.
>>>>>>
>>>>>> Thanks a lot for reviewing the change!
>>>>>>
>>>>>>
>>>>>>> It would be great if somebody could run this trough JPRT, but as
>>>>>>> Gustavo mentioned, I don't expect any regressions.
>>>>>>>
>>>>>>> @Igor: I think you've been the original author of the NUMA-aware
>>>>>>> allocator port to Linux (i.e. "6684395: Port NUMA-aware allocator to
>>>>>>> linux"). If you could find some spare minutes to take a look at this
>>>>>>> change, your comment would be very much appreciated :)
>>>>>>>
>>>>>>> Following some minor comments from me:
>>>>>>>
>>>>>>> - in os::numa_get_groups_num() you now use numa_num_configured_nodes()
>>>>>>> to get the actual number of configured nodes. This is good and
>>>>>>> certainly an improvement over the previous implementation. However,
>>>>>>> the man page for numa_num_configured_nodes() mentions that the
>>>>>>> returned count may contain currently disabled nodes. Do we currently
>>>>>>> handle disabled nodes? What will be the consequence if we would use
>>>>>>> such a disabled node (e.g. mbind() warnings)?
>>>>>>
>>>>>> In [1] 'numa_memnode_ptr' is set to keep a list of *just nodes with memory in
>>>>>> found in /sys/devices/system/node/* Hence numa_num_configured_nodes() just
>>>>>> returns the number of nodes in 'numa_memnode_ptr' [2], thus just returns the
>>>>>> number of nodes with memory in the system. To the best of my knowledge there is
>>>>>> no system configuration on Linux/PPC64 that could match such a notion of
>>>>>> "disabled nodes" as it appears in libnuma's manual. If it is enabled, it's in
>>>>>> that dir and just the ones with memory will be taken into account. If it's
>>>>>> disabled (somehow), it's not in the dir, so won't be taken into account (i.e. no
>>>>>> mbind() tried against it).
>>>>>>
>>>>>> On Power it's possible to have a numa node without memory (memory-less node, a
>>>>>> case covered in this change), a numa node without cpus at all but with memory
>>>>>> (a configured node anyway, so a case already covered) but to disable a specific
>>>>>> numa node so it does not appear in /sys/devices/system/node/* it's only possible
>>>>>> from the inners of the control module. Or other rare condition not invisible /
>>>>>> adjustable from the OS. Also I'm not aware of a case where a node is in this
>>>>>> dir but is at the same time flagged as something like "disabled". There are
>>>>>> cpu/memory hotplugs, but that does not change numa nodes status AFAIK.
>>>>>>
>>>>>> [1] https://github.com/numactl/numactl/blob/master/libnuma.c#L334-L347
>>>>>> [2] https://github.com/numactl/numactl/blob/master/libnuma.c#L614-L618
>>>>>>
>>>>>>
>>>>>>> - the same question applies to the usage of
>>>>>>> Linux::isnode_in_configured_nodes() within os::numa_get_leaf_groups().
>>>>>>> Does isnode_in_configured_nodes() (i.e. the node set defined by
>>>>>>> 'numa_all_nodes_ptr' take into account the disabled nodes or not? Can
>>>>>>> this be a potential problem (i.e. if we use a disabled node).
>>>>>>
>>>>>> On the meaning of "disabled nodes", it's the same case as above, so to the
>>>>>> best of knowledge it's not a potential problem.
>>>>>>
>>>>>> Anyway 'numa_all_nodes_ptr' just includes the configured nodes (with memory),
>>>>>> i.e. "all nodes on which the calling task may allocate memory". It's exactly
>>>>>> the same pointer returned by numa_get_membind() v2 [3] which:
>>>>>>
>>>>>> "returns the mask of nodes from which memory can currently be allocated"
>>>>>>
>>>>>> and that is used, for example, in "numactl --show" to show nodes from where
>>>>>> memory can be allocated [4, 5].
>>>>>>
>>>>>> [3] https://github.com/numactl/numactl/blob/master/libnuma.c#L1147
>>>>>> [4] https://github.com/numactl/numactl/blob/master/numactl.c#L144
>>>>>> [5] https://github.com/numactl/numactl/blob/master/numactl.c#L177
>>>>>>
>>>>>>
>>>>>>> - I'd like to suggest renaming the 'index' part of the following
>>>>>>> variables and functions to 'nindex' ('node_index' is probably to long)
>>>>>>> in the following code, to emphasize that we have node indexes pointing
>>>>>>> to actual, not always consecutive node numbers:
>>>>>>>
>>>>>>> 2879         // Create an index -> node mapping, since nodes are not
>>>>>>> always consecutive
>>>>>>> 2880         _index_to_node = new (ResourceObj::C_HEAP, mtInternal)
>>>>>>> GrowableArray<int>(0, true);
>>>>>>> 2881         rebuild_index_to_node_map();
>>>>>>
>>>>>> Simple change but much better to read indeed. Done.
>>>>>>
>>>>>>
>>>>>>> - can you please wrap the following one-line else statement into curly
>>>>>>> braces (it's more readable and we usually do it that way in HotSpot
>>>>>>> although there are no formal style guidelines :)
>>>>>>>
>>>>>>> 2953      } else
>>>>>>> 2954        // Current node is already a configured node.
>>>>>>> 2955        closest_node = index_to_node()->at(i);
>>>>>>
>>>>>> Done.
>>>>>>
>>>>>>
>>>>>>> - in os::Linux::rebuild_cpu_to_node_map(), if you set
>>>>>>> 'closest_distance' to INT_MAX at the beginning of the loop, you can
>>>>>>> later avoid the check for '|| !closest_distance'. Also, according to
>>>>>>> the man page, numa_distance() returns 0 if it can not determine the
>>>>>>> distance. So with the above change, the condition on line 2974 should
>>>>>>> read:
>>>>>>>
>>>>>>> 2947           if (distance && distance < closest_distance) {
>>>>>>>
>>>>>>
>>>>>> Sure, much better to set the initial condition as distant as possible and
>>>>>> adjust to a closer one bit by bit improving the if condition. Done.
>>>>>>
>>>>>>
>>>>>>> Finally, and not directly related to your change, I'd suggest the
>>>>>>> following clean-ups:
>>>>>>>
>>>>>>> - remove the usage of 'NCPUS = 32768' in
>>>>>>> os::Linux::rebuild_cpu_to_node_map(). The comment on that line is
>>>>>>> unclear to me and probably related to an older version/problem of
>>>>>>> libnuma? I think we should simply use
>>>>>>> numa_allocate_cpumask()/numa_free_cpumask() instead.
>>>>>>>
>>>>>>> - we still use the NUMA version 1 function prototypes (e.g.
>>>>>>> "numa_node_to_cpus(int node, unsigned long *buffer, int buffer_len)"
>>>>>>> instead of "numa_node_to_cpus(int node, struct bitmask *mask)", but
>>>>>>> also "numa_interleave_memory()" and maybe others). I think we should
>>>>>>> switch all prototypes to the new NUMA version 2 interface which you've
>>>>>>> already used for the new functions which you've added.
>>>>>>
>>>>>> I agree. Could I open a new bug to address these clean-ups?
>>>>>>
>>>>>>
>>>>>>> That said, I think these changes all require libnuma 2.0 (see
>>>>>>> os::Linux::libnuma_dlsym). So before starting this, you should make
>>>>>>> sure that libnuma 2.0 is available on all platforms to which you'd
>>>>>>> like to down-port this change. For jdk10 we could definitely do it,
>>>>>>> for jdk9 probably also, for jdk8 I'm not so sure.
>>>>>>
>>>>>> libnuma v1 last release dates back to 2008, but any idea how could I check that
>>>>>> for sure since it's on shared code?
>>>>>>
>>>>>> new webrev: http://cr.openjdk.java.net/~gromero/8175813/v3/
>>>>>>
>>>>>> Thank you!
>>>>>>
>>>>>> Best regards,
>>>>>> Gustavo
>>>>>>
>>>>>>
>>>>>>> Regards,
>>>>>>> Volker
>>>>>>>
>>>>>>> On Thu, Apr 13, 2017 at 12:51 AM, Gustavo Romero
>>>>>>> <[hidden email]> wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Any update on it?
>>>>>>>>
>>>>>>>> Thank you.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Gustavo
>>>>>>>>
>>>>>>>> On 09-03-2017 16:33, Gustavo Romero wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> Could the following webrev be reviewed please?
>>>>>>>>>
>>>>>>>>> It improves the numa node detection when non-consecutive or memory-less nodes
>>>>>>>>> exist in the system.
>>>>>>>>>
>>>>>>>>> webrev: http://cr.openjdk.java.net/~gromero/8175813/v2/
>>>>>>>>> bug   : https://bugs.openjdk.java.net/browse/JDK-8175813
>>>>>>>>>
>>>>>>>>> Currently, although no problem exists when the JVM detects numa nodes that are
>>>>>>>>> consecutive and have memory, for example in a numa topology like:
>>>>>>>>>
>>>>>>>>> available: 2 nodes (0-1)
>>>>>>>>> node 0 cpus: 0 8 16 24 32
>>>>>>>>> node 0 size: 65258 MB
>>>>>>>>> node 0 free: 34 MB
>>>>>>>>> node 1 cpus: 40 48 56 64 72
>>>>>>>>> node 1 size: 65320 MB
>>>>>>>>> node 1 free: 150 MB
>>>>>>>>> node distances:
>>>>>>>>> node   0   1
>>>>>>>>>   0:  10  20
>>>>>>>>>   1:  20  10,
>>>>>>>>>
>>>>>>>>> it fails on detecting numa nodes to be used in the Parallel GC in a numa
>>>>>>>>> topology like:
>>>>>>>>>
>>>>>>>>> available: 4 nodes (0-1,16-17)
>>>>>>>>> node 0 cpus: 0 8 16 24 32
>>>>>>>>> node 0 size: 130706 MB
>>>>>>>>> node 0 free: 7729 MB
>>>>>>>>> node 1 cpus: 40 48 56 64 72
>>>>>>>>> node 1 size: 0 MB
>>>>>>>>> node 1 free: 0 MB
>>>>>>>>> node 16 cpus: 80 88 96 104 112
>>>>>>>>> node 16 size: 130630 MB
>>>>>>>>> node 16 free: 5282 MB
>>>>>>>>> node 17 cpus: 120 128 136 144 152
>>>>>>>>> node 17 size: 0 MB
>>>>>>>>> node 17 free: 0 MB
>>>>>>>>> node distances:
>>>>>>>>> node   0   1  16  17
>>>>>>>>>   0:  10  20  40  40
>>>>>>>>>   1:  20  10  40  40
>>>>>>>>>  16:  40  40  10  20
>>>>>>>>>  17:  40  40  20  10,
>>>>>>>>>
>>>>>>>>> where node 16 is not consecutive in relation to 1 and also nodes 1 and 17 have
>>>>>>>>> no memory.
>>>>>>>>>
>>>>>>>>> If a topology like that exists, os::numa_make_local() will receive a local group
>>>>>>>>> id as a hint that is not available in the system to be bound (it will receive
>>>>>>>>> all nodes from 0 to 17), causing a proliferation of "mbind: Invalid argument"
>>>>>>>>> messages:
>>>>>>>>>
>>>>>>>>> http://cr.openjdk.java.net/~gromero/logs/jdk10_pristine.log
>>>>>>>>>
>>>>>>>>> That change improves the detection by making the JVM numa API aware of the
>>>>>>>>> existence of numa nodes that are non-consecutive from 0 to the highest node
>>>>>>>>> number and also of nodes that might be memory-less nodes, i.e. that might not
>>>>>>>>> be, in libnuma terms, a configured node. Hence just the configured nodes will
>>>>>>>>> be available:
>>>>>>>>>
>>>>>>>>> http://cr.openjdk.java.net/~gromero/logs/jdk10_numa_patched.log
>>>>>>>>>
>>>>>>>>> The change has no effect on numa topologies were the problem does not occur,
>>>>>>>>> i.e. no change in the number of nodes and no change in the cpu to node map. On
>>>>>>>>> numa topologies where memory-less nodes exist (like in the last example above),
>>>>>>>>> cpus from a memory-less node won't be able to bind locally so they are mapped
>>>>>>>>> to the closest node, otherwise they would be not associate to any node and
>>>>>>>>> MutableNUMASpace::cas_allocate() would pick a node randomly, compromising the
>>>>>>>>> performance.
>>>>>>>>>
>>>>>>>>> I found no regressions on x64 for the following numa topology:
>>>>>>>>>
>>>>>>>>> available: 2 nodes (0-1)
>>>>>>>>> node 0 cpus: 0 1 2 3 8 9 10 11
>>>>>>>>> node 0 size: 24102 MB
>>>>>>>>> node 0 free: 19806 MB
>>>>>>>>> node 1 cpus: 4 5 6 7 12 13 14 15
>>>>>>>>> node 1 size: 24190 MB
>>>>>>>>> node 1 free: 21951 MB
>>>>>>>>> node distances:
>>>>>>>>> node   0   1
>>>>>>>>>   0:  10  21
>>>>>>>>>   1:  21  10
>>>>>>>>>
>>>>>>>>> I understand that fixing the current numa detection is a prerequisite to enable
>>>>>>>>> UseNUMA by the default [1] and to extend the numa-aware allocation to the G1 GC [2].
>>>>>>>>>
>>>>>>>>> Thank you.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Best regards,
>>>>>>>>> Gustavo
>>>>>>>>>
>>>>>>>>> [1] https://bugs.openjdk.java.net/browse/JDK-8046153 (JEP 163: Enable NUMA Mode by Default When Appropriate)
>>>>>>>>> [2] https://bugs.openjdk.java.net/browse/JDK-8046147 (JEP 157: G1 GC: NUMA-Aware Allocation)
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>
>>
>

Loading...