mem: fix allocation failure on non-NUMA kernel
diff mbox series

Message ID 20200805122640.13884-1-nick.connolly@mayadata.io
State New
Delegated to: David Marchand
Headers show
Series
  • mem: fix allocation failure on non-NUMA kernel
Related show

Checks

Context Check Description
ci/travis-robot success Travis build: passed
ci/iol-mellanox-Performance success Performance Testing PASS
ci/iol-testing success Testing PASS
ci/Intel-compilation success Compilation OK
ci/iol-intel-Performance success Performance Testing PASS
ci/checkpatch success coding style OK

Commit Message

Nick Connolly Aug. 5, 2020, 12:26 p.m. UTC
Running dpdk-helloworld on Linux with lib numa present,
but no kernel support for NUMA (CONFIG_NUMA=n) causes
ret_service_init() to fail with EAL: error allocating
rte services array.

alloc_seg() calls get_mempolicy to verify that the allocation
has happened on the correct socket, but receives ENOSYS from
the kernel and fails the allocation.

The allocated socket should only be verified if check_numa() is true.

Fixes: 2a96c88be83e ("mem: ease init in a docker container")
Cc: nicolas.dichtel@6wind.com
Cc: stable@dpdk.org
Signed-off-by: Nick Connolly <nick.connolly@mayadata.io>
---
 lib/librte_eal/linux/eal_memalloc.c | 28 +++++++++++++++++-----------
 1 file changed, 17 insertions(+), 11 deletions(-)

Comments

Nicolas Dichtel Aug. 5, 2020, 1:42 p.m. UTC | #1
Le 05/08/2020 à 14:26, Nick Connolly a écrit :
> Running dpdk-helloworld on Linux with lib numa present,
> but no kernel support for NUMA (CONFIG_NUMA=n) causes
> ret_service_init() to fail with EAL: error allocating
> rte services array.
> 
> alloc_seg() calls get_mempolicy to verify that the allocation
> has happened on the correct socket, but receives ENOSYS from
> the kernel and fails the allocation.
> 
> The allocated socket should only be verified if check_numa() is true.
> 
> Fixes: 2a96c88be83e ("mem: ease init in a docker container")
I'm wondering if the bug existed before this commit.

Before this commit, it was:
       move_pages(getpid(), 1, &addr, NULL, &cur_socket_id, 0);
       if (cur_socket_id != socket_id) {
               /* error */

Isn't it possible to hit this error case if CONFIG_NUMA is unset in the kernel?

[snip]
> +	if (check_numa()) {
> +		ret = get_mempolicy(&cur_socket_id, NULL, 0, addr,
> +					MPOL_F_NODE | MPOL_F_ADDR);
> +		if (ret < 0) {
> +			RTE_LOG(DEBUG, EAL, "%s(): get_mempolicy: %s\n",
> +				__func__, strerror(errno));
> +			goto mapped;
> +		} else if (cur_socket_id != socket_id) {
> +			RTE_LOG(DEBUG, EAL,
> +					"%s(): allocation happened on wrong socket (wanted %d, got %d)\n",
> +				__func__, socket_id, cur_socket_id);
> +			goto mapped;
> +		}
> +	} else {
> +		if (rte_socket_count() > 1)
> +			RTE_LOG(DEBUG, EAL, "%s(): not checking socket for allocation (wanted %d)\n",
> +					__func__, socket_id);
nit: maybe an higher log level like WARNING?


Regards,
Nicolas
Nick Connolly Aug. 5, 2020, 2:20 p.m. UTC | #2
Hi Nicolas,

Thanks for the quick response.

On 05/08/2020 14:42, Nicolas Dichtel wrote:
> Le 05/08/2020 à 14:26, Nick Connolly a écrit :
>> Running dpdk-helloworld on Linux with lib numa present,
>> but no kernel support for NUMA (CONFIG_NUMA=n) causes
>> ret_service_init() to fail with EAL: error allocating
>> rte services array.
>>
>> alloc_seg() calls get_mempolicy to verify that the allocation
>> has happened on the correct socket, but receives ENOSYS from
>> the kernel and fails the allocation.
>>
>> The allocated socket should only be verified if check_numa() is true.
>>
>> Fixes: 2a96c88be83e ("mem: ease init in a docker container")
> I'm wondering if the bug existed before this commit.
>
> Before this commit, it was:
>         move_pages(getpid(), 1, &addr, NULL, &cur_socket_id, 0);
>         if (cur_socket_id != socket_id) {
>                 /* error */
>
> Isn't it possible to hit this error case if CONFIG_NUMA is unset in the kernel?
I've just run the previous code to test this out and you are right that 
move_pages does indeed return -1 with errno set to ENOSYS, but nothing 
checks this so execution carries on and compares cur_socket_id (which 
will be unchanged from the zero initialization) with socket_id (which is 
presumably also zero), thus allowing the allocation to succeed!

> [snip]
>> +	if (check_numa()) {
>> +		ret = get_mempolicy(&cur_socket_id, NULL, 0, addr,
>> +					MPOL_F_NODE | MPOL_F_ADDR);
>> +		if (ret < 0) {
>> +			RTE_LOG(DEBUG, EAL, "%s(): get_mempolicy: %s\n",
>> +				__func__, strerror(errno));
>> +			goto mapped;
>> +		} else if (cur_socket_id != socket_id) {
>> +			RTE_LOG(DEBUG, EAL,
>> +					"%s(): allocation happened on wrong socket (wanted %d, got %d)\n",
>> +				__func__, socket_id, cur_socket_id);
>> +			goto mapped;
>> +		}
>> +	} else {
>> +		if (rte_socket_count() > 1)
>> +			RTE_LOG(DEBUG, EAL, "%s(): not checking socket for allocation (wanted %d)\n",
>> +					__func__, socket_id);
> nit: maybe an higher log level like WARNING?
Open to guidance here - my concern was that this is going to be 
generated for every call to alloc_seg() and I'm not sure what the 
frequency will be - I'm cautious about flooding the log with warnings 
under 'normal running'.  Are the implications of running on a multi 
socket system with NUMA support disabled in the kernel purely 
performance related for the DPDK or is there a functional correctness 
issue as well?
>
> Regards,
> Nicolas

Regards,
Nick
Nicolas Dichtel Aug. 5, 2020, 2:36 p.m. UTC | #3
Le 05/08/2020 à 16:20, Nick Connolly a écrit :
[snip]
>>> Fixes: 2a96c88be83e ("mem: ease init in a docker container")
>> I'm wondering if the bug existed before this commit.
>>
>> Before this commit, it was:
>>         move_pages(getpid(), 1, &addr, NULL, &cur_socket_id, 0);
>>         if (cur_socket_id != socket_id) {
>>                 /* error */
>>
>> Isn't it possible to hit this error case if CONFIG_NUMA is unset in the kernel?
> I've just run the previous code to test this out and you are right that
> move_pages does indeed return -1 with errno set to ENOSYS, but nothing checks
> this so execution carries on and compares cur_socket_id (which will be unchanged
> from the zero initialization) with socket_id (which is presumably also zero),
> thus allowing the allocation to succeed!
I came to this conclusion, but I didn't check if socket_id could be != from 0.

>> [snip]
>>> +    if (check_numa()) {
>>> +        ret = get_mempolicy(&cur_socket_id, NULL, 0, addr,
>>> +                    MPOL_F_NODE | MPOL_F_ADDR);
>>> +        if (ret < 0) {
>>> +            RTE_LOG(DEBUG, EAL, "%s(): get_mempolicy: %s\n",
>>> +                __func__, strerror(errno));
>>> +            goto mapped;
>>> +        } else if (cur_socket_id != socket_id) {
>>> +            RTE_LOG(DEBUG, EAL,
>>> +                    "%s(): allocation happened on wrong socket (wanted %d,
>>> got %d)\n",
>>> +                __func__, socket_id, cur_socket_id);
>>> +            goto mapped;
>>> +        }
>>> +    } else {
>>> +        if (rte_socket_count() > 1)
>>> +            RTE_LOG(DEBUG, EAL, "%s(): not checking socket for allocation
>>> (wanted %d)\n",
>>> +                    __func__, socket_id);
>> nit: maybe an higher log level like WARNING?
> Open to guidance here - my concern was that this is going to be generated for
> every call to alloc_seg() and I'm not sure what the frequency will be - I'm
> cautious about flooding the log with warnings under 'normal running'.  Are the
> implications of running on a multi socket system with NUMA support disabled in
> the kernel purely performance related for the DPDK or is there a functional
> correctness issue as well?
Is it really a 'normal running' to have CONFIG_RTE_EAL_NUMA_AWARE_HUGEPAGES in
dpdk and not CONFIG_NUMA in the kernel?


Regards,
Nicolas
Nick Connolly Aug. 5, 2020, 2:53 p.m. UTC | #4
On 05/08/2020 15:36, Nicolas Dichtel wrote:
> Le 05/08/2020 à 16:20, Nick Connolly a écrit :
> [snip]
>>>> Fixes: 2a96c88be83e ("mem: ease init in a docker container")
>>> I'm wondering if the bug existed before this commit.
>>>
>>> Before this commit, it was:
>>>          move_pages(getpid(), 1, &addr, NULL, &cur_socket_id, 0);
>>>          if (cur_socket_id != socket_id) {
>>>                  /* error */
>>>
>>> Isn't it possible to hit this error case if CONFIG_NUMA is unset in the kernel?
>> I've just run the previous code to test this out and you are right that
>> move_pages does indeed return -1 with errno set to ENOSYS, but nothing checks
>> this so execution carries on and compares cur_socket_id (which will be unchanged
>> from the zero initialization) with socket_id (which is presumably also zero),
>> thus allowing the allocation to succeed!
> I came to this conclusion, but I didn't check if socket_id could be != from 0.
>
>>> [snip]
>>>> +    if (check_numa()) {
>>>> +        ret = get_mempolicy(&cur_socket_id, NULL, 0, addr,
>>>> +                    MPOL_F_NODE | MPOL_F_ADDR);
>>>> +        if (ret < 0) {
>>>> +            RTE_LOG(DEBUG, EAL, "%s(): get_mempolicy: %s\n",
>>>> +                __func__, strerror(errno));
>>>> +            goto mapped;
>>>> +        } else if (cur_socket_id != socket_id) {
>>>> +            RTE_LOG(DEBUG, EAL,
>>>> +                    "%s(): allocation happened on wrong socket (wanted %d,
>>>> got %d)\n",
>>>> +                __func__, socket_id, cur_socket_id);
>>>> +            goto mapped;
>>>> +        }
>>>> +    } else {
>>>> +        if (rte_socket_count() > 1)
>>>> +            RTE_LOG(DEBUG, EAL, "%s(): not checking socket for allocation
>>>> (wanted %d)\n",
>>>> +                    __func__, socket_id);
>>> nit: maybe an higher log level like WARNING?
>> Open to guidance here - my concern was that this is going to be generated for
>> every call to alloc_seg() and I'm not sure what the frequency will be - I'm
>> cautious about flooding the log with warnings under 'normal running'.  Are the
>> implications of running on a multi socket system with NUMA support disabled in
>> the kernel purely performance related for the DPDK or is there a functional
>> correctness issue as well?
> Is it really a 'normal running' to have CONFIG_RTE_EAL_NUMA_AWARE_HUGEPAGES in
> dpdk and not CONFIG_NUMA in the kernel?

I'm not an expert of DPDK, but I think it needs to be treated as 'normal 
running', for the following reasons:

 1. The existing code in eal_memalloc_alloc_seg_bulk() is designed to
    work even if check_numa() indicates that NUMA support is not enabled:

    #ifdef RTE_EAL_NUMA_AWARE_HUGEPAGES
    if (check_numa()) {
             oldmask = numa_allocate_nodemask();
             prepare_numa(&oldpolicy, oldmask, socket);
             have_numa = true;
         }
    #endif
 2. The DPDK application could be built with
    CONFIG_RTE_EAL_NUMA_AWARE_HUGE_PAGES and then the binary run on
    different systems with and without NUMA support.

Regards,
Nick
Nicolas Dichtel Aug. 5, 2020, 3:13 p.m. UTC | #5
Le 05/08/2020 à 16:53, Nick Connolly a écrit :
[snip]
>>>>> +    if (check_numa()) {
>>>>> +        ret = get_mempolicy(&cur_socket_id, NULL, 0, addr,
>>>>> +                    MPOL_F_NODE | MPOL_F_ADDR);
>>>>> +        if (ret < 0) {
>>>>> +            RTE_LOG(DEBUG, EAL, "%s(): get_mempolicy: %s\n",
>>>>> +                __func__, strerror(errno));
>>>>> +            goto mapped;
>>>>> +        } else if (cur_socket_id != socket_id) {
>>>>> +            RTE_LOG(DEBUG, EAL,
>>>>> +                    "%s(): allocation happened on wrong socket (wanted %d,
>>>>> got %d)\n",
>>>>> +                __func__, socket_id, cur_socket_id);
>>>>> +            goto mapped;
>>>>> +        }
>>>>> +    } else {
>>>>> +        if (rte_socket_count() > 1)
>>>>> +            RTE_LOG(DEBUG, EAL, "%s(): not checking socket for allocation
>>>>> (wanted %d)\n",
>>>>> +                    __func__, socket_id);
>>>> nit: maybe an higher log level like WARNING?
>>> Open to guidance here - my concern was that this is going to be generated for
>>> every call to alloc_seg() and I'm not sure what the frequency will be - I'm
>>> cautious about flooding the log with warnings under 'normal running'.  Are the
>>> implications of running on a multi socket system with NUMA support disabled in
>>> the kernel purely performance related for the DPDK or is there a functional
>>> correctness issue as well?
>> Is it really a 'normal running' to have CONFIG_RTE_EAL_NUMA_AWARE_HUGEPAGES in
>> dpdk and not CONFIG_NUMA in the kernel?
> 
> I'm not an expert of DPDK, but I think it needs to be treated as 'normal
> running', for the following reasons:
> 
> 1. The existing code in eal_memalloc_alloc_seg_bulk() is designed to
>    work even if check_numa() indicates that NUMA support is not enabled:
> 
>    #ifdef RTE_EAL_NUMA_AWARE_HUGEPAGES
>    if (check_numa()) {
>             oldmask = numa_allocate_nodemask();
>             prepare_numa(&oldpolicy, oldmask, socket);
>             have_numa = true;
>         }
>    #endif
The question was not to return an error, but to display a warning. So the code
will work (after your patch), no problem.

> 2. The DPDK application could be built with
>    CONFIG_RTE_EAL_NUMA_AWARE_HUGE_PAGES and then the binary run on
>    different systems with and without NUMA support.
In a production environment, it seems odd to have a custom kernel and a generic
dpdk app, it's why I propose the log level WARNING (or NOTICE maybe?).
I let other comment about this, I don't have a strong opinion.


Regards,
Nicolas
Nick Connolly Aug. 5, 2020, 3:21 p.m. UTC | #6
On 05/08/2020 16:13, Nicolas Dichtel wrote:
> Le 05/08/2020 à 16:53, Nick Connolly a écrit :
> [snip]
>>>>>> +    if (check_numa()) {
>>>>>> +        ret = get_mempolicy(&cur_socket_id, NULL, 0, addr,
>>>>>> +                    MPOL_F_NODE | MPOL_F_ADDR);
>>>>>> +        if (ret < 0) {
>>>>>> +            RTE_LOG(DEBUG, EAL, "%s(): get_mempolicy: %s\n",
>>>>>> +                __func__, strerror(errno));
>>>>>> +            goto mapped;
>>>>>> +        } else if (cur_socket_id != socket_id) {
>>>>>> +            RTE_LOG(DEBUG, EAL,
>>>>>> +                    "%s(): allocation happened on wrong socket (wanted %d,
>>>>>> got %d)\n",
>>>>>> +                __func__, socket_id, cur_socket_id);
>>>>>> +            goto mapped;
>>>>>> +        }
>>>>>> +    } else {
>>>>>> +        if (rte_socket_count() > 1)
>>>>>> +            RTE_LOG(DEBUG, EAL, "%s(): not checking socket for allocation
>>>>>> (wanted %d)\n",
>>>>>> +                    __func__, socket_id);
>>>>> nit: maybe an higher log level like WARNING?
>>>> Open to guidance here - my concern was that this is going to be generated for
>>>> every call to alloc_seg() and I'm not sure what the frequency will be - I'm
>>>> cautious about flooding the log with warnings under 'normal running'.  Are the
>>>> implications of running on a multi socket system with NUMA support disabled in
>>>> the kernel purely performance related for the DPDK or is there a functional
>>>> correctness issue as well?
>>> Is it really a 'normal running' to have CONFIG_RTE_EAL_NUMA_AWARE_HUGEPAGES in
>>> dpdk and not CONFIG_NUMA in the kernel?
>> I'm not an expert of DPDK, but I think it needs to be treated as 'normal
>> running', for the following reasons:
>>
>> 1. The existing code in eal_memalloc_alloc_seg_bulk() is designed to
>>     work even if check_numa() indicates that NUMA support is not enabled:
>>
>>     #ifdef RTE_EAL_NUMA_AWARE_HUGEPAGES
>>     if (check_numa()) {
>>              oldmask = numa_allocate_nodemask();
>>              prepare_numa(&oldpolicy, oldmask, socket);
>>              have_numa = true;
>>          }
>>     #endif
> The question was not to return an error, but to display a warning. So the code
> will work (after your patch), no problem.
>
>> 2. The DPDK application could be built with
>>     CONFIG_RTE_EAL_NUMA_AWARE_HUGE_PAGES and then the binary run on
>>     different systems with and without NUMA support.
> In a production environment, it seems odd to have a custom kernel and a generic
> dpdk app, it's why I propose the log level WARNING (or NOTICE maybe?).
> I let other comment about this, I don't have a strong opinion.
Thanks - appreciate the input - I also have no strong opinions here and 
am happy to be guided.

Regards,
Nick
Burakov, Anatoly Sept. 17, 2020, 11:28 a.m. UTC | #7
On 05-Aug-20 4:21 PM, Nick Connolly wrote:
> 
> 
> On 05/08/2020 16:13, Nicolas Dichtel wrote:
>> Le 05/08/2020 à 16:53, Nick Connolly a écrit :
>> [snip]
>>>>>>> +    if (check_numa()) {
>>>>>>> +        ret = get_mempolicy(&cur_socket_id, NULL, 0, addr,
>>>>>>> +                    MPOL_F_NODE | MPOL_F_ADDR);
>>>>>>> +        if (ret < 0) {
>>>>>>> +            RTE_LOG(DEBUG, EAL, "%s(): get_mempolicy: %s\n",
>>>>>>> +                __func__, strerror(errno));
>>>>>>> +            goto mapped;
>>>>>>> +        } else if (cur_socket_id != socket_id) {
>>>>>>> +            RTE_LOG(DEBUG, EAL,
>>>>>>> +                    "%s(): allocation happened on wrong socket 
>>>>>>> (wanted %d,
>>>>>>> got %d)\n",
>>>>>>> +                __func__, socket_id, cur_socket_id);
>>>>>>> +            goto mapped;
>>>>>>> +        }
>>>>>>> +    } else {
>>>>>>> +        if (rte_socket_count() > 1)
>>>>>>> +            RTE_LOG(DEBUG, EAL, "%s(): not checking socket for 
>>>>>>> allocation
>>>>>>> (wanted %d)\n",
>>>>>>> +                    __func__, socket_id);
>>>>>> nit: maybe an higher log level like WARNING?
>>>>> Open to guidance here - my concern was that this is going to be 
>>>>> generated for
>>>>> every call to alloc_seg() and I'm not sure what the frequency will 
>>>>> be - I'm
>>>>> cautious about flooding the log with warnings under 'normal 
>>>>> running'.  Are the
>>>>> implications of running on a multi socket system with NUMA support 
>>>>> disabled in
>>>>> the kernel purely performance related for the DPDK or is there a 
>>>>> functional
>>>>> correctness issue as well?
>>>> Is it really a 'normal running' to have 
>>>> CONFIG_RTE_EAL_NUMA_AWARE_HUGEPAGES in
>>>> dpdk and not CONFIG_NUMA in the kernel?
>>> I'm not an expert of DPDK, but I think it needs to be treated as 'normal
>>> running', for the following reasons:
>>>
>>> 1. The existing code in eal_memalloc_alloc_seg_bulk() is designed to
>>>     work even if check_numa() indicates that NUMA support is not 
>>> enabled:
>>>
>>>     #ifdef RTE_EAL_NUMA_AWARE_HUGEPAGES
>>>     if (check_numa()) {
>>>              oldmask = numa_allocate_nodemask();
>>>              prepare_numa(&oldpolicy, oldmask, socket);
>>>              have_numa = true;
>>>          }
>>>     #endif
>> The question was not to return an error, but to display a warning. So 
>> the code
>> will work (after your patch), no problem.
>>
>>> 2. The DPDK application could be built with
>>>     CONFIG_RTE_EAL_NUMA_AWARE_HUGE_PAGES and then the binary run on
>>>     different systems with and without NUMA support.
>> In a production environment, it seems odd to have a custom kernel and 
>> a generic
>> dpdk app, it's why I propose the log level WARNING (or NOTICE maybe?).
>> I let other comment about this, I don't have a strong opinion.
> Thanks - appreciate the input - I also have no strong opinions here and 
> am happy to be guided.

If there is a socket mismatch, wouldn't allocation fail anyway, which 
would result in an error message? Usually, when things fail, the first 
obvious thing to do is turn up the debug log. I'm inclined to leave it 
as DEBUG.

> 
> Regards,
> Nick
Burakov, Anatoly Sept. 17, 2020, 11:31 a.m. UTC | #8
On 05-Aug-20 1:26 PM, Nick Connolly wrote:
> Running dpdk-helloworld on Linux with lib numa present,
> but no kernel support for NUMA (CONFIG_NUMA=n) causes
> ret_service_init() to fail with EAL: error allocating
> rte services array.
> 
> alloc_seg() calls get_mempolicy to verify that the allocation
> has happened on the correct socket, but receives ENOSYS from
> the kernel and fails the allocation.
> 
> The allocated socket should only be verified if check_numa() is true.
> 
> Fixes: 2a96c88be83e ("mem: ease init in a docker container")
> Cc: nicolas.dichtel@6wind.com
> Cc: stable@dpdk.org
> Signed-off-by: Nick Connolly <nick.connolly@mayadata.io>
> ---
>   lib/librte_eal/linux/eal_memalloc.c | 28 +++++++++++++++++-----------
>   1 file changed, 17 insertions(+), 11 deletions(-)
> 
> diff --git a/lib/librte_eal/linux/eal_memalloc.c b/lib/librte_eal/linux/eal_memalloc.c
> index db60e7997..179757809 100644
> --- a/lib/librte_eal/linux/eal_memalloc.c
> +++ b/lib/librte_eal/linux/eal_memalloc.c
> @@ -610,17 +610,23 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
>   	}
>   
>   #ifdef RTE_EAL_NUMA_AWARE_HUGEPAGES
> -	ret = get_mempolicy(&cur_socket_id, NULL, 0, addr,
> -			    MPOL_F_NODE | MPOL_F_ADDR);
> -	if (ret < 0) {
> -		RTE_LOG(DEBUG, EAL, "%s(): get_mempolicy: %s\n",
> -			__func__, strerror(errno));
> -		goto mapped;
> -	} else if (cur_socket_id != socket_id) {
> -		RTE_LOG(DEBUG, EAL,
> -				"%s(): allocation happened on wrong socket (wanted %d, got %d)\n",
> -			__func__, socket_id, cur_socket_id);
> -		goto mapped;
> +	if (check_numa()) {
> +		ret = get_mempolicy(&cur_socket_id, NULL, 0, addr,
> +					MPOL_F_NODE | MPOL_F_ADDR);
> +		if (ret < 0) {
> +			RTE_LOG(DEBUG, EAL, "%s(): get_mempolicy: %s\n",
> +				__func__, strerror(errno));
> +			goto mapped;
> +		} else if (cur_socket_id != socket_id) {
> +			RTE_LOG(DEBUG, EAL,
> +					"%s(): allocation happened on wrong socket (wanted %d, got %d)\n",
> +				__func__, socket_id, cur_socket_id);
> +			goto mapped;
> +		}
> +	} else {
> +		if (rte_socket_count() > 1)
> +			RTE_LOG(DEBUG, EAL, "%s(): not checking socket for allocation (wanted %d)\n",
> +					__func__, socket_id);
>   	}

If there is no kernel support for NUMA, how would we end up with >1 
socket count?

>   #else
>   	if (rte_socket_count() > 1)
>
Nick Connolly Sept. 17, 2020, 12:29 p.m. UTC | #9
Hi Anatoly,

Thanks for the response.  You are asking a good question - here's what I 
know:

The issue arose on a single socket system, running WSL2 (full Linux 
kernel running as a lightweight VM under Windows).
The default kernel in this environment is built with CONFIG_NUMA=n which 
means get_mempolicy() returns an error.
This causes the check to ensure that the allocated memory is associated 
with the correct socket to fail.

The change is to skip the allocation check if check_numa() indicates 
that NUMA-aware memory is not supported.

Researching the meaning of CONFIG_NUMA, I found 
https://cateee.net/lkddb/web-lkddb/NUMA.html which says:
> Enable NUMA (Non-Uniform Memory Access) support.
> The kernel will try to allocate memory used by a CPU on the local 
> memory controller of the CPU and add some more NUMA awareness to the 
> kernel.

Clearly CONFIG_NUMA enables memory awareness, but there's no indication 
in the description whether information about the NUMA physical 
architecture is 'hidden', or whether it is still exposed through 
/sys/devices/system/node* (which is used by the rte initialisation code 
to determine how many sockets there are). Unfortunately, I don't have 
ready access to a multi-socket Linux system that I can test this out on, 
so I took the conservative approach that it may be possible to have 
CONFIG_NUMA disabled, but the kernel still report more than one node, 
and coded the change to generate a debug message if this occurs.

Do you know whether CONFIG_NUMA turns off all knowledge about the 
hardware architecture?  If it does, then I agree that the test for 
rte_socket_count() serves no purpose and should be removed.

Many thanks,
Nick


On 17/09/2020 12:31, Burakov, Anatoly wrote:
> On 05-Aug-20 1:26 PM, Nick Connolly wrote:
>> Running dpdk-helloworld on Linux with lib numa present,
>> but no kernel support for NUMA (CONFIG_NUMA=n) causes
>> ret_service_init() to fail with EAL: error allocating
>> rte services array.
>>
>> alloc_seg() calls get_mempolicy to verify that the allocation
>> has happened on the correct socket, but receives ENOSYS from
>> the kernel and fails the allocation.
>>
>> The allocated socket should only be verified if check_numa() is true.
>>
>> Fixes: 2a96c88be83e ("mem: ease init in a docker container")
>> Cc: nicolas.dichtel@6wind.com
>> Cc: stable@dpdk.org
>> Signed-off-by: Nick Connolly <nick.connolly@mayadata.io>
>> ---
>>   lib/librte_eal/linux/eal_memalloc.c | 28 +++++++++++++++++-----------
>>   1 file changed, 17 insertions(+), 11 deletions(-)
>>
>> diff --git a/lib/librte_eal/linux/eal_memalloc.c 
>> b/lib/librte_eal/linux/eal_memalloc.c
>> index db60e7997..179757809 100644
>> --- a/lib/librte_eal/linux/eal_memalloc.c
>> +++ b/lib/librte_eal/linux/eal_memalloc.c
>> @@ -610,17 +610,23 @@ alloc_seg(struct rte_memseg *ms, void *addr, 
>> int socket_id,
>>       }
>>     #ifdef RTE_EAL_NUMA_AWARE_HUGEPAGES
>> -    ret = get_mempolicy(&cur_socket_id, NULL, 0, addr,
>> -                MPOL_F_NODE | MPOL_F_ADDR);
>> -    if (ret < 0) {
>> -        RTE_LOG(DEBUG, EAL, "%s(): get_mempolicy: %s\n",
>> -            __func__, strerror(errno));
>> -        goto mapped;
>> -    } else if (cur_socket_id != socket_id) {
>> -        RTE_LOG(DEBUG, EAL,
>> -                "%s(): allocation happened on wrong socket (wanted 
>> %d, got %d)\n",
>> -            __func__, socket_id, cur_socket_id);
>> -        goto mapped;
>> +    if (check_numa()) {
>> +        ret = get_mempolicy(&cur_socket_id, NULL, 0, addr,
>> +                    MPOL_F_NODE | MPOL_F_ADDR);
>> +        if (ret < 0) {
>> +            RTE_LOG(DEBUG, EAL, "%s(): get_mempolicy: %s\n",
>> +                __func__, strerror(errno));
>> +            goto mapped;
>> +        } else if (cur_socket_id != socket_id) {
>> +            RTE_LOG(DEBUG, EAL,
>> +                    "%s(): allocation happened on wrong socket 
>> (wanted %d, got %d)\n",
>> +                __func__, socket_id, cur_socket_id);
>> +            goto mapped;
>> +        }
>> +    } else {
>> +        if (rte_socket_count() > 1)
>> +            RTE_LOG(DEBUG, EAL, "%s(): not checking socket for 
>> allocation (wanted %d)\n",
>> +                    __func__, socket_id);
>>       }
>
> If there is no kernel support for NUMA, how would we end up with >1 
> socket count?
>
>>   #else
>>       if (rte_socket_count() > 1)
>>
>
>
Burakov, Anatoly Sept. 17, 2020, 12:57 p.m. UTC | #10
On 17-Sep-20 1:29 PM, Nick Connolly wrote:
> Hi Anatoly,
> 
> Thanks for the response.  You are asking a good question - here's what I 
> know:
> 
> The issue arose on a single socket system, running WSL2 (full Linux 
> kernel running as a lightweight VM under Windows).
> The default kernel in this environment is built with CONFIG_NUMA=n which 
> means get_mempolicy() returns an error.
> This causes the check to ensure that the allocated memory is associated 
> with the correct socket to fail.
> 
> The change is to skip the allocation check if check_numa() indicates 
> that NUMA-aware memory is not supported.
> 
> Researching the meaning of CONFIG_NUMA, I found 
> https://cateee.net/lkddb/web-lkddb/NUMA.html which says:
>> Enable NUMA (Non-Uniform Memory Access) support.
>> The kernel will try to allocate memory used by a CPU on the local 
>> memory controller of the CPU and add some more NUMA awareness to the 
>> kernel.
> 
> Clearly CONFIG_NUMA enables memory awareness, but there's no indication 
> in the description whether information about the NUMA physical 
> architecture is 'hidden', or whether it is still exposed through 
> /sys/devices/system/node* (which is used by the rte initialisation code 
> to determine how many sockets there are). Unfortunately, I don't have 
> ready access to a multi-socket Linux system that I can test this out on, 
> so I took the conservative approach that it may be possible to have 
> CONFIG_NUMA disabled, but the kernel still report more than one node, 
> and coded the change to generate a debug message if this occurs.
> 
> Do you know whether CONFIG_NUMA turns off all knowledge about the 
> hardware architecture?  If it does, then I agree that the test for 
> rte_socket_count() serves no purpose and should be removed.
> 

I have a system with a custom compiled kernel, i can recompile it 
without this flag and test this. I'll report back with results :)
Nick Connolly Sept. 17, 2020, 1:05 p.m. UTC | #11
Hi Anatoly,

Thanks.  My recollection is that all of the NUMA configuration flags 
were set to 'n'.

Regards,
Nick

On 17/09/2020 13:57, Burakov, Anatoly wrote:
> On 17-Sep-20 1:29 PM, Nick Connolly wrote:
>> Hi Anatoly,
>>
>> Thanks for the response.  You are asking a good question - here's 
>> what I know:
>>
>> The issue arose on a single socket system, running WSL2 (full Linux 
>> kernel running as a lightweight VM under Windows).
>> The default kernel in this environment is built with CONFIG_NUMA=n 
>> which means get_mempolicy() returns an error.
>> This causes the check to ensure that the allocated memory is 
>> associated with the correct socket to fail.
>>
>> The change is to skip the allocation check if check_numa() indicates 
>> that NUMA-aware memory is not supported.
>>
>> Researching the meaning of CONFIG_NUMA, I found 
>> https://cateee.net/lkddb/web-lkddb/NUMA.html which says:
>>> Enable NUMA (Non-Uniform Memory Access) support.
>>> The kernel will try to allocate memory used by a CPU on the local 
>>> memory controller of the CPU and add some more NUMA awareness to the 
>>> kernel.
>>
>> Clearly CONFIG_NUMA enables memory awareness, but there's no 
>> indication in the description whether information about the NUMA 
>> physical architecture is 'hidden', or whether it is still exposed 
>> through /sys/devices/system/node* (which is used by the rte 
>> initialisation code to determine how many sockets there are). 
>> Unfortunately, I don't have ready access to a multi-socket Linux 
>> system that I can test this out on, so I took the conservative 
>> approach that it may be possible to have CONFIG_NUMA disabled, but 
>> the kernel still report more than one node, and coded the change to 
>> generate a debug message if this occurs.
>>
>> Do you know whether CONFIG_NUMA turns off all knowledge about the 
>> hardware architecture?  If it does, then I agree that the test for 
>> rte_socket_count() serves no purpose and should be removed.
>>
>
> I have a system with a custom compiled kernel, i can recompile it 
> without this flag and test this. I'll report back with results :)
>
Burakov, Anatoly Sept. 17, 2020, 2:07 p.m. UTC | #12
On 17-Sep-20 2:05 PM, Nick Connolly wrote:
> Hi Anatoly,
> 
> Thanks.  My recollection is that all of the NUMA configuration flags 
> were set to 'n'.
> 
> Regards,
> Nick
> 
> On 17/09/2020 13:57, Burakov, Anatoly wrote:
>> On 17-Sep-20 1:29 PM, Nick Connolly wrote:
>>> Hi Anatoly,
>>>
>>> Thanks for the response.  You are asking a good question - here's 
>>> what I know:
>>>
>>> The issue arose on a single socket system, running WSL2 (full Linux 
>>> kernel running as a lightweight VM under Windows).
>>> The default kernel in this environment is built with CONFIG_NUMA=n 
>>> which means get_mempolicy() returns an error.
>>> This causes the check to ensure that the allocated memory is 
>>> associated with the correct socket to fail.
>>>
>>> The change is to skip the allocation check if check_numa() indicates 
>>> that NUMA-aware memory is not supported.
>>>
>>> Researching the meaning of CONFIG_NUMA, I found 
>>> https://cateee.net/lkddb/web-lkddb/NUMA.html which says:
>>>> Enable NUMA (Non-Uniform Memory Access) support.
>>>> The kernel will try to allocate memory used by a CPU on the local 
>>>> memory controller of the CPU and add some more NUMA awareness to the 
>>>> kernel.
>>>
>>> Clearly CONFIG_NUMA enables memory awareness, but there's no 
>>> indication in the description whether information about the NUMA 
>>> physical architecture is 'hidden', or whether it is still exposed 
>>> through /sys/devices/system/node* (which is used by the rte 
>>> initialisation code to determine how many sockets there are). 
>>> Unfortunately, I don't have ready access to a multi-socket Linux 
>>> system that I can test this out on, so I took the conservative 
>>> approach that it may be possible to have CONFIG_NUMA disabled, but 
>>> the kernel still report more than one node, and coded the change to 
>>> generate a debug message if this occurs.
>>>
>>> Do you know whether CONFIG_NUMA turns off all knowledge about the 
>>> hardware architecture?  If it does, then I agree that the test for 
>>> rte_socket_count() serves no purpose and should be removed.
>>>
>>
>> I have a system with a custom compiled kernel, i can recompile it 
>> without this flag and test this. I'll report back with results :)
>>
> 

With CONFIG_NUMA set to 'n':

[root@xxx ~]# find /sys -name "node*"
/sys/kernel/software_nodes/node0
[root@xxx ~]#

This is confirmed by running DPDK on that machine - i can see all cores 
from all sockets, but they're all appearing on socket 0. So, yes, that 
check isn't necessary :)
Nick Connolly Sept. 17, 2020, 2:08 p.m. UTC | #13
Excellent - thanks - I'll amend the patch.

On 17/09/2020 15:07, Burakov, Anatoly wrote:
> On 17-Sep-20 2:05 PM, Nick Connolly wrote:
>> Hi Anatoly,
>>
>> Thanks.  My recollection is that all of the NUMA configuration flags 
>> were set to 'n'.
>>
>> Regards,
>> Nick
>>
>> On 17/09/2020 13:57, Burakov, Anatoly wrote:
>>> On 17-Sep-20 1:29 PM, Nick Connolly wrote:
>>>> Hi Anatoly,
>>>>
>>>> Thanks for the response.  You are asking a good question - here's 
>>>> what I know:
>>>>
>>>> The issue arose on a single socket system, running WSL2 (full Linux 
>>>> kernel running as a lightweight VM under Windows).
>>>> The default kernel in this environment is built with CONFIG_NUMA=n 
>>>> which means get_mempolicy() returns an error.
>>>> This causes the check to ensure that the allocated memory is 
>>>> associated with the correct socket to fail.
>>>>
>>>> The change is to skip the allocation check if check_numa() 
>>>> indicates that NUMA-aware memory is not supported.
>>>>
>>>> Researching the meaning of CONFIG_NUMA, I found 
>>>> https://cateee.net/lkddb/web-lkddb/NUMA.html which says:
>>>>> Enable NUMA (Non-Uniform Memory Access) support.
>>>>> The kernel will try to allocate memory used by a CPU on the local 
>>>>> memory controller of the CPU and add some more NUMA awareness to 
>>>>> the kernel.
>>>>
>>>> Clearly CONFIG_NUMA enables memory awareness, but there's no 
>>>> indication in the description whether information about the NUMA 
>>>> physical architecture is 'hidden', or whether it is still exposed 
>>>> through /sys/devices/system/node* (which is used by the rte 
>>>> initialisation code to determine how many sockets there are). 
>>>> Unfortunately, I don't have ready access to a multi-socket Linux 
>>>> system that I can test this out on, so I took the conservative 
>>>> approach that it may be possible to have CONFIG_NUMA disabled, but 
>>>> the kernel still report more than one node, and coded the change to 
>>>> generate a debug message if this occurs.
>>>>
>>>> Do you know whether CONFIG_NUMA turns off all knowledge about the 
>>>> hardware architecture?  If it does, then I agree that the test for 
>>>> rte_socket_count() serves no purpose and should be removed.
>>>>
>>>
>>> I have a system with a custom compiled kernel, i can recompile it 
>>> without this flag and test this. I'll report back with results :)
>>>
>>
>
> With CONFIG_NUMA set to 'n':
>
> [root@xxx ~]# find /sys -name "node*"
> /sys/kernel/software_nodes/node0
> [root@xxx ~]#
>
> This is confirmed by running DPDK on that machine - i can see all 
> cores from all sockets, but they're all appearing on socket 0. So, 
> yes, that check isn't necessary :)
>
Burakov, Anatoly Sept. 17, 2020, 2:18 p.m. UTC | #14
On 17-Sep-20 3:08 PM, Nick Connolly wrote:
> Excellent - thanks - I'll amend the patch.
> 
> On 17/09/2020 15:07, Burakov, Anatoly wrote:
>> On 17-Sep-20 2:05 PM, Nick Connolly wrote:
>>> Hi Anatoly,
>>>
>>> Thanks.  My recollection is that all of the NUMA configuration flags 
>>> were set to 'n'.
>>>
>>> Regards,
>>> Nick
>>>
>>> On 17/09/2020 13:57, Burakov, Anatoly wrote:
>>>> On 17-Sep-20 1:29 PM, Nick Connolly wrote:
>>>>> Hi Anatoly,
>>>>>
>>>>> Thanks for the response.  You are asking a good question - here's 
>>>>> what I know:
>>>>>
>>>>> The issue arose on a single socket system, running WSL2 (full Linux 
>>>>> kernel running as a lightweight VM under Windows).
>>>>> The default kernel in this environment is built with CONFIG_NUMA=n 
>>>>> which means get_mempolicy() returns an error.
>>>>> This causes the check to ensure that the allocated memory is 
>>>>> associated with the correct socket to fail.
>>>>>
>>>>> The change is to skip the allocation check if check_numa() 
>>>>> indicates that NUMA-aware memory is not supported.
>>>>>
>>>>> Researching the meaning of CONFIG_NUMA, I found 
>>>>> https://cateee.net/lkddb/web-lkddb/NUMA.html which says:
>>>>>> Enable NUMA (Non-Uniform Memory Access) support.
>>>>>> The kernel will try to allocate memory used by a CPU on the local 
>>>>>> memory controller of the CPU and add some more NUMA awareness to 
>>>>>> the kernel.
>>>>>
>>>>> Clearly CONFIG_NUMA enables memory awareness, but there's no 
>>>>> indication in the description whether information about the NUMA 
>>>>> physical architecture is 'hidden', or whether it is still exposed 
>>>>> through /sys/devices/system/node* (which is used by the rte 
>>>>> initialisation code to determine how many sockets there are). 
>>>>> Unfortunately, I don't have ready access to a multi-socket Linux 
>>>>> system that I can test this out on, so I took the conservative 
>>>>> approach that it may be possible to have CONFIG_NUMA disabled, but 
>>>>> the kernel still report more than one node, and coded the change to 
>>>>> generate a debug message if this occurs.
>>>>>
>>>>> Do you know whether CONFIG_NUMA turns off all knowledge about the 
>>>>> hardware architecture?  If it does, then I agree that the test for 
>>>>> rte_socket_count() serves no purpose and should be removed.
>>>>>
>>>>
>>>> I have a system with a custom compiled kernel, i can recompile it 
>>>> without this flag and test this. I'll report back with results :)
>>>>
>>>
>>
>> With CONFIG_NUMA set to 'n':
>>
>> [root@xxx ~]# find /sys -name "node*"
>> /sys/kernel/software_nodes/node0
>> [root@xxx ~]#
>>
>> This is confirmed by running DPDK on that machine - i can see all 
>> cores from all sockets, but they're all appearing on socket 0. So, 
>> yes, that check isn't necessary :)
>>
> 

I would also add a comment explaining why we're checking for NUMA 
support when NUMA support is defined at compiled time.
Nick Connolly Sept. 17, 2020, 2:19 p.m. UTC | #15
Sure.

On 17/09/2020 15:18, Burakov, Anatoly wrote:
> On 17-Sep-20 3:08 PM, Nick Connolly wrote:
>> Excellent - thanks - I'll amend the patch.
>>
>> On 17/09/2020 15:07, Burakov, Anatoly wrote:
>>> On 17-Sep-20 2:05 PM, Nick Connolly wrote:
>>>> Hi Anatoly,
>>>>
>>>> Thanks.  My recollection is that all of the NUMA configuration 
>>>> flags were set to 'n'.
>>>>
>>>> Regards,
>>>> Nick
>>>>
>>>> On 17/09/2020 13:57, Burakov, Anatoly wrote:
>>>>> On 17-Sep-20 1:29 PM, Nick Connolly wrote:
>>>>>> Hi Anatoly,
>>>>>>
>>>>>> Thanks for the response.  You are asking a good question - here's 
>>>>>> what I know:
>>>>>>
>>>>>> The issue arose on a single socket system, running WSL2 (full 
>>>>>> Linux kernel running as a lightweight VM under Windows).
>>>>>> The default kernel in this environment is built with 
>>>>>> CONFIG_NUMA=n which means get_mempolicy() returns an error.
>>>>>> This causes the check to ensure that the allocated memory is 
>>>>>> associated with the correct socket to fail.
>>>>>>
>>>>>> The change is to skip the allocation check if check_numa() 
>>>>>> indicates that NUMA-aware memory is not supported.
>>>>>>
>>>>>> Researching the meaning of CONFIG_NUMA, I found 
>>>>>> https://cateee.net/lkddb/web-lkddb/NUMA.html which says:
>>>>>>> Enable NUMA (Non-Uniform Memory Access) support.
>>>>>>> The kernel will try to allocate memory used by a CPU on the 
>>>>>>> local memory controller of the CPU and add some more NUMA 
>>>>>>> awareness to the kernel.
>>>>>>
>>>>>> Clearly CONFIG_NUMA enables memory awareness, but there's no 
>>>>>> indication in the description whether information about the NUMA 
>>>>>> physical architecture is 'hidden', or whether it is still exposed 
>>>>>> through /sys/devices/system/node* (which is used by the rte 
>>>>>> initialisation code to determine how many sockets there are). 
>>>>>> Unfortunately, I don't have ready access to a multi-socket Linux 
>>>>>> system that I can test this out on, so I took the conservative 
>>>>>> approach that it may be possible to have CONFIG_NUMA disabled, 
>>>>>> but the kernel still report more than one node, and coded the 
>>>>>> change to generate a debug message if this occurs.
>>>>>>
>>>>>> Do you know whether CONFIG_NUMA turns off all knowledge about the 
>>>>>> hardware architecture?  If it does, then I agree that the test 
>>>>>> for rte_socket_count() serves no purpose and should be removed.
>>>>>>
>>>>>
>>>>> I have a system with a custom compiled kernel, i can recompile it 
>>>>> without this flag and test this. I'll report back with results :)
>>>>>
>>>>
>>>
>>> With CONFIG_NUMA set to 'n':
>>>
>>> [root@xxx ~]# find /sys -name "node*"
>>> /sys/kernel/software_nodes/node0
>>> [root@xxx ~]#
>>>
>>> This is confirmed by running DPDK on that machine - i can see all 
>>> cores from all sockets, but they're all appearing on socket 0. So, 
>>> yes, that check isn't necessary :)
>>>
>>
>
> I would also add a comment explaining why we're checking for NUMA 
> support when NUMA support is defined at compiled time.
>

Patch
diff mbox series

diff --git a/lib/librte_eal/linux/eal_memalloc.c b/lib/librte_eal/linux/eal_memalloc.c
index db60e7997..179757809 100644
--- a/lib/librte_eal/linux/eal_memalloc.c
+++ b/lib/librte_eal/linux/eal_memalloc.c
@@ -610,17 +610,23 @@  alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 	}
 
 #ifdef RTE_EAL_NUMA_AWARE_HUGEPAGES
-	ret = get_mempolicy(&cur_socket_id, NULL, 0, addr,
-			    MPOL_F_NODE | MPOL_F_ADDR);
-	if (ret < 0) {
-		RTE_LOG(DEBUG, EAL, "%s(): get_mempolicy: %s\n",
-			__func__, strerror(errno));
-		goto mapped;
-	} else if (cur_socket_id != socket_id) {
-		RTE_LOG(DEBUG, EAL,
-				"%s(): allocation happened on wrong socket (wanted %d, got %d)\n",
-			__func__, socket_id, cur_socket_id);
-		goto mapped;
+	if (check_numa()) {
+		ret = get_mempolicy(&cur_socket_id, NULL, 0, addr,
+					MPOL_F_NODE | MPOL_F_ADDR);
+		if (ret < 0) {
+			RTE_LOG(DEBUG, EAL, "%s(): get_mempolicy: %s\n",
+				__func__, strerror(errno));
+			goto mapped;
+		} else if (cur_socket_id != socket_id) {
+			RTE_LOG(DEBUG, EAL,
+					"%s(): allocation happened on wrong socket (wanted %d, got %d)\n",
+				__func__, socket_id, cur_socket_id);
+			goto mapped;
+		}
+	} else {
+		if (rte_socket_count() > 1)
+			RTE_LOG(DEBUG, EAL, "%s(): not checking socket for allocation (wanted %d)\n",
+					__func__, socket_id);
 	}
 #else
 	if (rte_socket_count() > 1)