ethdev: fix device init without socket-local memory

Message ID 20240711123500.483119-1-bruce.richardson@intel.com (mailing list archive)
State Superseded, archived
Delegated to: Ferruh Yigit
Headers
Series ethdev: fix device init without socket-local memory |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/loongarch-compilation success Compilation OK
ci/loongarch-unit-testing success Unit Testing PASS
ci/Intel-compilation success Compilation OK
ci/intel-Testing success Testing PASS
ci/github-robot: build success github build: passed
ci/intel-Functional success Functional PASS
ci/iol-intel-Performance success Performance Testing PASS
ci/iol-mellanox-Performance success Performance Testing PASS
ci/iol-abi-testing success Testing PASS
ci/iol-unit-arm64-testing warning Testing issues
ci/iol-compile-amd64-testing success Testing PASS
ci/iol-unit-amd64-testing success Testing PASS
ci/iol-compile-arm64-testing fail Testing issues
ci/iol-sample-apps-testing success Testing PASS
ci/iol-intel-Functional success Functional Testing PASS
ci/iol-broadcom-Functional success Functional Testing PASS
ci/iol-broadcom-Performance success Performance Testing PASS

Commit Message

Bruce Richardson July 11, 2024, 12:35 p.m. UTC
When allocating memory for an ethdev, the rte_malloc_socket call used
only allocates memory on the NUMA node/socket local to the device. This
means that even if the user wanted to, they could never use a remote NIC
without also having memory on that NIC's socket.

For example, if we change examples/skeleton/basicfwd.c to have
SOCKET_ID_ANY as the socket_id parameter for Rx and Tx rings, we should
be able to run the app cross-numa e.g. as below, where the two PCI
devices are on socket 1, and core 1 is on socket 0:

 ./build/examples/dpdk-skeleton -l 1 --legacy-mem --socket-mem=1024,0 \
		-a a8:00.0 -a b8:00.0

This fails however, with the error:

  ETHDEV: failed to allocate private data
  PCI_BUS: Requested device 0000:a8:00.0 cannot be used

We can remove this restriction by doing a fallback call to general
rte_malloc after a call to rte_malloc_socket fails. This should be safe
to do because the later ethdev calls to setup Rx/Tx queues all take a
socket_id parameter, which can be used by applications to enforce the
requirement for local-only memory for a device, if so desired. [If
device-local memory is present it will be used as before, while if not
present the rte_eth_dev_configure call will now pass, but the subsequent
queue setup calls requesting local memory will fail].

Fixes: e489007a411c ("ethdev: add generic create/destroy ethdev APIs")
Fixes: dcd5c8112bc3 ("ethdev: add PCI driver helpers")
Cc: stable@dpdk.org

Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
Signed-off-by: Padraig Connolly <padraig.j.connolly@intel.com>
---
 lib/ethdev/ethdev_driver.c |  7 ++++++-
 lib/ethdev/ethdev_pci.h    | 11 ++++++++++-
 2 files changed, 16 insertions(+), 2 deletions(-)
  

Comments

Ferruh Yigit July 19, 2024, 8:59 a.m. UTC | #1
On 7/11/2024 1:35 PM, Bruce Richardson wrote:
> When allocating memory for an ethdev, the rte_malloc_socket call used
> only allocates memory on the NUMA node/socket local to the device. This
> means that even if the user wanted to, they could never use a remote NIC
> without also having memory on that NIC's socket.
> 
> For example, if we change examples/skeleton/basicfwd.c to have
> SOCKET_ID_ANY as the socket_id parameter for Rx and Tx rings, we should
> be able to run the app cross-numa e.g. as below, where the two PCI
> devices are on socket 1, and core 1 is on socket 0:
> 
>  ./build/examples/dpdk-skeleton -l 1 --legacy-mem --socket-mem=1024,0 \
> 		-a a8:00.0 -a b8:00.0
> 
> This fails however, with the error:
> 
>   ETHDEV: failed to allocate private data
>   PCI_BUS: Requested device 0000:a8:00.0 cannot be used
> 
> We can remove this restriction by doing a fallback call to general
> rte_malloc after a call to rte_malloc_socket fails. This should be safe
> to do because the later ethdev calls to setup Rx/Tx queues all take a
> socket_id parameter, which can be used by applications to enforce the
> requirement for local-only memory for a device, if so desired. [If
> device-local memory is present it will be used as before, while if not
> present the rte_eth_dev_configure call will now pass, but the subsequent
> queue setup calls requesting local memory will fail].
> 
> Fixes: e489007a411c ("ethdev: add generic create/destroy ethdev APIs")
> Fixes: dcd5c8112bc3 ("ethdev: add PCI driver helpers")
> Cc: stable@dpdk.org
> 
> Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
> Signed-off-by: Padraig Connolly <padraig.j.connolly@intel.com>
>

Hi Bruce,

If device-local memory is present, behavior will be same, so I agree
this is low impact.

But for the case device-local memory is NOT present, should we enforce
the HW setup is the question. This can be beneficial for users new to DPDK.

Probably 'dev_private' on its own has small impact on the performance
(although it may depend if these fields used in datapath), but it may be
vehicle to enforce local memory.

What is enabled by enabling app to run on cross-numa memory, since on a
production I expect users would like to use device-local memory for
performance reasons anyway?


Also I am not sure if this is a fix, or change of a intentional behavior.
  
Bruce Richardson July 19, 2024, 9:57 a.m. UTC | #2
On Fri, Jul 19, 2024 at 09:59:50AM +0100, Ferruh Yigit wrote:
> On 7/11/2024 1:35 PM, Bruce Richardson wrote:
> > When allocating memory for an ethdev, the rte_malloc_socket call used
> > only allocates memory on the NUMA node/socket local to the device. This
> > means that even if the user wanted to, they could never use a remote NIC
> > without also having memory on that NIC's socket.
> > 
> > For example, if we change examples/skeleton/basicfwd.c to have
> > SOCKET_ID_ANY as the socket_id parameter for Rx and Tx rings, we should
> > be able to run the app cross-numa e.g. as below, where the two PCI
> > devices are on socket 1, and core 1 is on socket 0:
> > 
> >  ./build/examples/dpdk-skeleton -l 1 --legacy-mem --socket-mem=1024,0 \
> > 		-a a8:00.0 -a b8:00.0
> > 
> > This fails however, with the error:
> > 
> >   ETHDEV: failed to allocate private data
> >   PCI_BUS: Requested device 0000:a8:00.0 cannot be used
> > 
> > We can remove this restriction by doing a fallback call to general
> > rte_malloc after a call to rte_malloc_socket fails. This should be safe
> > to do because the later ethdev calls to setup Rx/Tx queues all take a
> > socket_id parameter, which can be used by applications to enforce the
> > requirement for local-only memory for a device, if so desired. [If
> > device-local memory is present it will be used as before, while if not
> > present the rte_eth_dev_configure call will now pass, but the subsequent
> > queue setup calls requesting local memory will fail].
> > 
> > Fixes: e489007a411c ("ethdev: add generic create/destroy ethdev APIs")
> > Fixes: dcd5c8112bc3 ("ethdev: add PCI driver helpers")
> > Cc: stable@dpdk.org
> > 
> > Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
> > Signed-off-by: Padraig Connolly <padraig.j.connolly@intel.com>
> >
> 
> Hi Bruce,
> 
> If device-local memory is present, behavior will be same, so I agree
> this is low impact.
> 
> But for the case device-local memory is NOT present, should we enforce
> the HW setup is the question. This can be beneficial for users new to DPDK.
> 

No we should not do so, because if we do, there is no way for the user to
allow using remote memory - the probe/init and even configure call has NO
socket_id parameter in it, so the enforcement of local memory is an
internal assumption on the part of the API which is not documented
anywhere, and is not possible for the user to override.

> Probably 'dev_private' on its own has small impact on the performance
> (although it may depend if these fields used in datapath), but it may be
> vehicle to enforce local memory.
> 

As I explain above in the cover letter, there are already other ways to
enforce local memory - we don't need another one. If the user only wants to
use local memory for a port, they can do so by setting the socket_id
parameter of the rx and tx queues. Enforcing local memory in probe
doesn't add anything to that, and just prevents other use-cases.

> What is enabled by enabling app to run on cross-numa memory, since on a
> production I expect users would like to use device-local memory for
> performance reasons anyway?
> 

Mostly users want socket-local memory, but with increasing speeds of NICs
we are already seeing cases where users want to run cross-NUMA. For
example, a multi-port NIC may have some ports in use on each socket.

> 
> Also I am not sure if this is a fix, or change of a intentional behavior.
> 

I suppose it can be viewed either way. However, for me this is a fix,
because right now it is impossible for many users to run their applications
with memory on a different socket to the ports. Nowhere is it documented in
DPDK that there is a hard restriction that ports must have local memory, so
any enforcement of such a policy is wrong.

Turning things the other way around - I can't see how anything will break
or even slow down with this patch applied. As I point out above, the user
can already enforce local memory by passing the required socket id when
allocating rx and tx rings - this patch only pushed the failure for
non-local memory a bit later in the initialization sequence, where the user
can actually specify the desired NUMA behaviour. Is there some
case I'm missing where you forsee this causing problems?

/Bruce
  
Bruce Richardson July 19, 2024, 10:41 a.m. UTC | #3
On Fri, Jul 19, 2024 at 09:59:50AM +0100, Ferruh Yigit wrote:
> On 7/11/2024 1:35 PM, Bruce Richardson wrote:
> > When allocating memory for an ethdev, the rte_malloc_socket call used
> > only allocates memory on the NUMA node/socket local to the device. This
> > means that even if the user wanted to, they could never use a remote NIC
> > without also having memory on that NIC's socket.
> > 
> > For example, if we change examples/skeleton/basicfwd.c to have
> > SOCKET_ID_ANY as the socket_id parameter for Rx and Tx rings, we should
> > be able to run the app cross-numa e.g. as below, where the two PCI
> > devices are on socket 1, and core 1 is on socket 0:
> > 
> >  ./build/examples/dpdk-skeleton -l 1 --legacy-mem --socket-mem=1024,0 \
> > 		-a a8:00.0 -a b8:00.0
> > 
> > This fails however, with the error:
> > 
> >   ETHDEV: failed to allocate private data
> >   PCI_BUS: Requested device 0000:a8:00.0 cannot be used
> > 
> > We can remove this restriction by doing a fallback call to general
> > rte_malloc after a call to rte_malloc_socket fails. This should be safe
> > to do because the later ethdev calls to setup Rx/Tx queues all take a
> > socket_id parameter, which can be used by applications to enforce the
> > requirement for local-only memory for a device, if so desired. [If
> > device-local memory is present it will be used as before, while if not
> > present the rte_eth_dev_configure call will now pass, but the subsequent
> > queue setup calls requesting local memory will fail].
> > 
> > Fixes: e489007a411c ("ethdev: add generic create/destroy ethdev APIs")
> > Fixes: dcd5c8112bc3 ("ethdev: add PCI driver helpers")
> > Cc: stable@dpdk.org
> > 
> > Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
> > Signed-off-by: Padraig Connolly <padraig.j.connolly@intel.com>
> >
> 
> Hi Bruce,
> 
> If device-local memory is present, behavior will be same, so I agree
> this is low impact.
> 
> But for the case device-local memory is NOT present, should we enforce
> the HW setup is the question. This can be beneficial for users new to DPDK.
> 
> Probably 'dev_private' on its own has small impact on the performance
> (although it may depend if these fields used in datapath), but it may be
> vehicle to enforce local memory.
> 
Actually, thinking about it further, I actually think that using rte_malloc
itself may even be a better solution in general, because the dev_private
structure is never accessed by the NIC itself, but only by the cores.
Therefore, logically, it makes more sense to allocate the memory for it
according to where the process is running rather than where the NIC is
actually located. rte_malloc does this, by first attempting to allocate
based on the calling core socket, and then falling back to allocating on
any other socket on failure of the initial socket-local attempt.

/Bruce
  
Ferruh Yigit July 19, 2024, 11:10 a.m. UTC | #4
On 7/19/2024 10:57 AM, Bruce Richardson wrote:
> On Fri, Jul 19, 2024 at 09:59:50AM +0100, Ferruh Yigit wrote:
>> On 7/11/2024 1:35 PM, Bruce Richardson wrote:
>>> When allocating memory for an ethdev, the rte_malloc_socket call used
>>> only allocates memory on the NUMA node/socket local to the device. This
>>> means that even if the user wanted to, they could never use a remote NIC
>>> without also having memory on that NIC's socket.
>>>
>>> For example, if we change examples/skeleton/basicfwd.c to have
>>> SOCKET_ID_ANY as the socket_id parameter for Rx and Tx rings, we should
>>> be able to run the app cross-numa e.g. as below, where the two PCI
>>> devices are on socket 1, and core 1 is on socket 0:
>>>
>>>  ./build/examples/dpdk-skeleton -l 1 --legacy-mem --socket-mem=1024,0 \
>>> 		-a a8:00.0 -a b8:00.0
>>>
>>> This fails however, with the error:
>>>
>>>   ETHDEV: failed to allocate private data
>>>   PCI_BUS: Requested device 0000:a8:00.0 cannot be used
>>>
>>> We can remove this restriction by doing a fallback call to general
>>> rte_malloc after a call to rte_malloc_socket fails. This should be safe
>>> to do because the later ethdev calls to setup Rx/Tx queues all take a
>>> socket_id parameter, which can be used by applications to enforce the
>>> requirement for local-only memory for a device, if so desired. [If
>>> device-local memory is present it will be used as before, while if not
>>> present the rte_eth_dev_configure call will now pass, but the subsequent
>>> queue setup calls requesting local memory will fail].
>>>
>>> Fixes: e489007a411c ("ethdev: add generic create/destroy ethdev APIs")
>>> Fixes: dcd5c8112bc3 ("ethdev: add PCI driver helpers")
>>> Cc: stable@dpdk.org
>>>
>>> Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
>>> Signed-off-by: Padraig Connolly <padraig.j.connolly@intel.com>
>>>
>>
>> Hi Bruce,
>>
>> If device-local memory is present, behavior will be same, so I agree
>> this is low impact.
>>
>> But for the case device-local memory is NOT present, should we enforce
>> the HW setup is the question. This can be beneficial for users new to DPDK.
>>
> 
> No we should not do so, because if we do, there is no way for the user to
> allow using remote memory - the probe/init and even configure call has NO
> socket_id parameter in it, so the enforcement of local memory is an
> internal assumption on the part of the API which is not documented
> anywhere, and is not possible for the user to override.
> 
>> Probably 'dev_private' on its own has small impact on the performance
>> (although it may depend if these fields used in datapath), but it may be
>> vehicle to enforce local memory.
>>
> 
> As I explain above in the cover letter, there are already other ways to
> enforce local memory - we don't need another one. If the user only wants to
> use local memory for a port, they can do so by setting the socket_id
> parameter of the rx and tx queues. Enforcing local memory in probe
> doesn't add anything to that, and just prevents other use-cases.
> 
>> What is enabled by enabling app to run on cross-numa memory, since on a
>> production I expect users would like to use device-local memory for
>> performance reasons anyway?
>>
> 
> Mostly users want socket-local memory, but with increasing speeds of NICs
> we are already seeing cases where users want to run cross-NUMA. For
> example, a multi-port NIC may have some ports in use on each socket.
> 

Ack.

>>
>> Also I am not sure if this is a fix, or change of a intentional behavior.
>>
> 
> I suppose it can be viewed either way. However, for me this is a fix,
> because right now it is impossible for many users to run their applications
> with memory on a different socket to the ports. Nowhere is it documented in
> DPDK that there is a hard restriction that ports must have local memory, so
> any enforcement of such a policy is wrong.
> 

Although it seems this is done intentionally in the API, I agree that
this is not documented in the API, or this restriction is not part of
the API definition.

> Turning things the other way around - I can't see how anything will break
> or even slow down with this patch applied. As I point out above, the user
> can already enforce local memory by passing the required socket id when
> allocating rx and tx rings - this patch only pushed the failure for
> non-local memory a bit later in the initialization sequence, where the user
> can actually specify the desired NUMA behaviour. Is there some
> case I'm missing where you forsee this causing problems?
> 

A new user may not know that allocating memory from cross-numa impacts
performance negatively and have this configuration unintentionally,
current failure enforces the ideal configuration.

One option can be adding a warning log to the fallback case, saying that
memory allocated from non-local socket and performance will be less.
Although this message may not mean much to a new user, it may still help
via a support engineer or internet search...
  
Bruce Richardson July 19, 2024, 1:22 p.m. UTC | #5
On Fri, Jul 19, 2024 at 12:10:24PM +0100, Ferruh Yigit wrote:
> On 7/19/2024 10:57 AM, Bruce Richardson wrote:
> > On Fri, Jul 19, 2024 at 09:59:50AM +0100, Ferruh Yigit wrote:
> >> On 7/11/2024 1:35 PM, Bruce Richardson wrote:
> >>> When allocating memory for an ethdev, the rte_malloc_socket call used
> >>> only allocates memory on the NUMA node/socket local to the device. This
> >>> means that even if the user wanted to, they could never use a remote NIC
> >>> without also having memory on that NIC's socket.
> >>>
> >>> For example, if we change examples/skeleton/basicfwd.c to have
> >>> SOCKET_ID_ANY as the socket_id parameter for Rx and Tx rings, we should
> >>> be able to run the app cross-numa e.g. as below, where the two PCI
> >>> devices are on socket 1, and core 1 is on socket 0:
> >>>
> >>>  ./build/examples/dpdk-skeleton -l 1 --legacy-mem --socket-mem=1024,0 \
> >>> 		-a a8:00.0 -a b8:00.0
> >>>
> >>> This fails however, with the error:
> >>>
> >>>   ETHDEV: failed to allocate private data
> >>>   PCI_BUS: Requested device 0000:a8:00.0 cannot be used
> >>>
> >>> We can remove this restriction by doing a fallback call to general
> >>> rte_malloc after a call to rte_malloc_socket fails. This should be safe
> >>> to do because the later ethdev calls to setup Rx/Tx queues all take a
> >>> socket_id parameter, which can be used by applications to enforce the
> >>> requirement for local-only memory for a device, if so desired. [If
> >>> device-local memory is present it will be used as before, while if not
> >>> present the rte_eth_dev_configure call will now pass, but the subsequent
> >>> queue setup calls requesting local memory will fail].
> >>>
> >>> Fixes: e489007a411c ("ethdev: add generic create/destroy ethdev APIs")
> >>> Fixes: dcd5c8112bc3 ("ethdev: add PCI driver helpers")
> >>> Cc: stable@dpdk.org
> >>>
> >>> Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
> >>> Signed-off-by: Padraig Connolly <padraig.j.connolly@intel.com>
> >>>
> >>
> >> Hi Bruce,
> >>
> >> If device-local memory is present, behavior will be same, so I agree
> >> this is low impact.
> >>
> >> But for the case device-local memory is NOT present, should we enforce
> >> the HW setup is the question. This can be beneficial for users new to DPDK.
> >>
> > 
> > No we should not do so, because if we do, there is no way for the user to
> > allow using remote memory - the probe/init and even configure call has NO
> > socket_id parameter in it, so the enforcement of local memory is an
> > internal assumption on the part of the API which is not documented
> > anywhere, and is not possible for the user to override.
> > 
> >> Probably 'dev_private' on its own has small impact on the performance
> >> (although it may depend if these fields used in datapath), but it may be
> >> vehicle to enforce local memory.
> >>
> > 
> > As I explain above in the cover letter, there are already other ways to
> > enforce local memory - we don't need another one. If the user only wants to
> > use local memory for a port, they can do so by setting the socket_id
> > parameter of the rx and tx queues. Enforcing local memory in probe
> > doesn't add anything to that, and just prevents other use-cases.
> > 
> >> What is enabled by enabling app to run on cross-numa memory, since on a
> >> production I expect users would like to use device-local memory for
> >> performance reasons anyway?
> >>
> > 
> > Mostly users want socket-local memory, but with increasing speeds of NICs
> > we are already seeing cases where users want to run cross-NUMA. For
> > example, a multi-port NIC may have some ports in use on each socket.
> > 
> 
> Ack.
> 
> >>
> >> Also I am not sure if this is a fix, or change of a intentional behavior.
> >>
> > 
> > I suppose it can be viewed either way. However, for me this is a fix,
> > because right now it is impossible for many users to run their applications
> > with memory on a different socket to the ports. Nowhere is it documented in
> > DPDK that there is a hard restriction that ports must have local memory, so
> > any enforcement of such a policy is wrong.
> > 
> 
> Although it seems this is done intentionally in the API, I agree that
> this is not documented in the API, or this restriction is not part of
> the API definition.
> 
> > Turning things the other way around - I can't see how anything will break
> > or even slow down with this patch applied. As I point out above, the user
> > can already enforce local memory by passing the required socket id when
> > allocating rx and tx rings - this patch only pushed the failure for
> > non-local memory a bit later in the initialization sequence, where the user
> > can actually specify the desired NUMA behaviour. Is there some
> > case I'm missing where you forsee this causing problems?
> > 
> 
> A new user may not know that allocating memory from cross-numa impacts
> performance negatively and have this configuration unintentionally,
> current failure enforces the ideal configuration.
> 
> One option can be adding a warning log to the fallback case, saying that
> memory allocated from non-local socket and performance will be less.
> Although this message may not mean much to a new user, it may still help
> via a support engineer or internet search...
> 

Yes, we could add a warning, but that is probably better in the app itself.
Thinking about where we get issues, they primarily stem from running the
cores on a different numa node  Since the private_data struct is accessed
by cores not devices, any perf degradation will come from having it remote
to the cores. Because of that, I actually think the original implementation
should really have used "rte_socket_id()" rather than the device socket id
for the allocation.

For v2, will I therefore include a warning in the case that rte_socket_id()
!= device socket_id? WDYT?

/Bruce
  
Ferruh Yigit July 19, 2024, 3:31 p.m. UTC | #6
On 7/19/2024 2:22 PM, Bruce Richardson wrote:
> On Fri, Jul 19, 2024 at 12:10:24PM +0100, Ferruh Yigit wrote:
>> On 7/19/2024 10:57 AM, Bruce Richardson wrote:
>>> On Fri, Jul 19, 2024 at 09:59:50AM +0100, Ferruh Yigit wrote:
>>>> On 7/11/2024 1:35 PM, Bruce Richardson wrote:
>>>>> When allocating memory for an ethdev, the rte_malloc_socket call used
>>>>> only allocates memory on the NUMA node/socket local to the device. This
>>>>> means that even if the user wanted to, they could never use a remote NIC
>>>>> without also having memory on that NIC's socket.
>>>>>
>>>>> For example, if we change examples/skeleton/basicfwd.c to have
>>>>> SOCKET_ID_ANY as the socket_id parameter for Rx and Tx rings, we should
>>>>> be able to run the app cross-numa e.g. as below, where the two PCI
>>>>> devices are on socket 1, and core 1 is on socket 0:
>>>>>
>>>>>  ./build/examples/dpdk-skeleton -l 1 --legacy-mem --socket-mem=1024,0 \
>>>>> 		-a a8:00.0 -a b8:00.0
>>>>>
>>>>> This fails however, with the error:
>>>>>
>>>>>   ETHDEV: failed to allocate private data
>>>>>   PCI_BUS: Requested device 0000:a8:00.0 cannot be used
>>>>>
>>>>> We can remove this restriction by doing a fallback call to general
>>>>> rte_malloc after a call to rte_malloc_socket fails. This should be safe
>>>>> to do because the later ethdev calls to setup Rx/Tx queues all take a
>>>>> socket_id parameter, which can be used by applications to enforce the
>>>>> requirement for local-only memory for a device, if so desired. [If
>>>>> device-local memory is present it will be used as before, while if not
>>>>> present the rte_eth_dev_configure call will now pass, but the subsequent
>>>>> queue setup calls requesting local memory will fail].
>>>>>
>>>>> Fixes: e489007a411c ("ethdev: add generic create/destroy ethdev APIs")
>>>>> Fixes: dcd5c8112bc3 ("ethdev: add PCI driver helpers")
>>>>> Cc: stable@dpdk.org
>>>>>
>>>>> Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
>>>>> Signed-off-by: Padraig Connolly <padraig.j.connolly@intel.com>
>>>>>
>>>>
>>>> Hi Bruce,
>>>>
>>>> If device-local memory is present, behavior will be same, so I agree
>>>> this is low impact.
>>>>
>>>> But for the case device-local memory is NOT present, should we enforce
>>>> the HW setup is the question. This can be beneficial for users new to DPDK.
>>>>
>>>
>>> No we should not do so, because if we do, there is no way for the user to
>>> allow using remote memory - the probe/init and even configure call has NO
>>> socket_id parameter in it, so the enforcement of local memory is an
>>> internal assumption on the part of the API which is not documented
>>> anywhere, and is not possible for the user to override.
>>>
>>>> Probably 'dev_private' on its own has small impact on the performance
>>>> (although it may depend if these fields used in datapath), but it may be
>>>> vehicle to enforce local memory.
>>>>
>>>
>>> As I explain above in the cover letter, there are already other ways to
>>> enforce local memory - we don't need another one. If the user only wants to
>>> use local memory for a port, they can do so by setting the socket_id
>>> parameter of the rx and tx queues. Enforcing local memory in probe
>>> doesn't add anything to that, and just prevents other use-cases.
>>>
>>>> What is enabled by enabling app to run on cross-numa memory, since on a
>>>> production I expect users would like to use device-local memory for
>>>> performance reasons anyway?
>>>>
>>>
>>> Mostly users want socket-local memory, but with increasing speeds of NICs
>>> we are already seeing cases where users want to run cross-NUMA. For
>>> example, a multi-port NIC may have some ports in use on each socket.
>>>
>>
>> Ack.
>>
>>>>
>>>> Also I am not sure if this is a fix, or change of a intentional behavior.
>>>>
>>>
>>> I suppose it can be viewed either way. However, for me this is a fix,
>>> because right now it is impossible for many users to run their applications
>>> with memory on a different socket to the ports. Nowhere is it documented in
>>> DPDK that there is a hard restriction that ports must have local memory, so
>>> any enforcement of such a policy is wrong.
>>>
>>
>> Although it seems this is done intentionally in the API, I agree that
>> this is not documented in the API, or this restriction is not part of
>> the API definition.
>>
>>> Turning things the other way around - I can't see how anything will break
>>> or even slow down with this patch applied. As I point out above, the user
>>> can already enforce local memory by passing the required socket id when
>>> allocating rx and tx rings - this patch only pushed the failure for
>>> non-local memory a bit later in the initialization sequence, where the user
>>> can actually specify the desired NUMA behaviour. Is there some
>>> case I'm missing where you forsee this causing problems?
>>>
>>
>> A new user may not know that allocating memory from cross-numa impacts
>> performance negatively and have this configuration unintentionally,
>> current failure enforces the ideal configuration.
>>
>> One option can be adding a warning log to the fallback case, saying that
>> memory allocated from non-local socket and performance will be less.
>> Although this message may not mean much to a new user, it may still help
>> via a support engineer or internet search...
>>
> 
> Yes, we could add a warning, but that is probably better in the app itself.
> Thinking about where we get issues, they primarily stem from running the
> cores on a different numa node  Since the private_data struct is accessed
> by cores not devices, any perf degradation will come from having it remote
> to the cores. Because of that, I actually think the original implementation
> should really have used "rte_socket_id()" rather than the device socket id
> for the allocation.
>

Yes I guess private_data is not accessed by device, but it may be
accessed by cores that is running the datapath.

This API may be called by core that is not involved to the datapath, so
it may not correct to allocate memory from its numa.

Will it be wrong to assume that cores used for datapath will be same
numa with device, if so allocating memory from that numa (which device
is in) makes more sense. Am I missing something?


> 
> For v2, will I therefore include a warning in the case that rte_socket_id()
> != device socket_id? WDYT?
> 
> /Bruce
  
Bruce Richardson July 19, 2024, 4:10 p.m. UTC | #7
On Fri, Jul 19, 2024 at 04:31:11PM +0100, Ferruh Yigit wrote:
> On 7/19/2024 2:22 PM, Bruce Richardson wrote:
> > On Fri, Jul 19, 2024 at 12:10:24PM +0100, Ferruh Yigit wrote:
> >> One option can be adding a warning log to the fallback case, saying that
> >> memory allocated from non-local socket and performance will be less.
> >> Although this message may not mean much to a new user, it may still help
> >> via a support engineer or internet search...
> >>
> > 
> > Yes, we could add a warning, but that is probably better in the app itself.
> > Thinking about where we get issues, they primarily stem from running the
> > cores on a different numa node  Since the private_data struct is accessed
> > by cores not devices, any perf degradation will come from having it remote
> > to the cores. Because of that, I actually think the original implementation
> > should really have used "rte_socket_id()" rather than the device socket id
> > for the allocation.
> >
> 
> Yes I guess private_data is not accessed by device, but it may be
> accessed by cores that is running the datapath.
> 
> This API may be called by core that is not involved to the datapath, so
> it may not correct to allocate memory from its numa.
> 
> Will it be wrong to assume that cores used for datapath will be same
> numa with device, if so allocating memory from that numa (which device
> is in) makes more sense. Am I missing something?
> 

It depends on which you think is more likely for the polling cores:
- they will be on the same socket as the device, but different socket to
  the main lcore.
- they will be on the same socket as the main lcore, but different socket
  to the device.

Personally, I'd suspect both to be unlikely, but also both to be possible.
For the first scenario, I don't see anything being broken or affected by
the proposed fix here, since priority is still being given to memory on the
same socket as the device. It just opens up the possibility of scenario
two.

/Bruce
  
Ferruh Yigit July 21, 2024, 10:56 p.m. UTC | #8
On 7/19/2024 5:10 PM, Bruce Richardson wrote:
> On Fri, Jul 19, 2024 at 04:31:11PM +0100, Ferruh Yigit wrote:
>> On 7/19/2024 2:22 PM, Bruce Richardson wrote:
>>> On Fri, Jul 19, 2024 at 12:10:24PM +0100, Ferruh Yigit wrote:
>>>> One option can be adding a warning log to the fallback case, saying that
>>>> memory allocated from non-local socket and performance will be less.
>>>> Although this message may not mean much to a new user, it may still help
>>>> via a support engineer or internet search...
>>>>
>>>
>>> Yes, we could add a warning, but that is probably better in the app itself.
>>> Thinking about where we get issues, they primarily stem from running the
>>> cores on a different numa node  Since the private_data struct is accessed
>>> by cores not devices, any perf degradation will come from having it remote
>>> to the cores. Because of that, I actually think the original implementation
>>> should really have used "rte_socket_id()" rather than the device socket id
>>> for the allocation.
>>>
>>
>> Yes I guess private_data is not accessed by device, but it may be
>> accessed by cores that is running the datapath.
>>
>> This API may be called by core that is not involved to the datapath, so
>> it may not correct to allocate memory from its numa.
>>
>> Will it be wrong to assume that cores used for datapath will be same
>> numa with device, if so allocating memory from that numa (which device
>> is in) makes more sense. Am I missing something?
>>
> 
> It depends on which you think is more likely for the polling cores:
> - they will be on the same socket as the device, but different socket to
>   the main lcore.
> - they will be on the same socket as the main lcore, but different socket
>   to the device.
> 
> Personally, I'd suspect both to be unlikely, but also both to be possible.
> For the first scenario, I don't see anything being broken or affected by
> the proposed fix here, since priority is still being given to memory on the
> same socket as the device. It just opens up the possibility of scenario
> two.
> 

My comment was on suggestion to use "rte_socket_id()" rather than the
device socket id,
if both nodes have memory, memory should be allocated from the one where
device is in, because although device doesn't use private_data, polling
cores will and polling cores will be most likely in the same node with
device and memory, but we don't know main core is in.
So I think first try for memory allocation should be node where device
is in, which is the existing code.

If node that has device doesn't have any memory attached, your change
enables this case, as already there is memory only in one node, it
doesn't matter if we check device node or main core node anyway.


Briefly, I am OK to current patch with a warning log in fallback, but
not to "rte_socket_id()" change.
  
Bruce Richardson July 22, 2024, 10:06 a.m. UTC | #9
On Sun, Jul 21, 2024 at 11:56:08PM +0100, Ferruh Yigit wrote:
> On 7/19/2024 5:10 PM, Bruce Richardson wrote:
> > On Fri, Jul 19, 2024 at 04:31:11PM +0100, Ferruh Yigit wrote:
> >> On 7/19/2024 2:22 PM, Bruce Richardson wrote:
> >>> On Fri, Jul 19, 2024 at 12:10:24PM +0100, Ferruh Yigit wrote:
> >>>> One option can be adding a warning log to the fallback case, saying that
> >>>> memory allocated from non-local socket and performance will be less.
> >>>> Although this message may not mean much to a new user, it may still help
> >>>> via a support engineer or internet search...
> >>>>
> >>>
> >>> Yes, we could add a warning, but that is probably better in the app itself.
> >>> Thinking about where we get issues, they primarily stem from running the
> >>> cores on a different numa node  Since the private_data struct is accessed
> >>> by cores not devices, any perf degradation will come from having it remote
> >>> to the cores. Because of that, I actually think the original implementation
> >>> should really have used "rte_socket_id()" rather than the device socket id
> >>> for the allocation.
> >>>
> >>
> >> Yes I guess private_data is not accessed by device, but it may be
> >> accessed by cores that is running the datapath.
> >>
> >> This API may be called by core that is not involved to the datapath, so
> >> it may not correct to allocate memory from its numa.
> >>
> >> Will it be wrong to assume that cores used for datapath will be same
> >> numa with device, if so allocating memory from that numa (which device
> >> is in) makes more sense. Am I missing something?
> >>
> > 
> > It depends on which you think is more likely for the polling cores:
> > - they will be on the same socket as the device, but different socket to
> >   the main lcore.
> > - they will be on the same socket as the main lcore, but different socket
> >   to the device.
> > 
> > Personally, I'd suspect both to be unlikely, but also both to be possible.
> > For the first scenario, I don't see anything being broken or affected by
> > the proposed fix here, since priority is still being given to memory on the
> > same socket as the device. It just opens up the possibility of scenario
> > two.
> > 
> 
> My comment was on suggestion to use "rte_socket_id()" rather than the
> device socket id,
> if both nodes have memory, memory should be allocated from the one where
> device is in, because although device doesn't use private_data, polling
> cores will and polling cores will be most likely in the same node with
> device and memory, but we don't know main core is in.
> So I think first try for memory allocation should be node where device
> is in, which is the existing code.
> 
> If node that has device doesn't have any memory attached, your change
> enables this case, as already there is memory only in one node, it
> doesn't matter if we check device node or main core node anyway.
> 
> 
> Briefly, I am OK to current patch with a warning log in fallback, but
> not to "rte_socket_id()" change.
>
Ack, makes sense. Done in V2 patch.

Thanks for review.

/Bruce
  

Patch

diff --git a/lib/ethdev/ethdev_driver.c b/lib/ethdev/ethdev_driver.c
index f48c0eb8bc..f9ce7ec348 100644
--- a/lib/ethdev/ethdev_driver.c
+++ b/lib/ethdev/ethdev_driver.c
@@ -303,11 +303,16 @@  rte_eth_dev_create(struct rte_device *device, const char *name,
 			return -ENODEV;
 
 		if (priv_data_size) {
+			/* try alloc private data on device-local node. */
 			ethdev->data->dev_private = rte_zmalloc_socket(
 				name, priv_data_size, RTE_CACHE_LINE_SIZE,
 				device->numa_node);
+			/* fall back to alloc on any socket on failure */
+			if (ethdev->data->dev_private == NULL)
+				ethdev->data->dev_private = rte_zmalloc(name,
+						priv_data_size, RTE_CACHE_LINE_SIZE);
 
-			if (!ethdev->data->dev_private) {
+			if (ethdev->data->dev_private == NULL) {
 				RTE_ETHDEV_LOG_LINE(ERR,
 					"failed to allocate private data");
 				retval = -ENOMEM;
diff --git a/lib/ethdev/ethdev_pci.h b/lib/ethdev/ethdev_pci.h
index 737fff1833..d600d9acbb 100644
--- a/lib/ethdev/ethdev_pci.h
+++ b/lib/ethdev/ethdev_pci.h
@@ -93,10 +93,19 @@  rte_eth_dev_pci_allocate(struct rte_pci_device *dev, size_t private_data_size)
 			return NULL;
 
 		if (private_data_size) {
+			/* Try and alloc the private-data structure on socket local to the device */
 			eth_dev->data->dev_private = rte_zmalloc_socket(name,
 				private_data_size, RTE_CACHE_LINE_SIZE,
 				dev->device.numa_node);
-			if (!eth_dev->data->dev_private) {
+
+			/* if cannot allocate memory on the socket local to the device
+			 * use rte_malloc to allocate memory on some other socket, if available.
+			 */
+			if (eth_dev->data->dev_private == NULL)
+				eth_dev->data->dev_private = rte_zmalloc(name,
+					private_data_size, RTE_CACHE_LINE_SIZE);
+
+			if (eth_dev->data->dev_private == NULL) {
 				rte_eth_dev_release_port(eth_dev);
 				return NULL;
 			}