mbox series

[v4,0/6] ethdev: introduce shared Rx queue

Message ID 20210930145602.763969-1-xuemingl@nvidia.com (mailing list archive)
Headers
Series ethdev: introduce shared Rx queue |

Message

Xueming Li Sept. 30, 2021, 2:55 p.m. UTC
  In current DPDK framework, all RX queues is pre-loaded with mbufs for
incoming packets. When number of representors scale out in a switch
domain, the memory consumption became significant. Further more,
polling all ports leads to high cache miss, high latency and low
throughputs.

This patch introduces shared RX queue. PF and representors with same
configuration in same switch domain could share RX queue set by
specifying shared Rx queue offloading flag and sharing group.

All ports that Shared Rx queue actually shares One Rx queue and only
pre-load mbufs to one Rx queue, memory is saved.

Polling any queue using same shared RX queue receives packets from all
member ports. Source port is identified by mbuf->port.

Multiple groups is supported by group ID. Port queue number in a shared
group should be identical. Queue index is 1:1 mapped in shared group.
An example of polling two share groups:
  core	group	queue
  0	0	0
  1	0	1
  2	0	2
  3	0	3
  4	1	0
  5	1	1
  6	1	2
  7	1	3

Shared RX queue must be polled on single thread or core. If both PF0 and
representor0 joined same share group, can't poll pf0rxq0 on core1 and
rep0rxq0 on core2. Actually, polling one port within share group is
sufficient since polling any port in group will return packets for any
port in group.

v1:
  - initial version
v2:
  - add testpmd patches
v3:
  - change common forwarding api to macro for performance, thanks Jerin.
  - save global variable accessed in forwarding to flowstream to minimize
    cache miss
  - combined patches for each forwarding engine
  - support multiple groups in testpmd "--share-rxq" parameter
  - new api to aggerate shared rxq group
v4:
  - spelling fixes
  - remove shared-rxq support for all forwarding engines
  - add dedicate shared-rxq forwarding engine

Xueming Li (6):
  ethdev: introduce shared Rx queue
  ethdev: new API to aggregate shared Rx queue group
  app/testpmd: new parameter to enable shared Rx queue
  app/testpmd: dump port info for shared Rx queue
  app/testpmd: force shared Rx queue polled on same core
  app/testpmd: add forwarding engine for shared Rx queue

 app/test-pmd/config.c                         | 102 +++++++++++-
 app/test-pmd/meson.build                      |   1 +
 app/test-pmd/parameters.c                     |  13 ++
 app/test-pmd/shared_rxq_fwd.c                 | 148 ++++++++++++++++++
 app/test-pmd/testpmd.c                        |  23 ++-
 app/test-pmd/testpmd.h                        |   5 +
 app/test-pmd/util.c                           |   3 +
 doc/guides/nics/features.rst                  |  11 ++
 doc/guides/nics/features/default.ini          |   1 +
 .../prog_guide/switch_representation.rst      |  10 ++
 doc/guides/testpmd_app_ug/run_app.rst         |   8 +
 doc/guides/testpmd_app_ug/testpmd_funcs.rst   |   5 +-
 lib/ethdev/ethdev_driver.h                    |  23 ++-
 lib/ethdev/rte_ethdev.c                       |  23 +++
 lib/ethdev/rte_ethdev.h                       |  23 +++
 lib/ethdev/version.map                        |   3 +
 16 files changed, 398 insertions(+), 4 deletions(-)
 create mode 100644 app/test-pmd/shared_rxq_fwd.c
  

Comments

Andrew Rybchenko Oct. 11, 2021, 11:49 a.m. UTC | #1
Hi Xueming,

On 9/30/21 5:55 PM, Xueming Li wrote:
> In current DPDK framework, all RX queues is pre-loaded with mbufs for
> incoming packets. When number of representors scale out in a switch
> domain, the memory consumption became significant. Further more,
> polling all ports leads to high cache miss, high latency and low
> throughputs.
> 
> This patch introduces shared RX queue. PF and representors with same
> configuration in same switch domain could share RX queue set by
> specifying shared Rx queue offloading flag and sharing group.
> 
> All ports that Shared Rx queue actually shares One Rx queue and only
> pre-load mbufs to one Rx queue, memory is saved.
> 
> Polling any queue using same shared RX queue receives packets from all
> member ports. Source port is identified by mbuf->port.
> 
> Multiple groups is supported by group ID. Port queue number in a shared
> group should be identical. Queue index is 1:1 mapped in shared group.
> An example of polling two share groups:
>   core	group	queue
>   0	0	0
>   1	0	1
>   2	0	2
>   3	0	3
>   4	1	0
>   5	1	1
>   6	1	2
>   7	1	3
> 
> Shared RX queue must be polled on single thread or core. If both PF0 and
> representor0 joined same share group, can't poll pf0rxq0 on core1 and
> rep0rxq0 on core2. Actually, polling one port within share group is
> sufficient since polling any port in group will return packets for any
> port in group.

I apologize that I jump in into the review process that late.

Frankly speaking I doubt that it is the best design to solve
the problem. Yes, I confirm that the problem exists, but I
think there is better and simpler way to solve it.

The problem of the suggested solution is that it puts all
the headache about consistency to application and PMDs
without any help from ethdev layer to guarantee the
consistency. As the result I believe it will be either
missing/lost consistency checks or huge duplication in
each PMD which supports the feature. Shared RxQs must be
equally configured including number of queues, offloads
(taking device level Rx offloads into account), RSS
settings etc. So, applications must care about it and
PMDs (or ethdev layer) must check it.

The advantage of the solution is that any device may
create group and subsequent devices join. Absence of
primary device is nice. But do we really need it?
Will the design work if some representors are configured
to use shared RxQ, but some do not? Theoretically it
is possible, but could require extra non-trivial code
on fast path.

Also looking at the first two patch I don't understand
how application will find out which devices may share
RxQs. E.g. if we have two difference NICs which support
sharing, we can try to setup only one group 0, but
finally will have two devices (not one) which must be
polled.

1. We need extra flag in dev_info->dev_capa
   RTE_ETH_DEV_CAPA_RX_SHARE to advertise that
   the device supports Rx sharing.

2. I think we need "rx_domain" in device info
   (which should be treated in boundaries of the
   switch_domain) if and only if
   RTE_ETH_DEV_CAPA_RX_SHARE is advertised.
   Otherwise rx_domain value does not make sense.

(1) and (2) will allow application to find out which
devices can share Rx.

3. Primary device (representors backing device) should
   advertise shared RxQ offload. Enabling of the offload
   tells the device to provide packets to all device in
   the Rx domain with mbuf->port filled in appropriately.
   Also it allows app to identify primary device in the
   Rx domain. When application enables the offload, it
   must ensure that it does not treat used port_id as an
   input port_id, but always check mbuf->port for each
   packet.

4. A new Rx mode should be introduced for secondary
   devices. It should not allow to configure RSS, specify
   any Rx offloads etc. ethdev must ensure it.
   It is an open question right now if it should require
   to provide primary port_id. In theory representors
   have it. However, may be it is nice for consistency
   to ensure that application knows that it does.
   If shared Rx mode is specified for device, application
   does not need to setup RxQs and attempts to do it
   should be discarded in ethdev.
   For consistency it is better to ensure that number of
   queues match.
   It is an interesting question what should happen if
   primary device is reconfigured and shared Rx is
   disabled on reconfiguration.

5. If so, in theory implementation of the Rx burst
   in the secondary could simply call Rx burst on
   primary device.

Andrew.
  
Xueming Li Oct. 11, 2021, 3:11 p.m. UTC | #2
On Mon, 2021-10-11 at 14:49 +0300, Andrew Rybchenko wrote:
> Hi Xueming,
> 
> On 9/30/21 5:55 PM, Xueming Li wrote:
> > In current DPDK framework, all RX queues is pre-loaded with mbufs for
> > incoming packets. When number of representors scale out in a switch
> > domain, the memory consumption became significant. Further more,
> > polling all ports leads to high cache miss, high latency and low
> > throughputs.
> > 
> > This patch introduces shared RX queue. PF and representors with same
> > configuration in same switch domain could share RX queue set by
> > specifying shared Rx queue offloading flag and sharing group.
> > 
> > All ports that Shared Rx queue actually shares One Rx queue and only
> > pre-load mbufs to one Rx queue, memory is saved.
> > 
> > Polling any queue using same shared RX queue receives packets from all
> > member ports. Source port is identified by mbuf->port.
> > 
> > Multiple groups is supported by group ID. Port queue number in a shared
> > group should be identical. Queue index is 1:1 mapped in shared group.
> > An example of polling two share groups:
> >   core	group	queue
> >   0	0	0
> >   1	0	1
> >   2	0	2
> >   3	0	3
> >   4	1	0
> >   5	1	1
> >   6	1	2
> >   7	1	3
> > 
> > Shared RX queue must be polled on single thread or core. If both PF0 and
> > representor0 joined same share group, can't poll pf0rxq0 on core1 and
> > rep0rxq0 on core2. Actually, polling one port within share group is
> > sufficient since polling any port in group will return packets for any
> > port in group.
> 
> I apologize that I jump in into the review process that late.

Appreciate the bold suggestion, never too late :)

> 
> Frankly speaking I doubt that it is the best design to solve
> the problem. Yes, I confirm that the problem exists, but I
> think there is better and simpler way to solve it.
> 
> The problem of the suggested solution is that it puts all
> the headache about consistency to application and PMDs
> without any help from ethdev layer to guarantee the
> consistency. As the result I believe it will be either
> missing/lost consistency checks or huge duplication in
> each PMD which supports the feature. Shared RxQs must be
> equally configured including number of queues, offloads
> (taking device level Rx offloads into account), RSS
> settings etc. So, applications must care about it and
> PMDs (or ethdev layer) must check it.

The name might be confusing, here is my understanding:
1. NIC  shares the buffer supply HW Q between shared RxQs - for memory
2. PMD polls one shared RxQ - for latency and performance
3. Most per queue features like offloads and RSS not impacted. That's
why this not mentioned. Some offloading might not being supported due
to PMD or hw limitation, need to add check in PMD case by case.
4. Multiple group is defined for service level flexibility. For
example, PF and VIP customer's load distributed via queues and dedicate
cores. Low priority customers share one core with one shared queue.
multiple groups enables more combination.
5. One port could assign queues to different group for polling
flexibility. For example first 4 queues in group 0 and next 4 queues in
group1, each group have other member ports with 4 queues, so the port
with 8 queues could be polled with 8 cores w/o non-shared rxq penalty,
in other words, each core only poll one shared RxQ.

> 
> The advantage of the solution is that any device may
> create group and subsequent devices join. Absence of
> primary device is nice. But do we really need it?
> Will the design work if some representors are configured
> to use shared RxQ, but some do not? Theoretically it
> is possible, but could require extra non-trivial code
> on fast path.

If multiple groups, any device could be hot-unplugged.

Mixed configuration is supported, the only difference is how to set
mbuf->port. Since group is per queue, mixed is better to be supported,
didn't see any difficulty here.

PDM could select to support only group 0, same settings for each rxq,
that fits most scenario.

> 
> Also looking at the first two patch I don't understand
> how application will find out which devices may share
> RxQs. E.g. if we have two difference NICs which support
> sharing, we can try to setup only one group 0, but
> finally will have two devices (not one) which must be
> polled.
> 
> 1. We need extra flag in dev_info->dev_capa
>    RTE_ETH_DEV_CAPA_RX_SHARE to advertise that
>    the device supports Rx sharing.

dev_info->rx_queue_offload_capa could be used here, no?

> 
> 2. I think we need "rx_domain" in device info
>    (which should be treated in boundaries of the
>    switch_domain) if and only if
>    RTE_ETH_DEV_CAPA_RX_SHARE is advertised.
>    Otherwise rx_domain value does not make sense.

I see, this will give flexibility of different hw, will add it.

> 
> (1) and (2) will allow application to find out which
> devices can share Rx.
> 
> 3. Primary device (representors backing device) should
>    advertise shared RxQ offload. Enabling of the offload
>    tells the device to provide packets to all device in
>    the Rx domain with mbuf->port filled in appropriately.
>    Also it allows app to identify primary device in the
>    Rx domain. When application enables the offload, it
>    must ensure that it does not treat used port_id as an
>    input port_id, but always check mbuf->port for each
>    packet.
> 
> 4. A new Rx mode should be introduced for secondary
>    devices. It should not allow to configure RSS, specify
>    any Rx offloads etc. ethdev must ensure it.
>    It is an open question right now if it should require
>    to provide primary port_id. In theory representors
>    have it. However, may be it is nice for consistency
>    to ensure that application knows that it does.
>    If shared Rx mode is specified for device, application
>    does not need to setup RxQs and attempts to do it
>    should be discarded in ethdev.
>    For consistency it is better to ensure that number of
>    queues match.

RSS and Rx offloads should be supported as individual, PMD needs to
check if not supported.

>    It is an interesting question what should happen if
>    primary device is reconfigured and shared Rx is
>    disabled on reconfiguration.

I feel better no primary port/queue assumption in configuration, all
members are equally treated, each queue can join or quit share group,
that's important to support multiple groups.

> 
> 5. If so, in theory implementation of the Rx burst
>    in the secondary could simply call Rx burst on
>    primary device.
> 
> Andrew.
  
Xueming Li Oct. 12, 2021, 6:37 a.m. UTC | #3
On Mon, 2021-10-11 at 23:11 +0800, Xueming Li wrote:
> On Mon, 2021-10-11 at 14:49 +0300, Andrew Rybchenko wrote:
> > Hi Xueming,
> > 
> > On 9/30/21 5:55 PM, Xueming Li wrote:
> > > In current DPDK framework, all RX queues is pre-loaded with mbufs for
> > > incoming packets. When number of representors scale out in a switch
> > > domain, the memory consumption became significant. Further more,
> > > polling all ports leads to high cache miss, high latency and low
> > > throughputs.
> > >  
> > > This patch introduces shared RX queue. PF and representors with same
> > > configuration in same switch domain could share RX queue set by
> > > specifying shared Rx queue offloading flag and sharing group.
> > > 
> > > All ports that Shared Rx queue actually shares One Rx queue and only
> > > pre-load mbufs to one Rx queue, memory is saved.
> > > 
> > > Polling any queue using same shared RX queue receives packets from all
> > > member ports. Source port is identified by mbuf->port.
> > > 
> > > Multiple groups is supported by group ID. Port queue number in a shared
> > > group should be identical. Queue index is 1:1 mapped in shared group.
> > > An example of polling two share groups:
> > >   core	group	queue
> > >   0	0	0
> > >   1	0	1
> > >   2	0	2
> > >   3	0	3
> > >   4	1	0
> > >   5	1	1
> > >   6	1	2
> > >   7	1	3
> > > 
> > > Shared RX queue must be polled on single thread or core. If both PF0 and
> > > representor0 joined same share group, can't poll pf0rxq0 on core1 and
> > > rep0rxq0 on core2. Actually, polling one port within share group is
> > > sufficient since polling any port in group will return packets for any
> > > port in group.
> > 
> > I apologize that I jump in into the review process that late.
> 
> Appreciate the bold suggestion, never too late :)
> 
> > 
> > Frankly speaking I doubt that it is the best design to solve
> > the problem. Yes, I confirm that the problem exists, but I
> > think there is better and simpler way to solve it.
> > 
> > The problem of the suggested solution is that it puts all
> > the headache about consistency to application and PMDs
> > without any help from ethdev layer to guarantee the
> > consistency. As the result I believe it will be either
> > missing/lost consistency checks or huge duplication in
> > each PMD which supports the feature. Shared RxQs must be
> > equally configured including number of queues, offloads
> > (taking device level Rx offloads into account), RSS
> > settings etc. So, applications must care about it and
> > PMDs (or ethdev layer) must check it.
> 
> The name might be confusing, here is my understanding:
> 1. NIC  shares the buffer supply HW Q between shared RxQs - for memory
> 2. PMD polls one shared RxQ - for latency and performance
> 3. Most per queue features like offloads and RSS not impacted. That's
> why this not mentioned. Some offloading might not being supported due
> to PMD or hw limitation, need to add check in PMD case by case.
> 4. Multiple group is defined for service level flexibility. For
> example, PF and VIP customer's load distributed via queues and dedicate
> cores. Low priority customers share one core with one shared queue.
> multiple groups enables more combination.
> 5. One port could assign queues to different group for polling
> flexibility. For example first 4 queues in group 0 and next 4 queues in
> group1, each group have other member ports with 4 queues, so the port
> with 8 queues could be polled with 8 cores w/o non-shared rxq penalty,
> in other words, each core only poll one shared RxQ.
> 
> > 
> > The advantage of the solution is that any device may
> > create group and subsequent devices join. Absence of
> > primary device is nice. But do we really need it?
> > Will the design work if some representors are configured
> > to use shared RxQ, but some do not? Theoretically it
> > is possible, but could require extra non-trivial code
> > on fast path.
> 
> If multiple groups, any device could be hot-unplugged.
> 
> Mixed configuration is supported, the only difference is how to set
> mbuf->port. Since group is per queue, mixed is better to be supported,
> didn't see any difficulty here.
> 
> PDM could select to support only group 0, same settings for each rxq,
> that fits most scenario.
> 
> > 
> > Also looking at the first two patch I don't understand
> > how application will find out which devices may share
> > RxQs. E.g. if we have two difference NICs which support
> > sharing, we can try to setup only one group 0, but
> > finally will have two devices (not one) which must be
> > polled.
> > 
> > 1. We need extra flag in dev_info->dev_capa
> >    RTE_ETH_DEV_CAPA_RX_SHARE to advertise that
> >    the device supports Rx sharing.
> 
> dev_info->rx_queue_offload_capa could be used here, no?
> 
> > 
> > 2. I think we need "rx_domain" in device info
> >    (which should be treated in boundaries of the
> >    switch_domain) if and only if
> >    RTE_ETH_DEV_CAPA_RX_SHARE is advertised.
> >    Otherwise rx_domain value does not make sense.
> 
> I see, this will give flexibility of different hw, will add it.
> 
> > 
> > (1) and (2) will allow application to find out which
> > devices can share Rx.
> > 
> > 3. Primary device (representors backing device) should
> >    advertise shared RxQ offload. Enabling of the offload
> >    tells the device to provide packets to all device in
> >    the Rx domain with mbuf->port filled in appropriately.
> >    Also it allows app to identify primary device in the
> >    Rx domain. When application enables the offload, it
> >    must ensure that it does not treat used port_id as an
> >    input port_id, but always check mbuf->port for each
> >    packet.
> > 
> > 4. A new Rx mode should be introduced for secondary
> >    devices. It should not allow to configure RSS, specify
> >    any Rx offloads etc. ethdev must ensure it.
> >    It is an open question right now if it should require
> >    to provide primary port_id. In theory representors
> >    have it. However, may be it is nice for consistency
> >    to ensure that application knows that it does.
> >    If shared Rx mode is specified for device, application
> >    does not need to setup RxQs and attempts to do it
> >    should be discarded in ethdev.
> >    For consistency it is better to ensure that number of
> >    queues match.
> 
> RSS and Rx offloads should be supported as individual, PMD needs to
> check if not supported.
> 
> >    It is an interesting question what should happen if
> >    primary device is reconfigured and shared Rx is
> >    disabled on reconfiguration.
> 
> I feel better no primary port/queue assumption in configuration, all
> members are equally treated, each queue can join or quit share group,
> that's important to support multiple groups.
> 
> > 
> > 5. If so, in theory implementation of the Rx burst
> >    in the secondary could simply call Rx burst on
> >    primary device.
> > 
> > Andrew.
> 

Hi Andrew,

I realized that we are talking different things, this feature
introduced 2 RxQ share:
1. Share mempool to save memory
2. Share polling to save latency

What you suggested is reuse all RxQ configuration IIUC, maybe we should
break the flag into 3, so application could learn PMD capability and
configure accordingly, how do you think?
RTE_ETH_RX_OFFLOAD_RXQ_SHARE_POOL
RTE_ETH_RX_OFFLOAD_RXQ_SHARE_POLL
RTE_ETH_RX_OFFLOAD_RXQ_SHARE_CFG //implies POOL and POLL

Regards,
Xueming
  
Andrew Rybchenko Oct. 12, 2021, 8:48 a.m. UTC | #4
On 10/12/21 9:37 AM, Xueming(Steven) Li wrote:
> On Mon, 2021-10-11 at 23:11 +0800, Xueming Li wrote:
>> On Mon, 2021-10-11 at 14:49 +0300, Andrew Rybchenko wrote:
>>> Hi Xueming,
>>>
>>> On 9/30/21 5:55 PM, Xueming Li wrote:
>>>> In current DPDK framework, all RX queues is pre-loaded with mbufs for
>>>> incoming packets. When number of representors scale out in a switch
>>>> domain, the memory consumption became significant. Further more,
>>>> polling all ports leads to high cache miss, high latency and low
>>>> throughputs.
>>>>  
>>>> This patch introduces shared RX queue. PF and representors with same
>>>> configuration in same switch domain could share RX queue set by
>>>> specifying shared Rx queue offloading flag and sharing group.
>>>>
>>>> All ports that Shared Rx queue actually shares One Rx queue and only
>>>> pre-load mbufs to one Rx queue, memory is saved.
>>>>
>>>> Polling any queue using same shared RX queue receives packets from all
>>>> member ports. Source port is identified by mbuf->port.
>>>>
>>>> Multiple groups is supported by group ID. Port queue number in a shared
>>>> group should be identical. Queue index is 1:1 mapped in shared group.
>>>> An example of polling two share groups:
>>>>   core	group	queue
>>>>   0	0	0
>>>>   1	0	1
>>>>   2	0	2
>>>>   3	0	3
>>>>   4	1	0
>>>>   5	1	1
>>>>   6	1	2
>>>>   7	1	3
>>>>
>>>> Shared RX queue must be polled on single thread or core. If both PF0 and
>>>> representor0 joined same share group, can't poll pf0rxq0 on core1 and
>>>> rep0rxq0 on core2. Actually, polling one port within share group is
>>>> sufficient since polling any port in group will return packets for any
>>>> port in group.
>>>
>>> I apologize that I jump in into the review process that late.
>>
>> Appreciate the bold suggestion, never too late :)
>>
>>>
>>> Frankly speaking I doubt that it is the best design to solve
>>> the problem. Yes, I confirm that the problem exists, but I
>>> think there is better and simpler way to solve it.
>>>
>>> The problem of the suggested solution is that it puts all
>>> the headache about consistency to application and PMDs
>>> without any help from ethdev layer to guarantee the
>>> consistency. As the result I believe it will be either
>>> missing/lost consistency checks or huge duplication in
>>> each PMD which supports the feature. Shared RxQs must be
>>> equally configured including number of queues, offloads
>>> (taking device level Rx offloads into account), RSS
>>> settings etc. So, applications must care about it and
>>> PMDs (or ethdev layer) must check it.
>>
>> The name might be confusing, here is my understanding:
>> 1. NIC  shares the buffer supply HW Q between shared RxQs - for memory
>> 2. PMD polls one shared RxQ - for latency and performance
>> 3. Most per queue features like offloads and RSS not impacted. That's
>> why this not mentioned. Some offloading might not being supported due
>> to PMD or hw limitation, need to add check in PMD case by case.
>> 4. Multiple group is defined for service level flexibility. For
>> example, PF and VIP customer's load distributed via queues and dedicate
>> cores. Low priority customers share one core with one shared queue.
>> multiple groups enables more combination.
>> 5. One port could assign queues to different group for polling
>> flexibility. For example first 4 queues in group 0 and next 4 queues in
>> group1, each group have other member ports with 4 queues, so the port
>> with 8 queues could be polled with 8 cores w/o non-shared rxq penalty,
>> in other words, each core only poll one shared RxQ.
>>
>>>
>>> The advantage of the solution is that any device may
>>> create group and subsequent devices join. Absence of
>>> primary device is nice. But do we really need it?
>>> Will the design work if some representors are configured
>>> to use shared RxQ, but some do not? Theoretically it
>>> is possible, but could require extra non-trivial code
>>> on fast path.
>>
>> If multiple groups, any device could be hot-unplugged.
>>
>> Mixed configuration is supported, the only difference is how to set
>> mbuf->port. Since group is per queue, mixed is better to be supported,
>> didn't see any difficulty here.
>>
>> PDM could select to support only group 0, same settings for each rxq,
>> that fits most scenario.
>>
>>>
>>> Also looking at the first two patch I don't understand
>>> how application will find out which devices may share
>>> RxQs. E.g. if we have two difference NICs which support
>>> sharing, we can try to setup only one group 0, but
>>> finally will have two devices (not one) which must be
>>> polled.
>>>
>>> 1. We need extra flag in dev_info->dev_capa
>>>    RTE_ETH_DEV_CAPA_RX_SHARE to advertise that
>>>    the device supports Rx sharing.
>>
>> dev_info->rx_queue_offload_capa could be used here, no?

It depends. But we definitely need a flag which
says that below rx_domain makes sense. It could be
either RTE_ETH_DEV_CAPA_RX_SHARE or an Rx offload
capability.

The question is if it is really an offload. The offload is
when something could be done by HW/FW and result is provided
to SW. May be it is just a nit picking...

May be we don't need an offload at all. Just have
RTE_ETH_DEV_CAPA_RXQ_SHARE and use non-zero group ID
as a flag that an RxQ should be shared (zero - default,
no sharing). ethdev layer may check consistency on
its layer to ensure that the device capability is
reported if non-zero group is specified on queue setup.

>>
>>>
>>> 2. I think we need "rx_domain" in device info
>>>    (which should be treated in boundaries of the
>>>    switch_domain) if and only if
>>>    RTE_ETH_DEV_CAPA_RX_SHARE is advertised.
>>>    Otherwise rx_domain value does not make sense.
>>
>> I see, this will give flexibility of different hw, will add it.
>>
>>>
>>> (1) and (2) will allow application to find out which
>>> devices can share Rx.
>>>
>>> 3. Primary device (representors backing device) should
>>>    advertise shared RxQ offload. Enabling of the offload
>>>    tells the device to provide packets to all device in
>>>    the Rx domain with mbuf->port filled in appropriately.
>>>    Also it allows app to identify primary device in the
>>>    Rx domain. When application enables the offload, it
>>>    must ensure that it does not treat used port_id as an
>>>    input port_id, but always check mbuf->port for each
>>>    packet.
>>>
>>> 4. A new Rx mode should be introduced for secondary
>>>    devices. It should not allow to configure RSS, specify
>>>    any Rx offloads etc. ethdev must ensure it.
>>>    It is an open question right now if it should require
>>>    to provide primary port_id. In theory representors
>>>    have it. However, may be it is nice for consistency
>>>    to ensure that application knows that it does.
>>>    If shared Rx mode is specified for device, application
>>>    does not need to setup RxQs and attempts to do it
>>>    should be discarded in ethdev.
>>>    For consistency it is better to ensure that number of
>>>    queues match.
>>
>> RSS and Rx offloads should be supported as individual, PMD needs to
>> check if not supported.

Thinking a bit more about it I agree that RSS settings could
be individual. Offload could be individual as well, but I'm
not sure about all offloads. E.g. Rx scatter which is related
to Rx buffer size (which is shared since Rx mempool is shared)
vs MTU. May be it is acceptable. We just must define rules
what should happen if offloads contradict to each other.
It should be highlighted in the description including
driver callback to ensure that PMD maintainers are responsible
for consistency checks.

>>
>>>    It is an interesting question what should happen if
>>>    primary device is reconfigured and shared Rx is
>>>    disabled on reconfiguration.
>>
>> I feel better no primary port/queue assumption in configuration, all
>> members are equally treated, each queue can join or quit share group,
>> that's important to support multiple groups.

I agree. The problem of many flexible solutions is
complexity to support. We'll see how it goes.

>>
>>>
>>> 5. If so, in theory implementation of the Rx burst
>>>    in the secondary could simply call Rx burst on
>>>    primary device.
>>>
>>> Andrew.
>>
> 
> Hi Andrew,
> 
> I realized that we are talking different things, this feature
> introduced 2 RxQ share:
> 1. Share mempool to save memory
> 2. Share polling to save latency
> 
> What you suggested is reuse all RxQ configuration IIUC, maybe we should
> break the flag into 3, so application could learn PMD capability and
> configure accordingly, how do you think?
> RTE_ETH_RX_OFFLOAD_RXQ_SHARE_POOL

Not sure that I understand. Just specify the same mempool
on Rx queue setup. Isn't it sufficient?

> RTE_ETH_RX_OFFLOAD_RXQ_SHARE_POLL

It implies pool sharing if I'm not mistaken. Of course,
we can pool many different HW queues in one poll, but it
hardly makes sense to care specially about it.
IMHO RxQ sharing is a sharing of the underlying HW Rx queue.

> RTE_ETH_RX_OFFLOAD_RXQ_SHARE_CFG //implies POOL and POLL

It is hardly a feature. Rather a possible limitation.

Andrew.
  
Xueming Li Oct. 12, 2021, 10:55 a.m. UTC | #5
On Tue, 2021-10-12 at 11:48 +0300, Andrew Rybchenko wrote:
> On 10/12/21 9:37 AM, Xueming(Steven) Li wrote:
> > On Mon, 2021-10-11 at 23:11 +0800, Xueming Li wrote:
> > > On Mon, 2021-10-11 at 14:49 +0300, Andrew Rybchenko wrote:
> > > > Hi Xueming,
> > > > 
> > > > On 9/30/21 5:55 PM, Xueming Li wrote:
> > > > > In current DPDK framework, all RX queues is pre-loaded with mbufs for
> > > > > incoming packets. When number of representors scale out in a switch
> > > > > domain, the memory consumption became significant. Further more,
> > > > > polling all ports leads to high cache miss, high latency and low
> > > > > throughputs.
> > > > >  
> > > > > This patch introduces shared RX queue. PF and representors with same
> > > > > configuration in same switch domain could share RX queue set by
> > > > > specifying shared Rx queue offloading flag and sharing group.
> > > > > 
> > > > > All ports that Shared Rx queue actually shares One Rx queue and only
> > > > > pre-load mbufs to one Rx queue, memory is saved.
> > > > > 
> > > > > Polling any queue using same shared RX queue receives packets from all
> > > > > member ports. Source port is identified by mbuf->port.
> > > > > 
> > > > > Multiple groups is supported by group ID. Port queue number in a shared
> > > > > group should be identical. Queue index is 1:1 mapped in shared group.
> > > > > An example of polling two share groups:
> > > > >   core	group	queue
> > > > >   0	0	0
> > > > >   1	0	1
> > > > >   2	0	2
> > > > >   3	0	3
> > > > >   4	1	0
> > > > >   5	1	1
> > > > >   6	1	2
> > > > >   7	1	3
> > > > > 
> > > > > Shared RX queue must be polled on single thread or core. If both PF0 and
> > > > > representor0 joined same share group, can't poll pf0rxq0 on core1 and
> > > > > rep0rxq0 on core2. Actually, polling one port within share group is
> > > > > sufficient since polling any port in group will return packets for any
> > > > > port in group.
> > > > 
> > > > I apologize that I jump in into the review process that late.
> > > 
> > > Appreciate the bold suggestion, never too late :)
> > > 
> > > > 
> > > > Frankly speaking I doubt that it is the best design to solve
> > > > the problem. Yes, I confirm that the problem exists, but I
> > > > think there is better and simpler way to solve it.
> > > > 
> > > > The problem of the suggested solution is that it puts all
> > > > the headache about consistency to application and PMDs
> > > > without any help from ethdev layer to guarantee the
> > > > consistency. As the result I believe it will be either
> > > > missing/lost consistency checks or huge duplication in
> > > > each PMD which supports the feature. Shared RxQs must be
> > > > equally configured including number of queues, offloads
> > > > (taking device level Rx offloads into account), RSS
> > > > settings etc. So, applications must care about it and
> > > > PMDs (or ethdev layer) must check it.
> > > 
> > > The name might be confusing, here is my understanding:
> > > 1. NIC  shares the buffer supply HW Q between shared RxQs - for memory
> > > 2. PMD polls one shared RxQ - for latency and performance
> > > 3. Most per queue features like offloads and RSS not impacted. That's
> > > why this not mentioned. Some offloading might not being supported due
> > > to PMD or hw limitation, need to add check in PMD case by case.
> > > 4. Multiple group is defined for service level flexibility. For
> > > example, PF and VIP customer's load distributed via queues and dedicate
> > > cores. Low priority customers share one core with one shared queue.
> > > multiple groups enables more combination.
> > > 5. One port could assign queues to different group for polling
> > > flexibility. For example first 4 queues in group 0 and next 4 queues in
> > > group1, each group have other member ports with 4 queues, so the port
> > > with 8 queues could be polled with 8 cores w/o non-shared rxq penalty,
> > > in other words, each core only poll one shared RxQ.
> > > 
> > > > 
> > > > The advantage of the solution is that any device may
> > > > create group and subsequent devices join. Absence of
> > > > primary device is nice. But do we really need it?
> > > > Will the design work if some representors are configured
> > > > to use shared RxQ, but some do not? Theoretically it
> > > > is possible, but could require extra non-trivial code
> > > > on fast path.
> > > 
> > > If multiple groups, any device could be hot-unplugged.
> > > 
> > > Mixed configuration is supported, the only difference is how to set
> > > mbuf->port. Since group is per queue, mixed is better to be supported,
> > > didn't see any difficulty here.
> > > 
> > > PDM could select to support only group 0, same settings for each rxq,
> > > that fits most scenario.
> > > 
> > > > 
> > > > Also looking at the first two patch I don't understand
> > > > how application will find out which devices may share
> > > > RxQs. E.g. if we have two difference NICs which support
> > > > sharing, we can try to setup only one group 0, but
> > > > finally will have two devices (not one) which must be
> > > > polled.
> > > > 
> > > > 1. We need extra flag in dev_info->dev_capa
> > > >    RTE_ETH_DEV_CAPA_RX_SHARE to advertise that
> > > >    the device supports Rx sharing.
> > > 
> > > dev_info->rx_queue_offload_capa could be used here, no?
> 
> It depends. But we definitely need a flag which
> says that below rx_domain makes sense. It could be
> either RTE_ETH_DEV_CAPA_RX_SHARE or an Rx offload
> capability.
> 
> The question is if it is really an offload. The offload is
> when something could be done by HW/FW and result is provided
> to SW. May be it is just a nit picking...
> 
> May be we don't need an offload at all. Just have
> RTE_ETH_DEV_CAPA_RXQ_SHARE and use non-zero group ID
> as a flag that an RxQ should be shared (zero - default,
> no sharing). ethdev layer may check consistency on
> its layer to ensure that the device capability is
> reported if non-zero group is specified on queue setup.
> 
> > > 
> > > > 
> > > > 2. I think we need "rx_domain" in device info
> > > >    (which should be treated in boundaries of the
> > > >    switch_domain) if and only if
> > > >    RTE_ETH_DEV_CAPA_RX_SHARE is advertised.
> > > >    Otherwise rx_domain value does not make sense.
> > > 
> > > I see, this will give flexibility of different hw, will add it.
> > > 
> > > > 
> > > > (1) and (2) will allow application to find out which
> > > > devices can share Rx.
> > > > 
> > > > 3. Primary device (representors backing device) should
> > > >    advertise shared RxQ offload. Enabling of the offload
> > > >    tells the device to provide packets to all device in
> > > >    the Rx domain with mbuf->port filled in appropriately.
> > > >    Also it allows app to identify primary device in the
> > > >    Rx domain. When application enables the offload, it
> > > >    must ensure that it does not treat used port_id as an
> > > >    input port_id, but always check mbuf->port for each
> > > >    packet.
> > > > 
> > > > 4. A new Rx mode should be introduced for secondary
> > > >    devices. It should not allow to configure RSS, specify
> > > >    any Rx offloads etc. ethdev must ensure it.
> > > >    It is an open question right now if it should require
> > > >    to provide primary port_id. In theory representors
> > > >    have it. However, may be it is nice for consistency
> > > >    to ensure that application knows that it does.
> > > >    If shared Rx mode is specified for device, application
> > > >    does not need to setup RxQs and attempts to do it
> > > >    should be discarded in ethdev.
> > > >    For consistency it is better to ensure that number of
> > > >    queues match.
> > > 
> > > RSS and Rx offloads should be supported as individual, PMD needs to
> > > check if not supported.
> 
> Thinking a bit more about it I agree that RSS settings could
> be individual. Offload could be individual as well, but I'm
> not sure about all offloads. E.g. Rx scatter which is related
> to Rx buffer size (which is shared since Rx mempool is shared)
> vs MTU. May be it is acceptable. We just must define rules
> what should happen if offloads contradict to each other.
> It should be highlighted in the description including
> driver callback to ensure that PMD maintainers are responsible
> for consistency checks.
> 
> > > 
> > > >    It is an interesting question what should happen if
> > > >    primary device is reconfigured and shared Rx is
> > > >    disabled on reconfiguration.
> > > 
> > > I feel better no primary port/queue assumption in configuration, all
> > > members are equally treated, each queue can join or quit share group,
> > > that's important to support multiple groups.
> 
> I agree. The problem of many flexible solutions is
> complexity to support. We'll see how it goes.
> 
> > > 
> > > > 
> > > > 5. If so, in theory implementation of the Rx burst
> > > >    in the secondary could simply call Rx burst on
> > > >    primary device.
> > > > 
> > > > Andrew.
> > > 
> > 
> > Hi Andrew,
> > 
> > I realized that we are talking different things, this feature
> > introduced 2 RxQ share:
> > 1. Share mempool to save memory
> > 2. Share polling to save latency
> > 
> > What you suggested is reuse all RxQ configuration IIUC, maybe we should
> > break the flag into 3, so application could learn PMD capability and
> > configure accordingly, how do you think?
> > RTE_ETH_RX_OFFLOAD_RXQ_SHARE_POOL
> 
> Not sure that I understand. Just specify the same mempool
> on Rx queue setup. Isn't it sufficient?
> 
> > RTE_ETH_RX_OFFLOAD_RXQ_SHARE_POLL
> 
> It implies pool sharing if I'm not mistaken. Of course,
> we can pool many different HW queues in one poll, but it
> hardly makes sense to care specially about it.
> IMHO RxQ sharing is a sharing of the underlying HW Rx queue.
> 
> > RTE_ETH_RX_OFFLOAD_RXQ_SHARE_CFG //implies POOL and POLL
> 
> It is hardly a feature. Rather a possible limitation.

Thanks, then I'd drop this suggestion then.

Here is the TODO list, let me know if anything missing:
1. change offload flag to RTE_ETH_DEV_CAPA_RX_SHARE
2. RxQ share group check in ethdev
3. add rx_domain into device info

> 
> Andrew.
  
Andrew Rybchenko Oct. 12, 2021, 11:28 a.m. UTC | #6
On 10/12/21 1:55 PM, Xueming(Steven) Li wrote:
> On Tue, 2021-10-12 at 11:48 +0300, Andrew Rybchenko wrote:
>> On 10/12/21 9:37 AM, Xueming(Steven) Li wrote:
>>> On Mon, 2021-10-11 at 23:11 +0800, Xueming Li wrote:
>>>> On Mon, 2021-10-11 at 14:49 +0300, Andrew Rybchenko wrote:
>>>>> Hi Xueming,
>>>>>
>>>>> On 9/30/21 5:55 PM, Xueming Li wrote:
>>>>>> In current DPDK framework, all RX queues is pre-loaded with mbufs for
>>>>>> incoming packets. When number of representors scale out in a switch
>>>>>> domain, the memory consumption became significant. Further more,
>>>>>> polling all ports leads to high cache miss, high latency and low
>>>>>> throughputs.
>>>>>>  
>>>>>> This patch introduces shared RX queue. PF and representors with same
>>>>>> configuration in same switch domain could share RX queue set by
>>>>>> specifying shared Rx queue offloading flag and sharing group.
>>>>>>
>>>>>> All ports that Shared Rx queue actually shares One Rx queue and only
>>>>>> pre-load mbufs to one Rx queue, memory is saved.
>>>>>>
>>>>>> Polling any queue using same shared RX queue receives packets from all
>>>>>> member ports. Source port is identified by mbuf->port.
>>>>>>
>>>>>> Multiple groups is supported by group ID. Port queue number in a shared
>>>>>> group should be identical. Queue index is 1:1 mapped in shared group.
>>>>>> An example of polling two share groups:
>>>>>>   core	group	queue
>>>>>>   0	0	0
>>>>>>   1	0	1
>>>>>>   2	0	2
>>>>>>   3	0	3
>>>>>>   4	1	0
>>>>>>   5	1	1
>>>>>>   6	1	2
>>>>>>   7	1	3
>>>>>>
>>>>>> Shared RX queue must be polled on single thread or core. If both PF0 and
>>>>>> representor0 joined same share group, can't poll pf0rxq0 on core1 and
>>>>>> rep0rxq0 on core2. Actually, polling one port within share group is
>>>>>> sufficient since polling any port in group will return packets for any
>>>>>> port in group.
>>>>>
>>>>> I apologize that I jump in into the review process that late.
>>>>
>>>> Appreciate the bold suggestion, never too late :)
>>>>
>>>>>
>>>>> Frankly speaking I doubt that it is the best design to solve
>>>>> the problem. Yes, I confirm that the problem exists, but I
>>>>> think there is better and simpler way to solve it.
>>>>>
>>>>> The problem of the suggested solution is that it puts all
>>>>> the headache about consistency to application and PMDs
>>>>> without any help from ethdev layer to guarantee the
>>>>> consistency. As the result I believe it will be either
>>>>> missing/lost consistency checks or huge duplication in
>>>>> each PMD which supports the feature. Shared RxQs must be
>>>>> equally configured including number of queues, offloads
>>>>> (taking device level Rx offloads into account), RSS
>>>>> settings etc. So, applications must care about it and
>>>>> PMDs (or ethdev layer) must check it.
>>>>
>>>> The name might be confusing, here is my understanding:
>>>> 1. NIC  shares the buffer supply HW Q between shared RxQs - for memory
>>>> 2. PMD polls one shared RxQ - for latency and performance
>>>> 3. Most per queue features like offloads and RSS not impacted. That's
>>>> why this not mentioned. Some offloading might not being supported due
>>>> to PMD or hw limitation, need to add check in PMD case by case.
>>>> 4. Multiple group is defined for service level flexibility. For
>>>> example, PF and VIP customer's load distributed via queues and dedicate
>>>> cores. Low priority customers share one core with one shared queue.
>>>> multiple groups enables more combination.
>>>> 5. One port could assign queues to different group for polling
>>>> flexibility. For example first 4 queues in group 0 and next 4 queues in
>>>> group1, each group have other member ports with 4 queues, so the port
>>>> with 8 queues could be polled with 8 cores w/o non-shared rxq penalty,
>>>> in other words, each core only poll one shared RxQ.
>>>>
>>>>>
>>>>> The advantage of the solution is that any device may
>>>>> create group and subsequent devices join. Absence of
>>>>> primary device is nice. But do we really need it?
>>>>> Will the design work if some representors are configured
>>>>> to use shared RxQ, but some do not? Theoretically it
>>>>> is possible, but could require extra non-trivial code
>>>>> on fast path.
>>>>
>>>> If multiple groups, any device could be hot-unplugged.
>>>>
>>>> Mixed configuration is supported, the only difference is how to set
>>>> mbuf->port. Since group is per queue, mixed is better to be supported,
>>>> didn't see any difficulty here.
>>>>
>>>> PDM could select to support only group 0, same settings for each rxq,
>>>> that fits most scenario.
>>>>
>>>>>
>>>>> Also looking at the first two patch I don't understand
>>>>> how application will find out which devices may share
>>>>> RxQs. E.g. if we have two difference NICs which support
>>>>> sharing, we can try to setup only one group 0, but
>>>>> finally will have two devices (not one) which must be
>>>>> polled.
>>>>>
>>>>> 1. We need extra flag in dev_info->dev_capa
>>>>>    RTE_ETH_DEV_CAPA_RX_SHARE to advertise that
>>>>>    the device supports Rx sharing.
>>>>
>>>> dev_info->rx_queue_offload_capa could be used here, no?
>>
>> It depends. But we definitely need a flag which
>> says that below rx_domain makes sense. It could be
>> either RTE_ETH_DEV_CAPA_RX_SHARE or an Rx offload
>> capability.
>>
>> The question is if it is really an offload. The offload is
>> when something could be done by HW/FW and result is provided
>> to SW. May be it is just a nit picking...
>>
>> May be we don't need an offload at all. Just have
>> RTE_ETH_DEV_CAPA_RXQ_SHARE and use non-zero group ID
>> as a flag that an RxQ should be shared (zero - default,
>> no sharing). ethdev layer may check consistency on
>> its layer to ensure that the device capability is
>> reported if non-zero group is specified on queue setup.
>>
>>>>
>>>>>
>>>>> 2. I think we need "rx_domain" in device info
>>>>>    (which should be treated in boundaries of the
>>>>>    switch_domain) if and only if
>>>>>    RTE_ETH_DEV_CAPA_RX_SHARE is advertised.
>>>>>    Otherwise rx_domain value does not make sense.
>>>>
>>>> I see, this will give flexibility of different hw, will add it.
>>>>
>>>>>
>>>>> (1) and (2) will allow application to find out which
>>>>> devices can share Rx.
>>>>>
>>>>> 3. Primary device (representors backing device) should
>>>>>    advertise shared RxQ offload. Enabling of the offload
>>>>>    tells the device to provide packets to all device in
>>>>>    the Rx domain with mbuf->port filled in appropriately.
>>>>>    Also it allows app to identify primary device in the
>>>>>    Rx domain. When application enables the offload, it
>>>>>    must ensure that it does not treat used port_id as an
>>>>>    input port_id, but always check mbuf->port for each
>>>>>    packet.
>>>>>
>>>>> 4. A new Rx mode should be introduced for secondary
>>>>>    devices. It should not allow to configure RSS, specify
>>>>>    any Rx offloads etc. ethdev must ensure it.
>>>>>    It is an open question right now if it should require
>>>>>    to provide primary port_id. In theory representors
>>>>>    have it. However, may be it is nice for consistency
>>>>>    to ensure that application knows that it does.
>>>>>    If shared Rx mode is specified for device, application
>>>>>    does not need to setup RxQs and attempts to do it
>>>>>    should be discarded in ethdev.
>>>>>    For consistency it is better to ensure that number of
>>>>>    queues match.
>>>>
>>>> RSS and Rx offloads should be supported as individual, PMD needs to
>>>> check if not supported.
>>
>> Thinking a bit more about it I agree that RSS settings could
>> be individual. Offload could be individual as well, but I'm
>> not sure about all offloads. E.g. Rx scatter which is related
>> to Rx buffer size (which is shared since Rx mempool is shared)
>> vs MTU. May be it is acceptable. We just must define rules
>> what should happen if offloads contradict to each other.
>> It should be highlighted in the description including
>> driver callback to ensure that PMD maintainers are responsible
>> for consistency checks.
>>
>>>>
>>>>>    It is an interesting question what should happen if
>>>>>    primary device is reconfigured and shared Rx is
>>>>>    disabled on reconfiguration.
>>>>
>>>> I feel better no primary port/queue assumption in configuration, all
>>>> members are equally treated, each queue can join or quit share group,
>>>> that's important to support multiple groups.
>>
>> I agree. The problem of many flexible solutions is
>> complexity to support. We'll see how it goes.
>>
>>>>
>>>>>
>>>>> 5. If so, in theory implementation of the Rx burst
>>>>>    in the secondary could simply call Rx burst on
>>>>>    primary device.
>>>>>
>>>>> Andrew.
>>>>
>>>
>>> Hi Andrew,
>>>
>>> I realized that we are talking different things, this feature
>>> introduced 2 RxQ share:
>>> 1. Share mempool to save memory
>>> 2. Share polling to save latency
>>>
>>> What you suggested is reuse all RxQ configuration IIUC, maybe we should
>>> break the flag into 3, so application could learn PMD capability and
>>> configure accordingly, how do you think?
>>> RTE_ETH_RX_OFFLOAD_RXQ_SHARE_POOL
>>
>> Not sure that I understand. Just specify the same mempool
>> on Rx queue setup. Isn't it sufficient?
>>
>>> RTE_ETH_RX_OFFLOAD_RXQ_SHARE_POLL
>>
>> It implies pool sharing if I'm not mistaken. Of course,
>> we can pool many different HW queues in one poll, but it
>> hardly makes sense to care specially about it.
>> IMHO RxQ sharing is a sharing of the underlying HW Rx queue.
>>
>>> RTE_ETH_RX_OFFLOAD_RXQ_SHARE_CFG //implies POOL and POLL
>>
>> It is hardly a feature. Rather a possible limitation.
> 
> Thanks, then I'd drop this suggestion then.
> 
> Here is the TODO list, let me know if anything missing:
> 1. change offload flag to RTE_ETH_DEV_CAPA_RX_SHARE

RTE_ETH_DEV_CAPA_RXQ_SHARE since it is not sharing of
entire Rx, but just some queues.

> 2. RxQ share group check in ethdev
> 3. add rx_domain into device info
> 
>>
>> Andrew.
>
  
Xueming Li Oct. 12, 2021, 11:33 a.m. UTC | #7
On Tue, 2021-10-12 at 14:28 +0300, Andrew Rybchenko wrote:
> On 10/12/21 1:55 PM, Xueming(Steven) Li wrote:
> > On Tue, 2021-10-12 at 11:48 +0300, Andrew Rybchenko wrote:
> > > On 10/12/21 9:37 AM, Xueming(Steven) Li wrote:
> > > > On Mon, 2021-10-11 at 23:11 +0800, Xueming Li wrote:
> > > > > On Mon, 2021-10-11 at 14:49 +0300, Andrew Rybchenko wrote:
> > > > > > Hi Xueming,
> > > > > > 
> > > > > > On 9/30/21 5:55 PM, Xueming Li wrote:
> > > > > > > In current DPDK framework, all RX queues is pre-loaded with mbufs for
> > > > > > > incoming packets. When number of representors scale out in a switch
> > > > > > > domain, the memory consumption became significant. Further more,
> > > > > > > polling all ports leads to high cache miss, high latency and low
> > > > > > > throughputs.
> > > > > > >  
> > > > > > > This patch introduces shared RX queue. PF and representors with same
> > > > > > > configuration in same switch domain could share RX queue set by
> > > > > > > specifying shared Rx queue offloading flag and sharing group.
> > > > > > > 
> > > > > > > All ports that Shared Rx queue actually shares One Rx queue and only
> > > > > > > pre-load mbufs to one Rx queue, memory is saved.
> > > > > > > 
> > > > > > > Polling any queue using same shared RX queue receives packets from all
> > > > > > > member ports. Source port is identified by mbuf->port.
> > > > > > > 
> > > > > > > Multiple groups is supported by group ID. Port queue number in a shared
> > > > > > > group should be identical. Queue index is 1:1 mapped in shared group.
> > > > > > > An example of polling two share groups:
> > > > > > >   core	group	queue
> > > > > > >   0	0	0
> > > > > > >   1	0	1
> > > > > > >   2	0	2
> > > > > > >   3	0	3
> > > > > > >   4	1	0
> > > > > > >   5	1	1
> > > > > > >   6	1	2
> > > > > > >   7	1	3
> > > > > > > 
> > > > > > > Shared RX queue must be polled on single thread or core. If both PF0 and
> > > > > > > representor0 joined same share group, can't poll pf0rxq0 on core1 and
> > > > > > > rep0rxq0 on core2. Actually, polling one port within share group is
> > > > > > > sufficient since polling any port in group will return packets for any
> > > > > > > port in group.
> > > > > > 
> > > > > > I apologize that I jump in into the review process that late.
> > > > > 
> > > > > Appreciate the bold suggestion, never too late :)
> > > > > 
> > > > > > 
> > > > > > Frankly speaking I doubt that it is the best design to solve
> > > > > > the problem. Yes, I confirm that the problem exists, but I
> > > > > > think there is better and simpler way to solve it.
> > > > > > 
> > > > > > The problem of the suggested solution is that it puts all
> > > > > > the headache about consistency to application and PMDs
> > > > > > without any help from ethdev layer to guarantee the
> > > > > > consistency. As the result I believe it will be either
> > > > > > missing/lost consistency checks or huge duplication in
> > > > > > each PMD which supports the feature. Shared RxQs must be
> > > > > > equally configured including number of queues, offloads
> > > > > > (taking device level Rx offloads into account), RSS
> > > > > > settings etc. So, applications must care about it and
> > > > > > PMDs (or ethdev layer) must check it.
> > > > > 
> > > > > The name might be confusing, here is my understanding:
> > > > > 1. NIC  shares the buffer supply HW Q between shared RxQs - for memory
> > > > > 2. PMD polls one shared RxQ - for latency and performance
> > > > > 3. Most per queue features like offloads and RSS not impacted. That's
> > > > > why this not mentioned. Some offloading might not being supported due
> > > > > to PMD or hw limitation, need to add check in PMD case by case.
> > > > > 4. Multiple group is defined for service level flexibility. For
> > > > > example, PF and VIP customer's load distributed via queues and dedicate
> > > > > cores. Low priority customers share one core with one shared queue.
> > > > > multiple groups enables more combination.
> > > > > 5. One port could assign queues to different group for polling
> > > > > flexibility. For example first 4 queues in group 0 and next 4 queues in
> > > > > group1, each group have other member ports with 4 queues, so the port
> > > > > with 8 queues could be polled with 8 cores w/o non-shared rxq penalty,
> > > > > in other words, each core only poll one shared RxQ.
> > > > > 
> > > > > > 
> > > > > > The advantage of the solution is that any device may
> > > > > > create group and subsequent devices join. Absence of
> > > > > > primary device is nice. But do we really need it?
> > > > > > Will the design work if some representors are configured
> > > > > > to use shared RxQ, but some do not? Theoretically it
> > > > > > is possible, but could require extra non-trivial code
> > > > > > on fast path.
> > > > > 
> > > > > If multiple groups, any device could be hot-unplugged.
> > > > > 
> > > > > Mixed configuration is supported, the only difference is how to set
> > > > > mbuf->port. Since group is per queue, mixed is better to be supported,
> > > > > didn't see any difficulty here.
> > > > > 
> > > > > PDM could select to support only group 0, same settings for each rxq,
> > > > > that fits most scenario.
> > > > > 
> > > > > > 
> > > > > > Also looking at the first two patch I don't understand
> > > > > > how application will find out which devices may share
> > > > > > RxQs. E.g. if we have two difference NICs which support
> > > > > > sharing, we can try to setup only one group 0, but
> > > > > > finally will have two devices (not one) which must be
> > > > > > polled.
> > > > > > 
> > > > > > 1. We need extra flag in dev_info->dev_capa
> > > > > >    RTE_ETH_DEV_CAPA_RX_SHARE to advertise that
> > > > > >    the device supports Rx sharing.
> > > > > 
> > > > > dev_info->rx_queue_offload_capa could be used here, no?
> > > 
> > > It depends. But we definitely need a flag which
> > > says that below rx_domain makes sense. It could be
> > > either RTE_ETH_DEV_CAPA_RX_SHARE or an Rx offload
> > > capability.
> > > 
> > > The question is if it is really an offload. The offload is
> > > when something could be done by HW/FW and result is provided
> > > to SW. May be it is just a nit picking...
> > > 
> > > May be we don't need an offload at all. Just have
> > > RTE_ETH_DEV_CAPA_RXQ_SHARE and use non-zero group ID
> > > as a flag that an RxQ should be shared (zero - default,
> > > no sharing). ethdev layer may check consistency on
> > > its layer to ensure that the device capability is
> > > reported if non-zero group is specified on queue setup.
> > > 
> > > > > 
> > > > > > 
> > > > > > 2. I think we need "rx_domain" in device info
> > > > > >    (which should be treated in boundaries of the
> > > > > >    switch_domain) if and only if
> > > > > >    RTE_ETH_DEV_CAPA_RX_SHARE is advertised.
> > > > > >    Otherwise rx_domain value does not make sense.
> > > > > 
> > > > > I see, this will give flexibility of different hw, will add it.
> > > > > 
> > > > > > 
> > > > > > (1) and (2) will allow application to find out which
> > > > > > devices can share Rx.
> > > > > > 
> > > > > > 3. Primary device (representors backing device) should
> > > > > >    advertise shared RxQ offload. Enabling of the offload
> > > > > >    tells the device to provide packets to all device in
> > > > > >    the Rx domain with mbuf->port filled in appropriately.
> > > > > >    Also it allows app to identify primary device in the
> > > > > >    Rx domain. When application enables the offload, it
> > > > > >    must ensure that it does not treat used port_id as an
> > > > > >    input port_id, but always check mbuf->port for each
> > > > > >    packet.
> > > > > > 
> > > > > > 4. A new Rx mode should be introduced for secondary
> > > > > >    devices. It should not allow to configure RSS, specify
> > > > > >    any Rx offloads etc. ethdev must ensure it.
> > > > > >    It is an open question right now if it should require
> > > > > >    to provide primary port_id. In theory representors
> > > > > >    have it. However, may be it is nice for consistency
> > > > > >    to ensure that application knows that it does.
> > > > > >    If shared Rx mode is specified for device, application
> > > > > >    does not need to setup RxQs and attempts to do it
> > > > > >    should be discarded in ethdev.
> > > > > >    For consistency it is better to ensure that number of
> > > > > >    queues match.
> > > > > 
> > > > > RSS and Rx offloads should be supported as individual, PMD needs to
> > > > > check if not supported.
> > > 
> > > Thinking a bit more about it I agree that RSS settings could
> > > be individual. Offload could be individual as well, but I'm
> > > not sure about all offloads. E.g. Rx scatter which is related
> > > to Rx buffer size (which is shared since Rx mempool is shared)
> > > vs MTU. May be it is acceptable. We just must define rules
> > > what should happen if offloads contradict to each other.
> > > It should be highlighted in the description including
> > > driver callback to ensure that PMD maintainers are responsible
> > > for consistency checks.
> > > 
> > > > > 
> > > > > >    It is an interesting question what should happen if
> > > > > >    primary device is reconfigured and shared Rx is
> > > > > >    disabled on reconfiguration.
> > > > > 
> > > > > I feel better no primary port/queue assumption in configuration, all
> > > > > members are equally treated, each queue can join or quit share group,
> > > > > that's important to support multiple groups.
> > > 
> > > I agree. The problem of many flexible solutions is
> > > complexity to support. We'll see how it goes.
> > > 
> > > > > 
> > > > > > 
> > > > > > 5. If so, in theory implementation of the Rx burst
> > > > > >    in the secondary could simply call Rx burst on
> > > > > >    primary device.
> > > > > > 
> > > > > > Andrew.
> > > > > 
> > > > 
> > > > Hi Andrew,
> > > > 
> > > > I realized that we are talking different things, this feature
> > > > introduced 2 RxQ share:
> > > > 1. Share mempool to save memory
> > > > 2. Share polling to save latency
> > > > 
> > > > What you suggested is reuse all RxQ configuration IIUC, maybe we should
> > > > break the flag into 3, so application could learn PMD capability and
> > > > configure accordingly, how do you think?
> > > > RTE_ETH_RX_OFFLOAD_RXQ_SHARE_POOL
> > > 
> > > Not sure that I understand. Just specify the same mempool
> > > on Rx queue setup. Isn't it sufficient?
> > > 
> > > > RTE_ETH_RX_OFFLOAD_RXQ_SHARE_POLL
> > > 
> > > It implies pool sharing if I'm not mistaken. Of course,
> > > we can pool many different HW queues in one poll, but it
> > > hardly makes sense to care specially about it.
> > > IMHO RxQ sharing is a sharing of the underlying HW Rx queue.
> > > 
> > > > RTE_ETH_RX_OFFLOAD_RXQ_SHARE_CFG //implies POOL and POLL
> > > 
> > > It is hardly a feature. Rather a possible limitation.
> > 
> > Thanks, then I'd drop this suggestion then.
> > 
> > Here is the TODO list, let me know if anything missing:
> > 1. change offload flag to RTE_ETH_DEV_CAPA_RX_SHARE
> 
> RTE_ETH_DEV_CAPA_RXQ_SHARE since it is not sharing of
> entire Rx, but just some queues.

OK

> 
> > 2. RxQ share group check in ethdev
> > 3. add rx_domain into device info

Seems rte_eth_switch_info is better palce for rx_domain.

> > 
> > > 
> > > Andrew.
> > 
>
  
Xueming Li Oct. 13, 2021, 7:53 a.m. UTC | #8
On Tue, 2021-10-12 at 14:28 +0300, Andrew Rybchenko wrote:
> On 10/12/21 1:55 PM, Xueming(Steven) Li wrote:
> > On Tue, 2021-10-12 at 11:48 +0300, Andrew Rybchenko wrote:
> > > On 10/12/21 9:37 AM, Xueming(Steven) Li wrote:
> > > > On Mon, 2021-10-11 at 23:11 +0800, Xueming Li wrote:
> > > > > On Mon, 2021-10-11 at 14:49 +0300, Andrew Rybchenko wrote:
> > > > > > Hi Xueming,
> > > > > > 
> > > > > > On 9/30/21 5:55 PM, Xueming Li wrote:
> > > > > > > In current DPDK framework, all RX queues is pre-loaded with mbufs for
> > > > > > > incoming packets. When number of representors scale out in a switch
> > > > > > > domain, the memory consumption became significant. Further more,
> > > > > > > polling all ports leads to high cache miss, high latency and low
> > > > > > > throughputs.
> > > > > > >  
> > > > > > > This patch introduces shared RX queue. PF and representors with same
> > > > > > > configuration in same switch domain could share RX queue set by
> > > > > > > specifying shared Rx queue offloading flag and sharing group.
> > > > > > > 
> > > > > > > All ports that Shared Rx queue actually shares One Rx queue and only
> > > > > > > pre-load mbufs to one Rx queue, memory is saved.
> > > > > > > 
> > > > > > > Polling any queue using same shared RX queue receives packets from all
> > > > > > > member ports. Source port is identified by mbuf->port.
> > > > > > > 
> > > > > > > Multiple groups is supported by group ID. Port queue number in a shared
> > > > > > > group should be identical. Queue index is 1:1 mapped in shared group.
> > > > > > > An example of polling two share groups:
> > > > > > >   core	group	queue
> > > > > > >   0	0	0
> > > > > > >   1	0	1
> > > > > > >   2	0	2
> > > > > > >   3	0	3
> > > > > > >   4	1	0
> > > > > > >   5	1	1
> > > > > > >   6	1	2
> > > > > > >   7	1	3
> > > > > > > 
> > > > > > > Shared RX queue must be polled on single thread or core. If both PF0 and
> > > > > > > representor0 joined same share group, can't poll pf0rxq0 on core1 and
> > > > > > > rep0rxq0 on core2. Actually, polling one port within share group is
> > > > > > > sufficient since polling any port in group will return packets for any
> > > > > > > port in group.
> > > > > > 
> > > > > > I apologize that I jump in into the review process that late.
> > > > > 
> > > > > Appreciate the bold suggestion, never too late :)
> > > > > 
> > > > > > 
> > > > > > Frankly speaking I doubt that it is the best design to solve
> > > > > > the problem. Yes, I confirm that the problem exists, but I
> > > > > > think there is better and simpler way to solve it.
> > > > > > 
> > > > > > The problem of the suggested solution is that it puts all
> > > > > > the headache about consistency to application and PMDs
> > > > > > without any help from ethdev layer to guarantee the
> > > > > > consistency. As the result I believe it will be either
> > > > > > missing/lost consistency checks or huge duplication in
> > > > > > each PMD which supports the feature. Shared RxQs must be
> > > > > > equally configured including number of queues, offloads
> > > > > > (taking device level Rx offloads into account), RSS
> > > > > > settings etc. So, applications must care about it and
> > > > > > PMDs (or ethdev layer) must check it.
> > > > > 
> > > > > The name might be confusing, here is my understanding:
> > > > > 1. NIC  shares the buffer supply HW Q between shared RxQs - for memory
> > > > > 2. PMD polls one shared RxQ - for latency and performance
> > > > > 3. Most per queue features like offloads and RSS not impacted. That's
> > > > > why this not mentioned. Some offloading might not being supported due
> > > > > to PMD or hw limitation, need to add check in PMD case by case.
> > > > > 4. Multiple group is defined for service level flexibility. For
> > > > > example, PF and VIP customer's load distributed via queues and dedicate
> > > > > cores. Low priority customers share one core with one shared queue.
> > > > > multiple groups enables more combination.
> > > > > 5. One port could assign queues to different group for polling
> > > > > flexibility. For example first 4 queues in group 0 and next 4 queues in
> > > > > group1, each group have other member ports with 4 queues, so the port
> > > > > with 8 queues could be polled with 8 cores w/o non-shared rxq penalty,
> > > > > in other words, each core only poll one shared RxQ.
> > > > > 
> > > > > > 
> > > > > > The advantage of the solution is that any device may
> > > > > > create group and subsequent devices join. Absence of
> > > > > > primary device is nice. But do we really need it?
> > > > > > Will the design work if some representors are configured
> > > > > > to use shared RxQ, but some do not? Theoretically it
> > > > > > is possible, but could require extra non-trivial code
> > > > > > on fast path.
> > > > > 
> > > > > If multiple groups, any device could be hot-unplugged.
> > > > > 
> > > > > Mixed configuration is supported, the only difference is how to set
> > > > > mbuf->port. Since group is per queue, mixed is better to be supported,
> > > > > didn't see any difficulty here.
> > > > > 
> > > > > PDM could select to support only group 0, same settings for each rxq,
> > > > > that fits most scenario.
> > > > > 
> > > > > > 
> > > > > > Also looking at the first two patch I don't understand
> > > > > > how application will find out which devices may share
> > > > > > RxQs. E.g. if we have two difference NICs which support
> > > > > > sharing, we can try to setup only one group 0, but
> > > > > > finally will have two devices (not one) which must be
> > > > > > polled.
> > > > > > 
> > > > > > 1. We need extra flag in dev_info->dev_capa
> > > > > >    RTE_ETH_DEV_CAPA_RX_SHARE to advertise that
> > > > > >    the device supports Rx sharing.
> > > > > 
> > > > > dev_info->rx_queue_offload_capa could be used here, no?
> > > 
> > > It depends. But we definitely need a flag which
> > > says that below rx_domain makes sense. It could be
> > > either RTE_ETH_DEV_CAPA_RX_SHARE or an Rx offload
> > > capability.
> > > 
> > > The question is if it is really an offload. The offload is
> > > when something could be done by HW/FW and result is provided
> > > to SW. May be it is just a nit picking...
> > > 
> > > May be we don't need an offload at all. Just have
> > > RTE_ETH_DEV_CAPA_RXQ_SHARE and use non-zero group ID
> > > as a flag that an RxQ should be shared (zero - default,
> > > no sharing). ethdev layer may check consistency on
> > > its layer to ensure that the device capability is
> > > reported if non-zero group is specified on queue setup.
> > > 
> > > > > 
> > > > > > 
> > > > > > 2. I think we need "rx_domain" in device info
> > > > > >    (which should be treated in boundaries of the
> > > > > >    switch_domain) if and only if
> > > > > >    RTE_ETH_DEV_CAPA_RX_SHARE is advertised.
> > > > > >    Otherwise rx_domain value does not make sense.
> > > > > 
> > > > > I see, this will give flexibility of different hw, will add it.
> > > > > 
> > > > > > 
> > > > > > (1) and (2) will allow application to find out which
> > > > > > devices can share Rx.
> > > > > > 
> > > > > > 3. Primary device (representors backing device) should
> > > > > >    advertise shared RxQ offload. Enabling of the offload
> > > > > >    tells the device to provide packets to all device in
> > > > > >    the Rx domain with mbuf->port filled in appropriately.
> > > > > >    Also it allows app to identify primary device in the
> > > > > >    Rx domain. When application enables the offload, it
> > > > > >    must ensure that it does not treat used port_id as an
> > > > > >    input port_id, but always check mbuf->port for each
> > > > > >    packet.
> > > > > > 
> > > > > > 4. A new Rx mode should be introduced for secondary
> > > > > >    devices. It should not allow to configure RSS, specify
> > > > > >    any Rx offloads etc. ethdev must ensure it.
> > > > > >    It is an open question right now if it should require
> > > > > >    to provide primary port_id. In theory representors
> > > > > >    have it. However, may be it is nice for consistency
> > > > > >    to ensure that application knows that it does.
> > > > > >    If shared Rx mode is specified for device, application
> > > > > >    does not need to setup RxQs and attempts to do it
> > > > > >    should be discarded in ethdev.
> > > > > >    For consistency it is better to ensure that number of
> > > > > >    queues match.
> > > > > 
> > > > > RSS and Rx offloads should be supported as individual, PMD needs to
> > > > > check if not supported.
> > > 
> > > Thinking a bit more about it I agree that RSS settings could
> > > be individual. Offload could be individual as well, but I'm
> > > not sure about all offloads. E.g. Rx scatter which is related
> > > to Rx buffer size (which is shared since Rx mempool is shared)
> > > vs MTU. May be it is acceptable. We just must define rules
> > > what should happen if offloads contradict to each other.
> > > It should be highlighted in the description including
> > > driver callback to ensure that PMD maintainers are responsible
> > > for consistency checks.
> > > 
> > > > > 
> > > > > >    It is an interesting question what should happen if
> > > > > >    primary device is reconfigured and shared Rx is
> > > > > >    disabled on reconfiguration.
> > > > > 
> > > > > I feel better no primary port/queue assumption in configuration, all
> > > > > members are equally treated, each queue can join or quit share group,
> > > > > that's important to support multiple groups.
> > > 
> > > I agree. The problem of many flexible solutions is
> > > complexity to support. We'll see how it goes.
> > > 
> > > > > 
> > > > > > 
> > > > > > 5. If so, in theory implementation of the Rx burst
> > > > > >    in the secondary could simply call Rx burst on
> > > > > >    primary device.
> > > > > > 
> > > > > > Andrew.
> > > > > 
> > > > 
> > > > Hi Andrew,
> > > > 
> > > > I realized that we are talking different things, this feature
> > > > introduced 2 RxQ share:
> > > > 1. Share mempool to save memory
> > > > 2. Share polling to save latency
> > > > 
> > > > What you suggested is reuse all RxQ configuration IIUC, maybe we should
> > > > break the flag into 3, so application could learn PMD capability and
> > > > configure accordingly, how do you think?
> > > > RTE_ETH_RX_OFFLOAD_RXQ_SHARE_POOL
> > > 
> > > Not sure that I understand. Just specify the same mempool
> > > on Rx queue setup. Isn't it sufficient?
> > > 
> > > > RTE_ETH_RX_OFFLOAD_RXQ_SHARE_POLL
> > > 
> > > It implies pool sharing if I'm not mistaken. Of course,
> > > we can pool many different HW queues in one poll, but it
> > > hardly makes sense to care specially about it.
> > > IMHO RxQ sharing is a sharing of the underlying HW Rx queue.
> > > 
> > > > RTE_ETH_RX_OFFLOAD_RXQ_SHARE_CFG //implies POOL and POLL
> > > 
> > > It is hardly a feature. Rather a possible limitation.
> > 
> > Thanks, then I'd drop this suggestion then.
> > 
> > Here is the TODO list, let me know if anything missing:
> > 1. change offload flag to RTE_ETH_DEV_CAPA_RX_SHARE
> 
> RTE_ETH_DEV_CAPA_RXQ_SHARE since it is not sharing of
> entire Rx, but just some queues.

V6 posted, thanks!

> 
> > 2. RxQ share group check in ethdev
> > 3. add rx_domain into device info
> > 
> > > 
> > > Andrew.
> > 
>