[v2,01/15] ethdev: introduce shared Rx queue

Message ID 20210811140418.393264-1-xuemingl@nvidia.com (mailing list archive)
State Superseded, archived
Delegated to: Ferruh Yigit
Headers
Series [v2,01/15] ethdev: introduce shared Rx queue |

Checks

Context Check Description
ci/checkpatch success coding style OK

Commit Message

Xueming Li Aug. 11, 2021, 2:04 p.m. UTC
  In current DPDK framework, each RX queue is pre-loaded with mbufs for
incoming packets. When number of representors scale out in a switch
domain, the memory consumption became significant. Most important,
polling all ports leads to high cache miss, high latency and low
throughput.

This patch introduces shared RX queue. Ports with same configuration in
a switch domain could share RX queue set by specifying sharing group.
Polling any queue using same shared RX queue receives packets from all
member ports. Source port is identified by mbuf->port.

Port queue number in a shared group should be identical. Queue index is
1:1 mapped in shared group.

Share RX queue must be polled on single thread or core.

Multiple groups is supported by group ID.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
Cc: Jerin Jacob <jerinjacobk@gmail.com>
---
Rx queue object could be used as shared Rx queue object, it's important
to clear all queue control callback api that using queue object:
  https://mails.dpdk.org/archives/dev/2021-July/215574.html
---
 doc/guides/nics/features.rst                    | 11 +++++++++++
 doc/guides/nics/features/default.ini            |  1 +
 doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
 lib/ethdev/rte_ethdev.c                         |  1 +
 lib/ethdev/rte_ethdev.h                         |  7 +++++++
 5 files changed, 30 insertions(+)
  

Comments

Jerin Jacob Aug. 17, 2021, 9:33 a.m. UTC | #1
On Wed, Aug 11, 2021 at 7:34 PM Xueming Li <xuemingl@nvidia.com> wrote:
>
> In current DPDK framework, each RX queue is pre-loaded with mbufs for
> incoming packets. When number of representors scale out in a switch
> domain, the memory consumption became significant. Most important,
> polling all ports leads to high cache miss, high latency and low
> throughput.
>
> This patch introduces shared RX queue. Ports with same configuration in
> a switch domain could share RX queue set by specifying sharing group.
> Polling any queue using same shared RX queue receives packets from all
> member ports. Source port is identified by mbuf->port.
>
> Port queue number in a shared group should be identical. Queue index is
> 1:1 mapped in shared group.
>
> Share RX queue must be polled on single thread or core.
>
> Multiple groups is supported by group ID.
>
> Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> Cc: Jerin Jacob <jerinjacobk@gmail.com>
> ---
> Rx queue object could be used as shared Rx queue object, it's important
> to clear all queue control callback api that using queue object:
>   https://mails.dpdk.org/archives/dev/2021-July/215574.html

>  #undef RTE_RX_OFFLOAD_BIT2STR
> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> index d2b27c351f..a578c9db9d 100644
> --- a/lib/ethdev/rte_ethdev.h
> +++ b/lib/ethdev/rte_ethdev.h
> @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
>         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
>         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
>         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array. */
> +       uint32_t shared_group; /**< Shared port group index in switch domain. */

Not to able to see anyone setting/creating this group ID test application.
How this group is created?


>         /**
>          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
>          * Only offloads set on rx_queue_offload_capa or rx_offload_capa
> @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
>  #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
>  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
>  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> +/**
> + * Rx queue is shared among ports in same switch domain to save memory,
> + * avoid polling each port. Any port in group can be used to receive packets.
> + * Real source port number saved in mbuf->port field.
> + */
> +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
>
>  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
>                                  DEV_RX_OFFLOAD_UDP_CKSUM | \
> --
> 2.25.1
>
  
Xueming Li Aug. 17, 2021, 11:31 a.m. UTC | #2
> -----Original Message-----
> From: Jerin Jacob <jerinjacobk@gmail.com>
> Sent: Tuesday, August 17, 2021 5:33 PM
> To: Xueming(Steven) Li <xuemingl@nvidia.com>
> Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> 
> On Wed, Aug 11, 2021 at 7:34 PM Xueming Li <xuemingl@nvidia.com> wrote:
> >
> > In current DPDK framework, each RX queue is pre-loaded with mbufs for
> > incoming packets. When number of representors scale out in a switch
> > domain, the memory consumption became significant. Most important,
> > polling all ports leads to high cache miss, high latency and low
> > throughput.
> >
> > This patch introduces shared RX queue. Ports with same configuration
> > in a switch domain could share RX queue set by specifying sharing group.
> > Polling any queue using same shared RX queue receives packets from all
> > member ports. Source port is identified by mbuf->port.
> >
> > Port queue number in a shared group should be identical. Queue index
> > is
> > 1:1 mapped in shared group.
> >
> > Share RX queue must be polled on single thread or core.
> >
> > Multiple groups is supported by group ID.
> >
> > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > Cc: Jerin Jacob <jerinjacobk@gmail.com>
> > ---
> > Rx queue object could be used as shared Rx queue object, it's
> > important to clear all queue control callback api that using queue object:
> >   https://mails.dpdk.org/archives/dev/2021-July/215574.html
> 
> >  #undef RTE_RX_OFFLOAD_BIT2STR
> > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h index
> > d2b27c351f..a578c9db9d 100644
> > --- a/lib/ethdev/rte_ethdev.h
> > +++ b/lib/ethdev/rte_ethdev.h
> > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> >         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> >         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> >         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > */
> > +       uint32_t shared_group; /**< Shared port group index in switch
> > + domain. */
> 
> Not to able to see anyone setting/creating this group ID test application.
> How this group is created?

Nice catch, the initial testpmd version only support one default group(0).
All ports that supports shared-rxq assigned in same group.

We should be able to change "--rxq-shared" to "--rxq-shared-group" to support
group other than default.

To support more groups simultaneously, need to consider testpmd forwarding stream
core assignment, all streams in same group need to stay on same core. 
It's possible to specify how many ports to increase group number, but user must
schedule stream affinity carefully - error prone.
 
On the other hand, one group should be sufficient for most customer, the doubt is
whether it valuable to support multiple groups test.

> 
> 
> >         /**
> >          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> >          * Only offloads set on rx_queue_offload_capa or
> > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {  #define
> > DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> >  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > +/**
> > + * Rx queue is shared among ports in same switch domain to save
> > +memory,
> > + * avoid polling each port. Any port in group can be used to receive packets.
> > + * Real source port number saved in mbuf->port field.
> > + */
> > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> >
> >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> >                                  DEV_RX_OFFLOAD_UDP_CKSUM | \
> > --
> > 2.25.1
> >
  
Jerin Jacob Aug. 17, 2021, 3:11 p.m. UTC | #3
On Tue, Aug 17, 2021 at 5:01 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
>
>
>
> > -----Original Message-----
> > From: Jerin Jacob <jerinjacobk@gmail.com>
> > Sent: Tuesday, August 17, 2021 5:33 PM
> > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> >
> > On Wed, Aug 11, 2021 at 7:34 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > >
> > > In current DPDK framework, each RX queue is pre-loaded with mbufs for
> > > incoming packets. When number of representors scale out in a switch
> > > domain, the memory consumption became significant. Most important,
> > > polling all ports leads to high cache miss, high latency and low
> > > throughput.
> > >
> > > This patch introduces shared RX queue. Ports with same configuration
> > > in a switch domain could share RX queue set by specifying sharing group.
> > > Polling any queue using same shared RX queue receives packets from all
> > > member ports. Source port is identified by mbuf->port.
> > >
> > > Port queue number in a shared group should be identical. Queue index
> > > is
> > > 1:1 mapped in shared group.
> > >
> > > Share RX queue must be polled on single thread or core.
> > >
> > > Multiple groups is supported by group ID.
> > >
> > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > Cc: Jerin Jacob <jerinjacobk@gmail.com>
> > > ---
> > > Rx queue object could be used as shared Rx queue object, it's
> > > important to clear all queue control callback api that using queue object:
> > >   https://mails.dpdk.org/archives/dev/2021-July/215574.html
> >
> > >  #undef RTE_RX_OFFLOAD_BIT2STR
> > > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h index
> > > d2b27c351f..a578c9db9d 100644
> > > --- a/lib/ethdev/rte_ethdev.h
> > > +++ b/lib/ethdev/rte_ethdev.h
> > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > >         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > >         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > >         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > */
> > > +       uint32_t shared_group; /**< Shared port group index in switch
> > > + domain. */
> >
> > Not to able to see anyone setting/creating this group ID test application.
> > How this group is created?
>
> Nice catch, the initial testpmd version only support one default group(0).
> All ports that supports shared-rxq assigned in same group.
>
> We should be able to change "--rxq-shared" to "--rxq-shared-group" to support
> group other than default.
>
> To support more groups simultaneously, need to consider testpmd forwarding stream
> core assignment, all streams in same group need to stay on same core.
> It's possible to specify how many ports to increase group number, but user must
> schedule stream affinity carefully - error prone.
>
> On the other hand, one group should be sufficient for most customer, the doubt is
> whether it valuable to support multiple groups test.

Ack. One group is enough in testpmd.

My question was more about who and how this group is created, Should n't we need
API to create shared_group? If we do the following, at least, I can
think, how it
can be implemented in SW or other HW.

- Create aggregation queue group
- Attach multiple  Rx queues to the aggregation queue group
- Pull the packets from the queue group(which internally fetch from
the Rx queues _attached_)

Does the above kind of sequence, break your representor use case?


>
> >
> >
> > >         /**
> > >          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > >          * Only offloads set on rx_queue_offload_capa or
> > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {  #define
> > > DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> > >  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > +/**
> > > + * Rx queue is shared among ports in same switch domain to save
> > > +memory,
> > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > + * Real source port number saved in mbuf->port field.
> > > + */
> > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> > >
> > >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > >                                  DEV_RX_OFFLOAD_UDP_CKSUM | \
> > > --
> > > 2.25.1
> > >
  
Xueming Li Aug. 18, 2021, 11:14 a.m. UTC | #4
> -----Original Message-----
> From: Jerin Jacob <jerinjacobk@gmail.com>
> Sent: Tuesday, August 17, 2021 11:12 PM
> To: Xueming(Steven) Li <xuemingl@nvidia.com>
> Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> 
> On Tue, Aug 17, 2021 at 5:01 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> >
> >
> >
> > > -----Original Message-----
> > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > Sent: Tuesday, August 17, 2021 5:33 PM
> > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > <andrew.rybchenko@oktetlabs.ru>
> > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> > >
> > > On Wed, Aug 11, 2021 at 7:34 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > >
> > > > In current DPDK framework, each RX queue is pre-loaded with mbufs
> > > > for incoming packets. When number of representors scale out in a
> > > > switch domain, the memory consumption became significant. Most
> > > > important, polling all ports leads to high cache miss, high
> > > > latency and low throughput.
> > > >
> > > > This patch introduces shared RX queue. Ports with same
> > > > configuration in a switch domain could share RX queue set by specifying sharing group.
> > > > Polling any queue using same shared RX queue receives packets from
> > > > all member ports. Source port is identified by mbuf->port.
> > > >
> > > > Port queue number in a shared group should be identical. Queue
> > > > index is
> > > > 1:1 mapped in shared group.
> > > >
> > > > Share RX queue must be polled on single thread or core.
> > > >
> > > > Multiple groups is supported by group ID.
> > > >
> > > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > > Cc: Jerin Jacob <jerinjacobk@gmail.com>
> > > > ---
> > > > Rx queue object could be used as shared Rx queue object, it's
> > > > important to clear all queue control callback api that using queue object:
> > > >   https://mails.dpdk.org/archives/dev/2021-July/215574.html
> > >
> > > >  #undef RTE_RX_OFFLOAD_BIT2STR
> > > > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> > > > index d2b27c351f..a578c9db9d 100644
> > > > --- a/lib/ethdev/rte_ethdev.h
> > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > >         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > > >         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > > >         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > > */
> > > > +       uint32_t shared_group; /**< Shared port group index in
> > > > + switch domain. */
> > >
> > > Not to able to see anyone setting/creating this group ID test application.
> > > How this group is created?
> >
> > Nice catch, the initial testpmd version only support one default group(0).
> > All ports that supports shared-rxq assigned in same group.
> >
> > We should be able to change "--rxq-shared" to "--rxq-shared-group" to
> > support group other than default.
> >
> > To support more groups simultaneously, need to consider testpmd
> > forwarding stream core assignment, all streams in same group need to stay on same core.
> > It's possible to specify how many ports to increase group number, but
> > user must schedule stream affinity carefully - error prone.
> >
> > On the other hand, one group should be sufficient for most customer,
> > the doubt is whether it valuable to support multiple groups test.
> 
> Ack. One group is enough in testpmd.
> 
> My question was more about who and how this group is created, Should n't we need API to create shared_group? If we do the
> following, at least, I can think, how it can be implemented in SW or other HW.
> 
> - Create aggregation queue group
> - Attach multiple  Rx queues to the aggregation queue group
> - Pull the packets from the queue group(which internally fetch from the Rx queues _attached_)
> 
> Does the above kind of sequence, break your representor use case?

Seems more like a set of EAL wrapper. Current API tries to minimize the application efforts to adapt shared-rxq.
- step 1, not sure how important it is to create group with API, in rte_flow, group is created on demand.
- step 2, currently, the attaching is done in rte_eth_rx_queue_setup, specify offload and group in rx_conf struct.
- step 3, define a dedicate api to receive packets from shared rxq? Looks clear to receive packets from shared rxq.
  currently, rxq objects in share group is same - the shared rxq, so the eth callback eth_rx_burst_t(rxq_obj, mbufs, n) could
  be used to receive packets from any ports in group, normally the first port(PF) in group.
  An alternative way is defining a vdev with same queue number and copy rxq objects will make the vdev a proxy of
  the shared rxq group - this could be an helper API.

Anyway the wrapper doesn't break use case, step 3 api is more clear, need to understand how to implement efficiently.

> 
> 
> >
> > >
> > >
> > > >         /**
> > > >          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > > >          * Only offloads set on rx_queue_offload_capa or
> > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
> > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> > > >  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > > +/**
> > > > + * Rx queue is shared among ports in same switch domain to save
> > > > +memory,
> > > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > > + * Real source port number saved in mbuf->port field.
> > > > + */
> > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> > > >
> > > >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > >                                  DEV_RX_OFFLOAD_UDP_CKSUM | \
> > > > --
> > > > 2.25.1
> > > >
  
Jerin Jacob Aug. 19, 2021, 5:26 a.m. UTC | #5
On Wed, Aug 18, 2021 at 4:44 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
>
>
>
> > -----Original Message-----
> > From: Jerin Jacob <jerinjacobk@gmail.com>
> > Sent: Tuesday, August 17, 2021 11:12 PM
> > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> >
> > On Tue, Aug 17, 2021 at 5:01 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > Sent: Tuesday, August 17, 2021 5:33 PM
> > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > > <andrew.rybchenko@oktetlabs.ru>
> > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> > > >
> > > > On Wed, Aug 11, 2021 at 7:34 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > >
> > > > > In current DPDK framework, each RX queue is pre-loaded with mbufs
> > > > > for incoming packets. When number of representors scale out in a
> > > > > switch domain, the memory consumption became significant. Most
> > > > > important, polling all ports leads to high cache miss, high
> > > > > latency and low throughput.
> > > > >
> > > > > This patch introduces shared RX queue. Ports with same
> > > > > configuration in a switch domain could share RX queue set by specifying sharing group.
> > > > > Polling any queue using same shared RX queue receives packets from
> > > > > all member ports. Source port is identified by mbuf->port.
> > > > >
> > > > > Port queue number in a shared group should be identical. Queue
> > > > > index is
> > > > > 1:1 mapped in shared group.
> > > > >
> > > > > Share RX queue must be polled on single thread or core.
> > > > >
> > > > > Multiple groups is supported by group ID.
> > > > >
> > > > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > > > Cc: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > ---
> > > > > Rx queue object could be used as shared Rx queue object, it's
> > > > > important to clear all queue control callback api that using queue object:
> > > > >   https://mails.dpdk.org/archives/dev/2021-July/215574.html
> > > >
> > > > >  #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> > > > > index d2b27c351f..a578c9db9d 100644
> > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > >         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > > > >         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > > > >         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > > > */
> > > > > +       uint32_t shared_group; /**< Shared port group index in
> > > > > + switch domain. */
> > > >
> > > > Not to able to see anyone setting/creating this group ID test application.
> > > > How this group is created?
> > >
> > > Nice catch, the initial testpmd version only support one default group(0).
> > > All ports that supports shared-rxq assigned in same group.
> > >
> > > We should be able to change "--rxq-shared" to "--rxq-shared-group" to
> > > support group other than default.
> > >
> > > To support more groups simultaneously, need to consider testpmd
> > > forwarding stream core assignment, all streams in same group need to stay on same core.
> > > It's possible to specify how many ports to increase group number, but
> > > user must schedule stream affinity carefully - error prone.
> > >
> > > On the other hand, one group should be sufficient for most customer,
> > > the doubt is whether it valuable to support multiple groups test.
> >
> > Ack. One group is enough in testpmd.
> >
> > My question was more about who and how this group is created, Should n't we need API to create shared_group? If we do the
> > following, at least, I can think, how it can be implemented in SW or other HW.
> >
> > - Create aggregation queue group
> > - Attach multiple  Rx queues to the aggregation queue group
> > - Pull the packets from the queue group(which internally fetch from the Rx queues _attached_)
> >
> > Does the above kind of sequence, break your representor use case?
>
> Seems more like a set of EAL wrapper. Current API tries to minimize the application efforts to adapt shared-rxq.
> - step 1, not sure how important it is to create group with API, in rte_flow, group is created on demand.

Which rte_flow pattern/action for this?

> - step 2, currently, the attaching is done in rte_eth_rx_queue_setup, specify offload and group in rx_conf struct.
> - step 3, define a dedicate api to receive packets from shared rxq? Looks clear to receive packets from shared rxq.
>   currently, rxq objects in share group is same - the shared rxq, so the eth callback eth_rx_burst_t(rxq_obj, mbufs, n) could
>   be used to receive packets from any ports in group, normally the first port(PF) in group.
>   An alternative way is defining a vdev with same queue number and copy rxq objects will make the vdev a proxy of
>   the shared rxq group - this could be an helper API.
>
> Anyway the wrapper doesn't break use case, step 3 api is more clear, need to understand how to implement efficiently.

Are you doing this feature based on any HW support or it just pure SW
thing, If it is SW, It is better to have
just new vdev for like drivers/net/bonding/. This we can help
aggregate multiple Rxq across the multiple ports
of same the driver.


>
> >
> >
> > >
> > > >
> > > >
> > > > >         /**
> > > > >          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > > > >          * Only offloads set on rx_queue_offload_capa or
> > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
> > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> > > > >  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > > > +/**
> > > > > + * Rx queue is shared among ports in same switch domain to save
> > > > > +memory,
> > > > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > > > + * Real source port number saved in mbuf->port field.
> > > > > + */
> > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> > > > >
> > > > >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > >                                  DEV_RX_OFFLOAD_UDP_CKSUM | \
> > > > > --
> > > > > 2.25.1
> > > > >
  
Xueming Li Aug. 19, 2021, 12:09 p.m. UTC | #6
> -----Original Message-----
> From: Jerin Jacob <jerinjacobk@gmail.com>
> Sent: Thursday, August 19, 2021 1:27 PM
> To: Xueming(Steven) Li <xuemingl@nvidia.com>
> Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> 
> On Wed, Aug 18, 2021 at 4:44 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> >
> >
> >
> > > -----Original Message-----
> > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > Sent: Tuesday, August 17, 2021 11:12 PM
> > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > <andrew.rybchenko@oktetlabs.ru>
> > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> > >
> > > On Tue, Aug 17, 2021 at 5:01 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > >
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > Sent: Tuesday, August 17, 2021 5:33 PM
> > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon
> > > > > <thomas@monjalon.net>; Andrew Rybchenko
> > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> > > > >
> > > > > On Wed, Aug 11, 2021 at 7:34 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > > >
> > > > > > In current DPDK framework, each RX queue is pre-loaded with
> > > > > > mbufs for incoming packets. When number of representors scale
> > > > > > out in a switch domain, the memory consumption became
> > > > > > significant. Most important, polling all ports leads to high
> > > > > > cache miss, high latency and low throughput.
> > > > > >
> > > > > > This patch introduces shared RX queue. Ports with same
> > > > > > configuration in a switch domain could share RX queue set by specifying sharing group.
> > > > > > Polling any queue using same shared RX queue receives packets
> > > > > > from all member ports. Source port is identified by mbuf->port.
> > > > > >
> > > > > > Port queue number in a shared group should be identical. Queue
> > > > > > index is
> > > > > > 1:1 mapped in shared group.
> > > > > >
> > > > > > Share RX queue must be polled on single thread or core.
> > > > > >
> > > > > > Multiple groups is supported by group ID.
> > > > > >
> > > > > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > > > > Cc: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > ---
> > > > > > Rx queue object could be used as shared Rx queue object, it's
> > > > > > important to clear all queue control callback api that using queue object:
> > > > > >   https://mails.dpdk.org/archives/dev/2021-July/215574.html
> > > > >
> > > > > >  #undef RTE_RX_OFFLOAD_BIT2STR diff --git
> > > > > > a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h index
> > > > > > d2b27c351f..a578c9db9d 100644
> > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > >         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > > > > >         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > > > > >         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > > > > */
> > > > > > +       uint32_t shared_group; /**< Shared port group index in
> > > > > > + switch domain. */
> > > > >
> > > > > Not to able to see anyone setting/creating this group ID test application.
> > > > > How this group is created?
> > > >
> > > > Nice catch, the initial testpmd version only support one default group(0).
> > > > All ports that supports shared-rxq assigned in same group.
> > > >
> > > > We should be able to change "--rxq-shared" to "--rxq-shared-group"
> > > > to support group other than default.
> > > >
> > > > To support more groups simultaneously, need to consider testpmd
> > > > forwarding stream core assignment, all streams in same group need to stay on same core.
> > > > It's possible to specify how many ports to increase group number,
> > > > but user must schedule stream affinity carefully - error prone.
> > > >
> > > > On the other hand, one group should be sufficient for most
> > > > customer, the doubt is whether it valuable to support multiple groups test.
> > >
> > > Ack. One group is enough in testpmd.
> > >
> > > My question was more about who and how this group is created, Should
> > > n't we need API to create shared_group? If we do the following, at least, I can think, how it can be implemented in SW or other HW.
> > >
> > > - Create aggregation queue group
> > > - Attach multiple  Rx queues to the aggregation queue group
> > > - Pull the packets from the queue group(which internally fetch from
> > > the Rx queues _attached_)
> > >
> > > Does the above kind of sequence, break your representor use case?
> >
> > Seems more like a set of EAL wrapper. Current API tries to minimize the application efforts to adapt shared-rxq.
> > - step 1, not sure how important it is to create group with API, in rte_flow, group is created on demand.
> 
> Which rte_flow pattern/action for this?

No rte_flow for this, just recalled that the group in rte_flow is not created along with flow, not via api.
I don’t see anything else to create along with group, just double whether it valuable to introduce a new api set to manage group.

> 
> > - step 2, currently, the attaching is done in rte_eth_rx_queue_setup, specify offload and group in rx_conf struct.
> > - step 3, define a dedicate api to receive packets from shared rxq? Looks clear to receive packets from shared rxq.
> >   currently, rxq objects in share group is same - the shared rxq, so the eth callback eth_rx_burst_t(rxq_obj, mbufs, n) could
> >   be used to receive packets from any ports in group, normally the first port(PF) in group.
> >   An alternative way is defining a vdev with same queue number and copy rxq objects will make the vdev a proxy of
> >   the shared rxq group - this could be an helper API.
> >
> > Anyway the wrapper doesn't break use case, step 3 api is more clear, need to understand how to implement efficiently.
> 
> Are you doing this feature based on any HW support or it just pure SW thing, If it is SW, It is better to have just new vdev for like
> drivers/net/bonding/. This we can help aggregate multiple Rxq across the multiple ports of same the driver.

Based on HW support. 

Most user might uses PF in group as the anchor port to rx burst, current definition should be easy for them to migrate.
but some user might prefer grouping some hot plug/unpluggedrepresentors, EAL could provide wrappers, users could do 
that either due to the strategy not complex enough. Anyway, welcome any suggestion.

> 
> 
> >
> > >
> > >
> > > >
> > > > >
> > > > >
> > > > > >         /**
> > > > > >          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > > > > >          * Only offloads set on rx_queue_offload_capa or
> > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
> > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> > > > > >  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> > > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > > > > +/**
> > > > > > + * Rx queue is shared among ports in same switch domain to
> > > > > > +save memory,
> > > > > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > > > > + * Real source port number saved in mbuf->port field.
> > > > > > + */
> > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> > > > > >
> > > > > >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > >                                  DEV_RX_OFFLOAD_UDP_CKSUM | \
> > > > > > --
> > > > > > 2.25.1
> > > > > >
  
Jerin Jacob Aug. 26, 2021, 11:58 a.m. UTC | #7
On Thu, Aug 19, 2021 at 5:39 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
>
>
>
> > -----Original Message-----
> > From: Jerin Jacob <jerinjacobk@gmail.com>
> > Sent: Thursday, August 19, 2021 1:27 PM
> > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> >
> > On Wed, Aug 18, 2021 at 4:44 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > Sent: Tuesday, August 17, 2021 11:12 PM
> > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > > <andrew.rybchenko@oktetlabs.ru>
> > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> > > >
> > > > On Tue, Aug 17, 2021 at 5:01 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > >
> > > > >
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > Sent: Tuesday, August 17, 2021 5:33 PM
> > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon
> > > > > > <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> > > > > >
> > > > > > On Wed, Aug 11, 2021 at 7:34 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > > > >
> > > > > > > In current DPDK framework, each RX queue is pre-loaded with
> > > > > > > mbufs for incoming packets. When number of representors scale
> > > > > > > out in a switch domain, the memory consumption became
> > > > > > > significant. Most important, polling all ports leads to high
> > > > > > > cache miss, high latency and low throughput.
> > > > > > >
> > > > > > > This patch introduces shared RX queue. Ports with same
> > > > > > > configuration in a switch domain could share RX queue set by specifying sharing group.
> > > > > > > Polling any queue using same shared RX queue receives packets
> > > > > > > from all member ports. Source port is identified by mbuf->port.
> > > > > > >
> > > > > > > Port queue number in a shared group should be identical. Queue
> > > > > > > index is
> > > > > > > 1:1 mapped in shared group.
> > > > > > >
> > > > > > > Share RX queue must be polled on single thread or core.
> > > > > > >
> > > > > > > Multiple groups is supported by group ID.
> > > > > > >
> > > > > > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > > > > > Cc: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > ---
> > > > > > > Rx queue object could be used as shared Rx queue object, it's
> > > > > > > important to clear all queue control callback api that using queue object:
> > > > > > >   https://mails.dpdk.org/archives/dev/2021-July/215574.html
> > > > > >
> > > > > > >  #undef RTE_RX_OFFLOAD_BIT2STR diff --git
> > > > > > > a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h index
> > > > > > > d2b27c351f..a578c9db9d 100644
> > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > >         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > > > > > >         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > > > > > >         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > > > > > */
> > > > > > > +       uint32_t shared_group; /**< Shared port group index in
> > > > > > > + switch domain. */
> > > > > >
> > > > > > Not to able to see anyone setting/creating this group ID test application.
> > > > > > How this group is created?
> > > > >
> > > > > Nice catch, the initial testpmd version only support one default group(0).
> > > > > All ports that supports shared-rxq assigned in same group.
> > > > >
> > > > > We should be able to change "--rxq-shared" to "--rxq-shared-group"
> > > > > to support group other than default.
> > > > >
> > > > > To support more groups simultaneously, need to consider testpmd
> > > > > forwarding stream core assignment, all streams in same group need to stay on same core.
> > > > > It's possible to specify how many ports to increase group number,
> > > > > but user must schedule stream affinity carefully - error prone.
> > > > >
> > > > > On the other hand, one group should be sufficient for most
> > > > > customer, the doubt is whether it valuable to support multiple groups test.
> > > >
> > > > Ack. One group is enough in testpmd.
> > > >
> > > > My question was more about who and how this group is created, Should
> > > > n't we need API to create shared_group? If we do the following, at least, I can think, how it can be implemented in SW or other HW.
> > > >
> > > > - Create aggregation queue group
> > > > - Attach multiple  Rx queues to the aggregation queue group
> > > > - Pull the packets from the queue group(which internally fetch from
> > > > the Rx queues _attached_)
> > > >
> > > > Does the above kind of sequence, break your representor use case?
> > >
> > > Seems more like a set of EAL wrapper. Current API tries to minimize the application efforts to adapt shared-rxq.
> > > - step 1, not sure how important it is to create group with API, in rte_flow, group is created on demand.
> >
> > Which rte_flow pattern/action for this?
>
> No rte_flow for this, just recalled that the group in rte_flow is not created along with flow, not via api.
> I don’t see anything else to create along with group, just double whether it valuable to introduce a new api set to manage group.

See below.

>
> >
> > > - step 2, currently, the attaching is done in rte_eth_rx_queue_setup, specify offload and group in rx_conf struct.
> > > - step 3, define a dedicate api to receive packets from shared rxq? Looks clear to receive packets from shared rxq.
> > >   currently, rxq objects in share group is same - the shared rxq, so the eth callback eth_rx_burst_t(rxq_obj, mbufs, n) could
> > >   be used to receive packets from any ports in group, normally the first port(PF) in group.
> > >   An alternative way is defining a vdev with same queue number and copy rxq objects will make the vdev a proxy of
> > >   the shared rxq group - this could be an helper API.
> > >
> > > Anyway the wrapper doesn't break use case, step 3 api is more clear, need to understand how to implement efficiently.
> >
> > Are you doing this feature based on any HW support or it just pure SW thing, If it is SW, It is better to have just new vdev for like
> > drivers/net/bonding/. This we can help aggregate multiple Rxq across the multiple ports of same the driver.
>
> Based on HW support.

In Marvel HW, we do some support, I will outline here and some queries on this.

# We need to create some new HW structure for aggregation
# Connect each Rxq to the new HW structure for aggregation
# Use rx_burst from the new HW structure.

Could you outline your HW support?

Also, I am not able to understand how this will reduce the memory,
atleast in our HW need creating more memory now to deal this
as we need to deal new HW structure.

How is in your HW it reduces the memory? Also, if memory is the
constraint, why NOT reduce the number of queues.

# Also, I was thinking, one way to avoid the fast path or ABI change would like.

# Driver Initializes one more eth_dev_ops in driver as aggregator ethdev
# devargs of new ethdev or specific API like
drivers/net/bonding/rte_eth_bond.h can take the argument (port, queue)
tuples which needs to aggregate by new ethdev port
# No change in fastpath or ABI is required in this model.



> Most user might uses PF in group as the anchor port to rx burst, current definition should be easy for them to migrate.
> but some user might prefer grouping some hot plug/unpluggedrepresentors, EAL could provide wrappers, users could do
> that either due to the strategy not complex enough. Anyway, welcome any suggestion.
>
> >
> >
> > >
> > > >
> > > >
> > > > >
> > > > > >
> > > > > >
> > > > > > >         /**
> > > > > > >          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > > > > > >          * Only offloads set on rx_queue_offload_capa or
> > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
> > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> > > > > > >  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> > > > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > > > > > +/**
> > > > > > > + * Rx queue is shared among ports in same switch domain to
> > > > > > > +save memory,
> > > > > > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > > > > > + * Real source port number saved in mbuf->port field.
> > > > > > > + */
> > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> > > > > > >
> > > > > > >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > >                                  DEV_RX_OFFLOAD_UDP_CKSUM | \
> > > > > > > --
> > > > > > > 2.25.1
> > > > > > >
  
Xueming Li Aug. 28, 2021, 2:16 p.m. UTC | #8
> -----Original Message-----
> From: Jerin Jacob <jerinjacobk@gmail.com>
> Sent: Thursday, August 26, 2021 7:58 PM
> To: Xueming(Steven) Li <xuemingl@nvidia.com>
> Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> 
> On Thu, Aug 19, 2021 at 5:39 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> >
> >
> >
> > > -----Original Message-----
> > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > Sent: Thursday, August 19, 2021 1:27 PM
> > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > <andrew.rybchenko@oktetlabs.ru>
> > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> > >
> > > On Wed, Aug 18, 2021 at 4:44 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > >
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > Sent: Tuesday, August 17, 2021 11:12 PM
> > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon
> > > > > <thomas@monjalon.net>; Andrew Rybchenko
> > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> > > > >
> > > > > On Tue, Aug 17, 2021 at 5:01 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > >
> > > > > >
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > Sent: Tuesday, August 17, 2021 5:33 PM
> > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon
> > > > > > > <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx
> > > > > > > queue
> > > > > > >
> > > > > > > On Wed, Aug 11, 2021 at 7:34 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > > > > >
> > > > > > > > In current DPDK framework, each RX queue is pre-loaded
> > > > > > > > with mbufs for incoming packets. When number of
> > > > > > > > representors scale out in a switch domain, the memory
> > > > > > > > consumption became significant. Most important, polling
> > > > > > > > all ports leads to high cache miss, high latency and low throughput.
> > > > > > > >
> > > > > > > > This patch introduces shared RX queue. Ports with same
> > > > > > > > configuration in a switch domain could share RX queue set by specifying sharing group.
> > > > > > > > Polling any queue using same shared RX queue receives
> > > > > > > > packets from all member ports. Source port is identified by mbuf->port.
> > > > > > > >
> > > > > > > > Port queue number in a shared group should be identical.
> > > > > > > > Queue index is
> > > > > > > > 1:1 mapped in shared group.
> > > > > > > >
> > > > > > > > Share RX queue must be polled on single thread or core.
> > > > > > > >
> > > > > > > > Multiple groups is supported by group ID.
> > > > > > > >
> > > > > > > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > > > > > > Cc: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > ---
> > > > > > > > Rx queue object could be used as shared Rx queue object,
> > > > > > > > it's important to clear all queue control callback api that using queue object:
> > > > > > > >
> > > > > > > > https://mails.dpdk.org/archives/dev/2021-July/215574.html
> > > > > > >
> > > > > > > >  #undef RTE_RX_OFFLOAD_BIT2STR diff --git
> > > > > > > > a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h index
> > > > > > > > d2b27c351f..a578c9db9d 100644
> > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > > >         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > > > > > > >         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > > > > > > >         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > > > > > > */
> > > > > > > > +       uint32_t shared_group; /**< Shared port group
> > > > > > > > + index in switch domain. */
> > > > > > >
> > > > > > > Not to able to see anyone setting/creating this group ID test application.
> > > > > > > How this group is created?
> > > > > >
> > > > > > Nice catch, the initial testpmd version only support one default group(0).
> > > > > > All ports that supports shared-rxq assigned in same group.
> > > > > >
> > > > > > We should be able to change "--rxq-shared" to "--rxq-shared-group"
> > > > > > to support group other than default.
> > > > > >
> > > > > > To support more groups simultaneously, need to consider
> > > > > > testpmd forwarding stream core assignment, all streams in same group need to stay on same core.
> > > > > > It's possible to specify how many ports to increase group
> > > > > > number, but user must schedule stream affinity carefully - error prone.
> > > > > >
> > > > > > On the other hand, one group should be sufficient for most
> > > > > > customer, the doubt is whether it valuable to support multiple groups test.
> > > > >
> > > > > Ack. One group is enough in testpmd.
> > > > >
> > > > > My question was more about who and how this group is created,
> > > > > Should n't we need API to create shared_group? If we do the following, at least, I can think, how it can be implemented in SW
> or other HW.
> > > > >
> > > > > - Create aggregation queue group
> > > > > - Attach multiple  Rx queues to the aggregation queue group
> > > > > - Pull the packets from the queue group(which internally fetch
> > > > > from the Rx queues _attached_)
> > > > >
> > > > > Does the above kind of sequence, break your representor use case?
> > > >
> > > > Seems more like a set of EAL wrapper. Current API tries to minimize the application efforts to adapt shared-rxq.
> > > > - step 1, not sure how important it is to create group with API, in rte_flow, group is created on demand.
> > >
> > > Which rte_flow pattern/action for this?
> >
> > No rte_flow for this, just recalled that the group in rte_flow is not created along with flow, not via api.
> > I don’t see anything else to create along with group, just double whether it valuable to introduce a new api set to manage group.
> 
> See below.
> 
> >
> > >
> > > > - step 2, currently, the attaching is done in rte_eth_rx_queue_setup, specify offload and group in rx_conf struct.
> > > > - step 3, define a dedicate api to receive packets from shared rxq? Looks clear to receive packets from shared rxq.
> > > >   currently, rxq objects in share group is same - the shared rxq, so the eth callback eth_rx_burst_t(rxq_obj, mbufs, n) could
> > > >   be used to receive packets from any ports in group, normally the first port(PF) in group.
> > > >   An alternative way is defining a vdev with same queue number and copy rxq objects will make the vdev a proxy of
> > > >   the shared rxq group - this could be an helper API.
> > > >
> > > > Anyway the wrapper doesn't break use case, step 3 api is more clear, need to understand how to implement efficiently.
> > >
> > > Are you doing this feature based on any HW support or it just pure
> > > SW thing, If it is SW, It is better to have just new vdev for like drivers/net/bonding/. This we can help aggregate multiple Rxq across
> the multiple ports of same the driver.
> >
> > Based on HW support.
> 
> In Marvel HW, we do some support, I will outline here and some queries on this.
> 
> # We need to create some new HW structure for aggregation # Connect each Rxq to the new HW structure for aggregation # Use
> rx_burst from the new HW structure.
> 
> Could you outline your HW support?
> 
> Also, I am not able to understand how this will reduce the memory, atleast in our HW need creating more memory now to deal this as
> we need to deal new HW structure.
> 
> How is in your HW it reduces the memory? Also, if memory is the constraint, why NOT reduce the number of queues.
> 

Glad to know that Marvel is working on this, what's the status of driver implementation?

In my PMD implementation, it's very similar, a new HW object shared memory pool is created to replace per rxq memory pool.
Legacy rxq feed queue with allocated mbufs as number of descriptors, now shared rxqs share the same pool, no need to supply
mbufs for each rxq, just feed the shared rxq.

So the memory saving reflects to mbuf per rxq, even 1000 representors in shared rxq group, the mbufs consumed is one rxq.
In other words, new members in shared rxq doesn’t allocate new mbufs to feed rxq, just share with existing shared rxq(HW mempool).
The memory required to setup each rxq doesn't change too much, agree.

> # Also, I was thinking, one way to avoid the fast path or ABI change would like.
> 
> # Driver Initializes one more eth_dev_ops in driver as aggregator ethdev # devargs of new ethdev or specific API like
> drivers/net/bonding/rte_eth_bond.h can take the argument (port, queue) tuples which needs to aggregate by new ethdev port # No
> change in fastpath or ABI is required in this model.
> 

This could be an option to access shared rxq. What's the difference of the new PMD?
What's the difference of PMD driver to create the new device? 

Is it important in your implementation? Does it work with existing rx_burst api?

> 
> 
> > Most user might uses PF in group as the anchor port to rx burst, current definition should be easy for them to migrate.
> > but some user might prefer grouping some hot
> > plug/unpluggedrepresentors, EAL could provide wrappers, users could do that either due to the strategy not complex enough.
> Anyway, welcome any suggestion.
> >
> > >
> > >
> > > >
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >         /**
> > > > > > > >          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > > > > > > >          * Only offloads set on rx_queue_offload_capa or
> > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf
> > > > > > > > { #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> > > > > > > >  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> > > > > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > > > > > > +/**
> > > > > > > > + * Rx queue is shared among ports in same switch domain
> > > > > > > > +to save memory,
> > > > > > > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > > > > > > + * Real source port number saved in mbuf->port field.
> > > > > > > > + */
> > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> > > > > > > >
> > > > > > > >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > >                                  DEV_RX_OFFLOAD_UDP_CKSUM
> > > > > > > > | \
> > > > > > > > --
> > > > > > > > 2.25.1
> > > > > > > >
  
Jerin Jacob Aug. 30, 2021, 9:31 a.m. UTC | #9
On Sat, Aug 28, 2021 at 7:46 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
>
>
>
> > -----Original Message-----
> > From: Jerin Jacob <jerinjacobk@gmail.com>
> > Sent: Thursday, August 26, 2021 7:58 PM
> > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> >
> > On Thu, Aug 19, 2021 at 5:39 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > Sent: Thursday, August 19, 2021 1:27 PM
> > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > > <andrew.rybchenko@oktetlabs.ru>
> > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> > > >
> > > > On Wed, Aug 18, 2021 at 4:44 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > >
> > > > >
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > Sent: Tuesday, August 17, 2021 11:12 PM
> > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon
> > > > > > <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> > > > > >
> > > > > > On Tue, Aug 17, 2021 at 5:01 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > -----Original Message-----
> > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > Sent: Tuesday, August 17, 2021 5:33 PM
> > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > > <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon
> > > > > > > > <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx
> > > > > > > > queue
> > > > > > > >
> > > > > > > > On Wed, Aug 11, 2021 at 7:34 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > > > > > >
> > > > > > > > > In current DPDK framework, each RX queue is pre-loaded
> > > > > > > > > with mbufs for incoming packets. When number of
> > > > > > > > > representors scale out in a switch domain, the memory
> > > > > > > > > consumption became significant. Most important, polling
> > > > > > > > > all ports leads to high cache miss, high latency and low throughput.
> > > > > > > > >
> > > > > > > > > This patch introduces shared RX queue. Ports with same
> > > > > > > > > configuration in a switch domain could share RX queue set by specifying sharing group.
> > > > > > > > > Polling any queue using same shared RX queue receives
> > > > > > > > > packets from all member ports. Source port is identified by mbuf->port.
> > > > > > > > >
> > > > > > > > > Port queue number in a shared group should be identical.
> > > > > > > > > Queue index is
> > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > >
> > > > > > > > > Share RX queue must be polled on single thread or core.
> > > > > > > > >
> > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > > >
> > > > > > > > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > > > > > > > Cc: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > ---
> > > > > > > > > Rx queue object could be used as shared Rx queue object,
> > > > > > > > > it's important to clear all queue control callback api that using queue object:
> > > > > > > > >
> > > > > > > > > https://mails.dpdk.org/archives/dev/2021-July/215574.html
> > > > > > > >
> > > > > > > > >  #undef RTE_RX_OFFLOAD_BIT2STR diff --git
> > > > > > > > > a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h index
> > > > > > > > > d2b27c351f..a578c9db9d 100644
> > > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > > > >         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > > > > > > > >         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > > > > > > > >         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > > > > > > > */
> > > > > > > > > +       uint32_t shared_group; /**< Shared port group
> > > > > > > > > + index in switch domain. */
> > > > > > > >
> > > > > > > > Not to able to see anyone setting/creating this group ID test application.
> > > > > > > > How this group is created?
> > > > > > >
> > > > > > > Nice catch, the initial testpmd version only support one default group(0).
> > > > > > > All ports that supports shared-rxq assigned in same group.
> > > > > > >
> > > > > > > We should be able to change "--rxq-shared" to "--rxq-shared-group"
> > > > > > > to support group other than default.
> > > > > > >
> > > > > > > To support more groups simultaneously, need to consider
> > > > > > > testpmd forwarding stream core assignment, all streams in same group need to stay on same core.
> > > > > > > It's possible to specify how many ports to increase group
> > > > > > > number, but user must schedule stream affinity carefully - error prone.
> > > > > > >
> > > > > > > On the other hand, one group should be sufficient for most
> > > > > > > customer, the doubt is whether it valuable to support multiple groups test.
> > > > > >
> > > > > > Ack. One group is enough in testpmd.
> > > > > >
> > > > > > My question was more about who and how this group is created,
> > > > > > Should n't we need API to create shared_group? If we do the following, at least, I can think, how it can be implemented in SW
> > or other HW.
> > > > > >
> > > > > > - Create aggregation queue group
> > > > > > - Attach multiple  Rx queues to the aggregation queue group
> > > > > > - Pull the packets from the queue group(which internally fetch
> > > > > > from the Rx queues _attached_)
> > > > > >
> > > > > > Does the above kind of sequence, break your representor use case?
> > > > >
> > > > > Seems more like a set of EAL wrapper. Current API tries to minimize the application efforts to adapt shared-rxq.
> > > > > - step 1, not sure how important it is to create group with API, in rte_flow, group is created on demand.
> > > >
> > > > Which rte_flow pattern/action for this?
> > >
> > > No rte_flow for this, just recalled that the group in rte_flow is not created along with flow, not via api.
> > > I don’t see anything else to create along with group, just double whether it valuable to introduce a new api set to manage group.
> >
> > See below.
> >
> > >
> > > >
> > > > > - step 2, currently, the attaching is done in rte_eth_rx_queue_setup, specify offload and group in rx_conf struct.
> > > > > - step 3, define a dedicate api to receive packets from shared rxq? Looks clear to receive packets from shared rxq.
> > > > >   currently, rxq objects in share group is same - the shared rxq, so the eth callback eth_rx_burst_t(rxq_obj, mbufs, n) could
> > > > >   be used to receive packets from any ports in group, normally the first port(PF) in group.
> > > > >   An alternative way is defining a vdev with same queue number and copy rxq objects will make the vdev a proxy of
> > > > >   the shared rxq group - this could be an helper API.
> > > > >
> > > > > Anyway the wrapper doesn't break use case, step 3 api is more clear, need to understand how to implement efficiently.
> > > >
> > > > Are you doing this feature based on any HW support or it just pure
> > > > SW thing, If it is SW, It is better to have just new vdev for like drivers/net/bonding/. This we can help aggregate multiple Rxq across
> > the multiple ports of same the driver.
> > >
> > > Based on HW support.
> >
> > In Marvel HW, we do some support, I will outline here and some queries on this.
> >
> > # We need to create some new HW structure for aggregation # Connect each Rxq to the new HW structure for aggregation # Use
> > rx_burst from the new HW structure.
> >
> > Could you outline your HW support?
> >
> > Also, I am not able to understand how this will reduce the memory, atleast in our HW need creating more memory now to deal this as
> > we need to deal new HW structure.
> >
> > How is in your HW it reduces the memory? Also, if memory is the constraint, why NOT reduce the number of queues.
> >
>
> Glad to know that Marvel is working on this, what's the status of driver implementation?
>
> In my PMD implementation, it's very similar, a new HW object shared memory pool is created to replace per rxq memory pool.
> Legacy rxq feed queue with allocated mbufs as number of descriptors, now shared rxqs share the same pool, no need to supply
> mbufs for each rxq, just feed the shared rxq.
>
> So the memory saving reflects to mbuf per rxq, even 1000 representors in shared rxq group, the mbufs consumed is one rxq.
> In other words, new members in shared rxq doesn’t allocate new mbufs to feed rxq, just share with existing shared rxq(HW mempool).
> The memory required to setup each rxq doesn't change too much, agree.

We can ask the application to configure the same mempool for multiple
RQ too. RIght? If the saving is based on sharing the mempool
with multiple RQs.

>
> > # Also, I was thinking, one way to avoid the fast path or ABI change would like.
> >
> > # Driver Initializes one more eth_dev_ops in driver as aggregator ethdev # devargs of new ethdev or specific API like
> > drivers/net/bonding/rte_eth_bond.h can take the argument (port, queue) tuples which needs to aggregate by new ethdev port # No
> > change in fastpath or ABI is required in this model.
> >
>
> This could be an option to access shared rxq. What's the difference of the new PMD?

No ABI and fast change are required.

> What's the difference of PMD driver to create the new device?
>
> Is it important in your implementation? Does it work with existing rx_burst api?

Yes . It will work with the existing rx_burst API.

>
> >
> >
> > > Most user might uses PF in group as the anchor port to rx burst, current definition should be easy for them to migrate.
> > > but some user might prefer grouping some hot
> > > plug/unpluggedrepresentors, EAL could provide wrappers, users could do that either due to the strategy not complex enough.
> > Anyway, welcome any suggestion.
> > >
> > > >
> > > >
> > > > >
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > >         /**
> > > > > > > > >          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > > > > > > > >          * Only offloads set on rx_queue_offload_capa or
> > > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf
> > > > > > > > > { #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> > > > > > > > >  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> > > > > > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > > > > > > > +/**
> > > > > > > > > + * Rx queue is shared among ports in same switch domain
> > > > > > > > > +to save memory,
> > > > > > > > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > > > > > > > + * Real source port number saved in mbuf->port field.
> > > > > > > > > + */
> > > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> > > > > > > > >
> > > > > > > > >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > >                                  DEV_RX_OFFLOAD_UDP_CKSUM
> > > > > > > > > | \
> > > > > > > > > --
> > > > > > > > > 2.25.1
> > > > > > > > >
  
Xueming Li Aug. 30, 2021, 10:13 a.m. UTC | #10
> -----Original Message-----
> From: Jerin Jacob <jerinjacobk@gmail.com>
> Sent: Monday, August 30, 2021 5:31 PM
> To: Xueming(Steven) Li <xuemingl@nvidia.com>
> Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> 
> On Sat, Aug 28, 2021 at 7:46 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> >
> >
> >
> > > -----Original Message-----
> > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > Sent: Thursday, August 26, 2021 7:58 PM
> > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > <andrew.rybchenko@oktetlabs.ru>
> > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> > >
> > > On Thu, Aug 19, 2021 at 5:39 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > >
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > Sent: Thursday, August 19, 2021 1:27 PM
> > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon
> > > > > <thomas@monjalon.net>; Andrew Rybchenko
> > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> > > > >
> > > > > On Wed, Aug 18, 2021 at 4:44 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > >
> > > > > >
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > Sent: Tuesday, August 17, 2021 11:12 PM
> > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon
> > > > > > > <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx
> > > > > > > queue
> > > > > > >
> > > > > > > On Tue, Aug 17, 2021 at 5:01 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > Sent: Tuesday, August 17, 2021 5:33 PM
> > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > > > <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon
> > > > > > > > > <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared
> > > > > > > > > Rx queue
> > > > > > > > >
> > > > > > > > > On Wed, Aug 11, 2021 at 7:34 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > > > > > > >
> > > > > > > > > > In current DPDK framework, each RX queue is pre-loaded
> > > > > > > > > > with mbufs for incoming packets. When number of
> > > > > > > > > > representors scale out in a switch domain, the memory
> > > > > > > > > > consumption became significant. Most important,
> > > > > > > > > > polling all ports leads to high cache miss, high latency and low throughput.
> > > > > > > > > >
> > > > > > > > > > This patch introduces shared RX queue. Ports with same
> > > > > > > > > > configuration in a switch domain could share RX queue set by specifying sharing group.
> > > > > > > > > > Polling any queue using same shared RX queue receives
> > > > > > > > > > packets from all member ports. Source port is identified by mbuf->port.
> > > > > > > > > >
> > > > > > > > > > Port queue number in a shared group should be identical.
> > > > > > > > > > Queue index is
> > > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > > >
> > > > > > > > > > Share RX queue must be polled on single thread or core.
> > > > > > > > > >
> > > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > > > >
> > > > > > > > > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > > > > > > > > Cc: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > > ---
> > > > > > > > > > Rx queue object could be used as shared Rx queue
> > > > > > > > > > object, it's important to clear all queue control callback api that using queue object:
> > > > > > > > > >
> > > > > > > > > > https://mails.dpdk.org/archives/dev/2021-July/215574.h
> > > > > > > > > > tml
> > > > > > > > >
> > > > > > > > > >  #undef RTE_RX_OFFLOAD_BIT2STR diff --git
> > > > > > > > > > a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > > > > >         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > > > > > > > > >         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > > > > > > > > >         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > > > > > > > > */
> > > > > > > > > > +       uint32_t shared_group; /**< Shared port group
> > > > > > > > > > + index in switch domain. */
> > > > > > > > >
> > > > > > > > > Not to able to see anyone setting/creating this group ID test application.
> > > > > > > > > How this group is created?
> > > > > > > >
> > > > > > > > Nice catch, the initial testpmd version only support one default group(0).
> > > > > > > > All ports that supports shared-rxq assigned in same group.
> > > > > > > >
> > > > > > > > We should be able to change "--rxq-shared" to "--rxq-shared-group"
> > > > > > > > to support group other than default.
> > > > > > > >
> > > > > > > > To support more groups simultaneously, need to consider
> > > > > > > > testpmd forwarding stream core assignment, all streams in same group need to stay on same core.
> > > > > > > > It's possible to specify how many ports to increase group
> > > > > > > > number, but user must schedule stream affinity carefully - error prone.
> > > > > > > >
> > > > > > > > On the other hand, one group should be sufficient for most
> > > > > > > > customer, the doubt is whether it valuable to support multiple groups test.
> > > > > > >
> > > > > > > Ack. One group is enough in testpmd.
> > > > > > >
> > > > > > > My question was more about who and how this group is
> > > > > > > created, Should n't we need API to create shared_group? If
> > > > > > > we do the following, at least, I can think, how it can be
> > > > > > > implemented in SW
> > > or other HW.
> > > > > > >
> > > > > > > - Create aggregation queue group
> > > > > > > - Attach multiple  Rx queues to the aggregation queue group
> > > > > > > - Pull the packets from the queue group(which internally
> > > > > > > fetch from the Rx queues _attached_)
> > > > > > >
> > > > > > > Does the above kind of sequence, break your representor use case?
> > > > > >
> > > > > > Seems more like a set of EAL wrapper. Current API tries to minimize the application efforts to adapt shared-rxq.
> > > > > > - step 1, not sure how important it is to create group with API, in rte_flow, group is created on demand.
> > > > >
> > > > > Which rte_flow pattern/action for this?
> > > >
> > > > No rte_flow for this, just recalled that the group in rte_flow is not created along with flow, not via api.
> > > > I don’t see anything else to create along with group, just double whether it valuable to introduce a new api set to manage group.
> > >
> > > See below.
> > >
> > > >
> > > > >
> > > > > > - step 2, currently, the attaching is done in rte_eth_rx_queue_setup, specify offload and group in rx_conf struct.
> > > > > > - step 3, define a dedicate api to receive packets from shared rxq? Looks clear to receive packets from shared rxq.
> > > > > >   currently, rxq objects in share group is same - the shared rxq, so the eth callback eth_rx_burst_t(rxq_obj, mbufs, n) could
> > > > > >   be used to receive packets from any ports in group, normally the first port(PF) in group.
> > > > > >   An alternative way is defining a vdev with same queue number and copy rxq objects will make the vdev a proxy of
> > > > > >   the shared rxq group - this could be an helper API.
> > > > > >
> > > > > > Anyway the wrapper doesn't break use case, step 3 api is more clear, need to understand how to implement efficiently.
> > > > >
> > > > > Are you doing this feature based on any HW support or it just
> > > > > pure SW thing, If it is SW, It is better to have just new vdev
> > > > > for like drivers/net/bonding/. This we can help aggregate
> > > > > multiple Rxq across
> > > the multiple ports of same the driver.
> > > >
> > > > Based on HW support.
> > >
> > > In Marvel HW, we do some support, I will outline here and some queries on this.
> > >
> > > # We need to create some new HW structure for aggregation # Connect
> > > each Rxq to the new HW structure for aggregation # Use rx_burst from the new HW structure.
> > >
> > > Could you outline your HW support?
> > >
> > > Also, I am not able to understand how this will reduce the memory,
> > > atleast in our HW need creating more memory now to deal this as we need to deal new HW structure.
> > >
> > > How is in your HW it reduces the memory? Also, if memory is the constraint, why NOT reduce the number of queues.
> > >
> >
> > Glad to know that Marvel is working on this, what's the status of driver implementation?
> >
> > In my PMD implementation, it's very similar, a new HW object shared memory pool is created to replace per rxq memory pool.
> > Legacy rxq feed queue with allocated mbufs as number of descriptors,
> > now shared rxqs share the same pool, no need to supply mbufs for each rxq, just feed the shared rxq.
> >
> > So the memory saving reflects to mbuf per rxq, even 1000 representors in shared rxq group, the mbufs consumed is one rxq.
> > In other words, new members in shared rxq doesn’t allocate new mbufs to feed rxq, just share with existing shared rxq(HW
> mempool).
> > The memory required to setup each rxq doesn't change too much, agree.
> 
> We can ask the application to configure the same mempool for multiple RQ too. RIght? If the saving is based on sharing the mempool
> with multiple RQs.

Yes, using the same mempool is the fundamental. The difference is how many mbufs allocate from pool.
Assuming 512 descriptors perf rxq and 4 rxqs per device, it's 2.3K(mbuf) * 512 * 4 = 4.6M / device
To support 1000 representors, need a 4.6G mempool :)
For shared rxq, only 4.6M(one device) mbufs allocate from mempool, they are shared for all rxqs in group.

> 
> >
> > > # Also, I was thinking, one way to avoid the fast path or ABI change would like.
> > >
> > > # Driver Initializes one more eth_dev_ops in driver as aggregator
> > > ethdev # devargs of new ethdev or specific API like
> > > drivers/net/bonding/rte_eth_bond.h can take the argument (port, queue) tuples which needs to aggregate by new ethdev port #
> No change in fastpath or ABI is required in this model.
> > >
> >
> > This could be an option to access shared rxq. What's the difference of the new PMD?
> 
> No ABI and fast change are required.
> 
> > What's the difference of PMD driver to create the new device?
> >
> > Is it important in your implementation? Does it work with existing rx_burst api?
> 
> Yes . It will work with the existing rx_burst API.
> 
> >
> > >
> > >
> > > > Most user might uses PF in group as the anchor port to rx burst, current definition should be easy for them to migrate.
> > > > but some user might prefer grouping some hot
> > > > plug/unpluggedrepresentors, EAL could provide wrappers, users could do that either due to the strategy not complex enough.
> > > Anyway, welcome any suggestion.
> > > >
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >         /**
> > > > > > > > > >          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > > > > > > > > >          * Only offloads set on rx_queue_offload_capa
> > > > > > > > > > or rx_offload_capa @@ -1373,6 +1374,12 @@ struct
> > > > > > > > > > rte_eth_conf { #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> > > > > > > > > >  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> > > > > > > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > > > > > > > > +/**
> > > > > > > > > > + * Rx queue is shared among ports in same switch
> > > > > > > > > > +domain to save memory,
> > > > > > > > > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > > > > > > > > + * Real source port number saved in mbuf->port field.
> > > > > > > > > > + */
> > > > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> > > > > > > > > >
> > > > > > > > > >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > > >
> > > > > > > > > > DEV_RX_OFFLOAD_UDP_CKSUM
> > > > > > > > > > | \
> > > > > > > > > > --
> > > > > > > > > > 2.25.1
> > > > > > > > > >
  
Xueming Li Sept. 15, 2021, 2:45 p.m. UTC | #11
Hi Jerin,

On Mon, 2021-08-30 at 15:01 +0530, Jerin Jacob wrote:
> On Sat, Aug 28, 2021 at 7:46 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > 
> > 
> > 
> > > -----Original Message-----
> > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > Sent: Thursday, August 26, 2021 7:58 PM
> > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> > > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> > > 
> > > On Thu, Aug 19, 2021 at 5:39 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > 
> > > > 
> > > > 
> > > > > -----Original Message-----
> > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > Sent: Thursday, August 19, 2021 1:27 PM
> > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> > > > > 
> > > > > On Wed, Aug 18, 2021 at 4:44 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > > -----Original Message-----
> > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > Sent: Tuesday, August 17, 2021 11:12 PM
> > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon
> > > > > > > <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> > > > > > > 
> > > > > > > On Tue, Aug 17, 2021 at 5:01 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > Sent: Tuesday, August 17, 2021 5:33 PM
> > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > > > <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon
> > > > > > > > > <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx
> > > > > > > > > queue
> > > > > > > > > 
> > > > > > > > > On Wed, Aug 11, 2021 at 7:34 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > > > > > > > 
> > > > > > > > > > In current DPDK framework, each RX queue is pre-loaded
> > > > > > > > > > with mbufs for incoming packets. When number of
> > > > > > > > > > representors scale out in a switch domain, the memory
> > > > > > > > > > consumption became significant. Most important, polling
> > > > > > > > > > all ports leads to high cache miss, high latency and low throughput.
> > > > > > > > > > 
> > > > > > > > > > This patch introduces shared RX queue. Ports with same
> > > > > > > > > > configuration in a switch domain could share RX queue set by specifying sharing group.
> > > > > > > > > > Polling any queue using same shared RX queue receives
> > > > > > > > > > packets from all member ports. Source port is identified by mbuf->port.
> > > > > > > > > > 
> > > > > > > > > > Port queue number in a shared group should be identical.
> > > > > > > > > > Queue index is
> > > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > > > 
> > > > > > > > > > Share RX queue must be polled on single thread or core.
> > > > > > > > > > 
> > > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > > > > 
> > > > > > > > > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > > > > > > > > Cc: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > > ---
> > > > > > > > > > Rx queue object could be used as shared Rx queue object,
> > > > > > > > > > it's important to clear all queue control callback api that using queue object:
> > > > > > > > > > 
> > > > > > > > > > https://mails.dpdk.org/archives/dev/2021-July/215574.html
> > > > > > > > > 
> > > > > > > > > >  #undef RTE_RX_OFFLOAD_BIT2STR diff --git
> > > > > > > > > > a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h index
> > > > > > > > > > d2b27c351f..a578c9db9d 100644
> > > > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > > > > >         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > > > > > > > > >         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > > > > > > > > >         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > > > > > > > > */
> > > > > > > > > > +       uint32_t shared_group; /**< Shared port group
> > > > > > > > > > + index in switch domain. */
> > > > > > > > > 
> > > > > > > > > Not to able to see anyone setting/creating this group ID test application.
> > > > > > > > > How this group is created?
> > > > > > > > 
> > > > > > > > Nice catch, the initial testpmd version only support one default group(0).
> > > > > > > > All ports that supports shared-rxq assigned in same group.
> > > > > > > > 
> > > > > > > > We should be able to change "--rxq-shared" to "--rxq-shared-group"
> > > > > > > > to support group other than default.
> > > > > > > > 
> > > > > > > > To support more groups simultaneously, need to consider
> > > > > > > > testpmd forwarding stream core assignment, all streams in same group need to stay on same core.
> > > > > > > > It's possible to specify how many ports to increase group
> > > > > > > > number, but user must schedule stream affinity carefully - error prone.
> > > > > > > > 
> > > > > > > > On the other hand, one group should be sufficient for most
> > > > > > > > customer, the doubt is whether it valuable to support multiple groups test.
> > > > > > > 
> > > > > > > Ack. One group is enough in testpmd.
> > > > > > > 
> > > > > > > My question was more about who and how this group is created,
> > > > > > > Should n't we need API to create shared_group? If we do the following, at least, I can think, how it can be implemented in SW
> > > or other HW.
> > > > > > > 
> > > > > > > - Create aggregation queue group
> > > > > > > - Attach multiple  Rx queues to the aggregation queue group
> > > > > > > - Pull the packets from the queue group(which internally fetch
> > > > > > > from the Rx queues _attached_)
> > > > > > > 
> > > > > > > Does the above kind of sequence, break your representor use case?
> > > > > > 
> > > > > > Seems more like a set of EAL wrapper. Current API tries to minimize the application efforts to adapt shared-rxq.
> > > > > > - step 1, not sure how important it is to create group with API, in rte_flow, group is created on demand.
> > > > > 
> > > > > Which rte_flow pattern/action for this?
> > > > 
> > > > No rte_flow for this, just recalled that the group in rte_flow is not created along with flow, not via api.
> > > > I don’t see anything else to create along with group, just double whether it valuable to introduce a new api set to manage group.
> > > 
> > > See below.
> > > 
> > > > 
> > > > > 
> > > > > > - step 2, currently, the attaching is done in rte_eth_rx_queue_setup, specify offload and group in rx_conf struct.
> > > > > > - step 3, define a dedicate api to receive packets from shared rxq? Looks clear to receive packets from shared rxq.
> > > > > >   currently, rxq objects in share group is same - the shared rxq, so the eth callback eth_rx_burst_t(rxq_obj, mbufs, n) could
> > > > > >   be used to receive packets from any ports in group, normally the first port(PF) in group.
> > > > > >   An alternative way is defining a vdev with same queue number and copy rxq objects will make the vdev a proxy of
> > > > > >   the shared rxq group - this could be an helper API.
> > > > > > 
> > > > > > Anyway the wrapper doesn't break use case, step 3 api is more clear, need to understand how to implement efficiently.
> > > > > 
> > > > > Are you doing this feature based on any HW support or it just pure
> > > > > SW thing, If it is SW, It is better to have just new vdev for like drivers/net/bonding/. This we can help aggregate multiple Rxq across
> > > the multiple ports of same the driver.
> > > > 
> > > > Based on HW support.
> > > 
> > > In Marvel HW, we do some support, I will outline here and some queries on this.
> > > 
> > > # We need to create some new HW structure for aggregation # Connect each Rxq to the new HW structure for aggregation # Use
> > > rx_burst from the new HW structure.
> > > 
> > > Could you outline your HW support?
> > > 
> > > Also, I am not able to understand how this will reduce the memory, atleast in our HW need creating more memory now to deal this as
> > > we need to deal new HW structure.
> > > 
> > > How is in your HW it reduces the memory? Also, if memory is the constraint, why NOT reduce the number of queues.
> > > 
> > 
> > Glad to know that Marvel is working on this, what's the status of driver implementation?
> > 
> > In my PMD implementation, it's very similar, a new HW object shared memory pool is created to replace per rxq memory pool.
> > Legacy rxq feed queue with allocated mbufs as number of descriptors, now shared rxqs share the same pool, no need to supply
> > mbufs for each rxq, just feed the shared rxq.
> > 
> > So the memory saving reflects to mbuf per rxq, even 1000 representors in shared rxq group, the mbufs consumed is one rxq.
> > In other words, new members in shared rxq doesn’t allocate new mbufs to feed rxq, just share with existing shared rxq(HW mempool).
> > The memory required to setup each rxq doesn't change too much, agree.
> 
> We can ask the application to configure the same mempool for multiple
> RQ too. RIght? If the saving is based on sharing the mempool
> with multiple RQs.
> 
> > 
> > > # Also, I was thinking, one way to avoid the fast path or ABI change would like.
> > > 
> > > # Driver Initializes one more eth_dev_ops in driver as aggregator ethdev # devargs of new ethdev or specific API like
> > > drivers/net/bonding/rte_eth_bond.h can take the argument (port, queue) tuples which needs to aggregate by new ethdev port # No
> > > change in fastpath or ABI is required in this model.
> > > 
> > 
> > This could be an option to access shared rxq. What's the difference of the new PMD?
> 
> No ABI and fast change are required.
> 
> > What's the difference of PMD driver to create the new device?
> > 
> > Is it important in your implementation? Does it work with existing rx_burst api?
> 
> Yes . It will work with the existing rx_burst API.
> 

The aggregator ethdev required by user is a port, maybe it good to add
a callback for PMD to prepare a complete ethdev just like creating
representor ethdev - pmd register new port internally. If the PMD
doens't provide the callback, ethdev api fallback to initialize an
empty ethdev by copy rxq data(shared) and rx_burst api from source port
and share group. Actually users can do this fallback themselves or with
an util api.

IIUC, an aggregator ethdev not a must, do you think we can continue and
leave that design in later stage? 

> > 
> > > 
> > > 
> > > > Most user might uses PF in group as the anchor port to rx burst, current definition should be easy for them to migrate.
> > > > but some user might prefer grouping some hot
> > > > plug/unpluggedrepresentors, EAL could provide wrappers, users could do that either due to the strategy not complex enough.
> > > Anyway, welcome any suggestion.
> > > > 
> > > > > 
> > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > >         /**
> > > > > > > > > >          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > > > > > > > > >          * Only offloads set on rx_queue_offload_capa or
> > > > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf
> > > > > > > > > > { #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> > > > > > > > > >  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> > > > > > > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > > > > > > > > +/**
> > > > > > > > > > + * Rx queue is shared among ports in same switch domain
> > > > > > > > > > +to save memory,
> > > > > > > > > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > > > > > > > > + * Real source port number saved in mbuf->port field.
> > > > > > > > > > + */
> > > > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> > > > > > > > > > 
> > > > > > > > > >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > > >                                  DEV_RX_OFFLOAD_UDP_CKSUM
> > > > > > > > > > > \
> > > > > > > > > > --
> > > > > > > > > > 2.25.1
> > > > > > > > > >
  
Jerin Jacob Sept. 16, 2021, 4:16 a.m. UTC | #12
On Wed, Sep 15, 2021 at 8:15 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
>
> Hi Jerin,
>
> On Mon, 2021-08-30 at 15:01 +0530, Jerin Jacob wrote:
> > On Sat, Aug 28, 2021 at 7:46 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > Sent: Thursday, August 26, 2021 7:58 PM
> > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> > > > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> > > >
> > > > On Thu, Aug 19, 2021 at 5:39 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > >
> > > > >
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > Sent: Thursday, August 19, 2021 1:27 PM
> > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > > > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> > > > > >
> > > > > > On Wed, Aug 18, 2021 at 4:44 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > -----Original Message-----
> > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > Sent: Tuesday, August 17, 2021 11:12 PM
> > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > > <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon
> > > > > > > > <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> > > > > > > >
> > > > > > > > On Tue, Aug 17, 2021 at 5:01 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > -----Original Message-----
> > > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > > Sent: Tuesday, August 17, 2021 5:33 PM
> > > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > > > > <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon
> > > > > > > > > > <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx
> > > > > > > > > > queue
> > > > > > > > > >
> > > > > > > > > > On Wed, Aug 11, 2021 at 7:34 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > In current DPDK framework, each RX queue is pre-loaded
> > > > > > > > > > > with mbufs for incoming packets. When number of
> > > > > > > > > > > representors scale out in a switch domain, the memory
> > > > > > > > > > > consumption became significant. Most important, polling
> > > > > > > > > > > all ports leads to high cache miss, high latency and low throughput.
> > > > > > > > > > >
> > > > > > > > > > > This patch introduces shared RX queue. Ports with same
> > > > > > > > > > > configuration in a switch domain could share RX queue set by specifying sharing group.
> > > > > > > > > > > Polling any queue using same shared RX queue receives
> > > > > > > > > > > packets from all member ports. Source port is identified by mbuf->port.
> > > > > > > > > > >
> > > > > > > > > > > Port queue number in a shared group should be identical.
> > > > > > > > > > > Queue index is
> > > > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > > > >
> > > > > > > > > > > Share RX queue must be polled on single thread or core.
> > > > > > > > > > >
> > > > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > > > > >
> > > > > > > > > > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > > > > > > > > > Cc: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > > > ---
> > > > > > > > > > > Rx queue object could be used as shared Rx queue object,
> > > > > > > > > > > it's important to clear all queue control callback api that using queue object:
> > > > > > > > > > >
> > > > > > > > > > > https://mails.dpdk.org/archives/dev/2021-July/215574.html
> > > > > > > > > >
> > > > > > > > > > >  #undef RTE_RX_OFFLOAD_BIT2STR diff --git
> > > > > > > > > > > a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h index
> > > > > > > > > > > d2b27c351f..a578c9db9d 100644
> > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > > > > > >         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > > > > > > > > > >         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > > > > > > > > > >         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > > > > > > > > > */
> > > > > > > > > > > +       uint32_t shared_group; /**< Shared port group
> > > > > > > > > > > + index in switch domain. */
> > > > > > > > > >
> > > > > > > > > > Not to able to see anyone setting/creating this group ID test application.
> > > > > > > > > > How this group is created?
> > > > > > > > >
> > > > > > > > > Nice catch, the initial testpmd version only support one default group(0).
> > > > > > > > > All ports that supports shared-rxq assigned in same group.
> > > > > > > > >
> > > > > > > > > We should be able to change "--rxq-shared" to "--rxq-shared-group"
> > > > > > > > > to support group other than default.
> > > > > > > > >
> > > > > > > > > To support more groups simultaneously, need to consider
> > > > > > > > > testpmd forwarding stream core assignment, all streams in same group need to stay on same core.
> > > > > > > > > It's possible to specify how many ports to increase group
> > > > > > > > > number, but user must schedule stream affinity carefully - error prone.
> > > > > > > > >
> > > > > > > > > On the other hand, one group should be sufficient for most
> > > > > > > > > customer, the doubt is whether it valuable to support multiple groups test.
> > > > > > > >
> > > > > > > > Ack. One group is enough in testpmd.
> > > > > > > >
> > > > > > > > My question was more about who and how this group is created,
> > > > > > > > Should n't we need API to create shared_group? If we do the following, at least, I can think, how it can be implemented in SW
> > > > or other HW.
> > > > > > > >
> > > > > > > > - Create aggregation queue group
> > > > > > > > - Attach multiple  Rx queues to the aggregation queue group
> > > > > > > > - Pull the packets from the queue group(which internally fetch
> > > > > > > > from the Rx queues _attached_)
> > > > > > > >
> > > > > > > > Does the above kind of sequence, break your representor use case?
> > > > > > >
> > > > > > > Seems more like a set of EAL wrapper. Current API tries to minimize the application efforts to adapt shared-rxq.
> > > > > > > - step 1, not sure how important it is to create group with API, in rte_flow, group is created on demand.
> > > > > >
> > > > > > Which rte_flow pattern/action for this?
> > > > >
> > > > > No rte_flow for this, just recalled that the group in rte_flow is not created along with flow, not via api.
> > > > > I don’t see anything else to create along with group, just double whether it valuable to introduce a new api set to manage group.
> > > >
> > > > See below.
> > > >
> > > > >
> > > > > >
> > > > > > > - step 2, currently, the attaching is done in rte_eth_rx_queue_setup, specify offload and group in rx_conf struct.
> > > > > > > - step 3, define a dedicate api to receive packets from shared rxq? Looks clear to receive packets from shared rxq.
> > > > > > >   currently, rxq objects in share group is same - the shared rxq, so the eth callback eth_rx_burst_t(rxq_obj, mbufs, n) could
> > > > > > >   be used to receive packets from any ports in group, normally the first port(PF) in group.
> > > > > > >   An alternative way is defining a vdev with same queue number and copy rxq objects will make the vdev a proxy of
> > > > > > >   the shared rxq group - this could be an helper API.
> > > > > > >
> > > > > > > Anyway the wrapper doesn't break use case, step 3 api is more clear, need to understand how to implement efficiently.
> > > > > >
> > > > > > Are you doing this feature based on any HW support or it just pure
> > > > > > SW thing, If it is SW, It is better to have just new vdev for like drivers/net/bonding/. This we can help aggregate multiple Rxq across
> > > > the multiple ports of same the driver.
> > > > >
> > > > > Based on HW support.
> > > >
> > > > In Marvel HW, we do some support, I will outline here and some queries on this.
> > > >
> > > > # We need to create some new HW structure for aggregation # Connect each Rxq to the new HW structure for aggregation # Use
> > > > rx_burst from the new HW structure.
> > > >
> > > > Could you outline your HW support?
> > > >
> > > > Also, I am not able to understand how this will reduce the memory, atleast in our HW need creating more memory now to deal this as
> > > > we need to deal new HW structure.
> > > >
> > > > How is in your HW it reduces the memory? Also, if memory is the constraint, why NOT reduce the number of queues.
> > > >
> > >
> > > Glad to know that Marvel is working on this, what's the status of driver implementation?
> > >
> > > In my PMD implementation, it's very similar, a new HW object shared memory pool is created to replace per rxq memory pool.
> > > Legacy rxq feed queue with allocated mbufs as number of descriptors, now shared rxqs share the same pool, no need to supply
> > > mbufs for each rxq, just feed the shared rxq.
> > >
> > > So the memory saving reflects to mbuf per rxq, even 1000 representors in shared rxq group, the mbufs consumed is one rxq.
> > > In other words, new members in shared rxq doesn’t allocate new mbufs to feed rxq, just share with existing shared rxq(HW mempool).
> > > The memory required to setup each rxq doesn't change too much, agree.
> >
> > We can ask the application to configure the same mempool for multiple
> > RQ too. RIght? If the saving is based on sharing the mempool
> > with multiple RQs.
> >
> > >
> > > > # Also, I was thinking, one way to avoid the fast path or ABI change would like.
> > > >
> > > > # Driver Initializes one more eth_dev_ops in driver as aggregator ethdev # devargs of new ethdev or specific API like
> > > > drivers/net/bonding/rte_eth_bond.h can take the argument (port, queue) tuples which needs to aggregate by new ethdev port # No
> > > > change in fastpath or ABI is required in this model.
> > > >
> > >
> > > This could be an option to access shared rxq. What's the difference of the new PMD?
> >
> > No ABI and fast change are required.
> >
> > > What's the difference of PMD driver to create the new device?
> > >
> > > Is it important in your implementation? Does it work with existing rx_burst api?
> >
> > Yes . It will work with the existing rx_burst API.
> >
>
> The aggregator ethdev required by user is a port, maybe it good to add
> a callback for PMD to prepare a complete ethdev just like creating
> representor ethdev - pmd register new port internally. If the PMD
> doens't provide the callback, ethdev api fallback to initialize an
> empty ethdev by copy rxq data(shared) and rx_burst api from source port
> and share group. Actually users can do this fallback themselves or with
> an util api.
>
> IIUC, an aggregator ethdev not a must, do you think we can continue and
> leave that design in later stage?


IMO aggregator ethdev reduces the complexity for application hence
avoid any change in
test application etc. IMO, I prefer to take that. I will leave the
decision to ethdev maintainers.


>
> > >
> > > >
> > > >
> > > > > Most user might uses PF in group as the anchor port to rx burst, current definition should be easy for them to migrate.
> > > > > but some user might prefer grouping some hot
> > > > > plug/unpluggedrepresentors, EAL could provide wrappers, users could do that either due to the strategy not complex enough.
> > > > Anyway, welcome any suggestion.
> > > > >
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >         /**
> > > > > > > > > > >          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > > > > > > > > > >          * Only offloads set on rx_queue_offload_capa or
> > > > > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf
> > > > > > > > > > > { #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> > > > > > > > > > >  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> > > > > > > > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > > > > > > > > > +/**
> > > > > > > > > > > + * Rx queue is shared among ports in same switch domain
> > > > > > > > > > > +to save memory,
> > > > > > > > > > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > > > > > > > > > + * Real source port number saved in mbuf->port field.
> > > > > > > > > > > + */
> > > > > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> > > > > > > > > > >
> > > > > > > > > > >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > > > >                                  DEV_RX_OFFLOAD_UDP_CKSUM
> > > > > > > > > > > > \
> > > > > > > > > > > --
> > > > > > > > > > > 2.25.1
> > > > > > > > > > >
>
  
Xueming Li Sept. 28, 2021, 5:50 a.m. UTC | #13
On Thu, 2021-09-16 at 09:46 +0530, Jerin Jacob wrote:
> On Wed, Sep 15, 2021 at 8:15 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > 
> > Hi Jerin,
> > 
> > On Mon, 2021-08-30 at 15:01 +0530, Jerin Jacob wrote:
> > > On Sat, Aug 28, 2021 at 7:46 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > 
> > > > 
> > > > 
> > > > > -----Original Message-----
> > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > Sent: Thursday, August 26, 2021 7:58 PM
> > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> > > > > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> > > > > 
> > > > > On Thu, Aug 19, 2021 at 5:39 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > > -----Original Message-----
> > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > Sent: Thursday, August 19, 2021 1:27 PM
> > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > > > > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> > > > > > > 
> > > > > > > On Wed, Aug 18, 2021 at 4:44 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > Sent: Tuesday, August 17, 2021 11:12 PM
> > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > > > <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon
> > > > > > > > > <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx queue
> > > > > > > > > 
> > > > > > > > > On Tue, Aug 17, 2021 at 5:01 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > > > Sent: Tuesday, August 17, 2021 5:33 PM
> > > > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > > > > > <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon
> > > > > > > > > > > <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > > > Subject: Re: [PATCH v2 01/15] ethdev: introduce shared Rx
> > > > > > > > > > > queue
> > > > > > > > > > > 
> > > > > > > > > > > On Wed, Aug 11, 2021 at 7:34 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > > > > > > > > > 
> > > > > > > > > > > > In current DPDK framework, each RX queue is pre-loaded
> > > > > > > > > > > > with mbufs for incoming packets. When number of
> > > > > > > > > > > > representors scale out in a switch domain, the memory
> > > > > > > > > > > > consumption became significant. Most important, polling
> > > > > > > > > > > > all ports leads to high cache miss, high latency and low throughput.
> > > > > > > > > > > > 
> > > > > > > > > > > > This patch introduces shared RX queue. Ports with same
> > > > > > > > > > > > configuration in a switch domain could share RX queue set by specifying sharing group.
> > > > > > > > > > > > Polling any queue using same shared RX queue receives
> > > > > > > > > > > > packets from all member ports. Source port is identified by mbuf->port.
> > > > > > > > > > > > 
> > > > > > > > > > > > Port queue number in a shared group should be identical.
> > > > > > > > > > > > Queue index is
> > > > > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > > > > > 
> > > > > > > > > > > > Share RX queue must be polled on single thread or core.
> > > > > > > > > > > > 
> > > > > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > > > > > > 
> > > > > > > > > > > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > > > > > > > > > > Cc: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > > > > ---
> > > > > > > > > > > > Rx queue object could be used as shared Rx queue object,
> > > > > > > > > > > > it's important to clear all queue control callback api that using queue object:
> > > > > > > > > > > > 
> > > > > > > > > > > > https://mails.dpdk.org/archives/dev/2021-July/215574.html
> > > > > > > > > > > 
> > > > > > > > > > > >  #undef RTE_RX_OFFLOAD_BIT2STR diff --git
> > > > > > > > > > > > a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h index
> > > > > > > > > > > > d2b27c351f..a578c9db9d 100644
> > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > > > > > > >         uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > > > > > > > > > > >         uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > > > > > > > > > > >         uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > > > > > > > > > > */
> > > > > > > > > > > > +       uint32_t shared_group; /**< Shared port group
> > > > > > > > > > > > + index in switch domain. */
> > > > > > > > > > > 
> > > > > > > > > > > Not to able to see anyone setting/creating this group ID test application.
> > > > > > > > > > > How this group is created?
> > > > > > > > > > 
> > > > > > > > > > Nice catch, the initial testpmd version only support one default group(0).
> > > > > > > > > > All ports that supports shared-rxq assigned in same group.
> > > > > > > > > > 
> > > > > > > > > > We should be able to change "--rxq-shared" to "--rxq-shared-group"
> > > > > > > > > > to support group other than default.
> > > > > > > > > > 
> > > > > > > > > > To support more groups simultaneously, need to consider
> > > > > > > > > > testpmd forwarding stream core assignment, all streams in same group need to stay on same core.
> > > > > > > > > > It's possible to specify how many ports to increase group
> > > > > > > > > > number, but user must schedule stream affinity carefully - error prone.
> > > > > > > > > > 
> > > > > > > > > > On the other hand, one group should be sufficient for most
> > > > > > > > > > customer, the doubt is whether it valuable to support multiple groups test.
> > > > > > > > > 
> > > > > > > > > Ack. One group is enough in testpmd.
> > > > > > > > > 
> > > > > > > > > My question was more about who and how this group is created,
> > > > > > > > > Should n't we need API to create shared_group? If we do the following, at least, I can think, how it can be implemented in SW
> > > > > or other HW.
> > > > > > > > > 
> > > > > > > > > - Create aggregation queue group
> > > > > > > > > - Attach multiple  Rx queues to the aggregation queue group
> > > > > > > > > - Pull the packets from the queue group(which internally fetch
> > > > > > > > > from the Rx queues _attached_)
> > > > > > > > > 
> > > > > > > > > Does the above kind of sequence, break your representor use case?
> > > > > > > > 
> > > > > > > > Seems more like a set of EAL wrapper. Current API tries to minimize the application efforts to adapt shared-rxq.
> > > > > > > > - step 1, not sure how important it is to create group with API, in rte_flow, group is created on demand.
> > > > > > > 
> > > > > > > Which rte_flow pattern/action for this?
> > > > > > 
> > > > > > No rte_flow for this, just recalled that the group in rte_flow is not created along with flow, not via api.
> > > > > > I don’t see anything else to create along with group, just double whether it valuable to introduce a new api set to manage group.
> > > > > 
> > > > > See below.
> > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > > - step 2, currently, the attaching is done in rte_eth_rx_queue_setup, specify offload and group in rx_conf struct.
> > > > > > > > - step 3, define a dedicate api to receive packets from shared rxq? Looks clear to receive packets from shared rxq.
> > > > > > > >   currently, rxq objects in share group is same - the shared rxq, so the eth callback eth_rx_burst_t(rxq_obj, mbufs, n) could
> > > > > > > >   be used to receive packets from any ports in group, normally the first port(PF) in group.
> > > > > > > >   An alternative way is defining a vdev with same queue number and copy rxq objects will make the vdev a proxy of
> > > > > > > >   the shared rxq group - this could be an helper API.
> > > > > > > > 
> > > > > > > > Anyway the wrapper doesn't break use case, step 3 api is more clear, need to understand how to implement efficiently.
> > > > > > > 
> > > > > > > Are you doing this feature based on any HW support or it just pure
> > > > > > > SW thing, If it is SW, It is better to have just new vdev for like drivers/net/bonding/. This we can help aggregate multiple Rxq across
> > > > > the multiple ports of same the driver.
> > > > > > 
> > > > > > Based on HW support.
> > > > > 
> > > > > In Marvel HW, we do some support, I will outline here and some queries on this.
> > > > > 
> > > > > # We need to create some new HW structure for aggregation # Connect each Rxq to the new HW structure for aggregation # Use
> > > > > rx_burst from the new HW structure.
> > > > > 
> > > > > Could you outline your HW support?
> > > > > 
> > > > > Also, I am not able to understand how this will reduce the memory, atleast in our HW need creating more memory now to deal this as
> > > > > we need to deal new HW structure.
> > > > > 
> > > > > How is in your HW it reduces the memory? Also, if memory is the constraint, why NOT reduce the number of queues.
> > > > > 
> > > > 
> > > > Glad to know that Marvel is working on this, what's the status of driver implementation?
> > > > 
> > > > In my PMD implementation, it's very similar, a new HW object shared memory pool is created to replace per rxq memory pool.
> > > > Legacy rxq feed queue with allocated mbufs as number of descriptors, now shared rxqs share the same pool, no need to supply
> > > > mbufs for each rxq, just feed the shared rxq.
> > > > 
> > > > So the memory saving reflects to mbuf per rxq, even 1000 representors in shared rxq group, the mbufs consumed is one rxq.
> > > > In other words, new members in shared rxq doesn’t allocate new mbufs to feed rxq, just share with existing shared rxq(HW mempool).
> > > > The memory required to setup each rxq doesn't change too much, agree.
> > > 
> > > We can ask the application to configure the same mempool for multiple
> > > RQ too. RIght? If the saving is based on sharing the mempool
> > > with multiple RQs.
> > > 
> > > > 
> > > > > # Also, I was thinking, one way to avoid the fast path or ABI change would like.
> > > > > 
> > > > > # Driver Initializes one more eth_dev_ops in driver as aggregator ethdev # devargs of new ethdev or specific API like
> > > > > drivers/net/bonding/rte_eth_bond.h can take the argument (port, queue) tuples which needs to aggregate by new ethdev port # No
> > > > > change in fastpath or ABI is required in this model.
> > > > > 
> > > > 
> > > > This could be an option to access shared rxq. What's the difference of the new PMD?
> > > 
> > > No ABI and fast change are required.
> > > 
> > > > What's the difference of PMD driver to create the new device?
> > > > 
> > > > Is it important in your implementation? Does it work with existing rx_burst api?
> > > 
> > > Yes . It will work with the existing rx_burst API.
> > > 
> > 
> > The aggregator ethdev required by user is a port, maybe it good to add
> > a callback for PMD to prepare a complete ethdev just like creating
> > representor ethdev - pmd register new port internally. If the PMD
> > doens't provide the callback, ethdev api fallback to initialize an
> > empty ethdev by copy rxq data(shared) and rx_burst api from source port
> > and share group. Actually users can do this fallback themselves or with
> > an util api.
> > 
> > IIUC, an aggregator ethdev not a must, do you think we can continue and
> > leave that design in later stage?
> 
> 
> IMO aggregator ethdev reduces the complexity for application hence
> avoid any change in
> test application etc. IMO, I prefer to take that. I will leave the
> decision to ethdev maintainers.

Hi Jerin, new API added for aggregator, the last one in v3, thanks! 

> 
> 
> > 
> > > > 
> > > > > 
> > > > > 
> > > > > > Most user might uses PF in group as the anchor port to rx burst, current definition should be easy for them to migrate.
> > > > > > but some user might prefer grouping some hot
> > > > > > plug/unpluggedrepresentors, EAL could provide wrappers, users could do that either due to the strategy not complex enough.
> > > > > Anyway, welcome any suggestion.
> > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > >         /**
> > > > > > > > > > > >          * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > > > > > > > > > > >          * Only offloads set on rx_queue_offload_capa or
> > > > > > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf
> > > > > > > > > > > > { #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> > > > > > > > > > > >  #define DEV_RX_OFFLOAD_RSS_HASH                0x00080000
> > > > > > > > > > > >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > > > > > > > > > > +/**
> > > > > > > > > > > > + * Rx queue is shared among ports in same switch domain
> > > > > > > > > > > > +to save memory,
> > > > > > > > > > > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > > > > > > > > > > + * Real source port number saved in mbuf->port field.
> > > > > > > > > > > > + */
> > > > > > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> > > > > > > > > > > > 
> > > > > > > > > > > >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > > > > >                                  DEV_RX_OFFLOAD_UDP_CKSUM
> > > > > > > > > > > > > \
> > > > > > > > > > > > --
> > > > > > > > > > > > 2.25.1
> > > > > > > > > > > > 
> >
  

Patch

diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
index a96e12d155..2e2a9b1554 100644
--- a/doc/guides/nics/features.rst
+++ b/doc/guides/nics/features.rst
@@ -624,6 +624,17 @@  Supports inner packet L4 checksum.
   ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
 
 
+.. _nic_features_shared_rx_queue:
+
+Shared Rx queue
+---------------
+
+Supports shared Rx queue for ports in same switch domain.
+
+* **[uses]     rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
+* **[provides] mbuf**: ``mbuf.port``.
+
+
 .. _nic_features_packet_type_parsing:
 
 Packet type parsing
diff --git a/doc/guides/nics/features/default.ini b/doc/guides/nics/features/default.ini
index 754184ddd4..ebeb4c1851 100644
--- a/doc/guides/nics/features/default.ini
+++ b/doc/guides/nics/features/default.ini
@@ -19,6 +19,7 @@  Free Tx mbuf on demand =
 Queue start/stop     =
 Runtime Rx queue setup =
 Runtime Tx queue setup =
+Shared Rx queue      =
 Burst mode info      =
 Power mgmt address monitor =
 MTU update           =
diff --git a/doc/guides/prog_guide/switch_representation.rst b/doc/guides/prog_guide/switch_representation.rst
index ff6aa91c80..45bf5a3a10 100644
--- a/doc/guides/prog_guide/switch_representation.rst
+++ b/doc/guides/prog_guide/switch_representation.rst
@@ -123,6 +123,16 @@  thought as a software "patch panel" front-end for applications.
 .. [1] `Ethernet switch device driver model (switchdev)
        <https://www.kernel.org/doc/Documentation/networking/switchdev.txt>`_
 
+- Memory usage of representors is huge when number of representor grows,
+  because PMD always allocate mbuf for each descriptor of Rx queue.
+  Polling the large number of ports brings more CPU load, cache miss and
+  latency. Shared Rx queue can be used to share Rx queue between PF and
+  representors in same switch domain. ``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
+  is present in Rx offloading capability of device info. Setting the
+  offloading flag in device Rx mode or Rx queue configuration to enable
+  shared Rx queue. Polling any member port of shared Rx queue can return
+  packets of all ports in group, port ID is saved in ``mbuf.port``.
+
 Basic SR-IOV
 ------------
 
diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
index 193f0d8295..058f5c88d9 100644
--- a/lib/ethdev/rte_ethdev.c
+++ b/lib/ethdev/rte_ethdev.c
@@ -127,6 +127,7 @@  static const struct {
 	RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
 	RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
 	RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
+	RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
 };
 
 #undef RTE_RX_OFFLOAD_BIT2STR
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index d2b27c351f..a578c9db9d 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -1047,6 +1047,7 @@  struct rte_eth_rxconf {
 	uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
 	uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
 	uint16_t rx_nseg; /**< Number of descriptions in rx_seg array. */
+	uint32_t shared_group; /**< Shared port group index in switch domain. */
 	/**
 	 * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
 	 * Only offloads set on rx_queue_offload_capa or rx_offload_capa
@@ -1373,6 +1374,12 @@  struct rte_eth_conf {
 #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
 #define DEV_RX_OFFLOAD_RSS_HASH		0x00080000
 #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
+/**
+ * Rx queue is shared among ports in same switch domain to save memory,
+ * avoid polling each port. Any port in group can be used to receive packets.
+ * Real source port number saved in mbuf->port field.
+ */
+#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
 
 #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
 				 DEV_RX_OFFLOAD_UDP_CKSUM | \