[v6,1/5] ethdev: introduce shared Rx queue

Message ID 20211012143942.1133718-2-xuemingl@nvidia.com (mailing list archive)
State Superseded, archived
Delegated to: Ferruh Yigit
Headers
Series ethdev: introduce shared Rx queue |

Checks

Context Check Description
ci/checkpatch success coding style OK

Commit Message

Xueming Li Oct. 12, 2021, 2:39 p.m. UTC
  In current DPDK framework, each Rx queue is pre-loaded with mbufs to
save incoming packets. For some PMDs, when number of representors scale
out in a switch domain, the memory consumption became significant.
Polling all ports also leads to high cache miss, high latency and low
throughput.

This patch introduce shared Rx queue. Ports in same Rx domain and
switch domain could share Rx queue set by specifying non-zero sharing
group in Rx queue configuration.

No special API is defined to receive packets from shared Rx queue.
Polling any member port of a shared Rx queue receives packets of that
queue for all member ports, source port is identified by mbuf->port.

Shared Rx queue must be polled in same thread or core, polling a queue
ID of any member port is essentially same.

Multiple share groups are supported by non-zero share group ID. Device
should support mixed configuration by allowing multiple share
groups and non-shared Rx queue.

Even Rx queue shared, queue configuration like offloads and RSS should
not be impacted.

Example grouping and polling model to reflect service priority:
 Group1, 2 shared Rx queues per port: PF, rep0, rep1
 Group2, 1 shared Rx queue per port: rep2, rep3, ... rep127
 Core0: poll PF queue0
 Core1: poll PF queue1
 Core2: poll rep2 queue0

PMD driver advertise shared Rx queue capability via
RTE_ETH_DEV_CAPA_RXQ_SHARE.

PMD driver is responsible for shared Rx queue consistency checks to
avoid member port's configuration contradict to each other.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
 doc/guides/nics/features.rst                  | 13 ++++++++++++
 doc/guides/nics/features/default.ini          |  1 +
 .../prog_guide/switch_representation.rst      | 10 +++++++++
 doc/guides/rel_notes/release_21_11.rst        |  5 +++++
 lib/ethdev/rte_ethdev.c                       |  9 ++++++++
 lib/ethdev/rte_ethdev.h                       | 21 +++++++++++++++++++
 6 files changed, 59 insertions(+)
  

Comments

Andrew Rybchenko Oct. 15, 2021, 9:28 a.m. UTC | #1
On 10/12/21 5:39 PM, Xueming Li wrote:
> In current DPDK framework, each Rx queue is pre-loaded with mbufs to
> save incoming packets. For some PMDs, when number of representors scale
> out in a switch domain, the memory consumption became significant.
> Polling all ports also leads to high cache miss, high latency and low
> throughput.
> 
> This patch introduce shared Rx queue. Ports in same Rx domain and
> switch domain could share Rx queue set by specifying non-zero sharing
> group in Rx queue configuration.
> 
> No special API is defined to receive packets from shared Rx queue.
> Polling any member port of a shared Rx queue receives packets of that
> queue for all member ports, source port is identified by mbuf->port.
> 
> Shared Rx queue must be polled in same thread or core, polling a queue
> ID of any member port is essentially same.
> 
> Multiple share groups are supported by non-zero share group ID. Device

"by non-zero share group ID" is not required. Since it must be
always non-zero to enable sharing.

> should support mixed configuration by allowing multiple share
> groups and non-shared Rx queue.
> 
> Even Rx queue shared, queue configuration like offloads and RSS should
> not be impacted.

I don't understand above sentence.
Even when Rx queues are shared, queue configuration like
offloads and RSS may differ. If a PMD has some limitation,
it should care about consistency itself. These limitations
should be documented in the PMD documentation.

> 
> Example grouping and polling model to reflect service priority:
>  Group1, 2 shared Rx queues per port: PF, rep0, rep1
>  Group2, 1 shared Rx queue per port: rep2, rep3, ... rep127
>  Core0: poll PF queue0
>  Core1: poll PF queue1
>  Core2: poll rep2 queue0


Can I have:
PF RxQ#0, RxQ#1
Rep0 RxQ#0 shared with PF RxQ#0
Rep1 RxQ#0 shared with PF RxQ#1

I guess no, since it looks like RxQ ID must be equal.
Or am I missing something? Otherwise grouping rules
are not obvious to me. May be we need dedicated
shared_qid in boundaries of the share_group?

> 
> PMD driver advertise shared Rx queue capability via
> RTE_ETH_DEV_CAPA_RXQ_SHARE.
> 
> PMD driver is responsible for shared Rx queue consistency checks to
> avoid member port's configuration contradict to each other.
> 
> Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> ---
>  doc/guides/nics/features.rst                  | 13 ++++++++++++
>  doc/guides/nics/features/default.ini          |  1 +
>  .../prog_guide/switch_representation.rst      | 10 +++++++++
>  doc/guides/rel_notes/release_21_11.rst        |  5 +++++
>  lib/ethdev/rte_ethdev.c                       |  9 ++++++++
>  lib/ethdev/rte_ethdev.h                       | 21 +++++++++++++++++++
>  6 files changed, 59 insertions(+)
> 
> diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
> index e346018e4b8..b64433b8ea5 100644
> --- a/doc/guides/nics/features.rst
> +++ b/doc/guides/nics/features.rst
> @@ -615,6 +615,19 @@ Supports inner packet L4 checksum.
>    ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
>  
>  
> +.. _nic_features_shared_rx_queue:
> +
> +Shared Rx queue
> +---------------
> +
> +Supports shared Rx queue for ports in same Rx domain of a switch domain.
> +
> +* **[uses]     rte_eth_dev_info**: ``dev_capa:RTE_ETH_DEV_CAPA_RXQ_SHARE``.
> +* **[uses]     rte_eth_dev_info,rte_eth_switch_info**: ``rx_domain``, ``domain_id``.
> +* **[uses]     rte_eth_rxconf**: ``share_group``.
> +* **[provides] mbuf**: ``mbuf.port``.
> +
> +
>  .. _nic_features_packet_type_parsing:
>  
>  Packet type parsing
> diff --git a/doc/guides/nics/features/default.ini b/doc/guides/nics/features/default.ini
> index d473b94091a..93f5d1b46f4 100644
> --- a/doc/guides/nics/features/default.ini
> +++ b/doc/guides/nics/features/default.ini
> @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
>  Queue start/stop     =
>  Runtime Rx queue setup =
>  Runtime Tx queue setup =
> +Shared Rx queue      =
>  Burst mode info      =
>  Power mgmt address monitor =
>  MTU update           =
> diff --git a/doc/guides/prog_guide/switch_representation.rst b/doc/guides/prog_guide/switch_representation.rst
> index ff6aa91c806..de41db8385d 100644
> --- a/doc/guides/prog_guide/switch_representation.rst
> +++ b/doc/guides/prog_guide/switch_representation.rst
> @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
>  .. [1] `Ethernet switch device driver model (switchdev)
>         <https://www.kernel.org/doc/Documentation/networking/switchdev.txt>`_
>  
> +- For some PMDs, memory usage of representors is huge when number of
> +  representor grows, mbufs are allocated for each descriptor of Rx queue.
> +  Polling large number of ports brings more CPU load, cache miss and
> +  latency. Shared Rx queue can be used to share Rx queue between PF and
> +  representors among same Rx domain. ``RTE_ETH_DEV_CAPA_RXQ_SHARE`` is
> +  present in device capability of device info. Setting non-zero share group
> +  in Rx queue configuration to enable share. Polling any member port can
> +  receive packets of all member ports in the group, port ID is saved in
> +  ``mbuf.port``.
> +
>  Basic SR-IOV
>  ------------
>  
> diff --git a/doc/guides/rel_notes/release_21_11.rst b/doc/guides/rel_notes/release_21_11.rst
> index 5036641842c..d72fc97f4fb 100644
> --- a/doc/guides/rel_notes/release_21_11.rst
> +++ b/doc/guides/rel_notes/release_21_11.rst
> @@ -141,6 +141,11 @@ New Features
>    * Added tests to validate packets hard expiry.
>    * Added tests to verify tunnel header verification in IPsec inbound.
>  
> +* **Added ethdev shared Rx queue support.**
> +
> +  * Added new device capability flag and rx domain field to switch info.
> +  * Added share group to Rx queue configuration.
> +  * Added testpmd support and dedicate forwarding engine.

Please, add one more empty line since it must be two
before the next section. Also it should be put after
the last ethdev item above since list of features has
defined order.

>  
>  Removed Items
>  -------------
> diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> index 028907bc4b9..9b1b66370a7 100644
> --- a/lib/ethdev/rte_ethdev.c
> +++ b/lib/ethdev/rte_ethdev.c
> @@ -2159,6 +2159,15 @@ rte_eth_rx_queue_setup(uint16_t port_id, uint16_t rx_queue_id,
>  		return -EINVAL;
>  	}
>  
> +	if (local_conf.share_group > 0 &&
> +	    (dev_info.dev_capa & RTE_ETH_DEV_CAPA_RXQ_SHARE) == 0) {
> +		RTE_ETHDEV_LOG(ERR,
> +			"Ethdev port_id=%d rx_queue_id=%d, enabled share_group=%u while device doesn't support Rx queue share in %s()\n",
> +			port_id, rx_queue_id, local_conf.share_group,
> +			__func__);

I'd remove function name logging here. Log is unique enough.

> +		return -EINVAL;
> +	}
> +
>  	/*
>  	 * If LRO is enabled, check that the maximum aggregated packet
>  	 * size is supported by the configured device.
> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> index 6d80514ba7a..041da6ee52f 100644
> --- a/lib/ethdev/rte_ethdev.h
> +++ b/lib/ethdev/rte_ethdev.h
> @@ -1044,6 +1044,13 @@ struct rte_eth_rxconf {
>  	uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
>  	uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
>  	uint16_t rx_nseg; /**< Number of descriptions in rx_seg array. */
> +	/**
> +	 * Share group index in Rx domain and switch domain.
> +	 * Non-zero value to enable Rx queue share, zero value disable share.
> +	 * PMD driver is responsible for Rx queue consistency checks to avoid
> +	 * member port's configuration contradict to each other.
> +	 */
> +	uint32_t share_group;

I think that we don't need 32-bit for shared groups.
16-bits sounds more than enough.

>  	/**
>  	 * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
>  	 * Only offloads set on rx_queue_offload_capa or rx_offload_capa
> @@ -1445,6 +1452,14 @@ struct rte_eth_conf {
>  #define RTE_ETH_DEV_CAPA_RUNTIME_RX_QUEUE_SETUP 0x00000001
>  /** Device supports Tx queue setup after device started. */
>  #define RTE_ETH_DEV_CAPA_RUNTIME_TX_QUEUE_SETUP 0x00000002
> +/**
> + * Device supports shared Rx queue among ports within Rx domain and
> + * switch domain. Mbufs are consumed by shared Rx queue instead of
> + * every port. Multiple groups is supported by share_group of Rx
> + * queue configuration. Polling any port in the group receive packets
> + * of all member ports, source port identified by mbuf->port field.
> + */
> +#define RTE_ETH_DEV_CAPA_RXQ_SHARE              0x00000004

Let's use RTE_BIT64(2)

I think above two should be fixed in a separate
cleanup patch.

>  /**@}*/
>  
>  /*
> @@ -1488,6 +1503,12 @@ struct rte_eth_switch_info {
>  	 * but each driver should explicitly define the mapping of switch
>  	 * port identifier to that physical interconnect/switch
>  	 */
> +	uint16_t rx_domain;
> +	/**<
> +	 * Shared Rx queue sub-domain boundary. Only ports in same Rx domain
> +	 * and switch domain can share Rx queue. Valid only if device advertised
> +	 * RTE_ETH_DEV_CAPA_RXQ_SHARE capability.
> +	 */

Please, put the documentation before the documented
field.

[snip]
  
Xueming Li Oct. 15, 2021, 10:54 a.m. UTC | #2
On Fri, 2021-10-15 at 12:28 +0300, Andrew Rybchenko wrote:
> On 10/12/21 5:39 PM, Xueming Li wrote:
> > In current DPDK framework, each Rx queue is pre-loaded with mbufs to
> > save incoming packets. For some PMDs, when number of representors scale
> > out in a switch domain, the memory consumption became significant.
> > Polling all ports also leads to high cache miss, high latency and low
> > throughput.
> > 
> > This patch introduce shared Rx queue. Ports in same Rx domain and
> > switch domain could share Rx queue set by specifying non-zero sharing
> > group in Rx queue configuration.
> > 
> > No special API is defined to receive packets from shared Rx queue.
> > Polling any member port of a shared Rx queue receives packets of that
> > queue for all member ports, source port is identified by mbuf->port.
> > 
> > Shared Rx queue must be polled in same thread or core, polling a queue
> > ID of any member port is essentially same.
> > 
> > Multiple share groups are supported by non-zero share group ID. Device
> 
> "by non-zero share group ID" is not required. Since it must be
> always non-zero to enable sharing.
> 
> > should support mixed configuration by allowing multiple share
> > groups and non-shared Rx queue.
> > 
> > Even Rx queue shared, queue configuration like offloads and RSS should
> > not be impacted.
> 
> I don't understand above sentence.
> Even when Rx queues are shared, queue configuration like
> offloads and RSS may differ. If a PMD has some limitation,
> it should care about consistency itself. These limitations
> should be documented in the PMD documentation.
> 

OK, I'll remove this line.

> > 
> > Example grouping and polling model to reflect service priority:
> >  Group1, 2 shared Rx queues per port: PF, rep0, rep1
> >  Group2, 1 shared Rx queue per port: rep2, rep3, ... rep127
> >  Core0: poll PF queue0
> >  Core1: poll PF queue1
> >  Core2: poll rep2 queue0
> 
> 
> Can I have:
> PF RxQ#0, RxQ#1
> Rep0 RxQ#0 shared with PF RxQ#0
> Rep1 RxQ#0 shared with PF RxQ#1
> 
> I guess no, since it looks like RxQ ID must be equal.
> Or am I missing something? Otherwise grouping rules
> are not obvious to me. May be we need dedicated
> shared_qid in boundaries of the share_group?

Yes, RxQ ID must be equal, following configuration should work:
  Rep1 RxQ#1 shared with PF RxQ#1
Equal mapping should work by default instead of a new field that must
be set. I'll add some description to emphasis, how do you think?

> 
> > 
> > PMD driver advertise shared Rx queue capability via
> > RTE_ETH_DEV_CAPA_RXQ_SHARE.
> > 
> > PMD driver is responsible for shared Rx queue consistency checks to
> > avoid member port's configuration contradict to each other.
> > 
> > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > ---
> >  doc/guides/nics/features.rst                  | 13 ++++++++++++
> >  doc/guides/nics/features/default.ini          |  1 +
> >  .../prog_guide/switch_representation.rst      | 10 +++++++++
> >  doc/guides/rel_notes/release_21_11.rst        |  5 +++++
> >  lib/ethdev/rte_ethdev.c                       |  9 ++++++++
> >  lib/ethdev/rte_ethdev.h                       | 21 +++++++++++++++++++
> >  6 files changed, 59 insertions(+)
> > 
> > diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
> > index e346018e4b8..b64433b8ea5 100644
> > --- a/doc/guides/nics/features.rst
> > +++ b/doc/guides/nics/features.rst
> > @@ -615,6 +615,19 @@ Supports inner packet L4 checksum.
> >    ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> >  
> >  
> > +.. _nic_features_shared_rx_queue:
> > +
> > +Shared Rx queue
> > +---------------
> > +
> > +Supports shared Rx queue for ports in same Rx domain of a switch domain.
> > +
> > +* **[uses]     rte_eth_dev_info**: ``dev_capa:RTE_ETH_DEV_CAPA_RXQ_SHARE``.
> > +* **[uses]     rte_eth_dev_info,rte_eth_switch_info**: ``rx_domain``, ``domain_id``.
> > +* **[uses]     rte_eth_rxconf**: ``share_group``.
> > +* **[provides] mbuf**: ``mbuf.port``.
> > +
> > +
> >  .. _nic_features_packet_type_parsing:
> >  
> >  Packet type parsing
> > diff --git a/doc/guides/nics/features/default.ini b/doc/guides/nics/features/default.ini
> > index d473b94091a..93f5d1b46f4 100644
> > --- a/doc/guides/nics/features/default.ini
> > +++ b/doc/guides/nics/features/default.ini
> > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> >  Queue start/stop     =
> >  Runtime Rx queue setup =
> >  Runtime Tx queue setup =
> > +Shared Rx queue      =
> >  Burst mode info      =
> >  Power mgmt address monitor =
> >  MTU update           =
> > diff --git a/doc/guides/prog_guide/switch_representation.rst b/doc/guides/prog_guide/switch_representation.rst
> > index ff6aa91c806..de41db8385d 100644
> > --- a/doc/guides/prog_guide/switch_representation.rst
> > +++ b/doc/guides/prog_guide/switch_representation.rst
> > @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
> >  .. [1] `Ethernet switch device driver model (switchdev)
> >         <https://www.kernel.org/doc/Documentation/networking/switchdev.txt>`_
> >  
> > +- For some PMDs, memory usage of representors is huge when number of
> > +  representor grows, mbufs are allocated for each descriptor of Rx queue.
> > +  Polling large number of ports brings more CPU load, cache miss and
> > +  latency. Shared Rx queue can be used to share Rx queue between PF and
> > +  representors among same Rx domain. ``RTE_ETH_DEV_CAPA_RXQ_SHARE`` is
> > +  present in device capability of device info. Setting non-zero share group
> > +  in Rx queue configuration to enable share. Polling any member port can
> > +  receive packets of all member ports in the group, port ID is saved in
> > +  ``mbuf.port``.
> > +
> >  Basic SR-IOV
> >  ------------
> >  
> > diff --git a/doc/guides/rel_notes/release_21_11.rst b/doc/guides/rel_notes/release_21_11.rst
> > index 5036641842c..d72fc97f4fb 100644
> > --- a/doc/guides/rel_notes/release_21_11.rst
> > +++ b/doc/guides/rel_notes/release_21_11.rst
> > @@ -141,6 +141,11 @@ New Features
> >    * Added tests to validate packets hard expiry.
> >    * Added tests to verify tunnel header verification in IPsec inbound.
> >  
> > +* **Added ethdev shared Rx queue support.**
> > +
> > +  * Added new device capability flag and rx domain field to switch info.
> > +  * Added share group to Rx queue configuration.
> > +  * Added testpmd support and dedicate forwarding engine.
> 
> Please, add one more empty line since it must be two
> before the next section. Also it should be put after
> the last ethdev item above since list of features has
> defined order.
> 
> >  
> >  Removed Items
> >  -------------
> > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> > index 028907bc4b9..9b1b66370a7 100644
> > --- a/lib/ethdev/rte_ethdev.c
> > +++ b/lib/ethdev/rte_ethdev.c
> > @@ -2159,6 +2159,15 @@ rte_eth_rx_queue_setup(uint16_t port_id, uint16_t rx_queue_id,
> >  		return -EINVAL;
> >  	}
> >  
> > +	if (local_conf.share_group > 0 &&
> > +	    (dev_info.dev_capa & RTE_ETH_DEV_CAPA_RXQ_SHARE) == 0) {
> > +		RTE_ETHDEV_LOG(ERR,
> > +			"Ethdev port_id=%d rx_queue_id=%d, enabled share_group=%u while device doesn't support Rx queue share in %s()\n",
> > +			port_id, rx_queue_id, local_conf.share_group,
> > +			__func__);
> 
> I'd remove function name logging here. Log is unique enough.
> 
> > +		return -EINVAL;
> > +	}
> > +
> >  	/*
> >  	 * If LRO is enabled, check that the maximum aggregated packet
> >  	 * size is supported by the configured device.
> > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> > index 6d80514ba7a..041da6ee52f 100644
> > --- a/lib/ethdev/rte_ethdev.h
> > +++ b/lib/ethdev/rte_ethdev.h
> > @@ -1044,6 +1044,13 @@ struct rte_eth_rxconf {
> >  	uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> >  	uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> >  	uint16_t rx_nseg; /**< Number of descriptions in rx_seg array. */
> > +	/**
> > +	 * Share group index in Rx domain and switch domain.
> > +	 * Non-zero value to enable Rx queue share, zero value disable share.
> > +	 * PMD driver is responsible for Rx queue consistency checks to avoid
> > +	 * member port's configuration contradict to each other.
> > +	 */
> > +	uint32_t share_group;
> 
> I think that we don't need 32-bit for shared groups.
> 16-bits sounds more than enough.
> 
> >  	/**
> >  	 * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> >  	 * Only offloads set on rx_queue_offload_capa or rx_offload_capa
> > @@ -1445,6 +1452,14 @@ struct rte_eth_conf {
> >  #define RTE_ETH_DEV_CAPA_RUNTIME_RX_QUEUE_SETUP 0x00000001
> >  /** Device supports Tx queue setup after device started. */
> >  #define RTE_ETH_DEV_CAPA_RUNTIME_TX_QUEUE_SETUP 0x00000002
> > +/**
> > + * Device supports shared Rx queue among ports within Rx domain and
> > + * switch domain. Mbufs are consumed by shared Rx queue instead of
> > + * every port. Multiple groups is supported by share_group of Rx
> > + * queue configuration. Polling any port in the group receive packets
> > + * of all member ports, source port identified by mbuf->port field.
> > + */
> > +#define RTE_ETH_DEV_CAPA_RXQ_SHARE              0x00000004
> 
> Let's use RTE_BIT64(2)
> 
> I think above two should be fixed in a separate
> cleanup patch.

Not only above two, more Rx/Tx offload bits to cleanup, let's do it
later.

> 
> >  /**@}*/
> >  
> >  /*
> > @@ -1488,6 +1503,12 @@ struct rte_eth_switch_info {
> >  	 * but each driver should explicitly define the mapping of switch
> >  	 * port identifier to that physical interconnect/switch
> >  	 */
> > +	uint16_t rx_domain;
> > +	/**<
> > +	 * Shared Rx queue sub-domain boundary. Only ports in same Rx domain
> > +	 * and switch domain can share Rx queue. Valid only if device advertised
> > +	 * RTE_ETH_DEV_CAPA_RXQ_SHARE capability.
> > +	 */
> 
> Please, put the documentation before the documented
> field.
> 
> [snip]
  
Ferruh Yigit Oct. 15, 2021, 5:20 p.m. UTC | #3
On 10/12/2021 3:39 PM, Xueming Li wrote:
> index 6d80514ba7a..041da6ee52f 100644
> --- a/lib/ethdev/rte_ethdev.h
> +++ b/lib/ethdev/rte_ethdev.h
> @@ -1044,6 +1044,13 @@ struct rte_eth_rxconf {
>   	uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
>   	uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
>   	uint16_t rx_nseg; /**< Number of descriptions in rx_seg array. */
> +	/**
> +	 * Share group index in Rx domain and switch domain.
> +	 * Non-zero value to enable Rx queue share, zero value disable share.
> +	 * PMD driver is responsible for Rx queue consistency checks to avoid

When you update the set, can you please update 'PMD driver' usage too?

PMD = Poll Mode Driver, so second 'driver' is duplicate, there are a
few more instance of this usage in this set.
  
Xueming Li Oct. 16, 2021, 9:14 a.m. UTC | #4
On Fri, 2021-10-15 at 18:20 +0100, Ferruh Yigit wrote:
> On 10/12/2021 3:39 PM, Xueming Li wrote:
> > index 6d80514ba7a..041da6ee52f 100644
> > --- a/lib/ethdev/rte_ethdev.h
> > +++ b/lib/ethdev/rte_ethdev.h
> > @@ -1044,6 +1044,13 @@ struct rte_eth_rxconf {
> >   	uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> >   	uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> >   	uint16_t rx_nseg; /**< Number of descriptions in rx_seg array. */
> > +	/**
> > +	 * Share group index in Rx domain and switch domain.
> > +	 * Non-zero value to enable Rx queue share, zero value disable share.
> > +	 * PMD driver is responsible for Rx queue consistency checks to avoid
> 
> When you update the set, can you please update 'PMD driver' usage too?
> 
> PMD = Poll Mode Driver, so second 'driver' is duplicate, there are a
> few more instance of this usage in this set.

Got it, thanks!

BTW, PMD patches updated:
https://patches.dpdk.org/project/dpdk/list/?series=19709
  
Andrew Rybchenko Oct. 18, 2021, 6:46 a.m. UTC | #5
On 10/15/21 1:54 PM, Xueming(Steven) Li wrote:
> On Fri, 2021-10-15 at 12:28 +0300, Andrew Rybchenko wrote:
>> On 10/12/21 5:39 PM, Xueming Li wrote:
>>> In current DPDK framework, each Rx queue is pre-loaded with mbufs to
>>> save incoming packets. For some PMDs, when number of representors scale
>>> out in a switch domain, the memory consumption became significant.
>>> Polling all ports also leads to high cache miss, high latency and low
>>> throughput.
>>>
>>> This patch introduce shared Rx queue. Ports in same Rx domain and
>>> switch domain could share Rx queue set by specifying non-zero sharing
>>> group in Rx queue configuration.
>>>
>>> No special API is defined to receive packets from shared Rx queue.
>>> Polling any member port of a shared Rx queue receives packets of that
>>> queue for all member ports, source port is identified by mbuf->port.
>>>
>>> Shared Rx queue must be polled in same thread or core, polling a queue
>>> ID of any member port is essentially same.
>>>
>>> Multiple share groups are supported by non-zero share group ID. Device
>>
>> "by non-zero share group ID" is not required. Since it must be
>> always non-zero to enable sharing.
>>
>>> should support mixed configuration by allowing multiple share
>>> groups and non-shared Rx queue.
>>>
>>> Even Rx queue shared, queue configuration like offloads and RSS should
>>> not be impacted.
>>
>> I don't understand above sentence.
>> Even when Rx queues are shared, queue configuration like
>> offloads and RSS may differ. If a PMD has some limitation,
>> it should care about consistency itself. These limitations
>> should be documented in the PMD documentation.
>>
> 
> OK, I'll remove this line.
> 
>>>
>>> Example grouping and polling model to reflect service priority:
>>>  Group1, 2 shared Rx queues per port: PF, rep0, rep1
>>>  Group2, 1 shared Rx queue per port: rep2, rep3, ... rep127
>>>  Core0: poll PF queue0
>>>  Core1: poll PF queue1
>>>  Core2: poll rep2 queue0
>>
>>
>> Can I have:
>> PF RxQ#0, RxQ#1
>> Rep0 RxQ#0 shared with PF RxQ#0
>> Rep1 RxQ#0 shared with PF RxQ#1
>>
>> I guess no, since it looks like RxQ ID must be equal.
>> Or am I missing something? Otherwise grouping rules
>> are not obvious to me. May be we need dedicated
>> shared_qid in boundaries of the share_group?
> 
> Yes, RxQ ID must be equal, following configuration should work:
>   Rep1 RxQ#1 shared with PF RxQ#1

But I want just one RxQ on Rep1. I don't need two.

> Equal mapping should work by default instead of a new field that must
> be set. I'll add some description to emphasis, how do you think?

Sorry for delay with reply. I think that above limitation is
not nice. It is better to avoid it.

[snip]
  
Xueming Li Oct. 18, 2021, 6:57 a.m. UTC | #6
On Mon, 2021-10-18 at 09:46 +0300, Andrew Rybchenko wrote:
> On 10/15/21 1:54 PM, Xueming(Steven) Li wrote:
> > On Fri, 2021-10-15 at 12:28 +0300, Andrew Rybchenko wrote:
> > > On 10/12/21 5:39 PM, Xueming Li wrote:
> > > > In current DPDK framework, each Rx queue is pre-loaded with mbufs to
> > > > save incoming packets. For some PMDs, when number of representors scale
> > > > out in a switch domain, the memory consumption became significant.
> > > > Polling all ports also leads to high cache miss, high latency and low
> > > > throughput.
> > > > 
> > > > This patch introduce shared Rx queue. Ports in same Rx domain and
> > > > switch domain could share Rx queue set by specifying non-zero sharing
> > > > group in Rx queue configuration.
> > > > 
> > > > No special API is defined to receive packets from shared Rx queue.
> > > > Polling any member port of a shared Rx queue receives packets of that
> > > > queue for all member ports, source port is identified by mbuf->port.
> > > > 
> > > > Shared Rx queue must be polled in same thread or core, polling a queue
> > > > ID of any member port is essentially same.
> > > > 
> > > > Multiple share groups are supported by non-zero share group ID. Device
> > > 
> > > "by non-zero share group ID" is not required. Since it must be
> > > always non-zero to enable sharing.
> > > 
> > > > should support mixed configuration by allowing multiple share
> > > > groups and non-shared Rx queue.
> > > > 
> > > > Even Rx queue shared, queue configuration like offloads and RSS should
> > > > not be impacted.
> > > 
> > > I don't understand above sentence.
> > > Even when Rx queues are shared, queue configuration like
> > > offloads and RSS may differ. If a PMD has some limitation,
> > > it should care about consistency itself. These limitations
> > > should be documented in the PMD documentation.
> > > 
> > 
> > OK, I'll remove this line.
> > 
> > > > 
> > > > Example grouping and polling model to reflect service priority:
> > > >  Group1, 2 shared Rx queues per port: PF, rep0, rep1
> > > >  Group2, 1 shared Rx queue per port: rep2, rep3, ... rep127
> > > >  Core0: poll PF queue0
> > > >  Core1: poll PF queue1
> > > >  Core2: poll rep2 queue0
> > > 
> > > 
> > > Can I have:
> > > PF RxQ#0, RxQ#1
> > > Rep0 RxQ#0 shared with PF RxQ#0
> > > Rep1 RxQ#0 shared with PF RxQ#1
> > > 
> > > I guess no, since it looks like RxQ ID must be equal.
> > > Or am I missing something? Otherwise grouping rules
> > > are not obvious to me. May be we need dedicated
> > > shared_qid in boundaries of the share_group?
> > 
> > Yes, RxQ ID must be equal, following configuration should work:
> >   Rep1 RxQ#1 shared with PF RxQ#1
> 
> But I want just one RxQ on Rep1. I don't need two.
> 
> > Equal mapping should work by default instead of a new field that must
> > be set. I'll add some description to emphasis, how do you think?
> 
> Sorry for delay with reply. I think that above limitation is
> not nice. It is better to avoid it.

Okay, it will offer more flexibility. I will add in next version. User
has to be aware the indirect mapping relation, the following polling in
above example rx burt the same shared RxQ:
  PF RxQ#1
  Rep1 RxQ#0 shared with PF RxQ#1

> 
> [snip]
  

Patch

diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
index e346018e4b8..b64433b8ea5 100644
--- a/doc/guides/nics/features.rst
+++ b/doc/guides/nics/features.rst
@@ -615,6 +615,19 @@  Supports inner packet L4 checksum.
   ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
 
 
+.. _nic_features_shared_rx_queue:
+
+Shared Rx queue
+---------------
+
+Supports shared Rx queue for ports in same Rx domain of a switch domain.
+
+* **[uses]     rte_eth_dev_info**: ``dev_capa:RTE_ETH_DEV_CAPA_RXQ_SHARE``.
+* **[uses]     rte_eth_dev_info,rte_eth_switch_info**: ``rx_domain``, ``domain_id``.
+* **[uses]     rte_eth_rxconf**: ``share_group``.
+* **[provides] mbuf**: ``mbuf.port``.
+
+
 .. _nic_features_packet_type_parsing:
 
 Packet type parsing
diff --git a/doc/guides/nics/features/default.ini b/doc/guides/nics/features/default.ini
index d473b94091a..93f5d1b46f4 100644
--- a/doc/guides/nics/features/default.ini
+++ b/doc/guides/nics/features/default.ini
@@ -19,6 +19,7 @@  Free Tx mbuf on demand =
 Queue start/stop     =
 Runtime Rx queue setup =
 Runtime Tx queue setup =
+Shared Rx queue      =
 Burst mode info      =
 Power mgmt address monitor =
 MTU update           =
diff --git a/doc/guides/prog_guide/switch_representation.rst b/doc/guides/prog_guide/switch_representation.rst
index ff6aa91c806..de41db8385d 100644
--- a/doc/guides/prog_guide/switch_representation.rst
+++ b/doc/guides/prog_guide/switch_representation.rst
@@ -123,6 +123,16 @@  thought as a software "patch panel" front-end for applications.
 .. [1] `Ethernet switch device driver model (switchdev)
        <https://www.kernel.org/doc/Documentation/networking/switchdev.txt>`_
 
+- For some PMDs, memory usage of representors is huge when number of
+  representor grows, mbufs are allocated for each descriptor of Rx queue.
+  Polling large number of ports brings more CPU load, cache miss and
+  latency. Shared Rx queue can be used to share Rx queue between PF and
+  representors among same Rx domain. ``RTE_ETH_DEV_CAPA_RXQ_SHARE`` is
+  present in device capability of device info. Setting non-zero share group
+  in Rx queue configuration to enable share. Polling any member port can
+  receive packets of all member ports in the group, port ID is saved in
+  ``mbuf.port``.
+
 Basic SR-IOV
 ------------
 
diff --git a/doc/guides/rel_notes/release_21_11.rst b/doc/guides/rel_notes/release_21_11.rst
index 5036641842c..d72fc97f4fb 100644
--- a/doc/guides/rel_notes/release_21_11.rst
+++ b/doc/guides/rel_notes/release_21_11.rst
@@ -141,6 +141,11 @@  New Features
   * Added tests to validate packets hard expiry.
   * Added tests to verify tunnel header verification in IPsec inbound.
 
+* **Added ethdev shared Rx queue support.**
+
+  * Added new device capability flag and rx domain field to switch info.
+  * Added share group to Rx queue configuration.
+  * Added testpmd support and dedicate forwarding engine.
 
 Removed Items
 -------------
diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
index 028907bc4b9..9b1b66370a7 100644
--- a/lib/ethdev/rte_ethdev.c
+++ b/lib/ethdev/rte_ethdev.c
@@ -2159,6 +2159,15 @@  rte_eth_rx_queue_setup(uint16_t port_id, uint16_t rx_queue_id,
 		return -EINVAL;
 	}
 
+	if (local_conf.share_group > 0 &&
+	    (dev_info.dev_capa & RTE_ETH_DEV_CAPA_RXQ_SHARE) == 0) {
+		RTE_ETHDEV_LOG(ERR,
+			"Ethdev port_id=%d rx_queue_id=%d, enabled share_group=%u while device doesn't support Rx queue share in %s()\n",
+			port_id, rx_queue_id, local_conf.share_group,
+			__func__);
+		return -EINVAL;
+	}
+
 	/*
 	 * If LRO is enabled, check that the maximum aggregated packet
 	 * size is supported by the configured device.
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index 6d80514ba7a..041da6ee52f 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -1044,6 +1044,13 @@  struct rte_eth_rxconf {
 	uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
 	uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
 	uint16_t rx_nseg; /**< Number of descriptions in rx_seg array. */
+	/**
+	 * Share group index in Rx domain and switch domain.
+	 * Non-zero value to enable Rx queue share, zero value disable share.
+	 * PMD driver is responsible for Rx queue consistency checks to avoid
+	 * member port's configuration contradict to each other.
+	 */
+	uint32_t share_group;
 	/**
 	 * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
 	 * Only offloads set on rx_queue_offload_capa or rx_offload_capa
@@ -1445,6 +1452,14 @@  struct rte_eth_conf {
 #define RTE_ETH_DEV_CAPA_RUNTIME_RX_QUEUE_SETUP 0x00000001
 /** Device supports Tx queue setup after device started. */
 #define RTE_ETH_DEV_CAPA_RUNTIME_TX_QUEUE_SETUP 0x00000002
+/**
+ * Device supports shared Rx queue among ports within Rx domain and
+ * switch domain. Mbufs are consumed by shared Rx queue instead of
+ * every port. Multiple groups is supported by share_group of Rx
+ * queue configuration. Polling any port in the group receive packets
+ * of all member ports, source port identified by mbuf->port field.
+ */
+#define RTE_ETH_DEV_CAPA_RXQ_SHARE              0x00000004
 /**@}*/
 
 /*
@@ -1488,6 +1503,12 @@  struct rte_eth_switch_info {
 	 * but each driver should explicitly define the mapping of switch
 	 * port identifier to that physical interconnect/switch
 	 */
+	uint16_t rx_domain;
+	/**<
+	 * Shared Rx queue sub-domain boundary. Only ports in same Rx domain
+	 * and switch domain can share Rx queue. Valid only if device advertised
+	 * RTE_ETH_DEV_CAPA_RXQ_SHARE capability.
+	 */
 };
 
 /**