[v4,1/6] ethdev: introduce shared Rx queue

Message ID 20210930145602.763969-2-xuemingl@nvidia.com (mailing list archive)
State Superseded, archived
Delegated to: Ferruh Yigit
Headers
Series ethdev: introduce shared Rx queue |

Checks

Context Check Description
ci/checkpatch success coding style OK

Commit Message

Xueming Li Sept. 30, 2021, 2:55 p.m. UTC
  In current DPDK framework, each RX queue is pre-loaded with mbufs for
incoming packets. When number of representors scale out in a switch
domain, the memory consumption became significant. Most important,
polling all ports leads to high cache miss, high latency and low
throughput.

This patch introduces shared RX queue. Ports with same configuration in
a switch domain could share RX queue set by specifying sharing group.
Polling any queue using same shared RX queue receives packets from all
member ports. Source port is identified by mbuf->port.

Port queue number in a shared group should be identical. Queue index is
1:1 mapped in shared group.

Share RX queue must be polled on single thread or core.

Multiple groups is supported by group ID.

Signed-off-by: Xueming Li <xuemingl@nvidia.com>
Cc: Jerin Jacob <jerinjacobk@gmail.com>
---
Rx queue object could be used as shared Rx queue object, it's important
to clear all queue control callback api that using queue object:
  https://mails.dpdk.org/archives/dev/2021-July/215574.html
---
 doc/guides/nics/features.rst                    | 11 +++++++++++
 doc/guides/nics/features/default.ini            |  1 +
 doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
 lib/ethdev/rte_ethdev.c                         |  1 +
 lib/ethdev/rte_ethdev.h                         |  7 +++++++
 5 files changed, 30 insertions(+)
  

Comments

Andrew Rybchenko Oct. 11, 2021, 10:47 a.m. UTC | #1
On 9/30/21 5:55 PM, Xueming Li wrote:
> In current DPDK framework, each RX queue is pre-loaded with mbufs for

RX -> Rx

> incoming packets. When number of representors scale out in a switch
> domain, the memory consumption became significant. Most important,
> polling all ports leads to high cache miss, high latency and low
> throughput.

It should be highlighted that it is a problem of some PMDs.
Not all.

> 
> This patch introduces shared RX queue. Ports with same configuration in

"This patch introduces" -> "Introduce"

RX -> Rx

> a switch domain could share RX queue set by specifying sharing group.

RX -> Rx

> Polling any queue using same shared RX queue receives packets from all

RX -> Rx

> member ports. Source port is identified by mbuf->port.
> 
> Port queue number in a shared group should be identical. Queue index is
> 1:1 mapped in shared group.
> 
> Share RX queue must be polled on single thread or core.

RX -> Rx

> 
> Multiple groups is supported by group ID.

is -> are

> 
> Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> Cc: Jerin Jacob <jerinjacobk@gmail.com>

The patch should update release notes.

> ---
> Rx queue object could be used as shared Rx queue object, it's important
> to clear all queue control callback api that using queue object:
>   https://mails.dpdk.org/archives/dev/2021-July/215574.html
> ---
>  doc/guides/nics/features.rst                    | 11 +++++++++++
>  doc/guides/nics/features/default.ini            |  1 +
>  doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
>  lib/ethdev/rte_ethdev.c                         |  1 +
>  lib/ethdev/rte_ethdev.h                         |  7 +++++++
>  5 files changed, 30 insertions(+)
> 
> diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
> index 4fce8cd1c97..69bc1d5719c 100644
> --- a/doc/guides/nics/features.rst
> +++ b/doc/guides/nics/features.rst
> @@ -626,6 +626,17 @@ Supports inner packet L4 checksum.
>    ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
>  
>  
> +.. _nic_features_shared_rx_queue:
> +
> +Shared Rx queue
> +---------------
> +
> +Supports shared Rx queue for ports in same switch domain.
> +
> +* **[uses]     rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> +* **[provides] mbuf**: ``mbuf.port``.
> +
> +
>  .. _nic_features_packet_type_parsing:
>  
>  Packet type parsing
> diff --git a/doc/guides/nics/features/default.ini b/doc/guides/nics/features/default.ini
> index 754184ddd4d..ebeb4c18512 100644
> --- a/doc/guides/nics/features/default.ini
> +++ b/doc/guides/nics/features/default.ini
> @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
>  Queue start/stop     =
>  Runtime Rx queue setup =
>  Runtime Tx queue setup =
> +Shared Rx queue      =
>  Burst mode info      =
>  Power mgmt address monitor =
>  MTU update           =
> diff --git a/doc/guides/prog_guide/switch_representation.rst b/doc/guides/prog_guide/switch_representation.rst
> index ff6aa91c806..bc7ce65fa3d 100644
> --- a/doc/guides/prog_guide/switch_representation.rst
> +++ b/doc/guides/prog_guide/switch_representation.rst
> @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
>  .. [1] `Ethernet switch device driver model (switchdev)
>         <https://www.kernel.org/doc/Documentation/networking/switchdev.txt>`_
>  
> +- Memory usage of representors is huge when number of representor grows,
> +  because PMD always allocate mbuf for each descriptor of Rx queue.

It is a problem of some PMDs only. So, it must be rewritten to
highlight it.

> +  Polling the large number of ports brings more CPU load, cache miss and
> +  latency. Shared Rx queue can be used to share Rx queue between PF and
> +  representors in same switch. ``RTE_ETH_RX_OFFLOAD_SHARED_RXQ`` is
> +  present in Rx offloading capability of device info. Setting the
> +  offloading flag in device Rx mode or Rx queue configuration to enable
> +  shared Rx queue. Polling any member port of the shared Rx queue can return
> +  packets of all ports in the group, port ID is saved in ``mbuf.port``.
> +
>  Basic SR-IOV
>  ------------
>  
> diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> index 61aa49efec6..73270c10492 100644
> --- a/lib/ethdev/rte_ethdev.c
> +++ b/lib/ethdev/rte_ethdev.c
> @@ -127,6 +127,7 @@ static const struct {
>  	RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
>  	RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
>  	RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
> +	RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
>  };
>  
>  #undef RTE_RX_OFFLOAD_BIT2STR
> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> index afdc53b674c..d7ac625ee74 100644
> --- a/lib/ethdev/rte_ethdev.h
> +++ b/lib/ethdev/rte_ethdev.h
> @@ -1077,6 +1077,7 @@ struct rte_eth_rxconf {
>  	uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
>  	uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
>  	uint16_t rx_nseg; /**< Number of descriptions in rx_seg array. */
> +	uint32_t shared_group; /**< Shared port group index in switch domain. */
>  	/**
>  	 * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
>  	 * Only offloads set on rx_queue_offload_capa or rx_offload_capa
> @@ -1403,6 +1404,12 @@ struct rte_eth_conf {
>  #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
>  #define DEV_RX_OFFLOAD_RSS_HASH		0x00080000
>  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> +/**
> + * Rx queue is shared among ports in same switch domain to save memory,
> + * avoid polling each port. Any port in the group can be used to receive
> + * packets. Real source port number saved in mbuf->port field.
> + */
> +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
>  
>  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
>  				 DEV_RX_OFFLOAD_UDP_CKSUM | \
> 

IMHO it should be squashed with the second patch to make it
easier to review. Otherwise it is hard to understand what is
shared_group and the offlaod which are dead in the patch.
  
Xueming Li Oct. 11, 2021, 1:12 p.m. UTC | #2
On Mon, 2021-10-11 at 13:47 +0300, Andrew Rybchenko wrote:
> On 9/30/21 5:55 PM, Xueming Li wrote:
> > In current DPDK framework, each RX queue is pre-loaded with mbufs for
> 
> RX -> Rx
> 
> > incoming packets. When number of representors scale out in a switch
> > domain, the memory consumption became significant. Most important,
> > polling all ports leads to high cache miss, high latency and low
> > throughput.
> 
> It should be highlighted that it is a problem of some PMDs.
> Not all.
> 
> > 
> > This patch introduces shared RX queue. Ports with same configuration in
> 
> "This patch introduces" -> "Introduce"
> 
> RX -> Rx
> 
> > a switch domain could share RX queue set by specifying sharing group.
> 
> RX -> Rx
> 
> > Polling any queue using same shared RX queue receives packets from all
> 
> RX -> Rx
> 
> > member ports. Source port is identified by mbuf->port.
> > 
> > Port queue number in a shared group should be identical. Queue index is
> > 1:1 mapped in shared group.
> > 
> > Share RX queue must be polled on single thread or core.
> 
> RX -> Rx
> 
> > 
> > Multiple groups is supported by group ID.
> 
> is -> are
> 
> > 
> > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > Cc: Jerin Jacob <jerinjacobk@gmail.com>
> 
> The patch should update release notes.
> 
> > ---
> > Rx queue object could be used as shared Rx queue object, it's important
> > to clear all queue control callback api that using queue object:
> >   https://mails.dpdk.org/archives/dev/2021-July/215574.html
> > ---
> >  doc/guides/nics/features.rst                    | 11 +++++++++++
> >  doc/guides/nics/features/default.ini            |  1 +
> >  doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
> >  lib/ethdev/rte_ethdev.c                         |  1 +
> >  lib/ethdev/rte_ethdev.h                         |  7 +++++++
> >  5 files changed, 30 insertions(+)
> > 
> > diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
> > index 4fce8cd1c97..69bc1d5719c 100644
> > --- a/doc/guides/nics/features.rst
> > +++ b/doc/guides/nics/features.rst
> > @@ -626,6 +626,17 @@ Supports inner packet L4 checksum.
> >    ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> >  
> >  
> > +.. _nic_features_shared_rx_queue:
> > +
> > +Shared Rx queue
> > +---------------
> > +
> > +Supports shared Rx queue for ports in same switch domain.
> > +
> > +* **[uses]     rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > +* **[provides] mbuf**: ``mbuf.port``.
> > +
> > +
> >  .. _nic_features_packet_type_parsing:
> >  
> >  Packet type parsing
> > diff --git a/doc/guides/nics/features/default.ini b/doc/guides/nics/features/default.ini
> > index 754184ddd4d..ebeb4c18512 100644
> > --- a/doc/guides/nics/features/default.ini
> > +++ b/doc/guides/nics/features/default.ini
> > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> >  Queue start/stop     =
> >  Runtime Rx queue setup =
> >  Runtime Tx queue setup =
> > +Shared Rx queue      =
> >  Burst mode info      =
> >  Power mgmt address monitor =
> >  MTU update           =
> > diff --git a/doc/guides/prog_guide/switch_representation.rst b/doc/guides/prog_guide/switch_representation.rst
> > index ff6aa91c806..bc7ce65fa3d 100644
> > --- a/doc/guides/prog_guide/switch_representation.rst
> > +++ b/doc/guides/prog_guide/switch_representation.rst
> > @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
> >  .. [1] `Ethernet switch device driver model (switchdev)
> >         <https://www.kernel.org/doc/Documentation/networking/switchdev.txt>`_
> >  
> > +- Memory usage of representors is huge when number of representor grows,
> > +  because PMD always allocate mbuf for each descriptor of Rx queue.
> 
> It is a problem of some PMDs only. So, it must be rewritten to
> highlight it.
> 
> > +  Polling the large number of ports brings more CPU load, cache miss and
> > +  latency. Shared Rx queue can be used to share Rx queue between PF and
> > +  representors in same switch. ``RTE_ETH_RX_OFFLOAD_SHARED_RXQ`` is
> > +  present in Rx offloading capability of device info. Setting the
> > +  offloading flag in device Rx mode or Rx queue configuration to enable
> > +  shared Rx queue. Polling any member port of the shared Rx queue can return
> > +  packets of all ports in the group, port ID is saved in ``mbuf.port``.
> > +
> >  Basic SR-IOV
> >  ------------
> >  
> > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> > index 61aa49efec6..73270c10492 100644
> > --- a/lib/ethdev/rte_ethdev.c
> > +++ b/lib/ethdev/rte_ethdev.c
> > @@ -127,6 +127,7 @@ static const struct {
> >  	RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
> >  	RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> >  	RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
> > +	RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> >  };
> >  
> >  #undef RTE_RX_OFFLOAD_BIT2STR
> > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> > index afdc53b674c..d7ac625ee74 100644
> > --- a/lib/ethdev/rte_ethdev.h
> > +++ b/lib/ethdev/rte_ethdev.h
> > @@ -1077,6 +1077,7 @@ struct rte_eth_rxconf {
> >  	uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> >  	uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> >  	uint16_t rx_nseg; /**< Number of descriptions in rx_seg array. */
> > +	uint32_t shared_group; /**< Shared port group index in switch domain. */
> >  	/**
> >  	 * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> >  	 * Only offloads set on rx_queue_offload_capa or rx_offload_capa
> > @@ -1403,6 +1404,12 @@ struct rte_eth_conf {
> >  #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
> >  #define DEV_RX_OFFLOAD_RSS_HASH		0x00080000
> >  #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > +/**
> > + * Rx queue is shared among ports in same switch domain to save memory,
> > + * avoid polling each port. Any port in the group can be used to receive
> > + * packets. Real source port number saved in mbuf->port field.
> > + */
> > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
> >  
> >  #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> >  				 DEV_RX_OFFLOAD_UDP_CKSUM | \
> > 
> 
> IMHO it should be squashed with the second patch to make it
> easier to review. Otherwise it is hard to understand what is
> shared_group and the offlaod which are dead in the patch.

Hi Andrew,

Thanks for the review! With discussion with Jerin, we want to drop
second patch and decide how to aggregate ports later by collecting more
feedback and idea. To make the offload and group clear, I'll add an
example in commit message. v5 sent w/o seeing your another review on
0/6, please ignore for now.
  

Patch

diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
index 4fce8cd1c97..69bc1d5719c 100644
--- a/doc/guides/nics/features.rst
+++ b/doc/guides/nics/features.rst
@@ -626,6 +626,17 @@  Supports inner packet L4 checksum.
   ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
 
 
+.. _nic_features_shared_rx_queue:
+
+Shared Rx queue
+---------------
+
+Supports shared Rx queue for ports in same switch domain.
+
+* **[uses]     rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
+* **[provides] mbuf**: ``mbuf.port``.
+
+
 .. _nic_features_packet_type_parsing:
 
 Packet type parsing
diff --git a/doc/guides/nics/features/default.ini b/doc/guides/nics/features/default.ini
index 754184ddd4d..ebeb4c18512 100644
--- a/doc/guides/nics/features/default.ini
+++ b/doc/guides/nics/features/default.ini
@@ -19,6 +19,7 @@  Free Tx mbuf on demand =
 Queue start/stop     =
 Runtime Rx queue setup =
 Runtime Tx queue setup =
+Shared Rx queue      =
 Burst mode info      =
 Power mgmt address monitor =
 MTU update           =
diff --git a/doc/guides/prog_guide/switch_representation.rst b/doc/guides/prog_guide/switch_representation.rst
index ff6aa91c806..bc7ce65fa3d 100644
--- a/doc/guides/prog_guide/switch_representation.rst
+++ b/doc/guides/prog_guide/switch_representation.rst
@@ -123,6 +123,16 @@  thought as a software "patch panel" front-end for applications.
 .. [1] `Ethernet switch device driver model (switchdev)
        <https://www.kernel.org/doc/Documentation/networking/switchdev.txt>`_
 
+- Memory usage of representors is huge when number of representor grows,
+  because PMD always allocate mbuf for each descriptor of Rx queue.
+  Polling the large number of ports brings more CPU load, cache miss and
+  latency. Shared Rx queue can be used to share Rx queue between PF and
+  representors in same switch. ``RTE_ETH_RX_OFFLOAD_SHARED_RXQ`` is
+  present in Rx offloading capability of device info. Setting the
+  offloading flag in device Rx mode or Rx queue configuration to enable
+  shared Rx queue. Polling any member port of the shared Rx queue can return
+  packets of all ports in the group, port ID is saved in ``mbuf.port``.
+
 Basic SR-IOV
 ------------
 
diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
index 61aa49efec6..73270c10492 100644
--- a/lib/ethdev/rte_ethdev.c
+++ b/lib/ethdev/rte_ethdev.c
@@ -127,6 +127,7 @@  static const struct {
 	RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
 	RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
 	RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
+	RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
 };
 
 #undef RTE_RX_OFFLOAD_BIT2STR
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index afdc53b674c..d7ac625ee74 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -1077,6 +1077,7 @@  struct rte_eth_rxconf {
 	uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
 	uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
 	uint16_t rx_nseg; /**< Number of descriptions in rx_seg array. */
+	uint32_t shared_group; /**< Shared port group index in switch domain. */
 	/**
 	 * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
 	 * Only offloads set on rx_queue_offload_capa or rx_offload_capa
@@ -1403,6 +1404,12 @@  struct rte_eth_conf {
 #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM  0x00040000
 #define DEV_RX_OFFLOAD_RSS_HASH		0x00080000
 #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
+/**
+ * Rx queue is shared among ports in same switch domain to save memory,
+ * avoid polling each port. Any port in the group can be used to receive
+ * packets. Real source port number saved in mbuf->port field.
+ */
+#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ   0x00200000
 
 #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
 				 DEV_RX_OFFLOAD_UDP_CKSUM | \