[v1] ethdev: introduce shared Rx queue
Checks
Commit Message
In current DPDK framework, each RX queue is pre-loaded with mbufs for
incoming packets. When number of representors scale out in a switch
domain, the memory consumption became significant. Most important,
polling all ports leads to high cache miss, high latency and low
throughput.
This patch introduces shared RX queue. Ports with same configuration in
a switch domain could share RX queue set by specifying sharing group.
Polling any queue using same shared RX queue receives packets from all
member ports. Source port is identified by mbuf->port.
Port queue number in a shared group should be identical. Queue index is
1:1 mapped in shared group.
Share RX queue is supposed to be polled on same thread.
Multiple groups is supported by group ID.
Signed-off-by: Xueming Li <xuemingl@nvidia.com>
---
doc/guides/nics/features.rst | 11 +++++++++++
doc/guides/nics/features/default.ini | 1 +
doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
lib/ethdev/rte_ethdev.c | 1 +
lib/ethdev/rte_ethdev.h | 7 +++++++
5 files changed, 30 insertions(+)
Comments
On Mon, Aug 9, 2021 at 5:18 PM Xueming Li <xuemingl@nvidia.com> wrote:
>
> In current DPDK framework, each RX queue is pre-loaded with mbufs for
> incoming packets. When number of representors scale out in a switch
> domain, the memory consumption became significant. Most important,
> polling all ports leads to high cache miss, high latency and low
> throughput.
>
> This patch introduces shared RX queue. Ports with same configuration in
> a switch domain could share RX queue set by specifying sharing group.
> Polling any queue using same shared RX queue receives packets from all
> member ports. Source port is identified by mbuf->port.
>
> Port queue number in a shared group should be identical. Queue index is
> 1:1 mapped in shared group.
>
> Share RX queue is supposed to be polled on same thread.
>
> Multiple groups is supported by group ID.
Is this offload specific to the representor? If so can this name be
changed specifically to representor?
If it is for a generic case, how the flow ordering will be maintained?
>
> Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> ---
> doc/guides/nics/features.rst | 11 +++++++++++
> doc/guides/nics/features/default.ini | 1 +
> doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
> lib/ethdev/rte_ethdev.c | 1 +
> lib/ethdev/rte_ethdev.h | 7 +++++++
> 5 files changed, 30 insertions(+)
>
> diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
> index a96e12d155..2e2a9b1554 100644
> --- a/doc/guides/nics/features.rst
> +++ b/doc/guides/nics/features.rst
> @@ -624,6 +624,17 @@ Supports inner packet L4 checksum.
> ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
>
>
> +.. _nic_features_shared_rx_queue:
> +
> +Shared Rx queue
> +---------------
> +
> +Supports shared Rx queue for ports in same switch domain.
> +
> +* **[uses] rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> +* **[provides] mbuf**: ``mbuf.port``.
> +
> +
> .. _nic_features_packet_type_parsing:
>
> Packet type parsing
> diff --git a/doc/guides/nics/features/default.ini b/doc/guides/nics/features/default.ini
> index 754184ddd4..ebeb4c1851 100644
> --- a/doc/guides/nics/features/default.ini
> +++ b/doc/guides/nics/features/default.ini
> @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> Queue start/stop =
> Runtime Rx queue setup =
> Runtime Tx queue setup =
> +Shared Rx queue =
> Burst mode info =
> Power mgmt address monitor =
> MTU update =
> diff --git a/doc/guides/prog_guide/switch_representation.rst b/doc/guides/prog_guide/switch_representation.rst
> index ff6aa91c80..45bf5a3a10 100644
> --- a/doc/guides/prog_guide/switch_representation.rst
> +++ b/doc/guides/prog_guide/switch_representation.rst
> @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
> .. [1] `Ethernet switch device driver model (switchdev)
> <https://www.kernel.org/doc/Documentation/networking/switchdev.txt>`_
>
> +- Memory usage of representors is huge when number of representor grows,
> + because PMD always allocate mbuf for each descriptor of Rx queue.
> + Polling the large number of ports brings more CPU load, cache miss and
> + latency. Shared Rx queue can be used to share Rx queue between PF and
> + representors in same switch domain. ``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> + is present in Rx offloading capability of device info. Setting the
> + offloading flag in device Rx mode or Rx queue configuration to enable
> + shared Rx queue. Polling any member port of shared Rx queue can return
> + packets of all ports in group, port ID is saved in ``mbuf.port``.
> +
> Basic SR-IOV
> ------------
>
> diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> index 9d95cd11e1..1361ff759a 100644
> --- a/lib/ethdev/rte_ethdev.c
> +++ b/lib/ethdev/rte_ethdev.c
> @@ -127,6 +127,7 @@ static const struct {
> RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
> RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
> + RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> };
>
> #undef RTE_RX_OFFLOAD_BIT2STR
> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> index d2b27c351f..a578c9db9d 100644
> --- a/lib/ethdev/rte_ethdev.h
> +++ b/lib/ethdev/rte_ethdev.h
> @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> uint16_t rx_nseg; /**< Number of descriptions in rx_seg array. */
> + uint32_t shared_group; /**< Shared port group index in switch domain. */
> /**
> * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> * Only offloads set on rx_queue_offload_capa or rx_offload_capa
> @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
> #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM 0x00040000
> #define DEV_RX_OFFLOAD_RSS_HASH 0x00080000
> #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> +/**
> + * Rx queue is shared among ports in same switch domain to save memory,
> + * avoid polling each port. Any port in group can be used to receive packets.
> + * Real source port number saved in mbuf->port field.
> + */
> +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ 0x00200000
>
> #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> DEV_RX_OFFLOAD_UDP_CKSUM | \
> --
> 2.25.1
>
Hi,
> -----Original Message-----
> From: Jerin Jacob <jerinjacobk@gmail.com>
> Sent: Monday, August 9, 2021 9:51 PM
> To: Xueming(Steven) Li <xuemingl@nvidia.com>
> Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
>
> On Mon, Aug 9, 2021 at 5:18 PM Xueming Li <xuemingl@nvidia.com> wrote:
> >
> > In current DPDK framework, each RX queue is pre-loaded with mbufs for
> > incoming packets. When number of representors scale out in a switch
> > domain, the memory consumption became significant. Most important,
> > polling all ports leads to high cache miss, high latency and low
> > throughput.
> >
> > This patch introduces shared RX queue. Ports with same configuration
> > in a switch domain could share RX queue set by specifying sharing group.
> > Polling any queue using same shared RX queue receives packets from all
> > member ports. Source port is identified by mbuf->port.
> >
> > Port queue number in a shared group should be identical. Queue index
> > is
> > 1:1 mapped in shared group.
> >
> > Share RX queue is supposed to be polled on same thread.
> >
> > Multiple groups is supported by group ID.
>
> Is this offload specific to the representor? If so can this name be changed specifically to representor?
Yes, PF and representor in switch domain could take advantage.
> If it is for a generic case, how the flow ordering will be maintained?
Not quite sure that I understood your question. The control path of is almost same as before,
PF and representor port still needed, rte flows not impacted.
Queues still needed for each member port, descriptors(mbuf) will be supplied from shared Rx queue
in my PMD implementation.
>
> >
> > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > ---
> > doc/guides/nics/features.rst | 11 +++++++++++
> > doc/guides/nics/features/default.ini | 1 +
> > doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
> > lib/ethdev/rte_ethdev.c | 1 +
> > lib/ethdev/rte_ethdev.h | 7 +++++++
> > 5 files changed, 30 insertions(+)
> >
> > diff --git a/doc/guides/nics/features.rst
> > b/doc/guides/nics/features.rst index a96e12d155..2e2a9b1554 100644
> > --- a/doc/guides/nics/features.rst
> > +++ b/doc/guides/nics/features.rst
> > @@ -624,6 +624,17 @@ Supports inner packet L4 checksum.
> > ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> >
> >
> > +.. _nic_features_shared_rx_queue:
> > +
> > +Shared Rx queue
> > +---------------
> > +
> > +Supports shared Rx queue for ports in same switch domain.
> > +
> > +* **[uses] rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > +* **[provides] mbuf**: ``mbuf.port``.
> > +
> > +
> > .. _nic_features_packet_type_parsing:
> >
> > Packet type parsing
> > diff --git a/doc/guides/nics/features/default.ini
> > b/doc/guides/nics/features/default.ini
> > index 754184ddd4..ebeb4c1851 100644
> > --- a/doc/guides/nics/features/default.ini
> > +++ b/doc/guides/nics/features/default.ini
> > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> > Queue start/stop =
> > Runtime Rx queue setup =
> > Runtime Tx queue setup =
> > +Shared Rx queue =
> > Burst mode info =
> > Power mgmt address monitor =
> > MTU update =
> > diff --git a/doc/guides/prog_guide/switch_representation.rst
> > b/doc/guides/prog_guide/switch_representation.rst
> > index ff6aa91c80..45bf5a3a10 100644
> > --- a/doc/guides/prog_guide/switch_representation.rst
> > +++ b/doc/guides/prog_guide/switch_representation.rst
> > @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
> > .. [1] `Ethernet switch device driver model (switchdev)
> >
> > <https://www.kernel.org/doc/Documentation/networking/switchdev.txt>`_
> >
> > +- Memory usage of representors is huge when number of representor
> > +grows,
> > + because PMD always allocate mbuf for each descriptor of Rx queue.
> > + Polling the large number of ports brings more CPU load, cache miss
> > +and
> > + latency. Shared Rx queue can be used to share Rx queue between PF
> > +and
> > + representors in same switch domain.
> > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > + is present in Rx offloading capability of device info. Setting the
> > + offloading flag in device Rx mode or Rx queue configuration to
> > +enable
> > + shared Rx queue. Polling any member port of shared Rx queue can
> > +return
> > + packets of all ports in group, port ID is saved in ``mbuf.port``.
> > +
> > Basic SR-IOV
> > ------------
> >
> > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c index
> > 9d95cd11e1..1361ff759a 100644
> > --- a/lib/ethdev/rte_ethdev.c
> > +++ b/lib/ethdev/rte_ethdev.c
> > @@ -127,6 +127,7 @@ static const struct {
> > RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
> > RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
> > + RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > };
> >
> > #undef RTE_RX_OFFLOAD_BIT2STR
> > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h index
> > d2b27c351f..a578c9db9d 100644
> > --- a/lib/ethdev/rte_ethdev.h
> > +++ b/lib/ethdev/rte_ethdev.h
> > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > */
> > + uint32_t shared_group; /**< Shared port group index in switch
> > + domain. */
> > /**
> > * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > * Only offloads set on rx_queue_offload_capa or
> > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf { #define
> > DEV_RX_OFFLOAD_OUTER_UDP_CKSUM 0x00040000
> > #define DEV_RX_OFFLOAD_RSS_HASH 0x00080000
> > #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > +/**
> > + * Rx queue is shared among ports in same switch domain to save
> > +memory,
> > + * avoid polling each port. Any port in group can be used to receive packets.
> > + * Real source port number saved in mbuf->port field.
> > + */
> > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ 0x00200000
> >
> > #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > DEV_RX_OFFLOAD_UDP_CKSUM | \
> > --
> > 2.25.1
> >
On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
>
> Hi,
>
> > -----Original Message-----
> > From: Jerin Jacob <jerinjacobk@gmail.com>
> > Sent: Monday, August 9, 2021 9:51 PM
> > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> >
> > On Mon, Aug 9, 2021 at 5:18 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > >
> > > In current DPDK framework, each RX queue is pre-loaded with mbufs for
> > > incoming packets. When number of representors scale out in a switch
> > > domain, the memory consumption became significant. Most important,
> > > polling all ports leads to high cache miss, high latency and low
> > > throughput.
> > >
> > > This patch introduces shared RX queue. Ports with same configuration
> > > in a switch domain could share RX queue set by specifying sharing group.
> > > Polling any queue using same shared RX queue receives packets from all
> > > member ports. Source port is identified by mbuf->port.
> > >
> > > Port queue number in a shared group should be identical. Queue index
> > > is
> > > 1:1 mapped in shared group.
> > >
> > > Share RX queue is supposed to be polled on same thread.
> > >
> > > Multiple groups is supported by group ID.
> >
> > Is this offload specific to the representor? If so can this name be changed specifically to representor?
>
> Yes, PF and representor in switch domain could take advantage.
>
> > If it is for a generic case, how the flow ordering will be maintained?
>
> Not quite sure that I understood your question. The control path of is almost same as before,
> PF and representor port still needed, rte flows not impacted.
> Queues still needed for each member port, descriptors(mbuf) will be supplied from shared Rx queue
> in my PMD implementation.
My question was if create a generic RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload,
multiple ethdev receive queues land into the same receive queue, In that case,
how the flow order is maintained for respective receive queues.
If this offload is only useful for representor case, Can we make this
offload specific
to representor the case by changing its name and scope.
>
> >
> > >
> > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > ---
> > > doc/guides/nics/features.rst | 11 +++++++++++
> > > doc/guides/nics/features/default.ini | 1 +
> > > doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
> > > lib/ethdev/rte_ethdev.c | 1 +
> > > lib/ethdev/rte_ethdev.h | 7 +++++++
> > > 5 files changed, 30 insertions(+)
> > >
> > > diff --git a/doc/guides/nics/features.rst
> > > b/doc/guides/nics/features.rst index a96e12d155..2e2a9b1554 100644
> > > --- a/doc/guides/nics/features.rst
> > > +++ b/doc/guides/nics/features.rst
> > > @@ -624,6 +624,17 @@ Supports inner packet L4 checksum.
> > > ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > >
> > >
> > > +.. _nic_features_shared_rx_queue:
> > > +
> > > +Shared Rx queue
> > > +---------------
> > > +
> > > +Supports shared Rx queue for ports in same switch domain.
> > > +
> > > +* **[uses] rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > > +* **[provides] mbuf**: ``mbuf.port``.
> > > +
> > > +
> > > .. _nic_features_packet_type_parsing:
> > >
> > > Packet type parsing
> > > diff --git a/doc/guides/nics/features/default.ini
> > > b/doc/guides/nics/features/default.ini
> > > index 754184ddd4..ebeb4c1851 100644
> > > --- a/doc/guides/nics/features/default.ini
> > > +++ b/doc/guides/nics/features/default.ini
> > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> > > Queue start/stop =
> > > Runtime Rx queue setup =
> > > Runtime Tx queue setup =
> > > +Shared Rx queue =
> > > Burst mode info =
> > > Power mgmt address monitor =
> > > MTU update =
> > > diff --git a/doc/guides/prog_guide/switch_representation.rst
> > > b/doc/guides/prog_guide/switch_representation.rst
> > > index ff6aa91c80..45bf5a3a10 100644
> > > --- a/doc/guides/prog_guide/switch_representation.rst
> > > +++ b/doc/guides/prog_guide/switch_representation.rst
> > > @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
> > > .. [1] `Ethernet switch device driver model (switchdev)
> > >
> > > <https://www.kernel.org/doc/Documentation/networking/switchdev.txt>`_
> > >
> > > +- Memory usage of representors is huge when number of representor
> > > +grows,
> > > + because PMD always allocate mbuf for each descriptor of Rx queue.
> > > + Polling the large number of ports brings more CPU load, cache miss
> > > +and
> > > + latency. Shared Rx queue can be used to share Rx queue between PF
> > > +and
> > > + representors in same switch domain.
> > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > + is present in Rx offloading capability of device info. Setting the
> > > + offloading flag in device Rx mode or Rx queue configuration to
> > > +enable
> > > + shared Rx queue. Polling any member port of shared Rx queue can
> > > +return
> > > + packets of all ports in group, port ID is saved in ``mbuf.port``.
> > > +
> > > Basic SR-IOV
> > > ------------
> > >
> > > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c index
> > > 9d95cd11e1..1361ff759a 100644
> > > --- a/lib/ethdev/rte_ethdev.c
> > > +++ b/lib/ethdev/rte_ethdev.c
> > > @@ -127,6 +127,7 @@ static const struct {
> > > RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
> > > RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
> > > + RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > };
> > >
> > > #undef RTE_RX_OFFLOAD_BIT2STR
> > > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h index
> > > d2b27c351f..a578c9db9d 100644
> > > --- a/lib/ethdev/rte_ethdev.h
> > > +++ b/lib/ethdev/rte_ethdev.h
> > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > > uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > > uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > */
> > > + uint32_t shared_group; /**< Shared port group index in switch
> > > + domain. */
> > > /**
> > > * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > > * Only offloads set on rx_queue_offload_capa or
> > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf { #define
> > > DEV_RX_OFFLOAD_OUTER_UDP_CKSUM 0x00040000
> > > #define DEV_RX_OFFLOAD_RSS_HASH 0x00080000
> > > #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > +/**
> > > + * Rx queue is shared among ports in same switch domain to save
> > > +memory,
> > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > + * Real source port number saved in mbuf->port field.
> > > + */
> > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ 0x00200000
> > >
> > > #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > DEV_RX_OFFLOAD_UDP_CKSUM | \
> > > --
> > > 2.25.1
> > >
> -----Original Message-----
> From: Jerin Jacob <jerinjacobk@gmail.com>
> Sent: Wednesday, August 11, 2021 4:03 PM
> To: Xueming(Steven) Li <xuemingl@nvidia.com>
> Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
>
> On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> >
> > Hi,
> >
> > > -----Original Message-----
> > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > Sent: Monday, August 9, 2021 9:51 PM
> > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > <andrew.rybchenko@oktetlabs.ru>
> > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > >
> > > On Mon, Aug 9, 2021 at 5:18 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > >
> > > > In current DPDK framework, each RX queue is pre-loaded with mbufs
> > > > for incoming packets. When number of representors scale out in a
> > > > switch domain, the memory consumption became significant. Most
> > > > important, polling all ports leads to high cache miss, high
> > > > latency and low throughput.
> > > >
> > > > This patch introduces shared RX queue. Ports with same
> > > > configuration in a switch domain could share RX queue set by specifying sharing group.
> > > > Polling any queue using same shared RX queue receives packets from
> > > > all member ports. Source port is identified by mbuf->port.
> > > >
> > > > Port queue number in a shared group should be identical. Queue
> > > > index is
> > > > 1:1 mapped in shared group.
> > > >
> > > > Share RX queue is supposed to be polled on same thread.
> > > >
> > > > Multiple groups is supported by group ID.
> > >
> > > Is this offload specific to the representor? If so can this name be changed specifically to representor?
> >
> > Yes, PF and representor in switch domain could take advantage.
> >
> > > If it is for a generic case, how the flow ordering will be maintained?
> >
> > Not quite sure that I understood your question. The control path of is
> > almost same as before, PF and representor port still needed, rte flows not impacted.
> > Queues still needed for each member port, descriptors(mbuf) will be
> > supplied from shared Rx queue in my PMD implementation.
>
> My question was if create a generic RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple ethdev receive queues land into the same
> receive queue, In that case, how the flow order is maintained for respective receive queues.
I guess the question is testpmd forward stream? The forwarding logic has to be changed slightly in case of shared rxq.
basically for each packet in rx_burst result, lookup source stream according to mbuf->port, forwarding to target fs.
Packets from same source port could be grouped as a small burst to process, this will accelerates the performance if traffic come from
limited ports. I'll introduce some common api to do shard rxq forwarding, call it with packets handling callback, so it suites for
all forwarding engine. Will sent patches soon.
> If this offload is only useful for representor case, Can we make this offload specific to representor the case by changing its name and
> scope.
It works for both PF and representors in same switch domain, for application like OVS, few changes to apply.
>
>
> >
> > >
> > > >
> > > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > > ---
> > > > doc/guides/nics/features.rst | 11 +++++++++++
> > > > doc/guides/nics/features/default.ini | 1 +
> > > > doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
> > > > lib/ethdev/rte_ethdev.c | 1 +
> > > > lib/ethdev/rte_ethdev.h | 7 +++++++
> > > > 5 files changed, 30 insertions(+)
> > > >
> > > > diff --git a/doc/guides/nics/features.rst
> > > > b/doc/guides/nics/features.rst index a96e12d155..2e2a9b1554 100644
> > > > --- a/doc/guides/nics/features.rst
> > > > +++ b/doc/guides/nics/features.rst
> > > > @@ -624,6 +624,17 @@ Supports inner packet L4 checksum.
> > > > ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > >
> > > >
> > > > +.. _nic_features_shared_rx_queue:
> > > > +
> > > > +Shared Rx queue
> > > > +---------------
> > > > +
> > > > +Supports shared Rx queue for ports in same switch domain.
> > > > +
> > > > +* **[uses] rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > +
> > > > +
> > > > .. _nic_features_packet_type_parsing:
> > > >
> > > > Packet type parsing
> > > > diff --git a/doc/guides/nics/features/default.ini
> > > > b/doc/guides/nics/features/default.ini
> > > > index 754184ddd4..ebeb4c1851 100644
> > > > --- a/doc/guides/nics/features/default.ini
> > > > +++ b/doc/guides/nics/features/default.ini
> > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> > > > Queue start/stop =
> > > > Runtime Rx queue setup =
> > > > Runtime Tx queue setup =
> > > > +Shared Rx queue =
> > > > Burst mode info =
> > > > Power mgmt address monitor =
> > > > MTU update =
> > > > diff --git a/doc/guides/prog_guide/switch_representation.rst
> > > > b/doc/guides/prog_guide/switch_representation.rst
> > > > index ff6aa91c80..45bf5a3a10 100644
> > > > --- a/doc/guides/prog_guide/switch_representation.rst
> > > > +++ b/doc/guides/prog_guide/switch_representation.rst
> > > > @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
> > > > .. [1] `Ethernet switch device driver model (switchdev)
> > > >
> > > > <https://www.kernel.org/doc/Documentation/networking/switchdev.txt
> > > > >`_
> > > >
> > > > +- Memory usage of representors is huge when number of representor
> > > > +grows,
> > > > + because PMD always allocate mbuf for each descriptor of Rx queue.
> > > > + Polling the large number of ports brings more CPU load, cache
> > > > +miss and
> > > > + latency. Shared Rx queue can be used to share Rx queue between
> > > > +PF and
> > > > + representors in same switch domain.
> > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > + is present in Rx offloading capability of device info. Setting
> > > > +the
> > > > + offloading flag in device Rx mode or Rx queue configuration to
> > > > +enable
> > > > + shared Rx queue. Polling any member port of shared Rx queue can
> > > > +return
> > > > + packets of all ports in group, port ID is saved in ``mbuf.port``.
> > > > +
> > > > Basic SR-IOV
> > > > ------------
> > > >
> > > > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> > > > index 9d95cd11e1..1361ff759a 100644
> > > > --- a/lib/ethdev/rte_ethdev.c
> > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > @@ -127,6 +127,7 @@ static const struct {
> > > > RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
> > > > RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
> > > > + RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > };
> > > >
> > > > #undef RTE_RX_OFFLOAD_BIT2STR
> > > > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> > > > index d2b27c351f..a578c9db9d 100644
> > > > --- a/lib/ethdev/rte_ethdev.h
> > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > > > uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > > > uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > > */
> > > > + uint32_t shared_group; /**< Shared port group index in
> > > > + switch domain. */
> > > > /**
> > > > * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > > > * Only offloads set on rx_queue_offload_capa or
> > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
> > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM 0x00040000
> > > > #define DEV_RX_OFFLOAD_RSS_HASH 0x00080000
> > > > #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > > +/**
> > > > + * Rx queue is shared among ports in same switch domain to save
> > > > +memory,
> > > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > > + * Real source port number saved in mbuf->port field.
> > > > + */
> > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ 0x00200000
> > > >
> > > > #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > DEV_RX_OFFLOAD_UDP_CKSUM | \
> > > > --
> > > > 2.25.1
> > > >
On 8/11/2021 9:28 AM, Xueming(Steven) Li wrote:
>
>
>> -----Original Message-----
>> From: Jerin Jacob <jerinjacobk@gmail.com>
>> Sent: Wednesday, August 11, 2021 4:03 PM
>> To: Xueming(Steven) Li <xuemingl@nvidia.com>
>> Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
>> Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
>> Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
>>
>> On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
>>>
>>> Hi,
>>>
>>>> -----Original Message-----
>>>> From: Jerin Jacob <jerinjacobk@gmail.com>
>>>> Sent: Monday, August 9, 2021 9:51 PM
>>>> To: Xueming(Steven) Li <xuemingl@nvidia.com>
>>>> Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
>>>> NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
>>>> <andrew.rybchenko@oktetlabs.ru>
>>>> Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
>>>>
>>>> On Mon, Aug 9, 2021 at 5:18 PM Xueming Li <xuemingl@nvidia.com> wrote:
>>>>>
>>>>> In current DPDK framework, each RX queue is pre-loaded with mbufs
>>>>> for incoming packets. When number of representors scale out in a
>>>>> switch domain, the memory consumption became significant. Most
>>>>> important, polling all ports leads to high cache miss, high
>>>>> latency and low throughput.
>>>>>
>>>>> This patch introduces shared RX queue. Ports with same
>>>>> configuration in a switch domain could share RX queue set by specifying sharing group.
>>>>> Polling any queue using same shared RX queue receives packets from
>>>>> all member ports. Source port is identified by mbuf->port.
>>>>>
>>>>> Port queue number in a shared group should be identical. Queue
>>>>> index is
>>>>> 1:1 mapped in shared group.
>>>>>
>>>>> Share RX queue is supposed to be polled on same thread.
>>>>>
>>>>> Multiple groups is supported by group ID.
>>>>
>>>> Is this offload specific to the representor? If so can this name be changed specifically to representor?
>>>
>>> Yes, PF and representor in switch domain could take advantage.
>>>
>>>> If it is for a generic case, how the flow ordering will be maintained?
>>>
>>> Not quite sure that I understood your question. The control path of is
>>> almost same as before, PF and representor port still needed, rte flows not impacted.
>>> Queues still needed for each member port, descriptors(mbuf) will be
>>> supplied from shared Rx queue in my PMD implementation.
>>
>> My question was if create a generic RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple ethdev receive queues land into the same
>> receive queue, In that case, how the flow order is maintained for respective receive queues.
>
> I guess the question is testpmd forward stream? The forwarding logic has to be changed slightly in case of shared rxq.
> basically for each packet in rx_burst result, lookup source stream according to mbuf->port, forwarding to target fs.
> Packets from same source port could be grouped as a small burst to process, this will accelerates the performance if traffic come from
> limited ports. I'll introduce some common api to do shard rxq forwarding, call it with packets handling callback, so it suites for
> all forwarding engine. Will sent patches soon.
>
All ports will put the packets in to the same queue (share queue), right? Does
this means only single core will poll only, what will happen if there are
multiple cores polling, won't it cause problem?
And if this requires specific changes in the application, I am not sure about
the solution, can't this work in a transparent way to the application?
Overall, is this for optimizing memory for the port represontors? If so can't we
have a port representor specific solution, reducing scope can reduce the
complexity it brings?
>> If this offload is only useful for representor case, Can we make this offload specific to representor the case by changing its name and
>> scope.
>
> It works for both PF and representors in same switch domain, for application like OVS, few changes to apply.
>
>>
>>
>>>
>>>>
>>>>>
>>>>> Signed-off-by: Xueming Li <xuemingl@nvidia.com>
>>>>> ---
>>>>> doc/guides/nics/features.rst | 11 +++++++++++
>>>>> doc/guides/nics/features/default.ini | 1 +
>>>>> doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
>>>>> lib/ethdev/rte_ethdev.c | 1 +
>>>>> lib/ethdev/rte_ethdev.h | 7 +++++++
>>>>> 5 files changed, 30 insertions(+)
>>>>>
>>>>> diff --git a/doc/guides/nics/features.rst
>>>>> b/doc/guides/nics/features.rst index a96e12d155..2e2a9b1554 100644
>>>>> --- a/doc/guides/nics/features.rst
>>>>> +++ b/doc/guides/nics/features.rst
>>>>> @@ -624,6 +624,17 @@ Supports inner packet L4 checksum.
>>>>> ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
>>>>>
>>>>>
>>>>> +.. _nic_features_shared_rx_queue:
>>>>> +
>>>>> +Shared Rx queue
>>>>> +---------------
>>>>> +
>>>>> +Supports shared Rx queue for ports in same switch domain.
>>>>> +
>>>>> +* **[uses] rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
>>>>> +* **[provides] mbuf**: ``mbuf.port``.
>>>>> +
>>>>> +
>>>>> .. _nic_features_packet_type_parsing:
>>>>>
>>>>> Packet type parsing
>>>>> diff --git a/doc/guides/nics/features/default.ini
>>>>> b/doc/guides/nics/features/default.ini
>>>>> index 754184ddd4..ebeb4c1851 100644
>>>>> --- a/doc/guides/nics/features/default.ini
>>>>> +++ b/doc/guides/nics/features/default.ini
>>>>> @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
>>>>> Queue start/stop =
>>>>> Runtime Rx queue setup =
>>>>> Runtime Tx queue setup =
>>>>> +Shared Rx queue =
>>>>> Burst mode info =
>>>>> Power mgmt address monitor =
>>>>> MTU update =
>>>>> diff --git a/doc/guides/prog_guide/switch_representation.rst
>>>>> b/doc/guides/prog_guide/switch_representation.rst
>>>>> index ff6aa91c80..45bf5a3a10 100644
>>>>> --- a/doc/guides/prog_guide/switch_representation.rst
>>>>> +++ b/doc/guides/prog_guide/switch_representation.rst
>>>>> @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
>>>>> .. [1] `Ethernet switch device driver model (switchdev)
>>>>>
>>>>> <https://www.kernel.org/doc/Documentation/networking/switchdev.txt
>>>>>> `_
>>>>>
>>>>> +- Memory usage of representors is huge when number of representor
>>>>> +grows,
>>>>> + because PMD always allocate mbuf for each descriptor of Rx queue.
>>>>> + Polling the large number of ports brings more CPU load, cache
>>>>> +miss and
>>>>> + latency. Shared Rx queue can be used to share Rx queue between
>>>>> +PF and
>>>>> + representors in same switch domain.
>>>>> +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
>>>>> + is present in Rx offloading capability of device info. Setting
>>>>> +the
>>>>> + offloading flag in device Rx mode or Rx queue configuration to
>>>>> +enable
>>>>> + shared Rx queue. Polling any member port of shared Rx queue can
>>>>> +return
>>>>> + packets of all ports in group, port ID is saved in ``mbuf.port``.
>>>>> +
>>>>> Basic SR-IOV
>>>>> ------------
>>>>>
>>>>> diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
>>>>> index 9d95cd11e1..1361ff759a 100644
>>>>> --- a/lib/ethdev/rte_ethdev.c
>>>>> +++ b/lib/ethdev/rte_ethdev.c
>>>>> @@ -127,6 +127,7 @@ static const struct {
>>>>> RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
>>>>> RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
>>>>> RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
>>>>> + RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
>>>>> };
>>>>>
>>>>> #undef RTE_RX_OFFLOAD_BIT2STR
>>>>> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
>>>>> index d2b27c351f..a578c9db9d 100644
>>>>> --- a/lib/ethdev/rte_ethdev.h
>>>>> +++ b/lib/ethdev/rte_ethdev.h
>>>>> @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
>>>>> uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
>>>>> uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
>>>>> uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
>>>>> */
>>>>> + uint32_t shared_group; /**< Shared port group index in
>>>>> + switch domain. */
>>>>> /**
>>>>> * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
>>>>> * Only offloads set on rx_queue_offload_capa or
>>>>> rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
>>>>> #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM 0x00040000
>>>>> #define DEV_RX_OFFLOAD_RSS_HASH 0x00080000
>>>>> #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
>>>>> +/**
>>>>> + * Rx queue is shared among ports in same switch domain to save
>>>>> +memory,
>>>>> + * avoid polling each port. Any port in group can be used to receive packets.
>>>>> + * Real source port number saved in mbuf->port field.
>>>>> + */
>>>>> +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ 0x00200000
>>>>>
>>>>> #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
>>>>> DEV_RX_OFFLOAD_UDP_CKSUM | \
>>>>> --
>>>>> 2.25.1
>>>>>
> -----Original Message-----
> From: Ferruh Yigit <ferruh.yigit@intel.com>
> Sent: Wednesday, August 11, 2021 8:04 PM
> To: Xueming(Steven) Li <xuemingl@nvidia.com>; Jerin Jacob <jerinjacobk@gmail.com>
> Cc: dpdk-dev <dev@dpdk.org>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> <andrew.rybchenko@oktetlabs.ru>
> Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
>
> On 8/11/2021 9:28 AM, Xueming(Steven) Li wrote:
> >
> >
> >> -----Original Message-----
> >> From: Jerin Jacob <jerinjacobk@gmail.com>
> >> Sent: Wednesday, August 11, 2021 4:03 PM
> >> To: Xueming(Steven) Li <xuemingl@nvidia.com>
> >> Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> >> NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> >> <andrew.rybchenko@oktetlabs.ru>
> >> Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> >>
> >> On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> >>>
> >>> Hi,
> >>>
> >>>> -----Original Message-----
> >>>> From: Jerin Jacob <jerinjacobk@gmail.com>
> >>>> Sent: Monday, August 9, 2021 9:51 PM
> >>>> To: Xueming(Steven) Li <xuemingl@nvidia.com>
> >>>> Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> >>>> NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> >>>> <andrew.rybchenko@oktetlabs.ru>
> >>>> Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx
> >>>> queue
> >>>>
> >>>> On Mon, Aug 9, 2021 at 5:18 PM Xueming Li <xuemingl@nvidia.com> wrote:
> >>>>>
> >>>>> In current DPDK framework, each RX queue is pre-loaded with mbufs
> >>>>> for incoming packets. When number of representors scale out in a
> >>>>> switch domain, the memory consumption became significant. Most
> >>>>> important, polling all ports leads to high cache miss, high
> >>>>> latency and low throughput.
> >>>>>
> >>>>> This patch introduces shared RX queue. Ports with same
> >>>>> configuration in a switch domain could share RX queue set by specifying sharing group.
> >>>>> Polling any queue using same shared RX queue receives packets from
> >>>>> all member ports. Source port is identified by mbuf->port.
> >>>>>
> >>>>> Port queue number in a shared group should be identical. Queue
> >>>>> index is
> >>>>> 1:1 mapped in shared group.
> >>>>>
> >>>>> Share RX queue is supposed to be polled on same thread.
> >>>>>
> >>>>> Multiple groups is supported by group ID.
> >>>>
> >>>> Is this offload specific to the representor? If so can this name be changed specifically to representor?
> >>>
> >>> Yes, PF and representor in switch domain could take advantage.
> >>>
> >>>> If it is for a generic case, how the flow ordering will be maintained?
> >>>
> >>> Not quite sure that I understood your question. The control path of
> >>> is almost same as before, PF and representor port still needed, rte flows not impacted.
> >>> Queues still needed for each member port, descriptors(mbuf) will be
> >>> supplied from shared Rx queue in my PMD implementation.
> >>
> >> My question was if create a generic RTE_ETH_RX_OFFLOAD_SHARED_RXQ
> >> offload, multiple ethdev receive queues land into the same receive queue, In that case, how the flow order is maintained for
> respective receive queues.
> >
> > I guess the question is testpmd forward stream? The forwarding logic has to be changed slightly in case of shared rxq.
> > basically for each packet in rx_burst result, lookup source stream according to mbuf->port, forwarding to target fs.
> > Packets from same source port could be grouped as a small burst to
> > process, this will accelerates the performance if traffic come from
> > limited ports. I'll introduce some common api to do shard rxq forwarding, call it with packets handling callback, so it suites for all
> forwarding engine. Will sent patches soon.
> >
>
> All ports will put the packets in to the same queue (share queue), right? Does this means only single core will poll only, what will
> happen if there are multiple cores polling, won't it cause problem?
This has been mentioned in commit log, the shared rxq is supposed to be polling in single thread(core) - I think it should be "MUST".
Result is unexpected if there are multiple cores pooling, that's why I added a polling schedule check in testpmd.
Similar for rx/tx burst function, a queue can't be polled on multiple thread(core), and for performance concern, no such check in eal api.
If users want to utilize multiple cores to distribute workloads, it's possible to define more groups, queues in different group could be
could be polled on multiple cores.
It's possible to poll every member port in group, but not necessary, any port in group could be polled to get packets for all ports in group.
If the member port subject to hot plug/remove, it's possible to create a vdev with same queue number, copy rxq object and poll vdev
as a dedicate proxy for the group.
>
> And if this requires specific changes in the application, I am not sure about the solution, can't this work in a transparent way to the
> application?
Yes, we considered different options in design stage. One possible solution is to cache received packets in rings, this can be done on
eth layer, but I'm afraid less benefits, user still has to be a ware of multiple core polling.
This can be done as a wrapper PMD later, more efforts.
>
> Overall, is this for optimizing memory for the port represontors? If so can't we have a port representor specific solution, reducing
> scope can reduce the complexity it brings?
This feature supports both PF and representor, and yes, major issue is memory of representors. Poll all representors also
introduces more core cache miss latency. This feature essentially aggregates all ports in group as one port.
On the other hand, it's useful for rte flow to create offloading flows using representor as a regular port ID.
It's great if any new solution/suggestion, my head buried in PMD code :)
>
> >> If this offload is only useful for representor case, Can we make this
> >> offload specific to representor the case by changing its name and scope.
> >
> > It works for both PF and representors in same switch domain, for application like OVS, few changes to apply.
> >
> >>
> >>
> >>>
> >>>>
> >>>>>
> >>>>> Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> >>>>> ---
> >>>>> doc/guides/nics/features.rst | 11 +++++++++++
> >>>>> doc/guides/nics/features/default.ini | 1 +
> >>>>> doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
> >>>>> lib/ethdev/rte_ethdev.c | 1 +
> >>>>> lib/ethdev/rte_ethdev.h | 7 +++++++
> >>>>> 5 files changed, 30 insertions(+)
> >>>>>
> >>>>> diff --git a/doc/guides/nics/features.rst
> >>>>> b/doc/guides/nics/features.rst index a96e12d155..2e2a9b1554 100644
> >>>>> --- a/doc/guides/nics/features.rst
> >>>>> +++ b/doc/guides/nics/features.rst
> >>>>> @@ -624,6 +624,17 @@ Supports inner packet L4 checksum.
> >>>>> ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> >>>>>
> >>>>>
> >>>>> +.. _nic_features_shared_rx_queue:
> >>>>> +
> >>>>> +Shared Rx queue
> >>>>> +---------------
> >>>>> +
> >>>>> +Supports shared Rx queue for ports in same switch domain.
> >>>>> +
> >>>>> +* **[uses] rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> >>>>> +* **[provides] mbuf**: ``mbuf.port``.
> >>>>> +
> >>>>> +
> >>>>> .. _nic_features_packet_type_parsing:
> >>>>>
> >>>>> Packet type parsing
> >>>>> diff --git a/doc/guides/nics/features/default.ini
> >>>>> b/doc/guides/nics/features/default.ini
> >>>>> index 754184ddd4..ebeb4c1851 100644
> >>>>> --- a/doc/guides/nics/features/default.ini
> >>>>> +++ b/doc/guides/nics/features/default.ini
> >>>>> @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> >>>>> Queue start/stop =
> >>>>> Runtime Rx queue setup =
> >>>>> Runtime Tx queue setup =
> >>>>> +Shared Rx queue =
> >>>>> Burst mode info =
> >>>>> Power mgmt address monitor =
> >>>>> MTU update =
> >>>>> diff --git a/doc/guides/prog_guide/switch_representation.rst
> >>>>> b/doc/guides/prog_guide/switch_representation.rst
> >>>>> index ff6aa91c80..45bf5a3a10 100644
> >>>>> --- a/doc/guides/prog_guide/switch_representation.rst
> >>>>> +++ b/doc/guides/prog_guide/switch_representation.rst
> >>>>> @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
> >>>>> .. [1] `Ethernet switch device driver model (switchdev)
> >>>>>
> >>>>> <https://www.kernel.org/doc/Documentation/networking/switchdev.txt
> >>>>>> `_
> >>>>>
> >>>>> +- Memory usage of representors is huge when number of representor
> >>>>> +grows,
> >>>>> + because PMD always allocate mbuf for each descriptor of Rx queue.
> >>>>> + Polling the large number of ports brings more CPU load, cache
> >>>>> +miss and
> >>>>> + latency. Shared Rx queue can be used to share Rx queue between
> >>>>> +PF and
> >>>>> + representors in same switch domain.
> >>>>> +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> >>>>> + is present in Rx offloading capability of device info. Setting
> >>>>> +the
> >>>>> + offloading flag in device Rx mode or Rx queue configuration to
> >>>>> +enable
> >>>>> + shared Rx queue. Polling any member port of shared Rx queue can
> >>>>> +return
> >>>>> + packets of all ports in group, port ID is saved in ``mbuf.port``.
> >>>>> +
> >>>>> Basic SR-IOV
> >>>>> ------------
> >>>>>
> >>>>> diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> >>>>> index 9d95cd11e1..1361ff759a 100644
> >>>>> --- a/lib/ethdev/rte_ethdev.c
> >>>>> +++ b/lib/ethdev/rte_ethdev.c
> >>>>> @@ -127,6 +127,7 @@ static const struct {
> >>>>> RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
> >>>>> RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> >>>>> RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
> >>>>> + RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> >>>>> };
> >>>>>
> >>>>> #undef RTE_RX_OFFLOAD_BIT2STR
> >>>>> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> >>>>> index d2b27c351f..a578c9db9d 100644
> >>>>> --- a/lib/ethdev/rte_ethdev.h
> >>>>> +++ b/lib/ethdev/rte_ethdev.h
> >>>>> @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> >>>>> uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> >>>>> uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> >>>>> uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> >>>>> */
> >>>>> + uint32_t shared_group; /**< Shared port group index in
> >>>>> + switch domain. */
> >>>>> /**
> >>>>> * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> >>>>> * Only offloads set on rx_queue_offload_capa or
> >>>>> rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
> >>>>> #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM 0x00040000
> >>>>> #define DEV_RX_OFFLOAD_RSS_HASH 0x00080000
> >>>>> #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> >>>>> +/**
> >>>>> + * Rx queue is shared among ports in same switch domain to save
> >>>>> +memory,
> >>>>> + * avoid polling each port. Any port in group can be used to receive packets.
> >>>>> + * Real source port number saved in mbuf->port field.
> >>>>> + */
> >>>>> +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ 0x00200000
> >>>>>
> >>>>> #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> >>>>> DEV_RX_OFFLOAD_UDP_CKSUM | \
> >>>>> --
> >>>>> 2.25.1
> >>>>>
> -----Original Message-----
> From: dev <dev-bounces@dpdk.org> On Behalf Of Xueming(Steven) Li
> Sent: Wednesday, August 11, 2021 8:59 PM
> To: Ferruh Yigit <ferruh.yigit@intel.com>; Jerin Jacob <jerinjacobk@gmail.com>
> Cc: dpdk-dev <dev@dpdk.org>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> <andrew.rybchenko@oktetlabs.ru>
> Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
>
>
>
> > -----Original Message-----
> > From: Ferruh Yigit <ferruh.yigit@intel.com>
> > Sent: Wednesday, August 11, 2021 8:04 PM
> > To: Xueming(Steven) Li <xuemingl@nvidia.com>; Jerin Jacob
> > <jerinjacobk@gmail.com>
> > Cc: dpdk-dev <dev@dpdk.org>; NBU-Contact-Thomas Monjalon
> > <thomas@monjalon.net>; Andrew Rybchenko
> > <andrew.rybchenko@oktetlabs.ru>
> > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> >
> > On 8/11/2021 9:28 AM, Xueming(Steven) Li wrote:
> > >
> > >
> > >> -----Original Message-----
> > >> From: Jerin Jacob <jerinjacobk@gmail.com>
> > >> Sent: Wednesday, August 11, 2021 4:03 PM
> > >> To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > >> Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > >> NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > >> <andrew.rybchenko@oktetlabs.ru>
> > >> Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx
> > >> queue
> > >>
> > >> On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > >>>
> > >>> Hi,
> > >>>
> > >>>> -----Original Message-----
> > >>>> From: Jerin Jacob <jerinjacobk@gmail.com>
> > >>>> Sent: Monday, August 9, 2021 9:51 PM
> > >>>> To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > >>>> Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > >>>> <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon
> > >>>> <thomas@monjalon.net>; Andrew Rybchenko
> > >>>> <andrew.rybchenko@oktetlabs.ru>
> > >>>> Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx
> > >>>> queue
> > >>>>
> > >>>> On Mon, Aug 9, 2021 at 5:18 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > >>>>>
> > >>>>> In current DPDK framework, each RX queue is pre-loaded with
> > >>>>> mbufs for incoming packets. When number of representors scale
> > >>>>> out in a switch domain, the memory consumption became
> > >>>>> significant. Most important, polling all ports leads to high
> > >>>>> cache miss, high latency and low throughput.
> > >>>>>
> > >>>>> This patch introduces shared RX queue. Ports with same
> > >>>>> configuration in a switch domain could share RX queue set by specifying sharing group.
> > >>>>> Polling any queue using same shared RX queue receives packets
> > >>>>> from all member ports. Source port is identified by mbuf->port.
> > >>>>>
> > >>>>> Port queue number in a shared group should be identical. Queue
> > >>>>> index is
> > >>>>> 1:1 mapped in shared group.
> > >>>>>
> > >>>>> Share RX queue is supposed to be polled on same thread.
> > >>>>>
> > >>>>> Multiple groups is supported by group ID.
> > >>>>
> > >>>> Is this offload specific to the representor? If so can this name be changed specifically to representor?
> > >>>
> > >>> Yes, PF and representor in switch domain could take advantage.
> > >>>
> > >>>> If it is for a generic case, how the flow ordering will be maintained?
> > >>>
> > >>> Not quite sure that I understood your question. The control path
> > >>> of is almost same as before, PF and representor port still needed, rte flows not impacted.
> > >>> Queues still needed for each member port, descriptors(mbuf) will
> > >>> be supplied from shared Rx queue in my PMD implementation.
> > >>
> > >> My question was if create a generic RTE_ETH_RX_OFFLOAD_SHARED_RXQ
> > >> offload, multiple ethdev receive queues land into the same receive
> > >> queue, In that case, how the flow order is maintained for
> > respective receive queues.
> > >
> > > I guess the question is testpmd forward stream? The forwarding logic has to be changed slightly in case of shared rxq.
> > > basically for each packet in rx_burst result, lookup source stream according to mbuf->port, forwarding to target fs.
> > > Packets from same source port could be grouped as a small burst to
> > > process, this will accelerates the performance if traffic come from
> > > limited ports. I'll introduce some common api to do shard rxq
> > > forwarding, call it with packets handling callback, so it suites for
> > > all
> > forwarding engine. Will sent patches soon.
> > >
> >
> > All ports will put the packets in to the same queue (share queue),
> > right? Does this means only single core will poll only, what will happen if there are multiple cores polling, won't it cause problem?
>
> This has been mentioned in commit log, the shared rxq is supposed to be polling in single thread(core) - I think it should be "MUST".
> Result is unexpected if there are multiple cores pooling, that's why I added a polling schedule check in testpmd.
V2 with testpmd code uploaded, please check.
> Similar for rx/tx burst function, a queue can't be polled on multiple thread(core), and for performance concern, no such check in eal
> api.
>
> If users want to utilize multiple cores to distribute workloads, it's possible to define more groups, queues in different group could be
> could be polled on multiple cores.
>
> It's possible to poll every member port in group, but not necessary, any port in group could be polled to get packets for all ports in
> group.
>
> If the member port subject to hot plug/remove, it's possible to create a vdev with same queue number, copy rxq object and poll vdev
> as a dedicate proxy for the group.
>
> >
> > And if this requires specific changes in the application, I am not
> > sure about the solution, can't this work in a transparent way to the application?
>
> Yes, we considered different options in design stage. One possible solution is to cache received packets in rings, this can be done on
> eth layer, but I'm afraid less benefits, user still has to be a ware of multiple core polling.
> This can be done as a wrapper PMD later, more efforts.
>
> >
> > Overall, is this for optimizing memory for the port represontors? If
> > so can't we have a port representor specific solution, reducing scope can reduce the complexity it brings?
>
> This feature supports both PF and representor, and yes, major issue is memory of representors. Poll all representors also introduces
> more core cache miss latency. This feature essentially aggregates all ports in group as one port.
> On the other hand, it's useful for rte flow to create offloading flows using representor as a regular port ID.
>
> It's great if any new solution/suggestion, my head buried in PMD code :)
>
> >
> > >> If this offload is only useful for representor case, Can we make
> > >> this offload specific to representor the case by changing its name and scope.
> > >
> > > It works for both PF and representors in same switch domain, for application like OVS, few changes to apply.
> > >
> > >>
> > >>
> > >>>
> > >>>>
> > >>>>>
> > >>>>> Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > >>>>> ---
> > >>>>> doc/guides/nics/features.rst | 11 +++++++++++
> > >>>>> doc/guides/nics/features/default.ini | 1 +
> > >>>>> doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
> > >>>>> lib/ethdev/rte_ethdev.c | 1 +
> > >>>>> lib/ethdev/rte_ethdev.h | 7 +++++++
> > >>>>> 5 files changed, 30 insertions(+)
> > >>>>>
> > >>>>> diff --git a/doc/guides/nics/features.rst
> > >>>>> b/doc/guides/nics/features.rst index a96e12d155..2e2a9b1554
> > >>>>> 100644
> > >>>>> --- a/doc/guides/nics/features.rst
> > >>>>> +++ b/doc/guides/nics/features.rst
> > >>>>> @@ -624,6 +624,17 @@ Supports inner packet L4 checksum.
> > >>>>> ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > >>>>>
> > >>>>>
> > >>>>> +.. _nic_features_shared_rx_queue:
> > >>>>> +
> > >>>>> +Shared Rx queue
> > >>>>> +---------------
> > >>>>> +
> > >>>>> +Supports shared Rx queue for ports in same switch domain.
> > >>>>> +
> > >>>>> +* **[uses] rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > >>>>> +* **[provides] mbuf**: ``mbuf.port``.
> > >>>>> +
> > >>>>> +
> > >>>>> .. _nic_features_packet_type_parsing:
> > >>>>>
> > >>>>> Packet type parsing
> > >>>>> diff --git a/doc/guides/nics/features/default.ini
> > >>>>> b/doc/guides/nics/features/default.ini
> > >>>>> index 754184ddd4..ebeb4c1851 100644
> > >>>>> --- a/doc/guides/nics/features/default.ini
> > >>>>> +++ b/doc/guides/nics/features/default.ini
> > >>>>> @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> > >>>>> Queue start/stop =
> > >>>>> Runtime Rx queue setup =
> > >>>>> Runtime Tx queue setup =
> > >>>>> +Shared Rx queue =
> > >>>>> Burst mode info =
> > >>>>> Power mgmt address monitor =
> > >>>>> MTU update =
> > >>>>> diff --git a/doc/guides/prog_guide/switch_representation.rst
> > >>>>> b/doc/guides/prog_guide/switch_representation.rst
> > >>>>> index ff6aa91c80..45bf5a3a10 100644
> > >>>>> --- a/doc/guides/prog_guide/switch_representation.rst
> > >>>>> +++ b/doc/guides/prog_guide/switch_representation.rst
> > >>>>> @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
> > >>>>> .. [1] `Ethernet switch device driver model (switchdev)
> > >>>>>
> > >>>>> <https://www.kernel.org/doc/Documentation/networking/switchdev.t
> > >>>>> xt
> > >>>>>> `_
> > >>>>>
> > >>>>> +- Memory usage of representors is huge when number of
> > >>>>> +representor grows,
> > >>>>> + because PMD always allocate mbuf for each descriptor of Rx queue.
> > >>>>> + Polling the large number of ports brings more CPU load, cache
> > >>>>> +miss and
> > >>>>> + latency. Shared Rx queue can be used to share Rx queue
> > >>>>> +between PF and
> > >>>>> + representors in same switch domain.
> > >>>>> +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > >>>>> + is present in Rx offloading capability of device info.
> > >>>>> +Setting the
> > >>>>> + offloading flag in device Rx mode or Rx queue configuration
> > >>>>> +to enable
> > >>>>> + shared Rx queue. Polling any member port of shared Rx queue
> > >>>>> +can return
> > >>>>> + packets of all ports in group, port ID is saved in ``mbuf.port``.
> > >>>>> +
> > >>>>> Basic SR-IOV
> > >>>>> ------------
> > >>>>>
> > >>>>> diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> > >>>>> index 9d95cd11e1..1361ff759a 100644
> > >>>>> --- a/lib/ethdev/rte_ethdev.c
> > >>>>> +++ b/lib/ethdev/rte_ethdev.c
> > >>>>> @@ -127,6 +127,7 @@ static const struct {
> > >>>>> RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
> > >>>>> RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > >>>>> RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
> > >>>>> + RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > >>>>> };
> > >>>>>
> > >>>>> #undef RTE_RX_OFFLOAD_BIT2STR
> > >>>>> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> > >>>>> index d2b27c351f..a578c9db9d 100644
> > >>>>> --- a/lib/ethdev/rte_ethdev.h
> > >>>>> +++ b/lib/ethdev/rte_ethdev.h
> > >>>>> @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > >>>>> uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > >>>>> uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > >>>>> uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > >>>>> */
> > >>>>> + uint32_t shared_group; /**< Shared port group index in
> > >>>>> + switch domain. */
> > >>>>> /**
> > >>>>> * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > >>>>> * Only offloads set on rx_queue_offload_capa or
> > >>>>> rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
> > >>>>> #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM 0x00040000
> > >>>>> #define DEV_RX_OFFLOAD_RSS_HASH 0x00080000
> > >>>>> #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > >>>>> +/**
> > >>>>> + * Rx queue is shared among ports in same switch domain to save
> > >>>>> +memory,
> > >>>>> + * avoid polling each port. Any port in group can be used to receive packets.
> > >>>>> + * Real source port number saved in mbuf->port field.
> > >>>>> + */
> > >>>>> +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ 0x00200000
> > >>>>>
> > >>>>> #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > >>>>> DEV_RX_OFFLOAD_UDP_CKSUM | \
> > >>>>> --
> > >>>>> 2.25.1
> > >>>>>
On Wed, 2021-08-11 at 12:59 +0000, Xueming(Steven) Li wrote:
>
> > -----Original Message-----
> > From: Ferruh Yigit <ferruh.yigit@intel.com>
> > Sent: Wednesday, August 11, 2021 8:04 PM
> > To: Xueming(Steven) Li <xuemingl@nvidia.com>; Jerin Jacob <jerinjacobk@gmail.com>
> > Cc: dpdk-dev <dev@dpdk.org>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > <andrew.rybchenko@oktetlabs.ru>
> > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> >
> > On 8/11/2021 9:28 AM, Xueming(Steven) Li wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > Sent: Wednesday, August 11, 2021 4:03 PM
> > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > > <andrew.rybchenko@oktetlabs.ru>
> > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > > >
> > > > On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > Sent: Monday, August 9, 2021 9:51 PM
> > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > > > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx
> > > > > > queue
> > > > > >
> > > > > > On Mon, Aug 9, 2021 at 5:18 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > > > >
> > > > > > > In current DPDK framework, each RX queue is pre-loaded with mbufs
> > > > > > > for incoming packets. When number of representors scale out in a
> > > > > > > switch domain, the memory consumption became significant. Most
> > > > > > > important, polling all ports leads to high cache miss, high
> > > > > > > latency and low throughput.
> > > > > > >
> > > > > > > This patch introduces shared RX queue. Ports with same
> > > > > > > configuration in a switch domain could share RX queue set by specifying sharing group.
> > > > > > > Polling any queue using same shared RX queue receives packets from
> > > > > > > all member ports. Source port is identified by mbuf->port.
> > > > > > >
> > > > > > > Port queue number in a shared group should be identical. Queue
> > > > > > > index is
> > > > > > > 1:1 mapped in shared group.
> > > > > > >
> > > > > > > Share RX queue is supposed to be polled on same thread.
> > > > > > >
> > > > > > > Multiple groups is supported by group ID.
> > > > > >
> > > > > > Is this offload specific to the representor? If so can this name be changed specifically to representor?
> > > > >
> > > > > Yes, PF and representor in switch domain could take advantage.
> > > > >
> > > > > > If it is for a generic case, how the flow ordering will be maintained?
> > > > >
> > > > > Not quite sure that I understood your question. The control path of
> > > > > is almost same as before, PF and representor port still needed, rte flows not impacted.
> > > > > Queues still needed for each member port, descriptors(mbuf) will be
> > > > > supplied from shared Rx queue in my PMD implementation.
> > > >
> > > > My question was if create a generic RTE_ETH_RX_OFFLOAD_SHARED_RXQ
> > > > offload, multiple ethdev receive queues land into the same receive queue, In that case, how the flow order is maintained for
> > respective receive queues.
> > >
> > > I guess the question is testpmd forward stream? The forwarding logic has to be changed slightly in case of shared rxq.
> > > basically for each packet in rx_burst result, lookup source stream according to mbuf->port, forwarding to target fs.
> > > Packets from same source port could be grouped as a small burst to
> > > process, this will accelerates the performance if traffic come from
> > > limited ports. I'll introduce some common api to do shard rxq forwarding, call it with packets handling callback, so it suites for all
> > forwarding engine. Will sent patches soon.
> > >
> >
> > All ports will put the packets in to the same queue (share queue), right? Does this means only single core will poll only, what will
> > happen if there are multiple cores polling, won't it cause problem?
>
> This has been mentioned in commit log, the shared rxq is supposed to be polling in single thread(core) - I think it should be "MUST".
> Result is unexpected if there are multiple cores pooling, that's why I added a polling schedule check in testpmd.
> Similar for rx/tx burst function, a queue can't be polled on multiple thread(core), and for performance concern, no such check in eal api.
>
> If users want to utilize multiple cores to distribute workloads, it's possible to define more groups, queues in different group could be
> could be polled on multiple cores.
>
> It's possible to poll every member port in group, but not necessary, any port in group could be polled to get packets for all ports in group.
>
> If the member port subject to hot plug/remove, it's possible to create a vdev with same queue number, copy rxq object and poll vdev
> as a dedicate proxy for the group.
>
> >
> > And if this requires specific changes in the application, I am not sure about the solution, can't this work in a transparent way to the
> > application?
>
> Yes, we considered different options in design stage. One possible solution is to cache received packets in rings, this can be done on
> eth layer, but I'm afraid less benefits, user still has to be a ware of multiple core polling.
> This can be done as a wrapper PMD later, more efforts.
For people want to use shared rxq to save memory, they need to be
conscious on the core polling rule: dedicate core for shared rxq, like
the rule for rxq and txq.
I'm afraid specific changes in application is a must, but not too much:
polling one port per group is sufficient. Protections in data plane
will definitely hurt performance :(
>
> >
> > Overall, is this for optimizing memory for the port represontors? If so can't we have a port representor specific solution, reducing
> > scope can reduce the complexity it brings?
>
> This feature supports both PF and representor, and yes, major issue is memory of representors. Poll all representors also
> introduces more core cache miss latency. This feature essentially aggregates all ports in group as one port.
> On the other hand, it's useful for rte flow to create offloading flows using representor as a regular port ID.
>
As discussion with Jerin below, the major memory consumed by PF or
representor is mbufs pre-filled to rxq. PMD can't assume all
representors shares same memory pool, or share rxqs internally in PMD -
user might schedule representors to different cores. Defining shared
rxq flag and group looks a good direction.
> It's great if any new solution/suggestion, my head buried in PMD code :)
>
> >
> > > > If this offload is only useful for representor case, Can we make this
> > > > offload specific to representor the case by changing its name and scope.
> > >
> > > It works for both PF and representors in same switch domain, for application like OVS, few changes to apply.
> > >
> > > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > > > > > ---
> > > > > > > doc/guides/nics/features.rst | 11 +++++++++++
> > > > > > > doc/guides/nics/features/default.ini | 1 +
> > > > > > > doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
> > > > > > > lib/ethdev/rte_ethdev.c | 1 +
> > > > > > > lib/ethdev/rte_ethdev.h | 7 +++++++
> > > > > > > 5 files changed, 30 insertions(+)
> > > > > > >
> > > > > > > diff --git a/doc/guides/nics/features.rst
> > > > > > > b/doc/guides/nics/features.rst index a96e12d155..2e2a9b1554 100644
> > > > > > > --- a/doc/guides/nics/features.rst
> > > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > > @@ -624,6 +624,17 @@ Supports inner packet L4 checksum.
> > > > > > > ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > >
> > > > > > >
> > > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > > +
> > > > > > > +Shared Rx queue
> > > > > > > +---------------
> > > > > > > +
> > > > > > > +Supports shared Rx queue for ports in same switch domain.
> > > > > > > +
> > > > > > > +* **[uses] rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > > > > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > > > > +
> > > > > > > +
> > > > > > > .. _nic_features_packet_type_parsing:
> > > > > > >
> > > > > > > Packet type parsing
> > > > > > > diff --git a/doc/guides/nics/features/default.ini
> > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > > --- a/doc/guides/nics/features/default.ini
> > > > > > > +++ b/doc/guides/nics/features/default.ini
> > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> > > > > > > Queue start/stop =
> > > > > > > Runtime Rx queue setup =
> > > > > > > Runtime Tx queue setup =
> > > > > > > +Shared Rx queue =
> > > > > > > Burst mode info =
> > > > > > > Power mgmt address monitor =
> > > > > > > MTU update =
> > > > > > > diff --git a/doc/guides/prog_guide/switch_representation.rst
> > > > > > > b/doc/guides/prog_guide/switch_representation.rst
> > > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > > --- a/doc/guides/prog_guide/switch_representation.rst
> > > > > > > +++ b/doc/guides/prog_guide/switch_representation.rst
> > > > > > > @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
> > > > > > > .. [1] `Ethernet switch device driver model (switchdev)
> > > > > > >
> > > > > > > <https://www.kernel.org/doc/Documentation/networking/switchdev.txt
> > > > > > > > `_
> > > > > > >
> > > > > > > +- Memory usage of representors is huge when number of representor
> > > > > > > +grows,
> > > > > > > + because PMD always allocate mbuf for each descriptor of Rx queue.
> > > > > > > + Polling the large number of ports brings more CPU load, cache
> > > > > > > +miss and
> > > > > > > + latency. Shared Rx queue can be used to share Rx queue between
> > > > > > > +PF and
> > > > > > > + representors in same switch domain.
> > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > > + is present in Rx offloading capability of device info. Setting
> > > > > > > +the
> > > > > > > + offloading flag in device Rx mode or Rx queue configuration to
> > > > > > > +enable
> > > > > > > + shared Rx queue. Polling any member port of shared Rx queue can
> > > > > > > +return
> > > > > > > + packets of all ports in group, port ID is saved in ``mbuf.port``.
> > > > > > > +
> > > > > > > Basic SR-IOV
> > > > > > > ------------
> > > > > > >
> > > > > > > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> > > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > > @@ -127,6 +127,7 @@ static const struct {
> > > > > > > RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
> > > > > > > RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > > > > RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
> > > > > > > + RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > > > > };
> > > > > > >
> > > > > > > #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > > uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > > > > > > uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > > > > > > uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > > > > > */
> > > > > > > + uint32_t shared_group; /**< Shared port group index in
> > > > > > > + switch domain. */
> > > > > > > /**
> > > > > > > * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > > > > > > * Only offloads set on rx_queue_offload_capa or
> > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
> > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM 0x00040000
> > > > > > > #define DEV_RX_OFFLOAD_RSS_HASH 0x00080000
> > > > > > > #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > > > > > +/**
> > > > > > > + * Rx queue is shared among ports in same switch domain to save
> > > > > > > +memory,
> > > > > > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > > > > > + * Real source port number saved in mbuf->port field.
> > > > > > > + */
> > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ 0x00200000
> > > > > > >
> > > > > > > #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > DEV_RX_OFFLOAD_UDP_CKSUM | \
> > > > > > > --
> > > > > > > 2.25.1
> > > > > > >
>
On Wed, 2021-08-11 at 13:04 +0100, Ferruh Yigit wrote:
> On 8/11/2021 9:28 AM, Xueming(Steven) Li wrote:
> >
> >
> > > -----Original Message-----
> > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > Sent: Wednesday, August 11, 2021 4:03 PM
> > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> > > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > >
> > > On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > >
> > > > Hi,
> > > >
> > > > > -----Original Message-----
> > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > Sent: Monday, August 9, 2021 9:51 PM
> > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > > > >
> > > > > On Mon, Aug 9, 2021 at 5:18 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > > >
> > > > > > In current DPDK framework, each RX queue is pre-loaded with mbufs
> > > > > > for incoming packets. When number of representors scale out in a
> > > > > > switch domain, the memory consumption became significant. Most
> > > > > > important, polling all ports leads to high cache miss, high
> > > > > > latency and low throughput.
> > > > > >
> > > > > > This patch introduces shared RX queue. Ports with same
> > > > > > configuration in a switch domain could share RX queue set by specifying sharing group.
> > > > > > Polling any queue using same shared RX queue receives packets from
> > > > > > all member ports. Source port is identified by mbuf->port.
> > > > > >
> > > > > > Port queue number in a shared group should be identical. Queue
> > > > > > index is
> > > > > > 1:1 mapped in shared group.
> > > > > >
> > > > > > Share RX queue is supposed to be polled on same thread.
> > > > > >
> > > > > > Multiple groups is supported by group ID.
> > > > >
> > > > > Is this offload specific to the representor? If so can this name be changed specifically to representor?
> > > >
> > > > Yes, PF and representor in switch domain could take advantage.
> > > >
> > > > > If it is for a generic case, how the flow ordering will be maintained?
> > > >
> > > > Not quite sure that I understood your question. The control path of is
> > > > almost same as before, PF and representor port still needed, rte flows not impacted.
> > > > Queues still needed for each member port, descriptors(mbuf) will be
> > > > supplied from shared Rx queue in my PMD implementation.
> > >
> > > My question was if create a generic RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple ethdev receive queues land into the same
> > > receive queue, In that case, how the flow order is maintained for respective receive queues.
> >
> > I guess the question is testpmd forward stream? The forwarding logic has to be changed slightly in case of shared rxq.
> > basically for each packet in rx_burst result, lookup source stream according to mbuf->port, forwarding to target fs.
> > Packets from same source port could be grouped as a small burst to process, this will accelerates the performance if traffic come from
> > limited ports. I'll introduce some common api to do shard rxq forwarding, call it with packets handling callback, so it suites for
> > all forwarding engine. Will sent patches soon.
> >
>
> All ports will put the packets in to the same queue (share queue), right? Does
> this means only single core will poll only, what will happen if there are
> multiple cores polling, won't it cause problem?
>
> And if this requires specific changes in the application, I am not sure about
> the solution, can't this work in a transparent way to the application?
Discussed with Jerin, new API introduced in v3 2/8 that aggregate ports
in same group into one new port. Users could schedule polling on the
aggregated port instead of all member ports.
>
> Overall, is this for optimizing memory for the port represontors? If so can't we
> have a port representor specific solution, reducing scope can reduce the
> complexity it brings?
>
> > > If this offload is only useful for representor case, Can we make this offload specific to representor the case by changing its name and
> > > scope.
> >
> > It works for both PF and representors in same switch domain, for application like OVS, few changes to apply.
> >
> > >
> > >
> > > >
> > > > >
> > > > > >
> > > > > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > > > > ---
> > > > > > doc/guides/nics/features.rst | 11 +++++++++++
> > > > > > doc/guides/nics/features/default.ini | 1 +
> > > > > > doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
> > > > > > lib/ethdev/rte_ethdev.c | 1 +
> > > > > > lib/ethdev/rte_ethdev.h | 7 +++++++
> > > > > > 5 files changed, 30 insertions(+)
> > > > > >
> > > > > > diff --git a/doc/guides/nics/features.rst
> > > > > > b/doc/guides/nics/features.rst index a96e12d155..2e2a9b1554 100644
> > > > > > --- a/doc/guides/nics/features.rst
> > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > @@ -624,6 +624,17 @@ Supports inner packet L4 checksum.
> > > > > > ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > >
> > > > > >
> > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > +
> > > > > > +Shared Rx queue
> > > > > > +---------------
> > > > > > +
> > > > > > +Supports shared Rx queue for ports in same switch domain.
> > > > > > +
> > > > > > +* **[uses] rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > > > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > > > +
> > > > > > +
> > > > > > .. _nic_features_packet_type_parsing:
> > > > > >
> > > > > > Packet type parsing
> > > > > > diff --git a/doc/guides/nics/features/default.ini
> > > > > > b/doc/guides/nics/features/default.ini
> > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > --- a/doc/guides/nics/features/default.ini
> > > > > > +++ b/doc/guides/nics/features/default.ini
> > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> > > > > > Queue start/stop =
> > > > > > Runtime Rx queue setup =
> > > > > > Runtime Tx queue setup =
> > > > > > +Shared Rx queue =
> > > > > > Burst mode info =
> > > > > > Power mgmt address monitor =
> > > > > > MTU update =
> > > > > > diff --git a/doc/guides/prog_guide/switch_representation.rst
> > > > > > b/doc/guides/prog_guide/switch_representation.rst
> > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > --- a/doc/guides/prog_guide/switch_representation.rst
> > > > > > +++ b/doc/guides/prog_guide/switch_representation.rst
> > > > > > @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
> > > > > > .. [1] `Ethernet switch device driver model (switchdev)
> > > > > >
> > > > > > <https://www.kernel.org/doc/Documentation/networking/switchdev.txt
> > > > > > > `_
> > > > > >
> > > > > > +- Memory usage of representors is huge when number of representor
> > > > > > +grows,
> > > > > > + because PMD always allocate mbuf for each descriptor of Rx queue.
> > > > > > + Polling the large number of ports brings more CPU load, cache
> > > > > > +miss and
> > > > > > + latency. Shared Rx queue can be used to share Rx queue between
> > > > > > +PF and
> > > > > > + representors in same switch domain.
> > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > + is present in Rx offloading capability of device info. Setting
> > > > > > +the
> > > > > > + offloading flag in device Rx mode or Rx queue configuration to
> > > > > > +enable
> > > > > > + shared Rx queue. Polling any member port of shared Rx queue can
> > > > > > +return
> > > > > > + packets of all ports in group, port ID is saved in ``mbuf.port``.
> > > > > > +
> > > > > > Basic SR-IOV
> > > > > > ------------
> > > > > >
> > > > > > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > @@ -127,6 +127,7 @@ static const struct {
> > > > > > RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
> > > > > > RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > > > RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
> > > > > > + RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > > > };
> > > > > >
> > > > > > #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > > > > > uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > > > > > uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > > > > */
> > > > > > + uint32_t shared_group; /**< Shared port group index in
> > > > > > + switch domain. */
> > > > > > /**
> > > > > > * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > > > > > * Only offloads set on rx_queue_offload_capa or
> > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
> > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM 0x00040000
> > > > > > #define DEV_RX_OFFLOAD_RSS_HASH 0x00080000
> > > > > > #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > > > > +/**
> > > > > > + * Rx queue is shared among ports in same switch domain to save
> > > > > > +memory,
> > > > > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > > > > + * Real source port number saved in mbuf->port field.
> > > > > > + */
> > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ 0x00200000
> > > > > >
> > > > > > #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > DEV_RX_OFFLOAD_UDP_CKSUM | \
> > > > > > --
> > > > > > 2.25.1
> > > > > >
>
On Sun, Sep 26, 2021 at 11:06 AM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
>
> On Wed, 2021-08-11 at 13:04 +0100, Ferruh Yigit wrote:
> > On 8/11/2021 9:28 AM, Xueming(Steven) Li wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > Sent: Wednesday, August 11, 2021 4:03 PM
> > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> > > > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > > >
> > > > On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > Sent: Monday, August 9, 2021 9:51 PM
> > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > > > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > > > > >
> > > > > > On Mon, Aug 9, 2021 at 5:18 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > > > >
> > > > > > > In current DPDK framework, each RX queue is pre-loaded with mbufs
> > > > > > > for incoming packets. When number of representors scale out in a
> > > > > > > switch domain, the memory consumption became significant. Most
> > > > > > > important, polling all ports leads to high cache miss, high
> > > > > > > latency and low throughput.
> > > > > > >
> > > > > > > This patch introduces shared RX queue. Ports with same
> > > > > > > configuration in a switch domain could share RX queue set by specifying sharing group.
> > > > > > > Polling any queue using same shared RX queue receives packets from
> > > > > > > all member ports. Source port is identified by mbuf->port.
> > > > > > >
> > > > > > > Port queue number in a shared group should be identical. Queue
> > > > > > > index is
> > > > > > > 1:1 mapped in shared group.
> > > > > > >
> > > > > > > Share RX queue is supposed to be polled on same thread.
> > > > > > >
> > > > > > > Multiple groups is supported by group ID.
> > > > > >
> > > > > > Is this offload specific to the representor? If so can this name be changed specifically to representor?
> > > > >
> > > > > Yes, PF and representor in switch domain could take advantage.
> > > > >
> > > > > > If it is for a generic case, how the flow ordering will be maintained?
> > > > >
> > > > > Not quite sure that I understood your question. The control path of is
> > > > > almost same as before, PF and representor port still needed, rte flows not impacted.
> > > > > Queues still needed for each member port, descriptors(mbuf) will be
> > > > > supplied from shared Rx queue in my PMD implementation.
> > > >
> > > > My question was if create a generic RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple ethdev receive queues land into the same
> > > > receive queue, In that case, how the flow order is maintained for respective receive queues.
> > >
> > > I guess the question is testpmd forward stream? The forwarding logic has to be changed slightly in case of shared rxq.
> > > basically for each packet in rx_burst result, lookup source stream according to mbuf->port, forwarding to target fs.
> > > Packets from same source port could be grouped as a small burst to process, this will accelerates the performance if traffic come from
> > > limited ports. I'll introduce some common api to do shard rxq forwarding, call it with packets handling callback, so it suites for
> > > all forwarding engine. Will sent patches soon.
> > >
> >
> > All ports will put the packets in to the same queue (share queue), right? Does
> > this means only single core will poll only, what will happen if there are
> > multiple cores polling, won't it cause problem?
> >
> > And if this requires specific changes in the application, I am not sure about
> > the solution, can't this work in a transparent way to the application?
>
> Discussed with Jerin, new API introduced in v3 2/8 that aggregate ports
> in same group into one new port. Users could schedule polling on the
> aggregated port instead of all member ports.
The v3 still has testpmd changes in fastpath. Right? IMO, For this
feature, we should not change fastpath of testpmd
application. Instead, testpmd can use aggregated ports probably as
separate fwd_engine to show how to use this feature.
>
> >
> > Overall, is this for optimizing memory for the port represontors? If so can't we
> > have a port representor specific solution, reducing scope can reduce the
> > complexity it brings?
> >
> > > > If this offload is only useful for representor case, Can we make this offload specific to representor the case by changing its name and
> > > > scope.
> > >
> > > It works for both PF and representors in same switch domain, for application like OVS, few changes to apply.
> > >
> > > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > > > > > ---
> > > > > > > doc/guides/nics/features.rst | 11 +++++++++++
> > > > > > > doc/guides/nics/features/default.ini | 1 +
> > > > > > > doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
> > > > > > > lib/ethdev/rte_ethdev.c | 1 +
> > > > > > > lib/ethdev/rte_ethdev.h | 7 +++++++
> > > > > > > 5 files changed, 30 insertions(+)
> > > > > > >
> > > > > > > diff --git a/doc/guides/nics/features.rst
> > > > > > > b/doc/guides/nics/features.rst index a96e12d155..2e2a9b1554 100644
> > > > > > > --- a/doc/guides/nics/features.rst
> > > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > > @@ -624,6 +624,17 @@ Supports inner packet L4 checksum.
> > > > > > > ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > >
> > > > > > >
> > > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > > +
> > > > > > > +Shared Rx queue
> > > > > > > +---------------
> > > > > > > +
> > > > > > > +Supports shared Rx queue for ports in same switch domain.
> > > > > > > +
> > > > > > > +* **[uses] rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > > > > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > > > > +
> > > > > > > +
> > > > > > > .. _nic_features_packet_type_parsing:
> > > > > > >
> > > > > > > Packet type parsing
> > > > > > > diff --git a/doc/guides/nics/features/default.ini
> > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > > --- a/doc/guides/nics/features/default.ini
> > > > > > > +++ b/doc/guides/nics/features/default.ini
> > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> > > > > > > Queue start/stop =
> > > > > > > Runtime Rx queue setup =
> > > > > > > Runtime Tx queue setup =
> > > > > > > +Shared Rx queue =
> > > > > > > Burst mode info =
> > > > > > > Power mgmt address monitor =
> > > > > > > MTU update =
> > > > > > > diff --git a/doc/guides/prog_guide/switch_representation.rst
> > > > > > > b/doc/guides/prog_guide/switch_representation.rst
> > > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > > --- a/doc/guides/prog_guide/switch_representation.rst
> > > > > > > +++ b/doc/guides/prog_guide/switch_representation.rst
> > > > > > > @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
> > > > > > > .. [1] `Ethernet switch device driver model (switchdev)
> > > > > > >
> > > > > > > <https://www.kernel.org/doc/Documentation/networking/switchdev.txt
> > > > > > > > `_
> > > > > > >
> > > > > > > +- Memory usage of representors is huge when number of representor
> > > > > > > +grows,
> > > > > > > + because PMD always allocate mbuf for each descriptor of Rx queue.
> > > > > > > + Polling the large number of ports brings more CPU load, cache
> > > > > > > +miss and
> > > > > > > + latency. Shared Rx queue can be used to share Rx queue between
> > > > > > > +PF and
> > > > > > > + representors in same switch domain.
> > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > > + is present in Rx offloading capability of device info. Setting
> > > > > > > +the
> > > > > > > + offloading flag in device Rx mode or Rx queue configuration to
> > > > > > > +enable
> > > > > > > + shared Rx queue. Polling any member port of shared Rx queue can
> > > > > > > +return
> > > > > > > + packets of all ports in group, port ID is saved in ``mbuf.port``.
> > > > > > > +
> > > > > > > Basic SR-IOV
> > > > > > > ------------
> > > > > > >
> > > > > > > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> > > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > > @@ -127,6 +127,7 @@ static const struct {
> > > > > > > RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
> > > > > > > RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > > > > RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
> > > > > > > + RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > > > > };
> > > > > > >
> > > > > > > #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > > uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > > > > > > uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > > > > > > uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > > > > > */
> > > > > > > + uint32_t shared_group; /**< Shared port group index in
> > > > > > > + switch domain. */
> > > > > > > /**
> > > > > > > * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > > > > > > * Only offloads set on rx_queue_offload_capa or
> > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
> > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM 0x00040000
> > > > > > > #define DEV_RX_OFFLOAD_RSS_HASH 0x00080000
> > > > > > > #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > > > > > +/**
> > > > > > > + * Rx queue is shared among ports in same switch domain to save
> > > > > > > +memory,
> > > > > > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > > > > > + * Real source port number saved in mbuf->port field.
> > > > > > > + */
> > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ 0x00200000
> > > > > > >
> > > > > > > #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > DEV_RX_OFFLOAD_UDP_CKSUM | \
> > > > > > > --
> > > > > > > 2.25.1
> > > > > > >
> >
>
On Tue, 2021-09-28 at 15:05 +0530, Jerin Jacob wrote:
On Sun, Sep 26, 2021 at 11:06 AM Xueming(Steven) Li <xuemingl@nvidia.com<mailto:xuemingl@nvidia.com>> wrote:
On Wed, 2021-08-11 at 13:04 +0100, Ferruh Yigit wrote:
On 8/11/2021 9:28 AM, Xueming(Steven) Li wrote:
-----Original Message-----
From: Jerin Jacob <jerinjacobk@gmail.com<mailto:jerinjacobk@gmail.com>>
Sent: Wednesday, August 11, 2021 4:03 PM
To: Xueming(Steven) Li <xuemingl@nvidia.com<mailto:xuemingl@nvidia.com>>
Cc: dpdk-dev <dev@dpdk.org<mailto:dev@dpdk.org>>; Ferruh Yigit <ferruh.yigit@intel.com<mailto:ferruh.yigit@intel.com>>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net<mailto:thomas@monjalon.net>>;
Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru<mailto:andrew.rybchenko@oktetlabs.ru>>
Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li <xuemingl@nvidia.com<mailto:xuemingl@nvidia.com>> wrote:
Hi,
-----Original Message-----
From: Jerin Jacob <jerinjacobk@gmail.com<mailto:jerinjacobk@gmail.com>>
Sent: Monday, August 9, 2021 9:51 PM
To: Xueming(Steven) Li <xuemingl@nvidia.com<mailto:xuemingl@nvidia.com>>
Cc: dpdk-dev <dev@dpdk.org<mailto:dev@dpdk.org>>; Ferruh Yigit <ferruh.yigit@intel.com<mailto:ferruh.yigit@intel.com>>;
NBU-Contact-Thomas Monjalon <thomas@monjalon.net<mailto:thomas@monjalon.net>>; Andrew Rybchenko
<andrew.rybchenko@oktetlabs.ru<mailto:andrew.rybchenko@oktetlabs.ru>>
Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
On Mon, Aug 9, 2021 at 5:18 PM Xueming Li <xuemingl@nvidia.com<mailto:xuemingl@nvidia.com>> wrote:
In current DPDK framework, each RX queue is pre-loaded with mbufs
for incoming packets. When number of representors scale out in a
switch domain, the memory consumption became significant. Most
important, polling all ports leads to high cache miss, high
latency and low throughput.
This patch introduces shared RX queue. Ports with same
configuration in a switch domain could share RX queue set by specifying sharing group.
Polling any queue using same shared RX queue receives packets from
all member ports. Source port is identified by mbuf->port.
Port queue number in a shared group should be identical. Queue
index is
1:1 mapped in shared group.
Share RX queue is supposed to be polled on same thread.
Multiple groups is supported by group ID.
Is this offload specific to the representor? If so can this name be changed specifically to representor?
Yes, PF and representor in switch domain could take advantage.
If it is for a generic case, how the flow ordering will be maintained?
Not quite sure that I understood your question. The control path of is
almost same as before, PF and representor port still needed, rte flows not impacted.
Queues still needed for each member port, descriptors(mbuf) will be
supplied from shared Rx queue in my PMD implementation.
My question was if create a generic RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple ethdev receive queues land into the same
receive queue, In that case, how the flow order is maintained for respective receive queues.
I guess the question is testpmd forward stream? The forwarding logic has to be changed slightly in case of shared rxq.
basically for each packet in rx_burst result, lookup source stream according to mbuf->port, forwarding to target fs.
Packets from same source port could be grouped as a small burst to process, this will accelerates the performance if traffic come from
limited ports. I'll introduce some common api to do shard rxq forwarding, call it with packets handling callback, so it suites for
all forwarding engine. Will sent patches soon.
All ports will put the packets in to the same queue (share queue), right? Does
this means only single core will poll only, what will happen if there are
multiple cores polling, won't it cause problem?
And if this requires specific changes in the application, I am not sure about
the solution, can't this work in a transparent way to the application?
Discussed with Jerin, new API introduced in v3 2/8 that aggregate ports
in same group into one new port. Users could schedule polling on the
aggregated port instead of all member ports.
The v3 still has testpmd changes in fastpath. Right? IMO, For this
feature, we should not change fastpath of testpmd
application. Instead, testpmd can use aggregated ports probably as
separate fwd_engine to show how to use this feature.
Good point to discuss :) There are two strategies to polling a shared
Rxq:
1. polling each member port
All forwarding engines can be reused to work as before.
My testpmd patches are efforts towards this direction.
Does your PMD support this?
2. polling aggregated port
Besides forwarding engine, need more work to to demo it.
This is an optional API, not supported by my PMD yet.
Overall, is this for optimizing memory for the port represontors? If so can't we
have a port representor specific solution, reducing scope can reduce the
complexity it brings?
If this offload is only useful for representor case, Can we make this offload specific to representor the case by changing its name and
scope.
It works for both PF and representors in same switch domain, for application like OVS, few changes to apply.
Signed-off-by: Xueming Li <xuemingl@nvidia.com<mailto:xuemingl@nvidia.com>>
---
doc/guides/nics/features.rst | 11 +++++++++++
doc/guides/nics/features/default.ini | 1 +
doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
lib/ethdev/rte_ethdev.c | 1 +
lib/ethdev/rte_ethdev.h | 7 +++++++
5 files changed, 30 insertions(+)
diff --git a/doc/guides/nics/features.rst
b/doc/guides/nics/features.rst index a96e12d155..2e2a9b1554 100644
--- a/doc/guides/nics/features.rst
+++ b/doc/guides/nics/features.rst
@@ -624,6 +624,17 @@ Supports inner packet L4 checksum.
``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
+.. _nic_features_shared_rx_queue:
+
+Shared Rx queue
+---------------
+
+Supports shared Rx queue for ports in same switch domain.
+
+* **[uses] rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
+* **[provides] mbuf**: ``mbuf.port``.
+
+
.. _nic_features_packet_type_parsing:
Packet type parsing
diff --git a/doc/guides/nics/features/default.ini
b/doc/guides/nics/features/default.ini
index 754184ddd4..ebeb4c1851 100644
--- a/doc/guides/nics/features/default.ini
+++ b/doc/guides/nics/features/default.ini
@@ -19,6 +19,7 @@ Free Tx mbuf on demand =
Queue start/stop =
Runtime Rx queue setup =
Runtime Tx queue setup =
+Shared Rx queue =
Burst mode info =
Power mgmt address monitor =
MTU update =
diff --git a/doc/guides/prog_guide/switch_representation.rst
b/doc/guides/prog_guide/switch_representation.rst
index ff6aa91c80..45bf5a3a10 100644
--- a/doc/guides/prog_guide/switch_representation.rst
+++ b/doc/guides/prog_guide/switch_representation.rst
@@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
.. [1] `Ethernet switch device driver model (switchdev)
<https://www.kernel.org/doc/Documentation/networking/switchdev.txt
`_
+- Memory usage of representors is huge when number of representor
+grows,
+ because PMD always allocate mbuf for each descriptor of Rx queue.
+ Polling the large number of ports brings more CPU load, cache
+miss and
+ latency. Shared Rx queue can be used to share Rx queue between
+PF and
+ representors in same switch domain.
+``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
+ is present in Rx offloading capability of device info. Setting
+the
+ offloading flag in device Rx mode or Rx queue configuration to
+enable
+ shared Rx queue. Polling any member port of shared Rx queue can
+return
+ packets of all ports in group, port ID is saved in ``mbuf.port``.
+
Basic SR-IOV
------------
diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
index 9d95cd11e1..1361ff759a 100644
--- a/lib/ethdev/rte_ethdev.c
+++ b/lib/ethdev/rte_ethdev.c
@@ -127,6 +127,7 @@ static const struct {
RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
+ RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
};
#undef RTE_RX_OFFLOAD_BIT2STR
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index d2b27c351f..a578c9db9d 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
*/
+ uint32_t shared_group; /**< Shared port group index in
+ switch domain. */
/**
* Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
* Only offloads set on rx_queue_offload_capa or
rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
#define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM 0x00040000
#define DEV_RX_OFFLOAD_RSS_HASH 0x00080000
#define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
+/**
+ * Rx queue is shared among ports in same switch domain to save
+memory,
+ * avoid polling each port. Any port in group can be used to receive packets.
+ * Real source port number saved in mbuf->port field.
+ */
+#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ 0x00200000
#define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
DEV_RX_OFFLOAD_UDP_CKSUM | \
--
2.25.1
On Tue, 2021-09-28 at 15:05 +0530, Jerin Jacob wrote:
> On Sun, Sep 26, 2021 at 11:06 AM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> >
> > On Wed, 2021-08-11 at 13:04 +0100, Ferruh Yigit wrote:
> > > On 8/11/2021 9:28 AM, Xueming(Steven) Li wrote:
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > Sent: Wednesday, August 11, 2021 4:03 PM
> > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> > > > > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > > > >
> > > > > On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > Sent: Monday, August 9, 2021 9:51 PM
> > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > > > > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > > > > > >
> > > > > > > On Mon, Aug 9, 2021 at 5:18 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > > > > >
> > > > > > > > In current DPDK framework, each RX queue is pre-loaded with mbufs
> > > > > > > > for incoming packets. When number of representors scale out in a
> > > > > > > > switch domain, the memory consumption became significant. Most
> > > > > > > > important, polling all ports leads to high cache miss, high
> > > > > > > > latency and low throughput.
> > > > > > > >
> > > > > > > > This patch introduces shared RX queue. Ports with same
> > > > > > > > configuration in a switch domain could share RX queue set by specifying sharing group.
> > > > > > > > Polling any queue using same shared RX queue receives packets from
> > > > > > > > all member ports. Source port is identified by mbuf->port.
> > > > > > > >
> > > > > > > > Port queue number in a shared group should be identical. Queue
> > > > > > > > index is
> > > > > > > > 1:1 mapped in shared group.
> > > > > > > >
> > > > > > > > Share RX queue is supposed to be polled on same thread.
> > > > > > > >
> > > > > > > > Multiple groups is supported by group ID.
> > > > > > >
> > > > > > > Is this offload specific to the representor? If so can this name be changed specifically to representor?
> > > > > >
> > > > > > Yes, PF and representor in switch domain could take advantage.
> > > > > >
> > > > > > > If it is for a generic case, how the flow ordering will be maintained?
> > > > > >
> > > > > > Not quite sure that I understood your question. The control path of is
> > > > > > almost same as before, PF and representor port still needed, rte flows not impacted.
> > > > > > Queues still needed for each member port, descriptors(mbuf) will be
> > > > > > supplied from shared Rx queue in my PMD implementation.
> > > > >
> > > > > My question was if create a generic RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple ethdev receive queues land into the same
> > > > > receive queue, In that case, how the flow order is maintained for respective receive queues.
> > > >
> > > > I guess the question is testpmd forward stream? The forwarding logic has to be changed slightly in case of shared rxq.
> > > > basically for each packet in rx_burst result, lookup source stream according to mbuf->port, forwarding to target fs.
> > > > Packets from same source port could be grouped as a small burst to process, this will accelerates the performance if traffic come from
> > > > limited ports. I'll introduce some common api to do shard rxq forwarding, call it with packets handling callback, so it suites for
> > > > all forwarding engine. Will sent patches soon.
> > > >
> > >
> > > All ports will put the packets in to the same queue (share queue), right? Does
> > > this means only single core will poll only, what will happen if there are
> > > multiple cores polling, won't it cause problem?
> > >
> > > And if this requires specific changes in the application, I am not sure about
> > > the solution, can't this work in a transparent way to the application?
> >
> > Discussed with Jerin, new API introduced in v3 2/8 that aggregate ports
> > in same group into one new port. Users could schedule polling on the
> > aggregated port instead of all member ports.
>
> The v3 still has testpmd changes in fastpath. Right? IMO, For this
> feature, we should not change fastpath of testpmd
> application. Instead, testpmd can use aggregated ports probably as
> separate fwd_engine to show how to use this feature.
Good point to discuss :) There are two strategies to polling a shared
Rxq:
1. polling each member port
All forwarding engines can be reused to work as before.
My testpmd patches are efforts towards this direction.
Does your PMD support this?
2. polling aggregated port
Besides forwarding engine, need more work to to demo it.
This is an optional API, not supported by my PMD yet.
>
> >
> > >
> > > Overall, is this for optimizing memory for the port represontors? If so can't we
> > > have a port representor specific solution, reducing scope can reduce the
> > > complexity it brings?
> > >
> > > > > If this offload is only useful for representor case, Can we make this offload specific to representor the case by changing its name and
> > > > > scope.
> > > >
> > > > It works for both PF and representors in same switch domain, for application like OVS, few changes to apply.
> > > >
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > > > > > > ---
> > > > > > > > doc/guides/nics/features.rst | 11 +++++++++++
> > > > > > > > doc/guides/nics/features/default.ini | 1 +
> > > > > > > > doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
> > > > > > > > lib/ethdev/rte_ethdev.c | 1 +
> > > > > > > > lib/ethdev/rte_ethdev.h | 7 +++++++
> > > > > > > > 5 files changed, 30 insertions(+)
> > > > > > > >
> > > > > > > > diff --git a/doc/guides/nics/features.rst
> > > > > > > > b/doc/guides/nics/features.rst index a96e12d155..2e2a9b1554 100644
> > > > > > > > --- a/doc/guides/nics/features.rst
> > > > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > > > @@ -624,6 +624,17 @@ Supports inner packet L4 checksum.
> > > > > > > > ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > > >
> > > > > > > >
> > > > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > > > +
> > > > > > > > +Shared Rx queue
> > > > > > > > +---------------
> > > > > > > > +
> > > > > > > > +Supports shared Rx queue for ports in same switch domain.
> > > > > > > > +
> > > > > > > > +* **[uses] rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > > > > > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > > > > > +
> > > > > > > > +
> > > > > > > > .. _nic_features_packet_type_parsing:
> > > > > > > >
> > > > > > > > Packet type parsing
> > > > > > > > diff --git a/doc/guides/nics/features/default.ini
> > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > > > --- a/doc/guides/nics/features/default.ini
> > > > > > > > +++ b/doc/guides/nics/features/default.ini
> > > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> > > > > > > > Queue start/stop =
> > > > > > > > Runtime Rx queue setup =
> > > > > > > > Runtime Tx queue setup =
> > > > > > > > +Shared Rx queue =
> > > > > > > > Burst mode info =
> > > > > > > > Power mgmt address monitor =
> > > > > > > > MTU update =
> > > > > > > > diff --git a/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > b/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > > > --- a/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > +++ b/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
> > > > > > > > .. [1] `Ethernet switch device driver model (switchdev)
> > > > > > > >
> > > > > > > > <https://www.kernel.org/doc/Documentation/networking/switchdev.txt
> > > > > > > > > `_
> > > > > > > >
> > > > > > > > +- Memory usage of representors is huge when number of representor
> > > > > > > > +grows,
> > > > > > > > + because PMD always allocate mbuf for each descriptor of Rx queue.
> > > > > > > > + Polling the large number of ports brings more CPU load, cache
> > > > > > > > +miss and
> > > > > > > > + latency. Shared Rx queue can be used to share Rx queue between
> > > > > > > > +PF and
> > > > > > > > + representors in same switch domain.
> > > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > > > + is present in Rx offloading capability of device info. Setting
> > > > > > > > +the
> > > > > > > > + offloading flag in device Rx mode or Rx queue configuration to
> > > > > > > > +enable
> > > > > > > > + shared Rx queue. Polling any member port of shared Rx queue can
> > > > > > > > +return
> > > > > > > > + packets of all ports in group, port ID is saved in ``mbuf.port``.
> > > > > > > > +
> > > > > > > > Basic SR-IOV
> > > > > > > > ------------
> > > > > > > >
> > > > > > > > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> > > > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > > > @@ -127,6 +127,7 @@ static const struct {
> > > > > > > > RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
> > > > > > > > RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > > > > > RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
> > > > > > > > + RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > > > > > };
> > > > > > > >
> > > > > > > > #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > > > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> > > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > > > uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > > > > > > > uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > > > > > > > uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > > > > > > */
> > > > > > > > + uint32_t shared_group; /**< Shared port group index in
> > > > > > > > + switch domain. */
> > > > > > > > /**
> > > > > > > > * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > > > > > > > * Only offloads set on rx_queue_offload_capa or
> > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
> > > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM 0x00040000
> > > > > > > > #define DEV_RX_OFFLOAD_RSS_HASH 0x00080000
> > > > > > > > #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > > > > > > +/**
> > > > > > > > + * Rx queue is shared among ports in same switch domain to save
> > > > > > > > +memory,
> > > > > > > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > > > > > > + * Real source port number saved in mbuf->port field.
> > > > > > > > + */
> > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ 0x00200000
> > > > > > > >
> > > > > > > > #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > DEV_RX_OFFLOAD_UDP_CKSUM | \
> > > > > > > > --
> > > > > > > > 2.25.1
> > > > > > > >
> > >
> >
On Tue, 2021-09-28 at 15:05 +0530, Jerin Jacob wrote:
> On Sun, Sep 26, 2021 at 11:06 AM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> >
> > On Wed, 2021-08-11 at 13:04 +0100, Ferruh Yigit wrote:
> > > On 8/11/2021 9:28 AM, Xueming(Steven) Li wrote:
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > Sent: Wednesday, August 11, 2021 4:03 PM
> > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> > > > > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > > > >
> > > > > On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > Sent: Monday, August 9, 2021 9:51 PM
> > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > > > > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > > > > > >
> > > > > > > On Mon, Aug 9, 2021 at 5:18 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > > > > >
> > > > > > > > In current DPDK framework, each RX queue is pre-loaded with mbufs
> > > > > > > > for incoming packets. When number of representors scale out in a
> > > > > > > > switch domain, the memory consumption became significant. Most
> > > > > > > > important, polling all ports leads to high cache miss, high
> > > > > > > > latency and low throughput.
> > > > > > > >
> > > > > > > > This patch introduces shared RX queue. Ports with same
> > > > > > > > configuration in a switch domain could share RX queue set by specifying sharing group.
> > > > > > > > Polling any queue using same shared RX queue receives packets from
> > > > > > > > all member ports. Source port is identified by mbuf->port.
> > > > > > > >
> > > > > > > > Port queue number in a shared group should be identical. Queue
> > > > > > > > index is
> > > > > > > > 1:1 mapped in shared group.
> > > > > > > >
> > > > > > > > Share RX queue is supposed to be polled on same thread.
> > > > > > > >
> > > > > > > > Multiple groups is supported by group ID.
> > > > > > >
> > > > > > > Is this offload specific to the representor? If so can this name be changed specifically to representor?
> > > > > >
> > > > > > Yes, PF and representor in switch domain could take advantage.
> > > > > >
> > > > > > > If it is for a generic case, how the flow ordering will be maintained?
> > > > > >
> > > > > > Not quite sure that I understood your question. The control path of is
> > > > > > almost same as before, PF and representor port still needed, rte flows not impacted.
> > > > > > Queues still needed for each member port, descriptors(mbuf) will be
> > > > > > supplied from shared Rx queue in my PMD implementation.
> > > > >
> > > > > My question was if create a generic RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple ethdev receive queues land into the same
> > > > > receive queue, In that case, how the flow order is maintained for respective receive queues.
> > > >
> > > > I guess the question is testpmd forward stream? The forwarding logic has to be changed slightly in case of shared rxq.
> > > > basically for each packet in rx_burst result, lookup source stream according to mbuf->port, forwarding to target fs.
> > > > Packets from same source port could be grouped as a small burst to process, this will accelerates the performance if traffic come from
> > > > limited ports. I'll introduce some common api to do shard rxq forwarding, call it with packets handling callback, so it suites for
> > > > all forwarding engine. Will sent patches soon.
> > > >
> > >
> > > All ports will put the packets in to the same queue (share queue), right? Does
> > > this means only single core will poll only, what will happen if there are
> > > multiple cores polling, won't it cause problem?
> > >
> > > And if this requires specific changes in the application, I am not sure about
> > > the solution, can't this work in a transparent way to the application?
> >
> > Discussed with Jerin, new API introduced in v3 2/8 that aggregate ports
> > in same group into one new port. Users could schedule polling on the
> > aggregated port instead of all member ports.
>
> The v3 still has testpmd changes in fastpath. Right? IMO, For this
> feature, we should not change fastpath of testpmd
> application. Instead, testpmd can use aggregated ports probably as
> separate fwd_engine to show how to use this feature.
Good point to discuss :) There are two strategies to polling a shared
Rxq:
1. polling each member port
All forwarding engines can be reused to work as before.
My testpmd patches are efforts towards this direction.
Does your PMD support this?
2. polling aggregated port
Besides forwarding engine, need more work to to demo it.
This is an optional API, not supported by my PMD yet.
>
> >
> > >
> > > Overall, is this for optimizing memory for the port represontors? If so can't we
> > > have a port representor specific solution, reducing scope can reduce the
> > > complexity it brings?
> > >
> > > > > If this offload is only useful for representor case, Can we make this offload specific to representor the case by changing its name and
> > > > > scope.
> > > >
> > > > It works for both PF and representors in same switch domain, for application like OVS, few changes to apply.
> > > >
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > > > > > > ---
> > > > > > > > doc/guides/nics/features.rst | 11 +++++++++++
> > > > > > > > doc/guides/nics/features/default.ini | 1 +
> > > > > > > > doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
> > > > > > > > lib/ethdev/rte_ethdev.c | 1 +
> > > > > > > > lib/ethdev/rte_ethdev.h | 7 +++++++
> > > > > > > > 5 files changed, 30 insertions(+)
> > > > > > > >
> > > > > > > > diff --git a/doc/guides/nics/features.rst
> > > > > > > > b/doc/guides/nics/features.rst index a96e12d155..2e2a9b1554 100644
> > > > > > > > --- a/doc/guides/nics/features.rst
> > > > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > > > @@ -624,6 +624,17 @@ Supports inner packet L4 checksum.
> > > > > > > > ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > > >
> > > > > > > >
> > > > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > > > +
> > > > > > > > +Shared Rx queue
> > > > > > > > +---------------
> > > > > > > > +
> > > > > > > > +Supports shared Rx queue for ports in same switch domain.
> > > > > > > > +
> > > > > > > > +* **[uses] rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > > > > > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > > > > > +
> > > > > > > > +
> > > > > > > > .. _nic_features_packet_type_parsing:
> > > > > > > >
> > > > > > > > Packet type parsing
> > > > > > > > diff --git a/doc/guides/nics/features/default.ini
> > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > > > --- a/doc/guides/nics/features/default.ini
> > > > > > > > +++ b/doc/guides/nics/features/default.ini
> > > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> > > > > > > > Queue start/stop =
> > > > > > > > Runtime Rx queue setup =
> > > > > > > > Runtime Tx queue setup =
> > > > > > > > +Shared Rx queue =
> > > > > > > > Burst mode info =
> > > > > > > > Power mgmt address monitor =
> > > > > > > > MTU update =
> > > > > > > > diff --git a/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > b/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > > > --- a/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > +++ b/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
> > > > > > > > .. [1] `Ethernet switch device driver model (switchdev)
> > > > > > > >
> > > > > > > > <https://www.kernel.org/doc/Documentation/networking/switchdev.txt
> > > > > > > > > `_
> > > > > > > >
> > > > > > > > +- Memory usage of representors is huge when number of representor
> > > > > > > > +grows,
> > > > > > > > + because PMD always allocate mbuf for each descriptor of Rx queue.
> > > > > > > > + Polling the large number of ports brings more CPU load, cache
> > > > > > > > +miss and
> > > > > > > > + latency. Shared Rx queue can be used to share Rx queue between
> > > > > > > > +PF and
> > > > > > > > + representors in same switch domain.
> > > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > > > + is present in Rx offloading capability of device info. Setting
> > > > > > > > +the
> > > > > > > > + offloading flag in device Rx mode or Rx queue configuration to
> > > > > > > > +enable
> > > > > > > > + shared Rx queue. Polling any member port of shared Rx queue can
> > > > > > > > +return
> > > > > > > > + packets of all ports in group, port ID is saved in ``mbuf.port``.
> > > > > > > > +
> > > > > > > > Basic SR-IOV
> > > > > > > > ------------
> > > > > > > >
> > > > > > > > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> > > > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > > > @@ -127,6 +127,7 @@ static const struct {
> > > > > > > > RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
> > > > > > > > RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > > > > > RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
> > > > > > > > + RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > > > > > };
> > > > > > > >
> > > > > > > > #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > > > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> > > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > > > uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > > > > > > > uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > > > > > > > uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > > > > > > */
> > > > > > > > + uint32_t shared_group; /**< Shared port group index in
> > > > > > > > + switch domain. */
> > > > > > > > /**
> > > > > > > > * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > > > > > > > * Only offloads set on rx_queue_offload_capa or
> > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
> > > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM 0x00040000
> > > > > > > > #define DEV_RX_OFFLOAD_RSS_HASH 0x00080000
> > > > > > > > #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > > > > > > +/**
> > > > > > > > + * Rx queue is shared among ports in same switch domain to save
> > > > > > > > +memory,
> > > > > > > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > > > > > > + * Real source port number saved in mbuf->port field.
> > > > > > > > + */
> > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ 0x00200000
> > > > > > > >
> > > > > > > > #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > DEV_RX_OFFLOAD_UDP_CKSUM | \
> > > > > > > > --
> > > > > > > > 2.25.1
> > > > > > > >
> > >
> >
On Tue, Sep 28, 2021 at 5:07 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
>
> On Tue, 2021-09-28 at 15:05 +0530, Jerin Jacob wrote:
> > On Sun, Sep 26, 2021 at 11:06 AM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > >
> > > On Wed, 2021-08-11 at 13:04 +0100, Ferruh Yigit wrote:
> > > > On 8/11/2021 9:28 AM, Xueming(Steven) Li wrote:
> > > > >
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > Sent: Wednesday, August 11, 2021 4:03 PM
> > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> > > > > > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > > > > >
> > > > > > On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > > -----Original Message-----
> > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > Sent: Monday, August 9, 2021 9:51 PM
> > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > > > > > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > > > > > > >
> > > > > > > > On Mon, Aug 9, 2021 at 5:18 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > > > > > >
> > > > > > > > > In current DPDK framework, each RX queue is pre-loaded with mbufs
> > > > > > > > > for incoming packets. When number of representors scale out in a
> > > > > > > > > switch domain, the memory consumption became significant. Most
> > > > > > > > > important, polling all ports leads to high cache miss, high
> > > > > > > > > latency and low throughput.
> > > > > > > > >
> > > > > > > > > This patch introduces shared RX queue. Ports with same
> > > > > > > > > configuration in a switch domain could share RX queue set by specifying sharing group.
> > > > > > > > > Polling any queue using same shared RX queue receives packets from
> > > > > > > > > all member ports. Source port is identified by mbuf->port.
> > > > > > > > >
> > > > > > > > > Port queue number in a shared group should be identical. Queue
> > > > > > > > > index is
> > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > >
> > > > > > > > > Share RX queue is supposed to be polled on same thread.
> > > > > > > > >
> > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > >
> > > > > > > > Is this offload specific to the representor? If so can this name be changed specifically to representor?
> > > > > > >
> > > > > > > Yes, PF and representor in switch domain could take advantage.
> > > > > > >
> > > > > > > > If it is for a generic case, how the flow ordering will be maintained?
> > > > > > >
> > > > > > > Not quite sure that I understood your question. The control path of is
> > > > > > > almost same as before, PF and representor port still needed, rte flows not impacted.
> > > > > > > Queues still needed for each member port, descriptors(mbuf) will be
> > > > > > > supplied from shared Rx queue in my PMD implementation.
> > > > > >
> > > > > > My question was if create a generic RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple ethdev receive queues land into the same
> > > > > > receive queue, In that case, how the flow order is maintained for respective receive queues.
> > > > >
> > > > > I guess the question is testpmd forward stream? The forwarding logic has to be changed slightly in case of shared rxq.
> > > > > basically for each packet in rx_burst result, lookup source stream according to mbuf->port, forwarding to target fs.
> > > > > Packets from same source port could be grouped as a small burst to process, this will accelerates the performance if traffic come from
> > > > > limited ports. I'll introduce some common api to do shard rxq forwarding, call it with packets handling callback, so it suites for
> > > > > all forwarding engine. Will sent patches soon.
> > > > >
> > > >
> > > > All ports will put the packets in to the same queue (share queue), right? Does
> > > > this means only single core will poll only, what will happen if there are
> > > > multiple cores polling, won't it cause problem?
> > > >
> > > > And if this requires specific changes in the application, I am not sure about
> > > > the solution, can't this work in a transparent way to the application?
> > >
> > > Discussed with Jerin, new API introduced in v3 2/8 that aggregate ports
> > > in same group into one new port. Users could schedule polling on the
> > > aggregated port instead of all member ports.
> >
> > The v3 still has testpmd changes in fastpath. Right? IMO, For this
> > feature, we should not change fastpath of testpmd
> > application. Instead, testpmd can use aggregated ports probably as
> > separate fwd_engine to show how to use this feature.
>
> Good point to discuss :) There are two strategies to polling a shared
> Rxq:
> 1. polling each member port
> All forwarding engines can be reused to work as before.
> My testpmd patches are efforts towards this direction.
> Does your PMD support this?
Not unfortunately. More than that, every application needs to change
to support this model.
> 2. polling aggregated port
> Besides forwarding engine, need more work to to demo it.
> This is an optional API, not supported by my PMD yet.
We are thinking of implementing this PMD when it comes to it, ie.
without application change in fastpath
logic.
>
>
> >
> > >
> > > >
> > > > Overall, is this for optimizing memory for the port represontors? If so can't we
> > > > have a port representor specific solution, reducing scope can reduce the
> > > > complexity it brings?
> > > >
> > > > > > If this offload is only useful for representor case, Can we make this offload specific to representor the case by changing its name and
> > > > > > scope.
> > > > >
> > > > > It works for both PF and representors in same switch domain, for application like OVS, few changes to apply.
> > > > >
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > > > > > > > ---
> > > > > > > > > doc/guides/nics/features.rst | 11 +++++++++++
> > > > > > > > > doc/guides/nics/features/default.ini | 1 +
> > > > > > > > > doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
> > > > > > > > > lib/ethdev/rte_ethdev.c | 1 +
> > > > > > > > > lib/ethdev/rte_ethdev.h | 7 +++++++
> > > > > > > > > 5 files changed, 30 insertions(+)
> > > > > > > > >
> > > > > > > > > diff --git a/doc/guides/nics/features.rst
> > > > > > > > > b/doc/guides/nics/features.rst index a96e12d155..2e2a9b1554 100644
> > > > > > > > > --- a/doc/guides/nics/features.rst
> > > > > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > > > > @@ -624,6 +624,17 @@ Supports inner packet L4 checksum.
> > > > > > > > > ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > > > > +
> > > > > > > > > +Shared Rx queue
> > > > > > > > > +---------------
> > > > > > > > > +
> > > > > > > > > +Supports shared Rx queue for ports in same switch domain.
> > > > > > > > > +
> > > > > > > > > +* **[uses] rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > > > > > > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > > > > > > +
> > > > > > > > > +
> > > > > > > > > .. _nic_features_packet_type_parsing:
> > > > > > > > >
> > > > > > > > > Packet type parsing
> > > > > > > > > diff --git a/doc/guides/nics/features/default.ini
> > > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > > > > --- a/doc/guides/nics/features/default.ini
> > > > > > > > > +++ b/doc/guides/nics/features/default.ini
> > > > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> > > > > > > > > Queue start/stop =
> > > > > > > > > Runtime Rx queue setup =
> > > > > > > > > Runtime Tx queue setup =
> > > > > > > > > +Shared Rx queue =
> > > > > > > > > Burst mode info =
> > > > > > > > > Power mgmt address monitor =
> > > > > > > > > MTU update =
> > > > > > > > > diff --git a/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > > b/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > > > > --- a/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > > +++ b/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > > @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
> > > > > > > > > .. [1] `Ethernet switch device driver model (switchdev)
> > > > > > > > >
> > > > > > > > > <https://www.kernel.org/doc/Documentation/networking/switchdev.txt
> > > > > > > > > > `_
> > > > > > > > >
> > > > > > > > > +- Memory usage of representors is huge when number of representor
> > > > > > > > > +grows,
> > > > > > > > > + because PMD always allocate mbuf for each descriptor of Rx queue.
> > > > > > > > > + Polling the large number of ports brings more CPU load, cache
> > > > > > > > > +miss and
> > > > > > > > > + latency. Shared Rx queue can be used to share Rx queue between
> > > > > > > > > +PF and
> > > > > > > > > + representors in same switch domain.
> > > > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > > > > + is present in Rx offloading capability of device info. Setting
> > > > > > > > > +the
> > > > > > > > > + offloading flag in device Rx mode or Rx queue configuration to
> > > > > > > > > +enable
> > > > > > > > > + shared Rx queue. Polling any member port of shared Rx queue can
> > > > > > > > > +return
> > > > > > > > > + packets of all ports in group, port ID is saved in ``mbuf.port``.
> > > > > > > > > +
> > > > > > > > > Basic SR-IOV
> > > > > > > > > ------------
> > > > > > > > >
> > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> > > > > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > > > > @@ -127,6 +127,7 @@ static const struct {
> > > > > > > > > RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
> > > > > > > > > RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > > > > > > RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
> > > > > > > > > + RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > > > > > > };
> > > > > > > > >
> > > > > > > > > #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> > > > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > > > > uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > > > > > > > > uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > > > > > > > > uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > > > > > > > */
> > > > > > > > > + uint32_t shared_group; /**< Shared port group index in
> > > > > > > > > + switch domain. */
> > > > > > > > > /**
> > > > > > > > > * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > > > > > > > > * Only offloads set on rx_queue_offload_capa or
> > > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
> > > > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM 0x00040000
> > > > > > > > > #define DEV_RX_OFFLOAD_RSS_HASH 0x00080000
> > > > > > > > > #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > > > > > > > +/**
> > > > > > > > > + * Rx queue is shared among ports in same switch domain to save
> > > > > > > > > +memory,
> > > > > > > > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > > > > > > > + * Real source port number saved in mbuf->port field.
> > > > > > > > > + */
> > > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ 0x00200000
> > > > > > > > >
> > > > > > > > > #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > > DEV_RX_OFFLOAD_UDP_CKSUM | \
> > > > > > > > > --
> > > > > > > > > 2.25.1
> > > > > > > > >
> > > >
> > >
>
>
On Tue, 2021-09-28 at 15:05 +0530, Jerin Jacob wrote:
> On Sun, Sep 26, 2021 at 11:06 AM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> >
> > On Wed, 2021-08-11 at 13:04 +0100, Ferruh Yigit wrote:
> > > On 8/11/2021 9:28 AM, Xueming(Steven) Li wrote:
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > Sent: Wednesday, August 11, 2021 4:03 PM
> > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> > > > > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > > > >
> > > > > On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > Sent: Monday, August 9, 2021 9:51 PM
> > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > > > > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > > > > > >
> > > > > > > On Mon, Aug 9, 2021 at 5:18 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > > > > >
> > > > > > > > In current DPDK framework, each RX queue is pre-loaded with mbufs
> > > > > > > > for incoming packets. When number of representors scale out in a
> > > > > > > > switch domain, the memory consumption became significant. Most
> > > > > > > > important, polling all ports leads to high cache miss, high
> > > > > > > > latency and low throughput.
> > > > > > > >
> > > > > > > > This patch introduces shared RX queue. Ports with same
> > > > > > > > configuration in a switch domain could share RX queue set by specifying sharing group.
> > > > > > > > Polling any queue using same shared RX queue receives packets from
> > > > > > > > all member ports. Source port is identified by mbuf->port.
> > > > > > > >
> > > > > > > > Port queue number in a shared group should be identical. Queue
> > > > > > > > index is
> > > > > > > > 1:1 mapped in shared group.
> > > > > > > >
> > > > > > > > Share RX queue is supposed to be polled on same thread.
> > > > > > > >
> > > > > > > > Multiple groups is supported by group ID.
> > > > > > >
> > > > > > > Is this offload specific to the representor? If so can this name be changed specifically to representor?
> > > > > >
> > > > > > Yes, PF and representor in switch domain could take advantage.
> > > > > >
> > > > > > > If it is for a generic case, how the flow ordering will be maintained?
> > > > > >
> > > > > > Not quite sure that I understood your question. The control path of is
> > > > > > almost same as before, PF and representor port still needed, rte flows not impacted.
> > > > > > Queues still needed for each member port, descriptors(mbuf) will be
> > > > > > supplied from shared Rx queue in my PMD implementation.
> > > > >
> > > > > My question was if create a generic RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple ethdev receive queues land into the same
> > > > > receive queue, In that case, how the flow order is maintained for respective receive queues.
> > > >
> > > > I guess the question is testpmd forward stream? The forwarding logic has to be changed slightly in case of shared rxq.
> > > > basically for each packet in rx_burst result, lookup source stream according to mbuf->port, forwarding to target fs.
> > > > Packets from same source port could be grouped as a small burst to process, this will accelerates the performance if traffic come from
> > > > limited ports. I'll introduce some common api to do shard rxq forwarding, call it with packets handling callback, so it suites for
> > > > all forwarding engine. Will sent patches soon.
> > > >
> > >
> > > All ports will put the packets in to the same queue (share queue), right? Does
> > > this means only single core will poll only, what will happen if there are
> > > multiple cores polling, won't it cause problem?
> > >
> > > And if this requires specific changes in the application, I am not sure about
> > > the solution, can't this work in a transparent way to the application?
> >
> > Discussed with Jerin, new API introduced in v3 2/8 that aggregate ports
> > in same group into one new port. Users could schedule polling on the
> > aggregated port instead of all member ports.
>
> The v3 still has testpmd changes in fastpath. Right? IMO, For this
> feature, we should not change fastpath of testpmd
> application. Instead, testpmd can use aggregated ports probably as
> separate fwd_engine to show how to use this feature.
Good point to discuss :) There are two strategies to polling a shared
Rxq:
1. polling each member port
All forwarding engines can be reused to work as before.
My testpmd patches are efforts towards this direction.
Does your PMD support this?
2. polling aggregated port
Besides forwarding engine, need more work to to demo it.
This is an optional API, not supported by my PMD yet.
>
> >
> > >
> > > Overall, is this for optimizing memory for the port represontors? If so can't we
> > > have a port representor specific solution, reducing scope can reduce the
> > > complexity it brings?
> > >
> > > > > If this offload is only useful for representor case, Can we make this offload specific to representor the case by changing its name and
> > > > > scope.
> > > >
> > > > It works for both PF and representors in same switch domain, for application like OVS, few changes to apply.
> > > >
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > > > > > > ---
> > > > > > > > doc/guides/nics/features.rst | 11 +++++++++++
> > > > > > > > doc/guides/nics/features/default.ini | 1 +
> > > > > > > > doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
> > > > > > > > lib/ethdev/rte_ethdev.c | 1 +
> > > > > > > > lib/ethdev/rte_ethdev.h | 7 +++++++
> > > > > > > > 5 files changed, 30 insertions(+)
> > > > > > > >
> > > > > > > > diff --git a/doc/guides/nics/features.rst
> > > > > > > > b/doc/guides/nics/features.rst index a96e12d155..2e2a9b1554 100644
> > > > > > > > --- a/doc/guides/nics/features.rst
> > > > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > > > @@ -624,6 +624,17 @@ Supports inner packet L4 checksum.
> > > > > > > > ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > > >
> > > > > > > >
> > > > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > > > +
> > > > > > > > +Shared Rx queue
> > > > > > > > +---------------
> > > > > > > > +
> > > > > > > > +Supports shared Rx queue for ports in same switch domain.
> > > > > > > > +
> > > > > > > > +* **[uses] rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > > > > > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > > > > > +
> > > > > > > > +
> > > > > > > > .. _nic_features_packet_type_parsing:
> > > > > > > >
> > > > > > > > Packet type parsing
> > > > > > > > diff --git a/doc/guides/nics/features/default.ini
> > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > > > --- a/doc/guides/nics/features/default.ini
> > > > > > > > +++ b/doc/guides/nics/features/default.ini
> > > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> > > > > > > > Queue start/stop =
> > > > > > > > Runtime Rx queue setup =
> > > > > > > > Runtime Tx queue setup =
> > > > > > > > +Shared Rx queue =
> > > > > > > > Burst mode info =
> > > > > > > > Power mgmt address monitor =
> > > > > > > > MTU update =
> > > > > > > > diff --git a/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > b/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > > > --- a/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > +++ b/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
> > > > > > > > .. [1] `Ethernet switch device driver model (switchdev)
> > > > > > > >
> > > > > > > > <https://www.kernel.org/doc/Documentation/networking/switchdev.txt
> > > > > > > > > `_
> > > > > > > >
> > > > > > > > +- Memory usage of representors is huge when number of representor
> > > > > > > > +grows,
> > > > > > > > + because PMD always allocate mbuf for each descriptor of Rx queue.
> > > > > > > > + Polling the large number of ports brings more CPU load, cache
> > > > > > > > +miss and
> > > > > > > > + latency. Shared Rx queue can be used to share Rx queue between
> > > > > > > > +PF and
> > > > > > > > + representors in same switch domain.
> > > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > > > + is present in Rx offloading capability of device info. Setting
> > > > > > > > +the
> > > > > > > > + offloading flag in device Rx mode or Rx queue configuration to
> > > > > > > > +enable
> > > > > > > > + shared Rx queue. Polling any member port of shared Rx queue can
> > > > > > > > +return
> > > > > > > > + packets of all ports in group, port ID is saved in ``mbuf.port``.
> > > > > > > > +
> > > > > > > > Basic SR-IOV
> > > > > > > > ------------
> > > > > > > >
> > > > > > > > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> > > > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > > > @@ -127,6 +127,7 @@ static const struct {
> > > > > > > > RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
> > > > > > > > RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > > > > > RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
> > > > > > > > + RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > > > > > };
> > > > > > > >
> > > > > > > > #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > > > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> > > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > > > uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > > > > > > > uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > > > > > > > uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > > > > > > */
> > > > > > > > + uint32_t shared_group; /**< Shared port group index in
> > > > > > > > + switch domain. */
> > > > > > > > /**
> > > > > > > > * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > > > > > > > * Only offloads set on rx_queue_offload_capa or
> > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
> > > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM 0x00040000
> > > > > > > > #define DEV_RX_OFFLOAD_RSS_HASH 0x00080000
> > > > > > > > #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > > > > > > +/**
> > > > > > > > + * Rx queue is shared among ports in same switch domain to save
> > > > > > > > +memory,
> > > > > > > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > > > > > > + * Real source port number saved in mbuf->port field.
> > > > > > > > + */
> > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ 0x00200000
> > > > > > > >
> > > > > > > > #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > DEV_RX_OFFLOAD_UDP_CKSUM | \
> > > > > > > > --
> > > > > > > > 2.25.1
> > > > > > > >
> > >
> >
On Tue, 2021-09-28 at 18:28 +0530, Jerin Jacob wrote:
> On Tue, Sep 28, 2021 at 5:07 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> >
> > On Tue, 2021-09-28 at 15:05 +0530, Jerin Jacob wrote:
> > > On Sun, Sep 26, 2021 at 11:06 AM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > >
> > > > On Wed, 2021-08-11 at 13:04 +0100, Ferruh Yigit wrote:
> > > > > On 8/11/2021 9:28 AM, Xueming(Steven) Li wrote:
> > > > > >
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > Sent: Wednesday, August 11, 2021 4:03 PM
> > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> > > > > > > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > > > > > >
> > > > > > > On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > Sent: Monday, August 9, 2021 9:51 PM
> > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > > > > > > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > > > > > > > >
> > > > > > > > > On Mon, Aug 9, 2021 at 5:18 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > > > > > > >
> > > > > > > > > > In current DPDK framework, each RX queue is pre-loaded with mbufs
> > > > > > > > > > for incoming packets. When number of representors scale out in a
> > > > > > > > > > switch domain, the memory consumption became significant. Most
> > > > > > > > > > important, polling all ports leads to high cache miss, high
> > > > > > > > > > latency and low throughput.
> > > > > > > > > >
> > > > > > > > > > This patch introduces shared RX queue. Ports with same
> > > > > > > > > > configuration in a switch domain could share RX queue set by specifying sharing group.
> > > > > > > > > > Polling any queue using same shared RX queue receives packets from
> > > > > > > > > > all member ports. Source port is identified by mbuf->port.
> > > > > > > > > >
> > > > > > > > > > Port queue number in a shared group should be identical. Queue
> > > > > > > > > > index is
> > > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > > >
> > > > > > > > > > Share RX queue is supposed to be polled on same thread.
> > > > > > > > > >
> > > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > > >
> > > > > > > > > Is this offload specific to the representor? If so can this name be changed specifically to representor?
> > > > > > > >
> > > > > > > > Yes, PF and representor in switch domain could take advantage.
> > > > > > > >
> > > > > > > > > If it is for a generic case, how the flow ordering will be maintained?
> > > > > > > >
> > > > > > > > Not quite sure that I understood your question. The control path of is
> > > > > > > > almost same as before, PF and representor port still needed, rte flows not impacted.
> > > > > > > > Queues still needed for each member port, descriptors(mbuf) will be
> > > > > > > > supplied from shared Rx queue in my PMD implementation.
> > > > > > >
> > > > > > > My question was if create a generic RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple ethdev receive queues land into the same
> > > > > > > receive queue, In that case, how the flow order is maintained for respective receive queues.
> > > > > >
> > > > > > I guess the question is testpmd forward stream? The forwarding logic has to be changed slightly in case of shared rxq.
> > > > > > basically for each packet in rx_burst result, lookup source stream according to mbuf->port, forwarding to target fs.
> > > > > > Packets from same source port could be grouped as a small burst to process, this will accelerates the performance if traffic come from
> > > > > > limited ports. I'll introduce some common api to do shard rxq forwarding, call it with packets handling callback, so it suites for
> > > > > > all forwarding engine. Will sent patches soon.
> > > > > >
> > > > >
> > > > > All ports will put the packets in to the same queue (share queue), right? Does
> > > > > this means only single core will poll only, what will happen if there are
> > > > > multiple cores polling, won't it cause problem?
> > > > >
> > > > > And if this requires specific changes in the application, I am not sure about
> > > > > the solution, can't this work in a transparent way to the application?
> > > >
> > > > Discussed with Jerin, new API introduced in v3 2/8 that aggregate ports
> > > > in same group into one new port. Users could schedule polling on the
> > > > aggregated port instead of all member ports.
> > >
> > > The v3 still has testpmd changes in fastpath. Right? IMO, For this
> > > feature, we should not change fastpath of testpmd
> > > application. Instead, testpmd can use aggregated ports probably as
> > > separate fwd_engine to show how to use this feature.
> >
> > Good point to discuss :) There are two strategies to polling a shared
> > Rxq:
> > 1. polling each member port
> > All forwarding engines can be reused to work as before.
> > My testpmd patches are efforts towards this direction.
> > Does your PMD support this?
>
> Not unfortunately. More than that, every application needs to change
> to support this model.
Both strategies need user application to resolve port ID from mbuf and
process accordingly.
This one doesn't demand aggregated port, no polling schedule change.
>
> > 2. polling aggregated port
> > Besides forwarding engine, need more work to to demo it.
> > This is an optional API, not supported by my PMD yet.
>
> We are thinking of implementing this PMD when it comes to it, ie.
> without application change in fastpath
> logic.
Fastpath have to resolve port ID anyway and forwarding according to
logic. Forwarding engine need to adapt to support shard Rxq.
Fortunately, in testpmd, this can be done with an abstract API.
Let's defer part 2 until some PMD really support it and tested, how do
you think?
>
> >
> >
> > >
> > > >
> > > > >
> > > > > Overall, is this for optimizing memory for the port represontors? If so can't we
> > > > > have a port representor specific solution, reducing scope can reduce the
> > > > > complexity it brings?
> > > > >
> > > > > > > If this offload is only useful for representor case, Can we make this offload specific to representor the case by changing its name and
> > > > > > > scope.
> > > > > >
> > > > > > It works for both PF and representors in same switch domain, for application like OVS, few changes to apply.
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > > > > > > > > ---
> > > > > > > > > > doc/guides/nics/features.rst | 11 +++++++++++
> > > > > > > > > > doc/guides/nics/features/default.ini | 1 +
> > > > > > > > > > doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
> > > > > > > > > > lib/ethdev/rte_ethdev.c | 1 +
> > > > > > > > > > lib/ethdev/rte_ethdev.h | 7 +++++++
> > > > > > > > > > 5 files changed, 30 insertions(+)
> > > > > > > > > >
> > > > > > > > > > diff --git a/doc/guides/nics/features.rst
> > > > > > > > > > b/doc/guides/nics/features.rst index a96e12d155..2e2a9b1554 100644
> > > > > > > > > > --- a/doc/guides/nics/features.rst
> > > > > > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > > > > > @@ -624,6 +624,17 @@ Supports inner packet L4 checksum.
> > > > > > > > > > ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > > > > > +
> > > > > > > > > > +Shared Rx queue
> > > > > > > > > > +---------------
> > > > > > > > > > +
> > > > > > > > > > +Supports shared Rx queue for ports in same switch domain.
> > > > > > > > > > +
> > > > > > > > > > +* **[uses] rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > > > > > > > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > > > > > > > +
> > > > > > > > > > +
> > > > > > > > > > .. _nic_features_packet_type_parsing:
> > > > > > > > > >
> > > > > > > > > > Packet type parsing
> > > > > > > > > > diff --git a/doc/guides/nics/features/default.ini
> > > > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > > > > > --- a/doc/guides/nics/features/default.ini
> > > > > > > > > > +++ b/doc/guides/nics/features/default.ini
> > > > > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> > > > > > > > > > Queue start/stop =
> > > > > > > > > > Runtime Rx queue setup =
> > > > > > > > > > Runtime Tx queue setup =
> > > > > > > > > > +Shared Rx queue =
> > > > > > > > > > Burst mode info =
> > > > > > > > > > Power mgmt address monitor =
> > > > > > > > > > MTU update =
> > > > > > > > > > diff --git a/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > > > b/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > > > > > --- a/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > > > +++ b/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > > > @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
> > > > > > > > > > .. [1] `Ethernet switch device driver model (switchdev)
> > > > > > > > > >
> > > > > > > > > > <https://www.kernel.org/doc/Documentation/networking/switchdev.txt
> > > > > > > > > > > `_
> > > > > > > > > >
> > > > > > > > > > +- Memory usage of representors is huge when number of representor
> > > > > > > > > > +grows,
> > > > > > > > > > + because PMD always allocate mbuf for each descriptor of Rx queue.
> > > > > > > > > > + Polling the large number of ports brings more CPU load, cache
> > > > > > > > > > +miss and
> > > > > > > > > > + latency. Shared Rx queue can be used to share Rx queue between
> > > > > > > > > > +PF and
> > > > > > > > > > + representors in same switch domain.
> > > > > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > > > > > + is present in Rx offloading capability of device info. Setting
> > > > > > > > > > +the
> > > > > > > > > > + offloading flag in device Rx mode or Rx queue configuration to
> > > > > > > > > > +enable
> > > > > > > > > > + shared Rx queue. Polling any member port of shared Rx queue can
> > > > > > > > > > +return
> > > > > > > > > > + packets of all ports in group, port ID is saved in ``mbuf.port``.
> > > > > > > > > > +
> > > > > > > > > > Basic SR-IOV
> > > > > > > > > > ------------
> > > > > > > > > >
> > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > @@ -127,6 +127,7 @@ static const struct {
> > > > > > > > > > RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
> > > > > > > > > > RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > > > > > > > RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
> > > > > > > > > > + RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > > > > > > > };
> > > > > > > > > >
> > > > > > > > > > #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > > > > > uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > > > > > > > > > uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > > > > > > > > > uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > > > > > > > > */
> > > > > > > > > > + uint32_t shared_group; /**< Shared port group index in
> > > > > > > > > > + switch domain. */
> > > > > > > > > > /**
> > > > > > > > > > * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > > > > > > > > > * Only offloads set on rx_queue_offload_capa or
> > > > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
> > > > > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM 0x00040000
> > > > > > > > > > #define DEV_RX_OFFLOAD_RSS_HASH 0x00080000
> > > > > > > > > > #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > > > > > > > > +/**
> > > > > > > > > > + * Rx queue is shared among ports in same switch domain to save
> > > > > > > > > > +memory,
> > > > > > > > > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > > > > > > > > + * Real source port number saved in mbuf->port field.
> > > > > > > > > > + */
> > > > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ 0x00200000
> > > > > > > > > >
> > > > > > > > > > #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > > > DEV_RX_OFFLOAD_UDP_CKSUM | \
> > > > > > > > > > --
> > > > > > > > > > 2.25.1
> > > > > > > > > >
> > > > >
> > > >
> >
> >
On Tue, Sep 28, 2021 at 6:55 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
>
> On Tue, 2021-09-28 at 18:28 +0530, Jerin Jacob wrote:
> > On Tue, Sep 28, 2021 at 5:07 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > >
> > > On Tue, 2021-09-28 at 15:05 +0530, Jerin Jacob wrote:
> > > > On Sun, Sep 26, 2021 at 11:06 AM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > >
> > > > > On Wed, 2021-08-11 at 13:04 +0100, Ferruh Yigit wrote:
> > > > > > On 8/11/2021 9:28 AM, Xueming(Steven) Li wrote:
> > > > > > >
> > > > > > >
> > > > > > > > -----Original Message-----
> > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > Sent: Wednesday, August 11, 2021 4:03 PM
> > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> > > > > > > > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > > > > > > >
> > > > > > > > On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > > > > >
> > > > > > > > > Hi,
> > > > > > > > >
> > > > > > > > > > -----Original Message-----
> > > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > > Sent: Monday, August 9, 2021 9:51 PM
> > > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > > > > > > > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > > > > > > > > >
> > > > > > > > > > On Mon, Aug 9, 2021 at 5:18 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > In current DPDK framework, each RX queue is pre-loaded with mbufs
> > > > > > > > > > > for incoming packets. When number of representors scale out in a
> > > > > > > > > > > switch domain, the memory consumption became significant. Most
> > > > > > > > > > > important, polling all ports leads to high cache miss, high
> > > > > > > > > > > latency and low throughput.
> > > > > > > > > > >
> > > > > > > > > > > This patch introduces shared RX queue. Ports with same
> > > > > > > > > > > configuration in a switch domain could share RX queue set by specifying sharing group.
> > > > > > > > > > > Polling any queue using same shared RX queue receives packets from
> > > > > > > > > > > all member ports. Source port is identified by mbuf->port.
> > > > > > > > > > >
> > > > > > > > > > > Port queue number in a shared group should be identical. Queue
> > > > > > > > > > > index is
> > > > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > > > >
> > > > > > > > > > > Share RX queue is supposed to be polled on same thread.
> > > > > > > > > > >
> > > > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > > > >
> > > > > > > > > > Is this offload specific to the representor? If so can this name be changed specifically to representor?
> > > > > > > > >
> > > > > > > > > Yes, PF and representor in switch domain could take advantage.
> > > > > > > > >
> > > > > > > > > > If it is for a generic case, how the flow ordering will be maintained?
> > > > > > > > >
> > > > > > > > > Not quite sure that I understood your question. The control path of is
> > > > > > > > > almost same as before, PF and representor port still needed, rte flows not impacted.
> > > > > > > > > Queues still needed for each member port, descriptors(mbuf) will be
> > > > > > > > > supplied from shared Rx queue in my PMD implementation.
> > > > > > > >
> > > > > > > > My question was if create a generic RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple ethdev receive queues land into the same
> > > > > > > > receive queue, In that case, how the flow order is maintained for respective receive queues.
> > > > > > >
> > > > > > > I guess the question is testpmd forward stream? The forwarding logic has to be changed slightly in case of shared rxq.
> > > > > > > basically for each packet in rx_burst result, lookup source stream according to mbuf->port, forwarding to target fs.
> > > > > > > Packets from same source port could be grouped as a small burst to process, this will accelerates the performance if traffic come from
> > > > > > > limited ports. I'll introduce some common api to do shard rxq forwarding, call it with packets handling callback, so it suites for
> > > > > > > all forwarding engine. Will sent patches soon.
> > > > > > >
> > > > > >
> > > > > > All ports will put the packets in to the same queue (share queue), right? Does
> > > > > > this means only single core will poll only, what will happen if there are
> > > > > > multiple cores polling, won't it cause problem?
> > > > > >
> > > > > > And if this requires specific changes in the application, I am not sure about
> > > > > > the solution, can't this work in a transparent way to the application?
> > > > >
> > > > > Discussed with Jerin, new API introduced in v3 2/8 that aggregate ports
> > > > > in same group into one new port. Users could schedule polling on the
> > > > > aggregated port instead of all member ports.
> > > >
> > > > The v3 still has testpmd changes in fastpath. Right? IMO, For this
> > > > feature, we should not change fastpath of testpmd
> > > > application. Instead, testpmd can use aggregated ports probably as
> > > > separate fwd_engine to show how to use this feature.
> > >
> > > Good point to discuss :) There are two strategies to polling a shared
> > > Rxq:
> > > 1. polling each member port
> > > All forwarding engines can be reused to work as before.
> > > My testpmd patches are efforts towards this direction.
> > > Does your PMD support this?
> >
> > Not unfortunately. More than that, every application needs to change
> > to support this model.
>
> Both strategies need user application to resolve port ID from mbuf and
> process accordingly.
> This one doesn't demand aggregated port, no polling schedule change.
I was thinking, mbuf will be updated from driver/aggregator port as when it
comes to application.
>
> >
> > > 2. polling aggregated port
> > > Besides forwarding engine, need more work to to demo it.
> > > This is an optional API, not supported by my PMD yet.
> >
> > We are thinking of implementing this PMD when it comes to it, ie.
> > without application change in fastpath
> > logic.
>
> Fastpath have to resolve port ID anyway and forwarding according to
> logic. Forwarding engine need to adapt to support shard Rxq.
> Fortunately, in testpmd, this can be done with an abstract API.
>
> Let's defer part 2 until some PMD really support it and tested, how do
> you think?
We are not planning to use this feature so either way it is OK to me.
I leave to ethdev maintainers decide between 1 vs 2.
I do have a strong opinion not changing the testpmd basic forward engines
for this feature.I would like to keep it simple as fastpath optimized and would
like to add a separate Forwarding engine as means to verify this feature.
>
> >
> > >
> > >
> > > >
> > > > >
> > > > > >
> > > > > > Overall, is this for optimizing memory for the port represontors? If so can't we
> > > > > > have a port representor specific solution, reducing scope can reduce the
> > > > > > complexity it brings?
> > > > > >
> > > > > > > > If this offload is only useful for representor case, Can we make this offload specific to representor the case by changing its name and
> > > > > > > > scope.
> > > > > > >
> > > > > > > It works for both PF and representors in same switch domain, for application like OVS, few changes to apply.
> > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > > > > > > > > > ---
> > > > > > > > > > > doc/guides/nics/features.rst | 11 +++++++++++
> > > > > > > > > > > doc/guides/nics/features/default.ini | 1 +
> > > > > > > > > > > doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
> > > > > > > > > > > lib/ethdev/rte_ethdev.c | 1 +
> > > > > > > > > > > lib/ethdev/rte_ethdev.h | 7 +++++++
> > > > > > > > > > > 5 files changed, 30 insertions(+)
> > > > > > > > > > >
> > > > > > > > > > > diff --git a/doc/guides/nics/features.rst
> > > > > > > > > > > b/doc/guides/nics/features.rst index a96e12d155..2e2a9b1554 100644
> > > > > > > > > > > --- a/doc/guides/nics/features.rst
> > > > > > > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > > > > > > @@ -624,6 +624,17 @@ Supports inner packet L4 checksum.
> > > > > > > > > > > ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > > > > > > +
> > > > > > > > > > > +Shared Rx queue
> > > > > > > > > > > +---------------
> > > > > > > > > > > +
> > > > > > > > > > > +Supports shared Rx queue for ports in same switch domain.
> > > > > > > > > > > +
> > > > > > > > > > > +* **[uses] rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > > > > > > > > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > > > > > > > > +
> > > > > > > > > > > +
> > > > > > > > > > > .. _nic_features_packet_type_parsing:
> > > > > > > > > > >
> > > > > > > > > > > Packet type parsing
> > > > > > > > > > > diff --git a/doc/guides/nics/features/default.ini
> > > > > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > > > > > > --- a/doc/guides/nics/features/default.ini
> > > > > > > > > > > +++ b/doc/guides/nics/features/default.ini
> > > > > > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> > > > > > > > > > > Queue start/stop =
> > > > > > > > > > > Runtime Rx queue setup =
> > > > > > > > > > > Runtime Tx queue setup =
> > > > > > > > > > > +Shared Rx queue =
> > > > > > > > > > > Burst mode info =
> > > > > > > > > > > Power mgmt address monitor =
> > > > > > > > > > > MTU update =
> > > > > > > > > > > diff --git a/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > > > > b/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > > > > > > --- a/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > > > > +++ b/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > > > > @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
> > > > > > > > > > > .. [1] `Ethernet switch device driver model (switchdev)
> > > > > > > > > > >
> > > > > > > > > > > <https://www.kernel.org/doc/Documentation/networking/switchdev.txt
> > > > > > > > > > > > `_
> > > > > > > > > > >
> > > > > > > > > > > +- Memory usage of representors is huge when number of representor
> > > > > > > > > > > +grows,
> > > > > > > > > > > + because PMD always allocate mbuf for each descriptor of Rx queue.
> > > > > > > > > > > + Polling the large number of ports brings more CPU load, cache
> > > > > > > > > > > +miss and
> > > > > > > > > > > + latency. Shared Rx queue can be used to share Rx queue between
> > > > > > > > > > > +PF and
> > > > > > > > > > > + representors in same switch domain.
> > > > > > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > > > > > > + is present in Rx offloading capability of device info. Setting
> > > > > > > > > > > +the
> > > > > > > > > > > + offloading flag in device Rx mode or Rx queue configuration to
> > > > > > > > > > > +enable
> > > > > > > > > > > + shared Rx queue. Polling any member port of shared Rx queue can
> > > > > > > > > > > +return
> > > > > > > > > > > + packets of all ports in group, port ID is saved in ``mbuf.port``.
> > > > > > > > > > > +
> > > > > > > > > > > Basic SR-IOV
> > > > > > > > > > > ------------
> > > > > > > > > > >
> > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > @@ -127,6 +127,7 @@ static const struct {
> > > > > > > > > > > RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
> > > > > > > > > > > RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > > > > > > > > RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
> > > > > > > > > > > + RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > > > > > > > > };
> > > > > > > > > > >
> > > > > > > > > > > #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > > > > > > uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > > > > > > > > > > uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > > > > > > > > > > uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > > > > > > > > > */
> > > > > > > > > > > + uint32_t shared_group; /**< Shared port group index in
> > > > > > > > > > > + switch domain. */
> > > > > > > > > > > /**
> > > > > > > > > > > * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > > > > > > > > > > * Only offloads set on rx_queue_offload_capa or
> > > > > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
> > > > > > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM 0x00040000
> > > > > > > > > > > #define DEV_RX_OFFLOAD_RSS_HASH 0x00080000
> > > > > > > > > > > #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > > > > > > > > > +/**
> > > > > > > > > > > + * Rx queue is shared among ports in same switch domain to save
> > > > > > > > > > > +memory,
> > > > > > > > > > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > > > > > > > > > + * Real source port number saved in mbuf->port field.
> > > > > > > > > > > + */
> > > > > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ 0x00200000
> > > > > > > > > > >
> > > > > > > > > > > #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > > > > DEV_RX_OFFLOAD_UDP_CKSUM | \
> > > > > > > > > > > --
> > > > > > > > > > > 2.25.1
> > > > > > > > > > >
> > > > > >
> > > > >
> > >
> > >
>
>
> On Tue, Sep 28, 2021 at 6:55 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> >
> > On Tue, 2021-09-28 at 18:28 +0530, Jerin Jacob wrote:
> > > On Tue, Sep 28, 2021 at 5:07 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > >
> > > > On Tue, 2021-09-28 at 15:05 +0530, Jerin Jacob wrote:
> > > > > On Sun, Sep 26, 2021 at 11:06 AM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > >
> > > > > > On Wed, 2021-08-11 at 13:04 +0100, Ferruh Yigit wrote:
> > > > > > > On 8/11/2021 9:28 AM, Xueming(Steven) Li wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > Sent: Wednesday, August 11, 2021 4:03 PM
> > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon
> <thomas@monjalon.net>;
> > > > > > > > > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > > > > > > > >
> > > > > > > > > On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > > > > > >
> > > > > > > > > > Hi,
> > > > > > > > > >
> > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > > > Sent: Monday, August 9, 2021 9:51 PM
> > > > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > > > > > > > > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > > > > > > > > > >
> > > > > > > > > > > On Mon, Aug 9, 2021 at 5:18 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > In current DPDK framework, each RX queue is pre-loaded with mbufs
> > > > > > > > > > > > for incoming packets. When number of representors scale out in a
> > > > > > > > > > > > switch domain, the memory consumption became significant. Most
> > > > > > > > > > > > important, polling all ports leads to high cache miss, high
> > > > > > > > > > > > latency and low throughput.
> > > > > > > > > > > >
> > > > > > > > > > > > This patch introduces shared RX queue. Ports with same
> > > > > > > > > > > > configuration in a switch domain could share RX queue set by specifying sharing group.
> > > > > > > > > > > > Polling any queue using same shared RX queue receives packets from
> > > > > > > > > > > > all member ports. Source port is identified by mbuf->port.
> > > > > > > > > > > >
> > > > > > > > > > > > Port queue number in a shared group should be identical. Queue
> > > > > > > > > > > > index is
> > > > > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > > > > >
> > > > > > > > > > > > Share RX queue is supposed to be polled on same thread.
> > > > > > > > > > > >
> > > > > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > > > > >
> > > > > > > > > > > Is this offload specific to the representor? If so can this name be changed specifically to representor?
> > > > > > > > > >
> > > > > > > > > > Yes, PF and representor in switch domain could take advantage.
> > > > > > > > > >
> > > > > > > > > > > If it is for a generic case, how the flow ordering will be maintained?
> > > > > > > > > >
> > > > > > > > > > Not quite sure that I understood your question. The control path of is
> > > > > > > > > > almost same as before, PF and representor port still needed, rte flows not impacted.
> > > > > > > > > > Queues still needed for each member port, descriptors(mbuf) will be
> > > > > > > > > > supplied from shared Rx queue in my PMD implementation.
> > > > > > > > >
> > > > > > > > > My question was if create a generic RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple ethdev receive queues land into
> the same
> > > > > > > > > receive queue, In that case, how the flow order is maintained for respective receive queues.
> > > > > > > >
> > > > > > > > I guess the question is testpmd forward stream? The forwarding logic has to be changed slightly in case of shared rxq.
> > > > > > > > basically for each packet in rx_burst result, lookup source stream according to mbuf->port, forwarding to target fs.
> > > > > > > > Packets from same source port could be grouped as a small burst to process, this will accelerates the performance if traffic
> come from
> > > > > > > > limited ports. I'll introduce some common api to do shard rxq forwarding, call it with packets handling callback, so it suites for
> > > > > > > > all forwarding engine. Will sent patches soon.
> > > > > > > >
> > > > > > >
> > > > > > > All ports will put the packets in to the same queue (share queue), right? Does
> > > > > > > this means only single core will poll only, what will happen if there are
> > > > > > > multiple cores polling, won't it cause problem?
> > > > > > >
> > > > > > > And if this requires specific changes in the application, I am not sure about
> > > > > > > the solution, can't this work in a transparent way to the application?
> > > > > >
> > > > > > Discussed with Jerin, new API introduced in v3 2/8 that aggregate ports
> > > > > > in same group into one new port. Users could schedule polling on the
> > > > > > aggregated port instead of all member ports.
> > > > >
> > > > > The v3 still has testpmd changes in fastpath. Right? IMO, For this
> > > > > feature, we should not change fastpath of testpmd
> > > > > application. Instead, testpmd can use aggregated ports probably as
> > > > > separate fwd_engine to show how to use this feature.
> > > >
> > > > Good point to discuss :) There are two strategies to polling a shared
> > > > Rxq:
> > > > 1. polling each member port
> > > > All forwarding engines can be reused to work as before.
> > > > My testpmd patches are efforts towards this direction.
> > > > Does your PMD support this?
> > >
> > > Not unfortunately. More than that, every application needs to change
> > > to support this model.
> >
> > Both strategies need user application to resolve port ID from mbuf and
> > process accordingly.
> > This one doesn't demand aggregated port, no polling schedule change.
>
> I was thinking, mbuf will be updated from driver/aggregator port as when it
> comes to application.
>
> >
> > >
> > > > 2. polling aggregated port
> > > > Besides forwarding engine, need more work to to demo it.
> > > > This is an optional API, not supported by my PMD yet.
> > >
> > > We are thinking of implementing this PMD when it comes to it, ie.
> > > without application change in fastpath
> > > logic.
> >
> > Fastpath have to resolve port ID anyway and forwarding according to
> > logic. Forwarding engine need to adapt to support shard Rxq.
> > Fortunately, in testpmd, this can be done with an abstract API.
> >
> > Let's defer part 2 until some PMD really support it and tested, how do
> > you think?
>
> We are not planning to use this feature so either way it is OK to me.
> I leave to ethdev maintainers decide between 1 vs 2.
>
> I do have a strong opinion not changing the testpmd basic forward engines
> for this feature.I would like to keep it simple as fastpath optimized and would
> like to add a separate Forwarding engine as means to verify this feature.
+1 to that.
I don't think it a 'common' feature.
So separate FWD mode seems like a best choice to me.
>
>
>
> >
> > >
> > > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > Overall, is this for optimizing memory for the port represontors? If so can't we
> > > > > > > have a port representor specific solution, reducing scope can reduce the
> > > > > > > complexity it brings?
> > > > > > >
> > > > > > > > > If this offload is only useful for representor case, Can we make this offload specific to representor the case by changing its
> name and
> > > > > > > > > scope.
> > > > > > > >
> > > > > > > > It works for both PF and representors in same switch domain, for application like OVS, few changes to apply.
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > > > > > > > > > > ---
> > > > > > > > > > > > doc/guides/nics/features.rst | 11 +++++++++++
> > > > > > > > > > > > doc/guides/nics/features/default.ini | 1 +
> > > > > > > > > > > > doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
> > > > > > > > > > > > lib/ethdev/rte_ethdev.c | 1 +
> > > > > > > > > > > > lib/ethdev/rte_ethdev.h | 7 +++++++
> > > > > > > > > > > > 5 files changed, 30 insertions(+)
> > > > > > > > > > > >
> > > > > > > > > > > > diff --git a/doc/guides/nics/features.rst
> > > > > > > > > > > > b/doc/guides/nics/features.rst index a96e12d155..2e2a9b1554 100644
> > > > > > > > > > > > --- a/doc/guides/nics/features.rst
> > > > > > > > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > > > > > > > @@ -624,6 +624,17 @@ Supports inner packet L4 checksum.
> > > > > > > > > > > > ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > > > > > > > +
> > > > > > > > > > > > +Shared Rx queue
> > > > > > > > > > > > +---------------
> > > > > > > > > > > > +
> > > > > > > > > > > > +Supports shared Rx queue for ports in same switch domain.
> > > > > > > > > > > > +
> > > > > > > > > > > > +* **[uses] rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > > > > > > > > > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > > > > > > > > > +
> > > > > > > > > > > > +
> > > > > > > > > > > > .. _nic_features_packet_type_parsing:
> > > > > > > > > > > >
> > > > > > > > > > > > Packet type parsing
> > > > > > > > > > > > diff --git a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > > > > > > > --- a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > +++ b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> > > > > > > > > > > > Queue start/stop =
> > > > > > > > > > > > Runtime Rx queue setup =
> > > > > > > > > > > > Runtime Tx queue setup =
> > > > > > > > > > > > +Shared Rx queue =
> > > > > > > > > > > > Burst mode info =
> > > > > > > > > > > > Power mgmt address monitor =
> > > > > > > > > > > > MTU update =
> > > > > > > > > > > > diff --git a/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > > > > > b/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > > > > > > > --- a/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > > > > > +++ b/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > > > > > @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
> > > > > > > > > > > > .. [1] `Ethernet switch device driver model (switchdev)
> > > > > > > > > > > >
> > > > > > > > > > > > <https://www.kernel.org/doc/Documentation/networking/switchdev.txt
> > > > > > > > > > > > > `_
> > > > > > > > > > > >
> > > > > > > > > > > > +- Memory usage of representors is huge when number of representor
> > > > > > > > > > > > +grows,
> > > > > > > > > > > > + because PMD always allocate mbuf for each descriptor of Rx queue.
> > > > > > > > > > > > + Polling the large number of ports brings more CPU load, cache
> > > > > > > > > > > > +miss and
> > > > > > > > > > > > + latency. Shared Rx queue can be used to share Rx queue between
> > > > > > > > > > > > +PF and
> > > > > > > > > > > > + representors in same switch domain.
> > > > > > > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > > > > > > > + is present in Rx offloading capability of device info. Setting
> > > > > > > > > > > > +the
> > > > > > > > > > > > + offloading flag in device Rx mode or Rx queue configuration to
> > > > > > > > > > > > +enable
> > > > > > > > > > > > + shared Rx queue. Polling any member port of shared Rx queue can
> > > > > > > > > > > > +return
> > > > > > > > > > > > + packets of all ports in group, port ID is saved in ``mbuf.port``.
> > > > > > > > > > > > +
> > > > > > > > > > > > Basic SR-IOV
> > > > > > > > > > > > ------------
> > > > > > > > > > > >
> > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > @@ -127,6 +127,7 @@ static const struct {
> > > > > > > > > > > > RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
> > > > > > > > > > > > RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
> > > > > > > > > > > > + RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > > > > > > > > > };
> > > > > > > > > > > >
> > > > > > > > > > > > #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > > > > > > > uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > > > > > > > > > > > uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > > > > > > > > > > > uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > > > > > > > > > > */
> > > > > > > > > > > > + uint32_t shared_group; /**< Shared port group index in
> > > > > > > > > > > > + switch domain. */
> > > > > > > > > > > > /**
> > > > > > > > > > > > * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > > > > > > > > > > > * Only offloads set on rx_queue_offload_capa or
> > > > > > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
> > > > > > > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM 0x00040000
> > > > > > > > > > > > #define DEV_RX_OFFLOAD_RSS_HASH 0x00080000
> > > > > > > > > > > > #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > > > > > > > > > > +/**
> > > > > > > > > > > > + * Rx queue is shared among ports in same switch domain to save
> > > > > > > > > > > > +memory,
> > > > > > > > > > > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > > > > > > > > > > + * Real source port number saved in mbuf->port field.
> > > > > > > > > > > > + */
> > > > > > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ 0x00200000
> > > > > > > > > > > >
> > > > > > > > > > > > #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > > > > > DEV_RX_OFFLOAD_UDP_CKSUM | \
> > > > > > > > > > > > --
> > > > > > > > > > > > 2.25.1
> > > > > > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > > >
> >
On Tue, 2021-09-28 at 13:59 +0000, Ananyev, Konstantin wrote:
> >
> > On Tue, Sep 28, 2021 at 6:55 PM Xueming(Steven) Li
> > <xuemingl@nvidia.com> wrote:
> > >
> > > On Tue, 2021-09-28 at 18:28 +0530, Jerin Jacob wrote:
> > > > On Tue, Sep 28, 2021 at 5:07 PM Xueming(Steven) Li
> > > > <xuemingl@nvidia.com> wrote:
> > > > >
> > > > > On Tue, 2021-09-28 at 15:05 +0530, Jerin Jacob wrote:
> > > > > > On Sun, Sep 26, 2021 at 11:06 AM Xueming(Steven) Li
> > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > >
> > > > > > > On Wed, 2021-08-11 at 13:04 +0100, Ferruh Yigit wrote:
> > > > > > > > On 8/11/2021 9:28 AM, Xueming(Steven) Li wrote:
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > -----Original Message-----
> > > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > > Sent: Wednesday, August 11, 2021 4:03 PM
> > > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > > > > <ferruh.yigit@intel.com>; NBU-Contact-Thomas
> > > > > > > > > > Monjalon
> > <thomas@monjalon.net>;
> > > > > > > > > > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev:
> > > > > > > > > > introduce shared Rx queue
> > > > > > > > > >
> > > > > > > > > > On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li
> > > > > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > Hi,
> > > > > > > > > > >
> > > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > > > > Sent: Monday, August 9, 2021 9:51 PM
> > > > > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > > > > > > <ferruh.yigit@intel.com>;
> > > > > > > > > > > > NBU-Contact-Thomas Monjalon
> > > > > > > > > > > > <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev:
> > > > > > > > > > > > introduce shared Rx queue
> > > > > > > > > > > >
> > > > > > > > > > > > On Mon, Aug 9, 2021 at 5:18 PM Xueming Li
> > > > > > > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > In current DPDK framework, each RX queue is
> > > > > > > > > > > > > pre-loaded with mbufs
> > > > > > > > > > > > > for incoming packets. When number of
> > > > > > > > > > > > > representors scale out in a
> > > > > > > > > > > > > switch domain, the memory consumption became
> > > > > > > > > > > > > significant. Most
> > > > > > > > > > > > > important, polling all ports leads to high
> > > > > > > > > > > > > cache miss, high
> > > > > > > > > > > > > latency and low throughput.
> > > > > > > > > > > > >
> > > > > > > > > > > > > This patch introduces shared RX queue. Ports
> > > > > > > > > > > > > with same
> > > > > > > > > > > > > configuration in a switch domain could share
> > > > > > > > > > > > > RX queue set by specifying sharing group.
> > > > > > > > > > > > > Polling any queue using same shared RX queue
> > > > > > > > > > > > > receives packets from
> > > > > > > > > > > > > all member ports. Source port is identified
> > > > > > > > > > > > > by mbuf->port.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Port queue number in a shared group should be
> > > > > > > > > > > > > identical. Queue
> > > > > > > > > > > > > index is
> > > > > > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Share RX queue is supposed to be polled on
> > > > > > > > > > > > > same thread.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > > > > > >
> > > > > > > > > > > > Is this offload specific to the representor? If
> > > > > > > > > > > > so can this name be changed specifically to
> > > > > > > > > > > > representor?
> > > > > > > > > > >
> > > > > > > > > > > Yes, PF and representor in switch domain could
> > > > > > > > > > > take advantage.
> > > > > > > > > > >
> > > > > > > > > > > > If it is for a generic case, how the flow
> > > > > > > > > > > > ordering will be maintained?
> > > > > > > > > > >
> > > > > > > > > > > Not quite sure that I understood your question.
> > > > > > > > > > > The control path of is
> > > > > > > > > > > almost same as before, PF and representor port
> > > > > > > > > > > still needed, rte flows not impacted.
> > > > > > > > > > > Queues still needed for each member port,
> > > > > > > > > > > descriptors(mbuf) will be
> > > > > > > > > > > supplied from shared Rx queue in my PMD
> > > > > > > > > > > implementation.
> > > > > > > > > >
> > > > > > > > > > My question was if create a generic
> > > > > > > > > > RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple
> > > > > > > > > > ethdev receive queues land into
> > the same
> > > > > > > > > > receive queue, In that case, how the flow order is
> > > > > > > > > > maintained for respective receive queues.
> > > > > > > > >
> > > > > > > > > I guess the question is testpmd forward stream? The
> > > > > > > > > forwarding logic has to be changed slightly in case
> > > > > > > > > of shared rxq.
> > > > > > > > > basically for each packet in rx_burst result, lookup
> > > > > > > > > source stream according to mbuf->port, forwarding to
> > > > > > > > > target fs.
> > > > > > > > > Packets from same source port could be grouped as a
> > > > > > > > > small burst to process, this will accelerates the
> > > > > > > > > performance if traffic
> > come from
> > > > > > > > > limited ports. I'll introduce some common api to do
> > > > > > > > > shard rxq forwarding, call it with packets handling
> > > > > > > > > callback, so it suites for
> > > > > > > > > all forwarding engine. Will sent patches soon.
> > > > > > > > >
> > > > > > > >
> > > > > > > > All ports will put the packets in to the same queue
> > > > > > > > (share queue), right? Does
> > > > > > > > this means only single core will poll only, what will
> > > > > > > > happen if there are
> > > > > > > > multiple cores polling, won't it cause problem?
> > > > > > > >
> > > > > > > > And if this requires specific changes in the
> > > > > > > > application, I am not sure about
> > > > > > > > the solution, can't this work in a transparent way to
> > > > > > > > the application?
> > > > > > >
> > > > > > > Discussed with Jerin, new API introduced in v3 2/8 that
> > > > > > > aggregate ports
> > > > > > > in same group into one new port. Users could schedule
> > > > > > > polling on the
> > > > > > > aggregated port instead of all member ports.
> > > > > >
> > > > > > The v3 still has testpmd changes in fastpath. Right? IMO,
> > > > > > For this
> > > > > > feature, we should not change fastpath of testpmd
> > > > > > application. Instead, testpmd can use aggregated ports
> > > > > > probably as
> > > > > > separate fwd_engine to show how to use this feature.
> > > > >
> > > > > Good point to discuss :) There are two strategies to polling
> > > > > a shared
> > > > > Rxq:
> > > > > 1. polling each member port
> > > > > All forwarding engines can be reused to work as before.
> > > > > My testpmd patches are efforts towards this direction.
> > > > > Does your PMD support this?
> > > >
> > > > Not unfortunately. More than that, every application needs to
> > > > change
> > > > to support this model.
> > >
> > > Both strategies need user application to resolve port ID from
> > > mbuf and
> > > process accordingly.
> > > This one doesn't demand aggregated port, no polling schedule
> > > change.
> >
> > I was thinking, mbuf will be updated from driver/aggregator port as
> > when it
> > comes to application.
> >
> > >
> > > >
> > > > > 2. polling aggregated port
> > > > > Besides forwarding engine, need more work to to demo it.
> > > > > This is an optional API, not supported by my PMD yet.
> > > >
> > > > We are thinking of implementing this PMD when it comes to it,
> > > > ie.
> > > > without application change in fastpath
> > > > logic.
> > >
> > > Fastpath have to resolve port ID anyway and forwarding according
> > > to
> > > logic. Forwarding engine need to adapt to support shard Rxq.
> > > Fortunately, in testpmd, this can be done with an abstract API.
> > >
> > > Let's defer part 2 until some PMD really support it and tested,
> > > how do
> > > you think?
> >
> > We are not planning to use this feature so either way it is OK to
> > me.
> > I leave to ethdev maintainers decide between 1 vs 2.
> >
> > I do have a strong opinion not changing the testpmd basic forward
> > engines
> > for this feature.I would like to keep it simple as fastpath
> > optimized and would
> > like to add a separate Forwarding engine as means to verify this
> > feature.
>
> +1 to that.
> I don't think it a 'common' feature.
> So separate FWD mode seems like a best choice to me.
-1 :)
There was some internal requirement from test team, they need to verify
all features like packet content, rss, vlan, checksum, rte_flow... to
be working based on shared rx queue. Based on the patch, I believe the
impact has been minimized.
>
> >
> >
> >
> > >
> > > >
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > Overall, is this for optimizing memory for the port
> > > > > > > > represontors? If so can't we
> > > > > > > > have a port representor specific solution, reducing
> > > > > > > > scope can reduce the
> > > > > > > > complexity it brings?
> > > > > > > >
> > > > > > > > > > If this offload is only useful for representor
> > > > > > > > > > case, Can we make this offload specific to
> > > > > > > > > > representor the case by changing its
> > name and
> > > > > > > > > > scope.
> > > > > > > > >
> > > > > > > > > It works for both PF and representors in same switch
> > > > > > > > > domain, for application like OVS, few changes to
> > > > > > > > > apply.
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Signed-off-by: Xueming Li
> > > > > > > > > > > > > <xuemingl@nvidia.com>
> > > > > > > > > > > > > ---
> > > > > > > > > > > > > doc/guides/nics/features.rst
> > > > > > > > > > > > > | 11 +++++++++++
> > > > > > > > > > > > > doc/guides/nics/features/default.ini
> > > > > > > > > > > > > | 1 +
> > > > > > > > > > > > > doc/guides/prog_guide/switch_representation.
> > > > > > > > > > > > > rst | 10 ++++++++++
> > > > > > > > > > > > > lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > | 1 +
> > > > > > > > > > > > > lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > | 7 +++++++
> > > > > > > > > > > > > 5 files changed, 30 insertions(+)
> > > > > > > > > > > > >
> > > > > > > > > > > > > diff --git a/doc/guides/nics/features.rst
> > > > > > > > > > > > > b/doc/guides/nics/features.rst index
> > > > > > > > > > > > > a96e12d155..2e2a9b1554 100644
> > > > > > > > > > > > > --- a/doc/guides/nics/features.rst
> > > > > > > > > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > > > > > > > > @@ -624,6 +624,17 @@ Supports inner packet L4
> > > > > > > > > > > > > checksum.
> > > > > > > > > > > > > ``tx_offload_capa,tx_queue_offload_capa:DE
> > > > > > > > > > > > > V_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > > > > > > > > +
> > > > > > > > > > > > > +Shared Rx queue
> > > > > > > > > > > > > +---------------
> > > > > > > > > > > > > +
> > > > > > > > > > > > > +Supports shared Rx queue for ports in same
> > > > > > > > > > > > > switch domain.
> > > > > > > > > > > > > +
> > > > > > > > > > > > > +* **[uses]
> > > > > > > > > > > > > rte_eth_rxconf,rte_eth_rxmode**:
> > > > > > > > > > > > > ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > > > > > > > > > > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > > > > > > > > > > +
> > > > > > > > > > > > > +
> > > > > > > > > > > > > .. _nic_features_packet_type_parsing:
> > > > > > > > > > > > >
> > > > > > > > > > > > > Packet type parsing
> > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > > > > > > > > --- a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > +++ b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> > > > > > > > > > > > > Queue start/stop =
> > > > > > > > > > > > > Runtime Rx queue setup =
> > > > > > > > > > > > > Runtime Tx queue setup =
> > > > > > > > > > > > > +Shared Rx queue =
> > > > > > > > > > > > > Burst mode info =
> > > > > > > > > > > > > Power mgmt address monitor =
> > > > > > > > > > > > > MTU update =
> > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > .rst
> > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > .rst
> > > > > > > > > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > > > > > > > > ---
> > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > .rst
> > > > > > > > > > > > > +++
> > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > .rst
> > > > > > > > > > > > > @@ -123,6 +123,16 @@ thought as a software
> > > > > > > > > > > > > "patch panel" front-end for applications.
> > > > > > > > > > > > > .. [1] `Ethernet switch device driver model
> > > > > > > > > > > > > (switchdev)
> > > > > > > > > > > > >
> > > > > > > > > > > > > <https://www.kernel.org/doc/Documentation/net
> > > > > > > > > > > > > working/switchdev.txt
> > > > > > > > > > > > > > `_
> > > > > > > > > > > > >
> > > > > > > > > > > > > +- Memory usage of representors is huge when
> > > > > > > > > > > > > number of representor
> > > > > > > > > > > > > +grows,
> > > > > > > > > > > > > + because PMD always allocate mbuf for each
> > > > > > > > > > > > > descriptor of Rx queue.
> > > > > > > > > > > > > + Polling the large number of ports brings
> > > > > > > > > > > > > more CPU load, cache
> > > > > > > > > > > > > +miss and
> > > > > > > > > > > > > + latency. Shared Rx queue can be used to
> > > > > > > > > > > > > share Rx queue between
> > > > > > > > > > > > > +PF and
> > > > > > > > > > > > > + representors in same switch domain.
> > > > > > > > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > > > > > > > > + is present in Rx offloading capability of
> > > > > > > > > > > > > device info. Setting
> > > > > > > > > > > > > +the
> > > > > > > > > > > > > + offloading flag in device Rx mode or Rx
> > > > > > > > > > > > > queue configuration to
> > > > > > > > > > > > > +enable
> > > > > > > > > > > > > + shared Rx queue. Polling any member port
> > > > > > > > > > > > > of shared Rx queue can
> > > > > > > > > > > > > +return
> > > > > > > > > > > > > + packets of all ports in group, port ID is
> > > > > > > > > > > > > saved in ``mbuf.port``.
> > > > > > > > > > > > > +
> > > > > > > > > > > > > Basic SR-IOV
> > > > > > > > > > > > > ------------
> > > > > > > > > > > > >
> > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > @@ -127,6 +127,7 @@ static const struct {
> > > > > > > > > > > > > RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSU
> > > > > > > > > > > > > M),
> > > > > > > > > > > > > RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPL
> > > > > > > > > > > > > IT),
> > > > > > > > > > > > > +
> > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > > > > > > > > > > };
> > > > > > > > > > > > >
> > > > > > > > > > > > > #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > > > > > > > > uint8_t rx_drop_en; /**< Drop packets
> > > > > > > > > > > > > if no descriptors are available. */
> > > > > > > > > > > > > uint8_t rx_deferred_start; /**< Do
> > > > > > > > > > > > > not start queue with rte_eth_dev_start(). */
> > > > > > > > > > > > > uint16_t rx_nseg; /**< Number of
> > > > > > > > > > > > > descriptions in rx_seg array.
> > > > > > > > > > > > > */
> > > > > > > > > > > > > + uint32_t shared_group; /**< Shared
> > > > > > > > > > > > > port group index in
> > > > > > > > > > > > > + switch domain. */
> > > > > > > > > > > > > /**
> > > > > > > > > > > > > * Per-queue Rx offloads to be set
> > > > > > > > > > > > > using DEV_RX_OFFLOAD_* flags.
> > > > > > > > > > > > > * Only offloads set on
> > > > > > > > > > > > > rx_queue_offload_capa or
> > > > > > > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct
> > > > > > > > > > > > > rte_eth_conf {
> > > > > > > > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM
> > > > > > > > > > > > > 0x00040000
> > > > > > > > > > > > > #define DEV_RX_OFFLOAD_RSS_HASH
> > > > > > > > > > > > > 0x00080000
> > > > > > > > > > > > > #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT
> > > > > > > > > > > > > 0x00100000
> > > > > > > > > > > > > +/**
> > > > > > > > > > > > > + * Rx queue is shared among ports in same
> > > > > > > > > > > > > switch domain to save
> > > > > > > > > > > > > +memory,
> > > > > > > > > > > > > + * avoid polling each port. Any port in
> > > > > > > > > > > > > group can be used to receive packets.
> > > > > > > > > > > > > + * Real source port number saved in mbuf-
> > > > > > > > > > > > > >port field.
> > > > > > > > > > > > > + */
> > > > > > > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ
> > > > > > > > > > > > > 0x00200000
> > > > > > > > > > > > >
> > > > > > > > > > > > > #define DEV_RX_OFFLOAD_CHECKSUM
> > > > > > > > > > > > > (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > > > > > > DEV_RX_OFFLO
> > > > > > > > > > > > > AD_UDP_CKSUM | \
> > > > > > > > > > > > > --
> > > > > > > > > > > > > 2.25.1
> > > > > > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > > > >
> > >
On Tue, 2021-09-28 at 19:08 +0530, Jerin Jacob wrote:
> On Tue, Sep 28, 2021 at 6:55 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> >
> > On Tue, 2021-09-28 at 18:28 +0530, Jerin Jacob wrote:
> > > On Tue, Sep 28, 2021 at 5:07 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > >
> > > > On Tue, 2021-09-28 at 15:05 +0530, Jerin Jacob wrote:
> > > > > On Sun, Sep 26, 2021 at 11:06 AM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > >
> > > > > > On Wed, 2021-08-11 at 13:04 +0100, Ferruh Yigit wrote:
> > > > > > > On 8/11/2021 9:28 AM, Xueming(Steven) Li wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > Sent: Wednesday, August 11, 2021 4:03 PM
> > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> > > > > > > > > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > > > > > > > >
> > > > > > > > > On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > > > > > >
> > > > > > > > > > Hi,
> > > > > > > > > >
> > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > > > Sent: Monday, August 9, 2021 9:51 PM
> > > > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > > > > > > > > > > NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > > > > > > > > > >
> > > > > > > > > > > On Mon, Aug 9, 2021 at 5:18 PM Xueming Li <xuemingl@nvidia.com> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > In current DPDK framework, each RX queue is pre-loaded with mbufs
> > > > > > > > > > > > for incoming packets. When number of representors scale out in a
> > > > > > > > > > > > switch domain, the memory consumption became significant. Most
> > > > > > > > > > > > important, polling all ports leads to high cache miss, high
> > > > > > > > > > > > latency and low throughput.
> > > > > > > > > > > >
> > > > > > > > > > > > This patch introduces shared RX queue. Ports with same
> > > > > > > > > > > > configuration in a switch domain could share RX queue set by specifying sharing group.
> > > > > > > > > > > > Polling any queue using same shared RX queue receives packets from
> > > > > > > > > > > > all member ports. Source port is identified by mbuf->port.
> > > > > > > > > > > >
> > > > > > > > > > > > Port queue number in a shared group should be identical. Queue
> > > > > > > > > > > > index is
> > > > > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > > > > >
> > > > > > > > > > > > Share RX queue is supposed to be polled on same thread.
> > > > > > > > > > > >
> > > > > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > > > > >
> > > > > > > > > > > Is this offload specific to the representor? If so can this name be changed specifically to representor?
> > > > > > > > > >
> > > > > > > > > > Yes, PF and representor in switch domain could take advantage.
> > > > > > > > > >
> > > > > > > > > > > If it is for a generic case, how the flow ordering will be maintained?
> > > > > > > > > >
> > > > > > > > > > Not quite sure that I understood your question. The control path of is
> > > > > > > > > > almost same as before, PF and representor port still needed, rte flows not impacted.
> > > > > > > > > > Queues still needed for each member port, descriptors(mbuf) will be
> > > > > > > > > > supplied from shared Rx queue in my PMD implementation.
> > > > > > > > >
> > > > > > > > > My question was if create a generic RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple ethdev receive queues land into the same
> > > > > > > > > receive queue, In that case, how the flow order is maintained for respective receive queues.
> > > > > > > >
> > > > > > > > I guess the question is testpmd forward stream? The forwarding logic has to be changed slightly in case of shared rxq.
> > > > > > > > basically for each packet in rx_burst result, lookup source stream according to mbuf->port, forwarding to target fs.
> > > > > > > > Packets from same source port could be grouped as a small burst to process, this will accelerates the performance if traffic come from
> > > > > > > > limited ports. I'll introduce some common api to do shard rxq forwarding, call it with packets handling callback, so it suites for
> > > > > > > > all forwarding engine. Will sent patches soon.
> > > > > > > >
> > > > > > >
> > > > > > > All ports will put the packets in to the same queue (share queue), right? Does
> > > > > > > this means only single core will poll only, what will happen if there are
> > > > > > > multiple cores polling, won't it cause problem?
> > > > > > >
> > > > > > > And if this requires specific changes in the application, I am not sure about
> > > > > > > the solution, can't this work in a transparent way to the application?
> > > > > >
> > > > > > Discussed with Jerin, new API introduced in v3 2/8 that aggregate ports
> > > > > > in same group into one new port. Users could schedule polling on the
> > > > > > aggregated port instead of all member ports.
> > > > >
> > > > > The v3 still has testpmd changes in fastpath. Right? IMO, For this
> > > > > feature, we should not change fastpath of testpmd
> > > > > application. Instead, testpmd can use aggregated ports probably as
> > > > > separate fwd_engine to show how to use this feature.
> > > >
> > > > Good point to discuss :) There are two strategies to polling a shared
> > > > Rxq:
> > > > 1. polling each member port
> > > > All forwarding engines can be reused to work as before.
> > > > My testpmd patches are efforts towards this direction.
> > > > Does your PMD support this?
> > >
> > > Not unfortunately. More than that, every application needs to change
> > > to support this model.
> >
> > Both strategies need user application to resolve port ID from mbuf and
> > process accordingly.
> > This one doesn't demand aggregated port, no polling schedule change.
>
> I was thinking, mbuf will be updated from driver/aggregator port as when it
> comes to application.
>
> >
> > >
> > > > 2. polling aggregated port
> > > > Besides forwarding engine, need more work to to demo it.
> > > > This is an optional API, not supported by my PMD yet.
> > >
> > > We are thinking of implementing this PMD when it comes to it, ie.
> > > without application change in fastpath
> > > logic.
> >
> > Fastpath have to resolve port ID anyway and forwarding according to
> > logic. Forwarding engine need to adapt to support shard Rxq.
> > Fortunately, in testpmd, this can be done with an abstract API.
> >
> > Let's defer part 2 until some PMD really support it and tested, how do
> > you think?
>
> We are not planning to use this feature so either way it is OK to me.
> I leave to ethdev maintainers decide between 1 vs 2.
A better driver should support both, but specific driver could select
either one. 1 brings less changes to application, 2 brings better
performance with additional steps.
>
> I do have a strong opinion not changing the testpmd basic forward engines
> for this feature.I would like to keep it simple as fastpath optimized and would
> like to add a separate Forwarding engine as means to verify this feature.
>
>
>
> >
> > >
> > > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > Overall, is this for optimizing memory for the port represontors? If so can't we
> > > > > > > have a port representor specific solution, reducing scope can reduce the
> > > > > > > complexity it brings?
> > > > > > >
> > > > > > > > > If this offload is only useful for representor case, Can we make this offload specific to representor the case by changing its name and
> > > > > > > > > scope.
> > > > > > > >
> > > > > > > > It works for both PF and representors in same switch domain, for application like OVS, few changes to apply.
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Signed-off-by: Xueming Li <xuemingl@nvidia.com>
> > > > > > > > > > > > ---
> > > > > > > > > > > > doc/guides/nics/features.rst | 11 +++++++++++
> > > > > > > > > > > > doc/guides/nics/features/default.ini | 1 +
> > > > > > > > > > > > doc/guides/prog_guide/switch_representation.rst | 10 ++++++++++
> > > > > > > > > > > > lib/ethdev/rte_ethdev.c | 1 +
> > > > > > > > > > > > lib/ethdev/rte_ethdev.h | 7 +++++++
> > > > > > > > > > > > 5 files changed, 30 insertions(+)
> > > > > > > > > > > >
> > > > > > > > > > > > diff --git a/doc/guides/nics/features.rst
> > > > > > > > > > > > b/doc/guides/nics/features.rst index a96e12d155..2e2a9b1554 100644
> > > > > > > > > > > > --- a/doc/guides/nics/features.rst
> > > > > > > > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > > > > > > > @@ -624,6 +624,17 @@ Supports inner packet L4 checksum.
> > > > > > > > > > > > ``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > > > > > > > +
> > > > > > > > > > > > +Shared Rx queue
> > > > > > > > > > > > +---------------
> > > > > > > > > > > > +
> > > > > > > > > > > > +Supports shared Rx queue for ports in same switch domain.
> > > > > > > > > > > > +
> > > > > > > > > > > > +* **[uses] rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > > > > > > > > > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > > > > > > > > > +
> > > > > > > > > > > > +
> > > > > > > > > > > > .. _nic_features_packet_type_parsing:
> > > > > > > > > > > >
> > > > > > > > > > > > Packet type parsing
> > > > > > > > > > > > diff --git a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > > > > > > > --- a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > +++ b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> > > > > > > > > > > > Queue start/stop =
> > > > > > > > > > > > Runtime Rx queue setup =
> > > > > > > > > > > > Runtime Tx queue setup =
> > > > > > > > > > > > +Shared Rx queue =
> > > > > > > > > > > > Burst mode info =
> > > > > > > > > > > > Power mgmt address monitor =
> > > > > > > > > > > > MTU update =
> > > > > > > > > > > > diff --git a/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > > > > > b/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > > > > > > > --- a/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > > > > > +++ b/doc/guides/prog_guide/switch_representation.rst
> > > > > > > > > > > > @@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
> > > > > > > > > > > > .. [1] `Ethernet switch device driver model (switchdev)
> > > > > > > > > > > >
> > > > > > > > > > > > <https://www.kernel.org/doc/Documentation/networking/switchdev.txt
> > > > > > > > > > > > > `_
> > > > > > > > > > > >
> > > > > > > > > > > > +- Memory usage of representors is huge when number of representor
> > > > > > > > > > > > +grows,
> > > > > > > > > > > > + because PMD always allocate mbuf for each descriptor of Rx queue.
> > > > > > > > > > > > + Polling the large number of ports brings more CPU load, cache
> > > > > > > > > > > > +miss and
> > > > > > > > > > > > + latency. Shared Rx queue can be used to share Rx queue between
> > > > > > > > > > > > +PF and
> > > > > > > > > > > > + representors in same switch domain.
> > > > > > > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > > > > > > > + is present in Rx offloading capability of device info. Setting
> > > > > > > > > > > > +the
> > > > > > > > > > > > + offloading flag in device Rx mode or Rx queue configuration to
> > > > > > > > > > > > +enable
> > > > > > > > > > > > + shared Rx queue. Polling any member port of shared Rx queue can
> > > > > > > > > > > > +return
> > > > > > > > > > > > + packets of all ports in group, port ID is saved in ``mbuf.port``.
> > > > > > > > > > > > +
> > > > > > > > > > > > Basic SR-IOV
> > > > > > > > > > > > ------------
> > > > > > > > > > > >
> > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > @@ -127,6 +127,7 @@ static const struct {
> > > > > > > > > > > > RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
> > > > > > > > > > > > RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
> > > > > > > > > > > > + RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > > > > > > > > > };
> > > > > > > > > > > >
> > > > > > > > > > > > #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > > > > > > > uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
> > > > > > > > > > > > uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
> > > > > > > > > > > > uint16_t rx_nseg; /**< Number of descriptions in rx_seg array.
> > > > > > > > > > > > */
> > > > > > > > > > > > + uint32_t shared_group; /**< Shared port group index in
> > > > > > > > > > > > + switch domain. */
> > > > > > > > > > > > /**
> > > > > > > > > > > > * Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
> > > > > > > > > > > > * Only offloads set on rx_queue_offload_capa or
> > > > > > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct rte_eth_conf {
> > > > > > > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM 0x00040000
> > > > > > > > > > > > #define DEV_RX_OFFLOAD_RSS_HASH 0x00080000
> > > > > > > > > > > > #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
> > > > > > > > > > > > +/**
> > > > > > > > > > > > + * Rx queue is shared among ports in same switch domain to save
> > > > > > > > > > > > +memory,
> > > > > > > > > > > > + * avoid polling each port. Any port in group can be used to receive packets.
> > > > > > > > > > > > + * Real source port number saved in mbuf->port field.
> > > > > > > > > > > > + */
> > > > > > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ 0x00200000
> > > > > > > > > > > >
> > > > > > > > > > > > #define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > > > > > DEV_RX_OFFLOAD_UDP_CKSUM | \
> > > > > > > > > > > > --
> > > > > > > > > > > > 2.25.1
> > > > > > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > > >
> >
On Tue, Sep 28, 2021 at 8:10 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
>
> On Tue, 2021-09-28 at 13:59 +0000, Ananyev, Konstantin wrote:
> > >
> > > On Tue, Sep 28, 2021 at 6:55 PM Xueming(Steven) Li
> > > <xuemingl@nvidia.com> wrote:
> > > >
> > > > On Tue, 2021-09-28 at 18:28 +0530, Jerin Jacob wrote:
> > > > > On Tue, Sep 28, 2021 at 5:07 PM Xueming(Steven) Li
> > > > > <xuemingl@nvidia.com> wrote:
> > > > > >
> > > > > > On Tue, 2021-09-28 at 15:05 +0530, Jerin Jacob wrote:
> > > > > > > On Sun, Sep 26, 2021 at 11:06 AM Xueming(Steven) Li
> > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > >
> > > > > > > > On Wed, 2021-08-11 at 13:04 +0100, Ferruh Yigit wrote:
> > > > > > > > > On 8/11/2021 9:28 AM, Xueming(Steven) Li wrote:
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > > > Sent: Wednesday, August 11, 2021 4:03 PM
> > > > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > > > > > <ferruh.yigit@intel.com>; NBU-Contact-Thomas
> > > > > > > > > > > Monjalon
> > > <thomas@monjalon.net>;
> > > > > > > > > > > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev:
> > > > > > > > > > > introduce shared Rx queue
> > > > > > > > > > >
> > > > > > > > > > > On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li
> > > > > > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > Hi,
> > > > > > > > > > > >
> > > > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > > > > > Sent: Monday, August 9, 2021 9:51 PM
> > > > > > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > > > > > > > <ferruh.yigit@intel.com>;
> > > > > > > > > > > > > NBU-Contact-Thomas Monjalon
> > > > > > > > > > > > > <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev:
> > > > > > > > > > > > > introduce shared Rx queue
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Mon, Aug 9, 2021 at 5:18 PM Xueming Li
> > > > > > > > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > In current DPDK framework, each RX queue is
> > > > > > > > > > > > > > pre-loaded with mbufs
> > > > > > > > > > > > > > for incoming packets. When number of
> > > > > > > > > > > > > > representors scale out in a
> > > > > > > > > > > > > > switch domain, the memory consumption became
> > > > > > > > > > > > > > significant. Most
> > > > > > > > > > > > > > important, polling all ports leads to high
> > > > > > > > > > > > > > cache miss, high
> > > > > > > > > > > > > > latency and low throughput.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > This patch introduces shared RX queue. Ports
> > > > > > > > > > > > > > with same
> > > > > > > > > > > > > > configuration in a switch domain could share
> > > > > > > > > > > > > > RX queue set by specifying sharing group.
> > > > > > > > > > > > > > Polling any queue using same shared RX queue
> > > > > > > > > > > > > > receives packets from
> > > > > > > > > > > > > > all member ports. Source port is identified
> > > > > > > > > > > > > > by mbuf->port.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Port queue number in a shared group should be
> > > > > > > > > > > > > > identical. Queue
> > > > > > > > > > > > > > index is
> > > > > > > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Share RX queue is supposed to be polled on
> > > > > > > > > > > > > > same thread.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Is this offload specific to the representor? If
> > > > > > > > > > > > > so can this name be changed specifically to
> > > > > > > > > > > > > representor?
> > > > > > > > > > > >
> > > > > > > > > > > > Yes, PF and representor in switch domain could
> > > > > > > > > > > > take advantage.
> > > > > > > > > > > >
> > > > > > > > > > > > > If it is for a generic case, how the flow
> > > > > > > > > > > > > ordering will be maintained?
> > > > > > > > > > > >
> > > > > > > > > > > > Not quite sure that I understood your question.
> > > > > > > > > > > > The control path of is
> > > > > > > > > > > > almost same as before, PF and representor port
> > > > > > > > > > > > still needed, rte flows not impacted.
> > > > > > > > > > > > Queues still needed for each member port,
> > > > > > > > > > > > descriptors(mbuf) will be
> > > > > > > > > > > > supplied from shared Rx queue in my PMD
> > > > > > > > > > > > implementation.
> > > > > > > > > > >
> > > > > > > > > > > My question was if create a generic
> > > > > > > > > > > RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple
> > > > > > > > > > > ethdev receive queues land into
> > > the same
> > > > > > > > > > > receive queue, In that case, how the flow order is
> > > > > > > > > > > maintained for respective receive queues.
> > > > > > > > > >
> > > > > > > > > > I guess the question is testpmd forward stream? The
> > > > > > > > > > forwarding logic has to be changed slightly in case
> > > > > > > > > > of shared rxq.
> > > > > > > > > > basically for each packet in rx_burst result, lookup
> > > > > > > > > > source stream according to mbuf->port, forwarding to
> > > > > > > > > > target fs.
> > > > > > > > > > Packets from same source port could be grouped as a
> > > > > > > > > > small burst to process, this will accelerates the
> > > > > > > > > > performance if traffic
> > > come from
> > > > > > > > > > limited ports. I'll introduce some common api to do
> > > > > > > > > > shard rxq forwarding, call it with packets handling
> > > > > > > > > > callback, so it suites for
> > > > > > > > > > all forwarding engine. Will sent patches soon.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > All ports will put the packets in to the same queue
> > > > > > > > > (share queue), right? Does
> > > > > > > > > this means only single core will poll only, what will
> > > > > > > > > happen if there are
> > > > > > > > > multiple cores polling, won't it cause problem?
> > > > > > > > >
> > > > > > > > > And if this requires specific changes in the
> > > > > > > > > application, I am not sure about
> > > > > > > > > the solution, can't this work in a transparent way to
> > > > > > > > > the application?
> > > > > > > >
> > > > > > > > Discussed with Jerin, new API introduced in v3 2/8 that
> > > > > > > > aggregate ports
> > > > > > > > in same group into one new port. Users could schedule
> > > > > > > > polling on the
> > > > > > > > aggregated port instead of all member ports.
> > > > > > >
> > > > > > > The v3 still has testpmd changes in fastpath. Right? IMO,
> > > > > > > For this
> > > > > > > feature, we should not change fastpath of testpmd
> > > > > > > application. Instead, testpmd can use aggregated ports
> > > > > > > probably as
> > > > > > > separate fwd_engine to show how to use this feature.
> > > > > >
> > > > > > Good point to discuss :) There are two strategies to polling
> > > > > > a shared
> > > > > > Rxq:
> > > > > > 1. polling each member port
> > > > > > All forwarding engines can be reused to work as before.
> > > > > > My testpmd patches are efforts towards this direction.
> > > > > > Does your PMD support this?
> > > > >
> > > > > Not unfortunately. More than that, every application needs to
> > > > > change
> > > > > to support this model.
> > > >
> > > > Both strategies need user application to resolve port ID from
> > > > mbuf and
> > > > process accordingly.
> > > > This one doesn't demand aggregated port, no polling schedule
> > > > change.
> > >
> > > I was thinking, mbuf will be updated from driver/aggregator port as
> > > when it
> > > comes to application.
> > >
> > > >
> > > > >
> > > > > > 2. polling aggregated port
> > > > > > Besides forwarding engine, need more work to to demo it.
> > > > > > This is an optional API, not supported by my PMD yet.
> > > > >
> > > > > We are thinking of implementing this PMD when it comes to it,
> > > > > ie.
> > > > > without application change in fastpath
> > > > > logic.
> > > >
> > > > Fastpath have to resolve port ID anyway and forwarding according
> > > > to
> > > > logic. Forwarding engine need to adapt to support shard Rxq.
> > > > Fortunately, in testpmd, this can be done with an abstract API.
> > > >
> > > > Let's defer part 2 until some PMD really support it and tested,
> > > > how do
> > > > you think?
> > >
> > > We are not planning to use this feature so either way it is OK to
> > > me.
> > > I leave to ethdev maintainers decide between 1 vs 2.
> > >
> > > I do have a strong opinion not changing the testpmd basic forward
> > > engines
> > > for this feature.I would like to keep it simple as fastpath
> > > optimized and would
> > > like to add a separate Forwarding engine as means to verify this
> > > feature.
> >
> > +1 to that.
> > I don't think it a 'common' feature.
> > So separate FWD mode seems like a best choice to me.
>
> -1 :)
> There was some internal requirement from test team, they need to verify
Internal QA requirements may not be the driving factor :-)
> all features like packet content, rss, vlan, checksum, rte_flow... to
> be working based on shared rx queue. Based on the patch, I believe the
> impact has been minimized.
>
> >
> > >
> > >
> > >
> > > >
> > > > >
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Overall, is this for optimizing memory for the port
> > > > > > > > > represontors? If so can't we
> > > > > > > > > have a port representor specific solution, reducing
> > > > > > > > > scope can reduce the
> > > > > > > > > complexity it brings?
> > > > > > > > >
> > > > > > > > > > > If this offload is only useful for representor
> > > > > > > > > > > case, Can we make this offload specific to
> > > > > > > > > > > representor the case by changing its
> > > name and
> > > > > > > > > > > scope.
> > > > > > > > > >
> > > > > > > > > > It works for both PF and representors in same switch
> > > > > > > > > > domain, for application like OVS, few changes to
> > > > > > > > > > apply.
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Signed-off-by: Xueming Li
> > > > > > > > > > > > > > <xuemingl@nvidia.com>
> > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > doc/guides/nics/features.rst
> > > > > > > > > > > > > > | 11 +++++++++++
> > > > > > > > > > > > > > doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > | 1 +
> > > > > > > > > > > > > > doc/guides/prog_guide/switch_representation.
> > > > > > > > > > > > > > rst | 10 ++++++++++
> > > > > > > > > > > > > > lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > | 1 +
> > > > > > > > > > > > > > lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > | 7 +++++++
> > > > > > > > > > > > > > 5 files changed, 30 insertions(+)
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > diff --git a/doc/guides/nics/features.rst
> > > > > > > > > > > > > > b/doc/guides/nics/features.rst index
> > > > > > > > > > > > > > a96e12d155..2e2a9b1554 100644
> > > > > > > > > > > > > > --- a/doc/guides/nics/features.rst
> > > > > > > > > > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > > > > > > > > > @@ -624,6 +624,17 @@ Supports inner packet L4
> > > > > > > > > > > > > > checksum.
> > > > > > > > > > > > > > ``tx_offload_capa,tx_queue_offload_capa:DE
> > > > > > > > > > > > > > V_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > > > > > > > > > +
> > > > > > > > > > > > > > +Shared Rx queue
> > > > > > > > > > > > > > +---------------
> > > > > > > > > > > > > > +
> > > > > > > > > > > > > > +Supports shared Rx queue for ports in same
> > > > > > > > > > > > > > switch domain.
> > > > > > > > > > > > > > +
> > > > > > > > > > > > > > +* **[uses]
> > > > > > > > > > > > > > rte_eth_rxconf,rte_eth_rxmode**:
> > > > > > > > > > > > > > ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > > > > > > > > > > > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > > > > > > > > > > > +
> > > > > > > > > > > > > > +
> > > > > > > > > > > > > > .. _nic_features_packet_type_parsing:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Packet type parsing
> > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > > > > > > > > > --- a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > +++ b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> > > > > > > > > > > > > > Queue start/stop =
> > > > > > > > > > > > > > Runtime Rx queue setup =
> > > > > > > > > > > > > > Runtime Tx queue setup =
> > > > > > > > > > > > > > +Shared Rx queue =
> > > > > > > > > > > > > > Burst mode info =
> > > > > > > > > > > > > > Power mgmt address monitor =
> > > > > > > > > > > > > > MTU update =
> > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > +++
> > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > @@ -123,6 +123,16 @@ thought as a software
> > > > > > > > > > > > > > "patch panel" front-end for applications.
> > > > > > > > > > > > > > .. [1] `Ethernet switch device driver model
> > > > > > > > > > > > > > (switchdev)
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > <https://www.kernel.org/doc/Documentation/net
> > > > > > > > > > > > > > working/switchdev.txt
> > > > > > > > > > > > > > > `_
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > +- Memory usage of representors is huge when
> > > > > > > > > > > > > > number of representor
> > > > > > > > > > > > > > +grows,
> > > > > > > > > > > > > > + because PMD always allocate mbuf for each
> > > > > > > > > > > > > > descriptor of Rx queue.
> > > > > > > > > > > > > > + Polling the large number of ports brings
> > > > > > > > > > > > > > more CPU load, cache
> > > > > > > > > > > > > > +miss and
> > > > > > > > > > > > > > + latency. Shared Rx queue can be used to
> > > > > > > > > > > > > > share Rx queue between
> > > > > > > > > > > > > > +PF and
> > > > > > > > > > > > > > + representors in same switch domain.
> > > > > > > > > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > > > > > > > > > + is present in Rx offloading capability of
> > > > > > > > > > > > > > device info. Setting
> > > > > > > > > > > > > > +the
> > > > > > > > > > > > > > + offloading flag in device Rx mode or Rx
> > > > > > > > > > > > > > queue configuration to
> > > > > > > > > > > > > > +enable
> > > > > > > > > > > > > > + shared Rx queue. Polling any member port
> > > > > > > > > > > > > > of shared Rx queue can
> > > > > > > > > > > > > > +return
> > > > > > > > > > > > > > + packets of all ports in group, port ID is
> > > > > > > > > > > > > > saved in ``mbuf.port``.
> > > > > > > > > > > > > > +
> > > > > > > > > > > > > > Basic SR-IOV
> > > > > > > > > > > > > > ------------
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > @@ -127,6 +127,7 @@ static const struct {
> > > > > > > > > > > > > > RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSU
> > > > > > > > > > > > > > M),
> > > > > > > > > > > > > > RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPL
> > > > > > > > > > > > > > IT),
> > > > > > > > > > > > > > +
> > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > > > > > > > > > > > };
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > > > > > > > > > uint8_t rx_drop_en; /**< Drop packets
> > > > > > > > > > > > > > if no descriptors are available. */
> > > > > > > > > > > > > > uint8_t rx_deferred_start; /**< Do
> > > > > > > > > > > > > > not start queue with rte_eth_dev_start(). */
> > > > > > > > > > > > > > uint16_t rx_nseg; /**< Number of
> > > > > > > > > > > > > > descriptions in rx_seg array.
> > > > > > > > > > > > > > */
> > > > > > > > > > > > > > + uint32_t shared_group; /**< Shared
> > > > > > > > > > > > > > port group index in
> > > > > > > > > > > > > > + switch domain. */
> > > > > > > > > > > > > > /**
> > > > > > > > > > > > > > * Per-queue Rx offloads to be set
> > > > > > > > > > > > > > using DEV_RX_OFFLOAD_* flags.
> > > > > > > > > > > > > > * Only offloads set on
> > > > > > > > > > > > > > rx_queue_offload_capa or
> > > > > > > > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct
> > > > > > > > > > > > > > rte_eth_conf {
> > > > > > > > > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM
> > > > > > > > > > > > > > 0x00040000
> > > > > > > > > > > > > > #define DEV_RX_OFFLOAD_RSS_HASH
> > > > > > > > > > > > > > 0x00080000
> > > > > > > > > > > > > > #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT
> > > > > > > > > > > > > > 0x00100000
> > > > > > > > > > > > > > +/**
> > > > > > > > > > > > > > + * Rx queue is shared among ports in same
> > > > > > > > > > > > > > switch domain to save
> > > > > > > > > > > > > > +memory,
> > > > > > > > > > > > > > + * avoid polling each port. Any port in
> > > > > > > > > > > > > > group can be used to receive packets.
> > > > > > > > > > > > > > + * Real source port number saved in mbuf-
> > > > > > > > > > > > > > >port field.
> > > > > > > > > > > > > > + */
> > > > > > > > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ
> > > > > > > > > > > > > > 0x00200000
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > #define DEV_RX_OFFLOAD_CHECKSUM
> > > > > > > > > > > > > > (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > > > > > > > DEV_RX_OFFLO
> > > > > > > > > > > > > > AD_UDP_CKSUM | \
> > > > > > > > > > > > > > --
> > > > > > > > > > > > > > 2.25.1
> > > > > > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > > >
> > > >
>
> > > > > > > > > > > > > > In current DPDK framework, each RX queue is
> > > > > > > > > > > > > > pre-loaded with mbufs
> > > > > > > > > > > > > > for incoming packets. When number of
> > > > > > > > > > > > > > representors scale out in a
> > > > > > > > > > > > > > switch domain, the memory consumption became
> > > > > > > > > > > > > > significant. Most
> > > > > > > > > > > > > > important, polling all ports leads to high
> > > > > > > > > > > > > > cache miss, high
> > > > > > > > > > > > > > latency and low throughput.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > This patch introduces shared RX queue. Ports
> > > > > > > > > > > > > > with same
> > > > > > > > > > > > > > configuration in a switch domain could share
> > > > > > > > > > > > > > RX queue set by specifying sharing group.
> > > > > > > > > > > > > > Polling any queue using same shared RX queue
> > > > > > > > > > > > > > receives packets from
> > > > > > > > > > > > > > all member ports. Source port is identified
> > > > > > > > > > > > > > by mbuf->port.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Port queue number in a shared group should be
> > > > > > > > > > > > > > identical. Queue
> > > > > > > > > > > > > > index is
> > > > > > > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Share RX queue is supposed to be polled on
> > > > > > > > > > > > > > same thread.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Is this offload specific to the representor? If
> > > > > > > > > > > > > so can this name be changed specifically to
> > > > > > > > > > > > > representor?
> > > > > > > > > > > >
> > > > > > > > > > > > Yes, PF and representor in switch domain could
> > > > > > > > > > > > take advantage.
> > > > > > > > > > > >
> > > > > > > > > > > > > If it is for a generic case, how the flow
> > > > > > > > > > > > > ordering will be maintained?
> > > > > > > > > > > >
> > > > > > > > > > > > Not quite sure that I understood your question.
> > > > > > > > > > > > The control path of is
> > > > > > > > > > > > almost same as before, PF and representor port
> > > > > > > > > > > > still needed, rte flows not impacted.
> > > > > > > > > > > > Queues still needed for each member port,
> > > > > > > > > > > > descriptors(mbuf) will be
> > > > > > > > > > > > supplied from shared Rx queue in my PMD
> > > > > > > > > > > > implementation.
> > > > > > > > > > >
> > > > > > > > > > > My question was if create a generic
> > > > > > > > > > > RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple
> > > > > > > > > > > ethdev receive queues land into
> > > the same
> > > > > > > > > > > receive queue, In that case, how the flow order is
> > > > > > > > > > > maintained for respective receive queues.
> > > > > > > > > >
> > > > > > > > > > I guess the question is testpmd forward stream? The
> > > > > > > > > > forwarding logic has to be changed slightly in case
> > > > > > > > > > of shared rxq.
> > > > > > > > > > basically for each packet in rx_burst result, lookup
> > > > > > > > > > source stream according to mbuf->port, forwarding to
> > > > > > > > > > target fs.
> > > > > > > > > > Packets from same source port could be grouped as a
> > > > > > > > > > small burst to process, this will accelerates the
> > > > > > > > > > performance if traffic
> > > come from
> > > > > > > > > > limited ports. I'll introduce some common api to do
> > > > > > > > > > shard rxq forwarding, call it with packets handling
> > > > > > > > > > callback, so it suites for
> > > > > > > > > > all forwarding engine. Will sent patches soon.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > All ports will put the packets in to the same queue
> > > > > > > > > (share queue), right? Does
> > > > > > > > > this means only single core will poll only, what will
> > > > > > > > > happen if there are
> > > > > > > > > multiple cores polling, won't it cause problem?
> > > > > > > > >
> > > > > > > > > And if this requires specific changes in the
> > > > > > > > > application, I am not sure about
> > > > > > > > > the solution, can't this work in a transparent way to
> > > > > > > > > the application?
> > > > > > > >
> > > > > > > > Discussed with Jerin, new API introduced in v3 2/8 that
> > > > > > > > aggregate ports
> > > > > > > > in same group into one new port. Users could schedule
> > > > > > > > polling on the
> > > > > > > > aggregated port instead of all member ports.
> > > > > > >
> > > > > > > The v3 still has testpmd changes in fastpath. Right? IMO,
> > > > > > > For this
> > > > > > > feature, we should not change fastpath of testpmd
> > > > > > > application. Instead, testpmd can use aggregated ports
> > > > > > > probably as
> > > > > > > separate fwd_engine to show how to use this feature.
> > > > > >
> > > > > > Good point to discuss :) There are two strategies to polling
> > > > > > a shared
> > > > > > Rxq:
> > > > > > 1. polling each member port
> > > > > > All forwarding engines can be reused to work as before.
> > > > > > My testpmd patches are efforts towards this direction.
> > > > > > Does your PMD support this?
> > > > >
> > > > > Not unfortunately. More than that, every application needs to
> > > > > change
> > > > > to support this model.
> > > >
> > > > Both strategies need user application to resolve port ID from
> > > > mbuf and
> > > > process accordingly.
> > > > This one doesn't demand aggregated port, no polling schedule
> > > > change.
> > >
> > > I was thinking, mbuf will be updated from driver/aggregator port as
> > > when it
> > > comes to application.
> > >
> > > >
> > > > >
> > > > > > 2. polling aggregated port
> > > > > > Besides forwarding engine, need more work to to demo it.
> > > > > > This is an optional API, not supported by my PMD yet.
> > > > >
> > > > > We are thinking of implementing this PMD when it comes to it,
> > > > > ie.
> > > > > without application change in fastpath
> > > > > logic.
> > > >
> > > > Fastpath have to resolve port ID anyway and forwarding according
> > > > to
> > > > logic. Forwarding engine need to adapt to support shard Rxq.
> > > > Fortunately, in testpmd, this can be done with an abstract API.
> > > >
> > > > Let's defer part 2 until some PMD really support it and tested,
> > > > how do
> > > > you think?
> > >
> > > We are not planning to use this feature so either way it is OK to
> > > me.
> > > I leave to ethdev maintainers decide between 1 vs 2.
> > >
> > > I do have a strong opinion not changing the testpmd basic forward
> > > engines
> > > for this feature.I would like to keep it simple as fastpath
> > > optimized and would
> > > like to add a separate Forwarding engine as means to verify this
> > > feature.
> >
> > +1 to that.
> > I don't think it a 'common' feature.
> > So separate FWD mode seems like a best choice to me.
>
> -1 :)
> There was some internal requirement from test team, they need to verify
> all features like packet content, rss, vlan, checksum, rte_flow... to
> be working based on shared rx queue.
Then I suppose you'll need to write really comprehensive fwd-engine
to satisfy your test team :)
Speaking seriously, I still don't understand why do you need all
available fwd-engines to verify this feature.
From what I understand, main purpose of your changes to test-pmd:
allow to fwd packet though different fwd_stream (TX through different HW queue).
In theory, if implemented in generic and extendable way - that
might be a useful add-on to tespmd fwd functionality.
But current implementation looks very case specific.
And as I don't think it is a common case, I don't see much point to pollute
basic fwd cases with it.
BTW, as a side note, the code below looks bogus to me:
+void
+forward_shared_rxq(struct fwd_stream *fs, uint16_t nb_rx,
+ struct rte_mbuf **pkts_burst, packet_fwd_cb fwd)
+{
+ uint16_t i, nb_fs_rx = 1, port;
+
+ /* Locate real source fs according to mbuf->port. */
+ for (i = 0; i < nb_rx; ++i) {
+ rte_prefetch0(pkts_burst[i + 1]);
you access pkt_burst[] beyond array boundaries,
also you ask cpu to prefetch some unknown and possibly invalid address.
> Based on the patch, I believe the
> impact has been minimized.
>
> >
> > >
> > >
> > >
> > > >
> > > > >
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Overall, is this for optimizing memory for the port
> > > > > > > > > represontors? If so can't we
> > > > > > > > > have a port representor specific solution, reducing
> > > > > > > > > scope can reduce the
> > > > > > > > > complexity it brings?
> > > > > > > > >
> > > > > > > > > > > If this offload is only useful for representor
> > > > > > > > > > > case, Can we make this offload specific to
> > > > > > > > > > > representor the case by changing its
> > > name and
> > > > > > > > > > > scope.
> > > > > > > > > >
> > > > > > > > > > It works for both PF and representors in same switch
> > > > > > > > > > domain, for application like OVS, few changes to
> > > > > > > > > > apply.
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Signed-off-by: Xueming Li
> > > > > > > > > > > > > > <xuemingl@nvidia.com>
> > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > doc/guides/nics/features.rst
> > > > > > > > > > > > > > | 11 +++++++++++
> > > > > > > > > > > > > > doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > | 1 +
> > > > > > > > > > > > > > doc/guides/prog_guide/switch_representation.
> > > > > > > > > > > > > > rst | 10 ++++++++++
> > > > > > > > > > > > > > lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > | 1 +
> > > > > > > > > > > > > > lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > | 7 +++++++
> > > > > > > > > > > > > > 5 files changed, 30 insertions(+)
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > diff --git a/doc/guides/nics/features.rst
> > > > > > > > > > > > > > b/doc/guides/nics/features.rst index
> > > > > > > > > > > > > > a96e12d155..2e2a9b1554 100644
> > > > > > > > > > > > > > --- a/doc/guides/nics/features.rst
> > > > > > > > > > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > > > > > > > > > @@ -624,6 +624,17 @@ Supports inner packet L4
> > > > > > > > > > > > > > checksum.
> > > > > > > > > > > > > > ``tx_offload_capa,tx_queue_offload_capa:DE
> > > > > > > > > > > > > > V_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > > > > > > > > > +
> > > > > > > > > > > > > > +Shared Rx queue
> > > > > > > > > > > > > > +---------------
> > > > > > > > > > > > > > +
> > > > > > > > > > > > > > +Supports shared Rx queue for ports in same
> > > > > > > > > > > > > > switch domain.
> > > > > > > > > > > > > > +
> > > > > > > > > > > > > > +* **[uses]
> > > > > > > > > > > > > > rte_eth_rxconf,rte_eth_rxmode**:
> > > > > > > > > > > > > > ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > > > > > > > > > > > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > > > > > > > > > > > +
> > > > > > > > > > > > > > +
> > > > > > > > > > > > > > .. _nic_features_packet_type_parsing:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Packet type parsing
> > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > > > > > > > > > --- a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > +++ b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> > > > > > > > > > > > > > Queue start/stop =
> > > > > > > > > > > > > > Runtime Rx queue setup =
> > > > > > > > > > > > > > Runtime Tx queue setup =
> > > > > > > > > > > > > > +Shared Rx queue =
> > > > > > > > > > > > > > Burst mode info =
> > > > > > > > > > > > > > Power mgmt address monitor =
> > > > > > > > > > > > > > MTU update =
> > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > +++
> > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > @@ -123,6 +123,16 @@ thought as a software
> > > > > > > > > > > > > > "patch panel" front-end for applications.
> > > > > > > > > > > > > > .. [1] `Ethernet switch device driver model
> > > > > > > > > > > > > > (switchdev)
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > <https://www.kernel.org/doc/Documentation/net
> > > > > > > > > > > > > > working/switchdev.txt
> > > > > > > > > > > > > > > `_
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > +- Memory usage of representors is huge when
> > > > > > > > > > > > > > number of representor
> > > > > > > > > > > > > > +grows,
> > > > > > > > > > > > > > + because PMD always allocate mbuf for each
> > > > > > > > > > > > > > descriptor of Rx queue.
> > > > > > > > > > > > > > + Polling the large number of ports brings
> > > > > > > > > > > > > > more CPU load, cache
> > > > > > > > > > > > > > +miss and
> > > > > > > > > > > > > > + latency. Shared Rx queue can be used to
> > > > > > > > > > > > > > share Rx queue between
> > > > > > > > > > > > > > +PF and
> > > > > > > > > > > > > > + representors in same switch domain.
> > > > > > > > > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > > > > > > > > > + is present in Rx offloading capability of
> > > > > > > > > > > > > > device info. Setting
> > > > > > > > > > > > > > +the
> > > > > > > > > > > > > > + offloading flag in device Rx mode or Rx
> > > > > > > > > > > > > > queue configuration to
> > > > > > > > > > > > > > +enable
> > > > > > > > > > > > > > + shared Rx queue. Polling any member port
> > > > > > > > > > > > > > of shared Rx queue can
> > > > > > > > > > > > > > +return
> > > > > > > > > > > > > > + packets of all ports in group, port ID is
> > > > > > > > > > > > > > saved in ``mbuf.port``.
> > > > > > > > > > > > > > +
> > > > > > > > > > > > > > Basic SR-IOV
> > > > > > > > > > > > > > ------------
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > @@ -127,6 +127,7 @@ static const struct {
> > > > > > > > > > > > > > RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSU
> > > > > > > > > > > > > > M),
> > > > > > > > > > > > > > RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPL
> > > > > > > > > > > > > > IT),
> > > > > > > > > > > > > > +
> > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > > > > > > > > > > > };
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > > > > > > > > > uint8_t rx_drop_en; /**< Drop packets
> > > > > > > > > > > > > > if no descriptors are available. */
> > > > > > > > > > > > > > uint8_t rx_deferred_start; /**< Do
> > > > > > > > > > > > > > not start queue with rte_eth_dev_start(). */
> > > > > > > > > > > > > > uint16_t rx_nseg; /**< Number of
> > > > > > > > > > > > > > descriptions in rx_seg array.
> > > > > > > > > > > > > > */
> > > > > > > > > > > > > > + uint32_t shared_group; /**< Shared
> > > > > > > > > > > > > > port group index in
> > > > > > > > > > > > > > + switch domain. */
> > > > > > > > > > > > > > /**
> > > > > > > > > > > > > > * Per-queue Rx offloads to be set
> > > > > > > > > > > > > > using DEV_RX_OFFLOAD_* flags.
> > > > > > > > > > > > > > * Only offloads set on
> > > > > > > > > > > > > > rx_queue_offload_capa or
> > > > > > > > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct
> > > > > > > > > > > > > > rte_eth_conf {
> > > > > > > > > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM
> > > > > > > > > > > > > > 0x00040000
> > > > > > > > > > > > > > #define DEV_RX_OFFLOAD_RSS_HASH
> > > > > > > > > > > > > > 0x00080000
> > > > > > > > > > > > > > #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT
> > > > > > > > > > > > > > 0x00100000
> > > > > > > > > > > > > > +/**
> > > > > > > > > > > > > > + * Rx queue is shared among ports in same
> > > > > > > > > > > > > > switch domain to save
> > > > > > > > > > > > > > +memory,
> > > > > > > > > > > > > > + * avoid polling each port. Any port in
> > > > > > > > > > > > > > group can be used to receive packets.
> > > > > > > > > > > > > > + * Real source port number saved in mbuf-
> > > > > > > > > > > > > > >port field.
> > > > > > > > > > > > > > + */
> > > > > > > > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ
> > > > > > > > > > > > > > 0x00200000
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > #define DEV_RX_OFFLOAD_CHECKSUM
> > > > > > > > > > > > > > (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > > > > > > > DEV_RX_OFFLO
> > > > > > > > > > > > > > AD_UDP_CKSUM | \
> > > > > > > > > > > > > > --
> > > > > > > > > > > > > > 2.25.1
> > > > > > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > > >
> > > >
On Tue, 2021-09-28 at 20:29 +0530, Jerin Jacob wrote:
> On Tue, Sep 28, 2021 at 8:10 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> >
> > On Tue, 2021-09-28 at 13:59 +0000, Ananyev, Konstantin wrote:
> > > >
> > > > On Tue, Sep 28, 2021 at 6:55 PM Xueming(Steven) Li
> > > > <xuemingl@nvidia.com> wrote:
> > > > >
> > > > > On Tue, 2021-09-28 at 18:28 +0530, Jerin Jacob wrote:
> > > > > > On Tue, Sep 28, 2021 at 5:07 PM Xueming(Steven) Li
> > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > >
> > > > > > > On Tue, 2021-09-28 at 15:05 +0530, Jerin Jacob wrote:
> > > > > > > > On Sun, Sep 26, 2021 at 11:06 AM Xueming(Steven) Li
> > > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > > >
> > > > > > > > > On Wed, 2021-08-11 at 13:04 +0100, Ferruh Yigit wrote:
> > > > > > > > > > On 8/11/2021 9:28 AM, Xueming(Steven) Li wrote:
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > > > > Sent: Wednesday, August 11, 2021 4:03 PM
> > > > > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > > > > > > <ferruh.yigit@intel.com>; NBU-Contact-Thomas
> > > > > > > > > > > > Monjalon
> > > > <thomas@monjalon.net>;
> > > > > > > > > > > > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev:
> > > > > > > > > > > > introduce shared Rx queue
> > > > > > > > > > > >
> > > > > > > > > > > > On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li
> > > > > > > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > Hi,
> > > > > > > > > > > > >
> > > > > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > > > > > > Sent: Monday, August 9, 2021 9:51 PM
> > > > > > > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > > > > > > > > <ferruh.yigit@intel.com>;
> > > > > > > > > > > > > > NBU-Contact-Thomas Monjalon
> > > > > > > > > > > > > > <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > > > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev:
> > > > > > > > > > > > > > introduce shared Rx queue
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Mon, Aug 9, 2021 at 5:18 PM Xueming Li
> > > > > > > > > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > In current DPDK framework, each RX queue is
> > > > > > > > > > > > > > > pre-loaded with mbufs
> > > > > > > > > > > > > > > for incoming packets. When number of
> > > > > > > > > > > > > > > representors scale out in a
> > > > > > > > > > > > > > > switch domain, the memory consumption became
> > > > > > > > > > > > > > > significant. Most
> > > > > > > > > > > > > > > important, polling all ports leads to high
> > > > > > > > > > > > > > > cache miss, high
> > > > > > > > > > > > > > > latency and low throughput.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > This patch introduces shared RX queue. Ports
> > > > > > > > > > > > > > > with same
> > > > > > > > > > > > > > > configuration in a switch domain could share
> > > > > > > > > > > > > > > RX queue set by specifying sharing group.
> > > > > > > > > > > > > > > Polling any queue using same shared RX queue
> > > > > > > > > > > > > > > receives packets from
> > > > > > > > > > > > > > > all member ports. Source port is identified
> > > > > > > > > > > > > > > by mbuf->port.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Port queue number in a shared group should be
> > > > > > > > > > > > > > > identical. Queue
> > > > > > > > > > > > > > > index is
> > > > > > > > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Share RX queue is supposed to be polled on
> > > > > > > > > > > > > > > same thread.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Is this offload specific to the representor? If
> > > > > > > > > > > > > > so can this name be changed specifically to
> > > > > > > > > > > > > > representor?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Yes, PF and representor in switch domain could
> > > > > > > > > > > > > take advantage.
> > > > > > > > > > > > >
> > > > > > > > > > > > > > If it is for a generic case, how the flow
> > > > > > > > > > > > > > ordering will be maintained?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Not quite sure that I understood your question.
> > > > > > > > > > > > > The control path of is
> > > > > > > > > > > > > almost same as before, PF and representor port
> > > > > > > > > > > > > still needed, rte flows not impacted.
> > > > > > > > > > > > > Queues still needed for each member port,
> > > > > > > > > > > > > descriptors(mbuf) will be
> > > > > > > > > > > > > supplied from shared Rx queue in my PMD
> > > > > > > > > > > > > implementation.
> > > > > > > > > > > >
> > > > > > > > > > > > My question was if create a generic
> > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple
> > > > > > > > > > > > ethdev receive queues land into
> > > > the same
> > > > > > > > > > > > receive queue, In that case, how the flow order is
> > > > > > > > > > > > maintained for respective receive queues.
> > > > > > > > > > >
> > > > > > > > > > > I guess the question is testpmd forward stream? The
> > > > > > > > > > > forwarding logic has to be changed slightly in case
> > > > > > > > > > > of shared rxq.
> > > > > > > > > > > basically for each packet in rx_burst result, lookup
> > > > > > > > > > > source stream according to mbuf->port, forwarding to
> > > > > > > > > > > target fs.
> > > > > > > > > > > Packets from same source port could be grouped as a
> > > > > > > > > > > small burst to process, this will accelerates the
> > > > > > > > > > > performance if traffic
> > > > come from
> > > > > > > > > > > limited ports. I'll introduce some common api to do
> > > > > > > > > > > shard rxq forwarding, call it with packets handling
> > > > > > > > > > > callback, so it suites for
> > > > > > > > > > > all forwarding engine. Will sent patches soon.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > All ports will put the packets in to the same queue
> > > > > > > > > > (share queue), right? Does
> > > > > > > > > > this means only single core will poll only, what will
> > > > > > > > > > happen if there are
> > > > > > > > > > multiple cores polling, won't it cause problem?
> > > > > > > > > >
> > > > > > > > > > And if this requires specific changes in the
> > > > > > > > > > application, I am not sure about
> > > > > > > > > > the solution, can't this work in a transparent way to
> > > > > > > > > > the application?
> > > > > > > > >
> > > > > > > > > Discussed with Jerin, new API introduced in v3 2/8 that
> > > > > > > > > aggregate ports
> > > > > > > > > in same group into one new port. Users could schedule
> > > > > > > > > polling on the
> > > > > > > > > aggregated port instead of all member ports.
> > > > > > > >
> > > > > > > > The v3 still has testpmd changes in fastpath. Right? IMO,
> > > > > > > > For this
> > > > > > > > feature, we should not change fastpath of testpmd
> > > > > > > > application. Instead, testpmd can use aggregated ports
> > > > > > > > probably as
> > > > > > > > separate fwd_engine to show how to use this feature.
> > > > > > >
> > > > > > > Good point to discuss :) There are two strategies to polling
> > > > > > > a shared
> > > > > > > Rxq:
> > > > > > > 1. polling each member port
> > > > > > > All forwarding engines can be reused to work as before.
> > > > > > > My testpmd patches are efforts towards this direction.
> > > > > > > Does your PMD support this?
> > > > > >
> > > > > > Not unfortunately. More than that, every application needs to
> > > > > > change
> > > > > > to support this model.
> > > > >
> > > > > Both strategies need user application to resolve port ID from
> > > > > mbuf and
> > > > > process accordingly.
> > > > > This one doesn't demand aggregated port, no polling schedule
> > > > > change.
> > > >
> > > > I was thinking, mbuf will be updated from driver/aggregator port as
> > > > when it
> > > > comes to application.
> > > >
> > > > >
> > > > > >
> > > > > > > 2. polling aggregated port
> > > > > > > Besides forwarding engine, need more work to to demo it.
> > > > > > > This is an optional API, not supported by my PMD yet.
> > > > > >
> > > > > > We are thinking of implementing this PMD when it comes to it,
> > > > > > ie.
> > > > > > without application change in fastpath
> > > > > > logic.
> > > > >
> > > > > Fastpath have to resolve port ID anyway and forwarding according
> > > > > to
> > > > > logic. Forwarding engine need to adapt to support shard Rxq.
> > > > > Fortunately, in testpmd, this can be done with an abstract API.
> > > > >
> > > > > Let's defer part 2 until some PMD really support it and tested,
> > > > > how do
> > > > > you think?
> > > >
> > > > We are not planning to use this feature so either way it is OK to
> > > > me.
> > > > I leave to ethdev maintainers decide between 1 vs 2.
> > > >
> > > > I do have a strong opinion not changing the testpmd basic forward
> > > > engines
> > > > for this feature.I would like to keep it simple as fastpath
> > > > optimized and would
> > > > like to add a separate Forwarding engine as means to verify this
> > > > feature.
> > >
> > > +1 to that.
> > > I don't think it a 'common' feature.
> > > So separate FWD mode seems like a best choice to me.
> >
> > -1 :)
> > There was some internal requirement from test team, they need to verify
>
> Internal QA requirements may not be the driving factor :-)
It will be a test requirement for any driver to face, not internal. The
performance difference almost zero in v3, only an "unlikely if" test on
each burst. Shared Rxq is a low level feature, reusing all current FWD
engines to verify driver high level features is important IMHO.
>
> > all features like packet content, rss, vlan, checksum, rte_flow... to
> > be working based on shared rx queue. Based on the patch, I believe the
> > impact has been minimized.
>
>
> >
> > >
> > > >
> > > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Overall, is this for optimizing memory for the port
> > > > > > > > > > represontors? If so can't we
> > > > > > > > > > have a port representor specific solution, reducing
> > > > > > > > > > scope can reduce the
> > > > > > > > > > complexity it brings?
> > > > > > > > > >
> > > > > > > > > > > > If this offload is only useful for representor
> > > > > > > > > > > > case, Can we make this offload specific to
> > > > > > > > > > > > representor the case by changing its
> > > > name and
> > > > > > > > > > > > scope.
> > > > > > > > > > >
> > > > > > > > > > > It works for both PF and representors in same switch
> > > > > > > > > > > domain, for application like OVS, few changes to
> > > > > > > > > > > apply.
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Signed-off-by: Xueming Li
> > > > > > > > > > > > > > > <xuemingl@nvidia.com>
> > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > 11 +++++++++++
> > > > > > > > > > > > > > > doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > 1 +
> > > > > > > > > > > > > > > doc/guides/prog_guide/switch_representation.
> > > > > > > > > > > > > > > rst | 10 ++++++++++
> > > > > > > > > > > > > > > lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > 1 +
> > > > > > > > > > > > > > > lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > 7 +++++++
> > > > > > > > > > > > > > > 5 files changed, 30 insertions(+)
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > diff --git a/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > b/doc/guides/nics/features.rst index
> > > > > > > > > > > > > > > a96e12d155..2e2a9b1554 100644
> > > > > > > > > > > > > > > --- a/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > @@ -624,6 +624,17 @@ Supports inner packet L4
> > > > > > > > > > > > > > > checksum.
> > > > > > > > > > > > > > > ``tx_offload_capa,tx_queue_offload_capa:DE
> > > > > > > > > > > > > > > V_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > +Shared Rx queue
> > > > > > > > > > > > > > > +---------------
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > +Supports shared Rx queue for ports in same
> > > > > > > > > > > > > > > switch domain.
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > +* **[uses]
> > > > > > > > > > > > > > > rte_eth_rxconf,rte_eth_rxmode**:
> > > > > > > > > > > > > > > ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > > > > > > > > > > > > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > .. _nic_features_packet_type_parsing:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Packet type parsing
> > > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > > a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > > > > > > > > > > --- a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > +++ b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> > > > > > > > > > > > > > > Queue start/stop =
> > > > > > > > > > > > > > > Runtime Rx queue setup =
> > > > > > > > > > > > > > > Runtime Tx queue setup =
> > > > > > > > > > > > > > > +Shared Rx queue =
> > > > > > > > > > > > > > > Burst mode info =
> > > > > > > > > > > > > > > Power mgmt address monitor =
> > > > > > > > > > > > > > > MTU update =
> > > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > +++
> > > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > @@ -123,6 +123,16 @@ thought as a software
> > > > > > > > > > > > > > > "patch panel" front-end for applications.
> > > > > > > > > > > > > > > .. [1] `Ethernet switch device driver model
> > > > > > > > > > > > > > > (switchdev)
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > <https://www.kernel.org/doc/Documentation/net
> > > > > > > > > > > > > > > working/switchdev.txt
> > > > > > > > > > > > > > > > `_
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > +- Memory usage of representors is huge when
> > > > > > > > > > > > > > > number of representor
> > > > > > > > > > > > > > > +grows,
> > > > > > > > > > > > > > > + because PMD always allocate mbuf for each
> > > > > > > > > > > > > > > descriptor of Rx queue.
> > > > > > > > > > > > > > > + Polling the large number of ports brings
> > > > > > > > > > > > > > > more CPU load, cache
> > > > > > > > > > > > > > > +miss and
> > > > > > > > > > > > > > > + latency. Shared Rx queue can be used to
> > > > > > > > > > > > > > > share Rx queue between
> > > > > > > > > > > > > > > +PF and
> > > > > > > > > > > > > > > + representors in same switch domain.
> > > > > > > > > > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > > > > > > > > > > + is present in Rx offloading capability of
> > > > > > > > > > > > > > > device info. Setting
> > > > > > > > > > > > > > > +the
> > > > > > > > > > > > > > > + offloading flag in device Rx mode or Rx
> > > > > > > > > > > > > > > queue configuration to
> > > > > > > > > > > > > > > +enable
> > > > > > > > > > > > > > > + shared Rx queue. Polling any member port
> > > > > > > > > > > > > > > of shared Rx queue can
> > > > > > > > > > > > > > > +return
> > > > > > > > > > > > > > > + packets of all ports in group, port ID is
> > > > > > > > > > > > > > > saved in ``mbuf.port``.
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > Basic SR-IOV
> > > > > > > > > > > > > > > ------------
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > @@ -127,6 +127,7 @@ static const struct {
> > > > > > > > > > > > > > > RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSU
> > > > > > > > > > > > > > > M),
> > > > > > > > > > > > > > > RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPL
> > > > > > > > > > > > > > > IT),
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > > > > > > > > > > > > };
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > > > > > > > > > > uint8_t rx_drop_en; /**< Drop packets
> > > > > > > > > > > > > > > if no descriptors are available. */
> > > > > > > > > > > > > > > uint8_t rx_deferred_start; /**< Do
> > > > > > > > > > > > > > > not start queue with rte_eth_dev_start(). */
> > > > > > > > > > > > > > > uint16_t rx_nseg; /**< Number of
> > > > > > > > > > > > > > > descriptions in rx_seg array.
> > > > > > > > > > > > > > > */
> > > > > > > > > > > > > > > + uint32_t shared_group; /**< Shared
> > > > > > > > > > > > > > > port group index in
> > > > > > > > > > > > > > > + switch domain. */
> > > > > > > > > > > > > > > /**
> > > > > > > > > > > > > > > * Per-queue Rx offloads to be set
> > > > > > > > > > > > > > > using DEV_RX_OFFLOAD_* flags.
> > > > > > > > > > > > > > > * Only offloads set on
> > > > > > > > > > > > > > > rx_queue_offload_capa or
> > > > > > > > > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct
> > > > > > > > > > > > > > > rte_eth_conf {
> > > > > > > > > > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM
> > > > > > > > > > > > > > > 0x00040000
> > > > > > > > > > > > > > > #define DEV_RX_OFFLOAD_RSS_HASH
> > > > > > > > > > > > > > > 0x00080000
> > > > > > > > > > > > > > > #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT
> > > > > > > > > > > > > > > 0x00100000
> > > > > > > > > > > > > > > +/**
> > > > > > > > > > > > > > > + * Rx queue is shared among ports in same
> > > > > > > > > > > > > > > switch domain to save
> > > > > > > > > > > > > > > +memory,
> > > > > > > > > > > > > > > + * avoid polling each port. Any port in
> > > > > > > > > > > > > > > group can be used to receive packets.
> > > > > > > > > > > > > > > + * Real source port number saved in mbuf-
> > > > > > > > > > > > > > > > port field.
> > > > > > > > > > > > > > > + */
> > > > > > > > > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ
> > > > > > > > > > > > > > > 0x00200000
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > #define DEV_RX_OFFLOAD_CHECKSUM
> > > > > > > > > > > > > > > (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > > > > > > > > DEV_RX_OFFLO
> > > > > > > > > > > > > > > AD_UDP_CKSUM | \
> > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > 2.25.1
> > > > > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > > >
> > > > >
> >
On Wed, Sep 29, 2021 at 1:11 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
>
> On Tue, 2021-09-28 at 20:29 +0530, Jerin Jacob wrote:
> > On Tue, Sep 28, 2021 at 8:10 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > >
> > > On Tue, 2021-09-28 at 13:59 +0000, Ananyev, Konstantin wrote:
> > > > >
> > > > > On Tue, Sep 28, 2021 at 6:55 PM Xueming(Steven) Li
> > > > > <xuemingl@nvidia.com> wrote:
> > > > > >
> > > > > > On Tue, 2021-09-28 at 18:28 +0530, Jerin Jacob wrote:
> > > > > > > On Tue, Sep 28, 2021 at 5:07 PM Xueming(Steven) Li
> > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > >
> > > > > > > > On Tue, 2021-09-28 at 15:05 +0530, Jerin Jacob wrote:
> > > > > > > > > On Sun, Sep 26, 2021 at 11:06 AM Xueming(Steven) Li
> > > > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On Wed, 2021-08-11 at 13:04 +0100, Ferruh Yigit wrote:
> > > > > > > > > > > On 8/11/2021 9:28 AM, Xueming(Steven) Li wrote:
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > > > > > Sent: Wednesday, August 11, 2021 4:03 PM
> > > > > > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > > > > > > > <ferruh.yigit@intel.com>; NBU-Contact-Thomas
> > > > > > > > > > > > > Monjalon
> > > > > <thomas@monjalon.net>;
> > > > > > > > > > > > > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev:
> > > > > > > > > > > > > introduce shared Rx queue
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li
> > > > > > > > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > > > > > > > Sent: Monday, August 9, 2021 9:51 PM
> > > > > > > > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > > > > > > > > > <ferruh.yigit@intel.com>;
> > > > > > > > > > > > > > > NBU-Contact-Thomas Monjalon
> > > > > > > > > > > > > > > <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > > > > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev:
> > > > > > > > > > > > > > > introduce shared Rx queue
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Mon, Aug 9, 2021 at 5:18 PM Xueming Li
> > > > > > > > > > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > In current DPDK framework, each RX queue is
> > > > > > > > > > > > > > > > pre-loaded with mbufs
> > > > > > > > > > > > > > > > for incoming packets. When number of
> > > > > > > > > > > > > > > > representors scale out in a
> > > > > > > > > > > > > > > > switch domain, the memory consumption became
> > > > > > > > > > > > > > > > significant. Most
> > > > > > > > > > > > > > > > important, polling all ports leads to high
> > > > > > > > > > > > > > > > cache miss, high
> > > > > > > > > > > > > > > > latency and low throughput.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > This patch introduces shared RX queue. Ports
> > > > > > > > > > > > > > > > with same
> > > > > > > > > > > > > > > > configuration in a switch domain could share
> > > > > > > > > > > > > > > > RX queue set by specifying sharing group.
> > > > > > > > > > > > > > > > Polling any queue using same shared RX queue
> > > > > > > > > > > > > > > > receives packets from
> > > > > > > > > > > > > > > > all member ports. Source port is identified
> > > > > > > > > > > > > > > > by mbuf->port.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Port queue number in a shared group should be
> > > > > > > > > > > > > > > > identical. Queue
> > > > > > > > > > > > > > > > index is
> > > > > > > > > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Share RX queue is supposed to be polled on
> > > > > > > > > > > > > > > > same thread.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Is this offload specific to the representor? If
> > > > > > > > > > > > > > > so can this name be changed specifically to
> > > > > > > > > > > > > > > representor?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Yes, PF and representor in switch domain could
> > > > > > > > > > > > > > take advantage.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > If it is for a generic case, how the flow
> > > > > > > > > > > > > > > ordering will be maintained?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Not quite sure that I understood your question.
> > > > > > > > > > > > > > The control path of is
> > > > > > > > > > > > > > almost same as before, PF and representor port
> > > > > > > > > > > > > > still needed, rte flows not impacted.
> > > > > > > > > > > > > > Queues still needed for each member port,
> > > > > > > > > > > > > > descriptors(mbuf) will be
> > > > > > > > > > > > > > supplied from shared Rx queue in my PMD
> > > > > > > > > > > > > > implementation.
> > > > > > > > > > > > >
> > > > > > > > > > > > > My question was if create a generic
> > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple
> > > > > > > > > > > > > ethdev receive queues land into
> > > > > the same
> > > > > > > > > > > > > receive queue, In that case, how the flow order is
> > > > > > > > > > > > > maintained for respective receive queues.
> > > > > > > > > > > >
> > > > > > > > > > > > I guess the question is testpmd forward stream? The
> > > > > > > > > > > > forwarding logic has to be changed slightly in case
> > > > > > > > > > > > of shared rxq.
> > > > > > > > > > > > basically for each packet in rx_burst result, lookup
> > > > > > > > > > > > source stream according to mbuf->port, forwarding to
> > > > > > > > > > > > target fs.
> > > > > > > > > > > > Packets from same source port could be grouped as a
> > > > > > > > > > > > small burst to process, this will accelerates the
> > > > > > > > > > > > performance if traffic
> > > > > come from
> > > > > > > > > > > > limited ports. I'll introduce some common api to do
> > > > > > > > > > > > shard rxq forwarding, call it with packets handling
> > > > > > > > > > > > callback, so it suites for
> > > > > > > > > > > > all forwarding engine. Will sent patches soon.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > All ports will put the packets in to the same queue
> > > > > > > > > > > (share queue), right? Does
> > > > > > > > > > > this means only single core will poll only, what will
> > > > > > > > > > > happen if there are
> > > > > > > > > > > multiple cores polling, won't it cause problem?
> > > > > > > > > > >
> > > > > > > > > > > And if this requires specific changes in the
> > > > > > > > > > > application, I am not sure about
> > > > > > > > > > > the solution, can't this work in a transparent way to
> > > > > > > > > > > the application?
> > > > > > > > > >
> > > > > > > > > > Discussed with Jerin, new API introduced in v3 2/8 that
> > > > > > > > > > aggregate ports
> > > > > > > > > > in same group into one new port. Users could schedule
> > > > > > > > > > polling on the
> > > > > > > > > > aggregated port instead of all member ports.
> > > > > > > > >
> > > > > > > > > The v3 still has testpmd changes in fastpath. Right? IMO,
> > > > > > > > > For this
> > > > > > > > > feature, we should not change fastpath of testpmd
> > > > > > > > > application. Instead, testpmd can use aggregated ports
> > > > > > > > > probably as
> > > > > > > > > separate fwd_engine to show how to use this feature.
> > > > > > > >
> > > > > > > > Good point to discuss :) There are two strategies to polling
> > > > > > > > a shared
> > > > > > > > Rxq:
> > > > > > > > 1. polling each member port
> > > > > > > > All forwarding engines can be reused to work as before.
> > > > > > > > My testpmd patches are efforts towards this direction.
> > > > > > > > Does your PMD support this?
> > > > > > >
> > > > > > > Not unfortunately. More than that, every application needs to
> > > > > > > change
> > > > > > > to support this model.
> > > > > >
> > > > > > Both strategies need user application to resolve port ID from
> > > > > > mbuf and
> > > > > > process accordingly.
> > > > > > This one doesn't demand aggregated port, no polling schedule
> > > > > > change.
> > > > >
> > > > > I was thinking, mbuf will be updated from driver/aggregator port as
> > > > > when it
> > > > > comes to application.
> > > > >
> > > > > >
> > > > > > >
> > > > > > > > 2. polling aggregated port
> > > > > > > > Besides forwarding engine, need more work to to demo it.
> > > > > > > > This is an optional API, not supported by my PMD yet.
> > > > > > >
> > > > > > > We are thinking of implementing this PMD when it comes to it,
> > > > > > > ie.
> > > > > > > without application change in fastpath
> > > > > > > logic.
> > > > > >
> > > > > > Fastpath have to resolve port ID anyway and forwarding according
> > > > > > to
> > > > > > logic. Forwarding engine need to adapt to support shard Rxq.
> > > > > > Fortunately, in testpmd, this can be done with an abstract API.
> > > > > >
> > > > > > Let's defer part 2 until some PMD really support it and tested,
> > > > > > how do
> > > > > > you think?
> > > > >
> > > > > We are not planning to use this feature so either way it is OK to
> > > > > me.
> > > > > I leave to ethdev maintainers decide between 1 vs 2.
> > > > >
> > > > > I do have a strong opinion not changing the testpmd basic forward
> > > > > engines
> > > > > for this feature.I would like to keep it simple as fastpath
> > > > > optimized and would
> > > > > like to add a separate Forwarding engine as means to verify this
> > > > > feature.
> > > >
> > > > +1 to that.
> > > > I don't think it a 'common' feature.
> > > > So separate FWD mode seems like a best choice to me.
> > >
> > > -1 :)
> > > There was some internal requirement from test team, they need to verify
> >
> > Internal QA requirements may not be the driving factor :-)
>
> It will be a test requirement for any driver to face, not internal. The
> performance difference almost zero in v3, only an "unlikely if" test on
> each burst. Shared Rxq is a low level feature, reusing all current FWD
> engines to verify driver high level features is important IMHO.
In addition to additional if check, The real concern is polluting the
common forward engine for the not common feature.
If you really want to reuse the existing application without any
application change,
I think, you need to hook this to eventdev
http://code.dpdk.org/dpdk/latest/source/lib/eventdev/rte_eventdev.h#L34
Where eventdev drivers does this thing in addition to other features, Ie.
t has ports (which is kind of aggregator),
it can receive the packets from any queue with mbuf->port as actually
received port.
That is in terms of mapping:
- event queue will be dummy it will be as same as Rx queue
- Rx adapter will be also a dummy
- event ports aggregate multiple queues and connect to core via event port
- On Rxing the packet, mbuf->port will be the actual Port which is received.
app/test-eventdev written to use this model.
>
> >
> > > all features like packet content, rss, vlan, checksum, rte_flow... to
> > > be working based on shared rx queue. Based on the patch, I believe the
> > > impact has been minimized.
> >
> >
> > >
> > > >
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Overall, is this for optimizing memory for the port
> > > > > > > > > > > represontors? If so can't we
> > > > > > > > > > > have a port representor specific solution, reducing
> > > > > > > > > > > scope can reduce the
> > > > > > > > > > > complexity it brings?
> > > > > > > > > > >
> > > > > > > > > > > > > If this offload is only useful for representor
> > > > > > > > > > > > > case, Can we make this offload specific to
> > > > > > > > > > > > > representor the case by changing its
> > > > > name and
> > > > > > > > > > > > > scope.
> > > > > > > > > > > >
> > > > > > > > > > > > It works for both PF and representors in same switch
> > > > > > > > > > > > domain, for application like OVS, few changes to
> > > > > > > > > > > > apply.
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Signed-off-by: Xueming Li
> > > > > > > > > > > > > > > > <xuemingl@nvidia.com>
> > > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > > doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > > 11 +++++++++++
> > > > > > > > > > > > > > > > doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > > 1 +
> > > > > > > > > > > > > > > > doc/guides/prog_guide/switch_representation.
> > > > > > > > > > > > > > > > rst | 10 ++++++++++
> > > > > > > > > > > > > > > > lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > > 1 +
> > > > > > > > > > > > > > > > lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > > 7 +++++++
> > > > > > > > > > > > > > > > 5 files changed, 30 insertions(+)
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > diff --git a/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > b/doc/guides/nics/features.rst index
> > > > > > > > > > > > > > > > a96e12d155..2e2a9b1554 100644
> > > > > > > > > > > > > > > > --- a/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > @@ -624,6 +624,17 @@ Supports inner packet L4
> > > > > > > > > > > > > > > > checksum.
> > > > > > > > > > > > > > > > ``tx_offload_capa,tx_queue_offload_capa:DE
> > > > > > > > > > > > > > > > V_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > +Shared Rx queue
> > > > > > > > > > > > > > > > +---------------
> > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > +Supports shared Rx queue for ports in same
> > > > > > > > > > > > > > > > switch domain.
> > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > +* **[uses]
> > > > > > > > > > > > > > > > rte_eth_rxconf,rte_eth_rxmode**:
> > > > > > > > > > > > > > > > ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > > > > > > > > > > > > > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > .. _nic_features_packet_type_parsing:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Packet type parsing
> > > > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > > > a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > > > > > > > > > > > --- a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > +++ b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> > > > > > > > > > > > > > > > Queue start/stop =
> > > > > > > > > > > > > > > > Runtime Rx queue setup =
> > > > > > > > > > > > > > > > Runtime Tx queue setup =
> > > > > > > > > > > > > > > > +Shared Rx queue =
> > > > > > > > > > > > > > > > Burst mode info =
> > > > > > > > > > > > > > > > Power mgmt address monitor =
> > > > > > > > > > > > > > > > MTU update =
> > > > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > +++
> > > > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > @@ -123,6 +123,16 @@ thought as a software
> > > > > > > > > > > > > > > > "patch panel" front-end for applications.
> > > > > > > > > > > > > > > > .. [1] `Ethernet switch device driver model
> > > > > > > > > > > > > > > > (switchdev)
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > <https://www.kernel.org/doc/Documentation/net
> > > > > > > > > > > > > > > > working/switchdev.txt
> > > > > > > > > > > > > > > > > `_
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > +- Memory usage of representors is huge when
> > > > > > > > > > > > > > > > number of representor
> > > > > > > > > > > > > > > > +grows,
> > > > > > > > > > > > > > > > + because PMD always allocate mbuf for each
> > > > > > > > > > > > > > > > descriptor of Rx queue.
> > > > > > > > > > > > > > > > + Polling the large number of ports brings
> > > > > > > > > > > > > > > > more CPU load, cache
> > > > > > > > > > > > > > > > +miss and
> > > > > > > > > > > > > > > > + latency. Shared Rx queue can be used to
> > > > > > > > > > > > > > > > share Rx queue between
> > > > > > > > > > > > > > > > +PF and
> > > > > > > > > > > > > > > > + representors in same switch domain.
> > > > > > > > > > > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > > > > > > > > > > > + is present in Rx offloading capability of
> > > > > > > > > > > > > > > > device info. Setting
> > > > > > > > > > > > > > > > +the
> > > > > > > > > > > > > > > > + offloading flag in device Rx mode or Rx
> > > > > > > > > > > > > > > > queue configuration to
> > > > > > > > > > > > > > > > +enable
> > > > > > > > > > > > > > > > + shared Rx queue. Polling any member port
> > > > > > > > > > > > > > > > of shared Rx queue can
> > > > > > > > > > > > > > > > +return
> > > > > > > > > > > > > > > > + packets of all ports in group, port ID is
> > > > > > > > > > > > > > > > saved in ``mbuf.port``.
> > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > Basic SR-IOV
> > > > > > > > > > > > > > > > ------------
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > @@ -127,6 +127,7 @@ static const struct {
> > > > > > > > > > > > > > > > RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSU
> > > > > > > > > > > > > > > > M),
> > > > > > > > > > > > > > > > RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPL
> > > > > > > > > > > > > > > > IT),
> > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > > > > > > > > > > > > > };
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > > > > > > > > > > > uint8_t rx_drop_en; /**< Drop packets
> > > > > > > > > > > > > > > > if no descriptors are available. */
> > > > > > > > > > > > > > > > uint8_t rx_deferred_start; /**< Do
> > > > > > > > > > > > > > > > not start queue with rte_eth_dev_start(). */
> > > > > > > > > > > > > > > > uint16_t rx_nseg; /**< Number of
> > > > > > > > > > > > > > > > descriptions in rx_seg array.
> > > > > > > > > > > > > > > > */
> > > > > > > > > > > > > > > > + uint32_t shared_group; /**< Shared
> > > > > > > > > > > > > > > > port group index in
> > > > > > > > > > > > > > > > + switch domain. */
> > > > > > > > > > > > > > > > /**
> > > > > > > > > > > > > > > > * Per-queue Rx offloads to be set
> > > > > > > > > > > > > > > > using DEV_RX_OFFLOAD_* flags.
> > > > > > > > > > > > > > > > * Only offloads set on
> > > > > > > > > > > > > > > > rx_queue_offload_capa or
> > > > > > > > > > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct
> > > > > > > > > > > > > > > > rte_eth_conf {
> > > > > > > > > > > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM
> > > > > > > > > > > > > > > > 0x00040000
> > > > > > > > > > > > > > > > #define DEV_RX_OFFLOAD_RSS_HASH
> > > > > > > > > > > > > > > > 0x00080000
> > > > > > > > > > > > > > > > #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT
> > > > > > > > > > > > > > > > 0x00100000
> > > > > > > > > > > > > > > > +/**
> > > > > > > > > > > > > > > > + * Rx queue is shared among ports in same
> > > > > > > > > > > > > > > > switch domain to save
> > > > > > > > > > > > > > > > +memory,
> > > > > > > > > > > > > > > > + * avoid polling each port. Any port in
> > > > > > > > > > > > > > > > group can be used to receive packets.
> > > > > > > > > > > > > > > > + * Real source port number saved in mbuf-
> > > > > > > > > > > > > > > > > port field.
> > > > > > > > > > > > > > > > + */
> > > > > > > > > > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ
> > > > > > > > > > > > > > > > 0x00200000
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > #define DEV_RX_OFFLOAD_CHECKSUM
> > > > > > > > > > > > > > > > (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > > > > > > > > > DEV_RX_OFFLO
> > > > > > > > > > > > > > > > AD_UDP_CKSUM | \
> > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > 2.25.1
> > > > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > >
> > >
>
On Wed, 2021-09-29 at 00:26 +0000, Ananyev, Konstantin wrote:
> > > > > > > > > > > > > > > In current DPDK framework, each RX queue
> > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > pre-loaded with mbufs
> > > > > > > > > > > > > > > for incoming packets. When number of
> > > > > > > > > > > > > > > representors scale out in a
> > > > > > > > > > > > > > > switch domain, the memory consumption
> > > > > > > > > > > > > > > became
> > > > > > > > > > > > > > > significant. Most
> > > > > > > > > > > > > > > important, polling all ports leads to
> > > > > > > > > > > > > > > high
> > > > > > > > > > > > > > > cache miss, high
> > > > > > > > > > > > > > > latency and low throughput.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > This patch introduces shared RX queue.
> > > > > > > > > > > > > > > Ports
> > > > > > > > > > > > > > > with same
> > > > > > > > > > > > > > > configuration in a switch domain could
> > > > > > > > > > > > > > > share
> > > > > > > > > > > > > > > RX queue set by specifying sharing group.
> > > > > > > > > > > > > > > Polling any queue using same shared RX
> > > > > > > > > > > > > > > queue
> > > > > > > > > > > > > > > receives packets from
> > > > > > > > > > > > > > > all member ports. Source port is
> > > > > > > > > > > > > > > identified
> > > > > > > > > > > > > > > by mbuf->port.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Port queue number in a shared group
> > > > > > > > > > > > > > > should be
> > > > > > > > > > > > > > > identical. Queue
> > > > > > > > > > > > > > > index is
> > > > > > > > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Share RX queue is supposed to be polled
> > > > > > > > > > > > > > > on
> > > > > > > > > > > > > > > same thread.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Is this offload specific to the
> > > > > > > > > > > > > > representor? If
> > > > > > > > > > > > > > so can this name be changed specifically to
> > > > > > > > > > > > > > representor?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Yes, PF and representor in switch domain
> > > > > > > > > > > > > could
> > > > > > > > > > > > > take advantage.
> > > > > > > > > > > > >
> > > > > > > > > > > > > > If it is for a generic case, how the flow
> > > > > > > > > > > > > > ordering will be maintained?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Not quite sure that I understood your
> > > > > > > > > > > > > question.
> > > > > > > > > > > > > The control path of is
> > > > > > > > > > > > > almost same as before, PF and representor
> > > > > > > > > > > > > port
> > > > > > > > > > > > > still needed, rte flows not impacted.
> > > > > > > > > > > > > Queues still needed for each member port,
> > > > > > > > > > > > > descriptors(mbuf) will be
> > > > > > > > > > > > > supplied from shared Rx queue in my PMD
> > > > > > > > > > > > > implementation.
> > > > > > > > > > > >
> > > > > > > > > > > > My question was if create a generic
> > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple
> > > > > > > > > > > > ethdev receive queues land into
> > > > the same
> > > > > > > > > > > > receive queue, In that case, how the flow order
> > > > > > > > > > > > is
> > > > > > > > > > > > maintained for respective receive queues.
> > > > > > > > > > >
> > > > > > > > > > > I guess the question is testpmd forward stream?
> > > > > > > > > > > The
> > > > > > > > > > > forwarding logic has to be changed slightly in
> > > > > > > > > > > case
> > > > > > > > > > > of shared rxq.
> > > > > > > > > > > basically for each packet in rx_burst result,
> > > > > > > > > > > lookup
> > > > > > > > > > > source stream according to mbuf->port, forwarding
> > > > > > > > > > > to
> > > > > > > > > > > target fs.
> > > > > > > > > > > Packets from same source port could be grouped as
> > > > > > > > > > > a
> > > > > > > > > > > small burst to process, this will accelerates the
> > > > > > > > > > > performance if traffic
> > > > come from
> > > > > > > > > > > limited ports. I'll introduce some common api to
> > > > > > > > > > > do
> > > > > > > > > > > shard rxq forwarding, call it with packets
> > > > > > > > > > > handling
> > > > > > > > > > > callback, so it suites for
> > > > > > > > > > > all forwarding engine. Will sent patches soon.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > All ports will put the packets in to the same queue
> > > > > > > > > > (share queue), right? Does
> > > > > > > > > > this means only single core will poll only, what
> > > > > > > > > > will
> > > > > > > > > > happen if there are
> > > > > > > > > > multiple cores polling, won't it cause problem?
> > > > > > > > > >
> > > > > > > > > > And if this requires specific changes in the
> > > > > > > > > > application, I am not sure about
> > > > > > > > > > the solution, can't this work in a transparent way
> > > > > > > > > > to
> > > > > > > > > > the application?
> > > > > > > > >
> > > > > > > > > Discussed with Jerin, new API introduced in v3 2/8
> > > > > > > > > that
> > > > > > > > > aggregate ports
> > > > > > > > > in same group into one new port. Users could schedule
> > > > > > > > > polling on the
> > > > > > > > > aggregated port instead of all member ports.
> > > > > > > >
> > > > > > > > The v3 still has testpmd changes in fastpath. Right?
> > > > > > > > IMO,
> > > > > > > > For this
> > > > > > > > feature, we should not change fastpath of testpmd
> > > > > > > > application. Instead, testpmd can use aggregated ports
> > > > > > > > probably as
> > > > > > > > separate fwd_engine to show how to use this feature.
> > > > > > >
> > > > > > > Good point to discuss :) There are two strategies to
> > > > > > > polling
> > > > > > > a shared
> > > > > > > Rxq:
> > > > > > > 1. polling each member port
> > > > > > > All forwarding engines can be reused to work as
> > > > > > > before.
> > > > > > > My testpmd patches are efforts towards this direction.
> > > > > > > Does your PMD support this?
> > > > > >
> > > > > > Not unfortunately. More than that, every application needs
> > > > > > to
> > > > > > change
> > > > > > to support this model.
> > > > >
> > > > > Both strategies need user application to resolve port ID from
> > > > > mbuf and
> > > > > process accordingly.
> > > > > This one doesn't demand aggregated port, no polling schedule
> > > > > change.
> > > >
> > > > I was thinking, mbuf will be updated from driver/aggregator
> > > > port as
> > > > when it
> > > > comes to application.
> > > >
> > > > >
> > > > > >
> > > > > > > 2. polling aggregated port
> > > > > > > Besides forwarding engine, need more work to to demo
> > > > > > > it.
> > > > > > > This is an optional API, not supported by my PMD yet.
> > > > > >
> > > > > > We are thinking of implementing this PMD when it comes to
> > > > > > it,
> > > > > > ie.
> > > > > > without application change in fastpath
> > > > > > logic.
> > > > >
> > > > > Fastpath have to resolve port ID anyway and forwarding
> > > > > according
> > > > > to
> > > > > logic. Forwarding engine need to adapt to support shard Rxq.
> > > > > Fortunately, in testpmd, this can be done with an abstract
> > > > > API.
> > > > >
> > > > > Let's defer part 2 until some PMD really support it and
> > > > > tested,
> > > > > how do
> > > > > you think?
> > > >
> > > > We are not planning to use this feature so either way it is OK
> > > > to
> > > > me.
> > > > I leave to ethdev maintainers decide between 1 vs 2.
> > > >
> > > > I do have a strong opinion not changing the testpmd basic
> > > > forward
> > > > engines
> > > > for this feature.I would like to keep it simple as fastpath
> > > > optimized and would
> > > > like to add a separate Forwarding engine as means to verify
> > > > this
> > > > feature.
> > >
> > > +1 to that.
> > > I don't think it a 'common' feature.
> > > So separate FWD mode seems like a best choice to me.
> >
> > -1 :)
> > There was some internal requirement from test team, they need to
> > verify
> > all features like packet content, rss, vlan, checksum, rte_flow...
> > to
> > be working based on shared rx queue.
>
> Then I suppose you'll need to write really comprehensive fwd-engine
> to satisfy your test team :)
> Speaking seriously, I still don't understand why do you need all
> available fwd-engines to verify this feature.
The shared Rxq is low level feature, need to make sure driver higher
level features working properly. fwd-engines like csum checks input
packet and enable L3/L4 checksum and tunnel offloads accordingly,
other engines do their own feature verification. All test automation
could be reused with these engines supported seamlessly.
> From what I understand, main purpose of your changes to test-pmd:
> allow to fwd packet though different fwd_stream (TX through different
> HW queue).
Yes, each mbuf in burst come from differnt port, testpmd current fwd-
engines relies heavily on source forwarding stream, that's why the
patch devide burst result mbufs into sub-burst and use orginal fwd-
engine callback to handle. How to handle is not changed.
> In theory, if implemented in generic and extendable way - that
> might be a useful add-on to tespmd fwd functionality.
> But current implementation looks very case specific.
> And as I don't think it is a common case, I don't see much point to
> pollute
> basic fwd cases with it.
Shared Rxq is a ethdev feature that impacts how packets get handled.
It's natural to update forwarding engines to avoid broken.
The new macro is introduced to minimize performance impact, I'm also
wondering is there an elegant solution :) Current performance penalty
is one "if unlikely" per burst.
Think in reverse direction, if we don't update fwd-engines here, all
malfunction when shared rxq enabled, users can't verify driver
features, are you expecting this?
>
> BTW, as a side note, the code below looks bogus to me:
> +void
> +forward_shared_rxq(struct fwd_stream *fs, uint16_t nb_rx,
> + struct rte_mbuf **pkts_burst, packet_fwd_cb
> fwd)
> +{
> + uint16_t i, nb_fs_rx = 1, port;
> +
> + /* Locate real source fs according to mbuf->port. */
> + for (i = 0; i < nb_rx; ++i) {
> + rte_prefetch0(pkts_burst[i + 1]);
>
> you access pkt_burst[] beyond array boundaries,
> also you ask cpu to prefetch some unknown and possibly invalid
> address.
>
> > Based on the patch, I believe the
> > impact has been minimized.
> >
> > >
> > > >
> > > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Overall, is this for optimizing memory for the port
> > > > > > > > > > represontors? If so can't we
> > > > > > > > > > have a port representor specific solution, reducing
> > > > > > > > > > scope can reduce the
> > > > > > > > > > complexity it brings?
> > > > > > > > > >
> > > > > > > > > > > > If this offload is only useful for representor
> > > > > > > > > > > > case, Can we make this offload specific to
> > > > > > > > > > > > representor the case by changing its
> > > > name and
> > > > > > > > > > > > scope.
> > > > > > > > > > >
> > > > > > > > > > > It works for both PF and representors in same
> > > > > > > > > > > switch
> > > > > > > > > > > domain, for application like OVS, few changes to
> > > > > > > > > > > apply.
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Signed-off-by: Xueming Li
> > > > > > > > > > > > > > > <xuemingl@nvidia.com>
> > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > 11 +++++++++++
> > > > > > > > > > > > > > > doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > 1 +
> > > > > > > > > > > > > > > doc/guides/prog_guide/switch_representat
> > > > > > > > > > > > > > > ion.
> > > > > > > > > > > > > > > rst | 10 ++++++++++
> > > > > > > > > > > > > > > lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > 1 +
> > > > > > > > > > > > > > > lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > 7 +++++++
> > > > > > > > > > > > > > > 5 files changed, 30 insertions(+)
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > diff --git a/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > b/doc/guides/nics/features.rst index
> > > > > > > > > > > > > > > a96e12d155..2e2a9b1554 100644
> > > > > > > > > > > > > > > --- a/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > @@ -624,6 +624,17 @@ Supports inner
> > > > > > > > > > > > > > > packet L4
> > > > > > > > > > > > > > > checksum.
> > > > > > > > > > > > > > > ``tx_offload_capa,tx_queue_offload_cap
> > > > > > > > > > > > > > > a:DE
> > > > > > > > > > > > > > > V_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > +Shared Rx queue
> > > > > > > > > > > > > > > +---------------
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > +Supports shared Rx queue for ports in
> > > > > > > > > > > > > > > same
> > > > > > > > > > > > > > > switch domain.
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > +* **[uses]
> > > > > > > > > > > > > > > rte_eth_rxconf,rte_eth_rxmode**:
> > > > > > > > > > > > > > > ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ`
> > > > > > > > > > > > > > > `.
> > > > > > > > > > > > > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > .. _nic_features_packet_type_parsing:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Packet type parsing
> > > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > > a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > +++
> > > > > > > > > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand
> > > > > > > > > > > > > > > =
> > > > > > > > > > > > > > > Queue start/stop =
> > > > > > > > > > > > > > > Runtime Rx queue setup =
> > > > > > > > > > > > > > > Runtime Tx queue setup =
> > > > > > > > > > > > > > > +Shared Rx queue =
> > > > > > > > > > > > > > > Burst mode info =
> > > > > > > > > > > > > > > Power mgmt address monitor =
> > > > > > > > > > > > > > > MTU update =
> > > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representa
> > > > > > > > > > > > > > > tion
> > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representa
> > > > > > > > > > > > > > > tion
> > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representa
> > > > > > > > > > > > > > > tion
> > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > +++
> > > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representa
> > > > > > > > > > > > > > > tion
> > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > @@ -123,6 +123,16 @@ thought as a
> > > > > > > > > > > > > > > software
> > > > > > > > > > > > > > > "patch panel" front-end for applications.
> > > > > > > > > > > > > > > .. [1] `Ethernet switch device driver
> > > > > > > > > > > > > > > model
> > > > > > > > > > > > > > > (switchdev)
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > <https://www.kernel.org/doc/Documentation
> > > > > > > > > > > > > > > /net
> > > > > > > > > > > > > > > working/switchdev.txt
> > > > > > > > > > > > > > > > `_
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > +- Memory usage of representors is huge
> > > > > > > > > > > > > > > when
> > > > > > > > > > > > > > > number of representor
> > > > > > > > > > > > > > > +grows,
> > > > > > > > > > > > > > > + because PMD always allocate mbuf for
> > > > > > > > > > > > > > > each
> > > > > > > > > > > > > > > descriptor of Rx queue.
> > > > > > > > > > > > > > > + Polling the large number of ports
> > > > > > > > > > > > > > > brings
> > > > > > > > > > > > > > > more CPU load, cache
> > > > > > > > > > > > > > > +miss and
> > > > > > > > > > > > > > > + latency. Shared Rx queue can be used
> > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > share Rx queue between
> > > > > > > > > > > > > > > +PF and
> > > > > > > > > > > > > > > + representors in same switch domain.
> > > > > > > > > > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > > > > > > > > > > + is present in Rx offloading capability
> > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > device info. Setting
> > > > > > > > > > > > > > > +the
> > > > > > > > > > > > > > > + offloading flag in device Rx mode or
> > > > > > > > > > > > > > > Rx
> > > > > > > > > > > > > > > queue configuration to
> > > > > > > > > > > > > > > +enable
> > > > > > > > > > > > > > > + shared Rx queue. Polling any member
> > > > > > > > > > > > > > > port
> > > > > > > > > > > > > > > of shared Rx queue can
> > > > > > > > > > > > > > > +return
> > > > > > > > > > > > > > > + packets of all ports in group, port ID
> > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > saved in ``mbuf.port``.
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > Basic SR-IOV
> > > > > > > > > > > > > > > ------------
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > @@ -127,6 +127,7 @@ static const struct {
> > > > > > > > > > > > > > > RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_
> > > > > > > > > > > > > > > CKSU
> > > > > > > > > > > > > > > M),
> > > > > > > > > > > > > > > RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER
> > > > > > > > > > > > > > > _SPL
> > > > > > > > > > > > > > > IT),
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > > > > > > > > > > > > };
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > @@ -1047,6 +1047,7 @@ struct
> > > > > > > > > > > > > > > rte_eth_rxconf {
> > > > > > > > > > > > > > > uint8_t rx_drop_en; /**< Drop
> > > > > > > > > > > > > > > packets
> > > > > > > > > > > > > > > if no descriptors are available. */
> > > > > > > > > > > > > > > uint8_t rx_deferred_start; /**<
> > > > > > > > > > > > > > > Do
> > > > > > > > > > > > > > > not start queue with rte_eth_dev_start().
> > > > > > > > > > > > > > > */
> > > > > > > > > > > > > > > uint16_t rx_nseg; /**< Number of
> > > > > > > > > > > > > > > descriptions in rx_seg array.
> > > > > > > > > > > > > > > */
> > > > > > > > > > > > > > > + uint32_t shared_group; /**<
> > > > > > > > > > > > > > > Shared
> > > > > > > > > > > > > > > port group index in
> > > > > > > > > > > > > > > + switch domain. */
> > > > > > > > > > > > > > > /**
> > > > > > > > > > > > > > > * Per-queue Rx offloads to be
> > > > > > > > > > > > > > > set
> > > > > > > > > > > > > > > using DEV_RX_OFFLOAD_* flags.
> > > > > > > > > > > > > > > * Only offloads set on
> > > > > > > > > > > > > > > rx_queue_offload_capa or
> > > > > > > > > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@
> > > > > > > > > > > > > > > struct
> > > > > > > > > > > > > > > rte_eth_conf {
> > > > > > > > > > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM
> > > > > > > > > > > > > > > 0x00040000
> > > > > > > > > > > > > > > #define DEV_RX_OFFLOAD_RSS_HASH
> > > > > > > > > > > > > > > 0x00080000
> > > > > > > > > > > > > > > #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT
> > > > > > > > > > > > > > > 0x00100000
> > > > > > > > > > > > > > > +/**
> > > > > > > > > > > > > > > + * Rx queue is shared among ports in
> > > > > > > > > > > > > > > same
> > > > > > > > > > > > > > > switch domain to save
> > > > > > > > > > > > > > > +memory,
> > > > > > > > > > > > > > > + * avoid polling each port. Any port in
> > > > > > > > > > > > > > > group can be used to receive packets.
> > > > > > > > > > > > > > > + * Real source port number saved in
> > > > > > > > > > > > > > > mbuf-
> > > > > > > > > > > > > > > > port field.
> > > > > > > > > > > > > > > + */
> > > > > > > > > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ
> > > > > > > > > > > > > > > 0x00200000
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > #define DEV_RX_OFFLOAD_CHECKSUM
> > > > > > > > > > > > > > > (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > > > > > > > > DEV_RX_O
> > > > > > > > > > > > > > > FFLO
> > > > > > > > > > > > > > > AD_UDP_CKSUM | \
> > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > 2.25.1
> > > > > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > > >
> > > > >
>
On Wed, 2021-09-29 at 00:26 +0000, Ananyev, Konstantin wrote:
> > > > > > > > > > > > > > > In current DPDK framework, each RX queue
> > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > pre-loaded with mbufs
> > > > > > > > > > > > > > > for incoming packets. When number of
> > > > > > > > > > > > > > > representors scale out in a
> > > > > > > > > > > > > > > switch domain, the memory consumption
> > > > > > > > > > > > > > > became
> > > > > > > > > > > > > > > significant. Most
> > > > > > > > > > > > > > > important, polling all ports leads to
> > > > > > > > > > > > > > > high
> > > > > > > > > > > > > > > cache miss, high
> > > > > > > > > > > > > > > latency and low throughput.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > This patch introduces shared RX queue.
> > > > > > > > > > > > > > > Ports
> > > > > > > > > > > > > > > with same
> > > > > > > > > > > > > > > configuration in a switch domain could
> > > > > > > > > > > > > > > share
> > > > > > > > > > > > > > > RX queue set by specifying sharing group.
> > > > > > > > > > > > > > > Polling any queue using same shared RX
> > > > > > > > > > > > > > > queue
> > > > > > > > > > > > > > > receives packets from
> > > > > > > > > > > > > > > all member ports. Source port is
> > > > > > > > > > > > > > > identified
> > > > > > > > > > > > > > > by mbuf->port.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Port queue number in a shared group
> > > > > > > > > > > > > > > should be
> > > > > > > > > > > > > > > identical. Queue
> > > > > > > > > > > > > > > index is
> > > > > > > > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Share RX queue is supposed to be polled
> > > > > > > > > > > > > > > on
> > > > > > > > > > > > > > > same thread.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Is this offload specific to the
> > > > > > > > > > > > > > representor? If
> > > > > > > > > > > > > > so can this name be changed specifically to
> > > > > > > > > > > > > > representor?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Yes, PF and representor in switch domain
> > > > > > > > > > > > > could
> > > > > > > > > > > > > take advantage.
> > > > > > > > > > > > >
> > > > > > > > > > > > > > If it is for a generic case, how the flow
> > > > > > > > > > > > > > ordering will be maintained?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Not quite sure that I understood your
> > > > > > > > > > > > > question.
> > > > > > > > > > > > > The control path of is
> > > > > > > > > > > > > almost same as before, PF and representor
> > > > > > > > > > > > > port
> > > > > > > > > > > > > still needed, rte flows not impacted.
> > > > > > > > > > > > > Queues still needed for each member port,
> > > > > > > > > > > > > descriptors(mbuf) will be
> > > > > > > > > > > > > supplied from shared Rx queue in my PMD
> > > > > > > > > > > > > implementation.
> > > > > > > > > > > >
> > > > > > > > > > > > My question was if create a generic
> > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple
> > > > > > > > > > > > ethdev receive queues land into
> > > > the same
> > > > > > > > > > > > receive queue, In that case, how the flow order
> > > > > > > > > > > > is
> > > > > > > > > > > > maintained for respective receive queues.
> > > > > > > > > > >
> > > > > > > > > > > I guess the question is testpmd forward stream?
> > > > > > > > > > > The
> > > > > > > > > > > forwarding logic has to be changed slightly in
> > > > > > > > > > > case
> > > > > > > > > > > of shared rxq.
> > > > > > > > > > > basically for each packet in rx_burst result,
> > > > > > > > > > > lookup
> > > > > > > > > > > source stream according to mbuf->port, forwarding
> > > > > > > > > > > to
> > > > > > > > > > > target fs.
> > > > > > > > > > > Packets from same source port could be grouped as
> > > > > > > > > > > a
> > > > > > > > > > > small burst to process, this will accelerates the
> > > > > > > > > > > performance if traffic
> > > > come from
> > > > > > > > > > > limited ports. I'll introduce some common api to
> > > > > > > > > > > do
> > > > > > > > > > > shard rxq forwarding, call it with packets
> > > > > > > > > > > handling
> > > > > > > > > > > callback, so it suites for
> > > > > > > > > > > all forwarding engine. Will sent patches soon.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > All ports will put the packets in to the same queue
> > > > > > > > > > (share queue), right? Does
> > > > > > > > > > this means only single core will poll only, what
> > > > > > > > > > will
> > > > > > > > > > happen if there are
> > > > > > > > > > multiple cores polling, won't it cause problem?
> > > > > > > > > >
> > > > > > > > > > And if this requires specific changes in the
> > > > > > > > > > application, I am not sure about
> > > > > > > > > > the solution, can't this work in a transparent way
> > > > > > > > > > to
> > > > > > > > > > the application?
> > > > > > > > >
> > > > > > > > > Discussed with Jerin, new API introduced in v3 2/8
> > > > > > > > > that
> > > > > > > > > aggregate ports
> > > > > > > > > in same group into one new port. Users could schedule
> > > > > > > > > polling on the
> > > > > > > > > aggregated port instead of all member ports.
> > > > > > > >
> > > > > > > > The v3 still has testpmd changes in fastpath. Right?
> > > > > > > > IMO,
> > > > > > > > For this
> > > > > > > > feature, we should not change fastpath of testpmd
> > > > > > > > application. Instead, testpmd can use aggregated ports
> > > > > > > > probably as
> > > > > > > > separate fwd_engine to show how to use this feature.
> > > > > > >
> > > > > > > Good point to discuss :) There are two strategies to
> > > > > > > polling
> > > > > > > a shared
> > > > > > > Rxq:
> > > > > > > 1. polling each member port
> > > > > > > All forwarding engines can be reused to work as
> > > > > > > before.
> > > > > > > My testpmd patches are efforts towards this direction.
> > > > > > > Does your PMD support this?
> > > > > >
> > > > > > Not unfortunately. More than that, every application needs
> > > > > > to
> > > > > > change
> > > > > > to support this model.
> > > > >
> > > > > Both strategies need user application to resolve port ID from
> > > > > mbuf and
> > > > > process accordingly.
> > > > > This one doesn't demand aggregated port, no polling schedule
> > > > > change.
> > > >
> > > > I was thinking, mbuf will be updated from driver/aggregator
> > > > port as
> > > > when it
> > > > comes to application.
> > > >
> > > > >
> > > > > >
> > > > > > > 2. polling aggregated port
> > > > > > > Besides forwarding engine, need more work to to demo
> > > > > > > it.
> > > > > > > This is an optional API, not supported by my PMD yet.
> > > > > >
> > > > > > We are thinking of implementing this PMD when it comes to
> > > > > > it,
> > > > > > ie.
> > > > > > without application change in fastpath
> > > > > > logic.
> > > > >
> > > > > Fastpath have to resolve port ID anyway and forwarding
> > > > > according
> > > > > to
> > > > > logic. Forwarding engine need to adapt to support shard Rxq.
> > > > > Fortunately, in testpmd, this can be done with an abstract
> > > > > API.
> > > > >
> > > > > Let's defer part 2 until some PMD really support it and
> > > > > tested,
> > > > > how do
> > > > > you think?
> > > >
> > > > We are not planning to use this feature so either way it is OK
> > > > to
> > > > me.
> > > > I leave to ethdev maintainers decide between 1 vs 2.
> > > >
> > > > I do have a strong opinion not changing the testpmd basic
> > > > forward
> > > > engines
> > > > for this feature.I would like to keep it simple as fastpath
> > > > optimized and would
> > > > like to add a separate Forwarding engine as means to verify
> > > > this
> > > > feature.
> > >
> > > +1 to that.
> > > I don't think it a 'common' feature.
> > > So separate FWD mode seems like a best choice to me.
> >
> > -1 :)
> > There was some internal requirement from test team, they need to
> > verify
> > all features like packet content, rss, vlan, checksum, rte_flow...
> > to
> > be working based on shared rx queue.
>
> Then I suppose you'll need to write really comprehensive fwd-engine
> to satisfy your test team :)
> Speaking seriously, I still don't understand why do you need all
> available fwd-engines to verify this feature.
> From what I understand, main purpose of your changes to test-pmd:
> allow to fwd packet though different fwd_stream (TX through different
> HW queue).
> In theory, if implemented in generic and extendable way - that
> might be a useful add-on to tespmd fwd functionality.
> But current implementation looks very case specific.
> And as I don't think it is a common case, I don't see much point to
> pollute
> basic fwd cases with it.
>
> BTW, as a side note, the code below looks bogus to me:
> +void
> +forward_shared_rxq(struct fwd_stream *fs, uint16_t nb_rx,
> + struct rte_mbuf **pkts_burst, packet_fwd_cb
> fwd)
> +{
> + uint16_t i, nb_fs_rx = 1, port;
> +
> + /* Locate real source fs according to mbuf->port. */
> + for (i = 0; i < nb_rx; ++i) {
> + rte_prefetch0(pkts_burst[i + 1]);
>
> you access pkt_burst[] beyond array boundaries,
> also you ask cpu to prefetch some unknown and possibly invalid
> address.
Sorry I forgot this topic. It's too late to prefetch current packet, so
perfetch next is better. Prefetch an invalid address at end of a look
doesn't hurt, it's common in DPDK.
>
> > Based on the patch, I believe the
> > impact has been minimized.
> >
> > >
> > > >
> > > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Overall, is this for optimizing memory for the port
> > > > > > > > > > represontors? If so can't we
> > > > > > > > > > have a port representor specific solution, reducing
> > > > > > > > > > scope can reduce the
> > > > > > > > > > complexity it brings?
> > > > > > > > > >
> > > > > > > > > > > > If this offload is only useful for representor
> > > > > > > > > > > > case, Can we make this offload specific to
> > > > > > > > > > > > representor the case by changing its
> > > > name and
> > > > > > > > > > > > scope.
> > > > > > > > > > >
> > > > > > > > > > > It works for both PF and representors in same
> > > > > > > > > > > switch
> > > > > > > > > > > domain, for application like OVS, few changes to
> > > > > > > > > > > apply.
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Signed-off-by: Xueming Li
> > > > > > > > > > > > > > > <xuemingl@nvidia.com>
> > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > 11 +++++++++++
> > > > > > > > > > > > > > > doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > 1 +
> > > > > > > > > > > > > > > doc/guides/prog_guide/switch_representat
> > > > > > > > > > > > > > > ion.
> > > > > > > > > > > > > > > rst | 10 ++++++++++
> > > > > > > > > > > > > > > lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > 1 +
> > > > > > > > > > > > > > > lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > 7 +++++++
> > > > > > > > > > > > > > > 5 files changed, 30 insertions(+)
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > diff --git a/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > b/doc/guides/nics/features.rst index
> > > > > > > > > > > > > > > a96e12d155..2e2a9b1554 100644
> > > > > > > > > > > > > > > --- a/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > @@ -624,6 +624,17 @@ Supports inner
> > > > > > > > > > > > > > > packet L4
> > > > > > > > > > > > > > > checksum.
> > > > > > > > > > > > > > > ``tx_offload_capa,tx_queue_offload_cap
> > > > > > > > > > > > > > > a:DE
> > > > > > > > > > > > > > > V_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > +Shared Rx queue
> > > > > > > > > > > > > > > +---------------
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > +Supports shared Rx queue for ports in
> > > > > > > > > > > > > > > same
> > > > > > > > > > > > > > > switch domain.
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > +* **[uses]
> > > > > > > > > > > > > > > rte_eth_rxconf,rte_eth_rxmode**:
> > > > > > > > > > > > > > > ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ`
> > > > > > > > > > > > > > > `.
> > > > > > > > > > > > > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > .. _nic_features_packet_type_parsing:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Packet type parsing
> > > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > > a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > +++
> > > > > > > > > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand
> > > > > > > > > > > > > > > =
> > > > > > > > > > > > > > > Queue start/stop =
> > > > > > > > > > > > > > > Runtime Rx queue setup =
> > > > > > > > > > > > > > > Runtime Tx queue setup =
> > > > > > > > > > > > > > > +Shared Rx queue =
> > > > > > > > > > > > > > > Burst mode info =
> > > > > > > > > > > > > > > Power mgmt address monitor =
> > > > > > > > > > > > > > > MTU update =
> > > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representa
> > > > > > > > > > > > > > > tion
> > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representa
> > > > > > > > > > > > > > > tion
> > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representa
> > > > > > > > > > > > > > > tion
> > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > +++
> > > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representa
> > > > > > > > > > > > > > > tion
> > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > @@ -123,6 +123,16 @@ thought as a
> > > > > > > > > > > > > > > software
> > > > > > > > > > > > > > > "patch panel" front-end for applications.
> > > > > > > > > > > > > > > .. [1] `Ethernet switch device driver
> > > > > > > > > > > > > > > model
> > > > > > > > > > > > > > > (switchdev)
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > <https://www.kernel.org/doc/Documentation
> > > > > > > > > > > > > > > /net
> > > > > > > > > > > > > > > working/switchdev.txt
> > > > > > > > > > > > > > > > `_
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > +- Memory usage of representors is huge
> > > > > > > > > > > > > > > when
> > > > > > > > > > > > > > > number of representor
> > > > > > > > > > > > > > > +grows,
> > > > > > > > > > > > > > > + because PMD always allocate mbuf for
> > > > > > > > > > > > > > > each
> > > > > > > > > > > > > > > descriptor of Rx queue.
> > > > > > > > > > > > > > > + Polling the large number of ports
> > > > > > > > > > > > > > > brings
> > > > > > > > > > > > > > > more CPU load, cache
> > > > > > > > > > > > > > > +miss and
> > > > > > > > > > > > > > > + latency. Shared Rx queue can be used
> > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > share Rx queue between
> > > > > > > > > > > > > > > +PF and
> > > > > > > > > > > > > > > + representors in same switch domain.
> > > > > > > > > > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > > > > > > > > > > + is present in Rx offloading capability
> > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > device info. Setting
> > > > > > > > > > > > > > > +the
> > > > > > > > > > > > > > > + offloading flag in device Rx mode or
> > > > > > > > > > > > > > > Rx
> > > > > > > > > > > > > > > queue configuration to
> > > > > > > > > > > > > > > +enable
> > > > > > > > > > > > > > > + shared Rx queue. Polling any member
> > > > > > > > > > > > > > > port
> > > > > > > > > > > > > > > of shared Rx queue can
> > > > > > > > > > > > > > > +return
> > > > > > > > > > > > > > > + packets of all ports in group, port ID
> > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > saved in ``mbuf.port``.
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > Basic SR-IOV
> > > > > > > > > > > > > > > ------------
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > @@ -127,6 +127,7 @@ static const struct {
> > > > > > > > > > > > > > > RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_
> > > > > > > > > > > > > > > CKSU
> > > > > > > > > > > > > > > M),
> > > > > > > > > > > > > > > RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER
> > > > > > > > > > > > > > > _SPL
> > > > > > > > > > > > > > > IT),
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > > > > > > > > > > > > };
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > @@ -1047,6 +1047,7 @@ struct
> > > > > > > > > > > > > > > rte_eth_rxconf {
> > > > > > > > > > > > > > > uint8_t rx_drop_en; /**< Drop
> > > > > > > > > > > > > > > packets
> > > > > > > > > > > > > > > if no descriptors are available. */
> > > > > > > > > > > > > > > uint8_t rx_deferred_start; /**<
> > > > > > > > > > > > > > > Do
> > > > > > > > > > > > > > > not start queue with rte_eth_dev_start().
> > > > > > > > > > > > > > > */
> > > > > > > > > > > > > > > uint16_t rx_nseg; /**< Number of
> > > > > > > > > > > > > > > descriptions in rx_seg array.
> > > > > > > > > > > > > > > */
> > > > > > > > > > > > > > > + uint32_t shared_group; /**<
> > > > > > > > > > > > > > > Shared
> > > > > > > > > > > > > > > port group index in
> > > > > > > > > > > > > > > + switch domain. */
> > > > > > > > > > > > > > > /**
> > > > > > > > > > > > > > > * Per-queue Rx offloads to be
> > > > > > > > > > > > > > > set
> > > > > > > > > > > > > > > using DEV_RX_OFFLOAD_* flags.
> > > > > > > > > > > > > > > * Only offloads set on
> > > > > > > > > > > > > > > rx_queue_offload_capa or
> > > > > > > > > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@
> > > > > > > > > > > > > > > struct
> > > > > > > > > > > > > > > rte_eth_conf {
> > > > > > > > > > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM
> > > > > > > > > > > > > > > 0x00040000
> > > > > > > > > > > > > > > #define DEV_RX_OFFLOAD_RSS_HASH
> > > > > > > > > > > > > > > 0x00080000
> > > > > > > > > > > > > > > #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT
> > > > > > > > > > > > > > > 0x00100000
> > > > > > > > > > > > > > > +/**
> > > > > > > > > > > > > > > + * Rx queue is shared among ports in
> > > > > > > > > > > > > > > same
> > > > > > > > > > > > > > > switch domain to save
> > > > > > > > > > > > > > > +memory,
> > > > > > > > > > > > > > > + * avoid polling each port. Any port in
> > > > > > > > > > > > > > > group can be used to receive packets.
> > > > > > > > > > > > > > > + * Real source port number saved in
> > > > > > > > > > > > > > > mbuf-
> > > > > > > > > > > > > > > > port field.
> > > > > > > > > > > > > > > + */
> > > > > > > > > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ
> > > > > > > > > > > > > > > 0x00200000
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > #define DEV_RX_OFFLOAD_CHECKSUM
> > > > > > > > > > > > > > > (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > > > > > > > > DEV_RX_O
> > > > > > > > > > > > > > > FFLO
> > > > > > > > > > > > > > > AD_UDP_CKSUM | \
> > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > 2.25.1
> > > > > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > > >
> > > > >
>
> -----Original Message-----
> From: Xueming(Steven) Li <xuemingl@nvidia.com>
> Sent: Wednesday, September 29, 2021 10:13 AM
> To: jerinjacobk@gmail.com; Ananyev, Konstantin <konstantin.ananyev@intel.com>
> Cc: NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; andrew.rybchenko@oktetlabs.ru; dev@dpdk.org; Yigit, Ferruh
> <ferruh.yigit@intel.com>
> Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
>
> On Wed, 2021-09-29 at 00:26 +0000, Ananyev, Konstantin wrote:
> > > > > > > > > > > > > > > > In current DPDK framework, each RX queue
> > > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > pre-loaded with mbufs
> > > > > > > > > > > > > > > > for incoming packets. When number of
> > > > > > > > > > > > > > > > representors scale out in a
> > > > > > > > > > > > > > > > switch domain, the memory consumption
> > > > > > > > > > > > > > > > became
> > > > > > > > > > > > > > > > significant. Most
> > > > > > > > > > > > > > > > important, polling all ports leads to
> > > > > > > > > > > > > > > > high
> > > > > > > > > > > > > > > > cache miss, high
> > > > > > > > > > > > > > > > latency and low throughput.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > This patch introduces shared RX queue.
> > > > > > > > > > > > > > > > Ports
> > > > > > > > > > > > > > > > with same
> > > > > > > > > > > > > > > > configuration in a switch domain could
> > > > > > > > > > > > > > > > share
> > > > > > > > > > > > > > > > RX queue set by specifying sharing group.
> > > > > > > > > > > > > > > > Polling any queue using same shared RX
> > > > > > > > > > > > > > > > queue
> > > > > > > > > > > > > > > > receives packets from
> > > > > > > > > > > > > > > > all member ports. Source port is
> > > > > > > > > > > > > > > > identified
> > > > > > > > > > > > > > > > by mbuf->port.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Port queue number in a shared group
> > > > > > > > > > > > > > > > should be
> > > > > > > > > > > > > > > > identical. Queue
> > > > > > > > > > > > > > > > index is
> > > > > > > > > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Share RX queue is supposed to be polled
> > > > > > > > > > > > > > > > on
> > > > > > > > > > > > > > > > same thread.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Is this offload specific to the
> > > > > > > > > > > > > > > representor? If
> > > > > > > > > > > > > > > so can this name be changed specifically to
> > > > > > > > > > > > > > > representor?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Yes, PF and representor in switch domain
> > > > > > > > > > > > > > could
> > > > > > > > > > > > > > take advantage.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > If it is for a generic case, how the flow
> > > > > > > > > > > > > > > ordering will be maintained?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Not quite sure that I understood your
> > > > > > > > > > > > > > question.
> > > > > > > > > > > > > > The control path of is
> > > > > > > > > > > > > > almost same as before, PF and representor
> > > > > > > > > > > > > > port
> > > > > > > > > > > > > > still needed, rte flows not impacted.
> > > > > > > > > > > > > > Queues still needed for each member port,
> > > > > > > > > > > > > > descriptors(mbuf) will be
> > > > > > > > > > > > > > supplied from shared Rx queue in my PMD
> > > > > > > > > > > > > > implementation.
> > > > > > > > > > > > >
> > > > > > > > > > > > > My question was if create a generic
> > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple
> > > > > > > > > > > > > ethdev receive queues land into
> > > > > the same
> > > > > > > > > > > > > receive queue, In that case, how the flow order
> > > > > > > > > > > > > is
> > > > > > > > > > > > > maintained for respective receive queues.
> > > > > > > > > > > >
> > > > > > > > > > > > I guess the question is testpmd forward stream?
> > > > > > > > > > > > The
> > > > > > > > > > > > forwarding logic has to be changed slightly in
> > > > > > > > > > > > case
> > > > > > > > > > > > of shared rxq.
> > > > > > > > > > > > basically for each packet in rx_burst result,
> > > > > > > > > > > > lookup
> > > > > > > > > > > > source stream according to mbuf->port, forwarding
> > > > > > > > > > > > to
> > > > > > > > > > > > target fs.
> > > > > > > > > > > > Packets from same source port could be grouped as
> > > > > > > > > > > > a
> > > > > > > > > > > > small burst to process, this will accelerates the
> > > > > > > > > > > > performance if traffic
> > > > > come from
> > > > > > > > > > > > limited ports. I'll introduce some common api to
> > > > > > > > > > > > do
> > > > > > > > > > > > shard rxq forwarding, call it with packets
> > > > > > > > > > > > handling
> > > > > > > > > > > > callback, so it suites for
> > > > > > > > > > > > all forwarding engine. Will sent patches soon.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > All ports will put the packets in to the same queue
> > > > > > > > > > > (share queue), right? Does
> > > > > > > > > > > this means only single core will poll only, what
> > > > > > > > > > > will
> > > > > > > > > > > happen if there are
> > > > > > > > > > > multiple cores polling, won't it cause problem?
> > > > > > > > > > >
> > > > > > > > > > > And if this requires specific changes in the
> > > > > > > > > > > application, I am not sure about
> > > > > > > > > > > the solution, can't this work in a transparent way
> > > > > > > > > > > to
> > > > > > > > > > > the application?
> > > > > > > > > >
> > > > > > > > > > Discussed with Jerin, new API introduced in v3 2/8
> > > > > > > > > > that
> > > > > > > > > > aggregate ports
> > > > > > > > > > in same group into one new port. Users could schedule
> > > > > > > > > > polling on the
> > > > > > > > > > aggregated port instead of all member ports.
> > > > > > > > >
> > > > > > > > > The v3 still has testpmd changes in fastpath. Right?
> > > > > > > > > IMO,
> > > > > > > > > For this
> > > > > > > > > feature, we should not change fastpath of testpmd
> > > > > > > > > application. Instead, testpmd can use aggregated ports
> > > > > > > > > probably as
> > > > > > > > > separate fwd_engine to show how to use this feature.
> > > > > > > >
> > > > > > > > Good point to discuss :) There are two strategies to
> > > > > > > > polling
> > > > > > > > a shared
> > > > > > > > Rxq:
> > > > > > > > 1. polling each member port
> > > > > > > > All forwarding engines can be reused to work as
> > > > > > > > before.
> > > > > > > > My testpmd patches are efforts towards this direction.
> > > > > > > > Does your PMD support this?
> > > > > > >
> > > > > > > Not unfortunately. More than that, every application needs
> > > > > > > to
> > > > > > > change
> > > > > > > to support this model.
> > > > > >
> > > > > > Both strategies need user application to resolve port ID from
> > > > > > mbuf and
> > > > > > process accordingly.
> > > > > > This one doesn't demand aggregated port, no polling schedule
> > > > > > change.
> > > > >
> > > > > I was thinking, mbuf will be updated from driver/aggregator
> > > > > port as
> > > > > when it
> > > > > comes to application.
> > > > >
> > > > > >
> > > > > > >
> > > > > > > > 2. polling aggregated port
> > > > > > > > Besides forwarding engine, need more work to to demo
> > > > > > > > it.
> > > > > > > > This is an optional API, not supported by my PMD yet.
> > > > > > >
> > > > > > > We are thinking of implementing this PMD when it comes to
> > > > > > > it,
> > > > > > > ie.
> > > > > > > without application change in fastpath
> > > > > > > logic.
> > > > > >
> > > > > > Fastpath have to resolve port ID anyway and forwarding
> > > > > > according
> > > > > > to
> > > > > > logic. Forwarding engine need to adapt to support shard Rxq.
> > > > > > Fortunately, in testpmd, this can be done with an abstract
> > > > > > API.
> > > > > >
> > > > > > Let's defer part 2 until some PMD really support it and
> > > > > > tested,
> > > > > > how do
> > > > > > you think?
> > > > >
> > > > > We are not planning to use this feature so either way it is OK
> > > > > to
> > > > > me.
> > > > > I leave to ethdev maintainers decide between 1 vs 2.
> > > > >
> > > > > I do have a strong opinion not changing the testpmd basic
> > > > > forward
> > > > > engines
> > > > > for this feature.I would like to keep it simple as fastpath
> > > > > optimized and would
> > > > > like to add a separate Forwarding engine as means to verify
> > > > > this
> > > > > feature.
> > > >
> > > > +1 to that.
> > > > I don't think it a 'common' feature.
> > > > So separate FWD mode seems like a best choice to me.
> > >
> > > -1 :)
> > > There was some internal requirement from test team, they need to
> > > verify
> > > all features like packet content, rss, vlan, checksum, rte_flow...
> > > to
> > > be working based on shared rx queue.
> >
> > Then I suppose you'll need to write really comprehensive fwd-engine
> > to satisfy your test team :)
> > Speaking seriously, I still don't understand why do you need all
> > available fwd-engines to verify this feature.
> > From what I understand, main purpose of your changes to test-pmd:
> > allow to fwd packet though different fwd_stream (TX through different
> > HW queue).
> > In theory, if implemented in generic and extendable way - that
> > might be a useful add-on to tespmd fwd functionality.
> > But current implementation looks very case specific.
> > And as I don't think it is a common case, I don't see much point to
> > pollute
> > basic fwd cases with it.
> >
> > BTW, as a side note, the code below looks bogus to me:
> > +void
> > +forward_shared_rxq(struct fwd_stream *fs, uint16_t nb_rx,
> > + struct rte_mbuf **pkts_burst, packet_fwd_cb
> > fwd)
> > +{
> > + uint16_t i, nb_fs_rx = 1, port;
> > +
> > + /* Locate real source fs according to mbuf->port. */
> > + for (i = 0; i < nb_rx; ++i) {
> > + rte_prefetch0(pkts_burst[i + 1]);
> >
> > you access pkt_burst[] beyond array boundaries,
> > also you ask cpu to prefetch some unknown and possibly invalid
> > address.
>
> Sorry I forgot this topic. It's too late to prefetch current packet, so
> perfetch next is better. Prefetch an invalid address at end of a look
> doesn't hurt, it's common in DPDK.
First of all it is usually never 'OK' to access array beyond its bounds.
Second prefetching invalid address *does* hurt performance badly on many CPUs
(TLB misses, consumed memory bandwidth etc.).
As a reference: https://lwn.net/Articles/444346/
If some existing DPDK code really does that - then I believe it is an issue and has to be addressed.
More important - it is really bad attitude to submit bogus code to DPDK community
and pretend that it is 'OK'.
>
> >
> > > Based on the patch, I believe the
> > > impact has been minimized.
> > >
> > > >
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Overall, is this for optimizing memory for the port
> > > > > > > > > > > represontors? If so can't we
> > > > > > > > > > > have a port representor specific solution, reducing
> > > > > > > > > > > scope can reduce the
> > > > > > > > > > > complexity it brings?
> > > > > > > > > > >
> > > > > > > > > > > > > If this offload is only useful for representor
> > > > > > > > > > > > > case, Can we make this offload specific to
> > > > > > > > > > > > > representor the case by changing its
> > > > > name and
> > > > > > > > > > > > > scope.
> > > > > > > > > > > >
> > > > > > > > > > > > It works for both PF and representors in same
> > > > > > > > > > > > switch
> > > > > > > > > > > > domain, for application like OVS, few changes to
> > > > > > > > > > > > apply.
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Signed-off-by: Xueming Li
> > > > > > > > > > > > > > > > <xuemingl@nvidia.com>
> > > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > > doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > > 11 +++++++++++
> > > > > > > > > > > > > > > > doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > > 1 +
> > > > > > > > > > > > > > > > doc/guides/prog_guide/switch_representat
> > > > > > > > > > > > > > > > ion.
> > > > > > > > > > > > > > > > rst | 10 ++++++++++
> > > > > > > > > > > > > > > > lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > > 1 +
> > > > > > > > > > > > > > > > lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > > 7 +++++++
> > > > > > > > > > > > > > > > 5 files changed, 30 insertions(+)
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > diff --git a/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > b/doc/guides/nics/features.rst index
> > > > > > > > > > > > > > > > a96e12d155..2e2a9b1554 100644
> > > > > > > > > > > > > > > > --- a/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > @@ -624,6 +624,17 @@ Supports inner
> > > > > > > > > > > > > > > > packet L4
> > > > > > > > > > > > > > > > checksum.
> > > > > > > > > > > > > > > > ``tx_offload_capa,tx_queue_offload_cap
> > > > > > > > > > > > > > > > a:DE
> > > > > > > > > > > > > > > > V_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > +Shared Rx queue
> > > > > > > > > > > > > > > > +---------------
> > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > +Supports shared Rx queue for ports in
> > > > > > > > > > > > > > > > same
> > > > > > > > > > > > > > > > switch domain.
> > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > +* **[uses]
> > > > > > > > > > > > > > > > rte_eth_rxconf,rte_eth_rxmode**:
> > > > > > > > > > > > > > > > ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ`
> > > > > > > > > > > > > > > > `.
> > > > > > > > > > > > > > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > .. _nic_features_packet_type_parsing:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Packet type parsing
> > > > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > > > a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > > a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > +++
> > > > > > > > > > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand
> > > > > > > > > > > > > > > > =
> > > > > > > > > > > > > > > > Queue start/stop =
> > > > > > > > > > > > > > > > Runtime Rx queue setup =
> > > > > > > > > > > > > > > > Runtime Tx queue setup =
> > > > > > > > > > > > > > > > +Shared Rx queue =
> > > > > > > > > > > > > > > > Burst mode info =
> > > > > > > > > > > > > > > > Power mgmt address monitor =
> > > > > > > > > > > > > > > > MTU update =
> > > > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representa
> > > > > > > > > > > > > > > > tion
> > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representa
> > > > > > > > > > > > > > > > tion
> > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representa
> > > > > > > > > > > > > > > > tion
> > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > +++
> > > > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representa
> > > > > > > > > > > > > > > > tion
> > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > @@ -123,6 +123,16 @@ thought as a
> > > > > > > > > > > > > > > > software
> > > > > > > > > > > > > > > > "patch panel" front-end for applications.
> > > > > > > > > > > > > > > > .. [1] `Ethernet switch device driver
> > > > > > > > > > > > > > > > model
> > > > > > > > > > > > > > > > (switchdev)
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > <https://www.kernel.org/doc/Documentation
> > > > > > > > > > > > > > > > /net
> > > > > > > > > > > > > > > > working/switchdev.txt
> > > > > > > > > > > > > > > > > `_
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > +- Memory usage of representors is huge
> > > > > > > > > > > > > > > > when
> > > > > > > > > > > > > > > > number of representor
> > > > > > > > > > > > > > > > +grows,
> > > > > > > > > > > > > > > > + because PMD always allocate mbuf for
> > > > > > > > > > > > > > > > each
> > > > > > > > > > > > > > > > descriptor of Rx queue.
> > > > > > > > > > > > > > > > + Polling the large number of ports
> > > > > > > > > > > > > > > > brings
> > > > > > > > > > > > > > > > more CPU load, cache
> > > > > > > > > > > > > > > > +miss and
> > > > > > > > > > > > > > > > + latency. Shared Rx queue can be used
> > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > share Rx queue between
> > > > > > > > > > > > > > > > +PF and
> > > > > > > > > > > > > > > > + representors in same switch domain.
> > > > > > > > > > > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > > > > > > > > > > > + is present in Rx offloading capability
> > > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > device info. Setting
> > > > > > > > > > > > > > > > +the
> > > > > > > > > > > > > > > > + offloading flag in device Rx mode or
> > > > > > > > > > > > > > > > Rx
> > > > > > > > > > > > > > > > queue configuration to
> > > > > > > > > > > > > > > > +enable
> > > > > > > > > > > > > > > > + shared Rx queue. Polling any member
> > > > > > > > > > > > > > > > port
> > > > > > > > > > > > > > > > of shared Rx queue can
> > > > > > > > > > > > > > > > +return
> > > > > > > > > > > > > > > > + packets of all ports in group, port ID
> > > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > saved in ``mbuf.port``.
> > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > Basic SR-IOV
> > > > > > > > > > > > > > > > ------------
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > @@ -127,6 +127,7 @@ static const struct {
> > > > > > > > > > > > > > > > RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_
> > > > > > > > > > > > > > > > CKSU
> > > > > > > > > > > > > > > > M),
> > > > > > > > > > > > > > > > RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER
> > > > > > > > > > > > > > > > _SPL
> > > > > > > > > > > > > > > > IT),
> > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > > > > > > > > > > > > > };
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > @@ -1047,6 +1047,7 @@ struct
> > > > > > > > > > > > > > > > rte_eth_rxconf {
> > > > > > > > > > > > > > > > uint8_t rx_drop_en; /**< Drop
> > > > > > > > > > > > > > > > packets
> > > > > > > > > > > > > > > > if no descriptors are available. */
> > > > > > > > > > > > > > > > uint8_t rx_deferred_start; /**<
> > > > > > > > > > > > > > > > Do
> > > > > > > > > > > > > > > > not start queue with rte_eth_dev_start().
> > > > > > > > > > > > > > > > */
> > > > > > > > > > > > > > > > uint16_t rx_nseg; /**< Number of
> > > > > > > > > > > > > > > > descriptions in rx_seg array.
> > > > > > > > > > > > > > > > */
> > > > > > > > > > > > > > > > + uint32_t shared_group; /**<
> > > > > > > > > > > > > > > > Shared
> > > > > > > > > > > > > > > > port group index in
> > > > > > > > > > > > > > > > + switch domain. */
> > > > > > > > > > > > > > > > /**
> > > > > > > > > > > > > > > > * Per-queue Rx offloads to be
> > > > > > > > > > > > > > > > set
> > > > > > > > > > > > > > > > using DEV_RX_OFFLOAD_* flags.
> > > > > > > > > > > > > > > > * Only offloads set on
> > > > > > > > > > > > > > > > rx_queue_offload_capa or
> > > > > > > > > > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@
> > > > > > > > > > > > > > > > struct
> > > > > > > > > > > > > > > > rte_eth_conf {
> > > > > > > > > > > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM
> > > > > > > > > > > > > > > > 0x00040000
> > > > > > > > > > > > > > > > #define DEV_RX_OFFLOAD_RSS_HASH
> > > > > > > > > > > > > > > > 0x00080000
> > > > > > > > > > > > > > > > #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT
> > > > > > > > > > > > > > > > 0x00100000
> > > > > > > > > > > > > > > > +/**
> > > > > > > > > > > > > > > > + * Rx queue is shared among ports in
> > > > > > > > > > > > > > > > same
> > > > > > > > > > > > > > > > switch domain to save
> > > > > > > > > > > > > > > > +memory,
> > > > > > > > > > > > > > > > + * avoid polling each port. Any port in
> > > > > > > > > > > > > > > > group can be used to receive packets.
> > > > > > > > > > > > > > > > + * Real source port number saved in
> > > > > > > > > > > > > > > > mbuf-
> > > > > > > > > > > > > > > > > port field.
> > > > > > > > > > > > > > > > + */
> > > > > > > > > > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ
> > > > > > > > > > > > > > > > 0x00200000
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > #define DEV_RX_OFFLOAD_CHECKSUM
> > > > > > > > > > > > > > > > (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > > > > > > > > > DEV_RX_O
> > > > > > > > > > > > > > > > FFLO
> > > > > > > > > > > > > > > > AD_UDP_CKSUM | \
> > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > 2.25.1
> > > > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > >
> >
> > > > > > > > > > > > > > > > In current DPDK framework, each RX queue
> > > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > pre-loaded with mbufs
> > > > > > > > > > > > > > > > for incoming packets. When number of
> > > > > > > > > > > > > > > > representors scale out in a
> > > > > > > > > > > > > > > > switch domain, the memory consumption
> > > > > > > > > > > > > > > > became
> > > > > > > > > > > > > > > > significant. Most
> > > > > > > > > > > > > > > > important, polling all ports leads to
> > > > > > > > > > > > > > > > high
> > > > > > > > > > > > > > > > cache miss, high
> > > > > > > > > > > > > > > > latency and low throughput.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > This patch introduces shared RX queue.
> > > > > > > > > > > > > > > > Ports
> > > > > > > > > > > > > > > > with same
> > > > > > > > > > > > > > > > configuration in a switch domain could
> > > > > > > > > > > > > > > > share
> > > > > > > > > > > > > > > > RX queue set by specifying sharing group.
> > > > > > > > > > > > > > > > Polling any queue using same shared RX
> > > > > > > > > > > > > > > > queue
> > > > > > > > > > > > > > > > receives packets from
> > > > > > > > > > > > > > > > all member ports. Source port is
> > > > > > > > > > > > > > > > identified
> > > > > > > > > > > > > > > > by mbuf->port.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Port queue number in a shared group
> > > > > > > > > > > > > > > > should be
> > > > > > > > > > > > > > > > identical. Queue
> > > > > > > > > > > > > > > > index is
> > > > > > > > > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Share RX queue is supposed to be polled
> > > > > > > > > > > > > > > > on
> > > > > > > > > > > > > > > > same thread.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Is this offload specific to the
> > > > > > > > > > > > > > > representor? If
> > > > > > > > > > > > > > > so can this name be changed specifically to
> > > > > > > > > > > > > > > representor?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Yes, PF and representor in switch domain
> > > > > > > > > > > > > > could
> > > > > > > > > > > > > > take advantage.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > If it is for a generic case, how the flow
> > > > > > > > > > > > > > > ordering will be maintained?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Not quite sure that I understood your
> > > > > > > > > > > > > > question.
> > > > > > > > > > > > > > The control path of is
> > > > > > > > > > > > > > almost same as before, PF and representor
> > > > > > > > > > > > > > port
> > > > > > > > > > > > > > still needed, rte flows not impacted.
> > > > > > > > > > > > > > Queues still needed for each member port,
> > > > > > > > > > > > > > descriptors(mbuf) will be
> > > > > > > > > > > > > > supplied from shared Rx queue in my PMD
> > > > > > > > > > > > > > implementation.
> > > > > > > > > > > > >
> > > > > > > > > > > > > My question was if create a generic
> > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple
> > > > > > > > > > > > > ethdev receive queues land into
> > > > > the same
> > > > > > > > > > > > > receive queue, In that case, how the flow order
> > > > > > > > > > > > > is
> > > > > > > > > > > > > maintained for respective receive queues.
> > > > > > > > > > > >
> > > > > > > > > > > > I guess the question is testpmd forward stream?
> > > > > > > > > > > > The
> > > > > > > > > > > > forwarding logic has to be changed slightly in
> > > > > > > > > > > > case
> > > > > > > > > > > > of shared rxq.
> > > > > > > > > > > > basically for each packet in rx_burst result,
> > > > > > > > > > > > lookup
> > > > > > > > > > > > source stream according to mbuf->port, forwarding
> > > > > > > > > > > > to
> > > > > > > > > > > > target fs.
> > > > > > > > > > > > Packets from same source port could be grouped as
> > > > > > > > > > > > a
> > > > > > > > > > > > small burst to process, this will accelerates the
> > > > > > > > > > > > performance if traffic
> > > > > come from
> > > > > > > > > > > > limited ports. I'll introduce some common api to
> > > > > > > > > > > > do
> > > > > > > > > > > > shard rxq forwarding, call it with packets
> > > > > > > > > > > > handling
> > > > > > > > > > > > callback, so it suites for
> > > > > > > > > > > > all forwarding engine. Will sent patches soon.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > All ports will put the packets in to the same queue
> > > > > > > > > > > (share queue), right? Does
> > > > > > > > > > > this means only single core will poll only, what
> > > > > > > > > > > will
> > > > > > > > > > > happen if there are
> > > > > > > > > > > multiple cores polling, won't it cause problem?
> > > > > > > > > > >
> > > > > > > > > > > And if this requires specific changes in the
> > > > > > > > > > > application, I am not sure about
> > > > > > > > > > > the solution, can't this work in a transparent way
> > > > > > > > > > > to
> > > > > > > > > > > the application?
> > > > > > > > > >
> > > > > > > > > > Discussed with Jerin, new API introduced in v3 2/8
> > > > > > > > > > that
> > > > > > > > > > aggregate ports
> > > > > > > > > > in same group into one new port. Users could schedule
> > > > > > > > > > polling on the
> > > > > > > > > > aggregated port instead of all member ports.
> > > > > > > > >
> > > > > > > > > The v3 still has testpmd changes in fastpath. Right?
> > > > > > > > > IMO,
> > > > > > > > > For this
> > > > > > > > > feature, we should not change fastpath of testpmd
> > > > > > > > > application. Instead, testpmd can use aggregated ports
> > > > > > > > > probably as
> > > > > > > > > separate fwd_engine to show how to use this feature.
> > > > > > > >
> > > > > > > > Good point to discuss :) There are two strategies to
> > > > > > > > polling
> > > > > > > > a shared
> > > > > > > > Rxq:
> > > > > > > > 1. polling each member port
> > > > > > > > All forwarding engines can be reused to work as
> > > > > > > > before.
> > > > > > > > My testpmd patches are efforts towards this direction.
> > > > > > > > Does your PMD support this?
> > > > > > >
> > > > > > > Not unfortunately. More than that, every application needs
> > > > > > > to
> > > > > > > change
> > > > > > > to support this model.
> > > > > >
> > > > > > Both strategies need user application to resolve port ID from
> > > > > > mbuf and
> > > > > > process accordingly.
> > > > > > This one doesn't demand aggregated port, no polling schedule
> > > > > > change.
> > > > >
> > > > > I was thinking, mbuf will be updated from driver/aggregator
> > > > > port as
> > > > > when it
> > > > > comes to application.
> > > > >
> > > > > >
> > > > > > >
> > > > > > > > 2. polling aggregated port
> > > > > > > > Besides forwarding engine, need more work to to demo
> > > > > > > > it.
> > > > > > > > This is an optional API, not supported by my PMD yet.
> > > > > > >
> > > > > > > We are thinking of implementing this PMD when it comes to
> > > > > > > it,
> > > > > > > ie.
> > > > > > > without application change in fastpath
> > > > > > > logic.
> > > > > >
> > > > > > Fastpath have to resolve port ID anyway and forwarding
> > > > > > according
> > > > > > to
> > > > > > logic. Forwarding engine need to adapt to support shard Rxq.
> > > > > > Fortunately, in testpmd, this can be done with an abstract
> > > > > > API.
> > > > > >
> > > > > > Let's defer part 2 until some PMD really support it and
> > > > > > tested,
> > > > > > how do
> > > > > > you think?
> > > > >
> > > > > We are not planning to use this feature so either way it is OK
> > > > > to
> > > > > me.
> > > > > I leave to ethdev maintainers decide between 1 vs 2.
> > > > >
> > > > > I do have a strong opinion not changing the testpmd basic
> > > > > forward
> > > > > engines
> > > > > for this feature.I would like to keep it simple as fastpath
> > > > > optimized and would
> > > > > like to add a separate Forwarding engine as means to verify
> > > > > this
> > > > > feature.
> > > >
> > > > +1 to that.
> > > > I don't think it a 'common' feature.
> > > > So separate FWD mode seems like a best choice to me.
> > >
> > > -1 :)
> > > There was some internal requirement from test team, they need to
> > > verify
> > > all features like packet content, rss, vlan, checksum, rte_flow...
> > > to
> > > be working based on shared rx queue.
> >
> > Then I suppose you'll need to write really comprehensive fwd-engine
> > to satisfy your test team :)
> > Speaking seriously, I still don't understand why do you need all
> > available fwd-engines to verify this feature.
>
> The shared Rxq is low level feature, need to make sure driver higher
> level features working properly. fwd-engines like csum checks input
> packet and enable L3/L4 checksum and tunnel offloads accordingly,
> other engines do their own feature verification. All test automation
> could be reused with these engines supported seamlessly.
>
> > From what I understand, main purpose of your changes to test-pmd:
> > allow to fwd packet though different fwd_stream (TX through different
> > HW queue).
>
> Yes, each mbuf in burst come from differnt port, testpmd current fwd-
> engines relies heavily on source forwarding stream, that's why the
> patch devide burst result mbufs into sub-burst and use orginal fwd-
> engine callback to handle. How to handle is not changed.
>
> > In theory, if implemented in generic and extendable way - that
> > might be a useful add-on to tespmd fwd functionality.
> > But current implementation looks very case specific.
> > And as I don't think it is a common case, I don't see much point to
> > pollute
> > basic fwd cases with it.
>
> Shared Rxq is a ethdev feature that impacts how packets get handled.
> It's natural to update forwarding engines to avoid broken.
Why is that?
All it affects the way you RX the packets.
So why *all* FWD engines have to be updated?
Let say what specific you are going to test with macswap vs macfwd mode for that feature?
I still think one specific FWD engine is enough to cover majority of test cases.
> The new macro is introduced to minimize performance impact, I'm also
> wondering is there an elegant solution :)
I think Jerin suggested a good alternative with eventdev.
As another approach - might be consider to add an RX callback that
will return packets only for one particular port (while keeping packets
for other ports cached internally).
As a 'wild' thought - change testpmd fwd logic to allow multiple TX queues
per fwd_stream and add a function to do TX switching logic.
But that's probably quite a big change that needs a lot of work.
> Current performance penalty
> is one "if unlikely" per burst.
It is not only about performance impact.
It is about keeping test-pmd code simple and maintainable.
>
> Think in reverse direction, if we don't update fwd-engines here, all
> malfunction when shared rxq enabled, users can't verify driver
> features, are you expecting this?
I expect developers not to rewrite whole test-pmd fwd code for each new ethdev feature.
Specially for the feature that is not widely used.
>
> >
> > BTW, as a side note, the code below looks bogus to me:
> > +void
> > +forward_shared_rxq(struct fwd_stream *fs, uint16_t nb_rx,
> > + struct rte_mbuf **pkts_burst, packet_fwd_cb
> > fwd)
> > +{
> > + uint16_t i, nb_fs_rx = 1, port;
> > +
> > + /* Locate real source fs according to mbuf->port. */
> > + for (i = 0; i < nb_rx; ++i) {
> > + rte_prefetch0(pkts_burst[i + 1]);
> >
> > you access pkt_burst[] beyond array boundaries,
> > also you ask cpu to prefetch some unknown and possibly invalid
> > address.
> >
> > > Based on the patch, I believe the
> > > impact has been minimized.
> > >
> > > >
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Overall, is this for optimizing memory for the port
> > > > > > > > > > > represontors? If so can't we
> > > > > > > > > > > have a port representor specific solution, reducing
> > > > > > > > > > > scope can reduce the
> > > > > > > > > > > complexity it brings?
> > > > > > > > > > >
> > > > > > > > > > > > > If this offload is only useful for representor
> > > > > > > > > > > > > case, Can we make this offload specific to
> > > > > > > > > > > > > representor the case by changing its
> > > > > name and
> > > > > > > > > > > > > scope.
> > > > > > > > > > > >
> > > > > > > > > > > > It works for both PF and representors in same
> > > > > > > > > > > > switch
> > > > > > > > > > > > domain, for application like OVS, few changes to
> > > > > > > > > > > > apply.
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Signed-off-by: Xueming Li
> > > > > > > > > > > > > > > > <xuemingl@nvidia.com>
> > > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > > doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > > 11 +++++++++++
> > > > > > > > > > > > > > > > doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > > 1 +
> > > > > > > > > > > > > > > > doc/guides/prog_guide/switch_representat
> > > > > > > > > > > > > > > > ion.
> > > > > > > > > > > > > > > > rst | 10 ++++++++++
> > > > > > > > > > > > > > > > lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > > 1 +
> > > > > > > > > > > > > > > > lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > > 7 +++++++
> > > > > > > > > > > > > > > > 5 files changed, 30 insertions(+)
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > diff --git a/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > b/doc/guides/nics/features.rst index
> > > > > > > > > > > > > > > > a96e12d155..2e2a9b1554 100644
> > > > > > > > > > > > > > > > --- a/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > @@ -624,6 +624,17 @@ Supports inner
> > > > > > > > > > > > > > > > packet L4
> > > > > > > > > > > > > > > > checksum.
> > > > > > > > > > > > > > > > ``tx_offload_capa,tx_queue_offload_cap
> > > > > > > > > > > > > > > > a:DE
> > > > > > > > > > > > > > > > V_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > +Shared Rx queue
> > > > > > > > > > > > > > > > +---------------
> > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > +Supports shared Rx queue for ports in
> > > > > > > > > > > > > > > > same
> > > > > > > > > > > > > > > > switch domain.
> > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > +* **[uses]
> > > > > > > > > > > > > > > > rte_eth_rxconf,rte_eth_rxmode**:
> > > > > > > > > > > > > > > > ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ`
> > > > > > > > > > > > > > > > `.
> > > > > > > > > > > > > > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > .. _nic_features_packet_type_parsing:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Packet type parsing
> > > > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > > > a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > > a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > +++
> > > > > > > > > > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand
> > > > > > > > > > > > > > > > =
> > > > > > > > > > > > > > > > Queue start/stop =
> > > > > > > > > > > > > > > > Runtime Rx queue setup =
> > > > > > > > > > > > > > > > Runtime Tx queue setup =
> > > > > > > > > > > > > > > > +Shared Rx queue =
> > > > > > > > > > > > > > > > Burst mode info =
> > > > > > > > > > > > > > > > Power mgmt address monitor =
> > > > > > > > > > > > > > > > MTU update =
> > > > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representa
> > > > > > > > > > > > > > > > tion
> > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representa
> > > > > > > > > > > > > > > > tion
> > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representa
> > > > > > > > > > > > > > > > tion
> > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > +++
> > > > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representa
> > > > > > > > > > > > > > > > tion
> > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > @@ -123,6 +123,16 @@ thought as a
> > > > > > > > > > > > > > > > software
> > > > > > > > > > > > > > > > "patch panel" front-end for applications.
> > > > > > > > > > > > > > > > .. [1] `Ethernet switch device driver
> > > > > > > > > > > > > > > > model
> > > > > > > > > > > > > > > > (switchdev)
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > <https://www.kernel.org/doc/Documentation
> > > > > > > > > > > > > > > > /net
> > > > > > > > > > > > > > > > working/switchdev.txt
> > > > > > > > > > > > > > > > > `_
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > +- Memory usage of representors is huge
> > > > > > > > > > > > > > > > when
> > > > > > > > > > > > > > > > number of representor
> > > > > > > > > > > > > > > > +grows,
> > > > > > > > > > > > > > > > + because PMD always allocate mbuf for
> > > > > > > > > > > > > > > > each
> > > > > > > > > > > > > > > > descriptor of Rx queue.
> > > > > > > > > > > > > > > > + Polling the large number of ports
> > > > > > > > > > > > > > > > brings
> > > > > > > > > > > > > > > > more CPU load, cache
> > > > > > > > > > > > > > > > +miss and
> > > > > > > > > > > > > > > > + latency. Shared Rx queue can be used
> > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > share Rx queue between
> > > > > > > > > > > > > > > > +PF and
> > > > > > > > > > > > > > > > + representors in same switch domain.
> > > > > > > > > > > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > > > > > > > > > > > + is present in Rx offloading capability
> > > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > device info. Setting
> > > > > > > > > > > > > > > > +the
> > > > > > > > > > > > > > > > + offloading flag in device Rx mode or
> > > > > > > > > > > > > > > > Rx
> > > > > > > > > > > > > > > > queue configuration to
> > > > > > > > > > > > > > > > +enable
> > > > > > > > > > > > > > > > + shared Rx queue. Polling any member
> > > > > > > > > > > > > > > > port
> > > > > > > > > > > > > > > > of shared Rx queue can
> > > > > > > > > > > > > > > > +return
> > > > > > > > > > > > > > > > + packets of all ports in group, port ID
> > > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > saved in ``mbuf.port``.
> > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > Basic SR-IOV
> > > > > > > > > > > > > > > > ------------
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > @@ -127,6 +127,7 @@ static const struct {
> > > > > > > > > > > > > > > > RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_
> > > > > > > > > > > > > > > > CKSU
> > > > > > > > > > > > > > > > M),
> > > > > > > > > > > > > > > > RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER
> > > > > > > > > > > > > > > > _SPL
> > > > > > > > > > > > > > > > IT),
> > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > > > > > > > > > > > > > };
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > @@ -1047,6 +1047,7 @@ struct
> > > > > > > > > > > > > > > > rte_eth_rxconf {
> > > > > > > > > > > > > > > > uint8_t rx_drop_en; /**< Drop
> > > > > > > > > > > > > > > > packets
> > > > > > > > > > > > > > > > if no descriptors are available. */
> > > > > > > > > > > > > > > > uint8_t rx_deferred_start; /**<
> > > > > > > > > > > > > > > > Do
> > > > > > > > > > > > > > > > not start queue with rte_eth_dev_start().
> > > > > > > > > > > > > > > > */
> > > > > > > > > > > > > > > > uint16_t rx_nseg; /**< Number of
> > > > > > > > > > > > > > > > descriptions in rx_seg array.
> > > > > > > > > > > > > > > > */
> > > > > > > > > > > > > > > > + uint32_t shared_group; /**<
> > > > > > > > > > > > > > > > Shared
> > > > > > > > > > > > > > > > port group index in
> > > > > > > > > > > > > > > > + switch domain. */
> > > > > > > > > > > > > > > > /**
> > > > > > > > > > > > > > > > * Per-queue Rx offloads to be
> > > > > > > > > > > > > > > > set
> > > > > > > > > > > > > > > > using DEV_RX_OFFLOAD_* flags.
> > > > > > > > > > > > > > > > * Only offloads set on
> > > > > > > > > > > > > > > > rx_queue_offload_capa or
> > > > > > > > > > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@
> > > > > > > > > > > > > > > > struct
> > > > > > > > > > > > > > > > rte_eth_conf {
> > > > > > > > > > > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM
> > > > > > > > > > > > > > > > 0x00040000
> > > > > > > > > > > > > > > > #define DEV_RX_OFFLOAD_RSS_HASH
> > > > > > > > > > > > > > > > 0x00080000
> > > > > > > > > > > > > > > > #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT
> > > > > > > > > > > > > > > > 0x00100000
> > > > > > > > > > > > > > > > +/**
> > > > > > > > > > > > > > > > + * Rx queue is shared among ports in
> > > > > > > > > > > > > > > > same
> > > > > > > > > > > > > > > > switch domain to save
> > > > > > > > > > > > > > > > +memory,
> > > > > > > > > > > > > > > > + * avoid polling each port. Any port in
> > > > > > > > > > > > > > > > group can be used to receive packets.
> > > > > > > > > > > > > > > > + * Real source port number saved in
> > > > > > > > > > > > > > > > mbuf-
> > > > > > > > > > > > > > > > > port field.
> > > > > > > > > > > > > > > > + */
> > > > > > > > > > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ
> > > > > > > > > > > > > > > > 0x00200000
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > #define DEV_RX_OFFLOAD_CHECKSUM
> > > > > > > > > > > > > > > > (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > > > > > > > > > DEV_RX_O
> > > > > > > > > > > > > > > > FFLO
> > > > > > > > > > > > > > > > AD_UDP_CKSUM | \
> > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > 2.25.1
> > > > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > >
> >
On Wed, Sep 29, 2021 at 09:52:20AM +0000, Ananyev, Konstantin wrote:
>
>
> > -----Original Message-----
> > From: Xueming(Steven) Li <xuemingl@nvidia.com>
> > Sent: Wednesday, September 29, 2021 10:13 AM
<snip>
> > > + /* Locate real source fs according to mbuf->port. */
> > > + for (i = 0; i < nb_rx; ++i) {
> > > + rte_prefetch0(pkts_burst[i + 1]);
> > >
> > > you access pkt_burst[] beyond array boundaries,
> > > also you ask cpu to prefetch some unknown and possibly invalid
> > > address.
> >
> > Sorry I forgot this topic. It's too late to prefetch current packet, so
> > perfetch next is better. Prefetch an invalid address at end of a look
> > doesn't hurt, it's common in DPDK.
>
> First of all it is usually never 'OK' to access array beyond its bounds.
> Second prefetching invalid address *does* hurt performance badly on many CPUs
> (TLB misses, consumed memory bandwidth etc.).
> As a reference: https://lwn.net/Articles/444346/
> If some existing DPDK code really does that - then I believe it is an issue and has to be addressed.
> More important - it is really bad attitude to submit bogus code to DPDK community
> and pretend that it is 'OK'.
>
The main point we need to take from all this is that when
prefetching you need to measure perf impact of it.
In terms of the specific case of prefetching one past the end of the array,
I would take the view that this is harmless in almost all cases. Unlike any
prefetch of "NULL" as in the referenced mail, reading one past the end (or
other small number of elements past the end) is far less likely to cause a
TLB miss - and it's basically just reproducing behaviour we would expect
off a HW prefetcher (though those my explicitly never cross page
boundaries). However, if you feel it's just cleaner to put in an
additional condition to remove the prefetch for the end case, that's ok
also - again so long as it doesn't affect performance. [Since prefetch is a
hint, I'm not sure if compilers or CPUs may be legally allowed to skip the
branch and blindly prefetch in all cases?]
/Bruce
> -----Original Message-----
> From: Richardson, Bruce <bruce.richardson@intel.com>
> Sent: Wednesday, September 29, 2021 12:08 PM
> To: Ananyev, Konstantin <konstantin.ananyev@intel.com>
> Cc: Xueming(Steven) Li <xuemingl@nvidia.com>; jerinjacobk@gmail.com; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> andrew.rybchenko@oktetlabs.ru; dev@dpdk.org; Yigit, Ferruh <ferruh.yigit@intel.com>
> Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
>
> On Wed, Sep 29, 2021 at 09:52:20AM +0000, Ananyev, Konstantin wrote:
> >
> >
> > > -----Original Message-----
> > > From: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > Sent: Wednesday, September 29, 2021 10:13 AM
> <snip>
> > > > + /* Locate real source fs according to mbuf->port. */
> > > > + for (i = 0; i < nb_rx; ++i) {
> > > > + rte_prefetch0(pkts_burst[i + 1]);
> > > >
> > > > you access pkt_burst[] beyond array boundaries,
> > > > also you ask cpu to prefetch some unknown and possibly invalid
> > > > address.
> > >
> > > Sorry I forgot this topic. It's too late to prefetch current packet, so
> > > perfetch next is better. Prefetch an invalid address at end of a look
> > > doesn't hurt, it's common in DPDK.
> >
> > First of all it is usually never 'OK' to access array beyond its bounds.
> > Second prefetching invalid address *does* hurt performance badly on many CPUs
> > (TLB misses, consumed memory bandwidth etc.).
> > As a reference: https://lwn.net/Articles/444346/
> > If some existing DPDK code really does that - then I believe it is an issue and has to be addressed.
> > More important - it is really bad attitude to submit bogus code to DPDK community
> > and pretend that it is 'OK'.
> >
>
> The main point we need to take from all this is that when
> prefetching you need to measure perf impact of it.
>
> In terms of the specific case of prefetching one past the end of the array,
> I would take the view that this is harmless in almost all cases. Unlike any
> prefetch of "NULL" as in the referenced mail, reading one past the end (or
> other small number of elements past the end) is far less likely to cause a
> TLB miss - and it's basically just reproducing behaviour we would expect
> off a HW prefetcher (though those my explicitly never cross page
> boundaries). However, if you feel it's just cleaner to put in an
> additional condition to remove the prefetch for the end case, that's ok
> also - again so long as it doesn't affect performance. [Since prefetch is a
> hint, I'm not sure if compilers or CPUs may be legally allowed to skip the
> branch and blindly prefetch in all cases?]
Please look at the code.
It doesn't prefetch next element beyond array boundaries.
It first reads address from the element that is beyond array boundaries (which is a bug by itself).
Then it prefetches that bogus address.
We simply don't know is this address is valid and where it points to.
In other words, it doesn't do:
rte_prefetch0(&pkts_burst[i + 1]);
It does:
rte_prefetch0(pkts_burst[i + 1]);
On Wed, 2021-09-29 at 09:52 +0000, Ananyev, Konstantin wrote:
>
> > -----Original Message-----
> > From: Xueming(Steven) Li <xuemingl@nvidia.com>
> > Sent: Wednesday, September 29, 2021 10:13 AM
> > To: jerinjacobk@gmail.com; Ananyev, Konstantin <konstantin.ananyev@intel.com>
> > Cc: NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; andrew.rybchenko@oktetlabs.ru; dev@dpdk.org; Yigit, Ferruh
> > <ferruh.yigit@intel.com>
> > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> >
> > On Wed, 2021-09-29 at 00:26 +0000, Ananyev, Konstantin wrote:
> > > > > > > > > > > > > > > > > In current DPDK framework, each RX queue
> > > > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > > pre-loaded with mbufs
> > > > > > > > > > > > > > > > > for incoming packets. When number of
> > > > > > > > > > > > > > > > > representors scale out in a
> > > > > > > > > > > > > > > > > switch domain, the memory consumption
> > > > > > > > > > > > > > > > > became
> > > > > > > > > > > > > > > > > significant. Most
> > > > > > > > > > > > > > > > > important, polling all ports leads to
> > > > > > > > > > > > > > > > > high
> > > > > > > > > > > > > > > > > cache miss, high
> > > > > > > > > > > > > > > > > latency and low throughput.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > This patch introduces shared RX queue.
> > > > > > > > > > > > > > > > > Ports
> > > > > > > > > > > > > > > > > with same
> > > > > > > > > > > > > > > > > configuration in a switch domain could
> > > > > > > > > > > > > > > > > share
> > > > > > > > > > > > > > > > > RX queue set by specifying sharing group.
> > > > > > > > > > > > > > > > > Polling any queue using same shared RX
> > > > > > > > > > > > > > > > > queue
> > > > > > > > > > > > > > > > > receives packets from
> > > > > > > > > > > > > > > > > all member ports. Source port is
> > > > > > > > > > > > > > > > > identified
> > > > > > > > > > > > > > > > > by mbuf->port.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Port queue number in a shared group
> > > > > > > > > > > > > > > > > should be
> > > > > > > > > > > > > > > > > identical. Queue
> > > > > > > > > > > > > > > > > index is
> > > > > > > > > > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Share RX queue is supposed to be polled
> > > > > > > > > > > > > > > > > on
> > > > > > > > > > > > > > > > > same thread.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Is this offload specific to the
> > > > > > > > > > > > > > > > representor? If
> > > > > > > > > > > > > > > > so can this name be changed specifically to
> > > > > > > > > > > > > > > > representor?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Yes, PF and representor in switch domain
> > > > > > > > > > > > > > > could
> > > > > > > > > > > > > > > take advantage.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > If it is for a generic case, how the flow
> > > > > > > > > > > > > > > > ordering will be maintained?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Not quite sure that I understood your
> > > > > > > > > > > > > > > question.
> > > > > > > > > > > > > > > The control path of is
> > > > > > > > > > > > > > > almost same as before, PF and representor
> > > > > > > > > > > > > > > port
> > > > > > > > > > > > > > > still needed, rte flows not impacted.
> > > > > > > > > > > > > > > Queues still needed for each member port,
> > > > > > > > > > > > > > > descriptors(mbuf) will be
> > > > > > > > > > > > > > > supplied from shared Rx queue in my PMD
> > > > > > > > > > > > > > > implementation.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > My question was if create a generic
> > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple
> > > > > > > > > > > > > > ethdev receive queues land into
> > > > > > the same
> > > > > > > > > > > > > > receive queue, In that case, how the flow order
> > > > > > > > > > > > > > is
> > > > > > > > > > > > > > maintained for respective receive queues.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I guess the question is testpmd forward stream?
> > > > > > > > > > > > > The
> > > > > > > > > > > > > forwarding logic has to be changed slightly in
> > > > > > > > > > > > > case
> > > > > > > > > > > > > of shared rxq.
> > > > > > > > > > > > > basically for each packet in rx_burst result,
> > > > > > > > > > > > > lookup
> > > > > > > > > > > > > source stream according to mbuf->port, forwarding
> > > > > > > > > > > > > to
> > > > > > > > > > > > > target fs.
> > > > > > > > > > > > > Packets from same source port could be grouped as
> > > > > > > > > > > > > a
> > > > > > > > > > > > > small burst to process, this will accelerates the
> > > > > > > > > > > > > performance if traffic
> > > > > > come from
> > > > > > > > > > > > > limited ports. I'll introduce some common api to
> > > > > > > > > > > > > do
> > > > > > > > > > > > > shard rxq forwarding, call it with packets
> > > > > > > > > > > > > handling
> > > > > > > > > > > > > callback, so it suites for
> > > > > > > > > > > > > all forwarding engine. Will sent patches soon.
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > All ports will put the packets in to the same queue
> > > > > > > > > > > > (share queue), right? Does
> > > > > > > > > > > > this means only single core will poll only, what
> > > > > > > > > > > > will
> > > > > > > > > > > > happen if there are
> > > > > > > > > > > > multiple cores polling, won't it cause problem?
> > > > > > > > > > > >
> > > > > > > > > > > > And if this requires specific changes in the
> > > > > > > > > > > > application, I am not sure about
> > > > > > > > > > > > the solution, can't this work in a transparent way
> > > > > > > > > > > > to
> > > > > > > > > > > > the application?
> > > > > > > > > > >
> > > > > > > > > > > Discussed with Jerin, new API introduced in v3 2/8
> > > > > > > > > > > that
> > > > > > > > > > > aggregate ports
> > > > > > > > > > > in same group into one new port. Users could schedule
> > > > > > > > > > > polling on the
> > > > > > > > > > > aggregated port instead of all member ports.
> > > > > > > > > >
> > > > > > > > > > The v3 still has testpmd changes in fastpath. Right?
> > > > > > > > > > IMO,
> > > > > > > > > > For this
> > > > > > > > > > feature, we should not change fastpath of testpmd
> > > > > > > > > > application. Instead, testpmd can use aggregated ports
> > > > > > > > > > probably as
> > > > > > > > > > separate fwd_engine to show how to use this feature.
> > > > > > > > >
> > > > > > > > > Good point to discuss :) There are two strategies to
> > > > > > > > > polling
> > > > > > > > > a shared
> > > > > > > > > Rxq:
> > > > > > > > > 1. polling each member port
> > > > > > > > > All forwarding engines can be reused to work as
> > > > > > > > > before.
> > > > > > > > > My testpmd patches are efforts towards this direction.
> > > > > > > > > Does your PMD support this?
> > > > > > > >
> > > > > > > > Not unfortunately. More than that, every application needs
> > > > > > > > to
> > > > > > > > change
> > > > > > > > to support this model.
> > > > > > >
> > > > > > > Both strategies need user application to resolve port ID from
> > > > > > > mbuf and
> > > > > > > process accordingly.
> > > > > > > This one doesn't demand aggregated port, no polling schedule
> > > > > > > change.
> > > > > >
> > > > > > I was thinking, mbuf will be updated from driver/aggregator
> > > > > > port as
> > > > > > when it
> > > > > > comes to application.
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > 2. polling aggregated port
> > > > > > > > > Besides forwarding engine, need more work to to demo
> > > > > > > > > it.
> > > > > > > > > This is an optional API, not supported by my PMD yet.
> > > > > > > >
> > > > > > > > We are thinking of implementing this PMD when it comes to
> > > > > > > > it,
> > > > > > > > ie.
> > > > > > > > without application change in fastpath
> > > > > > > > logic.
> > > > > > >
> > > > > > > Fastpath have to resolve port ID anyway and forwarding
> > > > > > > according
> > > > > > > to
> > > > > > > logic. Forwarding engine need to adapt to support shard Rxq.
> > > > > > > Fortunately, in testpmd, this can be done with an abstract
> > > > > > > API.
> > > > > > >
> > > > > > > Let's defer part 2 until some PMD really support it and
> > > > > > > tested,
> > > > > > > how do
> > > > > > > you think?
> > > > > >
> > > > > > We are not planning to use this feature so either way it is OK
> > > > > > to
> > > > > > me.
> > > > > > I leave to ethdev maintainers decide between 1 vs 2.
> > > > > >
> > > > > > I do have a strong opinion not changing the testpmd basic
> > > > > > forward
> > > > > > engines
> > > > > > for this feature.I would like to keep it simple as fastpath
> > > > > > optimized and would
> > > > > > like to add a separate Forwarding engine as means to verify
> > > > > > this
> > > > > > feature.
> > > > >
> > > > > +1 to that.
> > > > > I don't think it a 'common' feature.
> > > > > So separate FWD mode seems like a best choice to me.
> > > >
> > > > -1 :)
> > > > There was some internal requirement from test team, they need to
> > > > verify
> > > > all features like packet content, rss, vlan, checksum, rte_flow...
> > > > to
> > > > be working based on shared rx queue.
> > >
> > > Then I suppose you'll need to write really comprehensive fwd-engine
> > > to satisfy your test team :)
> > > Speaking seriously, I still don't understand why do you need all
> > > available fwd-engines to verify this feature.
> > > From what I understand, main purpose of your changes to test-pmd:
> > > allow to fwd packet though different fwd_stream (TX through different
> > > HW queue).
> > > In theory, if implemented in generic and extendable way - that
> > > might be a useful add-on to tespmd fwd functionality.
> > > But current implementation looks very case specific.
> > > And as I don't think it is a common case, I don't see much point to
> > > pollute
> > > basic fwd cases with it.
> > >
> > > BTW, as a side note, the code below looks bogus to me:
> > > +void
> > > +forward_shared_rxq(struct fwd_stream *fs, uint16_t nb_rx,
> > > + struct rte_mbuf **pkts_burst, packet_fwd_cb
> > > fwd)
> > > +{
> > > + uint16_t i, nb_fs_rx = 1, port;
> > > +
> > > + /* Locate real source fs according to mbuf->port. */
> > > + for (i = 0; i < nb_rx; ++i) {
> > > + rte_prefetch0(pkts_burst[i + 1]);
> > >
> > > you access pkt_burst[] beyond array boundaries,
> > > also you ask cpu to prefetch some unknown and possibly invalid
> > > address.
> >
> > Sorry I forgot this topic. It's too late to prefetch current packet, so
> > perfetch next is better. Prefetch an invalid address at end of a look
> > doesn't hurt, it's common in DPDK.
>
> First of all it is usually never 'OK' to access array beyond its bounds.
> Second prefetching invalid address *does* hurt performance badly on many CPUs
> (TLB misses, consumed memory bandwidth etc.).
> As a reference: https://lwn.net/Articles/444346/
> If some existing DPDK code really does that - then I believe it is an issue and has to be addressed.
> More important - it is really bad attitude to submit bogus code to DPDK community
> and pretend that it is 'OK'.
Thanks for the link!
From instruction spec, "The PREFETCHh instruction is merely a hint and
does not affect program behavior."
There are 3 choices here:
1: no prefetch. D$ miss will happen on each packet, time cost depends
on where data sits(close or far) and burst size.
2: prefetch with loop end check to avoid random address. Pro is free of
TLB miss per burst, Cons is "if" instruction per packet. Cost depends
on burst size.
3: brute force prefetch, cost is TLB miss, but no addtional
instructions per packet. Not sure how random the last address could be
in testpmd and how many TLB miss could happen.
Based on my expericen of performance optimization, IIRC, option 3 has
the best performance. But for this case, result depends on how many
sub-burst inside and how sub-burst get processed, maybe callback will
flush prefetch data completely or not. So it's hard to get a
conclusion, what I said is that the code in PMD driver should have a
reason.
On the other hand, the latency and throughput saving of this featue on
multiple ports is huge, I perfer to down play this prefetch discussion
if you agree.
>
> >
> > >
> > > > Based on the patch, I believe the
> > > > impact has been minimized.
> > > >
> > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Overall, is this for optimizing memory for the port
> > > > > > > > > > > > represontors? If so can't we
> > > > > > > > > > > > have a port representor specific solution, reducing
> > > > > > > > > > > > scope can reduce the
> > > > > > > > > > > > complexity it brings?
> > > > > > > > > > > >
> > > > > > > > > > > > > > If this offload is only useful for representor
> > > > > > > > > > > > > > case, Can we make this offload specific to
> > > > > > > > > > > > > > representor the case by changing its
> > > > > > name and
> > > > > > > > > > > > > > scope.
> > > > > > > > > > > > >
> > > > > > > > > > > > > It works for both PF and representors in same
> > > > > > > > > > > > > switch
> > > > > > > > > > > > > domain, for application like OVS, few changes to
> > > > > > > > > > > > > apply.
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Signed-off-by: Xueming Li
> > > > > > > > > > > > > > > > > <xuemingl@nvidia.com>
> > > > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > > > doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > > > 11 +++++++++++
> > > > > > > > > > > > > > > > > doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > > > 1 +
> > > > > > > > > > > > > > > > > doc/guides/prog_guide/switch_representat
> > > > > > > > > > > > > > > > > ion.
> > > > > > > > > > > > > > > > > rst | 10 ++++++++++
> > > > > > > > > > > > > > > > > lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > > > 1 +
> > > > > > > > > > > > > > > > > lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > > > 7 +++++++
> > > > > > > > > > > > > > > > > 5 files changed, 30 insertions(+)
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > diff --git a/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > > b/doc/guides/nics/features.rst index
> > > > > > > > > > > > > > > > > a96e12d155..2e2a9b1554 100644
> > > > > > > > > > > > > > > > > --- a/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > > @@ -624,6 +624,17 @@ Supports inner
> > > > > > > > > > > > > > > > > packet L4
> > > > > > > > > > > > > > > > > checksum.
> > > > > > > > > > > > > > > > > ``tx_offload_capa,tx_queue_offload_cap
> > > > > > > > > > > > > > > > > a:DE
> > > > > > > > > > > > > > > > > V_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > +Shared Rx queue
> > > > > > > > > > > > > > > > > +---------------
> > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > +Supports shared Rx queue for ports in
> > > > > > > > > > > > > > > > > same
> > > > > > > > > > > > > > > > > switch domain.
> > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > +* **[uses]
> > > > > > > > > > > > > > > > > rte_eth_rxconf,rte_eth_rxmode**:
> > > > > > > > > > > > > > > > > ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ`
> > > > > > > > > > > > > > > > > `.
> > > > > > > > > > > > > > > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > .. _nic_features_packet_type_parsing:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Packet type parsing
> > > > > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > > > > a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > > > a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > > +++
> > > > > > > > > > > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand
> > > > > > > > > > > > > > > > > =
> > > > > > > > > > > > > > > > > Queue start/stop =
> > > > > > > > > > > > > > > > > Runtime Rx queue setup =
> > > > > > > > > > > > > > > > > Runtime Tx queue setup =
> > > > > > > > > > > > > > > > > +Shared Rx queue =
> > > > > > > > > > > > > > > > > Burst mode info =
> > > > > > > > > > > > > > > > > Power mgmt address monitor =
> > > > > > > > > > > > > > > > > MTU update =
> > > > > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representa
> > > > > > > > > > > > > > > > > tion
> > > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representa
> > > > > > > > > > > > > > > > > tion
> > > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representa
> > > > > > > > > > > > > > > > > tion
> > > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > > +++
> > > > > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representa
> > > > > > > > > > > > > > > > > tion
> > > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > > @@ -123,6 +123,16 @@ thought as a
> > > > > > > > > > > > > > > > > software
> > > > > > > > > > > > > > > > > "patch panel" front-end for applications.
> > > > > > > > > > > > > > > > > .. [1] `Ethernet switch device driver
> > > > > > > > > > > > > > > > > model
> > > > > > > > > > > > > > > > > (switchdev)
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > <https://www.kernel.org/doc/Documentation
> > > > > > > > > > > > > > > > > /net
> > > > > > > > > > > > > > > > > working/switchdev.txt
> > > > > > > > > > > > > > > > > > `_
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > +- Memory usage of representors is huge
> > > > > > > > > > > > > > > > > when
> > > > > > > > > > > > > > > > > number of representor
> > > > > > > > > > > > > > > > > +grows,
> > > > > > > > > > > > > > > > > + because PMD always allocate mbuf for
> > > > > > > > > > > > > > > > > each
> > > > > > > > > > > > > > > > > descriptor of Rx queue.
> > > > > > > > > > > > > > > > > + Polling the large number of ports
> > > > > > > > > > > > > > > > > brings
> > > > > > > > > > > > > > > > > more CPU load, cache
> > > > > > > > > > > > > > > > > +miss and
> > > > > > > > > > > > > > > > > + latency. Shared Rx queue can be used
> > > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > share Rx queue between
> > > > > > > > > > > > > > > > > +PF and
> > > > > > > > > > > > > > > > > + representors in same switch domain.
> > > > > > > > > > > > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > > > > > > > > > > > > + is present in Rx offloading capability
> > > > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > > device info. Setting
> > > > > > > > > > > > > > > > > +the
> > > > > > > > > > > > > > > > > + offloading flag in device Rx mode or
> > > > > > > > > > > > > > > > > Rx
> > > > > > > > > > > > > > > > > queue configuration to
> > > > > > > > > > > > > > > > > +enable
> > > > > > > > > > > > > > > > > + shared Rx queue. Polling any member
> > > > > > > > > > > > > > > > > port
> > > > > > > > > > > > > > > > > of shared Rx queue can
> > > > > > > > > > > > > > > > > +return
> > > > > > > > > > > > > > > > > + packets of all ports in group, port ID
> > > > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > > saved in ``mbuf.port``.
> > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > Basic SR-IOV
> > > > > > > > > > > > > > > > > ------------
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > > @@ -127,6 +127,7 @@ static const struct {
> > > > > > > > > > > > > > > > > RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_
> > > > > > > > > > > > > > > > > CKSU
> > > > > > > > > > > > > > > > > M),
> > > > > > > > > > > > > > > > > RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER
> > > > > > > > > > > > > > > > > _SPL
> > > > > > > > > > > > > > > > > IT),
> > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > > > > > > > > > > > > > > };
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > > @@ -1047,6 +1047,7 @@ struct
> > > > > > > > > > > > > > > > > rte_eth_rxconf {
> > > > > > > > > > > > > > > > > uint8_t rx_drop_en; /**< Drop
> > > > > > > > > > > > > > > > > packets
> > > > > > > > > > > > > > > > > if no descriptors are available. */
> > > > > > > > > > > > > > > > > uint8_t rx_deferred_start; /**<
> > > > > > > > > > > > > > > > > Do
> > > > > > > > > > > > > > > > > not start queue with rte_eth_dev_start().
> > > > > > > > > > > > > > > > > */
> > > > > > > > > > > > > > > > > uint16_t rx_nseg; /**< Number of
> > > > > > > > > > > > > > > > > descriptions in rx_seg array.
> > > > > > > > > > > > > > > > > */
> > > > > > > > > > > > > > > > > + uint32_t shared_group; /**<
> > > > > > > > > > > > > > > > > Shared
> > > > > > > > > > > > > > > > > port group index in
> > > > > > > > > > > > > > > > > + switch domain. */
> > > > > > > > > > > > > > > > > /**
> > > > > > > > > > > > > > > > > * Per-queue Rx offloads to be
> > > > > > > > > > > > > > > > > set
> > > > > > > > > > > > > > > > > using DEV_RX_OFFLOAD_* flags.
> > > > > > > > > > > > > > > > > * Only offloads set on
> > > > > > > > > > > > > > > > > rx_queue_offload_capa or
> > > > > > > > > > > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@
> > > > > > > > > > > > > > > > > struct
> > > > > > > > > > > > > > > > > rte_eth_conf {
> > > > > > > > > > > > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM
> > > > > > > > > > > > > > > > > 0x00040000
> > > > > > > > > > > > > > > > > #define DEV_RX_OFFLOAD_RSS_HASH
> > > > > > > > > > > > > > > > > 0x00080000
> > > > > > > > > > > > > > > > > #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT
> > > > > > > > > > > > > > > > > 0x00100000
> > > > > > > > > > > > > > > > > +/**
> > > > > > > > > > > > > > > > > + * Rx queue is shared among ports in
> > > > > > > > > > > > > > > > > same
> > > > > > > > > > > > > > > > > switch domain to save
> > > > > > > > > > > > > > > > > +memory,
> > > > > > > > > > > > > > > > > + * avoid polling each port. Any port in
> > > > > > > > > > > > > > > > > group can be used to receive packets.
> > > > > > > > > > > > > > > > > + * Real source port number saved in
> > > > > > > > > > > > > > > > > mbuf-
> > > > > > > > > > > > > > > > > > port field.
> > > > > > > > > > > > > > > > > + */
> > > > > > > > > > > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ
> > > > > > > > > > > > > > > > > 0x00200000
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > #define DEV_RX_OFFLOAD_CHECKSUM
> > > > > > > > > > > > > > > > > (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > > > > > > > > > > DEV_RX_O
> > > > > > > > > > > > > > > > > FFLO
> > > > > > > > > > > > > > > > > AD_UDP_CKSUM | \
> > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > 2.25.1
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > >
> > >
>
On Wed, Sep 29, 2021 at 12:46:51PM +0100, Ananyev, Konstantin wrote:
>
>
> > -----Original Message-----
> > From: Richardson, Bruce <bruce.richardson@intel.com>
> > Sent: Wednesday, September 29, 2021 12:08 PM
> > To: Ananyev, Konstantin <konstantin.ananyev@intel.com>
> > Cc: Xueming(Steven) Li <xuemingl@nvidia.com>; jerinjacobk@gmail.com; NBU-Contact-Thomas Monjalon <thomas@monjalon.net>;
> > andrew.rybchenko@oktetlabs.ru; dev@dpdk.org; Yigit, Ferruh <ferruh.yigit@intel.com>
> > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> >
> > On Wed, Sep 29, 2021 at 09:52:20AM +0000, Ananyev, Konstantin wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > Sent: Wednesday, September 29, 2021 10:13 AM
> > <snip>
> > > > > + /* Locate real source fs according to mbuf->port. */
> > > > > + for (i = 0; i < nb_rx; ++i) {
> > > > > + rte_prefetch0(pkts_burst[i + 1]);
> > > > >
> > > > > you access pkt_burst[] beyond array boundaries,
> > > > > also you ask cpu to prefetch some unknown and possibly invalid
> > > > > address.
> > > >
> > > > Sorry I forgot this topic. It's too late to prefetch current packet, so
> > > > perfetch next is better. Prefetch an invalid address at end of a look
> > > > doesn't hurt, it's common in DPDK.
> > >
> > > First of all it is usually never 'OK' to access array beyond its bounds.
> > > Second prefetching invalid address *does* hurt performance badly on many CPUs
> > > (TLB misses, consumed memory bandwidth etc.).
> > > As a reference: https://lwn.net/Articles/444346/
> > > If some existing DPDK code really does that - then I believe it is an issue and has to be addressed.
> > > More important - it is really bad attitude to submit bogus code to DPDK community
> > > and pretend that it is 'OK'.
> > >
> >
> > The main point we need to take from all this is that when
> > prefetching you need to measure perf impact of it.
> >
> > In terms of the specific case of prefetching one past the end of the array,
> > I would take the view that this is harmless in almost all cases. Unlike any
> > prefetch of "NULL" as in the referenced mail, reading one past the end (or
> > other small number of elements past the end) is far less likely to cause a
> > TLB miss - and it's basically just reproducing behaviour we would expect
> > off a HW prefetcher (though those my explicitly never cross page
> > boundaries). However, if you feel it's just cleaner to put in an
> > additional condition to remove the prefetch for the end case, that's ok
> > also - again so long as it doesn't affect performance. [Since prefetch is a
> > hint, I'm not sure if compilers or CPUs may be legally allowed to skip the
> > branch and blindly prefetch in all cases?]
>
> Please look at the code.
> It doesn't prefetch next element beyond array boundaries.
> It first reads address from the element that is beyond array boundaries (which is a bug by itself).
> Then it prefetches that bogus address.
> We simply don't know is this address is valid and where it points to.
>
> In other words, it doesn't do:
> rte_prefetch0(&pkts_burst[i + 1]);
>
> It does:
> rte_prefetch0(pkts_burst[i + 1]);
>
Apologies, yes, you are right, and that is a bug.
/Bruce
> -----Original Message-----
> From: Xueming(Steven) Li <xuemingl@nvidia.com>
> Sent: Wednesday, September 29, 2021 1:09 PM
> To: jerinjacobk@gmail.com; Ananyev, Konstantin <konstantin.ananyev@intel.com>
> Cc: NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; andrew.rybchenko@oktetlabs.ru; dev@dpdk.org; Yigit, Ferruh
> <ferruh.yigit@intel.com>
> Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue>
> On Wed, 2021-09-29 at 09:52 +0000, Ananyev, Konstantin wrote:
> >
> > > -----Original Message-----
> > > From: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > Sent: Wednesday, September 29, 2021 10:13 AM
> > > To: jerinjacobk@gmail.com; Ananyev, Konstantin <konstantin.ananyev@intel.com>
> > > Cc: NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; andrew.rybchenko@oktetlabs.ru; dev@dpdk.org; Yigit, Ferruh
> > > <ferruh.yigit@intel.com>
> > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > >
> > > On Wed, 2021-09-29 at 00:26 +0000, Ananyev, Konstantin wrote:
> > > > > > > > > > > > > > > > > > In current DPDK framework, each RX queue
> > > > > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > > > pre-loaded with mbufs
> > > > > > > > > > > > > > > > > > for incoming packets. When number of
> > > > > > > > > > > > > > > > > > representors scale out in a
> > > > > > > > > > > > > > > > > > switch domain, the memory consumption
> > > > > > > > > > > > > > > > > > became
> > > > > > > > > > > > > > > > > > significant. Most
> > > > > > > > > > > > > > > > > > important, polling all ports leads to
> > > > > > > > > > > > > > > > > > high
> > > > > > > > > > > > > > > > > > cache miss, high
> > > > > > > > > > > > > > > > > > latency and low throughput.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > This patch introduces shared RX queue.
> > > > > > > > > > > > > > > > > > Ports
> > > > > > > > > > > > > > > > > > with same
> > > > > > > > > > > > > > > > > > configuration in a switch domain could
> > > > > > > > > > > > > > > > > > share
> > > > > > > > > > > > > > > > > > RX queue set by specifying sharing group.
> > > > > > > > > > > > > > > > > > Polling any queue using same shared RX
> > > > > > > > > > > > > > > > > > queue
> > > > > > > > > > > > > > > > > > receives packets from
> > > > > > > > > > > > > > > > > > all member ports. Source port is
> > > > > > > > > > > > > > > > > > identified
> > > > > > > > > > > > > > > > > > by mbuf->port.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Port queue number in a shared group
> > > > > > > > > > > > > > > > > > should be
> > > > > > > > > > > > > > > > > > identical. Queue
> > > > > > > > > > > > > > > > > > index is
> > > > > > > > > > > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Share RX queue is supposed to be polled
> > > > > > > > > > > > > > > > > > on
> > > > > > > > > > > > > > > > > > same thread.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Is this offload specific to the
> > > > > > > > > > > > > > > > > representor? If
> > > > > > > > > > > > > > > > > so can this name be changed specifically to
> > > > > > > > > > > > > > > > > representor?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Yes, PF and representor in switch domain
> > > > > > > > > > > > > > > > could
> > > > > > > > > > > > > > > > take advantage.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > If it is for a generic case, how the flow
> > > > > > > > > > > > > > > > > ordering will be maintained?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Not quite sure that I understood your
> > > > > > > > > > > > > > > > question.
> > > > > > > > > > > > > > > > The control path of is
> > > > > > > > > > > > > > > > almost same as before, PF and representor
> > > > > > > > > > > > > > > > port
> > > > > > > > > > > > > > > > still needed, rte flows not impacted.
> > > > > > > > > > > > > > > > Queues still needed for each member port,
> > > > > > > > > > > > > > > > descriptors(mbuf) will be
> > > > > > > > > > > > > > > > supplied from shared Rx queue in my PMD
> > > > > > > > > > > > > > > > implementation.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > My question was if create a generic
> > > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple
> > > > > > > > > > > > > > > ethdev receive queues land into
> > > > > > > the same
> > > > > > > > > > > > > > > receive queue, In that case, how the flow order
> > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > maintained for respective receive queues.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I guess the question is testpmd forward stream?
> > > > > > > > > > > > > > The
> > > > > > > > > > > > > > forwarding logic has to be changed slightly in
> > > > > > > > > > > > > > case
> > > > > > > > > > > > > > of shared rxq.
> > > > > > > > > > > > > > basically for each packet in rx_burst result,
> > > > > > > > > > > > > > lookup
> > > > > > > > > > > > > > source stream according to mbuf->port, forwarding
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > target fs.
> > > > > > > > > > > > > > Packets from same source port could be grouped as
> > > > > > > > > > > > > > a
> > > > > > > > > > > > > > small burst to process, this will accelerates the
> > > > > > > > > > > > > > performance if traffic
> > > > > > > come from
> > > > > > > > > > > > > > limited ports. I'll introduce some common api to
> > > > > > > > > > > > > > do
> > > > > > > > > > > > > > shard rxq forwarding, call it with packets
> > > > > > > > > > > > > > handling
> > > > > > > > > > > > > > callback, so it suites for
> > > > > > > > > > > > > > all forwarding engine. Will sent patches soon.
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > All ports will put the packets in to the same queue
> > > > > > > > > > > > > (share queue), right? Does
> > > > > > > > > > > > > this means only single core will poll only, what
> > > > > > > > > > > > > will
> > > > > > > > > > > > > happen if there are
> > > > > > > > > > > > > multiple cores polling, won't it cause problem?
> > > > > > > > > > > > >
> > > > > > > > > > > > > And if this requires specific changes in the
> > > > > > > > > > > > > application, I am not sure about
> > > > > > > > > > > > > the solution, can't this work in a transparent way
> > > > > > > > > > > > > to
> > > > > > > > > > > > > the application?
> > > > > > > > > > > >
> > > > > > > > > > > > Discussed with Jerin, new API introduced in v3 2/8
> > > > > > > > > > > > that
> > > > > > > > > > > > aggregate ports
> > > > > > > > > > > > in same group into one new port. Users could schedule
> > > > > > > > > > > > polling on the
> > > > > > > > > > > > aggregated port instead of all member ports.
> > > > > > > > > > >
> > > > > > > > > > > The v3 still has testpmd changes in fastpath. Right?
> > > > > > > > > > > IMO,
> > > > > > > > > > > For this
> > > > > > > > > > > feature, we should not change fastpath of testpmd
> > > > > > > > > > > application. Instead, testpmd can use aggregated ports
> > > > > > > > > > > probably as
> > > > > > > > > > > separate fwd_engine to show how to use this feature.
> > > > > > > > > >
> > > > > > > > > > Good point to discuss :) There are two strategies to
> > > > > > > > > > polling
> > > > > > > > > > a shared
> > > > > > > > > > Rxq:
> > > > > > > > > > 1. polling each member port
> > > > > > > > > > All forwarding engines can be reused to work as
> > > > > > > > > > before.
> > > > > > > > > > My testpmd patches are efforts towards this direction.
> > > > > > > > > > Does your PMD support this?
> > > > > > > > >
> > > > > > > > > Not unfortunately. More than that, every application needs
> > > > > > > > > to
> > > > > > > > > change
> > > > > > > > > to support this model.
> > > > > > > >
> > > > > > > > Both strategies need user application to resolve port ID from
> > > > > > > > mbuf and
> > > > > > > > process accordingly.
> > > > > > > > This one doesn't demand aggregated port, no polling schedule
> > > > > > > > change.
> > > > > > >
> > > > > > > I was thinking, mbuf will be updated from driver/aggregator
> > > > > > > port as
> > > > > > > when it
> > > > > > > comes to application.
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > > 2. polling aggregated port
> > > > > > > > > > Besides forwarding engine, need more work to to demo
> > > > > > > > > > it.
> > > > > > > > > > This is an optional API, not supported by my PMD yet.
> > > > > > > > >
> > > > > > > > > We are thinking of implementing this PMD when it comes to
> > > > > > > > > it,
> > > > > > > > > ie.
> > > > > > > > > without application change in fastpath
> > > > > > > > > logic.
> > > > > > > >
> > > > > > > > Fastpath have to resolve port ID anyway and forwarding
> > > > > > > > according
> > > > > > > > to
> > > > > > > > logic. Forwarding engine need to adapt to support shard Rxq.
> > > > > > > > Fortunately, in testpmd, this can be done with an abstract
> > > > > > > > API.
> > > > > > > >
> > > > > > > > Let's defer part 2 until some PMD really support it and
> > > > > > > > tested,
> > > > > > > > how do
> > > > > > > > you think?
> > > > > > >
> > > > > > > We are not planning to use this feature so either way it is OK
> > > > > > > to
> > > > > > > me.
> > > > > > > I leave to ethdev maintainers decide between 1 vs 2.
> > > > > > >
> > > > > > > I do have a strong opinion not changing the testpmd basic
> > > > > > > forward
> > > > > > > engines
> > > > > > > for this feature.I would like to keep it simple as fastpath
> > > > > > > optimized and would
> > > > > > > like to add a separate Forwarding engine as means to verify
> > > > > > > this
> > > > > > > feature.
> > > > > >
> > > > > > +1 to that.
> > > > > > I don't think it a 'common' feature.
> > > > > > So separate FWD mode seems like a best choice to me.
> > > > >
> > > > > -1 :)
> > > > > There was some internal requirement from test team, they need to
> > > > > verify
> > > > > all features like packet content, rss, vlan, checksum, rte_flow...
> > > > > to
> > > > > be working based on shared rx queue.
> > > >
> > > > Then I suppose you'll need to write really comprehensive fwd-engine
> > > > to satisfy your test team :)
> > > > Speaking seriously, I still don't understand why do you need all
> > > > available fwd-engines to verify this feature.
> > > > From what I understand, main purpose of your changes to test-pmd:
> > > > allow to fwd packet though different fwd_stream (TX through different
> > > > HW queue).
> > > > In theory, if implemented in generic and extendable way - that
> > > > might be a useful add-on to tespmd fwd functionality.
> > > > But current implementation looks very case specific.
> > > > And as I don't think it is a common case, I don't see much point to
> > > > pollute
> > > > basic fwd cases with it.
> > > >
> > > > BTW, as a side note, the code below looks bogus to me:
> > > > +void
> > > > +forward_shared_rxq(struct fwd_stream *fs, uint16_t nb_rx,
> > > > + struct rte_mbuf **pkts_burst, packet_fwd_cb
> > > > fwd)
> > > > +{
> > > > + uint16_t i, nb_fs_rx = 1, port;
> > > > +
> > > > + /* Locate real source fs according to mbuf->port. */
> > > > + for (i = 0; i < nb_rx; ++i) {
> > > > + rte_prefetch0(pkts_burst[i + 1]);
> > > >
> > > > you access pkt_burst[] beyond array boundaries,
> > > > also you ask cpu to prefetch some unknown and possibly invalid
> > > > address.
> > >
> > > Sorry I forgot this topic. It's too late to prefetch current packet, so
> > > perfetch next is better. Prefetch an invalid address at end of a look
> > > doesn't hurt, it's common in DPDK.
> >
> > First of all it is usually never 'OK' to access array beyond its bounds.
> > Second prefetching invalid address *does* hurt performance badly on many CPUs
> > (TLB misses, consumed memory bandwidth etc.).
> > As a reference: https://lwn.net/Articles/444346/
> > If some existing DPDK code really does that - then I believe it is an issue and has to be addressed.
> > More important - it is really bad attitude to submit bogus code to DPDK community
> > and pretend that it is 'OK'.
>
> Thanks for the link!
> From instruction spec, "The PREFETCHh instruction is merely a hint and
> does not affect program behavior."
> There are 3 choices here:
> 1: no prefetch. D$ miss will happen on each packet, time cost depends
> on where data sits(close or far) and burst size.
> 2: prefetch with loop end check to avoid random address. Pro is free of
> TLB miss per burst, Cons is "if" instruction per packet. Cost depends
> on burst size.
> 3: brute force prefetch, cost is TLB miss, but no addtional
> instructions per packet. Not sure how random the last address could be
> in testpmd and how many TLB miss could happen.
There are plenty of standard techniques to avoid that issue while keeping
prefetch() in place.
Probably the easiest one:
for (i=0; i < nb_rx - 1; i++) {
prefetch(pkt[i+1];
/* do your stuff with pkt[i[ here */
}
/* do your stuff with pkt[nb_rx - 1]; */
> Based on my expericen of performance optimization, IIRC, option 3 has
> the best performance. But for this case, result depends on how many
> sub-burst inside and how sub-burst get processed, maybe callback will
> flush prefetch data completely or not. So it's hard to get a
> conclusion, what I said is that the code in PMD driver should have a
> reason.
>
> On the other hand, the latency and throughput saving of this featue on
> multiple ports is huge, I perfer to down play this prefetch discussion
> if you agree.
>
Honestly, I don't know how else to explain to you that there is a bug in that piece of code.
From my perspective it is a trivial bug, with a trivial fix.
But you simply keep ignoring the arguments.
Till it get fixed and other comments addressed - my vote is NACK for these series.
I don't think we need bogus code in testpmd.
On Wed, 2021-09-29 at 10:20 +0000, Ananyev, Konstantin wrote:
> > > > > > > > > > > > > > > > > In current DPDK framework, each RX
> > > > > > > > > > > > > > > > > queue
> > > > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > > pre-loaded with mbufs
> > > > > > > > > > > > > > > > > for incoming packets. When number of
> > > > > > > > > > > > > > > > > representors scale out in a
> > > > > > > > > > > > > > > > > switch domain, the memory consumption
> > > > > > > > > > > > > > > > > became
> > > > > > > > > > > > > > > > > significant. Most
> > > > > > > > > > > > > > > > > important, polling all ports leads to
> > > > > > > > > > > > > > > > > high
> > > > > > > > > > > > > > > > > cache miss, high
> > > > > > > > > > > > > > > > > latency and low throughput.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > This patch introduces shared RX
> > > > > > > > > > > > > > > > > queue.
> > > > > > > > > > > > > > > > > Ports
> > > > > > > > > > > > > > > > > with same
> > > > > > > > > > > > > > > > > configuration in a switch domain
> > > > > > > > > > > > > > > > > could
> > > > > > > > > > > > > > > > > share
> > > > > > > > > > > > > > > > > RX queue set by specifying sharing
> > > > > > > > > > > > > > > > > group.
> > > > > > > > > > > > > > > > > Polling any queue using same shared
> > > > > > > > > > > > > > > > > RX
> > > > > > > > > > > > > > > > > queue
> > > > > > > > > > > > > > > > > receives packets from
> > > > > > > > > > > > > > > > > all member ports. Source port is
> > > > > > > > > > > > > > > > > identified
> > > > > > > > > > > > > > > > > by mbuf->port.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Port queue number in a shared group
> > > > > > > > > > > > > > > > > should be
> > > > > > > > > > > > > > > > > identical. Queue
> > > > > > > > > > > > > > > > > index is
> > > > > > > > > > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Share RX queue is supposed to be
> > > > > > > > > > > > > > > > > polled
> > > > > > > > > > > > > > > > > on
> > > > > > > > > > > > > > > > > same thread.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Multiple groups is supported by group
> > > > > > > > > > > > > > > > > ID.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Is this offload specific to the
> > > > > > > > > > > > > > > > representor? If
> > > > > > > > > > > > > > > > so can this name be changed
> > > > > > > > > > > > > > > > specifically to
> > > > > > > > > > > > > > > > representor?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Yes, PF and representor in switch domain
> > > > > > > > > > > > > > > could
> > > > > > > > > > > > > > > take advantage.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > If it is for a generic case, how the
> > > > > > > > > > > > > > > > flow
> > > > > > > > > > > > > > > > ordering will be maintained?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Not quite sure that I understood your
> > > > > > > > > > > > > > > question.
> > > > > > > > > > > > > > > The control path of is
> > > > > > > > > > > > > > > almost same as before, PF and representor
> > > > > > > > > > > > > > > port
> > > > > > > > > > > > > > > still needed, rte flows not impacted.
> > > > > > > > > > > > > > > Queues still needed for each member port,
> > > > > > > > > > > > > > > descriptors(mbuf) will be
> > > > > > > > > > > > > > > supplied from shared Rx queue in my PMD
> > > > > > > > > > > > > > > implementation.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > My question was if create a generic
> > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload,
> > > > > > > > > > > > > > multiple
> > > > > > > > > > > > > > ethdev receive queues land into
> > > > > > the same
> > > > > > > > > > > > > > receive queue, In that case, how the flow
> > > > > > > > > > > > > > order
> > > > > > > > > > > > > > is
> > > > > > > > > > > > > > maintained for respective receive queues.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I guess the question is testpmd forward
> > > > > > > > > > > > > stream?
> > > > > > > > > > > > > The
> > > > > > > > > > > > > forwarding logic has to be changed slightly
> > > > > > > > > > > > > in
> > > > > > > > > > > > > case
> > > > > > > > > > > > > of shared rxq.
> > > > > > > > > > > > > basically for each packet in rx_burst result,
> > > > > > > > > > > > > lookup
> > > > > > > > > > > > > source stream according to mbuf->port,
> > > > > > > > > > > > > forwarding
> > > > > > > > > > > > > to
> > > > > > > > > > > > > target fs.
> > > > > > > > > > > > > Packets from same source port could be
> > > > > > > > > > > > > grouped as
> > > > > > > > > > > > > a
> > > > > > > > > > > > > small burst to process, this will accelerates
> > > > > > > > > > > > > the
> > > > > > > > > > > > > performance if traffic
> > > > > > come from
> > > > > > > > > > > > > limited ports. I'll introduce some common api
> > > > > > > > > > > > > to
> > > > > > > > > > > > > do
> > > > > > > > > > > > > shard rxq forwarding, call it with packets
> > > > > > > > > > > > > handling
> > > > > > > > > > > > > callback, so it suites for
> > > > > > > > > > > > > all forwarding engine. Will sent patches
> > > > > > > > > > > > > soon.
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > All ports will put the packets in to the same
> > > > > > > > > > > > queue
> > > > > > > > > > > > (share queue), right? Does
> > > > > > > > > > > > this means only single core will poll only,
> > > > > > > > > > > > what
> > > > > > > > > > > > will
> > > > > > > > > > > > happen if there are
> > > > > > > > > > > > multiple cores polling, won't it cause problem?
> > > > > > > > > > > >
> > > > > > > > > > > > And if this requires specific changes in the
> > > > > > > > > > > > application, I am not sure about
> > > > > > > > > > > > the solution, can't this work in a transparent
> > > > > > > > > > > > way
> > > > > > > > > > > > to
> > > > > > > > > > > > the application?
> > > > > > > > > > >
> > > > > > > > > > > Discussed with Jerin, new API introduced in v3
> > > > > > > > > > > 2/8
> > > > > > > > > > > that
> > > > > > > > > > > aggregate ports
> > > > > > > > > > > in same group into one new port. Users could
> > > > > > > > > > > schedule
> > > > > > > > > > > polling on the
> > > > > > > > > > > aggregated port instead of all member ports.
> > > > > > > > > >
> > > > > > > > > > The v3 still has testpmd changes in fastpath.
> > > > > > > > > > Right?
> > > > > > > > > > IMO,
> > > > > > > > > > For this
> > > > > > > > > > feature, we should not change fastpath of testpmd
> > > > > > > > > > application. Instead, testpmd can use aggregated
> > > > > > > > > > ports
> > > > > > > > > > probably as
> > > > > > > > > > separate fwd_engine to show how to use this
> > > > > > > > > > feature.
> > > > > > > > >
> > > > > > > > > Good point to discuss :) There are two strategies to
> > > > > > > > > polling
> > > > > > > > > a shared
> > > > > > > > > Rxq:
> > > > > > > > > 1. polling each member port
> > > > > > > > > All forwarding engines can be reused to work as
> > > > > > > > > before.
> > > > > > > > > My testpmd patches are efforts towards this
> > > > > > > > > direction.
> > > > > > > > > Does your PMD support this?
> > > > > > > >
> > > > > > > > Not unfortunately. More than that, every application
> > > > > > > > needs
> > > > > > > > to
> > > > > > > > change
> > > > > > > > to support this model.
> > > > > > >
> > > > > > > Both strategies need user application to resolve port ID
> > > > > > > from
> > > > > > > mbuf and
> > > > > > > process accordingly.
> > > > > > > This one doesn't demand aggregated port, no polling
> > > > > > > schedule
> > > > > > > change.
> > > > > >
> > > > > > I was thinking, mbuf will be updated from driver/aggregator
> > > > > > port as
> > > > > > when it
> > > > > > comes to application.
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > 2. polling aggregated port
> > > > > > > > > Besides forwarding engine, need more work to to
> > > > > > > > > demo
> > > > > > > > > it.
> > > > > > > > > This is an optional API, not supported by my PMD
> > > > > > > > > yet.
> > > > > > > >
> > > > > > > > We are thinking of implementing this PMD when it comes
> > > > > > > > to
> > > > > > > > it,
> > > > > > > > ie.
> > > > > > > > without application change in fastpath
> > > > > > > > logic.
> > > > > > >
> > > > > > > Fastpath have to resolve port ID anyway and forwarding
> > > > > > > according
> > > > > > > to
> > > > > > > logic. Forwarding engine need to adapt to support shard
> > > > > > > Rxq.
> > > > > > > Fortunately, in testpmd, this can be done with an
> > > > > > > abstract
> > > > > > > API.
> > > > > > >
> > > > > > > Let's defer part 2 until some PMD really support it and
> > > > > > > tested,
> > > > > > > how do
> > > > > > > you think?
> > > > > >
> > > > > > We are not planning to use this feature so either way it is
> > > > > > OK
> > > > > > to
> > > > > > me.
> > > > > > I leave to ethdev maintainers decide between 1 vs 2.
> > > > > >
> > > > > > I do have a strong opinion not changing the testpmd basic
> > > > > > forward
> > > > > > engines
> > > > > > for this feature.I would like to keep it simple as fastpath
> > > > > > optimized and would
> > > > > > like to add a separate Forwarding engine as means to verify
> > > > > > this
> > > > > > feature.
> > > > >
> > > > > +1 to that.
> > > > > I don't think it a 'common' feature.
> > > > > So separate FWD mode seems like a best choice to me.
> > > >
> > > > -1 :)
> > > > There was some internal requirement from test team, they need
> > > > to
> > > > verify
> > > > all features like packet content, rss, vlan, checksum,
> > > > rte_flow...
> > > > to
> > > > be working based on shared rx queue.
> > >
> > > Then I suppose you'll need to write really comprehensive fwd-
> > > engine
> > > to satisfy your test team :)
> > > Speaking seriously, I still don't understand why do you need all
> > > available fwd-engines to verify this feature.
> >
> > The shared Rxq is low level feature, need to make sure driver
> > higher
> > level features working properly. fwd-engines like csum checks input
> > packet and enable L3/L4 checksum and tunnel offloads accordingly,
> > other engines do their own feature verification. All test
> > automation
> > could be reused with these engines supported seamlessly.
> >
> > > From what I understand, main purpose of your changes to test-pmd:
> > > allow to fwd packet though different fwd_stream (TX through
> > > different
> > > HW queue).
> >
> > Yes, each mbuf in burst come from differnt port, testpmd current
> > fwd-
> > engines relies heavily on source forwarding stream, that's why the
> > patch devide burst result mbufs into sub-burst and use orginal fwd-
> > engine callback to handle. How to handle is not changed.
> >
> > > In theory, if implemented in generic and extendable way - that
> > > might be a useful add-on to tespmd fwd functionality.
> > > But current implementation looks very case specific.
> > > And as I don't think it is a common case, I don't see much point
> > > to
> > > pollute
> > > basic fwd cases with it.
> >
> > Shared Rxq is a ethdev feature that impacts how packets get
> > handled.
> > It's natural to update forwarding engines to avoid broken.
>
> Why is that?
> All it affects the way you RX the packets.
> So why *all* FWD engines have to be updated?
People will ask why some FWD engine can't work?
> Let say what specific you are going to test with macswap vs macfwd
> mode for that feature?
If people want to test NIC with real switch, or make sure L2 layer not
get corrupted.
> I still think one specific FWD engine is enough to cover majority of
> test cases.
Yes, rxonly should be sufficient to verify the fundametal, but to
verify csum, timing, need others. Some back2back test system need io
forwarding, real switch depolyment need macswap...
>
> > The new macro is introduced to minimize performance impact, I'm
> > also
> > wondering is there an elegant solution :)
>
> I think Jerin suggested a good alternative with eventdev.
> As another approach - might be consider to add an RX callback that
> will return packets only for one particular port (while keeping
> packets
> for other ports cached internally).
This and the aggreate port API could be options in ethdev layer later.
It can't be the fundamental due performance loss and potential cache
miss.
> As a 'wild' thought - change testpmd fwd logic to allow multiple TX
> queues
> per fwd_stream and add a function to do TX switching logic.
> But that's probably quite a big change that needs a lot of work.
>
> > Current performance penalty
> > is one "if unlikely" per burst.
>
> It is not only about performance impact.
> It is about keeping test-pmd code simple and maintainable.
> >
> > Think in reverse direction, if we don't update fwd-engines here,
> > all
> > malfunction when shared rxq enabled, users can't verify driver
> > features, are you expecting this?
>
> I expect developers not to rewrite whole test-pmd fwd code for each
> new ethdev feature.
Here just abstract duplicated code from fwd-engines, an improvement,
keep test-pmd code simple and mantainable.
> Specially for the feature that is not widely used.
Based on the huge memory saving, performance and latency gains, it will
be popular to users.
But the test-pmd is not critical to this feature, I'm ok to drop the
fwd-engine support if you agree.
>
> >
> > >
> > > BTW, as a side note, the code below looks bogus to me:
> > > +void
> > > +forward_shared_rxq(struct fwd_stream *fs, uint16_t nb_rx,
> > > + struct rte_mbuf **pkts_burst, packet_fwd_cb
> > > fwd)
> > > +{
> > > + uint16_t i, nb_fs_rx = 1, port;
> > > +
> > > + /* Locate real source fs according to mbuf->port. */
> > > + for (i = 0; i < nb_rx; ++i) {
> > > + rte_prefetch0(pkts_burst[i + 1]);
> > >
> > > you access pkt_burst[] beyond array boundaries,
> > > also you ask cpu to prefetch some unknown and possibly invalid
> > > address.
> > >
> > > > Based on the patch, I believe the
> > > > impact has been minimized.
> > > >
> > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Overall, is this for optimizing memory for the
> > > > > > > > > > > > port
> > > > > > > > > > > > represontors? If so can't we
> > > > > > > > > > > > have a port representor specific solution,
> > > > > > > > > > > > reducing
> > > > > > > > > > > > scope can reduce the
> > > > > > > > > > > > complexity it brings?
> > > > > > > > > > > >
> > > > > > > > > > > > > > If this offload is only useful for
> > > > > > > > > > > > > > representor
> > > > > > > > > > > > > > case, Can we make this offload specific to
> > > > > > > > > > > > > > representor the case by changing its
> > > > > > name and
> > > > > > > > > > > > > > scope.
> > > > > > > > > > > > >
> > > > > > > > > > > > > It works for both PF and representors in same
> > > > > > > > > > > > > switch
> > > > > > > > > > > > > domain, for application like OVS, few changes
> > > > > > > > > > > > > to
> > > > > > > > > > > > > apply.
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Signed-off-by: Xueming Li
> > > > > > > > > > > > > > > > > <xuemingl@nvidia.com>
> > > > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > > > doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > > > 11 +++++++++++
> > > > > > > > > > > > > > > > > doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > > > 1 +
> > > > > > > > > > > > > > > > > doc/guides/prog_guide/switch_represe
> > > > > > > > > > > > > > > > > ntat
> > > > > > > > > > > > > > > > > ion.
> > > > > > > > > > > > > > > > > rst | 10 ++++++++++
> > > > > > > > > > > > > > > > > lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > > > 1 +
> > > > > > > > > > > > > > > > > lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > > > 7 +++++++
> > > > > > > > > > > > > > > > > 5 files changed, 30 insertions(+)
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > > > > a/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > > b/doc/guides/nics/features.rst index
> > > > > > > > > > > > > > > > > a96e12d155..2e2a9b1554 100644
> > > > > > > > > > > > > > > > > --- a/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > > @@ -624,6 +624,17 @@ Supports inner
> > > > > > > > > > > > > > > > > packet L4
> > > > > > > > > > > > > > > > > checksum.
> > > > > > > > > > > > > > > > > ``tx_offload_capa,tx_queue_offload
> > > > > > > > > > > > > > > > > _cap
> > > > > > > > > > > > > > > > > a:DE
> > > > > > > > > > > > > > > > > V_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > +Shared Rx queue
> > > > > > > > > > > > > > > > > +---------------
> > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > +Supports shared Rx queue for ports
> > > > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > > same
> > > > > > > > > > > > > > > > > switch domain.
> > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > +* **[uses]
> > > > > > > > > > > > > > > > > rte_eth_rxconf,rte_eth_rxmode**:
> > > > > > > > > > > > > > > > > ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_
> > > > > > > > > > > > > > > > > RXQ`
> > > > > > > > > > > > > > > > > `.
> > > > > > > > > > > > > > > > > +* **[provides] mbuf**:
> > > > > > > > > > > > > > > > > ``mbuf.port``.
> > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > ..
> > > > > > > > > > > > > > > > > _nic_features_packet_type_parsing:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Packet type parsing
> > > > > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > > > > a/doc/guides/nics/features/default.in
> > > > > > > > > > > > > > > > > i
> > > > > > > > > > > > > > > > > b/doc/guides/nics/features/default.in
> > > > > > > > > > > > > > > > > i
> > > > > > > > > > > > > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > > > a/doc/guides/nics/features/default.in
> > > > > > > > > > > > > > > > > i
> > > > > > > > > > > > > > > > > +++
> > > > > > > > > > > > > > > > > b/doc/guides/nics/features/default.in
> > > > > > > > > > > > > > > > > i
> > > > > > > > > > > > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on
> > > > > > > > > > > > > > > > > demand
> > > > > > > > > > > > > > > > > =
> > > > > > > > > > > > > > > > > Queue start/stop =
> > > > > > > > > > > > > > > > > Runtime Rx queue setup =
> > > > > > > > > > > > > > > > > Runtime Tx queue setup =
> > > > > > > > > > > > > > > > > +Shared Rx queue =
> > > > > > > > > > > > > > > > > Burst mode info =
> > > > > > > > > > > > > > > > > Power mgmt address monitor =
> > > > > > > > > > > > > > > > > MTU update =
> > > > > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_repres
> > > > > > > > > > > > > > > > > enta
> > > > > > > > > > > > > > > > > tion
> > > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_repres
> > > > > > > > > > > > > > > > > enta
> > > > > > > > > > > > > > > > > tion
> > > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_repres
> > > > > > > > > > > > > > > > > enta
> > > > > > > > > > > > > > > > > tion
> > > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > > +++
> > > > > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_repres
> > > > > > > > > > > > > > > > > enta
> > > > > > > > > > > > > > > > > tion
> > > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > > @@ -123,6 +123,16 @@ thought as a
> > > > > > > > > > > > > > > > > software
> > > > > > > > > > > > > > > > > "patch panel" front-end for
> > > > > > > > > > > > > > > > > applications.
> > > > > > > > > > > > > > > > > .. [1] `Ethernet switch device
> > > > > > > > > > > > > > > > > driver
> > > > > > > > > > > > > > > > > model
> > > > > > > > > > > > > > > > > (switchdev)
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > <https://www.kernel.org/doc/Documenta
> > > > > > > > > > > > > > > > > tion
> > > > > > > > > > > > > > > > > /net
> > > > > > > > > > > > > > > > > working/switchdev.txt
> > > > > > > > > > > > > > > > > > `_
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > +- Memory usage of representors is
> > > > > > > > > > > > > > > > > huge
> > > > > > > > > > > > > > > > > when
> > > > > > > > > > > > > > > > > number of representor
> > > > > > > > > > > > > > > > > +grows,
> > > > > > > > > > > > > > > > > + because PMD always allocate mbuf
> > > > > > > > > > > > > > > > > for
> > > > > > > > > > > > > > > > > each
> > > > > > > > > > > > > > > > > descriptor of Rx queue.
> > > > > > > > > > > > > > > > > + Polling the large number of ports
> > > > > > > > > > > > > > > > > brings
> > > > > > > > > > > > > > > > > more CPU load, cache
> > > > > > > > > > > > > > > > > +miss and
> > > > > > > > > > > > > > > > > + latency. Shared Rx queue can be
> > > > > > > > > > > > > > > > > used
> > > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > share Rx queue between
> > > > > > > > > > > > > > > > > +PF and
> > > > > > > > > > > > > > > > > + representors in same switch
> > > > > > > > > > > > > > > > > domain.
> > > > > > > > > > > > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > > > > > > > > > > > > + is present in Rx offloading
> > > > > > > > > > > > > > > > > capability
> > > > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > > device info. Setting
> > > > > > > > > > > > > > > > > +the
> > > > > > > > > > > > > > > > > + offloading flag in device Rx mode
> > > > > > > > > > > > > > > > > or
> > > > > > > > > > > > > > > > > Rx
> > > > > > > > > > > > > > > > > queue configuration to
> > > > > > > > > > > > > > > > > +enable
> > > > > > > > > > > > > > > > > + shared Rx queue. Polling any
> > > > > > > > > > > > > > > > > member
> > > > > > > > > > > > > > > > > port
> > > > > > > > > > > > > > > > > of shared Rx queue can
> > > > > > > > > > > > > > > > > +return
> > > > > > > > > > > > > > > > > + packets of all ports in group,
> > > > > > > > > > > > > > > > > port ID
> > > > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > > saved in ``mbuf.port``.
> > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > Basic SR-IOV
> > > > > > > > > > > > > > > > > ------------
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > > @@ -127,6 +127,7 @@ static const
> > > > > > > > > > > > > > > > > struct {
> > > > > > > > > > > > > > > > > RTE_RX_OFFLOAD_BIT2STR(OUTER_
> > > > > > > > > > > > > > > > > UDP_
> > > > > > > > > > > > > > > > > CKSU
> > > > > > > > > > > > > > > > > M),
> > > > > > > > > > > > > > > > > RTE_RX_OFFLOAD_BIT2STR(RSS_HA
> > > > > > > > > > > > > > > > > SH),
> > > > > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_BIT2STR(BU
> > > > > > > > > > > > > > > > > FFER
> > > > > > > > > > > > > > > > > _SPL
> > > > > > > > > > > > > > > > > IT),
> > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ
> > > > > > > > > > > > > > > > > ),
> > > > > > > > > > > > > > > > > };
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > > @@ -1047,6 +1047,7 @@ struct
> > > > > > > > > > > > > > > > > rte_eth_rxconf {
> > > > > > > > > > > > > > > > > uint8_t rx_drop_en; /**< Drop
> > > > > > > > > > > > > > > > > packets
> > > > > > > > > > > > > > > > > if no descriptors are available. */
> > > > > > > > > > > > > > > > > uint8_t rx_deferred_start;
> > > > > > > > > > > > > > > > > /**<
> > > > > > > > > > > > > > > > > Do
> > > > > > > > > > > > > > > > > not start queue with
> > > > > > > > > > > > > > > > > rte_eth_dev_start().
> > > > > > > > > > > > > > > > > */
> > > > > > > > > > > > > > > > > uint16_t rx_nseg; /**< Number
> > > > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > > descriptions in rx_seg array.
> > > > > > > > > > > > > > > > > */
> > > > > > > > > > > > > > > > > + uint32_t shared_group; /**<
> > > > > > > > > > > > > > > > > Shared
> > > > > > > > > > > > > > > > > port group index in
> > > > > > > > > > > > > > > > > + switch domain. */
> > > > > > > > > > > > > > > > > /**
> > > > > > > > > > > > > > > > > * Per-queue Rx offloads to
> > > > > > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > > set
> > > > > > > > > > > > > > > > > using DEV_RX_OFFLOAD_* flags.
> > > > > > > > > > > > > > > > > * Only offloads set on
> > > > > > > > > > > > > > > > > rx_queue_offload_capa or
> > > > > > > > > > > > > > > > > rx_offload_capa @@ -1373,6 +1374,12
> > > > > > > > > > > > > > > > > @@
> > > > > > > > > > > > > > > > > struct
> > > > > > > > > > > > > > > > > rte_eth_conf {
> > > > > > > > > > > > > > > > > #define
> > > > > > > > > > > > > > > > > DEV_RX_OFFLOAD_OUTER_UDP_CKSUM
> > > > > > > > > > > > > > > > > 0x00040000
> > > > > > > > > > > > > > > > > #define DEV_RX_OFFLOAD_RSS_HASH
> > > > > > > > > > > > > > > > > 0x00080000
> > > > > > > > > > > > > > > > > #define
> > > > > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT
> > > > > > > > > > > > > > > > > 0x00100000
> > > > > > > > > > > > > > > > > +/**
> > > > > > > > > > > > > > > > > + * Rx queue is shared among ports in
> > > > > > > > > > > > > > > > > same
> > > > > > > > > > > > > > > > > switch domain to save
> > > > > > > > > > > > > > > > > +memory,
> > > > > > > > > > > > > > > > > + * avoid polling each port. Any port
> > > > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > > group can be used to receive packets.
> > > > > > > > > > > > > > > > > + * Real source port number saved in
> > > > > > > > > > > > > > > > > mbuf-
> > > > > > > > > > > > > > > > > > port field.
> > > > > > > > > > > > > > > > > + */
> > > > > > > > > > > > > > > > > +#define
> > > > > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_SHARED_RXQ
> > > > > > > > > > > > > > > > > 0x00200000
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > #define DEV_RX_OFFLOAD_CHECKSUM
> > > > > > > > > > > > > > > > > (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > > > > > > > > > > DEV_
> > > > > > > > > > > > > > > > > RX_O
> > > > > > > > > > > > > > > > > FFLO
> > > > > > > > > > > > > > > > > AD_UDP_CKSUM | \
> > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > 2.25.1
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > >
> > >
>
On Wed, 2021-09-29 at 12:35 +0000, Ananyev, Konstantin wrote:
>
> > -----Original Message-----
> > From: Xueming(Steven) Li <xuemingl@nvidia.com>
> > Sent: Wednesday, September 29, 2021 1:09 PM
> > To: jerinjacobk@gmail.com; Ananyev, Konstantin <konstantin.ananyev@intel.com>
> > Cc: NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; andrew.rybchenko@oktetlabs.ru; dev@dpdk.org; Yigit, Ferruh
> > <ferruh.yigit@intel.com>
> > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue>
> > On Wed, 2021-09-29 at 09:52 +0000, Ananyev, Konstantin wrote:
> > >
> > > > -----Original Message-----
> > > > From: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > Sent: Wednesday, September 29, 2021 10:13 AM
> > > > To: jerinjacobk@gmail.com; Ananyev, Konstantin <konstantin.ananyev@intel.com>
> > > > Cc: NBU-Contact-Thomas Monjalon <thomas@monjalon.net>; andrew.rybchenko@oktetlabs.ru; dev@dpdk.org; Yigit, Ferruh
> > > > <ferruh.yigit@intel.com>
> > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > > >
> > > > On Wed, 2021-09-29 at 00:26 +0000, Ananyev, Konstantin wrote:
> > > > > > > > > > > > > > > > > > > In current DPDK framework, each RX queue
> > > > > > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > > > > pre-loaded with mbufs
> > > > > > > > > > > > > > > > > > > for incoming packets. When number of
> > > > > > > > > > > > > > > > > > > representors scale out in a
> > > > > > > > > > > > > > > > > > > switch domain, the memory consumption
> > > > > > > > > > > > > > > > > > > became
> > > > > > > > > > > > > > > > > > > significant. Most
> > > > > > > > > > > > > > > > > > > important, polling all ports leads to
> > > > > > > > > > > > > > > > > > > high
> > > > > > > > > > > > > > > > > > > cache miss, high
> > > > > > > > > > > > > > > > > > > latency and low throughput.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > This patch introduces shared RX queue.
> > > > > > > > > > > > > > > > > > > Ports
> > > > > > > > > > > > > > > > > > > with same
> > > > > > > > > > > > > > > > > > > configuration in a switch domain could
> > > > > > > > > > > > > > > > > > > share
> > > > > > > > > > > > > > > > > > > RX queue set by specifying sharing group.
> > > > > > > > > > > > > > > > > > > Polling any queue using same shared RX
> > > > > > > > > > > > > > > > > > > queue
> > > > > > > > > > > > > > > > > > > receives packets from
> > > > > > > > > > > > > > > > > > > all member ports. Source port is
> > > > > > > > > > > > > > > > > > > identified
> > > > > > > > > > > > > > > > > > > by mbuf->port.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Port queue number in a shared group
> > > > > > > > > > > > > > > > > > > should be
> > > > > > > > > > > > > > > > > > > identical. Queue
> > > > > > > > > > > > > > > > > > > index is
> > > > > > > > > > > > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Share RX queue is supposed to be polled
> > > > > > > > > > > > > > > > > > > on
> > > > > > > > > > > > > > > > > > > same thread.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Is this offload specific to the
> > > > > > > > > > > > > > > > > > representor? If
> > > > > > > > > > > > > > > > > > so can this name be changed specifically to
> > > > > > > > > > > > > > > > > > representor?
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Yes, PF and representor in switch domain
> > > > > > > > > > > > > > > > > could
> > > > > > > > > > > > > > > > > take advantage.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > If it is for a generic case, how the flow
> > > > > > > > > > > > > > > > > > ordering will be maintained?
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Not quite sure that I understood your
> > > > > > > > > > > > > > > > > question.
> > > > > > > > > > > > > > > > > The control path of is
> > > > > > > > > > > > > > > > > almost same as before, PF and representor
> > > > > > > > > > > > > > > > > port
> > > > > > > > > > > > > > > > > still needed, rte flows not impacted.
> > > > > > > > > > > > > > > > > Queues still needed for each member port,
> > > > > > > > > > > > > > > > > descriptors(mbuf) will be
> > > > > > > > > > > > > > > > > supplied from shared Rx queue in my PMD
> > > > > > > > > > > > > > > > > implementation.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > My question was if create a generic
> > > > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple
> > > > > > > > > > > > > > > > ethdev receive queues land into
> > > > > > > > the same
> > > > > > > > > > > > > > > > receive queue, In that case, how the flow order
> > > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > maintained for respective receive queues.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I guess the question is testpmd forward stream?
> > > > > > > > > > > > > > > The
> > > > > > > > > > > > > > > forwarding logic has to be changed slightly in
> > > > > > > > > > > > > > > case
> > > > > > > > > > > > > > > of shared rxq.
> > > > > > > > > > > > > > > basically for each packet in rx_burst result,
> > > > > > > > > > > > > > > lookup
> > > > > > > > > > > > > > > source stream according to mbuf->port, forwarding
> > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > target fs.
> > > > > > > > > > > > > > > Packets from same source port could be grouped as
> > > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > small burst to process, this will accelerates the
> > > > > > > > > > > > > > > performance if traffic
> > > > > > > > come from
> > > > > > > > > > > > > > > limited ports. I'll introduce some common api to
> > > > > > > > > > > > > > > do
> > > > > > > > > > > > > > > shard rxq forwarding, call it with packets
> > > > > > > > > > > > > > > handling
> > > > > > > > > > > > > > > callback, so it suites for
> > > > > > > > > > > > > > > all forwarding engine. Will sent patches soon.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > All ports will put the packets in to the same queue
> > > > > > > > > > > > > > (share queue), right? Does
> > > > > > > > > > > > > > this means only single core will poll only, what
> > > > > > > > > > > > > > will
> > > > > > > > > > > > > > happen if there are
> > > > > > > > > > > > > > multiple cores polling, won't it cause problem?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > And if this requires specific changes in the
> > > > > > > > > > > > > > application, I am not sure about
> > > > > > > > > > > > > > the solution, can't this work in a transparent way
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > the application?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Discussed with Jerin, new API introduced in v3 2/8
> > > > > > > > > > > > > that
> > > > > > > > > > > > > aggregate ports
> > > > > > > > > > > > > in same group into one new port. Users could schedule
> > > > > > > > > > > > > polling on the
> > > > > > > > > > > > > aggregated port instead of all member ports.
> > > > > > > > > > > >
> > > > > > > > > > > > The v3 still has testpmd changes in fastpath. Right?
> > > > > > > > > > > > IMO,
> > > > > > > > > > > > For this
> > > > > > > > > > > > feature, we should not change fastpath of testpmd
> > > > > > > > > > > > application. Instead, testpmd can use aggregated ports
> > > > > > > > > > > > probably as
> > > > > > > > > > > > separate fwd_engine to show how to use this feature.
> > > > > > > > > > >
> > > > > > > > > > > Good point to discuss :) There are two strategies to
> > > > > > > > > > > polling
> > > > > > > > > > > a shared
> > > > > > > > > > > Rxq:
> > > > > > > > > > > 1. polling each member port
> > > > > > > > > > > All forwarding engines can be reused to work as
> > > > > > > > > > > before.
> > > > > > > > > > > My testpmd patches are efforts towards this direction.
> > > > > > > > > > > Does your PMD support this?
> > > > > > > > > >
> > > > > > > > > > Not unfortunately. More than that, every application needs
> > > > > > > > > > to
> > > > > > > > > > change
> > > > > > > > > > to support this model.
> > > > > > > > >
> > > > > > > > > Both strategies need user application to resolve port ID from
> > > > > > > > > mbuf and
> > > > > > > > > process accordingly.
> > > > > > > > > This one doesn't demand aggregated port, no polling schedule
> > > > > > > > > change.
> > > > > > > >
> > > > > > > > I was thinking, mbuf will be updated from driver/aggregator
> > > > > > > > port as
> > > > > > > > when it
> > > > > > > > comes to application.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > 2. polling aggregated port
> > > > > > > > > > > Besides forwarding engine, need more work to to demo
> > > > > > > > > > > it.
> > > > > > > > > > > This is an optional API, not supported by my PMD yet.
> > > > > > > > > >
> > > > > > > > > > We are thinking of implementing this PMD when it comes to
> > > > > > > > > > it,
> > > > > > > > > > ie.
> > > > > > > > > > without application change in fastpath
> > > > > > > > > > logic.
> > > > > > > > >
> > > > > > > > > Fastpath have to resolve port ID anyway and forwarding
> > > > > > > > > according
> > > > > > > > > to
> > > > > > > > > logic. Forwarding engine need to adapt to support shard Rxq.
> > > > > > > > > Fortunately, in testpmd, this can be done with an abstract
> > > > > > > > > API.
> > > > > > > > >
> > > > > > > > > Let's defer part 2 until some PMD really support it and
> > > > > > > > > tested,
> > > > > > > > > how do
> > > > > > > > > you think?
> > > > > > > >
> > > > > > > > We are not planning to use this feature so either way it is OK
> > > > > > > > to
> > > > > > > > me.
> > > > > > > > I leave to ethdev maintainers decide between 1 vs 2.
> > > > > > > >
> > > > > > > > I do have a strong opinion not changing the testpmd basic
> > > > > > > > forward
> > > > > > > > engines
> > > > > > > > for this feature.I would like to keep it simple as fastpath
> > > > > > > > optimized and would
> > > > > > > > like to add a separate Forwarding engine as means to verify
> > > > > > > > this
> > > > > > > > feature.
> > > > > > >
> > > > > > > +1 to that.
> > > > > > > I don't think it a 'common' feature.
> > > > > > > So separate FWD mode seems like a best choice to me.
> > > > > >
> > > > > > -1 :)
> > > > > > There was some internal requirement from test team, they need to
> > > > > > verify
> > > > > > all features like packet content, rss, vlan, checksum, rte_flow...
> > > > > > to
> > > > > > be working based on shared rx queue.
> > > > >
> > > > > Then I suppose you'll need to write really comprehensive fwd-engine
> > > > > to satisfy your test team :)
> > > > > Speaking seriously, I still don't understand why do you need all
> > > > > available fwd-engines to verify this feature.
> > > > > From what I understand, main purpose of your changes to test-pmd:
> > > > > allow to fwd packet though different fwd_stream (TX through different
> > > > > HW queue).
> > > > > In theory, if implemented in generic and extendable way - that
> > > > > might be a useful add-on to tespmd fwd functionality.
> > > > > But current implementation looks very case specific.
> > > > > And as I don't think it is a common case, I don't see much point to
> > > > > pollute
> > > > > basic fwd cases with it.
> > > > >
> > > > > BTW, as a side note, the code below looks bogus to me:
> > > > > +void
> > > > > +forward_shared_rxq(struct fwd_stream *fs, uint16_t nb_rx,
> > > > > + struct rte_mbuf **pkts_burst, packet_fwd_cb
> > > > > fwd)
> > > > > +{
> > > > > + uint16_t i, nb_fs_rx = 1, port;
> > > > > +
> > > > > + /* Locate real source fs according to mbuf->port. */
> > > > > + for (i = 0; i < nb_rx; ++i) {
> > > > > + rte_prefetch0(pkts_burst[i + 1]);
> > > > >
> > > > > you access pkt_burst[] beyond array boundaries,
> > > > > also you ask cpu to prefetch some unknown and possibly invalid
> > > > > address.
> > > >
> > > > Sorry I forgot this topic. It's too late to prefetch current packet, so
> > > > perfetch next is better. Prefetch an invalid address at end of a look
> > > > doesn't hurt, it's common in DPDK.
> > >
> > > First of all it is usually never 'OK' to access array beyond its bounds.
> > > Second prefetching invalid address *does* hurt performance badly on many CPUs
> > > (TLB misses, consumed memory bandwidth etc.).
> > > As a reference: https://lwn.net/Articles/444346/
> > > If some existing DPDK code really does that - then I believe it is an issue and has to be addressed.
> > > More important - it is really bad attitude to submit bogus code to DPDK community
> > > and pretend that it is 'OK'.
> >
> > Thanks for the link!
> > From instruction spec, "The PREFETCHh instruction is merely a hint and
> > does not affect program behavior."
> > There are 3 choices here:
> > 1: no prefetch. D$ miss will happen on each packet, time cost depends
> > on where data sits(close or far) and burst size.
> > 2: prefetch with loop end check to avoid random address. Pro is free of
> > TLB miss per burst, Cons is "if" instruction per packet. Cost depends
> > on burst size.
> > 3: brute force prefetch, cost is TLB miss, but no addtional
> > instructions per packet. Not sure how random the last address could be
> > in testpmd and how many TLB miss could happen.
>
> There are plenty of standard techniques to avoid that issue while keeping
> prefetch() in place.
> Probably the easiest one:
>
> for (i=0; i < nb_rx - 1; i++) {
> prefetch(pkt[i+1];
> /* do your stuff with pkt[i[ here */
> }
>
> /* do your stuff with pkt[nb_rx - 1]; */
Thanks, will update in next version.
>
> > Based on my expericen of performance optimization, IIRC, option 3 has
> > the best performance. But for this case, result depends on how many
> > sub-burst inside and how sub-burst get processed, maybe callback will
> > flush prefetch data completely or not. So it's hard to get a
> > conclusion, what I said is that the code in PMD driver should have a
> > reason.
> >
> > On the other hand, the latency and throughput saving of this featue on
> > multiple ports is huge, I perfer to down play this prefetch discussion
> > if you agree.
> >
>
> Honestly, I don't know how else to explain to you that there is a bug in that piece of code.
> From my perspective it is a trivial bug, with a trivial fix.
> But you simply keep ignoring the arguments.
> Till it get fixed and other comments addressed - my vote is NACK for these series.
> I don't think we need bogus code in testpmd.
>
>
> On Wed, 2021-09-29 at 10:20 +0000, Ananyev, Konstantin wrote:
> > > > > > > > > > > > > > > > > > In current DPDK framework, each RX
> > > > > > > > > > > > > > > > > > queue
> > > > > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > > > pre-loaded with mbufs
> > > > > > > > > > > > > > > > > > for incoming packets. When number of
> > > > > > > > > > > > > > > > > > representors scale out in a
> > > > > > > > > > > > > > > > > > switch domain, the memory consumption
> > > > > > > > > > > > > > > > > > became
> > > > > > > > > > > > > > > > > > significant. Most
> > > > > > > > > > > > > > > > > > important, polling all ports leads to
> > > > > > > > > > > > > > > > > > high
> > > > > > > > > > > > > > > > > > cache miss, high
> > > > > > > > > > > > > > > > > > latency and low throughput.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > This patch introduces shared RX
> > > > > > > > > > > > > > > > > > queue.
> > > > > > > > > > > > > > > > > > Ports
> > > > > > > > > > > > > > > > > > with same
> > > > > > > > > > > > > > > > > > configuration in a switch domain
> > > > > > > > > > > > > > > > > > could
> > > > > > > > > > > > > > > > > > share
> > > > > > > > > > > > > > > > > > RX queue set by specifying sharing
> > > > > > > > > > > > > > > > > > group.
> > > > > > > > > > > > > > > > > > Polling any queue using same shared
> > > > > > > > > > > > > > > > > > RX
> > > > > > > > > > > > > > > > > > queue
> > > > > > > > > > > > > > > > > > receives packets from
> > > > > > > > > > > > > > > > > > all member ports. Source port is
> > > > > > > > > > > > > > > > > > identified
> > > > > > > > > > > > > > > > > > by mbuf->port.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Port queue number in a shared group
> > > > > > > > > > > > > > > > > > should be
> > > > > > > > > > > > > > > > > > identical. Queue
> > > > > > > > > > > > > > > > > > index is
> > > > > > > > > > > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Share RX queue is supposed to be
> > > > > > > > > > > > > > > > > > polled
> > > > > > > > > > > > > > > > > > on
> > > > > > > > > > > > > > > > > > same thread.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Multiple groups is supported by group
> > > > > > > > > > > > > > > > > > ID.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Is this offload specific to the
> > > > > > > > > > > > > > > > > representor? If
> > > > > > > > > > > > > > > > > so can this name be changed
> > > > > > > > > > > > > > > > > specifically to
> > > > > > > > > > > > > > > > > representor?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Yes, PF and representor in switch domain
> > > > > > > > > > > > > > > > could
> > > > > > > > > > > > > > > > take advantage.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > If it is for a generic case, how the
> > > > > > > > > > > > > > > > > flow
> > > > > > > > > > > > > > > > > ordering will be maintained?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Not quite sure that I understood your
> > > > > > > > > > > > > > > > question.
> > > > > > > > > > > > > > > > The control path of is
> > > > > > > > > > > > > > > > almost same as before, PF and representor
> > > > > > > > > > > > > > > > port
> > > > > > > > > > > > > > > > still needed, rte flows not impacted.
> > > > > > > > > > > > > > > > Queues still needed for each member port,
> > > > > > > > > > > > > > > > descriptors(mbuf) will be
> > > > > > > > > > > > > > > > supplied from shared Rx queue in my PMD
> > > > > > > > > > > > > > > > implementation.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > My question was if create a generic
> > > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload,
> > > > > > > > > > > > > > > multiple
> > > > > > > > > > > > > > > ethdev receive queues land into
> > > > > > > the same
> > > > > > > > > > > > > > > receive queue, In that case, how the flow
> > > > > > > > > > > > > > > order
> > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > maintained for respective receive queues.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I guess the question is testpmd forward
> > > > > > > > > > > > > > stream?
> > > > > > > > > > > > > > The
> > > > > > > > > > > > > > forwarding logic has to be changed slightly
> > > > > > > > > > > > > > in
> > > > > > > > > > > > > > case
> > > > > > > > > > > > > > of shared rxq.
> > > > > > > > > > > > > > basically for each packet in rx_burst result,
> > > > > > > > > > > > > > lookup
> > > > > > > > > > > > > > source stream according to mbuf->port,
> > > > > > > > > > > > > > forwarding
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > target fs.
> > > > > > > > > > > > > > Packets from same source port could be
> > > > > > > > > > > > > > grouped as
> > > > > > > > > > > > > > a
> > > > > > > > > > > > > > small burst to process, this will accelerates
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > performance if traffic
> > > > > > > come from
> > > > > > > > > > > > > > limited ports. I'll introduce some common api
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > do
> > > > > > > > > > > > > > shard rxq forwarding, call it with packets
> > > > > > > > > > > > > > handling
> > > > > > > > > > > > > > callback, so it suites for
> > > > > > > > > > > > > > all forwarding engine. Will sent patches
> > > > > > > > > > > > > > soon.
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > All ports will put the packets in to the same
> > > > > > > > > > > > > queue
> > > > > > > > > > > > > (share queue), right? Does
> > > > > > > > > > > > > this means only single core will poll only,
> > > > > > > > > > > > > what
> > > > > > > > > > > > > will
> > > > > > > > > > > > > happen if there are
> > > > > > > > > > > > > multiple cores polling, won't it cause problem?
> > > > > > > > > > > > >
> > > > > > > > > > > > > And if this requires specific changes in the
> > > > > > > > > > > > > application, I am not sure about
> > > > > > > > > > > > > the solution, can't this work in a transparent
> > > > > > > > > > > > > way
> > > > > > > > > > > > > to
> > > > > > > > > > > > > the application?
> > > > > > > > > > > >
> > > > > > > > > > > > Discussed with Jerin, new API introduced in v3
> > > > > > > > > > > > 2/8
> > > > > > > > > > > > that
> > > > > > > > > > > > aggregate ports
> > > > > > > > > > > > in same group into one new port. Users could
> > > > > > > > > > > > schedule
> > > > > > > > > > > > polling on the
> > > > > > > > > > > > aggregated port instead of all member ports.
> > > > > > > > > > >
> > > > > > > > > > > The v3 still has testpmd changes in fastpath.
> > > > > > > > > > > Right?
> > > > > > > > > > > IMO,
> > > > > > > > > > > For this
> > > > > > > > > > > feature, we should not change fastpath of testpmd
> > > > > > > > > > > application. Instead, testpmd can use aggregated
> > > > > > > > > > > ports
> > > > > > > > > > > probably as
> > > > > > > > > > > separate fwd_engine to show how to use this
> > > > > > > > > > > feature.
> > > > > > > > > >
> > > > > > > > > > Good point to discuss :) There are two strategies to
> > > > > > > > > > polling
> > > > > > > > > > a shared
> > > > > > > > > > Rxq:
> > > > > > > > > > 1. polling each member port
> > > > > > > > > > All forwarding engines can be reused to work as
> > > > > > > > > > before.
> > > > > > > > > > My testpmd patches are efforts towards this
> > > > > > > > > > direction.
> > > > > > > > > > Does your PMD support this?
> > > > > > > > >
> > > > > > > > > Not unfortunately. More than that, every application
> > > > > > > > > needs
> > > > > > > > > to
> > > > > > > > > change
> > > > > > > > > to support this model.
> > > > > > > >
> > > > > > > > Both strategies need user application to resolve port ID
> > > > > > > > from
> > > > > > > > mbuf and
> > > > > > > > process accordingly.
> > > > > > > > This one doesn't demand aggregated port, no polling
> > > > > > > > schedule
> > > > > > > > change.
> > > > > > >
> > > > > > > I was thinking, mbuf will be updated from driver/aggregator
> > > > > > > port as
> > > > > > > when it
> > > > > > > comes to application.
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > > 2. polling aggregated port
> > > > > > > > > > Besides forwarding engine, need more work to to
> > > > > > > > > > demo
> > > > > > > > > > it.
> > > > > > > > > > This is an optional API, not supported by my PMD
> > > > > > > > > > yet.
> > > > > > > > >
> > > > > > > > > We are thinking of implementing this PMD when it comes
> > > > > > > > > to
> > > > > > > > > it,
> > > > > > > > > ie.
> > > > > > > > > without application change in fastpath
> > > > > > > > > logic.
> > > > > > > >
> > > > > > > > Fastpath have to resolve port ID anyway and forwarding
> > > > > > > > according
> > > > > > > > to
> > > > > > > > logic. Forwarding engine need to adapt to support shard
> > > > > > > > Rxq.
> > > > > > > > Fortunately, in testpmd, this can be done with an
> > > > > > > > abstract
> > > > > > > > API.
> > > > > > > >
> > > > > > > > Let's defer part 2 until some PMD really support it and
> > > > > > > > tested,
> > > > > > > > how do
> > > > > > > > you think?
> > > > > > >
> > > > > > > We are not planning to use this feature so either way it is
> > > > > > > OK
> > > > > > > to
> > > > > > > me.
> > > > > > > I leave to ethdev maintainers decide between 1 vs 2.
> > > > > > >
> > > > > > > I do have a strong opinion not changing the testpmd basic
> > > > > > > forward
> > > > > > > engines
> > > > > > > for this feature.I would like to keep it simple as fastpath
> > > > > > > optimized and would
> > > > > > > like to add a separate Forwarding engine as means to verify
> > > > > > > this
> > > > > > > feature.
> > > > > >
> > > > > > +1 to that.
> > > > > > I don't think it a 'common' feature.
> > > > > > So separate FWD mode seems like a best choice to me.
> > > > >
> > > > > -1 :)
> > > > > There was some internal requirement from test team, they need
> > > > > to
> > > > > verify
> > > > > all features like packet content, rss, vlan, checksum,
> > > > > rte_flow...
> > > > > to
> > > > > be working based on shared rx queue.
> > > >
> > > > Then I suppose you'll need to write really comprehensive fwd-
> > > > engine
> > > > to satisfy your test team :)
> > > > Speaking seriously, I still don't understand why do you need all
> > > > available fwd-engines to verify this feature.
> > >
> > > The shared Rxq is low level feature, need to make sure driver
> > > higher
> > > level features working properly. fwd-engines like csum checks input
> > > packet and enable L3/L4 checksum and tunnel offloads accordingly,
> > > other engines do their own feature verification. All test
> > > automation
> > > could be reused with these engines supported seamlessly.
> > >
> > > > From what I understand, main purpose of your changes to test-pmd:
> > > > allow to fwd packet though different fwd_stream (TX through
> > > > different
> > > > HW queue).
> > >
> > > Yes, each mbuf in burst come from differnt port, testpmd current
> > > fwd-
> > > engines relies heavily on source forwarding stream, that's why the
> > > patch devide burst result mbufs into sub-burst and use orginal fwd-
> > > engine callback to handle. How to handle is not changed.
> > >
> > > > In theory, if implemented in generic and extendable way - that
> > > > might be a useful add-on to tespmd fwd functionality.
> > > > But current implementation looks very case specific.
> > > > And as I don't think it is a common case, I don't see much point
> > > > to
> > > > pollute
> > > > basic fwd cases with it.
> > >
> > > Shared Rxq is a ethdev feature that impacts how packets get
> > > handled.
> > > It's natural to update forwarding engines to avoid broken.
> >
> > Why is that?
> > All it affects the way you RX the packets.
> > So why *all* FWD engines have to be updated?
>
> People will ask why some FWD engine can't work?
It can be documented: which fwd engine supposed to work properly
with this feature, which not.
BTW, as I understand, as long as RX queues are properly assigned to lcores,
any fwd engine will continue to work.
Just for engines that are not aware about this feature packets can be send
out via wrong TX queue.
> > Let say what specific you are going to test with macswap vs macfwd
> > mode for that feature?
>
> If people want to test NIC with real switch, or make sure L2 layer not
> get corrupted.
I understand that, what I am saying that you probably don't need both to test this feature.
> > I still think one specific FWD engine is enough to cover majority of
> > test cases.
>
> Yes, rxonly should be sufficient to verify the fundametal, but to
> verify csum, timing, need others. Some back2back test system need io
> forwarding, real switch depolyment need macswap...
Ok, but nothing stops you to pickup for your purposes a FWD engine with most comprehensive functionality:
macswap, 5tswap, even 5tswap+csum update ....
>
>
> >
> > > The new macro is introduced to minimize performance impact, I'm
> > > also
> > > wondering is there an elegant solution :)
> >
> > I think Jerin suggested a good alternative with eventdev.
> > As another approach - might be consider to add an RX callback that
> > will return packets only for one particular port (while keeping
> > packets
> > for other ports cached internally).
>
> This and the aggreate port API could be options in ethdev layer later.
> It can't be the fundamental due performance loss and potential cache
> miss.
>
> > As a 'wild' thought - change testpmd fwd logic to allow multiple TX
> > queues
> > per fwd_stream and add a function to do TX switching logic.
> > But that's probably quite a big change that needs a lot of work.
> >
> > > Current performance penalty
> > > is one "if unlikely" per burst.
> >
> > It is not only about performance impact.
> > It is about keeping test-pmd code simple and maintainable.
> > >
> > > Think in reverse direction, if we don't update fwd-engines here,
> > > all
> > > malfunction when shared rxq enabled, users can't verify driver
> > > features, are you expecting this?
> >
> > I expect developers not to rewrite whole test-pmd fwd code for each
> > new ethdev feature.
>
> Here just abstract duplicated code from fwd-engines, an improvement,
> keep test-pmd code simple and mantainable.
>
> > Specially for the feature that is not widely used.
>
> Based on the huge memory saving, performance and latency gains, it will
> be popular to users.
>
> But the test-pmd is not critical to this feature, I'm ok to drop the
> fwd-engine support if you agree.
Not sure fully understand you here...
If you saying that you decided to demonstrate this feature via some other app/example
and prefer to abandon these changes in testpmd -
then, yes I don't see any problems with that.
On Thu, 2021-09-30 at 09:59 +0000, Ananyev, Konstantin wrote:
On Wed, 2021-09-29 at 10:20 +0000, Ananyev, Konstantin wrote:
In current DPDK framework, each RX
queue
is
pre-loaded with mbufs
for incoming packets. When number of
representors scale out in a
switch domain, the memory consumption
became
significant. Most
important, polling all ports leads to
high
cache miss, high
latency and low throughput.
This patch introduces shared RX
queue.
Ports
with same
configuration in a switch domain
could
share
RX queue set by specifying sharing
group.
Polling any queue using same shared
RX
queue
receives packets from
all member ports. Source port is
identified
by mbuf->port.
Port queue number in a shared group
should be
identical. Queue
index is
1:1 mapped in shared group.
Share RX queue is supposed to be
polled
on
same thread.
Multiple groups is supported by group
ID.
Is this offload specific to the
representor? If
so can this name be changed
specifically to
representor?
Yes, PF and representor in switch domain
could
take advantage.
If it is for a generic case, how the
flow
ordering will be maintained?
Not quite sure that I understood your
question.
The control path of is
almost same as before, PF and representor
port
still needed, rte flows not impacted.
Queues still needed for each member port,
descriptors(mbuf) will be
supplied from shared Rx queue in my PMD
implementation.
My question was if create a generic
RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload,
multiple
ethdev receive queues land into
the same
receive queue, In that case, how the flow
order
is
maintained for respective receive queues.
I guess the question is testpmd forward
stream?
The
forwarding logic has to be changed slightly
in
case
of shared rxq.
basically for each packet in rx_burst result,
lookup
source stream according to mbuf->port,
forwarding
to
target fs.
Packets from same source port could be
grouped as
a
small burst to process, this will accelerates
the
performance if traffic
come from
limited ports. I'll introduce some common api
to
do
shard rxq forwarding, call it with packets
handling
callback, so it suites for
all forwarding engine. Will sent patches
soon.
All ports will put the packets in to the same
queue
(share queue), right? Does
this means only single core will poll only,
what
will
happen if there are
multiple cores polling, won't it cause problem?
And if this requires specific changes in the
application, I am not sure about
the solution, can't this work in a transparent
way
to
the application?
Discussed with Jerin, new API introduced in v3
2/8
that
aggregate ports
in same group into one new port. Users could
schedule
polling on the
aggregated port instead of all member ports.
The v3 still has testpmd changes in fastpath.
Right?
IMO,
For this
feature, we should not change fastpath of testpmd
application. Instead, testpmd can use aggregated
ports
probably as
separate fwd_engine to show how to use this
feature.
Good point to discuss :) There are two strategies to
polling
a shared
Rxq:
1. polling each member port
All forwarding engines can be reused to work as
before.
My testpmd patches are efforts towards this
direction.
Does your PMD support this?
Not unfortunately. More than that, every application
needs
to
change
to support this model.
Both strategies need user application to resolve port ID
from
mbuf and
process accordingly.
This one doesn't demand aggregated port, no polling
schedule
change.
I was thinking, mbuf will be updated from driver/aggregator
port as
when it
comes to application.
2. polling aggregated port
Besides forwarding engine, need more work to to
demo
it.
This is an optional API, not supported by my PMD
yet.
We are thinking of implementing this PMD when it comes
to
it,
ie.
without application change in fastpath
logic.
Fastpath have to resolve port ID anyway and forwarding
according
to
logic. Forwarding engine need to adapt to support shard
Rxq.
Fortunately, in testpmd, this can be done with an
abstract
API.
Let's defer part 2 until some PMD really support it and
tested,
how do
you think?
We are not planning to use this feature so either way it is
OK
to
me.
I leave to ethdev maintainers decide between 1 vs 2.
I do have a strong opinion not changing the testpmd basic
forward
engines
for this feature.I would like to keep it simple as fastpath
optimized and would
like to add a separate Forwarding engine as means to verify
this
feature.
+1 to that.
I don't think it a 'common' feature.
So separate FWD mode seems like a best choice to me.
-1 :)
There was some internal requirement from test team, they need
to
verify
all features like packet content, rss, vlan, checksum,
rte_flow...
to
be working based on shared rx queue.
Then I suppose you'll need to write really comprehensive fwd-
engine
to satisfy your test team :)
Speaking seriously, I still don't understand why do you need all
available fwd-engines to verify this feature.
The shared Rxq is low level feature, need to make sure driver
higher
level features working properly. fwd-engines like csum checks input
packet and enable L3/L4 checksum and tunnel offloads accordingly,
other engines do their own feature verification. All test
automation
could be reused with these engines supported seamlessly.
From what I understand, main purpose of your changes to test-pmd:
allow to fwd packet though different fwd_stream (TX through
different
HW queue).
Yes, each mbuf in burst come from differnt port, testpmd current
fwd-
engines relies heavily on source forwarding stream, that's why the
patch devide burst result mbufs into sub-burst and use orginal fwd-
engine callback to handle. How to handle is not changed.
In theory, if implemented in generic and extendable way - that
might be a useful add-on to tespmd fwd functionality.
But current implementation looks very case specific.
And as I don't think it is a common case, I don't see much point
to
pollute
basic fwd cases with it.
Shared Rxq is a ethdev feature that impacts how packets get
handled.
It's natural to update forwarding engines to avoid broken.
Why is that?
All it affects the way you RX the packets.
So why *all* FWD engines have to be updated?
People will ask why some FWD engine can't work?
It can be documented: which fwd engine supposed to work properly
with this feature, which not.
BTW, as I understand, as long as RX queues are properly assigned to lcores,
any fwd engine will continue to work.
Just for engines that are not aware about this feature packets can be send
out via wrong TX queue.
Let say what specific you are going to test with macswap vs macfwd
mode for that feature?
If people want to test NIC with real switch, or make sure L2 layer not
get corrupted.
I understand that, what I am saying that you probably don't need both to test this feature.
I still think one specific FWD engine is enough to cover majority of
test cases.
Yes, rxonly should be sufficient to verify the fundametal, but to
verify csum, timing, need others. Some back2back test system need io
forwarding, real switch depolyment need macswap...
Ok, but nothing stops you to pickup for your purposes a FWD engine with most comprehensive functionality:
macswap, 5tswap, even 5tswap+csum update ....
The new macro is introduced to minimize performance impact, I'm
also
wondering is there an elegant solution :)
I think Jerin suggested a good alternative with eventdev.
As another approach - might be consider to add an RX callback that
will return packets only for one particular port (while keeping
packets
for other ports cached internally).
This and the aggreate port API could be options in ethdev layer later.
It can't be the fundamental due performance loss and potential cache
miss.
As a 'wild' thought - change testpmd fwd logic to allow multiple TX
queues
per fwd_stream and add a function to do TX switching logic.
But that's probably quite a big change that needs a lot of work.
Current performance penalty
is one "if unlikely" per burst.
It is not only about performance impact.
It is about keeping test-pmd code simple and maintainable.
Think in reverse direction, if we don't update fwd-engines here,
all
malfunction when shared rxq enabled, users can't verify driver
features, are you expecting this?
I expect developers not to rewrite whole test-pmd fwd code for each
new ethdev feature.
Here just abstract duplicated code from fwd-engines, an improvement,
keep test-pmd code simple and mantainable.
Specially for the feature that is not widely used.
Based on the huge memory saving, performance and latency gains, it will
be popular to users.
But the test-pmd is not critical to this feature, I'm ok to drop the
fwd-engine support if you agree.
Not sure fully understand you here...
If you saying that you decided to demonstrate this feature via some other app/example
and prefer to abandon these changes in testpmd -
then, yes I don't see any problems with that.
Hi Ananyev & Jerin,
New v4 posted with a dedicate fwd engine to demonstrate this feature, do you have time to check?
Thanks,
Xueming
On Wed, 2021-09-29 at 13:35 +0530, Jerin Jacob wrote:
> On Wed, Sep 29, 2021 at 1:11 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> >
> > On Tue, 2021-09-28 at 20:29 +0530, Jerin Jacob wrote:
> > > On Tue, Sep 28, 2021 at 8:10 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > >
> > > > On Tue, 2021-09-28 at 13:59 +0000, Ananyev, Konstantin wrote:
> > > > > >
> > > > > > On Tue, Sep 28, 2021 at 6:55 PM Xueming(Steven) Li
> > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > >
> > > > > > > On Tue, 2021-09-28 at 18:28 +0530, Jerin Jacob wrote:
> > > > > > > > On Tue, Sep 28, 2021 at 5:07 PM Xueming(Steven) Li
> > > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > > >
> > > > > > > > > On Tue, 2021-09-28 at 15:05 +0530, Jerin Jacob wrote:
> > > > > > > > > > On Sun, Sep 26, 2021 at 11:06 AM Xueming(Steven) Li
> > > > > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Wed, 2021-08-11 at 13:04 +0100, Ferruh Yigit wrote:
> > > > > > > > > > > > On 8/11/2021 9:28 AM, Xueming(Steven) Li wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > > > > > > Sent: Wednesday, August 11, 2021 4:03 PM
> > > > > > > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > > > > > > > > <ferruh.yigit@intel.com>; NBU-Contact-Thomas
> > > > > > > > > > > > > > Monjalon
> > > > > > <thomas@monjalon.net>;
> > > > > > > > > > > > > > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev:
> > > > > > > > > > > > > > introduce shared Rx queue
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li
> > > > > > > > > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > > > > > > > > Sent: Monday, August 9, 2021 9:51 PM
> > > > > > > > > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > > > > > > > > > > <ferruh.yigit@intel.com>;
> > > > > > > > > > > > > > > > NBU-Contact-Thomas Monjalon
> > > > > > > > > > > > > > > > <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > > > > > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > > > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev:
> > > > > > > > > > > > > > > > introduce shared Rx queue
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Mon, Aug 9, 2021 at 5:18 PM Xueming Li
> > > > > > > > > > > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > In current DPDK framework, each RX queue is
> > > > > > > > > > > > > > > > > pre-loaded with mbufs
> > > > > > > > > > > > > > > > > for incoming packets. When number of
> > > > > > > > > > > > > > > > > representors scale out in a
> > > > > > > > > > > > > > > > > switch domain, the memory consumption became
> > > > > > > > > > > > > > > > > significant. Most
> > > > > > > > > > > > > > > > > important, polling all ports leads to high
> > > > > > > > > > > > > > > > > cache miss, high
> > > > > > > > > > > > > > > > > latency and low throughput.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > This patch introduces shared RX queue. Ports
> > > > > > > > > > > > > > > > > with same
> > > > > > > > > > > > > > > > > configuration in a switch domain could share
> > > > > > > > > > > > > > > > > RX queue set by specifying sharing group.
> > > > > > > > > > > > > > > > > Polling any queue using same shared RX queue
> > > > > > > > > > > > > > > > > receives packets from
> > > > > > > > > > > > > > > > > all member ports. Source port is identified
> > > > > > > > > > > > > > > > > by mbuf->port.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Port queue number in a shared group should be
> > > > > > > > > > > > > > > > > identical. Queue
> > > > > > > > > > > > > > > > > index is
> > > > > > > > > > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Share RX queue is supposed to be polled on
> > > > > > > > > > > > > > > > > same thread.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Is this offload specific to the representor? If
> > > > > > > > > > > > > > > > so can this name be changed specifically to
> > > > > > > > > > > > > > > > representor?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Yes, PF and representor in switch domain could
> > > > > > > > > > > > > > > take advantage.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > If it is for a generic case, how the flow
> > > > > > > > > > > > > > > > ordering will be maintained?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Not quite sure that I understood your question.
> > > > > > > > > > > > > > > The control path of is
> > > > > > > > > > > > > > > almost same as before, PF and representor port
> > > > > > > > > > > > > > > still needed, rte flows not impacted.
> > > > > > > > > > > > > > > Queues still needed for each member port,
> > > > > > > > > > > > > > > descriptors(mbuf) will be
> > > > > > > > > > > > > > > supplied from shared Rx queue in my PMD
> > > > > > > > > > > > > > > implementation.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > My question was if create a generic
> > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple
> > > > > > > > > > > > > > ethdev receive queues land into
> > > > > > the same
> > > > > > > > > > > > > > receive queue, In that case, how the flow order is
> > > > > > > > > > > > > > maintained for respective receive queues.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I guess the question is testpmd forward stream? The
> > > > > > > > > > > > > forwarding logic has to be changed slightly in case
> > > > > > > > > > > > > of shared rxq.
> > > > > > > > > > > > > basically for each packet in rx_burst result, lookup
> > > > > > > > > > > > > source stream according to mbuf->port, forwarding to
> > > > > > > > > > > > > target fs.
> > > > > > > > > > > > > Packets from same source port could be grouped as a
> > > > > > > > > > > > > small burst to process, this will accelerates the
> > > > > > > > > > > > > performance if traffic
> > > > > > come from
> > > > > > > > > > > > > limited ports. I'll introduce some common api to do
> > > > > > > > > > > > > shard rxq forwarding, call it with packets handling
> > > > > > > > > > > > > callback, so it suites for
> > > > > > > > > > > > > all forwarding engine. Will sent patches soon.
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > All ports will put the packets in to the same queue
> > > > > > > > > > > > (share queue), right? Does
> > > > > > > > > > > > this means only single core will poll only, what will
> > > > > > > > > > > > happen if there are
> > > > > > > > > > > > multiple cores polling, won't it cause problem?
> > > > > > > > > > > >
> > > > > > > > > > > > And if this requires specific changes in the
> > > > > > > > > > > > application, I am not sure about
> > > > > > > > > > > > the solution, can't this work in a transparent way to
> > > > > > > > > > > > the application?
> > > > > > > > > > >
> > > > > > > > > > > Discussed with Jerin, new API introduced in v3 2/8 that
> > > > > > > > > > > aggregate ports
> > > > > > > > > > > in same group into one new port. Users could schedule
> > > > > > > > > > > polling on the
> > > > > > > > > > > aggregated port instead of all member ports.
> > > > > > > > > >
> > > > > > > > > > The v3 still has testpmd changes in fastpath. Right? IMO,
> > > > > > > > > > For this
> > > > > > > > > > feature, we should not change fastpath of testpmd
> > > > > > > > > > application. Instead, testpmd can use aggregated ports
> > > > > > > > > > probably as
> > > > > > > > > > separate fwd_engine to show how to use this feature.
> > > > > > > > >
> > > > > > > > > Good point to discuss :) There are two strategies to polling
> > > > > > > > > a shared
> > > > > > > > > Rxq:
> > > > > > > > > 1. polling each member port
> > > > > > > > > All forwarding engines can be reused to work as before.
> > > > > > > > > My testpmd patches are efforts towards this direction.
> > > > > > > > > Does your PMD support this?
> > > > > > > >
> > > > > > > > Not unfortunately. More than that, every application needs to
> > > > > > > > change
> > > > > > > > to support this model.
> > > > > > >
> > > > > > > Both strategies need user application to resolve port ID from
> > > > > > > mbuf and
> > > > > > > process accordingly.
> > > > > > > This one doesn't demand aggregated port, no polling schedule
> > > > > > > change.
> > > > > >
> > > > > > I was thinking, mbuf will be updated from driver/aggregator port as
> > > > > > when it
> > > > > > comes to application.
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > 2. polling aggregated port
> > > > > > > > > Besides forwarding engine, need more work to to demo it.
> > > > > > > > > This is an optional API, not supported by my PMD yet.
> > > > > > > >
> > > > > > > > We are thinking of implementing this PMD when it comes to it,
> > > > > > > > ie.
> > > > > > > > without application change in fastpath
> > > > > > > > logic.
> > > > > > >
> > > > > > > Fastpath have to resolve port ID anyway and forwarding according
> > > > > > > to
> > > > > > > logic. Forwarding engine need to adapt to support shard Rxq.
> > > > > > > Fortunately, in testpmd, this can be done with an abstract API.
> > > > > > >
> > > > > > > Let's defer part 2 until some PMD really support it and tested,
> > > > > > > how do
> > > > > > > you think?
> > > > > >
> > > > > > We are not planning to use this feature so either way it is OK to
> > > > > > me.
> > > > > > I leave to ethdev maintainers decide between 1 vs 2.
> > > > > >
> > > > > > I do have a strong opinion not changing the testpmd basic forward
> > > > > > engines
> > > > > > for this feature.I would like to keep it simple as fastpath
> > > > > > optimized and would
> > > > > > like to add a separate Forwarding engine as means to verify this
> > > > > > feature.
> > > > >
> > > > > +1 to that.
> > > > > I don't think it a 'common' feature.
> > > > > So separate FWD mode seems like a best choice to me.
> > > >
> > > > -1 :)
> > > > There was some internal requirement from test team, they need to verify
> > >
>
>
> > > Internal QA requirements may not be the driving factor :-)
> >
> > It will be a test requirement for any driver to face, not internal. The
> > performance difference almost zero in v3, only an "unlikely if" test on
> > each burst. Shared Rxq is a low level feature, reusing all current FWD
> > engines to verify driver high level features is important IMHO.
>
> In addition to additional if check, The real concern is polluting the
> common forward engine for the not common feature.
Okay, removed changes to common forward engines in v4, please check.
>
> If you really want to reuse the existing application without any
> application change,
> I think, you need to hook this to eventdev
> http://code.dpdk.org/dpdk/latest/source/lib/eventdev/rte_eventdev.h#L34
>
> Where eventdev drivers does this thing in addition to other features, Ie.
> t has ports (which is kind of aggregator),
> it can receive the packets from any queue with mbuf->port as actually
> received port.
> That is in terms of mapping:
> - event queue will be dummy it will be as same as Rx queue
> - Rx adapter will be also a dummy
> - event ports aggregate multiple queues and connect to core via event port
> - On Rxing the packet, mbuf->port will be the actual Port which is received.
> app/test-eventdev written to use this model.
Is this the optional aggregator api we discussed? already there, patch
2/6.
I was trying to make common forwarding engines perfect to support any
case, but since you all have concerns, removed in v4.
>
>
>
> >
> > >
> > > > all features like packet content, rss, vlan, checksum, rte_flow... to
> > > > be working based on shared rx queue. Based on the patch, I believe the
> > > > impact has been minimized.
> > >
> > >
> > > >
> > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Overall, is this for optimizing memory for the port
> > > > > > > > > > > > represontors? If so can't we
> > > > > > > > > > > > have a port representor specific solution, reducing
> > > > > > > > > > > > scope can reduce the
> > > > > > > > > > > > complexity it brings?
> > > > > > > > > > > >
> > > > > > > > > > > > > > If this offload is only useful for representor
> > > > > > > > > > > > > > case, Can we make this offload specific to
> > > > > > > > > > > > > > representor the case by changing its
> > > > > > name and
> > > > > > > > > > > > > > scope.
> > > > > > > > > > > > >
> > > > > > > > > > > > > It works for both PF and representors in same switch
> > > > > > > > > > > > > domain, for application like OVS, few changes to
> > > > > > > > > > > > > apply.
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Signed-off-by: Xueming Li
> > > > > > > > > > > > > > > > > <xuemingl@nvidia.com>
> > > > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > > > doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > > > 11 +++++++++++
> > > > > > > > > > > > > > > > > doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > > > 1 +
> > > > > > > > > > > > > > > > > doc/guides/prog_guide/switch_representation.
> > > > > > > > > > > > > > > > > rst | 10 ++++++++++
> > > > > > > > > > > > > > > > > lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > > > 1 +
> > > > > > > > > > > > > > > > > lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > > > 7 +++++++
> > > > > > > > > > > > > > > > > 5 files changed, 30 insertions(+)
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > diff --git a/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > > b/doc/guides/nics/features.rst index
> > > > > > > > > > > > > > > > > a96e12d155..2e2a9b1554 100644
> > > > > > > > > > > > > > > > > --- a/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > > @@ -624,6 +624,17 @@ Supports inner packet L4
> > > > > > > > > > > > > > > > > checksum.
> > > > > > > > > > > > > > > > > ``tx_offload_capa,tx_queue_offload_capa:DE
> > > > > > > > > > > > > > > > > V_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > +Shared Rx queue
> > > > > > > > > > > > > > > > > +---------------
> > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > +Supports shared Rx queue for ports in same
> > > > > > > > > > > > > > > > > switch domain.
> > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > +* **[uses]
> > > > > > > > > > > > > > > > > rte_eth_rxconf,rte_eth_rxmode**:
> > > > > > > > > > > > > > > > > ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > > > > > > > > > > > > > > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > .. _nic_features_packet_type_parsing:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Packet type parsing
> > > > > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > > > > a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > > > > > > > > > > > > --- a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > > +++ b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> > > > > > > > > > > > > > > > > Queue start/stop =
> > > > > > > > > > > > > > > > > Runtime Rx queue setup =
> > > > > > > > > > > > > > > > > Runtime Tx queue setup =
> > > > > > > > > > > > > > > > > +Shared Rx queue =
> > > > > > > > > > > > > > > > > Burst mode info =
> > > > > > > > > > > > > > > > > Power mgmt address monitor =
> > > > > > > > > > > > > > > > > MTU update =
> > > > > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > > +++
> > > > > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > > @@ -123,6 +123,16 @@ thought as a software
> > > > > > > > > > > > > > > > > "patch panel" front-end for applications.
> > > > > > > > > > > > > > > > > .. [1] `Ethernet switch device driver model
> > > > > > > > > > > > > > > > > (switchdev)
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > <https://www.kernel.org/doc/Documentation/net
> > > > > > > > > > > > > > > > > working/switchdev.txt
> > > > > > > > > > > > > > > > > > `_
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > +- Memory usage of representors is huge when
> > > > > > > > > > > > > > > > > number of representor
> > > > > > > > > > > > > > > > > +grows,
> > > > > > > > > > > > > > > > > + because PMD always allocate mbuf for each
> > > > > > > > > > > > > > > > > descriptor of Rx queue.
> > > > > > > > > > > > > > > > > + Polling the large number of ports brings
> > > > > > > > > > > > > > > > > more CPU load, cache
> > > > > > > > > > > > > > > > > +miss and
> > > > > > > > > > > > > > > > > + latency. Shared Rx queue can be used to
> > > > > > > > > > > > > > > > > share Rx queue between
> > > > > > > > > > > > > > > > > +PF and
> > > > > > > > > > > > > > > > > + representors in same switch domain.
> > > > > > > > > > > > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > > > > > > > > > > > > + is present in Rx offloading capability of
> > > > > > > > > > > > > > > > > device info. Setting
> > > > > > > > > > > > > > > > > +the
> > > > > > > > > > > > > > > > > + offloading flag in device Rx mode or Rx
> > > > > > > > > > > > > > > > > queue configuration to
> > > > > > > > > > > > > > > > > +enable
> > > > > > > > > > > > > > > > > + shared Rx queue. Polling any member port
> > > > > > > > > > > > > > > > > of shared Rx queue can
> > > > > > > > > > > > > > > > > +return
> > > > > > > > > > > > > > > > > + packets of all ports in group, port ID is
> > > > > > > > > > > > > > > > > saved in ``mbuf.port``.
> > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > Basic SR-IOV
> > > > > > > > > > > > > > > > > ------------
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > > @@ -127,6 +127,7 @@ static const struct {
> > > > > > > > > > > > > > > > > RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSU
> > > > > > > > > > > > > > > > > M),
> > > > > > > > > > > > > > > > > RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPL
> > > > > > > > > > > > > > > > > IT),
> > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > > > > > > > > > > > > > > };
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > > > > > > > > > > > > uint8_t rx_drop_en; /**< Drop packets
> > > > > > > > > > > > > > > > > if no descriptors are available. */
> > > > > > > > > > > > > > > > > uint8_t rx_deferred_start; /**< Do
> > > > > > > > > > > > > > > > > not start queue with rte_eth_dev_start(). */
> > > > > > > > > > > > > > > > > uint16_t rx_nseg; /**< Number of
> > > > > > > > > > > > > > > > > descriptions in rx_seg array.
> > > > > > > > > > > > > > > > > */
> > > > > > > > > > > > > > > > > + uint32_t shared_group; /**< Shared
> > > > > > > > > > > > > > > > > port group index in
> > > > > > > > > > > > > > > > > + switch domain. */
> > > > > > > > > > > > > > > > > /**
> > > > > > > > > > > > > > > > > * Per-queue Rx offloads to be set
> > > > > > > > > > > > > > > > > using DEV_RX_OFFLOAD_* flags.
> > > > > > > > > > > > > > > > > * Only offloads set on
> > > > > > > > > > > > > > > > > rx_queue_offload_capa or
> > > > > > > > > > > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct
> > > > > > > > > > > > > > > > > rte_eth_conf {
> > > > > > > > > > > > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM
> > > > > > > > > > > > > > > > > 0x00040000
> > > > > > > > > > > > > > > > > #define DEV_RX_OFFLOAD_RSS_HASH
> > > > > > > > > > > > > > > > > 0x00080000
> > > > > > > > > > > > > > > > > #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT
> > > > > > > > > > > > > > > > > 0x00100000
> > > > > > > > > > > > > > > > > +/**
> > > > > > > > > > > > > > > > > + * Rx queue is shared among ports in same
> > > > > > > > > > > > > > > > > switch domain to save
> > > > > > > > > > > > > > > > > +memory,
> > > > > > > > > > > > > > > > > + * avoid polling each port. Any port in
> > > > > > > > > > > > > > > > > group can be used to receive packets.
> > > > > > > > > > > > > > > > > + * Real source port number saved in mbuf-
> > > > > > > > > > > > > > > > > > port field.
> > > > > > > > > > > > > > > > > + */
> > > > > > > > > > > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ
> > > > > > > > > > > > > > > > > 0x00200000
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > #define DEV_RX_OFFLOAD_CHECKSUM
> > > > > > > > > > > > > > > > > (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > > > > > > > > > > DEV_RX_OFFLO
> > > > > > > > > > > > > > > > > AD_UDP_CKSUM | \
> > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > 2.25.1
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > >
> >
On Fri, Oct 8, 2021 at 1:56 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
>
> On Wed, 2021-09-29 at 13:35 +0530, Jerin Jacob wrote:
> > On Wed, Sep 29, 2021 at 1:11 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > >
> > > On Tue, 2021-09-28 at 20:29 +0530, Jerin Jacob wrote:
> > > > On Tue, Sep 28, 2021 at 8:10 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > >
> > > > > On Tue, 2021-09-28 at 13:59 +0000, Ananyev, Konstantin wrote:
> > > > > > >
> > > > > > > On Tue, Sep 28, 2021 at 6:55 PM Xueming(Steven) Li
> > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > >
> > > > > > > > On Tue, 2021-09-28 at 18:28 +0530, Jerin Jacob wrote:
> > > > > > > > > On Tue, Sep 28, 2021 at 5:07 PM Xueming(Steven) Li
> > > > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On Tue, 2021-09-28 at 15:05 +0530, Jerin Jacob wrote:
> > > > > > > > > > > On Sun, Sep 26, 2021 at 11:06 AM Xueming(Steven) Li
> > > > > > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, 2021-08-11 at 13:04 +0100, Ferruh Yigit wrote:
> > > > > > > > > > > > > On 8/11/2021 9:28 AM, Xueming(Steven) Li wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > > > > > > > Sent: Wednesday, August 11, 2021 4:03 PM
> > > > > > > > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > > > > > > > > > <ferruh.yigit@intel.com>; NBU-Contact-Thomas
> > > > > > > > > > > > > > > Monjalon
> > > > > > > <thomas@monjalon.net>;
> > > > > > > > > > > > > > > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev:
> > > > > > > > > > > > > > > introduce shared Rx queue
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li
> > > > > > > > > > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Hi,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > > > > > > > > > Sent: Monday, August 9, 2021 9:51 PM
> > > > > > > > > > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > > > > > > > > > > > <ferruh.yigit@intel.com>;
> > > > > > > > > > > > > > > > > NBU-Contact-Thomas Monjalon
> > > > > > > > > > > > > > > > > <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > > > > > > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > > > > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev:
> > > > > > > > > > > > > > > > > introduce shared Rx queue
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Mon, Aug 9, 2021 at 5:18 PM Xueming Li
> > > > > > > > > > > > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > In current DPDK framework, each RX queue is
> > > > > > > > > > > > > > > > > > pre-loaded with mbufs
> > > > > > > > > > > > > > > > > > for incoming packets. When number of
> > > > > > > > > > > > > > > > > > representors scale out in a
> > > > > > > > > > > > > > > > > > switch domain, the memory consumption became
> > > > > > > > > > > > > > > > > > significant. Most
> > > > > > > > > > > > > > > > > > important, polling all ports leads to high
> > > > > > > > > > > > > > > > > > cache miss, high
> > > > > > > > > > > > > > > > > > latency and low throughput.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > This patch introduces shared RX queue. Ports
> > > > > > > > > > > > > > > > > > with same
> > > > > > > > > > > > > > > > > > configuration in a switch domain could share
> > > > > > > > > > > > > > > > > > RX queue set by specifying sharing group.
> > > > > > > > > > > > > > > > > > Polling any queue using same shared RX queue
> > > > > > > > > > > > > > > > > > receives packets from
> > > > > > > > > > > > > > > > > > all member ports. Source port is identified
> > > > > > > > > > > > > > > > > > by mbuf->port.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Port queue number in a shared group should be
> > > > > > > > > > > > > > > > > > identical. Queue
> > > > > > > > > > > > > > > > > > index is
> > > > > > > > > > > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Share RX queue is supposed to be polled on
> > > > > > > > > > > > > > > > > > same thread.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Is this offload specific to the representor? If
> > > > > > > > > > > > > > > > > so can this name be changed specifically to
> > > > > > > > > > > > > > > > > representor?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Yes, PF and representor in switch domain could
> > > > > > > > > > > > > > > > take advantage.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > If it is for a generic case, how the flow
> > > > > > > > > > > > > > > > > ordering will be maintained?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Not quite sure that I understood your question.
> > > > > > > > > > > > > > > > The control path of is
> > > > > > > > > > > > > > > > almost same as before, PF and representor port
> > > > > > > > > > > > > > > > still needed, rte flows not impacted.
> > > > > > > > > > > > > > > > Queues still needed for each member port,
> > > > > > > > > > > > > > > > descriptors(mbuf) will be
> > > > > > > > > > > > > > > > supplied from shared Rx queue in my PMD
> > > > > > > > > > > > > > > > implementation.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > My question was if create a generic
> > > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple
> > > > > > > > > > > > > > > ethdev receive queues land into
> > > > > > > the same
> > > > > > > > > > > > > > > receive queue, In that case, how the flow order is
> > > > > > > > > > > > > > > maintained for respective receive queues.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I guess the question is testpmd forward stream? The
> > > > > > > > > > > > > > forwarding logic has to be changed slightly in case
> > > > > > > > > > > > > > of shared rxq.
> > > > > > > > > > > > > > basically for each packet in rx_burst result, lookup
> > > > > > > > > > > > > > source stream according to mbuf->port, forwarding to
> > > > > > > > > > > > > > target fs.
> > > > > > > > > > > > > > Packets from same source port could be grouped as a
> > > > > > > > > > > > > > small burst to process, this will accelerates the
> > > > > > > > > > > > > > performance if traffic
> > > > > > > come from
> > > > > > > > > > > > > > limited ports. I'll introduce some common api to do
> > > > > > > > > > > > > > shard rxq forwarding, call it with packets handling
> > > > > > > > > > > > > > callback, so it suites for
> > > > > > > > > > > > > > all forwarding engine. Will sent patches soon.
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > All ports will put the packets in to the same queue
> > > > > > > > > > > > > (share queue), right? Does
> > > > > > > > > > > > > this means only single core will poll only, what will
> > > > > > > > > > > > > happen if there are
> > > > > > > > > > > > > multiple cores polling, won't it cause problem?
> > > > > > > > > > > > >
> > > > > > > > > > > > > And if this requires specific changes in the
> > > > > > > > > > > > > application, I am not sure about
> > > > > > > > > > > > > the solution, can't this work in a transparent way to
> > > > > > > > > > > > > the application?
> > > > > > > > > > > >
> > > > > > > > > > > > Discussed with Jerin, new API introduced in v3 2/8 that
> > > > > > > > > > > > aggregate ports
> > > > > > > > > > > > in same group into one new port. Users could schedule
> > > > > > > > > > > > polling on the
> > > > > > > > > > > > aggregated port instead of all member ports.
> > > > > > > > > > >
> > > > > > > > > > > The v3 still has testpmd changes in fastpath. Right? IMO,
> > > > > > > > > > > For this
> > > > > > > > > > > feature, we should not change fastpath of testpmd
> > > > > > > > > > > application. Instead, testpmd can use aggregated ports
> > > > > > > > > > > probably as
> > > > > > > > > > > separate fwd_engine to show how to use this feature.
> > > > > > > > > >
> > > > > > > > > > Good point to discuss :) There are two strategies to polling
> > > > > > > > > > a shared
> > > > > > > > > > Rxq:
> > > > > > > > > > 1. polling each member port
> > > > > > > > > > All forwarding engines can be reused to work as before.
> > > > > > > > > > My testpmd patches are efforts towards this direction.
> > > > > > > > > > Does your PMD support this?
> > > > > > > > >
> > > > > > > > > Not unfortunately. More than that, every application needs to
> > > > > > > > > change
> > > > > > > > > to support this model.
> > > > > > > >
> > > > > > > > Both strategies need user application to resolve port ID from
> > > > > > > > mbuf and
> > > > > > > > process accordingly.
> > > > > > > > This one doesn't demand aggregated port, no polling schedule
> > > > > > > > change.
> > > > > > >
> > > > > > > I was thinking, mbuf will be updated from driver/aggregator port as
> > > > > > > when it
> > > > > > > comes to application.
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > > 2. polling aggregated port
> > > > > > > > > > Besides forwarding engine, need more work to to demo it.
> > > > > > > > > > This is an optional API, not supported by my PMD yet.
> > > > > > > > >
> > > > > > > > > We are thinking of implementing this PMD when it comes to it,
> > > > > > > > > ie.
> > > > > > > > > without application change in fastpath
> > > > > > > > > logic.
> > > > > > > >
> > > > > > > > Fastpath have to resolve port ID anyway and forwarding according
> > > > > > > > to
> > > > > > > > logic. Forwarding engine need to adapt to support shard Rxq.
> > > > > > > > Fortunately, in testpmd, this can be done with an abstract API.
> > > > > > > >
> > > > > > > > Let's defer part 2 until some PMD really support it and tested,
> > > > > > > > how do
> > > > > > > > you think?
> > > > > > >
> > > > > > > We are not planning to use this feature so either way it is OK to
> > > > > > > me.
> > > > > > > I leave to ethdev maintainers decide between 1 vs 2.
> > > > > > >
> > > > > > > I do have a strong opinion not changing the testpmd basic forward
> > > > > > > engines
> > > > > > > for this feature.I would like to keep it simple as fastpath
> > > > > > > optimized and would
> > > > > > > like to add a separate Forwarding engine as means to verify this
> > > > > > > feature.
> > > > > >
> > > > > > +1 to that.
> > > > > > I don't think it a 'common' feature.
> > > > > > So separate FWD mode seems like a best choice to me.
> > > > >
> > > > > -1 :)
> > > > > There was some internal requirement from test team, they need to verify
> > > >
> >
> >
> > > > Internal QA requirements may not be the driving factor :-)
> > >
> > > It will be a test requirement for any driver to face, not internal. The
> > > performance difference almost zero in v3, only an "unlikely if" test on
> > > each burst. Shared Rxq is a low level feature, reusing all current FWD
> > > engines to verify driver high level features is important IMHO.
> >
> > In addition to additional if check, The real concern is polluting the
> > common forward engine for the not common feature.
>
> Okay, removed changes to common forward engines in v4, please check.
Thanks.
>
> >
> > If you really want to reuse the existing application without any
> > application change,
> > I think, you need to hook this to eventdev
> > http://code.dpdk.org/dpdk/latest/source/lib/eventdev/rte_eventdev.h#L34
> >
> > Where eventdev drivers does this thing in addition to other features, Ie.
> > t has ports (which is kind of aggregator),
> > it can receive the packets from any queue with mbuf->port as actually
> > received port.
> > That is in terms of mapping:
> > - event queue will be dummy it will be as same as Rx queue
> > - Rx adapter will be also a dummy
> > - event ports aggregate multiple queues and connect to core via event port
> > - On Rxing the packet, mbuf->port will be the actual Port which is received.
> > app/test-eventdev written to use this model.
>
> Is this the optional aggregator api we discussed? already there, patch
> 2/6.
> I was trying to make common forwarding engines perfect to support any
> case, but since you all have concerns, removed in v4.
The point was, If we take eventdev Rx adapter path, This all thing can
be implemented
without adding any new APIs in ethdev as similar functionality is
supported ethdeev-eventdev
Rx adapter. Now two things,
1) Aggregator API is not required, We will be taking the eventdev Rx
adapter route this implement it
2) Another mode it is possible to implement it with eventdev Rx
adapter. So I leave to ethdev
maintainers to decide if this path is required or not. No strong
opinion on this.
>
>
> >
> >
> >
> > >
> > > >
> > > > > all features like packet content, rss, vlan, checksum, rte_flow... to
> > > > > be working based on shared rx queue. Based on the patch, I believe the
> > > > > impact has been minimized.
> > > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Overall, is this for optimizing memory for the port
> > > > > > > > > > > > > represontors? If so can't we
> > > > > > > > > > > > > have a port representor specific solution, reducing
> > > > > > > > > > > > > scope can reduce the
> > > > > > > > > > > > > complexity it brings?
> > > > > > > > > > > > >
> > > > > > > > > > > > > > > If this offload is only useful for representor
> > > > > > > > > > > > > > > case, Can we make this offload specific to
> > > > > > > > > > > > > > > representor the case by changing its
> > > > > > > name and
> > > > > > > > > > > > > > > scope.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > It works for both PF and representors in same switch
> > > > > > > > > > > > > > domain, for application like OVS, few changes to
> > > > > > > > > > > > > > apply.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Signed-off-by: Xueming Li
> > > > > > > > > > > > > > > > > > <xuemingl@nvidia.com>
> > > > > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > > > > doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > > > > 11 +++++++++++
> > > > > > > > > > > > > > > > > > doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > > > > 1 +
> > > > > > > > > > > > > > > > > > doc/guides/prog_guide/switch_representation.
> > > > > > > > > > > > > > > > > > rst | 10 ++++++++++
> > > > > > > > > > > > > > > > > > lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > > > > 1 +
> > > > > > > > > > > > > > > > > > lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > > > > 7 +++++++
> > > > > > > > > > > > > > > > > > 5 files changed, 30 insertions(+)
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > diff --git a/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > > > b/doc/guides/nics/features.rst index
> > > > > > > > > > > > > > > > > > a96e12d155..2e2a9b1554 100644
> > > > > > > > > > > > > > > > > > --- a/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > > > @@ -624,6 +624,17 @@ Supports inner packet L4
> > > > > > > > > > > > > > > > > > checksum.
> > > > > > > > > > > > > > > > > > ``tx_offload_capa,tx_queue_offload_capa:DE
> > > > > > > > > > > > > > > > > > V_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > > +Shared Rx queue
> > > > > > > > > > > > > > > > > > +---------------
> > > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > > +Supports shared Rx queue for ports in same
> > > > > > > > > > > > > > > > > > switch domain.
> > > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > > +* **[uses]
> > > > > > > > > > > > > > > > > > rte_eth_rxconf,rte_eth_rxmode**:
> > > > > > > > > > > > > > > > > > ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > > > > > > > > > > > > > > > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > > .. _nic_features_packet_type_parsing:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Packet type parsing
> > > > > > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > > > > > a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > > > > > > > > > > > > > --- a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > > > +++ b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> > > > > > > > > > > > > > > > > > Queue start/stop =
> > > > > > > > > > > > > > > > > > Runtime Rx queue setup =
> > > > > > > > > > > > > > > > > > Runtime Tx queue setup =
> > > > > > > > > > > > > > > > > > +Shared Rx queue =
> > > > > > > > > > > > > > > > > > Burst mode info =
> > > > > > > > > > > > > > > > > > Power mgmt address monitor =
> > > > > > > > > > > > > > > > > > MTU update =
> > > > > > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > > > +++
> > > > > > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > > > @@ -123,6 +123,16 @@ thought as a software
> > > > > > > > > > > > > > > > > > "patch panel" front-end for applications.
> > > > > > > > > > > > > > > > > > .. [1] `Ethernet switch device driver model
> > > > > > > > > > > > > > > > > > (switchdev)
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > <https://www.kernel.org/doc/Documentation/net
> > > > > > > > > > > > > > > > > > working/switchdev.txt
> > > > > > > > > > > > > > > > > > > `_
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > +- Memory usage of representors is huge when
> > > > > > > > > > > > > > > > > > number of representor
> > > > > > > > > > > > > > > > > > +grows,
> > > > > > > > > > > > > > > > > > + because PMD always allocate mbuf for each
> > > > > > > > > > > > > > > > > > descriptor of Rx queue.
> > > > > > > > > > > > > > > > > > + Polling the large number of ports brings
> > > > > > > > > > > > > > > > > > more CPU load, cache
> > > > > > > > > > > > > > > > > > +miss and
> > > > > > > > > > > > > > > > > > + latency. Shared Rx queue can be used to
> > > > > > > > > > > > > > > > > > share Rx queue between
> > > > > > > > > > > > > > > > > > +PF and
> > > > > > > > > > > > > > > > > > + representors in same switch domain.
> > > > > > > > > > > > > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > > > > > > > > > > > > > + is present in Rx offloading capability of
> > > > > > > > > > > > > > > > > > device info. Setting
> > > > > > > > > > > > > > > > > > +the
> > > > > > > > > > > > > > > > > > + offloading flag in device Rx mode or Rx
> > > > > > > > > > > > > > > > > > queue configuration to
> > > > > > > > > > > > > > > > > > +enable
> > > > > > > > > > > > > > > > > > + shared Rx queue. Polling any member port
> > > > > > > > > > > > > > > > > > of shared Rx queue can
> > > > > > > > > > > > > > > > > > +return
> > > > > > > > > > > > > > > > > > + packets of all ports in group, port ID is
> > > > > > > > > > > > > > > > > > saved in ``mbuf.port``.
> > > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > > Basic SR-IOV
> > > > > > > > > > > > > > > > > > ------------
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > > > @@ -127,6 +127,7 @@ static const struct {
> > > > > > > > > > > > > > > > > > RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSU
> > > > > > > > > > > > > > > > > > M),
> > > > > > > > > > > > > > > > > > RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > > > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPL
> > > > > > > > > > > > > > > > > > IT),
> > > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > > > > > > > > > > > > > > > };
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > > > > > > > > > > > > > uint8_t rx_drop_en; /**< Drop packets
> > > > > > > > > > > > > > > > > > if no descriptors are available. */
> > > > > > > > > > > > > > > > > > uint8_t rx_deferred_start; /**< Do
> > > > > > > > > > > > > > > > > > not start queue with rte_eth_dev_start(). */
> > > > > > > > > > > > > > > > > > uint16_t rx_nseg; /**< Number of
> > > > > > > > > > > > > > > > > > descriptions in rx_seg array.
> > > > > > > > > > > > > > > > > > */
> > > > > > > > > > > > > > > > > > + uint32_t shared_group; /**< Shared
> > > > > > > > > > > > > > > > > > port group index in
> > > > > > > > > > > > > > > > > > + switch domain. */
> > > > > > > > > > > > > > > > > > /**
> > > > > > > > > > > > > > > > > > * Per-queue Rx offloads to be set
> > > > > > > > > > > > > > > > > > using DEV_RX_OFFLOAD_* flags.
> > > > > > > > > > > > > > > > > > * Only offloads set on
> > > > > > > > > > > > > > > > > > rx_queue_offload_capa or
> > > > > > > > > > > > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct
> > > > > > > > > > > > > > > > > > rte_eth_conf {
> > > > > > > > > > > > > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM
> > > > > > > > > > > > > > > > > > 0x00040000
> > > > > > > > > > > > > > > > > > #define DEV_RX_OFFLOAD_RSS_HASH
> > > > > > > > > > > > > > > > > > 0x00080000
> > > > > > > > > > > > > > > > > > #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT
> > > > > > > > > > > > > > > > > > 0x00100000
> > > > > > > > > > > > > > > > > > +/**
> > > > > > > > > > > > > > > > > > + * Rx queue is shared among ports in same
> > > > > > > > > > > > > > > > > > switch domain to save
> > > > > > > > > > > > > > > > > > +memory,
> > > > > > > > > > > > > > > > > > + * avoid polling each port. Any port in
> > > > > > > > > > > > > > > > > > group can be used to receive packets.
> > > > > > > > > > > > > > > > > > + * Real source port number saved in mbuf-
> > > > > > > > > > > > > > > > > > > port field.
> > > > > > > > > > > > > > > > > > + */
> > > > > > > > > > > > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ
> > > > > > > > > > > > > > > > > > 0x00200000
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > #define DEV_RX_OFFLOAD_CHECKSUM
> > > > > > > > > > > > > > > > > > (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > > > > > > > > > > > DEV_RX_OFFLO
> > > > > > > > > > > > > > > > > > AD_UDP_CKSUM | \
> > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > 2.25.1
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > >
> > >
>
On Sun, 2021-10-10 at 15:16 +0530, Jerin Jacob wrote:
> On Fri, Oct 8, 2021 at 1:56 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> >
> > On Wed, 2021-09-29 at 13:35 +0530, Jerin Jacob wrote:
> > > On Wed, Sep 29, 2021 at 1:11 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > >
> > > > On Tue, 2021-09-28 at 20:29 +0530, Jerin Jacob wrote:
> > > > > On Tue, Sep 28, 2021 at 8:10 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > >
> > > > > > On Tue, 2021-09-28 at 13:59 +0000, Ananyev, Konstantin wrote:
> > > > > > > >
> > > > > > > > On Tue, Sep 28, 2021 at 6:55 PM Xueming(Steven) Li
> > > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > > >
> > > > > > > > > On Tue, 2021-09-28 at 18:28 +0530, Jerin Jacob wrote:
> > > > > > > > > > On Tue, Sep 28, 2021 at 5:07 PM Xueming(Steven) Li
> > > > > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Tue, 2021-09-28 at 15:05 +0530, Jerin Jacob wrote:
> > > > > > > > > > > > On Sun, Sep 26, 2021 at 11:06 AM Xueming(Steven) Li
> > > > > > > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Wed, 2021-08-11 at 13:04 +0100, Ferruh Yigit wrote:
> > > > > > > > > > > > > > On 8/11/2021 9:28 AM, Xueming(Steven) Li wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > > > > > > > > Sent: Wednesday, August 11, 2021 4:03 PM
> > > > > > > > > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > > > > > > > > > > <ferruh.yigit@intel.com>; NBU-Contact-Thomas
> > > > > > > > > > > > > > > > Monjalon
> > > > > > > > <thomas@monjalon.net>;
> > > > > > > > > > > > > > > > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > > > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev:
> > > > > > > > > > > > > > > > introduce shared Rx queue
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li
> > > > > > > > > > > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Hi,
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > > > > > > > > > > Sent: Monday, August 9, 2021 9:51 PM
> > > > > > > > > > > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > > > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > > > > > > > > > > > > <ferruh.yigit@intel.com>;
> > > > > > > > > > > > > > > > > > NBU-Contact-Thomas Monjalon
> > > > > > > > > > > > > > > > > > <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > > > > > > > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > > > > > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev:
> > > > > > > > > > > > > > > > > > introduce shared Rx queue
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > On Mon, Aug 9, 2021 at 5:18 PM Xueming Li
> > > > > > > > > > > > > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > In current DPDK framework, each RX queue is
> > > > > > > > > > > > > > > > > > > pre-loaded with mbufs
> > > > > > > > > > > > > > > > > > > for incoming packets. When number of
> > > > > > > > > > > > > > > > > > > representors scale out in a
> > > > > > > > > > > > > > > > > > > switch domain, the memory consumption became
> > > > > > > > > > > > > > > > > > > significant. Most
> > > > > > > > > > > > > > > > > > > important, polling all ports leads to high
> > > > > > > > > > > > > > > > > > > cache miss, high
> > > > > > > > > > > > > > > > > > > latency and low throughput.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > This patch introduces shared RX queue. Ports
> > > > > > > > > > > > > > > > > > > with same
> > > > > > > > > > > > > > > > > > > configuration in a switch domain could share
> > > > > > > > > > > > > > > > > > > RX queue set by specifying sharing group.
> > > > > > > > > > > > > > > > > > > Polling any queue using same shared RX queue
> > > > > > > > > > > > > > > > > > > receives packets from
> > > > > > > > > > > > > > > > > > > all member ports. Source port is identified
> > > > > > > > > > > > > > > > > > > by mbuf->port.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Port queue number in a shared group should be
> > > > > > > > > > > > > > > > > > > identical. Queue
> > > > > > > > > > > > > > > > > > > index is
> > > > > > > > > > > > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Share RX queue is supposed to be polled on
> > > > > > > > > > > > > > > > > > > same thread.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Is this offload specific to the representor? If
> > > > > > > > > > > > > > > > > > so can this name be changed specifically to
> > > > > > > > > > > > > > > > > > representor?
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Yes, PF and representor in switch domain could
> > > > > > > > > > > > > > > > > take advantage.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > If it is for a generic case, how the flow
> > > > > > > > > > > > > > > > > > ordering will be maintained?
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Not quite sure that I understood your question.
> > > > > > > > > > > > > > > > > The control path of is
> > > > > > > > > > > > > > > > > almost same as before, PF and representor port
> > > > > > > > > > > > > > > > > still needed, rte flows not impacted.
> > > > > > > > > > > > > > > > > Queues still needed for each member port,
> > > > > > > > > > > > > > > > > descriptors(mbuf) will be
> > > > > > > > > > > > > > > > > supplied from shared Rx queue in my PMD
> > > > > > > > > > > > > > > > > implementation.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > My question was if create a generic
> > > > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple
> > > > > > > > > > > > > > > > ethdev receive queues land into
> > > > > > > > the same
> > > > > > > > > > > > > > > > receive queue, In that case, how the flow order is
> > > > > > > > > > > > > > > > maintained for respective receive queues.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I guess the question is testpmd forward stream? The
> > > > > > > > > > > > > > > forwarding logic has to be changed slightly in case
> > > > > > > > > > > > > > > of shared rxq.
> > > > > > > > > > > > > > > basically for each packet in rx_burst result, lookup
> > > > > > > > > > > > > > > source stream according to mbuf->port, forwarding to
> > > > > > > > > > > > > > > target fs.
> > > > > > > > > > > > > > > Packets from same source port could be grouped as a
> > > > > > > > > > > > > > > small burst to process, this will accelerates the
> > > > > > > > > > > > > > > performance if traffic
> > > > > > > > come from
> > > > > > > > > > > > > > > limited ports. I'll introduce some common api to do
> > > > > > > > > > > > > > > shard rxq forwarding, call it with packets handling
> > > > > > > > > > > > > > > callback, so it suites for
> > > > > > > > > > > > > > > all forwarding engine. Will sent patches soon.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > All ports will put the packets in to the same queue
> > > > > > > > > > > > > > (share queue), right? Does
> > > > > > > > > > > > > > this means only single core will poll only, what will
> > > > > > > > > > > > > > happen if there are
> > > > > > > > > > > > > > multiple cores polling, won't it cause problem?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > And if this requires specific changes in the
> > > > > > > > > > > > > > application, I am not sure about
> > > > > > > > > > > > > > the solution, can't this work in a transparent way to
> > > > > > > > > > > > > > the application?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Discussed with Jerin, new API introduced in v3 2/8 that
> > > > > > > > > > > > > aggregate ports
> > > > > > > > > > > > > in same group into one new port. Users could schedule
> > > > > > > > > > > > > polling on the
> > > > > > > > > > > > > aggregated port instead of all member ports.
> > > > > > > > > > > >
> > > > > > > > > > > > The v3 still has testpmd changes in fastpath. Right? IMO,
> > > > > > > > > > > > For this
> > > > > > > > > > > > feature, we should not change fastpath of testpmd
> > > > > > > > > > > > application. Instead, testpmd can use aggregated ports
> > > > > > > > > > > > probably as
> > > > > > > > > > > > separate fwd_engine to show how to use this feature.
> > > > > > > > > > >
> > > > > > > > > > > Good point to discuss :) There are two strategies to polling
> > > > > > > > > > > a shared
> > > > > > > > > > > Rxq:
> > > > > > > > > > > 1. polling each member port
> > > > > > > > > > > All forwarding engines can be reused to work as before.
> > > > > > > > > > > My testpmd patches are efforts towards this direction.
> > > > > > > > > > > Does your PMD support this?
> > > > > > > > > >
> > > > > > > > > > Not unfortunately. More than that, every application needs to
> > > > > > > > > > change
> > > > > > > > > > to support this model.
> > > > > > > > >
> > > > > > > > > Both strategies need user application to resolve port ID from
> > > > > > > > > mbuf and
> > > > > > > > > process accordingly.
> > > > > > > > > This one doesn't demand aggregated port, no polling schedule
> > > > > > > > > change.
> > > > > > > >
> > > > > > > > I was thinking, mbuf will be updated from driver/aggregator port as
> > > > > > > > when it
> > > > > > > > comes to application.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > 2. polling aggregated port
> > > > > > > > > > > Besides forwarding engine, need more work to to demo it.
> > > > > > > > > > > This is an optional API, not supported by my PMD yet.
> > > > > > > > > >
> > > > > > > > > > We are thinking of implementing this PMD when it comes to it,
> > > > > > > > > > ie.
> > > > > > > > > > without application change in fastpath
> > > > > > > > > > logic.
> > > > > > > > >
> > > > > > > > > Fastpath have to resolve port ID anyway and forwarding according
> > > > > > > > > to
> > > > > > > > > logic. Forwarding engine need to adapt to support shard Rxq.
> > > > > > > > > Fortunately, in testpmd, this can be done with an abstract API.
> > > > > > > > >
> > > > > > > > > Let's defer part 2 until some PMD really support it and tested,
> > > > > > > > > how do
> > > > > > > > > you think?
> > > > > > > >
> > > > > > > > We are not planning to use this feature so either way it is OK to
> > > > > > > > me.
> > > > > > > > I leave to ethdev maintainers decide between 1 vs 2.
> > > > > > > >
> > > > > > > > I do have a strong opinion not changing the testpmd basic forward
> > > > > > > > engines
> > > > > > > > for this feature.I would like to keep it simple as fastpath
> > > > > > > > optimized and would
> > > > > > > > like to add a separate Forwarding engine as means to verify this
> > > > > > > > feature.
> > > > > > >
> > > > > > > +1 to that.
> > > > > > > I don't think it a 'common' feature.
> > > > > > > So separate FWD mode seems like a best choice to me.
> > > > > >
> > > > > > -1 :)
> > > > > > There was some internal requirement from test team, they need to verify
> > > > >
> > >
> > >
> > > > > Internal QA requirements may not be the driving factor :-)
> > > >
> > > > It will be a test requirement for any driver to face, not internal. The
> > > > performance difference almost zero in v3, only an "unlikely if" test on
> > > > each burst. Shared Rxq is a low level feature, reusing all current FWD
> > > > engines to verify driver high level features is important IMHO.
> > >
> > > In addition to additional if check, The real concern is polluting the
> > > common forward engine for the not common feature.
> >
> > Okay, removed changes to common forward engines in v4, please check.
>
> Thanks.
>
> >
> > >
> > > If you really want to reuse the existing application without any
> > > application change,
> > > I think, you need to hook this to eventdev
> > > http://code.dpdk.org/dpdk/latest/source/lib/eventdev/rte_eventdev.h#L34
> > >
> > > Where eventdev drivers does this thing in addition to other features, Ie.
> > > t has ports (which is kind of aggregator),
> > > it can receive the packets from any queue with mbuf->port as actually
> > > received port.
> > > That is in terms of mapping:
> > > - event queue will be dummy it will be as same as Rx queue
> > > - Rx adapter will be also a dummy
> > > - event ports aggregate multiple queues and connect to core via event port
> > > - On Rxing the packet, mbuf->port will be the actual Port which is received.
> > > app/test-eventdev written to use this model.
> >
> > Is this the optional aggregator api we discussed? already there, patch
> > 2/6.
> > I was trying to make common forwarding engines perfect to support any
> > case, but since you all have concerns, removed in v4.
>
> The point was, If we take eventdev Rx adapter path, This all thing can
> be implemented
> without adding any new APIs in ethdev as similar functionality is
> supported ethdeev-eventdev
> Rx adapter. Now two things,
>
> 1) Aggregator API is not required, We will be taking the eventdev Rx
> adapter route this implement it
> 2) Another mode it is possible to implement it with eventdev Rx
> adapter. So I leave to ethdev
> maintainers to decide if this path is required or not. No strong
> opinion on this.
Seems you are expert of event, is this the Rx burst api?
rte_event_dequeue_burst(dev_id, port_id, ev[], nb_events, timeout)
Two concerns from user perspective:
1. By using ethdev-eventdev wrapper, it impacts performance.
2. For user application like OVS, using event api just when shared rxq
enable looks strange.
Maybe I missed something?
There should be more feedkback and idea on how to aggregate ports after
the fundamental(offload bit and group) start to work, agree to remove
the aggregator api for now.
>
>
>
> >
> >
> > >
> > >
> > >
> > > >
> > > > >
> > > > > > all features like packet content, rss, vlan, checksum, rte_flow... to
> > > > > > be working based on shared rx queue. Based on the patch, I believe the
> > > > > > impact has been minimized.
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Overall, is this for optimizing memory for the port
> > > > > > > > > > > > > > represontors? If so can't we
> > > > > > > > > > > > > > have a port representor specific solution, reducing
> > > > > > > > > > > > > > scope can reduce the
> > > > > > > > > > > > > > complexity it brings?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > If this offload is only useful for representor
> > > > > > > > > > > > > > > > case, Can we make this offload specific to
> > > > > > > > > > > > > > > > representor the case by changing its
> > > > > > > > name and
> > > > > > > > > > > > > > > > scope.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > It works for both PF and representors in same switch
> > > > > > > > > > > > > > > domain, for application like OVS, few changes to
> > > > > > > > > > > > > > > apply.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Signed-off-by: Xueming Li
> > > > > > > > > > > > > > > > > > > <xuemingl@nvidia.com>
> > > > > > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > > > > > doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > > > > > 11 +++++++++++
> > > > > > > > > > > > > > > > > > > doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > > > > > 1 +
> > > > > > > > > > > > > > > > > > > doc/guides/prog_guide/switch_representation.
> > > > > > > > > > > > > > > > > > > rst | 10 ++++++++++
> > > > > > > > > > > > > > > > > > > lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > > > > > 1 +
> > > > > > > > > > > > > > > > > > > lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > > > > > 7 +++++++
> > > > > > > > > > > > > > > > > > > 5 files changed, 30 insertions(+)
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > diff --git a/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > > > > b/doc/guides/nics/features.rst index
> > > > > > > > > > > > > > > > > > > a96e12d155..2e2a9b1554 100644
> > > > > > > > > > > > > > > > > > > --- a/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > > > > @@ -624,6 +624,17 @@ Supports inner packet L4
> > > > > > > > > > > > > > > > > > > checksum.
> > > > > > > > > > > > > > > > > > > ``tx_offload_capa,tx_queue_offload_capa:DE
> > > > > > > > > > > > > > > > > > > V_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > > > +Shared Rx queue
> > > > > > > > > > > > > > > > > > > +---------------
> > > > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > > > +Supports shared Rx queue for ports in same
> > > > > > > > > > > > > > > > > > > switch domain.
> > > > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > > > +* **[uses]
> > > > > > > > > > > > > > > > > > > rte_eth_rxconf,rte_eth_rxmode**:
> > > > > > > > > > > > > > > > > > > ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > > > > > > > > > > > > > > > > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > > > .. _nic_features_packet_type_parsing:
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Packet type parsing
> > > > > > > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > > > > > > a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > > > > > > > > > > > > > > --- a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > > > > +++ b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> > > > > > > > > > > > > > > > > > > Queue start/stop =
> > > > > > > > > > > > > > > > > > > Runtime Rx queue setup =
> > > > > > > > > > > > > > > > > > > Runtime Tx queue setup =
> > > > > > > > > > > > > > > > > > > +Shared Rx queue =
> > > > > > > > > > > > > > > > > > > Burst mode info =
> > > > > > > > > > > > > > > > > > > Power mgmt address monitor =
> > > > > > > > > > > > > > > > > > > MTU update =
> > > > > > > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > > > > +++
> > > > > > > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > > > > @@ -123,6 +123,16 @@ thought as a software
> > > > > > > > > > > > > > > > > > > "patch panel" front-end for applications.
> > > > > > > > > > > > > > > > > > > .. [1] `Ethernet switch device driver model
> > > > > > > > > > > > > > > > > > > (switchdev)
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > <https://www.kernel.org/doc/Documentation/net
> > > > > > > > > > > > > > > > > > > working/switchdev.txt
> > > > > > > > > > > > > > > > > > > > `_
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > +- Memory usage of representors is huge when
> > > > > > > > > > > > > > > > > > > number of representor
> > > > > > > > > > > > > > > > > > > +grows,
> > > > > > > > > > > > > > > > > > > + because PMD always allocate mbuf for each
> > > > > > > > > > > > > > > > > > > descriptor of Rx queue.
> > > > > > > > > > > > > > > > > > > + Polling the large number of ports brings
> > > > > > > > > > > > > > > > > > > more CPU load, cache
> > > > > > > > > > > > > > > > > > > +miss and
> > > > > > > > > > > > > > > > > > > + latency. Shared Rx queue can be used to
> > > > > > > > > > > > > > > > > > > share Rx queue between
> > > > > > > > > > > > > > > > > > > +PF and
> > > > > > > > > > > > > > > > > > > + representors in same switch domain.
> > > > > > > > > > > > > > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > > > > > > > > > > > > > > + is present in Rx offloading capability of
> > > > > > > > > > > > > > > > > > > device info. Setting
> > > > > > > > > > > > > > > > > > > +the
> > > > > > > > > > > > > > > > > > > + offloading flag in device Rx mode or Rx
> > > > > > > > > > > > > > > > > > > queue configuration to
> > > > > > > > > > > > > > > > > > > +enable
> > > > > > > > > > > > > > > > > > > + shared Rx queue. Polling any member port
> > > > > > > > > > > > > > > > > > > of shared Rx queue can
> > > > > > > > > > > > > > > > > > > +return
> > > > > > > > > > > > > > > > > > > + packets of all ports in group, port ID is
> > > > > > > > > > > > > > > > > > > saved in ``mbuf.port``.
> > > > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > > > Basic SR-IOV
> > > > > > > > > > > > > > > > > > > ------------
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > > > > @@ -127,6 +127,7 @@ static const struct {
> > > > > > > > > > > > > > > > > > > RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSU
> > > > > > > > > > > > > > > > > > > M),
> > > > > > > > > > > > > > > > > > > RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > > > > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPL
> > > > > > > > > > > > > > > > > > > IT),
> > > > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > > > > > > > > > > > > > > > > };
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > > > > > > > > > > > > > > uint8_t rx_drop_en; /**< Drop packets
> > > > > > > > > > > > > > > > > > > if no descriptors are available. */
> > > > > > > > > > > > > > > > > > > uint8_t rx_deferred_start; /**< Do
> > > > > > > > > > > > > > > > > > > not start queue with rte_eth_dev_start(). */
> > > > > > > > > > > > > > > > > > > uint16_t rx_nseg; /**< Number of
> > > > > > > > > > > > > > > > > > > descriptions in rx_seg array.
> > > > > > > > > > > > > > > > > > > */
> > > > > > > > > > > > > > > > > > > + uint32_t shared_group; /**< Shared
> > > > > > > > > > > > > > > > > > > port group index in
> > > > > > > > > > > > > > > > > > > + switch domain. */
> > > > > > > > > > > > > > > > > > > /**
> > > > > > > > > > > > > > > > > > > * Per-queue Rx offloads to be set
> > > > > > > > > > > > > > > > > > > using DEV_RX_OFFLOAD_* flags.
> > > > > > > > > > > > > > > > > > > * Only offloads set on
> > > > > > > > > > > > > > > > > > > rx_queue_offload_capa or
> > > > > > > > > > > > > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct
> > > > > > > > > > > > > > > > > > > rte_eth_conf {
> > > > > > > > > > > > > > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM
> > > > > > > > > > > > > > > > > > > 0x00040000
> > > > > > > > > > > > > > > > > > > #define DEV_RX_OFFLOAD_RSS_HASH
> > > > > > > > > > > > > > > > > > > 0x00080000
> > > > > > > > > > > > > > > > > > > #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT
> > > > > > > > > > > > > > > > > > > 0x00100000
> > > > > > > > > > > > > > > > > > > +/**
> > > > > > > > > > > > > > > > > > > + * Rx queue is shared among ports in same
> > > > > > > > > > > > > > > > > > > switch domain to save
> > > > > > > > > > > > > > > > > > > +memory,
> > > > > > > > > > > > > > > > > > > + * avoid polling each port. Any port in
> > > > > > > > > > > > > > > > > > > group can be used to receive packets.
> > > > > > > > > > > > > > > > > > > + * Real source port number saved in mbuf-
> > > > > > > > > > > > > > > > > > > > port field.
> > > > > > > > > > > > > > > > > > > + */
> > > > > > > > > > > > > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ
> > > > > > > > > > > > > > > > > > > 0x00200000
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > #define DEV_RX_OFFLOAD_CHECKSUM
> > > > > > > > > > > > > > > > > > > (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > > > > > > > > > > > > DEV_RX_OFFLO
> > > > > > > > > > > > > > > > > > > AD_UDP_CKSUM | \
> > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > > 2.25.1
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > >
> > > >
> >
On Sun, Oct 10, 2021 at 7:10 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
>
> On Sun, 2021-10-10 at 15:16 +0530, Jerin Jacob wrote:
> > On Fri, Oct 8, 2021 at 1:56 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > >
> > > On Wed, 2021-09-29 at 13:35 +0530, Jerin Jacob wrote:
> > > > On Wed, Sep 29, 2021 at 1:11 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > >
> > > > > On Tue, 2021-09-28 at 20:29 +0530, Jerin Jacob wrote:
> > > > > > On Tue, Sep 28, 2021 at 8:10 PM Xueming(Steven) Li <xuemingl@nvidia.com> wrote:
> > > > > > >
> > > > > > > On Tue, 2021-09-28 at 13:59 +0000, Ananyev, Konstantin wrote:
> > > > > > > > >
> > > > > > > > > On Tue, Sep 28, 2021 at 6:55 PM Xueming(Steven) Li
> > > > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On Tue, 2021-09-28 at 18:28 +0530, Jerin Jacob wrote:
> > > > > > > > > > > On Tue, Sep 28, 2021 at 5:07 PM Xueming(Steven) Li
> > > > > > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > On Tue, 2021-09-28 at 15:05 +0530, Jerin Jacob wrote:
> > > > > > > > > > > > > On Sun, Sep 26, 2021 at 11:06 AM Xueming(Steven) Li
> > > > > > > > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Wed, 2021-08-11 at 13:04 +0100, Ferruh Yigit wrote:
> > > > > > > > > > > > > > > On 8/11/2021 9:28 AM, Xueming(Steven) Li wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > > > > > > > > > Sent: Wednesday, August 11, 2021 4:03 PM
> > > > > > > > > > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > > > > > > > > > > > <ferruh.yigit@intel.com>; NBU-Contact-Thomas
> > > > > > > > > > > > > > > > > Monjalon
> > > > > > > > > <thomas@monjalon.net>;
> > > > > > > > > > > > > > > > > Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > > > > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev:
> > > > > > > > > > > > > > > > > introduce shared Rx queue
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Mon, Aug 9, 2021 at 7:46 PM Xueming(Steven) Li
> > > > > > > > > > > > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Hi,
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > > > > > > > > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > > > > > > > > > > > > > > > > > Sent: Monday, August 9, 2021 9:51 PM
> > > > > > > > > > > > > > > > > > > To: Xueming(Steven) Li <xuemingl@nvidia.com>
> > > > > > > > > > > > > > > > > > > Cc: dpdk-dev <dev@dpdk.org>; Ferruh Yigit
> > > > > > > > > > > > > > > > > > > <ferruh.yigit@intel.com>;
> > > > > > > > > > > > > > > > > > > NBU-Contact-Thomas Monjalon
> > > > > > > > > > > > > > > > > > > <thomas@monjalon.net>; Andrew Rybchenko
> > > > > > > > > > > > > > > > > > > <andrew.rybchenko@oktetlabs.ru>
> > > > > > > > > > > > > > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev:
> > > > > > > > > > > > > > > > > > > introduce shared Rx queue
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > On Mon, Aug 9, 2021 at 5:18 PM Xueming Li
> > > > > > > > > > > > > > > > > > > <xuemingl@nvidia.com> wrote:
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > In current DPDK framework, each RX queue is
> > > > > > > > > > > > > > > > > > > > pre-loaded with mbufs
> > > > > > > > > > > > > > > > > > > > for incoming packets. When number of
> > > > > > > > > > > > > > > > > > > > representors scale out in a
> > > > > > > > > > > > > > > > > > > > switch domain, the memory consumption became
> > > > > > > > > > > > > > > > > > > > significant. Most
> > > > > > > > > > > > > > > > > > > > important, polling all ports leads to high
> > > > > > > > > > > > > > > > > > > > cache miss, high
> > > > > > > > > > > > > > > > > > > > latency and low throughput.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > This patch introduces shared RX queue. Ports
> > > > > > > > > > > > > > > > > > > > with same
> > > > > > > > > > > > > > > > > > > > configuration in a switch domain could share
> > > > > > > > > > > > > > > > > > > > RX queue set by specifying sharing group.
> > > > > > > > > > > > > > > > > > > > Polling any queue using same shared RX queue
> > > > > > > > > > > > > > > > > > > > receives packets from
> > > > > > > > > > > > > > > > > > > > all member ports. Source port is identified
> > > > > > > > > > > > > > > > > > > > by mbuf->port.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Port queue number in a shared group should be
> > > > > > > > > > > > > > > > > > > > identical. Queue
> > > > > > > > > > > > > > > > > > > > index is
> > > > > > > > > > > > > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Share RX queue is supposed to be polled on
> > > > > > > > > > > > > > > > > > > > same thread.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Is this offload specific to the representor? If
> > > > > > > > > > > > > > > > > > > so can this name be changed specifically to
> > > > > > > > > > > > > > > > > > > representor?
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Yes, PF and representor in switch domain could
> > > > > > > > > > > > > > > > > > take advantage.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > If it is for a generic case, how the flow
> > > > > > > > > > > > > > > > > > > ordering will be maintained?
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Not quite sure that I understood your question.
> > > > > > > > > > > > > > > > > > The control path of is
> > > > > > > > > > > > > > > > > > almost same as before, PF and representor port
> > > > > > > > > > > > > > > > > > still needed, rte flows not impacted.
> > > > > > > > > > > > > > > > > > Queues still needed for each member port,
> > > > > > > > > > > > > > > > > > descriptors(mbuf) will be
> > > > > > > > > > > > > > > > > > supplied from shared Rx queue in my PMD
> > > > > > > > > > > > > > > > > > implementation.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > My question was if create a generic
> > > > > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple
> > > > > > > > > > > > > > > > > ethdev receive queues land into
> > > > > > > > > the same
> > > > > > > > > > > > > > > > > receive queue, In that case, how the flow order is
> > > > > > > > > > > > > > > > > maintained for respective receive queues.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I guess the question is testpmd forward stream? The
> > > > > > > > > > > > > > > > forwarding logic has to be changed slightly in case
> > > > > > > > > > > > > > > > of shared rxq.
> > > > > > > > > > > > > > > > basically for each packet in rx_burst result, lookup
> > > > > > > > > > > > > > > > source stream according to mbuf->port, forwarding to
> > > > > > > > > > > > > > > > target fs.
> > > > > > > > > > > > > > > > Packets from same source port could be grouped as a
> > > > > > > > > > > > > > > > small burst to process, this will accelerates the
> > > > > > > > > > > > > > > > performance if traffic
> > > > > > > > > come from
> > > > > > > > > > > > > > > > limited ports. I'll introduce some common api to do
> > > > > > > > > > > > > > > > shard rxq forwarding, call it with packets handling
> > > > > > > > > > > > > > > > callback, so it suites for
> > > > > > > > > > > > > > > > all forwarding engine. Will sent patches soon.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > All ports will put the packets in to the same queue
> > > > > > > > > > > > > > > (share queue), right? Does
> > > > > > > > > > > > > > > this means only single core will poll only, what will
> > > > > > > > > > > > > > > happen if there are
> > > > > > > > > > > > > > > multiple cores polling, won't it cause problem?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > And if this requires specific changes in the
> > > > > > > > > > > > > > > application, I am not sure about
> > > > > > > > > > > > > > > the solution, can't this work in a transparent way to
> > > > > > > > > > > > > > > the application?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Discussed with Jerin, new API introduced in v3 2/8 that
> > > > > > > > > > > > > > aggregate ports
> > > > > > > > > > > > > > in same group into one new port. Users could schedule
> > > > > > > > > > > > > > polling on the
> > > > > > > > > > > > > > aggregated port instead of all member ports.
> > > > > > > > > > > > >
> > > > > > > > > > > > > The v3 still has testpmd changes in fastpath. Right? IMO,
> > > > > > > > > > > > > For this
> > > > > > > > > > > > > feature, we should not change fastpath of testpmd
> > > > > > > > > > > > > application. Instead, testpmd can use aggregated ports
> > > > > > > > > > > > > probably as
> > > > > > > > > > > > > separate fwd_engine to show how to use this feature.
> > > > > > > > > > > >
> > > > > > > > > > > > Good point to discuss :) There are two strategies to polling
> > > > > > > > > > > > a shared
> > > > > > > > > > > > Rxq:
> > > > > > > > > > > > 1. polling each member port
> > > > > > > > > > > > All forwarding engines can be reused to work as before.
> > > > > > > > > > > > My testpmd patches are efforts towards this direction.
> > > > > > > > > > > > Does your PMD support this?
> > > > > > > > > > >
> > > > > > > > > > > Not unfortunately. More than that, every application needs to
> > > > > > > > > > > change
> > > > > > > > > > > to support this model.
> > > > > > > > > >
> > > > > > > > > > Both strategies need user application to resolve port ID from
> > > > > > > > > > mbuf and
> > > > > > > > > > process accordingly.
> > > > > > > > > > This one doesn't demand aggregated port, no polling schedule
> > > > > > > > > > change.
> > > > > > > > >
> > > > > > > > > I was thinking, mbuf will be updated from driver/aggregator port as
> > > > > > > > > when it
> > > > > > > > > comes to application.
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > 2. polling aggregated port
> > > > > > > > > > > > Besides forwarding engine, need more work to to demo it.
> > > > > > > > > > > > This is an optional API, not supported by my PMD yet.
> > > > > > > > > > >
> > > > > > > > > > > We are thinking of implementing this PMD when it comes to it,
> > > > > > > > > > > ie.
> > > > > > > > > > > without application change in fastpath
> > > > > > > > > > > logic.
> > > > > > > > > >
> > > > > > > > > > Fastpath have to resolve port ID anyway and forwarding according
> > > > > > > > > > to
> > > > > > > > > > logic. Forwarding engine need to adapt to support shard Rxq.
> > > > > > > > > > Fortunately, in testpmd, this can be done with an abstract API.
> > > > > > > > > >
> > > > > > > > > > Let's defer part 2 until some PMD really support it and tested,
> > > > > > > > > > how do
> > > > > > > > > > you think?
> > > > > > > > >
> > > > > > > > > We are not planning to use this feature so either way it is OK to
> > > > > > > > > me.
> > > > > > > > > I leave to ethdev maintainers decide between 1 vs 2.
> > > > > > > > >
> > > > > > > > > I do have a strong opinion not changing the testpmd basic forward
> > > > > > > > > engines
> > > > > > > > > for this feature.I would like to keep it simple as fastpath
> > > > > > > > > optimized and would
> > > > > > > > > like to add a separate Forwarding engine as means to verify this
> > > > > > > > > feature.
> > > > > > > >
> > > > > > > > +1 to that.
> > > > > > > > I don't think it a 'common' feature.
> > > > > > > > So separate FWD mode seems like a best choice to me.
> > > > > > >
> > > > > > > -1 :)
> > > > > > > There was some internal requirement from test team, they need to verify
> > > > > >
> > > >
> > > >
> > > > > > Internal QA requirements may not be the driving factor :-)
> > > > >
> > > > > It will be a test requirement for any driver to face, not internal. The
> > > > > performance difference almost zero in v3, only an "unlikely if" test on
> > > > > each burst. Shared Rxq is a low level feature, reusing all current FWD
> > > > > engines to verify driver high level features is important IMHO.
> > > >
> > > > In addition to additional if check, The real concern is polluting the
> > > > common forward engine for the not common feature.
> > >
> > > Okay, removed changes to common forward engines in v4, please check.
> >
> > Thanks.
> >
> > >
> > > >
> > > > If you really want to reuse the existing application without any
> > > > application change,
> > > > I think, you need to hook this to eventdev
> > > > http://code.dpdk.org/dpdk/latest/source/lib/eventdev/rte_eventdev.h#L34
> > > >
> > > > Where eventdev drivers does this thing in addition to other features, Ie.
> > > > t has ports (which is kind of aggregator),
> > > > it can receive the packets from any queue with mbuf->port as actually
> > > > received port.
> > > > That is in terms of mapping:
> > > > - event queue will be dummy it will be as same as Rx queue
> > > > - Rx adapter will be also a dummy
> > > > - event ports aggregate multiple queues and connect to core via event port
> > > > - On Rxing the packet, mbuf->port will be the actual Port which is received.
> > > > app/test-eventdev written to use this model.
> > >
> > > Is this the optional aggregator api we discussed? already there, patch
> > > 2/6.
> > > I was trying to make common forwarding engines perfect to support any
> > > case, but since you all have concerns, removed in v4.
> >
> > The point was, If we take eventdev Rx adapter path, This all thing can
> > be implemented
> > without adding any new APIs in ethdev as similar functionality is
> > supported ethdeev-eventdev
> > Rx adapter. Now two things,
> >
> > 1) Aggregator API is not required, We will be taking the eventdev Rx
> > adapter route this implement it
> > 2) Another mode it is possible to implement it with eventdev Rx
> > adapter. So I leave to ethdev
> > maintainers to decide if this path is required or not. No strong
> > opinion on this.
>
> Seems you are expert of event, is this the Rx burst api?
> rte_event_dequeue_burst(dev_id, port_id, ev[], nb_events, timeout)
Yes.
>
> Two concerns from user perspective:
> 1. By using ethdev-eventdev wrapper, it impacts performance.
It is not a wrapper. If HW doing the work then there will not be any regression
with the Rx adapter.
Like tx_burst, packet/event comes through
rte_event_dequeue_burst() aka single callback function pointer overhead.
> 2. For user application like OVS, using event api just when shared rxq
> enable looks strange.
>
> Maybe I missed something?
>
> There should be more feedkback and idea on how to aggregate ports after
> the fundamental(offload bit and group) start to work, agree to remove
> the aggregator api for now.
OK.
>
> >
> >
> >
> > >
> > >
> > > >
> > > >
> > > >
> > > > >
> > > > > >
> > > > > > > all features like packet content, rss, vlan, checksum, rte_flow... to
> > > > > > > be working based on shared rx queue. Based on the patch, I believe the
> > > > > > > impact has been minimized.
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Overall, is this for optimizing memory for the port
> > > > > > > > > > > > > > > represontors? If so can't we
> > > > > > > > > > > > > > > have a port representor specific solution, reducing
> > > > > > > > > > > > > > > scope can reduce the
> > > > > > > > > > > > > > > complexity it brings?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > If this offload is only useful for representor
> > > > > > > > > > > > > > > > > case, Can we make this offload specific to
> > > > > > > > > > > > > > > > > representor the case by changing its
> > > > > > > > > name and
> > > > > > > > > > > > > > > > > scope.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > It works for both PF and representors in same switch
> > > > > > > > > > > > > > > > domain, for application like OVS, few changes to
> > > > > > > > > > > > > > > > apply.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Signed-off-by: Xueming Li
> > > > > > > > > > > > > > > > > > > > <xuemingl@nvidia.com>
> > > > > > > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > > > > > > doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > > > > > > 11 +++++++++++
> > > > > > > > > > > > > > > > > > > > doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > > > > > > 1 +
> > > > > > > > > > > > > > > > > > > > doc/guides/prog_guide/switch_representation.
> > > > > > > > > > > > > > > > > > > > rst | 10 ++++++++++
> > > > > > > > > > > > > > > > > > > > lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > > > > > > 1 +
> > > > > > > > > > > > > > > > > > > > lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > > > > > > 7 +++++++
> > > > > > > > > > > > > > > > > > > > 5 files changed, 30 insertions(+)
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > diff --git a/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > > > > > b/doc/guides/nics/features.rst index
> > > > > > > > > > > > > > > > > > > > a96e12d155..2e2a9b1554 100644
> > > > > > > > > > > > > > > > > > > > --- a/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > > > > > +++ b/doc/guides/nics/features.rst
> > > > > > > > > > > > > > > > > > > > @@ -624,6 +624,17 @@ Supports inner packet L4
> > > > > > > > > > > > > > > > > > > > checksum.
> > > > > > > > > > > > > > > > > > > > ``tx_offload_capa,tx_queue_offload_capa:DE
> > > > > > > > > > > > > > > > > > > > V_TX_OFFLOAD_OUTER_UDP_CKSUM``.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > +.. _nic_features_shared_rx_queue:
> > > > > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > > > > +Shared Rx queue
> > > > > > > > > > > > > > > > > > > > +---------------
> > > > > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > > > > +Supports shared Rx queue for ports in same
> > > > > > > > > > > > > > > > > > > > switch domain.
> > > > > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > > > > +* **[uses]
> > > > > > > > > > > > > > > > > > > > rte_eth_rxconf,rte_eth_rxmode**:
> > > > > > > > > > > > > > > > > > > > ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
> > > > > > > > > > > > > > > > > > > > +* **[provides] mbuf**: ``mbuf.port``.
> > > > > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > > > > .. _nic_features_packet_type_parsing:
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Packet type parsing
> > > > > > > > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > > > > > > > a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > > > > > b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > > > > > index 754184ddd4..ebeb4c1851 100644
> > > > > > > > > > > > > > > > > > > > --- a/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > > > > > +++ b/doc/guides/nics/features/default.ini
> > > > > > > > > > > > > > > > > > > > @@ -19,6 +19,7 @@ Free Tx mbuf on demand =
> > > > > > > > > > > > > > > > > > > > Queue start/stop =
> > > > > > > > > > > > > > > > > > > > Runtime Rx queue setup =
> > > > > > > > > > > > > > > > > > > > Runtime Tx queue setup =
> > > > > > > > > > > > > > > > > > > > +Shared Rx queue =
> > > > > > > > > > > > > > > > > > > > Burst mode info =
> > > > > > > > > > > > > > > > > > > > Power mgmt address monitor =
> > > > > > > > > > > > > > > > > > > > MTU update =
> > > > > > > > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > > > > > index ff6aa91c80..45bf5a3a10 100644
> > > > > > > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > > > > > > a/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > > > > > +++
> > > > > > > > > > > > > > > > > > > > b/doc/guides/prog_guide/switch_representation
> > > > > > > > > > > > > > > > > > > > .rst
> > > > > > > > > > > > > > > > > > > > @@ -123,6 +123,16 @@ thought as a software
> > > > > > > > > > > > > > > > > > > > "patch panel" front-end for applications.
> > > > > > > > > > > > > > > > > > > > .. [1] `Ethernet switch device driver model
> > > > > > > > > > > > > > > > > > > > (switchdev)
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > <https://www.kernel.org/doc/Documentation/net
> > > > > > > > > > > > > > > > > > > > working/switchdev.txt
> > > > > > > > > > > > > > > > > > > > > `_
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > +- Memory usage of representors is huge when
> > > > > > > > > > > > > > > > > > > > number of representor
> > > > > > > > > > > > > > > > > > > > +grows,
> > > > > > > > > > > > > > > > > > > > + because PMD always allocate mbuf for each
> > > > > > > > > > > > > > > > > > > > descriptor of Rx queue.
> > > > > > > > > > > > > > > > > > > > + Polling the large number of ports brings
> > > > > > > > > > > > > > > > > > > > more CPU load, cache
> > > > > > > > > > > > > > > > > > > > +miss and
> > > > > > > > > > > > > > > > > > > > + latency. Shared Rx queue can be used to
> > > > > > > > > > > > > > > > > > > > share Rx queue between
> > > > > > > > > > > > > > > > > > > > +PF and
> > > > > > > > > > > > > > > > > > > > + representors in same switch domain.
> > > > > > > > > > > > > > > > > > > > +``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
> > > > > > > > > > > > > > > > > > > > + is present in Rx offloading capability of
> > > > > > > > > > > > > > > > > > > > device info. Setting
> > > > > > > > > > > > > > > > > > > > +the
> > > > > > > > > > > > > > > > > > > > + offloading flag in device Rx mode or Rx
> > > > > > > > > > > > > > > > > > > > queue configuration to
> > > > > > > > > > > > > > > > > > > > +enable
> > > > > > > > > > > > > > > > > > > > + shared Rx queue. Polling any member port
> > > > > > > > > > > > > > > > > > > > of shared Rx queue can
> > > > > > > > > > > > > > > > > > > > +return
> > > > > > > > > > > > > > > > > > > > + packets of all ports in group, port ID is
> > > > > > > > > > > > > > > > > > > > saved in ``mbuf.port``.
> > > > > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > > > > Basic SR-IOV
> > > > > > > > > > > > > > > > > > > > ------------
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > > > > > index 9d95cd11e1..1361ff759a 100644
> > > > > > > > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.c
> > > > > > > > > > > > > > > > > > > > @@ -127,6 +127,7 @@ static const struct {
> > > > > > > > > > > > > > > > > > > > RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSU
> > > > > > > > > > > > > > > > > > > > M),
> > > > > > > > > > > > > > > > > > > > RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
> > > > > > > > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPL
> > > > > > > > > > > > > > > > > > > > IT),
> > > > > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
> > > > > > > > > > > > > > > > > > > > };
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > #undef RTE_RX_OFFLOAD_BIT2STR
> > > > > > > > > > > > > > > > > > > > diff --git a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > > > > > b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > > > > > index d2b27c351f..a578c9db9d 100644
> > > > > > > > > > > > > > > > > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > > > > > > > > > > > > > > > > @@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
> > > > > > > > > > > > > > > > > > > > uint8_t rx_drop_en; /**< Drop packets
> > > > > > > > > > > > > > > > > > > > if no descriptors are available. */
> > > > > > > > > > > > > > > > > > > > uint8_t rx_deferred_start; /**< Do
> > > > > > > > > > > > > > > > > > > > not start queue with rte_eth_dev_start(). */
> > > > > > > > > > > > > > > > > > > > uint16_t rx_nseg; /**< Number of
> > > > > > > > > > > > > > > > > > > > descriptions in rx_seg array.
> > > > > > > > > > > > > > > > > > > > */
> > > > > > > > > > > > > > > > > > > > + uint32_t shared_group; /**< Shared
> > > > > > > > > > > > > > > > > > > > port group index in
> > > > > > > > > > > > > > > > > > > > + switch domain. */
> > > > > > > > > > > > > > > > > > > > /**
> > > > > > > > > > > > > > > > > > > > * Per-queue Rx offloads to be set
> > > > > > > > > > > > > > > > > > > > using DEV_RX_OFFLOAD_* flags.
> > > > > > > > > > > > > > > > > > > > * Only offloads set on
> > > > > > > > > > > > > > > > > > > > rx_queue_offload_capa or
> > > > > > > > > > > > > > > > > > > > rx_offload_capa @@ -1373,6 +1374,12 @@ struct
> > > > > > > > > > > > > > > > > > > > rte_eth_conf {
> > > > > > > > > > > > > > > > > > > > #define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM
> > > > > > > > > > > > > > > > > > > > 0x00040000
> > > > > > > > > > > > > > > > > > > > #define DEV_RX_OFFLOAD_RSS_HASH
> > > > > > > > > > > > > > > > > > > > 0x00080000
> > > > > > > > > > > > > > > > > > > > #define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT
> > > > > > > > > > > > > > > > > > > > 0x00100000
> > > > > > > > > > > > > > > > > > > > +/**
> > > > > > > > > > > > > > > > > > > > + * Rx queue is shared among ports in same
> > > > > > > > > > > > > > > > > > > > switch domain to save
> > > > > > > > > > > > > > > > > > > > +memory,
> > > > > > > > > > > > > > > > > > > > + * avoid polling each port. Any port in
> > > > > > > > > > > > > > > > > > > > group can be used to receive packets.
> > > > > > > > > > > > > > > > > > > > + * Real source port number saved in mbuf-
> > > > > > > > > > > > > > > > > > > > > port field.
> > > > > > > > > > > > > > > > > > > > + */
> > > > > > > > > > > > > > > > > > > > +#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ
> > > > > > > > > > > > > > > > > > > > 0x00200000
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > #define DEV_RX_OFFLOAD_CHECKSUM
> > > > > > > > > > > > > > > > > > > > (DEV_RX_OFFLOAD_IPV4_CKSUM | \
> > > > > > > > > > > > > > > > > > > > DEV_RX_OFFLO
> > > > > > > > > > > > > > > > > > > > AD_UDP_CKSUM | \
> > > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > > > 2.25.1
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > >
> > > > >
> > >
>
@@ -624,6 +624,17 @@ Supports inner packet L4 checksum.
``tx_offload_capa,tx_queue_offload_capa:DEV_TX_OFFLOAD_OUTER_UDP_CKSUM``.
+.. _nic_features_shared_rx_queue:
+
+Shared Rx queue
+---------------
+
+Supports shared Rx queue for ports in same switch domain.
+
+* **[uses] rte_eth_rxconf,rte_eth_rxmode**: ``offloads:RTE_ETH_RX_OFFLOAD_SHARED_RXQ``.
+* **[provides] mbuf**: ``mbuf.port``.
+
+
.. _nic_features_packet_type_parsing:
Packet type parsing
@@ -19,6 +19,7 @@ Free Tx mbuf on demand =
Queue start/stop =
Runtime Rx queue setup =
Runtime Tx queue setup =
+Shared Rx queue =
Burst mode info =
Power mgmt address monitor =
MTU update =
@@ -123,6 +123,16 @@ thought as a software "patch panel" front-end for applications.
.. [1] `Ethernet switch device driver model (switchdev)
<https://www.kernel.org/doc/Documentation/networking/switchdev.txt>`_
+- Memory usage of representors is huge when number of representor grows,
+ because PMD always allocate mbuf for each descriptor of Rx queue.
+ Polling the large number of ports brings more CPU load, cache miss and
+ latency. Shared Rx queue can be used to share Rx queue between PF and
+ representors in same switch domain. ``RTE_ETH_RX_OFFLOAD_SHARED_RXQ``
+ is present in Rx offloading capability of device info. Setting the
+ offloading flag in device Rx mode or Rx queue configuration to enable
+ shared Rx queue. Polling any member port of shared Rx queue can return
+ packets of all ports in group, port ID is saved in ``mbuf.port``.
+
Basic SR-IOV
------------
@@ -127,6 +127,7 @@ static const struct {
RTE_RX_OFFLOAD_BIT2STR(OUTER_UDP_CKSUM),
RTE_RX_OFFLOAD_BIT2STR(RSS_HASH),
RTE_ETH_RX_OFFLOAD_BIT2STR(BUFFER_SPLIT),
+ RTE_ETH_RX_OFFLOAD_BIT2STR(SHARED_RXQ),
};
#undef RTE_RX_OFFLOAD_BIT2STR
@@ -1047,6 +1047,7 @@ struct rte_eth_rxconf {
uint8_t rx_drop_en; /**< Drop packets if no descriptors are available. */
uint8_t rx_deferred_start; /**< Do not start queue with rte_eth_dev_start(). */
uint16_t rx_nseg; /**< Number of descriptions in rx_seg array. */
+ uint32_t shared_group; /**< Shared port group index in switch domain. */
/**
* Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
* Only offloads set on rx_queue_offload_capa or rx_offload_capa
@@ -1373,6 +1374,12 @@ struct rte_eth_conf {
#define DEV_RX_OFFLOAD_OUTER_UDP_CKSUM 0x00040000
#define DEV_RX_OFFLOAD_RSS_HASH 0x00080000
#define RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT 0x00100000
+/**
+ * Rx queue is shared among ports in same switch domain to save memory,
+ * avoid polling each port. Any port in group can be used to receive packets.
+ * Real source port number saved in mbuf->port field.
+ */
+#define RTE_ETH_RX_OFFLOAD_SHARED_RXQ 0x00200000
#define DEV_RX_OFFLOAD_CHECKSUM (DEV_RX_OFFLOAD_IPV4_CKSUM | \
DEV_RX_OFFLOAD_UDP_CKSUM | \