[v2,1/2] ethdev: support queue-based priority flow control

Message ID 20220113102718.3167282-1-jerinj@marvell.com (mailing list archive)
State Superseded, archived
Delegated to: Ferruh Yigit
Headers
Series [v2,1/2] ethdev: support queue-based priority flow control |

Checks

Context Check Description
ci/checkpatch success coding style OK

Commit Message

Jerin Jacob Jan. 13, 2022, 10:27 a.m. UTC
From: Jerin Jacob <jerinj@marvell.com>

Based on device support and use-case need, there are two different ways
to enable PFC. The first case is the port level PFC configuration, in
this case, rte_eth_dev_priority_flow_ctrl_set() API shall be used to
configure the PFC, and PFC frames will be generated using based on VLAN
TC value.

The second case is the queue level PFC configuration, in this
case, Any packet field content can be used to steer the packet to the
specific queue using rte_flow or RSS and then use
rte_eth_dev_priority_flow_ctrl_queue_set() to set the TC mapping on each
queue. Based on congestion selected on the specific queue, configured TC
shall be used to generate PFC frames.

Operation of these modes are mutually exclusive, when driver sets
non zero value for rte_eth_dev_info::pfc_queue_tc_max,
application must use queue level PFC configuration via
rte_eth_dev_priority_flow_ctrl_queue_set() API instead of port level
PFC configuration via rte_eth_dev_priority_flow_ctrl_set() API to
realize PFC configuration.

This patch enables the configuration for second case a.k.a queue
based PFC also updates rte_eth_dev_priority_flow_ctrl_set()
implmentaion to adheher to rte_eth_dev_info::pfc_queue_tc_max
handling.

Also updated libabigail.abignore to ignore the update
to reserved fields in rte_eth_dev_info.

Signed-off-by: Jerin Jacob <jerinj@marvell.com>
Signed-off-by: Sunil Kumar Kori <skori@marvell.com>
---

A driver implemtion based on this API is at
https://patches.dpdk.org/project/dpdk/patch/20220111081831.881374-1-skori@marvell.com/

RFC..v1
 - Added queue based PFC config API instead port based

v1..v2
- Updated libabigail.ignore as
  has_data_member_inserted_between = {offset_of(reserved_64s), end}
- Updated doxygen comments of rte_eth_pfc_queue_conf


 devtools/libabigail.abignore           |   5 ++
 doc/guides/nics/features.rst           |   5 +-
 doc/guides/rel_notes/release_22_03.rst |   3 +
 lib/ethdev/ethdev_driver.h             |   6 +-
 lib/ethdev/rte_ethdev.c                | 109 +++++++++++++++++++++++++
 lib/ethdev/rte_ethdev.h                |  83 ++++++++++++++++++-
 lib/ethdev/version.map                 |   3 +
 7 files changed, 208 insertions(+), 6 deletions(-)
  

Comments

Ferruh Yigit Jan. 25, 2022, 5:34 p.m. UTC | #1
On 1/13/2022 10:27 AM, jerinj@marvell.com wrote:
> From: Jerin Jacob <jerinj@marvell.com>
> 
> Based on device support and use-case need, there are two different ways
> to enable PFC. The first case is the port level PFC configuration, in
> this case, rte_eth_dev_priority_flow_ctrl_set() API shall be used to
> configure the PFC, and PFC frames will be generated using based on VLAN
> TC value.
> 
> The second case is the queue level PFC configuration, in this
> case, Any packet field content can be used to steer the packet to the
> specific queue using rte_flow or RSS and then use
> rte_eth_dev_priority_flow_ctrl_queue_set() to set the TC mapping on each
> queue. Based on congestion selected on the specific queue, configured TC
> shall be used to generate PFC frames.
> > Operation of these modes are mutually exclusive, when driver sets
> non zero value for rte_eth_dev_info::pfc_queue_tc_max,
> application must use queue level PFC configuration via
> rte_eth_dev_priority_flow_ctrl_queue_set() API instead of port level
> PFC configuration via rte_eth_dev_priority_flow_ctrl_set() API to
> realize PFC configuration.
> 
> This patch enables the configuration for second case a.k.a queue
> based PFC also updates rte_eth_dev_priority_flow_ctrl_set()
> implmentaion to adheher to rte_eth_dev_info::pfc_queue_tc_max
> handling.
> 
> Also updated libabigail.abignore to ignore the update
> to reserved fields in rte_eth_dev_info.
> 
> Signed-off-by: Jerin Jacob <jerinj@marvell.com>
> Signed-off-by: Sunil Kumar Kori <skori@marvell.com>
> ---
> 
> A driver implemtion based on this API is at
> https://patches.dpdk.org/project/dpdk/patch/20220111081831.881374-1-skori@marvell.com/
> 
> RFC..v1
>   - Added queue based PFC config API instead port based
> 
> v1..v2
> - Updated libabigail.ignore as
>    has_data_member_inserted_between = {offset_of(reserved_64s), end}
> - Updated doxygen comments of rte_eth_pfc_queue_conf
> 
> 
>   devtools/libabigail.abignore           |   5 ++
>   doc/guides/nics/features.rst           |   5 +-
>   doc/guides/rel_notes/release_22_03.rst |   3 +
>   lib/ethdev/ethdev_driver.h             |   6 +-
>   lib/ethdev/rte_ethdev.c                | 109 +++++++++++++++++++++++++
>   lib/ethdev/rte_ethdev.h                |  83 ++++++++++++++++++-
>   lib/ethdev/version.map                 |   3 +
>   7 files changed, 208 insertions(+), 6 deletions(-)
> 
> diff --git a/devtools/libabigail.abignore b/devtools/libabigail.abignore
> index 4b676f317d..3bdecaaef0 100644
> --- a/devtools/libabigail.abignore
> +++ b/devtools/libabigail.abignore
> @@ -11,3 +11,8 @@
>   ; Ignore generated PMD information strings
>   [suppress_variable]
>           name_regexp = _pmd_info$
> +
> +;Ignore fields inserted in place of reserved fields of rte_eth_dev_info
> +[suppress_type]
> +       name = rte_eth_dev_info
> +       has_data_member_inserted_between = {offset_of(reserved_64s), end}
> diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
> index 27be2d2576..277a784f4e 100644
> --- a/doc/guides/nics/features.rst
> +++ b/doc/guides/nics/features.rst
> @@ -379,9 +379,10 @@ Flow control
>   Supports configuring link flow control.
>   
>   * **[implements] eth_dev_ops**: ``flow_ctrl_get``, ``flow_ctrl_set``,
> -  ``priority_flow_ctrl_set``.
> +  ``priority_flow_ctrl_set``, ``priority_flow_ctrl_queue_set``.
>   * **[related]    API**: ``rte_eth_dev_flow_ctrl_get()``, ``rte_eth_dev_flow_ctrl_set()``,
> -  ``rte_eth_dev_priority_flow_ctrl_set()``.
> +  ``rte_eth_dev_priority_flow_ctrl_set()``, ``rte_eth_dev_priority_flow_ctrl_queue_set()``.
> +* **[provides]   rte_eth_dev_info**: ``pfc_queue_tc_max``.
>   
>   
>   .. _nic_features_rate_limitation:
> diff --git a/doc/guides/rel_notes/release_22_03.rst b/doc/guides/rel_notes/release_22_03.rst
> index 6d99d1eaa9..b75c0356e6 100644
> --- a/doc/guides/rel_notes/release_22_03.rst
> +++ b/doc/guides/rel_notes/release_22_03.rst
> @@ -55,6 +55,9 @@ New Features
>        Also, make sure to start the actual text at the margin.
>        =======================================================
>   
> +* **Added an API to enable queue based priority flow ctrl(PFC).**
> +
> +  A new API, ``rte_eth_dev_priority_flow_ctrl_queue_set()``, was added.
>   
>   Removed Items
>   -------------
> diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
> index d95605a355..e0bbfe89d7 100644
> --- a/lib/ethdev/ethdev_driver.h
> +++ b/lib/ethdev/ethdev_driver.h
> @@ -532,6 +532,9 @@ typedef int (*flow_ctrl_set_t)(struct rte_eth_dev *dev,
>   /** @internal Setup priority flow control parameter on an Ethernet device. */
>   typedef int (*priority_flow_ctrl_set_t)(struct rte_eth_dev *dev,
>   				struct rte_eth_pfc_conf *pfc_conf);
> +/** @internal Queue setup for priority flow control parameter on an Ethernet device. */
> +typedef int (*priority_flow_ctrl_queue_set_t)(struct rte_eth_dev *dev,
> +				struct rte_eth_pfc_queue_conf *pfc_queue_conf);
>   
>   /** @internal Update RSS redirection table on an Ethernet device. */
>   typedef int (*reta_update_t)(struct rte_eth_dev *dev,
> @@ -1080,7 +1083,8 @@ struct eth_dev_ops {
>   	flow_ctrl_set_t            flow_ctrl_set; /**< Setup flow control */
>   	/** Setup priority flow control */
>   	priority_flow_ctrl_set_t   priority_flow_ctrl_set;
> -
> +	/** Priority flow control queue setup */
> +	priority_flow_ctrl_queue_set_t   priority_flow_ctrl_queue_set;
>   	/** Set Unicast Table Array */
>   	eth_uc_hash_table_set_t    uc_hash_table_set;
>   	/** Set Unicast hash bitmap */
> diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> index a1d475a292..6def057720 100644
> --- a/lib/ethdev/rte_ethdev.c
> +++ b/lib/ethdev/rte_ethdev.c
> @@ -3998,7 +3998,9 @@ int
>   rte_eth_dev_priority_flow_ctrl_set(uint16_t port_id,
>   				   struct rte_eth_pfc_conf *pfc_conf)
>   {
> +	struct rte_eth_dev_info dev_info;
>   	struct rte_eth_dev *dev;
> +	int ret;
>   
>   	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
>   	dev = &rte_eth_devices[port_id];
> @@ -4010,6 +4012,17 @@ rte_eth_dev_priority_flow_ctrl_set(uint16_t port_id,
>   		return -EINVAL;
>   	}
>   
> +	ret = rte_eth_dev_info_get(port_id, &dev_info);
> +	if (ret != 0)
> +		return ret;
> +
> +	if (dev_info.pfc_queue_tc_max != 0) {
> +		RTE_ETHDEV_LOG(ERR,
> +			"Ethdev port %u driver does not support port level PFC config\n",
> +			port_id);
> +		return -ENOTSUP;
> +	}
> +
>   	if (pfc_conf->priority > (RTE_ETH_DCB_NUM_USER_PRIORITIES - 1)) {
>   		RTE_ETHDEV_LOG(ERR, "Invalid priority, only 0-7 allowed\n");
>   		return -EINVAL;
> @@ -4022,6 +4035,102 @@ rte_eth_dev_priority_flow_ctrl_set(uint16_t port_id,
>   	return -ENOTSUP;
>   }
>   
> +static inline int
> +validate_rx_pause_config(struct rte_eth_dev_info *dev_info,
> +			 struct rte_eth_pfc_queue_conf *pfc_queue_conf)
> +{
> +	if ((pfc_queue_conf->mode == RTE_ETH_FC_RX_PAUSE) ||
> +	    (pfc_queue_conf->mode == RTE_ETH_FC_FULL)) {
> +		if (pfc_queue_conf->rx_pause.tx_qid >= dev_info->nb_tx_queues) {
> +			RTE_ETHDEV_LOG(ERR, "Tx queue not in range for Rx pause"
> +				       " (requested: %d configured: %d)\n",
> +				       pfc_queue_conf->rx_pause.tx_qid,
> +				       dev_info->nb_tx_queues);
> +			return -EINVAL;
> +		}
> +
> +		if (pfc_queue_conf->rx_pause.tc >= dev_info->pfc_queue_tc_max) {
> +			RTE_ETHDEV_LOG(ERR, "TC not in range for Rx pause"
> +				       " (requested: %d max: %d)\n",
> +				       pfc_queue_conf->rx_pause.tc,
> +				       dev_info->pfc_queue_tc_max);
> +			return -EINVAL;
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +static inline int
> +validate_tx_pause_config(struct rte_eth_dev_info *dev_info,
> +			 struct rte_eth_pfc_queue_conf *pfc_queue_conf)
> +{
> +	if ((pfc_queue_conf->mode == RTE_ETH_FC_TX_PAUSE) ||
> +	     (pfc_queue_conf->mode == RTE_ETH_FC_FULL)) {
> +		if (pfc_queue_conf->tx_pause.rx_qid >= dev_info->nb_rx_queues) {
> +			RTE_ETHDEV_LOG(ERR, "Rx queue not in range for Tx pause"
> +				       "(requested: %d configured: %d)\n",
> +				       pfc_queue_conf->tx_pause.rx_qid,
> +				       dev_info->nb_rx_queues);
> +			return -EINVAL;
> +		}
> +
> +		if (pfc_queue_conf->tx_pause.tc >= dev_info->pfc_queue_tc_max) {
> +			RTE_ETHDEV_LOG(ERR, "TC not in range for Tx pause"
> +				       "(requested: %d max: %d)\n",
> +				       pfc_queue_conf->tx_pause.tc,
> +				       dev_info->pfc_queue_tc_max);
> +			return -EINVAL;
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +int
> +rte_eth_dev_priority_flow_ctrl_queue_set(
> +	uint16_t port_id, struct rte_eth_pfc_queue_conf *pfc_queue_conf)
> +{
> +	struct rte_eth_dev_info dev_info;
> +	struct rte_eth_dev *dev;
> +	int ret;
> +
> +	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
> +	dev = &rte_eth_devices[port_id];
> +
> +	if (pfc_queue_conf == NULL) {
> +		RTE_ETHDEV_LOG(ERR, "PFC parameters are NULL for port (%u)\n",
> +			       port_id);
> +		return -EINVAL;
> +	}
> +
> +	ret = rte_eth_dev_info_get(port_id, &dev_info);
> +	if (ret != 0)
> +		return ret;
> +
> +	if (dev_info.pfc_queue_tc_max == 0) {
> +		RTE_ETHDEV_LOG(ERR,
> +			"Ethdev port %u does not support PFC TC values\n",
> +			port_id);
> +		return -ENOTSUP;
> +	}
> +
> +	ret = validate_rx_pause_config(&dev_info, pfc_queue_conf);
> +	if (ret != 0)
> +		return ret;
> +
> +	ret = validate_tx_pause_config(&dev_info, pfc_queue_conf);
> +	if (ret != 0)
> +		return ret;
> +
> +
> +	if (*dev->dev_ops->priority_flow_ctrl_queue_set)
> +		return eth_err(port_id,
> +			       (*dev->dev_ops->priority_flow_ctrl_queue_set)(
> +				       dev, pfc_queue_conf));
> +	return -ENOTSUP;
> +}
> +
>   static int
>   eth_check_reta_mask(struct rte_eth_rss_reta_entry64 *reta_conf,
>   			uint16_t reta_size)
> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> index fa299c8ad7..33e43f6e1d 100644
> --- a/lib/ethdev/rte_ethdev.h
> +++ b/lib/ethdev/rte_ethdev.h
> @@ -1395,6 +1395,28 @@ struct rte_eth_pfc_conf {
>   	uint8_t priority;          /**< VLAN User Priority. */
>   };
>   
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> + *
> + * A structure used to configure Ethernet priority flow control parameter for
> + * ethdev queues.
> + */
> +struct rte_eth_pfc_queue_conf {
> +	enum rte_eth_fc_mode mode;  /**< Link flow control mode */
> +
> +	struct {
> +		uint16_t tx_qid; /**< Tx queue ID */
> +		uint8_t tc; /**< Traffic class as per PFC (802.1Qbb) spec */
> +	} rx_pause; /* Valid when (mode == FC_RX_PAUSE || mode == FC_FULL) */
> +
> +	struct {
> +		uint16_t pause_time; /**< Pause quota in the Pause frame */
> +		uint16_t rx_qid; /**< Rx queue ID */
> +		uint8_t tc; /**< Traffic class as per PFC (802.1Qbb) spec */
> +	} tx_pause; /* Valid when (mode == FC_TX_PAUSE || mode == FC_FULL) */
> +};
> +
>   /**
>    * Tunnel type for device-specific classifier configuration.
>    * @see rte_eth_udp_tunnel
> @@ -1841,8 +1863,30 @@ struct rte_eth_dev_info {
>   	 * embedded managed interconnect/switch.
>   	 */
>   	struct rte_eth_switch_info switch_info;
> -
> -	uint64_t reserved_64s[2]; /**< Reserved for future fields */
> +	/**
> +	 * Maximum supported traffic class as per PFC (802.1Qbb) specification.
> +	 *
> +	 * Based on device support and use-case need, there are two different
> +	 * ways to enable PFC. The first case is the port level PFC
> +	 * configuration, in this case, rte_eth_dev_priority_flow_ctrl_set()
> +	 * API shall be used to configure the PFC, and PFC frames will be
> +	 * generated using based on VLAN TC value.
> +	 * The second case is the queue level PFC configuration, in this case,
> +	 * Any packet field content can be used to steer the packet to the
> +	 * specific queue using rte_flow or RSS and then use
> +	 * rte_eth_dev_priority_flow_ctrl_queue_set() to set the TC mapping
> +	 * on each queue. Based on congestion selected on the specific queue,
> +	 * configured TC shall be used to generate PFC frames.
> +	 *
> +	 * When set to non zero value, application must use queue level
> +	 * PFC configuration via rte_eth_dev_priority_flow_ctrl_queue_set() API
> +	 * instead of port level PFC configuration via
> +	 * rte_eth_dev_priority_flow_ctrl_set() API to realize
> +	 * PFC configuration.
> +	 */
> +	uint8_t pfc_queue_tc_max;


'rte_eth_dev_info_get()' is one of the APIs that anyone using ethdev needs to use.

Instead of expanding it with less used features, what do you think to have a
specific API to get the 'pfc_queue_tc_max'?

It also can be used by application to detect queue based PFC is supported by
driver or not.
Assume API is 'rte_eth_dev_priority_flow_ctrl_get()', if it returns '-ENOTSUP'
application can know that PMD doesn't support queue based PFC.


> +	uint8_t reserved_8s[7];
> +	uint64_t reserved_64s[1]; /**< Reserved for future fields */
>   	void *reserved_ptrs[2];   /**< Reserved for future fields */
>   };
>   
> @@ -4109,6 +4153,9 @@ int rte_eth_dev_flow_ctrl_set(uint16_t port_id,
>    * Configure the Ethernet priority flow control under DCB environment
>    * for Ethernet device.
>    *
> + * @see struct rte_eth_dev_info::pfc_queue_tc_max priority
> + * flow control usage models.
> + *
>    * @param port_id
>    * The port identifier of the Ethernet device.
>    * @param pfc_conf
> @@ -4119,10 +4166,40 @@ int rte_eth_dev_flow_ctrl_set(uint16_t port_id,
>    *   - (-ENODEV)  if *port_id* invalid.
>    *   - (-EINVAL)  if bad parameter
>    *   - (-EIO)     if flow control setup failure or device is removed.
> + *
>    */
>   int rte_eth_dev_priority_flow_ctrl_set(uint16_t port_id,
> -				struct rte_eth_pfc_conf *pfc_conf);
> +				       struct rte_eth_pfc_conf *pfc_conf);

Above syntax changes are not needed.

DPDK coding convention is using tabs (mostly two) for multi line function
decleration/definition, please be consistant with usage, this patch has
multiple variations.

>   
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Configure the Ethernet priority flow control for a given queue
> + * for Ethernet device.
> + *
> + * @see struct rte_eth_dev_info::pfc_queue_tc_max priority flow control
> + * usage models.
> + *
> + * @note When an ethdev port switches to PFC mode, the unconfigured

Doesit mean queue based PFC mode?

> + * queues shall be configured by the driver with default values such as
> + * lower priority value for TC etc.
> + *

I assume there is no way for application to know what the defaults values are,
also not sure if application interested in this.

> + * @param port_id
> + *   The port identifier of the Ethernet device.
> + * @param pfc_queue_conf
> + *   The pointer to the structure of the priority flow control parameters
> + *   for the queue.
> + * @return
> + *   - (0) if successful.
> + *   - (-ENOTSUP) if hardware doesn't support priority flow control mode.
> + *   - (-ENODEV)  if *port_id* invalid.
> + *   - (-EINVAL)  if bad parameter
> + *   - (-EIO)     if flow control setup queue failure
> + */
> +__rte_experimental
> +int rte_eth_dev_priority_flow_ctrl_queue_set(uint16_t port_id,
> +					     struct rte_eth_pfc_queue_conf *pfc_queue_conf);

I wonder if Rx/Tx queue id should be API arguments, to be consistent
with some other APIs, and will it help application that configures queues
in a loop.
But I can see 'rx_pause' or 'tx_pause' (in config) can be valid or not
based on the 'mode', so I understand to have queue ids in the struct.
No strong opinion.
  
Jerin Jacob Jan. 25, 2022, 6:52 p.m. UTC | #2
On Tue, Jan 25, 2022 at 11:05 PM Ferruh Yigit <ferruh.yigit@intel.com> wrote:
>
> On 1/13/2022 10:27 AM, jerinj@marvell.com wrote:
> > From: Jerin Jacob <jerinj@marvell.com>
> >
> > Based on device support and use-case need, there are two different ways
> > to enable PFC. The first case is the port level PFC configuration, in
> > this case, rte_eth_dev_priority_flow_ctrl_set() API shall be used to
> > configure the PFC, and PFC frames will be generated using based on VLAN
> > TC value.
> >
> > The second case is the queue level PFC configuration, in this
> > case, Any packet field content can be used to steer the packet to the
> > specific queue using rte_flow or RSS and then use
> > rte_eth_dev_priority_flow_ctrl_queue_set() to set the TC mapping on each
> > queue. Based on congestion selected on the specific queue, configured TC
> > shall be used to generate PFC frames.
> > > Operation of these modes are mutually exclusive, when driver sets
> > non zero value for rte_eth_dev_info::pfc_queue_tc_max,
> > application must use queue level PFC configuration via
> > rte_eth_dev_priority_flow_ctrl_queue_set() API instead of port level
> > PFC configuration via rte_eth_dev_priority_flow_ctrl_set() API to
> > realize PFC configuration.
> >
> > This patch enables the configuration for second case a.k.a queue
> > based PFC also updates rte_eth_dev_priority_flow_ctrl_set()
> > implmentaion to adheher to rte_eth_dev_info::pfc_queue_tc_max
> > handling.
> >
> > Also updated libabigail.abignore to ignore the update
> > to reserved fields in rte_eth_dev_info.
> >
> > Signed-off-by: Jerin Jacob <jerinj@marvell.com>
> > Signed-off-by: Sunil Kumar Kori <skori@marvell.com>
> > ---
> >
> > A driver implemtion based on this API is at
> > https://patches.dpdk.org/project/dpdk/patch/20220111081831.881374-1-skori@marvell.com/
> >
> > RFC..v1
> >   - Added queue based PFC config API instead port based
> >
> > v1..v2

> > +
> >   /**
> >    * Tunnel type for device-specific classifier configuration.
> >    * @see rte_eth_udp_tunnel
> > @@ -1841,8 +1863,30 @@ struct rte_eth_dev_info {
> >        * embedded managed interconnect/switch.
> >        */
> >       struct rte_eth_switch_info switch_info;
> > -
> > -     uint64_t reserved_64s[2]; /**< Reserved for future fields */
> > +     /**
> > +      * Maximum supported traffic class as per PFC (802.1Qbb) specification.
> > +      *
> > +      * Based on device support and use-case need, there are two different
> > +      * ways to enable PFC. The first case is the port level PFC
> > +      * configuration, in this case, rte_eth_dev_priority_flow_ctrl_set()
> > +      * API shall be used to configure the PFC, and PFC frames will be
> > +      * generated using based on VLAN TC value.
> > +      * The second case is the queue level PFC configuration, in this case,
> > +      * Any packet field content can be used to steer the packet to the
> > +      * specific queue using rte_flow or RSS and then use
> > +      * rte_eth_dev_priority_flow_ctrl_queue_set() to set the TC mapping
> > +      * on each queue. Based on congestion selected on the specific queue,
> > +      * configured TC shall be used to generate PFC frames.
> > +      *
> > +      * When set to non zero value, application must use queue level
> > +      * PFC configuration via rte_eth_dev_priority_flow_ctrl_queue_set() API
> > +      * instead of port level PFC configuration via
> > +      * rte_eth_dev_priority_flow_ctrl_set() API to realize
> > +      * PFC configuration.
> > +      */
> > +     uint8_t pfc_queue_tc_max;
>
>
> 'rte_eth_dev_info_get()' is one of the APIs that anyone using ethdev needs to use.
>
> Instead of expanding it with less used features, what do you think to have a
> specific API to get the 'pfc_queue_tc_max'?


OK. rte_eth_dev_info_get() was general theme used in ethdev API. Fine
for new API
specifically for this.

>
> It also can be used by application to detect queue based PFC is supported by
> driver or not.

OK.

> Assume API is 'rte_eth_dev_priority_flow_ctrl_get()', if it returns '-ENOTSUP'
> application can know that PMD doesn't support queue based PFC.

OK.

>
>
> > +     uint8_t reserved_8s[7];
> > +     uint64_t reserved_64s[1]; /**< Reserved for future fields */
> >       void *reserved_ptrs[2];   /**< Reserved for future fields */
> >   };
> >
> > @@ -4109,6 +4153,9 @@ int rte_eth_dev_flow_ctrl_set(uint16_t port_id,
> >    * Configure the Ethernet priority flow control under DCB environment
> >    * for Ethernet device.
> >    *
> > + * @see struct rte_eth_dev_info::pfc_queue_tc_max priority
> > + * flow control usage models.
> > + *
> >    * @param port_id
> >    * The port identifier of the Ethernet device.
> >    * @param pfc_conf
> > @@ -4119,10 +4166,40 @@ int rte_eth_dev_flow_ctrl_set(uint16_t port_id,
> >    *   - (-ENODEV)  if *port_id* invalid.
> >    *   - (-EINVAL)  if bad parameter
> >    *   - (-EIO)     if flow control setup failure or device is removed.
> > + *
> >    */
> >   int rte_eth_dev_priority_flow_ctrl_set(uint16_t port_id,
> > -                             struct rte_eth_pfc_conf *pfc_conf);
> > +                                    struct rte_eth_pfc_conf *pfc_conf);
>
> Above syntax changes are not needed.
>
> DPDK coding convention is using tabs (mostly two) for multi line function
> decleration/definition, please be consistant with usage, this patch has
> multiple variations.

Ack.

>
> >
> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this API may change without prior notice.
> > + *
> > + * Configure the Ethernet priority flow control for a given queue
> > + * for Ethernet device.
> > + *
> > + * @see struct rte_eth_dev_info::pfc_queue_tc_max priority flow control
> > + * usage models.
> > + *
> > + * @note When an ethdev port switches to PFC mode, the unconfigured
>
> Doesit mean queue based PFC mode?

Yes.  I will change to PFC mode -> queue-based PFC mode.

>
> > + * queues shall be configured by the driver with default values such as
> > + * lower priority value for TC etc.
> > + *
>
> I assume there is no way for application to know what the defaults values are,
> also not sure if application interested in this.

Yes. I don't think the application cares. But added in the doc to
clarify the state.

>
> > + * @param port_id
> > + *   The port identifier of the Ethernet device.
> > + * @param pfc_queue_conf
> > + *   The pointer to the structure of the priority flow control parameters
> > + *   for the queue.
> > + * @return
> > + *   - (0) if successful.
> > + *   - (-ENOTSUP) if hardware doesn't support priority flow control mode.
> > + *   - (-ENODEV)  if *port_id* invalid.
> > + *   - (-EINVAL)  if bad parameter
> > + *   - (-EIO)     if flow control setup queue failure
> > + */
> > +__rte_experimental
> > +int rte_eth_dev_priority_flow_ctrl_queue_set(uint16_t port_id,
> > +                                          struct rte_eth_pfc_queue_conf *pfc_queue_conf);
>
> I wonder if Rx/Tx queue id should be API arguments, to be consistent
> with some other APIs, and will it help application that configures queues
> in a loop.
> But I can see 'rx_pause' or 'tx_pause' (in config) can be valid or not
> based on the 'mode', so I understand to have queue ids in the struct.
> No strong opinion.

Yes. I keep it as is.
  

Patch

diff --git a/devtools/libabigail.abignore b/devtools/libabigail.abignore
index 4b676f317d..3bdecaaef0 100644
--- a/devtools/libabigail.abignore
+++ b/devtools/libabigail.abignore
@@ -11,3 +11,8 @@ 
 ; Ignore generated PMD information strings
 [suppress_variable]
         name_regexp = _pmd_info$
+
+;Ignore fields inserted in place of reserved fields of rte_eth_dev_info
+[suppress_type]
+       name = rte_eth_dev_info
+       has_data_member_inserted_between = {offset_of(reserved_64s), end}
diff --git a/doc/guides/nics/features.rst b/doc/guides/nics/features.rst
index 27be2d2576..277a784f4e 100644
--- a/doc/guides/nics/features.rst
+++ b/doc/guides/nics/features.rst
@@ -379,9 +379,10 @@  Flow control
 Supports configuring link flow control.
 
 * **[implements] eth_dev_ops**: ``flow_ctrl_get``, ``flow_ctrl_set``,
-  ``priority_flow_ctrl_set``.
+  ``priority_flow_ctrl_set``, ``priority_flow_ctrl_queue_set``.
 * **[related]    API**: ``rte_eth_dev_flow_ctrl_get()``, ``rte_eth_dev_flow_ctrl_set()``,
-  ``rte_eth_dev_priority_flow_ctrl_set()``.
+  ``rte_eth_dev_priority_flow_ctrl_set()``, ``rte_eth_dev_priority_flow_ctrl_queue_set()``.
+* **[provides]   rte_eth_dev_info**: ``pfc_queue_tc_max``.
 
 
 .. _nic_features_rate_limitation:
diff --git a/doc/guides/rel_notes/release_22_03.rst b/doc/guides/rel_notes/release_22_03.rst
index 6d99d1eaa9..b75c0356e6 100644
--- a/doc/guides/rel_notes/release_22_03.rst
+++ b/doc/guides/rel_notes/release_22_03.rst
@@ -55,6 +55,9 @@  New Features
      Also, make sure to start the actual text at the margin.
      =======================================================
 
+* **Added an API to enable queue based priority flow ctrl(PFC).**
+
+  A new API, ``rte_eth_dev_priority_flow_ctrl_queue_set()``, was added.
 
 Removed Items
 -------------
diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
index d95605a355..e0bbfe89d7 100644
--- a/lib/ethdev/ethdev_driver.h
+++ b/lib/ethdev/ethdev_driver.h
@@ -532,6 +532,9 @@  typedef int (*flow_ctrl_set_t)(struct rte_eth_dev *dev,
 /** @internal Setup priority flow control parameter on an Ethernet device. */
 typedef int (*priority_flow_ctrl_set_t)(struct rte_eth_dev *dev,
 				struct rte_eth_pfc_conf *pfc_conf);
+/** @internal Queue setup for priority flow control parameter on an Ethernet device. */
+typedef int (*priority_flow_ctrl_queue_set_t)(struct rte_eth_dev *dev,
+				struct rte_eth_pfc_queue_conf *pfc_queue_conf);
 
 /** @internal Update RSS redirection table on an Ethernet device. */
 typedef int (*reta_update_t)(struct rte_eth_dev *dev,
@@ -1080,7 +1083,8 @@  struct eth_dev_ops {
 	flow_ctrl_set_t            flow_ctrl_set; /**< Setup flow control */
 	/** Setup priority flow control */
 	priority_flow_ctrl_set_t   priority_flow_ctrl_set;
-
+	/** Priority flow control queue setup */
+	priority_flow_ctrl_queue_set_t   priority_flow_ctrl_queue_set;
 	/** Set Unicast Table Array */
 	eth_uc_hash_table_set_t    uc_hash_table_set;
 	/** Set Unicast hash bitmap */
diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
index a1d475a292..6def057720 100644
--- a/lib/ethdev/rte_ethdev.c
+++ b/lib/ethdev/rte_ethdev.c
@@ -3998,7 +3998,9 @@  int
 rte_eth_dev_priority_flow_ctrl_set(uint16_t port_id,
 				   struct rte_eth_pfc_conf *pfc_conf)
 {
+	struct rte_eth_dev_info dev_info;
 	struct rte_eth_dev *dev;
+	int ret;
 
 	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
 	dev = &rte_eth_devices[port_id];
@@ -4010,6 +4012,17 @@  rte_eth_dev_priority_flow_ctrl_set(uint16_t port_id,
 		return -EINVAL;
 	}
 
+	ret = rte_eth_dev_info_get(port_id, &dev_info);
+	if (ret != 0)
+		return ret;
+
+	if (dev_info.pfc_queue_tc_max != 0) {
+		RTE_ETHDEV_LOG(ERR,
+			"Ethdev port %u driver does not support port level PFC config\n",
+			port_id);
+		return -ENOTSUP;
+	}
+
 	if (pfc_conf->priority > (RTE_ETH_DCB_NUM_USER_PRIORITIES - 1)) {
 		RTE_ETHDEV_LOG(ERR, "Invalid priority, only 0-7 allowed\n");
 		return -EINVAL;
@@ -4022,6 +4035,102 @@  rte_eth_dev_priority_flow_ctrl_set(uint16_t port_id,
 	return -ENOTSUP;
 }
 
+static inline int
+validate_rx_pause_config(struct rte_eth_dev_info *dev_info,
+			 struct rte_eth_pfc_queue_conf *pfc_queue_conf)
+{
+	if ((pfc_queue_conf->mode == RTE_ETH_FC_RX_PAUSE) ||
+	    (pfc_queue_conf->mode == RTE_ETH_FC_FULL)) {
+		if (pfc_queue_conf->rx_pause.tx_qid >= dev_info->nb_tx_queues) {
+			RTE_ETHDEV_LOG(ERR, "Tx queue not in range for Rx pause"
+				       " (requested: %d configured: %d)\n",
+				       pfc_queue_conf->rx_pause.tx_qid,
+				       dev_info->nb_tx_queues);
+			return -EINVAL;
+		}
+
+		if (pfc_queue_conf->rx_pause.tc >= dev_info->pfc_queue_tc_max) {
+			RTE_ETHDEV_LOG(ERR, "TC not in range for Rx pause"
+				       " (requested: %d max: %d)\n",
+				       pfc_queue_conf->rx_pause.tc,
+				       dev_info->pfc_queue_tc_max);
+			return -EINVAL;
+		}
+	}
+
+	return 0;
+}
+
+static inline int
+validate_tx_pause_config(struct rte_eth_dev_info *dev_info,
+			 struct rte_eth_pfc_queue_conf *pfc_queue_conf)
+{
+	if ((pfc_queue_conf->mode == RTE_ETH_FC_TX_PAUSE) ||
+	     (pfc_queue_conf->mode == RTE_ETH_FC_FULL)) {
+		if (pfc_queue_conf->tx_pause.rx_qid >= dev_info->nb_rx_queues) {
+			RTE_ETHDEV_LOG(ERR, "Rx queue not in range for Tx pause"
+				       "(requested: %d configured: %d)\n",
+				       pfc_queue_conf->tx_pause.rx_qid,
+				       dev_info->nb_rx_queues);
+			return -EINVAL;
+		}
+
+		if (pfc_queue_conf->tx_pause.tc >= dev_info->pfc_queue_tc_max) {
+			RTE_ETHDEV_LOG(ERR, "TC not in range for Tx pause"
+				       "(requested: %d max: %d)\n",
+				       pfc_queue_conf->tx_pause.tc,
+				       dev_info->pfc_queue_tc_max);
+			return -EINVAL;
+		}
+	}
+
+	return 0;
+}
+
+int
+rte_eth_dev_priority_flow_ctrl_queue_set(
+	uint16_t port_id, struct rte_eth_pfc_queue_conf *pfc_queue_conf)
+{
+	struct rte_eth_dev_info dev_info;
+	struct rte_eth_dev *dev;
+	int ret;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
+	dev = &rte_eth_devices[port_id];
+
+	if (pfc_queue_conf == NULL) {
+		RTE_ETHDEV_LOG(ERR, "PFC parameters are NULL for port (%u)\n",
+			       port_id);
+		return -EINVAL;
+	}
+
+	ret = rte_eth_dev_info_get(port_id, &dev_info);
+	if (ret != 0)
+		return ret;
+
+	if (dev_info.pfc_queue_tc_max == 0) {
+		RTE_ETHDEV_LOG(ERR,
+			"Ethdev port %u does not support PFC TC values\n",
+			port_id);
+		return -ENOTSUP;
+	}
+
+	ret = validate_rx_pause_config(&dev_info, pfc_queue_conf);
+	if (ret != 0)
+		return ret;
+
+	ret = validate_tx_pause_config(&dev_info, pfc_queue_conf);
+	if (ret != 0)
+		return ret;
+
+
+	if (*dev->dev_ops->priority_flow_ctrl_queue_set)
+		return eth_err(port_id,
+			       (*dev->dev_ops->priority_flow_ctrl_queue_set)(
+				       dev, pfc_queue_conf));
+	return -ENOTSUP;
+}
+
 static int
 eth_check_reta_mask(struct rte_eth_rss_reta_entry64 *reta_conf,
 			uint16_t reta_size)
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index fa299c8ad7..33e43f6e1d 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -1395,6 +1395,28 @@  struct rte_eth_pfc_conf {
 	uint8_t priority;          /**< VLAN User Priority. */
 };
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
+ *
+ * A structure used to configure Ethernet priority flow control parameter for
+ * ethdev queues.
+ */
+struct rte_eth_pfc_queue_conf {
+	enum rte_eth_fc_mode mode;  /**< Link flow control mode */
+
+	struct {
+		uint16_t tx_qid; /**< Tx queue ID */
+		uint8_t tc; /**< Traffic class as per PFC (802.1Qbb) spec */
+	} rx_pause; /* Valid when (mode == FC_RX_PAUSE || mode == FC_FULL) */
+
+	struct {
+		uint16_t pause_time; /**< Pause quota in the Pause frame */
+		uint16_t rx_qid; /**< Rx queue ID */
+		uint8_t tc; /**< Traffic class as per PFC (802.1Qbb) spec */
+	} tx_pause; /* Valid when (mode == FC_TX_PAUSE || mode == FC_FULL) */
+};
+
 /**
  * Tunnel type for device-specific classifier configuration.
  * @see rte_eth_udp_tunnel
@@ -1841,8 +1863,30 @@  struct rte_eth_dev_info {
 	 * embedded managed interconnect/switch.
 	 */
 	struct rte_eth_switch_info switch_info;
-
-	uint64_t reserved_64s[2]; /**< Reserved for future fields */
+	/**
+	 * Maximum supported traffic class as per PFC (802.1Qbb) specification.
+	 *
+	 * Based on device support and use-case need, there are two different
+	 * ways to enable PFC. The first case is the port level PFC
+	 * configuration, in this case, rte_eth_dev_priority_flow_ctrl_set()
+	 * API shall be used to configure the PFC, and PFC frames will be
+	 * generated using based on VLAN TC value.
+	 * The second case is the queue level PFC configuration, in this case,
+	 * Any packet field content can be used to steer the packet to the
+	 * specific queue using rte_flow or RSS and then use
+	 * rte_eth_dev_priority_flow_ctrl_queue_set() to set the TC mapping
+	 * on each queue. Based on congestion selected on the specific queue,
+	 * configured TC shall be used to generate PFC frames.
+	 *
+	 * When set to non zero value, application must use queue level
+	 * PFC configuration via rte_eth_dev_priority_flow_ctrl_queue_set() API
+	 * instead of port level PFC configuration via
+	 * rte_eth_dev_priority_flow_ctrl_set() API to realize
+	 * PFC configuration.
+	 */
+	uint8_t pfc_queue_tc_max;
+	uint8_t reserved_8s[7];
+	uint64_t reserved_64s[1]; /**< Reserved for future fields */
 	void *reserved_ptrs[2];   /**< Reserved for future fields */
 };
 
@@ -4109,6 +4153,9 @@  int rte_eth_dev_flow_ctrl_set(uint16_t port_id,
  * Configure the Ethernet priority flow control under DCB environment
  * for Ethernet device.
  *
+ * @see struct rte_eth_dev_info::pfc_queue_tc_max priority
+ * flow control usage models.
+ *
  * @param port_id
  * The port identifier of the Ethernet device.
  * @param pfc_conf
@@ -4119,10 +4166,40 @@  int rte_eth_dev_flow_ctrl_set(uint16_t port_id,
  *   - (-ENODEV)  if *port_id* invalid.
  *   - (-EINVAL)  if bad parameter
  *   - (-EIO)     if flow control setup failure or device is removed.
+ *
  */
 int rte_eth_dev_priority_flow_ctrl_set(uint16_t port_id,
-				struct rte_eth_pfc_conf *pfc_conf);
+				       struct rte_eth_pfc_conf *pfc_conf);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Configure the Ethernet priority flow control for a given queue
+ * for Ethernet device.
+ *
+ * @see struct rte_eth_dev_info::pfc_queue_tc_max priority flow control
+ * usage models.
+ *
+ * @note When an ethdev port switches to PFC mode, the unconfigured
+ * queues shall be configured by the driver with default values such as
+ * lower priority value for TC etc.
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ * @param pfc_queue_conf
+ *   The pointer to the structure of the priority flow control parameters
+ *   for the queue.
+ * @return
+ *   - (0) if successful.
+ *   - (-ENOTSUP) if hardware doesn't support priority flow control mode.
+ *   - (-ENODEV)  if *port_id* invalid.
+ *   - (-EINVAL)  if bad parameter
+ *   - (-EIO)     if flow control setup queue failure
+ */
+__rte_experimental
+int rte_eth_dev_priority_flow_ctrl_queue_set(uint16_t port_id,
+					     struct rte_eth_pfc_queue_conf *pfc_queue_conf);
 /**
  * Add a MAC address to the set used for filtering incoming packets.
  *
diff --git a/lib/ethdev/version.map b/lib/ethdev/version.map
index c2fb0669a4..8f361ec15a 100644
--- a/lib/ethdev/version.map
+++ b/lib/ethdev/version.map
@@ -256,6 +256,9 @@  EXPERIMENTAL {
 	rte_flow_flex_item_create;
 	rte_flow_flex_item_release;
 	rte_flow_pick_transfer_proxy;
+
+	# added in 22.03
+	rte_eth_dev_priority_flow_ctrl_queue_set;
 };
 
 INTERNAL {