[dpdk-dev,v1] ixgbe_pmd: forbid tx_rs_thresh above 1 for all NICs but 82598

Message ID 1439489195-31553-1-git-send-email-vladz@cloudius-systems.com (mailing list archive)
State Superseded, archived
Headers

Commit Message

Vladislav Zolotarov Aug. 13, 2015, 6:06 p.m. UTC
  According to 82599 and x540 HW specifications RS bit *must* be
set in the last descriptor of *every* packet.

This patch fixes the Tx hang we were constantly hitting with a
seastar-based application on x540 NIC.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
---
 drivers/net/ixgbe/ixgbe_ethdev.c |  9 +++++++++
 drivers/net/ixgbe/ixgbe_rxtx.c   | 23 ++++++++++++++++++++++-
 2 files changed, 31 insertions(+), 1 deletion(-)
  

Comments

Zhang, Helin Aug. 13, 2015, 8:28 p.m. UTC | #1
Hi Vlad

I don't think the changes are needed. It says in datasheet that the RS bit should be
set on the last descriptor of every packet, ONLY WHEN TXDCTL.WTHRESH equals to ZERO.

Regards,
Helin

> -----Original Message-----
> From: Vlad Zolotarov [mailto:vladz@cloudius-systems.com]
> Sent: Thursday, August 13, 2015 11:07 AM
> To: dev@dpdk.org
> Cc: Zhang, Helin; Ananyev, Konstantin; avi@cloudius-systems.com; Vlad
> Zolotarov
> Subject: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh above 1 for all
> NICs but 82598
> 
> According to 82599 and x540 HW specifications RS bit *must* be set in the last
> descriptor of *every* packet.
There is a condition that if TXDCTL.WTHRESH equal to zero.

> 
> This patch fixes the Tx hang we were constantly hitting with a seastar-based
> application on x540 NIC.
Could you help to share with us how to reproduce the tx hang issue, with using
typical DPDK examples?

> 
> Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
> ---
>  drivers/net/ixgbe/ixgbe_ethdev.c |  9 +++++++++
>  drivers/net/ixgbe/ixgbe_rxtx.c   | 23 ++++++++++++++++++++++-
>  2 files changed, 31 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c
> b/drivers/net/ixgbe/ixgbe_ethdev.c
> index b8ee1e9..6714fd9 100644
> --- a/drivers/net/ixgbe/ixgbe_ethdev.c
> +++ b/drivers/net/ixgbe/ixgbe_ethdev.c
> @@ -2414,6 +2414,15 @@ ixgbe_dev_info_get(struct rte_eth_dev *dev, struct
> rte_eth_dev_info *dev_info)
>  		.txq_flags = ETH_TXQ_FLAGS_NOMULTSEGS |
>  				ETH_TXQ_FLAGS_NOOFFLOADS,
>  	};
> +
> +	/*
> +	 * According to 82599 and x540 specifications RS bit *must* be set on the
> +	 * last descriptor of *every* packet. Therefore we will not allow the
> +	 * tx_rs_thresh above 1 for all NICs newer than 82598.
> +	 */
> +	if (hw->mac.type > ixgbe_mac_82598EB)
> +		dev_info->default_txconf.tx_rs_thresh = 1;
> +
>  	dev_info->hash_key_size = IXGBE_HKEY_MAX_INDEX * sizeof(uint32_t);
>  	dev_info->reta_size = ETH_RSS_RETA_SIZE_128;
>  	dev_info->flow_type_rss_offloads = IXGBE_RSS_OFFLOAD_ALL; diff --git
> a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c index
> 91023b9..8dbdffc 100644
> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
> @@ -2085,11 +2085,19 @@ ixgbe_dev_tx_queue_setup(struct rte_eth_dev
> *dev,
>  	struct ixgbe_tx_queue *txq;
>  	struct ixgbe_hw     *hw;
>  	uint16_t tx_rs_thresh, tx_free_thresh;
> +	bool rs_deferring_allowed;
> 
>  	PMD_INIT_FUNC_TRACE();
>  	hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);
> 
>  	/*
> +	 * According to 82599 and x540 specifications RS bit *must* be set on the
> +	 * last descriptor of *every* packet. Therefore we will not allow the
> +	 * tx_rs_thresh above 1 for all NICs newer than 82598.
> +	 */
> +	rs_deferring_allowed = (hw->mac.type <= ixgbe_mac_82598EB);
> +
> +	/*
>  	 * Validate number of transmit descriptors.
>  	 * It must not exceed hardware maximum, and must be multiple
>  	 * of IXGBE_ALIGN.
> @@ -2110,6 +2118,8 @@ ixgbe_dev_tx_queue_setup(struct rte_eth_dev *dev,
>  	 * to transmit a packet is greater than the number of free TX
>  	 * descriptors.
>  	 * The following constraints must be satisfied:
> +	 *  tx_rs_thresh must be less than 2 for NICs for which RS deferring is
> +	 *  forbidden (all but 82598).
>  	 *  tx_rs_thresh must be greater than 0.
>  	 *  tx_rs_thresh must be less than the size of the ring minus 2.
>  	 *  tx_rs_thresh must be less than or equal to tx_free_thresh.
> @@ -2121,9 +2131,20 @@ ixgbe_dev_tx_queue_setup(struct rte_eth_dev *dev,
>  	 * When set to zero use default values.
>  	 */
>  	tx_rs_thresh = (uint16_t)((tx_conf->tx_rs_thresh) ?
> -			tx_conf->tx_rs_thresh : DEFAULT_TX_RS_THRESH);
> +			tx_conf->tx_rs_thresh :
> +			(rs_deferring_allowed ? DEFAULT_TX_RS_THRESH : 1));
>  	tx_free_thresh = (uint16_t)((tx_conf->tx_free_thresh) ?
>  			tx_conf->tx_free_thresh : DEFAULT_TX_FREE_THRESH);
> +
> +	if (!rs_deferring_allowed && tx_rs_thresh > 1) {
> +		PMD_INIT_LOG(ERR, "tx_rs_thresh must be less than 2 since RS "
> +				  "must be set for every packet for this HW. "
> +				  "(tx_rs_thresh=%u port=%d queue=%d)",
> +			     (unsigned int)tx_rs_thresh,
> +			     (int)dev->data->port_id, (int)queue_idx);
> +		return -(EINVAL);
> +	}
> +
>  	if (tx_rs_thresh >= (nb_desc - 2)) {
>  		PMD_INIT_LOG(ERR, "tx_rs_thresh must be less than the number "
>  			     "of TX descriptors minus 2. (tx_rs_thresh=%u "
> --
> 2.1.0
  
Vladislav Zolotarov Aug. 14, 2015, 5:37 a.m. UTC | #2
On 08/13/15 23:28, Zhang, Helin wrote:
> Hi Vlad
>
> I don't think the changes are needed. It says in datasheet that the RS bit should be
> set on the last descriptor of every packet, ONLY WHEN TXDCTL.WTHRESH equals to ZERO.

Of course it's needed! See below.
Exactly the same spec a few lines above the place u've just quoted states:

"Software should not set the RS bit when TXDCTL.WTHRESH is greater than zero."

And since all three (3) ixgbe xmit callbacks are utilizing RS bit 
notation ixgbe PMD is actually not supporting any value of WTHRESH 
different from zero.

>
> Regards,
> Helin
>
>> -----Original Message-----
>> From: Vlad Zolotarov [mailto:vladz@cloudius-systems.com]
>> Sent: Thursday, August 13, 2015 11:07 AM
>> To: dev@dpdk.org
>> Cc: Zhang, Helin; Ananyev, Konstantin; avi@cloudius-systems.com; Vlad
>> Zolotarov
>> Subject: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh above 1 for all
>> NICs but 82598
>>
>> According to 82599 and x540 HW specifications RS bit *must* be set in the last
>> descriptor of *every* packet.
> There is a condition that if TXDCTL.WTHRESH equal to zero.

Right and ixgbe PMD requires this condition to be fulfilled in order to 
function. See above.

>
>> This patch fixes the Tx hang we were constantly hitting with a seastar-based
>> application on x540 NIC.
> Could you help to share with us how to reproduce the tx hang issue, with using
> typical DPDK examples?

Sorry. I'm not very familiar with the typical DPDK examples to help u 
here. However this is quite irrelevant since without this this patch 
ixgbe PMD obviously abuses the HW spec as has been explained above.

We saw the issue when u stressed the xmit path with a lot of highly 
fragmented TCP frames (packets with up to 33 fragments with non-headers 
fragments as small as 4 bytes) with all offload features enabled.

Thanks,
vlad
>
>> Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
>> ---
>>   drivers/net/ixgbe/ixgbe_ethdev.c |  9 +++++++++
>>   drivers/net/ixgbe/ixgbe_rxtx.c   | 23 ++++++++++++++++++++++-
>>   2 files changed, 31 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c
>> b/drivers/net/ixgbe/ixgbe_ethdev.c
>> index b8ee1e9..6714fd9 100644
>> --- a/drivers/net/ixgbe/ixgbe_ethdev.c
>> +++ b/drivers/net/ixgbe/ixgbe_ethdev.c
>> @@ -2414,6 +2414,15 @@ ixgbe_dev_info_get(struct rte_eth_dev *dev, struct
>> rte_eth_dev_info *dev_info)
>>   		.txq_flags = ETH_TXQ_FLAGS_NOMULTSEGS |
>>   				ETH_TXQ_FLAGS_NOOFFLOADS,
>>   	};
>> +
>> +	/*
>> +	 * According to 82599 and x540 specifications RS bit *must* be set on the
>> +	 * last descriptor of *every* packet. Therefore we will not allow the
>> +	 * tx_rs_thresh above 1 for all NICs newer than 82598.
>> +	 */
>> +	if (hw->mac.type > ixgbe_mac_82598EB)
>> +		dev_info->default_txconf.tx_rs_thresh = 1;
>> +
>>   	dev_info->hash_key_size = IXGBE_HKEY_MAX_INDEX * sizeof(uint32_t);
>>   	dev_info->reta_size = ETH_RSS_RETA_SIZE_128;
>>   	dev_info->flow_type_rss_offloads = IXGBE_RSS_OFFLOAD_ALL; diff --git
>> a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c index
>> 91023b9..8dbdffc 100644
>> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
>> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
>> @@ -2085,11 +2085,19 @@ ixgbe_dev_tx_queue_setup(struct rte_eth_dev
>> *dev,
>>   	struct ixgbe_tx_queue *txq;
>>   	struct ixgbe_hw     *hw;
>>   	uint16_t tx_rs_thresh, tx_free_thresh;
>> +	bool rs_deferring_allowed;
>>
>>   	PMD_INIT_FUNC_TRACE();
>>   	hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);
>>
>>   	/*
>> +	 * According to 82599 and x540 specifications RS bit *must* be set on the
>> +	 * last descriptor of *every* packet. Therefore we will not allow the
>> +	 * tx_rs_thresh above 1 for all NICs newer than 82598.
>> +	 */
>> +	rs_deferring_allowed = (hw->mac.type <= ixgbe_mac_82598EB);
>> +
>> +	/*
>>   	 * Validate number of transmit descriptors.
>>   	 * It must not exceed hardware maximum, and must be multiple
>>   	 * of IXGBE_ALIGN.
>> @@ -2110,6 +2118,8 @@ ixgbe_dev_tx_queue_setup(struct rte_eth_dev *dev,
>>   	 * to transmit a packet is greater than the number of free TX
>>   	 * descriptors.
>>   	 * The following constraints must be satisfied:
>> +	 *  tx_rs_thresh must be less than 2 for NICs for which RS deferring is
>> +	 *  forbidden (all but 82598).
>>   	 *  tx_rs_thresh must be greater than 0.
>>   	 *  tx_rs_thresh must be less than the size of the ring minus 2.
>>   	 *  tx_rs_thresh must be less than or equal to tx_free_thresh.
>> @@ -2121,9 +2131,20 @@ ixgbe_dev_tx_queue_setup(struct rte_eth_dev *dev,
>>   	 * When set to zero use default values.
>>   	 */
>>   	tx_rs_thresh = (uint16_t)((tx_conf->tx_rs_thresh) ?
>> -			tx_conf->tx_rs_thresh : DEFAULT_TX_RS_THRESH);
>> +			tx_conf->tx_rs_thresh :
>> +			(rs_deferring_allowed ? DEFAULT_TX_RS_THRESH : 1));
>>   	tx_free_thresh = (uint16_t)((tx_conf->tx_free_thresh) ?
>>   			tx_conf->tx_free_thresh : DEFAULT_TX_FREE_THRESH);
>> +
>> +	if (!rs_deferring_allowed && tx_rs_thresh > 1) {
>> +		PMD_INIT_LOG(ERR, "tx_rs_thresh must be less than 2 since RS "
>> +				  "must be set for every packet for this HW. "
>> +				  "(tx_rs_thresh=%u port=%d queue=%d)",
>> +			     (unsigned int)tx_rs_thresh,
>> +			     (int)dev->data->port_id, (int)queue_idx);
>> +		return -(EINVAL);
>> +	}
>> +
>>   	if (tx_rs_thresh >= (nb_desc - 2)) {
>>   		PMD_INIT_LOG(ERR, "tx_rs_thresh must be less than the number "
>>   			     "of TX descriptors minus 2. (tx_rs_thresh=%u "
>> --
>> 2.1.0
  
Wenzhuo Lu Aug. 19, 2015, 12:42 a.m. UTC | #3
Hi Helin,

> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Vlad Zolotarov
> Sent: Friday, August 14, 2015 1:38 PM
> To: Zhang, Helin; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh above 1 for
> all NICs but 82598
> 
> 
> 
> On 08/13/15 23:28, Zhang, Helin wrote:
> > Hi Vlad
> >
> > I don't think the changes are needed. It says in datasheet that the RS
> > bit should be set on the last descriptor of every packet, ONLY WHEN
> TXDCTL.WTHRESH equals to ZERO.
> 
> Of course it's needed! See below.
> Exactly the same spec a few lines above the place u've just quoted states:
> 
> "Software should not set the RS bit when TXDCTL.WTHRESH is greater than
> zero."
> 
> And since all three (3) ixgbe xmit callbacks are utilizing RS bit notation ixgbe PMD
> is actually not supporting any value of WTHRESH different from zero.
I think Vlad is right. We need to fix this issue. Any suggestion? If not, I'd like to ack this patch.

> 
> >
> > Regards,
> > Helin
> >
> >> -----Original Message-----
> >> From: Vlad Zolotarov [mailto:vladz@cloudius-systems.com]
> >> Sent: Thursday, August 13, 2015 11:07 AM
> >> To: dev@dpdk.org
> >> Cc: Zhang, Helin; Ananyev, Konstantin; avi@cloudius-systems.com; Vlad
> >> Zolotarov
> >> Subject: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh above 1 for
> all
> >> NICs but 82598
> >>
> >> According to 82599 and x540 HW specifications RS bit *must* be set in the
> last
> >> descriptor of *every* packet.
> > There is a condition that if TXDCTL.WTHRESH equal to zero.
> 
> Right and ixgbe PMD requires this condition to be fulfilled in order to
> function. See above.
> 
> >
> >> This patch fixes the Tx hang we were constantly hitting with a seastar-based
> >> application on x540 NIC.
> > Could you help to share with us how to reproduce the tx hang issue, with using
> > typical DPDK examples?
> 
> Sorry. I'm not very familiar with the typical DPDK examples to help u
> here. However this is quite irrelevant since without this this patch
> ixgbe PMD obviously abuses the HW spec as has been explained above.
> 
> We saw the issue when u stressed the xmit path with a lot of highly
> fragmented TCP frames (packets with up to 33 fragments with non-headers
> fragments as small as 4 bytes) with all offload features enabled.
> 
> Thanks,
> vlad
> >
> >> Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
> >> ---
> >>   drivers/net/ixgbe/ixgbe_ethdev.c |  9 +++++++++
> >>   drivers/net/ixgbe/ixgbe_rxtx.c   | 23 ++++++++++++++++++++++-
> >>   2 files changed, 31 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c
> >> b/drivers/net/ixgbe/ixgbe_ethdev.c
> >> index b8ee1e9..6714fd9 100644
> >> --- a/drivers/net/ixgbe/ixgbe_ethdev.c
> >> +++ b/drivers/net/ixgbe/ixgbe_ethdev.c
> >> @@ -2414,6 +2414,15 @@ ixgbe_dev_info_get(struct rte_eth_dev *dev,
> struct
> >> rte_eth_dev_info *dev_info)
> >>   		.txq_flags = ETH_TXQ_FLAGS_NOMULTSEGS |
> >>   				ETH_TXQ_FLAGS_NOOFFLOADS,
> >>   	};
> >> +
> >> +	/*
> >> +	 * According to 82599 and x540 specifications RS bit *must* be set on
> the
> >> +	 * last descriptor of *every* packet. Therefore we will not allow the
> >> +	 * tx_rs_thresh above 1 for all NICs newer than 82598.
> >> +	 */
> >> +	if (hw->mac.type > ixgbe_mac_82598EB)
> >> +		dev_info->default_txconf.tx_rs_thresh = 1;
> >> +
> >>   	dev_info->hash_key_size = IXGBE_HKEY_MAX_INDEX * sizeof(uint32_t);
> >>   	dev_info->reta_size = ETH_RSS_RETA_SIZE_128;
> >>   	dev_info->flow_type_rss_offloads = IXGBE_RSS_OFFLOAD_ALL; diff --
> git
> >> a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c index
> >> 91023b9..8dbdffc 100644
> >> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
> >> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
> >> @@ -2085,11 +2085,19 @@ ixgbe_dev_tx_queue_setup(struct rte_eth_dev
> >> *dev,
> >>   	struct ixgbe_tx_queue *txq;
> >>   	struct ixgbe_hw     *hw;
> >>   	uint16_t tx_rs_thresh, tx_free_thresh;
> >> +	bool rs_deferring_allowed;
> >>
> >>   	PMD_INIT_FUNC_TRACE();
> >>   	hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);
> >>
> >>   	/*
> >> +	 * According to 82599 and x540 specifications RS bit *must* be set on
> the
> >> +	 * last descriptor of *every* packet. Therefore we will not allow the
> >> +	 * tx_rs_thresh above 1 for all NICs newer than 82598.
> >> +	 */
> >> +	rs_deferring_allowed = (hw->mac.type <= ixgbe_mac_82598EB);
> >> +
> >> +	/*
> >>   	 * Validate number of transmit descriptors.
> >>   	 * It must not exceed hardware maximum, and must be multiple
> >>   	 * of IXGBE_ALIGN.
> >> @@ -2110,6 +2118,8 @@ ixgbe_dev_tx_queue_setup(struct rte_eth_dev
> *dev,
> >>   	 * to transmit a packet is greater than the number of free TX
> >>   	 * descriptors.
> >>   	 * The following constraints must be satisfied:
> >> +	 *  tx_rs_thresh must be less than 2 for NICs for which RS deferring is
> >> +	 *  forbidden (all but 82598).
> >>   	 *  tx_rs_thresh must be greater than 0.
> >>   	 *  tx_rs_thresh must be less than the size of the ring minus 2.
> >>   	 *  tx_rs_thresh must be less than or equal to tx_free_thresh.
> >> @@ -2121,9 +2131,20 @@ ixgbe_dev_tx_queue_setup(struct rte_eth_dev
> *dev,
> >>   	 * When set to zero use default values.
> >>   	 */
> >>   	tx_rs_thresh = (uint16_t)((tx_conf->tx_rs_thresh) ?
> >> -			tx_conf->tx_rs_thresh : DEFAULT_TX_RS_THRESH);
> >> +			tx_conf->tx_rs_thresh :
> >> +			(rs_deferring_allowed ? DEFAULT_TX_RS_THRESH : 1));
> >>   	tx_free_thresh = (uint16_t)((tx_conf->tx_free_thresh) ?
> >>   			tx_conf->tx_free_thresh : DEFAULT_TX_FREE_THRESH);
> >> +
> >> +	if (!rs_deferring_allowed && tx_rs_thresh > 1) {
> >> +		PMD_INIT_LOG(ERR, "tx_rs_thresh must be less than 2 since RS
> "
> >> +				  "must be set for every packet for this HW. "
> >> +				  "(tx_rs_thresh=%u port=%d queue=%d)",
> >> +			     (unsigned int)tx_rs_thresh,
> >> +			     (int)dev->data->port_id, (int)queue_idx);
> >> +		return -(EINVAL);
> >> +	}
> >> +
> >>   	if (tx_rs_thresh >= (nb_desc - 2)) {
> >>   		PMD_INIT_LOG(ERR, "tx_rs_thresh must be less than the
> number "
> >>   			     "of TX descriptors minus 2. (tx_rs_thresh=%u "
> >> --
> >> 2.1.0
  
Vladislav Zolotarov Aug. 19, 2015, 4:55 a.m. UTC | #4
On Aug 19, 2015 03:42, "Lu, Wenzhuo" <wenzhuo.lu@intel.com> wrote:
>
> Hi Helin,
>
> > -----Original Message-----
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Vlad Zolotarov
> > Sent: Friday, August 14, 2015 1:38 PM
> > To: Zhang, Helin; dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh above
1 for
> > all NICs but 82598
> >
> >
> >
> > On 08/13/15 23:28, Zhang, Helin wrote:
> > > Hi Vlad
> > >
> > > I don't think the changes are needed. It says in datasheet that the RS
> > > bit should be set on the last descriptor of every packet, ONLY WHEN
> > TXDCTL.WTHRESH equals to ZERO.
> >
> > Of course it's needed! See below.
> > Exactly the same spec a few lines above the place u've just quoted
states:
> >
> > "Software should not set the RS bit when TXDCTL.WTHRESH is greater than
> > zero."
> >
> > And since all three (3) ixgbe xmit callbacks are utilizing RS bit
notation ixgbe PMD
> > is actually not supporting any value of WTHRESH different from zero.
> I think Vlad is right. We need to fix this issue. Any suggestion? If not,
I'd like to ack this patch.

Pls., note that there is a v2 of this patch on the list. I forgot to patch
ixgbevf_dev_info_get() in v1.

>
> >
> > >
> > > Regards,
> > > Helin
> > >
> > >> -----Original Message-----
> > >> From: Vlad Zolotarov [mailto:vladz@cloudius-systems.com]
> > >> Sent: Thursday, August 13, 2015 11:07 AM
> > >> To: dev@dpdk.org
> > >> Cc: Zhang, Helin; Ananyev, Konstantin; avi@cloudius-systems.com; Vlad
> > >> Zolotarov
> > >> Subject: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh above
1 for
> > all
> > >> NICs but 82598
> > >>
> > >> According to 82599 and x540 HW specifications RS bit *must* be set
in the
> > last
> > >> descriptor of *every* packet.
> > > There is a condition that if TXDCTL.WTHRESH equal to zero.
> >
> > Right and ixgbe PMD requires this condition to be fulfilled in order to
> > function. See above.
> >
> > >
> > >> This patch fixes the Tx hang we were constantly hitting with a
seastar-based
> > >> application on x540 NIC.
> > > Could you help to share with us how to reproduce the tx hang issue,
with using
> > > typical DPDK examples?
> >
> > Sorry. I'm not very familiar with the typical DPDK examples to help u
> > here. However this is quite irrelevant since without this this patch
> > ixgbe PMD obviously abuses the HW spec as has been explained above.
> >
> > We saw the issue when u stressed the xmit path with a lot of highly
> > fragmented TCP frames (packets with up to 33 fragments with non-headers
> > fragments as small as 4 bytes) with all offload features enabled.
> >
> > Thanks,
> > vlad
> > >
> > >> Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
> > >> ---
> > >>   drivers/net/ixgbe/ixgbe_ethdev.c |  9 +++++++++
> > >>   drivers/net/ixgbe/ixgbe_rxtx.c   | 23 ++++++++++++++++++++++-
> > >>   2 files changed, 31 insertions(+), 1 deletion(-)
> > >>
> > >> diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c
> > >> b/drivers/net/ixgbe/ixgbe_ethdev.c
> > >> index b8ee1e9..6714fd9 100644
> > >> --- a/drivers/net/ixgbe/ixgbe_ethdev.c
> > >> +++ b/drivers/net/ixgbe/ixgbe_ethdev.c
> > >> @@ -2414,6 +2414,15 @@ ixgbe_dev_info_get(struct rte_eth_dev *dev,
> > struct
> > >> rte_eth_dev_info *dev_info)
> > >>            .txq_flags = ETH_TXQ_FLAGS_NOMULTSEGS |
> > >>                            ETH_TXQ_FLAGS_NOOFFLOADS,
> > >>    };
> > >> +
> > >> +  /*
> > >> +   * According to 82599 and x540 specifications RS bit *must* be
set on
> > the
> > >> +   * last descriptor of *every* packet. Therefore we will not allow
the
> > >> +   * tx_rs_thresh above 1 for all NICs newer than 82598.
> > >> +   */
> > >> +  if (hw->mac.type > ixgbe_mac_82598EB)
> > >> +          dev_info->default_txconf.tx_rs_thresh = 1;
> > >> +
> > >>    dev_info->hash_key_size = IXGBE_HKEY_MAX_INDEX * sizeof(uint32_t);
> > >>    dev_info->reta_size = ETH_RSS_RETA_SIZE_128;
> > >>    dev_info->flow_type_rss_offloads = IXGBE_RSS_OFFLOAD_ALL; diff --
> > git
> > >> a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index
> > >> 91023b9..8dbdffc 100644
> > >> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
> > >> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
> > >> @@ -2085,11 +2085,19 @@ ixgbe_dev_tx_queue_setup(struct rte_eth_dev
> > >> *dev,
> > >>    struct ixgbe_tx_queue *txq;
> > >>    struct ixgbe_hw     *hw;
> > >>    uint16_t tx_rs_thresh, tx_free_thresh;
> > >> +  bool rs_deferring_allowed;
> > >>
> > >>    PMD_INIT_FUNC_TRACE();
> > >>    hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);
> > >>
> > >>    /*
> > >> +   * According to 82599 and x540 specifications RS bit *must* be
set on
> > the
> > >> +   * last descriptor of *every* packet. Therefore we will not allow
the
> > >> +   * tx_rs_thresh above 1 for all NICs newer than 82598.
> > >> +   */
> > >> +  rs_deferring_allowed = (hw->mac.type <= ixgbe_mac_82598EB);
> > >> +
> > >> +  /*
> > >>     * Validate number of transmit descriptors.
> > >>     * It must not exceed hardware maximum, and must be multiple
> > >>     * of IXGBE_ALIGN.
> > >> @@ -2110,6 +2118,8 @@ ixgbe_dev_tx_queue_setup(struct rte_eth_dev
> > *dev,
> > >>     * to transmit a packet is greater than the number of free TX
> > >>     * descriptors.
> > >>     * The following constraints must be satisfied:
> > >> +   *  tx_rs_thresh must be less than 2 for NICs for which RS
deferring is
> > >> +   *  forbidden (all but 82598).
> > >>     *  tx_rs_thresh must be greater than 0.
> > >>     *  tx_rs_thresh must be less than the size of the ring minus 2.
> > >>     *  tx_rs_thresh must be less than or equal to tx_free_thresh.
> > >> @@ -2121,9 +2131,20 @@ ixgbe_dev_tx_queue_setup(struct rte_eth_dev
> > *dev,
> > >>     * When set to zero use default values.
> > >>     */
> > >>    tx_rs_thresh = (uint16_t)((tx_conf->tx_rs_thresh) ?
> > >> -                  tx_conf->tx_rs_thresh : DEFAULT_TX_RS_THRESH);
> > >> +                  tx_conf->tx_rs_thresh :
> > >> +                  (rs_deferring_allowed ? DEFAULT_TX_RS_THRESH :
1));
> > >>    tx_free_thresh = (uint16_t)((tx_conf->tx_free_thresh) ?
> > >>                    tx_conf->tx_free_thresh : DEFAULT_TX_FREE_THRESH);
> > >> +
> > >> +  if (!rs_deferring_allowed && tx_rs_thresh > 1) {
> > >> +          PMD_INIT_LOG(ERR, "tx_rs_thresh must be less than 2 since
RS
> > "
> > >> +                            "must be set for every packet for this
HW. "
> > >> +                            "(tx_rs_thresh=%u port=%d queue=%d)",
> > >> +                       (unsigned int)tx_rs_thresh,
> > >> +                       (int)dev->data->port_id, (int)queue_idx);
> > >> +          return -(EINVAL);
> > >> +  }
> > >> +
> > >>    if (tx_rs_thresh >= (nb_desc - 2)) {
> > >>            PMD_INIT_LOG(ERR, "tx_rs_thresh must be less than the
> > number "
> > >>                         "of TX descriptors minus 2. (tx_rs_thresh=%u
"
> > >> --
> > >> 2.1.0
>
  
Ananyev, Konstantin Aug. 19, 2015, 7:43 a.m. UTC | #5
Hi Vlad,
Sorry for delay with review, I am OOO till next week.
Meanwhile, few questions/comments from me. 

> > > >

> > > >> This patch fixes the Tx hang we were constantly hitting with a

> seastar-based

> > > >> application on x540 NIC.

> > > > Could you help to share with us how to reproduce the tx hang issue,

> with using

> > > > typical DPDK examples?

> > >

> > > Sorry. I'm not very familiar with the typical DPDK examples to help u

> > > here. However this is quite irrelevant since without this this patch

> > > ixgbe PMD obviously abuses the HW spec as has been explained above.

> > >

> > > We saw the issue when u stressed the xmit path with a lot of highly

> > > fragmented TCP frames (packets with up to 33 fragments with non-headers

> > > fragments as small as 4 bytes) with all offload features enabled.


Could you provide us with the pcap file to reproduce the issue?
My concern with you approach is that it would affect TX performance.
Right now, for simple TX PMD usually reads only (nb_tx_desc/tx_rs_thresh) TXDs,
While with your patch (if I understand it correctly) it has to read all TXDs in the HW TX ring.
Even if we really need to setup RS bit in each TXD (I still doubt we really do) - ,
I think inside PMD it still should be possible to check TX completion in chunks.
Konstantin   


> > >

> > > Thanks,

> > > vlad

> > > >

> > > >> Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>

> > > >> ---

> > > >>   drivers/net/ixgbe/ixgbe_ethdev.c |  9 +++++++++

> > > >>   drivers/net/ixgbe/ixgbe_rxtx.c   | 23 ++++++++++++++++++++++-

> > > >>   2 files changed, 31 insertions(+), 1 deletion(-)

> > > >>

> > > >> diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c

> > > >> b/drivers/net/ixgbe/ixgbe_ethdev.c

> > > >> index b8ee1e9..6714fd9 100644

> > > >> --- a/drivers/net/ixgbe/ixgbe_ethdev.c

> > > >> +++ b/drivers/net/ixgbe/ixgbe_ethdev.c

> > > >> @@ -2414,6 +2414,15 @@ ixgbe_dev_info_get(struct rte_eth_dev *dev,

> > > struct

> > > >> rte_eth_dev_info *dev_info)

> > > >>            .txq_flags = ETH_TXQ_FLAGS_NOMULTSEGS |

> > > >>                            ETH_TXQ_FLAGS_NOOFFLOADS,

> > > >>    };

> > > >> +

> > > >> +  /*

> > > >> +   * According to 82599 and x540 specifications RS bit *must* be

> set on

> > > the

> > > >> +   * last descriptor of *every* packet. Therefore we will not allow

> the

> > > >> +   * tx_rs_thresh above 1 for all NICs newer than 82598.

> > > >> +   */

> > > >> +  if (hw->mac.type > ixgbe_mac_82598EB)

> > > >> +          dev_info->default_txconf.tx_rs_thresh = 1;

> > > >> +

> > > >>    dev_info->hash_key_size = IXGBE_HKEY_MAX_INDEX * sizeof(uint32_t);

> > > >>    dev_info->reta_size = ETH_RSS_RETA_SIZE_128;

> > > >>    dev_info->flow_type_rss_offloads = IXGBE_RSS_OFFLOAD_ALL; diff --

> > > git

> > > >> a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c

> index

> > > >> 91023b9..8dbdffc 100644

> > > >> --- a/drivers/net/ixgbe/ixgbe_rxtx.c

> > > >> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c

> > > >> @@ -2085,11 +2085,19 @@ ixgbe_dev_tx_queue_setup(struct rte_eth_dev

> > > >> *dev,

> > > >>    struct ixgbe_tx_queue *txq;

> > > >>    struct ixgbe_hw     *hw;

> > > >>    uint16_t tx_rs_thresh, tx_free_thresh;

> > > >> +  bool rs_deferring_allowed;

> > > >>

> > > >>    PMD_INIT_FUNC_TRACE();

> > > >>    hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);

> > > >>

> > > >>    /*

> > > >> +   * According to 82599 and x540 specifications RS bit *must* be

> set on

> > > the

> > > >> +   * last descriptor of *every* packet. Therefore we will not allow

> the

> > > >> +   * tx_rs_thresh above 1 for all NICs newer than 82598.

> > > >> +   */

> > > >> +  rs_deferring_allowed = (hw->mac.type <= ixgbe_mac_82598EB);

> > > >> +

> > > >> +  /*

> > > >>     * Validate number of transmit descriptors.

> > > >>     * It must not exceed hardware maximum, and must be multiple

> > > >>     * of IXGBE_ALIGN.

> > > >> @@ -2110,6 +2118,8 @@ ixgbe_dev_tx_queue_setup(struct rte_eth_dev

> > > *dev,

> > > >>     * to transmit a packet is greater than the number of free TX

> > > >>     * descriptors.

> > > >>     * The following constraints must be satisfied:

> > > >> +   *  tx_rs_thresh must be less than 2 for NICs for which RS

> deferring is

> > > >> +   *  forbidden (all but 82598).

> > > >>     *  tx_rs_thresh must be greater than 0.

> > > >>     *  tx_rs_thresh must be less than the size of the ring minus 2.

> > > >>     *  tx_rs_thresh must be less than or equal to tx_free_thresh.

> > > >> @@ -2121,9 +2131,20 @@ ixgbe_dev_tx_queue_setup(struct rte_eth_dev

> > > *dev,

> > > >>     * When set to zero use default values.

> > > >>     */

> > > >>    tx_rs_thresh = (uint16_t)((tx_conf->tx_rs_thresh) ?

> > > >> -                  tx_conf->tx_rs_thresh : DEFAULT_TX_RS_THRESH);

> > > >> +                  tx_conf->tx_rs_thresh :

> > > >> +                  (rs_deferring_allowed ? DEFAULT_TX_RS_THRESH :

> 1));

> > > >>    tx_free_thresh = (uint16_t)((tx_conf->tx_free_thresh) ?

> > > >>                    tx_conf->tx_free_thresh : DEFAULT_TX_FREE_THRESH);

> > > >> +

> > > >> +  if (!rs_deferring_allowed && tx_rs_thresh > 1) {

> > > >> +          PMD_INIT_LOG(ERR, "tx_rs_thresh must be less than 2 since

> RS

> > > "

> > > >> +                            "must be set for every packet for this

> HW. "

> > > >> +                            "(tx_rs_thresh=%u port=%d queue=%d)",

> > > >> +                       (unsigned int)tx_rs_thresh,

> > > >> +                       (int)dev->data->port_id, (int)queue_idx);

> > > >> +          return -(EINVAL);

> > > >> +  }

> > > >> +

> > > >>    if (tx_rs_thresh >= (nb_desc - 2)) {

> > > >>            PMD_INIT_LOG(ERR, "tx_rs_thresh must be less than the

> > > number "

> > > >>                         "of TX descriptors minus 2. (tx_rs_thresh=%u

> "

> > > >> --

> > > >> 2.1.0

> >
  
Vladislav Zolotarov Aug. 19, 2015, 10:02 a.m. UTC | #6
On 08/19/15 10:43, Ananyev, Konstantin wrote:
> Hi Vlad,
> Sorry for delay with review, I am OOO till next week.
> Meanwhile, few questions/comments from me.

Hi, Konstantin, long time no see... ;)

>
>>>>>> This patch fixes the Tx hang we were constantly hitting with a
>> seastar-based
>>>>>> application on x540 NIC.
>>>>> Could you help to share with us how to reproduce the tx hang issue,
>> with using
>>>>> typical DPDK examples?
>>>> Sorry. I'm not very familiar with the typical DPDK examples to help u
>>>> here. However this is quite irrelevant since without this this patch
>>>> ixgbe PMD obviously abuses the HW spec as has been explained above.
>>>>
>>>> We saw the issue when u stressed the xmit path with a lot of highly
>>>> fragmented TCP frames (packets with up to 33 fragments with non-headers
>>>> fragments as small as 4 bytes) with all offload features enabled.
> Could you provide us with the pcap file to reproduce the issue?

Well, the thing is it takes some time to reproduce it (a few minutes of 
heavy load) therefore a pcap would be quite large.

> My concern with you approach is that it would affect TX performance.

It certainly will ;) But it seem inevitable. See below.

> Right now, for simple TX PMD usually reads only (nb_tx_desc/tx_rs_thresh) TXDs,
> While with your patch (if I understand it correctly) it has to read all TXDs in the HW TX ring.

If by "simple" u refer an always single fragment per Tx packet - then u 
are absolutely correct.

My initial patch was to only set RS on every EOP descriptor without 
changing the rs_thresh value and this patch worked.
However HW spec doesn't ensure in a general case that packets are always 
handled/completion write-back completes in the same order the packets 
are placed on the ring (see "Tx arbitration schemes" chapter in 82599 
spec for instance). Therefore AFAIU one should not assume that if 
packet[x+1] DD bit is set then packet[x] is completed too.

That's why I changed the patch to be as u see it now. However if I miss 
something here and your HW people ensure the in-order completion this of 
course may be changed back.

> Even if we really need to setup RS bit in each TXD (I still doubt we really do) - ,

Well, if u doubt u may ask the guys from the Intel networking division 
that wrote the 82599 and x540 HW specs where they clearly state that. ;)

> I think inside PMD it still should be possible to check TX completion in chunks.
> Konstantin
>
>
>>>> Thanks,
>>>> vlad
>>>>>> Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
>>>>>> ---
>>>>>>    drivers/net/ixgbe/ixgbe_ethdev.c |  9 +++++++++
>>>>>>    drivers/net/ixgbe/ixgbe_rxtx.c   | 23 ++++++++++++++++++++++-
>>>>>>    2 files changed, 31 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c
>>>>>> b/drivers/net/ixgbe/ixgbe_ethdev.c
>>>>>> index b8ee1e9..6714fd9 100644
>>>>>> --- a/drivers/net/ixgbe/ixgbe_ethdev.c
>>>>>> +++ b/drivers/net/ixgbe/ixgbe_ethdev.c
>>>>>> @@ -2414,6 +2414,15 @@ ixgbe_dev_info_get(struct rte_eth_dev *dev,
>>>> struct
>>>>>> rte_eth_dev_info *dev_info)
>>>>>>             .txq_flags = ETH_TXQ_FLAGS_NOMULTSEGS |
>>>>>>                             ETH_TXQ_FLAGS_NOOFFLOADS,
>>>>>>     };
>>>>>> +
>>>>>> +  /*
>>>>>> +   * According to 82599 and x540 specifications RS bit *must* be
>> set on
>>>> the
>>>>>> +   * last descriptor of *every* packet. Therefore we will not allow
>> the
>>>>>> +   * tx_rs_thresh above 1 for all NICs newer than 82598.
>>>>>> +   */
>>>>>> +  if (hw->mac.type > ixgbe_mac_82598EB)
>>>>>> +          dev_info->default_txconf.tx_rs_thresh = 1;
>>>>>> +
>>>>>>     dev_info->hash_key_size = IXGBE_HKEY_MAX_INDEX * sizeof(uint32_t);
>>>>>>     dev_info->reta_size = ETH_RSS_RETA_SIZE_128;
>>>>>>     dev_info->flow_type_rss_offloads = IXGBE_RSS_OFFLOAD_ALL; diff --
>>>> git
>>>>>> a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
>> index
>>>>>> 91023b9..8dbdffc 100644
>>>>>> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
>>>>>> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
>>>>>> @@ -2085,11 +2085,19 @@ ixgbe_dev_tx_queue_setup(struct rte_eth_dev
>>>>>> *dev,
>>>>>>     struct ixgbe_tx_queue *txq;
>>>>>>     struct ixgbe_hw     *hw;
>>>>>>     uint16_t tx_rs_thresh, tx_free_thresh;
>>>>>> +  bool rs_deferring_allowed;
>>>>>>
>>>>>>     PMD_INIT_FUNC_TRACE();
>>>>>>     hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);
>>>>>>
>>>>>>     /*
>>>>>> +   * According to 82599 and x540 specifications RS bit *must* be
>> set on
>>>> the
>>>>>> +   * last descriptor of *every* packet. Therefore we will not allow
>> the
>>>>>> +   * tx_rs_thresh above 1 for all NICs newer than 82598.
>>>>>> +   */
>>>>>> +  rs_deferring_allowed = (hw->mac.type <= ixgbe_mac_82598EB);
>>>>>> +
>>>>>> +  /*
>>>>>>      * Validate number of transmit descriptors.
>>>>>>      * It must not exceed hardware maximum, and must be multiple
>>>>>>      * of IXGBE_ALIGN.
>>>>>> @@ -2110,6 +2118,8 @@ ixgbe_dev_tx_queue_setup(struct rte_eth_dev
>>>> *dev,
>>>>>>      * to transmit a packet is greater than the number of free TX
>>>>>>      * descriptors.
>>>>>>      * The following constraints must be satisfied:
>>>>>> +   *  tx_rs_thresh must be less than 2 for NICs for which RS
>> deferring is
>>>>>> +   *  forbidden (all but 82598).
>>>>>>      *  tx_rs_thresh must be greater than 0.
>>>>>>      *  tx_rs_thresh must be less than the size of the ring minus 2.
>>>>>>      *  tx_rs_thresh must be less than or equal to tx_free_thresh.
>>>>>> @@ -2121,9 +2131,20 @@ ixgbe_dev_tx_queue_setup(struct rte_eth_dev
>>>> *dev,
>>>>>>      * When set to zero use default values.
>>>>>>      */
>>>>>>     tx_rs_thresh = (uint16_t)((tx_conf->tx_rs_thresh) ?
>>>>>> -                  tx_conf->tx_rs_thresh : DEFAULT_TX_RS_THRESH);
>>>>>> +                  tx_conf->tx_rs_thresh :
>>>>>> +                  (rs_deferring_allowed ? DEFAULT_TX_RS_THRESH :
>> 1));
>>>>>>     tx_free_thresh = (uint16_t)((tx_conf->tx_free_thresh) ?
>>>>>>                     tx_conf->tx_free_thresh : DEFAULT_TX_FREE_THRESH);
>>>>>> +
>>>>>> +  if (!rs_deferring_allowed && tx_rs_thresh > 1) {
>>>>>> +          PMD_INIT_LOG(ERR, "tx_rs_thresh must be less than 2 since
>> RS
>>>> "
>>>>>> +                            "must be set for every packet for this
>> HW. "
>>>>>> +                            "(tx_rs_thresh=%u port=%d queue=%d)",
>>>>>> +                       (unsigned int)tx_rs_thresh,
>>>>>> +                       (int)dev->data->port_id, (int)queue_idx);
>>>>>> +          return -(EINVAL);
>>>>>> +  }
>>>>>> +
>>>>>>     if (tx_rs_thresh >= (nb_desc - 2)) {
>>>>>>             PMD_INIT_LOG(ERR, "tx_rs_thresh must be less than the
>>>> number "
>>>>>>                          "of TX descriptors minus 2. (tx_rs_thresh=%u
>> "
>>>>>> --
>>>>>> 2.1.0
  
Zhang, Helin Aug. 19, 2015, 5:29 p.m. UTC | #7
Hi Vlad

Thank you very much for the patches! Give me a few more time to double check with more guys, and possibly hardware experts.

Regards,
Helin

From: Vladislav Zolotarov [mailto:vladz@cloudius-systems.com]

Sent: Tuesday, August 18, 2015 9:56 PM
To: Lu, Wenzhuo
Cc: dev@dpdk.org; Zhang, Helin
Subject: RE: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh above 1 for all NICs but 82598


On Aug 19, 2015 03:42, "Lu, Wenzhuo" <wenzhuo.lu@intel.com<mailto:wenzhuo.lu@intel.com>> wrote:
>

> Hi Helin,

>

> > -----Original Message-----

> > From: dev [mailto:dev-bounces@dpdk.org<mailto:dev-bounces@dpdk.org>] On Behalf Of Vlad Zolotarov

> > Sent: Friday, August 14, 2015 1:38 PM

> > To: Zhang, Helin; dev@dpdk.org<mailto:dev@dpdk.org>

> > Subject: Re: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh above 1 for

> > all NICs but 82598

> >

> >

> >

> > On 08/13/15 23:28, Zhang, Helin wrote:

> > > Hi Vlad

> > >

> > > I don't think the changes are needed. It says in datasheet that the RS

> > > bit should be set on the last descriptor of every packet, ONLY WHEN

> > TXDCTL.WTHRESH equals to ZERO.

> >

> > Of course it's needed! See below.

> > Exactly the same spec a few lines above the place u've just quoted states:

> >

> > "Software should not set the RS bit when TXDCTL.WTHRESH is greater than

> > zero."

> >

> > And since all three (3) ixgbe xmit callbacks are utilizing RS bit notation ixgbe PMD

> > is actually not supporting any value of WTHRESH different from zero.

> I think Vlad is right. We need to fix this issue. Any suggestion? If not, I'd like to ack this patch.


Pls., note that there is a v2 of this patch on the list. I forgot to patch ixgbevf_dev_info_get() in v1.

>

> >

> > >

> > > Regards,

> > > Helin

> > >

> > >> -----Original Message-----

> > >> From: Vlad Zolotarov [mailto:vladz@cloudius-systems.com<mailto:vladz@cloudius-systems.com>]

> > >> Sent: Thursday, August 13, 2015 11:07 AM

> > >> To: dev@dpdk.org<mailto:dev@dpdk.org>

> > >> Cc: Zhang, Helin; Ananyev, Konstantin; avi@cloudius-systems.com<mailto:avi@cloudius-systems.com>; Vlad

> > >> Zolotarov

> > >> Subject: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh above 1 for

> > all

> > >> NICs but 82598

> > >>

> > >> According to 82599 and x540 HW specifications RS bit *must* be set in the

> > last

> > >> descriptor of *every* packet.

> > > There is a condition that if TXDCTL.WTHRESH equal to zero.

> >

> > Right and ixgbe PMD requires this condition to be fulfilled in order to

> > function. See above.

> >

> > >

> > >> This patch fixes the Tx hang we were constantly hitting with a seastar-based

> > >> application on x540 NIC.

> > > Could you help to share with us how to reproduce the tx hang issue, with using

> > > typical DPDK examples?

> >

> > Sorry. I'm not very familiar with the typical DPDK examples to help u

> > here. However this is quite irrelevant since without this this patch

> > ixgbe PMD obviously abuses the HW spec as has been explained above.

> >

> > We saw the issue when u stressed the xmit path with a lot of highly

> > fragmented TCP frames (packets with up to 33 fragments with non-headers

> > fragments as small as 4 bytes) with all offload features enabled.

> >

> > Thanks,

> > vlad

> > >

> > >> Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com<mailto:vladz@cloudius-systems.com>>

> > >> ---

> > >>   drivers/net/ixgbe/ixgbe_ethdev.c |  9 +++++++++

> > >>   drivers/net/ixgbe/ixgbe_rxtx.c   | 23 ++++++++++++++++++++++-

> > >>   2 files changed, 31 insertions(+), 1 deletion(-)

> > >>

> > >> diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c

> > >> b/drivers/net/ixgbe/ixgbe_ethdev.c

> > >> index b8ee1e9..6714fd9 100644

> > >> --- a/drivers/net/ixgbe/ixgbe_ethdev.c

> > >> +++ b/drivers/net/ixgbe/ixgbe_ethdev.c

> > >> @@ -2414,6 +2414,15 @@ ixgbe_dev_info_get(struct rte_eth_dev *dev,

> > struct

> > >> rte_eth_dev_info *dev_info)

> > >>            .txq_flags = ETH_TXQ_FLAGS_NOMULTSEGS |

> > >>                            ETH_TXQ_FLAGS_NOOFFLOADS,

> > >>    };

> > >> +

> > >> +  /*

> > >> +   * According to 82599 and x540 specifications RS bit *must* be set on

> > the

> > >> +   * last descriptor of *every* packet. Therefore we will not allow the

> > >> +   * tx_rs_thresh above 1 for all NICs newer than 82598.

> > >> +   */

> > >> +  if (hw->mac.type > ixgbe_mac_82598EB)

> > >> +          dev_info->default_txconf.tx_rs_thresh = 1;

> > >> +

> > >>    dev_info->hash_key_size = IXGBE_HKEY_MAX_INDEX * sizeof(uint32_t);

> > >>    dev_info->reta_size = ETH_RSS_RETA_SIZE_128;

> > >>    dev_info->flow_type_rss_offloads = IXGBE_RSS_OFFLOAD_ALL; diff --

> > git

> > >> a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c index

> > >> 91023b9..8dbdffc 100644

> > >> --- a/drivers/net/ixgbe/ixgbe_rxtx.c

> > >> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c

> > >> @@ -2085,11 +2085,19 @@ ixgbe_dev_tx_queue_setup(struct rte_eth_dev

> > >> *dev,

> > >>    struct ixgbe_tx_queue *txq;

> > >>    struct ixgbe_hw     *hw;

> > >>    uint16_t tx_rs_thresh, tx_free_thresh;

> > >> +  bool rs_deferring_allowed;

> > >>

> > >>    PMD_INIT_FUNC_TRACE();

> > >>    hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);

> > >>

> > >>    /*

> > >> +   * According to 82599 and x540 specifications RS bit *must* be set on

> > the

> > >> +   * last descriptor of *every* packet. Therefore we will not allow the

> > >> +   * tx_rs_thresh above 1 for all NICs newer than 82598.

> > >> +   */

> > >> +  rs_deferring_allowed = (hw->mac.type <= ixgbe_mac_82598EB);

> > >> +

> > >> +  /*

> > >>     * Validate number of transmit descriptors.

> > >>     * It must not exceed hardware maximum, and must be multiple

> > >>     * of IXGBE_ALIGN.

> > >> @@ -2110,6 +2118,8 @@ ixgbe_dev_tx_queue_setup(struct rte_eth_dev

> > *dev,

> > >>     * to transmit a packet is greater than the number of free TX

> > >>     * descriptors.

> > >>     * The following constraints must be satisfied:

> > >> +   *  tx_rs_thresh must be less than 2 for NICs for which RS deferring is

> > >> +   *  forbidden (all but 82598).

> > >>     *  tx_rs_thresh must be greater than 0.

> > >>     *  tx_rs_thresh must be less than the size of the ring minus 2.

> > >>     *  tx_rs_thresh must be less than or equal to tx_free_thresh.

> > >> @@ -2121,9 +2131,20 @@ ixgbe_dev_tx_queue_setup(struct rte_eth_dev

> > *dev,

> > >>     * When set to zero use default values.

> > >>     */

> > >>    tx_rs_thresh = (uint16_t)((tx_conf->tx_rs_thresh) ?

> > >> -                  tx_conf->tx_rs_thresh : DEFAULT_TX_RS_THRESH);

> > >> +                  tx_conf->tx_rs_thresh :

> > >> +                  (rs_deferring_allowed ? DEFAULT_TX_RS_THRESH : 1));

> > >>    tx_free_thresh = (uint16_t)((tx_conf->tx_free_thresh) ?

> > >>                    tx_conf->tx_free_thresh : DEFAULT_TX_FREE_THRESH);

> > >> +

> > >> +  if (!rs_deferring_allowed && tx_rs_thresh > 1) {

> > >> +          PMD_INIT_LOG(ERR, "tx_rs_thresh must be less than 2 since RS

> > "

> > >> +                            "must be set for every packet for this HW. "

> > >> +                            "(tx_rs_thresh=%u port=%d queue=%d)",

> > >> +                       (unsigned int)tx_rs_thresh,

> > >> +                       (int)dev->data->port_id, (int)queue_idx);

> > >> +          return -(EINVAL);

> > >> +  }

> > >> +

> > >>    if (tx_rs_thresh >= (nb_desc - 2)) {

> > >>            PMD_INIT_LOG(ERR, "tx_rs_thresh must be less than the

> > number "

> > >>                         "of TX descriptors minus 2. (tx_rs_thresh=%u "

> > >> --

> > >> 2.1.0

>
  
Ananyev, Konstantin Aug. 20, 2015, 8:41 a.m. UTC | #8
Hi Vlad,

> -----Original Message-----

> From: Vlad Zolotarov [mailto:vladz@cloudius-systems.com]

> Sent: Wednesday, August 19, 2015 11:03 AM

> To: Ananyev, Konstantin; Lu, Wenzhuo

> Cc: dev@dpdk.org

> Subject: Re: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh above 1 for all NICs but 82598

> 

> 

> 

> On 08/19/15 10:43, Ananyev, Konstantin wrote:

> > Hi Vlad,

> > Sorry for delay with review, I am OOO till next week.

> > Meanwhile, few questions/comments from me.

> 

> Hi, Konstantin, long time no see... ;)

> 

> >

> >>>>>> This patch fixes the Tx hang we were constantly hitting with a

> >> seastar-based

> >>>>>> application on x540 NIC.

> >>>>> Could you help to share with us how to reproduce the tx hang issue,

> >> with using

> >>>>> typical DPDK examples?

> >>>> Sorry. I'm not very familiar with the typical DPDK examples to help u

> >>>> here. However this is quite irrelevant since without this this patch

> >>>> ixgbe PMD obviously abuses the HW spec as has been explained above.

> >>>>

> >>>> We saw the issue when u stressed the xmit path with a lot of highly

> >>>> fragmented TCP frames (packets with up to 33 fragments with non-headers

> >>>> fragments as small as 4 bytes) with all offload features enabled.

> > Could you provide us with the pcap file to reproduce the issue?

> 

> Well, the thing is it takes some time to reproduce it (a few minutes of

> heavy load) therefore a pcap would be quite large.


Probably you can upload it to some place, from which we will be able to download it?
Or might be you have some sort of scapy script to generate it?
I suppose we'll need something to reproduce the issue and verify the fix.   

> 

> > My concern with you approach is that it would affect TX performance.

> 

> It certainly will ;) But it seem inevitable. See below.

> 

> > Right now, for simple TX PMD usually reads only (nb_tx_desc/tx_rs_thresh) TXDs,

> > While with your patch (if I understand it correctly) it has to read all TXDs in the HW TX ring.

> 

> If by "simple" u refer an always single fragment per Tx packet - then u

> are absolutely correct.

> 

> My initial patch was to only set RS on every EOP descriptor without

> changing the rs_thresh value and this patch worked.

> However HW spec doesn't ensure in a general case that packets are always

> handled/completion write-back completes in the same order the packets

> are placed on the ring (see "Tx arbitration schemes" chapter in 82599

> spec for instance). Therefore AFAIU one should not assume that if

> packet[x+1] DD bit is set then packet[x] is completed too.


From my understanding, TX arbitration controls the order in which TXDs from
different queues are fetched/processed.
But descriptors from the same TX queue are processed in FIFO order.
So, I think that  - yes, if TXD[x+1] DD bit is set, then TXD[x] is completed too,
and setting RS on every EOP TXD should be enough.

> 

> That's why I changed the patch to be as u see it now. However if I miss

> something here and your HW people ensure the in-order completion this of

> course may be changed back.

> 

> > Even if we really need to setup RS bit in each TXD (I still doubt we really do) - ,

> 

> Well, if u doubt u may ask the guys from the Intel networking division

> that wrote the 82599 and x540 HW specs where they clearly state that. ;)


Good point, we'll see what we can do here :)
Konstantin

> 

> > I think inside PMD it still should be possible to check TX completion in chunks.

> > Konstantin

> >

> >

> >>>> Thanks,

> >>>> vlad

> >>>>>> Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>

> >>>>>> ---

> >>>>>>    drivers/net/ixgbe/ixgbe_ethdev.c |  9 +++++++++

> >>>>>>    drivers/net/ixgbe/ixgbe_rxtx.c   | 23 ++++++++++++++++++++++-

> >>>>>>    2 files changed, 31 insertions(+), 1 deletion(-)

> >>>>>>

> >>>>>> diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c

> >>>>>> b/drivers/net/ixgbe/ixgbe_ethdev.c

> >>>>>> index b8ee1e9..6714fd9 100644

> >>>>>> --- a/drivers/net/ixgbe/ixgbe_ethdev.c

> >>>>>> +++ b/drivers/net/ixgbe/ixgbe_ethdev.c

> >>>>>> @@ -2414,6 +2414,15 @@ ixgbe_dev_info_get(struct rte_eth_dev *dev,

> >>>> struct

> >>>>>> rte_eth_dev_info *dev_info)

> >>>>>>             .txq_flags = ETH_TXQ_FLAGS_NOMULTSEGS |

> >>>>>>                             ETH_TXQ_FLAGS_NOOFFLOADS,

> >>>>>>     };

> >>>>>> +

> >>>>>> +  /*

> >>>>>> +   * According to 82599 and x540 specifications RS bit *must* be

> >> set on

> >>>> the

> >>>>>> +   * last descriptor of *every* packet. Therefore we will not allow

> >> the

> >>>>>> +   * tx_rs_thresh above 1 for all NICs newer than 82598.

> >>>>>> +   */

> >>>>>> +  if (hw->mac.type > ixgbe_mac_82598EB)

> >>>>>> +          dev_info->default_txconf.tx_rs_thresh = 1;

> >>>>>> +

> >>>>>>     dev_info->hash_key_size = IXGBE_HKEY_MAX_INDEX * sizeof(uint32_t);

> >>>>>>     dev_info->reta_size = ETH_RSS_RETA_SIZE_128;

> >>>>>>     dev_info->flow_type_rss_offloads = IXGBE_RSS_OFFLOAD_ALL; diff --

> >>>> git

> >>>>>> a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c

> >> index

> >>>>>> 91023b9..8dbdffc 100644

> >>>>>> --- a/drivers/net/ixgbe/ixgbe_rxtx.c

> >>>>>> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c

> >>>>>> @@ -2085,11 +2085,19 @@ ixgbe_dev_tx_queue_setup(struct rte_eth_dev

> >>>>>> *dev,

> >>>>>>     struct ixgbe_tx_queue *txq;

> >>>>>>     struct ixgbe_hw     *hw;

> >>>>>>     uint16_t tx_rs_thresh, tx_free_thresh;

> >>>>>> +  bool rs_deferring_allowed;

> >>>>>>

> >>>>>>     PMD_INIT_FUNC_TRACE();

> >>>>>>     hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);

> >>>>>>

> >>>>>>     /*

> >>>>>> +   * According to 82599 and x540 specifications RS bit *must* be

> >> set on

> >>>> the

> >>>>>> +   * last descriptor of *every* packet. Therefore we will not allow

> >> the

> >>>>>> +   * tx_rs_thresh above 1 for all NICs newer than 82598.

> >>>>>> +   */

> >>>>>> +  rs_deferring_allowed = (hw->mac.type <= ixgbe_mac_82598EB);

> >>>>>> +

> >>>>>> +  /*

> >>>>>>      * Validate number of transmit descriptors.

> >>>>>>      * It must not exceed hardware maximum, and must be multiple

> >>>>>>      * of IXGBE_ALIGN.

> >>>>>> @@ -2110,6 +2118,8 @@ ixgbe_dev_tx_queue_setup(struct rte_eth_dev

> >>>> *dev,

> >>>>>>      * to transmit a packet is greater than the number of free TX

> >>>>>>      * descriptors.

> >>>>>>      * The following constraints must be satisfied:

> >>>>>> +   *  tx_rs_thresh must be less than 2 for NICs for which RS

> >> deferring is

> >>>>>> +   *  forbidden (all but 82598).

> >>>>>>      *  tx_rs_thresh must be greater than 0.

> >>>>>>      *  tx_rs_thresh must be less than the size of the ring minus 2.

> >>>>>>      *  tx_rs_thresh must be less than or equal to tx_free_thresh.

> >>>>>> @@ -2121,9 +2131,20 @@ ixgbe_dev_tx_queue_setup(struct rte_eth_dev

> >>>> *dev,

> >>>>>>      * When set to zero use default values.

> >>>>>>      */

> >>>>>>     tx_rs_thresh = (uint16_t)((tx_conf->tx_rs_thresh) ?

> >>>>>> -                  tx_conf->tx_rs_thresh : DEFAULT_TX_RS_THRESH);

> >>>>>> +                  tx_conf->tx_rs_thresh :

> >>>>>> +                  (rs_deferring_allowed ? DEFAULT_TX_RS_THRESH :

> >> 1));

> >>>>>>     tx_free_thresh = (uint16_t)((tx_conf->tx_free_thresh) ?

> >>>>>>                     tx_conf->tx_free_thresh : DEFAULT_TX_FREE_THRESH);

> >>>>>> +

> >>>>>> +  if (!rs_deferring_allowed && tx_rs_thresh > 1) {

> >>>>>> +          PMD_INIT_LOG(ERR, "tx_rs_thresh must be less than 2 since

> >> RS

> >>>> "

> >>>>>> +                            "must be set for every packet for this

> >> HW. "

> >>>>>> +                            "(tx_rs_thresh=%u port=%d queue=%d)",

> >>>>>> +                       (unsigned int)tx_rs_thresh,

> >>>>>> +                       (int)dev->data->port_id, (int)queue_idx);

> >>>>>> +          return -(EINVAL);

> >>>>>> +  }

> >>>>>> +

> >>>>>>     if (tx_rs_thresh >= (nb_desc - 2)) {

> >>>>>>             PMD_INIT_LOG(ERR, "tx_rs_thresh must be less than the

> >>>> number "

> >>>>>>                          "of TX descriptors minus 2. (tx_rs_thresh=%u

> >> "

> >>>>>> --

> >>>>>> 2.1.0
  
Vladislav Zolotarov Aug. 20, 2015, 8:56 a.m. UTC | #9
On 08/20/15 11:41, Ananyev, Konstantin wrote:
> Hi Vlad,
>
>> -----Original Message-----
>> From: Vlad Zolotarov [mailto:vladz@cloudius-systems.com]
>> Sent: Wednesday, August 19, 2015 11:03 AM
>> To: Ananyev, Konstantin; Lu, Wenzhuo
>> Cc: dev@dpdk.org
>> Subject: Re: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh above 1 for all NICs but 82598
>>
>>
>>
>> On 08/19/15 10:43, Ananyev, Konstantin wrote:
>>> Hi Vlad,
>>> Sorry for delay with review, I am OOO till next week.
>>> Meanwhile, few questions/comments from me.
>> Hi, Konstantin, long time no see... ;)
>>
>>>>>>>> This patch fixes the Tx hang we were constantly hitting with a
>>>> seastar-based
>>>>>>>> application on x540 NIC.
>>>>>>> Could you help to share with us how to reproduce the tx hang issue,
>>>> with using
>>>>>>> typical DPDK examples?
>>>>>> Sorry. I'm not very familiar with the typical DPDK examples to help u
>>>>>> here. However this is quite irrelevant since without this this patch
>>>>>> ixgbe PMD obviously abuses the HW spec as has been explained above.
>>>>>>
>>>>>> We saw the issue when u stressed the xmit path with a lot of highly
>>>>>> fragmented TCP frames (packets with up to 33 fragments with non-headers
>>>>>> fragments as small as 4 bytes) with all offload features enabled.
>>> Could you provide us with the pcap file to reproduce the issue?
>> Well, the thing is it takes some time to reproduce it (a few minutes of
>> heavy load) therefore a pcap would be quite large.
> Probably you can upload it to some place, from which we will be able to download it?

I'll see what I can do but no promises...

> Or might be you have some sort of scapy script to generate it?
> I suppose we'll need something to reproduce the issue and verify the fix.

Since the original code abuses the HW spec u don't have to... ;)

>
>>> My concern with you approach is that it would affect TX performance.
>> It certainly will ;) But it seem inevitable. See below.
>>
>>> Right now, for simple TX PMD usually reads only (nb_tx_desc/tx_rs_thresh) TXDs,
>>> While with your patch (if I understand it correctly) it has to read all TXDs in the HW TX ring.
>> If by "simple" u refer an always single fragment per Tx packet - then u
>> are absolutely correct.
>>
>> My initial patch was to only set RS on every EOP descriptor without
>> changing the rs_thresh value and this patch worked.
>> However HW spec doesn't ensure in a general case that packets are always
>> handled/completion write-back completes in the same order the packets
>> are placed on the ring (see "Tx arbitration schemes" chapter in 82599
>> spec for instance). Therefore AFAIU one should not assume that if
>> packet[x+1] DD bit is set then packet[x] is completed too.
>  From my understanding, TX arbitration controls the order in which TXDs from
> different queues are fetched/processed.
> But descriptors from the same TX queue are processed in FIFO order.
> So, I think that  - yes, if TXD[x+1] DD bit is set, then TXD[x] is completed too,
> and setting RS on every EOP TXD should be enough.

Ok. I'll rework the patch under this assumption then.

>
>> That's why I changed the patch to be as u see it now. However if I miss
>> something here and your HW people ensure the in-order completion this of
>> course may be changed back.
>>
>>> Even if we really need to setup RS bit in each TXD (I still doubt we really do) - ,
>> Well, if u doubt u may ask the guys from the Intel networking division
>> that wrote the 82599 and x540 HW specs where they clearly state that. ;)
> Good point, we'll see what we can do here :)
> Konstantin
>
>>> I think inside PMD it still should be possible to check TX completion in chunks.
>>> Konstantin
>>>
>>>
>>>>>> Thanks,
>>>>>> vlad
>>>>>>>> Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
>>>>>>>> ---
>>>>>>>>     drivers/net/ixgbe/ixgbe_ethdev.c |  9 +++++++++
>>>>>>>>     drivers/net/ixgbe/ixgbe_rxtx.c   | 23 ++++++++++++++++++++++-
>>>>>>>>     2 files changed, 31 insertions(+), 1 deletion(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c
>>>>>>>> b/drivers/net/ixgbe/ixgbe_ethdev.c
>>>>>>>> index b8ee1e9..6714fd9 100644
>>>>>>>> --- a/drivers/net/ixgbe/ixgbe_ethdev.c
>>>>>>>> +++ b/drivers/net/ixgbe/ixgbe_ethdev.c
>>>>>>>> @@ -2414,6 +2414,15 @@ ixgbe_dev_info_get(struct rte_eth_dev *dev,
>>>>>> struct
>>>>>>>> rte_eth_dev_info *dev_info)
>>>>>>>>              .txq_flags = ETH_TXQ_FLAGS_NOMULTSEGS |
>>>>>>>>                              ETH_TXQ_FLAGS_NOOFFLOADS,
>>>>>>>>      };
>>>>>>>> +
>>>>>>>> +  /*
>>>>>>>> +   * According to 82599 and x540 specifications RS bit *must* be
>>>> set on
>>>>>> the
>>>>>>>> +   * last descriptor of *every* packet. Therefore we will not allow
>>>> the
>>>>>>>> +   * tx_rs_thresh above 1 for all NICs newer than 82598.
>>>>>>>> +   */
>>>>>>>> +  if (hw->mac.type > ixgbe_mac_82598EB)
>>>>>>>> +          dev_info->default_txconf.tx_rs_thresh = 1;
>>>>>>>> +
>>>>>>>>      dev_info->hash_key_size = IXGBE_HKEY_MAX_INDEX * sizeof(uint32_t);
>>>>>>>>      dev_info->reta_size = ETH_RSS_RETA_SIZE_128;
>>>>>>>>      dev_info->flow_type_rss_offloads = IXGBE_RSS_OFFLOAD_ALL; diff --
>>>>>> git
>>>>>>>> a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
>>>> index
>>>>>>>> 91023b9..8dbdffc 100644
>>>>>>>> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
>>>>>>>> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
>>>>>>>> @@ -2085,11 +2085,19 @@ ixgbe_dev_tx_queue_setup(struct rte_eth_dev
>>>>>>>> *dev,
>>>>>>>>      struct ixgbe_tx_queue *txq;
>>>>>>>>      struct ixgbe_hw     *hw;
>>>>>>>>      uint16_t tx_rs_thresh, tx_free_thresh;
>>>>>>>> +  bool rs_deferring_allowed;
>>>>>>>>
>>>>>>>>      PMD_INIT_FUNC_TRACE();
>>>>>>>>      hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);
>>>>>>>>
>>>>>>>>      /*
>>>>>>>> +   * According to 82599 and x540 specifications RS bit *must* be
>>>> set on
>>>>>> the
>>>>>>>> +   * last descriptor of *every* packet. Therefore we will not allow
>>>> the
>>>>>>>> +   * tx_rs_thresh above 1 for all NICs newer than 82598.
>>>>>>>> +   */
>>>>>>>> +  rs_deferring_allowed = (hw->mac.type <= ixgbe_mac_82598EB);
>>>>>>>> +
>>>>>>>> +  /*
>>>>>>>>       * Validate number of transmit descriptors.
>>>>>>>>       * It must not exceed hardware maximum, and must be multiple
>>>>>>>>       * of IXGBE_ALIGN.
>>>>>>>> @@ -2110,6 +2118,8 @@ ixgbe_dev_tx_queue_setup(struct rte_eth_dev
>>>>>> *dev,
>>>>>>>>       * to transmit a packet is greater than the number of free TX
>>>>>>>>       * descriptors.
>>>>>>>>       * The following constraints must be satisfied:
>>>>>>>> +   *  tx_rs_thresh must be less than 2 for NICs for which RS
>>>> deferring is
>>>>>>>> +   *  forbidden (all but 82598).
>>>>>>>>       *  tx_rs_thresh must be greater than 0.
>>>>>>>>       *  tx_rs_thresh must be less than the size of the ring minus 2.
>>>>>>>>       *  tx_rs_thresh must be less than or equal to tx_free_thresh.
>>>>>>>> @@ -2121,9 +2131,20 @@ ixgbe_dev_tx_queue_setup(struct rte_eth_dev
>>>>>> *dev,
>>>>>>>>       * When set to zero use default values.
>>>>>>>>       */
>>>>>>>>      tx_rs_thresh = (uint16_t)((tx_conf->tx_rs_thresh) ?
>>>>>>>> -                  tx_conf->tx_rs_thresh : DEFAULT_TX_RS_THRESH);
>>>>>>>> +                  tx_conf->tx_rs_thresh :
>>>>>>>> +                  (rs_deferring_allowed ? DEFAULT_TX_RS_THRESH :
>>>> 1));
>>>>>>>>      tx_free_thresh = (uint16_t)((tx_conf->tx_free_thresh) ?
>>>>>>>>                      tx_conf->tx_free_thresh : DEFAULT_TX_FREE_THRESH);
>>>>>>>> +
>>>>>>>> +  if (!rs_deferring_allowed && tx_rs_thresh > 1) {
>>>>>>>> +          PMD_INIT_LOG(ERR, "tx_rs_thresh must be less than 2 since
>>>> RS
>>>>>> "
>>>>>>>> +                            "must be set for every packet for this
>>>> HW. "
>>>>>>>> +                            "(tx_rs_thresh=%u port=%d queue=%d)",
>>>>>>>> +                       (unsigned int)tx_rs_thresh,
>>>>>>>> +                       (int)dev->data->port_id, (int)queue_idx);
>>>>>>>> +          return -(EINVAL);
>>>>>>>> +  }
>>>>>>>> +
>>>>>>>>      if (tx_rs_thresh >= (nb_desc - 2)) {
>>>>>>>>              PMD_INIT_LOG(ERR, "tx_rs_thresh must be less than the
>>>>>> number "
>>>>>>>>                           "of TX descriptors minus 2. (tx_rs_thresh=%u
>>>> "
>>>>>>>> --
>>>>>>>> 2.1.0
  
Vladislav Zolotarov Aug. 20, 2015, 9:05 a.m. UTC | #10
On 08/20/15 11:56, Vlad Zolotarov wrote:
>
>
> On 08/20/15 11:41, Ananyev, Konstantin wrote:
>> Hi Vlad,
>>
>>> -----Original Message-----
>>> From: Vlad Zolotarov [mailto:vladz@cloudius-systems.com]
>>> Sent: Wednesday, August 19, 2015 11:03 AM
>>> To: Ananyev, Konstantin; Lu, Wenzhuo
>>> Cc: dev@dpdk.org
>>> Subject: Re: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh 
>>> above 1 for all NICs but 82598
>>>
>>>
>>>
>>> On 08/19/15 10:43, Ananyev, Konstantin wrote:
>>>> Hi Vlad,
>>>> Sorry for delay with review, I am OOO till next week.
>>>> Meanwhile, few questions/comments from me.
>>> Hi, Konstantin, long time no see... ;)
>>>
>>>>>>>>> This patch fixes the Tx hang we were constantly hitting with a
>>>>> seastar-based
>>>>>>>>> application on x540 NIC.
>>>>>>>> Could you help to share with us how to reproduce the tx hang 
>>>>>>>> issue,
>>>>> with using
>>>>>>>> typical DPDK examples?
>>>>>>> Sorry. I'm not very familiar with the typical DPDK examples to 
>>>>>>> help u
>>>>>>> here. However this is quite irrelevant since without this this 
>>>>>>> patch
>>>>>>> ixgbe PMD obviously abuses the HW spec as has been explained above.
>>>>>>>
>>>>>>> We saw the issue when u stressed the xmit path with a lot of highly
>>>>>>> fragmented TCP frames (packets with up to 33 fragments with 
>>>>>>> non-headers
>>>>>>> fragments as small as 4 bytes) with all offload features enabled.
>>>> Could you provide us with the pcap file to reproduce the issue?
>>> Well, the thing is it takes some time to reproduce it (a few minutes of
>>> heavy load) therefore a pcap would be quite large.
>> Probably you can upload it to some place, from which we will be able 
>> to download it?
>
> I'll see what I can do but no promises...

On a second thought pcap file won't help u much since in order to 
reproduce the issue u have to reproduce exactly the same structure of 
clusters i give to HW and it's not what u see on wire in a TSO case.

>
>> Or might be you have some sort of scapy script to generate it?
>> I suppose we'll need something to reproduce the issue and verify the 
>> fix.
>
> Since the original code abuses the HW spec u don't have to... ;)
>
>>
>>>> My concern with you approach is that it would affect TX performance.
>>> It certainly will ;) But it seem inevitable. See below.
>>>
>>>> Right now, for simple TX PMD usually reads only 
>>>> (nb_tx_desc/tx_rs_thresh) TXDs,
>>>> While with your patch (if I understand it correctly) it has to read 
>>>> all TXDs in the HW TX ring.
>>> If by "simple" u refer an always single fragment per Tx packet - then u
>>> are absolutely correct.
>>>
>>> My initial patch was to only set RS on every EOP descriptor without
>>> changing the rs_thresh value and this patch worked.
>>> However HW spec doesn't ensure in a general case that packets are 
>>> always
>>> handled/completion write-back completes in the same order the packets
>>> are placed on the ring (see "Tx arbitration schemes" chapter in 82599
>>> spec for instance). Therefore AFAIU one should not assume that if
>>> packet[x+1] DD bit is set then packet[x] is completed too.
>>  From my understanding, TX arbitration controls the order in which 
>> TXDs from
>> different queues are fetched/processed.
>> But descriptors from the same TX queue are processed in FIFO order.
>> So, I think that  - yes, if TXD[x+1] DD bit is set, then TXD[x] is 
>> completed too,
>> and setting RS on every EOP TXD should be enough.
>
> Ok. I'll rework the patch under this assumption then.
>
>>
>>> That's why I changed the patch to be as u see it now. However if I miss
>>> something here and your HW people ensure the in-order completion 
>>> this of
>>> course may be changed back.
>>>
>>>> Even if we really need to setup RS bit in each TXD (I still doubt 
>>>> we really do) - ,
>>> Well, if u doubt u may ask the guys from the Intel networking division
>>> that wrote the 82599 and x540 HW specs where they clearly state 
>>> that. ;)
>> Good point, we'll see what we can do here :)
>> Konstantin
>>
>>>> I think inside PMD it still should be possible to check TX 
>>>> completion in chunks.
>>>> Konstantin
>>>>
>>>>
>>>>>>> Thanks,
>>>>>>> vlad
>>>>>>>>> Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
>>>>>>>>> ---
>>>>>>>>>     drivers/net/ixgbe/ixgbe_ethdev.c |  9 +++++++++
>>>>>>>>>     drivers/net/ixgbe/ixgbe_rxtx.c   | 23 ++++++++++++++++++++++-
>>>>>>>>>     2 files changed, 31 insertions(+), 1 deletion(-)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c
>>>>>>>>> b/drivers/net/ixgbe/ixgbe_ethdev.c
>>>>>>>>> index b8ee1e9..6714fd9 100644
>>>>>>>>> --- a/drivers/net/ixgbe/ixgbe_ethdev.c
>>>>>>>>> +++ b/drivers/net/ixgbe/ixgbe_ethdev.c
>>>>>>>>> @@ -2414,6 +2414,15 @@ ixgbe_dev_info_get(struct rte_eth_dev 
>>>>>>>>> *dev,
>>>>>>> struct
>>>>>>>>> rte_eth_dev_info *dev_info)
>>>>>>>>>              .txq_flags = ETH_TXQ_FLAGS_NOMULTSEGS |
>>>>>>>>> ETH_TXQ_FLAGS_NOOFFLOADS,
>>>>>>>>>      };
>>>>>>>>> +
>>>>>>>>> +  /*
>>>>>>>>> +   * According to 82599 and x540 specifications RS bit *must* be
>>>>> set on
>>>>>>> the
>>>>>>>>> +   * last descriptor of *every* packet. Therefore we will not 
>>>>>>>>> allow
>>>>> the
>>>>>>>>> +   * tx_rs_thresh above 1 for all NICs newer than 82598.
>>>>>>>>> +   */
>>>>>>>>> +  if (hw->mac.type > ixgbe_mac_82598EB)
>>>>>>>>> + dev_info->default_txconf.tx_rs_thresh = 1;
>>>>>>>>> +
>>>>>>>>>      dev_info->hash_key_size = IXGBE_HKEY_MAX_INDEX * 
>>>>>>>>> sizeof(uint32_t);
>>>>>>>>>      dev_info->reta_size = ETH_RSS_RETA_SIZE_128;
>>>>>>>>>      dev_info->flow_type_rss_offloads = IXGBE_RSS_OFFLOAD_ALL; 
>>>>>>>>> diff --
>>>>>>> git
>>>>>>>>> a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
>>>>> index
>>>>>>>>> 91023b9..8dbdffc 100644
>>>>>>>>> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
>>>>>>>>> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
>>>>>>>>> @@ -2085,11 +2085,19 @@ ixgbe_dev_tx_queue_setup(struct 
>>>>>>>>> rte_eth_dev
>>>>>>>>> *dev,
>>>>>>>>>      struct ixgbe_tx_queue *txq;
>>>>>>>>>      struct ixgbe_hw     *hw;
>>>>>>>>>      uint16_t tx_rs_thresh, tx_free_thresh;
>>>>>>>>> +  bool rs_deferring_allowed;
>>>>>>>>>
>>>>>>>>>      PMD_INIT_FUNC_TRACE();
>>>>>>>>>      hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);
>>>>>>>>>
>>>>>>>>>      /*
>>>>>>>>> +   * According to 82599 and x540 specifications RS bit *must* be
>>>>> set on
>>>>>>> the
>>>>>>>>> +   * last descriptor of *every* packet. Therefore we will not 
>>>>>>>>> allow
>>>>> the
>>>>>>>>> +   * tx_rs_thresh above 1 for all NICs newer than 82598.
>>>>>>>>> +   */
>>>>>>>>> +  rs_deferring_allowed = (hw->mac.type <= ixgbe_mac_82598EB);
>>>>>>>>> +
>>>>>>>>> +  /*
>>>>>>>>>       * Validate number of transmit descriptors.
>>>>>>>>>       * It must not exceed hardware maximum, and must be multiple
>>>>>>>>>       * of IXGBE_ALIGN.
>>>>>>>>> @@ -2110,6 +2118,8 @@ ixgbe_dev_tx_queue_setup(struct rte_eth_dev
>>>>>>> *dev,
>>>>>>>>>       * to transmit a packet is greater than the number of 
>>>>>>>>> free TX
>>>>>>>>>       * descriptors.
>>>>>>>>>       * The following constraints must be satisfied:
>>>>>>>>> +   *  tx_rs_thresh must be less than 2 for NICs for which RS
>>>>> deferring is
>>>>>>>>> +   *  forbidden (all but 82598).
>>>>>>>>>       *  tx_rs_thresh must be greater than 0.
>>>>>>>>>       *  tx_rs_thresh must be less than the size of the ring 
>>>>>>>>> minus 2.
>>>>>>>>>       *  tx_rs_thresh must be less than or equal to 
>>>>>>>>> tx_free_thresh.
>>>>>>>>> @@ -2121,9 +2131,20 @@ ixgbe_dev_tx_queue_setup(struct 
>>>>>>>>> rte_eth_dev
>>>>>>> *dev,
>>>>>>>>>       * When set to zero use default values.
>>>>>>>>>       */
>>>>>>>>>      tx_rs_thresh = (uint16_t)((tx_conf->tx_rs_thresh) ?
>>>>>>>>> -                  tx_conf->tx_rs_thresh : DEFAULT_TX_RS_THRESH);
>>>>>>>>> +                  tx_conf->tx_rs_thresh :
>>>>>>>>> +                  (rs_deferring_allowed ? DEFAULT_TX_RS_THRESH :
>>>>> 1));
>>>>>>>>>      tx_free_thresh = (uint16_t)((tx_conf->tx_free_thresh) ?
>>>>>>>>>                      tx_conf->tx_free_thresh : 
>>>>>>>>> DEFAULT_TX_FREE_THRESH);
>>>>>>>>> +
>>>>>>>>> +  if (!rs_deferring_allowed && tx_rs_thresh > 1) {
>>>>>>>>> +          PMD_INIT_LOG(ERR, "tx_rs_thresh must be less than 2 
>>>>>>>>> since
>>>>> RS
>>>>>>> "
>>>>>>>>> + "must be set for every packet for this
>>>>> HW. "
>>>>>>>>> + "(tx_rs_thresh=%u port=%d queue=%d)",
>>>>>>>>> +                       (unsigned int)tx_rs_thresh,
>>>>>>>>> + (int)dev->data->port_id, (int)queue_idx);
>>>>>>>>> +          return -(EINVAL);
>>>>>>>>> +  }
>>>>>>>>> +
>>>>>>>>>      if (tx_rs_thresh >= (nb_desc - 2)) {
>>>>>>>>>              PMD_INIT_LOG(ERR, "tx_rs_thresh must be less than 
>>>>>>>>> the
>>>>>>> number "
>>>>>>>>> "of TX descriptors minus 2. (tx_rs_thresh=%u
>>>>> "
>>>>>>>>> -- 
>>>>>>>>> 2.1.0
>
  
Vladislav Zolotarov Aug. 20, 2015, 9:06 a.m. UTC | #11
On 08/20/15 12:05, Vlad Zolotarov wrote:
>
>
> On 08/20/15 11:56, Vlad Zolotarov wrote:
>>
>>
>> On 08/20/15 11:41, Ananyev, Konstantin wrote:
>>> Hi Vlad,
>>>
>>>> -----Original Message-----
>>>> From: Vlad Zolotarov [mailto:vladz@cloudius-systems.com]
>>>> Sent: Wednesday, August 19, 2015 11:03 AM
>>>> To: Ananyev, Konstantin; Lu, Wenzhuo
>>>> Cc: dev@dpdk.org
>>>> Subject: Re: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh 
>>>> above 1 for all NICs but 82598
>>>>
>>>>
>>>>
>>>> On 08/19/15 10:43, Ananyev, Konstantin wrote:
>>>>> Hi Vlad,
>>>>> Sorry for delay with review, I am OOO till next week.
>>>>> Meanwhile, few questions/comments from me.
>>>> Hi, Konstantin, long time no see... ;)
>>>>
>>>>>>>>>> This patch fixes the Tx hang we were constantly hitting with a
>>>>>> seastar-based
>>>>>>>>>> application on x540 NIC.
>>>>>>>>> Could you help to share with us how to reproduce the tx hang 
>>>>>>>>> issue,
>>>>>> with using
>>>>>>>>> typical DPDK examples?
>>>>>>>> Sorry. I'm not very familiar with the typical DPDK examples to 
>>>>>>>> help u
>>>>>>>> here. However this is quite irrelevant since without this this 
>>>>>>>> patch
>>>>>>>> ixgbe PMD obviously abuses the HW spec as has been explained 
>>>>>>>> above.
>>>>>>>>
>>>>>>>> We saw the issue when u stressed the xmit path with a lot of 
>>>>>>>> highly
>>>>>>>> fragmented TCP frames (packets with up to 33 fragments with 
>>>>>>>> non-headers
>>>>>>>> fragments as small as 4 bytes) with all offload features enabled.
>>>>> Could you provide us with the pcap file to reproduce the issue?
>>>> Well, the thing is it takes some time to reproduce it (a few 
>>>> minutes of
>>>> heavy load) therefore a pcap would be quite large.
>>> Probably you can upload it to some place, from which we will be able 
>>> to download it?
>>
>> I'll see what I can do but no promises...
>
> On a second thought pcap file won't help u much since in order to 
> reproduce the issue u have to reproduce exactly the same structure of 
> clusters i give to HW and it's not what u see on wire in a TSO case.

And not only in a TSO case... ;)

>
>>
>>> Or might be you have some sort of scapy script to generate it?
>>> I suppose we'll need something to reproduce the issue and verify the 
>>> fix.
>>
>> Since the original code abuses the HW spec u don't have to... ;)
>>
>>>
>>>>> My concern with you approach is that it would affect TX performance.
>>>> It certainly will ;) But it seem inevitable. See below.
>>>>
>>>>> Right now, for simple TX PMD usually reads only 
>>>>> (nb_tx_desc/tx_rs_thresh) TXDs,
>>>>> While with your patch (if I understand it correctly) it has to 
>>>>> read all TXDs in the HW TX ring.
>>>> If by "simple" u refer an always single fragment per Tx packet - 
>>>> then u
>>>> are absolutely correct.
>>>>
>>>> My initial patch was to only set RS on every EOP descriptor without
>>>> changing the rs_thresh value and this patch worked.
>>>> However HW spec doesn't ensure in a general case that packets are 
>>>> always
>>>> handled/completion write-back completes in the same order the packets
>>>> are placed on the ring (see "Tx arbitration schemes" chapter in 82599
>>>> spec for instance). Therefore AFAIU one should not assume that if
>>>> packet[x+1] DD bit is set then packet[x] is completed too.
>>>  From my understanding, TX arbitration controls the order in which 
>>> TXDs from
>>> different queues are fetched/processed.
>>> But descriptors from the same TX queue are processed in FIFO order.
>>> So, I think that  - yes, if TXD[x+1] DD bit is set, then TXD[x] is 
>>> completed too,
>>> and setting RS on every EOP TXD should be enough.
>>
>> Ok. I'll rework the patch under this assumption then.
>>
>>>
>>>> That's why I changed the patch to be as u see it now. However if I 
>>>> miss
>>>> something here and your HW people ensure the in-order completion 
>>>> this of
>>>> course may be changed back.
>>>>
>>>>> Even if we really need to setup RS bit in each TXD (I still doubt 
>>>>> we really do) - ,
>>>> Well, if u doubt u may ask the guys from the Intel networking division
>>>> that wrote the 82599 and x540 HW specs where they clearly state 
>>>> that. ;)
>>> Good point, we'll see what we can do here :)
>>> Konstantin
>>>
>>>>> I think inside PMD it still should be possible to check TX 
>>>>> completion in chunks.
>>>>> Konstantin
>>>>>
>>>>>
>>>>>>>> Thanks,
>>>>>>>> vlad
>>>>>>>>>> Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
>>>>>>>>>> ---
>>>>>>>>>>     drivers/net/ixgbe/ixgbe_ethdev.c |  9 +++++++++
>>>>>>>>>>     drivers/net/ixgbe/ixgbe_rxtx.c   | 23 
>>>>>>>>>> ++++++++++++++++++++++-
>>>>>>>>>>     2 files changed, 31 insertions(+), 1 deletion(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c
>>>>>>>>>> b/drivers/net/ixgbe/ixgbe_ethdev.c
>>>>>>>>>> index b8ee1e9..6714fd9 100644
>>>>>>>>>> --- a/drivers/net/ixgbe/ixgbe_ethdev.c
>>>>>>>>>> +++ b/drivers/net/ixgbe/ixgbe_ethdev.c
>>>>>>>>>> @@ -2414,6 +2414,15 @@ ixgbe_dev_info_get(struct rte_eth_dev 
>>>>>>>>>> *dev,
>>>>>>>> struct
>>>>>>>>>> rte_eth_dev_info *dev_info)
>>>>>>>>>>              .txq_flags = ETH_TXQ_FLAGS_NOMULTSEGS |
>>>>>>>>>> ETH_TXQ_FLAGS_NOOFFLOADS,
>>>>>>>>>>      };
>>>>>>>>>> +
>>>>>>>>>> +  /*
>>>>>>>>>> +   * According to 82599 and x540 specifications RS bit 
>>>>>>>>>> *must* be
>>>>>> set on
>>>>>>>> the
>>>>>>>>>> +   * last descriptor of *every* packet. Therefore we will 
>>>>>>>>>> not allow
>>>>>> the
>>>>>>>>>> +   * tx_rs_thresh above 1 for all NICs newer than 82598.
>>>>>>>>>> +   */
>>>>>>>>>> +  if (hw->mac.type > ixgbe_mac_82598EB)
>>>>>>>>>> + dev_info->default_txconf.tx_rs_thresh = 1;
>>>>>>>>>> +
>>>>>>>>>>      dev_info->hash_key_size = IXGBE_HKEY_MAX_INDEX * 
>>>>>>>>>> sizeof(uint32_t);
>>>>>>>>>>      dev_info->reta_size = ETH_RSS_RETA_SIZE_128;
>>>>>>>>>>      dev_info->flow_type_rss_offloads = 
>>>>>>>>>> IXGBE_RSS_OFFLOAD_ALL; diff --
>>>>>>>> git
>>>>>>>>>> a/drivers/net/ixgbe/ixgbe_rxtx.c 
>>>>>>>>>> b/drivers/net/ixgbe/ixgbe_rxtx.c
>>>>>> index
>>>>>>>>>> 91023b9..8dbdffc 100644
>>>>>>>>>> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
>>>>>>>>>> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
>>>>>>>>>> @@ -2085,11 +2085,19 @@ ixgbe_dev_tx_queue_setup(struct 
>>>>>>>>>> rte_eth_dev
>>>>>>>>>> *dev,
>>>>>>>>>>      struct ixgbe_tx_queue *txq;
>>>>>>>>>>      struct ixgbe_hw     *hw;
>>>>>>>>>>      uint16_t tx_rs_thresh, tx_free_thresh;
>>>>>>>>>> +  bool rs_deferring_allowed;
>>>>>>>>>>
>>>>>>>>>>      PMD_INIT_FUNC_TRACE();
>>>>>>>>>>      hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);
>>>>>>>>>>
>>>>>>>>>>      /*
>>>>>>>>>> +   * According to 82599 and x540 specifications RS bit 
>>>>>>>>>> *must* be
>>>>>> set on
>>>>>>>> the
>>>>>>>>>> +   * last descriptor of *every* packet. Therefore we will 
>>>>>>>>>> not allow
>>>>>> the
>>>>>>>>>> +   * tx_rs_thresh above 1 for all NICs newer than 82598.
>>>>>>>>>> +   */
>>>>>>>>>> +  rs_deferring_allowed = (hw->mac.type <= ixgbe_mac_82598EB);
>>>>>>>>>> +
>>>>>>>>>> +  /*
>>>>>>>>>>       * Validate number of transmit descriptors.
>>>>>>>>>>       * It must not exceed hardware maximum, and must be 
>>>>>>>>>> multiple
>>>>>>>>>>       * of IXGBE_ALIGN.
>>>>>>>>>> @@ -2110,6 +2118,8 @@ ixgbe_dev_tx_queue_setup(struct 
>>>>>>>>>> rte_eth_dev
>>>>>>>> *dev,
>>>>>>>>>>       * to transmit a packet is greater than the number of 
>>>>>>>>>> free TX
>>>>>>>>>>       * descriptors.
>>>>>>>>>>       * The following constraints must be satisfied:
>>>>>>>>>> +   *  tx_rs_thresh must be less than 2 for NICs for which RS
>>>>>> deferring is
>>>>>>>>>> +   *  forbidden (all but 82598).
>>>>>>>>>>       *  tx_rs_thresh must be greater than 0.
>>>>>>>>>>       *  tx_rs_thresh must be less than the size of the ring 
>>>>>>>>>> minus 2.
>>>>>>>>>>       *  tx_rs_thresh must be less than or equal to 
>>>>>>>>>> tx_free_thresh.
>>>>>>>>>> @@ -2121,9 +2131,20 @@ ixgbe_dev_tx_queue_setup(struct 
>>>>>>>>>> rte_eth_dev
>>>>>>>> *dev,
>>>>>>>>>>       * When set to zero use default values.
>>>>>>>>>>       */
>>>>>>>>>>      tx_rs_thresh = (uint16_t)((tx_conf->tx_rs_thresh) ?
>>>>>>>>>> -                  tx_conf->tx_rs_thresh : 
>>>>>>>>>> DEFAULT_TX_RS_THRESH);
>>>>>>>>>> +                  tx_conf->tx_rs_thresh :
>>>>>>>>>> +                  (rs_deferring_allowed ? 
>>>>>>>>>> DEFAULT_TX_RS_THRESH :
>>>>>> 1));
>>>>>>>>>>      tx_free_thresh = (uint16_t)((tx_conf->tx_free_thresh) ?
>>>>>>>>>>                      tx_conf->tx_free_thresh : 
>>>>>>>>>> DEFAULT_TX_FREE_THRESH);
>>>>>>>>>> +
>>>>>>>>>> +  if (!rs_deferring_allowed && tx_rs_thresh > 1) {
>>>>>>>>>> +          PMD_INIT_LOG(ERR, "tx_rs_thresh must be less than 
>>>>>>>>>> 2 since
>>>>>> RS
>>>>>>>> "
>>>>>>>>>> + "must be set for every packet for this
>>>>>> HW. "
>>>>>>>>>> + "(tx_rs_thresh=%u port=%d queue=%d)",
>>>>>>>>>> +                       (unsigned int)tx_rs_thresh,
>>>>>>>>>> + (int)dev->data->port_id, (int)queue_idx);
>>>>>>>>>> +          return -(EINVAL);
>>>>>>>>>> +  }
>>>>>>>>>> +
>>>>>>>>>>      if (tx_rs_thresh >= (nb_desc - 2)) {
>>>>>>>>>>              PMD_INIT_LOG(ERR, "tx_rs_thresh must be less 
>>>>>>>>>> than the
>>>>>>>> number "
>>>>>>>>>> "of TX descriptors minus 2. (tx_rs_thresh=%u
>>>>>> "
>>>>>>>>>> -- 
>>>>>>>>>> 2.1.0
>>
>
  
Ananyev, Konstantin Aug. 25, 2015, 5:33 p.m. UTC | #12
Hi Vlad,

> -----Original Message-----

> From: Vlad Zolotarov [mailto:vladz@cloudius-systems.com]

> Sent: Thursday, August 20, 2015 10:07 AM

> To: Ananyev, Konstantin; Lu, Wenzhuo

> Cc: dev@dpdk.org

> Subject: Re: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh above 1 for all NICs but 82598

> 

> 

> 

> On 08/20/15 12:05, Vlad Zolotarov wrote:

> >

> >

> > On 08/20/15 11:56, Vlad Zolotarov wrote:

> >>

> >>

> >> On 08/20/15 11:41, Ananyev, Konstantin wrote:

> >>> Hi Vlad,

> >>>

> >>>> -----Original Message-----

> >>>> From: Vlad Zolotarov [mailto:vladz@cloudius-systems.com]

> >>>> Sent: Wednesday, August 19, 2015 11:03 AM

> >>>> To: Ananyev, Konstantin; Lu, Wenzhuo

> >>>> Cc: dev@dpdk.org

> >>>> Subject: Re: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh

> >>>> above 1 for all NICs but 82598

> >>>>

> >>>>

> >>>>

> >>>> On 08/19/15 10:43, Ananyev, Konstantin wrote:

> >>>>> Hi Vlad,

> >>>>> Sorry for delay with review, I am OOO till next week.

> >>>>> Meanwhile, few questions/comments from me.

> >>>> Hi, Konstantin, long time no see... ;)

> >>>>

> >>>>>>>>>> This patch fixes the Tx hang we were constantly hitting with a

> >>>>>> seastar-based

> >>>>>>>>>> application on x540 NIC.

> >>>>>>>>> Could you help to share with us how to reproduce the tx hang

> >>>>>>>>> issue,

> >>>>>> with using

> >>>>>>>>> typical DPDK examples?

> >>>>>>>> Sorry. I'm not very familiar with the typical DPDK examples to

> >>>>>>>> help u

> >>>>>>>> here. However this is quite irrelevant since without this this

> >>>>>>>> patch

> >>>>>>>> ixgbe PMD obviously abuses the HW spec as has been explained

> >>>>>>>> above.

> >>>>>>>>

> >>>>>>>> We saw the issue when u stressed the xmit path with a lot of

> >>>>>>>> highly

> >>>>>>>> fragmented TCP frames (packets with up to 33 fragments with

> >>>>>>>> non-headers

> >>>>>>>> fragments as small as 4 bytes) with all offload features enabled.

> >>>>> Could you provide us with the pcap file to reproduce the issue?

> >>>> Well, the thing is it takes some time to reproduce it (a few

> >>>> minutes of

> >>>> heavy load) therefore a pcap would be quite large.

> >>> Probably you can upload it to some place, from which we will be able

> >>> to download it?

> >>

> >> I'll see what I can do but no promises...

> >

> > On a second thought pcap file won't help u much since in order to

> > reproduce the issue u have to reproduce exactly the same structure of

> > clusters i give to HW and it's not what u see on wire in a TSO case.

> 

> And not only in a TSO case... ;)


I understand that, but my thought was you can add some sort of TX callback for the rte_eth_tx_burst()
into your code that would write the packet into pcap file and then re-run your hang scenario.
I know that it means extra work for you - but I think it would be very helpful if we would be able to reproduce your hang scenario:
- if HW guys would confirm that setting RS bit for every EOP packet is not really required,
  then we probably have to look at what else can cause it.
- it might be added to our validation cycle, to prevent hitting similar problem in future.  
Thanks
Konstantin

> 

> >

> >>

> >>> Or might be you have some sort of scapy script to generate it?

> >>> I suppose we'll need something to reproduce the issue and verify the

> >>> fix.

> >>

> >> Since the original code abuses the HW spec u don't have to... ;)

> >>

> >>>

> >>>>> My concern with you approach is that it would affect TX performance.

> >>>> It certainly will ;) But it seem inevitable. See below.

> >>>>

> >>>>> Right now, for simple TX PMD usually reads only

> >>>>> (nb_tx_desc/tx_rs_thresh) TXDs,

> >>>>> While with your patch (if I understand it correctly) it has to

> >>>>> read all TXDs in the HW TX ring.

> >>>> If by "simple" u refer an always single fragment per Tx packet -

> >>>> then u

> >>>> are absolutely correct.

> >>>>

> >>>> My initial patch was to only set RS on every EOP descriptor without

> >>>> changing the rs_thresh value and this patch worked.

> >>>> However HW spec doesn't ensure in a general case that packets are

> >>>> always

> >>>> handled/completion write-back completes in the same order the packets

> >>>> are placed on the ring (see "Tx arbitration schemes" chapter in 82599

> >>>> spec for instance). Therefore AFAIU one should not assume that if

> >>>> packet[x+1] DD bit is set then packet[x] is completed too.

> >>>  From my understanding, TX arbitration controls the order in which

> >>> TXDs from

> >>> different queues are fetched/processed.

> >>> But descriptors from the same TX queue are processed in FIFO order.

> >>> So, I think that  - yes, if TXD[x+1] DD bit is set, then TXD[x] is

> >>> completed too,

> >>> and setting RS on every EOP TXD should be enough.

> >>

> >> Ok. I'll rework the patch under this assumption then.

> >>

> >>>

> >>>> That's why I changed the patch to be as u see it now. However if I

> >>>> miss

> >>>> something here and your HW people ensure the in-order completion

> >>>> this of

> >>>> course may be changed back.

> >>>>

> >>>>> Even if we really need to setup RS bit in each TXD (I still doubt

> >>>>> we really do) - ,

> >>>> Well, if u doubt u may ask the guys from the Intel networking division

> >>>> that wrote the 82599 and x540 HW specs where they clearly state

> >>>> that. ;)

> >>> Good point, we'll see what we can do here :)

> >>> Konstantin

> >>>

> >>>>> I think inside PMD it still should be possible to check TX

> >>>>> completion in chunks.

> >>>>> Konstantin

> >>>>>

> >>>>>

> >>>>>>>> Thanks,

> >>>>>>>> vlad

> >>>>>>>>>> Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>

> >>>>>>>>>> ---

> >>>>>>>>>>     drivers/net/ixgbe/ixgbe_ethdev.c |  9 +++++++++

> >>>>>>>>>>     drivers/net/ixgbe/ixgbe_rxtx.c   | 23

> >>>>>>>>>> ++++++++++++++++++++++-

> >>>>>>>>>>     2 files changed, 31 insertions(+), 1 deletion(-)

> >>>>>>>>>>

> >>>>>>>>>> diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c

> >>>>>>>>>> b/drivers/net/ixgbe/ixgbe_ethdev.c

> >>>>>>>>>> index b8ee1e9..6714fd9 100644

> >>>>>>>>>> --- a/drivers/net/ixgbe/ixgbe_ethdev.c

> >>>>>>>>>> +++ b/drivers/net/ixgbe/ixgbe_ethdev.c

> >>>>>>>>>> @@ -2414,6 +2414,15 @@ ixgbe_dev_info_get(struct rte_eth_dev

> >>>>>>>>>> *dev,

> >>>>>>>> struct

> >>>>>>>>>> rte_eth_dev_info *dev_info)

> >>>>>>>>>>              .txq_flags = ETH_TXQ_FLAGS_NOMULTSEGS |

> >>>>>>>>>> ETH_TXQ_FLAGS_NOOFFLOADS,

> >>>>>>>>>>      };

> >>>>>>>>>> +

> >>>>>>>>>> +  /*

> >>>>>>>>>> +   * According to 82599 and x540 specifications RS bit

> >>>>>>>>>> *must* be

> >>>>>> set on

> >>>>>>>> the

> >>>>>>>>>> +   * last descriptor of *every* packet. Therefore we will

> >>>>>>>>>> not allow

> >>>>>> the

> >>>>>>>>>> +   * tx_rs_thresh above 1 for all NICs newer than 82598.

> >>>>>>>>>> +   */

> >>>>>>>>>> +  if (hw->mac.type > ixgbe_mac_82598EB)

> >>>>>>>>>> + dev_info->default_txconf.tx_rs_thresh = 1;

> >>>>>>>>>> +

> >>>>>>>>>>      dev_info->hash_key_size = IXGBE_HKEY_MAX_INDEX *

> >>>>>>>>>> sizeof(uint32_t);

> >>>>>>>>>>      dev_info->reta_size = ETH_RSS_RETA_SIZE_128;

> >>>>>>>>>>      dev_info->flow_type_rss_offloads =

> >>>>>>>>>> IXGBE_RSS_OFFLOAD_ALL; diff --

> >>>>>>>> git

> >>>>>>>>>> a/drivers/net/ixgbe/ixgbe_rxtx.c

> >>>>>>>>>> b/drivers/net/ixgbe/ixgbe_rxtx.c

> >>>>>> index

> >>>>>>>>>> 91023b9..8dbdffc 100644

> >>>>>>>>>> --- a/drivers/net/ixgbe/ixgbe_rxtx.c

> >>>>>>>>>> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c

> >>>>>>>>>> @@ -2085,11 +2085,19 @@ ixgbe_dev_tx_queue_setup(struct

> >>>>>>>>>> rte_eth_dev

> >>>>>>>>>> *dev,

> >>>>>>>>>>      struct ixgbe_tx_queue *txq;

> >>>>>>>>>>      struct ixgbe_hw     *hw;

> >>>>>>>>>>      uint16_t tx_rs_thresh, tx_free_thresh;

> >>>>>>>>>> +  bool rs_deferring_allowed;

> >>>>>>>>>>

> >>>>>>>>>>      PMD_INIT_FUNC_TRACE();

> >>>>>>>>>>      hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);

> >>>>>>>>>>

> >>>>>>>>>>      /*

> >>>>>>>>>> +   * According to 82599 and x540 specifications RS bit

> >>>>>>>>>> *must* be

> >>>>>> set on

> >>>>>>>> the

> >>>>>>>>>> +   * last descriptor of *every* packet. Therefore we will

> >>>>>>>>>> not allow

> >>>>>> the

> >>>>>>>>>> +   * tx_rs_thresh above 1 for all NICs newer than 82598.

> >>>>>>>>>> +   */

> >>>>>>>>>> +  rs_deferring_allowed = (hw->mac.type <= ixgbe_mac_82598EB);

> >>>>>>>>>> +

> >>>>>>>>>> +  /*

> >>>>>>>>>>       * Validate number of transmit descriptors.

> >>>>>>>>>>       * It must not exceed hardware maximum, and must be

> >>>>>>>>>> multiple

> >>>>>>>>>>       * of IXGBE_ALIGN.

> >>>>>>>>>> @@ -2110,6 +2118,8 @@ ixgbe_dev_tx_queue_setup(struct

> >>>>>>>>>> rte_eth_dev

> >>>>>>>> *dev,

> >>>>>>>>>>       * to transmit a packet is greater than the number of

> >>>>>>>>>> free TX

> >>>>>>>>>>       * descriptors.

> >>>>>>>>>>       * The following constraints must be satisfied:

> >>>>>>>>>> +   *  tx_rs_thresh must be less than 2 for NICs for which RS

> >>>>>> deferring is

> >>>>>>>>>> +   *  forbidden (all but 82598).

> >>>>>>>>>>       *  tx_rs_thresh must be greater than 0.

> >>>>>>>>>>       *  tx_rs_thresh must be less than the size of the ring

> >>>>>>>>>> minus 2.

> >>>>>>>>>>       *  tx_rs_thresh must be less than or equal to

> >>>>>>>>>> tx_free_thresh.

> >>>>>>>>>> @@ -2121,9 +2131,20 @@ ixgbe_dev_tx_queue_setup(struct

> >>>>>>>>>> rte_eth_dev

> >>>>>>>> *dev,

> >>>>>>>>>>       * When set to zero use default values.

> >>>>>>>>>>       */

> >>>>>>>>>>      tx_rs_thresh = (uint16_t)((tx_conf->tx_rs_thresh) ?

> >>>>>>>>>> -                  tx_conf->tx_rs_thresh :

> >>>>>>>>>> DEFAULT_TX_RS_THRESH);

> >>>>>>>>>> +                  tx_conf->tx_rs_thresh :

> >>>>>>>>>> +                  (rs_deferring_allowed ?

> >>>>>>>>>> DEFAULT_TX_RS_THRESH :

> >>>>>> 1));

> >>>>>>>>>>      tx_free_thresh = (uint16_t)((tx_conf->tx_free_thresh) ?

> >>>>>>>>>>                      tx_conf->tx_free_thresh :

> >>>>>>>>>> DEFAULT_TX_FREE_THRESH);

> >>>>>>>>>> +

> >>>>>>>>>> +  if (!rs_deferring_allowed && tx_rs_thresh > 1) {

> >>>>>>>>>> +          PMD_INIT_LOG(ERR, "tx_rs_thresh must be less than

> >>>>>>>>>> 2 since

> >>>>>> RS

> >>>>>>>> "

> >>>>>>>>>> + "must be set for every packet for this

> >>>>>> HW. "

> >>>>>>>>>> + "(tx_rs_thresh=%u port=%d queue=%d)",

> >>>>>>>>>> +                       (unsigned int)tx_rs_thresh,

> >>>>>>>>>> + (int)dev->data->port_id, (int)queue_idx);

> >>>>>>>>>> +          return -(EINVAL);

> >>>>>>>>>> +  }

> >>>>>>>>>> +

> >>>>>>>>>>      if (tx_rs_thresh >= (nb_desc - 2)) {

> >>>>>>>>>>              PMD_INIT_LOG(ERR, "tx_rs_thresh must be less

> >>>>>>>>>> than the

> >>>>>>>> number "

> >>>>>>>>>> "of TX descriptors minus 2. (tx_rs_thresh=%u

> >>>>>> "

> >>>>>>>>>> --

> >>>>>>>>>> 2.1.0

> >>

> >
  
Avi Kivity Aug. 25, 2015, 5:39 p.m. UTC | #13
On 08/25/2015 08:33 PM, Ananyev, Konstantin wrote:
> Hi Vlad,
>
>> -----Original Message-----
>> From: Vlad Zolotarov [mailto:vladz@cloudius-systems.com]
>> Sent: Thursday, August 20, 2015 10:07 AM
>> To: Ananyev, Konstantin; Lu, Wenzhuo
>> Cc: dev@dpdk.org
>> Subject: Re: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh above 1 for all NICs but 82598
>>
>>
>>
>> On 08/20/15 12:05, Vlad Zolotarov wrote:
>>>
>>> On 08/20/15 11:56, Vlad Zolotarov wrote:
>>>>
>>>> On 08/20/15 11:41, Ananyev, Konstantin wrote:
>>>>> Hi Vlad,
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Vlad Zolotarov [mailto:vladz@cloudius-systems.com]
>>>>>> Sent: Wednesday, August 19, 2015 11:03 AM
>>>>>> To: Ananyev, Konstantin; Lu, Wenzhuo
>>>>>> Cc: dev@dpdk.org
>>>>>> Subject: Re: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh
>>>>>> above 1 for all NICs but 82598
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 08/19/15 10:43, Ananyev, Konstantin wrote:
>>>>>>> Hi Vlad,
>>>>>>> Sorry for delay with review, I am OOO till next week.
>>>>>>> Meanwhile, few questions/comments from me.
>>>>>> Hi, Konstantin, long time no see... ;)
>>>>>>
>>>>>>>>>>>> This patch fixes the Tx hang we were constantly hitting with a
>>>>>>>> seastar-based
>>>>>>>>>>>> application on x540 NIC.
>>>>>>>>>>> Could you help to share with us how to reproduce the tx hang
>>>>>>>>>>> issue,
>>>>>>>> with using
>>>>>>>>>>> typical DPDK examples?
>>>>>>>>>> Sorry. I'm not very familiar with the typical DPDK examples to
>>>>>>>>>> help u
>>>>>>>>>> here. However this is quite irrelevant since without this this
>>>>>>>>>> patch
>>>>>>>>>> ixgbe PMD obviously abuses the HW spec as has been explained
>>>>>>>>>> above.
>>>>>>>>>>
>>>>>>>>>> We saw the issue when u stressed the xmit path with a lot of
>>>>>>>>>> highly
>>>>>>>>>> fragmented TCP frames (packets with up to 33 fragments with
>>>>>>>>>> non-headers
>>>>>>>>>> fragments as small as 4 bytes) with all offload features enabled.
>>>>>>> Could you provide us with the pcap file to reproduce the issue?
>>>>>> Well, the thing is it takes some time to reproduce it (a few
>>>>>> minutes of
>>>>>> heavy load) therefore a pcap would be quite large.
>>>>> Probably you can upload it to some place, from which we will be able
>>>>> to download it?
>>>> I'll see what I can do but no promises...
>>> On a second thought pcap file won't help u much since in order to
>>> reproduce the issue u have to reproduce exactly the same structure of
>>> clusters i give to HW and it's not what u see on wire in a TSO case.
>> And not only in a TSO case... ;)
> I understand that, but my thought was you can add some sort of TX callback for the rte_eth_tx_burst()
> into your code that would write the packet into pcap file and then re-run your hang scenario.
> I know that it means extra work for you - but I think it would be very helpful if we would be able to reproduce your hang scenario:
> - if HW guys would confirm that setting RS bit for every EOP packet is not really required,
>    then we probably have to look at what else can cause it.
> - it might be added to our validation cycle, to prevent hitting similar problem in future.
> Thanks
> Konstantin
>


I think if you send packets with random fragment chains up to 32 mbufs 
you might see this.  TSO was not required to trigger this problem.
  
Zhang, Helin Aug. 25, 2015, 6:13 p.m. UTC | #14
Hi Vlad

In addition, I’d double check with you what’s the maximum number of descriptors would be used for a single packet transmitting?
Datasheet said that it supports up to 8. I am wondering if more than 8 were used in your case?
Thank you very much!

Regards,
Helin

From: Zhang, Helin

Sent: Wednesday, August 19, 2015 10:29 AM
To: Vladislav Zolotarov
Cc: Lu, Wenzhuo; dev@dpdk.org
Subject: RE: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh above 1 for all NICs but 82598

Hi Vlad

Thank you very much for the patches! Give me a few more time to double check with more guys, and possibly hardware experts.

Regards,
Helin

From: Vladislav Zolotarov [mailto:vladz@cloudius-systems.com]

Sent: Tuesday, August 18, 2015 9:56 PM
To: Lu, Wenzhuo
Cc: dev@dpdk.org<mailto:dev@dpdk.org>; Zhang, Helin
Subject: RE: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh above 1 for all NICs but 82598


On Aug 19, 2015 03:42, "Lu, Wenzhuo" <wenzhuo.lu@intel.com<mailto:wenzhuo.lu@intel.com>> wrote:
>

> Hi Helin,

>

> > -----Original Message-----

> > From: dev [mailto:dev-bounces@dpdk.org<mailto:dev-bounces@dpdk.org>] On Behalf Of Vlad Zolotarov

> > Sent: Friday, August 14, 2015 1:38 PM

> > To: Zhang, Helin; dev@dpdk.org<mailto:dev@dpdk.org>

> > Subject: Re: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh above 1 for

> > all NICs but 82598

> >

> >

> >

> > On 08/13/15 23:28, Zhang, Helin wrote:

> > > Hi Vlad

> > >

> > > I don't think the changes are needed. It says in datasheet that the RS

> > > bit should be set on the last descriptor of every packet, ONLY WHEN

> > TXDCTL.WTHRESH equals to ZERO.

> >

> > Of course it's needed! See below.

> > Exactly the same spec a few lines above the place u've just quoted states:

> >

> > "Software should not set the RS bit when TXDCTL.WTHRESH is greater than

> > zero."

> >

> > And since all three (3) ixgbe xmit callbacks are utilizing RS bit notation ixgbe PMD

> > is actually not supporting any value of WTHRESH different from zero.

> I think Vlad is right. We need to fix this issue. Any suggestion? If not, I'd like to ack this patch.


Pls., note that there is a v2 of this patch on the list. I forgot to patch ixgbevf_dev_info_get() in v1.

>

> >

> > >

> > > Regards,

> > > Helin

> > >

> > >> -----Original Message-----

> > >> From: Vlad Zolotarov [mailto:vladz@cloudius-systems.com<mailto:vladz@cloudius-systems.com>]

> > >> Sent: Thursday, August 13, 2015 11:07 AM

> > >> To: dev@dpdk.org<mailto:dev@dpdk.org>

> > >> Cc: Zhang, Helin; Ananyev, Konstantin; avi@cloudius-systems.com<mailto:avi@cloudius-systems.com>; Vlad

> > >> Zolotarov

> > >> Subject: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh above 1 for

> > all

> > >> NICs but 82598

> > >>

> > >> According to 82599 and x540 HW specifications RS bit *must* be set in the

> > last

> > >> descriptor of *every* packet.

> > > There is a condition that if TXDCTL.WTHRESH equal to zero.

> >

> > Right and ixgbe PMD requires this condition to be fulfilled in order to

> > function. See above.

> >

> > >

> > >> This patch fixes the Tx hang we were constantly hitting with a seastar-based

> > >> application on x540 NIC.

> > > Could you help to share with us how to reproduce the tx hang issue, with using

> > > typical DPDK examples?

> >

> > Sorry. I'm not very familiar with the typical DPDK examples to help u

> > here. However this is quite irrelevant since without this this patch

> > ixgbe PMD obviously abuses the HW spec as has been explained above.

> >

> > We saw the issue when u stressed the xmit path with a lot of highly

> > fragmented TCP frames (packets with up to 33 fragments with non-headers

> > fragments as small as 4 bytes) with all offload features enabled.

> >

> > Thanks,

> > vlad

> > >

> > >> Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com<mailto:vladz@cloudius-systems.com>>

> > >> ---

> > >>   drivers/net/ixgbe/ixgbe_ethdev.c |  9 +++++++++

> > >>   drivers/net/ixgbe/ixgbe_rxtx.c   | 23 ++++++++++++++++++++++-

> > >>   2 files changed, 31 insertions(+), 1 deletion(-)

> > >>

> > >> diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c

> > >> b/drivers/net/ixgbe/ixgbe_ethdev.c

> > >> index b8ee1e9..6714fd9 100644

> > >> --- a/drivers/net/ixgbe/ixgbe_ethdev.c

> > >> +++ b/drivers/net/ixgbe/ixgbe_ethdev.c

> > >> @@ -2414,6 +2414,15 @@ ixgbe_dev_info_get(struct rte_eth_dev *dev,

> > struct

> > >> rte_eth_dev_info *dev_info)

> > >>            .txq_flags = ETH_TXQ_FLAGS_NOMULTSEGS |

> > >>                            ETH_TXQ_FLAGS_NOOFFLOADS,

> > >>    };

> > >> +

> > >> +  /*

> > >> +   * According to 82599 and x540 specifications RS bit *must* be set on

> > the

> > >> +   * last descriptor of *every* packet. Therefore we will not allow the

> > >> +   * tx_rs_thresh above 1 for all NICs newer than 82598.

> > >> +   */

> > >> +  if (hw->mac.type > ixgbe_mac_82598EB)

> > >> +          dev_info->default_txconf.tx_rs_thresh = 1;

> > >> +

> > >>    dev_info->hash_key_size = IXGBE_HKEY_MAX_INDEX * sizeof(uint32_t);

> > >>    dev_info->reta_size = ETH_RSS_RETA_SIZE_128;

> > >>    dev_info->flow_type_rss_offloads = IXGBE_RSS_OFFLOAD_ALL; diff --

> > git

> > >> a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c index

> > >> 91023b9..8dbdffc 100644

> > >> --- a/drivers/net/ixgbe/ixgbe_rxtx.c

> > >> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c

> > >> @@ -2085,11 +2085,19 @@ ixgbe_dev_tx_queue_setup(struct rte_eth_dev

> > >> *dev,

> > >>    struct ixgbe_tx_queue *txq;

> > >>    struct ixgbe_hw     *hw;

> > >>    uint16_t tx_rs_thresh, tx_free_thresh;

> > >> +  bool rs_deferring_allowed;

> > >>

> > >>    PMD_INIT_FUNC_TRACE();

> > >>    hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);

> > >>

> > >>    /*

> > >> +   * According to 82599 and x540 specifications RS bit *must* be set on

> > the

> > >> +   * last descriptor of *every* packet. Therefore we will not allow the

> > >> +   * tx_rs_thresh above 1 for all NICs newer than 82598.

> > >> +   */

> > >> +  rs_deferring_allowed = (hw->mac.type <= ixgbe_mac_82598EB);

> > >> +

> > >> +  /*

> > >>     * Validate number of transmit descriptors.

> > >>     * It must not exceed hardware maximum, and must be multiple

> > >>     * of IXGBE_ALIGN.

> > >> @@ -2110,6 +2118,8 @@ ixgbe_dev_tx_queue_setup(struct rte_eth_dev

> > *dev,

> > >>     * to transmit a packet is greater than the number of free TX

> > >>     * descriptors.

> > >>     * The following constraints must be satisfied:

> > >> +   *  tx_rs_thresh must be less than 2 for NICs for which RS deferring is

> > >> +   *  forbidden (all but 82598).

> > >>     *  tx_rs_thresh must be greater than 0.

> > >>     *  tx_rs_thresh must be less than the size of the ring minus 2.

> > >>     *  tx_rs_thresh must be less than or equal to tx_free_thresh.

> > >> @@ -2121,9 +2131,20 @@ ixgbe_dev_tx_queue_setup(struct rte_eth_dev

> > *dev,

> > >>     * When set to zero use default values.

> > >>     */

> > >>    tx_rs_thresh = (uint16_t)((tx_conf->tx_rs_thresh) ?

> > >> -                  tx_conf->tx_rs_thresh : DEFAULT_TX_RS_THRESH);

> > >> +                  tx_conf->tx_rs_thresh :

> > >> +                  (rs_deferring_allowed ? DEFAULT_TX_RS_THRESH : 1));

> > >>    tx_free_thresh = (uint16_t)((tx_conf->tx_free_thresh) ?

> > >>                    tx_conf->tx_free_thresh : DEFAULT_TX_FREE_THRESH);

> > >> +

> > >> +  if (!rs_deferring_allowed && tx_rs_thresh > 1) {

> > >> +          PMD_INIT_LOG(ERR, "tx_rs_thresh must be less than 2 since RS

> > "

> > >> +                            "must be set for every packet for this HW. "

> > >> +                            "(tx_rs_thresh=%u port=%d queue=%d)",

> > >> +                       (unsigned int)tx_rs_thresh,

> > >> +                       (int)dev->data->port_id, (int)queue_idx);

> > >> +          return -(EINVAL);

> > >> +  }

> > >> +

> > >>    if (tx_rs_thresh >= (nb_desc - 2)) {

> > >>            PMD_INIT_LOG(ERR, "tx_rs_thresh must be less than the

> > number "

> > >>                         "of TX descriptors minus 2. (tx_rs_thresh=%u "

> > >> --

> > >> 2.1.0

>
  
Vladislav Zolotarov Aug. 25, 2015, 6:33 p.m. UTC | #15
On Aug 25, 2015 21:14, "Zhang, Helin" <helin.zhang@intel.com> wrote:
>
> Hi Vlad
>
>
>
> In addition, I’d double check with you what’s the maximum number of
descriptors would be used for a single packet transmitting?
>
> Datasheet said that it supports up to 8. I am wondering if more than 8
were used in your case?

If memory serves me well the maximum number of data descriptors per single
xmit packet is 40 minus 2 minus WTHRESH. Since WTHRESH in DPDK is always
zero it gives us 38 segments. We limit them by 33.

>
> Thank you very much!
>
>
>
> Regards,
>
> Helin
>
>
>
> From: Zhang, Helin
> Sent: Wednesday, August 19, 2015 10:29 AM
> To: Vladislav Zolotarov
> Cc: Lu, Wenzhuo; dev@dpdk.org
>
> Subject: RE: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh above 1
for all NICs but 82598
>
>
>
> Hi Vlad
>
>
>
> Thank you very much for the patches! Give me a few more time to double
check with more guys, and possibly hardware experts.
>
>
>
> Regards,
>
> Helin
>
>
>
> From: Vladislav Zolotarov [mailto:vladz@cloudius-systems.com]
> Sent: Tuesday, August 18, 2015 9:56 PM
> To: Lu, Wenzhuo
> Cc: dev@dpdk.org; Zhang, Helin
> Subject: RE: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh above 1
for all NICs but 82598
>
>
>
>
> On Aug 19, 2015 03:42, "Lu, Wenzhuo" <wenzhuo.lu@intel.com> wrote:
> >
> > Hi Helin,
> >
> > > -----Original Message-----
> > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Vlad Zolotarov
> > > Sent: Friday, August 14, 2015 1:38 PM
> > > To: Zhang, Helin; dev@dpdk.org
> > > Subject: Re: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh
above 1 for
> > > all NICs but 82598
> > >
> > >
> > >
> > > On 08/13/15 23:28, Zhang, Helin wrote:
> > > > Hi Vlad
> > > >
> > > > I don't think the changes are needed. It says in datasheet that the
RS
> > > > bit should be set on the last descriptor of every packet, ONLY WHEN
> > > TXDCTL.WTHRESH equals to ZERO.
> > >
> > > Of course it's needed! See below.
> > > Exactly the same spec a few lines above the place u've just quoted
states:
> > >
> > > "Software should not set the RS bit when TXDCTL.WTHRESH is greater
than
> > > zero."
> > >
> > > And since all three (3) ixgbe xmit callbacks are utilizing RS bit
notation ixgbe PMD
> > > is actually not supporting any value of WTHRESH different from zero.
> > I think Vlad is right. We need to fix this issue. Any suggestion? If
not, I'd like to ack this patch.
>
> Pls., note that there is a v2 of this patch on the list. I forgot to
patch ixgbevf_dev_info_get() in v1.
>
> >
> > >
> > > >
> > > > Regards,
> > > > Helin
> > > >
> > > >> -----Original Message-----
> > > >> From: Vlad Zolotarov [mailto:vladz@cloudius-systems.com]
> > > >> Sent: Thursday, August 13, 2015 11:07 AM
> > > >> To: dev@dpdk.org
> > > >> Cc: Zhang, Helin; Ananyev, Konstantin; avi@cloudius-systems.com;
Vlad
> > > >> Zolotarov
> > > >> Subject: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh
above 1 for
> > > all
> > > >> NICs but 82598
> > > >>
> > > >> According to 82599 and x540 HW specifications RS bit *must* be set
in the
> > > last
> > > >> descriptor of *every* packet.
> > > > There is a condition that if TXDCTL.WTHRESH equal to zero.
> > >
> > > Right and ixgbe PMD requires this condition to be fulfilled in order
to
> > > function. See above.
> > >
> > > >
> > > >> This patch fixes the Tx hang we were constantly hitting with a
seastar-based
> > > >> application on x540 NIC.
> > > > Could you help to share with us how to reproduce the tx hang issue,
with using
> > > > typical DPDK examples?
> > >
> > > Sorry. I'm not very familiar with the typical DPDK examples to help u
> > > here. However this is quite irrelevant since without this this patch
> > > ixgbe PMD obviously abuses the HW spec as has been explained above.
> > >
> > > We saw the issue when u stressed the xmit path with a lot of highly
> > > fragmented TCP frames (packets with up to 33 fragments with
non-headers
> > > fragments as small as 4 bytes) with all offload features enabled.
> > >
> > > Thanks,
> > > vlad
> > > >
> > > >> Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
> > > >> ---
> > > >>   drivers/net/ixgbe/ixgbe_ethdev.c |  9 +++++++++
> > > >>   drivers/net/ixgbe/ixgbe_rxtx.c   | 23 ++++++++++++++++++++++-
> > > >>   2 files changed, 31 insertions(+), 1 deletion(-)
> > > >>
> > > >> diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c
> > > >> b/drivers/net/ixgbe/ixgbe_ethdev.c
> > > >> index b8ee1e9..6714fd9 100644
> > > >> --- a/drivers/net/ixgbe/ixgbe_ethdev.c
> > > >> +++ b/drivers/net/ixgbe/ixgbe_ethdev.c
> > > >> @@ -2414,6 +2414,15 @@ ixgbe_dev_info_get(struct rte_eth_dev *dev,
> > > struct
> > > >> rte_eth_dev_info *dev_info)
> > > >>            .txq_flags = ETH_TXQ_FLAGS_NOMULTSEGS |
> > > >>                            ETH_TXQ_FLAGS_NOOFFLOADS,
> > > >>    };
> > > >> +
> > > >> +  /*
> > > >> +   * According to 82599 and x540 specifications RS bit *must* be
set on
> > > the
> > > >> +   * last descriptor of *every* packet. Therefore we will not
allow the
> > > >> +   * tx_rs_thresh above 1 for all NICs newer than 82598.
> > > >> +   */
> > > >> +  if (hw->mac.type > ixgbe_mac_82598EB)
> > > >> +          dev_info->default_txconf.tx_rs_thresh = 1;
> > > >> +
> > > >>    dev_info->hash_key_size = IXGBE_HKEY_MAX_INDEX *
sizeof(uint32_t);
> > > >>    dev_info->reta_size = ETH_RSS_RETA_SIZE_128;
> > > >>    dev_info->flow_type_rss_offloads = IXGBE_RSS_OFFLOAD_ALL; diff
--
> > > git
> > > >> a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index
> > > >> 91023b9..8dbdffc 100644
> > > >> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
> > > >> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
> > > >> @@ -2085,11 +2085,19 @@ ixgbe_dev_tx_queue_setup(struct rte_eth_dev
> > > >> *dev,
> > > >>    struct ixgbe_tx_queue *txq;
> > > >>    struct ixgbe_hw     *hw;
> > > >>    uint16_t tx_rs_thresh, tx_free_thresh;
> > > >> +  bool rs_deferring_allowed;
> > > >>
> > > >>    PMD_INIT_FUNC_TRACE();
> > > >>    hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);
> > > >>
> > > >>    /*
> > > >> +   * According to 82599 and x540 specifications RS bit *must* be
set on
> > > the
> > > >> +   * last descriptor of *every* packet. Therefore we will not
allow the
> > > >> +   * tx_rs_thresh above 1 for all NICs newer than 82598.
> > > >> +   */
> > > >> +  rs_deferring_allowed = (hw->mac.type <= ixgbe_mac_82598EB);
> > > >> +
> > > >> +  /*
> > > >>     * Validate number of transmit descriptors.
> > > >>     * It must not exceed hardware maximum, and must be multiple
> > > >>     * of IXGBE_ALIGN.
> > > >> @@ -2110,6 +2118,8 @@ ixgbe_dev_tx_queue_setup(struct rte_eth_dev
> > > *dev,
> > > >>     * to transmit a packet is greater than the number of free TX
> > > >>     * descriptors.
> > > >>     * The following constraints must be satisfied:
> > > >> +   *  tx_rs_thresh must be less than 2 for NICs for which RS
deferring is
> > > >> +   *  forbidden (all but 82598).
> > > >>     *  tx_rs_thresh must be greater than 0.
> > > >>     *  tx_rs_thresh must be less than the size of the ring minus 2.
> > > >>     *  tx_rs_thresh must be less than or equal to tx_free_thresh.
> > > >> @@ -2121,9 +2131,20 @@ ixgbe_dev_tx_queue_setup(struct rte_eth_dev
> > > *dev,
> > > >>     * When set to zero use default values.
> > > >>     */
> > > >>    tx_rs_thresh = (uint16_t)((tx_conf->tx_rs_thresh) ?
> > > >> -                  tx_conf->tx_rs_thresh : DEFAULT_TX_RS_THRESH);
> > > >> +                  tx_conf->tx_rs_thresh :
> > > >> +                  (rs_deferring_allowed ? DEFAULT_TX_RS_THRESH :
1));
> > > >>    tx_free_thresh = (uint16_t)((tx_conf->tx_free_thresh) ?
> > > >>                    tx_conf->tx_free_thresh :
DEFAULT_TX_FREE_THRESH);
> > > >> +
> > > >> +  if (!rs_deferring_allowed && tx_rs_thresh > 1) {
> > > >> +          PMD_INIT_LOG(ERR, "tx_rs_thresh must be less than 2
since RS
> > > "
> > > >> +                            "must be set for every packet for
this HW. "
> > > >> +                            "(tx_rs_thresh=%u port=%d queue=%d)",
> > > >> +                       (unsigned int)tx_rs_thresh,
> > > >> +                       (int)dev->data->port_id, (int)queue_idx);
> > > >> +          return -(EINVAL);
> > > >> +  }
> > > >> +
> > > >>    if (tx_rs_thresh >= (nb_desc - 2)) {
> > > >>            PMD_INIT_LOG(ERR, "tx_rs_thresh must be less than the
> > > number "
> > > >>                         "of TX descriptors minus 2.
(tx_rs_thresh=%u "
> > > >> --
> > > >> 2.1.0
> >
  
Zhang, Helin Aug. 25, 2015, 6:43 p.m. UTC | #16
Hi Vlad

I think this could possibly be the root cause of your TX hang issue. Please try to limit the number to 8 or less, and then see if the issue will still be there or not?
It does not have any check for the number of descriptors to be used for a single packet, and it relies on the users to give correct mbuf chains.

We may need a check of this somewhere. Of cause the point you indicated we also need to carefully investigate or fix.

Regards,
Helin

From: Vladislav Zolotarov [mailto:vladz@cloudius-systems.com]

Sent: Tuesday, August 25, 2015 11:34 AM
To: Zhang, Helin
Cc: Lu, Wenzhuo; dev@dpdk.org
Subject: RE: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh above 1 for all NICs but 82598


On Aug 25, 2015 21:14, "Zhang, Helin" <helin.zhang@intel.com<mailto:helin.zhang@intel.com>> wrote:
>

> Hi Vlad

>

>

>

> In addition, I’d double check with you what’s the maximum number of descriptors would be used for a single packet transmitting?

>

> Datasheet said that it supports up to 8. I am wondering if more than 8 were used in your case?


If memory serves me well the maximum number of data descriptors per single xmit packet is 40 minus 2 minus WTHRESH. Since WTHRESH in DPDK is always zero it gives us 38 segments. We limit them by 33.

>

> Thank you very much!

>

>

>

> Regards,

>

> Helin

>

>

>

> From: Zhang, Helin

> Sent: Wednesday, August 19, 2015 10:29 AM

> To: Vladislav Zolotarov

> Cc: Lu, Wenzhuo; dev@dpdk.org<mailto:dev@dpdk.org>

>

> Subject: RE: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh above 1 for all NICs but 82598

>

>

>

> Hi Vlad

>

>

>

> Thank you very much for the patches! Give me a few more time to double check with more guys, and possibly hardware experts.

>

>

>

> Regards,

>

> Helin

>

>

>

> From: Vladislav Zolotarov [mailto:vladz@cloudius-systems.com<mailto:vladz@cloudius-systems.com>]

> Sent: Tuesday, August 18, 2015 9:56 PM

> To: Lu, Wenzhuo

> Cc: dev@dpdk.org<mailto:dev@dpdk.org>; Zhang, Helin

> Subject: RE: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh above 1 for all NICs but 82598

>

>

>

>

> On Aug 19, 2015 03:42, "Lu, Wenzhuo" <wenzhuo.lu@intel.com<mailto:wenzhuo.lu@intel.com>> wrote:

> >

> > Hi Helin,

> >

> > > -----Original Message-----

> > > From: dev [mailto:dev-bounces@dpdk.org<mailto:dev-bounces@dpdk.org>] On Behalf Of Vlad Zolotarov

> > > Sent: Friday, August 14, 2015 1:38 PM

> > > To: Zhang, Helin; dev@dpdk.org<mailto:dev@dpdk.org>

> > > Subject: Re: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh above 1 for

> > > all NICs but 82598

> > >

> > >

> > >

> > > On 08/13/15 23:28, Zhang, Helin wrote:

> > > > Hi Vlad

> > > >

> > > > I don't think the changes are needed. It says in datasheet that the RS

> > > > bit should be set on the last descriptor of every packet, ONLY WHEN

> > > TXDCTL.WTHRESH equals to ZERO.

> > >

> > > Of course it's needed! See below.

> > > Exactly the same spec a few lines above the place u've just quoted states:

> > >

> > > "Software should not set the RS bit when TXDCTL.WTHRESH is greater than

> > > zero."

> > >

> > > And since all three (3) ixgbe xmit callbacks are utilizing RS bit notation ixgbe PMD

> > > is actually not supporting any value of WTHRESH different from zero.

> > I think Vlad is right. We need to fix this issue. Any suggestion? If not, I'd like to ack this patch.

>

> Pls., note that there is a v2 of this patch on the list. I forgot to patch ixgbevf_dev_info_get() in v1.

>

> >

> > >

> > > >

> > > > Regards,

> > > > Helin

> > > >

> > > >> -----Original Message-----

> > > >> From: Vlad Zolotarov [mailto:vladz@cloudius-systems.com<mailto:vladz@cloudius-systems.com>]

> > > >> Sent: Thursday, August 13, 2015 11:07 AM

> > > >> To: dev@dpdk.org<mailto:dev@dpdk.org>

> > > >> Cc: Zhang, Helin; Ananyev, Konstantin; avi@cloudius-systems.com<mailto:avi@cloudius-systems.com>; Vlad

> > > >> Zolotarov

> > > >> Subject: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh above 1 for

> > > all

> > > >> NICs but 82598

> > > >>

> > > >> According to 82599 and x540 HW specifications RS bit *must* be set in the

> > > last

> > > >> descriptor of *every* packet.

> > > > There is a condition that if TXDCTL.WTHRESH equal to zero.

> > >

> > > Right and ixgbe PMD requires this condition to be fulfilled in order to

> > > function. See above.

> > >

> > > >

> > > >> This patch fixes the Tx hang we were constantly hitting with a seastar-based

> > > >> application on x540 NIC.

> > > > Could you help to share with us how to reproduce the tx hang issue, with using

> > > > typical DPDK examples?

> > >

> > > Sorry. I'm not very familiar with the typical DPDK examples to help u

> > > here. However this is quite irrelevant since without this this patch

> > > ixgbe PMD obviously abuses the HW spec as has been explained above.

> > >

> > > We saw the issue when u stressed the xmit path with a lot of highly

> > > fragmented TCP frames (packets with up to 33 fragments with non-headers

> > > fragments as small as 4 bytes) with all offload features enabled.

> > >

> > > Thanks,

> > > vlad

> > > >

> > > >> Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com<mailto:vladz@cloudius-systems.com>>

> > > >> ---

> > > >>   drivers/net/ixgbe/ixgbe_ethdev.c |  9 +++++++++

> > > >>   drivers/net/ixgbe/ixgbe_rxtx.c   | 23 ++++++++++++++++++++++-

> > > >>   2 files changed, 31 insertions(+), 1 deletion(-)

> > > >>

> > > >> diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c

> > > >> b/drivers/net/ixgbe/ixgbe_ethdev.c

> > > >> index b8ee1e9..6714fd9 100644

> > > >> --- a/drivers/net/ixgbe/ixgbe_ethdev.c

> > > >> +++ b/drivers/net/ixgbe/ixgbe_ethdev.c

> > > >> @@ -2414,6 +2414,15 @@ ixgbe_dev_info_get(struct rte_eth_dev *dev,

> > > struct

> > > >> rte_eth_dev_info *dev_info)

> > > >>            .txq_flags = ETH_TXQ_FLAGS_NOMULTSEGS |

> > > >>                            ETH_TXQ_FLAGS_NOOFFLOADS,

> > > >>    };

> > > >> +

> > > >> +  /*

> > > >> +   * According to 82599 and x540 specifications RS bit *must* be set on

> > > the

> > > >> +   * last descriptor of *every* packet. Therefore we will not allow the

> > > >> +   * tx_rs_thresh above 1 for all NICs newer than 82598.

> > > >> +   */

> > > >> +  if (hw->mac.type > ixgbe_mac_82598EB)

> > > >> +          dev_info->default_txconf.tx_rs_thresh = 1;

> > > >> +

> > > >>    dev_info->hash_key_size = IXGBE_HKEY_MAX_INDEX * sizeof(uint32_t);

> > > >>    dev_info->reta_size = ETH_RSS_RETA_SIZE_128;

> > > >>    dev_info->flow_type_rss_offloads = IXGBE_RSS_OFFLOAD_ALL; diff --

> > > git

> > > >> a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c index

> > > >> 91023b9..8dbdffc 100644

> > > >> --- a/drivers/net/ixgbe/ixgbe_rxtx.c

> > > >> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c

> > > >> @@ -2085,11 +2085,19 @@ ixgbe_dev_tx_queue_setup(struct rte_eth_dev

> > > >> *dev,

> > > >>    struct ixgbe_tx_queue *txq;

> > > >>    struct ixgbe_hw     *hw;

> > > >>    uint16_t tx_rs_thresh, tx_free_thresh;

> > > >> +  bool rs_deferring_allowed;

> > > >>

> > > >>    PMD_INIT_FUNC_TRACE();

> > > >>    hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);

> > > >>

> > > >>    /*

> > > >> +   * According to 82599 and x540 specifications RS bit *must* be set on

> > > the

> > > >> +   * last descriptor of *every* packet. Therefore we will not allow the

> > > >> +   * tx_rs_thresh above 1 for all NICs newer than 82598.

> > > >> +   */

> > > >> +  rs_deferring_allowed = (hw->mac.type <= ixgbe_mac_82598EB);

> > > >> +

> > > >> +  /*

> > > >>     * Validate number of transmit descriptors.

> > > >>     * It must not exceed hardware maximum, and must be multiple

> > > >>     * of IXGBE_ALIGN.

> > > >> @@ -2110,6 +2118,8 @@ ixgbe_dev_tx_queue_setup(struct rte_eth_dev

> > > *dev,

> > > >>     * to transmit a packet is greater than the number of free TX

> > > >>     * descriptors.

> > > >>     * The following constraints must be satisfied:

> > > >> +   *  tx_rs_thresh must be less than 2 for NICs for which RS deferring is

> > > >> +   *  forbidden (all but 82598).

> > > >>     *  tx_rs_thresh must be greater than 0.

> > > >>     *  tx_rs_thresh must be less than the size of the ring minus 2.

> > > >>     *  tx_rs_thresh must be less than or equal to tx_free_thresh.

> > > >> @@ -2121,9 +2131,20 @@ ixgbe_dev_tx_queue_setup(struct rte_eth_dev

> > > *dev,

> > > >>     * When set to zero use default values.

> > > >>     */

> > > >>    tx_rs_thresh = (uint16_t)((tx_conf->tx_rs_thresh) ?

> > > >> -                  tx_conf->tx_rs_thresh : DEFAULT_TX_RS_THRESH);

> > > >> +                  tx_conf->tx_rs_thresh :

> > > >> +                  (rs_deferring_allowed ? DEFAULT_TX_RS_THRESH : 1));

> > > >>    tx_free_thresh = (uint16_t)((tx_conf->tx_free_thresh) ?

> > > >>                    tx_conf->tx_free_thresh : DEFAULT_TX_FREE_THRESH);

> > > >> +

> > > >> +  if (!rs_deferring_allowed && tx_rs_thresh > 1) {

> > > >> +          PMD_INIT_LOG(ERR, "tx_rs_thresh must be less than 2 since RS

> > > "

> > > >> +                            "must be set for every packet for this HW. "

> > > >> +                            "(tx_rs_thresh=%u port=%d queue=%d)",

> > > >> +                       (unsigned int)tx_rs_thresh,

> > > >> +                       (int)dev->data->port_id, (int)queue_idx);

> > > >> +          return -(EINVAL);

> > > >> +  }

> > > >> +

> > > >>    if (tx_rs_thresh >= (nb_desc - 2)) {

> > > >>            PMD_INIT_LOG(ERR, "tx_rs_thresh must be less than the

> > > number "

> > > >>                         "of TX descriptors minus 2. (tx_rs_thresh=%u "

> > > >> --

> > > >> 2.1.0

> >
  
Vladislav Zolotarov Aug. 25, 2015, 6:52 p.m. UTC | #17
On 08/25/15 21:43, Zhang, Helin wrote:
>
> Hi Vlad
>
> I think this could possibly be the root cause of your TX hang issue. 
> Please try to limit the number to 8 or less, and then see if the issue 
> will still be there or not?
>

Helin, the issue has been seen on x540 devices. Pls., see a chapter 
7.2.1.1 of x540 devices spec:

A packet (or multiple packets in transmit segmentation) can span any number of
buffers (and their descriptors) up to a limit of 40 minus WTHRESH minus 2 (see
Section 7.2.3.3 for Tx Ring details and section Section 7.2.3.5.1 for WTHRESH
details). For best performance it is recommended to minimize the number of buffers
as possible.

Could u, pls., clarify why do u think that the maximum number of data 
buffers is limited by 8?

thanks,
vlad

> It does not have any check for the number of descriptors to be used 
> for a single packet, and it relies on the users to give correct mbuf 
> chains.
>
> We may need a check of this somewhere. Of cause the point you 
> indicated we also need to carefully investigate or fix.
>
> Regards,
>
> Helin
>
> *From:*Vladislav Zolotarov [mailto:vladz@cloudius-systems.com]
> *Sent:* Tuesday, August 25, 2015 11:34 AM
> *To:* Zhang, Helin
> *Cc:* Lu, Wenzhuo; dev@dpdk.org
> *Subject:* RE: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh 
> above 1 for all NICs but 82598
>
>
> On Aug 25, 2015 21:14, "Zhang, Helin" <helin.zhang@intel.com 
> <mailto:helin.zhang@intel.com>> wrote:
> >
> > Hi Vlad
> >
> >
> >
> > In addition, I’d double check with you what’s the maximum number of 
> descriptors would be used for a single packet transmitting?
> >
> > Datasheet said that it supports up to 8. I am wondering if more than 
> 8 were used in your case?
>
> If memory serves me well the maximum number of data descriptors per 
> single xmit packet is 40 minus 2 minus WTHRESH. Since WTHRESH in DPDK 
> is always zero it gives us 38 segments. We limit them by 33.
>
> >
> > Thank you very much!
> >
> >
> >
> > Regards,
> >
> > Helin
> >
> >
> >
> > From: Zhang, Helin
> > Sent: Wednesday, August 19, 2015 10:29 AM
> > To: Vladislav Zolotarov
> > Cc: Lu, Wenzhuo; dev@dpdk.org <mailto:dev@dpdk.org>
> >
> > Subject: RE: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh 
> above 1 for all NICs but 82598
> >
> >
> >
> > Hi Vlad
> >
> >
> >
> > Thank you very much for the patches! Give me a few more time to 
> double check with more guys, and possibly hardware experts.
> >
> >
> >
> > Regards,
> >
> > Helin
> >
> >
> >
> > From: Vladislav Zolotarov [mailto:vladz@cloudius-systems.com 
> <mailto:vladz@cloudius-systems.com>]
> > Sent: Tuesday, August 18, 2015 9:56 PM
> > To: Lu, Wenzhuo
> > Cc: dev@dpdk.org <mailto:dev@dpdk.org>; Zhang, Helin
> > Subject: RE: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh 
> above 1 for all NICs but 82598
> >
> >
> >
> >
> > On Aug 19, 2015 03:42, "Lu, Wenzhuo" <wenzhuo.lu@intel.com 
> <mailto:wenzhuo.lu@intel.com>> wrote:
> > >
> > > Hi Helin,
> > >
> > > > -----Original Message-----
> > > > From: dev [mailto:dev-bounces@dpdk.org 
> <mailto:dev-bounces@dpdk.org>] On Behalf Of Vlad Zolotarov
> > > > Sent: Friday, August 14, 2015 1:38 PM
> > > > To: Zhang, Helin; dev@dpdk.org <mailto:dev@dpdk.org>
> > > > Subject: Re: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid 
> tx_rs_thresh above 1 for
> > > > all NICs but 82598
> > > >
> > > >
> > > >
> > > > On 08/13/15 23:28, Zhang, Helin wrote:
> > > > > Hi Vlad
> > > > >
> > > > > I don't think the changes are needed. It says in datasheet 
> that the RS
> > > > > bit should be set on the last descriptor of every packet, ONLY 
> WHEN
> > > > TXDCTL.WTHRESH equals to ZERO.
> > > >
> > > > Of course it's needed! See below.
> > > > Exactly the same spec a few lines above the place u've just 
> quoted states:
> > > >
> > > > "Software should not set the RS bit when TXDCTL.WTHRESH is 
> greater than
> > > > zero."
> > > >
> > > > And since all three (3) ixgbe xmit callbacks are utilizing RS 
> bit notation ixgbe PMD
> > > > is actually not supporting any value of WTHRESH different from zero.
> > > I think Vlad is right. We need to fix this issue. Any suggestion? 
> If not, I'd like to ack this patch.
> >
> > Pls., note that there is a v2 of this patch on the list. I forgot to 
> patch ixgbevf_dev_info_get() in v1.
> >
> > >
> > > >
> > > > >
> > > > > Regards,
> > > > > Helin
> > > > >
> > > > >> -----Original Message-----
> > > > >> From: Vlad Zolotarov [mailto:vladz@cloudius-systems.com 
> <mailto:vladz@cloudius-systems.com>]
> > > > >> Sent: Thursday, August 13, 2015 11:07 AM
> > > > >> To: dev@dpdk.org <mailto:dev@dpdk.org>
> > > > >> Cc: Zhang, Helin; Ananyev, Konstantin; 
> avi@cloudius-systems.com <mailto:avi@cloudius-systems.com>; Vlad
> > > > >> Zolotarov
> > > > >> Subject: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh 
> above 1 for
> > > > all
> > > > >> NICs but 82598
> > > > >>
> > > > >> According to 82599 and x540 HW specifications RS bit *must* 
> be set in the
> > > > last
> > > > >> descriptor of *every* packet.
> > > > > There is a condition that if TXDCTL.WTHRESH equal to zero.
> > > >
> > > > Right and ixgbe PMD requires this condition to be fulfilled in 
> order to
> > > > function. See above.
> > > >
> > > > >
> > > > >> This patch fixes the Tx hang we were constantly hitting with 
> a seastar-based
> > > > >> application on x540 NIC.
> > > > > Could you help to share with us how to reproduce the tx hang 
> issue, with using
> > > > > typical DPDK examples?
> > > >
> > > > Sorry. I'm not very familiar with the typical DPDK examples to 
> help u
> > > > here. However this is quite irrelevant since without this this patch
> > > > ixgbe PMD obviously abuses the HW spec as has been explained above.
> > > >
> > > > We saw the issue when u stressed the xmit path with a lot of highly
> > > > fragmented TCP frames (packets with up to 33 fragments with 
> non-headers
> > > > fragments as small as 4 bytes) with all offload features enabled.
> > > >
> > > > Thanks,
> > > > vlad
> > > > >
> > > > >> Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com 
> <mailto:vladz@cloudius-systems.com>>
> > > > >> ---
> > > > >>   drivers/net/ixgbe/ixgbe_ethdev.c |  9 +++++++++
> > > > >>   drivers/net/ixgbe/ixgbe_rxtx.c  | 23 ++++++++++++++++++++++-
> > > > >>   2 files changed, 31 insertions(+), 1 deletion(-)
> > > > >>
> > > > >> diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c
> > > > >> b/drivers/net/ixgbe/ixgbe_ethdev.c
> > > > >> index b8ee1e9..6714fd9 100644
> > > > >> --- a/drivers/net/ixgbe/ixgbe_ethdev.c
> > > > >> +++ b/drivers/net/ixgbe/ixgbe_ethdev.c
> > > > >> @@ -2414,6 +2414,15 @@ ixgbe_dev_info_get(struct rte_eth_dev 
> *dev,
> > > > struct
> > > > >> rte_eth_dev_info *dev_info)
> > > > >>            .txq_flags = ETH_TXQ_FLAGS_NOMULTSEGS |
> > > > >> ETH_TXQ_FLAGS_NOOFFLOADS,
> > > > >>    };
> > > > >> +
> > > > >> +  /*
> > > > >> +   * According to 82599 and x540 specifications RS bit 
> *must* be set on
> > > > the
> > > > >> +   * last descriptor of *every* packet. Therefore we will 
> not allow the
> > > > >> +   * tx_rs_thresh above 1 for all NICs newer than 82598.
> > > > >> +   */
> > > > >> +  if (hw->mac.type > ixgbe_mac_82598EB)
> > > > >> + dev_info->default_txconf.tx_rs_thresh = 1;
> > > > >> +
> > > > >>    dev_info->hash_key_size = IXGBE_HKEY_MAX_INDEX * 
> sizeof(uint32_t);
> > > > >>    dev_info->reta_size = ETH_RSS_RETA_SIZE_128;
> > > > >> dev_info->flow_type_rss_offloads = IXGBE_RSS_OFFLOAD_ALL; diff --
> > > > git
> > > > >> a/drivers/net/ixgbe/ixgbe_rxtx.c 
> b/drivers/net/ixgbe/ixgbe_rxtx.c index
> > > > >> 91023b9..8dbdffc 100644
> > > > >> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
> > > > >> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
> > > > >> @@ -2085,11 +2085,19 @@ ixgbe_dev_tx_queue_setup(struct 
> rte_eth_dev
> > > > >> *dev,
> > > > >>    struct ixgbe_tx_queue *txq;
> > > > >>    struct ixgbe_hw     *hw;
> > > > >>    uint16_t tx_rs_thresh, tx_free_thresh;
> > > > >> +  bool rs_deferring_allowed;
> > > > >>
> > > > >>    PMD_INIT_FUNC_TRACE();
> > > > >>    hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);
> > > > >>
> > > > >>    /*
> > > > >> +   * According to 82599 and x540 specifications RS bit 
> *must* be set on
> > > > the
> > > > >> +   * last descriptor of *every* packet. Therefore we will 
> not allow the
> > > > >> +   * tx_rs_thresh above 1 for all NICs newer than 82598.
> > > > >> +   */
> > > > >> +  rs_deferring_allowed = (hw->mac.type <= ixgbe_mac_82598EB);
> > > > >> +
> > > > >> +  /*
> > > > >>     * Validate number of transmit descriptors.
> > > > >>     * It must not exceed hardware maximum, and must be multiple
> > > > >>     * of IXGBE_ALIGN.
> > > > >> @@ -2110,6 +2118,8 @@ ixgbe_dev_tx_queue_setup(struct rte_eth_dev
> > > > *dev,
> > > > >>     * to transmit a packet is greater than the number of free TX
> > > > >>     * descriptors.
> > > > >>     * The following constraints must be satisfied:
> > > > >> +   *  tx_rs_thresh must be less than 2 for NICs for which RS 
> deferring is
> > > > >> +   *  forbidden (all but 82598).
> > > > >>     *  tx_rs_thresh must be greater than 0.
> > > > >>     *  tx_rs_thresh must be less than the size of the ring 
> minus 2.
> > > > >>     *  tx_rs_thresh must be less than or equal to tx_free_thresh.
> > > > >> @@ -2121,9 +2131,20 @@ ixgbe_dev_tx_queue_setup(struct 
> rte_eth_dev
> > > > *dev,
> > > > >>     * When set to zero use default values.
> > > > >>     */
> > > > >>    tx_rs_thresh = (uint16_t)((tx_conf->tx_rs_thresh) ?
> > > > >> - tx_conf->tx_rs_thresh : DEFAULT_TX_RS_THRESH);
> > > > >> + tx_conf->tx_rs_thresh :
> > > > >> + (rs_deferring_allowed ? DEFAULT_TX_RS_THRESH : 1));
> > > > >>    tx_free_thresh = (uint16_t)((tx_conf->tx_free_thresh) ?
> > > > >> tx_conf->tx_free_thresh : DEFAULT_TX_FREE_THRESH);
> > > > >> +
> > > > >> +  if (!rs_deferring_allowed && tx_rs_thresh > 1) {
> > > > >> +          PMD_INIT_LOG(ERR, "tx_rs_thresh must be less than 
> 2 since RS
> > > > "
> > > > >> +                            "must be set for every packet 
> for this HW. "
> > > > >> + "(tx_rs_thresh=%u port=%d queue=%d)",
> > > > >> +                       (unsigned int)tx_rs_thresh,
> > > > >> +  (int)dev->data->port_id, (int)queue_idx);
> > > > >> +          return -(EINVAL);
> > > > >> +  }
> > > > >> +
> > > > >>    if (tx_rs_thresh >= (nb_desc - 2)) {
> > > > >>            PMD_INIT_LOG(ERR, "tx_rs_thresh must be less than the
> > > > number "
> > > > >>                         "of TX descriptors minus 2. 
> (tx_rs_thresh=%u "
> > > > >> --
> > > > >> 2.1.0
> > >
>
  
Zhang, Helin Aug. 25, 2015, 7:16 p.m. UTC | #18
> -----Original Message-----

> From: Vlad Zolotarov [mailto:vladz@cloudius-systems.com]

> Sent: Tuesday, August 25, 2015 11:53 AM

> To: Zhang, Helin

> Cc: Lu, Wenzhuo; dev@dpdk.org

> Subject: Re: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh above 1 for

> all NICs but 82598

> 

> 

> 

> On 08/25/15 21:43, Zhang, Helin wrote:

> >

> > Hi Vlad

> >

> > I think this could possibly be the root cause of your TX hang issue.

> > Please try to limit the number to 8 or less, and then see if the issue

> > will still be there or not?

> >

> 

> Helin, the issue has been seen on x540 devices. Pls., see a chapter

> 7.2.1.1 of x540 devices spec:

> 

> A packet (or multiple packets in transmit segmentation) can span any number of

> buffers (and their descriptors) up to a limit of 40 minus WTHRESH minus 2 (see

> Section 7.2.3.3 for Tx Ring details and section Section 7.2.3.5.1 for WTHRESH

> details). For best performance it is recommended to minimize the number of

> buffers as possible.

> 

> Could u, pls., clarify why do u think that the maximum number of data buffers is

> limited by 8?

OK, i40e hardware is 8, so I'd assume x540 could have a similar one. Yes, in your case,
the limit could be around 38, right?
Could you help to make sure there is no packet to be transmitted uses more than
38 descriptors?
I heard that there is a similar hang issue on X710 if using more than 8 descriptors for
a single packet. I am wondering if the issue is similar on x540.

Regards,
Helin

> 

> thanks,

> vlad

> 

> > It does not have any check for the number of descriptors to be used

> > for a single packet, and it relies on the users to give correct mbuf

> > chains.

> >

> > We may need a check of this somewhere. Of cause the point you

> > indicated we also need to carefully investigate or fix.

> >

> > Regards,

> >

> > Helin

> >

> > *From:*Vladislav Zolotarov [mailto:vladz@cloudius-systems.com]

> > *Sent:* Tuesday, August 25, 2015 11:34 AM

> > *To:* Zhang, Helin

> > *Cc:* Lu, Wenzhuo; dev@dpdk.org

> > *Subject:* RE: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh

> > above 1 for all NICs but 82598

> >

> >

> > On Aug 25, 2015 21:14, "Zhang, Helin" <helin.zhang@intel.com

> > <mailto:helin.zhang@intel.com>> wrote:

> > >

> > > Hi Vlad

> > >

> > >

> > >

> > > In addition, I’d double check with you what’s the maximum number of

> > descriptors would be used for a single packet transmitting?

> > >

> > > Datasheet said that it supports up to 8. I am wondering if more than

> > 8 were used in your case?

> >

> > If memory serves me well the maximum number of data descriptors per

> > single xmit packet is 40 minus 2 minus WTHRESH. Since WTHRESH in DPDK

> > is always zero it gives us 38 segments. We limit them by 33.

> >

> > >

> > > Thank you very much!

> > >

> > >

> > >

> > > Regards,

> > >

> > > Helin

> > >

> > >

> > >

> > > From: Zhang, Helin

> > > Sent: Wednesday, August 19, 2015 10:29 AM

> > > To: Vladislav Zolotarov

> > > Cc: Lu, Wenzhuo; dev@dpdk.org <mailto:dev@dpdk.org>

> > >

> > > Subject: RE: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh

> > above 1 for all NICs but 82598

> > >

> > >

> > >

> > > Hi Vlad

> > >

> > >

> > >

> > > Thank you very much for the patches! Give me a few more time to

> > double check with more guys, and possibly hardware experts.

> > >

> > >

> > >

> > > Regards,

> > >

> > > Helin

> > >

> > >

> > >

> > > From: Vladislav Zolotarov [mailto:vladz@cloudius-systems.com

> > <mailto:vladz@cloudius-systems.com>]

> > > Sent: Tuesday, August 18, 2015 9:56 PM

> > > To: Lu, Wenzhuo

> > > Cc: dev@dpdk.org <mailto:dev@dpdk.org>; Zhang, Helin

> > > Subject: RE: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh

> > above 1 for all NICs but 82598

> > >

> > >

> > >

> > >

> > > On Aug 19, 2015 03:42, "Lu, Wenzhuo" <wenzhuo.lu@intel.com

> > <mailto:wenzhuo.lu@intel.com>> wrote:

> > > >

> > > > Hi Helin,

> > > >

> > > > > -----Original Message-----

> > > > > From: dev [mailto:dev-bounces@dpdk.org

> > <mailto:dev-bounces@dpdk.org>] On Behalf Of Vlad Zolotarov

> > > > > Sent: Friday, August 14, 2015 1:38 PM

> > > > > To: Zhang, Helin; dev@dpdk.org <mailto:dev@dpdk.org>

> > > > > Subject: Re: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid

> > tx_rs_thresh above 1 for

> > > > > all NICs but 82598

> > > > >

> > > > >

> > > > >

> > > > > On 08/13/15 23:28, Zhang, Helin wrote:

> > > > > > Hi Vlad

> > > > > >

> > > > > > I don't think the changes are needed. It says in datasheet

> > that the RS

> > > > > > bit should be set on the last descriptor of every packet, ONLY

> > WHEN

> > > > > TXDCTL.WTHRESH equals to ZERO.

> > > > >

> > > > > Of course it's needed! See below.

> > > > > Exactly the same spec a few lines above the place u've just

> > quoted states:

> > > > >

> > > > > "Software should not set the RS bit when TXDCTL.WTHRESH is

> > greater than

> > > > > zero."

> > > > >

> > > > > And since all three (3) ixgbe xmit callbacks are utilizing RS

> > bit notation ixgbe PMD

> > > > > is actually not supporting any value of WTHRESH different from zero.

> > > > I think Vlad is right. We need to fix this issue. Any suggestion?

> > If not, I'd like to ack this patch.

> > >

> > > Pls., note that there is a v2 of this patch on the list. I forgot to

> > patch ixgbevf_dev_info_get() in v1.

> > >

> > > >

> > > > >

> > > > > >

> > > > > > Regards,

> > > > > > Helin

> > > > > >

> > > > > >> -----Original Message-----

> > > > > >> From: Vlad Zolotarov [mailto:vladz@cloudius-systems.com

> > <mailto:vladz@cloudius-systems.com>]

> > > > > >> Sent: Thursday, August 13, 2015 11:07 AM

> > > > > >> To: dev@dpdk.org <mailto:dev@dpdk.org>

> > > > > >> Cc: Zhang, Helin; Ananyev, Konstantin;

> > avi@cloudius-systems.com <mailto:avi@cloudius-systems.com>; Vlad

> > > > > >> Zolotarov

> > > > > >> Subject: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh

> > above 1 for

> > > > > all

> > > > > >> NICs but 82598

> > > > > >>

> > > > > >> According to 82599 and x540 HW specifications RS bit *must*

> > be set in the

> > > > > last

> > > > > >> descriptor of *every* packet.

> > > > > > There is a condition that if TXDCTL.WTHRESH equal to zero.

> > > > >

> > > > > Right and ixgbe PMD requires this condition to be fulfilled in

> > order to

> > > > > function. See above.

> > > > >

> > > > > >

> > > > > >> This patch fixes the Tx hang we were constantly hitting with

> > a seastar-based

> > > > > >> application on x540 NIC.

> > > > > > Could you help to share with us how to reproduce the tx hang

> > issue, with using

> > > > > > typical DPDK examples?

> > > > >

> > > > > Sorry. I'm not very familiar with the typical DPDK examples to

> > help u

> > > > > here. However this is quite irrelevant since without this this

> > > > > patch ixgbe PMD obviously abuses the HW spec as has been explained

> above.

> > > > >

> > > > > We saw the issue when u stressed the xmit path with a lot of

> > > > > highly fragmented TCP frames (packets with up to 33 fragments

> > > > > with

> > non-headers

> > > > > fragments as small as 4 bytes) with all offload features enabled.

> > > > >

> > > > > Thanks,

> > > > > vlad

> > > > > >

> > > > > >> Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com

> > <mailto:vladz@cloudius-systems.com>>

> > > > > >> ---

> > > > > >>   drivers/net/ixgbe/ixgbe_ethdev.c |  9 +++++++++

> > > > > >>   drivers/net/ixgbe/ixgbe_rxtx.c  | 23 ++++++++++++++++++++++-

> > > > > >>   2 files changed, 31 insertions(+), 1 deletion(-)

> > > > > >>

> > > > > >> diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c

> > > > > >> b/drivers/net/ixgbe/ixgbe_ethdev.c

> > > > > >> index b8ee1e9..6714fd9 100644

> > > > > >> --- a/drivers/net/ixgbe/ixgbe_ethdev.c

> > > > > >> +++ b/drivers/net/ixgbe/ixgbe_ethdev.c

> > > > > >> @@ -2414,6 +2414,15 @@ ixgbe_dev_info_get(struct rte_eth_dev

> > *dev,

> > > > > struct

> > > > > >> rte_eth_dev_info *dev_info)

> > > > > >>            .txq_flags = ETH_TXQ_FLAGS_NOMULTSEGS |

> > > > > >> ETH_TXQ_FLAGS_NOOFFLOADS,

> > > > > >>    };

> > > > > >> +

> > > > > >> +  /*

> > > > > >> +   * According to 82599 and x540 specifications RS bit

> > *must* be set on

> > > > > the

> > > > > >> +   * last descriptor of *every* packet. Therefore we will

> > not allow the

> > > > > >> +   * tx_rs_thresh above 1 for all NICs newer than 82598.

> > > > > >> +   */

> > > > > >> +  if (hw->mac.type > ixgbe_mac_82598EB)

> > > > > >> + dev_info->default_txconf.tx_rs_thresh = 1;

> > > > > >> +

> > > > > >>    dev_info->hash_key_size = IXGBE_HKEY_MAX_INDEX *

> > sizeof(uint32_t);

> > > > > >>    dev_info->reta_size = ETH_RSS_RETA_SIZE_128;

> > > > > >> dev_info->flow_type_rss_offloads = IXGBE_RSS_OFFLOAD_ALL;

> > > > > >> diff --

> > > > > git

> > > > > >> a/drivers/net/ixgbe/ixgbe_rxtx.c

> > b/drivers/net/ixgbe/ixgbe_rxtx.c index

> > > > > >> 91023b9..8dbdffc 100644

> > > > > >> --- a/drivers/net/ixgbe/ixgbe_rxtx.c

> > > > > >> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c

> > > > > >> @@ -2085,11 +2085,19 @@ ixgbe_dev_tx_queue_setup(struct

> > rte_eth_dev

> > > > > >> *dev,

> > > > > >>    struct ixgbe_tx_queue *txq;

> > > > > >>    struct ixgbe_hw     *hw;

> > > > > >>    uint16_t tx_rs_thresh, tx_free_thresh;

> > > > > >> +  bool rs_deferring_allowed;

> > > > > >>

> > > > > >>    PMD_INIT_FUNC_TRACE();

> > > > > >>    hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);

> > > > > >>

> > > > > >>    /*

> > > > > >> +   * According to 82599 and x540 specifications RS bit

> > *must* be set on

> > > > > the

> > > > > >> +   * last descriptor of *every* packet. Therefore we will

> > not allow the

> > > > > >> +   * tx_rs_thresh above 1 for all NICs newer than 82598.

> > > > > >> +   */

> > > > > >> +  rs_deferring_allowed = (hw->mac.type <=

> > > > > >> + ixgbe_mac_82598EB);

> > > > > >> +

> > > > > >> +  /*

> > > > > >>     * Validate number of transmit descriptors.

> > > > > >>     * It must not exceed hardware maximum, and must be multiple

> > > > > >>     * of IXGBE_ALIGN.

> > > > > >> @@ -2110,6 +2118,8 @@ ixgbe_dev_tx_queue_setup(struct

> > > > > >> rte_eth_dev

> > > > > *dev,

> > > > > >>     * to transmit a packet is greater than the number of free TX

> > > > > >>     * descriptors.

> > > > > >>     * The following constraints must be satisfied:

> > > > > >> +   *  tx_rs_thresh must be less than 2 for NICs for which RS

> > deferring is

> > > > > >> +   *  forbidden (all but 82598).

> > > > > >>     *  tx_rs_thresh must be greater than 0.

> > > > > >>     *  tx_rs_thresh must be less than the size of the ring

> > minus 2.

> > > > > >>     *  tx_rs_thresh must be less than or equal to tx_free_thresh.

> > > > > >> @@ -2121,9 +2131,20 @@ ixgbe_dev_tx_queue_setup(struct

> > rte_eth_dev

> > > > > *dev,

> > > > > >>     * When set to zero use default values.

> > > > > >>     */

> > > > > >>    tx_rs_thresh = (uint16_t)((tx_conf->tx_rs_thresh) ?

> > > > > >> - tx_conf->tx_rs_thresh : DEFAULT_TX_RS_THRESH);

> > > > > >> + tx_conf->tx_rs_thresh :

> > > > > >> + (rs_deferring_allowed ? DEFAULT_TX_RS_THRESH : 1));

> > > > > >>    tx_free_thresh = (uint16_t)((tx_conf->tx_free_thresh) ?

> > > > > >> tx_conf->tx_free_thresh : DEFAULT_TX_FREE_THRESH);

> > > > > >> +

> > > > > >> +  if (!rs_deferring_allowed && tx_rs_thresh > 1) {

> > > > > >> +          PMD_INIT_LOG(ERR, "tx_rs_thresh must be less than

> > 2 since RS

> > > > > "

> > > > > >> +                            "must be set for every packet

> > for this HW. "

> > > > > >> + "(tx_rs_thresh=%u port=%d queue=%d)",

> > > > > >> +                       (unsigned int)tx_rs_thresh,

> > > > > >> + (int)dev->data->port_id, (int)queue_idx);

> > > > > >> +          return -(EINVAL);

> > > > > >> +  }

> > > > > >> +

> > > > > >>    if (tx_rs_thresh >= (nb_desc - 2)) {

> > > > > >>            PMD_INIT_LOG(ERR, "tx_rs_thresh must be less than

> > > > > >> the

> > > > > number "

> > > > > >>                         "of TX descriptors minus 2.

> > (tx_rs_thresh=%u "

> > > > > >> --

> > > > > >> 2.1.0

> > > >

> >
  
Avi Kivity Aug. 25, 2015, 7:23 p.m. UTC | #19
On 08/25/2015 10:16 PM, Zhang, Helin wrote:
>
>> -----Original Message-----
>> From: Vlad Zolotarov [mailto:vladz@cloudius-systems.com]
>> Sent: Tuesday, August 25, 2015 11:53 AM
>> To: Zhang, Helin
>> Cc: Lu, Wenzhuo; dev@dpdk.org
>> Subject: Re: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh above 1 for
>> all NICs but 82598
>>
>>
>>
>> On 08/25/15 21:43, Zhang, Helin wrote:
>>> Hi Vlad
>>>
>>> I think this could possibly be the root cause of your TX hang issue.
>>> Please try to limit the number to 8 or less, and then see if the issue
>>> will still be there or not?
>>>
>> Helin, the issue has been seen on x540 devices. Pls., see a chapter
>> 7.2.1.1 of x540 devices spec:
>>
>> A packet (or multiple packets in transmit segmentation) can span any number of
>> buffers (and their descriptors) up to a limit of 40 minus WTHRESH minus 2 (see
>> Section 7.2.3.3 for Tx Ring details and section Section 7.2.3.5.1 for WTHRESH
>> details). For best performance it is recommended to minimize the number of
>> buffers as possible.
>>
>> Could u, pls., clarify why do u think that the maximum number of data buffers is
>> limited by 8?
> OK, i40e hardware is 8, so I'd assume x540 could have a similar one. Yes, in your case,
> the limit could be around 38, right?
> Could you help to make sure there is no packet to be transmitted uses more than
> 38 descriptors?
> I heard that there is a similar hang issue on X710 if using more than 8 descriptors for
> a single packet. I am wondering if the issue is similar on x540.
>
>

I believe that the ixgbe Linux driver does not limit packets to 8 
fragments, so apparently the hardware is capable.
  
Vladislav Zolotarov Aug. 25, 2015, 7:30 p.m. UTC | #20
On Aug 25, 2015 22:16, "Zhang, Helin" <helin.zhang@intel.com> wrote:
>
>
>
> > -----Original Message-----
> > From: Vlad Zolotarov [mailto:vladz@cloudius-systems.com]
> > Sent: Tuesday, August 25, 2015 11:53 AM
> > To: Zhang, Helin
> > Cc: Lu, Wenzhuo; dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh above
1 for
> > all NICs but 82598
> >
> >
> >
> > On 08/25/15 21:43, Zhang, Helin wrote:
> > >
> > > Hi Vlad
> > >
> > > I think this could possibly be the root cause of your TX hang issue.
> > > Please try to limit the number to 8 or less, and then see if the issue
> > > will still be there or not?
> > >
> >
> > Helin, the issue has been seen on x540 devices. Pls., see a chapter
> > 7.2.1.1 of x540 devices spec:
> >
> > A packet (or multiple packets in transmit segmentation) can span any
number of
> > buffers (and their descriptors) up to a limit of 40 minus WTHRESH minus
2 (see
> > Section 7.2.3.3 for Tx Ring details and section Section 7.2.3.5.1 for
WTHRESH
> > details). For best performance it is recommended to minimize the number
of
> > buffers as possible.
> >
> > Could u, pls., clarify why do u think that the maximum number of data
buffers is
> > limited by 8?
> OK, i40e hardware is 8

For i40 it's a bit more complicated than just "not more than 8" - it's not
more than 8 for a non-TSO packet and not more than 8 for each MSS including
headers buffers for TSO. But this thread is not about i40e so this doesn't
seem to be relevant anyway.

, so I'd assume x540 could have a similar one.

x540 spec assumes otherwise... 😉

Yes, in your case,
> the limit could be around 38, right?

If by "around 38" u mean "exactly 38" then u are absolutely right... 😉

> Could you help to make sure there is no packet to be transmitted uses
more than
> 38 descriptors?

Just like i've already mentioned, we limit the cluster by at most 33 data
segments. Therefore we are good here...

> I heard that there is a similar hang issue on X710 if using more than 8
descriptors for
> a single packet. I am wondering if the issue is similar on x540.

What's x710? If that's xl710 40G nics (i40e driver), then it has its own
specs with its own HW limitations i've mentioned above. It has nothing to
do with this thread that is all about 10G nics managed by ixgbe driver.

There is a different thread, where i've raised the 40G NICs xmit issues.
See "i40e xmit path HW limitation" thread.

>
> Regards,
> Helin
>
> >
> > thanks,
> > vlad
> >
> > > It does not have any check for the number of descriptors to be used
> > > for a single packet, and it relies on the users to give correct mbuf
> > > chains.
> > >
> > > We may need a check of this somewhere. Of cause the point you
> > > indicated we also need to carefully investigate or fix.
> > >
> > > Regards,
> > >
> > > Helin
> > >
> > > *From:*Vladislav Zolotarov [mailto:vladz@cloudius-systems.com]
> > > *Sent:* Tuesday, August 25, 2015 11:34 AM
> > > *To:* Zhang, Helin
> > > *Cc:* Lu, Wenzhuo; dev@dpdk.org
> > > *Subject:* RE: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh
> > > above 1 for all NICs but 82598
> > >
> > >
> > > On Aug 25, 2015 21:14, "Zhang, Helin" <helin.zhang@intel.com
> > > <mailto:helin.zhang@intel.com>> wrote:
> > > >
> > > > Hi Vlad
> > > >
> > > >
> > > >
> > > > In addition, I’d double check with you what’s the maximum number of
> > > descriptors would be used for a single packet transmitting?
> > > >
> > > > Datasheet said that it supports up to 8. I am wondering if more than
> > > 8 were used in your case?
> > >
> > > If memory serves me well the maximum number of data descriptors per
> > > single xmit packet is 40 minus 2 minus WTHRESH. Since WTHRESH in DPDK
> > > is always zero it gives us 38 segments. We limit them by 33.
> > >
> > > >
> > > > Thank you very much!
> > > >
> > > >
> > > >
> > > > Regards,
> > > >
> > > > Helin
> > > >
> > > >
> > > >
> > > > From: Zhang, Helin
> > > > Sent: Wednesday, August 19, 2015 10:29 AM
> > > > To: Vladislav Zolotarov
> > > > Cc: Lu, Wenzhuo; dev@dpdk.org <mailto:dev@dpdk.org>
> > > >
> > > > Subject: RE: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh
> > > above 1 for all NICs but 82598
> > > >
> > > >
> > > >
> > > > Hi Vlad
> > > >
> > > >
> > > >
> > > > Thank you very much for the patches! Give me a few more time to
> > > double check with more guys, and possibly hardware experts.
> > > >
> > > >
> > > >
> > > > Regards,
> > > >
> > > > Helin
> > > >
> > > >
> > > >
> > > > From: Vladislav Zolotarov [mailto:vladz@cloudius-systems.com
> > > <mailto:vladz@cloudius-systems.com>]
> > > > Sent: Tuesday, August 18, 2015 9:56 PM
> > > > To: Lu, Wenzhuo
> > > > Cc: dev@dpdk.org <mailto:dev@dpdk.org>; Zhang, Helin
> > > > Subject: RE: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh
> > > above 1 for all NICs but 82598
> > > >
> > > >
> > > >
> > > >
> > > > On Aug 19, 2015 03:42, "Lu, Wenzhuo" <wenzhuo.lu@intel.com
> > > <mailto:wenzhuo.lu@intel.com>> wrote:
> > > > >
> > > > > Hi Helin,
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: dev [mailto:dev-bounces@dpdk.org
> > > <mailto:dev-bounces@dpdk.org>] On Behalf Of Vlad Zolotarov
> > > > > > Sent: Friday, August 14, 2015 1:38 PM
> > > > > > To: Zhang, Helin; dev@dpdk.org <mailto:dev@dpdk.org>
> > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid
> > > tx_rs_thresh above 1 for
> > > > > > all NICs but 82598
> > > > > >
> > > > > >
> > > > > >
> > > > > > On 08/13/15 23:28, Zhang, Helin wrote:
> > > > > > > Hi Vlad
> > > > > > >
> > > > > > > I don't think the changes are needed. It says in datasheet
> > > that the RS
> > > > > > > bit should be set on the last descriptor of every packet, ONLY
> > > WHEN
> > > > > > TXDCTL.WTHRESH equals to ZERO.
> > > > > >
> > > > > > Of course it's needed! See below.
> > > > > > Exactly the same spec a few lines above the place u've just
> > > quoted states:
> > > > > >
> > > > > > "Software should not set the RS bit when TXDCTL.WTHRESH is
> > > greater than
> > > > > > zero."
> > > > > >
> > > > > > And since all three (3) ixgbe xmit callbacks are utilizing RS
> > > bit notation ixgbe PMD
> > > > > > is actually not supporting any value of WTHRESH different from
zero.
> > > > > I think Vlad is right. We need to fix this issue. Any suggestion?
> > > If not, I'd like to ack this patch.
> > > >
> > > > Pls., note that there is a v2 of this patch on the list. I forgot to
> > > patch ixgbevf_dev_info_get() in v1.
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > Regards,
> > > > > > > Helin
> > > > > > >
> > > > > > >> -----Original Message-----
> > > > > > >> From: Vlad Zolotarov [mailto:vladz@cloudius-systems.com
> > > <mailto:vladz@cloudius-systems.com>]
> > > > > > >> Sent: Thursday, August 13, 2015 11:07 AM
> > > > > > >> To: dev@dpdk.org <mailto:dev@dpdk.org>
> > > > > > >> Cc: Zhang, Helin; Ananyev, Konstantin;
> > > avi@cloudius-systems.com <mailto:avi@cloudius-systems.com>; Vlad
> > > > > > >> Zolotarov
> > > > > > >> Subject: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh
> > > above 1 for
> > > > > > all
> > > > > > >> NICs but 82598
> > > > > > >>
> > > > > > >> According to 82599 and x540 HW specifications RS bit *must*
> > > be set in the
> > > > > > last
> > > > > > >> descriptor of *every* packet.
> > > > > > > There is a condition that if TXDCTL.WTHRESH equal to zero.
> > > > > >
> > > > > > Right and ixgbe PMD requires this condition to be fulfilled in
> > > order to
> > > > > > function. See above.
> > > > > >
> > > > > > >
> > > > > > >> This patch fixes the Tx hang we were constantly hitting with
> > > a seastar-based
> > > > > > >> application on x540 NIC.
> > > > > > > Could you help to share with us how to reproduce the tx hang
> > > issue, with using
> > > > > > > typical DPDK examples?
> > > > > >
> > > > > > Sorry. I'm not very familiar with the typical DPDK examples to
> > > help u
> > > > > > here. However this is quite irrelevant since without this this
> > > > > > patch ixgbe PMD obviously abuses the HW spec as has been
explained
> > above.
> > > > > >
> > > > > > We saw the issue when u stressed the xmit path with a lot of
> > > > > > highly fragmented TCP frames (packets with up to 33 fragments
> > > > > > with
> > > non-headers
> > > > > > fragments as small as 4 bytes) with all offload features
enabled.
> > > > > >
> > > > > > Thanks,
> > > > > > vlad
> > > > > > >
> > > > > > >> Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com
> > > <mailto:vladz@cloudius-systems.com>>
> > > > > > >> ---
> > > > > > >>   drivers/net/ixgbe/ixgbe_ethdev.c |  9 +++++++++
> > > > > > >>   drivers/net/ixgbe/ixgbe_rxtx.c  | 23
++++++++++++++++++++++-
> > > > > > >>   2 files changed, 31 insertions(+), 1 deletion(-)
> > > > > > >>
> > > > > > >> diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c
> > > > > > >> b/drivers/net/ixgbe/ixgbe_ethdev.c
> > > > > > >> index b8ee1e9..6714fd9 100644
> > > > > > >> --- a/drivers/net/ixgbe/ixgbe_ethdev.c
> > > > > > >> +++ b/drivers/net/ixgbe/ixgbe_ethdev.c
> > > > > > >> @@ -2414,6 +2414,15 @@ ixgbe_dev_info_get(struct rte_eth_dev
> > > *dev,
> > > > > > struct
> > > > > > >> rte_eth_dev_info *dev_info)
> > > > > > >>            .txq_flags = ETH_TXQ_FLAGS_NOMULTSEGS |
> > > > > > >> ETH_TXQ_FLAGS_NOOFFLOADS,
> > > > > > >>    };
> > > > > > >> +
> > > > > > >> +  /*
> > > > > > >> +   * According to 82599 and x540 specifications RS bit
> > > *must* be set on
> > > > > > the
> > > > > > >> +   * last descriptor of *every* packet. Therefore we will
> > > not allow the
> > > > > > >> +   * tx_rs_thresh above 1 for all NICs newer than 82598.
> > > > > > >> +   */
> > > > > > >> +  if (hw->mac.type > ixgbe_mac_82598EB)
> > > > > > >> + dev_info->default_txconf.tx_rs_thresh = 1;
> > > > > > >> +
> > > > > > >>    dev_info->hash_key_size = IXGBE_HKEY_MAX_INDEX *
> > > sizeof(uint32_t);
> > > > > > >>    dev_info->reta_size = ETH_RSS_RETA_SIZE_128;
> > > > > > >> dev_info->flow_type_rss_offloads = IXGBE_RSS_OFFLOAD_ALL;
> > > > > > >> diff --
> > > > > > git
> > > > > > >> a/drivers/net/ixgbe/ixgbe_rxtx.c
> > > b/drivers/net/ixgbe/ixgbe_rxtx.c index
> > > > > > >> 91023b9..8dbdffc 100644
> > > > > > >> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
> > > > > > >> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
> > > > > > >> @@ -2085,11 +2085,19 @@ ixgbe_dev_tx_queue_setup(struct
> > > rte_eth_dev
> > > > > > >> *dev,
> > > > > > >>    struct ixgbe_tx_queue *txq;
> > > > > > >>    struct ixgbe_hw     *hw;
> > > > > > >>    uint16_t tx_rs_thresh, tx_free_thresh;
> > > > > > >> +  bool rs_deferring_allowed;
> > > > > > >>
> > > > > > >>    PMD_INIT_FUNC_TRACE();
> > > > > > >>    hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);
> > > > > > >>
> > > > > > >>    /*
> > > > > > >> +   * According to 82599 and x540 specifications RS bit
> > > *must* be set on
> > > > > > the
> > > > > > >> +   * last descriptor of *every* packet. Therefore we will
> > > not allow the
> > > > > > >> +   * tx_rs_thresh above 1 for all NICs newer than 82598.
> > > > > > >> +   */
> > > > > > >> +  rs_deferring_allowed = (hw->mac.type <=
> > > > > > >> + ixgbe_mac_82598EB);
> > > > > > >> +
> > > > > > >> +  /*
> > > > > > >>     * Validate number of transmit descriptors.
> > > > > > >>     * It must not exceed hardware maximum, and must be
multiple
> > > > > > >>     * of IXGBE_ALIGN.
> > > > > > >> @@ -2110,6 +2118,8 @@ ixgbe_dev_tx_queue_setup(struct
> > > > > > >> rte_eth_dev
> > > > > > *dev,
> > > > > > >>     * to transmit a packet is greater than the number of
free TX
> > > > > > >>     * descriptors.
> > > > > > >>     * The following constraints must be satisfied:
> > > > > > >> +   *  tx_rs_thresh must be less than 2 for NICs for which RS
> > > deferring is
> > > > > > >> +   *  forbidden (all but 82598).
> > > > > > >>     *  tx_rs_thresh must be greater than 0.
> > > > > > >>     *  tx_rs_thresh must be less than the size of the ring
> > > minus 2.
> > > > > > >>     *  tx_rs_thresh must be less than or equal to
tx_free_thresh.
> > > > > > >> @@ -2121,9 +2131,20 @@ ixgbe_dev_tx_queue_setup(struct
> > > rte_eth_dev
> > > > > > *dev,
> > > > > > >>     * When set to zero use default values.
> > > > > > >>     */
> > > > > > >>    tx_rs_thresh = (uint16_t)((tx_conf->tx_rs_thresh) ?
> > > > > > >> - tx_conf->tx_rs_thresh : DEFAULT_TX_RS_THRESH);
> > > > > > >> + tx_conf->tx_rs_thresh :
> > > > > > >> + (rs_deferring_allowed ? DEFAULT_TX_RS_THRESH : 1));
> > > > > > >>    tx_free_thresh = (uint16_t)((tx_conf->tx_free_thresh) ?
> > > > > > >> tx_conf->tx_free_thresh : DEFAULT_TX_FREE_THRESH);
> > > > > > >> +
> > > > > > >> +  if (!rs_deferring_allowed && tx_rs_thresh > 1) {
> > > > > > >> +          PMD_INIT_LOG(ERR, "tx_rs_thresh must be less than
> > > 2 since RS
> > > > > > "
> > > > > > >> +                            "must be set for every packet
> > > for this HW. "
> > > > > > >> + "(tx_rs_thresh=%u port=%d queue=%d)",
> > > > > > >> +                       (unsigned int)tx_rs_thresh,
> > > > > > >> + (int)dev->data->port_id, (int)queue_idx);
> > > > > > >> +          return -(EINVAL);
> > > > > > >> +  }
> > > > > > >> +
> > > > > > >>    if (tx_rs_thresh >= (nb_desc - 2)) {
> > > > > > >>            PMD_INIT_LOG(ERR, "tx_rs_thresh must be less than
> > > > > > >> the
> > > > > > number "
> > > > > > >>                         "of TX descriptors minus 2.
> > > (tx_rs_thresh=%u "
> > > > > > >> --
> > > > > > >> 2.1.0
> > > > >
> > >
>
  
Vladislav Zolotarov Aug. 25, 2015, 8:07 p.m. UTC | #21
On 08/25/15 22:30, Vladislav Zolotarov wrote:
>
>
> On Aug 25, 2015 22:16, "Zhang, Helin" <helin.zhang@intel.com 
> <mailto:helin.zhang@intel.com>> wrote:
> >
> >
> >
> > > -----Original Message-----
> > > From: Vlad Zolotarov [mailto:vladz@cloudius-systems.com 
> <mailto:vladz@cloudius-systems.com>]
> > > Sent: Tuesday, August 25, 2015 11:53 AM
> > > To: Zhang, Helin
> > > Cc: Lu, Wenzhuo; dev@dpdk.org <mailto:dev@dpdk.org>
> > > Subject: Re: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh 
> above 1 for
> > > all NICs but 82598
> > >
> > >
> > >
> > > On 08/25/15 21:43, Zhang, Helin wrote:
> > > >
> > > > Hi Vlad
> > > >
> > > > I think this could possibly be the root cause of your TX hang issue.
> > > > Please try to limit the number to 8 or less, and then see if the 
> issue
> > > > will still be there or not?
> > > >
> > >
> > > Helin, the issue has been seen on x540 devices. Pls., see a chapter
> > > 7.2.1.1 of x540 devices spec:
> > >
> > > A packet (or multiple packets in transmit segmentation) can span 
> any number of
> > > buffers (and their descriptors) up to a limit of 40 minus WTHRESH 
> minus 2 (see
> > > Section 7.2.3.3 for Tx Ring details and section Section 7.2.3.5.1 
> for WTHRESH
> > > details). For best performance it is recommended to minimize the 
> number of
> > > buffers as possible.
> > >
> > > Could u, pls., clarify why do u think that the maximum number of 
> data buffers is
> > > limited by 8?
> > OK, i40e hardware is 8
>
> For i40 it's a bit more complicated than just "not more than 8" - it's 
> not more than 8 for a non-TSO packet and not more than 8 for each MSS 
> including headers buffers for TSO. But this thread is not about i40e 
> so this doesn't seem to be relevant anyway.
>
> , so I'd assume x540 could have a similar one.
>
> x540 spec assumes otherwise... 😉
>
> Yes, in your case,
> > the limit could be around 38, right?
>
> If by "around 38" u mean "exactly 38" then u are absolutely right... 😉
>
> > Could you help to make sure there is no packet to be transmitted 
> uses more than
> > 38 descriptors?
>
> Just like i've already mentioned, we limit the cluster by at most 33 
> data segments. Therefore we are good here...
>
> > I heard that there is a similar hang issue on X710 if using more 
> than 8 descriptors for
> > a single packet. I am wondering if the issue is similar on x540.
>
> What's x710? If that's xl710 40G nics (i40e driver),
>

I've found what x710 NICs are - they are another NICs family managed by 
i40e PMD. Therefore the rest of what I said stands the same... ;)

> then it has its own specs with its own HW limitations i've mentioned 
> above. It has nothing to do with this thread that is all about 10G 
> nics managed by ixgbe driver.
>
> There is a different thread, where i've raised the 40G NICs xmit 
> issues. See "i40e xmit path HW limitation" thread.
>
> >
> > Regards,
> > Helin
> >
> > >
> > > thanks,
> > > vlad
> > >
> > > > It does not have any check for the number of descriptors to be used
> > > > for a single packet, and it relies on the users to give correct mbuf
> > > > chains.
> > > >
> > > > We may need a check of this somewhere. Of cause the point you
> > > > indicated we also need to carefully investigate or fix.
> > > >
> > > > Regards,
> > > >
> > > > Helin
> > > >
> > > > *From:*Vladislav Zolotarov [mailto:vladz@cloudius-systems.com 
> <mailto:vladz@cloudius-systems.com>]
> > > > *Sent:* Tuesday, August 25, 2015 11:34 AM
> > > > *To:* Zhang, Helin
> > > > *Cc:* Lu, Wenzhuo; dev@dpdk.org <mailto:dev@dpdk.org>
> > > > *Subject:* RE: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh
> > > > above 1 for all NICs but 82598
> > > >
> > > >
> > > > On Aug 25, 2015 21:14, "Zhang, Helin" <helin.zhang@intel.com 
> <mailto:helin.zhang@intel.com>
> > > > <mailto:helin.zhang@intel.com <mailto:helin.zhang@intel.com>>> 
> wrote:
> > > > >
> > > > > Hi Vlad
> > > > >
> > > > >
> > > > >
> > > > > In addition, I’d double check with you what’s the maximum 
> number of
> > > > descriptors would be used for a single packet transmitting?
> > > > >
> > > > > Datasheet said that it supports up to 8. I am wondering if 
> more than
> > > > 8 were used in your case?
> > > >
> > > > If memory serves me well the maximum number of data descriptors per
> > > > single xmit packet is 40 minus 2 minus WTHRESH. Since WTHRESH in 
> DPDK
> > > > is always zero it gives us 38 segments. We limit them by 33.
> > > >
> > > > >
> > > > > Thank you very much!
> > > > >
> > > > >
> > > > >
> > > > > Regards,
> > > > >
> > > > > Helin
> > > > >
> > > > >
> > > > >
> > > > > From: Zhang, Helin
> > > > > Sent: Wednesday, August 19, 2015 10:29 AM
> > > > > To: Vladislav Zolotarov
> > > > > Cc: Lu, Wenzhuo; dev@dpdk.org <mailto:dev@dpdk.org> 
> <mailto:dev@dpdk.org <mailto:dev@dpdk.org>>
> > > > >
> > > > > Subject: RE: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh
> > > > above 1 for all NICs but 82598
> > > > >
> > > > >
> > > > >
> > > > > Hi Vlad
> > > > >
> > > > >
> > > > >
> > > > > Thank you very much for the patches! Give me a few more time to
> > > > double check with more guys, and possibly hardware experts.
> > > > >
> > > > >
> > > > >
> > > > > Regards,
> > > > >
> > > > > Helin
> > > > >
> > > > >
> > > > >
> > > > > From: Vladislav Zolotarov [mailto:vladz@cloudius-systems.com 
> <mailto:vladz@cloudius-systems.com>
> > > > <mailto:vladz@cloudius-systems.com 
> <mailto:vladz@cloudius-systems.com>>]
> > > > > Sent: Tuesday, August 18, 2015 9:56 PM
> > > > > To: Lu, Wenzhuo
> > > > > Cc: dev@dpdk.org <mailto:dev@dpdk.org> <mailto:dev@dpdk.org 
> <mailto:dev@dpdk.org>>; Zhang, Helin
> > > > > Subject: RE: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh
> > > > above 1 for all NICs but 82598
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Aug 19, 2015 03:42, "Lu, Wenzhuo" <wenzhuo.lu@intel.com 
> <mailto:wenzhuo.lu@intel.com>
> > > > <mailto:wenzhuo.lu@intel.com <mailto:wenzhuo.lu@intel.com>>> wrote:
> > > > > >
> > > > > > Hi Helin,
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: dev [mailto:dev-bounces@dpdk.org 
> <mailto:dev-bounces@dpdk.org>
> > > > <mailto:dev-bounces@dpdk.org <mailto:dev-bounces@dpdk.org>>] On 
> Behalf Of Vlad Zolotarov
> > > > > > > Sent: Friday, August 14, 2015 1:38 PM
> > > > > > > To: Zhang, Helin; dev@dpdk.org <mailto:dev@dpdk.org> 
> <mailto:dev@dpdk.org <mailto:dev@dpdk.org>>
> > > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid
> > > > tx_rs_thresh above 1 for
> > > > > > > all NICs but 82598
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On 08/13/15 23:28, Zhang, Helin wrote:
> > > > > > > > Hi Vlad
> > > > > > > >
> > > > > > > > I don't think the changes are needed. It says in datasheet
> > > > that the RS
> > > > > > > > bit should be set on the last descriptor of every 
> packet, ONLY
> > > > WHEN
> > > > > > > TXDCTL.WTHRESH equals to ZERO.
> > > > > > >
> > > > > > > Of course it's needed! See below.
> > > > > > > Exactly the same spec a few lines above the place u've just
> > > > quoted states:
> > > > > > >
> > > > > > > "Software should not set the RS bit when TXDCTL.WTHRESH is
> > > > greater than
> > > > > > > zero."
> > > > > > >
> > > > > > > And since all three (3) ixgbe xmit callbacks are utilizing RS
> > > > bit notation ixgbe PMD
> > > > > > > is actually not supporting any value of WTHRESH different 
> from zero.
> > > > > > I think Vlad is right. We need to fix this issue. Any 
> suggestion?
> > > > If not, I'd like to ack this patch.
> > > > >
> > > > > Pls., note that there is a v2 of this patch on the list. I 
> forgot to
> > > > patch ixgbevf_dev_info_get() in v1.
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > > Helin
> > > > > > > >
> > > > > > > >> -----Original Message-----
> > > > > > > >> From: Vlad Zolotarov [mailto:vladz@cloudius-systems.com 
> <mailto:vladz@cloudius-systems.com>
> > > > <mailto:vladz@cloudius-systems.com 
> <mailto:vladz@cloudius-systems.com>>]
> > > > > > > >> Sent: Thursday, August 13, 2015 11:07 AM
> > > > > > > >> To: dev@dpdk.org <mailto:dev@dpdk.org> 
> <mailto:dev@dpdk.org <mailto:dev@dpdk.org>>
> > > > > > > >> Cc: Zhang, Helin; Ananyev, Konstantin;
> > > > avi@cloudius-systems.com <mailto:avi@cloudius-systems.com> 
> <mailto:avi@cloudius-systems.com <mailto:avi@cloudius-systems.com>>; Vlad
> > > > > > > >> Zolotarov
> > > > > > > >> Subject: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid 
> tx_rs_thresh
> > > > above 1 for
> > > > > > > all
> > > > > > > >> NICs but 82598
> > > > > > > >>
> > > > > > > >> According to 82599 and x540 HW specifications RS bit *must*
> > > > be set in the
> > > > > > > last
> > > > > > > >> descriptor of *every* packet.
> > > > > > > > There is a condition that if TXDCTL.WTHRESH equal to zero.
> > > > > > >
> > > > > > > Right and ixgbe PMD requires this condition to be fulfilled in
> > > > order to
> > > > > > > function. See above.
> > > > > > >
> > > > > > > >
> > > > > > > >> This patch fixes the Tx hang we were constantly hitting 
> with
> > > > a seastar-based
> > > > > > > >> application on x540 NIC.
> > > > > > > > Could you help to share with us how to reproduce the tx hang
> > > > issue, with using
> > > > > > > > typical DPDK examples?
> > > > > > >
> > > > > > > Sorry. I'm not very familiar with the typical DPDK examples to
> > > > help u
> > > > > > > here. However this is quite irrelevant since without this this
> > > > > > > patch ixgbe PMD obviously abuses the HW spec as has been 
> explained
> > > above.
> > > > > > >
> > > > > > > We saw the issue when u stressed the xmit path with a lot of
> > > > > > > highly fragmented TCP frames (packets with up to 33 fragments
> > > > > > > with
> > > > non-headers
> > > > > > > fragments as small as 4 bytes) with all offload features 
> enabled.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > vlad
> > > > > > > >
> > > > > > > >> Signed-off-by: Vlad Zolotarov 
> <vladz@cloudius-systems.com <mailto:vladz@cloudius-systems.com>
> > > > <mailto:vladz@cloudius-systems.com 
> <mailto:vladz@cloudius-systems.com>>>
> > > > > > > >> ---
> > > > > > > >>  drivers/net/ixgbe/ixgbe_ethdev.c |  9 +++++++++
> > > > > > > >>  drivers/net/ixgbe/ixgbe_rxtx.c  | 23 
> ++++++++++++++++++++++-
> > > > > > > >>   2 files changed, 31 insertions(+), 1 deletion(-)
> > > > > > > >>
> > > > > > > >> diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c
> > > > > > > >> b/drivers/net/ixgbe/ixgbe_ethdev.c
> > > > > > > >> index b8ee1e9..6714fd9 100644
> > > > > > > >> --- a/drivers/net/ixgbe/ixgbe_ethdev.c
> > > > > > > >> +++ b/drivers/net/ixgbe/ixgbe_ethdev.c
> > > > > > > >> @@ -2414,6 +2414,15 @@ ixgbe_dev_info_get(struct 
> rte_eth_dev
> > > > *dev,
> > > > > > > struct
> > > > > > > >> rte_eth_dev_info *dev_info)
> > > > > > > >>            .txq_flags = ETH_TXQ_FLAGS_NOMULTSEGS |
> > > > > > > >> ETH_TXQ_FLAGS_NOOFFLOADS,
> > > > > > > >>    };
> > > > > > > >> +
> > > > > > > >> +  /*
> > > > > > > >> +   * According to 82599 and x540 specifications RS bit
> > > > *must* be set on
> > > > > > > the
> > > > > > > >> +   * last descriptor of *every* packet. Therefore we will
> > > > not allow the
> > > > > > > >> +   * tx_rs_thresh above 1 for all NICs newer than 82598.
> > > > > > > >> +   */
> > > > > > > >> +  if (hw->mac.type > ixgbe_mac_82598EB)
> > > > > > > >> + dev_info->default_txconf.tx_rs_thresh = 1;
> > > > > > > >> +
> > > > > > > >> dev_info->hash_key_size = IXGBE_HKEY_MAX_INDEX *
> > > > sizeof(uint32_t);
> > > > > > > >>    dev_info->reta_size = ETH_RSS_RETA_SIZE_128;
> > > > > > > >> dev_info->flow_type_rss_offloads = IXGBE_RSS_OFFLOAD_ALL;
> > > > > > > >> diff --
> > > > > > > git
> > > > > > > >> a/drivers/net/ixgbe/ixgbe_rxtx.c
> > > > b/drivers/net/ixgbe/ixgbe_rxtx.c index
> > > > > > > >> 91023b9..8dbdffc 100644
> > > > > > > >> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
> > > > > > > >> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
> > > > > > > >> @@ -2085,11 +2085,19 @@ ixgbe_dev_tx_queue_setup(struct
> > > > rte_eth_dev
> > > > > > > >> *dev,
> > > > > > > >>    struct ixgbe_tx_queue *txq;
> > > > > > > >>    struct ixgbe_hw  *hw;
> > > > > > > >>    uint16_t tx_rs_thresh, tx_free_thresh;
> > > > > > > >> +  bool rs_deferring_allowed;
> > > > > > > >>
> > > > > > > >>    PMD_INIT_FUNC_TRACE();
> > > > > > > >>    hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);
> > > > > > > >>
> > > > > > > >>    /*
> > > > > > > >> +   * According to 82599 and x540 specifications RS bit
> > > > *must* be set on
> > > > > > > the
> > > > > > > >> +   * last descriptor of *every* packet. Therefore we will
> > > > not allow the
> > > > > > > >> +   * tx_rs_thresh above 1 for all NICs newer than 82598.
> > > > > > > >> +   */
> > > > > > > >> +  rs_deferring_allowed = (hw->mac.type <=
> > > > > > > >> + ixgbe_mac_82598EB);
> > > > > > > >> +
> > > > > > > >> +  /*
> > > > > > > >>     * Validate number of transmit descriptors.
> > > > > > > >>     * It must not exceed hardware maximum, and must be 
> multiple
> > > > > > > >>     * of IXGBE_ALIGN.
> > > > > > > >> @@ -2110,6 +2118,8 @@ ixgbe_dev_tx_queue_setup(struct
> > > > > > > >> rte_eth_dev
> > > > > > > *dev,
> > > > > > > >>     * to transmit a packet is greater than the number 
> of free TX
> > > > > > > >>     * descriptors.
> > > > > > > >>     * The following constraints must be satisfied:
> > > > > > > >> +   *  tx_rs_thresh must be less than 2 for NICs for 
> which RS
> > > > deferring is
> > > > > > > >> +   *  forbidden (all but 82598).
> > > > > > > >>     *  tx_rs_thresh must be greater than 0.
> > > > > > > >>     *  tx_rs_thresh must be less than the size of the ring
> > > > minus 2.
> > > > > > > >>     *  tx_rs_thresh must be less than or equal to 
> tx_free_thresh.
> > > > > > > >> @@ -2121,9 +2131,20 @@ ixgbe_dev_tx_queue_setup(struct
> > > > rte_eth_dev
> > > > > > > *dev,
> > > > > > > >>     * When set to zero use default values.
> > > > > > > >>     */
> > > > > > > >>    tx_rs_thresh = (uint16_t)((tx_conf->tx_rs_thresh) ?
> > > > > > > >> - tx_conf->tx_rs_thresh : DEFAULT_TX_RS_THRESH);
> > > > > > > >> + tx_conf->tx_rs_thresh :
> > > > > > > >> + (rs_deferring_allowed ? DEFAULT_TX_RS_THRESH : 1));
> > > > > > > >>    tx_free_thresh = (uint16_t)((tx_conf->tx_free_thresh) ?
> > > > > > > >> tx_conf->tx_free_thresh : DEFAULT_TX_FREE_THRESH);
> > > > > > > >> +
> > > > > > > >> +  if (!rs_deferring_allowed && tx_rs_thresh > 1) {
> > > > > > > >> + PMD_INIT_LOG(ERR, "tx_rs_thresh must be less than
> > > > 2 since RS
> > > > > > > "
> > > > > > > >> +     "must be set for every packet
> > > > for this HW. "
> > > > > > > >> + "(tx_rs_thresh=%u port=%d queue=%d)",
> > > > > > > >> +  (unsigned int)tx_rs_thresh,
> > > > > > > >> + (int)dev->data->port_id, (int)queue_idx);
> > > > > > > >> +          return -(EINVAL);
> > > > > > > >> +  }
> > > > > > > >> +
> > > > > > > >>    if (tx_rs_thresh >= (nb_desc - 2)) {
> > > > > > > >> PMD_INIT_LOG(ERR, "tx_rs_thresh must be less than
> > > > > > > >> the
> > > > > > > number "
> > > > > > > >>  "of TX descriptors minus 2.
> > > > (tx_rs_thresh=%u "
> > > > > > > >> --
> > > > > > > >> 2.1.0
> > > > > >
> > > >
> >
>
  
Zhang, Helin Aug. 25, 2015, 8:13 p.m. UTC | #22
Yes, I got the perfect answers. Thank you very much!
I just wanted to make sure the test case was OK with the limit of maximum number of descriptors, as I heard there is a hang issue on other NICs of using more descriptors than hardware allowed.
OK. I am still waiting for the answers/confirmation from x540 hardware designers. We need all agree on your patches to avoid risks.

Regards,
Helin

From: Vladislav Zolotarov [mailto:vladz@cloudius-systems.com]

Sent: Tuesday, August 25, 2015 12:30 PM
To: Zhang, Helin
Cc: Lu, Wenzhuo; dev@dpdk.org
Subject: RE: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh above 1 for all NICs but 82598


On Aug 25, 2015 22:16, "Zhang, Helin" <helin.zhang@intel.com<mailto:helin.zhang@intel.com>> wrote:
>

>

>

> > -----Original Message-----

> > From: Vlad Zolotarov [mailto:vladz@cloudius-systems.com<mailto:vladz@cloudius-systems.com>]

> > Sent: Tuesday, August 25, 2015 11:53 AM

> > To: Zhang, Helin

> > Cc: Lu, Wenzhuo; dev@dpdk.org<mailto:dev@dpdk.org>

> > Subject: Re: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh above 1 for

> > all NICs but 82598

> >

> >

> >

> > On 08/25/15 21:43, Zhang, Helin wrote:

> > >

> > > Hi Vlad

> > >

> > > I think this could possibly be the root cause of your TX hang issue.

> > > Please try to limit the number to 8 or less, and then see if the issue

> > > will still be there or not?

> > >

> >

> > Helin, the issue has been seen on x540 devices. Pls., see a chapter

> > 7.2.1.1 of x540 devices spec:

> >

> > A packet (or multiple packets in transmit segmentation) can span any number of

> > buffers (and their descriptors) up to a limit of 40 minus WTHRESH minus 2 (see

> > Section 7.2.3.3 for Tx Ring details and section Section 7.2.3.5.1 for WTHRESH

> > details). For best performance it is recommended to minimize the number of

> > buffers as possible.

> >

> > Could u, pls., clarify why do u think that the maximum number of data buffers is

> > limited by 8?

> OK, i40e hardware is 8


For i40 it's a bit more complicated than just "not more than 8" - it's not more than 8 for a non-TSO packet and not more than 8 for each MSS including headers buffers for TSO. But this thread is not about i40e so this doesn't seem to be relevant anyway.

, so I'd assume x540 could have a similar one.

x540 spec assumes otherwise... 😉

Yes, in your case,
> the limit could be around 38, right?


If by "around 38" u mean "exactly 38" then u are absolutely right... 😉

> Could you help to make sure there is no packet to be transmitted uses more than

> 38 descriptors?


Just like i've already mentioned, we limit the cluster by at most 33 data segments. Therefore we are good here...

> I heard that there is a similar hang issue on X710 if using more than 8 descriptors for

> a single packet. I am wondering if the issue is similar on x540.


What's x710? If that's xl710 40G nics (i40e driver), then it has its own specs with its own HW limitations i've mentioned above. It has nothing to do with this thread that is all about 10G nics managed by ixgbe driver.

There is a different thread, where i've raised the 40G NICs xmit issues. See "i40e xmit path HW limitation" thread.

>

> Regards,

> Helin

>

> >

> > thanks,

> > vlad

> >

> > > It does not have any check for the number of descriptors to be used

> > > for a single packet, and it relies on the users to give correct mbuf

> > > chains.

> > >

> > > We may need a check of this somewhere. Of cause the point you

> > > indicated we also need to carefully investigate or fix.

> > >

> > > Regards,

> > >

> > > Helin

> > >

> > > *From:*Vladislav Zolotarov [mailto:vladz@cloudius-systems.com<mailto:vladz@cloudius-systems.com>]

> > > *Sent:* Tuesday, August 25, 2015 11:34 AM

> > > *To:* Zhang, Helin

> > > *Cc:* Lu, Wenzhuo; dev@dpdk.org<mailto:dev@dpdk.org>

> > > *Subject:* RE: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh

> > > above 1 for all NICs but 82598

> > >

> > >

> > > On Aug 25, 2015 21:14, "Zhang, Helin" <helin.zhang@intel.com<mailto:helin.zhang@intel.com>

> > > <mailto:helin.zhang@intel.com<mailto:helin.zhang@intel.com>>> wrote:

> > > >

> > > > Hi Vlad

> > > >

> > > >

> > > >

> > > > In addition, I’d double check with you what’s the maximum number of

> > > descriptors would be used for a single packet transmitting?

> > > >

> > > > Datasheet said that it supports up to 8. I am wondering if more than

> > > 8 were used in your case?

> > >

> > > If memory serves me well the maximum number of data descriptors per

> > > single xmit packet is 40 minus 2 minus WTHRESH. Since WTHRESH in DPDK

> > > is always zero it gives us 38 segments. We limit them by 33.

> > >

> > > >

> > > > Thank you very much!

> > > >

> > > >

> > > >

> > > > Regards,

> > > >

> > > > Helin

> > > >

> > > >

> > > >

> > > > From: Zhang, Helin

> > > > Sent: Wednesday, August 19, 2015 10:29 AM

> > > > To: Vladislav Zolotarov

> > > > Cc: Lu, Wenzhuo; dev@dpdk.org<mailto:dev@dpdk.org> <mailto:dev@dpdk.org<mailto:dev@dpdk.org>>

> > > >

> > > > Subject: RE: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh

> > > above 1 for all NICs but 82598

> > > >

> > > >

> > > >

> > > > Hi Vlad

> > > >

> > > >

> > > >

> > > > Thank you very much for the patches! Give me a few more time to

> > > double check with more guys, and possibly hardware experts.

> > > >

> > > >

> > > >

> > > > Regards,

> > > >

> > > > Helin

> > > >

> > > >

> > > >

> > > > From: Vladislav Zolotarov [mailto:vladz@cloudius-systems.com<mailto:vladz@cloudius-systems.com>

> > > <mailto:vladz@cloudius-systems.com<mailto:vladz@cloudius-systems.com>>]

> > > > Sent: Tuesday, August 18, 2015 9:56 PM

> > > > To: Lu, Wenzhuo

> > > > Cc: dev@dpdk.org<mailto:dev@dpdk.org> <mailto:dev@dpdk.org<mailto:dev@dpdk.org>>; Zhang, Helin

> > > > Subject: RE: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh

> > > above 1 for all NICs but 82598

> > > >

> > > >

> > > >

> > > >

> > > > On Aug 19, 2015 03:42, "Lu, Wenzhuo" <wenzhuo.lu@intel.com<mailto:wenzhuo.lu@intel.com>

> > > <mailto:wenzhuo.lu@intel.com<mailto:wenzhuo.lu@intel.com>>> wrote:

> > > > >

> > > > > Hi Helin,

> > > > >

> > > > > > -----Original Message-----

> > > > > > From: dev [mailto:dev-bounces@dpdk.org<mailto:dev-bounces@dpdk.org>

> > > <mailto:dev-bounces@dpdk.org<mailto:dev-bounces@dpdk.org>>] On Behalf Of Vlad Zolotarov

> > > > > > Sent: Friday, August 14, 2015 1:38 PM

> > > > > > To: Zhang, Helin; dev@dpdk.org<mailto:dev@dpdk.org> <mailto:dev@dpdk.org<mailto:dev@dpdk.org>>

> > > > > > Subject: Re: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid

> > > tx_rs_thresh above 1 for

> > > > > > all NICs but 82598

> > > > > >

> > > > > >

> > > > > >

> > > > > > On 08/13/15 23:28, Zhang, Helin wrote:

> > > > > > > Hi Vlad

> > > > > > >

> > > > > > > I don't think the changes are needed. It says in datasheet

> > > that the RS

> > > > > > > bit should be set on the last descriptor of every packet, ONLY

> > > WHEN

> > > > > > TXDCTL.WTHRESH equals to ZERO.

> > > > > >

> > > > > > Of course it's needed! See below.

> > > > > > Exactly the same spec a few lines above the place u've just

> > > quoted states:

> > > > > >

> > > > > > "Software should not set the RS bit when TXDCTL.WTHRESH is

> > > greater than

> > > > > > zero."

> > > > > >

> > > > > > And since all three (3) ixgbe xmit callbacks are utilizing RS

> > > bit notation ixgbe PMD

> > > > > > is actually not supporting any value of WTHRESH different from zero.

> > > > > I think Vlad is right. We need to fix this issue. Any suggestion?

> > > If not, I'd like to ack this patch.

> > > >

> > > > Pls., note that there is a v2 of this patch on the list. I forgot to

> > > patch ixgbevf_dev_info_get() in v1.

> > > >

> > > > >

> > > > > >

> > > > > > >

> > > > > > > Regards,

> > > > > > > Helin

> > > > > > >

> > > > > > >> -----Original Message-----

> > > > > > >> From: Vlad Zolotarov [mailto:vladz@cloudius-systems.com<mailto:vladz@cloudius-systems.com>

> > > <mailto:vladz@cloudius-systems.com<mailto:vladz@cloudius-systems.com>>]

> > > > > > >> Sent: Thursday, August 13, 2015 11:07 AM

> > > > > > >> To: dev@dpdk.org<mailto:dev@dpdk.org> <mailto:dev@dpdk.org<mailto:dev@dpdk.org>>

> > > > > > >> Cc: Zhang, Helin; Ananyev, Konstantin;

> > > avi@cloudius-systems.com<mailto:avi@cloudius-systems.com> <mailto:avi@cloudius-systems.com<mailto:avi@cloudius-systems.com>>; Vlad

> > > > > > >> Zolotarov

> > > > > > >> Subject: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh

> > > above 1 for

> > > > > > all

> > > > > > >> NICs but 82598

> > > > > > >>

> > > > > > >> According to 82599 and x540 HW specifications RS bit *must*

> > > be set in the

> > > > > > last

> > > > > > >> descriptor of *every* packet.

> > > > > > > There is a condition that if TXDCTL.WTHRESH equal to zero.

> > > > > >

> > > > > > Right and ixgbe PMD requires this condition to be fulfilled in

> > > order to

> > > > > > function. See above.

> > > > > >

> > > > > > >

> > > > > > >> This patch fixes the Tx hang we were constantly hitting with

> > > a seastar-based

> > > > > > >> application on x540 NIC.

> > > > > > > Could you help to share with us how to reproduce the tx hang

> > > issue, with using

> > > > > > > typical DPDK examples?

> > > > > >

> > > > > > Sorry. I'm not very familiar with the typical DPDK examples to

> > > help u

> > > > > > here. However this is quite irrelevant since without this this

> > > > > > patch ixgbe PMD obviously abuses the HW spec as has been explained

> > above.

> > > > > >

> > > > > > We saw the issue when u stressed the xmit path with a lot of

> > > > > > highly fragmented TCP frames (packets with up to 33 fragments

> > > > > > with

> > > non-headers

> > > > > > fragments as small as 4 bytes) with all offload features enabled.

> > > > > >

> > > > > > Thanks,

> > > > > > vlad

> > > > > > >

> > > > > > >> Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com<mailto:vladz@cloudius-systems.com>

> > > <mailto:vladz@cloudius-systems.com<mailto:vladz@cloudius-systems.com>>>

> > > > > > >> ---

> > > > > > >>   drivers/net/ixgbe/ixgbe_ethdev.c |  9 +++++++++

> > > > > > >>   drivers/net/ixgbe/ixgbe_rxtx.c  | 23 ++++++++++++++++++++++-

> > > > > > >>   2 files changed, 31 insertions(+), 1 deletion(-)

> > > > > > >>

> > > > > > >> diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c

> > > > > > >> b/drivers/net/ixgbe/ixgbe_ethdev.c

> > > > > > >> index b8ee1e9..6714fd9 100644

> > > > > > >> --- a/drivers/net/ixgbe/ixgbe_ethdev.c

> > > > > > >> +++ b/drivers/net/ixgbe/ixgbe_ethdev.c

> > > > > > >> @@ -2414,6 +2414,15 @@ ixgbe_dev_info_get(struct rte_eth_dev

> > > *dev,

> > > > > > struct

> > > > > > >> rte_eth_dev_info *dev_info)

> > > > > > >>            .txq_flags = ETH_TXQ_FLAGS_NOMULTSEGS |

> > > > > > >> ETH_TXQ_FLAGS_NOOFFLOADS,

> > > > > > >>    };

> > > > > > >> +

> > > > > > >> +  /*

> > > > > > >> +   * According to 82599 and x540 specifications RS bit

> > > *must* be set on

> > > > > > the

> > > > > > >> +   * last descriptor of *every* packet. Therefore we will

> > > not allow the

> > > > > > >> +   * tx_rs_thresh above 1 for all NICs newer than 82598.

> > > > > > >> +   */

> > > > > > >> +  if (hw->mac.type > ixgbe_mac_82598EB)

> > > > > > >> + dev_info->default_txconf.tx_rs_thresh = 1;

> > > > > > >> +

> > > > > > >>    dev_info->hash_key_size = IXGBE_HKEY_MAX_INDEX *

> > > sizeof(uint32_t);

> > > > > > >>    dev_info->reta_size = ETH_RSS_RETA_SIZE_128;

> > > > > > >> dev_info->flow_type_rss_offloads = IXGBE_RSS_OFFLOAD_ALL;

> > > > > > >> diff --

> > > > > > git

> > > > > > >> a/drivers/net/ixgbe/ixgbe_rxtx.c

> > > b/drivers/net/ixgbe/ixgbe_rxtx.c index

> > > > > > >> 91023b9..8dbdffc 100644

> > > > > > >> --- a/drivers/net/ixgbe/ixgbe_rxtx.c

> > > > > > >> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c

> > > > > > >> @@ -2085,11 +2085,19 @@ ixgbe_dev_tx_queue_setup(struct

> > > rte_eth_dev

> > > > > > >> *dev,

> > > > > > >>    struct ixgbe_tx_queue *txq;

> > > > > > >>    struct ixgbe_hw     *hw;

> > > > > > >>    uint16_t tx_rs_thresh, tx_free_thresh;

> > > > > > >> +  bool rs_deferring_allowed;

> > > > > > >>

> > > > > > >>    PMD_INIT_FUNC_TRACE();

> > > > > > >>    hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);

> > > > > > >>

> > > > > > >>    /*

> > > > > > >> +   * According to 82599 and x540 specifications RS bit

> > > *must* be set on

> > > > > > the

> > > > > > >> +   * last descriptor of *every* packet. Therefore we will

> > > not allow the

> > > > > > >> +   * tx_rs_thresh above 1 for all NICs newer than 82598.

> > > > > > >> +   */

> > > > > > >> +  rs_deferring_allowed = (hw->mac.type <=

> > > > > > >> + ixgbe_mac_82598EB);

> > > > > > >> +

> > > > > > >> +  /*

> > > > > > >>     * Validate number of transmit descriptors.

> > > > > > >>     * It must not exceed hardware maximum, and must be multiple

> > > > > > >>     * of IXGBE_ALIGN.

> > > > > > >> @@ -2110,6 +2118,8 @@ ixgbe_dev_tx_queue_setup(struct

> > > > > > >> rte_eth_dev

> > > > > > *dev,

> > > > > > >>     * to transmit a packet is greater than the number of free TX

> > > > > > >>     * descriptors.

> > > > > > >>     * The following constraints must be satisfied:

> > > > > > >> +   *  tx_rs_thresh must be less than 2 for NICs for which RS

> > > deferring is

> > > > > > >> +   *  forbidden (all but 82598).

> > > > > > >>     *  tx_rs_thresh must be greater than 0.

> > > > > > >>     *  tx_rs_thresh must be less than the size of the ring

> > > minus 2.

> > > > > > >>     *  tx_rs_thresh must be less than or equal to tx_free_thresh.

> > > > > > >> @@ -2121,9 +2131,20 @@ ixgbe_dev_tx_queue_setup(struct

> > > rte_eth_dev

> > > > > > *dev,

> > > > > > >>     * When set to zero use default values.

> > > > > > >>     */

> > > > > > >>    tx_rs_thresh = (uint16_t)((tx_conf->tx_rs_thresh) ?

> > > > > > >> - tx_conf->tx_rs_thresh : DEFAULT_TX_RS_THRESH);

> > > > > > >> + tx_conf->tx_rs_thresh :

> > > > > > >> + (rs_deferring_allowed ? DEFAULT_TX_RS_THRESH : 1));

> > > > > > >>    tx_free_thresh = (uint16_t)((tx_conf->tx_free_thresh) ?

> > > > > > >> tx_conf->tx_free_thresh : DEFAULT_TX_FREE_THRESH);

> > > > > > >> +

> > > > > > >> +  if (!rs_deferring_allowed && tx_rs_thresh > 1) {

> > > > > > >> +          PMD_INIT_LOG(ERR, "tx_rs_thresh must be less than

> > > 2 since RS

> > > > > > "

> > > > > > >> +                            "must be set for every packet

> > > for this HW. "

> > > > > > >> + "(tx_rs_thresh=%u port=%d queue=%d)",

> > > > > > >> +                       (unsigned int)tx_rs_thresh,

> > > > > > >> + (int)dev->data->port_id, (int)queue_idx);

> > > > > > >> +          return -(EINVAL);

> > > > > > >> +  }

> > > > > > >> +

> > > > > > >>    if (tx_rs_thresh >= (nb_desc - 2)) {

> > > > > > >>            PMD_INIT_LOG(ERR, "tx_rs_thresh must be less than

> > > > > > >> the

> > > > > > number "

> > > > > > >>                         "of TX descriptors minus 2.

> > > (tx_rs_thresh=%u "

> > > > > > >> --

> > > > > > >> 2.1.0

> > > > >

> > >

>
  
Thomas Monjalon Sept. 9, 2015, 12:18 p.m. UTC | #23
2015-08-25 20:13, Zhang, Helin:
> Yes, I got the perfect answers. Thank you very much!
> I just wanted to make sure the test case was OK with the limit of maximum number of descriptors, as I heard there is a hang issue on other NICs of using more descriptors than hardware allowed.
> OK. I am still waiting for the answers/confirmation from x540 hardware designers. We need all agree on your patches to avoid risks.

Helin, any news?
Can we apply v4 of this patch?
  
Ananyev, Konstantin Sept. 9, 2015, 1:19 p.m. UTC | #24
Hi Thomas,

> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Thomas Monjalon
> Sent: Wednesday, September 09, 2015 1:19 PM
> To: Zhang, Helin
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh above 1 for all NICs but 82598
> 
> 2015-08-25 20:13, Zhang, Helin:
> > Yes, I got the perfect answers. Thank you very much!
> > I just wanted to make sure the test case was OK with the limit of maximum number of descriptors, as I heard there is a hang issue on
> other NICs of using more descriptors than hardware allowed.
> > OK. I am still waiting for the answers/confirmation from x540 hardware designers. We need all agree on your patches to avoid risks.
> 
> Helin, any news?
> Can we apply v4 of this patch?

Unfortunately we are seeing a huge performance drop with that patch:
On my box bi-directional traffic (64B packet) over one port can't reach even 11 Mpps.
We are still doing some experiments and consultations with ND guys how we can overcome
this problem and keep performance intact.

Konstantin
  
Didier Pallard Sept. 11, 2015, 2:25 p.m. UTC | #25
On 08/25/2015 08:52 PM, Vlad Zolotarov wrote:
>
> Helin, the issue has been seen on x540 devices. Pls., see a chapter 
> 7.2.1.1 of x540 devices spec:
>
> A packet (or multiple packets in transmit segmentation) can span any 
> number of
> buffers (and their descriptors) up to a limit of 40 minus WTHRESH 
> minus 2 (see
> Section 7.2.3.3 for Tx Ring details and section Section 7.2.3.5.1 for 
> WTHRESH
> details). For best performance it is recommended to minimize the 
> number of buffers
> as possible.
>
> Could u, pls., clarify why do u think that the maximum number of data 
> buffers is limited by 8?
>
> thanks,
> vlad

Hi vlad,

Documentation states that a packet (or multiple packets in transmit 
segmentation) can span any number of
buffers (and their descriptors) up to a limit of 40 minus WTHRESH minus 2.

Shouldn't there be a test in transmit function that drops properly the 
mbufs with a too large number of
segments, while incrementing a statistic; otherwise transmit function 
may be locked by the faulty packet without
notification.

thanks
didier
  
Avi Kivity Sept. 11, 2015, 2:47 p.m. UTC | #26
On 09/11/2015 05:25 PM, didier.pallard wrote:
> On 08/25/2015 08:52 PM, Vlad Zolotarov wrote:
>>
>> Helin, the issue has been seen on x540 devices. Pls., see a chapter 
>> 7.2.1.1 of x540 devices spec:
>>
>> A packet (or multiple packets in transmit segmentation) can span any 
>> number of
>> buffers (and their descriptors) up to a limit of 40 minus WTHRESH 
>> minus 2 (see
>> Section 7.2.3.3 for Tx Ring details and section Section 7.2.3.5.1 for 
>> WTHRESH
>> details). For best performance it is recommended to minimize the 
>> number of buffers
>> as possible.
>>
>> Could u, pls., clarify why do u think that the maximum number of data 
>> buffers is limited by 8?
>>
>> thanks,
>> vlad
>
> Hi vlad,
>
> Documentation states that a packet (or multiple packets in transmit 
> segmentation) can span any number of
> buffers (and their descriptors) up to a limit of 40 minus WTHRESH 
> minus 2.
>
> Shouldn't there be a test in transmit function that drops properly the 
> mbufs with a too large number of
> segments, while incrementing a statistic; otherwise transmit function 
> may be locked by the faulty packet without
> notification.
>

What we proposed is that the pmd expose to dpdk, and dpdk expose to the 
application, an mbuf check function.  This way applications that can 
generate complex packets can verify that the device will be able to 
process them, and applications that only generate simple mbufs can avoid 
the overhead by not calling the function.
  
Thomas Monjalon Sept. 11, 2015, 2:55 p.m. UTC | #27
2015-09-11 17:47, Avi Kivity:
> On 09/11/2015 05:25 PM, didier.pallard wrote:
> > On 08/25/2015 08:52 PM, Vlad Zolotarov wrote:
> >>
> >> Helin, the issue has been seen on x540 devices. Pls., see a chapter 
> >> 7.2.1.1 of x540 devices spec:
> >>
> >> A packet (or multiple packets in transmit segmentation) can span any 
> >> number of
> >> buffers (and their descriptors) up to a limit of 40 minus WTHRESH 
> >> minus 2 (see
> >> Section 7.2.3.3 for Tx Ring details and section Section 7.2.3.5.1 for 
> >> WTHRESH
> >> details). For best performance it is recommended to minimize the 
> >> number of buffers
> >> as possible.
> >>
> >> Could u, pls., clarify why do u think that the maximum number of data 
> >> buffers is limited by 8?
> >>
> >> thanks,
> >> vlad
> >
> > Hi vlad,
> >
> > Documentation states that a packet (or multiple packets in transmit 
> > segmentation) can span any number of
> > buffers (and their descriptors) up to a limit of 40 minus WTHRESH 
> > minus 2.
> >
> > Shouldn't there be a test in transmit function that drops properly the 
> > mbufs with a too large number of
> > segments, while incrementing a statistic; otherwise transmit function 
> > may be locked by the faulty packet without
> > notification.
> >
> 
> What we proposed is that the pmd expose to dpdk, and dpdk expose to the 
> application, an mbuf check function.  This way applications that can 
> generate complex packets can verify that the device will be able to 
> process them, and applications that only generate simple mbufs can avoid 
> the overhead by not calling the function.

More than a check, it should be exposed as a capability of the port.
Anyway, if the application sends too much segments, the driver must
drop it to avoid hang, and maintain a dedicated statistic counter to allow
easy debugging.
  
Vladislav Zolotarov Sept. 11, 2015, 3:12 p.m. UTC | #28
On Sep 11, 2015 5:55 PM, "Thomas Monjalon" <thomas.monjalon@6wind.com>
wrote:
>
> 2015-09-11 17:47, Avi Kivity:
> > On 09/11/2015 05:25 PM, didier.pallard wrote:
> > > On 08/25/2015 08:52 PM, Vlad Zolotarov wrote:
> > >>
> > >> Helin, the issue has been seen on x540 devices. Pls., see a chapter
> > >> 7.2.1.1 of x540 devices spec:
> > >>
> > >> A packet (or multiple packets in transmit segmentation) can span any
> > >> number of
> > >> buffers (and their descriptors) up to a limit of 40 minus WTHRESH
> > >> minus 2 (see
> > >> Section 7.2.3.3 for Tx Ring details and section Section 7.2.3.5.1 for
> > >> WTHRESH
> > >> details). For best performance it is recommended to minimize the
> > >> number of buffers
> > >> as possible.
> > >>
> > >> Could u, pls., clarify why do u think that the maximum number of data
> > >> buffers is limited by 8?
> > >>
> > >> thanks,
> > >> vlad
> > >
> > > Hi vlad,
> > >
> > > Documentation states that a packet (or multiple packets in transmit
> > > segmentation) can span any number of
> > > buffers (and their descriptors) up to a limit of 40 minus WTHRESH
> > > minus 2.
> > >
> > > Shouldn't there be a test in transmit function that drops properly the
> > > mbufs with a too large number of
> > > segments, while incrementing a statistic; otherwise transmit function
> > > may be locked by the faulty packet without
> > > notification.
> > >
> >
> > What we proposed is that the pmd expose to dpdk, and dpdk expose to the
> > application, an mbuf check function.  This way applications that can
> > generate complex packets can verify that the device will be able to
> > process them, and applications that only generate simple mbufs can avoid
> > the overhead by not calling the function.
>
> More than a check, it should be exposed as a capability of the port.
> Anyway, if the application sends too much segments, the driver must
> drop it to avoid hang, and maintain a dedicated statistic counter to allow
> easy debugging.

I agree with Thomas - this should not be optional. Malformed packets should
be dropped. In the icgbe case it's a very simple test - it's a single
branch per packet so i doubt that it could impose any measurable
performance degradation.

>
  
Vladislav Zolotarov Sept. 11, 2015, 3:17 p.m. UTC | #29
On Sep 9, 2015 4:19 PM, "Ananyev, Konstantin" <konstantin.ananyev@intel.com>
wrote:
>
> Hi Thomas,
>
> > -----Original Message-----
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Thomas Monjalon
> > Sent: Wednesday, September 09, 2015 1:19 PM
> > To: Zhang, Helin
> > Cc: dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh above
1 for all NICs but 82598
> >
> > 2015-08-25 20:13, Zhang, Helin:
> > > Yes, I got the perfect answers. Thank you very much!
> > > I just wanted to make sure the test case was OK with the limit of
maximum number of descriptors, as I heard there is a hang issue on
> > other NICs of using more descriptors than hardware allowed.
> > > OK. I am still waiting for the answers/confirmation from x540
hardware designers. We need all agree on your patches to avoid risks.
> >
> > Helin, any news?
> > Can we apply v4 of this patch?
>
> Unfortunately we are seeing a huge performance drop with that patch:
> On my box bi-directional traffic (64B packet) over one port can't reach
even 11 Mpps.

Konstantin, could u clarify - u saw "only" 11 Mpps with v3 of this patch
which doesn't change the rs_thresh and only sets RS on every packet? What
is the performance in the same test without this patch?

> We are still doing some experiments and consultations with ND guys how we
can overcome
> this problem and keep performance intact.
>
> Konstantin
  
Avi Kivity Sept. 11, 2015, 3:43 p.m. UTC | #30
On 09/11/2015 06:12 PM, Vladislav Zolotarov wrote:
>
>
> On Sep 11, 2015 5:55 PM, "Thomas Monjalon" <thomas.monjalon@6wind.com 
> <mailto:thomas.monjalon@6wind.com>> wrote:
> >
> > 2015-09-11 17:47, Avi Kivity:
> > > On 09/11/2015 05:25 PM, didier.pallard wrote:
> > > > On 08/25/2015 08:52 PM, Vlad Zolotarov wrote:
> > > >>
> > > >> Helin, the issue has been seen on x540 devices. Pls., see a chapter
> > > >> 7.2.1.1 of x540 devices spec:
> > > >>
> > > >> A packet (or multiple packets in transmit segmentation) can 
> span any
> > > >> number of
> > > >> buffers (and their descriptors) up to a limit of 40 minus WTHRESH
> > > >> minus 2 (see
> > > >> Section 7.2.3.3 for Tx Ring details and section Section 
> 7.2.3.5.1 for
> > > >> WTHRESH
> > > >> details). For best performance it is recommended to minimize the
> > > >> number of buffers
> > > >> as possible.
> > > >>
> > > >> Could u, pls., clarify why do u think that the maximum number 
> of data
> > > >> buffers is limited by 8?
> > > >>
> > > >> thanks,
> > > >> vlad
> > > >
> > > > Hi vlad,
> > > >
> > > > Documentation states that a packet (or multiple packets in transmit
> > > > segmentation) can span any number of
> > > > buffers (and their descriptors) up to a limit of 40 minus WTHRESH
> > > > minus 2.
> > > >
> > > > Shouldn't there be a test in transmit function that drops 
> properly the
> > > > mbufs with a too large number of
> > > > segments, while incrementing a statistic; otherwise transmit 
> function
> > > > may be locked by the faulty packet without
> > > > notification.
> > > >
> > >
> > > What we proposed is that the pmd expose to dpdk, and dpdk expose 
> to the
> > > application, an mbuf check function.  This way applications that can
> > > generate complex packets can verify that the device will be able to
> > > process them, and applications that only generate simple mbufs can 
> avoid
> > > the overhead by not calling the function.
> >
> > More than a check, it should be exposed as a capability of the port.
> > Anyway, if the application sends too much segments, the driver must
> > drop it to avoid hang, and maintain a dedicated statistic counter to 
> allow
> > easy debugging.
>
> I agree with Thomas - this should not be optional. Malformed packets 
> should be dropped. In the icgbe case it's a very simple test - it's a 
> single branch per packet so i doubt that it could impose any 
> measurable performance degradation.
>
>

A drop allows the application no chance to recover.  The driver must 
either provide the ability for the application to know that it cannot 
accept the packet, or it must fix it up itself.
  
Bruce Richardson Sept. 11, 2015, 4 p.m. UTC | #31
> -----Original Message-----

> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Vladislav Zolotarov

> Sent: Friday, September 11, 2015 4:13 PM

> To: Thomas Monjalon

> Cc: dev@dpdk.org

> Subject: Re: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh above 1

> for all NICs but 82598

> 

> On Sep 11, 2015 5:55 PM, "Thomas Monjalon" <thomas.monjalon@6wind.com>

> wrote:

> >

> > 2015-09-11 17:47, Avi Kivity:

> > > On 09/11/2015 05:25 PM, didier.pallard wrote:

> > > > On 08/25/2015 08:52 PM, Vlad Zolotarov wrote:

> > > >>

> > > >> Helin, the issue has been seen on x540 devices. Pls., see a

> > > >> chapter

> > > >> 7.2.1.1 of x540 devices spec:

> > > >>

> > > >> A packet (or multiple packets in transmit segmentation) can span

> > > >> any number of buffers (and their descriptors) up to a limit of 40

> > > >> minus WTHRESH minus 2 (see Section 7.2.3.3 for Tx Ring details

> > > >> and section Section 7.2.3.5.1 for WTHRESH details). For best

> > > >> performance it is recommended to minimize the number of buffers

> > > >> as possible.

> > > >>

> > > >> Could u, pls., clarify why do u think that the maximum number of

> > > >> data buffers is limited by 8?

> > > >>

> > > >> thanks,

> > > >> vlad

> > > >

> > > > Hi vlad,

> > > >

> > > > Documentation states that a packet (or multiple packets in

> > > > transmit

> > > > segmentation) can span any number of buffers (and their

> > > > descriptors) up to a limit of 40 minus WTHRESH minus 2.

> > > >

> > > > Shouldn't there be a test in transmit function that drops properly

> > > > the mbufs with a too large number of segments, while incrementing

> > > > a statistic; otherwise transmit function may be locked by the

> > > > faulty packet without notification.

> > > >

> > >

> > > What we proposed is that the pmd expose to dpdk, and dpdk expose to

> > > the application, an mbuf check function.  This way applications that

> > > can generate complex packets can verify that the device will be able

> > > to process them, and applications that only generate simple mbufs

> > > can avoid the overhead by not calling the function.

> >

> > More than a check, it should be exposed as a capability of the port.

> > Anyway, if the application sends too much segments, the driver must

> > drop it to avoid hang, and maintain a dedicated statistic counter to

> > allow easy debugging.

> 

> I agree with Thomas - this should not be optional. Malformed packets

> should be dropped. In the icgbe case it's a very simple test - it's a

> single branch per packet so i doubt that it could impose any measurable

> performance degradation.

> 

Actually, it could very well do - we'd have to test it. For the vector IO
paths, every additional cycle in the RX or TX paths causes a noticeable perf
drop.

/Bruce
  
Vladislav Zolotarov Sept. 11, 2015, 4:04 p.m. UTC | #32
On Sep 11, 2015 6:43 PM, "Avi Kivity" <avi@cloudius-systems.com> wrote:
>
> On 09/11/2015 06:12 PM, Vladislav Zolotarov wrote:
>>
>>
>> On Sep 11, 2015 5:55 PM, "Thomas Monjalon" <thomas.monjalon@6wind.com>
wrote:
>> >
>> > 2015-09-11 17:47, Avi Kivity:
>> > > On 09/11/2015 05:25 PM, didier.pallard wrote:
>> > > > On 08/25/2015 08:52 PM, Vlad Zolotarov wrote:
>> > > >>
>> > > >> Helin, the issue has been seen on x540 devices. Pls., see a
chapter
>> > > >> 7.2.1.1 of x540 devices spec:
>> > > >>
>> > > >> A packet (or multiple packets in transmit segmentation) can span
any
>> > > >> number of
>> > > >> buffers (and their descriptors) up to a limit of 40 minus WTHRESH
>> > > >> minus 2 (see
>> > > >> Section 7.2.3.3 for Tx Ring details and section Section 7.2.3.5.1
for
>> > > >> WTHRESH
>> > > >> details). For best performance it is recommended to minimize the
>> > > >> number of buffers
>> > > >> as possible.
>> > > >>
>> > > >> Could u, pls., clarify why do u think that the maximum number of
data
>> > > >> buffers is limited by 8?
>> > > >>
>> > > >> thanks,
>> > > >> vlad
>> > > >
>> > > > Hi vlad,
>> > > >
>> > > > Documentation states that a packet (or multiple packets in transmit
>> > > > segmentation) can span any number of
>> > > > buffers (and their descriptors) up to a limit of 40 minus WTHRESH
>> > > > minus 2.
>> > > >
>> > > > Shouldn't there be a test in transmit function that drops properly
the
>> > > > mbufs with a too large number of
>> > > > segments, while incrementing a statistic; otherwise transmit
function
>> > > > may be locked by the faulty packet without
>> > > > notification.
>> > > >
>> > >
>> > > What we proposed is that the pmd expose to dpdk, and dpdk expose to
the
>> > > application, an mbuf check function.  This way applications that can
>> > > generate complex packets can verify that the device will be able to
>> > > process them, and applications that only generate simple mbufs can
avoid
>> > > the overhead by not calling the function.
>> >
>> > More than a check, it should be exposed as a capability of the port.
>> > Anyway, if the application sends too much segments, the driver must
>> > drop it to avoid hang, and maintain a dedicated statistic counter to
allow
>> > easy debugging.
>>
>> I agree with Thomas - this should not be optional. Malformed packets
should be dropped. In the icgbe case it's a very simple test - it's a
single branch per packet so i doubt that it could impose any measurable
performance degradation.
>>
>>
>
> A drop allows the application no chance to recover.  The driver must
either provide the ability for the application to know that it cannot
accept the packet, or it must fix it up itself.

An appropriate statistics counter would be a perfect tool to detect such
issues. Knowingly sending a packet that will cause a HW to hang is not
acceptable.
  
Bruce Richardson Sept. 11, 2015, 4:07 p.m. UTC | #33
> -----Original Message-----

> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Vladislav Zolotarov

> Sent: Friday, September 11, 2015 5:04 PM

> To: Avi Kivity

> Cc: dev@dpdk.org

> Subject: Re: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh above 1

> for all NICs but 82598

> 

> On Sep 11, 2015 6:43 PM, "Avi Kivity" <avi@cloudius-systems.com> wrote:

> >

> > On 09/11/2015 06:12 PM, Vladislav Zolotarov wrote:

> >>

> >>

> >> On Sep 11, 2015 5:55 PM, "Thomas Monjalon"

> >> <thomas.monjalon@6wind.com>

> wrote:

> >> >

> >> > 2015-09-11 17:47, Avi Kivity:

> >> > > On 09/11/2015 05:25 PM, didier.pallard wrote:

> >> > > > On 08/25/2015 08:52 PM, Vlad Zolotarov wrote:

> >> > > >>

> >> > > >> Helin, the issue has been seen on x540 devices. Pls., see a

> chapter

> >> > > >> 7.2.1.1 of x540 devices spec:

> >> > > >>

> >> > > >> A packet (or multiple packets in transmit segmentation) can

> >> > > >> span

> any

> >> > > >> number of

> >> > > >> buffers (and their descriptors) up to a limit of 40 minus

> >> > > >> WTHRESH minus 2 (see Section 7.2.3.3 for Tx Ring details and

> >> > > >> section Section 7.2.3.5.1

> for

> >> > > >> WTHRESH

> >> > > >> details). For best performance it is recommended to minimize

> >> > > >> the number of buffers as possible.

> >> > > >>

> >> > > >> Could u, pls., clarify why do u think that the maximum number

> >> > > >> of

> data

> >> > > >> buffers is limited by 8?

> >> > > >>

> >> > > >> thanks,

> >> > > >> vlad

> >> > > >

> >> > > > Hi vlad,

> >> > > >

> >> > > > Documentation states that a packet (or multiple packets in

> >> > > > transmit

> >> > > > segmentation) can span any number of buffers (and their

> >> > > > descriptors) up to a limit of 40 minus WTHRESH minus 2.

> >> > > >

> >> > > > Shouldn't there be a test in transmit function that drops

> >> > > > properly

> the

> >> > > > mbufs with a too large number of segments, while incrementing a

> >> > > > statistic; otherwise transmit

> function

> >> > > > may be locked by the faulty packet without notification.

> >> > > >

> >> > >

> >> > > What we proposed is that the pmd expose to dpdk, and dpdk expose

> >> > > to

> the

> >> > > application, an mbuf check function.  This way applications that

> >> > > can generate complex packets can verify that the device will be

> >> > > able to process them, and applications that only generate simple

> >> > > mbufs can

> avoid

> >> > > the overhead by not calling the function.

> >> >

> >> > More than a check, it should be exposed as a capability of the port.

> >> > Anyway, if the application sends too much segments, the driver must

> >> > drop it to avoid hang, and maintain a dedicated statistic counter

> >> > to

> allow

> >> > easy debugging.

> >>

> >> I agree with Thomas - this should not be optional. Malformed packets

> should be dropped. In the icgbe case it's a very simple test - it's a

> single branch per packet so i doubt that it could impose any measurable

> performance degradation.

> >>

> >>

> >

> > A drop allows the application no chance to recover.  The driver must

> either provide the ability for the application to know that it cannot

> accept the packet, or it must fix it up itself.

> 

> An appropriate statistics counter would be a perfect tool to detect such

> issues. Knowingly sending a packet that will cause a HW to hang is not

> acceptable.


I would agree. Drivers should provide a function to query the max number of
segments they can accept and the driver should be able to discard any packets
exceeding that number, and just track it via a stat.

/Bruce
  
Thomas Monjalon Sept. 11, 2015, 4:08 p.m. UTC | #34
2015-09-11 18:43, Avi Kivity:
> On 09/11/2015 06:12 PM, Vladislav Zolotarov wrote:
> > On Sep 11, 2015 5:55 PM, "Thomas Monjalon" <thomas.monjalon@6wind.com 
> > <mailto:thomas.monjalon@6wind.com>> wrote:
> > > 2015-09-11 17:47, Avi Kivity:
> > > > On 09/11/2015 05:25 PM, didier.pallard wrote:
> > > > > Hi vlad,
> > > > >
> > > > > Documentation states that a packet (or multiple packets in transmit
> > > > > segmentation) can span any number of
> > > > > buffers (and their descriptors) up to a limit of 40 minus WTHRESH
> > > > > minus 2.
> > > > >
> > > > > Shouldn't there be a test in transmit function that drops 
> > properly the
> > > > > mbufs with a too large number of
> > > > > segments, while incrementing a statistic; otherwise transmit 
> > function
> > > > > may be locked by the faulty packet without
> > > > > notification.
> > > > >
> > > >
> > > > What we proposed is that the pmd expose to dpdk, and dpdk expose 
> > to the
> > > > application, an mbuf check function.  This way applications that can
> > > > generate complex packets can verify that the device will be able to
> > > > process them, and applications that only generate simple mbufs can 
> > avoid
> > > > the overhead by not calling the function.
> > >
> > > More than a check, it should be exposed as a capability of the port.
> > > Anyway, if the application sends too much segments, the driver must
> > > drop it to avoid hang, and maintain a dedicated statistic counter to 
> > > allow easy debugging.
> >
> > I agree with Thomas - this should not be optional. Malformed packets 
> > should be dropped. In the icgbe case it's a very simple test - it's a 
> > single branch per packet so i doubt that it could impose any 
> > measurable performance degradation.
> 
> A drop allows the application no chance to recover.  The driver must 
> either provide the ability for the application to know that it cannot 
> accept the packet, or it must fix it up itself.

I have the feeling that everybody agrees on the same thing:
the application must be able to make a well formed packet by checking
limitations of the port. What about a field rte_eth_dev_info.max_tx_segs?
In case the application fails in its checks, the driver must drop it and
notify the user via a stat counter.
The driver can also remove the hardware limitation by gathering the segments
but it may be hard to implement and would be a slow operation.
  
Vladislav Zolotarov Sept. 11, 2015, 4:13 p.m. UTC | #35
On Sep 11, 2015 7:00 PM, "Richardson, Bruce" <bruce.richardson@intel.com>
wrote:
>
>
>
> > -----Original Message-----
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Vladislav Zolotarov
> > Sent: Friday, September 11, 2015 4:13 PM
> > To: Thomas Monjalon
> > Cc: dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh above
1
> > for all NICs but 82598
> >
> > On Sep 11, 2015 5:55 PM, "Thomas Monjalon" <thomas.monjalon@6wind.com>
> > wrote:
> > >
> > > 2015-09-11 17:47, Avi Kivity:
> > > > On 09/11/2015 05:25 PM, didier.pallard wrote:
> > > > > On 08/25/2015 08:52 PM, Vlad Zolotarov wrote:
> > > > >>
> > > > >> Helin, the issue has been seen on x540 devices. Pls., see a
> > > > >> chapter
> > > > >> 7.2.1.1 of x540 devices spec:
> > > > >>
> > > > >> A packet (or multiple packets in transmit segmentation) can span
> > > > >> any number of buffers (and their descriptors) up to a limit of 40
> > > > >> minus WTHRESH minus 2 (see Section 7.2.3.3 for Tx Ring details
> > > > >> and section Section 7.2.3.5.1 for WTHRESH details). For best
> > > > >> performance it is recommended to minimize the number of buffers
> > > > >> as possible.
> > > > >>
> > > > >> Could u, pls., clarify why do u think that the maximum number of
> > > > >> data buffers is limited by 8?
> > > > >>
> > > > >> thanks,
> > > > >> vlad
> > > > >
> > > > > Hi vlad,
> > > > >
> > > > > Documentation states that a packet (or multiple packets in
> > > > > transmit
> > > > > segmentation) can span any number of buffers (and their
> > > > > descriptors) up to a limit of 40 minus WTHRESH minus 2.
> > > > >
> > > > > Shouldn't there be a test in transmit function that drops properly
> > > > > the mbufs with a too large number of segments, while incrementing
> > > > > a statistic; otherwise transmit function may be locked by the
> > > > > faulty packet without notification.
> > > > >
> > > >
> > > > What we proposed is that the pmd expose to dpdk, and dpdk expose to
> > > > the application, an mbuf check function.  This way applications that
> > > > can generate complex packets can verify that the device will be able
> > > > to process them, and applications that only generate simple mbufs
> > > > can avoid the overhead by not calling the function.
> > >
> > > More than a check, it should be exposed as a capability of the port.
> > > Anyway, if the application sends too much segments, the driver must
> > > drop it to avoid hang, and maintain a dedicated statistic counter to
> > > allow easy debugging.
> >
> > I agree with Thomas - this should not be optional. Malformed packets
> > should be dropped. In the icgbe case it's a very simple test - it's a
> > single branch per packet so i doubt that it could impose any measurable
> > performance degradation.
> >
> Actually, it could very well do - we'd have to test it. For the vector IO
> paths, every additional cycle in the RX or TX paths causes a noticeable
perf
> drop.

Well if your application is willing to know all different HW limitations
then u may not need it. However usually application doesn't want to know
the HW technical details. And it this case ignoring them may cause HW to
hang.

Of course, if your app always sends single fragment packets of less than
1500 bytes then u r right and u will most likely not hit any HW limitation,
however what i have in mind is a full featured case where packets are bit
more big and complicated and where a single branch per packet will change
nothing. This is regarding 40 segments case.

In regard to the RS bit - this is related to any packet and according to
spec it should be set in every packet.

>
> /Bruce
  
Vladislav Zolotarov Sept. 11, 2015, 4:14 p.m. UTC | #36
On Sep 11, 2015 7:07 PM, "Richardson, Bruce" <bruce.richardson@intel.com>
wrote:
>
>
>
> > -----Original Message-----
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Vladislav Zolotarov
> > Sent: Friday, September 11, 2015 5:04 PM
> > To: Avi Kivity
> > Cc: dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh above
1
> > for all NICs but 82598
> >
> > On Sep 11, 2015 6:43 PM, "Avi Kivity" <avi@cloudius-systems.com> wrote:
> > >
> > > On 09/11/2015 06:12 PM, Vladislav Zolotarov wrote:
> > >>
> > >>
> > >> On Sep 11, 2015 5:55 PM, "Thomas Monjalon"
> > >> <thomas.monjalon@6wind.com>
> > wrote:
> > >> >
> > >> > 2015-09-11 17:47, Avi Kivity:
> > >> > > On 09/11/2015 05:25 PM, didier.pallard wrote:
> > >> > > > On 08/25/2015 08:52 PM, Vlad Zolotarov wrote:
> > >> > > >>
> > >> > > >> Helin, the issue has been seen on x540 devices. Pls., see a
> > chapter
> > >> > > >> 7.2.1.1 of x540 devices spec:
> > >> > > >>
> > >> > > >> A packet (or multiple packets in transmit segmentation) can
> > >> > > >> span
> > any
> > >> > > >> number of
> > >> > > >> buffers (and their descriptors) up to a limit of 40 minus
> > >> > > >> WTHRESH minus 2 (see Section 7.2.3.3 for Tx Ring details and
> > >> > > >> section Section 7.2.3.5.1
> > for
> > >> > > >> WTHRESH
> > >> > > >> details). For best performance it is recommended to minimize
> > >> > > >> the number of buffers as possible.
> > >> > > >>
> > >> > > >> Could u, pls., clarify why do u think that the maximum number
> > >> > > >> of
> > data
> > >> > > >> buffers is limited by 8?
> > >> > > >>
> > >> > > >> thanks,
> > >> > > >> vlad
> > >> > > >
> > >> > > > Hi vlad,
> > >> > > >
> > >> > > > Documentation states that a packet (or multiple packets in
> > >> > > > transmit
> > >> > > > segmentation) can span any number of buffers (and their
> > >> > > > descriptors) up to a limit of 40 minus WTHRESH minus 2.
> > >> > > >
> > >> > > > Shouldn't there be a test in transmit function that drops
> > >> > > > properly
> > the
> > >> > > > mbufs with a too large number of segments, while incrementing a
> > >> > > > statistic; otherwise transmit
> > function
> > >> > > > may be locked by the faulty packet without notification.
> > >> > > >
> > >> > >
> > >> > > What we proposed is that the pmd expose to dpdk, and dpdk expose
> > >> > > to
> > the
> > >> > > application, an mbuf check function.  This way applications that
> > >> > > can generate complex packets can verify that the device will be
> > >> > > able to process them, and applications that only generate simple
> > >> > > mbufs can
> > avoid
> > >> > > the overhead by not calling the function.
> > >> >
> > >> > More than a check, it should be exposed as a capability of the
port.
> > >> > Anyway, if the application sends too much segments, the driver must
> > >> > drop it to avoid hang, and maintain a dedicated statistic counter
> > >> > to
> > allow
> > >> > easy debugging.
> > >>
> > >> I agree with Thomas - this should not be optional. Malformed packets
> > should be dropped. In the icgbe case it's a very simple test - it's a
> > single branch per packet so i doubt that it could impose any measurable
> > performance degradation.
> > >>
> > >>
> > >
> > > A drop allows the application no chance to recover.  The driver must
> > either provide the ability for the application to know that it cannot
> > accept the packet, or it must fix it up itself.
> >
> > An appropriate statistics counter would be a perfect tool to detect such
> > issues. Knowingly sending a packet that will cause a HW to hang is not
> > acceptable.
>
> I would agree. Drivers should provide a function to query the max number
of
> segments they can accept and the driver should be able to discard any
packets
> exceeding that number, and just track it via a stat.

+1

>
> /Bruce
  
Vladislav Zolotarov Sept. 11, 2015, 4:18 p.m. UTC | #37
On Sep 11, 2015 7:09 PM, "Thomas Monjalon" <thomas.monjalon@6wind.com>
wrote:
>
> 2015-09-11 18:43, Avi Kivity:
> > On 09/11/2015 06:12 PM, Vladislav Zolotarov wrote:
> > > On Sep 11, 2015 5:55 PM, "Thomas Monjalon" <thomas.monjalon@6wind.com
> > > <mailto:thomas.monjalon@6wind.com>> wrote:
> > > > 2015-09-11 17:47, Avi Kivity:
> > > > > On 09/11/2015 05:25 PM, didier.pallard wrote:
> > > > > > Hi vlad,
> > > > > >
> > > > > > Documentation states that a packet (or multiple packets in
transmit
> > > > > > segmentation) can span any number of
> > > > > > buffers (and their descriptors) up to a limit of 40 minus
WTHRESH
> > > > > > minus 2.
> > > > > >
> > > > > > Shouldn't there be a test in transmit function that drops
> > > properly the
> > > > > > mbufs with a too large number of
> > > > > > segments, while incrementing a statistic; otherwise transmit
> > > function
> > > > > > may be locked by the faulty packet without
> > > > > > notification.
> > > > > >
> > > > >
> > > > > What we proposed is that the pmd expose to dpdk, and dpdk expose
> > > to the
> > > > > application, an mbuf check function.  This way applications that
can
> > > > > generate complex packets can verify that the device will be able
to
> > > > > process them, and applications that only generate simple mbufs can
> > > avoid
> > > > > the overhead by not calling the function.
> > > >
> > > > More than a check, it should be exposed as a capability of the port.
> > > > Anyway, if the application sends too much segments, the driver must
> > > > drop it to avoid hang, and maintain a dedicated statistic counter to
> > > > allow easy debugging.
> > >
> > > I agree with Thomas - this should not be optional. Malformed packets
> > > should be dropped. In the icgbe case it's a very simple test - it's a
> > > single branch per packet so i doubt that it could impose any
> > > measurable performance degradation.
> >
> > A drop allows the application no chance to recover.  The driver must
> > either provide the ability for the application to know that it cannot
> > accept the packet, or it must fix it up itself.
>
> I have the feeling that everybody agrees on the same thing:
> the application must be able to make a well formed packet by checking
> limitations of the port. What about a field rte_eth_dev_info.max_tx_segs?
> In case the application fails in its checks, the driver must drop it and
> notify the user via a stat counter.
> The driver can also remove the hardware limitation by gathering the
segments
> but it may be hard to implement and would be a slow operation.

We thought about linearization too. It's doable with extra mempool and it
may be optional so that those that don't need could compile it out and/or
disable it in a runtime...
  
Matthew Hall Sept. 11, 2015, 5:17 p.m. UTC | #38
On Fri, Sep 11, 2015 at 07:18:20PM +0300, Vladislav Zolotarov wrote:
> We thought about linearization too. It's doable with extra mempool and it
> may be optional so that those that don't need could compile it out and/or
> disable it in a runtime...

High-level question. How realistic is sending a 40-segment frame in the first 
place? This whole thing seems kind of academic to me unless I missed 
something.

I usually use 2K pktmbufs and I don't think this is an uncommon size. Most 
jumbo frame hardware only does 9.5KB max frame size or so.

So I am having a hard time imagining how I'd end up with more than 10 segments 
as a worst-case scenario.

Matthew.
  
Ananyev, Konstantin Sept. 11, 2015, 5:42 p.m. UTC | #39
> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Matthew Hall
> Sent: Friday, September 11, 2015 6:18 PM
> To: Vladislav Zolotarov
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh above 1 for all NICs but 82598
> 
> On Fri, Sep 11, 2015 at 07:18:20PM +0300, Vladislav Zolotarov wrote:
> > We thought about linearization too. It's doable with extra mempool and it
> > may be optional so that those that don't need could compile it out and/or
> > disable it in a runtime...
> 
> High-level question. How realistic is sending a 40-segment frame in the first
> place? This whole thing seems kind of academic to me unless I missed
> something.
> 
> I usually use 2K pktmbufs and I don't think this is an uncommon size. Most
> jumbo frame hardware only does 9.5KB max frame size or so.
> 
> So I am having a hard time imagining how I'd end up with more than 10 segments
> as a worst-case scenario.

As I remember, with freebsd stack when TSO is on it was not unusual to see chains of ~30 segments. 
That's over port with 'normal' mtu (1.5K).
Konstantin

> 
> Matthew.
  
Avi Kivity Sept. 11, 2015, 5:44 p.m. UTC | #40
On 09/11/2015 07:07 PM, Richardson, Bruce wrote:
>
>> -----Original Message-----
>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Vladislav Zolotarov
>> Sent: Friday, September 11, 2015 5:04 PM
>> To: Avi Kivity
>> Cc: dev@dpdk.org
>> Subject: Re: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh above 1
>> for all NICs but 82598
>>
>> On Sep 11, 2015 6:43 PM, "Avi Kivity" <avi@cloudius-systems.com> wrote:
>>> On 09/11/2015 06:12 PM, Vladislav Zolotarov wrote:
>>>>
>>>> On Sep 11, 2015 5:55 PM, "Thomas Monjalon"
>>>> <thomas.monjalon@6wind.com>
>> wrote:
>>>>> 2015-09-11 17:47, Avi Kivity:
>>>>>> On 09/11/2015 05:25 PM, didier.pallard wrote:
>>>>>>> On 08/25/2015 08:52 PM, Vlad Zolotarov wrote:
>>>>>>>> Helin, the issue has been seen on x540 devices. Pls., see a
>> chapter
>>>>>>>> 7.2.1.1 of x540 devices spec:
>>>>>>>>
>>>>>>>> A packet (or multiple packets in transmit segmentation) can
>>>>>>>> span
>> any
>>>>>>>> number of
>>>>>>>> buffers (and their descriptors) up to a limit of 40 minus
>>>>>>>> WTHRESH minus 2 (see Section 7.2.3.3 for Tx Ring details and
>>>>>>>> section Section 7.2.3.5.1
>> for
>>>>>>>> WTHRESH
>>>>>>>> details). For best performance it is recommended to minimize
>>>>>>>> the number of buffers as possible.
>>>>>>>>
>>>>>>>> Could u, pls., clarify why do u think that the maximum number
>>>>>>>> of
>> data
>>>>>>>> buffers is limited by 8?
>>>>>>>>
>>>>>>>> thanks,
>>>>>>>> vlad
>>>>>>> Hi vlad,
>>>>>>>
>>>>>>> Documentation states that a packet (or multiple packets in
>>>>>>> transmit
>>>>>>> segmentation) can span any number of buffers (and their
>>>>>>> descriptors) up to a limit of 40 minus WTHRESH minus 2.
>>>>>>>
>>>>>>> Shouldn't there be a test in transmit function that drops
>>>>>>> properly
>> the
>>>>>>> mbufs with a too large number of segments, while incrementing a
>>>>>>> statistic; otherwise transmit
>> function
>>>>>>> may be locked by the faulty packet without notification.
>>>>>>>
>>>>>> What we proposed is that the pmd expose to dpdk, and dpdk expose
>>>>>> to
>> the
>>>>>> application, an mbuf check function.  This way applications that
>>>>>> can generate complex packets can verify that the device will be
>>>>>> able to process them, and applications that only generate simple
>>>>>> mbufs can
>> avoid
>>>>>> the overhead by not calling the function.
>>>>> More than a check, it should be exposed as a capability of the port.
>>>>> Anyway, if the application sends too much segments, the driver must
>>>>> drop it to avoid hang, and maintain a dedicated statistic counter
>>>>> to
>> allow
>>>>> easy debugging.
>>>> I agree with Thomas - this should not be optional. Malformed packets
>> should be dropped. In the icgbe case it's a very simple test - it's a
>> single branch per packet so i doubt that it could impose any measurable
>> performance degradation.allows
>>>>
>>> A drop allows the application no chance to recover.  The driver must
>> either provide the ability for the application to know that it cannot
>> accept the packet, or it must fix it up itself.
>>
>> An appropriate statistics counter would be a perfect tool to detect such
>> issues. Knowingly sending a packet that will cause a HW to hang is not
>> acceptable.
> I would agree. Drivers should provide a function to query the max number of
> segments they can accept and the driver should be able to discard any packets
> exceeding that number, and just track it via a stat.
>

There is no such max number of segments.  The i40e card, as an extreme 
example, allows 8 fragments per packet, but that is after TSO 
segmentation.  So if the header is in three fragments, that leaves 5 
data fragments per packet.  Another card (ixgbe) has a 38-fragment 
pre-TSO limit.  With such a variety of limitations, the only generic way 
to expose them is via a function.
  
Avi Kivity Sept. 11, 2015, 5:48 p.m. UTC | #41
On 09/11/2015 07:08 PM, Thomas Monjalon wrote:
> 2015-09-11 18:43, Avi Kivity:
>> On 09/11/2015 06:12 PM, Vladislav Zolotarov wrote:
>>> On Sep 11, 2015 5:55 PM, "Thomas Monjalon" <thomas.monjalon@6wind.com
>>> <mailto:thomas.monjalon@6wind.com>> wrote:
>>>> 2015-09-11 17:47, Avi Kivity:
>>>>> On 09/11/2015 05:25 PM, didier.pallard wrote:
>>>>>> Hi vlad,
>>>>>>
>>>>>> Documentation states that a packet (or multiple packets in transmit
>>>>>> segmentation) can span any number of
>>>>>> buffers (and their descriptors) up to a limit of 40 minus WTHRESH
>>>>>> minus 2.
>>>>>>
>>>>>> Shouldn't there be a test in transmit function that drops
>>> properly the
>>>>>> mbufs with a too large number of
>>>>>> segments, while incrementing a statistic; otherwise transmit
>>> function
>>>>>> may be locked by the faulty packet without
>>>>>> notification.
>>>>>>
>>>>> What we proposed is that the pmd expose to dpdk, and dpdk expose
>>> to the
>>>>> application, an mbuf check function.  This way applications that can
>>>>> generate complex packets can verify that the device will be able to
>>>>> process them, and applications that only generate simple mbufs can
>>> avoid
>>>>> the overhead by not calling the function.
>>>> More than a check, it should be exposed as a capability of the port.
>>>> Anyway, if the application sends too much segments, the driver must
>>>> drop it to avoid hang, and maintain a dedicated statistic counter to
>>>> allow easy debugging.
>>> I agree with Thomas - this should not be optional. Malformed packets
>>> should be dropped. In the icgbe case it's a very simple test - it's a
>>> single branch per packet so i doubt that it could impose any
>>> measurable performance degradation.
>> A drop allows the application no chance to recover.  The driver must
>> either provide the ability for the application to know that it cannot
>> accept the packet, or it must fix it up itself.
> I have the feeling that everybody agrees on the same thing:
> the application must be able to make a well formed packet by checking
> limitations of the port. What about a field rte_eth_dev_info.max_tx_segs?

It is not generic enough.  i40e has a limit that it imposes post-TSO.


> In case the application fails in its checks, the driver must drop it and
> notify the user via a stat counter.
> The driver can also remove the hardware limitation by gathering the segments
> but it may be hard to implement and would be a slow operation.

I think that to satisfy both the 64b full line rate applications and the 
more complicated full stack applications, this must be made optional.  
In particular, and application that only forwards packets will never hit 
a NIC's limits, so it need not take any action. That's why I think a 
verification function is ideal; a forwarding application can ignore it, 
and a complex application can call it, and if it fails the packet, it 
can linearize it itself, removing complexity from dpdk itself.
  
Matthew Hall Sept. 11, 2015, 5:58 p.m. UTC | #42
On Fri, Sep 11, 2015 at 05:42:48PM +0000, Ananyev, Konstantin wrote:
> As I remember, with freebsd stack when TSO is on it was not unusual to see chains of ~30 segments. 
> That's over port with 'normal' mtu (1.5K).
> Konstantin

This makes things quite tricky, because the TSO logic itself would need to 
handle the segment generation versus the limit, as the app would not really be 
able to force TSO to change its behavior, right?

Matthew.
  
Ananyev, Konstantin Sept. 13, 2015, 11:47 a.m. UTC | #43
> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Avi Kivity
> Sent: Friday, September 11, 2015 6:48 PM
> To: Thomas Monjalon; Vladislav Zolotarov; didier.pallard
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh above 1 for all NICs but 82598
> 
> On 09/11/2015 07:08 PM, Thomas Monjalon wrote:
> > 2015-09-11 18:43, Avi Kivity:
> >> On 09/11/2015 06:12 PM, Vladislav Zolotarov wrote:
> >>> On Sep 11, 2015 5:55 PM, "Thomas Monjalon" <thomas.monjalon@6wind.com
> >>> <mailto:thomas.monjalon@6wind.com>> wrote:
> >>>> 2015-09-11 17:47, Avi Kivity:
> >>>>> On 09/11/2015 05:25 PM, didier.pallard wrote:
> >>>>>> Hi vlad,
> >>>>>>
> >>>>>> Documentation states that a packet (or multiple packets in transmit
> >>>>>> segmentation) can span any number of
> >>>>>> buffers (and their descriptors) up to a limit of 40 minus WTHRESH
> >>>>>> minus 2.
> >>>>>>
> >>>>>> Shouldn't there be a test in transmit function that drops
> >>> properly the
> >>>>>> mbufs with a too large number of
> >>>>>> segments, while incrementing a statistic; otherwise transmit
> >>> function
> >>>>>> may be locked by the faulty packet without
> >>>>>> notification.
> >>>>>>
> >>>>> What we proposed is that the pmd expose to dpdk, and dpdk expose
> >>> to the
> >>>>> application, an mbuf check function.  This way applications that can
> >>>>> generate complex packets can verify that the device will be able to
> >>>>> process them, and applications that only generate simple mbufs can
> >>> avoid
> >>>>> the overhead by not calling the function.
> >>>> More than a check, it should be exposed as a capability of the port.
> >>>> Anyway, if the application sends too much segments, the driver must
> >>>> drop it to avoid hang, and maintain a dedicated statistic counter to
> >>>> allow easy debugging.
> >>> I agree with Thomas - this should not be optional. Malformed packets
> >>> should be dropped. In the icgbe case it's a very simple test - it's a
> >>> single branch per packet so i doubt that it could impose any
> >>> measurable performance degradation.
> >> A drop allows the application no chance to recover.  The driver must
> >> either provide the ability for the application to know that it cannot
> >> accept the packet, or it must fix it up itself.
> > I have the feeling that everybody agrees on the same thing:
> > the application must be able to make a well formed packet by checking
> > limitations of the port. What about a field rte_eth_dev_info.max_tx_segs?
> 
> It is not generic enough.  i40e has a limit that it imposes post-TSO.
> 
> 
> > In case the application fails in its checks, the driver must drop it and
> > notify the user via a stat counter.
> > The driver can also remove the hardware limitation by gathering the segments
> > but it may be hard to implement and would be a slow operation.
> 
> I think that to satisfy both the 64b full line rate applications and the
> more complicated full stack applications, this must be made optional.
> In particular, and application that only forwards packets will never hit
> a NIC's limits, so it need not take any action. That's why I think a
> verification function is ideal; a forwarding application can ignore it,
> and a complex application can call it, and if it fails the packet, it
> can linearize it itself, removing complexity from dpdk itself.

I think that's a good approach to that problem.
As I remember we discussed something similar a while ago -
A function (tx_prep() or something) that would check nb_segs and probably some other HW specific restrictions,
calculate pseudo-header checksum, reset ip header len, etc.   

From other hand we also can add two more fields into rte_eth_dev_info: 
1) Max num of segs per TSO packet (tx_max_seg ?). 
2) Max num of segs per single packet/TSO segment (tx_max_mtu_seg ?).
So for ixgbe both will have value 40 - wthresh,
while for i40e 1) would be UINT8_MAX and 2) will be 8.
Then upper layer can use that information to select an optimal size for its TX buffers.
 
Konstantin
  
Vladislav Zolotarov Sept. 13, 2015, 12:24 p.m. UTC | #44
On 09/13/15 14:47, Ananyev, Konstantin wrote:
>
>> -----Original Message-----
>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Avi Kivity
>> Sent: Friday, September 11, 2015 6:48 PM
>> To: Thomas Monjalon; Vladislav Zolotarov; didier.pallard
>> Cc: dev@dpdk.org
>> Subject: Re: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh above 1 for all NICs but 82598
>>
>> On 09/11/2015 07:08 PM, Thomas Monjalon wrote:
>>> 2015-09-11 18:43, Avi Kivity:
>>>> On 09/11/2015 06:12 PM, Vladislav Zolotarov wrote:
>>>>> On Sep 11, 2015 5:55 PM, "Thomas Monjalon" <thomas.monjalon@6wind.com
>>>>> <mailto:thomas.monjalon@6wind.com>> wrote:
>>>>>> 2015-09-11 17:47, Avi Kivity:
>>>>>>> On 09/11/2015 05:25 PM, didier.pallard wrote:
>>>>>>>> Hi vlad,
>>>>>>>>
>>>>>>>> Documentation states that a packet (or multiple packets in transmit
>>>>>>>> segmentation) can span any number of
>>>>>>>> buffers (and their descriptors) up to a limit of 40 minus WTHRESH
>>>>>>>> minus 2.
>>>>>>>>
>>>>>>>> Shouldn't there be a test in transmit function that drops
>>>>> properly the
>>>>>>>> mbufs with a too large number of
>>>>>>>> segments, while incrementing a statistic; otherwise transmit
>>>>> function
>>>>>>>> may be locked by the faulty packet without
>>>>>>>> notification.
>>>>>>>>
>>>>>>> What we proposed is that the pmd expose to dpdk, and dpdk expose
>>>>> to the
>>>>>>> application, an mbuf check function.  This way applications that can
>>>>>>> generate complex packets can verify that the device will be able to
>>>>>>> process them, and applications that only generate simple mbufs can
>>>>> avoid
>>>>>>> the overhead by not calling the function.
>>>>>> More than a check, it should be exposed as a capability of the port.
>>>>>> Anyway, if the application sends too much segments, the driver must
>>>>>> drop it to avoid hang, and maintain a dedicated statistic counter to
>>>>>> allow easy debugging.
>>>>> I agree with Thomas - this should not be optional. Malformed packets
>>>>> should be dropped. In the icgbe case it's a very simple test - it's a
>>>>> single branch per packet so i doubt that it could impose any
>>>>> measurable performance degradation.
>>>> A drop allows the application no chance to recover.  The driver must
>>>> either provide the ability for the application to know that it cannot
>>>> accept the packet, or it must fix it up itself.
>>> I have the feeling that everybody agrees on the same thing:
>>> the application must be able to make a well formed packet by checking
>>> limitations of the port. What about a field rte_eth_dev_info.max_tx_segs?
>> It is not generic enough.  i40e has a limit that it imposes post-TSO.
>>
>>
>>> In case the application fails in its checks, the driver must drop it and
>>> notify the user via a stat counter.
>>> The driver can also remove the hardware limitation by gathering the segments
>>> but it may be hard to implement and would be a slow operation.
>> I think that to satisfy both the 64b full line rate applications and the
>> more complicated full stack applications, this must be made optional.
>> In particular, and application that only forwards packets will never hit
>> a NIC's limits, so it need not take any action. That's why I think a
>> verification function is ideal; a forwarding application can ignore it,
>> and a complex application can call it, and if it fails the packet, it
>> can linearize it itself, removing complexity from dpdk itself.
> I think that's a good approach to that problem.
> As I remember we discussed something similar a while ago -
> A function (tx_prep() or something) that would check nb_segs and probably some other HW specific restrictions,
> calculate pseudo-header checksum, reset ip header len, etc.
>
>  From other hand we also can add two more fields into rte_eth_dev_info:
> 1) Max num of segs per TSO packet (tx_max_seg ?).
> 2) Max num of segs per single packet/TSO segment (tx_max_mtu_seg ?).
> So for ixgbe both will have value 40 - wthresh,
> while for i40e 1) would be UINT8_MAX and 2) will be 8.
> Then upper layer can use that information to select an optimal size for its TX buffers.

HW limitations differ from HW to HW not only by values but also by their 
nature - for instance for Qlogic bnx2x NICs the limitations may not be 
expressed in the values above so this must be a callback.

>   
> Konstantin
>
  
Avi Kivity Sept. 13, 2015, 12:32 p.m. UTC | #45
On 09/13/2015 02:47 PM, Ananyev, Konstantin wrote:
>
>> -----Original Message-----
>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Avi Kivity
>> Sent: Friday, September 11, 2015 6:48 PM
>> To: Thomas Monjalon; Vladislav Zolotarov; didier.pallard
>> Cc: dev@dpdk.org
>> Subject: Re: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh above 1 for all NICs but 82598
>>
>> On 09/11/2015 07:08 PM, Thomas Monjalon wrote:
>>> 2015-09-11 18:43, Avi Kivity:
>>>> On 09/11/2015 06:12 PM, Vladislav Zolotarov wrote:
>>>>> On Sep 11, 2015 5:55 PM, "Thomas Monjalon" <thomas.monjalon@6wind.com
>>>>> <mailto:thomas.monjalon@6wind.com>> wrote:
>>>>>> 2015-09-11 17:47, Avi Kivity:
>>>>>>> On 09/11/2015 05:25 PM, didier.pallard wrote:
>>>>>>>> Hi vlad,
>>>>>>>>
>>>>>>>> Documentation states that a packet (or multiple packets in transmit
>>>>>>>> segmentation) can span any number of
>>>>>>>> buffers (and their descriptors) up to a limit of 40 minus WTHRESH
>>>>>>>> minus 2.
>>>>>>>>
>>>>>>>> Shouldn't there be a test in transmit function that drops
>>>>> properly the
>>>>>>>> mbufs with a too large number of
>>>>>>>> segments, while incrementing a statistic; otherwise transmit
>>>>> function
>>>>>>>> may be locked by the faulty packet without
>>>>>>>> notification.
>>>>>>>>
>>>>>>> What we proposed is that the pmd expose to dpdk, and dpdk expose
>>>>> to the
>>>>>>> application, an mbuf check function.  This way applications that can
>>>>>>> generate complex packets can verify that the device will be able to
>>>>>>> process them, and applications that only generate simple mbufs can
>>>>> avoid
>>>>>>> the overhead by not calling the function.
>>>>>> More than a check, it should be exposed as a capability of the port.
>>>>>> Anyway, if the application sends too much segments, the driver must
>>>>>> drop it to avoid hang, and maintain a dedicated statistic counter to
>>>>>> allow easy debugging.
>>>>> I agree with Thomas - this should not be optional. Malformed packets
>>>>> should be dropped. In the icgbe case it's a very simple test - it's a
>>>>> single branch per packet so i doubt that it could impose any
>>>>> measurable performance degradation.
>>>> A drop allows the application no chance to recover.  The driver must
>>>> either provide the ability for the application to know that it cannot
>>>> accept the packet, or it must fix it up itself.
>>> I have the feeling that everybody agrees on the same thing:
>>> the application must be able to make a well formed packet by checking
>>> limitations of the port. What about a field rte_eth_dev_info.max_tx_segs?
>> It is not generic enough.  i40e has a limit that it imposes post-TSO.
>>
>>
>>> In case the application fails in its checks, the driver must drop it and
>>> notify the user via a stat counter.
>>> The driver can also remove the hardware limitation by gathering the segments
>>> but it may be hard to implement and would be a slow operation.
>> I think that to satisfy both the 64b full line rate applications and the
>> more complicated full stack applications, this must be made optional.
>> In particular, and application that only forwards packets will never hit
>> a NIC's limits, so it need not take any action. That's why I think a
>> verification function is ideal; a forwarding application can ignore it,
>> and a complex application can call it, and if it fails the packet, it
>> can linearize it itself, removing complexity from dpdk itself.
> I think that's a good approach to that problem.
> As I remember we discussed something similar a while ago -
> A function (tx_prep() or something) that would check nb_segs and probably some other HW specific restrictions,
> calculate pseudo-header checksum, reset ip header len, etc.
>
>  From other hand we also can add two more fields into rte_eth_dev_info:
> 1) Max num of segs per TSO packet (tx_max_seg ?).
> 2) Max num of segs per single packet/TSO segment (tx_max_mtu_seg ?).
> So for ixgbe both will have value 40 - wthresh,
> while for i40e 1) would be UINT8_MAX and 2) will be 8.
> Then upper layer can use that information to select an optimal size for its TX buffers.
>   
>

This will break whenever the fevered imagination of hardware designers 
comes up with a new limit.

We can have an internal function that accepts these two parameters, and 
then the driver-specific function can call this internal function:

static bool i40e_validate_packet(mbuf* m) {
     return rte_generic_validate_packet(m, 0, 8);
}

static bool ixgbe_validate_packet(mbuf* m) {
     return rte_generic_validate_packet(m, 40, 2);
}

this way, the application is isolated from changes in how invalid 
packets are detected.
  
Ananyev, Konstantin Sept. 13, 2015, 3:54 p.m. UTC | #46
> -----Original Message-----
> From: Avi Kivity [mailto:avi@cloudius-systems.com]
> Sent: Sunday, September 13, 2015 1:33 PM
> To: Ananyev, Konstantin; Thomas Monjalon; Vladislav Zolotarov; didier.pallard
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh above 1 for all NICs but 82598
> 
> On 09/13/2015 02:47 PM, Ananyev, Konstantin wrote:
> >
> >> -----Original Message-----
> >> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Avi Kivity
> >> Sent: Friday, September 11, 2015 6:48 PM
> >> To: Thomas Monjalon; Vladislav Zolotarov; didier.pallard
> >> Cc: dev@dpdk.org
> >> Subject: Re: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh above 1 for all NICs but 82598
> >>
> >> On 09/11/2015 07:08 PM, Thomas Monjalon wrote:
> >>> 2015-09-11 18:43, Avi Kivity:
> >>>> On 09/11/2015 06:12 PM, Vladislav Zolotarov wrote:
> >>>>> On Sep 11, 2015 5:55 PM, "Thomas Monjalon" <thomas.monjalon@6wind.com
> >>>>> <mailto:thomas.monjalon@6wind.com>> wrote:
> >>>>>> 2015-09-11 17:47, Avi Kivity:
> >>>>>>> On 09/11/2015 05:25 PM, didier.pallard wrote:
> >>>>>>>> Hi vlad,
> >>>>>>>>
> >>>>>>>> Documentation states that a packet (or multiple packets in transmit
> >>>>>>>> segmentation) can span any number of
> >>>>>>>> buffers (and their descriptors) up to a limit of 40 minus WTHRESH
> >>>>>>>> minus 2.
> >>>>>>>>
> >>>>>>>> Shouldn't there be a test in transmit function that drops
> >>>>> properly the
> >>>>>>>> mbufs with a too large number of
> >>>>>>>> segments, while incrementing a statistic; otherwise transmit
> >>>>> function
> >>>>>>>> may be locked by the faulty packet without
> >>>>>>>> notification.
> >>>>>>>>
> >>>>>>> What we proposed is that the pmd expose to dpdk, and dpdk expose
> >>>>> to the
> >>>>>>> application, an mbuf check function.  This way applications that can
> >>>>>>> generate complex packets can verify that the device will be able to
> >>>>>>> process them, and applications that only generate simple mbufs can
> >>>>> avoid
> >>>>>>> the overhead by not calling the function.
> >>>>>> More than a check, it should be exposed as a capability of the port.
> >>>>>> Anyway, if the application sends too much segments, the driver must
> >>>>>> drop it to avoid hang, and maintain a dedicated statistic counter to
> >>>>>> allow easy debugging.
> >>>>> I agree with Thomas - this should not be optional. Malformed packets
> >>>>> should be dropped. In the icgbe case it's a very simple test - it's a
> >>>>> single branch per packet so i doubt that it could impose any
> >>>>> measurable performance degradation.
> >>>> A drop allows the application no chance to recover.  The driver must
> >>>> either provide the ability for the application to know that it cannot
> >>>> accept the packet, or it must fix it up itself.
> >>> I have the feeling that everybody agrees on the same thing:
> >>> the application must be able to make a well formed packet by checking
> >>> limitations of the port. What about a field rte_eth_dev_info.max_tx_segs?
> >> It is not generic enough.  i40e has a limit that it imposes post-TSO.
> >>
> >>
> >>> In case the application fails in its checks, the driver must drop it and
> >>> notify the user via a stat counter.
> >>> The driver can also remove the hardware limitation by gathering the segments
> >>> but it may be hard to implement and would be a slow operation.
> >> I think that to satisfy both the 64b full line rate applications and the
> >> more complicated full stack applications, this must be made optional.
> >> In particular, and application that only forwards packets will never hit
> >> a NIC's limits, so it need not take any action. That's why I think a
> >> verification function is ideal; a forwarding application can ignore it,
> >> and a complex application can call it, and if it fails the packet, it
> >> can linearize it itself, removing complexity from dpdk itself.
> > I think that's a good approach to that problem.
> > As I remember we discussed something similar a while ago -
> > A function (tx_prep() or something) that would check nb_segs and probably some other HW specific restrictions,
> > calculate pseudo-header checksum, reset ip header len, etc.
> >
> >  From other hand we also can add two more fields into rte_eth_dev_info:
> > 1) Max num of segs per TSO packet (tx_max_seg ?).
> > 2) Max num of segs per single packet/TSO segment (tx_max_mtu_seg ?).
> > So for ixgbe both will have value 40 - wthresh,
> > while for i40e 1) would be UINT8_MAX and 2) will be 8.
> > Then upper layer can use that information to select an optimal size for its TX buffers.
> >
> >
> 
> This will break whenever the fevered imagination of hardware designers
> comes up with a new limit.
> 
> We can have an internal function that accepts these two parameters, and
> then the driver-specific function can call this internal function:
> 
> static bool i40e_validate_packet(mbuf* m) {
>      return rte_generic_validate_packet(m, 0, 8);
> }
> 
> static bool ixgbe_validate_packet(mbuf* m) {
>      return rte_generic_validate_packet(m, 40, 2);
> }
> 
> this way, the application is isolated from changes in how invalid
> packets are detected.
> 
> 

I am not saying we shouldn't have tx_prep (tx_validate?) function per PMD.
As I said before I like that approach.
I think we should have tx_prep (as you suggested) that most people using full-path TX would call,
*plus* these extra fields in re_eth_dev_conf, so if someone needs that information - it would be there. 
Konstantin
  
Avi Kivity Sept. 13, 2015, 4:01 p.m. UTC | #47
On Sep 13, 2015 6:54 PM, "Ananyev, Konstantin" <konstantin.ananyev@intel.com>
wrote:
>
>
>
> > -----Original Message-----
> > From: Avi Kivity [mailto:avi@cloudius-systems.com]
> > Sent: Sunday, September 13, 2015 1:33 PM
> > To: Ananyev, Konstantin; Thomas Monjalon; Vladislav Zolotarov;
didier.pallard
> > Cc: dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh above
1 for all NICs but 82598
> >
> > On 09/13/2015 02:47 PM, Ananyev, Konstantin wrote:
> > >
> > >> -----Original Message-----
> > >> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Avi Kivity
> > >> Sent: Friday, September 11, 2015 6:48 PM
> > >> To: Thomas Monjalon; Vladislav Zolotarov; didier.pallard
> > >> Cc: dev@dpdk.org
> > >> Subject: Re: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh
above 1 for all NICs but 82598
> > >>
> > >> On 09/11/2015 07:08 PM, Thomas Monjalon wrote:
> > >>> 2015-09-11 18:43, Avi Kivity:
> > >>>> On 09/11/2015 06:12 PM, Vladislav Zolotarov wrote:
> > >>>>> On Sep 11, 2015 5:55 PM, "Thomas Monjalon" <
thomas.monjalon@6wind.com
> > >>>>> <mailto:thomas.monjalon@6wind.com>> wrote:
> > >>>>>> 2015-09-11 17:47, Avi Kivity:
> > >>>>>>> On 09/11/2015 05:25 PM, didier.pallard wrote:
> > >>>>>>>> Hi vlad,
> > >>>>>>>>
> > >>>>>>>> Documentation states that a packet (or multiple packets in
transmit
> > >>>>>>>> segmentation) can span any number of
> > >>>>>>>> buffers (and their descriptors) up to a limit of 40 minus
WTHRESH
> > >>>>>>>> minus 2.
> > >>>>>>>>
> > >>>>>>>> Shouldn't there be a test in transmit function that drops
> > >>>>> properly the
> > >>>>>>>> mbufs with a too large number of
> > >>>>>>>> segments, while incrementing a statistic; otherwise transmit
> > >>>>> function
> > >>>>>>>> may be locked by the faulty packet without
> > >>>>>>>> notification.
> > >>>>>>>>
> > >>>>>>> What we proposed is that the pmd expose to dpdk, and dpdk expose
> > >>>>> to the
> > >>>>>>> application, an mbuf check function.  This way applications
that can
> > >>>>>>> generate complex packets can verify that the device will be
able to
> > >>>>>>> process them, and applications that only generate simple mbufs
can
> > >>>>> avoid
> > >>>>>>> the overhead by not calling the function.
> > >>>>>> More than a check, it should be exposed as a capability of the
port.
> > >>>>>> Anyway, if the application sends too much segments, the driver
must
> > >>>>>> drop it to avoid hang, and maintain a dedicated statistic
counter to
> > >>>>>> allow easy debugging.
> > >>>>> I agree with Thomas - this should not be optional. Malformed
packets
> > >>>>> should be dropped. In the icgbe case it's a very simple test -
it's a
> > >>>>> single branch per packet so i doubt that it could impose any
> > >>>>> measurable performance degradation.
> > >>>> A drop allows the application no chance to recover.  The driver
must
> > >>>> either provide the ability for the application to know that it
cannot
> > >>>> accept the packet, or it must fix it up itself.
> > >>> I have the feeling that everybody agrees on the same thing:
> > >>> the application must be able to make a well formed packet by
checking
> > >>> limitations of the port. What about a field
rte_eth_dev_info.max_tx_segs?
> > >> It is not generic enough.  i40e has a limit that it imposes post-TSO.
> > >>
> > >>
> > >>> In case the application fails in its checks, the driver must drop
it and
> > >>> notify the user via a stat counter.
> > >>> The driver can also remove the hardware limitation by gathering the
segments
> > >>> but it may be hard to implement and would be a slow operation.
> > >> I think that to satisfy both the 64b full line rate applications and
the
> > >> more complicated full stack applications, this must be made optional.
> > >> In particular, and application that only forwards packets will never
hit
> > >> a NIC's limits, so it need not take any action. That's why I think a
> > >> verification function is ideal; a forwarding application can ignore
it,
> > >> and a complex application can call it, and if it fails the packet, it
> > >> can linearize it itself, removing complexity from dpdk itself.
> > > I think that's a good approach to that problem.
> > > As I remember we discussed something similar a while ago -
> > > A function (tx_prep() or something) that would check nb_segs and
probably some other HW specific restrictions,
> > > calculate pseudo-header checksum, reset ip header len, etc.
> > >
> > >  From other hand we also can add two more fields into
rte_eth_dev_info:
> > > 1) Max num of segs per TSO packet (tx_max_seg ?).
> > > 2) Max num of segs per single packet/TSO segment (tx_max_mtu_seg ?).
> > > So for ixgbe both will have value 40 - wthresh,
> > > while for i40e 1) would be UINT8_MAX and 2) will be 8.
> > > Then upper layer can use that information to select an optimal size
for its TX buffers.
> > >
> > >
> >
> > This will break whenever the fevered imagination of hardware designers
> > comes up with a new limit.
> >
> > We can have an internal function that accepts these two parameters, and
> > then the driver-specific function can call this internal function:
> >
> > static bool i40e_validate_packet(mbuf* m) {
> >      return rte_generic_validate_packet(m, 0, 8);
> > }
> >
> > static bool ixgbe_validate_packet(mbuf* m) {
> >      return rte_generic_validate_packet(m, 40, 2);
> > }
> >
> > this way, the application is isolated from changes in how invalid
> > packets are detected.
> >
> >
>
> I am not saying we shouldn't have tx_prep (tx_validate?) function per PMD.
> As I said before I like that approach.
> I think we should have tx_prep (as you suggested) that most people using
full-path TX would call,
> *plus* these extra fields in re_eth_dev_conf, so if someone needs that
information - it would be there.

I think this is reasonable. Having those values can allow the application
to avoid generating bad packets in the first place.

> Konstantin
>
  

Patch

diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c
index b8ee1e9..6714fd9 100644
--- a/drivers/net/ixgbe/ixgbe_ethdev.c
+++ b/drivers/net/ixgbe/ixgbe_ethdev.c
@@ -2414,6 +2414,15 @@  ixgbe_dev_info_get(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
 		.txq_flags = ETH_TXQ_FLAGS_NOMULTSEGS |
 				ETH_TXQ_FLAGS_NOOFFLOADS,
 	};
+
+	/*
+	 * According to 82599 and x540 specifications RS bit *must* be set on the
+	 * last descriptor of *every* packet. Therefore we will not allow the
+	 * tx_rs_thresh above 1 for all NICs newer than 82598.
+	 */
+	if (hw->mac.type > ixgbe_mac_82598EB)
+		dev_info->default_txconf.tx_rs_thresh = 1;
+
 	dev_info->hash_key_size = IXGBE_HKEY_MAX_INDEX * sizeof(uint32_t);
 	dev_info->reta_size = ETH_RSS_RETA_SIZE_128;
 	dev_info->flow_type_rss_offloads = IXGBE_RSS_OFFLOAD_ALL;
diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index 91023b9..8dbdffc 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -2085,11 +2085,19 @@  ixgbe_dev_tx_queue_setup(struct rte_eth_dev *dev,
 	struct ixgbe_tx_queue *txq;
 	struct ixgbe_hw     *hw;
 	uint16_t tx_rs_thresh, tx_free_thresh;
+	bool rs_deferring_allowed;
 
 	PMD_INIT_FUNC_TRACE();
 	hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);
 
 	/*
+	 * According to 82599 and x540 specifications RS bit *must* be set on the
+	 * last descriptor of *every* packet. Therefore we will not allow the
+	 * tx_rs_thresh above 1 for all NICs newer than 82598.
+	 */
+	rs_deferring_allowed = (hw->mac.type <= ixgbe_mac_82598EB);
+
+	/*
 	 * Validate number of transmit descriptors.
 	 * It must not exceed hardware maximum, and must be multiple
 	 * of IXGBE_ALIGN.
@@ -2110,6 +2118,8 @@  ixgbe_dev_tx_queue_setup(struct rte_eth_dev *dev,
 	 * to transmit a packet is greater than the number of free TX
 	 * descriptors.
 	 * The following constraints must be satisfied:
+	 *  tx_rs_thresh must be less than 2 for NICs for which RS deferring is
+	 *  forbidden (all but 82598).
 	 *  tx_rs_thresh must be greater than 0.
 	 *  tx_rs_thresh must be less than the size of the ring minus 2.
 	 *  tx_rs_thresh must be less than or equal to tx_free_thresh.
@@ -2121,9 +2131,20 @@  ixgbe_dev_tx_queue_setup(struct rte_eth_dev *dev,
 	 * When set to zero use default values.
 	 */
 	tx_rs_thresh = (uint16_t)((tx_conf->tx_rs_thresh) ?
-			tx_conf->tx_rs_thresh : DEFAULT_TX_RS_THRESH);
+			tx_conf->tx_rs_thresh :
+			(rs_deferring_allowed ? DEFAULT_TX_RS_THRESH : 1));
 	tx_free_thresh = (uint16_t)((tx_conf->tx_free_thresh) ?
 			tx_conf->tx_free_thresh : DEFAULT_TX_FREE_THRESH);
+
+	if (!rs_deferring_allowed && tx_rs_thresh > 1) {
+		PMD_INIT_LOG(ERR, "tx_rs_thresh must be less than 2 since RS "
+				  "must be set for every packet for this HW. "
+				  "(tx_rs_thresh=%u port=%d queue=%d)",
+			     (unsigned int)tx_rs_thresh,
+			     (int)dev->data->port_id, (int)queue_idx);
+		return -(EINVAL);
+	}
+
 	if (tx_rs_thresh >= (nb_desc - 2)) {
 		PMD_INIT_LOG(ERR, "tx_rs_thresh must be less than the number "
 			     "of TX descriptors minus 2. (tx_rs_thresh=%u "