[v2,2/2] net/i40e: fix risk in Rx descriptor read in scalar path

Message ID 20210915083339.2424369-3-ruifeng.wang@arm.com (mailing list archive)
State Accepted, archived
Delegated to: Qi Zhang
Headers
Series i40e Rx descriptor loads ordering |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/github-robot: build success github build: passed
ci/iol-x86_64-unit-testing success Testing PASS
ci/Intel-compilation success Compilation OK
ci/intel-Testing fail Testing issues
ci/iol-x86_64-compile-testing success Testing PASS
ci/iol-mellanox-Performance success Performance Testing PASS

Commit Message

Ruifeng Wang Sept. 15, 2021, 8:33 a.m. UTC
  Rx descriptor is 16B/32B in size. If the DD bit is set, it indicates
that the rest of the descriptor words have valid values. Hence, the
word containing DD bit must be read first before reading the rest of
the descriptor words.

Since the entire descriptor is not read atomically, on relaxed memory
ordered systems like Aarch64, read of the word containing DD field
could be reordered after read of other words.

Read barrier is inserted between read of the word with DD field
and read of other words. The barrier ensures that the fetched data
is correct.

Testpmd single core test showed no performance drop on x86 or N1SDP.
On ThunderX2, 22% performance regression was observed.

Fixes: 7b0cf70135d1 ("net/i40e: support ARM platform")
Cc: stable@dpdk.org

Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 drivers/net/i40e/i40e_rxtx.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)
  

Comments

Ferruh Yigit Sept. 29, 2021, 3:05 p.m. UTC | #1
On 9/15/2021 9:33 AM, Ruifeng Wang wrote:
> Rx descriptor is 16B/32B in size. If the DD bit is set, it indicates
> that the rest of the descriptor words have valid values. Hence, the
> word containing DD bit must be read first before reading the rest of
> the descriptor words.
> 
> Since the entire descriptor is not read atomically, on relaxed memory
> ordered systems like Aarch64, read of the word containing DD field
> could be reordered after read of other words.
> 
> Read barrier is inserted between read of the word with DD field
> and read of other words. The barrier ensures that the fetched data
> is correct.
> 
> Testpmd single core test showed no performance drop on x86 or N1SDP.
> On ThunderX2, 22% performance regression was observed.
> 

Is 22% performance drop value correct? That is a big drop, is it acceptable?

Is this performance drop valid for all Arm scalar datapath, or is it specific to
ThunderX2?

> Fixes: 7b0cf70135d1 ("net/i40e: support ARM platform")
> Cc: stable@dpdk.org
> 
> Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
  
Honnappa Nagarahalli Sept. 29, 2021, 3:29 p.m. UTC | #2
<snip>
> 
> On 9/15/2021 9:33 AM, Ruifeng Wang wrote:
> > Rx descriptor is 16B/32B in size. If the DD bit is set, it indicates
> > that the rest of the descriptor words have valid values. Hence, the
> > word containing DD bit must be read first before reading the rest of
> > the descriptor words.
> >
> > Since the entire descriptor is not read atomically, on relaxed memory
> > ordered systems like Aarch64, read of the word containing DD field
> > could be reordered after read of other words.
> >
> > Read barrier is inserted between read of the word with DD field and
> > read of other words. The barrier ensures that the fetched data is
> > correct.
> >
> > Testpmd single core test showed no performance drop on x86 or N1SDP.
> > On ThunderX2, 22% performance regression was observed.
> >
> 
> Is 22% performance drop value correct? That is a big drop, is it acceptable?
Agree, it is a big drop. Fixing it will require using the barrier less frequently. For ex: read 4 descriptors (4 words containing the DD bits) before using the barrier.

> 
> Is this performance drop valid for all Arm scalar datapath, or is it specific to
> ThunderX2?
This is specific to ThunderX2. N1 CPU does not see any impact. A72 is not tested. Considering that the ThunderXx line of CPUs are not in further development, and it is scalar path, I would not suggest to make further changes to the code.

It would be good to test this on Kunpeng servers and get some feedback.

> 
> > Fixes: 7b0cf70135d1 ("net/i40e: support ARM platform")
> > Cc: stable@dpdk.org
> >
> > Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
  
Ferruh Yigit Oct. 11, 2021, 4:26 p.m. UTC | #3
On 9/29/2021 4:29 PM, Honnappa Nagarahalli wrote:
> <snip>
>>
>> On 9/15/2021 9:33 AM, Ruifeng Wang wrote:
>>> Rx descriptor is 16B/32B in size. If the DD bit is set, it indicates
>>> that the rest of the descriptor words have valid values. Hence, the
>>> word containing DD bit must be read first before reading the rest of
>>> the descriptor words.
>>>
>>> Since the entire descriptor is not read atomically, on relaxed memory
>>> ordered systems like Aarch64, read of the word containing DD field
>>> could be reordered after read of other words.
>>>
>>> Read barrier is inserted between read of the word with DD field and
>>> read of other words. The barrier ensures that the fetched data is
>>> correct.
>>>
>>> Testpmd single core test showed no performance drop on x86 or N1SDP.
>>> On ThunderX2, 22% performance regression was observed.
>>>
>>
>> Is 22% performance drop value correct? That is a big drop, is it acceptable?
> Agree, it is a big drop. Fixing it will require using the barrier less frequently. For ex: read 4 descriptors (4 words containing the DD bits) before using the barrier.
> 
>>
>> Is this performance drop valid for all Arm scalar datapath, or is it specific to
>> ThunderX2?
> This is specific to ThunderX2. N1 CPU does not see any impact. A72 is not tested. Considering that the ThunderXx line of CPUs are not in further development, and it is scalar path, I would not suggest to make further changes to the code.
> 
> It would be good to test this on Kunpeng servers and get some feedback.

Hi Connor, Yisen, Lijun,

Can you please check this patch? I don't know if you are using i40e nic
on your platform but if you do can you please test it?

Overall this patch cause a big performance drop on Arm for i40e, I just
want to be sure this is not impacting any user negatively.

> 
>>
>>> Fixes: 7b0cf70135d1 ("net/i40e: support ARM platform")
>>> Cc: stable@dpdk.org
>>>
>>> Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
>>> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
>
  
Qi Zhang Oct. 19, 2021, 11:14 a.m. UTC | #4
> -----Original Message-----
> From: Yigit, Ferruh <ferruh.yigit@intel.com>
> Sent: Tuesday, October 12, 2021 12:27 AM
> To: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang
> <Ruifeng.Wang@arm.com>; dev@dpdk.org; Min Hu (Connor)
> <humin29@huawei.com>; Yisen Zhuang <yisen.zhuang@huawei.com>; Lijun
> Ou <oulijun@huawei.com>
> Cc: Xing, Beilei <beilei.xing@intel.com>; Zhang, Qi Z <qi.z.zhang@intel.com>;
> Richardson, Bruce <bruce.richardson@intel.com>; jerinj@marvell.com;
> hemant.agrawal@nxp.com; drc@linux.vnet.ibm.com; stable@dpdk.org; nd
> <nd@arm.com>; humin29@huawei.com
> Subject: Re: [dpdk-stable] [PATCH v2 2/2] net/i40e: fix risk in Rx descriptor
> read in scalar path
> 
> On 9/29/2021 4:29 PM, Honnappa Nagarahalli wrote:
> > <snip>
> >>
> >> On 9/15/2021 9:33 AM, Ruifeng Wang wrote:
> >>> Rx descriptor is 16B/32B in size. If the DD bit is set, it indicates
> >>> that the rest of the descriptor words have valid values. Hence, the
> >>> word containing DD bit must be read first before reading the rest of
> >>> the descriptor words.
> >>>
> >>> Since the entire descriptor is not read atomically, on relaxed
> >>> memory ordered systems like Aarch64, read of the word containing DD
> >>> field could be reordered after read of other words.
> >>>
> >>> Read barrier is inserted between read of the word with DD field and
> >>> read of other words. The barrier ensures that the fetched data is
> >>> correct.
> >>>
> >>> Testpmd single core test showed no performance drop on x86 or N1SDP.
> >>> On ThunderX2, 22% performance regression was observed.
> >>>
> >>
> >> Is 22% performance drop value correct? That is a big drop, is it acceptable?
> > Agree, it is a big drop. Fixing it will require using the barrier less frequently.
> For ex: read 4 descriptors (4 words containing the DD bits) before using the
> barrier.
> >
> >>
> >> Is this performance drop valid for all Arm scalar datapath, or is it
> >> specific to ThunderX2?
> > This is specific to ThunderX2. N1 CPU does not see any impact. A72 is not
> tested. Considering that the ThunderXx line of CPUs are not in further
> development, and it is scalar path, I would not suggest to make further
> changes to the code.
> >
> > It would be good to test this on Kunpeng servers and get some feedback.
> 
> Hi Connor, Yisen, Lijun,
> 
> Can you please check this patch? I don't know if you are using i40e nic on your
> platform but if you do can you please test it?
> 
> Overall this patch cause a big performance drop on Arm for i40e, I just want to
> be sure this is not impacting any user negatively.

Folks:
	This patch has been dropped from dpdk-next-net-intel, as still waiting for your confirm.
	Btw Patch 1/2 was still in dpdk-next-net-intel.
Thanks
Qi

> 
> >
> >>
> >>> Fixes: 7b0cf70135d1 ("net/i40e: support ARM platform")
> >>> Cc: stable@dpdk.org
> >>>
> >>> Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
> >>> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> >
  
Ruifeng Wang Nov. 5, 2021, 6:57 a.m. UTC | #5
> -----Original Message-----
> From: Zhang, Qi Z <qi.z.zhang@intel.com>
> Sent: Tuesday, October 19, 2021 7:15 PM
> To: Yigit, Ferruh <ferruh.yigit@intel.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang
> <Ruifeng.Wang@arm.com>; dev@dpdk.org; Min Hu (Connor)
> <humin29@huawei.com>; Yisen Zhuang <yisen.zhuang@huawei.com>; Lijun
> Ou <oulijun@huawei.com>
> Cc: Xing, Beilei <beilei.xing@intel.com>; Richardson, Bruce
> <bruce.richardson@intel.com>; jerinj@marvell.com;
> hemant.agrawal@nxp.com; drc@linux.vnet.ibm.com; stable@dpdk.org; nd
> <nd@arm.com>; humin29@huawei.com
> Subject: RE: [dpdk-stable] [PATCH v2 2/2] net/i40e: fix risk in Rx descriptor
> read in scalar path
> 
> 
> 
> > -----Original Message-----
> > From: Yigit, Ferruh <ferruh.yigit@intel.com>
> > Sent: Tuesday, October 12, 2021 12:27 AM
> > To: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; Ruifeng
> Wang
> > <Ruifeng.Wang@arm.com>; dev@dpdk.org; Min Hu (Connor)
> > <humin29@huawei.com>; Yisen Zhuang <yisen.zhuang@huawei.com>;
> Lijun Ou
> > <oulijun@huawei.com>
> > Cc: Xing, Beilei <beilei.xing@intel.com>; Zhang, Qi Z
> > <qi.z.zhang@intel.com>; Richardson, Bruce
> > <bruce.richardson@intel.com>; jerinj@marvell.com;
> > hemant.agrawal@nxp.com; drc@linux.vnet.ibm.com; stable@dpdk.org; nd
> > <nd@arm.com>; humin29@huawei.com
> > Subject: Re: [dpdk-stable] [PATCH v2 2/2] net/i40e: fix risk in Rx
> > descriptor read in scalar path
> >
> > On 9/29/2021 4:29 PM, Honnappa Nagarahalli wrote:
> > > <snip>
> > >>
> > >> On 9/15/2021 9:33 AM, Ruifeng Wang wrote:
> > >>> Rx descriptor is 16B/32B in size. If the DD bit is set, it
> > >>> indicates that the rest of the descriptor words have valid values.
> > >>> Hence, the word containing DD bit must be read first before
> > >>> reading the rest of the descriptor words.
> > >>>
> > >>> Since the entire descriptor is not read atomically, on relaxed
> > >>> memory ordered systems like Aarch64, read of the word containing
> > >>> DD field could be reordered after read of other words.
> > >>>
> > >>> Read barrier is inserted between read of the word with DD field
> > >>> and read of other words. The barrier ensures that the fetched data
> > >>> is correct.
> > >>>
> > >>> Testpmd single core test showed no performance drop on x86 or
> N1SDP.
> > >>> On ThunderX2, 22% performance regression was observed.
> > >>>
> > >>
> > >> Is 22% performance drop value correct? That is a big drop, is it
> acceptable?
> > > Agree, it is a big drop. Fixing it will require using the barrier less frequently.
> > For ex: read 4 descriptors (4 words containing the DD bits) before
> > using the barrier.
> > >
> > >>
> > >> Is this performance drop valid for all Arm scalar datapath, or is
> > >> it specific to ThunderX2?
> > > This is specific to ThunderX2. N1 CPU does not see any impact. A72
> > > is not
> > tested. Considering that the ThunderXx line of CPUs are not in further
> > development, and it is scalar path, I would not suggest to make
> > further changes to the code.
> > >
> > > It would be good to test this on Kunpeng servers and get some feedback.
> >
> > Hi Connor, Yisen, Lijun,
> >
> > Can you please check this patch? I don't know if you are using i40e
> > nic on your platform but if you do can you please test it?
> >
> > Overall this patch cause a big performance drop on Arm for i40e, I
> > just want to be sure this is not impacting any user negatively.
> 
> Folks:
> 	This patch has been dropped from dpdk-next-net-intel, as still
> waiting for your confirm.
> 	Btw Patch 1/2 was still in dpdk-next-net-intel.
> Thanks
> Qi
> 
Hi Qi, Ferruh,

Do you have any suggestion on how to progress this patch?
It is fixing possible violation of hardware access from architecture point of view.
Negative performance impact may happen because barriers are added.
I don't think we received objections until now.

Thanks,
Ruifeng
> >
> > >
> > >>
> > >>> Fixes: 7b0cf70135d1 ("net/i40e: support ARM platform")
> > >>> Cc: stable@dpdk.org
> > >>>
> > >>> Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > >>> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > >
  
Ruifeng Wang Nov. 11, 2021, 10:27 a.m. UTC | #6
Hi Ferruh,

> > > -----Original Message-----
> > > From: Yigit, Ferruh <ferruh.yigit@intel.com>
> > > Sent: Tuesday, October 12, 2021 12:27 AM
> > > To: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; Ruifeng
> > Wang
> > > <Ruifeng.Wang@arm.com>; dev@dpdk.org; Min Hu (Connor)
> > > <humin29@huawei.com>; Yisen Zhuang <yisen.zhuang@huawei.com>;
> > Lijun Ou
> > > <oulijun@huawei.com>
> > > Cc: Xing, Beilei <beilei.xing@intel.com>; Zhang, Qi Z
> > > <qi.z.zhang@intel.com>; Richardson, Bruce
> > > <bruce.richardson@intel.com>; jerinj@marvell.com;
> > > hemant.agrawal@nxp.com; drc@linux.vnet.ibm.com; stable@dpdk.org;
> nd
> > > <nd@arm.com>; humin29@huawei.com
> > > Subject: Re: [dpdk-stable] [PATCH v2 2/2] net/i40e: fix risk in Rx
> > > descriptor read in scalar path
> > >
> > > On 9/29/2021 4:29 PM, Honnappa Nagarahalli wrote:
> > > > <snip>
> > > >>
> > > >> On 9/15/2021 9:33 AM, Ruifeng Wang wrote:
> > > >>> Rx descriptor is 16B/32B in size. If the DD bit is set, it
> > > >>> indicates that the rest of the descriptor words have valid values.
> > > >>> Hence, the word containing DD bit must be read first before
> > > >>> reading the rest of the descriptor words.
> > > >>>
> > > >>> Since the entire descriptor is not read atomically, on relaxed
> > > >>> memory ordered systems like Aarch64, read of the word containing
> > > >>> DD field could be reordered after read of other words.
> > > >>>
> > > >>> Read barrier is inserted between read of the word with DD field
> > > >>> and read of other words. The barrier ensures that the fetched
> > > >>> data is correct.
> > > >>>
> > > >>> Testpmd single core test showed no performance drop on x86 or
> > N1SDP.
> > > >>> On ThunderX2, 22% performance regression was observed.
> > > >>>
> > > >>
> > > >> Is 22% performance drop value correct? That is a big drop, is it
> > acceptable?
> > > > Agree, it is a big drop. Fixing it will require using the barrier less
> frequently.
> > > For ex: read 4 descriptors (4 words containing the DD bits) before
> > > using the barrier.
> > > >
> > > >>
> > > >> Is this performance drop valid for all Arm scalar datapath, or is
> > > >> it specific to ThunderX2?
> > > > This is specific to ThunderX2. N1 CPU does not see any impact. A72
> > > > is not
> > > tested. Considering that the ThunderXx line of CPUs are not in
> > > further development, and it is scalar path, I would not suggest to
> > > make further changes to the code.
> > > >
> > > > It would be good to test this on Kunpeng servers and get some
> feedback.
> > >
> > > Hi Connor, Yisen, Lijun,
> > >
> > > Can you please check this patch? I don't know if you are using i40e
> > > nic on your platform but if you do can you please test it?
> > >
> > > Overall this patch cause a big performance drop on Arm for i40e, I
> > > just want to be sure this is not impacting any user negatively.
I cannot speak for vendors. But my test on a Huawei aarch64 server showed no performance drop.
NIC in use is XXV710.
Just FYI.

Thanks,
Ruifeng
> >
> > Folks:
> > 	This patch has been dropped from dpdk-next-net-intel, as still
> > waiting for your confirm.
> > 	Btw Patch 1/2 was still in dpdk-next-net-intel.
> > Thanks
> > Qi
> >
> Hi Qi, Ferruh,
> 
> Do you have any suggestion on how to progress this patch?
> It is fixing possible violation of hardware access from architecture point of
> view.
> Negative performance impact may happen because barriers are added.
> I don't think we received objections until now.
> 
> Thanks,
> Ruifeng
> > >
> > > >
> > > >>
> > > >>> Fixes: 7b0cf70135d1 ("net/i40e: support ARM platform")
> > > >>> Cc: stable@dpdk.org
> > > >>>
> > > >>> Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > >>> Reviewed-by: Honnappa Nagarahalli
> <honnappa.nagarahalli@arm.com>
> > > >
  
Qi Zhang Nov. 11, 2021, 12:27 p.m. UTC | #7
> -----Original Message-----
> From: Ruifeng Wang <ruifeng.wang@arm.com>
> Sent: Wednesday, September 15, 2021 4:34 PM
> To: dev@dpdk.org
> Cc: Xing, Beilei <beilei.xing@intel.com>; Zhang, Qi Z <qi.z.zhang@intel.com>;
> Richardson, Bruce <bruce.richardson@intel.com>; jerinj@marvell.com;
> hemant.agrawal@nxp.com; drc@linux.vnet.ibm.com;
> honnappa.nagarahalli@arm.com; stable@dpdk.org; nd@arm.com; Ruifeng
> Wang <ruifeng.wang@arm.com>
> Subject: [PATCH v2 2/2] net/i40e: fix risk in Rx descriptor read in scalar path
> 
> Rx descriptor is 16B/32B in size. If the DD bit is set, it indicates that the rest of
> the descriptor words have valid values. Hence, the word containing DD bit
> must be read first before reading the rest of the descriptor words.
> 
> Since the entire descriptor is not read atomically, on relaxed memory ordered
> systems like Aarch64, read of the word containing DD field could be reordered
> after read of other words.
> 
> Read barrier is inserted between read of the word with DD field and read of
> other words. The barrier ensures that the fetched data is correct.
> 
> Testpmd single core test showed no performance drop on x86 or N1SDP.
> On ThunderX2, 22% performance regression was observed.
> 
> Fixes: 7b0cf70135d1 ("net/i40e: support ARM platform")
> Cc: stable@dpdk.org
> 
> Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
> Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> ---
>  drivers/net/i40e/i40e_rxtx.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c index
> 8329cbdd4e..c4cd6b6b60 100644
> --- a/drivers/net/i40e/i40e_rxtx.c
> +++ b/drivers/net/i40e/i40e_rxtx.c
> @@ -746,6 +746,12 @@ i40e_recv_pkts(void *rx_queue, struct rte_mbuf
> **rx_pkts, uint16_t nb_pkts)
>  			break;
>  		}
> 
> +		/**
> +		 * Use acquire fence to ensure that qword1 which includes DD
> +		 * bit is loaded before loading of other descriptor words.
> +		 */
> +		rte_atomic_thread_fence(__ATOMIC_ACQUIRE);
> +
>  		rxd = *rxdp;
>  		nb_hold++;
>  		rxe = &sw_ring[rx_id];
> @@ -862,6 +868,12 @@ i40e_recv_scattered_pkts(void *rx_queue,
>  			break;
>  		}
> 
> +		/**
> +		 * Use acquire fence to ensure that qword1 which includes DD
> +		 * bit is loaded before loading of other descriptor words.
> +		 */
> +		rte_atomic_thread_fence(__ATOMIC_ACQUIRE);
> +
>  		rxd = *rxdp;
>  		nb_hold++;
>  		rxe = &sw_ring[rx_id];
> --
> 2.25.1

Applied to dpdk-next-net-intel.

Thanks
Qi
  

Patch

diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index 8329cbdd4e..c4cd6b6b60 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -746,6 +746,12 @@  i40e_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts)
 			break;
 		}
 
+		/**
+		 * Use acquire fence to ensure that qword1 which includes DD
+		 * bit is loaded before loading of other descriptor words.
+		 */
+		rte_atomic_thread_fence(__ATOMIC_ACQUIRE);
+
 		rxd = *rxdp;
 		nb_hold++;
 		rxe = &sw_ring[rx_id];
@@ -862,6 +868,12 @@  i40e_recv_scattered_pkts(void *rx_queue,
 			break;
 		}
 
+		/**
+		 * Use acquire fence to ensure that qword1 which includes DD
+		 * bit is loaded before loading of other descriptor words.
+		 */
+		rte_atomic_thread_fence(__ATOMIC_ACQUIRE);
+
 		rxd = *rxdp;
 		nb_hold++;
 		rxe = &sw_ring[rx_id];