diff mbox series

net/mlx5: fix risk in Rx descriptor read in NEON vector path

Message ID 20220104030056.268974-1-ruifeng.wang@arm.com (mailing list archive)
State New
Delegated to: Raslan Darawsheh
Headers show
Series net/mlx5: fix risk in Rx descriptor read in NEON vector path | expand

Checks

Context Check Description
ci/intel-Testing success Testing PASS
ci/Intel-compilation success Compilation OK
ci/iol-aarch64-compile-testing success Testing PASS
ci/iol-x86_64-unit-testing success Testing PASS
ci/iol-x86_64-compile-testing success Testing PASS
ci/iol-aarch64-unit-testing success Testing PASS
ci/iol-abi-testing success Testing PASS
ci/iol-intel-Functional success Functional Testing PASS
ci/iol-intel-Performance success Performance Testing PASS
ci/iol-broadcom-Performance success Performance Testing PASS
ci/github-robot: build success github build: passed
ci/iol-mellanox-Performance success Performance Testing PASS
ci/iol-broadcom-Functional success Functional Testing PASS
ci/checkpatch success coding style OK

Commit Message

Ruifeng Wang Jan. 4, 2022, 3 a.m. UTC
In NEON vector PMD, vector load loads two contiguous 8B of
descriptor data into vector register. Given vector load ensures no
16B atomicity, read of the word that includes op_own field could be
reordered after read of other words. In this case, some words could
contain invalid data.

Reloaded qword0 after read barrier to update vector register. This
ensures that the fetched data is correct.

Testpmd single core test on N1SDP/ThunderX2 showed no performance drop.

Fixes: 1742c2d9fab0 ("net/mlx5: fix synchronization on polling Rx completions")
Cc: stable@dpdk.org

Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 drivers/net/mlx5/mlx5_rxtx_vec_neon.h | 8 ++++++++
 1 file changed, 8 insertions(+)

Comments

Ruifeng Wang Feb. 10, 2022, 6:24 a.m. UTC | #1
Ping.
Please could you help to review this patch?

Thanks.
Ruifeng

> -----Original Message-----
> From: Ruifeng Wang <ruifeng.wang@arm.com>
> Sent: Tuesday, January 4, 2022 11:01 AM
> To: matan@nvidia.com; viacheslavo@nvidia.com
> Cc: dev@dpdk.org; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; stable@dpdk.org; nd <nd@arm.com>;
> Ruifeng Wang <Ruifeng.Wang@arm.com>
> Subject: [PATCH] net/mlx5: fix risk in Rx descriptor read in NEON vector path
> 
> In NEON vector PMD, vector load loads two contiguous 8B of descriptor data
> into vector register. Given vector load ensures no 16B atomicity, read of the
> word that includes op_own field could be reordered after read of other
> words. In this case, some words could contain invalid data.
> 
> Reloaded qword0 after read barrier to update vector register. This ensures
> that the fetched data is correct.
> 
> Testpmd single core test on N1SDP/ThunderX2 showed no performance
> drop.
> 
> Fixes: 1742c2d9fab0 ("net/mlx5: fix synchronization on polling Rx
> completions")
> Cc: stable@dpdk.org
> 
> Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
> ---
>  drivers/net/mlx5/mlx5_rxtx_vec_neon.h | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
> b/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
> index b1d16baa61..b1ec615b51 100644
> --- a/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
> +++ b/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
> @@ -647,6 +647,14 @@ rxq_cq_process_v(struct mlx5_rxq_data *rxq,
> volatile struct mlx5_cqe *cq,
>  		c0 = vld1q_u64((uint64_t *)(p0 + 48));
>  		/* Synchronize for loading the rest of blocks. */
>  		rte_io_rmb();
> +		/* B.0 (CQE 3) reload lower half of the block. */
> +		c3 = vld1q_lane_u64((uint64_t *)(p3 + 48), c3, 0);
> +		/* B.0 (CQE 2) reload lower half of the block. */
> +		c2 = vld1q_lane_u64((uint64_t *)(p2 + 48), c2, 0);
> +		/* B.0 (CQE 1) reload lower half of the block. */
> +		c1 = vld1q_lane_u64((uint64_t *)(p1 + 48), c1, 0);
> +		/* B.0 (CQE 0) reload lower half of the block. */
> +		c0 = vld1q_lane_u64((uint64_t *)(p0 + 48), c0, 0);
>  		/* Prefetch next 4 CQEs. */
>  		if (pkts_n - pos >= 2 * MLX5_VPMD_DESCS_PER_LOOP) {
>  			unsigned int next = pos +
> MLX5_VPMD_DESCS_PER_LOOP;
> --
> 2.25.1
Slava Ovsiienko Feb. 10, 2022, 8:16 a.m. UTC | #2
Hi Ruifeng,

Patch looks reasonable, thank you.
Just curious - did you see the real issue with re-ordering in this code fragment?
And, please, let us do performance check.

With best regards,
Slava

> -----Original Message-----
> From: Ruifeng Wang <Ruifeng.Wang@arm.com>
> Sent: Thursday, February 10, 2022 8:25
> To: Matan Azrad <matan@nvidia.com>; Slava Ovsiienko
> <viacheslavo@nvidia.com>
> Cc: dev@dpdk.org; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; stable@dpdk.org; nd <nd@arm.com>;
> Ruifeng Wang <Ruifeng.Wang@arm.com>; nd <nd@arm.com>
> Subject: RE: [PATCH] net/mlx5: fix risk in Rx descriptor read in NEON vector
> path
> 
> Ping.
> Please could you help to review this patch?
> 
> Thanks.
> Ruifeng
> 
> > -----Original Message-----
> > From: Ruifeng Wang <ruifeng.wang@arm.com>
> > Sent: Tuesday, January 4, 2022 11:01 AM
> > To: matan@nvidia.com; viacheslavo@nvidia.com
> > Cc: dev@dpdk.org; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>;
> > stable@dpdk.org; nd <nd@arm.com>; Ruifeng Wang
> <Ruifeng.Wang@arm.com>
> > Subject: [PATCH] net/mlx5: fix risk in Rx descriptor read in NEON
> > vector path
> >
> > In NEON vector PMD, vector load loads two contiguous 8B of descriptor
> > data into vector register. Given vector load ensures no 16B atomicity,
> > read of the word that includes op_own field could be reordered after
> > read of other words. In this case, some words could contain invalid data.
> >
> > Reloaded qword0 after read barrier to update vector register. This
> > ensures that the fetched data is correct.
> >
> > Testpmd single core test on N1SDP/ThunderX2 showed no performance
> > drop.
> >
> > Fixes: 1742c2d9fab0 ("net/mlx5: fix synchronization on polling Rx
> > completions")
> > Cc: stable@dpdk.org
> >
> > Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > ---
> >  drivers/net/mlx5/mlx5_rxtx_vec_neon.h | 8 ++++++++
> >  1 file changed, 8 insertions(+)
> >
> > diff --git a/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
> > b/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
> > index b1d16baa61..b1ec615b51 100644
> > --- a/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
> > +++ b/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
> > @@ -647,6 +647,14 @@ rxq_cq_process_v(struct mlx5_rxq_data *rxq,
> > volatile struct mlx5_cqe *cq,
> >  		c0 = vld1q_u64((uint64_t *)(p0 + 48));
> >  		/* Synchronize for loading the rest of blocks. */
> >  		rte_io_rmb();
> > +		/* B.0 (CQE 3) reload lower half of the block. */
> > +		c3 = vld1q_lane_u64((uint64_t *)(p3 + 48), c3, 0);
> > +		/* B.0 (CQE 2) reload lower half of the block. */
> > +		c2 = vld1q_lane_u64((uint64_t *)(p2 + 48), c2, 0);
> > +		/* B.0 (CQE 1) reload lower half of the block. */
> > +		c1 = vld1q_lane_u64((uint64_t *)(p1 + 48), c1, 0);
> > +		/* B.0 (CQE 0) reload lower half of the block. */
> > +		c0 = vld1q_lane_u64((uint64_t *)(p0 + 48), c0, 0);
> >  		/* Prefetch next 4 CQEs. */
> >  		if (pkts_n - pos >= 2 * MLX5_VPMD_DESCS_PER_LOOP) {
> >  			unsigned int next = pos +
> > MLX5_VPMD_DESCS_PER_LOOP;
> > --
> > 2.25.1
Ruifeng Wang Feb. 10, 2022, 8:29 a.m. UTC | #3
> -----Original Message-----
> From: Slava Ovsiienko <viacheslavo@nvidia.com>
> Sent: Thursday, February 10, 2022 4:17 PM
> To: Ruifeng Wang <Ruifeng.Wang@arm.com>; Matan Azrad
> <matan@nvidia.com>
> Cc: dev@dpdk.org; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; stable@dpdk.org; nd <nd@arm.com>;
> nd <nd@arm.com>
> Subject: RE: [PATCH] net/mlx5: fix risk in Rx descriptor read in NEON vector
> path
> 
> Hi Ruifeng,

Hi Slava,
> 
> Patch looks reasonable, thank you.
> Just curious - did you see the real issue with re-ordering in this code
> fragment?

No real issue was seen. It is analysis from architecture perspective.
> And, please, let us do performance check.

Sure. Thank you.
> 
> With best regards,
> Slava
> 
> > -----Original Message-----
> > From: Ruifeng Wang <Ruifeng.Wang@arm.com>
> > Sent: Thursday, February 10, 2022 8:25
> > To: Matan Azrad <matan@nvidia.com>; Slava Ovsiienko
> > <viacheslavo@nvidia.com>
> > Cc: dev@dpdk.org; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>;
> > stable@dpdk.org; nd <nd@arm.com>; Ruifeng Wang
> <Ruifeng.Wang@arm.com>;
> > nd <nd@arm.com>
> > Subject: RE: [PATCH] net/mlx5: fix risk in Rx descriptor read in NEON
> > vector path
> >
> > Ping.
> > Please could you help to review this patch?
> >
> > Thanks.
> > Ruifeng
> >
> > > -----Original Message-----
> > > From: Ruifeng Wang <ruifeng.wang@arm.com>
> > > Sent: Tuesday, January 4, 2022 11:01 AM
> > > To: matan@nvidia.com; viacheslavo@nvidia.com
> > > Cc: dev@dpdk.org; Honnappa Nagarahalli
> > <Honnappa.Nagarahalli@arm.com>;
> > > stable@dpdk.org; nd <nd@arm.com>; Ruifeng Wang
> > <Ruifeng.Wang@arm.com>
> > > Subject: [PATCH] net/mlx5: fix risk in Rx descriptor read in NEON
> > > vector path
> > >
> > > In NEON vector PMD, vector load loads two contiguous 8B of
> > > descriptor data into vector register. Given vector load ensures no
> > > 16B atomicity, read of the word that includes op_own field could be
> > > reordered after read of other words. In this case, some words could
> contain invalid data.
> > >
> > > Reloaded qword0 after read barrier to update vector register. This
> > > ensures that the fetched data is correct.
> > >
> > > Testpmd single core test on N1SDP/ThunderX2 showed no performance
> > > drop.
> > >
> > > Fixes: 1742c2d9fab0 ("net/mlx5: fix synchronization on polling Rx
> > > completions")
> > > Cc: stable@dpdk.org
> > >
> > > Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > ---
> > >  drivers/net/mlx5/mlx5_rxtx_vec_neon.h | 8 ++++++++
> > >  1 file changed, 8 insertions(+)
> > >
> > > diff --git a/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
> > > b/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
> > > index b1d16baa61..b1ec615b51 100644
> > > --- a/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
> > > +++ b/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
> > > @@ -647,6 +647,14 @@ rxq_cq_process_v(struct mlx5_rxq_data *rxq,
> > > volatile struct mlx5_cqe *cq,
> > >  		c0 = vld1q_u64((uint64_t *)(p0 + 48));
> > >  		/* Synchronize for loading the rest of blocks. */
> > >  		rte_io_rmb();
> > > +		/* B.0 (CQE 3) reload lower half of the block. */
> > > +		c3 = vld1q_lane_u64((uint64_t *)(p3 + 48), c3, 0);
> > > +		/* B.0 (CQE 2) reload lower half of the block. */
> > > +		c2 = vld1q_lane_u64((uint64_t *)(p2 + 48), c2, 0);
> > > +		/* B.0 (CQE 1) reload lower half of the block. */
> > > +		c1 = vld1q_lane_u64((uint64_t *)(p1 + 48), c1, 0);
> > > +		/* B.0 (CQE 0) reload lower half of the block. */
> > > +		c0 = vld1q_lane_u64((uint64_t *)(p0 + 48), c0, 0);
> > >  		/* Prefetch next 4 CQEs. */
> > >  		if (pkts_n - pos >= 2 * MLX5_VPMD_DESCS_PER_LOOP) {
> > >  			unsigned int next = pos +
> > > MLX5_VPMD_DESCS_PER_LOOP;
> > > --
> > > 2.25.1
Ali Alnubani May 19, 2022, 2:56 p.m. UTC | #4
> -----Original Message-----
> From: Ruifeng Wang <ruifeng.wang@arm.com>
> Sent: Tuesday, January 4, 2022 5:01 AM
> To: Matan Azrad <matan@nvidia.com>; Slava Ovsiienko
> <viacheslavo@nvidia.com>
> Cc: dev@dpdk.org; honnappa.nagarahalli@arm.com; stable@dpdk.org;
> nd@arm.com; Ruifeng Wang <ruifeng.wang@arm.com>
> Subject: [PATCH] net/mlx5: fix risk in Rx descriptor read in NEON vector path
> 
> In NEON vector PMD, vector load loads two contiguous 8B of
> descriptor data into vector register. Given vector load ensures no
> 16B atomicity, read of the word that includes op_own field could be
> reordered after read of other words. In this case, some words could
> contain invalid data.
> 
> Reloaded qword0 after read barrier to update vector register. This
> ensures that the fetched data is correct.
> 
> Testpmd single core test on N1SDP/ThunderX2 showed no performance
> drop.
> 
> Fixes: 1742c2d9fab0 ("net/mlx5: fix synchronization on polling Rx
> completions")
> Cc: stable@dpdk.org
> 
> Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
> ---

Tested with BlueField-2 and didn't see a performance impact.

Tested-by: Ali Alnubani <alialnu@nvidia.com>

Thanks,
Ali
Slava Ovsiienko June 20, 2022, 5:37 a.m. UTC | #5
Hi, Ruifeng

My apologies for review delay.
As far I understand the hypothetical problem scenario is:
- CPU core reorders reading of qwords of 16B vector
- core reads the second 8B of CQE (old CQE values)
- CQE update 
- core reads the first 8B of CQE (new CQE values)

How the re-reading of CQEs can resolve the issue?
This wrong scenario might happen on the second read 
and we would run into the same issue.

In my opinion, the right solution to cover potential reordering should be:
- read CQE
- check CQE status (first 8B)
- read memory barrier
- read the rest of CQE

With best regards,
Slava

> -----Original Message-----
> From: Ali Alnubani <alialnu@nvidia.com>
> Sent: Thursday, May 19, 2022 17:56
> To: Ruifeng Wang <ruifeng.wang@arm.com>; Matan Azrad
> <matan@nvidia.com>; Slava Ovsiienko <viacheslavo@nvidia.com>
> Cc: dev@dpdk.org; honnappa.nagarahalli@arm.com; stable@dpdk.org;
> nd@arm.com
> Subject: RE: [PATCH] net/mlx5: fix risk in Rx descriptor read in NEON vector
> path
> 
> > -----Original Message-----
> > From: Ruifeng Wang <ruifeng.wang@arm.com>
> > Sent: Tuesday, January 4, 2022 5:01 AM
> > To: Matan Azrad <matan@nvidia.com>; Slava Ovsiienko
> > <viacheslavo@nvidia.com>
> > Cc: dev@dpdk.org; honnappa.nagarahalli@arm.com; stable@dpdk.org;
> > nd@arm.com; Ruifeng Wang <ruifeng.wang@arm.com>
> > Subject: [PATCH] net/mlx5: fix risk in Rx descriptor read in NEON
> > vector path
> >
> > In NEON vector PMD, vector load loads two contiguous 8B of descriptor
> > data into vector register. Given vector load ensures no 16B atomicity,
> > read of the word that includes op_own field could be reordered after
> > read of other words. In this case, some words could contain invalid
> > data.
> >
> > Reloaded qword0 after read barrier to update vector register. This
> > ensures that the fetched data is correct.
> >
> > Testpmd single core test on N1SDP/ThunderX2 showed no performance
> > drop.
> >
> > Fixes: 1742c2d9fab0 ("net/mlx5: fix synchronization on polling Rx
> > completions")
> > Cc: stable@dpdk.org
> >
> > Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > ---
> 
> Tested with BlueField-2 and didn't see a performance impact.
> 
> Tested-by: Ali Alnubani <alialnu@nvidia.com>
> 
> Thanks,
> Ali
Ruifeng Wang June 27, 2022, 11:08 a.m. UTC | #6
> -----Original Message-----
> From: Slava Ovsiienko <viacheslavo@nvidia.com>
> Sent: Monday, June 20, 2022 1:38 PM
> To: Ali Alnubani <alialnu@nvidia.com>; Ruifeng Wang
> <Ruifeng.Wang@arm.com>; Matan Azrad <matan@nvidia.com>
> Cc: dev@dpdk.org; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; stable@dpdk.org; nd <nd@arm.com>
> Subject: RE: [PATCH] net/mlx5: fix risk in Rx descriptor read in NEON vector
> path
> 
> Hi, Ruifeng

Hi Slava,

Thanks for your review.
> 
> My apologies for review delay.

Apologies too. I was on something else.

> As far I understand the hypothetical problem scenario is:
> - CPU core reorders reading of qwords of 16B vector
> - core reads the second 8B of CQE (old CQE values)
> - CQE update
> - core reads the first 8B of CQE (new CQE values)

Yes, This is the problem.
> 
> How the re-reading of CQEs can resolve the issue?
> This wrong scenario might happen on the second read and we would run into
> the same issue.

Here we are trying to ordering reading of a 16B vector (8B with op_own - high, and 8B without op_own - low).
The first read will load 16B. The second read will load and update low 8B (no op_own).
There are 2 possible status indicated by op_own: valid, invalid.
If CQE status is invalid, no problem, it will be ignored this time.
If CQE status is valid, the second read ensures the rest of CQE is no older than high 8B (with op_own). 
Assuming NIC updates op_own no earlier than the rest part of CQE, I think the second read ensures CQE content retrieved is correct.

> 
> In my opinion, the right solution to cover potential reordering should be:
> - read CQE
> - check CQE status (first 8B)

We don't need to check CQE status at the moment. See explanation above.
> - read memory barrier
> - read the rest of CQE
> 
> With best regards,
> Slava
> 
> > -----Original Message-----
> > From: Ali Alnubani <alialnu@nvidia.com>
> > Sent: Thursday, May 19, 2022 17:56
> > To: Ruifeng Wang <ruifeng.wang@arm.com>; Matan Azrad
> > <matan@nvidia.com>; Slava Ovsiienko <viacheslavo@nvidia.com>
> > Cc: dev@dpdk.org; honnappa.nagarahalli@arm.com; stable@dpdk.org;
> > nd@arm.com
> > Subject: RE: [PATCH] net/mlx5: fix risk in Rx descriptor read in NEON
> > vector path
> >
> > > -----Original Message-----
> > > From: Ruifeng Wang <ruifeng.wang@arm.com>
> > > Sent: Tuesday, January 4, 2022 5:01 AM
> > > To: Matan Azrad <matan@nvidia.com>; Slava Ovsiienko
> > > <viacheslavo@nvidia.com>
> > > Cc: dev@dpdk.org; honnappa.nagarahalli@arm.com; stable@dpdk.org;
> > > nd@arm.com; Ruifeng Wang <ruifeng.wang@arm.com>
> > > Subject: [PATCH] net/mlx5: fix risk in Rx descriptor read in NEON
> > > vector path
> > >
> > > In NEON vector PMD, vector load loads two contiguous 8B of
> > > descriptor data into vector register. Given vector load ensures no
> > > 16B atomicity, read of the word that includes op_own field could be
> > > reordered after read of other words. In this case, some words could
> > > contain invalid data.
> > >
> > > Reloaded qword0 after read barrier to update vector register. This
> > > ensures that the fetched data is correct.
> > >
> > > Testpmd single core test on N1SDP/ThunderX2 showed no performance
> > > drop.
> > >
> > > Fixes: 1742c2d9fab0 ("net/mlx5: fix synchronization on polling Rx
> > > completions")
> > > Cc: stable@dpdk.org
> > >
> > > Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > ---
> >
> > Tested with BlueField-2 and didn't see a performance impact.
> >
> > Tested-by: Ali Alnubani <alialnu@nvidia.com>
> >
> > Thanks,
> > Ali
Slava Ovsiienko June 29, 2022, 7:55 a.m. UTC | #7
Hi, Ruifeng

> -----Original Message-----
> From: Ruifeng Wang <Ruifeng.Wang@arm.com>
> Sent: Monday, June 27, 2022 14:08
> To: Slava Ovsiienko <viacheslavo@nvidia.com>; Ali Alnubani
> <alialnu@nvidia.com>; Matan Azrad <matan@nvidia.com>
> Cc: dev@dpdk.org; Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>;
> stable@dpdk.org; nd <nd@arm.com>; nd <nd@arm.com>
> Subject: RE: [PATCH] net/mlx5: fix risk in Rx descriptor read in NEON
> vector path
> 
> > -----Original Message-----
> > From: Slava Ovsiienko <viacheslavo@nvidia.com>
> > Sent: Monday, June 20, 2022 1:38 PM
> > To: Ali Alnubani <alialnu@nvidia.com>; Ruifeng Wang
> > <Ruifeng.Wang@arm.com>; Matan Azrad <matan@nvidia.com>
> > Cc: dev@dpdk.org; Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>;
> > stable@dpdk.org; nd <nd@arm.com>
> > Subject: RE: [PATCH] net/mlx5: fix risk in Rx descriptor read in NEON
> > vector path
> >
> > Hi, Ruifeng
> 
> Hi Slava,
> 
> Thanks for your review.
> >
> > My apologies for review delay.
> 
> Apologies too. I was on something else.
> 
> > As far I understand the hypothetical problem scenario is:
> > - CPU core reorders reading of qwords of 16B vector
> > - core reads the second 8B of CQE (old CQE values)
> > - CQE update
> > - core reads the first 8B of CQE (new CQE values)
> 
> Yes, This is the problem.
> >
> > How the re-reading of CQEs can resolve the issue?
> > This wrong scenario might happen on the second read and we would run
> > into the same issue.
> 
> Here we are trying to ordering reading of a 16B vector (8B with op_own -
> high, and 8B without op_own - low).
> The first read will load 16B. The second read will load and update low
> 8B (no op_own).
OK, I got the point, thank you for the explanations.
Can we avoid the first reading of low 8B (no containing CQE owning field)? 

I mean to update this part to read only upper 8Bs:
                /* B.0 (CQE 3) load a block having op_own. */
                c3 = vld1q_u64((uint64_t *)(p3 + 48));
                /* B.0 (CQE 2) load a block having op_own. */
                c2 = vld1q_u64((uint64_t *)(p2 + 48));
                /* B.0 (CQE 1) load a block having op_own. */
                c1 = vld1q_u64((uint64_t *)(p1 + 48));
                /* B.0 (CQE 0) load a block having op_own. */
                c0 = vld1q_u64((uint64_t *)(p0 + 48));
                /* Synchronize for loading the rest of blocks. */
                rte_io_rmb();

Because lower 8Bs will be overlapped with the second read (in your patch) 
and barrier ensures the correct order.


With best regards,
Slava
Ruifeng Wang June 29, 2022, 11:41 a.m. UTC | #8
> -----Original Message-----
> From: Slava Ovsiienko <viacheslavo@nvidia.com>
> Sent: Wednesday, June 29, 2022 3:55 PM
> To: Ruifeng Wang <Ruifeng.Wang@arm.com>; Ali Alnubani
> <alialnu@nvidia.com>; Matan Azrad <matan@nvidia.com>
> Cc: dev@dpdk.org; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; stable@dpdk.org; nd <nd@arm.com>;
> nd <nd@arm.com>
> Subject: RE: [PATCH] net/mlx5: fix risk in Rx descriptor read in NEON vector
> path
> 
> Hi, Ruifeng
> 
> > -----Original Message-----
> > From: Ruifeng Wang <Ruifeng.Wang@arm.com>
> > Sent: Monday, June 27, 2022 14:08
> > To: Slava Ovsiienko <viacheslavo@nvidia.com>; Ali Alnubani
> > <alialnu@nvidia.com>; Matan Azrad <matan@nvidia.com>
> > Cc: dev@dpdk.org; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>;
> > stable@dpdk.org; nd <nd@arm.com>; nd <nd@arm.com>
> > Subject: RE: [PATCH] net/mlx5: fix risk in Rx descriptor read in NEON
> > vector path
> >
> > > -----Original Message-----
> > > From: Slava Ovsiienko <viacheslavo@nvidia.com>
> > > Sent: Monday, June 20, 2022 1:38 PM
> > > To: Ali Alnubani <alialnu@nvidia.com>; Ruifeng Wang
> > > <Ruifeng.Wang@arm.com>; Matan Azrad <matan@nvidia.com>
> > > Cc: dev@dpdk.org; Honnappa Nagarahalli
> > > <Honnappa.Nagarahalli@arm.com>; stable@dpdk.org; nd <nd@arm.com>
> > > Subject: RE: [PATCH] net/mlx5: fix risk in Rx descriptor read in
> > > NEON vector path
> > >
> > > Hi, Ruifeng
> >
> > Hi Slava,
> >
> > Thanks for your review.
> > >
> > > My apologies for review delay.
> >
> > Apologies too. I was on something else.
> >
> > > As far I understand the hypothetical problem scenario is:
> > > - CPU core reorders reading of qwords of 16B vector
> > > - core reads the second 8B of CQE (old CQE values)
> > > - CQE update
> > > - core reads the first 8B of CQE (new CQE values)
> >
> > Yes, This is the problem.
> > >
> > > How the re-reading of CQEs can resolve the issue?
> > > This wrong scenario might happen on the second read and we would run
> > > into the same issue.
> >
> > Here we are trying to ordering reading of a 16B vector (8B with op_own
> > - high, and 8B without op_own - low).
> > The first read will load 16B. The second read will load and update low
> > 8B (no op_own).
> OK, I got the point, thank you for the explanations.
> Can we avoid the first reading of low 8B (no containing CQE owning field)?
> 
> I mean to update this part to read only upper 8Bs:
>                 /* B.0 (CQE 3) load a block having op_own. */
>                 c3 = vld1q_u64((uint64_t *)(p3 + 48));
>                 /* B.0 (CQE 2) load a block having op_own. */
>                 c2 = vld1q_u64((uint64_t *)(p2 + 48));
>                 /* B.0 (CQE 1) load a block having op_own. */
>                 c1 = vld1q_u64((uint64_t *)(p1 + 48));
>                 /* B.0 (CQE 0) load a block having op_own. */
>                 c0 = vld1q_u64((uint64_t *)(p0 + 48));
>                 /* Synchronize for loading the rest of blocks. */
>                 rte_io_rmb();
> 
> Because lower 8Bs will be overlapped with the second read (in your patch)
> and barrier ensures the correct order.

Hi Slava,

Yes, your suggestion is valid.
Actually, I tried that approach: load higher 8B + barrier + load lower 8B + combine the two 8Bs into a vector.
It also has no observable performance impact but generates more instructions compared to the current patch (the 'combine' operation).
So I followed current approach. 

Thanks.
> 
> 
> With best regards,
> Slava
diff mbox series

Patch

diff --git a/drivers/net/mlx5/mlx5_rxtx_vec_neon.h b/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
index b1d16baa61..b1ec615b51 100644
--- a/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
+++ b/drivers/net/mlx5/mlx5_rxtx_vec_neon.h
@@ -647,6 +647,14 @@  rxq_cq_process_v(struct mlx5_rxq_data *rxq, volatile struct mlx5_cqe *cq,
 		c0 = vld1q_u64((uint64_t *)(p0 + 48));
 		/* Synchronize for loading the rest of blocks. */
 		rte_io_rmb();
+		/* B.0 (CQE 3) reload lower half of the block. */
+		c3 = vld1q_lane_u64((uint64_t *)(p3 + 48), c3, 0);
+		/* B.0 (CQE 2) reload lower half of the block. */
+		c2 = vld1q_lane_u64((uint64_t *)(p2 + 48), c2, 0);
+		/* B.0 (CQE 1) reload lower half of the block. */
+		c1 = vld1q_lane_u64((uint64_t *)(p1 + 48), c1, 0);
+		/* B.0 (CQE 0) reload lower half of the block. */
+		c0 = vld1q_lane_u64((uint64_t *)(p0 + 48), c0, 0);
 		/* Prefetch next 4 CQEs. */
 		if (pkts_n - pos >= 2 * MLX5_VPMD_DESCS_PER_LOOP) {
 			unsigned int next = pos + MLX5_VPMD_DESCS_PER_LOOP;