[2/2] net/mlx5: reduce unnecessary memory access

Message ID 20210601083055.97261-3-ruifeng.wang@arm.com (mailing list archive)
State Superseded, archived
Delegated to: Raslan Darawsheh
Headers
Series MLX5 PMD tuning |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/iol-intel-Performance success Performance Testing PASS
ci/iol-intel-Functional fail Functional Testing issues
ci/github-robot success github build: passed
ci/iol-abi-testing success Testing PASS
ci/Intel-compilation success Compilation OK
ci/intel-Testing success Testing PASS
ci/iol-testing success Testing PASS

Commit Message

Ruifeng Wang June 1, 2021, 8:30 a.m. UTC
  MR btree len is a constant during Rx replenish.
Moved retrieve of the value out of loop to reduce data loads.
Slight performance uplift was measured on N1SDP.

Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 drivers/net/mlx5/mlx5_rxtx_vec.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)
  

Comments

Slava Ovsiienko July 2, 2021, 7:05 a.m. UTC | #1
Hi, Ruifeng

Could we go further and implement loop inside the conditional?
Like this:
if (mlx5_mr_btree_len(&rxq->mr_ctrl.cache_bh) > 1) {
	for (i = 0; i < n; ++i) {
		void *buf_addr = elts[i]->buf_addr;

		wq[i].addr = rte_cpu_to_be_64((uintptr_t)buf_addr +
					      RTE_PKTMBUF_HEADROOM);
		wq[i].lkey = mlx5_rx_mb2mr(rxq, elts[i]);
	}
} else {
	for (i = 0; i < n; ++i) {
		void *buf_addr = elts[i]->buf_addr;

		wq[i].addr = rte_cpu_to_be_64((uintptr_t)buf_addr +
					      RTE_PKTMBUF_HEADROOM);
	}
}
What do you think?
Also,  we should check the performance on other archs is not affected.

With best regards,
Slava

> -----Original Message-----
> From: Ruifeng Wang <ruifeng.wang@arm.com>
> Sent: Tuesday, June 1, 2021 11:31
> To: Raslan Darawsheh <rasland@nvidia.com>; Matan Azrad
> <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>; Slava Ovsiienko
> <viacheslavo@nvidia.com>
> Cc: dev@dpdk.org; jerinj@marvell.com; nd@arm.com;
> honnappa.nagarahalli@arm.com; Ruifeng Wang <ruifeng.wang@arm.com>
> Subject: [PATCH 2/2] net/mlx5: reduce unnecessary memory access
> 
> MR btree len is a constant during Rx replenish.
> Moved retrieve of the value out of loop to reduce data loads.
> Slight performance uplift was measured on N1SDP.
> 
> Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
> ---
>  drivers/net/mlx5/mlx5_rxtx_vec.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/mlx5/mlx5_rxtx_vec.c
> b/drivers/net/mlx5/mlx5_rxtx_vec.c
> index d5af2d91ff..fc7e2a7f41 100644
> --- a/drivers/net/mlx5/mlx5_rxtx_vec.c
> +++ b/drivers/net/mlx5/mlx5_rxtx_vec.c
> @@ -95,6 +95,7 @@ mlx5_rx_replenish_bulk_mbuf(struct mlx5_rxq_data
> *rxq)
>  	volatile struct mlx5_wqe_data_seg *wq =
>  		&((volatile struct mlx5_wqe_data_seg *)rxq->wqes)[elts_idx];
>  	unsigned int i;
> +	uint16_t btree_len;
> 
>  	if (n >= rxq->rq_repl_thresh) {
>  		MLX5_ASSERT(n >=
> MLX5_VPMD_RXQ_RPLNSH_THRESH(q_n));
> @@ -106,6 +107,8 @@ mlx5_rx_replenish_bulk_mbuf(struct mlx5_rxq_data
> *rxq)
>  			rxq->stats.rx_nombuf += n;
>  			return;
>  		}
> +
> +		btree_len = mlx5_mr_btree_len(&rxq->mr_ctrl.cache_bh);
>  		for (i = 0; i < n; ++i) {
>  			void *buf_addr;
> 
> @@ -119,8 +122,7 @@ mlx5_rx_replenish_bulk_mbuf(struct mlx5_rxq_data
> *rxq)
>  			wq[i].addr = rte_cpu_to_be_64((uintptr_t)buf_addr +
> 
> RTE_PKTMBUF_HEADROOM);
>  			/* If there's a single MR, no need to replace LKey. */
> -			if (unlikely(mlx5_mr_btree_len(&rxq-
> >mr_ctrl.cache_bh)
> -				     > 1))
> +			if (unlikely(btree_len > 1))
>  				wq[i].lkey = mlx5_rx_mb2mr(rxq, elts[i]);
>  		}
>  		rxq->rq_ci += n;
> --
> 2.25.1
  
Ruifeng Wang July 2, 2021, 7:28 a.m. UTC | #2
> -----Original Message-----
> From: Slava Ovsiienko <viacheslavo@nvidia.com>
> Sent: Friday, July 2, 2021 3:06 PM
> To: Ruifeng Wang <Ruifeng.Wang@arm.com>; Raslan Darawsheh
> <rasland@nvidia.com>; Matan Azrad <matan@nvidia.com>; Shahaf Shuler
> <shahafs@nvidia.com>
> Cc: dev@dpdk.org; jerinj@marvell.com; nd <nd@arm.com>; Honnappa
> Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Subject: RE: [PATCH 2/2] net/mlx5: reduce unnecessary memory access
> 
> Hi, Ruifeng
> 
> Could we go further and implement loop inside the conditional?
> Like this:
> if (mlx5_mr_btree_len(&rxq->mr_ctrl.cache_bh) > 1) {
> 	for (i = 0; i < n; ++i) {
> 		void *buf_addr = elts[i]->buf_addr;
> 
> 		wq[i].addr = rte_cpu_to_be_64((uintptr_t)buf_addr +
> 					      RTE_PKTMBUF_HEADROOM);
> 		wq[i].lkey = mlx5_rx_mb2mr(rxq, elts[i]);
> 	}
> } else {
> 	for (i = 0; i < n; ++i) {
> 		void *buf_addr = elts[i]->buf_addr;
> 
> 		wq[i].addr = rte_cpu_to_be_64((uintptr_t)buf_addr +
> 					      RTE_PKTMBUF_HEADROOM);
> 	}
> }
> What do you think?
Agree. Loop inside the conditional should be more efficient.

> Also,  we should check the performance on other archs is not affected.
I will also test on x86 platform that I have.

> 
> With best regards,
> Slava
> 
> > -----Original Message-----
> > From: Ruifeng Wang <ruifeng.wang@arm.com>
> > Sent: Tuesday, June 1, 2021 11:31
> > To: Raslan Darawsheh <rasland@nvidia.com>; Matan Azrad
> > <matan@nvidia.com>; Shahaf Shuler <shahafs@nvidia.com>; Slava
> > Ovsiienko <viacheslavo@nvidia.com>
> > Cc: dev@dpdk.org; jerinj@marvell.com; nd@arm.com;
> > honnappa.nagarahalli@arm.com; Ruifeng Wang <ruifeng.wang@arm.com>
> > Subject: [PATCH 2/2] net/mlx5: reduce unnecessary memory access
> >
> > MR btree len is a constant during Rx replenish.
> > Moved retrieve of the value out of loop to reduce data loads.
> > Slight performance uplift was measured on N1SDP.
> >
> > Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > ---
> >  drivers/net/mlx5/mlx5_rxtx_vec.c | 6 ++++--
> >  1 file changed, 4 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/net/mlx5/mlx5_rxtx_vec.c
> > b/drivers/net/mlx5/mlx5_rxtx_vec.c
> > index d5af2d91ff..fc7e2a7f41 100644
> > --- a/drivers/net/mlx5/mlx5_rxtx_vec.c
> > +++ b/drivers/net/mlx5/mlx5_rxtx_vec.c
> > @@ -95,6 +95,7 @@ mlx5_rx_replenish_bulk_mbuf(struct mlx5_rxq_data
> > *rxq)
> >  	volatile struct mlx5_wqe_data_seg *wq =
> >  		&((volatile struct mlx5_wqe_data_seg *)rxq-
> >wqes)[elts_idx];
> >  	unsigned int i;
> > +	uint16_t btree_len;
> >
> >  	if (n >= rxq->rq_repl_thresh) {
> >  		MLX5_ASSERT(n >=
> > MLX5_VPMD_RXQ_RPLNSH_THRESH(q_n));
> > @@ -106,6 +107,8 @@ mlx5_rx_replenish_bulk_mbuf(struct
> mlx5_rxq_data
> > *rxq)
> >  			rxq->stats.rx_nombuf += n;
> >  			return;
> >  		}
> > +
> > +		btree_len = mlx5_mr_btree_len(&rxq->mr_ctrl.cache_bh);
> >  		for (i = 0; i < n; ++i) {
> >  			void *buf_addr;
> >
> > @@ -119,8 +122,7 @@ mlx5_rx_replenish_bulk_mbuf(struct
> mlx5_rxq_data
> > *rxq)
> >  			wq[i].addr = rte_cpu_to_be_64((uintptr_t)buf_addr
> +
> >
> > RTE_PKTMBUF_HEADROOM);
> >  			/* If there's a single MR, no need to replace LKey. */
> > -			if (unlikely(mlx5_mr_btree_len(&rxq-
> > >mr_ctrl.cache_bh)
> > -				     > 1))
> > +			if (unlikely(btree_len > 1))
> >  				wq[i].lkey = mlx5_rx_mb2mr(rxq, elts[i]);
> >  		}
> >  		rxq->rq_ci += n;
> > --
> > 2.25.1
  

Patch

diff --git a/drivers/net/mlx5/mlx5_rxtx_vec.c b/drivers/net/mlx5/mlx5_rxtx_vec.c
index d5af2d91ff..fc7e2a7f41 100644
--- a/drivers/net/mlx5/mlx5_rxtx_vec.c
+++ b/drivers/net/mlx5/mlx5_rxtx_vec.c
@@ -95,6 +95,7 @@  mlx5_rx_replenish_bulk_mbuf(struct mlx5_rxq_data *rxq)
 	volatile struct mlx5_wqe_data_seg *wq =
 		&((volatile struct mlx5_wqe_data_seg *)rxq->wqes)[elts_idx];
 	unsigned int i;
+	uint16_t btree_len;
 
 	if (n >= rxq->rq_repl_thresh) {
 		MLX5_ASSERT(n >= MLX5_VPMD_RXQ_RPLNSH_THRESH(q_n));
@@ -106,6 +107,8 @@  mlx5_rx_replenish_bulk_mbuf(struct mlx5_rxq_data *rxq)
 			rxq->stats.rx_nombuf += n;
 			return;
 		}
+
+		btree_len = mlx5_mr_btree_len(&rxq->mr_ctrl.cache_bh);
 		for (i = 0; i < n; ++i) {
 			void *buf_addr;
 
@@ -119,8 +122,7 @@  mlx5_rx_replenish_bulk_mbuf(struct mlx5_rxq_data *rxq)
 			wq[i].addr = rte_cpu_to_be_64((uintptr_t)buf_addr +
 						      RTE_PKTMBUF_HEADROOM);
 			/* If there's a single MR, no need to replace LKey. */
-			if (unlikely(mlx5_mr_btree_len(&rxq->mr_ctrl.cache_bh)
-				     > 1))
+			if (unlikely(btree_len > 1))
 				wq[i].lkey = mlx5_rx_mb2mr(rxq, elts[i]);
 		}
 		rxq->rq_ci += n;