net/mlx5: improve vMPRQ descriptors allocation locality

Message ID 20201106000442.26059-1-akozyrev@nvidia.com (mailing list archive)
State Superseded, archived
Delegated to: Raslan Darawsheh
Headers
Series net/mlx5: improve vMPRQ descriptors allocation locality |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/Intel-compilation success Compilation OK
ci/travis-robot success Travis build: passed
ci/iol-broadcom-Functional success Functional Testing PASS
ci/iol-broadcom-Performance success Performance Testing PASS
ci/iol-testing success Testing PASS
ci/iol-intel-Functional success Functional Testing PASS
ci/iol-intel-Performance success Performance Testing PASS
ci/iol-mellanox-Performance success Performance Testing PASS

Commit Message

Alexander Kozyrev Nov. 6, 2020, 12:04 a.m. UTC
  There is a performance penalty for the replenish scheme
used in vectorized Rx burst for both MPRQ and SPRQ.
Mbuf elements are being filled at the end of the mbufs
array and being replenished at the beginning. That leads
to an increase in cache misses and the performance drop.
The more Rx descriptors are used the worse the situation.

Change the allocation scheme for vectorized MPRQ Rx burst:
allocate new mbufs only when consumed mbufs are almost
depleted (always have one burst gap between allocated and
consumed indices). Keeping a small number of mbufs allocated
improves cache locality and improves performance a lot.

Unfortunately, this approach cannot be applied to SPRQ Rx
burst routine. In MPRQ Rx burst we simply copy packets from
external MPRQ buffers or attach these buffers to mbufs.
In SPRQ Rx burst we allow the NIC to fill mbufs for us.
Hence keeping a small number of allocated mbufs will limit
NIC ability to fill as many buffers as possible. This fact
offsets the advantage of better cache locality.

Fixes: f2fa5327ff ("net/mlx5: implement vectorized MPRQ burst")

Signed-off-by: Alexander Kozyrev <akozyrev@nvidia.com>
---
 drivers/net/mlx5/mlx5_rxtx_vec.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)
  

Patch

diff --git a/drivers/net/mlx5/mlx5_rxtx_vec.c b/drivers/net/mlx5/mlx5_rxtx_vec.c
index 469ea8401d..8001ab6eb3 100644
--- a/drivers/net/mlx5/mlx5_rxtx_vec.c
+++ b/drivers/net/mlx5/mlx5_rxtx_vec.c
@@ -145,16 +145,16 @@  mlx5_rx_mprq_replenish_bulk_mbuf(struct mlx5_rxq_data *rxq)
 	const uint32_t strd_n = 1 << rxq->strd_num_n;
 	const uint32_t elts_n = wqe_n * strd_n;
 	const uint32_t wqe_mask = elts_n - 1;
-	uint32_t n = elts_n - (rxq->elts_ci - rxq->rq_pi);
+	uint32_t n = rxq->elts_ci - rxq->rq_pi;
 	uint32_t elts_idx = rxq->elts_ci & wqe_mask;
 	struct rte_mbuf **elts = &(*rxq->elts)[elts_idx];
 
-	/* Not to cross queue end. */
-	if (n >= rxq->rq_repl_thresh) {
+	if (n <= rxq->rq_repl_thresh) {
 		MLX5_ASSERT(n >= MLX5_VPMD_RXQ_RPLNSH_THRESH(elts_n));
 		MLX5_ASSERT(MLX5_VPMD_RXQ_RPLNSH_THRESH(elts_n) >
 			     MLX5_VPMD_DESCS_PER_LOOP);
-		n = RTE_MIN(n, elts_n - elts_idx);
+		/* Not to cross queue end. */
+		n = RTE_MIN(n + MLX5_VPMD_RX_MAX_BURST, elts_n - elts_idx);
 		if (rte_mempool_get_bulk(rxq->mp, (void *)elts, n) < 0) {
 			rxq->stats.rx_nombuf += n;
 			return;