From patchwork Wed Jul 22 20:32:38 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Alexander Kozyrev X-Patchwork-Id: 74625 X-Patchwork-Delegate: rasland@nvidia.com Return-Path: X-Original-To: patchwork@inbox.dpdk.org Delivered-To: patchwork@inbox.dpdk.org Received: from dpdk.org (dpdk.org [92.243.14.124]) by inbox.dpdk.org (Postfix) with ESMTP id BF8BCA0526; Wed, 22 Jul 2020 22:32:42 +0200 (CEST) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 13B2E1BFBA; Wed, 22 Jul 2020 22:32:42 +0200 (CEST) Received: from mellanox.co.il (mail-il-dmz.mellanox.com [193.47.165.129]) by dpdk.org (Postfix) with ESMTP id DEEA22C6E for ; Wed, 22 Jul 2020 22:32:40 +0200 (CEST) Received: from Internal Mail-Server by MTLPINE1 (envelope-from akozyrev@mellanox.com) with SMTP; 22 Jul 2020 23:32:39 +0300 Received: from pegasus02.mtr.labs.mlnx. (pegasus02.mtr.labs.mlnx [10.210.16.122]) by labmailer.mlnx (8.13.8/8.13.8) with ESMTP id 06MKWdG8028254; Wed, 22 Jul 2020 23:32:39 +0300 From: Alexander Kozyrev To: dev@dpdk.org Cc: stable@dpdk.org, rasland@mellanox.com, viacheslavo@mellanox.com Date: Wed, 22 Jul 2020 20:32:38 +0000 Message-Id: <20200722203238.14250-1-akozyrev@mellanox.com> X-Mailer: git-send-email 2.24.1 MIME-Version: 1.0 Subject: [dpdk-dev] [PATCH] net/mlx5: fix vectorized mini-CQE prefetching X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" There was an optimization work to prefetch all the CQEs before their invalidation. It allowed us to speed up the mini-CQE decompression process by preheating the cache in the vectorized Rx routine. Prefetching of the next mini-CQE, on the other hand, showed no difference in the performance on x86 platform. So, that was removed. Unfortunately this caused the performance drop on ARM. Prefetch the mini-CQE as well as well as the all the soon to be invalidated CQEs to get both CQE and mini-CQE on the hot path. Fixes: 28a4b9632 ("net/mlx5: prefetch CQEs for a faster decompression") Cc: stable@dpdk.org Signed-off-by: Alexander Kozyrev Acked-by: Viacheslav Ovsiienko --- drivers/net/mlx5/mlx5_rxtx_vec_altivec.h | 3 ++- drivers/net/mlx5/mlx5_rxtx_vec_neon.h | 3 +++ drivers/net/mlx5/mlx5_rxtx_vec_sse.h | 3 ++- 3 files changed, 7 insertions(+), 2 deletions(-) diff --git a/drivers/net/mlx5/mlx5_rxtx_vec_altivec.h b/drivers/net/mlx5/mlx5_rxtx_vec_altivec.h index f5414eebad..cb4ce1a099 100644 --- a/drivers/net/mlx5/mlx5_rxtx_vec_altivec.h +++ b/drivers/net/mlx5/mlx5_rxtx_vec_altivec.h @@ -158,7 +158,6 @@ rxq_cq_decompress_v(struct mlx5_rxq_data *rxq, volatile struct mlx5_cqe *cq, for (i = 0; i < MLX5_VPMD_DESCS_PER_LOOP; ++i) if (likely(pos + i < mcqe_n)) rte_prefetch0((void *)(cq + pos + i)); - /* A.1 load mCQEs into a 128bit register. */ mcqe1 = (vector unsigned char)vec_vsx_ld(0, (signed int const *)&mcq[pos % 8]); @@ -287,6 +286,8 @@ rxq_cq_decompress_v(struct mlx5_rxq_data *rxq, volatile struct mlx5_cqe *cq, pos += MLX5_VPMD_DESCS_PER_LOOP; /* Move to next CQE and invalidate consumed CQEs. */ if (!(pos & 0x7) && pos < mcqe_n) { + if (pos + 8 < mcqe_n) + rte_prefetch0((void *)(cq + pos + 8)); mcq = (void *)&(cq + pos)->pkt_info; for (i = 0; i < 8; ++i) cq[inv++].op_own = MLX5_CQE_INVALIDATE; diff --git a/drivers/net/mlx5/mlx5_rxtx_vec_neon.h b/drivers/net/mlx5/mlx5_rxtx_vec_neon.h index 555c342626..6c3149523e 100644 --- a/drivers/net/mlx5/mlx5_rxtx_vec_neon.h +++ b/drivers/net/mlx5/mlx5_rxtx_vec_neon.h @@ -145,6 +145,7 @@ rxq_cq_decompress_v(struct mlx5_rxq_data *rxq, volatile struct mlx5_cqe *cq, -1UL << ((mcqe_n - pos) * sizeof(uint16_t) * 8) : 0); #endif + for (i = 0; i < MLX5_VPMD_DESCS_PER_LOOP; ++i) if (likely(pos + i < mcqe_n)) rte_prefetch0((void *)(cq + pos + i)); @@ -227,6 +228,8 @@ rxq_cq_decompress_v(struct mlx5_rxq_data *rxq, volatile struct mlx5_cqe *cq, pos += MLX5_VPMD_DESCS_PER_LOOP; /* Move to next CQE and invalidate consumed CQEs. */ if (!(pos & 0x7) && pos < mcqe_n) { + if (pos + 8 < mcqe_n) + rte_prefetch0((void *)(cq + pos + 8)); mcq = (void *)&(cq + pos)->pkt_info; for (i = 0; i < 8; ++i) cq[inv++].op_own = MLX5_CQE_INVALIDATE; diff --git a/drivers/net/mlx5/mlx5_rxtx_vec_sse.h b/drivers/net/mlx5/mlx5_rxtx_vec_sse.h index 34e3397115..554924d7fc 100644 --- a/drivers/net/mlx5/mlx5_rxtx_vec_sse.h +++ b/drivers/net/mlx5/mlx5_rxtx_vec_sse.h @@ -135,7 +135,6 @@ rxq_cq_decompress_v(struct mlx5_rxq_data *rxq, volatile struct mlx5_cqe *cq, for (i = 0; i < MLX5_VPMD_DESCS_PER_LOOP; ++i) if (likely(pos + i < mcqe_n)) rte_prefetch0((void *)(cq + pos + i)); - /* A.1 load mCQEs into a 128bit register. */ mcqe1 = _mm_loadu_si128((__m128i *)&mcq[pos % 8]); mcqe2 = _mm_loadu_si128((__m128i *)&mcq[pos % 8 + 2]); @@ -214,6 +213,8 @@ rxq_cq_decompress_v(struct mlx5_rxq_data *rxq, volatile struct mlx5_cqe *cq, pos += MLX5_VPMD_DESCS_PER_LOOP; /* Move to next CQE and invalidate consumed CQEs. */ if (!(pos & 0x7) && pos < mcqe_n) { + if (pos + 8 < mcqe_n) + rte_prefetch0((void *)(cq + pos + 8)); mcq = (void *)(cq + pos); for (i = 0; i < 8; ++i) cq[inv++].op_own = MLX5_CQE_INVALIDATE;