From patchwork Fri Aug 25 18:40:23 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yongseok Koh X-Patchwork-Id: 28015 X-Patchwork-Delegate: ferruh.yigit@amd.com Return-Path: X-Original-To: patchwork@dpdk.org Delivered-To: patchwork@dpdk.org Received: from [92.243.14.124] (localhost [IPv6:::1]) by dpdk.org (Postfix) with ESMTP id A08DE7D57; Fri, 25 Aug 2017 20:40:59 +0200 (CEST) Received: from EUR03-VE1-obe.outbound.protection.outlook.com (mail-eopbgr50046.outbound.protection.outlook.com [40.107.5.46]) by dpdk.org (Postfix) with ESMTP id 9C3037D57 for ; Fri, 25 Aug 2017 20:40:57 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=Mellanox.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version; bh=qCvzXxgXz4cjpdzNdmrbgKuchGzyxr0Za/CpB5RbUno=; b=Qysd5gEMfCbIAGTuI0LDqAkyFZV21bARFan4MZOUQ3ys83JSAAAmYOAbmPm4nRwVQBdUY4paPenCpA+vjkylchQARJ8DFT1xfHmE73fCp2HsnV4dhGvRJ8xkYZAKT1M0QqCursWdSziD5XeYxKjsKj+lDNscpuboCSKMd5w/+Uo= Authentication-Results: spf=none (sender IP is ) smtp.mailfrom=yskoh@mellanox.com; Received: from mellanox.com (209.116.155.178) by HE1PR0501MB2043.eurprd05.prod.outlook.com (2603:10a6:3:35::21) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384_P256) id 15.1.1362.18; Fri, 25 Aug 2017 18:40:51 +0000 From: Yongseok Koh To: adrien.mazarguil@6wind.com, nelio.laranjeiro@6wind.com Cc: dev@dpdk.org, Yongseok Koh Date: Fri, 25 Aug 2017 11:40:23 -0700 Message-Id: <20170825184023.31692-2-yskoh@mellanox.com> X-Mailer: git-send-email 2.11.0 In-Reply-To: <20170825184023.31692-1-yskoh@mellanox.com> References: <20170825184023.31692-1-yskoh@mellanox.com> MIME-Version: 1.0 X-Originating-IP: [209.116.155.178] X-ClientProxiedBy: BN6PR1301CA0009.namprd13.prod.outlook.com (2603:10b6:405:29::22) To HE1PR0501MB2043.eurprd05.prod.outlook.com (2603:10a6:3:35::21) X-MS-PublicTrafficType: Email X-MS-Office365-Filtering-Correlation-Id: 3c695841-7d23-47c0-4d2a-08d4ebe8d09b X-MS-Office365-Filtering-HT: Tenant X-Microsoft-Antispam: UriScan:; BCL:0; PCL:0; RULEID:(300000500095)(300135000095)(300000501095)(300135300095)(300000502095)(300135100095)(22001)(2017030254152)(300000503095)(300135400095)(48565401081)(201703131423075)(201703031133081)(201702281549075)(300000504095)(300135200095)(300000505095)(300135600095)(300000506095)(300135500095); SRVR:HE1PR0501MB2043; X-Microsoft-Exchange-Diagnostics: 1; HE1PR0501MB2043; 3:UI/o0V3qYi0KF22e0FZQHAMVRv/g9MerNKIz5+h53l4+fY2mFNC1+8rhDghyUleQqveECxHlUGPj/WGu74KhjfmgN6LJXZFPkE6mFDB9VHnAKlqLofyGO1zoeB9+1Bj1rvkxo3fFnrVvYIvAdkMJXWuvVJA+P0aEvqHnvZKw/t00tXk7gOTYElA5EgoNlbeB6B4Jvi++yn9xgiYkFrtqqNo5nkubZvUYfkybDw/m6/EnwYK9zG7YZY7bLbMrV5wa; 25:rr9++BDItbW6g0z10Bdvp+Vwx7YycVnkNI7Arl9a+cTrab4+rHl+RVmn5P4c+F1rI85ifRlQ4oIqmHOhE58Qf1vP9zAQwwXGOU5iyjyVVsYm9YsLMMxu012mw2b8DFnKuw8EGC4dIsryq9slDmKMAVJ6iXgd0vItF+wtc8Dmu6fNOKrFb42v/L+6YmT5YDKWKcUJD86bg6Qu/Btg5CGyWo+FFsKEDzp+etYbCia9s18x6UMs/FpcbfFoI0wW7hWAGlFH0DS29M4R/NBJyejOxXuBGwg0mE2MzRPoLKFbLduq2xAorDmOa5/8nuT5zzTtCANNpNCg+3Crvau7grl+1w==; 31:xsQ+9UMpliXpHu5EP8fAw65p7NF4WgWpXEeCzTW9XHDEhB0B/akvzoydMDJVvggYQULtO/ntLyZo01Ay/2W2HLFrW/XIke0bS1ihO5exAn32IVrex+0oA1WNRM2wO+VyGANATInli4V20aohJtAd2qG/P0Ybf1Nozji/EbSAiyXk3SmIAPVqHYJAszH2aRBHZtCCsM7jQo5wCpKrsXbn75gORg6UPp1JfJAN48x4n6M= X-MS-TrafficTypeDiagnostic: HE1PR0501MB2043: X-LD-Processed: a652971c-7d2e-4d9b-a6a4-d149256f461b,ExtAddr X-Microsoft-Exchange-Diagnostics: 1; HE1PR0501MB2043; 20:/MS0Bzt0hy+G0ajjhAB+Hgkm16voRuTOwAlPDzLW3yDdYeYu/YN/wxLYl4aYtjQ2QggE49ViScPdbDdz2LSpz83mdKsMZboSYtyoEJ8CxTPUZ8sPzLgDXuLkICseIuab61c7wzNlxkTZRuJ6s+0Irai0LAu/tG6ccJnpJAef2IgRB2nuKGUeuaPeEsf+NrSkKiphNxwKeK8ib+HPlSJj0dvQvKQNhjtodvU5LBiyAJtwS3kB6ZkryAfBFtWlM0v7ezywONBPFHzCQtGYq3LawWRSiOgMzkce6UNfJYbEwxGUtKJ+m3cZTpBrORw7q+CszYZts1FDcVTn1/H6IfhmR2vP80l2U9CBYusWFVNeGx4PzRt4p1p8FtdVvNZGuAXi89YYXLlye9Iw5j2TfD5e2M0t3gvz55J5RQzANfEey7j+8v7Yc12AaeA642v4dqMUH5LcNERRwwucG9WrYWay9Wu0tEn0DFJaAO4dMtaoeciWXqEhelSS4BVNznv+uRn8; 4:ymnK60LLssghP8lnsD5VPiickj/OC97yLggRTVOvAdIb3Gy74BShJd7SKsVfrbfP0DBY4ByzuJTcs9yU4OGsqWcuwPw3GchmzHOie4sZRv9GNS076pfzklCiUm4o+PxeCMVEwuy1X9a9NWU8wXR83r3sLNMhQb7B3j4hWPSeHdRDrqUPv/edgTBCJZcKEhB3F8nwPQQ3/4t3YYY/9QgzN40DQMPm3cujb25k9Fi/t1A5Cce2gmsmdRfrQ9jxDW70VpwIQ6PjYDckBY6m50fga03OipSr7KZ8s+3x4nJaUEo= X-Exchange-Antispam-Report-Test: UriScan:(60795455431006); X-Microsoft-Antispam-PRVS: X-Exchange-Antispam-Report-CFA-Test: BCL:0; PCL:0; RULEID:(100000700101)(100105000095)(100000701101)(100105300095)(100000702101)(100105100095)(6040450)(601004)(2401047)(8121501046)(5005006)(3002001)(10201501046)(93006095)(93001095)(100000703101)(100105400095)(6055026)(6041248)(20161123562025)(20161123560025)(20161123564025)(20161123555025)(20161123558100)(201703131423075)(201702281528075)(201703061421075)(201703061406153)(6072148)(201708071742011)(100000704101)(100105200095)(100000705101)(100105500095); SRVR:HE1PR0501MB2043; BCL:0; PCL:0; RULEID:(100000800101)(100110000095)(100000801101)(100110300095)(100000802101)(100110100095)(100000803101)(100110400095)(100000804101)(100110200095)(100000805101)(100110500095); SRVR:HE1PR0501MB2043; X-Forefront-PRVS: 041032FF37 X-Forefront-Antispam-Report: SFV:NSPM; SFS:(10009020)(4630300001)(7370300001)(6009001)(39860400002)(199003)(189002)(76176999)(5660300001)(81166006)(55016002)(66066001)(50226002)(53936002)(8676002)(7736002)(25786009)(4326008)(50986999)(305945005)(3846002)(101416001)(86362001)(6666003)(33646002)(6116002)(1076002)(53946003)(110136004)(81156014)(21086003)(575784001)(2950100002)(107886003)(48376002)(50466002)(2906002)(478600001)(189998001)(5003940100001)(97736004)(105586002)(106356001)(36756003)(42186005)(47776003)(68736007)(7350300001)(69596002)(217873001)(579004)(559001)(309714004); DIR:OUT; SFP:1101; SCL:1; SRVR:HE1PR0501MB2043; H:mellanox.com; FPR:; SPF:None; PTR:InfoNoRecords; A:1; MX:1; LANG:en; Received-SPF: None (protection.outlook.com: mellanox.com does not designate permitted sender hosts) X-Microsoft-Exchange-Diagnostics: =?us-ascii?Q?1; HE1PR0501MB2043; 23:HhgHjfjGDn4pp6hVuzuKt12PG/JjTOs1yCpWw2U?= waEwpomcBUoVIYxD7VKsGE9XQnHbph5w4mFsXHt8FfAYFBe28JpUkR5+Buj3I0MonECFjEPPND8HPT4m+lfRP7sOyst2dyrb8ZwimDJc9XZNWId6TQ/6W4e2t3g8N3qR4sf4bcAikNY9iLexmIur49PsUlgvL97toorfDjMP/y3GEIrSmU2e5MNSAZvokAiLZDmdr1GTLKsIIENzVD6SFetPmlptWRspagnD1ND1AcmVArHjKcdcdkhq5cSlbWf7en9t+y7BUmcOVsNn6ncfF44kG0MdeKWDc4iKolNFRQyyc0GQmcVbY0BkJKD2afAKjaP2YGgucKUVe3Kpjo9jGoY7ej2+448gM5K+/VSaQlI50C0m97+2GkWE5FeOz8IbwvIr+d8hkXvmNirFdYajNKcSdmfNg0OX72ns3Lsp6Fwctd4xKU91uWi8qS7OJrNjsBh1ra4DU2WXA0XXZgXxYbAsQXVM7Wj9LNlE5gysZJE5X43sB1I4HZUYuBIsE49H3AadX/HlOmzMrfRXpXGOCwLa6D+ZkG1z+H0jWg9C4qyUJkFob5y815tbUVG4Eb/M2AglMNZW8hmas8jkA58aqexLPXj0/jtkULii2ybFDza82TTpDT4KGGOZQ6Y7qZabgaxullAn+WdLG1+NaJM9ZbQFMgiphoLRUlvW6K76Blp9i1k8F9r0jarBxZZ0X/utC2E/rlHvVrYIiYSdlIbfixK3FIPVFqmR3M2TXvvMCQ7sM2Gg/QPXBJqTpiB2lRRJVEf+XetvTolb17nxZZwKaIXTNC8HbD7LdY+eraKVZg44oVH5ZPZ8nd5lo/x6/jGS/hJDreiZdBBbbKOGn8bgozinmIH+NxLtzwPlu3uiYmdcnQK17vHaTUcJJWa25FK/TUBhvSrEMGd5ELfBFwY82IGFMmt/ykf7/QlXq11owb+80MFnQHFxdj0HoSlGhT3qMtQXeq48QZIEi6+XNmEb6DQlnXvSHNWHn/KwwNl5X08zF3i29mRD+HURofmJL2bfcQtRfjjgkKXLqnUFXI3JzkhacMO5SufrH0ItphIeOTNiAWnrBpmZGGH2Vy82VaNfQOPBVpYYVtiS4Tk45zivMTlbUcgPc/Re7LjUFHCUcPbbVM5on7YfmyyM78/rFj72t4FxJMbOMIRpzB7PbgaCGXGJeR4z4OV5z7nvP9ahxWcN+558moTIkiGsDh70h0YWpDkPUdq3lRoA3CPeVJ7kDA4HG X-Microsoft-Exchange-Diagnostics: 1; HE1PR0501MB2043; 6:J6PZl5ntgDShlpNjUXI7zqk6ccybttEE7qrTWsn73dGHTKL6S0GgUEPGnCVdh3MLxp0SUOmkomuAXqSi+pq5dju4YtCXK0mPnpHdttoY+bgAksCAFkZsGC/TXsRbnjPLNpM+L915A48xwrTscaZCayQ2JwQaOUokVDcTA3wA0fZOmOEWmbUBN+Lr4I1gohzUcs3Wtq/HjFMYMzFR+T2MUy/eIpV4o0XdsyD9iLIM/1QMlTeU8G9ra9bgxjurIvN7PNqgsFNW9VU+Py9mR/3Bn+MZA30oz1BqnlNKkgG3sNGkcufBrIBMhYTv/xmGFcn8bgBuYjAdET/bfPzSYJIPxg==; 5:uUsosFcP0zoJfi9Y+krwr1rd1hj/0bueYBiVNH3TVpWM5ncvOuRRMB2XyrFL3XrEZiiSOQYeOyVjN0wTtW06LP22z5wA4ftzR+RR0d+IOECFHBaJU12RG1IbKdqZ4LOs8oC86tatU45h337M1sFXVA==; 24:KL/EIseD3/O5zHjGR9Lk3TBtczl+Y5+imPZGZwmZDYXovVN/jBu6jvuoh0QLBkmEp8F2dbEQlOXcZMkxX41cUQOGyk2QBYbcY9gw52MPHOA=; 7:+lSsyIjc9aUCchK34ASrxwj92enSUJJy4TFgdxx9ZaPE9FPjut3Gw0vCC0xpVoIjD4wrQsytL/uxF+oVutkdCGVz0O8auygdbo6HwSWwS/UIKzTEf1E7vWaQDIIuMWFZe5R7TSQJnEzHDjUKqYSQ1whQgQJ2Y2n5Yroj3ZILup/x7/2FY67o6sV95zMpaNT+ks6KrS+YoFnk+Y/7D7k2reRrp5mETW9TsOyn110843g= SpamDiagnosticOutput: 1:99 SpamDiagnosticMetadata: NSPM X-OriginatorOrg: Mellanox.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 25 Aug 2017 18:40:51.9454 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-Transport-CrossTenantHeadersStamped: HE1PR0501MB2043 Subject: [dpdk-dev] [RFC PATCH 1/1] net/mlx5: add vectorized Rx/Tx burst for ARM X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" New Rx/Tx burst functions are added using NEON vector instructions for ARM CPU. Signed-off-by: Yongseok Koh --- drivers/net/mlx5/Makefile | 2 + drivers/net/mlx5/mlx5_ethdev.c | 4 +- drivers/net/mlx5/mlx5_prm.h | 15 + drivers/net/mlx5/mlx5_rxq.c | 61 ++ drivers/net/mlx5/mlx5_rxtx.h | 3 +- drivers/net/mlx5/mlx5_rxtx_vec_neon.c | 1464 +++++++++++++++++++++++++++++++++ 6 files changed, 1546 insertions(+), 3 deletions(-) create mode 100644 drivers/net/mlx5/mlx5_rxtx_vec_neon.c diff --git a/drivers/net/mlx5/Makefile b/drivers/net/mlx5/Makefile index 8736de5d3..616d769f8 100644 --- a/drivers/net/mlx5/Makefile +++ b/drivers/net/mlx5/Makefile @@ -41,6 +41,8 @@ SRCS-$(CONFIG_RTE_LIBRTE_MLX5_PMD) += mlx5_txq.c SRCS-$(CONFIG_RTE_LIBRTE_MLX5_PMD) += mlx5_rxtx.c ifeq ($(CONFIG_RTE_ARCH_X86_64),y) SRCS-$(CONFIG_RTE_LIBRTE_MLX5_PMD) += mlx5_rxtx_vec_sse.c +else ifeq ($(CONFIG_RTE_ARCH_ARM64),y) +SRCS-$(CONFIG_RTE_LIBRTE_MLX5_PMD) += mlx5_rxtx_vec_neon.c endif SRCS-$(CONFIG_RTE_LIBRTE_MLX5_PMD) += mlx5_trigger.c SRCS-$(CONFIG_RTE_LIBRTE_MLX5_PMD) += mlx5_ethdev.c diff --git a/drivers/net/mlx5/mlx5_ethdev.c b/drivers/net/mlx5/mlx5_ethdev.c index b0eb3cdfc..b387c6fb3 100644 --- a/drivers/net/mlx5/mlx5_ethdev.c +++ b/drivers/net/mlx5/mlx5_ethdev.c @@ -1516,7 +1516,7 @@ priv_select_tx_function(struct priv *priv) priv->dev->tx_pkt_burst = mlx5_tx_burst_raw_vec; else priv->dev->tx_pkt_burst = mlx5_tx_burst_vec; - DEBUG("selected Enhanced MPW TX vectorized function"); + WARN("selected Enhanced MPW TX vectorized function"); } else { priv->dev->tx_pkt_burst = mlx5_tx_burst_empw; DEBUG("selected Enhanced MPW TX function"); @@ -1542,7 +1542,7 @@ priv_select_rx_function(struct priv *priv) if (priv_check_vec_rx_support(priv) > 0) { priv_prep_vec_rx_function(priv); priv->dev->rx_pkt_burst = mlx5_rx_burst_vec; - DEBUG("selected RX vectorized function"); + WARN("selected RX vectorized function"); } else { priv->dev->rx_pkt_burst = mlx5_rx_burst; } diff --git a/drivers/net/mlx5/mlx5_prm.h b/drivers/net/mlx5/mlx5_prm.h index 608072f7e..01e95b466 100644 --- a/drivers/net/mlx5/mlx5_prm.h +++ b/drivers/net/mlx5/mlx5_prm.h @@ -224,6 +224,20 @@ struct mlx5_mpw { }; /* CQ element structure - should be equal to the cache line size */ +#if 0 +struct mlx5_cqe { // 16B + uint16_t hdr_type_etc; + uint8_t pkt_info; + uint8_t sop_drop_qpn; /* flow_tag */ + uint16_t byte_cnt; + uint16_t vlan_info; + uint32_t rx_hash_res; + uint8_t timestamp; + uint8_t wqe_counter; + uint8_t rsvd4; + uint8_t op_own; +}; +#else struct mlx5_cqe { #if (RTE_CACHE_LINE_SIZE == 128) uint8_t padding[64]; @@ -243,6 +257,7 @@ struct mlx5_cqe { uint8_t rsvd4; uint8_t op_own; }; +#endif /** * Convert a user mark to flow mark. diff --git a/drivers/net/mlx5/mlx5_rxq.c b/drivers/net/mlx5/mlx5_rxq.c index 74387a797..30654abd3 100644 --- a/drivers/net/mlx5/mlx5_rxq.c +++ b/drivers/net/mlx5/mlx5_rxq.c @@ -786,6 +786,61 @@ rxq_cleanup(struct rxq_ctrl *rxq_ctrl) memset(rxq_ctrl, 0, sizeof(*rxq_ctrl)); } +#ifdef SW_EMULATION +#define MLX5_EMUL_MCQE_PER_COMP 64 +int rxq_cqe_comp_en = 1; +/** + * + * Filling in CQEs to emulate packet arrival + * + * @param tmpl + * Pointer to RX queue control template. + */ + static inline void +emulate_rxq_cqe_setup(struct rxq_ctrl *tmpl) +{ + struct rxq *rxq = &tmpl->rxq; + const unsigned int cqe_n = 1 << rxq->cqe_n; + volatile struct mlx5_cqe *cqe = &(*rxq->cqes)[0]; + unsigned int i; + + for (i = 0; i < cqe_n; i++) + cqe[i].op_own = MLX5_CQE_INVALIDATE; + if (rxq_cqe_comp_en) { + volatile struct mlx5_mini_cqe8 *mcqe; + unsigned int j; + + for (i = 0; i < cqe_n;) { + if (!(i % MLX5_EMUL_MCQE_PER_COMP)) { + /* Fill in title CQE */ + cqe[i].op_own = 0xc; /* Compressed */ + /* number of mini CQEs */ + cqe[i].byte_cnt = htonl(MLX5_EMUL_MCQE_PER_COMP); + /* Fill in mini CQEs */ + mcqe = (volatile void *)&cqe[i + 1].pkt_info; + for (j = 0; j < MLX5_EMUL_MCQE_PER_COMP; j++) { + mcqe[j % 8].byte_cnt = + rxq->crc_present ? + htonl(64) : htonl(60); + if ((j % 8) == 7) { + i += 8; + mcqe = (volatile void *) + &cqe[i].pkt_info; + } + } + } else { + ERROR("Error on building compressed CQEs"); + } + } + } else { + for (i = 0; i < cqe_n; i++) { + cqe[i].op_own = 0; + cqe[i].byte_cnt = rxq->crc_present ? htonl(64) : htonl(60); + } + } +} +#endif /* SW_EMULATION */ + /** * Initialize RX queue. * @@ -1064,6 +1119,10 @@ rxq_ctrl_setup(struct rte_eth_dev *dev, struct rxq_ctrl *rxq_ctrl, (void *)dev, strerror(ret)); goto error; } +#ifdef SW_EMULATION + rxq_cqe_comp_en = priv->cqe_comp; + emulate_rxq_cqe_setup(&tmpl); +#endif /* Reuse buffers from original queue if possible. */ if (rxq_ctrl->rxq.elts_n) { assert(1 << rxq_ctrl->rxq.elts_n == desc); @@ -1092,7 +1151,9 @@ rxq_ctrl_setup(struct rte_eth_dev *dev, struct rxq_ctrl *rxq_ctrl, /* Update doorbell counter. */ rxq_ctrl->rxq.rq_ci = desc >> rxq_ctrl->rxq.sges_n; rte_wmb(); +#ifndef SW_EMULATION *rxq_ctrl->rxq.rq_db = htonl(rxq_ctrl->rxq.rq_ci); +#endif DEBUG("%p: rxq updated with %p", (void *)rxq_ctrl, (void *)&tmpl); assert(ret == 0); return 0; diff --git a/drivers/net/mlx5/mlx5_rxtx.h b/drivers/net/mlx5/mlx5_rxtx.h index 7de1d1086..6aae00b77 100644 --- a/drivers/net/mlx5/mlx5_rxtx.h +++ b/drivers/net/mlx5/mlx5_rxtx.h @@ -602,11 +602,12 @@ mlx5_tx_dbrec(struct txq *txq, volatile struct mlx5_wqe *wqe) uint64_t *dst = (uint64_t *)((uintptr_t)txq->bf_reg); volatile uint64_t *src = ((volatile uint64_t *)wqe); - rte_wmb(); + rte_compiler_barrier(); *txq->qp_db = htonl(txq->wqe_ci); /* Ensure ordering between DB record and BF copy. */ rte_wmb(); *dst = *src; + rte_wmb(); } #endif /* RTE_PMD_MLX5_RXTX_H_ */ diff --git a/drivers/net/mlx5/mlx5_rxtx_vec_neon.c b/drivers/net/mlx5/mlx5_rxtx_vec_neon.c new file mode 100644 index 000000000..e5bce23c2 --- /dev/null +++ b/drivers/net/mlx5/mlx5_rxtx_vec_neon.c @@ -0,0 +1,1464 @@ +/*- + * BSD LICENSE + * + * Copyright 2017 6WIND S.A. + * Copyright 2017 Mellanox. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * + * * Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * * Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in + * the documentation and/or other materials provided with the + * distribution. + * * Neither the name of 6WIND S.A. nor the names of its + * contributors may be used to endorse or promote products derived + * from this software without specific prior written permission. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + */ + +#include +#include +#include +#include +#include + +/* Verbs header. */ +/* ISO C doesn't support unnamed structs/unions, disabling -pedantic. */ +#ifdef PEDANTIC +#pragma GCC diagnostic ignored "-Wpedantic" +#endif +#include +#include +#include +#ifdef PEDANTIC +#pragma GCC diagnostic error "-Wpedantic" +#endif + +#include +#include +#include + +#include "mlx5.h" +#include "mlx5_utils.h" +#include "mlx5_rxtx.h" +#include "mlx5_autoconf.h" +#include "mlx5_defs.h" +#include "mlx5_prm.h" + +#pragma GCC diagnostic ignored "-Wcast-qual" + +/** + * Fill in buffer descriptors in a multi-packet send descriptor. + * + * @param txq + * Pointer to TX queue structure. + * @param dseg + * Pointer to buffer descriptor to be writen. + * @param pkts + * Pointer to array of packets to be sent. + * @param n + * Number of packets to be filled. + */ +static inline void +txq_wr_dseg_v(struct txq *txq, uint8_t *dseg, + struct rte_mbuf **pkts, unsigned int n) +{ + unsigned int pos; + uintptr_t addr; + const uint8x16_t dseg_shuf_m = { + 3, 2, 1, 0, /* length, bswap32 */ + 4, 5, 6, 7, /* lkey */ + 15, 14, 13, 12, /* addr, bswap64 */ + 11, 10, 9, 8 + }; +#ifdef MLX5_PMD_SOFT_COUNTERS + uint32_t tx_byte = 0; +#endif + + for (pos = 0; pos < n; ++pos, dseg += MLX5_WQE_DWORD_SIZE) { + uint8x16_t desc; + struct rte_mbuf *pkt = pkts[pos]; + + addr = rte_pktmbuf_mtod(pkt, uintptr_t); + desc = vreinterpretq_u8_u32((uint32x4_t) { + DATA_LEN(pkt), + mlx5_tx_mb2mr(txq, pkt), + addr, + addr >> 32 }); + desc = vqtbl1q_u8(desc, dseg_shuf_m); + vst1q_u8(dseg, desc); +#ifdef MLX5_PMD_SOFT_COUNTERS + tx_byte += DATA_LEN(pkt); +#endif + } +#ifdef MLX5_PMD_SOFT_COUNTERS + txq->stats.obytes += tx_byte; +#endif +} + +#if 0 +/** + * Count the number of continuous single segment packets. The first packet must + * be a single segment packet. + * + * @param pkts + * Pointer to array of packets. + * @param pkts_n + * Number of packets. + * + * @return + * Number of continuous single segment packets. + */ +static inline unsigned int +txq_check_multiseg(struct rte_mbuf **pkts, uint16_t pkts_n) +{ + unsigned int pos; + + if (!pkts_n) + return 0; + assert(NB_SEGS(pkts[0]) == 1); + /* Count the number of continuous single segment packets. */ + for (pos = 1; pos < pkts_n; ++pos) + if (NB_SEGS(pkts[pos]) > 1) + break; + return pos; +} + +/** + * Count the number of packets having same ol_flags and calculate cs_flags. + * + * @param txq + * Pointer to TX queue structure. + * @param pkts + * Pointer to array of packets. + * @param pkts_n + * Number of packets. + * @param cs_flags + * Pointer of flags to be returned. + * + * @return + * Number of packets having same ol_flags. + */ +static inline unsigned int +txq_calc_offload(struct txq *txq, struct rte_mbuf **pkts, uint16_t pkts_n, + uint8_t *cs_flags) +{ + unsigned int pos; + const uint64_t ol_mask = + PKT_TX_IP_CKSUM | PKT_TX_TCP_CKSUM | + PKT_TX_UDP_CKSUM | PKT_TX_TUNNEL_GRE | + PKT_TX_TUNNEL_VXLAN | PKT_TX_OUTER_IP_CKSUM; + + if (!pkts_n) + return 0; + /* Count the number of packets having same ol_flags. */ + for (pos = 1; pos < pkts_n; ++pos) + if ((pkts[pos]->ol_flags ^ pkts[0]->ol_flags) & ol_mask) + break; + /* Should open another MPW session for the rest. */ + if (pkts[0]->ol_flags & + (PKT_TX_IP_CKSUM | PKT_TX_TCP_CKSUM | PKT_TX_UDP_CKSUM)) { + const uint64_t is_tunneled = + pkts[0]->ol_flags & + (PKT_TX_TUNNEL_GRE | + PKT_TX_TUNNEL_VXLAN); + + if (is_tunneled && txq->tunnel_en) { + *cs_flags = MLX5_ETH_WQE_L3_INNER_CSUM | + MLX5_ETH_WQE_L4_INNER_CSUM; + if (pkts[0]->ol_flags & PKT_TX_OUTER_IP_CKSUM) + *cs_flags |= MLX5_ETH_WQE_L3_CSUM; + } else { + *cs_flags = MLX5_ETH_WQE_L3_CSUM | + MLX5_ETH_WQE_L4_CSUM; + } + } + return pos; +} + +/** + * Send multi-segmented packets until it encounters a single segment packet in + * the pkts list. + * + * @param txq + * Pointer to TX queue structure. + * @param pkts + * Pointer to array of packets to be sent. + * @param pkts_n + * Number of packets to be sent. + * + * @return + * Number of packets successfully transmitted (<= pkts_n). + */ +static uint16_t +txq_scatter_v(struct txq *txq, struct rte_mbuf **pkts, uint16_t pkts_n) +{ + uint16_t elts_head = txq->elts_head; + const uint16_t elts_n = 1 << txq->elts_n; + const uint16_t elts_m = elts_n - 1; + const uint16_t wq_n = 1 << txq->wqe_n; + const uint16_t wq_mask = wq_n - 1; + const unsigned int nb_dword_per_wqebb = + MLX5_WQE_SIZE / MLX5_WQE_DWORD_SIZE; + const unsigned int nb_dword_in_hdr = + sizeof(struct mlx5_wqe) / MLX5_WQE_DWORD_SIZE; + unsigned int n; + volatile struct mlx5_wqe *wqe = NULL; + + assert(elts_n > pkts_n); + mlx5_tx_complete(txq); + if (unlikely(!pkts_n)) + return 0; + for (n = 0; n < pkts_n; ++n) { + struct rte_mbuf *buf = pkts[n]; + unsigned int segs_n = buf->nb_segs; + unsigned int ds = nb_dword_in_hdr; + unsigned int len = PKT_LEN(buf); + uint16_t wqe_ci = txq->wqe_ci; + const __m128i shuf_mask_ctrl = + _mm_set_epi8(15, 14, 13, 12, + 8, 9, 10, 11, /* bswap32 */ + 4, 5, 6, 7, /* bswap32 */ + 0, 1, 2, 3 /* bswap32 */); + uint8_t cs_flags = 0; + uint16_t max_elts; + uint16_t max_wqe; + __m128i *t_wqe, *dseg; + __m128i ctrl; + + assert(segs_n); + max_elts = elts_n - (elts_head - txq->elts_tail); + max_wqe = wq_n - (txq->wqe_ci - txq->wqe_pi); + /* + * A MPW session consumes 2 WQEs at most to + * include MLX5_MPW_DSEG_MAX pointers. + */ + if (segs_n == 1 || + max_elts < segs_n || max_wqe < 2) + break; + wqe = &((volatile struct mlx5_wqe64 *) + txq->wqes)[wqe_ci & wq_mask].hdr; + if (buf->ol_flags & + (PKT_TX_IP_CKSUM | PKT_TX_TCP_CKSUM | PKT_TX_UDP_CKSUM)) { + const uint64_t is_tunneled = buf->ol_flags & + (PKT_TX_TUNNEL_GRE | + PKT_TX_TUNNEL_VXLAN); + + if (is_tunneled && txq->tunnel_en) { + cs_flags = MLX5_ETH_WQE_L3_INNER_CSUM | + MLX5_ETH_WQE_L4_INNER_CSUM; + if (buf->ol_flags & PKT_TX_OUTER_IP_CKSUM) + cs_flags |= MLX5_ETH_WQE_L3_CSUM; + } else { + cs_flags = MLX5_ETH_WQE_L3_CSUM | + MLX5_ETH_WQE_L4_CSUM; + } + } + /* Title WQEBB pointer. */ + t_wqe = (__m128i *)wqe; + dseg = (__m128i *)(wqe + 1); + do { + if (!(ds++ % nb_dword_per_wqebb)) { + dseg = (__m128i *) + &((volatile struct mlx5_wqe64 *) + txq->wqes)[++wqe_ci & wq_mask]; + } + txq_wr_dseg_v(txq, dseg++, &buf, 1); + (*txq->elts)[elts_head++ & elts_m] = buf; + buf = buf->next; + } while (--segs_n); + ++wqe_ci; + /* Fill CTRL in the header. */ + ctrl = _mm_set_epi32(0, 0, txq->qp_num_8s | ds, + MLX5_OPC_MOD_MPW << 24 | + txq->wqe_ci << 8 | MLX5_OPCODE_TSO); + ctrl = _mm_shuffle_epi8(ctrl, shuf_mask_ctrl); + _mm_store_si128(t_wqe, ctrl); + /* Fill ESEG in the header. */ + _mm_store_si128(t_wqe + 1, + _mm_set_epi16(0, 0, 0, 0, + htons(len), cs_flags, + 0, 0)); + txq->wqe_ci = wqe_ci; + } + if (!n) + return 0; + txq->elts_comp += (uint16_t)(elts_head - txq->elts_head); + txq->elts_head = elts_head; + if (txq->elts_comp >= MLX5_TX_COMP_THRESH) { + wqe->ctrl[2] = htonl(8); + wqe->ctrl[3] = txq->elts_head; + txq->elts_comp = 0; + ++txq->cq_pi; + } +#ifdef MLX5_PMD_SOFT_COUNTERS + txq->stats.opackets += n; +#endif + mlx5_tx_dbrec(txq, wqe); + return n; +} +#endif + +/** + * Send burst of packets with Enhanced MPW. If it encounters a multi-seg packet, + * it returns to make it processed by txq_scatter_v(). All the packets in + * the pkts list should be single segment packets having same offload flags. + * This must be checked by txq_check_multiseg() and txq_calc_offload(). + * + * @param txq + * Pointer to TX queue structure. + * @param pkts + * Pointer to array of packets to be sent. + * @param pkts_n + * Number of packets to be sent (<= MLX5_VPMD_TX_MAX_BURST). + * @param cs_flags + * Checksum offload flags to be written in the descriptor. + * + * @return + * Number of packets successfully transmitted (<= pkts_n). + */ +static inline uint16_t +txq_burst_v(struct txq *txq, struct rte_mbuf **pkts, uint16_t pkts_n, + uint8_t cs_flags) +{ + struct rte_mbuf **elts; + uint16_t elts_head = txq->elts_head; + const uint16_t elts_n = 1 << txq->elts_n; + const uint16_t elts_m = elts_n - 1; + const unsigned int nb_dword_per_wqebb = + MLX5_WQE_SIZE / MLX5_WQE_DWORD_SIZE; + const unsigned int nb_dword_in_hdr = + sizeof(struct mlx5_wqe) / MLX5_WQE_DWORD_SIZE; + unsigned int n = 0; + unsigned int pos; + uint16_t max_elts; + uint16_t max_wqe; + uint32_t comp_req = 0; + const uint16_t wq_n = 1 << txq->wqe_n; + const uint16_t wq_mask = wq_n - 1; + uint16_t wq_idx = txq->wqe_ci & wq_mask; + volatile struct mlx5_wqe64 *wq = + &((volatile struct mlx5_wqe64 *)txq->wqes)[wq_idx]; + volatile struct mlx5_wqe *wqe = (volatile struct mlx5_wqe *)wq; + const uint8x16_t ctrl_shuf_m = { + 3, 2, 1, 0, /* bswap32 */ + 7, 6, 5, 4, /* bswap32 */ + 11, 10, 9, 8, /* bswap32 */ + 12, 13, 14, 15 + }; + uint8x16_t *t_wqe; + uint8_t *dseg; + uint8x16_t ctrl; + + /* Make sure all packets can fit into a single WQE. */ + assert(elts_n > pkts_n); + mlx5_tx_complete(txq); + max_elts = (elts_n - (elts_head - txq->elts_tail)); + max_wqe = (1u << txq->wqe_n) - (txq->wqe_ci - txq->wqe_pi); + pkts_n = RTE_MIN((unsigned int)RTE_MIN(pkts_n, max_wqe), max_elts); + if (unlikely(!pkts_n)) + return 0; + elts = &(*txq->elts)[elts_head & elts_m]; + /* Loop for available tailroom first. */ + n = RTE_MIN(elts_n - (elts_head & elts_m), pkts_n); + for (pos = 0; pos < (n & -2); pos += 2) + vst1q_u64((void *)&elts[pos], vld1q_u64((void *)&pkts[pos])); + if (n & 1) + elts[pos] = pkts[pos]; + /* Check if it crosses the end of the queue. */ + if (unlikely(n < pkts_n)) { + elts = &(*txq->elts)[0]; + for (pos = 0; pos < pkts_n - n; ++pos) + elts[pos] = pkts[n + pos]; + } + txq->elts_head += pkts_n; + /* Save title WQEBB pointer. */ + t_wqe = (uint8x16_t *)wqe; + dseg = (uint8_t *)(wqe + 1); + /* Calculate the number of entries to the end. */ + n = RTE_MIN( + (wq_n - wq_idx) * nb_dword_per_wqebb - nb_dword_in_hdr, + pkts_n); + /* Fill DSEGs. */ + txq_wr_dseg_v(txq, dseg, pkts, n); + /* Check if it crosses the end of the queue. */ + if (n < pkts_n) { + dseg = (uint8_t *)txq->wqes; + txq_wr_dseg_v(txq, dseg, &pkts[n], pkts_n - n); + } + if (txq->elts_comp + pkts_n < MLX5_TX_COMP_THRESH) { + txq->elts_comp += pkts_n; + } else { + /* Request a completion. */ + txq->elts_comp = 0; + ++txq->cq_pi; + comp_req = 8; + } + /* Fill CTRL in the header. */ + ctrl = vreinterpretq_u8_u32((uint32x4_t) { + MLX5_OPC_MOD_ENHANCED_MPSW << 24 | + txq->wqe_ci << 8 | MLX5_OPCODE_ENHANCED_MPSW, + txq->qp_num_8s | (pkts_n + 2), + comp_req, + txq->elts_head }); + ctrl = vqtbl1q_u8(ctrl, ctrl_shuf_m); + vst1q_u8((void *)t_wqe, ctrl); + /* Fill ESEG in the header. */ + vst1q_u8((void *)(t_wqe + 1), + (uint8x16_t) { 0, 0, 0, 0, + cs_flags, 0, 0, 0, + 0, 0, 0, 0, + 0, 0, 0, 0 }); +#ifdef MLX5_PMD_SOFT_COUNTERS + txq->stats.opackets += pkts_n; +#endif + txq->wqe_ci += (nb_dword_in_hdr + pkts_n + (nb_dword_per_wqebb - 1)) / + nb_dword_per_wqebb; + /* Ring QP doorbell. */ + mlx5_tx_dbrec(txq, wqe); + return pkts_n; +} + +/** + * DPDK callback for vectorized TX. + * + * @param dpdk_txq + * Generic pointer to TX queue structure. + * @param[in] pkts + * Packets to transmit. + * @param pkts_n + * Number of packets in array. + * + * @return + * Number of packets successfully transmitted (<= pkts_n). + */ +uint16_t +mlx5_tx_burst_raw_vec(void *dpdk_txq, struct rte_mbuf **pkts, + uint16_t pkts_n) +{ + struct txq *txq = (struct txq *)dpdk_txq; + uint16_t nb_tx = 0; + + while (pkts_n > nb_tx) { + uint16_t n; + uint16_t ret; + + n = RTE_MIN((uint16_t)(pkts_n - nb_tx), MLX5_VPMD_TX_MAX_BURST); + ret = txq_burst_v(txq, &pkts[nb_tx], n, 0); + nb_tx += ret; + if (!ret) + break; + } + return nb_tx; +} + +#if 0 +/** + * DPDK callback for vectorized TX with multi-seg packets and offload. + * + * @param dpdk_txq + * Generic pointer to TX queue structure. + * @param[in] pkts + * Packets to transmit. + * @param pkts_n + * Number of packets in array. + * + * @return + * Number of packets successfully transmitted (<= pkts_n). + */ +uint16_t +mlx5_tx_burst_vec(void *dpdk_txq, struct rte_mbuf **pkts, uint16_t pkts_n) +{ + struct txq *txq = (struct txq *)dpdk_txq; + uint16_t nb_tx = 0; + + while (pkts_n > nb_tx) { + uint8_t cs_flags = 0; + uint16_t n; + uint16_t ret; + + /* Transmit multi-seg packets in the head of pkts list. */ + if (!(txq->flags & ETH_TXQ_FLAGS_NOMULTSEGS) && + NB_SEGS(pkts[nb_tx]) > 1) + nb_tx += txq_scatter_v(txq, + &pkts[nb_tx], + pkts_n - nb_tx); + n = RTE_MIN((uint16_t)(pkts_n - nb_tx), MLX5_VPMD_TX_MAX_BURST); + if (!(txq->flags & ETH_TXQ_FLAGS_NOMULTSEGS)) + n = txq_check_multiseg(&pkts[nb_tx], n); + if (!(txq->flags & ETH_TXQ_FLAGS_NOOFFLOADS)) + n = txq_calc_offload(txq, &pkts[nb_tx], n, &cs_flags); + ret = txq_burst_v(txq, &pkts[nb_tx], n, cs_flags); + nb_tx += ret; + if (!ret) + break; + } + return nb_tx; +} +#endif + +/** + * Store free buffers to RX SW ring. + * + * @param rxq + * Pointer to RX queue structure. + * @param pkts + * Pointer to array of packets to be stored. + * @param pkts_n + * Number of packets to be stored. + */ +static inline void +rxq_copy_mbuf_v(struct rxq *rxq, struct rte_mbuf **pkts, uint16_t n) +{ + const uint16_t q_mask = (1 << rxq->elts_n) - 1; + struct rte_mbuf **elts = &(*rxq->elts)[rxq->rq_pi & q_mask]; + unsigned int pos; + uint16_t p = n & -2; + + for (pos = 0; pos < p; pos += 2) { + uint64x2_t mbp; + + mbp = vld1q_u64((void *)&elts[pos]); + vst1q_u64((void *)&pkts[pos], mbp); + } + if (n & 1) + pkts[pos] = elts[pos]; +} + +/** + * Replenish buffers for RX in bulk. + * + * @param rxq + * Pointer to RX queue structure. + * @param n + * Number of buffers to be replenished. + */ +static inline void +rxq_replenish_bulk_mbuf(struct rxq *rxq, uint16_t n) +{ + const uint16_t q_n = 1 << rxq->elts_n; + const uint16_t q_mask = q_n - 1; + const uint16_t elts_idx = rxq->rq_ci & q_mask; + struct rte_mbuf **elts = &(*rxq->elts)[elts_idx]; + volatile struct mlx5_wqe_data_seg *wq = &(*rxq->wqes)[elts_idx]; + unsigned int i; + + assert(n >= MLX5_VPMD_RXQ_RPLNSH_THRESH); + assert(n <= (uint16_t)(q_n - (rxq->rq_ci - rxq->rq_pi))); + assert(MLX5_VPMD_RXQ_RPLNSH_THRESH > MLX5_VPMD_DESCS_PER_LOOP); + /* Not to cross queue end. */ + n = RTE_MIN(n - MLX5_VPMD_DESCS_PER_LOOP, q_n - elts_idx); + if (rte_mempool_get_bulk(rxq->mp, (void *)elts, n) < 0) { + rxq->stats.rx_nombuf += n; + return; + } + for (i = 0; i < n; ++i) + wq[i].addr = htonll((uintptr_t)elts[i]->buf_addr + + RTE_PKTMBUF_HEADROOM); + rxq->rq_ci += n; +#ifdef SW_EMULATION + *rxq->rq_db = 0; +#else + *rxq->rq_db = htonl(rxq->rq_ci); +#endif +} + +/** + * Decompress a compressed completion and fill in mbufs in RX SW ring with data + * extracted from the title completion descriptor. + * + * @param rxq + * Pointer to RX queue structure. + * @param cq + * Pointer to completion array having a compressed completion at first. + * @param elts + * Pointer to SW ring to be filled. The first mbuf has to be pre-built from + * the title completion descriptor to be copied to the rest of mbufs. + */ +static inline void +rxq_cq_decompress_v(struct rxq *rxq, + volatile struct mlx5_cqe *cq, + struct rte_mbuf **elts) +{ + volatile struct mlx5_mini_cqe8 *mcq = (void *)&(cq + 1)->pkt_info; + struct rte_mbuf *t_pkt = elts[0]; /* Title packet is pre-built. */ + unsigned int pos; +#ifdef SW_EMULATION + unsigned int inv = 2; +#else + unsigned int i; + unsigned int inv = 0; +#endif + /* Mask to shuffle from extracted mini CQE to mbuf. */ + const uint8x16_t mcqe_shuf_m1 = { + -1, -1, -1, -1, /* skip packet_type */ + 7, 6, -1, -1, /* pkt_len, bswap16 */ + 7, 6, /* data_len, bswap16 */ + -1, -1, /* skip vlan_tci */ + 3, 2, 1, 0 /* hash.rss, bswap32 */ + }; + const uint8x16_t mcqe_shuf_m2 = { + -1, -1, -1, -1, /* skip packet_type */ + 15, 14, -1, -1, /* pkt_len, bswap16 */ + 15, 14, /* data_len, bswap16 */ + -1, -1, /* skip vlan_tci */ + 11, 10, 9, 8 /* hash.rss, bswap32 */ + }; + /* Restore the compressed count. Must be 16 bits. */ + const uint16_t mcqe_n = t_pkt->data_len + + (rxq->crc_present * ETHER_CRC_LEN); + const uint64x2_t rearm = + vld1q_u64((void *)&t_pkt->rearm_data); + const uint32x4_t rxdf_mask = { + 0xffffffff, /* packet_type */ + 0, /* skip pkt_len */ + 0xffff0000, /* vlan_tci, skip data_len */ + 0, /* skip hash.rss */ + }; + const uint8x16_t rxdf = + vandq_u8(vld1q_u8((void *)&t_pkt->rx_descriptor_fields1), + vreinterpretq_u8_u32(rxdf_mask)); + const uint16x8_t crc_adj = { + 0, 0, + rxq->crc_present * ETHER_CRC_LEN, 0, + rxq->crc_present * ETHER_CRC_LEN, 0, + 0, 0 + }; + const uint32_t flow_tag = t_pkt->hash.fdir.hi; +#ifdef MLX5_PMD_SOFT_COUNTERS + uint32_t rcvd_byte = 0; +#endif + /* Mask to shuffle byte_cnt to add up stats. Do bswap16 for all. */ + const uint8x8_t len_shuf_m = { + 7, 6, /* 1st mCQE */ + 15, 14, /* 2nd mCQE */ + 23, 22, /* 3rd mCQE */ + 31, 30 /* 4th mCQE */ + }; + + /* Compile time sanity check for this function. */ + RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, pkt_len) != + offsetof(struct rte_mbuf, rx_descriptor_fields1) + 4); + RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, data_len) != + offsetof(struct rte_mbuf, rx_descriptor_fields1) + 8); + RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, hash) != + offsetof(struct rte_mbuf, rx_descriptor_fields1) + 12); + /* + * A. load mCQEs into a 128bit register. + * B. store rearm data to mbuf. + * C. combine data from mCQEs with rx_descriptor_fields1. + * D. store rx_descriptor_fields1. + * E. store flow tag (rte_flow mark). + */ + for (pos = 0; pos < mcqe_n; ) { + uint8_t *p = (void *)&mcq[pos % 8]; + uint8_t *e0 = (void *)&elts[pos]->rearm_data; + uint8_t *e1 = (void *)&elts[pos + 1]->rearm_data; + uint8_t *e2 = (void *)&elts[pos + 2]->rearm_data; + uint8_t *e3 = (void *)&elts[pos + 3]->rearm_data; + uint16x4_t byte_cnt; +#ifdef MLX5_PMD_SOFT_COUNTERS + uint16x4_t invalid_mask = + vcreate_u16(mcqe_n - pos < MLX5_VPMD_DESCS_PER_LOOP ? + -1UL << ((mcqe_n - pos) * + sizeof(uint16_t) * 8) : 0); +#endif + + if (!(pos & 0x7) && pos + 8 < mcqe_n) + rte_prefetch0((void *)(cq + pos + 8)); + __asm__ volatile ( + /* A.1 load mCQEs into a 128bit register. */ + "ld1 {v16.16b - v17.16b}, [%[mcq]]\n\t" + /* B.1 store rearm data to mbuf. */ + "st1 {%[rearm].2d}, [%[e0]]\n\t" + "add %[e0], %[e0], #16\n\t" + "st1 {%[rearm].2d}, [%[e1]]\n\t" + "add %[e1], %[e1], #16\n\t" + /* C.1 combine data from mCQEs with rx_descriptor_fields1. */ + "tbl v18.16b, {v16.16b}, %[mcqe_shuf_m1].16b\n\t" + "tbl v19.16b, {v16.16b}, %[mcqe_shuf_m2].16b\n\t" + "sub v18.8h, v18.8h, %[crc_adj].8h\n\t" + "sub v19.8h, v19.8h, %[crc_adj].8h\n\t" + "orr v18.16b, v18.16b, %[rxdf].16b\n\t" + "orr v19.16b, v19.16b, %[rxdf].16b\n\t" + /* D.1 store rx_descriptor_fields1. */ + "st1 {v18.2d}, [%[e0]]\n\t" + "st1 {v19.2d}, [%[e1]]\n\t" + /* B.1 store rearm data to mbuf. */ + "st1 {%[rearm].2d}, [%[e2]]\n\t" + "add %[e2], %[e2], #16\n\t" + "st1 {%[rearm].2d}, [%[e3]]\n\t" + "add %[e3], %[e3], #16\n\t" + /* C.1 combine data from mCQEs with rx_descriptor_fields1. */ + "tbl v18.16b, {v17.16b}, %[mcqe_shuf_m1].16b\n\t" + "tbl v19.16b, {v17.16b}, %[mcqe_shuf_m2].16b\n\t" + "sub v18.8h, v18.8h, %[crc_adj].8h\n\t" + "sub v19.8h, v19.8h, %[crc_adj].8h\n\t" + "orr v18.16b, v18.16b, %[rxdf].16b\n\t" + "orr v19.16b, v19.16b, %[rxdf].16b\n\t" + /* D.1 store rx_descriptor_fields1. */ + "st1 {v18.2d}, [%[e2]]\n\t" + "st1 {v19.2d}, [%[e3]]\n\t" +#ifdef MLX5_PMD_SOFT_COUNTERS + "tbl %[byte_cnt].8b, {v16.16b - v17.16b}, %[len_shuf_m].8b\n\t" +#endif + :[byte_cnt]"=&w"(byte_cnt) + :[mcq]"r"(p), [rxdf]"w"(rxdf), [rearm]"w"(rearm), + [e3]"r"(e3), [e2]"r"(e2), [e1]"r"(e1), [e0]"r"(e0), + [mcqe_shuf_m1]"w"(mcqe_shuf_m1), + [mcqe_shuf_m2]"w"(mcqe_shuf_m2), + [crc_adj]"w"(crc_adj), [len_shuf_m]"w"(len_shuf_m) + :"memory", "v16", "v17", "v18", "v19"); +#ifdef MLX5_PMD_SOFT_COUNTERS + byte_cnt = vbic_u16(byte_cnt, invalid_mask); + rcvd_byte += vget_lane_u64(vpaddl_u32(vpaddl_u16(byte_cnt)), 0); +#endif + if (rxq->mark) { + /* E.1 store flow tag (rte_flow mark). */ + elts[pos]->hash.fdir.hi = flow_tag; + elts[pos + 1]->hash.fdir.hi = flow_tag; + elts[pos + 2]->hash.fdir.hi = flow_tag; + elts[pos + 3]->hash.fdir.hi = flow_tag; + } + pos += MLX5_VPMD_DESCS_PER_LOOP; + /* Move to next CQE and invalidate consumed CQEs. */ + if (!(pos & 0x7) && pos < mcqe_n) { + mcq = (void *)&(cq + pos)->pkt_info; +#ifdef SW_EMULATION + for (; (inv & 7) != 0; ++inv) + cq[inv].op_own = MLX5_CQE_INVALIDATE; + ++inv; +#else + for (i = 0; i < 8; ++i) + cq[inv++].op_own = MLX5_CQE_INVALIDATE; +#endif + } + } + /* Invalidate the rest of CQEs. */ + for (; inv < mcqe_n; ++inv) + cq[inv].op_own = MLX5_CQE_INVALIDATE; +#ifdef MLX5_PMD_SOFT_COUNTERS + rxq->stats.ipackets += mcqe_n; + rxq->stats.ibytes += rcvd_byte; +#endif + rxq->cq_ci += mcqe_n; +} + +#if 0 +/** + * Calculate packet type and offload flag for mbuf and store it. + * + * @param rxq + * Pointer to RX queue structure. + * @param cqes[4] + * Array of four 16bytes completions extracted from the original completion + * descriptor. + * @param op_err + * Opcode vector having responder error status. Each field is 4B. + * @param pkts + * Pointer to array of packets to be filled. + */ +static inline void +rxq_cq_to_ptype_oflags_v(struct rxq *rxq, __m128i cqes[4], __m128i op_err, + struct rte_mbuf **pkts) +{ + __m128i pinfo0, pinfo1; + __m128i pinfo, ptype; + __m128i ol_flags = _mm_set1_epi32(rxq->rss_hash * PKT_RX_RSS_HASH); + __m128i cv_flags; + const __m128i zero = _mm_setzero_si128(); + const __m128i ptype_mask = + _mm_set_epi32(0xfd06, 0xfd06, 0xfd06, 0xfd06); + const __m128i ptype_ol_mask = + _mm_set_epi32(0x106, 0x106, 0x106, 0x106); + const __m128i pinfo_mask = + _mm_set_epi32(0x3, 0x3, 0x3, 0x3); + const __m128i cv_flag_sel = + _mm_set_epi8(0, 0, 0, 0, 0, 0, 0, 0, 0, + (uint8_t)((PKT_RX_IP_CKSUM_GOOD | + PKT_RX_L4_CKSUM_GOOD) >> 1), + 0, + (uint8_t)(PKT_RX_L4_CKSUM_GOOD >> 1), + 0, + (uint8_t)(PKT_RX_IP_CKSUM_GOOD >> 1), + (uint8_t)(PKT_RX_VLAN_PKT | PKT_RX_VLAN_STRIPPED), + 0); + const __m128i cv_mask = + _mm_set_epi32(PKT_RX_IP_CKSUM_GOOD | PKT_RX_L4_CKSUM_GOOD | + PKT_RX_VLAN_PKT | PKT_RX_VLAN_STRIPPED, + PKT_RX_IP_CKSUM_GOOD | PKT_RX_L4_CKSUM_GOOD | + PKT_RX_VLAN_PKT | PKT_RX_VLAN_STRIPPED, + PKT_RX_IP_CKSUM_GOOD | PKT_RX_L4_CKSUM_GOOD | + PKT_RX_VLAN_PKT | PKT_RX_VLAN_STRIPPED, + PKT_RX_IP_CKSUM_GOOD | PKT_RX_L4_CKSUM_GOOD | + PKT_RX_VLAN_PKT | PKT_RX_VLAN_STRIPPED); + const __m128i mbuf_init = + _mm_loadl_epi64((__m128i *)&rxq->mbuf_initializer); + __m128i rearm0, rearm1, rearm2, rearm3; + + /* Extract pkt_info field. */ + pinfo0 = _mm_unpacklo_epi32(cqes[0], cqes[1]); + pinfo1 = _mm_unpacklo_epi32(cqes[2], cqes[3]); + pinfo = _mm_unpacklo_epi64(pinfo0, pinfo1); + /* Extract hdr_type_etc field. */ + pinfo0 = _mm_unpackhi_epi32(cqes[0], cqes[1]); + pinfo1 = _mm_unpackhi_epi32(cqes[2], cqes[3]); + ptype = _mm_unpacklo_epi64(pinfo0, pinfo1); + if (rxq->mark) { + const __m128i pinfo_ft_mask = + _mm_set_epi32(0xffffff00, 0xffffff00, + 0xffffff00, 0xffffff00); + const __m128i fdir_flags = _mm_set1_epi32(PKT_RX_FDIR); + const __m128i fdir_id_flags = _mm_set1_epi32(PKT_RX_FDIR_ID); + __m128i flow_tag, invalid_mask; + + flow_tag = _mm_and_si128(pinfo, pinfo_ft_mask); + /* Check if flow tag is non-zero then set PKT_RX_FDIR. */ + invalid_mask = _mm_cmpeq_epi32(flow_tag, zero); + ol_flags = _mm_or_si128(ol_flags, + _mm_andnot_si128(invalid_mask, + fdir_flags)); + /* Mask out invalid entries. */ + flow_tag = _mm_andnot_si128(invalid_mask, flow_tag); + /* Check if flow tag MLX5_FLOW_MARK_DEFAULT. */ + ol_flags = _mm_or_si128(ol_flags, + _mm_andnot_si128( + _mm_cmpeq_epi32(flow_tag, + pinfo_ft_mask), + fdir_id_flags)); + } + /* + * Merge the two fields to generate the following: + * bit[1] = l3_ok + * bit[2] = l4_ok + * bit[8] = cv + * bit[11:10] = l3_hdr_type + * bit[14:12] = l4_hdr_type + * bit[15] = ip_frag + * bit[16] = tunneled + * bit[17] = outer_l3_type + */ + ptype = _mm_and_si128(ptype, ptype_mask); + pinfo = _mm_and_si128(pinfo, pinfo_mask); + pinfo = _mm_slli_epi32(pinfo, 16); + /* Make pinfo has merged fields for ol_flags calculation. */ + pinfo = _mm_or_si128(ptype, pinfo); + ptype = _mm_srli_epi32(pinfo, 10); + ptype = _mm_packs_epi32(ptype, zero); + /* Errored packets will have RTE_PTYPE_ALL_MASK. */ + op_err = _mm_srli_epi16(op_err, 8); + ptype = _mm_or_si128(ptype, op_err); + pkts[0]->packet_type = mlx5_ptype_table[_mm_extract_epi8(ptype, 0)]; + pkts[1]->packet_type = mlx5_ptype_table[_mm_extract_epi8(ptype, 2)]; + pkts[2]->packet_type = mlx5_ptype_table[_mm_extract_epi8(ptype, 4)]; + pkts[3]->packet_type = mlx5_ptype_table[_mm_extract_epi8(ptype, 6)]; + /* Fill flags for checksum and VLAN. */ + pinfo = _mm_and_si128(pinfo, ptype_ol_mask); + pinfo = _mm_shuffle_epi8(cv_flag_sel, pinfo); + /* Locate checksum flags at byte[2:1] and merge with VLAN flags. */ + cv_flags = _mm_slli_epi32(pinfo, 9); + cv_flags = _mm_or_si128(pinfo, cv_flags); + /* Move back flags to start from byte[0]. */ + cv_flags = _mm_srli_epi32(cv_flags, 8); + /* Mask out garbage bits. */ + cv_flags = _mm_and_si128(cv_flags, cv_mask); + /* Merge to ol_flags. */ + ol_flags = _mm_or_si128(ol_flags, cv_flags); + /* Merge mbuf_init and ol_flags. */ + RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, ol_flags) != + offsetof(struct rte_mbuf, rearm_data) + 8); + rearm0 = _mm_blend_epi16(mbuf_init, _mm_slli_si128(ol_flags, 8), 0x30); + rearm1 = _mm_blend_epi16(mbuf_init, _mm_slli_si128(ol_flags, 4), 0x30); + rearm2 = _mm_blend_epi16(mbuf_init, ol_flags, 0x30); + rearm3 = _mm_blend_epi16(mbuf_init, _mm_srli_si128(ol_flags, 4), 0x30); + /* Write 8B rearm_data and 8B ol_flags. */ + RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, rearm_data) != + RTE_ALIGN(offsetof(struct rte_mbuf, rearm_data), 16)); + _mm_store_si128((__m128i *)&pkts[0]->rearm_data, rearm0); + _mm_store_si128((__m128i *)&pkts[1]->rearm_data, rearm1); + _mm_store_si128((__m128i *)&pkts[2]->rearm_data, rearm2); + _mm_store_si128((__m128i *)&pkts[3]->rearm_data, rearm3); +} +#endif + +/** + * Skip error packets. + * + * @param rxq + * Pointer to RX queue structure. + * @param[out] pkts + * Array to store received packets. + * @param pkts_n + * Maximum number of packets in array. + * + * @return + * Number of packets successfully received (<= pkts_n). + */ +static uint16_t +rxq_handle_pending_error(struct rxq *rxq, struct rte_mbuf **pkts, + uint16_t pkts_n) +{ + uint16_t n = 0; + unsigned int i; +#ifdef MLX5_PMD_SOFT_COUNTERS + uint32_t err_bytes = 0; +#endif + + for (i = 0; i < pkts_n; ++i) { + struct rte_mbuf *pkt = pkts[i]; + + if (pkt->packet_type == RTE_PTYPE_ALL_MASK) { +#ifdef MLX5_PMD_SOFT_COUNTERS + err_bytes += PKT_LEN(pkt); +#endif + rte_pktmbuf_free_seg(pkt); + } else { + pkts[n++] = pkt; + } + } + rxq->stats.idropped += (pkts_n - n); +#ifdef MLX5_PMD_SOFT_COUNTERS + /* Correct counters of errored completions. */ + rxq->stats.ipackets -= (pkts_n - n); + rxq->stats.ibytes -= err_bytes; +#endif + rxq->pending_err = 0; + return n; +} + +/** + * Receive burst of packets. An errored completion also consumes a mbuf, but the + * packet_type is set to be RTE_PTYPE_ALL_MASK. Marked mbufs should be freed + * before returning to application. + * + * @param rxq + * Pointer to RX queue structure. + * @param[out] pkts + * Array to store received packets. + * @param pkts_n + * Maximum number of packets in array. + * + * @return + * Number of packets received including errors (<= pkts_n). + */ +static inline uint16_t +rxq_burst_v(struct rxq *rxq, struct rte_mbuf **pkts, uint16_t pkts_n) +{ + const uint16_t q_n = 1 << rxq->cqe_n; + const uint16_t q_mask = q_n - 1; + volatile struct mlx5_cqe *cq; + struct rte_mbuf **elts; + unsigned int pos; + uint64_t n; + uint16_t repl_n; + uint64_t comp_idx = MLX5_VPMD_DESCS_PER_LOOP; + uint16_t nocmp_n = 0; + uint16_t rcvd_pkt = 0; + unsigned int cq_idx = rxq->cq_ci & q_mask; + unsigned int elts_idx; +#ifdef SW_EMULATION + const uint16x4_t ownership = vdup_n_u16(1); +#else + const uint16x4_t ownership = vdup_n_u16(!(rxq->cq_ci & (q_mask + 1))); +#endif + const uint16x4_t owner_check = vcreate_u16(0x0001000100010001); + const uint16x4_t opcode_check = vcreate_u16(0x00f000f000f000f0); + const uint16x4_t format_check = vcreate_u16(0x000c000c000c000c); + const uint16x4_t resp_err_check = vcreate_u16(0x00e000e000e000e0); +#ifdef MLX5_PMD_SOFT_COUNTERS + uint32_t rcvd_byte = 0; +#endif + /* Mask to generate 16B length vector. */ + const uint8x8_t len_shuf_m = { + 52, 53, /* 4th CQE */ + 36, 37, /* 3rd CQE */ + 20, 21, /* 2nd CQE */ + 4, 5 /* 1st CQE */ + }; + /* Mask to extract 16B data from a 64B CQE. */ + const uint8x16_t cqe_shuf_m = { + 29, 28, /* hdr_type_etc, bswap16 */ + 0, /* pkt_info */ + -1, /* null */ + 47, 46, /* byte_cnt, bswap16 */ + 31, 30, /* vlan_info, bswap16 */ + 15, 14, 13, 12, /* rx_hash_res, bswap32 */ + 57, 58, 59, /* flow_tag */ + 63 /* op_own */ + }; + /* Mask to generate 16B data for mbuf. */ + const uint8x16_t mb_shuf_m = { + 4, 5, -1, -1, /* pkt_len */ + 4, 5, /* data_len */ + 6, 7, /* vlan_tci */ + 8, 9, 10, 11, /* hash.rss */ + 12, 13, 14, -1 /* hash.fdir.hi */ + }; + /* Mask to generate 16B owner vector. */ + const uint8x8_t owner_shuf_m = { + 63, -1, /* 4th CQE */ + 47, -1, /* 3rd CQE */ + 31, -1, /* 2nd CQE */ + 15, -1 /* 1st CQE */ + }; + const uint16x8_t crc_adj = { + rxq->crc_present * ETHER_CRC_LEN, + 0, + rxq->crc_present * ETHER_CRC_LEN, + 0, 0, 0, 0, 0 + }; + const uint32x4_t flow_mark_adj = { 0, 0, 0, rxq->mark * (-1) }; + + /* Compile time sanity check for this function. */ + RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, pkt_len) != + offsetof(struct rte_mbuf, rx_descriptor_fields1) + 4); + RTE_BUILD_BUG_ON(offsetof(struct rte_mbuf, data_len) != + offsetof(struct rte_mbuf, rx_descriptor_fields1) + 8); +#if (RTE_CACHE_LINE_SIZE == 128) + RTE_BUILD_BUG_ON(offsetof(struct mlx5_cqe, pkt_info) != 64); +#else + RTE_BUILD_BUG_ON(offsetof(struct mlx5_cqe, pkt_info) != 0); +#endif + RTE_BUILD_BUG_ON(offsetof(struct mlx5_cqe, rx_hash_res) != + offsetof(struct mlx5_cqe, pkt_info) + 12); + RTE_BUILD_BUG_ON(offsetof(struct mlx5_cqe, rsvd1) + + sizeof(((struct mlx5_cqe *)0)->rsvd1) != + offsetof(struct mlx5_cqe, hdr_type_etc)); + RTE_BUILD_BUG_ON(offsetof(struct mlx5_cqe, vlan_info) != + offsetof(struct mlx5_cqe, hdr_type_etc) + 2); + RTE_BUILD_BUG_ON(offsetof(struct mlx5_cqe, rsvd2) + + sizeof(((struct mlx5_cqe *)0)->rsvd2) != + offsetof(struct mlx5_cqe, byte_cnt)); + RTE_BUILD_BUG_ON(offsetof(struct mlx5_cqe, sop_drop_qpn) != + RTE_ALIGN(offsetof(struct mlx5_cqe, sop_drop_qpn), 8)); + RTE_BUILD_BUG_ON(offsetof(struct mlx5_cqe, op_own) != + offsetof(struct mlx5_cqe, sop_drop_qpn) + 7); + assert(rxq->sges_n == 0); + assert(rxq->cqe_n == rxq->elts_n); + cq = &(*rxq->cqes)[cq_idx]; + rte_prefetch0(cq); + rte_prefetch0(cq + 1); + rte_prefetch0(cq + 2); + rte_prefetch0(cq + 3); + pkts_n = RTE_MIN(pkts_n, MLX5_VPMD_RX_MAX_BURST); + /* + * Order of indexes: + * rq_ci >= cq_ci >= rq_pi + * Definition of indexes: + * rq_ci - cq_ci := # of buffers owned by HW (posted). + * cq_ci - rq_pi := # of buffers not returned to app (decompressed). + * N - (rq_ci - rq_pi) := # of buffers consumed (to be replenished). + */ + repl_n = q_n - (rxq->rq_ci - rxq->rq_pi); + if (repl_n >= MLX5_VPMD_RXQ_RPLNSH_THRESH) + rxq_replenish_bulk_mbuf(rxq, repl_n); + /* See if there're unreturned mbufs from compressed CQE. */ + rcvd_pkt = rxq->cq_ci - rxq->rq_pi; + if (rcvd_pkt > 0) { + rcvd_pkt = RTE_MIN(rcvd_pkt, pkts_n); + rxq_copy_mbuf_v(rxq, pkts, rcvd_pkt); + rxq->rq_pi += rcvd_pkt; + pkts += rcvd_pkt; + } + elts_idx = rxq->rq_pi & q_mask; + elts = &(*rxq->elts)[elts_idx]; + /* Not to overflow pkts array. */ + pkts_n = RTE_ALIGN_FLOOR(pkts_n - rcvd_pkt, MLX5_VPMD_DESCS_PER_LOOP); + /* Not to cross queue end. */ + pkts_n = RTE_MIN(pkts_n, q_n - elts_idx); + if (!pkts_n) + return rcvd_pkt; + /* At this point, there shouldn't be any remained packets. */ + assert(rxq->rq_pi == rxq->cq_ci); + /* + * A. copy 4 mbuf pointers from elts ring to returing pkts. + * B. load 64B CQE and extract necessary fields + * Final 16bytes cqes[] extracted from original 64bytes CQE has the + * following structure: + * struct { + * uint16_t hdr_type_etc; + * uint8_t pkt_info; + * uint8_t rsvd; + * uint16_t byte_cnt; + * uint16_t vlan_info; + * uint32_t rx_has_res; + * uint8_t flow_tag[3]; + * uint8_t op_own; + * } c; + * C. fill in mbuf. + * D. get valid CQEs. + * E. find compressed CQE. + */ + for (pos = 0; + pos < pkts_n; + pos += MLX5_VPMD_DESCS_PER_LOOP) { + uint16x4_t op_own; + uint16x4_t opcode, owner_mask, invalid_mask; + uint16x4_t comp_mask; + uint16x4_t mask; + uint16x4_t byte_cnt; + uint8_t *p0, *p1, *p2, *p3; + uint8_t *e0 = (void *)&elts[pos]->pkt_len; + uint8_t *e1 = (void *)&elts[pos + 1]->pkt_len; + uint8_t *e2 = (void *)&elts[pos + 2]->pkt_len; + uint8_t *e3 = (void *)&elts[pos + 3]->pkt_len; + void *elts_p = (void *)&elts[pos]; + void *pkts_p = (void *)&pkts[pos]; + + /* A.0 do not cross the end of CQ. */ + mask = vcreate_u16(pkts_n - pos < MLX5_VPMD_DESCS_PER_LOOP ? + -1UL >> ((pkts_n - pos) * + sizeof(uint16_t) * 8) : 0); + p0 = (void *)&cq[pos].pkt_info; + p1 = p0 + (pkts_n - pos > 1) * sizeof(struct mlx5_cqe); + p2 = p1 + (pkts_n - pos > 2) * sizeof(struct mlx5_cqe); + p3 = p2 + (pkts_n - pos > 3) * sizeof(struct mlx5_cqe); + /* Prefetch next 4 CQEs. */ + if (pkts_n - pos >= 2 * MLX5_VPMD_DESCS_PER_LOOP) { + rte_prefetch0(&cq[pos + MLX5_VPMD_DESCS_PER_LOOP]); + rte_prefetch0(&cq[pos + MLX5_VPMD_DESCS_PER_LOOP + 1]); + rte_prefetch0(&cq[pos + MLX5_VPMD_DESCS_PER_LOOP + 2]); + rte_prefetch0(&cq[pos + MLX5_VPMD_DESCS_PER_LOOP + 3]); + } + __asm__ volatile ( + /* B.1 (CQE 3) load a block having op_own. */ + "ld1 {v19.16b}, [%[p3]]\n\t" + "sub %[p3], %[p3], #48\n\t" + /* B.2 (CQE 3) load the rest blocks. */ + "ld1 {v16.16b - v18.16b}, [%[p3]]\n\t" + /* B.3 (CQE 3) extract 16B fields. */ + "tbl v23.16b, {v16.16b - v19.16b}, %[cqe_shuf_m].16b\n\t" + /* B.1 (CQE 2) load a block having op_own. */ + "ld1 {v19.16b}, [%[p2]]\n\t" + "sub %[p2], %[p2], #48\n\t" + /* C.1 (CQE 3) generate final structure for mbuf. */ + "tbl v15.16b, {v23.16b}, %[mb_shuf_m].16b\n\t" + /* B.2 (CQE 2) load the rest blocks. */ + "ld1 {v16.16b - v18.16b}, [%[p2]]\n\t" + /* B.3 (CQE 2) extract 16B fields. */ + "tbl v22.16b, {v16.16b - v19.16b}, %[cqe_shuf_m].16b\n\t" + /* B.1 (CQE 1) load a block having op_own. */ + "ld1 {v19.16b}, [%[p1]]\n\t" + "sub %[p1], %[p1], #48\n\t" + /* C.1 (CQE 2) generate final structure for mbuf. */ + "tbl v14.16b, {v22.16b}, %[mb_shuf_m].16b\n\t" + /* B.2 (CQE 1) load the rest blocks. */ + "ld1 {v16.16b - v18.16b}, [%[p1]]\n\t" + /* B.3 (CQE 1) extract 16B fields. */ + "tbl v21.16b, {v16.16b - v19.16b}, %[cqe_shuf_m].16b\n\t" + /* B.1 (CQE 0) load a block having op_own. */ + "ld1 {v19.16b}, [%[p0]]\n\t" + "sub %[p0], %[p0], #48\n\t" + /* C.1 (CQE 1) generate final structure for mbuf. */ + "tbl v13.16b, {v21.16b}, %[mb_shuf_m].16b\n\t" + /* B.2 (CQE 0) load the rest blocks. */ + "ld1 {v16.16b - v18.16b}, [%[p0]]\n\t" + /* B.3 (CQE 0) extract 16B fields. */ + "tbl v20.16b, {v16.16b - v19.16b}, %[cqe_shuf_m].16b\n\t" + /* A.1 load mbuf pointers. */ + "ld1 {v24.2d - v25.2d}, [%[elts_p]]\n\t" + /* D.1 extract op_own byte. */ + "tbl %[op_own].8b, {v20.16b - v23.16b}, %[owner_shuf_m].8b\n\t" + /* C.2 (CQE 3) adjust CRC length. */ + "sub v15.8h, v15.8h, %[crc_adj].8h\n\t" + /* C.3 (CQE 3) adjust flow mark. */ + "add v15.4s, v15.4s, %[flow_mark_adj].4s\n\t" + /* C.4 (CQE 3) fill in mbuf - rx_descriptor_fields1. */ + "st1 {v15.2d}, [%[e3]]\n\t" + /* C.2 (CQE 2) adjust CRC length. */ + "sub v14.8h, v14.8h, %[crc_adj].8h\n\t" + /* C.3 (CQE 2) adjust flow mark. */ + "add v14.4s, v14.4s, %[flow_mark_adj].4s\n\t" + /* C.4 (CQE 2) fill in mbuf - rx_descriptor_fields1. */ + "st1 {v14.2d}, [%[e2]]\n\t" + /* C.1 (CQE 0) generate final structure for mbuf. */ + "tbl v12.16b, {v20.16b}, %[mb_shuf_m].16b\n\t" + /* C.2 (CQE 1) adjust CRC length. */ + "sub v13.8h, v13.8h, %[crc_adj].8h\n\t" + /* C.3 (CQE 1) adjust flow mark. */ + "add v13.4s, v13.4s, %[flow_mark_adj].4s\n\t" + /* C.4 (CQE 1) fill in mbuf - rx_descriptor_fields1. */ + "st1 {v13.2d}, [%[e1]]\n\t" +#ifdef MLX5_PMD_SOFT_COUNTERS + /* Extract byte_cnt */ + "tbl %[byte_cnt].8b, {v20.16b - v23.16b}, %[len_shuf_m].8b\n\t" +#endif + /* A.2 copy mbuf pointers. */ + "st1 {v24.2d - v25.2d}, [%[pkts_p]]\n\t" + /* C.2 (CQE 0) adjust CRC length. */ + "sub v12.8h, v12.8h, %[crc_adj].8h\n\t" + /* C.3 (CQE 0) adjust flow mark. */ + "add v12.4s, v12.4s, %[flow_mark_adj].4s\n\t" + /* C.4 (CQE 1) fill in mbuf - rx_descriptor_fields1. */ + "st1 {v12.2d}, [%[e0]]\n\t" + :[op_own]"=&w"(op_own), [byte_cnt]"=&w"(byte_cnt) + :[p3]"r"(p3 + 48), [p2]"r"(p2 + 48), + [p1]"r"(p1 + 48), [p0]"r"(p0 + 48), + [e3]"r"(e3), [e2]"r"(e2), [e1]"r"(e1), [e0]"r"(e0), + [elts_p]"r"(elts_p), [pkts_p]"r"(pkts_p), + [cqe_shuf_m]"w"(cqe_shuf_m), [mb_shuf_m]"w"(mb_shuf_m), + [owner_shuf_m]"w"(owner_shuf_m), [len_shuf_m]"w"(len_shuf_m), + [crc_adj]"w"(crc_adj), [flow_mark_adj]"w"(flow_mark_adj) + :"memory", + "v12", "v13", "v14", "v15", + "v16", "v17", "v18", "v19", + "v20", "v21", "v22", "v23", + "v24", "v25"); + /* D.2 flip owner bit to mark CQEs from last round. */ + owner_mask = vand_u16(op_own, owner_check); + owner_mask = vceq_u16(owner_mask, ownership); + /* D.3 get mask for invalidated CQEs. */ + opcode = vand_u16(op_own, opcode_check); + invalid_mask = vceq_u16(opcode_check, opcode); + /* E.1 find compressed CQE format. */ + comp_mask = vand_u16(op_own, format_check); + comp_mask = vceq_u16(comp_mask, format_check); + /* D.4 mask out beyond boundary. */ + invalid_mask = vorr_u16(invalid_mask, mask); + /* D.5 merge invalid_mask with invalid owner. */ + invalid_mask = vorr_u16(invalid_mask, owner_mask); + /* E.2 mask out invalid entries. */ + comp_mask = vbic_u16(comp_mask, invalid_mask); + /* E.3 get the first compressed CQE. */ + comp_idx = __builtin_clzl(vget_lane_u64(vreinterpret_u64_u16( + comp_mask), 0)) / (sizeof(uint16_t) * 8); + /* D.6 mask out entries after the compressed CQE. */ + mask = vcreate_u16(comp_idx < MLX5_VPMD_DESCS_PER_LOOP ? + -1UL >> (comp_idx * sizeof(uint16_t) * 8) : 0); + invalid_mask = vorr_u16(invalid_mask, mask); + /* D.7 count non-compressed valid CQEs. */ + n = __builtin_clzl(vget_lane_u64(vreinterpret_u64_u16( + invalid_mask), 0)) / (sizeof(uint16_t) * 8); + nocmp_n += n; + /* D.2 get the final invalid mask. */ + mask = vcreate_u16(n < MLX5_VPMD_DESCS_PER_LOOP ? + -1UL >> (n * sizeof(uint16_t) * 8) : 0); + /* TODO: the following isn't needed. */ + invalid_mask = vorr_u16(invalid_mask, mask); + /* D.3 check error in opcode. */ + opcode = vceq_u16(resp_err_check, opcode); + opcode = vbic_u16(opcode, invalid_mask); + /* D.4 mark if any error is set */ + rxq->pending_err |= + !!vget_lane_u64(vreinterpret_u64_u16(opcode), 0); + /* C.5 fill in mbuf - rearm_data and packet_type. */ + /* TODO: + rxq_cq_to_ptype_oflags_v(rxq, cqes, opcode, &pkts[pos]); */ +#ifdef MLX5_PMD_SOFT_COUNTERS + /* Add up received bytes count. */ + byte_cnt = vbic_u16(byte_cnt, invalid_mask); + rcvd_byte += vget_lane_u64(vpaddl_u32(vpaddl_u16(byte_cnt)), 0); +#endif + /* + * Break the loop unless more valid CQE is expected, or if + * there's a compressed CQE. + */ + if (n != MLX5_VPMD_DESCS_PER_LOOP) + break; + } + /* If no new CQE seen, return without updating cq_db. */ + if (unlikely(!nocmp_n && comp_idx == MLX5_VPMD_DESCS_PER_LOOP)) + return rcvd_pkt; + /* Update the consumer indexes for non-compressed CQEs. */ + assert(nocmp_n <= pkts_n); + rxq->cq_ci += nocmp_n; + rxq->rq_pi += nocmp_n; + rcvd_pkt += nocmp_n; +#ifdef MLX5_PMD_SOFT_COUNTERS + rxq->stats.ipackets += nocmp_n; + rxq->stats.ibytes += rcvd_byte; +#endif + /* Decompress the last CQE if compressed. */ + if (comp_idx < MLX5_VPMD_DESCS_PER_LOOP && comp_idx == n) { + assert(comp_idx == (nocmp_n % MLX5_VPMD_DESCS_PER_LOOP)); + rxq_cq_decompress_v(rxq, &cq[nocmp_n], &elts[nocmp_n]); + /* Return more packets if needed. */ + if (nocmp_n < pkts_n) { + uint16_t n = rxq->cq_ci - rxq->rq_pi; + + n = RTE_MIN(n, pkts_n - nocmp_n); + rxq_copy_mbuf_v(rxq, &pkts[nocmp_n], n); + rxq->rq_pi += n; + rcvd_pkt += n; + } + } + rte_compiler_barrier(); +#ifdef SW_EMULATION + *rxq->cq_db = 0; +#else + *rxq->cq_db = htonl(rxq->cq_ci); +#endif + return rcvd_pkt; +} + +/** + * DPDK callback for vectorized RX. + * + * @param dpdk_rxq + * Generic pointer to RX queue structure. + * @param[out] pkts + * Array to store received packets. + * @param pkts_n + * Maximum number of packets in array. + * + * @return + * Number of packets successfully received (<= pkts_n). + */ +uint16_t +mlx5_rx_burst_vec(void *dpdk_rxq, struct rte_mbuf **pkts, uint16_t pkts_n) +{ + struct rxq *rxq = dpdk_rxq; + uint16_t nb_rx; + + nb_rx = rxq_burst_v(rxq, pkts, pkts_n); + if (unlikely(rxq->pending_err)) + nb_rx = rxq_handle_pending_error(rxq, pkts, nb_rx); + return nb_rx; +} + +/** + * Check Tx queue flags are set for raw vectorized Tx. + * + * @param priv + * Pointer to private structure. + * + * @return + * 1 if supported, negative errno value if not. + */ +int __attribute__((cold)) +priv_check_raw_vec_tx_support(struct priv *priv) +{ + uint16_t i; + + /* All the configured queues should support. */ + for (i = 0; i < priv->txqs_n; ++i) { + struct txq *txq = (*priv->txqs)[i]; + + if (!(txq->flags & ETH_TXQ_FLAGS_NOMULTSEGS) || + !(txq->flags & ETH_TXQ_FLAGS_NOOFFLOADS)) + break; + } + if (i != priv->txqs_n) + return -ENOTSUP; + return 1; +} + +/** + * Check a device can support vectorized TX. + * + * @param priv + * Pointer to private structure. + * + * @return + * 1 if supported, negative errno value if not. + */ +int __attribute__((cold)) +priv_check_vec_tx_support(struct priv *priv) +{ + if (!priv->tx_vec_en || + priv->txqs_n > MLX5_VPMD_MIN_TXQS || + priv->mps != MLX5_MPW_ENHANCED || + priv->tso) + return -ENOTSUP; + return 1; +} + +/** + * Check a RX queue can support vectorized RX. + * + * @param rxq + * Pointer to RX queue. + * + * @return + * 1 if supported, negative errno value if not. + */ +int __attribute__((cold)) +rxq_check_vec_support(struct rxq *rxq) +{ + struct rxq_ctrl *ctrl = container_of(rxq, struct rxq_ctrl, rxq); + + if (!ctrl->priv->rx_vec_en || rxq->sges_n != 0) + return -ENOTSUP; + return 1; +} + +/** + * Check a device can support vectorized RX. + * + * @param priv + * Pointer to private structure. + * + * @return + * 1 if supported, negative errno value if not. + */ +int __attribute__((cold)) +priv_check_vec_rx_support(struct priv *priv) +{ + uint16_t i; + + if (!priv->rx_vec_en) + return -ENOTSUP; + /* All the configured queues should support. */ + for (i = 0; i < priv->rxqs_n; ++i) { + struct rxq *rxq = (*priv->rxqs)[i]; + + if (rxq_check_vec_support(rxq) < 0) + break; + } + if (i != priv->rxqs_n) + return -ENOTSUP; + return 1; +} + +/** + * Prepare for vectorized RX. + * + * @param priv + * Pointer to private structure. + */ +void +priv_prep_vec_rx_function(struct priv *priv) +{ + uint16_t i; + + for (i = 0; i < priv->rxqs_n; ++i) { + struct rxq *rxq = (*priv->rxqs)[i]; + struct rte_mbuf *mbuf_init = &rxq->fake_mbuf; + const uint16_t desc = 1 << rxq->elts_n; + int j; + + assert(rxq->elts_n == rxq->cqe_n); + /* Initialize default rearm_data for vPMD. */ + mbuf_init->data_off = RTE_PKTMBUF_HEADROOM; + rte_mbuf_refcnt_set(mbuf_init, 1); + mbuf_init->nb_segs = 1; + mbuf_init->port = rxq->port_id; + /* + * prevent compiler reordering: + * rearm_data covers previous fields. + */ + rte_compiler_barrier(); + rxq->mbuf_initializer = + *(uint64_t *)&mbuf_init->rearm_data; + /* Padding with a fake mbuf for vectorized Rx. */ + for (j = 0; j < MLX5_VPMD_DESCS_PER_LOOP; ++j) + (*rxq->elts)[desc + j] = &rxq->fake_mbuf; + /* Mark that it need to be cleaned up for rxq_alloc_elts(). */ + rxq->trim_elts = 1; + } +}