net/mlx5: fix Tx doorbell write memory barrier

Message ID 1573817706-19322-1-git-send-email-viacheslavo@mellanox.com (mailing list archive)
State Accepted, archived
Delegated to: Raslan Darawsheh
Headers
Series net/mlx5: fix Tx doorbell write memory barrier |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/Intel-compilation success Compilation OK
ci/iol-intel-Performance success Performance Testing PASS
ci/iol-compilation success Compile Testing PASS
ci/iol-mellanox-Performance success Performance Testing PASS
ci/travis-robot success Travis build: passed

Commit Message

Slava Ovsiienko Nov. 15, 2019, 11:35 a.m. UTC
  As the result of testing it was found that some hosts have
the performance penalty imposed by required write memory barrier
after doorbell writing. Before 19.08 release there was some
heuristics to decide whether write memory barrier should be
performed. For the bursts of recommended size (or multiple)
it was supposed there were some extra ongoing packets in the
next burst and write memory barrier may be skipped (supposed
to be performed in the next burst, at least after descriptor
writing).

This patch restores that behaviour, the devargs tx_db_nc=2
must be specified to engage this performance tuning feature.

Fixes: 8409a28573d3 ("net/mlx5: control transmit doorbell register mapping")

Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
---
 doc/guides/nics/mlx5.rst       | 24 +++++++++++++++++-------
 drivers/net/mlx5/mlx5.c        | 17 +++++++++++++----
 drivers/net/mlx5/mlx5_defs.h   | 10 ++++++++++
 drivers/net/mlx5/mlx5_ethdev.c |  4 ++--
 drivers/net/mlx5/mlx5_rxtx.c   | 12 +++++++++++-
 drivers/net/mlx5/mlx5_rxtx.h   |  1 +
 drivers/net/mlx5/mlx5_txq.c    |  2 ++
 7 files changed, 56 insertions(+), 14 deletions(-)
  

Comments

Raslan Darawsheh Nov. 18, 2019, 4:25 p.m. UTC | #1
Hi,

> -----Original Message-----
> From: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
> Sent: Friday, November 15, 2019 1:35 PM
> To: dev@dpdk.org
> Cc: Matan Azrad <matan@mellanox.com>; Raslan Darawsheh
> <rasland@mellanox.com>; Ori Kam <orika@mellanox.com>
> Subject: [PATCH] net/mlx5: fix Tx doorbell write memory barrier
> 
> As the result of testing it was found that some hosts have
> the performance penalty imposed by required write memory barrier
> after doorbell writing. Before 19.08 release there was some
> heuristics to decide whether write memory barrier should be
> performed. For the bursts of recommended size (or multiple)
> it was supposed there were some extra ongoing packets in the
> next burst and write memory barrier may be skipped (supposed
> to be performed in the next burst, at least after descriptor
> writing).
> 
> This patch restores that behaviour, the devargs tx_db_nc=2
> must be specified to engage this performance tuning feature.
> 
> Fixes: 8409a28573d3 ("net/mlx5: control transmit doorbell register mapping")
> 
> Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
> ---


Patch applied to next-net-mlx,

Kindest regards,
Raslan Darawsheh
  

Patch

diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index fd5a326..9351de6 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -565,17 +565,27 @@  Run-time configuration
   The type of mapping may slightly affect the Tx performance, the optimal choice
   is strongly relied on the host architecture and should be deduced practically.
 
-  If ``tx_db_nc`` is either omitted or set to zero, the doorbell is forced to be
-  mapped to regular memory, the PMD will perform the extra write memory barrier
-  after writing to doorbell, it might increase the needed CPU clocks per packet
-  to send, but latency might be improved.
+  If ``tx_db_nc`` is set to zero, the doorbell is forced to be mapped to regular
+  memory, the PMD will perform the extra write memory barrier after writing to
+  doorbell, it might increase the needed CPU clocks per packet to send, but
+  latency might be improved.
 
-  If ``tx_db_nc`` is set to not zero, the doorbell is forced to be mapped to
-  non cached memory, the PMD will not perform the extra write memory barrier
+  If ``tx_db_nc`` is set to one, the doorbell is forced to be mapped to non
+  cached memory, the PMD will not perform the extra write memory barrier
   after writing to doorbell, on some architectures it might improve the
   performance.
 
-  The default ``tx_db_nc`` value is zero ARM64 hosts and one for others.
+  If ``tx_db_nc`` is set to two, the doorbell is forced to be mapped to regular
+  memory, the PMD will use heuristics to decide whether write memory barrier
+  should be performed. For bursts with size multiple of recommended one (64 pkts)
+  it is supposed the next burst is coming and no need to issue the extra memory
+  barrier (it is supposed to be issued in the next coming burst, at least after
+  descriptor writing). It might increase latency (on some hosts till next
+  packets transmit) and should be used with care.
+
+  If ``tx_db_nc`` is omitted or set to zero, the preset (if any) environment
+  variable "MLX5_SHUT_UP_BF" value is used. If there is no "MLX5_SHUT_UP_BF",
+  the default ``tx_db_nc`` value is zero for ARM64 hosts and one for others.
 
 - ``tx_vec_en`` parameter [int]
 
diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index 35baaf7..ad22f89 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -443,7 +443,8 @@  struct mlx5_flow_id_pool *
 	if (config->dbnc == MLX5_ARG_UNSET)
 		setenv(MLX5_SHUT_UP_BF, MLX5_SHUT_UP_BF_DEFAULT, 1);
 	else
-		setenv(MLX5_SHUT_UP_BF, config->dbnc ? "1" : "0", 1);
+		setenv(MLX5_SHUT_UP_BF,
+		       config->dbnc == MLX5_TXDB_NCACHED ? "1" : "0", 1);
 	return value;
 }
 
@@ -1387,7 +1388,15 @@  struct mlx5_flow_id_pool *
 	} else if (strcmp(MLX5_TXQ_MPW_EN, key) == 0) {
 		config->mps = !!tmp;
 	} else if (strcmp(MLX5_TX_DB_NC, key) == 0) {
-		config->dbnc = !!tmp;
+		if (tmp != MLX5_TXDB_CACHED &&
+		    tmp != MLX5_TXDB_NCACHED &&
+		    tmp != MLX5_TXDB_HEURISTIC) {
+			DRV_LOG(ERR, "invalid Tx doorbell "
+				     "mapping parameter");
+			rte_errno = EINVAL;
+			return -rte_errno;
+		}
+		config->dbnc = tmp;
 	} else if (strcmp(MLX5_TXQ_MPW_HDR_DSEG_EN, key) == 0) {
 		DRV_LOG(WARNING, "%s: deprecated parameter, ignored", key);
 	} else if (strcmp(MLX5_TXQ_MAX_INLINE_LEN, key) == 0) {
@@ -1410,8 +1419,8 @@  struct mlx5_flow_id_pool *
 		if (tmp != MLX5_XMETA_MODE_LEGACY &&
 		    tmp != MLX5_XMETA_MODE_META16 &&
 		    tmp != MLX5_XMETA_MODE_META32) {
-			DRV_LOG(WARNING, "invalid extensive "
-					 "metadata parameter");
+			DRV_LOG(ERR, "invalid extensive "
+				     "metadata parameter");
 			rte_errno = EINVAL;
 			return -rte_errno;
 		}
diff --git a/drivers/net/mlx5/mlx5_defs.h b/drivers/net/mlx5/mlx5_defs.h
index 03060aa..9e113c5 100644
--- a/drivers/net/mlx5/mlx5_defs.h
+++ b/drivers/net/mlx5/mlx5_defs.h
@@ -97,6 +97,10 @@ 
 /* Maximum size of burst for vectorized Rx. */
 #define MLX5_VPMD_RX_MAX_BURST 64U
 
+/* Recommended optimal burst size. */
+#define MLX5_RX_DEFAULT_BURST 64U
+#define MLX5_TX_DEFAULT_BURST 64U
+
 /* Number of packets vectorized Rx can simultaneously process in a loop. */
 #define MLX5_VPMD_DESCS_PER_LOOP      4
 
@@ -157,10 +161,16 @@ 
 /* Cache size of mempool for Multi-Packet RQ. */
 #define MLX5_MPRQ_MP_CACHE_SZ 32U
 
+/* MLX5_DV_XMETA_EN supported values. */
 #define MLX5_XMETA_MODE_LEGACY 0
 #define MLX5_XMETA_MODE_META16 1
 #define MLX5_XMETA_MODE_META32 2
 
+/* MLX5_TX_DB_NC supported values. */
+#define MLX5_TXDB_CACHED 0
+#define MLX5_TXDB_NCACHED 1
+#define MLX5_TXDB_HEURISTIC 2
+
 /* Size of the simple hash table for metadata register table. */
 #define MLX5_FLOW_MREG_HTABLE_SZ 4096
 #define MLX5_FLOW_MREG_HNAME "MARK_COPY_TABLE"
diff --git a/drivers/net/mlx5/mlx5_ethdev.c b/drivers/net/mlx5/mlx5_ethdev.c
index 11efb2d..98eddc7 100644
--- a/drivers/net/mlx5/mlx5_ethdev.c
+++ b/drivers/net/mlx5/mlx5_ethdev.c
@@ -532,8 +532,8 @@  struct ethtool_link_settings {
 	/* Minimum CPU utilization. */
 	info->default_rxportconf.ring_size = 256;
 	info->default_txportconf.ring_size = 256;
-	info->default_rxportconf.burst_size = 64;
-	info->default_txportconf.burst_size = 64;
+	info->default_rxportconf.burst_size = MLX5_RX_DEFAULT_BURST;
+	info->default_txportconf.burst_size = MLX5_TX_DEFAULT_BURST;
 	if (priv->link_speed_capa & ETH_LINK_SPEED_100G) {
 		info->default_rxportconf.nb_queues = 16;
 		info->default_txportconf.nb_queues = 16;
diff --git a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c
index 1ea5960..3762002 100644
--- a/drivers/net/mlx5/mlx5_rxtx.c
+++ b/drivers/net/mlx5/mlx5_rxtx.c
@@ -4769,8 +4769,18 @@  enum mlx5_txcmp_code {
 	 *   impact under heavy loading conditions but the explicit write
 	 *   memory barrier is not required and it may improve core
 	 *   performance.
+	 *
+	 * - the legacy behaviour (prior 19.08 release) was to use some
+	 *   heuristics to decide whether write memory barrier should
+	 *   be performed. This behavior is supported with specifying
+	 *   tx_db_nc=2, write barrier is skipped if application
+	 *   provides the full recommended burst of packets, it
+	 *   supposes the next packets are coming and the write barrier
+	 *   will be issued on the next burst (after descriptor writing,
+	 *   at least).
 	 */
-	mlx5_tx_dbrec_cond_wmb(txq, loc.wqe_last, !txq->db_nc);
+	mlx5_tx_dbrec_cond_wmb(txq, loc.wqe_last, !txq->db_nc &&
+			(!txq->db_heu || pkts_n % MLX5_TX_DEFAULT_BURST));
 	/* Not all of the mbufs may be stored into elts yet. */
 	part = MLX5_TXOFF_CONFIG(INLINE) ? 0 : loc.pkts_sent - loc.pkts_copy;
 	if (!MLX5_TXOFF_CONFIG(INLINE) && part) {
diff --git a/drivers/net/mlx5/mlx5_rxtx.h b/drivers/net/mlx5/mlx5_rxtx.h
index 88c50aa..e927343 100644
--- a/drivers/net/mlx5/mlx5_rxtx.h
+++ b/drivers/net/mlx5/mlx5_rxtx.h
@@ -288,6 +288,7 @@  struct mlx5_txq_data {
 	uint16_t swp_en:1; /* Whether SW parser is enabled. */
 	uint16_t vlan_en:1; /* VLAN insertion in WQE is supported. */
 	uint16_t db_nc:1; /* Doorbell mapped to non-cached region. */
+	uint16_t db_heu:1; /* Doorbell heuristic write barrier. */
 	uint16_t inlen_send; /* Ordinary send data inline size. */
 	uint16_t inlen_empw; /* eMPW max packet size to inline. */
 	uint16_t inlen_mode; /* Minimal data length to inline. */
diff --git a/drivers/net/mlx5/mlx5_txq.c b/drivers/net/mlx5/mlx5_txq.c
index a0d6164..1bf5ebe 100644
--- a/drivers/net/mlx5/mlx5_txq.c
+++ b/drivers/net/mlx5/mlx5_txq.c
@@ -313,8 +313,10 @@ 
 static void
 txq_uar_ncattr_init(struct mlx5_txq_ctrl *txq_ctrl, size_t page_size)
 {
+	struct mlx5_priv *priv = txq_ctrl->priv;
 	unsigned int cmd;
 
+	txq_ctrl->txq.db_heu = priv->config.dbnc == MLX5_TXDB_HEURISTIC;
 	txq_ctrl->txq.db_nc = 0;
 	/* Check the doorbell register mapping type. */
 	cmd = txq_ctrl->uar_mmap_offset / page_size;