[v01] net/af_packet: add rollover and defrag options

Message ID eadcf23b55de0fcc462024dd56269b86f856484a.1730209551.git.gur.stavi@huawei.com (mailing list archive)
State Superseded, archived
Delegated to: Ferruh Yigit
Headers
Series [v01] net/af_packet: add rollover and defrag options |

Checks

Context Check Description
ci/checkpatch warning coding style issues
ci/loongarch-compilation success Compilation OK
ci/loongarch-unit-testing success Unit Testing PASS
ci/Intel-compilation success Compilation OK
ci/intel-Testing success Testing PASS
ci/intel-Functional success Functional PASS
ci/github-robot: build success github build: passed
ci/iol-mellanox-Performance success Performance Testing PASS
ci/iol-intel-Functional success Functional Testing PASS
ci/iol-marvell-Functional success Functional Testing PASS
ci/iol-intel-Performance success Performance Testing PASS
ci/iol-broadcom-Performance success Performance Testing PASS
ci/iol-compile-arm64-testing success Testing PASS
ci/iol-unit-arm64-testing warning Testing issues
ci/iol-unit-amd64-testing success Testing PASS
ci/iol-compile-amd64-testing success Testing PASS
ci/iol-sample-apps-testing success Testing PASS

Commit Message

Gur Stavi Oct. 29, 2024, 1:48 p.m. UTC
net_af_packet PMD multi "queue" support relies on Linux FANOUT capability.
Linux FANOUT is a SW based load balancer that is similar to HW RSS which is
more common for DPDK PMDs. Instead of multiple HW descriptor queues, AF PACKET
uses multiple sockets.
HW RSS will typically drop a packet if its selected RX queue is empty. However,
Linux FANOUT, as a SW load balancer, can be configured to avoid this packet
drop by rolling over to the next socket.
This rollover functionality was ALWAYS enabled in net_af_packet. It is
surrounded by ifdef, but only to allow compilation on ancient Linux versions
that did not have it.

Since DPDK applications are usually designed for HW based PMDs, this rollover
functionality, which the developers are likely unaware of, could be confusing.

Another option that is part of Linux FANOUT is DEFRAG that instructs Linux to
compose complete IP packet out of fragments before delivering it to the PACKET
socket. Again, this behavior typically does not exist for HW based PMDs and may
confuse users.

This patch adds 2 options to control these features:
rollover=[0|1],defrag=[0|1]
For backward compatibility both features are enabled by default even though most
users will probably want both of them disabled.

Signed-off-by: Gur Stavi <gur.stavi@huawei.com>
---
 doc/guides/nics/af_packet.rst             |  4 ++
 drivers/net/af_packet/rte_eth_af_packet.c | 63 +++++++++++++++++------
 2 files changed, 51 insertions(+), 16 deletions(-)


base-commit: 98613d32e3dac58d685f4f236cf8cc9733abaaf3
  

Comments

Stephen Hemminger Oct. 29, 2024, 4:09 p.m. UTC | #1
On Tue, 29 Oct 2024 15:48:05 +0200
Gur Stavi <gur.stavi@huawei.com> wrote:

> net_af_packet PMD multi "queue" support relies on Linux FANOUT capability.
> Linux FANOUT is a SW based load balancer that is similar to HW RSS which is
> more common for DPDK PMDs. Instead of multiple HW descriptor queues, AF PACKET
> uses multiple sockets.
> HW RSS will typically drop a packet if its selected RX queue is empty. However,
> Linux FANOUT, as a SW load balancer, can be configured to avoid this packet
> drop by rolling over to the next socket.
> This rollover functionality was ALWAYS enabled in net_af_packet. It is
> surrounded by ifdef, but only to allow compilation on ancient Linux versions
> that did not have it.
> 
> Since DPDK applications are usually designed for HW based PMDs, this rollover
> functionality, which the developers are likely unaware of, could be confusing.
> 
> Another option that is part of Linux FANOUT is DEFRAG that instructs Linux to
> compose complete IP packet out of fragments before delivering it to the PACKET
> socket. Again, this behavior typically does not exist for HW based PMDs and may
> confuse users.
> 
> This patch adds 2 options to control these features:
> rollover=[0|1],defrag=[0|1]
> For backward compatibility both features are enabled by default even though most
> users will probably want both of them disabled.
> 
> Signed-off-by: Gur Stavi <gur.stavi@huawei.com>
> ---

Makes sense to expose kernel options. But have all combinations been tested?
  
Gur Stavi Oct. 29, 2024, 6:27 p.m. UTC | #2
> On Tue, 29 Oct 2024 15:48:05 +0200
> Gur Stavi <gur.stavi@huawei.com> wrote:
> 
> > net_af_packet PMD multi "queue" support relies on Linux FANOUT
> capability.
> > Linux FANOUT is a SW based load balancer that is similar to HW RSS
> which is
> > more common for DPDK PMDs. Instead of multiple HW descriptor queues, AF
> PACKET
> > uses multiple sockets.
> > HW RSS will typically drop a packet if its selected RX queue is empty.
> However,
> > Linux FANOUT, as a SW load balancer, can be configured to avoid this
> packet
> > drop by rolling over to the next socket.
> > This rollover functionality was ALWAYS enabled in net_af_packet. It is
> > surrounded by ifdef, but only to allow compilation on ancient Linux
> versions
> > that did not have it.
> >
> > Since DPDK applications are usually designed for HW based PMDs, this
> rollover
> > functionality, which the developers are likely unaware of, could be
> confusing.
> >
> > Another option that is part of Linux FANOUT is DEFRAG that instructs
> Linux to
> > compose complete IP packet out of fragments before delivering it to the
> PACKET
> > socket. Again, this behavior typically does not exist for HW based PMDs
> and may
> > confuse users.
> >
> > This patch adds 2 options to control these features:
> > rollover=[0|1],defrag=[0|1]
> > For backward compatibility both features are enabled by default even
> though most
> > users will probably want both of them disabled.
> >
> > Signed-off-by: Gur Stavi <gur.stavi@huawei.com>
> > ---
> 
> Makes sense to expose kernel options. But have all combinations been
> tested?

I tested the command line parsing of the new options.
I tested that the PMD is initialized successfully with all combinations.
I did not test if Linux actually behaves according to the man page. I did
not test traffic at all.
  
Stephen Hemminger Oct. 30, 2024, 2:41 a.m. UTC | #3
On Tue, 29 Oct 2024 15:48:05 +0200
Gur Stavi <gur.stavi@huawei.com> wrote:

> +		if (strstr(pair->key, ETH_AF_PACKET_ROLLOVER) != NULL) {
> +			rollover = atoi(pair->value);
> +			if (rollover != 0 && rollover != 1) {
> +				PMD_LOG(ERR,
> +					"%s: invalid rollover value",
> +				        name);
> +				return -1;
> +			}
> +			continue;
> +		}

The problem is that atoi() provides little to no error handling.
Prefer using strtoul() and/or having a common routine for parsing flag values.
  
Gur Stavi Oct. 30, 2024, 6:22 a.m. UTC | #4
> > +		if (strstr(pair->key, ETH_AF_PACKET_ROLLOVER) != NULL) {
> > +			rollover = atoi(pair->value);
> > +			if (rollover != 0 && rollover != 1) {
> > +				PMD_LOG(ERR,
> > +					"%s: invalid rollover value",
> > +				        name);
> > +				return -1;
> > +			}
> > +			continue;
> > +		}
> 
> The problem is that atoi() provides little to no error handling.
> Prefer using strtoul() and/or having a common routine for parsing flag
> values.

This block was copy-pasted from the handling of the other options.
I even copied by mistake the indentation error that checkpatch
complained about.

Do you want the atoi to be removed from the old code as well or just
from the new code?
  
Stephen Hemminger Oct. 30, 2024, 2:59 p.m. UTC | #5
On Wed, 30 Oct 2024 08:22:16 +0200
Gur Stavi <gur.stavi@huawei.com> wrote:

> > > +		if (strstr(pair->key, ETH_AF_PACKET_ROLLOVER) != NULL) {
> > > +			rollover = atoi(pair->value);
> > > +			if (rollover != 0 && rollover != 1) {
> > > +				PMD_LOG(ERR,
> > > +					"%s: invalid rollover value",
> > > +				        name);
> > > +				return -1;
> > > +			}
> > > +			continue;
> > > +		}  
> > 
> > The problem is that atoi() provides little to no error handling.
> > Prefer using strtoul() and/or having a common routine for parsing flag
> > values.  
> 
> This block was copy-pasted from the handling of the other options.
> I even copied by mistake the indentation error that checkpatch
> complained about.
> 
> Do you want the atoi to be removed from the old code as well or just
> from the new code?
> 
> 

Lets fix that as another patch later.
  

Patch

diff --git a/doc/guides/nics/af_packet.rst b/doc/guides/nics/af_packet.rst
index 66b977e1a2..4bc748b50b 100644
--- a/doc/guides/nics/af_packet.rst
+++ b/doc/guides/nics/af_packet.rst
@@ -29,6 +29,8 @@  Some of these, in turn, will be used to configure the PACKET_MMAP settings.
 *   ``framesz`` - PACKET_MMAP frame size (optional, default 2048B; Note: multiple
     of 16B);
 *   ``framecnt`` - PACKET_MMAP frame count (optional, default 512).
+*   ``rollover`` - disable(0)/enable(1) PACKET_FANOUT_ROLLOVER (optional, default 1).
+*   ``defrag`` - disable(0)/enable(1) PACKET_FANOUT_FLAG_DEFRAG (optional, default 1).
 
 Because this implementation is based on PACKET_MMAP, and PACKET_MMAP has its
 own pre-requisites, it should be noted that the inner workings of PACKET_MMAP
@@ -47,6 +49,8 @@  For the full details behind PACKET_MMAP's structures and settings, consider
 reading the `PACKET_MMAP documentation in the Kernel
 <https://www.kernel.org/doc/Documentation/networking/packet_mmap.txt>`_.
 
+For details of ``rollover`` anb ``defrag`` refer to the *packet(7)* man page.
+
 Prerequisites
 -------------
 
diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c
index 68870dd3e2..5ae76d2b2a 100644
--- a/drivers/net/af_packet/rte_eth_af_packet.c
+++ b/drivers/net/af_packet/rte_eth_af_packet.c
@@ -36,6 +36,8 @@ 
 #define ETH_AF_PACKET_FRAMESIZE_ARG	"framesz"
 #define ETH_AF_PACKET_FRAMECOUNT_ARG	"framecnt"
 #define ETH_AF_PACKET_QDISC_BYPASS_ARG	"qdisc_bypass"
+#define ETH_AF_PACKET_ROLLOVER		"rollover"
+#define ETH_AF_PACKET_DEFRAG		"defrag"
 
 #define DFLT_FRAME_SIZE		(1 << 11)
 #define DFLT_FRAME_COUNT	(1 << 9)
@@ -91,6 +93,8 @@  static const char *valid_arguments[] = {
 	ETH_AF_PACKET_FRAMESIZE_ARG,
 	ETH_AF_PACKET_FRAMECOUNT_ARG,
 	ETH_AF_PACKET_QDISC_BYPASS_ARG,
+	ETH_AF_PACKET_ROLLOVER,
+	ETH_AF_PACKET_DEFRAG,
 	NULL
 };
 
@@ -676,6 +680,7 @@  rte_pmd_init_internals(struct rte_vdev_device *dev,
 	const unsigned int numa_node = dev->device.numa_node;
 	struct rte_eth_dev_data *data = NULL;
 	struct rte_kvargs_pair *pair = NULL;
+	const char *ifname = NULL;
 	struct ifreq ifr;
 	size_t ifnamelen;
 	unsigned k_idx;
@@ -686,6 +691,8 @@  rte_pmd_init_internals(struct rte_vdev_device *dev,
 	int rc, tpver, discard;
 	int qsockfd = -1;
 	unsigned int i, q, rdsize;
+	unsigned int rollover = 1;
+	unsigned int defrag = 1;
 #if defined(PACKET_FANOUT)
 	int fanout_arg;
 #endif
@@ -693,9 +700,30 @@  rte_pmd_init_internals(struct rte_vdev_device *dev,
 	for (k_idx = 0; k_idx < kvlist->count; k_idx++) {
 		pair = &kvlist->pairs[k_idx];
 		if (strstr(pair->key, ETH_AF_PACKET_IFACE_ARG) != NULL)
-			break;
+			ifname = pair->value;
+		if (strstr(pair->key, ETH_AF_PACKET_ROLLOVER) != NULL) {
+			rollover = atoi(pair->value);
+			if (rollover != 0 && rollover != 1) {
+				PMD_LOG(ERR,
+					"%s: invalid rollover value",
+				        name);
+				return -1;
+			}
+			continue;
+		}
+		if (strstr(pair->key, ETH_AF_PACKET_DEFRAG) != NULL) {
+			defrag = atoi(pair->value);
+			if (defrag != 0 && defrag != 1) {
+				PMD_LOG(ERR,
+					"%s: invalid defrag value",
+				        name);
+				return -1;
+			}
+			continue;
+		}
 	}
-	if (pair == NULL) {
+
+	if (ifname == NULL) {
 		PMD_LOG(ERR,
 			"%s: no interface specified for AF_PACKET ethdev",
 		        name);
@@ -738,21 +766,21 @@  rte_pmd_init_internals(struct rte_vdev_device *dev,
 	req->tp_frame_size = framesize;
 	req->tp_frame_nr = framecnt;
 
-	ifnamelen = strlen(pair->value);
+	ifnamelen = strlen(ifname);
 	if (ifnamelen < sizeof(ifr.ifr_name)) {
-		memcpy(ifr.ifr_name, pair->value, ifnamelen);
+		memcpy(ifr.ifr_name, ifname, ifnamelen);
 		ifr.ifr_name[ifnamelen] = '\0';
 	} else {
 		PMD_LOG(ERR,
 			"%s: I/F name too long (%s)",
-			name, pair->value);
+			name, ifname);
 		goto free_internals;
 	}
 	if (ioctl(sockfd, SIOCGIFINDEX, &ifr) == -1) {
 		PMD_LOG_ERRNO(ERR, "%s: ioctl failed (SIOCGIFINDEX)", name);
 		goto free_internals;
 	}
-	(*internals)->if_name = strdup(pair->value);
+	(*internals)->if_name = strdup(ifname);
 	if ((*internals)->if_name == NULL)
 		goto free_internals;
 	(*internals)->if_index = ifr.ifr_ifindex;
@@ -770,9 +798,12 @@  rte_pmd_init_internals(struct rte_vdev_device *dev,
 
 #if defined(PACKET_FANOUT)
 	fanout_arg = (getpid() ^ (*internals)->if_index) & 0xffff;
-	fanout_arg |= (PACKET_FANOUT_HASH | PACKET_FANOUT_FLAG_DEFRAG) << 16;
+	fanout_arg |= PACKET_FANOUT_HASH << 16;
+	if (defrag)
+		fanout_arg |= PACKET_FANOUT_FLAG_DEFRAG << 16;
 #if defined(PACKET_FANOUT_FLAG_ROLLOVER)
-	fanout_arg |= PACKET_FANOUT_FLAG_ROLLOVER << 16;
+	if (rollover)
+		fanout_arg |= PACKET_FANOUT_FLAG_ROLLOVER << 16;
 #endif
 #endif
 
@@ -792,7 +823,7 @@  rte_pmd_init_internals(struct rte_vdev_device *dev,
 		if (rc == -1) {
 			PMD_LOG_ERRNO(ERR,
 				"%s: could not set PACKET_VERSION on AF_PACKET socket for %s",
-				name, pair->value);
+				name, ifname);
 			goto error;
 		}
 
@@ -802,7 +833,7 @@  rte_pmd_init_internals(struct rte_vdev_device *dev,
 		if (rc == -1) {
 			PMD_LOG_ERRNO(ERR,
 				"%s: could not set PACKET_LOSS on AF_PACKET socket for %s",
-				name, pair->value);
+				name, ifname);
 			goto error;
 		}
 
@@ -813,7 +844,7 @@  rte_pmd_init_internals(struct rte_vdev_device *dev,
 			if (rc == -1) {
 				PMD_LOG_ERRNO(ERR,
 					"%s: could not set PACKET_QDISC_BYPASS on AF_PACKET socket for %s",
-					name, pair->value);
+					name, ifname);
 				goto error;
 			}
 #endif
@@ -823,7 +854,7 @@  rte_pmd_init_internals(struct rte_vdev_device *dev,
 		if (rc == -1) {
 			PMD_LOG_ERRNO(ERR,
 				"%s: could not set PACKET_RX_RING on AF_PACKET socket for %s",
-				name, pair->value);
+				name, ifname);
 			goto error;
 		}
 
@@ -831,7 +862,7 @@  rte_pmd_init_internals(struct rte_vdev_device *dev,
 		if (rc == -1) {
 			PMD_LOG_ERRNO(ERR,
 				"%s: could not set PACKET_TX_RING on AF_PACKET "
-				"socket for %s", name, pair->value);
+				"socket for %s", name, ifname);
 			goto error;
 		}
 
@@ -844,7 +875,7 @@  rte_pmd_init_internals(struct rte_vdev_device *dev,
 		if (rx_queue->map == MAP_FAILED) {
 			PMD_LOG_ERRNO(ERR,
 				"%s: call to mmap failed on AF_PACKET socket for %s",
-				name, pair->value);
+				name, ifname);
 			goto error;
 		}
 
@@ -881,7 +912,7 @@  rte_pmd_init_internals(struct rte_vdev_device *dev,
 		if (rc == -1) {
 			PMD_LOG_ERRNO(ERR,
 				"%s: could not bind AF_PACKET socket to %s",
-				name, pair->value);
+				name, ifname);
 			goto error;
 		}
 
@@ -891,7 +922,7 @@  rte_pmd_init_internals(struct rte_vdev_device *dev,
 		if (rc == -1) {
 			PMD_LOG_ERRNO(ERR,
 				"%s: could not set PACKET_FANOUT on AF_PACKET socket for %s",
-				name, pair->value);
+				name, ifname);
 			goto error;
 		}
 #endif