examples/l3fwd: optimize packet prefetch

Message ID 20241225075302.353013-1-huangdengdui@huawei.com (mailing list archive)
State New
Delegated to: Thomas Monjalon
Headers
Series examples/l3fwd: optimize packet prefetch |

Checks

Context Check Description
ci/checkpatch warning coding style issues
ci/loongarch-compilation success Compilation OK
ci/loongarch-unit-testing success Unit Testing PASS
ci/iol-broadcom-Performance success Performance Testing PASS
ci/github-robot: build success github build: passed
ci/iol-mellanox-Performance success Performance Testing PASS
ci/iol-unit-amd64-testing success Testing PASS
ci/iol-intel-Functional success Functional Testing PASS
ci/Intel-compilation success Compilation OK
ci/intel-Testing success Testing PASS
ci/iol-unit-arm64-testing success Testing PASS
ci/intel-Functional success Functional PASS
ci/iol-sample-apps-testing success Testing PASS
ci/iol-compile-arm64-testing success Testing PASS
ci/iol-intel-Performance success Performance Testing PASS
ci/iol-marvell-Functional success Functional Testing PASS
ci/iol-compile-amd64-testing success Testing PASS

Commit Message

huangdengdui Dec. 25, 2024, 7:53 a.m. UTC
The prefetch window depending on the hardware platform. The current prefetch
policy may not be applicable to all platforms. In most cases, the number of
packets received by Rx burst is small (64 is used in most performance reports).
In L3fwd, the maximum value cannot exceed 512. Therefore, prefetching all
packets before processing can achieve better performance.

Signed-off-by: Dengdui Huang <huangdengdui@huawei.com>
---
 examples/l3fwd/l3fwd_lpm_neon.h | 42 ++++-----------------------------
 1 file changed, 5 insertions(+), 37 deletions(-)
  

Comments

Stephen Hemminger Dec. 25, 2024, 9:21 p.m. UTC | #1
On Wed, 25 Dec 2024 15:53:02 +0800
Dengdui Huang <huangdengdui@huawei.com> wrote:

> From: Dengdui Huang <huangdengdui@huawei.com>
> To: <dev@dpdk.org>
> CC: <wathsala.vithanage@arm.com>, <stephen@networkplumber.org>,  <liuyonglong@huawei.com>, <fengchengwen@huawei.com>, <haijie1@huawei.com>,  <lihuisong@huawei.com>
> Subject: [PATCH] examples/l3fwd: optimize packet prefetch
> Date: Wed, 25 Dec 2024 15:53:02 +0800
> X-Mailer: git-send-email 2.33.0
> 
> The prefetch window depending on the hardware platform. The current prefetch
> policy may not be applicable to all platforms. In most cases, the number of
> packets received by Rx burst is small (64 is used in most performance reports).
> In L3fwd, the maximum value cannot exceed 512. Therefore, prefetching all
> packets before processing can achieve better performance.
> 
> Signed-off-by: Dengdui Huang <huangdengdui@huawei.com>
> ---

I think Vpp had a good description of how to unroll and deal with prefetch.

With larger burst sizes you don't want to prefetch the whole burst.
  
Konstantin Ananyev Jan. 8, 2025, 1:42 p.m. UTC | #2
> 
> The prefetch window depending on the hardware platform. The current prefetch
> policy may not be applicable to all platforms. In most cases, the number of
> packets received by Rx burst is small (64 is used in most performance reports).
> In L3fwd, the maximum value cannot exceed 512. Therefore, prefetching all
> packets before processing can achieve better performance.

As you mentioned 'prefetch' behavior differs a lot from one HW platform to another.
So it could easily be that changes you suggesting will cause performance
boost on one platform and degradation on another.
In fact, right now l3fwd 'prefetch' usage is a bit of mess:
- l3fwd_lpm_neon.h uses  FWDSTEP as a prefetch window.
- l3fwd_fib.c uses FIB_PREFETCH_OFFSET for that purpose
- rest of the code uses either PREFETCH_OFFSET or doesn't use 'prefetch' at all
 
Probably what we need here is some unified approach:
configurable at run-time prefetch_window_size that all code-paths will obey. 

> Signed-off-by: Dengdui Huang <huangdengdui@huawei.com>
> ---
>  examples/l3fwd/l3fwd_lpm_neon.h | 42 ++++-----------------------------
>  1 file changed, 5 insertions(+), 37 deletions(-)
> 
> diff --git a/examples/l3fwd/l3fwd_lpm_neon.h b/examples/l3fwd/l3fwd_lpm_neon.h
> index 3c1f827424..0b51782b8c 100644
> --- a/examples/l3fwd/l3fwd_lpm_neon.h
> +++ b/examples/l3fwd/l3fwd_lpm_neon.h
> @@ -91,53 +91,21 @@ l3fwd_lpm_process_packets(int nb_rx, struct rte_mbuf **pkts_burst,
>  	const int32_t k = RTE_ALIGN_FLOOR(nb_rx, FWDSTEP);
>  	const int32_t m = nb_rx % FWDSTEP;
> 
> -	if (k) {
> -		for (i = 0; i < FWDSTEP; i++) {
> -			rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[i],
> -							void *));
> -		}
> -		for (j = 0; j != k - FWDSTEP; j += FWDSTEP) {
> -			for (i = 0; i < FWDSTEP; i++) {
> -				rte_prefetch0(rte_pktmbuf_mtod(
> -						pkts_burst[j + i + FWDSTEP],
> -						void *));
> -			}
> +	/* The number of packets is small. Prefetch all packets. */
> +	for (i = 0; i < nb_rx; i++)
> +		rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[i], void *));
> 
> +	if (k) {
> +		for (j = 0; j != k; j += FWDSTEP) {
>  			processx4_step1(&pkts_burst[j], &dip, &ipv4_flag);
>  			processx4_step2(qconf, dip, ipv4_flag, portid,
>  					&pkts_burst[j], &dst_port[j]);
>  			if (do_step3)
>  				processx4_step3(&pkts_burst[j], &dst_port[j]);
>  		}
> -
> -		processx4_step1(&pkts_burst[j], &dip, &ipv4_flag);
> -		processx4_step2(qconf, dip, ipv4_flag, portid, &pkts_burst[j],
> -				&dst_port[j]);
> -		if (do_step3)
> -			processx4_step3(&pkts_burst[j], &dst_port[j]);
> -
> -		j += FWDSTEP;
>  	}
> 
>  	if (m) {
> -		/* Prefetch last up to 3 packets one by one */
> -		switch (m) {
> -		case 3:
> -			rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j],
> -							void *));
> -			j++;
> -			/* fallthrough */
> -		case 2:
> -			rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j],
> -							void *));
> -			j++;
> -			/* fallthrough */
> -		case 1:
> -			rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j],
> -							void *));
> -			j++;
> -		}
> -		j -= m;
>  		/* Classify last up to 3 packets one by one */
>  		switch (m) {
>  		case 3:
> --
> 2.33.0
  
huangdengdui Jan. 9, 2025, 11:31 a.m. UTC | #3
On 2025/1/8 21:42, Konstantin Ananyev wrote:
> 
> 
>>
>> The prefetch window depending on the hardware platform. The current prefetch
>> policy may not be applicable to all platforms. In most cases, the number of
>> packets received by Rx burst is small (64 is used in most performance reports).
>> In L3fwd, the maximum value cannot exceed 512. Therefore, prefetching all
>> packets before processing can achieve better performance.
> 
> As you mentioned 'prefetch' behavior differs a lot from one HW platform to another.
> So it could easily be that changes you suggesting will cause performance
> boost on one platform and degradation on another.
> In fact, right now l3fwd 'prefetch' usage is a bit of mess:
> - l3fwd_lpm_neon.h uses  FWDSTEP as a prefetch window.
> - l3fwd_fib.c uses FIB_PREFETCH_OFFSET for that purpose
> - rest of the code uses either PREFETCH_OFFSET or doesn't use 'prefetch' at all
>  
> Probably what we need here is some unified approach:
> configurable at run-time prefetch_window_size that all code-paths will obey. 

Agreed, I'll add a parameter to configure the prefetch window.

> 
>> Signed-off-by: Dengdui Huang <huangdengdui@huawei.com>
>> ---
>>  examples/l3fwd/l3fwd_lpm_neon.h | 42 ++++-----------------------------
>>  1 file changed, 5 insertions(+), 37 deletions(-)
>>
>> diff --git a/examples/l3fwd/l3fwd_lpm_neon.h b/examples/l3fwd/l3fwd_lpm_neon.h
>> index 3c1f827424..0b51782b8c 100644
>> --- a/examples/l3fwd/l3fwd_lpm_neon.h
>> +++ b/examples/l3fwd/l3fwd_lpm_neon.h
>> @@ -91,53 +91,21 @@ l3fwd_lpm_process_packets(int nb_rx, struct rte_mbuf **pkts_burst,
>>  	const int32_t k = RTE_ALIGN_FLOOR(nb_rx, FWDSTEP);
>>  	const int32_t m = nb_rx % FWDSTEP;
>>
>> -	if (k) {
>> -		for (i = 0; i < FWDSTEP; i++) {
>> -			rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[i],
>> -							void *));
>> -		}
>> -		for (j = 0; j != k - FWDSTEP; j += FWDSTEP) {
>> -			for (i = 0; i < FWDSTEP; i++) {
>> -				rte_prefetch0(rte_pktmbuf_mtod(
>> -						pkts_burst[j + i + FWDSTEP],
>> -						void *));
>> -			}
>> +	/* The number of packets is small. Prefetch all packets. */
>> +	for (i = 0; i < nb_rx; i++)
>> +		rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[i], void *));
>>
>> +	if (k) {
>> +		for (j = 0; j != k; j += FWDSTEP) {
>>  			processx4_step1(&pkts_burst[j], &dip, &ipv4_flag);
>>  			processx4_step2(qconf, dip, ipv4_flag, portid,
>>  					&pkts_burst[j], &dst_port[j]);
>>  			if (do_step3)
>>  				processx4_step3(&pkts_burst[j], &dst_port[j]);
>>  		}
>> -
>> -		processx4_step1(&pkts_burst[j], &dip, &ipv4_flag);
>> -		processx4_step2(qconf, dip, ipv4_flag, portid, &pkts_burst[j],
>> -				&dst_port[j]);
>> -		if (do_step3)
>> -			processx4_step3(&pkts_burst[j], &dst_port[j]);
>> -
>> -		j += FWDSTEP;
>>  	}
>>
>>  	if (m) {
>> -		/* Prefetch last up to 3 packets one by one */
>> -		switch (m) {
>> -		case 3:
>> -			rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j],
>> -							void *));
>> -			j++;
>> -			/* fallthrough */
>> -		case 2:
>> -			rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j],
>> -							void *));
>> -			j++;
>> -			/* fallthrough */
>> -		case 1:
>> -			rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j],
>> -							void *));
>> -			j++;
>> -		}
>> -		j -= m;
>>  		/* Classify last up to 3 packets one by one */
>>  		switch (m) {
>>  		case 3:
>> --
>> 2.33.0
>
  

Patch

diff --git a/examples/l3fwd/l3fwd_lpm_neon.h b/examples/l3fwd/l3fwd_lpm_neon.h
index 3c1f827424..0b51782b8c 100644
--- a/examples/l3fwd/l3fwd_lpm_neon.h
+++ b/examples/l3fwd/l3fwd_lpm_neon.h
@@ -91,53 +91,21 @@  l3fwd_lpm_process_packets(int nb_rx, struct rte_mbuf **pkts_burst,
 	const int32_t k = RTE_ALIGN_FLOOR(nb_rx, FWDSTEP);
 	const int32_t m = nb_rx % FWDSTEP;
 
-	if (k) {
-		for (i = 0; i < FWDSTEP; i++) {
-			rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[i],
-							void *));
-		}
-		for (j = 0; j != k - FWDSTEP; j += FWDSTEP) {
-			for (i = 0; i < FWDSTEP; i++) {
-				rte_prefetch0(rte_pktmbuf_mtod(
-						pkts_burst[j + i + FWDSTEP],
-						void *));
-			}
+	/* The number of packets is small. Prefetch all packets. */
+	for (i = 0; i < nb_rx; i++)
+		rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[i], void *));
 
+	if (k) {
+		for (j = 0; j != k; j += FWDSTEP) {
 			processx4_step1(&pkts_burst[j], &dip, &ipv4_flag);
 			processx4_step2(qconf, dip, ipv4_flag, portid,
 					&pkts_burst[j], &dst_port[j]);
 			if (do_step3)
 				processx4_step3(&pkts_burst[j], &dst_port[j]);
 		}
-
-		processx4_step1(&pkts_burst[j], &dip, &ipv4_flag);
-		processx4_step2(qconf, dip, ipv4_flag, portid, &pkts_burst[j],
-				&dst_port[j]);
-		if (do_step3)
-			processx4_step3(&pkts_burst[j], &dst_port[j]);
-
-		j += FWDSTEP;
 	}
 
 	if (m) {
-		/* Prefetch last up to 3 packets one by one */
-		switch (m) {
-		case 3:
-			rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j],
-							void *));
-			j++;
-			/* fallthrough */
-		case 2:
-			rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j],
-							void *));
-			j++;
-			/* fallthrough */
-		case 1:
-			rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[j],
-							void *));
-			j++;
-		}
-		j -= m;
 		/* Classify last up to 3 packets one by one */
 		switch (m) {
 		case 3: