[v2,1/3] examples/l3fwd: reorganize code for better performance

Message ID 20210601075653.84927-2-ruifeng.wang@arm.com (mailing list archive)
State Superseded, archived
Delegated to: David Marchand
Headers
Series l3fwd improvements |

Checks

Context Check Description
ci/checkpatch success coding style OK

Commit Message

Ruifeng Wang June 1, 2021, 7:56 a.m. UTC
  Moved rfc1812 process prior to NEON registers store.
On N1SDP, this reorganization mitigates CPU frontend stall and backend
stall when forwarding.

On N1SDP with MLX5 40G NIC, this change showed 10.2% performance gain
in single port single core MRR test.
On ThunderX2, this changed showed no performance degradation.

Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 examples/l3fwd/l3fwd_neon.h | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)
  

Comments

Jerin Jacob June 6, 2021, 6:34 p.m. UTC | #1
On Tue, Jun 1, 2021 at 1:27 PM Ruifeng Wang <ruifeng.wang@arm.com> wrote:
>
> Moved rfc1812 process prior to NEON registers store.
> On N1SDP, this reorganization mitigates CPU frontend stall and backend
> stall when forwarding.
>
> On N1SDP with MLX5 40G NIC, this change showed 10.2% performance gain
> in single port single core MRR test.

I think, it may not have anything to do with N1SDP, It could be just
the prefetch window timing
with MLX5 driver on Tx mbuf on touching  with tx_burst() and L1 cache
pressure timing.
I think, tuning the driver parameters can switch the window to some driver code.

On Octeontx2, this change has regression of -3.1% flow lookup miss
case. so NACK.


> On ThunderX2, this changed showed no performance degradation.
>
> Signed-off-by: Ruifeng Wang <ruifeng.wang@arm.com>
> ---
>  examples/l3fwd/l3fwd_neon.h | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/examples/l3fwd/l3fwd_neon.h b/examples/l3fwd/l3fwd_neon.h
> index 86ac5971d7..ea7fe22d00 100644
> --- a/examples/l3fwd/l3fwd_neon.h
> +++ b/examples/l3fwd/l3fwd_neon.h
> @@ -43,11 +43,6 @@ processx4_step3(struct rte_mbuf *pkt[FWDSTEP], uint16_t dst_port[FWDSTEP])
>         ve[2] = vsetq_lane_u32(vgetq_lane_u32(te[2], 3), ve[2], 3);
>         ve[3] = vsetq_lane_u32(vgetq_lane_u32(te[3], 3), ve[3], 3);
>
> -       vst1q_u32(p[0], ve[0]);
> -       vst1q_u32(p[1], ve[1]);
> -       vst1q_u32(p[2], ve[2]);
> -       vst1q_u32(p[3], ve[3]);
> -
>         rfc1812_process((struct rte_ipv4_hdr *)
>                         ((struct rte_ether_hdr *)p[0] + 1),
>                         &dst_port[0], pkt[0]->packet_type);
> @@ -60,6 +55,11 @@ processx4_step3(struct rte_mbuf *pkt[FWDSTEP], uint16_t dst_port[FWDSTEP])
>         rfc1812_process((struct rte_ipv4_hdr *)
>                         ((struct rte_ether_hdr *)p[3] + 1),
>                         &dst_port[3], pkt[3]->packet_type);
> +
> +       vst1q_u32(p[0], ve[0]);
> +       vst1q_u32(p[1], ve[1]);
> +       vst1q_u32(p[2], ve[2]);
> +       vst1q_u32(p[3], ve[3]);
>  }
>
>  /*
> --
> 2.25.1
>
  

Patch

diff --git a/examples/l3fwd/l3fwd_neon.h b/examples/l3fwd/l3fwd_neon.h
index 86ac5971d7..ea7fe22d00 100644
--- a/examples/l3fwd/l3fwd_neon.h
+++ b/examples/l3fwd/l3fwd_neon.h
@@ -43,11 +43,6 @@  processx4_step3(struct rte_mbuf *pkt[FWDSTEP], uint16_t dst_port[FWDSTEP])
 	ve[2] = vsetq_lane_u32(vgetq_lane_u32(te[2], 3), ve[2], 3);
 	ve[3] = vsetq_lane_u32(vgetq_lane_u32(te[3], 3), ve[3], 3);
 
-	vst1q_u32(p[0], ve[0]);
-	vst1q_u32(p[1], ve[1]);
-	vst1q_u32(p[2], ve[2]);
-	vst1q_u32(p[3], ve[3]);
-
 	rfc1812_process((struct rte_ipv4_hdr *)
 			((struct rte_ether_hdr *)p[0] + 1),
 			&dst_port[0], pkt[0]->packet_type);
@@ -60,6 +55,11 @@  processx4_step3(struct rte_mbuf *pkt[FWDSTEP], uint16_t dst_port[FWDSTEP])
 	rfc1812_process((struct rte_ipv4_hdr *)
 			((struct rte_ether_hdr *)p[3] + 1),
 			&dst_port[3], pkt[3]->packet_type);
+
+	vst1q_u32(p[0], ve[0]);
+	vst1q_u32(p[1], ve[1]);
+	vst1q_u32(p[2], ve[2]);
+	vst1q_u32(p[3], ve[3]);
 }
 
 /*