[v3,1/2] ip_frag: optimize key compare and hash generation

Message ID 20230529145502.11805-1-pbhagavatula@marvell.com (mailing list archive)
State Superseded, archived
Delegated to: Thomas Monjalon
Headers
Series [v3,1/2] ip_frag: optimize key compare and hash generation |

Checks

Context Check Description
ci/checkpatch success coding style OK

Commit Message

Pavan Nikhilesh Bhagavatula May 29, 2023, 2:55 p.m. UTC
  From: Pavan Nikhilesh <pbhagavatula@marvell.com>

Use optimized rte_hash_k32_cmp_eq routine for key comparison for
x86 and ARM64.
Use CRC instructions for hash generation on ARM64.

Signed-off-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
---
On Neoverse-N2, performance improved by 10% when measured with
examples/ip_reassembly.

 v3 Changes:
 - Drop NEON patch.
 v2 Changes:
 - Fix compilation failure with non ARM64/x86 targets

 lib/hash/rte_cmp_arm64.h       | 16 ++++++++--------
 lib/hash/rte_cmp_x86.h         | 16 ++++++++--------
 lib/ip_frag/ip_frag_common.h   | 14 +++++++++++++-
 lib/ip_frag/ip_frag_internal.c |  4 ++--
 4 files changed, 31 insertions(+), 19 deletions(-)

--
2.25.1
  

Comments

Stephen Hemminger May 30, 2023, 3:09 a.m. UTC | #1
On Mon, 29 May 2023 20:25:01 +0530
<pbhagavatula@marvell.com> wrote:

> +	return (k1->id_key_len != k2->id_key_len) ||
> +	       (k1->key_len == IPV4_KEYLEN ? k1->src_dst[0] != k2->src_dst[0] :
> +					     rte_hash_k32_cmp_eq(k1, k2, 32));

If you make another version, one small comment.
Breaking this into a couple of if statements would make reading easier
for human readers. Compiler doesn't care.
  
Ruifeng Wang May 30, 2023, 7:44 a.m. UTC | #2
> -----Original Message-----
> From: pbhagavatula@marvell.com <pbhagavatula@marvell.com>
> Sent: Monday, May 29, 2023 10:55 PM
> To: jerinj@marvell.com; Ruifeng Wang <Ruifeng.Wang@arm.com>; Yipeng Wang
> <yipeng1.wang@intel.com>; Sameh Gobriel <sameh.gobriel@intel.com>; Bruce Richardson
> <bruce.richardson@intel.com>; Vladimir Medvedkin <vladimir.medvedkin@intel.com>;
> Konstantin Ananyev <konstantin.v.ananyev@yandex.ru>
> Cc: dev@dpdk.org; Pavan Nikhilesh <pbhagavatula@marvell.com>
> Subject: [PATCH v3 1/2] ip_frag: optimize key compare and hash generation
> 
> From: Pavan Nikhilesh <pbhagavatula@marvell.com>
> 
> Use optimized rte_hash_k32_cmp_eq routine for key comparison for
> x86 and ARM64.
> Use CRC instructions for hash generation on ARM64.
> 
> Signed-off-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
> ---
> On Neoverse-N2, performance improved by 10% when measured with examples/ip_reassembly.
> 
>  v3 Changes:
>  - Drop NEON patch.
>  v2 Changes:
>  - Fix compilation failure with non ARM64/x86 targets
> 
>  lib/hash/rte_cmp_arm64.h       | 16 ++++++++--------
>  lib/hash/rte_cmp_x86.h         | 16 ++++++++--------
>  lib/ip_frag/ip_frag_common.h   | 14 +++++++++++++-
>  lib/ip_frag/ip_frag_internal.c |  4 ++--
>  4 files changed, 31 insertions(+), 19 deletions(-)
> 
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
  
Pavan Nikhilesh Bhagavatula May 30, 2023, 5:50 p.m. UTC | #3
> On Mon, 29 May 2023 20:25:01 +0530
> <pbhagavatula@marvell.com> wrote:
> 
> > +	return (k1->id_key_len != k2->id_key_len) ||
> > +	       (k1->key_len == IPV4_KEYLEN ? k1->src_dst[0] != k2->src_dst[0] :
> > +					     rte_hash_k32_cmp_eq(k1, k2,
> 32));
> 
> If you make another version, one small comment.
> Breaking this into a couple of if statements would make reading easier
> for human readers. Compiler doesn't care.

I have modified the above code to 

       if (k1->id_key_len != k2->id_key_len)
               return 1;
       if (k1->key_len == IPV4_KEYLEN)
               return k1->src_dst[0] != k2->src_dst[0];
       else
               return rte_hash_k32_cmp_eq(k1, k2, 32);

But upon remeasuring performance I see a performance loss of 1.2%
Compiler(GCC 10) generates additional branches with the above code.

I have also profiled the ip_reassembly application with and without the changes and see lot of 
additional branch misses


Current implementation:

==============
Branch Metrics
==============
Branch MPKI                                                  : 0.159             
Branch PKI                                                   : 156.566           
Branch Mis-prediction Rate                                   : 0.101             

INST_RETIRED       : ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 9.493B
BR_RETIRED         : ▇▇▇▇▇▇▇ 1.486B
BR_MIS_PRED_RETIRED: ▏ 1.508M
BR_IMMED_SPEC      : ▇▇▇▇▇▇▇ 1.395B
BR_RETURN_SPEC     : ▏ 105.203M
BR_INDIRECT_SPEC   : ▏ 106.044M

Modified implementation:

==============
Branch Metrics
==============
Branch MPKI                                                  : 0.282             
Branch PKI                                                   : 156.566           
Branch Mis-prediction Rate                                   : 0.180             

INST_RETIRED       : ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 9.444B
BR_RETIRED         : ▇▇▇▇▇▇▇ 1.479B
BR_MIS_PRED_RETIRED: ▏ 2.662M
BR_IMMED_SPEC      : ▇▇▇▇▇▇▇ 1.388B
BR_RETURN_SPEC     : ▏ 104.518M
BR_INDIRECT_SPEC   : ▏ 105.354M


I will retain the current implementation in the next patch.

Thanks,
Pavan.
  

Patch

diff --git a/lib/hash/rte_cmp_arm64.h b/lib/hash/rte_cmp_arm64.h
index e9e26f9abd..a3e85635eb 100644
--- a/lib/hash/rte_cmp_arm64.h
+++ b/lib/hash/rte_cmp_arm64.h
@@ -3,7 +3,7 @@ 
  */

 /* Functions to compare multiple of 16 byte keys (up to 128 bytes) */
-static int
+static inline int
 rte_hash_k16_cmp_eq(const void *key1, const void *key2,
 		    size_t key_len __rte_unused)
 {
@@ -24,7 +24,7 @@  rte_hash_k16_cmp_eq(const void *key1, const void *key2,
 	return !(x0 == 0 && x1 == 0);
 }

-static int
+static inline int
 rte_hash_k32_cmp_eq(const void *key1, const void *key2, size_t key_len)
 {
 	return rte_hash_k16_cmp_eq(key1, key2, key_len) ||
@@ -32,7 +32,7 @@  rte_hash_k32_cmp_eq(const void *key1, const void *key2, size_t key_len)
 				(const char *) key2 + 16, key_len);
 }

-static int
+static inline int
 rte_hash_k48_cmp_eq(const void *key1, const void *key2, size_t key_len)
 {
 	return rte_hash_k16_cmp_eq(key1, key2, key_len) ||
@@ -42,7 +42,7 @@  rte_hash_k48_cmp_eq(const void *key1, const void *key2, size_t key_len)
 				(const char *) key2 + 32, key_len);
 }

-static int
+static inline int
 rte_hash_k64_cmp_eq(const void *key1, const void *key2, size_t key_len)
 {
 	return rte_hash_k32_cmp_eq(key1, key2, key_len) ||
@@ -50,7 +50,7 @@  rte_hash_k64_cmp_eq(const void *key1, const void *key2, size_t key_len)
 				(const char *) key2 + 32, key_len);
 }

-static int
+static inline int
 rte_hash_k80_cmp_eq(const void *key1, const void *key2, size_t key_len)
 {
 	return rte_hash_k64_cmp_eq(key1, key2, key_len) ||
@@ -58,7 +58,7 @@  rte_hash_k80_cmp_eq(const void *key1, const void *key2, size_t key_len)
 				(const char *) key2 + 64, key_len);
 }

-static int
+static inline int
 rte_hash_k96_cmp_eq(const void *key1, const void *key2, size_t key_len)
 {
 	return rte_hash_k64_cmp_eq(key1, key2, key_len) ||
@@ -66,7 +66,7 @@  rte_hash_k96_cmp_eq(const void *key1, const void *key2, size_t key_len)
 				(const char *) key2 + 64, key_len);
 }

-static int
+static inline int
 rte_hash_k112_cmp_eq(const void *key1, const void *key2, size_t key_len)
 {
 	return rte_hash_k64_cmp_eq(key1, key2, key_len) ||
@@ -76,7 +76,7 @@  rte_hash_k112_cmp_eq(const void *key1, const void *key2, size_t key_len)
 				(const char *) key2 + 96, key_len);
 }

-static int
+static inline int
 rte_hash_k128_cmp_eq(const void *key1, const void *key2, size_t key_len)
 {
 	return rte_hash_k64_cmp_eq(key1, key2, key_len) ||
diff --git a/lib/hash/rte_cmp_x86.h b/lib/hash/rte_cmp_x86.h
index 13a5836351..ddfbef462f 100644
--- a/lib/hash/rte_cmp_x86.h
+++ b/lib/hash/rte_cmp_x86.h
@@ -5,7 +5,7 @@ 
 #include <rte_vect.h>

 /* Functions to compare multiple of 16 byte keys (up to 128 bytes) */
-static int
+static inline int
 rte_hash_k16_cmp_eq(const void *key1, const void *key2, size_t key_len __rte_unused)
 {
 	const __m128i k1 = _mm_loadu_si128((const __m128i *) key1);
@@ -15,7 +15,7 @@  rte_hash_k16_cmp_eq(const void *key1, const void *key2, size_t key_len __rte_unu
 	return !_mm_test_all_zeros(x, x);
 }

-static int
+static inline int
 rte_hash_k32_cmp_eq(const void *key1, const void *key2, size_t key_len)
 {
 	return rte_hash_k16_cmp_eq(key1, key2, key_len) ||
@@ -23,7 +23,7 @@  rte_hash_k32_cmp_eq(const void *key1, const void *key2, size_t key_len)
 				(const char *) key2 + 16, key_len);
 }

-static int
+static inline int
 rte_hash_k48_cmp_eq(const void *key1, const void *key2, size_t key_len)
 {
 	return rte_hash_k16_cmp_eq(key1, key2, key_len) ||
@@ -33,7 +33,7 @@  rte_hash_k48_cmp_eq(const void *key1, const void *key2, size_t key_len)
 				(const char *) key2 + 32, key_len);
 }

-static int
+static inline int
 rte_hash_k64_cmp_eq(const void *key1, const void *key2, size_t key_len)
 {
 	return rte_hash_k32_cmp_eq(key1, key2, key_len) ||
@@ -41,7 +41,7 @@  rte_hash_k64_cmp_eq(const void *key1, const void *key2, size_t key_len)
 				(const char *) key2 + 32, key_len);
 }

-static int
+static inline int
 rte_hash_k80_cmp_eq(const void *key1, const void *key2, size_t key_len)
 {
 	return rte_hash_k64_cmp_eq(key1, key2, key_len) ||
@@ -49,7 +49,7 @@  rte_hash_k80_cmp_eq(const void *key1, const void *key2, size_t key_len)
 				(const char *) key2 + 64, key_len);
 }

-static int
+static inline int
 rte_hash_k96_cmp_eq(const void *key1, const void *key2, size_t key_len)
 {
 	return rte_hash_k64_cmp_eq(key1, key2, key_len) ||
@@ -57,7 +57,7 @@  rte_hash_k96_cmp_eq(const void *key1, const void *key2, size_t key_len)
 				(const char *) key2 + 64, key_len);
 }

-static int
+static inline int
 rte_hash_k112_cmp_eq(const void *key1, const void *key2, size_t key_len)
 {
 	return rte_hash_k64_cmp_eq(key1, key2, key_len) ||
@@ -67,7 +67,7 @@  rte_hash_k112_cmp_eq(const void *key1, const void *key2, size_t key_len)
 				(const char *) key2 + 96, key_len);
 }

-static int
+static inline int
 rte_hash_k128_cmp_eq(const void *key1, const void *key2, size_t key_len)
 {
 	return rte_hash_k64_cmp_eq(key1, key2, key_len) ||
diff --git a/lib/ip_frag/ip_frag_common.h b/lib/ip_frag/ip_frag_common.h
index 0d8ce6a1e1..5cdd98c8fe 100644
--- a/lib/ip_frag/ip_frag_common.h
+++ b/lib/ip_frag/ip_frag_common.h
@@ -5,7 +5,13 @@ 
 #ifndef _IP_FRAG_COMMON_H_
 #define _IP_FRAG_COMMON_H_

-#include <sys/queue.h>
+#include <rte_common.h>
+
+#if defined(RTE_ARCH_ARM64)
+#include <rte_cmp_arm64.h>
+#elif defined(RTE_ARCH_X86)
+#include <rte_cmp_x86.h>
+#endif

 #include "rte_ip_frag.h"
 #include "ip_reassembly.h"
@@ -75,12 +81,18 @@  ip_frag_key_invalidate(struct ip_frag_key * key)
 static inline uint64_t
 ip_frag_key_cmp(const struct ip_frag_key * k1, const struct ip_frag_key * k2)
 {
+#if defined(RTE_ARCH_X86) || defined(RTE_ARCH_ARM64)
+	return (k1->id_key_len != k2->id_key_len) ||
+	       (k1->key_len == IPV4_KEYLEN ? k1->src_dst[0] != k2->src_dst[0] :
+					     rte_hash_k32_cmp_eq(k1, k2, 32));
+#else
 	uint32_t i;
 	uint64_t val;
 	val = k1->id_key_len ^ k2->id_key_len;
 	for (i = 0; i < k1->key_len; i++)
 		val |= k1->src_dst[i] ^ k2->src_dst[i];
 	return val;
+#endif
 }

 /*
diff --git a/lib/ip_frag/ip_frag_internal.c b/lib/ip_frag/ip_frag_internal.c
index b436a4c931..7cbef647df 100644
--- a/lib/ip_frag/ip_frag_internal.c
+++ b/lib/ip_frag/ip_frag_internal.c
@@ -45,7 +45,7 @@  ipv4_frag_hash(const struct ip_frag_key *key, uint32_t *v1, uint32_t *v2)

 	p = (const uint32_t *)&key->src_dst;

-#ifdef RTE_ARCH_X86
+#if defined(RTE_ARCH_X86) || defined(RTE_ARCH_ARM64)
 	v = rte_hash_crc_4byte(p[0], PRIME_VALUE);
 	v = rte_hash_crc_4byte(p[1], v);
 	v = rte_hash_crc_4byte(key->id, v);
@@ -66,7 +66,7 @@  ipv6_frag_hash(const struct ip_frag_key *key, uint32_t *v1, uint32_t *v2)

 	p = (const uint32_t *) &key->src_dst;

-#ifdef RTE_ARCH_X86
+#if defined(RTE_ARCH_X86) || defined(RTE_ARCH_ARM64)
 	v = rte_hash_crc_4byte(p[0], PRIME_VALUE);
 	v = rte_hash_crc_4byte(p[1], v);
 	v = rte_hash_crc_4byte(p[2], v);