[v6,4/4] hash: add SVE support for bulk key lookup

Message ID 20240311232112.736613-5-yoan.picchi@arm.com (mailing list archive)
State Superseded, archived
Delegated to: Thomas Monjalon
Headers
Series hash: add SVE support for bulk key lookup |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/loongarch-compilation success Compilation OK
ci/loongarch-unit-testing success Unit Testing PASS
ci/Intel-compilation success Compilation OK
ci/intel-Testing success Testing PASS
ci/intel-Functional success Functional PASS
ci/iol-intel-Performance success Performance Testing PASS
ci/github-robot: build success github build: passed
ci/iol-compile-amd64-testing success Testing PASS
ci/iol-mellanox-Performance success Performance Testing PASS
ci/iol-unit-amd64-testing success Testing PASS
ci/iol-sample-apps-testing success Testing PASS
ci/iol-intel-Functional success Functional Testing PASS
ci/iol-abi-testing success Testing PASS
ci/iol-unit-arm64-testing success Testing PASS
ci/iol-compile-arm64-testing success Testing PASS
ci/iol-broadcom-Performance success Performance Testing PASS
ci/iol-broadcom-Functional success Functional Testing PASS

Commit Message

Yoan Picchi March 11, 2024, 11:21 p.m. UTC
- Implemented SVE code for comparing signatures in bulk lookup.
- Added Defines in code for SVE code support.
- Optimise NEON code
- New SVE code is ~5% slower than optimized NEON for N2 processor.

Signed-off-by: Yoan Picchi <yoan.picchi@arm.com>
Signed-off-by: Harjot Singh <harjot.singh@arm.com>
Reviewed-by: Nathan Brown <nathan.brown@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 lib/hash/arch/arm/compare_signatures.h | 58 ++++++++++++++++++++++++++
 lib/hash/rte_cuckoo_hash.c             |  2 +
 2 files changed, 60 insertions(+)
  

Comments

fengchengwen March 12, 2024, 3:57 a.m. UTC | #1
Hi Yoan,

On 2024/3/12 7:21, Yoan Picchi wrote:
> - Implemented SVE code for comparing signatures in bulk lookup.
> - Added Defines in code for SVE code support.
> - Optimise NEON code

This commit does not include this part. Pls only describe the content in this commit.

> - New SVE code is ~5% slower than optimized NEON for N2 processor.
> 
> Signed-off-by: Yoan Picchi <yoan.picchi@arm.com>
> Signed-off-by: Harjot Singh <harjot.singh@arm.com>
> Reviewed-by: Nathan Brown <nathan.brown@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> ---
>  lib/hash/arch/arm/compare_signatures.h | 58 ++++++++++++++++++++++++++
>  lib/hash/rte_cuckoo_hash.c             |  2 +
>  2 files changed, 60 insertions(+)
> 
> diff --git a/lib/hash/arch/arm/compare_signatures.h b/lib/hash/arch/arm/compare_signatures.h
> index b5a457f936..8a0627e119 100644
> --- a/lib/hash/arch/arm/compare_signatures.h
> +++ b/lib/hash/arch/arm/compare_signatures.h
> @@ -47,6 +47,64 @@ compare_signatures_dense(uint16_t *hitmask_buffer,
>  		*hitmask_buffer = vaddvq_u16(hit2);
>  		}
>  		break;
> +#endif
> +#if defined(RTE_HAS_SVE_ACLE)
> +	case RTE_HASH_COMPARE_SVE: {

...

>  #endif
>  	default:
>  		for (unsigned int i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
> diff --git a/lib/hash/rte_cuckoo_hash.c b/lib/hash/rte_cuckoo_hash.c
> index e41f03270a..7a474267f0 100644
> --- a/lib/hash/rte_cuckoo_hash.c
> +++ b/lib/hash/rte_cuckoo_hash.c
> @@ -452,6 +452,8 @@ rte_hash_create(const struct rte_hash_parameters *params)
>  #elif defined(RTE_ARCH_ARM64)
>  	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) {
>  		h->sig_cmp_fn = RTE_HASH_COMPARE_NEON;
> +		if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SVE))
> +			h->sig_cmp_fn = RTE_HASH_COMPARE_SVE;

The RTE_HASH_COMPARE_SVE was defined in "PATCH v6 1/4] hash: pack the hitmask for hash in bulk lookup",
but its first use is in this commit, so I think it should defined in this commit.

If RTE_CPUFLAG_SVE and RTE_HAS_SVE_ACLE both set, then SVE impl will be chosen.
If RTE_CPUFLAG_SVE defined, but RTE_HAS_SVE_ACLE was not, then scalar will be chosen. --- in this case we could back to NEON impl.
So I suggest direct use "#if defined(RTE_HAS_SVE_ACLE)" here.

>  	}
>  	else
>  #endif
> 

Plus:
I notice the commit log said the SVE performance is slower than NEON.

And I also notice other platform SVE also lower than NEON,
1. b4ee9c07bd config/arm: disable SVE ACLE for CN10K
2. 4eea7c6461 config/arm: add SVE ACLE control flag

So maybe we should disable RTE_HAS_SVE_ACLE default by:
diff --git a/config/arm/meson.build b/config/arm/meson.build
index 9d6fb87d7f..a5b890d100 100644
--- a/config/arm/meson.build
+++ b/config/arm/meson.build
@@ -875,7 +875,7 @@ endif

 if cc.get_define('__ARM_FEATURE_SVE', args: machine_args) != ''
     compile_time_cpuflags += ['RTE_CPUFLAG_SVE']
-    if (cc.check_header('arm_sve.h') and soc_config.get('sve_acle', true))
+    if (cc.check_header('arm_sve.h') and soc_config.get('sve_acle', false))
         dpdk_conf.set('RTE_HAS_SVE_ACLE', 1)
     endif
 endif

If the platform verify SVE has higher performance, then it could enable SVE by add "sve_acle: true" in soc_xxx config.

Thanks
  
Yoan Picchi March 12, 2024, 3:08 p.m. UTC | #2
On 3/12/24 03:57, fengchengwen wrote:
> Hi Yoan,
> 
> On 2024/3/12 7:21, Yoan Picchi wrote:
>> - Implemented SVE code for comparing signatures in bulk lookup.
>> - Added Defines in code for SVE code support.
>> - Optimise NEON code
> 
> This commit does not include this part. Pls only describe the content in this commit.

Thank you. I forgot to edit that out after moving commit around.

> 
>> - New SVE code is ~5% slower than optimized NEON for N2 processor.
>>
>> Signed-off-by: Yoan Picchi <yoan.picchi@arm.com>
>> Signed-off-by: Harjot Singh <harjot.singh@arm.com>
>> Reviewed-by: Nathan Brown <nathan.brown@arm.com>
>> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
>> ---
>>   lib/hash/arch/arm/compare_signatures.h | 58 ++++++++++++++++++++++++++
>>   lib/hash/rte_cuckoo_hash.c             |  2 +
>>   2 files changed, 60 insertions(+)
>>
>> diff --git a/lib/hash/arch/arm/compare_signatures.h b/lib/hash/arch/arm/compare_signatures.h
>> index b5a457f936..8a0627e119 100644
>> --- a/lib/hash/arch/arm/compare_signatures.h
>> +++ b/lib/hash/arch/arm/compare_signatures.h
>> @@ -47,6 +47,64 @@ compare_signatures_dense(uint16_t *hitmask_buffer,
>>   		*hitmask_buffer = vaddvq_u16(hit2);
>>   		}
>>   		break;
>> +#endif
>> +#if defined(RTE_HAS_SVE_ACLE)
>> +	case RTE_HASH_COMPARE_SVE: {
> 
> ...
> 
>>   #endif
>>   	default:
>>   		for (unsigned int i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
>> diff --git a/lib/hash/rte_cuckoo_hash.c b/lib/hash/rte_cuckoo_hash.c
>> index e41f03270a..7a474267f0 100644
>> --- a/lib/hash/rte_cuckoo_hash.c
>> +++ b/lib/hash/rte_cuckoo_hash.c
>> @@ -452,6 +452,8 @@ rte_hash_create(const struct rte_hash_parameters *params)
>>   #elif defined(RTE_ARCH_ARM64)
>>   	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) {
>>   		h->sig_cmp_fn = RTE_HASH_COMPARE_NEON;
>> +		if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SVE))
>> +			h->sig_cmp_fn = RTE_HASH_COMPARE_SVE;
> 
> The RTE_HASH_COMPARE_SVE was defined in "PATCH v6 1/4] hash: pack the hitmask for hash in bulk lookup",
> but its first use is in this commit, so I think it should defined in this commit.
> 
> If RTE_CPUFLAG_SVE and RTE_HAS_SVE_ACLE both set, then SVE impl will be chosen.
> If RTE_CPUFLAG_SVE defined, but RTE_HAS_SVE_ACLE was not, then scalar will be chosen. --- in this case we could back to NEON impl.
> So I suggest direct use "#if defined(RTE_HAS_SVE_ACLE)" here.

Sounds fair. I'll do it.

> 
>>   	}
>>   	else
>>   #endif
>>
> 
> Plus:
> I notice the commit log said the SVE performance is slower than NEON.
> 
> And I also notice other platform SVE also lower than NEON,
> 1. b4ee9c07bd config/arm: disable SVE ACLE for CN10K
> 2. 4eea7c6461 config/arm: add SVE ACLE control flag
> 
> So maybe we should disable RTE_HAS_SVE_ACLE default by:
> diff --git a/config/arm/meson.build b/config/arm/meson.build
> index 9d6fb87d7f..a5b890d100 100644
> --- a/config/arm/meson.build
> +++ b/config/arm/meson.build
> @@ -875,7 +875,7 @@ endif
> 
>   if cc.get_define('__ARM_FEATURE_SVE', args: machine_args) != ''
>       compile_time_cpuflags += ['RTE_CPUFLAG_SVE']
> -    if (cc.check_header('arm_sve.h') and soc_config.get('sve_acle', true))
> +    if (cc.check_header('arm_sve.h') and soc_config.get('sve_acle', false))
>           dpdk_conf.set('RTE_HAS_SVE_ACLE', 1)
>       endif
>   endif
> 
> If the platform verify SVE has higher performance, then it could enable SVE by add "sve_acle: true" in soc_xxx config.
> 
> Thanks

Here I kinda disagree. In this particular instance, SVE is a bit slower 
with narrow vectors (128b), but could be faster with some wider vector 
sizes.
Even in general SVE 128b is not just slower than neon. It's a case by 
case basis. Sometime it's slower, sometime it's faster, so I don't think 
we should just disable it by default. In any case, disabling it should 
be its own patch with much discussion, not just a offhand thing we 
include in the middle of this patch.
This SVE version is still faster than the upstream neon version. I just 
happen to have improved the neon version even more.
  

Patch

diff --git a/lib/hash/arch/arm/compare_signatures.h b/lib/hash/arch/arm/compare_signatures.h
index b5a457f936..8a0627e119 100644
--- a/lib/hash/arch/arm/compare_signatures.h
+++ b/lib/hash/arch/arm/compare_signatures.h
@@ -47,6 +47,64 @@  compare_signatures_dense(uint16_t *hitmask_buffer,
 		*hitmask_buffer = vaddvq_u16(hit2);
 		}
 		break;
+#endif
+#if defined(RTE_HAS_SVE_ACLE)
+	case RTE_HASH_COMPARE_SVE: {
+		svuint16_t vsign, shift, sv_matches;
+		svbool_t pred, match, bucket_wide_pred;
+		int i = 0;
+		uint64_t vl = svcnth();
+
+		vsign = svdup_u16(sig);
+		shift = svindex_u16(0, 1);
+
+		if (vl >= 2 * RTE_HASH_BUCKET_ENTRIES && RTE_HASH_BUCKET_ENTRIES <= 8) {
+			svuint16_t primary_array_vect, secondary_array_vect;
+			bucket_wide_pred = svwhilelt_b16(0, RTE_HASH_BUCKET_ENTRIES);
+			primary_array_vect = svld1_u16(bucket_wide_pred, prim_bucket_sigs);
+			secondary_array_vect = svld1_u16(bucket_wide_pred, sec_bucket_sigs);
+
+			/* We merged the two vectors so we can do both comparison at once */
+			primary_array_vect = svsplice_u16(bucket_wide_pred,
+				primary_array_vect,
+				secondary_array_vect);
+			pred = svwhilelt_b16(0, 2*RTE_HASH_BUCKET_ENTRIES);
+
+			/* Compare all signatures in the buckets */
+			match = svcmpeq_u16(pred, vsign, primary_array_vect);
+			if (svptest_any(svptrue_b16(), match)) {
+				sv_matches = svdup_u16(1);
+				sv_matches = svlsl_u16_z(match, sv_matches, shift);
+				*hitmask_buffer = svorv_u16(svptrue_b16(), sv_matches);
+			}
+		} else {
+			do {
+				pred = svwhilelt_b16(i, RTE_HASH_BUCKET_ENTRIES);
+				uint16_t lower_half = 0;
+				uint16_t upper_half = 0;
+				/* Compare all signatures in the primary bucket */
+				match = svcmpeq_u16(pred, vsign, svld1_u16(pred,
+							&prim_bucket_sigs[i]));
+				if (svptest_any(svptrue_b16(), match)) {
+					sv_matches = svdup_u16(1);
+					sv_matches = svlsl_u16_z(match, sv_matches, shift);
+					lower_half = svorv_u16(svptrue_b16(), sv_matches);
+				}
+				/* Compare all signatures in the secondary bucket */
+				match = svcmpeq_u16(pred, vsign, svld1_u16(pred,
+							&sec_bucket_sigs[i]));
+				if (svptest_any(svptrue_b16(), match)) {
+					sv_matches = svdup_u16(1);
+					sv_matches = svlsl_u16_z(match, sv_matches, shift);
+					upper_half = svorv_u16(svptrue_b16(), sv_matches)
+						<< RTE_HASH_BUCKET_ENTRIES;
+				}
+				hitmask_buffer[i/8] = upper_half | lower_half;
+				i += vl;
+			} while (i < RTE_HASH_BUCKET_ENTRIES);
+		}
+		}
+		break;
 #endif
 	default:
 		for (unsigned int i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
diff --git a/lib/hash/rte_cuckoo_hash.c b/lib/hash/rte_cuckoo_hash.c
index e41f03270a..7a474267f0 100644
--- a/lib/hash/rte_cuckoo_hash.c
+++ b/lib/hash/rte_cuckoo_hash.c
@@ -452,6 +452,8 @@  rte_hash_create(const struct rte_hash_parameters *params)
 #elif defined(RTE_ARCH_ARM64)
 	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON)) {
 		h->sig_cmp_fn = RTE_HASH_COMPARE_NEON;
+		if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SVE))
+			h->sig_cmp_fn = RTE_HASH_COMPARE_SVE;
 	}
 	else
 #endif