[v2] eal: fix unaligned loads/stores in rte_memcpy_generic

Message ID 20220115213949.449313-1-lucp.at.work@gmail.com (mailing list archive)
State Superseded, archived
Delegated to: Thomas Monjalon
Headers
Series [v2] eal: fix unaligned loads/stores in rte_memcpy_generic |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/Intel-compilation success Compilation OK
ci/intel-Testing success Testing PASS
ci/iol-broadcom-Functional success Functional Testing PASS
ci/github-robot: build success github build: passed
ci/iol-broadcom-Performance success Performance Testing PASS
ci/iol-intel-Performance success Performance Testing PASS
ci/iol-mellanox-Performance success Performance Testing PASS
ci/iol-intel-Functional success Functional Testing PASS
ci/iol-aarch64-compile-testing success Testing PASS
ci/iol-abi-testing success Testing PASS
ci/iol-aarch64-unit-testing success Testing PASS
ci/iol-x86_64-compile-testing success Testing PASS
ci/iol-x86_64-unit-testing success Testing PASS

Commit Message

Luc Pelletier Jan. 15, 2022, 9:39 p.m. UTC
  Calls to rte_memcpy_generic could result in unaligned loads/stores for
1 < n < 16. This is undefined behavior according to the C standard,
and it gets flagged by the clang undefined behavior sanitizer.

rte_memcpy_generic is called with unaligned src and dst addresses.
When 1 < n < 16, the code would cast both src and dst to a qword,
dword or word pointer, without verifying the alignment of src/dst. The
code was changed to use a for loop to copy the bytes one by one.
Experimentation on compiler explorer indicates that gcc 7+
(released in 2017) and clang 7+ (released in 2018) both optimize out the
for loop with the least number of memory loads and stores, if n is known
at compile-time. If n is only known at compile-time, gcc and clang have
different behaviour but they both seem to recognize that a memcpy is
being done. More recent versions of both gcc/clang seem to also produce
even more optimized results.

Fixes: d35cc1fe6a7a ("eal/x86: revert select optimized memcpy at run-time")
Cc: Xiaoyun Li <xiaoyun.li@intel.com>
Cc: stable@dpdk.org

Signed-off-by: Luc Pelletier <lucp.at.work@gmail.com>
---

I forgot that code under x86 also needs to compile for 32-bit
(obviously). So, I did some more experimentation and replaced the
assembly code with a regular for loop. Explanations are in the updated
commit message. Experimentation was done on compiler explorer here:
https://godbolt.org/z/zK54rzPEn

 lib/eal/x86/include/rte_memcpy.h | 82 ++++++++------------------------
 1 file changed, 20 insertions(+), 62 deletions(-)
  

Comments

Stephen Hemminger Jan. 15, 2022, 10:13 p.m. UTC | #1
On Sat, 15 Jan 2022 16:39:50 -0500
Luc Pelletier <lucp.at.work@gmail.com> wrote:

> diff --git a/lib/eal/x86/include/rte_memcpy.h b/lib/eal/x86/include/rte_memcpy.h
> index 1b6c6e585f..e422397e49 100644
> --- a/lib/eal/x86/include/rte_memcpy.h
> +++ b/lib/eal/x86/include/rte_memcpy.h
> @@ -45,6 +45,23 @@ extern "C" {
>  static __rte_always_inline void *
>  rte_memcpy(void *dst, const void *src, size_t n);
>  
> +/**
> + * Copy bytes from one location to another,
> + * locations should not overlap.
> + * Use with unaligned src/dst, and n <= 15.
> + */
> +static __rte_always_inline void *
> +rte_mov15_or_less_unaligned(void *dst, const void *src, size_t n)
> +{
> +	void *ret = dst;
> +	for (; n; n--) {
> +		*((char *)dst) = *((const char *) src);
> +		dst = ((char *)dst) + 1;
> +		src = ((const char *)src) + 1;
> +	}
> +	return ret;
> +}

X86 always allows unaligned access. Irregardless of what tools say.
Why impose additional overhead in performance critical code.
  
Luc Pelletier Jan. 16, 2022, 2:09 p.m. UTC | #2
> X86 always allows unaligned access. Irregardless of what tools say.
> Why impose additional overhead in performance critical code.

Let me preface my response by saying that I'm not a C compiler developer.
Hopefully someone who is will read this and chime in.

I agree that X86 allows unaligned store/load. However, the C standard doesn't,
and says that it's undefined behavior. This means that the code relies on
undefined behavior. It may do the right thing all the time, almost all the time,
some of the time... it's undefined. It may work now but it may stop
working in the future.
Here's a good discussion on SO about unaligned accesses in C on x86:

https://stackoverflow.com/questions/46790550/c-undefined-behavior-strict-aliasing-rule-or-incorrect-alignment/46790815#46790815

There's no way to do the unaligned store/load in C (that I know of)
without invoking
undefined behavior. I can see 2 options, either write the code in
assembly, or use
some other C construct that doesn't rely on undefined behavior.

While the for loop may seem slower than the other options, it
surprisingly results in
fewer load/store operations in certain scenarios. For example, if n ==
15 and it's
known at compile-time, the compiler will generate 2 overlapping qword load/store
operations (rather than the 4 that are currently being done with the
current code).

All that being said, I can go back to something similar to my first
patch. Using inline
assembly, and making sure this time that it works for 32-bit too. I
will post a patch in
a few minutes that does exactly that. Maintainers can then chime in
with their preferred
option.
  
Stephen Hemminger Jan. 16, 2022, 4:32 p.m. UTC | #3
On Sun, 16 Jan 2022 09:09:49 -0500
Luc Pelletier <lucp.at.work@gmail.com> wrote:

> > X86 always allows unaligned access. Irregardless of what tools say.
> > Why impose additional overhead in performance critical code.  
> 
> Let me preface my response by saying that I'm not a C compiler developer.
> Hopefully someone who is will read this and chime in.
> 
> I agree that X86 allows unaligned store/load. However, the C standard doesn't,
> and says that it's undefined behavior. This means that the code relies on
> undefined behavior. It may do the right thing all the time, almost all the time,
> some of the time... it's undefined. It may work now but it may stop
> working in the future.
> Here's a good discussion on SO about unaligned accesses in C on x86:
> 
> https://stackoverflow.com/questions/46790550/c-undefined-behavior-strict-aliasing-rule-or-incorrect-alignment/46790815#46790815
> 
> There's no way to do the unaligned store/load in C (that I know of)
> without invoking
> undefined behavior. I can see 2 options, either write the code in
> assembly, or use
> some other C construct that doesn't rely on undefined behavior.
> 
> While the for loop may seem slower than the other options, it
> surprisingly results in
> fewer load/store operations in certain scenarios. For example, if n ==
> 15 and it's
> known at compile-time, the compiler will generate 2 overlapping qword load/store
> operations (rather than the 4 that are currently being done with the
> current code).
> 
> All that being said, I can go back to something similar to my first
> patch. Using inline
> assembly, and making sure this time that it works for 32-bit too. I
> will post a patch in
> a few minutes that does exactly that. Maintainers can then chime in
> with their preferred
> option.

I would propose that DPDK have same kind of define as the kernel
for SAFE_UNALIGNED_ACCESS.  The C standard has to apply to all architectures
but DPDK will make the choice to be fast rather than standards conformant.
  
Georg Sauthoff Jan. 24, 2022, 11:21 p.m. UTC | #4
Hello,

Stephen Hemminger wrote (Sun, 16 Jan 2022 08:32:20 -0800):

> I would propose that DPDK have same kind of define as the kernel
> for SAFE_UNALIGNED_ACCESS.  The C standard has to apply to all architectures
> but DPDK will make the choice to be fast rather than standards conformant.

perhaps it's worth mentioning that the Linux Kernel is compiled with
-fno-strict-aliasing, because it contains code which violates the C
strict aliasing rules. Such code yields undefined behavior and is thus
unsafe when compiling with optmizing compilers such as GCC and Clang, by
default. But since the Linux supplies -fno-strict-aliasing its code is
safe, in the Linux Kernel context.

In contrast, DPDK isn't compiled with -fno-strict-aliasing, in general.
Only a few drivers are built with -Wno-strict-aliasing.

Thus, one has to be careful when importing (or being inspired) by the
above mentioned kernel defines.

Especially, it looks like version 5 of this patch yields undefined
behavior because it violates strict-aliasing rules.
It's possible that GCC makes some extra guarantees for the used
constructs, even when compiling without -fno-strict-aliasing. But I'm
not aware of any.

Please note that there is a reason GCC enables -fstrict-aliasing when
compiling with optimizations: Performance
IOW, -fno-strict-aliasing has performance implications, thus one is
advised to not use it, if possible.
IOW, when violating strict-aliasng rules one can easily end up with
low-performing and/or unsafe (buggy) code.

I haven't inspected the DPDK drivers which supply -Wno-strict-aliasing -
but I wouldn't be surprised if there aren't better alternatives. Meaning
writing the code in a way that doesn't require -Wno-strict-aliasing.

BTW, are there still systems that DPDK supports but have a memcpy() which
performs worse than rte_memcpy()?


Best regards,
Georg
  
Morten Brørup Jan. 25, 2022, 7:59 a.m. UTC | #5
> From: Georg Sauthoff [mailto:mail@gms.tf]
> Sent: Tuesday, 25 January 2022 00.22
> 
> Hello,
> 
> Stephen Hemminger wrote (Sun, 16 Jan 2022 08:32:20 -0800):
> 
> > I would propose that DPDK have same kind of define as the kernel
> > for SAFE_UNALIGNED_ACCESS.  The C standard has to apply to all
> architectures
> > but DPDK will make the choice to be fast rather than standards
> conformant.
> 
> perhaps it's worth mentioning that the Linux Kernel is compiled with
> -fno-strict-aliasing, because it contains code which violates the C
> strict aliasing rules. Such code yields undefined behavior and is thus
> unsafe when compiling with optmizing compilers such as GCC and Clang,
> by
> default. But since the Linux supplies -fno-strict-aliasing its code is
> safe, in the Linux Kernel context.
> 
> In contrast, DPDK isn't compiled with -fno-strict-aliasing, in general.
> Only a few drivers are built with -Wno-strict-aliasing.
> 
> Thus, one has to be careful when importing (or being inspired) by the
> above mentioned kernel defines.
> 
> Especially, it looks like version 5 of this patch yields undefined
> behavior because it violates strict-aliasing rules.
> It's possible that GCC makes some extra guarantees for the used
> constructs, even when compiling without -fno-strict-aliasing. But I'm
> not aware of any.
> 
> Please note that there is a reason GCC enables -fstrict-aliasing when
> compiling with optimizations: Performance
> IOW, -fno-strict-aliasing has performance implications, thus one is
> advised to not use it, if possible.
> IOW, when violating strict-aliasng rules one can easily end up with
> low-performing and/or unsafe (buggy) code.

Generally, I have an extremely strong preference for being able to compile code with all warnings enabled in the compiler (-Wall and then some). This gives the developer a great bug-detector tool (through the compiler warnings).

In my experience, compiling a Linux loadable module with all warnings enabled spews out too many warnings about the kernel itself to be helpful. This stems from bad practice, which should not be adopted by DPDK.

> 
> I haven't inspected the DPDK drivers which supply -Wno-strict-aliasing
> -
> but I wouldn't be surprised if there aren't better alternatives.
> Meaning
> writing the code in a way that doesn't require -Wno-strict-aliasing.
> 
> BTW, are there still systems that DPDK supports but have a memcpy()
> which
> performs worse than rte_memcpy()?

Excellent question! I have also seen memmove() being replaced with a comment about some systems using temporary memory for their implementation of memmove(). And most DPDK core libraries implement their own copy functions - following the general DPDK advice not to use standard C library functions. It's a shame there's no documentation why all this was re-implemented in DPDK, so it can be checked if it is still relevant or could be removed.

> 
> 
> Best regards,
> Georg
  
Luc Pelletier Jan. 25, 2022, 7:57 p.m. UTC | #6
Hi Georg,

> perhaps it's worth mentioning that the Linux Kernel is compiled with
> -fno-strict-aliasing, because it contains code which violates the C
> strict aliasing rules. Such code yields undefined behavior and is thus
> unsafe when compiling with optmizing compilers such as GCC and Clang, by
> default. But since the Linux supplies -fno-strict-aliasing its code is
> safe, in the Linux Kernel context.
>
> In contrast, DPDK isn't compiled with -fno-strict-aliasing, in general.
> Only a few drivers are built with -Wno-strict-aliasing.
>
> Thus, one has to be careful when importing (or being inspired) by the
> above mentioned kernel defines.
>
> Especially, it looks like version 5 of this patch yields undefined
> behavior because it violates strict-aliasing rules.
> It's possible that GCC makes some extra guarantees for the used
> constructs, even when compiling without -fno-strict-aliasing. But I'm
> not aware of any.

Thank you for pointing this out. That's an excellent comment and
something I had missed. Would adding the __may_alias__ attribute to
the unaligned_uint structs fix the issue?

I think it would, but I have to admit that documentation is a little
sparse and I don't have much experience with this attribute. If it
doesn't fix the problem, please let me know if you can think of
another solution that wouldn't involve going back to assembly (v4 of
this patch).

Also, I think the __may_alias__ attribute is supported by all
compilers that DPDK supports when targeting x86, but please let me
know if you don't think that's the case.

> BTW, are there still systems that DPDK supports but have a memcpy() which
> performs worse than rte_memcpy()?

I concur with Morten, that's an excellent question. Analyzing the
memcpy source code on every platform/architecture that DPDK supports
would be one way to answer the question. However, that seems like a
significant undertaking and I'm hoping someone who knows the answer
will chime in.
  

Patch

diff --git a/lib/eal/x86/include/rte_memcpy.h b/lib/eal/x86/include/rte_memcpy.h
index 1b6c6e585f..e422397e49 100644
--- a/lib/eal/x86/include/rte_memcpy.h
+++ b/lib/eal/x86/include/rte_memcpy.h
@@ -45,6 +45,23 @@  extern "C" {
 static __rte_always_inline void *
 rte_memcpy(void *dst, const void *src, size_t n);
 
+/**
+ * Copy bytes from one location to another,
+ * locations should not overlap.
+ * Use with unaligned src/dst, and n <= 15.
+ */
+static __rte_always_inline void *
+rte_mov15_or_less_unaligned(void *dst, const void *src, size_t n)
+{
+	void *ret = dst;
+	for (; n; n--) {
+		*((char *)dst) = *((const char *) src);
+		dst = ((char *)dst) + 1;
+		src = ((const char *)src) + 1;
+	}
+	return ret;
+}
+
 #if defined __AVX512F__ && defined RTE_MEMCPY_AVX512
 
 #define ALIGNMENT_MASK 0x3F
@@ -171,8 +188,6 @@  rte_mov512blocks(uint8_t *dst, const uint8_t *src, size_t n)
 static __rte_always_inline void *
 rte_memcpy_generic(void *dst, const void *src, size_t n)
 {
-	uintptr_t dstu = (uintptr_t)dst;
-	uintptr_t srcu = (uintptr_t)src;
 	void *ret = dst;
 	size_t dstofss;
 	size_t bits;
@@ -181,24 +196,7 @@  rte_memcpy_generic(void *dst, const void *src, size_t n)
 	 * Copy less than 16 bytes
 	 */
 	if (n < 16) {
-		if (n & 0x01) {
-			*(uint8_t *)dstu = *(const uint8_t *)srcu;
-			srcu = (uintptr_t)((const uint8_t *)srcu + 1);
-			dstu = (uintptr_t)((uint8_t *)dstu + 1);
-		}
-		if (n & 0x02) {
-			*(uint16_t *)dstu = *(const uint16_t *)srcu;
-			srcu = (uintptr_t)((const uint16_t *)srcu + 1);
-			dstu = (uintptr_t)((uint16_t *)dstu + 1);
-		}
-		if (n & 0x04) {
-			*(uint32_t *)dstu = *(const uint32_t *)srcu;
-			srcu = (uintptr_t)((const uint32_t *)srcu + 1);
-			dstu = (uintptr_t)((uint32_t *)dstu + 1);
-		}
-		if (n & 0x08)
-			*(uint64_t *)dstu = *(const uint64_t *)srcu;
-		return ret;
+		return rte_mov15_or_less_unaligned(dst, src, n);
 	}
 
 	/**
@@ -379,8 +377,6 @@  rte_mov128blocks(uint8_t *dst, const uint8_t *src, size_t n)
 static __rte_always_inline void *
 rte_memcpy_generic(void *dst, const void *src, size_t n)
 {
-	uintptr_t dstu = (uintptr_t)dst;
-	uintptr_t srcu = (uintptr_t)src;
 	void *ret = dst;
 	size_t dstofss;
 	size_t bits;
@@ -389,25 +385,7 @@  rte_memcpy_generic(void *dst, const void *src, size_t n)
 	 * Copy less than 16 bytes
 	 */
 	if (n < 16) {
-		if (n & 0x01) {
-			*(uint8_t *)dstu = *(const uint8_t *)srcu;
-			srcu = (uintptr_t)((const uint8_t *)srcu + 1);
-			dstu = (uintptr_t)((uint8_t *)dstu + 1);
-		}
-		if (n & 0x02) {
-			*(uint16_t *)dstu = *(const uint16_t *)srcu;
-			srcu = (uintptr_t)((const uint16_t *)srcu + 1);
-			dstu = (uintptr_t)((uint16_t *)dstu + 1);
-		}
-		if (n & 0x04) {
-			*(uint32_t *)dstu = *(const uint32_t *)srcu;
-			srcu = (uintptr_t)((const uint32_t *)srcu + 1);
-			dstu = (uintptr_t)((uint32_t *)dstu + 1);
-		}
-		if (n & 0x08) {
-			*(uint64_t *)dstu = *(const uint64_t *)srcu;
-		}
-		return ret;
+		return rte_mov15_or_less_unaligned(dst, src, n);
 	}
 
 	/**
@@ -672,8 +650,6 @@  static __rte_always_inline void *
 rte_memcpy_generic(void *dst, const void *src, size_t n)
 {
 	__m128i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7, xmm8;
-	uintptr_t dstu = (uintptr_t)dst;
-	uintptr_t srcu = (uintptr_t)src;
 	void *ret = dst;
 	size_t dstofss;
 	size_t srcofs;
@@ -682,25 +658,7 @@  rte_memcpy_generic(void *dst, const void *src, size_t n)
 	 * Copy less than 16 bytes
 	 */
 	if (n < 16) {
-		if (n & 0x01) {
-			*(uint8_t *)dstu = *(const uint8_t *)srcu;
-			srcu = (uintptr_t)((const uint8_t *)srcu + 1);
-			dstu = (uintptr_t)((uint8_t *)dstu + 1);
-		}
-		if (n & 0x02) {
-			*(uint16_t *)dstu = *(const uint16_t *)srcu;
-			srcu = (uintptr_t)((const uint16_t *)srcu + 1);
-			dstu = (uintptr_t)((uint16_t *)dstu + 1);
-		}
-		if (n & 0x04) {
-			*(uint32_t *)dstu = *(const uint32_t *)srcu;
-			srcu = (uintptr_t)((const uint32_t *)srcu + 1);
-			dstu = (uintptr_t)((uint32_t *)dstu + 1);
-		}
-		if (n & 0x08) {
-			*(uint64_t *)dstu = *(const uint64_t *)srcu;
-		}
-		return ret;
+		return rte_mov15_or_less_unaligned(dst, src, n);
 	}
 
 	/**