[v5] eal: fix unaligned loads/stores in rte_memcpy_generic

Message ID 20220117153711.32829-1-lucp.at.work@gmail.com (mailing list archive)
State Superseded, archived
Delegated to: Thomas Monjalon
Headers
Series [v5] eal: fix unaligned loads/stores in rte_memcpy_generic |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/Intel-compilation success Compilation OK
ci/intel-Testing success Testing PASS
ci/iol-broadcom-Functional success Functional Testing PASS
ci/iol-mellanox-Performance success Performance Testing PASS
ci/github-robot: build success github build: passed
ci/iol-broadcom-Performance success Performance Testing PASS
ci/iol-intel-Performance success Performance Testing PASS
ci/iol-intel-Functional success Functional Testing PASS
ci/iol-x86_64-unit-testing success Testing PASS
ci/iol-x86_64-compile-testing success Testing PASS
ci/iol-aarch64-unit-testing success Testing PASS
ci/iol-aarch64-compile-testing success Testing PASS
ci/iol-abi-testing success Testing PASS

Commit Message

Luc Pelletier Jan. 17, 2022, 3:37 p.m. UTC
  Calls to rte_memcpy_generic could result in unaligned loads/stores for
1 < n < 16. This is undefined behavior according to the C standard,
and it gets flagged by the clang undefined behavior sanitizer.

rte_memcpy_generic is called with unaligned src and dst addresses.
When 1 < n < 16, the code would cast both src and dst to a qword,
dword or word pointer, without verifying the alignment of src/dst. The
code was changed to use a packed structure to perform the unaligned
load/store operations. This results in unaligned load/store operations
to be C standards-compliant.

Fixes: d35cc1fe6a7a ("eal/x86: revert select optimized memcpy at run-time")
Cc: Xiaoyun Li <xiaoyun.li@intel.com>
Cc: stable@dpdk.org

Signed-off-by: Luc Pelletier <lucp.at.work@gmail.com>
---

Thanks to Stephen's pointer to look at the linux kernel, I was able to
find a way to perform the unaligned load/store using pure C code. The
new functions added to perform the load/store could likely be moved to a
different file and the code duplication could likely be eliminated by
using a macro. However, I will hold off on making these changes until I
get confirmation from maintainers that this technique is acceptable and
this is what we want to move forward with.

 lib/eal/x86/include/rte_memcpy.h | 142 +++++++++++++++++--------------
 1 file changed, 80 insertions(+), 62 deletions(-)
  

Comments

Luc Pelletier Feb. 4, 2022, 4:42 p.m. UTC | #1
Hi,

It's been several days and I haven't received any additional feedback
on this patch (and the other patch related to this one:
http://mails.dpdk.org/archives/dev/2022-January/232491.html). I would
welcome any kind of feedback. Ideally, I'm hoping an authoritative
voice would indicate if there's any interest in updating rte_memcpy
and if so, comment on the different possible solutions presented thus
far in this thread.

Thanks
  
Ananyev, Konstantin Feb. 4, 2022, 5:16 p.m. UTC | #2
> Calls to rte_memcpy_generic could result in unaligned loads/stores for
> 1 < n < 16. This is undefined behavior according to the C standard,
> and it gets flagged by the clang undefined behavior sanitizer.
> 
> rte_memcpy_generic is called with unaligned src and dst addresses.
> When 1 < n < 16, the code would cast both src and dst to a qword,
> dword or word pointer, without verifying the alignment of src/dst. The
> code was changed to use a packed structure to perform the unaligned
> load/store operations. This results in unaligned load/store operations
> to be C standards-compliant.

Still not sure we need to fix that:
This is x86 specific code-path, and as I remember on x86 there are no
penalties for unaligned access to 2/4/8 byte values. 
Though I like introduction of rte_mov15_or_less() function -t helps
with code dedup. 

> 
> Fixes: d35cc1fe6a7a ("eal/x86: revert select optimized memcpy at run-time")
> Cc: Xiaoyun Li <xiaoyun.li@intel.com>
> Cc: stable@dpdk.org
> 
> Signed-off-by: Luc Pelletier <lucp.at.work@gmail.com>
> ---
> 
> Thanks to Stephen's pointer to look at the linux kernel, I was able to
> find a way to perform the unaligned load/store using pure C code. The
> new functions added to perform the load/store could likely be moved to a
> different file and the code duplication could likely be eliminated by
> using a macro. However, I will hold off on making these changes until I
> get confirmation from maintainers that this technique is acceptable and
> this is what we want to move forward with.
> 
>  lib/eal/x86/include/rte_memcpy.h | 142 +++++++++++++++++--------------
>  1 file changed, 80 insertions(+), 62 deletions(-)
> 
> diff --git a/lib/eal/x86/include/rte_memcpy.h b/lib/eal/x86/include/rte_memcpy.h
> index 1b6c6e585f..4e876d39eb 100644
> --- a/lib/eal/x86/include/rte_memcpy.h
> +++ b/lib/eal/x86/include/rte_memcpy.h
> @@ -45,6 +45,83 @@ extern "C" {
>  static __rte_always_inline void *
>  rte_memcpy(void *dst, const void *src, size_t n);
> 
> +static __rte_always_inline uint64_t
> +rte_load_unaligned_uint64(const void *ptr)
> +{
> +	struct unaligned_uint64 { uint64_t val; } __rte_packed;
> +	return ((const struct unaligned_uint64 *)ptr)->val;
> +}
> +
> +static __rte_always_inline uint32_t
> +rte_load_unaligned_uint32(const void *ptr)
> +{
> +	struct unaligned_uint32 { uint32_t val; } __rte_packed;
> +	return ((const struct unaligned_uint32 *)ptr)->val;
> +}
> +
> +static __rte_always_inline uint16_t
> +rte_load_unaligned_uint16(const void *ptr)
> +{
> +	struct unaligned_uint16 { uint16_t val; } __rte_packed;
> +	return ((const struct unaligned_uint16 *)ptr)->val;
> +}
> +
> +static __rte_always_inline void
> +rte_store_unaligned_uint64(void *ptr, uint64_t val)
> +{
> +	struct unaligned_uint64 { uint64_t val; } __rte_packed;
> +	((struct unaligned_uint64 *)ptr)->val = val;
> +}
> +
> +static __rte_always_inline void
> +rte_store_unaligned_uint32(void *ptr, uint32_t val)
> +{
> +	struct unaligned_uint32 { uint32_t val; } __rte_packed;
> +	((struct unaligned_uint32 *)ptr)->val = val;
> +}
> +
> +static __rte_always_inline void
> +rte_store_unaligned_uint16(void *ptr, uint16_t val)
> +{
> +	struct unaligned_uint16 { uint16_t val; } __rte_packed;
> +	((struct unaligned_uint16 *)ptr)->val = val;
> +}
> +
> +/**
> + * Copy bytes from one location to another,
> + * locations should not overlap.
> + * Use with unaligned src/dst, and n <= 15.
> + */
> +static __rte_always_inline void *
> +rte_mov15_or_less_unaligned(void *dst, const void *src, size_t n)
> +{
> +	void *ret = dst;
> +	if (n & 8) {
> +		rte_store_unaligned_uint64(
> +			dst,
> +			rte_load_unaligned_uint64(src));
> +		src = ((const uint64_t *)src + 1);
> +		dst = ((uint64_t *)dst + 1);
> +	}
> +	if (n & 4) {
> +		rte_store_unaligned_uint32(
> +			dst,
> +			rte_load_unaligned_uint32(src));
> +		src = ((const uint32_t *)src + 1);
> +		dst = ((uint32_t *)dst + 1);
> +	}
> +	if (n & 2) {
> +		rte_store_unaligned_uint16(
> +			dst,
> +			rte_load_unaligned_uint16(src));
> +		src = ((const uint16_t *)src + 1);
> +		dst = ((uint16_t *)dst + 1);
> +	}
> +	if (n & 1)
> +		*(uint8_t *)dst = *(const uint8_t *)src;
> +	return ret;
> +}
> +
>  #if defined __AVX512F__ && defined RTE_MEMCPY_AVX512
> 
>  #define ALIGNMENT_MASK 0x3F
> @@ -171,8 +248,6 @@ rte_mov512blocks(uint8_t *dst, const uint8_t *src, size_t n)
>  static __rte_always_inline void *
>  rte_memcpy_generic(void *dst, const void *src, size_t n)
>  {
> -	uintptr_t dstu = (uintptr_t)dst;
> -	uintptr_t srcu = (uintptr_t)src;
>  	void *ret = dst;
>  	size_t dstofss;
>  	size_t bits;
> @@ -181,24 +256,7 @@ rte_memcpy_generic(void *dst, const void *src, size_t n)
>  	 * Copy less than 16 bytes
>  	 */
>  	if (n < 16) {
> -		if (n & 0x01) {
> -			*(uint8_t *)dstu = *(const uint8_t *)srcu;
> -			srcu = (uintptr_t)((const uint8_t *)srcu + 1);
> -			dstu = (uintptr_t)((uint8_t *)dstu + 1);
> -		}
> -		if (n & 0x02) {
> -			*(uint16_t *)dstu = *(const uint16_t *)srcu;
> -			srcu = (uintptr_t)((const uint16_t *)srcu + 1);
> -			dstu = (uintptr_t)((uint16_t *)dstu + 1);
> -		}
> -		if (n & 0x04) {
> -			*(uint32_t *)dstu = *(const uint32_t *)srcu;
> -			srcu = (uintptr_t)((const uint32_t *)srcu + 1);
> -			dstu = (uintptr_t)((uint32_t *)dstu + 1);
> -		}
> -		if (n & 0x08)
> -			*(uint64_t *)dstu = *(const uint64_t *)srcu;
> -		return ret;
> +		return rte_mov15_or_less_unaligned(dst, src, n);
>  	}
> 
>  	/**
> @@ -379,8 +437,6 @@ rte_mov128blocks(uint8_t *dst, const uint8_t *src, size_t n)
>  static __rte_always_inline void *
>  rte_memcpy_generic(void *dst, const void *src, size_t n)
>  {
> -	uintptr_t dstu = (uintptr_t)dst;
> -	uintptr_t srcu = (uintptr_t)src;
>  	void *ret = dst;
>  	size_t dstofss;
>  	size_t bits;
> @@ -389,25 +445,7 @@ rte_memcpy_generic(void *dst, const void *src, size_t n)
>  	 * Copy less than 16 bytes
>  	 */
>  	if (n < 16) {
> -		if (n & 0x01) {
> -			*(uint8_t *)dstu = *(const uint8_t *)srcu;
> -			srcu = (uintptr_t)((const uint8_t *)srcu + 1);
> -			dstu = (uintptr_t)((uint8_t *)dstu + 1);
> -		}
> -		if (n & 0x02) {
> -			*(uint16_t *)dstu = *(const uint16_t *)srcu;
> -			srcu = (uintptr_t)((const uint16_t *)srcu + 1);
> -			dstu = (uintptr_t)((uint16_t *)dstu + 1);
> -		}
> -		if (n & 0x04) {
> -			*(uint32_t *)dstu = *(const uint32_t *)srcu;
> -			srcu = (uintptr_t)((const uint32_t *)srcu + 1);
> -			dstu = (uintptr_t)((uint32_t *)dstu + 1);
> -		}
> -		if (n & 0x08) {
> -			*(uint64_t *)dstu = *(const uint64_t *)srcu;
> -		}
> -		return ret;
> +		return rte_mov15_or_less_unaligned(dst, src, n);
>  	}
> 
>  	/**
> @@ -672,8 +710,6 @@ static __rte_always_inline void *
>  rte_memcpy_generic(void *dst, const void *src, size_t n)
>  {
>  	__m128i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7, xmm8;
> -	uintptr_t dstu = (uintptr_t)dst;
> -	uintptr_t srcu = (uintptr_t)src;
>  	void *ret = dst;
>  	size_t dstofss;
>  	size_t srcofs;
> @@ -682,25 +718,7 @@ rte_memcpy_generic(void *dst, const void *src, size_t n)
>  	 * Copy less than 16 bytes
>  	 */
>  	if (n < 16) {
> -		if (n & 0x01) {
> -			*(uint8_t *)dstu = *(const uint8_t *)srcu;
> -			srcu = (uintptr_t)((const uint8_t *)srcu + 1);
> -			dstu = (uintptr_t)((uint8_t *)dstu + 1);
> -		}
> -		if (n & 0x02) {
> -			*(uint16_t *)dstu = *(const uint16_t *)srcu;
> -			srcu = (uintptr_t)((const uint16_t *)srcu + 1);
> -			dstu = (uintptr_t)((uint16_t *)dstu + 1);
> -		}
> -		if (n & 0x04) {
> -			*(uint32_t *)dstu = *(const uint32_t *)srcu;
> -			srcu = (uintptr_t)((const uint32_t *)srcu + 1);
> -			dstu = (uintptr_t)((uint32_t *)dstu + 1);
> -		}
> -		if (n & 0x08) {
> -			*(uint64_t *)dstu = *(const uint64_t *)srcu;
> -		}
> -		return ret;
> +		return rte_mov15_or_less_unaligned(dst, src, n);
>  	}
> 
>  	/**
> --
> 2.25.1
  
Thomas Monjalon Feb. 8, 2022, 4:53 p.m. UTC | #3
04/02/2022 18:16, Ananyev, Konstantin:
> 
> > Calls to rte_memcpy_generic could result in unaligned loads/stores for
> > 1 < n < 16. This is undefined behavior according to the C standard,
> > and it gets flagged by the clang undefined behavior sanitizer.
> > 
> > rte_memcpy_generic is called with unaligned src and dst addresses.
> > When 1 < n < 16, the code would cast both src and dst to a qword,
> > dword or word pointer, without verifying the alignment of src/dst. The
> > code was changed to use a packed structure to perform the unaligned
> > load/store operations. This results in unaligned load/store operations
> > to be C standards-compliant.
> 
> Still not sure we need to fix that:
> This is x86 specific code-path, and as I remember on x86 there are no
> penalties for unaligned access to 2/4/8 byte values. 
> Though I like introduction of rte_mov15_or_less() function -t helps
> with code dedup. 

Any more opinions?
  
Luc Pelletier Feb. 9, 2022, 3:05 p.m. UTC | #4
> > Calls to rte_memcpy_generic could result in unaligned loads/stores for
> > 1 < n < 16. This is undefined behavior according to the C standard,
> > and it gets flagged by the clang undefined behavior sanitizer.
> >
> > rte_memcpy_generic is called with unaligned src and dst addresses.
> > When 1 < n < 16, the code would cast both src and dst to a qword,
> > dword or word pointer, without verifying the alignment of src/dst. The
> > code was changed to use a packed structure to perform the unaligned
> > load/store operations. This results in unaligned load/store operations
> > to be C standards-compliant.
>
> Still not sure we need to fix that:
> This is x86 specific code-path, and as I remember on x86 there are no
> penalties for unaligned access to 2/4/8 byte values.
> Though I like introduction of rte_mov15_or_less() function -t helps
> with code dedup.

Thanks for your input Konstantin. Much appreciated. Just to make sure
I understand, can you please confirm that we do not want to fix the
fact that unaligned access in C is undefined behaviour? I understand
that X86 allows unaligned accesses but since this isn't assembly code,
we have to go by what the C standard allows.

Also, do you have any opinion on the strict aliasing violation (as was
pointed out by Georg)? I suggested adding __may_alias__ thinking it
would likely fix the issue, but I'd really like to get confirmation
from someone who has better knowledge of how the attribute works
exactly.

Thanks
  
Ananyev, Konstantin Feb. 10, 2022, 2:04 p.m. UTC | #5
Hi Luc,

> > > Calls to rte_memcpy_generic could result in unaligned loads/stores for
> > > 1 < n < 16. This is undefined behavior according to the C standard,
> > > and it gets flagged by the clang undefined behavior sanitizer.
> > >
> > > rte_memcpy_generic is called with unaligned src and dst addresses.
> > > When 1 < n < 16, the code would cast both src and dst to a qword,
> > > dword or word pointer, without verifying the alignment of src/dst. The
> > > code was changed to use a packed structure to perform the unaligned
> > > load/store operations. This results in unaligned load/store operations
> > > to be C standards-compliant.
> >
> > Still not sure we need to fix that:
> > This is x86 specific code-path, and as I remember on x86 there are no
> > penalties for unaligned access to 2/4/8 byte values.
> > Though I like introduction of rte_mov15_or_less() function -t helps
> > with code dedup.
> 
> Thanks for your input Konstantin. Much appreciated. Just to make sure
> I understand, can you please confirm that we do not want to fix the
> fact that unaligned access in C is undefined behaviour?

Yes, I don't think it is a real problem in that particular case.

> I understand
> that X86 allows unaligned accesses but since this isn't assembly code,
> we have to go by what the C standard allows.

> Also, do you have any opinion on the strict aliasing violation (as was
> pointed out by Georg)? I suggested adding __may_alias__ thinking it
> would likely fix the issue, but I'd really like to get confirmation
> from someone who has better knowledge of how the attribute works
> exactly.

Not sure I understand the problem you are referring to.
Are you saying that original rte_memcpy() code breaks strict aliasing? 
If so, could you point where exactly?
  
Luc Pelletier Feb. 10, 2022, 4:56 p.m. UTC | #6
Hi Konstantin,

> > Thanks for your input Konstantin. Much appreciated. Just to make sure
> > I understand, can you please confirm that we do not want to fix the
> > fact that unaligned access in C is undefined behaviour?
>
> Yes, I don't think it is a real problem in that particular case.

Perfect. Thank you.

> > Also, do you have any opinion on the strict aliasing violation (as was
> > pointed out by Georg)? I suggested adding __may_alias__ thinking it
> > would likely fix the issue, but I'd really like to get confirmation
> > from someone who has better knowledge of how the attribute works
> > exactly.
>
> Not sure I understand the problem you are referring to.
> Are you saying that original rte_memcpy() code breaks strict aliasing?
> If so, could you point where exactly?

As far as I understand, yes, it does break strict aliasing. For
example, in the following line:

*(uint64_t *)dstu = *(const uint64_t *)srcu;

IIUC, both casts break strict aliasing rules. While the src/dst
parameters are void* and can therefore be cast to something else
without breaking strict aliasing rules, the type of src/dst in the
calling code might be something other than uint64_t*. This can result
in src/dst pointers being cast to different unrelated types. AFAICT,
the fact that rte_memcpy is "always inline" increases the risk of the
compiler making an optimization that results in broken code.

I was able to come up with an example where the latest version of GCC
produces broken code when strict aliasing is enabled:

https://godbolt.org/z/3Yzvjr97c

With -fstrict-aliasing, it reorders a write and results in broken
code. If you change the compiler flags to -fno-strict-aliasing, it
produces the expected result.
  
Ananyev, Konstantin Feb. 11, 2022, 3:51 p.m. UTC | #7
> > Not sure I understand the problem you are referring to.
> > Are you saying that original rte_memcpy() code breaks strict aliasing?
> > If so, could you point where exactly?
> 
> As far as I understand, yes, it does break strict aliasing. For
> example, in the following line:
> 
> *(uint64_t *)dstu = *(const uint64_t *)srcu;
> 
> IIUC, both casts break strict aliasing rules. While the src/dst
> parameters are void* and can therefore be cast to something else
> without breaking strict aliasing rules, the type of src/dst in the
> calling code might be something other than uint64_t*. This can result
> in src/dst pointers being cast to different unrelated types. AFAICT,
> the fact that rte_memcpy is "always inline" increases the risk of the
> compiler making an optimization that results in broken code.
> 
> I was able to come up with an example where the latest version of GCC
> produces broken code when strict aliasing is enabled:
> 
> https://godbolt.org/z/3Yzvjr97c
> 
> With -fstrict-aliasing, it reorders a write and results in broken
> code. If you change the compiler flags to -fno-strict-aliasing, it
> produces the expected result.

Indeed it looks like a problem.
Thanks for pointing it out.
Was able to reproduce it with gcc 11 (clang 13 seems fine). 
Actually, adding ' __attribute__ ((__may_alias__))' for both dst and src 
didn't quire the problem.
To overcome it, I had to either:
add '-fno-strict-aliasing' CC flag (as you mentioned above),
or add:
if (__builtin_constant_p(n))
            return memcpy(dst, src, n);

on top of rte_memcpy() code.

Though I suppose the problem might be much wider than just rte_memcpy().
We do have similar inline copying code in other places too.
As understand some of such cases also might be affected.
Let say: '_rte_ring_(enqueue|dequeue_elems_*'.
Not sure what would be the best approach in general for such cases:
- always compile DPDK code with '-fno-strict-aliasing'
  But that wouldn't prevent people to use our inline functions without that flag.
  Also wonder what performance impact it will have.
- Try to fix all such occurrences manually (but it would be hard to catch all of them upfront)
- Something else ...?

Wonder what do people think about it?
Konstantin
  
Luc Pelletier Feb. 13, 2022, 10:31 p.m. UTC | #8
Hi Konstantin,

> Indeed it looks like a problem.
> Thanks for pointing it out.
> Was able to reproduce it with gcc 11 (clang 13 seems fine).
> Actually, adding ' __attribute__ ((__may_alias__))' for both dst and src
> didn't quire the problem.

__may_alias__ works if it's applied to a typedef, see the following
for a modified version of my original example that works and uses
__may_alias__:

https://godbolt.org/z/W83zzoePq

The documentation I found for __may_alias__ is quite sparse, so I'm a
little wary of assuming it'll always work. I'm hoping someone with
more experience with the attribute would be able to add more
confidence to the assumption that it'll work in all cases.
  
Ananyev, Konstantin Feb. 14, 2022, 1:41 p.m. UTC | #9
Hi Luc,
 
> > Indeed it looks like a problem.
> > Thanks for pointing it out.
> > Was able to reproduce it with gcc 11 (clang 13 seems fine).
> > Actually, adding ' __attribute__ ((__may_alias__))' for both dst and src
> > didn't quire the problem.
> 
> __may_alias__ works if it's applied to a typedef, see the following
> for a modified version of my original example that works and uses
> __may_alias__:
> 
> https://godbolt.org/z/W83zzoePq

Interesting, I played with '__may_alias__' attribute a bit more,
and yes, as you said, it seems to wok properly only with typedefs.
If so, then we probably can add a new typedefs (in rte_common.h or so),
and use it in rte_memcpy and probably few other places, where we
think it will be necessary. 
 
> The documentation I found for __may_alias__ is quite sparse, so I'm a
> little wary of assuming it'll always work.
> I'm hoping someone with
> more experience with the attribute would be able to add more
> confidence to the assumption that it'll work in all cases.

Agree, would like to see more input too.
CC-ing to other guys, who seemed to be interested in that discussion before.

Thanks
Konstantin
  

Patch

diff --git a/lib/eal/x86/include/rte_memcpy.h b/lib/eal/x86/include/rte_memcpy.h
index 1b6c6e585f..4e876d39eb 100644
--- a/lib/eal/x86/include/rte_memcpy.h
+++ b/lib/eal/x86/include/rte_memcpy.h
@@ -45,6 +45,83 @@  extern "C" {
 static __rte_always_inline void *
 rte_memcpy(void *dst, const void *src, size_t n);
 
+static __rte_always_inline uint64_t
+rte_load_unaligned_uint64(const void *ptr)
+{
+	struct unaligned_uint64 { uint64_t val; } __rte_packed;
+	return ((const struct unaligned_uint64 *)ptr)->val;
+}
+
+static __rte_always_inline uint32_t
+rte_load_unaligned_uint32(const void *ptr)
+{
+	struct unaligned_uint32 { uint32_t val; } __rte_packed;
+	return ((const struct unaligned_uint32 *)ptr)->val;
+}
+
+static __rte_always_inline uint16_t
+rte_load_unaligned_uint16(const void *ptr)
+{
+	struct unaligned_uint16 { uint16_t val; } __rte_packed;
+	return ((const struct unaligned_uint16 *)ptr)->val;
+}
+
+static __rte_always_inline void
+rte_store_unaligned_uint64(void *ptr, uint64_t val)
+{
+	struct unaligned_uint64 { uint64_t val; } __rte_packed;
+	((struct unaligned_uint64 *)ptr)->val = val;
+}
+
+static __rte_always_inline void
+rte_store_unaligned_uint32(void *ptr, uint32_t val)
+{
+	struct unaligned_uint32 { uint32_t val; } __rte_packed;
+	((struct unaligned_uint32 *)ptr)->val = val;
+}
+
+static __rte_always_inline void
+rte_store_unaligned_uint16(void *ptr, uint16_t val)
+{
+	struct unaligned_uint16 { uint16_t val; } __rte_packed;
+	((struct unaligned_uint16 *)ptr)->val = val;
+}
+
+/**
+ * Copy bytes from one location to another,
+ * locations should not overlap.
+ * Use with unaligned src/dst, and n <= 15.
+ */
+static __rte_always_inline void *
+rte_mov15_or_less_unaligned(void *dst, const void *src, size_t n)
+{
+	void *ret = dst;
+	if (n & 8) {
+		rte_store_unaligned_uint64(
+			dst,
+			rte_load_unaligned_uint64(src));
+		src = ((const uint64_t *)src + 1);
+		dst = ((uint64_t *)dst + 1);
+	}
+	if (n & 4) {
+		rte_store_unaligned_uint32(
+			dst,
+			rte_load_unaligned_uint32(src));
+		src = ((const uint32_t *)src + 1);
+		dst = ((uint32_t *)dst + 1);
+	}
+	if (n & 2) {
+		rte_store_unaligned_uint16(
+			dst,
+			rte_load_unaligned_uint16(src));
+		src = ((const uint16_t *)src + 1);
+		dst = ((uint16_t *)dst + 1);
+	}
+	if (n & 1)
+		*(uint8_t *)dst = *(const uint8_t *)src;
+	return ret;
+}
+
 #if defined __AVX512F__ && defined RTE_MEMCPY_AVX512
 
 #define ALIGNMENT_MASK 0x3F
@@ -171,8 +248,6 @@  rte_mov512blocks(uint8_t *dst, const uint8_t *src, size_t n)
 static __rte_always_inline void *
 rte_memcpy_generic(void *dst, const void *src, size_t n)
 {
-	uintptr_t dstu = (uintptr_t)dst;
-	uintptr_t srcu = (uintptr_t)src;
 	void *ret = dst;
 	size_t dstofss;
 	size_t bits;
@@ -181,24 +256,7 @@  rte_memcpy_generic(void *dst, const void *src, size_t n)
 	 * Copy less than 16 bytes
 	 */
 	if (n < 16) {
-		if (n & 0x01) {
-			*(uint8_t *)dstu = *(const uint8_t *)srcu;
-			srcu = (uintptr_t)((const uint8_t *)srcu + 1);
-			dstu = (uintptr_t)((uint8_t *)dstu + 1);
-		}
-		if (n & 0x02) {
-			*(uint16_t *)dstu = *(const uint16_t *)srcu;
-			srcu = (uintptr_t)((const uint16_t *)srcu + 1);
-			dstu = (uintptr_t)((uint16_t *)dstu + 1);
-		}
-		if (n & 0x04) {
-			*(uint32_t *)dstu = *(const uint32_t *)srcu;
-			srcu = (uintptr_t)((const uint32_t *)srcu + 1);
-			dstu = (uintptr_t)((uint32_t *)dstu + 1);
-		}
-		if (n & 0x08)
-			*(uint64_t *)dstu = *(const uint64_t *)srcu;
-		return ret;
+		return rte_mov15_or_less_unaligned(dst, src, n);
 	}
 
 	/**
@@ -379,8 +437,6 @@  rte_mov128blocks(uint8_t *dst, const uint8_t *src, size_t n)
 static __rte_always_inline void *
 rte_memcpy_generic(void *dst, const void *src, size_t n)
 {
-	uintptr_t dstu = (uintptr_t)dst;
-	uintptr_t srcu = (uintptr_t)src;
 	void *ret = dst;
 	size_t dstofss;
 	size_t bits;
@@ -389,25 +445,7 @@  rte_memcpy_generic(void *dst, const void *src, size_t n)
 	 * Copy less than 16 bytes
 	 */
 	if (n < 16) {
-		if (n & 0x01) {
-			*(uint8_t *)dstu = *(const uint8_t *)srcu;
-			srcu = (uintptr_t)((const uint8_t *)srcu + 1);
-			dstu = (uintptr_t)((uint8_t *)dstu + 1);
-		}
-		if (n & 0x02) {
-			*(uint16_t *)dstu = *(const uint16_t *)srcu;
-			srcu = (uintptr_t)((const uint16_t *)srcu + 1);
-			dstu = (uintptr_t)((uint16_t *)dstu + 1);
-		}
-		if (n & 0x04) {
-			*(uint32_t *)dstu = *(const uint32_t *)srcu;
-			srcu = (uintptr_t)((const uint32_t *)srcu + 1);
-			dstu = (uintptr_t)((uint32_t *)dstu + 1);
-		}
-		if (n & 0x08) {
-			*(uint64_t *)dstu = *(const uint64_t *)srcu;
-		}
-		return ret;
+		return rte_mov15_or_less_unaligned(dst, src, n);
 	}
 
 	/**
@@ -672,8 +710,6 @@  static __rte_always_inline void *
 rte_memcpy_generic(void *dst, const void *src, size_t n)
 {
 	__m128i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7, xmm8;
-	uintptr_t dstu = (uintptr_t)dst;
-	uintptr_t srcu = (uintptr_t)src;
 	void *ret = dst;
 	size_t dstofss;
 	size_t srcofs;
@@ -682,25 +718,7 @@  rte_memcpy_generic(void *dst, const void *src, size_t n)
 	 * Copy less than 16 bytes
 	 */
 	if (n < 16) {
-		if (n & 0x01) {
-			*(uint8_t *)dstu = *(const uint8_t *)srcu;
-			srcu = (uintptr_t)((const uint8_t *)srcu + 1);
-			dstu = (uintptr_t)((uint8_t *)dstu + 1);
-		}
-		if (n & 0x02) {
-			*(uint16_t *)dstu = *(const uint16_t *)srcu;
-			srcu = (uintptr_t)((const uint16_t *)srcu + 1);
-			dstu = (uintptr_t)((uint16_t *)dstu + 1);
-		}
-		if (n & 0x04) {
-			*(uint32_t *)dstu = *(const uint32_t *)srcu;
-			srcu = (uintptr_t)((const uint32_t *)srcu + 1);
-			dstu = (uintptr_t)((uint32_t *)dstu + 1);
-		}
-		if (n & 0x08) {
-			*(uint64_t *)dstu = *(const uint64_t *)srcu;
-		}
-		return ret;
+		return rte_mov15_or_less_unaligned(dst, src, n);
 	}
 
 	/**