[v4,2/2] lib/eal: add temporal store memcpy support for AMD platform

Message ID 20211027072810.257795-2-aman.kumar@vvdntech.in (mailing list archive)
State Changes Requested, archived
Delegated to: Thomas Monjalon
Headers
Series [v4,1/2] config/x86: add support for AMD platform |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/iol-spell-check-testing warning Testing issues
ci/Intel-compilation success Compilation OK
ci/iol-broadcom-Functional success Functional Testing PASS
ci/iol-broadcom-Performance success Performance Testing PASS
ci/github-robot: build success github build: passed
ci/intel-Testing success Testing PASS
ci/iol-x86_64-unit-testing success Testing PASS
ci/iol-x86_64-compile-testing success Testing PASS
ci/iol-aarch64-compile-testing success Testing PASS
ci/iol-mellanox-Performance success Performance Testing PASS
ci/iol-intel-Performance success Performance Testing PASS
ci/iol-intel-Functional success Functional Testing PASS
ci/iol-aarch64-unit-testing success Testing PASS

Commit Message

Aman Kumar Oct. 27, 2021, 7:28 a.m. UTC
  This patch provides a rte_memcpy* call with temporal stores.
Use -Dcpu_instruction_set=znverX with build to enable this API.

Signed-off-by: Aman Kumar <aman.kumar@vvdntech.in>
---
 config/x86/meson.build           |   2 +
 lib/eal/x86/include/rte_memcpy.h | 114 +++++++++++++++++++++++++++++++
 2 files changed, 116 insertions(+)
  

Comments

Thomas Monjalon Oct. 27, 2021, 8:13 a.m. UTC | #1
27/10/2021 09:28, Aman Kumar:
> This patch provides a rte_memcpy* call with temporal stores.
> Use -Dcpu_instruction_set=znverX with build to enable this API.
> 
> Signed-off-by: Aman Kumar <aman.kumar@vvdntech.in>

For the series, Acked-by: Thomas Monjalon <thomas@monjalon.net>
With the hope that such optimization will go in libc in a near future.

If there is no objection, I will merge this AMD-specific series in 21.11-rc2.
It should not affect other platforms.
  
Van Haaren, Harry Oct. 27, 2021, 11:03 a.m. UTC | #2
> -----Original Message-----
> From: dev <dev-bounces@dpdk.org> On Behalf Of Thomas Monjalon
> Sent: Wednesday, October 27, 2021 9:13 AM
> To: Aman Kumar <aman.kumar@vvdntech.in>
> Cc: dev@dpdk.org; viacheslavo@nvidia.com; Burakov, Anatoly
> <anatoly.burakov@intel.com>; keesang.song@amd.com;
> aman.kumar@vvdntech.in; jerinjacobk@gmail.com; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; Richardson, Bruce
> <bruce.richardson@intel.com>; honnappa.nagarahalli@arm.com; Ruifeng Wang
> <ruifeng.wang@arm.com>; David Christensen <drc@linux.vnet.ibm.com>;
> david.marchand@redhat.com; stephen@networkplumber.org
> Subject: Re: [dpdk-dev] [PATCH v4 2/2] lib/eal: add temporal store memcpy
> support for AMD platform
> 
> 27/10/2021 09:28, Aman Kumar:
> > This patch provides a rte_memcpy* call with temporal stores.
> > Use -Dcpu_instruction_set=znverX with build to enable this API.
> >
> > Signed-off-by: Aman Kumar <aman.kumar@vvdntech.in>
> 
> For the series, Acked-by: Thomas Monjalon <thomas@monjalon.net>
> With the hope that such optimization will go in libc in a near future.
> 
> If there is no objection, I will merge this AMD-specific series in 21.11-rc2.
> It should not affect other platforms.

Hi Folks,

This patchset was brought to my attention, and I have a few concerns.
I'll add short snippets of context from the patch here so I can refer to it below;

+/**
+ * Copy 16 bytes from one location to another,
+ * with temporal stores
+ */
+static __rte_always_inline void
+rte_copy16_ts(uint8_t *dst, uint8_t *src)
+{
+	__m128i var128;
+
+	var128 = _mm_stream_load_si128((__m128i *)src);
+	_mm_storeu_si128((__m128i *)dst, var128);
+}

1) What is fundamentally specific to the znverX CPU? Is there any reason this can not just be enabled for x86-64 generic with SSE4.1 ISA requirements?
_mm_stream_load_si128() is part of SSE4.1
_mm_storeu_si128() is SSE2. 
Using the intrinsics guide for lookup of intrinsics to ISA level: https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html?wapkw=intrinsics%20guide#text=_mm_stream_load&ig_expand=6884

2) Are -D options allowed to change/break API/ABI?
By allowing -Dcpu_instruction_set= to change available functions, any application using it is no longer source-code (API) compatible with "DPDK" proper.
This patch essentially splits a "DPDK" app to depend on "DPDK + CPU version -D flag", in an incompatible way (no fallback?).

3) The stream load instruction used here *requires* 16-byte alignment for its operand.
This is not documented, and worse, a uint8_t* is accepted, which is cast to (__m128i *).
This cast hides the compiler warning for expanding type-alignments.
And the code itself is broken - passing a "src" parameter that is not 16-byte aligned will segfault.

4) Temporal and Non-temporal are not logically presented here.
Temporal loads/stores are normal loads/stores. They use the L1/L2 caches.
Non-temporal loads/stores indicate that the data will *not* be used again in a short space of time.
Non-temporal means "having no relation to time" according to my internet search.

5) The *store* here uses a normal store (temporal, targets cache). The *load* however is a streaming (non-temporal, no cache) load.
It is not clearly documented that A) stream load will be used.
The inverse is documented "copy with ts" aka, copy with temporal store.
Is documenting the store as temporal meant to imply that the load is non-temporal?

6) What is the use-case for this? When would a user *want* to use this instead of rte_memcpy()?
If the data being loaded is relevant to datapath/packets, presumably other packets might require the
loaded data, so temporal (normal) loads should be used to cache the source data?

7) Why is streaming (non-temporal) loads & stores not used? I guess maybe this is regarding the use-case,
but its not clear to me right now why loads are NT, and stores are T.

All in all, I do not think merging this patch is a good idea. I would like to understand the motivation for adding
this type of function, and then see it being done in a way that is clearly documented regarding temporal loads/stores,
and not changing/adding APIs for specific CPUs.

So apologies for late feedback, but this is not of high enough quality to be merged to DPDK right now, NACK.
  
Mattias Rönnblom Oct. 27, 2021, 11:33 a.m. UTC | #3
On 2021-10-27 09:28, Aman Kumar wrote:
> This patch provides a rte_memcpy* call with temporal stores.
> Use -Dcpu_instruction_set=znverX with build to enable this API.
>
> Signed-off-by: Aman Kumar <aman.kumar@vvdntech.in>
> ---
>   config/x86/meson.build           |   2 +
>   lib/eal/x86/include/rte_memcpy.h | 114 +++++++++++++++++++++++++++++++
>   2 files changed, 116 insertions(+)
>
> diff --git a/config/x86/meson.build b/config/x86/meson.build
> index 21cda6fd33..56dae4aca7 100644
> --- a/config/x86/meson.build
> +++ b/config/x86/meson.build
> @@ -78,6 +78,8 @@ if get_option('cpu_instruction_set') == 'znver1'
>       dpdk_conf.set('RTE_MAX_LCORE', 256)
>   elif get_option('cpu_instruction_set') == 'znver2'
>       dpdk_conf.set('RTE_MAX_LCORE', 512)
> +    dpdk_conf.set('RTE_MEMCPY_AMDEPYC', 1)
>   elif get_option('cpu_instruction_set') == 'znver3'
>       dpdk_conf.set('RTE_MAX_LCORE', 512)
> +    dpdk_conf.set('RTE_MEMCPY_AMDEPYC', 1)
>   endif
> diff --git a/lib/eal/x86/include/rte_memcpy.h b/lib/eal/x86/include/rte_memcpy.h
> index 1b6c6e585f..8fe7822cb4 100644
> --- a/lib/eal/x86/include/rte_memcpy.h
> +++ b/lib/eal/x86/include/rte_memcpy.h
> @@ -376,6 +376,120 @@ rte_mov128blocks(uint8_t *dst, const uint8_t *src, size_t n)
>   	}
>   }
>   
> +#if defined RTE_MEMCPY_AMDEPYC
> +
> +/**
> + * Copy 16 bytes from one location to another,
> + * with temporal stores
> + */
> +static __rte_always_inline void
> +rte_copy16_ts(uint8_t *dst, uint8_t *src)
> +{
> +	__m128i var128;
> +
> +	var128 = _mm_stream_load_si128((__m128i *)src);
> +	_mm_storeu_si128((__m128i *)dst, var128);
> +}
> +
> +/**
> + * Copy 32 bytes from one location to another,
> + * with temporal stores
> + */
> +static __rte_always_inline void
> +rte_copy32_ts(uint8_t *dst, uint8_t *src)
> +{
> +	__m256i ymm0;
> +
> +	ymm0 = _mm256_stream_load_si256((const __m256i *)src);
> +	_mm256_storeu_si256((__m256i *)dst, ymm0);
> +}
> +
> +/**
> + * Copy 64 bytes from one location to another,
> + * with temporal stores
> + */
> +static __rte_always_inline void
> +rte_copy64_ts(uint8_t *dst, uint8_t *src)
> +{
> +	rte_copy32_ts(dst + 0 * 32, src + 0 * 32);
> +	rte_copy32_ts(dst + 1 * 32, src + 1 * 32);
> +}
> +
> +/**
> + * Copy 128 bytes from one location to another,
> + * with temporal stores
> + */
> +static __rte_always_inline void
> +rte_copy128_ts(uint8_t *dst, uint8_t *src)
> +{
> +	rte_copy32_ts(dst + 0 * 32, src + 0 * 32);
> +	rte_copy32_ts(dst + 1 * 32, src + 1 * 32);
> +	rte_copy32_ts(dst + 2 * 32, src + 2 * 32);
> +	rte_copy32_ts(dst + 3 * 32, src + 3 * 32);
> +}
> +
> +/**
> + * Copy len bytes from one location to another,
> + * with temporal stores 16B aligned
> + */
> +static __rte_always_inline void *
> +rte_memcpy_aligned_tstore16_generic(void *dst, void *src, int len)
> +{
> +	void *dest = dst;
> +
> +	while (len >= 128) {
> +		rte_copy128_ts((uint8_t *)dst, (uint8_t *)src);
> +		dst = (uint8_t *)dst + 128;
> +		src = (uint8_t *)src + 128;
> +		len -= 128;
> +	}
> +	while (len >= 64) {
> +		rte_copy64_ts((uint8_t *)dst, (uint8_t *)src);
> +		dst = (uint8_t *)dst + 64;
> +		src = (uint8_t *)src + 64;
> +		len -= 64;
> +	}
> +	while (len >= 32) {
> +		rte_copy32_ts((uint8_t *)dst, (uint8_t *)src);
> +		dst = (uint8_t *)dst + 32;
> +		src = (uint8_t *)src + 32;
> +		len -= 32;
> +	}
> +	if (len >= 16) {
> +		rte_copy16_ts((uint8_t *)dst, (uint8_t *)src);
> +		dst = (uint8_t *)dst + 16;
> +		src = (uint8_t *)src + 16;
> +		len -= 16;
> +	}
> +	if (len >= 8) {
> +		*(uint64_t *)dst = *(const uint64_t *)src;
> +		dst = (uint8_t *)dst + 8;
> +		src = (uint8_t *)src + 8;
> +		len -= 8;
> +	}
> +	if (len >= 4) {
> +		*(uint32_t *)dst = *(const uint32_t *)src;
> +		dst = (uint8_t *)dst + 4;
> +		src = (uint8_t *)src + 4;
> +		len -= 4;
> +	}
> +	if (len != 0) {
> +		dst = (uint8_t *)dst - (4 - len);
> +		src = (uint8_t *)src - (4 - len);
> +		*(uint32_t *)dst = *(const uint32_t *)src;
> +	}
> +
> +	return dest;


You don't need a _mm_sfence after the NT stores to avoid surprises (e.g, 
if you use this NT memcpy() in combination with DPDK rings)? NT stores 
are weakly ordered on x86_64, from what I understand.


> +}
> +
> +static __rte_always_inline void *
> +rte_memcpy_aligned_tstore16(void *dst, void *src, int len)


Shouldn't both dst and src be marked __restrict? Goes for all these 
functions.

> +{
> +	return rte_memcpy_aligned_tstore16_generic(dst, src, len);
> +}
> +
> +#endif /* RTE_MEMCPY_AMDEPYC */


What does x86_64 NT stores have to do with EPYC?


> +
>   static __rte_always_inline void *
>   rte_memcpy_generic(void *dst, const void *src, size_t n)
>   {
  
Mattias Rönnblom Oct. 27, 2021, 11:41 a.m. UTC | #4
On 2021-10-27 13:03, Van Haaren, Harry wrote:
>> -----Original Message-----
>> From: dev <dev-bounces@dpdk.org> On Behalf Of Thomas Monjalon
>> Sent: Wednesday, October 27, 2021 9:13 AM
>> To: Aman Kumar <aman.kumar@vvdntech.in>
>> Cc: dev@dpdk.org; viacheslavo@nvidia.com; Burakov, Anatoly
>> <anatoly.burakov@intel.com>; keesang.song@amd.com;
>> aman.kumar@vvdntech.in; jerinjacobk@gmail.com; Ananyev, Konstantin
>> <konstantin.ananyev@intel.com>; Richardson, Bruce
>> <bruce.richardson@intel.com>; honnappa.nagarahalli@arm.com; Ruifeng Wang
>> <ruifeng.wang@arm.com>; David Christensen <drc@linux.vnet.ibm.com>;
>> david.marchand@redhat.com; stephen@networkplumber.org
>> Subject: Re: [dpdk-dev] [PATCH v4 2/2] lib/eal: add temporal store memcpy
>> support for AMD platform
>>
>> 27/10/2021 09:28, Aman Kumar:
>>> This patch provides a rte_memcpy* call with temporal stores.
>>> Use -Dcpu_instruction_set=znverX with build to enable this API.
>>>
>>> Signed-off-by: Aman Kumar <aman.kumar@vvdntech.in>
>> For the series, Acked-by: Thomas Monjalon <thomas@monjalon.net>
>> With the hope that such optimization will go in libc in a near future.
>>
>> If there is no objection, I will merge this AMD-specific series in 21.11-rc2.
>> It should not affect other platforms.
> Hi Folks,
>
> This patchset was brought to my attention, and I have a few concerns.
> I'll add short snippets of context from the patch here so I can refer to it below;
>
> +/**
> + * Copy 16 bytes from one location to another,
> + * with temporal stores
> + */
> +static __rte_always_inline void
> +rte_copy16_ts(uint8_t *dst, uint8_t *src)
> +{
> +	__m128i var128;
> +
> +	var128 = _mm_stream_load_si128((__m128i *)src);
> +	_mm_storeu_si128((__m128i *)dst, var128);
> +}
>
> 1) What is fundamentally specific to the znverX CPU? Is there any reason this can not just be enabled for x86-64 generic with SSE4.1 ISA requirements?
> _mm_stream_load_si128() is part of SSE4.1
> _mm_storeu_si128() is SSE2.
> Using the intrinsics guide for lookup of intrinsics to ISA level: https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html?wapkw=intrinsics%20guide#text=_mm_stream_load&ig_expand=6884
>
> 2) Are -D options allowed to change/break API/ABI?
> By allowing -Dcpu_instruction_set= to change available functions, any application using it is no longer source-code (API) compatible with "DPDK" proper.
> This patch essentially splits a "DPDK" app to depend on "DPDK + CPU version -D flag", in an incompatible way (no fallback?).
>
> 3) The stream load instruction used here *requires* 16-byte alignment for its operand.
> This is not documented, and worse, a uint8_t* is accepted, which is cast to (__m128i *).
> This cast hides the compiler warning for expanding type-alignments.
> And the code itself is broken - passing a "src" parameter that is not 16-byte aligned will segfault.
>
> 4) Temporal and Non-temporal are not logically presented here.
> Temporal loads/stores are normal loads/stores. They use the L1/L2 caches.
> Non-temporal loads/stores indicate that the data will *not* be used again in a short space of time.
> Non-temporal means "having no relation to time" according to my internet search.
>
> 5) The *store* here uses a normal store (temporal, targets cache). The *load* however is a streaming (non-temporal, no cache) load.
> It is not clearly documented that A) stream load will be used.
> The inverse is documented "copy with ts" aka, copy with temporal store.
> Is documenting the store as temporal meant to imply that the load is non-temporal?
>
> 6) What is the use-case for this? When would a user *want* to use this instead of rte_memcpy()?
> If the data being loaded is relevant to datapath/packets, presumably other packets might require the
> loaded data, so temporal (normal) loads should be used to cache the source data?


I'm not sure if your first question is rhetorical or not, but a memcpy() 
in a NT variant is certainly useful. One use case for a memcpy() with 
temporal loads and non-temporal stores is if you need to archive packet 
payload for (distant, potential) future use, and want to avoid causing 
unnecessary LLC evictions while doing so.


> 7) Why is streaming (non-temporal) loads & stores not used? I guess maybe this is regarding the use-case,
> but its not clear to me right now why loads are NT, and stores are T.
>
> All in all, I do not think merging this patch is a good idea. I would like to understand the motivation for adding
> this type of function, and then see it being done in a way that is clearly documented regarding temporal loads/stores,
> and not changing/adding APIs for specific CPUs.
>
> So apologies for late feedback, but this is not of high enough quality to be merged to DPDK right now, NACK.
  
Van Haaren, Harry Oct. 27, 2021, 12:15 p.m. UTC | #5
> -----Original Message-----
> From: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> Sent: Wednesday, October 27, 2021 12:42 PM
> To: Van Haaren, Harry <harry.van.haaren@intel.com>; Thomas Monjalon
> <thomas@monjalon.net>; Aman Kumar <aman.kumar@vvdntech.in>
> Cc: dev@dpdk.org; viacheslavo@nvidia.com; Burakov, Anatoly
> <anatoly.burakov@intel.com>; Song, Keesang <Keesang.Song@amd.com>;
> jerinjacobk@gmail.com; Ananyev, Konstantin <konstantin.ananyev@intel.com>;
> Richardson, Bruce <bruce.richardson@intel.com>;
> honnappa.nagarahalli@arm.com; Ruifeng Wang <ruifeng.wang@arm.com>;
> David Christensen <drc@linux.vnet.ibm.com>; david.marchand@redhat.com;
> stephen@networkplumber.org
> Subject: Re: [dpdk-dev] [PATCH v4 2/2] lib/eal: add temporal store memcpy
> support for AMD platform
> 
> On 2021-10-27 13:03, Van Haaren, Harry wrote:
> >> -----Original Message-----

<snip>

Hi Mattias,

> > 6) What is the use-case for this? When would a user *want* to use this instead
> of rte_memcpy()?
> > If the data being loaded is relevant to datapath/packets, presumably other
> packets might require the
> > loaded data, so temporal (normal) loads should be used to cache the source
> data?
> 
> 
> I'm not sure if your first question is rhetorical or not, but a memcpy()
> in a NT variant is certainly useful. One use case for a memcpy() with
> temporal loads and non-temporal stores is if you need to archive packet
> payload for (distant, potential) future use, and want to avoid causing
> unnecessary LLC evictions while doing so.

Yes I agree that there are certainly benefits in using cache-locality hints.
There is an open question around if the src or dst or both are non-temporal.

In the implementation of this patch, the NT/T type of store is reversed from your use-case:
1) Loads are NT (so loaded data is not cached for future packets)
2) Stores are T (so copied/dst data is now resident in L1/L2)

In theory there might even be valid uses for this type of memcpy where loaded
data is not needed again soon and stored data is referenced again soon,
although I cannot think of any here while typing this mail..

I think some use-case examples, and clear documentation on when/how to choose
between rte_memcpy() or any (potential future) rte_memcpy_nt() variants is required
to progress this patch.

Assuming a strong use-case exists, and it can be clearly indicators to users of DPDK APIs which
rte_memcpy() to use, we can look at technical details around enabling the implementation.

-Harry

<snip remaining points>
  
Ananyev, Konstantin Oct. 27, 2021, 12:22 p.m. UTC | #6
> 
> Hi Mattias,
> 
> > > 6) What is the use-case for this? When would a user *want* to use this instead
> > of rte_memcpy()?
> > > If the data being loaded is relevant to datapath/packets, presumably other
> > packets might require the
> > > loaded data, so temporal (normal) loads should be used to cache the source
> > data?
> >
> >
> > I'm not sure if your first question is rhetorical or not, but a memcpy()
> > in a NT variant is certainly useful. One use case for a memcpy() with
> > temporal loads and non-temporal stores is if you need to archive packet
> > payload for (distant, potential) future use, and want to avoid causing
> > unnecessary LLC evictions while doing so.
> 
> Yes I agree that there are certainly benefits in using cache-locality hints.
> There is an open question around if the src or dst or both are non-temporal.
> 
> In the implementation of this patch, the NT/T type of store is reversed from your use-case:
> 1) Loads are NT (so loaded data is not cached for future packets)
> 2) Stores are T (so copied/dst data is now resident in L1/L2)
> 
> In theory there might even be valid uses for this type of memcpy where loaded
> data is not needed again soon and stored data is referenced again soon,
> although I cannot think of any here while typing this mail..
> 
> I think some use-case examples, and clear documentation on when/how to choose
> between rte_memcpy() or any (potential future) rte_memcpy_nt() variants is required
> to progress this patch.
> 
> Assuming a strong use-case exists, and it can be clearly indicators to users of DPDK APIs which
> rte_memcpy() to use, we can look at technical details around enabling the implementation.
> 

+1 here.
Function behaviour and restrictions (src parameter needs to be 16/32 B aligned, etc.),
along with expected usage scenarios have to be documented properly.
Again, as Harry pointed out, I don't see any AMD specific instructions in this function,
so presumably such function can go into __AVX2__ code block and no new defines will
be required.
  
Aman Kumar Oct. 27, 2021, 1:34 p.m. UTC | #7
On Wed, Oct 27, 2021 at 5:53 PM Ananyev, Konstantin <
konstantin.ananyev@intel.com> wrote

> >
> > Hi Mattias,
> >
> > > > 6) What is the use-case for this? When would a user *want* to use
> this instead
> > > of rte_memcpy()?
> > > > If the data being loaded is relevant to datapath/packets, presumably
> other
> > > packets might require the
> > > > loaded data, so temporal (normal) loads should be used to cache the
> source
> > > data?
> > >
> > >
> > > I'm not sure if your first question is rhetorical or not, but a
> memcpy()
> > > in a NT variant is certainly useful. One use case for a memcpy() with
> > > temporal loads and non-temporal stores is if you need to archive packet
> > > payload for (distant, potential) future use, and want to avoid causing
> > > unnecessary LLC evictions while doing so.
> >
> > Yes I agree that there are certainly benefits in using cache-locality
> hints.
> > There is an open question around if the src or dst or both are
> non-temporal.
> >
> > In the implementation of this patch, the NT/T type of store is reversed
> from your use-case:
> > 1) Loads are NT (so loaded data is not cached for future packets)
> > 2) Stores are T (so copied/dst data is now resident in L1/L2)
> >
> > In theory there might even be valid uses for this type of memcpy where
> loaded
> > data is not needed again soon and stored data is referenced again soon,
> > although I cannot think of any here while typing this mail..
> >
> > I think some use-case examples, and clear documentation on when/how to
> choose
> > between rte_memcpy() or any (potential future) rte_memcpy_nt() variants
> is required
> > to progress this patch.
> >
> > Assuming a strong use-case exists, and it can be clearly indicators to
> users of DPDK APIs which
> > rte_memcpy() to use, we can look at technical details around enabling
> the implementation.
> >
>
> +1 here.
> Function behaviour and restrictions (src parameter needs to be 16/32 B
> aligned, etc.),
> along with expected usage scenarios have to be documented properly.
> Again, as Harry pointed out, I don't see any AMD specific instructions in
> this function,
> so presumably such function can go into __AVX2__ code block and no new
> defines will
> be required.
>
> Agreed that APIs are generic but we've kept under an AMD flag for a
simple reason that it is NOT tested on any other platform.
A use-case on how to use this was planned earlier for mlx5 pmd but dropped
in this version of patch as the data path of mlx5 is going to be refactored
soon and may not be useful for future versions of mlx5 (>22.02).
Ref link: adaptation to mlx5 mprq
<https://patchwork.dpdk.org/project/dpdk/patch/20211019104724.19416-2-aman.kumar@vvdntech.in/>
(*we've plan to adapt this into future version*)
The patch in the link basically enhances mlx5 mprq implementation for our
specific use-case and with 128B packet size, we achieve ~60% better perf.
We understand the use of this copy function should be documented which we
shall plan along with few other platform specific optimizations in future
versions of DPDK. As this does not conflict with other platforms, can we still
keep under AMD flag for now as suggested by Thomas?
  
Van Haaren, Harry Oct. 27, 2021, 2:10 p.m. UTC | #8
From: Aman Kumar <aman.kumar@vvdntech.in> 
Sent: Wednesday, October 27, 2021 2:35 PM
To: Ananyev, Konstantin <konstantin.ananyev@intel.com>
Cc: Van Haaren, Harry <harry.van.haaren@intel.com>; mattias.ronnblom <mattias.ronnblom@ericsson.com>; Thomas Monjalon <thomas@monjalon.net>; dev@dpdk.org; viacheslavo@nvidia.com; Burakov, Anatoly <anatoly.burakov@intel.com>; Song, Keesang <Keesang.Song@amd.com>; jerinjacobk@gmail.com; Richardson, Bruce <bruce.richardson@intel.com>; honnappa.nagarahalli@arm.com; Ruifeng Wang <ruifeng.wang@arm.com>; David Christensen <drc@linux.vnet.ibm.com>; david.marchand@redhat.com; stephen@networkplumber.org
Subject: Re: [dpdk-dev] [PATCH v4 2/2] lib/eal: add temporal store memcpy support for AMD platform

Hi Aman,

Please sent plain-text email, converting to other formats it makes writing inline replies difficult.
I've converted this reply email back to plain-text, and will annotate email below with [<author> wrote]:

On Wed, Oct 27, 2021 at 5:53 PM Ananyev, Konstantin <mailto:konstantin.ananyev@intel.com> wrote
> 
> Hi Mattias,
> 
> > > 6) What is the use-case for this? When would a user *want* to use this instead
> > of rte_memcpy()?
> > > If the data being loaded is relevant to datapath/packets, presumably other
> > packets might require the
> > > loaded data, so temporal (normal) loads should be used to cache the source
> > data?
> >
> >
> > I'm not sure if your first question is rhetorical or not, but a memcpy()
> > in a NT variant is certainly useful. One use case for a memcpy() with
> > temporal loads and non-temporal stores is if you need to archive packet
> > payload for (distant, potential) future use, and want to avoid causing
> > unnecessary LLC evictions while doing so.
> 
> Yes I agree that there are certainly benefits in using cache-locality hints.
> There is an open question around if the src or dst or both are non-temporal.
> 
> In the implementation of this patch, the NT/T type of store is reversed from your use-case:
> 1) Loads are NT (so loaded data is not cached for future packets)
> 2) Stores are T (so copied/dst data is now resident in L1/L2)
> 
> In theory there might even be valid uses for this type of memcpy where loaded
> data is not needed again soon and stored data is referenced again soon,
> although I cannot think of any here while typing this mail..
> 
> I think some use-case examples, and clear documentation on when/how to choose
> between rte_memcpy() or any (potential future) rte_memcpy_nt() variants is required
> to progress this patch.
> 
> Assuming a strong use-case exists, and it can be clearly indicators to users of DPDK APIs which
> rte_memcpy() to use, we can look at technical details around enabling the implementation.
> 

[Konstantin wrote]:
+1 here.
Function behaviour and restrictions (src parameter needs to be 16/32 B aligned, etc.),
along with expected usage scenarios have to be documented properly.
Again, as Harry pointed out, I don't see any AMD specific instructions in this function,
so presumably such function can go into __AVX2__ code block and no new defines will
be required. 


[Aman wrote]:
Agreed that APIs are generic but we've kept under an AMD flag for a simple reason that it is NOT tested on any other platform.
A use-case on how to use this was planned earlier for mlx5 pmd but dropped in this version of patch as the data path of mlx5 is going to be refactored soon and may not be useful for future versions of mlx5 (>22.02). 
Ref link: https://patchwork.dpdk.org/project/dpdk/patch/20211019104724.19416-2-aman.kumar@vvdntech.in/(we've plan to adapt this into future version)
The patch in the link basically enhances mlx5 mprq implementation for our specific use-case and with 128B packet size, we achieve ~60% better perf. We understand the use of this copy function should be documented which we shall plan along with few other platform specific optimizations in future versions of DPDK. As this does not conflict with other platforms, can we still keep under AMD flag for now as suggested by Thomas?


[HvH wrote]:
As an open-source community, any contributions should aim to improve the whole.
In the past, numerous improvements have been merged to DPDK that improve performance.
Sometimes these are architecture specific (x86/arm/ppc) sometimes the are ISA specific (SSE, AVX512, NEON).

I am not familiar with any cases in DPDK, where there is a #ifdef based on a *specific platform*.
A quick "grep" through the "dpdk/lib" directory does not show any place where PMD or generic code
has been explicitly optimized for a *specific platform*.

Obviously, in cases where ISA either exists or does not exist, yes there is an optimization to enable it.
But this is not exposed as a top-level compile-time option, it uses runtime CPU ISA detection.

Please take a step back from the code, and look at what this patch asks of DPDK:
"Please accept & maintain these changes upstream, which benefit only platform X, even though these ISA features are also available on other platforms".

Other patches that enhance performance of DPDK ask this:
"Please accept & maintain these changes upstream, which benefit all platforms which have ISA capability X".


=== Question "As this does not conflict with other platforms, can we still keep under AMD flag for now"?
I feel the contribution is too specific to a platform. Make it generic by enabling it at an ISA capability level.

Please yes, contribute to the DPDK community by improving performance of a PMD by enabling/leveraging ISA.
But do so in a way that does not benefit only a specific platform - do so in a way that enhances all of DPDK, as
other patches have done for the DPDK that this patch is built on.

If you have concerns that the PMD maintainers will not accept the changes due to potential regressions on
other platforms, then discuss those, make a plan on how to performance validate, and work to a solution.


=== Regarding specifically the request for "can we still keep under AMD flag for now"?
I do not believe we should introduce APIs for specific platforms. DPDK's EAL is an abstraction layer.
The value of EAL is to provide a common abstraction. This platform-specific flag breaks the abstraction,
and results in packaging issues, as well as API/ABI instability based on -Dcpu_instruction_set choice.
So, no, we should not introduce APIs based on any compile-time flag.
  
Ananyev, Konstantin Oct. 27, 2021, 2:26 p.m. UTC | #9
From: Aman Kumar <aman.kumar@vvdntech.in> 
Sent: Wednesday, October 27, 2021 2:35 PM
To: Ananyev, Konstantin <konstantin.ananyev@intel.com>
Cc: Van Haaren, Harry <harry.van.haaren@intel.com>; mattias.ronnblom <mattias.ronnblom@ericsson.com>; Thomas Monjalon <thomas@monjalon.net>; dev@dpdk.org; viacheslavo@nvidia.com; Burakov, Anatoly <anatoly.burakov@intel.com>; Song, Keesang <Keesang.Song@amd.com>; jerinjacobk@gmail.com; Richardson, Bruce <bruce.richardson@intel.com>; honnappa.nagarahalli@arm.com; Ruifeng Wang <ruifeng.wang@arm.com>; David Christensen <drc@linux.vnet.ibm.com>; david.marchand@redhat.com; stephen@networkplumber.org
Subject: Re: [dpdk-dev] [PATCH v4 2/2] lib/eal: add temporal store memcpy support for AMD platform

>> 
>> Hi Mattias,
>> 
>> > > 6) What is the use-case for this? When would a user *want* to use this instead
>> > of rte_memcpy()?
>> > > If the data being loaded is relevant to datapath/packets, presumably other
>> > packets might require the
>> > > loaded data, so temporal (normal) loads should be used to cache the source
>> > data?
>> >
>> >
>> > I'm not sure if your first question is rhetorical or not, but a memcpy()
>> > in a NT variant is certainly useful. One use case for a memcpy() with
>> > temporal loads and non-temporal stores is if you need to archive packet
>> > payload for (distant, potential) future use, and want to avoid causing
>> > unnecessary LLC evictions while doing so.
>> 
>> Yes I agree that there are certainly benefits in using cache-locality hints.
>> There is an open question around if the src or dst or both are non-temporal.
>> 
>> In the implementation of this patch, the NT/T type of store is reversed from your use-case:
>> 1) Loads are NT (so loaded data is not cached for future packets)
>> 2) Stores are T (so copied/dst data is now resident in L1/L2)
>> 
>> In theory there might even be valid uses for this type of memcpy where loaded
>> data is not needed again soon and stored data is referenced again soon,
>> although I cannot think of any here while typing this mail..
>> 
>> I think some use-case examples, and clear documentation on when/how to choose
>> between rte_memcpy() or any (potential future) rte_memcpy_nt() variants is required
>> to progress this patch.
>> 
>> Assuming a strong use-case exists, and it can be clearly indicators to users of DPDK APIs which
>> rte_memcpy() to use, we can look at technical details around enabling the implementation.
>> 
>
> +1 here.
> Function behaviour and restrictions (src parameter needs to be 16/32 B aligned, etc.),
> along with expected usage scenarios have to be documented properly.
> Again, as Harry pointed out, I don't see any AMD specific instructions in this function,
> so presumably such function can go into __AVX2__ code block and no new defines will
> be required. 
> Agreed that APIs are generic but we've kept under an AMD flag for a simple reason that it is NOT tested on any other platform.
> A use-case on how to use this was planned earlier for mlx5 pmd but dropped in this version of patch as the data path of mlx5 is going to be refactored soon and may not be useful for > future versions of mlx5 (>22.02). 
> Ref link: https://patchwork.dpdk.org/project/dpdk/patch/20211019104724.19416-2-aman.kumar@vvdntech.in/(we've plan to adapt this into future version)
> The patch in the link basically enhances mlx5 mprq implementation for our specific use-case and with 128B packet size, we achieve ~60% better perf. We understand the use of this
> copy function should be documented which we shall plan along with few other platform specific optimizations in future versions of DPDK. As this does not conflict with other  >platforms, can we still keep under AMD flag for now as suggested by Thomas?

From what I read above the patch is sort of in the half-ready stage.
Why to rush here and try to push to the DPDK things that doesn't full-fill DPDK policy?
Probably better to do all missing parts first (docs, tests, etc.) and then come-up with updated version.
  
Thomas Monjalon Oct. 27, 2021, 2:31 p.m. UTC | #10
27/10/2021 16:10, Van Haaren, Harry:
> From: Aman Kumar <aman.kumar@vvdntech.in> 
> On Wed, Oct 27, 2021 at 5:53 PM Ananyev, Konstantin <mailto:konstantin.ananyev@intel.com> wrote
> > 
> > Hi Mattias,
> > 
> > > > 6) What is the use-case for this? When would a user *want* to use this instead
> > > of rte_memcpy()?
> > > > If the data being loaded is relevant to datapath/packets, presumably other
> > > packets might require the
> > > > loaded data, so temporal (normal) loads should be used to cache the source
> > > data?
> > >
> > >
> > > I'm not sure if your first question is rhetorical or not, but a memcpy()
> > > in a NT variant is certainly useful. One use case for a memcpy() with
> > > temporal loads and non-temporal stores is if you need to archive packet
> > > payload for (distant, potential) future use, and want to avoid causing
> > > unnecessary LLC evictions while doing so.
> > 
> > Yes I agree that there are certainly benefits in using cache-locality hints.
> > There is an open question around if the src or dst or both are non-temporal.
> > 
> > In the implementation of this patch, the NT/T type of store is reversed from your use-case:
> > 1) Loads are NT (so loaded data is not cached for future packets)
> > 2) Stores are T (so copied/dst data is now resident in L1/L2)
> > 
> > In theory there might even be valid uses for this type of memcpy where loaded
> > data is not needed again soon and stored data is referenced again soon,
> > although I cannot think of any here while typing this mail..
> > 
> > I think some use-case examples, and clear documentation on when/how to choose
> > between rte_memcpy() or any (potential future) rte_memcpy_nt() variants is required
> > to progress this patch.
> > 
> > Assuming a strong use-case exists, and it can be clearly indicators to users of DPDK APIs which
> > rte_memcpy() to use, we can look at technical details around enabling the implementation.
> > 
> 
> [Konstantin wrote]:
> +1 here.
> Function behaviour and restrictions (src parameter needs to be 16/32 B aligned, etc.),
> along with expected usage scenarios have to be documented properly.
> Again, as Harry pointed out, I don't see any AMD specific instructions in this function,
> so presumably such function can go into __AVX2__ code block and no new defines will
> be required. 
> 
> 
> [Aman wrote]:
> Agreed that APIs are generic but we've kept under an AMD flag for a simple reason that it is NOT tested on any other platform.
> A use-case on how to use this was planned earlier for mlx5 pmd but dropped in this version of patch as the data path of mlx5 is going to be refactored soon and may not be useful for future versions of mlx5 (>22.02). 
> Ref link: https://patchwork.dpdk.org/project/dpdk/patch/20211019104724.19416-2-aman.kumar@vvdntech.in/(we've plan to adapt this into future version)
> The patch in the link basically enhances mlx5 mprq implementation for our specific use-case and with 128B packet size, we achieve ~60% better perf. We understand the use of this copy function should be documented which we shall plan along with few other platform specific optimizations in future versions of DPDK. As this does not conflict with other platforms, can we still keep under AMD flag for now as suggested by Thomas?

I said I could merge if there is no objection.
I've overlooked that it's adding completely new functions in the API.
And the comments go in the direction of what I asked in previous version:
what is specific to AMD here?
Now seeing the valid objections, I agree it should be reworked.
We must provide API to applications which is generic, stable and well documented.


> [HvH wrote]:
> As an open-source community, any contributions should aim to improve the whole.
> In the past, numerous improvements have been merged to DPDK that improve performance.
> Sometimes these are architecture specific (x86/arm/ppc) sometimes the are ISA specific (SSE, AVX512, NEON).
> 
> I am not familiar with any cases in DPDK, where there is a #ifdef based on a *specific platform*.
> A quick "grep" through the "dpdk/lib" directory does not show any place where PMD or generic code
> has been explicitly optimized for a *specific platform*.
> 
> Obviously, in cases where ISA either exists or does not exist, yes there is an optimization to enable it.
> But this is not exposed as a top-level compile-time option, it uses runtime CPU ISA detection.
> 
> Please take a step back from the code, and look at what this patch asks of DPDK:
> "Please accept & maintain these changes upstream, which benefit only platform X, even though these ISA features are also available on other platforms".
> 
> Other patches that enhance performance of DPDK ask this:
> "Please accept & maintain these changes upstream, which benefit all platforms which have ISA capability X".
> 
> 
> === Question "As this does not conflict with other platforms, can we still keep under AMD flag for now"?
> I feel the contribution is too specific to a platform. Make it generic by enabling it at an ISA capability level.
> 
> Please yes, contribute to the DPDK community by improving performance of a PMD by enabling/leveraging ISA.
> But do so in a way that does not benefit only a specific platform - do so in a way that enhances all of DPDK, as
> other patches have done for the DPDK that this patch is built on.
> 
> If you have concerns that the PMD maintainers will not accept the changes due to potential regressions on
> other platforms, then discuss those, make a plan on how to performance validate, and work to a solution.
> 
> 
> === Regarding specifically the request for "can we still keep under AMD flag for now"?
> I do not believe we should introduce APIs for specific platforms. DPDK's EAL is an abstraction layer.
> The value of EAL is to provide a common abstraction. This platform-specific flag breaks the abstraction,
> and results in packaging issues, as well as API/ABI instability based on -Dcpu_instruction_set choice.
> So, no, we should not introduce APIs based on any compile-time flag.

I agree
  
Song, Keesang Oct. 29, 2021, 4:01 p.m. UTC | #11
[AMD Official Use Only]

Hi Thomas,

There are some gaps among us, so I think we really need another quick meeting call to discuss. I will set up a call like the last time on Monday.
Please join in the call if possible.

Thanks,
Keesang

-----Original Message-----
From: Thomas Monjalon <thomas@monjalon.net>
Sent: Wednesday, October 27, 2021 7:31 AM
To: Aman Kumar <aman.kumar@vvdntech.in>; Ananyev, Konstantin <konstantin.ananyev@intel.com>; Van Haaren, Harry <harry.van.haaren@intel.com>
Cc: mattias. ronnblom <mattias.ronnblom@ericsson.com>; dev@dpdk.org; viacheslavo@nvidia.com; Burakov, Anatoly <anatoly.burakov@intel.com>; Song, Keesang <Keesang.Song@amd.com>; jerinjacobk@gmail.com; Richardson, Bruce <bruce.richardson@intel.com>; honnappa.nagarahalli@arm.com; Ruifeng Wang <ruifeng.wang@arm.com>; David Christensen <drc@linux.vnet.ibm.com>; david.marchand@redhat.com; stephen@networkplumber.org
Subject: Re: [dpdk-dev] [PATCH v4 2/2] lib/eal: add temporal store memcpy support for AMD platform

[CAUTION: External Email]

27/10/2021 16:10, Van Haaren, Harry:
> From: Aman Kumar <aman.kumar@vvdntech.in> On Wed, Oct 27, 2021 at 5:53
> PM Ananyev, Konstantin <mailto:konstantin.ananyev@intel.com> wrote
> >
> > Hi Mattias,
> >
> > > > 6) What is the use-case for this? When would a user *want* to
> > > > use this instead
> > > of rte_memcpy()?
> > > > If the data being loaded is relevant to datapath/packets,
> > > > presumably other
> > > packets might require the
> > > > loaded data, so temporal (normal) loads should be used to cache
> > > > the source
> > > data?
> > >
> > >
> > > I'm not sure if your first question is rhetorical or not, but a
> > > memcpy() in a NT variant is certainly useful. One use case for a
> > > memcpy() with temporal loads and non-temporal stores is if you
> > > need to archive packet payload for (distant, potential) future
> > > use, and want to avoid causing unnecessary LLC evictions while doing so.
> >
> > Yes I agree that there are certainly benefits in using cache-locality hints.
> > There is an open question around if the src or dst or both are non-temporal.
> >
> > In the implementation of this patch, the NT/T type of store is reversed from your use-case:
> > 1) Loads are NT (so loaded data is not cached for future packets)
> > 2) Stores are T (so copied/dst data is now resident in L1/L2)
> >
> > In theory there might even be valid uses for this type of memcpy
> > where loaded data is not needed again soon and stored data is
> > referenced again soon, although I cannot think of any here while typing this mail..
> >
> > I think some use-case examples, and clear documentation on when/how
> > to choose between rte_memcpy() or any (potential future)
> > rte_memcpy_nt() variants is required to progress this patch.
> >
> > Assuming a strong use-case exists, and it can be clearly indicators
> > to users of DPDK APIs which
> > rte_memcpy() to use, we can look at technical details around enabling the implementation.
> >
>
> [Konstantin wrote]:
> +1 here.
> Function behaviour and restrictions (src parameter needs to be 16/32 B
> aligned, etc.), along with expected usage scenarios have to be documented properly.
> Again, as Harry pointed out, I don't see any AMD specific instructions
> in this function, so presumably such function can go into __AVX2__
> code block and no new defines will be required.
>
>
> [Aman wrote]:
> Agreed that APIs are generic but we've kept under an AMD flag for a simple reason that it is NOT tested on any other platform.
> A use-case on how to use this was planned earlier for mlx5 pmd but dropped in this version of patch as the data path of mlx5 is going to be refactored soon and may not be useful for future versions of mlx5 (>22.02).
> Ref link:
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.dpdk.org%2Fproject%2Fdpdk%2Fpatch%2F20211019104724.19416-2-aman.kumar%40vvdntech.in%2F&amp;data=04%7C01%7CKeesang.Song%40amd.com%7C1988237087f74375caf808d9995678f0%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637709418976849481%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=FErr0cuni6WLxpq5z2KKjAx2StGTlGuN4QaXoXFE%2BKI%3D&amp;reserved=0(we've plan to adapt this into future version) The patch in the link basically enhances mlx5 mprq implementation for our specific use-case and with 128B packet size, we achieve ~60% better perf. We understand the use of this copy function should be documented which we shall plan along with few other platform specific optimizations in future versions of DPDK. As this does not conflict with other platforms, can we still keep under AMD flag for now as suggested by Thomas?

I said I could merge if there is no objection.
I've overlooked that it's adding completely new functions in the API.
And the comments go in the direction of what I asked in previous version:
what is specific to AMD here?
Now seeing the valid objections, I agree it should be reworked.
We must provide API to applications which is generic, stable and well documented.


> [HvH wrote]:
> As an open-source community, any contributions should aim to improve the whole.
> In the past, numerous improvements have been merged to DPDK that improve performance.
> Sometimes these are architecture specific (x86/arm/ppc) sometimes the are ISA specific (SSE, AVX512, NEON).
>
> I am not familiar with any cases in DPDK, where there is a #ifdef based on a *specific platform*.
> A quick "grep" through the "dpdk/lib" directory does not show any
> place where PMD or generic code has been explicitly optimized for a *specific platform*.
>
> Obviously, in cases where ISA either exists or does not exist, yes there is an optimization to enable it.
> But this is not exposed as a top-level compile-time option, it uses runtime CPU ISA detection.
>
> Please take a step back from the code, and look at what this patch asks of DPDK:
> "Please accept & maintain these changes upstream, which benefit only platform X, even though these ISA features are also available on other platforms".
>
> Other patches that enhance performance of DPDK ask this:
> "Please accept & maintain these changes upstream, which benefit all platforms which have ISA capability X".
>
>
> === Question "As this does not conflict with other platforms, can we still keep under AMD flag for now"?
> I feel the contribution is too specific to a platform. Make it generic by enabling it at an ISA capability level.
>
> Please yes, contribute to the DPDK community by improving performance of a PMD by enabling/leveraging ISA.
> But do so in a way that does not benefit only a specific platform - do
> so in a way that enhances all of DPDK, as other patches have done for the DPDK that this patch is built on.
>
> If you have concerns that the PMD maintainers will not accept the
> changes due to potential regressions on other platforms, then discuss those, make a plan on how to performance validate, and work to a solution.
>
>
> === Regarding specifically the request for "can we still keep under AMD flag for now"?
> I do not believe we should introduce APIs for specific platforms. DPDK's EAL is an abstraction layer.
> The value of EAL is to provide a common abstraction. This
> platform-specific flag breaks the abstraction, and results in packaging issues, as well as API/ABI instability based on -Dcpu_instruction_set choice.
> So, no, we should not introduce APIs based on any compile-time flag.

I agree
  

Patch

diff --git a/config/x86/meson.build b/config/x86/meson.build
index 21cda6fd33..56dae4aca7 100644
--- a/config/x86/meson.build
+++ b/config/x86/meson.build
@@ -78,6 +78,8 @@  if get_option('cpu_instruction_set') == 'znver1'
     dpdk_conf.set('RTE_MAX_LCORE', 256)
 elif get_option('cpu_instruction_set') == 'znver2'
     dpdk_conf.set('RTE_MAX_LCORE', 512)
+    dpdk_conf.set('RTE_MEMCPY_AMDEPYC', 1)
 elif get_option('cpu_instruction_set') == 'znver3'
     dpdk_conf.set('RTE_MAX_LCORE', 512)
+    dpdk_conf.set('RTE_MEMCPY_AMDEPYC', 1)
 endif
diff --git a/lib/eal/x86/include/rte_memcpy.h b/lib/eal/x86/include/rte_memcpy.h
index 1b6c6e585f..8fe7822cb4 100644
--- a/lib/eal/x86/include/rte_memcpy.h
+++ b/lib/eal/x86/include/rte_memcpy.h
@@ -376,6 +376,120 @@  rte_mov128blocks(uint8_t *dst, const uint8_t *src, size_t n)
 	}
 }
 
+#if defined RTE_MEMCPY_AMDEPYC
+
+/**
+ * Copy 16 bytes from one location to another,
+ * with temporal stores
+ */
+static __rte_always_inline void
+rte_copy16_ts(uint8_t *dst, uint8_t *src)
+{
+	__m128i var128;
+
+	var128 = _mm_stream_load_si128((__m128i *)src);
+	_mm_storeu_si128((__m128i *)dst, var128);
+}
+
+/**
+ * Copy 32 bytes from one location to another,
+ * with temporal stores
+ */
+static __rte_always_inline void
+rte_copy32_ts(uint8_t *dst, uint8_t *src)
+{
+	__m256i ymm0;
+
+	ymm0 = _mm256_stream_load_si256((const __m256i *)src);
+	_mm256_storeu_si256((__m256i *)dst, ymm0);
+}
+
+/**
+ * Copy 64 bytes from one location to another,
+ * with temporal stores
+ */
+static __rte_always_inline void
+rte_copy64_ts(uint8_t *dst, uint8_t *src)
+{
+	rte_copy32_ts(dst + 0 * 32, src + 0 * 32);
+	rte_copy32_ts(dst + 1 * 32, src + 1 * 32);
+}
+
+/**
+ * Copy 128 bytes from one location to another,
+ * with temporal stores
+ */
+static __rte_always_inline void
+rte_copy128_ts(uint8_t *dst, uint8_t *src)
+{
+	rte_copy32_ts(dst + 0 * 32, src + 0 * 32);
+	rte_copy32_ts(dst + 1 * 32, src + 1 * 32);
+	rte_copy32_ts(dst + 2 * 32, src + 2 * 32);
+	rte_copy32_ts(dst + 3 * 32, src + 3 * 32);
+}
+
+/**
+ * Copy len bytes from one location to another,
+ * with temporal stores 16B aligned
+ */
+static __rte_always_inline void *
+rte_memcpy_aligned_tstore16_generic(void *dst, void *src, int len)
+{
+	void *dest = dst;
+
+	while (len >= 128) {
+		rte_copy128_ts((uint8_t *)dst, (uint8_t *)src);
+		dst = (uint8_t *)dst + 128;
+		src = (uint8_t *)src + 128;
+		len -= 128;
+	}
+	while (len >= 64) {
+		rte_copy64_ts((uint8_t *)dst, (uint8_t *)src);
+		dst = (uint8_t *)dst + 64;
+		src = (uint8_t *)src + 64;
+		len -= 64;
+	}
+	while (len >= 32) {
+		rte_copy32_ts((uint8_t *)dst, (uint8_t *)src);
+		dst = (uint8_t *)dst + 32;
+		src = (uint8_t *)src + 32;
+		len -= 32;
+	}
+	if (len >= 16) {
+		rte_copy16_ts((uint8_t *)dst, (uint8_t *)src);
+		dst = (uint8_t *)dst + 16;
+		src = (uint8_t *)src + 16;
+		len -= 16;
+	}
+	if (len >= 8) {
+		*(uint64_t *)dst = *(const uint64_t *)src;
+		dst = (uint8_t *)dst + 8;
+		src = (uint8_t *)src + 8;
+		len -= 8;
+	}
+	if (len >= 4) {
+		*(uint32_t *)dst = *(const uint32_t *)src;
+		dst = (uint8_t *)dst + 4;
+		src = (uint8_t *)src + 4;
+		len -= 4;
+	}
+	if (len != 0) {
+		dst = (uint8_t *)dst - (4 - len);
+		src = (uint8_t *)src - (4 - len);
+		*(uint32_t *)dst = *(const uint32_t *)src;
+	}
+
+	return dest;
+}
+
+static __rte_always_inline void *
+rte_memcpy_aligned_tstore16(void *dst, void *src, int len)
+{
+	return rte_memcpy_aligned_tstore16_generic(dst, src, len);
+}
+
+#endif /* RTE_MEMCPY_AMDEPYC */
+
 static __rte_always_inline void *
 rte_memcpy_generic(void *dst, const void *src, size_t n)
 {