[v2] mempool cache: add zero-copy get and put functions

Message ID 20221116180419.98937-1-mb@smartsharesystems.com (mailing list archive)
State Superseded, archived
Delegated to: Thomas Monjalon
Headers
Series [v2] mempool cache: add zero-copy get and put functions |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/loongarch-compilation success Compilation OK
ci/loongarch-unit-testing success Unit Testing PASS
ci/Intel-compilation success Compilation OK
ci/iol-broadcom-Performance success Performance Testing PASS
ci/intel-Testing success Testing PASS
ci/iol-intel-Performance success Performance Testing PASS
ci/iol-mellanox-Performance success Performance Testing PASS
ci/iol-intel-Functional success Functional Testing PASS
ci/iol-aarch64-unit-testing success Testing PASS
ci/iol-x86_64-compile-testing success Testing PASS
ci/github-robot: build success github build: passed
ci/iol-x86_64-unit-testing success Testing PASS
ci/iol-aarch64-compile-testing success Testing PASS

Commit Message

Morten Brørup Nov. 16, 2022, 6:04 p.m. UTC
  Zero-copy access to mempool caches is beneficial for PMD performance, and
must be provided by the mempool library to fix [Bug 1052] without a
performance regression.

[Bug 1052]: https://bugs.dpdk.org/show_bug.cgi?id=1052

v2:
* Fix checkpatch warnings.
* Fix missing registration of trace points.
* The functions are inline, so they don't go into the map file.
v1 changes from the RFC:
* Removed run-time parameter checks. (Honnappa)
  This is a hot fast path function; requiring correct application
  behaviour, i.e. function parameters must be valid.
* Added RTE_ASSERT for parameters instead.
  Code for this is only generated if built with RTE_ENABLE_ASSERT.
* Removed fallback when 'cache' parameter is not set. (Honnappa)
* Chose the simple get function; i.e. do not move the existing objects in
  the cache to the top of the new stack, just leave them at the bottom.
* Renamed the functions. Other suggestions are welcome, of course. ;-)
* Updated the function descriptions.
* Added the functions to trace_fp and version.map.

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---
 lib/mempool/mempool_trace_points.c |   6 ++
 lib/mempool/rte_mempool.h          | 124 +++++++++++++++++++++++++++++
 lib/mempool/rte_mempool_trace_fp.h |  16 ++++
 lib/mempool/version.map            |   4 +
 4 files changed, 150 insertions(+)
  

Comments

Kamalakshitha Aligeri Nov. 29, 2022, 8:54 p.m. UTC | #1
> -----Original Message-----
> From: Morten Brørup <mb@smartsharesystems.com>
> Sent: Wednesday, November 16, 2022 12:04 PM
> To: olivier.matz@6wind.com; andrew.rybchenko@oktetlabs.ru; Honnappa
> Nagarahalli <Honnappa.Nagarahalli@arm.com>; Kamalakshitha Aligeri
> <Kamalakshitha.Aligeri@arm.com>; bruce.richardson@intel.com;
> dev@dpdk.org
> Cc: nd <nd@arm.com>; Morten Brørup <mb@smartsharesystems.com>
> Subject: [PATCH v2] mempool cache: add zero-copy get and put functions
> 
> Zero-copy access to mempool caches is beneficial for PMD performance, and
> must be provided by the mempool library to fix [Bug 1052] without a
> performance regression.
> 
> [Bug 1052]: https://bugs.dpdk.org/show_bug.cgi?id=1052
> 
> v2:
> * Fix checkpatch warnings.
> * Fix missing registration of trace points.
> * The functions are inline, so they don't go into the map file.
> v1 changes from the RFC:
> * Removed run-time parameter checks. (Honnappa)
>   This is a hot fast path function; requiring correct application
>   behaviour, i.e. function parameters must be valid.
> * Added RTE_ASSERT for parameters instead.
>   Code for this is only generated if built with RTE_ENABLE_ASSERT.
> * Removed fallback when 'cache' parameter is not set. (Honnappa)
> * Chose the simple get function; i.e. do not move the existing objects in
>   the cache to the top of the new stack, just leave them at the bottom.
> * Renamed the functions. Other suggestions are welcome, of course. ;-)
> * Updated the function descriptions.
> * Added the functions to trace_fp and version.map.
> 
> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> ---
>  lib/mempool/mempool_trace_points.c |   6 ++
>  lib/mempool/rte_mempool.h          | 124 +++++++++++++++++++++++++++++
>  lib/mempool/rte_mempool_trace_fp.h |  16 ++++
>  lib/mempool/version.map            |   4 +
>  4 files changed, 150 insertions(+)
> 
> diff --git a/lib/mempool/mempool_trace_points.c
> b/lib/mempool/mempool_trace_points.c
> index 4ad76deb34..a6070799af 100644
> --- a/lib/mempool/mempool_trace_points.c
> +++ b/lib/mempool/mempool_trace_points.c
> @@ -77,3 +77,9 @@
> RTE_TRACE_POINT_REGISTER(rte_mempool_trace_ops_free,
> 
>  RTE_TRACE_POINT_REGISTER(rte_mempool_trace_set_ops_byname,
>  	lib.mempool.set.ops.byname)
> +
> +RTE_TRACE_POINT_REGISTER(rte_mempool_trace_cache_zc_put_bulk,
> +	lib.mempool.cache.zc.put.bulk)
> +
> +RTE_TRACE_POINT_REGISTER(rte_mempool_trace_cache_zc_get_bulk,
> +	lib.mempool.cache.zc.get.bulk)
> diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> index 9f530db24b..5e6da06bc7 100644
> --- a/lib/mempool/rte_mempool.h
> +++ b/lib/mempool/rte_mempool.h
> @@ -47,6 +47,7 @@
>  #include <rte_ring.h>
>  #include <rte_memcpy.h>
>  #include <rte_common.h>
> +#include <rte_errno.h>
> 
>  #include "rte_mempool_trace_fp.h"
> 
> @@ -1346,6 +1347,129 @@ rte_mempool_cache_flush(struct
> rte_mempool_cache *cache,
>  	cache->len = 0;
>  }
> 
> +/**
> + * @warning
> + * @b EXPERIMENTAL: This API may change, or be removed, without prior
> notice.
> + *
> + * Zero-copy put objects in a user-owned mempool cache backed by the
> specified mempool.
> + *
> + * @param cache
> + *   A pointer to the mempool cache.
> + * @param mp
> + *   A pointer to the mempool.
> + * @param n
> + *   The number of objects to be put in the mempool cache.
> + *   Must not exceed RTE_MEMPOOL_CACHE_MAX_SIZE.
> + * @return
> + *   The pointer to where to put the objects in the mempool cache.
> + */

rte_mempool_cache_zc_put_bulk function takes *cache as an input parameter, which means rte_mempool_default_cache function must be called in the PMD code, because there is no pointer to mempool stored in i40e_tx_queue. Its there in i40e_rx_queue though.
So, should we change the API's ?

> +__rte_experimental
> +static __rte_always_inline void *
> +rte_mempool_cache_zc_put_bulk(struct rte_mempool_cache *cache,
> +		struct rte_mempool *mp,
> +		unsigned int n)
> +{
> +	void **cache_objs;
> +
> +	RTE_ASSERT(cache != NULL);
> +	RTE_ASSERT(mp != NULL);
> +	RTE_ASSERT(n <= RTE_MEMPOOL_CACHE_MAX_SIZE);
> +
> +	rte_mempool_trace_cache_zc_put_bulk(cache, mp, n);
> +
> +	/* Increment stats now, adding in mempool always succeeds. */
> +	RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_bulk, 1);
> +	RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_objs, n);
> +
> +	/*
> +	 * The cache follows the following algorithm:
> +	 *   1. If the objects cannot be added to the cache without crossing
> +	 *      the flush threshold, flush the cache to the backend.
> +	 *   2. Add the objects to the cache.
> +	 */
> +
> +	if (cache->len + n <= cache->flushthresh) {
> +		cache_objs = &cache->objs[cache->len];
> +		cache->len += n;
> +	} else {
> +		cache_objs = &cache->objs[0];
> +		rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache-
> >len);
> +		cache->len = n;
> +	}
> +
> +	return cache_objs;
> +}
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: This API may change, or be removed, without prior
> notice.
> + *
> + * Zero-copy get objects from a user-owned mempool cache backed by the
> specified mempool.
> + *
> + * @param cache
> + *   A pointer to the mempool cache.
> + * @param mp
> + *   A pointer to the mempool.
> + * @param n
> + *   The number of objects to prefetch into the mempool cache.
> + *   Must not exceed RTE_MEMPOOL_CACHE_MAX_SIZE.
> + * @return
> + *   The pointer to the objects in the mempool cache.
> + *   NULL on error; i.e. the cache + the pool does not contain n objects.
> + *   With rte_errno set to the error code of the mempool dequeue function.
> + */
> +__rte_experimental
> +static __rte_always_inline void *
> +rte_mempool_cache_zc_get_bulk(struct rte_mempool_cache *cache,
> +		struct rte_mempool *mp,
> +		unsigned int n)
> +{
> +	unsigned int len;
> +
> +	RTE_ASSERT(cache != NULL);
> +	RTE_ASSERT(mp != NULL);
> +	RTE_ASSERT(n <= RTE_MEMPOOL_CACHE_MAX_SIZE);
> +
> +	rte_mempool_trace_cache_zc_get_bulk(cache, mp, n);
> +
> +	len = cache->len;
> +
> +	if (unlikely(n > len)) {
> +		/* Fill the cache from the backend; fetch size + requested -
> len objects. */
> +		int ret;
> +		const unsigned int size = cache->size;
> +
> +		ret = rte_mempool_ops_dequeue_bulk(mp, &cache-
> >objs[len], size + n - len);
> +		if (unlikely(ret < 0)) {
> +			/*
> +			 * We are buffer constrained.
> +			 * Do not fill the cache, just satisfy the request.
> +			 */
> +			ret = rte_mempool_ops_dequeue_bulk(mp, &cache-
> >objs[len], n - len);
> +			if (unlikely(ret < 0)) {
> +				/* Unable to satisfy the request. */
> +
> +				RTE_MEMPOOL_STAT_ADD(mp,
> get_fail_bulk, 1);
> +				RTE_MEMPOOL_STAT_ADD(mp,
> get_fail_objs, n);
> +
> +				rte_errno = -ret;
> +				return NULL;
> +			}
> +
> +			len = 0;
> +		} else
> +			len = size;
> +	} else
> +		len -= n;
> +
> +	cache->len = len;
> +
> +	RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1);
> +	RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_objs, n);
> +
> +	return &cache->objs[len];
> +}
> +
>  /**
>   * @internal Put several objects back in the mempool; used internally.
>   * @param mp
> diff --git a/lib/mempool/rte_mempool_trace_fp.h
> b/lib/mempool/rte_mempool_trace_fp.h
> index ed060e887c..00567fb1cf 100644
> --- a/lib/mempool/rte_mempool_trace_fp.h
> +++ b/lib/mempool/rte_mempool_trace_fp.h
> @@ -109,6 +109,22 @@ RTE_TRACE_POINT_FP(
>  	rte_trace_point_emit_ptr(mempool);
>  )
> 
> +RTE_TRACE_POINT_FP(
> +	rte_mempool_trace_cache_zc_put_bulk,
> +	RTE_TRACE_POINT_ARGS(void *cache, void *mempool, uint32_t
> nb_objs),
> +	rte_trace_point_emit_ptr(cache);
> +	rte_trace_point_emit_ptr(mempool);
> +	rte_trace_point_emit_u32(nb_objs);
> +)
> +
> +RTE_TRACE_POINT_FP(
> +	rte_mempool_trace_cache_zc_get_bulk,
> +	RTE_TRACE_POINT_ARGS(void *cache, void *mempool, uint32_t
> nb_objs),
> +	rte_trace_point_emit_ptr(cache);
> +	rte_trace_point_emit_ptr(mempool);
> +	rte_trace_point_emit_u32(nb_objs);
> +)
> +
>  #ifdef __cplusplus
>  }
>  #endif
> diff --git a/lib/mempool/version.map b/lib/mempool/version.map index
> b67d7aace7..927477b977 100644
> --- a/lib/mempool/version.map
> +++ b/lib/mempool/version.map
> @@ -63,6 +63,10 @@ EXPERIMENTAL {
>  	__rte_mempool_trace_ops_alloc;
>  	__rte_mempool_trace_ops_free;
>  	__rte_mempool_trace_set_ops_byname;
> +
> +	# added in 23.03
> +	__rte_mempool_trace_cache_zc_put_bulk;
> +	__rte_mempool_trace_cache_zc_get_bulk;
>  };
> 
>  INTERNAL {
> --
> 2.17.1
  
Morten Brørup Nov. 30, 2022, 10:21 a.m. UTC | #2
> From: Kamalakshitha Aligeri [mailto:Kamalakshitha.Aligeri@arm.com]
> Sent: Tuesday, 29 November 2022 21.54
> 
> > From: Morten Brørup <mb@smartsharesystems.com>
> > Sent: Wednesday, November 16, 2022 12:04 PM
> >
> > Zero-copy access to mempool caches is beneficial for PMD performance,
> and
> > must be provided by the mempool library to fix [Bug 1052] without a
> > performance regression.
> >
> > [Bug 1052]: https://bugs.dpdk.org/show_bug.cgi?id=1052
> >
> > v2:
> > * Fix checkpatch warnings.
> > * Fix missing registration of trace points.
> > * The functions are inline, so they don't go into the map file.
> > v1 changes from the RFC:
> > * Removed run-time parameter checks. (Honnappa)
> >   This is a hot fast path function; requiring correct application
> >   behaviour, i.e. function parameters must be valid.
> > * Added RTE_ASSERT for parameters instead.
> >   Code for this is only generated if built with RTE_ENABLE_ASSERT.
> > * Removed fallback when 'cache' parameter is not set. (Honnappa)
> > * Chose the simple get function; i.e. do not move the existing
> objects in
> >   the cache to the top of the new stack, just leave them at the
> bottom.
> > * Renamed the functions. Other suggestions are welcome, of course. ;-
> )
> > * Updated the function descriptions.
> > * Added the functions to trace_fp and version.map.
> >
> > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > ---

[...]

> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: This API may change, or be removed, without
> prior
> > notice.
> > + *
> > + * Zero-copy put objects in a user-owned mempool cache backed by the
> > specified mempool.
> > + *
> > + * @param cache
> > + *   A pointer to the mempool cache.
> > + * @param mp
> > + *   A pointer to the mempool.
> > + * @param n
> > + *   The number of objects to be put in the mempool cache.
> > + *   Must not exceed RTE_MEMPOOL_CACHE_MAX_SIZE.
> > + * @return
> > + *   The pointer to where to put the objects in the mempool cache.
> > + */
> 
> rte_mempool_cache_zc_put_bulk function takes *cache as an input
> parameter, which means rte_mempool_default_cache function must be
> called in the PMD code, because there is no pointer to mempool stored
> in i40e_tx_queue. Its there in i40e_rx_queue though.
> So, should we change the API's ?

Excellent question!

This is a "mempool cache" API. So we must keep in mind that it can be consumed by applications and other libraries using mempool caches, not just PMDs.

If some consumer of the API, e.g. i40e_tx, doesn’t know the cache pointer, it must look it up itself before calling the function, e.g. by rte_mempool_default_cache().

I think we should keep the API clean, as proposed. Otherwise, the added lookup (although conditional) would degrade the performance of all other consumers of the API.

And there is no performance difference for the PMD whether it calls rte_mempool_default_cache() in the PMD itself, or if it is called from within the rte_mempool_cache_zc_put_bulk() function.

> 
> > +__rte_experimental
> > +static __rte_always_inline void *
> > +rte_mempool_cache_zc_put_bulk(struct rte_mempool_cache *cache,
> > +		struct rte_mempool *mp,
> > +		unsigned int n)
> > +{
> > +	void **cache_objs;
> > +
> > +	RTE_ASSERT(cache != NULL);
> > +	RTE_ASSERT(mp != NULL);
> > +	RTE_ASSERT(n <= RTE_MEMPOOL_CACHE_MAX_SIZE);
> > +
> > +	rte_mempool_trace_cache_zc_put_bulk(cache, mp, n);
> > +
> > +	/* Increment stats now, adding in mempool always succeeds. */
> > +	RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_bulk, 1);
> > +	RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_objs, n);
> > +
> > +	/*
> > +	 * The cache follows the following algorithm:
> > +	 *   1. If the objects cannot be added to the cache without
> crossing
> > +	 *      the flush threshold, flush the cache to the backend.
> > +	 *   2. Add the objects to the cache.
> > +	 */
> > +
> > +	if (cache->len + n <= cache->flushthresh) {
> > +		cache_objs = &cache->objs[cache->len];
> > +		cache->len += n;
> > +	} else {
> > +		cache_objs = &cache->objs[0];
> > +		rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache-
> > >len);
> > +		cache->len = n;
> > +	}
> > +
> > +	return cache_objs;
> > +}
> > +
> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: This API may change, or be removed, without
> prior
> > notice.
> > + *
> > + * Zero-copy get objects from a user-owned mempool cache backed by
> the
> > specified mempool.
> > + *
> > + * @param cache
> > + *   A pointer to the mempool cache.
> > + * @param mp
> > + *   A pointer to the mempool.
> > + * @param n
> > + *   The number of objects to prefetch into the mempool cache.
> > + *   Must not exceed RTE_MEMPOOL_CACHE_MAX_SIZE.
> > + * @return
> > + *   The pointer to the objects in the mempool cache.
> > + *   NULL on error; i.e. the cache + the pool does not contain n
> objects.
> > + *   With rte_errno set to the error code of the mempool dequeue
> function.
> > + */
> > +__rte_experimental
> > +static __rte_always_inline void *
> > +rte_mempool_cache_zc_get_bulk(struct rte_mempool_cache *cache,
> > +		struct rte_mempool *mp,
> > +		unsigned int n)
> > +{
> > +	unsigned int len;
> > +
> > +	RTE_ASSERT(cache != NULL);
> > +	RTE_ASSERT(mp != NULL);
> > +	RTE_ASSERT(n <= RTE_MEMPOOL_CACHE_MAX_SIZE);
> > +
> > +	rte_mempool_trace_cache_zc_get_bulk(cache, mp, n);
> > +
> > +	len = cache->len;
> > +
> > +	if (unlikely(n > len)) {
> > +		/* Fill the cache from the backend; fetch size + requested
> -
> > len objects. */
> > +		int ret;
> > +		const unsigned int size = cache->size;
> > +
> > +		ret = rte_mempool_ops_dequeue_bulk(mp, &cache-
> > >objs[len], size + n - len);
> > +		if (unlikely(ret < 0)) {
> > +			/*
> > +			 * We are buffer constrained.
> > +			 * Do not fill the cache, just satisfy the request.
> > +			 */
> > +			ret = rte_mempool_ops_dequeue_bulk(mp, &cache-
> > >objs[len], n - len);
> > +			if (unlikely(ret < 0)) {
> > +				/* Unable to satisfy the request. */
> > +
> > +				RTE_MEMPOOL_STAT_ADD(mp,
> > get_fail_bulk, 1);
> > +				RTE_MEMPOOL_STAT_ADD(mp,
> > get_fail_objs, n);
> > +
> > +				rte_errno = -ret;
> > +				return NULL;
> > +			}
> > +
> > +			len = 0;
> > +		} else
> > +			len = size;
> > +	} else
> > +		len -= n;
> > +
> > +	cache->len = len;
> > +
> > +	RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1);
> > +	RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_objs, n);
> > +
> > +	return &cache->objs[len];
> > +}
> > +
  
Konstantin Ananyev Dec. 22, 2022, 3:57 p.m. UTC | #3
> Zero-copy access to mempool caches is beneficial for PMD performance, and
> must be provided by the mempool library to fix [Bug 1052] without a
> performance regression.

LGTM in general, thank you for working on it.
Few comments below.
 
> 
> [Bug 1052]: https://bugs.dpdk.org/show_bug.cgi?id=1052
> 
> v2:
> * Fix checkpatch warnings.
> * Fix missing registration of trace points.
> * The functions are inline, so they don't go into the map file.
> v1 changes from the RFC:
> * Removed run-time parameter checks. (Honnappa)
>   This is a hot fast path function; requiring correct application
>   behaviour, i.e. function parameters must be valid.
> * Added RTE_ASSERT for parameters instead.

RTE_ASSERT(n <= RTE_MEMPOOL_CACHE_MAX_SIZE);
I think it is too excessive.
Just:
if (n <= RTE_MEMPOOL_CACHE_MAX_SIZE) return NULL;
seems much more convenient for the users here and
more close to other mempool/ring API behavior.
In terms of performance - I don’t think one extra comparison here
would really count.

I also think would be really good to add:
add zc_(get|put)_bulk_start(),  zc_(get|put)_bulk_finish().
Where _start would check/fill the cache and return the pointer,
while _finsih would updathe cache->len.
Similar to what we have for rte_ring _peek_ API.
That would allow to extend this API usage - let say inside PMDs
it could be used not only for MBUF_FAST_FREE case,  but for generic
TX code path (one that have to call rte_mbuf_prefree()) also.  

>   Code for this is only generated if built with RTE_ENABLE_ASSERT.
> * Removed fallback when 'cache' parameter is not set. (Honnappa)
> * Chose the simple get function; i.e. do not move the existing objects in
>   the cache to the top of the new stack, just leave them at the bottom.
> * Renamed the functions. Other suggestions are welcome, of course. ;-)
> * Updated the function descriptions.
> * Added the functions to trace_fp and version.map.

Would be great to add some test-cases in app/test to cover this new API.
  
Morten Brørup Dec. 22, 2022, 5:55 p.m. UTC | #4
> From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com]
> Sent: Thursday, 22 December 2022 16.57
> 
> > Zero-copy access to mempool caches is beneficial for PMD performance,
> and
> > must be provided by the mempool library to fix [Bug 1052] without a
> > performance regression.
> 
> LGTM in general, thank you for working on it.
> Few comments below.
> 
> >
> > [Bug 1052]: https://bugs.dpdk.org/show_bug.cgi?id=1052
> >
> > v2:
> > * Fix checkpatch warnings.
> > * Fix missing registration of trace points.
> > * The functions are inline, so they don't go into the map file.
> > v1 changes from the RFC:
> > * Removed run-time parameter checks. (Honnappa)
> >   This is a hot fast path function; requiring correct application
> >   behaviour, i.e. function parameters must be valid.
> > * Added RTE_ASSERT for parameters instead.
> 
> RTE_ASSERT(n <= RTE_MEMPOOL_CACHE_MAX_SIZE);
> I think it is too excessive.
> Just:
> if (n <= RTE_MEMPOOL_CACHE_MAX_SIZE) return NULL;
> seems much more convenient for the users here and
> more close to other mempool/ring API behavior.
> In terms of performance - I don’t think one extra comparison here
> would really count.

The insignificant performance degradation seems like a good tradeoff for making the function more generic.
I will update the function documentation and place the run-time check here, so both trace and stats reflect what happened:

	RTE_ASSERT(cache != NULL);
	RTE_ASSERT(mp != NULL);
-	RTE_ASSERT(n <= RTE_MEMPOOL_CACHE_MAX_SIZE);

	rte_mempool_trace_cache_zc_put_bulk(cache, mp, n);
+
+	if (unlikely(n > RTE_MEMPOOL_CACHE_MAX_SIZE)) {
+		rte_errno = -ENOSPC; // Or EINVAL?
+		return NULL;
+	}

	/* Increment stats now, adding in mempool always succeeds. */

I will probably also be able to come up with solution for zc_get_bulk(), so both trace and stats make sense if called with n > RTE_MEMPOOL_CACHE_MAX_SIZE.

> 
> I also think would be really good to add:
> add zc_(get|put)_bulk_start(),  zc_(get|put)_bulk_finish().
> Where _start would check/fill the cache and return the pointer,
> while _finsih would updathe cache->len.
> Similar to what we have for rte_ring _peek_ API.
> That would allow to extend this API usage - let say inside PMDs
> it could be used not only for MBUF_FAST_FREE case,  but for generic
> TX code path (one that have to call rte_mbuf_prefree()) also.

I don't see a use case for zc_get_start()/_finish().

And since the mempool cache is a stack, it would *require* that the application reads the array in reverse order. In such case, the function should not return a pointer to the array of objects, but a pointer to the top of the stack.

So I prefer to stick with the single-function zero-copy get, i.e. without start/finish.


I do agree with you about the use case for zc_put_start()/_finish().

Unlike the ring, there is no need for locking with the mempool cache, so we can implement something much simpler:

Instead of requiring calling both zc_put_start() and _finish() for every zero-copy burst, we could add a zc_put_rewind() function, only to be called if some number of objects were not added anyway:

/* FIXME: Function documentation here. */
__rte_experimental
static __rte_always_inline void
rte_mempool_cache_zc_put_rewind(struct rte_mempool_cache *cache,
		unsigned int n)
{
	RTE_ASSERT(cache != NULL);
	RTE_ASSERT(n <= cache->len);

	rte_mempool_trace_cache_zc_put_rewind(cache, n);

	/* Rewind stats. */
	RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_objs, -n);

	cache->len -= n;
}

I have a strong preference for _rewind() over _start() and _finish(), because in the full burst case, it only touches the rte_mempool_cache structure once, whereas splitting it up into _start() and _finish() touches the rte_mempool_cache structure both before and after copying the array of objects.

What do you think?

I am open for other names than _rewind(), so feel free to speak up if you have a better name.


> 
> >   Code for this is only generated if built with RTE_ENABLE_ASSERT.
> > * Removed fallback when 'cache' parameter is not set. (Honnappa)
> > * Chose the simple get function; i.e. do not move the existing
> objects in
> >   the cache to the top of the new stack, just leave them at the
> bottom.
> > * Renamed the functions. Other suggestions are welcome, of course. ;-
> )
> > * Updated the function descriptions.
> > * Added the functions to trace_fp and version.map.
> 
> Would be great to add some test-cases in app/test to cover this new
> API.

Good point. I will look at it.

BTW: Akshitha already has zc_put_bulk working in the i40e PMD.
  
Konstantin Ananyev Dec. 23, 2022, 4:58 p.m. UTC | #5
> > From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com]
> > Sent: Thursday, 22 December 2022 16.57
> >
> > > Zero-copy access to mempool caches is beneficial for PMD performance,
> > and
> > > must be provided by the mempool library to fix [Bug 1052] without a
> > > performance regression.
> >
> > LGTM in general, thank you for working on it.
> > Few comments below.
> >
> > >
> > > [Bug 1052]: https://bugs.dpdk.org/show_bug.cgi?id=1052
> > >
> > > v2:
> > > * Fix checkpatch warnings.
> > > * Fix missing registration of trace points.
> > > * The functions are inline, so they don't go into the map file.
> > > v1 changes from the RFC:
> > > * Removed run-time parameter checks. (Honnappa)
> > >   This is a hot fast path function; requiring correct application
> > >   behaviour, i.e. function parameters must be valid.
> > > * Added RTE_ASSERT for parameters instead.
> >
> > RTE_ASSERT(n <= RTE_MEMPOOL_CACHE_MAX_SIZE);
> > I think it is too excessive.
> > Just:
> > if (n <= RTE_MEMPOOL_CACHE_MAX_SIZE) return NULL;
> > seems much more convenient for the users here and
> > more close to other mempool/ring API behavior.
> > In terms of performance - I don’t think one extra comparison here
> > would really count.
> 
> The insignificant performance degradation seems like a good tradeoff for making the function more generic.
> I will update the function documentation and place the run-time check here, so both trace and stats reflect what happened:
> 
> 	RTE_ASSERT(cache != NULL);
> 	RTE_ASSERT(mp != NULL);
> -	RTE_ASSERT(n <= RTE_MEMPOOL_CACHE_MAX_SIZE);
> 
> 	rte_mempool_trace_cache_zc_put_bulk(cache, mp, n);
> +
> +	if (unlikely(n > RTE_MEMPOOL_CACHE_MAX_SIZE)) {
> +		rte_errno = -ENOSPC; // Or EINVAL?
> +		return NULL;
> +	}
> 
> 	/* Increment stats now, adding in mempool always succeeds. */
> 
> I will probably also be able to come up with solution for zc_get_bulk(), so both trace and stats make sense if called with n >
> RTE_MEMPOOL_CACHE_MAX_SIZE.
> 
> >
> > I also think would be really good to add:
> > add zc_(get|put)_bulk_start(),  zc_(get|put)_bulk_finish().
> > Where _start would check/fill the cache and return the pointer,
> > while _finsih would updathe cache->len.
> > Similar to what we have for rte_ring _peek_ API.
> > That would allow to extend this API usage - let say inside PMDs
> > it could be used not only for MBUF_FAST_FREE case,  but for generic
> > TX code path (one that have to call rte_mbuf_prefree()) also.
> 
> I don't see a use case for zc_get_start()/_finish().
> 
> And since the mempool cache is a stack, it would *require* that the application reads the array in reverse order. In such case, the
> function should not return a pointer to the array of objects, but a pointer to the top of the stack.
> 
> So I prefer to stick with the single-function zero-copy get, i.e. without start/finish.

Yes, it would be more complicated than just update cache->len.
I don't have any real use-case for _get_ too - mostly just for symmetry with put.
 
> 
> 
> I do agree with you about the use case for zc_put_start()/_finish().
> 
> Unlike the ring, there is no need for locking with the mempool cache, so we can implement something much simpler:
> 
> Instead of requiring calling both zc_put_start() and _finish() for every zero-copy burst, we could add a zc_put_rewind() function, only
> to be called if some number of objects were not added anyway:
> 
> /* FIXME: Function documentation here. */
> __rte_experimental
> static __rte_always_inline void
> rte_mempool_cache_zc_put_rewind(struct rte_mempool_cache *cache,
> 		unsigned int n)
> {
> 	RTE_ASSERT(cache != NULL);
> 	RTE_ASSERT(n <= cache->len);
> 
> 	rte_mempool_trace_cache_zc_put_rewind(cache, n);
> 
> 	/* Rewind stats. */
> 	RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_objs, -n);
> 
> 	cache->len -= n;
> }
> 
> I have a strong preference for _rewind() over _start() and _finish(), because in the full burst case, it only touches the
> rte_mempool_cache structure once, whereas splitting it up into _start() and _finish() touches the rte_mempool_cache structure both
> before and after copying the array of objects.
> 
> What do you think?

And your concern is that between _get_start(_C_) and get_finish(_C_) the _C_
cache line can be bumped out of CPU Dcache, right?
I don't think such situation would be a common one.
But, if you think _rewind_ is a better approach - I am ok with it. 
 

> I am open for other names than _rewind(), so feel free to speak up if you have a better name.
> 
> 
> >
> > >   Code for this is only generated if built with RTE_ENABLE_ASSERT.
> > > * Removed fallback when 'cache' parameter is not set. (Honnappa)
> > > * Chose the simple get function; i.e. do not move the existing
> > objects in
> > >   the cache to the top of the new stack, just leave them at the
> > bottom.
> > > * Renamed the functions. Other suggestions are welcome, of course. ;-
> > )
> > > * Updated the function descriptions.
> > > * Added the functions to trace_fp and version.map.
> >
> > Would be great to add some test-cases in app/test to cover this new
> > API.
> 
> Good point. I will look at it.
> 
> BTW: Akshitha already has zc_put_bulk working in the i40e PMD.

That's great news, but I suppose it would be good to have some UT for it anyway.
Konstantin
  
Morten Brørup Dec. 24, 2022, 12:17 p.m. UTC | #6
> From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com]
> Sent: Friday, 23 December 2022 17.58
> 
> > > From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com]
> > > Sent: Thursday, 22 December 2022 16.57
> > >
> > > > Zero-copy access to mempool caches is beneficial for PMD
> performance,
> > > and
> > > > must be provided by the mempool library to fix [Bug 1052] without
> a
> > > > performance regression.
> > >
> > > LGTM in general, thank you for working on it.
> > > Few comments below.

[...]

> > > RTE_ASSERT(n <= RTE_MEMPOOL_CACHE_MAX_SIZE);
> > > I think it is too excessive.
> > > Just:
> > > if (n <= RTE_MEMPOOL_CACHE_MAX_SIZE) return NULL;
> > > seems much more convenient for the users here and
> > > more close to other mempool/ring API behavior.
> > > In terms of performance - I don’t think one extra comparison here
> > > would really count.
> >
> > The insignificant performance degradation seems like a good tradeoff
> for making the function more generic.
> > I will update the function documentation and place the run-time check
> here, so both trace and stats reflect what happened:
> >
> > 	RTE_ASSERT(cache != NULL);
> > 	RTE_ASSERT(mp != NULL);
> > -	RTE_ASSERT(n <= RTE_MEMPOOL_CACHE_MAX_SIZE);
> >
> > 	rte_mempool_trace_cache_zc_put_bulk(cache, mp, n);
> > +
> > +	if (unlikely(n > RTE_MEMPOOL_CACHE_MAX_SIZE)) {
> > +		rte_errno = -ENOSPC; // Or EINVAL?
> > +		return NULL;
> > +	}
> >
> > 	/* Increment stats now, adding in mempool always succeeds. */
> >
> > I will probably also be able to come up with solution for
> zc_get_bulk(), so both trace and stats make sense if called with n >
> > RTE_MEMPOOL_CACHE_MAX_SIZE.

I have sent a new patch, where I switched to the same code flow as in the micro-optimization patch, so this run-time check doesn't affect the most common case.

Also, I realized that I need to compare to the cache flush threshold instead of RTE_MEMPOOL_CACHE_MAX_SIZE, to respect the cache size. Otherwise, a zc_cache_get() operation could deplete a small mempool; and zc_cache_put() could leave the cache with too many objects, thus violating the invariant that cache->len <= cache->flushthreshold.

> >
> > >
> > > I also think would be really good to add:
> > > add zc_(get|put)_bulk_start(),  zc_(get|put)_bulk_finish().
> > > Where _start would check/fill the cache and return the pointer,
> > > while _finsih would updathe cache->len.
> > > Similar to what we have for rte_ring _peek_ API.
> > > That would allow to extend this API usage - let say inside PMDs
> > > it could be used not only for MBUF_FAST_FREE case,  but for generic
> > > TX code path (one that have to call rte_mbuf_prefree()) also.
> >
> > I don't see a use case for zc_get_start()/_finish().
> >
> > And since the mempool cache is a stack, it would *require* that the
> application reads the array in reverse order. In such case, the
> > function should not return a pointer to the array of objects, but a
> pointer to the top of the stack.
> >
> > So I prefer to stick with the single-function zero-copy get, i.e.
> without start/finish.
> 
> Yes, it would be more complicated than just update cache->len.
> I don't have any real use-case for _get_ too - mostly just for symmetry
> with put.
> 
> >
> >
> > I do agree with you about the use case for zc_put_start()/_finish().
> >
> > Unlike the ring, there is no need for locking with the mempool cache,
> so we can implement something much simpler:
> >
> > Instead of requiring calling both zc_put_start() and _finish() for
> every zero-copy burst, we could add a zc_put_rewind() function, only
> > to be called if some number of objects were not added anyway:
> >
> > /* FIXME: Function documentation here. */
> > __rte_experimental
> > static __rte_always_inline void
> > rte_mempool_cache_zc_put_rewind(struct rte_mempool_cache *cache,
> > 		unsigned int n)
> > {
> > 	RTE_ASSERT(cache != NULL);
> > 	RTE_ASSERT(n <= cache->len);
> >
> > 	rte_mempool_trace_cache_zc_put_rewind(cache, n);
> >
> > 	/* Rewind stats. */
> > 	RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_objs, -n);
> >
> > 	cache->len -= n;
> > }
> >
> > I have a strong preference for _rewind() over _start() and _finish(),
> because in the full burst case, it only touches the
> > rte_mempool_cache structure once, whereas splitting it up into
> _start() and _finish() touches the rte_mempool_cache structure both
> > before and after copying the array of objects.
> >
> > What do you think?
> 
> And your concern is that between _get_start(_C_) and get_finish(_C_)
> the _C_
> cache line can be bumped out of CPU Dcache, right?
> I don't think such situation would be a common one.

Yes, that is the essence of my concern. And I agree that it is probably uncommon.

There might also be some performance benefits by having the load/store/modify of _C_ closely together; but I don't know enough about CPU internals to determine if significant or not.

> But, if you think _rewind_ is a better approach - I am ok with it.

Thank you.

[...]

> > > Would be great to add some test-cases in app/test to cover this new
> > > API.
> >
> > Good point. I will look at it.
> >
> > BTW: Akshitha already has zc_put_bulk working in the i40e PMD.
> 
> That's great news, but I suppose it would be good to have some UT for
> it anyway.
> Konstantin

I don't have time for adding unit tests now, but sent an updated patch anyway, so the invariant bug doesn't bite Akshitha.

Merry Christmas, everyone!

-Morten
  

Patch

diff --git a/lib/mempool/mempool_trace_points.c b/lib/mempool/mempool_trace_points.c
index 4ad76deb34..a6070799af 100644
--- a/lib/mempool/mempool_trace_points.c
+++ b/lib/mempool/mempool_trace_points.c
@@ -77,3 +77,9 @@  RTE_TRACE_POINT_REGISTER(rte_mempool_trace_ops_free,
 
 RTE_TRACE_POINT_REGISTER(rte_mempool_trace_set_ops_byname,
 	lib.mempool.set.ops.byname)
+
+RTE_TRACE_POINT_REGISTER(rte_mempool_trace_cache_zc_put_bulk,
+	lib.mempool.cache.zc.put.bulk)
+
+RTE_TRACE_POINT_REGISTER(rte_mempool_trace_cache_zc_get_bulk,
+	lib.mempool.cache.zc.get.bulk)
diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index 9f530db24b..5e6da06bc7 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -47,6 +47,7 @@ 
 #include <rte_ring.h>
 #include <rte_memcpy.h>
 #include <rte_common.h>
+#include <rte_errno.h>
 
 #include "rte_mempool_trace_fp.h"
 
@@ -1346,6 +1347,129 @@  rte_mempool_cache_flush(struct rte_mempool_cache *cache,
 	cache->len = 0;
 }
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: This API may change, or be removed, without prior notice.
+ *
+ * Zero-copy put objects in a user-owned mempool cache backed by the specified mempool.
+ *
+ * @param cache
+ *   A pointer to the mempool cache.
+ * @param mp
+ *   A pointer to the mempool.
+ * @param n
+ *   The number of objects to be put in the mempool cache.
+ *   Must not exceed RTE_MEMPOOL_CACHE_MAX_SIZE.
+ * @return
+ *   The pointer to where to put the objects in the mempool cache.
+ */
+__rte_experimental
+static __rte_always_inline void *
+rte_mempool_cache_zc_put_bulk(struct rte_mempool_cache *cache,
+		struct rte_mempool *mp,
+		unsigned int n)
+{
+	void **cache_objs;
+
+	RTE_ASSERT(cache != NULL);
+	RTE_ASSERT(mp != NULL);
+	RTE_ASSERT(n <= RTE_MEMPOOL_CACHE_MAX_SIZE);
+
+	rte_mempool_trace_cache_zc_put_bulk(cache, mp, n);
+
+	/* Increment stats now, adding in mempool always succeeds. */
+	RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_bulk, 1);
+	RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_objs, n);
+
+	/*
+	 * The cache follows the following algorithm:
+	 *   1. If the objects cannot be added to the cache without crossing
+	 *      the flush threshold, flush the cache to the backend.
+	 *   2. Add the objects to the cache.
+	 */
+
+	if (cache->len + n <= cache->flushthresh) {
+		cache_objs = &cache->objs[cache->len];
+		cache->len += n;
+	} else {
+		cache_objs = &cache->objs[0];
+		rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len);
+		cache->len = n;
+	}
+
+	return cache_objs;
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: This API may change, or be removed, without prior notice.
+ *
+ * Zero-copy get objects from a user-owned mempool cache backed by the specified mempool.
+ *
+ * @param cache
+ *   A pointer to the mempool cache.
+ * @param mp
+ *   A pointer to the mempool.
+ * @param n
+ *   The number of objects to prefetch into the mempool cache.
+ *   Must not exceed RTE_MEMPOOL_CACHE_MAX_SIZE.
+ * @return
+ *   The pointer to the objects in the mempool cache.
+ *   NULL on error; i.e. the cache + the pool does not contain n objects.
+ *   With rte_errno set to the error code of the mempool dequeue function.
+ */
+__rte_experimental
+static __rte_always_inline void *
+rte_mempool_cache_zc_get_bulk(struct rte_mempool_cache *cache,
+		struct rte_mempool *mp,
+		unsigned int n)
+{
+	unsigned int len;
+
+	RTE_ASSERT(cache != NULL);
+	RTE_ASSERT(mp != NULL);
+	RTE_ASSERT(n <= RTE_MEMPOOL_CACHE_MAX_SIZE);
+
+	rte_mempool_trace_cache_zc_get_bulk(cache, mp, n);
+
+	len = cache->len;
+
+	if (unlikely(n > len)) {
+		/* Fill the cache from the backend; fetch size + requested - len objects. */
+		int ret;
+		const unsigned int size = cache->size;
+
+		ret = rte_mempool_ops_dequeue_bulk(mp, &cache->objs[len], size + n - len);
+		if (unlikely(ret < 0)) {
+			/*
+			 * We are buffer constrained.
+			 * Do not fill the cache, just satisfy the request.
+			 */
+			ret = rte_mempool_ops_dequeue_bulk(mp, &cache->objs[len], n - len);
+			if (unlikely(ret < 0)) {
+				/* Unable to satisfy the request. */
+
+				RTE_MEMPOOL_STAT_ADD(mp, get_fail_bulk, 1);
+				RTE_MEMPOOL_STAT_ADD(mp, get_fail_objs, n);
+
+				rte_errno = -ret;
+				return NULL;
+			}
+
+			len = 0;
+		} else
+			len = size;
+	} else
+		len -= n;
+
+	cache->len = len;
+
+	RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1);
+	RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_objs, n);
+
+	return &cache->objs[len];
+}
+
 /**
  * @internal Put several objects back in the mempool; used internally.
  * @param mp
diff --git a/lib/mempool/rte_mempool_trace_fp.h b/lib/mempool/rte_mempool_trace_fp.h
index ed060e887c..00567fb1cf 100644
--- a/lib/mempool/rte_mempool_trace_fp.h
+++ b/lib/mempool/rte_mempool_trace_fp.h
@@ -109,6 +109,22 @@  RTE_TRACE_POINT_FP(
 	rte_trace_point_emit_ptr(mempool);
 )
 
+RTE_TRACE_POINT_FP(
+	rte_mempool_trace_cache_zc_put_bulk,
+	RTE_TRACE_POINT_ARGS(void *cache, void *mempool, uint32_t nb_objs),
+	rte_trace_point_emit_ptr(cache);
+	rte_trace_point_emit_ptr(mempool);
+	rte_trace_point_emit_u32(nb_objs);
+)
+
+RTE_TRACE_POINT_FP(
+	rte_mempool_trace_cache_zc_get_bulk,
+	RTE_TRACE_POINT_ARGS(void *cache, void *mempool, uint32_t nb_objs),
+	rte_trace_point_emit_ptr(cache);
+	rte_trace_point_emit_ptr(mempool);
+	rte_trace_point_emit_u32(nb_objs);
+)
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/mempool/version.map b/lib/mempool/version.map
index b67d7aace7..927477b977 100644
--- a/lib/mempool/version.map
+++ b/lib/mempool/version.map
@@ -63,6 +63,10 @@  EXPERIMENTAL {
 	__rte_mempool_trace_ops_alloc;
 	__rte_mempool_trace_ops_free;
 	__rte_mempool_trace_set_ops_byname;
+
+	# added in 23.03
+	__rte_mempool_trace_cache_zc_put_bulk;
+	__rte_mempool_trace_cache_zc_get_bulk;
 };
 
 INTERNAL {