mempool: fix get objects from mempool with cache

Message ID 20220114163650.94288-1-mb@smartsharesystems.com (mailing list archive)
State Superseded, archived
Delegated to: Thomas Monjalon
Headers
Series mempool: fix get objects from mempool with cache |

Checks

Context Check Description
ci/checkpatch warning coding style issues
ci/Intel-compilation success Compilation OK
ci/iol-broadcom-Functional success Functional Testing PASS
ci/intel-Testing success Testing PASS
ci/iol-mellanox-Performance success Performance Testing PASS
ci/iol-broadcom-Performance success Performance Testing PASS
ci/iol-intel-Performance success Performance Testing PASS
ci/iol-intel-Functional success Functional Testing PASS
ci/github-robot: build success github build: passed
ci/iol-abi-testing success Testing PASS
ci/iol-aarch64-unit-testing success Testing PASS
ci/iol-x86_64-unit-testing success Testing PASS
ci/iol-x86_64-compile-testing success Testing PASS
ci/iol-aarch64-compile-testing success Testing PASS

Commit Message

Morten Brørup Jan. 14, 2022, 4:36 p.m. UTC
  A flush threshold for the mempool cache was introduced in DPDK version
1.3, but rte_mempool_do_generic_get() was not completely updated back
then, and some inefficiencies were introduced.

This patch fixes the following in rte_mempool_do_generic_get():

1. The code that initially screens the cache request was not updated
with the change in DPDK version 1.3.
The initial screening compared the request length to the cache size,
which was correct before, but became irrelevant with the introduction of
the flush threshold. E.g. the cache can hold up to flushthresh objects,
which is more than its size, so some requests were not served from the
cache, even though they could be.
The initial screening has now been corrected to match the initial
screening in rte_mempool_do_generic_put(), which verifies that a cache
is present, and that the length of the request does not overflow the
memory allocated for the cache.

2. The function is a helper for rte_mempool_generic_get(), so it must
behave according to the description of that function.
Specifically, objects must first be returned from the cache,
subsequently from the ring.
After the change in DPDK version 1.3, this was not the behavior when
the request was partially satisfied from the cache; instead, the objects
from the ring were returned ahead of the objects from the cache. This is
bad for CPUs with a small L1 cache, which benefit from having the hot
objects first in the returned array. (This is also the reason why
the function returns the objects in reverse order.)
Now, all code paths first return objects from the cache, subsequently
from the ring.

3. If the cache could not be backfilled, the function would attempt
to get all the requested objects from the ring (instead of only the
number of requested objects minus the objects available in the ring),
and the function would fail if that failed.
Now, the first part of the request is always satisfied from the cache,
and if the subsequent backfilling of the cache from the ring fails, only
the remaining requested objects are retrieved from the ring.

4. The code flow for satisfying the request from the cache was slightly
inefficient:
The likely code path where the objects are simply served from the cache
was treated as unlikely. Now it is treated as likely.
And in the code path where the cache was backfilled first, numbers were
added and subtracted from the cache length; now this code path simply
sets the cache length to its final value.

5. Some comments were not correct anymore.
The comments have been updated.
Most importanly, the description of the succesful return value was
inaccurate. Success only returns 0, not >= 0.

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---
 lib/mempool/rte_mempool.h | 81 ++++++++++++++++++++++++++++-----------
 1 file changed, 59 insertions(+), 22 deletions(-)
  

Comments

Bruce Richardson Jan. 17, 2022, 5:35 p.m. UTC | #1
On Fri, Jan 14, 2022 at 05:36:50PM +0100, Morten Brørup wrote:
> A flush threshold for the mempool cache was introduced in DPDK version
> 1.3, but rte_mempool_do_generic_get() was not completely updated back
> then, and some inefficiencies were introduced.
> 
> This patch fixes the following in rte_mempool_do_generic_get():
> 
> 1. The code that initially screens the cache request was not updated
> with the change in DPDK version 1.3.
> The initial screening compared the request length to the cache size,
> which was correct before, but became irrelevant with the introduction of
> the flush threshold. E.g. the cache can hold up to flushthresh objects,
> which is more than its size, so some requests were not served from the
> cache, even though they could be.
> The initial screening has now been corrected to match the initial
> screening in rte_mempool_do_generic_put(), which verifies that a cache
> is present, and that the length of the request does not overflow the
> memory allocated for the cache.
> 
> 2. The function is a helper for rte_mempool_generic_get(), so it must
> behave according to the description of that function.
> Specifically, objects must first be returned from the cache,
> subsequently from the ring.
> After the change in DPDK version 1.3, this was not the behavior when
> the request was partially satisfied from the cache; instead, the objects
> from the ring were returned ahead of the objects from the cache. This is
> bad for CPUs with a small L1 cache, which benefit from having the hot
> objects first in the returned array. (This is also the reason why
> the function returns the objects in reverse order.)
> Now, all code paths first return objects from the cache, subsequently
> from the ring.
> 
> 3. If the cache could not be backfilled, the function would attempt
> to get all the requested objects from the ring (instead of only the
> number of requested objects minus the objects available in the ring),
> and the function would fail if that failed.
> Now, the first part of the request is always satisfied from the cache,
> and if the subsequent backfilling of the cache from the ring fails, only
> the remaining requested objects are retrieved from the ring.
> 
> 4. The code flow for satisfying the request from the cache was slightly
> inefficient:
> The likely code path where the objects are simply served from the cache
> was treated as unlikely. Now it is treated as likely.
> And in the code path where the cache was backfilled first, numbers were
> added and subtracted from the cache length; now this code path simply
> sets the cache length to its final value.
> 
> 5. Some comments were not correct anymore.
> The comments have been updated.
> Most importanly, the description of the succesful return value was
> inaccurate. Success only returns 0, not >= 0.
> 
> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> ---

I am a little uncertain about the reversing of the copies taking things out
of the mempool - for machines where we are not that cache constrainted will
we lose out in possible optimizations where the compiler optimizes the copy
loop as a memcpy?

Otherwise the logic all looks correct to me.

/Bruce
  
Morten Brørup Jan. 18, 2022, 8:25 a.m. UTC | #2
> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> Sent: Monday, 17 January 2022 18.35
> 
> On Fri, Jan 14, 2022 at 05:36:50PM +0100, Morten Brørup wrote:
> > A flush threshold for the mempool cache was introduced in DPDK
> version
> > 1.3, but rte_mempool_do_generic_get() was not completely updated back
> > then, and some inefficiencies were introduced.
> >
> > This patch fixes the following in rte_mempool_do_generic_get():
> >
> > 1. The code that initially screens the cache request was not updated
> > with the change in DPDK version 1.3.
> > The initial screening compared the request length to the cache size,
> > which was correct before, but became irrelevant with the introduction
> of
> > the flush threshold. E.g. the cache can hold up to flushthresh
> objects,
> > which is more than its size, so some requests were not served from
> the
> > cache, even though they could be.
> > The initial screening has now been corrected to match the initial
> > screening in rte_mempool_do_generic_put(), which verifies that a
> cache
> > is present, and that the length of the request does not overflow the
> > memory allocated for the cache.
> >
> > 2. The function is a helper for rte_mempool_generic_get(), so it must
> > behave according to the description of that function.
> > Specifically, objects must first be returned from the cache,
> > subsequently from the ring.
> > After the change in DPDK version 1.3, this was not the behavior when
> > the request was partially satisfied from the cache; instead, the
> objects
> > from the ring were returned ahead of the objects from the cache. This
> is
> > bad for CPUs with a small L1 cache, which benefit from having the hot
> > objects first in the returned array. (This is also the reason why
> > the function returns the objects in reverse order.)
> > Now, all code paths first return objects from the cache, subsequently
> > from the ring.
> >
> > 3. If the cache could not be backfilled, the function would attempt
> > to get all the requested objects from the ring (instead of only the
> > number of requested objects minus the objects available in the ring),
> > and the function would fail if that failed.
> > Now, the first part of the request is always satisfied from the
> cache,
> > and if the subsequent backfilling of the cache from the ring fails,
> only
> > the remaining requested objects are retrieved from the ring.
> >
> > 4. The code flow for satisfying the request from the cache was
> slightly
> > inefficient:
> > The likely code path where the objects are simply served from the
> cache
> > was treated as unlikely. Now it is treated as likely.
> > And in the code path where the cache was backfilled first, numbers
> were
> > added and subtracted from the cache length; now this code path simply
> > sets the cache length to its final value.
> >
> > 5. Some comments were not correct anymore.
> > The comments have been updated.
> > Most importanly, the description of the succesful return value was
> > inaccurate. Success only returns 0, not >= 0.
> >
> > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > ---
> 
> I am a little uncertain about the reversing of the copies taking things
> out
> of the mempool - for machines where we are not that cache constrainted
> will
> we lose out in possible optimizations where the compiler optimizes the
> copy
> loop as a memcpy?

The objects are also returned in reverse order in the code it replaces, so this behavior is not introduced by this patch; I only describe the reason for it.

I floated a previous patch, in which the objects were returned in order, but Jerin argued [1] that we should keep it the way it was, unless I could show a performance improvement.

So I retracted that patch to split it up in two independent patches instead. This patch for get(), and [3] for put().

While experimenting using rte_memcpy() for these, I couldn't achieve a performance boost - quite the opposite. So I gave up on it.

Reviewing the x86 variant of rte_memcpy() [2] makes me think that it is inefficient for copying small bulks of pointers, especially when n is unknown at compile time, and its code path goes through a great deal of branches.

> 
> Otherwise the logic all looks correct to me.
> 
> /Bruce

[1]: http://inbox.dpdk.org/dev/CALBAE1OjCswxUfaNLWg5y-tnPkFhvvKQ8sJ3JpBoo7ObgeB5OA@mail.gmail.com/
[2]: http://code.dpdk.org/dpdk/latest/source/lib/eal/x86/include/rte_memcpy.h
[3]: http://inbox.dpdk.org/dev/20220117115231.8060-1-mb@smartsharesystems.com/
  
Bruce Richardson Jan. 18, 2022, 9:07 a.m. UTC | #3
On Tue, Jan 18, 2022 at 09:25:22AM +0100, Morten Brørup wrote:
> > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > Sent: Monday, 17 January 2022 18.35
> > 
> > On Fri, Jan 14, 2022 at 05:36:50PM +0100, Morten Brørup wrote:
> > > A flush threshold for the mempool cache was introduced in DPDK
> > version
> > > 1.3, but rte_mempool_do_generic_get() was not completely updated back
> > > then, and some inefficiencies were introduced.
> > >
> > > This patch fixes the following in rte_mempool_do_generic_get():
> > >
> > > 1. The code that initially screens the cache request was not updated
> > > with the change in DPDK version 1.3.
> > > The initial screening compared the request length to the cache size,
> > > which was correct before, but became irrelevant with the introduction
> > of
> > > the flush threshold. E.g. the cache can hold up to flushthresh
> > objects,
> > > which is more than its size, so some requests were not served from
> > the
> > > cache, even though they could be.
> > > The initial screening has now been corrected to match the initial
> > > screening in rte_mempool_do_generic_put(), which verifies that a
> > cache
> > > is present, and that the length of the request does not overflow the
> > > memory allocated for the cache.
> > >
> > > 2. The function is a helper for rte_mempool_generic_get(), so it must
> > > behave according to the description of that function.
> > > Specifically, objects must first be returned from the cache,
> > > subsequently from the ring.
> > > After the change in DPDK version 1.3, this was not the behavior when
> > > the request was partially satisfied from the cache; instead, the
> > objects
> > > from the ring were returned ahead of the objects from the cache. This
> > is
> > > bad for CPUs with a small L1 cache, which benefit from having the hot
> > > objects first in the returned array. (This is also the reason why
> > > the function returns the objects in reverse order.)
> > > Now, all code paths first return objects from the cache, subsequently
> > > from the ring.
> > >
> > > 3. If the cache could not be backfilled, the function would attempt
> > > to get all the requested objects from the ring (instead of only the
> > > number of requested objects minus the objects available in the ring),
> > > and the function would fail if that failed.
> > > Now, the first part of the request is always satisfied from the
> > cache,
> > > and if the subsequent backfilling of the cache from the ring fails,
> > only
> > > the remaining requested objects are retrieved from the ring.
> > >
> > > 4. The code flow for satisfying the request from the cache was
> > slightly
> > > inefficient:
> > > The likely code path where the objects are simply served from the
> > cache
> > > was treated as unlikely. Now it is treated as likely.
> > > And in the code path where the cache was backfilled first, numbers
> > were
> > > added and subtracted from the cache length; now this code path simply
> > > sets the cache length to its final value.
> > >
> > > 5. Some comments were not correct anymore.
> > > The comments have been updated.
> > > Most importanly, the description of the succesful return value was
> > > inaccurate. Success only returns 0, not >= 0.
> > >
> > > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > > ---
> > 
> > I am a little uncertain about the reversing of the copies taking things
> > out
> > of the mempool - for machines where we are not that cache constrainted
> > will
> > we lose out in possible optimizations where the compiler optimizes the
> > copy
> > loop as a memcpy?
> 
> The objects are also returned in reverse order in the code it replaces, so this behavior is not introduced by this patch; I only describe the reason for it.
> 
> I floated a previous patch, in which the objects were returned in order, but Jerin argued [1] that we should keep it the way it was, unless I could show a performance improvement.
> 
> So I retracted that patch to split it up in two independent patches instead. This patch for get(), and [3] for put().
> 
> While experimenting using rte_memcpy() for these, I couldn't achieve a performance boost - quite the opposite. So I gave up on it.
> 
> Reviewing the x86 variant of rte_memcpy() [2] makes me think that it is inefficient for copying small bulks of pointers, especially when n is unknown at compile time, and its code path goes through a great deal of branches.
>
Thanks for all the explanation.

Reviewed-by: Bruce Richardson <bruce.richardson@intel.com>
  
Olivier Matz Jan. 24, 2022, 3:38 p.m. UTC | #4
Hi Morten,

Few comments below.

On Fri, Jan 14, 2022 at 05:36:50PM +0100, Morten Brørup wrote:
> A flush threshold for the mempool cache was introduced in DPDK version
> 1.3, but rte_mempool_do_generic_get() was not completely updated back
> then, and some inefficiencies were introduced.
> 
> This patch fixes the following in rte_mempool_do_generic_get():
> 
> 1. The code that initially screens the cache request was not updated
> with the change in DPDK version 1.3.
> The initial screening compared the request length to the cache size,
> which was correct before, but became irrelevant with the introduction of
> the flush threshold. E.g. the cache can hold up to flushthresh objects,
> which is more than its size, so some requests were not served from the
> cache, even though they could be.
> The initial screening has now been corrected to match the initial
> screening in rte_mempool_do_generic_put(), which verifies that a cache
> is present, and that the length of the request does not overflow the
> memory allocated for the cache.
> 
> 2. The function is a helper for rte_mempool_generic_get(), so it must
> behave according to the description of that function.
> Specifically, objects must first be returned from the cache,
> subsequently from the ring.
> After the change in DPDK version 1.3, this was not the behavior when
> the request was partially satisfied from the cache; instead, the objects
> from the ring were returned ahead of the objects from the cache. This is
> bad for CPUs with a small L1 cache, which benefit from having the hot
> objects first in the returned array. (This is also the reason why
> the function returns the objects in reverse order.)
> Now, all code paths first return objects from the cache, subsequently
> from the ring.
> 
> 3. If the cache could not be backfilled, the function would attempt
> to get all the requested objects from the ring (instead of only the
> number of requested objects minus the objects available in the ring),
> and the function would fail if that failed.
> Now, the first part of the request is always satisfied from the cache,
> and if the subsequent backfilling of the cache from the ring fails, only
> the remaining requested objects are retrieved from the ring.

This is the only point I'd consider to be a fix. The problem, from the
user perspective, is that a get() can fail despite there are enough
objects in cache + common pool.

To be honnest, I feel a bit uncomfortable to have such a list of
problems solved in one commit, even if I understand that they are part
of the same code rework.

Ideally, this fix should be a separate commit. What do you think of
having this simple patch for this fix, and then do the
optimizations/rework in another commit?

  --- a/lib/mempool/rte_mempool.h
  +++ b/lib/mempool/rte_mempool.h
  @@ -1484,7 +1484,22 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
                           * the ring directly. If that fails, we are truly out of
                           * buffers.
                           */
  -                       goto ring_dequeue;
  +                       req = n - cache->len;
  +                       ret = rte_mempool_ops_dequeue_bulk(mp, obj_table, req);
  +                       if (ret < 0) {
  +                               RTE_MEMPOOL_STAT_ADD(mp, get_fail_bulk, 1);
  +                               RTE_MEMPOOL_STAT_ADD(mp, get_fail_objs, n);
  +                               return ret;
  +                       }
  +                       obj_table += req;
  +                       len = cache->len;
  +                       while (len > 0)
  +                               *obj_table++ = cache_objs[--len];
  +                       cache->len = 0;
  +                       RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
  +                       RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
  +
  +                       return 0;
                  }
   
                  cache->len += req;

The title of this commit could then be more precise to describe
the solved issue.

> 4. The code flow for satisfying the request from the cache was slightly
> inefficient:
> The likely code path where the objects are simply served from the cache
> was treated as unlikely. Now it is treated as likely.
> And in the code path where the cache was backfilled first, numbers were
> added and subtracted from the cache length; now this code path simply
> sets the cache length to its final value.
> 
> 5. Some comments were not correct anymore.
> The comments have been updated.
> Most importanly, the description of the succesful return value was
> inaccurate. Success only returns 0, not >= 0.
> 
> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> ---
>  lib/mempool/rte_mempool.h | 81 ++++++++++++++++++++++++++++-----------
>  1 file changed, 59 insertions(+), 22 deletions(-)
> 
> diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> index 1e7a3c1527..88f1b8b7ab 100644
> --- a/lib/mempool/rte_mempool.h
> +++ b/lib/mempool/rte_mempool.h
> @@ -1443,6 +1443,10 @@ rte_mempool_put(struct rte_mempool *mp, void *obj)
>  
>  /**
>   * @internal Get several objects from the mempool; used internally.
> + *
> + * If cache is enabled, objects are returned from the cache in Last In First
> + * Out (LIFO) order for the benefit of CPUs with small L1 cache.
> + *
>   * @param mp
>   *   A pointer to the mempool structure.
>   * @param obj_table
> @@ -1452,7 +1456,7 @@ rte_mempool_put(struct rte_mempool *mp, void *obj)
>   * @param cache
>   *   A pointer to a mempool cache structure. May be NULL if not needed.
>   * @return
> - *   - >=0: Success; number of objects supplied.
> + *   - 0: Success; got n objects.
>   *   - <0: Error; code of ring dequeue function.
>   */
>  static __rte_always_inline int

I think that part should be in a separate commit too. This is a
documentation fix, which is easily backportable (and should be
backported) (Fixes: af75078fece3 ("first public release")).

> @@ -1463,38 +1467,71 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
>  	uint32_t index, len;
>  	void **cache_objs;
>  
> -	/* No cache provided or cannot be satisfied from cache */
> -	if (unlikely(cache == NULL || n >= cache->size))
> +	/* No cache provided or if get would overflow mem allocated for cache */
> +	if (unlikely(cache == NULL || n > RTE_MEMPOOL_CACHE_MAX_SIZE))
>  		goto ring_dequeue;
>  
> -	cache_objs = cache->objs;
> +	cache_objs = &cache->objs[cache->len];
> +
> +	if (n <= cache->len) {
> +		/* The entire request can be satisfied from the cache. */
> +		cache->len -= n;
> +		for (index = 0; index < n; index++)
> +			*obj_table++ = *--cache_objs;
>  
> -	/* Can this be satisfied from the cache? */
> -	if (cache->len < n) {
> -		/* No. Backfill the cache first, and then fill from it */
> -		uint32_t req = n + (cache->size - cache->len);
> +		RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
> +		RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
>  
> -		/* How many do we require i.e. number to fill the cache + the request */
> -		ret = rte_mempool_ops_dequeue_bulk(mp,
> -			&cache->objs[cache->len], req);
> +		return 0;
> +	}
> +
> +	/* Satisfy the first part of the request by depleting the cache. */
> +	len = cache->len;
> +	for (index = 0; index < len; index++)
> +		*obj_table++ = *--cache_objs;
> +
> +	/* Number of objects remaining to satisfy the request. */
> +	len = n - len;
> +
> +	/* Fill the cache from the ring; fetch size + remaining objects. */
> +	ret = rte_mempool_ops_dequeue_bulk(mp, cache->objs,
> +			cache->size + len);
> +	if (unlikely(ret < 0)) {
> +		/*
> +		 * We are buffer constrained, and not able to allocate
> +		 * cache + remaining.
> +		 * Do not fill the cache, just satisfy the remaining part of
> +		 * the request directly from the ring.
> +		 */
> +		ret = rte_mempool_ops_dequeue_bulk(mp, obj_table, len);
>  		if (unlikely(ret < 0)) {
>  			/*
> -			 * In the off chance that we are buffer constrained,
> -			 * where we are not able to allocate cache + n, go to
> -			 * the ring directly. If that fails, we are truly out of
> -			 * buffers.
> +			 * That also failed.
> +			 * No furter action is required to roll the first
> +			 * part of the request back into the cache, as both
> +			 * cache->len and the objects in the cache are intact.
>  			 */
> -			goto ring_dequeue;
> +			RTE_MEMPOOL_STAT_ADD(mp, get_fail_bulk, 1);
> +			RTE_MEMPOOL_STAT_ADD(mp, get_fail_objs, n);
> +
> +			return ret;
>  		}
>  
> -		cache->len += req;
> +		/* Commit that the cache was emptied. */
> +		cache->len = 0;
> +
> +		RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
> +		RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
> +
> +		return 0;
>  	}
>  
> -	/* Now fill in the response ... */
> -	for (index = 0, len = cache->len - 1; index < n; ++index, len--, obj_table++)
> -		*obj_table = cache_objs[len];
> +	cache_objs = &cache->objs[cache->size + len];
>  
> -	cache->len -= n;
> +	/* Satisfy the remaining part of the request from the filled cache. */
> +	cache->len = cache->size;
> +	for (index = 0; index < len; index++)
> +		*obj_table++ = *--cache_objs;
>  
>  	RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
>  	RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
> @@ -1503,7 +1540,7 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
>  
>  ring_dequeue:
>  
> -	/* get remaining objects from ring */
> +	/* Get the objects from the ring. */
>  	ret = rte_mempool_ops_dequeue_bulk(mp, obj_table, n);
>  
>  	if (ret < 0) {

About the code itself, it is more readable now, and probably more
efficient. Did you notice any performance change in mempool perf
autotests ?

Thanks,
Olivier
  
Olivier Matz Jan. 24, 2022, 4:11 p.m. UTC | #5
On Mon, Jan 24, 2022 at 04:38:58PM +0100, Olivier Matz wrote:
> On Fri, Jan 14, 2022 at 05:36:50PM +0100, Morten Brørup wrote:
> > --- a/lib/mempool/rte_mempool.h
> > +++ b/lib/mempool/rte_mempool.h
> > @@ -1443,6 +1443,10 @@ rte_mempool_put(struct rte_mempool *mp, void *obj)
> >  
> >  /**
> >   * @internal Get several objects from the mempool; used internally.
> > + *
> > + * If cache is enabled, objects are returned from the cache in Last In First
> > + * Out (LIFO) order for the benefit of CPUs with small L1 cache.
> > + *
> >   * @param mp
> >   *   A pointer to the mempool structure.
> >   * @param obj_table
> > @@ -1452,7 +1456,7 @@ rte_mempool_put(struct rte_mempool *mp, void *obj)
> >   * @param cache
> >   *   A pointer to a mempool cache structure. May be NULL if not needed.
> >   * @return
> > - *   - >=0: Success; number of objects supplied.
> > + *   - 0: Success; got n objects.
> >   *   - <0: Error; code of ring dequeue function.
> >   */
> >  static __rte_always_inline int
> 
> I think that part should be in a separate commit too. This is a
> documentation fix, which is easily backportable (and should be
> backported) (Fixes: af75078fece3 ("first public release")).

I see that the same change is also part of this commit:
https://patches.dpdk.org/project/dpdk/patch/20211223100741.21292-1-chenzhiheng0227@gmail.com/

I think it is better to have a doc fix commit, and remove this chunk
from this patch.
  
Morten Brørup Jan. 28, 2022, 10:22 a.m. UTC | #6
Olivier, thank you for the detailed feedback on my mempool RFC and patches.

We might disagree on some points, but that is the point of having a discussion. :-)

> From: Olivier Matz [mailto:olivier.matz@6wind.com]
> Sent: Monday, 24 January 2022 16.39
> 
> Hi Morten,
> 
> Few comments below.
> 
> On Fri, Jan 14, 2022 at 05:36:50PM +0100, Morten Brørup wrote:
> > A flush threshold for the mempool cache was introduced in DPDK
> version
> > 1.3, but rte_mempool_do_generic_get() was not completely updated back
> > then, and some inefficiencies were introduced.
> >
> > This patch fixes the following in rte_mempool_do_generic_get():
> >
> > 1. The code that initially screens the cache request was not updated
> > with the change in DPDK version 1.3.
> > The initial screening compared the request length to the cache size,
> > which was correct before, but became irrelevant with the introduction
> of
> > the flush threshold. E.g. the cache can hold up to flushthresh
> objects,
> > which is more than its size, so some requests were not served from
> the
> > cache, even though they could be.
> > The initial screening has now been corrected to match the initial
> > screening in rte_mempool_do_generic_put(), which verifies that a
> cache
> > is present, and that the length of the request does not overflow the
> > memory allocated for the cache.

This bug will cause a major performance degradation in a scenario where the application burst length is the same as the cache size. In this case, the objects are not ever fetched from the mempool cache.

This scenario occurs if an application has configured a mempool with a size matching the application's burst size. Do any such applications exist? I don't know.

> >
> > 2. The function is a helper for rte_mempool_generic_get(), so it must
> > behave according to the description of that function.
> > Specifically, objects must first be returned from the cache,
> > subsequently from the ring.
> > After the change in DPDK version 1.3, this was not the behavior when
> > the request was partially satisfied from the cache; instead, the
> objects
> > from the ring were returned ahead of the objects from the cache. This
> is
> > bad for CPUs with a small L1 cache, which benefit from having the hot
> > objects first in the returned array. (This is also the reason why
> > the function returns the objects in reverse order.)
> > Now, all code paths first return objects from the cache, subsequently
> > from the ring.

Formally, the function is buggy when it isn't doing what it is supposed to.

But yes, it only has a performance impact. And perhaps mostly on low end hardware.

> >
> > 3. If the cache could not be backfilled, the function would attempt
> > to get all the requested objects from the ring (instead of only the
> > number of requested objects minus the objects available in the ring),
> > and the function would fail if that failed.
> > Now, the first part of the request is always satisfied from the
> cache,
> > and if the subsequent backfilling of the cache from the ring fails,
> only
> > the remaining requested objects are retrieved from the ring.
> 
> This is the only point I'd consider to be a fix. The problem, from the
> user perspective, is that a get() can fail despite there are enough
> objects in cache + common pool.
> 
> To be honnest, I feel a bit uncomfortable to have such a list of
> problems solved in one commit, even if I understand that they are part
> of the same code rework.
> 
> Ideally, this fix should be a separate commit. What do you think of
> having this simple patch for this fix, and then do the
> optimizations/rework in another commit?
> 
>   --- a/lib/mempool/rte_mempool.h
>   +++ b/lib/mempool/rte_mempool.h
>   @@ -1484,7 +1484,22 @@ rte_mempool_do_generic_get(struct rte_mempool
> *mp, void **obj_table,
>                            * the ring directly. If that fails, we are
> truly out of
>                            * buffers.
>                            */
>   -                       goto ring_dequeue;
>   +                       req = n - cache->len;
>   +                       ret = rte_mempool_ops_dequeue_bulk(mp,
> obj_table, req);
>   +                       if (ret < 0) {
>   +                               RTE_MEMPOOL_STAT_ADD(mp,
> get_fail_bulk, 1);
>   +                               RTE_MEMPOOL_STAT_ADD(mp,
> get_fail_objs, n);
>   +                               return ret;
>   +                       }
>   +                       obj_table += req;
>   +                       len = cache->len;
>   +                       while (len > 0)
>   +                               *obj_table++ = cache_objs[--len];
>   +                       cache->len = 0;
>   +                       RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk,
> 1);
>   +                       RTE_MEMPOOL_STAT_ADD(mp, get_success_objs,
> n);
>   +
>   +                       return 0;
>                   }
> 
>                   cache->len += req;
> 
> The title of this commit could then be more precise to describe
> the solved issue.

I get your point here, but I still consider the other modifications bug fixes too, so a unified rework patch works better for me.

As you also noticed yourself, the resulting code is clear and easily readable. If that was not the case, I might have agreed to break it up in a couple of steps.

> 
> > 4. The code flow for satisfying the request from the cache was
> slightly
> > inefficient:
> > The likely code path where the objects are simply served from the
> cache
> > was treated as unlikely. Now it is treated as likely.
> > And in the code path where the cache was backfilled first, numbers
> were
> > added and subtracted from the cache length; now this code path simply
> > sets the cache length to its final value.
> >
> > 5. Some comments were not correct anymore.
> > The comments have been updated.
> > Most importanly, the description of the succesful return value was
> > inaccurate. Success only returns 0, not >= 0.
> >
> > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > ---
> >  lib/mempool/rte_mempool.h | 81 ++++++++++++++++++++++++++++---------
> --
> >  1 file changed, 59 insertions(+), 22 deletions(-)
> >
> > diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> > index 1e7a3c1527..88f1b8b7ab 100644
> > --- a/lib/mempool/rte_mempool.h
> > +++ b/lib/mempool/rte_mempool.h
> > @@ -1443,6 +1443,10 @@ rte_mempool_put(struct rte_mempool *mp, void
> *obj)
> >
> >  /**
> >   * @internal Get several objects from the mempool; used internally.
> > + *
> > + * If cache is enabled, objects are returned from the cache in Last
> In First
> > + * Out (LIFO) order for the benefit of CPUs with small L1 cache.
> > + *
> >   * @param mp
> >   *   A pointer to the mempool structure.
> >   * @param obj_table
> > @@ -1452,7 +1456,7 @@ rte_mempool_put(struct rte_mempool *mp, void
> *obj)
> >   * @param cache
> >   *   A pointer to a mempool cache structure. May be NULL if not
> needed.
> >   * @return
> > - *   - >=0: Success; number of objects supplied.
> > + *   - 0: Success; got n objects.
> >   *   - <0: Error; code of ring dequeue function.
> >   */
> >  static __rte_always_inline int
> 
> I think that part should be in a separate commit too. This is a
> documentation fix, which is easily backportable (and should be
> backported) (Fixes: af75078fece3 ("first public release")).

OK. I'll take this out from the patch. And as you mentioned in your follow-up, this is also addressed by another commit, so I'll leave it at that.

> 
> > @@ -1463,38 +1467,71 @@ rte_mempool_do_generic_get(struct rte_mempool
> *mp, void **obj_table,
> >  	uint32_t index, len;
> >  	void **cache_objs;
> >
> > -	/* No cache provided or cannot be satisfied from cache */
> > -	if (unlikely(cache == NULL || n >= cache->size))
> > +	/* No cache provided or if get would overflow mem allocated for
> cache */
> > +	if (unlikely(cache == NULL || n > RTE_MEMPOOL_CACHE_MAX_SIZE))
> >  		goto ring_dequeue;
> >
> > -	cache_objs = cache->objs;
> > +	cache_objs = &cache->objs[cache->len];
> > +
> > +	if (n <= cache->len) {
> > +		/* The entire request can be satisfied from the cache. */
> > +		cache->len -= n;
> > +		for (index = 0; index < n; index++)
> > +			*obj_table++ = *--cache_objs;
> >
> > -	/* Can this be satisfied from the cache? */
> > -	if (cache->len < n) {
> > -		/* No. Backfill the cache first, and then fill from it */
> > -		uint32_t req = n + (cache->size - cache->len);
> > +		RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
> > +		RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
> >
> > -		/* How many do we require i.e. number to fill the cache +
> the request */
> > -		ret = rte_mempool_ops_dequeue_bulk(mp,
> > -			&cache->objs[cache->len], req);
> > +		return 0;
> > +	}
> > +
> > +	/* Satisfy the first part of the request by depleting the cache.
> */
> > +	len = cache->len;
> > +	for (index = 0; index < len; index++)
> > +		*obj_table++ = *--cache_objs;
> > +
> > +	/* Number of objects remaining to satisfy the request. */
> > +	len = n - len;
> > +
> > +	/* Fill the cache from the ring; fetch size + remaining objects.
> */
> > +	ret = rte_mempool_ops_dequeue_bulk(mp, cache->objs,
> > +			cache->size + len);
> > +	if (unlikely(ret < 0)) {
> > +		/*
> > +		 * We are buffer constrained, and not able to allocate
> > +		 * cache + remaining.
> > +		 * Do not fill the cache, just satisfy the remaining part
> of
> > +		 * the request directly from the ring.
> > +		 */
> > +		ret = rte_mempool_ops_dequeue_bulk(mp, obj_table, len);
> >  		if (unlikely(ret < 0)) {
> >  			/*
> > -			 * In the off chance that we are buffer constrained,
> > -			 * where we are not able to allocate cache + n, go to
> > -			 * the ring directly. If that fails, we are truly out
> of
> > -			 * buffers.
> > +			 * That also failed.
> > +			 * No furter action is required to roll the first
> > +			 * part of the request back into the cache, as both
> > +			 * cache->len and the objects in the cache are
> intact.
> >  			 */
> > -			goto ring_dequeue;
> > +			RTE_MEMPOOL_STAT_ADD(mp, get_fail_bulk, 1);
> > +			RTE_MEMPOOL_STAT_ADD(mp, get_fail_objs, n);
> > +
> > +			return ret;
> >  		}
> >
> > -		cache->len += req;
> > +		/* Commit that the cache was emptied. */
> > +		cache->len = 0;
> > +
> > +		RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
> > +		RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
> > +
> > +		return 0;
> >  	}
> >
> > -	/* Now fill in the response ... */
> > -	for (index = 0, len = cache->len - 1; index < n; ++index, len--,
> obj_table++)
> > -		*obj_table = cache_objs[len];
> > +	cache_objs = &cache->objs[cache->size + len];
> >
> > -	cache->len -= n;
> > +	/* Satisfy the remaining part of the request from the filled
> cache. */
> > +	cache->len = cache->size;
> > +	for (index = 0; index < len; index++)
> > +		*obj_table++ = *--cache_objs;
> >
> >  	RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
> >  	RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
> > @@ -1503,7 +1540,7 @@ rte_mempool_do_generic_get(struct rte_mempool
> *mp, void **obj_table,
> >
> >  ring_dequeue:
> >
> > -	/* get remaining objects from ring */
> > +	/* Get the objects from the ring. */
> >  	ret = rte_mempool_ops_dequeue_bulk(mp, obj_table, n);
> >
> >  	if (ret < 0) {
> 
> About the code itself, it is more readable now, and probably more
> efficient. Did you notice any performance change in mempool perf
> autotests ?

No significant performance change in my test environment. Probably also because mempool_perf_autotest doesn't test flushing/refilling the cache.

> 
> Thanks,
> Olivier
  

Patch

diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index 1e7a3c1527..88f1b8b7ab 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -1443,6 +1443,10 @@  rte_mempool_put(struct rte_mempool *mp, void *obj)
 
 /**
  * @internal Get several objects from the mempool; used internally.
+ *
+ * If cache is enabled, objects are returned from the cache in Last In First
+ * Out (LIFO) order for the benefit of CPUs with small L1 cache.
+ *
  * @param mp
  *   A pointer to the mempool structure.
  * @param obj_table
@@ -1452,7 +1456,7 @@  rte_mempool_put(struct rte_mempool *mp, void *obj)
  * @param cache
  *   A pointer to a mempool cache structure. May be NULL if not needed.
  * @return
- *   - >=0: Success; number of objects supplied.
+ *   - 0: Success; got n objects.
  *   - <0: Error; code of ring dequeue function.
  */
 static __rte_always_inline int
@@ -1463,38 +1467,71 @@  rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 	uint32_t index, len;
 	void **cache_objs;
 
-	/* No cache provided or cannot be satisfied from cache */
-	if (unlikely(cache == NULL || n >= cache->size))
+	/* No cache provided or if get would overflow mem allocated for cache */
+	if (unlikely(cache == NULL || n > RTE_MEMPOOL_CACHE_MAX_SIZE))
 		goto ring_dequeue;
 
-	cache_objs = cache->objs;
+	cache_objs = &cache->objs[cache->len];
+
+	if (n <= cache->len) {
+		/* The entire request can be satisfied from the cache. */
+		cache->len -= n;
+		for (index = 0; index < n; index++)
+			*obj_table++ = *--cache_objs;
 
-	/* Can this be satisfied from the cache? */
-	if (cache->len < n) {
-		/* No. Backfill the cache first, and then fill from it */
-		uint32_t req = n + (cache->size - cache->len);
+		RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
+		RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
 
-		/* How many do we require i.e. number to fill the cache + the request */
-		ret = rte_mempool_ops_dequeue_bulk(mp,
-			&cache->objs[cache->len], req);
+		return 0;
+	}
+
+	/* Satisfy the first part of the request by depleting the cache. */
+	len = cache->len;
+	for (index = 0; index < len; index++)
+		*obj_table++ = *--cache_objs;
+
+	/* Number of objects remaining to satisfy the request. */
+	len = n - len;
+
+	/* Fill the cache from the ring; fetch size + remaining objects. */
+	ret = rte_mempool_ops_dequeue_bulk(mp, cache->objs,
+			cache->size + len);
+	if (unlikely(ret < 0)) {
+		/*
+		 * We are buffer constrained, and not able to allocate
+		 * cache + remaining.
+		 * Do not fill the cache, just satisfy the remaining part of
+		 * the request directly from the ring.
+		 */
+		ret = rte_mempool_ops_dequeue_bulk(mp, obj_table, len);
 		if (unlikely(ret < 0)) {
 			/*
-			 * In the off chance that we are buffer constrained,
-			 * where we are not able to allocate cache + n, go to
-			 * the ring directly. If that fails, we are truly out of
-			 * buffers.
+			 * That also failed.
+			 * No furter action is required to roll the first
+			 * part of the request back into the cache, as both
+			 * cache->len and the objects in the cache are intact.
 			 */
-			goto ring_dequeue;
+			RTE_MEMPOOL_STAT_ADD(mp, get_fail_bulk, 1);
+			RTE_MEMPOOL_STAT_ADD(mp, get_fail_objs, n);
+
+			return ret;
 		}
 
-		cache->len += req;
+		/* Commit that the cache was emptied. */
+		cache->len = 0;
+
+		RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
+		RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
+
+		return 0;
 	}
 
-	/* Now fill in the response ... */
-	for (index = 0, len = cache->len - 1; index < n; ++index, len--, obj_table++)
-		*obj_table = cache_objs[len];
+	cache_objs = &cache->objs[cache->size + len];
 
-	cache->len -= n;
+	/* Satisfy the remaining part of the request from the filled cache. */
+	cache->len = cache->size;
+	for (index = 0; index < len; index++)
+		*obj_table++ = *--cache_objs;
 
 	RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
 	RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
@@ -1503,7 +1540,7 @@  rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 
 ring_dequeue:
 
-	/* get remaining objects from ring */
+	/* Get the objects from the ring. */
 	ret = rte_mempool_ops_dequeue_bulk(mp, obj_table, n);
 
 	if (ret < 0) {