[v4] mempool: fix mempool cache flushing algorithm

Message ID 20220202103354.79832-1-mb@smartsharesystems.com (mailing list archive)
State Superseded, archived
Delegated to: Thomas Monjalon
Headers
Series [v4] mempool: fix mempool cache flushing algorithm |

Checks

Context Check Description
ci/checkpatch warning coding style issues
ci/Intel-compilation success Compilation OK
ci/intel-Testing success Testing PASS
ci/iol-broadcom-Performance success Performance Testing PASS
ci/iol-mellanox-Performance success Performance Testing PASS
ci/iol-broadcom-Functional success Functional Testing PASS
ci/iol-intel-Functional success Functional Testing PASS
ci/github-robot: build success github build: passed
ci/iol-intel-Performance success Performance Testing PASS
ci/iol-aarch64-unit-testing success Testing PASS
ci/iol-aarch64-compile-testing success Testing PASS
ci/iol-abi-testing success Testing PASS

Commit Message

Morten Brørup Feb. 2, 2022, 10:33 a.m. UTC
  This patch fixes the rte_mempool_do_generic_put() caching algorithm,
which was fundamentally wrong, causing multiple performance issues when
flushing.

Although the bugs do have serious performance implications when
flushing, the function did not fail when flushing (or otherwise).
Backporting could be considered optional.

The algorithm was:
 1. Add the objects to the cache
 2. Anything greater than the cache size (if it crosses the cache flush
    threshold) is flushed to the ring.

Please note that the description in the source code said that it kept
"cache min value" objects after flushing, but the function actually kept
the cache full after flushing, which the above description reflects.

Now, the algorithm is:
 1. If the objects cannot be added to the cache without crossing the
    flush threshold, flush the cache to the ring.
 2. Add the objects to the cache.

This patch fixes these bugs:

1. The cache was still full after flushing.
In the opposite direction, i.e. when getting objects from the cache, the
cache is refilled to full level when it crosses the low watermark (which
happens to be zero).
Similarly, the cache should be flushed to empty level when it crosses
the high watermark (which happens to be 1.5 x the size of the cache).
The existing flushing behaviour was suboptimal for real applications,
because crossing the low or high watermark typically happens when the
application is in a state where the number of put/get events are out of
balance, e.g. when absorbing a burst of packets into a QoS queue
(getting more mbufs from the mempool), or when a burst of packets is
trickling out from the QoS queue (putting the mbufs back into the
mempool).
Now, the mempool cache is completely flushed when crossing the flush
threshold, so only the newly put (hot) objects remain in the mempool
cache afterwards.

This bug degraded performance caused by too frequent flushing.

Consider this application scenario:

Either, an lcore thread in the application is in a state of balance,
where it uses the mempool cache within its flush/refill boundaries; in
this situation, the flush method is less important, and this fix is
irrelevant.

Or, an lcore thread in the application is out of balance (either
permanently or temporarily), and mostly gets or puts objects from/to the
mempool. If it mostly puts objects, not flushing all of the objects will
cause more frequent flushing. This is the scenario addressed by this
fix. E.g.:

Cache size=256, flushthresh=384 (1.5x size), initial len=256;
application burst len=32.

If there are "size" objects in the cache after flushing, the cache is
flushed at every 4th burst.

If the cache is flushed completely, the cache is only flushed at every
16th burst.

As you can see, this bug caused the cache to be flushed 4x too
frequently in this example.

And when/if the application thread breaks its pattern of continuously
putting objects, and suddenly starts to get objects instead, it will
either get objects already in the cache, or the get() function will
refill the cache.

The concept of not flushing the cache completely was probably based on
an assumption that it is more likely for an application's lcore thread
to get() after flushing than to put() after flushing.
I strongly disagree with this assumption! If an application thread is
continuously putting so much that it overflows the cache, it is much
more likely to keep putting than it is to start getting. If in doubt,
consider how CPU branch predictors work: When the application has done
something many times consecutively, the branch predictor will expect the
application to do the same again, rather than suddenly do something
else.

Also, if you consider the description of the algorithm in the source
code, and agree that "cache min value" cannot mean "cache size", the
function did not behave as intended. This in itself is a bug.

2. The flush threshold comparison was off by one.
It must be "len > flushthresh", not "len >= flushthresh".
Consider a flush multiplier of 1 instead of 1.5; the cache would be
flushed already when reaching size objecs, not when exceeding size
objects. In other words, the cache would not be able to hold "size"
objects, which is clearly a bug.
Now, flushing is triggered when the flush threshold is exceeded, not
when reached.

This bug degraded performance due to premature flushing. In my example
above, this bug caused flushing every 3rd burst instead of every 4th.

3. The most recent (hot) objects were flushed, leaving the oldest (cold)
objects in the mempool cache.
This bug degraded performance, because flushing prevented immediate
reuse of the (hot) objects already in the CPU cache.
Now, the existing (cold) objects in the mempool cache are flushed before
the new (hot) objects are added the to the mempool cache.

4. With RTE_LIBRTE_MEMPOOL_DEBUG defined, the return value of
rte_mempool_ops_enqueue_bulk() was not checked when flushing the cache.
Now, it is checked in both locations where used; and obviously still
only if RTE_LIBRTE_MEMPOOL_DEBUG is defined.

v2 changes:

- Not adding the new objects to the mempool cache before flushing it
also allows the memory allocated for the mempool cache to be reduced
from 3 x to 2 x RTE_MEMPOOL_CACHE_MAX_SIZE.
However, such this change would break the ABI, so it was removed in v2.

- The mempool cache should be cache line aligned for the benefit of the
copying method, which on some CPU architectures performs worse on data
crossing a cache boundary.
However, such this change would break the ABI, so it was removed in v2;
and yet another alternative copying method replaced the rte_memcpy().

v3 changes:

- Actually remove my modifications of the rte_mempool_cache structure.

v4 changes:

- Updated patch title to reflect that the scope of the patch is only
mempool cache flushing.

- Do not replace rte_memcpy() with alternative copying method. This was
a pure optimization, not a fix.

- Elaborate even more on the bugs fixed by the modifications.

- Added 4th bullet item to the patch description, regarding
rte_mempool_ops_enqueue_bulk() with RTE_LIBRTE_MEMPOOL_DEBUG.

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---
 lib/mempool/rte_mempool.h | 34 ++++++++++++++++++++++------------
 1 file changed, 22 insertions(+), 12 deletions(-)
  

Comments

Morten Brørup April 7, 2022, 9:04 a.m. UTC | #1
> From: Morten Brørup [mailto:mb@smartsharesystems.com]
> Sent: Wednesday, 2 February 2022 11.34
> 
> This patch fixes the rte_mempool_do_generic_put() caching algorithm,
> which was fundamentally wrong, causing multiple performance issues when
> flushing.
> 

[...]

Olivier,

Will you please consider this patch [1] and the other one [2].

The primary bug here is this: When a mempool cache becomes full (i.e. exceeds the "flush threshold"), and is flushed to the backing ring, it is still full afterwards; but it should be empty afterwards. It is not flushed entirely, only the elements exceeding "size" are flushed.

E.g. pipelined applications having ingress threads and egress threads running on different lcores are affected by this bug.

I don't think the real performance impact is very big, but these algorithm level bugs really annoy me.

I'm still wondering how the patch introducing the mempool cache flush threshold could pass internal code review with so many bugs.

[1] https://patchwork.dpdk.org/project/dpdk/patch/20220202103354.79832-1-mb@smartsharesystems.com/
[2] https://patchwork.dpdk.org/project/dpdk/patch/20220202081426.77975-1-mb@smartsharesystems.com/

-Morten

> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> ---
>  lib/mempool/rte_mempool.h | 34 ++++++++++++++++++++++------------
>  1 file changed, 22 insertions(+), 12 deletions(-)
> 
> diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> index 1e7a3c1527..e7e09e48fc 100644
> --- a/lib/mempool/rte_mempool.h
> +++ b/lib/mempool/rte_mempool.h
> @@ -1344,31 +1344,41 @@ rte_mempool_do_generic_put(struct rte_mempool
> *mp, void * const *obj_table,
>  	if (unlikely(cache == NULL || n > RTE_MEMPOOL_CACHE_MAX_SIZE))
>  		goto ring_enqueue;
> 
> -	cache_objs = &cache->objs[cache->len];
> +	/* If the request itself is too big for the cache */
> +	if (unlikely(n > cache->flushthresh))
> +		goto ring_enqueue;
> 
>  	/*
>  	 * The cache follows the following algorithm
> -	 *   1. Add the objects to the cache
> -	 *   2. Anything greater than the cache min value (if it crosses
> the
> -	 *   cache flush threshold) is flushed to the ring.

In the code, "the cache min value" is actually "the cache size". This indicates an intention to do something more. Perhaps the patch introducing the "flush threshold" was committed while still incomplete, and just never got completed?

> +	 *   1. If the objects cannot be added to the cache without
> +	 *   crossing the flush threshold, flush the cache to the ring.
> +	 *   2. Add the objects to the cache.
>  	 */
> 
> -	/* Add elements back into the cache */
> -	rte_memcpy(&cache_objs[0], obj_table, sizeof(void *) * n);
> +	if (cache->len + n <= cache->flushthresh) {
> +		cache_objs = &cache->objs[cache->len];
> 
> -	cache->len += n;
> +		cache->len += n;
> +	} else {
> +		cache_objs = &cache->objs[0];
> 
> -	if (cache->len >= cache->flushthresh) {
> -		rte_mempool_ops_enqueue_bulk(mp, &cache->objs[cache->size],
> -				cache->len - cache->size);
> -		cache->len = cache->size;
> +#ifdef RTE_LIBRTE_MEMPOOL_DEBUG
> +		if (rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache-
> >len) < 0)
> +			rte_panic("cannot put objects in mempool\n");
> +#else
> +		rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len);
> +#endif
> +		cache->len = n;
>  	}
> 
> +	/* Add the objects to the cache. */
> +	rte_memcpy(cache_objs, obj_table, sizeof(void *) * n);
> +
>  	return;
> 
>  ring_enqueue:
> 
> -	/* push remaining objects in ring */
> +	/* Put the objects into the ring */
>  #ifdef RTE_LIBRTE_MEMPOOL_DEBUG
>  	if (rte_mempool_ops_enqueue_bulk(mp, obj_table, n) < 0)
>  		rte_panic("cannot put objects in mempool\n");
> --
> 2.17.1
  
Bruce Richardson April 7, 2022, 9:14 a.m. UTC | #2
On Thu, Apr 07, 2022 at 11:04:53AM +0200, Morten Brørup wrote:
> > From: Morten Brørup [mailto:mb@smartsharesystems.com]
> > Sent: Wednesday, 2 February 2022 11.34
> > 
> > This patch fixes the rte_mempool_do_generic_put() caching algorithm,
> > which was fundamentally wrong, causing multiple performance issues when
> > flushing.
> > 
> 
> [...]
> 
> Olivier,
> 
> Will you please consider this patch [1] and the other one [2].
> 
> The primary bug here is this: When a mempool cache becomes full (i.e. exceeds the "flush threshold"), and is flushed to the backing ring, it is still full afterwards; but it should be empty afterwards. It is not flushed entirely, only the elements exceeding "size" are flushed.
> 

I don't believe it should be flushed entirely, there should always be some
elements left so that even after flush we can still allocate an additional
burst. We want to avoid the situation where a flush of all elements is
immediately followed by a refill of new elements. However, we can flush to
maybe size/2, and improve things. In short, this not emptying is by design
rather than a bug, though we can look to tweak the behaviour.

> E.g. pipelined applications having ingress threads and egress threads running on different lcores are affected by this bug.
> 
If we are looking at improvements for pipelined applications, I think a
bigger win would be to change the default mempool from ring-based to
stack-based. For apps using a run-to-completion model, they should run out
of cache and should therefore be largely unaffected by such a change.

> I don't think the real performance impact is very big, but these algorithm level bugs really annoy me.
> 
> I'm still wondering how the patch introducing the mempool cache flush threshold could pass internal code review with so many bugs.
> 
> [1] https://patchwork.dpdk.org/project/dpdk/patch/20220202103354.79832-1-mb@smartsharesystems.com/
> [2] https://patchwork.dpdk.org/project/dpdk/patch/20220202081426.77975-1-mb@smartsharesystems.com/
> 
> -Morten
> 
> > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > ---
> >  lib/mempool/rte_mempool.h | 34 ++++++++++++++++++++++------------
> >  1 file changed, 22 insertions(+), 12 deletions(-)
> > 
> > diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> > index 1e7a3c1527..e7e09e48fc 100644
> > --- a/lib/mempool/rte_mempool.h
> > +++ b/lib/mempool/rte_mempool.h
> > @@ -1344,31 +1344,41 @@ rte_mempool_do_generic_put(struct rte_mempool
> > *mp, void * const *obj_table,
> >  	if (unlikely(cache == NULL || n > RTE_MEMPOOL_CACHE_MAX_SIZE))
> >  		goto ring_enqueue;
> > 
> > -	cache_objs = &cache->objs[cache->len];
> > +	/* If the request itself is too big for the cache */
> > +	if (unlikely(n > cache->flushthresh))
> > +		goto ring_enqueue;
> > 
> >  	/*
> >  	 * The cache follows the following algorithm
> > -	 *   1. Add the objects to the cache
> > -	 *   2. Anything greater than the cache min value (if it crosses
> > the
> > -	 *   cache flush threshold) is flushed to the ring.
> 
> In the code, "the cache min value" is actually "the cache size". This indicates an intention to do something more. Perhaps the patch introducing the "flush threshold" was committed while still incomplete, and just never got completed?
> 
> > +	 *   1. If the objects cannot be added to the cache without
> > +	 *   crossing the flush threshold, flush the cache to the ring.
> > +	 *   2. Add the objects to the cache.
> >  	 */
> > 
> > -	/* Add elements back into the cache */
> > -	rte_memcpy(&cache_objs[0], obj_table, sizeof(void *) * n);
> > +	if (cache->len + n <= cache->flushthresh) {
> > +		cache_objs = &cache->objs[cache->len];
> > 
> > -	cache->len += n;
> > +		cache->len += n;
> > +	} else {
> > +		cache_objs = &cache->objs[0];
> > 
> > -	if (cache->len >= cache->flushthresh) {
> > -		rte_mempool_ops_enqueue_bulk(mp, &cache->objs[cache->size],
> > -				cache->len - cache->size);
> > -		cache->len = cache->size;
> > +#ifdef RTE_LIBRTE_MEMPOOL_DEBUG
> > +		if (rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache-
> > >len) < 0)
> > +			rte_panic("cannot put objects in mempool\n");
> > +#else
> > +		rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len);
> > +#endif
> > +		cache->len = n;
> >  	}
> > 
> > +	/* Add the objects to the cache. */
> > +	rte_memcpy(cache_objs, obj_table, sizeof(void *) * n);
> > +
> >  	return;
> > 
> >  ring_enqueue:
> > 
> > -	/* push remaining objects in ring */
> > +	/* Put the objects into the ring */
> >  #ifdef RTE_LIBRTE_MEMPOOL_DEBUG
> >  	if (rte_mempool_ops_enqueue_bulk(mp, obj_table, n) < 0)
> >  		rte_panic("cannot put objects in mempool\n");
> > --
> > 2.17.1
>
  
Morten Brørup April 7, 2022, 9:26 a.m. UTC | #3
> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> Sent: Thursday, 7 April 2022 11.14
> 
> On Thu, Apr 07, 2022 at 11:04:53AM +0200, Morten Brørup wrote:
> > > From: Morten Brørup [mailto:mb@smartsharesystems.com]
> > > Sent: Wednesday, 2 February 2022 11.34
> > >
> > > This patch fixes the rte_mempool_do_generic_put() caching
> algorithm,
> > > which was fundamentally wrong, causing multiple performance issues
> when
> > > flushing.
> > >
> >
> > [...]
> >
> > Olivier,
> >
> > Will you please consider this patch [1] and the other one [2].
> >
> > The primary bug here is this: When a mempool cache becomes full (i.e.
> exceeds the "flush threshold"), and is flushed to the backing ring, it
> is still full afterwards; but it should be empty afterwards. It is not
> flushed entirely, only the elements exceeding "size" are flushed.
> >
> 
> I don't believe it should be flushed entirely, there should always be
> some
> elements left so that even after flush we can still allocate an
> additional
> burst. We want to avoid the situation where a flush of all elements is
> immediately followed by a refill of new elements. However, we can flush
> to
> maybe size/2, and improve things. In short, this not emptying is by
> design
> rather than a bug, though we can look to tweak the behaviour.
> 

I initially agreed with you about flushing to size/2.

However, I did think further about it when I wrote the patch, and came to this conclusion: If an application thread repeatedly puts objects into the mempool, and does it so often that the cache overflows (i.e. reaches the flush threshold) and needs to be flushed, it is far more likely that the application thread will continue doing that, rather than start getting objects from the mempool. This speaks for flushing the cache entirely.

Both solutions are better than flushing to size, so if there is a preference for keeping some objects in the cache after flushing, I can update the patch accordingly.

> > E.g. pipelined applications having ingress threads and egress threads
> running on different lcores are affected by this bug.
> >
> If we are looking at improvements for pipelined applications, I think a
> bigger win would be to change the default mempool from ring-based to
> stack-based. For apps using a run-to-completion model, they should run
> out
> of cache and should therefore be largely unaffected by such a change.
> 
> > I don't think the real performance impact is very big, but these
> algorithm level bugs really annoy me.
> >
> > I'm still wondering how the patch introducing the mempool cache flush
> threshold could pass internal code review with so many bugs.
> >
> > [1]
> https://patchwork.dpdk.org/project/dpdk/patch/20220202103354.79832-1-
> mb@smartsharesystems.com/
> > [2]
> https://patchwork.dpdk.org/project/dpdk/patch/20220202081426.77975-1-
> mb@smartsharesystems.com/
> >
> > -Morten
> >
> > > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > > ---
> > >  lib/mempool/rte_mempool.h | 34 ++++++++++++++++++++++------------
> > >  1 file changed, 22 insertions(+), 12 deletions(-)
> > >
> > > diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> > > index 1e7a3c1527..e7e09e48fc 100644
> > > --- a/lib/mempool/rte_mempool.h
> > > +++ b/lib/mempool/rte_mempool.h
> > > @@ -1344,31 +1344,41 @@ rte_mempool_do_generic_put(struct
> rte_mempool
> > > *mp, void * const *obj_table,
> > >  	if (unlikely(cache == NULL || n > RTE_MEMPOOL_CACHE_MAX_SIZE))
> > >  		goto ring_enqueue;
> > >
> > > -	cache_objs = &cache->objs[cache->len];
> > > +	/* If the request itself is too big for the cache */
> > > +	if (unlikely(n > cache->flushthresh))
> > > +		goto ring_enqueue;
> > >
> > >  	/*
> > >  	 * The cache follows the following algorithm
> > > -	 *   1. Add the objects to the cache
> > > -	 *   2. Anything greater than the cache min value (if it crosses
> > > the
> > > -	 *   cache flush threshold) is flushed to the ring.
> >
> > In the code, "the cache min value" is actually "the cache size". This
> indicates an intention to do something more. Perhaps the patch
> introducing the "flush threshold" was committed while still incomplete,
> and just never got completed?
> >
> > > +	 *   1. If the objects cannot be added to the cache without
> > > +	 *   crossing the flush threshold, flush the cache to the ring.
> > > +	 *   2. Add the objects to the cache.
> > >  	 */
> > >
> > > -	/* Add elements back into the cache */
> > > -	rte_memcpy(&cache_objs[0], obj_table, sizeof(void *) * n);
> > > +	if (cache->len + n <= cache->flushthresh) {
> > > +		cache_objs = &cache->objs[cache->len];
> > >
> > > -	cache->len += n;
> > > +		cache->len += n;
> > > +	} else {
> > > +		cache_objs = &cache->objs[0];
> > >
> > > -	if (cache->len >= cache->flushthresh) {
> > > -		rte_mempool_ops_enqueue_bulk(mp, &cache->objs[cache->size],
> > > -				cache->len - cache->size);
> > > -		cache->len = cache->size;
> > > +#ifdef RTE_LIBRTE_MEMPOOL_DEBUG
> > > +		if (rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache-
> > > >len) < 0)
> > > +			rte_panic("cannot put objects in mempool\n");
> > > +#else
> > > +		rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len);
> > > +#endif
> > > +		cache->len = n;
> > >  	}
> > >
> > > +	/* Add the objects to the cache. */
> > > +	rte_memcpy(cache_objs, obj_table, sizeof(void *) * n);
> > > +
> > >  	return;
> > >
> > >  ring_enqueue:
> > >
> > > -	/* push remaining objects in ring */
> > > +	/* Put the objects into the ring */
> > >  #ifdef RTE_LIBRTE_MEMPOOL_DEBUG
> > >  	if (rte_mempool_ops_enqueue_bulk(mp, obj_table, n) < 0)
> > >  		rte_panic("cannot put objects in mempool\n");
> > > --
> > > 2.17.1
> >
  
Bruce Richardson April 7, 2022, 10:32 a.m. UTC | #4
On Thu, Apr 07, 2022 at 11:26:53AM +0200, Morten Brørup wrote:
> > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > Sent: Thursday, 7 April 2022 11.14
> > 
> > On Thu, Apr 07, 2022 at 11:04:53AM +0200, Morten Brørup wrote:
> > > > From: Morten Brørup [mailto:mb@smartsharesystems.com]
> > > > Sent: Wednesday, 2 February 2022 11.34
> > > >
> > > > This patch fixes the rte_mempool_do_generic_put() caching
> > algorithm,
> > > > which was fundamentally wrong, causing multiple performance issues
> > when
> > > > flushing.
> > > >
> > >
> > > [...]
> > >
> > > Olivier,
> > >
> > > Will you please consider this patch [1] and the other one [2].
> > >
> > > The primary bug here is this: When a mempool cache becomes full (i.e.
> > exceeds the "flush threshold"), and is flushed to the backing ring, it
> > is still full afterwards; but it should be empty afterwards. It is not
> > flushed entirely, only the elements exceeding "size" are flushed.
> > >
> > 
> > I don't believe it should be flushed entirely, there should always be
> > some
> > elements left so that even after flush we can still allocate an
> > additional
> > burst. We want to avoid the situation where a flush of all elements is
> > immediately followed by a refill of new elements. However, we can flush
> > to
> > maybe size/2, and improve things. In short, this not emptying is by
> > design
> > rather than a bug, though we can look to tweak the behaviour.
> > 
> 
> I initially agreed with you about flushing to size/2.
> 
> However, I did think further about it when I wrote the patch, and came to this conclusion: If an application thread repeatedly puts objects into the mempool, and does it so often that the cache overflows (i.e. reaches the flush threshold) and needs to be flushed, it is far more likely that the application thread will continue doing that, rather than start getting objects from the mempool. This speaks for flushing the cache entirely.
> 
> Both solutions are better than flushing to size, so if there is a preference for keeping some objects in the cache after flushing, I can update the patch accordingly.
> 

Would it be worth looking at adding per-core hinting to the mempool?
Indicate for a core that it allocates-only, i.e. RX thread, frees-only,
i.e. TX-thread, or does both alloc and free (the default)? That hint could
be used only on flush or refill to specify whether to flush all or partial,
and similarly to refill to max possible or just to size.

/Bruce
  
Bruce Richardson April 7, 2022, 10:43 a.m. UTC | #5
On Thu, Apr 07, 2022 at 11:32:12AM +0100, Bruce Richardson wrote:
> On Thu, Apr 07, 2022 at 11:26:53AM +0200, Morten Brørup wrote:
> > > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > > Sent: Thursday, 7 April 2022 11.14
> > > 
> > > On Thu, Apr 07, 2022 at 11:04:53AM +0200, Morten Brørup wrote:
> > > > > From: Morten Brørup [mailto:mb@smartsharesystems.com]
> > > > > Sent: Wednesday, 2 February 2022 11.34
> > > > >
> > > > > This patch fixes the rte_mempool_do_generic_put() caching
> > > algorithm,
> > > > > which was fundamentally wrong, causing multiple performance issues
> > > when
> > > > > flushing.
> > > > >
> > > >
> > > > [...]
> > > >
> > > > Olivier,
> > > >
> > > > Will you please consider this patch [1] and the other one [2].
> > > >
> > > > The primary bug here is this: When a mempool cache becomes full (i.e.
> > > exceeds the "flush threshold"), and is flushed to the backing ring, it
> > > is still full afterwards; but it should be empty afterwards. It is not
> > > flushed entirely, only the elements exceeding "size" are flushed.
> > > >
> > > 
> > > I don't believe it should be flushed entirely, there should always be
> > > some
> > > elements left so that even after flush we can still allocate an
> > > additional
> > > burst. We want to avoid the situation where a flush of all elements is
> > > immediately followed by a refill of new elements. However, we can flush
> > > to
> > > maybe size/2, and improve things. In short, this not emptying is by
> > > design
> > > rather than a bug, though we can look to tweak the behaviour.
> > > 
> > 
> > I initially agreed with you about flushing to size/2.
> > 
> > However, I did think further about it when I wrote the patch, and came to this conclusion: If an application thread repeatedly puts objects into the mempool, and does it so often that the cache overflows (i.e. reaches the flush threshold) and needs to be flushed, it is far more likely that the application thread will continue doing that, rather than start getting objects from the mempool. This speaks for flushing the cache entirely.
> > 
> > Both solutions are better than flushing to size, so if there is a preference for keeping some objects in the cache after flushing, I can update the patch accordingly.
> > 
> 
> Would it be worth looking at adding per-core hinting to the mempool?
> Indicate for a core that it allocates-only, i.e. RX thread, frees-only,
> i.e. TX-thread, or does both alloc and free (the default)? That hint could
> be used only on flush or refill to specify whether to flush all or partial,
> and similarly to refill to max possible or just to size.
> 
Actually, taking the idea further, we could always track per-core whether a
core has ever done a flush/refill and use that as the hint instead. It
could even be done in a branch-free manner if we want. For example:

on flush:
	keep_entries = (size >> 1) & (never_refills - 1);

which will set the entries to keep to be 0 if we have never had to refill, or
half of size, if the thread has previously done refills.

/Bruce
  
Morten Brørup April 7, 2022, 11:36 a.m. UTC | #6
> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> Sent: Thursday, 7 April 2022 12.44
> 
> On Thu, Apr 07, 2022 at 11:32:12AM +0100, Bruce Richardson wrote:
> > On Thu, Apr 07, 2022 at 11:26:53AM +0200, Morten Brørup wrote:
> > > > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > > > Sent: Thursday, 7 April 2022 11.14
> > > >
> > > > On Thu, Apr 07, 2022 at 11:04:53AM +0200, Morten Brørup wrote:
> > > > > > From: Morten Brørup [mailto:mb@smartsharesystems.com]
> > > > > > Sent: Wednesday, 2 February 2022 11.34
> > > > > >
> > > > > > This patch fixes the rte_mempool_do_generic_put() caching
> > > > algorithm,
> > > > > > which was fundamentally wrong, causing multiple performance
> issues
> > > > when
> > > > > > flushing.
> > > > > >
> > > > >
> > > > > [...]
> > > > >
> > > > > Olivier,
> > > > >
> > > > > Will you please consider this patch [1] and the other one [2].
> > > > >
> > > > > The primary bug here is this: When a mempool cache becomes full
> (i.e.
> > > > exceeds the "flush threshold"), and is flushed to the backing
> ring, it
> > > > is still full afterwards; but it should be empty afterwards. It
> is not
> > > > flushed entirely, only the elements exceeding "size" are flushed.
> > > > >
> > > >
> > > > I don't believe it should be flushed entirely, there should
> always be
> > > > some
> > > > elements left so that even after flush we can still allocate an
> > > > additional
> > > > burst. We want to avoid the situation where a flush of all
> elements is
> > > > immediately followed by a refill of new elements. However, we can
> flush
> > > > to
> > > > maybe size/2, and improve things. In short, this not emptying is
> by
> > > > design
> > > > rather than a bug, though we can look to tweak the behaviour.
> > > >
> > >
> > > I initially agreed with you about flushing to size/2.
> > >
> > > However, I did think further about it when I wrote the patch, and
> came to this conclusion: If an application thread repeatedly puts
> objects into the mempool, and does it so often that the cache overflows
> (i.e. reaches the flush threshold) and needs to be flushed, it is far
> more likely that the application thread will continue doing that,
> rather than start getting objects from the mempool. This speaks for
> flushing the cache entirely.
> > >
> > > Both solutions are better than flushing to size, so if there is a
> preference for keeping some objects in the cache after flushing, I can
> update the patch accordingly.

I forgot to mention some details here...

The cache is a stack, so leaving objects in it after flushing can be done in one of two ways:

1. Flush the top objects, and leave the bottom objects, which are extremely cold.
2. Flush the bottom objects, and move the objects from the top to the bottom, which is a costly operation.

Theoretically, there is a third option: Make the stack a circular buffer, so its "bottom pointer" can be moved around, instead of copying objects from the top to the bottom after flushing. However, this will add complexity when copying arrays to/from the stack in both the normal cases, i.e. to/from the application. And it introduces requirements to the cache size. So I quickly discarded the idea when it first came to me.

The provided patch flushes the entire cache, and then stores the newly added objects (the ones causing the flush) in the cache. So it is not completely empty after flushing. It contains some (but not many) objects, and they are hot.

> > >
> >
> > Would it be worth looking at adding per-core hinting to the mempool?
> > Indicate for a core that it allocates-only, i.e. RX thread, frees-
> only,
> > i.e. TX-thread, or does both alloc and free (the default)? That hint
> could
> > be used only on flush or refill to specify whether to flush all or
> partial,
> > and similarly to refill to max possible or just to size.
> >
> Actually, taking the idea further, we could always track per-core
> whether a
> core has ever done a flush/refill and use that as the hint instead. It
> could even be done in a branch-free manner if we want. For example:
> 
> on flush:
> 	keep_entries = (size >> 1) & (never_refills - 1);
> 
> which will set the entries to keep to be 0 if we have never had to
> refill, or
> half of size, if the thread has previously done refills.
> 

Your suggestion is a good idea for a performance improvement.

We would also need "mostly" variants in addition to the "only" variants. Or the automatic detection will cause problems if triggered by some rare event.

And applications using the "service cores" concept will just fall back to the default alloc-free balanced variant.


Perhaps we should fix the current bugs (my term, not consensus) first, and then look at further performance improvements. It's already uphill getting Acks for my fixes as they are.


Another performance improvement could be hard coding the mempool cache size to RTE_MEMPOOL_CACHE_MAX_SIZE, so the copying between the cache and the backing ring can be done by a fixed size optimized memcpy using vector instructions.

Obviously, whatever we do to optimize this, we should ensure optimal handling of the common case, where objects are only moved between the application and the mempool cache, and doesn't touch the backing ring.
  
Morten Brørup Oct. 4, 2022, 8:01 p.m. UTC | #7
> From: Morten Brørup [mailto:mb@smartsharesystems.com]
> Sent: Wednesday, 2 February 2022 11.34
> 
> This patch fixes the rte_mempool_do_generic_put() caching algorithm,
> which was fundamentally wrong, causing multiple performance issues when
> flushing.
> 
> Although the bugs do have serious performance implications when
> flushing, the function did not fail when flushing (or otherwise).
> Backporting could be considered optional.
> 
> The algorithm was:
>  1. Add the objects to the cache
>  2. Anything greater than the cache size (if it crosses the cache flush
>     threshold) is flushed to the ring.
> 
> Please note that the description in the source code said that it kept
> "cache min value" objects after flushing, but the function actually
> kept
> the cache full after flushing, which the above description reflects.
> 
> Now, the algorithm is:
>  1. If the objects cannot be added to the cache without crossing the
>     flush threshold, flush the cache to the ring.
>  2. Add the objects to the cache.
> 
> This patch fixes these bugs:
> 
> 1. The cache was still full after flushing.
> In the opposite direction, i.e. when getting objects from the cache,
> the
> cache is refilled to full level when it crosses the low watermark
> (which
> happens to be zero).
> Similarly, the cache should be flushed to empty level when it crosses
> the high watermark (which happens to be 1.5 x the size of the cache).
> The existing flushing behaviour was suboptimal for real applications,
> because crossing the low or high watermark typically happens when the
> application is in a state where the number of put/get events are out of
> balance, e.g. when absorbing a burst of packets into a QoS queue
> (getting more mbufs from the mempool), or when a burst of packets is
> trickling out from the QoS queue (putting the mbufs back into the
> mempool).
> Now, the mempool cache is completely flushed when crossing the flush
> threshold, so only the newly put (hot) objects remain in the mempool
> cache afterwards.
> 
> This bug degraded performance caused by too frequent flushing.
> 
> Consider this application scenario:
> 
> Either, an lcore thread in the application is in a state of balance,
> where it uses the mempool cache within its flush/refill boundaries; in
> this situation, the flush method is less important, and this fix is
> irrelevant.
> 
> Or, an lcore thread in the application is out of balance (either
> permanently or temporarily), and mostly gets or puts objects from/to
> the
> mempool. If it mostly puts objects, not flushing all of the objects
> will
> cause more frequent flushing. This is the scenario addressed by this
> fix. E.g.:
> 
> Cache size=256, flushthresh=384 (1.5x size), initial len=256;
> application burst len=32.
> 
> If there are "size" objects in the cache after flushing, the cache is
> flushed at every 4th burst.
> 
> If the cache is flushed completely, the cache is only flushed at every
> 16th burst.
> 
> As you can see, this bug caused the cache to be flushed 4x too
> frequently in this example.
> 
> And when/if the application thread breaks its pattern of continuously
> putting objects, and suddenly starts to get objects instead, it will
> either get objects already in the cache, or the get() function will
> refill the cache.
> 
> The concept of not flushing the cache completely was probably based on
> an assumption that it is more likely for an application's lcore thread
> to get() after flushing than to put() after flushing.
> I strongly disagree with this assumption! If an application thread is
> continuously putting so much that it overflows the cache, it is much
> more likely to keep putting than it is to start getting. If in doubt,
> consider how CPU branch predictors work: When the application has done
> something many times consecutively, the branch predictor will expect
> the
> application to do the same again, rather than suddenly do something
> else.
> 
> Also, if you consider the description of the algorithm in the source
> code, and agree that "cache min value" cannot mean "cache size", the
> function did not behave as intended. This in itself is a bug.
> 
> 2. The flush threshold comparison was off by one.
> It must be "len > flushthresh", not "len >= flushthresh".
> Consider a flush multiplier of 1 instead of 1.5; the cache would be
> flushed already when reaching size objecs, not when exceeding size
> objects. In other words, the cache would not be able to hold "size"
> objects, which is clearly a bug.
> Now, flushing is triggered when the flush threshold is exceeded, not
> when reached.
> 
> This bug degraded performance due to premature flushing. In my example
> above, this bug caused flushing every 3rd burst instead of every 4th.
> 
> 3. The most recent (hot) objects were flushed, leaving the oldest
> (cold)
> objects in the mempool cache.
> This bug degraded performance, because flushing prevented immediate
> reuse of the (hot) objects already in the CPU cache.
> Now, the existing (cold) objects in the mempool cache are flushed
> before
> the new (hot) objects are added the to the mempool cache.
> 
> 4. With RTE_LIBRTE_MEMPOOL_DEBUG defined, the return value of
> rte_mempool_ops_enqueue_bulk() was not checked when flushing the cache.
> Now, it is checked in both locations where used; and obviously still
> only if RTE_LIBRTE_MEMPOOL_DEBUG is defined.
> 
> v2 changes:
> 
> - Not adding the new objects to the mempool cache before flushing it
> also allows the memory allocated for the mempool cache to be reduced
> from 3 x to 2 x RTE_MEMPOOL_CACHE_MAX_SIZE.
> However, such this change would break the ABI, so it was removed in v2.
> 
> - The mempool cache should be cache line aligned for the benefit of the
> copying method, which on some CPU architectures performs worse on data
> crossing a cache boundary.
> However, such this change would break the ABI, so it was removed in v2;
> and yet another alternative copying method replaced the rte_memcpy().
> 
> v3 changes:
> 
> - Actually remove my modifications of the rte_mempool_cache structure.
> 
> v4 changes:
> 
> - Updated patch title to reflect that the scope of the patch is only
> mempool cache flushing.
> 
> - Do not replace rte_memcpy() with alternative copying method. This was
> a pure optimization, not a fix.
> 
> - Elaborate even more on the bugs fixed by the modifications.
> 
> - Added 4th bullet item to the patch description, regarding
> rte_mempool_ops_enqueue_bulk() with RTE_LIBRTE_MEMPOOL_DEBUG.
> 
> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> ---
>  lib/mempool/rte_mempool.h | 34 ++++++++++++++++++++++------------
>  1 file changed, 22 insertions(+), 12 deletions(-)
> 
> diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> index 1e7a3c1527..e7e09e48fc 100644
> --- a/lib/mempool/rte_mempool.h
> +++ b/lib/mempool/rte_mempool.h
> @@ -1344,31 +1344,41 @@ rte_mempool_do_generic_put(struct rte_mempool
> *mp, void * const *obj_table,
>  	if (unlikely(cache == NULL || n > RTE_MEMPOOL_CACHE_MAX_SIZE))
>  		goto ring_enqueue;
> 
> -	cache_objs = &cache->objs[cache->len];
> +	/* If the request itself is too big for the cache */
> +	if (unlikely(n > cache->flushthresh))
> +		goto ring_enqueue;
> 
>  	/*
>  	 * The cache follows the following algorithm
> -	 *   1. Add the objects to the cache
> -	 *   2. Anything greater than the cache min value (if it crosses
> the
> -	 *   cache flush threshold) is flushed to the ring.
> +	 *   1. If the objects cannot be added to the cache without
> +	 *   crossing the flush threshold, flush the cache to the ring.
> +	 *   2. Add the objects to the cache.
>  	 */
> 
> -	/* Add elements back into the cache */
> -	rte_memcpy(&cache_objs[0], obj_table, sizeof(void *) * n);
> +	if (cache->len + n <= cache->flushthresh) {
> +		cache_objs = &cache->objs[cache->len];
> 
> -	cache->len += n;
> +		cache->len += n;
> +	} else {
> +		cache_objs = &cache->objs[0];
> 
> -	if (cache->len >= cache->flushthresh) {
> -		rte_mempool_ops_enqueue_bulk(mp, &cache->objs[cache->size],
> -				cache->len - cache->size);
> -		cache->len = cache->size;
> +#ifdef RTE_LIBRTE_MEMPOOL_DEBUG
> +		if (rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache-
> >len) < 0)
> +			rte_panic("cannot put objects in mempool\n");
> +#else
> +		rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len);
> +#endif
> +		cache->len = n;
>  	}
> 
> +	/* Add the objects to the cache. */
> +	rte_memcpy(cache_objs, obj_table, sizeof(void *) * n);
> +
>  	return;
> 
>  ring_enqueue:
> 
> -	/* push remaining objects in ring */
> +	/* Put the objects into the ring */
>  #ifdef RTE_LIBRTE_MEMPOOL_DEBUG
>  	if (rte_mempool_ops_enqueue_bulk(mp, obj_table, n) < 0)
>  		rte_panic("cannot put objects in mempool\n");
> --
> 2.17.1

Andrew, would you please also take a look at this patch and share your opinion.

I guess that the most controversial change in the patch is that it leaves the mempool cache nearly empty after flushing it.

Without the patch, the mempool cache is left full (at 100% size) after flushing. (Flushing is triggered by crossing the flush threshold, which is 50% above the cache size. This is not changed by the patch.)

As described with the patch, I consider this behavior incorrect: In periods where an application is sending more from its QoS queues that goes into the QoS queues, the mempool_put() function is called more often than the mempool_get() function, so there will naturally be consecutive cache flushing.

Many applications use QoS queues or similar traffic shapers, so mempool cache flushing is not as infrequent and exotic as some might think! (And flushing a burst of packets from the mempool cache to the underlying mempool is considered costly.)

Without the patch, consecutive cache flushing will be processed as many small flushes, because only the 50% objects above the cache size (the objects between the cache size and the cache threshold) are flushed each time.

With the patch, the flushes will be fewer and larger, because the full 150% cache size (every object in the cache up to the cache threshold) will be flushed each time.

PS: Bruce and I discussed this patch back in April, but didn't reach a conclusion. You might find some insights in that mail thread.
  
Andrew Rybchenko Oct. 9, 2022, 1:19 p.m. UTC | #8
> diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> index 1e7a3c1527..e7e09e48fc 100644
> --- a/lib/mempool/rte_mempool.h
> +++ b/lib/mempool/rte_mempool.h
> @@ -1344,31 +1344,41 @@ rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
>   	if (unlikely(cache == NULL || n > RTE_MEMPOOL_CACHE_MAX_SIZE))
>   		goto ring_enqueue;
>   
> -	cache_objs = &cache->objs[cache->len];
> +	/* If the request itself is too big for the cache */
> +	if (unlikely(n > cache->flushthresh))
> +		goto ring_enqueue;
>   

n is checked twice above and it is not actually required.
Just the later check is required.
Check vs RTE_MEMPOOL_CACHE_MAX_SIZE was required to ensure
that we do not overflow cache objects since we copied
to cache before flushing, but it is not the case now.
We never cross flush threshold now.

I'll send v5 were I split the most questionable part into
a separate patch at the end.
  

Patch

diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index 1e7a3c1527..e7e09e48fc 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -1344,31 +1344,41 @@  rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
 	if (unlikely(cache == NULL || n > RTE_MEMPOOL_CACHE_MAX_SIZE))
 		goto ring_enqueue;
 
-	cache_objs = &cache->objs[cache->len];
+	/* If the request itself is too big for the cache */
+	if (unlikely(n > cache->flushthresh))
+		goto ring_enqueue;
 
 	/*
 	 * The cache follows the following algorithm
-	 *   1. Add the objects to the cache
-	 *   2. Anything greater than the cache min value (if it crosses the
-	 *   cache flush threshold) is flushed to the ring.
+	 *   1. If the objects cannot be added to the cache without
+	 *   crossing the flush threshold, flush the cache to the ring.
+	 *   2. Add the objects to the cache.
 	 */
 
-	/* Add elements back into the cache */
-	rte_memcpy(&cache_objs[0], obj_table, sizeof(void *) * n);
+	if (cache->len + n <= cache->flushthresh) {
+		cache_objs = &cache->objs[cache->len];
 
-	cache->len += n;
+		cache->len += n;
+	} else {
+		cache_objs = &cache->objs[0];
 
-	if (cache->len >= cache->flushthresh) {
-		rte_mempool_ops_enqueue_bulk(mp, &cache->objs[cache->size],
-				cache->len - cache->size);
-		cache->len = cache->size;
+#ifdef RTE_LIBRTE_MEMPOOL_DEBUG
+		if (rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len) < 0)
+			rte_panic("cannot put objects in mempool\n");
+#else
+		rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len);
+#endif
+		cache->len = n;
 	}
 
+	/* Add the objects to the cache. */
+	rte_memcpy(cache_objs, obj_table, sizeof(void *) * n);
+
 	return;
 
 ring_enqueue:
 
-	/* push remaining objects in ring */
+	/* Put the objects into the ring */
 #ifdef RTE_LIBRTE_MEMPOOL_DEBUG
 	if (rte_mempool_ops_enqueue_bulk(mp, obj_table, n) < 0)
 		rte_panic("cannot put objects in mempool\n");