diff mbox series

[v6,3/4] mempool: fix cache flushing algorithm

Message ID	20221009133737.795377-4-andrew.rybchenko@oktetlabs.ru (mailing list archive)
State	Accepted, archived
Delegated to:	Thomas Monjalon
Headers	From: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru> To: Olivier Matz <olivier.matz@6wind.com> Cc: dev@dpdk.org, =?utf-8?q?Morten_Br=C3=B8rup?= <mb@smartsharesystems.com>, Bruce Richardson <bruce.richardson@intel.com> Subject: [PATCH v6 3/4] mempool: fix cache flushing algorithm Date: Sun, 9 Oct 2022 16:37:36 +0300 Message-Id: <20221009133737.795377-4-andrew.rybchenko@oktetlabs.ru> In-Reply-To: <20221009133737.795377-1-andrew.rybchenko@oktetlabs.ru> References: <98CBD80474FA8B44BF855DF32C47DC35D86DB2@smartserver.smartshare.dk> <20221009133737.795377-1-andrew.rybchenko@oktetlabs.ru> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: list Errors-To: dev-bounces@dpdk.org
Series	mempool: fix mempool cache flushing algorithm \| [v6,0/4] mempool: fix mempool cache flushing algorithm [v6,1/4] mempool: check driver enqueue result in one place [v6,2/4] mempool: avoid usage of term ring on put [v6,3/4] mempool: fix cache flushing algorithm [v6,4/4] mempool: flush cache completely on overflow

Checks

Context	Check	Description
ci/checkpatch	success	coding style OK

Commit Message

Andrew Rybchenko Oct. 9, 2022, 1:37 p.m. UTC

  From: Morten Brørup <mb@smartsharesystems.com>

Fix the rte_mempool_do_generic_put() caching flushing algorithm to
keep hot objects in cache instead of cold ones.

The algorithm was:
 1. Add the objects to the cache.
 2. Anything greater than the cache size (if it crosses the cache flush
    threshold) is flushed to the backend.

Please note that the description in the source code said that it kept
"cache min value" objects after flushing, but the function actually kept
the cache full after flushing, which the above description reflects.

Now, the algorithm is:
 1. If the objects cannot be added to the cache without crossing the
    flush threshold, flush some cached objects to the backend to
    free up required space.
 2. Add the objects to the cache.

The most recent (hot) objects were flushed, leaving the oldest (cold)
objects in the mempool cache. The bug degraded performance, because
flushing prevented immediate reuse of the (hot) objects already in
the CPU cache.  Now, the existing (cold) objects in the mempool cache
are flushed before the new (hot) objects are added the to the mempool
cache.

Since nearby code is touched anyway fix flush threshold comparison
to do flushing if the threshold is really exceed, not just reached.
I.e. it must be "len > flushthresh", not "len >= flushthresh".
Consider a flush multiplier of 1 instead of 1.5; the cache would be
flushed already when reaching size objects, not when exceeding size
objects. In other words, the cache would not be able to hold "size"
objects, which is clearly a bug. The bug could degraded performance
due to premature flushing.

Since we never exceed flush threshold now, cache size in the mempool
may be decreased from RTE_MEMPOOL_CACHE_MAX_SIZE * 3 to
RTE_MEMPOOL_CACHE_MAX_SIZE * 2. In fact it could be
CALC_CACHE_FLUSHTHRESH(RTE_MEMPOOL_CACHE_MAX_SIZE), but flush
threshold multiplier is internal.

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
Signed-off-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
---
 lib/mempool/rte_mempool.c |  5 +++++
 lib/mempool/rte_mempool.h | 43 +++++++++++++++++++++++----------------
 2 files changed, 31 insertions(+), 17 deletions(-)

Comments

Morten Brørup Oct. 9, 2022, 2:31 p.m. UTC | #1

> From: Andrew Rybchenko [mailto:andrew.rybchenko@oktetlabs.ru]
> Sent: Sunday, 9 October 2022 15.38
> 
> From: Morten Brørup <mb@smartsharesystems.com>
> 
> Fix the rte_mempool_do_generic_put() caching flushing algorithm to
> keep hot objects in cache instead of cold ones.
> 
> The algorithm was:
>  1. Add the objects to the cache.
>  2. Anything greater than the cache size (if it crosses the cache flush
>     threshold) is flushed to the backend.
> 
> Please note that the description in the source code said that it kept
> "cache min value" objects after flushing, but the function actually
> kept
> the cache full after flushing, which the above description reflects.
> 
> Now, the algorithm is:
>  1. If the objects cannot be added to the cache without crossing the
>     flush threshold, flush some cached objects to the backend to
>     free up required space.
>  2. Add the objects to the cache.
> 
> The most recent (hot) objects were flushed, leaving the oldest (cold)
> objects in the mempool cache. The bug degraded performance, because
> flushing prevented immediate reuse of the (hot) objects already in
> the CPU cache.  Now, the existing (cold) objects in the mempool cache
> are flushed before the new (hot) objects are added the to the mempool
> cache.
> 
> Since nearby code is touched anyway fix flush threshold comparison
> to do flushing if the threshold is really exceed, not just reached.
> I.e. it must be "len > flushthresh", not "len >= flushthresh".
> Consider a flush multiplier of 1 instead of 1.5; the cache would be
> flushed already when reaching size objects, not when exceeding size
> objects. In other words, the cache would not be able to hold "size"
> objects, which is clearly a bug. The bug could degraded performance
> due to premature flushing.
> 
> Since we never exceed flush threshold now, cache size in the mempool
> may be decreased from RTE_MEMPOOL_CACHE_MAX_SIZE * 3 to
> RTE_MEMPOOL_CACHE_MAX_SIZE * 2. In fact it could be
> CALC_CACHE_FLUSHTHRESH(RTE_MEMPOOL_CACHE_MAX_SIZE), but flush
> threshold multiplier is internal.
> 
> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> Signed-off-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> ---

[...]

> --- a/lib/mempool/rte_mempool.h
> +++ b/lib/mempool/rte_mempool.h
> @@ -90,7 +90,7 @@ struct rte_mempool_cache {
>  	 * Cache is allocated to this size to allow it to overflow in
> certain
>  	 * cases to avoid needless emptying of cache.
>  	 */
> -	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 3]; /**< Cache objects */
> +	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2]; /**< Cache objects */
>  } __rte_cache_aligned;

How much are we allowed to break the ABI here?

This patch reduces the size of the structure by removing a now unused part at the end, which should be harmless.

If we may also move the position of the objs array, I would add __rte_cache_aligned to the objs array. It makes no difference in the general case, but if get/put operations are always 32 objects, it will reduce the number of memory (or last level cache) accesses from five to four 64 B cache lines for every get/put operation.

	uint32_t len;	      /**< Current cache count */
-	/*
-	 * Cache is allocated to this size to allow it to overflow in certain
-	 * cases to avoid needless emptying of cache.
-	 */
-	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 3]; /**< Cache objects */
+	/**
+	 * Cache objects
+	 *
+	 * Cache is allocated to this size to allow it to overflow in certain
+	 * cases to avoid needless emptying of cache.
+	 */
+	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2] __rte_cache_aligned;
} __rte_cache_aligned;

With or without the above suggested optimization...

Reviewed-by: Morten Brørup <mb@smartsharesystems.com>

Andrew Rybchenko Oct. 9, 2022, 2:51 p.m. UTC | #2

On 10/9/22 17:31, Morten Brørup wrote:
>> From: Andrew Rybchenko [mailto:andrew.rybchenko@oktetlabs.ru]
>> Sent: Sunday, 9 October 2022 15.38
>>
>> From: Morten Brørup <mb@smartsharesystems.com>
>>
>> Fix the rte_mempool_do_generic_put() caching flushing algorithm to
>> keep hot objects in cache instead of cold ones.
>>
>> The algorithm was:
>>   1. Add the objects to the cache.
>>   2. Anything greater than the cache size (if it crosses the cache flush
>>      threshold) is flushed to the backend.
>>
>> Please note that the description in the source code said that it kept
>> "cache min value" objects after flushing, but the function actually
>> kept
>> the cache full after flushing, which the above description reflects.
>>
>> Now, the algorithm is:
>>   1. If the objects cannot be added to the cache without crossing the
>>      flush threshold, flush some cached objects to the backend to
>>      free up required space.
>>   2. Add the objects to the cache.
>>
>> The most recent (hot) objects were flushed, leaving the oldest (cold)
>> objects in the mempool cache. The bug degraded performance, because
>> flushing prevented immediate reuse of the (hot) objects already in
>> the CPU cache.  Now, the existing (cold) objects in the mempool cache
>> are flushed before the new (hot) objects are added the to the mempool
>> cache.
>>
>> Since nearby code is touched anyway fix flush threshold comparison
>> to do flushing if the threshold is really exceed, not just reached.
>> I.e. it must be "len > flushthresh", not "len >= flushthresh".
>> Consider a flush multiplier of 1 instead of 1.5; the cache would be
>> flushed already when reaching size objects, not when exceeding size
>> objects. In other words, the cache would not be able to hold "size"
>> objects, which is clearly a bug. The bug could degraded performance
>> due to premature flushing.
>>
>> Since we never exceed flush threshold now, cache size in the mempool
>> may be decreased from RTE_MEMPOOL_CACHE_MAX_SIZE * 3 to
>> RTE_MEMPOOL_CACHE_MAX_SIZE * 2. In fact it could be
>> CALC_CACHE_FLUSHTHRESH(RTE_MEMPOOL_CACHE_MAX_SIZE), but flush
>> threshold multiplier is internal.
>>
>> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
>> Signed-off-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
>> ---
> 
> [...]
> 
>> --- a/lib/mempool/rte_mempool.h
>> +++ b/lib/mempool/rte_mempool.h
>> @@ -90,7 +90,7 @@ struct rte_mempool_cache {
>>   	 * Cache is allocated to this size to allow it to overflow in
>> certain
>>   	 * cases to avoid needless emptying of cache.
>>   	 */
>> -	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 3]; /**< Cache objects */
>> +	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2]; /**< Cache objects */
>>   } __rte_cache_aligned;
> 
> How much are we allowed to break the ABI here?
> 
> This patch reduces the size of the structure by removing a now unused part at the end, which should be harmless.
> 
> If we may also move the position of the objs array, I would add __rte_cache_aligned to the objs array. It makes no difference in the general case, but if get/put operations are always 32 objects, it will reduce the number of memory (or last level cache) accesses from five to four 64 B cache lines for every get/put operation.
> 
> 	uint32_t len;	      /**< Current cache count */
> -	/*
> -	 * Cache is allocated to this size to allow it to overflow in certain
> -	 * cases to avoid needless emptying of cache.
> -	 */
> -	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 3]; /**< Cache objects */
> +	/**
> +	 * Cache objects
> +	 *
> +	 * Cache is allocated to this size to allow it to overflow in certain
> +	 * cases to avoid needless emptying of cache.
> +	 */
> +	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2] __rte_cache_aligned;
> } __rte_cache_aligned;

I think aligning objs on cacheline should be a separate patch.

> 
> With or without the above suggested optimization...
> 
> Reviewed-by: Morten Brørup <mb@smartsharesystems.com>
>

Morten Brørup Oct. 9, 2022, 3:08 p.m. UTC | #3

> From: Andrew Rybchenko [mailto:andrew.rybchenko@oktetlabs.ru]
> Sent: Sunday, 9 October 2022 16.52
> 
> On 10/9/22 17:31, Morten Brørup wrote:
> >> From: Andrew Rybchenko [mailto:andrew.rybchenko@oktetlabs.ru]
> >> Sent: Sunday, 9 October 2022 15.38
> >>
> >> From: Morten Brørup <mb@smartsharesystems.com>
> >>

[...]

> >> --- a/lib/mempool/rte_mempool.h
> >> +++ b/lib/mempool/rte_mempool.h
> >> @@ -90,7 +90,7 @@ struct rte_mempool_cache {
> >>   	 * Cache is allocated to this size to allow it to overflow in
> >> certain
> >>   	 * cases to avoid needless emptying of cache.
> >>   	 */
> >> -	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 3]; /**< Cache objects */
> >> +	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2]; /**< Cache objects */
> >>   } __rte_cache_aligned;
> >
> > How much are we allowed to break the ABI here?
> >
> > This patch reduces the size of the structure by removing a now unused
> part at the end, which should be harmless.
> >
> > If we may also move the position of the objs array, I would add
> __rte_cache_aligned to the objs array. It makes no difference in the
> general case, but if get/put operations are always 32 objects, it will
> reduce the number of memory (or last level cache) accesses from five to
> four 64 B cache lines for every get/put operation.
> >
> > 	uint32_t len;	      /**< Current cache count */
> > -	/*
> > -	 * Cache is allocated to this size to allow it to overflow in
> certain
> > -	 * cases to avoid needless emptying of cache.
> > -	 */
> > -	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 3]; /**< Cache objects */
> > +	/**
> > +	 * Cache objects
> > +	 *
> > +	 * Cache is allocated to this size to allow it to overflow in
> certain
> > +	 * cases to avoid needless emptying of cache.
> > +	 */
> > +	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2] __rte_cache_aligned;
> > } __rte_cache_aligned;
> 
> I think aligning objs on cacheline should be a separate patch.

Good point. I'll let you do it. :-)

PS: Thank you for following up on this patch series, Andrew!

-Morten

Olivier Matz Oct. 14, 2022, 2:01 p.m. UTC | #4

Hi Morten, Andrew,

On Sun, Oct 09, 2022 at 05:08:39PM +0200, Morten Brørup wrote:
> > From: Andrew Rybchenko [mailto:andrew.rybchenko@oktetlabs.ru]
> > Sent: Sunday, 9 October 2022 16.52
> > 
> > On 10/9/22 17:31, Morten Brørup wrote:
> > >> From: Andrew Rybchenko [mailto:andrew.rybchenko@oktetlabs.ru]
> > >> Sent: Sunday, 9 October 2022 15.38
> > >>
> > >> From: Morten Brørup <mb@smartsharesystems.com>
> > >>
> 
> [...]

I finally took a couple of hours to carefully review the mempool-related
series (including the ones that have already been pushed).

The new behavior looks better to me in all situations I can think about.

> 
> > >> --- a/lib/mempool/rte_mempool.h
> > >> +++ b/lib/mempool/rte_mempool.h
> > >> @@ -90,7 +90,7 @@ struct rte_mempool_cache {
> > >>   	 * Cache is allocated to this size to allow it to overflow in
> > >> certain
> > >>   	 * cases to avoid needless emptying of cache.
> > >>   	 */
> > >> -	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 3]; /**< Cache objects */
> > >> +	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2]; /**< Cache objects */
> > >>   } __rte_cache_aligned;
> > >
> > > How much are we allowed to break the ABI here?
> > >
> > > This patch reduces the size of the structure by removing a now unused
> > part at the end, which should be harmless.

It is an ABI breakage: an existing application will use the new 22.11
function to create the mempool (with a smaller cache), but will use the
old inlined get/put that can exceed MAX_SIZE x 2 will remain.

But this is a nice memory consumption improvement, in my opinion we
should accept it for 22.11 with an entry in the release note.


> > >
> > > If we may also move the position of the objs array, I would add
> > __rte_cache_aligned to the objs array. It makes no difference in the
> > general case, but if get/put operations are always 32 objects, it will
> > reduce the number of memory (or last level cache) accesses from five to
> > four 64 B cache lines for every get/put operation.

Will it really be the case? Since cache->len has to be accessed too,
I don't think it would make a difference.


> > >
> > > 	uint32_t len;	      /**< Current cache count */
> > > -	/*
> > > -	 * Cache is allocated to this size to allow it to overflow in
> > certain
> > > -	 * cases to avoid needless emptying of cache.
> > > -	 */
> > > -	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 3]; /**< Cache objects */
> > > +	/**
> > > +	 * Cache objects
> > > +	 *
> > > +	 * Cache is allocated to this size to allow it to overflow in
> > certain
> > > +	 * cases to avoid needless emptying of cache.
> > > +	 */
> > > +	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2] __rte_cache_aligned;
> > > } __rte_cache_aligned;
> > 
> > I think aligning objs on cacheline should be a separate patch.
> 
> Good point. I'll let you do it. :-)
> 
> PS: Thank you for following up on this patch series, Andrew!

Many thanks for this rework.

Acked-by: Olivier Matz <olivier.matz@6wind.com>

Morten Brørup Oct. 14, 2022, 3:57 p.m. UTC | #5

> From: Olivier Matz [mailto:olivier.matz@6wind.com]
> Sent: Friday, 14 October 2022 16.01
> 
> Hi Morten, Andrew,
> 
> On Sun, Oct 09, 2022 at 05:08:39PM +0200, Morten Brørup wrote:
> > > From: Andrew Rybchenko [mailto:andrew.rybchenko@oktetlabs.ru]
> > > Sent: Sunday, 9 October 2022 16.52
> > >
> > > On 10/9/22 17:31, Morten Brørup wrote:
> > > >> From: Andrew Rybchenko [mailto:andrew.rybchenko@oktetlabs.ru]
> > > >> Sent: Sunday, 9 October 2022 15.38
> > > >>
> > > >> From: Morten Brørup <mb@smartsharesystems.com>
> > > >>
> >
> > [...]
> 
> I finally took a couple of hours to carefully review the mempool-
> related
> series (including the ones that have already been pushed).
> 
> The new behavior looks better to me in all situations I can think
> about.

Extreme care is required when touching a core library like the mempool.

Thank you, Olivier.

> 
> >
> > > >> --- a/lib/mempool/rte_mempool.h
> > > >> +++ b/lib/mempool/rte_mempool.h
> > > >> @@ -90,7 +90,7 @@ struct rte_mempool_cache {
> > > >>   	 * Cache is allocated to this size to allow it to overflow
> in
> > > >> certain
> > > >>   	 * cases to avoid needless emptying of cache.
> > > >>   	 */
> > > >> -	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 3]; /**< Cache
> objects */
> > > >> +	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2]; /**< Cache
> objects */
> > > >>   } __rte_cache_aligned;
> > > >
> > > > How much are we allowed to break the ABI here?
> > > >
> > > > This patch reduces the size of the structure by removing a now
> unused
> > > part at the end, which should be harmless.
> 
> It is an ABI breakage: an existing application will use the new 22.11
> function to create the mempool (with a smaller cache), but will use the
> old inlined get/put that can exceed MAX_SIZE x 2 will remain.
> 
> But this is a nice memory consumption improvement, in my opinion we
> should accept it for 22.11 with an entry in the release note.
> 
> 
> > > >
> > > > If we may also move the position of the objs array, I would add
> > > __rte_cache_aligned to the objs array. It makes no difference in
> the
> > > general case, but if get/put operations are always 32 objects, it
> will
> > > reduce the number of memory (or last level cache) accesses from
> five to
> > > four 64 B cache lines for every get/put operation.
> 
> Will it really be the case? Since cache->len has to be accessed too,
> I don't think it would make a difference.

Yes, the first cache line, containing cache->len, will be accessed always. I forgot to count that; so the improvement by aligning cache->objs will be five cache line accesses instead of six.

Let me try to explain the scenario in other words:

In an application where a mempool cache is only accessed in bursts of 32 objects (256 B), it matters if those 256 B accesses in the mempool cache start at a cache line aligned address or not. If cache line aligned, accessing those 256 B in the mempool cache will only touch 4 cache lines; if not, 5 cache lines will be touched. (For architectures with 128 B cache line, it will be 2 instead of 3 touched cache lines per mempool cache get/put operation in applications using only bursts of 32 objects.)

If we cache line align cache->objs, those bursts of 32 objects (256 B) will be cache line aligned: Any address at cache->objs[N * 32 objects] is cache line aligned if objs->objs[0] is cache line aligned.

Currently, the cache->objs directly follows cache->len, which makes cache->objs[0] cache line unaligned.

If we decide to break the mempool cache ABI, we might as well include my suggested cache line alignment performance improvement. It doesn't degrade performance for mempool caches not only accessed in bursts of 32 objects.

> 
> 
> > > >
> > > > 	uint32_t len;	      /**< Current cache count */
> > > > -	/*
> > > > -	 * Cache is allocated to this size to allow it to overflow
> in
> > > certain
> > > > -	 * cases to avoid needless emptying of cache.
> > > > -	 */
> > > > -	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 3]; /**< Cache
> objects */
> > > > +	/**
> > > > +	 * Cache objects
> > > > +	 *
> > > > +	 * Cache is allocated to this size to allow it to overflow
> in
> > > certain
> > > > +	 * cases to avoid needless emptying of cache.
> > > > +	 */
> > > > +	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2]
> __rte_cache_aligned;
> > > > } __rte_cache_aligned;
> > >
> > > I think aligning objs on cacheline should be a separate patch.
> >
> > Good point. I'll let you do it. :-)
> >
> > PS: Thank you for following up on this patch series, Andrew!
> 
> Many thanks for this rework.
> 
> Acked-by: Olivier Matz <olivier.matz@6wind.com>

Perhaps Reviewed-by would be appropriate?

Olivier Matz Oct. 14, 2022, 7:50 p.m. UTC | #6

On Fri, Oct 14, 2022 at 05:57:39PM +0200, Morten Brørup wrote:
> > From: Olivier Matz [mailto:olivier.matz@6wind.com]
> > Sent: Friday, 14 October 2022 16.01
> > 
> > Hi Morten, Andrew,
> > 
> > On Sun, Oct 09, 2022 at 05:08:39PM +0200, Morten Brørup wrote:
> > > > From: Andrew Rybchenko [mailto:andrew.rybchenko@oktetlabs.ru]
> > > > Sent: Sunday, 9 October 2022 16.52
> > > >
> > > > On 10/9/22 17:31, Morten Brørup wrote:
> > > > >> From: Andrew Rybchenko [mailto:andrew.rybchenko@oktetlabs.ru]
> > > > >> Sent: Sunday, 9 October 2022 15.38
> > > > >>
> > > > >> From: Morten Brørup <mb@smartsharesystems.com>
> > > > >>
> > >
> > > [...]
> > 
> > I finally took a couple of hours to carefully review the mempool-
> > related
> > series (including the ones that have already been pushed).
> > 
> > The new behavior looks better to me in all situations I can think
> > about.
> 
> Extreme care is required when touching a core library like the mempool.
> 
> Thank you, Olivier.
> 
> > 
> > >
> > > > >> --- a/lib/mempool/rte_mempool.h
> > > > >> +++ b/lib/mempool/rte_mempool.h
> > > > >> @@ -90,7 +90,7 @@ struct rte_mempool_cache {
> > > > >>   	 * Cache is allocated to this size to allow it to overflow
> > in
> > > > >> certain
> > > > >>   	 * cases to avoid needless emptying of cache.
> > > > >>   	 */
> > > > >> -	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 3]; /**< Cache
> > objects */
> > > > >> +	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2]; /**< Cache
> > objects */
> > > > >>   } __rte_cache_aligned;
> > > > >
> > > > > How much are we allowed to break the ABI here?
> > > > >
> > > > > This patch reduces the size of the structure by removing a now
> > unused
> > > > part at the end, which should be harmless.
> > 
> > It is an ABI breakage: an existing application will use the new 22.11
> > function to create the mempool (with a smaller cache), but will use the
> > old inlined get/put that can exceed MAX_SIZE x 2 will remain.
> > 
> > But this is a nice memory consumption improvement, in my opinion we
> > should accept it for 22.11 with an entry in the release note.
> > 
> > 
> > > > >
> > > > > If we may also move the position of the objs array, I would add
> > > > __rte_cache_aligned to the objs array. It makes no difference in
> > the
> > > > general case, but if get/put operations are always 32 objects, it
> > will
> > > > reduce the number of memory (or last level cache) accesses from
> > five to
> > > > four 64 B cache lines for every get/put operation.
> > 
> > Will it really be the case? Since cache->len has to be accessed too,
> > I don't think it would make a difference.
> 
> Yes, the first cache line, containing cache->len, will be accessed always. I forgot to count that; so the improvement by aligning cache->objs will be five cache line accesses instead of six.
> 
> Let me try to explain the scenario in other words:
> 
> In an application where a mempool cache is only accessed in bursts of 32 objects (256 B), it matters if those 256 B accesses in the mempool cache start at a cache line aligned address or not. If cache line aligned, accessing those 256 B in the mempool cache will only touch 4 cache lines; if not, 5 cache lines will be touched. (For architectures with 128 B cache line, it will be 2 instead of 3 touched cache lines per mempool cache get/put operation in applications using only bursts of 32 objects.)
> 
> If we cache line align cache->objs, those bursts of 32 objects (256 B) will be cache line aligned: Any address at cache->objs[N * 32 objects] is cache line aligned if objs->objs[0] is cache line aligned.
> 
> Currently, the cache->objs directly follows cache->len, which makes cache->objs[0] cache line unaligned.
> 
> If we decide to break the mempool cache ABI, we might as well include my suggested cache line alignment performance improvement. It doesn't degrade performance for mempool caches not only accessed in bursts of 32 objects.

I don't follow you. Currently, with 16 objects (128B), we access to 3
cache lines:

      ┌────────┐
      │len     │
cache │********│---
line0 │********│ ^
      │********│ |
      ├────────┤ | 16 objects
      │********│ | 128B
cache │********│ |
line1 │********│ |
      │********│ |
      ├────────┤ |
      │********│_v_
cache │        │
line2 │        │
      │        │
      └────────┘

With the alignment, it is also 3 cache lines:

      ┌────────┐
      │len     │
cache │        │
line0 │        │
      │        │
      ├────────┤---
      │********│ ^
cache │********│ |
line1 │********│ |
      │********│ |
      ├────────┤ | 16 objects
      │********│ | 128B
cache │********│ |
line2 │********│ |
      │********│ v
      └────────┘---


Am I missing something?

> 
> > 
> > 
> > > > >
> > > > > 	uint32_t len;	      /**< Current cache count */
> > > > > -	/*
> > > > > -	 * Cache is allocated to this size to allow it to overflow
> > in
> > > > certain
> > > > > -	 * cases to avoid needless emptying of cache.
> > > > > -	 */
> > > > > -	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 3]; /**< Cache
> > objects */
> > > > > +	/**
> > > > > +	 * Cache objects
> > > > > +	 *
> > > > > +	 * Cache is allocated to this size to allow it to overflow
> > in
> > > > certain
> > > > > +	 * cases to avoid needless emptying of cache.
> > > > > +	 */
> > > > > +	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2]
> > __rte_cache_aligned;
> > > > > } __rte_cache_aligned;
> > > >
> > > > I think aligning objs on cacheline should be a separate patch.
> > >
> > > Good point. I'll let you do it. :-)
> > >
> > > PS: Thank you for following up on this patch series, Andrew!
> > 
> > Many thanks for this rework.
> > 
> > Acked-by: Olivier Matz <olivier.matz@6wind.com>
> 
> Perhaps Reviewed-by would be appropriate?

I was thinking that "Acked-by" was commonly used by maintainers, and
"Reviewed-by" for reviews by community members. After reading the
documentation again, it's not that clear now in my mind :)

Thanks,
Olivier

Morten Brørup Oct. 15, 2022, 6:57 a.m. UTC | #7

> From: Olivier Matz [mailto:olivier.matz@6wind.com]
> Sent: Friday, 14 October 2022 21.51
> 
> On Fri, Oct 14, 2022 at 05:57:39PM +0200, Morten Brørup wrote:
> > > From: Olivier Matz [mailto:olivier.matz@6wind.com]
> > > Sent: Friday, 14 October 2022 16.01
> > >
> > > Hi Morten, Andrew,
> > >
> > > On Sun, Oct 09, 2022 at 05:08:39PM +0200, Morten Brørup wrote:
> > > > > From: Andrew Rybchenko [mailto:andrew.rybchenko@oktetlabs.ru]
> > > > > Sent: Sunday, 9 October 2022 16.52
> > > > >
> > > > > On 10/9/22 17:31, Morten Brørup wrote:
> > > > > >> From: Andrew Rybchenko
> [mailto:andrew.rybchenko@oktetlabs.ru]
> > > > > >> Sent: Sunday, 9 October 2022 15.38
> > > > > >>
> > > > > >> From: Morten Brørup <mb@smartsharesystems.com>
> > > > > >>
> > > >
> > > > [...]
> > >
> > > I finally took a couple of hours to carefully review the mempool-
> > > related
> > > series (including the ones that have already been pushed).
> > >
> > > The new behavior looks better to me in all situations I can think
> > > about.
> >
> > Extreme care is required when touching a core library like the
> mempool.
> >
> > Thank you, Olivier.
> >
> > >
> > > >
> > > > > >> --- a/lib/mempool/rte_mempool.h
> > > > > >> +++ b/lib/mempool/rte_mempool.h
> > > > > >> @@ -90,7 +90,7 @@ struct rte_mempool_cache {
> > > > > >>   	 * Cache is allocated to this size to allow it to
> overflow
> > > in
> > > > > >> certain
> > > > > >>   	 * cases to avoid needless emptying of cache.
> > > > > >>   	 */
> > > > > >> -	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 3]; /**<
> Cache
> > > objects */
> > > > > >> +	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2]; /**<
> Cache
> > > objects */
> > > > > >>   } __rte_cache_aligned;
> > > > > >
> > > > > > How much are we allowed to break the ABI here?
> > > > > >
> > > > > > This patch reduces the size of the structure by removing a
> now
> > > unused
> > > > > part at the end, which should be harmless.
> > >
> > > It is an ABI breakage: an existing application will use the new
> 22.11
> > > function to create the mempool (with a smaller cache), but will use
> the
> > > old inlined get/put that can exceed MAX_SIZE x 2 will remain.
> > >
> > > But this is a nice memory consumption improvement, in my opinion we
> > > should accept it for 22.11 with an entry in the release note.
> > >
> > >
> > > > > >
> > > > > > If we may also move the position of the objs array, I would
> add
> > > > > __rte_cache_aligned to the objs array. It makes no difference
> in
> > > the
> > > > > general case, but if get/put operations are always 32 objects,
> it
> > > will
> > > > > reduce the number of memory (or last level cache) accesses from
> > > five to
> > > > > four 64 B cache lines for every get/put operation.
> > >
> > > Will it really be the case? Since cache->len has to be accessed
> too,
> > > I don't think it would make a difference.
> >
> > Yes, the first cache line, containing cache->len, will be accessed
> always. I forgot to count that; so the improvement by aligning cache-
> >objs will be five cache line accesses instead of six.
> >
> > Let me try to explain the scenario in other words:
> >
> > In an application where a mempool cache is only accessed in bursts of
> 32 objects (256 B), it matters if those 256 B accesses in the mempool
> cache start at a cache line aligned address or not. If cache line
> aligned, accessing those 256 B in the mempool cache will only touch 4
> cache lines; if not, 5 cache lines will be touched. (For architectures
> with 128 B cache line, it will be 2 instead of 3 touched cache lines
> per mempool cache get/put operation in applications using only bursts
> of 32 objects.)
> >
> > If we cache line align cache->objs, those bursts of 32 objects (256
> B) will be cache line aligned: Any address at cache->objs[N * 32
> objects] is cache line aligned if objs->objs[0] is cache line aligned.
> >
> > Currently, the cache->objs directly follows cache->len, which makes
> cache->objs[0] cache line unaligned.
> >
> > If we decide to break the mempool cache ABI, we might as well include
> my suggested cache line alignment performance improvement. It doesn't
> degrade performance for mempool caches not only accessed in bursts of
> 32 objects.
> 
> I don't follow you. Currently, with 16 objects (128B), we access to 3
> cache lines:
> 
>       ┌────────┐
>       │len     │
> cache │********│---
> line0 │********│ ^
>       │********│ |
>       ├────────┤ | 16 objects
>       │********│ | 128B
> cache │********│ |
> line1 │********│ |
>       │********│ |
>       ├────────┤ |
>       │********│_v_
> cache │        │
> line2 │        │
>       │        │
>       └────────┘
> 
> With the alignment, it is also 3 cache lines:
> 
>       ┌────────┐
>       │len     │
> cache │        │
> line0 │        │
>       │        │
>       ├────────┤---
>       │********│ ^
> cache │********│ |
> line1 │********│ |
>       │********│ |
>       ├────────┤ | 16 objects
>       │********│ | 128B
> cache │********│ |
> line2 │********│ |
>       │********│ v
>       └────────┘---
> 
> 
> Am I missing something?

Accessing the objects at the bottom of the mempool cache is a special case, where cache line0 is also used for objects.

Consider the next burst (and any following bursts):

Current:
      ┌────────┐
      │len     │
cache │        │
line0 │        │
      │        │
      ├────────┤
      │        │
cache │        │
line1 │        │
      │        │
      ├────────┤
      │        │
cache │********│---
line2 │********│ ^
      │********│ |
      ├────────┤ | 16 objects
      │********│ | 128B
cache │********│ |
line3 │********│ |
      │********│ |
      ├────────┤ |
      │********│_v_
cache │        │
line4 │        │
      │        │
      └────────┘
4 cache lines touched, incl. line0 for len.

With the proposed alignment:
      ┌────────┐
      │len     │
cache │        │
line0 │        │
      │        │
      ├────────┤
      │        │
cache │        │
line1 │        │
      │        │
      ├────────┤
      │        │
cache │        │
line2 │        │
      │        │
      ├────────┤
      │********│---
cache │********│ ^
line3 │********│ |
      │********│ | 16 objects
      ├────────┤ | 128B
      │********│ |
cache │********│ |
line4 │********│ |
      │********│_v_
      └────────┘
Only 3 cache lines touched, incl. line0 for len.


> 
> >
> > >
> > >
> > > > > >
> > > > > > 	uint32_t len;	      /**< Current cache count */
> > > > > > -	/*
> > > > > > -	 * Cache is allocated to this size to allow it to overflow
> > > in
> > > > > certain
> > > > > > -	 * cases to avoid needless emptying of cache.
> > > > > > -	 */
> > > > > > -	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 3]; /**< Cache
> > > objects */
> > > > > > +	/**
> > > > > > +	 * Cache objects
> > > > > > +	 *
> > > > > > +	 * Cache is allocated to this size to allow it to overflow
> > > in
> > > > > certain
> > > > > > +	 * cases to avoid needless emptying of cache.
> > > > > > +	 */
> > > > > > +	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2]
> > > __rte_cache_aligned;
> > > > > > } __rte_cache_aligned;
> > > > >
> > > > > I think aligning objs on cacheline should be a separate patch.
> > > >
> > > > Good point. I'll let you do it. :-)
> > > >
> > > > PS: Thank you for following up on this patch series, Andrew!
> > >
> > > Many thanks for this rework.
> > >
> > > Acked-by: Olivier Matz <olivier.matz@6wind.com>
> >
> > Perhaps Reviewed-by would be appropriate?
> 
> I was thinking that "Acked-by" was commonly used by maintainers, and
> "Reviewed-by" for reviews by community members. After reading the
> documentation again, it's not that clear now in my mind :)
> 
> Thanks,
> Olivier

Jerin Jacob Oct. 18, 2022, 4:32 p.m. UTC | #8

On Sat, Oct 15, 2022 at 12:27 PM Morten Brørup <mb@smartsharesystems.com> wrote:
>
> > From: Olivier Matz [mailto:olivier.matz@6wind.com]
> > Sent: Friday, 14 October 2022 21.51
> >
> > On Fri, Oct 14, 2022 at 05:57:39PM +0200, Morten Brørup wrote:
> > > > From: Olivier Matz [mailto:olivier.matz@6wind.com]
> > > > Sent: Friday, 14 October 2022 16.01
> > > >
> > > > Hi Morten, Andrew,
> > > >
> > > > On Sun, Oct 09, 2022 at 05:08:39PM +0200, Morten Brørup wrote:
> > > > > > From: Andrew Rybchenko [mailto:andrew.rybchenko@oktetlabs.ru]
> > > > > > Sent: Sunday, 9 October 2022 16.52
> > > > > >
> > > > > > On 10/9/22 17:31, Morten Brørup wrote:
> > > > > > >> From: Andrew Rybchenko
> > [mailto:andrew.rybchenko@oktetlabs.ru]
> > > > > > >> Sent: Sunday, 9 October 2022 15.38
> > > > > > >>
> > > > > > >> From: Morten Brørup <mb@smartsharesystems.com>
> > > > > > >>
> > > > >
> > > > > [...]
> > > >
> > > > I finally took a couple of hours to carefully review the mempool-
> > > > related
> > > > series (including the ones that have already been pushed).
> > > >
> > > > The new behavior looks better to me in all situations I can think
> > > > about.
> > >
> > > Extreme care is required when touching a core library like the
> > mempool.
> > >
> > > Thank you, Olivier.
> > >
> > > >
> > > > >
> > > > > > >> --- a/lib/mempool/rte_mempool.h
> > > > > > >> +++ b/lib/mempool/rte_mempool.h
> > > > > > >> @@ -90,7 +90,7 @@ struct rte_mempool_cache {
> > > > > > >>     * Cache is allocated to this size to allow it to
> > overflow
> > > > in
> > > > > > >> certain
> > > > > > >>     * cases to avoid needless emptying of cache.
> > > > > > >>     */
> > > > > > >> -  void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 3]; /**<
> > Cache
> > > > objects */
> > > > > > >> +  void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2]; /**<
> > Cache
> > > > objects */
> > > > > > >>   } __rte_cache_aligned;
> > > > > > >
> > > > > > > How much are we allowed to break the ABI here?
> > > > > > >
> > > > > > > This patch reduces the size of the structure by removing a
> > now
> > > > unused
> > > > > > part at the end, which should be harmless.
> > > >
> > > > It is an ABI breakage: an existing application will use the new
> > 22.11
> > > > function to create the mempool (with a smaller cache), but will use
> > the
> > > > old inlined get/put that can exceed MAX_SIZE x 2 will remain.
> > > >
> > > > But this is a nice memory consumption improvement, in my opinion we
> > > > should accept it for 22.11 with an entry in the release note.
> > > >
> > > >
> > > > > > >
> > > > > > > If we may also move the position of the objs array, I would
> > add
> > > > > > __rte_cache_aligned to the objs array. It makes no difference
> > in
> > > > the
> > > > > > general case, but if get/put operations are always 32 objects,
> > it
> > > > will
> > > > > > reduce the number of memory (or last level cache) accesses from
> > > > five to
> > > > > > four 64 B cache lines for every get/put operation.
> > > >
> > > > Will it really be the case? Since cache->len has to be accessed
> > too,
> > > > I don't think it would make a difference.
> > >
> > > Yes, the first cache line, containing cache->len, will be accessed
> > always. I forgot to count that; so the improvement by aligning cache-
> > >objs will be five cache line accesses instead of six.
> > >
> > > Let me try to explain the scenario in other words:
> > >
> > > In an application where a mempool cache is only accessed in bursts of
> > 32 objects (256 B), it matters if those 256 B accesses in the mempool
> > cache start at a cache line aligned address or not. If cache line
> > aligned, accessing those 256 B in the mempool cache will only touch 4
> > cache lines; if not, 5 cache lines will be touched. (For architectures
> > with 128 B cache line, it will be 2 instead of 3 touched cache lines
> > per mempool cache get/put operation in applications using only bursts
> > of 32 objects.)
> > >
> > > If we cache line align cache->objs, those bursts of 32 objects (256
> > B) will be cache line aligned: Any address at cache->objs[N * 32
> > objects] is cache line aligned if objs->objs[0] is cache line aligned.
> > >
> > > Currently, the cache->objs directly follows cache->len, which makes
> > cache->objs[0] cache line unaligned.
> > >
> > > If we decide to break the mempool cache ABI, we might as well include
> > my suggested cache line alignment performance improvement. It doesn't
> > degrade performance for mempool caches not only accessed in bursts of
> > 32 objects.
> >
> > I don't follow you. Currently, with 16 objects (128B), we access to 3
> > cache lines:
> >
> >       ┌────────┐
> >       │len     │
> > cache │********│---
> > line0 │********│ ^
> >       │********│ |
> >       ├────────┤ | 16 objects
> >       │********│ | 128B
> > cache │********│ |
> > line1 │********│ |
> >       │********│ |
> >       ├────────┤ |
> >       │********│_v_
> > cache │        │
> > line2 │        │
> >       │        │
> >       └────────┘
> >
> > With the alignment, it is also 3 cache lines:
> >
> >       ┌────────┐
> >       │len     │
> > cache │        │
> > line0 │        │
> >       │        │
> >       ├────────┤---
> >       │********│ ^
> > cache │********│ |
> > line1 │********│ |
> >       │********│ |
> >       ├────────┤ | 16 objects
> >       │********│ | 128B
> > cache │********│ |
> > line2 │********│ |
> >       │********│ v
> >       └────────┘---
> >
> >
> > Am I missing something?
>
> Accessing the objects at the bottom of the mempool cache is a special case, where cache line0 is also used for objects.
>
> Consider the next burst (and any following bursts):
>
> Current:
>       ┌────────┐
>       │len     │
> cache │        │
> line0 │        │
>       │        │
>       ├────────┤
>       │        │
> cache │        │
> line1 │        │
>       │        │
>       ├────────┤
>       │        │
> cache │********│---
> line2 │********│ ^
>       │********│ |
>       ├────────┤ | 16 objects
>       │********│ | 128B
> cache │********│ |
> line3 │********│ |
>       │********│ |
>       ├────────┤ |
>       │********│_v_
> cache │        │
> line4 │        │
>       │        │
>       └────────┘
> 4 cache lines touched, incl. line0 for len.
>
> With the proposed alignment:
>       ┌────────┐
>       │len     │
> cache │        │
> line0 │        │
>       │        │
>       ├────────┤
>       │        │
> cache │        │
> line1 │        │
>       │        │
>       ├────────┤
>       │        │
> cache │        │
> line2 │        │
>       │        │
>       ├────────┤
>       │********│---
> cache │********│ ^
> line3 │********│ |
>       │********│ | 16 objects
>       ├────────┤ | 128B
>       │********│ |
> cache │********│ |
> line4 │********│ |
>       │********│_v_
>       └────────┘
> Only 3 cache lines touched, incl. line0 for len.


When tested with testpmd,l3fwd there was less than 1% regression. It
could be noise.
But making the cacheline alignment is fixing that.

In addition to @Morten Brørup  point, I think, there is a factor
"load" stall on cache->len read, What I meant by that is:

In the case  of (len and objs) are in the same cache line. Assume objs
are written as stores operation and not read anything on cacheline
VS a few stores done for objects and on subsequent len read via
enqueue operation may stall based where those obj reached in
the cache hierarchy and cache policy(write-back vs write-through)

If we are seeing no regression with cachealinged with various platform
testing then I think it make sense to make cache aligned.

>
>
> >
> > >
> > > >
> > > >
> > > > > > >
> > > > > > >     uint32_t len;         /**< Current cache count */
> > > > > > > -   /*
> > > > > > > -    * Cache is allocated to this size to allow it to overflow
> > > > in
> > > > > > certain
> > > > > > > -    * cases to avoid needless emptying of cache.
> > > > > > > -    */
> > > > > > > -   void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 3]; /**< Cache
> > > > objects */
> > > > > > > +   /**
> > > > > > > +    * Cache objects
> > > > > > > +    *
> > > > > > > +    * Cache is allocated to this size to allow it to overflow
> > > > in
> > > > > > certain
> > > > > > > +    * cases to avoid needless emptying of cache.
> > > > > > > +    */
> > > > > > > +   void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2]
> > > > __rte_cache_aligned;
> > > > > > > } __rte_cache_aligned;
> > > > > >
> > > > > > I think aligning objs on cacheline should be a separate patch.
> > > > >
> > > > > Good point. I'll let you do it. :-)
> > > > >
> > > > > PS: Thank you for following up on this patch series, Andrew!
> > > >
> > > > Many thanks for this rework.
> > > >
> > > > Acked-by: Olivier Matz <olivier.matz@6wind.com>
> > >
> > > Perhaps Reviewed-by would be appropriate?
> >
> > I was thinking that "Acked-by" was commonly used by maintainers, and
> > "Reviewed-by" for reviews by community members. After reading the
> > documentation again, it's not that clear now in my mind :)
> >
> > Thanks,
> > Olivier
>

diff mbox series

Patch

diff --git a/lib/mempool/rte_mempool.c b/lib/mempool/rte_mempool.c
index de59009baf..4ba8ab7b63 100644
--- a/lib/mempool/rte_mempool.c
+++ b/lib/mempool/rte_mempool.c
@@ -746,6 +746,11 @@  rte_mempool_free(struct rte_mempool *mp)
 static void
 mempool_cache_init(struct rte_mempool_cache *cache, uint32_t size)
 {
+	/* Check that cache have enough space for flush threshold */
+	RTE_BUILD_BUG_ON(CALC_CACHE_FLUSHTHRESH(RTE_MEMPOOL_CACHE_MAX_SIZE) >
+			 RTE_SIZEOF_FIELD(struct rte_mempool_cache, objs) /
+			 RTE_SIZEOF_FIELD(struct rte_mempool_cache, objs[0]));
+
 	cache->size = size;
 	cache->flushthresh = CALC_CACHE_FLUSHTHRESH(size);
 	cache->len = 0;
diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index a072e5554b..e3364ed7b8 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -90,7 +90,7 @@  struct rte_mempool_cache {
 	 * Cache is allocated to this size to allow it to overflow in certain
 	 * cases to avoid needless emptying of cache.
 	 */
-	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 3]; /**< Cache objects */
+	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2]; /**< Cache objects */
 } __rte_cache_aligned;
 
 /**
@@ -1329,30 +1329,39 @@  rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
 	RTE_MEMPOOL_STAT_ADD(mp, put_bulk, 1);
 	RTE_MEMPOOL_STAT_ADD(mp, put_objs, n);
 
-	/* No cache provided or if put would overflow mem allocated for cache */
-	if (unlikely(cache == NULL || n > RTE_MEMPOOL_CACHE_MAX_SIZE))
+	/* No cache provided or the request itself is too big for the cache */
+	if (unlikely(cache == NULL || n > cache->flushthresh))
 		goto driver_enqueue;
 
-	cache_objs = &cache->objs[cache->len];
-
 	/*
-	 * The cache follows the following algorithm
-	 *   1. Add the objects to the cache
-	 *   2. Anything greater than the cache min value (if it crosses the
-	 *   cache flush threshold) is flushed to the backend.
+	 * The cache follows the following algorithm:
+	 *   1. If the objects cannot be added to the cache without crossing
+	 *      the flush threshold, flush the cache to the backend.
+	 *   2. Add the objects to the cache.
 	 */
 
-	/* Add elements back into the cache */
-	rte_memcpy(&cache_objs[0], obj_table, sizeof(void *) * n);
-
-	cache->len += n;
+	if (cache->len + n <= cache->flushthresh) {
+		cache_objs = &cache->objs[cache->len];
+		cache->len += n;
+	} else {
+		unsigned int keep = (n >= cache->size) ? 0 : (cache->size - n);
 
-	if (cache->len >= cache->flushthresh) {
-		rte_mempool_ops_enqueue_bulk(mp, &cache->objs[cache->size],
-				cache->len - cache->size);
-		cache->len = cache->size;
+		/*
+		 * If number of object to keep in the cache is positive:
+		 * keep = cache->size - n < cache->flushthresh - n < cache->len
+		 * since cache->flushthresh > cache->size.
+		 * If keep is 0, cache->len cannot be 0 anyway since
+		 * n <= cache->flushthresh and we'd no be here with
+		 * cache->len == 0.
+		 */
+		cache_objs = &cache->objs[keep];
+		rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len - keep);
+		cache->len = keep + n;
 	}
 
+	/* Add the objects to the cache. */
+	rte_memcpy(cache_objs, obj_table, sizeof(void *) * n);
+
 	return;
 
 driver_enqueue: