[v3] mempool: fix put objects to mempool with cache

Message ID 20220119150301.42583-1-mb@smartsharesystems.com (mailing list archive)
State Superseded, archived
Delegated to: Thomas Monjalon
Headers
Series [v3] mempool: fix put objects to mempool with cache |

Checks

Context Check Description
ci/checkpatch warning coding style issues
ci/github-robot: build success github build: passed
ci/iol-mellanox-Performance success Performance Testing PASS
ci/iol-broadcom-Performance success Performance Testing PASS
ci/iol-broadcom-Functional success Functional Testing PASS
ci/iol-intel-Performance success Performance Testing PASS
ci/iol-intel-Functional success Functional Testing PASS
ci/iol-aarch64-unit-testing success Testing PASS
ci/iol-x86_64-unit-testing success Testing PASS
ci/iol-x86_64-compile-testing success Testing PASS
ci/iol-aarch64-compile-testing success Testing PASS
ci/iol-abi-testing success Testing PASS
ci/Intel-compilation success Compilation OK
ci/intel-Testing success Testing PASS

Commit Message

Morten Brørup Jan. 19, 2022, 3:03 p.m. UTC
  mempool: fix put objects to mempool with cache

This patch optimizes the rte_mempool_do_generic_put() caching algorithm,
and fixes a bug in it.

The existing algorithm was:
 1. Add the objects to the cache
 2. Anything greater than the cache size (if it crosses the cache flush
    threshold) is flushed to the ring.

Please note that the description in the source code said that it kept
"cache min value" objects after flushing, but the function actually kept
"size" objects, which is reflected in the above description.

Now, the algorithm is:
 1. If the objects cannot be added to the cache without crossing the
    flush threshold, flush the cache to the ring.
 2. Add the objects to the cache.

This patch changes these details:

1. Bug: The cache was still full after flushing.
In the opposite direction, i.e. when getting objects from the cache, the
cache is refilled to full level when it crosses the low watermark (which
happens to be zero).
Similarly, the cache should be flushed to empty level when it crosses
the high watermark (which happens to be 1.5 x the size of the cache).
The existing flushing behaviour was suboptimal for real applications,
because crossing the low or high watermark typically happens when the
application is in a state where the number of put/get events are out of
balance, e.g. when absorbing a burst of packets into a QoS queue
(getting more mbufs from the mempool), or when a burst of packets is
trickling out from the QoS queue (putting the mbufs back into the
mempool).
NB: When the application is in a state where put/get events are in
balance, the cache should remain within its low and high watermarks, and
the algorithms for refilling/flushing the cache should not come into
play.
Now, the mempool cache is completely flushed when crossing the flush
threshold, so only the newly put (hot) objects remain in the mempool
cache afterwards.

2. Minor bug: The flush threshold comparison has been corrected; it must
be "len > flushthresh", not "len >= flushthresh".
Reasoning: Consider a flush multiplier of 1 instead of 1.5; the cache
would be flushed already when reaching size elements, not when exceeding
size elements.
Now, flushing is triggered when the flush threshold is exceeded, not
when reached.

3. Optimization: The most recent (hot) objects are flushed, leaving the
oldest (cold) objects in the mempool cache.
This is bad for CPUs with a small L1 cache, because when they get
objects from the mempool after the mempool cache has been flushed, they
get cold objects instead of hot objects.
Now, the existing (cold) objects in the mempool cache are flushed before
the new (hot) objects are added the to the mempool cache.

4. Optimization: Using the x86 variant of rte_memcpy() is inefficient
here, where n is relatively small and unknown at compile time.
Now, it has been replaced by an alternative copying method, optimized
for the fact that most Ethernet PMDs operate in bursts of 4 or 8 mbufs
or multiples thereof.

v2 changes:

- Not adding the new objects to the mempool cache before flushing it
also allows the memory allocated for the mempool cache to be reduced
from 3 x to 2 x RTE_MEMPOOL_CACHE_MAX_SIZE.
However, such this change would break the ABI, so it was removed in v2.

- The mempool cache should be cache line aligned for the benefit of the
copying method, which on some CPU architectures performs worse on data
crossing a cache boundary.
However, such this change would break the ABI, so it was removed in v2;
and yet another alternative copying method replaced the rte_memcpy().

v3 changes:

- Actually remove my modifications of the rte_mempool_cache structure.

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---
 lib/mempool/rte_mempool.h | 51 +++++++++++++++++++++++++++++----------
 1 file changed, 38 insertions(+), 13 deletions(-)
  

Comments

Olivier Matz Jan. 24, 2022, 3:39 p.m. UTC | #1
Hi Morten,

On Wed, Jan 19, 2022 at 04:03:01PM +0100, Morten Brørup wrote:
> mempool: fix put objects to mempool with cache
> 
> This patch optimizes the rte_mempool_do_generic_put() caching algorithm,
> and fixes a bug in it.

I think we should avoid grouping fixes and optimizations in one
patch. The main reason is that fixes aims to be backported, which
is not the case of optimizations.

> The existing algorithm was:
>  1. Add the objects to the cache
>  2. Anything greater than the cache size (if it crosses the cache flush
>     threshold) is flushed to the ring.
> 
> Please note that the description in the source code said that it kept
> "cache min value" objects after flushing, but the function actually kept
> "size" objects, which is reflected in the above description.
> 
> Now, the algorithm is:
>  1. If the objects cannot be added to the cache without crossing the
>     flush threshold, flush the cache to the ring.
>  2. Add the objects to the cache.
> 
> This patch changes these details:
> 
> 1. Bug: The cache was still full after flushing.
> In the opposite direction, i.e. when getting objects from the cache, the
> cache is refilled to full level when it crosses the low watermark (which
> happens to be zero).
> Similarly, the cache should be flushed to empty level when it crosses
> the high watermark (which happens to be 1.5 x the size of the cache).
> The existing flushing behaviour was suboptimal for real applications,
> because crossing the low or high watermark typically happens when the
> application is in a state where the number of put/get events are out of
> balance, e.g. when absorbing a burst of packets into a QoS queue
> (getting more mbufs from the mempool), or when a burst of packets is
> trickling out from the QoS queue (putting the mbufs back into the
> mempool).
> NB: When the application is in a state where put/get events are in
> balance, the cache should remain within its low and high watermarks, and
> the algorithms for refilling/flushing the cache should not come into
> play.
> Now, the mempool cache is completely flushed when crossing the flush
> threshold, so only the newly put (hot) objects remain in the mempool
> cache afterwards.

I'm not sure we should call this behavior a bug. What is the impact
on applications, from a user perspective? Can it break a use-case, or
have an important performance impact?


> 2. Minor bug: The flush threshold comparison has been corrected; it must
> be "len > flushthresh", not "len >= flushthresh".
> Reasoning: Consider a flush multiplier of 1 instead of 1.5; the cache
> would be flushed already when reaching size elements, not when exceeding
> size elements.
> Now, flushing is triggered when the flush threshold is exceeded, not
> when reached.

Same here, we should ask ourselves what is the impact before calling
it a bug.


> 3. Optimization: The most recent (hot) objects are flushed, leaving the
> oldest (cold) objects in the mempool cache.
> This is bad for CPUs with a small L1 cache, because when they get
> objects from the mempool after the mempool cache has been flushed, they
> get cold objects instead of hot objects.
> Now, the existing (cold) objects in the mempool cache are flushed before
> the new (hot) objects are added the to the mempool cache.
> 
> 4. Optimization: Using the x86 variant of rte_memcpy() is inefficient
> here, where n is relatively small and unknown at compile time.
> Now, it has been replaced by an alternative copying method, optimized
> for the fact that most Ethernet PMDs operate in bursts of 4 or 8 mbufs
> or multiples thereof.

For these optimizations, do you have an idea of what is the performance
gain? Ideally (I understand it is not always possible), each optimization
is done separately, and its impact is measured.


> v2 changes:
> 
> - Not adding the new objects to the mempool cache before flushing it
> also allows the memory allocated for the mempool cache to be reduced
> from 3 x to 2 x RTE_MEMPOOL_CACHE_MAX_SIZE.
> However, such this change would break the ABI, so it was removed in v2.
> 
> - The mempool cache should be cache line aligned for the benefit of the
> copying method, which on some CPU architectures performs worse on data
> crossing a cache boundary.
> However, such this change would break the ABI, so it was removed in v2;
> and yet another alternative copying method replaced the rte_memcpy().

OK, we may want to keep this in mind for the next abi breakage.


> 
> v3 changes:
> 
> - Actually remove my modifications of the rte_mempool_cache structure.
> 
> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> ---
>  lib/mempool/rte_mempool.h | 51 +++++++++++++++++++++++++++++----------
>  1 file changed, 38 insertions(+), 13 deletions(-)
> 
> diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> index 1e7a3c1527..7b364cfc74 100644
> --- a/lib/mempool/rte_mempool.h
> +++ b/lib/mempool/rte_mempool.h
> @@ -1334,6 +1334,7 @@ static __rte_always_inline void
>  rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
>  			   unsigned int n, struct rte_mempool_cache *cache)
>  {
> +	uint32_t index;
>  	void **cache_objs;
>  
>  	/* increment stat now, adding in mempool always success */
> @@ -1344,31 +1345,56 @@ rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
>  	if (unlikely(cache == NULL || n > RTE_MEMPOOL_CACHE_MAX_SIZE))
>  		goto ring_enqueue;
>  
> -	cache_objs = &cache->objs[cache->len];
> +	/* If the request itself is too big for the cache */
> +	if (unlikely(n > cache->flushthresh))
> +		goto ring_enqueue;
>  
>  	/*
>  	 * The cache follows the following algorithm
> -	 *   1. Add the objects to the cache
> -	 *   2. Anything greater than the cache min value (if it crosses the
> -	 *   cache flush threshold) is flushed to the ring.
> +	 *   1. If the objects cannot be added to the cache without
> +	 *   crossing the flush threshold, flush the cache to the ring.
> +	 *   2. Add the objects to the cache.
>  	 */
>  
> -	/* Add elements back into the cache */
> -	rte_memcpy(&cache_objs[0], obj_table, sizeof(void *) * n);
> +	if (cache->len + n <= cache->flushthresh) {
> +		cache_objs = &cache->objs[cache->len];
>  
> -	cache->len += n;
> +		cache->len += n;
> +	} else {
> +		cache_objs = cache->objs;
>  
> -	if (cache->len >= cache->flushthresh) {
> -		rte_mempool_ops_enqueue_bulk(mp, &cache->objs[cache->size],
> -				cache->len - cache->size);
> -		cache->len = cache->size;
> +#ifdef RTE_LIBRTE_MEMPOOL_DEBUG
> +		if (rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len) < 0)
> +			rte_panic("cannot put objects in mempool\n");
> +#else
> +		rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len);
> +#endif
> +		cache->len = n;
> +	}
> +
> +	/* Add the objects to the cache. */
> +	for (index = 0; index < (n & ~0x3); index += 4) {
> +		cache_objs[index] = obj_table[index];
> +		cache_objs[index + 1] = obj_table[index + 1];
> +		cache_objs[index + 2] = obj_table[index + 2];
> +		cache_objs[index + 3] = obj_table[index + 3];
> +	}
> +	switch (n & 0x3) {
> +	case 3:
> +		cache_objs[index] = obj_table[index];
> +		index++; /* fallthrough */
> +	case 2:
> +		cache_objs[index] = obj_table[index];
> +		index++; /* fallthrough */
> +	case 1:
> +		cache_objs[index] = obj_table[index];
>  	}
>  
>  	return;
>  
>  ring_enqueue:
>  
> -	/* push remaining objects in ring */
> +	/* Put the objects into the ring */
>  #ifdef RTE_LIBRTE_MEMPOOL_DEBUG
>  	if (rte_mempool_ops_enqueue_bulk(mp, obj_table, n) < 0)
>  		rte_panic("cannot put objects in mempool\n");
> @@ -1377,7 +1403,6 @@ rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
>  #endif
>  }
>  
> -
>  /**
>   * Put several objects back in the mempool.
>   *
> -- 
> 2.17.1
>
  
Morten Brørup Jan. 28, 2022, 9:37 a.m. UTC | #2
> From: Olivier Matz [mailto:olivier.matz@6wind.com]
> Sent: Monday, 24 January 2022 16.39
> 
> Hi Morten,
> 
> On Wed, Jan 19, 2022 at 04:03:01PM +0100, Morten Brørup wrote:
> > mempool: fix put objects to mempool with cache
> >
> > This patch optimizes the rte_mempool_do_generic_put() caching
> algorithm,
> > and fixes a bug in it.
> 
> I think we should avoid grouping fixes and optimizations in one
> patch. The main reason is that fixes aims to be backported, which
> is not the case of optimizations.

OK. I'll separate them.

> 
> > The existing algorithm was:
> >  1. Add the objects to the cache
> >  2. Anything greater than the cache size (if it crosses the cache
> flush
> >     threshold) is flushed to the ring.
> >
> > Please note that the description in the source code said that it kept
> > "cache min value" objects after flushing, but the function actually
> kept
> > "size" objects, which is reflected in the above description.
> >
> > Now, the algorithm is:
> >  1. If the objects cannot be added to the cache without crossing the
> >     flush threshold, flush the cache to the ring.
> >  2. Add the objects to the cache.
> >
> > This patch changes these details:
> >
> > 1. Bug: The cache was still full after flushing.
> > In the opposite direction, i.e. when getting objects from the cache,
> the
> > cache is refilled to full level when it crosses the low watermark
> (which
> > happens to be zero).
> > Similarly, the cache should be flushed to empty level when it crosses
> > the high watermark (which happens to be 1.5 x the size of the cache).
> > The existing flushing behaviour was suboptimal for real applications,
> > because crossing the low or high watermark typically happens when the
> > application is in a state where the number of put/get events are out
> of
> > balance, e.g. when absorbing a burst of packets into a QoS queue
> > (getting more mbufs from the mempool), or when a burst of packets is
> > trickling out from the QoS queue (putting the mbufs back into the
> > mempool).
> > NB: When the application is in a state where put/get events are in
> > balance, the cache should remain within its low and high watermarks,
> and
> > the algorithms for refilling/flushing the cache should not come into
> > play.
> > Now, the mempool cache is completely flushed when crossing the flush
> > threshold, so only the newly put (hot) objects remain in the mempool
> > cache afterwards.
> 
> I'm not sure we should call this behavior a bug. What is the impact
> on applications, from a user perspective? Can it break a use-case, or
> have an important performance impact?

It doesn't break anything.

But it doesn't behave as intended (according to its description in the source code), so I do consider it a bug! Any professional tester, when seeing an implementation that doesn't do what is intended, would also flag the implementation as faulty.

It has a performance impact: It causes many more mempool cache flushes than was intended. I have elaborated by an example here: http://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35D86E54@smartserver.smartshare.dk/T/#t

> 
> 
> > 2. Minor bug: The flush threshold comparison has been corrected; it
> must
> > be "len > flushthresh", not "len >= flushthresh".
> > Reasoning: Consider a flush multiplier of 1 instead of 1.5; the cache
> > would be flushed already when reaching size elements, not when
> exceeding
> > size elements.
> > Now, flushing is triggered when the flush threshold is exceeded, not
> > when reached.
> 
> Same here, we should ask ourselves what is the impact before calling
> it a bug.

It's a classic off-by-one bug.

It only impacts performance, causing premature mempool cache flushing.

Referring to my example in the RFC discussion, this bug causes flushing every 3rd application put() instead of every 4th.

> 
> 
> > 3. Optimization: The most recent (hot) objects are flushed, leaving
> the
> > oldest (cold) objects in the mempool cache.
> > This is bad for CPUs with a small L1 cache, because when they get
> > objects from the mempool after the mempool cache has been flushed,
> they
> > get cold objects instead of hot objects.
> > Now, the existing (cold) objects in the mempool cache are flushed
> before
> > the new (hot) objects are added the to the mempool cache.
> >
> > 4. Optimization: Using the x86 variant of rte_memcpy() is inefficient
> > here, where n is relatively small and unknown at compile time.
> > Now, it has been replaced by an alternative copying method, optimized
> > for the fact that most Ethernet PMDs operate in bursts of 4 or 8
> mbufs
> > or multiples thereof.
> 
> For these optimizations, do you have an idea of what is the performance
> gain? Ideally (I understand it is not always possible), each
> optimization
> is done separately, and its impact is measured.

Regarding 3: I don't have access to hardware with a CPU with small L1 cache. But the algorithm was structurally wrong, so I think it should be fixed. Not working with such hardware ourselves, I labeled it an "optimization"... if the patch came from someone with affected hardware, it could reasonably had been labeled a "bug fix".

Regarding 4: I'll stick with rte_memcpy() in the "fix" patch, and provide a separate optimization patch with performance information.

> 
> 
> > v2 changes:
> >
> > - Not adding the new objects to the mempool cache before flushing it
> > also allows the memory allocated for the mempool cache to be reduced
> > from 3 x to 2 x RTE_MEMPOOL_CACHE_MAX_SIZE.
> > However, such this change would break the ABI, so it was removed in
> v2.
> >
> > - The mempool cache should be cache line aligned for the benefit of
> the
> > copying method, which on some CPU architectures performs worse on
> data
> > crossing a cache boundary.
> > However, such this change would break the ABI, so it was removed in
> v2;
> > and yet another alternative copying method replaced the rte_memcpy().
> 
> OK, we may want to keep this in mind for the next abi breakage.

Sounds good.

> 
> 
> >
> > v3 changes:
> >
> > - Actually remove my modifications of the rte_mempool_cache
> structure.
> >
> > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > ---
> >  lib/mempool/rte_mempool.h | 51 +++++++++++++++++++++++++++++--------
> --
> >  1 file changed, 38 insertions(+), 13 deletions(-)
> >
> > diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> > index 1e7a3c1527..7b364cfc74 100644
> > --- a/lib/mempool/rte_mempool.h
> > +++ b/lib/mempool/rte_mempool.h
> > @@ -1334,6 +1334,7 @@ static __rte_always_inline void
> >  rte_mempool_do_generic_put(struct rte_mempool *mp, void * const
> *obj_table,
> >  			   unsigned int n, struct rte_mempool_cache *cache)
> >  {
> > +	uint32_t index;
> >  	void **cache_objs;
> >
> >  	/* increment stat now, adding in mempool always success */
> > @@ -1344,31 +1345,56 @@ rte_mempool_do_generic_put(struct rte_mempool
> *mp, void * const *obj_table,
> >  	if (unlikely(cache == NULL || n > RTE_MEMPOOL_CACHE_MAX_SIZE))
> >  		goto ring_enqueue;
> >
> > -	cache_objs = &cache->objs[cache->len];
> > +	/* If the request itself is too big for the cache */
> > +	if (unlikely(n > cache->flushthresh))
> > +		goto ring_enqueue;
> >
> >  	/*
> >  	 * The cache follows the following algorithm
> > -	 *   1. Add the objects to the cache
> > -	 *   2. Anything greater than the cache min value (if it crosses
> the
> > -	 *   cache flush threshold) is flushed to the ring.
> > +	 *   1. If the objects cannot be added to the cache without
> > +	 *   crossing the flush threshold, flush the cache to the ring.
> > +	 *   2. Add the objects to the cache.
> >  	 */
> >
> > -	/* Add elements back into the cache */
> > -	rte_memcpy(&cache_objs[0], obj_table, sizeof(void *) * n);
> > +	if (cache->len + n <= cache->flushthresh) {
> > +		cache_objs = &cache->objs[cache->len];
> >
> > -	cache->len += n;
> > +		cache->len += n;
> > +	} else {
> > +		cache_objs = cache->objs;
> >
> > -	if (cache->len >= cache->flushthresh) {
> > -		rte_mempool_ops_enqueue_bulk(mp, &cache->objs[cache->size],
> > -				cache->len - cache->size);
> > -		cache->len = cache->size;
> > +#ifdef RTE_LIBRTE_MEMPOOL_DEBUG
> > +		if (rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache-
> >len) < 0)
> > +			rte_panic("cannot put objects in mempool\n");
> > +#else
> > +		rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len);
> > +#endif
> > +		cache->len = n;
> > +	}
> > +
> > +	/* Add the objects to the cache. */
> > +	for (index = 0; index < (n & ~0x3); index += 4) {
> > +		cache_objs[index] = obj_table[index];
> > +		cache_objs[index + 1] = obj_table[index + 1];
> > +		cache_objs[index + 2] = obj_table[index + 2];
> > +		cache_objs[index + 3] = obj_table[index + 3];
> > +	}
> > +	switch (n & 0x3) {
> > +	case 3:
> > +		cache_objs[index] = obj_table[index];
> > +		index++; /* fallthrough */
> > +	case 2:
> > +		cache_objs[index] = obj_table[index];
> > +		index++; /* fallthrough */
> > +	case 1:
> > +		cache_objs[index] = obj_table[index];
> >  	}
> >
> >  	return;
> >
> >  ring_enqueue:
> >
> > -	/* push remaining objects in ring */
> > +	/* Put the objects into the ring */
> >  #ifdef RTE_LIBRTE_MEMPOOL_DEBUG
> >  	if (rte_mempool_ops_enqueue_bulk(mp, obj_table, n) < 0)
> >  		rte_panic("cannot put objects in mempool\n");
> > @@ -1377,7 +1403,6 @@ rte_mempool_do_generic_put(struct rte_mempool
> *mp, void * const *obj_table,
> >  #endif
> >  }
> >
> > -
> >  /**
> >   * Put several objects back in the mempool.
> >   *
> > --
> > 2.17.1
> >
  

Patch

diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index 1e7a3c1527..7b364cfc74 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -1334,6 +1334,7 @@  static __rte_always_inline void
 rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
 			   unsigned int n, struct rte_mempool_cache *cache)
 {
+	uint32_t index;
 	void **cache_objs;
 
 	/* increment stat now, adding in mempool always success */
@@ -1344,31 +1345,56 @@  rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
 	if (unlikely(cache == NULL || n > RTE_MEMPOOL_CACHE_MAX_SIZE))
 		goto ring_enqueue;
 
-	cache_objs = &cache->objs[cache->len];
+	/* If the request itself is too big for the cache */
+	if (unlikely(n > cache->flushthresh))
+		goto ring_enqueue;
 
 	/*
 	 * The cache follows the following algorithm
-	 *   1. Add the objects to the cache
-	 *   2. Anything greater than the cache min value (if it crosses the
-	 *   cache flush threshold) is flushed to the ring.
+	 *   1. If the objects cannot be added to the cache without
+	 *   crossing the flush threshold, flush the cache to the ring.
+	 *   2. Add the objects to the cache.
 	 */
 
-	/* Add elements back into the cache */
-	rte_memcpy(&cache_objs[0], obj_table, sizeof(void *) * n);
+	if (cache->len + n <= cache->flushthresh) {
+		cache_objs = &cache->objs[cache->len];
 
-	cache->len += n;
+		cache->len += n;
+	} else {
+		cache_objs = cache->objs;
 
-	if (cache->len >= cache->flushthresh) {
-		rte_mempool_ops_enqueue_bulk(mp, &cache->objs[cache->size],
-				cache->len - cache->size);
-		cache->len = cache->size;
+#ifdef RTE_LIBRTE_MEMPOOL_DEBUG
+		if (rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len) < 0)
+			rte_panic("cannot put objects in mempool\n");
+#else
+		rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len);
+#endif
+		cache->len = n;
+	}
+
+	/* Add the objects to the cache. */
+	for (index = 0; index < (n & ~0x3); index += 4) {
+		cache_objs[index] = obj_table[index];
+		cache_objs[index + 1] = obj_table[index + 1];
+		cache_objs[index + 2] = obj_table[index + 2];
+		cache_objs[index + 3] = obj_table[index + 3];
+	}
+	switch (n & 0x3) {
+	case 3:
+		cache_objs[index] = obj_table[index];
+		index++; /* fallthrough */
+	case 2:
+		cache_objs[index] = obj_table[index];
+		index++; /* fallthrough */
+	case 1:
+		cache_objs[index] = obj_table[index];
 	}
 
 	return;
 
 ring_enqueue:
 
-	/* push remaining objects in ring */
+	/* Put the objects into the ring */
 #ifdef RTE_LIBRTE_MEMPOOL_DEBUG
 	if (rte_mempool_ops_enqueue_bulk(mp, obj_table, n) < 0)
 		rte_panic("cannot put objects in mempool\n");
@@ -1377,7 +1403,6 @@  rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
 #endif
 }
 
-
 /**
  * Put several objects back in the mempool.
  *