mempool: optimize incomplete cache handling

Message ID 20220106122345.82740-1-mb@smartsharesystems.com (mailing list archive)
State Superseded, archived
Delegated to: Thomas Monjalon
Headers
Series mempool: optimize incomplete cache handling |

Checks

Context Check Description
ci/checkpatch warning coding style issues
ci/iol-broadcom-Performance success Performance Testing PASS
ci/iol-mellanox-Performance success Performance Testing PASS
ci/iol-broadcom-Functional success Functional Testing PASS
ci/iol-intel-Performance success Performance Testing PASS
ci/github-robot: build success github build: passed
ci/iol-intel-Functional success Functional Testing PASS
ci/iol-abi-testing success Testing PASS
ci/iol-aarch64-unit-testing success Testing PASS
ci/iol-aarch64-compile-testing success Testing PASS
ci/iol-x86_64-compile-testing success Testing PASS
ci/iol-x86_64-unit-testing success Testing PASS
ci/Intel-compilation success Compilation OK
ci/intel-Testing success Testing PASS

Commit Message

Morten Brørup Jan. 6, 2022, 12:23 p.m. UTC
  A flush threshold for the mempool cache was introduced in DPDK version
1.3, but rte_mempool_do_generic_get() was not completely updated back
then.

The incompleteness did not cause any functional bugs, so this patch
could be considered refactoring for the purpose of cleaning up.

This patch completes the update of rte_mempool_do_generic_get() as
follows:

1. A few comments were malplaced or no longer correct.
Some comments have been updated/added/corrected.

2. The code that initially screens the cache request was not updated.
The initial screening compared the request length to the cache size,
which was correct before, but became irrelevant with the introduction of
the flush threshold. E.g. the cache can hold up to flushthresh objects,
which is more than its size, so some requests were not served from the
cache, even though they could be.
The initial screening has now been corrected to match the initial
screening in rte_mempool_do_generic_put(), which verifies that a cache
is present, and that the length of the request does not overflow the
memory allocated for the cache.

3. The code flow for satisfying the request from the cache was weird.
The likely code path where the objects are simply served from the cache
was treated as unlikely; now it is treated as likely.
And in the code path where the cache was backfilled first, numbers were
added and subtracted from the cache length; now this code path simply
sets the cache length to its final value.

4. The objects were returned in reverse order.
Returning the objects in reverse order is not necessary, so rte_memcpy()
is now used instead.

This patch also updates/adds/corrects some comments in
rte_mempool_do_generic_put().

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---
 lib/mempool/rte_mempool.h | 36 +++++++++++++++++++-----------------
 1 file changed, 19 insertions(+), 17 deletions(-)
  

Comments

Jerin Jacob Jan. 6, 2022, 4:55 p.m. UTC | #1
On Thu, Jan 6, 2022 at 5:54 PM Morten Brørup <mb@smartsharesystems.com> wrote:
>
> A flush threshold for the mempool cache was introduced in DPDK version
> 1.3, but rte_mempool_do_generic_get() was not completely updated back
> then.
>
> The incompleteness did not cause any functional bugs, so this patch
> could be considered refactoring for the purpose of cleaning up.
>
> This patch completes the update of rte_mempool_do_generic_get() as
> follows:
>
> 1. A few comments were malplaced or no longer correct.
> Some comments have been updated/added/corrected.
>
> 2. The code that initially screens the cache request was not updated.
> The initial screening compared the request length to the cache size,
> which was correct before, but became irrelevant with the introduction of
> the flush threshold. E.g. the cache can hold up to flushthresh objects,
> which is more than its size, so some requests were not served from the
> cache, even though they could be.
> The initial screening has now been corrected to match the initial
> screening in rte_mempool_do_generic_put(), which verifies that a cache
> is present, and that the length of the request does not overflow the
> memory allocated for the cache.
>
> 3. The code flow for satisfying the request from the cache was weird.
> The likely code path where the objects are simply served from the cache
> was treated as unlikely; now it is treated as likely.
> And in the code path where the cache was backfilled first, numbers were
> added and subtracted from the cache length; now this code path simply
> sets the cache length to its final value.
>
> 4. The objects were returned in reverse order.
> Returning the objects in reverse order is not necessary, so rte_memcpy()
> is now used instead.

Have you checked the performance with network workload?
IMO, reverse order makes sense(LIFO vs FIFO).
The LIFO makes the cache warm as the same buffers are reused frequently.
  
Morten Brørup Jan. 7, 2022, 8:46 a.m. UTC | #2
> From: Jerin Jacob [mailto:jerinjacobk@gmail.com]
> Sent: Thursday, 6 January 2022 17.55
> 
> On Thu, Jan 6, 2022 at 5:54 PM Morten Brørup <mb@smartsharesystems.com>
> wrote:
> >
> > A flush threshold for the mempool cache was introduced in DPDK
> version
> > 1.3, but rte_mempool_do_generic_get() was not completely updated back
> > then.
> >
> > The incompleteness did not cause any functional bugs, so this patch
> > could be considered refactoring for the purpose of cleaning up.
> >
> > This patch completes the update of rte_mempool_do_generic_get() as
> > follows:
> >
> > 1. A few comments were malplaced or no longer correct.
> > Some comments have been updated/added/corrected.
> >
> > 2. The code that initially screens the cache request was not updated.
> > The initial screening compared the request length to the cache size,
> > which was correct before, but became irrelevant with the introduction
> of
> > the flush threshold. E.g. the cache can hold up to flushthresh
> objects,
> > which is more than its size, so some requests were not served from
> the
> > cache, even though they could be.
> > The initial screening has now been corrected to match the initial
> > screening in rte_mempool_do_generic_put(), which verifies that a
> cache
> > is present, and that the length of the request does not overflow the
> > memory allocated for the cache.
> >
> > 3. The code flow for satisfying the request from the cache was weird.
> > The likely code path where the objects are simply served from the
> cache
> > was treated as unlikely; now it is treated as likely.
> > And in the code path where the cache was backfilled first, numbers
> were
> > added and subtracted from the cache length; now this code path simply
> > sets the cache length to its final value.
> >
> > 4. The objects were returned in reverse order.
> > Returning the objects in reverse order is not necessary, so
> rte_memcpy()
> > is now used instead.
> 
> Have you checked the performance with network workload?
> IMO, reverse order makes sense(LIFO vs FIFO).
> The LIFO makes the cache warm as the same buffers are reused
> frequently.

I have not done any performance testing. We probably agree that the only major difference lies in how the objects are returned. And we probably also agree that rte_memcpy() is faster than the copy loop it replaced, especially when n is constant at compile time. So the performance difference mainly depends on the application, which I will discuss below.

Let's first consider LIFO vs. FIFO.

The key argument for the rte_memcpy() optimization is that we are still getting the burst of objects from the top of the stack (LIFO); only the order of the objects inside the burst is not reverse anymore.

Here is an example:

The cache initially contains 8 objects: 01234567.

8 more objects are put into the cache: 89ABCDEF.

The cache now holds: 0123456789ABCDEF.

Getting 4 objects from the cache gives us CDEF instead of FEDC, i.e. we are still getting the 4 objects most recently put into the cache.

Furthermore, if the application is working with fixed size bursts, it will usually put and get the same size burst, i.e. put the burst 89ABCDEF into the cache, and then get the burst 89ABCDEF from the cache again.


Here is an example unfavorable scenario:

The cache initially contains 4 objects, which have gone cold: 0123.

4 more objects, which happen to be hot, are put into the cache: 4567.

Getting 8 objects from the cache gives us 01234567 instead of 76543210.

Now, if the application only processes the first 4 of the 8 objects in the burst, it would have benefitted from those objects being the hot 7654 objects instead of the cold 0123 objects.

However, I think that most applications process the complete burst, so I do consider this scenario unlikely.

Similarly, a pipelined application doesn't process objects in reverse order at every other step in the pipeline, even though the previous step in the pipeline most recently touched the last object of the burst.


My overall conclusion was that the benefit of using rte_memcpy() outweighs the disadvantage of the unfavorable scenario, because I consider the probability of the unfavorable scenario occurring very low. But again: it mainly depends on the application.

If anyone disagrees with the risk analysis described above, I will happily provide a version 2 of the patch, where the objects are still returned in reverse order. After all, the rte_memcpy() benefit is relatively small compared to the impact if the unlikely scenario occurs.
  
Jerin Jacob Jan. 10, 2022, 7:26 a.m. UTC | #3
On Fri, Jan 7, 2022 at 2:16 PM Morten Brørup <mb@smartsharesystems.com> wrote:
>
> > From: Jerin Jacob [mailto:jerinjacobk@gmail.com]
> > Sent: Thursday, 6 January 2022 17.55
> >
> > On Thu, Jan 6, 2022 at 5:54 PM Morten Brørup <mb@smartsharesystems.com>
> > wrote:
> > >
> > > A flush threshold for the mempool cache was introduced in DPDK
> > version
> > > 1.3, but rte_mempool_do_generic_get() was not completely updated back
> > > then.
> > >
> > > The incompleteness did not cause any functional bugs, so this patch
> > > could be considered refactoring for the purpose of cleaning up.
> > >
> > > This patch completes the update of rte_mempool_do_generic_get() as
> > > follows:
> > >
> > > 1. A few comments were malplaced or no longer correct.
> > > Some comments have been updated/added/corrected.
> > >
> > > 2. The code that initially screens the cache request was not updated.
> > > The initial screening compared the request length to the cache size,
> > > which was correct before, but became irrelevant with the introduction
> > of
> > > the flush threshold. E.g. the cache can hold up to flushthresh
> > objects,
> > > which is more than its size, so some requests were not served from
> > the
> > > cache, even though they could be.
> > > The initial screening has now been corrected to match the initial
> > > screening in rte_mempool_do_generic_put(), which verifies that a
> > cache
> > > is present, and that the length of the request does not overflow the
> > > memory allocated for the cache.
> > >
> > > 3. The code flow for satisfying the request from the cache was weird.
> > > The likely code path where the objects are simply served from the
> > cache
> > > was treated as unlikely; now it is treated as likely.
> > > And in the code path where the cache was backfilled first, numbers
> > were
> > > added and subtracted from the cache length; now this code path simply
> > > sets the cache length to its final value.
> > >
> > > 4. The objects were returned in reverse order.
> > > Returning the objects in reverse order is not necessary, so
> > rte_memcpy()
> > > is now used instead.
> >
> > Have you checked the performance with network workload?
> > IMO, reverse order makes sense(LIFO vs FIFO).
> > The LIFO makes the cache warm as the same buffers are reused
> > frequently.
>
> I have not done any performance testing. We probably agree that the only major difference lies in how the objects are returned. And we probably also agree that rte_memcpy() is faster than the copy loop it replaced, especially when n is constant at compile time. So the performance difference mainly depends on the application, which I will discuss below.
>
> Let's first consider LIFO vs. FIFO.
>
> The key argument for the rte_memcpy() optimization is that we are still getting the burst of objects from the top of the stack (LIFO); only the order of the objects inside the burst is not reverse anymore.
>
> Here is an example:
>
> The cache initially contains 8 objects: 01234567.
>
> 8 more objects are put into the cache: 89ABCDEF.
>
> The cache now holds: 0123456789ABCDEF.

Agree. However I think, it may matter with less sized L1 cache
machines and burst size is more where it plays role what can be in L1
with the scheme.

I would suggest splitting each performance improvement as a separate
patch for better tracking and quantity of the performance improvement.

I think, mempool performance test and tx only stream mode in testpmd
can quantify patches.



>
> Getting 4 objects from the cache gives us CDEF instead of FEDC, i.e. we are still getting the 4 objects most recently put into the cache.
>
> Furthermore, if the application is working with fixed size bursts, it will usually put and get the same size burst, i.e. put the burst 89ABCDEF into the cache, and then get the burst 89ABCDEF from the cache again.
>
>
> Here is an example unfavorable scenario:
>
> The cache initially contains 4 objects, which have gone cold: 0123.
>
> 4 more objects, which happen to be hot, are put into the cache: 4567.
>
> Getting 8 objects from the cache gives us 01234567 instead of 76543210.
>
> Now, if the application only processes the first 4 of the 8 objects in the burst, it would have benefitted from those objects being the hot 7654 objects instead of the cold 0123 objects.
>
> However, I think that most applications process the complete burst, so I do consider this scenario unlikely.
>
> Similarly, a pipelined application doesn't process objects in reverse order at every other step in the pipeline, even though the previous step in the pipeline most recently touched the last object of the burst.
>
>
> My overall conclusion was that the benefit of using rte_memcpy() outweighs the disadvantage of the unfavorable scenario, because I consider the probability of the unfavorable scenario occurring very low. But again: it mainly depends on the application.
>
> If anyone disagrees with the risk analysis described above, I will happily provide a version 2 of the patch, where the objects are still returned in reverse order. After all, the rte_memcpy() benefit is relatively small compared to the impact if the unlikely scenario occurs.
>
  
Morten Brørup Jan. 10, 2022, 10:55 a.m. UTC | #4
+Bruce; you seemed interested in my work in this area.

> From: Jerin Jacob [mailto:jerinjacobk@gmail.com]
> Sent: Monday, 10 January 2022 08.27
> 
> On Fri, Jan 7, 2022 at 2:16 PM Morten Brørup <mb@smartsharesystems.com>
> wrote:
> >
> > > From: Jerin Jacob [mailto:jerinjacobk@gmail.com]
> > > Sent: Thursday, 6 January 2022 17.55
> > >
> > > On Thu, Jan 6, 2022 at 5:54 PM Morten Brørup
> <mb@smartsharesystems.com>
> > > wrote:
> > > >
> > > > A flush threshold for the mempool cache was introduced in DPDK
> > > version
> > > > 1.3, but rte_mempool_do_generic_get() was not completely updated
> back
> > > > then.
> > > >
> > > > The incompleteness did not cause any functional bugs, so this
> patch
> > > > could be considered refactoring for the purpose of cleaning up.
> > > >
> > > > This patch completes the update of rte_mempool_do_generic_get()
> as
> > > > follows:
> > > >
> > > > 1. A few comments were malplaced or no longer correct.
> > > > Some comments have been updated/added/corrected.
> > > >
> > > > 2. The code that initially screens the cache request was not
> updated.
> > > > The initial screening compared the request length to the cache
> size,
> > > > which was correct before, but became irrelevant with the
> introduction
> > > of
> > > > the flush threshold. E.g. the cache can hold up to flushthresh
> > > objects,
> > > > which is more than its size, so some requests were not served
> from
> > > the
> > > > cache, even though they could be.
> > > > The initial screening has now been corrected to match the initial
> > > > screening in rte_mempool_do_generic_put(), which verifies that a
> > > cache
> > > > is present, and that the length of the request does not overflow
> the
> > > > memory allocated for the cache.
> > > >
> > > > 3. The code flow for satisfying the request from the cache was
> weird.
> > > > The likely code path where the objects are simply served from the
> > > cache
> > > > was treated as unlikely; now it is treated as likely.
> > > > And in the code path where the cache was backfilled first,
> numbers
> > > were
> > > > added and subtracted from the cache length; now this code path
> simply
> > > > sets the cache length to its final value.
> > > >
> > > > 4. The objects were returned in reverse order.
> > > > Returning the objects in reverse order is not necessary, so
> > > rte_memcpy()
> > > > is now used instead.
> > >
> > > Have you checked the performance with network workload?
> > > IMO, reverse order makes sense(LIFO vs FIFO).
> > > The LIFO makes the cache warm as the same buffers are reused
> > > frequently.
> >
> > I have not done any performance testing. We probably agree that the
> only major difference lies in how the objects are returned. And we
> probably also agree that rte_memcpy() is faster than the copy loop it
> replaced, especially when n is constant at compile time. So the
> performance difference mainly depends on the application, which I will
> discuss below.
> >
> > Let's first consider LIFO vs. FIFO.
> >
> > The key argument for the rte_memcpy() optimization is that we are
> still getting the burst of objects from the top of the stack (LIFO);
> only the order of the objects inside the burst is not reverse anymore.
> >
> > Here is an example:
> >
> > The cache initially contains 8 objects: 01234567.
> >
> > 8 more objects are put into the cache: 89ABCDEF.
> >
> > The cache now holds: 0123456789ABCDEF.
> 
> Agree. However I think, it may matter with less sized L1 cache
> machines and burst size is more where it plays role what can be in L1
> with the scheme.

Good point! Thinking further about it made me realize that the mempool cache flushing algorithm is fundamentally flawed, at least in some cases...


rte_mempool_do_generic_put():

When putting objects into the cache, and the cache length exceeds the flush threshold, the most recent (hot) objects are flushed to the ring, thus leaving the less recent (colder) objects at the top of the cache stack.

Example (cache size: 8, flush threshold: 12, put 8 objects):

Initial cache: 01234567

Cache after putting (hot) objects 89ABCDEF: 0123456789ABCDEF

Cache flush threshold reached. Resulting cache: 01234567

Furthermore, the cache has to be completely depleted before the hot objects that were flushed to the ring are retrieved from the ring again.


rte_mempool_do_generic_get():

When getting objects from the cache, and the cache does not hold the requested number of objects, the cache will first be backfilled from the ring, thus putting colder objects at the top of the cache stack, and then the objects will be returned from the top of the cache stack, i.e. the backfilled (cold) objects will be returned first.

Example (cache size: 8, get 8 objects):

Initial cache: 0123 (hot or lukewarm)

Cache after backfill to size + requested objects: 0123456789ABCDEF

Returned objects: FEDCBA98 (cold)

Cache after returning objects: 012345678 (i.e. cold objects at the top)


> 
> I would suggest splitting each performance improvement as a separate
> patch for better tracking and quantity of the performance improvement.

With the new realizations above, I should reconsider my patch from scratch.

I have also been wondering if the mempool cache size really needs to be configurable, or it could be fixed size?

Bruce mentioned in another thread (http://inbox.dpdk.org/dev/Ydv%2FIMz8eIRPSguY@bricha3-MOBL.ger.corp.intel.com/T/#m02cabb25655c08a0980888df8c41aba9ac8dd6ff) that the typical configuration of the cache size is RTE_MEMPOOL_CACHE_MAX_SIZE.

I would dare to say that it probably suffices to configure if the mempool has a cache or not! The cache_size parameter of rte_mempool_create() is not respected 1:1 anyway (because each per-lcore cache may consume up to 1.5 x cache_size objects from the mempool backing store), so the cache_size parameter could be parsed as non-zero vs. zero to determine if a cache is wanted or not.

> 
> I think, mempool performance test and tx only stream mode in testpmd
> can quantify patches.
> 
> 
> 
> >
> > Getting 4 objects from the cache gives us CDEF instead of FEDC, i.e.
> we are still getting the 4 objects most recently put into the cache.
> >
> > Furthermore, if the application is working with fixed size bursts, it
> will usually put and get the same size burst, i.e. put the burst
> 89ABCDEF into the cache, and then get the burst 89ABCDEF from the cache
> again.
> >
> >
> > Here is an example unfavorable scenario:
> >
> > The cache initially contains 4 objects, which have gone cold: 0123.
> >
> > 4 more objects, which happen to be hot, are put into the cache: 4567.
> >
> > Getting 8 objects from the cache gives us 01234567 instead of
> 76543210.
> >
> > Now, if the application only processes the first 4 of the 8 objects
> in the burst, it would have benefitted from those objects being the hot
> 7654 objects instead of the cold 0123 objects.
> >
> > However, I think that most applications process the complete burst,
> so I do consider this scenario unlikely.
> >
> > Similarly, a pipelined application doesn't process objects in reverse
> order at every other step in the pipeline, even though the previous
> step in the pipeline most recently touched the last object of the
> burst.
> >
> >
> > My overall conclusion was that the benefit of using rte_memcpy()
> outweighs the disadvantage of the unfavorable scenario, because I
> consider the probability of the unfavorable scenario occurring very
> low. But again: it mainly depends on the application.
> >
> > If anyone disagrees with the risk analysis described above, I will
> happily provide a version 2 of the patch, where the objects are still
> returned in reverse order. After all, the rte_memcpy() benefit is
> relatively small compared to the impact if the unlikely scenario
> occurs.
> >
  

Patch

diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index 1e7a3c1527..4c36ad6dd1 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -1353,12 +1353,12 @@  rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
 	 *   cache flush threshold) is flushed to the ring.
 	 */
 
-	/* Add elements back into the cache */
+	/* Add the objects to the cache */
 	rte_memcpy(&cache_objs[0], obj_table, sizeof(void *) * n);
-
 	cache->len += n;
 
 	if (cache->len >= cache->flushthresh) {
+		/* Flush excess objects in the cache to the ring */
 		rte_mempool_ops_enqueue_bulk(mp, &cache->objs[cache->size],
 				cache->len - cache->size);
 		cache->len = cache->size;
@@ -1368,7 +1368,7 @@  rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
 
 ring_enqueue:
 
-	/* push remaining objects in ring */
+	/* Put the objects into the ring */
 #ifdef RTE_LIBRTE_MEMPOOL_DEBUG
 	if (rte_mempool_ops_enqueue_bulk(mp, obj_table, n) < 0)
 		rte_panic("cannot put objects in mempool\n");
@@ -1460,21 +1460,25 @@  rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 			   unsigned int n, struct rte_mempool_cache *cache)
 {
 	int ret;
-	uint32_t index, len;
 	void **cache_objs;
 
-	/* No cache provided or cannot be satisfied from cache */
-	if (unlikely(cache == NULL || n >= cache->size))
+	/* No cache provided or if get would overflow mem allocated for cache */
+	if (unlikely(cache == NULL || n > RTE_MEMPOOL_CACHE_MAX_SIZE))
 		goto ring_dequeue;
 
 	cache_objs = cache->objs;
 
-	/* Can this be satisfied from the cache? */
-	if (cache->len < n) {
-		/* No. Backfill the cache first, and then fill from it */
+	/* Can the request be satisfied from the cache? */
+	if (n <= cache->len) {
+		/* Yes. Simply decrease the cache length */
+		cache->len -= n;
+	} else {
+		/* No. Backfill the cache from the ring first */
+
+		/* Number required to fill the cache + n */
 		uint32_t req = n + (cache->size - cache->len);
 
-		/* How many do we require i.e. number to fill the cache + the request */
+		/* Backfill the cache from the ring */
 		ret = rte_mempool_ops_dequeue_bulk(mp,
 			&cache->objs[cache->len], req);
 		if (unlikely(ret < 0)) {
@@ -1487,14 +1491,12 @@  rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 			goto ring_dequeue;
 		}
 
-		cache->len += req;
+		/* Set the length of the backfilled cache - n */
+		cache->len = cache->size;
 	}
 
-	/* Now fill in the response ... */
-	for (index = 0, len = cache->len - 1; index < n; ++index, len--, obj_table++)
-		*obj_table = cache_objs[len];
-
-	cache->len -= n;
+	/* Get the objects from the cache, at the already decreased offset */
+	rte_memcpy(obj_table, &cache_objs[cache->len], sizeof(void *) * n);
 
 	RTE_MEMPOOL_STAT_ADD(mp, get_success_bulk, 1);
 	RTE_MEMPOOL_STAT_ADD(mp, get_success_objs, n);
@@ -1503,7 +1505,7 @@  rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
 
 ring_dequeue:
 
-	/* get remaining objects from ring */
+	/* Get the objects from the ring */
 	ret = rte_mempool_ops_dequeue_bulk(mp, obj_table, n);
 
 	if (ret < 0) {