mempool: cache align mempool cache objects

Message ID 20221026144436.71068-1-mb@smartsharesystems.com (mailing list archive)
State Superseded, archived
Delegated to: Thomas Monjalon
Headers
Series mempool: cache align mempool cache objects |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/Intel-compilation success Compilation OK
ci/intel-Testing success Testing PASS
ci/github-robot: build success github build: passed
ci/iol-x86_64-unit-testing success Testing PASS
ci/iol-mellanox-Performance success Performance Testing PASS
ci/iol-aarch64-compile-testing success Testing PASS
ci/iol-intel-Functional success Functional Testing PASS
ci/iol-aarch64-unit-testing success Testing PASS
ci/iol-intel-Performance success Performance Testing PASS

Commit Message

Morten Brørup Oct. 26, 2022, 2:44 p.m. UTC
  Add __rte_cache_aligned to the objs array.

It makes no difference in the general case, but if get/put operations are
always 32 objects, it will reduce the number of memory (or last level
cache) accesses from five to four 64 B cache lines for every get/put
operation.

For readability reasons, an example using 16 objects follows:

Currently, with 16 objects (128B), we access to 3
cache lines:

      ┌────────┐
      │len     │
cache │********│---
line0 │********│ ^
      │********│ |
      ├────────┤ | 16 objects
      │********│ | 128B
cache │********│ |
line1 │********│ |
      │********│ |
      ├────────┤ |
      │********│_v_
cache │        │
line2 │        │
      │        │
      └────────┘

With the alignment, it is also 3 cache lines:

      ┌────────┐
      │len     │
cache │        │
line0 │        │
      │        │
      ├────────┤---
      │********│ ^
cache │********│ |
line1 │********│ |
      │********│ |
      ├────────┤ | 16 objects
      │********│ | 128B
cache │********│ |
line2 │********│ |
      │********│ v
      └────────┘---

However, accessing the objects at the bottom of the mempool cache is a
special case, where cache line0 is also used for objects.

Consider the next burst (and any following bursts):

Current:
      ┌────────┐
      │len     │
cache │        │
line0 │        │
      │        │
      ├────────┤
      │        │
cache │        │
line1 │        │
      │        │
      ├────────┤
      │        │
cache │********│---
line2 │********│ ^
      │********│ |
      ├────────┤ | 16 objects
      │********│ | 128B
cache │********│ |
line3 │********│ |
      │********│ |
      ├────────┤ |
      │********│_v_
cache │        │
line4 │        │
      │        │
      └────────┘
4 cache lines touched, incl. line0 for len.

With the proposed alignment:
      ┌────────┐
      │len     │
cache │        │
line0 │        │
      │        │
      ├────────┤
      │        │
cache │        │
line1 │        │
      │        │
      ├────────┤
      │        │
cache │        │
line2 │        │
      │        │
      ├────────┤
      │********│---
cache │********│ ^
line3 │********│ |
      │********│ | 16 objects
      ├────────┤ | 128B
      │********│ |
cache │********│ |
line4 │********│ |
      │********│_v_
      └────────┘
Only 3 cache lines touched, incl. line0 for len.

Credits go to Olivier Matz for the nice ASCII graphics.

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---
 lib/mempool/rte_mempool.h | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)
  

Comments

Andrew Rybchenko Oct. 26, 2022, 7:44 p.m. UTC | #1
On 10/26/22 17:44, Morten Brørup wrote:
> Add __rte_cache_aligned to the objs array.
> 
> It makes no difference in the general case, but if get/put operations are
> always 32 objects, it will reduce the number of memory (or last level
> cache) accesses from five to four 64 B cache lines for every get/put
> operation.
> 
> For readability reasons, an example using 16 objects follows:
> 
> Currently, with 16 objects (128B), we access to 3
> cache lines:
> 
>        ┌────────┐
>        │len     │
> cache │********│---
> line0 │********│ ^
>        │********│ |
>        ├────────┤ | 16 objects
>        │********│ | 128B
> cache │********│ |
> line1 │********│ |
>        │********│ |
>        ├────────┤ |
>        │********│_v_
> cache │        │
> line2 │        │
>        │        │
>        └────────┘
> 
> With the alignment, it is also 3 cache lines:
> 
>        ┌────────┐
>        │len     │
> cache │        │
> line0 │        │
>        │        │
>        ├────────┤---
>        │********│ ^
> cache │********│ |
> line1 │********│ |
>        │********│ |
>        ├────────┤ | 16 objects
>        │********│ | 128B
> cache │********│ |
> line2 │********│ |
>        │********│ v
>        └────────┘---
> 
> However, accessing the objects at the bottom of the mempool cache is a
> special case, where cache line0 is also used for objects.
> 
> Consider the next burst (and any following bursts):
> 
> Current:
>        ┌────────┐
>        │len     │
> cache │        │
> line0 │        │
>        │        │
>        ├────────┤
>        │        │
> cache │        │
> line1 │        │
>        │        │
>        ├────────┤
>        │        │
> cache │********│---
> line2 │********│ ^
>        │********│ |
>        ├────────┤ | 16 objects
>        │********│ | 128B
> cache │********│ |
> line3 │********│ |
>        │********│ |
>        ├────────┤ |
>        │********│_v_
> cache │        │
> line4 │        │
>        │        │
>        └────────┘
> 4 cache lines touched, incl. line0 for len.
> 
> With the proposed alignment:
>        ┌────────┐
>        │len     │
> cache │        │
> line0 │        │
>        │        │
>        ├────────┤
>        │        │
> cache │        │
> line1 │        │
>        │        │
>        ├────────┤
>        │        │
> cache │        │
> line2 │        │
>        │        │
>        ├────────┤
>        │********│---
> cache │********│ ^
> line3 │********│ |
>        │********│ | 16 objects
>        ├────────┤ | 128B
>        │********│ |
> cache │********│ |
> line4 │********│ |
>        │********│_v_
>        └────────┘
> Only 3 cache lines touched, incl. line0 for len.
> 
> Credits go to Olivier Matz for the nice ASCII graphics.
> 
> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>

Reviewed-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
  
Olivier Matz Oct. 27, 2022, 8:34 a.m. UTC | #2
Hi Morten,

On Wed, Oct 26, 2022 at 04:44:36PM +0200, Morten Brørup wrote:
> Add __rte_cache_aligned to the objs array.
> 
> It makes no difference in the general case, but if get/put operations are
> always 32 objects, it will reduce the number of memory (or last level
> cache) accesses from five to four 64 B cache lines for every get/put
> operation.
> 
> For readability reasons, an example using 16 objects follows:
> 
> Currently, with 16 objects (128B), we access to 3
> cache lines:
> 
>       ┌────────┐
>       │len     │
> cache │********│---
> line0 │********│ ^
>       │********│ |
>       ├────────┤ | 16 objects
>       │********│ | 128B
> cache │********│ |
> line1 │********│ |
>       │********│ |
>       ├────────┤ |
>       │********│_v_
> cache │        │
> line2 │        │
>       │        │
>       └────────┘
> 
> With the alignment, it is also 3 cache lines:
> 
>       ┌────────┐
>       │len     │
> cache │        │
> line0 │        │
>       │        │
>       ├────────┤---
>       │********│ ^
> cache │********│ |
> line1 │********│ |
>       │********│ |
>       ├────────┤ | 16 objects
>       │********│ | 128B
> cache │********│ |
> line2 │********│ |
>       │********│ v
>       └────────┘---
> 
> However, accessing the objects at the bottom of the mempool cache is a
> special case, where cache line0 is also used for objects.
> 
> Consider the next burst (and any following bursts):
> 
> Current:
>       ┌────────┐
>       │len     │
> cache │        │
> line0 │        │
>       │        │
>       ├────────┤
>       │        │
> cache │        │
> line1 │        │
>       │        │
>       ├────────┤
>       │        │
> cache │********│---
> line2 │********│ ^
>       │********│ |
>       ├────────┤ | 16 objects
>       │********│ | 128B
> cache │********│ |
> line3 │********│ |
>       │********│ |
>       ├────────┤ |
>       │********│_v_
> cache │        │
> line4 │        │
>       │        │
>       └────────┘
> 4 cache lines touched, incl. line0 for len.
> 
> With the proposed alignment:
>       ┌────────┐
>       │len     │
> cache │        │
> line0 │        │
>       │        │
>       ├────────┤
>       │        │
> cache │        │
> line1 │        │
>       │        │
>       ├────────┤
>       │        │
> cache │        │
> line2 │        │
>       │        │
>       ├────────┤
>       │********│---
> cache │********│ ^
> line3 │********│ |
>       │********│ | 16 objects
>       ├────────┤ | 128B
>       │********│ |
> cache │********│ |
> line4 │********│ |
>       │********│_v_
>       └────────┘
> Only 3 cache lines touched, incl. line0 for len.

I understand your logic, but are we sure that having an application that
works with bulks of 32 means that the cache will stay aligned to 32
elements for the whole life of the application?

In an application, the alignment of the cache can change if you have
any of:
- software queues (reassembly for instance)
- packet duplication (bridge, multicast)
- locally generated packets (keepalive, control protocol)
- pipeline to other cores

Even with testpmd, which work by bulk of 32, I can see that the size
of the cache filling is not aligned to 32. Right after starting the
application, we already have this:

  internal cache infos:
    cache_size=250
    cache_count[0]=231

This is probably related to the hw rx rings size, number of queues,
number of ports.

The "250" default value for cache size in testpmd is questionable, but
with --mbcache=256, the behavior is similar.

Also, when we transmit to a NIC, the mbufs are not returned immediatly
to the pool, they may stay in the hw tx ring during some time, which is
a driver decision.

After processing traffic on cores 8 and 24 with this testpmd, I get:
    cache_count[0]=231
    cache_count[8]=123
    cache_count[24]=122

In my opinion, it is not realistic to think that the mempool cache will
remain aligned to cachelines. In these conditions, it looks better to
keep the structure packed to avoid wasting memory.

Olivier


> 
> Credits go to Olivier Matz for the nice ASCII graphics.
> 
> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> ---
>  lib/mempool/rte_mempool.h | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> index 1f5707f46a..3725a72951 100644
> --- a/lib/mempool/rte_mempool.h
> +++ b/lib/mempool/rte_mempool.h
> @@ -86,11 +86,13 @@ struct rte_mempool_cache {
>  	uint32_t size;	      /**< Size of the cache */
>  	uint32_t flushthresh; /**< Threshold before we flush excess elements */
>  	uint32_t len;	      /**< Current cache count */
> -	/*
> +	/**
> +	 * Cache objects
> +	 *
>  	 * Cache is allocated to this size to allow it to overflow in certain
>  	 * cases to avoid needless emptying of cache.
>  	 */
> -	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2]; /**< Cache objects */
> +	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2] __rte_cache_aligned;
>  } __rte_cache_aligned;
>  
>  /**
> -- 
> 2.17.1
>
  
Morten Brørup Oct. 27, 2022, 9:22 a.m. UTC | #3
> From: Olivier Matz [mailto:olivier.matz@6wind.com]
> Sent: Thursday, 27 October 2022 10.35
> 
> Hi Morten,
> 
> On Wed, Oct 26, 2022 at 04:44:36PM +0200, Morten Brørup wrote:
> > Add __rte_cache_aligned to the objs array.
> >
> > It makes no difference in the general case, but if get/put operations
> are
> > always 32 objects, it will reduce the number of memory (or last level
> > cache) accesses from five to four 64 B cache lines for every get/put
> > operation.
> >
> > For readability reasons, an example using 16 objects follows:
> >
> > Currently, with 16 objects (128B), we access to 3
> > cache lines:
> >
> >       ┌────────┐
> >       │len     │
> > cache │********│---
> > line0 │********│ ^
> >       │********│ |
> >       ├────────┤ | 16 objects
> >       │********│ | 128B
> > cache │********│ |
> > line1 │********│ |
> >       │********│ |
> >       ├────────┤ |
> >       │********│_v_
> > cache │        │
> > line2 │        │
> >       │        │
> >       └────────┘
> >
> > With the alignment, it is also 3 cache lines:
> >
> >       ┌────────┐
> >       │len     │
> > cache │        │
> > line0 │        │
> >       │        │
> >       ├────────┤---
> >       │********│ ^
> > cache │********│ |
> > line1 │********│ |
> >       │********│ |
> >       ├────────┤ | 16 objects
> >       │********│ | 128B
> > cache │********│ |
> > line2 │********│ |
> >       │********│ v
> >       └────────┘---
> >
> > However, accessing the objects at the bottom of the mempool cache is
> a
> > special case, where cache line0 is also used for objects.
> >
> > Consider the next burst (and any following bursts):
> >
> > Current:
> >       ┌────────┐
> >       │len     │
> > cache │        │
> > line0 │        │
> >       │        │
> >       ├────────┤
> >       │        │
> > cache │        │
> > line1 │        │
> >       │        │
> >       ├────────┤
> >       │        │
> > cache │********│---
> > line2 │********│ ^
> >       │********│ |
> >       ├────────┤ | 16 objects
> >       │********│ | 128B
> > cache │********│ |
> > line3 │********│ |
> >       │********│ |
> >       ├────────┤ |
> >       │********│_v_
> > cache │        │
> > line4 │        │
> >       │        │
> >       └────────┘
> > 4 cache lines touched, incl. line0 for len.
> >
> > With the proposed alignment:
> >       ┌────────┐
> >       │len     │
> > cache │        │
> > line0 │        │
> >       │        │
> >       ├────────┤
> >       │        │
> > cache │        │
> > line1 │        │
> >       │        │
> >       ├────────┤
> >       │        │
> > cache │        │
> > line2 │        │
> >       │        │
> >       ├────────┤
> >       │********│---
> > cache │********│ ^
> > line3 │********│ |
> >       │********│ | 16 objects
> >       ├────────┤ | 128B
> >       │********│ |
> > cache │********│ |
> > line4 │********│ |
> >       │********│_v_
> >       └────────┘
> > Only 3 cache lines touched, incl. line0 for len.
> 
> I understand your logic, but are we sure that having an application
> that
> works with bulks of 32 means that the cache will stay aligned to 32
> elements for the whole life of the application?
> 
> In an application, the alignment of the cache can change if you have
> any of:
> - software queues (reassembly for instance)
> - packet duplication (bridge, multicast)
> - locally generated packets (keepalive, control protocol)
> - pipeline to other cores
> 
> Even with testpmd, which work by bulk of 32, I can see that the size
> of the cache filling is not aligned to 32. Right after starting the
> application, we already have this:
> 
>   internal cache infos:
>     cache_size=250
>     cache_count[0]=231
> 
> This is probably related to the hw rx rings size, number of queues,
> number of ports.
> 
> The "250" default value for cache size in testpmd is questionable, but
> with --mbcache=256, the behavior is similar.
> 
> Also, when we transmit to a NIC, the mbufs are not returned immediatly
> to the pool, they may stay in the hw tx ring during some time, which is
> a driver decision.
> 
> After processing traffic on cores 8 and 24 with this testpmd, I get:
>     cache_count[0]=231
>     cache_count[8]=123
>     cache_count[24]=122
> 
> In my opinion, it is not realistic to think that the mempool cache will
> remain aligned to cachelines. In these conditions, it looks better to
> keep the structure packed to avoid wasting memory.

I agree that is a special use case to only access the mempool cache in bursts of 32 objects, so the accesses are always cache line aligned. (Generalized, the burst size must not be 32; a burst size that is a multiple of RTE_CACHE_LINE_SIZE/sizeof(void*), i.e. a burst size of 8 on a 64-bit architecture, will do.)

Adding a hole of 52 byte per mempool cache is nothing, considering that the mempool cache already uses 8 KB (RTE_MEMPOOL_CACHE_MAX_SIZE * 2 * sizeof(void*) = 1024 * 8 byte) for the objects.

Also - assuming that memory allocations are cache line aligned - the 52 byte of unused memory cannot be used regardless if they are before or after the objects. Instead of having 52 B unused after the objects, we might as well have a hole of 52 B unused before the objects. In other words: There is really no downside to this.

Jerin also presented a separate argument for moving the objects to another cache line than the len field: The risk for "load-after-store stall" when loading the len field after storing objects in cache line0 [1].

[1]: http://inbox.dpdk.org/dev/CALBAE1P4zFYdLwoQukn5Q-V-nTvc_UBWmWjhaV2uVBXQRytSSA@mail.gmail.com/

A new idea just popped into my head: The hot debug statistics counters (put_bulk, put_objs, get_success_bulk, get_success_objs) could be moved to this free space, reducing the need to touch another cache line for debug counters. I haven’t thought this idea through yet; it might conflict with Jerin's comment.

> 
> Olivier
> 
> 
> >
> > Credits go to Olivier Matz for the nice ASCII graphics.
> >
> > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > ---
> >  lib/mempool/rte_mempool.h | 6 ++++--
> >  1 file changed, 4 insertions(+), 2 deletions(-)
> >
> > diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> > index 1f5707f46a..3725a72951 100644
> > --- a/lib/mempool/rte_mempool.h
> > +++ b/lib/mempool/rte_mempool.h
> > @@ -86,11 +86,13 @@ struct rte_mempool_cache {
> >  	uint32_t size;	      /**< Size of the cache */
> >  	uint32_t flushthresh; /**< Threshold before we flush excess
> elements */
> >  	uint32_t len;	      /**< Current cache count */
> > -	/*
> > +	/**
> > +	 * Cache objects
> > +	 *
> >  	 * Cache is allocated to this size to allow it to overflow in
> certain
> >  	 * cases to avoid needless emptying of cache.
> >  	 */
> > -	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2]; /**< Cache objects */
> > +	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2] __rte_cache_aligned;
> >  } __rte_cache_aligned;
> >
> >  /**
> > --
> > 2.17.1
> >
  
Olivier Matz Oct. 27, 2022, 11:42 a.m. UTC | #4
On Thu, Oct 27, 2022 at 11:22:07AM +0200, Morten Brørup wrote:
> > From: Olivier Matz [mailto:olivier.matz@6wind.com]
> > Sent: Thursday, 27 October 2022 10.35
> > 
> > Hi Morten,
> > 
> > On Wed, Oct 26, 2022 at 04:44:36PM +0200, Morten Brørup wrote:
> > > Add __rte_cache_aligned to the objs array.
> > >
> > > It makes no difference in the general case, but if get/put operations
> > are
> > > always 32 objects, it will reduce the number of memory (or last level
> > > cache) accesses from five to four 64 B cache lines for every get/put
> > > operation.
> > >
> > > For readability reasons, an example using 16 objects follows:
> > >
> > > Currently, with 16 objects (128B), we access to 3
> > > cache lines:
> > >
> > >       ┌────────┐
> > >       │len     │
> > > cache │********│---
> > > line0 │********│ ^
> > >       │********│ |
> > >       ├────────┤ | 16 objects
> > >       │********│ | 128B
> > > cache │********│ |
> > > line1 │********│ |
> > >       │********│ |
> > >       ├────────┤ |
> > >       │********│_v_
> > > cache │        │
> > > line2 │        │
> > >       │        │
> > >       └────────┘
> > >
> > > With the alignment, it is also 3 cache lines:
> > >
> > >       ┌────────┐
> > >       │len     │
> > > cache │        │
> > > line0 │        │
> > >       │        │
> > >       ├────────┤---
> > >       │********│ ^
> > > cache │********│ |
> > > line1 │********│ |
> > >       │********│ |
> > >       ├────────┤ | 16 objects
> > >       │********│ | 128B
> > > cache │********│ |
> > > line2 │********│ |
> > >       │********│ v
> > >       └────────┘---
> > >
> > > However, accessing the objects at the bottom of the mempool cache is
> > a
> > > special case, where cache line0 is also used for objects.
> > >
> > > Consider the next burst (and any following bursts):
> > >
> > > Current:
> > >       ┌────────┐
> > >       │len     │
> > > cache │        │
> > > line0 │        │
> > >       │        │
> > >       ├────────┤
> > >       │        │
> > > cache │        │
> > > line1 │        │
> > >       │        │
> > >       ├────────┤
> > >       │        │
> > > cache │********│---
> > > line2 │********│ ^
> > >       │********│ |
> > >       ├────────┤ | 16 objects
> > >       │********│ | 128B
> > > cache │********│ |
> > > line3 │********│ |
> > >       │********│ |
> > >       ├────────┤ |
> > >       │********│_v_
> > > cache │        │
> > > line4 │        │
> > >       │        │
> > >       └────────┘
> > > 4 cache lines touched, incl. line0 for len.
> > >
> > > With the proposed alignment:
> > >       ┌────────┐
> > >       │len     │
> > > cache │        │
> > > line0 │        │
> > >       │        │
> > >       ├────────┤
> > >       │        │
> > > cache │        │
> > > line1 │        │
> > >       │        │
> > >       ├────────┤
> > >       │        │
> > > cache │        │
> > > line2 │        │
> > >       │        │
> > >       ├────────┤
> > >       │********│---
> > > cache │********│ ^
> > > line3 │********│ |
> > >       │********│ | 16 objects
> > >       ├────────┤ | 128B
> > >       │********│ |
> > > cache │********│ |
> > > line4 │********│ |
> > >       │********│_v_
> > >       └────────┘
> > > Only 3 cache lines touched, incl. line0 for len.
> > 
> > I understand your logic, but are we sure that having an application
> > that
> > works with bulks of 32 means that the cache will stay aligned to 32
> > elements for the whole life of the application?
> > 
> > In an application, the alignment of the cache can change if you have
> > any of:
> > - software queues (reassembly for instance)
> > - packet duplication (bridge, multicast)
> > - locally generated packets (keepalive, control protocol)
> > - pipeline to other cores
> > 
> > Even with testpmd, which work by bulk of 32, I can see that the size
> > of the cache filling is not aligned to 32. Right after starting the
> > application, we already have this:
> > 
> >   internal cache infos:
> >     cache_size=250
> >     cache_count[0]=231
> > 
> > This is probably related to the hw rx rings size, number of queues,
> > number of ports.
> > 
> > The "250" default value for cache size in testpmd is questionable, but
> > with --mbcache=256, the behavior is similar.
> > 
> > Also, when we transmit to a NIC, the mbufs are not returned immediatly
> > to the pool, they may stay in the hw tx ring during some time, which is
> > a driver decision.
> > 
> > After processing traffic on cores 8 and 24 with this testpmd, I get:
> >     cache_count[0]=231
> >     cache_count[8]=123
> >     cache_count[24]=122
> > 
> > In my opinion, it is not realistic to think that the mempool cache will
> > remain aligned to cachelines. In these conditions, it looks better to
> > keep the structure packed to avoid wasting memory.
> 
> I agree that is a special use case to only access the mempool cache in
> bursts of 32 objects, so the accesses are always cache line
> aligned. (Generalized, the burst size must not be 32; a burst size
> that is a multiple of RTE_CACHE_LINE_SIZE/sizeof(void*), i.e. a burst
> size of 8 on a 64-bit architecture, will do.)

Is there a real situation where it happens to always have read/write
accesses per bulks of 32? From what I see in my quick test, it is not
the case, even with testpmd.

> Adding a hole of 52 byte per mempool cache is nothing, considering
> that the mempool cache already uses 8 KB (RTE_MEMPOOL_CACHE_MAX_SIZE *
> 2 * sizeof(void*) = 1024 * 8 byte) for the objects.
>
> Also - assuming that memory allocations are cache line aligned - the
> 52 byte of unused memory cannot be used regardless if they are before
> or after the objects. Instead of having 52 B unused after the objects,
> we might as well have a hole of 52 B unused before the objects. In
> other words: There is really no downside to this.

Correct, the memory waste argument to nack the patch is invalid.

> Jerin also presented a separate argument for moving the objects to
> another cache line than the len field: The risk for "load-after-store
> stall" when loading the len field after storing objects in cache line0
> [1].
> 
> [1]: http://inbox.dpdk.org/dev/CALBAE1P4zFYdLwoQukn5Q-V-nTvc_UBWmWjhaV2uVBXQRytSSA@mail.gmail.com/

I'll be prudent on this justification without numbers. The case where we
access to the objects of the first cache line (among several KB) is
maybe not that frequent.

> A new idea just popped into my head: The hot debug statistics
> counters (put_bulk, put_objs, get_success_bulk, get_success_objs)
> could be moved to this free space, reducing the need to touch another
> cache line for debug counters. I haven’t thought this idea through
> yet; it might conflict with Jerin's comment.

Yes, but since the stats are only enabled when RTE_LIBRTE_MEMPOOL_DEBUG
is set, it won't have any impact on non-debug builds.


Honnestly, I find it hard to convince myself that it is a real
optimization. I don't see any reason why it would be slower though. So
since we already broke the mempool cache struct ABI in a previous
commit, and since it won't consume more memory, I'm ok to include that
patch. It would be great to have numbers to put some weight in the
balance.




> 
> > 
> > Olivier
> > 
> > 
> > >
> > > Credits go to Olivier Matz for the nice ASCII graphics.
> > >
> > > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > > ---
> > >  lib/mempool/rte_mempool.h | 6 ++++--
> > >  1 file changed, 4 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
> > > index 1f5707f46a..3725a72951 100644
> > > --- a/lib/mempool/rte_mempool.h
> > > +++ b/lib/mempool/rte_mempool.h
> > > @@ -86,11 +86,13 @@ struct rte_mempool_cache {
> > >  	uint32_t size;	      /**< Size of the cache */
> > >  	uint32_t flushthresh; /**< Threshold before we flush excess
> > elements */
> > >  	uint32_t len;	      /**< Current cache count */
> > > -	/*
> > > +	/**
> > > +	 * Cache objects
> > > +	 *
> > >  	 * Cache is allocated to this size to allow it to overflow in
> > certain
> > >  	 * cases to avoid needless emptying of cache.
> > >  	 */
> > > -	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2]; /**< Cache objects */
> > > +	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2] __rte_cache_aligned;
> > >  } __rte_cache_aligned;
> > >
> > >  /**
> > > --
> > > 2.17.1
> > >
>
  
Morten Brørup Oct. 27, 2022, 12:11 p.m. UTC | #5
> From: Olivier Matz [mailto:olivier.matz@6wind.com]
> Sent: Thursday, 27 October 2022 13.43
> 
> On Thu, Oct 27, 2022 at 11:22:07AM +0200, Morten Brørup wrote:
> > > From: Olivier Matz [mailto:olivier.matz@6wind.com]
> > > Sent: Thursday, 27 October 2022 10.35
> > >
> > > Hi Morten,
> > >
> > > On Wed, Oct 26, 2022 at 04:44:36PM +0200, Morten Brørup wrote:
> > > > Add __rte_cache_aligned to the objs array.
> > > >
> > > > It makes no difference in the general case, but if get/put
> operations
> > > are
> > > > always 32 objects, it will reduce the number of memory (or last
> level
> > > > cache) accesses from five to four 64 B cache lines for every
> get/put
> > > > operation.
> > > >
> > > > For readability reasons, an example using 16 objects follows:
> > > >
> > > > Currently, with 16 objects (128B), we access to 3
> > > > cache lines:
> > > >
> > > >       ┌────────┐
> > > >       │len     │
> > > > cache │********│---
> > > > line0 │********│ ^
> > > >       │********│ |
> > > >       ├────────┤ | 16 objects
> > > >       │********│ | 128B
> > > > cache │********│ |
> > > > line1 │********│ |
> > > >       │********│ |
> > > >       ├────────┤ |
> > > >       │********│_v_
> > > > cache │        │
> > > > line2 │        │
> > > >       │        │
> > > >       └────────┘
> > > >
> > > > With the alignment, it is also 3 cache lines:
> > > >
> > > >       ┌────────┐
> > > >       │len     │
> > > > cache │        │
> > > > line0 │        │
> > > >       │        │
> > > >       ├────────┤---
> > > >       │********│ ^
> > > > cache │********│ |
> > > > line1 │********│ |
> > > >       │********│ |
> > > >       ├────────┤ | 16 objects
> > > >       │********│ | 128B
> > > > cache │********│ |
> > > > line2 │********│ |
> > > >       │********│ v
> > > >       └────────┘---
> > > >
> > > > However, accessing the objects at the bottom of the mempool cache
> is
> > > a
> > > > special case, where cache line0 is also used for objects.
> > > >
> > > > Consider the next burst (and any following bursts):
> > > >
> > > > Current:
> > > >       ┌────────┐
> > > >       │len     │
> > > > cache │        │
> > > > line0 │        │
> > > >       │        │
> > > >       ├────────┤
> > > >       │        │
> > > > cache │        │
> > > > line1 │        │
> > > >       │        │
> > > >       ├────────┤
> > > >       │        │
> > > > cache │********│---
> > > > line2 │********│ ^
> > > >       │********│ |
> > > >       ├────────┤ | 16 objects
> > > >       │********│ | 128B
> > > > cache │********│ |
> > > > line3 │********│ |
> > > >       │********│ |
> > > >       ├────────┤ |
> > > >       │********│_v_
> > > > cache │        │
> > > > line4 │        │
> > > >       │        │
> > > >       └────────┘
> > > > 4 cache lines touched, incl. line0 for len.
> > > >
> > > > With the proposed alignment:
> > > >       ┌────────┐
> > > >       │len     │
> > > > cache │        │
> > > > line0 │        │
> > > >       │        │
> > > >       ├────────┤
> > > >       │        │
> > > > cache │        │
> > > > line1 │        │
> > > >       │        │
> > > >       ├────────┤
> > > >       │        │
> > > > cache │        │
> > > > line2 │        │
> > > >       │        │
> > > >       ├────────┤
> > > >       │********│---
> > > > cache │********│ ^
> > > > line3 │********│ |
> > > >       │********│ | 16 objects
> > > >       ├────────┤ | 128B
> > > >       │********│ |
> > > > cache │********│ |
> > > > line4 │********│ |
> > > >       │********│_v_
> > > >       └────────┘
> > > > Only 3 cache lines touched, incl. line0 for len.
> > >
> > > I understand your logic, but are we sure that having an application
> > > that
> > > works with bulks of 32 means that the cache will stay aligned to 32
> > > elements for the whole life of the application?
> > >
> > > In an application, the alignment of the cache can change if you
> have
> > > any of:
> > > - software queues (reassembly for instance)
> > > - packet duplication (bridge, multicast)
> > > - locally generated packets (keepalive, control protocol)
> > > - pipeline to other cores
> > >
> > > Even with testpmd, which work by bulk of 32, I can see that the
> size
> > > of the cache filling is not aligned to 32. Right after starting the
> > > application, we already have this:
> > >
> > >   internal cache infos:
> > >     cache_size=250
> > >     cache_count[0]=231
> > >
> > > This is probably related to the hw rx rings size, number of queues,
> > > number of ports.
> > >
> > > The "250" default value for cache size in testpmd is questionable,
> but
> > > with --mbcache=256, the behavior is similar.
> > >
> > > Also, when we transmit to a NIC, the mbufs are not returned
> immediatly
> > > to the pool, they may stay in the hw tx ring during some time,
> which is
> > > a driver decision.
> > >
> > > After processing traffic on cores 8 and 24 with this testpmd, I
> get:
> > >     cache_count[0]=231
> > >     cache_count[8]=123
> > >     cache_count[24]=122
> > >
> > > In my opinion, it is not realistic to think that the mempool cache
> will
> > > remain aligned to cachelines. In these conditions, it looks better
> to
> > > keep the structure packed to avoid wasting memory.
> >
> > I agree that is a special use case to only access the mempool cache
> in
> > bursts of 32 objects, so the accesses are always cache line
> > aligned. (Generalized, the burst size must not be 32; a burst size
> > that is a multiple of RTE_CACHE_LINE_SIZE/sizeof(void*), i.e. a burst
> > size of 8 on a 64-bit architecture, will do.)
> 
> Is there a real situation where it happens to always have read/write
> accesses per bulks of 32? From what I see in my quick test, it is not
> the case, even with testpmd.
> 
> > Adding a hole of 52 byte per mempool cache is nothing, considering
> > that the mempool cache already uses 8 KB (RTE_MEMPOOL_CACHE_MAX_SIZE
> *
> > 2 * sizeof(void*) = 1024 * 8 byte) for the objects.
> >
> > Also - assuming that memory allocations are cache line aligned - the
> > 52 byte of unused memory cannot be used regardless if they are before
> > or after the objects. Instead of having 52 B unused after the
> objects,
> > we might as well have a hole of 52 B unused before the objects. In
> > other words: There is really no downside to this.
> 
> Correct, the memory waste argument to nack the patch is invalid.
> 
> > Jerin also presented a separate argument for moving the objects to
> > another cache line than the len field: The risk for "load-after-store
> > stall" when loading the len field after storing objects in cache
> line0
> > [1].
> >
> > [1]: http://inbox.dpdk.org/dev/CALBAE1P4zFYdLwoQukn5Q-V-
> nTvc_UBWmWjhaV2uVBXQRytSSA@mail.gmail.com/
> 
> I'll be prudent on this justification without numbers. The case where
> we
> access to the objects of the first cache line (among several KB) is
> maybe not that frequent.
> 
> > A new idea just popped into my head: The hot debug statistics
> > counters (put_bulk, put_objs, get_success_bulk, get_success_objs)
> > could be moved to this free space, reducing the need to touch another
> > cache line for debug counters. I haven’t thought this idea through
> > yet; it might conflict with Jerin's comment.
> 
> Yes, but since the stats are only enabled when RTE_LIBRTE_MEMPOOL_DEBUG
> is set, it won't have any impact on non-debug builds.

Correct, but I do expect that it would reduce the performance cost of using RTE_LIBRTE_MEMPOOL_DEBUG. I'll provide such a patch shortly.

> 
> 
> Honnestly, I find it hard to convince myself that it is a real
> optimization. I don't see any reason why it would be slower though. So
> since we already broke the mempool cache struct ABI in a previous
> commit, and since it won't consume more memory, I'm ok to include that
> patch.

I don't know if there are any such applications now, and you are probably right that there are not. But this patch opens a road towards it.

Acked-by ?

> It would be great to have numbers to put some weight in the
> balance.

Yes, it would also be great if drivers didn't copy-paste code from the mempool library, so the performance effect of modifications in the mempool library would be reflected in such tests.

> 
> 
> 
> 
> >
> > >
> > > Olivier
> > >
> > >
> > > >
> > > > Credits go to Olivier Matz for the nice ASCII graphics.
> > > >
> > > > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > > > ---
> > > >  lib/mempool/rte_mempool.h | 6 ++++--
> > > >  1 file changed, 4 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/lib/mempool/rte_mempool.h
> b/lib/mempool/rte_mempool.h
> > > > index 1f5707f46a..3725a72951 100644
> > > > --- a/lib/mempool/rte_mempool.h
> > > > +++ b/lib/mempool/rte_mempool.h
> > > > @@ -86,11 +86,13 @@ struct rte_mempool_cache {
> > > >  	uint32_t size;	      /**< Size of the cache */
> > > >  	uint32_t flushthresh; /**< Threshold before we flush excess
> > > elements */
> > > >  	uint32_t len;	      /**< Current cache count */
> > > > -	/*
> > > > +	/**
> > > > +	 * Cache objects
> > > > +	 *
> > > >  	 * Cache is allocated to this size to allow it to overflow
> in
> > > certain
> > > >  	 * cases to avoid needless emptying of cache.
> > > >  	 */
> > > > -	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2]; /**< Cache
> objects */
> > > > +	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2]
> __rte_cache_aligned;
> > > >  } __rte_cache_aligned;
> > > >
> > > >  /**
> > > > --
> > > > 2.17.1
> > > >
> >
  
Olivier Matz Oct. 27, 2022, 3:20 p.m. UTC | #6
On Thu, Oct 27, 2022 at 02:11:29PM +0200, Morten Brørup wrote:
> > From: Olivier Matz [mailto:olivier.matz@6wind.com]
> > Sent: Thursday, 27 October 2022 13.43
> > 
> > On Thu, Oct 27, 2022 at 11:22:07AM +0200, Morten Brørup wrote:
> > > > From: Olivier Matz [mailto:olivier.matz@6wind.com]
> > > > Sent: Thursday, 27 October 2022 10.35
> > > >
> > > > Hi Morten,
> > > >
> > > > On Wed, Oct 26, 2022 at 04:44:36PM +0200, Morten Brørup wrote:
> > > > > Add __rte_cache_aligned to the objs array.
> > > > >
> > > > > It makes no difference in the general case, but if get/put
> > operations
> > > > are
> > > > > always 32 objects, it will reduce the number of memory (or last
> > level
> > > > > cache) accesses from five to four 64 B cache lines for every
> > get/put
> > > > > operation.
> > > > >
> > > > > For readability reasons, an example using 16 objects follows:
> > > > >
> > > > > Currently, with 16 objects (128B), we access to 3
> > > > > cache lines:
> > > > >
> > > > >       ┌────────┐
> > > > >       │len     │
> > > > > cache │********│---
> > > > > line0 │********│ ^
> > > > >       │********│ |
> > > > >       ├────────┤ | 16 objects
> > > > >       │********│ | 128B
> > > > > cache │********│ |
> > > > > line1 │********│ |
> > > > >       │********│ |
> > > > >       ├────────┤ |
> > > > >       │********│_v_
> > > > > cache │        │
> > > > > line2 │        │
> > > > >       │        │
> > > > >       └────────┘
> > > > >
> > > > > With the alignment, it is also 3 cache lines:
> > > > >
> > > > >       ┌────────┐
> > > > >       │len     │
> > > > > cache │        │
> > > > > line0 │        │
> > > > >       │        │
> > > > >       ├────────┤---
> > > > >       │********│ ^
> > > > > cache │********│ |
> > > > > line1 │********│ |
> > > > >       │********│ |
> > > > >       ├────────┤ | 16 objects
> > > > >       │********│ | 128B
> > > > > cache │********│ |
> > > > > line2 │********│ |
> > > > >       │********│ v
> > > > >       └────────┘---
> > > > >
> > > > > However, accessing the objects at the bottom of the mempool cache
> > is
> > > > a
> > > > > special case, where cache line0 is also used for objects.
> > > > >
> > > > > Consider the next burst (and any following bursts):
> > > > >
> > > > > Current:
> > > > >       ┌────────┐
> > > > >       │len     │
> > > > > cache │        │
> > > > > line0 │        │
> > > > >       │        │
> > > > >       ├────────┤
> > > > >       │        │
> > > > > cache │        │
> > > > > line1 │        │
> > > > >       │        │
> > > > >       ├────────┤
> > > > >       │        │
> > > > > cache │********│---
> > > > > line2 │********│ ^
> > > > >       │********│ |
> > > > >       ├────────┤ | 16 objects
> > > > >       │********│ | 128B
> > > > > cache │********│ |
> > > > > line3 │********│ |
> > > > >       │********│ |
> > > > >       ├────────┤ |
> > > > >       │********│_v_
> > > > > cache │        │
> > > > > line4 │        │
> > > > >       │        │
> > > > >       └────────┘
> > > > > 4 cache lines touched, incl. line0 for len.
> > > > >
> > > > > With the proposed alignment:
> > > > >       ┌────────┐
> > > > >       │len     │
> > > > > cache │        │
> > > > > line0 │        │
> > > > >       │        │
> > > > >       ├────────┤
> > > > >       │        │
> > > > > cache │        │
> > > > > line1 │        │
> > > > >       │        │
> > > > >       ├────────┤
> > > > >       │        │
> > > > > cache │        │
> > > > > line2 │        │
> > > > >       │        │
> > > > >       ├────────┤
> > > > >       │********│---
> > > > > cache │********│ ^
> > > > > line3 │********│ |
> > > > >       │********│ | 16 objects
> > > > >       ├────────┤ | 128B
> > > > >       │********│ |
> > > > > cache │********│ |
> > > > > line4 │********│ |
> > > > >       │********│_v_
> > > > >       └────────┘
> > > > > Only 3 cache lines touched, incl. line0 for len.
> > > >
> > > > I understand your logic, but are we sure that having an application
> > > > that
> > > > works with bulks of 32 means that the cache will stay aligned to 32
> > > > elements for the whole life of the application?
> > > >
> > > > In an application, the alignment of the cache can change if you
> > have
> > > > any of:
> > > > - software queues (reassembly for instance)
> > > > - packet duplication (bridge, multicast)
> > > > - locally generated packets (keepalive, control protocol)
> > > > - pipeline to other cores
> > > >
> > > > Even with testpmd, which work by bulk of 32, I can see that the
> > size
> > > > of the cache filling is not aligned to 32. Right after starting the
> > > > application, we already have this:
> > > >
> > > >   internal cache infos:
> > > >     cache_size=250
> > > >     cache_count[0]=231
> > > >
> > > > This is probably related to the hw rx rings size, number of queues,
> > > > number of ports.
> > > >
> > > > The "250" default value for cache size in testpmd is questionable,
> > but
> > > > with --mbcache=256, the behavior is similar.
> > > >
> > > > Also, when we transmit to a NIC, the mbufs are not returned
> > immediatly
> > > > to the pool, they may stay in the hw tx ring during some time,
> > which is
> > > > a driver decision.
> > > >
> > > > After processing traffic on cores 8 and 24 with this testpmd, I
> > get:
> > > >     cache_count[0]=231
> > > >     cache_count[8]=123
> > > >     cache_count[24]=122
> > > >
> > > > In my opinion, it is not realistic to think that the mempool cache
> > will
> > > > remain aligned to cachelines. In these conditions, it looks better
> > to
> > > > keep the structure packed to avoid wasting memory.
> > >
> > > I agree that is a special use case to only access the mempool cache
> > in
> > > bursts of 32 objects, so the accesses are always cache line
> > > aligned. (Generalized, the burst size must not be 32; a burst size
> > > that is a multiple of RTE_CACHE_LINE_SIZE/sizeof(void*), i.e. a burst
> > > size of 8 on a 64-bit architecture, will do.)
> > 
> > Is there a real situation where it happens to always have read/write
> > accesses per bulks of 32? From what I see in my quick test, it is not
> > the case, even with testpmd.
> > 
> > > Adding a hole of 52 byte per mempool cache is nothing, considering
> > > that the mempool cache already uses 8 KB (RTE_MEMPOOL_CACHE_MAX_SIZE
> > *
> > > 2 * sizeof(void*) = 1024 * 8 byte) for the objects.
> > >
> > > Also - assuming that memory allocations are cache line aligned - the
> > > 52 byte of unused memory cannot be used regardless if they are before
> > > or after the objects. Instead of having 52 B unused after the
> > objects,
> > > we might as well have a hole of 52 B unused before the objects. In
> > > other words: There is really no downside to this.
> > 
> > Correct, the memory waste argument to nack the patch is invalid.
> > 
> > > Jerin also presented a separate argument for moving the objects to
> > > another cache line than the len field: The risk for "load-after-store
> > > stall" when loading the len field after storing objects in cache
> > line0
> > > [1].
> > >
> > > [1]: http://inbox.dpdk.org/dev/CALBAE1P4zFYdLwoQukn5Q-V-
> > nTvc_UBWmWjhaV2uVBXQRytSSA@mail.gmail.com/
> > 
> > I'll be prudent on this justification without numbers. The case where
> > we
> > access to the objects of the first cache line (among several KB) is
> > maybe not that frequent.
> > 
> > > A new idea just popped into my head: The hot debug statistics
> > > counters (put_bulk, put_objs, get_success_bulk, get_success_objs)
> > > could be moved to this free space, reducing the need to touch another
> > > cache line for debug counters. I haven’t thought this idea through
> > > yet; it might conflict with Jerin's comment.
> > 
> > Yes, but since the stats are only enabled when RTE_LIBRTE_MEMPOOL_DEBUG
> > is set, it won't have any impact on non-debug builds.
> 
> Correct, but I do expect that it would reduce the performance cost of using RTE_LIBRTE_MEMPOOL_DEBUG. I'll provide such a patch shortly.
> 
> > 
> > 
> > Honnestly, I find it hard to convince myself that it is a real
> > optimization. I don't see any reason why it would be slower though. So
> > since we already broke the mempool cache struct ABI in a previous
> > commit, and since it won't consume more memory, I'm ok to include that
> > patch.
> 
> I don't know if there are any such applications now, and you are probably right that there are not. But this patch opens a road towards it.
> 
> Acked-by ?

Acked-by: Olivier Matz <olivier.matz@6wind.com>

Thanks Morten

> 
> > It would be great to have numbers to put some weight in the
> > balance.
> 
> Yes, it would also be great if drivers didn't copy-paste code from the mempool library, so the performance effect of modifications in the mempool library would be reflected in such tests.
> 
> > 
> > 
> > 
> > 
> > >
> > > >
> > > > Olivier
> > > >
> > > >
> > > > >
> > > > > Credits go to Olivier Matz for the nice ASCII graphics.
> > > > >
> > > > > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > > > > ---
> > > > >  lib/mempool/rte_mempool.h | 6 ++++--
> > > > >  1 file changed, 4 insertions(+), 2 deletions(-)
> > > > >
> > > > > diff --git a/lib/mempool/rte_mempool.h
> > b/lib/mempool/rte_mempool.h
> > > > > index 1f5707f46a..3725a72951 100644
> > > > > --- a/lib/mempool/rte_mempool.h
> > > > > +++ b/lib/mempool/rte_mempool.h
> > > > > @@ -86,11 +86,13 @@ struct rte_mempool_cache {
> > > > >  	uint32_t size;	      /**< Size of the cache */
> > > > >  	uint32_t flushthresh; /**< Threshold before we flush excess
> > > > elements */
> > > > >  	uint32_t len;	      /**< Current cache count */
> > > > > -	/*
> > > > > +	/**
> > > > > +	 * Cache objects
> > > > > +	 *
> > > > >  	 * Cache is allocated to this size to allow it to overflow
> > in
> > > > certain
> > > > >  	 * cases to avoid needless emptying of cache.
> > > > >  	 */
> > > > > -	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2]; /**< Cache
> > objects */
> > > > > +	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2]
> > __rte_cache_aligned;
> > > > >  } __rte_cache_aligned;
> > > > >
> > > > >  /**
> > > > > --
> > > > > 2.17.1
> > > > >
> > >
>
  

Patch

diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index 1f5707f46a..3725a72951 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -86,11 +86,13 @@  struct rte_mempool_cache {
 	uint32_t size;	      /**< Size of the cache */
 	uint32_t flushthresh; /**< Threshold before we flush excess elements */
 	uint32_t len;	      /**< Current cache count */
-	/*
+	/**
+	 * Cache objects
+	 *
 	 * Cache is allocated to this size to allow it to overflow in certain
 	 * cases to avoid needless emptying of cache.
 	 */
-	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2]; /**< Cache objects */
+	void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2] __rte_cache_aligned;
 } __rte_cache_aligned;
 
 /**