[v9,1/8] eal: generic 64 bit counter

Message ID 20240521201801.126886-2-stephen@networkplumber.org (mailing list archive)
State Changes Requested, archived
Delegated to: Thomas Monjalon
Headers
Series Common statistics for SW PMD's |

Checks

Context Check Description
ci/checkpatch success coding style OK

Commit Message

Stephen Hemminger May 21, 2024, 8:16 p.m. UTC
This header implements 64 bit counters using atomic
operations but with a weak memory ordering so that
they are safe against load/store splits on 32 bit platforms.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
---
 lib/eal/include/meson.build   |   1 +
 lib/eal/include/rte_counter.h | 150 ++++++++++++++++++++++++++++++++++
 2 files changed, 151 insertions(+)
 create mode 100644 lib/eal/include/rte_counter.h
  

Comments

Morten Brørup May 22, 2024, 8:31 a.m. UTC | #1
> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> Sent: Tuesday, 21 May 2024 22.17
> 
> This header implements 64 bit counters using atomic
> operations but with a weak memory ordering so that
> they are safe against load/store splits on 32 bit platforms.
> 
> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
> Acked-by: Morten Brørup <mb@smartsharesystems.com>
> ---
>  lib/eal/include/meson.build   |   1 +
>  lib/eal/include/rte_counter.h | 150 ++++++++++++++++++++++++++++++++++
>  2 files changed, 151 insertions(+)
>  create mode 100644 lib/eal/include/rte_counter.h
> 
> diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
> index e94b056d46..c070dd0079 100644
> --- a/lib/eal/include/meson.build
> +++ b/lib/eal/include/meson.build
> @@ -12,6 +12,7 @@ headers += files(
>          'rte_class.h',
>          'rte_common.h',
>          'rte_compat.h',
> +        'rte_counter.h',
>          'rte_debug.h',
>          'rte_dev.h',
>          'rte_devargs.h',
> diff --git a/lib/eal/include/rte_counter.h b/lib/eal/include/rte_counter.h
> new file mode 100644
> index 0000000000..f0c2b71a6c
> --- /dev/null
> +++ b/lib/eal/include/rte_counter.h
> @@ -0,0 +1,150 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright (c) Stephen Hemminger <stephen@networkplumber.org>
> + */
> +
> +#ifndef _RTE_COUNTER_H_
> +#define _RTE_COUNTER_H_
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +#include <stdint.h>
> +#include <rte_compat.h>
> +#include <rte_common.h>
> +#include <rte_stdatomic.h>
> +
> +/**
> + * @file
> + * RTE Counter
> + *
> + * A counter is 64 bit value that is safe from split read/write.
> + * It assumes that only one cpu at a time  will update the counter,
> + * and another CPU may want to read it.
> + *
> + * This is a weaker subset of full atomic variables.
> + *
> + * The counters are subject to the restrictions of atomic variables
> + * in packed structures or unaligned.
> + */
> +
> +#ifdef RTE_ARCH_64
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * On native 64 bit platform, counter is implemented as basic
> + * 64 bit unsigned integer that only increases.
> + */
> +typedef struct {
> +	uint64_t current;
> +	uint64_t offset;
> +} rte_counter64_t;

As discussed in the other thread [1], I strongly prefer having "current" and "offset" separate, for performance reasons.
Keeping each offset close together with its counter will require more cache lines than necessary, because the offsets take up space in the hot part of a fast path data structure. E.g. the size_bins[] counters could fit into one cache line instead of two.

[1]: https://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35E9F422@smartserver.smartshare.dk/

The disadvantages of a slightly larger API is insignificant, compared to the performance cost of not separating them.

> +
> +/**
> + * @internal
> + * Macro to implement read once (compiler barrier) using stdatomic.
> + * This is compiler barrier only.
> + */
> +#define __rte_read_once(var)						\
> +	rte_atomic_load_explicit((__rte_atomic typeof(&(var)))&(var),	\
> +		rte_memory_order_consume)
> +
> +/**
> + * @internal
> + * Macro to implement write once (compiler barrier) using stdatomic.
> + * This is compiler barrier only.
> + */
> +#define __rte_write_once(var, val)					    \
> +	rte_atomic_store_explicit((__rte_atomic typeof(&(var)))&(var), val, \
> +		rte_memory_order_release)

These macros certainly make the code using them shorter.
But IMHO, they don't improve the readability of the code using them; quite the opposite. Reviewing code using atomics is hard, so I prefer having the memory order directly shown in the code, not hidden behind a macro.

> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Add value to counter.
> + * Assumes this operation is only done by one thread on the object.
> + *
> + * @param counter
> + *    A pointer to the counter.
> + * @param val
> + *    The value to add to the counter.
> + */
> +static inline void
> +rte_counter64_add(rte_counter64_t *counter, uint32_t val)
> +{
> +	counter->current += val;
> +}
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Read a counter.
> + * This operation can be done by any thread.
> + *
> + * @param counter
> + *    A pointer to the counter.
> + * @return
> + *  The current value of the counter.
> + */
> +__rte_experimental
> +static inline uint64_t
> +rte_counter64_read(const rte_counter64_t *counter)
> +{
> +	return __rte_read_once(counter->current) - __rte_read_once(counter-
> >offset);

I'm not sure that "current" needs to be read using rte_memory_order_consume here; I think rte_memory_order_consume for "offset" and rte_memory_order_relaxed (or perhaps just volatile) for "counter" suffices. But let's settle on the high level design before we start micro-optimizing. :-)

> +}
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Reset a counter to zero.
> + * This operation can be done by any thread.
> + *
> + * @param counter
> + *    A pointer to the counter.
> + */
> +__rte_experimental
> +static inline void
> +rte_counter64_reset(rte_counter64_t *counter)
> +{
> +	__rte_write_once(counter->offset, __rte_read_once(counter->current));
> +}
> +
> +#else
> +
> +/* On 32 bit platform, need to use atomic to avoid load/store tearing */
> +typedef RTE_ATOMIC(uint64_t) rte_counter64_t;

As shown by Godbolt experiments discussed in a previous thread [2], non-tearing 64 bit counters can be implemented without using atomic instructions on all 32 bit architectures supported by DPDK. So we should use the counter/offset design pattern for RTE_ARCH_32 too.

[2]: https://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35E9F433@smartserver.smartshare.dk/

> +
> +__rte_experimental
> +static inline void
> +rte_counter64_add(rte_counter64_t *counter, uint32_t val)
> +{
> +	rte_atomic_fetch_add_explicit(counter, val, rte_memory_order_relaxed);
> +}
> +
> +__rte_experimental
> +static inline uint64_t
> +rte_counter64_read(const rte_counter64_t *counter)
> +{
> +	return rte_atomic_load_explicit(counter, rte_memory_order_consume);
> +}
> +
> +__rte_experimental
> +static inline void
> +rte_counter64_reset(rte_counter64_t *counter)
> +{
> +	rte_atomic_store_explicit(counter, 0, rte_memory_order_release);
> +}
> +
> +#endif /* RTE_ARCH_64 */
> +
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif /* _RTE_COUNTER_H_ */
> --
> 2.43.0
  
Stephen Hemminger May 22, 2024, 3:33 p.m. UTC | #2
On Wed, 22 May 2024 10:31:39 +0200
Morten Brørup <mb@smartsharesystems.com> wrote:

> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this API may change without prior notice.
> > + *
> > + * On native 64 bit platform, counter is implemented as basic
> > + * 64 bit unsigned integer that only increases.
> > + */
> > +typedef struct {
> > +	uint64_t current;
> > +	uint64_t offset;
> > +} rte_counter64_t;  
> 
> As discussed in the other thread [1], I strongly prefer having "current" and "offset" separate, for performance reasons.
> Keeping each offset close together with its counter will require more cache lines than necessary, because the offsets take up space in the hot part of a fast path data structure. E.g. the size_bins[] counters could fit into one cache line instead of two.
> 
> [1]: https://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35E9F422@smartserver.smartshare.dk/

There are no size_bins in the current version of the patch.
And the number of counters in ethdev part are small so it is less of a concern.
The code is easier to maintain if the counter object is self contained.

> > + * @internal
> > + * Macro to implement write once (compiler barrier) using stdatomic.
> > + * This is compiler barrier only.
> > + */
> > +#define __rte_write_once(var, val)					    \
> > +	rte_atomic_store_explicit((__rte_atomic typeof(&(var)))&(var), val, \
> > +		rte_memory_order_release)  
> 
> These macros certainly make the code using them shorter.
> But IMHO, they don't improve the readability of the code using them; quite the opposite. Reviewing code using atomics is hard, so I prefer having the memory order directly shown in the code, not hidden behind a macro.

Agree, was going to drop them in next version.

> > +__rte_experimental
> > +static inline uint64_t
> > +rte_counter64_read(const rte_counter64_t *counter)
> > +{
> > +	return __rte_read_once(counter->current) - __rte_read_once(counter-  
> > >offset);  
> 
> I'm not sure that "current" needs to be read using rte_memory_order_consume here; I think rte_memory_order_consume for "offset" and rte_memory_order_relaxed (or perhaps just volatile) for "counter" suffices. But let's settle on the high level design before we start micro-optimizing. :-)

memory order consume is compiler barrier only. Was trying to choose what was best here.

> > +
> > +/* On 32 bit platform, need to use atomic to avoid load/store tearing */
> > +typedef RTE_ATOMIC(uint64_t) rte_counter64_t;  
> 
> As shown by Godbolt experiments discussed in a previous thread [2], non-tearing 64 bit counters can be implemented without using atomic instructions on all 32 bit architectures supported by DPDK. So we should use the counter/offset design pattern for RTE_ARCH_32 too.
> 
> [2]: https://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35E9F433@smartserver.smartshare.dk/

Bruce found some 32 bit versions of x86 have the problem.
Godbolt doesn't seem to list 32 bit x86 compiler for Gcc.
If you try MSVC in 32 bit mode it will split the loads.
I see no problem on ARM which is the only other 32 bit we care about.
  
Stephen Hemminger May 22, 2024, 3:37 p.m. UTC | #3
On Wed, 22 May 2024 10:31:39 +0200
Morten Brørup <mb@smartsharesystems.com> wrote:

> > +/* On 32 bit platform, need to use atomic to avoid load/store tearing */
> > +typedef RTE_ATOMIC(uint64_t) rte_counter64_t;  
> 
> As shown by Godbolt experiments discussed in a previous thread [2], non-tearing 64 bit counters can be implemented without using atomic instructions on all 32 bit architectures supported by DPDK. So we should use the counter/offset design pattern for RTE_ARCH_32 too.
> 
> [2]: https://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35E9F433@smartserver.smartshare.dk/


This code built with -O3 and -m32 on godbolt shows split problem.

#include <stdint.h>

typedef uint64_t rte_counter64_t;

void
rte_counter64_add(rte_counter64_t *counter, uint32_t val)
{
	*counter += val;
}
…	*counter = val;
}

rte_counter64_add:
        push    ebx
        mov     eax, DWORD PTR [esp+8]
        xor     ebx, ebx
        mov     ecx, DWORD PTR [esp+12]
        add     DWORD PTR [eax], ecx
        adc     DWORD PTR [eax+4], ebx
        pop     ebx
        ret

rte_counter64_read:
        mov     eax, DWORD PTR [esp+4]
        mov     edx, DWORD PTR [eax+4]
        mov     eax, DWORD PTR [eax]
        ret
rte_counter64_set:
        movq    xmm0, QWORD PTR [esp+8]
        mov     eax, DWORD PTR [esp+4]
        movq    QWORD PTR [eax], xmm0
        ret
  
Morten Brørup May 22, 2024, 5:57 p.m. UTC | #4
> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> Sent: Wednesday, 22 May 2024 17.38
> 
> On Wed, 22 May 2024 10:31:39 +0200
> Morten Brørup <mb@smartsharesystems.com> wrote:
> 
> > > +/* On 32 bit platform, need to use atomic to avoid load/store
> tearing */
> > > +typedef RTE_ATOMIC(uint64_t) rte_counter64_t;
> >
> > As shown by Godbolt experiments discussed in a previous thread [2],
> non-tearing 64 bit counters can be implemented without using atomic
> instructions on all 32 bit architectures supported by DPDK. So we should
> use the counter/offset design pattern for RTE_ARCH_32 too.
> >
> > [2]:
> https://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35E9F433@smarts
> erver.smartshare.dk/
> 
> 
> This code built with -O3 and -m32 on godbolt shows split problem.
> 
> #include <stdint.h>
> 
> typedef uint64_t rte_counter64_t;
> 
> void
> rte_counter64_add(rte_counter64_t *counter, uint32_t val)
> {
> 	*counter += val;
> }
> …	*counter = val;
> }
> 
> rte_counter64_add:
>         push    ebx
>         mov     eax, DWORD PTR [esp+8]
>         xor     ebx, ebx
>         mov     ecx, DWORD PTR [esp+12]
>         add     DWORD PTR [eax], ecx
>         adc     DWORD PTR [eax+4], ebx
>         pop     ebx
>         ret
> 
> rte_counter64_read:
>         mov     eax, DWORD PTR [esp+4]
>         mov     edx, DWORD PTR [eax+4]
>         mov     eax, DWORD PTR [eax]
>         ret
> rte_counter64_set:
>         movq    xmm0, QWORD PTR [esp+8]
>         mov     eax, DWORD PTR [esp+4]
>         movq    QWORD PTR [eax], xmm0
>         ret

Sure, atomic might be required on some 32 bit architectures and/or with some compilers.

I envision a variety of 32 bit implementations, optimized for certain architectures/compilers.

Some of them can provide non-tearing 64 bit load/store, so we should also use the counter/offset design pattern for those.
  
Morten Brørup May 22, 2024, 6:09 p.m. UTC | #5
> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> Sent: Wednesday, 22 May 2024 17.33
> 
> On Wed, 22 May 2024 10:31:39 +0200
> Morten Brørup <mb@smartsharesystems.com> wrote:
> 
> > > +/**
> > > + * @warning
> > > + * @b EXPERIMENTAL: this API may change without prior notice.
> > > + *
> > > + * On native 64 bit platform, counter is implemented as basic
> > > + * 64 bit unsigned integer that only increases.
> > > + */
> > > +typedef struct {
> > > +	uint64_t current;
> > > +	uint64_t offset;
> > > +} rte_counter64_t;
> >
> > As discussed in the other thread [1], I strongly prefer having
> "current" and "offset" separate, for performance reasons.
> > Keeping each offset close together with its counter will require more
> cache lines than necessary, because the offsets take up space in the hot
> part of a fast path data structure. E.g. the size_bins[] counters could
> fit into one cache line instead of two.
> >
> > [1]:
> https://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35E9F422@smarts
> erver.smartshare.dk/
> 
> There are no size_bins in the current version of the patch.
> And the number of counters in ethdev part are small so it is less of a
> concern.
> The code is easier to maintain if the counter object is self contained.

I agree that there are advantages to keeping the counter object self contained.

However, these counters are generic, so we cannot assume that there are only very few, based on how the current software device drivers use them.

Someone might want to add size_bins to the software device drivers.
And someone else might want to collect many counters in some application or library structure.

> 
> > > +
> > > +/* On 32 bit platform, need to use atomic to avoid load/store
> tearing */
> > > +typedef RTE_ATOMIC(uint64_t) rte_counter64_t;
> >
> > As shown by Godbolt experiments discussed in a previous thread [2],
> non-tearing 64 bit counters can be implemented without using atomic
> instructions on all 32 bit architectures supported by DPDK. So we should
> use the counter/offset design pattern for RTE_ARCH_32 too.
> >
> > [2]:
> https://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35E9F433@smarts
> erver.smartshare.dk/
> 
> Bruce found some 32 bit versions of x86 have the problem.
> Godbolt doesn't seem to list 32 bit x86 compiler for Gcc.
> If you try MSVC in 32 bit mode it will split the loads.
> I see no problem on ARM which is the only other 32 bit we care about.
> 

Yeah, there seems to be a lot of work testing compiler behavior here. Let's start with a generic 32 bit implementation based on atomics, which should work correctly on all architectures/compilers.
Optimized architecture- and compiler-optimized variants can be added by interested CPU vendors later.
  
Tyler Retzlaff May 22, 2024, 7:01 p.m. UTC | #6
On Wed, May 22, 2024 at 07:57:01PM +0200, Morten Brørup wrote:
> > From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> > Sent: Wednesday, 22 May 2024 17.38
> > 
> > On Wed, 22 May 2024 10:31:39 +0200
> > Morten Brørup <mb@smartsharesystems.com> wrote:
> > 
> > > > +/* On 32 bit platform, need to use atomic to avoid load/store
> > tearing */
> > > > +typedef RTE_ATOMIC(uint64_t) rte_counter64_t;
> > >
> > > As shown by Godbolt experiments discussed in a previous thread [2],
> > non-tearing 64 bit counters can be implemented without using atomic
> > instructions on all 32 bit architectures supported by DPDK. So we should
> > use the counter/offset design pattern for RTE_ARCH_32 too.
> > >
> > > [2]:
> > https://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35E9F433@smarts
> > erver.smartshare.dk/
> > 
> > 
> > This code built with -O3 and -m32 on godbolt shows split problem.
> > 
> > #include <stdint.h>
> > 
> > typedef uint64_t rte_counter64_t;
> > 
> > void
> > rte_counter64_add(rte_counter64_t *counter, uint32_t val)
> > {
> > 	*counter += val;
> > }
> > …	*counter = val;
> > }
> > 
> > rte_counter64_add:
> >         push    ebx
> >         mov     eax, DWORD PTR [esp+8]
> >         xor     ebx, ebx
> >         mov     ecx, DWORD PTR [esp+12]
> >         add     DWORD PTR [eax], ecx
> >         adc     DWORD PTR [eax+4], ebx
> >         pop     ebx
> >         ret
> > 
> > rte_counter64_read:
> >         mov     eax, DWORD PTR [esp+4]
> >         mov     edx, DWORD PTR [eax+4]
> >         mov     eax, DWORD PTR [eax]
> >         ret
> > rte_counter64_set:
> >         movq    xmm0, QWORD PTR [esp+8]
> >         mov     eax, DWORD PTR [esp+4]
> >         movq    QWORD PTR [eax], xmm0
> >         ret
> 
> Sure, atomic might be required on some 32 bit architectures and/or with some compilers.

in theory i think you should be able to use generic atomics and
depending on the target you get codegen that works. it might be
something more expensive on 32-bit and nothing on 64-bit etc..

what's the damage if we just use atomic generic and relaxed ordering? is
the codegen not optimal?
 
> I envision a variety of 32 bit implementations, optimized for certain architectures/compilers.
> 
> Some of them can provide non-tearing 64 bit load/store, so we should also use the counter/offset design pattern for those.
>
  
Stephen Hemminger May 22, 2024, 7:51 p.m. UTC | #7
On Wed, 22 May 2024 12:01:12 -0700
Tyler Retzlaff <roretzla@linux.microsoft.com> wrote:

> On Wed, May 22, 2024 at 07:57:01PM +0200, Morten Brørup wrote:
> > > From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> > > Sent: Wednesday, 22 May 2024 17.38
> > > 
> > > On Wed, 22 May 2024 10:31:39 +0200
> > > Morten Brørup <mb@smartsharesystems.com> wrote:
> > >   
> > > > > +/* On 32 bit platform, need to use atomic to avoid load/store  
> > > tearing */  
> > > > > +typedef RTE_ATOMIC(uint64_t) rte_counter64_t;  
> > > >
> > > > As shown by Godbolt experiments discussed in a previous thread [2],  
> > > non-tearing 64 bit counters can be implemented without using atomic
> > > instructions on all 32 bit architectures supported by DPDK. So we should
> > > use the counter/offset design pattern for RTE_ARCH_32 too.  
> > > >
> > > > [2]:  
> > > https://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35E9F433@smarts
> > > erver.smartshare.dk/
> > > 
> > > 
> > > This code built with -O3 and -m32 on godbolt shows split problem.
> > > 
> > > #include <stdint.h>
> > > 
> > > typedef uint64_t rte_counter64_t;
> > > 
> > > void
> > > rte_counter64_add(rte_counter64_t *counter, uint32_t val)
> > > {
> > > 	*counter += val;
> > > }
> > > …	*counter = val;
> > > }
> > > 
> > > rte_counter64_add:
> > >         push    ebx
> > >         mov     eax, DWORD PTR [esp+8]
> > >         xor     ebx, ebx
> > >         mov     ecx, DWORD PTR [esp+12]
> > >         add     DWORD PTR [eax], ecx
> > >         adc     DWORD PTR [eax+4], ebx
> > >         pop     ebx
> > >         ret
> > > 
> > > rte_counter64_read:
> > >         mov     eax, DWORD PTR [esp+4]
> > >         mov     edx, DWORD PTR [eax+4]
> > >         mov     eax, DWORD PTR [eax]
> > >         ret
> > > rte_counter64_set:
> > >         movq    xmm0, QWORD PTR [esp+8]
> > >         mov     eax, DWORD PTR [esp+4]
> > >         movq    QWORD PTR [eax], xmm0
> > >         ret  
> > 
> > Sure, atomic might be required on some 32 bit architectures and/or with some compilers.  
> 
> in theory i think you should be able to use generic atomics and
> depending on the target you get codegen that works. it might be
> something more expensive on 32-bit and nothing on 64-bit etc..
> 
> what's the damage if we just use atomic generic and relaxed ordering? is
> the codegen not optimal?

If we use atomic with relaxed memory order, then compiler for x86 still generates
a locked increment in the fast path. This costs about 100 extra cycles due
to cache and prefetch stall. This whole endeavor is an attempt to avoid that.

PS: looking at the locked increment code for 32 bit involves locked compare
exchange and potential retry. Probably don't care about performance on that platform
anymore.
  
Stephen Hemminger May 22, 2024, 7:53 p.m. UTC | #8
On Wed, 22 May 2024 20:09:23 +0200
Morten Brørup <mb@smartsharesystems.com> wrote:

> > There are no size_bins in the current version of the patch.
> > And the number of counters in ethdev part are small so it is less of a
> > concern.
> > The code is easier to maintain if the counter object is self contained.  
> 
> I agree that there are advantages to keeping the counter object self contained.
> 
> However, these counters are generic, so we cannot assume that there are only very few, based on how the current software device drivers use them.
> 
> Someone might want to add size_bins to the software device drivers.
> And someone else might want to collect many counters in some application or library structure.


No.
The implementation should be as simple and as small as possible for the use case
that is presented in the patch series. Doing something more complex leads to the
classic YAGNI situation, where when the new case really happens the implemenation
just doesn't quite fit.
  
Morten Brørup May 22, 2024, 8:56 p.m. UTC | #9
> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> Sent: Wednesday, 22 May 2024 21.54
> 
> On Wed, 22 May 2024 20:09:23 +0200
> Morten Brørup <mb@smartsharesystems.com> wrote:
> 
> > > There are no size_bins in the current version of the patch.
> > > And the number of counters in ethdev part are small so it is less of
> a
> > > concern.
> > > The code is easier to maintain if the counter object is self
> contained.
> >
> > I agree that there are advantages to keeping the counter object self
> contained.
> >
> > However, these counters are generic, so we cannot assume that there
> are only very few, based on how the current software device drivers use
> them.
> >
> > Someone might want to add size_bins to the software device drivers.
> > And someone else might want to collect many counters in some
> application or library structure.
> 
> 
> No.
> The implementation should be as simple and as small as possible for the
> use case
> that is presented in the patch series.

I checked a random one of the use cases presented, rte_eth_af_packet:

The pkt_rx_queue struct grows to two cache lines, with the packets counter in the first cache line and the other counters in the second cache line.
By moving pkt_rx_queue's "sockfd" field below the pointers, the structure is better packed. If following my proposal, i.e. keeping the counters grouped together (and the offsets grouped together), all three counters stay within the first cache line (and the offsets go into the second).

The pkt_tx_queue struct also grows to two cache lines, with the first two counters in the first cache line and the third counter in the second cache line. With my proposal, the counters fit within the first cache line.

> Doing something more complex
> leads to the
> classic YAGNI situation, where when the new case really happens the
> implemenation
> just doesn't quite fit.

I disagree about YAGNI for this patch.
We *do* need the counter API to have the offset separate from the counter to avoid a performance degradation.
It is only slightly more complex, so I'm not convinced it's going to be a problem.
Passing a pointer to the offset as an additional parameter to fetch() and reset() is straightforward.
And using two instances of struct rte_eth_counters, one for counters and one for offsets, isn't complex either.

The rte_eth_af_packet use case shows that the risk of touching an increased number of cache lines (by moving the offsets into the hot part of the structures) is real.
  
Mattias Rönnblom May 26, 2024, 2:39 p.m. UTC | #10
On 2024-05-22 21:01, Tyler Retzlaff wrote:
> On Wed, May 22, 2024 at 07:57:01PM +0200, Morten Brørup wrote:
>>> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
>>> Sent: Wednesday, 22 May 2024 17.38
>>>
>>> On Wed, 22 May 2024 10:31:39 +0200
>>> Morten Brørup <mb@smartsharesystems.com> wrote:
>>>
>>>>> +/* On 32 bit platform, need to use atomic to avoid load/store
>>> tearing */
>>>>> +typedef RTE_ATOMIC(uint64_t) rte_counter64_t;
>>>>
>>>> As shown by Godbolt experiments discussed in a previous thread [2],
>>> non-tearing 64 bit counters can be implemented without using atomic
>>> instructions on all 32 bit architectures supported by DPDK. So we should
>>> use the counter/offset design pattern for RTE_ARCH_32 too.
>>>>
>>>> [2]:
>>> https://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35E9F433@smarts
>>> erver.smartshare.dk/
>>>
>>>
>>> This code built with -O3 and -m32 on godbolt shows split problem.
>>>
>>> #include <stdint.h>
>>>
>>> typedef uint64_t rte_counter64_t;
>>>
>>> void
>>> rte_counter64_add(rte_counter64_t *counter, uint32_t val)
>>> {
>>> 	*counter += val;
>>> }
>>> …	*counter = val;
>>> }
>>>
>>> rte_counter64_add:
>>>          push    ebx
>>>          mov     eax, DWORD PTR [esp+8]
>>>          xor     ebx, ebx
>>>          mov     ecx, DWORD PTR [esp+12]
>>>          add     DWORD PTR [eax], ecx
>>>          adc     DWORD PTR [eax+4], ebx
>>>          pop     ebx
>>>          ret
>>>
>>> rte_counter64_read:
>>>          mov     eax, DWORD PTR [esp+4]
>>>          mov     edx, DWORD PTR [eax+4]
>>>          mov     eax, DWORD PTR [eax]
>>>          ret
>>> rte_counter64_set:
>>>          movq    xmm0, QWORD PTR [esp+8]
>>>          mov     eax, DWORD PTR [esp+4]
>>>          movq    QWORD PTR [eax], xmm0
>>>          ret
>>
>> Sure, atomic might be required on some 32 bit architectures and/or with some compilers.
> 
> in theory i think you should be able to use generic atomics and
> depending on the target you get codegen that works. it might be
> something more expensive on 32-bit and nothing on 64-bit etc..
> 
> what's the damage if we just use atomic generic and relaxed ordering? is
> the codegen not optimal?
>   

Below is what I originally proposed in the "make stats reset reliable" 
thread.

struct counter
{
     uint64_t count;
     uint64_t offset;
};

/../
     struct counter rx_pkts;
     struct counter rx_bytes;
/../

static uint64_t
counter_value(const struct counter *counter)
{
     uint64_t count = __atomic_load_n(&counter->count, __ATOMIC_RELAXED);
     uint64_t offset = __atomic_load_n(&counter->offset, __ATOMIC_RELAXED);

     return count - offset;
}

static void
counter_reset(struct counter *counter)
{
     uint64_t count = __atomic_load_n(&counter->count, __ATOMIC_RELAXED);

     __atomic_store_n(&counter->offset, count, __ATOMIC_RELAXED);
}

static void
counter_add(struct counter *counter, uint64_t operand)
{
     __atomic_store_n(&counter->count, counter->count + operand, 
__ATOMIC_RELAXED);
}

I think this solution generally compiles to something that's equivalent 
to just using non-atomic loads/stores and hope for the best.

Using a non-atomic load in counter_add() will generate better code, but 
doesn't work if you using _Atomic (w/o casts).

Atomic load/stores seems to have volatile semantics, so multiple counter 
updates to the same counter cannot be merged. That is a drawback.

>> I envision a variety of 32 bit implementations, optimized for certain architectures/compilers.
>>
>> Some of them can provide non-tearing 64 bit load/store, so we should also use the counter/offset design pattern for those.
>>
  
Mattias Rönnblom May 26, 2024, 2:46 p.m. UTC | #11
On 2024-05-22 21:51, Stephen Hemminger wrote:
> On Wed, 22 May 2024 12:01:12 -0700
> Tyler Retzlaff <roretzla@linux.microsoft.com> wrote:
> 
>> On Wed, May 22, 2024 at 07:57:01PM +0200, Morten Brørup wrote:
>>>> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
>>>> Sent: Wednesday, 22 May 2024 17.38
>>>>
>>>> On Wed, 22 May 2024 10:31:39 +0200
>>>> Morten Brørup <mb@smartsharesystems.com> wrote:
>>>>    
>>>>>> +/* On 32 bit platform, need to use atomic to avoid load/store
>>>> tearing */
>>>>>> +typedef RTE_ATOMIC(uint64_t) rte_counter64_t;
>>>>>
>>>>> As shown by Godbolt experiments discussed in a previous thread [2],
>>>> non-tearing 64 bit counters can be implemented without using atomic
>>>> instructions on all 32 bit architectures supported by DPDK. So we should
>>>> use the counter/offset design pattern for RTE_ARCH_32 too.
>>>>>
>>>>> [2]:
>>>> https://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35E9F433@smarts
>>>> erver.smartshare.dk/
>>>>
>>>>
>>>> This code built with -O3 and -m32 on godbolt shows split problem.
>>>>
>>>> #include <stdint.h>
>>>>
>>>> typedef uint64_t rte_counter64_t;
>>>>
>>>> void
>>>> rte_counter64_add(rte_counter64_t *counter, uint32_t val)
>>>> {
>>>> 	*counter += val;
>>>> }
>>>> …	*counter = val;
>>>> }
>>>>
>>>> rte_counter64_add:
>>>>          push    ebx
>>>>          mov     eax, DWORD PTR [esp+8]
>>>>          xor     ebx, ebx
>>>>          mov     ecx, DWORD PTR [esp+12]
>>>>          add     DWORD PTR [eax], ecx
>>>>          adc     DWORD PTR [eax+4], ebx
>>>>          pop     ebx
>>>>          ret
>>>>
>>>> rte_counter64_read:
>>>>          mov     eax, DWORD PTR [esp+4]
>>>>          mov     edx, DWORD PTR [eax+4]
>>>>          mov     eax, DWORD PTR [eax]
>>>>          ret
>>>> rte_counter64_set:
>>>>          movq    xmm0, QWORD PTR [esp+8]
>>>>          mov     eax, DWORD PTR [esp+4]
>>>>          movq    QWORD PTR [eax], xmm0
>>>>          ret
>>>
>>> Sure, atomic might be required on some 32 bit architectures and/or with some compilers.
>>
>> in theory i think you should be able to use generic atomics and
>> depending on the target you get codegen that works. it might be
>> something more expensive on 32-bit and nothing on 64-bit etc..
>>
>> what's the damage if we just use atomic generic and relaxed ordering? is
>> the codegen not optimal?
> 
> If we use atomic with relaxed memory order, then compiler for x86 still generates
> a locked increment in the fast path. This costs about 100 extra cycles due
> to cache and prefetch stall. This whole endeavor is an attempt to avoid that.
> 

It's because the code is overly restrictive (e.g., needlessly forcing 
the whole read-modify-read being atomic), in that case, and no fault of 
the compiler.

void add(uint64_t *addr, uint64_t operand)
{
     uint64_t value = __atomic_load_n(addr, __ATOMIC_RELAXED);
     value += operand;
     __atomic_store_n(addr, value, __ATOMIC_RELAXED);
}

->

x86_64

add:
         mov     rax, QWORD PTR [rdi]
         add     rax, rsi
         mov     QWORD PTR [rdi], rax
         ret


x86

add:
         sub     esp, 12
         mov     ecx, DWORD PTR [esp+16]
         movq    xmm0, QWORD PTR [ecx]
         movq    QWORD PTR [esp], xmm0
         mov     eax, DWORD PTR [esp]
         mov     edx, DWORD PTR [esp+4]
         add     eax, DWORD PTR [esp+20]
         adc     edx, DWORD PTR [esp+24]
         mov     DWORD PTR [esp], eax
         mov     DWORD PTR [esp+4], edx
         movq    xmm1, QWORD PTR [esp]
         movq    QWORD PTR [ecx], xmm1
         add     esp, 12
         ret

No locked instructions.

> PS: looking at the locked increment code for 32 bit involves locked compare
> exchange and potential retry. Probably don't care about performance on that platform
> anymore.
> 
>
  

Patch

diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
index e94b056d46..c070dd0079 100644
--- a/lib/eal/include/meson.build
+++ b/lib/eal/include/meson.build
@@ -12,6 +12,7 @@  headers += files(
         'rte_class.h',
         'rte_common.h',
         'rte_compat.h',
+        'rte_counter.h',
         'rte_debug.h',
         'rte_dev.h',
         'rte_devargs.h',
diff --git a/lib/eal/include/rte_counter.h b/lib/eal/include/rte_counter.h
new file mode 100644
index 0000000000..f0c2b71a6c
--- /dev/null
+++ b/lib/eal/include/rte_counter.h
@@ -0,0 +1,150 @@ 
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) Stephen Hemminger <stephen@networkplumber.org>
+ */
+
+#ifndef _RTE_COUNTER_H_
+#define _RTE_COUNTER_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <stdint.h>
+#include <rte_compat.h>
+#include <rte_common.h>
+#include <rte_stdatomic.h>
+
+/**
+ * @file
+ * RTE Counter
+ *
+ * A counter is 64 bit value that is safe from split read/write.
+ * It assumes that only one cpu at a time  will update the counter,
+ * and another CPU may want to read it.
+ *
+ * This is a weaker subset of full atomic variables.
+ *
+ * The counters are subject to the restrictions of atomic variables
+ * in packed structures or unaligned.
+ */
+
+#ifdef RTE_ARCH_64
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * On native 64 bit platform, counter is implemented as basic
+ * 64 bit unsigned integer that only increases.
+ */
+typedef struct {
+	uint64_t current;
+	uint64_t offset;
+} rte_counter64_t;
+
+/**
+ * @internal
+ * Macro to implement read once (compiler barrier) using stdatomic.
+ * This is compiler barrier only.
+ */
+#define __rte_read_once(var)						\
+	rte_atomic_load_explicit((__rte_atomic typeof(&(var)))&(var),	\
+		rte_memory_order_consume)
+
+/**
+ * @internal
+ * Macro to implement write once (compiler barrier) using stdatomic.
+ * This is compiler barrier only.
+ */
+#define __rte_write_once(var, val)					    \
+	rte_atomic_store_explicit((__rte_atomic typeof(&(var)))&(var), val, \
+		rte_memory_order_release)
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Add value to counter.
+ * Assumes this operation is only done by one thread on the object.
+ *
+ * @param counter
+ *    A pointer to the counter.
+ * @param val
+ *    The value to add to the counter.
+ */
+static inline void
+rte_counter64_add(rte_counter64_t *counter, uint32_t val)
+{
+	counter->current += val;
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Read a counter.
+ * This operation can be done by any thread.
+ *
+ * @param counter
+ *    A pointer to the counter.
+ * @return
+ *  The current value of the counter.
+ */
+__rte_experimental
+static inline uint64_t
+rte_counter64_read(const rte_counter64_t *counter)
+{
+	return __rte_read_once(counter->current) - __rte_read_once(counter->offset);
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Reset a counter to zero.
+ * This operation can be done by any thread.
+ *
+ * @param counter
+ *    A pointer to the counter.
+ */
+__rte_experimental
+static inline void
+rte_counter64_reset(rte_counter64_t *counter)
+{
+	__rte_write_once(counter->offset, __rte_read_once(counter->current));
+}
+
+#else
+
+/* On 32 bit platform, need to use atomic to avoid load/store tearing */
+typedef RTE_ATOMIC(uint64_t) rte_counter64_t;
+
+__rte_experimental
+static inline void
+rte_counter64_add(rte_counter64_t *counter, uint32_t val)
+{
+	rte_atomic_fetch_add_explicit(counter, val, rte_memory_order_relaxed);
+}
+
+__rte_experimental
+static inline uint64_t
+rte_counter64_read(const rte_counter64_t *counter)
+{
+	return rte_atomic_load_explicit(counter, rte_memory_order_consume);
+}
+
+__rte_experimental
+static inline void
+rte_counter64_reset(rte_counter64_t *counter)
+{
+	rte_atomic_store_explicit(counter, 0, rte_memory_order_release);
+}
+
+#endif /* RTE_ARCH_64 */
+
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_COUNTER_H_ */