[v3,3/7] eal: add lcore variable performance test

Message ID 20240912084429.703405-4-mattias.ronnblom@ericsson.com (mailing list archive)
State Superseded
Delegated to: Thomas Monjalon
Headers
Series Lcore variables |

Checks

Context Check Description
ci/checkpatch warning coding style issues

Commit Message

Mattias Rönnblom Sept. 12, 2024, 8:44 a.m. UTC
Add basic micro benchmark for lcore variables, in an attempt to assure
that the overhead isn't significantly greater than alternative
approaches, in scenarios where the benefits aren't expected to show up
(i.e., when plenty of cache is available compared to the working set
size of the per-lcore data).

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
---
 app/test/meson.build           |   1 +
 app/test/test_lcore_var_perf.c | 160 +++++++++++++++++++++++++++++++++
 2 files changed, 161 insertions(+)
 create mode 100644 app/test/test_lcore_var_perf.c
  

Comments

Morten Brørup Sept. 12, 2024, 9:39 a.m. UTC | #1
> +struct lcore_state {
> +	uint64_t a;
> +	uint64_t b;
> +	uint64_t sum;
> +};
> +
> +static __rte_always_inline void
> +update(struct lcore_state *state)
> +{
> +	state->sum += state->a * state->b;
> +}
> +
> +static RTE_DEFINE_PER_LCORE(struct lcore_state, tls_lcore_state);
> +
> +static __rte_noinline void
> +tls_update(void)
> +{
> +	update(&RTE_PER_LCORE(tls_lcore_state));

I would normally access TLS variables directly, not through a pointer, i.e.:

RTE_PER_LCORE(tls_lcore_state.sum) += RTE_PER_LCORE(tls_lcore_state.a) * RTE_PER_LCORE(tls_lcore_state.b);

On the other hand, then it wouldn't be 1:1 comparable to the two other test cases.

Besides, I expect the compiler to optimize away the indirect access, and produce the same output (as for the alternative implementation) anyway.

No change requested. Just noticing.

> +}
> +
> +struct __rte_cache_aligned lcore_state_aligned {
> +	uint64_t a;
> +	uint64_t b;
> +	uint64_t sum;

Please add RTE_CACHE_GUARD here, for 100 % matching the common design pattern.

> +};
> +
> +static struct lcore_state_aligned sarray_lcore_state[RTE_MAX_LCORE];


> +	printf("Latencies [ns/update]\n");
> +	printf("Thread-local storage  Static array  Lcore variables\n");
> +	printf("%20.1f %13.1f %16.1f\n", tls_latency * 1e9,
> +	       sarray_latency * 1e9, lvar_latency * 1e9);

I prefer cycles over ns. Perhaps you could show both?


With RTE_CACHE_GUARD added where mentioned,

Acked-by: Morten Brørup <mb@smartsharesystems.com>
  
Mattias Rönnblom Sept. 12, 2024, 1:01 p.m. UTC | #2
On 2024-09-12 11:39, Morten Brørup wrote:
>> +struct lcore_state {
>> +	uint64_t a;
>> +	uint64_t b;
>> +	uint64_t sum;
>> +};
>> +
>> +static __rte_always_inline void
>> +update(struct lcore_state *state)
>> +{
>> +	state->sum += state->a * state->b;
>> +}
>> +
>> +static RTE_DEFINE_PER_LCORE(struct lcore_state, tls_lcore_state);
>> +
>> +static __rte_noinline void
>> +tls_update(void)
>> +{
>> +	update(&RTE_PER_LCORE(tls_lcore_state));
> 
> I would normally access TLS variables directly, not through a pointer, i.e.:
> 
> RTE_PER_LCORE(tls_lcore_state.sum) += RTE_PER_LCORE(tls_lcore_state.a) * RTE_PER_LCORE(tls_lcore_state.b);
> 
> On the other hand, then it wouldn't be 1:1 comparable to the two other test cases.
> 
> Besides, I expect the compiler to optimize away the indirect access, and produce the same output (as for the alternative implementation) anyway.
> 
> No change requested. Just noticing.
> 
>> +}
>> +
>> +struct __rte_cache_aligned lcore_state_aligned {
>> +	uint64_t a;
>> +	uint64_t b;
>> +	uint64_t sum;
> 
> Please add RTE_CACHE_GUARD here, for 100 % matching the common design pattern.
> 

Will do.

>> +};
>> +
>> +static struct lcore_state_aligned sarray_lcore_state[RTE_MAX_LCORE];
> 
> 
>> +	printf("Latencies [ns/update]\n");
>> +	printf("Thread-local storage  Static array  Lcore variables\n");
>> +	printf("%20.1f %13.1f %16.1f\n", tls_latency * 1e9,
>> +	       sarray_latency * 1e9, lvar_latency * 1e9);
> 
> I prefer cycles over ns. Perhaps you could show both?
> 

That's makes you an x86 guy. :) Since only on x86 those cycles makes any 
sense.

I didn't want to use cycles since it would be a very small value on 
certain (e.g., old ARM) platforms.

But, elsewhere in the perf tests TSC cycles are used, so maybe I should 
switch to using such nevertheless.

> 
> With RTE_CACHE_GUARD added where mentioned,
> 
> Acked-by: Morten Brørup <mb@smartsharesystems.com>
>
  
Jerin Jacob Sept. 12, 2024, 1:09 p.m. UTC | #3
On Thu, Sep 12, 2024 at 2:34 PM Mattias Rönnblom
<mattias.ronnblom@ericsson.com> wrote:
>
> Add basic micro benchmark for lcore variables, in an attempt to assure
> that the overhead isn't significantly greater than alternative
> approaches, in scenarios where the benefits aren't expected to show up
> (i.e., when plenty of cache is available compared to the working set
> size of the per-lcore data).
>
> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> ---
>  app/test/meson.build           |   1 +
>  app/test/test_lcore_var_perf.c | 160 +++++++++++++++++++++++++++++++++
>  2 files changed, 161 insertions(+)
>  create mode 100644 app/test/test_lcore_var_perf.c


> +static double
> +benchmark_access_method(void (*init_fun)(void), void (*update_fun)(void))
> +{
> +       uint64_t i;
> +       uint64_t start;
> +       uint64_t end;
> +       double latency;
> +
> +       init_fun();
> +
> +       start = rte_get_timer_cycles();
> +
> +       for (i = 0; i < ITERATIONS; i++)
> +               update_fun();
> +
> +       end = rte_get_timer_cycles();

Use precise variant. rte_rdtsc_precise() or so to be accurate
  
Mattias Rönnblom Sept. 12, 2024, 1:20 p.m. UTC | #4
On 2024-09-12 15:09, Jerin Jacob wrote:
> On Thu, Sep 12, 2024 at 2:34 PM Mattias Rönnblom
> <mattias.ronnblom@ericsson.com> wrote:
>>
>> Add basic micro benchmark for lcore variables, in an attempt to assure
>> that the overhead isn't significantly greater than alternative
>> approaches, in scenarios where the benefits aren't expected to show up
>> (i.e., when plenty of cache is available compared to the working set
>> size of the per-lcore data).
>>
>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>> ---
>>   app/test/meson.build           |   1 +
>>   app/test/test_lcore_var_perf.c | 160 +++++++++++++++++++++++++++++++++
>>   2 files changed, 161 insertions(+)
>>   create mode 100644 app/test/test_lcore_var_perf.c
> 
> 
>> +static double
>> +benchmark_access_method(void (*init_fun)(void), void (*update_fun)(void))
>> +{
>> +       uint64_t i;
>> +       uint64_t start;
>> +       uint64_t end;
>> +       double latency;
>> +
>> +       init_fun();
>> +
>> +       start = rte_get_timer_cycles();
>> +
>> +       for (i = 0; i < ITERATIONS; i++)
>> +               update_fun();
>> +
>> +       end = rte_get_timer_cycles();
> 
> Use precise variant. rte_rdtsc_precise() or so to be accurate

With 1e7 iterations, do you need rte_rdtsc_precise()? I suspect not.
  
Jerin Jacob Sept. 12, 2024, 3:11 p.m. UTC | #5
On Thu, Sep 12, 2024 at 6:50 PM Mattias Rönnblom <hofors@lysator.liu.se> wrote:
>
> On 2024-09-12 15:09, Jerin Jacob wrote:
> > On Thu, Sep 12, 2024 at 2:34 PM Mattias Rönnblom
> > <mattias.ronnblom@ericsson.com> wrote:
> >>
> >> Add basic micro benchmark for lcore variables, in an attempt to assure
> >> that the overhead isn't significantly greater than alternative
> >> approaches, in scenarios where the benefits aren't expected to show up
> >> (i.e., when plenty of cache is available compared to the working set
> >> size of the per-lcore data).
> >>
> >> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> >> ---
> >>   app/test/meson.build           |   1 +
> >>   app/test/test_lcore_var_perf.c | 160 +++++++++++++++++++++++++++++++++
> >>   2 files changed, 161 insertions(+)
> >>   create mode 100644 app/test/test_lcore_var_perf.c
> >
> >
> >> +static double
> >> +benchmark_access_method(void (*init_fun)(void), void (*update_fun)(void))
> >> +{
> >> +       uint64_t i;
> >> +       uint64_t start;
> >> +       uint64_t end;
> >> +       double latency;
> >> +
> >> +       init_fun();
> >> +
> >> +       start = rte_get_timer_cycles();
> >> +
> >> +       for (i = 0; i < ITERATIONS; i++)
> >> +               update_fun();
> >> +
> >> +       end = rte_get_timer_cycles();
> >
> > Use precise variant. rte_rdtsc_precise() or so to be accurate
>
> With 1e7 iterations, do you need rte_rdtsc_precise()? I suspect not.

I was thinking in another way, with 1e7 iteration, the additional
barrier on precise will be amortized, and we get more _deterministic_
behavior e.s.p in case if we print cycles and if we need to catch
regressions.
Furthermore, you may consider replacing rte_random() in fast path to
running number or so if it is not deterministic in cycle computation.
  
Mattias Rönnblom Sept. 13, 2024, 6:47 a.m. UTC | #6
On 2024-09-12 17:11, Jerin Jacob wrote:
> On Thu, Sep 12, 2024 at 6:50 PM Mattias Rönnblom <hofors@lysator.liu.se> wrote:
>>
>> On 2024-09-12 15:09, Jerin Jacob wrote:
>>> On Thu, Sep 12, 2024 at 2:34 PM Mattias Rönnblom
>>> <mattias.ronnblom@ericsson.com> wrote:
>>>>
>>>> Add basic micro benchmark for lcore variables, in an attempt to assure
>>>> that the overhead isn't significantly greater than alternative
>>>> approaches, in scenarios where the benefits aren't expected to show up
>>>> (i.e., when plenty of cache is available compared to the working set
>>>> size of the per-lcore data).
>>>>
>>>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>>>> ---
>>>>    app/test/meson.build           |   1 +
>>>>    app/test/test_lcore_var_perf.c | 160 +++++++++++++++++++++++++++++++++
>>>>    2 files changed, 161 insertions(+)
>>>>    create mode 100644 app/test/test_lcore_var_perf.c
>>>
>>>
>>>> +static double
>>>> +benchmark_access_method(void (*init_fun)(void), void (*update_fun)(void))
>>>> +{
>>>> +       uint64_t i;
>>>> +       uint64_t start;
>>>> +       uint64_t end;
>>>> +       double latency;
>>>> +
>>>> +       init_fun();
>>>> +
>>>> +       start = rte_get_timer_cycles();
>>>> +
>>>> +       for (i = 0; i < ITERATIONS; i++)
>>>> +               update_fun();
>>>> +
>>>> +       end = rte_get_timer_cycles();
>>>
>>> Use precise variant. rte_rdtsc_precise() or so to be accurate
>>
>> With 1e7 iterations, do you need rte_rdtsc_precise()? I suspect not.
> 
> I was thinking in another way, with 1e7 iteration, the additional
> barrier on precise will be amortized, and we get more _deterministic_
> behavior e.s.p in case if we print cycles and if we need to catch
> regressions.

If you time a section of code which spends ~40000000 cycles, it doesn't 
matter if you add or remove a few cycles at the beginning and the end.

The rte_rdtsc_precise() is both better (more precise in the sense of 
more serialization), and worse (because it's more costly, and thus more 
intrusive).

You can use rte_rdtsc_precise(), rte_rdtsc(), or gettimeofday(). It 
doesn't matter.

> Furthermore, you may consider replacing rte_random() in fast path to
> running number or so if it is not deterministic in cycle computation.

rte_rand() is not used in the fast path. I don't understand what you 
mean by "running number".
  
Jerin Jacob Sept. 13, 2024, 11:23 a.m. UTC | #7
On Fri, Sep 13, 2024 at 12:17 PM Mattias Rönnblom <hofors@lysator.liu.se> wrote:
>
> On 2024-09-12 17:11, Jerin Jacob wrote:
> > On Thu, Sep 12, 2024 at 6:50 PM Mattias Rönnblom <hofors@lysator.liu.se> wrote:
> >>
> >> On 2024-09-12 15:09, Jerin Jacob wrote:
> >>> On Thu, Sep 12, 2024 at 2:34 PM Mattias Rönnblom
> >>> <mattias.ronnblom@ericsson.com> wrote:
> >>>>
> >>>> Add basic micro benchmark for lcore variables, in an attempt to assure
> >>>> that the overhead isn't significantly greater than alternative
> >>>> approaches, in scenarios where the benefits aren't expected to show up
> >>>> (i.e., when plenty of cache is available compared to the working set
> >>>> size of the per-lcore data).
> >>>>
> >>>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> >>>> ---
> >>>>    app/test/meson.build           |   1 +
> >>>>    app/test/test_lcore_var_perf.c | 160 +++++++++++++++++++++++++++++++++
> >>>>    2 files changed, 161 insertions(+)
> >>>>    create mode 100644 app/test/test_lcore_var_perf.c
> >>>
> >>>
> >>>> +static double
> >>>> +benchmark_access_method(void (*init_fun)(void), void (*update_fun)(void))
> >>>> +{
> >>>> +       uint64_t i;
> >>>> +       uint64_t start;
> >>>> +       uint64_t end;
> >>>> +       double latency;
> >>>> +
> >>>> +       init_fun();
> >>>> +
> >>>> +       start = rte_get_timer_cycles();
> >>>> +
> >>>> +       for (i = 0; i < ITERATIONS; i++)
> >>>> +               update_fun();
> >>>> +
> >>>> +       end = rte_get_timer_cycles();
> >>>
> >>> Use precise variant. rte_rdtsc_precise() or so to be accurate
> >>
> >> With 1e7 iterations, do you need rte_rdtsc_precise()? I suspect not.
> >
> > I was thinking in another way, with 1e7 iteration, the additional
> > barrier on precise will be amortized, and we get more _deterministic_
> > behavior e.s.p in case if we print cycles and if we need to catch
> > regressions.
>
> If you time a section of code which spends ~40000000 cycles, it doesn't
> matter if you add or remove a few cycles at the beginning and the end.
>
> The rte_rdtsc_precise() is both better (more precise in the sense of
> more serialization), and worse (because it's more costly, and thus more
> intrusive).

We can calibrate the overhead to remove the cost.

>
> You can use rte_rdtsc_precise(), rte_rdtsc(), or gettimeofday(). It
> doesn't matter.

Yes. In this setup and it is pretty inaccurate PER iteration. Please
refer to the below patch to see the difference.

Patch 1: Make nanoseconds to cycles per iteration
------------------------------------------------------------------

diff --git a/app/test/test_lcore_var_perf.c b/app/test/test_lcore_var_perf.c
index ea1d7ba90b52..b8d25400f593 100644
--- a/app/test/test_lcore_var_perf.c
+++ b/app/test/test_lcore_var_perf.c
@@ -110,7 +110,7 @@ benchmark_access_method(void (*init_fun)(void),
void (*update_fun)(void))

        end = rte_get_timer_cycles();

-       latency = ((end - start) / (double)rte_get_timer_hz()) / ITERATIONS;
+       latency = ((end - start)) / ITERATIONS;

        return latency;
 }
@@ -137,8 +137,7 @@ test_lcore_var_access(void)

-       printf("Latencies [ns/update]\n");
+       printf("Latencies [cycles/update]\n");
        printf("Thread-local storage  Static array  Lcore variables\n");
-       printf("%20.1f %13.1f %16.1f\n", tls_latency * 1e9,
-              sarray_latency * 1e9, lvar_latency * 1e9);
+       printf("%20.1f %13.1f %16.1f\n", tls_latency, sarray_latency,
lvar_latency);

        return TEST_SUCCESS;
 }


Patch 2: Change to precise with calibration
-----------------------------------------------------------

diff --git a/app/test/test_lcore_var_perf.c b/app/test/test_lcore_var_perf.c
index ea1d7ba90b52..8142ecd56241 100644
--- a/app/test/test_lcore_var_perf.c
+++ b/app/test/test_lcore_var_perf.c
@@ -96,23 +96,28 @@ lvar_update(void)
 static double
 benchmark_access_method(void (*init_fun)(void), void (*update_fun)(void))
 {
-       uint64_t i;
+       double tsc_latency;
+       double latency;
        uint64_t start;
        uint64_t end;
-       double latency;
+       uint64_t i;

-       init_fun();
+       /* calculate rte_rdtsc_precise overhead */
+       start = rte_rdtsc_precise();
+       end = rte_rdtsc_precise();
+       tsc_latency = (end - start);

-       start = rte_get_timer_cycles();
+       init_fun();

-       for (i = 0; i < ITERATIONS; i++)
+       latency = 0;
+       for (i = 0; i < ITERATIONS; i++) {
+               start = rte_rdtsc_precise();
                update_fun();
+               end = rte_rdtsc_precise();
+               latency += (end - start) - tsc_latency;
+       }

-       end = rte_get_timer_cycles();
-
-       latency = ((end - start) / (double)rte_get_timer_hz()) / ITERATIONS;
-
-       return latency;
+       return latency / (double)ITERATIONS;
 }

 static int
@@ -135,10 +140,9 @@ test_lcore_var_access(void)
        sarray_latency = benchmark_access_method(sarray_init, sarray_update);
        lvar_latency = benchmark_access_method(lvar_init, lvar_update);

-       printf("Latencies [ns/update]\n");
+       printf("Latencies [cycles/update]\n");
        printf("Thread-local storage  Static array  Lcore variables\n");
-       printf("%20.1f %13.1f %16.1f\n", tls_latency * 1e9,
-              sarray_latency * 1e9, lvar_latency * 1e9);
+       printf("%20.1f %13.1f %16.1f\n", tls_latency, sarray_latency,
lvar_latency);

        return TEST_SUCCESS;
 }

ARM N2 core with patch 1(aka current scheme)
-----------------------------------

 + ------------------------------------------------------- +
 + Test Suite : lcore variable perf autotest
 + ------------------------------------------------------- +
Latencies [cycles/update]
Thread-local storage  Static array  Lcore variables
                 7.0           7.0              7.0


ARM N2 core with patch 2
-----------------------------------

 + ------------------------------------------------------- +
 + Test Suite : lcore variable perf autotest
 + ------------------------------------------------------- +
Latencies [cycles/update]
Thread-local storage  Static array  Lcore variables
                11.4          15.5             15.5

x86 i9 core with patch 1(aka current scheme)
------------------------------------------------------------

 + ------------------------------------------------------- +
 + Test Suite : lcore variable perf autotest
 + ------------------------------------------------------- +
Latencies [ns/update]
Thread-local storage  Static array  Lcore variables
                 5.0           6.0              6.0

x86 i9 core with patch 2
--------------------------------
 + ------------------------------------------------------- +
 + Test Suite : lcore variable perf autotest
 + ------------------------------------------------------- +
Latencies [cycles/update]
Thread-local storage  Static array  Lcore variables
                 5.3          10.6             11.7





>
> > Furthermore, you may consider replacing rte_random() in fast path to
> > running number or so if it is not deterministic in cycle computation.
>
> rte_rand() is not used in the fast path. I don't understand what you

I missed that. Ignore this comment.

> mean by "running number".
  
Morten Brørup Sept. 13, 2024, 2:40 p.m. UTC | #8
> From: Jerin Jacob [mailto:jerinjacobk@gmail.com]
> Sent: Friday, 13 September 2024 13.24
> 
> On Fri, Sep 13, 2024 at 12:17 PM Mattias Rönnblom <hofors@lysator.liu.se>
> wrote:
> >
> > On 2024-09-12 17:11, Jerin Jacob wrote:
> > > On Thu, Sep 12, 2024 at 6:50 PM Mattias Rönnblom <hofors@lysator.liu.se>
> wrote:
> > >>
> > >> On 2024-09-12 15:09, Jerin Jacob wrote:
> > >>> On Thu, Sep 12, 2024 at 2:34 PM Mattias Rönnblom
> > >>> <mattias.ronnblom@ericsson.com> wrote:
> > >>>> +static double
> > >>>> +benchmark_access_method(void (*init_fun)(void), void
> (*update_fun)(void))
> > >>>> +{
> > >>>> +       uint64_t i;
> > >>>> +       uint64_t start;
> > >>>> +       uint64_t end;
> > >>>> +       double latency;
> > >>>> +
> > >>>> +       init_fun();
> > >>>> +
> > >>>> +       start = rte_get_timer_cycles();
> > >>>> +
> > >>>> +       for (i = 0; i < ITERATIONS; i++)
> > >>>> +               update_fun();
> > >>>> +
> > >>>> +       end = rte_get_timer_cycles();
> > >>>
> > >>> Use precise variant. rte_rdtsc_precise() or so to be accurate
> > >>
> > >> With 1e7 iterations, do you need rte_rdtsc_precise()? I suspect not.
> > >
> > > I was thinking in another way, with 1e7 iteration, the additional
> > > barrier on precise will be amortized, and we get more _deterministic_
> > > behavior e.s.p in case if we print cycles and if we need to catch
> > > regressions.
> >
> > If you time a section of code which spends ~40000000 cycles, it doesn't
> > matter if you add or remove a few cycles at the beginning and the end.
> >
> > The rte_rdtsc_precise() is both better (more precise in the sense of
> > more serialization), and worse (because it's more costly, and thus more
> > intrusive).
> 
> We can calibrate the overhead to remove the cost.
> 
> >
> > You can use rte_rdtsc_precise(), rte_rdtsc(), or gettimeofday(). It
> > doesn't matter.
> 
> Yes. In this setup and it is pretty inaccurate PER iteration. Please
> refer to the below patch to see the difference.

No, Mattias is right. The time is sampled once before the loop, then the function is executed 10 million (ITERATIONS) times in the loop, and then the time is sampled once again.

So the overhead and accuracy of the timing function is amortized across the 10 million calls to the function being measured, and becomes insignificant.

Other perf tests also do it this way, and also use rte_get_timer_cycles(). E.g. the mempool_perf test.

Another detail: The for loop itself may cost a few cycles, which may not be irrelevant when measuring a function using very few cycles. If the compiler doesn't unroll the loop, it should be done manually:

        for (i = 0; i < ITERATIONS / 100; i++) {
                update_fun();
                update_fun();
                ... repeated 100 times
        }


> 
> Patch 1: Make nanoseconds to cycles per iteration
> ------------------------------------------------------------------
> 
> diff --git a/app/test/test_lcore_var_perf.c b/app/test/test_lcore_var_perf.c
> index ea1d7ba90b52..b8d25400f593 100644
> --- a/app/test/test_lcore_var_perf.c
> +++ b/app/test/test_lcore_var_perf.c
> @@ -110,7 +110,7 @@ benchmark_access_method(void (*init_fun)(void),
> void (*update_fun)(void))
> 
>         end = rte_get_timer_cycles();
> 
> -       latency = ((end - start) / (double)rte_get_timer_hz()) / ITERATIONS;
> +       latency = ((end - start)) / ITERATIONS;

This calculation uses integer arithmetic, which will round down the resulting latency.
Please use floating point arithmetic: latency = (end - start) / (double)ITERATIONS;

> 
>         return latency;
>  }
> @@ -137,8 +137,7 @@ test_lcore_var_access(void)
> 
> -       printf("Latencies [ns/update]\n");
> +       printf("Latencies [cycles/update]\n");
>         printf("Thread-local storage  Static array  Lcore variables\n");
> -       printf("%20.1f %13.1f %16.1f\n", tls_latency * 1e9,
> -              sarray_latency * 1e9, lvar_latency * 1e9);
> +       printf("%20.1f %13.1f %16.1f\n", tls_latency, sarray_latency,
> lvar_latency);
> 
>         return TEST_SUCCESS;
>  }
  
Jerin Jacob Sept. 16, 2024, 8:12 a.m. UTC | #9
On Fri, Sep 13, 2024 at 8:10 PM Morten Brørup <mb@smartsharesystems.com> wrote:
>
> > From: Jerin Jacob [mailto:jerinjacobk@gmail.com]
> > Sent: Friday, 13 September 2024 13.24
> >
> > On Fri, Sep 13, 2024 at 12:17 PM Mattias Rönnblom <hofors@lysator.liu.se>
> > wrote:
> > >
> > > On 2024-09-12 17:11, Jerin Jacob wrote:
> > > > On Thu, Sep 12, 2024 at 6:50 PM Mattias Rönnblom <hofors@lysator.liu.se>
> > wrote:
> > > >>
> > > >> On 2024-09-12 15:09, Jerin Jacob wrote:
> > > >>> On Thu, Sep 12, 2024 at 2:34 PM Mattias Rönnblom
> > > >>> <mattias.ronnblom@ericsson.com> wrote:
> > > >>>> +static double
> > > >>>> +benchmark_access_method(void (*init_fun)(void), void
> > (*update_fun)(void))
> > > >>>> +{
> > > >>>> +       uint64_t i;
> > > >>>> +       uint64_t start;
> > > >>>> +       uint64_t end;
> > > >>>> +       double latency;
> > > >>>> +
> > > >>>> +       init_fun();
> > > >>>> +
> > > >>>> +       start = rte_get_timer_cycles();
> > > >>>> +
> > > >>>> +       for (i = 0; i < ITERATIONS; i++)
> > > >>>> +               update_fun();
> > > >>>> +
> > > >>>> +       end = rte_get_timer_cycles();
> > > >>>
> > > >>> Use precise variant. rte_rdtsc_precise() or so to be accurate
> > > >>
> > > >> With 1e7 iterations, do you need rte_rdtsc_precise()? I suspect not.
> > > >
> > > > I was thinking in another way, with 1e7 iteration, the additional
> > > > barrier on precise will be amortized, and we get more _deterministic_
> > > > behavior e.s.p in case if we print cycles and if we need to catch
> > > > regressions.
> > >
> > > If you time a section of code which spends ~40000000 cycles, it doesn't
> > > matter if you add or remove a few cycles at the beginning and the end.
> > >
> > > The rte_rdtsc_precise() is both better (more precise in the sense of
> > > more serialization), and worse (because it's more costly, and thus more
> > > intrusive).
> >
> > We can calibrate the overhead to remove the cost.
> >
> > >
> > > You can use rte_rdtsc_precise(), rte_rdtsc(), or gettimeofday(). It
> > > doesn't matter.
> >
> > Yes. In this setup and it is pretty inaccurate PER iteration. Please
> > refer to the below patch to see the difference.
>
> No, Mattias is right. The time is sampled once before the loop, then the function is executed 10 million (ITERATIONS) times in the loop, and then the time is sampled once again.

No. I am not disagreeing. That why I said, “Yes. In this setup”.

All I am saying, there is a more accurate way of doing measurement for
this test along with “data” at
https://mails.dpdk.org/archives/dev/2024-September/301227.html


>
> So the overhead and accuracy of the timing function is amortized across the 10 million calls to the function being measured, and becomes insignificant.
>
> Other perf tests also do it this way, and also use rte_get_timer_cycles(). E.g. the mempool_perf test.
>
> Another detail: The for loop itself may cost a few cycles, which may not be irrelevant when measuring a function using very few cycles. If the compiler doesn't unroll the loop, it should be done manually:
>
>         for (i = 0; i < ITERATIONS / 100; i++) {
>                 update_fun();
>                 update_fun();
>                 ... repeated 100 times

I have done a similar scheme for trace perf for inline function test
at https://github.com/DPDK/dpdk/blob/main/app/test/test_trace_perf.c#L30

Either the above scheme or the below scheme needs to be used as
mentioned in https://mails.dpdk.org/archives/dev/2024-September/301227.html

+       for (i = 0; i < ITERATIONS; i++) {
+               start = rte_rdtsc_precise();
                update_fun();
+               end = rte_rdtsc_precise();
+               latency += (end - start) - tsc_latency;
+       }




>         }
>
>
> >
> > Patch 1: Make nanoseconds to cycles per iteration
> > ------------------------------------------------------------------
> >
> > diff --git a/app/test/test_lcore_var_perf.c b/app/test/test_lcore_var_perf.c
> > index ea1d7ba90b52..b8d25400f593 100644
> > --- a/app/test/test_lcore_var_perf.c
> > +++ b/app/test/test_lcore_var_perf.c
> > @@ -110,7 +110,7 @@ benchmark_access_method(void (*init_fun)(void),
> > void (*update_fun)(void))
> >
> >         end = rte_get_timer_cycles();
> >
> > -       latency = ((end - start) / (double)rte_get_timer_hz()) / ITERATIONS;
> > +       latency = ((end - start)) / ITERATIONS;
>
> This calculation uses integer arithmetic, which will round down the resulting latency.
> Please use floating point arithmetic: latency = (end - start) / (double)ITERATIONS;

Yup. It is in patch 2
https://mails.dpdk.org/archives/dev/2024-September/301227.html

>
> >
> >         return latency;
> >  }
> > @@ -137,8 +137,7 @@ test_lcore_var_access(void)
> >
> > -       printf("Latencies [ns/update]\n");
> > +       printf("Latencies [cycles/update]\n");
> >         printf("Thread-local storage  Static array  Lcore variables\n");
> > -       printf("%20.1f %13.1f %16.1f\n", tls_latency * 1e9,
> > -              sarray_latency * 1e9, lvar_latency * 1e9);
> > +       printf("%20.1f %13.1f %16.1f\n", tls_latency, sarray_latency,
> > lvar_latency);
> >
> >         return TEST_SUCCESS;
> >  }
  
Morten Brørup Sept. 16, 2024, 9:51 a.m. UTC | #10
> From: Jerin Jacob [mailto:jerinjacobk@gmail.com]
> Sent: Monday, 16 September 2024 10.12
> 
> On Fri, Sep 13, 2024 at 8:10 PM Morten Brørup <mb@smartsharesystems.com>
> wrote:
> >
> > > From: Jerin Jacob [mailto:jerinjacobk@gmail.com]
> > > Sent: Friday, 13 September 2024 13.24
> > >
> > > On Fri, Sep 13, 2024 at 12:17 PM Mattias Rönnblom <hofors@lysator.liu.se>
> > > wrote:
> > > >
> > > > On 2024-09-12 17:11, Jerin Jacob wrote:
> > > > > On Thu, Sep 12, 2024 at 6:50 PM Mattias Rönnblom
> <hofors@lysator.liu.se>
> > > wrote:
> > > > >>
> > > > >> On 2024-09-12 15:09, Jerin Jacob wrote:
> > > > >>> On Thu, Sep 12, 2024 at 2:34 PM Mattias Rönnblom
> > > > >>> <mattias.ronnblom@ericsson.com> wrote:
> > > > >>>> +static double
> > > > >>>> +benchmark_access_method(void (*init_fun)(void), void
> > > (*update_fun)(void))
> > > > >>>> +{
> > > > >>>> +       uint64_t i;
> > > > >>>> +       uint64_t start;
> > > > >>>> +       uint64_t end;
> > > > >>>> +       double latency;
> > > > >>>> +
> > > > >>>> +       init_fun();
> > > > >>>> +
> > > > >>>> +       start = rte_get_timer_cycles();
> > > > >>>> +
> > > > >>>> +       for (i = 0; i < ITERATIONS; i++)
> > > > >>>> +               update_fun();
> > > > >>>> +
> > > > >>>> +       end = rte_get_timer_cycles();
> > > > >>>
> > > > >>> Use precise variant. rte_rdtsc_precise() or so to be accurate
> > > > >>
> > > > >> With 1e7 iterations, do you need rte_rdtsc_precise()? I suspect not.
> > > > >
> > > > > I was thinking in another way, with 1e7 iteration, the additional
> > > > > barrier on precise will be amortized, and we get more _deterministic_
> > > > > behavior e.s.p in case if we print cycles and if we need to catch
> > > > > regressions.
> > > >
> > > > If you time a section of code which spends ~40000000 cycles, it doesn't
> > > > matter if you add or remove a few cycles at the beginning and the end.
> > > >
> > > > The rte_rdtsc_precise() is both better (more precise in the sense of
> > > > more serialization), and worse (because it's more costly, and thus more
> > > > intrusive).
> > >
> > > We can calibrate the overhead to remove the cost.
> > >
> > > >
> > > > You can use rte_rdtsc_precise(), rte_rdtsc(), or gettimeofday(). It
> > > > doesn't matter.
> > >
> > > Yes. In this setup and it is pretty inaccurate PER iteration. Please
> > > refer to the below patch to see the difference.
> >
> > No, Mattias is right. The time is sampled once before the loop, then the
> function is executed 10 million (ITERATIONS) times in the loop, and then the
> time is sampled once again.
> 
> No. I am not disagreeing. That why I said, “Yes. In this setup”.

Sorry, I misunderstood. Then we're all on the same page here. :-)

> 
> All I am saying, there is a more accurate way of doing measurement for
> this test along with “data” at
> https://mails.dpdk.org/archives/dev/2024-September/301227.html
> 
> 
> >
> > So the overhead and accuracy of the timing function is amortized across the
> 10 million calls to the function being measured, and becomes insignificant.
> >
> > Other perf tests also do it this way, and also use rte_get_timer_cycles().
> E.g. the mempool_perf test.
> >
> > Another detail: The for loop itself may cost a few cycles, which may not be
> irrelevant when measuring a function using very few cycles. If the compiler
> doesn't unroll the loop, it should be done manually:
> >
> >         for (i = 0; i < ITERATIONS / 100; i++) {
> >                 update_fun();
> >                 update_fun();
> >                 ... repeated 100 times
> 
> I have done a similar scheme for trace perf for inline function test
> at https://github.com/DPDK/dpdk/blob/main/app/test/test_trace_perf.c#L30

Nice macro. :-)

> 
> Either the above scheme or the below scheme needs to be used as
> mentioned in https://mails.dpdk.org/archives/dev/2024-September/301227.html
> 
> +       for (i = 0; i < ITERATIONS; i++) {
> +               start = rte_rdtsc_precise();
>                 update_fun();
> +               end = rte_rdtsc_precise();
> +               latency += (end - start) - tsc_latency;
> +       }
> 

I prefer reading the timestamps outside the loop.
If there is any jitter in the execution time (or cycles used) by rte_rdtsc_precise(), it gets amortized when used outside the loop. If used inside the loop, the jitter adds up, and may affect the result.

On the other hand, I guess using rte_rdtsc_precise() inside the loop may show different results, due to its memory barriers. I don't know; just speculating.

Maybe we want to use both methods to measure this? Considering that we are measuring the time to access frequently used variables in hot parts of the code, as implemented by three different design patterns. Performance here is quite important.

And if we want to subtract the overhead from rte_rdtsc_precise() itself - which I think is a good idea if used inside the loop - we probably need another loop to measure that, rather than just calling it twice and subtracting the returned values.

> 
> 
> 
> >         }
> >
> >
> > >
> > > Patch 1: Make nanoseconds to cycles per iteration
> > > ------------------------------------------------------------------
> > >
> > > diff --git a/app/test/test_lcore_var_perf.c
> b/app/test/test_lcore_var_perf.c
> > > index ea1d7ba90b52..b8d25400f593 100644
> > > --- a/app/test/test_lcore_var_perf.c
> > > +++ b/app/test/test_lcore_var_perf.c
> > > @@ -110,7 +110,7 @@ benchmark_access_method(void (*init_fun)(void),
> > > void (*update_fun)(void))
> > >
> > >         end = rte_get_timer_cycles();
> > >
> > > -       latency = ((end - start) / (double)rte_get_timer_hz()) /
> ITERATIONS;
> > > +       latency = ((end - start)) / ITERATIONS;
> >
> > This calculation uses integer arithmetic, which will round down the
> resulting latency.
> > Please use floating point arithmetic: latency = (end - start) /
> (double)ITERATIONS;
> 
> Yup. It is in patch 2
> https://mails.dpdk.org/archives/dev/2024-September/301227.html

Yep; my comment was mostly meant for Mattias, if he switched from nanoseconds to cycles, to remember using floating point calculation here.

> 
> >
> > >
> > >         return latency;
> > >  }
> > > @@ -137,8 +137,7 @@ test_lcore_var_access(void)
> > >
> > > -       printf("Latencies [ns/update]\n");
> > > +       printf("Latencies [cycles/update]\n");
> > >         printf("Thread-local storage  Static array  Lcore variables\n");
> > > -       printf("%20.1f %13.1f %16.1f\n", tls_latency * 1e9,
> > > -              sarray_latency * 1e9, lvar_latency * 1e9);
> > > +       printf("%20.1f %13.1f %16.1f\n", tls_latency, sarray_latency,
> > > lvar_latency);
> > >
> > >         return TEST_SUCCESS;
> > >  }
  
Mattias Rönnblom Sept. 16, 2024, 10:50 a.m. UTC | #11
On 2024-09-13 13:23, Jerin Jacob wrote:
> On Fri, Sep 13, 2024 at 12:17 PM Mattias Rönnblom <hofors@lysator.liu.se> wrote:
>>
>> On 2024-09-12 17:11, Jerin Jacob wrote:
>>> On Thu, Sep 12, 2024 at 6:50 PM Mattias Rönnblom <hofors@lysator.liu.se> wrote:
>>>>
>>>> On 2024-09-12 15:09, Jerin Jacob wrote:
>>>>> On Thu, Sep 12, 2024 at 2:34 PM Mattias Rönnblom
>>>>> <mattias.ronnblom@ericsson.com> wrote:
>>>>>>
>>>>>> Add basic micro benchmark for lcore variables, in an attempt to assure
>>>>>> that the overhead isn't significantly greater than alternative
>>>>>> approaches, in scenarios where the benefits aren't expected to show up
>>>>>> (i.e., when plenty of cache is available compared to the working set
>>>>>> size of the per-lcore data).
>>>>>>
>>>>>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>>>>>> ---
>>>>>>     app/test/meson.build           |   1 +
>>>>>>     app/test/test_lcore_var_perf.c | 160 +++++++++++++++++++++++++++++++++
>>>>>>     2 files changed, 161 insertions(+)
>>>>>>     create mode 100644 app/test/test_lcore_var_perf.c
>>>>>
>>>>>
>>>>>> +static double
>>>>>> +benchmark_access_method(void (*init_fun)(void), void (*update_fun)(void))
>>>>>> +{
>>>>>> +       uint64_t i;
>>>>>> +       uint64_t start;
>>>>>> +       uint64_t end;
>>>>>> +       double latency;
>>>>>> +
>>>>>> +       init_fun();
>>>>>> +
>>>>>> +       start = rte_get_timer_cycles();
>>>>>> +
>>>>>> +       for (i = 0; i < ITERATIONS; i++)
>>>>>> +               update_fun();
>>>>>> +
>>>>>> +       end = rte_get_timer_cycles();
>>>>>
>>>>> Use precise variant. rte_rdtsc_precise() or so to be accurate
>>>>
>>>> With 1e7 iterations, do you need rte_rdtsc_precise()? I suspect not.
>>>
>>> I was thinking in another way, with 1e7 iteration, the additional
>>> barrier on precise will be amortized, and we get more _deterministic_
>>> behavior e.s.p in case if we print cycles and if we need to catch
>>> regressions.
>>
>> If you time a section of code which spends ~40000000 cycles, it doesn't
>> matter if you add or remove a few cycles at the beginning and the end.
>>
>> The rte_rdtsc_precise() is both better (more precise in the sense of
>> more serialization), and worse (because it's more costly, and thus more
>> intrusive).
> 
> We can calibrate the overhead to remove the cost.
> 
What you are interested is primarily the impact of (instruction) 
throughput, not the latency of the sequence of instructions that must be 
retired in order to load the lcore variable values, when you switch from
(say) lcore id-index static arrays to lcore variables in your module.

Usually, there is not reason to make a distinction between latency and 
throughput in this context, but as you zoom into very short snippets of 
code being executed, the difference becomes relevant. For example, 
adding an div instruction won't necessarily add 12 cc to your program's 
execution time on a Zen 4, even though that is its latency. Rather, the 
effects may, depending on data dependencies and what other instructions 
are executed in parallel, be much smaller.

So, one could argue the ILP you get with the loop is a feature, not a bug.

With or without per-iteration latency measurements, these benchmark are 
not-very-useful at best, and misleading at worst. I will rework them to 
include more than a single module/lcore variable, which I think would be 
somewhat of an improvement.

Even better would have some real domain logic, instead of just a dummy 
multiplication.

>>
>> You can use rte_rdtsc_precise(), rte_rdtsc(), or gettimeofday(). It
>> doesn't matter.
> 
> Yes. In this setup and it is pretty inaccurate PER iteration. Please
> refer to the below patch to see the difference.
> 
> Patch 1: Make nanoseconds to cycles per iteration
> ------------------------------------------------------------------
> 
> diff --git a/app/test/test_lcore_var_perf.c b/app/test/test_lcore_var_perf.c
> index ea1d7ba90b52..b8d25400f593 100644
> --- a/app/test/test_lcore_var_perf.c
> +++ b/app/test/test_lcore_var_perf.c
> @@ -110,7 +110,7 @@ benchmark_access_method(void (*init_fun)(void),
> void (*update_fun)(void))
> 
>          end = rte_get_timer_cycles();
> 
> -       latency = ((end - start) / (double)rte_get_timer_hz()) / ITERATIONS;
> +       latency = ((end - start)) / ITERATIONS;
> 
>          return latency;
>   }
> @@ -137,8 +137,7 @@ test_lcore_var_access(void)
> 
> -       printf("Latencies [ns/update]\n");
> +       printf("Latencies [cycles/update]\n");
>          printf("Thread-local storage  Static array  Lcore variables\n");
> -       printf("%20.1f %13.1f %16.1f\n", tls_latency * 1e9,
> -              sarray_latency * 1e9, lvar_latency * 1e9);
> +       printf("%20.1f %13.1f %16.1f\n", tls_latency, sarray_latency,
> lvar_latency);
> 
>          return TEST_SUCCESS;
>   }
> 
> 
> Patch 2: Change to precise with calibration
> -----------------------------------------------------------
> 
> diff --git a/app/test/test_lcore_var_perf.c b/app/test/test_lcore_var_perf.c
> index ea1d7ba90b52..8142ecd56241 100644
> --- a/app/test/test_lcore_var_perf.c
> +++ b/app/test/test_lcore_var_perf.c
> @@ -96,23 +96,28 @@ lvar_update(void)
>   static double
>   benchmark_access_method(void (*init_fun)(void), void (*update_fun)(void))
>   {
> -       uint64_t i;
> +       double tsc_latency;
> +       double latency;
>          uint64_t start;
>          uint64_t end;
> -       double latency;
> +       uint64_t i;
> 
> -       init_fun();
> +       /* calculate rte_rdtsc_precise overhead */
> +       start = rte_rdtsc_precise();
> +       end = rte_rdtsc_precise();
> +       tsc_latency = (end - start);
> 
> -       start = rte_get_timer_cycles();
> +       init_fun();
> 
> -       for (i = 0; i < ITERATIONS; i++)
> +       latency = 0;
> +       for (i = 0; i < ITERATIONS; i++) {
> +               start = rte_rdtsc_precise();
>                  update_fun();
> +               end = rte_rdtsc_precise();
> +               latency += (end - start) - tsc_latency;
> +       }
> 
> -       end = rte_get_timer_cycles();
> -
> -       latency = ((end - start) / (double)rte_get_timer_hz()) / ITERATIONS;
> -
> -       return latency;
> +       return latency / (double)ITERATIONS;
>   }
> 
>   static int
> @@ -135,10 +140,9 @@ test_lcore_var_access(void)
>          sarray_latency = benchmark_access_method(sarray_init, sarray_update);
>          lvar_latency = benchmark_access_method(lvar_init, lvar_update);
> 
> -       printf("Latencies [ns/update]\n");
> +       printf("Latencies [cycles/update]\n");
>          printf("Thread-local storage  Static array  Lcore variables\n");
> -       printf("%20.1f %13.1f %16.1f\n", tls_latency * 1e9,
> -              sarray_latency * 1e9, lvar_latency * 1e9);
> +       printf("%20.1f %13.1f %16.1f\n", tls_latency, sarray_latency,
> lvar_latency);
> 
>          return TEST_SUCCESS;
>   }
> 
> ARM N2 core with patch 1(aka current scheme)
> -----------------------------------
> 
>   + ------------------------------------------------------- +
>   + Test Suite : lcore variable perf autotest
>   + ------------------------------------------------------- +
> Latencies [cycles/update]
> Thread-local storage  Static array  Lcore variables
>                   7.0           7.0              7.0
> 
> 
> ARM N2 core with patch 2
> -----------------------------------
> 
>   + ------------------------------------------------------- +
>   + Test Suite : lcore variable perf autotest
>   + ------------------------------------------------------- +
> Latencies [cycles/update]
> Thread-local storage  Static array  Lcore variables
>                  11.4          15.5             15.5
> 
> x86 i9 core with patch 1(aka current scheme)
> ------------------------------------------------------------
> 
>   + ------------------------------------------------------- +
>   + Test Suite : lcore variable perf autotest
>   + ------------------------------------------------------- +
> Latencies [ns/update]
> Thread-local storage  Static array  Lcore variables
>                   5.0           6.0              6.0
> 
> x86 i9 core with patch 2
> --------------------------------
>   + ------------------------------------------------------- +
>   + Test Suite : lcore variable perf autotest
>   + ------------------------------------------------------- +
> Latencies [cycles/update]
> Thread-local storage  Static array  Lcore variables
>                   5.3          10.6             11.7
> 
> 
> 
> 
> 
>>
>>> Furthermore, you may consider replacing rte_random() in fast path to
>>> running number or so if it is not deterministic in cycle computation.
>>
>> rte_rand() is not used in the fast path. I don't understand what you
> 
> I missed that. Ignore this comment.
> 
>> mean by "running number".
  
Jerin Jacob Sept. 18, 2024, 10:04 a.m. UTC | #12
On Mon, Sep 16, 2024 at 4:20 PM Mattias Rönnblom <hofors@lysator.liu.se> wrote:
>
> On 2024-09-13 13:23, Jerin Jacob wrote:
> > On Fri, Sep 13, 2024 at 12:17 PM Mattias Rönnblom <hofors@lysator.liu.se> wrote:
> >>
> >> On 2024-09-12 17:11, Jerin Jacob wrote:
> >>> On Thu, Sep 12, 2024 at 6:50 PM Mattias Rönnblom <hofors@lysator.liu.se> wrote:
> >>>>
> >>>> On 2024-09-12 15:09, Jerin Jacob wrote:
> >>>>> On Thu, Sep 12, 2024 at 2:34 PM Mattias Rönnblom
> >>>>> <mattias.ronnblom@ericsson.com> wrote:
> >>>>>>
> >>>>>> Add basic micro benchmark for lcore variables, in an attempt to assure
> >>>>>> that the overhead isn't significantly greater than alternative
> >>>>>> approaches, in scenarios where the benefits aren't expected to show up
> >>>>>> (i.e., when plenty of cache is available compared to the working set
> >>>>>> size of the per-lcore data).
> >>>>>>
> >>>>>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> >>>>>> ---
> >>>>>>     app/test/meson.build           |   1 +
> >>>>>>     app/test/test_lcore_var_perf.c | 160 +++++++++++++++++++++++++++++++++
> >>>>>>     2 files changed, 161 insertions(+)
> >>>>>>     create mode 100644 app/test/test_lcore_var_perf.c
> >>>>>
> >>>>>
> >>>>>> +static double
> >>>>>> +benchmark_access_method(void (*init_fun)(void), void (*update_fun)(void))
> >>>>>> +{
> >>>>>> +       uint64_t i;
> >>>>>> +       uint64_t start;
> >>>>>> +       uint64_t end;
> >>>>>> +       double latency;
> >>>>>> +
> >>>>>> +       init_fun();
> >>>>>> +
> >>>>>> +       start = rte_get_timer_cycles();
> >>>>>> +
> >>>>>> +       for (i = 0; i < ITERATIONS; i++)
> >>>>>> +               update_fun();
> >>>>>> +
> >>>>>> +       end = rte_get_timer_cycles();
> >>>>>
> >>>>> Use precise variant. rte_rdtsc_precise() or so to be accurate
> >>>>
> >>>> With 1e7 iterations, do you need rte_rdtsc_precise()? I suspect not.
> >>>
> >>> I was thinking in another way, with 1e7 iteration, the additional
> >>> barrier on precise will be amortized, and we get more _deterministic_
> >>> behavior e.s.p in case if we print cycles and if we need to catch
> >>> regressions.
> >>
> >> If you time a section of code which spends ~40000000 cycles, it doesn't
> >> matter if you add or remove a few cycles at the beginning and the end.
> >>
> >> The rte_rdtsc_precise() is both better (more precise in the sense of
> >> more serialization), and worse (because it's more costly, and thus more
> >> intrusive).
> >
> > We can calibrate the overhead to remove the cost.
> >
> What you are interested is primarily the impact of (instruction)
> throughput, not the latency of the sequence of instructions that must be
> retired in order to load the lcore variable values, when you switch from
> (say) lcore id-index static arrays to lcore variables in your module.
>
> Usually, there is not reason to make a distinction between latency and
> throughput in this context, but as you zoom into very short snippets of
> code being executed, the difference becomes relevant. For example,
> adding an div instruction won't necessarily add 12 cc to your program's
> execution time on a Zen 4, even though that is its latency. Rather, the
> effects may, depending on data dependencies and what other instructions
> are executed in parallel, be much smaller.
>
> So, one could argue the ILP you get with the loop is a feature, not a bug.
>
> With or without per-iteration latency measurements, these benchmark are
> not-very-useful at best, and misleading at worst. I will rework them to
> include more than a single module/lcore variable, which I think would be
> somewhat of an improvement.

OK. Module parameter will remove the compiler optimization and more accurate.
I was doing manual loop unrolling[1] in a trace test case(for small
inline functions)
Either way it fine. Thanks for the rework.

[1]
https://github.com/DPDK/dpdk/blob/main/app/test/test_trace_perf.c#L30


>
> Even better would have some real domain logic, instead of just a dummy
> multiplication.
>
> >>
> >> You can use rte_rdtsc_precise(), rte_rdtsc(), or gettimeofday(). It
> >> doesn't matter.
> >
> > Yes. In this setup and it is pretty inaccurate PER iteration. Please
> > refer to the below patch to see the difference.
> >
> > Patch 1: Make nanoseconds to cycles per iteration
> > ------------------------------------------------------------------
> >
> > diff --git a/app/test/test_lcore_var_perf.c b/app/test/test_lcore_var_perf.c
> > index ea1d7ba90b52..b8d25400f593 100644
> > --- a/app/test/test_lcore_var_perf.c
> > +++ b/app/test/test_lcore_var_perf.c
> > @@ -110,7 +110,7 @@ benchmark_access_method(void (*init_fun)(void),
> > void (*update_fun)(void))
> >
> >          end = rte_get_timer_cycles();
> >
> > -       latency = ((end - start) / (double)rte_get_timer_hz()) / ITERATIONS;
> > +       latency = ((end - start)) / ITERATIONS;
> >
> >          return latency;
> >   }
> > @@ -137,8 +137,7 @@ test_lcore_var_access(void)
> >
> > -       printf("Latencies [ns/update]\n");
> > +       printf("Latencies [cycles/update]\n");
> >          printf("Thread-local storage  Static array  Lcore variables\n");
> > -       printf("%20.1f %13.1f %16.1f\n", tls_latency * 1e9,
> > -              sarray_latency * 1e9, lvar_latency * 1e9);
> > +       printf("%20.1f %13.1f %16.1f\n", tls_latency, sarray_latency,
> > lvar_latency);
> >
> >          return TEST_SUCCESS;
> >   }
> >
> >
> > Patch 2: Change to precise with calibration
> > -----------------------------------------------------------
> >
> > diff --git a/app/test/test_lcore_var_perf.c b/app/test/test_lcore_var_perf.c
> > index ea1d7ba90b52..8142ecd56241 100644
> > --- a/app/test/test_lcore_var_perf.c
> > +++ b/app/test/test_lcore_var_perf.c
> > @@ -96,23 +96,28 @@ lvar_update(void)
> >   static double
> >   benchmark_access_method(void (*init_fun)(void), void (*update_fun)(void))
> >   {
> > -       uint64_t i;
> > +       double tsc_latency;
> > +       double latency;
> >          uint64_t start;
> >          uint64_t end;
> > -       double latency;
> > +       uint64_t i;
> >
> > -       init_fun();
> > +       /* calculate rte_rdtsc_precise overhead */
> > +       start = rte_rdtsc_precise();
> > +       end = rte_rdtsc_precise();
> > +       tsc_latency = (end - start);
> >
> > -       start = rte_get_timer_cycles();
> > +       init_fun();
> >
> > -       for (i = 0; i < ITERATIONS; i++)
> > +       latency = 0;
> > +       for (i = 0; i < ITERATIONS; i++) {
> > +               start = rte_rdtsc_precise();
> >                  update_fun();
> > +               end = rte_rdtsc_precise();
> > +               latency += (end - start) - tsc_latency;
> > +       }
> >
> > -       end = rte_get_timer_cycles();
> > -
> > -       latency = ((end - start) / (double)rte_get_timer_hz()) / ITERATIONS;
> > -
> > -       return latency;
> > +       return latency / (double)ITERATIONS;
> >   }
> >
> >   static int
> > @@ -135,10 +140,9 @@ test_lcore_var_access(void)
> >          sarray_latency = benchmark_access_method(sarray_init, sarray_update);
> >          lvar_latency = benchmark_access_method(lvar_init, lvar_update);
> >
> > -       printf("Latencies [ns/update]\n");
> > +       printf("Latencies [cycles/update]\n");
> >          printf("Thread-local storage  Static array  Lcore variables\n");
> > -       printf("%20.1f %13.1f %16.1f\n", tls_latency * 1e9,
> > -              sarray_latency * 1e9, lvar_latency * 1e9);
> > +       printf("%20.1f %13.1f %16.1f\n", tls_latency, sarray_latency,
> > lvar_latency);
> >
> >          return TEST_SUCCESS;
> >   }
> >
> > ARM N2 core with patch 1(aka current scheme)
> > -----------------------------------
> >
> >   + ------------------------------------------------------- +
> >   + Test Suite : lcore variable perf autotest
> >   + ------------------------------------------------------- +
> > Latencies [cycles/update]
> > Thread-local storage  Static array  Lcore variables
> >                   7.0           7.0              7.0
> >
> >
> > ARM N2 core with patch 2
> > -----------------------------------
> >
> >   + ------------------------------------------------------- +
> >   + Test Suite : lcore variable perf autotest
> >   + ------------------------------------------------------- +
> > Latencies [cycles/update]
> > Thread-local storage  Static array  Lcore variables
> >                  11.4          15.5             15.5
> >
> > x86 i9 core with patch 1(aka current scheme)
> > ------------------------------------------------------------
> >
> >   + ------------------------------------------------------- +
> >   + Test Suite : lcore variable perf autotest
> >   + ------------------------------------------------------- +
> > Latencies [ns/update]
> > Thread-local storage  Static array  Lcore variables
> >                   5.0           6.0              6.0
> >
> > x86 i9 core with patch 2
> > --------------------------------
> >   + ------------------------------------------------------- +
> >   + Test Suite : lcore variable perf autotest
> >   + ------------------------------------------------------- +
> > Latencies [cycles/update]
> > Thread-local storage  Static array  Lcore variables
> >                   5.3          10.6             11.7
> >
> >
> >
> >
> >
> >>
> >>> Furthermore, you may consider replacing rte_random() in fast path to
> >>> running number or so if it is not deterministic in cycle computation.
> >>
> >> rte_rand() is not used in the fast path. I don't understand what you
> >
> > I missed that. Ignore this comment.
> >
> >> mean by "running number".
  

Patch

diff --git a/app/test/meson.build b/app/test/meson.build
index 48279522f0..d4e0c59900 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -104,6 +104,7 @@  source_file_deps = {
     'test_kvargs.c': ['kvargs'],
     'test_latencystats.c': ['ethdev', 'latencystats', 'metrics'] + sample_packet_forward_deps,
     'test_lcore_var.c': [],
+    'test_lcore_var_perf.c': [],
     'test_lcores.c': [],
     'test_link_bonding.c': ['ethdev', 'net_bond',
         'net'] + packet_burst_generator_deps + virtual_pmd_deps,
diff --git a/app/test/test_lcore_var_perf.c b/app/test/test_lcore_var_perf.c
new file mode 100644
index 0000000000..ea1d7ba90b
--- /dev/null
+++ b/app/test/test_lcore_var_perf.c
@@ -0,0 +1,160 @@ 
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Ericsson AB
+ */
+
+#include <stdio.h>
+
+#include <rte_cycles.h>
+#include <rte_lcore_var.h>
+#include <rte_per_lcore.h>
+#include <rte_random.h>
+
+#include "test.h"
+
+struct lcore_state {
+	uint64_t a;
+	uint64_t b;
+	uint64_t sum;
+};
+
+static void
+init(struct lcore_state *state)
+{
+	state->a = rte_rand();
+	state->b = rte_rand();
+	state->sum = 0;
+}
+
+static __rte_always_inline void
+update(struct lcore_state *state)
+{
+	state->sum += state->a * state->b;
+}
+
+static RTE_DEFINE_PER_LCORE(struct lcore_state, tls_lcore_state);
+
+static void
+tls_init(void)
+{
+	init(&RTE_PER_LCORE(tls_lcore_state));
+}
+
+static __rte_noinline void
+tls_update(void)
+{
+	update(&RTE_PER_LCORE(tls_lcore_state));
+}
+
+struct __rte_cache_aligned lcore_state_aligned {
+	uint64_t a;
+	uint64_t b;
+	uint64_t sum;
+};
+
+static struct lcore_state_aligned sarray_lcore_state[RTE_MAX_LCORE];
+
+static void
+sarray_init(void)
+{
+	struct lcore_state *state =
+		(struct lcore_state *)&sarray_lcore_state[rte_lcore_id()];
+
+	init(state);
+}
+
+static __rte_noinline void
+sarray_update(void)
+{
+	struct lcore_state *state =
+		(struct lcore_state *)&sarray_lcore_state[rte_lcore_id()];
+
+	update(state);
+}
+
+RTE_LCORE_VAR_HANDLE(struct lcore_state, lvar_lcore_state);
+
+static void
+lvar_init(void)
+{
+	RTE_LCORE_VAR_ALLOC(lvar_lcore_state);
+
+	struct lcore_state *state = RTE_LCORE_VAR_VALUE(lvar_lcore_state);
+
+	init(state);
+}
+
+static __rte_noinline void
+lvar_update(void)
+{
+	struct lcore_state *state = RTE_LCORE_VAR_VALUE(lvar_lcore_state);
+
+	update(state);
+}
+
+#define ITERATIONS UINT64_C(10000000)
+
+static double
+benchmark_access_method(void (*init_fun)(void), void (*update_fun)(void))
+{
+	uint64_t i;
+	uint64_t start;
+	uint64_t end;
+	double latency;
+
+	init_fun();
+
+	start = rte_get_timer_cycles();
+
+	for (i = 0; i < ITERATIONS; i++)
+		update_fun();
+
+	end = rte_get_timer_cycles();
+
+	latency = ((end - start) / (double)rte_get_timer_hz()) / ITERATIONS;
+
+	return latency;
+}
+
+static int
+test_lcore_var_access(void)
+{
+	/* Note: the potential performance benefit of lcore variables
+	 * compared thread-local storage or the use of statically
+	 * sized, lcore id-indexed arrays are not shorter latencies in
+	 * a scenario with low cache pressure, but rather fewer cache
+	 * misses in a real-world scenario, with extensive cache
+	 * usage. These tests just tries to assure that the lcore
+	 * variable overhead is not significantly greater other
+	 * alternatives, when the per-lcore data is in L1.
+	 */
+	double tls_latency;
+	double sarray_latency;
+	double lvar_latency;
+
+	tls_latency = benchmark_access_method(tls_init, tls_update);
+	sarray_latency = benchmark_access_method(sarray_init, sarray_update);
+	lvar_latency = benchmark_access_method(lvar_init, lvar_update);
+
+	printf("Latencies [ns/update]\n");
+	printf("Thread-local storage  Static array  Lcore variables\n");
+	printf("%20.1f %13.1f %16.1f\n", tls_latency * 1e9,
+	       sarray_latency * 1e9, lvar_latency * 1e9);
+
+	return TEST_SUCCESS;
+}
+
+static struct unit_test_suite lcore_var_testsuite = {
+	.suite_name = "lcore variable perf autotest",
+	.unit_test_cases = {
+		TEST_CASE(test_lcore_var_access),
+		TEST_CASES_END()
+	},
+};
+
+static int
+test_lcore_var_perf(void)
+{
+	return unit_test_suite_runner(&lcore_var_testsuite);
+}
+
+REGISTER_PERF_TEST(lcore_var_perf_autotest, test_lcore_var_perf);