[v3,1/3] eal: add lcore poll busyness telemetry

Message ID 20220825152852.1231849-2-kevin.laatz@intel.com (mailing list archive)
State Superseded, archived
Delegated to: Thomas Monjalon
Headers
Series Add lcore poll busyness telemetry |

Checks

Context Check Description
ci/checkpatch warning coding style issues

Commit Message

Kevin Laatz Aug. 25, 2022, 3:28 p.m. UTC
  From: Anatoly Burakov <anatoly.burakov@intel.com>

Currently, there is no way to measure lcore poll busyness in a passive way,
without any modifications to the application. This patch adds a new EAL API
that will be able to passively track core polling busyness.

The poll busyness is calculated by relying on the fact that most DPDK API's
will poll for packets. Empty polls can be counted as "idle", while
non-empty polls can be counted as busy. To measure lcore poll busyness, we
simply call the telemetry timestamping function with the number of polls a
particular code section has processed, and count the number of cycles we've
spent processing empty bursts. The more empty bursts we encounter, the less
cycles we spend in "busy" state, and the less core poll busyness will be
reported.

In order for all of the above to work without modifications to the
application, the library code needs to be instrumented with calls to the
lcore telemetry busyness timestamping function. The following parts of DPDK
are instrumented with lcore telemetry calls:

- All major driver API's:
  - ethdev
  - cryptodev
  - compressdev
  - regexdev
  - bbdev
  - rawdev
  - eventdev
  - dmadev
- Some additional libraries:
  - ring
  - distributor

To avoid performance impact from having lcore telemetry support, a global
variable is exported by EAL, and a call to timestamping function is wrapped
into a macro, so that whenever telemetry is disabled, it only takes one
additional branch and no function calls are performed. It is also possible
to disable it at compile time by commenting out RTE_LCORE_BUSYNESS from
build config.

This patch also adds a telemetry endpoint to report lcore poll busyness, as
well as telemetry endpoints to enable/disable lcore telemetry. A
documentation entry has been added to the howto guides to explain the usage
of the new telemetry endpoints and API.

Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
Signed-off-by: Conor Walsh <conor.walsh@intel.com>
Signed-off-by: David Hunt <david.hunt@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>

---
v3:
  * Fix missed renaming to poll busyness
  * Fix clang compilation
  * Fix arm compilation

v2:
  * Use rte_get_tsc_hz() to adjust the telemetry period
  * Rename to reflect polling busyness vs general busyness
  * Fix segfault when calling telemetry timestamp from an unregistered
    non-EAL thread.
  * Minor cleanup
---
 config/meson.build                          |   1 +
 config/rte_config.h                         |   1 +
 lib/bbdev/rte_bbdev.h                       |  17 +-
 lib/compressdev/rte_compressdev.c           |   2 +
 lib/cryptodev/rte_cryptodev.h               |   2 +
 lib/distributor/rte_distributor.c           |  21 +-
 lib/distributor/rte_distributor_single.c    |  14 +-
 lib/dmadev/rte_dmadev.h                     |  15 +-
 lib/eal/common/eal_common_lcore_telemetry.c | 293 ++++++++++++++++++++
 lib/eal/common/meson.build                  |   1 +
 lib/eal/include/rte_lcore.h                 |  80 ++++++
 lib/eal/meson.build                         |   3 +
 lib/eal/version.map                         |   7 +
 lib/ethdev/rte_ethdev.h                     |   2 +
 lib/eventdev/rte_eventdev.h                 |  10 +-
 lib/rawdev/rte_rawdev.c                     |   6 +-
 lib/regexdev/rte_regexdev.h                 |   5 +-
 lib/ring/rte_ring_elem_pvt.h                |   1 +
 meson_options.txt                           |   2 +
 19 files changed, 459 insertions(+), 24 deletions(-)
 create mode 100644 lib/eal/common/eal_common_lcore_telemetry.c
  

Comments

Jerin Jacob Aug. 26, 2022, 7:05 a.m. UTC | #1
On Thu, Aug 25, 2022 at 8:56 PM Kevin Laatz <kevin.laatz@intel.com> wrote:
>
> From: Anatoly Burakov <anatoly.burakov@intel.com>
>
> Currently, there is no way to measure lcore poll busyness in a passive way,
> without any modifications to the application. This patch adds a new EAL API
> that will be able to passively track core polling busyness.
>
> The poll busyness is calculated by relying on the fact that most DPDK API's
> will poll for packets. Empty polls can be counted as "idle", while
> non-empty polls can be counted as busy. To measure lcore poll busyness, we
> simply call the telemetry timestamping function with the number of polls a
> particular code section has processed, and count the number of cycles we've
> spent processing empty bursts. The more empty bursts we encounter, the less
> cycles we spend in "busy" state, and the less core poll busyness will be
> reported.
>
> In order for all of the above to work without modifications to the
> application, the library code needs to be instrumented with calls to the
> lcore telemetry busyness timestamping function. The following parts of DPDK
> are instrumented with lcore telemetry calls:
>
> - All major driver API's:
>   - ethdev
>   - cryptodev
>   - compressdev
>   - regexdev
>   - bbdev
>   - rawdev
>   - eventdev
>   - dmadev
> - Some additional libraries:
>   - ring
>   - distributor
>
> To avoid performance impact from having lcore telemetry support, a global
> variable is exported by EAL, and a call to timestamping function is wrapped
> into a macro, so that whenever telemetry is disabled, it only takes one
> additional branch and no function calls are performed. It is also possible
> to disable it at compile time by commenting out RTE_LCORE_BUSYNESS from
> build config.
>
> This patch also adds a telemetry endpoint to report lcore poll busyness, as
> well as telemetry endpoints to enable/disable lcore telemetry. A
> documentation entry has been added to the howto guides to explain the usage
> of the new telemetry endpoints and API.
>
> Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
> Signed-off-by: Conor Walsh <conor.walsh@intel.com>
> Signed-off-by: David Hunt <david.hunt@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>
> ---
> v3:
>   * Fix missed renaming to poll busyness
>   * Fix clang compilation
>   * Fix arm compilation
>
> v2:
>   * Use rte_get_tsc_hz() to adjust the telemetry period
>   * Rename to reflect polling busyness vs general busyness
>   * Fix segfault when calling telemetry timestamp from an unregistered
>     non-EAL thread.
>   * Minor cleanup
> ---

> diff --git a/meson_options.txt b/meson_options.txt
> index 7c220ad68d..725b851f69 100644
> --- a/meson_options.txt
> +++ b/meson_options.txt
> @@ -20,6 +20,8 @@ option('enable_driver_sdk', type: 'boolean', value: false, description:
>         'Install headers to build drivers.')
>  option('enable_kmods', type: 'boolean', value: false, description:
>         'build kernel modules')
> +option('enable_lcore_poll_busyness', type: 'boolean', value: true, description:
> +       'enable collection of lcore poll busyness telemetry')

IMO, All fastpath features should be opt-in. i.e default should be false.
For the trace fastpath related changes, We have done the similar thing
even though it cost additional one cycle for disabled trace points



>  option('examples', type: 'string', value: '', description:
>         'Comma-separated list of examples to build by default')
>  option('flexran_sdk', type: 'string', value: '', description:
> --
> 2.31.1
>
  
Bruce Richardson Aug. 26, 2022, 8:07 a.m. UTC | #2
On Fri, Aug 26, 2022 at 12:35:16PM +0530, Jerin Jacob wrote:
> On Thu, Aug 25, 2022 at 8:56 PM Kevin Laatz <kevin.laatz@intel.com> wrote:
> >
> > From: Anatoly Burakov <anatoly.burakov@intel.com>
> >
> > Currently, there is no way to measure lcore poll busyness in a passive way,
> > without any modifications to the application. This patch adds a new EAL API
> > that will be able to passively track core polling busyness.
> >
> > The poll busyness is calculated by relying on the fact that most DPDK API's
> > will poll for packets. Empty polls can be counted as "idle", while
> > non-empty polls can be counted as busy. To measure lcore poll busyness, we
> > simply call the telemetry timestamping function with the number of polls a
> > particular code section has processed, and count the number of cycles we've
> > spent processing empty bursts. The more empty bursts we encounter, the less
> > cycles we spend in "busy" state, and the less core poll busyness will be
> > reported.
> >
> > In order for all of the above to work without modifications to the
> > application, the library code needs to be instrumented with calls to the
> > lcore telemetry busyness timestamping function. The following parts of DPDK
> > are instrumented with lcore telemetry calls:
> >
> > - All major driver API's:
> >   - ethdev
> >   - cryptodev
> >   - compressdev
> >   - regexdev
> >   - bbdev
> >   - rawdev
> >   - eventdev
> >   - dmadev
> > - Some additional libraries:
> >   - ring
> >   - distributor
> >
> > To avoid performance impact from having lcore telemetry support, a global
> > variable is exported by EAL, and a call to timestamping function is wrapped
> > into a macro, so that whenever telemetry is disabled, it only takes one
> > additional branch and no function calls are performed. It is also possible
> > to disable it at compile time by commenting out RTE_LCORE_BUSYNESS from
> > build config.
> >
> > This patch also adds a telemetry endpoint to report lcore poll busyness, as
> > well as telemetry endpoints to enable/disable lcore telemetry. A
> > documentation entry has been added to the howto guides to explain the usage
> > of the new telemetry endpoints and API.
> >
> > Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
> > Signed-off-by: Conor Walsh <conor.walsh@intel.com>
> > Signed-off-by: David Hunt <david.hunt@intel.com>
> > Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> >
> > ---
> > v3:
> >   * Fix missed renaming to poll busyness
> >   * Fix clang compilation
> >   * Fix arm compilation
> >
> > v2:
> >   * Use rte_get_tsc_hz() to adjust the telemetry period
> >   * Rename to reflect polling busyness vs general busyness
> >   * Fix segfault when calling telemetry timestamp from an unregistered
> >     non-EAL thread.
> >   * Minor cleanup
> > ---
> 
> > diff --git a/meson_options.txt b/meson_options.txt
> > index 7c220ad68d..725b851f69 100644
> > --- a/meson_options.txt
> > +++ b/meson_options.txt
> > @@ -20,6 +20,8 @@ option('enable_driver_sdk', type: 'boolean', value: false, description:
> >         'Install headers to build drivers.')
> >  option('enable_kmods', type: 'boolean', value: false, description:
> >         'build kernel modules')
> > +option('enable_lcore_poll_busyness', type: 'boolean', value: true, description:
> > +       'enable collection of lcore poll busyness telemetry')
> 
> IMO, All fastpath features should be opt-in. i.e default should be false.
> For the trace fastpath related changes, We have done the similar thing
> even though it cost additional one cycle for disabled trace points
> 

We do need to consider runtime and build defaults differently, though.
Since this has also runtime enabling, I think having build-time enabling
true as default is ok, so long as the runtime enabling is false (assuming
no noticable overhead when the feature is disabled.)

/Bruce
  
Jerin Jacob Aug. 26, 2022, 8:16 a.m. UTC | #3
On Fri, Aug 26, 2022 at 1:37 PM Bruce Richardson
<bruce.richardson@intel.com> wrote:
>
> On Fri, Aug 26, 2022 at 12:35:16PM +0530, Jerin Jacob wrote:
> > On Thu, Aug 25, 2022 at 8:56 PM Kevin Laatz <kevin.laatz@intel.com> wrote:
> > >
> > > From: Anatoly Burakov <anatoly.burakov@intel.com>
> > >
> > > Currently, there is no way to measure lcore poll busyness in a passive way,
> > > without any modifications to the application. This patch adds a new EAL API
> > > that will be able to passively track core polling busyness.
> > >
> > > The poll busyness is calculated by relying on the fact that most DPDK API's
> > > will poll for packets. Empty polls can be counted as "idle", while
> > > non-empty polls can be counted as busy. To measure lcore poll busyness, we
> > > simply call the telemetry timestamping function with the number of polls a
> > > particular code section has processed, and count the number of cycles we've
> > > spent processing empty bursts. The more empty bursts we encounter, the less
> > > cycles we spend in "busy" state, and the less core poll busyness will be
> > > reported.
> > >
> > > In order for all of the above to work without modifications to the
> > > application, the library code needs to be instrumented with calls to the
> > > lcore telemetry busyness timestamping function. The following parts of DPDK
> > > are instrumented with lcore telemetry calls:
> > >
> > > - All major driver API's:
> > >   - ethdev
> > >   - cryptodev
> > >   - compressdev
> > >   - regexdev
> > >   - bbdev
> > >   - rawdev
> > >   - eventdev
> > >   - dmadev
> > > - Some additional libraries:
> > >   - ring
> > >   - distributor
> > >
> > > To avoid performance impact from having lcore telemetry support, a global
> > > variable is exported by EAL, and a call to timestamping function is wrapped
> > > into a macro, so that whenever telemetry is disabled, it only takes one
> > > additional branch and no function calls are performed. It is also possible
> > > to disable it at compile time by commenting out RTE_LCORE_BUSYNESS from
> > > build config.
> > >
> > > This patch also adds a telemetry endpoint to report lcore poll busyness, as
> > > well as telemetry endpoints to enable/disable lcore telemetry. A
> > > documentation entry has been added to the howto guides to explain the usage
> > > of the new telemetry endpoints and API.
> > >
> > > Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
> > > Signed-off-by: Conor Walsh <conor.walsh@intel.com>
> > > Signed-off-by: David Hunt <david.hunt@intel.com>
> > > Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> > >
> > > ---
> > > v3:
> > >   * Fix missed renaming to poll busyness
> > >   * Fix clang compilation
> > >   * Fix arm compilation
> > >
> > > v2:
> > >   * Use rte_get_tsc_hz() to adjust the telemetry period
> > >   * Rename to reflect polling busyness vs general busyness
> > >   * Fix segfault when calling telemetry timestamp from an unregistered
> > >     non-EAL thread.
> > >   * Minor cleanup
> > > ---
> >
> > > diff --git a/meson_options.txt b/meson_options.txt
> > > index 7c220ad68d..725b851f69 100644
> > > --- a/meson_options.txt
> > > +++ b/meson_options.txt
> > > @@ -20,6 +20,8 @@ option('enable_driver_sdk', type: 'boolean', value: false, description:
> > >         'Install headers to build drivers.')
> > >  option('enable_kmods', type: 'boolean', value: false, description:
> > >         'build kernel modules')
> > > +option('enable_lcore_poll_busyness', type: 'boolean', value: true, description:
> > > +       'enable collection of lcore poll busyness telemetry')
> >
> > IMO, All fastpath features should be opt-in. i.e default should be false.
> > For the trace fastpath related changes, We have done the similar thing
> > even though it cost additional one cycle for disabled trace points
> >
>
> We do need to consider runtime and build defaults differently, though.
> Since this has also runtime enabling, I think having build-time enabling
> true as default is ok, so long as the runtime enabling is false (assuming
> no noticable overhead when the feature is disabled.)

I was talking about buildtime only. "enable_trace_fp" meson option selected as
false as default.

If the concern is enabling on generic distros then distro generic
config can opt in this

>
> /Bruce
  
Morten Brørup Aug. 26, 2022, 8:29 a.m. UTC | #4
> From: Jerin Jacob [mailto:jerinjacobk@gmail.com]
> Sent: Friday, 26 August 2022 10.16
> 
> On Fri, Aug 26, 2022 at 1:37 PM Bruce Richardson
> <bruce.richardson@intel.com> wrote:
> >
> > On Fri, Aug 26, 2022 at 12:35:16PM +0530, Jerin Jacob wrote:
> > > On Thu, Aug 25, 2022 at 8:56 PM Kevin Laatz <kevin.laatz@intel.com>
> wrote:
> > > >
> > > > From: Anatoly Burakov <anatoly.burakov@intel.com>
> > > >
> > > > Currently, there is no way to measure lcore poll busyness in a
> passive way,
> > > > without any modifications to the application. This patch adds a
> new EAL API
> > > > that will be able to passively track core polling busyness.
> > > >
> > > > The poll busyness is calculated by relying on the fact that most
> DPDK API's
> > > > will poll for packets. Empty polls can be counted as "idle",
> while
> > > > non-empty polls can be counted as busy. To measure lcore poll
> busyness, we
> > > > simply call the telemetry timestamping function with the number
> of polls a
> > > > particular code section has processed, and count the number of
> cycles we've
> > > > spent processing empty bursts. The more empty bursts we
> encounter, the less
> > > > cycles we spend in "busy" state, and the less core poll busyness
> will be
> > > > reported.
> > > >
> > > > In order for all of the above to work without modifications to
> the
> > > > application, the library code needs to be instrumented with calls
> to the
> > > > lcore telemetry busyness timestamping function. The following
> parts of DPDK
> > > > are instrumented with lcore telemetry calls:
> > > >
> > > > - All major driver API's:
> > > >   - ethdev
> > > >   - cryptodev
> > > >   - compressdev
> > > >   - regexdev
> > > >   - bbdev
> > > >   - rawdev
> > > >   - eventdev
> > > >   - dmadev
> > > > - Some additional libraries:
> > > >   - ring
> > > >   - distributor
> > > >
> > > > To avoid performance impact from having lcore telemetry support,
> a global
> > > > variable is exported by EAL, and a call to timestamping function
> is wrapped
> > > > into a macro, so that whenever telemetry is disabled, it only
> takes one
> > > > additional branch and no function calls are performed. It is also
> possible
> > > > to disable it at compile time by commenting out
> RTE_LCORE_BUSYNESS from
> > > > build config.
> > > >
> > > > This patch also adds a telemetry endpoint to report lcore poll
> busyness, as
> > > > well as telemetry endpoints to enable/disable lcore telemetry. A
> > > > documentation entry has been added to the howto guides to explain
> the usage
> > > > of the new telemetry endpoints and API.
> > > >
> > > > Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
> > > > Signed-off-by: Conor Walsh <conor.walsh@intel.com>
> > > > Signed-off-by: David Hunt <david.hunt@intel.com>
> > > > Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> > > >
> > > > ---
> > > > v3:
> > > >   * Fix missed renaming to poll busyness
> > > >   * Fix clang compilation
> > > >   * Fix arm compilation
> > > >
> > > > v2:
> > > >   * Use rte_get_tsc_hz() to adjust the telemetry period
> > > >   * Rename to reflect polling busyness vs general busyness
> > > >   * Fix segfault when calling telemetry timestamp from an
> unregistered
> > > >     non-EAL thread.
> > > >   * Minor cleanup
> > > > ---
> > >
> > > > diff --git a/meson_options.txt b/meson_options.txt
> > > > index 7c220ad68d..725b851f69 100644
> > > > --- a/meson_options.txt
> > > > +++ b/meson_options.txt
> > > > @@ -20,6 +20,8 @@ option('enable_driver_sdk', type: 'boolean',
> value: false, description:
> > > >         'Install headers to build drivers.')
> > > >  option('enable_kmods', type: 'boolean', value: false,
> description:
> > > >         'build kernel modules')
> > > > +option('enable_lcore_poll_busyness', type: 'boolean', value:
> true, description:
> > > > +       'enable collection of lcore poll busyness telemetry')
> > >
> > > IMO, All fastpath features should be opt-in. i.e default should be
> false.
> > > For the trace fastpath related changes, We have done the similar
> thing
> > > even though it cost additional one cycle for disabled trace points
> > >
> >
> > We do need to consider runtime and build defaults differently,
> though.
> > Since this has also runtime enabling, I think having build-time
> enabling
> > true as default is ok, so long as the runtime enabling is false
> (assuming
> > no noticable overhead when the feature is disabled.)
> 
> I was talking about buildtime only. "enable_trace_fp" meson option
> selected as
> false as default.

Agree. "enable_lcore_poll_busyness" is in the fast path, so it should follow the design pattern of "enable_trace_fp".

> 
> If the concern is enabling on generic distros then distro generic
> config can opt in this
> 
> >
> > /Bruce

@Kevin, are you considering a roadmap for using RTE_LCORE_TELEMETRY_TIMESTAMP() for other purposes? Otherwise, it should also be renamed to indicate that it is part of the "poll busyness" telemetry.
  
Kevin Laatz Aug. 26, 2022, 3:27 p.m. UTC | #5
On 26/08/2022 09:29, Morten Brørup wrote:
>> From: Jerin Jacob [mailto:jerinjacobk@gmail.com]
>> Sent: Friday, 26 August 2022 10.16
>>
>> On Fri, Aug 26, 2022 at 1:37 PM Bruce Richardson
>> <bruce.richardson@intel.com> wrote:
>>> On Fri, Aug 26, 2022 at 12:35:16PM +0530, Jerin Jacob wrote:
>>>> On Thu, Aug 25, 2022 at 8:56 PM Kevin Laatz <kevin.laatz@intel.com>
>> wrote:
>>>>> From: Anatoly Burakov <anatoly.burakov@intel.com>
>>>>>
>>>>> Currently, there is no way to measure lcore poll busyness in a
>> passive way,
>>>>> without any modifications to the application. This patch adds a
>> new EAL API
>>>>> that will be able to passively track core polling busyness.
>>>>>
>>>>> The poll busyness is calculated by relying on the fact that most
>> DPDK API's
>>>>> will poll for packets. Empty polls can be counted as "idle",
>> while
>>>>> non-empty polls can be counted as busy. To measure lcore poll
>> busyness, we
>>>>> simply call the telemetry timestamping function with the number
>> of polls a
>>>>> particular code section has processed, and count the number of
>> cycles we've
>>>>> spent processing empty bursts. The more empty bursts we
>> encounter, the less
>>>>> cycles we spend in "busy" state, and the less core poll busyness
>> will be
>>>>> reported.
>>>>>
>>>>> In order for all of the above to work without modifications to
>> the
>>>>> application, the library code needs to be instrumented with calls
>> to the
>>>>> lcore telemetry busyness timestamping function. The following
>> parts of DPDK
>>>>> are instrumented with lcore telemetry calls:
>>>>>
>>>>> - All major driver API's:
>>>>>    - ethdev
>>>>>    - cryptodev
>>>>>    - compressdev
>>>>>    - regexdev
>>>>>    - bbdev
>>>>>    - rawdev
>>>>>    - eventdev
>>>>>    - dmadev
>>>>> - Some additional libraries:
>>>>>    - ring
>>>>>    - distributor
>>>>>
>>>>> To avoid performance impact from having lcore telemetry support,
>> a global
>>>>> variable is exported by EAL, and a call to timestamping function
>> is wrapped
>>>>> into a macro, so that whenever telemetry is disabled, it only
>> takes one
>>>>> additional branch and no function calls are performed. It is also
>> possible
>>>>> to disable it at compile time by commenting out
>> RTE_LCORE_BUSYNESS from
>>>>> build config.
>>>>>
>>>>> This patch also adds a telemetry endpoint to report lcore poll
>> busyness, as
>>>>> well as telemetry endpoints to enable/disable lcore telemetry. A
>>>>> documentation entry has been added to the howto guides to explain
>> the usage
>>>>> of the new telemetry endpoints and API.
>>>>>
>>>>> Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
>>>>> Signed-off-by: Conor Walsh <conor.walsh@intel.com>
>>>>> Signed-off-by: David Hunt <david.hunt@intel.com>
>>>>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>>>>>
>>>>> ---
>>>>> v3:
>>>>>    * Fix missed renaming to poll busyness
>>>>>    * Fix clang compilation
>>>>>    * Fix arm compilation
>>>>>
>>>>> v2:
>>>>>    * Use rte_get_tsc_hz() to adjust the telemetry period
>>>>>    * Rename to reflect polling busyness vs general busyness
>>>>>    * Fix segfault when calling telemetry timestamp from an
>> unregistered
>>>>>      non-EAL thread.
>>>>>    * Minor cleanup
>>>>> ---
>>>>> diff --git a/meson_options.txt b/meson_options.txt
>>>>> index 7c220ad68d..725b851f69 100644
>>>>> --- a/meson_options.txt
>>>>> +++ b/meson_options.txt
>>>>> @@ -20,6 +20,8 @@ option('enable_driver_sdk', type: 'boolean',
>> value: false, description:
>>>>>          'Install headers to build drivers.')
>>>>>   option('enable_kmods', type: 'boolean', value: false,
>> description:
>>>>>          'build kernel modules')
>>>>> +option('enable_lcore_poll_busyness', type: 'boolean', value:
>> true, description:
>>>>> +       'enable collection of lcore poll busyness telemetry')
>>>> IMO, All fastpath features should be opt-in. i.e default should be
>> false.
>>>> For the trace fastpath related changes, We have done the similar
>> thing
>>>> even though it cost additional one cycle for disabled trace points
>>>>
>>> We do need to consider runtime and build defaults differently,
>> though.
>>> Since this has also runtime enabling, I think having build-time
>> enabling
>>> true as default is ok, so long as the runtime enabling is false
>> (assuming
>>> no noticable overhead when the feature is disabled.)
>> I was talking about buildtime only. "enable_trace_fp" meson option
>> selected as
>> false as default.
> Agree. "enable_lcore_poll_busyness" is in the fast path, so it should follow the design pattern of "enable_trace_fp".

+1 to making this opt-in. However, I'd lean more towards having the 
buildtime option enabled and the runtime option disabled by default. 
There is no measurable impact cause by the extra branch (the check for 
enabled/disabled in the macro) by disabling at runtime, and we gain the 
benefit of avoiding a recompile to enable it later.

>
>> If the concern is enabling on generic distros then distro generic
>> config can opt in this
>>
>>> /Bruce
> @Kevin, are you considering a roadmap for using RTE_LCORE_TELEMETRY_TIMESTAMP() for other purposes? Otherwise, it should also be renamed to indicate that it is part of the "poll busyness" telemetry.

No further purposes are planned for this macro, I'll rename it in the 
next revision.

-Kevin
  
Morten Brørup Aug. 26, 2022, 3:46 p.m. UTC | #6
> From: Kevin Laatz [mailto:kevin.laatz@intel.com]
> Sent: Friday, 26 August 2022 17.27
> 
> On 26/08/2022 09:29, Morten Brørup wrote:
> >> From: Jerin Jacob [mailto:jerinjacobk@gmail.com]
> >> Sent: Friday, 26 August 2022 10.16
> >>
> >> On Fri, Aug 26, 2022 at 1:37 PM Bruce Richardson
> >> <bruce.richardson@intel.com> wrote:
> >>> On Fri, Aug 26, 2022 at 12:35:16PM +0530, Jerin Jacob wrote:
> >>>> On Thu, Aug 25, 2022 at 8:56 PM Kevin Laatz
> <kevin.laatz@intel.com>
> >> wrote:
> >>>>> From: Anatoly Burakov <anatoly.burakov@intel.com>
> >>>>>
> >>>>> Currently, there is no way to measure lcore poll busyness in a
> >> passive way,
> >>>>> without any modifications to the application. This patch adds a
> >> new EAL API
> >>>>> that will be able to passively track core polling busyness.
> >>>>>
> >>>>> The poll busyness is calculated by relying on the fact that most
> >> DPDK API's
> >>>>> will poll for packets. Empty polls can be counted as "idle",
> >> while
> >>>>> non-empty polls can be counted as busy. To measure lcore poll
> >> busyness, we
> >>>>> simply call the telemetry timestamping function with the number
> >> of polls a
> >>>>> particular code section has processed, and count the number of
> >> cycles we've
> >>>>> spent processing empty bursts. The more empty bursts we
> >> encounter, the less
> >>>>> cycles we spend in "busy" state, and the less core poll busyness
> >> will be
> >>>>> reported.
> >>>>>
> >>>>> In order for all of the above to work without modifications to
> >> the
> >>>>> application, the library code needs to be instrumented with calls
> >> to the
> >>>>> lcore telemetry busyness timestamping function. The following
> >> parts of DPDK
> >>>>> are instrumented with lcore telemetry calls:
> >>>>>
> >>>>> - All major driver API's:
> >>>>>    - ethdev
> >>>>>    - cryptodev
> >>>>>    - compressdev
> >>>>>    - regexdev
> >>>>>    - bbdev
> >>>>>    - rawdev
> >>>>>    - eventdev
> >>>>>    - dmadev
> >>>>> - Some additional libraries:
> >>>>>    - ring
> >>>>>    - distributor
> >>>>>
> >>>>> To avoid performance impact from having lcore telemetry support,
> >> a global
> >>>>> variable is exported by EAL, and a call to timestamping function
> >> is wrapped
> >>>>> into a macro, so that whenever telemetry is disabled, it only
> >> takes one
> >>>>> additional branch and no function calls are performed. It is also
> >> possible
> >>>>> to disable it at compile time by commenting out
> >> RTE_LCORE_BUSYNESS from
> >>>>> build config.
> >>>>>
> >>>>> This patch also adds a telemetry endpoint to report lcore poll
> >> busyness, as
> >>>>> well as telemetry endpoints to enable/disable lcore telemetry. A
> >>>>> documentation entry has been added to the howto guides to explain
> >> the usage
> >>>>> of the new telemetry endpoints and API.
> >>>>>
> >>>>> Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
> >>>>> Signed-off-by: Conor Walsh <conor.walsh@intel.com>
> >>>>> Signed-off-by: David Hunt <david.hunt@intel.com>
> >>>>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> >>>>>
> >>>>> ---
> >>>>> v3:
> >>>>>    * Fix missed renaming to poll busyness
> >>>>>    * Fix clang compilation
> >>>>>    * Fix arm compilation
> >>>>>
> >>>>> v2:
> >>>>>    * Use rte_get_tsc_hz() to adjust the telemetry period
> >>>>>    * Rename to reflect polling busyness vs general busyness
> >>>>>    * Fix segfault when calling telemetry timestamp from an
> >> unregistered
> >>>>>      non-EAL thread.
> >>>>>    * Minor cleanup
> >>>>> ---
> >>>>> diff --git a/meson_options.txt b/meson_options.txt
> >>>>> index 7c220ad68d..725b851f69 100644
> >>>>> --- a/meson_options.txt
> >>>>> +++ b/meson_options.txt
> >>>>> @@ -20,6 +20,8 @@ option('enable_driver_sdk', type: 'boolean',
> >> value: false, description:
> >>>>>          'Install headers to build drivers.')
> >>>>>   option('enable_kmods', type: 'boolean', value: false,
> >> description:
> >>>>>          'build kernel modules')
> >>>>> +option('enable_lcore_poll_busyness', type: 'boolean', value:
> >> true, description:
> >>>>> +       'enable collection of lcore poll busyness telemetry')
> >>>> IMO, All fastpath features should be opt-in. i.e default should be
> >> false.
> >>>> For the trace fastpath related changes, We have done the similar
> >> thing
> >>>> even though it cost additional one cycle for disabled trace points
> >>>>
> >>> We do need to consider runtime and build defaults differently,
> >> though.
> >>> Since this has also runtime enabling, I think having build-time
> >> enabling
> >>> true as default is ok, so long as the runtime enabling is false
> >> (assuming
> >>> no noticable overhead when the feature is disabled.)
> >> I was talking about buildtime only. "enable_trace_fp" meson option
> >> selected as
> >> false as default.
> > Agree. "enable_lcore_poll_busyness" is in the fast path, so it should
> follow the design pattern of "enable_trace_fp".
> 
> +1 to making this opt-in. However, I'd lean more towards having the
> buildtime option enabled and the runtime option disabled by default.
> There is no measurable impact cause by the extra branch (the check for
> enabled/disabled in the macro) by disabling at runtime, and we gain the
> benefit of avoiding a recompile to enable it later.

The exact same thing could be said about "enable_trace_fp"; however, the development effort was put into separating it from "enable_trace", so it could be disabled by default.

Your patch is unlikely to get approved if you don't follow the "enable_trace_fp" design pattern as suggested.

> 
> >
> >> If the concern is enabling on generic distros then distro generic
> >> config can opt in this
> >>
> >>> /Bruce
> > @Kevin, are you considering a roadmap for using
> RTE_LCORE_TELEMETRY_TIMESTAMP() for other purposes? Otherwise, it
> should also be renamed to indicate that it is part of the "poll
> busyness" telemetry.
> 
> No further purposes are planned for this macro, I'll rename it in the
> next revision.

OK. Thank you.

Also, there's a new discussion about EAL bloat [1]. Perhaps I'm stretching it here, but it would be nice if your library was made a separate library, instead of part of the EAL library. (Since this kind of feature is not new to the EAL, I will categorize this suggestion as "nice to have", not "must have".)

[1] http://inbox.dpdk.org/dev/2594603.Isy0gbHreE@thomas/T/
  
Mattias Rönnblom Aug. 26, 2022, 10:06 p.m. UTC | #7
On 2022-08-25 17:28, Kevin Laatz wrote:
> From: Anatoly Burakov <anatoly.burakov@intel.com>
> 
> Currently, there is no way to measure lcore poll busyness in a passive way,
> without any modifications to the application. This patch adds a new EAL API
> that will be able to passively track core polling busyness.

There's no generic way, but the DSW event device keep tracks of lcore 
utilization (i.e., the fraction of cycles used to perform actual work, 
as opposed to just polling empty queues), and it does some with the same 
basic principles as, from what it seems after a quick look, used in this 
patch.

> 
> The poll busyness is calculated by relying on the fact that most DPDK API's
> will poll for packets. Empty polls can be counted as "idle", while

Lcore worker threads poll for work. Packets, timeouts, completions, 
event device events, etc.

> non-empty polls can be counted as busy. To measure lcore poll busyness, we

I guess what is meant here is that cycles spent after non-empty polls 
can be counted as busy (useful) cycles? Potentially including the cycles 
spent for the actual poll operation. ("Poll busyness" is a very vague 
term, in my opionion.)

Similiarly, cycles spent after an empty poll would not be counted.

> simply call the telemetry timestamping function with the number of polls a
> particular code section has processed, and count the number of cycles we've
> spent processing empty bursts. The more empty bursts we encounter, the less
> cycles we spend in "busy" state, and the less core poll busyness will be
> reported.
> 

Is this the same scheme as DSW? When a non-zero burst in idle state 
means a transition from the idle to busy? And a zero burst poll in busy 
state means a transition from idle to busy?

The issue with this scheme, is that you might potentially end up with a 
state transition for every iteration of the application's main loop, if 
packets (or other items of work) only comes in on one of the lcore's 
potentially many RX queues (or other input queues, such as eventdev 
ports). That means a rdtsc for every loop, which isn't too bad, but 
still might be noticable.

An application that gather items of work from multiple source before 
actually doing anything breaks this model. For example, consider a lcore 
worker owning two RX queues, performing rte_eth_rx_burst() on both, 
before attempt to process any of the received packets. If the last poll 
is empty, the cycles spent will considered idle, even though they were busy.

A lcore worker might also decide to poll the same RX queue multiple 
times (until it hits an empty poll, or reaching some high upper bound), 
before initating processing of the packets.

I didn't read your code in detail, so I might be jumping to conclusions.

> In order for all of the above to work without modifications to the
> application, the library code needs to be instrumented with calls to the
> lcore telemetry busyness timestamping function. The following parts of DPDK
> are instrumented with lcore telemetry calls:
> 
> - All major driver API's:
>    - ethdev
>    - cryptodev
>    - compressdev
>    - regexdev
>    - bbdev
>    - rawdev
>    - eventdev
>    - dmadev
> - Some additional libraries:
>    - ring
>    - distributor

In the past, I've suggested this kind of functionality should go into 
the service framework instead, with the service function explicitly 
signaling wheter or not the cycles spent on something useful or not.

That seems to me like a more straight-forward and more accurate 
solution, but does require the application to deploy everything as 
services, and also requires a change of the service function signature.

> 
> To avoid performance impact from having lcore telemetry support, a global
> variable is exported by EAL, and a call to timestamping function is wrapped
> into a macro, so that whenever telemetry is disabled, it only takes one

Use an static inline function if you don't need the additional 
expressive power of a macro.

I suggest you also mention the performance implications, when this 
function is enabled.

> additional branch and no function calls are performed. It is also possible
> to disable it at compile time by commenting out RTE_LCORE_BUSYNESS from
> build config.
> 
> This patch also adds a telemetry endpoint to report lcore poll busyness, as
> well as telemetry endpoints to enable/disable lcore telemetry. A
> documentation entry has been added to the howto guides to explain the usage
> of the new telemetry endpoints and API.
> 

Should there really be a dependency from the EAL to the telemetry 
library? A cycle. Maybe some dependency inversion would be in order? The 
telemetry library could instead register an interest in getting 
busy/idle cycles reports from lcores.

> Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
> Signed-off-by: Conor Walsh <conor.walsh@intel.com>
> Signed-off-by: David Hunt <david.hunt@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> 
> ---
> v3:
>    * Fix missed renaming to poll busyness
>    * Fix clang compilation
>    * Fix arm compilation
> 
> v2:
>    * Use rte_get_tsc_hz() to adjust the telemetry period
>    * Rename to reflect polling busyness vs general busyness
>    * Fix segfault when calling telemetry timestamp from an unregistered
>      non-EAL thread.
>    * Minor cleanup
> ---
>   config/meson.build                          |   1 +
>   config/rte_config.h                         |   1 +
>   lib/bbdev/rte_bbdev.h                       |  17 +-
>   lib/compressdev/rte_compressdev.c           |   2 +
>   lib/cryptodev/rte_cryptodev.h               |   2 +
>   lib/distributor/rte_distributor.c           |  21 +-
>   lib/distributor/rte_distributor_single.c    |  14 +-
>   lib/dmadev/rte_dmadev.h                     |  15 +-
>   lib/eal/common/eal_common_lcore_telemetry.c | 293 ++++++++++++++++++++
>   lib/eal/common/meson.build                  |   1 +
>   lib/eal/include/rte_lcore.h                 |  80 ++++++
>   lib/eal/meson.build                         |   3 +
>   lib/eal/version.map                         |   7 +
>   lib/ethdev/rte_ethdev.h                     |   2 +
>   lib/eventdev/rte_eventdev.h                 |  10 +-
>   lib/rawdev/rte_rawdev.c                     |   6 +-
>   lib/regexdev/rte_regexdev.h                 |   5 +-
>   lib/ring/rte_ring_elem_pvt.h                |   1 +
>   meson_options.txt                           |   2 +
>   19 files changed, 459 insertions(+), 24 deletions(-)
>   create mode 100644 lib/eal/common/eal_common_lcore_telemetry.c
> 
> diff --git a/config/meson.build b/config/meson.build
> index 7f7b6c92fd..d5954a059c 100644
> --- a/config/meson.build
> +++ b/config/meson.build
> @@ -297,6 +297,7 @@ endforeach
>   dpdk_conf.set('RTE_MAX_ETHPORTS', get_option('max_ethports'))
>   dpdk_conf.set('RTE_LIBEAL_USE_HPET', get_option('use_hpet'))
>   dpdk_conf.set('RTE_ENABLE_TRACE_FP', get_option('enable_trace_fp'))
> +dpdk_conf.set('RTE_LCORE_POLL_BUSYNESS', get_option('enable_lcore_poll_busyness'))
>   # values which have defaults which may be overridden
>   dpdk_conf.set('RTE_MAX_VFIO_GROUPS', 64)
>   dpdk_conf.set('RTE_DRIVER_MEMPOOL_BUCKET_SIZE_KB', 64)
> diff --git a/config/rte_config.h b/config/rte_config.h
> index 46549cb062..498702c9c7 100644
> --- a/config/rte_config.h
> +++ b/config/rte_config.h
> @@ -39,6 +39,7 @@
>   #define RTE_LOG_DP_LEVEL RTE_LOG_INFO
>   #define RTE_BACKTRACE 1
>   #define RTE_MAX_VFIO_CONTAINERS 64
> +#define RTE_LCORE_POLL_BUSYNESS_PERIOD_MS 2
>   
>   /* bsd module defines */
>   #define RTE_CONTIGMEM_MAX_NUM_BUFS 64
> diff --git a/lib/bbdev/rte_bbdev.h b/lib/bbdev/rte_bbdev.h
> index b88c88167e..d6ed176cce 100644
> --- a/lib/bbdev/rte_bbdev.h
> +++ b/lib/bbdev/rte_bbdev.h
> @@ -28,6 +28,7 @@ extern "C" {
>   #include <stdbool.h>
>   
>   #include <rte_cpuflags.h>
> +#include <rte_lcore.h>
>   
>   #include "rte_bbdev_op.h"
>   
> @@ -599,7 +600,9 @@ rte_bbdev_dequeue_enc_ops(uint16_t dev_id, uint16_t queue_id,
>   {
>   	struct rte_bbdev *dev = &rte_bbdev_devices[dev_id];
>   	struct rte_bbdev_queue_data *q_data = &dev->data->queues[queue_id];
> -	return dev->dequeue_enc_ops(q_data, ops, num_ops);
> +	const uint16_t nb_ops = dev->dequeue_enc_ops(q_data, ops, num_ops);
> +	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_ops);
> +	return nb_ops;
>   }
>   
>   /**
> @@ -631,7 +634,9 @@ rte_bbdev_dequeue_dec_ops(uint16_t dev_id, uint16_t queue_id,
>   {
>   	struct rte_bbdev *dev = &rte_bbdev_devices[dev_id];
>   	struct rte_bbdev_queue_data *q_data = &dev->data->queues[queue_id];
> -	return dev->dequeue_dec_ops(q_data, ops, num_ops);
> +	const uint16_t nb_ops = dev->dequeue_dec_ops(q_data, ops, num_ops);
> +	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_ops);
> +	return nb_ops;
>   }
>   
>   
> @@ -662,7 +667,9 @@ rte_bbdev_dequeue_ldpc_enc_ops(uint16_t dev_id, uint16_t queue_id,
>   {
>   	struct rte_bbdev *dev = &rte_bbdev_devices[dev_id];
>   	struct rte_bbdev_queue_data *q_data = &dev->data->queues[queue_id];
> -	return dev->dequeue_ldpc_enc_ops(q_data, ops, num_ops);
> +	const uint16_t nb_ops = dev->dequeue_ldpc_enc_ops(q_data, ops, num_ops);
> +	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_ops);
> +	return nb_ops;
>   }
>   
>   /**
> @@ -692,7 +699,9 @@ rte_bbdev_dequeue_ldpc_dec_ops(uint16_t dev_id, uint16_t queue_id,
>   {
>   	struct rte_bbdev *dev = &rte_bbdev_devices[dev_id];
>   	struct rte_bbdev_queue_data *q_data = &dev->data->queues[queue_id];
> -	return dev->dequeue_ldpc_dec_ops(q_data, ops, num_ops);
> +	const uint16_t nb_ops = dev->dequeue_ldpc_dec_ops(q_data, ops, num_ops);
> +	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_ops);
> +	return nb_ops;
>   }
>   
>   /** Definitions of device event types */
> diff --git a/lib/compressdev/rte_compressdev.c b/lib/compressdev/rte_compressdev.c
> index 22c438f2dd..912cee9a16 100644
> --- a/lib/compressdev/rte_compressdev.c
> +++ b/lib/compressdev/rte_compressdev.c
> @@ -580,6 +580,8 @@ rte_compressdev_dequeue_burst(uint8_t dev_id, uint16_t qp_id,
>   	nb_ops = (*dev->dequeue_burst)
>   			(dev->data->queue_pairs[qp_id], ops, nb_ops);
>   
> +	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_ops);
> +
>   	return nb_ops;
>   }
>   
> diff --git a/lib/cryptodev/rte_cryptodev.h b/lib/cryptodev/rte_cryptodev.h
> index 56f459c6a0..072874020d 100644
> --- a/lib/cryptodev/rte_cryptodev.h
> +++ b/lib/cryptodev/rte_cryptodev.h
> @@ -1915,6 +1915,8 @@ rte_cryptodev_dequeue_burst(uint8_t dev_id, uint16_t qp_id,
>   		rte_rcu_qsbr_thread_offline(list->qsbr, 0);
>   	}
>   #endif
> +
> +	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_ops);
>   	return nb_ops;
>   }
>   
> diff --git a/lib/distributor/rte_distributor.c b/lib/distributor/rte_distributor.c
> index 3035b7a999..35b0d8d36b 100644
> --- a/lib/distributor/rte_distributor.c
> +++ b/lib/distributor/rte_distributor.c
> @@ -56,6 +56,8 @@ rte_distributor_request_pkt(struct rte_distributor *d,
>   
>   		while (rte_rdtsc() < t)
>   			rte_pause();
> +		/* this was an empty poll */
> +		RTE_LCORE_TELEMETRY_TIMESTAMP(0);
>   	}
>   
>   	/*
> @@ -134,24 +136,29 @@ rte_distributor_get_pkt(struct rte_distributor *d,
>   
>   	if (unlikely(d->alg_type == RTE_DIST_ALG_SINGLE)) {
>   		if (return_count <= 1) {
> +			uint16_t cnt;
>   			pkts[0] = rte_distributor_get_pkt_single(d->d_single,
> -				worker_id, return_count ? oldpkt[0] : NULL);
> -			return (pkts[0]) ? 1 : 0;
> -		} else
> -			return -EINVAL;
> +								 worker_id,
> +								 return_count ? oldpkt[0] : NULL);
> +			cnt = (pkts[0] != NULL) ? 1 : 0;
> +			RTE_LCORE_TELEMETRY_TIMESTAMP(cnt);
> +			return cnt;
> +		}
> +		return -EINVAL;
>   	}
>   
>   	rte_distributor_request_pkt(d, worker_id, oldpkt, return_count);
>   
> -	count = rte_distributor_poll_pkt(d, worker_id, pkts);
> -	while (count == -1) {
> +	while ((count = rte_distributor_poll_pkt(d, worker_id, pkts)) == -1) {
>   		uint64_t t = rte_rdtsc() + 100;
>   
>   		while (rte_rdtsc() < t)
>   			rte_pause();
>   
> -		count = rte_distributor_poll_pkt(d, worker_id, pkts);
> +		/* this was an empty poll */
> +		RTE_LCORE_TELEMETRY_TIMESTAMP(0);
>   	}
> +	RTE_LCORE_TELEMETRY_TIMESTAMP(count);
>   	return count;
>   }
>   
> diff --git a/lib/distributor/rte_distributor_single.c b/lib/distributor/rte_distributor_single.c
> index 2c77ac454a..63cc9aab69 100644
> --- a/lib/distributor/rte_distributor_single.c
> +++ b/lib/distributor/rte_distributor_single.c
> @@ -31,8 +31,13 @@ rte_distributor_request_pkt_single(struct rte_distributor_single *d,
>   	union rte_distributor_buffer_single *buf = &d->bufs[worker_id];
>   	int64_t req = (((int64_t)(uintptr_t)oldpkt) << RTE_DISTRIB_FLAG_BITS)
>   			| RTE_DISTRIB_GET_BUF;
> -	RTE_WAIT_UNTIL_MASKED(&buf->bufptr64, RTE_DISTRIB_FLAGS_MASK,
> -		==, 0, __ATOMIC_RELAXED);
> +
> +	while (!((__atomic_load_n(&buf->bufptr64, __ATOMIC_RELAXED)
> +			& RTE_DISTRIB_FLAGS_MASK) == 0)) {
> +		rte_pause();
> +		/* this was an empty poll */
> +		RTE_LCORE_TELEMETRY_TIMESTAMP(0);
> +	}
>   
>   	/* Sync with distributor on GET_BUF flag. */
>   	__atomic_store_n(&(buf->bufptr64), req, __ATOMIC_RELEASE);
> @@ -59,8 +64,11 @@ rte_distributor_get_pkt_single(struct rte_distributor_single *d,
>   {
>   	struct rte_mbuf *ret;
>   	rte_distributor_request_pkt_single(d, worker_id, oldpkt);
> -	while ((ret = rte_distributor_poll_pkt_single(d, worker_id)) == NULL)
> +	while ((ret = rte_distributor_poll_pkt_single(d, worker_id)) == NULL) {
>   		rte_pause();
> +		/* this was an empty poll */
> +		RTE_LCORE_TELEMETRY_TIMESTAMP(0);
> +	}
>   	return ret;
>   }
>   
> diff --git a/lib/dmadev/rte_dmadev.h b/lib/dmadev/rte_dmadev.h
> index e7f992b734..98176a6a7a 100644
> --- a/lib/dmadev/rte_dmadev.h
> +++ b/lib/dmadev/rte_dmadev.h
> @@ -149,6 +149,7 @@
>   #include <rte_bitops.h>
>   #include <rte_common.h>
>   #include <rte_compat.h>
> +#include <rte_lcore.h>
>   
>   #ifdef __cplusplus
>   extern "C" {
> @@ -1027,7 +1028,7 @@ rte_dma_completed(int16_t dev_id, uint16_t vchan, const uint16_t nb_cpls,
>   		  uint16_t *last_idx, bool *has_error)
>   {
>   	struct rte_dma_fp_object *obj = &rte_dma_fp_objs[dev_id];
> -	uint16_t idx;
> +	uint16_t idx, nb_ops;
>   	bool err;
>   
>   #ifdef RTE_DMADEV_DEBUG
> @@ -1050,8 +1051,10 @@ rte_dma_completed(int16_t dev_id, uint16_t vchan, const uint16_t nb_cpls,
>   		has_error = &err;
>   
>   	*has_error = false;
> -	return (*obj->completed)(obj->dev_private, vchan, nb_cpls, last_idx,
> -				 has_error);
> +	nb_ops = (*obj->completed)(obj->dev_private, vchan, nb_cpls, last_idx,
> +				   has_error);
> +	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_ops);
> +	return nb_ops;
>   }
>   
>   /**
> @@ -1090,7 +1093,7 @@ rte_dma_completed_status(int16_t dev_id, uint16_t vchan,
>   			 enum rte_dma_status_code *status)
>   {
>   	struct rte_dma_fp_object *obj = &rte_dma_fp_objs[dev_id];
> -	uint16_t idx;
> +	uint16_t idx, nb_ops;
>   
>   #ifdef RTE_DMADEV_DEBUG
>   	if (!rte_dma_is_valid(dev_id) || nb_cpls == 0 || status == NULL)
> @@ -1101,8 +1104,10 @@ rte_dma_completed_status(int16_t dev_id, uint16_t vchan,
>   	if (last_idx == NULL)
>   		last_idx = &idx;
>   
> -	return (*obj->completed_status)(obj->dev_private, vchan, nb_cpls,
> +	nb_ops = (*obj->completed_status)(obj->dev_private, vchan, nb_cpls,
>   					last_idx, status);
> +	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_ops);
> +	return nb_ops;
>   }
>   
>   /**
> diff --git a/lib/eal/common/eal_common_lcore_telemetry.c b/lib/eal/common/eal_common_lcore_telemetry.c
> new file mode 100644
> index 0000000000..bba0afc26d
> --- /dev/null
> +++ b/lib/eal/common/eal_common_lcore_telemetry.c
> @@ -0,0 +1,293 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2010-2014 Intel Corporation
> + */
> +
> +#include <unistd.h>
> +#include <limits.h>
> +#include <string.h>
> +
> +#include <rte_common.h>
> +#include <rte_cycles.h>
> +#include <rte_errno.h>
> +#include <rte_lcore.h>
> +
> +#ifdef RTE_LCORE_POLL_BUSYNESS
> +#include <rte_telemetry.h>
> +#endif
> +
> +int __rte_lcore_telemetry_enabled;

Is "telemetry" really the term to use here? Isn't this just another 
piece of statistics? It can be used for telemetry, or in some other fashion.

(Use bool not int.)

> +
> +#ifdef RTE_LCORE_POLL_BUSYNESS
> +
> +struct lcore_telemetry {
> +	int poll_busyness;
> +	/**< Calculated poll busyness (gets set/returned by the API) */
> +	int raw_poll_busyness;
> +	/**< Calculated poll busyness times 100. */
> +	uint64_t interval_ts;
> +	/**< when previous telemetry interval started */
> +	uint64_t empty_cycles;
> +	/**< empty cycle count since last interval */
> +	uint64_t last_poll_ts;
> +	/**< last poll timestamp */
> +	bool last_empty;
> +	/**< if last poll was empty */
> +	unsigned int contig_poll_cnt;
> +	/**< contiguous (always empty/non empty) poll counter */
> +} __rte_cache_aligned;
> +
> +static struct lcore_telemetry *telemetry_data;
> +
> +#define LCORE_POLL_BUSYNESS_MAX 100
> +#define LCORE_POLL_BUSYNESS_NOT_SET -1
> +#define LCORE_POLL_BUSYNESS_MIN 0
> +
> +#define SMOOTH_COEFF 5
> +#define STATE_CHANGE_OPT 32
> +
> +static void lcore_config_init(void)
> +{
> +	int lcore_id;
> +
> +	telemetry_data = calloc(RTE_MAX_LCORE, sizeof(telemetry_data[0]));
> +	if (telemetry_data == NULL)
> +		rte_panic("Could not init lcore telemetry data: Out of memory\n");
> +
> +	RTE_LCORE_FOREACH(lcore_id) {
> +		struct lcore_telemetry *td = &telemetry_data[lcore_id];
> +
> +		td->interval_ts = 0;
> +		td->last_poll_ts = 0;
> +		td->empty_cycles = 0;
> +		td->last_empty = true;
> +		td->contig_poll_cnt = 0;
> +		td->poll_busyness = LCORE_POLL_BUSYNESS_NOT_SET;
> +		td->raw_poll_busyness = 0;
> +	}
> +}
> +
> +int rte_lcore_poll_busyness(unsigned int lcore_id)
> +{
> +	const uint64_t active_thresh = rte_get_tsc_hz() * RTE_LCORE_POLL_BUSYNESS_PERIOD_MS;
> +	struct lcore_telemetry *tdata;
> +
> +	if (lcore_id >= RTE_MAX_LCORE)
> +		return -EINVAL;
> +	tdata = &telemetry_data[lcore_id];
> +
> +	/* if the lcore is not active */
> +	if (tdata->interval_ts == 0)
> +		return LCORE_POLL_BUSYNESS_NOT_SET;
> +	/* if the core hasn't been active in a while */
> +	else if ((rte_rdtsc() - tdata->interval_ts) > active_thresh)
> +		return LCORE_POLL_BUSYNESS_NOT_SET;
> +
> +	/* this core is active, report its poll busyness */
> +	return telemetry_data[lcore_id].poll_busyness;
> +}
> +
> +int rte_lcore_poll_busyness_enabled(void)
> +{
> +	return __rte_lcore_telemetry_enabled;
> +}
> +
> +void rte_lcore_poll_busyness_enabled_set(int enable)

Use bool.

> +{
> +	__rte_lcore_telemetry_enabled = !!enable;

!!Another reason to use bool!! :)

Are you allowed to call this function during operation, you'll need a 
atomic store here (and an atomic load on the read side).

> +
> +	if (!enable)
> +		lcore_config_init();
> +}
> +
> +static inline int calc_raw_poll_busyness(const struct lcore_telemetry *tdata,
> +				    const uint64_t empty, const uint64_t total)
> +{
> +	/*
> +	 * we don't want to use floating point math here, but we want for our poll
> +	 * busyness to react smoothly to sudden changes, while still keeping the
> +	 * accuracy and making sure that over time the average follows poll busyness
> +	 * as measured just-in-time. therefore, we will calculate the average poll
> +	 * busyness using integer math, but shift the decimal point two places
> +	 * to the right, so that 100.0 becomes 10000. this allows us to report
> +	 * integer values (0..100) while still allowing ourselves to follow the
> +	 * just-in-time measurements when we calculate our averages.
> +	 */
> +	const int max_raw_idle = LCORE_POLL_BUSYNESS_MAX * 100;
> +

Why not just store/manage the number of busy (or idle, or both) cycles? 
Then the user can decide what time period to average over, to what 
extent the lcore utilization from previous periods should be factored 
in, etc.

In DSW, I initially presented only a load statistic (which averaged over 
250 us, with some contribution from previous period). I later came to 
realize that just exposing the number of busy cycles left the calling 
application much more options. For example, to present the average load 
during 1 s, you needed to have some control thread sampling the load 
statistic during that time period, as opposed to when the busy cycles 
statistic was introduced, it just had to read that value twice (at the 
beginning of the period, and at the end), and compared it will the 
amount of wallclock time passed.

> +	/*
> +	 * at upper end of the poll busyness scale, going up from 90->100 will take
> +	 * longer than going from 10->20 because of the averaging. to address
> +	 * this, we invert the scale when doing calculations: that is, we
> +	 * effectively calculate average *idle* cycle percentage, not average
> +	 * *busy* cycle percentage. this means that the scale is naturally
> +	 * biased towards fast scaling up, and slow scaling down.
> +	 */
> +	const int prev_raw_idle = max_raw_idle - tdata->raw_poll_busyness;
> +
> +	/* calculate rate of idle cycles, times 100 */
> +	const int cur_raw_idle = (int)((empty * max_raw_idle) / total);
> +
> +	/* smoothen the idleness */
> +	const int smoothened_idle =
> +			(cur_raw_idle + prev_raw_idle * (SMOOTH_COEFF - 1)) / SMOOTH_COEFF;
> +
> +	/* convert idleness back to poll busyness */
> +	return max_raw_idle - smoothened_idle;
> +}
> +
> +void __rte_lcore_telemetry_timestamp(uint16_t nb_rx)
> +{
> +	const unsigned int lcore_id = rte_lcore_id();
> +	uint64_t interval_ts, empty_cycles, cur_tsc, last_poll_ts;
> +	struct lcore_telemetry *tdata;
> +	const bool empty = nb_rx == 0;
> +	uint64_t diff_int, diff_last;
> +	bool last_empty;
> +
> +	/* This telemetry is not supported for unregistered non-EAL threads */
> +	if (lcore_id >= RTE_MAX_LCORE)
> +		return;
> +
> +	tdata = &telemetry_data[lcore_id];
> +	last_empty = tdata->last_empty;
> +
> +	/* optimization: don't do anything if status hasn't changed */
> +	if (last_empty == empty && tdata->contig_poll_cnt++ < STATE_CHANGE_OPT)
> +		return;
> +	/* status changed or we're waiting for too long, reset counter */
> +	tdata->contig_poll_cnt = 0;
> +
> +	cur_tsc = rte_rdtsc();
> +
> +	interval_ts = tdata->interval_ts;
> +	empty_cycles = tdata->empty_cycles;
> +	last_poll_ts = tdata->last_poll_ts;
> +
> +	diff_int = cur_tsc - interval_ts;
> +	diff_last = cur_tsc - last_poll_ts;
> +
> +	/* is this the first time we're here? */
> +	if (interval_ts == 0) {
> +		tdata->poll_busyness = LCORE_POLL_BUSYNESS_MIN;
> +		tdata->raw_poll_busyness = 0;
> +		tdata->interval_ts = cur_tsc;
> +		tdata->empty_cycles = 0;
> +		tdata->contig_poll_cnt = 0;
> +		goto end;
> +	}
> +
> +	/* update the empty counter if we got an empty poll earlier */
> +	if (last_empty)
> +		empty_cycles += diff_last;
> +
> +	/* have we passed the interval? */
> +	uint64_t interval = ((rte_get_tsc_hz() / MS_PER_S) * RTE_LCORE_POLL_BUSYNESS_PERIOD_MS);
> +	if (diff_int > interval) {
> +		int raw_poll_busyness;
> +
> +		/* get updated poll_busyness value */
> +		raw_poll_busyness = calc_raw_poll_busyness(tdata, empty_cycles, diff_int);
> +
> +		/* set a new interval, reset empty counter */
> +		tdata->interval_ts = cur_tsc;
> +		tdata->empty_cycles = 0;
> +		tdata->raw_poll_busyness = raw_poll_busyness;
> +		/* bring poll busyness back to 0..100 range, biased to round up */
> +		tdata->poll_busyness = (raw_poll_busyness + 50) / 100;
> +	} else
> +		/* we may have updated empty counter */
> +		tdata->empty_cycles = empty_cycles;
> +
> +end:
> +	/* update status for next poll */
> +	tdata->last_poll_ts = cur_tsc;
> +	tdata->last_empty = empty;
> +}
> +
> +static int
> +lcore_poll_busyness_enable(const char *cmd __rte_unused,
> +		      const char *params __rte_unused,
> +		      struct rte_tel_data *d)
> +{
> +	rte_lcore_poll_busyness_enabled_set(1);
> +
> +	rte_tel_data_start_dict(d);
> +
> +	rte_tel_data_add_dict_int(d, "poll_busyness_enabled", 1);
> +
> +	return 0;
> +}
> +
> +static int
> +lcore_poll_busyness_disable(const char *cmd __rte_unused,
> +		       const char *params __rte_unused,
> +		       struct rte_tel_data *d)
> +{
> +	rte_lcore_poll_busyness_enabled_set(0);
> +
> +	rte_tel_data_start_dict(d);
> +
> +	rte_tel_data_add_dict_int(d, "poll_busyness_enabled", 0);
> +
> +	if (telemetry_data != NULL)
> +		free(telemetry_data);
> +
> +	return 0;
> +}
> +
> +static int
> +lcore_handle_poll_busyness(const char *cmd __rte_unused,
> +		      const char *params __rte_unused, struct rte_tel_data *d)
> +{
> +	char corenum[64];
> +	int i;
> +
> +	rte_tel_data_start_dict(d);
> +
> +	RTE_LCORE_FOREACH(i) {
> +		if (!rte_lcore_is_enabled(i))
> +			continue;
> +		snprintf(corenum, sizeof(corenum), "%d", i);
> +		rte_tel_data_add_dict_int(d, corenum, rte_lcore_poll_busyness(i));
> +	}
> +
> +	return 0;
> +}
> +
> +RTE_INIT(lcore_init_telemetry)
> +{
> +	__rte_lcore_telemetry_enabled = true;
> +
> +	lcore_config_init();
> +
> +	rte_telemetry_register_cmd("/eal/lcore/poll_busyness", lcore_handle_poll_busyness,
> +				   "return percentage poll busyness of cores");
> +
> +	rte_telemetry_register_cmd("/eal/lcore/poll_busyness_enable", lcore_poll_busyness_enable,
> +				   "enable lcore poll busyness measurement");
> +
> +	rte_telemetry_register_cmd("/eal/lcore/poll_busyness_disable", lcore_poll_busyness_disable,
> +				   "disable lcore poll busyness measurement");
> +}
> +
> +#else
> +
> +int rte_lcore_poll_busyness(unsigned int lcore_id __rte_unused)
> +{
> +	return -ENOTSUP;
> +}
> +
> +int rte_lcore_poll_busyness_enabled(void)
> +{
> +	return -ENOTSUP;
> +}
> +
> +void rte_lcore_poll_busyness_enabled_set(int enable __rte_unused)
> +{
> +}
> +
> +void __rte_lcore_telemetry_timestamp(uint16_t nb_rx __rte_unused)
> +{
> +}
> +
> +#endif
> diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
> index 917758cc65..a743e66a7d 100644
> --- a/lib/eal/common/meson.build
> +++ b/lib/eal/common/meson.build
> @@ -17,6 +17,7 @@ sources += files(
>           'eal_common_hexdump.c',
>           'eal_common_interrupts.c',
>           'eal_common_launch.c',
> +        'eal_common_lcore_telemetry.c',
>           'eal_common_lcore.c',
>           'eal_common_log.c',
>           'eal_common_mcfg.c',
> diff --git a/lib/eal/include/rte_lcore.h b/lib/eal/include/rte_lcore.h
> index b598e1b9ec..75c1f874cb 100644
> --- a/lib/eal/include/rte_lcore.h
> +++ b/lib/eal/include/rte_lcore.h
> @@ -415,6 +415,86 @@ rte_ctrl_thread_create(pthread_t *thread, const char *name,
>   		const pthread_attr_t *attr,
>   		void *(*start_routine)(void *), void *arg);
>   
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Read poll busyness value corresponding to an lcore.
> + *
> + * @param lcore_id
> + *   Lcore to read poll busyness value for.
> + * @return
> + *   - value between 0 and 100 on success
> + *   - -1 if lcore is not active
> + *   - -EINVAL if lcore is invalid
> + *   - -ENOMEM if not enough memory available
> + *   - -ENOTSUP if not supported
> + */
> +__rte_experimental
> +int
> +rte_lcore_poll_busyness(unsigned int lcore_id);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Check if lcore poll busyness telemetry is enabled.
> + *
> + * @return
> + *   - 1 if lcore telemetry is enabled
> + *   - 0 if lcore telemetry is disabled
> + *   - -ENOTSUP if not lcore telemetry supported
> + */
> +__rte_experimental
> +int
> +rte_lcore_poll_busyness_enabled(void);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Enable or disable poll busyness telemetry.
> + *
> + * @param enable
> + *   1 to enable, 0 to disable
> + */
> +__rte_experimental
> +void
> +rte_lcore_poll_busyness_enabled_set(int enable);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Lcore telemetry timestamping function.
> + *
> + * @param nb_rx
> + *   Number of buffers processed by lcore.
> + */
> +__rte_experimental
> +void
> +__rte_lcore_telemetry_timestamp(uint16_t nb_rx);
> +
> +/** @internal lcore telemetry enabled status */
> +extern int __rte_lcore_telemetry_enabled;
> +
> +/**
> + * Call lcore telemetry timestamp function.
> + *
> + * @param nb_rx
> + *   Number of buffers processed by lcore.
> + */
> +#ifdef RTE_LCORE_POLL_BUSYNESS
> +#define RTE_LCORE_TELEMETRY_TIMESTAMP(nb_rx)                    \
> +	do {                                                    \
> +		if (__rte_lcore_telemetry_enabled)              \
> +			__rte_lcore_telemetry_timestamp(nb_rx); \
> +	} while (0)
> +#else
> +#define RTE_LCORE_TELEMETRY_TIMESTAMP(nb_rx) \
> +	while (0)
> +#endif
> +
>   #ifdef __cplusplus
>   }
>   #endif
> diff --git a/lib/eal/meson.build b/lib/eal/meson.build
> index 056beb9461..2fb90d446b 100644
> --- a/lib/eal/meson.build
> +++ b/lib/eal/meson.build
> @@ -25,6 +25,9 @@ subdir(arch_subdir)
>   deps += ['kvargs']
>   if not is_windows
>       deps += ['telemetry']
> +else
> +	# core poll busyness telemetry depends on telemetry library
> +	dpdk_conf.set('RTE_LCORE_POLL_BUSYNESS', false)
>   endif
>   if dpdk_conf.has('RTE_USE_LIBBSD')
>       ext_deps += libbsd
> diff --git a/lib/eal/version.map b/lib/eal/version.map
> index 1f293e768b..f84d2dc319 100644
> --- a/lib/eal/version.map
> +++ b/lib/eal/version.map
> @@ -424,6 +424,13 @@ EXPERIMENTAL {
>   	rte_thread_self;
>   	rte_thread_set_affinity_by_id;
>   	rte_thread_set_priority;
> +
> +	# added in 22.11
> +	__rte_lcore_telemetry_timestamp;
> +	__rte_lcore_telemetry_enabled;
> +	rte_lcore_poll_busyness;
> +	rte_lcore_poll_busyness_enabled;
> +	rte_lcore_poll_busyness_enabled_set;
>   };
>   
>   INTERNAL {
> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> index de9e970d4d..1caecd5a11 100644
> --- a/lib/ethdev/rte_ethdev.h
> +++ b/lib/ethdev/rte_ethdev.h
> @@ -5675,6 +5675,8 @@ rte_eth_rx_burst(uint16_t port_id, uint16_t queue_id,
>   #endif
>   
>   	rte_ethdev_trace_rx_burst(port_id, queue_id, (void **)rx_pkts, nb_rx);
> +
> +	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_rx);
>   	return nb_rx;
>   }
>   
> diff --git a/lib/eventdev/rte_eventdev.h b/lib/eventdev/rte_eventdev.h
> index 6a6f6ea4c1..a1d42d9214 100644
> --- a/lib/eventdev/rte_eventdev.h
> +++ b/lib/eventdev/rte_eventdev.h
> @@ -2153,6 +2153,7 @@ rte_event_dequeue_burst(uint8_t dev_id, uint8_t port_id, struct rte_event ev[],
>   			uint16_t nb_events, uint64_t timeout_ticks)
>   {
>   	const struct rte_event_fp_ops *fp_ops;
> +	uint16_t nb_evts;
>   	void *port;
>   
>   	fp_ops = &rte_event_fp_ops[dev_id];
> @@ -2175,10 +2176,13 @@ rte_event_dequeue_burst(uint8_t dev_id, uint8_t port_id, struct rte_event ev[],
>   	 * requests nb_events as const one
>   	 */
>   	if (nb_events == 1)
> -		return (fp_ops->dequeue)(port, ev, timeout_ticks);
> +		nb_evts = (fp_ops->dequeue)(port, ev, timeout_ticks);
>   	else
> -		return (fp_ops->dequeue_burst)(port, ev, nb_events,
> -					       timeout_ticks);
> +		nb_evts = (fp_ops->dequeue_burst)(port, ev, nb_events,
> +					timeout_ticks);
> +
> +	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_evts);
> +	return nb_evts;
>   }
>   
>   #define RTE_EVENT_DEV_MAINT_OP_FLUSH          (1 << 0)
> diff --git a/lib/rawdev/rte_rawdev.c b/lib/rawdev/rte_rawdev.c
> index 2f0a4f132e..f6c0ed196f 100644
> --- a/lib/rawdev/rte_rawdev.c
> +++ b/lib/rawdev/rte_rawdev.c
> @@ -16,6 +16,7 @@
>   #include <rte_common.h>
>   #include <rte_malloc.h>
>   #include <rte_telemetry.h>
> +#include <rte_lcore.h>
>   
>   #include "rte_rawdev.h"
>   #include "rte_rawdev_pmd.h"
> @@ -226,12 +227,15 @@ rte_rawdev_dequeue_buffers(uint16_t dev_id,
>   			   rte_rawdev_obj_t context)
>   {
>   	struct rte_rawdev *dev;
> +	int nb_ops;
>   
>   	RTE_RAWDEV_VALID_DEVID_OR_ERR_RET(dev_id, -EINVAL);
>   	dev = &rte_rawdevs[dev_id];
>   
>   	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->dequeue_bufs, -ENOTSUP);
> -	return (*dev->dev_ops->dequeue_bufs)(dev, buffers, count, context);
> +	nb_ops = (*dev->dev_ops->dequeue_bufs)(dev, buffers, count, context);
> +	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_ops);
> +	return nb_ops;
>   }
>   
>   int
> diff --git a/lib/regexdev/rte_regexdev.h b/lib/regexdev/rte_regexdev.h
> index 3bce8090f6..781055b4eb 100644
> --- a/lib/regexdev/rte_regexdev.h
> +++ b/lib/regexdev/rte_regexdev.h
> @@ -1530,6 +1530,7 @@ rte_regexdev_dequeue_burst(uint8_t dev_id, uint16_t qp_id,
>   			   struct rte_regex_ops **ops, uint16_t nb_ops)
>   {
>   	struct rte_regexdev *dev = &rte_regex_devices[dev_id];
> +	uint16_t deq_ops;
>   #ifdef RTE_LIBRTE_REGEXDEV_DEBUG
>   	RTE_REGEXDEV_VALID_DEV_ID_OR_ERR_RET(dev_id, -EINVAL);
>   	RTE_FUNC_PTR_OR_ERR_RET(*dev->dequeue, -ENOTSUP);
> @@ -1538,7 +1539,9 @@ rte_regexdev_dequeue_burst(uint8_t dev_id, uint16_t qp_id,
>   		return -EINVAL;
>   	}
>   #endif
> -	return (*dev->dequeue)(dev, qp_id, ops, nb_ops);
> +	deq_ops = (*dev->dequeue)(dev, qp_id, ops, nb_ops);
> +	RTE_LCORE_TELEMETRY_TIMESTAMP(deq_ops);
> +	return deq_ops;
>   }
>   
>   #ifdef __cplusplus
> diff --git a/lib/ring/rte_ring_elem_pvt.h b/lib/ring/rte_ring_elem_pvt.h
> index 83788c56e6..6db09d4291 100644
> --- a/lib/ring/rte_ring_elem_pvt.h
> +++ b/lib/ring/rte_ring_elem_pvt.h
> @@ -379,6 +379,7 @@ __rte_ring_do_dequeue_elem(struct rte_ring *r, void *obj_table,
>   end:
>   	if (available != NULL)
>   		*available = entries - n;
> +	RTE_LCORE_TELEMETRY_TIMESTAMP(n);
>   	return n;
>   }
>   
> diff --git a/meson_options.txt b/meson_options.txt
> index 7c220ad68d..725b851f69 100644
> --- a/meson_options.txt
> +++ b/meson_options.txt
> @@ -20,6 +20,8 @@ option('enable_driver_sdk', type: 'boolean', value: false, description:
>          'Install headers to build drivers.')
>   option('enable_kmods', type: 'boolean', value: false, description:
>          'build kernel modules')
> +option('enable_lcore_poll_busyness', type: 'boolean', value: true, description:
> +       'enable collection of lcore poll busyness telemetry')
>   option('examples', type: 'string', value: '', description:
>          'Comma-separated list of examples to build by default')
>   option('flexran_sdk', type: 'string', value: '', description:
  
Bruce Richardson Aug. 29, 2022, 8:23 a.m. UTC | #8
On Sat, Aug 27, 2022 at 12:06:19AM +0200, Mattias Rönnblom wrote:
> On 2022-08-25 17:28, Kevin Laatz wrote:
> > From: Anatoly Burakov <anatoly.burakov@intel.com>
> > 
<snip>
> > This patch also adds a telemetry endpoint to report lcore poll busyness, as
> > well as telemetry endpoints to enable/disable lcore telemetry. A
> > documentation entry has been added to the howto guides to explain the usage
> > of the new telemetry endpoints and API.
> > 
> 
> Should there really be a dependency from the EAL to the telemetry library? A
> cycle. Maybe some dependency inversion would be in order? The telemetry
> library could instead register an interest in getting busy/idle cycles
> reports from lcores.
> 
Just on this point, EAL already exposes telemetry and already depends upon
the telemetry library, so there would be no new dependency introduced here.

With the existing code, we avoid a cycle by having telemetry avoid using
EAL functions - and for the couple it does need, e.g. the log function, we
inject in the function pointer on init.

/Bruce
  
Bruce Richardson Aug. 29, 2022, 10:41 a.m. UTC | #9
On Fri, Aug 26, 2022 at 05:46:48PM +0200, Morten Brørup wrote:
> > From: Kevin Laatz [mailto:kevin.laatz@intel.com]
> > Sent: Friday, 26 August 2022 17.27
> > 
> > On 26/08/2022 09:29, Morten Brørup wrote:
> > >> From: Jerin Jacob [mailto:jerinjacobk@gmail.com]
> > >> Sent: Friday, 26 August 2022 10.16
> > >>
> > >> On Fri, Aug 26, 2022 at 1:37 PM Bruce Richardson
> > >> <bruce.richardson@intel.com> wrote:
> > >>> On Fri, Aug 26, 2022 at 12:35:16PM +0530, Jerin Jacob wrote:
> > >>>> On Thu, Aug 25, 2022 at 8:56 PM Kevin Laatz
> > <kevin.laatz@intel.com>
> > >> wrote:
> > >>>>> From: Anatoly Burakov <anatoly.burakov@intel.com>
> > >>>>>
> > >>>>> Currently, there is no way to measure lcore poll busyness in a
> > >> passive way,
> > >>>>> without any modifications to the application. This patch adds a
> > >> new EAL API
> > >>>>> that will be able to passively track core polling busyness.
> > >>>>>
> > >>>>> The poll busyness is calculated by relying on the fact that most
> > >> DPDK API's
> > >>>>> will poll for packets. Empty polls can be counted as "idle",
> > >> while
> > >>>>> non-empty polls can be counted as busy. To measure lcore poll
> > >> busyness, we
> > >>>>> simply call the telemetry timestamping function with the number
> > >> of polls a
> > >>>>> particular code section has processed, and count the number of
> > >> cycles we've
> > >>>>> spent processing empty bursts. The more empty bursts we
> > >> encounter, the less
> > >>>>> cycles we spend in "busy" state, and the less core poll busyness
> > >> will be
> > >>>>> reported.
> > >>>>>
> > >>>>> In order for all of the above to work without modifications to
> > >> the
> > >>>>> application, the library code needs to be instrumented with calls
> > >> to the
> > >>>>> lcore telemetry busyness timestamping function. The following
> > >> parts of DPDK
> > >>>>> are instrumented with lcore telemetry calls:
> > >>>>>
> > >>>>> - All major driver API's:
> > >>>>>    - ethdev
> > >>>>>    - cryptodev
> > >>>>>    - compressdev
> > >>>>>    - regexdev
> > >>>>>    - bbdev
> > >>>>>    - rawdev
> > >>>>>    - eventdev
> > >>>>>    - dmadev
> > >>>>> - Some additional libraries:
> > >>>>>    - ring
> > >>>>>    - distributor
> > >>>>>
> > >>>>> To avoid performance impact from having lcore telemetry support,
> > >> a global
> > >>>>> variable is exported by EAL, and a call to timestamping function
> > >> is wrapped
> > >>>>> into a macro, so that whenever telemetry is disabled, it only
> > >> takes one
> > >>>>> additional branch and no function calls are performed. It is also
> > >> possible
> > >>>>> to disable it at compile time by commenting out
> > >> RTE_LCORE_BUSYNESS from
> > >>>>> build config.
> > >>>>>
> > >>>>> This patch also adds a telemetry endpoint to report lcore poll
> > >> busyness, as
> > >>>>> well as telemetry endpoints to enable/disable lcore telemetry. A
> > >>>>> documentation entry has been added to the howto guides to explain
> > >> the usage
> > >>>>> of the new telemetry endpoints and API.
> > >>>>>
> > >>>>> Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
> > >>>>> Signed-off-by: Conor Walsh <conor.walsh@intel.com>
> > >>>>> Signed-off-by: David Hunt <david.hunt@intel.com>
> > >>>>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> > >>>>>
> > >>>>> ---
> > >>>>> v3:
> > >>>>>    * Fix missed renaming to poll busyness
> > >>>>>    * Fix clang compilation
> > >>>>>    * Fix arm compilation
> > >>>>>
> > >>>>> v2:
> > >>>>>    * Use rte_get_tsc_hz() to adjust the telemetry period
> > >>>>>    * Rename to reflect polling busyness vs general busyness
> > >>>>>    * Fix segfault when calling telemetry timestamp from an
> > >> unregistered
> > >>>>>      non-EAL thread.
> > >>>>>    * Minor cleanup
> > >>>>> ---
> > >>>>> diff --git a/meson_options.txt b/meson_options.txt
> > >>>>> index 7c220ad68d..725b851f69 100644
> > >>>>> --- a/meson_options.txt
> > >>>>> +++ b/meson_options.txt
> > >>>>> @@ -20,6 +20,8 @@ option('enable_driver_sdk', type: 'boolean',
> > >> value: false, description:
> > >>>>>          'Install headers to build drivers.')
> > >>>>>   option('enable_kmods', type: 'boolean', value: false,
> > >> description:
> > >>>>>          'build kernel modules')
> > >>>>> +option('enable_lcore_poll_busyness', type: 'boolean', value:
> > >> true, description:
> > >>>>> +       'enable collection of lcore poll busyness telemetry')
> > >>>> IMO, All fastpath features should be opt-in. i.e default should be
> > >> false.
> > >>>> For the trace fastpath related changes, We have done the similar
> > >> thing
> > >>>> even though it cost additional one cycle for disabled trace points
> > >>>>
> > >>> We do need to consider runtime and build defaults differently,
> > >> though.
> > >>> Since this has also runtime enabling, I think having build-time
> > >> enabling
> > >>> true as default is ok, so long as the runtime enabling is false
> > >> (assuming
> > >>> no noticable overhead when the feature is disabled.)
> > >> I was talking about buildtime only. "enable_trace_fp" meson option
> > >> selected as
> > >> false as default.
> > > Agree. "enable_lcore_poll_busyness" is in the fast path, so it should
> > follow the design pattern of "enable_trace_fp".
> > 
> > +1 to making this opt-in. However, I'd lean more towards having the
> > buildtime option enabled and the runtime option disabled by default.
> > There is no measurable impact cause by the extra branch (the check for
> > enabled/disabled in the macro) by disabling at runtime, and we gain the
> > benefit of avoiding a recompile to enable it later.
> 
> The exact same thing could be said about "enable_trace_fp"; however, the development effort was put into separating it from "enable_trace", so it could be disabled by default.
> 
> Your patch is unlikely to get approved if you don't follow the "enable_trace_fp" design pattern as suggested.
> 
> > 
> > >
> > >> If the concern is enabling on generic distros then distro generic
> > >> config can opt in this
> > >>
> > >>> /Bruce
> > > @Kevin, are you considering a roadmap for using
> > RTE_LCORE_TELEMETRY_TIMESTAMP() for other purposes? Otherwise, it
> > should also be renamed to indicate that it is part of the "poll
> > busyness" telemetry.
> > 
> > No further purposes are planned for this macro, I'll rename it in the
> > next revision.
> 
> OK. Thank you.
> 
> Also, there's a new discussion about EAL bloat [1]. Perhaps I'm stretching it here, but it would be nice if your library was made a separate library, instead of part of the EAL library. (Since this kind of feature is not new to the EAL, I will categorize this suggestion as "nice to have", not "must have".)
> 
> [1] http://inbox.dpdk.org/dev/2594603.Isy0gbHreE@thomas/T/
> 

I was actually discussing this with Kevin and Dave H. on Friay, and trying
to make this a separate library is indeed a big stretch. :-)

From that discussion, the key point/gap is that we are really missing a
clean way of providing undefs or macro fallbacks for when a library is just
not present. For example, if this was a separate library we would gain a
number of advantages e.g. no need for separate enable/disable flag, but the
big disadvantage is that every header include for it, and every reference
to the macros used in that header need to be surrounded by big ugly ifdefs.

For now, adding this into EAL is the far more practical approach, since it
means that even if support for the feature is disabled at build time the
header is still available to provide the appropriate no-op macros.

/Bruce
  
Thomas Monjalon Aug. 29, 2022, 10:53 a.m. UTC | #10
29/08/2022 12:41, Bruce Richardson:
> On Fri, Aug 26, 2022 at 05:46:48PM +0200, Morten Brørup wrote:
> > > From: Kevin Laatz [mailto:kevin.laatz@intel.com]
> > > Sent: Friday, 26 August 2022 17.27
> > > 
> > > On 26/08/2022 09:29, Morten Brørup wrote:
> > > >> From: Jerin Jacob [mailto:jerinjacobk@gmail.com]
> > > >> Sent: Friday, 26 August 2022 10.16
> > > >>
> > > >> On Fri, Aug 26, 2022 at 1:37 PM Bruce Richardson
> > > >> <bruce.richardson@intel.com> wrote:
> > > >>> On Fri, Aug 26, 2022 at 12:35:16PM +0530, Jerin Jacob wrote:
> > > >>>> On Thu, Aug 25, 2022 at 8:56 PM Kevin Laatz
> > > <kevin.laatz@intel.com>
> > > >> wrote:
> > > >>>>> From: Anatoly Burakov <anatoly.burakov@intel.com>
> > > >>>>>
> > > >>>>> Currently, there is no way to measure lcore poll busyness in a
> > > >> passive way,
> > > >>>>> without any modifications to the application. This patch adds a
> > > >> new EAL API
> > > >>>>> that will be able to passively track core polling busyness.
> > > >>>>>
> > > >>>>> The poll busyness is calculated by relying on the fact that most
> > > >> DPDK API's
> > > >>>>> will poll for packets. Empty polls can be counted as "idle",
> > > >> while
> > > >>>>> non-empty polls can be counted as busy. To measure lcore poll
> > > >> busyness, we
> > > >>>>> simply call the telemetry timestamping function with the number
> > > >> of polls a
> > > >>>>> particular code section has processed, and count the number of
> > > >> cycles we've
> > > >>>>> spent processing empty bursts. The more empty bursts we
> > > >> encounter, the less
> > > >>>>> cycles we spend in "busy" state, and the less core poll busyness
> > > >> will be
> > > >>>>> reported.
> > > >>>>>
> > > >>>>> In order for all of the above to work without modifications to
> > > >> the
> > > >>>>> application, the library code needs to be instrumented with calls
> > > >> to the
> > > >>>>> lcore telemetry busyness timestamping function. The following
> > > >> parts of DPDK
> > > >>>>> are instrumented with lcore telemetry calls:
> > > >>>>>
> > > >>>>> - All major driver API's:
> > > >>>>>    - ethdev
> > > >>>>>    - cryptodev
> > > >>>>>    - compressdev
> > > >>>>>    - regexdev
> > > >>>>>    - bbdev
> > > >>>>>    - rawdev
> > > >>>>>    - eventdev
> > > >>>>>    - dmadev
> > > >>>>> - Some additional libraries:
> > > >>>>>    - ring
> > > >>>>>    - distributor
> > > >>>>>
> > > >>>>> To avoid performance impact from having lcore telemetry support,
> > > >> a global
> > > >>>>> variable is exported by EAL, and a call to timestamping function
> > > >> is wrapped
> > > >>>>> into a macro, so that whenever telemetry is disabled, it only
> > > >> takes one
> > > >>>>> additional branch and no function calls are performed. It is also
> > > >> possible
> > > >>>>> to disable it at compile time by commenting out
> > > >> RTE_LCORE_BUSYNESS from
> > > >>>>> build config.
> > > >>>>>
> > > >>>>> This patch also adds a telemetry endpoint to report lcore poll
> > > >> busyness, as
> > > >>>>> well as telemetry endpoints to enable/disable lcore telemetry. A
> > > >>>>> documentation entry has been added to the howto guides to explain
> > > >> the usage
> > > >>>>> of the new telemetry endpoints and API.
> > > >>>>>
> > > >>>>> Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
> > > >>>>> Signed-off-by: Conor Walsh <conor.walsh@intel.com>
> > > >>>>> Signed-off-by: David Hunt <david.hunt@intel.com>
> > > >>>>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> > > >>>>>
> > > >>>>> ---
> > > >>>>> v3:
> > > >>>>>    * Fix missed renaming to poll busyness
> > > >>>>>    * Fix clang compilation
> > > >>>>>    * Fix arm compilation
> > > >>>>>
> > > >>>>> v2:
> > > >>>>>    * Use rte_get_tsc_hz() to adjust the telemetry period
> > > >>>>>    * Rename to reflect polling busyness vs general busyness
> > > >>>>>    * Fix segfault when calling telemetry timestamp from an
> > > >> unregistered
> > > >>>>>      non-EAL thread.
> > > >>>>>    * Minor cleanup
> > > >>>>> ---
> > > >>>>> diff --git a/meson_options.txt b/meson_options.txt
> > > >>>>> index 7c220ad68d..725b851f69 100644
> > > >>>>> --- a/meson_options.txt
> > > >>>>> +++ b/meson_options.txt
> > > >>>>> @@ -20,6 +20,8 @@ option('enable_driver_sdk', type: 'boolean',
> > > >> value: false, description:
> > > >>>>>          'Install headers to build drivers.')
> > > >>>>>   option('enable_kmods', type: 'boolean', value: false,
> > > >> description:
> > > >>>>>          'build kernel modules')
> > > >>>>> +option('enable_lcore_poll_busyness', type: 'boolean', value:
> > > >> true, description:
> > > >>>>> +       'enable collection of lcore poll busyness telemetry')
> > > >>>> IMO, All fastpath features should be opt-in. i.e default should be
> > > >> false.
> > > >>>> For the trace fastpath related changes, We have done the similar
> > > >> thing
> > > >>>> even though it cost additional one cycle for disabled trace points
> > > >>>>
> > > >>> We do need to consider runtime and build defaults differently,
> > > >> though.
> > > >>> Since this has also runtime enabling, I think having build-time
> > > >> enabling
> > > >>> true as default is ok, so long as the runtime enabling is false
> > > >> (assuming
> > > >>> no noticable overhead when the feature is disabled.)
> > > >> I was talking about buildtime only. "enable_trace_fp" meson option
> > > >> selected as
> > > >> false as default.
> > > > Agree. "enable_lcore_poll_busyness" is in the fast path, so it should
> > > follow the design pattern of "enable_trace_fp".
> > > 
> > > +1 to making this opt-in. However, I'd lean more towards having the
> > > buildtime option enabled and the runtime option disabled by default.
> > > There is no measurable impact cause by the extra branch (the check for
> > > enabled/disabled in the macro) by disabling at runtime, and we gain the
> > > benefit of avoiding a recompile to enable it later.
> > 
> > The exact same thing could be said about "enable_trace_fp"; however, the development effort was put into separating it from "enable_trace", so it could be disabled by default.
> > 
> > Your patch is unlikely to get approved if you don't follow the "enable_trace_fp" design pattern as suggested.
> > 
> > > 
> > > >
> > > >> If the concern is enabling on generic distros then distro generic
> > > >> config can opt in this
> > > >>
> > > >>> /Bruce
> > > > @Kevin, are you considering a roadmap for using
> > > RTE_LCORE_TELEMETRY_TIMESTAMP() for other purposes? Otherwise, it
> > > should also be renamed to indicate that it is part of the "poll
> > > busyness" telemetry.
> > > 
> > > No further purposes are planned for this macro, I'll rename it in the
> > > next revision.
> > 
> > OK. Thank you.
> > 
> > Also, there's a new discussion about EAL bloat [1]. Perhaps I'm stretching it here, but it would be nice if your library was made a separate library, instead of part of the EAL library. (Since this kind of feature is not new to the EAL, I will categorize this suggestion as "nice to have", not "must have".)
> > 
> > [1] http://inbox.dpdk.org/dev/2594603.Isy0gbHreE@thomas/T/
> > 
> 
> I was actually discussing this with Kevin and Dave H. on Friay, and trying
> to make this a separate library is indeed a big stretch. :-)
> 
> From that discussion, the key point/gap is that we are really missing a
> clean way of providing undefs or macro fallbacks for when a library is just
> not present. For example, if this was a separate library we would gain a
> number of advantages e.g. no need for separate enable/disable flag, but the
> big disadvantage is that every header include for it, and every reference
> to the macros used in that header need to be surrounded by big ugly ifdefs.
> 
> For now, adding this into EAL is the far more practical approach, since it
> means that even if support for the feature is disabled at build time the
> header is still available to provide the appropriate no-op macros.

We can make the library always available with different implementations
based on a build option.
But is it a good idea to have a different behaviour based on build option?
Why not making it a runtime option?
Can the performance hit be cached in some way?
  
Morten Brørup Aug. 29, 2022, 11:22 a.m. UTC | #11
> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> Sent: Monday, 29 August 2022 12.41
> 
> On Fri, Aug 26, 2022 at 05:46:48PM +0200, Morten Brørup wrote:
> >
> > Also, there's a new discussion about EAL bloat [1]. Perhaps I'm
> stretching it here, but it would be nice if your library was made a
> separate library, instead of part of the EAL library. (Since this kind
> of feature is not new to the EAL, I will categorize this suggestion as
> "nice to have", not "must have".)
> >
> > [1] http://inbox.dpdk.org/dev/2594603.Isy0gbHreE@thomas/T/
> >
> 
> I was actually discussing this with Kevin and Dave H. on Friay, and
> trying
> to make this a separate library is indeed a big stretch. :-)
> 
> From that discussion, the key point/gap is that we are really missing a
> clean way of providing undefs or macro fallbacks for when a library is
> just
> not present. For example, if this was a separate library we would gain
> a
> number of advantages e.g. no need for separate enable/disable flag, but
> the
> big disadvantage is that every header include for it, and every
> reference
> to the macros used in that header need to be surrounded by big ugly
> ifdefs.

I agree that we don't want everything surrounded by big ugly ifdefs.

I think solving this should be the responsibility of a library itself, not of the application or other libraries.

E.g. like the mbuf library's use of RTE_LIBRTE_MBUF_DEBUG.

Additionally, a no-op variant of the library could be required, to be compiled in when the library is disabled at build time.

> 
> For now, adding this into EAL is the far more practical approach, since
> it
> means that even if support for the feature is disabled at build time
> the
> header is still available to provide the appropriate no-op macros.

I think you already discovered the key to solving this problem, but perhaps didn't even notice it yourselves:

It must be possible to omit the *feature* at build time - not necessarily the *library*.
  
Kevin Laatz Aug. 29, 2022, 12:36 p.m. UTC | #12
On 29/08/2022 11:53, Thomas Monjalon wrote:
> 29/08/2022 12:41, Bruce Richardson:
>> On Fri, Aug 26, 2022 at 05:46:48PM +0200, Morten Brørup wrote:
>>>> From: Kevin Laatz [mailto:kevin.laatz@intel.com]
>>>> Sent: Friday, 26 August 2022 17.27
>>>>
>>>> On 26/08/2022 09:29, Morten Brørup wrote:
>>>>>> From: Jerin Jacob [mailto:jerinjacobk@gmail.com]
>>>>>> Sent: Friday, 26 August 2022 10.16
>>>>>>
>>>>>> On Fri, Aug 26, 2022 at 1:37 PM Bruce Richardson
>>>>>> <bruce.richardson@intel.com>  wrote:
>>>>>>> On Fri, Aug 26, 2022 at 12:35:16PM +0530, Jerin Jacob wrote:
>>>>>>>> On Thu, Aug 25, 2022 at 8:56 PM Kevin Laatz
>>>> <kevin.laatz@intel.com>
>>>>>> wrote:
>>>>>>>>> From: Anatoly Burakov<anatoly.burakov@intel.com>
>>>>>>>>>
>>>>>>>>> Currently, there is no way to measure lcore poll busyness in a
>>>>>> passive way,
>>>>>>>>> without any modifications to the application. This patch adds a
>>>>>> new EAL API
>>>>>>>>> that will be able to passively track core polling busyness.
>>>>>>>>>
>>>>>>>>> The poll busyness is calculated by relying on the fact that most
>>>>>> DPDK API's
>>>>>>>>> will poll for packets. Empty polls can be counted as "idle",
>>>>>> while
>>>>>>>>> non-empty polls can be counted as busy. To measure lcore poll
>>>>>> busyness, we
>>>>>>>>> simply call the telemetry timestamping function with the number
>>>>>> of polls a
>>>>>>>>> particular code section has processed, and count the number of
>>>>>> cycles we've
>>>>>>>>> spent processing empty bursts. The more empty bursts we
>>>>>> encounter, the less
>>>>>>>>> cycles we spend in "busy" state, and the less core poll busyness
>>>>>> will be
>>>>>>>>> reported.
>>>>>>>>>
>>>>>>>>> In order for all of the above to work without modifications to
>>>>>> the
>>>>>>>>> application, the library code needs to be instrumented with calls
>>>>>> to the
>>>>>>>>> lcore telemetry busyness timestamping function. The following
>>>>>> parts of DPDK
>>>>>>>>> are instrumented with lcore telemetry calls:
>>>>>>>>>
>>>>>>>>> - All major driver API's:
>>>>>>>>>     - ethdev
>>>>>>>>>     - cryptodev
>>>>>>>>>     - compressdev
>>>>>>>>>     - regexdev
>>>>>>>>>     - bbdev
>>>>>>>>>     - rawdev
>>>>>>>>>     - eventdev
>>>>>>>>>     - dmadev
>>>>>>>>> - Some additional libraries:
>>>>>>>>>     - ring
>>>>>>>>>     - distributor
>>>>>>>>>
>>>>>>>>> To avoid performance impact from having lcore telemetry support,
>>>>>> a global
>>>>>>>>> variable is exported by EAL, and a call to timestamping function
>>>>>> is wrapped
>>>>>>>>> into a macro, so that whenever telemetry is disabled, it only
>>>>>> takes one
>>>>>>>>> additional branch and no function calls are performed. It is also
>>>>>> possible
>>>>>>>>> to disable it at compile time by commenting out
>>>>>> RTE_LCORE_BUSYNESS from
>>>>>>>>> build config.
>>>>>>>>>
>>>>>>>>> This patch also adds a telemetry endpoint to report lcore poll
>>>>>> busyness, as
>>>>>>>>> well as telemetry endpoints to enable/disable lcore telemetry. A
>>>>>>>>> documentation entry has been added to the howto guides to explain
>>>>>> the usage
>>>>>>>>> of the new telemetry endpoints and API.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Kevin Laatz<kevin.laatz@intel.com>
>>>>>>>>> Signed-off-by: Conor Walsh<conor.walsh@intel.com>
>>>>>>>>> Signed-off-by: David Hunt<david.hunt@intel.com>
>>>>>>>>> Signed-off-by: Anatoly Burakov<anatoly.burakov@intel.com>
>>>>>>>>>
>>>>>>>>> ---
>>>>>>>>> v3:
>>>>>>>>>     * Fix missed renaming to poll busyness
>>>>>>>>>     * Fix clang compilation
>>>>>>>>>     * Fix arm compilation
>>>>>>>>>
>>>>>>>>> v2:
>>>>>>>>>     * Use rte_get_tsc_hz() to adjust the telemetry period
>>>>>>>>>     * Rename to reflect polling busyness vs general busyness
>>>>>>>>>     * Fix segfault when calling telemetry timestamp from an
>>>>>> unregistered
>>>>>>>>>       non-EAL thread.
>>>>>>>>>     * Minor cleanup
>>>>>>>>> ---
>>>>>>>>> diff --git a/meson_options.txt b/meson_options.txt
>>>>>>>>> index 7c220ad68d..725b851f69 100644
>>>>>>>>> --- a/meson_options.txt
>>>>>>>>> +++ b/meson_options.txt
>>>>>>>>> @@ -20,6 +20,8 @@ option('enable_driver_sdk', type: 'boolean',
>>>>>> value: false, description:
>>>>>>>>>           'Install headers to build drivers.')
>>>>>>>>>    option('enable_kmods', type: 'boolean', value: false,
>>>>>> description:
>>>>>>>>>           'build kernel modules')
>>>>>>>>> +option('enable_lcore_poll_busyness', type: 'boolean', value:
>>>>>> true, description:
>>>>>>>>> +       'enable collection of lcore poll busyness telemetry')
>>>>>>>> IMO, All fastpath features should be opt-in. i.e default should be
>>>>>> false.
>>>>>>>> For the trace fastpath related changes, We have done the similar
>>>>>> thing
>>>>>>>> even though it cost additional one cycle for disabled trace points
>>>>>>>>
>>>>>>> We do need to consider runtime and build defaults differently,
>>>>>> though.
>>>>>>> Since this has also runtime enabling, I think having build-time
>>>>>> enabling
>>>>>>> true as default is ok, so long as the runtime enabling is false
>>>>>> (assuming
>>>>>>> no noticable overhead when the feature is disabled.)
>>>>>> I was talking about buildtime only. "enable_trace_fp" meson option
>>>>>> selected as
>>>>>> false as default.
>>>>> Agree. "enable_lcore_poll_busyness" is in the fast path, so it should
>>>> follow the design pattern of "enable_trace_fp".
>>>>
>>>> +1 to making this opt-in. However, I'd lean more towards having the
>>>> buildtime option enabled and the runtime option disabled by default.
>>>> There is no measurable impact cause by the extra branch (the check for
>>>> enabled/disabled in the macro) by disabling at runtime, and we gain the
>>>> benefit of avoiding a recompile to enable it later.
>>> The exact same thing could be said about "enable_trace_fp"; however, the development effort was put into separating it from "enable_trace", so it could be disabled by default.
>>>
>>> Your patch is unlikely to get approved if you don't follow the "enable_trace_fp" design pattern as suggested.
>>>
>>>>>> If the concern is enabling on generic distros then distro generic
>>>>>> config can opt in this
>>>>>>
>>>>>>> /Bruce
>>>>> @Kevin, are you considering a roadmap for using
>>>> RTE_LCORE_TELEMETRY_TIMESTAMP() for other purposes? Otherwise, it
>>>> should also be renamed to indicate that it is part of the "poll
>>>> busyness" telemetry.
>>>>
>>>> No further purposes are planned for this macro, I'll rename it in the
>>>> next revision.
>>> OK. Thank you.
>>>
>>> Also, there's a new discussion about EAL bloat [1]. Perhaps I'm stretching it here, but it would be nice if your library was made a separate library, instead of part of the EAL library. (Since this kind of feature is not new to the EAL, I will categorize this suggestion as "nice to have", not "must have".)
>>>
>>> [1]http://inbox.dpdk.org/dev/2594603.Isy0gbHreE@thomas/T/
>>>
>> I was actually discussing this with Kevin and Dave H. on Friay, and trying
>> to make this a separate library is indeed a big stretch. :-)
>>
>>  From that discussion, the key point/gap is that we are really missing a
>> clean way of providing undefs or macro fallbacks for when a library is just
>> not present. For example, if this was a separate library we would gain a
>> number of advantages e.g. no need for separate enable/disable flag, but the
>> big disadvantage is that every header include for it, and every reference
>> to the macros used in that header need to be surrounded by big ugly ifdefs.
>>
>> For now, adding this into EAL is the far more practical approach, since it
>> means that even if support for the feature is disabled at build time the
>> header is still available to provide the appropriate no-op macros.
> We can make the library always available with different implementations
> based on a build option.
> But is it a good idea to have a different behaviour based on build option?
> Why not making it a runtime option?
> Can the performance hit be cached in some way?
>
The patches currently include runtime options to enable/disable the 
feature via API and via telemetry endpoints. We have run performance 
tests and have failed to measure any performance impact with the feature 
_runtime_ disabled.

We added the buildtime option following previous feedback where an 
absolute guarantee is required that performance would not be affected by 
the addition of this feature. To avoid clutter in the meson options, we 
could either a) remove the buildtime option, or b) allow a CFLAG to 
disable the feature rather than an explicit buildtime option. Either 
way, they would only serve a reassurance purpose.
  
Morten Brørup Aug. 29, 2022, 12:49 p.m. UTC | #13
From: Kevin Laatz [mailto:kevin.laatz@intel.com] 
Sent: Monday, 29 August 2022 14.37
>
> The patches currently include runtime options to enable/disable the feature via API and via telemetry endpoints. We have run performance tests and have failed to measure any performance impact with the feature runtime disabled.

Lots of features are added to DPDK all the time, and they all use the same "insignificant performance impact" argument. But the fact is, each added test-and-branch has some small performance impact (and consume some entries in the branch prediction table, which may impact performance elsewhere). If you add a million features using this argument, there will be a significant and measurable performance impact.

Which is why I keep insisting on the ability to omit non-core features from DPDK at build time.
  
Kevin Laatz Aug. 29, 2022, 1:16 p.m. UTC | #14
On 26/08/2022 23:06, Mattias Rönnblom wrote:
> On 2022-08-25 17:28, Kevin Laatz wrote:
>> From: Anatoly Burakov <anatoly.burakov@intel.com>
>>
>> Currently, there is no way to measure lcore poll busyness in a 
>> passive way,
>> without any modifications to the application. This patch adds a new 
>> EAL API
>> that will be able to passively track core polling busyness.
>
> There's no generic way, but the DSW event device keep tracks of lcore 
> utilization (i.e., the fraction of cycles used to perform actual work, 
> as opposed to just polling empty queues), and it does some with the 
> same basic principles as, from what it seems after a quick look, used 
> in this patch.
>
>>
>> The poll busyness is calculated by relying on the fact that most DPDK 
>> API's
>> will poll for packets. Empty polls can be counted as "idle", while
>
> Lcore worker threads poll for work. Packets, timeouts, completions, 
> event device events, etc.

Yes, the wording was too restrictive here - the patch includes changes 
to drivers and libraries such as dmadev, eventdev, ring etc that poll 
for work and would want to mark it as "idle" or "busy".


>
>> non-empty polls can be counted as busy. To measure lcore poll 
>> busyness, we
>
> I guess what is meant here is that cycles spent after non-empty polls 
> can be counted as busy (useful) cycles? Potentially including the 
> cycles spent for the actual poll operation. ("Poll busyness" is a very 
> vague term, in my opionion.)
>
> Similiarly, cycles spent after an empty poll would not be counted.

Correct, the generic functionality works this way. Any cycles between an 
"busy poll" and the next "idle poll" will be counted as busy/useful work 
(and vice versa).


>
>> simply call the telemetry timestamping function with the number of 
>> polls a
>> particular code section has processed, and count the number of cycles 
>> we've
>> spent processing empty bursts. The more empty bursts we encounter, 
>> the less
>> cycles we spend in "busy" state, and the less core poll busyness will be
>> reported.
>>
>
> Is this the same scheme as DSW? When a non-zero burst in idle state 
> means a transition from the idle to busy? And a zero burst poll in 
> busy state means a transition from idle to busy?
>
> The issue with this scheme, is that you might potentially end up with 
> a state transition for every iteration of the application's main loop, 
> if packets (or other items of work) only comes in on one of the 
> lcore's potentially many RX queues (or other input queues, such as 
> eventdev ports). That means a rdtsc for every loop, which isn't too 
> bad, but still might be noticable.
>
> An application that gather items of work from multiple source before 
> actually doing anything breaks this model. For example, consider a 
> lcore worker owning two RX queues, performing rte_eth_rx_burst() on 
> both, before attempt to process any of the received packets. If the 
> last poll is empty, the cycles spent will considered idle, even though 
> they were busy.
>
> A lcore worker might also decide to poll the same RX queue multiple 
> times (until it hits an empty poll, or reaching some high upper 
> bound), before initating processing of the packets.

Yes, more complex applications will need to be modified to gain a more 
fine-grained busyness metric. In order to achieve this level of 
accuracy, application context is required. The 
'RTE_LCORE_POLL_BUSYNESS_TIMESTAMP()' macro can be used within the 
application to mark sections as "busy" or "not busy" to do so. Using 
your example above, the application could keep track of multiple bursts 
(whether they have work or not) and call the macro before initiating the 
processing to signal that there is, in fact, work to be done.

There's a section in the documentation update in this patchset that 
describes it. It might need more work if its not clear :-)


>
> I didn't read your code in detail, so I might be jumping to conclusions.
>
>> In order for all of the above to work without modifications to the
>> application, the library code needs to be instrumented with calls to the
>> lcore telemetry busyness timestamping function. The following parts 
>> of DPDK
>> are instrumented with lcore telemetry calls:
>>
>> - All major driver API's:
>>    - ethdev
>>    - cryptodev
>>    - compressdev
>>    - regexdev
>>    - bbdev
>>    - rawdev
>>    - eventdev
>>    - dmadev
>> - Some additional libraries:
>>    - ring
>>    - distributor
>
> In the past, I've suggested this kind of functionality should go into 
> the service framework instead, with the service function explicitly 
> signaling wheter or not the cycles spent on something useful or not.
>
> That seems to me like a more straight-forward and more accurate 
> solution, but does require the application to deploy everything as 
> services, and also requires a change of the service function signature.
>
>>
>> To avoid performance impact from having lcore telemetry support, a 
>> global
>> variable is exported by EAL, and a call to timestamping function is 
>> wrapped
>> into a macro, so that whenever telemetry is disabled, it only takes one
>
> Use an static inline function if you don't need the additional 
> expressive power of a macro.
>
> I suggest you also mention the performance implications, when this 
> function is enabled.

Sure, I can add a note in the next revision.


>
>> additional branch and no function calls are performed. It is also 
>> possible
>> to disable it at compile time by commenting out RTE_LCORE_BUSYNESS from
>> build config.
>>
>> This patch also adds a telemetry endpoint to report lcore poll 
>> busyness, as
>> well as telemetry endpoints to enable/disable lcore telemetry. A
>> documentation entry has been added to the howto guides to explain the 
>> usage
>> of the new telemetry endpoints and API.
>>
>
> Should there really be a dependency from the EAL to the telemetry 
> library? A cycle. Maybe some dependency inversion would be in order? 
> The telemetry library could instead register an interest in getting 
> busy/idle cycles reports from lcores.
>
>> Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
>> Signed-off-by: Conor Walsh <conor.walsh@intel.com>
>> Signed-off-by: David Hunt <david.hunt@intel.com>
>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>>
>> ---
>> v3:
>>    * Fix missed renaming to poll busyness
>>    * Fix clang compilation
>>    * Fix arm compilation
>>
>> v2:
>>    * Use rte_get_tsc_hz() to adjust the telemetry period
>>    * Rename to reflect polling busyness vs general busyness
>>    * Fix segfault when calling telemetry timestamp from an unregistered
>>      non-EAL thread.
>>    * Minor cleanup
>> ---
>>   config/meson.build                          |   1 +
>>   config/rte_config.h                         |   1 +
>>   lib/bbdev/rte_bbdev.h                       |  17 +-
>>   lib/compressdev/rte_compressdev.c           |   2 +
>>   lib/cryptodev/rte_cryptodev.h               |   2 +
>>   lib/distributor/rte_distributor.c           |  21 +-
>>   lib/distributor/rte_distributor_single.c    |  14 +-
>>   lib/dmadev/rte_dmadev.h                     |  15 +-
>>   lib/eal/common/eal_common_lcore_telemetry.c | 293 ++++++++++++++++++++
>>   lib/eal/common/meson.build                  |   1 +
>>   lib/eal/include/rte_lcore.h                 |  80 ++++++
>>   lib/eal/meson.build                         |   3 +
>>   lib/eal/version.map                         |   7 +
>>   lib/ethdev/rte_ethdev.h                     |   2 +
>>   lib/eventdev/rte_eventdev.h                 |  10 +-
>>   lib/rawdev/rte_rawdev.c                     |   6 +-
>>   lib/regexdev/rte_regexdev.h                 |   5 +-
>>   lib/ring/rte_ring_elem_pvt.h                |   1 +
>>   meson_options.txt                           |   2 +
>>   19 files changed, 459 insertions(+), 24 deletions(-)
>>   create mode 100644 lib/eal/common/eal_common_lcore_telemetry.c
>>
<snip>
>> diff --git a/lib/eal/common/eal_common_lcore_telemetry.c 
>> b/lib/eal/common/eal_common_lcore_telemetry.c
>> new file mode 100644
>> index 0000000000..bba0afc26d
>> --- /dev/null
>> +++ b/lib/eal/common/eal_common_lcore_telemetry.c
>> @@ -0,0 +1,293 @@
>> +/* SPDX-License-Identifier: BSD-3-Clause
>> + * Copyright(c) 2010-2014 Intel Corporation
>> + */
>> +
>> +#include <unistd.h>
>> +#include <limits.h>
>> +#include <string.h>
>> +
>> +#include <rte_common.h>
>> +#include <rte_cycles.h>
>> +#include <rte_errno.h>
>> +#include <rte_lcore.h>
>> +
>> +#ifdef RTE_LCORE_POLL_BUSYNESS
>> +#include <rte_telemetry.h>
>> +#endif
>> +
>> +int __rte_lcore_telemetry_enabled;
>
> Is "telemetry" really the term to use here? Isn't this just another 
> piece of statistics? It can be used for telemetry, or in some other 
> fashion.
>
> (Use bool not int.)

Can rename to '__rte_lcore_stats_enabled' in next revision.


>
>> +
>> +#ifdef RTE_LCORE_POLL_BUSYNESS
>> +
>> +struct lcore_telemetry {
>> +    int poll_busyness;
>> +    /**< Calculated poll busyness (gets set/returned by the API) */
>> +    int raw_poll_busyness;
>> +    /**< Calculated poll busyness times 100. */
>> +    uint64_t interval_ts;
>> +    /**< when previous telemetry interval started */
>> +    uint64_t empty_cycles;
>> +    /**< empty cycle count since last interval */
>> +    uint64_t last_poll_ts;
>> +    /**< last poll timestamp */
>> +    bool last_empty;
>> +    /**< if last poll was empty */
>> +    unsigned int contig_poll_cnt;
>> +    /**< contiguous (always empty/non empty) poll counter */
>> +} __rte_cache_aligned;
>> +
>> +static struct lcore_telemetry *telemetry_data;
>> +
>> +#define LCORE_POLL_BUSYNESS_MAX 100
>> +#define LCORE_POLL_BUSYNESS_NOT_SET -1
>> +#define LCORE_POLL_BUSYNESS_MIN 0
>> +
>> +#define SMOOTH_COEFF 5
>> +#define STATE_CHANGE_OPT 32
>> +
>> +static void lcore_config_init(void)
>> +{
>> +    int lcore_id;
>> +
>> +    telemetry_data = calloc(RTE_MAX_LCORE, sizeof(telemetry_data[0]));
>> +    if (telemetry_data == NULL)
>> +        rte_panic("Could not init lcore telemetry data: Out of 
>> memory\n");
>> +
>> +    RTE_LCORE_FOREACH(lcore_id) {
>> +        struct lcore_telemetry *td = &telemetry_data[lcore_id];
>> +
>> +        td->interval_ts = 0;
>> +        td->last_poll_ts = 0;
>> +        td->empty_cycles = 0;
>> +        td->last_empty = true;
>> +        td->contig_poll_cnt = 0;
>> +        td->poll_busyness = LCORE_POLL_BUSYNESS_NOT_SET;
>> +        td->raw_poll_busyness = 0;
>> +    }
>> +}
>> +
>> +int rte_lcore_poll_busyness(unsigned int lcore_id)
>> +{
>> +    const uint64_t active_thresh = rte_get_tsc_hz() * 
>> RTE_LCORE_POLL_BUSYNESS_PERIOD_MS;
>> +    struct lcore_telemetry *tdata;
>> +
>> +    if (lcore_id >= RTE_MAX_LCORE)
>> +        return -EINVAL;
>> +    tdata = &telemetry_data[lcore_id];
>> +
>> +    /* if the lcore is not active */
>> +    if (tdata->interval_ts == 0)
>> +        return LCORE_POLL_BUSYNESS_NOT_SET;
>> +    /* if the core hasn't been active in a while */
>> +    else if ((rte_rdtsc() - tdata->interval_ts) > active_thresh)
>> +        return LCORE_POLL_BUSYNESS_NOT_SET;
>> +
>> +    /* this core is active, report its poll busyness */
>> +    return telemetry_data[lcore_id].poll_busyness;
>> +}
>> +
>> +int rte_lcore_poll_busyness_enabled(void)
>> +{
>> +    return __rte_lcore_telemetry_enabled;
>> +}
>> +
>> +void rte_lcore_poll_busyness_enabled_set(int enable)
>
> Use bool.
>
>> +{
>> +    __rte_lcore_telemetry_enabled = !!enable;
>
> !!Another reason to use bool!! :)
>
> Are you allowed to call this function during operation, you'll need a 
> atomic store here (and an atomic load on the read side).

Ack


>
>> +
>> +    if (!enable)
>> +        lcore_config_init();
>> +}
>> +
>> +static inline int calc_raw_poll_busyness(const struct 
>> lcore_telemetry *tdata,
>> +                    const uint64_t empty, const uint64_t total)
>> +{
>> +    /*
>> +     * we don't want to use floating point math here, but we want 
>> for our poll
>> +     * busyness to react smoothly to sudden changes, while still 
>> keeping the
>> +     * accuracy and making sure that over time the average follows 
>> poll busyness
>> +     * as measured just-in-time. therefore, we will calculate the 
>> average poll
>> +     * busyness using integer math, but shift the decimal point two 
>> places
>> +     * to the right, so that 100.0 becomes 10000. this allows us to 
>> report
>> +     * integer values (0..100) while still allowing ourselves to 
>> follow the
>> +     * just-in-time measurements when we calculate our averages.
>> +     */
>> +    const int max_raw_idle = LCORE_POLL_BUSYNESS_MAX * 100;
>> +
>
> Why not just store/manage the number of busy (or idle, or both) 
> cycles? Then the user can decide what time period to average over, to 
> what extent the lcore utilization from previous periods should be 
> factored in, etc.

There's an option 'RTE_LCORE_POLL_BUSYNESS_PERIOD_MS' added to 
rte_config.h which would allow the user to define the time period over 
which the utilization should be reported. We only do this calculation if 
that time interval has elapsed.


>
> In DSW, I initially presented only a load statistic (which averaged 
> over 250 us, with some contribution from previous period). I later 
> came to realize that just exposing the number of busy cycles left the 
> calling application much more options. For example, to present the 
> average load during 1 s, you needed to have some control thread 
> sampling the load statistic during that time period, as opposed to 
> when the busy cycles statistic was introduced, it just had to read 
> that value twice (at the beginning of the period, and at the end), and 
> compared it will the amount of wallclock time passed.
>
<snip>
  
Kevin Laatz Aug. 29, 2022, 1:37 p.m. UTC | #15
On 29/08/2022 13:49, Morten Brørup wrote:
> From: Kevin Laatz [mailto:kevin.laatz@intel.com]
> Sent: Monday, 29 August 2022 14.37
>> The patches currently include runtime options to enable/disable the feature via API and via telemetry endpoints. We have run performance tests and have failed to measure any performance impact with the feature runtime disabled.
> Lots of features are added to DPDK all the time, and they all use the same "insignificant performance impact" argument. But the fact is, each added test-and-branch has some small performance impact (and consume some entries in the branch prediction table, which may impact performance elsewhere). If you add a million features using this argument, there will be a significant and measurable performance impact.
>
> Which is why I keep insisting on the ability to omit non-core features from DPDK at build time.

I think there's general consensus in having a buildtime option to 
disable it.

Do we agree that it should be buildtime enabled, and runtime disabled by 
default (so just the single additional branch by default), with the 
meson option available to disable it completely at buildtime?
  
Morten Brørup Aug. 29, 2022, 1:44 p.m. UTC | #16
> From: Kevin Laatz [mailto:kevin.laatz@intel.com]
> Sent: Monday, 29 August 2022 15.37
> 
> On 29/08/2022 13:49, Morten Brørup wrote:
> > From: Kevin Laatz [mailto:kevin.laatz@intel.com]
> > Sent: Monday, 29 August 2022 14.37
> >> The patches currently include runtime options to enable/disable the
> feature via API and via telemetry endpoints. We have run performance
> tests and have failed to measure any performance impact with the
> feature runtime disabled.
> > Lots of features are added to DPDK all the time, and they all use the
> same "insignificant performance impact" argument. But the fact is, each
> added test-and-branch has some small performance impact (and consume
> some entries in the branch prediction table, which may impact
> performance elsewhere). If you add a million features using this
> argument, there will be a significant and measurable performance
> impact.
> >
> > Which is why I keep insisting on the ability to omit non-core
> features from DPDK at build time.
> 
> I think there's general consensus in having a buildtime option to
> disable it.
> 
> Do we agree that it should be buildtime enabled, and runtime disabled
> by
> default (so just the single additional branch by default), with the
> meson option available to disable it completely at buildtime?

No. This feature is in the fast path, so please follow the "enable_trace_fp" design pattern, which also has fast path trace disabled at build time.

-Morten
  
Kevin Laatz Aug. 29, 2022, 2:21 p.m. UTC | #17
On 29/08/2022 14:44, Morten Brørup wrote:
>> From: Kevin Laatz [mailto:kevin.laatz@intel.com]
>> Sent: Monday, 29 August 2022 15.37
>>
>> On 29/08/2022 13:49, Morten Brørup wrote:
>>> From: Kevin Laatz [mailto:kevin.laatz@intel.com]
>>> Sent: Monday, 29 August 2022 14.37
>>>> The patches currently include runtime options to enable/disable the
>> feature via API and via telemetry endpoints. We have run performance
>> tests and have failed to measure any performance impact with the
>> feature runtime disabled.
>>> Lots of features are added to DPDK all the time, and they all use the
>> same "insignificant performance impact" argument. But the fact is, each
>> added test-and-branch has some small performance impact (and consume
>> some entries in the branch prediction table, which may impact
>> performance elsewhere). If you add a million features using this
>> argument, there will be a significant and measurable performance
>> impact.
>>> Which is why I keep insisting on the ability to omit non-core
>> features from DPDK at build time.
>>
>> I think there's general consensus in having a buildtime option to
>> disable it.
>>
>> Do we agree that it should be buildtime enabled, and runtime disabled
>> by
>> default (so just the single additional branch by default), with the
>> meson option available to disable it completely at buildtime?
> No. This feature is in the fast path, so please follow the "enable_trace_fp" design pattern, which also has fast path trace disabled at build time.
>
Ok, will make this change for the v4. Thanks!
  
Kevin Laatz Aug. 30, 2022, 10:26 a.m. UTC | #18
On 26/08/2022 23:06, Mattias Rönnblom wrote:
> On 2022-08-25 17:28, Kevin Laatz wrote:
>> From: Anatoly Burakov <anatoly.burakov@intel.com>
<snip>
>>
>> To avoid performance impact from having lcore telemetry support, a 
>> global
>> variable is exported by EAL, and a call to timestamping function is 
>> wrapped
>> into a macro, so that whenever telemetry is disabled, it only takes one
>
> Use an static inline function if you don't need the additional 
> expressive power of a macro.
>
> I suggest you also mention the performance implications, when this 
> function is enabled.

Keeping the performance implications of having the feature enabled in 
mind, I think the expressive power of the macro is beneficial here.


<snip>

>> diff --git a/lib/eal/common/eal_common_lcore_telemetry.c 
>> b/lib/eal/common/eal_common_lcore_telemetry.c
>> new file mode 100644
>> index 0000000000..bba0afc26d
>> --- /dev/null
>> +++ b/lib/eal/common/eal_common_lcore_telemetry.c
>> @@ -0,0 +1,293 @@
>> +/* SPDX-License-Identifier: BSD-3-Clause
>> + * Copyright(c) 2010-2014 Intel Corporation
>> + */
>> +
>> +#include <unistd.h>
>> +#include <limits.h>
>> +#include <string.h>
>> +
>> +#include <rte_common.h>
>> +#include <rte_cycles.h>
>> +#include <rte_errno.h>
>> +#include <rte_lcore.h>
>> +
>> +#ifdef RTE_LCORE_POLL_BUSYNESS
>> +#include <rte_telemetry.h>
>> +#endif
>> +
>> +int __rte_lcore_telemetry_enabled;
>
> Is "telemetry" really the term to use here? Isn't this just another 
> piece of statistics? It can be used for telemetry, or in some other 
> fashion.
>
> (Use bool not int.)

Will change to bool.

Looking at this again, the telemetry naming is more accurate here since 
'__rte_lcore_telemetry_enabled' is used to enable/disable the telemetry 
endpoints.

-Kevin
  

Patch

diff --git a/config/meson.build b/config/meson.build
index 7f7b6c92fd..d5954a059c 100644
--- a/config/meson.build
+++ b/config/meson.build
@@ -297,6 +297,7 @@  endforeach
 dpdk_conf.set('RTE_MAX_ETHPORTS', get_option('max_ethports'))
 dpdk_conf.set('RTE_LIBEAL_USE_HPET', get_option('use_hpet'))
 dpdk_conf.set('RTE_ENABLE_TRACE_FP', get_option('enable_trace_fp'))
+dpdk_conf.set('RTE_LCORE_POLL_BUSYNESS', get_option('enable_lcore_poll_busyness'))
 # values which have defaults which may be overridden
 dpdk_conf.set('RTE_MAX_VFIO_GROUPS', 64)
 dpdk_conf.set('RTE_DRIVER_MEMPOOL_BUCKET_SIZE_KB', 64)
diff --git a/config/rte_config.h b/config/rte_config.h
index 46549cb062..498702c9c7 100644
--- a/config/rte_config.h
+++ b/config/rte_config.h
@@ -39,6 +39,7 @@ 
 #define RTE_LOG_DP_LEVEL RTE_LOG_INFO
 #define RTE_BACKTRACE 1
 #define RTE_MAX_VFIO_CONTAINERS 64
+#define RTE_LCORE_POLL_BUSYNESS_PERIOD_MS 2
 
 /* bsd module defines */
 #define RTE_CONTIGMEM_MAX_NUM_BUFS 64
diff --git a/lib/bbdev/rte_bbdev.h b/lib/bbdev/rte_bbdev.h
index b88c88167e..d6ed176cce 100644
--- a/lib/bbdev/rte_bbdev.h
+++ b/lib/bbdev/rte_bbdev.h
@@ -28,6 +28,7 @@  extern "C" {
 #include <stdbool.h>
 
 #include <rte_cpuflags.h>
+#include <rte_lcore.h>
 
 #include "rte_bbdev_op.h"
 
@@ -599,7 +600,9 @@  rte_bbdev_dequeue_enc_ops(uint16_t dev_id, uint16_t queue_id,
 {
 	struct rte_bbdev *dev = &rte_bbdev_devices[dev_id];
 	struct rte_bbdev_queue_data *q_data = &dev->data->queues[queue_id];
-	return dev->dequeue_enc_ops(q_data, ops, num_ops);
+	const uint16_t nb_ops = dev->dequeue_enc_ops(q_data, ops, num_ops);
+	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 /**
@@ -631,7 +634,9 @@  rte_bbdev_dequeue_dec_ops(uint16_t dev_id, uint16_t queue_id,
 {
 	struct rte_bbdev *dev = &rte_bbdev_devices[dev_id];
 	struct rte_bbdev_queue_data *q_data = &dev->data->queues[queue_id];
-	return dev->dequeue_dec_ops(q_data, ops, num_ops);
+	const uint16_t nb_ops = dev->dequeue_dec_ops(q_data, ops, num_ops);
+	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 
@@ -662,7 +667,9 @@  rte_bbdev_dequeue_ldpc_enc_ops(uint16_t dev_id, uint16_t queue_id,
 {
 	struct rte_bbdev *dev = &rte_bbdev_devices[dev_id];
 	struct rte_bbdev_queue_data *q_data = &dev->data->queues[queue_id];
-	return dev->dequeue_ldpc_enc_ops(q_data, ops, num_ops);
+	const uint16_t nb_ops = dev->dequeue_ldpc_enc_ops(q_data, ops, num_ops);
+	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 /**
@@ -692,7 +699,9 @@  rte_bbdev_dequeue_ldpc_dec_ops(uint16_t dev_id, uint16_t queue_id,
 {
 	struct rte_bbdev *dev = &rte_bbdev_devices[dev_id];
 	struct rte_bbdev_queue_data *q_data = &dev->data->queues[queue_id];
-	return dev->dequeue_ldpc_dec_ops(q_data, ops, num_ops);
+	const uint16_t nb_ops = dev->dequeue_ldpc_dec_ops(q_data, ops, num_ops);
+	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 /** Definitions of device event types */
diff --git a/lib/compressdev/rte_compressdev.c b/lib/compressdev/rte_compressdev.c
index 22c438f2dd..912cee9a16 100644
--- a/lib/compressdev/rte_compressdev.c
+++ b/lib/compressdev/rte_compressdev.c
@@ -580,6 +580,8 @@  rte_compressdev_dequeue_burst(uint8_t dev_id, uint16_t qp_id,
 	nb_ops = (*dev->dequeue_burst)
 			(dev->data->queue_pairs[qp_id], ops, nb_ops);
 
+	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_ops);
+
 	return nb_ops;
 }
 
diff --git a/lib/cryptodev/rte_cryptodev.h b/lib/cryptodev/rte_cryptodev.h
index 56f459c6a0..072874020d 100644
--- a/lib/cryptodev/rte_cryptodev.h
+++ b/lib/cryptodev/rte_cryptodev.h
@@ -1915,6 +1915,8 @@  rte_cryptodev_dequeue_burst(uint8_t dev_id, uint16_t qp_id,
 		rte_rcu_qsbr_thread_offline(list->qsbr, 0);
 	}
 #endif
+
+	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_ops);
 	return nb_ops;
 }
 
diff --git a/lib/distributor/rte_distributor.c b/lib/distributor/rte_distributor.c
index 3035b7a999..35b0d8d36b 100644
--- a/lib/distributor/rte_distributor.c
+++ b/lib/distributor/rte_distributor.c
@@ -56,6 +56,8 @@  rte_distributor_request_pkt(struct rte_distributor *d,
 
 		while (rte_rdtsc() < t)
 			rte_pause();
+		/* this was an empty poll */
+		RTE_LCORE_TELEMETRY_TIMESTAMP(0);
 	}
 
 	/*
@@ -134,24 +136,29 @@  rte_distributor_get_pkt(struct rte_distributor *d,
 
 	if (unlikely(d->alg_type == RTE_DIST_ALG_SINGLE)) {
 		if (return_count <= 1) {
+			uint16_t cnt;
 			pkts[0] = rte_distributor_get_pkt_single(d->d_single,
-				worker_id, return_count ? oldpkt[0] : NULL);
-			return (pkts[0]) ? 1 : 0;
-		} else
-			return -EINVAL;
+								 worker_id,
+								 return_count ? oldpkt[0] : NULL);
+			cnt = (pkts[0] != NULL) ? 1 : 0;
+			RTE_LCORE_TELEMETRY_TIMESTAMP(cnt);
+			return cnt;
+		}
+		return -EINVAL;
 	}
 
 	rte_distributor_request_pkt(d, worker_id, oldpkt, return_count);
 
-	count = rte_distributor_poll_pkt(d, worker_id, pkts);
-	while (count == -1) {
+	while ((count = rte_distributor_poll_pkt(d, worker_id, pkts)) == -1) {
 		uint64_t t = rte_rdtsc() + 100;
 
 		while (rte_rdtsc() < t)
 			rte_pause();
 
-		count = rte_distributor_poll_pkt(d, worker_id, pkts);
+		/* this was an empty poll */
+		RTE_LCORE_TELEMETRY_TIMESTAMP(0);
 	}
+	RTE_LCORE_TELEMETRY_TIMESTAMP(count);
 	return count;
 }
 
diff --git a/lib/distributor/rte_distributor_single.c b/lib/distributor/rte_distributor_single.c
index 2c77ac454a..63cc9aab69 100644
--- a/lib/distributor/rte_distributor_single.c
+++ b/lib/distributor/rte_distributor_single.c
@@ -31,8 +31,13 @@  rte_distributor_request_pkt_single(struct rte_distributor_single *d,
 	union rte_distributor_buffer_single *buf = &d->bufs[worker_id];
 	int64_t req = (((int64_t)(uintptr_t)oldpkt) << RTE_DISTRIB_FLAG_BITS)
 			| RTE_DISTRIB_GET_BUF;
-	RTE_WAIT_UNTIL_MASKED(&buf->bufptr64, RTE_DISTRIB_FLAGS_MASK,
-		==, 0, __ATOMIC_RELAXED);
+
+	while (!((__atomic_load_n(&buf->bufptr64, __ATOMIC_RELAXED)
+			& RTE_DISTRIB_FLAGS_MASK) == 0)) {
+		rte_pause();
+		/* this was an empty poll */
+		RTE_LCORE_TELEMETRY_TIMESTAMP(0);
+	}
 
 	/* Sync with distributor on GET_BUF flag. */
 	__atomic_store_n(&(buf->bufptr64), req, __ATOMIC_RELEASE);
@@ -59,8 +64,11 @@  rte_distributor_get_pkt_single(struct rte_distributor_single *d,
 {
 	struct rte_mbuf *ret;
 	rte_distributor_request_pkt_single(d, worker_id, oldpkt);
-	while ((ret = rte_distributor_poll_pkt_single(d, worker_id)) == NULL)
+	while ((ret = rte_distributor_poll_pkt_single(d, worker_id)) == NULL) {
 		rte_pause();
+		/* this was an empty poll */
+		RTE_LCORE_TELEMETRY_TIMESTAMP(0);
+	}
 	return ret;
 }
 
diff --git a/lib/dmadev/rte_dmadev.h b/lib/dmadev/rte_dmadev.h
index e7f992b734..98176a6a7a 100644
--- a/lib/dmadev/rte_dmadev.h
+++ b/lib/dmadev/rte_dmadev.h
@@ -149,6 +149,7 @@ 
 #include <rte_bitops.h>
 #include <rte_common.h>
 #include <rte_compat.h>
+#include <rte_lcore.h>
 
 #ifdef __cplusplus
 extern "C" {
@@ -1027,7 +1028,7 @@  rte_dma_completed(int16_t dev_id, uint16_t vchan, const uint16_t nb_cpls,
 		  uint16_t *last_idx, bool *has_error)
 {
 	struct rte_dma_fp_object *obj = &rte_dma_fp_objs[dev_id];
-	uint16_t idx;
+	uint16_t idx, nb_ops;
 	bool err;
 
 #ifdef RTE_DMADEV_DEBUG
@@ -1050,8 +1051,10 @@  rte_dma_completed(int16_t dev_id, uint16_t vchan, const uint16_t nb_cpls,
 		has_error = &err;
 
 	*has_error = false;
-	return (*obj->completed)(obj->dev_private, vchan, nb_cpls, last_idx,
-				 has_error);
+	nb_ops = (*obj->completed)(obj->dev_private, vchan, nb_cpls, last_idx,
+				   has_error);
+	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 /**
@@ -1090,7 +1093,7 @@  rte_dma_completed_status(int16_t dev_id, uint16_t vchan,
 			 enum rte_dma_status_code *status)
 {
 	struct rte_dma_fp_object *obj = &rte_dma_fp_objs[dev_id];
-	uint16_t idx;
+	uint16_t idx, nb_ops;
 
 #ifdef RTE_DMADEV_DEBUG
 	if (!rte_dma_is_valid(dev_id) || nb_cpls == 0 || status == NULL)
@@ -1101,8 +1104,10 @@  rte_dma_completed_status(int16_t dev_id, uint16_t vchan,
 	if (last_idx == NULL)
 		last_idx = &idx;
 
-	return (*obj->completed_status)(obj->dev_private, vchan, nb_cpls,
+	nb_ops = (*obj->completed_status)(obj->dev_private, vchan, nb_cpls,
 					last_idx, status);
+	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 /**
diff --git a/lib/eal/common/eal_common_lcore_telemetry.c b/lib/eal/common/eal_common_lcore_telemetry.c
new file mode 100644
index 0000000000..bba0afc26d
--- /dev/null
+++ b/lib/eal/common/eal_common_lcore_telemetry.c
@@ -0,0 +1,293 @@ 
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2014 Intel Corporation
+ */
+
+#include <unistd.h>
+#include <limits.h>
+#include <string.h>
+
+#include <rte_common.h>
+#include <rte_cycles.h>
+#include <rte_errno.h>
+#include <rte_lcore.h>
+
+#ifdef RTE_LCORE_POLL_BUSYNESS
+#include <rte_telemetry.h>
+#endif
+
+int __rte_lcore_telemetry_enabled;
+
+#ifdef RTE_LCORE_POLL_BUSYNESS
+
+struct lcore_telemetry {
+	int poll_busyness;
+	/**< Calculated poll busyness (gets set/returned by the API) */
+	int raw_poll_busyness;
+	/**< Calculated poll busyness times 100. */
+	uint64_t interval_ts;
+	/**< when previous telemetry interval started */
+	uint64_t empty_cycles;
+	/**< empty cycle count since last interval */
+	uint64_t last_poll_ts;
+	/**< last poll timestamp */
+	bool last_empty;
+	/**< if last poll was empty */
+	unsigned int contig_poll_cnt;
+	/**< contiguous (always empty/non empty) poll counter */
+} __rte_cache_aligned;
+
+static struct lcore_telemetry *telemetry_data;
+
+#define LCORE_POLL_BUSYNESS_MAX 100
+#define LCORE_POLL_BUSYNESS_NOT_SET -1
+#define LCORE_POLL_BUSYNESS_MIN 0
+
+#define SMOOTH_COEFF 5
+#define STATE_CHANGE_OPT 32
+
+static void lcore_config_init(void)
+{
+	int lcore_id;
+
+	telemetry_data = calloc(RTE_MAX_LCORE, sizeof(telemetry_data[0]));
+	if (telemetry_data == NULL)
+		rte_panic("Could not init lcore telemetry data: Out of memory\n");
+
+	RTE_LCORE_FOREACH(lcore_id) {
+		struct lcore_telemetry *td = &telemetry_data[lcore_id];
+
+		td->interval_ts = 0;
+		td->last_poll_ts = 0;
+		td->empty_cycles = 0;
+		td->last_empty = true;
+		td->contig_poll_cnt = 0;
+		td->poll_busyness = LCORE_POLL_BUSYNESS_NOT_SET;
+		td->raw_poll_busyness = 0;
+	}
+}
+
+int rte_lcore_poll_busyness(unsigned int lcore_id)
+{
+	const uint64_t active_thresh = rte_get_tsc_hz() * RTE_LCORE_POLL_BUSYNESS_PERIOD_MS;
+	struct lcore_telemetry *tdata;
+
+	if (lcore_id >= RTE_MAX_LCORE)
+		return -EINVAL;
+	tdata = &telemetry_data[lcore_id];
+
+	/* if the lcore is not active */
+	if (tdata->interval_ts == 0)
+		return LCORE_POLL_BUSYNESS_NOT_SET;
+	/* if the core hasn't been active in a while */
+	else if ((rte_rdtsc() - tdata->interval_ts) > active_thresh)
+		return LCORE_POLL_BUSYNESS_NOT_SET;
+
+	/* this core is active, report its poll busyness */
+	return telemetry_data[lcore_id].poll_busyness;
+}
+
+int rte_lcore_poll_busyness_enabled(void)
+{
+	return __rte_lcore_telemetry_enabled;
+}
+
+void rte_lcore_poll_busyness_enabled_set(int enable)
+{
+	__rte_lcore_telemetry_enabled = !!enable;
+
+	if (!enable)
+		lcore_config_init();
+}
+
+static inline int calc_raw_poll_busyness(const struct lcore_telemetry *tdata,
+				    const uint64_t empty, const uint64_t total)
+{
+	/*
+	 * we don't want to use floating point math here, but we want for our poll
+	 * busyness to react smoothly to sudden changes, while still keeping the
+	 * accuracy and making sure that over time the average follows poll busyness
+	 * as measured just-in-time. therefore, we will calculate the average poll
+	 * busyness using integer math, but shift the decimal point two places
+	 * to the right, so that 100.0 becomes 10000. this allows us to report
+	 * integer values (0..100) while still allowing ourselves to follow the
+	 * just-in-time measurements when we calculate our averages.
+	 */
+	const int max_raw_idle = LCORE_POLL_BUSYNESS_MAX * 100;
+
+	/*
+	 * at upper end of the poll busyness scale, going up from 90->100 will take
+	 * longer than going from 10->20 because of the averaging. to address
+	 * this, we invert the scale when doing calculations: that is, we
+	 * effectively calculate average *idle* cycle percentage, not average
+	 * *busy* cycle percentage. this means that the scale is naturally
+	 * biased towards fast scaling up, and slow scaling down.
+	 */
+	const int prev_raw_idle = max_raw_idle - tdata->raw_poll_busyness;
+
+	/* calculate rate of idle cycles, times 100 */
+	const int cur_raw_idle = (int)((empty * max_raw_idle) / total);
+
+	/* smoothen the idleness */
+	const int smoothened_idle =
+			(cur_raw_idle + prev_raw_idle * (SMOOTH_COEFF - 1)) / SMOOTH_COEFF;
+
+	/* convert idleness back to poll busyness */
+	return max_raw_idle - smoothened_idle;
+}
+
+void __rte_lcore_telemetry_timestamp(uint16_t nb_rx)
+{
+	const unsigned int lcore_id = rte_lcore_id();
+	uint64_t interval_ts, empty_cycles, cur_tsc, last_poll_ts;
+	struct lcore_telemetry *tdata;
+	const bool empty = nb_rx == 0;
+	uint64_t diff_int, diff_last;
+	bool last_empty;
+
+	/* This telemetry is not supported for unregistered non-EAL threads */
+	if (lcore_id >= RTE_MAX_LCORE)
+		return;
+
+	tdata = &telemetry_data[lcore_id];
+	last_empty = tdata->last_empty;
+
+	/* optimization: don't do anything if status hasn't changed */
+	if (last_empty == empty && tdata->contig_poll_cnt++ < STATE_CHANGE_OPT)
+		return;
+	/* status changed or we're waiting for too long, reset counter */
+	tdata->contig_poll_cnt = 0;
+
+	cur_tsc = rte_rdtsc();
+
+	interval_ts = tdata->interval_ts;
+	empty_cycles = tdata->empty_cycles;
+	last_poll_ts = tdata->last_poll_ts;
+
+	diff_int = cur_tsc - interval_ts;
+	diff_last = cur_tsc - last_poll_ts;
+
+	/* is this the first time we're here? */
+	if (interval_ts == 0) {
+		tdata->poll_busyness = LCORE_POLL_BUSYNESS_MIN;
+		tdata->raw_poll_busyness = 0;
+		tdata->interval_ts = cur_tsc;
+		tdata->empty_cycles = 0;
+		tdata->contig_poll_cnt = 0;
+		goto end;
+	}
+
+	/* update the empty counter if we got an empty poll earlier */
+	if (last_empty)
+		empty_cycles += diff_last;
+
+	/* have we passed the interval? */
+	uint64_t interval = ((rte_get_tsc_hz() / MS_PER_S) * RTE_LCORE_POLL_BUSYNESS_PERIOD_MS);
+	if (diff_int > interval) {
+		int raw_poll_busyness;
+
+		/* get updated poll_busyness value */
+		raw_poll_busyness = calc_raw_poll_busyness(tdata, empty_cycles, diff_int);
+
+		/* set a new interval, reset empty counter */
+		tdata->interval_ts = cur_tsc;
+		tdata->empty_cycles = 0;
+		tdata->raw_poll_busyness = raw_poll_busyness;
+		/* bring poll busyness back to 0..100 range, biased to round up */
+		tdata->poll_busyness = (raw_poll_busyness + 50) / 100;
+	} else
+		/* we may have updated empty counter */
+		tdata->empty_cycles = empty_cycles;
+
+end:
+	/* update status for next poll */
+	tdata->last_poll_ts = cur_tsc;
+	tdata->last_empty = empty;
+}
+
+static int
+lcore_poll_busyness_enable(const char *cmd __rte_unused,
+		      const char *params __rte_unused,
+		      struct rte_tel_data *d)
+{
+	rte_lcore_poll_busyness_enabled_set(1);
+
+	rte_tel_data_start_dict(d);
+
+	rte_tel_data_add_dict_int(d, "poll_busyness_enabled", 1);
+
+	return 0;
+}
+
+static int
+lcore_poll_busyness_disable(const char *cmd __rte_unused,
+		       const char *params __rte_unused,
+		       struct rte_tel_data *d)
+{
+	rte_lcore_poll_busyness_enabled_set(0);
+
+	rte_tel_data_start_dict(d);
+
+	rte_tel_data_add_dict_int(d, "poll_busyness_enabled", 0);
+
+	if (telemetry_data != NULL)
+		free(telemetry_data);
+
+	return 0;
+}
+
+static int
+lcore_handle_poll_busyness(const char *cmd __rte_unused,
+		      const char *params __rte_unused, struct rte_tel_data *d)
+{
+	char corenum[64];
+	int i;
+
+	rte_tel_data_start_dict(d);
+
+	RTE_LCORE_FOREACH(i) {
+		if (!rte_lcore_is_enabled(i))
+			continue;
+		snprintf(corenum, sizeof(corenum), "%d", i);
+		rte_tel_data_add_dict_int(d, corenum, rte_lcore_poll_busyness(i));
+	}
+
+	return 0;
+}
+
+RTE_INIT(lcore_init_telemetry)
+{
+	__rte_lcore_telemetry_enabled = true;
+
+	lcore_config_init();
+
+	rte_telemetry_register_cmd("/eal/lcore/poll_busyness", lcore_handle_poll_busyness,
+				   "return percentage poll busyness of cores");
+
+	rte_telemetry_register_cmd("/eal/lcore/poll_busyness_enable", lcore_poll_busyness_enable,
+				   "enable lcore poll busyness measurement");
+
+	rte_telemetry_register_cmd("/eal/lcore/poll_busyness_disable", lcore_poll_busyness_disable,
+				   "disable lcore poll busyness measurement");
+}
+
+#else
+
+int rte_lcore_poll_busyness(unsigned int lcore_id __rte_unused)
+{
+	return -ENOTSUP;
+}
+
+int rte_lcore_poll_busyness_enabled(void)
+{
+	return -ENOTSUP;
+}
+
+void rte_lcore_poll_busyness_enabled_set(int enable __rte_unused)
+{
+}
+
+void __rte_lcore_telemetry_timestamp(uint16_t nb_rx __rte_unused)
+{
+}
+
+#endif
diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
index 917758cc65..a743e66a7d 100644
--- a/lib/eal/common/meson.build
+++ b/lib/eal/common/meson.build
@@ -17,6 +17,7 @@  sources += files(
         'eal_common_hexdump.c',
         'eal_common_interrupts.c',
         'eal_common_launch.c',
+        'eal_common_lcore_telemetry.c',
         'eal_common_lcore.c',
         'eal_common_log.c',
         'eal_common_mcfg.c',
diff --git a/lib/eal/include/rte_lcore.h b/lib/eal/include/rte_lcore.h
index b598e1b9ec..75c1f874cb 100644
--- a/lib/eal/include/rte_lcore.h
+++ b/lib/eal/include/rte_lcore.h
@@ -415,6 +415,86 @@  rte_ctrl_thread_create(pthread_t *thread, const char *name,
 		const pthread_attr_t *attr,
 		void *(*start_routine)(void *), void *arg);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Read poll busyness value corresponding to an lcore.
+ *
+ * @param lcore_id
+ *   Lcore to read poll busyness value for.
+ * @return
+ *   - value between 0 and 100 on success
+ *   - -1 if lcore is not active
+ *   - -EINVAL if lcore is invalid
+ *   - -ENOMEM if not enough memory available
+ *   - -ENOTSUP if not supported
+ */
+__rte_experimental
+int
+rte_lcore_poll_busyness(unsigned int lcore_id);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Check if lcore poll busyness telemetry is enabled.
+ *
+ * @return
+ *   - 1 if lcore telemetry is enabled
+ *   - 0 if lcore telemetry is disabled
+ *   - -ENOTSUP if not lcore telemetry supported
+ */
+__rte_experimental
+int
+rte_lcore_poll_busyness_enabled(void);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Enable or disable poll busyness telemetry.
+ *
+ * @param enable
+ *   1 to enable, 0 to disable
+ */
+__rte_experimental
+void
+rte_lcore_poll_busyness_enabled_set(int enable);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Lcore telemetry timestamping function.
+ *
+ * @param nb_rx
+ *   Number of buffers processed by lcore.
+ */
+__rte_experimental
+void
+__rte_lcore_telemetry_timestamp(uint16_t nb_rx);
+
+/** @internal lcore telemetry enabled status */
+extern int __rte_lcore_telemetry_enabled;
+
+/**
+ * Call lcore telemetry timestamp function.
+ *
+ * @param nb_rx
+ *   Number of buffers processed by lcore.
+ */
+#ifdef RTE_LCORE_POLL_BUSYNESS
+#define RTE_LCORE_TELEMETRY_TIMESTAMP(nb_rx)                    \
+	do {                                                    \
+		if (__rte_lcore_telemetry_enabled)              \
+			__rte_lcore_telemetry_timestamp(nb_rx); \
+	} while (0)
+#else
+#define RTE_LCORE_TELEMETRY_TIMESTAMP(nb_rx) \
+	while (0)
+#endif
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/eal/meson.build b/lib/eal/meson.build
index 056beb9461..2fb90d446b 100644
--- a/lib/eal/meson.build
+++ b/lib/eal/meson.build
@@ -25,6 +25,9 @@  subdir(arch_subdir)
 deps += ['kvargs']
 if not is_windows
     deps += ['telemetry']
+else
+	# core poll busyness telemetry depends on telemetry library
+	dpdk_conf.set('RTE_LCORE_POLL_BUSYNESS', false)
 endif
 if dpdk_conf.has('RTE_USE_LIBBSD')
     ext_deps += libbsd
diff --git a/lib/eal/version.map b/lib/eal/version.map
index 1f293e768b..f84d2dc319 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -424,6 +424,13 @@  EXPERIMENTAL {
 	rte_thread_self;
 	rte_thread_set_affinity_by_id;
 	rte_thread_set_priority;
+
+	# added in 22.11
+	__rte_lcore_telemetry_timestamp;
+	__rte_lcore_telemetry_enabled;
+	rte_lcore_poll_busyness;
+	rte_lcore_poll_busyness_enabled;
+	rte_lcore_poll_busyness_enabled_set;
 };
 
 INTERNAL {
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index de9e970d4d..1caecd5a11 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -5675,6 +5675,8 @@  rte_eth_rx_burst(uint16_t port_id, uint16_t queue_id,
 #endif
 
 	rte_ethdev_trace_rx_burst(port_id, queue_id, (void **)rx_pkts, nb_rx);
+
+	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_rx);
 	return nb_rx;
 }
 
diff --git a/lib/eventdev/rte_eventdev.h b/lib/eventdev/rte_eventdev.h
index 6a6f6ea4c1..a1d42d9214 100644
--- a/lib/eventdev/rte_eventdev.h
+++ b/lib/eventdev/rte_eventdev.h
@@ -2153,6 +2153,7 @@  rte_event_dequeue_burst(uint8_t dev_id, uint8_t port_id, struct rte_event ev[],
 			uint16_t nb_events, uint64_t timeout_ticks)
 {
 	const struct rte_event_fp_ops *fp_ops;
+	uint16_t nb_evts;
 	void *port;
 
 	fp_ops = &rte_event_fp_ops[dev_id];
@@ -2175,10 +2176,13 @@  rte_event_dequeue_burst(uint8_t dev_id, uint8_t port_id, struct rte_event ev[],
 	 * requests nb_events as const one
 	 */
 	if (nb_events == 1)
-		return (fp_ops->dequeue)(port, ev, timeout_ticks);
+		nb_evts = (fp_ops->dequeue)(port, ev, timeout_ticks);
 	else
-		return (fp_ops->dequeue_burst)(port, ev, nb_events,
-					       timeout_ticks);
+		nb_evts = (fp_ops->dequeue_burst)(port, ev, nb_events,
+					timeout_ticks);
+
+	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_evts);
+	return nb_evts;
 }
 
 #define RTE_EVENT_DEV_MAINT_OP_FLUSH          (1 << 0)
diff --git a/lib/rawdev/rte_rawdev.c b/lib/rawdev/rte_rawdev.c
index 2f0a4f132e..f6c0ed196f 100644
--- a/lib/rawdev/rte_rawdev.c
+++ b/lib/rawdev/rte_rawdev.c
@@ -16,6 +16,7 @@ 
 #include <rte_common.h>
 #include <rte_malloc.h>
 #include <rte_telemetry.h>
+#include <rte_lcore.h>
 
 #include "rte_rawdev.h"
 #include "rte_rawdev_pmd.h"
@@ -226,12 +227,15 @@  rte_rawdev_dequeue_buffers(uint16_t dev_id,
 			   rte_rawdev_obj_t context)
 {
 	struct rte_rawdev *dev;
+	int nb_ops;
 
 	RTE_RAWDEV_VALID_DEVID_OR_ERR_RET(dev_id, -EINVAL);
 	dev = &rte_rawdevs[dev_id];
 
 	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->dequeue_bufs, -ENOTSUP);
-	return (*dev->dev_ops->dequeue_bufs)(dev, buffers, count, context);
+	nb_ops = (*dev->dev_ops->dequeue_bufs)(dev, buffers, count, context);
+	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 int
diff --git a/lib/regexdev/rte_regexdev.h b/lib/regexdev/rte_regexdev.h
index 3bce8090f6..781055b4eb 100644
--- a/lib/regexdev/rte_regexdev.h
+++ b/lib/regexdev/rte_regexdev.h
@@ -1530,6 +1530,7 @@  rte_regexdev_dequeue_burst(uint8_t dev_id, uint16_t qp_id,
 			   struct rte_regex_ops **ops, uint16_t nb_ops)
 {
 	struct rte_regexdev *dev = &rte_regex_devices[dev_id];
+	uint16_t deq_ops;
 #ifdef RTE_LIBRTE_REGEXDEV_DEBUG
 	RTE_REGEXDEV_VALID_DEV_ID_OR_ERR_RET(dev_id, -EINVAL);
 	RTE_FUNC_PTR_OR_ERR_RET(*dev->dequeue, -ENOTSUP);
@@ -1538,7 +1539,9 @@  rte_regexdev_dequeue_burst(uint8_t dev_id, uint16_t qp_id,
 		return -EINVAL;
 	}
 #endif
-	return (*dev->dequeue)(dev, qp_id, ops, nb_ops);
+	deq_ops = (*dev->dequeue)(dev, qp_id, ops, nb_ops);
+	RTE_LCORE_TELEMETRY_TIMESTAMP(deq_ops);
+	return deq_ops;
 }
 
 #ifdef __cplusplus
diff --git a/lib/ring/rte_ring_elem_pvt.h b/lib/ring/rte_ring_elem_pvt.h
index 83788c56e6..6db09d4291 100644
--- a/lib/ring/rte_ring_elem_pvt.h
+++ b/lib/ring/rte_ring_elem_pvt.h
@@ -379,6 +379,7 @@  __rte_ring_do_dequeue_elem(struct rte_ring *r, void *obj_table,
 end:
 	if (available != NULL)
 		*available = entries - n;
+	RTE_LCORE_TELEMETRY_TIMESTAMP(n);
 	return n;
 }
 
diff --git a/meson_options.txt b/meson_options.txt
index 7c220ad68d..725b851f69 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -20,6 +20,8 @@  option('enable_driver_sdk', type: 'boolean', value: false, description:
        'Install headers to build drivers.')
 option('enable_kmods', type: 'boolean', value: false, description:
        'build kernel modules')
+option('enable_lcore_poll_busyness', type: 'boolean', value: true, description:
+       'enable collection of lcore poll busyness telemetry')
 option('examples', type: 'string', value: '', description:
        'Comma-separated list of examples to build by default')
 option('flexran_sdk', type: 'string', value: '', description: