[v1,1/2] eal: add lcore busyness telemetry

Message ID 24c49429394294cfbf0d9c506b205029bac77c8b.1657890378.git.anatoly.burakov@intel.com (mailing list archive)
State Superseded, archived
Delegated to: Thomas Monjalon
Headers
Series [v1,1/2] eal: add lcore busyness telemetry |

Checks

Context Check Description
ci/checkpatch warning coding style issues

Commit Message

Anatoly Burakov July 15, 2022, 1:12 p.m. UTC
  Currently, there is no way to measure lcore busyness in a passive way,
without any modifications to the application. This patch adds a new EAL
API that will be able to passively track core busyness.

The busyness is calculated by relying on the fact that most DPDK API's
will poll for packets. Empty polls can be counted as "idle", while
non-empty polls can be counted as busy. To measure lcore busyness, we
simply call the telemetry timestamping function with the number of polls
a particular code section has processed, and count the number of cycles
we've spent processing empty bursts. The more empty bursts we encounter,
the less cycles we spend in "busy" state, and the less core busyness
will be reported.

In order for all of the above to work without modifications to the
application, the library code needs to be instrumented with calls to
the lcore telemetry busyness timestamping function. The following parts
of DPDK are instrumented with lcore telemetry calls:

- All major driver API's:
  - ethdev
  - cryptodev
  - compressdev
  - regexdev
  - bbdev
  - rawdev
  - eventdev
  - dmadev
- Some additional libraries:
  - ring
  - distributor

To avoid performance impact from having lcore telemetry support, a
global variable is exported by EAL, and a call to timestamping function
is wrapped into a macro, so that whenever telemetry is disabled, it only
takes one additional branch and no function calls are performed. It is
also possible to disable it at compile time by commenting out
RTE_LCORE_BUSYNESS from build config.

This patch also adds a telemetry endpoint to report lcore busyness, as
well as telemetry endpoints to enable/disable lcore telemetry.

Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
Signed-off-by: Conor Walsh <conor.walsh@intel.com>
Signed-off-by: David Hunt <david.hunt@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---

Notes:
    We did a couple of quick smoke tests to see if this patch causes any performance
    degradation, and it seemed to have none that we could measure. Telemetry can be
    disabled at compile time via a config option, while at runtime it can be
    disabled, seemingly at a cost of one additional branch.
    
    That said, our benchmarking efforts were admittedly not very rigorous, so
    comments welcome!

 config/rte_config.h                         |   2 +
 lib/bbdev/rte_bbdev.h                       |  17 +-
 lib/compressdev/rte_compressdev.c           |   2 +
 lib/cryptodev/rte_cryptodev.h               |   2 +
 lib/distributor/rte_distributor.c           |  21 +-
 lib/distributor/rte_distributor_single.c    |  14 +-
 lib/dmadev/rte_dmadev.h                     |  15 +-
 lib/eal/common/eal_common_lcore_telemetry.c | 274 ++++++++++++++++++++
 lib/eal/common/meson.build                  |   1 +
 lib/eal/include/rte_lcore.h                 |  80 ++++++
 lib/eal/meson.build                         |   3 +
 lib/eal/version.map                         |   7 +
 lib/ethdev/rte_ethdev.h                     |   2 +
 lib/eventdev/rte_eventdev.h                 |  10 +-
 lib/rawdev/rte_rawdev.c                     |   5 +-
 lib/regexdev/rte_regexdev.h                 |   5 +-
 lib/ring/rte_ring_elem_pvt.h                |   1 +
 17 files changed, 437 insertions(+), 24 deletions(-)
 create mode 100644 lib/eal/common/eal_common_lcore_telemetry.c
  

Comments

Anatoly Burakov July 15, 2022, 1:35 p.m. UTC | #1
On 15-Jul-22 2:12 PM, Anatoly Burakov wrote:
> Currently, there is no way to measure lcore busyness in a passive way,
> without any modifications to the application. This patch adds a new EAL
> API that will be able to passively track core busyness.
> 
> The busyness is calculated by relying on the fact that most DPDK API's
> will poll for packets. Empty polls can be counted as "idle", while
> non-empty polls can be counted as busy. To measure lcore busyness, we
> simply call the telemetry timestamping function with the number of polls
> a particular code section has processed, and count the number of cycles
> we've spent processing empty bursts. The more empty bursts we encounter,
> the less cycles we spend in "busy" state, and the less core busyness
> will be reported.
> 
> In order for all of the above to work without modifications to the
> application, the library code needs to be instrumented with calls to
> the lcore telemetry busyness timestamping function. The following parts
> of DPDK are instrumented with lcore telemetry calls:
> 
> - All major driver API's:
>    - ethdev
>    - cryptodev
>    - compressdev
>    - regexdev
>    - bbdev
>    - rawdev
>    - eventdev
>    - dmadev
> - Some additional libraries:
>    - ring
>    - distributor
> 
> To avoid performance impact from having lcore telemetry support, a
> global variable is exported by EAL, and a call to timestamping function
> is wrapped into a macro, so that whenever telemetry is disabled, it only
> takes one additional branch and no function calls are performed. It is
> also possible to disable it at compile time by commenting out
> RTE_LCORE_BUSYNESS from build config.
> 
> This patch also adds a telemetry endpoint to report lcore busyness, as
> well as telemetry endpoints to enable/disable lcore telemetry.
> 
> Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
> Signed-off-by: Conor Walsh <conor.walsh@intel.com>
> Signed-off-by: David Hunt <david.hunt@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
> 
> Notes:
>      We did a couple of quick smoke tests to see if this patch causes any performance
>      degradation, and it seemed to have none that we could measure. Telemetry can be
>      disabled at compile time via a config option, while at runtime it can be
>      disabled, seemingly at a cost of one additional branch.
>      
>      That said, our benchmarking efforts were admittedly not very rigorous, so
>      comments welcome!
> 
>   config/rte_config.h                         |   2 +
>   lib/bbdev/rte_bbdev.h                       |  17 +-
>   lib/compressdev/rte_compressdev.c           |   2 +
>   lib/cryptodev/rte_cryptodev.h               |   2 +
>   lib/distributor/rte_distributor.c           |  21 +-
>   lib/distributor/rte_distributor_single.c    |  14 +-
>   lib/dmadev/rte_dmadev.h                     |  15 +-
>   lib/eal/common/eal_common_lcore_telemetry.c | 274 ++++++++++++++++++++
>   lib/eal/common/meson.build                  |   1 +
>   lib/eal/include/rte_lcore.h                 |  80 ++++++
>   lib/eal/meson.build                         |   3 +
>   lib/eal/version.map                         |   7 +
>   lib/ethdev/rte_ethdev.h                     |   2 +
>   lib/eventdev/rte_eventdev.h                 |  10 +-
>   lib/rawdev/rte_rawdev.c                     |   5 +-
>   lib/regexdev/rte_regexdev.h                 |   5 +-
>   lib/ring/rte_ring_elem_pvt.h                |   1 +
>   17 files changed, 437 insertions(+), 24 deletions(-)
>   create mode 100644 lib/eal/common/eal_common_lcore_telemetry.c
> 
> diff --git a/config/rte_config.h b/config/rte_config.h
> index 46549cb062..583cb6f7a5 100644
> --- a/config/rte_config.h
> +++ b/config/rte_config.h
> @@ -39,6 +39,8 @@
>   #define RTE_LOG_DP_LEVEL RTE_LOG_INFO
>   #define RTE_BACKTRACE 1
>   #define RTE_MAX_VFIO_CONTAINERS 64
> +#define RTE_LCORE_BUSYNESS 1
> +#define RTE_LCORE_BUSYNESS_PERIOD 4000000ULL

One possible improvement here would be to specify period in 
microseconds, and use rte_get_tsc_hz() to adjust the telemetry period. 
This would require adding code to EAL init, because we can't use that 
API until EAL has called `rte_eal_timer_init()`, but it would make the 
telemetry period CPU frequency independent.
  
Jerin Jacob July 15, 2022, 1:46 p.m. UTC | #2
On Fri, Jul 15, 2022 at 6:42 PM Anatoly Burakov
<anatoly.burakov@intel.com> wrote:
>
> Currently, there is no way to measure lcore busyness in a passive way,
> without any modifications to the application. This patch adds a new EAL
> API that will be able to passively track core busyness.
>
> The busyness is calculated by relying on the fact that most DPDK API's
> will poll for packets. Empty polls can be counted as "idle", while
> non-empty polls can be counted as busy. To measure lcore busyness, we
> simply call the telemetry timestamping function with the number of polls
> a particular code section has processed, and count the number of cycles
> we've spent processing empty bursts. The more empty bursts we encounter,
> the less cycles we spend in "busy" state, and the less core busyness
> will be reported.
>
> In order for all of the above to work without modifications to the
> application, the library code needs to be instrumented with calls to
> the lcore telemetry busyness timestamping function. The following parts
> of DPDK are instrumented with lcore telemetry calls:
>
> - All major driver API's:
>   - ethdev
>   - cryptodev
>   - compressdev
>   - regexdev
>   - bbdev
>   - rawdev
>   - eventdev
>   - dmadev
> - Some additional libraries:
>   - ring
>   - distributor
>
> To avoid performance impact from having lcore telemetry support, a
> global variable is exported by EAL, and a call to timestamping function
> is wrapped into a macro, so that whenever telemetry is disabled, it only
> takes one additional branch and no function calls are performed. It is
> also possible to disable it at compile time by commenting out
> RTE_LCORE_BUSYNESS from build config.
>
> This patch also adds a telemetry endpoint to report lcore busyness, as
> well as telemetry endpoints to enable/disable lcore telemetry.
>
> Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
> Signed-off-by: Conor Walsh <conor.walsh@intel.com>
> Signed-off-by: David Hunt <david.hunt@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>


It is a good feature. Thanks for this work.


> ---
>
> Notes:
>     We did a couple of quick smoke tests to see if this patch causes any performance
>     degradation, and it seemed to have none that we could measure. Telemetry can be
>     disabled at compile time via a config option, while at runtime it can be
>     disabled, seemingly at a cost of one additional branch.
>
>     That said, our benchmarking efforts were admittedly not very rigorous, so
>     comments welcome!

>
> diff --git a/config/rte_config.h b/config/rte_config.h
> index 46549cb062..583cb6f7a5 100644
> --- a/config/rte_config.h
> +++ b/config/rte_config.h
> @@ -39,6 +39,8 @@
>  #define RTE_LOG_DP_LEVEL RTE_LOG_INFO
>  #define RTE_BACKTRACE 1
>  #define RTE_MAX_VFIO_CONTAINERS 64
> +#define RTE_LCORE_BUSYNESS 1

Please don't enable debug features in fastpath as default.

> +#define RTE_LCORE_BUSYNESS_PERIOD 4000000ULL
>
> +
> +#include <unistd.h>
> +#include <limits.h>
> +#include <string.h>
> +
> +#include <rte_common.h>
> +#include <rte_cycles.h>
> +#include <rte_errno.h>
> +#include <rte_lcore.h>
> +
> +#ifdef RTE_LCORE_BUSYNESS

This clutter may not be required. Let it compile all cases.

> +#include <rte_telemetry.h>
> +#endif
> +
> +int __rte_lcore_telemetry_enabled;
> +
> +#ifdef RTE_LCORE_BUSYNESS
> +
> +struct lcore_telemetry {
> +       int busyness;
> +       /**< Calculated busyness (gets set/returned by the API) */
> +       int raw_busyness;
> +       /**< Calculated busyness times 100. */
> +       uint64_t interval_ts;
> +       /**< when previous telemetry interval started */
> +       uint64_t empty_cycles;
> +       /**< empty cycle count since last interval */
> +       uint64_t last_poll_ts;
> +       /**< last poll timestamp */
> +       bool last_empty;
> +       /**< if last poll was empty */
> +       unsigned int contig_poll_cnt;
> +       /**< contiguous (always empty/non empty) poll counter */
> +} __rte_cache_aligned;
> +static struct lcore_telemetry telemetry_data[RTE_MAX_LCORE];

Allocate this from hugepage.

> +
> +void __rte_lcore_telemetry_timestamp(uint16_t nb_rx)
> +{
> +       const unsigned int lcore_id = rte_lcore_id();
> +       uint64_t interval_ts, empty_cycles, cur_tsc, last_poll_ts;
> +       struct lcore_telemetry *tdata = &telemetry_data[lcore_id];
> +       const bool empty = nb_rx == 0;
> +       uint64_t diff_int, diff_last;
> +       bool last_empty;
> +
> +       last_empty = tdata->last_empty;
> +
> +       /* optimization: don't do anything if status hasn't changed */
> +       if (last_empty == empty && tdata->contig_poll_cnt++ < 32)
> +               return;
> +       /* status changed or we're waiting for too long, reset counter */
> +       tdata->contig_poll_cnt = 0;
> +
> +       cur_tsc = rte_rdtsc();
> +
> +       interval_ts = tdata->interval_ts;
> +       empty_cycles = tdata->empty_cycles;
> +       last_poll_ts = tdata->last_poll_ts;
> +
> +       diff_int = cur_tsc - interval_ts;
> +       diff_last = cur_tsc - last_poll_ts;
> +
> +       /* is this the first time we're here? */
> +       if (interval_ts == 0) {
> +               tdata->busyness = LCORE_BUSYNESS_MIN;
> +               tdata->raw_busyness = 0;
> +               tdata->interval_ts = cur_tsc;
> +               tdata->empty_cycles = 0;
> +               tdata->contig_poll_cnt = 0;
> +               goto end;
> +       }
> +
> +       /* update the empty counter if we got an empty poll earlier */
> +       if (last_empty)
> +               empty_cycles += diff_last;
> +
> +       /* have we passed the interval? */
> +       if (diff_int > RTE_LCORE_BUSYNESS_PERIOD) {


I think, this function logic can be limited to just updating the
timestamp in the ring buffer,
and another control function of telemetry which runs in control core to do
heavy lifting to reduce the performance impact on fast path,

> +               int raw_busyness;
> +
> +               /* get updated busyness value */
> +               raw_busyness = calc_raw_busyness(tdata, empty_cycles, diff_int);
> +
> +               /* set a new interval, reset empty counter */
> +               tdata->interval_ts = cur_tsc;
> +               tdata->empty_cycles = 0;
> +               tdata->raw_busyness = raw_busyness;
> +               /* bring busyness back to 0..100 range, biased to round up */
> +               tdata->busyness = (raw_busyness + 50) / 100;
> +       } else
> +               /* we may have updated empty counter */
> +               tdata->empty_cycles = empty_cycles;
> +
> +end:
> +       /* update status for next poll */
> +       tdata->last_poll_ts = cur_tsc;
> +       tdata->last_empty = empty;
> +}
> +
> +#ifdef RTE_LCORE_BUSYNESS
> +#define RTE_LCORE_TELEMETRY_TIMESTAMP(nb_rx)                    \
> +       do {                                                    \
> +               if (__rte_lcore_telemetry_enabled)              \

I think, rather than reading memory, Like Linux perf infrastructure,
we can patch up the instruction steam as NOP vs Timestamp capture
function.


Also instead of changing all libraries, Maybe we can use
"-finstrument-functions".
and just mark the function with attribute.

Just 2c.

> +                       __rte_lcore_telemetry_timestamp(nb_rx); \
> +       } while (0)
> +#else
> +#define RTE_LCORE_TELEMETRY_TIMESTAMP(nb_rx) \
> +       while (0)
> +#endif
> +
  
Bruce Richardson July 15, 2022, 2:11 p.m. UTC | #3
On Fri, Jul 15, 2022 at 07:16:17PM +0530, Jerin Jacob wrote:
> On Fri, Jul 15, 2022 at 6:42 PM Anatoly Burakov
> <anatoly.burakov@intel.com> wrote:
> >
> > Currently, there is no way to measure lcore busyness in a passive way,
> > without any modifications to the application. This patch adds a new EAL
> > API that will be able to passively track core busyness.
> >
> > The busyness is calculated by relying on the fact that most DPDK API's
> > will poll for packets. Empty polls can be counted as "idle", while
> > non-empty polls can be counted as busy. To measure lcore busyness, we
> > simply call the telemetry timestamping function with the number of polls
> > a particular code section has processed, and count the number of cycles
> > we've spent processing empty bursts. The more empty bursts we encounter,
> > the less cycles we spend in "busy" state, and the less core busyness
> > will be reported.
> >
> > In order for all of the above to work without modifications to the
> > application, the library code needs to be instrumented with calls to
> > the lcore telemetry busyness timestamping function. The following parts
> > of DPDK are instrumented with lcore telemetry calls:
> >
> > - All major driver API's:
> >   - ethdev
> >   - cryptodev
> >   - compressdev
> >   - regexdev
> >   - bbdev
> >   - rawdev
> >   - eventdev
> >   - dmadev
> > - Some additional libraries:
> >   - ring
> >   - distributor
> >
> > To avoid performance impact from having lcore telemetry support, a
> > global variable is exported by EAL, and a call to timestamping function
> > is wrapped into a macro, so that whenever telemetry is disabled, it only
> > takes one additional branch and no function calls are performed. It is
> > also possible to disable it at compile time by commenting out
> > RTE_LCORE_BUSYNESS from build config.
> >
> > This patch also adds a telemetry endpoint to report lcore busyness, as
> > well as telemetry endpoints to enable/disable lcore telemetry.
> >
> > Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
> > Signed-off-by: Conor Walsh <conor.walsh@intel.com>
> > Signed-off-by: David Hunt <david.hunt@intel.com>
> > Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> 
> 
> It is a good feature. Thanks for this work.
> 

+1
Some follow-up comments inline below.

/Bruce

> 
> > ---
> >
> > Notes:
> >     We did a couple of quick smoke tests to see if this patch causes any performance
> >     degradation, and it seemed to have none that we could measure. Telemetry can be
> >     disabled at compile time via a config option, while at runtime it can be
> >     disabled, seemingly at a cost of one additional branch.
> >
> >     That said, our benchmarking efforts were admittedly not very rigorous, so
> >     comments welcome!
> 
> >
> > diff --git a/config/rte_config.h b/config/rte_config.h
> > index 46549cb062..583cb6f7a5 100644
> > --- a/config/rte_config.h
> > +++ b/config/rte_config.h
> > @@ -39,6 +39,8 @@
> >  #define RTE_LOG_DP_LEVEL RTE_LOG_INFO
> >  #define RTE_BACKTRACE 1
> >  #define RTE_MAX_VFIO_CONTAINERS 64
> > +#define RTE_LCORE_BUSYNESS 1
> 
> Please don't enable debug features in fastpath as default.
> 

I would disagree that this is a debug feature. The number of times I have
heard from DPDK users that they wish to get more visibility into what the
app is doing rather than seeing 100% cpu busy. Therefore, I'd see this as
enabled by default rather than disabled - unless it's shown to have a
performance regression.

That said, since this impacts multiple components and it's something that
end users might want to disable at build-time, I'd suggest moving it to
meson_options.txt file and have it as an official DPDK build-time option.

> > +#define RTE_LCORE_BUSYNESS_PERIOD 4000000ULL
> >
> > +
> > +#include <unistd.h>
> > +#include <limits.h>
> > +#include <string.h>
> > +
> > +#include <rte_common.h>
> > +#include <rte_cycles.h>
> > +#include <rte_errno.h>
> > +#include <rte_lcore.h>
> > +
> > +#ifdef RTE_LCORE_BUSYNESS
> 
> This clutter may not be required. Let it compile all cases.
> 
> > +#include <rte_telemetry.h>
> > +#endif
> > +
> > +int __rte_lcore_telemetry_enabled;
> > +
> > +#ifdef RTE_LCORE_BUSYNESS
> > +
> > +struct lcore_telemetry {
> > +       int busyness;
> > +       /**< Calculated busyness (gets set/returned by the API) */
> > +       int raw_busyness;
> > +       /**< Calculated busyness times 100. */
> > +       uint64_t interval_ts;
> > +       /**< when previous telemetry interval started */
> > +       uint64_t empty_cycles;
> > +       /**< empty cycle count since last interval */
> > +       uint64_t last_poll_ts;
> > +       /**< last poll timestamp */
> > +       bool last_empty;
> > +       /**< if last poll was empty */
> > +       unsigned int contig_poll_cnt;
> > +       /**< contiguous (always empty/non empty) poll counter */
> > +} __rte_cache_aligned;
> > +static struct lcore_telemetry telemetry_data[RTE_MAX_LCORE];
> 
> Allocate this from hugepage.
> 

Yes, whether or not it's allocated from hugepages, dynamic allocation would
be better than having static vars.

> > +
> > +void __rte_lcore_telemetry_timestamp(uint16_t nb_rx)
> > +{
> > +       const unsigned int lcore_id = rte_lcore_id();
> > +       uint64_t interval_ts, empty_cycles, cur_tsc, last_poll_ts;
> > +       struct lcore_telemetry *tdata = &telemetry_data[lcore_id];
> > +       const bool empty = nb_rx == 0;
> > +       uint64_t diff_int, diff_last;
> > +       bool last_empty;
> > +
> > +       last_empty = tdata->last_empty;
> > +
> > +       /* optimization: don't do anything if status hasn't changed */
> > +       if (last_empty == empty && tdata->contig_poll_cnt++ < 32)
> > +               return;
> > +       /* status changed or we're waiting for too long, reset counter */
> > +       tdata->contig_poll_cnt = 0;
> > +
> > +       cur_tsc = rte_rdtsc();
> > +
> > +       interval_ts = tdata->interval_ts;
> > +       empty_cycles = tdata->empty_cycles;
> > +       last_poll_ts = tdata->last_poll_ts;
> > +
> > +       diff_int = cur_tsc - interval_ts;
> > +       diff_last = cur_tsc - last_poll_ts;
> > +
> > +       /* is this the first time we're here? */
> > +       if (interval_ts == 0) {
> > +               tdata->busyness = LCORE_BUSYNESS_MIN;
> > +               tdata->raw_busyness = 0;
> > +               tdata->interval_ts = cur_tsc;
> > +               tdata->empty_cycles = 0;
> > +               tdata->contig_poll_cnt = 0;
> > +               goto end;
> > +       }
> > +
> > +       /* update the empty counter if we got an empty poll earlier */
> > +       if (last_empty)
> > +               empty_cycles += diff_last;
> > +
> > +       /* have we passed the interval? */
> > +       if (diff_int > RTE_LCORE_BUSYNESS_PERIOD) {
> 
> 
> I think, this function logic can be limited to just updating the
> timestamp in the ring buffer,
> and another control function of telemetry which runs in control core to do
> heavy lifting to reduce the performance impact on fast path,
> 
> > +               int raw_busyness;
> > +
> > +               /* get updated busyness value */
> > +               raw_busyness = calc_raw_busyness(tdata, empty_cycles, diff_int);
> > +
> > +               /* set a new interval, reset empty counter */
> > +               tdata->interval_ts = cur_tsc;
> > +               tdata->empty_cycles = 0;
> > +               tdata->raw_busyness = raw_busyness;
> > +               /* bring busyness back to 0..100 range, biased to round up */
> > +               tdata->busyness = (raw_busyness + 50) / 100;
> > +       } else
> > +               /* we may have updated empty counter */
> > +               tdata->empty_cycles = empty_cycles;
> > +
> > +end:
> > +       /* update status for next poll */
> > +       tdata->last_poll_ts = cur_tsc;
> > +       tdata->last_empty = empty;
> > +}
> > +
> > +#ifdef RTE_LCORE_BUSYNESS
> > +#define RTE_LCORE_TELEMETRY_TIMESTAMP(nb_rx)                    \
> > +       do {                                                    \
> > +               if (__rte_lcore_telemetry_enabled)              \
> 
> I think, rather than reading memory, Like Linux perf infrastructure,
> we can patch up the instruction steam as NOP vs Timestamp capture
> function.
> 
Surely that requires much more complicated tooling? How would that work in
this situation?

> 
> Also instead of changing all libraries, Maybe we can use
> "-finstrument-functions".
> and just mark the function with attribute.
> 
> Just 2c.
> 
> > +                       __rte_lcore_telemetry_timestamp(nb_rx); \
> > +       } while (0)
> > +#else
> > +#define RTE_LCORE_TELEMETRY_TIMESTAMP(nb_rx) \
> > +       while (0)
> > +#endif
> > +
  
Anatoly Burakov July 15, 2022, 2:18 p.m. UTC | #4
On 15-Jul-22 2:46 PM, Jerin Jacob wrote:
> On Fri, Jul 15, 2022 at 6:42 PM Anatoly Burakov
> <anatoly.burakov@intel.com> wrote:
>>
>> Currently, there is no way to measure lcore busyness in a passive way,
>> without any modifications to the application. This patch adds a new EAL
>> API that will be able to passively track core busyness.
>>
>> The busyness is calculated by relying on the fact that most DPDK API's
>> will poll for packets. Empty polls can be counted as "idle", while
>> non-empty polls can be counted as busy. To measure lcore busyness, we
>> simply call the telemetry timestamping function with the number of polls
>> a particular code section has processed, and count the number of cycles
>> we've spent processing empty bursts. The more empty bursts we encounter,
>> the less cycles we spend in "busy" state, and the less core busyness
>> will be reported.
>>
>> In order for all of the above to work without modifications to the
>> application, the library code needs to be instrumented with calls to
>> the lcore telemetry busyness timestamping function. The following parts
>> of DPDK are instrumented with lcore telemetry calls:
>>
>> - All major driver API's:
>>    - ethdev
>>    - cryptodev
>>    - compressdev
>>    - regexdev
>>    - bbdev
>>    - rawdev
>>    - eventdev
>>    - dmadev
>> - Some additional libraries:
>>    - ring
>>    - distributor
>>
>> To avoid performance impact from having lcore telemetry support, a
>> global variable is exported by EAL, and a call to timestamping function
>> is wrapped into a macro, so that whenever telemetry is disabled, it only
>> takes one additional branch and no function calls are performed. It is
>> also possible to disable it at compile time by commenting out
>> RTE_LCORE_BUSYNESS from build config.
>>
>> This patch also adds a telemetry endpoint to report lcore busyness, as
>> well as telemetry endpoints to enable/disable lcore telemetry.
>>
>> Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
>> Signed-off-by: Conor Walsh <conor.walsh@intel.com>
>> Signed-off-by: David Hunt <david.hunt@intel.com>
>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> 
> 
> It is a good feature. Thanks for this work.

Hi Jerin,

Thanks for your review! Comments below.

> 
> 
>> ---
>>
>> Notes:
>>      We did a couple of quick smoke tests to see if this patch causes any performance
>>      degradation, and it seemed to have none that we could measure. Telemetry can be
>>      disabled at compile time via a config option, while at runtime it can be
>>      disabled, seemingly at a cost of one additional branch.
>>
>>      That said, our benchmarking efforts were admittedly not very rigorous, so
>>      comments welcome!
> 
>>
>> diff --git a/config/rte_config.h b/config/rte_config.h
>> index 46549cb062..583cb6f7a5 100644
>> --- a/config/rte_config.h
>> +++ b/config/rte_config.h
>> @@ -39,6 +39,8 @@
>>   #define RTE_LOG_DP_LEVEL RTE_LOG_INFO
>>   #define RTE_BACKTRACE 1
>>   #define RTE_MAX_VFIO_CONTAINERS 64
>> +#define RTE_LCORE_BUSYNESS 1
> 
> Please don't enable debug features in fastpath as default.

It is not meant to be a debug feature. The ability to measure CPU usage 
in DPDK applications consistently comes up as one of the top asks from 
all kinds of people working on DPDK, and this is an attempt to address 
that at a fundamental level. This is more of a quality of life 
improvement than a debug feature.

> 
>> +#define RTE_LCORE_BUSYNESS_PERIOD 4000000ULL
>>
>> +
>> +#include <unistd.h>
>> +#include <limits.h>
>> +#include <string.h>
>> +
>> +#include <rte_common.h>
>> +#include <rte_cycles.h>
>> +#include <rte_errno.h>
>> +#include <rte_lcore.h>
>> +
>> +#ifdef RTE_LCORE_BUSYNESS
> 
> This clutter may not be required. Let it compile all cases.

Windows does not have telemetry library (so this define never exists on 
Windows), and we want to be able to compile this without lcore telemetry 
enabled as well (by commenting out the config option). We would have to 
have #ifdef-ery here in any case. Am I missing an obvious way to have 
this without #ifdef's?

> 
>> +#include <rte_telemetry.h>
>> +#endif
>> +
>> +int __rte_lcore_telemetry_enabled;
>> +
>> +#ifdef RTE_LCORE_BUSYNESS
>> +
>> +struct lcore_telemetry {
>> +       int busyness;
>> +       /**< Calculated busyness (gets set/returned by the API) */
>> +       int raw_busyness;
>> +       /**< Calculated busyness times 100. */
>> +       uint64_t interval_ts;
>> +       /**< when previous telemetry interval started */
>> +       uint64_t empty_cycles;
>> +       /**< empty cycle count since last interval */
>> +       uint64_t last_poll_ts;
>> +       /**< last poll timestamp */
>> +       bool last_empty;
>> +       /**< if last poll was empty */
>> +       unsigned int contig_poll_cnt;
>> +       /**< contiguous (always empty/non empty) poll counter */
>> +} __rte_cache_aligned;
>> +static struct lcore_telemetry telemetry_data[RTE_MAX_LCORE];
> 
> Allocate this from hugepage.

Good suggestion, probably needs per-socket structures as well to avoid 
cross-socket accesses.

> 
>> +
>> +void __rte_lcore_telemetry_timestamp(uint16_t nb_rx)
>> +{
>> +       const unsigned int lcore_id = rte_lcore_id();
>> +       uint64_t interval_ts, empty_cycles, cur_tsc, last_poll_ts;
>> +       struct lcore_telemetry *tdata = &telemetry_data[lcore_id];
>> +       const bool empty = nb_rx == 0;
>> +       uint64_t diff_int, diff_last;
>> +       bool last_empty;
>> +
>> +       last_empty = tdata->last_empty;
>> +
>> +       /* optimization: don't do anything if status hasn't changed */
>> +       if (last_empty == empty && tdata->contig_poll_cnt++ < 32)
>> +               return;
>> +       /* status changed or we're waiting for too long, reset counter */
>> +       tdata->contig_poll_cnt = 0;
>> +
>> +       cur_tsc = rte_rdtsc();
>> +
>> +       interval_ts = tdata->interval_ts;
>> +       empty_cycles = tdata->empty_cycles;
>> +       last_poll_ts = tdata->last_poll_ts;
>> +
>> +       diff_int = cur_tsc - interval_ts;
>> +       diff_last = cur_tsc - last_poll_ts;
>> +
>> +       /* is this the first time we're here? */
>> +       if (interval_ts == 0) {
>> +               tdata->busyness = LCORE_BUSYNESS_MIN;
>> +               tdata->raw_busyness = 0;
>> +               tdata->interval_ts = cur_tsc;
>> +               tdata->empty_cycles = 0;
>> +               tdata->contig_poll_cnt = 0;
>> +               goto end;
>> +       }
>> +
>> +       /* update the empty counter if we got an empty poll earlier */
>> +       if (last_empty)
>> +               empty_cycles += diff_last;
>> +
>> +       /* have we passed the interval? */
>> +       if (diff_int > RTE_LCORE_BUSYNESS_PERIOD) {
> 
> 
> I think, this function logic can be limited to just updating the
> timestamp in the ring buffer,
> and another control function of telemetry which runs in control core to do
> heavy lifting to reduce the performance impact on fast path,

That ring buffer would have to be rather large, because telemetry calls 
can come in as often as once every couple of hundred cycles (when polls 
are empty) while the telemetry measuring period is in the millions of 
cycles. That said, this is an interesting area to explore in further 
revisions, so I'll think of something along those lines, thanks!

> 
>> +               int raw_busyness;
>> +
>> +               /* get updated busyness value */
>> +               raw_busyness = calc_raw_busyness(tdata, empty_cycles, diff_int);
>> +
>> +               /* set a new interval, reset empty counter */
>> +               tdata->interval_ts = cur_tsc;
>> +               tdata->empty_cycles = 0;
>> +               tdata->raw_busyness = raw_busyness;
>> +               /* bring busyness back to 0..100 range, biased to round up */
>> +               tdata->busyness = (raw_busyness + 50) / 100;
>> +       } else
>> +               /* we may have updated empty counter */
>> +               tdata->empty_cycles = empty_cycles;
>> +
>> +end:
>> +       /* update status for next poll */
>> +       tdata->last_poll_ts = cur_tsc;
>> +       tdata->last_empty = empty;
>> +}
>> +
>> +#ifdef RTE_LCORE_BUSYNESS
>> +#define RTE_LCORE_TELEMETRY_TIMESTAMP(nb_rx)                    \
>> +       do {                                                    \
>> +               if (__rte_lcore_telemetry_enabled)              \
> 
> I think, rather than reading memory, Like Linux perf infrastructure,
> we can patch up the instruction steam as NOP vs Timestamp capture
> function.

Would you be so kind as to provide a link to examples of how this can be 
implemented?

> 
> 
> Also instead of changing all libraries, Maybe we can use
> "-finstrument-functions".
> and just mark the function with attribute.

This is not meant to be a profiling solution, it is rather a lightweight 
CPU usage measuring tool, to expose (albeit limited) CPU usage data. I'm 
by all means not an expert on `-finstrument-functions` flag, but from 
cursory reading of description in GCC manual it seems to be rather 
heavyweight as it's captuing function enter/exit and other stuff, which 
we're not really interested in. This *would* make this feature a debug 
feature, but this was not the intention here :)
  
Morten Brørup July 15, 2022, 10:13 p.m. UTC | #5
> From: Anatoly Burakov [mailto:anatoly.burakov@intel.com]
> Sent: Friday, 15 July 2022 15.13
> 
> Currently, there is no way to measure lcore busyness in a passive way,
> without any modifications to the application. This patch adds a new EAL
> API that will be able to passively track core busyness.
> 
> The busyness is calculated by relying on the fact that most DPDK API's
> will poll for packets.

This is an "alternative fact"! Only run-to-completion applications polls for RX. Pipelined applications do not poll for packets in every pipeline stage.

> Empty polls can be counted as "idle", while
> non-empty polls can be counted as busy. To measure lcore busyness, we
> simply call the telemetry timestamping function with the number of
> polls
> a particular code section has processed, and count the number of cycles
> we've spent processing empty bursts. The more empty bursts we
> encounter,
> the less cycles we spend in "busy" state, and the less core busyness
> will be reported.
> 
> In order for all of the above to work without modifications to the
> application, the library code needs to be instrumented with calls to
> the lcore telemetry busyness timestamping function. The following parts
> of DPDK are instrumented with lcore telemetry calls:
> 
> - All major driver API's:
>   - ethdev
>   - cryptodev
>   - compressdev
>   - regexdev
>   - bbdev
>   - rawdev
>   - eventdev
>   - dmadev
> - Some additional libraries:
>   - ring
>   - distributor
> 
> To avoid performance impact from having lcore telemetry support, a
> global variable is exported by EAL, and a call to timestamping function
> is wrapped into a macro, so that whenever telemetry is disabled, it
> only
> takes one additional branch and no function calls are performed. It is
> also possible to disable it at compile time by commenting out
> RTE_LCORE_BUSYNESS from build config.

Since all of this can be completely disabled at build time, and thus has exactly zero performance impact, I will not object to this patch.

> 
> This patch also adds a telemetry endpoint to report lcore busyness, as
> well as telemetry endpoints to enable/disable lcore telemetry.
> 
> Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
> Signed-off-by: Conor Walsh <conor.walsh@intel.com>
> Signed-off-by: David Hunt <david.hunt@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
> 
> Notes:
>     We did a couple of quick smoke tests to see if this patch causes
> any performance
>     degradation, and it seemed to have none that we could measure.
> Telemetry can be
>     disabled at compile time via a config option, while at runtime it
> can be
>     disabled, seemingly at a cost of one additional branch.
> 
>     That said, our benchmarking efforts were admittedly not very
> rigorous, so
>     comments welcome!

This patch does not reflect lcore business, it reflects some sort of ingress activity level.

All the considerations regarding non-intrusiveness and low overhead are good, but everything in this patch needs to be renamed to reflect what it truly does, so it is clear that pipelined applications cannot use this telemetry for measuring lcore business (except on the ingress pipeline stage).

It's a shame that so much effort clearly has gone into this patch, and no one stopped to consider pipelined applications. :-(
  
Thomas Monjalon July 16, 2022, 2:38 p.m. UTC | #6
16/07/2022 00:13, Morten Brørup:
> > From: Anatoly Burakov [mailto:anatoly.burakov@intel.com]
> > Sent: Friday, 15 July 2022 15.13
> > 
> > Currently, there is no way to measure lcore busyness in a passive way,
> > without any modifications to the application. This patch adds a new EAL
> > API that will be able to passively track core busyness.
> > 
> > The busyness is calculated by relying on the fact that most DPDK API's
> > will poll for packets.
> 
> This is an "alternative fact"! Only run-to-completion applications polls for RX. Pipelined applications do not poll for packets in every pipeline stage.
> 
> > Empty polls can be counted as "idle", while
> > non-empty polls can be counted as busy. To measure lcore busyness, we
> > simply call the telemetry timestamping function with the number of
> > polls
> > a particular code section has processed, and count the number of cycles
> > we've spent processing empty bursts. The more empty bursts we
> > encounter,
> > the less cycles we spend in "busy" state, and the less core busyness
> > will be reported.
> > 
> > In order for all of the above to work without modifications to the
> > application, the library code needs to be instrumented with calls to
> > the lcore telemetry busyness timestamping function. The following parts
> > of DPDK are instrumented with lcore telemetry calls:
> > 
> > - All major driver API's:
> >   - ethdev
> >   - cryptodev
> >   - compressdev
> >   - regexdev
> >   - bbdev
> >   - rawdev
> >   - eventdev
> >   - dmadev
> > - Some additional libraries:
> >   - ring
> >   - distributor
> > 
> > To avoid performance impact from having lcore telemetry support, a
> > global variable is exported by EAL, and a call to timestamping function
> > is wrapped into a macro, so that whenever telemetry is disabled, it
> > only
> > takes one additional branch and no function calls are performed. It is
> > also possible to disable it at compile time by commenting out
> > RTE_LCORE_BUSYNESS from build config.
> 
> Since all of this can be completely disabled at build time, and thus has exactly zero performance impact, I will not object to this patch.
> 
> > 
> > This patch also adds a telemetry endpoint to report lcore busyness, as
> > well as telemetry endpoints to enable/disable lcore telemetry.
> > 
> > Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
> > Signed-off-by: Conor Walsh <conor.walsh@intel.com>
> > Signed-off-by: David Hunt <david.hunt@intel.com>
> > Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> > ---
> > 
> > Notes:
> >     We did a couple of quick smoke tests to see if this patch causes
> > any performance
> >     degradation, and it seemed to have none that we could measure.
> > Telemetry can be
> >     disabled at compile time via a config option, while at runtime it
> > can be
> >     disabled, seemingly at a cost of one additional branch.
> > 
> >     That said, our benchmarking efforts were admittedly not very
> > rigorous, so
> >     comments welcome!
> 
> This patch does not reflect lcore business, it reflects some sort of ingress activity level.
> 
> All the considerations regarding non-intrusiveness and low overhead are good, but everything in this patch needs to be renamed to reflect what it truly does, so it is clear that pipelined applications cannot use this telemetry for measuring lcore business (except on the ingress pipeline stage).

+1
Anatoly, please reflect polling activity in naming.

> It's a shame that so much effort clearly has gone into this patch, and no one stopped to consider pipelined applications. :-(

That's because no RFC was sent I think.
  
Honnappa Nagarahalli July 17, 2022, 3:10 a.m. UTC | #7
<snip>

> Subject: RE: [PATCH v1 1/2] eal: add lcore busyness telemetry
> 
> > From: Anatoly Burakov [mailto:anatoly.burakov@intel.com]
> > Sent: Friday, 15 July 2022 15.13
> >
> > Currently, there is no way to measure lcore busyness in a passive way,
> > without any modifications to the application. This patch adds a new
> > EAL API that will be able to passively track core busyness.
> >
> > The busyness is calculated by relying on the fact that most DPDK API's
> > will poll for packets.
> 
> This is an "alternative fact"! Only run-to-completion applications polls for RX.
> Pipelined applications do not poll for packets in every pipeline stage.
I guess you meant, poll for packets from NIC. They still need to receive packets from queues. We could do a similar thing for rte_ring APIs.

> 
> > Empty polls can be counted as "idle", while non-empty polls can be
> > counted as busy. To measure lcore busyness, we simply call the
> > telemetry timestamping function with the number of polls a particular
> > code section has processed, and count the number of cycles we've spent
> > processing empty bursts. The more empty bursts we encounter, the less
> > cycles we spend in "busy" state, and the less core busyness will be
> > reported.
> >
> > In order for all of the above to work without modifications to the
> > application, the library code needs to be instrumented with calls to
> > the lcore telemetry busyness timestamping function. The following
> > parts of DPDK are instrumented with lcore telemetry calls:
> >
> > - All major driver API's:
> >   - ethdev
> >   - cryptodev
> >   - compressdev
> >   - regexdev
> >   - bbdev
> >   - rawdev
> >   - eventdev
> >   - dmadev
> > - Some additional libraries:
> >   - ring
> >   - distributor
> >
> > To avoid performance impact from having lcore telemetry support, a
> > global variable is exported by EAL, and a call to timestamping
> > function is wrapped into a macro, so that whenever telemetry is
> > disabled, it only takes one additional branch and no function calls
> > are performed. It is also possible to disable it at compile time by
> > commenting out RTE_LCORE_BUSYNESS from build config.
> 
> Since all of this can be completely disabled at build time, and thus has exactly
> zero performance impact, I will not object to this patch.
> 
> >
> > This patch also adds a telemetry endpoint to report lcore busyness, as
> > well as telemetry endpoints to enable/disable lcore telemetry.
> >
> > Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
> > Signed-off-by: Conor Walsh <conor.walsh@intel.com>
> > Signed-off-by: David Hunt <david.hunt@intel.com>
> > Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> > ---
> >
> > Notes:
> >     We did a couple of quick smoke tests to see if this patch causes
> > any performance
> >     degradation, and it seemed to have none that we could measure.
> > Telemetry can be
> >     disabled at compile time via a config option, while at runtime it
> > can be
> >     disabled, seemingly at a cost of one additional branch.
> >
> >     That said, our benchmarking efforts were admittedly not very
> > rigorous, so
> >     comments welcome!
> 
> This patch does not reflect lcore business, it reflects some sort of ingress
> activity level.
> 
> All the considerations regarding non-intrusiveness and low overhead are
> good, but everything in this patch needs to be renamed to reflect what it truly
> does, so it is clear that pipelined applications cannot use this telemetry for
> measuring lcore business (except on the ingress pipeline stage).
> 
> It's a shame that so much effort clearly has gone into this patch, and no one
> stopped to consider pipelined applications. :-(
  
Morten Brørup July 17, 2022, 9:56 a.m. UTC | #8
> From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> Sent: Sunday, 17 July 2022 05.10
> 
> <snip>
> 
> > Subject: RE: [PATCH v1 1/2] eal: add lcore busyness telemetry
> >
> > > From: Anatoly Burakov [mailto:anatoly.burakov@intel.com]
> > > Sent: Friday, 15 July 2022 15.13
> > >
> > > Currently, there is no way to measure lcore busyness in a passive
> way,
> > > without any modifications to the application. This patch adds a new
> > > EAL API that will be able to passively track core busyness.
> > >
> > > The busyness is calculated by relying on the fact that most DPDK
> API's
> > > will poll for packets.
> >
> > This is an "alternative fact"! Only run-to-completion applications
> polls for RX.
> > Pipelined applications do not poll for packets in every pipeline
> stage.
> I guess you meant, poll for packets from NIC. They still need to
> receive packets from queues. We could do a similar thing for rte_ring
> APIs.

But it would mix apples, pears and bananas.

Let's say you have a pipeline with three ingress preprocessing threads, two advanced packet processing threads in the next pipeline stage and one egress thread as the third pipeline stage.

Now, the metrics reflects busyness for six threads, but three of them are apples, two of them are pears, and one is bananas.

I just realized another example, where this patch might give misleading results on a run-to-completion application:

One thread handles a specific type of packets received on an Ethdev ingress queue set up by the rte_flow APIs, and another thread handles ingress packets from another Ethdev ingress queue. E.g. the first queue may contain packets for well known flows, where packets can be processed quickly, and the other queue for other packets requiring more scrutiny. Both threads are run-to-completion and handle Ethdev ingress packets.

*So: Only applications where the threads perform the exact same task can use this patch.*

Also, rings may be used for other purposes than queueing packets between pipeline stages. E.g. our application uses rings for fast bulk allocation and freeing of other resources.
  
Anatoly Burakov July 18, 2022, 9:43 a.m. UTC | #9
On 17-Jul-22 10:56 AM, Morten Brørup wrote:
>> From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
>> Sent: Sunday, 17 July 2022 05.10
>>
>> <snip>
>>
>>> Subject: RE: [PATCH v1 1/2] eal: add lcore busyness telemetry
>>>
>>>> From: Anatoly Burakov [mailto:anatoly.burakov@intel.com]
>>>> Sent: Friday, 15 July 2022 15.13
>>>>
>>>> Currently, there is no way to measure lcore busyness in a passive
>> way,
>>>> without any modifications to the application. This patch adds a new
>>>> EAL API that will be able to passively track core busyness.
>>>>
>>>> The busyness is calculated by relying on the fact that most DPDK
>> API's
>>>> will poll for packets.
>>>
>>> This is an "alternative fact"! Only run-to-completion applications
>> polls for RX.
>>> Pipelined applications do not poll for packets in every pipeline
>> stage.
>> I guess you meant, poll for packets from NIC. They still need to
>> receive packets from queues. We could do a similar thing for rte_ring
>> APIs.

Ring API is already instrumented to report telemetry in the same way, so 
any rte_ring-based pipeline will be able to track it. Obviously, 
non-DPDK API's will have to be instrumented too, we really can't do 
anything about that from inside DPDK.

> 
> But it would mix apples, pears and bananas.
> 
> Let's say you have a pipeline with three ingress preprocessing threads, two advanced packet processing threads in the next pipeline stage and one egress thread as the third pipeline stage.
> 
> Now, the metrics reflects busyness for six threads, but three of them are apples, two of them are pears, and one is bananas.
> 
> I just realized another example, where this patch might give misleading results on a run-to-completion application:
> 
> One thread handles a specific type of packets received on an Ethdev ingress queue set up by the rte_flow APIs, and another thread handles ingress packets from another Ethdev ingress queue. E.g. the first queue may contain packets for well known flows, where packets can be processed quickly, and the other queue for other packets requiring more scrutiny. Both threads are run-to-completion and handle Ethdev ingress packets.
> 
> *So: Only applications where the threads perform the exact same task can use this patch.*

I do not see how that follows. I think you're falling for a "it's not 
100% useful, therefore it's 0% useful" fallacy here. Some use cases 
would obviously make telemetry more informative than others, that's 
true, however I do not see how it's a mandatory requirement for lcore 
busyness to report the same thing. We can document the limitations and 
assumptions made, can we not?

It is true that this patchset is mostly written from the standpoint of a 
run-to-completion application, but can we improve it? What would be your 
suggestions to make it better suit use cases you are familiar with?

> 
> Also, rings may be used for other purposes than queueing packets between pipeline stages. E.g. our application uses rings for fast bulk allocation and freeing of other resources.
> 

Well, this is the tradeoff for simplicity. Of course we could add all 
sorts of stuff like dynamic enable/disable of this and that and the 
other... but the end goal was something easy and automatic and that 
doesn't require any work to implement, not something that suits 100% of 
the cases 100% of the time. Having such flexibility as you described 
comes at a cost that this patch was not meant to pay!
  
Morten Brørup July 18, 2022, 10:59 a.m. UTC | #10
> From: Burakov, Anatoly [mailto:anatoly.burakov@intel.com]
> Sent: Monday, 18 July 2022 11.44
> 
> On 17-Jul-22 10:56 AM, Morten Brørup wrote:
> >> From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> >> Sent: Sunday, 17 July 2022 05.10
> >>
> >> <snip>
> >>
> >>> Subject: RE: [PATCH v1 1/2] eal: add lcore busyness telemetry
> >>>
> >>>> From: Anatoly Burakov [mailto:anatoly.burakov@intel.com]
> >>>> Sent: Friday, 15 July 2022 15.13
> >>>>
> >>>> Currently, there is no way to measure lcore busyness in a passive
> >> way,
> >>>> without any modifications to the application. This patch adds a
> new
> >>>> EAL API that will be able to passively track core busyness.
> >>>>
> >>>> The busyness is calculated by relying on the fact that most DPDK
> >> API's
> >>>> will poll for packets.
> >>>
> >>> This is an "alternative fact"! Only run-to-completion applications
> >> polls for RX.
> >>> Pipelined applications do not poll for packets in every pipeline
> >> stage.
> >> I guess you meant, poll for packets from NIC. They still need to
> >> receive packets from queues. We could do a similar thing for
> rte_ring
> >> APIs.
> 
> Ring API is already instrumented to report telemetry in the same way,
> so
> any rte_ring-based pipeline will be able to track it. Obviously,
> non-DPDK API's will have to be instrumented too, we really can't do
> anything about that from inside DPDK.
> 
> >
> > But it would mix apples, pears and bananas.
> >
> > Let's say you have a pipeline with three ingress preprocessing
> threads, two advanced packet processing threads in the next pipeline
> stage and one egress thread as the third pipeline stage.
> >
> > Now, the metrics reflects busyness for six threads, but three of them
> are apples, two of them are pears, and one is bananas.
> >
> > I just realized another example, where this patch might give
> misleading results on a run-to-completion application:
> >
> > One thread handles a specific type of packets received on an Ethdev
> ingress queue set up by the rte_flow APIs, and another thread handles
> ingress packets from another Ethdev ingress queue. E.g. the first queue
> may contain packets for well known flows, where packets can be
> processed quickly, and the other queue for other packets requiring more
> scrutiny. Both threads are run-to-completion and handle Ethdev ingress
> packets.
> >
> > *So: Only applications where the threads perform the exact same task
> can use this patch.*
> 
> I do not see how that follows. I think you're falling for a "it's not
> 100% useful, therefore it's 0% useful" fallacy here. Some use cases
> would obviously make telemetry more informative than others, that's
> true, however I do not see how it's a mandatory requirement for lcore
> busyness to report the same thing. We can document the limitations and
> assumptions made, can we not?

I did use strong wording in my email to get my message through. However, I do consider the scope "applications where the threads perform the exact same task" more than 0 % of all deployed applications, and thus the patch is more than 0 % useful. But I certainly don't consider the scope for this patch 100 %  of all deployed applications, and perhaps not even 80 %.

I didn't reject the patch or oppose to it, but requested it to be updated, so the names reflect the information provided by it. I strongly oppose to using "CPU Busyness" as the telemetry name for something that only reflects ingress activity, and is zero for a thread that only performs egress or other non-ingress tasks. That would be strongly misleading.

If you by "document the limitations and assumptions" also mean rename telemetry names and variables/functions in the patch to reflect what it actually does, then yes, documenting the limitations and assumptions suffices. However, adding a notice in some documentation that "CPU Business" telemetry only is correct/relevant for specific applications doesn't suffice.

> 
> It is true that this patchset is mostly written from the standpoint of
> a
> run-to-completion application, but can we improve it? What would be
> your
> suggestions to make it better suit use cases you are familiar with?

Our application uses our own run-time profiler library to measure time spent in the application's various threads and pipeline stages. And the application needs to call the profiler library functions to feed it the information it needs. We still haven't found a good way to transform the profiler data to a generic summary CPU Utilization percentage, which should reflect how much of the system's CPU capacity is being used (preferably on a linear scale). (Our profiler library is designed specifically for our own purposes, and would require a complete rewrite to meet even basic DPDK library standards, so I won't even try to contribute it.)

I don't think it is possible to measure and report detailed CPU Busyness without involving the application. Only the application has knowledge about what the individual lcores are doing. Even for my example above (with two run-to-completion threads serving rte_flow configured Ethdev ingress queues), this patch would not provide information about which of the two types of traffic is causing the higher busyness. The telemetry might expose which specific thread is busy, but it doesn't tell which of the two tasks is being performed by that thread, and thus which kind of traffic is causing the busyness.

> 
> >
> > Also, rings may be used for other purposes than queueing packets
> between pipeline stages. E.g. our application uses rings for fast bulk
> allocation and freeing of other resources.
> >
> 
> Well, this is the tradeoff for simplicity. Of course we could add all
> sorts of stuff like dynamic enable/disable of this and that and the
> other... but the end goal was something easy and automatic and that
> doesn't require any work to implement, not something that suits 100% of
> the cases 100% of the time. Having such flexibility as you described
> comes at a cost that this patch was not meant to pay!

I do see the benefit of adding instrumentation like this to the DPDK libraries, so information becomes available at zero application development effort. The alternative would be a profiler/busyness library requiring application modifications.

I only request that:
1. The patch clearly reflects what is does, and
2. The instrumentation can be omitted at build time, so it has zero performance impact on applications where it is useless.

> 
> --
> Thanks,
> Anatoly

PS: The busyness counters in the DPDK Service Cores library are also being updated [1].

[1] http://inbox.dpdk.org/dev/20220711131825.3373195-2-harry.van.haaren@intel.com/T/#u
  
Stephen Hemminger July 18, 2022, 3:46 p.m. UTC | #11
On Mon, 18 Jul 2022 10:43:52 +0100
"Burakov, Anatoly" <anatoly.burakov@intel.com> wrote:

> >>> This is an "alternative fact"! Only run-to-completion applications  
> >> polls for RX.  
> >>> Pipelined applications do not poll for packets in every pipeline  
> >> stage.
> >> I guess you meant, poll for packets from NIC. They still need to
> >> receive packets from queues. We could do a similar thing for rte_ring
> >> APIs.  
> 
> Ring API is already instrumented to report telemetry in the same way, so 
> any rte_ring-based pipeline will be able to track it. Obviously, 
> non-DPDK API's will have to be instrumented too, we really can't do 
> anything about that from inside DPDK.

The eventdev API is used to build pipeline based app's and it supports
telemetry.
  
Thomas Monjalon July 19, 2022, 12:20 p.m. UTC | #12
18/07/2022 12:59, Morten Brørup:
> I only request that:
> 1. The patch clearly reflects what is does, and

+1

> 2. The instrumentation can be omitted at build time, so it has zero performance impact on applications where it is useless.

+1

(+2 then)
  

Patch

diff --git a/config/rte_config.h b/config/rte_config.h
index 46549cb062..583cb6f7a5 100644
--- a/config/rte_config.h
+++ b/config/rte_config.h
@@ -39,6 +39,8 @@ 
 #define RTE_LOG_DP_LEVEL RTE_LOG_INFO
 #define RTE_BACKTRACE 1
 #define RTE_MAX_VFIO_CONTAINERS 64
+#define RTE_LCORE_BUSYNESS 1
+#define RTE_LCORE_BUSYNESS_PERIOD 4000000ULL
 
 /* bsd module defines */
 #define RTE_CONTIGMEM_MAX_NUM_BUFS 64
diff --git a/lib/bbdev/rte_bbdev.h b/lib/bbdev/rte_bbdev.h
index b88c88167e..d6ed176cce 100644
--- a/lib/bbdev/rte_bbdev.h
+++ b/lib/bbdev/rte_bbdev.h
@@ -28,6 +28,7 @@  extern "C" {
 #include <stdbool.h>
 
 #include <rte_cpuflags.h>
+#include <rte_lcore.h>
 
 #include "rte_bbdev_op.h"
 
@@ -599,7 +600,9 @@  rte_bbdev_dequeue_enc_ops(uint16_t dev_id, uint16_t queue_id,
 {
 	struct rte_bbdev *dev = &rte_bbdev_devices[dev_id];
 	struct rte_bbdev_queue_data *q_data = &dev->data->queues[queue_id];
-	return dev->dequeue_enc_ops(q_data, ops, num_ops);
+	const uint16_t nb_ops = dev->dequeue_enc_ops(q_data, ops, num_ops);
+	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 /**
@@ -631,7 +634,9 @@  rte_bbdev_dequeue_dec_ops(uint16_t dev_id, uint16_t queue_id,
 {
 	struct rte_bbdev *dev = &rte_bbdev_devices[dev_id];
 	struct rte_bbdev_queue_data *q_data = &dev->data->queues[queue_id];
-	return dev->dequeue_dec_ops(q_data, ops, num_ops);
+	const uint16_t nb_ops = dev->dequeue_dec_ops(q_data, ops, num_ops);
+	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 
@@ -662,7 +667,9 @@  rte_bbdev_dequeue_ldpc_enc_ops(uint16_t dev_id, uint16_t queue_id,
 {
 	struct rte_bbdev *dev = &rte_bbdev_devices[dev_id];
 	struct rte_bbdev_queue_data *q_data = &dev->data->queues[queue_id];
-	return dev->dequeue_ldpc_enc_ops(q_data, ops, num_ops);
+	const uint16_t nb_ops = dev->dequeue_ldpc_enc_ops(q_data, ops, num_ops);
+	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 /**
@@ -692,7 +699,9 @@  rte_bbdev_dequeue_ldpc_dec_ops(uint16_t dev_id, uint16_t queue_id,
 {
 	struct rte_bbdev *dev = &rte_bbdev_devices[dev_id];
 	struct rte_bbdev_queue_data *q_data = &dev->data->queues[queue_id];
-	return dev->dequeue_ldpc_dec_ops(q_data, ops, num_ops);
+	const uint16_t nb_ops = dev->dequeue_ldpc_dec_ops(q_data, ops, num_ops);
+	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 /** Definitions of device event types */
diff --git a/lib/compressdev/rte_compressdev.c b/lib/compressdev/rte_compressdev.c
index 22c438f2dd..912cee9a16 100644
--- a/lib/compressdev/rte_compressdev.c
+++ b/lib/compressdev/rte_compressdev.c
@@ -580,6 +580,8 @@  rte_compressdev_dequeue_burst(uint8_t dev_id, uint16_t qp_id,
 	nb_ops = (*dev->dequeue_burst)
 			(dev->data->queue_pairs[qp_id], ops, nb_ops);
 
+	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_ops);
+
 	return nb_ops;
 }
 
diff --git a/lib/cryptodev/rte_cryptodev.h b/lib/cryptodev/rte_cryptodev.h
index 56f459c6a0..072874020d 100644
--- a/lib/cryptodev/rte_cryptodev.h
+++ b/lib/cryptodev/rte_cryptodev.h
@@ -1915,6 +1915,8 @@  rte_cryptodev_dequeue_burst(uint8_t dev_id, uint16_t qp_id,
 		rte_rcu_qsbr_thread_offline(list->qsbr, 0);
 	}
 #endif
+
+	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_ops);
 	return nb_ops;
 }
 
diff --git a/lib/distributor/rte_distributor.c b/lib/distributor/rte_distributor.c
index 3035b7a999..35b0d8d36b 100644
--- a/lib/distributor/rte_distributor.c
+++ b/lib/distributor/rte_distributor.c
@@ -56,6 +56,8 @@  rte_distributor_request_pkt(struct rte_distributor *d,
 
 		while (rte_rdtsc() < t)
 			rte_pause();
+		/* this was an empty poll */
+		RTE_LCORE_TELEMETRY_TIMESTAMP(0);
 	}
 
 	/*
@@ -134,24 +136,29 @@  rte_distributor_get_pkt(struct rte_distributor *d,
 
 	if (unlikely(d->alg_type == RTE_DIST_ALG_SINGLE)) {
 		if (return_count <= 1) {
+			uint16_t cnt;
 			pkts[0] = rte_distributor_get_pkt_single(d->d_single,
-				worker_id, return_count ? oldpkt[0] : NULL);
-			return (pkts[0]) ? 1 : 0;
-		} else
-			return -EINVAL;
+								 worker_id,
+								 return_count ? oldpkt[0] : NULL);
+			cnt = (pkts[0] != NULL) ? 1 : 0;
+			RTE_LCORE_TELEMETRY_TIMESTAMP(cnt);
+			return cnt;
+		}
+		return -EINVAL;
 	}
 
 	rte_distributor_request_pkt(d, worker_id, oldpkt, return_count);
 
-	count = rte_distributor_poll_pkt(d, worker_id, pkts);
-	while (count == -1) {
+	while ((count = rte_distributor_poll_pkt(d, worker_id, pkts)) == -1) {
 		uint64_t t = rte_rdtsc() + 100;
 
 		while (rte_rdtsc() < t)
 			rte_pause();
 
-		count = rte_distributor_poll_pkt(d, worker_id, pkts);
+		/* this was an empty poll */
+		RTE_LCORE_TELEMETRY_TIMESTAMP(0);
 	}
+	RTE_LCORE_TELEMETRY_TIMESTAMP(count);
 	return count;
 }
 
diff --git a/lib/distributor/rte_distributor_single.c b/lib/distributor/rte_distributor_single.c
index 2c77ac454a..dc58791bf4 100644
--- a/lib/distributor/rte_distributor_single.c
+++ b/lib/distributor/rte_distributor_single.c
@@ -31,8 +31,13 @@  rte_distributor_request_pkt_single(struct rte_distributor_single *d,
 	union rte_distributor_buffer_single *buf = &d->bufs[worker_id];
 	int64_t req = (((int64_t)(uintptr_t)oldpkt) << RTE_DISTRIB_FLAG_BITS)
 			| RTE_DISTRIB_GET_BUF;
-	RTE_WAIT_UNTIL_MASKED(&buf->bufptr64, RTE_DISTRIB_FLAGS_MASK,
-		==, 0, __ATOMIC_RELAXED);
+
+	while (!(__atomic_load_n(&buf->bufptr64, __ATOMIC_RELAXED)
+			& RTE_DISTRIB_FLAGS_MASK) == 0) {
+		rte_pause();
+		/* this was an empty poll */
+		RTE_LCORE_TELEMETRY_TIMESTAMP(0);
+	}
 
 	/* Sync with distributor on GET_BUF flag. */
 	__atomic_store_n(&(buf->bufptr64), req, __ATOMIC_RELEASE);
@@ -59,8 +64,11 @@  rte_distributor_get_pkt_single(struct rte_distributor_single *d,
 {
 	struct rte_mbuf *ret;
 	rte_distributor_request_pkt_single(d, worker_id, oldpkt);
-	while ((ret = rte_distributor_poll_pkt_single(d, worker_id)) == NULL)
+	while ((ret = rte_distributor_poll_pkt_single(d, worker_id)) == NULL) {
 		rte_pause();
+		/* this was an empty poll */
+		RTE_LCORE_TELEMETRY_TIMESTAMP(0);
+	}
 	return ret;
 }
 
diff --git a/lib/dmadev/rte_dmadev.h b/lib/dmadev/rte_dmadev.h
index e7f992b734..98176a6a7a 100644
--- a/lib/dmadev/rte_dmadev.h
+++ b/lib/dmadev/rte_dmadev.h
@@ -149,6 +149,7 @@ 
 #include <rte_bitops.h>
 #include <rte_common.h>
 #include <rte_compat.h>
+#include <rte_lcore.h>
 
 #ifdef __cplusplus
 extern "C" {
@@ -1027,7 +1028,7 @@  rte_dma_completed(int16_t dev_id, uint16_t vchan, const uint16_t nb_cpls,
 		  uint16_t *last_idx, bool *has_error)
 {
 	struct rte_dma_fp_object *obj = &rte_dma_fp_objs[dev_id];
-	uint16_t idx;
+	uint16_t idx, nb_ops;
 	bool err;
 
 #ifdef RTE_DMADEV_DEBUG
@@ -1050,8 +1051,10 @@  rte_dma_completed(int16_t dev_id, uint16_t vchan, const uint16_t nb_cpls,
 		has_error = &err;
 
 	*has_error = false;
-	return (*obj->completed)(obj->dev_private, vchan, nb_cpls, last_idx,
-				 has_error);
+	nb_ops = (*obj->completed)(obj->dev_private, vchan, nb_cpls, last_idx,
+				   has_error);
+	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 /**
@@ -1090,7 +1093,7 @@  rte_dma_completed_status(int16_t dev_id, uint16_t vchan,
 			 enum rte_dma_status_code *status)
 {
 	struct rte_dma_fp_object *obj = &rte_dma_fp_objs[dev_id];
-	uint16_t idx;
+	uint16_t idx, nb_ops;
 
 #ifdef RTE_DMADEV_DEBUG
 	if (!rte_dma_is_valid(dev_id) || nb_cpls == 0 || status == NULL)
@@ -1101,8 +1104,10 @@  rte_dma_completed_status(int16_t dev_id, uint16_t vchan,
 	if (last_idx == NULL)
 		last_idx = &idx;
 
-	return (*obj->completed_status)(obj->dev_private, vchan, nb_cpls,
+	nb_ops = (*obj->completed_status)(obj->dev_private, vchan, nb_cpls,
 					last_idx, status);
+	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 /**
diff --git a/lib/eal/common/eal_common_lcore_telemetry.c b/lib/eal/common/eal_common_lcore_telemetry.c
new file mode 100644
index 0000000000..5e4ea15ff5
--- /dev/null
+++ b/lib/eal/common/eal_common_lcore_telemetry.c
@@ -0,0 +1,274 @@ 
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2014 Intel Corporation
+ */
+
+#include <unistd.h>
+#include <limits.h>
+#include <string.h>
+
+#include <rte_common.h>
+#include <rte_cycles.h>
+#include <rte_errno.h>
+#include <rte_lcore.h>
+
+#ifdef RTE_LCORE_BUSYNESS
+#include <rte_telemetry.h>
+#endif
+
+int __rte_lcore_telemetry_enabled;
+
+#ifdef RTE_LCORE_BUSYNESS
+
+struct lcore_telemetry {
+	int busyness;
+	/**< Calculated busyness (gets set/returned by the API) */
+	int raw_busyness;
+	/**< Calculated busyness times 100. */
+	uint64_t interval_ts;
+	/**< when previous telemetry interval started */
+	uint64_t empty_cycles;
+	/**< empty cycle count since last interval */
+	uint64_t last_poll_ts;
+	/**< last poll timestamp */
+	bool last_empty;
+	/**< if last poll was empty */
+	unsigned int contig_poll_cnt;
+	/**< contiguous (always empty/non empty) poll counter */
+} __rte_cache_aligned;
+static struct lcore_telemetry telemetry_data[RTE_MAX_LCORE];
+
+#define LCORE_BUSYNESS_MAX 100
+#define LCORE_BUSYNESS_NOT_SET -1
+#define LCORE_BUSYNESS_MIN 0
+
+static void lcore_config_init(void)
+{
+	int lcore_id;
+	RTE_LCORE_FOREACH(lcore_id) {
+		struct lcore_telemetry *td = &telemetry_data[lcore_id];
+
+		td->interval_ts = 0;
+		td->last_poll_ts = 0;
+		td->empty_cycles = 0;
+		td->last_empty = true;
+		td->contig_poll_cnt = 0;
+		td->busyness = LCORE_BUSYNESS_NOT_SET;
+		td->raw_busyness = 0;
+	}
+}
+
+int rte_lcore_busyness(unsigned int lcore_id)
+{
+	const uint64_t active_thresh = RTE_LCORE_BUSYNESS_PERIOD * 1000;
+	struct lcore_telemetry *tdata;
+
+	if (lcore_id >= RTE_MAX_LCORE)
+		return -EINVAL;
+	tdata = &telemetry_data[lcore_id];
+
+	/* if the lcore is not active */
+	if (tdata->interval_ts == 0)
+		return LCORE_BUSYNESS_NOT_SET;
+	/* if the core hasn't been active in a while */
+	else if ((rte_rdtsc() - tdata->interval_ts) > active_thresh)
+		return LCORE_BUSYNESS_NOT_SET;
+
+	/* this core is active, report its busyness */
+	return telemetry_data[lcore_id].busyness;
+}
+
+int rte_lcore_busyness_enabled(void)
+{
+	return __rte_lcore_telemetry_enabled;
+}
+
+void rte_lcore_busyness_enabled_set(int enable)
+{
+	__rte_lcore_telemetry_enabled = !!enable;
+
+	if (!enable)
+		lcore_config_init();
+}
+
+static inline int calc_raw_busyness(const struct lcore_telemetry *tdata,
+				    const uint64_t empty, const uint64_t total)
+{
+	/*
+	 * we don't want to use floating point math here, but we want for our
+	 * busyness to react smoothly to sudden changes, while still keeping the
+	 * accuracy and making sure that over time the average follows busyness
+	 * as measured just-in-time. therefore, we will calculate the average
+	 * busyness using integer math, but shift the decimal point two places
+	 * to the right, so that 100.0 becomes 10000. this allows us to report
+	 * integer values (0..100) while still allowing ourselves to follow the
+	 * just-in-time measurements when we calculate our averages.
+	 */
+	const int max_raw_idle = LCORE_BUSYNESS_MAX * 100;
+
+	/*
+	 * at upper end of the busyness scale, going up from 90->100 will take
+	 * longer than going from 10->20 because of the averaging. to address
+	 * this, we invert the scale when doing calculations: that is, we
+	 * effectively calculate average *idle* cycle percentage, not average
+	 * *busy* cycle percentage. this means that the scale is naturally
+	 * biased towards fast scaling up, and slow scaling down.
+	 */
+	const int prev_raw_idle = max_raw_idle - tdata->raw_busyness;
+
+	/* calculate rate of idle cycles, times 100 */
+	const int cur_raw_idle = (int)((empty * max_raw_idle) / total);
+
+	/* smoothen the idleness */
+	const int smoothened_idle = (cur_raw_idle + prev_raw_idle * 4) / 5;
+
+	/* convert idleness back to busyness */
+	return max_raw_idle - smoothened_idle;
+}
+
+void __rte_lcore_telemetry_timestamp(uint16_t nb_rx)
+{
+	const unsigned int lcore_id = rte_lcore_id();
+	uint64_t interval_ts, empty_cycles, cur_tsc, last_poll_ts;
+	struct lcore_telemetry *tdata = &telemetry_data[lcore_id];
+	const bool empty = nb_rx == 0;
+	uint64_t diff_int, diff_last;
+	bool last_empty;
+
+	last_empty = tdata->last_empty;
+
+	/* optimization: don't do anything if status hasn't changed */
+	if (last_empty == empty && tdata->contig_poll_cnt++ < 32)
+		return;
+	/* status changed or we're waiting for too long, reset counter */
+	tdata->contig_poll_cnt = 0;
+
+	cur_tsc = rte_rdtsc();
+
+	interval_ts = tdata->interval_ts;
+	empty_cycles = tdata->empty_cycles;
+	last_poll_ts = tdata->last_poll_ts;
+
+	diff_int = cur_tsc - interval_ts;
+	diff_last = cur_tsc - last_poll_ts;
+
+	/* is this the first time we're here? */
+	if (interval_ts == 0) {
+		tdata->busyness = LCORE_BUSYNESS_MIN;
+		tdata->raw_busyness = 0;
+		tdata->interval_ts = cur_tsc;
+		tdata->empty_cycles = 0;
+		tdata->contig_poll_cnt = 0;
+		goto end;
+	}
+
+	/* update the empty counter if we got an empty poll earlier */
+	if (last_empty)
+		empty_cycles += diff_last;
+
+	/* have we passed the interval? */
+	if (diff_int > RTE_LCORE_BUSYNESS_PERIOD) {
+		int raw_busyness;
+
+		/* get updated busyness value */
+		raw_busyness = calc_raw_busyness(tdata, empty_cycles, diff_int);
+
+		/* set a new interval, reset empty counter */
+		tdata->interval_ts = cur_tsc;
+		tdata->empty_cycles = 0;
+		tdata->raw_busyness = raw_busyness;
+		/* bring busyness back to 0..100 range, biased to round up */
+		tdata->busyness = (raw_busyness + 50) / 100;
+	} else
+		/* we may have updated empty counter */
+		tdata->empty_cycles = empty_cycles;
+
+end:
+	/* update status for next poll */
+	tdata->last_poll_ts = cur_tsc;
+	tdata->last_empty = empty;
+}
+
+static int
+lcore_busyness_enable(const char *cmd __rte_unused,
+		      const char *params __rte_unused,
+		      struct rte_tel_data *d)
+{
+	rte_lcore_busyness_enabled_set(1);
+
+	rte_tel_data_start_dict(d);
+
+	rte_tel_data_add_dict_int(d, "busyness_enabled", 1);
+
+	return 0;
+}
+
+static int
+lcore_busyness_disable(const char *cmd __rte_unused,
+		       const char *params __rte_unused,
+		       struct rte_tel_data *d)
+{
+	rte_lcore_busyness_enabled_set(0);
+
+	rte_tel_data_start_dict(d);
+
+	rte_tel_data_add_dict_int(d, "busyness_enabled", 0);
+
+	return 0;
+}
+
+static int
+lcore_handle_busyness(const char *cmd __rte_unused,
+		      const char *params __rte_unused, struct rte_tel_data *d)
+{
+	char corenum[64];
+	int i;
+
+	rte_tel_data_start_dict(d);
+
+	RTE_LCORE_FOREACH(i) {
+		if (!rte_lcore_is_enabled(i))
+			continue;
+		snprintf(corenum, sizeof(corenum), "%d", i);
+		rte_tel_data_add_dict_int(d, corenum, rte_lcore_busyness(i));
+	}
+
+	return 0;
+}
+
+RTE_INIT(lcore_init_telemetry)
+{
+	__rte_lcore_telemetry_enabled = true;
+
+	lcore_config_init();
+
+	rte_telemetry_register_cmd("/eal/lcore/busyness", lcore_handle_busyness,
+				   "return percentage busyness of cores");
+
+	rte_telemetry_register_cmd("/eal/lcore/busyness_enable", lcore_busyness_enable,
+				   "enable lcore busyness measurement");
+
+	rte_telemetry_register_cmd("/eal/lcore/busyness_disable", lcore_busyness_disable,
+				   "disable lcore busyness measurement");
+}
+
+#else
+
+int rte_lcore_busyness(unsigned int lcore_id __rte_unused)
+{
+	return -ENOTSUP;
+}
+
+int rte_lcore_busyness_enabled(void)
+{
+	return -ENOTSUP;
+}
+
+void rte_lcore_busyness_enabled_set(int enable __rte_unused)
+{
+}
+
+void __rte_lcore_telemetry_timestamp(uint16_t nb_rx __rte_unused)
+{
+}
+
+#endif
diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
index 917758cc65..a743e66a7d 100644
--- a/lib/eal/common/meson.build
+++ b/lib/eal/common/meson.build
@@ -17,6 +17,7 @@  sources += files(
         'eal_common_hexdump.c',
         'eal_common_interrupts.c',
         'eal_common_launch.c',
+        'eal_common_lcore_telemetry.c',
         'eal_common_lcore.c',
         'eal_common_log.c',
         'eal_common_mcfg.c',
diff --git a/lib/eal/include/rte_lcore.h b/lib/eal/include/rte_lcore.h
index b598e1b9ec..ab7a8e1e26 100644
--- a/lib/eal/include/rte_lcore.h
+++ b/lib/eal/include/rte_lcore.h
@@ -415,6 +415,86 @@  rte_ctrl_thread_create(pthread_t *thread, const char *name,
 		const pthread_attr_t *attr,
 		void *(*start_routine)(void *), void *arg);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Read busyness value corresponding to an lcore.
+ *
+ * @param lcore_id
+ *   Lcore to read busyness value for.
+ * @return
+ *   - value between 0 and 100 on success
+ *   - -1 if lcore is not active
+ *   - -EINVAL if lcore is invalid
+ *   - -ENOMEM if not enough memory available
+ *   - -ENOTSUP if not supported
+ */
+__rte_experimental
+int
+rte_lcore_busyness(unsigned int lcore_id);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Check if lcore busyness telemetry is enabled.
+ *
+ * @return
+ *   - 1 if lcore telemetry is enabled
+ *   - 0 if lcore telemetry is disabled
+ *   - -ENOTSUP if not lcore telemetry supported
+ */
+__rte_experimental
+int
+rte_lcore_busyness_enabled(void);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Enable or disable busyness telemetry.
+ *
+ * @param enable
+ *   1 to enable, 0 to disable
+ */
+__rte_experimental
+void
+rte_lcore_busyness_enabled_set(int enable);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Lcore telemetry timestamping function.
+ *
+ * @param nb_rx
+ *   Number of buffers processed by lcore.
+ */
+__rte_experimental
+void
+__rte_lcore_telemetry_timestamp(uint16_t nb_rx);
+
+/** @internal lcore telemetry enabled status */
+extern int __rte_lcore_telemetry_enabled;
+
+/**
+ * Call lcore telemetry timestamp function.
+ *
+ * @param nb_rx
+ *   Number of buffers processed by lcore.
+ */
+#ifdef RTE_LCORE_BUSYNESS
+#define RTE_LCORE_TELEMETRY_TIMESTAMP(nb_rx)                    \
+	do {                                                    \
+		if (__rte_lcore_telemetry_enabled)              \
+			__rte_lcore_telemetry_timestamp(nb_rx); \
+	} while (0)
+#else
+#define RTE_LCORE_TELEMETRY_TIMESTAMP(nb_rx) \
+	while (0)
+#endif
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/eal/meson.build b/lib/eal/meson.build
index 056beb9461..7199aa03c2 100644
--- a/lib/eal/meson.build
+++ b/lib/eal/meson.build
@@ -25,6 +25,9 @@  subdir(arch_subdir)
 deps += ['kvargs']
 if not is_windows
     deps += ['telemetry']
+else
+	# core busyness telemetry depends on telemetry library
+	dpdk_conf.set('RTE_LCORE_BUSYNESS', false)
 endif
 if dpdk_conf.has('RTE_USE_LIBBSD')
     ext_deps += libbsd
diff --git a/lib/eal/version.map b/lib/eal/version.map
index c2a2cebf69..52061b30f0 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -424,6 +424,13 @@  EXPERIMENTAL {
 	rte_thread_self;
 	rte_thread_set_affinity_by_id;
 	rte_thread_set_priority;
+
+	# added in 22.11
+	__rte_lcore_telemetry_timestamp;
+	__rte_lcore_telemetry_enabled;
+	rte_lcore_busyness;
+	rte_lcore_busyness_enabled;
+	rte_lcore_busyness_enabled_set;
 };
 
 INTERNAL {
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index de9e970d4d..1caecd5a11 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -5675,6 +5675,8 @@  rte_eth_rx_burst(uint16_t port_id, uint16_t queue_id,
 #endif
 
 	rte_ethdev_trace_rx_burst(port_id, queue_id, (void **)rx_pkts, nb_rx);
+
+	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_rx);
 	return nb_rx;
 }
 
diff --git a/lib/eventdev/rte_eventdev.h b/lib/eventdev/rte_eventdev.h
index 6a6f6ea4c1..a1d42d9214 100644
--- a/lib/eventdev/rte_eventdev.h
+++ b/lib/eventdev/rte_eventdev.h
@@ -2153,6 +2153,7 @@  rte_event_dequeue_burst(uint8_t dev_id, uint8_t port_id, struct rte_event ev[],
 			uint16_t nb_events, uint64_t timeout_ticks)
 {
 	const struct rte_event_fp_ops *fp_ops;
+	uint16_t nb_evts;
 	void *port;
 
 	fp_ops = &rte_event_fp_ops[dev_id];
@@ -2175,10 +2176,13 @@  rte_event_dequeue_burst(uint8_t dev_id, uint8_t port_id, struct rte_event ev[],
 	 * requests nb_events as const one
 	 */
 	if (nb_events == 1)
-		return (fp_ops->dequeue)(port, ev, timeout_ticks);
+		nb_evts = (fp_ops->dequeue)(port, ev, timeout_ticks);
 	else
-		return (fp_ops->dequeue_burst)(port, ev, nb_events,
-					       timeout_ticks);
+		nb_evts = (fp_ops->dequeue_burst)(port, ev, nb_events,
+					timeout_ticks);
+
+	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_evts);
+	return nb_evts;
 }
 
 #define RTE_EVENT_DEV_MAINT_OP_FLUSH          (1 << 0)
diff --git a/lib/rawdev/rte_rawdev.c b/lib/rawdev/rte_rawdev.c
index 2f0a4f132e..27163e87cb 100644
--- a/lib/rawdev/rte_rawdev.c
+++ b/lib/rawdev/rte_rawdev.c
@@ -226,12 +226,15 @@  rte_rawdev_dequeue_buffers(uint16_t dev_id,
 			   rte_rawdev_obj_t context)
 {
 	struct rte_rawdev *dev;
+	int nb_ops;
 
 	RTE_RAWDEV_VALID_DEVID_OR_ERR_RET(dev_id, -EINVAL);
 	dev = &rte_rawdevs[dev_id];
 
 	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->dequeue_bufs, -ENOTSUP);
-	return (*dev->dev_ops->dequeue_bufs)(dev, buffers, count, context);
+	nb_ops = (*dev->dev_ops->dequeue_bufs)(dev, buffers, count, context);
+	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 int
diff --git a/lib/regexdev/rte_regexdev.h b/lib/regexdev/rte_regexdev.h
index 3bce8090f6..781055b4eb 100644
--- a/lib/regexdev/rte_regexdev.h
+++ b/lib/regexdev/rte_regexdev.h
@@ -1530,6 +1530,7 @@  rte_regexdev_dequeue_burst(uint8_t dev_id, uint16_t qp_id,
 			   struct rte_regex_ops **ops, uint16_t nb_ops)
 {
 	struct rte_regexdev *dev = &rte_regex_devices[dev_id];
+	uint16_t deq_ops;
 #ifdef RTE_LIBRTE_REGEXDEV_DEBUG
 	RTE_REGEXDEV_VALID_DEV_ID_OR_ERR_RET(dev_id, -EINVAL);
 	RTE_FUNC_PTR_OR_ERR_RET(*dev->dequeue, -ENOTSUP);
@@ -1538,7 +1539,9 @@  rte_regexdev_dequeue_burst(uint8_t dev_id, uint16_t qp_id,
 		return -EINVAL;
 	}
 #endif
-	return (*dev->dequeue)(dev, qp_id, ops, nb_ops);
+	deq_ops = (*dev->dequeue)(dev, qp_id, ops, nb_ops);
+	RTE_LCORE_TELEMETRY_TIMESTAMP(deq_ops);
+	return deq_ops;
 }
 
 #ifdef __cplusplus
diff --git a/lib/ring/rte_ring_elem_pvt.h b/lib/ring/rte_ring_elem_pvt.h
index 83788c56e6..6db09d4291 100644
--- a/lib/ring/rte_ring_elem_pvt.h
+++ b/lib/ring/rte_ring_elem_pvt.h
@@ -379,6 +379,7 @@  __rte_ring_do_dequeue_elem(struct rte_ring *r, void *obj_table,
 end:
 	if (available != NULL)
 		*available = entries - n;
+	RTE_LCORE_TELEMETRY_TIMESTAMP(n);
 	return n;
 }