mbox series

[v7,0/4] Add lcore poll busyness telemetry

Message ID 20220914092929.1159773-1-kevin.laatz@intel.com (mailing list archive)
Headers
Series Add lcore poll busyness telemetry |

Message

Kevin Laatz Sept. 14, 2022, 9:29 a.m. UTC
  Currently, there is no way to measure lcore polling busyness in a passive
way, without any modifications to the application. This patchset adds a new
EAL API that will be able to passively track core polling busyness. As part
of the set, new telemetry endpoints are added to read the generate metrics.

---
v7:
  * Rename funcs, vars, files to include "poll" where missing.

v6:
  * Add API and perf unit tests

v5:
  * Fix Windows build
  * Make lcore_telemetry_free() an internal interface
  * Minor cleanup

v4:
  * Fix doc build
  * Rename timestamp macro to RTE_LCORE_POLL_BUSYNESS_TIMESTAMP
  * Make enable/disable read and write atomic
  * Change rte_lcore_poll_busyness_enabled_set() param to bool
  * Move mem alloc from enable/disable to init/cleanup
  * Other minor fixes

v3:
  * Fix missing renaming to poll busyness
  * Fix clang compilation
  * Fix arm compilation

v2:
  * Use rte_get_tsc_hz() to adjust the telemetry period
  * Rename to reflect polling busyness vs general busyness
  * Fix segfault when calling telemetry timestamp from an unregistered
    non-EAL thread.
  * Minor cleanup

Anatoly Burakov (2):
  eal: add lcore poll busyness telemetry
  eal: add cpuset lcore telemetry entries

Kevin Laatz (2):
  app/test: add unit tests for lcore poll busyness
  doc: add howto guide for lcore poll busyness

 app/test/meson.build                          |   4 +
 app/test/test_lcore_poll_busyness_api.c       | 134 +++++++
 app/test/test_lcore_poll_busyness_perf.c      |  72 ++++
 config/meson.build                            |   1 +
 config/rte_config.h                           |   1 +
 doc/guides/howto/index.rst                    |   1 +
 doc/guides/howto/lcore_poll_busyness.rst      |  93 +++++
 lib/bbdev/rte_bbdev.h                         |  17 +-
 lib/compressdev/rte_compressdev.c             |   2 +
 lib/cryptodev/rte_cryptodev.h                 |   2 +
 lib/distributor/rte_distributor.c             |  21 +-
 lib/distributor/rte_distributor_single.c      |  14 +-
 lib/dmadev/rte_dmadev.h                       |  15 +-
 .../common/eal_common_lcore_poll_telemetry.c  | 350 ++++++++++++++++++
 lib/eal/common/meson.build                    |   1 +
 lib/eal/freebsd/eal.c                         |   1 +
 lib/eal/include/rte_lcore.h                   |  85 ++++-
 lib/eal/linux/eal.c                           |   1 +
 lib/eal/meson.build                           |   3 +
 lib/eal/version.map                           |   7 +
 lib/ethdev/rte_ethdev.h                       |   2 +
 lib/eventdev/rte_eventdev.h                   |  10 +-
 lib/rawdev/rte_rawdev.c                       |   6 +-
 lib/regexdev/rte_regexdev.h                   |   5 +-
 lib/ring/rte_ring_elem_pvt.h                  |   1 +
 meson_options.txt                             |   2 +
 26 files changed, 826 insertions(+), 25 deletions(-)
 create mode 100644 app/test/test_lcore_poll_busyness_api.c
 create mode 100644 app/test/test_lcore_poll_busyness_perf.c
 create mode 100644 doc/guides/howto/lcore_poll_busyness.rst
 create mode 100644 lib/eal/common/eal_common_lcore_poll_telemetry.c
  

Comments

Stephen Hemminger Sept. 14, 2022, 2:33 p.m. UTC | #1
On Wed, 14 Sep 2022 10:29:25 +0100
Kevin Laatz <kevin.laatz@intel.com> wrote:

> Currently, there is no way to measure lcore polling busyness in a passive
> way, without any modifications to the application. This patchset adds a new
> EAL API that will be able to passively track core polling busyness. As part
> of the set, new telemetry endpoints are added to read the generate metrics.

How much does measuring busyness impact performance??

In the past, calling rte_rdsc() would slow down packet rate
because it is stops CPU pipeline. Maybe better on more modern
processors, haven't measured it lately.
  
Kevin Laatz Sept. 16, 2022, 12:35 p.m. UTC | #2
On 14/09/2022 15:33, Stephen Hemminger wrote:
> On Wed, 14 Sep 2022 10:29:25 +0100
> Kevin Laatz <kevin.laatz@intel.com> wrote:
>
>> Currently, there is no way to measure lcore polling busyness in a passive
>> way, without any modifications to the application. This patchset adds a new
>> EAL API that will be able to passively track core polling busyness. As part
>> of the set, new telemetry endpoints are added to read the generate metrics.
> How much does measuring busyness impact performance??
>
> In the past, calling rte_rdsc() would slow down packet rate
> because it is stops CPU pipeline. Maybe better on more modern
> processors, haven't measured it lately.

Hi Stephen,

I've run some 0.001% loss tests using 2x 100G ports, with 64B packets 
using testpmd for forwarding. Those tests show a ~2.7% performance 
impact when the lcore poll busyness feature is enabled vs compile-time 
disabled.
Applications with more compute intensive workloads should see less 
performance impact since the proportion of time spent time-stamping will 
be smaller.

In addition, a performance autotest has been added in this patchset 
which measures the cycles cost of calling the timestamp macro. Please 
feel free to test it on your system (lcore_poll_busyness_perf_autotest).

-Kevin
  
Kevin Laatz Sept. 16, 2022, 2:10 p.m. UTC | #3
On 16/09/2022 13:35, Kevin Laatz wrote:
> On 14/09/2022 15:33, Stephen Hemminger wrote:
>> On Wed, 14 Sep 2022 10:29:25 +0100
>> Kevin Laatz <kevin.laatz@intel.com> wrote:
>>
>>> Currently, there is no way to measure lcore polling busyness in a 
>>> passive
>>> way, without any modifications to the application. This patchset 
>>> adds a new
>>> EAL API that will be able to passively track core polling busyness. 
>>> As part
>>> of the set, new telemetry endpoints are added to read the generate 
>>> metrics.
>> How much does measuring busyness impact performance??
>>
>> In the past, calling rte_rdsc() would slow down packet rate
>> because it is stops CPU pipeline. Maybe better on more modern
>> processors, haven't measured it lately.
>
> Hi Stephen,
>
> I've run some 0.001% loss tests using 2x 100G ports, with 64B packets 
> using testpmd for forwarding. Those tests show a ~2.7% performance 
> impact when the lcore poll busyness feature is enabled vs compile-time 
> disabled.
> Applications with more compute intensive workloads should see less 
> performance impact since the proportion of time spent time-stamping 
> will be smaller.
>
> In addition, a performance autotest has been added in this patchset 
> which measures the cycles cost of calling the timestamp macro. Please 
> feel free to test it on your system (lcore_poll_busyness_perf_autotest).
>
Worth mentioning as well, is that when lcore poll busyness is enabled at 
compile-time and disabled at run-time, we see *zero *performance impact.
  
Kevin Laatz Oct. 5, 2022, 1:44 p.m. UTC | #4
On 14/09/2022 10:29, Kevin Laatz wrote:
> Currently, there is no way to measure lcore polling busyness in a passive
> way, without any modifications to the application. This patchset adds a new
> EAL API that will be able to passively track core polling busyness. As part
> of the set, new telemetry endpoints are added to read the generate metrics.
>
> ---
> v7:
>    * Rename funcs, vars, files to include "poll" where missing.
>
> v6:
>    * Add API and perf unit tests
>
> v5:
>    * Fix Windows build
>    * Make lcore_telemetry_free() an internal interface
>    * Minor cleanup
>
> v4:
>    * Fix doc build
>    * Rename timestamp macro to RTE_LCORE_POLL_BUSYNESS_TIMESTAMP
>    * Make enable/disable read and write atomic
>    * Change rte_lcore_poll_busyness_enabled_set() param to bool
>    * Move mem alloc from enable/disable to init/cleanup
>    * Other minor fixes
>
> v3:
>    * Fix missing renaming to poll busyness
>    * Fix clang compilation
>    * Fix arm compilation
>
> v2:
>    * Use rte_get_tsc_hz() to adjust the telemetry period
>    * Rename to reflect polling busyness vs general busyness
>    * Fix segfault when calling telemetry timestamp from an unregistered
>      non-EAL thread.
>    * Minor cleanup
>
> Anatoly Burakov (2):
>    eal: add lcore poll busyness telemetry
>    eal: add cpuset lcore telemetry entries
>
> Kevin Laatz (2):
>    app/test: add unit tests for lcore poll busyness
>    doc: add howto guide for lcore poll busyness
>
>   app/test/meson.build                          |   4 +
>   app/test/test_lcore_poll_busyness_api.c       | 134 +++++++
>   app/test/test_lcore_poll_busyness_perf.c      |  72 ++++
>   config/meson.build                            |   1 +
>   config/rte_config.h                           |   1 +
>   doc/guides/howto/index.rst                    |   1 +
>   doc/guides/howto/lcore_poll_busyness.rst      |  93 +++++
>   lib/bbdev/rte_bbdev.h                         |  17 +-
>   lib/compressdev/rte_compressdev.c             |   2 +
>   lib/cryptodev/rte_cryptodev.h                 |   2 +
>   lib/distributor/rte_distributor.c             |  21 +-
>   lib/distributor/rte_distributor_single.c      |  14 +-
>   lib/dmadev/rte_dmadev.h                       |  15 +-
>   .../common/eal_common_lcore_poll_telemetry.c  | 350 ++++++++++++++++++
>   lib/eal/common/meson.build                    |   1 +
>   lib/eal/freebsd/eal.c                         |   1 +
>   lib/eal/include/rte_lcore.h                   |  85 ++++-
>   lib/eal/linux/eal.c                           |   1 +
>   lib/eal/meson.build                           |   3 +
>   lib/eal/version.map                           |   7 +
>   lib/ethdev/rte_ethdev.h                       |   2 +
>   lib/eventdev/rte_eventdev.h                   |  10 +-
>   lib/rawdev/rte_rawdev.c                       |   6 +-
>   lib/regexdev/rte_regexdev.h                   |   5 +-
>   lib/ring/rte_ring_elem_pvt.h                  |   1 +
>   meson_options.txt                             |   2 +
>   26 files changed, 826 insertions(+), 25 deletions(-)
>   create mode 100644 app/test/test_lcore_poll_busyness_api.c
>   create mode 100644 app/test/test_lcore_poll_busyness_perf.c
>   create mode 100644 doc/guides/howto/lcore_poll_busyness.rst
>   create mode 100644 lib/eal/common/eal_common_lcore_poll_telemetry.c

Based on the feedback in the discussions on this patchset, we have 
decided to revoke the submission of this patchset for the 22.11 release.

We will re-evaluate the design with the aim to provide a more acceptable 
solution in a future release.

---
Kevin
  
Morten Brørup Oct. 6, 2022, 1:25 p.m. UTC | #5
> From: Kevin Laatz [mailto:kevin.laatz@intel.com]
> Sent: Wednesday, 5 October 2022 15.45
> 
> On 14/09/2022 10:29, Kevin Laatz wrote:
> > Currently, there is no way to measure lcore polling busyness in a
> passive
> > way, without any modifications to the application. This patchset adds
> a new
> > EAL API that will be able to passively track core polling busyness.
> As part
> > of the set, new telemetry endpoints are added to read the generate
> metrics.
> >
> > ---
> 
> Based on the feedback in the discussions on this patchset, we have
> decided to revoke the submission of this patchset for the 22.11
> release.
> 
> We will re-evaluate the design with the aim to provide a more
> acceptable
> solution in a future release.

Good call. Thank you!

I suggest having an open discussion about requirements/expectations for such a solution, before you implement any code.

We haven't found the golden solution for our application, but we have discussed it quite a lot internally. Here are some of our thoughts:

The application must feed the library with information about how much work it is doing.

E.g. A pipeline stage that polls the NIC for N ingress packets could feed the busyness library with values such as:
 - "no work": zero packets received,
 - "25 % utilization": less than N packets received (in this example: 8 of max 32 packets = 25 %), or
 - "100% utilization, possibly more work to do": all N packets received (more packets could be ready in the queue, but we don't know).

A pipeline stage that services a QoS scheduler could additionally feed the library with values such as:
 - "100% utilization, definitely more work to do": stopped processing due to some "max work per call" limitation.
 - "waiting, no work until [DELAY] ns": current timeslot has been filled, waiting for the next timeslot to start.

It is important to note that any pipeline stage processing packets (or some other objects!) might process a different maximum number of objects than the ingress pipeline stage. What I mean is: The number N might not be the same for all pipeline stages.


The information should be collected per lcore or thread, also to prevent cache trashing.

Additionally, it could be collected per pipeline stage too, making the collection two-dimensional. This would essentially make it a profiling library, where you - in addition to seeing how much time is spent working - also can see which work the time is spent on.

As mentioned during the previous discussions, APIs should be provided to make the collected information machine readable, so the application can use it for power management and other purposes.

One of the simple things I would like to be able to extract from such a library is CPU Utilization (percentage) per lcore.

And since I want the CPU Utilization to be shown for multiple the time intervals (usually 1, 5 or 15 minutes; but perhaps also 1 second or 1 millisecond) the output data should be exposed as a counter type, so my "loadavg application" can calculate the rate by subtracting the previously obtained value from the current value and divide the difference by the time interval.

-Morten
  
Mattias Rönnblom Oct. 6, 2022, 3:26 p.m. UTC | #6
On 2022-10-06 15:25, Morten Brørup wrote:
>> From: Kevin Laatz [mailto:kevin.laatz@intel.com]
>> Sent: Wednesday, 5 October 2022 15.45
>>
>> On 14/09/2022 10:29, Kevin Laatz wrote:
>>> Currently, there is no way to measure lcore polling busyness in a
>> passive
>>> way, without any modifications to the application. This patchset adds
>> a new
>>> EAL API that will be able to passively track core polling busyness.
>> As part
>>> of the set, new telemetry endpoints are added to read the generate
>> metrics.
>>>
>>> ---
>>
>> Based on the feedback in the discussions on this patchset, we have
>> decided to revoke the submission of this patchset for the 22.11
>> release.
>>
>> We will re-evaluate the design with the aim to provide a more
>> acceptable
>> solution in a future release.
> 
> Good call. Thank you!
> 
> I suggest having an open discussion about requirements/expectations for such a solution, before you implement any code.
> 
> We haven't found the golden solution for our application, but we have discussed it quite a lot internally. Here are some of our thoughts:
> 
> The application must feed the library with information about how much work it is doing.
> 
> E.g. A pipeline stage that polls the NIC for N ingress packets could feed the busyness library with values such as:
>   - "no work": zero packets received,
>   - "25 % utilization": less than N packets received (in this example: 8 of max 32 packets = 25 %), or
>   - "100% utilization, possibly more work to do": all N packets received (more packets could be ready in the queue, but we don't know).
> 

If some lcore's NIC RX queue always, for every poll operation, produces 
8 packets out of a max burst of 32, I would argue that lcore is 100% 
busy. With always something to do, it doesn't have a single cycle to spare.

It seems to me that you basically have two options, if you do 
application-level "busyness" reporting.

Either the application
a) reports when a section of useful work begins, and when it ends, as 
two separate function calls.
b) after having taken a time stamp, and having completed a section of 
code which turned out to be something useful, it reports back to the 
busyness module with one function call, containing the busy cycles spent.

In a), the two calls could be to the same function, with a boolean 
argument informing the busyness module if this is the beginning of a 
busy or an idle period. In such case, just pass "num_pkts_dequeued > 0" 
to the call.

What you would like is a solution which avoid ping-pong between idle and 
busy states (with the resulting time stamping and computations) in 
scenarios where a lcore thread mix sources of work which often have 
items available, with sources that do not (e.g., packets in a RX queue 
versus reassembly timeouts in a core-local timer wheel). It would be 
better in that situation, to attribute the timer wheel poll cycles as 
busy cycles.

Another crucial aspect is that you want the API to be simple, and code 
changes to be minimal.

It's unclear to me if you need to account for both idle and busy cycles, 
or only busy cycles, and assume all other cycles are idle. The will be 
for a traditional 1:1 EAL thread <-> CPU core mapping, but not if the 
"--lcores" parameter is used to create floating EAL threads, and EAL 
threads which share the same core, and thus may not be able to use 100% 
of the TSC cycles.

> A pipeline stage that services a QoS scheduler could additionally feed the library with values such as:
>   - "100% utilization, definitely more work to do": stopped processing due to some "max work per call" limitation.
>   - "waiting, no work until [DELAY] ns": current timeslot has been filled, waiting for the next timeslot to start.
> 
> It is important to note that any pipeline stage processing packets (or some other objects!) might process a different maximum number of objects than the ingress pipeline stage. What I mean is: The number N might not be the same for all pipeline stages.
> 
> 
> The information should be collected per lcore or thread, also to prevent cache trashing.
> 
> Additionally, it could be collected per pipeline stage too, making the collection two-dimensional. This would essentially make it a profiling library, where you - in addition to seeing how much time is spent working - also can see which work the time is spent on.
> 

If you introduce subcategories of "busy", like "busy-with-X", and 
"busy-with-Y", the book keeping will be more expensive, since you will 
transit between states even for 100% busy lcores (which in principle you 
never, or at least very rarely, need to do if you have only busy and 
idle as states).

If your application is organized as DPDK services, you will get this 
already today, on a DPDK service level.

If you have your application organized as a pipeline, and you use an 
event device as a scheduler between the stages, that event device has a 
good opportunity to do this kind of bookkeeping. DSW, for example, keeps 
track of the average processing latency for events, and how many events 
of various types have been processed.

> As mentioned during the previous discussions, APIs should be provided to make the collected information machine readable, so the application can use it for power management and other purposes.
> 
> One of the simple things I would like to be able to extract from such a library is CPU Utilization (percentage) per lcore. >
> And since I want the CPU Utilization to be shown for multiple the time intervals (usually 1, 5 or 15 minutes; but perhaps also 1 second or 1 millisecond) the output data should be exposed as a counter type, so my "loadavg application" can calculate the rate by subtracting the previously obtained value from the current value and divide the difference by the time interval.
> 

I agree. In addition, you also want the "raw data" (lcore busy cycles) 
so you can do you own sampling, at your own favorite-length intervals.

> -Morten
>
  
Morten Brørup Oct. 10, 2022, 3:22 p.m. UTC | #7
> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> Sent: Thursday, 6 October 2022 17.27
> 
> On 2022-10-06 15:25, Morten Brørup wrote:
> >> From: Kevin Laatz [mailto:kevin.laatz@intel.com]
> >> Sent: Wednesday, 5 October 2022 15.45
> >>
> >> On 14/09/2022 10:29, Kevin Laatz wrote:
> >>> Currently, there is no way to measure lcore polling busyness in a
> >> passive
> >>> way, without any modifications to the application. This patchset
> adds
> >> a new
> >>> EAL API that will be able to passively track core polling busyness.
> >> As part
> >>> of the set, new telemetry endpoints are added to read the generate
> >> metrics.
> >>>
> >>> ---
> >>
> >> Based on the feedback in the discussions on this patchset, we have
> >> decided to revoke the submission of this patchset for the 22.11
> >> release.
> >>
> >> We will re-evaluate the design with the aim to provide a more
> >> acceptable
> >> solution in a future release.
> >
> > Good call. Thank you!
> >
> > I suggest having an open discussion about requirements/expectations
> for such a solution, before you implement any code.
> >
> > We haven't found the golden solution for our application, but we have
> discussed it quite a lot internally. Here are some of our thoughts:
> >
> > The application must feed the library with information about how much
> work it is doing.
> >
> > E.g. A pipeline stage that polls the NIC for N ingress packets could
> feed the busyness library with values such as:
> >   - "no work": zero packets received,
> >   - "25 % utilization": less than N packets received (in this
> example: 8 of max 32 packets = 25 %), or
> >   - "100% utilization, possibly more work to do": all N packets
> received (more packets could be ready in the queue, but we don't know).
> >
> 
> If some lcore's NIC RX queue always, for every poll operation, produces
> 8 packets out of a max burst of 32, I would argue that lcore is 100%
> busy. With always something to do, it doesn't have a single cycle to
> spare.

I would argue that if I have four cores each only processing 25 % of the packets, one core would suffice instead. Or, the application could schedule the function at 1/4 of the frequency it does now (e.g. call the function once every 40 microseconds instead of once every 10 microseconds).

However, the business does not scale linearly with the number of packets processed - which an intended benefit of bursting.

Here are some real life numbers from our in-house profiler library in a production environment, which says that polling the NIC for packets takes on average:

104 cycles when the NIC returns 0 packets,
529 cycles when the NIC returns 1 packet,
679 cycles when the NIC returns 8 packets, and
1275 cycles when the NIC returns a full burst of 32 packets.

(This includes some overhead from our application, so you will see other numbers in your application.)

> 
> It seems to me that you basically have two options, if you do
> application-level "busyness" reporting.
> 
> Either the application
> a) reports when a section of useful work begins, and when it ends, as
> two separate function calls.
> b) after having taken a time stamp, and having completed a section of
> code which turned out to be something useful, it reports back to the
> busyness module with one function call, containing the busy cycles
> spent.
> 
> In a), the two calls could be to the same function, with a boolean
> argument informing the busyness module if this is the beginning of a
> busy or an idle period. In such case, just pass "num_pkts_dequeued > 0"
> to the call.

Our profiler library has a start()and an end() function, and an end_and_start() function for when a section directly follows the preceding section (to only take one timestamp instead of two).

> 
> What you would like is a solution which avoid ping-pong between idle
> and
> busy states (with the resulting time stamping and computations) in
> scenarios where a lcore thread mix sources of work which often have
> items available, with sources that do not (e.g., packets in a RX queue
> versus reassembly timeouts in a core-local timer wheel). It would be
> better in that situation, to attribute the timer wheel poll cycles as
> busy cycles.
> 
> Another crucial aspect is that you want the API to be simple, and code
> changes to be minimal.
> 
> It's unclear to me if you need to account for both idle and busy
> cycles,
> or only busy cycles, and assume all other cycles are idle. The will be
> for a traditional 1:1 EAL thread <-> CPU core mapping, but not if the
> "--lcores" parameter is used to create floating EAL threads, and EAL
> threads which share the same core, and thus may not be able to use 100%
> of the TSC cycles.
> 
> > A pipeline stage that services a QoS scheduler could additionally
> feed the library with values such as:
> >   - "100% utilization, definitely more work to do": stopped
> processing due to some "max work per call" limitation.
> >   - "waiting, no work until [DELAY] ns": current timeslot has been
> filled, waiting for the next timeslot to start.
> >
> > It is important to note that any pipeline stage processing packets
> (or some other objects!) might process a different maximum number of
> objects than the ingress pipeline stage. What I mean is: The number N
> might not be the same for all pipeline stages.
> >
> >
> > The information should be collected per lcore or thread, also to
> prevent cache trashing.
> >
> > Additionally, it could be collected per pipeline stage too, making
> the collection two-dimensional. This would essentially make it a
> profiling library, where you - in addition to seeing how much time is
> spent working - also can see which work the time is spent on.
> >
> 
> If you introduce subcategories of "busy", like "busy-with-X", and
> "busy-with-Y", the book keeping will be more expensive, since you will
> transit between states even for 100% busy lcores (which in principle
> you
> never, or at least very rarely, need to do if you have only busy and
> idle as states).
> 
> If your application is organized as DPDK services, you will get this
> already today, on a DPDK service level.
> 
> If you have your application organized as a pipeline, and you use an
> event device as a scheduler between the stages, that event device has a
> good opportunity to do this kind of bookkeeping. DSW, for example,
> keeps
> track of the average processing latency for events, and how many events
> of various types have been processed.
> 

Lots of good input, Mattias. Let's see what others suggest. :-)

> > As mentioned during the previous discussions, APIs should be provided
> to make the collected information machine readable, so the application
> can use it for power management and other purposes.
> >
> > One of the simple things I would like to be able to extract from such
> a library is CPU Utilization (percentage) per lcore. >
> > And since I want the CPU Utilization to be shown for multiple the
> time intervals (usually 1, 5 or 15 minutes; but perhaps also 1 second
> or 1 millisecond) the output data should be exposed as a counter type,
> so my "loadavg application" can calculate the rate by subtracting the
> previously obtained value from the current value and divide the
> difference by the time interval.
> >
> 
> I agree. In addition, you also want the "raw data" (lcore busy cycles)
> so you can do you own sampling, at your own favorite-length intervals.
> 
> > -Morten
> >
  
Mattias Rönnblom Oct. 10, 2022, 5:38 p.m. UTC | #8
On 2022-10-10 17:22, Morten Brørup wrote:
>> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
>> Sent: Thursday, 6 October 2022 17.27
>>
>> On 2022-10-06 15:25, Morten Brørup wrote:
>>>> From: Kevin Laatz [mailto:kevin.laatz@intel.com]
>>>> Sent: Wednesday, 5 October 2022 15.45
>>>>
>>>> On 14/09/2022 10:29, Kevin Laatz wrote:
>>>>> Currently, there is no way to measure lcore polling busyness in a
>>>> passive
>>>>> way, without any modifications to the application. This patchset
>> adds
>>>> a new
>>>>> EAL API that will be able to passively track core polling busyness.
>>>> As part
>>>>> of the set, new telemetry endpoints are added to read the generate
>>>> metrics.
>>>>>
>>>>> ---
>>>>
>>>> Based on the feedback in the discussions on this patchset, we have
>>>> decided to revoke the submission of this patchset for the 22.11
>>>> release.
>>>>
>>>> We will re-evaluate the design with the aim to provide a more
>>>> acceptable
>>>> solution in a future release.
>>>
>>> Good call. Thank you!
>>>
>>> I suggest having an open discussion about requirements/expectations
>> for such a solution, before you implement any code.
>>>
>>> We haven't found the golden solution for our application, but we have
>> discussed it quite a lot internally. Here are some of our thoughts:
>>>
>>> The application must feed the library with information about how much
>> work it is doing.
>>>
>>> E.g. A pipeline stage that polls the NIC for N ingress packets could
>> feed the busyness library with values such as:
>>>    - "no work": zero packets received,
>>>    - "25 % utilization": less than N packets received (in this
>> example: 8 of max 32 packets = 25 %), or
>>>    - "100% utilization, possibly more work to do": all N packets
>> received (more packets could be ready in the queue, but we don't know).
>>>
>>
>> If some lcore's NIC RX queue always, for every poll operation, produces
>> 8 packets out of a max burst of 32, I would argue that lcore is 100%
>> busy. With always something to do, it doesn't have a single cycle to
>> spare.
> 
> I would argue that if I have four cores each only processing 25 % of the packets, one core would suffice instead. Or, the application could schedule the function at 1/4 of the frequency it does now (e.g. call the function once every 40 microseconds instead of once every 10 microseconds).
> 

Do you mean "only processing packets 25% of the time"? If yes, being 
able to replace four core @ 25% utilization with one core @ 100% might 
be a reasonable first guess. I'm not sure how it relates to what I 
wrote, though.

> However, the business does not scale linearly with the number of packets processed - which an intended benefit of bursting.
> 

Sure, there's usually a non-linear relationship between the system 
capacity used and the resulting CPU utilization. It can be both in the 
manner you describe below, with the per-packet processing latency 
reduced at higher rates, or the other way around. For example, NIC RX 
LLC stashing may cause a lot of LLC evictions, and generally the 
application might have a larger working set size during high load, so 
there may be forces working in the other direction as well.

It seems to me "busyness" telemetry value should just be lcore thread 
CPU utilization (in total, or with some per-module breakdown). If you 
want to know how much of the system's capacity is used, you need help 
from an application-specific agent, equipped with a model of how CPU 
utilization and capacity relates. Such a heuristic could take other 
factors into account as well, e.g. the average queue sizes, packet 
rates, packet sizes etc.

In my experience, for high touch applications (i.e., those that spends 
thousands of cycles per packet), CPU utilization is a pretty decent 
approximation on how much of the system's capacity is used.

> Here are some real life numbers from our in-house profiler library in a production environment, which says that polling the NIC for packets takes on average:
> 
> 104 cycles when the NIC returns 0 packets,
> 529 cycles when the NIC returns 1 packet,
> 679 cycles when the NIC returns 8 packets, and
> 1275 cycles when the NIC returns a full burst of 32 packets.
> 
> (This includes some overhead from our application, so you will see other numbers in your application.)
> 
>>
>> It seems to me that you basically have two options, if you do
>> application-level "busyness" reporting.
>>
>> Either the application
>> a) reports when a section of useful work begins, and when it ends, as
>> two separate function calls.
>> b) after having taken a time stamp, and having completed a section of
>> code which turned out to be something useful, it reports back to the
>> busyness module with one function call, containing the busy cycles
>> spent.
>>
>> In a), the two calls could be to the same function, with a boolean
>> argument informing the busyness module if this is the beginning of a
>> busy or an idle period. In such case, just pass "num_pkts_dequeued > 0"
>> to the call.
> 
> Our profiler library has a start()and an end() function, and an end_and_start() function for when a section directly follows the preceding section (to only take one timestamp instead of two).
> 

I like the idea of a end_and_start() (except for the name, maybe).

>>
>> What you would like is a solution which avoid ping-pong between idle
>> and
>> busy states (with the resulting time stamping and computations) in
>> scenarios where a lcore thread mix sources of work which often have
>> items available, with sources that do not (e.g., packets in a RX queue
>> versus reassembly timeouts in a core-local timer wheel). It would be
>> better in that situation, to attribute the timer wheel poll cycles as
>> busy cycles.
>>
>> Another crucial aspect is that you want the API to be simple, and code
>> changes to be minimal.
>>
>> It's unclear to me if you need to account for both idle and busy
>> cycles,
>> or only busy cycles, and assume all other cycles are idle. The will be
>> for a traditional 1:1 EAL thread <-> CPU core mapping, but not if the
>> "--lcores" parameter is used to create floating EAL threads, and EAL
>> threads which share the same core, and thus may not be able to use 100%
>> of the TSC cycles.
>>
>>> A pipeline stage that services a QoS scheduler could additionally
>> feed the library with values such as:
>>>    - "100% utilization, definitely more work to do": stopped
>> processing due to some "max work per call" limitation.
>>>    - "waiting, no work until [DELAY] ns": current timeslot has been
>> filled, waiting for the next timeslot to start.
>>>
>>> It is important to note that any pipeline stage processing packets
>> (or some other objects!) might process a different maximum number of
>> objects than the ingress pipeline stage. What I mean is: The number N
>> might not be the same for all pipeline stages.
>>>
>>>
>>> The information should be collected per lcore or thread, also to
>> prevent cache trashing.
>>>
>>> Additionally, it could be collected per pipeline stage too, making
>> the collection two-dimensional. This would essentially make it a
>> profiling library, where you - in addition to seeing how much time is
>> spent working - also can see which work the time is spent on.
>>>
>>
>> If you introduce subcategories of "busy", like "busy-with-X", and
>> "busy-with-Y", the book keeping will be more expensive, since you will
>> transit between states even for 100% busy lcores (which in principle
>> you
>> never, or at least very rarely, need to do if you have only busy and
>> idle as states).
>>
>> If your application is organized as DPDK services, you will get this
>> already today, on a DPDK service level.
>>
>> If you have your application organized as a pipeline, and you use an
>> event device as a scheduler between the stages, that event device has a
>> good opportunity to do this kind of bookkeeping. DSW, for example,
>> keeps
>> track of the average processing latency for events, and how many events
>> of various types have been processed.
>>
> 
> Lots of good input, Mattias. Let's see what others suggest. :-)
> 
>>> As mentioned during the previous discussions, APIs should be provided
>> to make the collected information machine readable, so the application
>> can use it for power management and other purposes.
>>>
>>> One of the simple things I would like to be able to extract from such
>> a library is CPU Utilization (percentage) per lcore. >
>>> And since I want the CPU Utilization to be shown for multiple the
>> time intervals (usually 1, 5 or 15 minutes; but perhaps also 1 second
>> or 1 millisecond) the output data should be exposed as a counter type,
>> so my "loadavg application" can calculate the rate by subtracting the
>> previously obtained value from the current value and divide the
>> difference by the time interval.
>>>
>>
>> I agree. In addition, you also want the "raw data" (lcore busy cycles)
>> so you can do you own sampling, at your own favorite-length intervals.
>>
>>> -Morten
>>>
>
  
Morten Brørup Oct. 12, 2022, 12:25 p.m. UTC | #9
> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> Sent: Monday, 10 October 2022 19.39
> 
> On 2022-10-10 17:22, Morten Brørup wrote:
> >> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> >> Sent: Thursday, 6 October 2022 17.27
> >>
> >> On 2022-10-06 15:25, Morten Brørup wrote:
> >>>> From: Kevin Laatz [mailto:kevin.laatz@intel.com]
> >>>> Sent: Wednesday, 5 October 2022 15.45
> >>>>
> >>>> On 14/09/2022 10:29, Kevin Laatz wrote:
> >>>>> Currently, there is no way to measure lcore polling busyness in a
> >>>> passive
> >>>>> way, without any modifications to the application. This patchset
> >> adds
> >>>> a new
> >>>>> EAL API that will be able to passively track core polling
> busyness.
> >>>> As part
> >>>>> of the set, new telemetry endpoints are added to read the
> generate
> >>>> metrics.
> >>>>>
> >>>>> ---
> >>>>
> >>>> Based on the feedback in the discussions on this patchset, we have
> >>>> decided to revoke the submission of this patchset for the 22.11
> >>>> release.
> >>>>
> >>>> We will re-evaluate the design with the aim to provide a more
> >>>> acceptable
> >>>> solution in a future release.
> >>>
> >>> Good call. Thank you!
> >>>
> >>> I suggest having an open discussion about requirements/expectations
> >> for such a solution, before you implement any code.
> >>>
> >>> We haven't found the golden solution for our application, but we
> have
> >> discussed it quite a lot internally. Here are some of our thoughts:
> >>>
> >>> The application must feed the library with information about how
> much
> >> work it is doing.
> >>>
> >>> E.g. A pipeline stage that polls the NIC for N ingress packets
> could
> >> feed the busyness library with values such as:
> >>>    - "no work": zero packets received,
> >>>    - "25 % utilization": less than N packets received (in this
> >> example: 8 of max 32 packets = 25 %), or
> >>>    - "100% utilization, possibly more work to do": all N packets
> >> received (more packets could be ready in the queue, but we don't
> know).
> >>>
> >>
> >> If some lcore's NIC RX queue always, for every poll operation,
> produces
> >> 8 packets out of a max burst of 32, I would argue that lcore is 100%
> >> busy. With always something to do, it doesn't have a single cycle to
> >> spare.
> >
> > I would argue that if I have four cores each only processing 25 % of
> the packets, one core would suffice instead. Or, the application could
> schedule the function at 1/4 of the frequency it does now (e.g. call
> the function once every 40 microseconds instead of once every 10
> microseconds).
> >
> 
> Do you mean "only processing packets 25% of the time"? If yes, being
> able to replace four core @ 25% utilization with one core @ 100% might
> be a reasonable first guess. I'm not sure how it relates to what I
> wrote, though.

I meant: "only processing 25 % of the maximum number of packets it could have processed (if 100 % utilized)"

A service is allowed to do some fixed maximum amount of work every time its function is called; in my example receive 32 packets. But if there were only 8 packets to receive in a call, the utilization in that call was only 25 % (although the lcore was 100 % active in duration of the function call). So the function can be called less frequently.

If the utilization per packet is linear, then it makes no difference. However, it is non-linear, so it is more efficient to call the function 1/4 of the times, receiving 32 packets each time, than calling it too frequently and only receive 8 packets each time.

> 
> > However, the business does not scale linearly with the number of
> packets processed - which an intended benefit of bursting.
> >
> 
> Sure, there's usually a non-linear relationship between the system
> capacity used and the resulting CPU utilization. It can be both in the
> manner you describe below, with the per-packet processing latency
> reduced at higher rates, or the other way around. For example, NIC RX
> LLC stashing may cause a lot of LLC evictions, and generally the
> application might have a larger working set size during high load, so
> there may be forces working in the other direction as well.
> 
> It seems to me "busyness" telemetry value should just be lcore thread
> CPU utilization (in total, or with some per-module breakdown).

We probably agree that if we cannot discriminate between useless and useful work, then the utilization will always be 100 % minus some cycles known by the library to be non-productive overhead.

I prefer a scale from 0 to 100 % useful, rather than a Boolean discriminating between useless and useful cycles spent in a function.

> If you
> want to know how much of the system's capacity is used, you need help
> from an application-specific agent, equipped with a model of how CPU
> utilization and capacity relates. Such a heuristic could take other
> factors into account as well, e.g. the average queue sizes, packet
> rates, packet sizes etc.

That is exactly what I'm exploring if this library can approximate in some simple way.

> 
> In my experience, for high touch applications (i.e., those that spends
> thousands of cycles per packet), CPU utilization is a pretty decent
> approximation on how much of the system's capacity is used.

I agree.

We can probably also agree that the number of cycles spent roughly follows a formula like: A + B * x, where B is the number of cycles spent to handle the burst itself, and C is the number of cycles it takes to handle an additional packet (or other unit of work) in the burst.

In high touch services, A might be relatively insignificant; but in other applications, A is very significant.

Going back to my example, and using my profiler numbers from below...

If 679 cycles are spent in every RX function call to receive 8 packets, and the function is called 4 times per time unit, it is 679 * 4 = 2.716 cycles spent per time unit.

If the library can tell us that the function is only doing 25 % useful work, we can instead call the function 1 time per time unit. We will still get the same 32 packets per time unit, but only spend 1.275 cycles per time unit.

This is one of the things I am hoping for the library to help the application achieve. (By feeding the library with information about the work done.)

> 
> > Here are some real life numbers from our in-house profiler library in
> a production environment, which says that polling the NIC for packets
> takes on average:
> >
> > 104 cycles when the NIC returns 0 packets,
> > 529 cycles when the NIC returns 1 packet,
> > 679 cycles when the NIC returns 8 packets, and
> > 1275 cycles when the NIC returns a full burst of 32 packets.
> >
> > (This includes some overhead from our application, so you will see
> other numbers in your application.)
> >
> >>
> >> It seems to me that you basically have two options, if you do
> >> application-level "busyness" reporting.
> >>
> >> Either the application
> >> a) reports when a section of useful work begins, and when it ends,
> as
> >> two separate function calls.
> >> b) after having taken a time stamp, and having completed a section
> of
> >> code which turned out to be something useful, it reports back to the
> >> busyness module with one function call, containing the busy cycles
> >> spent.
> >>
> >> In a), the two calls could be to the same function, with a boolean
> >> argument informing the busyness module if this is the beginning of a
> >> busy or an idle period. In such case, just pass "num_pkts_dequeued >
> 0"
> >> to the call.
> >
> > Our profiler library has a start()and an end() function, and an
> end_and_start() function for when a section directly follows the
> preceding section (to only take one timestamp instead of two).
> >
> 
> I like the idea of a end_and_start() (except for the name, maybe).
> 
> >>
> >> What you would like is a solution which avoid ping-pong between idle
> >> and
> >> busy states (with the resulting time stamping and computations) in
> >> scenarios where a lcore thread mix sources of work which often have
> >> items available, with sources that do not (e.g., packets in a RX
> queue
> >> versus reassembly timeouts in a core-local timer wheel). It would be
> >> better in that situation, to attribute the timer wheel poll cycles
> as
> >> busy cycles.
> >>
> >> Another crucial aspect is that you want the API to be simple, and
> code
> >> changes to be minimal.
> >>
> >> It's unclear to me if you need to account for both idle and busy
> >> cycles,
> >> or only busy cycles, and assume all other cycles are idle. The will
> be
> >> for a traditional 1:1 EAL thread <-> CPU core mapping, but not if
> the
> >> "--lcores" parameter is used to create floating EAL threads, and EAL
> >> threads which share the same core, and thus may not be able to use
> 100%
> >> of the TSC cycles.
> >>
> >>> A pipeline stage that services a QoS scheduler could additionally
> >> feed the library with values such as:
> >>>    - "100% utilization, definitely more work to do": stopped
> >> processing due to some "max work per call" limitation.
> >>>    - "waiting, no work until [DELAY] ns": current timeslot has been
> >> filled, waiting for the next timeslot to start.
> >>>
> >>> It is important to note that any pipeline stage processing packets
> >> (or some other objects!) might process a different maximum number of
> >> objects than the ingress pipeline stage. What I mean is: The number
> N
> >> might not be the same for all pipeline stages.
> >>>
> >>>
> >>> The information should be collected per lcore or thread, also to
> >> prevent cache trashing.
> >>>
> >>> Additionally, it could be collected per pipeline stage too, making
> >> the collection two-dimensional. This would essentially make it a
> >> profiling library, where you - in addition to seeing how much time
> is
> >> spent working - also can see which work the time is spent on.
> >>>
> >>
> >> If you introduce subcategories of "busy", like "busy-with-X", and
> >> "busy-with-Y", the book keeping will be more expensive, since you
> will
> >> transit between states even for 100% busy lcores (which in principle
> >> you
> >> never, or at least very rarely, need to do if you have only busy and
> >> idle as states).
> >>
> >> If your application is organized as DPDK services, you will get this
> >> already today, on a DPDK service level.
> >>
> >> If you have your application organized as a pipeline, and you use an
> >> event device as a scheduler between the stages, that event device
> has a
> >> good opportunity to do this kind of bookkeeping. DSW, for example,
> >> keeps
> >> track of the average processing latency for events, and how many
> events
> >> of various types have been processed.
> >>
> >
> > Lots of good input, Mattias. Let's see what others suggest. :-)
> >
> >>> As mentioned during the previous discussions, APIs should be
> provided
> >> to make the collected information machine readable, so the
> application
> >> can use it for power management and other purposes.
> >>>
> >>> One of the simple things I would like to be able to extract from
> such
> >> a library is CPU Utilization (percentage) per lcore. >
> >>> And since I want the CPU Utilization to be shown for multiple the
> >> time intervals (usually 1, 5 or 15 minutes; but perhaps also 1
> second
> >> or 1 millisecond) the output data should be exposed as a counter
> type,
> >> so my "loadavg application" can calculate the rate by subtracting
> the
> >> previously obtained value from the current value and divide the
> >> difference by the time interval.
> >>>
> >>
> >> I agree. In addition, you also want the "raw data" (lcore busy
> cycles)
> >> so you can do you own sampling, at your own favorite-length
> intervals.
> >>
> >>> -Morten
> >>>
> >