mbox series

[v12,00/11] Add PMD power management

Message ID cover.1608213657.git.anatoly.burakov@intel.com (mailing list archive)
Headers
Series Add PMD power management |

Message

Anatoly Burakov Dec. 17, 2020, 2:05 p.m. UTC
  This patchset proposes a simple API for Ethernet drivers to cause the  
CPU to enter a power-optimized state while waiting for packets to  
arrive. This is achieved through cooperation with the NIC driver that 
will allow us to know address of wake up event, and wait for writes on 
it.

On IA, this is achieved through using UMONITOR/UMWAIT instructions. They 
are used in their raw opcode form because there is no widespread 
compiler support for them yet. Still, the API is made generic enough to
hopefully support other architectures, if they happen to implement 
similar instructions.

To achieve power savings, there is a very simple mechanism used: we're 
counting empty polls, and if a certain threshold is reached, we get the 
address of next RX ring descriptor from the NIC driver, arm the 
monitoring hardware, and enter a power-optimized state. We will then 
wake up when either a timeout happens, or a write happens (or generally 
whenever CPU feels like waking up - this is platform-specific), and  
proceed as normal. The empty poll counter is reset whenever we actually 
get packets, so we only go to sleep when we know nothing is going on. 
The mechanism is generic which can be used for any write back 
descriptor.

This patchset also introduces a few changes into existing power 
management-related intrinsics, namely to provide a native way of waking 
up a sleeping core without application being responsible for it, as well 
as general robustness improvements. There's quite a bit of locking going 
on, but these locks are per-thread and very little (if any) contention 
is expected, so the performance impact shouldn't be that bad (and in any 
case the locking happens when we're about to sleep anyway, not on a 
hotpath).

Why are we putting it into ethdev as opposed to leaving this up to the 
application? Our customers specifically requested a way to do it wit 
minimal changes to the application code. The current approach allows to 
just flip a switch and automatically have power savings.

- Only 1:1 core to queue mapping is supported, meaning that each lcore 
  must at most handle RX on a single queue
- Support 3 type policies. Monitor/Pause/Frequency Scaling
- Power management is enabled per-queue
- The API doesn't extend to other device types

Anatoly Burakov (5):
  eal: uninline power intrinsics
  eal: avoid invalid API usage in power intrinsics
  eal: change API of power intrinsics
  eal: remove sync version of power monitor
  eal: add monitor wakeup function

Liang Ma (6):
  ethdev: add simple power management API
  power: add PMD power management API and callback
  net/ixgbe: implement power management API
  net/i40e: implement power management API
  net/ice: implement power management API
  examples/l3fwd-power: enable PMD power mgmt

 doc/guides/prog_guide/power_man.rst           |  44 +++
 doc/guides/rel_notes/release_21_02.rst        |  14 +
 .../sample_app_ug/l3_forward_power_man.rst    |  35 ++
 drivers/event/dlb/dlb.c                       |  10 +-
 drivers/event/dlb2/dlb2.c                     |  10 +-
 drivers/net/i40e/i40e_ethdev.c                |   1 +
 drivers/net/i40e/i40e_rxtx.c                  |  25 ++
 drivers/net/i40e/i40e_rxtx.h                  |   1 +
 drivers/net/ice/ice_ethdev.c                  |   1 +
 drivers/net/ice/ice_rxtx.c                    |  26 ++
 drivers/net/ice/ice_rxtx.h                    |   1 +
 drivers/net/ixgbe/ixgbe_ethdev.c              |   1 +
 drivers/net/ixgbe/ixgbe_rxtx.c                |  25 ++
 drivers/net/ixgbe/ixgbe_rxtx.h                |   1 +
 examples/l3fwd-power/main.c                   |  89 ++++-
 .../arm/include/rte_power_intrinsics.h        |  39 +-
 .../include/generic/rte_power_intrinsics.h    |  78 ++--
 .../ppc/include/rte_power_intrinsics.h        |  39 +-
 lib/librte_eal/version.map                    |   5 +
 .../x86/include/rte_power_intrinsics.h        | 115 ------
 lib/librte_eal/x86/meson.build                |   1 +
 lib/librte_eal/x86/rte_power_intrinsics.c     | 169 +++++++++
 lib/librte_ethdev/rte_ethdev.c                |  28 ++
 lib/librte_ethdev/rte_ethdev.h                |  25 ++
 lib/librte_ethdev/rte_ethdev_driver.h         |  22 ++
 lib/librte_ethdev/version.map                 |   3 +
 lib/librte_power/meson.build                  |   5 +-
 lib/librte_power/rte_power_pmd_mgmt.c         | 349 ++++++++++++++++++
 lib/librte_power/rte_power_pmd_mgmt.h         |  90 +++++
 lib/librte_power/version.map                  |   5 +
 30 files changed, 1028 insertions(+), 229 deletions(-)
 create mode 100644 lib/librte_eal/x86/rte_power_intrinsics.c
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.c
 create mode 100644 lib/librte_power/rte_power_pmd_mgmt.h
  

Comments

David Marchand Dec. 17, 2020, 4:12 p.m. UTC | #1
On Thu, Dec 17, 2020 at 3:06 PM Anatoly Burakov
<anatoly.burakov@intel.com> wrote:
>
> This patchset proposes a simple API for Ethernet drivers to cause the
> CPU to enter a power-optimized state while waiting for packets to
> arrive. This is achieved through cooperation with the NIC driver that
> will allow us to know address of wake up event, and wait for writes on
> it.
>
> On IA, this is achieved through using UMONITOR/UMWAIT instructions. They
> are used in their raw opcode form because there is no widespread
> compiler support for them yet. Still, the API is made generic enough to
> hopefully support other architectures, if they happen to implement
> similar instructions.
>
> To achieve power savings, there is a very simple mechanism used: we're
> counting empty polls, and if a certain threshold is reached, we get the
> address of next RX ring descriptor from the NIC driver, arm the
> monitoring hardware, and enter a power-optimized state. We will then
> wake up when either a timeout happens, or a write happens (or generally
> whenever CPU feels like waking up - this is platform-specific), and
> proceed as normal. The empty poll counter is reset whenever we actually
> get packets, so we only go to sleep when we know nothing is going on.
> The mechanism is generic which can be used for any write back
> descriptor.
>
> This patchset also introduces a few changes into existing power
> management-related intrinsics, namely to provide a native way of waking
> up a sleeping core without application being responsible for it, as well
> as general robustness improvements. There's quite a bit of locking going
> on, but these locks are per-thread and very little (if any) contention
> is expected, so the performance impact shouldn't be that bad (and in any
> case the locking happens when we're about to sleep anyway, not on a
> hotpath).
>
> Why are we putting it into ethdev as opposed to leaving this up to the
> application? Our customers specifically requested a way to do it wit
> minimal changes to the application code. The current approach allows to
> just flip a switch and automatically have power savings.
>
> - Only 1:1 core to queue mapping is supported, meaning that each lcore
>   must at most handle RX on a single queue
> - Support 3 type policies. Monitor/Pause/Frequency Scaling
> - Power management is enabled per-queue
> - The API doesn't extend to other device types

Fyi, ovsrobot Travis being KO, you probably missed that GHA CI caught this:
https://github.com/ovsrobot/dpdk/runs/1571056574?check_suite_focus=true#step:13:16082

We will have to put an exception on driver only ABI.
  
Anatoly Burakov Jan. 8, 2021, 4:42 p.m. UTC | #2
On 17-Dec-20 4:12 PM, David Marchand wrote:
> On Thu, Dec 17, 2020 at 3:06 PM Anatoly Burakov
> <anatoly.burakov@intel.com> wrote:
>>
>> This patchset proposes a simple API for Ethernet drivers to cause the
>> CPU to enter a power-optimized state while waiting for packets to
>> arrive. This is achieved through cooperation with the NIC driver that
>> will allow us to know address of wake up event, and wait for writes on
>> it.
>>
>> On IA, this is achieved through using UMONITOR/UMWAIT instructions. They
>> are used in their raw opcode form because there is no widespread
>> compiler support for them yet. Still, the API is made generic enough to
>> hopefully support other architectures, if they happen to implement
>> similar instructions.
>>
>> To achieve power savings, there is a very simple mechanism used: we're
>> counting empty polls, and if a certain threshold is reached, we get the
>> address of next RX ring descriptor from the NIC driver, arm the
>> monitoring hardware, and enter a power-optimized state. We will then
>> wake up when either a timeout happens, or a write happens (or generally
>> whenever CPU feels like waking up - this is platform-specific), and
>> proceed as normal. The empty poll counter is reset whenever we actually
>> get packets, so we only go to sleep when we know nothing is going on.
>> The mechanism is generic which can be used for any write back
>> descriptor.
>>
>> This patchset also introduces a few changes into existing power
>> management-related intrinsics, namely to provide a native way of waking
>> up a sleeping core without application being responsible for it, as well
>> as general robustness improvements. There's quite a bit of locking going
>> on, but these locks are per-thread and very little (if any) contention
>> is expected, so the performance impact shouldn't be that bad (and in any
>> case the locking happens when we're about to sleep anyway, not on a
>> hotpath).
>>
>> Why are we putting it into ethdev as opposed to leaving this up to the
>> application? Our customers specifically requested a way to do it wit
>> minimal changes to the application code. The current approach allows to
>> just flip a switch and automatically have power savings.
>>
>> - Only 1:1 core to queue mapping is supported, meaning that each lcore
>>    must at most handle RX on a single queue
>> - Support 3 type policies. Monitor/Pause/Frequency Scaling
>> - Power management is enabled per-queue
>> - The API doesn't extend to other device types
> 
> Fyi, ovsrobot Travis being KO, you probably missed that GHA CI caught this:
> https://github.com/ovsrobot/dpdk/runs/1571056574?check_suite_focus=true#step:13:16082
> 
> We will have to put an exception on driver only ABI.
> 
> 

Why does aarch64 build fail there? The functions in question are in the 
version map file, but the build complains that they aren't.
  
David Marchand Jan. 11, 2021, 8:44 a.m. UTC | #3
On Fri, Jan 8, 2021 at 5:42 PM Burakov, Anatoly
<anatoly.burakov@intel.com> wrote:
>
> On 17-Dec-20 4:12 PM, David Marchand wrote:
> > On Thu, Dec 17, 2020 at 3:06 PM Anatoly Burakov
> > <anatoly.burakov@intel.com> wrote:
> >>
> >> This patchset proposes a simple API for Ethernet drivers to cause the
> >> CPU to enter a power-optimized state while waiting for packets to
> >> arrive. This is achieved through cooperation with the NIC driver that
> >> will allow us to know address of wake up event, and wait for writes on
> >> it.
> >>
> >> On IA, this is achieved through using UMONITOR/UMWAIT instructions. They
> >> are used in their raw opcode form because there is no widespread
> >> compiler support for them yet. Still, the API is made generic enough to
> >> hopefully support other architectures, if they happen to implement
> >> similar instructions.
> >>
> >> To achieve power savings, there is a very simple mechanism used: we're
> >> counting empty polls, and if a certain threshold is reached, we get the
> >> address of next RX ring descriptor from the NIC driver, arm the
> >> monitoring hardware, and enter a power-optimized state. We will then
> >> wake up when either a timeout happens, or a write happens (or generally
> >> whenever CPU feels like waking up - this is platform-specific), and
> >> proceed as normal. The empty poll counter is reset whenever we actually
> >> get packets, so we only go to sleep when we know nothing is going on.
> >> The mechanism is generic which can be used for any write back
> >> descriptor.
> >>
> >> This patchset also introduces a few changes into existing power
> >> management-related intrinsics, namely to provide a native way of waking
> >> up a sleeping core without application being responsible for it, as well
> >> as general robustness improvements. There's quite a bit of locking going
> >> on, but these locks are per-thread and very little (if any) contention
> >> is expected, so the performance impact shouldn't be that bad (and in any
> >> case the locking happens when we're about to sleep anyway, not on a
> >> hotpath).
> >>
> >> Why are we putting it into ethdev as opposed to leaving this up to the
> >> application? Our customers specifically requested a way to do it wit
> >> minimal changes to the application code. The current approach allows to
> >> just flip a switch and automatically have power savings.
> >>
> >> - Only 1:1 core to queue mapping is supported, meaning that each lcore
> >>    must at most handle RX on a single queue
> >> - Support 3 type policies. Monitor/Pause/Frequency Scaling
> >> - Power management is enabled per-queue
> >> - The API doesn't extend to other device types
> >
> > Fyi, ovsrobot Travis being KO, you probably missed that GHA CI caught this:
> > https://github.com/ovsrobot/dpdk/runs/1571056574?check_suite_focus=true#step:13:16082
> >
> > We will have to put an exception on driver only ABI.
> >
> >
>
> Why does aarch64 build fail there? The functions in question are in the
> version map file, but the build complains that they aren't.

From what I can see, this series puts rte_power_* symbols in a .h.
So it will be seen as symbols exported by any library including such a header.

The check then complains about this as it sees exported symbols
unknown of the library version.map.
  
David Marchand Jan. 11, 2021, 8:52 a.m. UTC | #4
On Mon, Jan 11, 2021 at 9:44 AM David Marchand
<david.marchand@redhat.com> wrote:
>
> On Fri, Jan 8, 2021 at 5:42 PM Burakov, Anatoly
> <anatoly.burakov@intel.com> wrote:
> > Why does aarch64 build fail there? The functions in question are in the
> > version map file, but the build complains that they aren't.
>
> From what I can see, this series puts rte_power_* symbols in a .h.
> So it will be seen as symbols exported by any library including such a header.
>
> The check then complains about this as it sees exported symbols
> unknown of the library version.map.

Quick fix:

diff --git a/lib/librte_eal/arm/include/rte_power_intrinsics.h
b/lib/librte_eal/arm/include/rte_power_intrinsics.h
index 39e49cc45b..9e498e9ebf 100644
--- a/lib/librte_eal/arm/include/rte_power_intrinsics.h
+++ b/lib/librte_eal/arm/include/rte_power_intrinsics.h
@@ -13,35 +13,6 @@ extern "C" {

 #include "generic/rte_power_intrinsics.h"

-/**
- * This function is not supported on ARM.
- */
-void
-rte_power_monitor(const struct rte_power_monitor_cond *pmc,
-               const uint64_t tsc_timestamp)
-{
-       RTE_SET_USED(pmc);
-       RTE_SET_USED(tsc_timestamp);
-}
-
-/**
- * This function is not supported on ARM.
- */
-void
-rte_power_pause(const uint64_t tsc_timestamp)
-{
-       RTE_SET_USED(tsc_timestamp);
-}
-
-/**
- * This function is not supported on ARM.
- */
-void
-rte_power_monitor_wakeup(const unsigned int lcore_id)
-{
-       RTE_SET_USED(lcore_id);
-}
-
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/librte_eal/arm/meson.build b/lib/librte_eal/arm/meson.build
index d62875ebae..6ec53ea03a 100644
--- a/lib/librte_eal/arm/meson.build
+++ b/lib/librte_eal/arm/meson.build
@@ -7,4 +7,5 @@ sources += files(
        'rte_cpuflags.c',
        'rte_cycles.c',
        'rte_hypervisor.c',
+       'rte_power_intrinsics.c',
 )
diff --git a/lib/librte_eal/arm/rte_power_intrinsics.c
b/lib/librte_eal/arm/rte_power_intrinsics.c
new file mode 100644
index 0000000000..998f9898ad
--- /dev/null
+++ b/lib/librte_eal/arm/rte_power_intrinsics.c
@@ -0,0 +1,31 @@
+#include <rte_common.h>
+#include <rte_power_intrinsics.h>
+
+/**
+ * This function is not supported on ARM.
+ */
+void
+rte_power_monitor(const struct rte_power_monitor_cond *pmc,
+               const uint64_t tsc_timestamp)
+{
+       RTE_SET_USED(pmc);
+       RTE_SET_USED(tsc_timestamp);
+}
+
+/**
+ * This function is not supported on ARM.
+ */
+void
+rte_power_pause(const uint64_t tsc_timestamp)
+{
+       RTE_SET_USED(tsc_timestamp);
+}
+
+/**
+ * This function is not supported on ARM.
+ */
+void
+rte_power_monitor_wakeup(const unsigned int lcore_id)
+{
+       RTE_SET_USED(lcore_id);
+}


HTH.
  
Anatoly Burakov Jan. 11, 2021, 10:21 a.m. UTC | #5
On 11-Jan-21 8:52 AM, David Marchand wrote:
> On Mon, Jan 11, 2021 at 9:44 AM David Marchand
> <david.marchand@redhat.com> wrote:
>>
>> On Fri, Jan 8, 2021 at 5:42 PM Burakov, Anatoly
>> <anatoly.burakov@intel.com> wrote:
>>> Why does aarch64 build fail there? The functions in question are in the
>>> version map file, but the build complains that they aren't.
>>
>>  From what I can see, this series puts rte_power_* symbols in a .h.
>> So it will be seen as symbols exported by any library including such a header.
>>
>> The check then complains about this as it sees exported symbols
>> unknown of the library version.map.
> 
> Quick fix:
> 
> diff --git a/lib/librte_eal/arm/include/rte_power_intrinsics.h
> b/lib/librte_eal/arm/include/rte_power_intrinsics.h
> index 39e49cc45b..9e498e9ebf 100644
> --- a/lib/librte_eal/arm/include/rte_power_intrinsics.h
> +++ b/lib/librte_eal/arm/include/rte_power_intrinsics.h
> @@ -13,35 +13,6 @@ extern "C" {
> 
>   #include "generic/rte_power_intrinsics.h"
> 
> -/**
> - * This function is not supported on ARM.
> - */
> -void
> -rte_power_monitor(const struct rte_power_monitor_cond *pmc,
> -               const uint64_t tsc_timestamp)
> -{
> -       RTE_SET_USED(pmc);
> -       RTE_SET_USED(tsc_timestamp);
> -}
> -
> -/**
> - * This function is not supported on ARM.
> - */
> -void
> -rte_power_pause(const uint64_t tsc_timestamp)
> -{
> -       RTE_SET_USED(tsc_timestamp);
> -}
> -
> -/**
> - * This function is not supported on ARM.
> - */
> -void
> -rte_power_monitor_wakeup(const unsigned int lcore_id)
> -{
> -       RTE_SET_USED(lcore_id);
> -}
> -
>   #ifdef __cplusplus
>   }
>   #endif
> diff --git a/lib/librte_eal/arm/meson.build b/lib/librte_eal/arm/meson.build
> index d62875ebae..6ec53ea03a 100644
> --- a/lib/librte_eal/arm/meson.build
> +++ b/lib/librte_eal/arm/meson.build
> @@ -7,4 +7,5 @@ sources += files(
>          'rte_cpuflags.c',
>          'rte_cycles.c',
>          'rte_hypervisor.c',
> +       'rte_power_intrinsics.c',
>   )
> diff --git a/lib/librte_eal/arm/rte_power_intrinsics.c
> b/lib/librte_eal/arm/rte_power_intrinsics.c
> new file mode 100644
> index 0000000000..998f9898ad
> --- /dev/null
> +++ b/lib/librte_eal/arm/rte_power_intrinsics.c
> @@ -0,0 +1,31 @@
> +#include <rte_common.h>
> +#include <rte_power_intrinsics.h>
> +
> +/**
> + * This function is not supported on ARM.
> + */
> +void
> +rte_power_monitor(const struct rte_power_monitor_cond *pmc,
> +               const uint64_t tsc_timestamp)
> +{
> +       RTE_SET_USED(pmc);
> +       RTE_SET_USED(tsc_timestamp);
> +}
> +
> +/**
> + * This function is not supported on ARM.
> + */
> +void
> +rte_power_pause(const uint64_t tsc_timestamp)
> +{
> +       RTE_SET_USED(tsc_timestamp);
> +}
> +
> +/**
> + * This function is not supported on ARM.
> + */
> +void
> +rte_power_monitor_wakeup(const unsigned int lcore_id)
> +{
> +       RTE_SET_USED(lcore_id);
> +}
> 
> 
> HTH.
> 
> 

OK, will add into v14 so. Thanks!