[RFC,4/5] power: add eventdev support for power management

Message ID 20230419095427.563185-4-sivaprasad.tummala@amd.com (mailing list archive)
State Changes Requested, archived
Delegated to: Jerin Jacob
Headers
Series [RFC,1/5] eventdev: add power monitoring API on event port |

Checks

Context Check Description
ci/checkpatch success coding style OK

Commit Message

Sivaprasad Tummala April 19, 2023, 9:54 a.m. UTC
  Add eventdev support to enable power saving when no events
are arriving. It is based on counting the number of empty
polls and, when the number reaches a certain threshold, entering
an architecture-defined optimized power state that will either wait
until a TSC timestamp expires, or when events arrive.

This API mandates a core-to-single-port mapping (i.e. one core polling
multiple ports of event device is not supported). This should be ok
as the general use case will have one CPU core using one port to
enqueue/dequeue events from an eventdev.

This design is using Eventdev PMD Dequeue callbacks.

1. MWAITX/MONITORX:

   When a certain threshold of empty polls is reached, the core will go
   into a power optimized sleep while waiting on an address of next RX
   descriptor to be written to.

2. Pause instruction

   This method uses the pause instruction to avoid busy polling.

Signed-off-by: Sivaprasad Tummala <sivaprasad.tummala@amd.com>
---
 lib/power/meson.build          |   2 +-
 lib/power/rte_power_pmd_mgmt.c | 226 +++++++++++++++++++++++++++++++++
 lib/power/rte_power_pmd_mgmt.h |  55 ++++++++
 lib/power/version.map          |   4 +
 4 files changed, 286 insertions(+), 1 deletion(-)
  

Comments

Anatoly Burakov May 17, 2023, 2:43 p.m. UTC | #1
On 4/19/2023 10:54 AM, Sivaprasad Tummala wrote:
> Add eventdev support to enable power saving when no events
> are arriving. It is based on counting the number of empty
> polls and, when the number reaches a certain threshold, entering
> an architecture-defined optimized power state that will either wait
> until a TSC timestamp expires, or when events arrive.
> 
> This API mandates a core-to-single-port mapping (i.e. one core polling
> multiple ports of event device is not supported). This should be ok
> as the general use case will have one CPU core using one port to
> enqueue/dequeue events from an eventdev.
> 
> This design is using Eventdev PMD Dequeue callbacks.
> 
> 1. MWAITX/MONITORX:
> 
>     When a certain threshold of empty polls is reached, the core will go
>     into a power optimized sleep while waiting on an address of next RX
>     descriptor to be written to.
> 
> 2. Pause instruction
> 
>     This method uses the pause instruction to avoid busy polling.
> 
> Signed-off-by: Sivaprasad Tummala <sivaprasad.tummala@amd.com>
> ---

Hi, few comments

> +
> +static uint16_t
> +evt_clb_pause(uint8_t dev_id __rte_unused, uint8_t port_id __rte_unused,
> +		struct rte_event *ev __rte_unused,
> +		uint16_t nb_events, void *arg)
> +{
> +	const unsigned int lcore = rte_lcore_id();
> +	struct queue_list_entry *queue_conf = arg;
> +	struct pmd_core_cfg *lcore_conf;
> +	const bool empty = nb_events == 0;
> +	uint32_t pause_duration = rte_power_pmd_mgmt_get_pause_duration();
> +
> +	lcore_conf = &lcore_cfgs[lcore];
> +
> +	if (likely(!empty))
> +		/* early exit */
> +		queue_reset(lcore_conf, queue_conf);
> +	else {
> +		/* can this queue sleep? */
> +		if (!queue_can_sleep(lcore_conf, queue_conf))
> +			return nb_events;
> +
> +		/* can this lcore sleep? */
> +		if (!lcore_can_sleep(lcore_conf))
> +			return nb_events;
> +
> +		uint64_t i;
> +		for (i = 0; i < global_data.pause_per_us * pause_duration; i++)
> +			rte_pause();

Why not add support for TPAUSE? This is generic code, ethdev code path 
supports it, and most of this function is duplicated from ethdev 
implementation. I wish we could unify them somehow, but I can't think of 
an elegant way to do it off the top of my head.

> +
> +	/* we need this in various places */
> +	rte_cpu_get_intrinsics_support(&global_data.intrinsics_support);
> +
> +	switch (mode) {
> +	case RTE_POWER_MGMT_TYPE_MONITOR:
> +		/* check if we can add a new port */
> +		ret = check_evt_monitor(lcore_cfg, &qdata);
> +		if (ret < 0)
> +			goto end;
> +
> +		clb = evt_clb_umwait;
> +		break;
> +	case RTE_POWER_MGMT_TYPE_PAUSE:
> +		/* figure out various time-to-tsc conversions */
> +		if (global_data.tsc_per_us == 0)
> +			calc_tsc();
> +
> +		clb = evt_clb_pause;
> +		break;
> +	default:
> +		RTE_LOG(DEBUG, POWER, "Invalid power management type\n");

Technically, if we specify "scale" here, the power management scheme 
would be *unsupported* rather than *invalid*, and thus should return 
-ENOTSUP rather than -EINVAL.

Also, since this is generic code, theoretically this code could in fact 
support SCALE mode? Would it make sense for eventdev to use that scheme?

> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice.
> + *
> + * Disable power management on a specified Ethernet device Rx queue and lcore.
> + *
> + * @note This function is not thread-safe.
> + *
> + * @warning This function must be called when all affected Ethernet queues are
> + *   stopped and no Rx/Tx is in progress!
> + *
> + * @param lcore_id
> + *   The lcore the Rx queue is polled from.
> + * @param dev_id
> + *   The identifier of the device.
> + * @param port_id
> + *   Event port identifier of the Event device.
> + * @return
> + *   0 on success
> + *   <0 on error
> + */
> +__rte_experimental
> +int
> +rte_power_eventdev_pmgmt_port_disable(unsigned int lcore_id,
> +		uint8_t dev_id, uint8_t port_id);

It would've been nice if we didn't have to reimplement the same logic 
for every new device type, but seeing how we do not have any unified 
driver API, I don't have any bright ideas on how to do it better.
  
Sivaprasad Tummala May 24, 2023, 12:34 p.m. UTC | #2
[AMD Official Use Only - General]

Hi,

> -----Original Message-----
> From: Burakov, Anatoly <anatoly.burakov@intel.com>
> Sent: Wednesday, May 17, 2023 8:14 PM
> To: Tummala, Sivaprasad <Sivaprasad.Tummala@amd.com>;
> david.hunt@intel.com; jerinj@marvell.com; harry.van.haaren@intel.com
> Cc: dev@dpdk.org
> Subject: Re: [RFC PATCH 4/5] power: add eventdev support for power
> management
> 
> Caution: This message originated from an External Source. Use proper caution
> when opening attachments, clicking links, or responding.
> 
> 
> On 4/19/2023 10:54 AM, Sivaprasad Tummala wrote:
> > Add eventdev support to enable power saving when no events are
> > arriving. It is based on counting the number of empty polls and, when
> > the number reaches a certain threshold, entering an
> > architecture-defined optimized power state that will either wait until
> > a TSC timestamp expires, or when events arrive.
> >
> > This API mandates a core-to-single-port mapping (i.e. one core polling
> > multiple ports of event device is not supported). This should be ok as
> > the general use case will have one CPU core using one port to
> > enqueue/dequeue events from an eventdev.
> >
> > This design is using Eventdev PMD Dequeue callbacks.
> >
> > 1. MWAITX/MONITORX:
> >
> >     When a certain threshold of empty polls is reached, the core will go
> >     into a power optimized sleep while waiting on an address of next RX
> >     descriptor to be written to.
> >
> > 2. Pause instruction
> >
> >     This method uses the pause instruction to avoid busy polling.
> >
> > Signed-off-by: Sivaprasad Tummala <sivaprasad.tummala@amd.com>
> > ---
> 
> Hi, few comments
> 
> > +
> > +static uint16_t
> > +evt_clb_pause(uint8_t dev_id __rte_unused, uint8_t port_id __rte_unused,
> > +             struct rte_event *ev __rte_unused,
> > +             uint16_t nb_events, void *arg) {
> > +     const unsigned int lcore = rte_lcore_id();
> > +     struct queue_list_entry *queue_conf = arg;
> > +     struct pmd_core_cfg *lcore_conf;
> > +     const bool empty = nb_events == 0;
> > +     uint32_t pause_duration =
> > +rte_power_pmd_mgmt_get_pause_duration();
> > +
> > +     lcore_conf = &lcore_cfgs[lcore];
> > +
> > +     if (likely(!empty))
> > +             /* early exit */
> > +             queue_reset(lcore_conf, queue_conf);
> > +     else {
> > +             /* can this queue sleep? */
> > +             if (!queue_can_sleep(lcore_conf, queue_conf))
> > +                     return nb_events;
> > +
> > +             /* can this lcore sleep? */
> > +             if (!lcore_can_sleep(lcore_conf))
> > +                     return nb_events;
> > +
> > +             uint64_t i;
> > +             for (i = 0; i < global_data.pause_per_us * pause_duration; i++)
> > +                     rte_pause();
> 
> Why not add support for TPAUSE? This is generic code, ethdev code path supports
> it, and most of this function is duplicated from ethdev implementation. I wish we
> could unify them somehow, but I can't think of an elegant way to do it off the top
> of my head.
> 
Sure, will fix this in v1.
> > +
> > +     /* we need this in various places */
> > +     rte_cpu_get_intrinsics_support(&global_data.intrinsics_support);
> > +
> > +     switch (mode) {
> > +     case RTE_POWER_MGMT_TYPE_MONITOR:
> > +             /* check if we can add a new port */
> > +             ret = check_evt_monitor(lcore_cfg, &qdata);
> > +             if (ret < 0)
> > +                     goto end;
> > +
> > +             clb = evt_clb_umwait;
> > +             break;
> > +     case RTE_POWER_MGMT_TYPE_PAUSE:
> > +             /* figure out various time-to-tsc conversions */
> > +             if (global_data.tsc_per_us == 0)
> > +                     calc_tsc();
> > +
> > +             clb = evt_clb_pause;
> > +             break;
> > +     default:
> > +             RTE_LOG(DEBUG, POWER, "Invalid power management
> > + type\n");
> 
> Technically, if we specify "scale" here, the power management scheme would be
> *unsupported* rather than *invalid*, and thus should return -ENOTSUP rather than
> -EINVAL.
> 
> Also, since this is generic code, theoretically this code could in fact support SCALE
> mode? Would it make sense for eventdev to use that scheme?
> 
Agreed. Will fix this in v1 patch. 

> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this API may change, or be removed, without prior
> notice.
> > + *
> > + * Disable power management on a specified Ethernet device Rx queue and
> lcore.
> > + *
> > + * @note This function is not thread-safe.
> > + *
> > + * @warning This function must be called when all affected Ethernet queues are
> > + *   stopped and no Rx/Tx is in progress!
> > + *
> > + * @param lcore_id
> > + *   The lcore the Rx queue is polled from.
> > + * @param dev_id
> > + *   The identifier of the device.
> > + * @param port_id
> > + *   Event port identifier of the Event device.
> > + * @return
> > + *   0 on success
> > + *   <0 on error
> > + */
> > +__rte_experimental
> > +int
> > +rte_power_eventdev_pmgmt_port_disable(unsigned int lcore_id,
> > +             uint8_t dev_id, uint8_t port_id);
> 
> It would've been nice if we didn't have to reimplement the same logic for every
> new device type, but seeing how we do not have any unified driver API, I don't
> have any bright ideas on how to do it better.
> 
> --
> Thanks,
> Anatoly
  

Patch

diff --git a/lib/power/meson.build b/lib/power/meson.build
index 1ce8b7c07d..528340e291 100644
--- a/lib/power/meson.build
+++ b/lib/power/meson.build
@@ -31,4 +31,4 @@  headers = files(
 if cc.has_argument('-Wno-cast-qual')
     cflags += '-Wno-cast-qual'
 endif
-deps += ['timer', 'ethdev']
+deps += ['timer', 'ethdev', 'eventdev']
diff --git a/lib/power/rte_power_pmd_mgmt.c b/lib/power/rte_power_pmd_mgmt.c
index ca1840387c..3c194872c3 100644
--- a/lib/power/rte_power_pmd_mgmt.c
+++ b/lib/power/rte_power_pmd_mgmt.c
@@ -9,8 +9,10 @@ 
 #include <rte_cpuflags.h>
 #include <rte_malloc.h>
 #include <rte_ethdev.h>
+#include <rte_eventdev.h>
 #include <rte_power_intrinsics.h>
 
+#include <eventdev_pmd.h>
 #include "rte_power_pmd_mgmt.h"
 #include "power_common.h"
 
@@ -53,6 +55,7 @@  struct queue_list_entry {
 	uint64_t n_empty_polls;
 	uint64_t n_sleeps;
 	const struct rte_eth_rxtx_callback *cb;
+	const struct rte_event_dequeue_callback *evt_cb;
 };
 
 struct pmd_core_cfg {
@@ -414,6 +417,64 @@  cfg_queues_stopped(struct pmd_core_cfg *queue_cfg)
 	return 1;
 }
 
+static uint16_t
+evt_clb_umwait(uint8_t dev_id, uint8_t port_id, struct rte_event *ev __rte_unused,
+		uint16_t nb_events, void *arg)
+{
+	struct queue_list_entry *queue_conf = arg;
+
+	/* this callback can't do more than one queue, omit multiqueue logic */
+	if (unlikely(nb_events == 0)) {
+		queue_conf->n_empty_polls++;
+		if (unlikely(queue_conf->n_empty_polls > emptypoll_max)) {
+			struct rte_power_monitor_cond pmc;
+			int ret;
+
+			/* use monitoring condition to sleep */
+			ret = rte_event_port_get_monitor_addr(dev_id, port_id,
+					&pmc);
+			if (ret == 0)
+				rte_power_monitor(&pmc, UINT64_MAX);
+		}
+	} else
+		queue_conf->n_empty_polls = 0;
+
+	return nb_events;
+}
+
+static uint16_t
+evt_clb_pause(uint8_t dev_id __rte_unused, uint8_t port_id __rte_unused,
+		struct rte_event *ev __rte_unused,
+		uint16_t nb_events, void *arg)
+{
+	const unsigned int lcore = rte_lcore_id();
+	struct queue_list_entry *queue_conf = arg;
+	struct pmd_core_cfg *lcore_conf;
+	const bool empty = nb_events == 0;
+	uint32_t pause_duration = rte_power_pmd_mgmt_get_pause_duration();
+
+	lcore_conf = &lcore_cfgs[lcore];
+
+	if (likely(!empty))
+		/* early exit */
+		queue_reset(lcore_conf, queue_conf);
+	else {
+		/* can this queue sleep? */
+		if (!queue_can_sleep(lcore_conf, queue_conf))
+			return nb_events;
+
+		/* can this lcore sleep? */
+		if (!lcore_can_sleep(lcore_conf))
+			return nb_events;
+
+		uint64_t i;
+		for (i = 0; i < global_data.pause_per_us * pause_duration; i++)
+			rte_pause();
+	}
+
+	return nb_events;
+}
+
 static int
 check_scale(unsigned int lcore)
 {
@@ -479,6 +540,171 @@  get_monitor_callback(void)
 		clb_multiwait : clb_umwait;
 }
 
+static int
+check_evt_monitor(struct pmd_core_cfg *cfg __rte_unused,
+		const union queue *qdata)
+{
+	struct rte_power_monitor_cond dummy;
+
+	/* check if rte_power_monitor is supported */
+	if (!global_data.intrinsics_support.power_monitor) {
+		RTE_LOG(DEBUG, POWER, "Monitoring intrinsics are not supported\n");
+		return -ENOTSUP;
+	}
+
+	/* check if the device supports the necessary PMD API */
+	if (rte_event_port_get_monitor_addr((uint8_t)qdata->portid, (uint8_t)qdata->qid,
+				&dummy) == -ENOTSUP) {
+		RTE_LOG(DEBUG, POWER, "event port does not support rte_event_get_monitor_addr\n");
+		return -ENOTSUP;
+	}
+
+	/* we're done */
+	return 0;
+}
+
+int
+rte_power_eventdev_pmgmt_port_enable(unsigned int lcore_id, uint8_t dev_id,
+		uint8_t port_id, enum rte_power_pmd_mgmt_type mode)
+{
+	const union queue qdata = {.portid = dev_id, .qid = port_id};
+	struct pmd_core_cfg *lcore_cfg;
+	struct queue_list_entry *queue_cfg;
+	struct rte_event_dev_info info;
+	rte_dequeue_callback_fn clb;
+	int ret;
+
+	RTE_EVENTDEV_VALID_DEVID_OR_ERR_RET(dev_id, -EINVAL);
+
+	if (lcore_id >= RTE_MAX_LCORE) {
+		ret = -EINVAL;
+		goto end;
+	}
+
+	if (rte_event_dev_info_get(dev_id, &info) < 0) {
+		ret = -EINVAL;
+		goto end;
+	}
+
+	/* check if queue id is valid */
+	if (port_id >= info.max_event_ports) {
+		ret = -EINVAL;
+		goto end;
+	}
+
+	lcore_cfg = &lcore_cfgs[lcore_id];
+
+	/* if callback was already enabled, check current callback type */
+	if (lcore_cfg->pwr_mgmt_state != PMD_MGMT_DISABLED &&
+		lcore_cfg->cb_mode != mode) {
+		ret = -EINVAL;
+		goto end;
+	}
+
+	/* we need this in various places */
+	rte_cpu_get_intrinsics_support(&global_data.intrinsics_support);
+
+	switch (mode) {
+	case RTE_POWER_MGMT_TYPE_MONITOR:
+		/* check if we can add a new port */
+		ret = check_evt_monitor(lcore_cfg, &qdata);
+		if (ret < 0)
+			goto end;
+
+		clb = evt_clb_umwait;
+		break;
+	case RTE_POWER_MGMT_TYPE_PAUSE:
+		/* figure out various time-to-tsc conversions */
+		if (global_data.tsc_per_us == 0)
+			calc_tsc();
+
+		clb = evt_clb_pause;
+		break;
+	default:
+		RTE_LOG(DEBUG, POWER, "Invalid power management type\n");
+		ret = -EINVAL;
+		goto end;
+	}
+	/* add this queue to the list */
+	ret = queue_list_add(lcore_cfg, &qdata);
+	if (ret < 0) {
+		RTE_LOG(DEBUG, POWER, "Failed to add queue to list: %s\n",
+				strerror(-ret));
+		goto end;
+	}
+	/* new queue is always added last */
+	queue_cfg = TAILQ_LAST(&lcore_cfg->head, queue_list_head);
+
+	/* when enabling first queue, ensure sleep target is not 0 */
+	if (lcore_cfg->n_queues == 1 && lcore_cfg->sleep_target == 0)
+		lcore_cfg->sleep_target = 1;
+
+	/* initialize data before enabling the callback */
+	if (lcore_cfg->n_queues == 1) {
+		lcore_cfg->cb_mode = mode;
+		lcore_cfg->pwr_mgmt_state = PMD_MGMT_ENABLED;
+	}
+	queue_cfg->evt_cb = rte_event_add_dequeue_callback(dev_id, port_id,
+						clb, queue_cfg);
+
+	ret = 0;
+end:
+	return ret;
+}
+
+int
+rte_power_eventdev_pmgmt_port_disable(unsigned int lcore_id,
+		uint8_t dev_id, uint8_t port_id)
+{
+	const union queue qdata = {.portid = dev_id, .qid = port_id};
+	struct pmd_core_cfg *lcore_cfg;
+	struct queue_list_entry *queue_cfg;
+
+	RTE_EVENTDEV_VALID_DEVID_OR_ERR_RET(dev_id, -EINVAL);
+
+	if (lcore_id >= RTE_MAX_LCORE)
+		return -EINVAL;
+
+	/* no need to check queue id as wrong queue id would not be enabled */
+	lcore_cfg = &lcore_cfgs[lcore_id];
+
+	if (lcore_cfg->pwr_mgmt_state != PMD_MGMT_ENABLED)
+		return -EINVAL;
+
+	/*
+	 * There is no good/easy way to do this without race conditions, so we
+	 * are just going to throw our hands in the air and hope that the user
+	 * has read the documentation and has ensured that ports are stopped at
+	 * the time we enter the API functions.
+	 */
+	queue_cfg = queue_list_take(lcore_cfg, &qdata);
+	if (queue_cfg == NULL)
+		return -ENOENT;
+
+	/* if we've removed all queues from the lists, set state to disabled */
+	if (lcore_cfg->n_queues == 0)
+		lcore_cfg->pwr_mgmt_state = PMD_MGMT_DISABLED;
+
+	switch (lcore_cfg->cb_mode) {
+	case RTE_POWER_MGMT_TYPE_MONITOR: /* fall-through */
+	case RTE_POWER_MGMT_TYPE_SCALE:
+	case RTE_POWER_MGMT_TYPE_PAUSE:
+		rte_event_remove_dequeue_callback(dev_id, port_id,
+			queue_cfg->evt_cb);
+		break;
+	}
+	/*
+	 * the API doc mandates that the user stops all processing on affected
+	 * ports before calling any of these API's, so we can assume that the
+	 * callbacks can be freed. we're intentionally casting away const-ness.
+	 */
+	rte_free((void *)queue_cfg->evt_cb);
+	free(queue_cfg);
+
+	return 0;
+}
+
+
 int
 rte_power_ethdev_pmgmt_queue_enable(unsigned int lcore_id, uint16_t port_id,
 		uint16_t queue_id, enum rte_power_pmd_mgmt_type mode)
diff --git a/lib/power/rte_power_pmd_mgmt.h b/lib/power/rte_power_pmd_mgmt.h
index 0f1a2eb22e..e1966b9777 100644
--- a/lib/power/rte_power_pmd_mgmt.h
+++ b/lib/power/rte_power_pmd_mgmt.h
@@ -87,6 +87,61 @@  int
 rte_power_ethdev_pmgmt_queue_disable(unsigned int lcore_id,
 		uint16_t port_id, uint16_t queue_id);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice.
+ *
+ * Enable power management on a specified Event device port and lcore.
+ *
+ * @note This function is not thread-safe.
+ *
+ * @warning This function must be called when the event device stopped and
+ * no enqueue/dequeue is in progress!
+ *
+ * @param lcore_id
+ *   The lcore the event port will be polled from.
+ * @param dev_id
+ *   The identifier of the device.
+ * @param port_id
+ *   Event port identifier of the Event device.
+ * @param mode
+ *   The power management scheme to use for specified event port.
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+__rte_experimental
+int
+rte_power_eventdev_pmgmt_port_enable(unsigned int lcore_id,
+		uint8_t dev_id, uint8_t port_id,
+		enum rte_power_pmd_mgmt_type mode);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice.
+ *
+ * Disable power management on a specified Ethernet device Rx queue and lcore.
+ *
+ * @note This function is not thread-safe.
+ *
+ * @warning This function must be called when all affected Ethernet queues are
+ *   stopped and no Rx/Tx is in progress!
+ *
+ * @param lcore_id
+ *   The lcore the Rx queue is polled from.
+ * @param dev_id
+ *   The identifier of the device.
+ * @param port_id
+ *   Event port identifier of the Event device.
+ * @return
+ *   0 on success
+ *   <0 on error
+ */
+__rte_experimental
+int
+rte_power_eventdev_pmgmt_port_disable(unsigned int lcore_id,
+		uint8_t dev_id, uint8_t port_id);
+
 /**
  * @warning
  * @b EXPERIMENTAL: this API may change, or be removed, without prior notice.
diff --git a/lib/power/version.map b/lib/power/version.map
index 05d544e947..a8711779b6 100644
--- a/lib/power/version.map
+++ b/lib/power/version.map
@@ -52,4 +52,8 @@  EXPERIMENTAL {
 	rte_power_uncore_get_num_freqs;
 	rte_power_uncore_get_num_pkgs;
 	rte_power_uncore_init;
+
+	# added in 23.07
+	rte_power_eventdev_pmgmt_port_enable;
+	rte_power_eventdev_pmgmt_port_disable;
 };