diff mbox series

[v4,2/3] ethdev: add standby state for live migration

Message ID	20230118154447.595231-3-rongweil@nvidia.com (mailing list archive)
State	Superseded, archived
Delegated to:	Ferruh Yigit
Headers	Received-SPF: Pass (protection.outlook.com: domain of nvidia.com designates 216.228.117.160 as permitted sender) receiver=protection.outlook.com; client-ip=216.228.117.160; helo=mail.nvidia.com; pr=C From: Rongwei Liu <rongweil@nvidia.com> To: <dev@dpdk.org>, <matan@nvidia.com>, <viacheslavo@nvidia.com>, <orika@nvidia.com>, <thomas@monjalon.net>, <jerinjacobk@gmail.com>, <stephen@networkplumber.org> CC: <rasland@nvidia.com>, Ferruh Yigit <ferruh.yigit@amd.com>, "Andrew Rybchenko" <andrew.rybchenko@oktetlabs.ru> Subject: [PATCH v4 2/3] ethdev: add standby state for live migration Date: Wed, 18 Jan 2023 17:44:45 +0200 Message-ID: <20230118154447.595231-3-rongweil@nvidia.com> In-Reply-To: <20230118154447.595231-1-rongweil@nvidia.com> References: <20221221090017.3715030-2-rongweil@nvidia.com> <20230118154447.595231-1-rongweil@nvidia.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain Precedence: list Errors-To: dev-bounces@dpdk.org
Series	add API for live migration \| [v4,0/3] add API for live migration [v4,1/3] ethdev: add flow rule group description [v4,2/3] ethdev: add standby state for live migration [v4,3/3] ethdev: add standby flags for live migration

Checks

Context	Check	Description
ci/checkpatch	success	coding style OK

Commit Message

Rongwei Liu Jan. 18, 2023, 3:44 p.m. UTC

  When a DPDK application must be upgraded,
the traffic downtime should be shortened as much as possible.
During the migration time, the old application may stay alive
while the new application is starting and being configured.

In order to optimize the switch to the new application,
the old application may need to be aware of the presence
of the new application being prepared.
This is achieved with a new API allowing the user to change the
new application state to standby and active later.

The added function is trying to apply the new state to all probed
ethdev ports. To make this API simple and easy to use,
the same flags have to be accepted by all devices.

This is the scenario of operations in the old and new applications:
.       device: already configured by the old application
.       new:    start as active
.       new:    probe the same device
.       new:    set as standby
.       new:    configure the device
.       device: has configurations from old and new applications
.       old:    clear its device configuration
.       device: has only 1 configuration from new application
.       new:    set as active
.       device: downtime for connecting all to the new application
.       old:    shutdown

The active role means network handling configurations are programmed
to the HW immediately, and no behavior changed. This is the default state.
The standby role means configurations are queued in the HW.
If there is no application with active role,
any configuration is effective immediately.

Signed-off-by: Rongwei Liu <rongweil@nvidia.com>
---
 doc/guides/rel_notes/release_23_03.rst |  7 ++++
 lib/ethdev/ethdev_driver.h             | 20 +++++++++
 lib/ethdev/rte_ethdev.c                | 42 +++++++++++++++++++
 lib/ethdev/rte_ethdev.h                | 56 ++++++++++++++++++++++++++
 lib/ethdev/version.map                 |  3 ++
 5 files changed, 128 insertions(+)

Comments

Ori Kam Jan. 31, 2023, 1:50 p.m. UTC | #1

Hi Rongwei,

> -----Original Message-----
> From: Rongwei Liu <rongweil@nvidia.com>
> Sent: Wednesday, 18 January 2023 17:45
> 
> When a DPDK application must be upgraded,
> the traffic downtime should be shortened as much as possible.
> During the migration time, the old application may stay alive
> while the new application is starting and being configured.
> 
> In order to optimize the switch to the new application,
> the old application may need to be aware of the presence
> of the new application being prepared.
> This is achieved with a new API allowing the user to change the
> new application state to standby and active later.
> 
> The added function is trying to apply the new state to all probed
> ethdev ports. To make this API simple and easy to use,
> the same flags have to be accepted by all devices.
> 
> This is the scenario of operations in the old and new applications:
> .       device: already configured by the old application
> .       new:    start as active
> .       new:    probe the same device
> .       new:    set as standby
> .       new:    configure the device
> .       device: has configurations from old and new applications
> .       old:    clear its device configuration
> .       device: has only 1 configuration from new application
> .       new:    set as active
> .       device: downtime for connecting all to the new application
> .       old:    shutdown
> 
> The active role means network handling configurations are programmed
> to the HW immediately, and no behavior changed. This is the default state.
> The standby role means configurations are queued in the HW.
> If there is no application with active role,
> any configuration is effective immediately.
> 
> Signed-off-by: Rongwei Liu <rongweil@nvidia.com>
> ---
>  doc/guides/rel_notes/release_23_03.rst |  7 ++++
>  lib/ethdev/ethdev_driver.h             | 20 +++++++++
>  lib/ethdev/rte_ethdev.c                | 42 +++++++++++++++++++
>  lib/ethdev/rte_ethdev.h                | 56 ++++++++++++++++++++++++++
>  lib/ethdev/version.map                 |  3 ++
>  5 files changed, 128 insertions(+)
> 
> diff --git a/doc/guides/rel_notes/release_23_03.rst
> b/doc/guides/rel_notes/release_23_03.rst
> index b8c5b68d6c..5367123f24 100644
> --- a/doc/guides/rel_notes/release_23_03.rst
> +++ b/doc/guides/rel_notes/release_23_03.rst
> @@ -55,6 +55,13 @@ New Features
>       Also, make sure to start the actual text at the margin.
>       =======================================================
> 
> +* **Added process state in ethdev to improve live migration.**
> +
> +  Hot upgrade of an application may be accelerated by configuring
> +  the new application in standby state while the old one is still active.
> +  Such double ethdev configuration of the same device is possible
> +  with the added process state API.
> +
> 
>  Removed Items
>  -------------
> diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
> index 6a550cfc83..4a098410d5 100644
> --- a/lib/ethdev/ethdev_driver.h
> +++ b/lib/ethdev/ethdev_driver.h
> @@ -219,6 +219,23 @@ typedef int (*eth_dev_reset_t)(struct rte_eth_dev
> *dev);
>  /** @internal Function used to detect an Ethernet device removal. */
>  typedef int (*eth_is_removed_t)(struct rte_eth_dev *dev);
> 
> +/**
> + * @internal
> + * Set the role of the process to active or standby during live migration.
> + *
> + * @param dev
> + *   Port (ethdev) handle.
> + * @param standby
> + *   Role active if false, standby if true.
> + * @param flags
> + *   Role specific flags.
> + *
> + * @return
> + *   Negative value on error, 0 on success.
> + */
> +typedef int (*eth_dev_process_set_role_t)(struct rte_eth_dev *dev,
> +		bool standby, uint32_t flags);
> +
>  /**
>   * @internal
>   * Function used to enable the Rx promiscuous mode of an Ethernet device.
> @@ -1186,6 +1203,9 @@ struct eth_dev_ops {
>  	/** Check if the device was physically removed */
>  	eth_is_removed_t           is_removed;
> 
> +	/** Set role during live migration */
> +	eth_dev_process_set_role_t process_set_role;
> +
>  	eth_promiscuous_enable_t   promiscuous_enable; /**< Promiscuous
> ON */
>  	eth_promiscuous_disable_t  promiscuous_disable;/**< Promiscuous
> OFF */
>  	eth_allmulticast_enable_t  allmulticast_enable;/**< Rx multicast ON
> */
> diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> index 5d5e18db1e..3a1fb64053 100644
> --- a/lib/ethdev/rte_ethdev.c
> +++ b/lib/ethdev/rte_ethdev.c
> @@ -558,6 +558,48 @@ rte_eth_dev_owner_get(const uint16_t port_id,
> struct rte_eth_dev_owner *owner)
>  	return 0;
>  }
> 
> +int
> +rte_eth_process_set_role(bool standby, uint32_t flags)
> +{
> +	struct rte_eth_dev_info dev_info = {0};
> +	struct rte_eth_dev *dev;
> +	uint16_t port_id;
> +	int ret = 0;
> +
> +	/* Check if all devices support process role. */
> +	RTE_ETH_FOREACH_DEV(port_id) {
> +		dev = &rte_eth_devices[port_id];
> +		if (*dev->dev_ops->process_set_role != NULL &&
> +			*dev->dev_ops->dev_infos_get != NULL &&
> +			(*dev->dev_ops->dev_infos_get)(dev, &dev_info) ==
> 0 &&
> +			(dev_info.dev_capa &
> RTE_ETH_DEV_CAPA_PROCESS_ROLE) != 0)
> +			continue;
> +		rte_errno = ENOTSUP;
> +		return -rte_errno;
> +	}
> +	/* Call the driver callbacks. */
> +	RTE_ETH_FOREACH_DEV(port_id) {
> +		dev = &rte_eth_devices[port_id];
> +		if ((*dev->dev_ops->process_set_role)(dev, standby, flags) <
> 0)
> +			goto failure;
> +		ret++;
> +	}
> +	return ret;
> +
> +failure:
> +	/* Rollback all changed devices in case one failed. */
> +	if (ret) {
> +		RTE_ETH_FOREACH_DEV(port_id) {
> +			dev = &rte_eth_devices[port_id];
> +			(*dev->dev_ops->process_set_role)(dev, !standby,
> flags);
> +			if (--ret == 0)
> +				break;
> +		}
> +	}
> +	rte_errno = EPERM;
> +	return -rte_errno;
> +}
> +
>  int
>  rte_eth_dev_socket_id(uint16_t port_id)
>  {
> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> index c129ca1eaf..1505396ced 100644
> --- a/lib/ethdev/rte_ethdev.h
> +++ b/lib/ethdev/rte_ethdev.h
> @@ -1606,6 +1606,8 @@ struct rte_eth_conf {
>  #define RTE_ETH_DEV_CAPA_FLOW_RULE_KEEP         RTE_BIT64(3)
>  /** Device supports keeping shared flow objects across restart. */
>  #define RTE_ETH_DEV_CAPA_FLOW_SHARED_OBJECT_KEEP RTE_BIT64(4)
> +/** Device supports process role changing. @see
> rte_eth_process_set_active */
> +#define RTE_ETH_DEV_CAPA_PROCESS_ROLE           RTE_BIT64(5)
>  /**@}*/
> 
>  /*
> @@ -2204,6 +2206,60 @@ int rte_eth_dev_owner_delete(const uint64_t
> owner_id);
>  int rte_eth_dev_owner_get(const uint16_t port_id,
>  		struct rte_eth_dev_owner *owner);
> 
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice
> + *
> + * Set the role of the process to active or standby,
> + * affecting network traffic handling.
> + *
> + * If one device does not support this operation or fails,
> + * the whole operation is failed and rolled back.
> + *
> + * It is forbidden to have multiple processes with the same role
> + * unless only one of them is configured to handle the traffic.
> + *
> + * The application is active by default.
> + * The configuration from the active process is effective immediately
> + * while the configuration from the standby process is queued by hardware.
> + * When configuring the device from a standby process,
> + * it has no effect except for below situations:
> + *   - traffic not handled by the active process configuration
> + *   - no active process
> + *
> + * When a process is changed from a standby to an active role,
> + * all preceding configurations that are queued by hardware
> + * should become effective immediately.
> + * Before role transition, all the traffic handling configurations
> + * set by the active process should be flushed first.
> + *
> + * In summary, the operations are expected to happen in this order
> + * in "old" and "new" applications:
> + *   device: already configured by the old application
> + *   new:    start as active
> + *   new:    probe the same device
> + *   new:    set as standby
> + *   new:    configure the device
> + *   device: has configurations from old and new applications
> + *   old:    clear its device configuration
> + *   device: has only 1 configuration from new application
> + *   new:    set as active
> + *   device: downtime for connecting all to the new application
> + *   old:    shutdown
> + *
> + * @param standby
> + *   Role active if false, standby if true.
> + * @param flags
> + *   Role specific flags.
> + * @return
> + *   Positive value on success, -rte_errno value on error:
> + *   - (> 0) Number of switched devices.
> + *   - (-ENOTSUP) if not supported by a device.
> + *   - (-EPERM) if operation failed with a device.
> + */
> +__rte_experimental
> +int rte_eth_process_set_role(bool standby, uint32_t flags);
> +
>  /**
>   * Get the number of ports which are usable for the application.
>   *
> diff --git a/lib/ethdev/version.map b/lib/ethdev/version.map
> index 17201fbe0f..d5d3ea5421 100644
> --- a/lib/ethdev/version.map
> +++ b/lib/ethdev/version.map
> @@ -298,6 +298,9 @@ EXPERIMENTAL {
>  	rte_flow_get_q_aged_flows;
>  	rte_mtr_meter_policy_get;
>  	rte_mtr_meter_profile_get;
> +
> +	# added in 23.03
> +	rte_eth_process_set_role;
>  };
> 
>  INTERNAL {
> --
> 2.27.0


Acked-by: Ori Kam <orika@nvidia.com>
Best,
Ori

Jerin Jacob Jan. 31, 2023, 6:14 p.m. UTC | #2

On Wed, Jan 18, 2023 at 9:15 PM Rongwei Liu <rongweil@nvidia.com> wrote:
>
> When a DPDK application must be upgraded,
> the traffic downtime should be shortened as much as possible.
> During the migration time, the old application may stay alive
> while the new application is starting and being configured.
>
> In order to optimize the switch to the new application,
> the old application may need to be aware of the presence
> of the new application being prepared.
> This is achieved with a new API allowing the user to change the
> new application state to standby and active later.
>
> The added function is trying to apply the new state to all probed
> ethdev ports. To make this API simple and easy to use,
> the same flags have to be accepted by all devices.
>
> This is the scenario of operations in the old and new applications:
> .       device: already configured by the old application
> .       new:    start as active
> .       new:    probe the same device

How to probe same device if is already bind to another application?
vfio-pci wont allow this.

Good to see any implementation and example application to exercise this APIs


> .       new:    set as standby
> .       new:    configure the device
> .       device: has configurations from old and new applications
> .       old:    clear its device configuration
> .       device: has only 1 configuration from new application
> .       new:    set as active
> .       device: downtime for connecting all to the new application
> .       old:    shutdown
>
> The active role means network handling configurations are programmed
> to the HW immediately, and no behavior changed. This is the default state.
> The standby role means configurations are queued in the HW.
> If there is no application with active role,
> any configuration is effective immediately.
>
> Signed-off-by: Rongwei Liu <rongweil@nvidia.com>

Thomas Monjalon Jan. 31, 2023, 10:55 p.m. UTC | #3

31/01/2023 19:14, Jerin Jacob:
> On Wed, Jan 18, 2023 at 9:15 PM Rongwei Liu <rongweil@nvidia.com> wrote:
> >
> > When a DPDK application must be upgraded,
> > the traffic downtime should be shortened as much as possible.
> > During the migration time, the old application may stay alive
> > while the new application is starting and being configured.
> >
> > In order to optimize the switch to the new application,
> > the old application may need to be aware of the presence
> > of the new application being prepared.
> > This is achieved with a new API allowing the user to change the
> > new application state to standby and active later.
> >
> > The added function is trying to apply the new state to all probed
> > ethdev ports. To make this API simple and easy to use,
> > the same flags have to be accepted by all devices.
> >
> > This is the scenario of operations in the old and new applications:
> > .       device: already configured by the old application
> > .       new:    start as active
> > .       new:    probe the same device
> 
> How to probe same device if is already bind to another application?
> vfio-pci wont allow this.

I missed that part.
There is no way to share a VFIO device between 2 applications?


> > .       new:    set as standby
> > .       new:    configure the device
> > .       device: has configurations from old and new applications
> > .       old:    clear its device configuration
> > .       device: has only 1 configuration from new application
> > .       new:    set as active
> > .       device: downtime for connecting all to the new application
> > .       old:    shutdown

Andrew Rybchenko Feb. 1, 2023, 7:32 a.m. UTC | #4

On 2/1/23 01:55, Thomas Monjalon wrote:
> 31/01/2023 19:14, Jerin Jacob:
>> On Wed, Jan 18, 2023 at 9:15 PM Rongwei Liu <rongweil@nvidia.com> wrote:
>>>
>>> When a DPDK application must be upgraded,
>>> the traffic downtime should be shortened as much as possible.
>>> During the migration time, the old application may stay alive
>>> while the new application is starting and being configured.
>>>
>>> In order to optimize the switch to the new application,
>>> the old application may need to be aware of the presence
>>> of the new application being prepared.
>>> This is achieved with a new API allowing the user to change the
>>> new application state to standby and active later.
>>>
>>> The added function is trying to apply the new state to all probed
>>> ethdev ports. To make this API simple and easy to use,
>>> the same flags have to be accepted by all devices.
>>>
>>> This is the scenario of operations in the old and new applications:
>>> .       device: already configured by the old application
>>> .       new:    start as active
>>> .       new:    probe the same device
>>
>> How to probe same device if is already bind to another application?
>> vfio-pci wont allow this.
> 
> I missed that part.
> There is no way to share a VFIO device between 2 applications?

As I understand multi-process shares an VFIO device between
many application. As far as I remember it is just required to
pass corresponding file descriptor to another application.

Anyway I fully agree that the patch requires more documentation
in doc/guides/ with the description of live migration theory of
operations.

>>> .       new:    set as standby
>>> .       new:    configure the device
>>> .       device: has configurations from old and new applications
>>> .       old:    clear its device configuration
>>> .       device: has only 1 configuration from new application
>>> .       new:    set as active
>>> .       device: downtime for connecting all to the new application
>>> .       old:    shutdown
> 
> 
>

Andrew Rybchenko Feb. 1, 2023, 7:52 a.m. UTC | #5

On 1/18/23 18:44, Rongwei Liu wrote:
> When a DPDK application must be upgraded,
> the traffic downtime should be shortened as much as possible.
> During the migration time, the old application may stay alive
> while the new application is starting and being configured.
> 
> In order to optimize the switch to the new application,
> the old application may need to be aware of the presence
> of the new application being prepared.
> This is achieved with a new API allowing the user to change the
> new application state to standby and active later.
> 
> The added function is trying to apply the new state to all probed
> ethdev ports. To make this API simple and easy to use,
> the same flags have to be accepted by all devices.
> 
> This is the scenario of operations in the old and new applications:
> .       device: already configured by the old application
> .       new:    start as active
> .       new:    probe the same device
> .       new:    set as standby
> .       new:    configure the device
> .       device: has configurations from old and new applications
> .       old:    clear its device configuration
> .       device: has only 1 configuration from new application
> .       new:    set as active
> .       device: downtime for connecting all to the new application
> .       old:    shutdown
> 
> The active role means network handling configurations are programmed
> to the HW immediately, and no behavior changed. This is the default state.
> The standby role means configurations are queued in the HW.
> If there is no application with active role,
> any configuration is effective immediately.
> 
> Signed-off-by: Rongwei Liu <rongweil@nvidia.com>
> ---
>   doc/guides/rel_notes/release_23_03.rst |  7 ++++
>   lib/ethdev/ethdev_driver.h             | 20 +++++++++
>   lib/ethdev/rte_ethdev.c                | 42 +++++++++++++++++++
>   lib/ethdev/rte_ethdev.h                | 56 ++++++++++++++++++++++++++
>   lib/ethdev/version.map                 |  3 ++
>   5 files changed, 128 insertions(+)
> 
> diff --git a/doc/guides/rel_notes/release_23_03.rst b/doc/guides/rel_notes/release_23_03.rst
> index b8c5b68d6c..5367123f24 100644
> --- a/doc/guides/rel_notes/release_23_03.rst
> +++ b/doc/guides/rel_notes/release_23_03.rst
> @@ -55,6 +55,13 @@ New Features
>        Also, make sure to start the actual text at the margin.
>        =======================================================
>   
> +* **Added process state in ethdev to improve live migration.**
> +
> +  Hot upgrade of an application may be accelerated by configuring
> +  the new application in standby state while the old one is still active.
> +  Such double ethdev configuration of the same device is possible
> +  with the added process state API.
> +

Will we have the same API for other device classes? Any ideas
how to avoid duplication? (I'm afraid we don't have generic
place to put such functionality).

>   
>   Removed Items
>   -------------
> diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
> index 6a550cfc83..4a098410d5 100644
> --- a/lib/ethdev/ethdev_driver.h
> +++ b/lib/ethdev/ethdev_driver.h
> @@ -219,6 +219,23 @@ typedef int (*eth_dev_reset_t)(struct rte_eth_dev *dev);
>   /** @internal Function used to detect an Ethernet device removal. */
>   typedef int (*eth_is_removed_t)(struct rte_eth_dev *dev);
>   
> +/**
> + * @internal
> + * Set the role of the process to active or standby during live migration.
> + *
> + * @param dev
> + *   Port (ethdev) handle.
> + * @param standby
> + *   Role active if false, standby if true.
> + * @param flags
> + *   Role specific flags.
> + *
> + * @return
> + *   Negative value on error, 0 on success.
> + */
> +typedef int (*eth_dev_process_set_role_t)(struct rte_eth_dev *dev,
> +		bool standby, uint32_t flags);
> +
>   /**
>    * @internal
>    * Function used to enable the Rx promiscuous mode of an Ethernet device.
> @@ -1186,6 +1203,9 @@ struct eth_dev_ops {
>   	/** Check if the device was physically removed */
>   	eth_is_removed_t           is_removed;
>   
> +	/** Set role during live migration */
> +	eth_dev_process_set_role_t process_set_role;
> +
>   	eth_promiscuous_enable_t   promiscuous_enable; /**< Promiscuous ON */
>   	eth_promiscuous_disable_t  promiscuous_disable;/**< Promiscuous OFF */
>   	eth_allmulticast_enable_t  allmulticast_enable;/**< Rx multicast ON */
> diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> index 5d5e18db1e..3a1fb64053 100644
> --- a/lib/ethdev/rte_ethdev.c
> +++ b/lib/ethdev/rte_ethdev.c
> @@ -558,6 +558,48 @@ rte_eth_dev_owner_get(const uint16_t port_id, struct rte_eth_dev_owner *owner)
>   	return 0;
>   }
>   
> +int
> +rte_eth_process_set_role(bool standby, uint32_t flags)
> +{
> +	struct rte_eth_dev_info dev_info = {0};
> +	struct rte_eth_dev *dev;
> +	uint16_t port_id;
> +	int ret = 0;
> +
> +	/* Check if all devices support process role. */
> +	RTE_ETH_FOREACH_DEV(port_id) {
> +		dev = &rte_eth_devices[port_id];
> +		if (*dev->dev_ops->process_set_role != NULL &&
> +			*dev->dev_ops->dev_infos_get != NULL &&
> +			(*dev->dev_ops->dev_infos_get)(dev, &dev_info) == 0 &&
> +			(dev_info.dev_capa & RTE_ETH_DEV_CAPA_PROCESS_ROLE) != 0)

Why do we need both non-NULL callback and dev_capa flag?
Isn't just non-NULL callback sufficient?

> +			continue;
> +		rte_errno = ENOTSUP;
> +		return -rte_errno;
> +	}
> +	/* Call the driver callbacks. */
> +	RTE_ETH_FOREACH_DEV(port_id) {
> +		dev = &rte_eth_devices[port_id];
> +		if ((*dev->dev_ops->process_set_role)(dev, standby, flags) < 0)

Please, add error logging on failure.

> +			goto failure;

IMHO it would be useful to have info logs on success.

> +		ret++;

ret variable name is very confusing, please, be
more specific in variable naming.

> +	}
> +	return ret;
> +
> +failure:
> +	/* Rollback all changed devices in case one failed. */
> +	if (ret) {

Compare vs 0

> +		RTE_ETH_FOREACH_DEV(port_id) {
> +			dev = &rte_eth_devices[port_id];
> +			(*dev->dev_ops->process_set_role)(dev, !standby, flags);

IMHO, it would be useful to have error logs on rollback
failure and info logs on success.

> +			if (--ret == 0)
> +				break;
> +		}
> +	}
> +	rte_errno = EPERM;

Why is it always EPERM? Isn't it better to forward
failure code from driver?

> +	return -rte_errno;
> +}
> +
>   int
>   rte_eth_dev_socket_id(uint16_t port_id)
>   {
> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> index c129ca1eaf..1505396ced 100644
> --- a/lib/ethdev/rte_ethdev.h
> +++ b/lib/ethdev/rte_ethdev.h
> @@ -1606,6 +1606,8 @@ struct rte_eth_conf {
>   #define RTE_ETH_DEV_CAPA_FLOW_RULE_KEEP         RTE_BIT64(3)
>   /** Device supports keeping shared flow objects across restart. */
>   #define RTE_ETH_DEV_CAPA_FLOW_SHARED_OBJECT_KEEP RTE_BIT64(4)
> +/** Device supports process role changing. @see rte_eth_process_set_active */

There is no rte_eth_process_set_active()

> +#define RTE_ETH_DEV_CAPA_PROCESS_ROLE           RTE_BIT64(5)
>   /**@}*/
>   
>   /*
> @@ -2204,6 +2206,60 @@ int rte_eth_dev_owner_delete(const uint64_t owner_id);
>   int rte_eth_dev_owner_get(const uint16_t port_id,
>   		struct rte_eth_dev_owner *owner);
>   
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice
> + *
> + * Set the role of the process to active or standby,
> + * affecting network traffic handling.
> + *
> + * If one device does not support this operation or fails,
> + * the whole operation is failed and rolled back.
> + *
> + * It is forbidden to have multiple processes with the same role
> + * unless only one of them is configured to handle the traffic.

Why is it not allowed to have many processes in standby?

> + *
> + * The application is active by default.
> + * The configuration from the active process is effective immediately
> + * while the configuration from the standby process is queued by hardware.

I'm afraid error handling could be tricky in this case.
I think it makes sense to highlight that configuration
propagation failures will be returned on attempt to set
the process active.

> + * When configuring the device from a standby process,
> + * it has no effect except for below situations:
> + *   - traffic not handled by the active process configuration
> + *   - no active process
> + *
> + * When a process is changed from a standby to an active role,
> + * all preceding configurations that are queued by hardware
> + * should become effective immediately.
> + * Before role transition, all the traffic handling configurations
> + * set by the active process should be flushed first.
> + *
> + * In summary, the operations are expected to happen in this order
> + * in "old" and "new" applications:
> + *   device: already configured by the old application
> + *   new:    start as active
> + *   new:    probe the same device
> + *   new:    set as standby
> + *   new:    configure the device
> + *   device: has configurations from old and new applications
> + *   old:    clear its device configuration
> + *   device: has only 1 configuration from new application
> + *   new:    set as active
> + *   device: downtime for connecting all to the new application
> + *   old:    shutdown
> + *
> + * @param standby
> + *   Role active if false, standby if true.

Typically API with boolean parameters is bad. May be in this
particular case it is better to have two functions:
rte_eth_process_set_active() and rte_eth_process_set_standby().

> + * @param flags
> + *   Role specific flags.

Please, define namespace for flags.

> + * @return
> + *   Positive value on success, -rte_errno value on error:
> + *   - (> 0) Number of switched devices.

Isn't is success if there is no devices to switch?

> + *   - (-ENOTSUP) if not supported by a device.
> + *   - (-EPERM) if operation failed with a device.
> + */
> +__rte_experimental
> +int rte_eth_process_set_role(bool standby, uint32_t flags);
> +
>   /**
>    * Get the number of ports which are usable for the application.
>    *
> diff --git a/lib/ethdev/version.map b/lib/ethdev/version.map
> index 17201fbe0f..d5d3ea5421 100644
> --- a/lib/ethdev/version.map
> +++ b/lib/ethdev/version.map
> @@ -298,6 +298,9 @@ EXPERIMENTAL {
>   	rte_flow_get_q_aged_flows;
>   	rte_mtr_meter_policy_get;
>   	rte_mtr_meter_profile_get;
> +
> +	# added in 23.03
> +	rte_eth_process_set_role;
>   };
>   
>   INTERNAL {

Thomas Monjalon Feb. 1, 2023, 8:27 a.m. UTC | #6

01/02/2023 08:52, Andrew Rybchenko:
> On 1/18/23 18:44, Rongwei Liu wrote:
> > +* **Added process state in ethdev to improve live migration.**
> > +
> > +  Hot upgrade of an application may be accelerated by configuring
> > +  the new application in standby state while the old one is still active.
> > +  Such double ethdev configuration of the same device is possible
> > +  with the added process state API.
> > +
> 
> Will we have the same API for other device classes? Any ideas
> how to avoid duplication? (I'm afraid we don't have generic
> place to put such functionality).

It could be needed to have such feature in other classes, yes.
That's a device feature, so it must be in each class,
with different flags/options per class.

> > +	/* Check if all devices support process role. */
> > +	RTE_ETH_FOREACH_DEV(port_id) {
> > +		dev = &rte_eth_devices[port_id];
> > +		if (*dev->dev_ops->process_set_role != NULL &&
> > +			*dev->dev_ops->dev_infos_get != NULL &&
> > +			(*dev->dev_ops->dev_infos_get)(dev, &dev_info) == 0 &&
> > +			(dev_info.dev_capa & RTE_ETH_DEV_CAPA_PROCESS_ROLE) != 0)
> 
> Why do we need both non-NULL callback and dev_capa flag?
> Isn't just non-NULL callback sufficient?

We could have a callback returning ENOTSUP.
Anyway we need the capability for application query.

> > +			continue;
> > +		rte_errno = ENOTSUP;
> > +		return -rte_errno;
> > +	}
> > +	/* Call the driver callbacks. */
> > +	RTE_ETH_FOREACH_DEV(port_id) {
> > +		dev = &rte_eth_devices[port_id];
> > +		if ((*dev->dev_ops->process_set_role)(dev, standby, flags) < 0)
> 
> Please, add error logging on failure.

good idea

> 
> > +			goto failure;
> 
> IMHO it would be useful to have info logs on success.

When setting the role, we could have debug log, yes.

> > +		ret++;
> 
> ret variable name is very confusing, please, be
> more specific in variable naming.
> 
> > +	}
> > +	return ret;
> > +
> > +failure:
> > +	/* Rollback all changed devices in case one failed. */
> > +	if (ret) {
> 
> Compare vs 0
> 
> > +		RTE_ETH_FOREACH_DEV(port_id) {
> > +			dev = &rte_eth_devices[port_id];
> > +			(*dev->dev_ops->process_set_role)(dev, !standby, flags);
> 
> IMHO, it would be useful to have error logs on rollback
> failure and info logs on success.
> 
> > +			if (--ret == 0)
> > +				break;
> > +		}
> > +	}
> > +	rte_errno = EPERM;
> 
> Why is it always EPERM? Isn't it better to forward
> failure code from driver?
> 
> > +	return -rte_errno;
> > +}

Thanks for the good detailed review Andrew.

> > --- a/lib/ethdev/rte_ethdev.h
> > +++ b/lib/ethdev/rte_ethdev.h
> > @@ -1606,6 +1606,8 @@ struct rte_eth_conf {
> >   #define RTE_ETH_DEV_CAPA_FLOW_RULE_KEEP         RTE_BIT64(3)
> >   /** Device supports keeping shared flow objects across restart. */
> >   #define RTE_ETH_DEV_CAPA_FLOW_SHARED_OBJECT_KEEP RTE_BIT64(4)
> > +/** Device supports process role changing. @see rte_eth_process_set_active */
> 
> There is no rte_eth_process_set_active()
> 
> > +#define RTE_ETH_DEV_CAPA_PROCESS_ROLE           RTE_BIT64(5)
> >   /**@}*/
> >   
> >   /*
> > @@ -2204,6 +2206,60 @@ int rte_eth_dev_owner_delete(const uint64_t owner_id);
> >   int rte_eth_dev_owner_get(const uint16_t port_id,
> >   		struct rte_eth_dev_owner *owner);
> >   
> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this API may change without prior notice
> > + *
> > + * Set the role of the process to active or standby,
> > + * affecting network traffic handling.
> > + *
> > + * If one device does not support this operation or fails,
> > + * the whole operation is failed and rolled back.
> > + *
> > + * It is forbidden to have multiple processes with the same role
> > + * unless only one of them is configured to handle the traffic.
> 
> Why is it not allowed to have many processes in standby?
> 
> > + *
> > + * The application is active by default.
> > + * The configuration from the active process is effective immediately
> > + * while the configuration from the standby process is queued by hardware.
> 
> I'm afraid error handling could be tricky in this case.
> I think it makes sense to highlight that configuration
> propagation failures will be returned on attempt to set
> the process active.
> 
> > + * When configuring the device from a standby process,
> > + * it has no effect except for below situations:
> > + *   - traffic not handled by the active process configuration
> > + *   - no active process
> > + *
> > + * When a process is changed from a standby to an active role,
> > + * all preceding configurations that are queued by hardware
> > + * should become effective immediately.
> > + * Before role transition, all the traffic handling configurations
> > + * set by the active process should be flushed first.
> > + *
> > + * In summary, the operations are expected to happen in this order
> > + * in "old" and "new" applications:
> > + *   device: already configured by the old application
> > + *   new:    start as active
> > + *   new:    probe the same device
> > + *   new:    set as standby
> > + *   new:    configure the device
> > + *   device: has configurations from old and new applications
> > + *   old:    clear its device configuration
> > + *   device: has only 1 configuration from new application
> > + *   new:    set as active
> > + *   device: downtime for connecting all to the new application
> > + *   old:    shutdown
> > + *
> > + * @param standby
> > + *   Role active if false, standby if true.
> 
> Typically API with boolean parameters is bad. May be in this
> particular case it is better to have two functions:
> rte_eth_process_set_active() and rte_eth_process_set_standby().

Why?
It could be an enum as well.

> > + * @param flags
> > + *   Role specific flags.
> 
> Please, define namespace for flags.
> 
> > + * @return
> > + *   Positive value on success, -rte_errno value on error:
> > + *   - (> 0) Number of switched devices.
> 
> Isn't is success if there is no devices to switch?

It should be >= 0 I guess.

Thomas Monjalon Feb. 1, 2023, 8:31 a.m. UTC | #7

01/02/2023 08:32, Andrew Rybchenko:
> On 2/1/23 01:55, Thomas Monjalon wrote:
> > 31/01/2023 19:14, Jerin Jacob:
> >> On Wed, Jan 18, 2023 at 9:15 PM Rongwei Liu <rongweil@nvidia.com> wrote:
> >>>
> >>> When a DPDK application must be upgraded,
> >>> the traffic downtime should be shortened as much as possible.
> >>> During the migration time, the old application may stay alive
> >>> while the new application is starting and being configured.
> >>>
> >>> In order to optimize the switch to the new application,
> >>> the old application may need to be aware of the presence
> >>> of the new application being prepared.
> >>> This is achieved with a new API allowing the user to change the
> >>> new application state to standby and active later.
> >>>
> >>> The added function is trying to apply the new state to all probed
> >>> ethdev ports. To make this API simple and easy to use,
> >>> the same flags have to be accepted by all devices.
> >>>
> >>> This is the scenario of operations in the old and new applications:
> >>> .       device: already configured by the old application
> >>> .       new:    start as active
> >>> .       new:    probe the same device
> >>
> >> How to probe same device if is already bind to another application?
> >> vfio-pci wont allow this.
> > 
> > I missed that part.
> > There is no way to share a VFIO device between 2 applications?
> 
> As I understand multi-process shares an VFIO device between
> many application. As far as I remember it is just required to
> pass corresponding file descriptor to another application.
> 
> Anyway I fully agree that the patch requires more documentation
> in doc/guides/ with the description of live migration theory of
> operations.

Yes that part is missing.
Rongwei, you should try a fake migration with e1000 or virtio
(both available in QEMU), so you can play with VFIO device probing.

> >>> .       new:    set as standby
> >>> .       new:    configure the device
> >>> .       device: has configurations from old and new applications
> >>> .       old:    clear its device configuration
> >>> .       device: has only 1 configuration from new application
> >>> .       new:    set as active
> >>> .       device: downtime for connecting all to the new application
> >>> .       old:    shutdown

Andrew Rybchenko Feb. 1, 2023, 8:40 a.m. UTC | #8

On 2/1/23 11:27, Thomas Monjalon wrote:
> 01/02/2023 08:52, Andrew Rybchenko:
>> On 1/18/23 18:44, Rongwei Liu wrote:
>>> + * When configuring the device from a standby process,
>>> + * it has no effect except for below situations:
>>> + *   - traffic not handled by the active process configuration
>>> + *   - no active process
>>> + *
>>> + * When a process is changed from a standby to an active role,
>>> + * all preceding configurations that are queued by hardware
>>> + * should become effective immediately.
>>> + * Before role transition, all the traffic handling configurations
>>> + * set by the active process should be flushed first.
>>> + *
>>> + * In summary, the operations are expected to happen in this order
>>> + * in "old" and "new" applications:
>>> + *   device: already configured by the old application
>>> + *   new:    start as active
>>> + *   new:    probe the same device
>>> + *   new:    set as standby
>>> + *   new:    configure the device
>>> + *   device: has configurations from old and new applications
>>> + *   old:    clear its device configuration
>>> + *   device: has only 1 configuration from new application
>>> + *   new:    set as active
>>> + *   device: downtime for connecting all to the new application
>>> + *   old:    shutdown
>>> + *
>>> + * @param standby
>>> + *   Role active if false, standby if true.
>>
>> Typically API with boolean parameters is bad. May be in this
>> particular case it is better to have two functions:
>> rte_eth_process_set_active() and rte_eth_process_set_standby().
> 
> Why?
> It could be an enum as well.

It is simply hard to read (what is set_role(true)???) and not
extensible. enum instead of boolean is an acceptable
alternative of many functions.

Jerin Jacob Feb. 1, 2023, 8:40 a.m. UTC | #9

On Wed, Feb 1, 2023 at 1:02 PM Andrew Rybchenko
<andrew.rybchenko@oktetlabs.ru> wrote:
>
> On 2/1/23 01:55, Thomas Monjalon wrote:
> > 31/01/2023 19:14, Jerin Jacob:
> >> On Wed, Jan 18, 2023 at 9:15 PM Rongwei Liu <rongweil@nvidia.com> wrote:
> >>>
> >>> When a DPDK application must be upgraded,
> >>> the traffic downtime should be shortened as much as possible.
> >>> During the migration time, the old application may stay alive
> >>> while the new application is starting and being configured.
> >>>
> >>> In order to optimize the switch to the new application,
> >>> the old application may need to be aware of the presence
> >>> of the new application being prepared.
> >>> This is achieved with a new API allowing the user to change the
> >>> new application state to standby and active later.
> >>>
> >>> The added function is trying to apply the new state to all probed
> >>> ethdev ports. To make this API simple and easy to use,
> >>> the same flags have to be accepted by all devices.
> >>>
> >>> This is the scenario of operations in the old and new applications:
> >>> .       device: already configured by the old application
> >>> .       new:    start as active
> >>> .       new:    probe the same device
> >>
> >> How to probe same device if is already bind to another application?
> >> vfio-pci wont allow this.
> >
> > I missed that part.
> > There is no way to share a VFIO device between 2 applications?
>
> As I understand multi-process shares an VFIO device between
> many application. As far as I remember it is just required to
> pass corresponding file descriptor to another application.

I think, Here, it is two different primary application.

Even if it is primary-secondary kind of case, bus or pci driver layer
needs fixup,
Currently, if we add allow list same PCI device on primary and secondary,
the second app start will fail.

>
> Anyway I fully agree that the patch requires more documentation
> in doc/guides/ with the description of live migration theory of
> operations.
>
> >>> .       new:    set as standby
> >>> .       new:    configure the device
> >>> .       device: has configurations from old and new applications
> >>> .       old:    clear its device configuration
> >>> .       device: has only 1 configuration from new application
> >>> .       new:    set as active
> >>> .       device: downtime for connecting all to the new application
> >>> .       old:    shutdown
> >
> >
> >
>

Thomas Monjalon Feb. 1, 2023, 8:46 a.m. UTC | #10

01/02/2023 09:40, Jerin Jacob:
> On Wed, Feb 1, 2023 at 1:02 PM Andrew Rybchenko
> <andrew.rybchenko@oktetlabs.ru> wrote:
> >
> > On 2/1/23 01:55, Thomas Monjalon wrote:
> > > 31/01/2023 19:14, Jerin Jacob:
> > >> On Wed, Jan 18, 2023 at 9:15 PM Rongwei Liu <rongweil@nvidia.com> wrote:
> > >>>
> > >>> When a DPDK application must be upgraded,
> > >>> the traffic downtime should be shortened as much as possible.
> > >>> During the migration time, the old application may stay alive
> > >>> while the new application is starting and being configured.
> > >>>
> > >>> In order to optimize the switch to the new application,
> > >>> the old application may need to be aware of the presence
> > >>> of the new application being prepared.
> > >>> This is achieved with a new API allowing the user to change the
> > >>> new application state to standby and active later.
> > >>>
> > >>> The added function is trying to apply the new state to all probed
> > >>> ethdev ports. To make this API simple and easy to use,
> > >>> the same flags have to be accepted by all devices.
> > >>>
> > >>> This is the scenario of operations in the old and new applications:
> > >>> .       device: already configured by the old application
> > >>> .       new:    start as active
> > >>> .       new:    probe the same device
> > >>
> > >> How to probe same device if is already bind to another application?
> > >> vfio-pci wont allow this.
> > >
> > > I missed that part.
> > > There is no way to share a VFIO device between 2 applications?
> >
> > As I understand multi-process shares an VFIO device between
> > many application. As far as I remember it is just required to
> > pass corresponding file descriptor to another application.
> 
> I think, Here, it is two different primary application.

Yes it is 2 primary applications.

> Even if it is primary-secondary kind of case, bus or pci driver layer
> needs fixup,
> Currently, if we add allow list same PCI device on primary and secondary,
> the second app start will fail.

Can we imagine passing a VFIO handle to the new application?

Rongwei Liu Feb. 2, 2023, 10:23 a.m. UTC | #11

Hi Jerin and Thomas

BR
Rongwei

> -----Original Message-----
> From: Thomas Monjalon <thomas@monjalon.net>
> Sent: Wednesday, February 1, 2023 16:47
> To: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>; Jerin Jacob
> <jerinjacobk@gmail.com>
> Cc: Rongwei Liu <rongweil@nvidia.com>; dev@dpdk.org; Matan Azrad
> <matan@nvidia.com>; Slava Ovsiienko <viacheslavo@nvidia.com>; Ori Kam
> <orika@nvidia.com>; stephen@networkplumber.org; Raslan Darawsheh
> <rasland@nvidia.com>; Ferruh Yigit <ferruh.yigit@amd.com>;
> bruce.richardson@intel.com
> Subject: Re: [PATCH v4 2/3] ethdev: add standby state for live migration
> 
> External email: Use caution opening links or attachments
> 
> 
> 01/02/2023 09:40, Jerin Jacob:
> > On Wed, Feb 1, 2023 at 1:02 PM Andrew Rybchenko
> > <andrew.rybchenko@oktetlabs.ru> wrote:
> > >
> > > On 2/1/23 01:55, Thomas Monjalon wrote:
> > > > 31/01/2023 19:14, Jerin Jacob:
> > > >> On Wed, Jan 18, 2023 at 9:15 PM Rongwei Liu <rongweil@nvidia.com>
> wrote:
> > > >>>
> > > >>> When a DPDK application must be upgraded, the traffic downtime
> > > >>> should be shortened as much as possible.
> > > >>> During the migration time, the old application may stay alive
> > > >>> while the new application is starting and being configured.
> > > >>>
> > > >>> In order to optimize the switch to the new application, the old
> > > >>> application may need to be aware of the presence of the new
> > > >>> application being prepared.
> > > >>> This is achieved with a new API allowing the user to change the
> > > >>> new application state to standby and active later.
> > > >>>
> > > >>> The added function is trying to apply the new state to all
> > > >>> probed ethdev ports. To make this API simple and easy to use,
> > > >>> the same flags have to be accepted by all devices.
> > > >>>
> > > >>> This is the scenario of operations in the old and new applications:
> > > >>> .       device: already configured by the old application
> > > >>> .       new:    start as active
> > > >>> .       new:    probe the same device
> > > >>
> > > >> How to probe same device if is already bind to another application?
> > > >> vfio-pci wont allow this.
> > > >
> > > > I missed that part.
> > > > There is no way to share a VFIO device between 2 applications?
> > >
> > > As I understand multi-process shares an VFIO device between many
> > > application. As far as I remember it is just required to pass
> > > corresponding file descriptor to another application.
Thanks for the comments. I missed this part when designed this feature.
Per my test, vfio-pci doesn't support two primary applications using the same handle(/dev/vfio/***).
We will move this back to mlx5 specific. 
> >
> > I think, Here, it is two different primary application.
> 
> Yes it is 2 primary applications.
> 
> > Even if it is primary-secondary kind of case, bus or pci driver layer
> > needs fixup, Currently, if we add allow list same PCI device on
> > primary and secondary, the second app start will fail.
> 
> Can we imagine passing a VFIO handle to the new application?
> 
>

diff mbox series

Patch

diff --git a/doc/guides/rel_notes/release_23_03.rst b/doc/guides/rel_notes/release_23_03.rst
index b8c5b68d6c..5367123f24 100644
--- a/doc/guides/rel_notes/release_23_03.rst
+++ b/doc/guides/rel_notes/release_23_03.rst
@@ -55,6 +55,13 @@  New Features
      Also, make sure to start the actual text at the margin.
      =======================================================
 
+* **Added process state in ethdev to improve live migration.**
+
+  Hot upgrade of an application may be accelerated by configuring
+  the new application in standby state while the old one is still active.
+  Such double ethdev configuration of the same device is possible
+  with the added process state API.
+
 
 Removed Items
 -------------
diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
index 6a550cfc83..4a098410d5 100644
--- a/lib/ethdev/ethdev_driver.h
+++ b/lib/ethdev/ethdev_driver.h
@@ -219,6 +219,23 @@  typedef int (*eth_dev_reset_t)(struct rte_eth_dev *dev);
 /** @internal Function used to detect an Ethernet device removal. */
 typedef int (*eth_is_removed_t)(struct rte_eth_dev *dev);
 
+/**
+ * @internal
+ * Set the role of the process to active or standby during live migration.
+ *
+ * @param dev
+ *   Port (ethdev) handle.
+ * @param standby
+ *   Role active if false, standby if true.
+ * @param flags
+ *   Role specific flags.
+ *
+ * @return
+ *   Negative value on error, 0 on success.
+ */
+typedef int (*eth_dev_process_set_role_t)(struct rte_eth_dev *dev,
+		bool standby, uint32_t flags);
+
 /**
  * @internal
  * Function used to enable the Rx promiscuous mode of an Ethernet device.
@@ -1186,6 +1203,9 @@  struct eth_dev_ops {
 	/** Check if the device was physically removed */
 	eth_is_removed_t           is_removed;
 
+	/** Set role during live migration */
+	eth_dev_process_set_role_t process_set_role;
+
 	eth_promiscuous_enable_t   promiscuous_enable; /**< Promiscuous ON */
 	eth_promiscuous_disable_t  promiscuous_disable;/**< Promiscuous OFF */
 	eth_allmulticast_enable_t  allmulticast_enable;/**< Rx multicast ON */
diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
index 5d5e18db1e..3a1fb64053 100644
--- a/lib/ethdev/rte_ethdev.c
+++ b/lib/ethdev/rte_ethdev.c
@@ -558,6 +558,48 @@  rte_eth_dev_owner_get(const uint16_t port_id, struct rte_eth_dev_owner *owner)
 	return 0;
 }
 
+int
+rte_eth_process_set_role(bool standby, uint32_t flags)
+{
+	struct rte_eth_dev_info dev_info = {0};
+	struct rte_eth_dev *dev;
+	uint16_t port_id;
+	int ret = 0;
+
+	/* Check if all devices support process role. */
+	RTE_ETH_FOREACH_DEV(port_id) {
+		dev = &rte_eth_devices[port_id];
+		if (*dev->dev_ops->process_set_role != NULL &&
+			*dev->dev_ops->dev_infos_get != NULL &&
+			(*dev->dev_ops->dev_infos_get)(dev, &dev_info) == 0 &&
+			(dev_info.dev_capa & RTE_ETH_DEV_CAPA_PROCESS_ROLE) != 0)
+			continue;
+		rte_errno = ENOTSUP;
+		return -rte_errno;
+	}
+	/* Call the driver callbacks. */
+	RTE_ETH_FOREACH_DEV(port_id) {
+		dev = &rte_eth_devices[port_id];
+		if ((*dev->dev_ops->process_set_role)(dev, standby, flags) < 0)
+			goto failure;
+		ret++;
+	}
+	return ret;
+
+failure:
+	/* Rollback all changed devices in case one failed. */
+	if (ret) {
+		RTE_ETH_FOREACH_DEV(port_id) {
+			dev = &rte_eth_devices[port_id];
+			(*dev->dev_ops->process_set_role)(dev, !standby, flags);
+			if (--ret == 0)
+				break;
+		}
+	}
+	rte_errno = EPERM;
+	return -rte_errno;
+}
+
 int
 rte_eth_dev_socket_id(uint16_t port_id)
 {
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index c129ca1eaf..1505396ced 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -1606,6 +1606,8 @@  struct rte_eth_conf {
 #define RTE_ETH_DEV_CAPA_FLOW_RULE_KEEP         RTE_BIT64(3)
 /** Device supports keeping shared flow objects across restart. */
 #define RTE_ETH_DEV_CAPA_FLOW_SHARED_OBJECT_KEEP RTE_BIT64(4)
+/** Device supports process role changing. @see rte_eth_process_set_active */
+#define RTE_ETH_DEV_CAPA_PROCESS_ROLE           RTE_BIT64(5)
 /**@}*/
 
 /*
@@ -2204,6 +2206,60 @@  int rte_eth_dev_owner_delete(const uint64_t owner_id);
 int rte_eth_dev_owner_get(const uint16_t port_id,
 		struct rte_eth_dev_owner *owner);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Set the role of the process to active or standby,
+ * affecting network traffic handling.
+ *
+ * If one device does not support this operation or fails,
+ * the whole operation is failed and rolled back.
+ *
+ * It is forbidden to have multiple processes with the same role
+ * unless only one of them is configured to handle the traffic.
+ *
+ * The application is active by default.
+ * The configuration from the active process is effective immediately
+ * while the configuration from the standby process is queued by hardware.
+ * When configuring the device from a standby process,
+ * it has no effect except for below situations:
+ *   - traffic not handled by the active process configuration
+ *   - no active process
+ *
+ * When a process is changed from a standby to an active role,
+ * all preceding configurations that are queued by hardware
+ * should become effective immediately.
+ * Before role transition, all the traffic handling configurations
+ * set by the active process should be flushed first.
+ *
+ * In summary, the operations are expected to happen in this order
+ * in "old" and "new" applications:
+ *   device: already configured by the old application
+ *   new:    start as active
+ *   new:    probe the same device
+ *   new:    set as standby
+ *   new:    configure the device
+ *   device: has configurations from old and new applications
+ *   old:    clear its device configuration
+ *   device: has only 1 configuration from new application
+ *   new:    set as active
+ *   device: downtime for connecting all to the new application
+ *   old:    shutdown
+ *
+ * @param standby
+ *   Role active if false, standby if true.
+ * @param flags
+ *   Role specific flags.
+ * @return
+ *   Positive value on success, -rte_errno value on error:
+ *   - (> 0) Number of switched devices.
+ *   - (-ENOTSUP) if not supported by a device.
+ *   - (-EPERM) if operation failed with a device.
+ */
+__rte_experimental
+int rte_eth_process_set_role(bool standby, uint32_t flags);
+
 /**
  * Get the number of ports which are usable for the application.
  *
diff --git a/lib/ethdev/version.map b/lib/ethdev/version.map
index 17201fbe0f..d5d3ea5421 100644
--- a/lib/ethdev/version.map
+++ b/lib/ethdev/version.map
@@ -298,6 +298,9 @@  EXPERIMENTAL {
 	rte_flow_get_q_aged_flows;
 	rte_mtr_meter_policy_get;
 	rte_mtr_meter_profile_get;
+
+	# added in 23.03
+	rte_eth_process_set_role;
 };
 
 INTERNAL {