[v3] net/ice: fix crash on closing representor ports

Message ID 20231101101347.3570418-1-mingjinx.ye@intel.com (mailing list archive)
State Superseded, archived
Delegated to: Qi Zhang
Headers
Series [v3] net/ice: fix crash on closing representor ports |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/loongarch-compilation success Compilation OK
ci/loongarch-unit-testing success Unit Testing PASS
ci/Intel-compilation success Compilation OK
ci/intel-Testing success Testing PASS
ci/intel-Functional success Functional PASS
ci/github-robot: build success github build: passed
ci/iol-mellanox-Performance success Performance Testing PASS
ci/iol-intel-Performance success Performance Testing PASS
ci/iol-intel-Functional success Functional Testing PASS
ci/iol-broadcom-Performance success Performance Testing PASS
ci/iol-compile-amd64-testing success Testing PASS
ci/iol-broadcom-Functional success Functional Testing PASS
ci/iol-unit-arm64-testing success Testing PASS
ci/iol-unit-amd64-testing success Testing PASS
ci/iol-sample-apps-testing success Testing PASS
ci/iol-compile-arm64-testing success Testing PASS

Commit Message

Mingjin Ye Nov. 1, 2023, 10:13 a.m. UTC
  The data resource in struct rte_eth_dev is cleared and points to NULL
when the DCF port is closed.

If the DCF representor port is closed after the DCF port is closed,
a segmentation fault occurs because the representor port accesses
the data resource released by the DCF port.

This patch checks if the resource is present before accessing.

Fixes: 5674465a32c8 ("net/ice: add DCF VLAN handling")
Fixes: da9cdcd1f372 ("net/ice: fix crash on representor port closing")
Cc: stable@dpdk.org

Signed-off-by: Mingjin Ye <mingjinx.ye@intel.com>
---
v3: New solution.
---
 drivers/net/ice/ice_dcf_vf_representor.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)
  

Comments

Qi Zhang Nov. 1, 2023, 10:48 a.m. UTC | #1
Unaddressed
> -----Original Message-----
> From: Ye, MingjinX <mingjinx.ye@intel.com>
> Sent: Wednesday, November 1, 2023 6:14 PM
> To: dev@dpdk.org
> Cc: Yang, Qiming <qiming.yang@intel.com>; Zhou, YidingX
> <yidingx.zhou@intel.com>; Ye, MingjinX <mingjinx.ye@intel.com>;
> stable@dpdk.org; Zhang, Qi Z <qi.z.zhang@intel.com>
> Subject: [PATCH v3] net/ice: fix crash on closing representor ports
> 
> The data resource in struct rte_eth_dev is cleared and points to NULL when
> the DCF port is closed.
> 
> If the DCF representor port is closed after the DCF port is closed, a
> segmentation fault occurs because the representor port accesses the data
> resource released by the DCF port.
> 
> This patch checks if the resource is present before accessing.
> 
> Fixes: 5674465a32c8 ("net/ice: add DCF VLAN handling")
> Fixes: da9cdcd1f372 ("net/ice: fix crash on representor port closing")
> Cc: stable@dpdk.org
> 
> Signed-off-by: Mingjin Ye <mingjinx.ye@intel.com>
> ---
> v3: New solution.
> ---
>  drivers/net/ice/ice_dcf_vf_representor.c | 8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/ice/ice_dcf_vf_representor.c
> b/drivers/net/ice/ice_dcf_vf_representor.c
> index b9fcfc80ad..8c45e28f02 100644
> --- a/drivers/net/ice/ice_dcf_vf_representor.c
> +++ b/drivers/net/ice/ice_dcf_vf_representor.c
> @@ -111,14 +111,16 @@ ice_dcf_vf_repr_link_update(__rte_unused struct
> rte_eth_dev *ethdev,  static __rte_always_inline struct ice_dcf_hw *
> ice_dcf_vf_repr_hw(struct ice_dcf_vf_repr *repr)  {
> -	struct ice_dcf_adapter *dcf_adapter =
> -			repr->dcf_eth_dev->data->dev_private;
> +	struct rte_eth_dev_data *dcf_data = repr->dcf_eth_dev->data;

Seems this expose another issue, if dcf port already be closed, the dcf_eth_dev instance could already be reused by another driver.
So we can't assume dcf_eth_dev->data is NULL,  I think you can refine based on v2's method, but don't update dcf_valid flag in representor port's dev_stop.



> +	struct ice_dcf_adapter *dcf_adapter;
> 
> -	if (!dcf_adapter) {
> +	if (!dcf_data || !dcf_data->dev_private) {
>  		PMD_DRV_LOG(ERR, "DCF for VF representor has been
> released\n");
>  		return NULL;
>  	}
> 
> +	dcf_adapter = dcf_data->dev_private;
> +
>  	return &dcf_adapter->real_hw;
>  }
> 
> --
> 2.25.1
  
Mingjin Ye Nov. 2, 2023, 2:38 a.m. UTC | #2
> -----Original Message-----
> From: Zhang, Qi Z <qi.z.zhang@intel.com>
> Sent: Wednesday, November 1, 2023 6:49 PM
> To: Ye, MingjinX <mingjinx.ye@intel.com>; dev@dpdk.org
> Cc: Yang, Qiming <qiming.yang@intel.com>; Zhou, YidingX
> <yidingx.zhou@intel.com>; stable@dpdk.org
> Subject: RE: [PATCH v3] net/ice: fix crash on closing representor ports
> 
> 
> 
> > -----Original Message-----
> > From: Ye, MingjinX <mingjinx.ye@intel.com>
> > Sent: Wednesday, November 1, 2023 6:14 PM
> > To: dev@dpdk.org
> > Cc: Yang, Qiming <qiming.yang@intel.com>; Zhou, YidingX
> > <yidingx.zhou@intel.com>; Ye, MingjinX <mingjinx.ye@intel.com>;
> > stable@dpdk.org; Zhang, Qi Z <qi.z.zhang@intel.com>
> > Subject: [PATCH v3] net/ice: fix crash on closing representor ports
> >
> > The data resource in struct rte_eth_dev is cleared and points to NULL
> > when the DCF port is closed.
> >
> > If the DCF representor port is closed after the DCF port is closed, a
> > segmentation fault occurs because the representor port accesses the
> > data resource released by the DCF port.
> >
> > This patch checks if the resource is present before accessing.
> >
> > Fixes: 5674465a32c8 ("net/ice: add DCF VLAN handling")
> > Fixes: da9cdcd1f372 ("net/ice: fix crash on representor port closing")
> > Cc: stable@dpdk.org
> >
> > Signed-off-by: Mingjin Ye <mingjinx.ye@intel.com>
> > ---
> > v3: New solution.
> > ---
> >  drivers/net/ice/ice_dcf_vf_representor.c | 8 +++++---
> >  1 file changed, 5 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/net/ice/ice_dcf_vf_representor.c
> > b/drivers/net/ice/ice_dcf_vf_representor.c
> > index b9fcfc80ad..8c45e28f02 100644
> > --- a/drivers/net/ice/ice_dcf_vf_representor.c
> > +++ b/drivers/net/ice/ice_dcf_vf_representor.c
> > @@ -111,14 +111,16 @@ ice_dcf_vf_repr_link_update(__rte_unused
> struct
> > rte_eth_dev *ethdev,  static __rte_always_inline struct ice_dcf_hw *
> > ice_dcf_vf_repr_hw(struct ice_dcf_vf_repr *repr)  {
> > -	struct ice_dcf_adapter *dcf_adapter =
> > -			repr->dcf_eth_dev->data->dev_private;
> > +	struct rte_eth_dev_data *dcf_data = repr->dcf_eth_dev->data;
> 
> Seems this expose another issue, if dcf port already be closed, the
> dcf_eth_dev instance could already be reused by another driver.
Your analysis is spot on.

The reason for the issue:
Affected by 36c46e738120c381cf663c96692454c5aa75e203 commit.
da9cdcd1f37220e87db23993d6352637d71df25b commit does not fix the issue.

v2 patch:
Notify every proxy port of DCF port status.

v3 patch:
Based on da9cdcd1f37220e87db23993d636352637d71df25b, made optimization to fix the issue. 

> So we can't assume dcf_eth_dev->data is NULL,  I think you can refine based
> on v2's method, but don't update dcf_valid flag in representor port's
> dev_stop.
Implementation difficulties:
1. When the DCF is created, all associated proxy ports are created and each proxy port (struct rte_eth_dev) pointer is recorded in an array.
2. When shutting down the DCF, all all proxy ports are notified at ice_dcf_vf_repr_stop_all. Finally the array is deleted.

Based on the original scenario, notifying all agent ports of DCF state failure is the simplest and most effective way. This is very similar to the v2 patch scenario.

Check each DCF port flag?
If the DCF port has failed and the resource has been released. the DCF port status, naturally, is not available. This is very similar to the v3 patch scenario.

Other ideas
1. Re-implement DCF port and proxy port communication, can't directly use (struct rte_eth_dev) pointer to access the device resources on the other end.

2. Designed to return an error when removing the proxy port directly. The proxy port device is managed by the control DCF (the proxy port is created by the DCF port and the removal work should be done by it).

The above 2 options will bring a huge amount of work. It seems beyond the scope of this discussion.
  
Qi Zhang Nov. 2, 2023, 3:59 a.m. UTC | #3
> -----Original Message-----
> From: Ye, MingjinX <mingjinx.ye@intel.com>
> Sent: Thursday, November 2, 2023 10:39 AM
> To: Zhang, Qi Z <qi.z.zhang@intel.com>; dev@dpdk.org
> Cc: Yang, Qiming <qiming.yang@intel.com>; Zhou, YidingX
> <yidingx.zhou@intel.com>; stable@dpdk.org
> Subject: RE: [PATCH v3] net/ice: fix crash on closing representor ports
> 
> 
> 
> > -----Original Message-----
> > From: Zhang, Qi Z <qi.z.zhang@intel.com>
> > Sent: Wednesday, November 1, 2023 6:49 PM
> > To: Ye, MingjinX <mingjinx.ye@intel.com>; dev@dpdk.org
> > Cc: Yang, Qiming <qiming.yang@intel.com>; Zhou, YidingX
> > <yidingx.zhou@intel.com>; stable@dpdk.org
> > Subject: RE: [PATCH v3] net/ice: fix crash on closing representor
> > ports
> >
> >
> >
> > > -----Original Message-----
> > > From: Ye, MingjinX <mingjinx.ye@intel.com>
> > > Sent: Wednesday, November 1, 2023 6:14 PM
> > > To: dev@dpdk.org
> > > Cc: Yang, Qiming <qiming.yang@intel.com>; Zhou, YidingX
> > > <yidingx.zhou@intel.com>; Ye, MingjinX <mingjinx.ye@intel.com>;
> > > stable@dpdk.org; Zhang, Qi Z <qi.z.zhang@intel.com>
> > > Subject: [PATCH v3] net/ice: fix crash on closing representor ports
> > >
> > > The data resource in struct rte_eth_dev is cleared and points to
> > > NULL when the DCF port is closed.
> > >
> > > If the DCF representor port is closed after the DCF port is closed,
> > > a segmentation fault occurs because the representor port accesses
> > > the data resource released by the DCF port.
> > >
> > > This patch checks if the resource is present before accessing.
> > >
> > > Fixes: 5674465a32c8 ("net/ice: add DCF VLAN handling")
> > > Fixes: da9cdcd1f372 ("net/ice: fix crash on representor port
> > > closing")
> > > Cc: stable@dpdk.org
> > >
> > > Signed-off-by: Mingjin Ye <mingjinx.ye@intel.com>
> > > ---
> > > v3: New solution.
> > > ---
> > >  drivers/net/ice/ice_dcf_vf_representor.c | 8 +++++---
> > >  1 file changed, 5 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/drivers/net/ice/ice_dcf_vf_representor.c
> > > b/drivers/net/ice/ice_dcf_vf_representor.c
> > > index b9fcfc80ad..8c45e28f02 100644
> > > --- a/drivers/net/ice/ice_dcf_vf_representor.c
> > > +++ b/drivers/net/ice/ice_dcf_vf_representor.c
> > > @@ -111,14 +111,16 @@ ice_dcf_vf_repr_link_update(__rte_unused
> > struct
> > > rte_eth_dev *ethdev,  static __rte_always_inline struct ice_dcf_hw *
> > > ice_dcf_vf_repr_hw(struct ice_dcf_vf_repr *repr)  {
> > > -	struct ice_dcf_adapter *dcf_adapter =
> > > -			repr->dcf_eth_dev->data->dev_private;
> > > +	struct rte_eth_dev_data *dcf_data = repr->dcf_eth_dev->data;
> >
> > Seems this expose another issue, if dcf port already be closed, the
> > dcf_eth_dev instance could already be reused by another driver.
> Your analysis is spot on.
> 
> The reason for the issue:
> Affected by 36c46e738120c381cf663c96692454c5aa75e203 commit.
> da9cdcd1f37220e87db23993d6352637d71df25b commit does not fix the
> issue.
> 
> v2 patch:
> Notify every proxy port of DCF port status.
> 
> v3 patch:
> Based on da9cdcd1f37220e87db23993d636352637d71df25b, made
> optimization to fix the issue.
> 
> > So we can't assume dcf_eth_dev->data is NULL,  I think you can refine
> > based on v2's method, but don't update dcf_valid flag in representor
> > port's dev_stop.
> Implementation difficulties:
> 1. When the DCF is created, all associated proxy ports are created and each
> proxy port (struct rte_eth_dev) pointer is recorded in an array.
> 2. When shutting down the DCF, all all proxy ports are notified at
> ice_dcf_vf_repr_stop_all. Finally the array is deleted.
> 
> Based on the original scenario, notifying all agent ports of DCF state failure is
> the simplest and most effective way. This is very similar to the v2 patch
> scenario.

Yes, that's why I suggest to refine based on v2 after notice the issue on v3, seems we need to maintain a per repr flag for DCF status, but need to update it carefully.

> 
> Check each DCF port flag?
> If the DCF port has failed and the resource has been released. the DCF port
> status, naturally, is not available. This is very similar to the v3 patch scenario.

Check dev->data is NULL does not work, a better way is to check eth_dev->state and device name to make sure it is still used by DCF.
But I think v2's method is more reliable.


> 
> Other ideas
> 1. Re-implement DCF port and proxy port communication, can't directly use
> (struct rte_eth_dev) pointer to access the device resources on the other end.
> 
> 2. Designed to return an error when removing the proxy port directly. The
> proxy port device is managed by the control DCF (the proxy port is created by
> the DCF port and the removal work should be done by it).
> 
> The above 2 options will bring a huge amount of work. It seems beyond the
> scope of this discussion.
>
  

Patch

diff --git a/drivers/net/ice/ice_dcf_vf_representor.c b/drivers/net/ice/ice_dcf_vf_representor.c
index b9fcfc80ad..8c45e28f02 100644
--- a/drivers/net/ice/ice_dcf_vf_representor.c
+++ b/drivers/net/ice/ice_dcf_vf_representor.c
@@ -111,14 +111,16 @@  ice_dcf_vf_repr_link_update(__rte_unused struct rte_eth_dev *ethdev,
 static __rte_always_inline struct ice_dcf_hw *
 ice_dcf_vf_repr_hw(struct ice_dcf_vf_repr *repr)
 {
-	struct ice_dcf_adapter *dcf_adapter =
-			repr->dcf_eth_dev->data->dev_private;
+	struct rte_eth_dev_data *dcf_data = repr->dcf_eth_dev->data;
+	struct ice_dcf_adapter *dcf_adapter;
 
-	if (!dcf_adapter) {
+	if (!dcf_data || !dcf_data->dev_private) {
 		PMD_DRV_LOG(ERR, "DCF for VF representor has been released\n");
 		return NULL;
 	}
 
+	dcf_adapter = dcf_data->dev_private;
+
 	return &dcf_adapter->real_hw;
 }