[v2,2/8] net/ice: enhance debug when HW fails to transmit

Message ID 20240405144604.906695-3-david.marchand@redhat.com (mailing list archive)
State Superseded
Delegated to: Thomas Monjalon
Headers
Series Fix outer UDP checksum for Intel nics |

Checks

Context Check Description
ci/checkpatch success coding style OK

Commit Message

David Marchand April 5, 2024, 2:45 p.m. UTC
  At the moment, if the driver sets an incorrect Tx descriptor, the HW
will raise a MDD event reported as:
ice_interrupt_handler(): OICR: MDD event

Add some debug info for this case and the VF index in all logs.

Signed-off-by: David Marchand <david.marchand@redhat.com>
---
 drivers/net/ice/ice_ethdev.c | 29 +++++++++++++++++++++++++----
 1 file changed, 25 insertions(+), 4 deletions(-)
  

Comments

Bruce Richardson April 8, 2024, 3:23 p.m. UTC | #1
On Fri, Apr 05, 2024 at 04:45:56PM +0200, David Marchand wrote:
> At the moment, if the driver sets an incorrect Tx descriptor, the HW
> will raise a MDD event reported as:
> ice_interrupt_handler(): OICR: MDD event
> 
> Add some debug info for this case and the VF index in all logs.
> 
> Signed-off-by: David Marchand <david.marchand@redhat.com>
> ---
>  drivers/net/ice/ice_ethdev.c | 29 +++++++++++++++++++++++++----
>  1 file changed, 25 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/net/ice/ice_ethdev.c b/drivers/net/ice/ice_ethdev.c
> index 87385d2649..fd494e6b3b 100644
> --- a/drivers/net/ice/ice_ethdev.c
> +++ b/drivers/net/ice/ice_ethdev.c
> @@ -1389,6 +1389,7 @@ ice_interrupt_handler(void *param)
>  	uint32_t oicr;
>  	uint32_t reg;
>  	uint8_t pf_num;
> +	uint16_t vf_num;
>  	uint8_t event;
>  	uint16_t queue;
>  	int ret;
> @@ -1432,28 +1433,48 @@ ice_interrupt_handler(void *param)
>  		if (reg & GL_MDET_TX_PQM_VALID_M) {
>  			pf_num = (reg & GL_MDET_TX_PQM_PF_NUM_M) >>
>  				 GL_MDET_TX_PQM_PF_NUM_S;
> +			vf_num = (reg & GL_MDET_TX_PQM_VF_NUM_M) >>
> +				 GL_MDET_TX_PQM_VF_NUM_S;
>  			event = (reg & GL_MDET_TX_PQM_MAL_TYPE_M) >>
>  				GL_MDET_TX_PQM_MAL_TYPE_S;
>  			queue = (reg & GL_MDET_TX_PQM_QNUM_M) >>
>  				GL_MDET_TX_PQM_QNUM_S;
>  
>  			PMD_DRV_LOG(WARNING, "Malicious Driver Detection event "
> -				    "%d by PQM on TX queue %d PF# %d",
> -				    event, queue, pf_num);
> +				    "%d by PQM on TX queue %d PF# %d VF# %d",
> +				    event, queue, pf_num, vf_num);
>  		}
>  
Would this output be misleading in the case where there is no VF involved
and the actual MDD error comes from the PF?

/Bruce
  
David Marchand April 11, 2024, 8:30 a.m. UTC | #2
On Mon, Apr 8, 2024 at 5:23 PM Bruce Richardson
<bruce.richardson@intel.com> wrote:
>
> On Fri, Apr 05, 2024 at 04:45:56PM +0200, David Marchand wrote:
> > At the moment, if the driver sets an incorrect Tx descriptor, the HW
> > will raise a MDD event reported as:
> > ice_interrupt_handler(): OICR: MDD event
> >
> > Add some debug info for this case and the VF index in all logs.
> >
> > Signed-off-by: David Marchand <david.marchand@redhat.com>
> > ---
> >  drivers/net/ice/ice_ethdev.c | 29 +++++++++++++++++++++++++----
> >  1 file changed, 25 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/net/ice/ice_ethdev.c b/drivers/net/ice/ice_ethdev.c
> > index 87385d2649..fd494e6b3b 100644
> > --- a/drivers/net/ice/ice_ethdev.c
> > +++ b/drivers/net/ice/ice_ethdev.c
> > @@ -1389,6 +1389,7 @@ ice_interrupt_handler(void *param)
> >       uint32_t oicr;
> >       uint32_t reg;
> >       uint8_t pf_num;
> > +     uint16_t vf_num;
> >       uint8_t event;
> >       uint16_t queue;
> >       int ret;
> > @@ -1432,28 +1433,48 @@ ice_interrupt_handler(void *param)
> >               if (reg & GL_MDET_TX_PQM_VALID_M) {
> >                       pf_num = (reg & GL_MDET_TX_PQM_PF_NUM_M) >>
> >                                GL_MDET_TX_PQM_PF_NUM_S;
> > +                     vf_num = (reg & GL_MDET_TX_PQM_VF_NUM_M) >>
> > +                              GL_MDET_TX_PQM_VF_NUM_S;
> >                       event = (reg & GL_MDET_TX_PQM_MAL_TYPE_M) >>
> >                               GL_MDET_TX_PQM_MAL_TYPE_S;
> >                       queue = (reg & GL_MDET_TX_PQM_QNUM_M) >>
> >                               GL_MDET_TX_PQM_QNUM_S;
> >
> >                       PMD_DRV_LOG(WARNING, "Malicious Driver Detection event "
> > -                                 "%d by PQM on TX queue %d PF# %d",
> > -                                 event, queue, pf_num);
> > +                                 "%d by PQM on TX queue %d PF# %d VF# %d",
> > +                                 event, queue, pf_num, vf_num);
> >               }
> >
> Would this output be misleading in the case where there is no VF involved
> and the actual MDD error comes from the PF?

I will check, but IIRC, the queue here is an "absolute" queue number
that should help figure out whether it is the PF or a VF in the case
vf_num == 0.
  
Bruce Richardson April 11, 2024, 8:42 a.m. UTC | #3
On Thu, Apr 11, 2024 at 10:30:19AM +0200, David Marchand wrote:
> On Mon, Apr 8, 2024 at 5:23 PM Bruce Richardson
> <bruce.richardson@intel.com> wrote:
> >
> > On Fri, Apr 05, 2024 at 04:45:56PM +0200, David Marchand wrote:
> > > At the moment, if the driver sets an incorrect Tx descriptor, the HW
> > > will raise a MDD event reported as:
> > > ice_interrupt_handler(): OICR: MDD event
> > >
> > > Add some debug info for this case and the VF index in all logs.
> > >
> > > Signed-off-by: David Marchand <david.marchand@redhat.com>
> > > ---
> > >  drivers/net/ice/ice_ethdev.c | 29 +++++++++++++++++++++++++----
> > >  1 file changed, 25 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/drivers/net/ice/ice_ethdev.c b/drivers/net/ice/ice_ethdev.c
> > > index 87385d2649..fd494e6b3b 100644
> > > --- a/drivers/net/ice/ice_ethdev.c
> > > +++ b/drivers/net/ice/ice_ethdev.c
> > > @@ -1389,6 +1389,7 @@ ice_interrupt_handler(void *param)
> > >       uint32_t oicr;
> > >       uint32_t reg;
> > >       uint8_t pf_num;
> > > +     uint16_t vf_num;
> > >       uint8_t event;
> > >       uint16_t queue;
> > >       int ret;
> > > @@ -1432,28 +1433,48 @@ ice_interrupt_handler(void *param)
> > >               if (reg & GL_MDET_TX_PQM_VALID_M) {
> > >                       pf_num = (reg & GL_MDET_TX_PQM_PF_NUM_M) >>
> > >                                GL_MDET_TX_PQM_PF_NUM_S;
> > > +                     vf_num = (reg & GL_MDET_TX_PQM_VF_NUM_M) >>
> > > +                              GL_MDET_TX_PQM_VF_NUM_S;
> > >                       event = (reg & GL_MDET_TX_PQM_MAL_TYPE_M) >>
> > >                               GL_MDET_TX_PQM_MAL_TYPE_S;
> > >                       queue = (reg & GL_MDET_TX_PQM_QNUM_M) >>
> > >                               GL_MDET_TX_PQM_QNUM_S;
> > >
> > >                       PMD_DRV_LOG(WARNING, "Malicious Driver Detection event "
> > > -                                 "%d by PQM on TX queue %d PF# %d",
> > > -                                 event, queue, pf_num);
> > > +                                 "%d by PQM on TX queue %d PF# %d VF# %d",
> > > +                                 event, queue, pf_num, vf_num);
> > >               }
> > >
> > Would this output be misleading in the case where there is no VF involved
> > and the actual MDD error comes from the PF?
> 
> I will check, but IIRC, the queue here is an "absolute" queue number
> that should help figure out whether it is the PF or a VF in the case
> vf_num == 0.
> 
Yes, that is my understanding too. Maybe in future we can use the queue
number to identify if it's a VF of not. If the PF is the error cause, I
think the VF details should be omitted.

/Bruce
  
David Marchand April 15, 2024, 8:32 a.m. UTC | #4
On Thu, Apr 11, 2024 at 10:42 AM Bruce Richardson
<bruce.richardson@intel.com> wrote:
>
> On Thu, Apr 11, 2024 at 10:30:19AM +0200, David Marchand wrote:
> > On Mon, Apr 8, 2024 at 5:23 PM Bruce Richardson
> > <bruce.richardson@intel.com> wrote:
> > >
> > > On Fri, Apr 05, 2024 at 04:45:56PM +0200, David Marchand wrote:
> > > > At the moment, if the driver sets an incorrect Tx descriptor, the HW
> > > > will raise a MDD event reported as:
> > > > ice_interrupt_handler(): OICR: MDD event
> > > >
> > > > Add some debug info for this case and the VF index in all logs.
> > > >
> > > > Signed-off-by: David Marchand <david.marchand@redhat.com>
> > > > ---
> > > >  drivers/net/ice/ice_ethdev.c | 29 +++++++++++++++++++++++++----
> > > >  1 file changed, 25 insertions(+), 4 deletions(-)
> > > >
> > > > diff --git a/drivers/net/ice/ice_ethdev.c b/drivers/net/ice/ice_ethdev.c
> > > > index 87385d2649..fd494e6b3b 100644
> > > > --- a/drivers/net/ice/ice_ethdev.c
> > > > +++ b/drivers/net/ice/ice_ethdev.c
> > > > @@ -1389,6 +1389,7 @@ ice_interrupt_handler(void *param)
> > > >       uint32_t oicr;
> > > >       uint32_t reg;
> > > >       uint8_t pf_num;
> > > > +     uint16_t vf_num;
> > > >       uint8_t event;
> > > >       uint16_t queue;
> > > >       int ret;
> > > > @@ -1432,28 +1433,48 @@ ice_interrupt_handler(void *param)
> > > >               if (reg & GL_MDET_TX_PQM_VALID_M) {
> > > >                       pf_num = (reg & GL_MDET_TX_PQM_PF_NUM_M) >>
> > > >                                GL_MDET_TX_PQM_PF_NUM_S;
> > > > +                     vf_num = (reg & GL_MDET_TX_PQM_VF_NUM_M) >>
> > > > +                              GL_MDET_TX_PQM_VF_NUM_S;
> > > >                       event = (reg & GL_MDET_TX_PQM_MAL_TYPE_M) >>
> > > >                               GL_MDET_TX_PQM_MAL_TYPE_S;
> > > >                       queue = (reg & GL_MDET_TX_PQM_QNUM_M) >>
> > > >                               GL_MDET_TX_PQM_QNUM_S;
> > > >
> > > >                       PMD_DRV_LOG(WARNING, "Malicious Driver Detection event "
> > > > -                                 "%d by PQM on TX queue %d PF# %d",
> > > > -                                 event, queue, pf_num);
> > > > +                                 "%d by PQM on TX queue %d PF# %d VF# %d",
> > > > +                                 event, queue, pf_num, vf_num);
> > > >               }
> > > >
> > > Would this output be misleading in the case where there is no VF involved
> > > and the actual MDD error comes from the PF?
> >
> > I will check, but IIRC, the queue here is an "absolute" queue number
> > that should help figure out whether it is the PF or a VF in the case
> > vf_num == 0.
> >
> Yes, that is my understanding too. Maybe in future we can use the queue
> number to identify if it's a VF of not. If the PF is the error cause, I
> think the VF details should be omitted.

IIUC there is a set of other registers (per PF and per VF) that would
make it easier to point and blame.
On the other hand, such a change seems more risky to me and I would
prefer someone from Intel owns it :-).

I'll go back with simply adding the TXPDU check which already helped
me understand where to look when debugging.
  

Patch

diff --git a/drivers/net/ice/ice_ethdev.c b/drivers/net/ice/ice_ethdev.c
index 87385d2649..fd494e6b3b 100644
--- a/drivers/net/ice/ice_ethdev.c
+++ b/drivers/net/ice/ice_ethdev.c
@@ -1389,6 +1389,7 @@  ice_interrupt_handler(void *param)
 	uint32_t oicr;
 	uint32_t reg;
 	uint8_t pf_num;
+	uint16_t vf_num;
 	uint8_t event;
 	uint16_t queue;
 	int ret;
@@ -1432,28 +1433,48 @@  ice_interrupt_handler(void *param)
 		if (reg & GL_MDET_TX_PQM_VALID_M) {
 			pf_num = (reg & GL_MDET_TX_PQM_PF_NUM_M) >>
 				 GL_MDET_TX_PQM_PF_NUM_S;
+			vf_num = (reg & GL_MDET_TX_PQM_VF_NUM_M) >>
+				 GL_MDET_TX_PQM_VF_NUM_S;
 			event = (reg & GL_MDET_TX_PQM_MAL_TYPE_M) >>
 				GL_MDET_TX_PQM_MAL_TYPE_S;
 			queue = (reg & GL_MDET_TX_PQM_QNUM_M) >>
 				GL_MDET_TX_PQM_QNUM_S;
 
 			PMD_DRV_LOG(WARNING, "Malicious Driver Detection event "
-				    "%d by PQM on TX queue %d PF# %d",
-				    event, queue, pf_num);
+				    "%d by PQM on TX queue %d PF# %d VF# %d",
+				    event, queue, pf_num, vf_num);
 		}
 
 		reg = ICE_READ_REG(hw, GL_MDET_TX_TCLAN);
 		if (reg & GL_MDET_TX_TCLAN_VALID_M) {
 			pf_num = (reg & GL_MDET_TX_TCLAN_PF_NUM_M) >>
 				 GL_MDET_TX_TCLAN_PF_NUM_S;
+			vf_num = (reg & GL_MDET_TX_TCLAN_VF_NUM_M) >>
+				 GL_MDET_TX_TCLAN_VF_NUM_S;
 			event = (reg & GL_MDET_TX_TCLAN_MAL_TYPE_M) >>
 				GL_MDET_TX_TCLAN_MAL_TYPE_S;
 			queue = (reg & GL_MDET_TX_TCLAN_QNUM_M) >>
 				GL_MDET_TX_TCLAN_QNUM_S;
 
 			PMD_DRV_LOG(WARNING, "Malicious Driver Detection event "
-				    "%d by TCLAN on TX queue %d PF# %d",
-				    event, queue, pf_num);
+				    "%d by TCLAN on TX queue %d PF# %d VF# %d",
+				    event, queue, pf_num, vf_num);
+		}
+
+		reg = ICE_READ_REG(hw, GL_MDET_TX_TDPU);
+		if (reg & GL_MDET_TX_TDPU_VALID_M) {
+			pf_num = (reg & GL_MDET_TX_TDPU_PF_NUM_M) >>
+				 GL_MDET_TX_TDPU_PF_NUM_S;
+			vf_num = (reg & GL_MDET_TX_TDPU_VF_NUM_M) >>
+				 GL_MDET_TX_TDPU_VF_NUM_S;
+			event = (reg & GL_MDET_TX_TDPU_MAL_TYPE_M) >>
+				GL_MDET_TX_TDPU_MAL_TYPE_S;
+			queue = (reg & GL_MDET_TX_TDPU_QNUM_M) >>
+				GL_MDET_TX_TDPU_QNUM_S;
+
+			PMD_DRV_LOG(WARNING, "Malicious Driver Detection event "
+				    "%d by TDPU on TX queue %d PF# %d VF# %d",
+				    event, queue, pf_num, vf_num);
 		}
 	}
 done: