[v2] net/ixgbe: add proper memory barriers for some Rx functions

Message ID 20230424090532.367194-1-zhoumin@loongson.cn (mailing list archive)
State Superseded, archived
Delegated to: Qi Zhang
Headers
Series [v2] net/ixgbe: add proper memory barriers for some Rx functions |

Checks

Context Check Description
ci/checkpatch warning coding style issues
ci/loongarch-compilation success Compilation OK
ci/loongarch-unit-testing success Unit Testing PASS
ci/Intel-compilation success Compilation OK
ci/github-robot: build success github build: passed
ci/intel-Testing success Testing PASS
ci/iol-mellanox-Performance success Performance Testing PASS
ci/iol-broadcom-Functional success Functional Testing PASS
ci/iol-broadcom-Performance success Performance Testing PASS
ci/iol-intel-Functional success Functional Testing PASS
ci/iol-intel-Performance success Performance Testing PASS
ci/iol-aarch64-unit-testing success Testing PASS
ci/iol-abi-testing success Testing PASS
ci/iol-unit-testing success Testing PASS
ci/iol-x86_64-compile-testing success Testing PASS
ci/iol-testing success Testing PASS
ci/iol-aarch64-compile-testing success Testing PASS
ci/iol-x86_64-unit-testing success Testing PASS
ci/intel-Functional success Functional PASS

Commit Message

zhoumin April 24, 2023, 9:05 a.m. UTC
  Segmentation fault has been observed while running the
ixgbe_recv_pkts_lro() function to receive packets on the Loongson 3C5000
processor which has 64 cores and 4 NUMA nodes.

From the ixgbe_recv_pkts_lro() function, we found that as long as the first
packet has the EOP bit set, and the length of this packet is less than or
equal to rxq->crc_len, the segmentation fault will definitely happen even
though on the other platforms, such as X86.

Because when processd the first packet the first_seg->next will be NULL, if
at the same time this packet has the EOP bit set and its length is less
than or equal to rxq->crc_len, the following loop will be excecuted:

    for (lp = first_seg; lp->next != rxm; lp = lp->next)
        ;

We know that the first_seg->next will be NULL under this condition. So the
expression of lp->next->next will cause the segmentation fault.

Normally, the length of the first packet with EOP bit set will be greater
than rxq->crc_len. However, the out-of-order execution of CPU may make the
read ordering of the status and the rest of the descriptor fields in this
function not be correct. The related codes are as following:

        rxdp = &rx_ring[rx_id];
 #1     staterr = rte_le_to_cpu_32(rxdp->wb.upper.status_error);

        if (!(staterr & IXGBE_RXDADV_STAT_DD))
            break;

 #2     rxd = *rxdp;

The sentence #2 may be executed before sentence #1. This action is likely
to make the ready packet zero length. If the packet is the first packet and
has the EOP bit set, the above segmentation fault will happen.

So, we should add rte_rmb() to ensure the read ordering be correct. We also
did the same thing in the ixgbe_recv_pkts() function to make the rxd data
be valid even thougth we did not find segmentation fault in this function.

Signed-off-by: Min Zhou <zhoumin@loongson.cn>
---
v2:
- Make the calling of rte_rmb() for all platforms
---
 drivers/net/ixgbe/ixgbe_rxtx.c | 3 +++
 1 file changed, 3 insertions(+)
  

Comments

Qi Zhang April 28, 2023, 3:43 a.m. UTC | #1
> -----Original Message-----
> From: Min Zhou <zhoumin@loongson.cn>
> Sent: Monday, April 24, 2023 5:06 PM
> To: Yang, Qiming <qiming.yang@intel.com>; Wu, Wenjun1
> <wenjun1.wu@intel.com>; zhoumin@loongson.cn
> Cc: dev@dpdk.org; maobibo@loongson.cn
> Subject: [PATCH v2] net/ixgbe: add proper memory barriers for some Rx
> functions
> 
> Segmentation fault has been observed while running the
> ixgbe_recv_pkts_lro() function to receive packets on the Loongson 3C5000
> processor which has 64 cores and 4 NUMA nodes.
> 
> From the ixgbe_recv_pkts_lro() function, we found that as long as the first
> packet has the EOP bit set, and the length of this packet is less than or equal
> to rxq->crc_len, the segmentation fault will definitely happen even though
> on the other platforms, such as X86.
> 
> Because when processd the first packet the first_seg->next will be NULL, if at
> the same time this packet has the EOP bit set and its length is less than or
> equal to rxq->crc_len, the following loop will be excecuted:
> 
>     for (lp = first_seg; lp->next != rxm; lp = lp->next)
>         ;
> 
> We know that the first_seg->next will be NULL under this condition. So the
> expression of lp->next->next will cause the segmentation fault.
> 
> Normally, the length of the first packet with EOP bit set will be greater than
> rxq->crc_len. However, the out-of-order execution of CPU may make the
> read ordering of the status and the rest of the descriptor fields in this
> function not be correct. The related codes are as following:
> 
>         rxdp = &rx_ring[rx_id];
>  #1     staterr = rte_le_to_cpu_32(rxdp->wb.upper.status_error);
> 
>         if (!(staterr & IXGBE_RXDADV_STAT_DD))
>             break;
> 
>  #2     rxd = *rxdp;
> 
> The sentence #2 may be executed before sentence #1. This action is likely to
> make the ready packet zero length. If the packet is the first packet and has
> the EOP bit set, the above segmentation fault will happen.
> 
> So, we should add rte_rmb() to ensure the read ordering be correct. We also
> did the same thing in the ixgbe_recv_pkts() function to make the rxd data be
> valid even thougth we did not find segmentation fault in this function.
> 
> Signed-off-by: Min Zhou <zhoumin@loongson.cn>
> ---
> v2:
> - Make the calling of rte_rmb() for all platforms
> ---
>  drivers/net/ixgbe/ixgbe_rxtx.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
> index c9d6ca9efe..302a5ab7ff 100644
> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
> @@ -1823,6 +1823,8 @@ ixgbe_recv_pkts(void *rx_queue, struct rte_mbuf
> **rx_pkts,
>  		staterr = rxdp->wb.upper.status_error;
>  		if (!(staterr & rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD)))
>  			break;
> +
> +		rte_rmb();

So "volatile" does not prevent re-order with Loongson compiler?


>  		rxd = *rxdp;
> 
>  		/*
> @@ -2122,6 +2124,7 @@ ixgbe_recv_pkts_lro(void *rx_queue, struct
> rte_mbuf **rx_pkts, uint16_t nb_pkts,
>  		if (!(staterr & IXGBE_RXDADV_STAT_DD))
>  			break;
> 
> +		rte_rmb();
>  		rxd = *rxdp;
> 
>  		PMD_RX_LOG(DEBUG, "port_id=%u queue_id=%u rx_id=%u "
> --
> 2.31.1
  
Morten Brørup April 28, 2023, 6:27 a.m. UTC | #2
> From: Zhang, Qi Z [mailto:qi.z.zhang@intel.com]
> Sent: Friday, 28 April 2023 05.44
> 
> > From: Min Zhou <zhoumin@loongson.cn>
> > Sent: Monday, April 24, 2023 5:06 PM
> >
> > Segmentation fault has been observed while running the
> > ixgbe_recv_pkts_lro() function to receive packets on the Loongson 3C5000
> > processor which has 64 cores and 4 NUMA nodes.
> >
> > From the ixgbe_recv_pkts_lro() function, we found that as long as the first
> > packet has the EOP bit set, and the length of this packet is less than or
> equal
> > to rxq->crc_len, the segmentation fault will definitely happen even though
> > on the other platforms, such as X86.
> >
> > Because when processd the first packet the first_seg->next will be NULL, if
> at
> > the same time this packet has the EOP bit set and its length is less than or
> > equal to rxq->crc_len, the following loop will be excecuted:
> >
> >     for (lp = first_seg; lp->next != rxm; lp = lp->next)
> >         ;
> >
> > We know that the first_seg->next will be NULL under this condition. So the
> > expression of lp->next->next will cause the segmentation fault.
> >
> > Normally, the length of the first packet with EOP bit set will be greater
> than
> > rxq->crc_len. However, the out-of-order execution of CPU may make the
> > read ordering of the status and the rest of the descriptor fields in this
> > function not be correct. The related codes are as following:
> >
> >         rxdp = &rx_ring[rx_id];
> >  #1     staterr = rte_le_to_cpu_32(rxdp->wb.upper.status_error);
> >
> >         if (!(staterr & IXGBE_RXDADV_STAT_DD))
> >             break;
> >
> >  #2     rxd = *rxdp;
> >
> > The sentence #2 may be executed before sentence #1. This action is likely to
> > make the ready packet zero length. If the packet is the first packet and has
> > the EOP bit set, the above segmentation fault will happen.
> >
> > So, we should add rte_rmb() to ensure the read ordering be correct. We also
> > did the same thing in the ixgbe_recv_pkts() function to make the rxd data be
> > valid even thougth we did not find segmentation fault in this function.
> >
> > Signed-off-by: Min Zhou <zhoumin@loongson.cn>
> > ---
> > v2:
> > - Make the calling of rte_rmb() for all platforms
> > ---
> >  drivers/net/ixgbe/ixgbe_rxtx.c | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
> > index c9d6ca9efe..302a5ab7ff 100644
> > --- a/drivers/net/ixgbe/ixgbe_rxtx.c
> > +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
> > @@ -1823,6 +1823,8 @@ ixgbe_recv_pkts(void *rx_queue, struct rte_mbuf
> > **rx_pkts,
> >  		staterr = rxdp->wb.upper.status_error;
> >  		if (!(staterr & rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD)))
> >  			break;
> > +
> > +		rte_rmb();
> 
> So "volatile" does not prevent re-order with Loongson compiler?

"Volatile" does not prevent re-ordering on any compiler. "Volatile" only prevents caching of the variable marked volatile.

https://wiki.sei.cmu.edu/confluence/display/c/CON02-C.+Do+not+use+volatile+as+a+synchronization+primitive

Thinking out loud: I don't know the performance cost of rte_rmb(); perhaps using atomic accesses with the optimal memory ordering would be a better solution in the long term.

> 
> 
> >  		rxd = *rxdp;
> >
> >  		/*
> > @@ -2122,6 +2124,7 @@ ixgbe_recv_pkts_lro(void *rx_queue, struct
> > rte_mbuf **rx_pkts, uint16_t nb_pkts,
> >  		if (!(staterr & IXGBE_RXDADV_STAT_DD))
> >  			break;
> >
> > +		rte_rmb();
> >  		rxd = *rxdp;
> >
> >  		PMD_RX_LOG(DEBUG, "port_id=%u queue_id=%u rx_id=%u "
> > --
> > 2.31.1
  
Konstantin Ananyev May 1, 2023, 1:29 p.m. UTC | #3
> Segmentation fault has been observed while running the
> ixgbe_recv_pkts_lro() function to receive packets on the Loongson 3C5000
> processor which has 64 cores and 4 NUMA nodes.
> 
> From the ixgbe_recv_pkts_lro() function, we found that as long as the first
> packet has the EOP bit set, and the length of this packet is less than or
> equal to rxq->crc_len, the segmentation fault will definitely happen even
> though on the other platforms, such as X86.
> 
> Because when processd the first packet the first_seg->next will be NULL, if
> at the same time this packet has the EOP bit set and its length is less
> than or equal to rxq->crc_len, the following loop will be excecuted:
> 
>     for (lp = first_seg; lp->next != rxm; lp = lp->next)
>         ;
> 
> We know that the first_seg->next will be NULL under this condition. So the
> expression of lp->next->next will cause the segmentation fault.
> 
> Normally, the length of the first packet with EOP bit set will be greater
> than rxq->crc_len. However, the out-of-order execution of CPU may make the
> read ordering of the status and the rest of the descriptor fields in this
> function not be correct. The related codes are as following:
> 
>         rxdp = &rx_ring[rx_id];
>  #1     staterr = rte_le_to_cpu_32(rxdp->wb.upper.status_error);
> 
>         if (!(staterr & IXGBE_RXDADV_STAT_DD))
>             break;
> 
>  #2     rxd = *rxdp;
> 
> The sentence #2 may be executed before sentence #1. This action is likely
> to make the ready packet zero length. If the packet is the first packet and
> has the EOP bit set, the above segmentation fault will happen.
> 
> So, we should add rte_rmb() to ensure the read ordering be correct. We also
> did the same thing in the ixgbe_recv_pkts() function to make the rxd data
> be valid even thougth we did not find segmentation fault in this function.
> 
> Signed-off-by: Min Zhou <zhoumin@loongson.cn>
> ---
> v2:
> - Make the calling of rte_rmb() for all platforms
> ---
>  drivers/net/ixgbe/ixgbe_rxtx.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
> index c9d6ca9efe..302a5ab7ff 100644
> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
> @@ -1823,6 +1823,8 @@ ixgbe_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
>  		staterr = rxdp->wb.upper.status_error;
>  		if (!(staterr & rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD)))
>  			break;
> +
> +		rte_rmb();
>  		rxd = *rxdp;



Indeed, looks like a problem to me on systems with relaxed MO.
Strange that it was never hit on arm or ppc - cc-ing ARM/PPC maintainers.
About a fix - looks right, but a bit excessive to me -
as I understand all we need here is to prevent re-ordering by CPU itself.
So rte_smp_rmb() seems enough here.
Or might be just:
staterr = __atomic_load_n(&rxdp->wb.upper.status_error, __ATOMIC_ACQUIRE);


>  		/*
> @@ -2122,6 +2124,7 @@ ixgbe_recv_pkts_lro(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts,
>  		if (!(staterr & IXGBE_RXDADV_STAT_DD))
>  			break;
>  
> +		rte_rmb();
>  		rxd = *rxdp;
>  
>  		PMD_RX_LOG(DEBUG, "port_id=%u queue_id=%u rx_id=%u "
> -- 
> 2.31.1
  
Ruifeng Wang May 4, 2023, 6:13 a.m. UTC | #4
> -----Original Message-----
> From: Konstantin Ananyev <konstantin.v.ananyev@yandex.ru>
> Sent: Monday, May 1, 2023 9:29 PM
> To: zhoumin@loongson.cn
> Cc: dev@dpdk.org; maobibo@loongson.cn; qiming.yang@intel.com; wenjun1.wu@intel.com;
> Ruifeng Wang <Ruifeng.Wang@arm.com>; drc@linux.vnet.ibm.com
> Subject: Re: [PATCH v2] net/ixgbe: add proper memory barriers for some Rx functions
> 
> > Segmentation fault has been observed while running the
> > ixgbe_recv_pkts_lro() function to receive packets on the Loongson
> > 3C5000 processor which has 64 cores and 4 NUMA nodes.
> >
> > From the ixgbe_recv_pkts_lro() function, we found that as long as the
> > first packet has the EOP bit set, and the length of this packet is
> > less than or equal to rxq->crc_len, the segmentation fault will
> > definitely happen even though on the other platforms, such as X86.
> >
> > Because when processd the first packet the first_seg->next will be
> > NULL, if at the same time this packet has the EOP bit set and its
> > length is less than or equal to rxq->crc_len, the following loop will be excecuted:
> >
> >     for (lp = first_seg; lp->next != rxm; lp = lp->next)
> >         ;
> >
> > We know that the first_seg->next will be NULL under this condition. So
> > the expression of lp->next->next will cause the segmentation fault.
> >
> > Normally, the length of the first packet with EOP bit set will be
> > greater than rxq->crc_len. However, the out-of-order execution of CPU
> > may make the read ordering of the status and the rest of the
> > descriptor fields in this function not be correct. The related codes are as following:
> >
> >         rxdp = &rx_ring[rx_id];
> >  #1     staterr = rte_le_to_cpu_32(rxdp->wb.upper.status_error);
> >
> >         if (!(staterr & IXGBE_RXDADV_STAT_DD))
> >             break;
> >
> >  #2     rxd = *rxdp;
> >
> > The sentence #2 may be executed before sentence #1. This action is
> > likely to make the ready packet zero length. If the packet is the
> > first packet and has the EOP bit set, the above segmentation fault will happen.
> >
> > So, we should add rte_rmb() to ensure the read ordering be correct. We
> > also did the same thing in the ixgbe_recv_pkts() function to make the
> > rxd data be valid even thougth we did not find segmentation fault in this function.
> >
> > Signed-off-by: Min Zhou <zhoumin@loongson.cn>

"Fixes" tag for backport.

> > ---
> > v2:
> > - Make the calling of rte_rmb() for all platforms
> > ---
> >  drivers/net/ixgbe/ixgbe_rxtx.c | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c
> > b/drivers/net/ixgbe/ixgbe_rxtx.c index c9d6ca9efe..302a5ab7ff 100644
> > --- a/drivers/net/ixgbe/ixgbe_rxtx.c
> > +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
> > @@ -1823,6 +1823,8 @@ ixgbe_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
> >  		staterr = rxdp->wb.upper.status_error;
> >  		if (!(staterr & rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD)))
> >  			break;
> > +
> > +		rte_rmb();
> >  		rxd = *rxdp;
> 
> 
> 
> Indeed, looks like a problem to me on systems with relaxed MO.
> Strange that it was never hit on arm or ppc - cc-ing ARM/PPC maintainers.

Thanks, Konstantin.

> About a fix - looks right, but a bit excessive to me - as I understand all we need here is
> to prevent re-ordering by CPU itself.
> So rte_smp_rmb() seems enough here.

Agree that rte_rmb() is excessive.
rte_smp_rmb() or rte_atomic_thread_fence(__ATOMIC_ACQUIRE) is enough.
And it is better to add a comment to justify the barrier.

> Or might be just:
> staterr = __atomic_load_n(&rxdp->wb.upper.status_error, __ATOMIC_ACQUIRE);
> 
> 
> >  		/*
> > @@ -2122,6 +2124,7 @@ ixgbe_recv_pkts_lro(void *rx_queue, struct rte_mbuf **rx_pkts,
> uint16_t nb_pkts,

With the proper barrier in place, I think the long comments at the beginning of this loop can be removed.

> >  		if (!(staterr & IXGBE_RXDADV_STAT_DD))
> >  			break;
> >
> > +		rte_rmb();
> >  		rxd = *rxdp;
> >
> >  		PMD_RX_LOG(DEBUG, "port_id=%u queue_id=%u rx_id=%u "
> > --
> > 2.31.1
  
zhoumin May 4, 2023, 12:42 p.m. UTC | #5
Hi Qi,

Thanks for your review.

On Fri, Apr 28, 2023 at 11:43AM, Zhang, Qi Z wrote:
>
>> -----Original Message-----
>> From: Min Zhou <zhoumin@loongson.cn>
>> Sent: Monday, April 24, 2023 5:06 PM
>> To: Yang, Qiming <qiming.yang@intel.com>; Wu, Wenjun1
>> <wenjun1.wu@intel.com>; zhoumin@loongson.cn
>> Cc: dev@dpdk.org; maobibo@loongson.cn
>> Subject: [PATCH v2] net/ixgbe: add proper memory barriers for some Rx
>> functions
>>
>> Segmentation fault has been observed while running the
>> ixgbe_recv_pkts_lro() function to receive packets on the Loongson 3C5000
>> processor which has 64 cores and 4 NUMA nodes.
>>
>>  From the ixgbe_recv_pkts_lro() function, we found that as long as the first
>> packet has the EOP bit set, and the length of this packet is less than or equal
>> to rxq->crc_len, the segmentation fault will definitely happen even though
>> on the other platforms, such as X86.
>>
>> Because when processd the first packet the first_seg->next will be NULL, if at
>> the same time this packet has the EOP bit set and its length is less than or
>> equal to rxq->crc_len, the following loop will be excecuted:
>>
>>      for (lp = first_seg; lp->next != rxm; lp = lp->next)
>>          ;
>>
>> We know that the first_seg->next will be NULL under this condition. So the
>> expression of lp->next->next will cause the segmentation fault.
>>
>> Normally, the length of the first packet with EOP bit set will be greater than
>> rxq->crc_len. However, the out-of-order execution of CPU may make the
>> read ordering of the status and the rest of the descriptor fields in this
>> function not be correct. The related codes are as following:
>>
>>          rxdp = &rx_ring[rx_id];
>>   #1     staterr = rte_le_to_cpu_32(rxdp->wb.upper.status_error);
>>
>>          if (!(staterr & IXGBE_RXDADV_STAT_DD))
>>              break;
>>
>>   #2     rxd = *rxdp;
>>
>> The sentence #2 may be executed before sentence #1. This action is likely to
>> make the ready packet zero length. If the packet is the first packet and has
>> the EOP bit set, the above segmentation fault will happen.
>>
>> So, we should add rte_rmb() to ensure the read ordering be correct. We also
>> did the same thing in the ixgbe_recv_pkts() function to make the rxd data be
>> valid even thougth we did not find segmentation fault in this function.
>>
>> Signed-off-by: Min Zhou <zhoumin@loongson.cn>
>> ---
>> v2:
>> - Make the calling of rte_rmb() for all platforms
>> ---
>>   drivers/net/ixgbe/ixgbe_rxtx.c | 3 +++
>>   1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
>> index c9d6ca9efe..302a5ab7ff 100644
>> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
>> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
>> @@ -1823,6 +1823,8 @@ ixgbe_recv_pkts(void *rx_queue, struct rte_mbuf
>> **rx_pkts,
>>   		staterr = rxdp->wb.upper.status_error;
>>   		if (!(staterr & rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD)))
>>   			break;
>> +
>> +		rte_rmb();
> So "volatile" does not prevent re-order with Loongson compiler?

The memory consistency model of the LoongArch [1] uses the Weak 
Consistency model in which memory operations can be reordered.

[1] 
https://loongson.github.io/LoongArch-Documentation/LoongArch-Vol1-EN#overview-of-memory-consistency

>>   		rxd = *rxdp;
>>
>>   		/*
>> @@ -2122,6 +2124,7 @@ ixgbe_recv_pkts_lro(void *rx_queue, struct
>> rte_mbuf **rx_pkts, uint16_t nb_pkts,
>>   		if (!(staterr & IXGBE_RXDADV_STAT_DD))
>>   			break;
>>
>> +		rte_rmb();
>>   		rxd = *rxdp;
>>
>>   		PMD_RX_LOG(DEBUG, "port_id=%u queue_id=%u rx_id=%u "
>> --
>> 2.31.1

Best regards,

Min
  
zhoumin May 4, 2023, 12:58 p.m. UTC | #6
Hi Morten,

Thanks for your comments.

On Fri, Apr 28, 2023 at 2:27PM, Morten Brørup wrote:
>> From: Zhang, Qi Z [mailto:qi.z.zhang@intel.com]
>> Sent: Friday, 28 April 2023 05.44
>>
>>> From: Min Zhou <zhoumin@loongson.cn>
>>> Sent: Monday, April 24, 2023 5:06 PM
>>>
>>> Segmentation fault has been observed while running the
>>> ixgbe_recv_pkts_lro() function to receive packets on the Loongson 3C5000
>>> processor which has 64 cores and 4 NUMA nodes.
>>>
>>>  From the ixgbe_recv_pkts_lro() function, we found that as long as the first
>>> packet has the EOP bit set, and the length of this packet is less than or
>> equal
>>> to rxq->crc_len, the segmentation fault will definitely happen even though
>>> on the other platforms, such as X86.
>>>
>>> Because when processd the first packet the first_seg->next will be NULL, if
>> at
>>> the same time this packet has the EOP bit set and its length is less than or
>>> equal to rxq->crc_len, the following loop will be excecuted:
>>>
>>>      for (lp = first_seg; lp->next != rxm; lp = lp->next)
>>>          ;
>>>
>>> We know that the first_seg->next will be NULL under this condition. So the
>>> expression of lp->next->next will cause the segmentation fault.
>>>
>>> Normally, the length of the first packet with EOP bit set will be greater
>> than
>>> rxq->crc_len. However, the out-of-order execution of CPU may make the
>>> read ordering of the status and the rest of the descriptor fields in this
>>> function not be correct. The related codes are as following:
>>>
>>>          rxdp = &rx_ring[rx_id];
>>>   #1     staterr = rte_le_to_cpu_32(rxdp->wb.upper.status_error);
>>>
>>>          if (!(staterr & IXGBE_RXDADV_STAT_DD))
>>>              break;
>>>
>>>   #2     rxd = *rxdp;
>>>
>>> The sentence #2 may be executed before sentence #1. This action is likely to
>>> make the ready packet zero length. If the packet is the first packet and has
>>> the EOP bit set, the above segmentation fault will happen.
>>>
>>> So, we should add rte_rmb() to ensure the read ordering be correct. We also
>>> did the same thing in the ixgbe_recv_pkts() function to make the rxd data be
>>> valid even thougth we did not find segmentation fault in this function.
>>>
>>> Signed-off-by: Min Zhou <zhoumin@loongson.cn>
>>> ---
>>> v2:
>>> - Make the calling of rte_rmb() for all platforms
>>> ---
>>>   drivers/net/ixgbe/ixgbe_rxtx.c | 3 +++
>>>   1 file changed, 3 insertions(+)
>>>
>>> diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
>>> index c9d6ca9efe..302a5ab7ff 100644
>>> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
>>> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
>>> @@ -1823,6 +1823,8 @@ ixgbe_recv_pkts(void *rx_queue, struct rte_mbuf
>>> **rx_pkts,
>>>   		staterr = rxdp->wb.upper.status_error;
>>>   		if (!(staterr & rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD)))
>>>   			break;
>>> +
>>> +		rte_rmb();
>> So "volatile" does not prevent re-order with Loongson compiler?
> "Volatile" does not prevent re-ordering on any compiler. "Volatile" only prevents caching of the variable marked volatile.
>
> https://wiki.sei.cmu.edu/confluence/display/c/CON02-C.+Do+not+use+volatile+as+a+synchronization+primitive
>
> Thinking out loud: I don't know the performance cost of rte_rmb(); perhaps using atomic accesses with the optimal memory ordering would be a better solution in the long term.
Yes, rte_rmb() probably had side effects on the performance. I will use 
a better solution to solve the problem in the V2 patch.
>>
>>>   		rxd = *rxdp;
>>>
>>>   		/*
>>> @@ -2122,6 +2124,7 @@ ixgbe_recv_pkts_lro(void *rx_queue, struct
>>> rte_mbuf **rx_pkts, uint16_t nb_pkts,
>>>   		if (!(staterr & IXGBE_RXDADV_STAT_DD))
>>>   			break;
>>>
>>> +		rte_rmb();
>>>   		rxd = *rxdp;
>>>
>>>   		PMD_RX_LOG(DEBUG, "port_id=%u queue_id=%u rx_id=%u "
>>> --
>>> 2.31.1

Best regards,

Min
  
zhoumin May 4, 2023, 1:16 p.m. UTC | #7
Hi Konstantin,

Thanks for your  comments.

On 2023/5/1 下午9:29, Konstantin Ananyev wrote:
>> Segmentation fault has been observed while running the
>> ixgbe_recv_pkts_lro() function to receive packets on the Loongson 3C5000
>> processor which has 64 cores and 4 NUMA nodes.
>>
>> From the ixgbe_recv_pkts_lro() function, we found that as long as the 
>> first
>> packet has the EOP bit set, and the length of this packet is less 
>> than or
>> equal to rxq->crc_len, the segmentation fault will definitely happen 
>> even
>> though on the other platforms, such as X86.
>>
>> Because when processd the first packet the first_seg->next will be 
>> NULL, if
>> at the same time this packet has the EOP bit set and its length is less
>> than or equal to rxq->crc_len, the following loop will be excecuted:
>>
>>     for (lp = first_seg; lp->next != rxm; lp = lp->next)
>>         ;
>>
>> We know that the first_seg->next will be NULL under this condition. 
>> So the
>> expression of lp->next->next will cause the segmentation fault.
>>
>> Normally, the length of the first packet with EOP bit set will be 
>> greater
>> than rxq->crc_len. However, the out-of-order execution of CPU may 
>> make the
>> read ordering of the status and the rest of the descriptor fields in 
>> this
>> function not be correct. The related codes are as following:
>>
>>         rxdp = &rx_ring[rx_id];
>>  #1     staterr = rte_le_to_cpu_32(rxdp->wb.upper.status_error);
>>
>>         if (!(staterr & IXGBE_RXDADV_STAT_DD))
>>             break;
>>
>>  #2     rxd = *rxdp;
>>
>> The sentence #2 may be executed before sentence #1. This action is 
>> likely
>> to make the ready packet zero length. If the packet is the first 
>> packet and
>> has the EOP bit set, the above segmentation fault will happen.
>>
>> So, we should add rte_rmb() to ensure the read ordering be correct. 
>> We also
>> did the same thing in the ixgbe_recv_pkts() function to make the rxd 
>> data
>> be valid even thougth we did not find segmentation fault in this 
>> function.
>>
>> Signed-off-by: Min Zhou <zhoumin@loongson.cn>
>> ---
>> v2:
>> - Make the calling of rte_rmb() for all platforms
>> ---
>>  drivers/net/ixgbe/ixgbe_rxtx.c | 3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c 
>> b/drivers/net/ixgbe/ixgbe_rxtx.c
>> index c9d6ca9efe..302a5ab7ff 100644
>> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
>> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
>> @@ -1823,6 +1823,8 @@ ixgbe_recv_pkts(void *rx_queue, struct rte_mbuf 
>> **rx_pkts,
>>          staterr = rxdp->wb.upper.status_error;
>>          if (!(staterr & rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD)))
>>              break;
>> +
>> +        rte_rmb();
>>          rxd = *rxdp;
>
>
>
> Indeed, looks like a problem to me on systems with relaxed MO.
> Strange that it was never hit on arm or ppc - cc-ing ARM/PPC maintainers.
The LoongArch architecture uses the Weak Consistency model which can 
cause the problem, especially in scenario with many cores, such as 
Loongson 3C5000 with four NUMA node, which has 64 cores. I cannot 
reproduce it on Loongson 3C5000 with one NUMA node, which just has 16 cores.
> About a fix - looks right, but a bit excessive to me -
> as I understand all we need here is to prevent re-ordering by CPU itself.
Yes, thanks for cc-ing.
> So rte_smp_rmb() seems enough here.
> Or might be just:
> staterr = __atomic_load_n(&rxdp->wb.upper.status_error, 
> __ATOMIC_ACQUIRE);
>
Does __atomic_load_n() work on Windows if we use it to solve this problem ?
>
>>          /*
>> @@ -2122,6 +2124,7 @@ ixgbe_recv_pkts_lro(void *rx_queue, struct 
>> rte_mbuf **rx_pkts, uint16_t nb_pkts,
>>          if (!(staterr & IXGBE_RXDADV_STAT_DD))
>>              break;
>>
>> +        rte_rmb();
>>          rxd = *rxdp;
>>
>>          PMD_RX_LOG(DEBUG, "port_id=%u queue_id=%u rx_id=%u "
>> -- 
>> 2.31.1

Best regards,

Min
  
Morten Brørup May 4, 2023, 1:21 p.m. UTC | #8
> From: zhoumin [mailto:zhoumin@loongson.cn]
> Sent: Thursday, 4 May 2023 15.17
> 
> Hi Konstantin,
> 
> Thanks for your  comments.
> 
> On 2023/5/1 下午9:29, Konstantin Ananyev wrote:
> >> Segmentation fault has been observed while running the
> >> ixgbe_recv_pkts_lro() function to receive packets on the Loongson 3C5000
> >> processor which has 64 cores and 4 NUMA nodes.
> >>
> >> From the ixgbe_recv_pkts_lro() function, we found that as long as the
> >> first
> >> packet has the EOP bit set, and the length of this packet is less
> >> than or
> >> equal to rxq->crc_len, the segmentation fault will definitely happen
> >> even
> >> though on the other platforms, such as X86.
> >>
> >> Because when processd the first packet the first_seg->next will be
> >> NULL, if
> >> at the same time this packet has the EOP bit set and its length is less
> >> than or equal to rxq->crc_len, the following loop will be excecuted:
> >>
> >>     for (lp = first_seg; lp->next != rxm; lp = lp->next)
> >>         ;
> >>
> >> We know that the first_seg->next will be NULL under this condition.
> >> So the
> >> expression of lp->next->next will cause the segmentation fault.
> >>
> >> Normally, the length of the first packet with EOP bit set will be
> >> greater
> >> than rxq->crc_len. However, the out-of-order execution of CPU may
> >> make the
> >> read ordering of the status and the rest of the descriptor fields in
> >> this
> >> function not be correct. The related codes are as following:
> >>
> >>         rxdp = &rx_ring[rx_id];
> >>  #1     staterr = rte_le_to_cpu_32(rxdp->wb.upper.status_error);
> >>
> >>         if (!(staterr & IXGBE_RXDADV_STAT_DD))
> >>             break;
> >>
> >>  #2     rxd = *rxdp;
> >>
> >> The sentence #2 may be executed before sentence #1. This action is
> >> likely
> >> to make the ready packet zero length. If the packet is the first
> >> packet and
> >> has the EOP bit set, the above segmentation fault will happen.
> >>
> >> So, we should add rte_rmb() to ensure the read ordering be correct.
> >> We also
> >> did the same thing in the ixgbe_recv_pkts() function to make the rxd
> >> data
> >> be valid even thougth we did not find segmentation fault in this
> >> function.
> >>
> >> Signed-off-by: Min Zhou <zhoumin@loongson.cn>
> >> ---
> >> v2:
> >> - Make the calling of rte_rmb() for all platforms
> >> ---
> >>  drivers/net/ixgbe/ixgbe_rxtx.c | 3 +++
> >>  1 file changed, 3 insertions(+)
> >>
> >> diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c
> >> b/drivers/net/ixgbe/ixgbe_rxtx.c
> >> index c9d6ca9efe..302a5ab7ff 100644
> >> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
> >> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
> >> @@ -1823,6 +1823,8 @@ ixgbe_recv_pkts(void *rx_queue, struct rte_mbuf
> >> **rx_pkts,
> >>          staterr = rxdp->wb.upper.status_error;
> >>          if (!(staterr & rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD)))
> >>              break;
> >> +
> >> +        rte_rmb();
> >>          rxd = *rxdp;
> >
> >
> >
> > Indeed, looks like a problem to me on systems with relaxed MO.
> > Strange that it was never hit on arm or ppc - cc-ing ARM/PPC maintainers.
> The LoongArch architecture uses the Weak Consistency model which can
> cause the problem, especially in scenario with many cores, such as
> Loongson 3C5000 with four NUMA node, which has 64 cores. I cannot
> reproduce it on Loongson 3C5000 with one NUMA node, which just has 16 cores.
> > About a fix - looks right, but a bit excessive to me -
> > as I understand all we need here is to prevent re-ordering by CPU itself.
> Yes, thanks for cc-ing.
> > So rte_smp_rmb() seems enough here.
> > Or might be just:
> > staterr = __atomic_load_n(&rxdp->wb.upper.status_error,
> > __ATOMIC_ACQUIRE);
> >
> Does __atomic_load_n() work on Windows if we use it to solve this problem ?

Yes, __atomic_load_n() works on Windows too.
  
Qi Zhang May 4, 2023, 1:33 p.m. UTC | #9
> -----Original Message-----
> From: Morten Brørup <mb@smartsharesystems.com>
> Sent: Thursday, May 4, 2023 9:22 PM
> To: zhoumin <zhoumin@loongson.cn>; Konstantin Ananyev
> <konstantin.v.ananyev@yandex.ru>
> Cc: dev@dpdk.org; maobibo@loongson.cn; Yang, Qiming
> <qiming.yang@intel.com>; Wu, Wenjun1 <wenjun1.wu@intel.com>;
> ruifeng.wang@arm.com; drc@linux.vnet.ibm.com; Tyler Retzlaff
> <roretzla@linux.microsoft.com>
> Subject: RE: [PATCH v2] net/ixgbe: add proper memory barriers for some Rx
> functions
> 
> > From: zhoumin [mailto:zhoumin@loongson.cn]
> > Sent: Thursday, 4 May 2023 15.17
> >
> > Hi Konstantin,
> >
> > Thanks for your  comments.
> >
> > On 2023/5/1 下午9:29, Konstantin Ananyev wrote:
> > >> Segmentation fault has been observed while running the
> > >> ixgbe_recv_pkts_lro() function to receive packets on the Loongson
> > >> 3C5000 processor which has 64 cores and 4 NUMA nodes.
> > >>
> > >> From the ixgbe_recv_pkts_lro() function, we found that as long as
> > >> the first packet has the EOP bit set, and the length of this packet
> > >> is less than or equal to rxq->crc_len, the segmentation fault will
> > >> definitely happen even though on the other platforms, such as X86.

Sorry to interrupt, but I am curious why this issue still exists on x86 architecture. Can volatile be used to instruct the compiler to generate read instructions in a specific order, and does x86 guarantee not to reorder load operations?

> > >>
> > >> Because when processd the first packet the first_seg->next will be
> > >> NULL, if at the same time this packet has the EOP bit set and its
> > >> length is less than or equal to rxq->crc_len, the following loop
> > >> will be excecuted:
> > >>
> > >>     for (lp = first_seg; lp->next != rxm; lp = lp->next)
> > >>         ;
> > >>
> > >> We know that the first_seg->next will be NULL under this condition.
> > >> So the
> > >> expression of lp->next->next will cause the segmentation fault.
> > >>
> > >> Normally, the length of the first packet with EOP bit set will be
> > >> greater than rxq->crc_len. However, the out-of-order execution of
> > >> CPU may make the read ordering of the status and the rest of the
> > >> descriptor fields in this function not be correct. The related
> > >> codes are as following:
> > >>
> > >>         rxdp = &rx_ring[rx_id];
> > >>  #1     staterr = rte_le_to_cpu_32(rxdp->wb.upper.status_error);
> > >>
> > >>         if (!(staterr & IXGBE_RXDADV_STAT_DD))
> > >>             break;
> > >>
> > >>  #2     rxd = *rxdp;
> > >>
> > >> The sentence #2 may be executed before sentence #1. This action is
> > >> likely to make the ready packet zero length. If the packet is the
> > >> first packet and has the EOP bit set, the above segmentation fault
> > >> will happen.
> > >>
> > >> So, we should add rte_rmb() to ensure the read ordering be correct.
> > >> We also
> > >> did the same thing in the ixgbe_recv_pkts() function to make the
> > >> rxd data be valid even thougth we did not find segmentation fault
> > >> in this function.
> > >>
> > >> Signed-off-by: Min Zhou <zhoumin@loongson.cn>
> > >> ---
> > >> v2:
> > >> - Make the calling of rte_rmb() for all platforms
> > >> ---
> > >>  drivers/net/ixgbe/ixgbe_rxtx.c | 3 +++
> > >>  1 file changed, 3 insertions(+)
> > >>
> > >> diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c
> > >> b/drivers/net/ixgbe/ixgbe_rxtx.c index c9d6ca9efe..302a5ab7ff
> > >> 100644
> > >> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
> > >> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
> > >> @@ -1823,6 +1823,8 @@ ixgbe_recv_pkts(void *rx_queue, struct
> > >> rte_mbuf **rx_pkts,
> > >>          staterr = rxdp->wb.upper.status_error;
> > >>          if (!(staterr & rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD)))
> > >>              break;
> > >> +
> > >> +        rte_rmb();
> > >>          rxd = *rxdp;
> > >
> > >
> > >
> > > Indeed, looks like a problem to me on systems with relaxed MO.
> > > Strange that it was never hit on arm or ppc - cc-ing ARM/PPC maintainers.
> > The LoongArch architecture uses the Weak Consistency model which can
> > cause the problem, especially in scenario with many cores, such as
> > Loongson 3C5000 with four NUMA node, which has 64 cores. I cannot
> > reproduce it on Loongson 3C5000 with one NUMA node, which just has 16
> cores.
> > > About a fix - looks right, but a bit excessive to me - as I
> > > understand all we need here is to prevent re-ordering by CPU itself.
> > Yes, thanks for cc-ing.
> > > So rte_smp_rmb() seems enough here.
> > > Or might be just:
> > > staterr = __atomic_load_n(&rxdp->wb.upper.status_error,
> > > __ATOMIC_ACQUIRE);
> > >
> > Does __atomic_load_n() work on Windows if we use it to solve this
> problem ?
> 
> Yes, __atomic_load_n() works on Windows too.
>
  
zhoumin May 5, 2023, 1:45 a.m. UTC | #10
Hi Ruifeng,

Thanks for your review.

On Thur, May 4, 2023 at 2:13PM, Ruifeng Wang wrote:
>> -----Original Message-----
>> From: Konstantin Ananyev <konstantin.v.ananyev@yandex.ru>
>> Sent: Monday, May 1, 2023 9:29 PM
>> To: zhoumin@loongson.cn
>> Cc: dev@dpdk.org; maobibo@loongson.cn; qiming.yang@intel.com; wenjun1.wu@intel.com;
>> Ruifeng Wang <Ruifeng.Wang@arm.com>; drc@linux.vnet.ibm.com
>> Subject: Re: [PATCH v2] net/ixgbe: add proper memory barriers for some Rx functions
>>
>>> Segmentation fault has been observed while running the
>>> ixgbe_recv_pkts_lro() function to receive packets on the Loongson
>>> 3C5000 processor which has 64 cores and 4 NUMA nodes.
>>>
>>>  From the ixgbe_recv_pkts_lro() function, we found that as long as the
>>> first packet has the EOP bit set, and the length of this packet is
>>> less than or equal to rxq->crc_len, the segmentation fault will
>>> definitely happen even though on the other platforms, such as X86.
>>>
>>> Because when processd the first packet the first_seg->next will be
>>> NULL, if at the same time this packet has the EOP bit set and its
>>> length is less than or equal to rxq->crc_len, the following loop will be excecuted:
>>>
>>>      for (lp = first_seg; lp->next != rxm; lp = lp->next)
>>>          ;
>>>
>>> We know that the first_seg->next will be NULL under this condition. So
>>> the expression of lp->next->next will cause the segmentation fault.
>>>
>>> Normally, the length of the first packet with EOP bit set will be
>>> greater than rxq->crc_len. However, the out-of-order execution of CPU
>>> may make the read ordering of the status and the rest of the
>>> descriptor fields in this function not be correct. The related codes are as following:
>>>
>>>          rxdp = &rx_ring[rx_id];
>>>   #1     staterr = rte_le_to_cpu_32(rxdp->wb.upper.status_error);
>>>
>>>          if (!(staterr & IXGBE_RXDADV_STAT_DD))
>>>              break;
>>>
>>>   #2     rxd = *rxdp;
>>>
>>> The sentence #2 may be executed before sentence #1. This action is
>>> likely to make the ready packet zero length. If the packet is the
>>> first packet and has the EOP bit set, the above segmentation fault will happen.
>>>
>>> So, we should add rte_rmb() to ensure the read ordering be correct. We
>>> also did the same thing in the ixgbe_recv_pkts() function to make the
>>> rxd data be valid even thougth we did not find segmentation fault in this function.
>>>
>>> Signed-off-by: Min Zhou <zhoumin@loongson.cn>
> "Fixes" tag for backport.
OK, I will add the "Fixes" tag in the V3 patch.
>   
>>> ---
>>> v2:
>>> - Make the calling of rte_rmb() for all platforms
>>> ---
>>>   drivers/net/ixgbe/ixgbe_rxtx.c | 3 +++
>>>   1 file changed, 3 insertions(+)
>>>
>>> diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c
>>> b/drivers/net/ixgbe/ixgbe_rxtx.c index c9d6ca9efe..302a5ab7ff 100644
>>> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
>>> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
>>> @@ -1823,6 +1823,8 @@ ixgbe_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
>>>   		staterr = rxdp->wb.upper.status_error;
>>>   		if (!(staterr & rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD)))
>>>   			break;
>>> +
>>> +		rte_rmb();
>>>   		rxd = *rxdp;
>>
>>
>> Indeed, looks like a problem to me on systems with relaxed MO.
>> Strange that it was never hit on arm or ppc - cc-ing ARM/PPC maintainers.
> Thanks, Konstantin.
>
>> About a fix - looks right, but a bit excessive to me - as I understand all we need here is
>> to prevent re-ordering by CPU itself.
>> So rte_smp_rmb() seems enough here.
> Agree that rte_rmb() is excessive.
> rte_smp_rmb() or rte_atomic_thread_fence(__ATOMIC_ACQUIRE) is enough.
Thanks for your advice. I will compare the rte_smp_rmb(), 
__atomic_load_n() and rte_atomic_thread_fence() to choose a better one.
> And it is better to add a comment to justify the barrier.
OK, I will add a comment for this change.
>> Or might be just:
>> staterr = __atomic_load_n(&rxdp->wb.upper.status_error, __ATOMIC_ACQUIRE);
>>
>>
>>>   		/*
>>> @@ -2122,6 +2124,7 @@ ixgbe_recv_pkts_lro(void *rx_queue, struct rte_mbuf **rx_pkts,
>> uint16_t nb_pkts,
> With the proper barrier in place, I think the long comments at the beginning of this loop can be removed.
Yes, I think the long comments can be simplified when the proper barrier 
is already in place.
>>>   		if (!(staterr & IXGBE_RXDADV_STAT_DD))
>>>   			break;
>>>
>>> +		rte_rmb();
>>>   		rxd = *rxdp;
>>>
>>>   		PMD_RX_LOG(DEBUG, "port_id=%u queue_id=%u rx_id=%u "
>>> --
>>> 2.31.1

Best regards,

Min
  
zhoumin May 5, 2023, 1:54 a.m. UTC | #11
Hi Morten,

On Thur, May 4, 2023 at 9:21PM, Morten Brørup wrote:
>> From: zhoumin [mailto:zhoumin@loongson.cn]
>> Sent: Thursday, 4 May 2023 15.17
>>
>> Hi Konstantin,
>>
>> Thanks for your  comments.
>>
>> On 2023/5/1 下午9:29, Konstantin Ananyev wrote:
>>>> Segmentation fault has been observed while running the
>>>> ixgbe_recv_pkts_lro() function to receive packets on the Loongson 3C5000
>>>> processor which has 64 cores and 4 NUMA nodes.
>>>>
>>>>  From the ixgbe_recv_pkts_lro() function, we found that as long as the
>>>> first
>>>> packet has the EOP bit set, and the length of this packet is less
>>>> than or
>>>> equal to rxq->crc_len, the segmentation fault will definitely happen
>>>> even
>>>> though on the other platforms, such as X86.
>>>>
>>>> Because when processd the first packet the first_seg->next will be
>>>> NULL, if
>>>> at the same time this packet has the EOP bit set and its length is less
>>>> than or equal to rxq->crc_len, the following loop will be excecuted:
>>>>
>>>>      for (lp = first_seg; lp->next != rxm; lp = lp->next)
>>>>          ;
>>>>
>>>> We know that the first_seg->next will be NULL under this condition.
>>>> So the
>>>> expression of lp->next->next will cause the segmentation fault.
>>>>
>>>> Normally, the length of the first packet with EOP bit set will be
>>>> greater
>>>> than rxq->crc_len. However, the out-of-order execution of CPU may
>>>> make the
>>>> read ordering of the status and the rest of the descriptor fields in
>>>> this
>>>> function not be correct. The related codes are as following:
>>>>
>>>>          rxdp = &rx_ring[rx_id];
>>>>   #1     staterr = rte_le_to_cpu_32(rxdp->wb.upper.status_error);
>>>>
>>>>          if (!(staterr & IXGBE_RXDADV_STAT_DD))
>>>>              break;
>>>>
>>>>   #2     rxd = *rxdp;
>>>>
>>>> The sentence #2 may be executed before sentence #1. This action is
>>>> likely
>>>> to make the ready packet zero length. If the packet is the first
>>>> packet and
>>>> has the EOP bit set, the above segmentation fault will happen.
>>>>
>>>> So, we should add rte_rmb() to ensure the read ordering be correct.
>>>> We also
>>>> did the same thing in the ixgbe_recv_pkts() function to make the rxd
>>>> data
>>>> be valid even thougth we did not find segmentation fault in this
>>>> function.
>>>>
>>>> Signed-off-by: Min Zhou <zhoumin@loongson.cn>
>>>> ---
>>>> v2:
>>>> - Make the calling of rte_rmb() for all platforms
>>>> ---
>>>>   drivers/net/ixgbe/ixgbe_rxtx.c | 3 +++
>>>>   1 file changed, 3 insertions(+)
>>>>
>>>> diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c
>>>> b/drivers/net/ixgbe/ixgbe_rxtx.c
>>>> index c9d6ca9efe..302a5ab7ff 100644
>>>> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
>>>> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
>>>> @@ -1823,6 +1823,8 @@ ixgbe_recv_pkts(void *rx_queue, struct rte_mbuf
>>>> **rx_pkts,
>>>>           staterr = rxdp->wb.upper.status_error;
>>>>           if (!(staterr & rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD)))
>>>>               break;
>>>> +
>>>> +        rte_rmb();
>>>>           rxd = *rxdp;
>>>
>>>
>>> Indeed, looks like a problem to me on systems with relaxed MO.
>>> Strange that it was never hit on arm or ppc - cc-ing ARM/PPC maintainers.
>> The LoongArch architecture uses the Weak Consistency model which can
>> cause the problem, especially in scenario with many cores, such as
>> Loongson 3C5000 with four NUMA node, which has 64 cores. I cannot
>> reproduce it on Loongson 3C5000 with one NUMA node, which just has 16 cores.
>>> About a fix - looks right, but a bit excessive to me -
>>> as I understand all we need here is to prevent re-ordering by CPU itself.
>> Yes, thanks for cc-ing.
>>> So rte_smp_rmb() seems enough here.
>>> Or might be just:
>>> staterr = __atomic_load_n(&rxdp->wb.upper.status_error,
>>> __ATOMIC_ACQUIRE);
>>>
>> Does __atomic_load_n() work on Windows if we use it to solve this problem ?
> Yes, __atomic_load_n() works on Windows too.
>
Thank you, Morten. I got it.

I will compare those barriers and choose a proper one for this problem.


Best regards,

Min
  
zhoumin May 5, 2023, 2:42 a.m. UTC | #12
Hi Qi,

On Thur, May 4, 2023 at 9:33PM, Zhang, Qi Z wrote:
>
>> -----Original Message-----
>> From: Morten Brørup <mb@smartsharesystems.com>
>> Sent: Thursday, May 4, 2023 9:22 PM
>> To: zhoumin <zhoumin@loongson.cn>; Konstantin Ananyev
>> <konstantin.v.ananyev@yandex.ru>
>> Cc: dev@dpdk.org; maobibo@loongson.cn; Yang, Qiming
>> <qiming.yang@intel.com>; Wu, Wenjun1 <wenjun1.wu@intel.com>;
>> ruifeng.wang@arm.com; drc@linux.vnet.ibm.com; Tyler Retzlaff
>> <roretzla@linux.microsoft.com>
>> Subject: RE: [PATCH v2] net/ixgbe: add proper memory barriers for some Rx
>> functions
>>
>>> From: zhoumin [mailto:zhoumin@loongson.cn]
>>> Sent: Thursday, 4 May 2023 15.17
>>>
>>> Hi Konstantin,
>>>
>>> Thanks for your  comments.
>>>
>>> On 2023/5/1 下午9:29, Konstantin Ananyev wrote:
>>>>> Segmentation fault has been observed while running the
>>>>> ixgbe_recv_pkts_lro() function to receive packets on the Loongson
>>>>> 3C5000 processor which has 64 cores and 4 NUMA nodes.
>>>>>
>>>>>  From the ixgbe_recv_pkts_lro() function, we found that as long as
>>>>> the first packet has the EOP bit set, and the length of this packet
>>>>> is less than or equal to rxq->crc_len, the segmentation fault will
>>>>> definitely happen even though on the other platforms, such as X86.
> Sorry to interrupt, but I am curious why this issue still exists on x86 architecture. Can volatile be used to instruct the compiler to generate read instructions in a specific order, and does x86 guarantee not to reorder load operations?
Actually, I did not see the segmentation fault on X86. I just made the 
first packet which had the EOP bit set had a zero length, then the 
segmentation fault would happen on X86. So, I thought that the 
out-of-order access to the descriptor might be possible to make the 
ready packet zero length, and this case was more likely to cause the 
segmentation fault.
>>>>> Because when processd the first packet the first_seg->next will be
>>>>> NULL, if at the same time this packet has the EOP bit set and its
>>>>> length is less than or equal to rxq->crc_len, the following loop
>>>>> will be excecuted:
>>>>>
>>>>>      for (lp = first_seg; lp->next != rxm; lp = lp->next)
>>>>>          ;
>>>>>
>>>>> We know that the first_seg->next will be NULL under this condition.
>>>>> So the
>>>>> expression of lp->next->next will cause the segmentation fault.
>>>>>
>>>>> Normally, the length of the first packet with EOP bit set will be
>>>>> greater than rxq->crc_len. However, the out-of-order execution of
>>>>> CPU may make the read ordering of the status and the rest of the
>>>>> descriptor fields in this function not be correct. The related
>>>>> codes are as following:
>>>>>
>>>>>          rxdp = &rx_ring[rx_id];
>>>>>   #1     staterr = rte_le_to_cpu_32(rxdp->wb.upper.status_error);
>>>>>
>>>>>          if (!(staterr & IXGBE_RXDADV_STAT_DD))
>>>>>              break;
>>>>>
>>>>>   #2     rxd = *rxdp;
>>>>>
>>>>> The sentence #2 may be executed before sentence #1. This action is
>>>>> likely to make the ready packet zero length. If the packet is the
>>>>> first packet and has the EOP bit set, the above segmentation fault
>>>>> will happen.
>>>>>
>>>>> So, we should add rte_rmb() to ensure the read ordering be correct.
>>>>> We also
>>>>> did the same thing in the ixgbe_recv_pkts() function to make the
>>>>> rxd data be valid even thougth we did not find segmentation fault
>>>>> in this function.
>>>>>
>>>>> Signed-off-by: Min Zhou <zhoumin@loongson.cn>
>>>>> ---
>>>>> v2:
>>>>> - Make the calling of rte_rmb() for all platforms
>>>>> ---
>>>>>   drivers/net/ixgbe/ixgbe_rxtx.c | 3 +++
>>>>>   1 file changed, 3 insertions(+)
>>>>>
>>>>> diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c
>>>>> b/drivers/net/ixgbe/ixgbe_rxtx.c index c9d6ca9efe..302a5ab7ff
>>>>> 100644
>>>>> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
>>>>> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
>>>>> @@ -1823,6 +1823,8 @@ ixgbe_recv_pkts(void *rx_queue, struct
>>>>> rte_mbuf **rx_pkts,
>>>>>           staterr = rxdp->wb.upper.status_error;
>>>>>           if (!(staterr & rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD)))
>>>>>               break;
>>>>> +
>>>>> +        rte_rmb();
>>>>>           rxd = *rxdp;
>>>>
>>>>
>>>> Indeed, looks like a problem to me on systems with relaxed MO.
>>>> Strange that it was never hit on arm or ppc - cc-ing ARM/PPC maintainers.
>>> The LoongArch architecture uses the Weak Consistency model which can
>>> cause the problem, especially in scenario with many cores, such as
>>> Loongson 3C5000 with four NUMA node, which has 64 cores. I cannot
>>> reproduce it on Loongson 3C5000 with one NUMA node, which just has 16
>> cores.
>>>> About a fix - looks right, but a bit excessive to me - as I
>>>> understand all we need here is to prevent re-ordering by CPU itself.
>>> Yes, thanks for cc-ing.
>>>> So rte_smp_rmb() seems enough here.
>>>> Or might be just:
>>>> staterr = __atomic_load_n(&rxdp->wb.upper.status_error,
>>>> __ATOMIC_ACQUIRE);
>>>>
>>> Does __atomic_load_n() work on Windows if we use it to solve this
>> problem ?
>>
>> Yes, __atomic_load_n() works on Windows too.
>>
Best regards,

Min
  
Qi Zhang May 6, 2023, 1:30 a.m. UTC | #13
> -----Original Message-----
> From: zhoumin <zhoumin@loongson.cn>
> Sent: Friday, May 5, 2023 10:43 AM
> To: Zhang, Qi Z <qi.z.zhang@intel.com>; Morten Brørup
> <mb@smartsharesystems.com>; Konstantin Ananyev
> <konstantin.v.ananyev@yandex.ru>
> Cc: dev@dpdk.org; maobibo@loongson.cn; Yang, Qiming
> <qiming.yang@intel.com>; Wu, Wenjun1 <wenjun1.wu@intel.com>;
> ruifeng.wang@arm.com; drc@linux.vnet.ibm.com; Tyler Retzlaff
> <roretzla@linux.microsoft.com>
> Subject: Re: [PATCH v2] net/ixgbe: add proper memory barriers for some Rx
> functions
> 
> Hi Qi,
> 
> On Thur, May 4, 2023 at 9:33PM, Zhang, Qi Z wrote:
> >
> >> -----Original Message-----
> >> From: Morten Brørup <mb@smartsharesystems.com>
> >> Sent: Thursday, May 4, 2023 9:22 PM
> >> To: zhoumin <zhoumin@loongson.cn>; Konstantin Ananyev
> >> <konstantin.v.ananyev@yandex.ru>
> >> Cc: dev@dpdk.org; maobibo@loongson.cn; Yang, Qiming
> >> <qiming.yang@intel.com>; Wu, Wenjun1 <wenjun1.wu@intel.com>;
> >> ruifeng.wang@arm.com; drc@linux.vnet.ibm.com; Tyler Retzlaff
> >> <roretzla@linux.microsoft.com>
> >> Subject: RE: [PATCH v2] net/ixgbe: add proper memory barriers for
> >> some Rx functions
> >>
> >>> From: zhoumin [mailto:zhoumin@loongson.cn]
> >>> Sent: Thursday, 4 May 2023 15.17
> >>>
> >>> Hi Konstantin,
> >>>
> >>> Thanks for your  comments.
> >>>
> >>> On 2023/5/1 下午9:29, Konstantin Ananyev wrote:
> >>>>> Segmentation fault has been observed while running the
> >>>>> ixgbe_recv_pkts_lro() function to receive packets on the Loongson
> >>>>> 3C5000 processor which has 64 cores and 4 NUMA nodes.
> >>>>>
> >>>>>  From the ixgbe_recv_pkts_lro() function, we found that as long as
> >>>>> the first packet has the EOP bit set, and the length of this
> >>>>> packet is less than or equal to rxq->crc_len, the segmentation
> >>>>> fault will definitely happen even though on the other platforms, such
> as X86.
> > Sorry to interrupt, but I am curious why this issue still exists on x86
> architecture. Can volatile be used to instruct the compiler to generate read
> instructions in a specific order, and does x86 guarantee not to reorder load
> operations?
> Actually, I did not see the segmentation fault on X86. I just made the first
> packet which had the EOP bit set had a zero length, then the segmentation
> fault would happen on X86. So, I thought that the out-of-order access to the
> descriptor might be possible to make the ready packet zero length, and this
> case was more likely to cause the segmentation fault.

I see, thanks for the explanation.

> >>>>> Because when processd the first packet the first_seg->next will be
> >>>>> NULL, if at the same time this packet has the EOP bit set and its
> >>>>> length is less than or equal to rxq->crc_len, the following loop
> >>>>> will be excecuted:
> >>>>>
> >>>>>      for (lp = first_seg; lp->next != rxm; lp = lp->next)
> >>>>>          ;
> >>>>>
> >>>>> We know that the first_seg->next will be NULL under this condition.
> >>>>> So the
> >>>>> expression of lp->next->next will cause the segmentation fault.
> >>>>>
> >>>>> Normally, the length of the first packet with EOP bit set will be
> >>>>> greater than rxq->crc_len. However, the out-of-order execution of
> >>>>> CPU may make the read ordering of the status and the rest of the
> >>>>> descriptor fields in this function not be correct. The related
> >>>>> codes are as following:
> >>>>>
> >>>>>          rxdp = &rx_ring[rx_id];
> >>>>>   #1     staterr = rte_le_to_cpu_32(rxdp->wb.upper.status_error);
> >>>>>
> >>>>>          if (!(staterr & IXGBE_RXDADV_STAT_DD))
> >>>>>              break;
> >>>>>
> >>>>>   #2     rxd = *rxdp;
> >>>>>
> >>>>> The sentence #2 may be executed before sentence #1. This action is
> >>>>> likely to make the ready packet zero length. If the packet is the
> >>>>> first packet and has the EOP bit set, the above segmentation fault
> >>>>> will happen.
> >>>>>
> >>>>> So, we should add rte_rmb() to ensure the read ordering be correct.
> >>>>> We also
> >>>>> did the same thing in the ixgbe_recv_pkts() function to make the
> >>>>> rxd data be valid even thougth we did not find segmentation fault
> >>>>> in this function.
> >>>>>
> >>>>> Signed-off-by: Min Zhou <zhoumin@loongson.cn>
> >>>>> ---
> >>>>> v2:
> >>>>> - Make the calling of rte_rmb() for all platforms
> >>>>> ---
> >>>>>   drivers/net/ixgbe/ixgbe_rxtx.c | 3 +++
> >>>>>   1 file changed, 3 insertions(+)
> >>>>>
> >>>>> diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c
> >>>>> b/drivers/net/ixgbe/ixgbe_rxtx.c index c9d6ca9efe..302a5ab7ff
> >>>>> 100644
> >>>>> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
> >>>>> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
> >>>>> @@ -1823,6 +1823,8 @@ ixgbe_recv_pkts(void *rx_queue, struct
> >>>>> rte_mbuf **rx_pkts,
> >>>>>           staterr = rxdp->wb.upper.status_error;
> >>>>>           if (!(staterr & rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD)))
> >>>>>               break;
> >>>>> +
> >>>>> +        rte_rmb();
> >>>>>           rxd = *rxdp;
> >>>>
> >>>>
> >>>> Indeed, looks like a problem to me on systems with relaxed MO.
> >>>> Strange that it was never hit on arm or ppc - cc-ing ARM/PPC
> maintainers.
> >>> The LoongArch architecture uses the Weak Consistency model which can
> >>> cause the problem, especially in scenario with many cores, such as
> >>> Loongson 3C5000 with four NUMA node, which has 64 cores. I cannot
> >>> reproduce it on Loongson 3C5000 with one NUMA node, which just has
> >>> 16
> >> cores.
> >>>> About a fix - looks right, but a bit excessive to me - as I
> >>>> understand all we need here is to prevent re-ordering by CPU itself.
> >>> Yes, thanks for cc-ing.
> >>>> So rte_smp_rmb() seems enough here.
> >>>> Or might be just:
> >>>> staterr = __atomic_load_n(&rxdp->wb.upper.status_error,
> >>>> __ATOMIC_ACQUIRE);
> >>>>
> >>> Does __atomic_load_n() work on Windows if we use it to solve this
> >> problem ?
> >>
> >> Yes, __atomic_load_n() works on Windows too.
> >>
> Best regards,
> 
> Min
>
  

Patch

diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
index c9d6ca9efe..302a5ab7ff 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx.c
@@ -1823,6 +1823,8 @@  ixgbe_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 		staterr = rxdp->wb.upper.status_error;
 		if (!(staterr & rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD)))
 			break;
+
+		rte_rmb();
 		rxd = *rxdp;
 
 		/*
@@ -2122,6 +2124,7 @@  ixgbe_recv_pkts_lro(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts,
 		if (!(staterr & IXGBE_RXDADV_STAT_DD))
 			break;
 
+		rte_rmb();
 		rxd = *rxdp;
 
 		PMD_RX_LOG(DEBUG, "port_id=%u queue_id=%u rx_id=%u "