[v2] net/ixgbe: add proper memory barriers for some Rx functions
Checks
Commit Message
Segmentation fault has been observed while running the
ixgbe_recv_pkts_lro() function to receive packets on the Loongson 3C5000
processor which has 64 cores and 4 NUMA nodes.
From the ixgbe_recv_pkts_lro() function, we found that as long as the first
packet has the EOP bit set, and the length of this packet is less than or
equal to rxq->crc_len, the segmentation fault will definitely happen even
though on the other platforms, such as X86.
Because when processd the first packet the first_seg->next will be NULL, if
at the same time this packet has the EOP bit set and its length is less
than or equal to rxq->crc_len, the following loop will be excecuted:
for (lp = first_seg; lp->next != rxm; lp = lp->next)
;
We know that the first_seg->next will be NULL under this condition. So the
expression of lp->next->next will cause the segmentation fault.
Normally, the length of the first packet with EOP bit set will be greater
than rxq->crc_len. However, the out-of-order execution of CPU may make the
read ordering of the status and the rest of the descriptor fields in this
function not be correct. The related codes are as following:
rxdp = &rx_ring[rx_id];
#1 staterr = rte_le_to_cpu_32(rxdp->wb.upper.status_error);
if (!(staterr & IXGBE_RXDADV_STAT_DD))
break;
#2 rxd = *rxdp;
The sentence #2 may be executed before sentence #1. This action is likely
to make the ready packet zero length. If the packet is the first packet and
has the EOP bit set, the above segmentation fault will happen.
So, we should add rte_rmb() to ensure the read ordering be correct. We also
did the same thing in the ixgbe_recv_pkts() function to make the rxd data
be valid even thougth we did not find segmentation fault in this function.
Signed-off-by: Min Zhou <zhoumin@loongson.cn>
---
v2:
- Make the calling of rte_rmb() for all platforms
---
drivers/net/ixgbe/ixgbe_rxtx.c | 3 +++
1 file changed, 3 insertions(+)
Comments
> -----Original Message-----
> From: Min Zhou <zhoumin@loongson.cn>
> Sent: Monday, April 24, 2023 5:06 PM
> To: Yang, Qiming <qiming.yang@intel.com>; Wu, Wenjun1
> <wenjun1.wu@intel.com>; zhoumin@loongson.cn
> Cc: dev@dpdk.org; maobibo@loongson.cn
> Subject: [PATCH v2] net/ixgbe: add proper memory barriers for some Rx
> functions
>
> Segmentation fault has been observed while running the
> ixgbe_recv_pkts_lro() function to receive packets on the Loongson 3C5000
> processor which has 64 cores and 4 NUMA nodes.
>
> From the ixgbe_recv_pkts_lro() function, we found that as long as the first
> packet has the EOP bit set, and the length of this packet is less than or equal
> to rxq->crc_len, the segmentation fault will definitely happen even though
> on the other platforms, such as X86.
>
> Because when processd the first packet the first_seg->next will be NULL, if at
> the same time this packet has the EOP bit set and its length is less than or
> equal to rxq->crc_len, the following loop will be excecuted:
>
> for (lp = first_seg; lp->next != rxm; lp = lp->next)
> ;
>
> We know that the first_seg->next will be NULL under this condition. So the
> expression of lp->next->next will cause the segmentation fault.
>
> Normally, the length of the first packet with EOP bit set will be greater than
> rxq->crc_len. However, the out-of-order execution of CPU may make the
> read ordering of the status and the rest of the descriptor fields in this
> function not be correct. The related codes are as following:
>
> rxdp = &rx_ring[rx_id];
> #1 staterr = rte_le_to_cpu_32(rxdp->wb.upper.status_error);
>
> if (!(staterr & IXGBE_RXDADV_STAT_DD))
> break;
>
> #2 rxd = *rxdp;
>
> The sentence #2 may be executed before sentence #1. This action is likely to
> make the ready packet zero length. If the packet is the first packet and has
> the EOP bit set, the above segmentation fault will happen.
>
> So, we should add rte_rmb() to ensure the read ordering be correct. We also
> did the same thing in the ixgbe_recv_pkts() function to make the rxd data be
> valid even thougth we did not find segmentation fault in this function.
>
> Signed-off-by: Min Zhou <zhoumin@loongson.cn>
> ---
> v2:
> - Make the calling of rte_rmb() for all platforms
> ---
> drivers/net/ixgbe/ixgbe_rxtx.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
> index c9d6ca9efe..302a5ab7ff 100644
> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
> @@ -1823,6 +1823,8 @@ ixgbe_recv_pkts(void *rx_queue, struct rte_mbuf
> **rx_pkts,
> staterr = rxdp->wb.upper.status_error;
> if (!(staterr & rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD)))
> break;
> +
> + rte_rmb();
So "volatile" does not prevent re-order with Loongson compiler?
> rxd = *rxdp;
>
> /*
> @@ -2122,6 +2124,7 @@ ixgbe_recv_pkts_lro(void *rx_queue, struct
> rte_mbuf **rx_pkts, uint16_t nb_pkts,
> if (!(staterr & IXGBE_RXDADV_STAT_DD))
> break;
>
> + rte_rmb();
> rxd = *rxdp;
>
> PMD_RX_LOG(DEBUG, "port_id=%u queue_id=%u rx_id=%u "
> --
> 2.31.1
> From: Zhang, Qi Z [mailto:qi.z.zhang@intel.com]
> Sent: Friday, 28 April 2023 05.44
>
> > From: Min Zhou <zhoumin@loongson.cn>
> > Sent: Monday, April 24, 2023 5:06 PM
> >
> > Segmentation fault has been observed while running the
> > ixgbe_recv_pkts_lro() function to receive packets on the Loongson 3C5000
> > processor which has 64 cores and 4 NUMA nodes.
> >
> > From the ixgbe_recv_pkts_lro() function, we found that as long as the first
> > packet has the EOP bit set, and the length of this packet is less than or
> equal
> > to rxq->crc_len, the segmentation fault will definitely happen even though
> > on the other platforms, such as X86.
> >
> > Because when processd the first packet the first_seg->next will be NULL, if
> at
> > the same time this packet has the EOP bit set and its length is less than or
> > equal to rxq->crc_len, the following loop will be excecuted:
> >
> > for (lp = first_seg; lp->next != rxm; lp = lp->next)
> > ;
> >
> > We know that the first_seg->next will be NULL under this condition. So the
> > expression of lp->next->next will cause the segmentation fault.
> >
> > Normally, the length of the first packet with EOP bit set will be greater
> than
> > rxq->crc_len. However, the out-of-order execution of CPU may make the
> > read ordering of the status and the rest of the descriptor fields in this
> > function not be correct. The related codes are as following:
> >
> > rxdp = &rx_ring[rx_id];
> > #1 staterr = rte_le_to_cpu_32(rxdp->wb.upper.status_error);
> >
> > if (!(staterr & IXGBE_RXDADV_STAT_DD))
> > break;
> >
> > #2 rxd = *rxdp;
> >
> > The sentence #2 may be executed before sentence #1. This action is likely to
> > make the ready packet zero length. If the packet is the first packet and has
> > the EOP bit set, the above segmentation fault will happen.
> >
> > So, we should add rte_rmb() to ensure the read ordering be correct. We also
> > did the same thing in the ixgbe_recv_pkts() function to make the rxd data be
> > valid even thougth we did not find segmentation fault in this function.
> >
> > Signed-off-by: Min Zhou <zhoumin@loongson.cn>
> > ---
> > v2:
> > - Make the calling of rte_rmb() for all platforms
> > ---
> > drivers/net/ixgbe/ixgbe_rxtx.c | 3 +++
> > 1 file changed, 3 insertions(+)
> >
> > diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
> > index c9d6ca9efe..302a5ab7ff 100644
> > --- a/drivers/net/ixgbe/ixgbe_rxtx.c
> > +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
> > @@ -1823,6 +1823,8 @@ ixgbe_recv_pkts(void *rx_queue, struct rte_mbuf
> > **rx_pkts,
> > staterr = rxdp->wb.upper.status_error;
> > if (!(staterr & rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD)))
> > break;
> > +
> > + rte_rmb();
>
> So "volatile" does not prevent re-order with Loongson compiler?
"Volatile" does not prevent re-ordering on any compiler. "Volatile" only prevents caching of the variable marked volatile.
https://wiki.sei.cmu.edu/confluence/display/c/CON02-C.+Do+not+use+volatile+as+a+synchronization+primitive
Thinking out loud: I don't know the performance cost of rte_rmb(); perhaps using atomic accesses with the optimal memory ordering would be a better solution in the long term.
>
>
> > rxd = *rxdp;
> >
> > /*
> > @@ -2122,6 +2124,7 @@ ixgbe_recv_pkts_lro(void *rx_queue, struct
> > rte_mbuf **rx_pkts, uint16_t nb_pkts,
> > if (!(staterr & IXGBE_RXDADV_STAT_DD))
> > break;
> >
> > + rte_rmb();
> > rxd = *rxdp;
> >
> > PMD_RX_LOG(DEBUG, "port_id=%u queue_id=%u rx_id=%u "
> > --
> > 2.31.1
> Segmentation fault has been observed while running the
> ixgbe_recv_pkts_lro() function to receive packets on the Loongson 3C5000
> processor which has 64 cores and 4 NUMA nodes.
>
> From the ixgbe_recv_pkts_lro() function, we found that as long as the first
> packet has the EOP bit set, and the length of this packet is less than or
> equal to rxq->crc_len, the segmentation fault will definitely happen even
> though on the other platforms, such as X86.
>
> Because when processd the first packet the first_seg->next will be NULL, if
> at the same time this packet has the EOP bit set and its length is less
> than or equal to rxq->crc_len, the following loop will be excecuted:
>
> for (lp = first_seg; lp->next != rxm; lp = lp->next)
> ;
>
> We know that the first_seg->next will be NULL under this condition. So the
> expression of lp->next->next will cause the segmentation fault.
>
> Normally, the length of the first packet with EOP bit set will be greater
> than rxq->crc_len. However, the out-of-order execution of CPU may make the
> read ordering of the status and the rest of the descriptor fields in this
> function not be correct. The related codes are as following:
>
> rxdp = &rx_ring[rx_id];
> #1 staterr = rte_le_to_cpu_32(rxdp->wb.upper.status_error);
>
> if (!(staterr & IXGBE_RXDADV_STAT_DD))
> break;
>
> #2 rxd = *rxdp;
>
> The sentence #2 may be executed before sentence #1. This action is likely
> to make the ready packet zero length. If the packet is the first packet and
> has the EOP bit set, the above segmentation fault will happen.
>
> So, we should add rte_rmb() to ensure the read ordering be correct. We also
> did the same thing in the ixgbe_recv_pkts() function to make the rxd data
> be valid even thougth we did not find segmentation fault in this function.
>
> Signed-off-by: Min Zhou <zhoumin@loongson.cn>
> ---
> v2:
> - Make the calling of rte_rmb() for all platforms
> ---
> drivers/net/ixgbe/ixgbe_rxtx.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
> index c9d6ca9efe..302a5ab7ff 100644
> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
> @@ -1823,6 +1823,8 @@ ixgbe_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
> staterr = rxdp->wb.upper.status_error;
> if (!(staterr & rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD)))
> break;
> +
> + rte_rmb();
> rxd = *rxdp;
Indeed, looks like a problem to me on systems with relaxed MO.
Strange that it was never hit on arm or ppc - cc-ing ARM/PPC maintainers.
About a fix - looks right, but a bit excessive to me -
as I understand all we need here is to prevent re-ordering by CPU itself.
So rte_smp_rmb() seems enough here.
Or might be just:
staterr = __atomic_load_n(&rxdp->wb.upper.status_error, __ATOMIC_ACQUIRE);
> /*
> @@ -2122,6 +2124,7 @@ ixgbe_recv_pkts_lro(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts,
> if (!(staterr & IXGBE_RXDADV_STAT_DD))
> break;
>
> + rte_rmb();
> rxd = *rxdp;
>
> PMD_RX_LOG(DEBUG, "port_id=%u queue_id=%u rx_id=%u "
> --
> 2.31.1
> -----Original Message-----
> From: Konstantin Ananyev <konstantin.v.ananyev@yandex.ru>
> Sent: Monday, May 1, 2023 9:29 PM
> To: zhoumin@loongson.cn
> Cc: dev@dpdk.org; maobibo@loongson.cn; qiming.yang@intel.com; wenjun1.wu@intel.com;
> Ruifeng Wang <Ruifeng.Wang@arm.com>; drc@linux.vnet.ibm.com
> Subject: Re: [PATCH v2] net/ixgbe: add proper memory barriers for some Rx functions
>
> > Segmentation fault has been observed while running the
> > ixgbe_recv_pkts_lro() function to receive packets on the Loongson
> > 3C5000 processor which has 64 cores and 4 NUMA nodes.
> >
> > From the ixgbe_recv_pkts_lro() function, we found that as long as the
> > first packet has the EOP bit set, and the length of this packet is
> > less than or equal to rxq->crc_len, the segmentation fault will
> > definitely happen even though on the other platforms, such as X86.
> >
> > Because when processd the first packet the first_seg->next will be
> > NULL, if at the same time this packet has the EOP bit set and its
> > length is less than or equal to rxq->crc_len, the following loop will be excecuted:
> >
> > for (lp = first_seg; lp->next != rxm; lp = lp->next)
> > ;
> >
> > We know that the first_seg->next will be NULL under this condition. So
> > the expression of lp->next->next will cause the segmentation fault.
> >
> > Normally, the length of the first packet with EOP bit set will be
> > greater than rxq->crc_len. However, the out-of-order execution of CPU
> > may make the read ordering of the status and the rest of the
> > descriptor fields in this function not be correct. The related codes are as following:
> >
> > rxdp = &rx_ring[rx_id];
> > #1 staterr = rte_le_to_cpu_32(rxdp->wb.upper.status_error);
> >
> > if (!(staterr & IXGBE_RXDADV_STAT_DD))
> > break;
> >
> > #2 rxd = *rxdp;
> >
> > The sentence #2 may be executed before sentence #1. This action is
> > likely to make the ready packet zero length. If the packet is the
> > first packet and has the EOP bit set, the above segmentation fault will happen.
> >
> > So, we should add rte_rmb() to ensure the read ordering be correct. We
> > also did the same thing in the ixgbe_recv_pkts() function to make the
> > rxd data be valid even thougth we did not find segmentation fault in this function.
> >
> > Signed-off-by: Min Zhou <zhoumin@loongson.cn>
"Fixes" tag for backport.
> > ---
> > v2:
> > - Make the calling of rte_rmb() for all platforms
> > ---
> > drivers/net/ixgbe/ixgbe_rxtx.c | 3 +++
> > 1 file changed, 3 insertions(+)
> >
> > diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c
> > b/drivers/net/ixgbe/ixgbe_rxtx.c index c9d6ca9efe..302a5ab7ff 100644
> > --- a/drivers/net/ixgbe/ixgbe_rxtx.c
> > +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
> > @@ -1823,6 +1823,8 @@ ixgbe_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
> > staterr = rxdp->wb.upper.status_error;
> > if (!(staterr & rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD)))
> > break;
> > +
> > + rte_rmb();
> > rxd = *rxdp;
>
>
>
> Indeed, looks like a problem to me on systems with relaxed MO.
> Strange that it was never hit on arm or ppc - cc-ing ARM/PPC maintainers.
Thanks, Konstantin.
> About a fix - looks right, but a bit excessive to me - as I understand all we need here is
> to prevent re-ordering by CPU itself.
> So rte_smp_rmb() seems enough here.
Agree that rte_rmb() is excessive.
rte_smp_rmb() or rte_atomic_thread_fence(__ATOMIC_ACQUIRE) is enough.
And it is better to add a comment to justify the barrier.
> Or might be just:
> staterr = __atomic_load_n(&rxdp->wb.upper.status_error, __ATOMIC_ACQUIRE);
>
>
> > /*
> > @@ -2122,6 +2124,7 @@ ixgbe_recv_pkts_lro(void *rx_queue, struct rte_mbuf **rx_pkts,
> uint16_t nb_pkts,
With the proper barrier in place, I think the long comments at the beginning of this loop can be removed.
> > if (!(staterr & IXGBE_RXDADV_STAT_DD))
> > break;
> >
> > + rte_rmb();
> > rxd = *rxdp;
> >
> > PMD_RX_LOG(DEBUG, "port_id=%u queue_id=%u rx_id=%u "
> > --
> > 2.31.1
Hi Qi,
Thanks for your review.
On Fri, Apr 28, 2023 at 11:43AM, Zhang, Qi Z wrote:
>
>> -----Original Message-----
>> From: Min Zhou <zhoumin@loongson.cn>
>> Sent: Monday, April 24, 2023 5:06 PM
>> To: Yang, Qiming <qiming.yang@intel.com>; Wu, Wenjun1
>> <wenjun1.wu@intel.com>; zhoumin@loongson.cn
>> Cc: dev@dpdk.org; maobibo@loongson.cn
>> Subject: [PATCH v2] net/ixgbe: add proper memory barriers for some Rx
>> functions
>>
>> Segmentation fault has been observed while running the
>> ixgbe_recv_pkts_lro() function to receive packets on the Loongson 3C5000
>> processor which has 64 cores and 4 NUMA nodes.
>>
>> From the ixgbe_recv_pkts_lro() function, we found that as long as the first
>> packet has the EOP bit set, and the length of this packet is less than or equal
>> to rxq->crc_len, the segmentation fault will definitely happen even though
>> on the other platforms, such as X86.
>>
>> Because when processd the first packet the first_seg->next will be NULL, if at
>> the same time this packet has the EOP bit set and its length is less than or
>> equal to rxq->crc_len, the following loop will be excecuted:
>>
>> for (lp = first_seg; lp->next != rxm; lp = lp->next)
>> ;
>>
>> We know that the first_seg->next will be NULL under this condition. So the
>> expression of lp->next->next will cause the segmentation fault.
>>
>> Normally, the length of the first packet with EOP bit set will be greater than
>> rxq->crc_len. However, the out-of-order execution of CPU may make the
>> read ordering of the status and the rest of the descriptor fields in this
>> function not be correct. The related codes are as following:
>>
>> rxdp = &rx_ring[rx_id];
>> #1 staterr = rte_le_to_cpu_32(rxdp->wb.upper.status_error);
>>
>> if (!(staterr & IXGBE_RXDADV_STAT_DD))
>> break;
>>
>> #2 rxd = *rxdp;
>>
>> The sentence #2 may be executed before sentence #1. This action is likely to
>> make the ready packet zero length. If the packet is the first packet and has
>> the EOP bit set, the above segmentation fault will happen.
>>
>> So, we should add rte_rmb() to ensure the read ordering be correct. We also
>> did the same thing in the ixgbe_recv_pkts() function to make the rxd data be
>> valid even thougth we did not find segmentation fault in this function.
>>
>> Signed-off-by: Min Zhou <zhoumin@loongson.cn>
>> ---
>> v2:
>> - Make the calling of rte_rmb() for all platforms
>> ---
>> drivers/net/ixgbe/ixgbe_rxtx.c | 3 +++
>> 1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
>> index c9d6ca9efe..302a5ab7ff 100644
>> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
>> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
>> @@ -1823,6 +1823,8 @@ ixgbe_recv_pkts(void *rx_queue, struct rte_mbuf
>> **rx_pkts,
>> staterr = rxdp->wb.upper.status_error;
>> if (!(staterr & rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD)))
>> break;
>> +
>> + rte_rmb();
> So "volatile" does not prevent re-order with Loongson compiler?
The memory consistency model of the LoongArch [1] uses the Weak
Consistency model in which memory operations can be reordered.
[1]
https://loongson.github.io/LoongArch-Documentation/LoongArch-Vol1-EN#overview-of-memory-consistency
>> rxd = *rxdp;
>>
>> /*
>> @@ -2122,6 +2124,7 @@ ixgbe_recv_pkts_lro(void *rx_queue, struct
>> rte_mbuf **rx_pkts, uint16_t nb_pkts,
>> if (!(staterr & IXGBE_RXDADV_STAT_DD))
>> break;
>>
>> + rte_rmb();
>> rxd = *rxdp;
>>
>> PMD_RX_LOG(DEBUG, "port_id=%u queue_id=%u rx_id=%u "
>> --
>> 2.31.1
Best regards,
Min
Hi Morten,
Thanks for your comments.
On Fri, Apr 28, 2023 at 2:27PM, Morten Brørup wrote:
>> From: Zhang, Qi Z [mailto:qi.z.zhang@intel.com]
>> Sent: Friday, 28 April 2023 05.44
>>
>>> From: Min Zhou <zhoumin@loongson.cn>
>>> Sent: Monday, April 24, 2023 5:06 PM
>>>
>>> Segmentation fault has been observed while running the
>>> ixgbe_recv_pkts_lro() function to receive packets on the Loongson 3C5000
>>> processor which has 64 cores and 4 NUMA nodes.
>>>
>>> From the ixgbe_recv_pkts_lro() function, we found that as long as the first
>>> packet has the EOP bit set, and the length of this packet is less than or
>> equal
>>> to rxq->crc_len, the segmentation fault will definitely happen even though
>>> on the other platforms, such as X86.
>>>
>>> Because when processd the first packet the first_seg->next will be NULL, if
>> at
>>> the same time this packet has the EOP bit set and its length is less than or
>>> equal to rxq->crc_len, the following loop will be excecuted:
>>>
>>> for (lp = first_seg; lp->next != rxm; lp = lp->next)
>>> ;
>>>
>>> We know that the first_seg->next will be NULL under this condition. So the
>>> expression of lp->next->next will cause the segmentation fault.
>>>
>>> Normally, the length of the first packet with EOP bit set will be greater
>> than
>>> rxq->crc_len. However, the out-of-order execution of CPU may make the
>>> read ordering of the status and the rest of the descriptor fields in this
>>> function not be correct. The related codes are as following:
>>>
>>> rxdp = &rx_ring[rx_id];
>>> #1 staterr = rte_le_to_cpu_32(rxdp->wb.upper.status_error);
>>>
>>> if (!(staterr & IXGBE_RXDADV_STAT_DD))
>>> break;
>>>
>>> #2 rxd = *rxdp;
>>>
>>> The sentence #2 may be executed before sentence #1. This action is likely to
>>> make the ready packet zero length. If the packet is the first packet and has
>>> the EOP bit set, the above segmentation fault will happen.
>>>
>>> So, we should add rte_rmb() to ensure the read ordering be correct. We also
>>> did the same thing in the ixgbe_recv_pkts() function to make the rxd data be
>>> valid even thougth we did not find segmentation fault in this function.
>>>
>>> Signed-off-by: Min Zhou <zhoumin@loongson.cn>
>>> ---
>>> v2:
>>> - Make the calling of rte_rmb() for all platforms
>>> ---
>>> drivers/net/ixgbe/ixgbe_rxtx.c | 3 +++
>>> 1 file changed, 3 insertions(+)
>>>
>>> diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c
>>> index c9d6ca9efe..302a5ab7ff 100644
>>> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
>>> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
>>> @@ -1823,6 +1823,8 @@ ixgbe_recv_pkts(void *rx_queue, struct rte_mbuf
>>> **rx_pkts,
>>> staterr = rxdp->wb.upper.status_error;
>>> if (!(staterr & rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD)))
>>> break;
>>> +
>>> + rte_rmb();
>> So "volatile" does not prevent re-order with Loongson compiler?
> "Volatile" does not prevent re-ordering on any compiler. "Volatile" only prevents caching of the variable marked volatile.
>
> https://wiki.sei.cmu.edu/confluence/display/c/CON02-C.+Do+not+use+volatile+as+a+synchronization+primitive
>
> Thinking out loud: I don't know the performance cost of rte_rmb(); perhaps using atomic accesses with the optimal memory ordering would be a better solution in the long term.
Yes, rte_rmb() probably had side effects on the performance. I will use
a better solution to solve the problem in the V2 patch.
>>
>>> rxd = *rxdp;
>>>
>>> /*
>>> @@ -2122,6 +2124,7 @@ ixgbe_recv_pkts_lro(void *rx_queue, struct
>>> rte_mbuf **rx_pkts, uint16_t nb_pkts,
>>> if (!(staterr & IXGBE_RXDADV_STAT_DD))
>>> break;
>>>
>>> + rte_rmb();
>>> rxd = *rxdp;
>>>
>>> PMD_RX_LOG(DEBUG, "port_id=%u queue_id=%u rx_id=%u "
>>> --
>>> 2.31.1
Best regards,
Min
Hi Konstantin,
Thanks for your comments.
On 2023/5/1 下午9:29, Konstantin Ananyev wrote:
>> Segmentation fault has been observed while running the
>> ixgbe_recv_pkts_lro() function to receive packets on the Loongson 3C5000
>> processor which has 64 cores and 4 NUMA nodes.
>>
>> From the ixgbe_recv_pkts_lro() function, we found that as long as the
>> first
>> packet has the EOP bit set, and the length of this packet is less
>> than or
>> equal to rxq->crc_len, the segmentation fault will definitely happen
>> even
>> though on the other platforms, such as X86.
>>
>> Because when processd the first packet the first_seg->next will be
>> NULL, if
>> at the same time this packet has the EOP bit set and its length is less
>> than or equal to rxq->crc_len, the following loop will be excecuted:
>>
>> for (lp = first_seg; lp->next != rxm; lp = lp->next)
>> ;
>>
>> We know that the first_seg->next will be NULL under this condition.
>> So the
>> expression of lp->next->next will cause the segmentation fault.
>>
>> Normally, the length of the first packet with EOP bit set will be
>> greater
>> than rxq->crc_len. However, the out-of-order execution of CPU may
>> make the
>> read ordering of the status and the rest of the descriptor fields in
>> this
>> function not be correct. The related codes are as following:
>>
>> rxdp = &rx_ring[rx_id];
>> #1 staterr = rte_le_to_cpu_32(rxdp->wb.upper.status_error);
>>
>> if (!(staterr & IXGBE_RXDADV_STAT_DD))
>> break;
>>
>> #2 rxd = *rxdp;
>>
>> The sentence #2 may be executed before sentence #1. This action is
>> likely
>> to make the ready packet zero length. If the packet is the first
>> packet and
>> has the EOP bit set, the above segmentation fault will happen.
>>
>> So, we should add rte_rmb() to ensure the read ordering be correct.
>> We also
>> did the same thing in the ixgbe_recv_pkts() function to make the rxd
>> data
>> be valid even thougth we did not find segmentation fault in this
>> function.
>>
>> Signed-off-by: Min Zhou <zhoumin@loongson.cn>
>> ---
>> v2:
>> - Make the calling of rte_rmb() for all platforms
>> ---
>> drivers/net/ixgbe/ixgbe_rxtx.c | 3 +++
>> 1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c
>> b/drivers/net/ixgbe/ixgbe_rxtx.c
>> index c9d6ca9efe..302a5ab7ff 100644
>> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
>> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
>> @@ -1823,6 +1823,8 @@ ixgbe_recv_pkts(void *rx_queue, struct rte_mbuf
>> **rx_pkts,
>> staterr = rxdp->wb.upper.status_error;
>> if (!(staterr & rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD)))
>> break;
>> +
>> + rte_rmb();
>> rxd = *rxdp;
>
>
>
> Indeed, looks like a problem to me on systems with relaxed MO.
> Strange that it was never hit on arm or ppc - cc-ing ARM/PPC maintainers.
The LoongArch architecture uses the Weak Consistency model which can
cause the problem, especially in scenario with many cores, such as
Loongson 3C5000 with four NUMA node, which has 64 cores. I cannot
reproduce it on Loongson 3C5000 with one NUMA node, which just has 16 cores.
> About a fix - looks right, but a bit excessive to me -
> as I understand all we need here is to prevent re-ordering by CPU itself.
Yes, thanks for cc-ing.
> So rte_smp_rmb() seems enough here.
> Or might be just:
> staterr = __atomic_load_n(&rxdp->wb.upper.status_error,
> __ATOMIC_ACQUIRE);
>
Does __atomic_load_n() work on Windows if we use it to solve this problem ?
>
>> /*
>> @@ -2122,6 +2124,7 @@ ixgbe_recv_pkts_lro(void *rx_queue, struct
>> rte_mbuf **rx_pkts, uint16_t nb_pkts,
>> if (!(staterr & IXGBE_RXDADV_STAT_DD))
>> break;
>>
>> + rte_rmb();
>> rxd = *rxdp;
>>
>> PMD_RX_LOG(DEBUG, "port_id=%u queue_id=%u rx_id=%u "
>> --
>> 2.31.1
Best regards,
Min
> From: zhoumin [mailto:zhoumin@loongson.cn]
> Sent: Thursday, 4 May 2023 15.17
>
> Hi Konstantin,
>
> Thanks for your comments.
>
> On 2023/5/1 下午9:29, Konstantin Ananyev wrote:
> >> Segmentation fault has been observed while running the
> >> ixgbe_recv_pkts_lro() function to receive packets on the Loongson 3C5000
> >> processor which has 64 cores and 4 NUMA nodes.
> >>
> >> From the ixgbe_recv_pkts_lro() function, we found that as long as the
> >> first
> >> packet has the EOP bit set, and the length of this packet is less
> >> than or
> >> equal to rxq->crc_len, the segmentation fault will definitely happen
> >> even
> >> though on the other platforms, such as X86.
> >>
> >> Because when processd the first packet the first_seg->next will be
> >> NULL, if
> >> at the same time this packet has the EOP bit set and its length is less
> >> than or equal to rxq->crc_len, the following loop will be excecuted:
> >>
> >> for (lp = first_seg; lp->next != rxm; lp = lp->next)
> >> ;
> >>
> >> We know that the first_seg->next will be NULL under this condition.
> >> So the
> >> expression of lp->next->next will cause the segmentation fault.
> >>
> >> Normally, the length of the first packet with EOP bit set will be
> >> greater
> >> than rxq->crc_len. However, the out-of-order execution of CPU may
> >> make the
> >> read ordering of the status and the rest of the descriptor fields in
> >> this
> >> function not be correct. The related codes are as following:
> >>
> >> rxdp = &rx_ring[rx_id];
> >> #1 staterr = rte_le_to_cpu_32(rxdp->wb.upper.status_error);
> >>
> >> if (!(staterr & IXGBE_RXDADV_STAT_DD))
> >> break;
> >>
> >> #2 rxd = *rxdp;
> >>
> >> The sentence #2 may be executed before sentence #1. This action is
> >> likely
> >> to make the ready packet zero length. If the packet is the first
> >> packet and
> >> has the EOP bit set, the above segmentation fault will happen.
> >>
> >> So, we should add rte_rmb() to ensure the read ordering be correct.
> >> We also
> >> did the same thing in the ixgbe_recv_pkts() function to make the rxd
> >> data
> >> be valid even thougth we did not find segmentation fault in this
> >> function.
> >>
> >> Signed-off-by: Min Zhou <zhoumin@loongson.cn>
> >> ---
> >> v2:
> >> - Make the calling of rte_rmb() for all platforms
> >> ---
> >> drivers/net/ixgbe/ixgbe_rxtx.c | 3 +++
> >> 1 file changed, 3 insertions(+)
> >>
> >> diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c
> >> b/drivers/net/ixgbe/ixgbe_rxtx.c
> >> index c9d6ca9efe..302a5ab7ff 100644
> >> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
> >> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
> >> @@ -1823,6 +1823,8 @@ ixgbe_recv_pkts(void *rx_queue, struct rte_mbuf
> >> **rx_pkts,
> >> staterr = rxdp->wb.upper.status_error;
> >> if (!(staterr & rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD)))
> >> break;
> >> +
> >> + rte_rmb();
> >> rxd = *rxdp;
> >
> >
> >
> > Indeed, looks like a problem to me on systems with relaxed MO.
> > Strange that it was never hit on arm or ppc - cc-ing ARM/PPC maintainers.
> The LoongArch architecture uses the Weak Consistency model which can
> cause the problem, especially in scenario with many cores, such as
> Loongson 3C5000 with four NUMA node, which has 64 cores. I cannot
> reproduce it on Loongson 3C5000 with one NUMA node, which just has 16 cores.
> > About a fix - looks right, but a bit excessive to me -
> > as I understand all we need here is to prevent re-ordering by CPU itself.
> Yes, thanks for cc-ing.
> > So rte_smp_rmb() seems enough here.
> > Or might be just:
> > staterr = __atomic_load_n(&rxdp->wb.upper.status_error,
> > __ATOMIC_ACQUIRE);
> >
> Does __atomic_load_n() work on Windows if we use it to solve this problem ?
Yes, __atomic_load_n() works on Windows too.
> -----Original Message-----
> From: Morten Brørup <mb@smartsharesystems.com>
> Sent: Thursday, May 4, 2023 9:22 PM
> To: zhoumin <zhoumin@loongson.cn>; Konstantin Ananyev
> <konstantin.v.ananyev@yandex.ru>
> Cc: dev@dpdk.org; maobibo@loongson.cn; Yang, Qiming
> <qiming.yang@intel.com>; Wu, Wenjun1 <wenjun1.wu@intel.com>;
> ruifeng.wang@arm.com; drc@linux.vnet.ibm.com; Tyler Retzlaff
> <roretzla@linux.microsoft.com>
> Subject: RE: [PATCH v2] net/ixgbe: add proper memory barriers for some Rx
> functions
>
> > From: zhoumin [mailto:zhoumin@loongson.cn]
> > Sent: Thursday, 4 May 2023 15.17
> >
> > Hi Konstantin,
> >
> > Thanks for your comments.
> >
> > On 2023/5/1 下午9:29, Konstantin Ananyev wrote:
> > >> Segmentation fault has been observed while running the
> > >> ixgbe_recv_pkts_lro() function to receive packets on the Loongson
> > >> 3C5000 processor which has 64 cores and 4 NUMA nodes.
> > >>
> > >> From the ixgbe_recv_pkts_lro() function, we found that as long as
> > >> the first packet has the EOP bit set, and the length of this packet
> > >> is less than or equal to rxq->crc_len, the segmentation fault will
> > >> definitely happen even though on the other platforms, such as X86.
Sorry to interrupt, but I am curious why this issue still exists on x86 architecture. Can volatile be used to instruct the compiler to generate read instructions in a specific order, and does x86 guarantee not to reorder load operations?
> > >>
> > >> Because when processd the first packet the first_seg->next will be
> > >> NULL, if at the same time this packet has the EOP bit set and its
> > >> length is less than or equal to rxq->crc_len, the following loop
> > >> will be excecuted:
> > >>
> > >> for (lp = first_seg; lp->next != rxm; lp = lp->next)
> > >> ;
> > >>
> > >> We know that the first_seg->next will be NULL under this condition.
> > >> So the
> > >> expression of lp->next->next will cause the segmentation fault.
> > >>
> > >> Normally, the length of the first packet with EOP bit set will be
> > >> greater than rxq->crc_len. However, the out-of-order execution of
> > >> CPU may make the read ordering of the status and the rest of the
> > >> descriptor fields in this function not be correct. The related
> > >> codes are as following:
> > >>
> > >> rxdp = &rx_ring[rx_id];
> > >> #1 staterr = rte_le_to_cpu_32(rxdp->wb.upper.status_error);
> > >>
> > >> if (!(staterr & IXGBE_RXDADV_STAT_DD))
> > >> break;
> > >>
> > >> #2 rxd = *rxdp;
> > >>
> > >> The sentence #2 may be executed before sentence #1. This action is
> > >> likely to make the ready packet zero length. If the packet is the
> > >> first packet and has the EOP bit set, the above segmentation fault
> > >> will happen.
> > >>
> > >> So, we should add rte_rmb() to ensure the read ordering be correct.
> > >> We also
> > >> did the same thing in the ixgbe_recv_pkts() function to make the
> > >> rxd data be valid even thougth we did not find segmentation fault
> > >> in this function.
> > >>
> > >> Signed-off-by: Min Zhou <zhoumin@loongson.cn>
> > >> ---
> > >> v2:
> > >> - Make the calling of rte_rmb() for all platforms
> > >> ---
> > >> drivers/net/ixgbe/ixgbe_rxtx.c | 3 +++
> > >> 1 file changed, 3 insertions(+)
> > >>
> > >> diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c
> > >> b/drivers/net/ixgbe/ixgbe_rxtx.c index c9d6ca9efe..302a5ab7ff
> > >> 100644
> > >> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
> > >> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
> > >> @@ -1823,6 +1823,8 @@ ixgbe_recv_pkts(void *rx_queue, struct
> > >> rte_mbuf **rx_pkts,
> > >> staterr = rxdp->wb.upper.status_error;
> > >> if (!(staterr & rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD)))
> > >> break;
> > >> +
> > >> + rte_rmb();
> > >> rxd = *rxdp;
> > >
> > >
> > >
> > > Indeed, looks like a problem to me on systems with relaxed MO.
> > > Strange that it was never hit on arm or ppc - cc-ing ARM/PPC maintainers.
> > The LoongArch architecture uses the Weak Consistency model which can
> > cause the problem, especially in scenario with many cores, such as
> > Loongson 3C5000 with four NUMA node, which has 64 cores. I cannot
> > reproduce it on Loongson 3C5000 with one NUMA node, which just has 16
> cores.
> > > About a fix - looks right, but a bit excessive to me - as I
> > > understand all we need here is to prevent re-ordering by CPU itself.
> > Yes, thanks for cc-ing.
> > > So rte_smp_rmb() seems enough here.
> > > Or might be just:
> > > staterr = __atomic_load_n(&rxdp->wb.upper.status_error,
> > > __ATOMIC_ACQUIRE);
> > >
> > Does __atomic_load_n() work on Windows if we use it to solve this
> problem ?
>
> Yes, __atomic_load_n() works on Windows too.
>
Hi Ruifeng,
Thanks for your review.
On Thur, May 4, 2023 at 2:13PM, Ruifeng Wang wrote:
>> -----Original Message-----
>> From: Konstantin Ananyev <konstantin.v.ananyev@yandex.ru>
>> Sent: Monday, May 1, 2023 9:29 PM
>> To: zhoumin@loongson.cn
>> Cc: dev@dpdk.org; maobibo@loongson.cn; qiming.yang@intel.com; wenjun1.wu@intel.com;
>> Ruifeng Wang <Ruifeng.Wang@arm.com>; drc@linux.vnet.ibm.com
>> Subject: Re: [PATCH v2] net/ixgbe: add proper memory barriers for some Rx functions
>>
>>> Segmentation fault has been observed while running the
>>> ixgbe_recv_pkts_lro() function to receive packets on the Loongson
>>> 3C5000 processor which has 64 cores and 4 NUMA nodes.
>>>
>>> From the ixgbe_recv_pkts_lro() function, we found that as long as the
>>> first packet has the EOP bit set, and the length of this packet is
>>> less than or equal to rxq->crc_len, the segmentation fault will
>>> definitely happen even though on the other platforms, such as X86.
>>>
>>> Because when processd the first packet the first_seg->next will be
>>> NULL, if at the same time this packet has the EOP bit set and its
>>> length is less than or equal to rxq->crc_len, the following loop will be excecuted:
>>>
>>> for (lp = first_seg; lp->next != rxm; lp = lp->next)
>>> ;
>>>
>>> We know that the first_seg->next will be NULL under this condition. So
>>> the expression of lp->next->next will cause the segmentation fault.
>>>
>>> Normally, the length of the first packet with EOP bit set will be
>>> greater than rxq->crc_len. However, the out-of-order execution of CPU
>>> may make the read ordering of the status and the rest of the
>>> descriptor fields in this function not be correct. The related codes are as following:
>>>
>>> rxdp = &rx_ring[rx_id];
>>> #1 staterr = rte_le_to_cpu_32(rxdp->wb.upper.status_error);
>>>
>>> if (!(staterr & IXGBE_RXDADV_STAT_DD))
>>> break;
>>>
>>> #2 rxd = *rxdp;
>>>
>>> The sentence #2 may be executed before sentence #1. This action is
>>> likely to make the ready packet zero length. If the packet is the
>>> first packet and has the EOP bit set, the above segmentation fault will happen.
>>>
>>> So, we should add rte_rmb() to ensure the read ordering be correct. We
>>> also did the same thing in the ixgbe_recv_pkts() function to make the
>>> rxd data be valid even thougth we did not find segmentation fault in this function.
>>>
>>> Signed-off-by: Min Zhou <zhoumin@loongson.cn>
> "Fixes" tag for backport.
OK, I will add the "Fixes" tag in the V3 patch.
>
>>> ---
>>> v2:
>>> - Make the calling of rte_rmb() for all platforms
>>> ---
>>> drivers/net/ixgbe/ixgbe_rxtx.c | 3 +++
>>> 1 file changed, 3 insertions(+)
>>>
>>> diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c
>>> b/drivers/net/ixgbe/ixgbe_rxtx.c index c9d6ca9efe..302a5ab7ff 100644
>>> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
>>> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
>>> @@ -1823,6 +1823,8 @@ ixgbe_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
>>> staterr = rxdp->wb.upper.status_error;
>>> if (!(staterr & rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD)))
>>> break;
>>> +
>>> + rte_rmb();
>>> rxd = *rxdp;
>>
>>
>> Indeed, looks like a problem to me on systems with relaxed MO.
>> Strange that it was never hit on arm or ppc - cc-ing ARM/PPC maintainers.
> Thanks, Konstantin.
>
>> About a fix - looks right, but a bit excessive to me - as I understand all we need here is
>> to prevent re-ordering by CPU itself.
>> So rte_smp_rmb() seems enough here.
> Agree that rte_rmb() is excessive.
> rte_smp_rmb() or rte_atomic_thread_fence(__ATOMIC_ACQUIRE) is enough.
Thanks for your advice. I will compare the rte_smp_rmb(),
__atomic_load_n() and rte_atomic_thread_fence() to choose a better one.
> And it is better to add a comment to justify the barrier.
OK, I will add a comment for this change.
>> Or might be just:
>> staterr = __atomic_load_n(&rxdp->wb.upper.status_error, __ATOMIC_ACQUIRE);
>>
>>
>>> /*
>>> @@ -2122,6 +2124,7 @@ ixgbe_recv_pkts_lro(void *rx_queue, struct rte_mbuf **rx_pkts,
>> uint16_t nb_pkts,
> With the proper barrier in place, I think the long comments at the beginning of this loop can be removed.
Yes, I think the long comments can be simplified when the proper barrier
is already in place.
>>> if (!(staterr & IXGBE_RXDADV_STAT_DD))
>>> break;
>>>
>>> + rte_rmb();
>>> rxd = *rxdp;
>>>
>>> PMD_RX_LOG(DEBUG, "port_id=%u queue_id=%u rx_id=%u "
>>> --
>>> 2.31.1
Best regards,
Min
Hi Morten,
On Thur, May 4, 2023 at 9:21PM, Morten Brørup wrote:
>> From: zhoumin [mailto:zhoumin@loongson.cn]
>> Sent: Thursday, 4 May 2023 15.17
>>
>> Hi Konstantin,
>>
>> Thanks for your comments.
>>
>> On 2023/5/1 下午9:29, Konstantin Ananyev wrote:
>>>> Segmentation fault has been observed while running the
>>>> ixgbe_recv_pkts_lro() function to receive packets on the Loongson 3C5000
>>>> processor which has 64 cores and 4 NUMA nodes.
>>>>
>>>> From the ixgbe_recv_pkts_lro() function, we found that as long as the
>>>> first
>>>> packet has the EOP bit set, and the length of this packet is less
>>>> than or
>>>> equal to rxq->crc_len, the segmentation fault will definitely happen
>>>> even
>>>> though on the other platforms, such as X86.
>>>>
>>>> Because when processd the first packet the first_seg->next will be
>>>> NULL, if
>>>> at the same time this packet has the EOP bit set and its length is less
>>>> than or equal to rxq->crc_len, the following loop will be excecuted:
>>>>
>>>> for (lp = first_seg; lp->next != rxm; lp = lp->next)
>>>> ;
>>>>
>>>> We know that the first_seg->next will be NULL under this condition.
>>>> So the
>>>> expression of lp->next->next will cause the segmentation fault.
>>>>
>>>> Normally, the length of the first packet with EOP bit set will be
>>>> greater
>>>> than rxq->crc_len. However, the out-of-order execution of CPU may
>>>> make the
>>>> read ordering of the status and the rest of the descriptor fields in
>>>> this
>>>> function not be correct. The related codes are as following:
>>>>
>>>> rxdp = &rx_ring[rx_id];
>>>> #1 staterr = rte_le_to_cpu_32(rxdp->wb.upper.status_error);
>>>>
>>>> if (!(staterr & IXGBE_RXDADV_STAT_DD))
>>>> break;
>>>>
>>>> #2 rxd = *rxdp;
>>>>
>>>> The sentence #2 may be executed before sentence #1. This action is
>>>> likely
>>>> to make the ready packet zero length. If the packet is the first
>>>> packet and
>>>> has the EOP bit set, the above segmentation fault will happen.
>>>>
>>>> So, we should add rte_rmb() to ensure the read ordering be correct.
>>>> We also
>>>> did the same thing in the ixgbe_recv_pkts() function to make the rxd
>>>> data
>>>> be valid even thougth we did not find segmentation fault in this
>>>> function.
>>>>
>>>> Signed-off-by: Min Zhou <zhoumin@loongson.cn>
>>>> ---
>>>> v2:
>>>> - Make the calling of rte_rmb() for all platforms
>>>> ---
>>>> drivers/net/ixgbe/ixgbe_rxtx.c | 3 +++
>>>> 1 file changed, 3 insertions(+)
>>>>
>>>> diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c
>>>> b/drivers/net/ixgbe/ixgbe_rxtx.c
>>>> index c9d6ca9efe..302a5ab7ff 100644
>>>> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
>>>> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
>>>> @@ -1823,6 +1823,8 @@ ixgbe_recv_pkts(void *rx_queue, struct rte_mbuf
>>>> **rx_pkts,
>>>> staterr = rxdp->wb.upper.status_error;
>>>> if (!(staterr & rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD)))
>>>> break;
>>>> +
>>>> + rte_rmb();
>>>> rxd = *rxdp;
>>>
>>>
>>> Indeed, looks like a problem to me on systems with relaxed MO.
>>> Strange that it was never hit on arm or ppc - cc-ing ARM/PPC maintainers.
>> The LoongArch architecture uses the Weak Consistency model which can
>> cause the problem, especially in scenario with many cores, such as
>> Loongson 3C5000 with four NUMA node, which has 64 cores. I cannot
>> reproduce it on Loongson 3C5000 with one NUMA node, which just has 16 cores.
>>> About a fix - looks right, but a bit excessive to me -
>>> as I understand all we need here is to prevent re-ordering by CPU itself.
>> Yes, thanks for cc-ing.
>>> So rte_smp_rmb() seems enough here.
>>> Or might be just:
>>> staterr = __atomic_load_n(&rxdp->wb.upper.status_error,
>>> __ATOMIC_ACQUIRE);
>>>
>> Does __atomic_load_n() work on Windows if we use it to solve this problem ?
> Yes, __atomic_load_n() works on Windows too.
>
Thank you, Morten. I got it.
I will compare those barriers and choose a proper one for this problem.
Best regards,
Min
Hi Qi,
On Thur, May 4, 2023 at 9:33PM, Zhang, Qi Z wrote:
>
>> -----Original Message-----
>> From: Morten Brørup <mb@smartsharesystems.com>
>> Sent: Thursday, May 4, 2023 9:22 PM
>> To: zhoumin <zhoumin@loongson.cn>; Konstantin Ananyev
>> <konstantin.v.ananyev@yandex.ru>
>> Cc: dev@dpdk.org; maobibo@loongson.cn; Yang, Qiming
>> <qiming.yang@intel.com>; Wu, Wenjun1 <wenjun1.wu@intel.com>;
>> ruifeng.wang@arm.com; drc@linux.vnet.ibm.com; Tyler Retzlaff
>> <roretzla@linux.microsoft.com>
>> Subject: RE: [PATCH v2] net/ixgbe: add proper memory barriers for some Rx
>> functions
>>
>>> From: zhoumin [mailto:zhoumin@loongson.cn]
>>> Sent: Thursday, 4 May 2023 15.17
>>>
>>> Hi Konstantin,
>>>
>>> Thanks for your comments.
>>>
>>> On 2023/5/1 下午9:29, Konstantin Ananyev wrote:
>>>>> Segmentation fault has been observed while running the
>>>>> ixgbe_recv_pkts_lro() function to receive packets on the Loongson
>>>>> 3C5000 processor which has 64 cores and 4 NUMA nodes.
>>>>>
>>>>> From the ixgbe_recv_pkts_lro() function, we found that as long as
>>>>> the first packet has the EOP bit set, and the length of this packet
>>>>> is less than or equal to rxq->crc_len, the segmentation fault will
>>>>> definitely happen even though on the other platforms, such as X86.
> Sorry to interrupt, but I am curious why this issue still exists on x86 architecture. Can volatile be used to instruct the compiler to generate read instructions in a specific order, and does x86 guarantee not to reorder load operations?
Actually, I did not see the segmentation fault on X86. I just made the
first packet which had the EOP bit set had a zero length, then the
segmentation fault would happen on X86. So, I thought that the
out-of-order access to the descriptor might be possible to make the
ready packet zero length, and this case was more likely to cause the
segmentation fault.
>>>>> Because when processd the first packet the first_seg->next will be
>>>>> NULL, if at the same time this packet has the EOP bit set and its
>>>>> length is less than or equal to rxq->crc_len, the following loop
>>>>> will be excecuted:
>>>>>
>>>>> for (lp = first_seg; lp->next != rxm; lp = lp->next)
>>>>> ;
>>>>>
>>>>> We know that the first_seg->next will be NULL under this condition.
>>>>> So the
>>>>> expression of lp->next->next will cause the segmentation fault.
>>>>>
>>>>> Normally, the length of the first packet with EOP bit set will be
>>>>> greater than rxq->crc_len. However, the out-of-order execution of
>>>>> CPU may make the read ordering of the status and the rest of the
>>>>> descriptor fields in this function not be correct. The related
>>>>> codes are as following:
>>>>>
>>>>> rxdp = &rx_ring[rx_id];
>>>>> #1 staterr = rte_le_to_cpu_32(rxdp->wb.upper.status_error);
>>>>>
>>>>> if (!(staterr & IXGBE_RXDADV_STAT_DD))
>>>>> break;
>>>>>
>>>>> #2 rxd = *rxdp;
>>>>>
>>>>> The sentence #2 may be executed before sentence #1. This action is
>>>>> likely to make the ready packet zero length. If the packet is the
>>>>> first packet and has the EOP bit set, the above segmentation fault
>>>>> will happen.
>>>>>
>>>>> So, we should add rte_rmb() to ensure the read ordering be correct.
>>>>> We also
>>>>> did the same thing in the ixgbe_recv_pkts() function to make the
>>>>> rxd data be valid even thougth we did not find segmentation fault
>>>>> in this function.
>>>>>
>>>>> Signed-off-by: Min Zhou <zhoumin@loongson.cn>
>>>>> ---
>>>>> v2:
>>>>> - Make the calling of rte_rmb() for all platforms
>>>>> ---
>>>>> drivers/net/ixgbe/ixgbe_rxtx.c | 3 +++
>>>>> 1 file changed, 3 insertions(+)
>>>>>
>>>>> diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c
>>>>> b/drivers/net/ixgbe/ixgbe_rxtx.c index c9d6ca9efe..302a5ab7ff
>>>>> 100644
>>>>> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
>>>>> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
>>>>> @@ -1823,6 +1823,8 @@ ixgbe_recv_pkts(void *rx_queue, struct
>>>>> rte_mbuf **rx_pkts,
>>>>> staterr = rxdp->wb.upper.status_error;
>>>>> if (!(staterr & rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD)))
>>>>> break;
>>>>> +
>>>>> + rte_rmb();
>>>>> rxd = *rxdp;
>>>>
>>>>
>>>> Indeed, looks like a problem to me on systems with relaxed MO.
>>>> Strange that it was never hit on arm or ppc - cc-ing ARM/PPC maintainers.
>>> The LoongArch architecture uses the Weak Consistency model which can
>>> cause the problem, especially in scenario with many cores, such as
>>> Loongson 3C5000 with four NUMA node, which has 64 cores. I cannot
>>> reproduce it on Loongson 3C5000 with one NUMA node, which just has 16
>> cores.
>>>> About a fix - looks right, but a bit excessive to me - as I
>>>> understand all we need here is to prevent re-ordering by CPU itself.
>>> Yes, thanks for cc-ing.
>>>> So rte_smp_rmb() seems enough here.
>>>> Or might be just:
>>>> staterr = __atomic_load_n(&rxdp->wb.upper.status_error,
>>>> __ATOMIC_ACQUIRE);
>>>>
>>> Does __atomic_load_n() work on Windows if we use it to solve this
>> problem ?
>>
>> Yes, __atomic_load_n() works on Windows too.
>>
Best regards,
Min
> -----Original Message-----
> From: zhoumin <zhoumin@loongson.cn>
> Sent: Friday, May 5, 2023 10:43 AM
> To: Zhang, Qi Z <qi.z.zhang@intel.com>; Morten Brørup
> <mb@smartsharesystems.com>; Konstantin Ananyev
> <konstantin.v.ananyev@yandex.ru>
> Cc: dev@dpdk.org; maobibo@loongson.cn; Yang, Qiming
> <qiming.yang@intel.com>; Wu, Wenjun1 <wenjun1.wu@intel.com>;
> ruifeng.wang@arm.com; drc@linux.vnet.ibm.com; Tyler Retzlaff
> <roretzla@linux.microsoft.com>
> Subject: Re: [PATCH v2] net/ixgbe: add proper memory barriers for some Rx
> functions
>
> Hi Qi,
>
> On Thur, May 4, 2023 at 9:33PM, Zhang, Qi Z wrote:
> >
> >> -----Original Message-----
> >> From: Morten Brørup <mb@smartsharesystems.com>
> >> Sent: Thursday, May 4, 2023 9:22 PM
> >> To: zhoumin <zhoumin@loongson.cn>; Konstantin Ananyev
> >> <konstantin.v.ananyev@yandex.ru>
> >> Cc: dev@dpdk.org; maobibo@loongson.cn; Yang, Qiming
> >> <qiming.yang@intel.com>; Wu, Wenjun1 <wenjun1.wu@intel.com>;
> >> ruifeng.wang@arm.com; drc@linux.vnet.ibm.com; Tyler Retzlaff
> >> <roretzla@linux.microsoft.com>
> >> Subject: RE: [PATCH v2] net/ixgbe: add proper memory barriers for
> >> some Rx functions
> >>
> >>> From: zhoumin [mailto:zhoumin@loongson.cn]
> >>> Sent: Thursday, 4 May 2023 15.17
> >>>
> >>> Hi Konstantin,
> >>>
> >>> Thanks for your comments.
> >>>
> >>> On 2023/5/1 下午9:29, Konstantin Ananyev wrote:
> >>>>> Segmentation fault has been observed while running the
> >>>>> ixgbe_recv_pkts_lro() function to receive packets on the Loongson
> >>>>> 3C5000 processor which has 64 cores and 4 NUMA nodes.
> >>>>>
> >>>>> From the ixgbe_recv_pkts_lro() function, we found that as long as
> >>>>> the first packet has the EOP bit set, and the length of this
> >>>>> packet is less than or equal to rxq->crc_len, the segmentation
> >>>>> fault will definitely happen even though on the other platforms, such
> as X86.
> > Sorry to interrupt, but I am curious why this issue still exists on x86
> architecture. Can volatile be used to instruct the compiler to generate read
> instructions in a specific order, and does x86 guarantee not to reorder load
> operations?
> Actually, I did not see the segmentation fault on X86. I just made the first
> packet which had the EOP bit set had a zero length, then the segmentation
> fault would happen on X86. So, I thought that the out-of-order access to the
> descriptor might be possible to make the ready packet zero length, and this
> case was more likely to cause the segmentation fault.
I see, thanks for the explanation.
> >>>>> Because when processd the first packet the first_seg->next will be
> >>>>> NULL, if at the same time this packet has the EOP bit set and its
> >>>>> length is less than or equal to rxq->crc_len, the following loop
> >>>>> will be excecuted:
> >>>>>
> >>>>> for (lp = first_seg; lp->next != rxm; lp = lp->next)
> >>>>> ;
> >>>>>
> >>>>> We know that the first_seg->next will be NULL under this condition.
> >>>>> So the
> >>>>> expression of lp->next->next will cause the segmentation fault.
> >>>>>
> >>>>> Normally, the length of the first packet with EOP bit set will be
> >>>>> greater than rxq->crc_len. However, the out-of-order execution of
> >>>>> CPU may make the read ordering of the status and the rest of the
> >>>>> descriptor fields in this function not be correct. The related
> >>>>> codes are as following:
> >>>>>
> >>>>> rxdp = &rx_ring[rx_id];
> >>>>> #1 staterr = rte_le_to_cpu_32(rxdp->wb.upper.status_error);
> >>>>>
> >>>>> if (!(staterr & IXGBE_RXDADV_STAT_DD))
> >>>>> break;
> >>>>>
> >>>>> #2 rxd = *rxdp;
> >>>>>
> >>>>> The sentence #2 may be executed before sentence #1. This action is
> >>>>> likely to make the ready packet zero length. If the packet is the
> >>>>> first packet and has the EOP bit set, the above segmentation fault
> >>>>> will happen.
> >>>>>
> >>>>> So, we should add rte_rmb() to ensure the read ordering be correct.
> >>>>> We also
> >>>>> did the same thing in the ixgbe_recv_pkts() function to make the
> >>>>> rxd data be valid even thougth we did not find segmentation fault
> >>>>> in this function.
> >>>>>
> >>>>> Signed-off-by: Min Zhou <zhoumin@loongson.cn>
> >>>>> ---
> >>>>> v2:
> >>>>> - Make the calling of rte_rmb() for all platforms
> >>>>> ---
> >>>>> drivers/net/ixgbe/ixgbe_rxtx.c | 3 +++
> >>>>> 1 file changed, 3 insertions(+)
> >>>>>
> >>>>> diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c
> >>>>> b/drivers/net/ixgbe/ixgbe_rxtx.c index c9d6ca9efe..302a5ab7ff
> >>>>> 100644
> >>>>> --- a/drivers/net/ixgbe/ixgbe_rxtx.c
> >>>>> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c
> >>>>> @@ -1823,6 +1823,8 @@ ixgbe_recv_pkts(void *rx_queue, struct
> >>>>> rte_mbuf **rx_pkts,
> >>>>> staterr = rxdp->wb.upper.status_error;
> >>>>> if (!(staterr & rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD)))
> >>>>> break;
> >>>>> +
> >>>>> + rte_rmb();
> >>>>> rxd = *rxdp;
> >>>>
> >>>>
> >>>> Indeed, looks like a problem to me on systems with relaxed MO.
> >>>> Strange that it was never hit on arm or ppc - cc-ing ARM/PPC
> maintainers.
> >>> The LoongArch architecture uses the Weak Consistency model which can
> >>> cause the problem, especially in scenario with many cores, such as
> >>> Loongson 3C5000 with four NUMA node, which has 64 cores. I cannot
> >>> reproduce it on Loongson 3C5000 with one NUMA node, which just has
> >>> 16
> >> cores.
> >>>> About a fix - looks right, but a bit excessive to me - as I
> >>>> understand all we need here is to prevent re-ordering by CPU itself.
> >>> Yes, thanks for cc-ing.
> >>>> So rte_smp_rmb() seems enough here.
> >>>> Or might be just:
> >>>> staterr = __atomic_load_n(&rxdp->wb.upper.status_error,
> >>>> __ATOMIC_ACQUIRE);
> >>>>
> >>> Does __atomic_load_n() work on Windows if we use it to solve this
> >> problem ?
> >>
> >> Yes, __atomic_load_n() works on Windows too.
> >>
> Best regards,
>
> Min
>
@@ -1823,6 +1823,8 @@ ixgbe_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
staterr = rxdp->wb.upper.status_error;
if (!(staterr & rte_cpu_to_le_32(IXGBE_RXDADV_STAT_DD)))
break;
+
+ rte_rmb();
rxd = *rxdp;
/*
@@ -2122,6 +2124,7 @@ ixgbe_recv_pkts_lro(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts,
if (!(staterr & IXGBE_RXDADV_STAT_DD))
break;
+ rte_rmb();
rxd = *rxdp;
PMD_RX_LOG(DEBUG, "port_id=%u queue_id=%u rx_id=%u "