ethdev: document that stats reset APIs are not thread-safe

Message ID 20240425165308.1078454-1-ferruh.yigit@amd.com (mailing list archive)
State New
Delegated to: Ferruh Yigit
Headers
Series ethdev: document that stats reset APIs are not thread-safe |

Checks

Context Check Description
ci/loongarch-compilation success Compilation OK
ci/loongarch-unit-testing success Unit Testing PASS
ci/Intel-compilation success Compilation OK
ci/intel-Testing success Testing PASS
ci/intel-Functional success Functional PASS
ci/iol-mellanox-Performance success Performance Testing PASS
ci/iol-intel-Performance success Performance Testing PASS
ci/iol-abi-testing success Testing PASS
ci/iol-sample-apps-testing success Testing PASS
ci/iol-broadcom-Performance success Performance Testing PASS
ci/iol-unit-amd64-testing success Testing PASS
ci/iol-broadcom-Functional success Functional Testing PASS
ci/iol-intel-Functional success Functional Testing PASS
ci/iol-unit-arm64-testing success Testing PASS
ci/iol-compile-arm64-testing success Testing PASS
ci/iol-compile-amd64-testing success Testing PASS

Commit Message

Ferruh Yigit April 25, 2024, 4:53 p.m. UTC
  Making 'rte_eth_stats_reset()' and 'rte_eth_xstats_reset()' APIs thread
safe has performance impact on datapath.

Instead document APIs as not thread safe and add condition for reliable
stats reset functionality, forwarding should be stopped.

Cc: stable@dpdk.org

Signed-off-by: Ferruh Yigit <ferruh.yigit@amd.com>
---
Cc: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Morten Brørup <mb@smartsharesystems.com>

This update triggered by mail list discussion [1].

[1]
https://inbox.dpdk.org/dev/3b2cf48e-2293-4226-b6cd-5f4dd3969f99@lysator.liu.se/
---
 lib/ethdev/rte_ethdev.h | 6 ++++++
 1 file changed, 6 insertions(+)
  

Comments

Morten Brørup April 26, 2024, 12:20 p.m. UTC | #1
> From: Ferruh Yigit [mailto:ferruh.yigit@amd.com]
> Sent: Thursday, 25 April 2024 18.53
> 
> Making 'rte_eth_stats_reset()' and 'rte_eth_xstats_reset()' APIs thread
> safe has performance impact on datapath.
> 
> Instead document APIs as not thread safe and add condition for reliable
> stats reset functionality, forwarding should be stopped.

I'm not sure stopping forwarding suffices.
NIC hardware counters will keep progressing unless RX and TX is stopped at NIC level.

I don't have any suggestions for a better wording, though. :-(

Anyway, better with the patch than without...
Acked-by: Morten Brørup <mb@smartsharesystems.com>
  
Stephen Hemminger April 26, 2024, 3:13 p.m. UTC | #2
On Fri, 26 Apr 2024 14:20:01 +0200
Morten Brørup <mb@smartsharesystems.com> wrote:

> > From: Ferruh Yigit [mailto:ferruh.yigit@amd.com]
> > Sent: Thursday, 25 April 2024 18.53
> > 
> > Making 'rte_eth_stats_reset()' and 'rte_eth_xstats_reset()' APIs thread
> > safe has performance impact on datapath.
> > 
> > Instead document APIs as not thread safe and add condition for reliable
> > stats reset functionality, forwarding should be stopped.  
> 
> I'm not sure stopping forwarding suffices.
> NIC hardware counters will keep progressing unless RX and TX is stopped at NIC level.
> 
> I don't have any suggestions for a better wording, though. :-(
> 
> Anyway, better with the patch than without...
> Acked-by: Morten Brørup <mb@smartsharesystems.com>
> 

The safest option would be:
	rte_eth_dev_stop
	rte_eth_stats_reset
	rte_eth_dev_start
  
Morten Brørup April 26, 2024, 3:17 p.m. UTC | #3
> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> Sent: Friday, 26 April 2024 17.14
> 
> On Fri, 26 Apr 2024 14:20:01 +0200
> Morten Brørup <mb@smartsharesystems.com> wrote:
> 
> > > From: Ferruh Yigit [mailto:ferruh.yigit@amd.com]
> > > Sent: Thursday, 25 April 2024 18.53
> > >
> > > Making 'rte_eth_stats_reset()' and 'rte_eth_xstats_reset()' APIs thread
> > > safe has performance impact on datapath.
> > >
> > > Instead document APIs as not thread safe and add condition for reliable
> > > stats reset functionality, forwarding should be stopped.
> >
> > I'm not sure stopping forwarding suffices.
> > NIC hardware counters will keep progressing unless RX and TX is stopped at
> NIC level.
> >
> > I don't have any suggestions for a better wording, though. :-(
> >
> > Anyway, better with the patch than without...
> > Acked-by: Morten Brørup <mb@smartsharesystems.com>
> >
> 
> The safest option would be:
> 	rte_eth_dev_stop
> 	rte_eth_stats_reset
> 	rte_eth_dev_start

Yes, but this will cause packet loss.
Network admins should be able to clear the counters without causing packet loss.
  
Patrick Robb April 26, 2024, 9:33 p.m. UTC | #4
Recheck-request: iol-compile-amd64-testing

The DPDK Community Lab updated to the latest Alpine image yesterday, which
resulted in all Alpine builds failing. The failure is unrelated to your
patch, and this recheck should remove the fail on Patchwork, as we have
disabled Alpine testing for now.
  
Stephen Hemminger April 26, 2024, 10:57 p.m. UTC | #5
On Fri, 26 Apr 2024 17:17:25 +0200
Morten Brørup <mb@smartsharesystems.com> wrote:

> > From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> > Sent: Friday, 26 April 2024 17.14
> > 
> > On Fri, 26 Apr 2024 14:20:01 +0200
> > Morten Brørup <mb@smartsharesystems.com> wrote:
> >   
> > > > From: Ferruh Yigit [mailto:ferruh.yigit@amd.com]
> > > > Sent: Thursday, 25 April 2024 18.53
> > > >
> > > > Making 'rte_eth_stats_reset()' and 'rte_eth_xstats_reset()' APIs thread
> > > > safe has performance impact on datapath.
> > > >
> > > > Instead document APIs as not thread safe and add condition for reliable
> > > > stats reset functionality, forwarding should be stopped.  
> > >
> > > I'm not sure stopping forwarding suffices.
> > > NIC hardware counters will keep progressing unless RX and TX is stopped at  
> > NIC level.  
> > >
> > > I don't have any suggestions for a better wording, though. :-(
> > >
> > > Anyway, better with the patch than without...
> > > Acked-by: Morten Brørup <mb@smartsharesystems.com>
> > >  
> > 
> > The safest option would be:
> > 	rte_eth_dev_stop
> > 	rte_eth_stats_reset
> > 	rte_eth_dev_start  
> 
> Yes, but this will cause packet loss.
> Network admins should be able to clear the counters without causing packet loss.
> 

Another alternative would be to a "reset zero" approach.

When reset is done, capture the current state of all the relevant counters.
Record that, and subtract those on next request. Never modify the
underlying counters.

To make it MT safe, use lock on reset and something super lightweight
like seqlock on the stats request side.  Might be hard with the size and variable
length nature of xstats thoug.
  
Mattias Rönnblom April 28, 2024, 3:48 p.m. UTC | #6
On 2024-04-26 14:20, Morten Brørup wrote:
>> From: Ferruh Yigit [mailto:ferruh.yigit@amd.com]
>> Sent: Thursday, 25 April 2024 18.53
>>
>> Making 'rte_eth_stats_reset()' and 'rte_eth_xstats_reset()' APIs thread
>> safe has performance impact on datapath.
>>
>> Instead document APIs as not thread safe and add condition for reliable
>> stats reset functionality, forwarding should be stopped.
> 

What does "forwarding" mean here? Sounds like something from testpmd.

Isn't what needs to stop here other threads calling into the ethdev API 
for this particular device?

> I'm not sure stopping forwarding suffices.
> NIC hardware counters will keep progressing unless RX and TX is stopped at NIC level.
> 

Why would that be an issue?

> I don't have any suggestions for a better wording, though. :-(
> 
> Anyway, better with the patch than without...
> Acked-by: Morten Brørup <mb@smartsharesystems.com>
>
  
Mattias Rönnblom April 28, 2024, 3:52 p.m. UTC | #7
On 2024-04-26 17:17, Morten Brørup wrote:
>> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
>> Sent: Friday, 26 April 2024 17.14
>>
>> On Fri, 26 Apr 2024 14:20:01 +0200
>> Morten Brørup <mb@smartsharesystems.com> wrote:
>>
>>>> From: Ferruh Yigit [mailto:ferruh.yigit@amd.com]
>>>> Sent: Thursday, 25 April 2024 18.53
>>>>
>>>> Making 'rte_eth_stats_reset()' and 'rte_eth_xstats_reset()' APIs thread
>>>> safe has performance impact on datapath.
>>>>
>>>> Instead document APIs as not thread safe and add condition for reliable
>>>> stats reset functionality, forwarding should be stopped.
>>>
>>> I'm not sure stopping forwarding suffices.
>>> NIC hardware counters will keep progressing unless RX and TX is stopped at
>> NIC level.
>>>
>>> I don't have any suggestions for a better wording, though. :-(
>>>
>>> Anyway, better with the patch than without...
>>> Acked-by: Morten Brørup <mb@smartsharesystems.com>
>>>
>>
>> The safest option would be:
>> 	rte_eth_dev_stop
>> 	rte_eth_stats_reset
>> 	rte_eth_dev_start
> 
> Yes, but this will cause packet loss.
> Network admins should be able to clear the counters without causing packet loss.
> 

For sure, and they would do it via some O&M interface, which would in 
turn talk the control plane, which in turn would talk to the data plane 
(including <rte_ethdev.h>).

Either the control plane or the O&M layer could keep track of a reset 
offset. Doesn't have to be on the PMD level.
  
Morten Brørup April 29, 2024, 6:20 a.m. UTC | #8
> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> Sent: Sunday, 28 April 2024 17.53
> 
> On 2024-04-26 17:17, Morten Brørup wrote:
> >> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> >> Sent: Friday, 26 April 2024 17.14
> >>
> >> On Fri, 26 Apr 2024 14:20:01 +0200
> >> Morten Brørup <mb@smartsharesystems.com> wrote:
> >>
> >>>> From: Ferruh Yigit [mailto:ferruh.yigit@amd.com]
> >>>> Sent: Thursday, 25 April 2024 18.53
> >>>>
> >>>> Making 'rte_eth_stats_reset()' and 'rte_eth_xstats_reset()' APIs thread
> >>>> safe has performance impact on datapath.
> >>>>
> >>>> Instead document APIs as not thread safe and add condition for reliable
> >>>> stats reset functionality, forwarding should be stopped.
> >>>
> >>> I'm not sure stopping forwarding suffices.
> >>> NIC hardware counters will keep progressing unless RX and TX is stopped at
> >> NIC level.
> >>>
> >>> I don't have any suggestions for a better wording, though. :-(
> >>>
> >>> Anyway, better with the patch than without...
> >>> Acked-by: Morten Brørup <mb@smartsharesystems.com>
> >>>
> >>
> >> The safest option would be:
> >> 	rte_eth_dev_stop
> >> 	rte_eth_stats_reset
> >> 	rte_eth_dev_start
> >
> > Yes, but this will cause packet loss.
> > Network admins should be able to clear the counters without causing packet
> loss.
> >
> 
> For sure, and they would do it via some O&M interface, which would in
> turn talk the control plane, which in turn would talk to the data plane
> (including <rte_ethdev.h>).
> 
> Either the control plane or the O&M layer could keep track of a reset
> offset. Doesn't have to be on the PMD level.

Exactly.

Snapshotting the "reset" offset in the control plane has many advantages.

Let's promote that design pattern, rather than trying to work around different quirks required by different drivers.
  
Morten Brørup April 29, 2024, 7:57 a.m. UTC | #9
> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> Sent: Sunday, 28 April 2024 17.49
> 
> On 2024-04-26 14:20, Morten Brørup wrote:
> >> From: Ferruh Yigit [mailto:ferruh.yigit@amd.com]
> >> Sent: Thursday, 25 April 2024 18.53
> >>
> >> Making 'rte_eth_stats_reset()' and 'rte_eth_xstats_reset()' APIs thread
> >> safe has performance impact on datapath.
> >>
> >> Instead document APIs as not thread safe and add condition for reliable
> >> stats reset functionality, forwarding should be stopped.
> >
> 
> What does "forwarding" mean here? Sounds like something from testpmd.
> 
> Isn't what needs to stop here other threads calling into the ethdev API
> for this particular device?

Doesn't suffice.
The NIC hardware counters increment if the NIC's DMA engine receives or transmits packets.
E.g. in the RX direction, the NIC hardware counters increment according to what the NIC detects happening on the wire. If some packets are DMA'ed into memory and the associated RX descriptors are filled, the NIC's hardware counters progress. It doesn't matter if the software has looked at those RX descriptors or not.

> 
> > I'm not sure stopping forwarding suffices.
> > NIC hardware counters will keep progressing unless RX and TX is stopped at
> NIC level.
> >
> 
> Why would that be an issue?

My point is:
Unless you stop everything, including the NIC hardware RX & TX DMA, the counters might not be zero when returning from the stats_reset() functions.

> 
> > I don't have any suggestions for a better wording, though. :-(

I think it is fine providing a notice about MT-unsafety in the API documentation.
But the only way to prevent it is useless in real applications, as it would cause packet loss. So it's not helpful describing how to do that.

> >
> > Anyway, better with the patch than without...
> > Acked-by: Morten Brørup <mb@smartsharesystems.com>
> >
  
Mattias Rönnblom April 29, 2024, 9:30 a.m. UTC | #10
On 2024-04-29 09:57, Morten Brørup wrote:
>> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
>> Sent: Sunday, 28 April 2024 17.49
>>
>> On 2024-04-26 14:20, Morten Brørup wrote:
>>>> From: Ferruh Yigit [mailto:ferruh.yigit@amd.com]
>>>> Sent: Thursday, 25 April 2024 18.53
>>>>
>>>> Making 'rte_eth_stats_reset()' and 'rte_eth_xstats_reset()' APIs thread
>>>> safe has performance impact on datapath.
>>>>
>>>> Instead document APIs as not thread safe and add condition for reliable
>>>> stats reset functionality, forwarding should be stopped.
>>>
>>
>> What does "forwarding" mean here? Sounds like something from testpmd.
>>
>> Isn't what needs to stop here other threads calling into the ethdev API
>> for this particular device?
> 
> Doesn't suffice.
> The NIC hardware counters increment if the NIC's DMA engine receives or transmits packets.
> E.g. in the RX direction, the NIC hardware counters increment according to what the NIC detects happening on the wire. If some packets are DMA'ed into memory and the associated RX descriptors are filled, the NIC's hardware counters progress. It doesn't matter if the software has looked at those RX descriptors or not.
> 
>>
>>> I'm not sure stopping forwarding suffices.
>>> NIC hardware counters will keep progressing unless RX and TX is stopped at
>> NIC level.
>>>
>>
>> Why would that be an issue?
> 
> My point is:
> Unless you stop everything, including the NIC hardware RX & TX DMA, the counters might not be zero when returning from the stats_reset() functions.
> 

The reset can still be MT safe, even though there is no way to retrieve 
the exact counter state prior to the reset and also is no guarantee that 
the counters are zero immediately after the function returns (whatever 
that means).

If you do want to be able to do that, the means to that end is to teach 
the reset function to return the prior counter state as a part of the 
reset call, not to force the user to have the system grind to a halt.

>>
>>> I don't have any suggestions for a better wording, though. :-(
> 
> I think it is fine providing a notice about MT-unsafety in the API documentation.
> But the only way to prevent it is useless in real applications, as it would cause packet loss. So it's not helpful describing how to do that.
> 
>>>
>>> Anyway, better with the patch than without...
>>> Acked-by: Morten Brørup <mb@smartsharesystems.com>
>>>
  
Stephen Hemminger April 29, 2024, 3:33 p.m. UTC | #11
On Mon, 29 Apr 2024 08:20:13 +0200
Morten Brørup <mb@smartsharesystems.com> wrote:

> > turn talk the control plane, which in turn would talk to the data plane
> > (including <rte_ethdev.h>).
> > 
> > Either the control plane or the O&M layer could keep track of a reset
> > offset. Doesn't have to be on the PMD level.  
> 
> Exactly.
> 
> Snapshotting the "reset" offset in the control plane has many advantages.
> 
> Let's promote that design pattern, rather than trying to work around different quirks required by different drivers.
> 


The other thing that can help in software drivers is keeping per-queue stats.
  

Patch

diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index 548fada1c7ad..40f04c0e191b 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -3136,6 +3136,9 @@  int rte_eth_stats_get(uint16_t port_id, struct rte_eth_stats *stats);
 /**
  * Reset the general I/O statistics of an Ethernet device.
  *
+ * API is not multi-thread safe.
+ * Application should stop forwarding before calling this API.
+ *
  * @param port_id
  *   The port identifier of the Ethernet device.
  * @return
@@ -3296,6 +3299,9 @@  int rte_eth_xstats_get_id_by_name(uint16_t port_id, const char *xstat_name,
 /**
  * Reset extended statistics of an Ethernet device.
  *
+ * API is not multi-thread safe.
+ * Application should stop forwarding before calling this API.
+ *
  * @param port_id
  *   The port identifier of the Ethernet device.
  * @return