net/af_packet: cache align Rx/Tx structs

Message ID 20240423090813.94110-1-mattias.ronnblom@ericsson.com (mailing list archive)
State Superseded
Delegated to: Ferruh Yigit
Headers
Series net/af_packet: cache align Rx/Tx structs |

Checks

Context Check Description
ci/checkpatch warning coding style issues
ci/loongarch-compilation success Compilation OK
ci/loongarch-unit-testing success Unit Testing PASS
ci/Intel-compilation success Compilation OK
ci/intel-Testing success Testing PASS
ci/intel-Functional success Functional PASS
ci/github-robot: build success github build: passed
ci/iol-intel-Performance success Performance Testing PASS
ci/iol-mellanox-Performance success Performance Testing PASS
ci/iol-abi-testing success Testing PASS
ci/iol-sample-apps-testing success Testing PASS
ci/iol-broadcom-Performance success Performance Testing PASS
ci/iol-unit-amd64-testing success Testing PASS
ci/iol-unit-arm64-testing success Testing PASS
ci/iol-broadcom-Functional success Functional Testing PASS
ci/iol-intel-Functional success Functional Testing PASS
ci/iol-compile-arm64-testing success Testing PASS
ci/iol-compile-amd64-testing success Testing PASS

Commit Message

Mattias Rönnblom April 23, 2024, 9:08 a.m. UTC
  Cache align Rx and Tx queue struct to avoid false sharing.

RX struct happens to be 64 bytes on x86_64 already, so cache alignment
makes no change there, but it does on 32-bit ISAs.

TX struct is 56 bytes on x86_64.

Both structs keep counters, and in the RX case they are updated even
for empty polls.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
---
 drivers/net/af_packet/rte_eth_af_packet.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)
  

Comments

Ferruh Yigit April 23, 2024, 11:15 a.m. UTC | #1
On 4/23/2024 10:08 AM, Mattias Rönnblom wrote:
> Cache align Rx and Tx queue struct to avoid false sharing.
> 
> RX struct happens to be 64 bytes on x86_64 already, so cache alignment
> makes no change there, but it does on 32-bit ISAs.
> 
> TX struct is 56 bytes on x86_64.
> 

Hi Mattias,

No objection to the patch. Is the improvement theoretical or do you
measure any improvement practically, if so how much is the improvement?

> Both structs keep counters, and in the RX case they are updated even
> for empty polls.
> 

Do you think does it help if move 'rx_pkts' & 'rx_bytes' update within
the loop?

> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> ---
>  drivers/net/af_packet/rte_eth_af_packet.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c
> index 397a32db58..28aeb7d08e 100644
> --- a/drivers/net/af_packet/rte_eth_af_packet.c
> +++ b/drivers/net/af_packet/rte_eth_af_packet.c
> @@ -6,6 +6,7 @@
>   * All rights reserved.
>   */
>  
> +#include <rte_common.h>
>  #include <rte_string_fns.h>
>  #include <rte_mbuf.h>
>  #include <ethdev_driver.h>
> @@ -53,7 +54,7 @@ struct pkt_rx_queue {
>  
>  	volatile unsigned long rx_pkts;
>  	volatile unsigned long rx_bytes;
> -};
> +} __rte_cache_aligned;
>  

Latest location for '__rte_cache_aligned' tag is at the beginning of the
struct [1], so something like:
`struct __rte_cache_aligned pkt_rx_queue {`

[1]
https://patchwork.dpdk.org/project/dpdk/list/?series=31746&state=%2A&archive=both
  
Mattias Rönnblom April 23, 2024, 8:56 p.m. UTC | #2
On 2024-04-23 13:15, Ferruh Yigit wrote:
> On 4/23/2024 10:08 AM, Mattias Rönnblom wrote:
>> Cache align Rx and Tx queue struct to avoid false sharing.
>>
>> RX struct happens to be 64 bytes on x86_64 already, so cache alignment
>> makes no change there, but it does on 32-bit ISAs.
>>
>> TX struct is 56 bytes on x86_64.
>>
> 
> Hi Mattias,
> 
> No objection to the patch. Is the improvement theoretical or do you
> measure any improvement practically, if so how much is the improvement?
> 

I didn't run any benchmarks.

Two cores storing to a (falsely) shared cache line on a per-packet basis 
is going to be very expensive, at least for "light touch" applications.

>> Both structs keep counters, and in the RX case they are updated even
>> for empty polls.
>>
> 
> Do you think does it help if move 'rx_pkts' & 'rx_bytes' update within
> the loop?
> 

No, why? Wouldn't that be worse? Especially since rx_pkts and rx_bytes 
are declared volatile, so you are forcing a load-modify-store cycle for 
every increment.

I would drop "volatile", or replace it with an atomic (although *not* 
use an atomic add for incrementing, but rather atomic load + <n> 
non-atomic adds + atomic store).

>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>> ---
>>   drivers/net/af_packet/rte_eth_af_packet.c | 5 +++--
>>   1 file changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c
>> index 397a32db58..28aeb7d08e 100644
>> --- a/drivers/net/af_packet/rte_eth_af_packet.c
>> +++ b/drivers/net/af_packet/rte_eth_af_packet.c
>> @@ -6,6 +6,7 @@
>>    * All rights reserved.
>>    */
>>   
>> +#include <rte_common.h>
>>   #include <rte_string_fns.h>
>>   #include <rte_mbuf.h>
>>   #include <ethdev_driver.h>
>> @@ -53,7 +54,7 @@ struct pkt_rx_queue {
>>   
>>   	volatile unsigned long rx_pkts;
>>   	volatile unsigned long rx_bytes;
>> -};
>> +} __rte_cache_aligned;
>>   
> 
> Latest location for '__rte_cache_aligned' tag is at the beginning of the
> struct [1], so something like:
> `struct __rte_cache_aligned pkt_rx_queue {`
> 
> [1]
> https://patchwork.dpdk.org/project/dpdk/list/?series=31746&state=%2A&archive=both
  
Honnappa Nagarahalli April 24, 2024, 12:27 a.m. UTC | #3
> On Apr 23, 2024, at 3:56 PM, Mattias Rönnblom <hofors@lysator.liu.se> wrote:
> 
> On 2024-04-23 13:15, Ferruh Yigit wrote:
>> On 4/23/2024 10:08 AM, Mattias Rönnblom wrote:
>>> Cache align Rx and Tx queue struct to avoid false sharing.
>>> 
>>> RX struct happens to be 64 bytes on x86_64 already, so cache alignment
>>> makes no change there, but it does on 32-bit ISAs.
>>> 
>>> TX struct is 56 bytes on x86_64.
>>> 
>> Hi Mattias,
>> No objection to the patch. Is the improvement theoretical or do you
>> measure any improvement practically, if so how much is the improvement?
> 
> I didn't run any benchmarks.
> 
> Two cores storing to a (falsely) shared cache line on a per-packet basis is going to be very expensive, at least for "light touch" applications.
> 
>>> Both structs keep counters, and in the RX case they are updated even
>>> for empty polls.
>>> 
>> Do you think does it help if move 'rx_pkts' & 'rx_bytes' update within
>> the loop?
> 
> No, why? Wouldn't that be worse? Especially since rx_pkts and rx_bytes are declared volatile, so you are forcing a load-modify-store cycle for every increment.
> 
> I would drop "volatile", or replace it with an atomic (although *not* use an atomic add for incrementing, but rather atomic load + <n> non-atomic adds + atomic store).
(Slightly unrelated discussion)
Does the atomic load + increment + atomic store help in a non-contended case like this? Some platforms have optimizations for atomic-increments as well which would be missed.

> 
>>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>>> ---
>>>  drivers/net/af_packet/rte_eth_af_packet.c | 5 +++--
>>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>> 
>>> diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c
>>> index 397a32db58..28aeb7d08e 100644
>>> --- a/drivers/net/af_packet/rte_eth_af_packet.c
>>> +++ b/drivers/net/af_packet/rte_eth_af_packet.c
>>> @@ -6,6 +6,7 @@
>>>   * All rights reserved.
>>>   */
>>>  +#include <rte_common.h>
>>>  #include <rte_string_fns.h>
>>>  #include <rte_mbuf.h>
>>>  #include <ethdev_driver.h>
>>> @@ -53,7 +54,7 @@ struct pkt_rx_queue {
>>>     volatile unsigned long rx_pkts;
>>>   volatile unsigned long rx_bytes;
>>> -};
>>> +} __rte_cache_aligned;
>>>  
>> Latest location for '__rte_cache_aligned' tag is at the beginning of the
>> struct [1], so something like:
>> `struct __rte_cache_aligned pkt_rx_queue {`
>> [1]
>> https://patchwork.dpdk.org/project/dpdk/list/?series=31746&state=%2A&archive=both
  
Mattias Rönnblom April 24, 2024, 6:28 a.m. UTC | #4
On 2024-04-24 02:27, Honnappa Nagarahalli wrote:
> 
> 
>> On Apr 23, 2024, at 3:56 PM, Mattias Rönnblom <hofors@lysator.liu.se> wrote:
>>
>> On 2024-04-23 13:15, Ferruh Yigit wrote:
>>> On 4/23/2024 10:08 AM, Mattias Rönnblom wrote:
>>>> Cache align Rx and Tx queue struct to avoid false sharing.
>>>>
>>>> RX struct happens to be 64 bytes on x86_64 already, so cache alignment
>>>> makes no change there, but it does on 32-bit ISAs.
>>>>
>>>> TX struct is 56 bytes on x86_64.
>>>>
>>> Hi Mattias,
>>> No objection to the patch. Is the improvement theoretical or do you
>>> measure any improvement practically, if so how much is the improvement?
>>
>> I didn't run any benchmarks.
>>
>> Two cores storing to a (falsely) shared cache line on a per-packet basis is going to be very expensive, at least for "light touch" applications.
>>
>>>> Both structs keep counters, and in the RX case they are updated even
>>>> for empty polls.
>>>>
>>> Do you think does it help if move 'rx_pkts' & 'rx_bytes' update within
>>> the loop?
>>
>> No, why? Wouldn't that be worse? Especially since rx_pkts and rx_bytes are declared volatile, so you are forcing a load-modify-store cycle for every increment.
>>
>> I would drop "volatile", or replace it with an atomic (although *not* use an atomic add for incrementing, but rather atomic load + <n> non-atomic adds + atomic store).
> (Slightly unrelated discussion)
> Does the atomic load + increment + atomic store help in a non-contended case like this? Some platforms have optimizations for atomic-increments as well which would be missed.
> 

Is it "far atomics" you have in mind? A C11 complaint compiler won't 
generate STADD (even for relaxed type ops), since it doesn't fit well 
into the C11 memory model. In particular, the issue is that STADD isn't 
ordered by the instructions generated by C11 fences, from what I understand.

GCC did generate STADD for a while, until this issue was discovered.

On x86_64, ADD <addr> is generally much faster than LOCK ADD <addr>. No 
wonder, since the latter is a full barrier.

If I'd worked for ARM I would probably had proposed some API extension 
to DPDK atomics API to allow the use of STADD (through inline 
assembler). One way to integrate it might be to add a new memory model, 
rte_memory_order_unordered or rte_memory_order_jumps_all_fences, where 
loads and stores are totally unordered.

However, in this particular case, it's a single-writer scenario, so 
rte_memory_order_kangaroo wouldn't really help, since it would be 
equivalent to rte_memory_order_relaxed on non-far atomics-capable ISAs, 
which would be too expensive. In the past I've argued for a 
single-writer version of atomic add/sub etc in the DPDK atomics API 
(whatever that is), for convenience and to make the issue/pattern known.

>>
>>>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>>>> ---
>>>>   drivers/net/af_packet/rte_eth_af_packet.c | 5 +++--
>>>>   1 file changed, 3 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c
>>>> index 397a32db58..28aeb7d08e 100644
>>>> --- a/drivers/net/af_packet/rte_eth_af_packet.c
>>>> +++ b/drivers/net/af_packet/rte_eth_af_packet.c
>>>> @@ -6,6 +6,7 @@
>>>>    * All rights reserved.
>>>>    */
>>>>   +#include <rte_common.h>
>>>>   #include <rte_string_fns.h>
>>>>   #include <rte_mbuf.h>
>>>>   #include <ethdev_driver.h>
>>>> @@ -53,7 +54,7 @@ struct pkt_rx_queue {
>>>>      volatile unsigned long rx_pkts;
>>>>    volatile unsigned long rx_bytes;
>>>> -};
>>>> +} __rte_cache_aligned;
>>>>   
>>> Latest location for '__rte_cache_aligned' tag is at the beginning of the
>>> struct [1], so something like:
>>> `struct __rte_cache_aligned pkt_rx_queue {`
>>> [1]
>>> https://patchwork.dpdk.org/project/dpdk/list/?series=31746&state=%2A&archive=both
>
  
Ferruh Yigit April 24, 2024, 10:21 a.m. UTC | #5
On 4/23/2024 9:56 PM, Mattias Rönnblom wrote:
> On 2024-04-23 13:15, Ferruh Yigit wrote:
>> On 4/23/2024 10:08 AM, Mattias Rönnblom wrote:
>>> Cache align Rx and Tx queue struct to avoid false sharing.
>>>
>>> RX struct happens to be 64 bytes on x86_64 already, so cache alignment
>>> makes no change there, but it does on 32-bit ISAs.
>>>
>>> TX struct is 56 bytes on x86_64.
>>>
>>
>> Hi Mattias,
>>
>> No objection to the patch. Is the improvement theoretical or do you
>> measure any improvement practically, if so how much is the improvement?
>>
> 
> I didn't run any benchmarks.
> 
> Two cores storing to a (falsely) shared cache line on a per-packet basis
> is going to be very expensive, at least for "light touch" applications.
> 

ack
I expect for af_packet bottleneck is the kernel side, so I don't expect
any visible improvement practically, but OK to fix this theoretical issue.

>>> Both structs keep counters, and in the RX case they are updated even
>>> for empty polls.
>>>
>>
>> Do you think does it help if move 'rx_pkts' & 'rx_bytes' update within
>> the loop?
>>
> 
> No, why? Wouldn't that be worse? Especially since rx_pkts and rx_bytes
> are declared volatile, so you are forcing a load-modify-store cycle for
> every increment.
> 

My intention was to prevent updating stats in empty polls, I thought
stats will be hot in the cache but won't work with volatile.


> I would drop "volatile", or replace it with an atomic (although *not*
> use an atomic add for incrementing, but rather atomic load + <n>
> non-atomic adds + atomic store).
> 

As only single core polls from an Rx queue, I assume atomics is required
only for stats reset case. And for that case won't we need atomic add?

And do you know if there is a performance difference between atomic add
and keeping volatile qualifier?


>>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>>> ---
>>>   drivers/net/af_packet/rte_eth_af_packet.c | 5 +++--
>>>   1 file changed, 3 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/net/af_packet/rte_eth_af_packet.c
>>> b/drivers/net/af_packet/rte_eth_af_packet.c
>>> index 397a32db58..28aeb7d08e 100644
>>> --- a/drivers/net/af_packet/rte_eth_af_packet.c
>>> +++ b/drivers/net/af_packet/rte_eth_af_packet.c
>>> @@ -6,6 +6,7 @@
>>>    * All rights reserved.
>>>    */
>>>   +#include <rte_common.h>
>>>   #include <rte_string_fns.h>
>>>   #include <rte_mbuf.h>
>>>   #include <ethdev_driver.h>
>>> @@ -53,7 +54,7 @@ struct pkt_rx_queue {
>>>         volatile unsigned long rx_pkts;
>>>       volatile unsigned long rx_bytes;
>>> -};
>>> +} __rte_cache_aligned;
>>>   
>>
>> Latest location for '__rte_cache_aligned' tag is at the beginning of the
>> struct [1], so something like:
>> `struct __rte_cache_aligned pkt_rx_queue {`
>>
>> [1]
>> https://patchwork.dpdk.org/project/dpdk/list/?series=31746&state=%2A&archive=both
  
Bruce Richardson April 24, 2024, 10:28 a.m. UTC | #6
On Wed, Apr 24, 2024 at 11:21:52AM +0100, Ferruh Yigit wrote:
> On 4/23/2024 9:56 PM, Mattias Rönnblom wrote:
> > On 2024-04-23 13:15, Ferruh Yigit wrote:
> >> On 4/23/2024 10:08 AM, Mattias Rönnblom wrote:
> >>> Cache align Rx and Tx queue struct to avoid false sharing.
> >>>
> >>> RX struct happens to be 64 bytes on x86_64 already, so cache
> >>> alignment makes no change there, but it does on 32-bit ISAs.
> >>>
> >>> TX struct is 56 bytes on x86_64.
> >>>
> >>
> >> Hi Mattias,
> >>
> >> No objection to the patch. Is the improvement theoretical or do you
> >> measure any improvement practically, if so how much is the
> >> improvement?
> >>
> > 
> > I didn't run any benchmarks.
> > 
> > Two cores storing to a (falsely) shared cache line on a per-packet
> > basis is going to be very expensive, at least for "light touch"
> > applications.
> > 
> 
> ack I expect for af_packet bottleneck is the kernel side, so I don't
> expect any visible improvement practically, but OK to fix this
> theoretical issue.
> 
> >>> Both structs keep counters, and in the RX case they are updated even
> >>> for empty polls.
> >>>
> >>
> >> Do you think does it help if move 'rx_pkts' & 'rx_bytes' update within
> >> the loop?
> >>
> > 
> > No, why? Wouldn't that be worse? Especially since rx_pkts and rx_bytes
> > are declared volatile, so you are forcing a load-modify-store cycle for
> > every increment.
> > 
> 
> My intention was to prevent updating stats in empty polls, I thought
> stats will be hot in the cache but won't work with volatile.
> 
Yes, it will. Volatile only prevents caching in registers, it does not
affect the storing of data within the cache hierarchy. Reads/writes of
stats on empty polls should indeed hit the L1 as expected. However, that is
still less efficient than just doing a register increment which could
theoretically be the result without the volatile.

/Bruce
  
Mattias Rönnblom April 24, 2024, 11:57 a.m. UTC | #7
On 2024-04-24 12:21, Ferruh Yigit wrote:
> On 4/23/2024 9:56 PM, Mattias Rönnblom wrote:
>> On 2024-04-23 13:15, Ferruh Yigit wrote:
>>> On 4/23/2024 10:08 AM, Mattias Rönnblom wrote:
>>>> Cache align Rx and Tx queue struct to avoid false sharing.
>>>>
>>>> RX struct happens to be 64 bytes on x86_64 already, so cache alignment
>>>> makes no change there, but it does on 32-bit ISAs.
>>>>
>>>> TX struct is 56 bytes on x86_64.
>>>>
>>>
>>> Hi Mattias,
>>>
>>> No objection to the patch. Is the improvement theoretical or do you
>>> measure any improvement practically, if so how much is the improvement?
>>>
>>
>> I didn't run any benchmarks.
>>
>> Two cores storing to a (falsely) shared cache line on a per-packet basis
>> is going to be very expensive, at least for "light touch" applications.
>>
> 
> ack
> I expect for af_packet bottleneck is the kernel side, so I don't expect
> any visible improvement practically, but OK to fix this theoretical issue.
> 

If you use af_packet as a slow path interface, you may well poll it 
often. If you are unlucky and aren't on an ISA where the RX struct *by 
accident* is cache-line aligned, you will suffer greatly from false 
sharing, because the counters are written to regardless if there are 
packets polled or not.

If you don't care about synchronization overhead, why even support 
multi-queue in the af_packet driver!?

>>>> Both structs keep counters, and in the RX case they are updated even
>>>> for empty polls.
>>>>
>>>
>>> Do you think does it help if move 'rx_pkts' & 'rx_bytes' update within
>>> the loop?
>>>
>>
>> No, why? Wouldn't that be worse? Especially since rx_pkts and rx_bytes
>> are declared volatile, so you are forcing a load-modify-store cycle for
>> every increment.
>>
> 
> My intention was to prevent updating stats in empty polls, I thought
> stats will be hot in the cache but won't work with volatile.
> 
> 
>> I would drop "volatile", or replace it with an atomic (although *not*
>> use an atomic add for incrementing, but rather atomic load + <n>
>> non-atomic adds + atomic store).
>>
> 
> As only single core polls from an Rx queue, I assume atomics is required
> only for stats reset case. And for that case won't we need atomic add?
> 

Correct me if I'm wrong here, but my impression is that statistics reset 
tend to be best-effort in DPDK. Or, expressed in a different way, are 
not thread-safe.

If you want to support MT safe counter reset, the counter increment 
operation must be an atomic add, since now you have two writers. (Unless 
you do something fancy and keep only a new offset upon reset, and never 
actually zero the counter.)

> And do you know if there is a performance difference between atomic add
> and keeping volatile qualifier?
> 
> 

I don't know how slow af_packet is, but if you care about performance, 
you don't want to use atomic add for statistics.

volatile in this case it's not a big issue, performance-wise, I would 
expect.

>>>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>>>> ---
>>>>    drivers/net/af_packet/rte_eth_af_packet.c | 5 +++--
>>>>    1 file changed, 3 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/net/af_packet/rte_eth_af_packet.c
>>>> b/drivers/net/af_packet/rte_eth_af_packet.c
>>>> index 397a32db58..28aeb7d08e 100644
>>>> --- a/drivers/net/af_packet/rte_eth_af_packet.c
>>>> +++ b/drivers/net/af_packet/rte_eth_af_packet.c
>>>> @@ -6,6 +6,7 @@
>>>>     * All rights reserved.
>>>>     */
>>>>    +#include <rte_common.h>
>>>>    #include <rte_string_fns.h>
>>>>    #include <rte_mbuf.h>
>>>>    #include <ethdev_driver.h>
>>>> @@ -53,7 +54,7 @@ struct pkt_rx_queue {
>>>>          volatile unsigned long rx_pkts;
>>>>        volatile unsigned long rx_bytes;
>>>> -};
>>>> +} __rte_cache_aligned;
>>>>    
>>>
>>> Latest location for '__rte_cache_aligned' tag is at the beginning of the
>>> struct [1], so something like:
>>> `struct __rte_cache_aligned pkt_rx_queue {`
>>>
>>> [1]
>>> https://patchwork.dpdk.org/project/dpdk/list/?series=31746&state=%2A&archive=both
>
  
Ferruh Yigit April 24, 2024, 5:50 p.m. UTC | #8
On 4/24/2024 12:57 PM, Mattias Rönnblom wrote:
> On 2024-04-24 12:21, Ferruh Yigit wrote:
>> On 4/23/2024 9:56 PM, Mattias Rönnblom wrote:
>>> On 2024-04-23 13:15, Ferruh Yigit wrote:
>>>> On 4/23/2024 10:08 AM, Mattias Rönnblom wrote:
>>>>> Cache align Rx and Tx queue struct to avoid false sharing.
>>>>>
>>>>> RX struct happens to be 64 bytes on x86_64 already, so cache alignment
>>>>> makes no change there, but it does on 32-bit ISAs.
>>>>>
>>>>> TX struct is 56 bytes on x86_64.
>>>>>
>>>>
>>>> Hi Mattias,
>>>>
>>>> No objection to the patch. Is the improvement theoretical or do you
>>>> measure any improvement practically, if so how much is the improvement?
>>>>
>>>
>>> I didn't run any benchmarks.
>>>
>>> Two cores storing to a (falsely) shared cache line on a per-packet basis
>>> is going to be very expensive, at least for "light touch" applications.
>>>
>>
>> ack
>> I expect for af_packet bottleneck is the kernel side, so I don't expect
>> any visible improvement practically, but OK to fix this theoretical
>> issue.
>>
> 
> If you use af_packet as a slow path interface, you may well poll it
> often. If you are unlucky and aren't on an ISA where the RX struct *by
> accident* is cache-line aligned, you will suffer greatly from false
> sharing, because the counters are written to regardless if there are
> packets polled or not.
> 
> If you don't care about synchronization overhead, why even support
> multi-queue in the af_packet driver!?
> 

We care, only I expect no measurable performance improvement and was
asking if you have data for it.

>>>>> Both structs keep counters, and in the RX case they are updated even
>>>>> for empty polls.
>>>>>
>>>>
>>>> Do you think does it help if move 'rx_pkts' & 'rx_bytes' update within
>>>> the loop?
>>>>
>>>
>>> No, why? Wouldn't that be worse? Especially since rx_pkts and rx_bytes
>>> are declared volatile, so you are forcing a load-modify-store cycle for
>>> every increment.
>>>
>>
>> My intention was to prevent updating stats in empty polls, I thought
>> stats will be hot in the cache but won't work with volatile.
>>
>>
>>> I would drop "volatile", or replace it with an atomic (although *not*
>>> use an atomic add for incrementing, but rather atomic load + <n>
>>> non-atomic adds + atomic store).
>>>
>>
>> As only single core polls from an Rx queue, I assume atomics is required
>> only for stats reset case. And for that case won't we need atomic add?
>>
> 
> Correct me if I'm wrong here, but my impression is that statistics reset
> tend to be best-effort in DPDK. Or, expressed in a different way, are
> not thread-safe.
> 

At least I don't see this is something documented. For cases HW tracks
stats we don't have this problem. Or some drivers use offset mechanism
you mentioned below.
But I can see some soft drivers will have this problem with stats clear.

> If you want to support MT safe counter reset, the counter increment
> operation must be an atomic add, since now you have two writers. (Unless
> you do something fancy and keep only a new offset upon reset, and never
> actually zero the counter.)
> 

Asking to learn, if we ignore the stats reset case and assume stats only
incremented, single writer case, do we still need either volatile
qualifier or atomic operations at all?

>> And do you know if there is a performance difference between atomic add
>> and keeping volatile qualifier?
>>
>>
> 
> I don't know how slow af_packet is, but if you care about performance,
> you don't want to use atomic add for statistics.
> 

There are a few soft drivers already using atomics adds for updating stats.
If we document expectations from 'rte_eth_stats_reset()', we can update
those usages.

> volatile in this case it's not a big issue, performance-wise, I would
> expect.
> 
>>>>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>>>>> ---
>>>>>    drivers/net/af_packet/rte_eth_af_packet.c | 5 +++--
>>>>>    1 file changed, 3 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/net/af_packet/rte_eth_af_packet.c
>>>>> b/drivers/net/af_packet/rte_eth_af_packet.c
>>>>> index 397a32db58..28aeb7d08e 100644
>>>>> --- a/drivers/net/af_packet/rte_eth_af_packet.c
>>>>> +++ b/drivers/net/af_packet/rte_eth_af_packet.c
>>>>> @@ -6,6 +6,7 @@
>>>>>     * All rights reserved.
>>>>>     */
>>>>>    +#include <rte_common.h>
>>>>>    #include <rte_string_fns.h>
>>>>>    #include <rte_mbuf.h>
>>>>>    #include <ethdev_driver.h>
>>>>> @@ -53,7 +54,7 @@ struct pkt_rx_queue {
>>>>>          volatile unsigned long rx_pkts;
>>>>>        volatile unsigned long rx_bytes;
>>>>> -};
>>>>> +} __rte_cache_aligned;
>>>>>    
>>>>
>>>> Latest location for '__rte_cache_aligned' tag is at the beginning of
>>>> the
>>>> struct [1], so something like:
>>>> `struct __rte_cache_aligned pkt_rx_queue {`
>>>>
>>>> [1]
>>>> https://patchwork.dpdk.org/project/dpdk/list/?series=31746&state=%2A&archive=both
>>
  
Ferruh Yigit April 24, 2024, 6:02 p.m. UTC | #9
On 4/24/2024 11:28 AM, Bruce Richardson wrote:
> On Wed, Apr 24, 2024 at 11:21:52AM +0100, Ferruh Yigit wrote:
>> On 4/23/2024 9:56 PM, Mattias Rönnblom wrote:
>>> On 2024-04-23 13:15, Ferruh Yigit wrote:
>>>> On 4/23/2024 10:08 AM, Mattias Rönnblom wrote:
>>>>> Cache align Rx and Tx queue struct to avoid false sharing.
>>>>>
>>>>> RX struct happens to be 64 bytes on x86_64 already, so cache
>>>>> alignment makes no change there, but it does on 32-bit ISAs.
>>>>>
>>>>> TX struct is 56 bytes on x86_64.
>>>>>
>>>>
>>>> Hi Mattias,
>>>>
>>>> No objection to the patch. Is the improvement theoretical or do you
>>>> measure any improvement practically, if so how much is the
>>>> improvement?
>>>>
>>>
>>> I didn't run any benchmarks.
>>>
>>> Two cores storing to a (falsely) shared cache line on a per-packet
>>> basis is going to be very expensive, at least for "light touch"
>>> applications.
>>>
>>
>> ack I expect for af_packet bottleneck is the kernel side, so I don't
>> expect any visible improvement practically, but OK to fix this
>> theoretical issue.
>>
>>>>> Both structs keep counters, and in the RX case they are updated even
>>>>> for empty polls.
>>>>>
>>>>
>>>> Do you think does it help if move 'rx_pkts' & 'rx_bytes' update within
>>>> the loop?
>>>>
>>>
>>> No, why? Wouldn't that be worse? Especially since rx_pkts and rx_bytes
>>> are declared volatile, so you are forcing a load-modify-store cycle for
>>> every increment.
>>>
>>
>> My intention was to prevent updating stats in empty polls, I thought
>> stats will be hot in the cache but won't work with volatile.
>>
> Yes, it will. Volatile only prevents caching in registers, it does not
> affect the storing of data within the cache hierarchy. Reads/writes of
> stats on empty polls should indeed hit the L1 as expected. However, that is
> still less efficient than just doing a register increment which could
> theoretically be the result without the volatile.
> 
> 

Thanks Bruce for clarification.
  
Stephen Hemminger April 24, 2024, 7:13 p.m. UTC | #10
On Wed, 24 Apr 2024 18:50:50 +0100
Ferruh Yigit <ferruh.yigit@amd.com> wrote:

> > I don't know how slow af_packet is, but if you care about performance,
> > you don't want to use atomic add for statistics.
> >   
> 
> There are a few soft drivers already using atomics adds for updating stats.
> If we document expectations from 'rte_eth_stats_reset()', we can update
> those usages.

Using atomic add is lots of extra overhead. The statistics are not guaranteed
to be perfect.  If nothing else, the bytes and packets can be skewed.

The soft drivers af_xdp, af_packet, and tun performance is dominated by the
overhead of the kernel system call and copies. Yes, alignment is good
but won't be noticeable.
  
Mattias Rönnblom April 24, 2024, 10:27 p.m. UTC | #11
On 2024-04-24 21:13, Stephen Hemminger wrote:
> On Wed, 24 Apr 2024 18:50:50 +0100
> Ferruh Yigit <ferruh.yigit@amd.com> wrote:
> 
>>> I don't know how slow af_packet is, but if you care about performance,
>>> you don't want to use atomic add for statistics.
>>>    
>>
>> There are a few soft drivers already using atomics adds for updating stats.
>> If we document expectations from 'rte_eth_stats_reset()', we can update
>> those usages.
> 
> Using atomic add is lots of extra overhead. The statistics are not guaranteed
> to be perfect.  If nothing else, the bytes and packets can be skewed.
> 

The sad thing here is that in case the counters are reset within the 
load-modify-store cycle of the lcore counter update, the reset may end 
up being a nop. So, it's not like you missed a packet or two, or suffer 
some transient inconsistency, but you completed and permanently ignored 
the reset request.

> The soft drivers af_xdp, af_packet, and tun performance is dominated by the
> overhead of the kernel system call and copies. Yes, alignment is good
> but won't be noticeable.

There aren't any syscalls in the RX path in the af_packet PMD.

I added the same statistics updates as the af_packet PMD uses into an 
benchmark app which consumes ~1000 cc in-between stats updates.

If the equivalent of the RX queue struct was cache aligned, the 
statistics overhead was so small it was difficult to measure. Less than 
3-4 cc per update. This was with volatile, but without atomics.

If the RX queue struct wasn't cache aligned, and sized so a cache line 
generally was used by two (neighboring) cores, the stats incurred a cost 
of ~55 cc per update.

Shaving off 55 cc should translate to a couple of hundred percent 
increased performance for an empty af_packet poll. If your lcore has 
some other primary source of work than the af_packet RX queue, and the 
RX queue is polled often, then this may well be a noticeable gain.

The benchmark was run on 16 Gracemont cores, which in my experience 
seems to have a little shorter core-to-core latency than many other 
systems, provided the remote core/cache line owner is located in the 
same cluster.
  
Stephen Hemminger April 24, 2024, 11:55 p.m. UTC | #12
On Thu, 25 Apr 2024 00:27:36 +0200
Mattias Rönnblom <hofors@lysator.liu.se> wrote:

> On 2024-04-24 21:13, Stephen Hemminger wrote:
> > On Wed, 24 Apr 2024 18:50:50 +0100
> > Ferruh Yigit <ferruh.yigit@amd.com> wrote:
> >   
> >>> I don't know how slow af_packet is, but if you care about performance,
> >>> you don't want to use atomic add for statistics.
> >>>      
> >>
> >> There are a few soft drivers already using atomics adds for updating stats.
> >> If we document expectations from 'rte_eth_stats_reset()', we can update
> >> those usages.  
> > 
> > Using atomic add is lots of extra overhead. The statistics are not guaranteed
> > to be perfect.  If nothing else, the bytes and packets can be skewed.
> >   
> 
> The sad thing here is that in case the counters are reset within the 
> load-modify-store cycle of the lcore counter update, the reset may end 
> up being a nop. So, it's not like you missed a packet or two, or suffer 
> some transient inconsistency, but you completed and permanently ignored 
> the reset request.

That is one of the many reasons the Linux kernel intentionally did not
implement a reset statistics operation.
  
Mattias Rönnblom April 25, 2024, 9:26 a.m. UTC | #13
On 2024-04-25 01:55, Stephen Hemminger wrote:
> On Thu, 25 Apr 2024 00:27:36 +0200
> Mattias Rönnblom <hofors@lysator.liu.se> wrote:
> 
>> On 2024-04-24 21:13, Stephen Hemminger wrote:
>>> On Wed, 24 Apr 2024 18:50:50 +0100
>>> Ferruh Yigit <ferruh.yigit@amd.com> wrote:
>>>    
>>>>> I don't know how slow af_packet is, but if you care about performance,
>>>>> you don't want to use atomic add for statistics.
>>>>>       
>>>>
>>>> There are a few soft drivers already using atomics adds for updating stats.
>>>> If we document expectations from 'rte_eth_stats_reset()', we can update
>>>> those usages.
>>>
>>> Using atomic add is lots of extra overhead. The statistics are not guaranteed
>>> to be perfect.  If nothing else, the bytes and packets can be skewed.
>>>    
>>
>> The sad thing here is that in case the counters are reset within the
>> load-modify-store cycle of the lcore counter update, the reset may end
>> up being a nop. So, it's not like you missed a packet or two, or suffer
>> some transient inconsistency, but you completed and permanently ignored
>> the reset request.
> 
> That is one of the many reasons the Linux kernel intentionally did not
> implement a reset statistics operation.

DPDK should deprecate statistics reset, it seems to me.

The current API is broken anyway, if you care about correctness. A reset 
function must return the current state of the counters, at the point 
them being reset. Otherwise, a higher layer may miss counter updates.

The af_packet PMD should return -ENOTSUP on reset (which is allowed by 
the API).

Maintaining an offset-since-last-reset for counters is a control plane 
thing, and shouldn't be in PMDs. (If MT-safe reset for SW-managed 
counters are to be expected from the PMDs, we should have some helper 
API to facilitate its efficient & correct-enough implementation.)
  
Morten Brørup April 25, 2024, 9:49 a.m. UTC | #14
> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> Sent: Thursday, 25 April 2024 11.26
> 
> On 2024-04-25 01:55, Stephen Hemminger wrote:
> > On Thu, 25 Apr 2024 00:27:36 +0200
> > Mattias Rönnblom <hofors@lysator.liu.se> wrote:
> >
> >> On 2024-04-24 21:13, Stephen Hemminger wrote:
> >>> On Wed, 24 Apr 2024 18:50:50 +0100
> >>> Ferruh Yigit <ferruh.yigit@amd.com> wrote:
> >>>
> >>>>> I don't know how slow af_packet is, but if you care about performance,
> >>>>> you don't want to use atomic add for statistics.
> >>>>>
> >>>>
> >>>> There are a few soft drivers already using atomics adds for updating
> stats.
> >>>> If we document expectations from 'rte_eth_stats_reset()', we can update
> >>>> those usages.
> >>>
> >>> Using atomic add is lots of extra overhead. The statistics are not
> guaranteed
> >>> to be perfect.  If nothing else, the bytes and packets can be skewed.
> >>>
> >>
> >> The sad thing here is that in case the counters are reset within the
> >> load-modify-store cycle of the lcore counter update, the reset may end
> >> up being a nop. So, it's not like you missed a packet or two, or suffer
> >> some transient inconsistency, but you completed and permanently ignored
> >> the reset request.
> >
> > That is one of the many reasons the Linux kernel intentionally did not
> > implement a reset statistics operation.
> 
> DPDK should deprecate statistics reset, it seems to me.

+1

> 
> The current API is broken anyway, if you care about correctness. A reset
> function must return the current state of the counters, at the point
> them being reset. Otherwise, a higher layer may miss counter updates.
> 
> The af_packet PMD should return -ENOTSUP on reset (which is allowed by
> the API).
> 
> Maintaining an offset-since-last-reset for counters is a control plane
> thing, and shouldn't be in PMDs. (If MT-safe reset for SW-managed
> counters are to be expected from the PMDs, we should have some helper
> API to facilitate its efficient & correct-enough implementation.)
  
Ferruh Yigit April 25, 2024, 2:04 p.m. UTC | #15
On 4/25/2024 10:26 AM, Mattias Rönnblom wrote:
> On 2024-04-25 01:55, Stephen Hemminger wrote:
>> On Thu, 25 Apr 2024 00:27:36 +0200
>> Mattias Rönnblom <hofors@lysator.liu.se> wrote:
>>
>>> On 2024-04-24 21:13, Stephen Hemminger wrote:
>>>> On Wed, 24 Apr 2024 18:50:50 +0100
>>>> Ferruh Yigit <ferruh.yigit@amd.com> wrote:
>>>>   
>>>>>> I don't know how slow af_packet is, but if you care about
>>>>>> performance,
>>>>>> you don't want to use atomic add for statistics.
>>>>>>       
>>>>>
>>>>> There are a few soft drivers already using atomics adds for
>>>>> updating stats.
>>>>> If we document expectations from 'rte_eth_stats_reset()', we can
>>>>> update
>>>>> those usages.
>>>>
>>>> Using atomic add is lots of extra overhead. The statistics are not
>>>> guaranteed
>>>> to be perfect.  If nothing else, the bytes and packets can be skewed.
>>>>    
>>>
>>> The sad thing here is that in case the counters are reset within the
>>> load-modify-store cycle of the lcore counter update, the reset may end
>>> up being a nop. So, it's not like you missed a packet or two, or suffer
>>> some transient inconsistency, but you completed and permanently ignored
>>> the reset request.
>>
>> That is one of the many reasons the Linux kernel intentionally did not
>> implement a reset statistics operation.
> 
> DPDK should deprecate statistics reset, it seems to me.
> 
> The current API is broken anyway, if you care about correctness. A reset
> function must return the current state of the counters, at the point
> them being reset. Otherwise, a higher layer may miss counter updates.
> 
> The af_packet PMD should return -ENOTSUP on reset (which is allowed by
> the API).
> 
> Maintaining an offset-since-last-reset for counters is a control plane
> thing, and shouldn't be in PMDs. (If MT-safe reset for SW-managed
> counters are to be expected from the PMDs, we should have some helper
> API to facilitate its efficient & correct-enough implementation.)
>

statistics reset works for HW devices, instead of removing statics reset
I am for documenting API that it may be not reliable, I can send a patch
for it.

With above change, we can be more relax on stats update specially for
soft drivers, and can convert atomic_add stats updates to "atomic load +
add + atomic store".

Does this plan make sense?
  
Ferruh Yigit April 25, 2024, 2:08 p.m. UTC | #16
On 4/23/2024 12:15 PM, Ferruh Yigit wrote:
>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>> ---
>>  drivers/net/af_packet/rte_eth_af_packet.c | 5 +++--
>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c
>> index 397a32db58..28aeb7d08e 100644
>> --- a/drivers/net/af_packet/rte_eth_af_packet.c
>> +++ b/drivers/net/af_packet/rte_eth_af_packet.c
>> @@ -6,6 +6,7 @@
>>   * All rights reserved.
>>   */
>>  
>> +#include <rte_common.h>
>>  #include <rte_string_fns.h>
>>  #include <rte_mbuf.h>
>>  #include <ethdev_driver.h>
>> @@ -53,7 +54,7 @@ struct pkt_rx_queue {
>>  
>>  	volatile unsigned long rx_pkts;
>>  	volatile unsigned long rx_bytes;
>> -};
>> +} __rte_cache_aligned;
>>  
> Latest location for '__rte_cache_aligned' tag is at the beginning of the
> struct [1], so something like:
> `struct __rte_cache_aligned pkt_rx_queue {`
> 
> [1]
> https://patchwork.dpdk.org/project/dpdk/list/?series=31746&state=%2A&archive=both
>

Hi Mattias,

We dived into side discussions but with above change I can proceed with
the patch.
  
Mattias Rönnblom April 25, 2024, 3:06 p.m. UTC | #17
On 2024-04-25 16:04, Ferruh Yigit wrote:
> On 4/25/2024 10:26 AM, Mattias Rönnblom wrote:
>> On 2024-04-25 01:55, Stephen Hemminger wrote:
>>> On Thu, 25 Apr 2024 00:27:36 +0200
>>> Mattias Rönnblom <hofors@lysator.liu.se> wrote:
>>>
>>>> On 2024-04-24 21:13, Stephen Hemminger wrote:
>>>>> On Wed, 24 Apr 2024 18:50:50 +0100
>>>>> Ferruh Yigit <ferruh.yigit@amd.com> wrote:
>>>>>    
>>>>>>> I don't know how slow af_packet is, but if you care about
>>>>>>> performance,
>>>>>>> you don't want to use atomic add for statistics.
>>>>>>>        
>>>>>>
>>>>>> There are a few soft drivers already using atomics adds for
>>>>>> updating stats.
>>>>>> If we document expectations from 'rte_eth_stats_reset()', we can
>>>>>> update
>>>>>> those usages.
>>>>>
>>>>> Using atomic add is lots of extra overhead. The statistics are not
>>>>> guaranteed
>>>>> to be perfect.  If nothing else, the bytes and packets can be skewed.
>>>>>     
>>>>
>>>> The sad thing here is that in case the counters are reset within the
>>>> load-modify-store cycle of the lcore counter update, the reset may end
>>>> up being a nop. So, it's not like you missed a packet or two, or suffer
>>>> some transient inconsistency, but you completed and permanently ignored
>>>> the reset request.
>>>
>>> That is one of the many reasons the Linux kernel intentionally did not
>>> implement a reset statistics operation.
>>
>> DPDK should deprecate statistics reset, it seems to me.
>>
>> The current API is broken anyway, if you care about correctness. A reset
>> function must return the current state of the counters, at the point
>> them being reset. Otherwise, a higher layer may miss counter updates.
>>
>> The af_packet PMD should return -ENOTSUP on reset (which is allowed by
>> the API).
>>
>> Maintaining an offset-since-last-reset for counters is a control plane
>> thing, and shouldn't be in PMDs. (If MT-safe reset for SW-managed
>> counters are to be expected from the PMDs, we should have some helper
>> API to facilitate its efficient & correct-enough implementation.)
>>
> 
> statistics reset works for HW devices, instead of removing statics reset
> I am for documenting API that it may be not reliable, I can send a patch
> for it.
> 

With API you mean <rte_ethdev.h>?

If rte_ethdev_stats_reset() sometimes reset the counters, and sometimes 
doesn't, it should also have a name that reflect those semantics.
rte_ethdev_stats_reset_or_perhaps_not()
rte_ethdev_stats_usually_reset()

Rather than expecting the application to deal with unspecified 
non-determinism is seems better to specify under which conditions the 
reset is reliable (i.e., it's not MT safe). A non-MT-safe reset will 
limit it usefulness though. Also, it will make having an MT safe reset 
in a PMD pretty useless, except if the app is programmed not toward the 
API, but toward some particular PMD.

> With above change, we can be more relax on stats update specially for
> soft drivers, and can convert atomic_add stats updates to "atomic load +
> add + atomic store".
> 
> Does this plan make sense?

Not really.

Short-term the -ENOTSUP seems like the best option. Second best option 
is to implement a proper MT safe reset.

What is unfortunate is that the API is silent on MT safety. I've 
*assumed* that many users will have assumed it MT safe, but there's 
nothing in the documentation to support that. Rather the opposite, the 
generic section of DPDK MT safety only mentions PMD RX and TX functions.

This issue is not limited to this PMD or even to ethdev. 
rte_event_dev_xstats_reset() and some of the event devs have the same 
problem.
  
Stephen Hemminger April 25, 2024, 3:07 p.m. UTC | #18
On Thu, 25 Apr 2024 15:04:27 +0100
Ferruh Yigit <ferruh.yigit@amd.com> wrote:

> > Maintaining an offset-since-last-reset for counters is a control plane
> > thing, and shouldn't be in PMDs. (If MT-safe reset for SW-managed
> > counters are to be expected from the PMDs, we should have some helper
> > API to facilitate its efficient & correct-enough implementation.)
> >  
> 
> statistics reset works for HW devices, instead of removing statics reset
> I am for documenting API that it may be not reliable, I can send a patch
> for it.
> 
> With above change, we can be more relax on stats update specially for
> soft drivers, and can convert atomic_add stats updates to "atomic load +
> add + atomic store".
> 
> Does this plan make sense?


+1
  
Mattias Rönnblom April 25, 2024, 3:08 p.m. UTC | #19
On 2024-04-25 16:08, Ferruh Yigit wrote:
> On 4/23/2024 12:15 PM, Ferruh Yigit wrote:
>>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>>> ---
>>>   drivers/net/af_packet/rte_eth_af_packet.c | 5 +++--
>>>   1 file changed, 3 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c
>>> index 397a32db58..28aeb7d08e 100644
>>> --- a/drivers/net/af_packet/rte_eth_af_packet.c
>>> +++ b/drivers/net/af_packet/rte_eth_af_packet.c
>>> @@ -6,6 +6,7 @@
>>>    * All rights reserved.
>>>    */
>>>   
>>> +#include <rte_common.h>
>>>   #include <rte_string_fns.h>
>>>   #include <rte_mbuf.h>
>>>   #include <ethdev_driver.h>
>>> @@ -53,7 +54,7 @@ struct pkt_rx_queue {
>>>   
>>>   	volatile unsigned long rx_pkts;
>>>   	volatile unsigned long rx_bytes;
>>> -};
>>> +} __rte_cache_aligned;
>>>   
>> Latest location for '__rte_cache_aligned' tag is at the beginning of the
>> struct [1], so something like:
>> `struct __rte_cache_aligned pkt_rx_queue {`
>>
>> [1]
>> https://patchwork.dpdk.org/project/dpdk/list/?series=31746&state=%2A&archive=both
>>
> 
> Hi Mattias,
> 
> We dived into side discussions but with above change I can proceed with
> the patch.

OK.

Should this go into some stable branch as well?
  
Ferruh Yigit April 25, 2024, 3:35 p.m. UTC | #20
On 4/25/2024 4:08 PM, Mattias Rönnblom wrote:
> On 2024-04-25 16:08, Ferruh Yigit wrote:
>> On 4/23/2024 12:15 PM, Ferruh Yigit wrote:
>>>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>>>> ---
>>>>   drivers/net/af_packet/rte_eth_af_packet.c | 5 +++--
>>>>   1 file changed, 3 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/net/af_packet/rte_eth_af_packet.c
>>>> b/drivers/net/af_packet/rte_eth_af_packet.c
>>>> index 397a32db58..28aeb7d08e 100644
>>>> --- a/drivers/net/af_packet/rte_eth_af_packet.c
>>>> +++ b/drivers/net/af_packet/rte_eth_af_packet.c
>>>> @@ -6,6 +6,7 @@
>>>>    * All rights reserved.
>>>>    */
>>>>   +#include <rte_common.h>
>>>>   #include <rte_string_fns.h>
>>>>   #include <rte_mbuf.h>
>>>>   #include <ethdev_driver.h>
>>>> @@ -53,7 +54,7 @@ struct pkt_rx_queue {
>>>>         volatile unsigned long rx_pkts;
>>>>       volatile unsigned long rx_bytes;
>>>> -};
>>>> +} __rte_cache_aligned;
>>>>   
>>> Latest location for '__rte_cache_aligned' tag is at the beginning of the
>>> struct [1], so something like:
>>> `struct __rte_cache_aligned pkt_rx_queue {`
>>>
>>> [1]
>>> https://patchwork.dpdk.org/project/dpdk/list/?series=31746&state=%2A&archive=both
>>>
>>
>> Hi Mattias,
>>
>> We dived into side discussions but with above change I can proceed with
>> the patch.
> 
> OK.
> 
> Should this go into some stable branch as well?
>

I don't see any reason for not merging to stable trees. For this please
add fixes and stable tags.
  
Ferruh Yigit April 25, 2024, 4:21 p.m. UTC | #21
On 4/25/2024 4:06 PM, Mattias Rönnblom wrote:
> On 2024-04-25 16:04, Ferruh Yigit wrote:
>> On 4/25/2024 10:26 AM, Mattias Rönnblom wrote:
>>> On 2024-04-25 01:55, Stephen Hemminger wrote:
>>>> On Thu, 25 Apr 2024 00:27:36 +0200
>>>> Mattias Rönnblom <hofors@lysator.liu.se> wrote:
>>>>
>>>>> On 2024-04-24 21:13, Stephen Hemminger wrote:
>>>>>> On Wed, 24 Apr 2024 18:50:50 +0100
>>>>>> Ferruh Yigit <ferruh.yigit@amd.com> wrote:
>>>>>>   
>>>>>>>> I don't know how slow af_packet is, but if you care about
>>>>>>>> performance,
>>>>>>>> you don't want to use atomic add for statistics.
>>>>>>>>        
>>>>>>>
>>>>>>> There are a few soft drivers already using atomics adds for
>>>>>>> updating stats.
>>>>>>> If we document expectations from 'rte_eth_stats_reset()', we can
>>>>>>> update
>>>>>>> those usages.
>>>>>>
>>>>>> Using atomic add is lots of extra overhead. The statistics are not
>>>>>> guaranteed
>>>>>> to be perfect.  If nothing else, the bytes and packets can be skewed.
>>>>>>     
>>>>>
>>>>> The sad thing here is that in case the counters are reset within the
>>>>> load-modify-store cycle of the lcore counter update, the reset may end
>>>>> up being a nop. So, it's not like you missed a packet or two, or
>>>>> suffer
>>>>> some transient inconsistency, but you completed and permanently
>>>>> ignored
>>>>> the reset request.
>>>>
>>>> That is one of the many reasons the Linux kernel intentionally did not
>>>> implement a reset statistics operation.
>>>
>>> DPDK should deprecate statistics reset, it seems to me.
>>>
>>> The current API is broken anyway, if you care about correctness. A reset
>>> function must return the current state of the counters, at the point
>>> them being reset. Otherwise, a higher layer may miss counter updates.
>>>
>>> The af_packet PMD should return -ENOTSUP on reset (which is allowed by
>>> the API).
>>>
>>> Maintaining an offset-since-last-reset for counters is a control plane
>>> thing, and shouldn't be in PMDs. (If MT-safe reset for SW-managed
>>> counters are to be expected from the PMDs, we should have some helper
>>> API to facilitate its efficient & correct-enough implementation.)
>>>
>>
>> statistics reset works for HW devices, instead of removing statics reset
>> I am for documenting API that it may be not reliable, I can send a patch
>> for it.
>>
> 
> With API you mean <rte_ethdev.h>?
> 
> If rte_ethdev_stats_reset() sometimes reset the counters, and sometimes
> doesn't, it should also have a name that reflect those semantics.
> rte_ethdev_stats_reset_or_perhaps_not()
> rte_ethdev_stats_usually_reset()
> 

:) point taken

> Rather than expecting the application to deal with unspecified
> non-determinism is seems better to specify under which conditions the
> reset is reliable (i.e., it's not MT safe). A non-MT-safe reset will
> limit it usefulness though. Also, it will make having an MT safe reset
> in a PMD pretty useless, except if the app is programmed not toward the
> API, but toward some particular PMD.
> 
>> With above change, we can be more relax on stats update specially for
>> soft drivers, and can convert atomic_add stats updates to "atomic load +
>> add + atomic store".
>>
>> Does this plan make sense?
> 
> Not really.
> 
> Short-term the -ENOTSUP seems like the best option. Second best option
> is to implement a proper MT safe reset.
> 
> What is unfortunate is that the API is silent on MT safety. I've
> *assumed* that many users will have assumed it MT safe, but there's
> nothing in the documentation to support that. Rather the opposite, the
> generic section of DPDK MT safety only mentions PMD RX and TX functions.
> 
> This issue is not limited to this PMD or even to ethdev.
> rte_event_dev_xstats_reset() and some of the event devs have the same
> problem.
>

True, and we can document multi thread safety of APIs at least, this is
small effort.

For reliable stats reset, implementing offset logic is not that hard,
and enables having more optimized stats update in datapath.
We can go with this option instead of unreliable stats reset.

I can add this at least for af_packet as sample.
As far as I can see, with this update we can get rid of volatile
qualifier and atomic set for stats variables.
  
Mattias Rönnblom April 26, 2024, 7:25 a.m. UTC | #22
On 2024-04-25 17:35, Ferruh Yigit wrote:
> On 4/25/2024 4:08 PM, Mattias Rönnblom wrote:
>> On 2024-04-25 16:08, Ferruh Yigit wrote:
>>> On 4/23/2024 12:15 PM, Ferruh Yigit wrote:
>>>>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>>>>> ---
>>>>>    drivers/net/af_packet/rte_eth_af_packet.c | 5 +++--
>>>>>    1 file changed, 3 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/net/af_packet/rte_eth_af_packet.c
>>>>> b/drivers/net/af_packet/rte_eth_af_packet.c
>>>>> index 397a32db58..28aeb7d08e 100644
>>>>> --- a/drivers/net/af_packet/rte_eth_af_packet.c
>>>>> +++ b/drivers/net/af_packet/rte_eth_af_packet.c
>>>>> @@ -6,6 +6,7 @@
>>>>>     * All rights reserved.
>>>>>     */
>>>>>    +#include <rte_common.h>
>>>>>    #include <rte_string_fns.h>
>>>>>    #include <rte_mbuf.h>
>>>>>    #include <ethdev_driver.h>
>>>>> @@ -53,7 +54,7 @@ struct pkt_rx_queue {
>>>>>          volatile unsigned long rx_pkts;
>>>>>        volatile unsigned long rx_bytes;
>>>>> -};
>>>>> +} __rte_cache_aligned;
>>>>>    
>>>> Latest location for '__rte_cache_aligned' tag is at the beginning of the
>>>> struct [1], so something like:
>>>> `struct __rte_cache_aligned pkt_rx_queue {`
>>>>
>>>> [1]
>>>> https://patchwork.dpdk.org/project/dpdk/list/?series=31746&state=%2A&archive=both
>>>>
>>>
>>> Hi Mattias,
>>>
>>> We dived into side discussions but with above change I can proceed with
>>> the patch.
>>
>> OK.
>>
>> Should this go into some stable branch as well?
>>
> 
> I don't see any reason for not merging to stable trees. For this please
> add fixes and stable tags.

OK, I'll submit a v2. Thanks.
  
Patrick Robb April 26, 2024, 9:27 p.m. UTC | #23
Recheck-request: iol-compile-amd64-testing

The DPDK Community Lab updated to the latest Alpine image yesterday, which
resulted in all Alpine builds failing. The failure is unrelated to your
patch, and this recheck should remove the fail on Patchwork, as we have
disabled Alpine testing for now.
  

Patch

diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c
index 397a32db58..28aeb7d08e 100644
--- a/drivers/net/af_packet/rte_eth_af_packet.c
+++ b/drivers/net/af_packet/rte_eth_af_packet.c
@@ -6,6 +6,7 @@ 
  * All rights reserved.
  */
 
+#include <rte_common.h>
 #include <rte_string_fns.h>
 #include <rte_mbuf.h>
 #include <ethdev_driver.h>
@@ -53,7 +54,7 @@  struct pkt_rx_queue {
 
 	volatile unsigned long rx_pkts;
 	volatile unsigned long rx_bytes;
-};
+} __rte_cache_aligned;
 
 struct pkt_tx_queue {
 	int sockfd;
@@ -67,7 +68,7 @@  struct pkt_tx_queue {
 	volatile unsigned long tx_pkts;
 	volatile unsigned long err_pkts;
 	volatile unsigned long tx_bytes;
-};
+} __rte_cache_aligned;
 
 struct pmd_internals {
 	unsigned nb_queues;