[RFC,v7,3/6] eal: add exactly-once bit access functions

Message ID 20240505083737.118649-4-mattias.ronnblom@ericsson.com (mailing list archive)
State New
Delegated to: Thomas Monjalon
Headers
Series Improve EAL bit operations API |

Checks

Context Check Description
ci/checkpatch warning coding style issues

Commit Message

Mattias Rönnblom May 5, 2024, 8:37 a.m. UTC
  Add test/set/clear/assign/flip functions which prevents certain
compiler optimizations and guarantees that program-level memory loads
and/or stores will actually occur.

These functions are useful when interacting with memory-mapped
hardware devices.

The "once" family of functions does not promise atomicity and provides
no memory ordering guarantees beyond the C11 relaxed memory model.

RFC v7:
 * Fix various minor issues in documentation.

RFC v6:
 * Have rte_bit_once_test() accept const-marked bitsets.

RFC v3:
 * Work around lack of C++ support for _Generic (Tyler Retzlaff).

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Tyler Retzlaff <roretzla@linux.microsoft.com>
---
 lib/eal/include/rte_bitops.h | 201 +++++++++++++++++++++++++++++++++++
 1 file changed, 201 insertions(+)
  

Comments

Morten Brørup May 7, 2024, 7:17 p.m. UTC | #1
> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> Sent: Sunday, 5 May 2024 10.38
> 
> Add test/set/clear/assign/flip functions which prevents certain
> compiler optimizations and guarantees that program-level memory loads
> and/or stores will actually occur.
> 
> These functions are useful when interacting with memory-mapped
> hardware devices.
> 
> The "once" family of functions does not promise atomicity and provides
> no memory ordering guarantees beyond the C11 relaxed memory model.

In another thread, Stephen referred to the extended discussion on memory models in Linux kernel documentation:
https://www.kernel.org/doc/html/latest/core-api/wrappers/memory-barriers.html

Unlike the "once" family of functions in this RFC, the "once" family of functions in the kernel also guarantee memory ordering, specifically for memory-mapped hardware devices. The document describes the rationale with examples.

It makes me think that DPDK "once" family of functions should behave similarly.
Alternatively, if the "once" family of functions cannot be generically implemented with a memory ordering that is optimal for all use cases, drop this family of functions, and instead rely on the "atomic" family of functions for interacting with memory-mapped hardware devices.

> 
> RFC v7:
>  * Fix various minor issues in documentation.
> 
> RFC v6:
>  * Have rte_bit_once_test() accept const-marked bitsets.
> 
> RFC v3:
>  * Work around lack of C++ support for _Generic (Tyler Retzlaff).
> 
> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> Acked-by: Morten Brørup <mb@smartsharesystems.com>
> Acked-by: Tyler Retzlaff <roretzla@linux.microsoft.com>
> ---
  
Mattias Rönnblom May 8, 2024, 6:47 a.m. UTC | #2
On 2024-05-07 21:17, Morten Brørup wrote:
>> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
>> Sent: Sunday, 5 May 2024 10.38
>>
>> Add test/set/clear/assign/flip functions which prevents certain
>> compiler optimizations and guarantees that program-level memory loads
>> and/or stores will actually occur.
>>
>> These functions are useful when interacting with memory-mapped
>> hardware devices.
>>
>> The "once" family of functions does not promise atomicity and provides
>> no memory ordering guarantees beyond the C11 relaxed memory model.
> 
> In another thread, Stephen referred to the extended discussion on memory models in Linux kernel documentation:
> https://www.kernel.org/doc/html/latest/core-api/wrappers/memory-barriers.html
> 
> Unlike the "once" family of functions in this RFC, the "once" family of functions in the kernel also guarantee memory ordering, specifically for memory-mapped hardware devices. The document describes the rationale with examples.
> 

What more specifically did you have in mind? READ_ONCE() and 
WRITE_ONCE()? They give almost no guarantees. Very much relaxed.

I've read that document.

What you should keep in mind if you read that document, is that DPDK 
doesn't use the kernel's memory model, and doesn't have the kernel's 
barrier and atomics APIs. What we have are an obsolete, miniature 
look-alike in <rte_atomic.h> and something C11-like in <rte_stdatomic.h>.

My general impression is that DPDK was moving in the C11 direction 
memory model-wise, which is not the model the kernel uses.

> It makes me think that DPDK "once" family of functions should behave similarly.

I think they do already.

Also, rte_bit_once_set() works as the kernel's __set_bit().

> Alternatively, if the "once" family of functions cannot be generically implemented with a memory ordering that is optimal for all use cases, drop this family of functions, and instead rely on the "atomic" family of functions for interacting with memory-mapped hardware devices.
> 
>>
>> RFC v7:
>>   * Fix various minor issues in documentation.
>>
>> RFC v6:
>>   * Have rte_bit_once_test() accept const-marked bitsets.
>>
>> RFC v3:
>>   * Work around lack of C++ support for _Generic (Tyler Retzlaff).
>>
>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>> Acked-by: Morten Brørup <mb@smartsharesystems.com>
>> Acked-by: Tyler Retzlaff <roretzla@linux.microsoft.com>
>> ---
>
  
Morten Brørup May 8, 2024, 7:33 a.m. UTC | #3
> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> Sent: Wednesday, 8 May 2024 08.47
> 
> On 2024-05-07 21:17, Morten Brørup wrote:
> >> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> >> Sent: Sunday, 5 May 2024 10.38
> >>
> >> Add test/set/clear/assign/flip functions which prevents certain
> >> compiler optimizations and guarantees that program-level memory loads
> >> and/or stores will actually occur.
> >>
> >> These functions are useful when interacting with memory-mapped
> >> hardware devices.
> >>
> >> The "once" family of functions does not promise atomicity and provides
> >> no memory ordering guarantees beyond the C11 relaxed memory model.
> >
> > In another thread, Stephen referred to the extended discussion on memory
> models in Linux kernel documentation:
> > https://www.kernel.org/doc/html/latest/core-api/wrappers/memory-
> barriers.html
> >
> > Unlike the "once" family of functions in this RFC, the "once" family of
> functions in the kernel also guarantee memory ordering, specifically for
> memory-mapped hardware devices. The document describes the rationale with
> examples.
> >
> 
> What more specifically did you have in mind? READ_ONCE() and
> WRITE_ONCE()? They give almost no guarantees. Very much relaxed.

The way I read it, they do provide memory ordering guarantees.

Ignore that the kernel's "once" functions operates on words and this RFC operates on bits, the behavior is the same. Either there are memory ordering guarantees, or there are not.

> 
> I've read that document.
> 
> What you should keep in mind if you read that document, is that DPDK
> doesn't use the kernel's memory model, and doesn't have the kernel's
> barrier and atomics APIs. What we have are an obsolete, miniature
> look-alike in <rte_atomic.h> and something C11-like in <rte_stdatomic.h>.
> 
> My general impression is that DPDK was moving in the C11 direction
> memory model-wise, which is not the model the kernel uses.

I think you and I agree that using legacy methods only because "the kernel does it that way" would not be the optimal roadmap for DPDK.

We should keep moving in the C11 direction memory model-wise.
I consider it more descriptive, and thus expect compilers to eventually produce better optimized code.

> 
> > It makes me think that DPDK "once" family of functions should behave
> similarly.
> 
> I think they do already.

I haven't looked deep into it, but the RFC's documentation says otherwise:
The "once" family of functions does not promise atomicity and provides *no memory ordering* guarantees beyond the C11 relaxed memory model.

> 
> Also, rte_bit_once_set() works as the kernel's __set_bit().
> 
> > Alternatively, if the "once" family of functions cannot be generically
> implemented with a memory ordering that is optimal for all use cases, drop
> this family of functions, and instead rely on the "atomic" family of functions
> for interacting with memory-mapped hardware devices.
> >
> >>
> >> RFC v7:
> >>   * Fix various minor issues in documentation.
> >>
> >> RFC v6:
> >>   * Have rte_bit_once_test() accept const-marked bitsets.
> >>
> >> RFC v3:
> >>   * Work around lack of C++ support for _Generic (Tyler Retzlaff).
> >>
> >> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> >> Acked-by: Morten Brørup <mb@smartsharesystems.com>
> >> Acked-by: Tyler Retzlaff <roretzla@linux.microsoft.com>
> >> ---
> >
  
Mattias Rönnblom May 8, 2024, 8 a.m. UTC | #4
On 2024-05-08 09:33, Morten Brørup wrote:
>> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
>> Sent: Wednesday, 8 May 2024 08.47
>>
>> On 2024-05-07 21:17, Morten Brørup wrote:
>>>> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
>>>> Sent: Sunday, 5 May 2024 10.38
>>>>
>>>> Add test/set/clear/assign/flip functions which prevents certain
>>>> compiler optimizations and guarantees that program-level memory loads
>>>> and/or stores will actually occur.
>>>>
>>>> These functions are useful when interacting with memory-mapped
>>>> hardware devices.
>>>>
>>>> The "once" family of functions does not promise atomicity and provides
>>>> no memory ordering guarantees beyond the C11 relaxed memory model.
>>>
>>> In another thread, Stephen referred to the extended discussion on memory
>> models in Linux kernel documentation:
>>> https://www.kernel.org/doc/html/latest/core-api/wrappers/memory-
>> barriers.html
>>>
>>> Unlike the "once" family of functions in this RFC, the "once" family of
>> functions in the kernel also guarantee memory ordering, specifically for
>> memory-mapped hardware devices. The document describes the rationale with
>> examples.
>>>
>>
>> What more specifically did you have in mind? READ_ONCE() and
>> WRITE_ONCE()? They give almost no guarantees. Very much relaxed.
> 
> The way I read it, they do provide memory ordering guarantees.
> 

Sure. All types memory operations comes with some kind guarantees. A 
series of non-atomic, non-volatile stores issued by a particular thread 
are guaranteed to happen in program order, from the point of view of 
that thread, for example. Would be hard to write a program if that 
wasn't true.

"This macro does not give any guarantees in regards to memory ordering /../"

This is not true. I will rephrase to "any *additional* guarantees" for 
both plain and "once" family documentation.

> Ignore that the kernel's "once" functions operates on words and this RFC operates on bits, the behavior is the same. Either there are memory ordering guarantees, or there are not.
> 
>>
>> I've read that document.
>>
>> What you should keep in mind if you read that document, is that DPDK
>> doesn't use the kernel's memory model, and doesn't have the kernel's
>> barrier and atomics APIs. What we have are an obsolete, miniature
>> look-alike in <rte_atomic.h> and something C11-like in <rte_stdatomic.h>.
>>
>> My general impression is that DPDK was moving in the C11 direction
>> memory model-wise, which is not the model the kernel uses.
> 
> I think you and I agree that using legacy methods only because "the kernel does it that way" would not be the optimal roadmap for DPDK.
> 
> We should keep moving in the C11 direction memory model-wise.
> I consider it more descriptive, and thus expect compilers to eventually produce better optimized code.
> 
>>
>>> It makes me think that DPDK "once" family of functions should behave
>> similarly.
>>
>> I think they do already.
> 
> I haven't looked deep into it, but the RFC's documentation says otherwise:
> The "once" family of functions does not promise atomicity and provides *no memory ordering* guarantees beyond the C11 relaxed memory model.
> 
>>
>> Also, rte_bit_once_set() works as the kernel's __set_bit().
>>
>>> Alternatively, if the "once" family of functions cannot be generically
>> implemented with a memory ordering that is optimal for all use cases, drop
>> this family of functions, and instead rely on the "atomic" family of functions
>> for interacting with memory-mapped hardware devices.
>>>
>>>>
>>>> RFC v7:
>>>>    * Fix various minor issues in documentation.
>>>>
>>>> RFC v6:
>>>>    * Have rte_bit_once_test() accept const-marked bitsets.
>>>>
>>>> RFC v3:
>>>>    * Work around lack of C++ support for _Generic (Tyler Retzlaff).
>>>>
>>>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>>>> Acked-by: Morten Brørup <mb@smartsharesystems.com>
>>>> Acked-by: Tyler Retzlaff <roretzla@linux.microsoft.com>
>>>> ---
>>>
  
Morten Brørup May 8, 2024, 8:11 a.m. UTC | #5
> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> Sent: Wednesday, 8 May 2024 10.00
> 
> On 2024-05-08 09:33, Morten Brørup wrote:
> >> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> >> Sent: Wednesday, 8 May 2024 08.47
> >>
> >> On 2024-05-07 21:17, Morten Brørup wrote:
> >>>> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> >>>> Sent: Sunday, 5 May 2024 10.38
> >>>>
> >>>> Add test/set/clear/assign/flip functions which prevents certain
> >>>> compiler optimizations and guarantees that program-level memory loads
> >>>> and/or stores will actually occur.
> >>>>
> >>>> These functions are useful when interacting with memory-mapped
> >>>> hardware devices.
> >>>>
> >>>> The "once" family of functions does not promise atomicity and provides
> >>>> no memory ordering guarantees beyond the C11 relaxed memory model.
> >>>
> >>> In another thread, Stephen referred to the extended discussion on memory
> >> models in Linux kernel documentation:
> >>> https://www.kernel.org/doc/html/latest/core-api/wrappers/memory-
> >> barriers.html
> >>>
> >>> Unlike the "once" family of functions in this RFC, the "once" family of
> >> functions in the kernel also guarantee memory ordering, specifically for
> >> memory-mapped hardware devices. The document describes the rationale with
> >> examples.
> >>>
> >>
> >> What more specifically did you have in mind? READ_ONCE() and
> >> WRITE_ONCE()? They give almost no guarantees. Very much relaxed.
> >
> > The way I read it, they do provide memory ordering guarantees.
> >
> 
> Sure. All types memory operations comes with some kind guarantees. A
> series of non-atomic, non-volatile stores issued by a particular thread
> are guaranteed to happen in program order, from the point of view of
> that thread, for example. Would be hard to write a program if that
> wasn't true.
> 
> "This macro does not give any guarantees in regards to memory ordering /../"
> 
> This is not true. I will rephrase to "any *additional* guarantees" for
> both plain and "once" family documentation.

Consider code like this:
set_once(HW_START_BIT);
while (!get_once(HW_DONE_BIT)) /*busy wait*/;

If the "once" functions are used for hardware access, they must guarantee that HW_START_BIT has been written before HW_DONE_BIT is read.

The documentation must reflect this ordering guarantee.

> 
> > Ignore that the kernel's "once" functions operates on words and this RFC
> operates on bits, the behavior is the same. Either there are memory ordering
> guarantees, or there are not.
> >
> >>
> >> I've read that document.
> >>
> >> What you should keep in mind if you read that document, is that DPDK
> >> doesn't use the kernel's memory model, and doesn't have the kernel's
> >> barrier and atomics APIs. What we have are an obsolete, miniature
> >> look-alike in <rte_atomic.h> and something C11-like in <rte_stdatomic.h>.
> >>
> >> My general impression is that DPDK was moving in the C11 direction
> >> memory model-wise, which is not the model the kernel uses.
> >
> > I think you and I agree that using legacy methods only because "the kernel
> does it that way" would not be the optimal roadmap for DPDK.
> >
> > We should keep moving in the C11 direction memory model-wise.
> > I consider it more descriptive, and thus expect compilers to eventually
> produce better optimized code.
> >
> >>
> >>> It makes me think that DPDK "once" family of functions should behave
> >> similarly.
> >>
> >> I think they do already.
> >
> > I haven't looked deep into it, but the RFC's documentation says otherwise:
> > The "once" family of functions does not promise atomicity and provides *no
> memory ordering* guarantees beyond the C11 relaxed memory model.
> >
> >>
> >> Also, rte_bit_once_set() works as the kernel's __set_bit().
> >>
> >>> Alternatively, if the "once" family of functions cannot be generically
> >> implemented with a memory ordering that is optimal for all use cases, drop
> >> this family of functions, and instead rely on the "atomic" family of
> functions
> >> for interacting with memory-mapped hardware devices.
> >>>
> >>>>
> >>>> RFC v7:
> >>>>    * Fix various minor issues in documentation.
> >>>>
> >>>> RFC v6:
> >>>>    * Have rte_bit_once_test() accept const-marked bitsets.
> >>>>
> >>>> RFC v3:
> >>>>    * Work around lack of C++ support for _Generic (Tyler Retzlaff).
> >>>>
> >>>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> >>>> Acked-by: Morten Brørup <mb@smartsharesystems.com>
> >>>> Acked-by: Tyler Retzlaff <roretzla@linux.microsoft.com>
> >>>> ---
> >>>
  
Mattias Rönnblom May 8, 2024, 9:27 a.m. UTC | #6
On 2024-05-08 10:11, Morten Brørup wrote:
>> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
>> Sent: Wednesday, 8 May 2024 10.00
>>
>> On 2024-05-08 09:33, Morten Brørup wrote:
>>>> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
>>>> Sent: Wednesday, 8 May 2024 08.47
>>>>
>>>> On 2024-05-07 21:17, Morten Brørup wrote:
>>>>>> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
>>>>>> Sent: Sunday, 5 May 2024 10.38
>>>>>>
>>>>>> Add test/set/clear/assign/flip functions which prevents certain
>>>>>> compiler optimizations and guarantees that program-level memory loads
>>>>>> and/or stores will actually occur.
>>>>>>
>>>>>> These functions are useful when interacting with memory-mapped
>>>>>> hardware devices.
>>>>>>
>>>>>> The "once" family of functions does not promise atomicity and provides
>>>>>> no memory ordering guarantees beyond the C11 relaxed memory model.
>>>>>
>>>>> In another thread, Stephen referred to the extended discussion on memory
>>>> models in Linux kernel documentation:
>>>>> https://www.kernel.org/doc/html/latest/core-api/wrappers/memory-
>>>> barriers.html
>>>>>
>>>>> Unlike the "once" family of functions in this RFC, the "once" family of
>>>> functions in the kernel also guarantee memory ordering, specifically for
>>>> memory-mapped hardware devices. The document describes the rationale with
>>>> examples.
>>>>>
>>>>
>>>> What more specifically did you have in mind? READ_ONCE() and
>>>> WRITE_ONCE()? They give almost no guarantees. Very much relaxed.
>>>
>>> The way I read it, they do provide memory ordering guarantees.
>>>
>>
>> Sure. All types memory operations comes with some kind guarantees. A
>> series of non-atomic, non-volatile stores issued by a particular thread
>> are guaranteed to happen in program order, from the point of view of
>> that thread, for example. Would be hard to write a program if that
>> wasn't true.
>>
>> "This macro does not give any guarantees in regards to memory ordering /../"
>>
>> This is not true. I will rephrase to "any *additional* guarantees" for
>> both plain and "once" family documentation.
> 
> Consider code like this:
> set_once(HW_START_BIT);
> while (!get_once(HW_DONE_BIT)) /*busy wait*/;
> 
> If the "once" functions are used for hardware access, they must guarantee that HW_START_BIT has been written before HW_DONE_BIT is read.
> 

Provided bits reside in the same word, there is (or at least, should be) 
such guarantee, and otherwise, you'll need a barrier.

I'm guessing in most cases the requirements are actually not as strict 
as you pose them: DONE starts as 0, so it may actually be read before 
START is written to, but not all DONE reads can be reordered ahead of 
the single START write. In that case, a compiler barrier between set and 
the get loop should suffice. Otherwise, you need a full barrier, or an 
I/O barrier.

Anyway, since the exact purpose of the "once" type bit operations is 
unclear, maybe I should drop them from the patch set.

Now, they are much like the Linux kernel's __set_bit(), but for hardware 
access, maybe they should be more like writel().

> The documentation must reflect this ordering guarantee.
> 
>>
>>> Ignore that the kernel's "once" functions operates on words and this RFC
>> operates on bits, the behavior is the same. Either there are memory ordering
>> guarantees, or there are not.
>>>
>>>>
>>>> I've read that document.
>>>>
>>>> What you should keep in mind if you read that document, is that DPDK
>>>> doesn't use the kernel's memory model, and doesn't have the kernel's
>>>> barrier and atomics APIs. What we have are an obsolete, miniature
>>>> look-alike in <rte_atomic.h> and something C11-like in <rte_stdatomic.h>.
>>>>
>>>> My general impression is that DPDK was moving in the C11 direction
>>>> memory model-wise, which is not the model the kernel uses.
>>>
>>> I think you and I agree that using legacy methods only because "the kernel
>> does it that way" would not be the optimal roadmap for DPDK.
>>>
>>> We should keep moving in the C11 direction memory model-wise.
>>> I consider it more descriptive, and thus expect compilers to eventually
>> produce better optimized code.
>>>
>>>>
>>>>> It makes me think that DPDK "once" family of functions should behave
>>>> similarly.
>>>>
>>>> I think they do already.
>>>
>>> I haven't looked deep into it, but the RFC's documentation says otherwise:
>>> The "once" family of functions does not promise atomicity and provides *no
>> memory ordering* guarantees beyond the C11 relaxed memory model.
>>>
>>>>
>>>> Also, rte_bit_once_set() works as the kernel's __set_bit().
>>>>
>>>>> Alternatively, if the "once" family of functions cannot be generically
>>>> implemented with a memory ordering that is optimal for all use cases, drop
>>>> this family of functions, and instead rely on the "atomic" family of
>> functions
>>>> for interacting with memory-mapped hardware devices.
>>>>>
>>>>>>
>>>>>> RFC v7:
>>>>>>     * Fix various minor issues in documentation.
>>>>>>
>>>>>> RFC v6:
>>>>>>     * Have rte_bit_once_test() accept const-marked bitsets.
>>>>>>
>>>>>> RFC v3:
>>>>>>     * Work around lack of C++ support for _Generic (Tyler Retzlaff).
>>>>>>
>>>>>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>>>>>> Acked-by: Morten Brørup <mb@smartsharesystems.com>
>>>>>> Acked-by: Tyler Retzlaff <roretzla@linux.microsoft.com>
>>>>>> ---
>>>>>
  
Morten Brørup May 8, 2024, 10:08 a.m. UTC | #7
> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> Sent: Wednesday, 8 May 2024 11.27
> 
> On 2024-05-08 10:11, Morten Brørup wrote:
> >> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> >> Sent: Wednesday, 8 May 2024 10.00
> >>
> >> On 2024-05-08 09:33, Morten Brørup wrote:
> >>>> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> >>>> Sent: Wednesday, 8 May 2024 08.47
> >>>>
> >>>> On 2024-05-07 21:17, Morten Brørup wrote:
> >>>>>> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> >>>>>> Sent: Sunday, 5 May 2024 10.38
> >>>>>>
> >>>>>> Add test/set/clear/assign/flip functions which prevents certain
> >>>>>> compiler optimizations and guarantees that program-level memory loads
> >>>>>> and/or stores will actually occur.
> >>>>>>
> >>>>>> These functions are useful when interacting with memory-mapped
> >>>>>> hardware devices.
> >>>>>>
> >>>>>> The "once" family of functions does not promise atomicity and provides
> >>>>>> no memory ordering guarantees beyond the C11 relaxed memory model.
> >>>>>
> >>>>> In another thread, Stephen referred to the extended discussion on memory
> >>>> models in Linux kernel documentation:
> >>>>> https://www.kernel.org/doc/html/latest/core-api/wrappers/memory-
> >>>> barriers.html
> >>>>>
> >>>>> Unlike the "once" family of functions in this RFC, the "once" family of
> >>>> functions in the kernel also guarantee memory ordering, specifically for
> >>>> memory-mapped hardware devices. The document describes the rationale with
> >>>> examples.
> >>>>>
> >>>>
> >>>> What more specifically did you have in mind? READ_ONCE() and
> >>>> WRITE_ONCE()? They give almost no guarantees. Very much relaxed.
> >>>
> >>> The way I read it, they do provide memory ordering guarantees.
> >>>
> >>
> >> Sure. All types memory operations comes with some kind guarantees. A
> >> series of non-atomic, non-volatile stores issued by a particular thread
> >> are guaranteed to happen in program order, from the point of view of
> >> that thread, for example. Would be hard to write a program if that
> >> wasn't true.
> >>
> >> "This macro does not give any guarantees in regards to memory ordering
> /../"
> >>
> >> This is not true. I will rephrase to "any *additional* guarantees" for
> >> both plain and "once" family documentation.
> >
> > Consider code like this:
> > set_once(HW_START_BIT);
> > while (!get_once(HW_DONE_BIT)) /*busy wait*/;
> >
> > If the "once" functions are used for hardware access, they must guarantee
> that HW_START_BIT has been written before HW_DONE_BIT is read.
> >
> 
> Provided bits reside in the same word, there is (or at least, should be)
> such guarantee, and otherwise, you'll need a barrier.
> 
> I'm guessing in most cases the requirements are actually not as strict
> as you pose them: DONE starts as 0, so it may actually be read before
> START is written to, but not all DONE reads can be reordered ahead of
> the single START write. In that case, a compiler barrier between set and
> the get loop should suffice. Otherwise, you need a full barrier, or an
> I/O barrier.
> 
> Anyway, since the exact purpose of the "once" type bit operations is
> unclear, maybe I should drop them from the patch set.

I agree.

The "once" family, unless designed for accessing hardware registers, somehow seems like a subset of the "atomic" family.

Looking at DPDK drivers, they access hardware registers using e.g. rte_read32(), which looks like this:

static __rte_always_inline uint32_t
rte_read32(const volatile void *addr)
{
	uint32_t val;
	val = rte_read32_relaxed(addr);
	rte_io_rmb();
	return val;
}

If the "once" family of functions is for hardware access, they should do something similar regarding ordering and barriers.
And even if they do, I'm not sure the hardware driver developers are going to use them, unless other environments (e.g. Linux, Windows, BSD) supported by the hardware driver's common low-level code provide similar functions.

> 
> Now, they are much like the Linux kernel's __set_bit(), but for hardware
> access, maybe they should be more like writel().
> 
> > The documentation must reflect this ordering guarantee.
> >
> >>
> >>> Ignore that the kernel's "once" functions operates on words and this RFC
> >> operates on bits, the behavior is the same. Either there are memory
> ordering
> >> guarantees, or there are not.
> >>>
> >>>>
> >>>> I've read that document.
> >>>>
> >>>> What you should keep in mind if you read that document, is that DPDK
> >>>> doesn't use the kernel's memory model, and doesn't have the kernel's
> >>>> barrier and atomics APIs. What we have are an obsolete, miniature
> >>>> look-alike in <rte_atomic.h> and something C11-like in <rte_stdatomic.h>.
> >>>>
> >>>> My general impression is that DPDK was moving in the C11 direction
> >>>> memory model-wise, which is not the model the kernel uses.
> >>>
> >>> I think you and I agree that using legacy methods only because "the kernel
> >> does it that way" would not be the optimal roadmap for DPDK.
> >>>
> >>> We should keep moving in the C11 direction memory model-wise.
> >>> I consider it more descriptive, and thus expect compilers to eventually
> >> produce better optimized code.
> >>>
> >>>>
> >>>>> It makes me think that DPDK "once" family of functions should behave
> >>>> similarly.
> >>>>
> >>>> I think they do already.
> >>>
> >>> I haven't looked deep into it, but the RFC's documentation says otherwise:
> >>> The "once" family of functions does not promise atomicity and provides *no
> >> memory ordering* guarantees beyond the C11 relaxed memory model.
> >>>
> >>>>
> >>>> Also, rte_bit_once_set() works as the kernel's __set_bit().
> >>>>
> >>>>> Alternatively, if the "once" family of functions cannot be generically
> >>>> implemented with a memory ordering that is optimal for all use cases,
> drop
> >>>> this family of functions, and instead rely on the "atomic" family of
> >> functions
> >>>> for interacting with memory-mapped hardware devices.
> >>>>>
> >>>>>>
> >>>>>> RFC v7:
> >>>>>>     * Fix various minor issues in documentation.
> >>>>>>
> >>>>>> RFC v6:
> >>>>>>     * Have rte_bit_once_test() accept const-marked bitsets.
> >>>>>>
> >>>>>> RFC v3:
> >>>>>>     * Work around lack of C++ support for _Generic (Tyler Retzlaff).
> >>>>>>
> >>>>>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> >>>>>> Acked-by: Morten Brørup <mb@smartsharesystems.com>
> >>>>>> Acked-by: Tyler Retzlaff <roretzla@linux.microsoft.com>
> >>>>>> ---
> >>>>>
  
Stephen Hemminger May 8, 2024, 3:15 p.m. UTC | #8
On Wed, 8 May 2024 09:33:43 +0200
Morten Brørup <mb@smartsharesystems.com> wrote:

> > What more specifically did you have in mind? READ_ONCE() and
> > WRITE_ONCE()? They give almost no guarantees. Very much relaxed.  
> 
> The way I read it, they do provide memory ordering guarantees.
> 
> Ignore that the kernel's "once" functions operates on words and this RFC operates on bits, the behavior is the same. Either there are memory ordering guarantees, or there are not.

The kernel's READ_ONCE/WRITE_ONCE are compiler only ordering, i.e only apply to single CPU.
RTFM memory-barriers.txt..

GUARANTEES
----------

There are some minimal guarantees that may be expected of a CPU:

 (*) On any given CPU, dependent memory accesses will be issued in order, with
     respect to itself.  This means that for:

	Q = READ_ONCE(P); D = READ_ONCE(*Q);

     the CPU will issue the following memory operations:

	Q = LOAD P, D = LOAD *Q

     and always in that order.  However, on DEC Alpha, READ_ONCE() also
     emits a memory-barrier instruction, so that a DEC Alpha CPU will
     instead issue the following memory operations:

	Q = LOAD P, MEMORY_BARRIER, D = LOAD *Q, MEMORY_BARRIER

     Whether on DEC Alpha or not, the READ_ONCE() also prevents compiler
     mischief.

 (*) Overlapping loads and stores within a particular CPU will appear to be
     ordered within that CPU.  This means that for:

	a = READ_ONCE(*X); WRITE_ONCE(*X, b);

     the CPU will only issue the following sequence of memory operations:

	a = LOAD *X, STORE *X = b

     And for:

	WRITE_ONCE(*X, c); d = READ_ONCE(*X);

     the CPU will only issue:

	STORE *X = c, d = LOAD *X

     (Loads and stores overlap if they are targeted at overlapping pieces of
     memory).
  
Morten Brørup May 8, 2024, 4:16 p.m. UTC | #9
> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> Sent: Wednesday, 8 May 2024 17.16
> 
> On Wed, 8 May 2024 09:33:43 +0200
> Morten Brørup <mb@smartsharesystems.com> wrote:
> 
> > > What more specifically did you have in mind? READ_ONCE() and
> > > WRITE_ONCE()? They give almost no guarantees. Very much relaxed.
> >
> > The way I read it, they do provide memory ordering guarantees.
> >
> > Ignore that the kernel's "once" functions operates on words and this RFC
> operates on bits, the behavior is the same. Either there are memory ordering
> guarantees, or there are not.
> 
> The kernel's READ_ONCE/WRITE_ONCE are compiler only ordering, i.e only apply
> to single CPU.
> RTFM memory-barriers.txt..
> 
> GUARANTEES
> ----------
> 
> There are some minimal guarantees that may be expected of a CPU:
> 
>  (*) On any given CPU, dependent memory accesses will be issued in order, with
>      respect to itself.  This means that for:
> 
> 	Q = READ_ONCE(P); D = READ_ONCE(*Q);
> 
>      the CPU will issue the following memory operations:
> 
> 	Q = LOAD P, D = LOAD *Q
> 
>      and always in that order.
>      However, on DEC Alpha, READ_ONCE() also
>      emits a memory-barrier instruction, so that a DEC Alpha CPU will
>      instead issue the following memory operations:
> 
> 	Q = LOAD P, MEMORY_BARRIER, D = LOAD *Q, MEMORY_BARRIER
> 
>      Whether on DEC Alpha or not, the READ_ONCE() also prevents compiler
>      mischief.
> 
>  (*) Overlapping loads and stores within a particular CPU will appear to be
>      ordered within that CPU.  This means that for:
> 
> 	a = READ_ONCE(*X); WRITE_ONCE(*X, b);
> 
>      the CPU will only issue the following sequence of memory operations:
> 
> 	a = LOAD *X, STORE *X = b
> 
>      And for:
> 
> 	WRITE_ONCE(*X, c); d = READ_ONCE(*X);
> 
>      the CPU will only issue:
> 
> 	STORE *X = c, d = LOAD *X
> 
>      (Loads and stores overlap if they are targeted at overlapping pieces of
>      memory).

It says "*the CPU* will issue the following [sequence of] *memory operations*",
not "*the compiler* will generate the following *CPU instructions*".

To me, that reads like a memory ordering guarantee.
  

Patch

diff --git a/lib/eal/include/rte_bitops.h b/lib/eal/include/rte_bitops.h
index 3297133e22..3644aa115c 100644
--- a/lib/eal/include/rte_bitops.h
+++ b/lib/eal/include/rte_bitops.h
@@ -226,6 +226,183 @@  extern "C" {
 		 uint32_t *: __rte_bit_flip32,				\
 		 uint64_t *: __rte_bit_flip64)(addr, nr)
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Generic selection macro to test exactly once the value of a bit in
+ * a 32-bit or 64-bit word. The type of operation depends on the type
+ * of the @c addr parameter.
+ *
+ * rte_bit_once_test() is guaranteed to result in exactly one memory
+ * load (e.g., it may not be eliminate or merged by the compiler).
+ *
+ * \code{.c}
+ * rte_bit_once_set(addr, 17);
+ * if (rte_bit_once_test(addr, 17)) {
+ *     ...
+ * }
+ * \endcode
+ *
+ * In the above example, rte_bit_once_set() may not be removed by
+ * the compiler, which would be allowed in case rte_bit_set() and
+ * rte_bit_test() was used.
+ *
+ * \code{.c}
+ * while (rte_bit_once_test(addr, 17);
+ *     ;
+ * \endcode
+ *
+ * In case rte_bit_test(addr, 17) was used instead, the resulting
+ * object code could (and in many cases would be) replaced with
+ * the equivalent to
+ * \code{.c}
+ * if (rte_bit_test(addr, 17)) {
+ *   for (;;) // spin forever
+ *       ;
+ * }
+ * \endcode
+ *
+ * rte_bit_once_test() does not give any guarantees in regards to
+ * memory ordering or atomicity.
+ *
+ * The regular bit set operations (e.g., rte_bit_test()) should be
+ * preferred over the "once" family of operations (e.g.,
+ * rte_bit_once_test()) if possible, since the latter may prevent
+ * optimizations crucial for run-time performance.
+ *
+ * @param addr
+ *   A pointer to the word to query.
+ * @param nr
+ *   The index of the bit.
+ * @return
+ *   Returns true if the bit is set, and false otherwise.
+ */
+
+#define rte_bit_once_test(addr, nr)					\
+	_Generic((addr),						\
+		uint32_t *: __rte_bit_once_test32,			\
+		const uint32_t *: __rte_bit_once_test32,		\
+		uint64_t *: __rte_bit_once_test64,			\
+		const uint64_t *: __rte_bit_once_test64)(addr, nr)
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Set bit in word exactly once.
+ *
+ * Generic selection macro to set bit specified by @c nr in the word
+ * pointed to by @c addr to '1' exactly once.
+ *
+ * rte_bit_once_set() is guaranteed to result in exactly one memory
+ * load and exactly one memory store, *or* an atomic bit set
+ * operation.
+ *
+ * See rte_bit_test_once32() for more information and uses cases for
+ * the "once" class of functions.
+ *
+ * This macro does not give any guarantees in regards to memory
+ * ordering or atomicity.
+ *
+ * @param addr
+ *   A pointer to the word to modify.
+ * @param nr
+ *   The index of the bit.
+ */
+
+#define rte_bit_once_set(addr, nr)				\
+	_Generic((addr),					\
+		 uint32_t *: __rte_bit_once_set32,		\
+		 uint64_t *: __rte_bit_once_set64)(addr, nr)
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Clear bit in word exactly once.
+ *
+ * Generic selection macro to set bit specified by @c nr in the word
+ * pointed to by @c addr to '0' exactly once.
+ *
+ * rte_bit_once_clear() is guaranteed to result in exactly one memory load
+ * and exactly one memory store, *or* an atomic bit clear operation.
+ *
+ * See rte_bit_test_once() for more information and uses cases for
+ * the "once" class of functions.
+ *
+ * This macro does not give any guarantees in regards to memory
+ * ordering or atomicity.
+ *
+ * @param addr
+ *   A pointer to the word to modify.
+ * @param nr
+ *   The index of the bit.
+ */
+#define rte_bit_once_clear(addr, nr)				\
+	_Generic((addr),					\
+		 uint32_t *: __rte_bit_once_clear32,		\
+		 uint64_t *: __rte_bit_once_clear64)(addr, nr)
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Assign a value to bit in a word exactly once.
+ *
+ * Generic selection macro to set bit specified by @c nr in the word
+ * pointed to by @c addr to the value indicated by @c value exactly
+ * once.
+ *
+ * rte_bit_once_assign() is guaranteed to result in exactly one memory
+ * load and exactly one memory store, *or* an atomic bit clear
+ * operation.
+ *
+ * This macro does not give any guarantees in regards to memory
+ * ordering or atomicity.
+ *
+ * @param addr
+ *   A pointer to the word to modify.
+ * @param nr
+ *   The index of the bit.
+ * @param value
+ *   The new value of the bit - true for '1', or false for '0'.
+ */
+#define rte_bit_once_assign(addr, nr, value)				\
+	_Generic((addr),						\
+		 uint32_t *: __rte_bit_once_assign32,			\
+		 uint64_t *: __rte_bit_once_assign64)(addr, nr, value)
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Flip bit in word, reading and writing exactly once.
+ *
+ * Generic selection macro to change the value of a bit to '0' if '1'
+ * or '1' if '0' in a 32-bit or 64-bit word. The type of operation
+ * depends on the type of the @c addr parameter.
+ *
+ * rte_bit_once_flip() is guaranteed to result in exactly one memory
+ * load and exactly one memory store, *or* an atomic bit flip
+ * operation.
+ *
+ * See rte_bit_test_once() for more information and uses cases for the
+ * "once" class of functions.
+ *
+ * This macro does not give any guarantees in regards to memory
+ * ordering or atomicity.
+ *
+ * @param addr
+ *   A pointer to the word to modify.
+ * @param nr
+ *   The index of the bit.
+ */
+#define rte_bit_once_flip(addr, nr)				\
+	_Generic((addr),					\
+		 uint32_t *: __rte_bit_once_flip32,		\
+		 uint64_t *: __rte_bit_once_flip64)(addr, nr)
+
 #define __RTE_GEN_BIT_TEST(family, fun, qualifier, size)		\
 	__rte_experimental						\
 	static inline bool						\
@@ -298,6 +475,18 @@  __RTE_GEN_BIT_CLEAR(, clear,, 64)
 __RTE_GEN_BIT_ASSIGN(, assign,, 64)
 __RTE_GEN_BIT_FLIP(, flip,, 64)
 
+__RTE_GEN_BIT_TEST(once_, test, volatile, 32)
+__RTE_GEN_BIT_SET(once_, set, volatile, 32)
+__RTE_GEN_BIT_CLEAR(once_, clear, volatile, 32)
+__RTE_GEN_BIT_ASSIGN(once_, assign, volatile, 32)
+__RTE_GEN_BIT_FLIP(once_, flip, volatile, 32)
+
+__RTE_GEN_BIT_TEST(once_, test, volatile, 64)
+__RTE_GEN_BIT_SET(once_, set, volatile, 64)
+__RTE_GEN_BIT_CLEAR(once_, clear, volatile, 64)
+__RTE_GEN_BIT_ASSIGN(once_, assign, volatile, 64)
+__RTE_GEN_BIT_FLIP(once_, flip, volatile, 64)
+
 /*------------------------ 32-bit relaxed operations ------------------------*/
 
 /**
@@ -993,6 +1182,12 @@  rte_log2_u64(uint64_t v)
 #undef rte_bit_assign
 #undef rte_bit_flip
 
+#undef rte_bit_once_test
+#undef rte_bit_once_set
+#undef rte_bit_once_clear
+#undef rte_bit_once_assign
+#undef rte_bit_once_flip
+
 #define __RTE_BIT_OVERLOAD_SZ_2(fun, qualifier, size, arg1_type, arg1_name) \
 	static inline void						\
 	rte_bit_ ## fun(qualifier uint ## size ## _t *addr,		\
@@ -1042,6 +1237,12 @@  __RTE_BIT_OVERLOAD_2(clear,, unsigned int, nr)
 __RTE_BIT_OVERLOAD_3(assign,, unsigned int, nr, bool, value)
 __RTE_BIT_OVERLOAD_2(flip,, unsigned int, nr)
 
+__RTE_BIT_OVERLOAD_2R(once_test, const volatile, bool, unsigned int, nr)
+__RTE_BIT_OVERLOAD_2(once_set, volatile, unsigned int, nr)
+__RTE_BIT_OVERLOAD_2(once_clear, volatile, unsigned int, nr)
+__RTE_BIT_OVERLOAD_3(once_assign, volatile, unsigned int, nr, bool, value)
+__RTE_BIT_OVERLOAD_2(once_flip, volatile, unsigned int, nr)
+
 #endif
 
 #endif /* _RTE_BITOPS_H_ */