[v5,1/3] rcu: add RCU library supporting QSBR mechanism

Message ID 20190412202039.46902-2-honnappa.nagarahalli@arm.com
State Superseded, archived
Headers show
Series
  • lib/rcu: add RCU library supporting QSBR mechanism
Related show

Checks

Context Check Description
ci/Intel-compilation success Compilation OK
ci/mellanox-Performance-Testing success Performance Testing PASS
ci/intel-Performance-Testing success Performance Testing PASS
ci/checkpatch success coding style OK

Commit Message

Honnappa Nagarahalli April 12, 2019, 8:20 p.m.
Add RCU library supporting quiescent state based memory reclamation method.
This library helps identify the quiescent state of the reader threads so
that the writers can free the memory associated with the lock less data
structures.

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 MAINTAINERS                        |   5 +
 config/common_base                 |   6 +
 lib/Makefile                       |   2 +
 lib/librte_rcu/Makefile            |  23 +
 lib/librte_rcu/meson.build         |   5 +
 lib/librte_rcu/rte_rcu_qsbr.c      | 237 +++++++++++
 lib/librte_rcu/rte_rcu_qsbr.h      | 645 +++++++++++++++++++++++++++++
 lib/librte_rcu/rte_rcu_version.map |  11 +
 lib/meson.build                    |   2 +-
 mk/rte.app.mk                      |   1 +
 10 files changed, 936 insertions(+), 1 deletion(-)
 create mode 100644 lib/librte_rcu/Makefile
 create mode 100644 lib/librte_rcu/meson.build
 create mode 100644 lib/librte_rcu/rte_rcu_qsbr.c
 create mode 100644 lib/librte_rcu/rte_rcu_qsbr.h
 create mode 100644 lib/librte_rcu/rte_rcu_version.map

Comments

Stephen Hemminger April 12, 2019, 10:06 p.m. | #1
On Fri, 12 Apr 2019 15:20:37 -0500
Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> wrote:

> Add RCU library supporting quiescent state based memory reclamation method.
> This library helps identify the quiescent state of the reader threads so
> that the writers can free the memory associated with the lock less data
> structures.
> 
> Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Reviewed-by: Steve Capper <steve.capper@arm.com>
> Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
> Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>

After evaluating long term API/ABI issues, I think you need to get rid
of almost all use of inline and visible structures. Yes it might be marginally
slower, but you thank me the first time you have to fix something.

Even the log macro should be private.
Honnappa Nagarahalli April 12, 2019, 10:24 p.m. | #2
> 
> On Fri, 12 Apr 2019 15:20:37 -0500
> Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> wrote:
> 
> > Add RCU library supporting quiescent state based memory reclamation
> method.
> > This library helps identify the quiescent state of the reader threads
> > so that the writers can free the memory associated with the lock less
> > data structures.
> >
> > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > Reviewed-by: Steve Capper <steve.capper@arm.com>
> > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> > Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
> > Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> 
> After evaluating long term API/ABI issues, I think you need to get rid of almost
> all use of inline and visible structures. Yes it might be marginally slower, but
> you thank me the first time you have to fix something.
> 
Agree, I was planning on another version to address this (I am yet to take a look at your patch addressing the ABI).
The structure visibility definitely needs to be addressed.
For the inline functions, is the plan to convert all the inline functions in DPDK? If yes, I think we need to consider the performance difference. May be consider L3-fwd application, change all the inline functions in its path and run a test?

> Even the log macro should be private.
Stephen Hemminger April 12, 2019, 11:06 p.m. | #3
On Fri, 12 Apr 2019 22:24:45 +0000
Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com> wrote:

> > 
> > On Fri, 12 Apr 2019 15:20:37 -0500
> > Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> wrote:
> >   
> > > Add RCU library supporting quiescent state based memory reclamation  
> > method.  
> > > This library helps identify the quiescent state of the reader threads
> > > so that the writers can free the memory associated with the lock less
> > > data structures.
> > >
> > > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > > Reviewed-by: Steve Capper <steve.capper@arm.com>
> > > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> > > Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
> > > Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>  
> > 
> > After evaluating long term API/ABI issues, I think you need to get rid of almost
> > all use of inline and visible structures. Yes it might be marginally slower, but
> > you thank me the first time you have to fix something.
> >   
> Agree, I was planning on another version to address this (I am yet to take a look at your patch addressing the ABI).
> The structure visibility definitely needs to be addressed.
> For the inline functions, is the plan to convert all the inline functions in DPDK? If yes, I think we need to consider the performance difference. May be consider L3-fwd application, change all the inline functions in its path and run a test?

Every function that is not in the direct datapath should not be inline.
Exceptions or things like rx/tx burst, ring enqueue/dequeue, and packet alloc/free
Ananyev, Konstantin April 15, 2019, 12:24 p.m. | #4
> -----Original Message-----
> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> Sent: Saturday, April 13, 2019 12:06 AM
> To: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Cc: Ananyev, Konstantin <konstantin.ananyev@intel.com>; paulmck@linux.ibm.com; Kovacevic, Marko <marko.kovacevic@intel.com>;
> dev@dpdk.org; Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; Dharmik Thakkar <Dharmik.Thakkar@arm.com>; Malvika Gupta
> <Malvika.Gupta@arm.com>; nd <nd@arm.com>
> Subject: Re: [PATCH v5 1/3] rcu: add RCU library supporting QSBR mechanism
> 
> On Fri, 12 Apr 2019 22:24:45 +0000
> Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com> wrote:
> 
> > >
> > > On Fri, 12 Apr 2019 15:20:37 -0500
> > > Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> wrote:
> > >
> > > > Add RCU library supporting quiescent state based memory reclamation
> > > method.
> > > > This library helps identify the quiescent state of the reader threads
> > > > so that the writers can free the memory associated with the lock less
> > > > data structures.
> > > >
> > > > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > > > Reviewed-by: Steve Capper <steve.capper@arm.com>
> > > > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> > > > Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
> > > > Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> > >
> > > After evaluating long term API/ABI issues, I think you need to get rid of almost
> > > all use of inline and visible structures. Yes it might be marginally slower, but
> > > you thank me the first time you have to fix something.
> > >
> > Agree, I was planning on another version to address this (I am yet to take a look at your patch addressing the ABI).
> > The structure visibility definitely needs to be addressed.
> > For the inline functions, is the plan to convert all the inline functions in DPDK? If yes, I think we need to consider the performance
> difference. May be consider L3-fwd application, change all the inline functions in its path and run a test?
> 
> Every function that is not in the direct datapath should not be inline.
> Exceptions or things like rx/tx burst, ring enqueue/dequeue, and packet alloc/free

Plus synchronization routines: spin/rwlock/barrier, etc.
I think rcu should be one of such exceptions - it is just another synchronization mechanism after all
(just a bit more sophisticated).
Konstantin
Stephen Hemminger April 15, 2019, 3:38 p.m. | #5
On Mon, 15 Apr 2019 12:24:47 +0000
"Ananyev, Konstantin" <konstantin.ananyev@intel.com> wrote:

> > -----Original Message-----
> > From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> > Sent: Saturday, April 13, 2019 12:06 AM
> > To: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> > Cc: Ananyev, Konstantin <konstantin.ananyev@intel.com>; paulmck@linux.ibm.com; Kovacevic, Marko <marko.kovacevic@intel.com>;
> > dev@dpdk.org; Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; Dharmik Thakkar <Dharmik.Thakkar@arm.com>; Malvika Gupta
> > <Malvika.Gupta@arm.com>; nd <nd@arm.com>
> > Subject: Re: [PATCH v5 1/3] rcu: add RCU library supporting QSBR mechanism
> > 
> > On Fri, 12 Apr 2019 22:24:45 +0000
> > Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com> wrote:
> >   
> > > >
> > > > On Fri, 12 Apr 2019 15:20:37 -0500
> > > > Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> wrote:
> > > >  
> > > > > Add RCU library supporting quiescent state based memory reclamation  
> > > > method.  
> > > > > This library helps identify the quiescent state of the reader threads
> > > > > so that the writers can free the memory associated with the lock less
> > > > > data structures.
> > > > >
> > > > > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > > > > Reviewed-by: Steve Capper <steve.capper@arm.com>
> > > > > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> > > > > Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
> > > > > Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>  
> > > >
> > > > After evaluating long term API/ABI issues, I think you need to get rid of almost
> > > > all use of inline and visible structures. Yes it might be marginally slower, but
> > > > you thank me the first time you have to fix something.
> > > >  
> > > Agree, I was planning on another version to address this (I am yet to take a look at your patch addressing the ABI).
> > > The structure visibility definitely needs to be addressed.
> > > For the inline functions, is the plan to convert all the inline functions in DPDK? If yes, I think we need to consider the performance  
> > difference. May be consider L3-fwd application, change all the inline functions in its path and run a test?
> > 
> > Every function that is not in the direct datapath should not be inline.
> > Exceptions or things like rx/tx burst, ring enqueue/dequeue, and packet alloc/free  
> 
> Plus synchronization routines: spin/rwlock/barrier, etc.
> I think rcu should be one of such exceptions - it is just another synchronization mechanism after all
> (just a bit more sophisticated).
> Konstantin

If you look at the other userspace RCU, you wil see that the only inlines
are the rcu_read_lock,rcu_read_unlock and rcu_reference/rcu_assign_pointer.

The synchronization logic is all real functions.
Ananyev, Konstantin April 15, 2019, 5:39 p.m. | #6
> -----Original Message-----
> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> Sent: Monday, April 15, 2019 4:39 PM
> To: Ananyev, Konstantin <konstantin.ananyev@intel.com>
> Cc: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; paulmck@linux.ibm.com; Kovacevic, Marko
> <marko.kovacevic@intel.com>; dev@dpdk.org; Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; Dharmik Thakkar
> <Dharmik.Thakkar@arm.com>; Malvika Gupta <Malvika.Gupta@arm.com>; nd <nd@arm.com>
> Subject: Re: [PATCH v5 1/3] rcu: add RCU library supporting QSBR mechanism
> 
> On Mon, 15 Apr 2019 12:24:47 +0000
> "Ananyev, Konstantin" <konstantin.ananyev@intel.com> wrote:
> 
> > > -----Original Message-----
> > > From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> > > Sent: Saturday, April 13, 2019 12:06 AM
> > > To: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> > > Cc: Ananyev, Konstantin <konstantin.ananyev@intel.com>; paulmck@linux.ibm.com; Kovacevic, Marko <marko.kovacevic@intel.com>;
> > > dev@dpdk.org; Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; Dharmik Thakkar <Dharmik.Thakkar@arm.com>; Malvika
> Gupta
> > > <Malvika.Gupta@arm.com>; nd <nd@arm.com>
> > > Subject: Re: [PATCH v5 1/3] rcu: add RCU library supporting QSBR mechanism
> > >
> > > On Fri, 12 Apr 2019 22:24:45 +0000
> > > Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com> wrote:
> > >
> > > > >
> > > > > On Fri, 12 Apr 2019 15:20:37 -0500
> > > > > Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> wrote:
> > > > >
> > > > > > Add RCU library supporting quiescent state based memory reclamation
> > > > > method.
> > > > > > This library helps identify the quiescent state of the reader threads
> > > > > > so that the writers can free the memory associated with the lock less
> > > > > > data structures.
> > > > > >
> > > > > > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > > > > > Reviewed-by: Steve Capper <steve.capper@arm.com>
> > > > > > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> > > > > > Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
> > > > > > Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> > > > >
> > > > > After evaluating long term API/ABI issues, I think you need to get rid of almost
> > > > > all use of inline and visible structures. Yes it might be marginally slower, but
> > > > > you thank me the first time you have to fix something.
> > > > >
> > > > Agree, I was planning on another version to address this (I am yet to take a look at your patch addressing the ABI).
> > > > The structure visibility definitely needs to be addressed.
> > > > For the inline functions, is the plan to convert all the inline functions in DPDK? If yes, I think we need to consider the performance
> > > difference. May be consider L3-fwd application, change all the inline functions in its path and run a test?
> > >
> > > Every function that is not in the direct datapath should not be inline.
> > > Exceptions or things like rx/tx burst, ring enqueue/dequeue, and packet alloc/free
> >
> > Plus synchronization routines: spin/rwlock/barrier, etc.
> > I think rcu should be one of such exceptions - it is just another synchronization mechanism after all
> > (just a bit more sophisticated).
> > Konstantin
> 
> If you look at the other userspace RCU, you wil see that the only inlines
> are the rcu_read_lock,rcu_read_unlock and rcu_reference/rcu_assign_pointer.
> 
> The synchronization logic is all real functions.

In fact, I think urcu provides both flavors:
https://github.com/urcu/userspace-rcu/blob/master/include/urcu/static/urcu-qsbr.h
I still don't understand why we have to treat it differently then let say spin-lock/ticket-lock or rwlock.
If we gone all the way to create our own version of rcu, we probably want it to be as fast as possible
(I know that main speedup should come from the fact that readers don't have to wait for writer to finish, but still...)

Konstantin
Honnappa Nagarahalli April 15, 2019, 6:56 p.m. | #7
> > > > > >
> > > > > > After evaluating long term API/ABI issues, I think you need to
> > > > > > get rid of almost all use of inline and visible structures.
> > > > > > Yes it might be marginally slower, but you thank me the first time
> you have to fix something.
> > > > > >
> > > > > Agree, I was planning on another version to address this (I am yet
> to take a look at your patch addressing the ABI).
> > > > > The structure visibility definitely needs to be addressed.
> > > > > For the inline functions, is the plan to convert all the inline
> > > > > functions in DPDK? If yes, I think we need to consider the
> > > > > performance
> > > > difference. May be consider L3-fwd application, change all the inline
> functions in its path and run a test?
> > > >
> > > > Every function that is not in the direct datapath should not be inline.
> > > > Exceptions or things like rx/tx burst, ring enqueue/dequeue, and
> > > > packet alloc/free
> > >
> > > Plus synchronization routines: spin/rwlock/barrier, etc.
> > > I think rcu should be one of such exceptions - it is just another
> > > synchronization mechanism after all (just a bit more sophisticated).
> > > Konstantin
> >
> > If you look at the other userspace RCU, you wil see that the only
> > inlines are the rcu_read_lock,rcu_read_unlock and
> rcu_reference/rcu_assign_pointer.
> >
> > The synchronization logic is all real functions.
> 
> In fact, I think urcu provides both flavors:
> https://github.com/urcu/userspace-
> rcu/blob/master/include/urcu/static/urcu-qsbr.h
> I still don't understand why we have to treat it differently then let say
> spin-lock/ticket-lock or rwlock.
> If we gone all the way to create our own version of rcu, we probably want
> it to be as fast as possible (I know that main speedup should come from
> the fact that readers don't have to wait for writer to finish, but still...)
> 
Except for ' rte_rcu_qsbr_synchronize' (will correct in the next version), we have the correct APIs marked as inline. They all are part of the fast path.

> Konstantin
Stephen Hemminger April 15, 2019, 9:26 p.m. | #8
On Mon, 15 Apr 2019 17:39:07 +0000
"Ananyev, Konstantin" <konstantin.ananyev@intel.com> wrote:

> > -----Original Message-----
> > From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> > Sent: Monday, April 15, 2019 4:39 PM
> > To: Ananyev, Konstantin <konstantin.ananyev@intel.com>
> > Cc: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; paulmck@linux.ibm.com; Kovacevic, Marko
> > <marko.kovacevic@intel.com>; dev@dpdk.org; Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; Dharmik Thakkar
> > <Dharmik.Thakkar@arm.com>; Malvika Gupta <Malvika.Gupta@arm.com>; nd <nd@arm.com>
> > Subject: Re: [PATCH v5 1/3] rcu: add RCU library supporting QSBR mechanism
> > 
> > On Mon, 15 Apr 2019 12:24:47 +0000
> > "Ananyev, Konstantin" <konstantin.ananyev@intel.com> wrote:
> >   
> > > > -----Original Message-----
> > > > From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> > > > Sent: Saturday, April 13, 2019 12:06 AM
> > > > To: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> > > > Cc: Ananyev, Konstantin <konstantin.ananyev@intel.com>; paulmck@linux.ibm.com; Kovacevic, Marko <marko.kovacevic@intel.com>;
> > > > dev@dpdk.org; Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; Dharmik Thakkar <Dharmik.Thakkar@arm.com>; Malvika  
> > Gupta  
> > > > <Malvika.Gupta@arm.com>; nd <nd@arm.com>
> > > > Subject: Re: [PATCH v5 1/3] rcu: add RCU library supporting QSBR mechanism
> > > >
> > > > On Fri, 12 Apr 2019 22:24:45 +0000
> > > > Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com> wrote:
> > > >  
> > > > > >
> > > > > > On Fri, 12 Apr 2019 15:20:37 -0500
> > > > > > Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> wrote:
> > > > > >  
> > > > > > > Add RCU library supporting quiescent state based memory reclamation  
> > > > > > method.  
> > > > > > > This library helps identify the quiescent state of the reader threads
> > > > > > > so that the writers can free the memory associated with the lock less
> > > > > > > data structures.
> > > > > > >
> > > > > > > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > > > > > > Reviewed-by: Steve Capper <steve.capper@arm.com>
> > > > > > > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> > > > > > > Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
> > > > > > > Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>  
> > > > > >
> > > > > > After evaluating long term API/ABI issues, I think you need to get rid of almost
> > > > > > all use of inline and visible structures. Yes it might be marginally slower, but
> > > > > > you thank me the first time you have to fix something.
> > > > > >  
> > > > > Agree, I was planning on another version to address this (I am yet to take a look at your patch addressing the ABI).
> > > > > The structure visibility definitely needs to be addressed.
> > > > > For the inline functions, is the plan to convert all the inline functions in DPDK? If yes, I think we need to consider the performance  
> > > > difference. May be consider L3-fwd application, change all the inline functions in its path and run a test?
> > > >
> > > > Every function that is not in the direct datapath should not be inline.
> > > > Exceptions or things like rx/tx burst, ring enqueue/dequeue, and packet alloc/free  
> > >
> > > Plus synchronization routines: spin/rwlock/barrier, etc.
> > > I think rcu should be one of such exceptions - it is just another synchronization mechanism after all
> > > (just a bit more sophisticated).
> > > Konstantin  
> > 
> > If you look at the other userspace RCU, you wil see that the only inlines
> > are the rcu_read_lock,rcu_read_unlock and rcu_reference/rcu_assign_pointer.
> > 
> > The synchronization logic is all real functions.  
> 
> In fact, I think urcu provides both flavors:
> https://github.com/urcu/userspace-rcu/blob/master/include/urcu/static/urcu-qsbr.h
> I still don't understand why we have to treat it differently then let say spin-lock/ticket-lock or rwlock.
> If we gone all the way to create our own version of rcu, we probably want it to be as fast as possible
> (I know that main speedup should come from the fact that readers don't have to wait for writer to finish, but still...)
> 
> Konstantin
> 

Having locking functions inline is already a problem in current releases.
The implementation can not be improved without breaking ABI (or doing special
workarounds like lock v2)
Honnappa Nagarahalli April 16, 2019, 5:29 a.m. | #9
> > > > > > > On Fri, 12 Apr 2019 15:20:37 -0500 Honnappa Nagarahalli
> > > > > > > <honnappa.nagarahalli@arm.com> wrote:
> > > > > > >
> > > > > > > > Add RCU library supporting quiescent state based memory
> > > > > > > > reclamation
> > > > > > > method.
> > > > > > > > This library helps identify the quiescent state of the
> > > > > > > > reader threads so that the writers can free the memory
> > > > > > > > associated with the lock less data structures.
> > > > > > > >
> > > > > > > > Signed-off-by: Honnappa Nagarahalli
> > > > > > > > <honnappa.nagarahalli@arm.com>
> > > > > > > > Reviewed-by: Steve Capper <steve.capper@arm.com>
> > > > > > > > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> > > > > > > > Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
> > > > > > > > Acked-by: Konstantin Ananyev
> > > > > > > > <konstantin.ananyev@intel.com>
> > > > > > >
> > > > > > > After evaluating long term API/ABI issues, I think you need
> > > > > > > to get rid of almost all use of inline and visible
> > > > > > > structures. Yes it might be marginally slower, but you thank me
> the first time you have to fix something.
> > > > > > >
> > > > > > Agree, I was planning on another version to address this (I am yet
> to take a look at your patch addressing the ABI).
> > > > > > The structure visibility definitely needs to be addressed.
> > > > > > For the inline functions, is the plan to convert all the
> > > > > > inline functions in DPDK? If yes, I think we need to consider
> > > > > > the performance
> > > > > difference. May be consider L3-fwd application, change all the
> inline functions in its path and run a test?
> > > > >
> > > > > Every function that is not in the direct datapath should not be
> inline.
> > > > > Exceptions or things like rx/tx burst, ring enqueue/dequeue, and
> > > > > packet alloc/free
I do not understand how DPDK can claim ABI compatibility if we have inline functions (unless we freeze any development in these inline functions forever).

> > > >
> > > > Plus synchronization routines: spin/rwlock/barrier, etc.
> > > > I think rcu should be one of such exceptions - it is just another
> > > > synchronization mechanism after all (just a bit more sophisticated).
> > > > Konstantin
> > >
> > > If you look at the other userspace RCU, you wil see that the only
> > > inlines are the rcu_read_lock,rcu_read_unlock and
> rcu_reference/rcu_assign_pointer.
> > >
> > > The synchronization logic is all real functions.
> >
> > In fact, I think urcu provides both flavors:
> > https://github.com/urcu/userspace-
> rcu/blob/master/include/urcu/static/
> > urcu-qsbr.h I still don't understand why we have to treat it
> > differently then let say spin-lock/ticket-lock or rwlock.
> > If we gone all the way to create our own version of rcu, we probably
> > want it to be as fast as possible (I know that main speedup should
> > come from the fact that readers don't have to wait for writer to
> > finish, but still...)
> >
> > Konstantin
> >
> 
> Having locking functions inline is already a problem in current releases.
> The implementation can not be improved without breaking ABI (or doing
> special workarounds like lock v2)
I think ABI and inline function discussion needs to be taken up in a different thread.

Currently, I am looking to hide the structure visibility. I looked at your patch [1], it is a different case than what I have in this patch. It is a pretty generic use case as well (similar situation exists in other libraries). I think a generic solution should be agreed upon.

If we have to hide the structure content, the handle to QS variable returned to the application needs to be opaque. I suggest using 'void *' behind which any structure can be used.

typedef void * rte_rcu_qsbr_t;
typedef void * rte_hash_t;

But it requires typecasting.

[1] http://patchwork.dpdk.org/cover/52609/
Stephen Hemminger April 16, 2019, 2:54 p.m. | #10
On Tue, 16 Apr 2019 05:29:21 +0000
Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com> wrote:

> > > > > > > > On Fri, 12 Apr 2019 15:20:37 -0500 Honnappa Nagarahalli
> > > > > > > > <honnappa.nagarahalli@arm.com> wrote:
> > > > > > > >  
> > > > > > > > > Add RCU library supporting quiescent state based memory
> > > > > > > > > reclamation  
> > > > > > > > method.  
> > > > > > > > > This library helps identify the quiescent state of the
> > > > > > > > > reader threads so that the writers can free the memory
> > > > > > > > > associated with the lock less data structures.
> > > > > > > > >
> > > > > > > > > Signed-off-by: Honnappa Nagarahalli
> > > > > > > > > <honnappa.nagarahalli@arm.com>
> > > > > > > > > Reviewed-by: Steve Capper <steve.capper@arm.com>
> > > > > > > > > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> > > > > > > > > Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
> > > > > > > > > Acked-by: Konstantin Ananyev
> > > > > > > > > <konstantin.ananyev@intel.com>  
> > > > > > > >
> > > > > > > > After evaluating long term API/ABI issues, I think you need
> > > > > > > > to get rid of almost all use of inline and visible
> > > > > > > > structures. Yes it might be marginally slower, but you thank me  
> > the first time you have to fix something.  
> > > > > > > >  
> > > > > > > Agree, I was planning on another version to address this (I am yet  
> > to take a look at your patch addressing the ABI).  
> > > > > > > The structure visibility definitely needs to be addressed.
> > > > > > > For the inline functions, is the plan to convert all the
> > > > > > > inline functions in DPDK? If yes, I think we need to consider
> > > > > > > the performance  
> > > > > > difference. May be consider L3-fwd application, change all the  
> > inline functions in its path and run a test?  
> > > > > >
> > > > > > Every function that is not in the direct datapath should not be  
> > inline.  
> > > > > > Exceptions or things like rx/tx burst, ring enqueue/dequeue, and
> > > > > > packet alloc/free  
> I do not understand how DPDK can claim ABI compatibility if we have inline functions (unless we freeze any development in these inline functions forever).
> 
> > > > >
> > > > > Plus synchronization routines: spin/rwlock/barrier, etc.
> > > > > I think rcu should be one of such exceptions - it is just another
> > > > > synchronization mechanism after all (just a bit more sophisticated).
> > > > > Konstantin  
> > > >
> > > > If you look at the other userspace RCU, you wil see that the only
> > > > inlines are the rcu_read_lock,rcu_read_unlock and  
> > rcu_reference/rcu_assign_pointer.  
> > > >
> > > > The synchronization logic is all real functions.  
> > >
> > > In fact, I think urcu provides both flavors:
> > > https://github.com/urcu/userspace-  
> > rcu/blob/master/include/urcu/static/  
> > > urcu-qsbr.h I still don't understand why we have to treat it
> > > differently then let say spin-lock/ticket-lock or rwlock.
> > > If we gone all the way to create our own version of rcu, we probably
> > > want it to be as fast as possible (I know that main speedup should
> > > come from the fact that readers don't have to wait for writer to
> > > finish, but still...)
> > >
> > > Konstantin
> > >  
> > 
> > Having locking functions inline is already a problem in current releases.
> > The implementation can not be improved without breaking ABI (or doing
> > special workarounds like lock v2)  
> I think ABI and inline function discussion needs to be taken up in a different thread.
> 
> Currently, I am looking to hide the structure visibility. I looked at your patch [1], it is a different case than what I have in this patch. It is a pretty generic use case as well (similar situation exists in other libraries). I think a generic solution should be agreed upon.
> 
> If we have to hide the structure content, the handle to QS variable returned to the application needs to be opaque. I suggest using 'void *' behind which any structure can be used.
> 
> typedef void * rte_rcu_qsbr_t;
> typedef void * rte_hash_t;
> 
> But it requires typecasting.
> 
> [1] http://patchwork.dpdk.org/cover/52609/

C allows structure to be defined without knowing what is in it
therefore.

typedef struct rte_rcu_qsbr rte_rcu_qsbr_t;

is preferred (or do it without typedef)

struct rte_rcu_qsbr;
Honnappa Nagarahalli April 16, 2019, 4:56 p.m. | #11
> 
> > > > > > > > > On Fri, 12 Apr 2019 15:20:37 -0500 Honnappa Nagarahalli
> > > > > > > > > <honnappa.nagarahalli@arm.com> wrote:
> > > > > > > > >
> > > > > > > > > > Add RCU library supporting quiescent state based
> > > > > > > > > > memory reclamation
> > > > > > > > > method.
> > > > > > > > > > This library helps identify the quiescent state of the
> > > > > > > > > > reader threads so that the writers can free the memory
> > > > > > > > > > associated with the lock less data structures.
> > > > > > > > > >
> > > > > > > > > > Signed-off-by: Honnappa Nagarahalli
> > > > > > > > > > <honnappa.nagarahalli@arm.com>
> > > > > > > > > > Reviewed-by: Steve Capper <steve.capper@arm.com>
> > > > > > > > > > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> > > > > > > > > > Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
> > > > > > > > > > Acked-by: Konstantin Ananyev
> > > > > > > > > > <konstantin.ananyev@intel.com>
> > > > > > > > >
> > > > > > > > > After evaluating long term API/ABI issues, I think you
> > > > > > > > > need to get rid of almost all use of inline and visible
> > > > > > > > > structures. Yes it might be marginally slower, but you
> > > > > > > > > thank me
> > > the first time you have to fix something.
> > > > > > > > >
> > > > > > > > Agree, I was planning on another version to address this
> > > > > > > > (I am yet
> > > to take a look at your patch addressing the ABI).
> > > > > > > > The structure visibility definitely needs to be addressed.
> > > > > > > > For the inline functions, is the plan to convert all the
> > > > > > > > inline functions in DPDK? If yes, I think we need to
> > > > > > > > consider the performance
> > > > > > > difference. May be consider L3-fwd application, change all
> > > > > > > the
> > > inline functions in its path and run a test?
> > > > > > >
> > > > > > > Every function that is not in the direct datapath should not
> > > > > > > be
> > > inline.
> > > > > > > Exceptions or things like rx/tx burst, ring enqueue/dequeue,
> > > > > > > and packet alloc/free
> > I do not understand how DPDK can claim ABI compatibility if we have
> inline functions (unless we freeze any development in these inline functions
> forever).
> >
> > > > > >
> > > > > > Plus synchronization routines: spin/rwlock/barrier, etc.
> > > > > > I think rcu should be one of such exceptions - it is just
> > > > > > another synchronization mechanism after all (just a bit more
> sophisticated).
> > > > > > Konstantin
> > > > >
> > > > > If you look at the other userspace RCU, you wil see that the
> > > > > only inlines are the rcu_read_lock,rcu_read_unlock and
> > > rcu_reference/rcu_assign_pointer.
> > > > >
> > > > > The synchronization logic is all real functions.
> > > >
> > > > In fact, I think urcu provides both flavors:
> > > > https://github.com/urcu/userspace-
> > > rcu/blob/master/include/urcu/static/
> > > > urcu-qsbr.h I still don't understand why we have to treat it
> > > > differently then let say spin-lock/ticket-lock or rwlock.
> > > > If we gone all the way to create our own version of rcu, we
> > > > probably want it to be as fast as possible (I know that main
> > > > speedup should come from the fact that readers don't have to wait
> > > > for writer to finish, but still...)
> > > >
> > > > Konstantin
> > > >
> > >
> > > Having locking functions inline is already a problem in current releases.
> > > The implementation can not be improved without breaking ABI (or
> > > doing special workarounds like lock v2)
> > I think ABI and inline function discussion needs to be taken up in a
> different thread.
> >
> > Currently, I am looking to hide the structure visibility. I looked at your
> patch [1], it is a different case than what I have in this patch. It is a pretty
> generic use case as well (similar situation exists in other libraries). I think a
> generic solution should be agreed upon.
> >
> > If we have to hide the structure content, the handle to QS variable
> returned to the application needs to be opaque. I suggest using 'void *'
> behind which any structure can be used.
> >
> > typedef void * rte_rcu_qsbr_t;
> > typedef void * rte_hash_t;
> >
> > But it requires typecasting.
> >
> > [1] http://patchwork.dpdk.org/cover/52609/
> 
> C allows structure to be defined without knowing what is in it therefore.
> 
> typedef struct rte_rcu_qsbr rte_rcu_qsbr_t;
> 
> is preferred (or do it without typedef)
> 
> struct rte_rcu_qsbr;

I see that rte_hash library uses the same approach (struct rte_hash in rte_hash.h, though it is marking as internal). But the ABI Laboratory tool [1] seems to be reporting incorrect numbers for this library even though the internal structure is changed.

[1] https://abi-laboratory.pro/index.php?view=compat_report&l=dpdk&v1=19.02&v2=current&obj=66794&kind=abi
Stephen Hemminger April 16, 2019, 9:22 p.m. | #12
On Tue, 16 Apr 2019 16:56:32 +0000
Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com> wrote:

> >   
> > > > > > > > > > On Fri, 12 Apr 2019 15:20:37 -0500 Honnappa Nagarahalli
> > > > > > > > > > <honnappa.nagarahalli@arm.com> wrote:
> > > > > > > > > >  
> > > > > > > > > > > Add RCU library supporting quiescent state based
> > > > > > > > > > > memory reclamation  
> > > > > > > > > > method.  
> > > > > > > > > > > This library helps identify the quiescent state of the
> > > > > > > > > > > reader threads so that the writers can free the memory
> > > > > > > > > > > associated with the lock less data structures.
> > > > > > > > > > >
> > > > > > > > > > > Signed-off-by: Honnappa Nagarahalli
> > > > > > > > > > > <honnappa.nagarahalli@arm.com>
> > > > > > > > > > > Reviewed-by: Steve Capper <steve.capper@arm.com>
> > > > > > > > > > > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> > > > > > > > > > > Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
> > > > > > > > > > > Acked-by: Konstantin Ananyev
> > > > > > > > > > > <konstantin.ananyev@intel.com>  
> > > > > > > > > >
> > > > > > > > > > After evaluating long term API/ABI issues, I think you
> > > > > > > > > > need to get rid of almost all use of inline and visible
> > > > > > > > > > structures. Yes it might be marginally slower, but you
> > > > > > > > > > thank me  
> > > > the first time you have to fix something.  
> > > > > > > > > >  
> > > > > > > > > Agree, I was planning on another version to address this
> > > > > > > > > (I am yet  
> > > > to take a look at your patch addressing the ABI).  
> > > > > > > > > The structure visibility definitely needs to be addressed.
> > > > > > > > > For the inline functions, is the plan to convert all the
> > > > > > > > > inline functions in DPDK? If yes, I think we need to
> > > > > > > > > consider the performance  
> > > > > > > > difference. May be consider L3-fwd application, change all
> > > > > > > > the  
> > > > inline functions in its path and run a test?  
> > > > > > > >
> > > > > > > > Every function that is not in the direct datapath should not
> > > > > > > > be  
> > > > inline.  
> > > > > > > > Exceptions or things like rx/tx burst, ring enqueue/dequeue,
> > > > > > > > and packet alloc/free  
> > > I do not understand how DPDK can claim ABI compatibility if we have  
> > inline functions (unless we freeze any development in these inline functions
> > forever).  
> > >  
> > > > > > >
> > > > > > > Plus synchronization routines: spin/rwlock/barrier, etc.
> > > > > > > I think rcu should be one of such exceptions - it is just
> > > > > > > another synchronization mechanism after all (just a bit more  
> > sophisticated).  
> > > > > > > Konstantin  
> > > > > >
> > > > > > If you look at the other userspace RCU, you wil see that the
> > > > > > only inlines are the rcu_read_lock,rcu_read_unlock and  
> > > > rcu_reference/rcu_assign_pointer.  
> > > > > >
> > > > > > The synchronization logic is all real functions.  
> > > > >
> > > > > In fact, I think urcu provides both flavors:
> > > > > https://github.com/urcu/userspace-  
> > > > rcu/blob/master/include/urcu/static/  
> > > > > urcu-qsbr.h I still don't understand why we have to treat it
> > > > > differently then let say spin-lock/ticket-lock or rwlock.
> > > > > If we gone all the way to create our own version of rcu, we
> > > > > probably want it to be as fast as possible (I know that main
> > > > > speedup should come from the fact that readers don't have to wait
> > > > > for writer to finish, but still...)
> > > > >
> > > > > Konstantin
> > > > >  
> > > >
> > > > Having locking functions inline is already a problem in current releases.
> > > > The implementation can not be improved without breaking ABI (or
> > > > doing special workarounds like lock v2)  
> > > I think ABI and inline function discussion needs to be taken up in a  
> > different thread.  
> > >
> > > Currently, I am looking to hide the structure visibility. I looked at your  
> > patch [1], it is a different case than what I have in this patch. It is a pretty
> > generic use case as well (similar situation exists in other libraries). I think a
> > generic solution should be agreed upon.  
> > >
> > > If we have to hide the structure content, the handle to QS variable  
> > returned to the application needs to be opaque. I suggest using 'void *'
> > behind which any structure can be used.  
> > >
> > > typedef void * rte_rcu_qsbr_t;
> > > typedef void * rte_hash_t;
> > >
> > > But it requires typecasting.
> > >
> > > [1] http://patchwork.dpdk.org/cover/52609/  
> > 
> > C allows structure to be defined without knowing what is in it therefore.
> > 
> > typedef struct rte_rcu_qsbr rte_rcu_qsbr_t;
> > 
> > is preferred (or do it without typedef)
> > 
> > struct rte_rcu_qsbr;  
> 
> I see that rte_hash library uses the same approach (struct rte_hash in rte_hash.h, though it is marking as internal). But the ABI Laboratory tool [1] seems to be reporting incorrect numbers for this library even though the internal structure is changed.
> 
> [1] https://abi-laboratory.pro/index.php?view=compat_report&l=dpdk&v1=19.02&v2=current&obj=66794&kind=abi

The problem is rte_hash structure is exposed as part of ABI in rte_cuckoo_hash.h
This was a mistake.
Honnappa Nagarahalli April 17, 2019, 1:45 a.m. | #13
> > > > > > > > > > >
> > > > > > > > > > > After evaluating long term API/ABI issues, I think
> > > > > > > > > > > you need to get rid of almost all use of inline and
> > > > > > > > > > > visible structures. Yes it might be marginally
> > > > > > > > > > > slower, but you thank me
> > > > > the first time you have to fix something.
> > > > > > > > > > >
> > > > > > > > > > Agree, I was planning on another version to address
> > > > > > > > > > this (I am yet
> > > > > to take a look at your patch addressing the ABI).
> > > > > > > > > > The structure visibility definitely needs to be addressed.
> > > > > > > > > > For the inline functions, is the plan to convert all
> > > > > > > > > > the inline functions in DPDK? If yes, I think we need
> > > > > > > > > > to consider the performance
> > > > > > > > > difference. May be consider L3-fwd application, change
> > > > > > > > > all the
> > > > > inline functions in its path and run a test?
> > > > > > > > >
> > > > > > > > > Every function that is not in the direct datapath should
> > > > > > > > > not be
> > > > > inline.
> > > > > > > > > Exceptions or things like rx/tx burst, ring
> > > > > > > > > enqueue/dequeue, and packet alloc/free
> > > > I do not understand how DPDK can claim ABI compatibility if we
> > > > have
> > > inline functions (unless we freeze any development in these inline
> > > functions forever).
> > > >
> > > > > > > >
> > > > > > > > Plus synchronization routines: spin/rwlock/barrier, etc.
> > > > > > > > I think rcu should be one of such exceptions - it is just
> > > > > > > > another synchronization mechanism after all (just a bit
> > > > > > > > more
> > > sophisticated).
> > > > > > > > Konstantin
> > > > > > >
> > > > > > > If you look at the other userspace RCU, you wil see that the
> > > > > > > only inlines are the rcu_read_lock,rcu_read_unlock and
> > > > > rcu_reference/rcu_assign_pointer.
> > > > > > >
> > > > > > > The synchronization logic is all real functions.
> > > > > >
> > > > > > In fact, I think urcu provides both flavors:
> > > > > > https://github.com/urcu/userspace-
> > > > > rcu/blob/master/include/urcu/static/
> > > > > > urcu-qsbr.h I still don't understand why we have to treat it
> > > > > > differently then let say spin-lock/ticket-lock or rwlock.
> > > > > > If we gone all the way to create our own version of rcu, we
> > > > > > probably want it to be as fast as possible (I know that main
> > > > > > speedup should come from the fact that readers don't have to
> > > > > > wait for writer to finish, but still...)
> > > > > >
> > > > > > Konstantin
> > > > > >
> > > > >
> > > > > Having locking functions inline is already a problem in current
> releases.
> > > > > The implementation can not be improved without breaking ABI (or
> > > > > doing special workarounds like lock v2)
> > > > I think ABI and inline function discussion needs to be taken up in
> > > > a
> > > different thread.
> > > >
> > > > Currently, I am looking to hide the structure visibility. I looked
> > > > at your
> > > patch [1], it is a different case than what I have in this patch. It
> > > is a pretty generic use case as well (similar situation exists in
> > > other libraries). I think a generic solution should be agreed upon.
> > > >
> > > > If we have to hide the structure content, the handle to QS
> > > > variable
> > > returned to the application needs to be opaque. I suggest using 'void *'
> > > behind which any structure can be used.
> > > >
> > > > typedef void * rte_rcu_qsbr_t;
> > > > typedef void * rte_hash_t;
> > > >
> > > > But it requires typecasting.
> > > >
> > > > [1] http://patchwork.dpdk.org/cover/52609/
> > >
> > > C allows structure to be defined without knowing what is in it
> therefore.
> > >
> > > typedef struct rte_rcu_qsbr rte_rcu_qsbr_t;
> > >
> > > is preferred (or do it without typedef)
> > >
> > > struct rte_rcu_qsbr;
> >
> > I see that rte_hash library uses the same approach (struct rte_hash in
> rte_hash.h, though it is marking as internal). But the ABI Laboratory tool
> [1] seems to be reporting incorrect numbers for this library even though
> the internal structure is changed.
> >
> > [1]
> > https://abi-
> laboratory.pro/index.php?view=compat_report&l=dpdk&v1=19.0
> > 2&v2=current&obj=66794&kind=abi
> 
> The problem is rte_hash structure is exposed as part of ABI in
> rte_cuckoo_hash.h This was a mistake.
Do you mean, due to the use of structure with the same name? I am wondering if it is just a tools issue. The application is not supposed to include rte_cuckoo_hash.h.

For the RCU library, we either need to go all functions or leave it the way it is. I do not see a point in trying to hide the internal structure while having inline functions.

I converted the inline functions to function calls.

Testing on Arm platform (results *are* repeatable) shows very minimal drop (0.1% to 0.2%) in performance while using lock-free rte_hash data structure. But one of the test cases which is just spinning shows good amount of drop (41%).

Testing on x86 (Xeon Gold 6132 CPU @ 2.60GHz, results *are* pretty repeatable) shows performance improvements (7% to 8%) while using lock-free rte_hash data structure. The test cases which is just spinning show significant drop (14%, 155%, 231%).

Konstantin, any thoughts on the results?

I will send out V6 which will fix issues reported so far. The function vs inline part is still open, need to close it soon.
Ananyev, Konstantin April 17, 2019, 1:39 p.m. | #14
> > > > > > > > > > > >
> > > > > > > > > > > > After evaluating long term API/ABI issues, I think
> > > > > > > > > > > > you need to get rid of almost all use of inline and
> > > > > > > > > > > > visible structures. Yes it might be marginally
> > > > > > > > > > > > slower, but you thank me
> > > > > > the first time you have to fix something.
> > > > > > > > > > > >
> > > > > > > > > > > Agree, I was planning on another version to address
> > > > > > > > > > > this (I am yet
> > > > > > to take a look at your patch addressing the ABI).
> > > > > > > > > > > The structure visibility definitely needs to be addressed.
> > > > > > > > > > > For the inline functions, is the plan to convert all
> > > > > > > > > > > the inline functions in DPDK? If yes, I think we need
> > > > > > > > > > > to consider the performance
> > > > > > > > > > difference. May be consider L3-fwd application, change
> > > > > > > > > > all the
> > > > > > inline functions in its path and run a test?
> > > > > > > > > >
> > > > > > > > > > Every function that is not in the direct datapath should
> > > > > > > > > > not be
> > > > > > inline.
> > > > > > > > > > Exceptions or things like rx/tx burst, ring
> > > > > > > > > > enqueue/dequeue, and packet alloc/free
> > > > > I do not understand how DPDK can claim ABI compatibility if we
> > > > > have
> > > > inline functions (unless we freeze any development in these inline
> > > > functions forever).
> > > > >
> > > > > > > > >
> > > > > > > > > Plus synchronization routines: spin/rwlock/barrier, etc.
> > > > > > > > > I think rcu should be one of such exceptions - it is just
> > > > > > > > > another synchronization mechanism after all (just a bit
> > > > > > > > > more
> > > > sophisticated).
> > > > > > > > > Konstantin
> > > > > > > >
> > > > > > > > If you look at the other userspace RCU, you wil see that the
> > > > > > > > only inlines are the rcu_read_lock,rcu_read_unlock and
> > > > > > rcu_reference/rcu_assign_pointer.
> > > > > > > >
> > > > > > > > The synchronization logic is all real functions.
> > > > > > >
> > > > > > > In fact, I think urcu provides both flavors:
> > > > > > > https://github.com/urcu/userspace-
> > > > > > rcu/blob/master/include/urcu/static/
> > > > > > > urcu-qsbr.h I still don't understand why we have to treat it
> > > > > > > differently then let say spin-lock/ticket-lock or rwlock.
> > > > > > > If we gone all the way to create our own version of rcu, we
> > > > > > > probably want it to be as fast as possible (I know that main
> > > > > > > speedup should come from the fact that readers don't have to
> > > > > > > wait for writer to finish, but still...)
> > > > > > >
> > > > > > > Konstantin
> > > > > > >
> > > > > >
> > > > > > Having locking functions inline is already a problem in current
> > releases.
> > > > > > The implementation can not be improved without breaking ABI (or
> > > > > > doing special workarounds like lock v2)
> > > > > I think ABI and inline function discussion needs to be taken up in
> > > > > a
> > > > different thread.
> > > > >
> > > > > Currently, I am looking to hide the structure visibility. I looked
> > > > > at your
> > > > patch [1], it is a different case than what I have in this patch. It
> > > > is a pretty generic use case as well (similar situation exists in
> > > > other libraries). I think a generic solution should be agreed upon.
> > > > >
> > > > > If we have to hide the structure content, the handle to QS
> > > > > variable
> > > > returned to the application needs to be opaque. I suggest using 'void *'
> > > > behind which any structure can be used.
> > > > >
> > > > > typedef void * rte_rcu_qsbr_t;
> > > > > typedef void * rte_hash_t;
> > > > >
> > > > > But it requires typecasting.
> > > > >
> > > > > [1] http://patchwork.dpdk.org/cover/52609/
> > > >
> > > > C allows structure to be defined without knowing what is in it
> > therefore.
> > > >
> > > > typedef struct rte_rcu_qsbr rte_rcu_qsbr_t;
> > > >
> > > > is preferred (or do it without typedef)
> > > >
> > > > struct rte_rcu_qsbr;
> > >
> > > I see that rte_hash library uses the same approach (struct rte_hash in
> > rte_hash.h, though it is marking as internal). But the ABI Laboratory tool
> > [1] seems to be reporting incorrect numbers for this library even though
> > the internal structure is changed.
> > >
> > > [1]
> > > https://abi-
> > laboratory.pro/index.php?view=compat_report&l=dpdk&v1=19.0
> > > 2&v2=current&obj=66794&kind=abi
> >
> > The problem is rte_hash structure is exposed as part of ABI in
> > rte_cuckoo_hash.h This was a mistake.
> Do you mean, due to the use of structure with the same name? I am wondering if it is just a tools issue. The application is not supposed to
> include rte_cuckoo_hash.h.
> 
> For the RCU library, we either need to go all functions or leave it the way it is. I do not see a point in trying to hide the internal structure
> while having inline functions.
> 
> I converted the inline functions to function calls.
> 
> Testing on Arm platform (results *are* repeatable) shows very minimal drop (0.1% to 0.2%) in performance while using lock-free rte_hash
> data structure. But one of the test cases which is just spinning shows good amount of drop (41%).
> 
> Testing on x86 (Xeon Gold 6132 CPU @ 2.60GHz, results *are* pretty repeatable) shows performance improvements (7% to 8%) while using
> lock-free rte_hash data structure. The test cases which is just spinning show significant drop (14%, 155%, 231%).
> Konstantin, any thoughts on the results?

The fact that function show better result than inline (even for hash) is sort of surprise to me.
Don't have any good explanation off-hand, but the actual numbers for hash test are huge by itself...

In general, I still think that sync primitives better to stay inlined - there is no much point to create ones
and then figure out that no-one using them because they are too slow.
Though if there is no real perf difference between inlined and normal - no point to keep it inlined.
About RCU lib, my thought to have inlined version for 19.05 and do further perf testing with it
(as I remember there were suggestions about using it in l3fwd for guarding routing table or so).
If we'll find there is no real difference - move it to not-inlined version in 19.08.
It is experimental for now  - so could be changed without formal ABI breakage.

 Konstantin
Honnappa Nagarahalli April 17, 2019, 2:02 p.m. | #15
> > > > > > > > > > > > >
> > > > > > > > > > > > > After evaluating long term API/ABI issues, I
> > > > > > > > > > > > > think you need to get rid of almost all use of
> > > > > > > > > > > > > inline and visible structures. Yes it might be
> > > > > > > > > > > > > marginally slower, but you thank me
> > > > > > > the first time you have to fix something.
> > > > > > > > > > > > >
> > > > > > > > > > > > Agree, I was planning on another version to
> > > > > > > > > > > > address this (I am yet
> > > > > > > to take a look at your patch addressing the ABI).
> > > > > > > > > > > > The structure visibility definitely needs to be
> addressed.
> > > > > > > > > > > > For the inline functions, is the plan to convert
> > > > > > > > > > > > all the inline functions in DPDK? If yes, I think
> > > > > > > > > > > > we need to consider the performance
> > > > > > > > > > > difference. May be consider L3-fwd application,
> > > > > > > > > > > change all the
> > > > > > > inline functions in its path and run a test?
> > > > > > > > > > >
> > > > > > > > > > > Every function that is not in the direct datapath
> > > > > > > > > > > should not be
> > > > > > > inline.
> > > > > > > > > > > Exceptions or things like rx/tx burst, ring
> > > > > > > > > > > enqueue/dequeue, and packet alloc/free
> > > > > > I do not understand how DPDK can claim ABI compatibility if we
> > > > > > have
> > > > > inline functions (unless we freeze any development in these
> > > > > inline functions forever).
> > > > > >
> > > > > > > > > >
> > > > > > > > > > Plus synchronization routines: spin/rwlock/barrier, etc.
> > > > > > > > > > I think rcu should be one of such exceptions - it is
> > > > > > > > > > just another synchronization mechanism after all (just
> > > > > > > > > > a bit more
> > > > > sophisticated).
> > > > > > > > > > Konstantin
> > > > > > > > >
> > > > > > > > > If you look at the other userspace RCU, you wil see that
> > > > > > > > > the only inlines are the rcu_read_lock,rcu_read_unlock
> > > > > > > > > and
> > > > > > > rcu_reference/rcu_assign_pointer.
> > > > > > > > >
> > > > > > > > > The synchronization logic is all real functions.
> > > > > > > >
> > > > > > > > In fact, I think urcu provides both flavors:
> > > > > > > > https://github.com/urcu/userspace-
> > > > > > > rcu/blob/master/include/urcu/static/
> > > > > > > > urcu-qsbr.h I still don't understand why we have to treat
> > > > > > > > it differently then let say spin-lock/ticket-lock or rwlock.
> > > > > > > > If we gone all the way to create our own version of rcu,
> > > > > > > > we probably want it to be as fast as possible (I know that
> > > > > > > > main speedup should come from the fact that readers don't
> > > > > > > > have to wait for writer to finish, but still...)
> > > > > > > >
> > > > > > > > Konstantin
> > > > > > > >
> > > > > > >
> > > > > > > Having locking functions inline is already a problem in
> > > > > > > current
> > > releases.
> > > > > > > The implementation can not be improved without breaking ABI
> > > > > > > (or doing special workarounds like lock v2)
> > > > > > I think ABI and inline function discussion needs to be taken
> > > > > > up in a
> > > > > different thread.
> > > > > >
> > > > > > Currently, I am looking to hide the structure visibility. I
> > > > > > looked at your
> > > > > patch [1], it is a different case than what I have in this
> > > > > patch. It is a pretty generic use case as well (similar
> > > > > situation exists in other libraries). I think a generic solution should
> be agreed upon.
> > > > > >
> > > > > > If we have to hide the structure content, the handle to QS
> > > > > > variable
> > > > > returned to the application needs to be opaque. I suggest using
> 'void *'
> > > > > behind which any structure can be used.
> > > > > >
> > > > > > typedef void * rte_rcu_qsbr_t; typedef void * rte_hash_t;
> > > > > >
> > > > > > But it requires typecasting.
> > > > > >
> > > > > > [1] http://patchwork.dpdk.org/cover/52609/
> > > > >
> > > > > C allows structure to be defined without knowing what is in it
> > > therefore.
> > > > >
> > > > > typedef struct rte_rcu_qsbr rte_rcu_qsbr_t;
> > > > >
> > > > > is preferred (or do it without typedef)
> > > > >
> > > > > struct rte_rcu_qsbr;
> > > >
> > > > I see that rte_hash library uses the same approach (struct
> > > > rte_hash in
> > > rte_hash.h, though it is marking as internal). But the ABI
> > > Laboratory tool [1] seems to be reporting incorrect numbers for this
> > > library even though the internal structure is changed.
> > > >
> > > > [1]
> > > > https://abi-
> > > laboratory.pro/index.php?view=compat_report&l=dpdk&v1=19.0
> > > > 2&v2=current&obj=66794&kind=abi
> > >
> > > The problem is rte_hash structure is exposed as part of ABI in
> > > rte_cuckoo_hash.h This was a mistake.
> > Do you mean, due to the use of structure with the same name? I am
> > wondering if it is just a tools issue. The application is not supposed to
> include rte_cuckoo_hash.h.
> >
> > For the RCU library, we either need to go all functions or leave it
> > the way it is. I do not see a point in trying to hide the internal structure
> while having inline functions.
> >
> > I converted the inline functions to function calls.
> >
> > Testing on Arm platform (results *are* repeatable) shows very minimal
> > drop (0.1% to 0.2%) in performance while using lock-free rte_hash data
> structure. But one of the test cases which is just spinning shows good
> amount of drop (41%).
> >
> > Testing on x86 (Xeon Gold 6132 CPU @ 2.60GHz, results *are* pretty
> > repeatable) shows performance improvements (7% to 8%) while using
> lock-free rte_hash data structure. The test cases which is just spinning
> show significant drop (14%, 155%, 231%).
> > Konstantin, any thoughts on the results?
> 
> The fact that function show better result than inline (even for hash) is sort
> of surprise to me.
It was a surprise to me too and counter-intuitive to my understanding.
 
> Don't have any good explanation off-hand, but the actual numbers for
> hash test are huge by itself...
> 
> In general, I still think that sync primitives better to stay inlined - there is
> no much point to create ones and then figure out that no-one using them
> because they are too slow.
> Though if there is no real perf difference between inlined and normal - no
> point to keep it inlined.
> About RCU lib, my thought to have inlined version for 19.05 and do
> further perf testing with it (as I remember there were suggestions about
> using it in l3fwd for guarding routing table or so).
Yes, there is more work planned to integrate the library better which might provide more insight. 

> If we'll find there is no real difference - move it to not-inlined version in
> 19.08.
+1.

> It is experimental for now  - so could be changed without formal ABI
> breakage.
> 
>  Konstantin
> 
>
Thomas Monjalon April 17, 2019, 2:18 p.m. | #16
17/04/2019 15:39, Ananyev, Konstantin:
> In general, I still think that sync primitives better to stay inlined - there is no much point to create ones
> and then figure out that no-one using them because they are too slow.
> Though if there is no real perf difference between inlined and normal - no point to keep it inlined.
> About RCU lib, my thought to have inlined version for 19.05 and do further perf testing with it
> (as I remember there were suggestions about using it in l3fwd for guarding routing table or so).
> If we'll find there is no real difference - move it to not-inlined version in 19.08.
> It is experimental for now  - so could be changed without formal ABI breakage.

I agree, it looks reasonnable to take v6 of RCU patches
as an experimental implementation.

Then we can run some tests and discuss about inlining or not
before promoting it as a stable API.

Patch

diff --git a/MAINTAINERS b/MAINTAINERS
index 9774344dd..6e9766eed 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1267,6 +1267,11 @@  F: examples/bpf/
 F: app/test/test_bpf.c
 F: doc/guides/prog_guide/bpf_lib.rst
 
+RCU - EXPERIMENTAL
+M: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
+F: lib/librte_rcu/
+F: doc/guides/prog_guide/rcu_lib.rst
+
 
 Test Applications
 -----------------
diff --git a/config/common_base b/config/common_base
index 8da08105b..ad70c79e1 100644
--- a/config/common_base
+++ b/config/common_base
@@ -829,6 +829,12 @@  CONFIG_RTE_LIBRTE_LATENCY_STATS=y
 #
 CONFIG_RTE_LIBRTE_TELEMETRY=n
 
+#
+# Compile librte_rcu
+#
+CONFIG_RTE_LIBRTE_RCU=y
+CONFIG_RTE_LIBRTE_RCU_DEBUG=n
+
 #
 # Compile librte_lpm
 #
diff --git a/lib/Makefile b/lib/Makefile
index 26021d0c0..791e0d991 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -111,6 +111,8 @@  DIRS-$(CONFIG_RTE_LIBRTE_IPSEC) += librte_ipsec
 DEPDIRS-librte_ipsec := librte_eal librte_mbuf librte_cryptodev librte_security
 DIRS-$(CONFIG_RTE_LIBRTE_TELEMETRY) += librte_telemetry
 DEPDIRS-librte_telemetry := librte_eal librte_metrics librte_ethdev
+DIRS-$(CONFIG_RTE_LIBRTE_RCU) += librte_rcu
+DEPDIRS-librte_rcu := librte_eal
 
 ifeq ($(CONFIG_RTE_EXEC_ENV_LINUX),y)
 DIRS-$(CONFIG_RTE_LIBRTE_KNI) += librte_kni
diff --git a/lib/librte_rcu/Makefile b/lib/librte_rcu/Makefile
new file mode 100644
index 000000000..6aa677bd1
--- /dev/null
+++ b/lib/librte_rcu/Makefile
@@ -0,0 +1,23 @@ 
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(c) 2018 Arm Limited
+
+include $(RTE_SDK)/mk/rte.vars.mk
+
+# library name
+LIB = librte_rcu.a
+
+CFLAGS += -DALLOW_EXPERIMENTAL_API
+CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR) -O3
+LDLIBS += -lrte_eal
+
+EXPORT_MAP := rte_rcu_version.map
+
+LIBABIVER := 1
+
+# all source are stored in SRCS-y
+SRCS-$(CONFIG_RTE_LIBRTE_RCU) := rte_rcu_qsbr.c
+
+# install includes
+SYMLINK-$(CONFIG_RTE_LIBRTE_RCU)-include := rte_rcu_qsbr.h
+
+include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/lib/librte_rcu/meson.build b/lib/librte_rcu/meson.build
new file mode 100644
index 000000000..c009ae4b7
--- /dev/null
+++ b/lib/librte_rcu/meson.build
@@ -0,0 +1,5 @@ 
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(c) 2018 Arm Limited
+
+sources = files('rte_rcu_qsbr.c')
+headers = files('rte_rcu_qsbr.h')
diff --git a/lib/librte_rcu/rte_rcu_qsbr.c b/lib/librte_rcu/rte_rcu_qsbr.c
new file mode 100644
index 000000000..4aeb5f37f
--- /dev/null
+++ b/lib/librte_rcu/rte_rcu_qsbr.c
@@ -0,0 +1,237 @@ 
+/* SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Copyright (c) 2018 Arm Limited
+ */
+
+#include <stdio.h>
+#include <string.h>
+#include <stdint.h>
+#include <errno.h>
+
+#include <rte_common.h>
+#include <rte_log.h>
+#include <rte_memory.h>
+#include <rte_malloc.h>
+#include <rte_eal.h>
+#include <rte_eal_memconfig.h>
+#include <rte_atomic.h>
+#include <rte_per_lcore.h>
+#include <rte_lcore.h>
+#include <rte_errno.h>
+
+#include "rte_rcu_qsbr.h"
+
+/* Get the memory size of QSBR variable */
+size_t __rte_experimental
+rte_rcu_qsbr_get_memsize(uint32_t max_threads)
+{
+	size_t sz;
+
+	if (max_threads == 0) {
+		rte_log(RTE_LOG_ERR, rcu_log_type,
+			"%s(): Invalid max_threads %u\n",
+			__func__, max_threads);
+		rte_errno = EINVAL;
+
+		return 1;
+	}
+
+	sz = sizeof(struct rte_rcu_qsbr);
+
+	/* Add the size of quiescent state counter array */
+	sz += sizeof(struct rte_rcu_qsbr_cnt) * max_threads;
+
+	/* Add the size of the registered thread ID bitmap array */
+	sz += RTE_QSBR_THRID_ARRAY_SIZE(max_threads);
+
+	return sz;
+}
+
+/* Initialize a quiescent state variable */
+int __rte_experimental
+rte_rcu_qsbr_init(struct rte_rcu_qsbr *v, uint32_t max_threads)
+{
+	size_t sz;
+
+	if (v == NULL) {
+		rte_log(RTE_LOG_ERR, rcu_log_type,
+			"%s(): Invalid input parameter\n", __func__);
+		rte_errno = EINVAL;
+
+		return 1;
+	}
+
+	sz = rte_rcu_qsbr_get_memsize(max_threads);
+	if (sz == 1)
+		return 1;
+
+	/* Set all the threads to offline */
+	memset(v, 0, sz);
+	v->max_threads = max_threads;
+	v->num_elems = RTE_ALIGN_MUL_CEIL(max_threads,
+			RTE_QSBR_THRID_ARRAY_ELM_SIZE) /
+			RTE_QSBR_THRID_ARRAY_ELM_SIZE;
+	v->token = RTE_QSBR_CNT_INIT;
+
+	return 0;
+}
+
+/* Register a reader thread to report its quiescent state
+ * on a QS variable.
+ */
+int __rte_experimental
+rte_rcu_qsbr_thread_register(struct rte_rcu_qsbr *v, unsigned int thread_id)
+{
+	unsigned int i, id, success;
+	uint64_t old_bmap, new_bmap;
+
+	if (v == NULL || thread_id >= v->max_threads) {
+		rte_log(RTE_LOG_ERR, rcu_log_type,
+			"%s(): Invalid input parameter\n", __func__);
+		rte_errno = EINVAL;
+
+		return 1;
+	}
+
+	id = thread_id & RTE_QSBR_THRID_MASK;
+	i = thread_id >> RTE_QSBR_THRID_INDEX_SHIFT;
+
+	/* Make sure that the counter for registered threads does not
+	 * go out of sync. Hence, additional checks are required.
+	 */
+	/* Check if the thread is already registered */
+	old_bmap = __atomic_load_n(RTE_QSBR_THRID_ARRAY_ELM(v, i),
+					__ATOMIC_RELAXED);
+	if (old_bmap & 1UL << id)
+		return 0;
+
+	do {
+		new_bmap = old_bmap | (1UL << id);
+		success = __atomic_compare_exchange(
+					RTE_QSBR_THRID_ARRAY_ELM(v, i),
+					&old_bmap, &new_bmap, 0,
+					__ATOMIC_RELEASE, __ATOMIC_RELAXED);
+
+		if (success)
+			__atomic_fetch_add(&v->num_threads,
+						1, __ATOMIC_RELAXED);
+		else if (old_bmap & (1UL << id))
+			/* Someone else registered this thread.
+			 * Counter should not be incremented.
+			 */
+			return 0;
+	} while (success == 0);
+
+	return 0;
+}
+
+/* Remove a reader thread, from the list of threads reporting their
+ * quiescent state on a QS variable.
+ */
+int __rte_experimental
+rte_rcu_qsbr_thread_unregister(struct rte_rcu_qsbr *v, unsigned int thread_id)
+{
+	unsigned int i, id, success;
+	uint64_t old_bmap, new_bmap;
+
+	if (v == NULL || thread_id >= v->max_threads) {
+		rte_log(RTE_LOG_ERR, rcu_log_type,
+			"%s(): Invalid input parameter\n", __func__);
+		rte_errno = EINVAL;
+
+		return 1;
+	}
+
+	id = thread_id & RTE_QSBR_THRID_MASK;
+	i = thread_id >> RTE_QSBR_THRID_INDEX_SHIFT;
+
+	/* Make sure that the counter for registered threads does not
+	 * go out of sync. Hence, additional checks are required.
+	 */
+	/* Check if the thread is already unregistered */
+	old_bmap = __atomic_load_n(RTE_QSBR_THRID_ARRAY_ELM(v, i),
+					__ATOMIC_RELAXED);
+	if (old_bmap & ~(1UL << id))
+		return 0;
+
+	do {
+		new_bmap = old_bmap & ~(1UL << id);
+		/* Make sure any loads of the shared data structure are
+		 * completed before removal of the thread from the list of
+		 * reporting threads.
+		 */
+		success = __atomic_compare_exchange(
+					RTE_QSBR_THRID_ARRAY_ELM(v, i),
+					&old_bmap, &new_bmap, 0,
+					__ATOMIC_RELEASE, __ATOMIC_RELAXED);
+
+		if (success)
+			__atomic_fetch_sub(&v->num_threads,
+						1, __ATOMIC_RELAXED);
+		else if (old_bmap & ~(1UL << id))
+			/* Someone else unregistered this thread.
+			 * Counter should not be incremented.
+			 */
+			return 0;
+	} while (success == 0);
+
+	return 0;
+}
+
+/* Dump the details of a single quiescent state variable to a file. */
+int __rte_experimental
+rte_rcu_qsbr_dump(FILE *f, struct rte_rcu_qsbr *v)
+{
+	uint64_t bmap;
+	uint32_t i, t;
+
+	if (v == NULL || f == NULL) {
+		rte_log(RTE_LOG_ERR, rcu_log_type,
+			"%s(): Invalid input parameter\n", __func__);
+		rte_errno = EINVAL;
+
+		return 1;
+	}
+
+	fprintf(f, "\nQuiescent State Variable @%p\n", v);
+
+	fprintf(f, "  QS variable memory size = %lu\n",
+				rte_rcu_qsbr_get_memsize(v->max_threads));
+	fprintf(f, "  Given # max threads = %u\n", v->max_threads);
+	fprintf(f, "  Current # threads = %u\n", v->num_threads);
+
+	fprintf(f, "  Registered thread ID mask = 0x");
+	for (i = 0; i < v->num_elems; i++)
+		fprintf(f, "%lx", __atomic_load_n(
+					RTE_QSBR_THRID_ARRAY_ELM(v, i),
+					__ATOMIC_ACQUIRE));
+	fprintf(f, "\n");
+
+	fprintf(f, "  Token = %lu\n",
+			__atomic_load_n(&v->token, __ATOMIC_ACQUIRE));
+
+	fprintf(f, "Quiescent State Counts for readers:\n");
+	for (i = 0; i < v->num_elems; i++) {
+		bmap = __atomic_load_n(RTE_QSBR_THRID_ARRAY_ELM(v, i),
+					__ATOMIC_ACQUIRE);
+		while (bmap) {
+			t = __builtin_ctzl(bmap);
+			fprintf(f, "thread ID = %d, count = %lu\n", t,
+				__atomic_load_n(
+					&v->qsbr_cnt[i].cnt,
+					__ATOMIC_RELAXED));
+			bmap &= ~(1UL << t);
+		}
+	}
+
+	return 0;
+}
+
+int rcu_log_type;
+
+RTE_INIT(rte_rcu_register)
+{
+	rcu_log_type = rte_log_register("lib.rcu");
+	if (rcu_log_type >= 0)
+		rte_log_set_level(rcu_log_type, RTE_LOG_ERR);
+}
diff --git a/lib/librte_rcu/rte_rcu_qsbr.h b/lib/librte_rcu/rte_rcu_qsbr.h
new file mode 100644
index 000000000..304534a2d
--- /dev/null
+++ b/lib/librte_rcu/rte_rcu_qsbr.h
@@ -0,0 +1,645 @@ 
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2018 Arm Limited
+ */
+
+#ifndef _RTE_RCU_QSBR_H_
+#define _RTE_RCU_QSBR_H_
+
+/**
+ * @file
+ * RTE Quiescent State Based Reclamation (QSBR)
+ *
+ * Quiescent State (QS) is any point in the thread execution
+ * where the thread does not hold a reference to a data structure
+ * in shared memory. While using lock-less data structures, the writer
+ * can safely free memory once all the reader threads have entered
+ * quiescent state.
+ *
+ * This library provides the ability for the readers to report quiescent
+ * state and for the writers to identify when all the readers have
+ * entered quiescent state.
+ */
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <stdio.h>
+#include <stdint.h>
+#include <errno.h>
+#include <rte_common.h>
+#include <rte_memory.h>
+#include <rte_lcore.h>
+#include <rte_debug.h>
+#include <rte_atomic.h>
+#define RTE_LIBRTE_RCU_DEBUG
+extern int rcu_log_type;
+
+#if RTE_LOG_DP_LEVEL >= RTE_LOG_DEBUG
+#define RCU_DP_LOG(level, fmt, args...) \
+	rte_log(RTE_LOG_ ## level, rcu_log_type, \
+		"%s(): " fmt "\n", __func__, ## args)
+#else
+#define RCU_DP_LOG(level, fmt, args...)
+#endif
+
+/* Registered thread IDs are stored as a bitmap of 64b element array.
+ * Given thread id needs to be converted to index into the array and
+ * the id within the array element.
+ */
+#define RTE_QSBR_THRID_ARRAY_ELM_SIZE (sizeof(uint64_t) * 8)
+#define RTE_QSBR_THRID_ARRAY_SIZE(max_threads) \
+	RTE_ALIGN(RTE_ALIGN_MUL_CEIL(max_threads, \
+		RTE_QSBR_THRID_ARRAY_ELM_SIZE) >> 3, RTE_CACHE_LINE_SIZE)
+#define RTE_QSBR_THRID_ARRAY_ELM(v, i) ((uint64_t *) \
+	((struct rte_rcu_qsbr_cnt *)(v + 1) + v->max_threads) + i)
+#define RTE_QSBR_THRID_INDEX_SHIFT 6
+#define RTE_QSBR_THRID_MASK 0x3f
+#define RTE_QSBR_THRID_INVALID 0xffffffff
+
+/* Worker thread counter */
+struct rte_rcu_qsbr_cnt {
+	uint64_t cnt;
+	/**< Quiescent state counter. Value 0 indicates the thread is offline
+	 *   64b counter is used to avoid adding more code to address
+	 *   counter overflow. Changing this to 32b would require additional
+	 *   changes to various APIs.
+	 */
+	uint32_t lock_cnt;
+	/**< Lock counter. Used when CONFIG_RTE_LIBRTE_RCU_DEBUG is enabled */
+} __rte_cache_aligned;
+
+#define RTE_QSBR_CNT_THR_OFFLINE 0
+#define RTE_QSBR_CNT_INIT 1
+
+/* RTE Quiescent State variable structure.
+ * This structure has two elements that vary in size based on the
+ * 'max_threads' parameter.
+ * 1) Quiescent state counter array
+ * 2) Register thread ID array
+ */
+struct rte_rcu_qsbr {
+	uint64_t token __rte_cache_aligned;
+	/**< Counter to allow for multiple concurrent quiescent state queries */
+
+	uint32_t num_elems __rte_cache_aligned;
+	/**< Number of elements in the thread ID array */
+	uint32_t num_threads;
+	/**< Number of threads currently using this QS variable */
+	uint32_t max_threads;
+	/**< Maximum number of threads using this QS variable */
+
+	struct rte_rcu_qsbr_cnt qsbr_cnt[0] __rte_cache_aligned;
+	/**< Quiescent state counter array of 'max_threads' elements */
+
+	/**< Registered thread IDs are stored in a bitmap array,
+	 *   after the quiescent state counter array.
+	 */
+} __rte_cache_aligned;
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Return the size of the memory occupied by a Quiescent State variable.
+ *
+ * @param max_threads
+ *   Maximum number of threads reporting quiescent state on this variable.
+ * @return
+ *   On success - size of memory in bytes required for this QS variable.
+ *   On error - 1 with error code set in rte_errno.
+ *   Possible rte_errno codes are:
+ *   - EINVAL - max_threads is 0
+ */
+size_t __rte_experimental
+rte_rcu_qsbr_get_memsize(uint32_t max_threads);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Initialize a Quiescent State (QS) variable.
+ *
+ * @param v
+ *   QS variable
+ * @param max_threads
+ *   Maximum number of threads reporting quiescent state on this variable.
+ *   This should be the same value as passed to rte_rcu_qsbr_get_memsize.
+ * @return
+ *   On success - 0
+ *   On error - 1 with error code set in rte_errno.
+ *   Possible rte_errno codes are:
+ *   - EINVAL - max_threads is 0 or 'v' is NULL.
+ *
+ */
+int __rte_experimental
+rte_rcu_qsbr_init(struct rte_rcu_qsbr *v, uint32_t max_threads);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Register a reader thread to report its quiescent state
+ * on a QS variable.
+ *
+ * This is implemented as a lock-free function. It is multi-thread
+ * safe.
+ * Any reader thread that wants to report its quiescent state must
+ * call this API. This can be called during initialization or as part
+ * of the packet processing loop.
+ *
+ * Note that rte_rcu_qsbr_thread_online must be called before the
+ * thread updates its quiescent state using rte_rcu_qsbr_quiescent.
+ *
+ * @param v
+ *   QS variable
+ * @param thread_id
+ *   Reader thread with this thread ID will report its quiescent state on
+ *   the QS variable. thread_id is a value between 0 and (max_threads - 1).
+ *   'max_threads' is the parameter passed in 'rte_rcu_qsbr_init' API.
+ */
+int __rte_experimental
+rte_rcu_qsbr_thread_register(struct rte_rcu_qsbr *v, unsigned int thread_id);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Remove a reader thread, from the list of threads reporting their
+ * quiescent state on a QS variable.
+ *
+ * This is implemented as a lock-free function. It is multi-thread safe.
+ * This API can be called from the reader threads during shutdown.
+ * Ongoing quiescent state queries will stop waiting for the status from this
+ * unregistered reader thread.
+ *
+ * @param v
+ *   QS variable
+ * @param thread_id
+ *   Reader thread with this thread ID will stop reporting its quiescent
+ *   state on the QS variable.
+ */
+int __rte_experimental
+rte_rcu_qsbr_thread_unregister(struct rte_rcu_qsbr *v, unsigned int thread_id);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Add a registered reader thread, to the list of threads reporting their
+ * quiescent state on a QS variable.
+ *
+ * This is implemented as a lock-free function. It is multi-thread
+ * safe.
+ *
+ * Any registered reader thread that wants to report its quiescent state must
+ * call this API before calling rte_rcu_qsbr_quiescent. This can be called
+ * during initialization or as part of the packet processing loop.
+ *
+ * The reader thread must call rte_rcu_thread_offline API, before
+ * calling any functions that block, to ensure that rte_rcu_qsbr_check
+ * API does not wait indefinitely for the reader thread to update its QS.
+ *
+ * The reader thread must call rte_rcu_thread_online API, after the blocking
+ * function call returns, to ensure that rte_rcu_qsbr_check API
+ * waits for the reader thread to update its quiescent state.
+ *
+ * @param v
+ *   QS variable
+ * @param thread_id
+ *   Reader thread with this thread ID will report its quiescent state on
+ *   the QS variable.
+ */
+static __rte_always_inline void __rte_experimental
+rte_rcu_qsbr_thread_online(struct rte_rcu_qsbr *v, unsigned int thread_id)
+{
+	uint64_t t;
+
+	RTE_ASSERT(v != NULL && thread_id < v->max_threads);
+
+	/* Copy the current value of token.
+	 * The fence at the end of the function will ensure that
+	 * the following will not move down after the load of any shared
+	 * data structure.
+	 */
+	t = __atomic_load_n(&v->token, __ATOMIC_RELAXED);
+
+	/* __atomic_store_n(cnt, __ATOMIC_RELAXED) is used to ensure
+	 * 'cnt' (64b) is accessed atomically.
+	 */
+	__atomic_store_n(&v->qsbr_cnt[thread_id].cnt,
+		t, __ATOMIC_RELAXED);
+
+	/* The subsequent load of the data structure should not
+	 * move above the store. Hence a store-load barrier
+	 * is required.
+	 * If the load of the data structure moves above the store,
+	 * writer might not see that the reader is online, even though
+	 * the reader is referencing the shared data structure.
+	 */
+#ifdef RTE_ARCH_X86_64
+	/* rte_smp_mb() for x86 is lighter */
+	rte_smp_mb();
+#else
+	__atomic_thread_fence(__ATOMIC_SEQ_CST);
+#endif
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Remove a registered reader thread from the list of threads reporting their
+ * quiescent state on a QS variable.
+ *
+ * This is implemented as a lock-free function. It is multi-thread
+ * safe.
+ *
+ * This can be called during initialization or as part of the packet
+ * processing loop.
+ *
+ * The reader thread must call rte_rcu_thread_offline API, before
+ * calling any functions that block, to ensure that rte_rcu_qsbr_check
+ * API does not wait indefinitely for the reader thread to update its QS.
+ *
+ * @param v
+ *   QS variable
+ * @param thread_id
+ *   rte_rcu_qsbr_check API will not wait for the reader thread with
+ *   this thread ID to report its quiescent state on the QS variable.
+ */
+static __rte_always_inline void __rte_experimental
+rte_rcu_qsbr_thread_offline(struct rte_rcu_qsbr *v, unsigned int thread_id)
+{
+	RTE_ASSERT(v != NULL && thread_id < v->max_threads);
+
+	/* The reader can go offline only after the load of the
+	 * data structure is completed. i.e. any load of the
+	 * data strcture can not move after this store.
+	 */
+
+	__atomic_store_n(&v->qsbr_cnt[thread_id].cnt,
+		RTE_QSBR_CNT_THR_OFFLINE, __ATOMIC_RELEASE);
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Acquire a lock for accessing a shared data structure.
+ *
+ * This is implemented as a lock-free function. It is multi-thread
+ * safe.
+ *
+ * This API is provided to aid debugging. This should be called before
+ * accessing a shared data structure.
+ *
+ * When CONFIG_RTE_LIBRTE_RCU_DEBUG is enabled a lock counter is incremented.
+ * Similarly rte_rcu_qsbr_unlock will decrement the counter. When the
+ * rte_rcu_qsbr_check API will verify that this counter is 0.
+ *
+ * When CONFIG_RTE_LIBRTE_RCU_DEBUG is disabled, this API will do nothing.
+ *
+ * @param v
+ *   QS variable
+ * @param thread_id
+ *   Reader thread id
+ */
+static __rte_always_inline void __rte_experimental
+rte_rcu_qsbr_lock(struct rte_rcu_qsbr *v __rte_unused,
+			unsigned int thread_id __rte_unused)
+{
+#if defined(RTE_LIBRTE_RCU_DEBUG)
+	RTE_ASSERT(v != NULL && thread_id < v->max_threads);
+
+	/* Increment the lock counter */
+	__atomic_fetch_add(&v->qsbr_cnt[thread_id].lock_cnt,
+				1, __ATOMIC_ACQUIRE);
+#endif
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Release a lock after accessing a shared data structure.
+ *
+ * This is implemented as a lock-free function. It is multi-thread
+ * safe.
+ *
+ * This API is provided to aid debugging. This should be called after
+ * accessing a shared data structure.
+ *
+ * When CONFIG_RTE_LIBRTE_RCU_DEBUG is enabled, rte_rcu_qsbr_unlock will
+ * decrement a lock counter. rte_rcu_qsbr_check API will verify that this
+ * counter is 0.
+ *
+ * When CONFIG_RTE_LIBRTE_RCU_DEBUG is disabled, this API will do nothing.
+ *
+ * @param v
+ *   QS variable
+ * @param thread_id
+ *   Reader thread id
+ */
+static __rte_always_inline void __rte_experimental
+rte_rcu_qsbr_unlock(struct rte_rcu_qsbr *v __rte_unused,
+			unsigned int thread_id __rte_unused)
+{
+#if defined(RTE_LIBRTE_RCU_DEBUG)
+	RTE_ASSERT(v != NULL && thread_id < v->max_threads);
+
+	/* Decrement the lock counter */
+	__atomic_fetch_sub(&v->qsbr_cnt[thread_id].lock_cnt,
+				1, __ATOMIC_RELEASE);
+
+	if (v->qsbr_cnt[thread_id].lock_cnt)
+		rte_log(RTE_LOG_WARNING, rcu_log_type,
+			"%s(): Lock counter %u. Nested locks?\n",
+			__func__, v->qsbr_cnt[thread_id].lock_cnt);
+#endif
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Ask the reader threads to report the quiescent state
+ * status.
+ *
+ * This is implemented as a lock-free function. It is multi-thread
+ * safe and can be called from worker threads.
+ *
+ * @param v
+ *   QS variable
+ * @return
+ *   - This is the token for this call of the API. This should be
+ *     passed to rte_rcu_qsbr_check API.
+ */
+static __rte_always_inline uint64_t __rte_experimental
+rte_rcu_qsbr_start(struct rte_rcu_qsbr *v)
+{
+	uint64_t t;
+
+	RTE_ASSERT(v != NULL);
+
+	/* Release the changes to the shared data structure.
+	 * This store release will ensure that changes to any data
+	 * structure are visible to the workers before the token
+	 * update is visible.
+	 */
+	t = __atomic_add_fetch(&v->token, 1, __ATOMIC_RELEASE);
+
+	return t;
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Update quiescent state for a reader thread.
+ *
+ * This is implemented as a lock-free function. It is multi-thread safe.
+ * All the reader threads registered to report their quiescent state
+ * on the QS variable must call this API.
+ *
+ * @param v
+ *   QS variable
+ * @param thread_id
+ *   Update the quiescent state for the reader with this thread ID.
+ */
+static __rte_always_inline void __rte_experimental
+rte_rcu_qsbr_quiescent(struct rte_rcu_qsbr *v, unsigned int thread_id)
+{
+	uint64_t t;
+
+	RTE_ASSERT(v != NULL && thread_id < v->max_threads);
+
+#if defined(RTE_LIBRTE_RCU_DEBUG)
+	/* Validate that the lock counter is 0 */
+	if (v->qsbr_cnt[thread_id].lock_cnt)
+		rte_log(RTE_LOG_ERR, rcu_log_type,
+			"%s(): Lock counter %u, should be 0\n",
+			__func__, v->qsbr_cnt[thread_id].lock_cnt);
+#endif
+
+	/* Acquire the changes to the shared data structure released
+	 * by rte_rcu_qsbr_start.
+	 * Later loads of the shared data structure should not move
+	 * above this load. Hence, use load-acquire.
+	 */
+	t = __atomic_load_n(&v->token, __ATOMIC_ACQUIRE);
+
+	/* Inform the writer that updates are visible to this reader.
+	 * Prior loads of the shared data structure should not move
+	 * beyond this store. Hence use store-release.
+	 */
+	__atomic_store_n(&v->qsbr_cnt[thread_id].cnt,
+			 t, __ATOMIC_RELEASE);
+
+	RCU_DP_LOG(DEBUG, "%s: update: token = %lu, Thread ID = %d",
+		__func__, t, thread_id);
+}
+
+/* Check the quiescent state counter for registered threads only, assuming
+ * that not all threads have registered.
+ */
+static __rte_always_inline int
+__rcu_qsbr_check_selective(struct rte_rcu_qsbr *v, uint64_t t, bool wait)
+{
+	uint32_t i, j, id;
+	uint64_t bmap;
+	uint64_t c;
+	uint64_t *reg_thread_id;
+
+	for (i = 0, reg_thread_id = RTE_QSBR_THRID_ARRAY_ELM(v, 0);
+		i < v->num_elems;
+		i++, reg_thread_id++) {
+		/* Load the current registered thread bit map before
+		 * loading the reader thread quiescent state counters.
+		 */
+		bmap = __atomic_load_n(reg_thread_id, __ATOMIC_ACQUIRE);
+		id = i << RTE_QSBR_THRID_INDEX_SHIFT;
+
+		while (bmap) {
+			j = __builtin_ctzl(bmap);
+			RCU_DP_LOG(DEBUG,
+				"%s: check: token = %lu, wait = %d, Bit Map = 0x%lx, Thread ID = %d",
+				__func__, t, wait, bmap, id + j);
+			c = __atomic_load_n(
+					&v->qsbr_cnt[id + j].cnt,
+					__ATOMIC_ACQUIRE);
+			RCU_DP_LOG(DEBUG,
+				"%s: status: token = %lu, wait = %d, Thread QS cnt = %lu, Thread ID = %d",
+				__func__, t, wait, c, id+j);
+			/* Counter is not checked for wrap-around condition
+			 * as it is a 64b counter.
+			 */
+			if (unlikely(c != RTE_QSBR_CNT_THR_OFFLINE && c < t)) {
+				/* This thread is not in quiescent state */
+				if (!wait)
+					return 0;
+
+				rte_pause();
+				/* This thread might have unregistered.
+				 * Re-read the bitmap.
+				 */
+				bmap = __atomic_load_n(reg_thread_id,
+						__ATOMIC_ACQUIRE);
+
+				continue;
+			}
+
+			bmap &= ~(1UL << j);
+		}
+	}
+
+	return 1;
+}
+
+/* Check the quiescent state counter for all threads, assuming that
+ * all the threads have registered.
+ */
+static __rte_always_inline int
+__rcu_qsbr_check_all(struct rte_rcu_qsbr *v, uint64_t t, bool wait)
+{
+	uint32_t i;
+	struct rte_rcu_qsbr_cnt *cnt;
+	uint64_t c;
+
+	for (i = 0, cnt = v->qsbr_cnt; i < v->max_threads; i++, cnt++) {
+		RCU_DP_LOG(DEBUG,
+			"%s: check: token = %lu, wait = %d, Thread ID = %d",
+			__func__, t, wait, i);
+		while (1) {
+			c = __atomic_load_n(&cnt->cnt, __ATOMIC_ACQUIRE);
+			RCU_DP_LOG(DEBUG,
+				"%s: status: token = %lu, wait = %d, Thread QS cnt = %lu, Thread ID = %d",
+				__func__, t, wait, c, i);
+			/* Counter is not checked for wrap-around condition
+			 * as it is a 64b counter.
+			 */
+			if (likely(c == RTE_QSBR_CNT_THR_OFFLINE || c >= t))
+				break;
+
+			/* This thread is not in quiescent state */
+			if (!wait)
+				return 0;
+
+			rte_pause();
+		}
+	}
+
+	return 1;
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Checks if all the reader threads have entered the quiescent state
+ * referenced by token.
+ *
+ * This is implemented as a lock-free function. It is multi-thread
+ * safe and can be called from the worker threads as well.
+ *
+ * If this API is called with 'wait' set to true, the following
+ * factors must be considered:
+ *
+ * 1) If the calling thread is also reporting the status on the
+ * same QS variable, it must update the quiescent state status, before
+ * calling this API.
+ *
+ * 2) In addition, while calling from multiple threads, only
+ * one of those threads can be reporting the quiescent state status
+ * on a given QS variable.
+ *
+ * @param v
+ *   QS variable
+ * @param t
+ *   Token returned by rte_rcu_qsbr_start API
+ * @param wait
+ *   If true, block till all the reader threads have completed entering
+ *   the quiescent state referenced by token 't'.
+ * @return
+ *   - 0 if all reader threads have NOT passed through specified number
+ *     of quiescent states.
+ *   - 1 if all reader threads have passed through specified number
+ *     of quiescent states.
+ */
+static __rte_always_inline int __rte_experimental
+rte_rcu_qsbr_check(struct rte_rcu_qsbr *v, uint64_t t, bool wait)
+{
+	RTE_ASSERT(v != NULL);
+
+	if (likely(v->num_threads == v->max_threads))
+		return __rcu_qsbr_check_all(v, t, wait);
+	else
+		return __rcu_qsbr_check_selective(v, t, wait);
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Wait till the reader threads have entered quiescent state.
+ *
+ * This is implemented as a lock-free function. It is multi-thread safe.
+ * This API can be thought of as a wrapper around rte_rcu_qsbr_start and
+ * rte_rcu_qsbr_check APIs.
+ *
+ * If this API is called from multiple threads, only one of
+ * those threads can be reporting the quiescent state status on a
+ * given QS variable.
+ *
+ * @param v
+ *   QS variable
+ * @param thread_id
+ *   Thread ID of the caller if it is registered to report quiescent state
+ *   on this QS variable (i.e. the calling thread is also part of the
+ *   readside critical section). If not, pass RTE_QSBR_THRID_INVALID.
+ */
+static __rte_always_inline void __rte_experimental
+rte_rcu_qsbr_synchronize(struct rte_rcu_qsbr *v, unsigned int thread_id)
+{
+	uint64_t t;
+
+	RTE_ASSERT(v != NULL);
+
+	t = rte_rcu_qsbr_start(v);
+
+	/* If the current thread has readside critical section,
+	 * update its quiescent state status.
+	 */
+	if (thread_id != RTE_QSBR_THRID_INVALID)
+		rte_rcu_qsbr_quiescent(v, thread_id);
+
+	/* Wait for other readers to enter quiescent state */
+	rte_rcu_qsbr_check(v, t, true);
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Dump the details of a single QS variables to a file.
+ *
+ * It is NOT multi-thread safe.
+ *
+ * @param f
+ *   A pointer to a file for output
+ * @param v
+ *   QS variable
+ * @return
+ *   On success - 0
+ *   On error - 1 with error code set in rte_errno.
+ *   Possible rte_errno codes are:
+ *   - EINVAL - NULL parameters are passed
+ */
+int __rte_experimental
+rte_rcu_qsbr_dump(FILE *f, struct rte_rcu_qsbr *v);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_RCU_QSBR_H_ */
diff --git a/lib/librte_rcu/rte_rcu_version.map b/lib/librte_rcu/rte_rcu_version.map
new file mode 100644
index 000000000..ad8cb517c
--- /dev/null
+++ b/lib/librte_rcu/rte_rcu_version.map
@@ -0,0 +1,11 @@ 
+EXPERIMENTAL {
+	global:
+
+	rte_rcu_qsbr_get_memsize;
+	rte_rcu_qsbr_init;
+	rte_rcu_qsbr_thread_register;
+	rte_rcu_qsbr_thread_unregister;
+	rte_rcu_qsbr_dump;
+
+	local: *;
+};
diff --git a/lib/meson.build b/lib/meson.build
index 595314d7d..67be10659 100644
--- a/lib/meson.build
+++ b/lib/meson.build
@@ -22,7 +22,7 @@  libraries = [
 	'gro', 'gso', 'ip_frag', 'jobstats',
 	'kni', 'latencystats', 'lpm', 'member',
 	'power', 'pdump', 'rawdev',
-	'reorder', 'sched', 'security', 'stack', 'vhost',
+	'reorder', 'sched', 'security', 'stack', 'vhost', 'rcu',
 	#ipsec lib depends on crypto and security
 	'ipsec',
 	# add pkt framework libs which use other libs from above
diff --git a/mk/rte.app.mk b/mk/rte.app.mk
index 7d994bece..e93cc366d 100644
--- a/mk/rte.app.mk
+++ b/mk/rte.app.mk
@@ -97,6 +97,7 @@  _LDLIBS-$(CONFIG_RTE_LIBRTE_EAL)            += -lrte_eal
 _LDLIBS-$(CONFIG_RTE_LIBRTE_CMDLINE)        += -lrte_cmdline
 _LDLIBS-$(CONFIG_RTE_LIBRTE_REORDER)        += -lrte_reorder
 _LDLIBS-$(CONFIG_RTE_LIBRTE_SCHED)          += -lrte_sched
+_LDLIBS-$(CONFIG_RTE_LIBRTE_RCU)            += -lrte_rcu
 
 ifeq ($(CONFIG_RTE_EXEC_ENV_LINUX),y)
 _LDLIBS-$(CONFIG_RTE_LIBRTE_KNI)            += -lrte_kni