mbox series

[RFC,v2,0/7] introduce new barrier class and use it for mlx5 PMD

Message ID 20200410164127.54229-1-gavin.hu@arm.com (mailing list archive)
Headers
Series introduce new barrier class and use it for mlx5 PMD |

Message

Gavin Hu April 10, 2020, 4:41 p.m. UTC
To order writes to various memory types, 'sfence' is required for x86,
and 'dmb oshst' is required for aarch64. 

But within DPDK, there is no abstracted barriers covers this
combination: sfence(x86)/dmb(aarch64).

So introduce a new barrier class - rte_dma_*mb for this combination, 

Doorbell rings are typical use cases of this new barrier class, which
requires something ready in the memory before letting HW aware.

As a note, rte_io_wmb and rte_cio_wmb are compiler barriers for x86, while
rte_wmb is 'dsb' for aarch64.

In the joint preliminary testing between Arm and Ampere, 8%~13%
performance boost was measured.

As there is no functionality changes, it will not impact x86. 

Gavin Hu (6):
  eal: introduce new class of barriers for DMA use cases
  net/mlx5: dmb for immediate doorbell ring on aarch64
  net/mlx5: relax barrier to order UAR writes on aarch64
  net/mlx5: relax barrier for aarch64
  net/mlx5: add descriptive comment for a barrier
  doc: clarify one configuration in mlx5 guide

Phil Yang (1):
  net/mlx5: relax ordering for multi-packet RQ buffer refcnt

 doc/guides/nics/mlx5.rst                    |  6 ++--
 drivers/net/mlx5/mlx5_rxq.c                 |  2 +-
 drivers/net/mlx5/mlx5_rxtx.c                | 16 ++++++-----
 drivers/net/mlx5/mlx5_rxtx.h                | 14 ++++++----
 lib/librte_eal/arm/include/rte_atomic_32.h  |  6 ++++
 lib/librte_eal/arm/include/rte_atomic_64.h  |  6 ++++
 lib/librte_eal/include/generic/rte_atomic.h | 31 +++++++++++++++++++++
 lib/librte_eal/ppc/include/rte_atomic.h     |  6 ++++
 lib/librte_eal/x86/include/rte_atomic.h     |  6 ++++
 9 files changed, 78 insertions(+), 15 deletions(-)
  

Comments

Andrew Rybchenko April 10, 2020, 5:20 p.m. UTC | #1
On 4/10/20 7:41 PM, Gavin Hu wrote:
> To order writes to various memory types, 'sfence' is required for x86,
> and 'dmb oshst' is required for aarch64.
> 
> But within DPDK, there is no abstracted barriers covers this
> combination: sfence(x86)/dmb(aarch64).
> 
> So introduce a new barrier class - rte_dma_*mb for this combination,
> 
> Doorbell rings are typical use cases of this new barrier class, which
> requires something ready in the memory before letting HW aware.
> 
> As a note, rte_io_wmb and rte_cio_wmb are compiler barriers for x86, while
> rte_wmb is 'dsb' for aarch64.

As far as I can see rte_cio_wmb() is exactly definition of the barrier
to be used for doorbells. Am I missing something?
May be it is just a bug in rte_cio_wmb() on x86?

> In the joint preliminary testing between Arm and Ampere, 8%~13%
> performance boost was measured.
> 
> As there is no functionality changes, it will not impact x86.
> 
> Gavin Hu (6):
>    eal: introduce new class of barriers for DMA use cases
>    net/mlx5: dmb for immediate doorbell ring on aarch64
>    net/mlx5: relax barrier to order UAR writes on aarch64
>    net/mlx5: relax barrier for aarch64
>    net/mlx5: add descriptive comment for a barrier
>    doc: clarify one configuration in mlx5 guide
> 
> Phil Yang (1):
>    net/mlx5: relax ordering for multi-packet RQ buffer refcnt
> 
>   doc/guides/nics/mlx5.rst                    |  6 ++--
>   drivers/net/mlx5/mlx5_rxq.c                 |  2 +-
>   drivers/net/mlx5/mlx5_rxtx.c                | 16 ++++++-----
>   drivers/net/mlx5/mlx5_rxtx.h                | 14 ++++++----
>   lib/librte_eal/arm/include/rte_atomic_32.h  |  6 ++++
>   lib/librte_eal/arm/include/rte_atomic_64.h  |  6 ++++
>   lib/librte_eal/include/generic/rte_atomic.h | 31 +++++++++++++++++++++
>   lib/librte_eal/ppc/include/rte_atomic.h     |  6 ++++
>   lib/librte_eal/x86/include/rte_atomic.h     |  6 ++++
>   9 files changed, 78 insertions(+), 15 deletions(-)
>
  
Gavin Hu April 11, 2020, 3:46 a.m. UTC | #2
Hi Andrew,

> -----Original Message-----
> From: Andrew Rybchenko <arybchenko@solarflare.com>
> Sent: Saturday, April 11, 2020 1:21 AM
> To: Gavin Hu <Gavin.Hu@arm.com>; dev@dpdk.org
> Cc: nd <nd@arm.com>; david.marchand@redhat.com;
> thomas@monjalon.net; rasland@mellanox.com; drc@linux.vnet.ibm.com;
> bruce.richardson@intel.com; konstantin.ananyev@intel.com;
> matan@mellanox.com; shahafs@mellanox.com; viacheslavo@mellanox.com;
> jerinj@marvell.com; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang
> <Ruifeng.Wang@arm.com>; Phil Yang <Phil.Yang@arm.com>; Joyce Kong
> <Joyce.Kong@arm.com>; Steve Capper <Steve.Capper@arm.com>
> Subject: Re: [dpdk-dev] [PATCH RFC v2 0/7] introduce new barrier class and
> use it for mlx5 PMD
> 
> On 4/10/20 7:41 PM, Gavin Hu wrote:
> > To order writes to various memory types, 'sfence' is required for x86,
> > and 'dmb oshst' is required for aarch64.
> >
> > But within DPDK, there is no abstracted barriers covers this
> > combination: sfence(x86)/dmb(aarch64).
> >
> > So introduce a new barrier class - rte_dma_*mb for this combination,
> >
> > Doorbell rings are typical use cases of this new barrier class, which
> > requires something ready in the memory before letting HW aware.
> >
> > As a note, rte_io_wmb and rte_cio_wmb are compiler barriers for x86,
> while
> > rte_wmb is 'dsb' for aarch64.
> 
> As far as I can see rte_cio_wmb() is exactly definition of the barrier
> to be used for doorbells. Am I missing something?

I understand rte_cio_wmb is for DMA buffers, for examples, descriptors, work queues, located in the host memory, but shared between CPU and IO device.
rte_io_wmb is for MMIO regions. 
We are missing the barriers for various memory types, eg. Doorbell cases.

There is an implication in the definition of rte_cio_wmb, it can not be used for non-coherent MMIO region(WC?)
http://code.dpdk.org/dpdk/v20.02/source/lib/librte_eal/common/include/generic/rte_atomic.h#L124
> May be it is just a bug in rte_cio_wmb() on x86?
rte_cio_wmb is ok for doorbells on aarch64, but looking through the kernel code, 'sfence' is required for various/mixed memory types.
DPDK mlx5 PMD uses rte_cio_wmb widely and wisely, it orders sequences of writes to host memory that shared by IO device.
Strengthening rte_cio_wmb may hurt performance, so a new barrier class is introduced to optimize for aarch64, in the fast path only, while not impacting x86.
http://code.dpdk.org/dpdk/v20.02/source/drivers/net/mlx5/mlx5_rxtx.c#L1087
/Gavin
> 
> > In the joint preliminary testing between Arm and Ampere, 8%~13%
> > performance boost was measured.
> >
> > As there is no functionality changes, it will not impact x86.
> >
> > Gavin Hu (6):
> >    eal: introduce new class of barriers for DMA use cases
> >    net/mlx5: dmb for immediate doorbell ring on aarch64
> >    net/mlx5: relax barrier to order UAR writes on aarch64
> >    net/mlx5: relax barrier for aarch64
> >    net/mlx5: add descriptive comment for a barrier
> >    doc: clarify one configuration in mlx5 guide
> >
> > Phil Yang (1):
> >    net/mlx5: relax ordering for multi-packet RQ buffer refcnt
> >
> >   doc/guides/nics/mlx5.rst                    |  6 ++--
> >   drivers/net/mlx5/mlx5_rxq.c                 |  2 +-
> >   drivers/net/mlx5/mlx5_rxtx.c                | 16 ++++++-----
> >   drivers/net/mlx5/mlx5_rxtx.h                | 14 ++++++----
> >   lib/librte_eal/arm/include/rte_atomic_32.h  |  6 ++++
> >   lib/librte_eal/arm/include/rte_atomic_64.h  |  6 ++++
> >   lib/librte_eal/include/generic/rte_atomic.h | 31 +++++++++++++++++++++
> >   lib/librte_eal/ppc/include/rte_atomic.h     |  6 ++++
> >   lib/librte_eal/x86/include/rte_atomic.h     |  6 ++++
> >   9 files changed, 78 insertions(+), 15 deletions(-)
> >
  
Andrew Rybchenko April 13, 2020, 9:51 a.m. UTC | #3
On 4/11/20 6:46 AM, Gavin Hu wrote:
> Hi Andrew,
> 
>> -----Original Message-----
>> From: Andrew Rybchenko <arybchenko@solarflare.com>
>> Sent: Saturday, April 11, 2020 1:21 AM
>> To: Gavin Hu <Gavin.Hu@arm.com>; dev@dpdk.org
>> Cc: nd <nd@arm.com>; david.marchand@redhat.com;
>> thomas@monjalon.net; rasland@mellanox.com; drc@linux.vnet.ibm.com;
>> bruce.richardson@intel.com; konstantin.ananyev@intel.com;
>> matan@mellanox.com; shahafs@mellanox.com; viacheslavo@mellanox.com;
>> jerinj@marvell.com; Honnappa Nagarahalli
>> <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang
>> <Ruifeng.Wang@arm.com>; Phil Yang <Phil.Yang@arm.com>; Joyce Kong
>> <Joyce.Kong@arm.com>; Steve Capper <Steve.Capper@arm.com>
>> Subject: Re: [dpdk-dev] [PATCH RFC v2 0/7] introduce new barrier class and
>> use it for mlx5 PMD
>>
>> On 4/10/20 7:41 PM, Gavin Hu wrote:
>>> To order writes to various memory types, 'sfence' is required for x86,
>>> and 'dmb oshst' is required for aarch64.
>>>
>>> But within DPDK, there is no abstracted barriers covers this
>>> combination: sfence(x86)/dmb(aarch64).
>>>
>>> So introduce a new barrier class - rte_dma_*mb for this combination,
>>>
>>> Doorbell rings are typical use cases of this new barrier class, which
>>> requires something ready in the memory before letting HW aware.
>>>
>>> As a note, rte_io_wmb and rte_cio_wmb are compiler barriers for x86,
>> while
>>> rte_wmb is 'dsb' for aarch64.
>>
>> As far as I can see rte_cio_wmb() is exactly definition of the barrier
>> to be used for doorbells. Am I missing something?
> 
> I understand rte_cio_wmb is for DMA buffers, for examples, descriptors, work queues, located in the host memory, but shared between CPU and IO device.
> rte_io_wmb is for MMIO regions. 
> We are missing the barriers for various memory types, eg. Doorbell cases.

When the patch series is applied, we'll have 5 types of memory
barriers: regular, smp, cio, io, dma. Do we really need so
many? May be we need a table in description which could
help to make the right choice. I.e. type of access on both
axis and type of barrier to use on intersection.

> There is an implication in the definition of rte_cio_wmb, it can not be used for non-coherent MMIO region(WC?)
> http://code.dpdk.org/dpdk/v20.02/source/lib/librte_eal/common/include/generic/rte_atomic.h#L124
>> May be it is just a bug in rte_cio_wmb() on x86?
> rte_cio_wmb is ok for doorbells on aarch64, but looking through the kernel code, 'sfence' is required for various/mixed memory types.
> DPDK mlx5 PMD uses rte_cio_wmb widely and wisely, it orders sequences of writes to host memory that shared by IO device.
> Strengthening rte_cio_wmb may hurt performance, so a new barrier class is introduced to optimize for aarch64, in the fast path only, while not impacting x86.
> http://code.dpdk.org/dpdk/v20.02/source/drivers/net/mlx5/mlx5_rxtx.c#L1087

May be my problem that I don't fully understand real-life
usecases when cio should be used in accordance with its
current definition. Does it make sense without doorbell?
Does HW polling via DMA?

Thanks for explanations,
Andrew.

>>
>>> In the joint preliminary testing between Arm and Ampere, 8%~13%
>>> performance boost was measured.
>>>
>>> As there is no functionality changes, it will not impact x86.
>>>
>>> Gavin Hu (6):
>>>    eal: introduce new class of barriers for DMA use cases
>>>    net/mlx5: dmb for immediate doorbell ring on aarch64
>>>    net/mlx5: relax barrier to order UAR writes on aarch64
>>>    net/mlx5: relax barrier for aarch64
>>>    net/mlx5: add descriptive comment for a barrier
>>>    doc: clarify one configuration in mlx5 guide
>>>
>>> Phil Yang (1):
>>>    net/mlx5: relax ordering for multi-packet RQ buffer refcnt
>>>
>>>   doc/guides/nics/mlx5.rst                    |  6 ++--
>>>   drivers/net/mlx5/mlx5_rxq.c                 |  2 +-
>>>   drivers/net/mlx5/mlx5_rxtx.c                | 16 ++++++-----
>>>   drivers/net/mlx5/mlx5_rxtx.h                | 14 ++++++----
>>>   lib/librte_eal/arm/include/rte_atomic_32.h  |  6 ++++
>>>   lib/librte_eal/arm/include/rte_atomic_64.h  |  6 ++++
>>>   lib/librte_eal/include/generic/rte_atomic.h | 31 +++++++++++++++++++++
>>>   lib/librte_eal/ppc/include/rte_atomic.h     |  6 ++++
>>>   lib/librte_eal/x86/include/rte_atomic.h     |  6 ++++
>>>   9 files changed, 78 insertions(+), 15 deletions(-)
>>>
>
  
Gavin Hu April 13, 2020, 4:46 p.m. UTC | #4
Hi Andrew,

> -----Original Message-----
> From: Andrew Rybchenko <arybchenko@solarflare.com>
> Sent: Monday, April 13, 2020 5:52 PM
> To: Gavin Hu <Gavin.Hu@arm.com>; dev@dpdk.org
> Cc: nd <nd@arm.com>; david.marchand@redhat.com;
> thomas@monjalon.net; rasland@mellanox.com; drc@linux.vnet.ibm.com;
> bruce.richardson@intel.com; konstantin.ananyev@intel.com;
> matan@mellanox.com; shahafs@mellanox.com; viacheslavo@mellanox.com;
> jerinj@marvell.com; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang
> <Ruifeng.Wang@arm.com>; Phil Yang <Phil.Yang@arm.com>; Joyce Kong
> <Joyce.Kong@arm.com>; Steve Capper <Steve.Capper@arm.com>
> Subject: Re: [dpdk-dev] [PATCH RFC v2 0/7] introduce new barrier class and
> use it for mlx5 PMD
> 
> On 4/11/20 6:46 AM, Gavin Hu wrote:
> > Hi Andrew,
> >
> >> -----Original Message-----
> >> From: Andrew Rybchenko <arybchenko@solarflare.com>
> >> Sent: Saturday, April 11, 2020 1:21 AM
> >> To: Gavin Hu <Gavin.Hu@arm.com>; dev@dpdk.org
> >> Cc: nd <nd@arm.com>; david.marchand@redhat.com;
> >> thomas@monjalon.net; rasland@mellanox.com; drc@linux.vnet.ibm.com;
> >> bruce.richardson@intel.com; konstantin.ananyev@intel.com;
> >> matan@mellanox.com; shahafs@mellanox.com;
> viacheslavo@mellanox.com;
> >> jerinj@marvell.com; Honnappa Nagarahalli
> >> <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang
> >> <Ruifeng.Wang@arm.com>; Phil Yang <Phil.Yang@arm.com>; Joyce Kong
> >> <Joyce.Kong@arm.com>; Steve Capper <Steve.Capper@arm.com>
> >> Subject: Re: [dpdk-dev] [PATCH RFC v2 0/7] introduce new barrier class
> and
> >> use it for mlx5 PMD
> >>
> >> On 4/10/20 7:41 PM, Gavin Hu wrote:
> >>> To order writes to various memory types, 'sfence' is required for x86,
> >>> and 'dmb oshst' is required for aarch64.
> >>>
> >>> But within DPDK, there is no abstracted barriers covers this
> >>> combination: sfence(x86)/dmb(aarch64).
> >>>
> >>> So introduce a new barrier class - rte_dma_*mb for this combination,
> >>>
> >>> Doorbell rings are typical use cases of this new barrier class, which
> >>> requires something ready in the memory before letting HW aware.
> >>>
> >>> As a note, rte_io_wmb and rte_cio_wmb are compiler barriers for x86,
> >> while
> >>> rte_wmb is 'dsb' for aarch64.
> >>
> >> As far as I can see rte_cio_wmb() is exactly definition of the barrier
> >> to be used for doorbells. Am I missing something?
> >
> > I understand rte_cio_wmb is for DMA buffers, for examples, descriptors,
> work queues, located in the host memory, but shared between CPU and IO
> device.
> > rte_io_wmb is for MMIO regions.
> > We are missing the barriers for various memory types, eg. Doorbell cases.
> 
> When the patch series is applied, we'll have 5 types of memory
> barriers: regular, smp, cio, io, dma. Do we really need so
> many? May be we need a table in description which could
> help to make the right choice. I.e. type of access on both
> axis and type of barrier to use on intersection.
Yes, good suggestion!
Actually Honnappa and I already made a table sheet for this.
Will provide it in next release!
Thanks for your opinions!
> 
> > There is an implication in the definition of rte_cio_wmb, it can not be used
> for non-coherent MMIO region(WC?)
> >
> http://code.dpdk.org/dpdk/v20.02/source/lib/librte_eal/common/include/g
> eneric/rte_atomic.h#L124
> >> May be it is just a bug in rte_cio_wmb() on x86?
> > rte_cio_wmb is ok for doorbells on aarch64, but looking through the
> kernel code, 'sfence' is required for various/mixed memory types.
> > DPDK mlx5 PMD uses rte_cio_wmb widely and wisely, it orders sequences
> of writes to host memory that shared by IO device.
> > Strengthening rte_cio_wmb may hurt performance, so a new barrier class
> is introduced to optimize for aarch64, in the fast path only, while not
> impacting x86.
> >
> http://code.dpdk.org/dpdk/v20.02/source/drivers/net/mlx5/mlx5_rxtx.c#L1
> 087
> 
> May be my problem that I don't fully understand real-life
> usecases when cio should be used in accordance with its
> current definition. Does it make sense without doorbell?
> Does HW polling via DMA?
> 
> Thanks for explanations,
> Andrew.
> 
> >>
> >>> In the joint preliminary testing between Arm and Ampere, 8%~13%
> >>> performance boost was measured.
> >>>
> >>> As there is no functionality changes, it will not impact x86.
> >>>
> >>> Gavin Hu (6):
> >>>    eal: introduce new class of barriers for DMA use cases
> >>>    net/mlx5: dmb for immediate doorbell ring on aarch64
> >>>    net/mlx5: relax barrier to order UAR writes on aarch64
> >>>    net/mlx5: relax barrier for aarch64
> >>>    net/mlx5: add descriptive comment for a barrier
> >>>    doc: clarify one configuration in mlx5 guide
> >>>
> >>> Phil Yang (1):
> >>>    net/mlx5: relax ordering for multi-packet RQ buffer refcnt
> >>>
> >>>   doc/guides/nics/mlx5.rst                    |  6 ++--
> >>>   drivers/net/mlx5/mlx5_rxq.c                 |  2 +-
> >>>   drivers/net/mlx5/mlx5_rxtx.c                | 16 ++++++-----
> >>>   drivers/net/mlx5/mlx5_rxtx.h                | 14 ++++++----
> >>>   lib/librte_eal/arm/include/rte_atomic_32.h  |  6 ++++
> >>>   lib/librte_eal/arm/include/rte_atomic_64.h  |  6 ++++
> >>>   lib/librte_eal/include/generic/rte_atomic.h | 31
> +++++++++++++++++++++
> >>>   lib/librte_eal/ppc/include/rte_atomic.h     |  6 ++++
> >>>   lib/librte_eal/x86/include/rte_atomic.h     |  6 ++++
> >>>   9 files changed, 78 insertions(+), 15 deletions(-)
> >>>
> >