[v2,2/2] lpm: add a scalar version of lookupx4 function
Checks
Commit Message
From: Michal Mazurek <maz@semihalf.com>
Add an implementation of the rte_lpm_lookupx4() function for platforms
without support for vector operations.
This will be useful in the upcoming RISC-V port as well as any platform
which may want to start with a basic level of LPM support.
Signed-off-by: Michal Mazurek <maz@semihalf.com>
Signed-off-by: Stanislaw Kardach <kda@semihalf.com>
---
doc/guides/rel_notes/release_22_07.rst | 5 ++++
lib/lpm/meson.build | 1 +
lib/lpm/rte_lpm.h | 4 ++-
lib/lpm/rte_lpm_scalar.h | 36 ++++++++++++++++++++++++++
4 files changed, 45 insertions(+), 1 deletion(-)
create mode 100644 lib/lpm/rte_lpm_scalar.h
Comments
On Fri, 27 May 2022 20:18:22 +0200
Stanislaw Kardach <kda@semihalf.com> wrote:
> +static inline void
> +rte_lpm_lookupx4(const struct rte_lpm *lpm, xmm_t ip, uint32_t hop[4],
> + uint32_t defv)
> +{
> + uint32_t nh;
> + int i, ret;
> +
> + for (i = 0; i < 4; i++) {
> + ret = rte_lpm_lookup(lpm, ((rte_xmm_t)ip).u32[i], &nh);
> + hop[i] = (ret == 0) ? nh : defv;
> + }
> +}
For performance, manually unroll the loop.
On Fri, May 27, 2022 at 01:15:20PM -0700, Stephen Hemminger wrote:
> On Fri, 27 May 2022 20:18:22 +0200
> Stanislaw Kardach <kda@semihalf.com> wrote:
>
> > +static inline void
> > +rte_lpm_lookupx4(const struct rte_lpm *lpm, xmm_t ip, uint32_t hop[4],
> > + uint32_t defv)
> > +{
> > + uint32_t nh;
> > + int i, ret;
> > +
> > + for (i = 0; i < 4; i++) {
> > + ret = rte_lpm_lookup(lpm, ((rte_xmm_t)ip).u32[i], &nh);
> > + hop[i] = (ret == 0) ? nh : defv;
> > + }
> > +}
>
> For performance, manually unroll the loop.
Given a constant 4x iterations, will compilers not unroll this
automatically. I think the loop is a little clearer if it can be kept
/Bruce
> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> Sent: Monday, 30 May 2022 09.52
>
> On Fri, May 27, 2022 at 01:15:20PM -0700, Stephen Hemminger wrote:
> > On Fri, 27 May 2022 20:18:22 +0200
> > Stanislaw Kardach <kda@semihalf.com> wrote:
> >
> > > +static inline void
> > > +rte_lpm_lookupx4(const struct rte_lpm *lpm, xmm_t ip, uint32_t
> hop[4],
> > > + uint32_t defv)
> > > +{
> > > + uint32_t nh;
> > > + int i, ret;
> > > +
> > > + for (i = 0; i < 4; i++) {
> > > + ret = rte_lpm_lookup(lpm, ((rte_xmm_t)ip).u32[i], &nh);
> > > + hop[i] = (ret == 0) ? nh : defv;
> > > + }
> > > +}
> >
> > For performance, manually unroll the loop.
>
> Given a constant 4x iterations, will compilers not unroll this
> automatically. I think the loop is a little clearer if it can be kept
>
> /Bruce
If in doubt, add this and look at the assembler output:
#define REVIEW_INLINE_FUNCTIONS 1
#if REVIEW_INLINE_FUNCTIONS /* For compiler output review purposes only. */
#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wmissing-prototypes"
void review_rte_lpm_lookupx4(const struct rte_lpm *lpm, xmm_t ip, uint32_t hop[4], uint32_t defv)
{
rte_lpm_lookupx4(lpm, ip, hop, defv);
}
#pragma GCC diagnostic pop
#endif /* REVIEW_INLINE_FUNCTIONS */
On Mon, May 30, 2022 at 10:00:34AM +0200, Morten Brørup wrote:
> > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > Sent: Monday, 30 May 2022 09.52
> >
> > On Fri, May 27, 2022 at 01:15:20PM -0700, Stephen Hemminger wrote:
> > > On Fri, 27 May 2022 20:18:22 +0200
> > > Stanislaw Kardach <kda@semihalf.com> wrote:
> > >
> > > > +static inline void
> > > > +rte_lpm_lookupx4(const struct rte_lpm *lpm, xmm_t ip, uint32_t
> > hop[4],
> > > > + uint32_t defv)
> > > > +{
> > > > + uint32_t nh;
> > > > + int i, ret;
> > > > +
> > > > + for (i = 0; i < 4; i++) {
> > > > + ret = rte_lpm_lookup(lpm, ((rte_xmm_t)ip).u32[i], &nh);
> > > > + hop[i] = (ret == 0) ? nh : defv;
> > > > + }
> > > > +}
> > >
> > > For performance, manually unroll the loop.
> >
> > Given a constant 4x iterations, will compilers not unroll this
> > automatically. I think the loop is a little clearer if it can be kept
> >
> > /Bruce
>
> If in doubt, add this and look at the assembler output:
>
> #define REVIEW_INLINE_FUNCTIONS 1
>
> #if REVIEW_INLINE_FUNCTIONS /* For compiler output review purposes only. */
> #pragma GCC diagnostic push
> #pragma GCC diagnostic ignored "-Wmissing-prototypes"
> void review_rte_lpm_lookupx4(const struct rte_lpm *lpm, xmm_t ip, uint32_t hop[4], uint32_t defv)
> {
> rte_lpm_lookupx4(lpm, ip, hop, defv);
> }
> #pragma GCC diagnostic pop
> #endif /* REVIEW_INLINE_FUNCTIONS */
>
Used godbolt.org to check and indeed the function is not unrolled.
(Gcc 11.2, with flags "-O3 -march=icelake-server").
Manually unrolling changes the assembly generated in interesting ways. For
example, it appears to generate more cmov-type instructions for the
miss/default-value case rather than using branches as in the looped
version. Whether this is better or not may depend upon usecase - if one
expects most lpm lookup entries to hit, then having (predictable) branches
may well be cheaper.
In any case, I'll withdraw any object to unrolling, but I'm still not
convinced it's necessary.
/Bruce
On Mon, May 30, 2022 at 12:42 PM Bruce Richardson
<bruce.richardson@intel.com> wrote:
>
> On Mon, May 30, 2022 at 10:00:34AM +0200, Morten Brørup wrote:
> > > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > > Sent: Monday, 30 May 2022 09.52
> > >
> > > On Fri, May 27, 2022 at 01:15:20PM -0700, Stephen Hemminger wrote:
> > > > On Fri, 27 May 2022 20:18:22 +0200
> > > > Stanislaw Kardach <kda@semihalf.com> wrote:
> > > >
> > > > > +static inline void
> > > > > +rte_lpm_lookupx4(const struct rte_lpm *lpm, xmm_t ip, uint32_t
> > > hop[4],
> > > > > + uint32_t defv)
> > > > > +{
> > > > > + uint32_t nh;
> > > > > + int i, ret;
> > > > > +
> > > > > + for (i = 0; i < 4; i++) {
> > > > > + ret = rte_lpm_lookup(lpm, ((rte_xmm_t)ip).u32[i], &nh);
> > > > > + hop[i] = (ret == 0) ? nh : defv;
> > > > > + }
> > > > > +}
> > > >
> > > > For performance, manually unroll the loop.
> > >
> > > Given a constant 4x iterations, will compilers not unroll this
> > > automatically. I think the loop is a little clearer if it can be kept
> > >
> > > /Bruce
> >
> > If in doubt, add this and look at the assembler output:
> >
> > #define REVIEW_INLINE_FUNCTIONS 1
> >
> > #if REVIEW_INLINE_FUNCTIONS /* For compiler output review purposes only. */
> > #pragma GCC diagnostic push
> > #pragma GCC diagnostic ignored "-Wmissing-prototypes"
> > void review_rte_lpm_lookupx4(const struct rte_lpm *lpm, xmm_t ip, uint32_t hop[4], uint32_t defv)
> > {
> > rte_lpm_lookupx4(lpm, ip, hop, defv);
> > }
> > #pragma GCC diagnostic pop
> > #endif /* REVIEW_INLINE_FUNCTIONS */
> >
>
> Used godbolt.org to check and indeed the function is not unrolled.
> (Gcc 11.2, with flags "-O3 -march=icelake-server").
>
> Manually unrolling changes the assembly generated in interesting ways. For
> example, it appears to generate more cmov-type instructions for the
> miss/default-value case rather than using branches as in the looped
> version. Whether this is better or not may depend upon usecase - if one
> expects most lpm lookup entries to hit, then having (predictable) branches
> may well be cheaper.
>
> In any case, I'll withdraw any object to unrolling, but I'm still not
> convinced it's necessary.
>
> /Bruce
Interestingly enough until I've defined unlikely() in godbolt, I did
not get any automatic unrolling on godbolt (either with x86 or RISC-V
GCC). Did you get any compilation warnings?
That said it only happens on O3 since it implies -fpeel-loops. O3 is
the default for DPDK.
On Mon, May 30, 2022 at 01:20:50PM +0200, Stanisław Kardach wrote:
> On Mon, May 30, 2022 at 12:42 PM Bruce Richardson
> <bruce.richardson@intel.com> wrote:
> >
> > On Mon, May 30, 2022 at 10:00:34AM +0200, Morten Brørup wrote:
> > > > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > > > Sent: Monday, 30 May 2022 09.52
> > > >
> > > > On Fri, May 27, 2022 at 01:15:20PM -0700, Stephen Hemminger wrote:
> > > > > On Fri, 27 May 2022 20:18:22 +0200
> > > > > Stanislaw Kardach <kda@semihalf.com> wrote:
> > > > >
> > > > > > +static inline void
> > > > > > +rte_lpm_lookupx4(const struct rte_lpm *lpm, xmm_t ip, uint32_t
> > > > hop[4],
> > > > > > + uint32_t defv)
> > > > > > +{
> > > > > > + uint32_t nh;
> > > > > > + int i, ret;
> > > > > > +
> > > > > > + for (i = 0; i < 4; i++) {
> > > > > > + ret = rte_lpm_lookup(lpm, ((rte_xmm_t)ip).u32[i], &nh);
> > > > > > + hop[i] = (ret == 0) ? nh : defv;
> > > > > > + }
> > > > > > +}
> > > > >
> > > > > For performance, manually unroll the loop.
> > > >
> > > > Given a constant 4x iterations, will compilers not unroll this
> > > > automatically. I think the loop is a little clearer if it can be kept
> > > >
> > > > /Bruce
> > >
> > > If in doubt, add this and look at the assembler output:
> > >
> > > #define REVIEW_INLINE_FUNCTIONS 1
> > >
> > > #if REVIEW_INLINE_FUNCTIONS /* For compiler output review purposes only. */
> > > #pragma GCC diagnostic push
> > > #pragma GCC diagnostic ignored "-Wmissing-prototypes"
> > > void review_rte_lpm_lookupx4(const struct rte_lpm *lpm, xmm_t ip, uint32_t hop[4], uint32_t defv)
> > > {
> > > rte_lpm_lookupx4(lpm, ip, hop, defv);
> > > }
> > > #pragma GCC diagnostic pop
> > > #endif /* REVIEW_INLINE_FUNCTIONS */
> > >
> >
> > Used godbolt.org to check and indeed the function is not unrolled.
> > (Gcc 11.2, with flags "-O3 -march=icelake-server").
> >
> > Manually unrolling changes the assembly generated in interesting ways. For
> > example, it appears to generate more cmov-type instructions for the
> > miss/default-value case rather than using branches as in the looped
> > version. Whether this is better or not may depend upon usecase - if one
> > expects most lpm lookup entries to hit, then having (predictable) branches
> > may well be cheaper.
> >
> > In any case, I'll withdraw any object to unrolling, but I'm still not
> > convinced it's necessary.
> >
> > /Bruce
> Interestingly enough until I've defined unlikely() in godbolt, I did
> not get any automatic unrolling on godbolt (either with x86 or RISC-V
> GCC). Did you get any compilation warnings?
That matches what I saw. I then just used manual unrolling i.e. copy-paste
the 2 lines 4 times, to see what the output was like then.
> That said it only happens on O3 since it implies -fpeel-loops. O3 is
> the default for DPDK.
@@ -104,6 +104,11 @@ New Features
* ``RTE_EVENT_QUEUE_ATTR_WEIGHT``
* ``RTE_EVENT_QUEUE_ATTR_AFFINITY``
+* **Added scalar version of the LPM library.**
+
+ * Added scalar implementation of ``rte_lpm_lookupx4``. This is a fall-back
+ implementation for platforms that don't support vector operations.
+
Removed Items
-------------
@@ -14,6 +14,7 @@ headers = files('rte_lpm.h', 'rte_lpm6.h')
indirect_headers += files(
'rte_lpm_altivec.h',
'rte_lpm_neon.h',
+ 'rte_lpm_scalar.h',
'rte_lpm_sse.h',
'rte_lpm_sve.h',
)
@@ -405,8 +405,10 @@ rte_lpm_lookupx4(const struct rte_lpm *lpm, xmm_t ip, uint32_t hop[4],
#endif
#elif defined(RTE_ARCH_PPC_64)
#include "rte_lpm_altivec.h"
-#else
+#elif defined(RTE_ARCH_X86)
#include "rte_lpm_sse.h"
+#else
+#include "rte_lpm_scalar.h"
#endif
#ifdef __cplusplus
new file mode 100644
@@ -0,0 +1,36 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2022 StarFive
+ * Copyright(c) 2022 SiFive
+ * Copyright(c) 2022 Semihalf
+ */
+
+#ifndef _RTE_LPM_SCALAR_H_
+#define _RTE_LPM_SCALAR_H_
+
+#include <rte_branch_prediction.h>
+#include <rte_byteorder.h>
+#include <rte_common.h>
+#include <rte_vect.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+static inline void
+rte_lpm_lookupx4(const struct rte_lpm *lpm, xmm_t ip, uint32_t hop[4],
+ uint32_t defv)
+{
+ uint32_t nh;
+ int i, ret;
+
+ for (i = 0; i < 4; i++) {
+ ret = rte_lpm_lookup(lpm, ((rte_xmm_t)ip).u32[i], &nh);
+ hop[i] = (ret == 0) ? nh : defv;
+ }
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_LPM_SCALAR_H_ */