diff mbox series

[v2,2/2] lpm: add a scalar version of lookupx4 function

Message ID	20220527181822.716758-2-kda@semihalf.com (mailing list archive)
State	Superseded, archived
Delegated to:	Thomas Monjalon
Headers	From: Stanislaw Kardach <kda@semihalf.com> To: Vladimir Medvedkin <vladimir.medvedkin@intel.com> Cc: Michal Mazurek <maz@semihalf.com>, dev@dpdk.org, Frank Zhao <Frank.Zhao@starfivetech.com>, Sam Grove <sam.grove@sifive.com>, mw@semihalf.com, upstream@semihalf.com, Stanislaw Kardach <kda@semihalf.com> Subject: [PATCH v2 2/2] lpm: add a scalar version of lookupx4 function Date: Fri, 27 May 2022 20:18:22 +0200 Message-Id: <20220527181822.716758-2-kda@semihalf.com> In-Reply-To: <20220527181822.716758-1-kda@semihalf.com> References: <20220510115824.457885-1-kda@semihalf.com> <20220527181822.716758-1-kda@semihalf.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: list Errors-To: dev-bounces@dpdk.org
Series	[v2,1/2] lpm: add const to lpm arg of rte_lpm_lookup \| [v2,1/2] lpm: add const to lpm arg of rte_lpm_lookup [v2,2/2] lpm: add a scalar version of lookupx4 function

Checks

Context	Check	Description
ci/checkpatch	success	coding style OK
ci/Intel-compilation	success	Compilation OK
ci/intel-Testing	success	Testing PASS
ci/iol-mellanox-Performance	success	Performance Testing PASS
ci/github-robot: build	success	github build: passed
ci/iol-aarch64-compile-testing	success	Testing PASS
ci/iol-aarch64-unit-testing	success	Testing PASS
ci/iol-x86_64-unit-testing	success	Testing PASS
ci/iol-x86_64-compile-testing	success	Testing PASS
ci/iol-intel-Functional	success	Functional Testing PASS
ci/iol-intel-Performance	success	Performance Testing PASS
ci/iol-abi-testing	success	Testing PASS

Commit Message

Stanislaw Kardach May 27, 2022, 6:18 p.m. UTC

  From: Michal Mazurek <maz@semihalf.com>

Add an implementation of the rte_lpm_lookupx4() function for platforms
without support for vector operations.

This will be useful in the upcoming RISC-V port as well as any platform
which may want to start with a basic level of LPM support.

Signed-off-by: Michal Mazurek <maz@semihalf.com>
Signed-off-by: Stanislaw Kardach <kda@semihalf.com>
---
 doc/guides/rel_notes/release_22_07.rst |  5 ++++
 lib/lpm/meson.build                    |  1 +
 lib/lpm/rte_lpm.h                      |  4 ++-
 lib/lpm/rte_lpm_scalar.h               | 36 ++++++++++++++++++++++++++
 4 files changed, 45 insertions(+), 1 deletion(-)
 create mode 100644 lib/lpm/rte_lpm_scalar.h

Comments

Stephen Hemminger May 27, 2022, 8:15 p.m. UTC | #1

On Fri, 27 May 2022 20:18:22 +0200
Stanislaw Kardach <kda@semihalf.com> wrote:

> +static inline void
> +rte_lpm_lookupx4(const struct rte_lpm *lpm, xmm_t ip, uint32_t hop[4],
> +		uint32_t defv)
> +{
> +	uint32_t nh;
> +	int i, ret;
> +
> +	for (i = 0; i < 4; i++) {
> +		ret = rte_lpm_lookup(lpm, ((rte_xmm_t)ip).u32[i], &nh);
> +		hop[i] = (ret == 0) ? nh : defv;
> +	}
> +}

For performance, manually unroll the loop.

Bruce Richardson May 30, 2022, 7:52 a.m. UTC | #2

On Fri, May 27, 2022 at 01:15:20PM -0700, Stephen Hemminger wrote:
> On Fri, 27 May 2022 20:18:22 +0200
> Stanislaw Kardach <kda@semihalf.com> wrote:
> 
> > +static inline void
> > +rte_lpm_lookupx4(const struct rte_lpm *lpm, xmm_t ip, uint32_t hop[4],
> > +		uint32_t defv)
> > +{
> > +	uint32_t nh;
> > +	int i, ret;
> > +
> > +	for (i = 0; i < 4; i++) {
> > +		ret = rte_lpm_lookup(lpm, ((rte_xmm_t)ip).u32[i], &nh);
> > +		hop[i] = (ret == 0) ? nh : defv;
> > +	}
> > +}
> 
> For performance, manually unroll the loop.

Given a constant 4x iterations, will compilers not unroll this
automatically. I think the loop is a little clearer if it can be kept

/Bruce

Morten Brørup May 30, 2022, 8 a.m. UTC | #3

> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> Sent: Monday, 30 May 2022 09.52
> 
> On Fri, May 27, 2022 at 01:15:20PM -0700, Stephen Hemminger wrote:
> > On Fri, 27 May 2022 20:18:22 +0200
> > Stanislaw Kardach <kda@semihalf.com> wrote:
> >
> > > +static inline void
> > > +rte_lpm_lookupx4(const struct rte_lpm *lpm, xmm_t ip, uint32_t
> hop[4],
> > > +		uint32_t defv)
> > > +{
> > > +	uint32_t nh;
> > > +	int i, ret;
> > > +
> > > +	for (i = 0; i < 4; i++) {
> > > +		ret = rte_lpm_lookup(lpm, ((rte_xmm_t)ip).u32[i], &nh);
> > > +		hop[i] = (ret == 0) ? nh : defv;
> > > +	}
> > > +}
> >
> > For performance, manually unroll the loop.
> 
> Given a constant 4x iterations, will compilers not unroll this
> automatically. I think the loop is a little clearer if it can be kept
> 
> /Bruce

If in doubt, add this and look at the assembler output:

#define REVIEW_INLINE_FUNCTIONS 1

#if REVIEW_INLINE_FUNCTIONS /* For compiler output review purposes only. */
#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wmissing-prototypes"
void review_rte_lpm_lookupx4(const struct rte_lpm *lpm, xmm_t ip, uint32_t hop[4], uint32_t defv)
{
	rte_lpm_lookupx4(lpm, ip, hop, defv);
}
#pragma GCC diagnostic pop
#endif /* REVIEW_INLINE_FUNCTIONS */

Bruce Richardson May 30, 2022, 10:42 a.m. UTC | #4

On Mon, May 30, 2022 at 10:00:34AM +0200, Morten Brørup wrote:
> > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > Sent: Monday, 30 May 2022 09.52
> > 
> > On Fri, May 27, 2022 at 01:15:20PM -0700, Stephen Hemminger wrote:
> > > On Fri, 27 May 2022 20:18:22 +0200
> > > Stanislaw Kardach <kda@semihalf.com> wrote:
> > >
> > > > +static inline void
> > > > +rte_lpm_lookupx4(const struct rte_lpm *lpm, xmm_t ip, uint32_t
> > hop[4],
> > > > +		uint32_t defv)
> > > > +{
> > > > +	uint32_t nh;
> > > > +	int i, ret;
> > > > +
> > > > +	for (i = 0; i < 4; i++) {
> > > > +		ret = rte_lpm_lookup(lpm, ((rte_xmm_t)ip).u32[i], &nh);
> > > > +		hop[i] = (ret == 0) ? nh : defv;
> > > > +	}
> > > > +}
> > >
> > > For performance, manually unroll the loop.
> > 
> > Given a constant 4x iterations, will compilers not unroll this
> > automatically. I think the loop is a little clearer if it can be kept
> > 
> > /Bruce
> 
> If in doubt, add this and look at the assembler output:
> 
> #define REVIEW_INLINE_FUNCTIONS 1
> 
> #if REVIEW_INLINE_FUNCTIONS /* For compiler output review purposes only. */
> #pragma GCC diagnostic push
> #pragma GCC diagnostic ignored "-Wmissing-prototypes"
> void review_rte_lpm_lookupx4(const struct rte_lpm *lpm, xmm_t ip, uint32_t hop[4], uint32_t defv)
> {
> 	rte_lpm_lookupx4(lpm, ip, hop, defv);
> }
> #pragma GCC diagnostic pop
> #endif /* REVIEW_INLINE_FUNCTIONS */
> 

Used godbolt.org to check and indeed the function is not unrolled.
(Gcc 11.2, with flags "-O3 -march=icelake-server").

Manually unrolling changes the assembly generated in interesting ways. For
example, it appears to generate more cmov-type instructions for the
miss/default-value case rather than using branches as in the looped
version. Whether this is better or not may depend upon usecase - if one
expects most lpm lookup entries to hit, then having (predictable) branches
may well be cheaper.

In any case, I'll withdraw any object to unrolling, but I'm still not
convinced it's necessary.

/Bruce

Stanislaw Kardach May 30, 2022, 11:20 a.m. UTC | #5

On Mon, May 30, 2022 at 12:42 PM Bruce Richardson
<bruce.richardson@intel.com> wrote:
>
> On Mon, May 30, 2022 at 10:00:34AM +0200, Morten Brørup wrote:
> > > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > > Sent: Monday, 30 May 2022 09.52
> > >
> > > On Fri, May 27, 2022 at 01:15:20PM -0700, Stephen Hemminger wrote:
> > > > On Fri, 27 May 2022 20:18:22 +0200
> > > > Stanislaw Kardach <kda@semihalf.com> wrote:
> > > >
> > > > > +static inline void
> > > > > +rte_lpm_lookupx4(const struct rte_lpm *lpm, xmm_t ip, uint32_t
> > > hop[4],
> > > > > +               uint32_t defv)
> > > > > +{
> > > > > +       uint32_t nh;
> > > > > +       int i, ret;
> > > > > +
> > > > > +       for (i = 0; i < 4; i++) {
> > > > > +               ret = rte_lpm_lookup(lpm, ((rte_xmm_t)ip).u32[i], &nh);
> > > > > +               hop[i] = (ret == 0) ? nh : defv;
> > > > > +       }
> > > > > +}
> > > >
> > > > For performance, manually unroll the loop.
> > >
> > > Given a constant 4x iterations, will compilers not unroll this
> > > automatically. I think the loop is a little clearer if it can be kept
> > >
> > > /Bruce
> >
> > If in doubt, add this and look at the assembler output:
> >
> > #define REVIEW_INLINE_FUNCTIONS 1
> >
> > #if REVIEW_INLINE_FUNCTIONS /* For compiler output review purposes only. */
> > #pragma GCC diagnostic push
> > #pragma GCC diagnostic ignored "-Wmissing-prototypes"
> > void review_rte_lpm_lookupx4(const struct rte_lpm *lpm, xmm_t ip, uint32_t hop[4], uint32_t defv)
> > {
> >       rte_lpm_lookupx4(lpm, ip, hop, defv);
> > }
> > #pragma GCC diagnostic pop
> > #endif /* REVIEW_INLINE_FUNCTIONS */
> >
>
> Used godbolt.org to check and indeed the function is not unrolled.
> (Gcc 11.2, with flags "-O3 -march=icelake-server").
>
> Manually unrolling changes the assembly generated in interesting ways. For
> example, it appears to generate more cmov-type instructions for the
> miss/default-value case rather than using branches as in the looped
> version. Whether this is better or not may depend upon usecase - if one
> expects most lpm lookup entries to hit, then having (predictable) branches
> may well be cheaper.
>
> In any case, I'll withdraw any object to unrolling, but I'm still not
> convinced it's necessary.
>
> /Bruce
Interestingly enough until I've defined unlikely() in godbolt, I did
not get any automatic unrolling on godbolt (either with x86 or RISC-V
GCC). Did you get any compilation warnings?
That said it only happens on O3 since it implies -fpeel-loops. O3 is
the default for DPDK.

Bruce Richardson May 30, 2022, 12:46 p.m. UTC | #6

On Mon, May 30, 2022 at 01:20:50PM +0200, Stanisław Kardach wrote:
> On Mon, May 30, 2022 at 12:42 PM Bruce Richardson
> <bruce.richardson@intel.com> wrote:
> >
> > On Mon, May 30, 2022 at 10:00:34AM +0200, Morten Brørup wrote:
> > > > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > > > Sent: Monday, 30 May 2022 09.52
> > > >
> > > > On Fri, May 27, 2022 at 01:15:20PM -0700, Stephen Hemminger wrote:
> > > > > On Fri, 27 May 2022 20:18:22 +0200
> > > > > Stanislaw Kardach <kda@semihalf.com> wrote:
> > > > >
> > > > > > +static inline void
> > > > > > +rte_lpm_lookupx4(const struct rte_lpm *lpm, xmm_t ip, uint32_t
> > > > hop[4],
> > > > > > +               uint32_t defv)
> > > > > > +{
> > > > > > +       uint32_t nh;
> > > > > > +       int i, ret;
> > > > > > +
> > > > > > +       for (i = 0; i < 4; i++) {
> > > > > > +               ret = rte_lpm_lookup(lpm, ((rte_xmm_t)ip).u32[i], &nh);
> > > > > > +               hop[i] = (ret == 0) ? nh : defv;
> > > > > > +       }
> > > > > > +}
> > > > >
> > > > > For performance, manually unroll the loop.
> > > >
> > > > Given a constant 4x iterations, will compilers not unroll this
> > > > automatically. I think the loop is a little clearer if it can be kept
> > > >
> > > > /Bruce
> > >
> > > If in doubt, add this and look at the assembler output:
> > >
> > > #define REVIEW_INLINE_FUNCTIONS 1
> > >
> > > #if REVIEW_INLINE_FUNCTIONS /* For compiler output review purposes only. */
> > > #pragma GCC diagnostic push
> > > #pragma GCC diagnostic ignored "-Wmissing-prototypes"
> > > void review_rte_lpm_lookupx4(const struct rte_lpm *lpm, xmm_t ip, uint32_t hop[4], uint32_t defv)
> > > {
> > >       rte_lpm_lookupx4(lpm, ip, hop, defv);
> > > }
> > > #pragma GCC diagnostic pop
> > > #endif /* REVIEW_INLINE_FUNCTIONS */
> > >
> >
> > Used godbolt.org to check and indeed the function is not unrolled.
> > (Gcc 11.2, with flags "-O3 -march=icelake-server").
> >
> > Manually unrolling changes the assembly generated in interesting ways. For
> > example, it appears to generate more cmov-type instructions for the
> > miss/default-value case rather than using branches as in the looped
> > version. Whether this is better or not may depend upon usecase - if one
> > expects most lpm lookup entries to hit, then having (predictable) branches
> > may well be cheaper.
> >
> > In any case, I'll withdraw any object to unrolling, but I'm still not
> > convinced it's necessary.
> >
> > /Bruce
> Interestingly enough until I've defined unlikely() in godbolt, I did
> not get any automatic unrolling on godbolt (either with x86 or RISC-V
> GCC). Did you get any compilation warnings?

That matches what I saw. I then just used manual unrolling i.e. copy-paste
the 2 lines 4 times, to see what the output was like then.

> That said it only happens on O3 since it implies -fpeel-loops. O3 is
> the default for DPDK.

diff mbox series

Patch

diff --git a/doc/guides/rel_notes/release_22_07.rst b/doc/guides/rel_notes/release_22_07.rst
index e49cacecef..0cf3f71269 100644
--- a/doc/guides/rel_notes/release_22_07.rst
+++ b/doc/guides/rel_notes/release_22_07.rst
@@ -104,6 +104,11 @@  New Features
   * ``RTE_EVENT_QUEUE_ATTR_WEIGHT``
   * ``RTE_EVENT_QUEUE_ATTR_AFFINITY``
 
+* **Added scalar version of the LPM library.**
+
+  * Added scalar implementation of ``rte_lpm_lookupx4``. This is a fall-back
+    implementation for platforms that don't support vector operations.
+
 
 Removed Items
 -------------
diff --git a/lib/lpm/meson.build b/lib/lpm/meson.build
index 78d91d3421..6b47361fce 100644
--- a/lib/lpm/meson.build
+++ b/lib/lpm/meson.build
@@ -14,6 +14,7 @@  headers = files('rte_lpm.h', 'rte_lpm6.h')
 indirect_headers += files(
         'rte_lpm_altivec.h',
         'rte_lpm_neon.h',
+        'rte_lpm_scalar.h',
         'rte_lpm_sse.h',
         'rte_lpm_sve.h',
 )
diff --git a/lib/lpm/rte_lpm.h b/lib/lpm/rte_lpm.h
index 1cf863a146..4f38864fde 100644
--- a/lib/lpm/rte_lpm.h
+++ b/lib/lpm/rte_lpm.h
@@ -405,8 +405,10 @@  rte_lpm_lookupx4(const struct rte_lpm *lpm, xmm_t ip, uint32_t hop[4],
 #endif
 #elif defined(RTE_ARCH_PPC_64)
 #include "rte_lpm_altivec.h"
-#else
+#elif defined(RTE_ARCH_X86)
 #include "rte_lpm_sse.h"
+#else
+#include "rte_lpm_scalar.h"
 #endif
 
 #ifdef __cplusplus
diff --git a/lib/lpm/rte_lpm_scalar.h b/lib/lpm/rte_lpm_scalar.h
new file mode 100644
index 0000000000..2fc0e19161
--- /dev/null
+++ b/lib/lpm/rte_lpm_scalar.h
@@ -0,0 +1,36 @@ 
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2022 StarFive
+ * Copyright(c) 2022 SiFive
+ * Copyright(c) 2022 Semihalf
+ */
+
+#ifndef _RTE_LPM_SCALAR_H_
+#define _RTE_LPM_SCALAR_H_
+
+#include <rte_branch_prediction.h>
+#include <rte_byteorder.h>
+#include <rte_common.h>
+#include <rte_vect.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+static inline void
+rte_lpm_lookupx4(const struct rte_lpm *lpm, xmm_t ip, uint32_t hop[4],
+		uint32_t defv)
+{
+	uint32_t nh;
+	int i, ret;
+
+	for (i = 0; i < 4; i++) {
+		ret = rte_lpm_lookup(lpm, ((rte_xmm_t)ip).u32[i], &nh);
+		hop[i] = (ret == 0) ? nh : defv;
+	}
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_LPM_SCALAR_H_ */