[v4] node: switch IPv4 metadata to dynamic mbuf field

Message ID 20201028093003.29564-1-ndabilpuram@marvell.com (mailing list archive)
State Superseded, archived
Headers
Series [v4] node: switch IPv4 metadata to dynamic mbuf field |

Checks

Context Check Description
ci/iol-intel-Functional success Functional Testing PASS
ci/iol-testing success Testing PASS
ci/Intel-compilation success Compilation OK
ci/iol-intel-Performance success Performance Testing PASS
ci/travis-robot success Travis build: passed
ci/checkpatch success coding style OK
ci/iol-mellanox-Performance success Performance Testing PASS

Commit Message

Nithin Dabilpuram Oct. 28, 2020, 9:30 a.m. UTC
  From: Thomas Monjalon <thomas@monjalon.net>

The node_mbuf_priv1 was stored in the deprecated mbuf field udata64.
It is moved to a dynamic field in order to allow removal of udata64.

Signed-off-by: Thomas Monjalon <thomas@monjalon.net>
Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com>
---
 lib/librte_node/ip4_lookup.c      | 40 ++++++++++++++++++++++++--------
 lib/librte_node/ip4_lookup_neon.h | 20 +++++++---------
 lib/librte_node/ip4_lookup_sse.h  | 36 ++++++++++++++--------------
 lib/librte_node/ip4_rewrite.c     | 49 +++++++++++++++++++++++++++++----------
 lib/librte_node/node_private.h    | 13 ++++++++---
 5 files changed, 103 insertions(+), 55 deletions(-)
  

Comments

Thomas Monjalon Oct. 28, 2020, 10:08 a.m. UTC | #1
28/10/2020 10:30, Nithin Dabilpuram:
> From: Thomas Monjalon <thomas@monjalon.net>
> 
> The node_mbuf_priv1 was stored in the deprecated mbuf field udata64.
> It is moved to a dynamic field in order to allow removal of udata64.
> 
> Signed-off-by: Thomas Monjalon <thomas@monjalon.net>
> Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com>
[...]
> +	IP4_LOOKUP_NODE_PRIV1_OFF(node->ctx) = node_mbuf_priv1_dynfield_offset;

That's interesting.
You copy the offset in the node context for better performance.
How much is it better than with global offset variable?
How much it decreases compared to a static mbuf field?
  
Van Haaren, Harry Oct. 28, 2020, 10:24 a.m. UTC | #2
> -----Original Message-----
> From: dev <dev-bounces@dpdk.org> On Behalf Of Thomas Monjalon
> Sent: Wednesday, October 28, 2020 10:09 AM
> To: Nithin Dabilpuram <ndabilpuram@marvell.com>
> Cc: Pavan Nikhilesh <pbhagavatula@marvell.com>; Jerin Jacob
> <jerinj@marvell.com>; Ruifeng Wang <ruifeng.wang@arm.com>; Richardson, Bruce
> <bruce.richardson@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; kirankumark@marvell.com; dev@dpdk.org;
> david.marchand@redhat.com; olivier.matz@6wind.com
> Subject: Re: [dpdk-dev] [PATCH v4] node: switch IPv4 metadata to dynamic mbuf
> field
> 
> 28/10/2020 10:30, Nithin Dabilpuram:
> > From: Thomas Monjalon <thomas@monjalon.net>
> >
> > The node_mbuf_priv1 was stored in the deprecated mbuf field udata64.
> > It is moved to a dynamic field in order to allow removal of udata64.
> >
> > Signed-off-by: Thomas Monjalon <thomas@monjalon.net>
> > Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com>
> [...]
> > +	IP4_LOOKUP_NODE_PRIV1_OFF(node->ctx) =
> node_mbuf_priv1_dynfield_offset;
> 
> That's interesting.
> You copy the offset in the node context for better performance.
> How much is it better than with global offset variable?
> How much it decreases compared to a static mbuf field?

Also interested in this topic, I'll offer the logical/theory point of view;

With a static field, the offset into the mbuf can be encoded in the instruction
stream, meaning there are no d-cache loads to identify particular dynamic field.

With a static/global variable, the cache line where the value resides is presumably
not hot in cache per burst (assuming an application that does significant work, so not
in cache since last burst). Hence overhead estimate could be 1x cache line load per burst.

With the data copied into the node, the offset is presumably on a hot cache line as the
node is using other data-members of its context. As a result, perhaps a cold static cache
line load is converted to a hot node-context line re-use. 

Real world overhead likely depends on A) does the application cache-trash enough to make
the static/global line fall out of cache - causing perf degradation due to reload, and B) does
the node->ctx still fit in the same number of lines as before if the value is copied there.
  
Nithin Dabilpuram Oct. 28, 2020, 10:33 a.m. UTC | #3
On Wed, Oct 28, 2020 at 11:08:47AM +0100, Thomas Monjalon wrote:
> 28/10/2020 10:30, Nithin Dabilpuram:
> > From: Thomas Monjalon <thomas@monjalon.net>
> > 
> > The node_mbuf_priv1 was stored in the deprecated mbuf field udata64.
> > It is moved to a dynamic field in order to allow removal of udata64.
> > 
> > Signed-off-by: Thomas Monjalon <thomas@monjalon.net>
> > Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com>
> [...]
> > +	IP4_LOOKUP_NODE_PRIV1_OFF(node->ctx) = node_mbuf_priv1_dynfield_offset;
> 
> That's interesting.
> You copy the offset in the node context for better performance.
> How much is it better than with global offset variable?
> How much it decreases compared to a static mbuf field?

Moving it to node context was not for performance but for functionality as for
graph's created in primary can be looked up in secondary and graph walk can 
be called on them. So having the dyn offset in global variable doesn't reflect
in secondary.

This is partially better than referring global offset variable directly in
node_mbuf_priv1() as it is caching the dyn offset locally and passing it to
node_mbuf_priv1() instead of loading from global variable for every mbuf.

As mentioned earlier, this is done only because currently there is no mechanism
to have a callback triggered in secondary alone where we can update the 
global variable.


> 
> 
> 
>
  
Nithin Dabilpuram Oct. 28, 2020, 10:42 a.m. UTC | #4
On Wed, Oct 28, 2020 at 10:24:01AM +0000, Van Haaren, Harry wrote:
> > -----Original Message-----
> > From: dev <dev-bounces@dpdk.org> On Behalf Of Thomas Monjalon
> > Sent: Wednesday, October 28, 2020 10:09 AM
> > To: Nithin Dabilpuram <ndabilpuram@marvell.com>
> > Cc: Pavan Nikhilesh <pbhagavatula@marvell.com>; Jerin Jacob
> > <jerinj@marvell.com>; Ruifeng Wang <ruifeng.wang@arm.com>; Richardson, Bruce
> > <bruce.richardson@intel.com>; Ananyev, Konstantin
> > <konstantin.ananyev@intel.com>; kirankumark@marvell.com; dev@dpdk.org;
> > david.marchand@redhat.com; olivier.matz@6wind.com
> > Subject: Re: [dpdk-dev] [PATCH v4] node: switch IPv4 metadata to dynamic mbuf
> > field
> > 
> > 28/10/2020 10:30, Nithin Dabilpuram:
> > > From: Thomas Monjalon <thomas@monjalon.net>
> > >
> > > The node_mbuf_priv1 was stored in the deprecated mbuf field udata64.
> > > It is moved to a dynamic field in order to allow removal of udata64.
> > >
> > > Signed-off-by: Thomas Monjalon <thomas@monjalon.net>
> > > Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com>
> > [...]
> > > +	IP4_LOOKUP_NODE_PRIV1_OFF(node->ctx) =
> > node_mbuf_priv1_dynfield_offset;
> > 
> > That's interesting.
> > You copy the offset in the node context for better performance.
> > How much is it better than with global offset variable?
> > How much it decreases compared to a static mbuf field?
> 
> Also interested in this topic, I'll offer the logical/theory point of view;
> 
> With a static field, the offset into the mbuf can be encoded in the instruction
> stream, meaning there are no d-cache loads to identify particular dynamic field.
> 
> With a static/global variable, the cache line where the value resides is presumably
> not hot in cache per burst (assuming an application that does significant work, so not
> in cache since last burst). Hence overhead estimate could be 1x cache line load per burst.
> 
> With the data copied into the node, the offset is presumably on a hot cache line as the
> node is using other data-members of its context. As a result, perhaps a cold static cache
> line load is converted to a hot node-context line re-use. 
> 
> Real world overhead likely depends on A) does the application cache-trash enough to make
> the static/global line fall out of cache - causing perf degradation due to reload, and B) does
> the node->ctx still fit in the same number of lines as before if the value is copied there.

Agreed, node->ctx is already referred to get other data (lpm pointer). So
referening another 4 bytes might even convert that to load pair which is at
no extra cost.

Number's wise, 
it decreases by ~1.4 % from static mbuf field to global offset variable 
and it decreases by ~1% from static mbuf field to node context field
cached per process call
  
Thomas Monjalon Oct. 28, 2020, 10:43 a.m. UTC | #5
28/10/2020 11:42, Nithin Dabilpuram:
> On Wed, Oct 28, 2020 at 10:24:01AM +0000, Van Haaren, Harry wrote:
> > From: Thomas Monjalon
> > > 28/10/2020 10:30, Nithin Dabilpuram:
> > > > From: Thomas Monjalon <thomas@monjalon.net>
> > > >
> > > > The node_mbuf_priv1 was stored in the deprecated mbuf field udata64.
> > > > It is moved to a dynamic field in order to allow removal of udata64.
> > > >
> > > > Signed-off-by: Thomas Monjalon <thomas@monjalon.net>
> > > > Signed-off-by: Nithin Dabilpuram <ndabilpuram@marvell.com>
> > > [...]
> > > > +	IP4_LOOKUP_NODE_PRIV1_OFF(node->ctx) =
> > > node_mbuf_priv1_dynfield_offset;
> > > 
> > > That's interesting.
> > > You copy the offset in the node context for better performance.
> > > How much is it better than with global offset variable?
> > > How much it decreases compared to a static mbuf field?
> > 
> > Also interested in this topic, I'll offer the logical/theory point of view;
> > 
> > With a static field, the offset into the mbuf can be encoded in the instruction
> > stream, meaning there are no d-cache loads to identify particular dynamic field.
> > 
> > With a static/global variable, the cache line where the value resides is presumably
> > not hot in cache per burst (assuming an application that does significant work, so not
> > in cache since last burst). Hence overhead estimate could be 1x cache line load per burst.
> > 
> > With the data copied into the node, the offset is presumably on a hot cache line as the
> > node is using other data-members of its context. As a result, perhaps a cold static cache
> > line load is converted to a hot node-context line re-use. 
> > 
> > Real world overhead likely depends on A) does the application cache-trash enough to make
> > the static/global line fall out of cache - causing perf degradation due to reload, and B) does
> > the node->ctx still fit in the same number of lines as before if the value is copied there.
> 
> Agreed, node->ctx is already referred to get other data (lpm pointer). So
> referening another 4 bytes might even convert that to load pair which is at
> no extra cost.
> 
> Number's wise, 
> it decreases by ~1.4 % from static mbuf field to global offset variable 
> and it decreases by ~1% from static mbuf field to node context field
> cached per process call

OK thanks for providing these numbers.
  
Thomas Monjalon Oct. 28, 2020, 6:07 p.m. UTC | #6
28/10/2020 11:24, Van Haaren, Harry:
> From: Thomas Monjalon
> > > +	IP4_LOOKUP_NODE_PRIV1_OFF(node->ctx) = node_mbuf_priv1_dynfield_offset;
> > 
> > That's interesting.
> > You copy the offset in the node context for better performance.
> > How much is it better than with global offset variable?
> > How much it decreases compared to a static mbuf field?
> 
> Also interested in this topic, I'll offer the logical/theory point of view;
> 
> With a static field, the offset into the mbuf can be encoded in the instruction
> stream, meaning there are no d-cache loads to identify particular dynamic field.
> 
> With a static/global variable, the cache line where the value resides is presumably
> not hot in cache per burst (assuming an application that does significant work, so not
> in cache since last burst). Hence overhead estimate could be 1x cache line load per burst.

Would it help to group all dynfields and dynflags offsets
in the same cache line?
  
Van Haaren, Harry Oct. 29, 2020, 10:17 a.m. UTC | #7
> -----Original Message-----
> From: Thomas Monjalon <thomas@monjalon.net>
> Sent: Wednesday, October 28, 2020 6:08 PM
> To: Nithin Dabilpuram <ndabilpuram@marvell.com>; Van Haaren, Harry
> <harry.van.haaren@intel.com>
> Cc: dev@dpdk.org; Pavan Nikhilesh <pbhagavatula@marvell.com>; Jerin Jacob
> <jerinj@marvell.com>; Ruifeng Wang <ruifeng.wang@arm.com>; Richardson, Bruce
> <bruce.richardson@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; kirankumark@marvell.com; dev@dpdk.org;
> david.marchand@redhat.com; olivier.matz@6wind.com
> Subject: Re: [dpdk-dev] [PATCH v4] node: switch IPv4 metadata to dynamic mbuf
> field
> 
> 28/10/2020 11:24, Van Haaren, Harry:
> > From: Thomas Monjalon
> > > > +	IP4_LOOKUP_NODE_PRIV1_OFF(node->ctx) =
> node_mbuf_priv1_dynfield_offset;
> > >
> > > That's interesting.
> > > You copy the offset in the node context for better performance.
> > > How much is it better than with global offset variable?
> > > How much it decreases compared to a static mbuf field?
> >
> > Also interested in this topic, I'll offer the logical/theory point of view;
> >
> > With a static field, the offset into the mbuf can be encoded in the instruction
> > stream, meaning there are no d-cache loads to identify particular dynamic field.
> >
> > With a static/global variable, the cache line where the value resides is presumably
> > not hot in cache per burst (assuming an application that does significant work, so
> not
> > in cache since last burst). Hence overhead estimate could be 1x cache line load per
> burst.
> 
> Would it help to group all dynfields and dynflags offsets
> in the same cache line?

It could - but if/how-much it would benefit depends on the workload I think.

Using each cache line fully is always good, so if grouping the offsets together is
reasonable to do, it seems a good idea.

My assumptions is that registration of dynamic fields/flags is expected at init time,
and that the values remain constant at runtime. That would make this a cache-line
in "shared" state in each core that uses the dynfields of mbuf.

Overall, it is unlikely to have much impact on a real-world application.. but DPDK
puts performance first! And packing a single cache-line full of hot data is best practice :)
  

Patch

diff --git a/lib/librte_node/ip4_lookup.c b/lib/librte_node/ip4_lookup.c
index 8835aab..d083a72 100644
--- a/lib/librte_node/ip4_lookup.c
+++ b/lib/librte_node/ip4_lookup.c
@@ -29,8 +29,23 @@  struct ip4_lookup_node_main {
 	struct rte_lpm *lpm_tbl[RTE_MAX_NUMA_NODES];
 };
 
+struct ip4_lookup_node_ctx {
+	/* Socket's LPM table */
+	struct rte_lpm *lpm;
+	/* Dynamic offset to mbuf priv1 */
+	int mbuf_priv1_off;
+};
+
+int node_mbuf_priv1_dynfield_offset = -1;
+
 static struct ip4_lookup_node_main ip4_lookup_nm;
 
+#define IP4_LOOKUP_NODE_LPM(ctx) \
+	(((struct ip4_lookup_node_ctx *)ctx)->lpm)
+
+#define IP4_LOOKUP_NODE_PRIV1_OFF(ctx) \
+	(((struct ip4_lookup_node_ctx *)ctx)->mbuf_priv1_off)
+
 #if defined(__ARM_NEON)
 #include "ip4_lookup_neon.h"
 #elif defined(RTE_ARCH_X86)
@@ -41,12 +56,13 @@  static uint16_t
 ip4_lookup_node_process_scalar(struct rte_graph *graph, struct rte_node *node,
 			void **objs, uint16_t nb_objs)
 {
+	struct rte_lpm *lpm = IP4_LOOKUP_NODE_LPM(node->ctx);
+	const int dyn = IP4_LOOKUP_NODE_PRIV1_OFF(node->ctx);
 	struct rte_ipv4_hdr *ipv4_hdr;
 	void **to_next, **from;
 	uint16_t last_spec = 0;
 	struct rte_mbuf *mbuf;
 	rte_edge_t next_index;
-	struct rte_lpm *lpm;
 	uint16_t held = 0;
 	uint32_t drop_nh;
 	int i, rc;
@@ -55,9 +71,6 @@  ip4_lookup_node_process_scalar(struct rte_graph *graph, struct rte_node *node,
 	next_index = RTE_NODE_IP4_LOOKUP_NEXT_REWRITE;
 	/* Drop node */
 	drop_nh = ((uint32_t)RTE_NODE_IP4_LOOKUP_NEXT_PKT_DROP) << 16;
-
-	/* Get socket specific LPM from ctx */
-	lpm = *((struct rte_lpm **)node->ctx);
 	from = objs;
 
 	/* Get stream for the speculated next node */
@@ -72,14 +85,14 @@  ip4_lookup_node_process_scalar(struct rte_graph *graph, struct rte_node *node,
 		ipv4_hdr = rte_pktmbuf_mtod_offset(mbuf, struct rte_ipv4_hdr *,
 				sizeof(struct rte_ether_hdr));
 		/* Extract cksum, ttl as ipv4 hdr is in cache */
-		node_mbuf_priv1(mbuf)->cksum = ipv4_hdr->hdr_checksum;
-		node_mbuf_priv1(mbuf)->ttl = ipv4_hdr->time_to_live;
+		node_mbuf_priv1(mbuf, dyn)->cksum = ipv4_hdr->hdr_checksum;
+		node_mbuf_priv1(mbuf, dyn)->ttl = ipv4_hdr->time_to_live;
 
 		rc = rte_lpm_lookup(lpm, rte_be_to_cpu_32(ipv4_hdr->dst_addr),
 				    &next_hop);
 		next_hop = (rc == 0) ? next_hop : drop_nh;
 
-		node_mbuf_priv1(mbuf)->nh = (uint16_t)next_hop;
+		node_mbuf_priv1(mbuf, dyn)->nh = (uint16_t)next_hop;
 		next_hop = next_hop >> 16;
 		next = (uint16_t)next_hop;
 
@@ -169,15 +182,19 @@  setup_lpm(struct ip4_lookup_node_main *nm, int socket)
 static int
 ip4_lookup_node_init(const struct rte_graph *graph, struct rte_node *node)
 {
-	struct rte_lpm **lpm_p = (struct rte_lpm **)&node->ctx;
 	uint16_t socket, lcore_id;
 	static uint8_t init_once;
 	int rc;
 
 	RTE_SET_USED(graph);
-	RTE_SET_USED(node);
+	RTE_BUILD_BUG_ON(sizeof(struct ip4_lookup_node_ctx) > RTE_NODE_CTX_SZ);
 
 	if (!init_once) {
+		node_mbuf_priv1_dynfield_offset = rte_mbuf_dynfield_register(
+				&node_mbuf_priv1_dynfield_desc);
+		if (node_mbuf_priv1_dynfield_offset < 0)
+			return -rte_errno;
+
 		/* Setup LPM tables for all sockets */
 		RTE_LCORE_FOREACH(lcore_id)
 		{
@@ -192,7 +209,10 @@  ip4_lookup_node_init(const struct rte_graph *graph, struct rte_node *node)
 		}
 		init_once = 1;
 	}
-	*lpm_p = ip4_lookup_nm.lpm_tbl[graph->socket];
+
+	/* Update socket's LPM and mbuf dyn priv1 offset in node ctx */
+	IP4_LOOKUP_NODE_LPM(node->ctx) = ip4_lookup_nm.lpm_tbl[graph->socket];
+	IP4_LOOKUP_NODE_PRIV1_OFF(node->ctx) = node_mbuf_priv1_dynfield_offset;
 
 #if defined(__ARM_NEON) || defined(RTE_ARCH_X86)
 	if (rte_vect_get_max_simd_bitwidth() >= RTE_VECT_SIMD_128)
diff --git a/lib/librte_node/ip4_lookup_neon.h b/lib/librte_node/ip4_lookup_neon.h
index 0ad2763..d5c8da3 100644
--- a/lib/librte_node/ip4_lookup_neon.h
+++ b/lib/librte_node/ip4_lookup_neon.h
@@ -11,12 +11,13 @@  ip4_lookup_node_process_vec(struct rte_graph *graph, struct rte_node *node,
 			void **objs, uint16_t nb_objs)
 {
 	struct rte_mbuf *mbuf0, *mbuf1, *mbuf2, *mbuf3, **pkts;
+	struct rte_lpm *lpm = IP4_LOOKUP_NODE_LPM(node->ctx);
+	const int dyn = IP4_LOOKUP_NODE_PRIV1_OFF(node->ctx);
 	struct rte_ipv4_hdr *ipv4_hdr;
 	void **to_next, **from;
 	uint16_t last_spec = 0;
 	rte_edge_t next_index;
 	uint16_t n_left_from;
-	struct rte_lpm *lpm;
 	uint16_t held = 0;
 	uint32_t drop_nh;
 	rte_xmm_t result;
@@ -30,9 +31,6 @@  ip4_lookup_node_process_vec(struct rte_graph *graph, struct rte_node *node,
 	/* Drop node */
 	drop_nh = ((uint32_t)RTE_NODE_IP4_LOOKUP_NEXT_PKT_DROP) << 16;
 
-	/* Get socket specific LPM from ctx */
-	lpm = *((struct rte_lpm **)node->ctx);
-
 	pkts = (struct rte_mbuf **)objs;
 	from = objs;
 	n_left_from = nb_objs;
@@ -119,10 +117,10 @@  ip4_lookup_node_process_vec(struct rte_graph *graph, struct rte_node *node,
 		priv23.u16[0] = result.u16[4];
 		priv23.u16[4] = result.u16[6];
 
-		node_mbuf_priv1(mbuf0)->u = priv01.u64[0];
-		node_mbuf_priv1(mbuf1)->u = priv01.u64[1];
-		node_mbuf_priv1(mbuf2)->u = priv23.u64[0];
-		node_mbuf_priv1(mbuf3)->u = priv23.u64[1];
+		node_mbuf_priv1(mbuf0, dyn)->u = priv01.u64[0];
+		node_mbuf_priv1(mbuf1, dyn)->u = priv01.u64[1];
+		node_mbuf_priv1(mbuf2, dyn)->u = priv23.u64[0];
+		node_mbuf_priv1(mbuf3, dyn)->u = priv23.u64[1];
 
 		/* Enqueue four to next node */
 		rte_edge_t fix_spec = ((next_index == result.u16[1]) &&
@@ -197,14 +195,14 @@  ip4_lookup_node_process_vec(struct rte_graph *graph, struct rte_node *node,
 		ipv4_hdr = rte_pktmbuf_mtod_offset(mbuf0, struct rte_ipv4_hdr *,
 						sizeof(struct rte_ether_hdr));
 		/* Extract cksum, ttl as ipv4 hdr is in cache */
-		node_mbuf_priv1(mbuf0)->cksum = ipv4_hdr->hdr_checksum;
-		node_mbuf_priv1(mbuf0)->ttl = ipv4_hdr->time_to_live;
+		node_mbuf_priv1(mbuf0, dyn)->cksum = ipv4_hdr->hdr_checksum;
+		node_mbuf_priv1(mbuf0, dyn)->ttl = ipv4_hdr->time_to_live;
 
 		rc = rte_lpm_lookup(lpm, rte_be_to_cpu_32(ipv4_hdr->dst_addr),
 				    &next_hop);
 		next_hop = (rc == 0) ? next_hop : drop_nh;
 
-		node_mbuf_priv1(mbuf0)->nh = (uint16_t)next_hop;
+		node_mbuf_priv1(mbuf0, dyn)->nh = (uint16_t)next_hop;
 		next_hop = next_hop >> 16;
 		next0 = (uint16_t)next_hop;
 
diff --git a/lib/librte_node/ip4_lookup_sse.h b/lib/librte_node/ip4_lookup_sse.h
index 264c986..74dbf97 100644
--- a/lib/librte_node/ip4_lookup_sse.h
+++ b/lib/librte_node/ip4_lookup_sse.h
@@ -11,13 +11,14 @@  ip4_lookup_node_process_vec(struct rte_graph *graph, struct rte_node *node,
 			void **objs, uint16_t nb_objs)
 {
 	struct rte_mbuf *mbuf0, *mbuf1, *mbuf2, *mbuf3, **pkts;
+	struct rte_lpm *lpm = IP4_LOOKUP_NODE_LPM(node->ctx);
+	const int dyn = IP4_LOOKUP_NODE_PRIV1_OFF(node->ctx);
 	rte_edge_t next0, next1, next2, next3, next_index;
 	struct rte_ipv4_hdr *ipv4_hdr;
 	uint32_t ip0, ip1, ip2, ip3;
 	void **to_next, **from;
 	uint16_t last_spec = 0;
 	uint16_t n_left_from;
-	struct rte_lpm *lpm;
 	uint16_t held = 0;
 	uint32_t drop_nh;
 	rte_xmm_t dst;
@@ -29,9 +30,6 @@  ip4_lookup_node_process_vec(struct rte_graph *graph, struct rte_node *node,
 	/* Drop node */
 	drop_nh = ((uint32_t)RTE_NODE_IP4_LOOKUP_NEXT_PKT_DROP) << 16;
 
-	/* Get socket specific LPM from ctx */
-	lpm = *((struct rte_lpm **)node->ctx);
-
 	pkts = (struct rte_mbuf **)objs;
 	from = objs;
 	n_left_from = nb_objs;
@@ -78,24 +76,24 @@  ip4_lookup_node_process_vec(struct rte_graph *graph, struct rte_node *node,
 						sizeof(struct rte_ether_hdr));
 		ip0 = ipv4_hdr->dst_addr;
 		/* Extract cksum, ttl as ipv4 hdr is in cache */
-		node_mbuf_priv1(mbuf0)->cksum = ipv4_hdr->hdr_checksum;
-		node_mbuf_priv1(mbuf0)->ttl = ipv4_hdr->time_to_live;
+		node_mbuf_priv1(mbuf0, dyn)->cksum = ipv4_hdr->hdr_checksum;
+		node_mbuf_priv1(mbuf0, dyn)->ttl = ipv4_hdr->time_to_live;
 
 		/* Extract DIP of mbuf1 */
 		ipv4_hdr = rte_pktmbuf_mtod_offset(mbuf1, struct rte_ipv4_hdr *,
 						sizeof(struct rte_ether_hdr));
 		ip1 = ipv4_hdr->dst_addr;
 		/* Extract cksum, ttl as ipv4 hdr is in cache */
-		node_mbuf_priv1(mbuf1)->cksum = ipv4_hdr->hdr_checksum;
-		node_mbuf_priv1(mbuf1)->ttl = ipv4_hdr->time_to_live;
+		node_mbuf_priv1(mbuf1, dyn)->cksum = ipv4_hdr->hdr_checksum;
+		node_mbuf_priv1(mbuf1, dyn)->ttl = ipv4_hdr->time_to_live;
 
 		/* Extract DIP of mbuf2 */
 		ipv4_hdr = rte_pktmbuf_mtod_offset(mbuf2, struct rte_ipv4_hdr *,
 						sizeof(struct rte_ether_hdr));
 		ip2 = ipv4_hdr->dst_addr;
 		/* Extract cksum, ttl as ipv4 hdr is in cache */
-		node_mbuf_priv1(mbuf2)->cksum = ipv4_hdr->hdr_checksum;
-		node_mbuf_priv1(mbuf2)->ttl = ipv4_hdr->time_to_live;
+		node_mbuf_priv1(mbuf2, dyn)->cksum = ipv4_hdr->hdr_checksum;
+		node_mbuf_priv1(mbuf2, dyn)->ttl = ipv4_hdr->time_to_live;
 
 		/* Extract DIP of mbuf3 */
 		ipv4_hdr = rte_pktmbuf_mtod_offset(mbuf3, struct rte_ipv4_hdr *,
@@ -111,23 +109,23 @@  ip4_lookup_node_process_vec(struct rte_graph *graph, struct rte_node *node,
 		dip = _mm_shuffle_epi8(dip, bswap_mask);
 
 		/* Extract cksum, ttl as ipv4 hdr is in cache */
-		node_mbuf_priv1(mbuf3)->cksum = ipv4_hdr->hdr_checksum;
-		node_mbuf_priv1(mbuf3)->ttl = ipv4_hdr->time_to_live;
+		node_mbuf_priv1(mbuf3, dyn)->cksum = ipv4_hdr->hdr_checksum;
+		node_mbuf_priv1(mbuf3, dyn)->ttl = ipv4_hdr->time_to_live;
 
 		/* Perform LPM lookup to get NH and next node */
 		rte_lpm_lookupx4(lpm, dip, dst.u32, drop_nh);
 
 		/* Extract next node id and NH */
-		node_mbuf_priv1(mbuf0)->nh = dst.u32[0] & 0xFFFF;
+		node_mbuf_priv1(mbuf0, dyn)->nh = dst.u32[0] & 0xFFFF;
 		next0 = (dst.u32[0] >> 16);
 
-		node_mbuf_priv1(mbuf1)->nh = dst.u32[1] & 0xFFFF;
+		node_mbuf_priv1(mbuf1, dyn)->nh = dst.u32[1] & 0xFFFF;
 		next1 = (dst.u32[1] >> 16);
 
-		node_mbuf_priv1(mbuf2)->nh = dst.u32[2] & 0xFFFF;
+		node_mbuf_priv1(mbuf2, dyn)->nh = dst.u32[2] & 0xFFFF;
 		next2 = (dst.u32[2] >> 16);
 
-		node_mbuf_priv1(mbuf3)->nh = dst.u32[3] & 0xFFFF;
+		node_mbuf_priv1(mbuf3, dyn)->nh = dst.u32[3] & 0xFFFF;
 		next3 = (dst.u32[3] >> 16);
 
 		/* Enqueue four to next node */
@@ -202,14 +200,14 @@  ip4_lookup_node_process_vec(struct rte_graph *graph, struct rte_node *node,
 		ipv4_hdr = rte_pktmbuf_mtod_offset(mbuf0, struct rte_ipv4_hdr *,
 						sizeof(struct rte_ether_hdr));
 		/* Extract cksum, ttl as ipv4 hdr is in cache */
-		node_mbuf_priv1(mbuf0)->cksum = ipv4_hdr->hdr_checksum;
-		node_mbuf_priv1(mbuf0)->ttl = ipv4_hdr->time_to_live;
+		node_mbuf_priv1(mbuf0, dyn)->cksum = ipv4_hdr->hdr_checksum;
+		node_mbuf_priv1(mbuf0, dyn)->ttl = ipv4_hdr->time_to_live;
 
 		rc = rte_lpm_lookup(lpm, rte_be_to_cpu_32(ipv4_hdr->dst_addr),
 				    &next_hop);
 		next_hop = (rc == 0) ? next_hop : drop_nh;
 
-		node_mbuf_priv1(mbuf0)->nh = next_hop & 0xFFFF;
+		node_mbuf_priv1(mbuf0, dyn)->nh = next_hop & 0xFFFF;
 		next0 = (next_hop >> 16);
 
 		if (unlikely(next_index ^ next0)) {
diff --git a/lib/librte_node/ip4_rewrite.c b/lib/librte_node/ip4_rewrite.c
index bb7f671..99ecb45 100644
--- a/lib/librte_node/ip4_rewrite.c
+++ b/lib/librte_node/ip4_rewrite.c
@@ -19,14 +19,28 @@ 
 #include "ip4_rewrite_priv.h"
 #include "node_private.h"
 
+struct ip4_rewrite_node_ctx {
+	/* Dynamic offset to mbuf priv1 */
+	int mbuf_priv1_off;
+	/* Cached next index */
+	uint16_t next_index;
+};
+
 static struct ip4_rewrite_node_main *ip4_rewrite_nm;
 
+#define IP4_REWRITE_NODE_LAST_NEXT(ctx) \
+	(((struct ip4_rewrite_node_ctx *)ctx)->next_index)
+
+#define IP4_REWRITE_NODE_PRIV1_OFF(ctx) \
+	(((struct ip4_rewrite_node_ctx *)ctx)->mbuf_priv1_off)
+
 static uint16_t
 ip4_rewrite_node_process(struct rte_graph *graph, struct rte_node *node,
 			 void **objs, uint16_t nb_objs)
 {
 	struct rte_mbuf *mbuf0, *mbuf1, *mbuf2, *mbuf3, **pkts;
 	struct ip4_rewrite_nh_header *nh = ip4_rewrite_nm->nh;
+	const int dyn = IP4_REWRITE_NODE_PRIV1_OFF(node->ctx);
 	uint16_t next0, next1, next2, next3, next_index;
 	struct rte_ipv4_hdr *ip0, *ip1, *ip2, *ip3;
 	uint16_t n_left_from, held = 0, last_spec = 0;
@@ -37,7 +51,7 @@  ip4_rewrite_node_process(struct rte_graph *graph, struct rte_node *node,
 	int i;
 
 	/* Speculative next as last next */
-	next_index = *(uint16_t *)node->ctx;
+	next_index = IP4_REWRITE_NODE_LAST_NEXT(node->ctx);
 	rte_prefetch0(nh);
 
 	pkts = (struct rte_mbuf **)objs;
@@ -68,10 +82,10 @@  ip4_rewrite_node_process(struct rte_graph *graph, struct rte_node *node,
 
 		pkts += 4;
 		n_left_from -= 4;
-		priv01.u64[0] = node_mbuf_priv1(mbuf0)->u;
-		priv01.u64[1] = node_mbuf_priv1(mbuf1)->u;
-		priv23.u64[0] = node_mbuf_priv1(mbuf2)->u;
-		priv23.u64[1] = node_mbuf_priv1(mbuf3)->u;
+		priv01.u64[0] = node_mbuf_priv1(mbuf0, dyn)->u;
+		priv01.u64[1] = node_mbuf_priv1(mbuf1, dyn)->u;
+		priv23.u64[0] = node_mbuf_priv1(mbuf2, dyn)->u;
+		priv23.u64[1] = node_mbuf_priv1(mbuf3, dyn)->u;
 
 		/* Increment checksum by one. */
 		priv01.u32[1] += rte_cpu_to_be_16(0x0100);
@@ -203,17 +217,17 @@  ip4_rewrite_node_process(struct rte_graph *graph, struct rte_node *node,
 		n_left_from -= 1;
 
 		d0 = rte_pktmbuf_mtod(mbuf0, void *);
-		rte_memcpy(d0, nh[node_mbuf_priv1(mbuf0)->nh].rewrite_data,
-			   nh[node_mbuf_priv1(mbuf0)->nh].rewrite_len);
+		rte_memcpy(d0, nh[node_mbuf_priv1(mbuf0, dyn)->nh].rewrite_data,
+			   nh[node_mbuf_priv1(mbuf0, dyn)->nh].rewrite_len);
 
-		next0 = nh[node_mbuf_priv1(mbuf0)->nh].tx_node;
+		next0 = nh[node_mbuf_priv1(mbuf0, dyn)->nh].tx_node;
 		ip0 = (struct rte_ipv4_hdr *)((uint8_t *)d0 +
 					      sizeof(struct rte_ether_hdr));
-		chksum = node_mbuf_priv1(mbuf0)->cksum +
+		chksum = node_mbuf_priv1(mbuf0, dyn)->cksum +
 			 rte_cpu_to_be_16(0x0100);
 		chksum += chksum >= 0xffff;
 		ip0->hdr_checksum = chksum;
-		ip0->time_to_live = node_mbuf_priv1(mbuf0)->ttl - 1;
+		ip0->time_to_live = node_mbuf_priv1(mbuf0, dyn)->ttl - 1;
 
 		if (unlikely(next_index ^ next0)) {
 			/* Copy things successfully speculated till now */
@@ -240,7 +254,7 @@  ip4_rewrite_node_process(struct rte_graph *graph, struct rte_node *node,
 	rte_memcpy(to_next, from, last_spec * sizeof(from[0]));
 	rte_node_next_stream_put(graph, node, next_index, held);
 	/* Save the last next used */
-	*(uint16_t *)node->ctx = next_index;
+	IP4_REWRITE_NODE_LAST_NEXT(node->ctx) = next_index;
 
 	return nb_objs;
 }
@@ -248,9 +262,20 @@  ip4_rewrite_node_process(struct rte_graph *graph, struct rte_node *node,
 static int
 ip4_rewrite_node_init(const struct rte_graph *graph, struct rte_node *node)
 {
+	static bool init_once;
 
 	RTE_SET_USED(graph);
-	RTE_SET_USED(node);
+	RTE_BUILD_BUG_ON(sizeof(struct ip4_rewrite_node_ctx) > RTE_NODE_CTX_SZ);
+
+	if (!init_once) {
+		node_mbuf_priv1_dynfield_offset = rte_mbuf_dynfield_register(
+				&node_mbuf_priv1_dynfield_desc);
+		if (node_mbuf_priv1_dynfield_offset < 0)
+			return -rte_errno;
+		init_once = true;
+	}
+	IP4_REWRITE_NODE_PRIV1_OFF(node->ctx) = node_mbuf_priv1_dynfield_offset;
+
 	node_dbg("ip4_rewrite", "Initialized ip4_rewrite node initialized");
 
 	return 0;
diff --git a/lib/librte_node/node_private.h b/lib/librte_node/node_private.h
index ab7941c..8c73d5d 100644
--- a/lib/librte_node/node_private.h
+++ b/lib/librte_node/node_private.h
@@ -8,6 +8,7 @@ 
 #include <rte_common.h>
 #include <rte_log.h>
 #include <rte_mbuf.h>
+#include <rte_mbuf_dyn.h>
 
 extern int rte_node_logtype;
 #define NODE_LOG(level, node_name, ...)                                        \
@@ -21,7 +22,6 @@  extern int rte_node_logtype;
 #define node_dbg(node_name, ...) NODE_LOG(DEBUG, node_name, __VA_ARGS__)
 
 /**
- *
  * Node mbuf private data to store next hop, ttl and checksum.
  */
 struct node_mbuf_priv1 {
@@ -37,6 +37,13 @@  struct node_mbuf_priv1 {
 	};
 };
 
+static const struct rte_mbuf_dynfield node_mbuf_priv1_dynfield_desc = {
+	.name = "rte_node_dynfield_priv1",
+	.size = sizeof(struct node_mbuf_priv1),
+	.align = __alignof__(struct node_mbuf_priv1),
+};
+extern int node_mbuf_priv1_dynfield_offset;
+
 /**
  * Node mbuf private area 2.
  */
@@ -58,9 +65,9 @@  struct node_mbuf_priv2 {
  *   Pointer to the mbuf_priv1.
  */
 static __rte_always_inline struct node_mbuf_priv1 *
-node_mbuf_priv1(struct rte_mbuf *m)
+node_mbuf_priv1(struct rte_mbuf *m, const int offset)
 {
-	return (struct node_mbuf_priv1 *)&m->udata64;
+	return RTE_MBUF_DYNFIELD(m, offset, struct node_mbuf_priv1 *);
 }
 
 /**