[dpdk-dev] mbuff rearm_data aligmenet issue on non x86

Message ID 20160512091349.GA10395@localhost.localdomain (mailing list archive)
State Not Applicable, archived
Delegated to: Thomas Monjalon
Headers

Commit Message

Jerin Jacob May 12, 2016, 9:14 a.m. UTC
  Hi All,

I would like align mbuff rearm_data field to 8 byte aligned so that
write to mbuf->rearm_data with uint64_t* will be naturally aligned.
I am not sure about IA but some other architecture/implementation has overhead
in non-naturally aligned stores.

Proposed patch is something like this below, But open for any change to
make fit for all other architectures/platform.

Any thoughts ?

➜ [master] [dpdk-master] $ git diff
 
        /**
@@ -754,6 +752,7 @@ struct rte_mbuf {
        uint8_t nb_segs;          /**< Number of segments. */
        uint8_t port;             /**< Input port. */
 
+       uint16_t buf_len;         /**< Length of segment buffer. */
        uint64_t ol_flags;        /**< Offload features. */
 
        /* remaining bytes are set on RX when pulling packet from
 * descriptor 

/Jerin
  

Comments

Ananyev, Konstantin May 12, 2016, 10:07 a.m. UTC | #1
Hi Jerrin,

> 

> Hi All,

> 

> I would like align mbuff rearm_data field to 8 byte aligned so that

> write to mbuf->rearm_data with uint64_t* will be naturally aligned.

> I am not sure about IA but some other architecture/implementation has overhead

> in non-naturally aligned stores.

> 

> Proposed patch is something like this below, But open for any change to

> make fit for all other architectures/platform.

> 

> Any thoughts ?

> 

> ➜ [master] [dpdk-master] $ git diff

> diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h

> index 529debb..5a917d0 100644

> --- a/lib/librte_mbuf/rte_mbuf.h

> +++ b/lib/librte_mbuf/rte_mbuf.h

> @@ -733,10 +733,8 @@ struct rte_mbuf {

>         void *buf_addr;           /**< Virtual address of segment

> buffer. */

>         phys_addr_t buf_physaddr; /**< Physical address of segment

> buffer. */

> 

> -       uint16_t buf_len;         /**< Length of segment buffer. */

> -



There is no need to move buf_len itself, I think.
Just move rearm_data marker prior to buf_len is enough.
Though how do you suggest to deal with the fact, that right now we blindly
update the whole 64bits pointed by rearm_data:

drivers/net/ixgbe/ixgbe_rxtx_vec.c:
	/*
                 * Flush mbuf with pkt template.
                 * Data to be rearmed is 6 bytes long.
                 * Though, RX will overwrite ol_flags that are coming next
                 * anyway. So overwrite whole 8 bytes with one load:
                 * 6 bytes of rearm_data plus first 2 bytes of ol_flags.
                 */
                p0 = (uintptr_t)&mb0->rearm_data;
                *(uint64_t *)p0 = rxq->mbuf_initializer;

?

If buf_len will be inside these 64bits, we can't do it anymore.

Are you suggesting something like:

uint64_t *p0, v0; 

p0 = &mb0->rearm_data;
v0 = *p0 & REARM_MASK;
*p0 = v0 | rxq->mbuf_initializer;
? 

If so I wonder what would be the performance impact of that change.
Konstantin


>         /* next 6 bytes are initialised on RX descriptor rearm */

> -       MARKER8 rearm_data;

> +       MARKER64 rearm_data;

>         uint16_t data_off;

> 

>         /**

> @@ -754,6 +752,7 @@ struct rte_mbuf {

>         uint8_t nb_segs;          /**< Number of segments. */

>         uint8_t port;             /**< Input port. */

> 

> +       uint16_t buf_len;         /**< Length of segment buffer. */

>         uint64_t ol_flags;        /**< Offload features. */

> 

>         /* remaining bytes are set on RX when pulling packet from

>  * descriptor

> 

> /Jerin
  
Jerin Jacob May 12, 2016, 12:17 p.m. UTC | #2
On Thu, May 12, 2016 at 10:07:09AM +0000, Ananyev, Konstantin wrote:
> Hi Jerrin,
> 
> > 
> > Hi All,
> > 
> > I would like align mbuff rearm_data field to 8 byte aligned so that
> > write to mbuf->rearm_data with uint64_t* will be naturally aligned.
> > I am not sure about IA but some other architecture/implementation has overhead
> > in non-naturally aligned stores.
> > 
> > Proposed patch is something like this below, But open for any change to
> > make fit for all other architectures/platform.
> > 
> > Any thoughts ?
> > 
> > ➜ [master] [dpdk-master] $ git diff
> > diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
> > index 529debb..5a917d0 100644
> > --- a/lib/librte_mbuf/rte_mbuf.h
> > +++ b/lib/librte_mbuf/rte_mbuf.h
> > @@ -733,10 +733,8 @@ struct rte_mbuf {
> >         void *buf_addr;           /**< Virtual address of segment
> > buffer. */
> >         phys_addr_t buf_physaddr; /**< Physical address of segment
> > buffer. */
> > 
> > -       uint16_t buf_len;         /**< Length of segment buffer. */
> > -
> 
> 
> There is no need to move buf_len itself, I think.
> Just move rearm_data marker prior to buf_len is enough.
> Though how do you suggest to deal with the fact, that right now we blindly
> update the whole 64bits pointed by rearm_data:
> 
> drivers/net/ixgbe/ixgbe_rxtx_vec.c:
> 	/*
>                  * Flush mbuf with pkt template.
>                  * Data to be rearmed is 6 bytes long.
>                  * Though, RX will overwrite ol_flags that are coming next
>                  * anyway. So overwrite whole 8 bytes with one load:
>                  * 6 bytes of rearm_data plus first 2 bytes of ol_flags.
>                  */
>                 p0 = (uintptr_t)&mb0->rearm_data;
>                 *(uint64_t *)p0 = rxq->mbuf_initializer;
> 
> ?
> 
> If buf_len will be inside these 64bits, we can't do it anymore.
> 
> Are you suggesting something like:
> 
> uint64_t *p0, v0; 
> 
> p0 = &mb0->rearm_data;
> v0 = *p0 & REARM_MASK;
> *p0 = v0 | rxq->mbuf_initializer;
> ? 

Due to unaligned rearm_data issue, In ThunderX platform, we need to write
multiple half word of aligned stores(so masking was better us).

But I think, if we can put 16bit hole between port and ol_flags then
we may not need the masking stuff in ixgbe. Right?

OR

Even better, if we can fill in a uint16_t variable which will replaced
later in the flow like "data_len"? and move buf_len at end the first
cache line? or any other thoughts to fix unaligned rearm_data issue?

Jerin



> 
> If so I wonder what would be the performance impact of that change.
> Konstantin
> 
> 
> >         /* next 6 bytes are initialised on RX descriptor rearm */
> > -       MARKER8 rearm_data;
> > +       MARKER64 rearm_data;
> >         uint16_t data_off;
> > 
> >         /**
> > @@ -754,6 +752,7 @@ struct rte_mbuf {
> >         uint8_t nb_segs;          /**< Number of segments. */
> >         uint8_t port;             /**< Input port. */
> > 
> > +       uint16_t buf_len;         /**< Length of segment buffer. */
> >         uint64_t ol_flags;        /**< Offload features. */
> > 
> >         /* remaining bytes are set on RX when pulling packet from
> >  * descriptor
> > 
> > /Jerin
  
Ananyev, Konstantin May 12, 2016, 1:14 p.m. UTC | #3
> 

> On Thu, May 12, 2016 at 10:07:09AM +0000, Ananyev, Konstantin wrote:

> > Hi Jerrin,

> >

> > >

> > > Hi All,

> > >

> > > I would like align mbuff rearm_data field to 8 byte aligned so that

> > > write to mbuf->rearm_data with uint64_t* will be naturally aligned.

> > > I am not sure about IA but some other architecture/implementation has overhead

> > > in non-naturally aligned stores.

> > >

> > > Proposed patch is something like this below, But open for any change to

> > > make fit for all other architectures/platform.

> > >

> > > Any thoughts ?

> > >

> > > ➜ [master] [dpdk-master] $ git diff

> > > diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h

> > > index 529debb..5a917d0 100644

> > > --- a/lib/librte_mbuf/rte_mbuf.h

> > > +++ b/lib/librte_mbuf/rte_mbuf.h

> > > @@ -733,10 +733,8 @@ struct rte_mbuf {

> > >         void *buf_addr;           /**< Virtual address of segment

> > > buffer. */

> > >         phys_addr_t buf_physaddr; /**< Physical address of segment

> > > buffer. */

> > >

> > > -       uint16_t buf_len;         /**< Length of segment buffer. */

> > > -

> >

> >

> > There is no need to move buf_len itself, I think.

> > Just move rearm_data marker prior to buf_len is enough.

> > Though how do you suggest to deal with the fact, that right now we blindly

> > update the whole 64bits pointed by rearm_data:

> >

> > drivers/net/ixgbe/ixgbe_rxtx_vec.c:

> > 	/*

> >                  * Flush mbuf with pkt template.

> >                  * Data to be rearmed is 6 bytes long.

> >                  * Though, RX will overwrite ol_flags that are coming next

> >                  * anyway. So overwrite whole 8 bytes with one load:

> >                  * 6 bytes of rearm_data plus first 2 bytes of ol_flags.

> >                  */

> >                 p0 = (uintptr_t)&mb0->rearm_data;

> >                 *(uint64_t *)p0 = rxq->mbuf_initializer;

> >

> > ?

> >

> > If buf_len will be inside these 64bits, we can't do it anymore.

> >

> > Are you suggesting something like:

> >

> > uint64_t *p0, v0;

> >

> > p0 = &mb0->rearm_data;

> > v0 = *p0 & REARM_MASK;

> > *p0 = v0 | rxq->mbuf_initializer;

> > ?

> 

> Due to unaligned rearm_data issue, In ThunderX platform, we need to write

> multiple half word of aligned stores(so masking was better us).


Ok, so what would be the gain on ARM if you'll make that change?
Again, what would be the drop (if any) on IA?

> But I think, if we can put 16bit hole between port and ol_flags then

> we may not need the masking stuff in ixgbe. Right?


You mean move buf_len somewhere else (end of cacheline0) and 
introduce a 2B hole between port and ol_flags, right?
Yep, that probably wouldn't have any performance impact.

> 

> OR

> 

> Even better, if we can fill in a uint16_t variable which will replaced

> later in the flow like "data_len"?


data_len is grouped  with rx_descriptor_fields1 on purpose -
so we can update packet_type,pkt_len, data_len,vlan_tci,rss with one 16B write.

Konstantin

> and move buf_len at end the first  cache line? 

>or any other thoughts to fix unaligned rearm_data issue?

> 

> Jerin

> 

> 

> 

> >

> > If so I wonder what would be the performance impact of that change.

> > Konstantin

> >

> >

> > >         /* next 6 bytes are initialised on RX descriptor rearm */

> > > -       MARKER8 rearm_data;

> > > +       MARKER64 rearm_data;

> > >         uint16_t data_off;

> > >

> > >         /**

> > > @@ -754,6 +752,7 @@ struct rte_mbuf {

> > >         uint8_t nb_segs;          /**< Number of segments. */

> > >         uint8_t port;             /**< Input port. */

> > >

> > > +       uint16_t buf_len;         /**< Length of segment buffer. */

> > >         uint64_t ol_flags;        /**< Offload features. */

> > >

> > >         /* remaining bytes are set on RX when pulling packet from

> > >  * descriptor

> > >

> > > /Jerin
  
Jerin Jacob May 12, 2016, 2:50 p.m. UTC | #4
On Thu, May 12, 2016 at 01:14:34PM +0000, Ananyev, Konstantin wrote:
> > 
> > On Thu, May 12, 2016 at 10:07:09AM +0000, Ananyev, Konstantin wrote:
> > > Hi Jerrin,
> > >
> > > >
> > > > Hi All,
> > > >
> > > > I would like align mbuff rearm_data field to 8 byte aligned so that
> > > > write to mbuf->rearm_data with uint64_t* will be naturally aligned.
> > > > I am not sure about IA but some other architecture/implementation has overhead
> > > > in non-naturally aligned stores.
> > > >
> > > > Proposed patch is something like this below, But open for any change to
> > > > make fit for all other architectures/platform.
> > > >
> > > > Any thoughts ?
> > > >
> > > > ➜ [master] [dpdk-master] $ git diff
> > > > diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
> > > > index 529debb..5a917d0 100644
> > > > --- a/lib/librte_mbuf/rte_mbuf.h
> > > > +++ b/lib/librte_mbuf/rte_mbuf.h
> > > > @@ -733,10 +733,8 @@ struct rte_mbuf {
> > > >         void *buf_addr;           /**< Virtual address of segment
> > > > buffer. */
> > > >         phys_addr_t buf_physaddr; /**< Physical address of segment
> > > > buffer. */
> > > >
> > > > -       uint16_t buf_len;         /**< Length of segment buffer. */
> > > > -
> > >
> > >
> > > There is no need to move buf_len itself, I think.
> > > Just move rearm_data marker prior to buf_len is enough.
> > > Though how do you suggest to deal with the fact, that right now we blindly
> > > update the whole 64bits pointed by rearm_data:
> > >
> > > drivers/net/ixgbe/ixgbe_rxtx_vec.c:
> > > 	/*
> > >                  * Flush mbuf with pkt template.
> > >                  * Data to be rearmed is 6 bytes long.
> > >                  * Though, RX will overwrite ol_flags that are coming next
> > >                  * anyway. So overwrite whole 8 bytes with one load:
> > >                  * 6 bytes of rearm_data plus first 2 bytes of ol_flags.
> > >                  */
> > >                 p0 = (uintptr_t)&mb0->rearm_data;
> > >                 *(uint64_t *)p0 = rxq->mbuf_initializer;
> > >
> > > ?
> > >
> > > If buf_len will be inside these 64bits, we can't do it anymore.
> > >
> > > Are you suggesting something like:
> > >
> > > uint64_t *p0, v0;
> > >
> > > p0 = &mb0->rearm_data;
> > > v0 = *p0 & REARM_MASK;
> > > *p0 = v0 | rxq->mbuf_initializer;
> > > ?
> > 
> > Due to unaligned rearm_data issue, In ThunderX platform, we need to write
> > multiple half word of aligned stores(so masking was better us).
> 
> Ok, so what would be the gain on ARM if you'll make that change?

~4 cpu cycles per packet.Again it may not be ARM architecture specific
as ARM architecture does not define  instruction latency so it is more of a
implementation specific data.

> Again, what would be the drop (if any) on IA?
> 
> > But I think, if we can put 16bit hole between port and ol_flags then
> > we may not need the masking stuff in ixgbe. Right?
> 
> You mean move buf_len somewhere else (end of cacheline0) and 
> introduce a 2B hole between port and ol_flags, right?

Yes

> Yep, that probably wouldn't have any performance impact.

I will try two options and send a patch which don't have any
performance impact on IA.

> 
> > 
> > OR
> > 
> > Even better, if we can fill in a uint16_t variable which will replaced
> > later in the flow like "data_len"?
> 
> data_len is grouped  with rx_descriptor_fields1 on purpose -
> so we can update packet_type,pkt_len, data_len,vlan_tci,rss with one 16B write.

OK

> 
> Konstantin
> 
> > and move buf_len at end the first  cache line? 
> >or any other thoughts to fix unaligned rearm_data issue?
> > 
> > Jerin
> > 
> > 
> > 
> > >
> > > If so I wonder what would be the performance impact of that change.
> > > Konstantin
> > >
> > >
> > > >         /* next 6 bytes are initialised on RX descriptor rearm */
> > > > -       MARKER8 rearm_data;
> > > > +       MARKER64 rearm_data;
> > > >         uint16_t data_off;
> > > >
> > > >         /**
> > > > @@ -754,6 +752,7 @@ struct rte_mbuf {
> > > >         uint8_t nb_segs;          /**< Number of segments. */
> > > >         uint8_t port;             /**< Input port. */
> > > >
> > > > +       uint16_t buf_len;         /**< Length of segment buffer. */
> > > >         uint64_t ol_flags;        /**< Offload features. */
> > > >
> > > >         /* remaining bytes are set on RX when pulling packet from
> > > >  * descriptor
> > > >
> > > > /Jerin
  

Patch

diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index 529debb..5a917d0 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -733,10 +733,8 @@  struct rte_mbuf {
        void *buf_addr;           /**< Virtual address of segment
buffer. */
        phys_addr_t buf_physaddr; /**< Physical address of segment
buffer. */
 
-       uint16_t buf_len;         /**< Length of segment buffer. */
-
        /* next 6 bytes are initialised on RX descriptor rearm */
-       MARKER8 rearm_data;
+       MARKER64 rearm_data;
        uint16_t data_off;