[dpdk-dev] mbuf: make rearm_data address naturally aligned

Message ID 1463579863-32053-1-git-send-email-jerin.jacob@caviumnetworks.com (mailing list archive)
State Accepted, archived
Delegated to: Thomas Monjalon
Headers

Commit Message

Jerin Jacob May 18, 2016, 1:57 p.m. UTC
  To avoid multiple stores on fast path, Ethernet drivers
aggregate the writes to data_off, refcnt, nb_segs and port
to an uint64_t data and write the data in one shot
with uint64_t* at &mbuf->rearm_data address.

Some of the non-IA platforms have store operation overhead
if the store address is not naturally aligned.This patch
fixes the performance issue on those targets.

Signed-off-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
---

Tested this patch on IA and non-IA(ThunderX) platforms.
This patch shows 400Kpps/core improvement on ThunderX + ixgbe + vector environment.
and this patch does not have any overhead on IA platform.

Have tried an another similar approach by replacing "buf_len" with "pad"
(in this patch context),
Since it has additional overhead on read and then mask to keep "buf_len" intact,
not much improvement is not shown.
ref: http://dpdk.org/ml/archives/dev/2016-May/038914.html

---
 drivers/net/fm10k/fm10k_rxtx_vec.c                            | 3 ---
 drivers/net/i40e/i40e_rxtx_vec.c                              | 5 +----
 drivers/net/ixgbe/ixgbe_rxtx_vec.c                            | 3 ---
 lib/librte_eal/linuxapp/eal/include/exec-env/rte_kni_common.h | 4 ++--
 lib/librte_mbuf/rte_mbuf.h                                    | 6 +++---
 5 files changed, 6 insertions(+), 15 deletions(-)
  

Comments

Bruce Richardson May 18, 2016, 4:43 p.m. UTC | #1
On Wed, May 18, 2016 at 07:27:43PM +0530, Jerin Jacob wrote:
> To avoid multiple stores on fast path, Ethernet drivers
> aggregate the writes to data_off, refcnt, nb_segs and port
> to an uint64_t data and write the data in one shot
> with uint64_t* at &mbuf->rearm_data address.
> 
> Some of the non-IA platforms have store operation overhead
> if the store address is not naturally aligned.This patch
> fixes the performance issue on those targets.
> 
> Signed-off-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> ---
> 
> Tested this patch on IA and non-IA(ThunderX) platforms.
> This patch shows 400Kpps/core improvement on ThunderX + ixgbe + vector environment.
> and this patch does not have any overhead on IA platform.
> 
> Have tried an another similar approach by replacing "buf_len" with "pad"
> (in this patch context),
> Since it has additional overhead on read and then mask to keep "buf_len" intact,
> not much improvement is not shown.
> ref: http://dpdk.org/ml/archives/dev/2016-May/038914.html
> 
> ---
While this will work and from your tests doesn't seem to have a performance
impact, I'm not sure I particularly like it. It's extending out the end of
cacheline0 of the mbuf by 16 bytes, though I suppose it's not technically using
up any more space of it.

What I'm wondering about though, is do we have any usecases where we need a
variable buf_len for packets for RX. These mbufs come directly from a mempool,
which is generally understood to be a set of fixed-sized buffers. I realise that
this change was made in the past after some discussion, but one of the key points
there [at least to my reading] was that - even though nobody actually made a
concrete case where they had variable-sized buffers - having support for them
made no performance difference.

The latter part of that has now changed, and supporting variable-sized mbufs
from an mbuf pool has a perf impact. Do we definitely need that functionality,
because the easiest fix here is just to move the rxrearm marker back above
mbuf_len as it was originally in releases like 1.8?

Regards,
/Bruce

Ref: http://dpdk.org/ml/archives/dev/2014-December/009432.html
  
Jerin Jacob May 18, 2016, 6:50 p.m. UTC | #2
On Wed, May 18, 2016 at 05:43:00PM +0100, Bruce Richardson wrote:
> On Wed, May 18, 2016 at 07:27:43PM +0530, Jerin Jacob wrote:
> > To avoid multiple stores on fast path, Ethernet drivers
> > aggregate the writes to data_off, refcnt, nb_segs and port
> > to an uint64_t data and write the data in one shot
> > with uint64_t* at &mbuf->rearm_data address.
> > 
> > Some of the non-IA platforms have store operation overhead
> > if the store address is not naturally aligned.This patch
> > fixes the performance issue on those targets.
> > 
> > Signed-off-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> > ---
> > 
> > Tested this patch on IA and non-IA(ThunderX) platforms.
> > This patch shows 400Kpps/core improvement on ThunderX + ixgbe + vector environment.
> > and this patch does not have any overhead on IA platform.
> > 
> > Have tried an another similar approach by replacing "buf_len" with "pad"
> > (in this patch context),
> > Since it has additional overhead on read and then mask to keep "buf_len" intact,
> > not much improvement is not shown.
> > ref: http://dpdk.org/ml/archives/dev/2016-May/038914.html
> > 
> > ---
> While this will work and from your tests doesn't seem to have a performance
> impact, I'm not sure I particularly like it. It's extending out the end of
> cacheline0 of the mbuf by 16 bytes, though I suppose it's not technically using
> up any more space of it.

Extending by 2 bytes. Right ?. Yes, I guess, Now we using only 56 out of 64 bytes
in the first 64-byte cache line.

> 
> What I'm wondering about though, is do we have any usecases where we need a
> variable buf_len for packets for RX. These mbufs come directly from a mempool,
> which is generally understood to be a set of fixed-sized buffers. I realise that
> this change was made in the past after some discussion, but one of the key points
> there [at least to my reading] was that - even though nobody actually made a
> concrete case where they had variable-sized buffers - having support for them
> made no performance difference.
> 
> The latter part of that has now changed, and supporting variable-sized mbufs
> from an mbuf pool has a perf impact. Do we definitely need that functionality,
> because the easiest fix here is just to move the rxrearm marker back above
> mbuf_len as it was originally in releases like 1.8?

And initialize the buf_len with mp->elt_size - sizeof(struct rte_mbuf).
Right?

I don't have a strong opinion on this, I can do this if there is no
objection on this. Let me know.

However, I do see in future, "buf_len" may belong at the end of the first 64 byte
cache line as currently "port" is defined as uint8_t, IMO, that is less.
We may need to increase that uint16_t. The reason why I think that
because, Currently in ThunderX HW, we do have 128VFs per socket for
built-in NIC, So, the two node configuration and one external PCIe NW card
configuration can easily go beyond 256 ports.

> 
> Regards,
> /Bruce
> 
> Ref: http://dpdk.org/ml/archives/dev/2014-December/009432.html
>
  
Bruce Richardson May 19, 2016, 8:50 a.m. UTC | #3
On Thu, May 19, 2016 at 12:20:16AM +0530, Jerin Jacob wrote:
> On Wed, May 18, 2016 at 05:43:00PM +0100, Bruce Richardson wrote:
> > On Wed, May 18, 2016 at 07:27:43PM +0530, Jerin Jacob wrote:
> > > To avoid multiple stores on fast path, Ethernet drivers
> > > aggregate the writes to data_off, refcnt, nb_segs and port
> > > to an uint64_t data and write the data in one shot
> > > with uint64_t* at &mbuf->rearm_data address.
> > > 
> > > Some of the non-IA platforms have store operation overhead
> > > if the store address is not naturally aligned.This patch
> > > fixes the performance issue on those targets.
> > > 
> > > Signed-off-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> > > ---
> > > 
> > > Tested this patch on IA and non-IA(ThunderX) platforms.
> > > This patch shows 400Kpps/core improvement on ThunderX + ixgbe + vector environment.
> > > and this patch does not have any overhead on IA platform.
> > > 
> > > Have tried an another similar approach by replacing "buf_len" with "pad"
> > > (in this patch context),
> > > Since it has additional overhead on read and then mask to keep "buf_len" intact,
> > > not much improvement is not shown.
> > > ref: http://dpdk.org/ml/archives/dev/2016-May/038914.html
> > > 
> > > ---
> > While this will work and from your tests doesn't seem to have a performance
> > impact, I'm not sure I particularly like it. It's extending out the end of
> > cacheline0 of the mbuf by 16 bytes, though I suppose it's not technically using
> > up any more space of it.
> 
> Extending by 2 bytes. Right ?. Yes, I guess, Now we using only 56 out of 64 bytes
> in the first 64-byte cache line.
> 
> > 
> > What I'm wondering about though, is do we have any usecases where we need a
> > variable buf_len for packets for RX. These mbufs come directly from a mempool,
> > which is generally understood to be a set of fixed-sized buffers. I realise that
> > this change was made in the past after some discussion, but one of the key points
> > there [at least to my reading] was that - even though nobody actually made a
> > concrete case where they had variable-sized buffers - having support for them
> > made no performance difference.
> > 
> > The latter part of that has now changed, and supporting variable-sized mbufs
> > from an mbuf pool has a perf impact. Do we definitely need that functionality,
> > because the easiest fix here is just to move the rxrearm marker back above
> > mbuf_len as it was originally in releases like 1.8?
> 
> And initialize the buf_len with mp->elt_size - sizeof(struct rte_mbuf).
> Right?
> 
> I don't have a strong opinion on this, I can do this if there is no
> objection on this. Let me know.
> 
> However, I do see in future, "buf_len" may belong at the end of the first 64 byte
> cache line as currently "port" is defined as uint8_t, IMO, that is less.
> We may need to increase that uint16_t. The reason why I think that
> because, Currently in ThunderX HW, we do have 128VFs per socket for
> built-in NIC, So, the two node configuration and one external PCIe NW card
> configuration can easily go beyond 256 ports.
> 
Ok, good point. If you think it's needed, and if we are changing the mbuf
structure, it might be a good time to extend that field while you are at it, save
a second ABI break later on.

/Bruce

> > 
> > Regards,
> > /Bruce
> > 
> > Ref: http://dpdk.org/ml/archives/dev/2014-December/009432.html
> >
  
Jan Viktorin May 19, 2016, 11:54 a.m. UTC | #4
On Thu, 19 May 2016 09:50:48 +0100
Bruce Richardson <bruce.richardson@intel.com> wrote:

> On Thu, May 19, 2016 at 12:20:16AM +0530, Jerin Jacob wrote:
> > On Wed, May 18, 2016 at 05:43:00PM +0100, Bruce Richardson wrote:  
> > > On Wed, May 18, 2016 at 07:27:43PM +0530, Jerin Jacob wrote:  
> > > > To avoid multiple stores on fast path, Ethernet drivers
> > > > aggregate the writes to data_off, refcnt, nb_segs and port
> > > > to an uint64_t data and write the data in one shot
> > > > with uint64_t* at &mbuf->rearm_data address.
> > > > 
> > > > Some of the non-IA platforms have store operation overhead
> > > > if the store address is not naturally aligned.This patch
> > > > fixes the performance issue on those targets.
> > > > 
> > > > Signed-off-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> > > > ---
> > > > 
> > > > Tested this patch on IA and non-IA(ThunderX) platforms.
> > > > This patch shows 400Kpps/core improvement on ThunderX + ixgbe + vector environment.
> > > > and this patch does not have any overhead on IA platform.

Hello,

I can confirm a very small improvement in our synthetic tests based on the PMD
null (ARM Cortex-A9). For a single-core (1C) test, there is now a lower overhead
and it is more stable with different packet lengths. However, when running dual-core
(2C), the result is slightly slower but again, it seems to be more stable.

Without this patch (cycles per packet):

 length:   64     128     256     512    1024    1280    1518
  1C      488     544     487     454     543     488     515
  2C      433     433     431     433     433     461     443

Applied this patch (cycles per packet):

 length:   64     128     256     512    1024    1280    1518
  1C      472     472     472     472     473     472     473
  2C      435     435     435     435     436     436     436

Regards
Jan

> > > > 
> > > > Have tried an another similar approach by replacing "buf_len" with "pad"
> > > > (in this patch context),
> > > > Since it has additional overhead on read and then mask to keep "buf_len" intact,
> > > > not much improvement is not shown.
> > > > ref: http://dpdk.org/ml/archives/dev/2016-May/038914.html
> > > > 
> > > > ---  
> > > While this will work and from your tests doesn't seem to have a performance
> > > impact, I'm not sure I particularly like it. It's extending out the end of
> > > cacheline0 of the mbuf by 16 bytes, though I suppose it's not technically using
> > > up any more space of it.  
> > 
> > Extending by 2 bytes. Right ?. Yes, I guess, Now we using only 56 out of 64 bytes
> > in the first 64-byte cache line.
> >   
> > > 
> > > What I'm wondering about though, is do we have any usecases where we need a
> > > variable buf_len for packets for RX. These mbufs come directly from a mempool,
> > > which is generally understood to be a set of fixed-sized buffers. I realise that
> > > this change was made in the past after some discussion, but one of the key points
> > > there [at least to my reading] was that - even though nobody actually made a
> > > concrete case where they had variable-sized buffers - having support for them
> > > made no performance difference.
> > > 
> > > The latter part of that has now changed, and supporting variable-sized mbufs
> > > from an mbuf pool has a perf impact. Do we definitely need that functionality,
> > > because the easiest fix here is just to move the rxrearm marker back above
> > > mbuf_len as it was originally in releases like 1.8?  
> > 
> > And initialize the buf_len with mp->elt_size - sizeof(struct rte_mbuf).
> > Right?
> > 
> > I don't have a strong opinion on this, I can do this if there is no
> > objection on this. Let me know.
> > 
> > However, I do see in future, "buf_len" may belong at the end of the first 64 byte
> > cache line as currently "port" is defined as uint8_t, IMO, that is less.
> > We may need to increase that uint16_t. The reason why I think that
> > because, Currently in ThunderX HW, we do have 128VFs per socket for
> > built-in NIC, So, the two node configuration and one external PCIe NW card
> > configuration can easily go beyond 256 ports.
> >   
> Ok, good point. If you think it's needed, and if we are changing the mbuf
> structure, it might be a good time to extend that field while you are at it, save
> a second ABI break later on.

> 
> /Bruce
> 
> > > 
> > > Regards,
> > > /Bruce
> > > 
> > > Ref: http://dpdk.org/ml/archives/dev/2014-December/009432.html
> > >
  
Ananyev, Konstantin May 19, 2016, 12:18 p.m. UTC | #5
Hi everyone,
 
> On Thu, May 19, 2016 at 12:20:16AM +0530, Jerin Jacob wrote:
> > On Wed, May 18, 2016 at 05:43:00PM +0100, Bruce Richardson wrote:
> > > On Wed, May 18, 2016 at 07:27:43PM +0530, Jerin Jacob wrote:
> > > > To avoid multiple stores on fast path, Ethernet drivers
> > > > aggregate the writes to data_off, refcnt, nb_segs and port
> > > > to an uint64_t data and write the data in one shot
> > > > with uint64_t* at &mbuf->rearm_data address.
> > > >
> > > > Some of the non-IA platforms have store operation overhead
> > > > if the store address is not naturally aligned.This patch
> > > > fixes the performance issue on those targets.
> > > >
> > > > Signed-off-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> > > > ---
> > > >
> > > > Tested this patch on IA and non-IA(ThunderX) platforms.
> > > > This patch shows 400Kpps/core improvement on ThunderX + ixgbe + vector environment.
> > > > and this patch does not have any overhead on IA platform.
> > > >
> > > > Have tried an another similar approach by replacing "buf_len" with "pad"
> > > > (in this patch context),
> > > > Since it has additional overhead on read and then mask to keep "buf_len" intact,
> > > > not much improvement is not shown.
> > > > ref: http://dpdk.org/ml/archives/dev/2016-May/038914.html
> > > >
> > > > ---
> > > While this will work and from your tests doesn't seem to have a performance
> > > impact, I'm not sure I particularly like it. It's extending out the end of
> > > cacheline0 of the mbuf by 16 bytes, though I suppose it's not technically using
> > > up any more space of it.
> >
> > Extending by 2 bytes. Right ?. Yes, I guess, Now we using only 56 out of 64 bytes
> > in the first 64-byte cache line.
> >
> > >
> > > What I'm wondering about though, is do we have any usecases where we need a
> > > variable buf_len for packets for RX. These mbufs come directly from a mempool,
> > > which is generally understood to be a set of fixed-sized buffers. I realise that
> > > this change was made in the past after some discussion, but one of the key points
> > > there [at least to my reading] was that - even though nobody actually made a
> > > concrete case where they had variable-sized buffers - having support for them
> > > made no performance difference.

I was going to point to vhost zcp support, but as Thomas pointed out
that functionality was removed  from dpdk.org recently.
So I am not aware does such case exist right now in the 'real world' or not.
Though I still think RX function should leave buf_len field intact. 

> > >
> > > The latter part of that has now changed, and supporting variable-sized mbufs
> > > from an mbuf pool has a perf impact. Do we definitely need that functionality,
> > > because the easiest fix here is just to move the rxrearm marker back above
> > > mbuf_len as it was originally in releases like 1.8?
> >
> > And initialize the buf_len with mp->elt_size - sizeof(struct rte_mbuf).
> > Right?
> >
> > I don't have a strong opinion on this, I can do this if there is no
> > objection on this. Let me know.
> >
> > However, I do see in future, "buf_len" may belong at the end of the first 64 byte
> > cache line as currently "port" is defined as uint8_t, IMO, that is less.
> > We may need to increase that uint16_t. The reason why I think that
> > because, Currently in ThunderX HW, we do have 128VFs per socket for
> > built-in NIC, So, the two node configuration and one external PCIe NW card
> > configuration can easily go beyond 256 ports.

I wonder does anyone really use mbuf port field?
My though was - could we to drop it completely?
Actually, after discussing it with Bruce offline, an interesting idea came out:
if we'll drop port and make mbuf_prefree() to reset nb_segs=1, then
we can reduce RX rearm_data to 4B. So with that layout:

struct rte_mbuf {

         MARKER cacheline0;

        void *buf_addr;           
        phys_addr_t buf_physaddr; 
        uint16_t buf_len;
        uint8_t nb_segs;
        uint8_t reserved_1byte;   /* former port */
        
        MARKER32 rearm_data;
        uint16_t data_off;
       uint16_t refcnt;
       
        uint64_t ol_flags;
        ...

We can keep buf_len at its place and avoid 2B gap, while making rearm_data
4B long and 4B aligned.

Another similar alternative, is to make mbuf_prefree() to set refcnt=1
(as it update it anyway). Then we can remove refcnt from the RX rearm_data,
and again make rearm_data 4B long and 4B aligned:

struct rte_mbuf {

         MARKER cacheline0;

        void *buf_addr;           
        phys_addr_t buf_physaddr; 
        uint16_t buf_len;
        uint16_t refcnt;

        MARKER32 rearm_data;
        uint16_t data_off;
        uint8_t nb_segs;
        uint8_t port;
        
        uint64_t ol_flags;
         ..

As additional plus, __rte_mbuf_raw_alloc() wouldn't need to modify mbuf contents at all -
which probably is a good thing.
As a drawback - we'll have a free mbufs in pool with refcnt==1, which probably reduce
debug ability of the mbuf code.  

Konstantin

> >
> Ok, good point. If you think it's needed, and if we are changing the mbuf
> structure, it might be a good time to extend that field while you are at it, save
> a second ABI break later on.
> 
> /Bruce
> 
> > >
> > > Regards,
> > > /Bruce
> > >
> > > Ref: http://dpdk.org/ml/archives/dev/2014-December/009432.html
> > >
  
Jerin Jacob May 19, 2016, 1:35 p.m. UTC | #6
On Thu, May 19, 2016 at 12:18:57PM +0000, Ananyev, Konstantin wrote:
> 
> Hi everyone,
>  
> > On Thu, May 19, 2016 at 12:20:16AM +0530, Jerin Jacob wrote:
> > > On Wed, May 18, 2016 at 05:43:00PM +0100, Bruce Richardson wrote:
> > > > On Wed, May 18, 2016 at 07:27:43PM +0530, Jerin Jacob wrote:
> > > > > To avoid multiple stores on fast path, Ethernet drivers
> > > > > aggregate the writes to data_off, refcnt, nb_segs and port
> > > > > to an uint64_t data and write the data in one shot
> > > > > with uint64_t* at &mbuf->rearm_data address.
> > > > >
> > > > > Some of the non-IA platforms have store operation overhead
> > > > > if the store address is not naturally aligned.This patch
> > > > > fixes the performance issue on those targets.
> > > > >
> > > > > Signed-off-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> > > > > ---
> > > > >
> > > > > Tested this patch on IA and non-IA(ThunderX) platforms.
> > > > > This patch shows 400Kpps/core improvement on ThunderX + ixgbe + vector environment.
> > > > > and this patch does not have any overhead on IA platform.
> > > > >
> > > > > Have tried an another similar approach by replacing "buf_len" with "pad"
> > > > > (in this patch context),
> > > > > Since it has additional overhead on read and then mask to keep "buf_len" intact,
> > > > > not much improvement is not shown.
> > > > > ref: http://dpdk.org/ml/archives/dev/2016-May/038914.html
> > > > >
> > > > > ---
> > > > While this will work and from your tests doesn't seem to have a performance
> > > > impact, I'm not sure I particularly like it. It's extending out the end of
> > > > cacheline0 of the mbuf by 16 bytes, though I suppose it's not technically using
> > > > up any more space of it.
> > >
> > > Extending by 2 bytes. Right ?. Yes, I guess, Now we using only 56 out of 64 bytes
> > > in the first 64-byte cache line.
> > >
> > > >
> > > > What I'm wondering about though, is do we have any usecases where we need a
> > > > variable buf_len for packets for RX. These mbufs come directly from a mempool,
> > > > which is generally understood to be a set of fixed-sized buffers. I realise that
> > > > this change was made in the past after some discussion, but one of the key points
> > > > there [at least to my reading] was that - even though nobody actually made a
> > > > concrete case where they had variable-sized buffers - having support for them
> > > > made no performance difference.
> 
> I was going to point to vhost zcp support, but as Thomas pointed out
> that functionality was removed  from dpdk.org recently.
> So I am not aware does such case exist right now in the 'real world' or not.
> Though I still think RX function should leave buf_len field intact. 
> 
> > > >
> > > > The latter part of that has now changed, and supporting variable-sized mbufs
> > > > from an mbuf pool has a perf impact. Do we definitely need that functionality,
> > > > because the easiest fix here is just to move the rxrearm marker back above
> > > > mbuf_len as it was originally in releases like 1.8?
> > >
> > > And initialize the buf_len with mp->elt_size - sizeof(struct rte_mbuf).
> > > Right?
> > >
> > > I don't have a strong opinion on this, I can do this if there is no
> > > objection on this. Let me know.
> > >
> > > However, I do see in future, "buf_len" may belong at the end of the first 64 byte
> > > cache line as currently "port" is defined as uint8_t, IMO, that is less.
> > > We may need to increase that uint16_t. The reason why I think that
> > > because, Currently in ThunderX HW, we do have 128VFs per socket for
> > > built-in NIC, So, the two node configuration and one external PCIe NW card
> > > configuration can easily go beyond 256 ports.
> 
> I wonder does anyone really use mbuf port field?
> My though was - could we to drop it completely?
> Actually, after discussing it with Bruce offline, an interesting idea came out:
> if we'll drop port and make mbuf_prefree() to reset nb_segs=1, then
> we can reduce RX rearm_data to 4B. So with that layout:
> 
> struct rte_mbuf {
> 
>          MARKER cacheline0;
> 
>         void *buf_addr;           
>         phys_addr_t buf_physaddr; 
>         uint16_t buf_len;
>         uint8_t nb_segs;
>         uint8_t reserved_1byte;   /* former port */
>         
>         MARKER32 rearm_data;
>         uint16_t data_off;
>        uint16_t refcnt;
>        
>         uint64_t ol_flags;
>         ...
> 
> We can keep buf_len at its place and avoid 2B gap, while making rearm_data
> 4B long and 4B aligned.

Couple of comments,
- IMO, It is good if nb_segs can move under rearm_data, as some
drivers(not in ixgbe may be) can write nb_segs in one shot also
in segmented rx handler case
- I think, it makes sense to keep port in mbuf so that application
can make use of it(Not sure what real application developers think of
this)
- if Writing 4B and 8B consume same cycles(at least in arm64) then I think it
makes sense to make it as 8B wide with maximum pre-built constants are possible.

> 
> Another similar alternative, is to make mbuf_prefree() to set refcnt=1
> (as it update it anyway). Then we can remove refcnt from the RX rearm_data,
> and again make rearm_data 4B long and 4B aligned:
> 
> struct rte_mbuf {
> 
>          MARKER cacheline0;
> 
>         void *buf_addr;           
>         phys_addr_t buf_physaddr; 
>         uint16_t buf_len;
>         uint16_t refcnt;
> 
>         MARKER32 rearm_data;
>         uint16_t data_off;
>         uint8_t nb_segs;
>         uint8_t port;

The only problem I think with this approach is that, port data type cannot be
extended to uint16_t in future.

>         
>         uint64_t ol_flags;
>          ..
> 
> As additional plus, __rte_mbuf_raw_alloc() wouldn't need to modify mbuf contents at all -
> which probably is a good thing.
> As a drawback - we'll have a free mbufs in pool with refcnt==1, which probably reduce
> debug ability of the mbuf code.  
> 
> Konstantin
> 
> > >
> > Ok, good point. If you think it's needed, and if we are changing the mbuf
> > structure, it might be a good time to extend that field while you are at it, save
> > a second ABI break later on.
> > 
> > /Bruce
> > 
> > > >
> > > > Regards,
> > > > /Bruce
> > > >
> > > > Ref: http://dpdk.org/ml/archives/dev/2014-December/009432.html
> > > >
  
Thomas Monjalon May 19, 2016, 3:50 p.m. UTC | #7
2016-05-19 19:05, Jerin Jacob:
> On Thu, May 19, 2016 at 12:18:57PM +0000, Ananyev, Konstantin wrote:
> > > On Thu, May 19, 2016 at 12:20:16AM +0530, Jerin Jacob wrote:
> > > > On Wed, May 18, 2016 at 05:43:00PM +0100, Bruce Richardson wrote:
> > > > > On Wed, May 18, 2016 at 07:27:43PM +0530, Jerin Jacob wrote:
> > I wonder does anyone really use mbuf port field?
> > My though was - could we to drop it completely?
> > Actually, after discussing it with Bruce offline, an interesting idea came out:
> > if we'll drop port and make mbuf_prefree() to reset nb_segs=1, then
> > we can reduce RX rearm_data to 4B. So with that layout:
> > 
> > struct rte_mbuf {
> > 
> >          MARKER cacheline0;
> > 
> >         void *buf_addr;           
> >         phys_addr_t buf_physaddr; 
> >         uint16_t buf_len;
> >         uint8_t nb_segs;
> >         uint8_t reserved_1byte;   /* former port */
> >         
> >         MARKER32 rearm_data;
> >         uint16_t data_off;
> >        uint16_t refcnt;
> >        
> >         uint64_t ol_flags;
> >         ...
> > 
> > We can keep buf_len at its place and avoid 2B gap, while making rearm_data
> > 4B long and 4B aligned.
> 
> Couple of comments,
> - IMO, It is good if nb_segs can move under rearm_data, as some
> drivers(not in ixgbe may be) can write nb_segs in one shot also
> in segmented rx handler case
> - I think, it makes sense to keep port in mbuf so that application
> can make use of it(Not sure what real application developers think of
> this)

I agree we could try to remove the port id from mbuf.
These mbuf data are a common base to pass infos between drivers and apps.
If you need to store some data which are read and write from the app only,
you can have use the private zone (see rte_pktmbuf_priv_size).

> - if Writing 4B and 8B consume same cycles(at least in arm64) then I think it
> makes sense to make it as 8B wide with maximum pre-built constants are possible.
> 
> > 
> > Another similar alternative, is to make mbuf_prefree() to set refcnt=1
> > (as it update it anyway). Then we can remove refcnt from the RX rearm_data,
> > and again make rearm_data 4B long and 4B aligned:
> > 
> > struct rte_mbuf {
> > 
> >          MARKER cacheline0;
> > 
> >         void *buf_addr;           
> >         phys_addr_t buf_physaddr; 
> >         uint16_t buf_len;
> >         uint16_t refcnt;
> > 
> >         MARKER32 rearm_data;
> >         uint16_t data_off;
> >         uint8_t nb_segs;
> >         uint8_t port;
> 
> The only problem I think with this approach is that, port data type cannot be
> extended to uint16_t in future.

It is a major problem. The port id should be extended to uint16_t.
  
Zoltan Kiss May 20, 2016, 3:30 p.m. UTC | #8
Hi,

On 19/05/16 13:18, Ananyev, Konstantin wrote:
> I wonder does anyone really use mbuf port field?
> My though was - could we to drop it completely?

There are a few example codes which are reading the port field. Although 
they can retain this metadata in the private area of the mbuf, right 
after receiving, it would cause them a minor perf drop to do it 
separately. I'm not sure which one is more important: this perf drop of 
the gain everyone else has by relieving the drivers to do it.

Zoli
  
Olivier Matz May 23, 2016, 11:19 a.m. UTC | #9
Hi,

On 05/19/2016 05:50 PM, Thomas Monjalon wrote:
> 2016-05-19 19:05, Jerin Jacob:
>> On Thu, May 19, 2016 at 12:18:57PM +0000, Ananyev, Konstantin wrote:
>>>> On Thu, May 19, 2016 at 12:20:16AM +0530, Jerin Jacob wrote:
>>>>> On Wed, May 18, 2016 at 05:43:00PM +0100, Bruce Richardson wrote:
>>>>>> On Wed, May 18, 2016 at 07:27:43PM +0530, Jerin Jacob wrote:
>>> I wonder does anyone really use mbuf port field?
>>> My though was - could we to drop it completely?
>>> Actually, after discussing it with Bruce offline, an interesting idea came out:
>>> if we'll drop port and make mbuf_prefree() to reset nb_segs=1, then
>>> we can reduce RX rearm_data to 4B. So with that layout:
>>>
>>> struct rte_mbuf {
>>>
>>>          MARKER cacheline0;
>>>
>>>         void *buf_addr;           
>>>         phys_addr_t buf_physaddr; 
>>>         uint16_t buf_len;
>>>         uint8_t nb_segs;
>>>         uint8_t reserved_1byte;   /* former port */
>>>         
>>>         MARKER32 rearm_data;
>>>         uint16_t data_off;
>>>        uint16_t refcnt;
>>>        
>>>         uint64_t ol_flags;
>>>         ...
>>>
>>> We can keep buf_len at its place and avoid 2B gap, while making rearm_data
>>> 4B long and 4B aligned.
>>
>> Couple of comments,
>> - IMO, It is good if nb_segs can move under rearm_data, as some
>> drivers(not in ixgbe may be) can write nb_segs in one shot also
>> in segmented rx handler case
>> - I think, it makes sense to keep port in mbuf so that application
>> can make use of it(Not sure what real application developers think of
>> this)
> 
> I agree we could try to remove the port id from mbuf.
> These mbuf data are a common base to pass infos between drivers and apps.
> If you need to store some data which are read and write from the app only,
> you can have use the private zone (see rte_pktmbuf_priv_size).

At the first read, I was in favor of keeping the port_id in the
mbuf. But after checking the examples and applications, I'm not
opposed to remove it. Indeed, this information could go in an
application-specific part or it could be an additional function
parameter in the application processing function.

The same question could be raised for nb_seg: it seems this info
is not used a lot, and having a 8 bits value here also prevents from
having long chains (ex: for socket buffer in a tcp stack).

Just an idea thrown in the air: if these 2 fields are removed, it
brings some room for the m->next field to go in the first cache line.
This was mentioned several times (at least [1]).

[1] http://dpdk.org/ml/archives/dev/2015-June/019182.html

By the way, the "pahole" utility gives a nice representation
of structures with the field sizes and offsets. Example on the
current rte_mbuf structure on x86_64:

$ make config T=x86_64-native-linuxapp-gcc
$ make -j4 EXTRA_CFLAGS="-O -g"
$ pahole -C rte_mbuf build/app/testpmd

struct rte_mbuf {
     MARKER                     cacheline0;           /*     0     0 */
     void *                     buf_addr;             /*     0     8 */
     phys_addr_t                buf_physaddr;         /*     8     8 */
     uint16_t                   buf_len;              /*    16     2 */
     MARKER8                    rearm_data;           /*    18     0 */
     uint16_t                   data_off;             /*    18     2 */
     union {
             rte_atomic16_t     refcnt_atomic;        /*           2 */
             uint16_t           refcnt;               /*           2 */
     };                                               /*    20     2 */
     uint8_t                    nb_segs;              /*    22     1 */
     uint8_t                    port;                 /*    23     1 */
     uint64_t                   ol_flags;             /*    24     8 */
     MARKER                     rx_descriptor_fields1; /*    32     0 */
     union {
             uint32_t           packet_type;          /*           4 */
             struct {
                     uint32_t   l2_type:4;            /*    32:28  4 */
                     uint32_t   l3_type:4;            /*    32:24  4 */
                     uint32_t   l4_type:4;            /*    32:20  4 */
                     uint32_t   tun_type:4;           /*    32:16  4 */
                     uint32_t   inner_l2_type:4;      /*    32:12  4 */
                     uint32_t   inner_l3_type:4;      /*    32: 8  4 */
                     uint32_t   inner_l4_type:4;      /*    32: 4  4 */
             };                                       /*           4 */
     };                                               /*    32     4 */
     uint32_t                   pkt_len;              /*    36     4 */
     uint16_t                   data_len;             /*    40     2 */
     uint16_t                   vlan_tci;             /*    42     2 */
     union {
             uint32_t           rss;                  /*           4 */
             struct {
                     union {
                             struct {
                                     uint16_t hash;   /*    44     2 */
                                     uint16_t id;     /*    46     2 */
                             };                       /*           4 */
                             uint32_t lo;             /*           4 */
                     };                               /*    44     4 */
                     uint32_t   hi;                   /*    48     4 */
             } fdir;                                  /*           8 */
             struct {
                     uint32_t   lo;                   /*    44     4 */
                     uint32_t   hi;                   /*    48     4 */
             } sched;                                 /*           8 */
             uint32_t           usr;                  /*           4 */
     } hash;                                          /*    44     8 */
     uint32_t                   seqn;                 /*    52     4 */
     uint16_t                   vlan_tci_outer;       /*    56     2 */

     /* XXX 6 bytes hole, try to pack */

     /* --- cacheline 1 boundary (64 bytes) --- */
     MARKER                     cacheline1;           /*    64     0 */
     union {
             void *             userdata;             /*           8 */
             uint64_t           udata64;              /*           8 */
     };                                               /*    64     8 */
     struct rte_mempool *       pool;                 /*    72     8 */
     struct rte_mbuf *          next;                 /*    80     8 */
     union {
             uint64_t           tx_offload;           /*           8 */
             struct {
                     uint64_t   l2_len:7;             /*    88:57  8 */
                     uint64_t   l3_len:9;             /*    88:48  8 */
                     uint64_t   l4_len:8;             /*    88:40  8 */
                     uint64_t   tso_segsz:16;         /*    88:24  8 */
                     uint64_t   outer_l3_len:9;       /*    88:15  8 */
                     uint64_t   outer_l2_len:7;       /*    88: 8  8 */
             };                                       /*           8 */
     };                                               /*    88     8 */
     uint16_t                   priv_size;            /*    96     2 */
     uint16_t                   timesync;             /*    98     2 */

     /* size: 128, cachelines: 2, members: 25 */
     /* sum members: 94, holes: 1, sum holes: 6 */
     /* padding: 28 */
};
  
Jerin Jacob July 4, 2016, 12:45 p.m. UTC | #10
On Mon, May 23, 2016 at 01:19:46PM +0200, Olivier Matz wrote:
> Hi,
> 
> On 05/19/2016 05:50 PM, Thomas Monjalon wrote:
> > 2016-05-19 19:05, Jerin Jacob:
> >> On Thu, May 19, 2016 at 12:18:57PM +0000, Ananyev, Konstantin wrote:
> >>>> On Thu, May 19, 2016 at 12:20:16AM +0530, Jerin Jacob wrote:
> >>>>> On Wed, May 18, 2016 at 05:43:00PM +0100, Bruce Richardson wrote:
> >>>>>> On Wed, May 18, 2016 at 07:27:43PM +0530, Jerin Jacob wrote:
> >>> I wonder does anyone really use mbuf port field?
> >>> My though was - could we to drop it completely?
> >>> Actually, after discussing it with Bruce offline, an interesting idea came out:
> >>> if we'll drop port and make mbuf_prefree() to reset nb_segs=1, then
> >>> we can reduce RX rearm_data to 4B. So with that layout:
> >>>
> >>> struct rte_mbuf {
> >>>
> >>>          MARKER cacheline0;
> >>>
> >>>         void *buf_addr;           
> >>>         phys_addr_t buf_physaddr; 
> >>>         uint16_t buf_len;
> >>>         uint8_t nb_segs;
> >>>         uint8_t reserved_1byte;   /* former port */
> >>>         
> >>>         MARKER32 rearm_data;
> >>>         uint16_t data_off;
> >>>        uint16_t refcnt;
> >>>        
> >>>         uint64_t ol_flags;
> >>>         ...
> >>>
> >>> We can keep buf_len at its place and avoid 2B gap, while making rearm_data
> >>> 4B long and 4B aligned.
> >>
> >> Couple of comments,
> >> - IMO, It is good if nb_segs can move under rearm_data, as some
> >> drivers(not in ixgbe may be) can write nb_segs in one shot also
> >> in segmented rx handler case
> >> - I think, it makes sense to keep port in mbuf so that application
> >> can make use of it(Not sure what real application developers think of
> >> this)
> > 
> > I agree we could try to remove the port id from mbuf.
> > These mbuf data are a common base to pass infos between drivers and apps.
> > If you need to store some data which are read and write from the app only,
> > you can have use the private zone (see rte_pktmbuf_priv_size).
> 
> At the first read, I was in favor of keeping the port_id in the
> mbuf. But after checking the examples and applications, I'm not
> opposed to remove it. Indeed, this information could go in an
> application-specific part or it could be an additional function
> parameter in the application processing function.
> 
> The same question could be raised for nb_seg: it seems this info
> is not used a lot, and having a 8 bits value here also prevents from
> having long chains (ex: for socket buffer in a tcp stack).
> 
> Just an idea thrown in the air: if these 2 fields are removed, it
> brings some room for the m->next field to go in the first cache line.
> This was mentioned several times (at least [1]).
> 
> [1] http://dpdk.org/ml/archives/dev/2015-June/019182.html


Can we come to some consensus on this for this release. The original problem was
mbuf->rearm_data not being natually aligned and the assosiated performacnce
issues with ARM architecture for non naturally aligned access.
I believe that is fixing in this patch without any performance degradation on IA.
I believe that is a good progress. Can we make further mbuff improvements as
a additional problem to solve.

Thoughts ?

Jerin
  
Olivier Matz July 4, 2016, 12:58 p.m. UTC | #11
Hi Jerin,

On 07/04/2016 02:45 PM, Jerin Jacob wrote:
> On Mon, May 23, 2016 at 01:19:46PM +0200, Olivier Matz wrote:
>> Hi,
>>
>> On 05/19/2016 05:50 PM, Thomas Monjalon wrote:
>>> 2016-05-19 19:05, Jerin Jacob:
>>>> On Thu, May 19, 2016 at 12:18:57PM +0000, Ananyev, Konstantin wrote:
>>>>>> On Thu, May 19, 2016 at 12:20:16AM +0530, Jerin Jacob wrote:
>>>>>>> On Wed, May 18, 2016 at 05:43:00PM +0100, Bruce Richardson wrote:
>>>>>>>> On Wed, May 18, 2016 at 07:27:43PM +0530, Jerin Jacob wrote:
>>>>> I wonder does anyone really use mbuf port field?
>>>>> My though was - could we to drop it completely?
>>>>> Actually, after discussing it with Bruce offline, an interesting idea came out:
>>>>> if we'll drop port and make mbuf_prefree() to reset nb_segs=1, then
>>>>> we can reduce RX rearm_data to 4B. So with that layout:
>>>>>
>>>>> struct rte_mbuf {
>>>>>
>>>>>          MARKER cacheline0;
>>>>>
>>>>>         void *buf_addr;           
>>>>>         phys_addr_t buf_physaddr; 
>>>>>         uint16_t buf_len;
>>>>>         uint8_t nb_segs;
>>>>>         uint8_t reserved_1byte;   /* former port */
>>>>>         
>>>>>         MARKER32 rearm_data;
>>>>>         uint16_t data_off;
>>>>>        uint16_t refcnt;
>>>>>        
>>>>>         uint64_t ol_flags;
>>>>>         ...
>>>>>
>>>>> We can keep buf_len at its place and avoid 2B gap, while making rearm_data
>>>>> 4B long and 4B aligned.
>>>>
>>>> Couple of comments,
>>>> - IMO, It is good if nb_segs can move under rearm_data, as some
>>>> drivers(not in ixgbe may be) can write nb_segs in one shot also
>>>> in segmented rx handler case
>>>> - I think, it makes sense to keep port in mbuf so that application
>>>> can make use of it(Not sure what real application developers think of
>>>> this)
>>>
>>> I agree we could try to remove the port id from mbuf.
>>> These mbuf data are a common base to pass infos between drivers and apps.
>>> If you need to store some data which are read and write from the app only,
>>> you can have use the private zone (see rte_pktmbuf_priv_size).
>>
>> At the first read, I was in favor of keeping the port_id in the
>> mbuf. But after checking the examples and applications, I'm not
>> opposed to remove it. Indeed, this information could go in an
>> application-specific part or it could be an additional function
>> parameter in the application processing function.
>>
>> The same question could be raised for nb_seg: it seems this info
>> is not used a lot, and having a 8 bits value here also prevents from
>> having long chains (ex: for socket buffer in a tcp stack).
>>
>> Just an idea thrown in the air: if these 2 fields are removed, it
>> brings some room for the m->next field to go in the first cache line.
>> This was mentioned several times (at least [1]).
>>
>> [1] http://dpdk.org/ml/archives/dev/2015-June/019182.html
> 
> 
> Can we come to some consensus on this for this release. The original problem was
> mbuf->rearm_data not being natually aligned and the assosiated performacnce
> issues with ARM architecture for non naturally aligned access.
> I believe that is fixing in this patch without any performance degradation on IA.
> I believe that is a good progress. Can we make further mbuff improvements as
> a additional problem to solve.
> 
> Thoughts ?

Changing the mbuf topology is something that should happen as rarely as
possible, so I think we should group all mbuf modifications in one version.

Your issue (mbuf->rearm alignment), the removing of uneeded fields (port
id, maybe nb_segs), and possibly other things should be addressed for
next version (16.11). I'll send a deprecation notice before the 16.07 is
out if there is no opposition.

Regards,
Olivier
  

Patch

diff --git a/drivers/net/fm10k/fm10k_rxtx_vec.c b/drivers/net/fm10k/fm10k_rxtx_vec.c
index 03e4a5c..f3ef1a1 100644
--- a/drivers/net/fm10k/fm10k_rxtx_vec.c
+++ b/drivers/net/fm10k/fm10k_rxtx_vec.c
@@ -314,9 +314,6 @@  fm10k_rxq_rearm(struct fm10k_rx_queue *rxq)
 
 		/* Flush mbuf with pkt template.
 		 * Data to be rearmed is 6 bytes long.
-		 * Though, RX will overwrite ol_flags that are coming next
-		 * anyway. So overwrite whole 8 bytes with one load:
-		 * 6 bytes of rearm_data plus first 2 bytes of ol_flags.
 		 */
 		p0 = (uintptr_t)&mb0->rearm_data;
 		*(uint64_t *)p0 = rxq->mbuf_initializer;
diff --git a/drivers/net/i40e/i40e_rxtx_vec.c b/drivers/net/i40e/i40e_rxtx_vec.c
index f7a62a8..162ce4e 100644
--- a/drivers/net/i40e/i40e_rxtx_vec.c
+++ b/drivers/net/i40e/i40e_rxtx_vec.c
@@ -86,11 +86,8 @@  i40e_rxq_rearm(struct i40e_rx_queue *rxq)
 		mb0 = rxep[0].mbuf;
 		mb1 = rxep[1].mbuf;
 
-		 /* Flush mbuf with pkt template.
+		/* Flush mbuf with pkt template.
 		 * Data to be rearmed is 6 bytes long.
-		 * Though, RX will overwrite ol_flags that are coming next
-		 * anyway. So overwrite whole 8 bytes with one load:
-		 * 6 bytes of rearm_data plus first 2 bytes of ol_flags.
 		 */
 		p0 = (uintptr_t)&mb0->rearm_data;
 		*(uint64_t *)p0 = rxq->mbuf_initializer;
diff --git a/drivers/net/ixgbe/ixgbe_rxtx_vec.c b/drivers/net/ixgbe/ixgbe_rxtx_vec.c
index c4d709b..33b378d 100644
--- a/drivers/net/ixgbe/ixgbe_rxtx_vec.c
+++ b/drivers/net/ixgbe/ixgbe_rxtx_vec.c
@@ -89,9 +89,6 @@  ixgbe_rxq_rearm(struct ixgbe_rx_queue *rxq)
 		/*
 		 * Flush mbuf with pkt template.
 		 * Data to be rearmed is 6 bytes long.
-		 * Though, RX will overwrite ol_flags that are coming next
-		 * anyway. So overwrite whole 8 bytes with one load:
-		 * 6 bytes of rearm_data plus first 2 bytes of ol_flags.
 		 */
 		p0 = (uintptr_t)&mb0->rearm_data;
 		*(uint64_t *)p0 = rxq->mbuf_initializer;
diff --git a/lib/librte_eal/linuxapp/eal/include/exec-env/rte_kni_common.h b/lib/librte_eal/linuxapp/eal/include/exec-env/rte_kni_common.h
index 2acdfd9..26f61f8 100644
--- a/lib/librte_eal/linuxapp/eal/include/exec-env/rte_kni_common.h
+++ b/lib/librte_eal/linuxapp/eal/include/exec-env/rte_kni_common.h
@@ -111,11 +111,11 @@  struct rte_kni_fifo {
  */
 struct rte_kni_mbuf {
 	void *buf_addr __attribute__((__aligned__(RTE_CACHE_LINE_SIZE)));
-	char pad0[10];
+	char pad0[8];
 	uint16_t data_off;      /**< Start address of data in segment buffer. */
 	char pad1[2];
 	uint8_t nb_segs;        /**< Number of segments. */
-	char pad4[1];
+	char pad4[3];
 	uint64_t ol_flags;      /**< Offload features. */
 	char pad2[4];
 	uint32_t pkt_len;       /**< Total pkt len: sum of all segment data_len. */
diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index 7b92b88..6bc47ed 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -733,10 +733,8 @@  struct rte_mbuf {
 	void *buf_addr;           /**< Virtual address of segment buffer. */
 	phys_addr_t buf_physaddr; /**< Physical address of segment buffer. */
 
-	uint16_t buf_len;         /**< Length of segment buffer. */
-
 	/* next 6 bytes are initialised on RX descriptor rearm */
-	MARKER8 rearm_data;
+	MARKER64 rearm_data;
 	uint16_t data_off;
 
 	/**
@@ -753,6 +751,7 @@  struct rte_mbuf {
 	};
 	uint8_t nb_segs;          /**< Number of segments. */
 	uint8_t port;             /**< Input port. */
+	uint16_t pad;             /**< 2B pad for naturally aligned ol_flags */
 
 	uint64_t ol_flags;        /**< Offload features. */
 
@@ -806,6 +805,7 @@  struct rte_mbuf {
 
 	uint16_t vlan_tci_outer;  /**< Outer VLAN Tag Control Identifier (CPU order) */
 
+	uint16_t buf_len;         /**< Length of segment buffer. */
 	/* second cache line - fields only used in slow path or on TX */
 	MARKER cacheline1 __rte_cache_min_aligned;