[RFC] eventdev: ensure 16-byte alignment for events
Checks
Commit Message
The event structure in DPDK is 16-bytes in size, and events are
regularly passed as parameters directly rather than being passed as
pointers. To help compiler optimize correctly, we can explicitly request
16-byte alignment for events, which means that we should be able
to do aligned vector loads/stores (e.g. with SSE or Neon) when working
with those events.
Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
---
lib/eventdev/rte_eventdev.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
Comments
On Thu, Oct 05, 2023 at 12:51:00PM +0100, Bruce Richardson wrote:
> The event structure in DPDK is 16-bytes in size, and events are
> regularly passed as parameters directly rather than being passed as
> pointers. To help compiler optimize correctly, we can explicitly request
> 16-byte alignment for events, which means that we should be able
> to do aligned vector loads/stores (e.g. with SSE or Neon) when working
> with those events.
>
> Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
> ---
> lib/eventdev/rte_eventdev.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/lib/eventdev/rte_eventdev.h b/lib/eventdev/rte_eventdev.h
> index 2ba8a7b090..bb0d59b059 100644
> --- a/lib/eventdev/rte_eventdev.h
> +++ b/lib/eventdev/rte_eventdev.h
> @@ -1344,7 +1344,7 @@ struct rte_event {
> struct rte_event_vector *vec;
> /**< Event vector pointer. */
> };
> -};
> +} __rte_aligned(16);
>
Looking for feedback on this idea - hence the fact this is going as an RFC.
It seems to me something that should be done for performance reasons, but
I'm not sure if there are any negative consequences of doing this.
Since this is an ABI-affecting change, a decision on this needs to be made
for 23.11, or else it will be locked in for at least another year. Hence me
sending it now as an RFC late in the release cycle, rather than deferring
to next release.
/Bruce
> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> Sent: Thursday, 5 October 2023 13.51
ure 16-byte alignment for events
>
> The event structure in DPDK is 16-bytes in size, and events are
> regularly passed as parameters directly rather than being passed as
> pointers. To help compiler optimize correctly, we can explicitly request
> 16-byte alignment for events, which means that we should be able
> to do aligned vector loads/stores (e.g. with SSE or Neon) when working
> with those events.
>
> Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
> ---
AFAICS it is not specified anywhere that the size of this struct must be 16 byte.
Consider adding this requirement to the struct documentation, and possibly a static_assert to verify.
Acked-by: Morten Brørup <mb@smartsharesystems.com>
On Thu, Oct 05, 2023 at 02:12:10PM +0200, Morten Brørup wrote:
> > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > Sent: Thursday, 5 October 2023 13.51
> ure 16-byte alignment for events
> >
> > The event structure in DPDK is 16-bytes in size, and events are
> > regularly passed as parameters directly rather than being passed as
> > pointers. To help compiler optimize correctly, we can explicitly request
> > 16-byte alignment for events, which means that we should be able
> > to do aligned vector loads/stores (e.g. with SSE or Neon) when working
> > with those events.
> >
> > Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
> > ---
>
> AFAICS it is not specified anywhere that the size of this struct must be 16 byte.
>
> Consider adding this requirement to the struct documentation, and possibly a static_assert to verify.
>
Yes, good point to document and verify.
On Thu, Oct 5, 2023 at 6:01 PM Bruce Richardson
<bruce.richardson@intel.com> wrote:
>
> On Thu, Oct 05, 2023 at 12:51:00PM +0100, Bruce Richardson wrote:
> > The event structure in DPDK is 16-bytes in size, and events are
> > regularly passed as parameters directly rather than being passed as
> > pointers. To help compiler optimize correctly, we can explicitly request
> > 16-byte alignment for events, which means that we should be able
> > to do aligned vector loads/stores (e.g. with SSE or Neon) when working
> > with those events.
> >
> > Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
> > ---
> > lib/eventdev/rte_eventdev.h | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/lib/eventdev/rte_eventdev.h b/lib/eventdev/rte_eventdev.h
> > index 2ba8a7b090..bb0d59b059 100644
> > --- a/lib/eventdev/rte_eventdev.h
> > +++ b/lib/eventdev/rte_eventdev.h
> > @@ -1344,7 +1344,7 @@ struct rte_event {
> > struct rte_event_vector *vec;
> > /**< Event vector pointer. */
> > };
> > -};
> > +} __rte_aligned(16);
> >
>
+ Eventdev driver maintainers for review and for performance testing.
> Looking for feedback on this idea - hence the fact this is going as an RFC.
Are you seeing any performance improvement ? Look like only DLB2
driver only using SEE or AVX512 instructions.
> It seems to me something that should be done for performance reasons, but
> I'm not sure if there are any negative consequences of doing this.
In general, it looks OK, However, We may need more testing.
I can only speculate the following as of now, Since event memory is
allocated from stack most case,
there may stack pointer fix up in code for desired alignment ie. some
add/sub instructions which comes for free most likely due to pipeline.
We will test on Marvel HW. Request others to test on their HWs.
>
> Since this is an ABI-affecting change, a decision on this needs to be made
> for 23.11, or else it will be locked in for at least another year. Hence me
> sending it now as an RFC late in the release cycle, rather than deferring
> to next release.
>
> /Bruce
On Thu, Oct 05, 2023 at 06:41:34PM +0530, Jerin Jacob wrote:
> On Thu, Oct 5, 2023 at 6:01 PM Bruce Richardson
> <bruce.richardson@intel.com> wrote:
> >
> > On Thu, Oct 05, 2023 at 12:51:00PM +0100, Bruce Richardson wrote:
> > > The event structure in DPDK is 16-bytes in size, and events are
> > > regularly passed as parameters directly rather than being passed as
> > > pointers. To help compiler optimize correctly, we can explicitly request
> > > 16-byte alignment for events, which means that we should be able
> > > to do aligned vector loads/stores (e.g. with SSE or Neon) when working
> > > with those events.
> > >
> > > Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
> > > ---
> > > lib/eventdev/rte_eventdev.h | 2 +-
> > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/lib/eventdev/rte_eventdev.h b/lib/eventdev/rte_eventdev.h
> > > index 2ba8a7b090..bb0d59b059 100644
> > > --- a/lib/eventdev/rte_eventdev.h
> > > +++ b/lib/eventdev/rte_eventdev.h
> > > @@ -1344,7 +1344,7 @@ struct rte_event {
> > > struct rte_event_vector *vec;
> > > /**< Event vector pointer. */
> > > };
> > > -};
> > > +} __rte_aligned(16);
> > >
> >
>
> + Eventdev driver maintainers for review and for performance testing.
>
> > Looking for feedback on this idea - hence the fact this is going as an RFC.
>
> Are you seeing any performance improvement ? Look like only DLB2
> driver only using SEE or AVX512 instructions.
>
The idea would be that the driver code (and eventdev code) should not need
to use SSE directly. If we mark the event struct as aligned, it should help
encourage the compiler to use these instructions under-the-hood. For
example, when copying an event, the compiler should be emitting 128-bit
loads and stores for most platforms.
/Bruce
On Thu, Oct 5, 2023 at 6:52 PM Bruce Richardson
<bruce.richardson@intel.com> wrote:
>
> On Thu, Oct 05, 2023 at 06:41:34PM +0530, Jerin Jacob wrote:
> > On Thu, Oct 5, 2023 at 6:01 PM Bruce Richardson
> > <bruce.richardson@intel.com> wrote:
> > >
> > > On Thu, Oct 05, 2023 at 12:51:00PM +0100, Bruce Richardson wrote:
> > > > The event structure in DPDK is 16-bytes in size, and events are
> > > > regularly passed as parameters directly rather than being passed as
> > > > pointers. To help compiler optimize correctly, we can explicitly request
> > > > 16-byte alignment for events, which means that we should be able
> > > > to do aligned vector loads/stores (e.g. with SSE or Neon) when working
> > > > with those events.
> > > >
> > > > Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
> > > > ---
> > > > lib/eventdev/rte_eventdev.h | 2 +-
> > > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/lib/eventdev/rte_eventdev.h b/lib/eventdev/rte_eventdev.h
> > > > index 2ba8a7b090..bb0d59b059 100644
> > > > --- a/lib/eventdev/rte_eventdev.h
> > > > +++ b/lib/eventdev/rte_eventdev.h
> > > > @@ -1344,7 +1344,7 @@ struct rte_event {
> > > > struct rte_event_vector *vec;
> > > > /**< Event vector pointer. */
> > > > };
> > > > -};
> > > > +} __rte_aligned(16);
> > > >
> > >
> >
> > + Eventdev driver maintainers for review and for performance testing.
> >
> > > Looking for feedback on this idea - hence the fact this is going as an RFC.
> >
> > Are you seeing any performance improvement ? Look like only DLB2
> > driver only using SEE or AVX512 instructions.
> >
>
> The idea would be that the driver code (and eventdev code) should not need
> to use SSE directly. If we mark the event struct as aligned, it should help
> encourage the compiler to use these instructions under-the-hood. For
> example, when copying an event, the compiler should be emitting 128-bit
> loads and stores for most platforms.
With limited testing, there is no performance regression is seen. In
fact, the compiler is generating the
same instruction stream on both cases.
If there are no objections from others, Please send v1 with "ABI
change" update in doc/guides/rel_notes/release_23_11.rst.
With above change,
Acked-by: Jerin Jacob <jerinj@marvell.com>
NOTE: I already made PR to Thomas for rc1. Since is needs to part of
rc1, I need to sync with @Thomas Monjalon as well.
>
> /Bruce
On 2023-10-05 13:51, Bruce Richardson wrote:
> The event structure in DPDK is 16-bytes in size, and events are
> regularly passed as parameters directly rather than being passed as
> pointers.
When are events passed by-value, rather than by-reference? There are no
such examples in the public eventdev API.
To help compiler optimize correctly, we can explicitly request
> 16-byte alignment for events, which means that we should be able
> to do aligned vector loads/stores (e.g. with SSE or Neon) when working
> with those events.
>
That change is both helping and sabotaging the optimizer's work. Now
every stack allocation needs to be 2-byte aligned - in DPDK code, and in
the application.
The effect this change has on an eventdev app using DSW is a ~3
cycle/event performance degradation on an AMD Zen 3 system, and a ~4
cycle/event performance degradation on a Skylake-generation Intel CPU.
What scenarios do you have in mind, where this change would improve the
generated code? Something where there are no unaligned loads available
in the ISA, or they are much slower than their aligned counterparts?
When I looked into the same issue for the DPDK IP checksumming routines,
there basically were no such. Not that I could find.
> Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
> ---
> lib/eventdev/rte_eventdev.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/lib/eventdev/rte_eventdev.h b/lib/eventdev/rte_eventdev.h
> index 2ba8a7b090..bb0d59b059 100644
> --- a/lib/eventdev/rte_eventdev.h
> +++ b/lib/eventdev/rte_eventdev.h
> @@ -1344,7 +1344,7 @@ struct rte_event {
> struct rte_event_vector *vec;
> /**< Event vector pointer. */
> };
> -};
> +} __rte_aligned(16);
>
> /* Ethdev Rx adapter capability bitmap flags */
> #define RTE_EVENT_ETH_RX_ADAPTER_CAP_INTERNAL_PORT 0x1
On Fri, Oct 06, 2023 at 02:15:00PM +0200, Mattias Rönnblom wrote:
> On 2023-10-05 13:51, Bruce Richardson wrote:
> > The event structure in DPDK is 16-bytes in size, and events are
> > regularly passed as parameters directly rather than being passed as
> > pointers.
>
> When are events passed by-value, rather than by-reference? There are no such
> examples in the public eventdev API.
>
> To help compiler optimize correctly, we can explicitly request
> > 16-byte alignment for events, which means that we should be able
> > to do aligned vector loads/stores (e.g. with SSE or Neon) when working
> > with those events.
> >
>
> That change is both helping and sabotaging the optimizer's work. Now every
> stack allocation needs to be 2-byte aligned - in DPDK code, and in the
> application.
>
> The effect this change has on an eventdev app using DSW is a ~3 cycle/event
> performance degradation on an AMD Zen 3 system, and a ~4 cycle/event
> performance degradation on a Skylake-generation Intel CPU.
>
Thanks for checking - this is the sort of feedback needed alright. In SW
eventdev we copy events around alot without using pointers, so I felt that
alignment would be helpful to avoid issues with events spanning cachelines.
However, since it has negative impacts, I'm quite happy to drop the idea,
and keep things as they are. I'll mark the change as rejected in patchwork.
/Bruce
On 2023-10-06 14:19, Bruce Richardson wrote:
> On Fri, Oct 06, 2023 at 02:15:00PM +0200, Mattias Rönnblom wrote:
>> On 2023-10-05 13:51, Bruce Richardson wrote:
>>> The event structure in DPDK is 16-bytes in size, and events are
>>> regularly passed as parameters directly rather than being passed as
>>> pointers.
>>
>> When are events passed by-value, rather than by-reference? There are no such
>> examples in the public eventdev API.
>>
>> To help compiler optimize correctly, we can explicitly request
>>> 16-byte alignment for events, which means that we should be able
>>> to do aligned vector loads/stores (e.g. with SSE or Neon) when working
>>> with those events.
>>>
>>
>> That change is both helping and sabotaging the optimizer's work. Now every
>> stack allocation needs to be 2-byte aligned - in DPDK code, and in the
>> application. >>
>> The effect this change has on an eventdev app using DSW is a ~3 cycle/event
>> performance degradation on an AMD Zen 3 system, and a ~4 cycle/event
>> performance degradation on a Skylake-generation Intel CPU.
>>
>
> Thanks for checking - this is the sort of feedback needed alright. In SW
> eventdev we copy events around alot without using pointers, so I felt that
> alignment would be helpful to avoid issues with events spanning cachelines.
>
> However, since it has negative impacts, I'm quite happy to drop the idea,
> and keep things as they are. I'll mark the change as rejected in patchwork.
>
I think this was an excellent idea, it just didn't happen to have the
desired effect. At least not in the particular cases where I evaluated it.
Given the complexity of compilers and CPUs, it's almost impossible to
predict the outcome, performance-wise, of a particular change.
On Fri, 6 Oct 2023 14:15:00 +0200
Mattias Rönnblom <hofors@lysator.liu.se> wrote:
> On 2023-10-05 13:51, Bruce Richardson wrote:
> > The event structure in DPDK is 16-bytes in size, and events are
> > regularly passed as parameters directly rather than being passed as
> > pointers.
>
> When are events passed by-value, rather than by-reference? There are no
> such examples in the public eventdev API.
>
> To help compiler optimize correctly, we can explicitly request
> > 16-byte alignment for events, which means that we should be able
> > to do aligned vector loads/stores (e.g. with SSE or Neon) when working
> > with those events.
> >
>
> That change is both helping and sabotaging the optimizer's work. Now
> every stack allocation needs to be 2-byte aligned - in DPDK code, and in
> the application.
>
> The effect this change has on an eventdev app using DSW is a ~3
> cycle/event performance degradation on an AMD Zen 3 system, and a ~4
> cycle/event performance degradation on a Skylake-generation Intel CPU.
>
> What scenarios do you have in mind, where this change would improve the
> generated code? Something where there are no unaligned loads available
> in the ISA, or they are much slower than their aligned counterparts?
>
> When I looked into the same issue for the DPDK IP checksumming routines,
> there basically were no such. Not that I could find.
>
> > Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
Don't understand what the issue is. Isn't the event structure typically
on stack. Adding padding there would not help.
@@ -1344,7 +1344,7 @@ struct rte_event {
struct rte_event_vector *vec;
/**< Event vector pointer. */
};
-};
+} __rte_aligned(16);
/* Ethdev Rx adapter capability bitmap flags */
#define RTE_EVENT_ETH_RX_ADAPTER_CAP_INTERNAL_PORT 0x1