[v2] net/mana: use rte_pktmbuf_alloc_bulk for allocating RX WQEs

Message ID 1706577181-27842-1-git-send-email-longli@linuxonhyperv.com (mailing list archive)
State Changes Requested, archived
Delegated to: Ferruh Yigit
Headers
Series [v2] net/mana: use rte_pktmbuf_alloc_bulk for allocating RX WQEs |

Checks

Context Check Description
ci/checkpatch warning coding style issues
ci/loongarch-compilation success Compilation OK
ci/loongarch-unit-testing success Unit Testing PASS
ci/Intel-compilation success Compilation OK
ci/intel-Testing success Testing PASS
ci/iol-intel-Functional success Functional Testing PASS
ci/iol-broadcom-Performance success Performance Testing PASS
ci/iol-intel-Performance success Performance Testing PASS
ci/iol-abi-testing success Testing PASS
ci/intel-Functional success Functional PASS
ci/iol-broadcom-Functional success Functional Testing PASS
ci/iol-sample-apps-testing success Testing PASS
ci/iol-compile-amd64-testing success Testing PASS
ci/iol-unit-amd64-testing success Testing PASS
ci/iol-mellanox-Performance success Performance Testing PASS
ci/iol-unit-arm64-testing success Testing PASS
ci/iol-compile-arm64-testing success Testing PASS

Commit Message

Long Li Jan. 30, 2024, 1:13 a.m. UTC
  From: Long Li <longli@microsoft.com>

Instead of allocating mbufs one by one during RX, use rte_pktmbuf_alloc_bulk()
to allocate them in a batch.

Signed-off-by: Long Li <longli@microsoft.com>
---
Change in v2:
use rte_calloc_socket() in place of rte_calloc()

 drivers/net/mana/rx.c | 68 ++++++++++++++++++++++++++++---------------
 1 file changed, 44 insertions(+), 24 deletions(-)
  

Comments

Ferruh Yigit Jan. 30, 2024, 10:19 a.m. UTC | #1
On 1/30/2024 1:13 AM, longli@linuxonhyperv.com wrote:
> From: Long Li <longli@microsoft.com>
> 
> Instead of allocating mbufs one by one during RX, use rte_pktmbuf_alloc_bulk()
> to allocate them in a batch.
> 
> Signed-off-by: Long Li <longli@microsoft.com>
>

Can you please quantify the performance improvement (as percentage),
this clarifies the impact of the modification.

<...>

> @@ -121,19 +115,32 @@ mana_alloc_and_post_rx_wqe(struct mana_rxq *rxq)
>   * Post work requests for a Rx queue.
>   */
>  static int
> -mana_alloc_and_post_rx_wqes(struct mana_rxq *rxq)
> +mana_alloc_and_post_rx_wqes(struct mana_rxq *rxq, uint32_t count)
>  {
>  	int ret;
>  	uint32_t i;
> +	struct rte_mbuf **mbufs;
> +
> +	mbufs = rte_calloc_socket("mana_rx_mbufs", count, sizeof(struct rte_mbuf *),
> +				  0, rxq->mp->socket_id);
> +	if (!mbufs)
> +		return -ENOMEM;
>

'mbufs' is temporarily storage for allocated mbuf pointers, why not
allocate if from stack instead, can be faster and easier to manage:
"struct rte_mbuf *mbufs[count]"


> +
> +	ret = rte_pktmbuf_alloc_bulk(rxq->mp, mbufs, count);
> +	if (ret) {
> +		DP_LOG(ERR, "failed to allocate mbufs for RX");
> +		rxq->stats.nombuf += count;
> +		goto fail;
> +	}
>  
>  #ifdef RTE_ARCH_32
>  	rxq->wqe_cnt_to_short_db = 0;
>  #endif
> -	for (i = 0; i < rxq->num_desc; i++) {
> -		ret = mana_alloc_and_post_rx_wqe(rxq);
> +	for (i = 0; i < count; i++) {
> +		ret = mana_post_rx_wqe(rxq, mbufs[i]);
>  		if (ret) {
>  			DP_LOG(ERR, "failed to post RX ret = %d", ret);
> -			return ret;
> +			goto fail;
>

This may leak memory. There are allocated mbufs, if exit from loop here
and free 'mubfs' variable, how remaining mubfs will be freed?
  
Stephen Hemminger Jan. 30, 2024, 4:43 p.m. UTC | #2
On Tue, 30 Jan 2024 10:19:32 +0000
Ferruh Yigit <ferruh.yigit@amd.com> wrote:

> > -mana_alloc_and_post_rx_wqes(struct mana_rxq *rxq)
> > +mana_alloc_and_post_rx_wqes(struct mana_rxq *rxq, uint32_t count)
> >  {
> >  	int ret;
> >  	uint32_t i;
> > +	struct rte_mbuf **mbufs;
> > +
> > +	mbufs = rte_calloc_socket("mana_rx_mbufs", count, sizeof(struct rte_mbuf *),
> > +				  0, rxq->mp->socket_id);
> > +	if (!mbufs)
> > +		return -ENOMEM;
> >  
> 
> 'mbufs' is temporarily storage for allocated mbuf pointers, why not
> allocate if from stack instead, can be faster and easier to manage:
> "struct rte_mbuf *mbufs[count]"

That would introduce a variable length array.
VLA's should be removed, they are not supported on Windows and many
security tools flag them. The problem is that it makes the code brittle
if count gets huge.

But certainly regular calloc() or alloca() would work here.
  
Tyler Retzlaff Jan. 30, 2024, 6:05 p.m. UTC | #3
On Tue, Jan 30, 2024 at 08:43:52AM -0800, Stephen Hemminger wrote:
> On Tue, 30 Jan 2024 10:19:32 +0000
> Ferruh Yigit <ferruh.yigit@amd.com> wrote:
> 
> > > -mana_alloc_and_post_rx_wqes(struct mana_rxq *rxq)
> > > +mana_alloc_and_post_rx_wqes(struct mana_rxq *rxq, uint32_t count)
> > >  {
> > >  	int ret;
> > >  	uint32_t i;
> > > +	struct rte_mbuf **mbufs;
> > > +
> > > +	mbufs = rte_calloc_socket("mana_rx_mbufs", count, sizeof(struct rte_mbuf *),
> > > +				  0, rxq->mp->socket_id);
> > > +	if (!mbufs)
> > > +		return -ENOMEM;
> > >  
> > 
> > 'mbufs' is temporarily storage for allocated mbuf pointers, why not
> > allocate if from stack instead, can be faster and easier to manage:
> > "struct rte_mbuf *mbufs[count]"
> 
> That would introduce a variable length array.
> VLA's should be removed, they are not supported on Windows and many
> security tools flag them. The problem is that it makes the code brittle
> if count gets huge.

+1

> 
> But certainly regular calloc() or alloca() would work here.
  
Long Li Jan. 30, 2024, 9:30 p.m. UTC | #4
> Can you please quantify the performance improvement (as percentage), this
> clarifies the impact of the modification.

I didn't see any meaningful performance improvements in benchmarks. However, this should improve CPU cycles and reduce potential locking conflicts in real-world applications. 

Using batch allocation was one of the review comments during initial driver submission, suggested by Stephen Hemminger. I promised to fix it at that time. Sorry it took a while to submit this patch.

> 
> <...>
> 
> > @@ -121,19 +115,32 @@ mana_alloc_and_post_rx_wqe(struct mana_rxq
> *rxq)
> >   * Post work requests for a Rx queue.
> >   */
> >  static int
> > -mana_alloc_and_post_rx_wqes(struct mana_rxq *rxq)
> > +mana_alloc_and_post_rx_wqes(struct mana_rxq *rxq, uint32_t count)
> >  {
> >  	int ret;
> >  	uint32_t i;
> > +	struct rte_mbuf **mbufs;
> > +
> > +	mbufs = rte_calloc_socket("mana_rx_mbufs", count, sizeof(struct
> rte_mbuf *),
> > +				  0, rxq->mp->socket_id);
> > +	if (!mbufs)
> > +		return -ENOMEM;
> >
> 
> 'mbufs' is temporarily storage for allocated mbuf pointers, why not allocate if from
> stack instead, can be faster and easier to manage:
> "struct rte_mbuf *mbufs[count]"
> 
> 
> > +
> > +	ret = rte_pktmbuf_alloc_bulk(rxq->mp, mbufs, count);
> > +	if (ret) {
> > +		DP_LOG(ERR, "failed to allocate mbufs for RX");
> > +		rxq->stats.nombuf += count;
> > +		goto fail;
> > +	}
> >
> >  #ifdef RTE_ARCH_32
> >  	rxq->wqe_cnt_to_short_db = 0;
> >  #endif
> > -	for (i = 0; i < rxq->num_desc; i++) {
> > -		ret = mana_alloc_and_post_rx_wqe(rxq);
> > +	for (i = 0; i < count; i++) {
> > +		ret = mana_post_rx_wqe(rxq, mbufs[i]);
> >  		if (ret) {
> >  			DP_LOG(ERR, "failed to post RX ret = %d", ret);
> > -			return ret;
> > +			goto fail;
> >
> 
> This may leak memory. There are allocated mbufs, if exit from loop here and free
> 'mubfs' variable, how remaining mubfs will be freed?

Mbufs are always freed after fail:

fail:
        rte_free(mbufs);

>
  
Ferruh Yigit Jan. 30, 2024, 10:34 p.m. UTC | #5
On 1/30/2024 9:30 PM, Long Li wrote:
>> Can you please quantify the performance improvement (as percentage), this
>> clarifies the impact of the modification.
> 
> I didn't see any meaningful performance improvements in benchmarks. However, this should improve CPU cycles and reduce potential locking conflicts in real-world applications. 
> 
> Using batch allocation was one of the review comments during initial driver submission, suggested by Stephen Hemminger. I promised to fix it at that time. Sorry it took a while to submit this patch.
> 

That is OK, using bulk alloc is reasonable approach, only can you please
document the impact (performance increase) in the commit log.

>>
>> <...>
>>
>>> @@ -121,19 +115,32 @@ mana_alloc_and_post_rx_wqe(struct mana_rxq
>> *rxq)
>>>   * Post work requests for a Rx queue.
>>>   */
>>>  static int
>>> -mana_alloc_and_post_rx_wqes(struct mana_rxq *rxq)
>>> +mana_alloc_and_post_rx_wqes(struct mana_rxq *rxq, uint32_t count)
>>>  {
>>>  	int ret;
>>>  	uint32_t i;
>>> +	struct rte_mbuf **mbufs;
>>> +
>>> +	mbufs = rte_calloc_socket("mana_rx_mbufs", count, sizeof(struct
>> rte_mbuf *),
>>> +				  0, rxq->mp->socket_id);
>>> +	if (!mbufs)
>>> +		return -ENOMEM;
>>>
>>
>> 'mbufs' is temporarily storage for allocated mbuf pointers, why not allocate if from
>> stack instead, can be faster and easier to manage:
>> "struct rte_mbuf *mbufs[count]"
>>
>>
>>> +
>>> +	ret = rte_pktmbuf_alloc_bulk(rxq->mp, mbufs, count);
>>> +	if (ret) {
>>> +		DP_LOG(ERR, "failed to allocate mbufs for RX");
>>> +		rxq->stats.nombuf += count;
>>> +		goto fail;
>>> +	}
>>>
>>>  #ifdef RTE_ARCH_32
>>>  	rxq->wqe_cnt_to_short_db = 0;
>>>  #endif
>>> -	for (i = 0; i < rxq->num_desc; i++) {
>>> -		ret = mana_alloc_and_post_rx_wqe(rxq);
>>> +	for (i = 0; i < count; i++) {
>>> +		ret = mana_post_rx_wqe(rxq, mbufs[i]);
>>>  		if (ret) {
>>>  			DP_LOG(ERR, "failed to post RX ret = %d", ret);
>>> -			return ret;
>>> +			goto fail;
>>>
>>
>> This may leak memory. There are allocated mbufs, if exit from loop here and free
>> 'mubfs' variable, how remaining mubfs will be freed?
> 
> Mbufs are always freed after fail:
> 
> fail:
>         rte_free(mbufs);
> 

Nope, I am not talking about the 'mbufs' variable, I am talking about
mbuf pointers stored in the 'mbufs' array which are allocated by
'rte_pktmbuf_alloc_bulk()'.
  
Long Li Jan. 30, 2024, 10:36 p.m. UTC | #6
> Subject: Re: [Patch v2] net/mana: use rte_pktmbuf_alloc_bulk for allocating RX
> WQEs
> 
> On 1/30/2024 9:30 PM, Long Li wrote:
> >> Can you please quantify the performance improvement (as percentage),
> >> this clarifies the impact of the modification.
> >
> > I didn't see any meaningful performance improvements in benchmarks.
> However, this should improve CPU cycles and reduce potential locking conflicts in
> real-world applications.
> >
> > Using batch allocation was one of the review comments during initial driver
> submission, suggested by Stephen Hemminger. I promised to fix it at that time.
> Sorry it took a while to submit this patch.
> >
> 
> That is OK, using bulk alloc is reasonable approach, only can you please document
> the impact (performance increase) in the commit log.

Will do that.

> 
> >>
> >> <...>
> >>
> >>> @@ -121,19 +115,32 @@ mana_alloc_and_post_rx_wqe(struct mana_rxq
> >> *rxq)
> >>>   * Post work requests for a Rx queue.
> >>>   */
> >>>  static int
> >>> -mana_alloc_and_post_rx_wqes(struct mana_rxq *rxq)
> >>> +mana_alloc_and_post_rx_wqes(struct mana_rxq *rxq, uint32_t count)
> >>>  {
> >>>  	int ret;
> >>>  	uint32_t i;
> >>> +	struct rte_mbuf **mbufs;
> >>> +
> >>> +	mbufs = rte_calloc_socket("mana_rx_mbufs", count, sizeof(struct
> >> rte_mbuf *),
> >>> +				  0, rxq->mp->socket_id);
> >>> +	if (!mbufs)
> >>> +		return -ENOMEM;
> >>>
> >>
> >> 'mbufs' is temporarily storage for allocated mbuf pointers, why not
> >> allocate if from stack instead, can be faster and easier to manage:
> >> "struct rte_mbuf *mbufs[count]"
> >>
> >>
> >>> +
> >>> +	ret = rte_pktmbuf_alloc_bulk(rxq->mp, mbufs, count);
> >>> +	if (ret) {
> >>> +		DP_LOG(ERR, "failed to allocate mbufs for RX");
> >>> +		rxq->stats.nombuf += count;
> >>> +		goto fail;
> >>> +	}
> >>>
> >>>  #ifdef RTE_ARCH_32
> >>>  	rxq->wqe_cnt_to_short_db = 0;
> >>>  #endif
> >>> -	for (i = 0; i < rxq->num_desc; i++) {
> >>> -		ret = mana_alloc_and_post_rx_wqe(rxq);
> >>> +	for (i = 0; i < count; i++) {
> >>> +		ret = mana_post_rx_wqe(rxq, mbufs[i]);
> >>>  		if (ret) {
> >>>  			DP_LOG(ERR, "failed to post RX ret = %d", ret);
> >>> -			return ret;
> >>> +			goto fail;
> >>>
> >>
> >> This may leak memory. There are allocated mbufs, if exit from loop
> >> here and free 'mubfs' variable, how remaining mubfs will be freed?
> >
> > Mbufs are always freed after fail:
> >
> > fail:
> >         rte_free(mbufs);
> >
> 
> Nope, I am not talking about the 'mbufs' variable, I am talking about mbuf
> pointers stored in the 'mbufs' array which are allocated by
> 'rte_pktmbuf_alloc_bulk()'.

You are right, I'm sending v3 to fix those.

Long
  
Ferruh Yigit Jan. 30, 2024, 10:42 p.m. UTC | #7
On 1/30/2024 4:43 PM, Stephen Hemminger wrote:
> On Tue, 30 Jan 2024 10:19:32 +0000
> Ferruh Yigit <ferruh.yigit@amd.com> wrote:
> 
>>> -mana_alloc_and_post_rx_wqes(struct mana_rxq *rxq)
>>> +mana_alloc_and_post_rx_wqes(struct mana_rxq *rxq, uint32_t count)
>>>  {
>>>  	int ret;
>>>  	uint32_t i;
>>> +	struct rte_mbuf **mbufs;
>>> +
>>> +	mbufs = rte_calloc_socket("mana_rx_mbufs", count, sizeof(struct rte_mbuf *),
>>> +				  0, rxq->mp->socket_id);
>>> +	if (!mbufs)
>>> +		return -ENOMEM;
>>>  
>>
>> 'mbufs' is temporarily storage for allocated mbuf pointers, why not
>> allocate if from stack instead, can be faster and easier to manage:
>> "struct rte_mbuf *mbufs[count]"
> 
> That would introduce a variable length array.
> VLA's should be removed, they are not supported on Windows and many
> security tools flag them. The problem is that it makes the code brittle
> if count gets huge.
> 
> But certainly regular calloc() or alloca() would work here.
>

Most of the existing bulk alloc already uses VLA but I can see the
problem it is not being supported by Windows.

As this mbuf pointer array is short lived within the function, and being
in the fast path, I think continuous alloc and free can be prevented,

one option can be to define a fixed size, big enough, array which
requires additional loop for the cases 'count' size is bigger than array
size,

or an array can be allocated by driver init in device specific data ,as
we know it will be required continuously in the datapath, and it can be
freed during device close()/uninit().

I think an fixed size array from stack is easier and can be preferred.
  
Long Li Feb. 1, 2024, 3:55 a.m. UTC | #8
> >> 'mbufs' is temporarily storage for allocated mbuf pointers, why not
> >> allocate if from stack instead, can be faster and easier to manage:
> >> "struct rte_mbuf *mbufs[count]"
> >
> > That would introduce a variable length array.
> > VLA's should be removed, they are not supported on Windows and many
> > security tools flag them. The problem is that it makes the code
> > brittle if count gets huge.
> >
> > But certainly regular calloc() or alloca() would work here.
> >
> 
> Most of the existing bulk alloc already uses VLA but I can see the problem it is not
> being supported by Windows.
> 
> As this mbuf pointer array is short lived within the function, and being in the fast
> path, I think continuous alloc and free can be prevented,
> 
> one option can be to define a fixed size, big enough, array which requires
> additional loop for the cases 'count' size is bigger than array size,
> 
> or an array can be allocated by driver init in device specific data ,as we know it
> will be required continuously in the datapath, and it can be freed during device
> close()/uninit().
> 
> I think an fixed size array from stack is easier and can be preferred.

I sent a v3 of the patch, still using alloc().

I found two problems with using a fixed array:
1. the array size needs to be determined in advance. I don't know what a good number should be. If too big, some of them may be wasted. (and maybe make a bigger mess of CPU cache) If too small, it ends up doing multiple allocations, which is the problem this patch trying to solve.
2. if makes the code slightly more complex ,but I think 1 is the main problem.

I think another approach is to use VLA by default, but for Windows use alloc().

Long
  
Ferruh Yigit Feb. 1, 2024, 10:52 a.m. UTC | #9
On 2/1/2024 3:55 AM, Long Li wrote:
>>>> 'mbufs' is temporarily storage for allocated mbuf pointers, why not
>>>> allocate if from stack instead, can be faster and easier to manage:
>>>> "struct rte_mbuf *mbufs[count]"
>>>
>>> That would introduce a variable length array.
>>> VLA's should be removed, they are not supported on Windows and many
>>> security tools flag them. The problem is that it makes the code
>>> brittle if count gets huge.
>>>
>>> But certainly regular calloc() or alloca() would work here.
>>>
>>
>> Most of the existing bulk alloc already uses VLA but I can see the problem it is not
>> being supported by Windows.
>>
>> As this mbuf pointer array is short lived within the function, and being in the fast
>> path, I think continuous alloc and free can be prevented,
>>
>> one option can be to define a fixed size, big enough, array which requires
>> additional loop for the cases 'count' size is bigger than array size,
>>
>> or an array can be allocated by driver init in device specific data ,as we know it
>> will be required continuously in the datapath, and it can be freed during device
>> close()/uninit().
>>
>> I think an fixed size array from stack is easier and can be preferred.
> 
> I sent a v3 of the patch, still using alloc().
> 
> I found two problems with using a fixed array:
> 1. the array size needs to be determined in advance. I don't know what a good number should be. If too big, some of them may be wasted. (and maybe make a bigger mess of CPU cache) If too small, it ends up doing multiple allocations, which is the problem this patch trying to solve.
>

I think default burst size 32 can be used like below:

struct rte_mbuf *mbufs[32];

loop: //use do {} while(); if you prefer
n = min(32, count);
rte_pktmbuf_alloc_bulk(mbufs, n);
for (i = 0; i < n; i++)
	mana_post_rx_wqe(rxq, mbufs[i]);
count -= n;
if (count > 0) goto loop:


This additional loop doesn't make code very complex (I think not more
than additional alloc() & free()) and it doesn't waste memory.
I suggest doing a performance measurement with above change, it may
increase performance,
afterwards if you insist to go with original code, we can do it.


> 2. if makes the code slightly more complex ,but I think 1 is the main problem.
> 
> I think another approach is to use VLA by default, but for Windows use alloc().
> 
> Long
  
Tyler Retzlaff Feb. 1, 2024, 4:33 p.m. UTC | #10
On Thu, Feb 01, 2024 at 03:55:55AM +0000, Long Li wrote:
> > >> 'mbufs' is temporarily storage for allocated mbuf pointers, why not
> > >> allocate if from stack instead, can be faster and easier to manage:
> > >> "struct rte_mbuf *mbufs[count]"
> > >
> > > That would introduce a variable length array.
> > > VLA's should be removed, they are not supported on Windows and many
> > > security tools flag them. The problem is that it makes the code
> > > brittle if count gets huge.
> > >
> > > But certainly regular calloc() or alloca() would work here.
> > >
> > 
> > Most of the existing bulk alloc already uses VLA but I can see the problem it is not
> > being supported by Windows.
> > 
> > As this mbuf pointer array is short lived within the function, and being in the fast
> > path, I think continuous alloc and free can be prevented,
> > 
> > one option can be to define a fixed size, big enough, array which requires
> > additional loop for the cases 'count' size is bigger than array size,
> > 
> > or an array can be allocated by driver init in device specific data ,as we know it
> > will be required continuously in the datapath, and it can be freed during device
> > close()/uninit().
> > 
> > I think an fixed size array from stack is easier and can be preferred.
> 
> I sent a v3 of the patch, still using alloc().
> 
> I found two problems with using a fixed array:
> 1. the array size needs to be determined in advance. I don't know what a good number should be. If too big, some of them may be wasted. (and maybe make a bigger mess of CPU cache) If too small, it ends up doing multiple allocations, which is the problem this patch trying to solve.
> 2. if makes the code slightly more complex ,but I think 1 is the main problem.
> 
> I think another approach is to use VLA by default, but for Windows use alloc().

a few thoughts on VLAs you may consider. not to be regarded as a strong
objection.

indications are that standard C will gradually phase out VLAs because
they're generally accepted as having been a bad idea. that said
compilers that implement them will probably keep them forever.

VLAs generate a lot of code relative to just using a more permanent
allocation. may not show up in your performance tests but you also may
not want it on your hotpath either.

mana doesn't currently support windows, are there plans to support
windows? if never then i suppose VLAs can be used since all the
toolchains you care about have them. though it does raise the bar, cause
more work, later refactor, carry regression risk should you change your
mind and choose to port to windows.

accepting the use of VLAs anywhere in dpdk prohibits general
checkpatches and/or compiling with compiler options that detect and flag
their inclusion as a part of the CI without having to add exclusion
logic for drivers that are allowed to use them.

> 
> Long
  
Long Li Feb. 2, 2024, 1:21 a.m. UTC | #11
> I think default burst size 32 can be used like below:
> 
> struct rte_mbuf *mbufs[32];
> 
> loop: //use do {} while(); if you prefer n = min(32, count);
> rte_pktmbuf_alloc_bulk(mbufs, n); for (i = 0; i < n; i++)
> 	mana_post_rx_wqe(rxq, mbufs[i]);
> count -= n;
> if (count > 0) goto loop:
> 
> 
> This additional loop doesn't make code very complex (I think not more than
> additional alloc() & free()) and it doesn't waste memory.
> I suggest doing a performance measurement with above change, it may increase
> performance, afterwards if you insist to go with original code, we can do it.
> 

I submitted v4 with your suggestions. The code doesn't end up looking very messy. I measured the same performance with and without the patch.

Thanks,

Long
  
Long Li Feb. 2, 2024, 1:22 a.m. UTC | #12
> > I think another approach is to use VLA by default, but for Windows use alloc().
> 
> a few thoughts on VLAs you may consider. not to be regarded as a strong
> objection.
> 
> indications are that standard C will gradually phase out VLAs because they're
> generally accepted as having been a bad idea. that said compilers that implement
> them will probably keep them forever.
> 
> VLAs generate a lot of code relative to just using a more permanent allocation.
> may not show up in your performance tests but you also may not want it on your
> hotpath either.
> 
> mana doesn't currently support windows, are there plans to support windows? if
> never then i suppose VLAs can be used since all the toolchains you care about
> have them. though it does raise the bar, cause more work, later refactor, carry
> regression risk should you change your mind and choose to port to windows.
> 
> accepting the use of VLAs anywhere in dpdk prohibits general checkpatches
> and/or compiling with compiler options that detect and flag their inclusion as a
> part of the CI without having to add exclusion logic for drivers that are allowed to
> use them.
> 

I agree we need to keep the code consistent. I submitted v4 using fixed array.

Thanks,

Long
  

Patch

diff --git a/drivers/net/mana/rx.c b/drivers/net/mana/rx.c
index acad5e26cd..b011bf3ea1 100644
--- a/drivers/net/mana/rx.c
+++ b/drivers/net/mana/rx.c
@@ -2,6 +2,7 @@ 
  * Copyright 2022 Microsoft Corporation
  */
 #include <ethdev_driver.h>
+#include <rte_malloc.h>
 
 #include <infiniband/verbs.h>
 #include <infiniband/manadv.h>
@@ -59,9 +60,8 @@  mana_rq_ring_doorbell(struct mana_rxq *rxq)
 }
 
 static int
-mana_alloc_and_post_rx_wqe(struct mana_rxq *rxq)
+mana_post_rx_wqe(struct mana_rxq *rxq, struct rte_mbuf *mbuf)
 {
-	struct rte_mbuf *mbuf = NULL;
 	struct gdma_sgl_element sgl[1];
 	struct gdma_work_request request;
 	uint32_t wqe_size_in_bu;
@@ -69,12 +69,6 @@  mana_alloc_and_post_rx_wqe(struct mana_rxq *rxq)
 	int ret;
 	struct mana_mr_cache *mr;
 
-	mbuf = rte_pktmbuf_alloc(rxq->mp);
-	if (!mbuf) {
-		rxq->stats.nombuf++;
-		return -ENOMEM;
-	}
-
 	mr = mana_alloc_pmd_mr(&rxq->mr_btree, priv, mbuf);
 	if (!mr) {
 		DP_LOG(ERR, "failed to register RX MR");
@@ -121,19 +115,32 @@  mana_alloc_and_post_rx_wqe(struct mana_rxq *rxq)
  * Post work requests for a Rx queue.
  */
 static int
-mana_alloc_and_post_rx_wqes(struct mana_rxq *rxq)
+mana_alloc_and_post_rx_wqes(struct mana_rxq *rxq, uint32_t count)
 {
 	int ret;
 	uint32_t i;
+	struct rte_mbuf **mbufs;
+
+	mbufs = rte_calloc_socket("mana_rx_mbufs", count, sizeof(struct rte_mbuf *),
+				  0, rxq->mp->socket_id);
+	if (!mbufs)
+		return -ENOMEM;
+
+	ret = rte_pktmbuf_alloc_bulk(rxq->mp, mbufs, count);
+	if (ret) {
+		DP_LOG(ERR, "failed to allocate mbufs for RX");
+		rxq->stats.nombuf += count;
+		goto fail;
+	}
 
 #ifdef RTE_ARCH_32
 	rxq->wqe_cnt_to_short_db = 0;
 #endif
-	for (i = 0; i < rxq->num_desc; i++) {
-		ret = mana_alloc_and_post_rx_wqe(rxq);
+	for (i = 0; i < count; i++) {
+		ret = mana_post_rx_wqe(rxq, mbufs[i]);
 		if (ret) {
 			DP_LOG(ERR, "failed to post RX ret = %d", ret);
-			return ret;
+			goto fail;
 		}
 
 #ifdef RTE_ARCH_32
@@ -146,6 +153,8 @@  mana_alloc_and_post_rx_wqes(struct mana_rxq *rxq)
 
 	mana_rq_ring_doorbell(rxq);
 
+fail:
+	rte_free(mbufs);
 	return ret;
 }
 
@@ -404,7 +413,9 @@  mana_start_rx_queues(struct rte_eth_dev *dev)
 	}
 
 	for (i = 0; i < priv->num_queues; i++) {
-		ret = mana_alloc_and_post_rx_wqes(dev->data->rx_queues[i]);
+		struct mana_rxq *rxq = dev->data->rx_queues[i];
+
+		ret = mana_alloc_and_post_rx_wqes(rxq, rxq->num_desc);
 		if (ret)
 			goto fail;
 	}
@@ -423,7 +434,7 @@  uint16_t
 mana_rx_burst(void *dpdk_rxq, struct rte_mbuf **pkts, uint16_t pkts_n)
 {
 	uint16_t pkt_received = 0;
-	uint16_t wqe_posted = 0;
+	uint16_t wqe_consumed = 0;
 	struct mana_rxq *rxq = dpdk_rxq;
 	struct mana_priv *priv = rxq->priv;
 	struct rte_mbuf *mbuf;
@@ -535,18 +546,23 @@  mana_rx_burst(void *dpdk_rxq, struct rte_mbuf **pkts, uint16_t pkts_n)
 
 		rxq->gdma_rq.tail += desc->wqe_size_in_bu;
 
-		/* Consume this request and post another request */
-		ret = mana_alloc_and_post_rx_wqe(rxq);
-		if (ret) {
-			DP_LOG(ERR, "failed to post rx wqe ret=%d", ret);
-			break;
-		}
-
-		wqe_posted++;
+		/* Record the number of the RX WQE we need to post to replenish
+		 * consumed RX requests
+		 */
+		wqe_consumed++;
 		if (pkt_received == pkts_n)
 			break;
 
 #ifdef RTE_ARCH_32
+		/* Always post WQE as soon as it's consumed for short DB */
+		ret = mana_alloc_and_post_rx_wqes(rxq, wqe_consumed);
+		if (ret) {
+			DRV_LOG(ERR, "failed to post %d WQEs, ret %d",
+				wqe_consumed, ret);
+			return pkt_received;
+		}
+		wqe_consumed = 0;
+
 		/* Ring short doorbell if approaching the wqe increment
 		 * limit.
 		 */
@@ -569,8 +585,12 @@  mana_rx_burst(void *dpdk_rxq, struct rte_mbuf **pkts, uint16_t pkts_n)
 		goto repoll;
 	}
 
-	if (wqe_posted)
-		mana_rq_ring_doorbell(rxq);
+	if (wqe_consumed) {
+		ret = mana_alloc_and_post_rx_wqes(rxq, wqe_consumed);
+		if (ret)
+			DRV_LOG(ERR, "failed to post %d WQEs, ret %d",
+				wqe_consumed, ret);
+	}
 
 	return pkt_received;
 }