[v1,0/5] Direct re-arming of buffers on receive side

Message ID	20220420081650.2043183-1-feifei.wang2@arm.com (mailing list archive)
Headers	From: Feifei Wang <feifei.wang2@arm.com> To: Cc: dev@dpdk.org, nd@arm.com, Feifei Wang <feifei.wang2@arm.com> Subject: [PATCH v1 0/5] Direct re-arming of buffers on receive side Date: Wed, 20 Apr 2022 16:16:45 +0800 Message-Id: <20220420081650.2043183-1-feifei.wang2@arm.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: list Errors-To: dev-bounces@dpdk.org
Series	Direct re-arming of buffers on receive side \| [v1,0/5] Direct re-arming of buffers on receive side [v1,1/5] net/i40e: remove redundant Dtype initialization [v1,2/5] net/i40e: enable direct rearm mode [v1,3/5] ethdev: add API for direct rearm mode [v1,4/5] net/i40e: add direct rearm mode internal API [v1,5/5] examples/l3fwd: enable direct rearm mode

Message ID

20220420081650.2043183-1-feifei.wang2@arm.com (mailing list archive)

Headers

From: Feifei Wang <feifei.wang2@arm.com>
To: 
Cc: dev@dpdk.org,
	nd@arm.com,
	Feifei Wang <feifei.wang2@arm.com>
Subject: [PATCH v1 0/5] Direct re-arming of buffers on receive side
Date: Wed, 20 Apr 2022 16:16:45 +0800
Message-Id: <20220420081650.2043183-1-feifei.wang2@arm.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: list
Errors-To: dev-bounces@dpdk.org

Series

Direct re-arming of buffers on receive side |

Message

Feifei Wang April 20, 2022, 8:16 a.m. UTC

Currently, the transmit side frees the buffers into the lcore cache and
the receive side allocates buffers from the lcore cache. The transmit
side typically frees 32 buffers resulting in 32*8=256B of stores to
lcore cache. The receive side allocates 32 buffers and stores them in
the receive side software ring, resulting in 32*8=256B of stores and
256B of load from the lcore cache.

This patch proposes a mechanism to avoid freeing to/allocating from
the lcore cache. i.e. the receive side will free the buffers from
transmit side directly into it's software ring. This will avoid the 256B
of loads and stores introduced by the lcore cache. It also frees up the
cache lines used by the lcore cache.

However, this solution poses several constraints:

1)The receive queue needs to know which transmit queue it should take
the buffers from. The application logic decides which transmit port to
use to send out the packets. In many use cases the NIC might have a
single port ([1], [2], [3]), in which case a given transmit queue is
always mapped to a single receive queue (1:1 Rx queue: Tx queue). This
is easy to configure.

If the NIC has 2 ports (there are several references), then we will have
1:2 (RX queue: TX queue) mapping which is still easy to configure.
However, if this is generalized to 'N' ports, the configuration can be
long. More over the PMD would have to scan a list of transmit queues to
pull the buffers from.

2)The other factor that needs to be considered is 'run-to-completion' vs
'pipeline' models. In the run-to-completion model, the receive side and
the transmit side are running on the same lcore serially. In the pipeline
model. The receive side and transmit side might be running on different
lcores in parallel. This requires locking. This is not supported at this
point.

3)Tx and Rx buffers must be from the same mempool. And we also must
ensure Tx buffer free number is equal to Rx buffer free number:
(txq->tx_rs_thresh == RTE_I40E_RXQ_REARM_THRESH)
Thus, 'tx_next_dd' can be updated correctly in direct-rearm mode. This
is due to tx_next_dd is a variable to compute tx sw-ring free location.
Its value will be one more round than the position where next time free
starts.

Current status in this RFC:
1)An API is added to allow for mapping a TX queue to a RX queue.
Currently it supports 1:1 mapping.
2)The i40e driver is changed to do the direct re-arm of the receive
side.
3)L3fwd application is modified to do the direct rearm mapping
automatically without user config. This follows the rules that the
thread can map TX queue to a RX queue based on the first received
package destination port.

Testing status:
1.The testing results for L3fwd are as follows:
-------------------------------------------------------------------
enabled direct rearm
-------------------------------------------------------------------
Arm:
N1SDP(neon path):
without fast-free mode with fast-free mode
+14.1% +7.0%

Ampere Altra(neon path):
without fast-free mode with fast-free mode
+17.1 +14.0%

X86:
Dell-8268(limit frequency):
sse path:
without fast-free mode with fast-free mode
+6.96% +2.02%
avx2 path:
without fast-free mode with fast-free mode
+9.04% +7.75%
avx512 path:
without fast-free mode with fast-free mode
+5.43% +1.57%
-------------------------------------------------------------------
This patch can not affect base performance of normal mode.
Furthermore, the reason for that limiting the CPU frequency is
that dell-8268 can encounter i40e NIC bottleneck with maximum
frequency.

2.The testing results for VPP-L3fwd are as follows:
-------------------------------------------------------------------
Arm:
N1SDP(neon path):
with direct re-arm mode enabled
+7.0%
-------------------------------------------------------------------
For Ampere Altra and X86,VPP-L3fwd test has not been done.

Reference:
[1] https://store.nvidia.com/en-us/networking/store/product/MCX623105AN-CDAT/NVIDIAMCX623105ANCDATConnectX6DxENAdapterCard100GbECryptoDisabled/
[2] https://www.intel.com/content/www/us/en/products/sku/192561/intel-ethernet-network-adapter-e810cqda1/specifications.html
[3] https://www.broadcom.com/products/ethernet-connectivity/network-adapters/100gb-nic-ocp/n1100g

Feifei Wang (5):
net/i40e: remove redundant Dtype initialization
net/i40e: enable direct rearm mode
ethdev: add API for direct rearm mode
net/i40e: add direct rearm mode internal API
examples/l3fwd: enable direct rearm mode

drivers/net/i40e/i40e_ethdev.c | 34 +++
drivers/net/i40e/i40e_rxtx.c | 4 -
drivers/net/i40e/i40e_rxtx.h | 4 +
drivers/net/i40e/i40e_rxtx_common_avx.h | 269 ++++++++++++++++++++++++
drivers/net/i40e/i40e_rxtx_vec_avx2.c | 14 +-
drivers/net/i40e/i40e_rxtx_vec_avx512.c | 249 +++++++++++++++++++++-
drivers/net/i40e/i40e_rxtx_vec_neon.c | 141 ++++++++++++-
drivers/net/i40e/i40e_rxtx_vec_sse.c | 170 ++++++++++++++-
examples/l3fwd/l3fwd_lpm.c | 16 +-
lib/ethdev/ethdev_driver.h | 15 ++
lib/ethdev/rte_ethdev.c | 14 ++
lib/ethdev/rte_ethdev.h | 31 +++
lib/ethdev/version.map | 1 +
13 files changed, 949 insertions(+), 13 deletions(-)

Comments

Konstantin Ananyev May 11, 2022, 11 p.m. UTC | #1

> Currently, the transmit side frees the buffers into the lcore cache and
> the receive side allocates buffers from the lcore cache. The transmit
> side typically frees 32 buffers resulting in 32*8=256B of stores to
> lcore cache. The receive side allocates 32 buffers and stores them in
> the receive side software ring, resulting in 32*8=256B of stores and
> 256B of load from the lcore cache.
> 
> This patch proposes a mechanism to avoid freeing to/allocating from
> the lcore cache. i.e. the receive side will free the buffers from
> transmit side directly into it's software ring. This will avoid the 256B
> of loads and stores introduced by the lcore cache. It also frees up the
> cache lines used by the lcore cache.
> 
> However, this solution poses several constraints:
> 
> 1)The receive queue needs to know which transmit queue it should take
> the buffers from. The application logic decides which transmit port to
> use to send out the packets. In many use cases the NIC might have a
> single port ([1], [2], [3]), in which case a given transmit queue is
> always mapped to a single receive queue (1:1 Rx queue: Tx queue). This
> is easy to configure.
> 
> If the NIC has 2 ports (there are several references), then we will have
> 1:2 (RX queue: TX queue) mapping which is still easy to configure.
> However, if this is generalized to 'N' ports, the configuration can be
> long. More over the PMD would have to scan a list of transmit queues to
> pull the buffers from.

Just to re-iterate some generic concerns about this proposal:
  - We effectively link RX and TX queues - when this feature is enabled,
    user can't stop TX queue without stopping linked RX queue first.
    Right now user is free to start/stop any queues at his will.
    If that feature will allow to link queues from different ports,
    then even ports will become dependent and user will have to pay extra
    care when managing such ports.
- very limited usage scenario - it will have a positive effect only
   when we have a fixed forwarding mapping: all (or nearly all) packets
   from the RX queue are forwarded into the same TX queue.

Wonder did you had a chance to consider mempool-cache ZC API,
similar to one we have for the ring?
It would allow us on TX free path to avoid copying mbufs to
temporary array on the stack.
Instead we can put them straight from TX SW ring to the mempool cache.
That should save extra store/load for mbuf and might help to achieve 
some performance gain without by-passing mempool.
It probably wouldn't be as fast as what you proposing,
but might be fast enough to consider as alternative.
Again, it would be a generic one, so we can avoid all
these implications and limitations.


> 2)The other factor that needs to be considered is 'run-to-completion' vs
> 'pipeline' models. In the run-to-completion model, the receive side and
> the transmit side are running on the same lcore serially. In the pipeline
> model. The receive side and transmit side might be running on different
> lcores in parallel. This requires locking. This is not supported at this
> point.
> 
> 3)Tx and Rx buffers must be from the same mempool. And we also must
> ensure Tx buffer free number is equal to Rx buffer free number:
> (txq->tx_rs_thresh == RTE_I40E_RXQ_REARM_THRESH)
> Thus, 'tx_next_dd' can be updated correctly in direct-rearm mode. This
> is due to tx_next_dd is a variable to compute tx sw-ring free location.
> Its value will be one more round than the position where next time free
> starts.
> 
> Current status in this RFC:
> 1)An API is added to allow for mapping a TX queue to a RX queue.
>    Currently it supports 1:1 mapping.
> 2)The i40e driver is changed to do the direct re-arm of the receive
>    side.
> 3)L3fwd application is modified to do the direct rearm mapping
> automatically without user config. This follows the rules that the
> thread can map TX queue to a RX queue based on the first received
> package destination port.
> 
> Testing status:
> 1.The testing results for L3fwd are as follows:
> -------------------------------------------------------------------
> enabled direct rearm
> -------------------------------------------------------------------
> Arm:
> N1SDP(neon path):
> without fast-free mode		with fast-free mode
> 	+14.1%				+7.0%
> 
> Ampere Altra(neon path):
> without fast-free mode		with fast-free mode
> 	+17.1				+14.0%
> 
> X86:
> Dell-8268(limit frequency):
> sse path:
> without fast-free mode		with fast-free mode
> 	+6.96%				+2.02%
> avx2 path:
> without fast-free mode		with fast-free mode
> 	+9.04%				+7.75%
> avx512 path:
> without fast-free mode		with fast-free mode
> 	+5.43%				+1.57%
> -------------------------------------------------------------------
> This patch can not affect base performance of normal mode.
> Furthermore, the reason for that limiting the CPU frequency is
> that dell-8268 can encounter i40e NIC bottleneck with maximum
> frequency.
> 
> 2.The testing results for VPP-L3fwd are as follows:
> -------------------------------------------------------------------
> Arm:
> N1SDP(neon path):
> with direct re-arm mode enabled
> 	+7.0%
> -------------------------------------------------------------------
> For Ampere Altra and X86,VPP-L3fwd test has not been done.
> 
> Reference:
> [1] https://store.nvidia.com/en-us/networking/store/product/MCX623105AN-CDAT/NVIDIAMCX623105ANCDATConnectX6DxENAdapterCard100GbECryptoDisabled/
> [2] https://www.intel.com/content/www/us/en/products/sku/192561/intel-ethernet-network-adapter-e810cqda1/specifications.html
> [3] https://www.broadcom.com/products/ethernet-connectivity/network-adapters/100gb-nic-ocp/n1100g
> 
> Feifei Wang (5):
>    net/i40e: remove redundant Dtype initialization
>    net/i40e: enable direct rearm mode
>    ethdev: add API for direct rearm mode
>    net/i40e: add direct rearm mode internal API
>    examples/l3fwd: enable direct rearm mode
> 
>   drivers/net/i40e/i40e_ethdev.c          |  34 +++
>   drivers/net/i40e/i40e_rxtx.c            |   4 -
>   drivers/net/i40e/i40e_rxtx.h            |   4 +
>   drivers/net/i40e/i40e_rxtx_common_avx.h | 269 ++++++++++++++++++++++++
>   drivers/net/i40e/i40e_rxtx_vec_avx2.c   |  14 +-
>   drivers/net/i40e/i40e_rxtx_vec_avx512.c | 249 +++++++++++++++++++++-
>   drivers/net/i40e/i40e_rxtx_vec_neon.c   | 141 ++++++++++++-
>   drivers/net/i40e/i40e_rxtx_vec_sse.c    | 170 ++++++++++++++-
>   examples/l3fwd/l3fwd_lpm.c              |  16 +-
>   lib/ethdev/ethdev_driver.h              |  15 ++
>   lib/ethdev/rte_ethdev.c                 |  14 ++
>   lib/ethdev/rte_ethdev.h                 |  31 +++
>   lib/ethdev/version.map                  |   1 +
>   13 files changed, 949 insertions(+), 13 deletions(-)
>

Konstantin Ananyev May 24, 2022, 1:25 a.m. UTC | #2

16/05/2022 07:10, Feifei Wang пишет:
> 
>>> Currently, the transmit side frees the buffers into the lcore cache and
>>> the receive side allocates buffers from the lcore cache. The transmit
>>> side typically frees 32 buffers resulting in 32*8=256B of stores to
>>> lcore cache. The receive side allocates 32 buffers and stores them in
>>> the receive side software ring, resulting in 32*8=256B of stores and
>>> 256B of load from the lcore cache.
>>>
>>> This patch proposes a mechanism to avoid freeing to/allocating from
>>> the lcore cache. i.e. the receive side will free the buffers from
>>> transmit side directly into it's software ring. This will avoid the 256B
>>> of loads and stores introduced by the lcore cache. It also frees up the
>>> cache lines used by the lcore cache.
>>>
>>> However, this solution poses several constraints:
>>>
>>> 1)The receive queue needs to know which transmit queue it should take
>>> the buffers from. The application logic decides which transmit port to
>>> use to send out the packets. In many use cases the NIC might have a
>>> single port ([1], [2], [3]), in which case a given transmit queue is
>>> always mapped to a single receive queue (1:1 Rx queue: Tx queue). This
>>> is easy to configure.
>>>
>>> If the NIC has 2 ports (there are several references), then we will have
>>> 1:2 (RX queue: TX queue) mapping which is still easy to configure.
>>> However, if this is generalized to 'N' ports, the configuration can be
>>> long. More over the PMD would have to scan a list of transmit queues to
>>> pull the buffers from.
> 
>> Just to re-iterate some generic concerns about this proposal:
>>   - We effectively link RX and TX queues - when this feature is enabled,
>>     user can't stop TX queue without stopping linked RX queue first.
>>     Right now user is free to start/stop any queues at his will.
>>     If that feature will allow to link queues from different ports,
>>     then even ports will become dependent and user will have to pay extra
>>     care when managing such ports.
> 
> [Feifei] When direct rearm enabled, there are two path for thread to 
> choose. If
> there are enough Tx freed buffers, Rx can put buffers from Tx.
> Otherwise, Rx will put buffers from mempool as usual. Thus, users do not
> need to pay much attention managing ports.

What I am talking about: right now different port or different queues of
the same port can be treated as independent entities:
in general user is free to start/stop (and even reconfigure in some 
cases) one entity without need to stop other entity.
I.E user can stop and re-configure TX queue while keep receiving packets
from RX queue.
With direct re-arm enabled, I think it wouldn't be possible any more:
before stopping/reconfiguring TX queue user would have make sure that
corresponding RX queue wouldn't be used by datapath.

> 
>> - very limited usage scenario - it will have a positive effect only
>>    when we have a fixed forwarding mapping: all (or nearly all) packets
>>    from the RX queue are forwarded into the same TX queue.
> 
> [Feifei] Although the usage scenario is limited, this usage scenario has 
> a wide
> range of applications, such as NIC with one port.

yes, there are NICs with one port, but no guarantee there wouldn't be 
several such NICs within the system.

> Furtrhermore, I think this is a tradeoff between performance and 
> flexibility.
> Our goal is to achieve best performance, this means we need to give up some
> flexibility decisively. For example of 'FAST_FREE Mode', it deletes most
> of the buffer check (refcnt > 1, external buffer, chain buffer), chooses a
> shorest path, and then achieve significant performance improvement.
>> Wonder did you had a chance to consider mempool-cache ZC API,
>> similar to one we have for the ring?
>> It would allow us on TX free path to avoid copying mbufs to
>> temporary array on the stack.
>> Instead we can put them straight from TX SW ring to the mempool cache.
>> That should save extra store/load for mbuf and might help to achieve 
>> some performance gain without by-passing mempool.
>> It probably wouldn't be as fast as what you proposing,
>> but might be fast enough to consider as alternative.
>> Again, it would be a generic one, so we can avoid all
>> these implications and limitations.
> 
> [Feifei] I think this is a good try. However, the most important thing
> is that if we can bypass the mempool decisively to pursue the
> significant performance gains.

I understand the intention, and I personally think this is wrong
and dangerous attitude.
We have mempool abstraction in place for very good reason.
So we need to try to improve mempool performance (and API if necessary) 
at first place, not to avoid it and break our own rules and recommendations.


> For ZC, there maybe a problem for it in i40e. The reason for that put Tx 
> buffers
> into temporary is that i40e_tx_entry includes buffer pointer and index.
> Thus we cannot put Tx SW_ring entry into mempool directly, we need to
> firstlt extract mbuf pointer. Finally, though we use ZC, we still can't 
> avoid
> using a temporary stack to extract Tx buffer pointers.

When talking about ZC API for mempool cache I meant something like:
void ** mempool_cache_put_zc_start(struct rte_mempool_cache *mc, 
uint32_t *nb_elem, uint32_t flags);
void mempool_cache_put_zc_finish(struct rte_mempool_cache *mc, uint32_t 
nb_elem);
i.e. _start_ will return user a pointer inside mp-cache where to put 
free elems and max number of slots that can be safely filled.
_finish_ will update mc->len.
As an example:

/* expect to free N mbufs */
uint32_t n = N;
void **p = mempool_cache_put_zc_start(mc, &n, ...);

/* free up to n elems */
for (i = 0; i != n; i++) {

   /* get next free mbuf from somewhere */
   mb = extract_and_prefree_mbuf(...);

   /* no more free mbufs for now */
   if (mb == NULL)
      break;

   p[i] = mb;
}

/* finalize ZC put, with _i_ freed elems */
mempool_cache_put_zc_finish(mc, i);

That way, I think we can overcome the issue with i40e_tx_entry
you mentioned above. Plus it might be useful in other similar places.

Another alternative is obviously to split i40e_tx_entry into two structs
(one for mbuf, second for its metadata) and have a separate array for 
each of them.
Though with that approach we need to make sure no perf drops will be
introduced, plus probably more code changes will be required.

Morten Brørup May 24, 2022, 12:40 p.m. UTC | #3

> From: Konstantin Ananyev [mailto:konstantin.v.ananyev@yandex.ru]
> Sent: Tuesday, 24 May 2022 03.26


> > Furtrhermore, I think this is a tradeoff between performance and
> > flexibility.
> > Our goal is to achieve best performance, this means we need to give
> up some
> > flexibility decisively. For example of 'FAST_FREE Mode', it deletes
> most
> > of the buffer check (refcnt > 1, external buffer, chain buffer),
> chooses a
> > shorest path, and then achieve significant performance improvement.
> >> Wonder did you had a chance to consider mempool-cache ZC API,
> >> similar to one we have for the ring?
> >> It would allow us on TX free path to avoid copying mbufs to
> >> temporary array on the stack.
> >> Instead we can put them straight from TX SW ring to the mempool
> cache.
> >> That should save extra store/load for mbuf and might help to achieve
> >> some performance gain without by-passing mempool.
> >> It probably wouldn't be as fast as what you proposing,
> >> but might be fast enough to consider as alternative.
> >> Again, it would be a generic one, so we can avoid all
> >> these implications and limitations.
> >
> > [Feifei] I think this is a good try. However, the most important
> thing
> > is that if we can bypass the mempool decisively to pursue the
> > significant performance gains.
> 
> I understand the intention, and I personally think this is wrong
> and dangerous attitude.
> We have mempool abstraction in place for very good reason.

Yes, but the abstraction is being violated grossly elsewhere, and mempool code is copy-pasted elsewhere too.

A good example of the current situation is [1]. The cache multiplier (a definition private to the mempool library) is required for some copy-pasted code, and the solution is to expose the private definition and make it part of the public API.

[1] http://inbox.dpdk.org/dev/DM4PR12MB53893BF4C7861068FE8A943BDFFE9@DM4PR12MB5389.namprd12.prod.outlook.com/

The game of abstraction has already been lost. Performance won. :-(

Since we allow bypassing the mbuf/mempool library for other features, it should be allowed for this feature too.

I would even say: Why are the drivers using the mempool library, and not the mbuf library, when freeing and allocating mbufs? This behavior bypasses all the debug assertions in the mbuf library.

As you can probably see, I'm certainly not happy about the abstraction violations in DPDK. But they have been allowed for similar features, so they should be allowed here too.

> So we need to try to improve mempool performance (and API if necessary)
> at first place, not to avoid it and break our own rules and
> recommendations.
> 
> 
> > For ZC, there maybe a problem for it in i40e. The reason for that put
> Tx
> > buffers
> > into temporary is that i40e_tx_entry includes buffer pointer and
> index.
> > Thus we cannot put Tx SW_ring entry into mempool directly, we need to
> > firstlt extract mbuf pointer. Finally, though we use ZC, we still
> can't
> > avoid
> > using a temporary stack to extract Tx buffer pointers.
> 
> When talking about ZC API for mempool cache I meant something like:
> void ** mempool_cache_put_zc_start(struct rte_mempool_cache *mc,
> uint32_t *nb_elem, uint32_t flags);
> void mempool_cache_put_zc_finish(struct rte_mempool_cache *mc, uint32_t
> nb_elem);
> i.e. _start_ will return user a pointer inside mp-cache where to put
> free elems and max number of slots that can be safely filled.
> _finish_ will update mc->len.
> As an example:
> 
> /* expect to free N mbufs */
> uint32_t n = N;
> void **p = mempool_cache_put_zc_start(mc, &n, ...);
> 
> /* free up to n elems */
> for (i = 0; i != n; i++) {
> 
>    /* get next free mbuf from somewhere */
>    mb = extract_and_prefree_mbuf(...);
> 
>    /* no more free mbufs for now */
>    if (mb == NULL)
>       break;
> 
>    p[i] = mb;
> }
> 
> /* finalize ZC put, with _i_ freed elems */
> mempool_cache_put_zc_finish(mc, i);
> 
> That way, I think we can overcome the issue with i40e_tx_entry
> you mentioned above. Plus it might be useful in other similar places.

Great example. This would fit perfectly into the i40e driver, if it didn't already implement the exact same by accessing the mempool cache structure directly. :-(

BTW: I tried patching the mempool library to fix some asymmetry bugs [2], but couldn't get any ACKs for it. It seems to me that the community is too risk averse to dare to modify such a core library.

[2] http://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35D86FBB@smartserver.smartshare.dk/

However, adding your mempool ZC feature is not a modification, but an addition, so it should be able to gather support.

Honnappa Nagarahalli May 24, 2022, 8:14 p.m. UTC | #4

<snip>

> 
> [konstantin.v.ananyev@yandex.ru appears similar to someone who
> previously sent you email, but may not be that person. Learn why this could
> be a risk at https://aka.ms/LearnAboutSenderIdentification.]
> 
> 16/05/2022 07:10, Feifei Wang пишет:
> >
> >>> Currently, the transmit side frees the buffers into the lcore cache
> >>> and the receive side allocates buffers from the lcore cache. The
> >>> transmit side typically frees 32 buffers resulting in 32*8=256B of
> >>> stores to lcore cache. The receive side allocates 32 buffers and
> >>> stores them in the receive side software ring, resulting in
> >>> 32*8=256B of stores and 256B of load from the lcore cache.
> >>>
> >>> This patch proposes a mechanism to avoid freeing to/allocating from
> >>> the lcore cache. i.e. the receive side will free the buffers from
> >>> transmit side directly into it's software ring. This will avoid the
> >>> 256B of loads and stores introduced by the lcore cache. It also
> >>> frees up the cache lines used by the lcore cache.
> >>>
> >>> However, this solution poses several constraints:
> >>>
> >>> 1)The receive queue needs to know which transmit queue it should
> >>> take the buffers from. The application logic decides which transmit
> >>> port to use to send out the packets. In many use cases the NIC might
> >>> have a single port ([1], [2], [3]), in which case a given transmit
> >>> queue is always mapped to a single receive queue (1:1 Rx queue: Tx
> >>> queue). This is easy to configure.
> >>>
> >>> If the NIC has 2 ports (there are several references), then we will
> >>> have
> >>> 1:2 (RX queue: TX queue) mapping which is still easy to configure.
> >>> However, if this is generalized to 'N' ports, the configuration can
> >>> be long. More over the PMD would have to scan a list of transmit
> >>> queues to pull the buffers from.
> >
> >> Just to re-iterate some generic concerns about this proposal:
> >>   - We effectively link RX and TX queues - when this feature is enabled,
> >>     user can't stop TX queue without stopping linked RX queue first.
> >>     Right now user is free to start/stop any queues at his will.
> >>     If that feature will allow to link queues from different ports,
> >>     then even ports will become dependent and user will have to pay extra
> >>     care when managing such ports.
> >
> > [Feifei] When direct rearm enabled, there are two path for thread to
> > choose. If there are enough Tx freed buffers, Rx can put buffers from
> > Tx.
> > Otherwise, Rx will put buffers from mempool as usual. Thus, users do
> > not need to pay much attention managing ports.
> 
> What I am talking about: right now different port or different queues of the
> same port can be treated as independent entities:
> in general user is free to start/stop (and even reconfigure in some
> cases) one entity without need to stop other entity.
> I.E user can stop and re-configure TX queue while keep receiving packets from
> RX queue.
> With direct re-arm enabled, I think it wouldn't be possible any more:
> before stopping/reconfiguring TX queue user would have make sure that
> corresponding RX queue wouldn't be used by datapath.
I am trying to understand the problem better. For the TX queue to be stopped, the user must have blocked the data plane from accessing the TX queue. Like Feifei says, the RX side has the normal packet allocation path still available.
Also this sounds like a corner case to me, we can handle this through checks in the queue_stop API.

> 
> >
> >> - very limited usage scenario - it will have a positive effect only
> >>    when we have a fixed forwarding mapping: all (or nearly all) packets
> >>    from the RX queue are forwarded into the same TX queue.
> >
> > [Feifei] Although the usage scenario is limited, this usage scenario
> > has a wide range of applications, such as NIC with one port.
> 
> yes, there are NICs with one port, but no guarantee there wouldn't be several
> such NICs within the system.
What I see in my interactions is, a single NIC/DPU is under utilized for a 2 socket system. Some are adding more sockets to the system to better utilize the DPU. The NIC bandwidth continues to grow significantly. I do not think there will be a multi-DPU per server scenario.

> 
> > Furtrhermore, I think this is a tradeoff between performance and
> > flexibility.
> > Our goal is to achieve best performance, this means we need to give up
> > some flexibility decisively. For example of 'FAST_FREE Mode', it
> > deletes most of the buffer check (refcnt > 1, external buffer, chain
> > buffer), chooses a shorest path, and then achieve significant performance
> improvement.
> >> Wonder did you had a chance to consider mempool-cache ZC API, similar
> >> to one we have for the ring?
> >> It would allow us on TX free path to avoid copying mbufs to temporary
> >> array on the stack.
> >> Instead we can put them straight from TX SW ring to the mempool cache.
> >> That should save extra store/load for mbuf and might help to achieve
> >> some performance gain without by-passing mempool.
> >> It probably wouldn't be as fast as what you proposing, but might be
> >> fast enough to consider as alternative.
> >> Again, it would be a generic one, so we can avoid all these
> >> implications and limitations.
> >
> > [Feifei] I think this is a good try. However, the most important thing
> > is that if we can bypass the mempool decisively to pursue the
> > significant performance gains.
> 
> I understand the intention, and I personally think this is wrong and dangerous
> attitude.
> We have mempool abstraction in place for very good reason.
> So we need to try to improve mempool performance (and API if necessary) at
> first place, not to avoid it and break our own rules and recommendations.
The abstraction can be thought of at a higher level. i.e. the driver manages the buffer allocation/free and is hidden from the application. The application does not need to be aware of how these changes are implemented. 

> 
> 
> > For ZC, there maybe a problem for it in i40e. The reason for that put
> > Tx buffers into temporary is that i40e_tx_entry includes buffer
> > pointer and index.
> > Thus we cannot put Tx SW_ring entry into mempool directly, we need to
> > firstlt extract mbuf pointer. Finally, though we use ZC, we still
> > can't avoid using a temporary stack to extract Tx buffer pointers.
> 
> When talking about ZC API for mempool cache I meant something like:
> void ** mempool_cache_put_zc_start(struct rte_mempool_cache *mc,
> uint32_t *nb_elem, uint32_t flags); void mempool_cache_put_zc_finish(struct
> rte_mempool_cache *mc, uint32_t nb_elem); i.e. _start_ will return user a
> pointer inside mp-cache where to put free elems and max number of slots
> that can be safely filled.
> _finish_ will update mc->len.
> As an example:
> 
> /* expect to free N mbufs */
> uint32_t n = N;
> void **p = mempool_cache_put_zc_start(mc, &n, ...);
> 
> /* free up to n elems */
> for (i = 0; i != n; i++) {
> 
>    /* get next free mbuf from somewhere */
>    mb = extract_and_prefree_mbuf(...);
> 
>    /* no more free mbufs for now */
>    if (mb == NULL)
>       break;
> 
>    p[i] = mb;
> }
> 
> /* finalize ZC put, with _i_ freed elems */ mempool_cache_put_zc_finish(mc,
> i);
> 
> That way, I think we can overcome the issue with i40e_tx_entry you
> mentioned above. Plus it might be useful in other similar places.
> 
> Another alternative is obviously to split i40e_tx_entry into two structs (one
> for mbuf, second for its metadata) and have a separate array for each of
> them.
> Though with that approach we need to make sure no perf drops will be
> introduced, plus probably more code changes will be required.
Commit '5171b4ee6b6" already does this (in a different way), but just for AVX512. Unfortunately, it does not record any performance improvements. We could port this to Arm NEON and look at the performance.

> 
> 
> 
>

Konstantin Ananyev May 28, 2022, 12:22 p.m. UTC | #5

24/05/2022 21:14, Honnappa Nagarahalli пишет:
> <snip>
> 
>>
>> [konstantin.v.ananyev@yandex.ru appears similar to someone who
>> previously sent you email, but may not be that person. Learn why this could
>> be a risk at https://aka.ms/LearnAboutSenderIdentification.]
>>
>> 16/05/2022 07:10, Feifei Wang пишет:
>>>
>>>>> Currently, the transmit side frees the buffers into the lcore cache
>>>>> and the receive side allocates buffers from the lcore cache. The
>>>>> transmit side typically frees 32 buffers resulting in 32*8=256B of
>>>>> stores to lcore cache. The receive side allocates 32 buffers and
>>>>> stores them in the receive side software ring, resulting in
>>>>> 32*8=256B of stores and 256B of load from the lcore cache.
>>>>>
>>>>> This patch proposes a mechanism to avoid freeing to/allocating from
>>>>> the lcore cache. i.e. the receive side will free the buffers from
>>>>> transmit side directly into it's software ring. This will avoid the
>>>>> 256B of loads and stores introduced by the lcore cache. It also
>>>>> frees up the cache lines used by the lcore cache.
>>>>>
>>>>> However, this solution poses several constraints:
>>>>>
>>>>> 1)The receive queue needs to know which transmit queue it should
>>>>> take the buffers from. The application logic decides which transmit
>>>>> port to use to send out the packets. In many use cases the NIC might
>>>>> have a single port ([1], [2], [3]), in which case a given transmit
>>>>> queue is always mapped to a single receive queue (1:1 Rx queue: Tx
>>>>> queue). This is easy to configure.
>>>>>
>>>>> If the NIC has 2 ports (there are several references), then we will
>>>>> have
>>>>> 1:2 (RX queue: TX queue) mapping which is still easy to configure.
>>>>> However, if this is generalized to 'N' ports, the configuration can
>>>>> be long. More over the PMD would have to scan a list of transmit
>>>>> queues to pull the buffers from.
>>>
>>>> Just to re-iterate some generic concerns about this proposal:
>>>>    - We effectively link RX and TX queues - when this feature is enabled,
>>>>      user can't stop TX queue without stopping linked RX queue first.
>>>>      Right now user is free to start/stop any queues at his will.
>>>>      If that feature will allow to link queues from different ports,
>>>>      then even ports will become dependent and user will have to pay extra
>>>>      care when managing such ports.
>>>
>>> [Feifei] When direct rearm enabled, there are two path for thread to
>>> choose. If there are enough Tx freed buffers, Rx can put buffers from
>>> Tx.
>>> Otherwise, Rx will put buffers from mempool as usual. Thus, users do
>>> not need to pay much attention managing ports.
>>
>> What I am talking about: right now different port or different queues of the
>> same port can be treated as independent entities:
>> in general user is free to start/stop (and even reconfigure in some
>> cases) one entity without need to stop other entity.
>> I.E user can stop and re-configure TX queue while keep receiving packets from
>> RX queue.
>> With direct re-arm enabled, I think it wouldn't be possible any more:
>> before stopping/reconfiguring TX queue user would have make sure that
>> corresponding RX queue wouldn't be used by datapath.
> I am trying to understand the problem better. For the TX queue to be stopped, the user must have blocked the data plane from accessing the TX queue. 

Surely it is user responsibility tnot to call tx_burst()
for stopped/released queue.
The problem is that while TX for that queue is stopped,
RX for related queue still can continue.
So rx_burst() will try to read/modify TX queue data,
that might be already freed, or simultaneously modified by control path.

Again, it all can be mitigated by carefully re-designing and
modifying control and data-path inside user app -
by doing extra checks and synchronizations, etc.
But from practical point - I presume most of users simply would avoid
using this feature due all potential problems it might cause.

> Like Feifei says, the RX side has the normal packet allocation path still available.
> Also this sounds like a corner case to me, we can handle this through checks in the queue_stop API.

Depends.
if it would be allowed to link queues only from the same port,
then yes, extra checks for queue-stop might be enough.
As right now DPDK doesn't allow user to change number of queues
without dev_stop() first.
Though if it would be allowed to link queues from different ports,
then situation will be much worse.
Right now ports are totally independent entities
(except some special cases like link-bonding, etc.).
As one port can keep doing RX/TX, second one can be stopped,
re-confgured, even detached, and newly attached device might
re-use same port number.


>>
>>>
>>>> - very limited usage scenario - it will have a positive effect only
>>>>     when we have a fixed forwarding mapping: all (or nearly all) packets
>>>>     from the RX queue are forwarded into the same TX queue.
>>>
>>> [Feifei] Although the usage scenario is limited, this usage scenario
>>> has a wide range of applications, such as NIC with one port.
>>
>> yes, there are NICs with one port, but no guarantee there wouldn't be several
>> such NICs within the system.
> What I see in my interactions is, a single NIC/DPU is under utilized for a 2 socket system. Some are adding more sockets to the system to better utilize the DPU. The NIC bandwidth continues to grow significantly. I do not think there will be a multi-DPU per server scenario.


Interesting... from my experience it is visa-versa:
in many cases 200Gb/s is not that much these days
to saturate modern 2 socket x86 server.
Though I suppose a lot depends on particular HW and actual workload.

> 
>>
>>> Furtrhermore, I think this is a tradeoff between performance and
>>> flexibility.
>>> Our goal is to achieve best performance, this means we need to give up
>>> some flexibility decisively. For example of 'FAST_FREE Mode', it
>>> deletes most of the buffer check (refcnt > 1, external buffer, chain
>>> buffer), chooses a shorest path, and then achieve significant performance
>> improvement.
>>>> Wonder did you had a chance to consider mempool-cache ZC API, similar
>>>> to one we have for the ring?
>>>> It would allow us on TX free path to avoid copying mbufs to temporary
>>>> array on the stack.
>>>> Instead we can put them straight from TX SW ring to the mempool cache.
>>>> That should save extra store/load for mbuf and might help to achieve
>>>> some performance gain without by-passing mempool.
>>>> It probably wouldn't be as fast as what you proposing, but might be
>>>> fast enough to consider as alternative.
>>>> Again, it would be a generic one, so we can avoid all these
>>>> implications and limitations.
>>>
>>> [Feifei] I think this is a good try. However, the most important thing
>>> is that if we can bypass the mempool decisively to pursue the
>>> significant performance gains.
>>
>> I understand the intention, and I personally think this is wrong and dangerous
>> attitude.
>> We have mempool abstraction in place for very good reason.
>> So we need to try to improve mempool performance (and API if necessary) at
>> first place, not to avoid it and break our own rules and recommendations.
> The abstraction can be thought of at a higher level. i.e. the driver manages the buffer allocation/free and is hidden from the application. The application does not need to be aware of how these changes are implemented.
> 
>>
>>
>>> For ZC, there maybe a problem for it in i40e. The reason for that put
>>> Tx buffers into temporary is that i40e_tx_entry includes buffer
>>> pointer and index.
>>> Thus we cannot put Tx SW_ring entry into mempool directly, we need to
>>> firstlt extract mbuf pointer. Finally, though we use ZC, we still
>>> can't avoid using a temporary stack to extract Tx buffer pointers.
>>
>> When talking about ZC API for mempool cache I meant something like:
>> void ** mempool_cache_put_zc_start(struct rte_mempool_cache *mc,
>> uint32_t *nb_elem, uint32_t flags); void mempool_cache_put_zc_finish(struct
>> rte_mempool_cache *mc, uint32_t nb_elem); i.e. _start_ will return user a
>> pointer inside mp-cache where to put free elems and max number of slots
>> that can be safely filled.
>> _finish_ will update mc->len.
>> As an example:
>>
>> /* expect to free N mbufs */
>> uint32_t n = N;
>> void **p = mempool_cache_put_zc_start(mc, &n, ...);
>>
>> /* free up to n elems */
>> for (i = 0; i != n; i++) {
>>
>>     /* get next free mbuf from somewhere */
>>     mb = extract_and_prefree_mbuf(...);
>>
>>     /* no more free mbufs for now */
>>     if (mb == NULL)
>>        break;
>>
>>     p[i] = mb;
>> }
>>
>> /* finalize ZC put, with _i_ freed elems */ mempool_cache_put_zc_finish(mc,
>> i);
>>
>> That way, I think we can overcome the issue with i40e_tx_entry you
>> mentioned above. Plus it might be useful in other similar places.
>>
>> Another alternative is obviously to split i40e_tx_entry into two structs (one
>> for mbuf, second for its metadata) and have a separate array for each of
>> them.
>> Though with that approach we need to make sure no perf drops will be
>> introduced, plus probably more code changes will be required.
> Commit '5171b4ee6b6" already does this (in a different way), but just for AVX512. Unfortunately, it does not record any performance improvements. We could port this to Arm NEON and look at the performance.

Honnappa Nagarahalli June 1, 2022, 1 a.m. UTC | #6

<snip>
> >
> >>
> >> [konstantin.v.ananyev@yandex.ru appears similar to someone who
> >> previously sent you email, but may not be that person. Learn why this
> >> could be a risk at https://aka.ms/LearnAboutSenderIdentification.]
> >>
> >> 16/05/2022 07:10, Feifei Wang пишет:
> >>>
> >>>>> Currently, the transmit side frees the buffers into the lcore
> >>>>> cache and the receive side allocates buffers from the lcore cache.
> >>>>> The transmit side typically frees 32 buffers resulting in
> >>>>> 32*8=256B of stores to lcore cache. The receive side allocates 32
> >>>>> buffers and stores them in the receive side software ring,
> >>>>> resulting in 32*8=256B of stores and 256B of load from the lcore cache.
> >>>>>
> >>>>> This patch proposes a mechanism to avoid freeing to/allocating
> >>>>> from the lcore cache. i.e. the receive side will free the buffers
> >>>>> from transmit side directly into it's software ring. This will
> >>>>> avoid the 256B of loads and stores introduced by the lcore cache.
> >>>>> It also frees up the cache lines used by the lcore cache.
> >>>>>
> >>>>> However, this solution poses several constraints:
> >>>>>
> >>>>> 1)The receive queue needs to know which transmit queue it should
> >>>>> take the buffers from. The application logic decides which
> >>>>> transmit port to use to send out the packets. In many use cases
> >>>>> the NIC might have a single port ([1], [2], [3]), in which case a
> >>>>> given transmit queue is always mapped to a single receive queue
> >>>>> (1:1 Rx queue: Tx queue). This is easy to configure.
> >>>>>
> >>>>> If the NIC has 2 ports (there are several references), then we
> >>>>> will have
> >>>>> 1:2 (RX queue: TX queue) mapping which is still easy to configure.
> >>>>> However, if this is generalized to 'N' ports, the configuration
> >>>>> can be long. More over the PMD would have to scan a list of
> >>>>> transmit queues to pull the buffers from.
> >>>
> >>>> Just to re-iterate some generic concerns about this proposal:
> >>>>    - We effectively link RX and TX queues - when this feature is enabled,
> >>>>      user can't stop TX queue without stopping linked RX queue first.
> >>>>      Right now user is free to start/stop any queues at his will.
> >>>>      If that feature will allow to link queues from different ports,
> >>>>      then even ports will become dependent and user will have to pay extra
> >>>>      care when managing such ports.
> >>>
> >>> [Feifei] When direct rearm enabled, there are two path for thread to
> >>> choose. If there are enough Tx freed buffers, Rx can put buffers
> >>> from Tx.
> >>> Otherwise, Rx will put buffers from mempool as usual. Thus, users do
> >>> not need to pay much attention managing ports.
> >>
> >> What I am talking about: right now different port or different queues
> >> of the same port can be treated as independent entities:
> >> in general user is free to start/stop (and even reconfigure in some
> >> cases) one entity without need to stop other entity.
> >> I.E user can stop and re-configure TX queue while keep receiving
> >> packets from RX queue.
> >> With direct re-arm enabled, I think it wouldn't be possible any more:
> >> before stopping/reconfiguring TX queue user would have make sure that
> >> corresponding RX queue wouldn't be used by datapath.
> > I am trying to understand the problem better. For the TX queue to be stopped,
> the user must have blocked the data plane from accessing the TX queue.
> 
> Surely it is user responsibility tnot to call tx_burst() for stopped/released queue.
> The problem is that while TX for that queue is stopped, RX for related queue still
> can continue.
> So rx_burst() will try to read/modify TX queue data, that might be already freed,
> or simultaneously modified by control path.
Understood, agree on the issue

> 
> Again, it all can be mitigated by carefully re-designing and modifying control and
> data-path inside user app - by doing extra checks and synchronizations, etc.
> But from practical point - I presume most of users simply would avoid using this
> feature due all potential problems it might cause.
That is subjective, it all depends on the performance improvements users see in their application.
IMO, the performance improvement seen with this patch is worth few changes.

> 
> > Like Feifei says, the RX side has the normal packet allocation path still available.
> > Also this sounds like a corner case to me, we can handle this through checks in
> the queue_stop API.
> 
> Depends.
> if it would be allowed to link queues only from the same port, then yes, extra
> checks for queue-stop might be enough.
> As right now DPDK doesn't allow user to change number of queues without
> dev_stop() first.
> Though if it would be allowed to link queues from different ports, then situation
> will be much worse.
> Right now ports are totally independent entities (except some special cases like
> link-bonding, etc.).
> As one port can keep doing RX/TX, second one can be stopped, re-confgured,
> even detached, and newly attached device might re-use same port number.
I see this as a similar restriction to the one discussed above. Do you see any issues if we enforce this with checks?

> 
> 
> >>
> >>>
> >>>> - very limited usage scenario - it will have a positive effect only
> >>>>     when we have a fixed forwarding mapping: all (or nearly all) packets
> >>>>     from the RX queue are forwarded into the same TX queue.
> >>>
> >>> [Feifei] Although the usage scenario is limited, this usage scenario
> >>> has a wide range of applications, such as NIC with one port.
> >>
> >> yes, there are NICs with one port, but no guarantee there wouldn't be
> >> several such NICs within the system.
> > What I see in my interactions is, a single NIC/DPU is under utilized for a 2
> socket system. Some are adding more sockets to the system to better utilize the
> DPU. The NIC bandwidth continues to grow significantly. I do not think there will
> be a multi-DPU per server scenario.
> 
> 
> Interesting... from my experience it is visa-versa:
> in many cases 200Gb/s is not that much these days to saturate modern 2 socket
> x86 server.
> Though I suppose a lot depends on particular HW and actual workload.
> 
> >
> >>
> >>> Furtrhermore, I think this is a tradeoff between performance and
> >>> flexibility.
> >>> Our goal is to achieve best performance, this means we need to give
> >>> up some flexibility decisively. For example of 'FAST_FREE Mode', it
> >>> deletes most of the buffer check (refcnt > 1, external buffer, chain
> >>> buffer), chooses a shorest path, and then achieve significant
> >>> performance
> >> improvement.
> >>>> Wonder did you had a chance to consider mempool-cache ZC API,
> >>>> similar to one we have for the ring?
> >>>> It would allow us on TX free path to avoid copying mbufs to
> >>>> temporary array on the stack.
> >>>> Instead we can put them straight from TX SW ring to the mempool cache.
> >>>> That should save extra store/load for mbuf and might help to
> >>>> achieve some performance gain without by-passing mempool.
> >>>> It probably wouldn't be as fast as what you proposing, but might be
> >>>> fast enough to consider as alternative.
> >>>> Again, it would be a generic one, so we can avoid all these
> >>>> implications and limitations.
> >>>
> >>> [Feifei] I think this is a good try. However, the most important
> >>> thing is that if we can bypass the mempool decisively to pursue the
> >>> significant performance gains.
> >>
> >> I understand the intention, and I personally think this is wrong and
> >> dangerous attitude.
> >> We have mempool abstraction in place for very good reason.
> >> So we need to try to improve mempool performance (and API if
> >> necessary) at first place, not to avoid it and break our own rules and
> recommendations.
> > The abstraction can be thought of at a higher level. i.e. the driver manages the
> buffer allocation/free and is hidden from the application. The application does
> not need to be aware of how these changes are implemented.
> >
> >>
> >>
> >>> For ZC, there maybe a problem for it in i40e. The reason for that
> >>> put Tx buffers into temporary is that i40e_tx_entry includes buffer
> >>> pointer and index.
> >>> Thus we cannot put Tx SW_ring entry into mempool directly, we need
> >>> to firstlt extract mbuf pointer. Finally, though we use ZC, we still
> >>> can't avoid using a temporary stack to extract Tx buffer pointers.
> >>
> >> When talking about ZC API for mempool cache I meant something like:
> >> void ** mempool_cache_put_zc_start(struct rte_mempool_cache *mc,
> >> uint32_t *nb_elem, uint32_t flags); void
> >> mempool_cache_put_zc_finish(struct
> >> rte_mempool_cache *mc, uint32_t nb_elem); i.e. _start_ will return
> >> user a pointer inside mp-cache where to put free elems and max number
> >> of slots that can be safely filled.
> >> _finish_ will update mc->len.
> >> As an example:
> >>
> >> /* expect to free N mbufs */
> >> uint32_t n = N;
> >> void **p = mempool_cache_put_zc_start(mc, &n, ...);
> >>
> >> /* free up to n elems */
> >> for (i = 0; i != n; i++) {
> >>
> >>     /* get next free mbuf from somewhere */
> >>     mb = extract_and_prefree_mbuf(...);
> >>
> >>     /* no more free mbufs for now */
> >>     if (mb == NULL)
> >>        break;
> >>
> >>     p[i] = mb;
> >> }
> >>
> >> /* finalize ZC put, with _i_ freed elems */
> >> mempool_cache_put_zc_finish(mc, i);
> >>
> >> That way, I think we can overcome the issue with i40e_tx_entry you
> >> mentioned above. Plus it might be useful in other similar places.
> >>
> >> Another alternative is obviously to split i40e_tx_entry into two
> >> structs (one for mbuf, second for its metadata) and have a separate
> >> array for each of them.
> >> Though with that approach we need to make sure no perf drops will be
> >> introduced, plus probably more code changes will be required.
> > Commit '5171b4ee6b6" already does this (in a different way), but just for
> AVX512. Unfortunately, it does not record any performance improvements. We
> could port this to Arm NEON and look at the performance.

Konstantin Ananyev June 3, 2022, 11:32 p.m. UTC | #7

> <snip>
>>>
>>>>
>>>> [konstantin.v.ananyev@yandex.ru appears similar to someone who
>>>> previously sent you email, but may not be that person. Learn why this
>>>> could be a risk at https://aka.ms/LearnAboutSenderIdentification.]
>>>>
>>>> 16/05/2022 07:10, Feifei Wang пишет:
>>>>>
>>>>>>> Currently, the transmit side frees the buffers into the lcore
>>>>>>> cache and the receive side allocates buffers from the lcore cache.
>>>>>>> The transmit side typically frees 32 buffers resulting in
>>>>>>> 32*8=256B of stores to lcore cache. The receive side allocates 32
>>>>>>> buffers and stores them in the receive side software ring,
>>>>>>> resulting in 32*8=256B of stores and 256B of load from the lcore cache.
>>>>>>>
>>>>>>> This patch proposes a mechanism to avoid freeing to/allocating
>>>>>>> from the lcore cache. i.e. the receive side will free the buffers
>>>>>>> from transmit side directly into it's software ring. This will
>>>>>>> avoid the 256B of loads and stores introduced by the lcore cache.
>>>>>>> It also frees up the cache lines used by the lcore cache.
>>>>>>>
>>>>>>> However, this solution poses several constraints:
>>>>>>>
>>>>>>> 1)The receive queue needs to know which transmit queue it should
>>>>>>> take the buffers from. The application logic decides which
>>>>>>> transmit port to use to send out the packets. In many use cases
>>>>>>> the NIC might have a single port ([1], [2], [3]), in which case a
>>>>>>> given transmit queue is always mapped to a single receive queue
>>>>>>> (1:1 Rx queue: Tx queue). This is easy to configure.
>>>>>>>
>>>>>>> If the NIC has 2 ports (there are several references), then we
>>>>>>> will have
>>>>>>> 1:2 (RX queue: TX queue) mapping which is still easy to configure.
>>>>>>> However, if this is generalized to 'N' ports, the configuration
>>>>>>> can be long. More over the PMD would have to scan a list of
>>>>>>> transmit queues to pull the buffers from.
>>>>>
>>>>>> Just to re-iterate some generic concerns about this proposal:
>>>>>>     - We effectively link RX and TX queues - when this feature is enabled,
>>>>>>       user can't stop TX queue without stopping linked RX queue first.
>>>>>>       Right now user is free to start/stop any queues at his will.
>>>>>>       If that feature will allow to link queues from different ports,
>>>>>>       then even ports will become dependent and user will have to pay extra
>>>>>>       care when managing such ports.
>>>>>
>>>>> [Feifei] When direct rearm enabled, there are two path for thread to
>>>>> choose. If there are enough Tx freed buffers, Rx can put buffers
>>>>> from Tx.
>>>>> Otherwise, Rx will put buffers from mempool as usual. Thus, users do
>>>>> not need to pay much attention managing ports.
>>>>
>>>> What I am talking about: right now different port or different queues
>>>> of the same port can be treated as independent entities:
>>>> in general user is free to start/stop (and even reconfigure in some
>>>> cases) one entity without need to stop other entity.
>>>> I.E user can stop and re-configure TX queue while keep receiving
>>>> packets from RX queue.
>>>> With direct re-arm enabled, I think it wouldn't be possible any more:
>>>> before stopping/reconfiguring TX queue user would have make sure that
>>>> corresponding RX queue wouldn't be used by datapath.
>>> I am trying to understand the problem better. For the TX queue to be stopped,
>> the user must have blocked the data plane from accessing the TX queue.
>>
>> Surely it is user responsibility tnot to call tx_burst() for stopped/released queue.
>> The problem is that while TX for that queue is stopped, RX for related queue still
>> can continue.
>> So rx_burst() will try to read/modify TX queue data, that might be already freed,
>> or simultaneously modified by control path.
> Understood, agree on the issue
> 
>>
>> Again, it all can be mitigated by carefully re-designing and modifying control and
>> data-path inside user app - by doing extra checks and synchronizations, etc.
>> But from practical point - I presume most of users simply would avoid using this
>> feature due all potential problems it might cause.
> That is subjective, it all depends on the performance improvements users see in their application.
> IMO, the performance improvement seen with this patch is worth few changes.

Yes, it is subjective till some extent, though my feeling
that it might end-up being sort of synthetic improvement used only
by some show-case benchmarks.
 From my perspective, it would be much more plausible,
if we can introduce some sort of generic improvement, that doesn't
impose all these extra constraints and implications.
Like one, discussed below in that thread with ZC mempool approach.


>>
>>> Like Feifei says, the RX side has the normal packet allocation path still available.
>>> Also this sounds like a corner case to me, we can handle this through checks in
>> the queue_stop API.
>>
>> Depends.
>> if it would be allowed to link queues only from the same port, then yes, extra
>> checks for queue-stop might be enough.
>> As right now DPDK doesn't allow user to change number of queues without
>> dev_stop() first.
>> Though if it would be allowed to link queues from different ports, then situation
>> will be much worse.
>> Right now ports are totally independent entities (except some special cases like
>> link-bonding, etc.).
>> As one port can keep doing RX/TX, second one can be stopped, re-confgured,
>> even detached, and newly attached device might re-use same port number.
> I see this as a similar restriction to the one discussed above. 

Yes, they are similar in principal, though I think that the case
with queues from different port would make things much more complex.

 > Do you see any issues if we enforce this with checks?

Hard to tell straightway, a lot will depend how smart
such implementation would be.
Usually DPDK tends not to avoid heavy
synchronizations within its data-path functions.

>>
>>
>>>>
>>>>>
>>>>>> - very limited usage scenario - it will have a positive effect only
>>>>>>      when we have a fixed forwarding mapping: all (or nearly all) packets
>>>>>>      from the RX queue are forwarded into the same TX queue.
>>>>>
>>>>> [Feifei] Although the usage scenario is limited, this usage scenario
>>>>> has a wide range of applications, such as NIC with one port.
>>>>
>>>> yes, there are NICs with one port, but no guarantee there wouldn't be
>>>> several such NICs within the system.
>>> What I see in my interactions is, a single NIC/DPU is under utilized for a 2
>> socket system. Some are adding more sockets to the system to better utilize the
>> DPU. The NIC bandwidth continues to grow significantly. I do not think there will
>> be a multi-DPU per server scenario.
>>
>>
>> Interesting... from my experience it is visa-versa:
>> in many cases 200Gb/s is not that much these days to saturate modern 2 socket
>> x86 server.
>> Though I suppose a lot depends on particular HW and actual workload.
>>
>>>
>>>>
>>>>> Furtrhermore, I think this is a tradeoff between performance and
>>>>> flexibility.
>>>>> Our goal is to achieve best performance, this means we need to give
>>>>> up some flexibility decisively. For example of 'FAST_FREE Mode', it
>>>>> deletes most of the buffer check (refcnt > 1, external buffer, chain
>>>>> buffer), chooses a shorest path, and then achieve significant
>>>>> performance
>>>> improvement.
>>>>>> Wonder did you had a chance to consider mempool-cache ZC API,
>>>>>> similar to one we have for the ring?
>>>>>> It would allow us on TX free path to avoid copying mbufs to
>>>>>> temporary array on the stack.
>>>>>> Instead we can put them straight from TX SW ring to the mempool cache.
>>>>>> That should save extra store/load for mbuf and might help to
>>>>>> achieve some performance gain without by-passing mempool.
>>>>>> It probably wouldn't be as fast as what you proposing, but might be
>>>>>> fast enough to consider as alternative.
>>>>>> Again, it would be a generic one, so we can avoid all these
>>>>>> implications and limitations.
>>>>>
>>>>> [Feifei] I think this is a good try. However, the most important
>>>>> thing is that if we can bypass the mempool decisively to pursue the
>>>>> significant performance gains.
>>>>
>>>> I understand the intention, and I personally think this is wrong and
>>>> dangerous attitude.
>>>> We have mempool abstraction in place for very good reason.
>>>> So we need to try to improve mempool performance (and API if
>>>> necessary) at first place, not to avoid it and break our own rules and
>> recommendations.
>>> The abstraction can be thought of at a higher level. i.e. the driver manages the
>> buffer allocation/free and is hidden from the application. The application does
>> not need to be aware of how these changes are implemented.
>>>
>>>>
>>>>
>>>>> For ZC, there maybe a problem for it in i40e. The reason for that
>>>>> put Tx buffers into temporary is that i40e_tx_entry includes buffer
>>>>> pointer and index.
>>>>> Thus we cannot put Tx SW_ring entry into mempool directly, we need
>>>>> to firstlt extract mbuf pointer. Finally, though we use ZC, we still
>>>>> can't avoid using a temporary stack to extract Tx buffer pointers.
>>>>
>>>> When talking about ZC API for mempool cache I meant something like:
>>>> void ** mempool_cache_put_zc_start(struct rte_mempool_cache *mc,
>>>> uint32_t *nb_elem, uint32_t flags); void
>>>> mempool_cache_put_zc_finish(struct
>>>> rte_mempool_cache *mc, uint32_t nb_elem); i.e. _start_ will return
>>>> user a pointer inside mp-cache where to put free elems and max number
>>>> of slots that can be safely filled.
>>>> _finish_ will update mc->len.
>>>> As an example:
>>>>
>>>> /* expect to free N mbufs */
>>>> uint32_t n = N;
>>>> void **p = mempool_cache_put_zc_start(mc, &n, ...);
>>>>
>>>> /* free up to n elems */
>>>> for (i = 0; i != n; i++) {
>>>>
>>>>      /* get next free mbuf from somewhere */
>>>>      mb = extract_and_prefree_mbuf(...);
>>>>
>>>>      /* no more free mbufs for now */
>>>>      if (mb == NULL)
>>>>         break;
>>>>
>>>>      p[i] = mb;
>>>> }
>>>>
>>>> /* finalize ZC put, with _i_ freed elems */
>>>> mempool_cache_put_zc_finish(mc, i);
>>>>
>>>> That way, I think we can overcome the issue with i40e_tx_entry you
>>>> mentioned above. Plus it might be useful in other similar places.
>>>>
>>>> Another alternative is obviously to split i40e_tx_entry into two
>>>> structs (one for mbuf, second for its metadata) and have a separate
>>>> array for each of them.
>>>> Though with that approach we need to make sure no perf drops will be
>>>> introduced, plus probably more code changes will be required.
>>> Commit '5171b4ee6b6" already does this (in a different way), but just for
>> AVX512. Unfortunately, it does not record any performance improvements. We
>> could port this to Arm NEON and look at the performance.
>

Morten Brørup June 4, 2022, 8:07 a.m. UTC | #8

> From: Konstantin Ananyev [mailto:konstantin.v.ananyev@yandex.ru]
> Sent: Saturday, 4 June 2022 01.32
> 
> > <snip>
> >>>
> >>>>
> >>>> [konstantin.v.ananyev@yandex.ru appears similar to someone who
> >>>> previously sent you email, but may not be that person. Learn why
> this
> >>>> could be a risk at https://aka.ms/LearnAboutSenderIdentification.]
> >>>>
> >>>> 16/05/2022 07:10, Feifei Wang пишет:
> >>>>>
> >>>>>>> Currently, the transmit side frees the buffers into the lcore
> >>>>>>> cache and the receive side allocates buffers from the lcore
> cache.
> >>>>>>> The transmit side typically frees 32 buffers resulting in
> >>>>>>> 32*8=256B of stores to lcore cache. The receive side allocates
> 32
> >>>>>>> buffers and stores them in the receive side software ring,
> >>>>>>> resulting in 32*8=256B of stores and 256B of load from the
> lcore cache.
> >>>>>>>
> >>>>>>> This patch proposes a mechanism to avoid freeing to/allocating
> >>>>>>> from the lcore cache. i.e. the receive side will free the
> buffers
> >>>>>>> from transmit side directly into it's software ring. This will
> >>>>>>> avoid the 256B of loads and stores introduced by the lcore
> cache.
> >>>>>>> It also frees up the cache lines used by the lcore cache.
> >>>>>>>
> >>>>>>> However, this solution poses several constraints:
> >>>>>>>
> >>>>>>> 1)The receive queue needs to know which transmit queue it
> should
> >>>>>>> take the buffers from. The application logic decides which
> >>>>>>> transmit port to use to send out the packets. In many use cases
> >>>>>>> the NIC might have a single port ([1], [2], [3]), in which case
> a
> >>>>>>> given transmit queue is always mapped to a single receive queue
> >>>>>>> (1:1 Rx queue: Tx queue). This is easy to configure.
> >>>>>>>
> >>>>>>> If the NIC has 2 ports (there are several references), then we
> >>>>>>> will have
> >>>>>>> 1:2 (RX queue: TX queue) mapping which is still easy to
> configure.
> >>>>>>> However, if this is generalized to 'N' ports, the configuration
> >>>>>>> can be long. More over the PMD would have to scan a list of
> >>>>>>> transmit queues to pull the buffers from.
> >>>>>
> >>>>>> Just to re-iterate some generic concerns about this proposal:
> >>>>>>     - We effectively link RX and TX queues - when this feature
> is enabled,
> >>>>>>       user can't stop TX queue without stopping linked RX queue
> first.
> >>>>>>       Right now user is free to start/stop any queues at his
> will.
> >>>>>>       If that feature will allow to link queues from different
> ports,
> >>>>>>       then even ports will become dependent and user will have
> to pay extra
> >>>>>>       care when managing such ports.
> >>>>>
> >>>>> [Feifei] When direct rearm enabled, there are two path for thread
> to
> >>>>> choose. If there are enough Tx freed buffers, Rx can put buffers
> >>>>> from Tx.
> >>>>> Otherwise, Rx will put buffers from mempool as usual. Thus, users
> do
> >>>>> not need to pay much attention managing ports.
> >>>>
> >>>> What I am talking about: right now different port or different
> queues
> >>>> of the same port can be treated as independent entities:
> >>>> in general user is free to start/stop (and even reconfigure in
> some
> >>>> cases) one entity without need to stop other entity.
> >>>> I.E user can stop and re-configure TX queue while keep receiving
> >>>> packets from RX queue.
> >>>> With direct re-arm enabled, I think it wouldn't be possible any
> more:
> >>>> before stopping/reconfiguring TX queue user would have make sure
> that
> >>>> corresponding RX queue wouldn't be used by datapath.
> >>> I am trying to understand the problem better. For the TX queue to
> be stopped,
> >> the user must have blocked the data plane from accessing the TX
> queue.
> >>
> >> Surely it is user responsibility tnot to call tx_burst() for
> stopped/released queue.
> >> The problem is that while TX for that queue is stopped, RX for
> related queue still
> >> can continue.
> >> So rx_burst() will try to read/modify TX queue data, that might be
> already freed,
> >> or simultaneously modified by control path.
> > Understood, agree on the issue
> >
> >>
> >> Again, it all can be mitigated by carefully re-designing and
> modifying control and
> >> data-path inside user app - by doing extra checks and
> synchronizations, etc.
> >> But from practical point - I presume most of users simply would
> avoid using this
> >> feature due all potential problems it might cause.
> > That is subjective, it all depends on the performance improvements
> users see in their application.
> > IMO, the performance improvement seen with this patch is worth few
> changes.
> 
> Yes, it is subjective till some extent, though my feeling
> that it might end-up being sort of synthetic improvement used only
> by some show-case benchmarks.

I believe that one specific important use case has already been mentioned, so I don't think this is a benchmark only feature.

>  From my perspective, it would be much more plausible,
> if we can introduce some sort of generic improvement, that doesn't
> impose all these extra constraints and implications.
> Like one, discussed below in that thread with ZC mempool approach.
> 

Considering this feature from a high level perspective, I agree with Konstantin's concerns, so I'll also support his views.

If this patch is supposed to be a generic feature, please add support for it in all NIC PMDs, not just one. (Regardless if the feature is defined as 1:1 mapping or N:M mapping.) It is purely software, so it should be available for all PMDs, not just your favorite hardware! Consider the "fast mbuf free" feature, which is pure software; why is that feature not implemented in all PMDs?

A secondary point I'm making here is that this specific feature will lead to an enormous amount of copy-paste code, instead of a generic library function easily available for all PMDs.

> 
> >>
> >>> Like Feifei says, the RX side has the normal packet allocation path
> still available.
> >>> Also this sounds like a corner case to me, we can handle this
> through checks in
> >> the queue_stop API.
> >>
> >> Depends.
> >> if it would be allowed to link queues only from the same port, then
> yes, extra
> >> checks for queue-stop might be enough.
> >> As right now DPDK doesn't allow user to change number of queues
> without
> >> dev_stop() first.
> >> Though if it would be allowed to link queues from different ports,
> then situation
> >> will be much worse.
> >> Right now ports are totally independent entities (except some
> special cases like
> >> link-bonding, etc.).
> >> As one port can keep doing RX/TX, second one can be stopped, re-
> confgured,
> >> even detached, and newly attached device might re-use same port
> number.
> > I see this as a similar restriction to the one discussed above.
> 
> Yes, they are similar in principal, though I think that the case
> with queues from different port would make things much more complex.
> 
>  > Do you see any issues if we enforce this with checks?
> 
> Hard to tell straightway, a lot will depend how smart
> such implementation would be.
> Usually DPDK tends not to avoid heavy
> synchronizations within its data-path functions.

Certainly! Implementing more and more of such features in the PMDs will lead to longer and longer data plane code paths in the PMDs. It is the "salami method", where each small piece makes no performance difference, but they all add up, and eventually the sum of them does impact the performance of the general use case negatively.

> 
> >>
> >>
> >>>>
> >>>>>
> >>>>>> - very limited usage scenario - it will have a positive effect
> only
> >>>>>>      when we have a fixed forwarding mapping: all (or nearly
> all) packets
> >>>>>>      from the RX queue are forwarded into the same TX queue.
> >>>>>
> >>>>> [Feifei] Although the usage scenario is limited, this usage
> scenario
> >>>>> has a wide range of applications, such as NIC with one port.
> >>>>
> >>>> yes, there are NICs with one port, but no guarantee there wouldn't
> be
> >>>> several such NICs within the system.
> >>> What I see in my interactions is, a single NIC/DPU is under
> utilized for a 2
> >> socket system. Some are adding more sockets to the system to better
> utilize the
> >> DPU. The NIC bandwidth continues to grow significantly. I do not
> think there will
> >> be a multi-DPU per server scenario.
> >>
> >>
> >> Interesting... from my experience it is visa-versa:
> >> in many cases 200Gb/s is not that much these days to saturate modern
> 2 socket
> >> x86 server.
> >> Though I suppose a lot depends on particular HW and actual workload.
> >>
> >>>
> >>>>
> >>>>> Furtrhermore, I think this is a tradeoff between performance and
> >>>>> flexibility.
> >>>>> Our goal is to achieve best performance, this means we need to
> give
> >>>>> up some flexibility decisively. For example of 'FAST_FREE Mode',
> it
> >>>>> deletes most of the buffer check (refcnt > 1, external buffer,
> chain
> >>>>> buffer), chooses a shorest path, and then achieve significant
> >>>>> performance
> >>>> improvement.
> >>>>>> Wonder did you had a chance to consider mempool-cache ZC API,
> >>>>>> similar to one we have for the ring?
> >>>>>> It would allow us on TX free path to avoid copying mbufs to
> >>>>>> temporary array on the stack.
> >>>>>> Instead we can put them straight from TX SW ring to the mempool
> cache.
> >>>>>> That should save extra store/load for mbuf and might help to
> >>>>>> achieve some performance gain without by-passing mempool.
> >>>>>> It probably wouldn't be as fast as what you proposing, but might
> be
> >>>>>> fast enough to consider as alternative.
> >>>>>> Again, it would be a generic one, so we can avoid all these
> >>>>>> implications and limitations.
> >>>>>
> >>>>> [Feifei] I think this is a good try. However, the most important
> >>>>> thing is that if we can bypass the mempool decisively to pursue
> the
> >>>>> significant performance gains.
> >>>>
> >>>> I understand the intention, and I personally think this is wrong
> and
> >>>> dangerous attitude.
> >>>> We have mempool abstraction in place for very good reason.
> >>>> So we need to try to improve mempool performance (and API if
> >>>> necessary) at first place, not to avoid it and break our own rules
> and
> >> recommendations.
> >>> The abstraction can be thought of at a higher level. i.e. the
> driver manages the
> >> buffer allocation/free and is hidden from the application. The
> application does
> >> not need to be aware of how these changes are implemented.
> >>>
> >>>>
> >>>>
> >>>>> For ZC, there maybe a problem for it in i40e. The reason for that
> >>>>> put Tx buffers into temporary is that i40e_tx_entry includes
> buffer
> >>>>> pointer and index.
> >>>>> Thus we cannot put Tx SW_ring entry into mempool directly, we
> need
> >>>>> to firstlt extract mbuf pointer. Finally, though we use ZC, we
> still
> >>>>> can't avoid using a temporary stack to extract Tx buffer
> pointers.
> >>>>
> >>>> When talking about ZC API for mempool cache I meant something
> like:
> >>>> void ** mempool_cache_put_zc_start(struct rte_mempool_cache *mc,
> >>>> uint32_t *nb_elem, uint32_t flags); void
> >>>> mempool_cache_put_zc_finish(struct
> >>>> rte_mempool_cache *mc, uint32_t nb_elem); i.e. _start_ will return
> >>>> user a pointer inside mp-cache where to put free elems and max
> number
> >>>> of slots that can be safely filled.
> >>>> _finish_ will update mc->len.
> >>>> As an example:
> >>>>
> >>>> /* expect to free N mbufs */
> >>>> uint32_t n = N;
> >>>> void **p = mempool_cache_put_zc_start(mc, &n, ...);
> >>>>
> >>>> /* free up to n elems */
> >>>> for (i = 0; i != n; i++) {
> >>>>
> >>>>      /* get next free mbuf from somewhere */
> >>>>      mb = extract_and_prefree_mbuf(...);
> >>>>
> >>>>      /* no more free mbufs for now */
> >>>>      if (mb == NULL)
> >>>>         break;
> >>>>
> >>>>      p[i] = mb;
> >>>> }
> >>>>
> >>>> /* finalize ZC put, with _i_ freed elems */
> >>>> mempool_cache_put_zc_finish(mc, i);
> >>>>
> >>>> That way, I think we can overcome the issue with i40e_tx_entry you
> >>>> mentioned above. Plus it might be useful in other similar places.
> >>>>
> >>>> Another alternative is obviously to split i40e_tx_entry into two
> >>>> structs (one for mbuf, second for its metadata) and have a
> separate
> >>>> array for each of them.
> >>>> Though with that approach we need to make sure no perf drops will
> be
> >>>> introduced, plus probably more code changes will be required.
> >>> Commit '5171b4ee6b6" already does this (in a different way), but
> just for
> >> AVX512. Unfortunately, it does not record any performance
> improvements. We
> >> could port this to Arm NEON and look at the performance.
> >
>

Feifei Wang June 13, 2022, 5:55 a.m. UTC | #9

> -----邮件原件-----
> 发件人: Konstantin Ananyev <konstantin.v.ananyev@yandex.ru>
> 发送时间: Tuesday, May 24, 2022 9:26 AM
> 收件人: Feifei Wang <Feifei.Wang2@arm.com>
> 抄送: nd <nd@arm.com>; dev@dpdk.org; Ruifeng Wang
> <Ruifeng.Wang@arm.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>
> 主题: Re: [PATCH v1 0/5] Direct re-arming of buffers on receive side
> 
> [konstantin.v.ananyev@yandex.ru appears similar to someone who previously
> sent you email, but may not be that person. Learn why this could be a risk at
> https://aka.ms/LearnAboutSenderIdentification.]
> 
> 16/05/2022 07:10, Feifei Wang пишет:
> >
> >>> Currently, the transmit side frees the buffers into the lcore cache
> >>> and the receive side allocates buffers from the lcore cache. The
> >>> transmit side typically frees 32 buffers resulting in 32*8=256B of
> >>> stores to lcore cache. The receive side allocates 32 buffers and
> >>> stores them in the receive side software ring, resulting in
> >>> 32*8=256B of stores and 256B of load from the lcore cache.
> >>>
> >>> This patch proposes a mechanism to avoid freeing to/allocating from
> >>> the lcore cache. i.e. the receive side will free the buffers from
> >>> transmit side directly into it's software ring. This will avoid the
> >>> 256B of loads and stores introduced by the lcore cache. It also
> >>> frees up the cache lines used by the lcore cache.
> >>>
> >>> However, this solution poses several constraints:
> >>>
> >>> 1)The receive queue needs to know which transmit queue it should
> >>> take the buffers from. The application logic decides which transmit
> >>> port to use to send out the packets. In many use cases the NIC might
> >>> have a single port ([1], [2], [3]), in which case a given transmit
> >>> queue is always mapped to a single receive queue (1:1 Rx queue: Tx
> >>> queue). This is easy to configure.
> >>>
> >>> If the NIC has 2 ports (there are several references), then we will
> >>> have
> >>> 1:2 (RX queue: TX queue) mapping which is still easy to configure.
> >>> However, if this is generalized to 'N' ports, the configuration can
> >>> be long. More over the PMD would have to scan a list of transmit
> >>> queues to pull the buffers from.
> >
> >> Just to re-iterate some generic concerns about this proposal:
> >>   - We effectively link RX and TX queues - when this feature is enabled,
> >>     user can't stop TX queue without stopping linked RX queue first.
> >>     Right now user is free to start/stop any queues at his will.
> >>     If that feature will allow to link queues from different ports,
> >>     then even ports will become dependent and user will have to pay extra
> >>     care when managing such ports.
> >
> > [Feifei] When direct rearm enabled, there are two path for thread to
> > choose. If there are enough Tx freed buffers, Rx can put buffers from
> > Tx.
> > Otherwise, Rx will put buffers from mempool as usual. Thus, users do
> > not need to pay much attention managing ports.
> 
> What I am talking about: right now different port or different queues of the
> same port can be treated as independent entities:
> in general user is free to start/stop (and even reconfigure in some
> cases) one entity without need to stop other entity.
> I.E user can stop and re-configure TX queue while keep receiving packets
> from RX queue.
> With direct re-arm enabled, I think it wouldn't be possible any more:
> before stopping/reconfiguring TX queue user would have make sure that
> corresponding RX queue wouldn't be used by datapath.
> 
> >
> >> - very limited usage scenario - it will have a positive effect only
> >>    when we have a fixed forwarding mapping: all (or nearly all) packets
> >>    from the RX queue are forwarded into the same TX queue.
> >
> > [Feifei] Although the usage scenario is limited, this usage scenario
> > has a wide range of applications, such as NIC with one port.
> 
> yes, there are NICs with one port, but no guarantee there wouldn't be several
> such NICs within the system.
> 
> > Furtrhermore, I think this is a tradeoff between performance and
> > flexibility.
> > Our goal is to achieve best performance, this means we need to give up
> > some flexibility decisively. For example of 'FAST_FREE Mode', it
> > deletes most of the buffer check (refcnt > 1, external buffer, chain
> > buffer), chooses a shorest path, and then achieve significant performance
> improvement.
> >> Wonder did you had a chance to consider mempool-cache ZC API, similar
> >> to one we have for the ring?
> >> It would allow us on TX free path to avoid copying mbufs to temporary
> >> array on the stack.
> >> Instead we can put them straight from TX SW ring to the mempool cache.
> >> That should save extra store/load for mbuf and might help to achieve
> >> some performance gain without by-passing mempool.
> >> It probably wouldn't be as fast as what you proposing, but might be
> >> fast enough to consider as alternative.
> >> Again, it would be a generic one, so we can avoid all these
> >> implications and limitations.
> >
> > [Feifei] I think this is a good try. However, the most important thing
> > is that if we can bypass the mempool decisively to pursue the
> > significant performance gains.
> 
> I understand the intention, and I personally think this is wrong and dangerous
> attitude.
> We have mempool abstraction in place for very good reason.
> So we need to try to improve mempool performance (and API if necessary) at
> first place, not to avoid it and break our own rules and recommendations.
> 
> 
> > For ZC, there maybe a problem for it in i40e. The reason for that put
> > Tx buffers into temporary is that i40e_tx_entry includes buffer
> > pointer and index.
> > Thus we cannot put Tx SW_ring entry into mempool directly, we need to
> > firstlt extract mbuf pointer. Finally, though we use ZC, we still
> > can't avoid using a temporary stack to extract Tx buffer pointers.
> 
> When talking about ZC API for mempool cache I meant something like:
> void ** mempool_cache_put_zc_start(struct rte_mempool_cache *mc,
> uint32_t *nb_elem, uint32_t flags); void
> mempool_cache_put_zc_finish(struct rte_mempool_cache *mc, uint32_t
> nb_elem); i.e. _start_ will return user a pointer inside mp-cache where to put
> free elems and max number of slots that can be safely filled.
> _finish_ will update mc->len.
> As an example:
> 
> /* expect to free N mbufs */
> uint32_t n = N;
> void **p = mempool_cache_put_zc_start(mc, &n, ...);
> 
> /* free up to n elems */
> for (i = 0; i != n; i++) {
> 
>    /* get next free mbuf from somewhere */
>    mb = extract_and_prefree_mbuf(...);
> 
>    /* no more free mbufs for now */
>    if (mb == NULL)
>       break;
> 
>    p[i] = mb;
> }
> 
> /* finalize ZC put, with _i_ freed elems */ mempool_cache_put_zc_finish(mc,
> i);
> 
> That way, I think we can overcome the issue with i40e_tx_entry you
> mentioned above. Plus it might be useful in other similar places.
> 
> Another alternative is obviously to split i40e_tx_entry into two structs (one
> for mbuf, second for its metadata) and have a separate array for each of them.
> Though with that approach we need to make sure no perf drops will be
> introduced, plus probably more code changes will be required.

[Feifei] I just uploaded a RFC patch to the community:
http://patches.dpdk.org/project/dpdk/patch/20220613055136.1949784-1-feifei.wang2@arm.com/
I do not use ZC api, and refer to i40e avx512 path, directly put mempool cache out of API.
You can take a look when you are free. Thanks very much.
> 
> 
> 
>

Honnappa Nagarahalli June 29, 2022, 9:58 p.m. UTC | #10

<snip>

> > >>>
> > >>>>
> > >>>> [konstantin.v.ananyev@yandex.ru appears similar to someone who
> > >>>> previously sent you email, but may not be that person. Learn why
> > this
> > >>>> could be a risk at
> > >>>> https://aka.ms/LearnAboutSenderIdentification.]
> > >>>>
> > >>>> 16/05/2022 07:10, Feifei Wang пишет:
> > >>>>>
> > >>>>>>> Currently, the transmit side frees the buffers into the lcore
> > >>>>>>> cache and the receive side allocates buffers from the lcore
> > cache.
> > >>>>>>> The transmit side typically frees 32 buffers resulting in
> > >>>>>>> 32*8=256B of stores to lcore cache. The receive side allocates
> > 32
> > >>>>>>> buffers and stores them in the receive side software ring,
> > >>>>>>> resulting in 32*8=256B of stores and 256B of load from the
> > lcore cache.
> > >>>>>>>
> > >>>>>>> This patch proposes a mechanism to avoid freeing to/allocating
> > >>>>>>> from the lcore cache. i.e. the receive side will free the
> > buffers
> > >>>>>>> from transmit side directly into it's software ring. This will
> > >>>>>>> avoid the 256B of loads and stores introduced by the lcore
> > cache.
> > >>>>>>> It also frees up the cache lines used by the lcore cache.
> > >>>>>>>
> > >>>>>>> However, this solution poses several constraints:
> > >>>>>>>
> > >>>>>>> 1)The receive queue needs to know which transmit queue it
> > should
> > >>>>>>> take the buffers from. The application logic decides which
> > >>>>>>> transmit port to use to send out the packets. In many use
> > >>>>>>> cases the NIC might have a single port ([1], [2], [3]), in
> > >>>>>>> which case
> > a
> > >>>>>>> given transmit queue is always mapped to a single receive
> > >>>>>>> queue
> > >>>>>>> (1:1 Rx queue: Tx queue). This is easy to configure.
> > >>>>>>>
> > >>>>>>> If the NIC has 2 ports (there are several references), then we
> > >>>>>>> will have
> > >>>>>>> 1:2 (RX queue: TX queue) mapping which is still easy to
> > configure.
> > >>>>>>> However, if this is generalized to 'N' ports, the
> > >>>>>>> configuration can be long. More over the PMD would have to
> > >>>>>>> scan a list of transmit queues to pull the buffers from.
> > >>>>>
> > >>>>>> Just to re-iterate some generic concerns about this proposal:
> > >>>>>>     - We effectively link RX and TX queues - when this feature
> > is enabled,
> > >>>>>>       user can't stop TX queue without stopping linked RX queue
> > first.
> > >>>>>>       Right now user is free to start/stop any queues at his
> > will.
> > >>>>>>       If that feature will allow to link queues from different
> > ports,
> > >>>>>>       then even ports will become dependent and user will have
> > to pay extra
> > >>>>>>       care when managing such ports.
> > >>>>>
> > >>>>> [Feifei] When direct rearm enabled, there are two path for
> > >>>>> thread
> > to
> > >>>>> choose. If there are enough Tx freed buffers, Rx can put buffers
> > >>>>> from Tx.
> > >>>>> Otherwise, Rx will put buffers from mempool as usual. Thus,
> > >>>>> users
> > do
> > >>>>> not need to pay much attention managing ports.
> > >>>>
> > >>>> What I am talking about: right now different port or different
> > queues
> > >>>> of the same port can be treated as independent entities:
> > >>>> in general user is free to start/stop (and even reconfigure in
> > some
> > >>>> cases) one entity without need to stop other entity.
> > >>>> I.E user can stop and re-configure TX queue while keep receiving
> > >>>> packets from RX queue.
> > >>>> With direct re-arm enabled, I think it wouldn't be possible any
> > more:
> > >>>> before stopping/reconfiguring TX queue user would have make sure
> > that
> > >>>> corresponding RX queue wouldn't be used by datapath.
> > >>> I am trying to understand the problem better. For the TX queue to
> > be stopped,
> > >> the user must have blocked the data plane from accessing the TX
> > queue.
> > >>
> > >> Surely it is user responsibility tnot to call tx_burst() for
> > stopped/released queue.
> > >> The problem is that while TX for that queue is stopped, RX for
> > related queue still
> > >> can continue.
> > >> So rx_burst() will try to read/modify TX queue data, that might be
> > already freed,
> > >> or simultaneously modified by control path.
> > > Understood, agree on the issue
> > >
> > >>
> > >> Again, it all can be mitigated by carefully re-designing and
> > modifying control and
> > >> data-path inside user app - by doing extra checks and
> > synchronizations, etc.
> > >> But from practical point - I presume most of users simply would
> > avoid using this
> > >> feature due all potential problems it might cause.
> > > That is subjective, it all depends on the performance improvements
> > users see in their application.
> > > IMO, the performance improvement seen with this patch is worth few
> > changes.
> >
> > Yes, it is subjective till some extent, though my feeling that it
> > might end-up being sort of synthetic improvement used only by some
> > show-case benchmarks.
> 
> I believe that one specific important use case has already been mentioned, so I
> don't think this is a benchmark only feature.
+1

> 
> >  From my perspective, it would be much more plausible, if we can
> > introduce some sort of generic improvement, that doesn't impose all
> > these extra constraints and implications.
> > Like one, discussed below in that thread with ZC mempool approach.
> >
> 
> Considering this feature from a high level perspective, I agree with Konstantin's
> concerns, so I'll also support his views.
We did hack the ZC mempool approach [1], level of improvement is pretty small compared with this patch.

[1] http://patches.dpdk.org/project/dpdk/patch/20220613055136.1949784-1-feifei.wang2@arm.com/

> 
> If this patch is supposed to be a generic feature, please add support for it in all
> NIC PMDs, not just one. (Regardless if the feature is defined as 1:1 mapping or
> N:M mapping.) It is purely software, so it should be available for all PMDs, not
> just your favorite hardware! Consider the "fast mbuf free" feature, which is
> pure software; why is that feature not implemented in all PMDs?
Agree, it is good to have it supported in all the drivers. We do not have a favorite hardware, just picked a PMD which we are more familiar with. We do plan to implement in other prominent PMDs.

> 
> A secondary point I'm making here is that this specific feature will lead to an
> enormous amount of copy-paste code, instead of a generic library function
> easily available for all PMDs.
Are you talking about the i40e driver code in specific? If yes, agree we should avoid copy-paste and we will look to reduce that.


> 
> >
> > >>
> > >>> Like Feifei says, the RX side has the normal packet allocation
> > >>> path
> > still available.
> > >>> Also this sounds like a corner case to me, we can handle this
> > through checks in
> > >> the queue_stop API.
> > >>
> > >> Depends.
> > >> if it would be allowed to link queues only from the same port, then
> > yes, extra
> > >> checks for queue-stop might be enough.
> > >> As right now DPDK doesn't allow user to change number of queues
> > without
> > >> dev_stop() first.
> > >> Though if it would be allowed to link queues from different ports,
> > then situation
> > >> will be much worse.
> > >> Right now ports are totally independent entities (except some
> > special cases like
> > >> link-bonding, etc.).
> > >> As one port can keep doing RX/TX, second one can be stopped, re-
> > confgured,
> > >> even detached, and newly attached device might re-use same port
> > number.
> > > I see this as a similar restriction to the one discussed above.
> >
> > Yes, they are similar in principal, though I think that the case with
> > queues from different port would make things much more complex.
> >
> >  > Do you see any issues if we enforce this with checks?
> >
> > Hard to tell straightway, a lot will depend how smart such
> > implementation would be.
> > Usually DPDK tends not to avoid heavy
> > synchronizations within its data-path functions.
> 
> Certainly! Implementing more and more of such features in the PMDs will lead
> to longer and longer data plane code paths in the PMDs. It is the "salami
> method", where each small piece makes no performance difference, but they all
> add up, and eventually the sum of them does impact the performance of the
> general use case negatively.
It would be good to have a test running in UNH that shows the performance trend.

> 
> >
> > >>
> > >>
> > >>>>
> > >>>>>
> > >>>>>> - very limited usage scenario - it will have a positive effect
> > only
> > >>>>>>      when we have a fixed forwarding mapping: all (or nearly
> > all) packets
> > >>>>>>      from the RX queue are forwarded into the same TX queue.
> > >>>>>
> > >>>>> [Feifei] Although the usage scenario is limited, this usage
> > scenario
> > >>>>> has a wide range of applications, such as NIC with one port.
> > >>>>
> > >>>> yes, there are NICs with one port, but no guarantee there
> > >>>> wouldn't
> > be
> > >>>> several such NICs within the system.
> > >>> What I see in my interactions is, a single NIC/DPU is under
> > utilized for a 2
> > >> socket system. Some are adding more sockets to the system to better
> > utilize the
> > >> DPU. The NIC bandwidth continues to grow significantly. I do not
> > think there will
> > >> be a multi-DPU per server scenario.
> > >>
> > >>
> > >> Interesting... from my experience it is visa-versa:
> > >> in many cases 200Gb/s is not that much these days to saturate
> > >> modern
> > 2 socket
> > >> x86 server.
> > >> Though I suppose a lot depends on particular HW and actual workload.
> > >>
> > >>>
> > >>>>
> > >>>>> Furtrhermore, I think this is a tradeoff between performance and
> > >>>>> flexibility.
> > >>>>> Our goal is to achieve best performance, this means we need to
> > give
> > >>>>> up some flexibility decisively. For example of 'FAST_FREE Mode',
> > it
> > >>>>> deletes most of the buffer check (refcnt > 1, external buffer,
> > chain
> > >>>>> buffer), chooses a shorest path, and then achieve significant
> > >>>>> performance
> > >>>> improvement.
> > >>>>>> Wonder did you had a chance to consider mempool-cache ZC API,
> > >>>>>> similar to one we have for the ring?
> > >>>>>> It would allow us on TX free path to avoid copying mbufs to
> > >>>>>> temporary array on the stack.
> > >>>>>> Instead we can put them straight from TX SW ring to the mempool
> > cache.
> > >>>>>> That should save extra store/load for mbuf and might help to
> > >>>>>> achieve some performance gain without by-passing mempool.
> > >>>>>> It probably wouldn't be as fast as what you proposing, but
> > >>>>>> might
> > be
> > >>>>>> fast enough to consider as alternative.
> > >>>>>> Again, it would be a generic one, so we can avoid all these
> > >>>>>> implications and limitations.
> > >>>>>
> > >>>>> [Feifei] I think this is a good try. However, the most important
> > >>>>> thing is that if we can bypass the mempool decisively to pursue
> > the
> > >>>>> significant performance gains.
> > >>>>
> > >>>> I understand the intention, and I personally think this is wrong
> > and
> > >>>> dangerous attitude.
> > >>>> We have mempool abstraction in place for very good reason.
> > >>>> So we need to try to improve mempool performance (and API if
> > >>>> necessary) at first place, not to avoid it and break our own
> > >>>> rules
> > and
> > >> recommendations.
> > >>> The abstraction can be thought of at a higher level. i.e. the
> > driver manages the
> > >> buffer allocation/free and is hidden from the application. The
> > application does
> > >> not need to be aware of how these changes are implemented.
> > >>>
> > >>>>
> > >>>>
> > >>>>> For ZC, there maybe a problem for it in i40e. The reason for
> > >>>>> that put Tx buffers into temporary is that i40e_tx_entry
> > >>>>> includes
> > buffer
> > >>>>> pointer and index.
> > >>>>> Thus we cannot put Tx SW_ring entry into mempool directly, we
> > need
> > >>>>> to firstlt extract mbuf pointer. Finally, though we use ZC, we
> > still
> > >>>>> can't avoid using a temporary stack to extract Tx buffer
> > pointers.
> > >>>>
> > >>>> When talking about ZC API for mempool cache I meant something
> > like:
> > >>>> void ** mempool_cache_put_zc_start(struct rte_mempool_cache *mc,
> > >>>> uint32_t *nb_elem, uint32_t flags); void
> > >>>> mempool_cache_put_zc_finish(struct
> > >>>> rte_mempool_cache *mc, uint32_t nb_elem); i.e. _start_ will
> > >>>> return user a pointer inside mp-cache where to put free elems and
> > >>>> max
> > number
> > >>>> of slots that can be safely filled.
> > >>>> _finish_ will update mc->len.
> > >>>> As an example:
> > >>>>
> > >>>> /* expect to free N mbufs */
> > >>>> uint32_t n = N;
> > >>>> void **p = mempool_cache_put_zc_start(mc, &n, ...);
> > >>>>
> > >>>> /* free up to n elems */
> > >>>> for (i = 0; i != n; i++) {
> > >>>>
> > >>>>      /* get next free mbuf from somewhere */
> > >>>>      mb = extract_and_prefree_mbuf(...);
> > >>>>
> > >>>>      /* no more free mbufs for now */
> > >>>>      if (mb == NULL)
> > >>>>         break;
> > >>>>
> > >>>>      p[i] = mb;
> > >>>> }
> > >>>>
> > >>>> /* finalize ZC put, with _i_ freed elems */
> > >>>> mempool_cache_put_zc_finish(mc, i);
> > >>>>
> > >>>> That way, I think we can overcome the issue with i40e_tx_entry
> > >>>> you mentioned above. Plus it might be useful in other similar places.
> > >>>>
> > >>>> Another alternative is obviously to split i40e_tx_entry into two
> > >>>> structs (one for mbuf, second for its metadata) and have a
> > separate
> > >>>> array for each of them.
> > >>>> Though with that approach we need to make sure no perf drops will
> > be
> > >>>> introduced, plus probably more code changes will be required.
> > >>> Commit '5171b4ee6b6" already does this (in a different way), but
> > just for
> > >> AVX512. Unfortunately, it does not record any performance
> > improvements. We
> > >> could port this to Arm NEON and look at the performance.
> > >
> >

Morten Brørup June 30, 2022, 3:21 p.m. UTC | #11

> From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> Sent: Wednesday, 29 June 2022 23.58
> 
> <snip>
> 
> > > >>>>
> > > >>>> 16/05/2022 07:10, Feifei Wang пишет:
> > > >>>>>
> > > >>>>>>> Currently, the transmit side frees the buffers into the
> lcore
> > > >>>>>>> cache and the receive side allocates buffers from the lcore
> > > cache.
> > > >>>>>>> The transmit side typically frees 32 buffers resulting in
> > > >>>>>>> 32*8=256B of stores to lcore cache. The receive side
> allocates
> > > 32
> > > >>>>>>> buffers and stores them in the receive side software ring,
> > > >>>>>>> resulting in 32*8=256B of stores and 256B of load from the
> > > lcore cache.
> > > >>>>>>>
> > > >>>>>>> This patch proposes a mechanism to avoid freeing
> to/allocating
> > > >>>>>>> from the lcore cache. i.e. the receive side will free the
> > > buffers
> > > >>>>>>> from transmit side directly into it's software ring. This
> will
> > > >>>>>>> avoid the 256B of loads and stores introduced by the lcore
> > > cache.
> > > >>>>>>> It also frees up the cache lines used by the lcore cache.
> > > >>>>>>>
> > > >>>>>>> However, this solution poses several constraints:
> > > >>>>>>>
> > > >>>>>>> 1)The receive queue needs to know which transmit queue it
> > > should
> > > >>>>>>> take the buffers from. The application logic decides which
> > > >>>>>>> transmit port to use to send out the packets. In many use
> > > >>>>>>> cases the NIC might have a single port ([1], [2], [3]), in
> > > >>>>>>> which case
> > > a
> > > >>>>>>> given transmit queue is always mapped to a single receive
> > > >>>>>>> queue
> > > >>>>>>> (1:1 Rx queue: Tx queue). This is easy to configure.
> > > >>>>>>>
> > > >>>>>>> If the NIC has 2 ports (there are several references), then
> we
> > > >>>>>>> will have
> > > >>>>>>> 1:2 (RX queue: TX queue) mapping which is still easy to
> > > configure.
> > > >>>>>>> However, if this is generalized to 'N' ports, the
> > > >>>>>>> configuration can be long. More over the PMD would have to
> > > >>>>>>> scan a list of transmit queues to pull the buffers from.
> > > >>>>>
> > > >>>>>> Just to re-iterate some generic concerns about this
> proposal:
> > > >>>>>>     - We effectively link RX and TX queues - when this
> feature
> > > is enabled,
> > > >>>>>>       user can't stop TX queue without stopping linked RX
> queue
> > > first.
> > > >>>>>>       Right now user is free to start/stop any queues at his
> > > will.
> > > >>>>>>       If that feature will allow to link queues from
> different
> > > ports,
> > > >>>>>>       then even ports will become dependent and user will
> have
> > > to pay extra
> > > >>>>>>       care when managing such ports.
> > > >>>>>
> > > >>>>> [Feifei] When direct rearm enabled, there are two path for
> > > >>>>> thread
> > > to
> > > >>>>> choose. If there are enough Tx freed buffers, Rx can put
> buffers
> > > >>>>> from Tx.
> > > >>>>> Otherwise, Rx will put buffers from mempool as usual. Thus,
> > > >>>>> users
> > > do
> > > >>>>> not need to pay much attention managing ports.
> > > >>>>
> > > >>>> What I am talking about: right now different port or different
> > > queues
> > > >>>> of the same port can be treated as independent entities:
> > > >>>> in general user is free to start/stop (and even reconfigure in
> > > some
> > > >>>> cases) one entity without need to stop other entity.
> > > >>>> I.E user can stop and re-configure TX queue while keep
> receiving
> > > >>>> packets from RX queue.
> > > >>>> With direct re-arm enabled, I think it wouldn't be possible
> any
> > > more:
> > > >>>> before stopping/reconfiguring TX queue user would have make
> sure
> > > that
> > > >>>> corresponding RX queue wouldn't be used by datapath.
> > > >>> I am trying to understand the problem better. For the TX queue
> to
> > > be stopped,
> > > >> the user must have blocked the data plane from accessing the TX
> > > queue.
> > > >>
> > > >> Surely it is user responsibility tnot to call tx_burst() for
> > > stopped/released queue.
> > > >> The problem is that while TX for that queue is stopped, RX for
> > > related queue still
> > > >> can continue.
> > > >> So rx_burst() will try to read/modify TX queue data, that might
> be
> > > already freed,
> > > >> or simultaneously modified by control path.
> > > > Understood, agree on the issue
> > > >
> > > >>
> > > >> Again, it all can be mitigated by carefully re-designing and
> > > modifying control and
> > > >> data-path inside user app - by doing extra checks and
> > > synchronizations, etc.
> > > >> But from practical point - I presume most of users simply would
> > > avoid using this
> > > >> feature due all potential problems it might cause.
> > > > That is subjective, it all depends on the performance
> improvements
> > > users see in their application.
> > > > IMO, the performance improvement seen with this patch is worth
> few
> > > changes.
> > >
> > > Yes, it is subjective till some extent, though my feeling that it
> > > might end-up being sort of synthetic improvement used only by some
> > > show-case benchmarks.
> >
> > I believe that one specific important use case has already been
> mentioned, so I
> > don't think this is a benchmark only feature.
> +1
> 
> >
> > >  From my perspective, it would be much more plausible, if we can
> > > introduce some sort of generic improvement, that doesn't impose all
> > > these extra constraints and implications.
> > > Like one, discussed below in that thread with ZC mempool approach.
> > >
> >
> > Considering this feature from a high level perspective, I agree with
> Konstantin's
> > concerns, so I'll also support his views.
> We did hack the ZC mempool approach [1], level of improvement is pretty
> small compared with this patch.
> 
> [1] http://patches.dpdk.org/project/dpdk/patch/20220613055136.1949784-
> 1-feifei.wang2@arm.com/
> 
> >
> > If this patch is supposed to be a generic feature, please add support
> for it in all
> > NIC PMDs, not just one. (Regardless if the feature is defined as 1:1
> mapping or
> > N:M mapping.) It is purely software, so it should be available for
> all PMDs, not
> > just your favorite hardware! Consider the "fast mbuf free" feature,
> which is
> > pure software; why is that feature not implemented in all PMDs?
> Agree, it is good to have it supported in all the drivers. We do not
> have a favorite hardware, just picked a PMD which we are more familiar
> with. We do plan to implement in other prominent PMDs.
> 
> >
> > A secondary point I'm making here is that this specific feature will
> lead to an
> > enormous amount of copy-paste code, instead of a generic library
> function
> > easily available for all PMDs.
> Are you talking about the i40e driver code in specific? If yes, agree
> we should avoid copy-paste and we will look to reduce that.

Yes, I am talking about the code that needs to be copied into all prominent PMDs. Perhaps you can move the majority of it into a common directory, if not in a generic library, so the modification per PMD becomes smaller. (I see the same copy-paste issue with the "fast mbuf free" feature, if to be supported by other than the i40e PMD.)

Please note that I do not expect you to implement this feature in other PMDs than you need. I was trying to make the point that implementing a software feature in a PMD requires copy-pasting to other PMDs, which can require a big effort; while implementing it in a library and calling the library from the PMDs require a smaller effort per PMD. I intentionally phrased it somewhat provokingly, and was lucky not to offend anyone. :-)

> 
> 
> >
> > >
> > > >>
> > > >>> Like Feifei says, the RX side has the normal packet allocation
> > > >>> path
> > > still available.
> > > >>> Also this sounds like a corner case to me, we can handle this
> > > through checks in
> > > >> the queue_stop API.
> > > >>
> > > >> Depends.
> > > >> if it would be allowed to link queues only from the same port,
> then
> > > yes, extra
> > > >> checks for queue-stop might be enough.
> > > >> As right now DPDK doesn't allow user to change number of queues
> > > without
> > > >> dev_stop() first.
> > > >> Though if it would be allowed to link queues from different
> ports,
> > > then situation
> > > >> will be much worse.
> > > >> Right now ports are totally independent entities (except some
> > > special cases like
> > > >> link-bonding, etc.).
> > > >> As one port can keep doing RX/TX, second one can be stopped, re-
> > > confgured,
> > > >> even detached, and newly attached device might re-use same port
> > > number.
> > > > I see this as a similar restriction to the one discussed above.
> > >
> > > Yes, they are similar in principal, though I think that the case
> with
> > > queues from different port would make things much more complex.
> > >
> > >  > Do you see any issues if we enforce this with checks?
> > >
> > > Hard to tell straightway, a lot will depend how smart such
> > > implementation would be.
> > > Usually DPDK tends not to avoid heavy
> > > synchronizations within its data-path functions.
> >
> > Certainly! Implementing more and more of such features in the PMDs
> will lead
> > to longer and longer data plane code paths in the PMDs. It is the
> "salami
> > method", where each small piece makes no performance difference, but
> they all
> > add up, and eventually the sum of them does impact the performance of
> the
> > general use case negatively.
> It would be good to have a test running in UNH that shows the
> performance trend.
+1

> 
> >
> > >
> > > >>
> > > >>
> > > >>>>
> > > >>>>>
> > > >>>>>> - very limited usage scenario - it will have a positive
> effect
> > > only
> > > >>>>>>      when we have a fixed forwarding mapping: all (or nearly
> > > all) packets
> > > >>>>>>      from the RX queue are forwarded into the same TX queue.
> > > >>>>>
> > > >>>>> [Feifei] Although the usage scenario is limited, this usage
> > > scenario
> > > >>>>> has a wide range of applications, such as NIC with one port.
> > > >>>>
> > > >>>> yes, there are NICs with one port, but no guarantee there
> > > >>>> wouldn't
> > > be
> > > >>>> several such NICs within the system.
> > > >>> What I see in my interactions is, a single NIC/DPU is under
> > > utilized for a 2
> > > >> socket system. Some are adding more sockets to the system to
> better
> > > utilize the
> > > >> DPU. The NIC bandwidth continues to grow significantly. I do not
> > > think there will
> > > >> be a multi-DPU per server scenario.
> > > >>
> > > >>
> > > >> Interesting... from my experience it is visa-versa:
> > > >> in many cases 200Gb/s is not that much these days to saturate
> > > >> modern
> > > 2 socket
> > > >> x86 server.
> > > >> Though I suppose a lot depends on particular HW and actual
> workload.
> > > >>
> > > >>>
> > > >>>>
> > > >>>>> Furtrhermore, I think this is a tradeoff between performance
> and
> > > >>>>> flexibility.
> > > >>>>> Our goal is to achieve best performance, this means we need
> to
> > > give
> > > >>>>> up some flexibility decisively. For example of 'FAST_FREE
> Mode',
> > > it
> > > >>>>> deletes most of the buffer check (refcnt > 1, external
> buffer,
> > > chain
> > > >>>>> buffer), chooses a shorest path, and then achieve significant
> > > >>>>> performance
> > > >>>> improvement.
> > > >>>>>> Wonder did you had a chance to consider mempool-cache ZC
> API,
> > > >>>>>> similar to one we have for the ring?
> > > >>>>>> It would allow us on TX free path to avoid copying mbufs to
> > > >>>>>> temporary array on the stack.
> > > >>>>>> Instead we can put them straight from TX SW ring to the
> mempool
> > > cache.
> > > >>>>>> That should save extra store/load for mbuf and might help to
> > > >>>>>> achieve some performance gain without by-passing mempool.
> > > >>>>>> It probably wouldn't be as fast as what you proposing, but
> > > >>>>>> might
> > > be
> > > >>>>>> fast enough to consider as alternative.
> > > >>>>>> Again, it would be a generic one, so we can avoid all these
> > > >>>>>> implications and limitations.
> > > >>>>>
> > > >>>>> [Feifei] I think this is a good try. However, the most
> important
> > > >>>>> thing is that if we can bypass the mempool decisively to
> pursue
> > > the
> > > >>>>> significant performance gains.
> > > >>>>
> > > >>>> I understand the intention, and I personally think this is
> wrong
> > > and
> > > >>>> dangerous attitude.
> > > >>>> We have mempool abstraction in place for very good reason.
> > > >>>> So we need to try to improve mempool performance (and API if
> > > >>>> necessary) at first place, not to avoid it and break our own
> > > >>>> rules
> > > and
> > > >> recommendations.
> > > >>> The abstraction can be thought of at a higher level. i.e. the
> > > driver manages the
> > > >> buffer allocation/free and is hidden from the application. The
> > > application does
> > > >> not need to be aware of how these changes are implemented.
> > > >>>
> > > >>>>
> > > >>>>
> > > >>>>> For ZC, there maybe a problem for it in i40e. The reason for
> > > >>>>> that put Tx buffers into temporary is that i40e_tx_entry
> > > >>>>> includes
> > > buffer
> > > >>>>> pointer and index.
> > > >>>>> Thus we cannot put Tx SW_ring entry into mempool directly, we
> > > need
> > > >>>>> to firstlt extract mbuf pointer. Finally, though we use ZC,
> we
> > > still
> > > >>>>> can't avoid using a temporary stack to extract Tx buffer
> > > pointers.
> > > >>>>
> > > >>>> When talking about ZC API for mempool cache I meant something
> > > like:
> > > >>>> void ** mempool_cache_put_zc_start(struct rte_mempool_cache
> *mc,
> > > >>>> uint32_t *nb_elem, uint32_t flags); void
> > > >>>> mempool_cache_put_zc_finish(struct
> > > >>>> rte_mempool_cache *mc, uint32_t nb_elem); i.e. _start_ will
> > > >>>> return user a pointer inside mp-cache where to put free elems
> and
> > > >>>> max
> > > number
> > > >>>> of slots that can be safely filled.
> > > >>>> _finish_ will update mc->len.
> > > >>>> As an example:
> > > >>>>
> > > >>>> /* expect to free N mbufs */
> > > >>>> uint32_t n = N;
> > > >>>> void **p = mempool_cache_put_zc_start(mc, &n, ...);
> > > >>>>
> > > >>>> /* free up to n elems */
> > > >>>> for (i = 0; i != n; i++) {
> > > >>>>
> > > >>>>      /* get next free mbuf from somewhere */
> > > >>>>      mb = extract_and_prefree_mbuf(...);
> > > >>>>
> > > >>>>      /* no more free mbufs for now */
> > > >>>>      if (mb == NULL)
> > > >>>>         break;
> > > >>>>
> > > >>>>      p[i] = mb;
> > > >>>> }
> > > >>>>
> > > >>>> /* finalize ZC put, with _i_ freed elems */
> > > >>>> mempool_cache_put_zc_finish(mc, i);
> > > >>>>
> > > >>>> That way, I think we can overcome the issue with i40e_tx_entry
> > > >>>> you mentioned above. Plus it might be useful in other similar
> places.
> > > >>>>
> > > >>>> Another alternative is obviously to split i40e_tx_entry into
> two
> > > >>>> structs (one for mbuf, second for its metadata) and have a
> > > separate
> > > >>>> array for each of them.
> > > >>>> Though with that approach we need to make sure no perf drops
> will
> > > be
> > > >>>> introduced, plus probably more code changes will be required.
> > > >>> Commit '5171b4ee6b6" already does this (in a different way),
> but
> > > just for
> > > >> AVX512. Unfortunately, it does not record any performance
> > > improvements. We
> > > >> could port this to Arm NEON and look at the performance.
> > > >
> > >

Honnappa Nagarahalli July 1, 2022, 7:30 p.m. UTC | #12

<snip>

> > > > >>>>
> > > > >>>> 16/05/2022 07:10, Feifei Wang пишет:
> > > > >>>>>
> > > > >>>>>>> Currently, the transmit side frees the buffers into the
> > lcore
> > > > >>>>>>> cache and the receive side allocates buffers from the
> > > > >>>>>>> lcore
> > > > cache.
> > > > >>>>>>> The transmit side typically frees 32 buffers resulting in
> > > > >>>>>>> 32*8=256B of stores to lcore cache. The receive side
> > allocates
> > > > 32
> > > > >>>>>>> buffers and stores them in the receive side software ring,
> > > > >>>>>>> resulting in 32*8=256B of stores and 256B of load from the
> > > > lcore cache.
> > > > >>>>>>>
> > > > >>>>>>> This patch proposes a mechanism to avoid freeing
> > to/allocating
> > > > >>>>>>> from the lcore cache. i.e. the receive side will free the
> > > > buffers
> > > > >>>>>>> from transmit side directly into it's software ring. This
> > will
> > > > >>>>>>> avoid the 256B of loads and stores introduced by the lcore
> > > > cache.
> > > > >>>>>>> It also frees up the cache lines used by the lcore cache.
> > > > >>>>>>>
> > > > >>>>>>> However, this solution poses several constraints:
> > > > >>>>>>>
> > > > >>>>>>> 1)The receive queue needs to know which transmit queue it
> > > > should
> > > > >>>>>>> take the buffers from. The application logic decides which
> > > > >>>>>>> transmit port to use to send out the packets. In many use
> > > > >>>>>>> cases the NIC might have a single port ([1], [2], [3]), in
> > > > >>>>>>> which case
> > > > a
> > > > >>>>>>> given transmit queue is always mapped to a single receive
> > > > >>>>>>> queue
> > > > >>>>>>> (1:1 Rx queue: Tx queue). This is easy to configure.
> > > > >>>>>>>
> > > > >>>>>>> If the NIC has 2 ports (there are several references),
> > > > >>>>>>> then
> > we
> > > > >>>>>>> will have
> > > > >>>>>>> 1:2 (RX queue: TX queue) mapping which is still easy to
> > > > configure.
> > > > >>>>>>> However, if this is generalized to 'N' ports, the
> > > > >>>>>>> configuration can be long. More over the PMD would have to
> > > > >>>>>>> scan a list of transmit queues to pull the buffers from.
> > > > >>>>>
> > > > >>>>>> Just to re-iterate some generic concerns about this
> > proposal:
> > > > >>>>>>     - We effectively link RX and TX queues - when this
> > feature
> > > > is enabled,
> > > > >>>>>>       user can't stop TX queue without stopping linked RX
> > queue
> > > > first.
> > > > >>>>>>       Right now user is free to start/stop any queues at
> > > > >>>>>> his
> > > > will.
> > > > >>>>>>       If that feature will allow to link queues from
> > different
> > > > ports,
> > > > >>>>>>       then even ports will become dependent and user will
> > have
> > > > to pay extra
> > > > >>>>>>       care when managing such ports.
> > > > >>>>>
> > > > >>>>> [Feifei] When direct rearm enabled, there are two path for
> > > > >>>>> thread
> > > > to
> > > > >>>>> choose. If there are enough Tx freed buffers, Rx can put
> > buffers
> > > > >>>>> from Tx.
> > > > >>>>> Otherwise, Rx will put buffers from mempool as usual. Thus,
> > > > >>>>> users
> > > > do
> > > > >>>>> not need to pay much attention managing ports.
> > > > >>>>
> > > > >>>> What I am talking about: right now different port or
> > > > >>>> different
> > > > queues
> > > > >>>> of the same port can be treated as independent entities:
> > > > >>>> in general user is free to start/stop (and even reconfigure
> > > > >>>> in
> > > > some
> > > > >>>> cases) one entity without need to stop other entity.
> > > > >>>> I.E user can stop and re-configure TX queue while keep
> > receiving
> > > > >>>> packets from RX queue.
> > > > >>>> With direct re-arm enabled, I think it wouldn't be possible
> > any
> > > > more:
> > > > >>>> before stopping/reconfiguring TX queue user would have make
> > sure
> > > > that
> > > > >>>> corresponding RX queue wouldn't be used by datapath.
> > > > >>> I am trying to understand the problem better. For the TX queue
> > to
> > > > be stopped,
> > > > >> the user must have blocked the data plane from accessing the TX
> > > > queue.
> > > > >>
> > > > >> Surely it is user responsibility tnot to call tx_burst() for
> > > > stopped/released queue.
> > > > >> The problem is that while TX for that queue is stopped, RX for
> > > > related queue still
> > > > >> can continue.
> > > > >> So rx_burst() will try to read/modify TX queue data, that might
> > be
> > > > already freed,
> > > > >> or simultaneously modified by control path.
> > > > > Understood, agree on the issue
> > > > >
> > > > >>
> > > > >> Again, it all can be mitigated by carefully re-designing and
> > > > modifying control and
> > > > >> data-path inside user app - by doing extra checks and
> > > > synchronizations, etc.
> > > > >> But from practical point - I presume most of users simply would
> > > > avoid using this
> > > > >> feature due all potential problems it might cause.
> > > > > That is subjective, it all depends on the performance
> > improvements
> > > > users see in their application.
> > > > > IMO, the performance improvement seen with this patch is worth
> > few
> > > > changes.
> > > >
> > > > Yes, it is subjective till some extent, though my feeling that it
> > > > might end-up being sort of synthetic improvement used only by some
> > > > show-case benchmarks.
> > >
> > > I believe that one specific important use case has already been
> > mentioned, so I
> > > don't think this is a benchmark only feature.
> > +1
> >
> > >
> > > >  From my perspective, it would be much more plausible, if we can
> > > > introduce some sort of generic improvement, that doesn't impose
> > > > all these extra constraints and implications.
> > > > Like one, discussed below in that thread with ZC mempool approach.
> > > >
> > >
> > > Considering this feature from a high level perspective, I agree with
> > Konstantin's
> > > concerns, so I'll also support his views.
> > We did hack the ZC mempool approach [1], level of improvement is
> > pretty small compared with this patch.
> >
> > [1] http://patches.dpdk.org/project/dpdk/patch/20220613055136.1949784-
> > 1-feifei.wang2@arm.com/
> >
> > >
> > > If this patch is supposed to be a generic feature, please add
> > > support
> > for it in all
> > > NIC PMDs, not just one. (Regardless if the feature is defined as 1:1
> > mapping or
> > > N:M mapping.) It is purely software, so it should be available for
> > all PMDs, not
> > > just your favorite hardware! Consider the "fast mbuf free" feature,
> > which is
> > > pure software; why is that feature not implemented in all PMDs?
> > Agree, it is good to have it supported in all the drivers. We do not
> > have a favorite hardware, just picked a PMD which we are more familiar
> > with. We do plan to implement in other prominent PMDs.
> >
> > >
> > > A secondary point I'm making here is that this specific feature will
> > lead to an
> > > enormous amount of copy-paste code, instead of a generic library
> > function
> > > easily available for all PMDs.
> > Are you talking about the i40e driver code in specific? If yes, agree
> > we should avoid copy-paste and we will look to reduce that.
> 
> Yes, I am talking about the code that needs to be copied into all prominent
> PMDs. Perhaps you can move the majority of it into a common directory, if not
> in a generic library, so the modification per PMD becomes smaller. (I see the
> same copy-paste issue with the "fast mbuf free" feature, if to be supported by
> other than the i40e PMD.)
The current abstraction does not allow for common code at this (lower) level across all the PMDs. If we look at "fast free", it is accessing the device private structure for the list of buffers to free. If it needs to be common code, this needs to be lifted up along with other dependent configuration thresholds etc.

> 
> Please note that I do not expect you to implement this feature in other PMDs
> than you need. I was trying to make the point that implementing a software
> feature in a PMD requires copy-pasting to other PMDs, which can require a big
> effort; while implementing it in a library and calling the library from the PMDs
> require a smaller effort per PMD. I intentionally phrased it somewhat
> provokingly, and was lucky not to offend anyone. :-)
> 
> >
> >
> > >
> > > >
> > > > >>
> > > > >>> Like Feifei says, the RX side has the normal packet allocation
> > > > >>> path
> > > > still available.
> > > > >>> Also this sounds like a corner case to me, we can handle this
> > > > through checks in
> > > > >> the queue_stop API.
> > > > >>
> > > > >> Depends.
> > > > >> if it would be allowed to link queues only from the same port,
> > then
> > > > yes, extra
> > > > >> checks for queue-stop might be enough.
> > > > >> As right now DPDK doesn't allow user to change number of queues
> > > > without
> > > > >> dev_stop() first.
> > > > >> Though if it would be allowed to link queues from different
> > ports,
> > > > then situation
> > > > >> will be much worse.
> > > > >> Right now ports are totally independent entities (except some
> > > > special cases like
> > > > >> link-bonding, etc.).
> > > > >> As one port can keep doing RX/TX, second one can be stopped,
> > > > >> re-
> > > > confgured,
> > > > >> even detached, and newly attached device might re-use same port
> > > > number.
> > > > > I see this as a similar restriction to the one discussed above.
> > > >
> > > > Yes, they are similar in principal, though I think that the case
> > with
> > > > queues from different port would make things much more complex.
> > > >
> > > >  > Do you see any issues if we enforce this with checks?
> > > >
> > > > Hard to tell straightway, a lot will depend how smart such
> > > > implementation would be.
> > > > Usually DPDK tends not to avoid heavy synchronizations within its
> > > > data-path functions.
> > >
> > > Certainly! Implementing more and more of such features in the PMDs
> > will lead
> > > to longer and longer data plane code paths in the PMDs. It is the
> > "salami
> > > method", where each small piece makes no performance difference, but
> > they all
> > > add up, and eventually the sum of them does impact the performance
> > > of
> > the
> > > general use case negatively.
> > It would be good to have a test running in UNH that shows the
> > performance trend.
> +1
> 
> >
> > >
> > > >
> > > > >>
> > > > >>
> > > > >>>>
> > > > >>>>>
> > > > >>>>>> - very limited usage scenario - it will have a positive
> > effect
> > > > only
> > > > >>>>>>      when we have a fixed forwarding mapping: all (or
> > > > >>>>>> nearly
> > > > all) packets
> > > > >>>>>>      from the RX queue are forwarded into the same TX queue.
> > > > >>>>>
> > > > >>>>> [Feifei] Although the usage scenario is limited, this usage
> > > > scenario
> > > > >>>>> has a wide range of applications, such as NIC with one port.
> > > > >>>>
> > > > >>>> yes, there are NICs with one port, but no guarantee there
> > > > >>>> wouldn't
> > > > be
> > > > >>>> several such NICs within the system.
> > > > >>> What I see in my interactions is, a single NIC/DPU is under
> > > > utilized for a 2
> > > > >> socket system. Some are adding more sockets to the system to
> > better
> > > > utilize the
> > > > >> DPU. The NIC bandwidth continues to grow significantly. I do
> > > > >> not
> > > > think there will
> > > > >> be a multi-DPU per server scenario.
> > > > >>
> > > > >>
> > > > >> Interesting... from my experience it is visa-versa:
> > > > >> in many cases 200Gb/s is not that much these days to saturate
> > > > >> modern
> > > > 2 socket
> > > > >> x86 server.
> > > > >> Though I suppose a lot depends on particular HW and actual
> > workload.
> > > > >>
> > > > >>>
> > > > >>>>
> > > > >>>>> Furtrhermore, I think this is a tradeoff between performance
> > and
> > > > >>>>> flexibility.
> > > > >>>>> Our goal is to achieve best performance, this means we need
> > to
> > > > give
> > > > >>>>> up some flexibility decisively. For example of 'FAST_FREE
> > Mode',
> > > > it
> > > > >>>>> deletes most of the buffer check (refcnt > 1, external
> > buffer,
> > > > chain
> > > > >>>>> buffer), chooses a shorest path, and then achieve
> > > > >>>>> significant performance
> > > > >>>> improvement.
> > > > >>>>>> Wonder did you had a chance to consider mempool-cache ZC
> > API,
> > > > >>>>>> similar to one we have for the ring?
> > > > >>>>>> It would allow us on TX free path to avoid copying mbufs to
> > > > >>>>>> temporary array on the stack.
> > > > >>>>>> Instead we can put them straight from TX SW ring to the
> > mempool
> > > > cache.
> > > > >>>>>> That should save extra store/load for mbuf and might help
> > > > >>>>>> to achieve some performance gain without by-passing mempool.
> > > > >>>>>> It probably wouldn't be as fast as what you proposing, but
> > > > >>>>>> might
> > > > be
> > > > >>>>>> fast enough to consider as alternative.
> > > > >>>>>> Again, it would be a generic one, so we can avoid all these
> > > > >>>>>> implications and limitations.
> > > > >>>>>
> > > > >>>>> [Feifei] I think this is a good try. However, the most
> > important
> > > > >>>>> thing is that if we can bypass the mempool decisively to
> > pursue
> > > > the
> > > > >>>>> significant performance gains.
> > > > >>>>
> > > > >>>> I understand the intention, and I personally think this is
> > wrong
> > > > and
> > > > >>>> dangerous attitude.
> > > > >>>> We have mempool abstraction in place for very good reason.
> > > > >>>> So we need to try to improve mempool performance (and API if
> > > > >>>> necessary) at first place, not to avoid it and break our own
> > > > >>>> rules
> > > > and
> > > > >> recommendations.
> > > > >>> The abstraction can be thought of at a higher level. i.e. the
> > > > driver manages the
> > > > >> buffer allocation/free and is hidden from the application. The
> > > > application does
> > > > >> not need to be aware of how these changes are implemented.
> > > > >>>
> > > > >>>>
> > > > >>>>
> > > > >>>>> For ZC, there maybe a problem for it in i40e. The reason for
> > > > >>>>> that put Tx buffers into temporary is that i40e_tx_entry
> > > > >>>>> includes
> > > > buffer
> > > > >>>>> pointer and index.
> > > > >>>>> Thus we cannot put Tx SW_ring entry into mempool directly,
> > > > >>>>> we
> > > > need
> > > > >>>>> to firstlt extract mbuf pointer. Finally, though we use ZC,
> > we
> > > > still
> > > > >>>>> can't avoid using a temporary stack to extract Tx buffer
> > > > pointers.
> > > > >>>>
> > > > >>>> When talking about ZC API for mempool cache I meant something
> > > > like:
> > > > >>>> void ** mempool_cache_put_zc_start(struct rte_mempool_cache
> > *mc,
> > > > >>>> uint32_t *nb_elem, uint32_t flags); void
> > > > >>>> mempool_cache_put_zc_finish(struct
> > > > >>>> rte_mempool_cache *mc, uint32_t nb_elem); i.e. _start_ will
> > > > >>>> return user a pointer inside mp-cache where to put free elems
> > and
> > > > >>>> max
> > > > number
> > > > >>>> of slots that can be safely filled.
> > > > >>>> _finish_ will update mc->len.
> > > > >>>> As an example:
> > > > >>>>
> > > > >>>> /* expect to free N mbufs */
> > > > >>>> uint32_t n = N;
> > > > >>>> void **p = mempool_cache_put_zc_start(mc, &n, ...);
> > > > >>>>
> > > > >>>> /* free up to n elems */
> > > > >>>> for (i = 0; i != n; i++) {
> > > > >>>>
> > > > >>>>      /* get next free mbuf from somewhere */
> > > > >>>>      mb = extract_and_prefree_mbuf(...);
> > > > >>>>
> > > > >>>>      /* no more free mbufs for now */
> > > > >>>>      if (mb == NULL)
> > > > >>>>         break;
> > > > >>>>
> > > > >>>>      p[i] = mb;
> > > > >>>> }
> > > > >>>>
> > > > >>>> /* finalize ZC put, with _i_ freed elems */
> > > > >>>> mempool_cache_put_zc_finish(mc, i);
> > > > >>>>
> > > > >>>> That way, I think we can overcome the issue with
> > > > >>>> i40e_tx_entry you mentioned above. Plus it might be useful in
> > > > >>>> other similar
> > places.
> > > > >>>>
> > > > >>>> Another alternative is obviously to split i40e_tx_entry into
> > two
> > > > >>>> structs (one for mbuf, second for its metadata) and have a
> > > > separate
> > > > >>>> array for each of them.
> > > > >>>> Though with that approach we need to make sure no perf drops
> > will
> > > > be
> > > > >>>> introduced, plus probably more code changes will be required.
> > > > >>> Commit '5171b4ee6b6" already does this (in a different way),
> > but
> > > > just for
> > > > >> AVX512. Unfortunately, it does not record any performance
> > > > improvements. We
> > > > >> could port this to Arm NEON and look at the performance.
> > > > >
> > > >

Morten Brørup July 1, 2022, 8:28 p.m. UTC | #13

> From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> Sent: Friday, 1 July 2022 21.31
> 
> <snip>
> 
> > > > > >>>>
> > > > > >>>> 16/05/2022 07:10, Feifei Wang пишет:
> > > > > >>>>>
> > > > > >>>>>>> Currently, the transmit side frees the buffers into the
> > > lcore
> > > > > >>>>>>> cache and the receive side allocates buffers from the
> > > > > >>>>>>> lcore
> > > > > cache.
> > > > > >>>>>>> The transmit side typically frees 32 buffers resulting
> in
> > > > > >>>>>>> 32*8=256B of stores to lcore cache. The receive side
> > > allocates
> > > > > 32
> > > > > >>>>>>> buffers and stores them in the receive side software
> ring,
> > > > > >>>>>>> resulting in 32*8=256B of stores and 256B of load from
> the
> > > > > lcore cache.
> > > > > >>>>>>>
> > > > > >>>>>>> This patch proposes a mechanism to avoid freeing
> > > to/allocating
> > > > > >>>>>>> from the lcore cache. i.e. the receive side will free
> the
> > > > > buffers
> > > > > >>>>>>> from transmit side directly into it's software ring.
> This
> > > will
> > > > > >>>>>>> avoid the 256B of loads and stores introduced by the
> lcore
> > > > > cache.
> > > > > >>>>>>> It also frees up the cache lines used by the lcore
> cache.
> > > > > >>>>>>>
> > > > > >>>>>>> However, this solution poses several constraints:
> > > > > >>>>>>>
> > > > > >>>>>>> 1)The receive queue needs to know which transmit queue
> it
> > > > > should
> > > > > >>>>>>> take the buffers from. The application logic decides
> which
> > > > > >>>>>>> transmit port to use to send out the packets. In many
> use
> > > > > >>>>>>> cases the NIC might have a single port ([1], [2], [3]),
> in
> > > > > >>>>>>> which case
> > > > > a
> > > > > >>>>>>> given transmit queue is always mapped to a single
> receive
> > > > > >>>>>>> queue
> > > > > >>>>>>> (1:1 Rx queue: Tx queue). This is easy to configure.
> > > > > >>>>>>>
> > > > > >>>>>>> If the NIC has 2 ports (there are several references),
> > > > > >>>>>>> then
> > > we
> > > > > >>>>>>> will have
> > > > > >>>>>>> 1:2 (RX queue: TX queue) mapping which is still easy to
> > > > > configure.
> > > > > >>>>>>> However, if this is generalized to 'N' ports, the
> > > > > >>>>>>> configuration can be long. More over the PMD would have
> to
> > > > > >>>>>>> scan a list of transmit queues to pull the buffers
> from.
> > > > > >>>>>
> > > > > >>>>>> Just to re-iterate some generic concerns about this
> > > proposal:
> > > > > >>>>>>     - We effectively link RX and TX queues - when this
> > > feature
> > > > > is enabled,
> > > > > >>>>>>       user can't stop TX queue without stopping linked
> RX
> > > queue
> > > > > first.
> > > > > >>>>>>       Right now user is free to start/stop any queues at
> > > > > >>>>>> his
> > > > > will.
> > > > > >>>>>>       If that feature will allow to link queues from
> > > different
> > > > > ports,
> > > > > >>>>>>       then even ports will become dependent and user
> will
> > > have
> > > > > to pay extra
> > > > > >>>>>>       care when managing such ports.
> > > > > >>>>>
> > > > > >>>>> [Feifei] When direct rearm enabled, there are two path
> for
> > > > > >>>>> thread
> > > > > to
> > > > > >>>>> choose. If there are enough Tx freed buffers, Rx can put
> > > buffers
> > > > > >>>>> from Tx.
> > > > > >>>>> Otherwise, Rx will put buffers from mempool as usual.
> Thus,
> > > > > >>>>> users
> > > > > do
> > > > > >>>>> not need to pay much attention managing ports.
> > > > > >>>>
> > > > > >>>> What I am talking about: right now different port or
> > > > > >>>> different
> > > > > queues
> > > > > >>>> of the same port can be treated as independent entities:
> > > > > >>>> in general user is free to start/stop (and even
> reconfigure
> > > > > >>>> in
> > > > > some
> > > > > >>>> cases) one entity without need to stop other entity.
> > > > > >>>> I.E user can stop and re-configure TX queue while keep
> > > receiving
> > > > > >>>> packets from RX queue.
> > > > > >>>> With direct re-arm enabled, I think it wouldn't be
> possible
> > > any
> > > > > more:
> > > > > >>>> before stopping/reconfiguring TX queue user would have
> make
> > > sure
> > > > > that
> > > > > >>>> corresponding RX queue wouldn't be used by datapath.
> > > > > >>> I am trying to understand the problem better. For the TX
> queue
> > > to
> > > > > be stopped,
> > > > > >> the user must have blocked the data plane from accessing the
> TX
> > > > > queue.
> > > > > >>
> > > > > >> Surely it is user responsibility tnot to call tx_burst() for
> > > > > stopped/released queue.
> > > > > >> The problem is that while TX for that queue is stopped, RX
> for
> > > > > related queue still
> > > > > >> can continue.
> > > > > >> So rx_burst() will try to read/modify TX queue data, that
> might
> > > be
> > > > > already freed,
> > > > > >> or simultaneously modified by control path.
> > > > > > Understood, agree on the issue
> > > > > >
> > > > > >>
> > > > > >> Again, it all can be mitigated by carefully re-designing and
> > > > > modifying control and
> > > > > >> data-path inside user app - by doing extra checks and
> > > > > synchronizations, etc.
> > > > > >> But from practical point - I presume most of users simply
> would
> > > > > avoid using this
> > > > > >> feature due all potential problems it might cause.
> > > > > > That is subjective, it all depends on the performance
> > > improvements
> > > > > users see in their application.
> > > > > > IMO, the performance improvement seen with this patch is
> worth
> > > few
> > > > > changes.
> > > > >
> > > > > Yes, it is subjective till some extent, though my feeling that
> it
> > > > > might end-up being sort of synthetic improvement used only by
> some
> > > > > show-case benchmarks.
> > > >
> > > > I believe that one specific important use case has already been
> > > mentioned, so I
> > > > don't think this is a benchmark only feature.
> > > +1
> > >
> > > >
> > > > >  From my perspective, it would be much more plausible, if we
> can
> > > > > introduce some sort of generic improvement, that doesn't impose
> > > > > all these extra constraints and implications.
> > > > > Like one, discussed below in that thread with ZC mempool
> approach.
> > > > >
> > > >
> > > > Considering this feature from a high level perspective, I agree
> with
> > > Konstantin's
> > > > concerns, so I'll also support his views.
> > > We did hack the ZC mempool approach [1], level of improvement is
> > > pretty small compared with this patch.
> > >
> > > [1]
> http://patches.dpdk.org/project/dpdk/patch/20220613055136.1949784-
> > > 1-feifei.wang2@arm.com/
> > >
> > > >
> > > > If this patch is supposed to be a generic feature, please add
> > > > support
> > > for it in all
> > > > NIC PMDs, not just one. (Regardless if the feature is defined as
> 1:1
> > > mapping or
> > > > N:M mapping.) It is purely software, so it should be available
> for
> > > all PMDs, not
> > > > just your favorite hardware! Consider the "fast mbuf free"
> feature,
> > > which is
> > > > pure software; why is that feature not implemented in all PMDs?
> > > Agree, it is good to have it supported in all the drivers. We do
> not
> > > have a favorite hardware, just picked a PMD which we are more
> familiar
> > > with. We do plan to implement in other prominent PMDs.
> > >
> > > >
> > > > A secondary point I'm making here is that this specific feature
> will
> > > lead to an
> > > > enormous amount of copy-paste code, instead of a generic library
> > > function
> > > > easily available for all PMDs.
> > > Are you talking about the i40e driver code in specific? If yes,
> agree
> > > we should avoid copy-paste and we will look to reduce that.
> >
> > Yes, I am talking about the code that needs to be copied into all
> prominent
> > PMDs. Perhaps you can move the majority of it into a common
> directory, if not
> > in a generic library, so the modification per PMD becomes smaller. (I
> see the
> > same copy-paste issue with the "fast mbuf free" feature, if to be
> supported by
> > other than the i40e PMD.)
> The current abstraction does not allow for common code at this (lower)
> level across all the PMDs. If we look at "fast free", it is accessing
> the device private structure for the list of buffers to free. If it
> needs to be common code, this needs to be lifted up along with other
> dependent configuration thresholds etc.

Exactly. The "direct re-arm" feature has some of the same design properties as "fast free": They are both purely software features, and they are both embedded deeply in the PMD code.

I don't know if it is possible, but perhaps we could redesign the private buffer structures of the PMDs, so the first part of it is common across PMDs, and the following part is truly private. The common part would obviously hold the mbuf pointer. That way, a common library could manipulate the public part.

Or perhaps some other means could be provided for a common library to manipulate the common parts of a private structure, e.g. a common fast_free function could take a few parameters to work on private buffer structures: void * buffer_array, size_t element_size, size_t mbuf_offset_in_element, unsigned int num_elements.

> 
> >
> > Please note that I do not expect you to implement this feature in
> other PMDs
> > than you need. I was trying to make the point that implementing a
> software
> > feature in a PMD requires copy-pasting to other PMDs, which can
> require a big
> > effort; while implementing it in a library and calling the library
> from the PMDs
> > require a smaller effort per PMD. I intentionally phrased it somewhat
> > provokingly, and was lucky not to offend anyone. :-)

Unless we find a better solution, copy-paste code across PMDs seems to be the only way to implement such features. And I agree that they should not be blocked due to "code complexity" or "copy-paste" arguments, if they are impossible to implement in a more "correct" way.

The DPDK community should accept such contributions to the common code, and be grateful that they don't just go into private forks of DPDK!