[dpdk-dev] net/mlx5: poll completion queue once per a call

Message ID 20170720154835.13571-1-yskoh@mellanox.com (mailing list archive)
State Accepted, archived
Delegated to: Ferruh Yigit
Headers

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/Intel-compilation success Compilation OK

Commit Message

Yongseok Koh July 20, 2017, 3:48 p.m. UTC
  mlx5_tx_complete() polls completion queue multiple times until it
encounters an invalid entry. As Tx completions are suppressed by
MLX5_TX_COMP_THRESH, it is waste of cycles to expect multiple completions
in a poll. And freeing too many buffers in a call can cause high jitter.
This patch improves throughput a little.

Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
Acked-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com>
---
 drivers/net/mlx5/mlx5_rxtx.h | 32 ++++++++++----------------------
 1 file changed, 10 insertions(+), 22 deletions(-)
  

Comments

Sagi Grimberg July 20, 2017, 4:34 p.m. UTC | #1
> mlx5_tx_complete() polls completion queue multiple times until it
> encounters an invalid entry. As Tx completions are suppressed by
> MLX5_TX_COMP_THRESH, it is waste of cycles to expect multiple completions
> in a poll. And freeing too many buffers in a call can cause high jitter.
> This patch improves throughput a little.

What if the device generates burst of completions? Holding these
completions un-reaped can theoretically cause resource stress on
the corresponding mempool(s).

I totally get the need for a stopping condition, but is "loop once"
the best stop condition?

Perhaps an adaptive budget (based on online stats) perform better?
  
Yongseok Koh July 21, 2017, 3:10 p.m. UTC | #2
On Thu, Jul 20, 2017 at 07:34:04PM +0300, Sagi Grimberg wrote:
> 
> > mlx5_tx_complete() polls completion queue multiple times until it
> > encounters an invalid entry. As Tx completions are suppressed by
> > MLX5_TX_COMP_THRESH, it is waste of cycles to expect multiple completions
> > in a poll. And freeing too many buffers in a call can cause high jitter.
> > This patch improves throughput a little.
> 
> What if the device generates burst of completions?
mlx5 PMD suppresses completions anyway. It requests a completion per every
MLX5_TX_COMP_THRESH Tx mbufs, not every single mbuf. So, the size of completion
queue is even much small.

> Holding these completions un-reaped can theoretically cause resource stress on
> the corresponding mempool(s).
Can you make your point clearer? Do you think the "stress" can impact
performance? I think stress doesn't matter unless it is depleted. And app is
responsible for supplying enough mbufs considering the depth of all queues (max
# of outstanding mbufs).

> I totally get the need for a stopping condition, but is "loop once"
> the best stop condition?
Best for what?

> Perhaps an adaptive budget (based on online stats) perform better?
Please bring up any suggestion or submit a patch if any. Does "budget" mean the
threshold? If so, calculation of stats for adaptive threshold can impact single
core performance. With multiple cores, adjusting threshold doesn't affect much.

Thanks,
Yongseok
  
Sagi Grimberg July 23, 2017, 9:49 a.m. UTC | #3
>>> mlx5_tx_complete() polls completion queue multiple times until it
>>> encounters an invalid entry. As Tx completions are suppressed by
>>> MLX5_TX_COMP_THRESH, it is waste of cycles to expect multiple completions
>>> in a poll. And freeing too many buffers in a call can cause high jitter.
>>> This patch improves throughput a little.
>>
>> What if the device generates burst of completions?
> mlx5 PMD suppresses completions anyway. It requests a completion per every
> MLX5_TX_COMP_THRESH Tx mbufs, not every single mbuf. So, the size of completion
> queue is even much small.

Yes I realize that, but can't the device still complete in a burst (of
unsuppressed completions)? I mean its not guaranteed that for every
txq_complete a signaled completion is pending right? What happens if
the device has inconsistent completion pacing? Can't the sw grow a
batch of completions if txq_complete will process a single completion
unconditionally?

>> Holding these completions un-reaped can theoretically cause resource stress on
>> the corresponding mempool(s).
> Can you make your point clearer? Do you think the "stress" can impact
> performance? I think stress doesn't matter unless it is depleted. And app is
> responsible for supplying enough mbufs considering the depth of all queues (max
> # of outstanding mbufs).

I might be missing something, but # of outstanding mbufs should be
relatively small as the pmd reaps every MLX5_TX_COMP_THRESH mbufs right?
Why should the pool account for the entire TX queue depth (which can
be very large)?

Is there a hard requirement documented somewhere that the application
needs to account for the entire TX queue depths for sizing its mbuf
pool?

My question is with the proposed change, doesn't this mean that the 
application might need to allocate a bigger TX mbuf pool? Because the
pmd can theoretically consume completions slower (as in multiple TX
burst calls)?

>> I totally get the need for a stopping condition, but is "loop once"
>> the best stop condition?
> Best for what?

Best condition to stop consuming TX completions. As I said, I think that
leaving TX completions un-reaped can (at least in theory) slow down the
mbuf reclamation, which impacts the application. (unless I'm not
understanding something fundamental)

>> Perhaps an adaptive budget (based on online stats) perform better?
> Please bring up any suggestion or submit a patch if any.

I was simply providing a review for the patch. I don't have the time
to come up with a better patch unfortunately, but I still think its
fair to raise a point.

> Does "budget" mean the
> threshold? If so, calculation of stats for adaptive threshold can impact single
> core performance. With multiple cores, adjusting threshold doesn't affect much.

If you look at mlx5e driver in the kernel, it maintains online stats on
its RX and TX queues. It maintain these stats mostly for adaptive
interrupt moderation control (but not only).

I was suggesting maintaining per TX queue stats on average completions
consumed for each TX burst call, and adjust the stopping condition
according to a calculated stat.
  
Yongseok Koh July 25, 2017, 7:43 a.m. UTC | #4
On Sun, Jul 23, 2017 at 12:49:36PM +0300, Sagi Grimberg wrote:
> > > > mlx5_tx_complete() polls completion queue multiple times until it
> > > > encounters an invalid entry. As Tx completions are suppressed by
> > > > MLX5_TX_COMP_THRESH, it is waste of cycles to expect multiple completions
> > > > in a poll. And freeing too many buffers in a call can cause high jitter.
> > > > This patch improves throughput a little.
> > > 
> > > What if the device generates burst of completions?
> > mlx5 PMD suppresses completions anyway. It requests a completion per every
> > MLX5_TX_COMP_THRESH Tx mbufs, not every single mbuf. So, the size of completion
> > queue is even much small.
> 
> Yes I realize that, but can't the device still complete in a burst (of
> unsuppressed completions)? I mean its not guaranteed that for every
> txq_complete a signaled completion is pending right? What happens if
> the device has inconsistent completion pacing? Can't the sw grow a
> batch of completions if txq_complete will process a single completion
> unconditionally?
Speculation. First of all, device doesn't delay completion notifications for no
reason. ASIC is not a SW running on top of a OS. If a completion comes up late,
this means device really can't keep up the rate of posting descriptors. If so,
tx_burst() should generate back-pressure by returning partial Tx, then app can
make a decision between drop or retry. Retry on Tx means back-pressuring Rx side
if app is forwarding packets.

More serious problem I expected was a case that the THRESH is smaller than
burst size. In that case, txq->elts[] will be short of slots all the time. But
fortunately, in MLX PMD, we request one completion per a burst at most, not
every THRESH of packets.

If there's some SW jitter on Tx processiong, the Tx CQ can grow for sure.
Question to myself was "when does it shrink?". It shrinks when Tx burst is light
(burst size is smaller than THRESH) because mlx5_tx_complete() is always called
every time tx_burst() is called. What if it keeps growing? Then, drop is
necessary and natural like I mentioned above.

It doesn't make sense for SW to absorb any possible SW jitters. Cost is high.
It is usually done by increasing queue depth. Keeping steady state is more
important. 

Rather, this patch is helpful for reducing jitters. When I run a profiler, the
most cycle-consuming part on Tx is still freeing buffers. If we allow loops on
checking valid CQE, many buffers could be freed in a single call of
mlx5_tx_complete() at some moment, then it would cause a long delay. This would
aggravate jitter.

> > > Holding these completions un-reaped can theoretically cause resource stress on
> > > the corresponding mempool(s).
> > Can you make your point clearer? Do you think the "stress" can impact
> > performance? I think stress doesn't matter unless it is depleted. And app is
> > responsible for supplying enough mbufs considering the depth of all queues (max
> > # of outstanding mbufs).
> 
> I might be missing something, but # of outstanding mbufs should be
> relatively small as the pmd reaps every MLX5_TX_COMP_THRESH mbufs right?
> Why should the pool account for the entire TX queue depth (which can
> be very large)?
Reason is simple for Rx queue. If the number of mbufs in the provisioned mempool
is less then rxq depth, PMD can't even successfully initialize device. PMD
doesn't keep a private mempool. So, it is nonsensical to provision less mbufs
than queue depth even if it isn't documented. It is obvious.

No mempool is assigned for Tx. And in this case, app isn't forced to prepare
enough mbufs to cover all the Tx queues. But the downside of it is significant
performance degradation. From PMD perspective, it just needs to avoid any
deadlock condition due to depletion. Even if freeing mbufs in bulk causes some
resource depletion in app side, it is a fair trade-off to get higher performance
unless there's no deadlock. And as far as I can tell, most of PMDs would free
mbufs in bulk, not one by one. Also good for cache locality.

Anyway, there are many examples according to packet processing mode -
fwd/rxonly/txonly. But I won't explain all of them one by one.

> Is there a hard requirement documented somewhere that the application
> needs to account for the entire TX queue depths for sizing its mbuf
> pool?
If needed, we should document it and this will be a good start for you to
contribute to DPDK community. But, think about the definition of Tx queue depth,
doesn't it mean that a queue can hold that amount of descriptors? Then app
should prepare more mbufs than the queue depth which is configured by the app.
In my understanding, there's no point of having less mbufs than the total amount
of queue entries. If resource is scarce, what's the point of having larger queue
depth? It should have smaller queue.

> My question is with the proposed change, doesn't this mean that the
> application might need to allocate a bigger TX mbuf pool? Because the
> pmd can theoretically consume completions slower (as in multiple TX
> burst calls)?
No. Explained above.

[...]
> > > Perhaps an adaptive budget (based on online stats) perform better?
> > Please bring up any suggestion or submit a patch if any.
> 
> I was simply providing a review for the patch. I don't have the time
> to come up with a better patch unfortunately, but I still think its
> fair to raise a point.
Of course. I appreciate your time for the review. And keep in mind that nothing
is impossible in an open source community. I always like to discuss about ideas
with anyone. But I was just asking to hear more details about your suggestion if
you wanted me to implement it, rather than giving me one-sentence question :-)

> > Does "budget" mean the
> > threshold? If so, calculation of stats for adaptive threshold can impact single
> > core performance. With multiple cores, adjusting threshold doesn't affect much.
> 
> If you look at mlx5e driver in the kernel, it maintains online stats on
> its RX and TX queues. It maintain these stats mostly for adaptive
> interrupt moderation control (but not only).
> 
> I was suggesting maintaining per TX queue stats on average completions
> consumed for each TX burst call, and adjust the stopping condition
> according to a calculated stat.
In case of interrupt mitigation, it could be beneficial because interrupt
handling cost is too costly. But, the beauty of DPDK is polling, isn't it?


And please remember to ack at the end of this discussion if you are okay so that
this patch can gets merged. One data point is, single core performance (fwd) of
vectorized PMD gets improved by more than 6% with this patch. 6% is never small.

Thanks for your review again.

Yongseok
  
Sagi Grimberg July 27, 2017, 11:12 a.m. UTC | #5
>> Yes I realize that, but can't the device still complete in a burst (of
>> unsuppressed completions)? I mean its not guaranteed that for every
>> txq_complete a signaled completion is pending right? What happens if
>> the device has inconsistent completion pacing? Can't the sw grow a
>> batch of completions if txq_complete will process a single completion
>> unconditionally?
> Speculation. First of all, device doesn't delay completion notifications for no
> reason. ASIC is not a SW running on top of a OS.

I'm sorry but this statement is not correct. It might be correct in a
lab environment, but in practice, there are lots of things that can
affect the device timing.

> If a completion comes up late,
> this means device really can't keep up the rate of posting descriptors. If so,
> tx_burst() should generate back-pressure by returning partial Tx, then app can
> make a decision between drop or retry. Retry on Tx means back-pressuring Rx side
> if app is forwarding packets.

Not arguing on that, I was simply suggesting that better heuristics
could be applied than "process one completion unconditionally".

> More serious problem I expected was a case that the THRESH is smaller than
> burst size. In that case, txq->elts[] will be short of slots all the time. But
> fortunately, in MLX PMD, we request one completion per a burst at most, not
> every THRESH of packets.
> 
> If there's some SW jitter on Tx processiong, the Tx CQ can grow for sure.
> Question to myself was "when does it shrink?". It shrinks when Tx burst is light
> (burst size is smaller than THRESH) because mlx5_tx_complete() is always called
> every time tx_burst() is called. What if it keeps growing? Then, drop is
> necessary and natural like I mentioned above.
> 
> It doesn't make sense for SW to absorb any possible SW jitters. Cost is high.
> It is usually done by increasing queue depth. Keeping steady state is more
> important.

Again, I agree jitters are bad, but with proper heuristics in place mlx5
can still keep a low jitter _and_ consume completions faster than
consecutive tx_burst invocations.

> Rather, this patch is helpful for reducing jitters. When I run a profiler, the
> most cycle-consuming part on Tx is still freeing buffers. If we allow loops on
> checking valid CQE, many buffers could be freed in a single call of
> mlx5_tx_complete() at some moment, then it would cause a long delay. This would
> aggravate jitter.

I didn't argue the fact that this patch addresses an issue, but mlx5 is
a driver that is designed to run applications that can act differently
than your test case.

> Of course. I appreciate your time for the review. And keep in mind that nothing
> is impossible in an open source community. I always like to discuss about ideas
> with anyone. But I was just asking to hear more details about your suggestion if
> you wanted me to implement it, rather than giving me one-sentence question :-)

Good to know.

>>> Does "budget" mean the
>>> threshold? If so, calculation of stats for adaptive threshold can impact single
>>> core performance. With multiple cores, adjusting threshold doesn't affect much.
>>
>> If you look at mlx5e driver in the kernel, it maintains online stats on
>> its RX and TX queues. It maintain these stats mostly for adaptive
>> interrupt moderation control (but not only).
>>
>> I was suggesting maintaining per TX queue stats on average completions
>> consumed for each TX burst call, and adjust the stopping condition
>> according to a calculated stat.
> In case of interrupt mitigation, it could be beneficial because interrupt
> handling cost is too costly. But, the beauty of DPDK is polling, isn't it?

If you read again my comment, I didn't suggest to apply stats for
interrupt moderation, I just gave an example of a use-case. I was
suggesting to maintain the online stats for adjusting a threshold of
how much completions to process in a tx burst call (instead of
processing one unconditionally).

> And please remember to ack at the end of this discussion if you are okay so that
> this patch can gets merged. One data point is, single core performance (fwd) of
> vectorized PMD gets improved by more than 6% with this patch. 6% is never small.

Yea, I don't mind merging it in given that I don't have time to come
up with anything better (or worse :))

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
  
Yongseok Koh July 28, 2017, 12:26 a.m. UTC | #6
> On Jul 27, 2017, at 4:12 AM, Sagi Grimberg <sagi@grimberg.me> wrote:
> 
> 
>>> Yes I realize that, but can't the device still complete in a burst (of
>>> unsuppressed completions)? I mean its not guaranteed that for every
>>> txq_complete a signaled completion is pending right? What happens if
>>> the device has inconsistent completion pacing? Can't the sw grow a
>>> batch of completions if txq_complete will process a single completion
>>> unconditionally?
>> Speculation. First of all, device doesn't delay completion notifications for no
>> reason. ASIC is not a SW running on top of a OS.
> 
> I'm sorry but this statement is not correct. It might be correct in a
> lab environment, but in practice, there are lots of things that can
> affect the device timing.
Disagree.

[...]
> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Thanks for ack!

Yongseok
  
Ferruh Yigit July 31, 2017, 4:12 p.m. UTC | #7
On 7/20/2017 4:48 PM, Yongseok Koh wrote:
> mlx5_tx_complete() polls completion queue multiple times until it
> encounters an invalid entry. As Tx completions are suppressed by
> MLX5_TX_COMP_THRESH, it is waste of cycles to expect multiple completions
> in a poll. And freeing too many buffers in a call can cause high jitter.
> This patch improves throughput a little.
> 
> Signed-off-by: Yongseok Koh <yskoh@mellanox.com>
> Acked-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com>

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

Applied to dpdk-next-net/master, thanks.
  

Patch

diff --git a/drivers/net/mlx5/mlx5_rxtx.h b/drivers/net/mlx5/mlx5_rxtx.h
index 534aaeb46..7fd59a4b1 100644
--- a/drivers/net/mlx5/mlx5_rxtx.h
+++ b/drivers/net/mlx5/mlx5_rxtx.h
@@ -480,30 +480,18 @@  mlx5_tx_complete(struct txq *txq)
 	struct rte_mempool *pool = NULL;
 	unsigned int blk_n = 0;
 
-	do {
-		volatile struct mlx5_cqe *tmp;
-
-		tmp = &(*txq->cqes)[cq_ci & cqe_cnt];
-		if (check_cqe(tmp, cqe_n, cq_ci))
-			break;
-		cqe = tmp;
+	cqe = &(*txq->cqes)[cq_ci & cqe_cnt];
+	if (unlikely(check_cqe(cqe, cqe_n, cq_ci)))
+		return;
 #ifndef NDEBUG
-		if (MLX5_CQE_FORMAT(cqe->op_own) == MLX5_COMPRESSED) {
-			if (!check_cqe_seen(cqe))
-				ERROR("unexpected compressed CQE, TX stopped");
-			return;
-		}
-		if ((MLX5_CQE_OPCODE(cqe->op_own) == MLX5_CQE_RESP_ERR) ||
-		    (MLX5_CQE_OPCODE(cqe->op_own) == MLX5_CQE_REQ_ERR)) {
-			if (!check_cqe_seen(cqe))
-				ERROR("unexpected error CQE, TX stopped");
-			return;
-		}
-#endif /* NDEBUG */
-		++cq_ci;
-	} while (1);
-	if (unlikely(cqe == NULL))
+	if ((MLX5_CQE_OPCODE(cqe->op_own) == MLX5_CQE_RESP_ERR) ||
+	    (MLX5_CQE_OPCODE(cqe->op_own) == MLX5_CQE_REQ_ERR)) {
+		if (!check_cqe_seen(cqe))
+			ERROR("unexpected error CQE, TX stopped");
 		return;
+	}
+#endif /* NDEBUG */
+	++cq_ci;
 	txq->wqe_pi = ntohs(cqe->wqe_counter);
 	ctrl = (volatile struct mlx5_wqe_ctrl *)
 		tx_mlx5_wqe(txq, txq->wqe_pi);