net/mlx5: fix decreasing the reference count of a Tx queue

Message ID 20250703150252.145065-1-bingz@nvidia.com (mailing list archive)
State Awaiting Upstream
Delegated to: Raslan Darawsheh
Headers
Series net/mlx5: fix decreasing the reference count of a Tx queue |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/loongarch-compilation success Compilation OK
ci/loongarch-unit-testing success Unit Testing PASS
ci/Intel-compilation success Compilation OK
ci/iol-sample-apps-testing fail Testing issues
ci/github-robot: build success github build: passed
ci/iol-unit-arm64-testing pending Testing pending
ci/iol-unit-amd64-testing pending Testing pending
ci/iol-compile-amd64-testing success Testing PASS
ci/aws-unit-testing success Unit Testing PASS
ci/intel-Testing success Testing PASS
ci/intel-Functional success Functional PASS

Commit Message

Bing Zhao July 3, 2025, 3:02 p.m. UTC
When changing the order of the Tx queues startup, the depth of the
queue is compared. If not equal to the current big log2 value, next
queue will be checked and the current one will be skipped for the
next iteration.

The mlx5_txq_get() will increase the reference count number, and the
size check no match is not an error and the startup will continue but
not fall into the error roll-back label. The reference count should
be decreased by 1 to dereference the count, or else in the device
close stage, the queue cannot be released in the FW and the TIS, PD
will be leaked as well.

By calling the mlx5_txq_release() before continue will recover the
reference count to the initial state and solve the leak.

Fixes: 6f356d3840e6 ("net/mlx5: pass DevX object info in Tx queue start")

Signed-off-by: Bing Zhao <bingz@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
---
 drivers/net/mlx5/mlx5_trigger.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)
  

Comments

Raslan Darawsheh July 6, 2025, 2:17 p.m. UTC | #1
Hi,


On 03/07/2025 6:02 PM, Bing Zhao wrote:
> When changing the order of the Tx queues startup, the depth of the
> queue is compared. If not equal to the current big log2 value, next
> queue will be checked and the current one will be skipped for the
> next iteration.
> 
> The mlx5_txq_get() will increase the reference count number, and the
> size check no match is not an error and the startup will continue but
> not fall into the error roll-back label. The reference count should
> be decreased by 1 to dereference the count, or else in the device
> close stage, the queue cannot be released in the FW and the TIS, PD
> will be leaked as well.
> 
> By calling the mlx5_txq_release() before continue will recover the
> reference count to the initial state and solve the leak.
> 
> Fixes: 6f356d3840e6 ("net/mlx5: pass DevX object info in Tx queue start")
> 
> Signed-off-by: Bing Zhao <bingz@nvidia.com>
> Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>

Patch applied to next-net-mlx,

Kindest regards
Raslan Darawsheh
  

Patch

diff --git a/drivers/net/mlx5/mlx5_trigger.c b/drivers/net/mlx5/mlx5_trigger.c
index 90287a1b75..6c6f228afd 100644
--- a/drivers/net/mlx5/mlx5_trigger.c
+++ b/drivers/net/mlx5/mlx5_trigger.c
@@ -61,8 +61,12 @@  mlx5_txq_start(struct rte_eth_dev *dev)
 			struct mlx5_txq_ctrl *txq_ctrl = mlx5_txq_get(dev, i);
 			struct mlx5_txq_data *txq_data = &txq_ctrl->txq;
 
-			if (!txq_ctrl || txq_data->elts_n != cnt)
+			if (!txq_ctrl)
+				continue;
+			if (txq_data->elts_n != cnt) {
+				mlx5_txq_release(dev, i);
 				continue;
+			}
 			if (!txq_ctrl->is_hairpin)
 				txq_alloc_elts(txq_ctrl);
 			MLX5_ASSERT(!txq_ctrl->obj);