eal: prevent socket closure before MP sync

Message ID 20250314103638.2198-1-ming.1.yang@nokia-sbell.com (mailing list archive)
State Superseded
Delegated to: David Marchand
Headers
Series eal: prevent socket closure before MP sync |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/loongarch-compilation success Compilation OK
ci/loongarch-unit-testing success Unit Testing PASS
ci/Intel-compilation success Compilation OK
ci/intel-Testing success Testing PASS
ci/intel-Functional success Functional PASS
ci/iol-marvell-Functional success Functional Testing PASS
ci/iol-intel-Functional success Functional Testing PASS
ci/iol-mellanox-Performance success Performance Testing PASS
ci/iol-intel-Performance success Performance Testing PASS
ci/iol-broadcom-Performance success Performance Testing PASS
ci/iol-unit-arm64-testing success Testing PASS
ci/iol-unit-amd64-testing fail Testing issues
ci/iol-sample-apps-testing success Testing PASS
ci/iol-abi-testing success Testing PASS
ci/iol-compile-amd64-testing success Testing PASS
ci/iol-compile-arm64-testing success Testing PASS
ci/github-robot: build success github build: passed

Commit Message

Yang Ming March 14, 2025, 10:36 a.m. UTC
The secordary process should not close socket file for MP
channel before performing MP request synchronization.
This prevents error logs when the secondary process exits
without any operation on the crypto device while the primary
process starts the device.

Case situation:
eal_bus_cleanup has been added in rte_eal_cleanup. But for the
secondary process, rte_eal_cleanup firstly performs
rte_mp_channel_cleanup, which closes socket file for the MP
channel, making mp_fd invalid. Subsequently, eal_bus_cleanup
triggers vdev_cleanup, which calls mp_request_sync to send a
message via the MP channel. Since mp_fd is invalid, error logs
occur.

Error logs occur as below when the secordary process exit:
EAL: failed to send to (/tmp/dpdk/l2hicu/mp_socket) due to Bad
file descriptor
EAL: Fail to send request /tmp/dpdk/l2hicu/mp_socket:
ipsec_mb_mp_msg
USER1: Create MR request to primary process failed.

Function call trace:
1. rte_eal_cleanup->rte_mp_channel_cleanup->close_socket_fd
2. rte_eal_cleanup->eal_bus_cleanup->vdev_cleanup->
rte_vdev_driver->ipsec_mb_remove->ipsec_mb_qp_release->
ipsec_mb_secondary_qp_op->rte_mp_request_sync->mp_request_sync->
send_msg->sendmsg(mp_fd, &msgh, 0);

Fixes: 1cab1a40ea9b ("bus: cleanup devices on shutdown")
Cc: kevin.laatz@intel.com
Cc: stable@dpdk.org

Signed-off-by: Yang Ming <ming.1.yang@nokia-sbell.com>
---
 lib/eal/linux/eal.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
  

Comments

Stephen Hemminger March 17, 2025, 1:56 p.m. UTC | #1
On Fri, 14 Mar 2025 18:36:38 +0800
Yang Ming <ming.1.yang@nokia-sbell.com> wrote:

> The secordary process should not close socket file for MP
> channel before performing MP request synchronization.
> This prevents error logs when the secondary process exits
> without any operation on the crypto device while the primary
> process starts the device.
> 
> Case situation:
> eal_bus_cleanup has been added in rte_eal_cleanup. But for the
> secondary process, rte_eal_cleanup firstly performs
> rte_mp_channel_cleanup, which closes socket file for the MP
> channel, making mp_fd invalid. Subsequently, eal_bus_cleanup
> triggers vdev_cleanup, which calls mp_request_sync to send a
> message via the MP channel. Since mp_fd is invalid, error logs
> occur.
> 
> Error logs occur as below when the secordary process exit:
> EAL: failed to send to (/tmp/dpdk/l2hicu/mp_socket) due to Bad
> file descriptor
> EAL: Fail to send request /tmp/dpdk/l2hicu/mp_socket:
> ipsec_mb_mp_msg
> USER1: Create MR request to primary process failed.
> 
> Function call trace:
> 1. rte_eal_cleanup->rte_mp_channel_cleanup->close_socket_fd
> 2. rte_eal_cleanup->eal_bus_cleanup->vdev_cleanup->
> rte_vdev_driver->ipsec_mb_remove->ipsec_mb_qp_release->
> ipsec_mb_secondary_qp_op->rte_mp_request_sync->mp_request_sync->
> send_msg->sendmsg(mp_fd, &msgh, 0);
> 
> Fixes: 1cab1a40ea9b ("bus: cleanup devices on shutdown")
> Cc: kevin.laatz@intel.com
> Cc: stable@dpdk.org
> 
> Signed-off-by: Yang Ming <ming.1.yang@nokia-sbell.com>

Looks good, could there be a test?

Acked-by: Stephen Hemminger <stephen@networkplumber.org>
  
Yang Ming March 27, 2025, 9:28 a.m. UTC | #2
On 2025/3/17 21:56, Stephen Hemminger wrote:
> Caution: This is an external email. Please be very careful when clicking links or opening attachments. See http://nok.it/nsb for additional information.
>
> On Fri, 14 Mar 2025 18:36:38 +0800
> Yang Ming <ming.1.yang@nokia-sbell.com> wrote:
>
>> The secordary process should not close socket file for MP
>> channel before performing MP request synchronization.
>> This prevents error logs when the secondary process exits
>> without any operation on the crypto device while the primary
>> process starts the device.
>>
>> Case situation:
>> eal_bus_cleanup has been added in rte_eal_cleanup. But for the
>> secondary process, rte_eal_cleanup firstly performs
>> rte_mp_channel_cleanup, which closes socket file for the MP
>> channel, making mp_fd invalid. Subsequently, eal_bus_cleanup
>> triggers vdev_cleanup, which calls mp_request_sync to send a
>> message via the MP channel. Since mp_fd is invalid, error logs
>> occur.
>>
>> Error logs occur as below when the secordary process exit:
>> EAL: failed to send to (/tmp/dpdk/l2hicu/mp_socket) due to Bad
>> file descriptor
>> EAL: Fail to send request /tmp/dpdk/l2hicu/mp_socket:
>> ipsec_mb_mp_msg
>> USER1: Create MR request to primary process failed.
>>
>> Function call trace:
>> 1. rte_eal_cleanup->rte_mp_channel_cleanup->close_socket_fd
>> 2. rte_eal_cleanup->eal_bus_cleanup->vdev_cleanup->
>> rte_vdev_driver->ipsec_mb_remove->ipsec_mb_qp_release->
>> ipsec_mb_secondary_qp_op->rte_mp_request_sync->mp_request_sync->
>> send_msg->sendmsg(mp_fd, &msgh, 0);
>>
>> Fixes: 1cab1a40ea9b ("bus: cleanup devices on shutdown")
>> Cc: kevin.laatz@intel.com
>> Cc: stable@dpdk.org
>>
>> Signed-off-by: Yang Ming <ming.1.yang@nokia-sbell.com>
> Looks good, could there be a test?
>
> Acked-by: Stephen Hemminger <stephen@networkplumber.org>
>
Hi Thanks Stephen,

We have some findings for this change in our product test line. I will 
do more investigation for this patch.

Brs,

Yang Ming
  
Yang Ming April 7, 2025, 3:52 a.m. UTC | #3
On 2025/3/27 17:28, Yang Ming wrote:
>
> On 2025/3/17 21:56, Stephen Hemminger wrote:
>> Caution: This is an external email. Please be very careful when 
>> clicking links or opening attachments. See http://nok.it/nsb for 
>> additional information.
>>
>> On Fri, 14 Mar 2025 18:36:38 +0800
>> Yang Ming <ming.1.yang@nokia-sbell.com> wrote:
>>
>>> The secordary process should not close socket file for MP
>>> channel before performing MP request synchronization.
>>> This prevents error logs when the secondary process exits
>>> without any operation on the crypto device while the primary
>>> process starts the device.
>>>
>>> Case situation:
>>> eal_bus_cleanup has been added in rte_eal_cleanup. But for the
>>> secondary process, rte_eal_cleanup firstly performs
>>> rte_mp_channel_cleanup, which closes socket file for the MP
>>> channel, making mp_fd invalid. Subsequently, eal_bus_cleanup
>>> triggers vdev_cleanup, which calls mp_request_sync to send a
>>> message via the MP channel. Since mp_fd is invalid, error logs
>>> occur.
>>>
>>> Error logs occur as below when the secordary process exit:
>>> EAL: failed to send to (/tmp/dpdk/l2hicu/mp_socket) due to Bad
>>> file descriptor
>>> EAL: Fail to send request /tmp/dpdk/l2hicu/mp_socket:
>>> ipsec_mb_mp_msg
>>> USER1: Create MR request to primary process failed.
>>>
>>> Function call trace:
>>> 1. rte_eal_cleanup->rte_mp_channel_cleanup->close_socket_fd
>>> 2. rte_eal_cleanup->eal_bus_cleanup->vdev_cleanup->
>>> rte_vdev_driver->ipsec_mb_remove->ipsec_mb_qp_release->
>>> ipsec_mb_secondary_qp_op->rte_mp_request_sync->mp_request_sync->
>>> send_msg->sendmsg(mp_fd, &msgh, 0);
>>>
>>> Fixes: 1cab1a40ea9b ("bus: cleanup devices on shutdown")
>>> Cc: kevin.laatz@intel.com
>>> Cc: stable@dpdk.org
>>>
>>> Signed-off-by: Yang Ming <ming.1.yang@nokia-sbell.com>
>> Looks good, could there be a test?
>>
>> Acked-by: Stephen Hemminger <stephen@networkplumber.org>
>>
> Hi Thanks Stephen,
>
> We have some findings for this change in our product test line. I will 
> do more investigation for this patch.
>
> Brs,
>
> Yang Ming
>
>
Hi,

This patch has been tested and functions correctly under normal 
conditions. However, during testing, we observed some new error logs in 
specific cases. Upon investigation, we identified that these logs 
originate from a separate issue, which will be addressed in the next 
version of this patch. Additionally, a similar issue may affect FreeBSD, 
and I plan to include a fix for that as well in the upcoming patch 
series. Please note that the entire patch series has been thoroughly tested.

Brs,

Yang Ming
  

Patch

diff --git a/lib/eal/linux/eal.c b/lib/eal/linux/eal.c
index b1e63e37fc..73ea47b12d 100644
--- a/lib/eal/linux/eal.c
+++ b/lib/eal/linux/eal.c
@@ -1318,11 +1318,11 @@  rte_eal_cleanup(void)
 		rte_memseg_walk(mark_freeable, NULL);
 
 	rte_service_finalize();
+	eal_bus_cleanup();
 #ifdef VFIO_PRESENT
 	vfio_mp_sync_cleanup();
 #endif
 	rte_mp_channel_cleanup();
-	eal_bus_cleanup();
 	rte_trace_save();
 	eal_trace_fini();
 	eal_mp_dev_hotplug_cleanup();