eal/linux: enhanced error handling for affinity

Message ID 20240423030243.59895-1-wujianyue000@163.com (mailing list archive)
State Superseded
Delegated to: Thomas Monjalon
Headers
Series eal/linux: enhanced error handling for affinity |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/loongarch-compilation success Compilation OK
ci/loongarch-unit-testing success Unit Testing PASS
ci/Intel-compilation fail Compilation issues
ci/iol-intel-Performance success Performance Testing PASS
ci/iol-mellanox-Performance success Performance Testing PASS
ci/iol-unit-amd64-testing fail Testing issues
ci/iol-abi-testing success Testing PASS
ci/iol-compile-arm64-testing fail Testing issues
ci/iol-sample-apps-testing warning Testing issues
ci/iol-compile-amd64-testing fail Testing issues
ci/iol-unit-arm64-testing fail Testing issues
ci/iol-intel-Functional success Functional Testing PASS
ci/iol-broadcom-Functional success Functional Testing PASS
ci/iol-broadcom-Performance success Performance Testing PASS

Commit Message

Jianyue Wu April 23, 2024, 3:02 a.m. UTC
  Improve the robustness of setting thread affinity in DPDK
by adding detailed error logging.

Changes:
1. Check the return value of pthread_setaffinity_np() and log an error
if the call fails.
2. Include the current thread name, the intended CPU set, and a detailed
error message in the log.

Sample prints:
EAL: Cannot set affinity for thread dpdk-test with cpus 0,
ret: 22, errno: 0, error description: Success
EAL: Cannot set affinity for thread dpdk-worker1 with cpus 1,
ret: 22, errno: 0, error description: Success

Signed-off-by: Jianyue Wu <wujianyue000@163.com>
---
 lib/eal/unix/rte_thread.c | 22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)
  

Comments

Stephen Hemminger April 24, 2024, 3:50 p.m. UTC | #1
On Tue, 23 Apr 2024 11:02:43 +0800
Jianyue Wu <wujianyue000@163.com> wrote:

> Improve the robustness of setting thread affinity in DPDK
> by adding detailed error logging.

Is this an error you saw in your application or something inside DPDK?

> Changes:
> 1. Check the return value of pthread_setaffinity_np() and log an error
> if the call fails.

Not sure this is necessary. The rte_thread functions are intended to
be os independent wrapper for threads. Does it need to be this chatty.

> 2. Include the current thread name, the intended CPU set, and a detailed
> error message in the log.

This introduces a more code and ends up being Linux/BSD specific only
for the case where application did something wrong.
  
Jianyue Wu April 25, 2024, 1:08 a.m. UTC | #2
Hello, Stephen,



Good day
The issue is not caused by DPDK itself, but arises when the DPDK worker process attempts to set affinity to a cpuset that exceeds the limits set by the cgroup cpuset settings.
Original error prints are:
     PANIC in rte_eal_init():
     Cannot set affinity
     # Callstacks.


Finding the detailed reason for the failure was challenging, so I added extra print statements to help diagnose the issue.
I understand your concern about maintaining OS independence with the rte_thread functions. This change aims to provide more context when errors occur, facilitating quicker troubleshooting. I agree that this introduces more code and could be seen as platform-specific. Perhaps we could implement this conditionally, only for platforms where such detailed logging is supported and useful.


At 2024-04-24 23:50:21, "Stephen Hemminger" <stephen@networkplumber.org> wrote:
>On Tue, 23 Apr 2024 11:02:43 +0800
>Jianyue Wu <wujianyue000@163.com> wrote:
>
>> Improve the robustness of setting thread affinity in DPDK
>> by adding detailed error logging.
>
>Is this an error you saw in your application or something inside DPDK?
>
>> Changes:
>> 1. Check the return value of pthread_setaffinity_np() and log an error
>> if the call fails.
>
>Not sure this is necessary. The rte_thread functions are intended to
>be os independent wrapper for threads. Does it need to be this chatty.
>
>> 2. Include the current thread name, the intended CPU set, and a detailed
>> error message in the log.
>
>This introduces a more code and ends up being Linux/BSD specific only
>for the case where application did something wrong.
  
Jianyue Wu April 25, 2024, 5:40 a.m. UTC | #3
After reviewing the code, I believe that the combination of the __linux__ and _GNU_SOURCE macros effectively confirms whether the pthread_getname_np() API can be utilized. I will proceed with adding them. Thank you~
#if defined(__linux__) && defined(_GNU_SOURCE)


在 2024-04-25 09:08:59,"吴剑跃" <wujianyue000@163.com> 写道:

Hello, Stephen,



Good day
The issue is not caused by DPDK itself, but arises when the DPDK worker process attempts to set affinity to a cpuset that exceeds the limits set by the cgroup cpuset settings.
Original error prints are:
     PANIC in rte_eal_init():
     Cannot set affinity
     # Callstacks.


Finding the detailed reason for the failure was challenging, so I added extra print statements to help diagnose the issue.
I understand your concern about maintaining OS independence with the rte_thread functions. This change aims to provide more context when errors occur, facilitating quicker troubleshooting. I agree that this introduces more code and could be seen as platform-specific. Perhaps we could implement this conditionally, only for platforms where such detailed logging is supported and useful.


At 2024-04-24 23:50:21, "Stephen Hemminger" <stephen@networkplumber.org> wrote:
>On Tue, 23 Apr 2024 11:02:43 +0800
>Jianyue Wu <wujianyue000@163.com> wrote:
>
>> Improve the robustness of setting thread affinity in DPDK
>> by adding detailed error logging.
>
>Is this an error you saw in your application or something inside DPDK?
>
>> Changes:
>> 1. Check the return value of pthread_setaffinity_np() and log an error
>> if the call fails.
>
>Not sure this is necessary. The rte_thread functions are intended to
>be os independent wrapper for threads. Does it need to be this chatty.
>
>> 2. Include the current thread name, the intended CPU set, and a detailed
>> error message in the log.
>
>This introduces a more code and ends up being Linux/BSD specific only
>for the case where application did something wrong.
  
Stephen Hemminger April 25, 2024, 3:04 p.m. UTC | #4
On Thu, 25 Apr 2024 13:40:21 +0800 (CST)
吴剑跃 <wujianyue000@163.com> wrote:

> After reviewing the code, I believe that the combination of the __linux__ and _GNU_SOURCE macros effectively confirms whether the pthread_getname_np() API can be utilized. I will proceed with adding them. Thank you~
> #if defined(__linux__) && defined(_GNU_SOURCE)
> 
> 
> 在 2024-04-25 09:08:59,"吴剑跃" <wujianyue000@163.com> 写道:
> 
> Hello, Stephen,
> 
> 
> 
> Good day
> The issue is not caused by DPDK itself, but arises when the DPDK worker process attempts to set affinity to a cpuset that exceeds the limits set by the cgroup cpuset settings.
> Original error prints are:
>      PANIC in rte_eal_init():
>      Cannot set affinity
>      # Callstacks.
> 
> 
> Finding the detailed reason for the failure was challenging, so I added extra print statements to help diagnose the issue.
> I understand your concern about maintaining OS independence with the rte_thread functions. This change aims to provide more context when errors occur, facilitating quicker troubleshooting. I agree that this introduces more code and could be seen as platform-specific. Perhaps we could implement this conditionally, only for platforms where such detailed logging is supported and useful.
> 

My point is that just giving the kernel error should be sufficient, rather than having
to reformat the incoming arguments. The arguments are coming from the command line, and what I
would do is look at the error and the command line arguments to the application, as well as
any kernel logs.
  
Jianyue Wu April 26, 2024, 3:14 a.m. UTC | #5
Hello, Stephen,




Understand, yesterday I had added new changes to the patch, how to recall that patch?

Thank you~














At 2024-04-25 23:04:46, "Stephen Hemminger" <stephen@networkplumber.org> wrote:
>On Thu, 25 Apr 2024 13:40:21 +0800 (CST)
>吴剑跃 <wujianyue000@163.com> wrote:
>
>> After reviewing the code, I believe that the combination of the __linux__ and _GNU_SOURCE macros effectively confirms whether the pthread_getname_np() API can be utilized. I will proceed with adding them. Thank you~
>> #if defined(__linux__) && defined(_GNU_SOURCE)
>> 
>> 
>> 在 2024-04-25 09:08:59,"吴剑跃" <wujianyue000@163.com> 写道:
>> 
>> Hello, Stephen,
>> 
>> 
>> 
>> Good day
>> The issue is not caused by DPDK itself, but arises when the DPDK worker process attempts to set affinity to a cpuset that exceeds the limits set by the cgroup cpuset settings.
>> Original error prints are:
>>      PANIC in rte_eal_init():
>>      Cannot set affinity
>>      # Callstacks.
>> 
>> 
>> Finding the detailed reason for the failure was challenging, so I added extra print statements to help diagnose the issue.
>> I understand your concern about maintaining OS independence with the rte_thread functions. This change aims to provide more context when errors occur, facilitating quicker troubleshooting. I agree that this introduces more code and could be seen as platform-specific. Perhaps we could implement this conditionally, only for platforms where such detailed logging is supported and useful.
>> 
>
>My point is that just giving the kernel error should be sufficient, rather than having
>to reformat the incoming arguments. The arguments are coming from the command line, and what I
>would do is look at the error and the command line arguments to the application, as well as
>any kernel logs.
  

Patch

diff --git a/lib/eal/unix/rte_thread.c b/lib/eal/unix/rte_thread.c
index 1b4c73f58e..8f9eaf0dcf 100644
--- a/lib/eal/unix/rte_thread.c
+++ b/lib/eal/unix/rte_thread.c
@@ -369,8 +369,26 @@  int
 rte_thread_set_affinity_by_id(rte_thread_t thread_id,
 		const rte_cpuset_t *cpuset)
 {
-	return pthread_setaffinity_np((pthread_t)thread_id.opaque_id,
-		sizeof(*cpuset), cpuset);
+	int ret;
+	char cpus_str[RTE_CPU_AFFINITY_STR_LEN] = {'\0'};
+	char thread_name[RTE_MAX_THREAD_NAME_LEN] = {'\0'};
+
+	errno = 0;
+	ret = pthread_setaffinity_np((pthread_t)thread_id.opaque_id,
+				sizeof(*cpuset), cpuset);
+	if (ret != 0) {
+		if (pthread_getname_np((pthread_t)thread_id.opaque_id,
+					thread_name, sizeof(thread_name)) != 0)
+			EAL_LOG(ERR, "pthread_getname_np failed!");
+		if (eal_thread_dump_affinity(cpuset, cpus_str, RTE_CPU_AFFINITY_STR_LEN) != 0)
+			EAL_LOG(ERR, "eal_thread_dump_affinity failed!");
+		EAL_LOG(ERR, "Cannot set affinity for thread %s with cpus %s, "
+			"ret: %d, errno: %d, error description: %s",
+			thread_name, cpus_str,
+			ret, errno, strerror(errno));
+	}
+
+	return ret;
 }
 
 int