eal/linux: enhanced error handling for affinity
Checks
Commit Message
Improve the robustness of setting thread affinity in DPDK
by adding detailed error logging.
Changes:
1. Check the return value of pthread_setaffinity_np() and log an error
if the call fails.
2. Include the current thread name, the intended CPU set, and a detailed
error message in the log.
Sample prints:
EAL: Cannot set affinity for thread dpdk-test with cpus 0,
ret: 22, errno: 0, error description: Success
EAL: Cannot set affinity for thread dpdk-worker1 with cpus 1,
ret: 22, errno: 0, error description: Success
Signed-off-by: Jianyue Wu <wujianyue000@163.com>
---
lib/eal/unix/rte_thread.c | 22 ++++++++++++++++++++--
1 file changed, 20 insertions(+), 2 deletions(-)
Comments
On Tue, 23 Apr 2024 11:02:43 +0800
Jianyue Wu <wujianyue000@163.com> wrote:
> Improve the robustness of setting thread affinity in DPDK
> by adding detailed error logging.
Is this an error you saw in your application or something inside DPDK?
> Changes:
> 1. Check the return value of pthread_setaffinity_np() and log an error
> if the call fails.
Not sure this is necessary. The rte_thread functions are intended to
be os independent wrapper for threads. Does it need to be this chatty.
> 2. Include the current thread name, the intended CPU set, and a detailed
> error message in the log.
This introduces a more code and ends up being Linux/BSD specific only
for the case where application did something wrong.
Hello, Stephen,
Good day
The issue is not caused by DPDK itself, but arises when the DPDK worker process attempts to set affinity to a cpuset that exceeds the limits set by the cgroup cpuset settings.
Original error prints are:
PANIC in rte_eal_init():
Cannot set affinity
# Callstacks.
Finding the detailed reason for the failure was challenging, so I added extra print statements to help diagnose the issue.
I understand your concern about maintaining OS independence with the rte_thread functions. This change aims to provide more context when errors occur, facilitating quicker troubleshooting. I agree that this introduces more code and could be seen as platform-specific. Perhaps we could implement this conditionally, only for platforms where such detailed logging is supported and useful.
At 2024-04-24 23:50:21, "Stephen Hemminger" <stephen@networkplumber.org> wrote:
>On Tue, 23 Apr 2024 11:02:43 +0800
>Jianyue Wu <wujianyue000@163.com> wrote:
>
>> Improve the robustness of setting thread affinity in DPDK
>> by adding detailed error logging.
>
>Is this an error you saw in your application or something inside DPDK?
>
>> Changes:
>> 1. Check the return value of pthread_setaffinity_np() and log an error
>> if the call fails.
>
>Not sure this is necessary. The rte_thread functions are intended to
>be os independent wrapper for threads. Does it need to be this chatty.
>
>> 2. Include the current thread name, the intended CPU set, and a detailed
>> error message in the log.
>
>This introduces a more code and ends up being Linux/BSD specific only
>for the case where application did something wrong.
After reviewing the code, I believe that the combination of the __linux__ and _GNU_SOURCE macros effectively confirms whether the pthread_getname_np() API can be utilized. I will proceed with adding them. Thank you~
#if defined(__linux__) && defined(_GNU_SOURCE)
在 2024-04-25 09:08:59,"吴剑跃" <wujianyue000@163.com> 写道:
Hello, Stephen,
Good day
The issue is not caused by DPDK itself, but arises when the DPDK worker process attempts to set affinity to a cpuset that exceeds the limits set by the cgroup cpuset settings.
Original error prints are:
PANIC in rte_eal_init():
Cannot set affinity
# Callstacks.
Finding the detailed reason for the failure was challenging, so I added extra print statements to help diagnose the issue.
I understand your concern about maintaining OS independence with the rte_thread functions. This change aims to provide more context when errors occur, facilitating quicker troubleshooting. I agree that this introduces more code and could be seen as platform-specific. Perhaps we could implement this conditionally, only for platforms where such detailed logging is supported and useful.
At 2024-04-24 23:50:21, "Stephen Hemminger" <stephen@networkplumber.org> wrote:
>On Tue, 23 Apr 2024 11:02:43 +0800
>Jianyue Wu <wujianyue000@163.com> wrote:
>
>> Improve the robustness of setting thread affinity in DPDK
>> by adding detailed error logging.
>
>Is this an error you saw in your application or something inside DPDK?
>
>> Changes:
>> 1. Check the return value of pthread_setaffinity_np() and log an error
>> if the call fails.
>
>Not sure this is necessary. The rte_thread functions are intended to
>be os independent wrapper for threads. Does it need to be this chatty.
>
>> 2. Include the current thread name, the intended CPU set, and a detailed
>> error message in the log.
>
>This introduces a more code and ends up being Linux/BSD specific only
>for the case where application did something wrong.
On Thu, 25 Apr 2024 13:40:21 +0800 (CST)
吴剑跃 <wujianyue000@163.com> wrote:
> After reviewing the code, I believe that the combination of the __linux__ and _GNU_SOURCE macros effectively confirms whether the pthread_getname_np() API can be utilized. I will proceed with adding them. Thank you~
> #if defined(__linux__) && defined(_GNU_SOURCE)
>
>
> 在 2024-04-25 09:08:59,"吴剑跃" <wujianyue000@163.com> 写道:
>
> Hello, Stephen,
>
>
>
> Good day
> The issue is not caused by DPDK itself, but arises when the DPDK worker process attempts to set affinity to a cpuset that exceeds the limits set by the cgroup cpuset settings.
> Original error prints are:
> PANIC in rte_eal_init():
> Cannot set affinity
> # Callstacks.
>
>
> Finding the detailed reason for the failure was challenging, so I added extra print statements to help diagnose the issue.
> I understand your concern about maintaining OS independence with the rte_thread functions. This change aims to provide more context when errors occur, facilitating quicker troubleshooting. I agree that this introduces more code and could be seen as platform-specific. Perhaps we could implement this conditionally, only for platforms where such detailed logging is supported and useful.
>
My point is that just giving the kernel error should be sufficient, rather than having
to reformat the incoming arguments. The arguments are coming from the command line, and what I
would do is look at the error and the command line arguments to the application, as well as
any kernel logs.
Hello, Stephen,
Understand, yesterday I had added new changes to the patch, how to recall that patch?
Thank you~
At 2024-04-25 23:04:46, "Stephen Hemminger" <stephen@networkplumber.org> wrote:
>On Thu, 25 Apr 2024 13:40:21 +0800 (CST)
>吴剑跃 <wujianyue000@163.com> wrote:
>
>> After reviewing the code, I believe that the combination of the __linux__ and _GNU_SOURCE macros effectively confirms whether the pthread_getname_np() API can be utilized. I will proceed with adding them. Thank you~
>> #if defined(__linux__) && defined(_GNU_SOURCE)
>>
>>
>> 在 2024-04-25 09:08:59,"吴剑跃" <wujianyue000@163.com> 写道:
>>
>> Hello, Stephen,
>>
>>
>>
>> Good day
>> The issue is not caused by DPDK itself, but arises when the DPDK worker process attempts to set affinity to a cpuset that exceeds the limits set by the cgroup cpuset settings.
>> Original error prints are:
>> PANIC in rte_eal_init():
>> Cannot set affinity
>> # Callstacks.
>>
>>
>> Finding the detailed reason for the failure was challenging, so I added extra print statements to help diagnose the issue.
>> I understand your concern about maintaining OS independence with the rte_thread functions. This change aims to provide more context when errors occur, facilitating quicker troubleshooting. I agree that this introduces more code and could be seen as platform-specific. Perhaps we could implement this conditionally, only for platforms where such detailed logging is supported and useful.
>>
>
>My point is that just giving the kernel error should be sufficient, rather than having
>to reformat the incoming arguments. The arguments are coming from the command line, and what I
>would do is look at the error and the command line arguments to the application, as well as
>any kernel logs.
@@ -369,8 +369,26 @@ int
rte_thread_set_affinity_by_id(rte_thread_t thread_id,
const rte_cpuset_t *cpuset)
{
- return pthread_setaffinity_np((pthread_t)thread_id.opaque_id,
- sizeof(*cpuset), cpuset);
+ int ret;
+ char cpus_str[RTE_CPU_AFFINITY_STR_LEN] = {'\0'};
+ char thread_name[RTE_MAX_THREAD_NAME_LEN] = {'\0'};
+
+ errno = 0;
+ ret = pthread_setaffinity_np((pthread_t)thread_id.opaque_id,
+ sizeof(*cpuset), cpuset);
+ if (ret != 0) {
+ if (pthread_getname_np((pthread_t)thread_id.opaque_id,
+ thread_name, sizeof(thread_name)) != 0)
+ EAL_LOG(ERR, "pthread_getname_np failed!");
+ if (eal_thread_dump_affinity(cpuset, cpus_str, RTE_CPU_AFFINITY_STR_LEN) != 0)
+ EAL_LOG(ERR, "eal_thread_dump_affinity failed!");
+ EAL_LOG(ERR, "Cannot set affinity for thread %s with cpus %s, "
+ "ret: %d, errno: %d, error description: %s",
+ thread_name, cpus_str,
+ ret, errno, strerror(errno));
+ }
+
+ return ret;
}
int