[RFC,1/2] eal: add llc aware functions

Message ID 20240827151014.201-2-vipin.varghese@amd.com (mailing list archive)
State Changes Requested
Delegated to: Thomas Monjalon
Headers
Series introduce LLC aware functions |

Checks

Context Check Description
ci/checkpatch warning coding style issues

Commit Message

Varghese, Vipin Aug. 27, 2024, 3:10 p.m. UTC
Introduce lcore functions which operates on Last Level Cache for
core complexes or chiplet cores. On non chiplet core complexes,
the function will iterate over all the available dpdk lcores.

Functions added:
 - rte_get_llc_first_lcores
 - rte_get_llc_lcore
 - rte_get_llc_n_lcore

Signed-off-by: Vipin Varghese <vipin.varghese@amd.com>
---
 lib/eal/common/eal_common_lcore.c | 279 ++++++++++++++++++++++++++++--
 1 file changed, 267 insertions(+), 12 deletions(-)
  

Comments

Stephen Hemminger Aug. 27, 2024, 5:36 p.m. UTC | #1
On Tue, 27 Aug 2024 20:40:13 +0530
Vipin Varghese <vipin.varghese@amd.com> wrote:

> +		"ls -d /sys/bus/cpu/devices/cpu%u/cache/index[0-9] | sort  -r  | grep -m1 index[0-9] | awk -F '[x]' '{print $2}' "

NAK
Running shell commands from EAL is non-portable and likely to be flagged by security scanning tools.

Do it in C please.
  
Wathsala Wathawana Vithanage Aug. 27, 2024, 8:56 p.m. UTC | #2
> -unsigned int rte_get_next_lcore(unsigned int i, int skip_main, int wrap)
> +#define LCORE_GET_LLC   \
> +		"ls -d /sys/bus/cpu/devices/cpu%u/cache/index[0-9] | sort  -r
> | grep -m1 index[0-9] | awk -F '[x]' '{print $2}' "
> 

This won't work for some SOCs. 
How to ensure the index you got is for an LLC? Some SOCs may only show upper-level caches here, therefore cannot be use blindly without knowing the SOC.
Also, unacceptable to execute a shell script, consider implementing in C.

--wathsala
  
Feifei Wang Aug. 29, 2024, 3:21 a.m. UTC | #3
Hi,

> -----邮件原件-----
> 发件人: Wathsala Wathawana Vithanage <wathsala.vithanage@arm.com>
> 发送时间: 2024年8月28日 4:56
> 收件人: Vipin Varghese <vipin.varghese@amd.com>; ferruh.yigit@amd.com;
> dev@dpdk.org
> 抄送: nd <nd@arm.com>; nd <nd@arm.com>
> 主题: RE: [RFC 1/2] eal: add llc aware functions
> 
> > -unsigned int rte_get_next_lcore(unsigned int i, int skip_main, int wrap)
> > +#define LCORE_GET_LLC   \
> > +		"ls -d /sys/bus/cpu/devices/cpu%u/cache/index[0-9] | sort  -r
> > | grep -m1 index[0-9] | awk -F '[x]' '{print $2}' "
> >
> 
> This won't work for some SOCs.
> How to ensure the index you got is for an LLC? Some SOCs may only show
> upper-level caches here, therefore cannot be use blindly without knowing the
> SOC.
> Also, unacceptable to execute a shell script, consider implementing in C.

Maybe:
For arm, maybe we can load MPIDR_EL1 register to achieve cpu cluster topology.
MPIDR_EL1 register bit meaning:
[23:16]	AFF3	 (Level 3 affinity)
[15:8]	AFF2	 (Level 2 affinity)
[7:0]	AFF1	(Level 1 affinity)
[7:0]	AFF0	(Level 0 affinity)

For x86, we can use apic_id:
Apic_id includes cluster id, die id, smt id and core id.
 
This bypass execute a shell script, and for arm and x86, we set different path to implement this.

Best Regards
Feifei
> --wathsala
>
  
Varghese, Vipin Sept. 2, 2024, 12:27 a.m. UTC | #4
<snipped>
>> +             "ls -d /sys/bus/cpu/devices/cpu%u/cache/index[0-9] | sort  -r  | grep -m1 index[0-9] | awk -F '[x]' '{print $2}' "
> NAK
> Running shell commands from EAL is non-portable and likely to be flagged by security scanning tools.
>
> Do it in C please.
Thank you Stephen, for pointing this out. Surely will convert to `C` 
rather than shell execution.
  
Varghese, Vipin Sept. 2, 2024, 1:20 a.m. UTC | #5
<Snipped>
>> -unsigned int rte_get_next_lcore(unsigned int i, int skip_main, int wrap)
>> +#define LCORE_GET_LLC   \
>> +             "ls -d /sys/bus/cpu/devices/cpu%u/cache/index[0-9] | sort  -r
>> | grep -m1 index[0-9] | awk -F '[x]' '{print $2}' "
>>
> This won't work for some SOCs.

Thank you for your response. please find our response and queries below

> How to ensure the index you got is for an LLC?

we referred to How CPU topology info is exported via sysfs — The Linux 
Kernel documentation 
<https://www.kernel.org/doc/html/latest/admin-guide/cputopology.html> 
and linux/Documentation/ABI/stable/sysfs-devices-system-cpu at master · 
torvalds/linux (github.com) 
<https://github.com/torvalds/linux/blob/master/Documentation/ABI/stable/sysfs-devices-system-cpu> 
and

Get Cache Info in Linux on ARMv8 64-bit Platform (zhiyisun.github.io) 
<https://zhiyisun.github.io/2016/06/25/Get-Cache-Info-in-Linux-on-ARMv8-64-bit-Platform.html>. 
Based on my current understanding on bare metal 64Bit Linux OS (which is 
supported by most Distros), the cache topology are populated into sysfs.

>   Some SOCs may only show upper-level caches here, therefore cannot be use blindly without knowing the SOC.
Can you please help us understand

1. if there are specific SoC which do not populate the information at 
all? If yes are they in DTS?

2. If there are specific SoC which does not export to hypervisor like 
Qemu or Xen?


We can work together to make it compatible.

> Also, unacceptable to execute a shell script, consider implementing in C.
As the intention of the RFC is to share possible API and Macro, we 
welcome suggestions on the implementation as agreed with Stepehen.
>
> --wathsala
>
>
  
Wathsala Wathawana Vithanage Sept. 3, 2024, 5:54 p.m. UTC | #6
> 	 Some SOCs may only show upper-level caches here, therefore cannot
> be use blindly without knowing the SOC.
> 
> Can you please help us understand
> 

For instance, in Neoverse N1 can disable the use of SLC as LLC (a BIOS setting)
If SLC is not used as LLC, then your script would report the unified L2 as an LLC.
I don't think that's what you are interested in.

> 1. if there are specific SoC which do not populate the information at all? If yes
> are they in DTS?

This information is populated correctly for all SOCs, comment was on the script.
  
Bruce Richardson Sept. 4, 2024, 8:18 a.m. UTC | #7
On Tue, Sep 03, 2024 at 05:54:22PM +0000, Wathsala Wathawana Vithanage wrote:
> > 	 Some SOCs may only show upper-level caches here, therefore cannot
> > be use blindly without knowing the SOC.
> > 
> > Can you please help us understand
> > 
> 
> For instance, in Neoverse N1 can disable the use of SLC as LLC (a BIOS setting)
> If SLC is not used as LLC, then your script would report the unified L2 as an LLC.
> I don't think that's what you are interested in.
> 
> > 1. if there are specific SoC which do not populate the information at all? If yes
> > are they in DTS?
> 
> This information is populated correctly for all SOCs, comment was on the script.
> 
Given all the complexities around topologies, do we want this covered by
DPDK at all? Are we better to just recommend, to any applications that
need it, that they get the info straight from the kernel via sysfs? Why
have DPDK play the middle-man here, proxying the info from sysfs to the
app?

/Bruce
  
Varghese, Vipin Sept. 6, 2024, 11:59 a.m. UTC | #8
[AMD Official Use Only - AMD Internal Distribution Only]


<snipped>

> >        Some SOCs may only show upper-level caches here, therefore
> > cannot be use blindly without knowing the SOC.
> >
> > Can you please help us understand
> >
>
> For instance, in Neoverse N1 can disable the use of SLC as LLC (a BIOS setting)
> If SLC is not used as LLC, then your script would report the unified L2 as an LLC.

Does `disabling SLC as LLC` disable L3? I think not, and what you are implying is the ` ls -d /sys/bus/cpu/devices/cpu%u/cache/index[0-9] | sort -r …… `  will return index2 and not index3. Is this the understanding?


> I don't think that's what you are interested in.
My intention as shared is to `whether BIOS setting for CPU NUMA is enabled or not, I would like to allow the end customer get the core complexes (tile) which are under one group`.
So, if the `Last Level Cache` is L3 or L2 seen by OS, API allows the end user to get DPDK lcores sharing the last level cache.

But as per the earlier communication, specific SoC does not behave when some setting are done different. For AMD SoC case we are trying to help end user with right setting with tuning guides as pointed by ` 12. How to get best performance on AMD platform — Data Plane Development Kit 24.11.0-rc0 documentation (dpdk.org)<https://doc.dpdk.org/guides/linux_gsg/amd_platform.html>`

Can you please confirm if such tuning guides or recommended settings are shared ? If not, can you please allow me to setup a technical call to sync on the same?

>
> > 1. if there are specific SoC which do not populate the information at
> > all? If yes are they in DTS?
>
> This information is populated correctly for all SOCs, comment was on the
> script.

Please note, I am not running any script. The command LCORE_GET_LLC is executed using C function `open`. As per suggestion of Stephen we have replied we will change to C function logic to get details.
Hope there is no longer confusion on this?
  
Wathsala Wathawana Vithanage Sept. 12, 2024, 4:58 p.m. UTC | #9
<snipped>
> >
> > For instance, in Neoverse N1 can disable the use of SLC as LLC (a BIOS
> > setting) If SLC is not used as LLC, then your script would report the unified L2
> as an LLC.
> 
> Does `disabling SLC as LLC` disable L3? I think not, and what you are implying is
> the ` ls -d /sys/bus/cpu/devices/cpu%u/cache/index[0-9] | sort -r …… `  will
> return index2 and not index3. Is this the understanding?
> 
It disables the use of SLC as an LLC for the CPUs and will return index2. 
Disable SLC as L3 is a feature in Arm CMN interconnects (SFONLY mode).
When SLC is disabled as L3,  firmware sets up ACPI PPTT to reflect this change.
Using the PPTT kernel correctly enumerates cache IDs not showing an L3. 

> 
> > I don't think that's what you are interested in.
> My intention as shared is to `whether BIOS setting for CPU NUMA is enabled
> or not, I would like to allow the end customer get the core complexes (tile)
> which are under one group`.
> So, if the `Last Level Cache` is L3 or L2 seen by OS, API allows the end user to
> get DPDK lcores sharing the last level cache.
> 
> But as per the earlier communication, specific SoC does not behave when
> some setting are done different. For AMD SoC case we are trying to help end
> user with right setting with tuning guides as pointed by ` 12. How to get best
> performance on AMD platform — Data Plane Development Kit 24.11.0-rc0
> documentation (dpdk.org)
> <https://doc.dpdk.org/guides/linux_gsg/amd_platform.html> `
> 
> Can you please confirm if such tuning guides or recommended settings are
> shared ? If not, can you please allow me to setup a technical call to sync on the
> same?
> 

Currently there is no such document for Arm. But we would like to have one, there are
some complexities too, not all SOC vendors use Arm's CMN interconnect.
I would be happy to sync over a call.

> >
> > > 1. if there are specific SoC which do not populate the information
> > > at all? If yes are they in DTS?
> >
> > This information is populated correctly for all SOCs, comment was on
> > the script.
> 
> Please note, I am not running any script. The command LCORE_GET_LLC is
> executed using C function `open`. As per suggestion of Stephen we have
> replied we will change to C function logic to get details.
> Hope there is no longer confusion on this?
> 
If this is implemented using sysfs, then it needs to handle caveats like SFONLY mode.
Perhaps consulting /sys/bus/cpu/devices/cpu%u/cache/index[0-9]/type would help.
However, I prefer using hwloc to get this information accurately.

Thanks

--wathsala
  

Patch

diff --git a/lib/eal/common/eal_common_lcore.c b/lib/eal/common/eal_common_lcore.c
index 2ff9252c52..4ff8b9e116 100644
--- a/lib/eal/common/eal_common_lcore.c
+++ b/lib/eal/common/eal_common_lcore.c
@@ -14,6 +14,7 @@ 
 #ifndef RTE_EXEC_ENV_WINDOWS
 #include <rte_telemetry.h>
 #endif
+#include <rte_string_fns.h>
 
 #include "eal_private.h"
 #include "eal_thread.h"
@@ -93,25 +94,279 @@  int rte_lcore_is_enabled(unsigned int lcore_id)
 	return cfg->lcore_role[lcore_id] == ROLE_RTE;
 }
 
-unsigned int rte_get_next_lcore(unsigned int i, int skip_main, int wrap)
+#define LCORE_GET_LLC   \
+		"ls -d /sys/bus/cpu/devices/cpu%u/cache/index[0-9] | sort  -r  | grep -m1 index[0-9] | awk -F '[x]' '{print $2}' "
+#define LCORE_GET_SHAREDLLC   \
+		"grep [0-9] /sys/bus/cpu/devices/cpu%u/cache/index%u/shared_cpu_list"
+
+unsigned int rte_get_llc_first_lcores (rte_cpuset_t *llc_cpu)
 {
-	i++;
-	if (wrap)
-		i %= RTE_MAX_LCORE;
+	CPU_ZERO((rte_cpuset_t *)llc_cpu);
 
-	while (i < RTE_MAX_LCORE) {
-		if (!rte_lcore_is_enabled(i) ||
-		    (skip_main && (i == rte_get_main_lcore()))) {
-			i++;
-			if (wrap)
-				i %= RTE_MAX_LCORE;
+	char cmdline[2048] = {'\0'};
+	char output_llc[8] = {'\0'};
+	char output_threads[16] = {'\0'};
+
+	for (unsigned int lcore =0; lcore < RTE_MAX_LCORE; lcore++)
+	{
+		if (!rte_lcore_is_enabled (lcore))
 			continue;
+
+		/* get sysfs llc index */
+		snprintf(cmdline, 2047, LCORE_GET_LLC, lcore);
+		FILE *fp = popen (cmdline, "r");
+		if (fp == NULL) {
+			return -1;
 		}
-		break;
+		if (fgets(output_llc, sizeof(output_llc) - 1, fp) == NULL) {
+			pclose(fp);
+			return -1;
+		}
+		pclose(fp);
+		int llc_index = atoi (output_llc);
+
+		/* get sysfs core group of the same core index*/
+		snprintf(cmdline, 2047, LCORE_GET_SHAREDLLC, lcore, llc_index);
+		fp = popen (cmdline, "r");
+		if (fp == NULL) {
+			return -1;
+		}
+		if (fgets(output_threads, sizeof(output_threads) - 1, fp) == NULL) {
+			pclose(fp);
+			return -1;
+		}
+		pclose(fp);
+
+		output_threads [strlen(output_threads) - 1] = '\0';
+	        char *smt_thrds[2];
+		int smt_threads = rte_strsplit(output_threads, sizeof(output_threads), smt_thrds, 2, ',');
+
+		for (int  index = 0; index < smt_threads; index++) {
+			char *llc[2] = {'\0'};
+			int smt_cpu = rte_strsplit(smt_thrds[index], sizeof(smt_thrds[index]), llc, 2, '-');
+			RTE_SET_USED(smt_cpu);
+	
+			unsigned int first_cpu = atoi (llc[0]);
+			unsigned int last_cpu = (NULL == llc[1]) ? atoi (llc[0]) : atoi (llc[1]);
+		
+	
+			for (unsigned int temp_cpu = first_cpu; temp_cpu <= last_cpu; temp_cpu++) {
+				if (rte_lcore_is_enabled(temp_cpu)) {
+					CPU_SET (temp_cpu, (rte_cpuset_t *) llc_cpu);
+					lcore = last_cpu;
+					break;
+				}
+			}
+		}
+	}
+
+	return CPU_COUNT((rte_cpuset_t *)llc_cpu);
+}
+
+unsigned int
+rte_get_llc_lcore (unsigned int lcore, rte_cpuset_t *llc_cpu,
+		unsigned int *first_cpu, unsigned int * last_cpu)
+{
+	CPU_ZERO((rte_cpuset_t *)llc_cpu);
+
+	char cmdline[2048] = {'\0'};
+	char output_llc[8] = {'\0'};
+	char output_threads[16] = {'\0'};
+
+	*first_cpu = *last_cpu = RTE_MAX_LCORE;
+
+	/* get sysfs llc index */
+	snprintf(cmdline, 2047, LCORE_GET_LLC, lcore);
+	FILE *fp = popen (cmdline, "r");
+	if (fp == NULL) {
+		return -1;
+	}
+	if (fgets(output_llc, sizeof(output_llc) - 1, fp) == NULL) {
+		pclose(fp);
+		return -1;
+	}
+	pclose(fp);
+	int llc_index = atoi (output_llc);
+
+	/* get sysfs core group of the same core index*/
+	snprintf(cmdline, 2047, LCORE_GET_SHAREDLLC, lcore, llc_index);
+	fp = popen (cmdline, "r");
+	if (fp == NULL) {
+		return -1;
+	}
+
+	if (fgets(output_threads, sizeof(output_threads) - 1, fp) == NULL) {
+		pclose(fp);
+		return -1;
 	}
-	return i;
+	pclose(fp);
+
+	output_threads [strlen(output_threads) - 1] = '\0';
+        char *smt_thrds[2];
+	int smt_threads = rte_strsplit(output_threads, sizeof(output_threads), smt_thrds, 2, ',');
+
+	bool found_first_cpu = false;
+	unsigned int first_lcore_cpu = RTE_MAX_LCORE;
+	unsigned int last_lcore_cpu = RTE_MAX_LCORE;
+
+	for (int  index = 0; index < smt_threads; index++) {
+		char *llc[2] = {'\0'};
+		int smt_cpu = rte_strsplit(smt_thrds[index], sizeof(smt_thrds[index]), llc, 2, '-');
+		RTE_SET_USED(smt_cpu);
+
+		char *end = NULL;
+		*first_cpu = strtoul (llc[0], end, 10);
+		*last_cpu = (1 == smt_cpu) ? strtoul (llc[0], end, 10) : strtoul (llc[1], end, 10);
+
+		unsigned int temp_cpu = RTE_MAX_LCORE;
+		RTE_LCORE_FOREACH(temp_cpu) {
+			if ((temp_cpu >= *first_cpu) && (temp_cpu <= *last_cpu)) {
+				CPU_SET (temp_cpu, (rte_cpuset_t *) llc_cpu);
+				//printf ("rte_get_llc_lcore: temp_cpu %u count %u \n", temp_cpu, CPU_COUNT(llc_cpu));
+
+				if (false == found_first_cpu) {
+				 	first_lcore_cpu = temp_cpu;
+					found_first_cpu = true;
+				}
+				last_lcore_cpu = temp_cpu;
+			}
+			//printf ("rte_get_llc_lcore: first %u last %u \n", first_lcore_cpu, last_lcore_cpu);
+		}
+	}
+
+	*first_cpu = first_lcore_cpu;
+	*last_cpu = last_lcore_cpu;
+
+	//printf ("rte_get_llc_lcore: first %u last %u count %u \n", *first_cpu, *last_cpu, CPU_COUNT(llc_cpu));
+	return CPU_COUNT((rte_cpuset_t *)llc_cpu);
+}
+
+unsigned int
+rte_get_llc_n_lcore (unsigned int lcore, rte_cpuset_t *llc_cpu,
+		unsigned int *first_cpu, unsigned int * last_cpu,
+		unsigned int n, bool skip)
+{
+	bool found_first_cpu = false;
+	bool found_last_cpu = false;
+	unsigned int first_lcore_cpu = RTE_MAX_LCORE;
+	unsigned int last_lcore_cpu = RTE_MAX_LCORE;
+
+	unsigned int temp_count = n;
+	unsigned int count = rte_get_llc_lcore (lcore, llc_cpu, first_cpu, last_cpu);
+
+	//printf ("rte_get_llc_n_lcore: first %u last %u count %u \n", *first_cpu, *last_cpu, CPU_COUNT(llc_cpu));
+
+	unsigned int temp_cpu = RTE_MAX_LCORE;
+	unsigned int temp_last_cpu = RTE_MAX_LCORE;
+	if (false == skip) {
+		if (count < n)
+			return 0;
+
+		RTE_LCORE_FOREACH(temp_cpu) {
+			if ((temp_cpu >= *first_cpu) && (temp_cpu <= *last_cpu)) {
+				if (CPU_ISSET(temp_cpu, llc_cpu) && (temp_count)) {
+					//printf ("rte_get_llc_n_lcore: temp - count %d cpu %u skip %u first %u last %u \n", temp_count, temp_cpu, skip, *first_cpu, *last_cpu);
+					if (false == found_first_cpu) {
+						*first_cpu = temp_cpu;
+						found_first_cpu = true;
+					}
+					temp_last_cpu = temp_cpu;
+
+					temp_count -= 1;
+					continue;
+				} 
+			}
+			CPU_CLR(temp_cpu, llc_cpu);
+		}
+		*last_cpu = temp_last_cpu;
+		//printf ("rte_get_llc_n_lcore: start %u last %u count %u\n", *first_cpu, *last_cpu, CPU_COUNT(llc_cpu));
+		return n;
+	}
+
+	int total_core = CPU_COUNT(llc_cpu) - n;
+	if (total_core <= 0)
+		return 0;
+
+	RTE_LCORE_FOREACH(temp_cpu) {
+		if ((temp_cpu >= *first_cpu) && (temp_cpu <= *last_cpu)) {
+			if (CPU_ISSET(temp_cpu, llc_cpu) && (temp_count)) {
+				if (temp_count) {
+					CPU_CLR(temp_cpu, llc_cpu);
+					temp_count -= 1;
+					continue;
+				}
+
+				if (false == found_first_cpu) {
+					*first_cpu = temp_cpu;
+					found_first_cpu = true;
+				}
+				*last_cpu = temp_cpu;
+			}
+		}
+	}
+
+	//printf ("rte_get_llc_n_lcore: start %u last %u count %u\n", *first_cpu, *last_cpu, total_core);
+	return total_core;
+#if 0
+	if (false == skip) {
+		unsigned int start = *first_cpu, end = *last_cpu, temp_last_cpu = *last_cpu;
+		for (; (start <= end); start++)
+		{
+			if (CPU_ISSET(start, llc_cpu) && (temp_count)) {
+				temp_count -= 1;
+				continue;
+			} else if (CPU_ISSET(start, llc_cpu)) {
+				temp_last_cpu = (false == is_last_cpu) ? (start - 1) : temp_last_cpu;
+				is_last_cpu = true;
+
+				CPU_CLR(start, llc_cpu);
+			}
+		}
+		*last_cpu = temp_last_cpu;
+		return n;
+	}
+
+	int total_core = CPU_COUNT(llc_cpu) - n;
+	if (total_core <= 0)
+		return 0;
+
+	bool is_first_cpu = false;
+	unsigned int temp_last_cpu = *last_cpu;
+	for (unsigned int start = *first_cpu, end = *last_cpu; (start <= end) && (temp_count); start++)
+	{
+		if (CPU_ISSET(start, llc_cpu) && (temp_count)) {
+			*first_cpu = (is_first_cpu == false) ? start : *first_cpu;
+			temp_last_cpu = start;
+			CPU_CLR(start, llc_cpu);
+			temp_count -= 1;
+		}
+	}
+
+	*last_cpu = temp_last_cpu;
+	return total_core;
+#endif
+}
+
+unsigned int rte_get_next_lcore(unsigned int i, int skip_main, int wrap)
+{
+        i++;
+        if (wrap)
+                i %= RTE_MAX_LCORE;
+
+        while (i < RTE_MAX_LCORE) {
+                if (!rte_lcore_is_enabled(i) ||
+                    (skip_main && (i == rte_get_main_lcore()))) {
+                        i++;
+                        if (wrap)
+                                i %= RTE_MAX_LCORE;
+                        continue;
+                }
+                break;
+        }
+        return i;
 }
 
+
 unsigned int
 rte_lcore_to_socket_id(unsigned int lcore_id)
 {