mbox series

[RFC,0/2] introduce LLC aware functions

Message ID 20240827151014.201-1-vipin.varghese@amd.com (mailing list archive)
Headers
Series introduce LLC aware functions |

Message

Vipin Varghese Aug. 27, 2024, 3:10 p.m. UTC
As core density continues to increase, chiplet-based
core packing has become a key trend. In AMD SoC EPYC
architectures, core complexes within the same chiplet
share a Last-Level Cache (LLC). By packing logical cores
within the same LLC, we can enhance pipeline processing
stages due to reduced latency and improved data locality.

To leverage these benefits, DPDK libraries and examples
can utilize localized lcores. This approach ensures more
consistent latencies by minimizing the dispersion of lcores
across different chiplet complexes and enhances packet
processing by ensuring that data for subsequent pipeline
stages is likely to reside within the LLC.

< Function: Purpose >
---------------------
 - rte_get_llc_first_lcores: Retrieves all the first lcores in the shared LLC.
 - rte_get_llc_lcore: Retrieves all lcores that share the LLC.
 - rte_get_llc_n_lcore: Retrieves the first n or skips the first n lcores in the shared LLC.

< MACRO: Purpose >
------------------
RTE_LCORE_FOREACH_LLC_FIRST: iterates through all first lcore from each LLC.
RTE_LCORE_FOREACH_LLC_FIRST_WORKER: iterates through all first worker lcore from each LLC.
RTE_LCORE_FOREACH_LLC_WORKER: iterates lcores from LLC based on hint (lcore id).
RTE_LCORE_FOREACH_LLC_SKIP_FIRST_WORKER: iterates lcores from LLC while skipping first worker.
RTE_LCORE_FOREACH_LLC_FIRST_N_WORKER: iterates through `n` lcores from each LLC.
RTE_LCORE_FOREACH_LLC_SKIP_N_WORKER: skip first `n` lcores, then iterates through reaming lcores in each LLC.

Vipin Varghese (2):
  eal: add llc aware functions
  eal/lcore: add llc aware for each macro

 lib/eal/common/eal_common_lcore.c | 279 ++++++++++++++++++++++++++++--
 lib/eal/include/rte_lcore.h       |  89 ++++++++++
 2 files changed, 356 insertions(+), 12 deletions(-)
  

Comments

Mattias Rönnblom Aug. 27, 2024, 9:23 p.m. UTC | #1
On 2024-08-27 17:10, Vipin Varghese wrote:
> As core density continues to increase, chiplet-based
> core packing has become a key trend. In AMD SoC EPYC
> architectures, core complexes within the same chiplet
> share a Last-Level Cache (LLC). By packing logical cores
> within the same LLC, we can enhance pipeline processing
> stages due to reduced latency and improved data locality.
> 
> To leverage these benefits, DPDK libraries and examples
> can utilize localized lcores. This approach ensures more
> consistent latencies by minimizing the dispersion of lcores
> across different chiplet complexes and enhances packet
> processing by ensuring that data for subsequent pipeline
> stages is likely to reside within the LLC.
> 

We shouldn't have a separate CPU/cache hierarchy API instead?

Could potentially be built on the 'hwloc' library.

I much agree cache/core topology may be of interest of the application 
(or a work scheduler, like a DPDK event device), but it's not limited to 
LLC. It may well be worthwhile to care about which cores shares L2 
cache, for example. Not sure the RTE_LCORE_FOREACH_* approach scales.

> < Function: Purpose >
> ---------------------
>   - rte_get_llc_first_lcores: Retrieves all the first lcores in the shared LLC.
>   - rte_get_llc_lcore: Retrieves all lcores that share the LLC.
>   - rte_get_llc_n_lcore: Retrieves the first n or skips the first n lcores in the shared LLC.
> 
> < MACRO: Purpose >
> ------------------
> RTE_LCORE_FOREACH_LLC_FIRST: iterates through all first lcore from each LLC.
> RTE_LCORE_FOREACH_LLC_FIRST_WORKER: iterates through all first worker lcore from each LLC.
> RTE_LCORE_FOREACH_LLC_WORKER: iterates lcores from LLC based on hint (lcore id).
> RTE_LCORE_FOREACH_LLC_SKIP_FIRST_WORKER: iterates lcores from LLC while skipping first worker.
> RTE_LCORE_FOREACH_LLC_FIRST_N_WORKER: iterates through `n` lcores from each LLC.
> RTE_LCORE_FOREACH_LLC_SKIP_N_WORKER: skip first `n` lcores, then iterates through reaming lcores in each LLC.
> 
> Vipin Varghese (2):
>    eal: add llc aware functions
>    eal/lcore: add llc aware for each macro
> 
>   lib/eal/common/eal_common_lcore.c | 279 ++++++++++++++++++++++++++++--
>   lib/eal/include/rte_lcore.h       |  89 ++++++++++
>   2 files changed, 356 insertions(+), 12 deletions(-)
>
  
Burakov, Anatoly Aug. 28, 2024, 8:38 a.m. UTC | #2
On 8/27/2024 5:10 PM, Vipin Varghese wrote:
> As core density continues to increase, chiplet-based
> core packing has become a key trend. In AMD SoC EPYC
> architectures, core complexes within the same chiplet
> share a Last-Level Cache (LLC). By packing logical cores
> within the same LLC, we can enhance pipeline processing
> stages due to reduced latency and improved data locality.
> 
> To leverage these benefits, DPDK libraries and examples
> can utilize localized lcores. This approach ensures more
> consistent latencies by minimizing the dispersion of lcores
> across different chiplet complexes and enhances packet
> processing by ensuring that data for subsequent pipeline
> stages is likely to reside within the LLC.
> 
> < Function: Purpose >
> ---------------------
>   - rte_get_llc_first_lcores: Retrieves all the first lcores in the shared LLC.
>   - rte_get_llc_lcore: Retrieves all lcores that share the LLC.
>   - rte_get_llc_n_lcore: Retrieves the first n or skips the first n lcores in the shared LLC.
> 
> < MACRO: Purpose >
> ------------------
> RTE_LCORE_FOREACH_LLC_FIRST: iterates through all first lcore from each LLC.
> RTE_LCORE_FOREACH_LLC_FIRST_WORKER: iterates through all first worker lcore from each LLC.
> RTE_LCORE_FOREACH_LLC_WORKER: iterates lcores from LLC based on hint (lcore id).
> RTE_LCORE_FOREACH_LLC_SKIP_FIRST_WORKER: iterates lcores from LLC while skipping first worker.
> RTE_LCORE_FOREACH_LLC_FIRST_N_WORKER: iterates through `n` lcores from each LLC.
> RTE_LCORE_FOREACH_LLC_SKIP_N_WORKER: skip first `n` lcores, then iterates through reaming lcores in each LLC.
> 

Hi Vipin,

I recently looked into how Intel's Sub-NUMA Clustering would work within 
DPDK, and found that I actually didn't have to do anything, because the 
SNC "clusters" present themselves as NUMA nodes, which DPDK already 
supports natively.

Does AMD's implementation of chiplets not report themselves as separate 
NUMA nodes? Because if it does, I don't really think any changes are 
required because NUMA nodes would give you the same thing, would it not?
  
Vipin Varghese Sept. 2, 2024, 12:39 a.m. UTC | #3
<snipped>

Thank you Mattias for the comments and question, please let me try to 
explain the same below

> We shouldn't have a separate CPU/cache hierarchy API instead?

Based on the intention to bring in CPU lcores which share same L3 (for 
better cache hits and less noisy neighbor) current API focuses on using

Last Level Cache. But if the suggestion is `there are SoC where L2 cache 
are also shared, and the new API should be provisioned`, I am also

comfortable with the thought.

>
> Could potentially be built on the 'hwloc' library.

There are 3 reason on AMD SoC we did not explore this path, reasons are

1. depending n hwloc version and kernel version certain SoC hierarchies 
are not available

2. CPU NUMA and IO (memory & PCIe) NUMA are independent on AMD Epyc Soc.

3. adds the extra dependency layer of library layer to be made available 
to work.


hence we have tried to use Linux Documented generic layer of `sysfs CPU 
cache`.

I will try to explore more on hwloc and check if other libraries within 
DPDK leverages the same.

>
> I much agree cache/core topology may be of interest of the application
> (or a work scheduler, like a DPDK event device), but it's not limited to
> LLC. It may well be worthwhile to care about which cores shares L2
> cache, for example. Not sure the RTE_LCORE_FOREACH_* approach scales.

yes, totally understand as some SoC, multiple lcores shares same L2 cache.


Can we rework the API to be rte_get_cache_<function> where user argument 
is desired lcore index.

1. index-1: SMT threads

2. index-2: threads sharing same L2 cache

3. index-3: threads sharing same L3 cache

4. index-MAX: identify the threads sharing last level cache.

>
>> < Function: Purpose >
>> ---------------------
>>   - rte_get_llc_first_lcores: Retrieves all the first lcores in the 
>> shared LLC.
>>   - rte_get_llc_lcore: Retrieves all lcores that share the LLC.
>>   - rte_get_llc_n_lcore: Retrieves the first n or skips the first n 
>> lcores in the shared LLC.
>>
>> < MACRO: Purpose >
>> ------------------
>> RTE_LCORE_FOREACH_LLC_FIRST: iterates through all first lcore from 
>> each LLC.
>> RTE_LCORE_FOREACH_LLC_FIRST_WORKER: iterates through all first worker 
>> lcore from each LLC.
>> RTE_LCORE_FOREACH_LLC_WORKER: iterates lcores from LLC based on hint 
>> (lcore id).
>> RTE_LCORE_FOREACH_LLC_SKIP_FIRST_WORKER: iterates lcores from LLC 
>> while skipping first worker.
>> RTE_LCORE_FOREACH_LLC_FIRST_N_WORKER: iterates through `n` lcores 
>> from each LLC.
>> RTE_LCORE_FOREACH_LLC_SKIP_N_WORKER: skip first `n` lcores, then 
>> iterates through reaming lcores in each LLC.
>>
While the MACRO are simple wrapper invoking appropriate API. can this be 
worked out in this fashion?

<snipped>
  
Vipin Varghese Sept. 2, 2024, 1:08 a.m. UTC | #4
<Snipped>

Thank you Antaloy for the response. Let me try to share my understanding.

> I recently looked into how Intel's Sub-NUMA Clustering would work within
> DPDK, and found that I actually didn't have to do anything, because the
> SNC "clusters" present themselves as NUMA nodes, which DPDK already
> supports natively.

yes, this is correct. In Intel Xeon Platinum BIOS one can enable 
`Cluster per NUMA` as `1,2 or4`.

This divides the tiles into Sub-Numa parition, each having separate 
lcores,memory controllers, PCIe

and accelerator.

>
> Does AMD's implementation of chiplets not report themselves as separate
> NUMA nodes? 

In AMD EPYC Soc, this is different. There are 2 BIOS settings, namely

1. NPS: `Numa Per Socket` which allows the IO tile (memory, PCIe and 
Accelerator) to be partitioned as Numa 0, 1, 2 or 4.

2. L3 as NUMA: `L3 cache of CPU tiles as individual NUMA`. This allows 
all CPU tiles to be independent NUMA cores.


The above settings are possible because CPU is independent from IO tile. 
Thus allowing 4 combinations be available for use.

These are covered in the tuning gudie for the SoC in 12. How to get best 
performance on AMD platform — Data Plane Development Kit 24.07.0 
documentation (dpdk.org) 
<https://doc.dpdk.org/guides/linux_gsg/amd_platform.html>.


> Because if it does, I don't really think any changes are
> required because NUMA nodes would give you the same thing, would it not?

I have a different opinion to this outlook. An end user can

1. Identify the lcores and it's NUMA user `usertools/cpu-layout.py`

2. But it is core mask in eal arguments which makes the threads 
available to be used in a process.

3. there are no API which distinguish L3 numa domain. Function 
`rte_socket_id 
<https://doc.dpdk.org/api/rte__lcore_8h.html#a7c8da4664df26a64cf05dc508a4f26df>` 
for CPU tiles like AMD SoC will return physical socket.


Example: In AMD EPYC Genoa, there are total of 13 tiles. 12 CPU tiles 
and 1 IO tile. Setting

1. NPS to 4 will divide the memory, PCIe and accelerator into 4 domain. 
While the all CPU will appear as single NUMA but each 12 tile having 
independent L3 caches.

2. Setting `L3 as NUMA` allows each tile to appear as separate L3 clusters.


Hence, adding an API which allows to select available lcores based on 
Split L3 is essential irrespective of the BIOS setting.


>
> -- 
> Thanks,
> Anatoly
>
  
Burakov, Anatoly Sept. 2, 2024, 2:17 p.m. UTC | #5
On 9/2/2024 3:08 AM, Varghese, Vipin wrote:
> <Snipped>
> 
> Thank you Antaloy for the response. Let me try to share my understanding.
> 
>> I recently looked into how Intel's Sub-NUMA Clustering would work within
>> DPDK, and found that I actually didn't have to do anything, because the
>> SNC "clusters" present themselves as NUMA nodes, which DPDK already
>> supports natively.
> 
> yes, this is correct. In Intel Xeon Platinum BIOS one can enable 
> `Cluster per NUMA` as `1,2 or4`.
> 
> This divides the tiles into Sub-Numa parition, each having separate 
> lcores,memory controllers, PCIe
> 
> and accelerator.
> 
>>
>> Does AMD's implementation of chiplets not report themselves as separate
>> NUMA nodes? 
> 
> In AMD EPYC Soc, this is different. There are 2 BIOS settings, namely
> 
> 1. NPS: `Numa Per Socket` which allows the IO tile (memory, PCIe and 
> Accelerator) to be partitioned as Numa 0, 1, 2 or 4.
> 
> 2. L3 as NUMA: `L3 cache of CPU tiles as individual NUMA`. This allows 
> all CPU tiles to be independent NUMA cores.
> 
> 
> The above settings are possible because CPU is independent from IO tile. 
> Thus allowing 4 combinations be available for use.

Sure, but presumably if the user wants to distinguish this, they have to 
configure their system appropriately. If user wants to take advantage of 
L3 as NUMA (which is what your patch proposes), then they can enable the 
BIOS knob and get that functionality for free. DPDK already supports this.

> 
> These are covered in the tuning gudie for the SoC in 12. How to get best 
> performance on AMD platform — Data Plane Development Kit 24.07.0 
> documentation (dpdk.org) 
> <https://doc.dpdk.org/guides/linux_gsg/amd_platform.html>.
> 
> 
>> Because if it does, I don't really think any changes are
>> required because NUMA nodes would give you the same thing, would it not?
> 
> I have a different opinion to this outlook. An end user can
> 
> 1. Identify the lcores and it's NUMA user `usertools/cpu-layout.py`

I recently submitted an enhacement for CPU layout script to print out 
NUMA separately from physical socket [1].

[1] 
https://patches.dpdk.org/project/dpdk/patch/40cf4ee32f15952457ac5526cfce64728bd13d32.1724323106.git.anatoly.burakov@intel.com/

I believe when "L3 as NUMA" is enabled in BIOS, the script will display 
both physical package ID as well as NUMA nodes reported by the system, 
which will be different from physical package ID, and which will display 
information you were looking for.

> 
> 2. But it is core mask in eal arguments which makes the threads 
> available to be used in a process.

See above: if the OS already reports NUMA information, this is not a 
problem to be solved, CPU layout script can give this information to the 
user.

> 
> 3. there are no API which distinguish L3 numa domain. Function 
> `rte_socket_id 
> <https://doc.dpdk.org/api/rte__lcore_8h.html#a7c8da4664df26a64cf05dc508a4f26df>` for CPU tiles like AMD SoC will return physical socket.

Sure, but I would think the answer to that would be to introduce an API 
to distinguish between NUMA (socket ID in DPDK parlance) and package 
(physical socket ID in the "traditional NUMA" sense). Once we can 
distinguish between those, DPDK can just rely on NUMA information 
provided by the OS, while still being capable of identifying physical 
sockets if the user so desires.

I am actually going to introduce API to get *physical socket* (as 
opposed to NUMA node) in the next few days.

> 
> 
> Example: In AMD EPYC Genoa, there are total of 13 tiles. 12 CPU tiles 
> and 1 IO tile. Setting
> 
> 1. NPS to 4 will divide the memory, PCIe and accelerator into 4 domain. 
> While the all CPU will appear as single NUMA but each 12 tile having 
> independent L3 caches.
> 
> 2. Setting `L3 as NUMA` allows each tile to appear as separate L3 clusters.
> 
> 
> Hence, adding an API which allows to select available lcores based on 
> Split L3 is essential irrespective of the BIOS setting.
> 

I think the crucial issue here is the "irrespective of BIOS setting" 
bit. If EAL is getting into the game of figuring out exact intricacies 
of physical layout of the system, then there's a lot more work to be 
done as there are lots of different topologies, as other people have 
already commented, and such an API needs *a lot* of thought put into it.

If, on the other hand, we leave this issue to the kernel, and only 
gather NUMA information provided by the kernel, then nothing has to be 
done - DPDK already supports all of this natively, provided the user has 
configured the system correctly.

Moreover, arguably DPDK already works that way: technically you can get 
physical socket information even absent of NUMA support in BIOS, but 
DPDK does not do that. Instead, if OS reports NUMA node as 0, that's 
what we're going with (even if we could detect multiple sockets from 
sysfs), and IMO it should stay that way unless there is a strong 
argument otherwise. We force the user to configure their system 
correctly as it is, and I see no reason to second-guess user's BIOS 
configuration otherwise.
  
Vipin Varghese Sept. 2, 2024, 3:33 p.m. UTC | #6
<snipped>
>>
>>> I recently looked into how Intel's Sub-NUMA Clustering would work 
>>> within
>>> DPDK, and found that I actually didn't have to do anything, because the
>>> SNC "clusters" present themselves as NUMA nodes, which DPDK already
>>> supports natively.
>>
>> yes, this is correct. In Intel Xeon Platinum BIOS one can enable
>> `Cluster per NUMA` as `1,2 or4`.
>>
>> This divides the tiles into Sub-Numa parition, each having separate
>> lcores,memory controllers, PCIe
>>
>> and accelerator.
>>
>>>
>>> Does AMD's implementation of chiplets not report themselves as separate
>>> NUMA nodes?
>>
>> In AMD EPYC Soc, this is different. There are 2 BIOS settings, namely
>>
>> 1. NPS: `Numa Per Socket` which allows the IO tile (memory, PCIe and
>> Accelerator) to be partitioned as Numa 0, 1, 2 or 4.
>>
>> 2. L3 as NUMA: `L3 cache of CPU tiles as individual NUMA`. This allows
>> all CPU tiles to be independent NUMA cores.
>>
>>
>> The above settings are possible because CPU is independent from IO tile.
>> Thus allowing 4 combinations be available for use.
>
> Sure, but presumably if the user wants to distinguish this, they have to
> configure their system appropriately. If user wants to take advantage of
> L3 as NUMA (which is what your patch proposes), then they can enable the
> BIOS knob and get that functionality for free. DPDK already supports 
> this.
>
The intend of the RFC is to introduce the ability to select lcore within 
the same

L3 cache whether the BIOS is set or unset for `L3 as NUMA`. This is also 
achieved

and tested on platforms which advertises via sysfs by OS kernel. Thus 
eliminating

the dependency on hwloc and libuma which can be different versions in 
different distros.


>>
>> These are covered in the tuning gudie for the SoC in 12. How to get best
>> performance on AMD platform — Data Plane Development Kit 24.07.0
>> documentation (dpdk.org)
>> <https://doc.dpdk.org/guides/linux_gsg/amd_platform.html>.
>>
>>
>>> Because if it does, I don't really think any changes are
>>> required because NUMA nodes would give you the same thing, would it 
>>> not?
>>
>> I have a different opinion to this outlook. An end user can
>>
>> 1. Identify the lcores and it's NUMA user `usertools/cpu-layout.py`
>
> I recently submitted an enhacement for CPU layout script to print out
> NUMA separately from physical socket [1].
>
> [1]
> https://patches.dpdk.org/project/dpdk/patch/40cf4ee32f15952457ac5526cfce64728bd13d32.1724323106.git.anatoly.burakov@intel.com/ 
>
>
> I believe when "L3 as NUMA" is enabled in BIOS, the script will display
> both physical package ID as well as NUMA nodes reported by the system,
> which will be different from physical package ID, and which will display
> information you were looking for.

As AMD we had submitted earlier work on the same via usertools: enhance 
logic to display NUMA - Patchwork (dpdk.org) 
<https://patchwork.dpdk.org/project/dpdk/patch/20220326073207.489694-1-vipin.varghese@amd.com/>.

this clearly were distinguishing NUMA and Physical socket.

>
>>
>> 2. But it is core mask in eal arguments which makes the threads
>> available to be used in a process.
>
> See above: if the OS already reports NUMA information, this is not a
> problem to be solved, CPU layout script can give this information to the
> user.

Agreed, but as pointed out in case of Intel Xeon Platinum SPR, the tile 
consists of cpu, memory, pcie and accelerator.

hence setting the BIOS option `Cluster per NUMA` the OS kernel & libnuma 
display appropriate Domain with memory, pcie and cpu.


In case of AMD SoC, libnuma for CPU is different from memory NUMA per 
socket.

>
>>
>> 3. there are no API which distinguish L3 numa domain. Function
>> `rte_socket_id
>> <https://doc.dpdk.org/api/rte__lcore_8h.html#a7c8da4664df26a64cf05dc508a4f26df>` 
>> for CPU tiles like AMD SoC will return physical socket.
>
> Sure, but I would think the answer to that would be to introduce an API
> to distinguish between NUMA (socket ID in DPDK parlance) and package
> (physical socket ID in the "traditional NUMA" sense). Once we can
> distinguish between those, DPDK can just rely on NUMA information
> provided by the OS, while still being capable of identifying physical
> sockets if the user so desires.
Agreed, +1 for the idea for physcial socket and changes in library to 
exploit the same.
>
> I am actually going to introduce API to get *physical socket* (as
> opposed to NUMA node) in the next few days.
>
But how does it solve the end customer issues

1. if there are multiple NIC or Accelerator on multiple socket, but IO 
tile is partitioned to Sub Domain.

2. If RTE_FLOW steering is applied on NIC which needs to processed under 
same L3 - reduces noisy neighbor and better cache hits

3, for PKT-distribute library which needs to run within same worker 
lcore set as RX-Distributor-TX.


Current RFC suggested addresses the above, by helping the end users to 
identify the lcores withing same L3 domain under a NUMA|Physical socket 
irresepctive of BIOS setting.

>>
>>
>> Example: In AMD EPYC Genoa, there are total of 13 tiles. 12 CPU tiles
>> and 1 IO tile. Setting
>>
>> 1. NPS to 4 will divide the memory, PCIe and accelerator into 4 domain.
>> While the all CPU will appear as single NUMA but each 12 tile having
>> independent L3 caches.
>>
>> 2. Setting `L3 as NUMA` allows each tile to appear as separate L3 
>> clusters.
>>
>>
>> Hence, adding an API which allows to select available lcores based on
>> Split L3 is essential irrespective of the BIOS setting.
>>
>
> I think the crucial issue here is the "irrespective of BIOS setting"
> bit.

That is what the current RFC achieves.

> If EAL is getting into the game of figuring out exact intricacies
> of physical layout of the system, then there's a lot more work to be
> done as there are lots of different topologies, as other people have
> already commented, and such an API needs *a lot* of thought put into it.

There is standard sysfs interfaces for CPU cache topology (OS kernel), 
as mentioned earlier

problem with hwloc and libnuma is different distros has different 
versions. There are solutions for

specific SoC architectures as per latest comment.


But we always can limit the API to selected SoC, while all other SoC 
when invoked will invoke rte_get_next_lcore.


>
> If, on the other hand, we leave this issue to the kernel, and only
> gather NUMA information provided by the kernel, then nothing has to be
> done - DPDK already supports all of this natively, provided the user has
> configured the system correctly.

As shared above, we tried to bring this usertools: enhance logic to 
display NUMA - Patchwork (dpdk.org) 
<https://patchwork.dpdk.org/project/dpdk/patch/20220326073207.489694-1-vipin.varghese@amd.com/>. 


DPDK support for lcore is getting enhanced and allowing user to use more 
favorable lcores within same Tile.


>
> Moreover, arguably DPDK already works that way: technically you can get
> physical socket information even absent of NUMA support in BIOS, but
> DPDK does not do that. Instead, if OS reports NUMA node as 0, that's
> what we're going with (even if we could detect multiple sockets from
> sysfs), 

In the above argument, it is shared as OS kernel detects NUMA or domain, 
which is used by DPDK right?

The RFC suggested also adheres to the same, what OS sees. can you please 
explain for better understanding

what in the RFC is doing differently?


> and IMO it should stay that way unless there is a strong
> argument otherwise.

Totally agree, that is what the RFC is also doing, based on what OS sees 
as NUMA we are using it.

Only addition is within the NUMA if there are split LLC, allow selection 
of those lcores. Rather than blindly choosing lcore using

rte_lcore_get_next.


> We force the user to configure their system
> correctly as it is, and I see no reason to second-guess user's BIOS
> configuration otherwise.

Again iterating, the changes suggested in RFC are agnostic to what BIOS 
options are used,

It is to earlier question `is AMD configuration same as Intel tile` I 
have explained it is not using BIOS setting.


>
> -- 
> Thanks,
> Anatoly
>
  
Burakov, Anatoly Sept. 3, 2024, 8:50 a.m. UTC | #7
On 9/2/2024 5:33 PM, Varghese, Vipin wrote:
> <snipped>
>>>
>>>> I recently looked into how Intel's Sub-NUMA Clustering would work 
>>>> within
>>>> DPDK, and found that I actually didn't have to do anything, because the
>>>> SNC "clusters" present themselves as NUMA nodes, which DPDK already
>>>> supports natively.
>>>
>>> yes, this is correct. In Intel Xeon Platinum BIOS one can enable
>>> `Cluster per NUMA` as `1,2 or4`.
>>>
>>> This divides the tiles into Sub-Numa parition, each having separate
>>> lcores,memory controllers, PCIe
>>>
>>> and accelerator.
>>>
>>>>
>>>> Does AMD's implementation of chiplets not report themselves as separate
>>>> NUMA nodes?
>>>
>>> In AMD EPYC Soc, this is different. There are 2 BIOS settings, namely
>>>
>>> 1. NPS: `Numa Per Socket` which allows the IO tile (memory, PCIe and
>>> Accelerator) to be partitioned as Numa 0, 1, 2 or 4.
>>>
>>> 2. L3 as NUMA: `L3 cache of CPU tiles as individual NUMA`. This allows
>>> all CPU tiles to be independent NUMA cores.
>>>
>>>
>>> The above settings are possible because CPU is independent from IO tile.
>>> Thus allowing 4 combinations be available for use.
>>
>> Sure, but presumably if the user wants to distinguish this, they have to
>> configure their system appropriately. If user wants to take advantage of
>> L3 as NUMA (which is what your patch proposes), then they can enable the
>> BIOS knob and get that functionality for free. DPDK already supports 
>> this.
>>
> The intend of the RFC is to introduce the ability to select lcore within 
> the same
> 
> L3 cache whether the BIOS is set or unset for `L3 as NUMA`. This is also 
> achieved
> 
> and tested on platforms which advertises via sysfs by OS kernel. Thus 
> eliminating
> 
> the dependency on hwloc and libuma which can be different versions in 
> different distros.

But we do depend on libnuma, so we might as well depend on it? Are there 
different versions of libnuma that interfere with what you're trying to 
do? You keep coming back to this "whether the BIOS is set or unset" for 
L3 as NUMA, but I'm still unclear as to what issues your patch is 
solving assuming "knob is set". When the system is configured correctly, 
it already works and reports cores as part of NUMA nodes (as L3) 
correctly. It is only when the system is configured *not* to do that 
that issues arise, is it not? In which case IMO the easier solution 
would be to just tell the user to enable that knob in BIOS?

> 
> 
>>>
>>> These are covered in the tuning gudie for the SoC in 12. How to get best
>>> performance on AMD platform — Data Plane Development Kit 24.07.0
>>> documentation (dpdk.org)
>>> <https://doc.dpdk.org/guides/linux_gsg/amd_platform.html>.
>>>
>>>
>>>> Because if it does, I don't really think any changes are
>>>> required because NUMA nodes would give you the same thing, would it 
>>>> not?
>>>
>>> I have a different opinion to this outlook. An end user can
>>>
>>> 1. Identify the lcores and it's NUMA user `usertools/cpu-layout.py`
>>
>> I recently submitted an enhacement for CPU layout script to print out
>> NUMA separately from physical socket [1].
>>
>> [1]
>> https://patches.dpdk.org/project/dpdk/patch/40cf4ee32f15952457ac5526cfce64728bd13d32.1724323106.git.anatoly.burakov@intel.com/
>>
>> I believe when "L3 as NUMA" is enabled in BIOS, the script will display
>> both physical package ID as well as NUMA nodes reported by the system,
>> which will be different from physical package ID, and which will display
>> information you were looking for.
> 
> As AMD we had submitted earlier work on the same via usertools: enhance 
> logic to display NUMA - Patchwork (dpdk.org) 
> <https://patchwork.dpdk.org/project/dpdk/patch/20220326073207.489694-1-vipin.varghese@amd.com/>.
> 
> this clearly were distinguishing NUMA and Physical socket.

Oh, cool, I didn't see that patch. I would argue my visual format is 
more readable though, so perhaps we can get that in :)

> Agreed, but as pointed out in case of Intel Xeon Platinum SPR, the tile 
> consists of cpu, memory, pcie and accelerator.
> 
> hence setting the BIOS option `Cluster per NUMA` the OS kernel & libnuma 
> display appropriate Domain with memory, pcie and cpu.
> 
> 
> In case of AMD SoC, libnuma for CPU is different from memory NUMA per 
> socket.

I'm curious how does the kernel handle this then, and what are you 
getting from libnuma. You seem to be implying that there are two 
different NUMA nodes on your SoC, and either kernel or libnuma are in 
conflict as to what belongs to what NUMA node?

> 
>>
>>>
>>> 3. there are no API which distinguish L3 numa domain. Function
>>> `rte_socket_id
>>> <https://doc.dpdk.org/api/rte__lcore_8h.html#a7c8da4664df26a64cf05dc508a4f26df>` for CPU tiles like AMD SoC will return physical socket.
>>
>> Sure, but I would think the answer to that would be to introduce an API
>> to distinguish between NUMA (socket ID in DPDK parlance) and package
>> (physical socket ID in the "traditional NUMA" sense). Once we can
>> distinguish between those, DPDK can just rely on NUMA information
>> provided by the OS, while still being capable of identifying physical
>> sockets if the user so desires.
> Agreed, +1 for the idea for physcial socket and changes in library to 
> exploit the same.
>>
>> I am actually going to introduce API to get *physical socket* (as
>> opposed to NUMA node) in the next few days.
>>
> But how does it solve the end customer issues
> 
> 1. if there are multiple NIC or Accelerator on multiple socket, but IO 
> tile is partitioned to Sub Domain.

At least on Intel platforms, NUMA node gets assigned correctly - that 
is, if my Xeon with SNC enabled has NUMA nodes 3,4 on socket 1, and 
there's a NIC connected to socket 1, it's going to show up as being on 
NUMA node 3 or 4 depending on where exactly I plugged it in. Everything 
already works as expected, and there is no need for any changes for 
Intel platforms (at least none that I can see).

My proposed API is really for those users who wish to explicitly allow 
for reserving memory/cores on "the same physical socket", as "on the 
same tile" is already taken care of by NUMA nodes.

> 
> 2. If RTE_FLOW steering is applied on NIC which needs to processed under 
> same L3 - reduces noisy neighbor and better cache hits
> 
> 3, for PKT-distribute library which needs to run within same worker 
> lcore set as RX-Distributor-TX.
> 

Same as above: on Intel platforms, NUMA nodes already solve this.

<snip>

> Totally agree, that is what the RFC is also doing, based on what OS sees 
> as NUMA we are using it.
> 
> Only addition is within the NUMA if there are split LLC, allow selection 
> of those lcores. Rather than blindly choosing lcore using
> 
> rte_lcore_get_next.

It feels like we're working around a problem that shouldn't exist in the 
first place, because kernel should already report this information. 
Within NUMA subsystem, there is sysfs node "distance" that, at least on 
Intel platforms and in certain BIOS configuration, reports distance 
between NUMA nodes, from which one can make inferences about how far a 
specific NUMA node is from any other NUMA node. This could have been 
used to encode L3 cache information. Do AMD platforms not do that? In 
that case, "lcore next" for a particular socket ID (NUMA node, in 
reality) should already get us any cores that are close to each other, 
because all of this information is already encoded in NUMA nodes by the 
system.

I feel like there's a disconnect between my understanding of the problem 
space, and yours, so I'm going to ask a very basic question:

Assuming the user has configured their AMD system correctly (i.e. 
enabled L3 as NUMA), are there any problem to be solved by adding a new 
API? Does the system not report each L3 as a separate NUMA node?

> 
> 
>> We force the user to configure their system
>> correctly as it is, and I see no reason to second-guess user's BIOS
>> configuration otherwise.
> 
> Again iterating, the changes suggested in RFC are agnostic to what BIOS 
> options are used,

But that is exactly my contention: are we not effectively working around 
users' misconfiguration of a system then?
  
Mattias Rönnblom Sept. 4, 2024, 9:30 a.m. UTC | #8
On 2024-09-02 02:39, Varghese, Vipin wrote:
> <snipped>
> 
> Thank you Mattias for the comments and question, please let me try to 
> explain the same below
> 
>> We shouldn't have a separate CPU/cache hierarchy API instead?
> 
> Based on the intention to bring in CPU lcores which share same L3 (for 
> better cache hits and less noisy neighbor) current API focuses on using
> 
> Last Level Cache. But if the suggestion is `there are SoC where L2 cache 
> are also shared, and the new API should be provisioned`, I am also
> 
> comfortable with the thought.
> 

Rather than some AMD special case API hacked into <rte_lcore.h>, I think 
we are better off with no DPDK API at all for this kind of functionality.

A DPDK CPU/memory hierarchy topology API very much makes sense, but it 
should be reasonably generic and complete from the start.

>>
>> Could potentially be built on the 'hwloc' library.
> 
> There are 3 reason on AMD SoC we did not explore this path, reasons are
> 
> 1. depending n hwloc version and kernel version certain SoC hierarchies 
> are not available
> 
> 2. CPU NUMA and IO (memory & PCIe) NUMA are independent on AMD Epyc Soc.
> 
> 3. adds the extra dependency layer of library layer to be made available 
> to work.
> 
> 
> hence we have tried to use Linux Documented generic layer of `sysfs CPU 
> cache`.
> 
> I will try to explore more on hwloc and check if other libraries within 
> DPDK leverages the same.
> 
>>
>> I much agree cache/core topology may be of interest of the application
>> (or a work scheduler, like a DPDK event device), but it's not limited to
>> LLC. It may well be worthwhile to care about which cores shares L2
>> cache, for example. Not sure the RTE_LCORE_FOREACH_* approach scales.
> 
> yes, totally understand as some SoC, multiple lcores shares same L2 cache.
> 
> 
> Can we rework the API to be rte_get_cache_<function> where user argument 
> is desired lcore index.
> 
> 1. index-1: SMT threads
> 
> 2. index-2: threads sharing same L2 cache
> 
> 3. index-3: threads sharing same L3 cache
> 
> 4. index-MAX: identify the threads sharing last level cache.
> 
>>
>>> < Function: Purpose >
>>> ---------------------
>>>   - rte_get_llc_first_lcores: Retrieves all the first lcores in the 
>>> shared LLC.
>>>   - rte_get_llc_lcore: Retrieves all lcores that share the LLC.
>>>   - rte_get_llc_n_lcore: Retrieves the first n or skips the first n 
>>> lcores in the shared LLC.
>>>
>>> < MACRO: Purpose >
>>> ------------------
>>> RTE_LCORE_FOREACH_LLC_FIRST: iterates through all first lcore from 
>>> each LLC.
>>> RTE_LCORE_FOREACH_LLC_FIRST_WORKER: iterates through all first worker 
>>> lcore from each LLC.
>>> RTE_LCORE_FOREACH_LLC_WORKER: iterates lcores from LLC based on hint 
>>> (lcore id).
>>> RTE_LCORE_FOREACH_LLC_SKIP_FIRST_WORKER: iterates lcores from LLC 
>>> while skipping first worker.
>>> RTE_LCORE_FOREACH_LLC_FIRST_N_WORKER: iterates through `n` lcores 
>>> from each LLC.
>>> RTE_LCORE_FOREACH_LLC_SKIP_N_WORKER: skip first `n` lcores, then 
>>> iterates through reaming lcores in each LLC.
>>>
> While the MACRO are simple wrapper invoking appropriate API. can this be 
> worked out in this fashion?
> 
> <snipped>
  
Stephen Hemminger Sept. 4, 2024, 2:37 p.m. UTC | #9
On Wed, 4 Sep 2024 11:30:59 +0200
Mattias Rönnblom <hofors@lysator.liu.se> wrote:

> On 2024-09-02 02:39, Varghese, Vipin wrote:
> > <snipped>
> > 
> > Thank you Mattias for the comments and question, please let me try to 
> > explain the same below
> >   
> >> We shouldn't have a separate CPU/cache hierarchy API instead?  
> > 
> > Based on the intention to bring in CPU lcores which share same L3 (for 
> > better cache hits and less noisy neighbor) current API focuses on using
> > 
> > Last Level Cache. But if the suggestion is `there are SoC where L2 cache 
> > are also shared, and the new API should be provisioned`, I am also
> > 
> > comfortable with the thought.
> >   
> 
> Rather than some AMD special case API hacked into <rte_lcore.h>, I think 
> we are better off with no DPDK API at all for this kind of functionality.
> 
> A DPDK CPU/memory hierarchy topology API very much makes sense, but it 
> should be reasonably generic and complete from the start.

Agreed. This one of those cases where the existing project hwloc which
is part of open-mpi is more complete and well supported. It supports
multiple OS's and can deal with more quirks.

https://github.com/open-mpi/hwloc
  
Ferruh Yigit Sept. 5, 2024, 1:05 p.m. UTC | #10
On 9/3/2024 9:50 AM, Burakov, Anatoly wrote:
> On 9/2/2024 5:33 PM, Varghese, Vipin wrote:
>> <snipped>
>>>>
>>>>> I recently looked into how Intel's Sub-NUMA Clustering would work
>>>>> within
>>>>> DPDK, and found that I actually didn't have to do anything, because
>>>>> the
>>>>> SNC "clusters" present themselves as NUMA nodes, which DPDK already
>>>>> supports natively.
>>>>
>>>> yes, this is correct. In Intel Xeon Platinum BIOS one can enable
>>>> `Cluster per NUMA` as `1,2 or4`.
>>>>
>>>> This divides the tiles into Sub-Numa parition, each having separate
>>>> lcores,memory controllers, PCIe
>>>>
>>>> and accelerator.
>>>>
>>>>>
>>>>> Does AMD's implementation of chiplets not report themselves as
>>>>> separate
>>>>> NUMA nodes?
>>>>
>>>> In AMD EPYC Soc, this is different. There are 2 BIOS settings, namely
>>>>
>>>> 1. NPS: `Numa Per Socket` which allows the IO tile (memory, PCIe and
>>>> Accelerator) to be partitioned as Numa 0, 1, 2 or 4.
>>>>
>>>> 2. L3 as NUMA: `L3 cache of CPU tiles as individual NUMA`. This allows
>>>> all CPU tiles to be independent NUMA cores.
>>>>
>>>>
>>>> The above settings are possible because CPU is independent from IO
>>>> tile.
>>>> Thus allowing 4 combinations be available for use.
>>>
>>> Sure, but presumably if the user wants to distinguish this, they have to
>>> configure their system appropriately. If user wants to take advantage of
>>> L3 as NUMA (which is what your patch proposes), then they can enable the
>>> BIOS knob and get that functionality for free. DPDK already supports
>>> this.
>>>
>> The intend of the RFC is to introduce the ability to select lcore
>> within the same
>>
>> L3 cache whether the BIOS is set or unset for `L3 as NUMA`. This is
>> also achieved
>>
>> and tested on platforms which advertises via sysfs by OS kernel. Thus
>> eliminating
>>
>> the dependency on hwloc and libuma which can be different versions in
>> different distros.
> 
> But we do depend on libnuma, so we might as well depend on it? Are there
> different versions of libnuma that interfere with what you're trying to
> do? You keep coming back to this "whether the BIOS is set or unset" for
> L3 as NUMA, but I'm still unclear as to what issues your patch is
> solving assuming "knob is set". When the system is configured correctly,
> it already works and reports cores as part of NUMA nodes (as L3)
> correctly. It is only when the system is configured *not* to do that
> that issues arise, is it not? In which case IMO the easier solution
> would be to just tell the user to enable that knob in BIOS?
> 
>>
>>
>>>>
>>>> These are covered in the tuning gudie for the SoC in 12. How to get
>>>> best
>>>> performance on AMD platform — Data Plane Development Kit 24.07.0
>>>> documentation (dpdk.org)
>>>> <https://doc.dpdk.org/guides/linux_gsg/amd_platform.html>.
>>>>
>>>>
>>>>> Because if it does, I don't really think any changes are
>>>>> required because NUMA nodes would give you the same thing, would it
>>>>> not?
>>>>
>>>> I have a different opinion to this outlook. An end user can
>>>>
>>>> 1. Identify the lcores and it's NUMA user `usertools/cpu-layout.py`
>>>
>>> I recently submitted an enhacement for CPU layout script to print out
>>> NUMA separately from physical socket [1].
>>>
>>> [1]
>>> https://patches.dpdk.org/project/dpdk/
>>> patch/40cf4ee32f15952457ac5526cfce64728bd13d32.1724323106.git.anatoly.burakov@intel.com/
>>>
>>> I believe when "L3 as NUMA" is enabled in BIOS, the script will display
>>> both physical package ID as well as NUMA nodes reported by the system,
>>> which will be different from physical package ID, and which will display
>>> information you were looking for.
>>
>> As AMD we had submitted earlier work on the same via usertools:
>> enhance logic to display NUMA - Patchwork (dpdk.org) <https://
>> patchwork.dpdk.org/project/dpdk/patch/20220326073207.489694-1-
>> vipin.varghese@amd.com/>.
>>
>> this clearly were distinguishing NUMA and Physical socket.
> 
> Oh, cool, I didn't see that patch. I would argue my visual format is
> more readable though, so perhaps we can get that in :)
> 
>> Agreed, but as pointed out in case of Intel Xeon Platinum SPR, the
>> tile consists of cpu, memory, pcie and accelerator.
>>
>> hence setting the BIOS option `Cluster per NUMA` the OS kernel &
>> libnuma display appropriate Domain with memory, pcie and cpu.
>>
>>
>> In case of AMD SoC, libnuma for CPU is different from memory NUMA per
>> socket.
> 
> I'm curious how does the kernel handle this then, and what are you
> getting from libnuma. You seem to be implying that there are two
> different NUMA nodes on your SoC, and either kernel or libnuma are in
> conflict as to what belongs to what NUMA node?
> 
>>
>>>
>>>>
>>>> 3. there are no API which distinguish L3 numa domain. Function
>>>> `rte_socket_id
>>>> <https://doc.dpdk.org/api/
>>>> rte__lcore_8h.html#a7c8da4664df26a64cf05dc508a4f26df>` for CPU tiles
>>>> like AMD SoC will return physical socket.
>>>
>>> Sure, but I would think the answer to that would be to introduce an API
>>> to distinguish between NUMA (socket ID in DPDK parlance) and package
>>> (physical socket ID in the "traditional NUMA" sense). Once we can
>>> distinguish between those, DPDK can just rely on NUMA information
>>> provided by the OS, while still being capable of identifying physical
>>> sockets if the user so desires.
>> Agreed, +1 for the idea for physcial socket and changes in library to
>> exploit the same.
>>>
>>> I am actually going to introduce API to get *physical socket* (as
>>> opposed to NUMA node) in the next few days.
>>>
>> But how does it solve the end customer issues
>>
>> 1. if there are multiple NIC or Accelerator on multiple socket, but IO
>> tile is partitioned to Sub Domain.
> 
> At least on Intel platforms, NUMA node gets assigned correctly - that
> is, if my Xeon with SNC enabled has NUMA nodes 3,4 on socket 1, and
> there's a NIC connected to socket 1, it's going to show up as being on
> NUMA node 3 or 4 depending on where exactly I plugged it in. Everything
> already works as expected, and there is no need for any changes for
> Intel platforms (at least none that I can see).
> 
> My proposed API is really for those users who wish to explicitly allow
> for reserving memory/cores on "the same physical socket", as "on the
> same tile" is already taken care of by NUMA nodes.
> 
>>
>> 2. If RTE_FLOW steering is applied on NIC which needs to processed
>> under same L3 - reduces noisy neighbor and better cache hits
>>
>> 3, for PKT-distribute library which needs to run within same worker
>> lcore set as RX-Distributor-TX.
>>
> 
> Same as above: on Intel platforms, NUMA nodes already solve this.
> 
> <snip>
> 
>> Totally agree, that is what the RFC is also doing, based on what OS
>> sees as NUMA we are using it.
>>
>> Only addition is within the NUMA if there are split LLC, allow
>> selection of those lcores. Rather than blindly choosing lcore using
>>
>> rte_lcore_get_next.
> 
> It feels like we're working around a problem that shouldn't exist in the
> first place, because kernel should already report this information.
> Within NUMA subsystem, there is sysfs node "distance" that, at least on
> Intel platforms and in certain BIOS configuration, reports distance
> between NUMA nodes, from which one can make inferences about how far a
> specific NUMA node is from any other NUMA node. This could have been
> used to encode L3 cache information. Do AMD platforms not do that? In
> that case, "lcore next" for a particular socket ID (NUMA node, in
> reality) should already get us any cores that are close to each other,
> because all of this information is already encoded in NUMA nodes by the
> system.
> 
> I feel like there's a disconnect between my understanding of the problem
> space, and yours, so I'm going to ask a very basic question:
> 
> Assuming the user has configured their AMD system correctly (i.e.
> enabled L3 as NUMA), are there any problem to be solved by adding a new
> API? Does the system not report each L3 as a separate NUMA node?
> 

Hi Anatoly,

Let me try to answer.

To start with, Intel "Sub-NUMA Clustering" and AMD NUMA is different, as
far as I understand SNC is more similar to more classic physical socket
based NUMA.

Following is the AMD CPU:
      ┌─────┐┌─────┐┌──────────┐┌─────┐┌─────┐
      │     ││     ││          ││     ││     │
      │     ││     ││          ││     ││     │
      │TILE1││TILE2││          ││TILE5││TILE6│
      │     ││     ││          ││     ││     │
      │     ││     ││          ││     ││     │
      │     ││     ││          ││     ││     │
      └─────┘└─────┘│    IO    │└─────┘└─────┘
      ┌─────┐┌─────┐│   TILE   │┌─────┐┌─────┐
      │     ││     ││          ││     ││     │
      │     ││     ││          ││     ││     │
      │TILE3││TILE4││          ││TILE7││TILE8│
      │     ││     ││          ││     ││     │
      │     ││     ││          ││     ││     │
      │     ││     ││          ││     ││     │
      └─────┘└─────┘└──────────┘└─────┘└─────┘

Each 'Tile' has multiple cores, and 'IO Tile' has memory controller, bus
controllers etc..

When NPS=x configured in bios, IO tile resources are split and each seen
as a NUMA node.

Following is NPS=4
      ┌─────┐┌─────┐┌──────────┐┌─────┐┌─────┐
      │     ││     ││     .    ││     ││     │
      │     ││     ││     .    ││     ││     │
      │TILE1││TILE2││     .    ││TILE5││TILE6│
      │     ││     ││NUMA .NUMA││     ││     │
      │     ││     ││ 0   . 1  ││     ││     │
      │     ││     ││     .    ││     ││     │
      └─────┘└─────┘│     .    │└─────┘└─────┘
      ┌─────┐┌─────┐│..........│┌─────┐┌─────┐
      │     ││     ││     .    ││     ││     │
      │     ││     ││NUMA .NUMA││     ││     │
      │TILE3││TILE4││ 2   . 3  ││TILE7││TILE8│
      │     ││     ││     .    ││     ││     │
      │     ││     ││     .    ││     ││     │
      │     ││     ││     .    ││     ││     │
      └─────┘└─────┘└─────.────┘└─────┘└─────┘

Benefit of this is approach is now all cores has to access all NUMA
without any penalty. Like a DPDK application can use cores from 'TILE1',
'TILE4' & 'TILE7' to access to NUMA0 (or any NUMA) resources in high
performance.
This is different than SNC where cores access to cross NUMA resources
hit by performance penalty.

Now, although which tile cores come from doesn't matter from NUMA
perspective, it may matter (based on workload) to have them under same LLC.

One way to make sure all cores are under same LLC, is to enable "L3 as
NUMA" BIOS option, which will make each TILE shown as a different NUMA,
and user select cores from one NUMA.
This is sufficient up to some point, but not enough when application
needs number of cores that uses multiple tiles.

Assume each tile has 8 cores, and application needs 24 cores, when user
provide all cores from TILE1, TILE2 & TILE3, in DPDK right now there is
now way for application to figure out how to group/select these cores to
use cores efficiently.

Indeed this is what Vipin is enabling, from a core, he is finding list
of cores that will work efficiently with this core. In this perspective
this is nothing really related to NUMA configuration, and nothing really
specific to AMD, as defined Linux sysfs interface is used for this.

There are other architectures around that has similar NUMA configuration
and they can also use same logic, at worst we can introduce an
architecture specific code that all architectures can have a way to find
other cores that works more efficient with given core. This is a useful
feature for DPDK.

Lets looks into another example, application uses 24 cores in an graph
library like usage, that we want to group each three cores to process a
graph node. Application needs to a way to select which three cores works
most efficient with eachother, that is what this patch enables. In this
case enabling "L3 as NUMA" does not help at all. With this patch both
bios config works, but of course user should select cores to provide
application based on configuration.


And we can even improve this effective core selection, like as Mattias
suggested we can select cores that share L2 caches, with expansion of
this patch. This is unrelated to NUMA, and again it is not introducing
architecture details to DPDK as this implementation already relies on
Linux sysfs interface.

I hope it clarifies a little more.


Thanks,
ferruh
  
Burakov, Anatoly Sept. 5, 2024, 2:45 p.m. UTC | #11
On 9/5/2024 3:05 PM, Ferruh Yigit wrote:
> On 9/3/2024 9:50 AM, Burakov, Anatoly wrote:
>> On 9/2/2024 5:33 PM, Varghese, Vipin wrote:
>>> <snipped>
>>>>>

Hi Ferruh,

>>
>> I feel like there's a disconnect between my understanding of the problem
>> space, and yours, so I'm going to ask a very basic question:
>>
>> Assuming the user has configured their AMD system correctly (i.e.
>> enabled L3 as NUMA), are there any problem to be solved by adding a new
>> API? Does the system not report each L3 as a separate NUMA node?
>>
> 
> Hi Anatoly,
> 
> Let me try to answer.
> 
> To start with, Intel "Sub-NUMA Clustering" and AMD NUMA is different, as
> far as I understand SNC is more similar to more classic physical socket
> based NUMA.
> 
> Following is the AMD CPU:
>        ┌─────┐┌─────┐┌──────────┐┌─────┐┌─────┐
>        │     ││     ││          ││     ││     │
>        │     ││     ││          ││     ││     │
>        │TILE1││TILE2││          ││TILE5││TILE6│
>        │     ││     ││          ││     ││     │
>        │     ││     ││          ││     ││     │
>        │     ││     ││          ││     ││     │
>        └─────┘└─────┘│    IO    │└─────┘└─────┘
>        ┌─────┐┌─────┐│   TILE   │┌─────┐┌─────┐
>        │     ││     ││          ││     ││     │
>        │     ││     ││          ││     ││     │
>        │TILE3││TILE4││          ││TILE7││TILE8│
>        │     ││     ││          ││     ││     │
>        │     ││     ││          ││     ││     │
>        │     ││     ││          ││     ││     │
>        └─────┘└─────┘└──────────┘└─────┘└─────┘
> 
> Each 'Tile' has multiple cores, and 'IO Tile' has memory controller, bus
> controllers etc..
> 
> When NPS=x configured in bios, IO tile resources are split and each seen
> as a NUMA node.
> 
> Following is NPS=4
>        ┌─────┐┌─────┐┌──────────┐┌─────┐┌─────┐
>        │     ││     ││     .    ││     ││     │
>        │     ││     ││     .    ││     ││     │
>        │TILE1││TILE2││     .    ││TILE5││TILE6│
>        │     ││     ││NUMA .NUMA││     ││     │
>        │     ││     ││ 0   . 1  ││     ││     │
>        │     ││     ││     .    ││     ││     │
>        └─────┘└─────┘│     .    │└─────┘└─────┘
>        ┌─────┐┌─────┐│..........│┌─────┐┌─────┐
>        │     ││     ││     .    ││     ││     │
>        │     ││     ││NUMA .NUMA││     ││     │
>        │TILE3││TILE4││ 2   . 3  ││TILE7││TILE8│
>        │     ││     ││     .    ││     ││     │
>        │     ││     ││     .    ││     ││     │
>        │     ││     ││     .    ││     ││     │
>        └─────┘└─────┘└─────.────┘└─────┘└─────┘
> 
> Benefit of this is approach is now all cores has to access all NUMA
> without any penalty. Like a DPDK application can use cores from 'TILE1',
> 'TILE4' & 'TILE7' to access to NUMA0 (or any NUMA) resources in high
> performance.
> This is different than SNC where cores access to cross NUMA resources
> hit by performance penalty.
> 
> Now, although which tile cores come from doesn't matter from NUMA
> perspective, it may matter (based on workload) to have them under same LLC.
> 
> One way to make sure all cores are under same LLC, is to enable "L3 as
> NUMA" BIOS option, which will make each TILE shown as a different NUMA,
> and user select cores from one NUMA.
> This is sufficient up to some point, but not enough when application
> needs number of cores that uses multiple tiles.
> 
> Assume each tile has 8 cores, and application needs 24 cores, when user
> provide all cores from TILE1, TILE2 & TILE3, in DPDK right now there is
> now way for application to figure out how to group/select these cores to
> use cores efficiently.
> 
> Indeed this is what Vipin is enabling, from a core, he is finding list
> of cores that will work efficiently with this core. In this perspective
> this is nothing really related to NUMA configuration, and nothing really
> specific to AMD, as defined Linux sysfs interface is used for this.
> 
> There are other architectures around that has similar NUMA configuration
> and they can also use same logic, at worst we can introduce an
> architecture specific code that all architectures can have a way to find
> other cores that works more efficient with given core. This is a useful
> feature for DPDK.
> 
> Lets looks into another example, application uses 24 cores in an graph
> library like usage, that we want to group each three cores to process a
> graph node. Application needs to a way to select which three cores works
> most efficient with eachother, that is what this patch enables. In this
> case enabling "L3 as NUMA" does not help at all. With this patch both
> bios config works, but of course user should select cores to provide
> application based on configuration.
> 
> 
> And we can even improve this effective core selection, like as Mattias
> suggested we can select cores that share L2 caches, with expansion of
> this patch. This is unrelated to NUMA, and again it is not introducing
> architecture details to DPDK as this implementation already relies on
> Linux sysfs interface.
> 
> I hope it clarifies a little more.
> 
> 
> Thanks,
> ferruh
> 

Yes, this does help clarify things a lot as to why current NUMA support 
would be insufficient to express what you are describing.

However, in that case I would echo sentiment others have expressed 
already as this kind of deep sysfs parsing doesn't seem like it would be 
in scope for EAL, it sounds more like something a sysadmin/orchestration 
(or the application itself) would do.

I mean, in principle I'm not opposed to having such an API, it just 
seems like the abstraction would perhaps need to be a bit more robust 
than directly referencing cache structure? Maybe something that 
degenerates into NUMA nodes would be better, so that applications 
wouldn't have to *specifically* worry about cache locality but instead 
have a more generic API they can use to group cores together?
  
Ferruh Yigit Sept. 5, 2024, 3:34 p.m. UTC | #12
On 9/5/2024 3:45 PM, Burakov, Anatoly wrote:
> On 9/5/2024 3:05 PM, Ferruh Yigit wrote:
>> On 9/3/2024 9:50 AM, Burakov, Anatoly wrote:
>>> On 9/2/2024 5:33 PM, Varghese, Vipin wrote:
>>>> <snipped>
>>>>>>
> 
> Hi Ferruh,
> 
>>>
>>> I feel like there's a disconnect between my understanding of the problem
>>> space, and yours, so I'm going to ask a very basic question:
>>>
>>> Assuming the user has configured their AMD system correctly (i.e.
>>> enabled L3 as NUMA), are there any problem to be solved by adding a new
>>> API? Does the system not report each L3 as a separate NUMA node?
>>>
>>
>> Hi Anatoly,
>>
>> Let me try to answer.
>>
>> To start with, Intel "Sub-NUMA Clustering" and AMD NUMA is different, as
>> far as I understand SNC is more similar to more classic physical socket
>> based NUMA.
>>
>> Following is the AMD CPU:
>>        ┌─────┐┌─────┐┌──────────┐┌─────┐┌─────┐
>>        │     ││     ││          ││     ││     │
>>        │     ││     ││          ││     ││     │
>>        │TILE1││TILE2││          ││TILE5││TILE6│
>>        │     ││     ││          ││     ││     │
>>        │     ││     ││          ││     ││     │
>>        │     ││     ││          ││     ││     │
>>        └─────┘└─────┘│    IO    │└─────┘└─────┘
>>        ┌─────┐┌─────┐│   TILE   │┌─────┐┌─────┐
>>        │     ││     ││          ││     ││     │
>>        │     ││     ││          ││     ││     │
>>        │TILE3││TILE4││          ││TILE7││TILE8│
>>        │     ││     ││          ││     ││     │
>>        │     ││     ││          ││     ││     │
>>        │     ││     ││          ││     ││     │
>>        └─────┘└─────┘└──────────┘└─────┘└─────┘
>>
>> Each 'Tile' has multiple cores, and 'IO Tile' has memory controller, bus
>> controllers etc..
>>
>> When NPS=x configured in bios, IO tile resources are split and each seen
>> as a NUMA node.
>>
>> Following is NPS=4
>>        ┌─────┐┌─────┐┌──────────┐┌─────┐┌─────┐
>>        │     ││     ││     .    ││     ││     │
>>        │     ││     ││     .    ││     ││     │
>>        │TILE1││TILE2││     .    ││TILE5││TILE6│
>>        │     ││     ││NUMA .NUMA││     ││     │
>>        │     ││     ││ 0   . 1  ││     ││     │
>>        │     ││     ││     .    ││     ││     │
>>        └─────┘└─────┘│     .    │└─────┘└─────┘
>>        ┌─────┐┌─────┐│..........│┌─────┐┌─────┐
>>        │     ││     ││     .    ││     ││     │
>>        │     ││     ││NUMA .NUMA││     ││     │
>>        │TILE3││TILE4││ 2   . 3  ││TILE7││TILE8│
>>        │     ││     ││     .    ││     ││     │
>>        │     ││     ││     .    ││     ││     │
>>        │     ││     ││     .    ││     ││     │
>>        └─────┘└─────┘└─────.────┘└─────┘└─────┘
>>
>> Benefit of this is approach is now all cores has to access all NUMA
>> without any penalty. Like a DPDK application can use cores from 'TILE1',
>> 'TILE4' & 'TILE7' to access to NUMA0 (or any NUMA) resources in high
>> performance.
>> This is different than SNC where cores access to cross NUMA resources
>> hit by performance penalty.
>>
>> Now, although which tile cores come from doesn't matter from NUMA
>> perspective, it may matter (based on workload) to have them under same
>> LLC.
>>
>> One way to make sure all cores are under same LLC, is to enable "L3 as
>> NUMA" BIOS option, which will make each TILE shown as a different NUMA,
>> and user select cores from one NUMA.
>> This is sufficient up to some point, but not enough when application
>> needs number of cores that uses multiple tiles.
>>
>> Assume each tile has 8 cores, and application needs 24 cores, when user
>> provide all cores from TILE1, TILE2 & TILE3, in DPDK right now there is
>> now way for application to figure out how to group/select these cores to
>> use cores efficiently.
>>
>> Indeed this is what Vipin is enabling, from a core, he is finding list
>> of cores that will work efficiently with this core. In this perspective
>> this is nothing really related to NUMA configuration, and nothing really
>> specific to AMD, as defined Linux sysfs interface is used for this.
>>
>> There are other architectures around that has similar NUMA configuration
>> and they can also use same logic, at worst we can introduce an
>> architecture specific code that all architectures can have a way to find
>> other cores that works more efficient with given core. This is a useful
>> feature for DPDK.
>>
>> Lets looks into another example, application uses 24 cores in an graph
>> library like usage, that we want to group each three cores to process a
>> graph node. Application needs to a way to select which three cores works
>> most efficient with eachother, that is what this patch enables. In this
>> case enabling "L3 as NUMA" does not help at all. With this patch both
>> bios config works, but of course user should select cores to provide
>> application based on configuration.
>>
>>
>> And we can even improve this effective core selection, like as Mattias
>> suggested we can select cores that share L2 caches, with expansion of
>> this patch. This is unrelated to NUMA, and again it is not introducing
>> architecture details to DPDK as this implementation already relies on
>> Linux sysfs interface.
>>
>> I hope it clarifies a little more.
>>
>>
>> Thanks,
>> ferruh
>>
> 
> Yes, this does help clarify things a lot as to why current NUMA support
> would be insufficient to express what you are describing.
> 
> However, in that case I would echo sentiment others have expressed
> already as this kind of deep sysfs parsing doesn't seem like it would be
> in scope for EAL, it sounds more like something a sysadmin/orchestration
> (or the application itself) would do.
> 
> I mean, in principle I'm not opposed to having such an API, it just
> seems like the abstraction would perhaps need to be a bit more robust
> than directly referencing cache structure? Maybe something that
> degenerates into NUMA nodes would be better, so that applications
> wouldn't have to *specifically* worry about cache locality but instead
> have a more generic API they can use to group cores together?
> 

Unfortunately can't cover all usecases by sysadmin/orchestration (as
graph usecase one above), and definitely too much HW detail for the
application, that is why we required some programmatic way (APIs) for
applications.

And we are on the same page that, the more we can get away from
architecture details in the abstraction (APIs) better it is, overall
intention is to provide ways to application to find lcores works
efficiently with each other.

For this what do you think about slightly different API *, like:
```
rte_get_next_lcore_ex(uint i, u32 flag)
```

Based on the flag, we can grab the next eligible lcore, for this patch
the flag can be `RTE_LCORE_LLC`, but options are wide and different
architectures can have different grouping to benefit most from HW in a
vendor agnostic way.
I like the idea, what do you think about this abstraction?

* Kudos to Vipin 😉
  
Burakov, Anatoly Sept. 6, 2024, 8:44 a.m. UTC | #13
>> Yes, this does help clarify things a lot as to why current NUMA support
>> would be insufficient to express what you are describing.
>>
>> However, in that case I would echo sentiment others have expressed
>> already as this kind of deep sysfs parsing doesn't seem like it would be
>> in scope for EAL, it sounds more like something a sysadmin/orchestration
>> (or the application itself) would do.
>>
>> I mean, in principle I'm not opposed to having such an API, it just
>> seems like the abstraction would perhaps need to be a bit more robust
>> than directly referencing cache structure? Maybe something that
>> degenerates into NUMA nodes would be better, so that applications
>> wouldn't have to *specifically* worry about cache locality but instead
>> have a more generic API they can use to group cores together?
>>
> 
> Unfortunately can't cover all usecases by sysadmin/orchestration (as
> graph usecase one above), and definitely too much HW detail for the
> application, that is why we required some programmatic way (APIs) for
> applications.
> 
> And we are on the same page that, the more we can get away from
> architecture details in the abstraction (APIs) better it is, overall
> intention is to provide ways to application to find lcores works
> efficiently with each other.
> 
> For this what do you think about slightly different API *, like:
> ```
> rte_get_next_lcore_ex(uint i, u32 flag)
> ```
> 
> Based on the flag, we can grab the next eligible lcore, for this patch
> the flag can be `RTE_LCORE_LLC`, but options are wide and different
> architectures can have different grouping to benefit most from HW in a
> vendor agnostic way.
> I like the idea, what do you think about this abstraction?
> 
> * Kudos to Vipin 😉
> 

Hi Ferruh,

In principle, having flags for this sort of thing sounds like a better 
way to go. I do like this idea as well! It of course remains to be seen 
how it can work in practice but to me it certainly looks like a path 
worth exploring.
  
Vipin Varghese Sept. 9, 2024, 2:14 p.m. UTC | #14
[AMD Official Use Only - AMD Internal Distribution Only]


<snipped>

>
> >> Yes, this does help clarify things a lot as to why current NUMA
> >> support would be insufficient to express what you are describing.
> >>
> >> However, in that case I would echo sentiment others have expressed
> >> already as this kind of deep sysfs parsing doesn't seem like it would
> >> be in scope for EAL, it sounds more like something a
> >> sysadmin/orchestration (or the application itself) would do.
> >>
> >> I mean, in principle I'm not opposed to having such an API, it just
> >> seems like the abstraction would perhaps need to be a bit more robust
> >> than directly referencing cache structure? Maybe something that
> >> degenerates into NUMA nodes would be better, so that applications
> >> wouldn't have to *specifically* worry about cache locality but
> >> instead have a more generic API they can use to group cores together?
> >>
> >
> > Unfortunately can't cover all usecases by sysadmin/orchestration (as
> > graph usecase one above), and definitely too much HW detail for the
> > application, that is why we required some programmatic way (APIs) for
> > applications.
> >
> > And we are on the same page that, the more we can get away from
> > architecture details in the abstraction (APIs) better it is, overall
> > intention is to provide ways to application to find lcores works
> > efficiently with each other.
> >
> > For this what do you think about slightly different API *, like:
> > ```
> > rte_get_next_lcore_ex(uint i, u32 flag) ```
> >
> > Based on the flag, we can grab the next eligible lcore, for this patch
> > the flag can be `RTE_LCORE_LLC`, but options are wide and different
> > architectures can have different grouping to benefit most from HW in a
> > vendor agnostic way.
> > I like the idea, what do you think about this abstraction?
> >
> > * Kudos to Vipin 😉
> >
>
> Hi Ferruh,
>
> In principle, having flags for this sort of thing sounds like a better way to go. I
> do like this idea as well! It of course remains to be seen how it can work in
> practice but to me it certainly looks like a path worth exploring.
>

Sharing the new RFC shortly.

<snipped>
  
Vipin Varghese Sept. 9, 2024, 2:22 p.m. UTC | #15
[AMD Official Use Only - AMD Internal Distribution Only]

<snipped>

> > <snipped>
> >
> > Thank you Mattias for the comments and question, please let me try to
> > explain the same below
> >
> >> We shouldn't have a separate CPU/cache hierarchy API instead?
> >
> > Based on the intention to bring in CPU lcores which share same L3 (for
> > better cache hits and less noisy neighbor) current API focuses on
> > using
> >
> > Last Level Cache. But if the suggestion is `there are SoC where L2
> > cache are also shared, and the new API should be provisioned`, I am
> > also
> >
> > comfortable with the thought.
> >
>
> Rather than some AMD special case API hacked into <rte_lcore.h>, I think we
> are better off with no DPDK API at all for this kind of functionality.

Hi Mattias, as shared in the earlier email thread, this is not a AMD special case at all. Let me try to explain this one more time. One of techniques used to increase cores cost effective way to go for tiles of compute complexes.
This introduces a bunch of cores in sharing same Last Level Cache (namely L2, L3 or even L4) depending upon cache topology architecture.

The API suggested in RFC is to help end users to selectively use cores under same Last Level Cache Hierarchy as advertised by OS (irrespective of the BIOS settings used). This is useful in both bare-metal and container environment.

As shared in response for cover letter +1 to expand it to more than just LLC cores. We have also confirmed the same to https://patchwork.dpdk.org/project/dpdk/cover/20240827151014.201-1-vipin.varghese@amd.com/

>
> A DPDK CPU/memory hierarchy topology API very much makes sense, but it
> should be reasonably generic and complete from the start.
>
> >>
> >> Could potentially be built on the 'hwloc' library.
> >
> > There are 3 reason on AMD SoC we did not explore this path, reasons
> > are
> >
> > 1. depending n hwloc version and kernel version certain SoC
> > hierarchies are not available
> >
> > 2. CPU NUMA and IO (memory & PCIe) NUMA are independent on AMD
> Epyc Soc.
> >
> > 3. adds the extra dependency layer of library layer to be made
> > available to work.
> >
> >
> > hence we have tried to use Linux Documented generic layer of `sysfs
> > CPU cache`.
> >
> > I will try to explore more on hwloc and check if other libraries
> > within DPDK leverages the same.
> >
> >>
> >> I much agree cache/core topology may be of interest of the
> >> application (or a work scheduler, like a DPDK event device), but it's
> >> not limited to LLC. It may well be worthwhile to care about which
> >> cores shares L2 cache, for example. Not sure the RTE_LCORE_FOREACH_*
> approach scales.
> >
> > yes, totally understand as some SoC, multiple lcores shares same L2 cache.
> >
> >
> > Can we rework the API to be rte_get_cache_<function> where user
> > argument is desired lcore index.
> >
> > 1. index-1: SMT threads
> >
> > 2. index-2: threads sharing same L2 cache
> >
> > 3. index-3: threads sharing same L3 cache
> >
> > 4. index-MAX: identify the threads sharing last level cache.
> >
> >>
> >>> < Function: Purpose >
> >>> ---------------------
> >>>   - rte_get_llc_first_lcores: Retrieves all the first lcores in the
> >>> shared LLC.
> >>>   - rte_get_llc_lcore: Retrieves all lcores that share the LLC.
> >>>   - rte_get_llc_n_lcore: Retrieves the first n or skips the first n
> >>> lcores in the shared LLC.
> >>>
> >>> < MACRO: Purpose >
> >>> ------------------
> >>> RTE_LCORE_FOREACH_LLC_FIRST: iterates through all first lcore from
> >>> each LLC.
> >>> RTE_LCORE_FOREACH_LLC_FIRST_WORKER: iterates through all first
> >>> worker lcore from each LLC.
> >>> RTE_LCORE_FOREACH_LLC_WORKER: iterates lcores from LLC based on
> hint
> >>> (lcore id).
> >>> RTE_LCORE_FOREACH_LLC_SKIP_FIRST_WORKER: iterates lcores from LLC
> >>> while skipping first worker.
> >>> RTE_LCORE_FOREACH_LLC_FIRST_N_WORKER: iterates through `n` lcores
> >>> from each LLC.
> >>> RTE_LCORE_FOREACH_LLC_SKIP_N_WORKER: skip first `n` lcores, then
> >>> iterates through reaming lcores in each LLC.
> >>>
> > While the MACRO are simple wrapper invoking appropriate API. can this
> > be worked out in this fashion?
> >
> > <snipped>
  
Mattias Rönnblom Sept. 9, 2024, 2:52 p.m. UTC | #16
On 2024-09-09 16:22, Varghese, Vipin wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
> 
> <snipped>
> 
>>> <snipped>
>>>
>>> Thank you Mattias for the comments and question, please let me try to
>>> explain the same below
>>>
>>>> We shouldn't have a separate CPU/cache hierarchy API instead?
>>>
>>> Based on the intention to bring in CPU lcores which share same L3 (for
>>> better cache hits and less noisy neighbor) current API focuses on
>>> using
>>>
>>> Last Level Cache. But if the suggestion is `there are SoC where L2
>>> cache are also shared, and the new API should be provisioned`, I am
>>> also
>>>
>>> comfortable with the thought.
>>>
>>
>> Rather than some AMD special case API hacked into <rte_lcore.h>, I think we
>> are better off with no DPDK API at all for this kind of functionality.
> 
> Hi Mattias, as shared in the earlier email thread, this is not a AMD special case at all. Let me try to explain this one more time. One of techniques used to increase cores cost effective way to go for tiles of compute complexes.
> This introduces a bunch of cores in sharing same Last Level Cache (namely L2, L3 or even L4) depending upon cache topology architecture.
> 
> The API suggested in RFC is to help end users to selectively use cores under same Last Level Cache Hierarchy as advertised by OS (irrespective of the BIOS settings used). This is useful in both bare-metal and container environment.
> 

I'm pretty familiar with AMD CPUs and the use of tiles (including the 
challenges these kinds of non-uniformities pose for work scheduling).

To maximize performance, caring about core<->LLC relationship may well 
not be enough, and more HT/core/cache/memory topology information is 
required. That's what I meant by special case. A proper API should allow 
access to information about which lcores are SMT siblings, cores on the 
same L2, and cores on the same L3, to name a few things. Probably you 
want to fit NUMA into the same API as well, although that is available 
already in <rte_lcore.h>.

One can have a look at how scheduling domains work in the Linux kernel. 
They model this kind of thing.

> As shared in response for cover letter +1 to expand it to more than just LLC cores. We have also confirmed the same to https://patchwork.dpdk.org/project/dpdk/cover/20240827151014.201-1-vipin.varghese@amd.com/
> 
>>
>> A DPDK CPU/memory hierarchy topology API very much makes sense, but it
>> should be reasonably generic and complete from the start.
>>
>>>>
>>>> Could potentially be built on the 'hwloc' library.
>>>
>>> There are 3 reason on AMD SoC we did not explore this path, reasons
>>> are
>>>
>>> 1. depending n hwloc version and kernel version certain SoC
>>> hierarchies are not available
>>>
>>> 2. CPU NUMA and IO (memory & PCIe) NUMA are independent on AMD
>> Epyc Soc.
>>>
>>> 3. adds the extra dependency layer of library layer to be made
>>> available to work.
>>>
>>>
>>> hence we have tried to use Linux Documented generic layer of `sysfs
>>> CPU cache`.
>>>
>>> I will try to explore more on hwloc and check if other libraries
>>> within DPDK leverages the same.
>>>
>>>>
>>>> I much agree cache/core topology may be of interest of the
>>>> application (or a work scheduler, like a DPDK event device), but it's
>>>> not limited to LLC. It may well be worthwhile to care about which
>>>> cores shares L2 cache, for example. Not sure the RTE_LCORE_FOREACH_*
>> approach scales.
>>>
>>> yes, totally understand as some SoC, multiple lcores shares same L2 cache.
>>>
>>>
>>> Can we rework the API to be rte_get_cache_<function> where user
>>> argument is desired lcore index.
>>>
>>> 1. index-1: SMT threads
>>>
>>> 2. index-2: threads sharing same L2 cache
>>>
>>> 3. index-3: threads sharing same L3 cache
>>>
>>> 4. index-MAX: identify the threads sharing last level cache.
>>>
>>>>
>>>>> < Function: Purpose >
>>>>> ---------------------
>>>>>    - rte_get_llc_first_lcores: Retrieves all the first lcores in the
>>>>> shared LLC.
>>>>>    - rte_get_llc_lcore: Retrieves all lcores that share the LLC.
>>>>>    - rte_get_llc_n_lcore: Retrieves the first n or skips the first n
>>>>> lcores in the shared LLC.
>>>>>
>>>>> < MACRO: Purpose >
>>>>> ------------------
>>>>> RTE_LCORE_FOREACH_LLC_FIRST: iterates through all first lcore from
>>>>> each LLC.
>>>>> RTE_LCORE_FOREACH_LLC_FIRST_WORKER: iterates through all first
>>>>> worker lcore from each LLC.
>>>>> RTE_LCORE_FOREACH_LLC_WORKER: iterates lcores from LLC based on
>> hint
>>>>> (lcore id).
>>>>> RTE_LCORE_FOREACH_LLC_SKIP_FIRST_WORKER: iterates lcores from LLC
>>>>> while skipping first worker.
>>>>> RTE_LCORE_FOREACH_LLC_FIRST_N_WORKER: iterates through `n` lcores
>>>>> from each LLC.
>>>>> RTE_LCORE_FOREACH_LLC_SKIP_N_WORKER: skip first `n` lcores, then
>>>>> iterates through reaming lcores in each LLC.
>>>>>
>>> While the MACRO are simple wrapper invoking appropriate API. can this
>>> be worked out in this fashion?
>>>
>>> <snipped>
  
Vipin Varghese Sept. 11, 2024, 3:13 a.m. UTC | #17
[AMD Official Use Only - AMD Internal Distribution Only]


<snipped>

> > >
> > > Thank you Mattias for the comments and question, please let me try
> > > to explain the same below
> > >
> > >> We shouldn't have a separate CPU/cache hierarchy API instead?
> > >
> > > Based on the intention to bring in CPU lcores which share same L3
> > > (for better cache hits and less noisy neighbor) current API focuses
> > > on using
> > >
> > > Last Level Cache. But if the suggestion is `there are SoC where L2
> > > cache are also shared, and the new API should be provisioned`, I am
> > > also
> > >
> > > comfortable with the thought.
> > >
> >
> > Rather than some AMD special case API hacked into <rte_lcore.h>, I
> > think we are better off with no DPDK API at all for this kind of functionality.
> >
> > A DPDK CPU/memory hierarchy topology API very much makes sense, but it
> > should be reasonably generic and complete from the start.
>
> Agreed. This one of those cases where the existing project hwloc which is part
> of open-mpi is more complete and well supported. It supports multiple OS's
> and can deal with more quirks.

Thank you Stephen for the inputs, last year when checked hwloc for distros there were anomalies for NUMA and Physical socket Identification on AMD EPYC Soc.
I will recheck the distros version of hwloc, if these work out fine I will re-work with hwloc libraries making it OS independent too.

>
> https://github.com/open-mpi/hwloc
  
Vipin Varghese Sept. 11, 2024, 3:26 a.m. UTC | #18
[AMD Official Use Only - AMD Internal Distribution Only]

<snipped>

>
> On 2024-09-09 16:22, Varghese, Vipin wrote:
> > [AMD Official Use Only - AMD Internal Distribution Only]
> >
> > <snipped>
> >
> >>> <snipped>
> >>>
> >>> Thank you Mattias for the comments and question, please let me try
> >>> to explain the same below
> >>>
> >>>> We shouldn't have a separate CPU/cache hierarchy API instead?
> >>>
> >>> Based on the intention to bring in CPU lcores which share same L3
> >>> (for better cache hits and less noisy neighbor) current API focuses
> >>> on using
> >>>
> >>> Last Level Cache. But if the suggestion is `there are SoC where L2
> >>> cache are also shared, and the new API should be provisioned`, I am
> >>> also
> >>>
> >>> comfortable with the thought.
> >>>
> >>
> >> Rather than some AMD special case API hacked into <rte_lcore.h>, I
> >> think we are better off with no DPDK API at all for this kind of functionality.
> >
> > Hi Mattias, as shared in the earlier email thread, this is not a AMD special
> case at all. Let me try to explain this one more time. One of techniques used to
> increase cores cost effective way to go for tiles of compute complexes.
> > This introduces a bunch of cores in sharing same Last Level Cache (namely
> L2, L3 or even L4) depending upon cache topology architecture.
> >
> > The API suggested in RFC is to help end users to selectively use cores under
> same Last Level Cache Hierarchy as advertised by OS (irrespective of the BIOS
> settings used). This is useful in both bare-metal and container environment.
> >
>
> I'm pretty familiar with AMD CPUs and the use of tiles (including the
> challenges these kinds of non-uniformities pose for work scheduling).
>
> To maximize performance, caring about core<->LLC relationship may well not
> be enough, and more HT/core/cache/memory topology information is
> required. That's what I meant by special case. A proper API should allow
> access to information about which lcores are SMT siblings, cores on the same
> L2, and cores on the same L3, to name a few things. Probably you want to fit
> NUMA into the same API as well, although that is available already in
> <rte_lcore.h>.

Thank you Mattias for the information, as shared by in the reply with Anatoly we want expose a new API `rte_get_next_lcore_ex` which intakes a extra argument `u32 flags`.
The flags can be RTE_GET_LCORE_L1 (SMT), RTE_GET_LCORE_L2, RTE_GET_LCORE_L3, RTE_GET_LCORE_BOOST_ENABLED, RTE_GET_LCORE_BOOST_DISABLED.

This is AMD EPYC SoC agnostic and trying to address for all generic cases.

Please do let us know if we (Ferruh & myself) can sync up via call?

>
> One can have a look at how scheduling domains work in the Linux kernel.
> They model this kind of thing.
>
> > As shared in response for cover letter +1 to expand it to more than
> > just LLC cores. We have also confirmed the same to
> > https://patchwork.dpdk.org/project/dpdk/cover/20240827151014.201-
> 1-vip
> > in.varghese@amd.com/
> >
> >>
> >> A DPDK CPU/memory hierarchy topology API very much makes sense, but
> >> it should be reasonably generic and complete from the start.
> >>
> >>>>
> >>>> Could potentially be built on the 'hwloc' library.
> >>>
> >>> There are 3 reason on AMD SoC we did not explore this path, reasons
> >>> are
> >>>
> >>> 1. depending n hwloc version and kernel version certain SoC
> >>> hierarchies are not available
> >>>
> >>> 2. CPU NUMA and IO (memory & PCIe) NUMA are independent on AMD
> >> Epyc Soc.
> >>>
> >>> 3. adds the extra dependency layer of library layer to be made
> >>> available to work.
> >>>
> >>>
> >>> hence we have tried to use Linux Documented generic layer of `sysfs
> >>> CPU cache`.
> >>>
> >>> I will try to explore more on hwloc and check if other libraries
> >>> within DPDK leverages the same.
> >>>
> >>>>
> >>>> I much agree cache/core topology may be of interest of the
> >>>> application (or a work scheduler, like a DPDK event device), but
> >>>> it's not limited to LLC. It may well be worthwhile to care about
> >>>> which cores shares L2 cache, for example. Not sure the
> >>>> RTE_LCORE_FOREACH_*
> >> approach scales.
> >>>
> >>> yes, totally understand as some SoC, multiple lcores shares same L2 cache.
> >>>
> >>>
> >>> Can we rework the API to be rte_get_cache_<function> where user
> >>> argument is desired lcore index.
> >>>
> >>> 1. index-1: SMT threads
> >>>
> >>> 2. index-2: threads sharing same L2 cache
> >>>
> >>> 3. index-3: threads sharing same L3 cache
> >>>
> >>> 4. index-MAX: identify the threads sharing last level cache.
> >>>
> >>>>
> >>>>> < Function: Purpose >
> >>>>> ---------------------
> >>>>>    - rte_get_llc_first_lcores: Retrieves all the first lcores in
> >>>>> the shared LLC.
> >>>>>    - rte_get_llc_lcore: Retrieves all lcores that share the LLC.
> >>>>>    - rte_get_llc_n_lcore: Retrieves the first n or skips the first
> >>>>> n lcores in the shared LLC.
> >>>>>
> >>>>> < MACRO: Purpose >
> >>>>> ------------------
> >>>>> RTE_LCORE_FOREACH_LLC_FIRST: iterates through all first lcore from
> >>>>> each LLC.
> >>>>> RTE_LCORE_FOREACH_LLC_FIRST_WORKER: iterates through all first
> >>>>> worker lcore from each LLC.
> >>>>> RTE_LCORE_FOREACH_LLC_WORKER: iterates lcores from LLC based on
> >> hint
> >>>>> (lcore id).
> >>>>> RTE_LCORE_FOREACH_LLC_SKIP_FIRST_WORKER: iterates lcores from
> LLC
> >>>>> while skipping first worker.
> >>>>> RTE_LCORE_FOREACH_LLC_FIRST_N_WORKER: iterates through `n`
> lcores
> >>>>> from each LLC.
> >>>>> RTE_LCORE_FOREACH_LLC_SKIP_N_WORKER: skip first `n` lcores, then
> >>>>> iterates through reaming lcores in each LLC.
> >>>>>
> >>> While the MACRO are simple wrapper invoking appropriate API. can
> >>> this be worked out in this fashion?
> >>>
> >>> <snipped>
  
Stephen Hemminger Sept. 11, 2024, 3:53 a.m. UTC | #19
On Wed, 11 Sep 2024 03:13:14 +0000
"Varghese, Vipin" <Vipin.Varghese@amd.com> wrote:

> > Agreed. This one of those cases where the existing project hwloc which is part
> > of open-mpi is more complete and well supported. It supports multiple OS's
> > and can deal with more quirks.  
> 
> Thank you Stephen for the inputs, last year when checked hwloc for distros there were anomalies for NUMA and Physical socket Identification on AMD EPYC Soc.
> I will recheck the distros version of hwloc, if these work out fine I will re-work with hwloc libraries making it OS independent too.

DPDK doesn't exist to resolve problems with upstreaming hardware support
in other packages. If DPDK does supports something only because it is harder, slower, more painful
to deal with another project; then you create long term technical debt.
  
Mattias Rönnblom Sept. 11, 2024, 3:55 p.m. UTC | #20
On 2024-09-11 05:26, Varghese, Vipin wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
> 
> <snipped>
> 
>>
>> On 2024-09-09 16:22, Varghese, Vipin wrote:
>>> [AMD Official Use Only - AMD Internal Distribution Only]
>>>
>>> <snipped>
>>>
>>>>> <snipped>
>>>>>
>>>>> Thank you Mattias for the comments and question, please let me try
>>>>> to explain the same below
>>>>>
>>>>>> We shouldn't have a separate CPU/cache hierarchy API instead?
>>>>>
>>>>> Based on the intention to bring in CPU lcores which share same L3
>>>>> (for better cache hits and less noisy neighbor) current API focuses
>>>>> on using
>>>>>
>>>>> Last Level Cache. But if the suggestion is `there are SoC where L2
>>>>> cache are also shared, and the new API should be provisioned`, I am
>>>>> also
>>>>>
>>>>> comfortable with the thought.
>>>>>
>>>>
>>>> Rather than some AMD special case API hacked into <rte_lcore.h>, I
>>>> think we are better off with no DPDK API at all for this kind of functionality.
>>>
>>> Hi Mattias, as shared in the earlier email thread, this is not a AMD special
>> case at all. Let me try to explain this one more time. One of techniques used to
>> increase cores cost effective way to go for tiles of compute complexes.
>>> This introduces a bunch of cores in sharing same Last Level Cache (namely
>> L2, L3 or even L4) depending upon cache topology architecture.
>>>
>>> The API suggested in RFC is to help end users to selectively use cores under
>> same Last Level Cache Hierarchy as advertised by OS (irrespective of the BIOS
>> settings used). This is useful in both bare-metal and container environment.
>>>
>>
>> I'm pretty familiar with AMD CPUs and the use of tiles (including the
>> challenges these kinds of non-uniformities pose for work scheduling).
>>
>> To maximize performance, caring about core<->LLC relationship may well not
>> be enough, and more HT/core/cache/memory topology information is
>> required. That's what I meant by special case. A proper API should allow
>> access to information about which lcores are SMT siblings, cores on the same
>> L2, and cores on the same L3, to name a few things. Probably you want to fit
>> NUMA into the same API as well, although that is available already in
>> <rte_lcore.h>.
> 
> Thank you Mattias for the information, as shared by in the reply with Anatoly we want expose a new API `rte_get_next_lcore_ex` which intakes a extra argument `u32 flags`.
> The flags can be RTE_GET_LCORE_L1 (SMT), RTE_GET_LCORE_L2, RTE_GET_LCORE_L3, RTE_GET_LCORE_BOOST_ENABLED, RTE_GET_LCORE_BOOST_DISABLED.
> 

Wouldn't using that API be pretty awkward to use?

I mean, what you have is a topology, with nodes of different types and 
with different properties, and you want to present it to the user.

In a sense, it's similar to XCM and DOM versus SAX. The above is 
SAX-style, and what I have in mind is something DOM-like.

What use case do you have in mind? What's on top of my list is a 
scenario where a DPDK app gets a bunch of cores (e.g., -l <cores>) and 
tries to figure out how best make use of them. It's not going to "skip" 
(ignore, leave unused) SMT siblings, or skip non-boosted cores, it would 
just try to be clever in regards to which cores to use for what purpose.

> This is AMD EPYC SoC agnostic and trying to address for all generic cases.
> 
> Please do let us know if we (Ferruh & myself) can sync up via call?
> 

Sure, I can do that.

>>
>> One can have a look at how scheduling domains work in the Linux kernel.
>> They model this kind of thing.
>>
>>> As shared in response for cover letter +1 to expand it to more than
>>> just LLC cores. We have also confirmed the same to
>>> https://patchwork.dpdk.org/project/dpdk/cover/20240827151014.201-
>> 1-vip
>>> in.varghese@amd.com/
>>>
>>>>
>>>> A DPDK CPU/memory hierarchy topology API very much makes sense, but
>>>> it should be reasonably generic and complete from the start.
>>>>
>>>>>>
>>>>>> Could potentially be built on the 'hwloc' library.
>>>>>
>>>>> There are 3 reason on AMD SoC we did not explore this path, reasons
>>>>> are
>>>>>
>>>>> 1. depending n hwloc version and kernel version certain SoC
>>>>> hierarchies are not available
>>>>>
>>>>> 2. CPU NUMA and IO (memory & PCIe) NUMA are independent on AMD
>>>> Epyc Soc.
>>>>>
>>>>> 3. adds the extra dependency layer of library layer to be made
>>>>> available to work.
>>>>>
>>>>>
>>>>> hence we have tried to use Linux Documented generic layer of `sysfs
>>>>> CPU cache`.
>>>>>
>>>>> I will try to explore more on hwloc and check if other libraries
>>>>> within DPDK leverages the same.
>>>>>
>>>>>>
>>>>>> I much agree cache/core topology may be of interest of the
>>>>>> application (or a work scheduler, like a DPDK event device), but
>>>>>> it's not limited to LLC. It may well be worthwhile to care about
>>>>>> which cores shares L2 cache, for example. Not sure the
>>>>>> RTE_LCORE_FOREACH_*
>>>> approach scales.
>>>>>
>>>>> yes, totally understand as some SoC, multiple lcores shares same L2 cache.
>>>>>
>>>>>
>>>>> Can we rework the API to be rte_get_cache_<function> where user
>>>>> argument is desired lcore index.
>>>>>
>>>>> 1. index-1: SMT threads
>>>>>
>>>>> 2. index-2: threads sharing same L2 cache
>>>>>
>>>>> 3. index-3: threads sharing same L3 cache
>>>>>
>>>>> 4. index-MAX: identify the threads sharing last level cache.
>>>>>
>>>>>>
>>>>>>> < Function: Purpose >
>>>>>>> ---------------------
>>>>>>>     - rte_get_llc_first_lcores: Retrieves all the first lcores in
>>>>>>> the shared LLC.
>>>>>>>     - rte_get_llc_lcore: Retrieves all lcores that share the LLC.
>>>>>>>     - rte_get_llc_n_lcore: Retrieves the first n or skips the first
>>>>>>> n lcores in the shared LLC.
>>>>>>>
>>>>>>> < MACRO: Purpose >
>>>>>>> ------------------
>>>>>>> RTE_LCORE_FOREACH_LLC_FIRST: iterates through all first lcore from
>>>>>>> each LLC.
>>>>>>> RTE_LCORE_FOREACH_LLC_FIRST_WORKER: iterates through all first
>>>>>>> worker lcore from each LLC.
>>>>>>> RTE_LCORE_FOREACH_LLC_WORKER: iterates lcores from LLC based on
>>>> hint
>>>>>>> (lcore id).
>>>>>>> RTE_LCORE_FOREACH_LLC_SKIP_FIRST_WORKER: iterates lcores from
>> LLC
>>>>>>> while skipping first worker.
>>>>>>> RTE_LCORE_FOREACH_LLC_FIRST_N_WORKER: iterates through `n`
>> lcores
>>>>>>> from each LLC.
>>>>>>> RTE_LCORE_FOREACH_LLC_SKIP_N_WORKER: skip first `n` lcores, then
>>>>>>> iterates through reaming lcores in each LLC.
>>>>>>>
>>>>> While the MACRO are simple wrapper invoking appropriate API. can
>>>>> this be worked out in this fashion?
>>>>>
>>>>> <snipped>
  
Bruce Richardson Sept. 11, 2024, 4:01 p.m. UTC | #21
On Wed, Sep 11, 2024 at 03:26:20AM +0000, Varghese, Vipin wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
> 
> <snipped>
> 
> >
> > On 2024-09-09 16:22, Varghese, Vipin wrote:
> > > [AMD Official Use Only - AMD Internal Distribution Only]
> > >
> > > <snipped>
> > >
> > >>> <snipped>
> > >>>
> > >>> Thank you Mattias for the comments and question, please let me try
> > >>> to explain the same below
> > >>>
> > >>>> We shouldn't have a separate CPU/cache hierarchy API instead?
> > >>>
> > >>> Based on the intention to bring in CPU lcores which share same L3
> > >>> (for better cache hits and less noisy neighbor) current API focuses
> > >>> on using
> > >>>
> > >>> Last Level Cache. But if the suggestion is `there are SoC where L2
> > >>> cache are also shared, and the new API should be provisioned`, I am
> > >>> also
> > >>>
> > >>> comfortable with the thought.
> > >>>
> > >>
> > >> Rather than some AMD special case API hacked into <rte_lcore.h>, I
> > >> think we are better off with no DPDK API at all for this kind of functionality.
> > >
> > > Hi Mattias, as shared in the earlier email thread, this is not a AMD special
> > case at all. Let me try to explain this one more time. One of techniques used to
> > increase cores cost effective way to go for tiles of compute complexes.
> > > This introduces a bunch of cores in sharing same Last Level Cache (namely
> > L2, L3 or even L4) depending upon cache topology architecture.
> > >
> > > The API suggested in RFC is to help end users to selectively use cores under
> > same Last Level Cache Hierarchy as advertised by OS (irrespective of the BIOS
> > settings used). This is useful in both bare-metal and container environment.
> > >
> >
> > I'm pretty familiar with AMD CPUs and the use of tiles (including the
> > challenges these kinds of non-uniformities pose for work scheduling).
> >
> > To maximize performance, caring about core<->LLC relationship may well not
> > be enough, and more HT/core/cache/memory topology information is
> > required. That's what I meant by special case. A proper API should allow
> > access to information about which lcores are SMT siblings, cores on the same
> > L2, and cores on the same L3, to name a few things. Probably you want to fit
> > NUMA into the same API as well, although that is available already in
> > <rte_lcore.h>.
> 
> Thank you Mattias for the information, as shared by in the reply with Anatoly we want expose a new API `rte_get_next_lcore_ex` which intakes a extra argument `u32 flags`.
> The flags can be RTE_GET_LCORE_L1 (SMT), RTE_GET_LCORE_L2, RTE_GET_LCORE_L3, RTE_GET_LCORE_BOOST_ENABLED, RTE_GET_LCORE_BOOST_DISABLED.
> 

For the naming, would "rte_get_next_sibling_core" (or lcore if you prefer)
be a clearer name than just adding "ex" on to the end of the existing
function?

Looking logically, I'm not sure about the BOOST_ENABLED and BOOST_DISABLED
flags you propose - in a system with multiple possible standard and boost
frequencies what would those correspond to? What's also missing is a define
for getting actual NUMA siblings i.e. those sharing common memory but not
an L3 or anything else.

My suggestion would be to have the function take just an integer-type e.g.
uint16_t parameter which defines the memory/cache hierarchy level to use, 0
being lowest, 1 next, and so on. Different systems may have different
numbers of cache levels so lets just make it a zero-based index of levels,
rather than giving explicit defines (except for memory which should
probably always be last). The zero-level will be for "closest neighbour"
whatever that happens to be, with as many levels as is necessary to express
the topology, e.g. without SMT, but with 3 cache levels, level 0 would be
an L2 neighbour, level 1 an L3 neighbour. If the L3 was split within a
memory NUMA node, then level 2 would give the NUMA siblings. We'd just need
an API to return the max number of levels along with the iterator.

Regards,
/Bruce
  
Honnappa Nagarahalli Sept. 11, 2024, 5:04 p.m. UTC | #22
> On Sep 11, 2024, at 10:55 AM, Mattias Rönnblom <hofors@lysator.liu.se> wrote:
> 
> On 2024-09-11 05:26, Varghese, Vipin wrote:
>> [AMD Official Use Only - AMD Internal Distribution Only]
>> <snipped>
>>> 
>>> On 2024-09-09 16:22, Varghese, Vipin wrote:
>>>> [AMD Official Use Only - AMD Internal Distribution Only]
>>>> 
>>>> <snipped>
>>>> 
>>>>>> <snipped>
>>>>>> 
>>>>>> Thank you Mattias for the comments and question, please let me try
>>>>>> to explain the same below
>>>>>> 
>>>>>>> We shouldn't have a separate CPU/cache hierarchy API instead?
>>>>>> 
>>>>>> Based on the intention to bring in CPU lcores which share same L3
>>>>>> (for better cache hits and less noisy neighbor) current API focuses
>>>>>> on using
>>>>>> 
>>>>>> Last Level Cache. But if the suggestion is `there are SoC where L2
>>>>>> cache are also shared, and the new API should be provisioned`, I am
>>>>>> also
>>>>>> 
>>>>>> comfortable with the thought.
>>>>>> 
>>>>> 
>>>>> Rather than some AMD special case API hacked into <rte_lcore.h>, I
>>>>> think we are better off with no DPDK API at all for this kind of functionality.
>>>> 
>>>> Hi Mattias, as shared in the earlier email thread, this is not a AMD special
>>> case at all. Let me try to explain this one more time. One of techniques used to
>>> increase cores cost effective way to go for tiles of compute complexes.
>>>> This introduces a bunch of cores in sharing same Last Level Cache (namely
>>> L2, L3 or even L4) depending upon cache topology architecture.
>>>> 
>>>> The API suggested in RFC is to help end users to selectively use cores under
>>> same Last Level Cache Hierarchy as advertised by OS (irrespective of the BIOS
>>> settings used). This is useful in both bare-metal and container environment.
>>>> 
>>> 
>>> I'm pretty familiar with AMD CPUs and the use of tiles (including the
>>> challenges these kinds of non-uniformities pose for work scheduling).
>>> 
>>> To maximize performance, caring about core<->LLC relationship may well not
>>> be enough, and more HT/core/cache/memory topology information is
>>> required. That's what I meant by special case. A proper API should allow
>>> access to information about which lcores are SMT siblings, cores on the same
>>> L2, and cores on the same L3, to name a few things. Probably you want to fit
>>> NUMA into the same API as well, although that is available already in
>>> <rte_lcore.h>.
>> Thank you Mattias for the information, as shared by in the reply with Anatoly we want expose a new API `rte_get_next_lcore_ex` which intakes a extra argument `u32 flags`.
>> The flags can be RTE_GET_LCORE_L1 (SMT), RTE_GET_LCORE_L2, RTE_GET_LCORE_L3, RTE_GET_LCORE_BOOST_ENABLED, RTE_GET_LCORE_BOOST_DISABLED.
> 
> Wouldn't using that API be pretty awkward to use?
> 
> I mean, what you have is a topology, with nodes of different types and with different properties, and you want to present it to the user.
> 
> In a sense, it's similar to XCM and DOM versus SAX. The above is SAX-style, and what I have in mind is something DOM-like.
> 
> What use case do you have in mind? What's on top of my list is a scenario where a DPDK app gets a bunch of cores (e.g., -l <cores>) and tries to figure out how best make use of them. It's not going to "skip" (ignore, leave unused) SMT siblings, or skip non-boosted cores, it would just try to be clever in regards to which cores to use for what purpose.
> 
>> This is AMD EPYC SoC agnostic and trying to address for all generic cases.
>> Please do let us know if we (Ferruh & myself) can sync up via call?
> 
> Sure, I can do that.
> 
Can this be opened to the rest of the community? This is a common problem that needs to be solved for multiple architectures. I would be interested in attending.

>>> 
>>> One can have a look at how scheduling domains work in the Linux kernel.
>>> They model this kind of thing.
>>> 
>>>> As shared in response for cover letter +1 to expand it to more than
>>>> just LLC cores. We have also confirmed the same to
>>>> https://patchwork.dpdk.org/project/dpdk/cover/20240827151014.201-
>>> 1-vip
>>>> in.varghese@amd.com/
>>>> 
>>>>> 
>>>>> A DPDK CPU/memory hierarchy topology API very much makes sense, but
>>>>> it should be reasonably generic and complete from the start.
>>>>> 
>>>>>>> 
>>>>>>> Could potentially be built on the 'hwloc' library.
>>>>>> 
>>>>>> There are 3 reason on AMD SoC we did not explore this path, reasons
>>>>>> are
>>>>>> 
>>>>>> 1. depending n hwloc version and kernel version certain SoC
>>>>>> hierarchies are not available
>>>>>> 
>>>>>> 2. CPU NUMA and IO (memory & PCIe) NUMA are independent on AMD
>>>>> Epyc Soc.
>>>>>> 
>>>>>> 3. adds the extra dependency layer of library layer to be made
>>>>>> available to work.
>>>>>> 
>>>>>> 
>>>>>> hence we have tried to use Linux Documented generic layer of `sysfs
>>>>>> CPU cache`.
>>>>>> 
>>>>>> I will try to explore more on hwloc and check if other libraries
>>>>>> within DPDK leverages the same.
>>>>>> 
>>>>>>> 
>>>>>>> I much agree cache/core topology may be of interest of the
>>>>>>> application (or a work scheduler, like a DPDK event device), but
>>>>>>> it's not limited to LLC. It may well be worthwhile to care about
>>>>>>> which cores shares L2 cache, for example. Not sure the
>>>>>>> RTE_LCORE_FOREACH_*
>>>>> approach scales.
>>>>>> 
>>>>>> yes, totally understand as some SoC, multiple lcores shares same L2 cache.
>>>>>> 
>>>>>> 
>>>>>> Can we rework the API to be rte_get_cache_<function> where user
>>>>>> argument is desired lcore index.
>>>>>> 
>>>>>> 1. index-1: SMT threads
>>>>>> 
>>>>>> 2. index-2: threads sharing same L2 cache
>>>>>> 
>>>>>> 3. index-3: threads sharing same L3 cache
>>>>>> 
>>>>>> 4. index-MAX: identify the threads sharing last level cache.
>>>>>> 
>>>>>>> 
>>>>>>>> < Function: Purpose >
>>>>>>>> ---------------------
>>>>>>>>    - rte_get_llc_first_lcores: Retrieves all the first lcores in
>>>>>>>> the shared LLC.
>>>>>>>>    - rte_get_llc_lcore: Retrieves all lcores that share the LLC.
>>>>>>>>    - rte_get_llc_n_lcore: Retrieves the first n or skips the first
>>>>>>>> n lcores in the shared LLC.
>>>>>>>> 
>>>>>>>> < MACRO: Purpose >
>>>>>>>> ------------------
>>>>>>>> RTE_LCORE_FOREACH_LLC_FIRST: iterates through all first lcore from
>>>>>>>> each LLC.
>>>>>>>> RTE_LCORE_FOREACH_LLC_FIRST_WORKER: iterates through all first
>>>>>>>> worker lcore from each LLC.
>>>>>>>> RTE_LCORE_FOREACH_LLC_WORKER: iterates lcores from LLC based on
>>>>> hint
>>>>>>>> (lcore id).
>>>>>>>> RTE_LCORE_FOREACH_LLC_SKIP_FIRST_WORKER: iterates lcores from
>>> LLC
>>>>>>>> while skipping first worker.
>>>>>>>> RTE_LCORE_FOREACH_LLC_FIRST_N_WORKER: iterates through `n`
>>> lcores
>>>>>>>> from each LLC.
>>>>>>>> RTE_LCORE_FOREACH_LLC_SKIP_N_WORKER: skip first `n` lcores, then
>>>>>>>> iterates through reaming lcores in each LLC.
>>>>>>>> 
>>>>>> While the MACRO are simple wrapper invoking appropriate API. can
>>>>>> this be worked out in this fashion?
>>>>>> 
>>>>>> <snipped>
  
Konstantin Ananyev Sept. 11, 2024, 10:25 p.m. UTC | #23
> > > >>> Thank you Mattias for the comments and question, please let me try
> > > >>> to explain the same below
> > > >>>
> > > >>>> We shouldn't have a separate CPU/cache hierarchy API instead?
> > > >>>
> > > >>> Based on the intention to bring in CPU lcores which share same L3
> > > >>> (for better cache hits and less noisy neighbor) current API focuses
> > > >>> on using
> > > >>>
> > > >>> Last Level Cache. But if the suggestion is `there are SoC where L2
> > > >>> cache are also shared, and the new API should be provisioned`, I am
> > > >>> also
> > > >>>
> > > >>> comfortable with the thought.
> > > >>>
> > > >>
> > > >> Rather than some AMD special case API hacked into <rte_lcore.h>, I
> > > >> think we are better off with no DPDK API at all for this kind of functionality.
> > > >
> > > > Hi Mattias, as shared in the earlier email thread, this is not a AMD special
> > > case at all. Let me try to explain this one more time. One of techniques used to
> > > increase cores cost effective way to go for tiles of compute complexes.
> > > > This introduces a bunch of cores in sharing same Last Level Cache (namely
> > > L2, L3 or even L4) depending upon cache topology architecture.
> > > >
> > > > The API suggested in RFC is to help end users to selectively use cores under
> > > same Last Level Cache Hierarchy as advertised by OS (irrespective of the BIOS
> > > settings used). This is useful in both bare-metal and container environment.
> > > >
> > >
> > > I'm pretty familiar with AMD CPUs and the use of tiles (including the
> > > challenges these kinds of non-uniformities pose for work scheduling).
> > >
> > > To maximize performance, caring about core<->LLC relationship may well not
> > > be enough, and more HT/core/cache/memory topology information is
> > > required. That's what I meant by special case. A proper API should allow
> > > access to information about which lcores are SMT siblings, cores on the same
> > > L2, and cores on the same L3, to name a few things. Probably you want to fit
> > > NUMA into the same API as well, although that is available already in
> > > <rte_lcore.h>.
> >
> > Thank you Mattias for the information, as shared by in the reply with Anatoly we want expose a new API `rte_get_next_lcore_ex`
> which intakes a extra argument `u32 flags`.
> > The flags can be RTE_GET_LCORE_L1 (SMT), RTE_GET_LCORE_L2, RTE_GET_LCORE_L3, RTE_GET_LCORE_BOOST_ENABLED,
> RTE_GET_LCORE_BOOST_DISABLED.
> >
> 
> For the naming, would "rte_get_next_sibling_core" (or lcore if you prefer)
> be a clearer name than just adding "ex" on to the end of the existing
> function?
> 
> Looking logically, I'm not sure about the BOOST_ENABLED and BOOST_DISABLED
> flags you propose - in a system with multiple possible standard and boost
> frequencies what would those correspond to? What's also missing is a define
> for getting actual NUMA siblings i.e. those sharing common memory but not
> an L3 or anything else.
> 
> My suggestion would be to have the function take just an integer-type e.g.
> uint16_t parameter which defines the memory/cache hierarchy level to use, 0
> being lowest, 1 next, and so on. Different systems may have different
> numbers of cache levels so lets just make it a zero-based index of levels,
> rather than giving explicit defines (except for memory which should
> probably always be last). The zero-level will be for "closest neighbour"
> whatever that happens to be, with as many levels as is necessary to express
> the topology, e.g. without SMT, but with 3 cache levels, level 0 would be
> an L2 neighbour, level 1 an L3 neighbour. If the L3 was split within a
> memory NUMA node, then level 2 would give the NUMA siblings. We'd just need
> an API to return the max number of levels along with the iterator.

Sounds like a neat idea to me.
  
Vipin Varghese Sept. 12, 2024, 1:11 a.m. UTC | #24
[AMD Official Use Only - AMD Internal Distribution Only]

<snipped>

>
> On Wed, 11 Sep 2024 03:13:14 +0000
> "Varghese, Vipin" <Vipin.Varghese@amd.com> wrote:
>
> > > Agreed. This one of those cases where the existing project hwloc
> > > which is part of open-mpi is more complete and well supported. It
> > > supports multiple OS's and can deal with more quirks.
> >
> > Thank you Stephen for the inputs, last year when checked hwloc for distros
> there were anomalies for NUMA and Physical socket Identification on AMD
> EPYC Soc.
> > I will recheck the distros version of hwloc, if these work out fine I will re-
> work with hwloc libraries making it OS independent too.
>
> DPDK doesn't exist to resolve problems with upstreaming hardware support in
> other packages.
Stephen, thank you for your comment. You have asked in earlier request to try using hwloc library.
I have mentioned at least till last year popular distros were still using older version for hwloc and I will recheck if it has changed.

This I assumed for mitigating the `sysfs` and bringing in `OS` portability.

 If DPDK does supports something only because it is harder, > slower, more painful to deal with another project; then you create long term
> technical debt.
Now I really confused, for any application or end user which want to use the right set of lcores for performance DPDK provides abstraction and ease of use.
Maybe I am misreading the above idea, so let me back off a bit and share version-2 of the RFC.
  
Vipin Varghese Sept. 12, 2024, 1:33 a.m. UTC | #25
[Public]

Snipped

> >>>>
> >>>> <snipped>
> >>>>
> >>>>>> <snipped>
> >>>>>>
> >>>>>> Thank you Mattias for the comments and question, please let me
> >>>>>> try to explain the same below
> >>>>>>
> >>>>>>> We shouldn't have a separate CPU/cache hierarchy API instead?
> >>>>>>
> >>>>>> Based on the intention to bring in CPU lcores which share same L3
> >>>>>> (for better cache hits and less noisy neighbor) current API
> >>>>>> focuses on using
> >>>>>>
> >>>>>> Last Level Cache. But if the suggestion is `there are SoC where
> >>>>>> L2 cache are also shared, and the new API should be provisioned`,
> >>>>>> I am also
> >>>>>>
> >>>>>> comfortable with the thought.
> >>>>>>
> >>>>>
> >>>>> Rather than some AMD special case API hacked into <rte_lcore.h>, I
> >>>>> think we are better off with no DPDK API at all for this kind of
> functionality.
> >>>>
> >>>> Hi Mattias, as shared in the earlier email thread, this is not a
> >>>> AMD special
> >>> case at all. Let me try to explain this one more time. One of
> >>> techniques used to increase cores cost effective way to go for tiles of
> compute complexes.
> >>>> This introduces a bunch of cores in sharing same Last Level Cache
> >>>> (namely
> >>> L2, L3 or even L4) depending upon cache topology architecture.
> >>>>
> >>>> The API suggested in RFC is to help end users to selectively use
> >>>> cores under
> >>> same Last Level Cache Hierarchy as advertised by OS (irrespective of
> >>> the BIOS settings used). This is useful in both bare-metal and container
> environment.
> >>>>
> >>>
> >>> I'm pretty familiar with AMD CPUs and the use of tiles (including
> >>> the challenges these kinds of non-uniformities pose for work scheduling).
> >>>
> >>> To maximize performance, caring about core<->LLC relationship may
> >>> well not be enough, and more HT/core/cache/memory topology
> >>> information is required. That's what I meant by special case. A
> >>> proper API should allow access to information about which lcores are
> >>> SMT siblings, cores on the same L2, and cores on the same L3, to
> >>> name a few things. Probably you want to fit NUMA into the same API
> >>> as well, although that is available already in <rte_lcore.h>.
> >> Thank you Mattias for the information, as shared by in the reply with
> Anatoly we want expose a new API `rte_get_next_lcore_ex` which intakes a
> extra argument `u32 flags`.
> >> The flags can be RTE_GET_LCORE_L1 (SMT), RTE_GET_LCORE_L2,
> RTE_GET_LCORE_L3, RTE_GET_LCORE_BOOST_ENABLED,
> RTE_GET_LCORE_BOOST_DISABLED.
> >
> > Wouldn't using that API be pretty awkward to use?
Current API available under DPDK is ` rte_get_next_lcore`, which is used within DPDK example and in customer solution.
Based on the comments from others we responded to the idea of changing the new Api from ` rte_get_next_lcore_llc` to ` rte_get_next_lcore_exntd`.

Can you please help us understand what is `awkward`.

> >
> > I mean, what you have is a topology, with nodes of different types and with
> different properties, and you want to present it to the user.
Let me be clear, what we want via DPDK to help customer to use an Unified API which works across multiple platforms.
Example - let a vendor have 2 products namely A and B. CPU-A has all cores within same SUB-NUMA domain and CPU-B has cores split to 2 sub-NUMA domain based on split LLC.
When `rte_get_next_lcore_extnd` is invoked for `LLC` on
1. CPU-A: it returns all cores as there is no split
2. CPU-B: it returns cores from specific sub-NUMA which is partitioned by L3

> >
> > In a sense, it's similar to XCM and DOM versus SAX. The above is SAX-style,
> and what I have in mind is something DOM-like.
> >
> > What use case do you have in mind? What's on top of my list is a scenario
> where a DPDK app gets a bunch of cores (e.g., -l <cores>) and tries to figure
> out how best make use of them.
Exactly.

 It's not going to "skip" (ignore, leave unused)
> SMT siblings, or skip non-boosted cores, it would just try to be clever in
> regards to which cores to use for what purpose.
Let me try to share my idea on SMT sibling. When user invoked for rte_get_next_lcore_extnd` is invoked for `L1 | SMT` flag with `lcore`; the API identifies first whether given lcore is part of enabled core list.
If yes, it programmatically either using `sysfs` or `hwloc library (shared the version concern on distros. Will recheck again)` identify the sibling thread and return.
If there is no sibling thread available under DPDK it will fetch next lcore (probably lcore +1 ).

> >
> >> This is AMD EPYC SoC agnostic and trying to address for all generic cases.
> >> Please do let us know if we (Ferruh & myself) can sync up via call?
> >
> > Sure, I can do that.

Let me sync with Ferruh and get a time slot for internal sync.

> >
> Can this be opened to the rest of the community? This is a common problem
> that needs to be solved for multiple architectures. I would be interested in
> attending.
Thank you Mattias, in DPDK Bangkok summit 2024 we did bring this up. As per the suggestion from Thomas and Jerrin we tried to bring the RFC for discussion.
For DPDK Montreal 2024, Keesang and Ferruh (most likely) is travelling for the summit and presenting this as the talk to get things moving.

>
> >>>
<snipped>
  
Vipin Varghese Sept. 12, 2024, 2:19 a.m. UTC | #26
[Public]


<snipped>

> > > > <snipped>
> > > >
> > > >>> <snipped>
> > > >>>
> > > >>> Thank you Mattias for the comments and question, please let me
> > > >>> try to explain the same below
> > > >>>
> > > >>>> We shouldn't have a separate CPU/cache hierarchy API instead?
> > > >>>
> > > >>> Based on the intention to bring in CPU lcores which share same
> > > >>> L3 (for better cache hits and less noisy neighbor) current API
> > > >>> focuses on using
> > > >>>
> > > >>> Last Level Cache. But if the suggestion is `there are SoC where
> > > >>> L2 cache are also shared, and the new API should be
> > > >>> provisioned`, I am also
> > > >>>
> > > >>> comfortable with the thought.
> > > >>>
> > > >>
> > > >> Rather than some AMD special case API hacked into <rte_lcore.h>,
> > > >> I think we are better off with no DPDK API at all for this kind of
> functionality.
> > > >
> > > > Hi Mattias, as shared in the earlier email thread, this is not a
> > > > AMD special
> > > case at all. Let me try to explain this one more time. One of
> > > techniques used to increase cores cost effective way to go for tiles of
> compute complexes.
> > > > This introduces a bunch of cores in sharing same Last Level Cache
> > > > (namely
> > > L2, L3 or even L4) depending upon cache topology architecture.
> > > >
> > > > The API suggested in RFC is to help end users to selectively use
> > > > cores under
> > > same Last Level Cache Hierarchy as advertised by OS (irrespective of
> > > the BIOS settings used). This is useful in both bare-metal and container
> environment.
> > > >
> > >
> > > I'm pretty familiar with AMD CPUs and the use of tiles (including
> > > the challenges these kinds of non-uniformities pose for work scheduling).
> > >
> > > To maximize performance, caring about core<->LLC relationship may
> > > well not be enough, and more HT/core/cache/memory topology
> > > information is required. That's what I meant by special case. A
> > > proper API should allow access to information about which lcores are
> > > SMT siblings, cores on the same L2, and cores on the same L3, to
> > > name a few things. Probably you want to fit NUMA into the same API
> > > as well, although that is available already in <rte_lcore.h>.
> >
> > Thank you Mattias for the information, as shared by in the reply with
> Anatoly we want expose a new API `rte_get_next_lcore_ex` which intakes a
> extra argument `u32 flags`.
> > The flags can be RTE_GET_LCORE_L1 (SMT), RTE_GET_LCORE_L2,
> RTE_GET_LCORE_L3, RTE_GET_LCORE_BOOST_ENABLED,
> RTE_GET_LCORE_BOOST_DISABLED.
> >
>
> For the naming, would "rte_get_next_sibling_core" (or lcore if you prefer) be a
> clearer name than just adding "ex" on to the end of the existing function?
Thank you Bruce, Please find my answer below

Functions shared as per the RFC were
```
 - rte_get_llc_first_lcores: Retrieves all the first lcores in the shared LLC.
 - rte_get_llc_lcore: Retrieves all lcores that share the LLC.
 - rte_get_llc_n_lcore: Retrieves the first n or skips the first n lcores in the shared LLC.
```

MACRO's extending the usability were
```
RTE_LCORE_FOREACH_LLC_FIRST: iterates through all first lcore from each LLC.
RTE_LCORE_FOREACH_LLC_FIRST_WORKER: iterates through all first worker lcore from each LLC.
RTE_LCORE_FOREACH_LLC_WORKER: iterates lcores from LLC based on hint (lcore id).
RTE_LCORE_FOREACH_LLC_SKIP_FIRST_WORKER: iterates lcores from LLC while skipping first worker.
RTE_LCORE_FOREACH_LLC_FIRST_N_WORKER: iterates through `n` lcores from each LLC.
RTE_LCORE_FOREACH_LLC_SKIP_N_WORKER: skip first `n` lcores, then iterates through reaming lcores in each LLC.
```

Based on the discussions we agreed on sharing version-2 FRC for extending API as `rte_get_next_lcore_extnd` with extra argument as `flags`.
As per my ideation, for the API ` rte_get_next_sibling_core`, the above API can easily with flag ` RTE_GET_LCORE_L1 (SMT)`. Is this right understanding?
We can easily have simple MACROs like `RTE_LCORE_FOREACH_L1` which allows to iterate SMT sibling threads.

>
> Looking logically, I'm not sure about the BOOST_ENABLED and
> BOOST_DISABLED flags you propose
The idea for the BOOST_ENABLED & BOOST_DISABLED is based on DPDK power library which allows to enable boost.
Allow user to select lcores where BOOST is enabled|disabled using MACRO or API.

 - in a system with multiple possible
> standard and boost frequencies what would those correspond to?
I now understand the confusion, apologies for mixing the AMD EPYC SoC boost with Intel Turbo.

Thank you for pointing out, we will use the terminology ` RTE_GET_LCORE_TURBO`.

 What's also
> missing is a define for getting actual NUMA siblings i.e. those sharing common
> memory but not an L3 or anything else.
This can be extended into `rte_get_next_lcore_extnd` with flag ` RTE_GET_LCORE_NUMA`. This will allow to grab all lcores under the same sub-memory NUMA as shared by LCORE.
If SMT sibling is enabled and DPDK Lcore mask covers the sibling threads, then ` RTE_GET_LCORE_NUMA` get all lcore and sibling threads under same memory NUMA of lcore shared.

>
> My suggestion would be to have the function take just an integer-type e.g.
> uint16_t parameter which defines the memory/cache hierarchy level to use, 0
> being lowest, 1 next, and so on. Different systems may have different numbers
> of cache levels so lets just make it a zero-based index of levels, rather than
> giving explicit defines (except for memory which should probably always be
> last). The zero-level will be for "closest neighbour"
Good idea, we did prototype this internally. But issue it will keep on adding the number of API into lcore library.
To keep the API count less, we are using lcore id as hint to sub-NUMA.

> whatever that happens to be, with as many levels as is necessary to express
> the topology, e.g. without SMT, but with 3 cache levels, level 0 would be an L2
> neighbour, level 1 an L3 neighbour. If the L3 was split within a memory NUMA
> node, then level 2 would give the NUMA siblings. We'd just need an API to
> return the max number of levels along with the iterator.
We are using lcore numa as the hint.

>
> Regards,
> /Bruce
  
Vipin Varghese Sept. 12, 2024, 2:28 a.m. UTC | #27
[Public]

<snipped>


> > What use case do you have in mind? What's on top of my list is a scenario
> where a DPDK app gets a bunch of cores (e.g., -l <cores>) and tries to figure
> out how best make use of them. It's not going to "skip" (ignore, leave unused)
> SMT siblings, or skip non-boosted cores, it would just try to be clever in
> regards to which cores to use for what purpose.
> >
> >> This is AMD EPYC SoC agnostic and trying to address for all generic cases.
> >> Please do let us know if we (Ferruh & myself) can sync up via call?
> >
> > Sure, I can do that.
> >
> Can this be opened to the rest of the community? This is a common problem
> that needs to be solved for multiple architectures. I would be interested in
> attending.

Hi Honappa, we can accommodate the same. Let me work with Ferruh how to get it as tech discussion.

<snipped>
  
Vipin Varghese Sept. 12, 2024, 2:38 a.m. UTC | #28
[AMD Official Use Only - AMD Internal Distribution Only]


<snipped>
> >
> > For the naming, would "rte_get_next_sibling_core" (or lcore if you
> > prefer) be a clearer name than just adding "ex" on to the end of the
> > existing function?
> >
> > Looking logically, I'm not sure about the BOOST_ENABLED and
> > BOOST_DISABLED flags you propose - in a system with multiple possible
> > standard and boost frequencies what would those correspond to? What's
> > also missing is a define for getting actual NUMA siblings i.e. those
> > sharing common memory but not an L3 or anything else.
> >
> > My suggestion would be to have the function take just an integer-type e.g.
> > uint16_t parameter which defines the memory/cache hierarchy level to
> > use, 0 being lowest, 1 next, and so on. Different systems may have
> > different numbers of cache levels so lets just make it a zero-based
> > index of levels, rather than giving explicit defines (except for
> > memory which should probably always be last). The zero-level will be for
> "closest neighbour"
> > whatever that happens to be, with as many levels as is necessary to
> > express the topology, e.g. without SMT, but with 3 cache levels, level
> > 0 would be an L2 neighbour, level 1 an L3 neighbour. If the L3 was
> > split within a memory NUMA node, then level 2 would give the NUMA
> > siblings. We'd just need an API to return the max number of levels along with
> the iterator.
>
> Sounds like a neat idea to me.

Hi Konstantin, I have tried my best to address to Bruce comment. Let me try to recap
1.      we want vendor agnostic API which allows end users to get list of lcores
2.      this can be based on L1(SMT), L2, L3, NUMA, TURBO (as of now)
3.      instead of creating multiple different API, we would like to add 1 API `rte_get_next_lcore_extnd` which can be controlled with `flags`
4.      flag can be single or combination (like L3|TURBO_ENABLED or NUMA|TURBO_ENABLED).
5.      As per my current idea, we can expand ease of use via MACRO and not API.

I hope this justifies why we should have 1 exntended API and wrap things with Macro?
May I setup or add to tech discussion call with Mattias and Honappa too?
  
Mattias Rönnblom Sept. 12, 2024, 6:38 a.m. UTC | #29
On 2024-09-12 03:33, Varghese, Vipin wrote:
> [Public]
> 
> Snipped
> 
>>>>>>
>>>>>> <snipped>
>>>>>>
>>>>>>>> <snipped>
>>>>>>>>
>>>>>>>> Thank you Mattias for the comments and question, please let me
>>>>>>>> try to explain the same below
>>>>>>>>
>>>>>>>>> We shouldn't have a separate CPU/cache hierarchy API instead?
>>>>>>>>
>>>>>>>> Based on the intention to bring in CPU lcores which share same L3
>>>>>>>> (for better cache hits and less noisy neighbor) current API
>>>>>>>> focuses on using
>>>>>>>>
>>>>>>>> Last Level Cache. But if the suggestion is `there are SoC where
>>>>>>>> L2 cache are also shared, and the new API should be provisioned`,
>>>>>>>> I am also
>>>>>>>>
>>>>>>>> comfortable with the thought.
>>>>>>>>
>>>>>>>
>>>>>>> Rather than some AMD special case API hacked into <rte_lcore.h>, I
>>>>>>> think we are better off with no DPDK API at all for this kind of
>> functionality.
>>>>>>
>>>>>> Hi Mattias, as shared in the earlier email thread, this is not a
>>>>>> AMD special
>>>>> case at all. Let me try to explain this one more time. One of
>>>>> techniques used to increase cores cost effective way to go for tiles of
>> compute complexes.
>>>>>> This introduces a bunch of cores in sharing same Last Level Cache
>>>>>> (namely
>>>>> L2, L3 or even L4) depending upon cache topology architecture.
>>>>>>
>>>>>> The API suggested in RFC is to help end users to selectively use
>>>>>> cores under
>>>>> same Last Level Cache Hierarchy as advertised by OS (irrespective of
>>>>> the BIOS settings used). This is useful in both bare-metal and container
>> environment.
>>>>>>
>>>>>
>>>>> I'm pretty familiar with AMD CPUs and the use of tiles (including
>>>>> the challenges these kinds of non-uniformities pose for work scheduling).
>>>>>
>>>>> To maximize performance, caring about core<->LLC relationship may
>>>>> well not be enough, and more HT/core/cache/memory topology
>>>>> information is required. That's what I meant by special case. A
>>>>> proper API should allow access to information about which lcores are
>>>>> SMT siblings, cores on the same L2, and cores on the same L3, to
>>>>> name a few things. Probably you want to fit NUMA into the same API
>>>>> as well, although that is available already in <rte_lcore.h>.
>>>> Thank you Mattias for the information, as shared by in the reply with
>> Anatoly we want expose a new API `rte_get_next_lcore_ex` which intakes a
>> extra argument `u32 flags`.
>>>> The flags can be RTE_GET_LCORE_L1 (SMT), RTE_GET_LCORE_L2,
>> RTE_GET_LCORE_L3, RTE_GET_LCORE_BOOST_ENABLED,
>> RTE_GET_LCORE_BOOST_DISABLED.
>>>
>>> Wouldn't using that API be pretty awkward to use?
> Current API available under DPDK is ` rte_get_next_lcore`, which is used within DPDK example and in customer solution.
> Based on the comments from others we responded to the idea of changing the new Api from ` rte_get_next_lcore_llc` to ` rte_get_next_lcore_exntd`.
> 
> Can you please help us understand what is `awkward`.
> 

The awkwardness starts when you are trying to fit provide hwloc type 
information over an API that was designed for iterating over lcores.

It seems to me that you should either have:
A) An API in similar to that of hwloc (or any DOM-like API), which would 
give a low-level description of the hardware in implementation terms. 
The topology would consist of nodes, with attributes, etc, where nodes 
are things like cores or instances of caches of some level and 
attributes are things like CPU actual and nominal, and maybe max 
frequency, cache size, or memory size.
or
B) An API to be directly useful for a work scheduler, in which case you 
should abstract away things like "boost" (and fold them into some 
abstract capacity notion, together with core "size" [in 
big-little/heterogeneous systems]), and have an abstract notion of what 
core is "close" to some other core. This would something like Linux' 
scheduling domains.

If you want B you probably need A as a part of its implementation, so 
you may just as well start with A, I suppose.

What you could do to explore the API design is to add support for, for 
example, boost core awareness or SMT affinity in the SW scheduler. You 
could also do an "lstopo" equivalent, since that's needed for debugging 
and exploration, if nothing else.

One question that will have to be answered in a work scheduling scenario 
is "are these two lcores SMT siblings," or "are these two cores on the 
same LLC", or "give me all lcores on a particular L2 cache".

>>>
>>> I mean, what you have is a topology, with nodes of different types and with
>> different properties, and you want to present it to the user.
> Let me be clear, what we want via DPDK to help customer to use an Unified API which works across multiple platforms.
> Example - let a vendor have 2 products namely A and B. CPU-A has all cores within same SUB-NUMA domain and CPU-B has cores split to 2 sub-NUMA domain based on split LLC.
> When `rte_get_next_lcore_extnd` is invoked for `LLC` on
> 1. CPU-A: it returns all cores as there is no split
> 2. CPU-B: it returns cores from specific sub-NUMA which is partitioned by L3
> 

I think the function name rte_get_next_lcore_extnd() alone makes clear 
this is an awkward API. :)

My gut feeling is to make it more explicit and forget about 
<rte_lcore.h>. <rte_hwtopo.h>? Could and should still be EAL.

>>>
>>> In a sense, it's similar to XCM and DOM versus SAX. The above is SAX-style,
>> and what I have in mind is something DOM-like.
>>>
>>> What use case do you have in mind? What's on top of my list is a scenario
>> where a DPDK app gets a bunch of cores (e.g., -l <cores>) and tries to figure
>> out how best make use of them.
> Exactly.
> 
>   It's not going to "skip" (ignore, leave unused)
>> SMT siblings, or skip non-boosted cores, it would just try to be clever in
>> regards to which cores to use for what purpose.
> Let me try to share my idea on SMT sibling. When user invoked for rte_get_next_lcore_extnd` is invoked for `L1 | SMT` flag with `lcore`; the API identifies first whether given lcore is part of enabled core list.
> If yes, it programmatically either using `sysfs` or `hwloc library (shared the version concern on distros. Will recheck again)` identify the sibling thread and return.
> If there is no sibling thread available under DPDK it will fetch next lcore (probably lcore +1 ).
> 

Distributions having old hwloc versions isn't an argument for a new DPDK 
library or new API. If only that was the issue, then it would be better 
to help the hwloc and/or distributions, rather than the DPDK project.

>>>
>>>> This is AMD EPYC SoC agnostic and trying to address for all generic cases.
>>>> Please do let us know if we (Ferruh & myself) can sync up via call?
>>>
>>> Sure, I can do that.
> 
> Let me sync with Ferruh and get a time slot for internal sync.
> 
>>>
>> Can this be opened to the rest of the community? This is a common problem
>> that needs to be solved for multiple architectures. I would be interested in
>> attending.
> Thank you Mattias, in DPDK Bangkok summit 2024 we did bring this up. As per the suggestion from Thomas and Jerrin we tried to bring the RFC for discussion.
> For DPDK Montreal 2024, Keesang and Ferruh (most likely) is travelling for the summit and presenting this as the talk to get things moving.
> 
>>
>>>>>
> <snipped>
  
Mattias Rönnblom Sept. 12, 2024, 7:02 a.m. UTC | #30
On 2024-09-12 08:38, Mattias Rönnblom wrote:
> On 2024-09-12 03:33, Varghese, Vipin wrote:
>> [Public]
>>
>> Snipped
>>
>>>>>>>
>>>>>>> <snipped>
>>>>>>>
>>>>>>>>> <snipped>
>>>>>>>>>
>>>>>>>>> Thank you Mattias for the comments and question, please let me
>>>>>>>>> try to explain the same below
>>>>>>>>>
>>>>>>>>>> We shouldn't have a separate CPU/cache hierarchy API instead?
>>>>>>>>>
>>>>>>>>> Based on the intention to bring in CPU lcores which share same L3
>>>>>>>>> (for better cache hits and less noisy neighbor) current API
>>>>>>>>> focuses on using
>>>>>>>>>
>>>>>>>>> Last Level Cache. But if the suggestion is `there are SoC where
>>>>>>>>> L2 cache are also shared, and the new API should be provisioned`,
>>>>>>>>> I am also
>>>>>>>>>
>>>>>>>>> comfortable with the thought.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Rather than some AMD special case API hacked into <rte_lcore.h>, I
>>>>>>>> think we are better off with no DPDK API at all for this kind of
>>> functionality.
>>>>>>>
>>>>>>> Hi Mattias, as shared in the earlier email thread, this is not a
>>>>>>> AMD special
>>>>>> case at all. Let me try to explain this one more time. One of
>>>>>> techniques used to increase cores cost effective way to go for 
>>>>>> tiles of
>>> compute complexes.
>>>>>>> This introduces a bunch of cores in sharing same Last Level Cache
>>>>>>> (namely
>>>>>> L2, L3 or even L4) depending upon cache topology architecture.
>>>>>>>
>>>>>>> The API suggested in RFC is to help end users to selectively use
>>>>>>> cores under
>>>>>> same Last Level Cache Hierarchy as advertised by OS (irrespective of
>>>>>> the BIOS settings used). This is useful in both bare-metal and 
>>>>>> container
>>> environment.
>>>>>>>
>>>>>>
>>>>>> I'm pretty familiar with AMD CPUs and the use of tiles (including
>>>>>> the challenges these kinds of non-uniformities pose for work 
>>>>>> scheduling).
>>>>>>
>>>>>> To maximize performance, caring about core<->LLC relationship may
>>>>>> well not be enough, and more HT/core/cache/memory topology
>>>>>> information is required. That's what I meant by special case. A
>>>>>> proper API should allow access to information about which lcores are
>>>>>> SMT siblings, cores on the same L2, and cores on the same L3, to
>>>>>> name a few things. Probably you want to fit NUMA into the same API
>>>>>> as well, although that is available already in <rte_lcore.h>.
>>>>> Thank you Mattias for the information, as shared by in the reply with
>>> Anatoly we want expose a new API `rte_get_next_lcore_ex` which intakes a
>>> extra argument `u32 flags`.
>>>>> The flags can be RTE_GET_LCORE_L1 (SMT), RTE_GET_LCORE_L2,
>>> RTE_GET_LCORE_L3, RTE_GET_LCORE_BOOST_ENABLED,
>>> RTE_GET_LCORE_BOOST_DISABLED.
>>>>
>>>> Wouldn't using that API be pretty awkward to use?
>> Current API available under DPDK is ` rte_get_next_lcore`, which is 
>> used within DPDK example and in customer solution.
>> Based on the comments from others we responded to the idea of changing 
>> the new Api from ` rte_get_next_lcore_llc` to ` 
>> rte_get_next_lcore_exntd`.
>>
>> Can you please help us understand what is `awkward`.
>>
> 
> The awkwardness starts when you are trying to fit provide hwloc type 
> information over an API that was designed for iterating over lcores.
> 
> It seems to me that you should either have:
> A) An API in similar to that of hwloc (or any DOM-like API), which would 
> give a low-level description of the hardware in implementation terms. 
> The topology would consist of nodes, with attributes, etc, where nodes 
> are things like cores or instances of caches of some level and 
> attributes are things like CPU actual and nominal, and maybe max 
> frequency, cache size, or memory size.


To to be clear; it's something like this I think of when I say 
"DOM-style" API.

#ifndef RTE_HWTOPO_H
#define RTE_HWTOPO_H

struct rte_hwtopo_node;

enum rte_hwtopo_node_type {
     RTE_HWTOPO_NODE_TYPE_CPU_CORE,
     RTE_HWTOPO_NODE_TYPE_CACHE,
     RTE_HWTOPO_NODE_TYPE_NUMA
};

int
rte_hwtopo_init(void);

struct rte_hwtopo_node *
rte_hwtopo_get_core_by_lcore(unsigned int lcore);

struct rte_hwtopo_node *
rte_hwtopo_get_core_by_id(unsigned int os_cpu_id);

struct rte_hwtopo_node *
rte_hwtopo_parent(struct rte_hwtopo_node *node);

struct rte_hwtopo_node *
rte_hwtopo_first_child(struct rte_hwtopo_node *node);

struct rte_hwtopo_node *
rte_hwtopo_next_child(struct rte_hwtopo_node *node,
		      struct rte_hwtopo_node *child);

struct rte_hwtopo_node *
rte_hwtopo_first_sibling(struct rte_hwtopo_node *node);

struct rte_hwtopo_node *
rte_hwtopo_next_sibling(struct rte_hwtopo_node *node,
			struct rte_hwtopo_node *child);

enum rte_hwtopo_node_type
rte_hwtopo_get_type(struct rte_hwtopo_node *node);

#define RTE_HWTOPO_NODE_ATTR_CORE_FREQUENCY_NOMINAL 0
#define RTE_HWTOPO_NODE_ATTR_CACHE_LEVEL 1
#define RTE_HWTOPO_NODE_ATTR_CACHE_SIZE 2

int
rte_hwtopo_get_attr_int64(struct rte_hwtopo_node *node, unsigned int 
attr_name,
			  int64_t *attr_value);

int
rte_hwtopo_get_attr_str(struct rte_hwtopo_node *node, unsigned int 
attr_name,
			char *attr_value, size_t capacity);

#endif

Surely, this too would be awkward (or should I say cumbersome) to use in 
certain scenarios. You could have syntactic sugar/special case helpers 
which address common use cases. You would also build abstractions on top 
of this (like the B case below).

One could have node type specific functions instead of generic getter 
and setters. Anyway, this is not a counter-proposal, but rather just to 
make clear, what I had in mind.

> or
> B) An API to be directly useful for a work scheduler, in which case you 
> should abstract away things like "boost" (and fold them into some 
> abstract capacity notion, together with core "size" [in 
> big-little/heterogeneous systems]), and have an abstract notion of what 
> core is "close" to some other core. This would something like Linux' 
> scheduling domains.
> 
> If you want B you probably need A as a part of its implementation, so 
> you may just as well start with A, I suppose.
> 
> What you could do to explore the API design is to add support for, for 
> example, boost core awareness or SMT affinity in the SW scheduler. You 
> could also do an "lstopo" equivalent, since that's needed for debugging 
> and exploration, if nothing else.
> 
> One question that will have to be answered in a work scheduling scenario 
> is "are these two lcores SMT siblings," or "are these two cores on the 
> same LLC", or "give me all lcores on a particular L2 cache".
> 
>>>>
>>>> I mean, what you have is a topology, with nodes of different types 
>>>> and with
>>> different properties, and you want to present it to the user.
>> Let me be clear, what we want via DPDK to help customer to use an 
>> Unified API which works across multiple platforms.
>> Example - let a vendor have 2 products namely A and B. CPU-A has all 
>> cores within same SUB-NUMA domain and CPU-B has cores split to 2 
>> sub-NUMA domain based on split LLC.
>> When `rte_get_next_lcore_extnd` is invoked for `LLC` on
>> 1. CPU-A: it returns all cores as there is no split
>> 2. CPU-B: it returns cores from specific sub-NUMA which is partitioned 
>> by L3
>>
> 
> I think the function name rte_get_next_lcore_extnd() alone makes clear 
> this is an awkward API. :)
> 
> My gut feeling is to make it more explicit and forget about 
> <rte_lcore.h>. <rte_hwtopo.h>? Could and should still be EAL.
> 
>>>>
>>>> In a sense, it's similar to XCM and DOM versus SAX. The above is 
>>>> SAX-style,
>>> and what I have in mind is something DOM-like.
>>>>
>>>> What use case do you have in mind? What's on top of my list is a 
>>>> scenario
>>> where a DPDK app gets a bunch of cores (e.g., -l <cores>) and tries 
>>> to figure
>>> out how best make use of them.
>> Exactly.
>>
>>   It's not going to "skip" (ignore, leave unused)
>>> SMT siblings, or skip non-boosted cores, it would just try to be 
>>> clever in
>>> regards to which cores to use for what purpose.
>> Let me try to share my idea on SMT sibling. When user invoked for 
>> rte_get_next_lcore_extnd` is invoked for `L1 | SMT` flag with `lcore`; 
>> the API identifies first whether given lcore is part of enabled core 
>> list.
>> If yes, it programmatically either using `sysfs` or `hwloc library 
>> (shared the version concern on distros. Will recheck again)` identify 
>> the sibling thread and return.
>> If there is no sibling thread available under DPDK it will fetch next 
>> lcore (probably lcore +1 ).
>>
> 
> Distributions having old hwloc versions isn't an argument for a new DPDK 
> library or new API. If only that was the issue, then it would be better 
> to help the hwloc and/or distributions, rather than the DPDK project.
> 
>>>>
>>>>> This is AMD EPYC SoC agnostic and trying to address for all generic 
>>>>> cases.
>>>>> Please do let us know if we (Ferruh & myself) can sync up via call?
>>>>
>>>> Sure, I can do that.
>>
>> Let me sync with Ferruh and get a time slot for internal sync.
>>
>>>>
>>> Can this be opened to the rest of the community? This is a common 
>>> problem
>>> that needs to be solved for multiple architectures. I would be 
>>> interested in
>>> attending.
>> Thank you Mattias, in DPDK Bangkok summit 2024 we did bring this up. 
>> As per the suggestion from Thomas and Jerrin we tried to bring the RFC 
>> for discussion.
>> For DPDK Montreal 2024, Keesang and Ferruh (most likely) is travelling 
>> for the summit and presenting this as the talk to get things moving.
>>
>>>
>>>>>>
>> <snipped>
  
Bruce Richardson Sept. 12, 2024, 9:17 a.m. UTC | #31
On Thu, Sep 12, 2024 at 02:19:07AM +0000, Varghese, Vipin wrote:
>    [Public]
> 
>    <snipped>
> 
> 
> 
>    > > > > <snipped>
> 
>    > > > >
> 
>    > > > >>> <snipped>
> 
>    > > > >>>
> 
>    > > > >>> Thank you Mattias for the comments and question, please let
>    me
> 
>    > > > >>> try to explain the same below
> 
>    > > > >>>
> 
>    > > > >>>> We shouldn't have a separate CPU/cache hierarchy API
>    instead?
> 
>    > > > >>>
> 
>    > > > >>> Based on the intention to bring in CPU lcores which share
>    same
> 
>    > > > >>> L3 (for better cache hits and less noisy neighbor) current
>    API
> 
>    > > > >>> focuses on using
> 
>    > > > >>>
> 
>    > > > >>> Last Level Cache. But if the suggestion is `there are SoC
>    where
> 
>    > > > >>> L2 cache are also shared, and the new API should be
> 
>    > > > >>> provisioned`, I am also
> 
>    > > > >>>
> 
>    > > > >>> comfortable with the thought.
> 
>    > > > >>>
> 
>    > > > >>
> 
>    > > > >> Rather than some AMD special case API hacked into
>    <rte_lcore.h>,
> 
>    > > > >> I think we are better off with no DPDK API at all for this
>    kind of
> 
>    > functionality.
> 
>    > > > >
> 
>    > > > > Hi Mattias, as shared in the earlier email thread, this is not
>    a
> 
>    > > > > AMD special
> 
>    > > > case at all. Let me try to explain this one more time. One of
> 
>    > > > techniques used to increase cores cost effective way to go for
>    tiles of
> 
>    > compute complexes.
> 
>    > > > > This introduces a bunch of cores in sharing same Last Level
>    Cache
> 
>    > > > > (namely
> 
>    > > > L2, L3 or even L4) depending upon cache topology architecture.
> 
>    > > > >
> 
>    > > > > The API suggested in RFC is to help end users to selectively
>    use
> 
>    > > > > cores under
> 
>    > > > same Last Level Cache Hierarchy as advertised by OS (irrespective
>    of
> 
>    > > > the BIOS settings used). This is useful in both bare-metal and
>    container
> 
>    > environment.
> 
>    > > > >
> 
>    > > >
> 
>    > > > I'm pretty familiar with AMD CPUs and the use of tiles (including
> 
>    > > > the challenges these kinds of non-uniformities pose for work
>    scheduling).
> 
>    > > >
> 
>    > > > To maximize performance, caring about core<->LLC relationship may
> 
>    > > > well not be enough, and more HT/core/cache/memory topology
> 
>    > > > information is required. That's what I meant by special case. A
> 
>    > > > proper API should allow access to information about which lcores
>    are
> 
>    > > > SMT siblings, cores on the same L2, and cores on the same L3, to
> 
>    > > > name a few things. Probably you want to fit NUMA into the same
>    API
> 
>    > > > as well, although that is available already in <rte_lcore.h>.
> 
>    > >
> 
>    > > Thank you Mattias for the information, as shared by in the reply
>    with
> 
>    > Anatoly we want expose a new API `rte_get_next_lcore_ex` which
>    intakes a
> 
>    > extra argument `u32 flags`.
> 
>    > > The flags can be RTE_GET_LCORE_L1 (SMT), RTE_GET_LCORE_L2,
> 
>    > RTE_GET_LCORE_L3, RTE_GET_LCORE_BOOST_ENABLED,
> 
>    > RTE_GET_LCORE_BOOST_DISABLED.
> 
>    > >
> 
>    >
> 
>    > For the naming, would "rte_get_next_sibling_core" (or lcore if you
>    prefer) be a
> 
>    > clearer name than just adding "ex" on to the end of the existing
>    function?
> 
>    Thank you Bruce, Please find my answer below
> 
> 
> 
>    Functions shared as per the RFC were
> 
>    ```
> 
>    - rte_get_llc_first_lcores: Retrieves all the first lcores in the
>    shared LLC.
> 
>    - rte_get_llc_lcore: Retrieves all lcores that share the LLC.
> 
>    - rte_get_llc_n_lcore: Retrieves the first n or skips the first n
>    lcores in the shared LLC.
> 
>    ```
> 
> 
> 
>    MACRO’s extending the usability were
> 
>    ```
> 
>    RTE_LCORE_FOREACH_LLC_FIRST: iterates through all first lcore from each
>    LLC.
> 
>    RTE_LCORE_FOREACH_LLC_FIRST_WORKER: iterates through all first worker
>    lcore from each LLC.
> 
>    RTE_LCORE_FOREACH_LLC_WORKER: iterates lcores from LLC based on hint
>    (lcore id).
> 
>    RTE_LCORE_FOREACH_LLC_SKIP_FIRST_WORKER: iterates lcores from LLC while
>    skipping first worker.
> 
>    RTE_LCORE_FOREACH_LLC_FIRST_N_WORKER: iterates through `n` lcores from
>    each LLC.
> 
>    RTE_LCORE_FOREACH_LLC_SKIP_N_WORKER: skip first `n` lcores, then
>    iterates through reaming lcores in each LLC.
> 
>    ```
> 
> 
> 
>    Based on the discussions we agreed on sharing version-2 FRC for
>    extending API as `rte_get_next_lcore_extnd` with extra argument as
>    `flags`.
> 
>    As per my ideation, for the API ` rte_get_next_sibling_core`, the above
>    API can easily with flag ` RTE_GET_LCORE_L1 (SMT)`. Is this right
>    understanding?
> 
>    We can easily have simple MACROs like `RTE_LCORE_FOREACH_L1` which
>    allows to iterate SMT sibling threads.
> 
> 

This seems like a lot of new macro and API additions! I'd really like to
cut that back and simplify the amount of new things we are adding to DPDK
for this. I tend to agree with others that external libs would be better
for apps that really want to deal with all this.

> 
>    >
> 
>    > Looking logically, I'm not sure about the BOOST_ENABLED and
> 
>    > BOOST_DISABLED flags you propose
> 
>    The idea for the BOOST_ENABLED & BOOST_DISABLED is based on DPDK power
>    library which allows to enable boost.
> 
>    Allow user to select lcores where BOOST is enabled|disabled using MACRO
>    or API.
> 
> 
> 
>    - in a system with multiple possible
> 
>    > standard and boost frequencies what would those correspond to?
> 
>    I now understand the confusion, apologies for mixing the AMD EPYC SoC
>    boost with Intel Turbo.
> 
> 
> 
>    Thank you for pointing out, we will use the terminology `
>    RTE_GET_LCORE_TURBO`.
> 
> 

That still doesn't clarify it for me. If you start mixing in power
management related functions in with topology ones things will turn into a
real headache. What does boost or turbo correspond to? Is it for cores that
have the feature enabled - whether or not it's currently in use - or is it
for finding cores that are currently boosted? Do we need additions for
cores that are boosted by 100Mhz vs say 300Mhz. What about cores that are
in lower frequencies for power-saving. Do we add macros for finding those?

> 
>    What's also
> 
>    > missing is a define for getting actual NUMA siblings i.e. those
>    sharing common
> 
>    > memory but not an L3 or anything else.
> 
>    This can be extended into `rte_get_next_lcore_extnd` with flag `
>    RTE_GET_LCORE_NUMA`. This will allow to grab all lcores under the same
>    sub-memory NUMA as shared by LCORE.
> 
>    If SMT sibling is enabled and DPDK Lcore mask covers the sibling
>    threads, then ` RTE_GET_LCORE_NUMA` get all lcore and sibling threads
>    under same memory NUMA of lcore shared.
> 
> 

Yes. That can work. But it means we are basing the implementation on a
fixed idea of what topologies there are or can exist. My suggestion below
is just to ignore the whole idea of L1 vs L2 vs NUMA - just give the app a
way to find it's nearest nodes. 

After all, the app doesn't want to know the topology just for the sake of
knowing it - it wants it to ensure best placement of work on cores! To that
end, it just needs to know what cores are near to each other and what are
far away.

> 
>    >
> 
>    > My suggestion would be to have the function take just an integer-type
>    e.g.
> 
>    > uint16_t parameter which defines the memory/cache hierarchy level to
>    use, 0
> 
>    > being lowest, 1 next, and so on. Different systems may have different
>    numbers
> 
>    > of cache levels so lets just make it a zero-based index of levels,
>    rather than
> 
>    > giving explicit defines (except for memory which should probably
>    always be
> 
>    > last). The zero-level will be for "closest neighbour"
> 
>    Good idea, we did prototype this internally. But issue it will keep on
>    adding the number of API into lcore library.
> 
>    To keep the API count less, we are using lcore id as hint to sub-NUMA.
> 

I'm unclear about this keeping the API count down - you are proposing a lot
of APIs and macros up above. My suggestion is basically to add two APIs and
no macros: one API to get the max number of topology-nearness levels, and a
second API to get the next sibling a given nearness level from
0(nearest)..N(furthest). If we want, we can also add a FOREACH macro too.

Overall, though, as I say above, let's focus on the problem the app
actually wants these APIs for, not how we think we should solve it. Apps
don't want to know the topology for knowledge sake, they want to use that
knowledge to improve performance by pinning tasks to cores. What is the
minimum that we need to provide to enable the app to do that? For example,
if there are no lcores that share an L1, then from an app topology
viewpoint that L1 level may as well not exist, because it provides us no
details on how to place our work.

For the rare app that does have some esoteric use-case that does actually
want to know some intricate details of the topology, then having that app
use an external lib is probably a better solution than us trying to cover
all possible options in DPDK.

My 2c. on this at this stage anyway.

/Bruce
  
Vipin Varghese Sept. 12, 2024, 11:17 a.m. UTC | #32
[AMD Official Use Only - AMD Internal Distribution Only]


<snipped>

> >>>> Thank you Mattias for the information, as shared by in the reply
> >>>> with
> >> Anatoly we want expose a new API `rte_get_next_lcore_ex` which
> >> intakes a extra argument `u32 flags`.
> >>>> The flags can be RTE_GET_LCORE_L1 (SMT), RTE_GET_LCORE_L2,
> >> RTE_GET_LCORE_L3, RTE_GET_LCORE_BOOST_ENABLED,
> >> RTE_GET_LCORE_BOOST_DISABLED.
> >>>
> >>> Wouldn't using that API be pretty awkward to use?
> > Current API available under DPDK is ` rte_get_next_lcore`, which is used
> within DPDK example and in customer solution.
> > Based on the comments from others we responded to the idea of changing
> the new Api from ` rte_get_next_lcore_llc` to ` rte_get_next_lcore_exntd`.
> >
> > Can you please help us understand what is `awkward`.
> >
>
> The awkwardness starts when you are trying to fit provide hwloc type
> information over an API that was designed for iterating over lcores.

I disagree to this point, current implementation of lcore libraries is only focused on iterating through list of enabled cores, core-mask, and lcore-map.
With ever increasing core count, memory, io and accelerators on SoC, sub-numa partitioning is common in various vendor SoC. Enhancing or Augumenting lcore API to extract or provision NUMA, Cache Topology is not awkward.

If memory, IO and accelerator can have sub-NUMA domain, why is it awkward to have lcore in domains? Hence I do not agree on the awkwardness argument.

>
> It seems to me that you should either have:
> A) An API in similar to that of hwloc (or any DOM-like API), which would give a
> low-level description of the hardware in implementation terms.
> The topology would consist of nodes, with attributes, etc, where nodes are
> things like cores or instances of caches of some level and attributes are things
> like CPU actual and nominal, and maybe max frequency, cache size, or memory
> size.
Here is the catch, `rte_eal_init` internally invokes `get_cpu|lcores` and populates thread (lcore) to physical CPU. But there is more than just CPU mapping, as we have seeing in SoC architecture. The argument shared by many is `DPDK is not the place for such topology discovery`.

As per my current understanding, I have to disagree to the abive because
1. forces user to use external libraries example like hwloc
2. forces user to creating internal mapping for lcore, core-mask, and lcore-map with topology awareness code.

My intention is to `enable end user to leverage the API format or similar API format (rte_get_next_lcore)` to get best results on any SoC (vendor agnostic).
I fail to grasp why we are asking CPU topology to exported, while NIC, PCIe and accelerators are not asked to be exported via external libraries like hwloc.

Hence let us setup tech call in slack or teams to understand this better.

> or
> B) An API to be directly useful for a work scheduler, in which case you should
> abstract away things like "boost"

Please note as shared in earlier reply to Bruce, I made a mistake of calling it boost (AMD SoC terminology). Instead it should DPDK_TURBO.
There are use cases and DPDK examples, where cypto and compression are run on cores where TURBO is enabled. This allows end users to boost when there is more work and disable boost when there is less or no work.

>  (and fold them into some abstract capacity notion, together with core "size" [in big-little/heterogeneous systems]), and
> have an abstract notion of what core is "close" to some other core. This would
> something like Linux'
> scheduling domains.

We had similar discussion with Jerrin on the last day of Bangkok DPDK summit. This RFC was intended to help capture this relevant point. With my current understanding on selected SoC the little core on ARM Soc shares L2 cache, while this analogy does not cover all cases. But this would be good start.


>
> If you want B you probably need A as a part of its implementation, so you may
> just as well start with A, I suppose.
>
> What you could do to explore the API design is to add support for, for
> example, boost core awareness or SMT affinity in the SW scheduler. You could
> also do an "lstopo" equivalent, since that's needed for debugging and
> exploration, if nothing else.
Not following on this analogy, will discuss in detail in tech talk

>
> One question that will have to be answered in a work scheduling scenario is
> "are these two lcores SMT siblings," or "are these two cores on the same LLC",
> or "give me all lcores on a particular L2 cache".
>
Is not that we have been trying to address based on Anatoly request to generalize than LLC. Hence we agreed on sharing version-2 of RFC with `rte_get_nex_lcore_extnd` with `flags`.
May I ask where is the disconnect?

> >>>
> >>> I mean, what you have is a topology, with nodes of different types
> >>> and with
> >> different properties, and you want to present it to the user.
> > Let me be clear, what we want via DPDK to help customer to use an Unified
> API which works across multiple platforms.
> > Example - let a vendor have 2 products namely A and B. CPU-A has all cores
> within same SUB-NUMA domain and CPU-B has cores split to 2 sub-NUMA
> domain based on split LLC.
> > When `rte_get_next_lcore_extnd` is invoked for `LLC` on 1. CPU-A: it
> > returns all cores as there is no split 2. CPU-B: it returns cores from
> > specific sub-NUMA which is partitioned by L3
> >
>
> I think the function name rte_get_next_lcore_extnd() alone makes clear this is an awkward API. :)
I humbly disagree to this statement, as explained above.


>
> My gut feeling is to make it more explicit and forget about <rte_lcore.h>.
> <rte_hwtopo.h>? Could and should still be EAL.
For me this is like adding a new level of library and more code. While the easiest way was to add an API similar to existing `get_next_lcore` style for easy adoption.

>
> >>>
> >>> In a sense, it's similar to XCM and DOM versus SAX. The above is
> >>> SAX-style,
> >> and what I have in mind is something DOM-like.
> >>>
> >>> What use case do you have in mind? What's on top of my list is a scenario where a DPDK app gets a bunch of cores (e.g., -l <cores>) and tries to figure out how best make use of them.
> > Exactly.
> >
> >   It's not going to "skip" (ignore, leave unused)
> >> SMT siblings, or skip non-boosted cores, it would just try to be
> >> clever in regards to which cores to use for what purpose.
> > Let me try to share my idea on SMT sibling. When user invoked for
> rte_get_next_lcore_extnd` is invoked for `L1 | SMT` flag with `lcore`; the API
> identifies first whether given lcore is part of enabled core list.
> > If yes, it programmatically either using `sysfs` or `hwloc library (shared the
> version concern on distros. Will recheck again)` identify the sibling thread and
> return.
> > If there is no sibling thread available under DPDK it will fetch next lcore
> (probably lcore +1 ).
> >
>
> Distributions having old hwloc versions isn't an argument for a new DPDK library or new API. If only that was the issue, then it would be better to help the hwloc and/or distributions, rather than the DPDK project.
I do not agree to terms of ` Distributions having old hwloc versions isn't an argument for a new DPDK library or new API.` Because this is not what my intention is. Let me be clear on Ampere & AMD Bios settings are 2

1. SLC or L3 as NUMA enable
2. Numa for IO|memory

With `NUMA for IO|memory` is set hwloc library works as expected. But when `L3 as NUMA` is set gives incorrect details. We have been fixing this and pushing to upstream. But as I clearly shared, version of distros having latest hwloc is almost nil.
Hence to keep things simple, in documentation of DPDK we pointed to AMD SoC tuning guide we have been recommending not to enable `L3 as NUMA`.

Now end goal for me is to allow vendor agnostic API which is easy to understand and use, and works irrespective of BIOS settings. I have enabled parsing of OS `sysfs` as a RFC. But if the comment is to use `hwloc` as shared with response for Stephen I am open to try this again.

<snipped>
  
Vipin Varghese Sept. 12, 2024, 11:23 a.m. UTC | #33
[Public]


Snipped

>
>
> To to be clear; it's something like this I think of when I say "DOM-style" API.
>
> #ifndef RTE_HWTOPO_H
> #define RTE_HWTOPO_H
>
> struct rte_hwtopo_node;
>
> enum rte_hwtopo_node_type {
>      RTE_HWTOPO_NODE_TYPE_CPU_CORE,
>      RTE_HWTOPO_NODE_TYPE_CACHE,
>      RTE_HWTOPO_NODE_TYPE_NUMA
> };
>
> int
> rte_hwtopo_init(void);
>
> struct rte_hwtopo_node *
> rte_hwtopo_get_core_by_lcore(unsigned int lcore);
>
> struct rte_hwtopo_node *
> rte_hwtopo_get_core_by_id(unsigned int os_cpu_id);
>
> struct rte_hwtopo_node *
> rte_hwtopo_parent(struct rte_hwtopo_node *node);
>
> struct rte_hwtopo_node *
> rte_hwtopo_first_child(struct rte_hwtopo_node *node);
>
> struct rte_hwtopo_node *
> rte_hwtopo_next_child(struct rte_hwtopo_node *node,
>                       struct rte_hwtopo_node *child);
>
> struct rte_hwtopo_node *
> rte_hwtopo_first_sibling(struct rte_hwtopo_node *node);
>
> struct rte_hwtopo_node *
> rte_hwtopo_next_sibling(struct rte_hwtopo_node *node,
>                         struct rte_hwtopo_node *child);
>
> enum rte_hwtopo_node_type
> rte_hwtopo_get_type(struct rte_hwtopo_node *node);
>
> #define RTE_HWTOPO_NODE_ATTR_CORE_FREQUENCY_NOMINAL 0 #define
> RTE_HWTOPO_NODE_ATTR_CACHE_LEVEL 1 #define
> RTE_HWTOPO_NODE_ATTR_CACHE_SIZE 2
>
> int
> rte_hwtopo_get_attr_int64(struct rte_hwtopo_node *node, unsigned int
> attr_name,
>                           int64_t *attr_value);
>
> int
> rte_hwtopo_get_attr_str(struct rte_hwtopo_node *node, unsigned int
> attr_name,
>                         char *attr_value, size_t capacity);
>
> #endif
>
> Surely, this too would be awkward (or should I say cumbersome) to use in certain scenarios.
This appears to be more like hwloc api calls, as shared in my earlier email my intention with the API suggestion is not introduce new library.
I have certain reservations and with my current understanding I am not able to map certain DPDK core mapping. Let discuss this in technical call.

Snipped
  
Vipin Varghese Sept. 12, 2024, 11:50 a.m. UTC | #34
[Public]

Snipped

> >
> >
> >    Based on the discussions we agreed on sharing version-2 FRC for
> >    extending API as `rte_get_next_lcore_extnd` with extra argument as
> >    `flags`.
> >
> >    As per my ideation, for the API ` rte_get_next_sibling_core`, the above
> >    API can easily with flag ` RTE_GET_LCORE_L1 (SMT)`. Is this right
> >    understanding?
> >
> >    We can easily have simple MACROs like `RTE_LCORE_FOREACH_L1` which
> >    allows to iterate SMT sibling threads.
> >
> >
>
> This seems like a lot of new macro and API additions! I'd really like to cut that
> back and simplify the amount of new things we are adding to DPDK for this.
I disagree Bruce, as per the new conversation with Anatoly and you it has been shared the new API are
```
1. rte_get_next_lcore_exntd
2. rte_get_next_n_lcore_exntd
```

While I mentioned custom Macro can augment based on typical flag usage similar to ` RTE_LCORE_FOREACH and RTE_LCORE_FOREACH_WORKER` as
```
RTE_LCORE_FOREACH_FLAG
RTE_LCORE_FOREACH_WORKER_FLAG

Or

RTE_LCORE_FOREACH_LLC
RTE_LCORE_FOREACH_WORKER_LLC
```

Please note I have not even shared version-2 of RFC yet.

> I tend to agree with others that external libs would be better for apps that really want to deal with all this.
I have covered why this is not a good idea for Mattias query.

>
> >
> >    >
> >
> >    > Looking logically, I'm not sure about the BOOST_ENABLED and BOOST_DISABLED flags you propose
> >    The idea for the BOOST_ENABLED & BOOST_DISABLED is based on DPDK power library which allows to enable boost.
> >    Allow user to select lcores where BOOST is enabled|disabled using MACRO or API.
May be there is confusion, so let me try to be explicit here. The intention of any `rte_get_next_lcore##`  is fetch lcores.
Hence with new proposed API `rte_get_next_lcore_exntd` with `flag set for Boost` is to fetch lcores where boost is enabled.
There is no intention to enable or disable boost on lcore with `get` API.

> >
> >
> >
> >    - in a system with multiple possible
> >
> >    > standard and boost frequencies what would those correspond to?
> >
> >    I now understand the confusion, apologies for mixing the AMD EPYC SoC
> >    boost with Intel Turbo.
> >
> >
> >
> >    Thank you for pointing out, we will use the terminology `
> >    RTE_GET_LCORE_TURBO`.
> >
> >
>
> That still doesn't clarify it for me. If you start mixing in power management related functions in with topology ones things will turn into a real headache.
Can you please tell me what is not clarified. DPDK lcores as of today has no notion of Cache, Numa, Power, Turbo or any DPDK supported features.
The initial API introduced were to expose lcore sharing the same Last Level Cache. Based on interaction with Anatoly, extending this to support multiple features turned out to be possibility.
Hence, we said we can share v2 for RFC based on this idea.

But if the claim is not to put TURBO I am also ok for this. Let only keep cache and NUMA-IO domain.

> What does boost or turbo correspond to? Is it for cores that have the feature enabled - whether or not it's currently in use - or is it for finding cores that are
> currently boosted?  Do we need additions for cores that are boosted by 100Mhz vs say 300Mhz. What about cores that are in lower frequencies for
> power-saving. Do we add macros for finding those?
Why are we talking about feq-up and freq-down? This was not even discussed in this RFC patch at all.

> >
> >    What's also
> >
> >    > missing is a define for getting actual NUMA siblings i.e. those
> >    sharing common memory but not an L3 or anything else.
> >
> >    This can be extended into `rte_get_next_lcore_extnd` with flag `
> >    RTE_GET_LCORE_NUMA`. This will allow to grab all lcores under the same
> >    sub-memory NUMA as shared by LCORE.
> >
> >    If SMT sibling is enabled and DPDK Lcore mask covers the sibling
> >    threads, then ` RTE_GET_LCORE_NUMA` get all lcore and sibling threads
> >    under same memory NUMA of lcore shared.
> >
> >
>
> Yes. That can work. But it means we are basing the implementation on a fixed idea of what topologies there are or can exist.
> My suggestion below is just to ignore the whole idea of L1 vs L2 vs NUMA - just give the app a way to find it's nearest nodes.
Bruce, for different vendor SoC, the implementation of architecture is different. Let me share what I know
1. using L1, we can fetch SMT threads
2. using L2 we can get certain SoC on Arm, Intel and power PC which is like efficient cores
3. using L3 we can get certain SoC like AMD, AF64x and others which follow chiplet or tile split L3 domain.

>
> After all, the app doesn't want to know the topology just for the sake of knowing it - it wants it to ensure best placement of work on cores! To that end, it just needs to know what cores are near to each other and what are far away.
Exactly, that is why we want to minimize new libraries and limit to format of existing API `rte_get_next_lcore`. The end user need to deploy another library or external library then map to DPDK lcore mapping to identify what is where.
So as end user I prefer simple API which get my work done.

>
> >
> >    >
> >
> >    > My suggestion would be to have the function take just an integer-type
> >    e.g.
> >
> >    > uint16_t parameter which defines the memory/cache hierarchy level to
> >    use, 0
> >
> >    > being lowest, 1 next, and so on. Different systems may have different
> >    numbers
> >
> >    > of cache levels so lets just make it a zero-based index of levels,
> >    rather than
> >
> >    > giving explicit defines (except for memory which should probably
> >    always be
> >
> >    > last). The zero-level will be for "closest neighbour"
> >
> >    Good idea, we did prototype this internally. But issue it will keep on
> >    adding the number of API into lcore library.
> >
> >    To keep the API count less, we are using lcore id as hint to sub-NUMA.
> >
>
> I'm unclear about this keeping the API count down - you are proposing a lot of APIs and macros up above.
No, I am not. I have shared based on the last discussion with Anatoly we will end up with 2 API in lcore only. Explained in the above response

> My suggestion is basically to add two APIs and no macros: one API to get the max number of topology-nearness levels, and a
> second API to get the next sibling a given nearness level from
> 0(nearest)..N(furthest). If we want, we can also add a FOREACH macro too.
>
> Overall, though, as I say above, let's focus on the problem the app actually
> wants these APIs for, not how we think we should solve it. Apps don't want to
> know the topology for knowledge sake, they want to use that knowledge to
> improve performance by pinning tasks to cores. What is the minimum that we
> need to provide to enable the app to do that? For example, if there are no
> lcores that share an L1, then from an app topology viewpoint that L1 level may
> as well not exist, because it provides us no details on how to place our work.
I have shared above why we need vendor agnostic L1, L2, L3 and sub-NUMA-IO.

Snipped
  
Mattias Rönnblom Sept. 12, 2024, 11:59 a.m. UTC | #35
On 2024-09-12 13:17, Varghese, Vipin wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
> 
> <snipped>
>> >>>> Thank you Mattias for the information, as shared by in the reply
>> >>>> with
>> >> Anatoly we want expose a new API `rte_get_next_lcore_ex` which
>> >> intakes a extra argument `u32 flags`.
>> >>>> The flags can be RTE_GET_LCORE_L1 (SMT), RTE_GET_LCORE_L2,
>> >> RTE_GET_LCORE_L3, RTE_GET_LCORE_BOOST_ENABLED,
>> >> RTE_GET_LCORE_BOOST_DISABLED.
>> >>>
>> >>> Wouldn't using that API be pretty awkward to use?
>> > Current API available under DPDK is ` rte_get_next_lcore`, which is used
>> within DPDK example and in customer solution.
>> > Based on the comments from others we responded to the idea of changing
>> the new Api from ` rte_get_next_lcore_llc` to ` rte_get_next_lcore_exntd`.
>> >
>> > Can you please help us understand what is `awkward`.
>> >
>> 
>> The awkwardness starts when you are trying to fit provide hwloc type
>> information over an API that was designed for iterating over lcores.
> I disagree to this point, current implementation of lcore libraries is 
> only focused on iterating through list of enabled cores, core-mask, and 
> lcore-map.
> With ever increasing core count, memory, io and accelerators on SoC, 
> sub-numa partitioning is common in various vendor SoC. Enhancing or 
> Augumenting lcore API to extract or provision NUMA, Cache Topology is 
> not awkward.

DPDK providing an API for this information makes sense to me, as I've 
mentioned before. What I questioned was the way it was done (i.e., the 
API design) in your RFC, and the limited scope (which in part you have 
addressed).

> If memory, IO and accelerator can have sub-NUMA domain, why is it 
> awkward to have lcore in domains? Hence I do not agree on the 
> awkwardness argument.
>> 
>> It seems to me that you should either have:
>> A) An API in similar to that of hwloc (or any DOM-like API), which would give a
>> low-level description of the hardware in implementation terms.
>> The topology would consist of nodes, with attributes, etc, where nodes are
>> things like cores or instances of caches of some level and attributes are things
>> like CPU actual and nominal, and maybe max frequency, cache size, or memory
>> size.
> Here is the catch, `rte_eal_init` internally invokes `get_cpu|lcores` 
> and populates thread (lcore) to physical CPU. But there is more than 
> just CPU mapping, as we have seeing in SoC architecture. The argument 
> shared by many is `DPDK is not the place for such topology discovery`.
> As per my current understanding, I have to disagree to the abive because
> 1. forces user to use external libraries example like hwloc
> 2. forces user to creating internal mapping for lcore, core-mask, and 
> lcore-map with topology awareness code.
> My intention is to `enable end user to leverage the API format or 
> similar API format (rte_get_next_lcore)` to get best results on any SoC 
> (vendor agnostic).
> I fail to grasp why we are asking CPU topology to exported, while NIC, 
> PCIe and accelerators are not asked to be exported via external 
> libraries like hwloc.
> Hence let us setup tech call in slack or teams to understand this better.
>> or
>> B) An API to be directly useful for a work scheduler, in which case you should
>> abstract away things like "boost"
> Please note as shared in earlier reply to Bruce, I made a mistake of 
> calling it boost (AMD SoC terminology). Instead it should DPDK_TURBO.
> There are use cases and DPDK examples, where cypto and compression are 
> run on cores where TURBO is enabled. This allows end users to boost when 
> there is more work and disable boost when there is less or no work.
>>  (and fold them into some abstract capacity notion, together with core "size" [in big-little/heterogeneous systems]), and
>> have an abstract notion of what core is "close" to some other core. This would
>> something like Linux'
>> scheduling domains.
> We had similar discussion with Jerrin on the last day of Bangkok DPDK 
> summit. This RFC was intended to help capture this relevant point. With 
> my current understanding on selected SoC the little core on ARM Soc 
> shares L2 cache, while this analogy does not cover all cases. But this 
> would be good start.
>> 
>> If you want B you probably need A as a part of its implementation, so you may
>> just as well start with A, I suppose.
>> 
>> What you could do to explore the API design is to add support for, for
>> example, boost core awareness or SMT affinity in the SW scheduler. You could
>> also do an "lstopo" equivalent, since that's needed for debugging and
>> exploration, if nothing else.
> Not following on this analogy, will discuss in detail in tech talk
>> 
>> One question that will have to be answered in a work scheduling scenario is
>> "are these two lcores SMT siblings," or "are these two cores on the same LLC",
>> or "give me all lcores on a particular L2 cache".
>> 
> Is not that we have been trying to address based on Anatoly request to 
> generalize than LLC. Hence we agreed on sharing version-2 of RFC with 
> `rte_get_nex_lcore_extnd` with `flags`.
> May I ask where is the disconnect?
>> >>>
>> >>> I mean, what you have is a topology, with nodes of different types
>> >>> and with
>> >> different properties, and you want to present it to the user.
>> > Let me be clear, what we want via DPDK to help customer to use an Unified
>> API which works across multiple platforms.
>> > Example - let a vendor have 2 products namely A and B. CPU-A has all cores
>> within same SUB-NUMA domain and CPU-B has cores split to 2 sub-NUMA
>> domain based on split LLC.
>> > When `rte_get_next_lcore_extnd` is invoked for `LLC` on 1. CPU-A: it
>> > returns all cores as there is no split 2. CPU-B: it returns cores from
>> > specific sub-NUMA which is partitioned by L3
>> >
>> 
>> I think the function name rte_get_next_lcore_extnd() alone makes clear this is an awkward API. :)
> I humbly disagree to this statement, as explained above.
>> 
>> My gut feeling is to make it more explicit and forget about <rte_lcore.h>.
>> <rte_hwtopo.h>? Could and should still be EAL.
> For me this is like adding a new level of library and more code. While 
> the easiest way was to add an API similar to existing `get_next_lcore` 
> style for easy adoption.

A poorly designed, special-case API is not less work. It's just less 
work for *you* *now*, and much more work for someone in the future to 
clean it up.

>> 
>> >>>
>> >>> In a sense, it's similar to XCM and DOM versus SAX. The above is
>> >>> SAX-style,
>> >> and what I have in mind is something DOM-like.
>> >>>
>> >>> What use case do you have in mind? What's on top of my list is a scenario where a DPDK app gets a bunch of cores (e.g., -l <cores>) and tries to figure out how best make use of them.
>> > Exactly.
>> >
>> >   It's not going to "skip" (ignore, leave unused)
>> >> SMT siblings, or skip non-boosted cores, it would just try to be
>> >> clever in regards to which cores to use for what purpose.
>> > Let me try to share my idea on SMT sibling. When user invoked for
>> rte_get_next_lcore_extnd` is invoked for `L1 | SMT` flag with `lcore`; the API
>> identifies first whether given lcore is part of enabled core list.
>> > If yes, it programmatically either using `sysfs` or `hwloc library (shared the
>> version concern on distros. Will recheck again)` identify the sibling thread and
>> return.
>> > If there is no sibling thread available under DPDK it will fetch next lcore 
>> (probably lcore +1 ).
>> >
>> 
>> Distributions having old hwloc versions isn't an argument for a new DPDK library or new API. If only that was the issue, then it would be better to help the hwloc and/or distributions, rather  than the DPDK project.
> I do not agree to terms of ` Distributions having old hwloc versions 
> isn't an argument for a new DPDK library or new API.` Because this is 
> not what my intention is. Let me be clear on Ampere & AMD Bios settings 
> are 2
> 1. SLC or L3 as NUMA enable
> 2. Numa for IO|memory
> With `NUMA for IO|memory` is set hwloc library works as expected. But 
> when `L3 as NUMA` is set gives incorrect details. We have been fixing 
> this and pushing to upstream. But as I clearly shared, version of 
> distros having latest hwloc is almost nil.
> Hence to keep things simple, in documentation of DPDK we pointed to AMD 
> SoC tuning guide we have been recommending not to enable `L3 as NUMA`.
> Now end goal for me is to allow vendor agnostic API which is easy to 
> understand and use, and works irrespective of BIOS settings. I have 
> enabled parsing of OS `sysfs` as a RFC. But if the comment is to use 
> `hwloc` as shared with response for Stephen I am open to try this again.
> <snipped>
  
Mattias Rönnblom Sept. 12, 2024, 12:12 p.m. UTC | #36
On 2024-09-12 13:23, Varghese, Vipin wrote:
> [Public]
> 
> Snipped
>> 
>> 
>> To to be clear; it's something like this I think of when I say "DOM-style" API.
>> 
>> #ifndef RTE_HWTOPO_H
>> #define RTE_HWTOPO_H
>> 
>> struct rte_hwtopo_node;
>> 
>> enum rte_hwtopo_node_type {
>>      RTE_HWTOPO_NODE_TYPE_CPU_CORE,
>>      RTE_HWTOPO_NODE_TYPE_CACHE,
>>      RTE_HWTOPO_NODE_TYPE_NUMA
>> };
>> 
>> int
>> rte_hwtopo_init(void);
>> 
>> struct rte_hwtopo_node *
>> rte_hwtopo_get_core_by_lcore(unsigned int lcore);
>> 
>> struct rte_hwtopo_node *
>> rte_hwtopo_get_core_by_id(unsigned int os_cpu_id);
>> 
>> struct rte_hwtopo_node *
>> rte_hwtopo_parent(struct rte_hwtopo_node *node);
>> 
>> struct rte_hwtopo_node *
>> rte_hwtopo_first_child(struct rte_hwtopo_node *node);
>> 
>> struct rte_hwtopo_node *
>> rte_hwtopo_next_child(struct rte_hwtopo_node *node,
>>                       struct rte_hwtopo_node *child);
>> 
>> struct rte_hwtopo_node *
>> rte_hwtopo_first_sibling(struct rte_hwtopo_node *node);
>> 
>> struct rte_hwtopo_node *
>> rte_hwtopo_next_sibling(struct rte_hwtopo_node *node,
>>                         struct rte_hwtopo_node *child);
>> 
>> enum rte_hwtopo_node_type
>> rte_hwtopo_get_type(struct rte_hwtopo_node *node);
>> 
>> #define RTE_HWTOPO_NODE_ATTR_CORE_FREQUENCY_NOMINAL 0 #define
>> RTE_HWTOPO_NODE_ATTR_CACHE_LEVEL 1 #define
>> RTE_HWTOPO_NODE_ATTR_CACHE_SIZE 2
>> 
>> int
>> rte_hwtopo_get_attr_int64(struct rte_hwtopo_node *node, unsigned int
>> attr_name,
>>                           int64_t *attr_value);
>> 
>> int
>> rte_hwtopo_get_attr_str(struct rte_hwtopo_node *node, unsigned int
>> attr_name,
>>                         char *attr_value, size_t capacity);
>> 
>> #endif
>> 
>> Surely, this too would be awkward (or should I say cumbersome) to use in certain scenarios. 
> This appears to be more like hwloc api calls, as shared in my earlier 
> email my intention with the API suggestion is not introduce new library.
> I have certain reservations and with my current understanding I am not 
> able to map certain DPDK core mapping. Let discuss this in technical call.
> Snipped

It still would need to be a part of EAL (so not a new library), since 
EAL surely would depend on it (sooner rather than later).

If this functionality should be a new library, or a new API in an 
existing library, it doesn't really matter if your original intentions 
where something else, does it.
  
Mattias Rönnblom Sept. 12, 2024, 1:18 p.m. UTC | #37
On 2024-09-12 11:17, Bruce Richardson wrote:
> On Thu, Sep 12, 2024 at 02:19:07AM +0000, Varghese, Vipin wrote:
>>     [Public]
>>
>>     <snipped>
>>
>>
>>
>>     > > > > <snipped>
>>
>>     > > > >
>>
>>     > > > >>> <snipped>
>>
>>     > > > >>>
>>
>>     > > > >>> Thank you Mattias for the comments and question, please let
>>     me
>>
>>     > > > >>> try to explain the same below
>>
>>     > > > >>>
>>
>>     > > > >>>> We shouldn't have a separate CPU/cache hierarchy API
>>     instead?
>>
>>     > > > >>>
>>
>>     > > > >>> Based on the intention to bring in CPU lcores which share
>>     same
>>
>>     > > > >>> L3 (for better cache hits and less noisy neighbor) current
>>     API
>>
>>     > > > >>> focuses on using
>>
>>     > > > >>>
>>
>>     > > > >>> Last Level Cache. But if the suggestion is `there are SoC
>>     where
>>
>>     > > > >>> L2 cache are also shared, and the new API should be
>>
>>     > > > >>> provisioned`, I am also
>>
>>     > > > >>>
>>
>>     > > > >>> comfortable with the thought.
>>
>>     > > > >>>
>>
>>     > > > >>
>>
>>     > > > >> Rather than some AMD special case API hacked into
>>     <rte_lcore.h>,
>>
>>     > > > >> I think we are better off with no DPDK API at all for this
>>     kind of
>>
>>     > functionality.
>>
>>     > > > >
>>
>>     > > > > Hi Mattias, as shared in the earlier email thread, this is not
>>     a
>>
>>     > > > > AMD special
>>
>>     > > > case at all. Let me try to explain this one more time. One of
>>
>>     > > > techniques used to increase cores cost effective way to go for
>>     tiles of
>>
>>     > compute complexes.
>>
>>     > > > > This introduces a bunch of cores in sharing same Last Level
>>     Cache
>>
>>     > > > > (namely
>>
>>     > > > L2, L3 or even L4) depending upon cache topology architecture.
>>
>>     > > > >
>>
>>     > > > > The API suggested in RFC is to help end users to selectively
>>     use
>>
>>     > > > > cores under
>>
>>     > > > same Last Level Cache Hierarchy as advertised by OS (irrespective
>>     of
>>
>>     > > > the BIOS settings used). This is useful in both bare-metal and
>>     container
>>
>>     > environment.
>>
>>     > > > >
>>
>>     > > >
>>
>>     > > > I'm pretty familiar with AMD CPUs and the use of tiles (including
>>
>>     > > > the challenges these kinds of non-uniformities pose for work
>>     scheduling).
>>
>>     > > >
>>
>>     > > > To maximize performance, caring about core<->LLC relationship may
>>
>>     > > > well not be enough, and more HT/core/cache/memory topology
>>
>>     > > > information is required. That's what I meant by special case. A
>>
>>     > > > proper API should allow access to information about which lcores
>>     are
>>
>>     > > > SMT siblings, cores on the same L2, and cores on the same L3, to
>>
>>     > > > name a few things. Probably you want to fit NUMA into the same
>>     API
>>
>>     > > > as well, although that is available already in <rte_lcore.h>.
>>
>>     > >
>>
>>     > > Thank you Mattias for the information, as shared by in the reply
>>     with
>>
>>     > Anatoly we want expose a new API `rte_get_next_lcore_ex` which
>>     intakes a
>>
>>     > extra argument `u32 flags`.
>>
>>     > > The flags can be RTE_GET_LCORE_L1 (SMT), RTE_GET_LCORE_L2,
>>
>>     > RTE_GET_LCORE_L3, RTE_GET_LCORE_BOOST_ENABLED,
>>
>>     > RTE_GET_LCORE_BOOST_DISABLED.
>>
>>     > >
>>
>>     >
>>
>>     > For the naming, would "rte_get_next_sibling_core" (or lcore if you
>>     prefer) be a
>>
>>     > clearer name than just adding "ex" on to the end of the existing
>>     function?
>>
>>     Thank you Bruce, Please find my answer below
>>
>>
>>
>>     Functions shared as per the RFC were
>>
>>     ```
>>
>>     - rte_get_llc_first_lcores: Retrieves all the first lcores in the
>>     shared LLC.
>>
>>     - rte_get_llc_lcore: Retrieves all lcores that share the LLC.
>>
>>     - rte_get_llc_n_lcore: Retrieves the first n or skips the first n
>>     lcores in the shared LLC.
>>
>>     ```
>>
>>
>>
>>     MACRO’s extending the usability were
>>
>>     ```
>>
>>     RTE_LCORE_FOREACH_LLC_FIRST: iterates through all first lcore from each
>>     LLC.
>>
>>     RTE_LCORE_FOREACH_LLC_FIRST_WORKER: iterates through all first worker
>>     lcore from each LLC.
>>
>>     RTE_LCORE_FOREACH_LLC_WORKER: iterates lcores from LLC based on hint
>>     (lcore id).
>>
>>     RTE_LCORE_FOREACH_LLC_SKIP_FIRST_WORKER: iterates lcores from LLC while
>>     skipping first worker.
>>
>>     RTE_LCORE_FOREACH_LLC_FIRST_N_WORKER: iterates through `n` lcores from
>>     each LLC.
>>
>>     RTE_LCORE_FOREACH_LLC_SKIP_N_WORKER: skip first `n` lcores, then
>>     iterates through reaming lcores in each LLC.
>>
>>     ```
>>
>>
>>
>>     Based on the discussions we agreed on sharing version-2 FRC for
>>     extending API as `rte_get_next_lcore_extnd` with extra argument as
>>     `flags`.
>>
>>     As per my ideation, for the API ` rte_get_next_sibling_core`, the above
>>     API can easily with flag ` RTE_GET_LCORE_L1 (SMT)`. Is this right
>>     understanding?
>>
>>     We can easily have simple MACROs like `RTE_LCORE_FOREACH_L1` which
>>     allows to iterate SMT sibling threads.
>>
>>
> 
> This seems like a lot of new macro and API additions! I'd really like to
> cut that back and simplify the amount of new things we are adding to DPDK
> for this. I tend to agree with others that external libs would be better
> for apps that really want to deal with all this.
> 

Conveying HW topology will require a fair bit of API verbiage. I think 
there's no way around it, other than giving the API user half of the 
story (or 1% of the story).

That's one of the reasons I think it should be in a separate header file 
in EAL.

>>
>>     >
>>
>>     > Looking logically, I'm not sure about the BOOST_ENABLED and
>>
>>     > BOOST_DISABLED flags you propose
>>
>>     The idea for the BOOST_ENABLED & BOOST_DISABLED is based on DPDK power
>>     library which allows to enable boost.
>>
>>     Allow user to select lcores where BOOST is enabled|disabled using MACRO
>>     or API.
>>
>>
>>
>>     - in a system with multiple possible
>>
>>     > standard and boost frequencies what would those correspond to?
>>
>>     I now understand the confusion, apologies for mixing the AMD EPYC SoC
>>     boost with Intel Turbo.
>>
>>
>>
>>     Thank you for pointing out, we will use the terminology `
>>     RTE_GET_LCORE_TURBO`.
>>
>>
> 
> That still doesn't clarify it for me. If you start mixing in power
> management related functions in with topology ones things will turn into a
> real headache. What does boost or turbo correspond to? Is it for cores that
> have the feature enabled - whether or not it's currently in use - or is it
> for finding cores that are currently boosted? Do we need additions for
> cores that are boosted by 100Mhz vs say 300Mhz. What about cores that are
> in lower frequencies for power-saving. Do we add macros for finding those?
> 

In my world, the operating frequency is a property of a CPU core node in 
the hardware topology.

lcore discrimination (or classification) shouldn't be built as a myriad 
of FOREACH macros, but rather generic iteration + app domain logic.

For example, the size of the L3 could be a factor. Should we have a 
FOREACH_BIG_L3. No.

>>
>>     What's also
>>
>>     > missing is a define for getting actual NUMA siblings i.e. those
>>     sharing common
>>
>>     > memory but not an L3 or anything else.
>>
>>     This can be extended into `rte_get_next_lcore_extnd` with flag `
>>     RTE_GET_LCORE_NUMA`. This will allow to grab all lcores under the same
>>     sub-memory NUMA as shared by LCORE.
>>
>>     If SMT sibling is enabled and DPDK Lcore mask covers the sibling
>>     threads, then ` RTE_GET_LCORE_NUMA` get all lcore and sibling threads
>>     under same memory NUMA of lcore shared.
>>
>>
> 
> Yes. That can work. But it means we are basing the implementation on a
> fixed idea of what topologies there are or can exist. My suggestion below
> is just to ignore the whole idea of L1 vs L2 vs NUMA - just give the app a
> way to find it's nearest nodes.
> 

I think we need to agree what is the purpose of this API. Is it the to 
describe the hardware topology in some details for general-purpose use 
(including informing the operator, lstopo-style), or just some abstract, 
simplified representation to be use purely for work scheduling.

> After all, the app doesn't want to know the topology just for the sake of
> knowing it - it wants it to ensure best placement of work on cores! To that
> end, it just needs to know what cores are near to each other and what are
> far away.
> 
>>
>>     >
>>
>>     > My suggestion would be to have the function take just an integer-type
>>     e.g.
>>
>>     > uint16_t parameter which defines the memory/cache hierarchy level to
>>     use, 0
>>
>>     > being lowest, 1 next, and so on. Different systems may have different
>>     numbers
>>
>>     > of cache levels so lets just make it a zero-based index of levels,
>>     rather than
>>
>>     > giving explicit defines (except for memory which should probably
>>     always be
>>
>>     > last). The zero-level will be for "closest neighbour"
>>
>>     Good idea, we did prototype this internally. But issue it will keep on
>>     adding the number of API into lcore library.
>>
>>     To keep the API count less, we are using lcore id as hint to sub-NUMA.
>>
> 
> I'm unclear about this keeping the API count down - you are proposing a lot
> of APIs and macros up above. My suggestion is basically to add two APIs and
> no macros: one API to get the max number of topology-nearness levels, and a
> second API to get the next sibling a given nearness level from
> 0(nearest)..N(furthest). If we want, we can also add a FOREACH macro too.
> 
> Overall, though, as I say above, let's focus on the problem the app
> actually wants these APIs for, not how we think we should solve it. Apps
> don't want to know the topology for knowledge sake, they want to use that
> knowledge to improve performance by pinning tasks to cores. What is the
> minimum that we need to provide to enable the app to do that? For example,
> if there are no lcores that share an L1, then from an app topology
> viewpoint that L1 level may as well not exist, because it provides us no
> details on how to place our work.
> 
> For the rare app that does have some esoteric use-case that does actually
> want to know some intricate details of the topology, then having that app
> use an external lib is probably a better solution than us trying to cover
> all possible options in DPDK.
> 
> My 2c. on this at this stage anyway.
> 
> /Bruce
>
  
Bruce Richardson Sept. 12, 2024, 1:30 p.m. UTC | #38
On Thu, Sep 12, 2024 at 01:59:34PM +0200, Mattias Rönnblom wrote:
> On 2024-09-12 13:17, Varghese, Vipin wrote:
> > [AMD Official Use Only - AMD Internal Distribution Only]
> > 
> > <snipped>
> > > >>>> Thank you Mattias for the information, as shared by in the reply
> > > >>>> with
> > > >> Anatoly we want expose a new API `rte_get_next_lcore_ex` which
> > > >> intakes a extra argument `u32 flags`.
> > > >>>> The flags can be RTE_GET_LCORE_L1 (SMT), RTE_GET_LCORE_L2,
> > > >> RTE_GET_LCORE_L3, RTE_GET_LCORE_BOOST_ENABLED,
> > > >> RTE_GET_LCORE_BOOST_DISABLED.
> > > >>>
> > > >>> Wouldn't using that API be pretty awkward to use?
> > > > Current API available under DPDK is ` rte_get_next_lcore`, which is used
> > > within DPDK example and in customer solution.
> > > > Based on the comments from others we responded to the idea of changing
> > > the new Api from ` rte_get_next_lcore_llc` to ` rte_get_next_lcore_exntd`.
> > > >
> > > > Can you please help us understand what is `awkward`.
> > > >
> > > 
> > > The awkwardness starts when you are trying to fit provide hwloc type
> > > information over an API that was designed for iterating over lcores.
> > I disagree to this point, current implementation of lcore libraries is
> > only focused on iterating through list of enabled cores, core-mask, and
> > lcore-map.
> > With ever increasing core count, memory, io and accelerators on SoC,
> > sub-numa partitioning is common in various vendor SoC. Enhancing or
> > Augumenting lcore API to extract or provision NUMA, Cache Topology is
> > not awkward.
> 
> DPDK providing an API for this information makes sense to me, as I've
> mentioned before. What I questioned was the way it was done (i.e., the API
> design) in your RFC, and the limited scope (which in part you have
> addressed).
> 

Actually, I'd like to touch on this first item a little bit. What is the
main benefit of providing this information in EAL? To me, it seems like
something that is for apps to try and be super-smart and select particular
cores out of a set of cores to run on. However, is that not taking work
that should really be the job of the person deploying the app? The deployer
- if I can use that term - has already selected a set of cores and NICs for
a DPDK application to use. Should they not also be the one selecting - via
app argument, via --lcores flag to map one core id to another, or otherwise
- which part of an application should run on what particular piece of
hardware?

In summary, what is the final real-world intended usecase for this work?
DPDK already tries to be smart about cores and NUMA, and in some cases we
have hit issues where users have - for their own valid reasons - wanted to
run DPDK in a sub-optimal way, and they end up having to fight DPDK's
smarts in order to do so! Ref: [1]

/Bruce

[1] https://git.dpdk.org/dpdk/commit/?id=ed34d87d9cfbae8b908159f60df2008e45e4c39f
  
Stephen Hemminger Sept. 12, 2024, 3:50 p.m. UTC | #39
On Thu, 12 Sep 2024 14:12:55 +0200
Mattias Rönnblom <hofors@lysator.liu.se> wrote:

> On 2024-09-12 13:23, Varghese, Vipin wrote:
> > [Public]
> > 
> > Snipped  
> >> 
> >> 
> >> To to be clear; it's something like this I think of when I say "DOM-style" API.
> >> 
> >> #ifndef RTE_HWTOPO_H
> >> #define RTE_HWTOPO_H
> >> 
> >> struct rte_hwtopo_node;
> >> 
> >> enum rte_hwtopo_node_type {
> >>      RTE_HWTOPO_NODE_TYPE_CPU_CORE,
> >>      RTE_HWTOPO_NODE_TYPE_CACHE,
> >>      RTE_HWTOPO_NODE_TYPE_NUMA
> >> };
> >> 
> >> int
> >> rte_hwtopo_init(void);
> >> 
> >> struct rte_hwtopo_node *
> >> rte_hwtopo_get_core_by_lcore(unsigned int lcore);
> >> 
> >> struct rte_hwtopo_node *
> >> rte_hwtopo_get_core_by_id(unsigned int os_cpu_id);
> >> 
> >> struct rte_hwtopo_node *
> >> rte_hwtopo_parent(struct rte_hwtopo_node *node);
> >> 
> >> struct rte_hwtopo_node *
> >> rte_hwtopo_first_child(struct rte_hwtopo_node *node);
> >> 
> >> struct rte_hwtopo_node *
> >> rte_hwtopo_next_child(struct rte_hwtopo_node *node,
> >>                       struct rte_hwtopo_node *child);
> >> 
> >> struct rte_hwtopo_node *
> >> rte_hwtopo_first_sibling(struct rte_hwtopo_node *node);
> >> 
> >> struct rte_hwtopo_node *
> >> rte_hwtopo_next_sibling(struct rte_hwtopo_node *node,
> >>                         struct rte_hwtopo_node *child);
> >> 
> >> enum rte_hwtopo_node_type
> >> rte_hwtopo_get_type(struct rte_hwtopo_node *node);
> >> 
> >> #define RTE_HWTOPO_NODE_ATTR_CORE_FREQUENCY_NOMINAL 0 #define
> >> RTE_HWTOPO_NODE_ATTR_CACHE_LEVEL 1 #define
> >> RTE_HWTOPO_NODE_ATTR_CACHE_SIZE 2
> >> 
> >> int
> >> rte_hwtopo_get_attr_int64(struct rte_hwtopo_node *node, unsigned int
> >> attr_name,
> >>                           int64_t *attr_value);
> >> 
> >> int
> >> rte_hwtopo_get_attr_str(struct rte_hwtopo_node *node, unsigned int
> >> attr_name,
> >>                         char *attr_value, size_t capacity);
> >> 
> >> #endif
> >> 
> >> Surely, this too would be awkward (or should I say cumbersome) to use in certain scenarios.   
> > This appears to be more like hwloc api calls, as shared in my earlier 
> > email my intention with the API suggestion is not introduce new library.
> > I have certain reservations and with my current understanding I am not 
> > able to map certain DPDK core mapping. Let discuss this in technical call.
> > Snipped  
> 
> It still would need to be a part of EAL (so not a new library), since 
> EAL surely would depend on it (sooner rather than later).
> 
> If this functionality should be a new library, or a new API in an 
> existing library, it doesn't really matter if your original intentions 
> where something else, does it.
> 

Good discussion.
I wonder if the API would be cleaner if it just provided a tree representation
of the hardware in a data structure, instead of trying to provide FOREACH.

The other concern is that hardware will evolve and there is likely to be more
possibilities. It is impossible totally future proof API's (YAGNI) but worth
thinking about it now.

There is also the issue of core's with different clock speeds that should
be represented as well.
  
Mattias Rönnblom Sept. 12, 2024, 4:32 p.m. UTC | #40
On 2024-09-12 15:30, Bruce Richardson wrote:
> On Thu, Sep 12, 2024 at 01:59:34PM +0200, Mattias Rönnblom wrote:
>> On 2024-09-12 13:17, Varghese, Vipin wrote:
>>> [AMD Official Use Only - AMD Internal Distribution Only]
>>>
>>> <snipped>
>>>>>>>> Thank you Mattias for the information, as shared by in the reply
>>>>>>>> with
>>>>>> Anatoly we want expose a new API `rte_get_next_lcore_ex` which
>>>>>> intakes a extra argument `u32 flags`.
>>>>>>>> The flags can be RTE_GET_LCORE_L1 (SMT), RTE_GET_LCORE_L2,
>>>>>> RTE_GET_LCORE_L3, RTE_GET_LCORE_BOOST_ENABLED,
>>>>>> RTE_GET_LCORE_BOOST_DISABLED.
>>>>>>>
>>>>>>> Wouldn't using that API be pretty awkward to use?
>>>>> Current API available under DPDK is ` rte_get_next_lcore`, which is used
>>>> within DPDK example and in customer solution.
>>>>> Based on the comments from others we responded to the idea of changing
>>>> the new Api from ` rte_get_next_lcore_llc` to ` rte_get_next_lcore_exntd`.
>>>>>
>>>>> Can you please help us understand what is `awkward`.
>>>>>
>>>>
>>>> The awkwardness starts when you are trying to fit provide hwloc type
>>>> information over an API that was designed for iterating over lcores.
>>> I disagree to this point, current implementation of lcore libraries is
>>> only focused on iterating through list of enabled cores, core-mask, and
>>> lcore-map.
>>> With ever increasing core count, memory, io and accelerators on SoC,
>>> sub-numa partitioning is common in various vendor SoC. Enhancing or
>>> Augumenting lcore API to extract or provision NUMA, Cache Topology is
>>> not awkward.
>>
>> DPDK providing an API for this information makes sense to me, as I've
>> mentioned before. What I questioned was the way it was done (i.e., the API
>> design) in your RFC, and the limited scope (which in part you have
>> addressed).
>>
> 
> Actually, I'd like to touch on this first item a little bit. What is the
> main benefit of providing this information in EAL? To me, it seems like
> something that is for apps to try and be super-smart and select particular
> cores out of a set of cores to run on. However, is that not taking work
> that should really be the job of the person deploying the app? The deployer
> - if I can use that term - has already selected a set of cores and NICs for
> a DPDK application to use. Should they not also be the one selecting - via
> app argument, via --lcores flag to map one core id to another, or otherwise
> - which part of an application should run on what particular piece of
> hardware?
> 

Scheduling in one form or another will happen on a number of levels. One 
level is what you call the "deployer". Whether man or machine, it will 
allocate a bunch of lcores to the application - either statically by 
using -l <cores>, or dynamically, by giving a very large core mask, 
combined with having an agent in the app responsible to scale up or down 
the number of cores actually used (allowing coexistence with other 
non-DPDK, Linux process scheduler-scheduled processes, on the same set 
of cores, although not at the same time).

I think the "deployer" level should generally not be aware of the DPDK 
app internals, including how to assign different tasks to different 
cores. That is consistent with how things work in a general-purpose 
operating system, where you allocate cores, memory and I/O devices to an 
instance (e.g., a VM), but then OS' scheduler figures out how to best 
use them.

The app internal may be complicated, change across software versions and 
traffic mixes/patterns, and most of all, not lend itself to static 
at-start configuration at all.

> In summary, what is the final real-world intended usecase for this work?

One real-world example is an Eventdev app with some atomic processing 
stage, using DSW, and SMT. Hardware threading on Intel x86 generally 
improves performance with ~25%, which seems to hold true for data plane 
apps as well, in my experience. So that's a (not-so-)freebie you don't 
want to miss out on. To max out single-flow performance, the work 
scheduler may not only need to give 100% of an lcore to bottleneck stage 
atomic processing for that elephant flow, but a *full* physical core 
(i.e., assure that the SMT sibling is idle). But, DSW doesn't understand 
the CPU topology, so you have to choose between max multi-flow 
throughput or max single-flow throughput at the time of deployment. A 
RTE hwtopo API would certainly help in the implementation of SMT-aware 
scheduling.

Another example could be the use of bigger or turbo-capable cores to run 
CPU-hungry, singleton services (e.g., a Eventdev RX timer adapter core), 
or the use of a hardware thread to run the SW scheduler service (which 
needs to react quickly to incoming scheduling events, but maybe not need 
all the cycles of a full physical core).

Yet another example would be an event device which understand how to 
spread a particular flow across multiple cores, but use only cores 
sharing the same L2. Or, keep only processing of a certain kind (e.g., a 
certain Eventdev Queue) on cores with the same L2, improve L2 hit rates 
for instructions and data related to that processing stage.

> DPDK already tries to be smart about cores and NUMA, and in some cases we
> have hit issues where users have - for their own valid reasons - wanted to
> run DPDK in a sub-optimal way, and they end up having to fight DPDK's
> smarts in order to do so! Ref: [1]
> 
> /Bruce
> 
> [1] https://git.dpdk.org/dpdk/commit/?id=ed34d87d9cfbae8b908159f60df2008e45e4c39f
  
Burakov, Anatoly Sept. 13, 2024, 2:15 p.m. UTC | #41
On 9/12/2024 1:50 PM, Varghese, Vipin wrote:
> [Public]
> 
> Snipped
> 
>>>
>>>
>>>     Based on the discussions we agreed on sharing version-2 FRC for
>>>     extending API as `rte_get_next_lcore_extnd` with extra argument as
>>>     `flags`.
>>>
>>>     As per my ideation, for the API ` rte_get_next_sibling_core`, the above
>>>     API can easily with flag ` RTE_GET_LCORE_L1 (SMT)`. Is this right
>>>     understanding?
>>>
>>>     We can easily have simple MACROs like `RTE_LCORE_FOREACH_L1` which
>>>     allows to iterate SMT sibling threads.
>>>
>>>
>>
>> This seems like a lot of new macro and API additions! I'd really like to cut that
>> back and simplify the amount of new things we are adding to DPDK for this.
> I disagree Bruce, as per the new conversation with Anatoly and you it has been shared the new API are
> ```
> 1. rte_get_next_lcore_exntd
> 2. rte_get_next_n_lcore_exntd
> ```
> 
> While I mentioned custom Macro can augment based on typical flag usage similar to ` RTE_LCORE_FOREACH and RTE_LCORE_FOREACH_WORKER` as
> ```
> RTE_LCORE_FOREACH_FLAG
> RTE_LCORE_FOREACH_WORKER_FLAG
> 
> Or
> 
> RTE_LCORE_FOREACH_LLC
> RTE_LCORE_FOREACH_WORKER_LLC
> ```
> 
> Please note I have not even shared version-2 of RFC yet.
> 
>> I tend to agree with others that external libs would be better for apps that really want to deal with all this.
> I have covered why this is not a good idea for Mattias query.
> 
>>
>>>
>>>     >
>>>
>>>     > Looking logically, I'm not sure about the BOOST_ENABLED and BOOST_DISABLED flags you propose
>>>     The idea for the BOOST_ENABLED & BOOST_DISABLED is based on DPDK power library which allows to enable boost.
>>>     Allow user to select lcores where BOOST is enabled|disabled using MACRO or API.
> May be there is confusion, so let me try to be explicit here. The intention of any `rte_get_next_lcore##`  is fetch lcores.
> Hence with new proposed API `rte_get_next_lcore_exntd` with `flag set for Boost` is to fetch lcores where boost is enabled.
> There is no intention to enable or disable boost on lcore with `get` API.
> 
>>>
>>>
>>>
>>>     - in a system with multiple possible
>>>
>>>     > standard and boost frequencies what would those correspond to?
>>>
>>>     I now understand the confusion, apologies for mixing the AMD EPYC SoC
>>>     boost with Intel Turbo.
>>>
>>>
>>>
>>>     Thank you for pointing out, we will use the terminology `
>>>     RTE_GET_LCORE_TURBO`.
>>>
>>>
>>
>> That still doesn't clarify it for me. If you start mixing in power management related functions in with topology ones things will turn into a real headache.
> Can you please tell me what is not clarified. DPDK lcores as of today has no notion of Cache, Numa, Power, Turbo or any DPDK supported features.
> The initial API introduced were to expose lcore sharing the same Last Level Cache. Based on interaction with Anatoly, extending this to support multiple features turned out to be possibility.
> Hence, we said we can share v2 for RFC based on this idea.
> 
> But if the claim is not to put TURBO I am also ok for this. Let only keep cache and NUMA-IO domain.
> 
>> What does boost or turbo correspond to? Is it for cores that have the feature enabled - whether or not it's currently in use - or is it for finding cores that are
>> currently boosted?  Do we need additions for cores that are boosted by 100Mhz vs say 300Mhz. What about cores that are in lower frequencies for
>> power-saving. Do we add macros for finding those?
> Why are we talking about feq-up and freq-down? This was not even discussed in this RFC patch at all.
> 
>>>
>>>     What's also
>>>
>>>     > missing is a define for getting actual NUMA siblings i.e. those
>>>     sharing common memory but not an L3 or anything else.
>>>
>>>     This can be extended into `rte_get_next_lcore_extnd` with flag `
>>>     RTE_GET_LCORE_NUMA`. This will allow to grab all lcores under the same
>>>     sub-memory NUMA as shared by LCORE.
>>>
>>>     If SMT sibling is enabled and DPDK Lcore mask covers the sibling
>>>     threads, then ` RTE_GET_LCORE_NUMA` get all lcore and sibling threads
>>>     under same memory NUMA of lcore shared.
>>>
>>>
>>
>> Yes. That can work. But it means we are basing the implementation on a fixed idea of what topologies there are or can exist.
>> My suggestion below is just to ignore the whole idea of L1 vs L2 vs NUMA - just give the app a way to find it's nearest nodes.
> Bruce, for different vendor SoC, the implementation of architecture is different. Let me share what I know
> 1. using L1, we can fetch SMT threads
> 2. using L2 we can get certain SoC on Arm, Intel and power PC which is like efficient cores
> 3. using L3 we can get certain SoC like AMD, AF64x and others which follow chiplet or tile split L3 domain.
> 
>>
>> After all, the app doesn't want to know the topology just for the sake of knowing it - it wants it to ensure best placement of work on cores! To that end, it just needs to know what cores are near to each other and what are far away.
> Exactly, that is why we want to minimize new libraries and limit to format of existing API `rte_get_next_lcore`. The end user need to deploy another library or external library then map to DPDK lcore mapping to identify what is where.
> So as end user I prefer simple API which get my work done.
> 
>>
>>>
>>>     >
>>>
>>>     > My suggestion would be to have the function take just an integer-type
>>>     e.g.
>>>
>>>     > uint16_t parameter which defines the memory/cache hierarchy level to
>>>     use, 0
>>>
>>>     > being lowest, 1 next, and so on. Different systems may have different
>>>     numbers
>>>
>>>     > of cache levels so lets just make it a zero-based index of levels,
>>>     rather than
>>>
>>>     > giving explicit defines (except for memory which should probably
>>>     always be
>>>
>>>     > last). The zero-level will be for "closest neighbour"
>>>
>>>     Good idea, we did prototype this internally. But issue it will keep on
>>>     adding the number of API into lcore library.
>>>
>>>     To keep the API count less, we are using lcore id as hint to sub-NUMA.
>>>
>>
>> I'm unclear about this keeping the API count down - you are proposing a lot of APIs and macros up above.
> No, I am not. I have shared based on the last discussion with Anatoly we will end up with 2 API in lcore only. Explained in the above response
> 
>> My suggestion is basically to add two APIs and no macros: one API to get the max number of topology-nearness levels, and a
>> second API to get the next sibling a given nearness level from
>> 0(nearest)..N(furthest). If we want, we can also add a FOREACH macro too.
>>
>> Overall, though, as I say above, let's focus on the problem the app actually
>> wants these APIs for, not how we think we should solve it. Apps don't want to
>> know the topology for knowledge sake, they want to use that knowledge to
>> improve performance by pinning tasks to cores. What is the minimum that we
>> need to provide to enable the app to do that? For example, if there are no
>> lcores that share an L1, then from an app topology viewpoint that L1 level may
>> as well not exist, because it provides us no details on how to place our work.
> I have shared above why we need vendor agnostic L1, L2, L3 and sub-NUMA-IO.
> 
> Snipped

Just to add my 2c here, since my name is being thrown around a lot in 
this discussion :)

I tend to agree with Bruce here in the sense that if we want this API be 
used to group cores together, then ideally we shouldn't really 
explicitly call out the principle by which we group them unless we have 
to. My main contention with the initial RFC *was* the fact that it was 
tied to specific HW arch stuff in the API.

Vipin has suggested using a "flags" value to discriminate between 
L1/L2/L3/NUMA/whatever ways of grouping cores, and I agree that it's 
better than what was initially proposed (at least from my vantage 
point), but what's even better is not to have any flags at all! As in, I 
think the thing we're presumably trying to achieve here just as well 
could be achieved simply by returning a number of "levels" we have in 
our hierarchy, and user then being able to iterate over nearest 
neighbours sitting on that "level" without explicitly specifying what 
that level is.

So, for some systems level 0 would be SMT, for others - L3, for some - 
NUMA, for yet others - efficiency/performance cores, etc. Bruce's 
suggestion is that we don't explicitly call out the thing we use to 
group the cores by, and instead rely on EAL to parse that information 
out for us into a set of "levels". I would agree that for anything more 
complicated an external library would be the way to go, because, well, 
we're DPDK, not Linux kernel.

But, just to be clear, this is not mutually exclusive with some kind of 
topology-style API. If we do go down that route, then the point about 
"attaching to specific architectural features" becomes moot, as by 
necessity any DOM-style API would have to represent topology in some 
way, which then gets used by DPDK.

The main question here (and other people have rightly asked this 
question) would be, do we want a topology API, or do we want an API to 
assist with scheduling. My impression so far has been that Vipin is 
looking for the latter rather than the former, as no topology-related 
use cases were mentioned in the discussion except as a proxy for scheduling.
  
Stephen Hemminger Oct. 7, 2024, 9:28 p.m. UTC | #42
On Tue, 27 Aug 2024 20:40:12 +0530
Vipin Varghese <vipin.varghese@amd.com> wrote:

> As core density continues to increase, chiplet-based
> core packing has become a key trend. In AMD SoC EPYC
> architectures, core complexes within the same chiplet
> share a Last-Level Cache (LLC). By packing logical cores
> within the same LLC, we can enhance pipeline processing
> stages due to reduced latency and improved data locality.
> 
> To leverage these benefits, DPDK libraries and examples
> can utilize localized lcores. This approach ensures more
> consistent latencies by minimizing the dispersion of lcores
> across different chiplet complexes and enhances packet
> processing by ensuring that data for subsequent pipeline
> stages is likely to reside within the LLC.
> 
> < Function: Purpose >
> ---------------------
>  - rte_get_llc_first_lcores: Retrieves all the first lcores in the shared LLC.
>  - rte_get_llc_lcore: Retrieves all lcores that share the LLC.
>  - rte_get_llc_n_lcore: Retrieves the first n or skips the first n lcores in the shared LLC.
> 
> < MACRO: Purpose >
> ------------------
> RTE_LCORE_FOREACH_LLC_FIRST: iterates through all first lcore from each LLC.
> RTE_LCORE_FOREACH_LLC_FIRST_WORKER: iterates through all first worker lcore from each LLC.
> RTE_LCORE_FOREACH_LLC_WORKER: iterates lcores from LLC based on hint (lcore id).
> RTE_LCORE_FOREACH_LLC_SKIP_FIRST_WORKER: iterates lcores from LLC while skipping first worker.
> RTE_LCORE_FOREACH_LLC_FIRST_N_WORKER: iterates through `n` lcores from each LLC.
> RTE_LCORE_FOREACH_LLC_SKIP_N_WORKER: skip first `n` lcores, then iterates through reaming lcores in each LLC.
> 
> Vipin Varghese (2):
>   eal: add llc aware functions
>   eal/lcore: add llc aware for each macro
> 
>  lib/eal/common/eal_common_lcore.c | 279 ++++++++++++++++++++++++++++--
>  lib/eal/include/rte_lcore.h       |  89 ++++++++++

When are you going to send a new version?

Need:
  - new functions need to be marked experimental
  - has to build cleanly on all platforms.
  - need functional tests
  - address all the other review comments
  
Vipin Varghese Oct. 21, 2024, 8:17 a.m. UTC | #43
[AMD Official Use Only - AMD Internal Distribution Only]

>
> When are you going to send a new version?
We had been testing this on various Intel and AMD platforms. We have completed testing over sub-NUMA domains on both SoC.
We will be sharing the new patch (rfc-v2) before 22 Oct 2024.

>
> Need:
>   - new functions need to be marked experimental
>   - has to build cleanly on all platforms.
>   - need functional tests
>   - address all the other review comments