Message ID | 20240827151014.201-1-vipin.varghese@amd.com (mailing list archive) |
---|---|
Headers |
Return-Path: <dev-bounces@dpdk.org> X-Original-To: patchwork@inbox.dpdk.org Delivered-To: patchwork@inbox.dpdk.org Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 3B01145879; Tue, 27 Aug 2024 17:10:41 +0200 (CEST) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id C1A4C40E1B; Tue, 27 Aug 2024 17:10:40 +0200 (CEST) Received: from NAM12-DM6-obe.outbound.protection.outlook.com (mail-dm6nam12on2050.outbound.protection.outlook.com [40.107.243.50]) by mails.dpdk.org (Postfix) with ESMTP id 1AAC940E19 for <dev@dpdk.org>; Tue, 27 Aug 2024 17:10:39 +0200 (CEST) ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=n5SWNDmgNtxx1m7ZjGXbjIkuNz6YEa7saEUv/598lHlyq7untK35NkFTpaHGUPEDNhVT0z3F8opw+UrqmESvWBkpf4oa+9KGKBqLRZpnnn0s2SrOKm2l6/EFY/fhgd9/RxyyOhDYVABhturtTQzgVV6+n2WI1zYe8z/vz5jZE0r440ebemIvL9U6yHX024nII/5CyQRM88kz1WT3D1P024M0jbmoWJ6dI9E2gGcQfzo7IxZjghbI3vRcHUhcDR+VqvUsv3JyAL0doATnfONhcdUlV7ZZ3c50kuglk6LmINQYUOSuYaU8FfGb+LemSYuG/+skaXuMCAT++gJhXrIz2w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=zQ0ATG6bOqeVEyN0KtbJvfyiEK+vyHBBYP1SQcCIcIE=; b=frrEroRTmhvrguhXsKUxYZvfaRjGxixFKSDgwIBSQdZU44zkTKxe0aQDNdZpoxzF6zXd3CJahGQmBH2v7ahKx1+m2e722WKlimEx8Avj2QjyWQqIcHnkpmL7wPSb2wrSA1YO1eFb57rMLrk11oF/26R+NoOmmsHFXLCNqZTjws6Wg8jvgtyEeVkCYuQPrW92LpP1dPGF79dA4GXnFIe96A2sm1jRlZ/dOJldxcHV4dTa6RazHEnualH2y84ozQH1X/TKcR8rqNcen2t7HKoLfeI1SwJj8wsLmsgdUXf6edNUJsNJOVtofeGMbLuQJIAhmAkeRazu/W+lJ4PNvFcbDg== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 165.204.84.17) smtp.rcpttodomain=dpdk.org smtp.mailfrom=amd.com; dmarc=pass (p=quarantine sp=quarantine pct=100) action=none header.from=amd.com; dkim=none (message not signed); arc=none (0) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amd.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=zQ0ATG6bOqeVEyN0KtbJvfyiEK+vyHBBYP1SQcCIcIE=; b=g0cnd/2Kb0KKrRwV/fkM5tGZ5ZJoU8tIA5I8F8+GUZBtH3RStDIqvCMhjM7Q4/O6LKt6+kEwgzEr2OTIHG7Tt9rg97W/FpNoAuarLhBpEGC6+Qx4FHfxW/K0jVz7Yy04P+TSNM1yuj9oRb+Sw+ombIN6bWDhNraOf+R+lUdfd+k= Received: from BN8PR15CA0037.namprd15.prod.outlook.com (2603:10b6:408:80::14) by SJ0PR12MB7007.namprd12.prod.outlook.com (2603:10b6:a03:486::8) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.7897.25; Tue, 27 Aug 2024 15:10:32 +0000 Received: from BN1PEPF00006000.namprd05.prod.outlook.com (2603:10b6:408:80:cafe::82) by BN8PR15CA0037.outlook.office365.com (2603:10b6:408:80::14) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.7897.24 via Frontend Transport; Tue, 27 Aug 2024 15:10:32 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 165.204.84.17) smtp.mailfrom=amd.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=amd.com; Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=SATLEXMB04.amd.com; pr=C Received: from SATLEXMB04.amd.com (165.204.84.17) by BN1PEPF00006000.mail.protection.outlook.com (10.167.243.232) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.20.7918.13 via Frontend Transport; Tue, 27 Aug 2024 15:10:32 +0000 Received: from BLR-5CG134626B.amd.com (10.180.168.240) by SATLEXMB04.amd.com (10.181.40.145) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.39; Tue, 27 Aug 2024 10:10:30 -0500 From: Vipin Varghese <vipin.varghese@amd.com> To: <ferruh.yigit@amd.com>, <dev@dpdk.org> Subject: [RFC 0/2] introduce LLC aware functions Date: Tue, 27 Aug 2024 20:40:12 +0530 Message-ID: <20240827151014.201-1-vipin.varghese@amd.com> X-Mailer: git-send-email 2.41.0.windows.3 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain X-Originating-IP: [10.180.168.240] X-ClientProxiedBy: SATLEXMB03.amd.com (10.181.40.144) To SATLEXMB04.amd.com (10.181.40.145) X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: BN1PEPF00006000:EE_|SJ0PR12MB7007:EE_ X-MS-Office365-Filtering-Correlation-Id: 87d9556b-36c6-4d78-f1d3-08dcc6aa652b X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0; ARA:13230040|1800799024|376014|36860700013|82310400026; X-Microsoft-Antispam-Message-Info: bY8YTTR8ecdHDxswJ47CkqW2lGWTeS1SfFiZ2C05HHZ+46H81684dkARln+smb66uJRJpATiaRigeb+MaUu/yIUe8NCXp2iRORCRUIRZP1I6E1335/aK1pLt47zhvX8ulENaVIPFNn15KeXBqYUqrWcm7e7aUWKEjyMJvC+zChPx06a63s6rFto3m278n+tkV3mlmgLQD+KAWChHs0W6kz3e6T6bVfluMfn8o5nN5p9ynRVUHImfXtzQvQXnGfXbce8AWCaA97aCXgZ1EhIcjmyKdGwJcCrbLlgVVo1uceJhsxDvOvqJ+zvNVYZTtUp6p0haCnDEKBrtRiK2TdUAtWLkJ05OrDZz0Z6zdUqg4zGwgMy4EKpK0r/evZ1XaaXcmdXCUm6LIjCuBal92hth0YvkE/J1LdJJFxZRbwhQVoo6abB510EkFAVpXCarFjKBYNZ37KYN/tnavkVFI8wqHvHBoqxU8vKTss8kHFkGtAEn3sYt6xG6mcUd/5YoYRGDHRhG/zo6Q+MCqW2ymHSclSF5OPfwhB6uIdn6zA+/27M7bl5BSjJCd52XUD5O/vSjKtQR9UQkhk59wmruWq7nUJql0hQZl2I6RMlfufjjBN0aoayRYcSX2m7cAHbha0t0R8k1BEji6swgED3mWU2PSfdDdsMJS1GZCMjuwuIUHI9ODFB9L2TukyYVUo64eCYVjA8vGNY63GaOPP7OQtVrJam6SHPFhZtCITLXKGRZ2HmWCKHSXZ/aZr9chA0GKDMFcR90QmhWJYNgJZaAMsleaYHK2E0JY+MnZYvwhdSMZ5K5Mhhfk+VwocsRIh+akLJoWUEGpYnoNEe2OdsTyRN/X3zntd+l25BV17E/tIk3igC/YkpsxzZIzR7pHVfkneKKd6Ty/6JkujAJGm+4ElnPrFPIZt9lfjjQd883pL91N7PwujMon+kAURz6nZhfrffN2pJbBPP+BUDsVf2zo+Ssbwelk1C8Kcvwbq7eQ7YbZg4jSfhYSo/4AyUgJvncSWHNdZHP68HVntK/9MMV+qSG7Z4CbRdL8vbPwZlV56ULMHuGkj4WQUEz8XrZArHuYTT1FRRvnpE9O5SSLw9j6d+m4HkulwFsJFHp+rA5IfZzCho+ytGGZRH3f5HYVKwsQ3DOJhJtWk10LS9x29xg0ejkMOF3EL/iH8vLpbyiHcA2fxOhpSoXAemIGQX2B/6z7wt0cORo8qrFVn/l/lMRhOKuKuCifRyk6WNE3MweEsqk8caU3v0fEbXylPsVZh4tcnKpqfkH1e81TRv8vzgThQfvzfZbqy5VymdjDl4t1nEdJgla95P8qg9aFYjHa1AUXx8JFvh7I5Gn3JkZ1I8Tl6pQ+2/S0W7i4D/ZAAzRviMCYfbGVdTk24RL21ikXqXywQcR2y000tOxiR+51mIeSeiy7hZySZqsrEMHtX7x7xm5nkH4TQIq13RPRN062DcTx6Id X-Forefront-Antispam-Report: CIP:165.204.84.17; CTRY:US; LANG:en; SCL:1; SRV:; IPV:CAL; SFV:NSPM; H:SATLEXMB04.amd.com; PTR:InfoDomainNonexistent; CAT:NONE; SFS:(13230040)(1800799024)(376014)(36860700013)(82310400026); DIR:OUT; SFP:1101; X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 27 Aug 2024 15:10:32.2324 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 87d9556b-36c6-4d78-f1d3-08dcc6aa652b X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=3dd8961f-e488-4e60-8e11-a82d994e183d; Ip=[165.204.84.17]; Helo=[SATLEXMB04.amd.com] X-MS-Exchange-CrossTenant-AuthSource: BN1PEPF00006000.namprd05.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: SJ0PR12MB7007 X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions <dev.dpdk.org> List-Unsubscribe: <https://mails.dpdk.org/options/dev>, <mailto:dev-request@dpdk.org?subject=unsubscribe> List-Archive: <http://mails.dpdk.org/archives/dev/> List-Post: <mailto:dev@dpdk.org> List-Help: <mailto:dev-request@dpdk.org?subject=help> List-Subscribe: <https://mails.dpdk.org/listinfo/dev>, <mailto:dev-request@dpdk.org?subject=subscribe> Errors-To: dev-bounces@dpdk.org |
Series |
introduce LLC aware functions
|
|
Message
Vipin Varghese
Aug. 27, 2024, 3:10 p.m. UTC
As core density continues to increase, chiplet-based core packing has become a key trend. In AMD SoC EPYC architectures, core complexes within the same chiplet share a Last-Level Cache (LLC). By packing logical cores within the same LLC, we can enhance pipeline processing stages due to reduced latency and improved data locality. To leverage these benefits, DPDK libraries and examples can utilize localized lcores. This approach ensures more consistent latencies by minimizing the dispersion of lcores across different chiplet complexes and enhances packet processing by ensuring that data for subsequent pipeline stages is likely to reside within the LLC. < Function: Purpose > --------------------- - rte_get_llc_first_lcores: Retrieves all the first lcores in the shared LLC. - rte_get_llc_lcore: Retrieves all lcores that share the LLC. - rte_get_llc_n_lcore: Retrieves the first n or skips the first n lcores in the shared LLC. < MACRO: Purpose > ------------------ RTE_LCORE_FOREACH_LLC_FIRST: iterates through all first lcore from each LLC. RTE_LCORE_FOREACH_LLC_FIRST_WORKER: iterates through all first worker lcore from each LLC. RTE_LCORE_FOREACH_LLC_WORKER: iterates lcores from LLC based on hint (lcore id). RTE_LCORE_FOREACH_LLC_SKIP_FIRST_WORKER: iterates lcores from LLC while skipping first worker. RTE_LCORE_FOREACH_LLC_FIRST_N_WORKER: iterates through `n` lcores from each LLC. RTE_LCORE_FOREACH_LLC_SKIP_N_WORKER: skip first `n` lcores, then iterates through reaming lcores in each LLC. Vipin Varghese (2): eal: add llc aware functions eal/lcore: add llc aware for each macro lib/eal/common/eal_common_lcore.c | 279 ++++++++++++++++++++++++++++-- lib/eal/include/rte_lcore.h | 89 ++++++++++ 2 files changed, 356 insertions(+), 12 deletions(-)
Comments
On 2024-08-27 17:10, Vipin Varghese wrote: > As core density continues to increase, chiplet-based > core packing has become a key trend. In AMD SoC EPYC > architectures, core complexes within the same chiplet > share a Last-Level Cache (LLC). By packing logical cores > within the same LLC, we can enhance pipeline processing > stages due to reduced latency and improved data locality. > > To leverage these benefits, DPDK libraries and examples > can utilize localized lcores. This approach ensures more > consistent latencies by minimizing the dispersion of lcores > across different chiplet complexes and enhances packet > processing by ensuring that data for subsequent pipeline > stages is likely to reside within the LLC. > We shouldn't have a separate CPU/cache hierarchy API instead? Could potentially be built on the 'hwloc' library. I much agree cache/core topology may be of interest of the application (or a work scheduler, like a DPDK event device), but it's not limited to LLC. It may well be worthwhile to care about which cores shares L2 cache, for example. Not sure the RTE_LCORE_FOREACH_* approach scales. > < Function: Purpose > > --------------------- > - rte_get_llc_first_lcores: Retrieves all the first lcores in the shared LLC. > - rte_get_llc_lcore: Retrieves all lcores that share the LLC. > - rte_get_llc_n_lcore: Retrieves the first n or skips the first n lcores in the shared LLC. > > < MACRO: Purpose > > ------------------ > RTE_LCORE_FOREACH_LLC_FIRST: iterates through all first lcore from each LLC. > RTE_LCORE_FOREACH_LLC_FIRST_WORKER: iterates through all first worker lcore from each LLC. > RTE_LCORE_FOREACH_LLC_WORKER: iterates lcores from LLC based on hint (lcore id). > RTE_LCORE_FOREACH_LLC_SKIP_FIRST_WORKER: iterates lcores from LLC while skipping first worker. > RTE_LCORE_FOREACH_LLC_FIRST_N_WORKER: iterates through `n` lcores from each LLC. > RTE_LCORE_FOREACH_LLC_SKIP_N_WORKER: skip first `n` lcores, then iterates through reaming lcores in each LLC. > > Vipin Varghese (2): > eal: add llc aware functions > eal/lcore: add llc aware for each macro > > lib/eal/common/eal_common_lcore.c | 279 ++++++++++++++++++++++++++++-- > lib/eal/include/rte_lcore.h | 89 ++++++++++ > 2 files changed, 356 insertions(+), 12 deletions(-) >
On 8/27/2024 5:10 PM, Vipin Varghese wrote: > As core density continues to increase, chiplet-based > core packing has become a key trend. In AMD SoC EPYC > architectures, core complexes within the same chiplet > share a Last-Level Cache (LLC). By packing logical cores > within the same LLC, we can enhance pipeline processing > stages due to reduced latency and improved data locality. > > To leverage these benefits, DPDK libraries and examples > can utilize localized lcores. This approach ensures more > consistent latencies by minimizing the dispersion of lcores > across different chiplet complexes and enhances packet > processing by ensuring that data for subsequent pipeline > stages is likely to reside within the LLC. > > < Function: Purpose > > --------------------- > - rte_get_llc_first_lcores: Retrieves all the first lcores in the shared LLC. > - rte_get_llc_lcore: Retrieves all lcores that share the LLC. > - rte_get_llc_n_lcore: Retrieves the first n or skips the first n lcores in the shared LLC. > > < MACRO: Purpose > > ------------------ > RTE_LCORE_FOREACH_LLC_FIRST: iterates through all first lcore from each LLC. > RTE_LCORE_FOREACH_LLC_FIRST_WORKER: iterates through all first worker lcore from each LLC. > RTE_LCORE_FOREACH_LLC_WORKER: iterates lcores from LLC based on hint (lcore id). > RTE_LCORE_FOREACH_LLC_SKIP_FIRST_WORKER: iterates lcores from LLC while skipping first worker. > RTE_LCORE_FOREACH_LLC_FIRST_N_WORKER: iterates through `n` lcores from each LLC. > RTE_LCORE_FOREACH_LLC_SKIP_N_WORKER: skip first `n` lcores, then iterates through reaming lcores in each LLC. > Hi Vipin, I recently looked into how Intel's Sub-NUMA Clustering would work within DPDK, and found that I actually didn't have to do anything, because the SNC "clusters" present themselves as NUMA nodes, which DPDK already supports natively. Does AMD's implementation of chiplets not report themselves as separate NUMA nodes? Because if it does, I don't really think any changes are required because NUMA nodes would give you the same thing, would it not?
<snipped> Thank you Mattias for the comments and question, please let me try to explain the same below > We shouldn't have a separate CPU/cache hierarchy API instead? Based on the intention to bring in CPU lcores which share same L3 (for better cache hits and less noisy neighbor) current API focuses on using Last Level Cache. But if the suggestion is `there are SoC where L2 cache are also shared, and the new API should be provisioned`, I am also comfortable with the thought. > > Could potentially be built on the 'hwloc' library. There are 3 reason on AMD SoC we did not explore this path, reasons are 1. depending n hwloc version and kernel version certain SoC hierarchies are not available 2. CPU NUMA and IO (memory & PCIe) NUMA are independent on AMD Epyc Soc. 3. adds the extra dependency layer of library layer to be made available to work. hence we have tried to use Linux Documented generic layer of `sysfs CPU cache`. I will try to explore more on hwloc and check if other libraries within DPDK leverages the same. > > I much agree cache/core topology may be of interest of the application > (or a work scheduler, like a DPDK event device), but it's not limited to > LLC. It may well be worthwhile to care about which cores shares L2 > cache, for example. Not sure the RTE_LCORE_FOREACH_* approach scales. yes, totally understand as some SoC, multiple lcores shares same L2 cache. Can we rework the API to be rte_get_cache_<function> where user argument is desired lcore index. 1. index-1: SMT threads 2. index-2: threads sharing same L2 cache 3. index-3: threads sharing same L3 cache 4. index-MAX: identify the threads sharing last level cache. > >> < Function: Purpose > >> --------------------- >> - rte_get_llc_first_lcores: Retrieves all the first lcores in the >> shared LLC. >> - rte_get_llc_lcore: Retrieves all lcores that share the LLC. >> - rte_get_llc_n_lcore: Retrieves the first n or skips the first n >> lcores in the shared LLC. >> >> < MACRO: Purpose > >> ------------------ >> RTE_LCORE_FOREACH_LLC_FIRST: iterates through all first lcore from >> each LLC. >> RTE_LCORE_FOREACH_LLC_FIRST_WORKER: iterates through all first worker >> lcore from each LLC. >> RTE_LCORE_FOREACH_LLC_WORKER: iterates lcores from LLC based on hint >> (lcore id). >> RTE_LCORE_FOREACH_LLC_SKIP_FIRST_WORKER: iterates lcores from LLC >> while skipping first worker. >> RTE_LCORE_FOREACH_LLC_FIRST_N_WORKER: iterates through `n` lcores >> from each LLC. >> RTE_LCORE_FOREACH_LLC_SKIP_N_WORKER: skip first `n` lcores, then >> iterates through reaming lcores in each LLC. >> While the MACRO are simple wrapper invoking appropriate API. can this be worked out in this fashion? <snipped>
<Snipped> Thank you Antaloy for the response. Let me try to share my understanding. > I recently looked into how Intel's Sub-NUMA Clustering would work within > DPDK, and found that I actually didn't have to do anything, because the > SNC "clusters" present themselves as NUMA nodes, which DPDK already > supports natively. yes, this is correct. In Intel Xeon Platinum BIOS one can enable `Cluster per NUMA` as `1,2 or4`. This divides the tiles into Sub-Numa parition, each having separate lcores,memory controllers, PCIe and accelerator. > > Does AMD's implementation of chiplets not report themselves as separate > NUMA nodes? In AMD EPYC Soc, this is different. There are 2 BIOS settings, namely 1. NPS: `Numa Per Socket` which allows the IO tile (memory, PCIe and Accelerator) to be partitioned as Numa 0, 1, 2 or 4. 2. L3 as NUMA: `L3 cache of CPU tiles as individual NUMA`. This allows all CPU tiles to be independent NUMA cores. The above settings are possible because CPU is independent from IO tile. Thus allowing 4 combinations be available for use. These are covered in the tuning gudie for the SoC in 12. How to get best performance on AMD platform — Data Plane Development Kit 24.07.0 documentation (dpdk.org) <https://doc.dpdk.org/guides/linux_gsg/amd_platform.html>. > Because if it does, I don't really think any changes are > required because NUMA nodes would give you the same thing, would it not? I have a different opinion to this outlook. An end user can 1. Identify the lcores and it's NUMA user `usertools/cpu-layout.py` 2. But it is core mask in eal arguments which makes the threads available to be used in a process. 3. there are no API which distinguish L3 numa domain. Function `rte_socket_id <https://doc.dpdk.org/api/rte__lcore_8h.html#a7c8da4664df26a64cf05dc508a4f26df>` for CPU tiles like AMD SoC will return physical socket. Example: In AMD EPYC Genoa, there are total of 13 tiles. 12 CPU tiles and 1 IO tile. Setting 1. NPS to 4 will divide the memory, PCIe and accelerator into 4 domain. While the all CPU will appear as single NUMA but each 12 tile having independent L3 caches. 2. Setting `L3 as NUMA` allows each tile to appear as separate L3 clusters. Hence, adding an API which allows to select available lcores based on Split L3 is essential irrespective of the BIOS setting. > > -- > Thanks, > Anatoly >
On 9/2/2024 3:08 AM, Varghese, Vipin wrote: > <Snipped> > > Thank you Antaloy for the response. Let me try to share my understanding. > >> I recently looked into how Intel's Sub-NUMA Clustering would work within >> DPDK, and found that I actually didn't have to do anything, because the >> SNC "clusters" present themselves as NUMA nodes, which DPDK already >> supports natively. > > yes, this is correct. In Intel Xeon Platinum BIOS one can enable > `Cluster per NUMA` as `1,2 or4`. > > This divides the tiles into Sub-Numa parition, each having separate > lcores,memory controllers, PCIe > > and accelerator. > >> >> Does AMD's implementation of chiplets not report themselves as separate >> NUMA nodes? > > In AMD EPYC Soc, this is different. There are 2 BIOS settings, namely > > 1. NPS: `Numa Per Socket` which allows the IO tile (memory, PCIe and > Accelerator) to be partitioned as Numa 0, 1, 2 or 4. > > 2. L3 as NUMA: `L3 cache of CPU tiles as individual NUMA`. This allows > all CPU tiles to be independent NUMA cores. > > > The above settings are possible because CPU is independent from IO tile. > Thus allowing 4 combinations be available for use. Sure, but presumably if the user wants to distinguish this, they have to configure their system appropriately. If user wants to take advantage of L3 as NUMA (which is what your patch proposes), then they can enable the BIOS knob and get that functionality for free. DPDK already supports this. > > These are covered in the tuning gudie for the SoC in 12. How to get best > performance on AMD platform — Data Plane Development Kit 24.07.0 > documentation (dpdk.org) > <https://doc.dpdk.org/guides/linux_gsg/amd_platform.html>. > > >> Because if it does, I don't really think any changes are >> required because NUMA nodes would give you the same thing, would it not? > > I have a different opinion to this outlook. An end user can > > 1. Identify the lcores and it's NUMA user `usertools/cpu-layout.py` I recently submitted an enhacement for CPU layout script to print out NUMA separately from physical socket [1]. [1] https://patches.dpdk.org/project/dpdk/patch/40cf4ee32f15952457ac5526cfce64728bd13d32.1724323106.git.anatoly.burakov@intel.com/ I believe when "L3 as NUMA" is enabled in BIOS, the script will display both physical package ID as well as NUMA nodes reported by the system, which will be different from physical package ID, and which will display information you were looking for. > > 2. But it is core mask in eal arguments which makes the threads > available to be used in a process. See above: if the OS already reports NUMA information, this is not a problem to be solved, CPU layout script can give this information to the user. > > 3. there are no API which distinguish L3 numa domain. Function > `rte_socket_id > <https://doc.dpdk.org/api/rte__lcore_8h.html#a7c8da4664df26a64cf05dc508a4f26df>` for CPU tiles like AMD SoC will return physical socket. Sure, but I would think the answer to that would be to introduce an API to distinguish between NUMA (socket ID in DPDK parlance) and package (physical socket ID in the "traditional NUMA" sense). Once we can distinguish between those, DPDK can just rely on NUMA information provided by the OS, while still being capable of identifying physical sockets if the user so desires. I am actually going to introduce API to get *physical socket* (as opposed to NUMA node) in the next few days. > > > Example: In AMD EPYC Genoa, there are total of 13 tiles. 12 CPU tiles > and 1 IO tile. Setting > > 1. NPS to 4 will divide the memory, PCIe and accelerator into 4 domain. > While the all CPU will appear as single NUMA but each 12 tile having > independent L3 caches. > > 2. Setting `L3 as NUMA` allows each tile to appear as separate L3 clusters. > > > Hence, adding an API which allows to select available lcores based on > Split L3 is essential irrespective of the BIOS setting. > I think the crucial issue here is the "irrespective of BIOS setting" bit. If EAL is getting into the game of figuring out exact intricacies of physical layout of the system, then there's a lot more work to be done as there are lots of different topologies, as other people have already commented, and such an API needs *a lot* of thought put into it. If, on the other hand, we leave this issue to the kernel, and only gather NUMA information provided by the kernel, then nothing has to be done - DPDK already supports all of this natively, provided the user has configured the system correctly. Moreover, arguably DPDK already works that way: technically you can get physical socket information even absent of NUMA support in BIOS, but DPDK does not do that. Instead, if OS reports NUMA node as 0, that's what we're going with (even if we could detect multiple sockets from sysfs), and IMO it should stay that way unless there is a strong argument otherwise. We force the user to configure their system correctly as it is, and I see no reason to second-guess user's BIOS configuration otherwise.
<snipped> >> >>> I recently looked into how Intel's Sub-NUMA Clustering would work >>> within >>> DPDK, and found that I actually didn't have to do anything, because the >>> SNC "clusters" present themselves as NUMA nodes, which DPDK already >>> supports natively. >> >> yes, this is correct. In Intel Xeon Platinum BIOS one can enable >> `Cluster per NUMA` as `1,2 or4`. >> >> This divides the tiles into Sub-Numa parition, each having separate >> lcores,memory controllers, PCIe >> >> and accelerator. >> >>> >>> Does AMD's implementation of chiplets not report themselves as separate >>> NUMA nodes? >> >> In AMD EPYC Soc, this is different. There are 2 BIOS settings, namely >> >> 1. NPS: `Numa Per Socket` which allows the IO tile (memory, PCIe and >> Accelerator) to be partitioned as Numa 0, 1, 2 or 4. >> >> 2. L3 as NUMA: `L3 cache of CPU tiles as individual NUMA`. This allows >> all CPU tiles to be independent NUMA cores. >> >> >> The above settings are possible because CPU is independent from IO tile. >> Thus allowing 4 combinations be available for use. > > Sure, but presumably if the user wants to distinguish this, they have to > configure their system appropriately. If user wants to take advantage of > L3 as NUMA (which is what your patch proposes), then they can enable the > BIOS knob and get that functionality for free. DPDK already supports > this. > The intend of the RFC is to introduce the ability to select lcore within the same L3 cache whether the BIOS is set or unset for `L3 as NUMA`. This is also achieved and tested on platforms which advertises via sysfs by OS kernel. Thus eliminating the dependency on hwloc and libuma which can be different versions in different distros. >> >> These are covered in the tuning gudie for the SoC in 12. How to get best >> performance on AMD platform — Data Plane Development Kit 24.07.0 >> documentation (dpdk.org) >> <https://doc.dpdk.org/guides/linux_gsg/amd_platform.html>. >> >> >>> Because if it does, I don't really think any changes are >>> required because NUMA nodes would give you the same thing, would it >>> not? >> >> I have a different opinion to this outlook. An end user can >> >> 1. Identify the lcores and it's NUMA user `usertools/cpu-layout.py` > > I recently submitted an enhacement for CPU layout script to print out > NUMA separately from physical socket [1]. > > [1] > https://patches.dpdk.org/project/dpdk/patch/40cf4ee32f15952457ac5526cfce64728bd13d32.1724323106.git.anatoly.burakov@intel.com/ > > > I believe when "L3 as NUMA" is enabled in BIOS, the script will display > both physical package ID as well as NUMA nodes reported by the system, > which will be different from physical package ID, and which will display > information you were looking for. As AMD we had submitted earlier work on the same via usertools: enhance logic to display NUMA - Patchwork (dpdk.org) <https://patchwork.dpdk.org/project/dpdk/patch/20220326073207.489694-1-vipin.varghese@amd.com/>. this clearly were distinguishing NUMA and Physical socket. > >> >> 2. But it is core mask in eal arguments which makes the threads >> available to be used in a process. > > See above: if the OS already reports NUMA information, this is not a > problem to be solved, CPU layout script can give this information to the > user. Agreed, but as pointed out in case of Intel Xeon Platinum SPR, the tile consists of cpu, memory, pcie and accelerator. hence setting the BIOS option `Cluster per NUMA` the OS kernel & libnuma display appropriate Domain with memory, pcie and cpu. In case of AMD SoC, libnuma for CPU is different from memory NUMA per socket. > >> >> 3. there are no API which distinguish L3 numa domain. Function >> `rte_socket_id >> <https://doc.dpdk.org/api/rte__lcore_8h.html#a7c8da4664df26a64cf05dc508a4f26df>` >> for CPU tiles like AMD SoC will return physical socket. > > Sure, but I would think the answer to that would be to introduce an API > to distinguish between NUMA (socket ID in DPDK parlance) and package > (physical socket ID in the "traditional NUMA" sense). Once we can > distinguish between those, DPDK can just rely on NUMA information > provided by the OS, while still being capable of identifying physical > sockets if the user so desires. Agreed, +1 for the idea for physcial socket and changes in library to exploit the same. > > I am actually going to introduce API to get *physical socket* (as > opposed to NUMA node) in the next few days. > But how does it solve the end customer issues 1. if there are multiple NIC or Accelerator on multiple socket, but IO tile is partitioned to Sub Domain. 2. If RTE_FLOW steering is applied on NIC which needs to processed under same L3 - reduces noisy neighbor and better cache hits 3, for PKT-distribute library which needs to run within same worker lcore set as RX-Distributor-TX. Current RFC suggested addresses the above, by helping the end users to identify the lcores withing same L3 domain under a NUMA|Physical socket irresepctive of BIOS setting. >> >> >> Example: In AMD EPYC Genoa, there are total of 13 tiles. 12 CPU tiles >> and 1 IO tile. Setting >> >> 1. NPS to 4 will divide the memory, PCIe and accelerator into 4 domain. >> While the all CPU will appear as single NUMA but each 12 tile having >> independent L3 caches. >> >> 2. Setting `L3 as NUMA` allows each tile to appear as separate L3 >> clusters. >> >> >> Hence, adding an API which allows to select available lcores based on >> Split L3 is essential irrespective of the BIOS setting. >> > > I think the crucial issue here is the "irrespective of BIOS setting" > bit. That is what the current RFC achieves. > If EAL is getting into the game of figuring out exact intricacies > of physical layout of the system, then there's a lot more work to be > done as there are lots of different topologies, as other people have > already commented, and such an API needs *a lot* of thought put into it. There is standard sysfs interfaces for CPU cache topology (OS kernel), as mentioned earlier problem with hwloc and libnuma is different distros has different versions. There are solutions for specific SoC architectures as per latest comment. But we always can limit the API to selected SoC, while all other SoC when invoked will invoke rte_get_next_lcore. > > If, on the other hand, we leave this issue to the kernel, and only > gather NUMA information provided by the kernel, then nothing has to be > done - DPDK already supports all of this natively, provided the user has > configured the system correctly. As shared above, we tried to bring this usertools: enhance logic to display NUMA - Patchwork (dpdk.org) <https://patchwork.dpdk.org/project/dpdk/patch/20220326073207.489694-1-vipin.varghese@amd.com/>. DPDK support for lcore is getting enhanced and allowing user to use more favorable lcores within same Tile. > > Moreover, arguably DPDK already works that way: technically you can get > physical socket information even absent of NUMA support in BIOS, but > DPDK does not do that. Instead, if OS reports NUMA node as 0, that's > what we're going with (even if we could detect multiple sockets from > sysfs), In the above argument, it is shared as OS kernel detects NUMA or domain, which is used by DPDK right? The RFC suggested also adheres to the same, what OS sees. can you please explain for better understanding what in the RFC is doing differently? > and IMO it should stay that way unless there is a strong > argument otherwise. Totally agree, that is what the RFC is also doing, based on what OS sees as NUMA we are using it. Only addition is within the NUMA if there are split LLC, allow selection of those lcores. Rather than blindly choosing lcore using rte_lcore_get_next. > We force the user to configure their system > correctly as it is, and I see no reason to second-guess user's BIOS > configuration otherwise. Again iterating, the changes suggested in RFC are agnostic to what BIOS options are used, It is to earlier question `is AMD configuration same as Intel tile` I have explained it is not using BIOS setting. > > -- > Thanks, > Anatoly >
On 9/2/2024 5:33 PM, Varghese, Vipin wrote: > <snipped> >>> >>>> I recently looked into how Intel's Sub-NUMA Clustering would work >>>> within >>>> DPDK, and found that I actually didn't have to do anything, because the >>>> SNC "clusters" present themselves as NUMA nodes, which DPDK already >>>> supports natively. >>> >>> yes, this is correct. In Intel Xeon Platinum BIOS one can enable >>> `Cluster per NUMA` as `1,2 or4`. >>> >>> This divides the tiles into Sub-Numa parition, each having separate >>> lcores,memory controllers, PCIe >>> >>> and accelerator. >>> >>>> >>>> Does AMD's implementation of chiplets not report themselves as separate >>>> NUMA nodes? >>> >>> In AMD EPYC Soc, this is different. There are 2 BIOS settings, namely >>> >>> 1. NPS: `Numa Per Socket` which allows the IO tile (memory, PCIe and >>> Accelerator) to be partitioned as Numa 0, 1, 2 or 4. >>> >>> 2. L3 as NUMA: `L3 cache of CPU tiles as individual NUMA`. This allows >>> all CPU tiles to be independent NUMA cores. >>> >>> >>> The above settings are possible because CPU is independent from IO tile. >>> Thus allowing 4 combinations be available for use. >> >> Sure, but presumably if the user wants to distinguish this, they have to >> configure their system appropriately. If user wants to take advantage of >> L3 as NUMA (which is what your patch proposes), then they can enable the >> BIOS knob and get that functionality for free. DPDK already supports >> this. >> > The intend of the RFC is to introduce the ability to select lcore within > the same > > L3 cache whether the BIOS is set or unset for `L3 as NUMA`. This is also > achieved > > and tested on platforms which advertises via sysfs by OS kernel. Thus > eliminating > > the dependency on hwloc and libuma which can be different versions in > different distros. But we do depend on libnuma, so we might as well depend on it? Are there different versions of libnuma that interfere with what you're trying to do? You keep coming back to this "whether the BIOS is set or unset" for L3 as NUMA, but I'm still unclear as to what issues your patch is solving assuming "knob is set". When the system is configured correctly, it already works and reports cores as part of NUMA nodes (as L3) correctly. It is only when the system is configured *not* to do that that issues arise, is it not? In which case IMO the easier solution would be to just tell the user to enable that knob in BIOS? > > >>> >>> These are covered in the tuning gudie for the SoC in 12. How to get best >>> performance on AMD platform — Data Plane Development Kit 24.07.0 >>> documentation (dpdk.org) >>> <https://doc.dpdk.org/guides/linux_gsg/amd_platform.html>. >>> >>> >>>> Because if it does, I don't really think any changes are >>>> required because NUMA nodes would give you the same thing, would it >>>> not? >>> >>> I have a different opinion to this outlook. An end user can >>> >>> 1. Identify the lcores and it's NUMA user `usertools/cpu-layout.py` >> >> I recently submitted an enhacement for CPU layout script to print out >> NUMA separately from physical socket [1]. >> >> [1] >> https://patches.dpdk.org/project/dpdk/patch/40cf4ee32f15952457ac5526cfce64728bd13d32.1724323106.git.anatoly.burakov@intel.com/ >> >> I believe when "L3 as NUMA" is enabled in BIOS, the script will display >> both physical package ID as well as NUMA nodes reported by the system, >> which will be different from physical package ID, and which will display >> information you were looking for. > > As AMD we had submitted earlier work on the same via usertools: enhance > logic to display NUMA - Patchwork (dpdk.org) > <https://patchwork.dpdk.org/project/dpdk/patch/20220326073207.489694-1-vipin.varghese@amd.com/>. > > this clearly were distinguishing NUMA and Physical socket. Oh, cool, I didn't see that patch. I would argue my visual format is more readable though, so perhaps we can get that in :) > Agreed, but as pointed out in case of Intel Xeon Platinum SPR, the tile > consists of cpu, memory, pcie and accelerator. > > hence setting the BIOS option `Cluster per NUMA` the OS kernel & libnuma > display appropriate Domain with memory, pcie and cpu. > > > In case of AMD SoC, libnuma for CPU is different from memory NUMA per > socket. I'm curious how does the kernel handle this then, and what are you getting from libnuma. You seem to be implying that there are two different NUMA nodes on your SoC, and either kernel or libnuma are in conflict as to what belongs to what NUMA node? > >> >>> >>> 3. there are no API which distinguish L3 numa domain. Function >>> `rte_socket_id >>> <https://doc.dpdk.org/api/rte__lcore_8h.html#a7c8da4664df26a64cf05dc508a4f26df>` for CPU tiles like AMD SoC will return physical socket. >> >> Sure, but I would think the answer to that would be to introduce an API >> to distinguish between NUMA (socket ID in DPDK parlance) and package >> (physical socket ID in the "traditional NUMA" sense). Once we can >> distinguish between those, DPDK can just rely on NUMA information >> provided by the OS, while still being capable of identifying physical >> sockets if the user so desires. > Agreed, +1 for the idea for physcial socket and changes in library to > exploit the same. >> >> I am actually going to introduce API to get *physical socket* (as >> opposed to NUMA node) in the next few days. >> > But how does it solve the end customer issues > > 1. if there are multiple NIC or Accelerator on multiple socket, but IO > tile is partitioned to Sub Domain. At least on Intel platforms, NUMA node gets assigned correctly - that is, if my Xeon with SNC enabled has NUMA nodes 3,4 on socket 1, and there's a NIC connected to socket 1, it's going to show up as being on NUMA node 3 or 4 depending on where exactly I plugged it in. Everything already works as expected, and there is no need for any changes for Intel platforms (at least none that I can see). My proposed API is really for those users who wish to explicitly allow for reserving memory/cores on "the same physical socket", as "on the same tile" is already taken care of by NUMA nodes. > > 2. If RTE_FLOW steering is applied on NIC which needs to processed under > same L3 - reduces noisy neighbor and better cache hits > > 3, for PKT-distribute library which needs to run within same worker > lcore set as RX-Distributor-TX. > Same as above: on Intel platforms, NUMA nodes already solve this. <snip> > Totally agree, that is what the RFC is also doing, based on what OS sees > as NUMA we are using it. > > Only addition is within the NUMA if there are split LLC, allow selection > of those lcores. Rather than blindly choosing lcore using > > rte_lcore_get_next. It feels like we're working around a problem that shouldn't exist in the first place, because kernel should already report this information. Within NUMA subsystem, there is sysfs node "distance" that, at least on Intel platforms and in certain BIOS configuration, reports distance between NUMA nodes, from which one can make inferences about how far a specific NUMA node is from any other NUMA node. This could have been used to encode L3 cache information. Do AMD platforms not do that? In that case, "lcore next" for a particular socket ID (NUMA node, in reality) should already get us any cores that are close to each other, because all of this information is already encoded in NUMA nodes by the system. I feel like there's a disconnect between my understanding of the problem space, and yours, so I'm going to ask a very basic question: Assuming the user has configured their AMD system correctly (i.e. enabled L3 as NUMA), are there any problem to be solved by adding a new API? Does the system not report each L3 as a separate NUMA node? > > >> We force the user to configure their system >> correctly as it is, and I see no reason to second-guess user's BIOS >> configuration otherwise. > > Again iterating, the changes suggested in RFC are agnostic to what BIOS > options are used, But that is exactly my contention: are we not effectively working around users' misconfiguration of a system then?
On 2024-09-02 02:39, Varghese, Vipin wrote: > <snipped> > > Thank you Mattias for the comments and question, please let me try to > explain the same below > >> We shouldn't have a separate CPU/cache hierarchy API instead? > > Based on the intention to bring in CPU lcores which share same L3 (for > better cache hits and less noisy neighbor) current API focuses on using > > Last Level Cache. But if the suggestion is `there are SoC where L2 cache > are also shared, and the new API should be provisioned`, I am also > > comfortable with the thought. > Rather than some AMD special case API hacked into <rte_lcore.h>, I think we are better off with no DPDK API at all for this kind of functionality. A DPDK CPU/memory hierarchy topology API very much makes sense, but it should be reasonably generic and complete from the start. >> >> Could potentially be built on the 'hwloc' library. > > There are 3 reason on AMD SoC we did not explore this path, reasons are > > 1. depending n hwloc version and kernel version certain SoC hierarchies > are not available > > 2. CPU NUMA and IO (memory & PCIe) NUMA are independent on AMD Epyc Soc. > > 3. adds the extra dependency layer of library layer to be made available > to work. > > > hence we have tried to use Linux Documented generic layer of `sysfs CPU > cache`. > > I will try to explore more on hwloc and check if other libraries within > DPDK leverages the same. > >> >> I much agree cache/core topology may be of interest of the application >> (or a work scheduler, like a DPDK event device), but it's not limited to >> LLC. It may well be worthwhile to care about which cores shares L2 >> cache, for example. Not sure the RTE_LCORE_FOREACH_* approach scales. > > yes, totally understand as some SoC, multiple lcores shares same L2 cache. > > > Can we rework the API to be rte_get_cache_<function> where user argument > is desired lcore index. > > 1. index-1: SMT threads > > 2. index-2: threads sharing same L2 cache > > 3. index-3: threads sharing same L3 cache > > 4. index-MAX: identify the threads sharing last level cache. > >> >>> < Function: Purpose > >>> --------------------- >>> - rte_get_llc_first_lcores: Retrieves all the first lcores in the >>> shared LLC. >>> - rte_get_llc_lcore: Retrieves all lcores that share the LLC. >>> - rte_get_llc_n_lcore: Retrieves the first n or skips the first n >>> lcores in the shared LLC. >>> >>> < MACRO: Purpose > >>> ------------------ >>> RTE_LCORE_FOREACH_LLC_FIRST: iterates through all first lcore from >>> each LLC. >>> RTE_LCORE_FOREACH_LLC_FIRST_WORKER: iterates through all first worker >>> lcore from each LLC. >>> RTE_LCORE_FOREACH_LLC_WORKER: iterates lcores from LLC based on hint >>> (lcore id). >>> RTE_LCORE_FOREACH_LLC_SKIP_FIRST_WORKER: iterates lcores from LLC >>> while skipping first worker. >>> RTE_LCORE_FOREACH_LLC_FIRST_N_WORKER: iterates through `n` lcores >>> from each LLC. >>> RTE_LCORE_FOREACH_LLC_SKIP_N_WORKER: skip first `n` lcores, then >>> iterates through reaming lcores in each LLC. >>> > While the MACRO are simple wrapper invoking appropriate API. can this be > worked out in this fashion? > > <snipped>
On Wed, 4 Sep 2024 11:30:59 +0200 Mattias Rönnblom <hofors@lysator.liu.se> wrote: > On 2024-09-02 02:39, Varghese, Vipin wrote: > > <snipped> > > > > Thank you Mattias for the comments and question, please let me try to > > explain the same below > > > >> We shouldn't have a separate CPU/cache hierarchy API instead? > > > > Based on the intention to bring in CPU lcores which share same L3 (for > > better cache hits and less noisy neighbor) current API focuses on using > > > > Last Level Cache. But if the suggestion is `there are SoC where L2 cache > > are also shared, and the new API should be provisioned`, I am also > > > > comfortable with the thought. > > > > Rather than some AMD special case API hacked into <rte_lcore.h>, I think > we are better off with no DPDK API at all for this kind of functionality. > > A DPDK CPU/memory hierarchy topology API very much makes sense, but it > should be reasonably generic and complete from the start. Agreed. This one of those cases where the existing project hwloc which is part of open-mpi is more complete and well supported. It supports multiple OS's and can deal with more quirks. https://github.com/open-mpi/hwloc
On 9/3/2024 9:50 AM, Burakov, Anatoly wrote: > On 9/2/2024 5:33 PM, Varghese, Vipin wrote: >> <snipped> >>>> >>>>> I recently looked into how Intel's Sub-NUMA Clustering would work >>>>> within >>>>> DPDK, and found that I actually didn't have to do anything, because >>>>> the >>>>> SNC "clusters" present themselves as NUMA nodes, which DPDK already >>>>> supports natively. >>>> >>>> yes, this is correct. In Intel Xeon Platinum BIOS one can enable >>>> `Cluster per NUMA` as `1,2 or4`. >>>> >>>> This divides the tiles into Sub-Numa parition, each having separate >>>> lcores,memory controllers, PCIe >>>> >>>> and accelerator. >>>> >>>>> >>>>> Does AMD's implementation of chiplets not report themselves as >>>>> separate >>>>> NUMA nodes? >>>> >>>> In AMD EPYC Soc, this is different. There are 2 BIOS settings, namely >>>> >>>> 1. NPS: `Numa Per Socket` which allows the IO tile (memory, PCIe and >>>> Accelerator) to be partitioned as Numa 0, 1, 2 or 4. >>>> >>>> 2. L3 as NUMA: `L3 cache of CPU tiles as individual NUMA`. This allows >>>> all CPU tiles to be independent NUMA cores. >>>> >>>> >>>> The above settings are possible because CPU is independent from IO >>>> tile. >>>> Thus allowing 4 combinations be available for use. >>> >>> Sure, but presumably if the user wants to distinguish this, they have to >>> configure their system appropriately. If user wants to take advantage of >>> L3 as NUMA (which is what your patch proposes), then they can enable the >>> BIOS knob and get that functionality for free. DPDK already supports >>> this. >>> >> The intend of the RFC is to introduce the ability to select lcore >> within the same >> >> L3 cache whether the BIOS is set or unset for `L3 as NUMA`. This is >> also achieved >> >> and tested on platforms which advertises via sysfs by OS kernel. Thus >> eliminating >> >> the dependency on hwloc and libuma which can be different versions in >> different distros. > > But we do depend on libnuma, so we might as well depend on it? Are there > different versions of libnuma that interfere with what you're trying to > do? You keep coming back to this "whether the BIOS is set or unset" for > L3 as NUMA, but I'm still unclear as to what issues your patch is > solving assuming "knob is set". When the system is configured correctly, > it already works and reports cores as part of NUMA nodes (as L3) > correctly. It is only when the system is configured *not* to do that > that issues arise, is it not? In which case IMO the easier solution > would be to just tell the user to enable that knob in BIOS? > >> >> >>>> >>>> These are covered in the tuning gudie for the SoC in 12. How to get >>>> best >>>> performance on AMD platform — Data Plane Development Kit 24.07.0 >>>> documentation (dpdk.org) >>>> <https://doc.dpdk.org/guides/linux_gsg/amd_platform.html>. >>>> >>>> >>>>> Because if it does, I don't really think any changes are >>>>> required because NUMA nodes would give you the same thing, would it >>>>> not? >>>> >>>> I have a different opinion to this outlook. An end user can >>>> >>>> 1. Identify the lcores and it's NUMA user `usertools/cpu-layout.py` >>> >>> I recently submitted an enhacement for CPU layout script to print out >>> NUMA separately from physical socket [1]. >>> >>> [1] >>> https://patches.dpdk.org/project/dpdk/ >>> patch/40cf4ee32f15952457ac5526cfce64728bd13d32.1724323106.git.anatoly.burakov@intel.com/ >>> >>> I believe when "L3 as NUMA" is enabled in BIOS, the script will display >>> both physical package ID as well as NUMA nodes reported by the system, >>> which will be different from physical package ID, and which will display >>> information you were looking for. >> >> As AMD we had submitted earlier work on the same via usertools: >> enhance logic to display NUMA - Patchwork (dpdk.org) <https:// >> patchwork.dpdk.org/project/dpdk/patch/20220326073207.489694-1- >> vipin.varghese@amd.com/>. >> >> this clearly were distinguishing NUMA and Physical socket. > > Oh, cool, I didn't see that patch. I would argue my visual format is > more readable though, so perhaps we can get that in :) > >> Agreed, but as pointed out in case of Intel Xeon Platinum SPR, the >> tile consists of cpu, memory, pcie and accelerator. >> >> hence setting the BIOS option `Cluster per NUMA` the OS kernel & >> libnuma display appropriate Domain with memory, pcie and cpu. >> >> >> In case of AMD SoC, libnuma for CPU is different from memory NUMA per >> socket. > > I'm curious how does the kernel handle this then, and what are you > getting from libnuma. You seem to be implying that there are two > different NUMA nodes on your SoC, and either kernel or libnuma are in > conflict as to what belongs to what NUMA node? > >> >>> >>>> >>>> 3. there are no API which distinguish L3 numa domain. Function >>>> `rte_socket_id >>>> <https://doc.dpdk.org/api/ >>>> rte__lcore_8h.html#a7c8da4664df26a64cf05dc508a4f26df>` for CPU tiles >>>> like AMD SoC will return physical socket. >>> >>> Sure, but I would think the answer to that would be to introduce an API >>> to distinguish between NUMA (socket ID in DPDK parlance) and package >>> (physical socket ID in the "traditional NUMA" sense). Once we can >>> distinguish between those, DPDK can just rely on NUMA information >>> provided by the OS, while still being capable of identifying physical >>> sockets if the user so desires. >> Agreed, +1 for the idea for physcial socket and changes in library to >> exploit the same. >>> >>> I am actually going to introduce API to get *physical socket* (as >>> opposed to NUMA node) in the next few days. >>> >> But how does it solve the end customer issues >> >> 1. if there are multiple NIC or Accelerator on multiple socket, but IO >> tile is partitioned to Sub Domain. > > At least on Intel platforms, NUMA node gets assigned correctly - that > is, if my Xeon with SNC enabled has NUMA nodes 3,4 on socket 1, and > there's a NIC connected to socket 1, it's going to show up as being on > NUMA node 3 or 4 depending on where exactly I plugged it in. Everything > already works as expected, and there is no need for any changes for > Intel platforms (at least none that I can see). > > My proposed API is really for those users who wish to explicitly allow > for reserving memory/cores on "the same physical socket", as "on the > same tile" is already taken care of by NUMA nodes. > >> >> 2. If RTE_FLOW steering is applied on NIC which needs to processed >> under same L3 - reduces noisy neighbor and better cache hits >> >> 3, for PKT-distribute library which needs to run within same worker >> lcore set as RX-Distributor-TX. >> > > Same as above: on Intel platforms, NUMA nodes already solve this. > > <snip> > >> Totally agree, that is what the RFC is also doing, based on what OS >> sees as NUMA we are using it. >> >> Only addition is within the NUMA if there are split LLC, allow >> selection of those lcores. Rather than blindly choosing lcore using >> >> rte_lcore_get_next. > > It feels like we're working around a problem that shouldn't exist in the > first place, because kernel should already report this information. > Within NUMA subsystem, there is sysfs node "distance" that, at least on > Intel platforms and in certain BIOS configuration, reports distance > between NUMA nodes, from which one can make inferences about how far a > specific NUMA node is from any other NUMA node. This could have been > used to encode L3 cache information. Do AMD platforms not do that? In > that case, "lcore next" for a particular socket ID (NUMA node, in > reality) should already get us any cores that are close to each other, > because all of this information is already encoded in NUMA nodes by the > system. > > I feel like there's a disconnect between my understanding of the problem > space, and yours, so I'm going to ask a very basic question: > > Assuming the user has configured their AMD system correctly (i.e. > enabled L3 as NUMA), are there any problem to be solved by adding a new > API? Does the system not report each L3 as a separate NUMA node? > Hi Anatoly, Let me try to answer. To start with, Intel "Sub-NUMA Clustering" and AMD NUMA is different, as far as I understand SNC is more similar to more classic physical socket based NUMA. Following is the AMD CPU: ┌─────┐┌─────┐┌──────────┐┌─────┐┌─────┐ │ ││ ││ ││ ││ │ │ ││ ││ ││ ││ │ │TILE1││TILE2││ ││TILE5││TILE6│ │ ││ ││ ││ ││ │ │ ││ ││ ││ ││ │ │ ││ ││ ││ ││ │ └─────┘└─────┘│ IO │└─────┘└─────┘ ┌─────┐┌─────┐│ TILE │┌─────┐┌─────┐ │ ││ ││ ││ ││ │ │ ││ ││ ││ ││ │ │TILE3││TILE4││ ││TILE7││TILE8│ │ ││ ││ ││ ││ │ │ ││ ││ ││ ││ │ │ ││ ││ ││ ││ │ └─────┘└─────┘└──────────┘└─────┘└─────┘ Each 'Tile' has multiple cores, and 'IO Tile' has memory controller, bus controllers etc.. When NPS=x configured in bios, IO tile resources are split and each seen as a NUMA node. Following is NPS=4 ┌─────┐┌─────┐┌──────────┐┌─────┐┌─────┐ │ ││ ││ . ││ ││ │ │ ││ ││ . ││ ││ │ │TILE1││TILE2││ . ││TILE5││TILE6│ │ ││ ││NUMA .NUMA││ ││ │ │ ││ ││ 0 . 1 ││ ││ │ │ ││ ││ . ││ ││ │ └─────┘└─────┘│ . │└─────┘└─────┘ ┌─────┐┌─────┐│..........│┌─────┐┌─────┐ │ ││ ││ . ││ ││ │ │ ││ ││NUMA .NUMA││ ││ │ │TILE3││TILE4││ 2 . 3 ││TILE7││TILE8│ │ ││ ││ . ││ ││ │ │ ││ ││ . ││ ││ │ │ ││ ││ . ││ ││ │ └─────┘└─────┘└─────.────┘└─────┘└─────┘ Benefit of this is approach is now all cores has to access all NUMA without any penalty. Like a DPDK application can use cores from 'TILE1', 'TILE4' & 'TILE7' to access to NUMA0 (or any NUMA) resources in high performance. This is different than SNC where cores access to cross NUMA resources hit by performance penalty. Now, although which tile cores come from doesn't matter from NUMA perspective, it may matter (based on workload) to have them under same LLC. One way to make sure all cores are under same LLC, is to enable "L3 as NUMA" BIOS option, which will make each TILE shown as a different NUMA, and user select cores from one NUMA. This is sufficient up to some point, but not enough when application needs number of cores that uses multiple tiles. Assume each tile has 8 cores, and application needs 24 cores, when user provide all cores from TILE1, TILE2 & TILE3, in DPDK right now there is now way for application to figure out how to group/select these cores to use cores efficiently. Indeed this is what Vipin is enabling, from a core, he is finding list of cores that will work efficiently with this core. In this perspective this is nothing really related to NUMA configuration, and nothing really specific to AMD, as defined Linux sysfs interface is used for this. There are other architectures around that has similar NUMA configuration and they can also use same logic, at worst we can introduce an architecture specific code that all architectures can have a way to find other cores that works more efficient with given core. This is a useful feature for DPDK. Lets looks into another example, application uses 24 cores in an graph library like usage, that we want to group each three cores to process a graph node. Application needs to a way to select which three cores works most efficient with eachother, that is what this patch enables. In this case enabling "L3 as NUMA" does not help at all. With this patch both bios config works, but of course user should select cores to provide application based on configuration. And we can even improve this effective core selection, like as Mattias suggested we can select cores that share L2 caches, with expansion of this patch. This is unrelated to NUMA, and again it is not introducing architecture details to DPDK as this implementation already relies on Linux sysfs interface. I hope it clarifies a little more. Thanks, ferruh
On 9/5/2024 3:05 PM, Ferruh Yigit wrote: > On 9/3/2024 9:50 AM, Burakov, Anatoly wrote: >> On 9/2/2024 5:33 PM, Varghese, Vipin wrote: >>> <snipped> >>>>> Hi Ferruh, >> >> I feel like there's a disconnect between my understanding of the problem >> space, and yours, so I'm going to ask a very basic question: >> >> Assuming the user has configured their AMD system correctly (i.e. >> enabled L3 as NUMA), are there any problem to be solved by adding a new >> API? Does the system not report each L3 as a separate NUMA node? >> > > Hi Anatoly, > > Let me try to answer. > > To start with, Intel "Sub-NUMA Clustering" and AMD NUMA is different, as > far as I understand SNC is more similar to more classic physical socket > based NUMA. > > Following is the AMD CPU: > ┌─────┐┌─────┐┌──────────┐┌─────┐┌─────┐ > │ ││ ││ ││ ││ │ > │ ││ ││ ││ ││ │ > │TILE1││TILE2││ ││TILE5││TILE6│ > │ ││ ││ ││ ││ │ > │ ││ ││ ││ ││ │ > │ ││ ││ ││ ││ │ > └─────┘└─────┘│ IO │└─────┘└─────┘ > ┌─────┐┌─────┐│ TILE │┌─────┐┌─────┐ > │ ││ ││ ││ ││ │ > │ ││ ││ ││ ││ │ > │TILE3││TILE4││ ││TILE7││TILE8│ > │ ││ ││ ││ ││ │ > │ ││ ││ ││ ││ │ > │ ││ ││ ││ ││ │ > └─────┘└─────┘└──────────┘└─────┘└─────┘ > > Each 'Tile' has multiple cores, and 'IO Tile' has memory controller, bus > controllers etc.. > > When NPS=x configured in bios, IO tile resources are split and each seen > as a NUMA node. > > Following is NPS=4 > ┌─────┐┌─────┐┌──────────┐┌─────┐┌─────┐ > │ ││ ││ . ││ ││ │ > │ ││ ││ . ││ ││ │ > │TILE1││TILE2││ . ││TILE5││TILE6│ > │ ││ ││NUMA .NUMA││ ││ │ > │ ││ ││ 0 . 1 ││ ││ │ > │ ││ ││ . ││ ││ │ > └─────┘└─────┘│ . │└─────┘└─────┘ > ┌─────┐┌─────┐│..........│┌─────┐┌─────┐ > │ ││ ││ . ││ ││ │ > │ ││ ││NUMA .NUMA││ ││ │ > │TILE3││TILE4││ 2 . 3 ││TILE7││TILE8│ > │ ││ ││ . ││ ││ │ > │ ││ ││ . ││ ││ │ > │ ││ ││ . ││ ││ │ > └─────┘└─────┘└─────.────┘└─────┘└─────┘ > > Benefit of this is approach is now all cores has to access all NUMA > without any penalty. Like a DPDK application can use cores from 'TILE1', > 'TILE4' & 'TILE7' to access to NUMA0 (or any NUMA) resources in high > performance. > This is different than SNC where cores access to cross NUMA resources > hit by performance penalty. > > Now, although which tile cores come from doesn't matter from NUMA > perspective, it may matter (based on workload) to have them under same LLC. > > One way to make sure all cores are under same LLC, is to enable "L3 as > NUMA" BIOS option, which will make each TILE shown as a different NUMA, > and user select cores from one NUMA. > This is sufficient up to some point, but not enough when application > needs number of cores that uses multiple tiles. > > Assume each tile has 8 cores, and application needs 24 cores, when user > provide all cores from TILE1, TILE2 & TILE3, in DPDK right now there is > now way for application to figure out how to group/select these cores to > use cores efficiently. > > Indeed this is what Vipin is enabling, from a core, he is finding list > of cores that will work efficiently with this core. In this perspective > this is nothing really related to NUMA configuration, and nothing really > specific to AMD, as defined Linux sysfs interface is used for this. > > There are other architectures around that has similar NUMA configuration > and they can also use same logic, at worst we can introduce an > architecture specific code that all architectures can have a way to find > other cores that works more efficient with given core. This is a useful > feature for DPDK. > > Lets looks into another example, application uses 24 cores in an graph > library like usage, that we want to group each three cores to process a > graph node. Application needs to a way to select which three cores works > most efficient with eachother, that is what this patch enables. In this > case enabling "L3 as NUMA" does not help at all. With this patch both > bios config works, but of course user should select cores to provide > application based on configuration. > > > And we can even improve this effective core selection, like as Mattias > suggested we can select cores that share L2 caches, with expansion of > this patch. This is unrelated to NUMA, and again it is not introducing > architecture details to DPDK as this implementation already relies on > Linux sysfs interface. > > I hope it clarifies a little more. > > > Thanks, > ferruh > Yes, this does help clarify things a lot as to why current NUMA support would be insufficient to express what you are describing. However, in that case I would echo sentiment others have expressed already as this kind of deep sysfs parsing doesn't seem like it would be in scope for EAL, it sounds more like something a sysadmin/orchestration (or the application itself) would do. I mean, in principle I'm not opposed to having such an API, it just seems like the abstraction would perhaps need to be a bit more robust than directly referencing cache structure? Maybe something that degenerates into NUMA nodes would be better, so that applications wouldn't have to *specifically* worry about cache locality but instead have a more generic API they can use to group cores together?
On 9/5/2024 3:45 PM, Burakov, Anatoly wrote: > On 9/5/2024 3:05 PM, Ferruh Yigit wrote: >> On 9/3/2024 9:50 AM, Burakov, Anatoly wrote: >>> On 9/2/2024 5:33 PM, Varghese, Vipin wrote: >>>> <snipped> >>>>>> > > Hi Ferruh, > >>> >>> I feel like there's a disconnect between my understanding of the problem >>> space, and yours, so I'm going to ask a very basic question: >>> >>> Assuming the user has configured their AMD system correctly (i.e. >>> enabled L3 as NUMA), are there any problem to be solved by adding a new >>> API? Does the system not report each L3 as a separate NUMA node? >>> >> >> Hi Anatoly, >> >> Let me try to answer. >> >> To start with, Intel "Sub-NUMA Clustering" and AMD NUMA is different, as >> far as I understand SNC is more similar to more classic physical socket >> based NUMA. >> >> Following is the AMD CPU: >> ┌─────┐┌─────┐┌──────────┐┌─────┐┌─────┐ >> │ ││ ││ ││ ││ │ >> │ ││ ││ ││ ││ │ >> │TILE1││TILE2││ ││TILE5││TILE6│ >> │ ││ ││ ││ ││ │ >> │ ││ ││ ││ ││ │ >> │ ││ ││ ││ ││ │ >> └─────┘└─────┘│ IO │└─────┘└─────┘ >> ┌─────┐┌─────┐│ TILE │┌─────┐┌─────┐ >> │ ││ ││ ││ ││ │ >> │ ││ ││ ││ ││ │ >> │TILE3││TILE4││ ││TILE7││TILE8│ >> │ ││ ││ ││ ││ │ >> │ ││ ││ ││ ││ │ >> │ ││ ││ ││ ││ │ >> └─────┘└─────┘└──────────┘└─────┘└─────┘ >> >> Each 'Tile' has multiple cores, and 'IO Tile' has memory controller, bus >> controllers etc.. >> >> When NPS=x configured in bios, IO tile resources are split and each seen >> as a NUMA node. >> >> Following is NPS=4 >> ┌─────┐┌─────┐┌──────────┐┌─────┐┌─────┐ >> │ ││ ││ . ││ ││ │ >> │ ││ ││ . ││ ││ │ >> │TILE1││TILE2││ . ││TILE5││TILE6│ >> │ ││ ││NUMA .NUMA││ ││ │ >> │ ││ ││ 0 . 1 ││ ││ │ >> │ ││ ││ . ││ ││ │ >> └─────┘└─────┘│ . │└─────┘└─────┘ >> ┌─────┐┌─────┐│..........│┌─────┐┌─────┐ >> │ ││ ││ . ││ ││ │ >> │ ││ ││NUMA .NUMA││ ││ │ >> │TILE3││TILE4││ 2 . 3 ││TILE7││TILE8│ >> │ ││ ││ . ││ ││ │ >> │ ││ ││ . ││ ││ │ >> │ ││ ││ . ││ ││ │ >> └─────┘└─────┘└─────.────┘└─────┘└─────┘ >> >> Benefit of this is approach is now all cores has to access all NUMA >> without any penalty. Like a DPDK application can use cores from 'TILE1', >> 'TILE4' & 'TILE7' to access to NUMA0 (or any NUMA) resources in high >> performance. >> This is different than SNC where cores access to cross NUMA resources >> hit by performance penalty. >> >> Now, although which tile cores come from doesn't matter from NUMA >> perspective, it may matter (based on workload) to have them under same >> LLC. >> >> One way to make sure all cores are under same LLC, is to enable "L3 as >> NUMA" BIOS option, which will make each TILE shown as a different NUMA, >> and user select cores from one NUMA. >> This is sufficient up to some point, but not enough when application >> needs number of cores that uses multiple tiles. >> >> Assume each tile has 8 cores, and application needs 24 cores, when user >> provide all cores from TILE1, TILE2 & TILE3, in DPDK right now there is >> now way for application to figure out how to group/select these cores to >> use cores efficiently. >> >> Indeed this is what Vipin is enabling, from a core, he is finding list >> of cores that will work efficiently with this core. In this perspective >> this is nothing really related to NUMA configuration, and nothing really >> specific to AMD, as defined Linux sysfs interface is used for this. >> >> There are other architectures around that has similar NUMA configuration >> and they can also use same logic, at worst we can introduce an >> architecture specific code that all architectures can have a way to find >> other cores that works more efficient with given core. This is a useful >> feature for DPDK. >> >> Lets looks into another example, application uses 24 cores in an graph >> library like usage, that we want to group each three cores to process a >> graph node. Application needs to a way to select which three cores works >> most efficient with eachother, that is what this patch enables. In this >> case enabling "L3 as NUMA" does not help at all. With this patch both >> bios config works, but of course user should select cores to provide >> application based on configuration. >> >> >> And we can even improve this effective core selection, like as Mattias >> suggested we can select cores that share L2 caches, with expansion of >> this patch. This is unrelated to NUMA, and again it is not introducing >> architecture details to DPDK as this implementation already relies on >> Linux sysfs interface. >> >> I hope it clarifies a little more. >> >> >> Thanks, >> ferruh >> > > Yes, this does help clarify things a lot as to why current NUMA support > would be insufficient to express what you are describing. > > However, in that case I would echo sentiment others have expressed > already as this kind of deep sysfs parsing doesn't seem like it would be > in scope for EAL, it sounds more like something a sysadmin/orchestration > (or the application itself) would do. > > I mean, in principle I'm not opposed to having such an API, it just > seems like the abstraction would perhaps need to be a bit more robust > than directly referencing cache structure? Maybe something that > degenerates into NUMA nodes would be better, so that applications > wouldn't have to *specifically* worry about cache locality but instead > have a more generic API they can use to group cores together? > Unfortunately can't cover all usecases by sysadmin/orchestration (as graph usecase one above), and definitely too much HW detail for the application, that is why we required some programmatic way (APIs) for applications. And we are on the same page that, the more we can get away from architecture details in the abstraction (APIs) better it is, overall intention is to provide ways to application to find lcores works efficiently with each other. For this what do you think about slightly different API *, like: ``` rte_get_next_lcore_ex(uint i, u32 flag) ``` Based on the flag, we can grab the next eligible lcore, for this patch the flag can be `RTE_LCORE_LLC`, but options are wide and different architectures can have different grouping to benefit most from HW in a vendor agnostic way. I like the idea, what do you think about this abstraction? * Kudos to Vipin 😉
>> Yes, this does help clarify things a lot as to why current NUMA support >> would be insufficient to express what you are describing. >> >> However, in that case I would echo sentiment others have expressed >> already as this kind of deep sysfs parsing doesn't seem like it would be >> in scope for EAL, it sounds more like something a sysadmin/orchestration >> (or the application itself) would do. >> >> I mean, in principle I'm not opposed to having such an API, it just >> seems like the abstraction would perhaps need to be a bit more robust >> than directly referencing cache structure? Maybe something that >> degenerates into NUMA nodes would be better, so that applications >> wouldn't have to *specifically* worry about cache locality but instead >> have a more generic API they can use to group cores together? >> > > Unfortunately can't cover all usecases by sysadmin/orchestration (as > graph usecase one above), and definitely too much HW detail for the > application, that is why we required some programmatic way (APIs) for > applications. > > And we are on the same page that, the more we can get away from > architecture details in the abstraction (APIs) better it is, overall > intention is to provide ways to application to find lcores works > efficiently with each other. > > For this what do you think about slightly different API *, like: > ``` > rte_get_next_lcore_ex(uint i, u32 flag) > ``` > > Based on the flag, we can grab the next eligible lcore, for this patch > the flag can be `RTE_LCORE_LLC`, but options are wide and different > architectures can have different grouping to benefit most from HW in a > vendor agnostic way. > I like the idea, what do you think about this abstraction? > > * Kudos to Vipin 😉 > Hi Ferruh, In principle, having flags for this sort of thing sounds like a better way to go. I do like this idea as well! It of course remains to be seen how it can work in practice but to me it certainly looks like a path worth exploring.
[AMD Official Use Only - AMD Internal Distribution Only] <snipped> > > >> Yes, this does help clarify things a lot as to why current NUMA > >> support would be insufficient to express what you are describing. > >> > >> However, in that case I would echo sentiment others have expressed > >> already as this kind of deep sysfs parsing doesn't seem like it would > >> be in scope for EAL, it sounds more like something a > >> sysadmin/orchestration (or the application itself) would do. > >> > >> I mean, in principle I'm not opposed to having such an API, it just > >> seems like the abstraction would perhaps need to be a bit more robust > >> than directly referencing cache structure? Maybe something that > >> degenerates into NUMA nodes would be better, so that applications > >> wouldn't have to *specifically* worry about cache locality but > >> instead have a more generic API they can use to group cores together? > >> > > > > Unfortunately can't cover all usecases by sysadmin/orchestration (as > > graph usecase one above), and definitely too much HW detail for the > > application, that is why we required some programmatic way (APIs) for > > applications. > > > > And we are on the same page that, the more we can get away from > > architecture details in the abstraction (APIs) better it is, overall > > intention is to provide ways to application to find lcores works > > efficiently with each other. > > > > For this what do you think about slightly different API *, like: > > ``` > > rte_get_next_lcore_ex(uint i, u32 flag) ``` > > > > Based on the flag, we can grab the next eligible lcore, for this patch > > the flag can be `RTE_LCORE_LLC`, but options are wide and different > > architectures can have different grouping to benefit most from HW in a > > vendor agnostic way. > > I like the idea, what do you think about this abstraction? > > > > * Kudos to Vipin 😉 > > > > Hi Ferruh, > > In principle, having flags for this sort of thing sounds like a better way to go. I > do like this idea as well! It of course remains to be seen how it can work in > practice but to me it certainly looks like a path worth exploring. > Sharing the new RFC shortly. <snipped>
[AMD Official Use Only - AMD Internal Distribution Only] <snipped> > > <snipped> > > > > Thank you Mattias for the comments and question, please let me try to > > explain the same below > > > >> We shouldn't have a separate CPU/cache hierarchy API instead? > > > > Based on the intention to bring in CPU lcores which share same L3 (for > > better cache hits and less noisy neighbor) current API focuses on > > using > > > > Last Level Cache. But if the suggestion is `there are SoC where L2 > > cache are also shared, and the new API should be provisioned`, I am > > also > > > > comfortable with the thought. > > > > Rather than some AMD special case API hacked into <rte_lcore.h>, I think we > are better off with no DPDK API at all for this kind of functionality. Hi Mattias, as shared in the earlier email thread, this is not a AMD special case at all. Let me try to explain this one more time. One of techniques used to increase cores cost effective way to go for tiles of compute complexes. This introduces a bunch of cores in sharing same Last Level Cache (namely L2, L3 or even L4) depending upon cache topology architecture. The API suggested in RFC is to help end users to selectively use cores under same Last Level Cache Hierarchy as advertised by OS (irrespective of the BIOS settings used). This is useful in both bare-metal and container environment. As shared in response for cover letter +1 to expand it to more than just LLC cores. We have also confirmed the same to https://patchwork.dpdk.org/project/dpdk/cover/20240827151014.201-1-vipin.varghese@amd.com/ > > A DPDK CPU/memory hierarchy topology API very much makes sense, but it > should be reasonably generic and complete from the start. > > >> > >> Could potentially be built on the 'hwloc' library. > > > > There are 3 reason on AMD SoC we did not explore this path, reasons > > are > > > > 1. depending n hwloc version and kernel version certain SoC > > hierarchies are not available > > > > 2. CPU NUMA and IO (memory & PCIe) NUMA are independent on AMD > Epyc Soc. > > > > 3. adds the extra dependency layer of library layer to be made > > available to work. > > > > > > hence we have tried to use Linux Documented generic layer of `sysfs > > CPU cache`. > > > > I will try to explore more on hwloc and check if other libraries > > within DPDK leverages the same. > > > >> > >> I much agree cache/core topology may be of interest of the > >> application (or a work scheduler, like a DPDK event device), but it's > >> not limited to LLC. It may well be worthwhile to care about which > >> cores shares L2 cache, for example. Not sure the RTE_LCORE_FOREACH_* > approach scales. > > > > yes, totally understand as some SoC, multiple lcores shares same L2 cache. > > > > > > Can we rework the API to be rte_get_cache_<function> where user > > argument is desired lcore index. > > > > 1. index-1: SMT threads > > > > 2. index-2: threads sharing same L2 cache > > > > 3. index-3: threads sharing same L3 cache > > > > 4. index-MAX: identify the threads sharing last level cache. > > > >> > >>> < Function: Purpose > > >>> --------------------- > >>> - rte_get_llc_first_lcores: Retrieves all the first lcores in the > >>> shared LLC. > >>> - rte_get_llc_lcore: Retrieves all lcores that share the LLC. > >>> - rte_get_llc_n_lcore: Retrieves the first n or skips the first n > >>> lcores in the shared LLC. > >>> > >>> < MACRO: Purpose > > >>> ------------------ > >>> RTE_LCORE_FOREACH_LLC_FIRST: iterates through all first lcore from > >>> each LLC. > >>> RTE_LCORE_FOREACH_LLC_FIRST_WORKER: iterates through all first > >>> worker lcore from each LLC. > >>> RTE_LCORE_FOREACH_LLC_WORKER: iterates lcores from LLC based on > hint > >>> (lcore id). > >>> RTE_LCORE_FOREACH_LLC_SKIP_FIRST_WORKER: iterates lcores from LLC > >>> while skipping first worker. > >>> RTE_LCORE_FOREACH_LLC_FIRST_N_WORKER: iterates through `n` lcores > >>> from each LLC. > >>> RTE_LCORE_FOREACH_LLC_SKIP_N_WORKER: skip first `n` lcores, then > >>> iterates through reaming lcores in each LLC. > >>> > > While the MACRO are simple wrapper invoking appropriate API. can this > > be worked out in this fashion? > > > > <snipped>
On 2024-09-09 16:22, Varghese, Vipin wrote: > [AMD Official Use Only - AMD Internal Distribution Only] > > <snipped> > >>> <snipped> >>> >>> Thank you Mattias for the comments and question, please let me try to >>> explain the same below >>> >>>> We shouldn't have a separate CPU/cache hierarchy API instead? >>> >>> Based on the intention to bring in CPU lcores which share same L3 (for >>> better cache hits and less noisy neighbor) current API focuses on >>> using >>> >>> Last Level Cache. But if the suggestion is `there are SoC where L2 >>> cache are also shared, and the new API should be provisioned`, I am >>> also >>> >>> comfortable with the thought. >>> >> >> Rather than some AMD special case API hacked into <rte_lcore.h>, I think we >> are better off with no DPDK API at all for this kind of functionality. > > Hi Mattias, as shared in the earlier email thread, this is not a AMD special case at all. Let me try to explain this one more time. One of techniques used to increase cores cost effective way to go for tiles of compute complexes. > This introduces a bunch of cores in sharing same Last Level Cache (namely L2, L3 or even L4) depending upon cache topology architecture. > > The API suggested in RFC is to help end users to selectively use cores under same Last Level Cache Hierarchy as advertised by OS (irrespective of the BIOS settings used). This is useful in both bare-metal and container environment. > I'm pretty familiar with AMD CPUs and the use of tiles (including the challenges these kinds of non-uniformities pose for work scheduling). To maximize performance, caring about core<->LLC relationship may well not be enough, and more HT/core/cache/memory topology information is required. That's what I meant by special case. A proper API should allow access to information about which lcores are SMT siblings, cores on the same L2, and cores on the same L3, to name a few things. Probably you want to fit NUMA into the same API as well, although that is available already in <rte_lcore.h>. One can have a look at how scheduling domains work in the Linux kernel. They model this kind of thing. > As shared in response for cover letter +1 to expand it to more than just LLC cores. We have also confirmed the same to https://patchwork.dpdk.org/project/dpdk/cover/20240827151014.201-1-vipin.varghese@amd.com/ > >> >> A DPDK CPU/memory hierarchy topology API very much makes sense, but it >> should be reasonably generic and complete from the start. >> >>>> >>>> Could potentially be built on the 'hwloc' library. >>> >>> There are 3 reason on AMD SoC we did not explore this path, reasons >>> are >>> >>> 1. depending n hwloc version and kernel version certain SoC >>> hierarchies are not available >>> >>> 2. CPU NUMA and IO (memory & PCIe) NUMA are independent on AMD >> Epyc Soc. >>> >>> 3. adds the extra dependency layer of library layer to be made >>> available to work. >>> >>> >>> hence we have tried to use Linux Documented generic layer of `sysfs >>> CPU cache`. >>> >>> I will try to explore more on hwloc and check if other libraries >>> within DPDK leverages the same. >>> >>>> >>>> I much agree cache/core topology may be of interest of the >>>> application (or a work scheduler, like a DPDK event device), but it's >>>> not limited to LLC. It may well be worthwhile to care about which >>>> cores shares L2 cache, for example. Not sure the RTE_LCORE_FOREACH_* >> approach scales. >>> >>> yes, totally understand as some SoC, multiple lcores shares same L2 cache. >>> >>> >>> Can we rework the API to be rte_get_cache_<function> where user >>> argument is desired lcore index. >>> >>> 1. index-1: SMT threads >>> >>> 2. index-2: threads sharing same L2 cache >>> >>> 3. index-3: threads sharing same L3 cache >>> >>> 4. index-MAX: identify the threads sharing last level cache. >>> >>>> >>>>> < Function: Purpose > >>>>> --------------------- >>>>> - rte_get_llc_first_lcores: Retrieves all the first lcores in the >>>>> shared LLC. >>>>> - rte_get_llc_lcore: Retrieves all lcores that share the LLC. >>>>> - rte_get_llc_n_lcore: Retrieves the first n or skips the first n >>>>> lcores in the shared LLC. >>>>> >>>>> < MACRO: Purpose > >>>>> ------------------ >>>>> RTE_LCORE_FOREACH_LLC_FIRST: iterates through all first lcore from >>>>> each LLC. >>>>> RTE_LCORE_FOREACH_LLC_FIRST_WORKER: iterates through all first >>>>> worker lcore from each LLC. >>>>> RTE_LCORE_FOREACH_LLC_WORKER: iterates lcores from LLC based on >> hint >>>>> (lcore id). >>>>> RTE_LCORE_FOREACH_LLC_SKIP_FIRST_WORKER: iterates lcores from LLC >>>>> while skipping first worker. >>>>> RTE_LCORE_FOREACH_LLC_FIRST_N_WORKER: iterates through `n` lcores >>>>> from each LLC. >>>>> RTE_LCORE_FOREACH_LLC_SKIP_N_WORKER: skip first `n` lcores, then >>>>> iterates through reaming lcores in each LLC. >>>>> >>> While the MACRO are simple wrapper invoking appropriate API. can this >>> be worked out in this fashion? >>> >>> <snipped>
[AMD Official Use Only - AMD Internal Distribution Only] <snipped> > > > > > > Thank you Mattias for the comments and question, please let me try > > > to explain the same below > > > > > >> We shouldn't have a separate CPU/cache hierarchy API instead? > > > > > > Based on the intention to bring in CPU lcores which share same L3 > > > (for better cache hits and less noisy neighbor) current API focuses > > > on using > > > > > > Last Level Cache. But if the suggestion is `there are SoC where L2 > > > cache are also shared, and the new API should be provisioned`, I am > > > also > > > > > > comfortable with the thought. > > > > > > > Rather than some AMD special case API hacked into <rte_lcore.h>, I > > think we are better off with no DPDK API at all for this kind of functionality. > > > > A DPDK CPU/memory hierarchy topology API very much makes sense, but it > > should be reasonably generic and complete from the start. > > Agreed. This one of those cases where the existing project hwloc which is part > of open-mpi is more complete and well supported. It supports multiple OS's > and can deal with more quirks. Thank you Stephen for the inputs, last year when checked hwloc for distros there were anomalies for NUMA and Physical socket Identification on AMD EPYC Soc. I will recheck the distros version of hwloc, if these work out fine I will re-work with hwloc libraries making it OS independent too. > > https://github.com/open-mpi/hwloc
[AMD Official Use Only - AMD Internal Distribution Only] <snipped> > > On 2024-09-09 16:22, Varghese, Vipin wrote: > > [AMD Official Use Only - AMD Internal Distribution Only] > > > > <snipped> > > > >>> <snipped> > >>> > >>> Thank you Mattias for the comments and question, please let me try > >>> to explain the same below > >>> > >>>> We shouldn't have a separate CPU/cache hierarchy API instead? > >>> > >>> Based on the intention to bring in CPU lcores which share same L3 > >>> (for better cache hits and less noisy neighbor) current API focuses > >>> on using > >>> > >>> Last Level Cache. But if the suggestion is `there are SoC where L2 > >>> cache are also shared, and the new API should be provisioned`, I am > >>> also > >>> > >>> comfortable with the thought. > >>> > >> > >> Rather than some AMD special case API hacked into <rte_lcore.h>, I > >> think we are better off with no DPDK API at all for this kind of functionality. > > > > Hi Mattias, as shared in the earlier email thread, this is not a AMD special > case at all. Let me try to explain this one more time. One of techniques used to > increase cores cost effective way to go for tiles of compute complexes. > > This introduces a bunch of cores in sharing same Last Level Cache (namely > L2, L3 or even L4) depending upon cache topology architecture. > > > > The API suggested in RFC is to help end users to selectively use cores under > same Last Level Cache Hierarchy as advertised by OS (irrespective of the BIOS > settings used). This is useful in both bare-metal and container environment. > > > > I'm pretty familiar with AMD CPUs and the use of tiles (including the > challenges these kinds of non-uniformities pose for work scheduling). > > To maximize performance, caring about core<->LLC relationship may well not > be enough, and more HT/core/cache/memory topology information is > required. That's what I meant by special case. A proper API should allow > access to information about which lcores are SMT siblings, cores on the same > L2, and cores on the same L3, to name a few things. Probably you want to fit > NUMA into the same API as well, although that is available already in > <rte_lcore.h>. Thank you Mattias for the information, as shared by in the reply with Anatoly we want expose a new API `rte_get_next_lcore_ex` which intakes a extra argument `u32 flags`. The flags can be RTE_GET_LCORE_L1 (SMT), RTE_GET_LCORE_L2, RTE_GET_LCORE_L3, RTE_GET_LCORE_BOOST_ENABLED, RTE_GET_LCORE_BOOST_DISABLED. This is AMD EPYC SoC agnostic and trying to address for all generic cases. Please do let us know if we (Ferruh & myself) can sync up via call? > > One can have a look at how scheduling domains work in the Linux kernel. > They model this kind of thing. > > > As shared in response for cover letter +1 to expand it to more than > > just LLC cores. We have also confirmed the same to > > https://patchwork.dpdk.org/project/dpdk/cover/20240827151014.201- > 1-vip > > in.varghese@amd.com/ > > > >> > >> A DPDK CPU/memory hierarchy topology API very much makes sense, but > >> it should be reasonably generic and complete from the start. > >> > >>>> > >>>> Could potentially be built on the 'hwloc' library. > >>> > >>> There are 3 reason on AMD SoC we did not explore this path, reasons > >>> are > >>> > >>> 1. depending n hwloc version and kernel version certain SoC > >>> hierarchies are not available > >>> > >>> 2. CPU NUMA and IO (memory & PCIe) NUMA are independent on AMD > >> Epyc Soc. > >>> > >>> 3. adds the extra dependency layer of library layer to be made > >>> available to work. > >>> > >>> > >>> hence we have tried to use Linux Documented generic layer of `sysfs > >>> CPU cache`. > >>> > >>> I will try to explore more on hwloc and check if other libraries > >>> within DPDK leverages the same. > >>> > >>>> > >>>> I much agree cache/core topology may be of interest of the > >>>> application (or a work scheduler, like a DPDK event device), but > >>>> it's not limited to LLC. It may well be worthwhile to care about > >>>> which cores shares L2 cache, for example. Not sure the > >>>> RTE_LCORE_FOREACH_* > >> approach scales. > >>> > >>> yes, totally understand as some SoC, multiple lcores shares same L2 cache. > >>> > >>> > >>> Can we rework the API to be rte_get_cache_<function> where user > >>> argument is desired lcore index. > >>> > >>> 1. index-1: SMT threads > >>> > >>> 2. index-2: threads sharing same L2 cache > >>> > >>> 3. index-3: threads sharing same L3 cache > >>> > >>> 4. index-MAX: identify the threads sharing last level cache. > >>> > >>>> > >>>>> < Function: Purpose > > >>>>> --------------------- > >>>>> - rte_get_llc_first_lcores: Retrieves all the first lcores in > >>>>> the shared LLC. > >>>>> - rte_get_llc_lcore: Retrieves all lcores that share the LLC. > >>>>> - rte_get_llc_n_lcore: Retrieves the first n or skips the first > >>>>> n lcores in the shared LLC. > >>>>> > >>>>> < MACRO: Purpose > > >>>>> ------------------ > >>>>> RTE_LCORE_FOREACH_LLC_FIRST: iterates through all first lcore from > >>>>> each LLC. > >>>>> RTE_LCORE_FOREACH_LLC_FIRST_WORKER: iterates through all first > >>>>> worker lcore from each LLC. > >>>>> RTE_LCORE_FOREACH_LLC_WORKER: iterates lcores from LLC based on > >> hint > >>>>> (lcore id). > >>>>> RTE_LCORE_FOREACH_LLC_SKIP_FIRST_WORKER: iterates lcores from > LLC > >>>>> while skipping first worker. > >>>>> RTE_LCORE_FOREACH_LLC_FIRST_N_WORKER: iterates through `n` > lcores > >>>>> from each LLC. > >>>>> RTE_LCORE_FOREACH_LLC_SKIP_N_WORKER: skip first `n` lcores, then > >>>>> iterates through reaming lcores in each LLC. > >>>>> > >>> While the MACRO are simple wrapper invoking appropriate API. can > >>> this be worked out in this fashion? > >>> > >>> <snipped>
On Wed, 11 Sep 2024 03:13:14 +0000 "Varghese, Vipin" <Vipin.Varghese@amd.com> wrote: > > Agreed. This one of those cases where the existing project hwloc which is part > > of open-mpi is more complete and well supported. It supports multiple OS's > > and can deal with more quirks. > > Thank you Stephen for the inputs, last year when checked hwloc for distros there were anomalies for NUMA and Physical socket Identification on AMD EPYC Soc. > I will recheck the distros version of hwloc, if these work out fine I will re-work with hwloc libraries making it OS independent too. DPDK doesn't exist to resolve problems with upstreaming hardware support in other packages. If DPDK does supports something only because it is harder, slower, more painful to deal with another project; then you create long term technical debt.
On 2024-09-11 05:26, Varghese, Vipin wrote: > [AMD Official Use Only - AMD Internal Distribution Only] > > <snipped> > >> >> On 2024-09-09 16:22, Varghese, Vipin wrote: >>> [AMD Official Use Only - AMD Internal Distribution Only] >>> >>> <snipped> >>> >>>>> <snipped> >>>>> >>>>> Thank you Mattias for the comments and question, please let me try >>>>> to explain the same below >>>>> >>>>>> We shouldn't have a separate CPU/cache hierarchy API instead? >>>>> >>>>> Based on the intention to bring in CPU lcores which share same L3 >>>>> (for better cache hits and less noisy neighbor) current API focuses >>>>> on using >>>>> >>>>> Last Level Cache. But if the suggestion is `there are SoC where L2 >>>>> cache are also shared, and the new API should be provisioned`, I am >>>>> also >>>>> >>>>> comfortable with the thought. >>>>> >>>> >>>> Rather than some AMD special case API hacked into <rte_lcore.h>, I >>>> think we are better off with no DPDK API at all for this kind of functionality. >>> >>> Hi Mattias, as shared in the earlier email thread, this is not a AMD special >> case at all. Let me try to explain this one more time. One of techniques used to >> increase cores cost effective way to go for tiles of compute complexes. >>> This introduces a bunch of cores in sharing same Last Level Cache (namely >> L2, L3 or even L4) depending upon cache topology architecture. >>> >>> The API suggested in RFC is to help end users to selectively use cores under >> same Last Level Cache Hierarchy as advertised by OS (irrespective of the BIOS >> settings used). This is useful in both bare-metal and container environment. >>> >> >> I'm pretty familiar with AMD CPUs and the use of tiles (including the >> challenges these kinds of non-uniformities pose for work scheduling). >> >> To maximize performance, caring about core<->LLC relationship may well not >> be enough, and more HT/core/cache/memory topology information is >> required. That's what I meant by special case. A proper API should allow >> access to information about which lcores are SMT siblings, cores on the same >> L2, and cores on the same L3, to name a few things. Probably you want to fit >> NUMA into the same API as well, although that is available already in >> <rte_lcore.h>. > > Thank you Mattias for the information, as shared by in the reply with Anatoly we want expose a new API `rte_get_next_lcore_ex` which intakes a extra argument `u32 flags`. > The flags can be RTE_GET_LCORE_L1 (SMT), RTE_GET_LCORE_L2, RTE_GET_LCORE_L3, RTE_GET_LCORE_BOOST_ENABLED, RTE_GET_LCORE_BOOST_DISABLED. > Wouldn't using that API be pretty awkward to use? I mean, what you have is a topology, with nodes of different types and with different properties, and you want to present it to the user. In a sense, it's similar to XCM and DOM versus SAX. The above is SAX-style, and what I have in mind is something DOM-like. What use case do you have in mind? What's on top of my list is a scenario where a DPDK app gets a bunch of cores (e.g., -l <cores>) and tries to figure out how best make use of them. It's not going to "skip" (ignore, leave unused) SMT siblings, or skip non-boosted cores, it would just try to be clever in regards to which cores to use for what purpose. > This is AMD EPYC SoC agnostic and trying to address for all generic cases. > > Please do let us know if we (Ferruh & myself) can sync up via call? > Sure, I can do that. >> >> One can have a look at how scheduling domains work in the Linux kernel. >> They model this kind of thing. >> >>> As shared in response for cover letter +1 to expand it to more than >>> just LLC cores. We have also confirmed the same to >>> https://patchwork.dpdk.org/project/dpdk/cover/20240827151014.201- >> 1-vip >>> in.varghese@amd.com/ >>> >>>> >>>> A DPDK CPU/memory hierarchy topology API very much makes sense, but >>>> it should be reasonably generic and complete from the start. >>>> >>>>>> >>>>>> Could potentially be built on the 'hwloc' library. >>>>> >>>>> There are 3 reason on AMD SoC we did not explore this path, reasons >>>>> are >>>>> >>>>> 1. depending n hwloc version and kernel version certain SoC >>>>> hierarchies are not available >>>>> >>>>> 2. CPU NUMA and IO (memory & PCIe) NUMA are independent on AMD >>>> Epyc Soc. >>>>> >>>>> 3. adds the extra dependency layer of library layer to be made >>>>> available to work. >>>>> >>>>> >>>>> hence we have tried to use Linux Documented generic layer of `sysfs >>>>> CPU cache`. >>>>> >>>>> I will try to explore more on hwloc and check if other libraries >>>>> within DPDK leverages the same. >>>>> >>>>>> >>>>>> I much agree cache/core topology may be of interest of the >>>>>> application (or a work scheduler, like a DPDK event device), but >>>>>> it's not limited to LLC. It may well be worthwhile to care about >>>>>> which cores shares L2 cache, for example. Not sure the >>>>>> RTE_LCORE_FOREACH_* >>>> approach scales. >>>>> >>>>> yes, totally understand as some SoC, multiple lcores shares same L2 cache. >>>>> >>>>> >>>>> Can we rework the API to be rte_get_cache_<function> where user >>>>> argument is desired lcore index. >>>>> >>>>> 1. index-1: SMT threads >>>>> >>>>> 2. index-2: threads sharing same L2 cache >>>>> >>>>> 3. index-3: threads sharing same L3 cache >>>>> >>>>> 4. index-MAX: identify the threads sharing last level cache. >>>>> >>>>>> >>>>>>> < Function: Purpose > >>>>>>> --------------------- >>>>>>> - rte_get_llc_first_lcores: Retrieves all the first lcores in >>>>>>> the shared LLC. >>>>>>> - rte_get_llc_lcore: Retrieves all lcores that share the LLC. >>>>>>> - rte_get_llc_n_lcore: Retrieves the first n or skips the first >>>>>>> n lcores in the shared LLC. >>>>>>> >>>>>>> < MACRO: Purpose > >>>>>>> ------------------ >>>>>>> RTE_LCORE_FOREACH_LLC_FIRST: iterates through all first lcore from >>>>>>> each LLC. >>>>>>> RTE_LCORE_FOREACH_LLC_FIRST_WORKER: iterates through all first >>>>>>> worker lcore from each LLC. >>>>>>> RTE_LCORE_FOREACH_LLC_WORKER: iterates lcores from LLC based on >>>> hint >>>>>>> (lcore id). >>>>>>> RTE_LCORE_FOREACH_LLC_SKIP_FIRST_WORKER: iterates lcores from >> LLC >>>>>>> while skipping first worker. >>>>>>> RTE_LCORE_FOREACH_LLC_FIRST_N_WORKER: iterates through `n` >> lcores >>>>>>> from each LLC. >>>>>>> RTE_LCORE_FOREACH_LLC_SKIP_N_WORKER: skip first `n` lcores, then >>>>>>> iterates through reaming lcores in each LLC. >>>>>>> >>>>> While the MACRO are simple wrapper invoking appropriate API. can >>>>> this be worked out in this fashion? >>>>> >>>>> <snipped>
On Wed, Sep 11, 2024 at 03:26:20AM +0000, Varghese, Vipin wrote: > [AMD Official Use Only - AMD Internal Distribution Only] > > <snipped> > > > > > On 2024-09-09 16:22, Varghese, Vipin wrote: > > > [AMD Official Use Only - AMD Internal Distribution Only] > > > > > > <snipped> > > > > > >>> <snipped> > > >>> > > >>> Thank you Mattias for the comments and question, please let me try > > >>> to explain the same below > > >>> > > >>>> We shouldn't have a separate CPU/cache hierarchy API instead? > > >>> > > >>> Based on the intention to bring in CPU lcores which share same L3 > > >>> (for better cache hits and less noisy neighbor) current API focuses > > >>> on using > > >>> > > >>> Last Level Cache. But if the suggestion is `there are SoC where L2 > > >>> cache are also shared, and the new API should be provisioned`, I am > > >>> also > > >>> > > >>> comfortable with the thought. > > >>> > > >> > > >> Rather than some AMD special case API hacked into <rte_lcore.h>, I > > >> think we are better off with no DPDK API at all for this kind of functionality. > > > > > > Hi Mattias, as shared in the earlier email thread, this is not a AMD special > > case at all. Let me try to explain this one more time. One of techniques used to > > increase cores cost effective way to go for tiles of compute complexes. > > > This introduces a bunch of cores in sharing same Last Level Cache (namely > > L2, L3 or even L4) depending upon cache topology architecture. > > > > > > The API suggested in RFC is to help end users to selectively use cores under > > same Last Level Cache Hierarchy as advertised by OS (irrespective of the BIOS > > settings used). This is useful in both bare-metal and container environment. > > > > > > > I'm pretty familiar with AMD CPUs and the use of tiles (including the > > challenges these kinds of non-uniformities pose for work scheduling). > > > > To maximize performance, caring about core<->LLC relationship may well not > > be enough, and more HT/core/cache/memory topology information is > > required. That's what I meant by special case. A proper API should allow > > access to information about which lcores are SMT siblings, cores on the same > > L2, and cores on the same L3, to name a few things. Probably you want to fit > > NUMA into the same API as well, although that is available already in > > <rte_lcore.h>. > > Thank you Mattias for the information, as shared by in the reply with Anatoly we want expose a new API `rte_get_next_lcore_ex` which intakes a extra argument `u32 flags`. > The flags can be RTE_GET_LCORE_L1 (SMT), RTE_GET_LCORE_L2, RTE_GET_LCORE_L3, RTE_GET_LCORE_BOOST_ENABLED, RTE_GET_LCORE_BOOST_DISABLED. > For the naming, would "rte_get_next_sibling_core" (or lcore if you prefer) be a clearer name than just adding "ex" on to the end of the existing function? Looking logically, I'm not sure about the BOOST_ENABLED and BOOST_DISABLED flags you propose - in a system with multiple possible standard and boost frequencies what would those correspond to? What's also missing is a define for getting actual NUMA siblings i.e. those sharing common memory but not an L3 or anything else. My suggestion would be to have the function take just an integer-type e.g. uint16_t parameter which defines the memory/cache hierarchy level to use, 0 being lowest, 1 next, and so on. Different systems may have different numbers of cache levels so lets just make it a zero-based index of levels, rather than giving explicit defines (except for memory which should probably always be last). The zero-level will be for "closest neighbour" whatever that happens to be, with as many levels as is necessary to express the topology, e.g. without SMT, but with 3 cache levels, level 0 would be an L2 neighbour, level 1 an L3 neighbour. If the L3 was split within a memory NUMA node, then level 2 would give the NUMA siblings. We'd just need an API to return the max number of levels along with the iterator. Regards, /Bruce
> On Sep 11, 2024, at 10:55 AM, Mattias Rönnblom <hofors@lysator.liu.se> wrote: > > On 2024-09-11 05:26, Varghese, Vipin wrote: >> [AMD Official Use Only - AMD Internal Distribution Only] >> <snipped> >>> >>> On 2024-09-09 16:22, Varghese, Vipin wrote: >>>> [AMD Official Use Only - AMD Internal Distribution Only] >>>> >>>> <snipped> >>>> >>>>>> <snipped> >>>>>> >>>>>> Thank you Mattias for the comments and question, please let me try >>>>>> to explain the same below >>>>>> >>>>>>> We shouldn't have a separate CPU/cache hierarchy API instead? >>>>>> >>>>>> Based on the intention to bring in CPU lcores which share same L3 >>>>>> (for better cache hits and less noisy neighbor) current API focuses >>>>>> on using >>>>>> >>>>>> Last Level Cache. But if the suggestion is `there are SoC where L2 >>>>>> cache are also shared, and the new API should be provisioned`, I am >>>>>> also >>>>>> >>>>>> comfortable with the thought. >>>>>> >>>>> >>>>> Rather than some AMD special case API hacked into <rte_lcore.h>, I >>>>> think we are better off with no DPDK API at all for this kind of functionality. >>>> >>>> Hi Mattias, as shared in the earlier email thread, this is not a AMD special >>> case at all. Let me try to explain this one more time. One of techniques used to >>> increase cores cost effective way to go for tiles of compute complexes. >>>> This introduces a bunch of cores in sharing same Last Level Cache (namely >>> L2, L3 or even L4) depending upon cache topology architecture. >>>> >>>> The API suggested in RFC is to help end users to selectively use cores under >>> same Last Level Cache Hierarchy as advertised by OS (irrespective of the BIOS >>> settings used). This is useful in both bare-metal and container environment. >>>> >>> >>> I'm pretty familiar with AMD CPUs and the use of tiles (including the >>> challenges these kinds of non-uniformities pose for work scheduling). >>> >>> To maximize performance, caring about core<->LLC relationship may well not >>> be enough, and more HT/core/cache/memory topology information is >>> required. That's what I meant by special case. A proper API should allow >>> access to information about which lcores are SMT siblings, cores on the same >>> L2, and cores on the same L3, to name a few things. Probably you want to fit >>> NUMA into the same API as well, although that is available already in >>> <rte_lcore.h>. >> Thank you Mattias for the information, as shared by in the reply with Anatoly we want expose a new API `rte_get_next_lcore_ex` which intakes a extra argument `u32 flags`. >> The flags can be RTE_GET_LCORE_L1 (SMT), RTE_GET_LCORE_L2, RTE_GET_LCORE_L3, RTE_GET_LCORE_BOOST_ENABLED, RTE_GET_LCORE_BOOST_DISABLED. > > Wouldn't using that API be pretty awkward to use? > > I mean, what you have is a topology, with nodes of different types and with different properties, and you want to present it to the user. > > In a sense, it's similar to XCM and DOM versus SAX. The above is SAX-style, and what I have in mind is something DOM-like. > > What use case do you have in mind? What's on top of my list is a scenario where a DPDK app gets a bunch of cores (e.g., -l <cores>) and tries to figure out how best make use of them. It's not going to "skip" (ignore, leave unused) SMT siblings, or skip non-boosted cores, it would just try to be clever in regards to which cores to use for what purpose. > >> This is AMD EPYC SoC agnostic and trying to address for all generic cases. >> Please do let us know if we (Ferruh & myself) can sync up via call? > > Sure, I can do that. > Can this be opened to the rest of the community? This is a common problem that needs to be solved for multiple architectures. I would be interested in attending. >>> >>> One can have a look at how scheduling domains work in the Linux kernel. >>> They model this kind of thing. >>> >>>> As shared in response for cover letter +1 to expand it to more than >>>> just LLC cores. We have also confirmed the same to >>>> https://patchwork.dpdk.org/project/dpdk/cover/20240827151014.201- >>> 1-vip >>>> in.varghese@amd.com/ >>>> >>>>> >>>>> A DPDK CPU/memory hierarchy topology API very much makes sense, but >>>>> it should be reasonably generic and complete from the start. >>>>> >>>>>>> >>>>>>> Could potentially be built on the 'hwloc' library. >>>>>> >>>>>> There are 3 reason on AMD SoC we did not explore this path, reasons >>>>>> are >>>>>> >>>>>> 1. depending n hwloc version and kernel version certain SoC >>>>>> hierarchies are not available >>>>>> >>>>>> 2. CPU NUMA and IO (memory & PCIe) NUMA are independent on AMD >>>>> Epyc Soc. >>>>>> >>>>>> 3. adds the extra dependency layer of library layer to be made >>>>>> available to work. >>>>>> >>>>>> >>>>>> hence we have tried to use Linux Documented generic layer of `sysfs >>>>>> CPU cache`. >>>>>> >>>>>> I will try to explore more on hwloc and check if other libraries >>>>>> within DPDK leverages the same. >>>>>> >>>>>>> >>>>>>> I much agree cache/core topology may be of interest of the >>>>>>> application (or a work scheduler, like a DPDK event device), but >>>>>>> it's not limited to LLC. It may well be worthwhile to care about >>>>>>> which cores shares L2 cache, for example. Not sure the >>>>>>> RTE_LCORE_FOREACH_* >>>>> approach scales. >>>>>> >>>>>> yes, totally understand as some SoC, multiple lcores shares same L2 cache. >>>>>> >>>>>> >>>>>> Can we rework the API to be rte_get_cache_<function> where user >>>>>> argument is desired lcore index. >>>>>> >>>>>> 1. index-1: SMT threads >>>>>> >>>>>> 2. index-2: threads sharing same L2 cache >>>>>> >>>>>> 3. index-3: threads sharing same L3 cache >>>>>> >>>>>> 4. index-MAX: identify the threads sharing last level cache. >>>>>> >>>>>>> >>>>>>>> < Function: Purpose > >>>>>>>> --------------------- >>>>>>>> - rte_get_llc_first_lcores: Retrieves all the first lcores in >>>>>>>> the shared LLC. >>>>>>>> - rte_get_llc_lcore: Retrieves all lcores that share the LLC. >>>>>>>> - rte_get_llc_n_lcore: Retrieves the first n or skips the first >>>>>>>> n lcores in the shared LLC. >>>>>>>> >>>>>>>> < MACRO: Purpose > >>>>>>>> ------------------ >>>>>>>> RTE_LCORE_FOREACH_LLC_FIRST: iterates through all first lcore from >>>>>>>> each LLC. >>>>>>>> RTE_LCORE_FOREACH_LLC_FIRST_WORKER: iterates through all first >>>>>>>> worker lcore from each LLC. >>>>>>>> RTE_LCORE_FOREACH_LLC_WORKER: iterates lcores from LLC based on >>>>> hint >>>>>>>> (lcore id). >>>>>>>> RTE_LCORE_FOREACH_LLC_SKIP_FIRST_WORKER: iterates lcores from >>> LLC >>>>>>>> while skipping first worker. >>>>>>>> RTE_LCORE_FOREACH_LLC_FIRST_N_WORKER: iterates through `n` >>> lcores >>>>>>>> from each LLC. >>>>>>>> RTE_LCORE_FOREACH_LLC_SKIP_N_WORKER: skip first `n` lcores, then >>>>>>>> iterates through reaming lcores in each LLC. >>>>>>>> >>>>>> While the MACRO are simple wrapper invoking appropriate API. can >>>>>> this be worked out in this fashion? >>>>>> >>>>>> <snipped>
> > > >>> Thank you Mattias for the comments and question, please let me try > > > >>> to explain the same below > > > >>> > > > >>>> We shouldn't have a separate CPU/cache hierarchy API instead? > > > >>> > > > >>> Based on the intention to bring in CPU lcores which share same L3 > > > >>> (for better cache hits and less noisy neighbor) current API focuses > > > >>> on using > > > >>> > > > >>> Last Level Cache. But if the suggestion is `there are SoC where L2 > > > >>> cache are also shared, and the new API should be provisioned`, I am > > > >>> also > > > >>> > > > >>> comfortable with the thought. > > > >>> > > > >> > > > >> Rather than some AMD special case API hacked into <rte_lcore.h>, I > > > >> think we are better off with no DPDK API at all for this kind of functionality. > > > > > > > > Hi Mattias, as shared in the earlier email thread, this is not a AMD special > > > case at all. Let me try to explain this one more time. One of techniques used to > > > increase cores cost effective way to go for tiles of compute complexes. > > > > This introduces a bunch of cores in sharing same Last Level Cache (namely > > > L2, L3 or even L4) depending upon cache topology architecture. > > > > > > > > The API suggested in RFC is to help end users to selectively use cores under > > > same Last Level Cache Hierarchy as advertised by OS (irrespective of the BIOS > > > settings used). This is useful in both bare-metal and container environment. > > > > > > > > > > I'm pretty familiar with AMD CPUs and the use of tiles (including the > > > challenges these kinds of non-uniformities pose for work scheduling). > > > > > > To maximize performance, caring about core<->LLC relationship may well not > > > be enough, and more HT/core/cache/memory topology information is > > > required. That's what I meant by special case. A proper API should allow > > > access to information about which lcores are SMT siblings, cores on the same > > > L2, and cores on the same L3, to name a few things. Probably you want to fit > > > NUMA into the same API as well, although that is available already in > > > <rte_lcore.h>. > > > > Thank you Mattias for the information, as shared by in the reply with Anatoly we want expose a new API `rte_get_next_lcore_ex` > which intakes a extra argument `u32 flags`. > > The flags can be RTE_GET_LCORE_L1 (SMT), RTE_GET_LCORE_L2, RTE_GET_LCORE_L3, RTE_GET_LCORE_BOOST_ENABLED, > RTE_GET_LCORE_BOOST_DISABLED. > > > > For the naming, would "rte_get_next_sibling_core" (or lcore if you prefer) > be a clearer name than just adding "ex" on to the end of the existing > function? > > Looking logically, I'm not sure about the BOOST_ENABLED and BOOST_DISABLED > flags you propose - in a system with multiple possible standard and boost > frequencies what would those correspond to? What's also missing is a define > for getting actual NUMA siblings i.e. those sharing common memory but not > an L3 or anything else. > > My suggestion would be to have the function take just an integer-type e.g. > uint16_t parameter which defines the memory/cache hierarchy level to use, 0 > being lowest, 1 next, and so on. Different systems may have different > numbers of cache levels so lets just make it a zero-based index of levels, > rather than giving explicit defines (except for memory which should > probably always be last). The zero-level will be for "closest neighbour" > whatever that happens to be, with as many levels as is necessary to express > the topology, e.g. without SMT, but with 3 cache levels, level 0 would be > an L2 neighbour, level 1 an L3 neighbour. If the L3 was split within a > memory NUMA node, then level 2 would give the NUMA siblings. We'd just need > an API to return the max number of levels along with the iterator. Sounds like a neat idea to me.
[AMD Official Use Only - AMD Internal Distribution Only] <snipped> > > On Wed, 11 Sep 2024 03:13:14 +0000 > "Varghese, Vipin" <Vipin.Varghese@amd.com> wrote: > > > > Agreed. This one of those cases where the existing project hwloc > > > which is part of open-mpi is more complete and well supported. It > > > supports multiple OS's and can deal with more quirks. > > > > Thank you Stephen for the inputs, last year when checked hwloc for distros > there were anomalies for NUMA and Physical socket Identification on AMD > EPYC Soc. > > I will recheck the distros version of hwloc, if these work out fine I will re- > work with hwloc libraries making it OS independent too. > > DPDK doesn't exist to resolve problems with upstreaming hardware support in > other packages. Stephen, thank you for your comment. You have asked in earlier request to try using hwloc library. I have mentioned at least till last year popular distros were still using older version for hwloc and I will recheck if it has changed. This I assumed for mitigating the `sysfs` and bringing in `OS` portability. If DPDK does supports something only because it is harder, > slower, more painful to deal with another project; then you create long term > technical debt. Now I really confused, for any application or end user which want to use the right set of lcores for performance DPDK provides abstraction and ease of use. Maybe I am misreading the above idea, so let me back off a bit and share version-2 of the RFC.
[Public] Snipped > >>>> > >>>> <snipped> > >>>> > >>>>>> <snipped> > >>>>>> > >>>>>> Thank you Mattias for the comments and question, please let me > >>>>>> try to explain the same below > >>>>>> > >>>>>>> We shouldn't have a separate CPU/cache hierarchy API instead? > >>>>>> > >>>>>> Based on the intention to bring in CPU lcores which share same L3 > >>>>>> (for better cache hits and less noisy neighbor) current API > >>>>>> focuses on using > >>>>>> > >>>>>> Last Level Cache. But if the suggestion is `there are SoC where > >>>>>> L2 cache are also shared, and the new API should be provisioned`, > >>>>>> I am also > >>>>>> > >>>>>> comfortable with the thought. > >>>>>> > >>>>> > >>>>> Rather than some AMD special case API hacked into <rte_lcore.h>, I > >>>>> think we are better off with no DPDK API at all for this kind of > functionality. > >>>> > >>>> Hi Mattias, as shared in the earlier email thread, this is not a > >>>> AMD special > >>> case at all. Let me try to explain this one more time. One of > >>> techniques used to increase cores cost effective way to go for tiles of > compute complexes. > >>>> This introduces a bunch of cores in sharing same Last Level Cache > >>>> (namely > >>> L2, L3 or even L4) depending upon cache topology architecture. > >>>> > >>>> The API suggested in RFC is to help end users to selectively use > >>>> cores under > >>> same Last Level Cache Hierarchy as advertised by OS (irrespective of > >>> the BIOS settings used). This is useful in both bare-metal and container > environment. > >>>> > >>> > >>> I'm pretty familiar with AMD CPUs and the use of tiles (including > >>> the challenges these kinds of non-uniformities pose for work scheduling). > >>> > >>> To maximize performance, caring about core<->LLC relationship may > >>> well not be enough, and more HT/core/cache/memory topology > >>> information is required. That's what I meant by special case. A > >>> proper API should allow access to information about which lcores are > >>> SMT siblings, cores on the same L2, and cores on the same L3, to > >>> name a few things. Probably you want to fit NUMA into the same API > >>> as well, although that is available already in <rte_lcore.h>. > >> Thank you Mattias for the information, as shared by in the reply with > Anatoly we want expose a new API `rte_get_next_lcore_ex` which intakes a > extra argument `u32 flags`. > >> The flags can be RTE_GET_LCORE_L1 (SMT), RTE_GET_LCORE_L2, > RTE_GET_LCORE_L3, RTE_GET_LCORE_BOOST_ENABLED, > RTE_GET_LCORE_BOOST_DISABLED. > > > > Wouldn't using that API be pretty awkward to use? Current API available under DPDK is ` rte_get_next_lcore`, which is used within DPDK example and in customer solution. Based on the comments from others we responded to the idea of changing the new Api from ` rte_get_next_lcore_llc` to ` rte_get_next_lcore_exntd`. Can you please help us understand what is `awkward`. > > > > I mean, what you have is a topology, with nodes of different types and with > different properties, and you want to present it to the user. Let me be clear, what we want via DPDK to help customer to use an Unified API which works across multiple platforms. Example - let a vendor have 2 products namely A and B. CPU-A has all cores within same SUB-NUMA domain and CPU-B has cores split to 2 sub-NUMA domain based on split LLC. When `rte_get_next_lcore_extnd` is invoked for `LLC` on 1. CPU-A: it returns all cores as there is no split 2. CPU-B: it returns cores from specific sub-NUMA which is partitioned by L3 > > > > In a sense, it's similar to XCM and DOM versus SAX. The above is SAX-style, > and what I have in mind is something DOM-like. > > > > What use case do you have in mind? What's on top of my list is a scenario > where a DPDK app gets a bunch of cores (e.g., -l <cores>) and tries to figure > out how best make use of them. Exactly. It's not going to "skip" (ignore, leave unused) > SMT siblings, or skip non-boosted cores, it would just try to be clever in > regards to which cores to use for what purpose. Let me try to share my idea on SMT sibling. When user invoked for rte_get_next_lcore_extnd` is invoked for `L1 | SMT` flag with `lcore`; the API identifies first whether given lcore is part of enabled core list. If yes, it programmatically either using `sysfs` or `hwloc library (shared the version concern on distros. Will recheck again)` identify the sibling thread and return. If there is no sibling thread available under DPDK it will fetch next lcore (probably lcore +1 ). > > > >> This is AMD EPYC SoC agnostic and trying to address for all generic cases. > >> Please do let us know if we (Ferruh & myself) can sync up via call? > > > > Sure, I can do that. Let me sync with Ferruh and get a time slot for internal sync. > > > Can this be opened to the rest of the community? This is a common problem > that needs to be solved for multiple architectures. I would be interested in > attending. Thank you Mattias, in DPDK Bangkok summit 2024 we did bring this up. As per the suggestion from Thomas and Jerrin we tried to bring the RFC for discussion. For DPDK Montreal 2024, Keesang and Ferruh (most likely) is travelling for the summit and presenting this as the talk to get things moving. > > >>> <snipped>
[Public] <snipped> > > > > <snipped> > > > > > > > >>> <snipped> > > > >>> > > > >>> Thank you Mattias for the comments and question, please let me > > > >>> try to explain the same below > > > >>> > > > >>>> We shouldn't have a separate CPU/cache hierarchy API instead? > > > >>> > > > >>> Based on the intention to bring in CPU lcores which share same > > > >>> L3 (for better cache hits and less noisy neighbor) current API > > > >>> focuses on using > > > >>> > > > >>> Last Level Cache. But if the suggestion is `there are SoC where > > > >>> L2 cache are also shared, and the new API should be > > > >>> provisioned`, I am also > > > >>> > > > >>> comfortable with the thought. > > > >>> > > > >> > > > >> Rather than some AMD special case API hacked into <rte_lcore.h>, > > > >> I think we are better off with no DPDK API at all for this kind of > functionality. > > > > > > > > Hi Mattias, as shared in the earlier email thread, this is not a > > > > AMD special > > > case at all. Let me try to explain this one more time. One of > > > techniques used to increase cores cost effective way to go for tiles of > compute complexes. > > > > This introduces a bunch of cores in sharing same Last Level Cache > > > > (namely > > > L2, L3 or even L4) depending upon cache topology architecture. > > > > > > > > The API suggested in RFC is to help end users to selectively use > > > > cores under > > > same Last Level Cache Hierarchy as advertised by OS (irrespective of > > > the BIOS settings used). This is useful in both bare-metal and container > environment. > > > > > > > > > > I'm pretty familiar with AMD CPUs and the use of tiles (including > > > the challenges these kinds of non-uniformities pose for work scheduling). > > > > > > To maximize performance, caring about core<->LLC relationship may > > > well not be enough, and more HT/core/cache/memory topology > > > information is required. That's what I meant by special case. A > > > proper API should allow access to information about which lcores are > > > SMT siblings, cores on the same L2, and cores on the same L3, to > > > name a few things. Probably you want to fit NUMA into the same API > > > as well, although that is available already in <rte_lcore.h>. > > > > Thank you Mattias for the information, as shared by in the reply with > Anatoly we want expose a new API `rte_get_next_lcore_ex` which intakes a > extra argument `u32 flags`. > > The flags can be RTE_GET_LCORE_L1 (SMT), RTE_GET_LCORE_L2, > RTE_GET_LCORE_L3, RTE_GET_LCORE_BOOST_ENABLED, > RTE_GET_LCORE_BOOST_DISABLED. > > > > For the naming, would "rte_get_next_sibling_core" (or lcore if you prefer) be a > clearer name than just adding "ex" on to the end of the existing function? Thank you Bruce, Please find my answer below Functions shared as per the RFC were ``` - rte_get_llc_first_lcores: Retrieves all the first lcores in the shared LLC. - rte_get_llc_lcore: Retrieves all lcores that share the LLC. - rte_get_llc_n_lcore: Retrieves the first n or skips the first n lcores in the shared LLC. ``` MACRO's extending the usability were ``` RTE_LCORE_FOREACH_LLC_FIRST: iterates through all first lcore from each LLC. RTE_LCORE_FOREACH_LLC_FIRST_WORKER: iterates through all first worker lcore from each LLC. RTE_LCORE_FOREACH_LLC_WORKER: iterates lcores from LLC based on hint (lcore id). RTE_LCORE_FOREACH_LLC_SKIP_FIRST_WORKER: iterates lcores from LLC while skipping first worker. RTE_LCORE_FOREACH_LLC_FIRST_N_WORKER: iterates through `n` lcores from each LLC. RTE_LCORE_FOREACH_LLC_SKIP_N_WORKER: skip first `n` lcores, then iterates through reaming lcores in each LLC. ``` Based on the discussions we agreed on sharing version-2 FRC for extending API as `rte_get_next_lcore_extnd` with extra argument as `flags`. As per my ideation, for the API ` rte_get_next_sibling_core`, the above API can easily with flag ` RTE_GET_LCORE_L1 (SMT)`. Is this right understanding? We can easily have simple MACROs like `RTE_LCORE_FOREACH_L1` which allows to iterate SMT sibling threads. > > Looking logically, I'm not sure about the BOOST_ENABLED and > BOOST_DISABLED flags you propose The idea for the BOOST_ENABLED & BOOST_DISABLED is based on DPDK power library which allows to enable boost. Allow user to select lcores where BOOST is enabled|disabled using MACRO or API. - in a system with multiple possible > standard and boost frequencies what would those correspond to? I now understand the confusion, apologies for mixing the AMD EPYC SoC boost with Intel Turbo. Thank you for pointing out, we will use the terminology ` RTE_GET_LCORE_TURBO`. What's also > missing is a define for getting actual NUMA siblings i.e. those sharing common > memory but not an L3 or anything else. This can be extended into `rte_get_next_lcore_extnd` with flag ` RTE_GET_LCORE_NUMA`. This will allow to grab all lcores under the same sub-memory NUMA as shared by LCORE. If SMT sibling is enabled and DPDK Lcore mask covers the sibling threads, then ` RTE_GET_LCORE_NUMA` get all lcore and sibling threads under same memory NUMA of lcore shared. > > My suggestion would be to have the function take just an integer-type e.g. > uint16_t parameter which defines the memory/cache hierarchy level to use, 0 > being lowest, 1 next, and so on. Different systems may have different numbers > of cache levels so lets just make it a zero-based index of levels, rather than > giving explicit defines (except for memory which should probably always be > last). The zero-level will be for "closest neighbour" Good idea, we did prototype this internally. But issue it will keep on adding the number of API into lcore library. To keep the API count less, we are using lcore id as hint to sub-NUMA. > whatever that happens to be, with as many levels as is necessary to express > the topology, e.g. without SMT, but with 3 cache levels, level 0 would be an L2 > neighbour, level 1 an L3 neighbour. If the L3 was split within a memory NUMA > node, then level 2 would give the NUMA siblings. We'd just need an API to > return the max number of levels along with the iterator. We are using lcore numa as the hint. > > Regards, > /Bruce
[Public] <snipped> > > What use case do you have in mind? What's on top of my list is a scenario > where a DPDK app gets a bunch of cores (e.g., -l <cores>) and tries to figure > out how best make use of them. It's not going to "skip" (ignore, leave unused) > SMT siblings, or skip non-boosted cores, it would just try to be clever in > regards to which cores to use for what purpose. > > > >> This is AMD EPYC SoC agnostic and trying to address for all generic cases. > >> Please do let us know if we (Ferruh & myself) can sync up via call? > > > > Sure, I can do that. > > > Can this be opened to the rest of the community? This is a common problem > that needs to be solved for multiple architectures. I would be interested in > attending. Hi Honappa, we can accommodate the same. Let me work with Ferruh how to get it as tech discussion. <snipped>
[AMD Official Use Only - AMD Internal Distribution Only] <snipped> > > > > For the naming, would "rte_get_next_sibling_core" (or lcore if you > > prefer) be a clearer name than just adding "ex" on to the end of the > > existing function? > > > > Looking logically, I'm not sure about the BOOST_ENABLED and > > BOOST_DISABLED flags you propose - in a system with multiple possible > > standard and boost frequencies what would those correspond to? What's > > also missing is a define for getting actual NUMA siblings i.e. those > > sharing common memory but not an L3 or anything else. > > > > My suggestion would be to have the function take just an integer-type e.g. > > uint16_t parameter which defines the memory/cache hierarchy level to > > use, 0 being lowest, 1 next, and so on. Different systems may have > > different numbers of cache levels so lets just make it a zero-based > > index of levels, rather than giving explicit defines (except for > > memory which should probably always be last). The zero-level will be for > "closest neighbour" > > whatever that happens to be, with as many levels as is necessary to > > express the topology, e.g. without SMT, but with 3 cache levels, level > > 0 would be an L2 neighbour, level 1 an L3 neighbour. If the L3 was > > split within a memory NUMA node, then level 2 would give the NUMA > > siblings. We'd just need an API to return the max number of levels along with > the iterator. > > Sounds like a neat idea to me. Hi Konstantin, I have tried my best to address to Bruce comment. Let me try to recap 1. we want vendor agnostic API which allows end users to get list of lcores 2. this can be based on L1(SMT), L2, L3, NUMA, TURBO (as of now) 3. instead of creating multiple different API, we would like to add 1 API `rte_get_next_lcore_extnd` which can be controlled with `flags` 4. flag can be single or combination (like L3|TURBO_ENABLED or NUMA|TURBO_ENABLED). 5. As per my current idea, we can expand ease of use via MACRO and not API. I hope this justifies why we should have 1 exntended API and wrap things with Macro? May I setup or add to tech discussion call with Mattias and Honappa too?
On 2024-09-12 03:33, Varghese, Vipin wrote: > [Public] > > Snipped > >>>>>> >>>>>> <snipped> >>>>>> >>>>>>>> <snipped> >>>>>>>> >>>>>>>> Thank you Mattias for the comments and question, please let me >>>>>>>> try to explain the same below >>>>>>>> >>>>>>>>> We shouldn't have a separate CPU/cache hierarchy API instead? >>>>>>>> >>>>>>>> Based on the intention to bring in CPU lcores which share same L3 >>>>>>>> (for better cache hits and less noisy neighbor) current API >>>>>>>> focuses on using >>>>>>>> >>>>>>>> Last Level Cache. But if the suggestion is `there are SoC where >>>>>>>> L2 cache are also shared, and the new API should be provisioned`, >>>>>>>> I am also >>>>>>>> >>>>>>>> comfortable with the thought. >>>>>>>> >>>>>>> >>>>>>> Rather than some AMD special case API hacked into <rte_lcore.h>, I >>>>>>> think we are better off with no DPDK API at all for this kind of >> functionality. >>>>>> >>>>>> Hi Mattias, as shared in the earlier email thread, this is not a >>>>>> AMD special >>>>> case at all. Let me try to explain this one more time. One of >>>>> techniques used to increase cores cost effective way to go for tiles of >> compute complexes. >>>>>> This introduces a bunch of cores in sharing same Last Level Cache >>>>>> (namely >>>>> L2, L3 or even L4) depending upon cache topology architecture. >>>>>> >>>>>> The API suggested in RFC is to help end users to selectively use >>>>>> cores under >>>>> same Last Level Cache Hierarchy as advertised by OS (irrespective of >>>>> the BIOS settings used). This is useful in both bare-metal and container >> environment. >>>>>> >>>>> >>>>> I'm pretty familiar with AMD CPUs and the use of tiles (including >>>>> the challenges these kinds of non-uniformities pose for work scheduling). >>>>> >>>>> To maximize performance, caring about core<->LLC relationship may >>>>> well not be enough, and more HT/core/cache/memory topology >>>>> information is required. That's what I meant by special case. A >>>>> proper API should allow access to information about which lcores are >>>>> SMT siblings, cores on the same L2, and cores on the same L3, to >>>>> name a few things. Probably you want to fit NUMA into the same API >>>>> as well, although that is available already in <rte_lcore.h>. >>>> Thank you Mattias for the information, as shared by in the reply with >> Anatoly we want expose a new API `rte_get_next_lcore_ex` which intakes a >> extra argument `u32 flags`. >>>> The flags can be RTE_GET_LCORE_L1 (SMT), RTE_GET_LCORE_L2, >> RTE_GET_LCORE_L3, RTE_GET_LCORE_BOOST_ENABLED, >> RTE_GET_LCORE_BOOST_DISABLED. >>> >>> Wouldn't using that API be pretty awkward to use? > Current API available under DPDK is ` rte_get_next_lcore`, which is used within DPDK example and in customer solution. > Based on the comments from others we responded to the idea of changing the new Api from ` rte_get_next_lcore_llc` to ` rte_get_next_lcore_exntd`. > > Can you please help us understand what is `awkward`. > The awkwardness starts when you are trying to fit provide hwloc type information over an API that was designed for iterating over lcores. It seems to me that you should either have: A) An API in similar to that of hwloc (or any DOM-like API), which would give a low-level description of the hardware in implementation terms. The topology would consist of nodes, with attributes, etc, where nodes are things like cores or instances of caches of some level and attributes are things like CPU actual and nominal, and maybe max frequency, cache size, or memory size. or B) An API to be directly useful for a work scheduler, in which case you should abstract away things like "boost" (and fold them into some abstract capacity notion, together with core "size" [in big-little/heterogeneous systems]), and have an abstract notion of what core is "close" to some other core. This would something like Linux' scheduling domains. If you want B you probably need A as a part of its implementation, so you may just as well start with A, I suppose. What you could do to explore the API design is to add support for, for example, boost core awareness or SMT affinity in the SW scheduler. You could also do an "lstopo" equivalent, since that's needed for debugging and exploration, if nothing else. One question that will have to be answered in a work scheduling scenario is "are these two lcores SMT siblings," or "are these two cores on the same LLC", or "give me all lcores on a particular L2 cache". >>> >>> I mean, what you have is a topology, with nodes of different types and with >> different properties, and you want to present it to the user. > Let me be clear, what we want via DPDK to help customer to use an Unified API which works across multiple platforms. > Example - let a vendor have 2 products namely A and B. CPU-A has all cores within same SUB-NUMA domain and CPU-B has cores split to 2 sub-NUMA domain based on split LLC. > When `rte_get_next_lcore_extnd` is invoked for `LLC` on > 1. CPU-A: it returns all cores as there is no split > 2. CPU-B: it returns cores from specific sub-NUMA which is partitioned by L3 > I think the function name rte_get_next_lcore_extnd() alone makes clear this is an awkward API. :) My gut feeling is to make it more explicit and forget about <rte_lcore.h>. <rte_hwtopo.h>? Could and should still be EAL. >>> >>> In a sense, it's similar to XCM and DOM versus SAX. The above is SAX-style, >> and what I have in mind is something DOM-like. >>> >>> What use case do you have in mind? What's on top of my list is a scenario >> where a DPDK app gets a bunch of cores (e.g., -l <cores>) and tries to figure >> out how best make use of them. > Exactly. > > It's not going to "skip" (ignore, leave unused) >> SMT siblings, or skip non-boosted cores, it would just try to be clever in >> regards to which cores to use for what purpose. > Let me try to share my idea on SMT sibling. When user invoked for rte_get_next_lcore_extnd` is invoked for `L1 | SMT` flag with `lcore`; the API identifies first whether given lcore is part of enabled core list. > If yes, it programmatically either using `sysfs` or `hwloc library (shared the version concern on distros. Will recheck again)` identify the sibling thread and return. > If there is no sibling thread available under DPDK it will fetch next lcore (probably lcore +1 ). > Distributions having old hwloc versions isn't an argument for a new DPDK library or new API. If only that was the issue, then it would be better to help the hwloc and/or distributions, rather than the DPDK project. >>> >>>> This is AMD EPYC SoC agnostic and trying to address for all generic cases. >>>> Please do let us know if we (Ferruh & myself) can sync up via call? >>> >>> Sure, I can do that. > > Let me sync with Ferruh and get a time slot for internal sync. > >>> >> Can this be opened to the rest of the community? This is a common problem >> that needs to be solved for multiple architectures. I would be interested in >> attending. > Thank you Mattias, in DPDK Bangkok summit 2024 we did bring this up. As per the suggestion from Thomas and Jerrin we tried to bring the RFC for discussion. > For DPDK Montreal 2024, Keesang and Ferruh (most likely) is travelling for the summit and presenting this as the talk to get things moving. > >> >>>>> > <snipped>
On 2024-09-12 08:38, Mattias Rönnblom wrote: > On 2024-09-12 03:33, Varghese, Vipin wrote: >> [Public] >> >> Snipped >> >>>>>>> >>>>>>> <snipped> >>>>>>> >>>>>>>>> <snipped> >>>>>>>>> >>>>>>>>> Thank you Mattias for the comments and question, please let me >>>>>>>>> try to explain the same below >>>>>>>>> >>>>>>>>>> We shouldn't have a separate CPU/cache hierarchy API instead? >>>>>>>>> >>>>>>>>> Based on the intention to bring in CPU lcores which share same L3 >>>>>>>>> (for better cache hits and less noisy neighbor) current API >>>>>>>>> focuses on using >>>>>>>>> >>>>>>>>> Last Level Cache. But if the suggestion is `there are SoC where >>>>>>>>> L2 cache are also shared, and the new API should be provisioned`, >>>>>>>>> I am also >>>>>>>>> >>>>>>>>> comfortable with the thought. >>>>>>>>> >>>>>>>> >>>>>>>> Rather than some AMD special case API hacked into <rte_lcore.h>, I >>>>>>>> think we are better off with no DPDK API at all for this kind of >>> functionality. >>>>>>> >>>>>>> Hi Mattias, as shared in the earlier email thread, this is not a >>>>>>> AMD special >>>>>> case at all. Let me try to explain this one more time. One of >>>>>> techniques used to increase cores cost effective way to go for >>>>>> tiles of >>> compute complexes. >>>>>>> This introduces a bunch of cores in sharing same Last Level Cache >>>>>>> (namely >>>>>> L2, L3 or even L4) depending upon cache topology architecture. >>>>>>> >>>>>>> The API suggested in RFC is to help end users to selectively use >>>>>>> cores under >>>>>> same Last Level Cache Hierarchy as advertised by OS (irrespective of >>>>>> the BIOS settings used). This is useful in both bare-metal and >>>>>> container >>> environment. >>>>>>> >>>>>> >>>>>> I'm pretty familiar with AMD CPUs and the use of tiles (including >>>>>> the challenges these kinds of non-uniformities pose for work >>>>>> scheduling). >>>>>> >>>>>> To maximize performance, caring about core<->LLC relationship may >>>>>> well not be enough, and more HT/core/cache/memory topology >>>>>> information is required. That's what I meant by special case. A >>>>>> proper API should allow access to information about which lcores are >>>>>> SMT siblings, cores on the same L2, and cores on the same L3, to >>>>>> name a few things. Probably you want to fit NUMA into the same API >>>>>> as well, although that is available already in <rte_lcore.h>. >>>>> Thank you Mattias for the information, as shared by in the reply with >>> Anatoly we want expose a new API `rte_get_next_lcore_ex` which intakes a >>> extra argument `u32 flags`. >>>>> The flags can be RTE_GET_LCORE_L1 (SMT), RTE_GET_LCORE_L2, >>> RTE_GET_LCORE_L3, RTE_GET_LCORE_BOOST_ENABLED, >>> RTE_GET_LCORE_BOOST_DISABLED. >>>> >>>> Wouldn't using that API be pretty awkward to use? >> Current API available under DPDK is ` rte_get_next_lcore`, which is >> used within DPDK example and in customer solution. >> Based on the comments from others we responded to the idea of changing >> the new Api from ` rte_get_next_lcore_llc` to ` >> rte_get_next_lcore_exntd`. >> >> Can you please help us understand what is `awkward`. >> > > The awkwardness starts when you are trying to fit provide hwloc type > information over an API that was designed for iterating over lcores. > > It seems to me that you should either have: > A) An API in similar to that of hwloc (or any DOM-like API), which would > give a low-level description of the hardware in implementation terms. > The topology would consist of nodes, with attributes, etc, where nodes > are things like cores or instances of caches of some level and > attributes are things like CPU actual and nominal, and maybe max > frequency, cache size, or memory size. To to be clear; it's something like this I think of when I say "DOM-style" API. #ifndef RTE_HWTOPO_H #define RTE_HWTOPO_H struct rte_hwtopo_node; enum rte_hwtopo_node_type { RTE_HWTOPO_NODE_TYPE_CPU_CORE, RTE_HWTOPO_NODE_TYPE_CACHE, RTE_HWTOPO_NODE_TYPE_NUMA }; int rte_hwtopo_init(void); struct rte_hwtopo_node * rte_hwtopo_get_core_by_lcore(unsigned int lcore); struct rte_hwtopo_node * rte_hwtopo_get_core_by_id(unsigned int os_cpu_id); struct rte_hwtopo_node * rte_hwtopo_parent(struct rte_hwtopo_node *node); struct rte_hwtopo_node * rte_hwtopo_first_child(struct rte_hwtopo_node *node); struct rte_hwtopo_node * rte_hwtopo_next_child(struct rte_hwtopo_node *node, struct rte_hwtopo_node *child); struct rte_hwtopo_node * rte_hwtopo_first_sibling(struct rte_hwtopo_node *node); struct rte_hwtopo_node * rte_hwtopo_next_sibling(struct rte_hwtopo_node *node, struct rte_hwtopo_node *child); enum rte_hwtopo_node_type rte_hwtopo_get_type(struct rte_hwtopo_node *node); #define RTE_HWTOPO_NODE_ATTR_CORE_FREQUENCY_NOMINAL 0 #define RTE_HWTOPO_NODE_ATTR_CACHE_LEVEL 1 #define RTE_HWTOPO_NODE_ATTR_CACHE_SIZE 2 int rte_hwtopo_get_attr_int64(struct rte_hwtopo_node *node, unsigned int attr_name, int64_t *attr_value); int rte_hwtopo_get_attr_str(struct rte_hwtopo_node *node, unsigned int attr_name, char *attr_value, size_t capacity); #endif Surely, this too would be awkward (or should I say cumbersome) to use in certain scenarios. You could have syntactic sugar/special case helpers which address common use cases. You would also build abstractions on top of this (like the B case below). One could have node type specific functions instead of generic getter and setters. Anyway, this is not a counter-proposal, but rather just to make clear, what I had in mind. > or > B) An API to be directly useful for a work scheduler, in which case you > should abstract away things like "boost" (and fold them into some > abstract capacity notion, together with core "size" [in > big-little/heterogeneous systems]), and have an abstract notion of what > core is "close" to some other core. This would something like Linux' > scheduling domains. > > If you want B you probably need A as a part of its implementation, so > you may just as well start with A, I suppose. > > What you could do to explore the API design is to add support for, for > example, boost core awareness or SMT affinity in the SW scheduler. You > could also do an "lstopo" equivalent, since that's needed for debugging > and exploration, if nothing else. > > One question that will have to be answered in a work scheduling scenario > is "are these two lcores SMT siblings," or "are these two cores on the > same LLC", or "give me all lcores on a particular L2 cache". > >>>> >>>> I mean, what you have is a topology, with nodes of different types >>>> and with >>> different properties, and you want to present it to the user. >> Let me be clear, what we want via DPDK to help customer to use an >> Unified API which works across multiple platforms. >> Example - let a vendor have 2 products namely A and B. CPU-A has all >> cores within same SUB-NUMA domain and CPU-B has cores split to 2 >> sub-NUMA domain based on split LLC. >> When `rte_get_next_lcore_extnd` is invoked for `LLC` on >> 1. CPU-A: it returns all cores as there is no split >> 2. CPU-B: it returns cores from specific sub-NUMA which is partitioned >> by L3 >> > > I think the function name rte_get_next_lcore_extnd() alone makes clear > this is an awkward API. :) > > My gut feeling is to make it more explicit and forget about > <rte_lcore.h>. <rte_hwtopo.h>? Could and should still be EAL. > >>>> >>>> In a sense, it's similar to XCM and DOM versus SAX. The above is >>>> SAX-style, >>> and what I have in mind is something DOM-like. >>>> >>>> What use case do you have in mind? What's on top of my list is a >>>> scenario >>> where a DPDK app gets a bunch of cores (e.g., -l <cores>) and tries >>> to figure >>> out how best make use of them. >> Exactly. >> >> It's not going to "skip" (ignore, leave unused) >>> SMT siblings, or skip non-boosted cores, it would just try to be >>> clever in >>> regards to which cores to use for what purpose. >> Let me try to share my idea on SMT sibling. When user invoked for >> rte_get_next_lcore_extnd` is invoked for `L1 | SMT` flag with `lcore`; >> the API identifies first whether given lcore is part of enabled core >> list. >> If yes, it programmatically either using `sysfs` or `hwloc library >> (shared the version concern on distros. Will recheck again)` identify >> the sibling thread and return. >> If there is no sibling thread available under DPDK it will fetch next >> lcore (probably lcore +1 ). >> > > Distributions having old hwloc versions isn't an argument for a new DPDK > library or new API. If only that was the issue, then it would be better > to help the hwloc and/or distributions, rather than the DPDK project. > >>>> >>>>> This is AMD EPYC SoC agnostic and trying to address for all generic >>>>> cases. >>>>> Please do let us know if we (Ferruh & myself) can sync up via call? >>>> >>>> Sure, I can do that. >> >> Let me sync with Ferruh and get a time slot for internal sync. >> >>>> >>> Can this be opened to the rest of the community? This is a common >>> problem >>> that needs to be solved for multiple architectures. I would be >>> interested in >>> attending. >> Thank you Mattias, in DPDK Bangkok summit 2024 we did bring this up. >> As per the suggestion from Thomas and Jerrin we tried to bring the RFC >> for discussion. >> For DPDK Montreal 2024, Keesang and Ferruh (most likely) is travelling >> for the summit and presenting this as the talk to get things moving. >> >>> >>>>>> >> <snipped>
On Thu, Sep 12, 2024 at 02:19:07AM +0000, Varghese, Vipin wrote: > [Public] > > <snipped> > > > > > > > > <snipped> > > > > > > > > > > > >>> <snipped> > > > > > >>> > > > > > >>> Thank you Mattias for the comments and question, please let > me > > > > > >>> try to explain the same below > > > > > >>> > > > > > >>>> We shouldn't have a separate CPU/cache hierarchy API > instead? > > > > > >>> > > > > > >>> Based on the intention to bring in CPU lcores which share > same > > > > > >>> L3 (for better cache hits and less noisy neighbor) current > API > > > > > >>> focuses on using > > > > > >>> > > > > > >>> Last Level Cache. But if the suggestion is `there are SoC > where > > > > > >>> L2 cache are also shared, and the new API should be > > > > > >>> provisioned`, I am also > > > > > >>> > > > > > >>> comfortable with the thought. > > > > > >>> > > > > > >> > > > > > >> Rather than some AMD special case API hacked into > <rte_lcore.h>, > > > > > >> I think we are better off with no DPDK API at all for this > kind of > > > functionality. > > > > > > > > > > > > Hi Mattias, as shared in the earlier email thread, this is not > a > > > > > > AMD special > > > > > case at all. Let me try to explain this one more time. One of > > > > > techniques used to increase cores cost effective way to go for > tiles of > > > compute complexes. > > > > > > This introduces a bunch of cores in sharing same Last Level > Cache > > > > > > (namely > > > > > L2, L3 or even L4) depending upon cache topology architecture. > > > > > > > > > > > > The API suggested in RFC is to help end users to selectively > use > > > > > > cores under > > > > > same Last Level Cache Hierarchy as advertised by OS (irrespective > of > > > > > the BIOS settings used). This is useful in both bare-metal and > container > > > environment. > > > > > > > > > > > > > > > > I'm pretty familiar with AMD CPUs and the use of tiles (including > > > > > the challenges these kinds of non-uniformities pose for work > scheduling). > > > > > > > > > > To maximize performance, caring about core<->LLC relationship may > > > > > well not be enough, and more HT/core/cache/memory topology > > > > > information is required. That's what I meant by special case. A > > > > > proper API should allow access to information about which lcores > are > > > > > SMT siblings, cores on the same L2, and cores on the same L3, to > > > > > name a few things. Probably you want to fit NUMA into the same > API > > > > > as well, although that is available already in <rte_lcore.h>. > > > > > > > > Thank you Mattias for the information, as shared by in the reply > with > > > Anatoly we want expose a new API `rte_get_next_lcore_ex` which > intakes a > > > extra argument `u32 flags`. > > > > The flags can be RTE_GET_LCORE_L1 (SMT), RTE_GET_LCORE_L2, > > > RTE_GET_LCORE_L3, RTE_GET_LCORE_BOOST_ENABLED, > > > RTE_GET_LCORE_BOOST_DISABLED. > > > > > > > > > > For the naming, would "rte_get_next_sibling_core" (or lcore if you > prefer) be a > > > clearer name than just adding "ex" on to the end of the existing > function? > > Thank you Bruce, Please find my answer below > > > > Functions shared as per the RFC were > > ``` > > - rte_get_llc_first_lcores: Retrieves all the first lcores in the > shared LLC. > > - rte_get_llc_lcore: Retrieves all lcores that share the LLC. > > - rte_get_llc_n_lcore: Retrieves the first n or skips the first n > lcores in the shared LLC. > > ``` > > > > MACRO’s extending the usability were > > ``` > > RTE_LCORE_FOREACH_LLC_FIRST: iterates through all first lcore from each > LLC. > > RTE_LCORE_FOREACH_LLC_FIRST_WORKER: iterates through all first worker > lcore from each LLC. > > RTE_LCORE_FOREACH_LLC_WORKER: iterates lcores from LLC based on hint > (lcore id). > > RTE_LCORE_FOREACH_LLC_SKIP_FIRST_WORKER: iterates lcores from LLC while > skipping first worker. > > RTE_LCORE_FOREACH_LLC_FIRST_N_WORKER: iterates through `n` lcores from > each LLC. > > RTE_LCORE_FOREACH_LLC_SKIP_N_WORKER: skip first `n` lcores, then > iterates through reaming lcores in each LLC. > > ``` > > > > Based on the discussions we agreed on sharing version-2 FRC for > extending API as `rte_get_next_lcore_extnd` with extra argument as > `flags`. > > As per my ideation, for the API ` rte_get_next_sibling_core`, the above > API can easily with flag ` RTE_GET_LCORE_L1 (SMT)`. Is this right > understanding? > > We can easily have simple MACROs like `RTE_LCORE_FOREACH_L1` which > allows to iterate SMT sibling threads. > > This seems like a lot of new macro and API additions! I'd really like to cut that back and simplify the amount of new things we are adding to DPDK for this. I tend to agree with others that external libs would be better for apps that really want to deal with all this. > > > > > > Looking logically, I'm not sure about the BOOST_ENABLED and > > > BOOST_DISABLED flags you propose > > The idea for the BOOST_ENABLED & BOOST_DISABLED is based on DPDK power > library which allows to enable boost. > > Allow user to select lcores where BOOST is enabled|disabled using MACRO > or API. > > > > - in a system with multiple possible > > > standard and boost frequencies what would those correspond to? > > I now understand the confusion, apologies for mixing the AMD EPYC SoC > boost with Intel Turbo. > > > > Thank you for pointing out, we will use the terminology ` > RTE_GET_LCORE_TURBO`. > > That still doesn't clarify it for me. If you start mixing in power management related functions in with topology ones things will turn into a real headache. What does boost or turbo correspond to? Is it for cores that have the feature enabled - whether or not it's currently in use - or is it for finding cores that are currently boosted? Do we need additions for cores that are boosted by 100Mhz vs say 300Mhz. What about cores that are in lower frequencies for power-saving. Do we add macros for finding those? > > What's also > > > missing is a define for getting actual NUMA siblings i.e. those > sharing common > > > memory but not an L3 or anything else. > > This can be extended into `rte_get_next_lcore_extnd` with flag ` > RTE_GET_LCORE_NUMA`. This will allow to grab all lcores under the same > sub-memory NUMA as shared by LCORE. > > If SMT sibling is enabled and DPDK Lcore mask covers the sibling > threads, then ` RTE_GET_LCORE_NUMA` get all lcore and sibling threads > under same memory NUMA of lcore shared. > > Yes. That can work. But it means we are basing the implementation on a fixed idea of what topologies there are or can exist. My suggestion below is just to ignore the whole idea of L1 vs L2 vs NUMA - just give the app a way to find it's nearest nodes. After all, the app doesn't want to know the topology just for the sake of knowing it - it wants it to ensure best placement of work on cores! To that end, it just needs to know what cores are near to each other and what are far away. > > > > > > My suggestion would be to have the function take just an integer-type > e.g. > > > uint16_t parameter which defines the memory/cache hierarchy level to > use, 0 > > > being lowest, 1 next, and so on. Different systems may have different > numbers > > > of cache levels so lets just make it a zero-based index of levels, > rather than > > > giving explicit defines (except for memory which should probably > always be > > > last). The zero-level will be for "closest neighbour" > > Good idea, we did prototype this internally. But issue it will keep on > adding the number of API into lcore library. > > To keep the API count less, we are using lcore id as hint to sub-NUMA. > I'm unclear about this keeping the API count down - you are proposing a lot of APIs and macros up above. My suggestion is basically to add two APIs and no macros: one API to get the max number of topology-nearness levels, and a second API to get the next sibling a given nearness level from 0(nearest)..N(furthest). If we want, we can also add a FOREACH macro too. Overall, though, as I say above, let's focus on the problem the app actually wants these APIs for, not how we think we should solve it. Apps don't want to know the topology for knowledge sake, they want to use that knowledge to improve performance by pinning tasks to cores. What is the minimum that we need to provide to enable the app to do that? For example, if there are no lcores that share an L1, then from an app topology viewpoint that L1 level may as well not exist, because it provides us no details on how to place our work. For the rare app that does have some esoteric use-case that does actually want to know some intricate details of the topology, then having that app use an external lib is probably a better solution than us trying to cover all possible options in DPDK. My 2c. on this at this stage anyway. /Bruce
[AMD Official Use Only - AMD Internal Distribution Only] <snipped> > >>>> Thank you Mattias for the information, as shared by in the reply > >>>> with > >> Anatoly we want expose a new API `rte_get_next_lcore_ex` which > >> intakes a extra argument `u32 flags`. > >>>> The flags can be RTE_GET_LCORE_L1 (SMT), RTE_GET_LCORE_L2, > >> RTE_GET_LCORE_L3, RTE_GET_LCORE_BOOST_ENABLED, > >> RTE_GET_LCORE_BOOST_DISABLED. > >>> > >>> Wouldn't using that API be pretty awkward to use? > > Current API available under DPDK is ` rte_get_next_lcore`, which is used > within DPDK example and in customer solution. > > Based on the comments from others we responded to the idea of changing > the new Api from ` rte_get_next_lcore_llc` to ` rte_get_next_lcore_exntd`. > > > > Can you please help us understand what is `awkward`. > > > > The awkwardness starts when you are trying to fit provide hwloc type > information over an API that was designed for iterating over lcores. I disagree to this point, current implementation of lcore libraries is only focused on iterating through list of enabled cores, core-mask, and lcore-map. With ever increasing core count, memory, io and accelerators on SoC, sub-numa partitioning is common in various vendor SoC. Enhancing or Augumenting lcore API to extract or provision NUMA, Cache Topology is not awkward. If memory, IO and accelerator can have sub-NUMA domain, why is it awkward to have lcore in domains? Hence I do not agree on the awkwardness argument. > > It seems to me that you should either have: > A) An API in similar to that of hwloc (or any DOM-like API), which would give a > low-level description of the hardware in implementation terms. > The topology would consist of nodes, with attributes, etc, where nodes are > things like cores or instances of caches of some level and attributes are things > like CPU actual and nominal, and maybe max frequency, cache size, or memory > size. Here is the catch, `rte_eal_init` internally invokes `get_cpu|lcores` and populates thread (lcore) to physical CPU. But there is more than just CPU mapping, as we have seeing in SoC architecture. The argument shared by many is `DPDK is not the place for such topology discovery`. As per my current understanding, I have to disagree to the abive because 1. forces user to use external libraries example like hwloc 2. forces user to creating internal mapping for lcore, core-mask, and lcore-map with topology awareness code. My intention is to `enable end user to leverage the API format or similar API format (rte_get_next_lcore)` to get best results on any SoC (vendor agnostic). I fail to grasp why we are asking CPU topology to exported, while NIC, PCIe and accelerators are not asked to be exported via external libraries like hwloc. Hence let us setup tech call in slack or teams to understand this better. > or > B) An API to be directly useful for a work scheduler, in which case you should > abstract away things like "boost" Please note as shared in earlier reply to Bruce, I made a mistake of calling it boost (AMD SoC terminology). Instead it should DPDK_TURBO. There are use cases and DPDK examples, where cypto and compression are run on cores where TURBO is enabled. This allows end users to boost when there is more work and disable boost when there is less or no work. > (and fold them into some abstract capacity notion, together with core "size" [in big-little/heterogeneous systems]), and > have an abstract notion of what core is "close" to some other core. This would > something like Linux' > scheduling domains. We had similar discussion with Jerrin on the last day of Bangkok DPDK summit. This RFC was intended to help capture this relevant point. With my current understanding on selected SoC the little core on ARM Soc shares L2 cache, while this analogy does not cover all cases. But this would be good start. > > If you want B you probably need A as a part of its implementation, so you may > just as well start with A, I suppose. > > What you could do to explore the API design is to add support for, for > example, boost core awareness or SMT affinity in the SW scheduler. You could > also do an "lstopo" equivalent, since that's needed for debugging and > exploration, if nothing else. Not following on this analogy, will discuss in detail in tech talk > > One question that will have to be answered in a work scheduling scenario is > "are these two lcores SMT siblings," or "are these two cores on the same LLC", > or "give me all lcores on a particular L2 cache". > Is not that we have been trying to address based on Anatoly request to generalize than LLC. Hence we agreed on sharing version-2 of RFC with `rte_get_nex_lcore_extnd` with `flags`. May I ask where is the disconnect? > >>> > >>> I mean, what you have is a topology, with nodes of different types > >>> and with > >> different properties, and you want to present it to the user. > > Let me be clear, what we want via DPDK to help customer to use an Unified > API which works across multiple platforms. > > Example - let a vendor have 2 products namely A and B. CPU-A has all cores > within same SUB-NUMA domain and CPU-B has cores split to 2 sub-NUMA > domain based on split LLC. > > When `rte_get_next_lcore_extnd` is invoked for `LLC` on 1. CPU-A: it > > returns all cores as there is no split 2. CPU-B: it returns cores from > > specific sub-NUMA which is partitioned by L3 > > > > I think the function name rte_get_next_lcore_extnd() alone makes clear this is an awkward API. :) I humbly disagree to this statement, as explained above. > > My gut feeling is to make it more explicit and forget about <rte_lcore.h>. > <rte_hwtopo.h>? Could and should still be EAL. For me this is like adding a new level of library and more code. While the easiest way was to add an API similar to existing `get_next_lcore` style for easy adoption. > > >>> > >>> In a sense, it's similar to XCM and DOM versus SAX. The above is > >>> SAX-style, > >> and what I have in mind is something DOM-like. > >>> > >>> What use case do you have in mind? What's on top of my list is a scenario where a DPDK app gets a bunch of cores (e.g., -l <cores>) and tries to figure out how best make use of them. > > Exactly. > > > > It's not going to "skip" (ignore, leave unused) > >> SMT siblings, or skip non-boosted cores, it would just try to be > >> clever in regards to which cores to use for what purpose. > > Let me try to share my idea on SMT sibling. When user invoked for > rte_get_next_lcore_extnd` is invoked for `L1 | SMT` flag with `lcore`; the API > identifies first whether given lcore is part of enabled core list. > > If yes, it programmatically either using `sysfs` or `hwloc library (shared the > version concern on distros. Will recheck again)` identify the sibling thread and > return. > > If there is no sibling thread available under DPDK it will fetch next lcore > (probably lcore +1 ). > > > > Distributions having old hwloc versions isn't an argument for a new DPDK library or new API. If only that was the issue, then it would be better to help the hwloc and/or distributions, rather than the DPDK project. I do not agree to terms of ` Distributions having old hwloc versions isn't an argument for a new DPDK library or new API.` Because this is not what my intention is. Let me be clear on Ampere & AMD Bios settings are 2 1. SLC or L3 as NUMA enable 2. Numa for IO|memory With `NUMA for IO|memory` is set hwloc library works as expected. But when `L3 as NUMA` is set gives incorrect details. We have been fixing this and pushing to upstream. But as I clearly shared, version of distros having latest hwloc is almost nil. Hence to keep things simple, in documentation of DPDK we pointed to AMD SoC tuning guide we have been recommending not to enable `L3 as NUMA`. Now end goal for me is to allow vendor agnostic API which is easy to understand and use, and works irrespective of BIOS settings. I have enabled parsing of OS `sysfs` as a RFC. But if the comment is to use `hwloc` as shared with response for Stephen I am open to try this again. <snipped>
[Public] Snipped > > > To to be clear; it's something like this I think of when I say "DOM-style" API. > > #ifndef RTE_HWTOPO_H > #define RTE_HWTOPO_H > > struct rte_hwtopo_node; > > enum rte_hwtopo_node_type { > RTE_HWTOPO_NODE_TYPE_CPU_CORE, > RTE_HWTOPO_NODE_TYPE_CACHE, > RTE_HWTOPO_NODE_TYPE_NUMA > }; > > int > rte_hwtopo_init(void); > > struct rte_hwtopo_node * > rte_hwtopo_get_core_by_lcore(unsigned int lcore); > > struct rte_hwtopo_node * > rte_hwtopo_get_core_by_id(unsigned int os_cpu_id); > > struct rte_hwtopo_node * > rte_hwtopo_parent(struct rte_hwtopo_node *node); > > struct rte_hwtopo_node * > rte_hwtopo_first_child(struct rte_hwtopo_node *node); > > struct rte_hwtopo_node * > rte_hwtopo_next_child(struct rte_hwtopo_node *node, > struct rte_hwtopo_node *child); > > struct rte_hwtopo_node * > rte_hwtopo_first_sibling(struct rte_hwtopo_node *node); > > struct rte_hwtopo_node * > rte_hwtopo_next_sibling(struct rte_hwtopo_node *node, > struct rte_hwtopo_node *child); > > enum rte_hwtopo_node_type > rte_hwtopo_get_type(struct rte_hwtopo_node *node); > > #define RTE_HWTOPO_NODE_ATTR_CORE_FREQUENCY_NOMINAL 0 #define > RTE_HWTOPO_NODE_ATTR_CACHE_LEVEL 1 #define > RTE_HWTOPO_NODE_ATTR_CACHE_SIZE 2 > > int > rte_hwtopo_get_attr_int64(struct rte_hwtopo_node *node, unsigned int > attr_name, > int64_t *attr_value); > > int > rte_hwtopo_get_attr_str(struct rte_hwtopo_node *node, unsigned int > attr_name, > char *attr_value, size_t capacity); > > #endif > > Surely, this too would be awkward (or should I say cumbersome) to use in certain scenarios. This appears to be more like hwloc api calls, as shared in my earlier email my intention with the API suggestion is not introduce new library. I have certain reservations and with my current understanding I am not able to map certain DPDK core mapping. Let discuss this in technical call. Snipped
[Public] Snipped > > > > > > Based on the discussions we agreed on sharing version-2 FRC for > > extending API as `rte_get_next_lcore_extnd` with extra argument as > > `flags`. > > > > As per my ideation, for the API ` rte_get_next_sibling_core`, the above > > API can easily with flag ` RTE_GET_LCORE_L1 (SMT)`. Is this right > > understanding? > > > > We can easily have simple MACROs like `RTE_LCORE_FOREACH_L1` which > > allows to iterate SMT sibling threads. > > > > > > This seems like a lot of new macro and API additions! I'd really like to cut that > back and simplify the amount of new things we are adding to DPDK for this. I disagree Bruce, as per the new conversation with Anatoly and you it has been shared the new API are ``` 1. rte_get_next_lcore_exntd 2. rte_get_next_n_lcore_exntd ``` While I mentioned custom Macro can augment based on typical flag usage similar to ` RTE_LCORE_FOREACH and RTE_LCORE_FOREACH_WORKER` as ``` RTE_LCORE_FOREACH_FLAG RTE_LCORE_FOREACH_WORKER_FLAG Or RTE_LCORE_FOREACH_LLC RTE_LCORE_FOREACH_WORKER_LLC ``` Please note I have not even shared version-2 of RFC yet. > I tend to agree with others that external libs would be better for apps that really want to deal with all this. I have covered why this is not a good idea for Mattias query. > > > > > > > > > > > Looking logically, I'm not sure about the BOOST_ENABLED and BOOST_DISABLED flags you propose > > The idea for the BOOST_ENABLED & BOOST_DISABLED is based on DPDK power library which allows to enable boost. > > Allow user to select lcores where BOOST is enabled|disabled using MACRO or API. May be there is confusion, so let me try to be explicit here. The intention of any `rte_get_next_lcore##` is fetch lcores. Hence with new proposed API `rte_get_next_lcore_exntd` with `flag set for Boost` is to fetch lcores where boost is enabled. There is no intention to enable or disable boost on lcore with `get` API. > > > > > > > > - in a system with multiple possible > > > > > standard and boost frequencies what would those correspond to? > > > > I now understand the confusion, apologies for mixing the AMD EPYC SoC > > boost with Intel Turbo. > > > > > > > > Thank you for pointing out, we will use the terminology ` > > RTE_GET_LCORE_TURBO`. > > > > > > That still doesn't clarify it for me. If you start mixing in power management related functions in with topology ones things will turn into a real headache. Can you please tell me what is not clarified. DPDK lcores as of today has no notion of Cache, Numa, Power, Turbo or any DPDK supported features. The initial API introduced were to expose lcore sharing the same Last Level Cache. Based on interaction with Anatoly, extending this to support multiple features turned out to be possibility. Hence, we said we can share v2 for RFC based on this idea. But if the claim is not to put TURBO I am also ok for this. Let only keep cache and NUMA-IO domain. > What does boost or turbo correspond to? Is it for cores that have the feature enabled - whether or not it's currently in use - or is it for finding cores that are > currently boosted? Do we need additions for cores that are boosted by 100Mhz vs say 300Mhz. What about cores that are in lower frequencies for > power-saving. Do we add macros for finding those? Why are we talking about feq-up and freq-down? This was not even discussed in this RFC patch at all. > > > > What's also > > > > > missing is a define for getting actual NUMA siblings i.e. those > > sharing common memory but not an L3 or anything else. > > > > This can be extended into `rte_get_next_lcore_extnd` with flag ` > > RTE_GET_LCORE_NUMA`. This will allow to grab all lcores under the same > > sub-memory NUMA as shared by LCORE. > > > > If SMT sibling is enabled and DPDK Lcore mask covers the sibling > > threads, then ` RTE_GET_LCORE_NUMA` get all lcore and sibling threads > > under same memory NUMA of lcore shared. > > > > > > Yes. That can work. But it means we are basing the implementation on a fixed idea of what topologies there are or can exist. > My suggestion below is just to ignore the whole idea of L1 vs L2 vs NUMA - just give the app a way to find it's nearest nodes. Bruce, for different vendor SoC, the implementation of architecture is different. Let me share what I know 1. using L1, we can fetch SMT threads 2. using L2 we can get certain SoC on Arm, Intel and power PC which is like efficient cores 3. using L3 we can get certain SoC like AMD, AF64x and others which follow chiplet or tile split L3 domain. > > After all, the app doesn't want to know the topology just for the sake of knowing it - it wants it to ensure best placement of work on cores! To that end, it just needs to know what cores are near to each other and what are far away. Exactly, that is why we want to minimize new libraries and limit to format of existing API `rte_get_next_lcore`. The end user need to deploy another library or external library then map to DPDK lcore mapping to identify what is where. So as end user I prefer simple API which get my work done. > > > > > > > > > > > My suggestion would be to have the function take just an integer-type > > e.g. > > > > > uint16_t parameter which defines the memory/cache hierarchy level to > > use, 0 > > > > > being lowest, 1 next, and so on. Different systems may have different > > numbers > > > > > of cache levels so lets just make it a zero-based index of levels, > > rather than > > > > > giving explicit defines (except for memory which should probably > > always be > > > > > last). The zero-level will be for "closest neighbour" > > > > Good idea, we did prototype this internally. But issue it will keep on > > adding the number of API into lcore library. > > > > To keep the API count less, we are using lcore id as hint to sub-NUMA. > > > > I'm unclear about this keeping the API count down - you are proposing a lot of APIs and macros up above. No, I am not. I have shared based on the last discussion with Anatoly we will end up with 2 API in lcore only. Explained in the above response > My suggestion is basically to add two APIs and no macros: one API to get the max number of topology-nearness levels, and a > second API to get the next sibling a given nearness level from > 0(nearest)..N(furthest). If we want, we can also add a FOREACH macro too. > > Overall, though, as I say above, let's focus on the problem the app actually > wants these APIs for, not how we think we should solve it. Apps don't want to > know the topology for knowledge sake, they want to use that knowledge to > improve performance by pinning tasks to cores. What is the minimum that we > need to provide to enable the app to do that? For example, if there are no > lcores that share an L1, then from an app topology viewpoint that L1 level may > as well not exist, because it provides us no details on how to place our work. I have shared above why we need vendor agnostic L1, L2, L3 and sub-NUMA-IO. Snipped
On 2024-09-12 13:17, Varghese, Vipin wrote: > [AMD Official Use Only - AMD Internal Distribution Only] > > <snipped> >> >>>> Thank you Mattias for the information, as shared by in the reply >> >>>> with >> >> Anatoly we want expose a new API `rte_get_next_lcore_ex` which >> >> intakes a extra argument `u32 flags`. >> >>>> The flags can be RTE_GET_LCORE_L1 (SMT), RTE_GET_LCORE_L2, >> >> RTE_GET_LCORE_L3, RTE_GET_LCORE_BOOST_ENABLED, >> >> RTE_GET_LCORE_BOOST_DISABLED. >> >>> >> >>> Wouldn't using that API be pretty awkward to use? >> > Current API available under DPDK is ` rte_get_next_lcore`, which is used >> within DPDK example and in customer solution. >> > Based on the comments from others we responded to the idea of changing >> the new Api from ` rte_get_next_lcore_llc` to ` rte_get_next_lcore_exntd`. >> > >> > Can you please help us understand what is `awkward`. >> > >> >> The awkwardness starts when you are trying to fit provide hwloc type >> information over an API that was designed for iterating over lcores. > I disagree to this point, current implementation of lcore libraries is > only focused on iterating through list of enabled cores, core-mask, and > lcore-map. > With ever increasing core count, memory, io and accelerators on SoC, > sub-numa partitioning is common in various vendor SoC. Enhancing or > Augumenting lcore API to extract or provision NUMA, Cache Topology is > not awkward. DPDK providing an API for this information makes sense to me, as I've mentioned before. What I questioned was the way it was done (i.e., the API design) in your RFC, and the limited scope (which in part you have addressed). > If memory, IO and accelerator can have sub-NUMA domain, why is it > awkward to have lcore in domains? Hence I do not agree on the > awkwardness argument. >> >> It seems to me that you should either have: >> A) An API in similar to that of hwloc (or any DOM-like API), which would give a >> low-level description of the hardware in implementation terms. >> The topology would consist of nodes, with attributes, etc, where nodes are >> things like cores or instances of caches of some level and attributes are things >> like CPU actual and nominal, and maybe max frequency, cache size, or memory >> size. > Here is the catch, `rte_eal_init` internally invokes `get_cpu|lcores` > and populates thread (lcore) to physical CPU. But there is more than > just CPU mapping, as we have seeing in SoC architecture. The argument > shared by many is `DPDK is not the place for such topology discovery`. > As per my current understanding, I have to disagree to the abive because > 1. forces user to use external libraries example like hwloc > 2. forces user to creating internal mapping for lcore, core-mask, and > lcore-map with topology awareness code. > My intention is to `enable end user to leverage the API format or > similar API format (rte_get_next_lcore)` to get best results on any SoC > (vendor agnostic). > I fail to grasp why we are asking CPU topology to exported, while NIC, > PCIe and accelerators are not asked to be exported via external > libraries like hwloc. > Hence let us setup tech call in slack or teams to understand this better. >> or >> B) An API to be directly useful for a work scheduler, in which case you should >> abstract away things like "boost" > Please note as shared in earlier reply to Bruce, I made a mistake of > calling it boost (AMD SoC terminology). Instead it should DPDK_TURBO. > There are use cases and DPDK examples, where cypto and compression are > run on cores where TURBO is enabled. This allows end users to boost when > there is more work and disable boost when there is less or no work. >> (and fold them into some abstract capacity notion, together with core "size" [in big-little/heterogeneous systems]), and >> have an abstract notion of what core is "close" to some other core. This would >> something like Linux' >> scheduling domains. > We had similar discussion with Jerrin on the last day of Bangkok DPDK > summit. This RFC was intended to help capture this relevant point. With > my current understanding on selected SoC the little core on ARM Soc > shares L2 cache, while this analogy does not cover all cases. But this > would be good start. >> >> If you want B you probably need A as a part of its implementation, so you may >> just as well start with A, I suppose. >> >> What you could do to explore the API design is to add support for, for >> example, boost core awareness or SMT affinity in the SW scheduler. You could >> also do an "lstopo" equivalent, since that's needed for debugging and >> exploration, if nothing else. > Not following on this analogy, will discuss in detail in tech talk >> >> One question that will have to be answered in a work scheduling scenario is >> "are these two lcores SMT siblings," or "are these two cores on the same LLC", >> or "give me all lcores on a particular L2 cache". >> > Is not that we have been trying to address based on Anatoly request to > generalize than LLC. Hence we agreed on sharing version-2 of RFC with > `rte_get_nex_lcore_extnd` with `flags`. > May I ask where is the disconnect? >> >>> >> >>> I mean, what you have is a topology, with nodes of different types >> >>> and with >> >> different properties, and you want to present it to the user. >> > Let me be clear, what we want via DPDK to help customer to use an Unified >> API which works across multiple platforms. >> > Example - let a vendor have 2 products namely A and B. CPU-A has all cores >> within same SUB-NUMA domain and CPU-B has cores split to 2 sub-NUMA >> domain based on split LLC. >> > When `rte_get_next_lcore_extnd` is invoked for `LLC` on 1. CPU-A: it >> > returns all cores as there is no split 2. CPU-B: it returns cores from >> > specific sub-NUMA which is partitioned by L3 >> > >> >> I think the function name rte_get_next_lcore_extnd() alone makes clear this is an awkward API. :) > I humbly disagree to this statement, as explained above. >> >> My gut feeling is to make it more explicit and forget about <rte_lcore.h>. >> <rte_hwtopo.h>? Could and should still be EAL. > For me this is like adding a new level of library and more code. While > the easiest way was to add an API similar to existing `get_next_lcore` > style for easy adoption. A poorly designed, special-case API is not less work. It's just less work for *you* *now*, and much more work for someone in the future to clean it up. >> >> >>> >> >>> In a sense, it's similar to XCM and DOM versus SAX. The above is >> >>> SAX-style, >> >> and what I have in mind is something DOM-like. >> >>> >> >>> What use case do you have in mind? What's on top of my list is a scenario where a DPDK app gets a bunch of cores (e.g., -l <cores>) and tries to figure out how best make use of them. >> > Exactly. >> > >> > It's not going to "skip" (ignore, leave unused) >> >> SMT siblings, or skip non-boosted cores, it would just try to be >> >> clever in regards to which cores to use for what purpose. >> > Let me try to share my idea on SMT sibling. When user invoked for >> rte_get_next_lcore_extnd` is invoked for `L1 | SMT` flag with `lcore`; the API >> identifies first whether given lcore is part of enabled core list. >> > If yes, it programmatically either using `sysfs` or `hwloc library (shared the >> version concern on distros. Will recheck again)` identify the sibling thread and >> return. >> > If there is no sibling thread available under DPDK it will fetch next lcore >> (probably lcore +1 ). >> > >> >> Distributions having old hwloc versions isn't an argument for a new DPDK library or new API. If only that was the issue, then it would be better to help the hwloc and/or distributions, rather than the DPDK project. > I do not agree to terms of ` Distributions having old hwloc versions > isn't an argument for a new DPDK library or new API.` Because this is > not what my intention is. Let me be clear on Ampere & AMD Bios settings > are 2 > 1. SLC or L3 as NUMA enable > 2. Numa for IO|memory > With `NUMA for IO|memory` is set hwloc library works as expected. But > when `L3 as NUMA` is set gives incorrect details. We have been fixing > this and pushing to upstream. But as I clearly shared, version of > distros having latest hwloc is almost nil. > Hence to keep things simple, in documentation of DPDK we pointed to AMD > SoC tuning guide we have been recommending not to enable `L3 as NUMA`. > Now end goal for me is to allow vendor agnostic API which is easy to > understand and use, and works irrespective of BIOS settings. I have > enabled parsing of OS `sysfs` as a RFC. But if the comment is to use > `hwloc` as shared with response for Stephen I am open to try this again. > <snipped>
On 2024-09-12 13:23, Varghese, Vipin wrote: > [Public] > > Snipped >> >> >> To to be clear; it's something like this I think of when I say "DOM-style" API. >> >> #ifndef RTE_HWTOPO_H >> #define RTE_HWTOPO_H >> >> struct rte_hwtopo_node; >> >> enum rte_hwtopo_node_type { >> RTE_HWTOPO_NODE_TYPE_CPU_CORE, >> RTE_HWTOPO_NODE_TYPE_CACHE, >> RTE_HWTOPO_NODE_TYPE_NUMA >> }; >> >> int >> rte_hwtopo_init(void); >> >> struct rte_hwtopo_node * >> rte_hwtopo_get_core_by_lcore(unsigned int lcore); >> >> struct rte_hwtopo_node * >> rte_hwtopo_get_core_by_id(unsigned int os_cpu_id); >> >> struct rte_hwtopo_node * >> rte_hwtopo_parent(struct rte_hwtopo_node *node); >> >> struct rte_hwtopo_node * >> rte_hwtopo_first_child(struct rte_hwtopo_node *node); >> >> struct rte_hwtopo_node * >> rte_hwtopo_next_child(struct rte_hwtopo_node *node, >> struct rte_hwtopo_node *child); >> >> struct rte_hwtopo_node * >> rte_hwtopo_first_sibling(struct rte_hwtopo_node *node); >> >> struct rte_hwtopo_node * >> rte_hwtopo_next_sibling(struct rte_hwtopo_node *node, >> struct rte_hwtopo_node *child); >> >> enum rte_hwtopo_node_type >> rte_hwtopo_get_type(struct rte_hwtopo_node *node); >> >> #define RTE_HWTOPO_NODE_ATTR_CORE_FREQUENCY_NOMINAL 0 #define >> RTE_HWTOPO_NODE_ATTR_CACHE_LEVEL 1 #define >> RTE_HWTOPO_NODE_ATTR_CACHE_SIZE 2 >> >> int >> rte_hwtopo_get_attr_int64(struct rte_hwtopo_node *node, unsigned int >> attr_name, >> int64_t *attr_value); >> >> int >> rte_hwtopo_get_attr_str(struct rte_hwtopo_node *node, unsigned int >> attr_name, >> char *attr_value, size_t capacity); >> >> #endif >> >> Surely, this too would be awkward (or should I say cumbersome) to use in certain scenarios. > This appears to be more like hwloc api calls, as shared in my earlier > email my intention with the API suggestion is not introduce new library. > I have certain reservations and with my current understanding I am not > able to map certain DPDK core mapping. Let discuss this in technical call. > Snipped It still would need to be a part of EAL (so not a new library), since EAL surely would depend on it (sooner rather than later). If this functionality should be a new library, or a new API in an existing library, it doesn't really matter if your original intentions where something else, does it.
On 2024-09-12 11:17, Bruce Richardson wrote: > On Thu, Sep 12, 2024 at 02:19:07AM +0000, Varghese, Vipin wrote: >> [Public] >> >> <snipped> >> >> >> >> > > > > <snipped> >> >> > > > > >> >> > > > >>> <snipped> >> >> > > > >>> >> >> > > > >>> Thank you Mattias for the comments and question, please let >> me >> >> > > > >>> try to explain the same below >> >> > > > >>> >> >> > > > >>>> We shouldn't have a separate CPU/cache hierarchy API >> instead? >> >> > > > >>> >> >> > > > >>> Based on the intention to bring in CPU lcores which share >> same >> >> > > > >>> L3 (for better cache hits and less noisy neighbor) current >> API >> >> > > > >>> focuses on using >> >> > > > >>> >> >> > > > >>> Last Level Cache. But if the suggestion is `there are SoC >> where >> >> > > > >>> L2 cache are also shared, and the new API should be >> >> > > > >>> provisioned`, I am also >> >> > > > >>> >> >> > > > >>> comfortable with the thought. >> >> > > > >>> >> >> > > > >> >> >> > > > >> Rather than some AMD special case API hacked into >> <rte_lcore.h>, >> >> > > > >> I think we are better off with no DPDK API at all for this >> kind of >> >> > functionality. >> >> > > > > >> >> > > > > Hi Mattias, as shared in the earlier email thread, this is not >> a >> >> > > > > AMD special >> >> > > > case at all. Let me try to explain this one more time. One of >> >> > > > techniques used to increase cores cost effective way to go for >> tiles of >> >> > compute complexes. >> >> > > > > This introduces a bunch of cores in sharing same Last Level >> Cache >> >> > > > > (namely >> >> > > > L2, L3 or even L4) depending upon cache topology architecture. >> >> > > > > >> >> > > > > The API suggested in RFC is to help end users to selectively >> use >> >> > > > > cores under >> >> > > > same Last Level Cache Hierarchy as advertised by OS (irrespective >> of >> >> > > > the BIOS settings used). This is useful in both bare-metal and >> container >> >> > environment. >> >> > > > > >> >> > > > >> >> > > > I'm pretty familiar with AMD CPUs and the use of tiles (including >> >> > > > the challenges these kinds of non-uniformities pose for work >> scheduling). >> >> > > > >> >> > > > To maximize performance, caring about core<->LLC relationship may >> >> > > > well not be enough, and more HT/core/cache/memory topology >> >> > > > information is required. That's what I meant by special case. A >> >> > > > proper API should allow access to information about which lcores >> are >> >> > > > SMT siblings, cores on the same L2, and cores on the same L3, to >> >> > > > name a few things. Probably you want to fit NUMA into the same >> API >> >> > > > as well, although that is available already in <rte_lcore.h>. >> >> > > >> >> > > Thank you Mattias for the information, as shared by in the reply >> with >> >> > Anatoly we want expose a new API `rte_get_next_lcore_ex` which >> intakes a >> >> > extra argument `u32 flags`. >> >> > > The flags can be RTE_GET_LCORE_L1 (SMT), RTE_GET_LCORE_L2, >> >> > RTE_GET_LCORE_L3, RTE_GET_LCORE_BOOST_ENABLED, >> >> > RTE_GET_LCORE_BOOST_DISABLED. >> >> > > >> >> > >> >> > For the naming, would "rte_get_next_sibling_core" (or lcore if you >> prefer) be a >> >> > clearer name than just adding "ex" on to the end of the existing >> function? >> >> Thank you Bruce, Please find my answer below >> >> >> >> Functions shared as per the RFC were >> >> ``` >> >> - rte_get_llc_first_lcores: Retrieves all the first lcores in the >> shared LLC. >> >> - rte_get_llc_lcore: Retrieves all lcores that share the LLC. >> >> - rte_get_llc_n_lcore: Retrieves the first n or skips the first n >> lcores in the shared LLC. >> >> ``` >> >> >> >> MACRO’s extending the usability were >> >> ``` >> >> RTE_LCORE_FOREACH_LLC_FIRST: iterates through all first lcore from each >> LLC. >> >> RTE_LCORE_FOREACH_LLC_FIRST_WORKER: iterates through all first worker >> lcore from each LLC. >> >> RTE_LCORE_FOREACH_LLC_WORKER: iterates lcores from LLC based on hint >> (lcore id). >> >> RTE_LCORE_FOREACH_LLC_SKIP_FIRST_WORKER: iterates lcores from LLC while >> skipping first worker. >> >> RTE_LCORE_FOREACH_LLC_FIRST_N_WORKER: iterates through `n` lcores from >> each LLC. >> >> RTE_LCORE_FOREACH_LLC_SKIP_N_WORKER: skip first `n` lcores, then >> iterates through reaming lcores in each LLC. >> >> ``` >> >> >> >> Based on the discussions we agreed on sharing version-2 FRC for >> extending API as `rte_get_next_lcore_extnd` with extra argument as >> `flags`. >> >> As per my ideation, for the API ` rte_get_next_sibling_core`, the above >> API can easily with flag ` RTE_GET_LCORE_L1 (SMT)`. Is this right >> understanding? >> >> We can easily have simple MACROs like `RTE_LCORE_FOREACH_L1` which >> allows to iterate SMT sibling threads. >> >> > > This seems like a lot of new macro and API additions! I'd really like to > cut that back and simplify the amount of new things we are adding to DPDK > for this. I tend to agree with others that external libs would be better > for apps that really want to deal with all this. > Conveying HW topology will require a fair bit of API verbiage. I think there's no way around it, other than giving the API user half of the story (or 1% of the story). That's one of the reasons I think it should be in a separate header file in EAL. >> >> > >> >> > Looking logically, I'm not sure about the BOOST_ENABLED and >> >> > BOOST_DISABLED flags you propose >> >> The idea for the BOOST_ENABLED & BOOST_DISABLED is based on DPDK power >> library which allows to enable boost. >> >> Allow user to select lcores where BOOST is enabled|disabled using MACRO >> or API. >> >> >> >> - in a system with multiple possible >> >> > standard and boost frequencies what would those correspond to? >> >> I now understand the confusion, apologies for mixing the AMD EPYC SoC >> boost with Intel Turbo. >> >> >> >> Thank you for pointing out, we will use the terminology ` >> RTE_GET_LCORE_TURBO`. >> >> > > That still doesn't clarify it for me. If you start mixing in power > management related functions in with topology ones things will turn into a > real headache. What does boost or turbo correspond to? Is it for cores that > have the feature enabled - whether or not it's currently in use - or is it > for finding cores that are currently boosted? Do we need additions for > cores that are boosted by 100Mhz vs say 300Mhz. What about cores that are > in lower frequencies for power-saving. Do we add macros for finding those? > In my world, the operating frequency is a property of a CPU core node in the hardware topology. lcore discrimination (or classification) shouldn't be built as a myriad of FOREACH macros, but rather generic iteration + app domain logic. For example, the size of the L3 could be a factor. Should we have a FOREACH_BIG_L3. No. >> >> What's also >> >> > missing is a define for getting actual NUMA siblings i.e. those >> sharing common >> >> > memory but not an L3 or anything else. >> >> This can be extended into `rte_get_next_lcore_extnd` with flag ` >> RTE_GET_LCORE_NUMA`. This will allow to grab all lcores under the same >> sub-memory NUMA as shared by LCORE. >> >> If SMT sibling is enabled and DPDK Lcore mask covers the sibling >> threads, then ` RTE_GET_LCORE_NUMA` get all lcore and sibling threads >> under same memory NUMA of lcore shared. >> >> > > Yes. That can work. But it means we are basing the implementation on a > fixed idea of what topologies there are or can exist. My suggestion below > is just to ignore the whole idea of L1 vs L2 vs NUMA - just give the app a > way to find it's nearest nodes. > I think we need to agree what is the purpose of this API. Is it the to describe the hardware topology in some details for general-purpose use (including informing the operator, lstopo-style), or just some abstract, simplified representation to be use purely for work scheduling. > After all, the app doesn't want to know the topology just for the sake of > knowing it - it wants it to ensure best placement of work on cores! To that > end, it just needs to know what cores are near to each other and what are > far away. > >> >> > >> >> > My suggestion would be to have the function take just an integer-type >> e.g. >> >> > uint16_t parameter which defines the memory/cache hierarchy level to >> use, 0 >> >> > being lowest, 1 next, and so on. Different systems may have different >> numbers >> >> > of cache levels so lets just make it a zero-based index of levels, >> rather than >> >> > giving explicit defines (except for memory which should probably >> always be >> >> > last). The zero-level will be for "closest neighbour" >> >> Good idea, we did prototype this internally. But issue it will keep on >> adding the number of API into lcore library. >> >> To keep the API count less, we are using lcore id as hint to sub-NUMA. >> > > I'm unclear about this keeping the API count down - you are proposing a lot > of APIs and macros up above. My suggestion is basically to add two APIs and > no macros: one API to get the max number of topology-nearness levels, and a > second API to get the next sibling a given nearness level from > 0(nearest)..N(furthest). If we want, we can also add a FOREACH macro too. > > Overall, though, as I say above, let's focus on the problem the app > actually wants these APIs for, not how we think we should solve it. Apps > don't want to know the topology for knowledge sake, they want to use that > knowledge to improve performance by pinning tasks to cores. What is the > minimum that we need to provide to enable the app to do that? For example, > if there are no lcores that share an L1, then from an app topology > viewpoint that L1 level may as well not exist, because it provides us no > details on how to place our work. > > For the rare app that does have some esoteric use-case that does actually > want to know some intricate details of the topology, then having that app > use an external lib is probably a better solution than us trying to cover > all possible options in DPDK. > > My 2c. on this at this stage anyway. > > /Bruce >
On Thu, Sep 12, 2024 at 01:59:34PM +0200, Mattias Rönnblom wrote: > On 2024-09-12 13:17, Varghese, Vipin wrote: > > [AMD Official Use Only - AMD Internal Distribution Only] > > > > <snipped> > > > >>>> Thank you Mattias for the information, as shared by in the reply > > > >>>> with > > > >> Anatoly we want expose a new API `rte_get_next_lcore_ex` which > > > >> intakes a extra argument `u32 flags`. > > > >>>> The flags can be RTE_GET_LCORE_L1 (SMT), RTE_GET_LCORE_L2, > > > >> RTE_GET_LCORE_L3, RTE_GET_LCORE_BOOST_ENABLED, > > > >> RTE_GET_LCORE_BOOST_DISABLED. > > > >>> > > > >>> Wouldn't using that API be pretty awkward to use? > > > > Current API available under DPDK is ` rte_get_next_lcore`, which is used > > > within DPDK example and in customer solution. > > > > Based on the comments from others we responded to the idea of changing > > > the new Api from ` rte_get_next_lcore_llc` to ` rte_get_next_lcore_exntd`. > > > > > > > > Can you please help us understand what is `awkward`. > > > > > > > > > > The awkwardness starts when you are trying to fit provide hwloc type > > > information over an API that was designed for iterating over lcores. > > I disagree to this point, current implementation of lcore libraries is > > only focused on iterating through list of enabled cores, core-mask, and > > lcore-map. > > With ever increasing core count, memory, io and accelerators on SoC, > > sub-numa partitioning is common in various vendor SoC. Enhancing or > > Augumenting lcore API to extract or provision NUMA, Cache Topology is > > not awkward. > > DPDK providing an API for this information makes sense to me, as I've > mentioned before. What I questioned was the way it was done (i.e., the API > design) in your RFC, and the limited scope (which in part you have > addressed). > Actually, I'd like to touch on this first item a little bit. What is the main benefit of providing this information in EAL? To me, it seems like something that is for apps to try and be super-smart and select particular cores out of a set of cores to run on. However, is that not taking work that should really be the job of the person deploying the app? The deployer - if I can use that term - has already selected a set of cores and NICs for a DPDK application to use. Should they not also be the one selecting - via app argument, via --lcores flag to map one core id to another, or otherwise - which part of an application should run on what particular piece of hardware? In summary, what is the final real-world intended usecase for this work? DPDK already tries to be smart about cores and NUMA, and in some cases we have hit issues where users have - for their own valid reasons - wanted to run DPDK in a sub-optimal way, and they end up having to fight DPDK's smarts in order to do so! Ref: [1] /Bruce [1] https://git.dpdk.org/dpdk/commit/?id=ed34d87d9cfbae8b908159f60df2008e45e4c39f
On Thu, 12 Sep 2024 14:12:55 +0200 Mattias Rönnblom <hofors@lysator.liu.se> wrote: > On 2024-09-12 13:23, Varghese, Vipin wrote: > > [Public] > > > > Snipped > >> > >> > >> To to be clear; it's something like this I think of when I say "DOM-style" API. > >> > >> #ifndef RTE_HWTOPO_H > >> #define RTE_HWTOPO_H > >> > >> struct rte_hwtopo_node; > >> > >> enum rte_hwtopo_node_type { > >> RTE_HWTOPO_NODE_TYPE_CPU_CORE, > >> RTE_HWTOPO_NODE_TYPE_CACHE, > >> RTE_HWTOPO_NODE_TYPE_NUMA > >> }; > >> > >> int > >> rte_hwtopo_init(void); > >> > >> struct rte_hwtopo_node * > >> rte_hwtopo_get_core_by_lcore(unsigned int lcore); > >> > >> struct rte_hwtopo_node * > >> rte_hwtopo_get_core_by_id(unsigned int os_cpu_id); > >> > >> struct rte_hwtopo_node * > >> rte_hwtopo_parent(struct rte_hwtopo_node *node); > >> > >> struct rte_hwtopo_node * > >> rte_hwtopo_first_child(struct rte_hwtopo_node *node); > >> > >> struct rte_hwtopo_node * > >> rte_hwtopo_next_child(struct rte_hwtopo_node *node, > >> struct rte_hwtopo_node *child); > >> > >> struct rte_hwtopo_node * > >> rte_hwtopo_first_sibling(struct rte_hwtopo_node *node); > >> > >> struct rte_hwtopo_node * > >> rte_hwtopo_next_sibling(struct rte_hwtopo_node *node, > >> struct rte_hwtopo_node *child); > >> > >> enum rte_hwtopo_node_type > >> rte_hwtopo_get_type(struct rte_hwtopo_node *node); > >> > >> #define RTE_HWTOPO_NODE_ATTR_CORE_FREQUENCY_NOMINAL 0 #define > >> RTE_HWTOPO_NODE_ATTR_CACHE_LEVEL 1 #define > >> RTE_HWTOPO_NODE_ATTR_CACHE_SIZE 2 > >> > >> int > >> rte_hwtopo_get_attr_int64(struct rte_hwtopo_node *node, unsigned int > >> attr_name, > >> int64_t *attr_value); > >> > >> int > >> rte_hwtopo_get_attr_str(struct rte_hwtopo_node *node, unsigned int > >> attr_name, > >> char *attr_value, size_t capacity); > >> > >> #endif > >> > >> Surely, this too would be awkward (or should I say cumbersome) to use in certain scenarios. > > This appears to be more like hwloc api calls, as shared in my earlier > > email my intention with the API suggestion is not introduce new library. > > I have certain reservations and with my current understanding I am not > > able to map certain DPDK core mapping. Let discuss this in technical call. > > Snipped > > It still would need to be a part of EAL (so not a new library), since > EAL surely would depend on it (sooner rather than later). > > If this functionality should be a new library, or a new API in an > existing library, it doesn't really matter if your original intentions > where something else, does it. > Good discussion. I wonder if the API would be cleaner if it just provided a tree representation of the hardware in a data structure, instead of trying to provide FOREACH. The other concern is that hardware will evolve and there is likely to be more possibilities. It is impossible totally future proof API's (YAGNI) but worth thinking about it now. There is also the issue of core's with different clock speeds that should be represented as well.
On 2024-09-12 15:30, Bruce Richardson wrote: > On Thu, Sep 12, 2024 at 01:59:34PM +0200, Mattias Rönnblom wrote: >> On 2024-09-12 13:17, Varghese, Vipin wrote: >>> [AMD Official Use Only - AMD Internal Distribution Only] >>> >>> <snipped> >>>>>>>> Thank you Mattias for the information, as shared by in the reply >>>>>>>> with >>>>>> Anatoly we want expose a new API `rte_get_next_lcore_ex` which >>>>>> intakes a extra argument `u32 flags`. >>>>>>>> The flags can be RTE_GET_LCORE_L1 (SMT), RTE_GET_LCORE_L2, >>>>>> RTE_GET_LCORE_L3, RTE_GET_LCORE_BOOST_ENABLED, >>>>>> RTE_GET_LCORE_BOOST_DISABLED. >>>>>>> >>>>>>> Wouldn't using that API be pretty awkward to use? >>>>> Current API available under DPDK is ` rte_get_next_lcore`, which is used >>>> within DPDK example and in customer solution. >>>>> Based on the comments from others we responded to the idea of changing >>>> the new Api from ` rte_get_next_lcore_llc` to ` rte_get_next_lcore_exntd`. >>>>> >>>>> Can you please help us understand what is `awkward`. >>>>> >>>> >>>> The awkwardness starts when you are trying to fit provide hwloc type >>>> information over an API that was designed for iterating over lcores. >>> I disagree to this point, current implementation of lcore libraries is >>> only focused on iterating through list of enabled cores, core-mask, and >>> lcore-map. >>> With ever increasing core count, memory, io and accelerators on SoC, >>> sub-numa partitioning is common in various vendor SoC. Enhancing or >>> Augumenting lcore API to extract or provision NUMA, Cache Topology is >>> not awkward. >> >> DPDK providing an API for this information makes sense to me, as I've >> mentioned before. What I questioned was the way it was done (i.e., the API >> design) in your RFC, and the limited scope (which in part you have >> addressed). >> > > Actually, I'd like to touch on this first item a little bit. What is the > main benefit of providing this information in EAL? To me, it seems like > something that is for apps to try and be super-smart and select particular > cores out of a set of cores to run on. However, is that not taking work > that should really be the job of the person deploying the app? The deployer > - if I can use that term - has already selected a set of cores and NICs for > a DPDK application to use. Should they not also be the one selecting - via > app argument, via --lcores flag to map one core id to another, or otherwise > - which part of an application should run on what particular piece of > hardware? > Scheduling in one form or another will happen on a number of levels. One level is what you call the "deployer". Whether man or machine, it will allocate a bunch of lcores to the application - either statically by using -l <cores>, or dynamically, by giving a very large core mask, combined with having an agent in the app responsible to scale up or down the number of cores actually used (allowing coexistence with other non-DPDK, Linux process scheduler-scheduled processes, on the same set of cores, although not at the same time). I think the "deployer" level should generally not be aware of the DPDK app internals, including how to assign different tasks to different cores. That is consistent with how things work in a general-purpose operating system, where you allocate cores, memory and I/O devices to an instance (e.g., a VM), but then OS' scheduler figures out how to best use them. The app internal may be complicated, change across software versions and traffic mixes/patterns, and most of all, not lend itself to static at-start configuration at all. > In summary, what is the final real-world intended usecase for this work? One real-world example is an Eventdev app with some atomic processing stage, using DSW, and SMT. Hardware threading on Intel x86 generally improves performance with ~25%, which seems to hold true for data plane apps as well, in my experience. So that's a (not-so-)freebie you don't want to miss out on. To max out single-flow performance, the work scheduler may not only need to give 100% of an lcore to bottleneck stage atomic processing for that elephant flow, but a *full* physical core (i.e., assure that the SMT sibling is idle). But, DSW doesn't understand the CPU topology, so you have to choose between max multi-flow throughput or max single-flow throughput at the time of deployment. A RTE hwtopo API would certainly help in the implementation of SMT-aware scheduling. Another example could be the use of bigger or turbo-capable cores to run CPU-hungry, singleton services (e.g., a Eventdev RX timer adapter core), or the use of a hardware thread to run the SW scheduler service (which needs to react quickly to incoming scheduling events, but maybe not need all the cycles of a full physical core). Yet another example would be an event device which understand how to spread a particular flow across multiple cores, but use only cores sharing the same L2. Or, keep only processing of a certain kind (e.g., a certain Eventdev Queue) on cores with the same L2, improve L2 hit rates for instructions and data related to that processing stage. > DPDK already tries to be smart about cores and NUMA, and in some cases we > have hit issues where users have - for their own valid reasons - wanted to > run DPDK in a sub-optimal way, and they end up having to fight DPDK's > smarts in order to do so! Ref: [1] > > /Bruce > > [1] https://git.dpdk.org/dpdk/commit/?id=ed34d87d9cfbae8b908159f60df2008e45e4c39f
On 9/12/2024 1:50 PM, Varghese, Vipin wrote: > [Public] > > Snipped > >>> >>> >>> Based on the discussions we agreed on sharing version-2 FRC for >>> extending API as `rte_get_next_lcore_extnd` with extra argument as >>> `flags`. >>> >>> As per my ideation, for the API ` rte_get_next_sibling_core`, the above >>> API can easily with flag ` RTE_GET_LCORE_L1 (SMT)`. Is this right >>> understanding? >>> >>> We can easily have simple MACROs like `RTE_LCORE_FOREACH_L1` which >>> allows to iterate SMT sibling threads. >>> >>> >> >> This seems like a lot of new macro and API additions! I'd really like to cut that >> back and simplify the amount of new things we are adding to DPDK for this. > I disagree Bruce, as per the new conversation with Anatoly and you it has been shared the new API are > ``` > 1. rte_get_next_lcore_exntd > 2. rte_get_next_n_lcore_exntd > ``` > > While I mentioned custom Macro can augment based on typical flag usage similar to ` RTE_LCORE_FOREACH and RTE_LCORE_FOREACH_WORKER` as > ``` > RTE_LCORE_FOREACH_FLAG > RTE_LCORE_FOREACH_WORKER_FLAG > > Or > > RTE_LCORE_FOREACH_LLC > RTE_LCORE_FOREACH_WORKER_LLC > ``` > > Please note I have not even shared version-2 of RFC yet. > >> I tend to agree with others that external libs would be better for apps that really want to deal with all this. > I have covered why this is not a good idea for Mattias query. > >> >>> >>> > >>> >>> > Looking logically, I'm not sure about the BOOST_ENABLED and BOOST_DISABLED flags you propose >>> The idea for the BOOST_ENABLED & BOOST_DISABLED is based on DPDK power library which allows to enable boost. >>> Allow user to select lcores where BOOST is enabled|disabled using MACRO or API. > May be there is confusion, so let me try to be explicit here. The intention of any `rte_get_next_lcore##` is fetch lcores. > Hence with new proposed API `rte_get_next_lcore_exntd` with `flag set for Boost` is to fetch lcores where boost is enabled. > There is no intention to enable or disable boost on lcore with `get` API. > >>> >>> >>> >>> - in a system with multiple possible >>> >>> > standard and boost frequencies what would those correspond to? >>> >>> I now understand the confusion, apologies for mixing the AMD EPYC SoC >>> boost with Intel Turbo. >>> >>> >>> >>> Thank you for pointing out, we will use the terminology ` >>> RTE_GET_LCORE_TURBO`. >>> >>> >> >> That still doesn't clarify it for me. If you start mixing in power management related functions in with topology ones things will turn into a real headache. > Can you please tell me what is not clarified. DPDK lcores as of today has no notion of Cache, Numa, Power, Turbo or any DPDK supported features. > The initial API introduced were to expose lcore sharing the same Last Level Cache. Based on interaction with Anatoly, extending this to support multiple features turned out to be possibility. > Hence, we said we can share v2 for RFC based on this idea. > > But if the claim is not to put TURBO I am also ok for this. Let only keep cache and NUMA-IO domain. > >> What does boost or turbo correspond to? Is it for cores that have the feature enabled - whether or not it's currently in use - or is it for finding cores that are >> currently boosted? Do we need additions for cores that are boosted by 100Mhz vs say 300Mhz. What about cores that are in lower frequencies for >> power-saving. Do we add macros for finding those? > Why are we talking about feq-up and freq-down? This was not even discussed in this RFC patch at all. > >>> >>> What's also >>> >>> > missing is a define for getting actual NUMA siblings i.e. those >>> sharing common memory but not an L3 or anything else. >>> >>> This can be extended into `rte_get_next_lcore_extnd` with flag ` >>> RTE_GET_LCORE_NUMA`. This will allow to grab all lcores under the same >>> sub-memory NUMA as shared by LCORE. >>> >>> If SMT sibling is enabled and DPDK Lcore mask covers the sibling >>> threads, then ` RTE_GET_LCORE_NUMA` get all lcore and sibling threads >>> under same memory NUMA of lcore shared. >>> >>> >> >> Yes. That can work. But it means we are basing the implementation on a fixed idea of what topologies there are or can exist. >> My suggestion below is just to ignore the whole idea of L1 vs L2 vs NUMA - just give the app a way to find it's nearest nodes. > Bruce, for different vendor SoC, the implementation of architecture is different. Let me share what I know > 1. using L1, we can fetch SMT threads > 2. using L2 we can get certain SoC on Arm, Intel and power PC which is like efficient cores > 3. using L3 we can get certain SoC like AMD, AF64x and others which follow chiplet or tile split L3 domain. > >> >> After all, the app doesn't want to know the topology just for the sake of knowing it - it wants it to ensure best placement of work on cores! To that end, it just needs to know what cores are near to each other and what are far away. > Exactly, that is why we want to minimize new libraries and limit to format of existing API `rte_get_next_lcore`. The end user need to deploy another library or external library then map to DPDK lcore mapping to identify what is where. > So as end user I prefer simple API which get my work done. > >> >>> >>> > >>> >>> > My suggestion would be to have the function take just an integer-type >>> e.g. >>> >>> > uint16_t parameter which defines the memory/cache hierarchy level to >>> use, 0 >>> >>> > being lowest, 1 next, and so on. Different systems may have different >>> numbers >>> >>> > of cache levels so lets just make it a zero-based index of levels, >>> rather than >>> >>> > giving explicit defines (except for memory which should probably >>> always be >>> >>> > last). The zero-level will be for "closest neighbour" >>> >>> Good idea, we did prototype this internally. But issue it will keep on >>> adding the number of API into lcore library. >>> >>> To keep the API count less, we are using lcore id as hint to sub-NUMA. >>> >> >> I'm unclear about this keeping the API count down - you are proposing a lot of APIs and macros up above. > No, I am not. I have shared based on the last discussion with Anatoly we will end up with 2 API in lcore only. Explained in the above response > >> My suggestion is basically to add two APIs and no macros: one API to get the max number of topology-nearness levels, and a >> second API to get the next sibling a given nearness level from >> 0(nearest)..N(furthest). If we want, we can also add a FOREACH macro too. >> >> Overall, though, as I say above, let's focus on the problem the app actually >> wants these APIs for, not how we think we should solve it. Apps don't want to >> know the topology for knowledge sake, they want to use that knowledge to >> improve performance by pinning tasks to cores. What is the minimum that we >> need to provide to enable the app to do that? For example, if there are no >> lcores that share an L1, then from an app topology viewpoint that L1 level may >> as well not exist, because it provides us no details on how to place our work. > I have shared above why we need vendor agnostic L1, L2, L3 and sub-NUMA-IO. > > Snipped Just to add my 2c here, since my name is being thrown around a lot in this discussion :) I tend to agree with Bruce here in the sense that if we want this API be used to group cores together, then ideally we shouldn't really explicitly call out the principle by which we group them unless we have to. My main contention with the initial RFC *was* the fact that it was tied to specific HW arch stuff in the API. Vipin has suggested using a "flags" value to discriminate between L1/L2/L3/NUMA/whatever ways of grouping cores, and I agree that it's better than what was initially proposed (at least from my vantage point), but what's even better is not to have any flags at all! As in, I think the thing we're presumably trying to achieve here just as well could be achieved simply by returning a number of "levels" we have in our hierarchy, and user then being able to iterate over nearest neighbours sitting on that "level" without explicitly specifying what that level is. So, for some systems level 0 would be SMT, for others - L3, for some - NUMA, for yet others - efficiency/performance cores, etc. Bruce's suggestion is that we don't explicitly call out the thing we use to group the cores by, and instead rely on EAL to parse that information out for us into a set of "levels". I would agree that for anything more complicated an external library would be the way to go, because, well, we're DPDK, not Linux kernel. But, just to be clear, this is not mutually exclusive with some kind of topology-style API. If we do go down that route, then the point about "attaching to specific architectural features" becomes moot, as by necessity any DOM-style API would have to represent topology in some way, which then gets used by DPDK. The main question here (and other people have rightly asked this question) would be, do we want a topology API, or do we want an API to assist with scheduling. My impression so far has been that Vipin is looking for the latter rather than the former, as no topology-related use cases were mentioned in the discussion except as a proxy for scheduling.
On Tue, 27 Aug 2024 20:40:12 +0530 Vipin Varghese <vipin.varghese@amd.com> wrote: > As core density continues to increase, chiplet-based > core packing has become a key trend. In AMD SoC EPYC > architectures, core complexes within the same chiplet > share a Last-Level Cache (LLC). By packing logical cores > within the same LLC, we can enhance pipeline processing > stages due to reduced latency and improved data locality. > > To leverage these benefits, DPDK libraries and examples > can utilize localized lcores. This approach ensures more > consistent latencies by minimizing the dispersion of lcores > across different chiplet complexes and enhances packet > processing by ensuring that data for subsequent pipeline > stages is likely to reside within the LLC. > > < Function: Purpose > > --------------------- > - rte_get_llc_first_lcores: Retrieves all the first lcores in the shared LLC. > - rte_get_llc_lcore: Retrieves all lcores that share the LLC. > - rte_get_llc_n_lcore: Retrieves the first n or skips the first n lcores in the shared LLC. > > < MACRO: Purpose > > ------------------ > RTE_LCORE_FOREACH_LLC_FIRST: iterates through all first lcore from each LLC. > RTE_LCORE_FOREACH_LLC_FIRST_WORKER: iterates through all first worker lcore from each LLC. > RTE_LCORE_FOREACH_LLC_WORKER: iterates lcores from LLC based on hint (lcore id). > RTE_LCORE_FOREACH_LLC_SKIP_FIRST_WORKER: iterates lcores from LLC while skipping first worker. > RTE_LCORE_FOREACH_LLC_FIRST_N_WORKER: iterates through `n` lcores from each LLC. > RTE_LCORE_FOREACH_LLC_SKIP_N_WORKER: skip first `n` lcores, then iterates through reaming lcores in each LLC. > > Vipin Varghese (2): > eal: add llc aware functions > eal/lcore: add llc aware for each macro > > lib/eal/common/eal_common_lcore.c | 279 ++++++++++++++++++++++++++++-- > lib/eal/include/rte_lcore.h | 89 ++++++++++ When are you going to send a new version? Need: - new functions need to be marked experimental - has to build cleanly on all platforms. - need functional tests - address all the other review comments
[AMD Official Use Only - AMD Internal Distribution Only] > > When are you going to send a new version? We had been testing this on various Intel and AMD platforms. We have completed testing over sub-NUMA domains on both SoC. We will be sharing the new patch (rfc-v2) before 22 Oct 2024. > > Need: > - new functions need to be marked experimental > - has to build cleanly on all platforms. > - need functional tests > - address all the other review comments