From patchwork Tue Sep 14 10:34:55 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dmitry Kozlyuk X-Patchwork-Id: 98853 X-Patchwork-Delegate: david.marchand@redhat.com Return-Path: X-Original-To: patchwork@inbox.dpdk.org Delivered-To: patchwork@inbox.dpdk.org Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id E6E6EA0C46; Tue, 14 Sep 2021 12:35:35 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 111AF410EB; Tue, 14 Sep 2021 12:35:27 +0200 (CEST) Received: from NAM12-DM6-obe.outbound.protection.outlook.com (mail-dm6nam12on2061.outbound.protection.outlook.com [40.107.243.61]) by mails.dpdk.org (Postfix) with ESMTP id 3E487410F1 for ; Tue, 14 Sep 2021 12:35:25 +0200 (CEST) ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=GAU3SNu9p7ivUZD7Brvp09ZBW7MYzOiZ5gJN3ftPI0i0+QTc6ybObRgCzsucuDk/OqLTSYmrjH3be2bqjN5TWXRA+/4Cgf5cNkGd6joAdoJSKVG3OnSfb8lQ9iz4oCNgDy3feZvwVcwwgPvKkpckYiELfxJguTEDr5ZWC2PJVZKhIZdD/Ylvgdvz+vdr+usGl+7RD98ngSGgnM8pz3MrHSR5xdJVeJIL5SJydw/LXHD+ECxG+Xra/ftle6cFwxCBadoauj/105g/6fjK8q5mR8sXvufg19LmmwrWDj4mG024Uogfk/qElcDhRVoJQMlY2b+O0poizK4qzl/POZ5Rxw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version; bh=yoy/JkWst0DB73xPXcXXCsi8yfz/u8BQO2tG99zbiWo=; b=hXTPAatZQ8I+MidHWxiyvwXJZMEnLRBBivwfWkck1xnZXqkhKyNtJwXL2t9MQKhHqogkcZSQtOP3gQO9LgecWdBDY9roUvrLn+5Qr7mp0nq1dVNMycVpXHnmZq04X7nEGpnekwEhE6/2skaXrYdPssvctTCSgSaQTlkEBd6dMbtUk3/0yo88T9Ew2EvxsMBbZa8DoIESgyjE0N0U6VgWjnM8jQW0+PQpBfbJmDsvyffgVb1n+DBBqur5ywf7eKHKAZ5gqt3tf7RgslOg+ECPLrI4Kayym6PEPDw47mdawQI4Y/XlSIGS1JMgZhxCugC4F2AL3C/YR+OZtPFqbhpkCA== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 216.228.112.35) smtp.rcpttodomain=dpdk.org smtp.mailfrom=nvidia.com; dmarc=pass (p=quarantine sp=none pct=100) action=none header.from=nvidia.com; dkim=none (message not signed); arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=Nvidia.com; s=selector2; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=yoy/JkWst0DB73xPXcXXCsi8yfz/u8BQO2tG99zbiWo=; b=qmmev5HBbGeYDoWjmU2Pm/gf/RBPio4rsHCxdsl34ifHM8Bt6Wx2y6Xw0gKDpsOrnVDFy1sQiJ0FqxGI9huhiJivVMewhy/aFPAffItHfiV6qMaEbKEhUSwu4DYC/Q0IfjkFJd6n0QEvrKDFkfDBQd5c1lAf99xfy8XmO9jOT0ebRt6SdyVMha00iVeZQ3L7KlVUNIkGsAH8KdYcj3QuUkHP+dAjA3JEx+JNeJu7720GHQ7RTWAtEHdU8sXPomQL3/xUl4S2PJkzFIU0urhbNMmRmiPhRc07LY6Ie8tpnBWkNa9AzUo4bZ8tlO0JjkbeuUuahMYx6kUaG6OPw8nV/A== Received: from BN9PR03CA0931.namprd03.prod.outlook.com (2603:10b6:408:108::6) by BN6PR1201MB2546.namprd12.prod.outlook.com (2603:10b6:404:b0::22) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4500.17; Tue, 14 Sep 2021 10:35:22 +0000 Received: from BN8NAM11FT026.eop-nam11.prod.protection.outlook.com (2603:10b6:408:108:cafe::b9) by BN9PR03CA0931.outlook.office365.com (2603:10b6:408:108::6) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4523.14 via Frontend Transport; Tue, 14 Sep 2021 10:35:22 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 216.228.112.35) smtp.mailfrom=nvidia.com; dpdk.org; dkim=none (message not signed) header.d=none;dpdk.org; dmarc=pass action=none header.from=nvidia.com; Received-SPF: Pass (protection.outlook.com: domain of nvidia.com designates 216.228.112.35 as permitted sender) receiver=protection.outlook.com; client-ip=216.228.112.35; helo=mail.nvidia.com; Received: from mail.nvidia.com (216.228.112.35) by BN8NAM11FT026.mail.protection.outlook.com (10.13.177.51) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384) id 15.20.4500.14 via Frontend Transport; Tue, 14 Sep 2021 10:35:22 +0000 Received: from DRHQMAIL107.nvidia.com (10.27.9.16) by HQMAIL111.nvidia.com (172.20.187.18) with Microsoft SMTP Server (TLS) id 15.0.1497.18; Tue, 14 Sep 2021 10:35:20 +0000 Received: from nvidia.com (172.20.187.5) by DRHQMAIL107.nvidia.com (10.27.9.16) with Microsoft SMTP Server (TLS) id 15.0.1497.18; Tue, 14 Sep 2021 10:35:18 +0000 From: Dmitry Kozlyuk To: CC: Anatoly Burakov , Viacheslav Ovsiienko Date: Tue, 14 Sep 2021 13:34:55 +0300 Message-ID: <20210914103456.535427-3-dkozlyuk@nvidia.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20210914103456.535427-1-dkozlyuk@nvidia.com> References: <20210716110806.2566788-1-dkozlyuk@nvidia.com> <20210914103456.535427-1-dkozlyuk@nvidia.com> MIME-Version: 1.0 X-Originating-IP: [172.20.187.5] X-ClientProxiedBy: HQMAIL111.nvidia.com (172.20.187.18) To DRHQMAIL107.nvidia.com (10.27.9.16) X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-Office365-Filtering-Correlation-Id: 2bd6a5eb-aad8-4466-f806-08d9776b5b40 X-MS-TrafficTypeDiagnostic: BN6PR1201MB2546: X-Microsoft-Antispam-PRVS: X-MS-Oob-TLC-OOBClassifiers: OLM:4502; X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: XA3YZ1edapiAAmm2rMR8+L8o5RSSdLAyKHmh9Hr2+Mnt3sBOgAwQdCSe6QsRJ+abYHMM4542sF9jQJhRmf1L2GDsk6i8iTs0/XLVLgpw9re2UC8/BiyH5DT3+iPITr7abv7Gol/WWwPh+Qqf9kzB3Te+WWyaLI5SZgAeCYX3c71JquStvcymJmqsPjxWaDdXdCitgyl9cVD2A/A6yCIC8xFsENbH958V0TKXR422vmkqVrcgDJwAqKLJgydv3P/tLasOlrfVB1+WsEWu5Qe1lcbc+toW7GfPl1uotJO4I8JBw8QEzmIROZrRf5GC8Kc3MrthglJFhorndBx3vnHMOhAh0+tjAzeUsgjM6tMuxJYfVfCMAeYqs68Mac2yTz8ppiq1vza+4eIIB+P6gZwNHdFSmrRq7zAz+HUxI0R0VqKuwg5xlE/m5AaBuxhEFxbkAVjpIGidDoUx+wiBK3q7KgmkYo/lxhaE2Cbw7HLtrarmZ81ydpoBF4su20owRnjKWm8QIfvhYmdcMen2ULnAdk7jE8brmM4tTx9eQPdOPts8+lmN9V2HAdMF3NLBaNgmNGBEUEQdAXunbalD343GLEoe7j9FYThsZuFnLLnWaqCn9dbqyz8Ma0dPz8bksLt/hzpkg3DjxcZyO+71xVkej6r8aV7rarYD5IXd5JZruTZRSKIHq0EDvEl+GtKN7TcbnDiZyPSsTSX2Z4TpSi63gA== X-Forefront-Antispam-Report: CIP:216.228.112.35; CTRY:US; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:mail.nvidia.com; PTR:schybrid02.nvidia.com; CAT:NONE; SFS:(4636009)(376002)(346002)(396003)(136003)(39860400002)(46966006)(36840700001)(6666004)(70206006)(70586007)(2906002)(5660300002)(26005)(186003)(47076005)(30864003)(16526019)(4326008)(55016002)(356005)(8676002)(86362001)(107886003)(1076003)(478600001)(36860700001)(7636003)(6916009)(336012)(426003)(2616005)(83380400001)(6286002)(82740400003)(8936002)(36756003)(36906005)(7696005)(54906003)(316002)(82310400003); DIR:OUT; SFP:1101; X-OriginatorOrg: Nvidia.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 14 Sep 2021 10:35:22.4127 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 2bd6a5eb-aad8-4466-f806-08d9776b5b40 X-MS-Exchange-CrossTenant-Id: 43083d15-7273-40c1-b7db-39efd9ccc17a X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=43083d15-7273-40c1-b7db-39efd9ccc17a; Ip=[216.228.112.35]; Helo=[mail.nvidia.com] X-MS-Exchange-CrossTenant-AuthSource: BN8NAM11FT026.eop-nam11.prod.protection.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: BN6PR1201MB2546 Subject: [dpdk-dev] [PATCH v3 2/3] eal: add memory pre-allocation from existing files X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" From: Viacheslav Ovsiienko The primary DPDK process launch might take a long time if initially allocated memory is large. From practice allocation of 1 TB of memory over 1 GB hugepages on Linux takes tens of seconds. Fast restart is highly desired for some applications and launch delay presents a problem. The primary delay happens in this call trace: rte_eal_init() rte_eal_memory_init() rte_eal_hugepage_init() eal_dynmem_hugepage_init() eal_memalloc_alloc_seg_bulk() alloc_seg() mmap() The largest part of the time spent in mmap() is filling the memory with zeros. Kernel does so to prevent data leakage from a process that was last using the page. However, in a controlled environment it may not be the issue, while performance is. (Linux-specific MAP_UNINITIALIZED flag allows mapping without clearing, but it is disabled in all popular distributions for the reason above.) It is proposed to add a new EAL option: --mem-file FILE1,FILE2,... to map hugepages "as is" from specified FILEs in hugetlbfs. Compared to using external memory for the task, EAL option requires no change to application code, while allowing administrator to control hugepage sizes and their NUMA affinity. Limitations of the feature: * Linux-specific (only Linux maps hugepages from files). * Incompatible with --legacy-mem (partially replaces it). * Incompatible with --single-file-segments (--mem-file FILEs can contain as many segments as needed). * Incompatible with --in-memory (logically). A warning about possible security implications is printed when --mem-file is used. Until this patch DPDK allocator always cleared memory on freeing, so that it did not have to do that on allocation, while new memory was cleared by the kernel. When --mem-file is in use, DPDK clears memory after allocation in rte_zmalloc() and does not clean it on freeing. Effectively user trades fast startup for occasional allocation slowdown whenever it is absolutely necessary. When memory is recycled, it is cleared again, which is suboptimal par se, but saves complication of memory management. Signed-off-by: Viacheslav Ovsiienko Signed-off-by: Dmitry Kozlyuk --- doc/guides/linux_gsg/linux_eal_parameters.rst | 17 + lib/eal/common/eal_common_dynmem.c | 6 + lib/eal/common/eal_common_options.c | 23 ++ lib/eal/common/eal_internal_cfg.h | 4 + lib/eal/common/eal_memalloc.h | 8 +- lib/eal/common/eal_options.h | 2 + lib/eal/common/malloc_elem.c | 5 + lib/eal/common/malloc_heap.h | 8 + lib/eal/common/rte_malloc.c | 16 +- lib/eal/include/rte_memory.h | 4 +- lib/eal/linux/eal.c | 28 ++ lib/eal/linux/eal_hugepage_info.c | 5 + lib/eal/linux/eal_memalloc.c | 328 +++++++++++++++++- 13 files changed, 441 insertions(+), 13 deletions(-) diff --git a/doc/guides/linux_gsg/linux_eal_parameters.rst b/doc/guides/linux_gsg/linux_eal_parameters.rst index bd3977cb3d..b465feaea8 100644 --- a/doc/guides/linux_gsg/linux_eal_parameters.rst +++ b/doc/guides/linux_gsg/linux_eal_parameters.rst @@ -92,6 +92,23 @@ Memory-related options Free hugepages back to system exactly as they were originally allocated. +* ``--mem-file `` + + Use memory from pre-allocated files in ``hugetlbfs`` without clearing it; + when this memory is exhausted, switch to default dynamic allocation. + This speeds up startup compared to ``--legacy-mem`` while also avoiding + later delays for allocating new hugepages. One downside is slowdown + of all zeroed memory allocations. Security warning: an application + can access contents left by previous users of hugepages. Multiple files + can be pre-allocated in ``hugetlbfs`` with different page sizes, + on desired NUMA nodes, using ``mount`` options and ``numactl``: + + --mem-file /mnt/huge-1G/node0,/mnt/huge-1G/node1,/mnt/huge-2M/extra + + This option is incompatible with ``--legacy-mem``, ``--in-memory``, + and ``--single-file-segments``. Primary and secondary processes + must specify exactly the same list of files. + Other options ~~~~~~~~~~~~~ diff --git a/lib/eal/common/eal_common_dynmem.c b/lib/eal/common/eal_common_dynmem.c index 7c5437ddfa..abcf22f097 100644 --- a/lib/eal/common/eal_common_dynmem.c +++ b/lib/eal/common/eal_common_dynmem.c @@ -272,6 +272,12 @@ eal_dynmem_hugepage_init(void) internal_conf->num_hugepage_sizes) < 0) return -1; +#ifdef RTE_EXEC_ENV_LINUX + /* pre-allocate pages from --mem-file option files */ + if (eal_memalloc_memfile_alloc(used_hp) < 0) + return -1; +#endif + for (hp_sz_idx = 0; hp_sz_idx < (int)internal_conf->num_hugepage_sizes; hp_sz_idx++) { diff --git a/lib/eal/common/eal_common_options.c b/lib/eal/common/eal_common_options.c index ff5861b5f3..c729c36630 100644 --- a/lib/eal/common/eal_common_options.c +++ b/lib/eal/common/eal_common_options.c @@ -86,6 +86,7 @@ eal_long_options[] = { {OPT_MASTER_LCORE, 1, NULL, OPT_MASTER_LCORE_NUM }, {OPT_MAIN_LCORE, 1, NULL, OPT_MAIN_LCORE_NUM }, {OPT_MBUF_POOL_OPS_NAME, 1, NULL, OPT_MBUF_POOL_OPS_NAME_NUM}, + {OPT_MEM_FILE, 1, NULL, OPT_MEM_FILE_NUM }, {OPT_NO_HPET, 0, NULL, OPT_NO_HPET_NUM }, {OPT_NO_HUGE, 0, NULL, OPT_NO_HUGE_NUM }, {OPT_NO_PCI, 0, NULL, OPT_NO_PCI_NUM }, @@ -1898,6 +1899,8 @@ eal_cleanup_config(struct internal_config *internal_cfg) free(internal_cfg->hugepage_dir); if (internal_cfg->user_mbuf_pool_ops_name != NULL) free(internal_cfg->user_mbuf_pool_ops_name); + if (internal_cfg->mem_file[0]) + free(internal_cfg->mem_file[0]); return 0; } @@ -2018,6 +2021,26 @@ eal_check_common_options(struct internal_config *internal_cfg) "amount of reserved memory can be adjusted with " "-m or --"OPT_SOCKET_MEM"\n"); } + if (internal_cfg->mem_file[0] && internal_conf->legacy_mem) { + RTE_LOG(ERR, EAL, "Option --"OPT_MEM_FILE" is not compatible " + "with --"OPT_LEGACY_MEM"\n"); + return -1; + } + if (internal_cfg->mem_file[0] && internal_conf->no_hugetlbfs) { + RTE_LOG(ERR, EAL, "Option --"OPT_MEM_FILE" is not compatible " + "with --"OPT_NO_HUGE"\n"); + return -1; + } + if (internal_cfg->mem_file[0] && internal_conf->in_memory) { + RTE_LOG(ERR, EAL, "Option --"OPT_MEM_FILE" is not compatible " + "with --"OPT_IN_MEMORY"\n"); + return -1; + } + if (internal_cfg->mem_file[0] && internal_conf->single_file_segments) { + RTE_LOG(ERR, EAL, "Option --"OPT_MEM_FILE" is not compatible " + "with --"OPT_SINGLE_FILE_SEGMENTS"\n"); + return -1; + } return 0; } diff --git a/lib/eal/common/eal_internal_cfg.h b/lib/eal/common/eal_internal_cfg.h index d6c0470eb8..814d5c66e1 100644 --- a/lib/eal/common/eal_internal_cfg.h +++ b/lib/eal/common/eal_internal_cfg.h @@ -22,6 +22,9 @@ #define MAX_HUGEPAGE_SIZES 3 /**< support up to 3 page sizes */ #endif +#define MAX_MEMFILE_ITEMS (MAX_HUGEPAGE_SIZES * RTE_MAX_NUMA_NODES) +/**< Maximal number of mem-file parameters. */ + /* * internal configuration structure for the number, size and * mount points of hugepages @@ -83,6 +86,7 @@ struct internal_config { rte_uuid_t vfio_vf_token; char *hugefile_prefix; /**< the base filename of hugetlbfs files */ char *hugepage_dir; /**< specific hugetlbfs directory to use */ + char *mem_file[MAX_MEMFILE_ITEMS]; /**< pre-allocated memory files */ char *user_mbuf_pool_ops_name; /**< user defined mbuf pool ops name */ unsigned num_hugepage_sizes; /**< how many sizes on this system */ diff --git a/lib/eal/common/eal_memalloc.h b/lib/eal/common/eal_memalloc.h index ebc3a6f6c1..d92c9a167b 100644 --- a/lib/eal/common/eal_memalloc.h +++ b/lib/eal/common/eal_memalloc.h @@ -8,7 +8,7 @@ #include #include - +#include "eal_internal_cfg.h" /* * Allocate segment of specified page size. */ @@ -96,4 +96,10 @@ eal_memalloc_init(void); int eal_memalloc_cleanup(void); +int +eal_memalloc_memfile_init(void); + +int +eal_memalloc_memfile_alloc(struct hugepage_info *hpa); + #endif /* EAL_MEMALLOC_H */ diff --git a/lib/eal/common/eal_options.h b/lib/eal/common/eal_options.h index 7b348e707f..c6c634b2b2 100644 --- a/lib/eal/common/eal_options.h +++ b/lib/eal/common/eal_options.h @@ -93,6 +93,8 @@ enum { OPT_NO_TELEMETRY_NUM, #define OPT_FORCE_MAX_SIMD_BITWIDTH "force-max-simd-bitwidth" OPT_FORCE_MAX_SIMD_BITWIDTH_NUM, +#define OPT_MEM_FILE "mem-file" + OPT_MEM_FILE_NUM, /* legacy option that will be removed in future */ #define OPT_PCI_BLACKLIST "pci-blacklist" diff --git a/lib/eal/common/malloc_elem.c b/lib/eal/common/malloc_elem.c index c2c9461f1d..6e71029a3c 100644 --- a/lib/eal/common/malloc_elem.c +++ b/lib/eal/common/malloc_elem.c @@ -578,8 +578,13 @@ malloc_elem_free(struct malloc_elem *elem) /* decrease heap's count of allocated elements */ elem->heap->alloc_count--; +#ifdef MALLOC_DEBUG /* poison memory */ memset(ptr, MALLOC_POISON, data_len); +#else + if (!malloc_clear_on_alloc()) + memset(ptr, 0, data_len); +#endif return elem; } diff --git a/lib/eal/common/malloc_heap.h b/lib/eal/common/malloc_heap.h index 772736b53f..72b64d8052 100644 --- a/lib/eal/common/malloc_heap.h +++ b/lib/eal/common/malloc_heap.h @@ -10,6 +10,7 @@ #include #include +#include "eal_private.h" /* Number of free lists per heap, grouped by size. */ #define RTE_HEAP_NUM_FREELISTS 13 @@ -48,6 +49,13 @@ malloc_get_numa_socket(void) return socket_id; } +static inline bool +malloc_clear_on_alloc(void) +{ + const struct internal_config *cfg = eal_get_internal_configuration(); + return cfg->mem_file[0] != NULL; +} + void * malloc_heap_alloc(const char *type, size_t size, int socket, unsigned int flags, size_t align, size_t bound, bool contig); diff --git a/lib/eal/common/rte_malloc.c b/lib/eal/common/rte_malloc.c index 9d39e58c08..ce94268aca 100644 --- a/lib/eal/common/rte_malloc.c +++ b/lib/eal/common/rte_malloc.c @@ -113,17 +113,23 @@ rte_malloc(const char *type, size_t size, unsigned align) void * rte_zmalloc_socket(const char *type, size_t size, unsigned align, int socket) { + bool zero; void *ptr = rte_malloc_socket(type, size, align, socket); -#ifdef RTE_MALLOC_DEBUG /* * If DEBUG is enabled, then freed memory is marked with poison - * value and set to zero on allocation. - * If DEBUG is not enabled then memory is already zeroed. + * value and must be set to zero on allocation. + * If DEBUG is not enabled then it is configurable + * whether memory comes already set to zero by memalloc or on free + * or it must be set to zero here. */ - if (ptr != NULL) - memset(ptr, 0, size); +#ifdef RTE_MALLOC_DEBUG + zero = true; +#else + zero = malloc_clear_on_alloc(); #endif + if (ptr != NULL && zero) + memset(ptr, 0, size); rte_eal_trace_mem_zmalloc(type, size, align, socket, ptr); return ptr; diff --git a/lib/eal/include/rte_memory.h b/lib/eal/include/rte_memory.h index bba9b5300a..9a2b191314 100644 --- a/lib/eal/include/rte_memory.h +++ b/lib/eal/include/rte_memory.h @@ -40,7 +40,9 @@ extern "C" { /** * Physical memory segment descriptor. */ -#define RTE_MEMSEG_FLAG_DO_NOT_FREE (1 << 0) +#define RTE_MEMSEG_FLAG_DO_NOT_FREE (1 << 0) +#define RTE_MEMSEG_FLAG_PRE_ALLOCATED (1 << 1) + /**< Prevent this segment from being freed back to the OS. */ struct rte_memseg { rte_iova_t iova; /**< Start IO address. */ diff --git a/lib/eal/linux/eal.c b/lib/eal/linux/eal.c index 3577eaeaa4..d0afcd8326 100644 --- a/lib/eal/linux/eal.c +++ b/lib/eal/linux/eal.c @@ -548,6 +548,7 @@ eal_usage(const char *prgname) " --"OPT_LEGACY_MEM" Legacy memory mode (no dynamic allocation, contiguous segments)\n" " --"OPT_SINGLE_FILE_SEGMENTS" Put all hugepage memory in single files\n" " --"OPT_MATCH_ALLOCATIONS" Free hugepages exactly as allocated\n" + " --"OPT_MEM_FILE" Comma-separated list of files in hugetlbfs.\n" "\n"); /* Allow the application to print its usage message too if hook is set */ if (hook) { @@ -678,6 +679,22 @@ eal_log_level_parse(int argc, char **argv) optarg = old_optarg; } +static int +eal_parse_memfile_arg(const char *arg, char **mem_file) +{ + int ret; + + char *copy = strdup(arg); + if (copy == NULL) { + RTE_LOG(ERR, EAL, "Cannot store --"OPT_MEM_FILE" names\n"); + return -1; + } + + ret = rte_strsplit(copy, strlen(copy), mem_file, + MAX_MEMFILE_ITEMS, ','); + return ret <= 0 ? -1 : 0; +} + /* Parse the argument given in the command line of the application */ static int eal_parse_args(int argc, char **argv) @@ -819,6 +836,17 @@ eal_parse_args(int argc, char **argv) internal_conf->match_allocations = 1; break; + case OPT_MEM_FILE_NUM: + if (eal_parse_memfile_arg(optarg, + internal_conf->mem_file) < 0) { + RTE_LOG(ERR, EAL, "invalid parameters for --" + OPT_MEM_FILE "\n"); + eal_usage(prgname); + ret = -1; + goto out; + } + break; + default: if (opt < OPT_LONG_MIN_NUM && isprint(opt)) { RTE_LOG(ERR, EAL, "Option %c is not supported " diff --git a/lib/eal/linux/eal_hugepage_info.c b/lib/eal/linux/eal_hugepage_info.c index 726a086ab3..08dc0e5620 100644 --- a/lib/eal/linux/eal_hugepage_info.c +++ b/lib/eal/linux/eal_hugepage_info.c @@ -37,6 +37,7 @@ #include "eal_hugepages.h" #include "eal_hugepage_info.h" #include "eal_filesystem.h" +#include "eal_memalloc.h" static const char sys_dir_path[] = "/sys/kernel/mm/hugepages"; static const char sys_pages_numa_dir_path[] = "/sys/devices/system/node"; @@ -515,6 +516,10 @@ hugepage_info_init(void) qsort(&internal_conf->hugepage_info[0], num_sizes, sizeof(internal_conf->hugepage_info[0]), compare_hpi); + /* add pre-allocated pages with --mem-file option to available ones */ + if (eal_memalloc_memfile_init()) + return -1; + /* now we have all info, check we have at least one valid size */ for (i = 0; i < num_sizes; i++) { /* pages may no longer all be on socket 0, so check all */ diff --git a/lib/eal/linux/eal_memalloc.c b/lib/eal/linux/eal_memalloc.c index 0ec8542283..c2b3586204 100644 --- a/lib/eal/linux/eal_memalloc.c +++ b/lib/eal/linux/eal_memalloc.c @@ -18,6 +18,7 @@ #include #include #include +#include #include #include #include @@ -41,6 +42,7 @@ #include #include "eal_filesystem.h" +#include "eal_hugepage_info.h" #include "eal_internal_cfg.h" #include "eal_memalloc.h" #include "eal_memcfg.h" @@ -102,6 +104,19 @@ static struct { int count; /**< entries used in an array */ } fd_list[RTE_MAX_MEMSEG_LISTS]; +struct memfile { + char *fname; /**< file name */ + uint64_t hugepage_sz; /**< size of a huge page */ + uint32_t num_pages; /**< number of pages */ + uint32_t num_allocated; /**< number of already allocated pages */ + int socket_id; /**< Socket ID */ + int fd; /**< file descriptor */ +}; + +struct memfile mem_file[MAX_MEMFILE_ITEMS]; + +static int alloc_memfile; + /** local copy of a memory map, used to synchronize memory hotplug in MP */ static struct rte_memseg_list local_memsegs[RTE_MAX_MEMSEG_LISTS]; @@ -542,6 +557,26 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id, * stage. */ map_offset = 0; + } else if (alloc_memfile) { + uint32_t mf; + + for (mf = 0; mf < RTE_DIM(mem_file); mf++) { + if (alloc_sz == mem_file[mf].hugepage_sz && + socket_id == mem_file[mf].socket_id && + mem_file[mf].num_allocated < mem_file[mf].num_pages) + break; + } + if (mf >= RTE_DIM(mem_file)) { + RTE_LOG(ERR, EAL, + "%s() cannot allocate from memfile\n", + __func__); + return -1; + } + fd = mem_file[mf].fd; + fd_list[list_idx].fds[seg_idx] = fd; + map_offset = mem_file[mf].num_allocated * alloc_sz; + mmap_flags = MAP_SHARED | MAP_POPULATE | MAP_FIXED; + mem_file[mf].num_allocated++; } else { /* takes out a read lock on segment or segment list */ fd = get_seg_fd(path, sizeof(path), hi, list_idx, seg_idx); @@ -683,6 +718,10 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id, if (fd < 0) return -1; + /* don't cleanup pre-allocated files */ + if (alloc_memfile) + return -1; + if (internal_conf->single_file_segments) { resize_hugefile(fd, map_offset, alloc_sz, false); /* ignore failure, can't make it any worse */ @@ -712,8 +751,9 @@ free_seg(struct rte_memseg *ms, struct hugepage_info *hi, const struct internal_config *internal_conf = eal_get_internal_configuration(); - /* erase page data */ - memset(ms->addr, 0, ms->len); + /* Erase page data unless it's pre-allocated files. */ + if (!alloc_memfile) + memset(ms->addr, 0, ms->len); if (mmap(ms->addr, ms->len, PROT_NONE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, -1, 0) == @@ -724,8 +764,12 @@ free_seg(struct rte_memseg *ms, struct hugepage_info *hi, eal_mem_set_dump(ms->addr, ms->len, false); - /* if we're using anonymous hugepages, nothing to be done */ - if (internal_conf->in_memory && !memfd_create_supported) { + /* + * if we're using anonymous hugepages or pre-allocated files, + * nothing to be done + */ + if ((internal_conf->in_memory && !memfd_create_supported) || + alloc_memfile) { memset(ms, 0, sizeof(*ms)); return 0; } @@ -838,7 +882,9 @@ alloc_seg_walk(const struct rte_memseg_list *msl, void *arg) * during init, we already hold a write lock, so don't try to take out * another one. */ - if (wa->hi->lock_descriptor == -1 && !internal_conf->in_memory) { + if (wa->hi->lock_descriptor == -1 && + !internal_conf->in_memory && + !alloc_memfile) { dir_fd = open(wa->hi->hugedir, O_RDONLY); if (dir_fd < 0) { RTE_LOG(ERR, EAL, "%s(): Cannot open '%s': %s\n", @@ -868,7 +914,7 @@ alloc_seg_walk(const struct rte_memseg_list *msl, void *arg) need, i); /* if exact number wasn't requested, stop */ - if (!wa->exact) + if (!wa->exact || alloc_memfile) goto out; /* clean up */ @@ -1120,6 +1166,262 @@ eal_memalloc_free_seg(struct rte_memseg *ms) return eal_memalloc_free_seg_bulk(&ms, 1); } +static int +memfile_fill_socket_id(struct memfile *mf) +{ +#ifdef RTE_EAL_NUMA_AWARE_HUGEPAGES + void *va; + int ret; + + va = mmap(NULL, mf->hugepage_sz, PROT_READ | PROT_WRITE, + MAP_SHARED | MAP_POPULATE, mf->fd, 0); + if (va == MAP_FAILED) { + RTE_LOG(ERR, EAL, "%s(): %s: mmap(): %s\n", + __func__, mf->fname, strerror(errno)); + return -1; + } + + ret = 0; + if (check_numa()) { + if (get_mempolicy(&mf->socket_id, NULL, 0, va, + MPOL_F_NODE | MPOL_F_ADDR) < 0) { + RTE_LOG(ERR, EAL, "%s(): %s: get_mempolicy(): %s\n", + __func__, mf->fname, strerror(errno)); + ret = -1; + } + } else + mf->socket_id = 0; + + munmap(va, mf->hugepage_sz); + return ret; +#else + mf->socket_id = 0; + return 0; +#endif +} + +struct match_memfile_path_arg { + const char *path; + uint64_t file_sz; + uint64_t hugepage_sz; + size_t best_len; +}; + +/* + * While it is unlikely for hugetlbfs, mount points can be nested. + * Find the deepest mount point that contains the file. + */ +static int +match_memfile_path(const char *path, uint64_t hugepage_sz, void *cb_arg) +{ + struct match_memfile_path_arg *arg = cb_arg; + size_t dir_len = strlen(path); + + if (dir_len < arg->best_len) + return 0; + if (strncmp(path, arg->path, dir_len) != 0) + return 0; + if (arg->file_sz % hugepage_sz != 0) + return 0; + + arg->hugepage_sz = hugepage_sz; + arg->best_len = dir_len; + return 0; +} + +/* Determine hugepage size from the path to a file in hugetlbfs. */ +static int +memfile_fill_hugepage_sz(struct memfile *mf, uint64_t file_sz) +{ + char abspath[PATH_MAX]; + struct match_memfile_path_arg arg; + + if (realpath(mf->fname, abspath) == NULL) { + RTE_LOG(ERR, EAL, "%s(): realpath(): %s\n", + __func__, strerror(errno)); + return -1; + } + + memset(&arg, 0, sizeof(arg)); + arg.path = abspath; + arg.file_sz = file_sz; + if (eal_hugepage_mount_walk(match_memfile_path, &arg) == 0 && + arg.hugepage_sz != 0) { + mf->hugepage_sz = arg.hugepage_sz; + return 0; + } + return -1; +} + +int +eal_memalloc_memfile_init(void) +{ + struct internal_config *internal_conf = + eal_get_internal_configuration(); + int err = -1, fd; + uint32_t i; + + if (internal_conf->mem_file[0] == NULL) + return 0; + + for (i = 0; i < RTE_DIM(internal_conf->mem_file); i++) { + struct memfile *mf = &mem_file[i]; + uint64_t fsize; + + if (internal_conf->mem_file[i] == NULL) { + err = 0; + break; + } + mf->fname = internal_conf->mem_file[i]; + fd = open(mf->fname, O_RDWR, 0600); + mf->fd = fd; + if (fd < 0) { + RTE_LOG(ERR, EAL, "%s(): %s: open(): %s\n", + __func__, mf->fname, strerror(errno)); + break; + } + + /* take out a read lock and keep it indefinitely */ + if (lock(fd, LOCK_SH) != 1) { + RTE_LOG(ERR, EAL, "%s(): %s: cannot lock file\n", + __func__, mf->fname); + break; + } + + fsize = get_file_size(fd); + if (!fsize) { + RTE_LOG(ERR, EAL, "%s(): %s: zero file length\n", + __func__, mf->fname); + break; + } + + if (memfile_fill_hugepage_sz(mf, fsize) < 0) { + RTE_LOG(ERR, EAL, "%s(): %s: cannot detect page size\n", + __func__, mf->fname); + break; + } + mf->num_pages = fsize / mf->hugepage_sz; + + if (memfile_fill_socket_id(mf) < 0) { + RTE_LOG(ERR, EAL, "%s(): %s: cannot detect NUMA node\n", + __func__, mf->fname); + break; + } + } + + /* check if some problem happened */ + if (err && i < RTE_DIM(internal_conf->mem_file)) { + /* some error occurred, do rollback */ + do { + fd = mem_file[i].fd; + /* closing fd drops the lock */ + if (fd >= 0) + close(fd); + mem_file[i].fd = -1; + } while (i--); + return -1; + } + + /* update hugepage_info with pages allocated in files */ + for (i = 0; i < RTE_DIM(mem_file); i++) { + const struct memfile *mf = &mem_file[i]; + struct hugepage_info *hpi = NULL; + uint64_t sz; + + if (!mf->hugepage_sz) + break; + + for (sz = 0; sz < internal_conf->num_hugepage_sizes; sz++) { + hpi = &internal_conf->hugepage_info[sz]; + + if (mf->hugepage_sz == hpi->hugepage_sz) { + hpi->num_pages[mf->socket_id] += mf->num_pages; + break; + } + } + + /* it seems hugepage info is not socket aware yet */ + if (hpi != NULL && sz >= internal_conf->num_hugepage_sizes) + hpi->num_pages[0] += mf->num_pages; + } + return 0; +} + +int +eal_memalloc_memfile_alloc(struct hugepage_info *hpa) +{ + struct internal_config *internal_conf = + eal_get_internal_configuration(); + uint32_t i, sz; + + if (internal_conf->mem_file[0] == NULL || + rte_eal_process_type() != RTE_PROC_PRIMARY) + return 0; + + for (i = 0; i < RTE_DIM(mem_file); i++) { + struct memfile *mf = &mem_file[i]; + uint64_t hugepage_sz = mf->hugepage_sz; + int socket_id = mf->socket_id; + struct rte_memseg **pages; + + if (!hugepage_sz) + break; + + while (mf->num_allocated < mf->num_pages) { + int needed, allocated, j; + uint32_t prev; + + prev = mf->num_allocated; + needed = mf->num_pages - mf->num_allocated; + pages = malloc(sizeof(*pages) * needed); + if (pages == NULL) + return -1; + + /* memalloc is locked, it's safe to switch allocator */ + alloc_memfile = 1; + allocated = eal_memalloc_alloc_seg_bulk(pages, + needed, hugepage_sz, socket_id, false); + /* switch allocator back */ + alloc_memfile = 0; + if (allocated <= 0) { + RTE_LOG(ERR, EAL, "%s(): %s: allocation failed\n", + __func__, mf->fname); + free(pages); + return -1; + } + + /* mark preallocated pages as unfreeable */ + for (j = 0; j < allocated; j++) { + struct rte_memseg *ms = pages[j]; + + ms->flags |= RTE_MEMSEG_FLAG_DO_NOT_FREE | + RTE_MEMSEG_FLAG_PRE_ALLOCATED; + } + + free(pages); + + /* check whether we allocated from expected file */ + if (prev + allocated != mf->num_allocated) { + RTE_LOG(ERR, EAL, "%s(): %s: incorrect allocation\n", + __func__, mf->fname); + return -1; + } + } + + /* reflect we pre-allocated some memory */ + for (sz = 0; sz < internal_conf->num_hugepage_sizes; sz++) { + struct hugepage_info *hpi = &hpa[sz]; + + if (hpi->hugepage_sz != hugepage_sz) + continue; + hpi->num_pages[socket_id] -= + RTE_MIN(hpi->num_pages[socket_id], + mf->num_allocated); + } + } + return 0; +} + static int sync_chunk(struct rte_memseg_list *primary_msl, struct rte_memseg_list *local_msl, struct hugepage_info *hi, @@ -1178,6 +1480,14 @@ sync_chunk(struct rte_memseg_list *primary_msl, if (l_ms == NULL || p_ms == NULL) return -1; + /* + * Switch allocator for this segment. + * This function is only called during init, + * so don't try to restore allocator on failure. + */ + if (p_ms->flags & RTE_MEMSEG_FLAG_PRE_ALLOCATED) + alloc_memfile = 1; + if (used) { ret = alloc_seg(l_ms, p_ms->addr, p_ms->socket_id, hi, @@ -1191,6 +1501,9 @@ sync_chunk(struct rte_memseg_list *primary_msl, if (ret < 0) return -1; } + + /* Reset the allocator. */ + alloc_memfile = 0; } /* if we just allocated memory, notify the application */ @@ -1392,6 +1705,9 @@ eal_memalloc_sync_with_primary(void) if (rte_eal_process_type() == RTE_PROC_PRIMARY) return 0; + if (eal_memalloc_memfile_init() < 0) + return -1; + /* memalloc is locked, so it's safe to call thread-unsafe version */ if (rte_memseg_list_walk_thread_unsafe(sync_walk, NULL)) return -1;