From patchwork Mon Jan 17 08:07:56 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dmitry Kozlyuk X-Patchwork-Id: 105905 X-Patchwork-Delegate: david.marchand@redhat.com Return-Path: X-Original-To: patchwork@inbox.dpdk.org Delivered-To: patchwork@inbox.dpdk.org Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 6C969A034F; Mon, 17 Jan 2022 09:08:30 +0100 (CET) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 4A0F3411AE; Mon, 17 Jan 2022 09:08:25 +0100 (CET) Received: from NAM11-BN8-obe.outbound.protection.outlook.com (mail-bn8nam11on2078.outbound.protection.outlook.com [40.107.236.78]) by mails.dpdk.org (Postfix) with ESMTP id 672624067B for ; Mon, 17 Jan 2022 09:08:23 +0100 (CET) ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=mdeuQMsUGupSPXXKVI7wQl8iy6nfCcFgeynldSrSLWyNgNX23w6UM5mzyX8mkCnprsJ3Q+x4PK9zisLCrwjOHv2RN/MDxhC20GUx6yAibGRZWfqj3gEBM7MvcGct5kuottHIgtRtrtmWeT1Aztq3zjBGPincIq5H85QPDO9OIHJGHJDONJDi0pocJFyi51F61nulCuOGwFU9FEt62opPN4MqzBLWID9s0BcXRhdIVapgAyJEog+T16N4xwnvbKRhTPLKMw96vdGGU3gyciWNQJMej1w/iTugZv0VfUD8fvDgqyjcYl7abbK0f3XKQ5Q9BE9j1OttPA5SkjjAq8alQw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=+XVNd5qa5ZlL/CS/VF8eTYGTTv6TESaF90gIqquAP94=; b=HMk1RqXJo9pIakUb7lB/0WgUuM1VFedRqbHJNQ+5ffAwhIRDWWxPjRofU5H6rFwWlaNRwikrGQkF6JpJ5WAaVzOTGLgOI+gJbPZ16eClyJbcsWIfNQgjG0gpM747b2aUOoeJnPRnGvW4kVkXXn7LinwabSugAp1dXEHnfN+sio5qPtirJLVwlueJC7odSu/xPPD0URfLxvznhoABD2l/RIzMuWcmwbHcx1TaLBNUT0KeoaxCYybHSPudjYs5EBuuveF0qtJKZDnFNKxzWQEdwskugoTFwJ9bPbPI9gj1Eq0eJKL8V6RwvMJS376xF+mJntY78DlyMVq4Lp6+bayipw== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 12.22.5.238) smtp.rcpttodomain=intel.com smtp.mailfrom=nvidia.com; dmarc=pass (p=reject sp=reject pct=100) action=none header.from=nvidia.com; dkim=none (message not signed); arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=Nvidia.com; s=selector2; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=+XVNd5qa5ZlL/CS/VF8eTYGTTv6TESaF90gIqquAP94=; b=NSd9byBQKObDeh3u1zaL8tp7l5JjdeyXGYQRnMSAi6nJv/czu0PhF3+kbvWd1WVU29OIpC/rKFn4+0tDO/1xl1GkL7+LP3zgHRMyKAG7eEIGCUNCE1hzcuYjUfiYaMdxv8KqK4VL5EwPeDheM4GIwJA/h0Oo2xz4uB699psS9LBRPmpyczWb9t9k0SdKUrIrTKv8Yvxdh6SUtq3thVt90BJe5Nbn7TWIk8bEtxsZDpZyLOEkhrEBz8Vnf+AQtSrdwLbyZrDzuRhK2bcC5E3oBRPC6sVQq5RkSxI4aDrc7bqLPNFTdfNHlqXnIK/2/2ECm9gyx191tfUzrSztoIO1aQ== Received: from MW4P221CA0014.NAMP221.PROD.OUTLOOK.COM (2603:10b6:303:8b::19) by DM5PR12MB1562.namprd12.prod.outlook.com (2603:10b6:4:d::23) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4888.9; Mon, 17 Jan 2022 08:08:21 +0000 Received: from CO1NAM11FT065.eop-nam11.prod.protection.outlook.com (2603:10b6:303:8b:cafe::f1) by MW4P221CA0014.outlook.office365.com (2603:10b6:303:8b::19) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4888.9 via Frontend Transport; Mon, 17 Jan 2022 08:08:21 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 12.22.5.238) smtp.mailfrom=nvidia.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=nvidia.com; Received-SPF: Pass (protection.outlook.com: domain of nvidia.com designates 12.22.5.238 as permitted sender) receiver=protection.outlook.com; client-ip=12.22.5.238; helo=mail.nvidia.com; Received: from mail.nvidia.com (12.22.5.238) by CO1NAM11FT065.mail.protection.outlook.com (10.13.174.62) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384) id 15.20.4888.9 via Frontend Transport; Mon, 17 Jan 2022 08:08:21 +0000 Received: from rnnvmail201.nvidia.com (10.129.68.8) by DRHQMAIL105.nvidia.com (10.27.9.14) with Microsoft SMTP Server (TLS) id 15.0.1497.18; Mon, 17 Jan 2022 08:08:21 +0000 Received: from nvidia.com (10.126.231.35) by rnnvmail201.nvidia.com (10.129.68.8) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384) id 15.2.986.9; Mon, 17 Jan 2022 00:08:19 -0800 From: Dmitry Kozlyuk To: CC: Anatoly Burakov Subject: [PATCH v1 1/6] doc: add hugepage mapping details Date: Mon, 17 Jan 2022 10:07:56 +0200 Message-ID: <20220117080801.481568-2-dkozlyuk@nvidia.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20220117080801.481568-1-dkozlyuk@nvidia.com> References: <20211230143744.3550098-1-dkozlyuk@nvidia.com> <20220117080801.481568-1-dkozlyuk@nvidia.com> MIME-Version: 1.0 X-Originating-IP: [10.126.231.35] X-ClientProxiedBy: HQMAIL105.nvidia.com (172.20.187.12) To rnnvmail201.nvidia.com (10.129.68.8) X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-Office365-Filtering-Correlation-Id: e2ec3d25-5d1d-4c7b-bdbd-08d9d99086fe X-MS-TrafficTypeDiagnostic: DM5PR12MB1562:EE_ X-Microsoft-Antispam-PRVS: X-MS-Oob-TLC-OOBClassifiers: OLM:9508; X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: QqeQI/b4+7BND22qHxKv4zlwJNFmnUONtwdmD8hBxO5Z8dg8igkINprNnOiTAt+6dERxBTwzmhY7xSvs9f9l/g1ePzXhttNtVGDuyYhXmJzzaw0kqcM56sRX4Qhi8Nuqs77QdFG15vcoazreW//JDPW81Np+VRB3Z/zrcGzyNhV6g/f5e1XvB51S3ob8Qg33hSEPlPQFELAHzNWbkZn7MscSstotJOlCKX/nbdbQBUulHp5w5+sAWS+lkFxX0y21To+/+5BiHLoXfe45eCgEicE3i8IEVJNs4lgpMogxjdtOdJx2SVX+SVl7VT/lM2BP82c6EtljNfCZWbNDTUpU4LkUSunDvQSMvJUM8pXxqL918b2eu2kctfLS70QS5TGzBU4rHtrCHh7f9oEdNNUlQ+PR4jnemyNL6HgpeBmDxgOF8H0BOso8sE31RYGGb5CRS+iEv/aiHyXzqdsWvD0D03114g+Za+RRHYQUbQUdxmWleHBM5GUCpBDdqm/COKwajqAhX5Sbc3ziROOnhdsMY2pjRzrL2FJjdE/2ainnjM+aAjciS6qvP8ivjZBnDspHRZhBLhNzUKFd7S33iNSEHAKbEtyLavssUQ2UaPVLhR6DaYPTqI1QBMakb+s+JrdKRp2T4L31+Ae48DqLGjH8u4itVV7H63vxG6S4DW3k2/YkCrYd7FgwZVSZRAbQSPuvpfdkHH4rqXw9BbbRyHTysjygOSre4QGsikvqRO1YGvvIvWKClSbA+ZuRiYApSjBPS7IKfv3xRxQl5BwAEpvVOwRTH1N07FufO8dFe5hk9qU= X-Forefront-Antispam-Report: CIP:12.22.5.238; CTRY:US; LANG:en; SCL:1; SRV:; IPV:CAL; SFV:NSPM; H:mail.nvidia.com; PTR:InfoNoRecords; CAT:NONE; SFS:(4636009)(36840700001)(46966006)(40470700002)(36756003)(316002)(2616005)(426003)(36860700001)(336012)(2906002)(70586007)(4326008)(8676002)(70206006)(5660300002)(7696005)(356005)(6286002)(6666004)(6916009)(82310400004)(40460700001)(55016003)(16526019)(186003)(26005)(508600001)(8936002)(1076003)(47076005)(83380400001)(86362001)(81166007)(36900700001); DIR:OUT; SFP:1101; X-OriginatorOrg: Nvidia.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 17 Jan 2022 08:08:21.1901 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: e2ec3d25-5d1d-4c7b-bdbd-08d9d99086fe X-MS-Exchange-CrossTenant-Id: 43083d15-7273-40c1-b7db-39efd9ccc17a X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=43083d15-7273-40c1-b7db-39efd9ccc17a; Ip=[12.22.5.238]; Helo=[mail.nvidia.com] X-MS-Exchange-CrossTenant-AuthSource: CO1NAM11FT065.eop-nam11.prod.protection.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: DM5PR12MB1562 X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Hugepage mapping is a layer of EAL malloc builds upon. There were implicit references to its details, like mentions of segment file descriptors, but no explicit description of its modes and operation. Add an overview of mechanics used on ech supported OS. Convert memory management subsections from list items to level 4 headers: they are big and important enough. Signed-off-by: Dmitry Kozlyuk --- .../prog_guide/env_abstraction_layer.rst | 85 +++++++++++++++++-- 1 file changed, 76 insertions(+), 9 deletions(-) diff --git a/doc/guides/prog_guide/env_abstraction_layer.rst b/doc/guides/prog_guide/env_abstraction_layer.rst index c6accce701..bfe4594bf1 100644 --- a/doc/guides/prog_guide/env_abstraction_layer.rst +++ b/doc/guides/prog_guide/env_abstraction_layer.rst @@ -86,7 +86,7 @@ See chapter Memory Mapping Discovery and Memory Reservation ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -The allocation of large contiguous physical memory is done using the hugetlbfs kernel filesystem. +The allocation of large contiguous physical memory is done using hugepages. The EAL provides an API to reserve named memory zones in this contiguous memory. The physical address of the reserved memory for that memory zone is also returned to the user by the memory zone reservation API. @@ -95,11 +95,12 @@ and legacy mode. Both modes are explained below. .. note:: - Memory reservations done using the APIs provided by rte_malloc are also backed by pages from the hugetlbfs filesystem. + Memory reservations done using the APIs provided by rte_malloc are also backed by hugepages. -+ Dynamic memory mode +Dynamic Memory Mode +^^^^^^^^^^^^^^^^^^^ -Currently, this mode is only supported on Linux. +Currently, this mode is only supported on Linux and Windows. In this mode, usage of hugepages by DPDK application will grow and shrink based on application's requests. Any memory allocation through ``rte_malloc()``, @@ -155,7 +156,8 @@ of memory that can be used by DPDK application. :ref:`Multi-process Support ` for more details about DPDK IPC. -+ Legacy memory mode +Legacy Memory Mode +^^^^^^^^^^^^^^^^^^ This mode is enabled by specifying ``--legacy-mem`` command-line switch to the EAL. This switch will have no effect on FreeBSD as FreeBSD only supports @@ -168,7 +170,8 @@ not allow acquiring or releasing hugepages from the system at runtime. If neither ``-m`` nor ``--socket-mem`` were specified, the entire available hugepage memory will be preallocated. -+ Hugepage allocation matching +Hugepage Allocation Matching +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ This behavior is enabled by specifying the ``--match-allocations`` command-line switch to the EAL. This switch is Linux-only and not supported with @@ -182,7 +185,8 @@ matching can be used by these types of applications to satisfy both of these requirements. This can result in some increased memory usage which is very dependent on the memory allocation patterns of the application. -+ 32-bit support +32-bit Support +^^^^^^^^^^^^^^ Additional restrictions are present when running in 32-bit mode. In dynamic memory mode, by default maximum of 2 gigabytes of VA space will be preallocated, @@ -192,7 +196,8 @@ used. In legacy mode, VA space will only be preallocated for segments that were requested (plus padding, to keep IOVA-contiguousness). -+ Maximum amount of memory +Maximum Amount of Memory +^^^^^^^^^^^^^^^^^^^^^^^^ All possible virtual memory space that can ever be used for hugepage mapping in a DPDK process is preallocated at startup, thereby placing an upper limit on how @@ -222,7 +227,68 @@ Normally, these options do not need to be changed. can later be mapped into that preallocated VA space (if dynamic memory mode is enabled), and can optionally be mapped into it at startup. -+ Segment file descriptors +Hugepage Mapping +^^^^^^^^^^^^^^^^ + +Below is an overview of methods used for each OS to obtain hugepages, +explaining why certain limitations and options exist in EAL. +See the user guide for a specific OS for configuration details. + +FreeBSD uses ``contigmem`` kernel module +to reserve a fixed number of hugepages at system start, +which are mapped by EAL at initialization using a specific ``sysctl()``. + +Windows EAL allocates hugepages from the OS as needed using Win32 API, +so available amount depends on the system load. +It uses ``virt2phys`` kernel module to obtain physical addresses, +unless running in IOVA-as-VA mode (e.g. forced with ``--iova-mode=va``). + +Linux implements a variety of methods: + +* mapping each hugepage from its own file in hugetlbfs; +* mapping multiple hugepages from a shared file in hugetlbfs; +* anonymous mapping. + +Mapping hugepages from files in hugetlbfs is essential for multi-process, +because secondary processes need to map the same hugepages. +EAL creates files like ``rtemap_0`` +in directories specified with ``--huge-dir`` option +(or in the mount point for a specific hugepage size). +The ``rtemap_`` prefix can be changed using ``--file-prefix``. +This may be needed for running multiple primary processes +that share a hugetlbfs mount point. +Each backing file by default corresponds to one hugepage, +it is opened and locked for the entire time the hugepage is used. +See :ref:`segment-file-descriptors` section +on how the number of open backing file descriptors can be reduced. + +Backing files may persist after the corresponding hugepage is freed +and even after the application terminates, +reducing the number of hugepages available to other processes. +EAL removes existing files at startup +and can remove newly created files before mapping them with ``--huge-unlink``. +However, since it disables multi-process anyway, +using anonymous mapping (``--in-memory``) is recommended instead. + +:ref:`EAL memory allocator ` relies on hugepages being zero-filled. +Hugepages are cleared by the kernel when a file in hugetlbfs or its part +is mapped for the first time system-wide +to prevent data leaks from previous users of the same hugepage. +EAL ensures this behavior by removing existing backing files at startup +and by recreating them before opening for mapping (as a precaution). + +Anonymous mapping does not allow multi-process architecture, +but it is free of filename conflicts and leftover files on hugetlbfs. +If memfd_create(2) is supported both at build and run time, +DPDK memory manager can provide file descriptors for memory segments, +which are required for VirtIO with vhost-user backend. +This means open file descriptor issues may also affect this mode, +with the same solution. + +.. _segment-file-descriptors: + +Segment File Descriptors +^^^^^^^^^^^^^^^^^^^^^^^^ On Linux, in most cases, EAL will store segment file descriptors in EAL. This can become a problem when using smaller page sizes due to underlying limitations @@ -731,6 +797,7 @@ We expect only 50% of CPU spend on packet IO. echo 100000 > pkt_io/cpu.cfs_period_us echo 50000 > pkt_io/cpu.cfs_quota_us +.. _malloc: Malloc ------ From patchwork Mon Jan 17 08:07:57 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dmitry Kozlyuk X-Patchwork-Id: 105906 X-Patchwork-Delegate: david.marchand@redhat.com Return-Path: X-Original-To: patchwork@inbox.dpdk.org Delivered-To: patchwork@inbox.dpdk.org Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 5F38AA034F; Mon, 17 Jan 2022 09:08:38 +0100 (CET) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 80510411EB; Mon, 17 Jan 2022 09:08:26 +0100 (CET) Received: from NAM11-CO1-obe.outbound.protection.outlook.com (mail-co1nam11on2042.outbound.protection.outlook.com [40.107.220.42]) by mails.dpdk.org (Postfix) with ESMTP id 337374118A for ; Mon, 17 Jan 2022 09:08:25 +0100 (CET) ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=AbMAO7FZkcewrY0mQK1XddkNieK5dchfvIXleVM37cxNW0JkLbv9WqqkPAipsH8d4ZuXdjVsizHgBO/CsUZypUAZSZQh9cO7r/hTXEBor/rNiLAmK4QaAqj5E9eG4vOr/Lploh34Dag/cWgkLgxAg+iRpJD2y/ULGK5sFAL+byHeGPNHdaIH9bxAJpa6PLyA1rPX3pvCPbjssLon0M9VAQjW2yM5RqfjA/UG7nCEMQjENejgeErFdWjrGdF2U8CNbhy1XD5wLYkt/lP56rYgcldi6A+q+XcoEZw9jSTH9m7byC9gPxq1WycGLAlQTISj1wnBoOIJygeUhUopxLq09g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=hO8zT3agI3rmYUtan1v5CRJcRVgwe8JzJ8W7aLYxGJM=; b=Y0j9IlqgB0v3N+KSVHzOGd3XkF8/VBm8MrRPdL3b5XxBowdccf4QOn1gG5gs3cRJhzVH6REvHnwdvCq0g+v5CKIZa71T2S6ZX93bd0TwxK97Rq+QxoTDjZDzcQln6VG6+2C/znptIDtt6HxQGbahOIWI3WsGAOeGsZZWgHfdlEUsPr07Qhh6SB7EZfKgcOaRLlj3ETzZzV5xsFV82sGvH+z4EBAEgbjF735ZjUas4kAEQK3EOiNTYHMErZ48UUsp2f9p9HXVoGGi6g6yt0Rqj6byyJ9Y7vj5zM0MDmSRJPYY5X5LrpOR6+Qqf3u8abV1gxUW1C1rR/O9Vh4Pmy74+g== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 12.22.5.234) smtp.rcpttodomain=redhat.com smtp.mailfrom=nvidia.com; dmarc=pass (p=reject sp=reject pct=100) action=none header.from=nvidia.com; dkim=none (message not signed); arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=Nvidia.com; s=selector2; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=hO8zT3agI3rmYUtan1v5CRJcRVgwe8JzJ8W7aLYxGJM=; b=YYciESKYnOZKyr2RmJhHHJsq0BOjVULKA98G1tG9OW9XMF6C3XXx/FQON5AFahev0iaksGoBLb3tuE6I2oew2Xq+yjM3K6WHInvNM8D70tSQzuViRzzJsDZjbz5zpczp+BAhuByIgdfibmicI3c9ehAuSmRY8b299LOKXcLGZPoolo8/M26LtjBM/2LevW2kh3KOm1ordqGjNEHnj5disl7hovHe+KdCNkbFRW9hUuvxyC3WvrquKo75jPX29L7LaqT0t8311jT2WtOgxiv4XDYXhoB/JZi5ZiYlXKNqQeCHuYsXhaZFLD/53U+ohga4EwDr5nawTFk+UIK73hOMyg== Received: from DM5PR08CA0042.namprd08.prod.outlook.com (2603:10b6:4:60::31) by CY4PR1201MB2519.namprd12.prod.outlook.com (2603:10b6:903:d7::23) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4888.12; Mon, 17 Jan 2022 08:08:23 +0000 Received: from DM6NAM11FT011.eop-nam11.prod.protection.outlook.com (2603:10b6:4:60:cafe::d5) by DM5PR08CA0042.outlook.office365.com (2603:10b6:4:60::31) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4888.9 via Frontend Transport; Mon, 17 Jan 2022 08:08:23 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 12.22.5.234) smtp.mailfrom=nvidia.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=nvidia.com; Received-SPF: Pass (protection.outlook.com: domain of nvidia.com designates 12.22.5.234 as permitted sender) receiver=protection.outlook.com; client-ip=12.22.5.234; helo=mail.nvidia.com; Received: from mail.nvidia.com (12.22.5.234) by DM6NAM11FT011.mail.protection.outlook.com (10.13.172.108) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384) id 15.20.4888.9 via Frontend Transport; Mon, 17 Jan 2022 08:08:23 +0000 Received: from rnnvmail201.nvidia.com (10.129.68.8) by DRHQMAIL101.nvidia.com (10.27.9.10) with Microsoft SMTP Server (TLS) id 15.0.1497.18; Mon, 17 Jan 2022 08:08:22 +0000 Received: from nvidia.com (10.126.231.35) by rnnvmail201.nvidia.com (10.129.68.8) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384) id 15.2.986.9; Mon, 17 Jan 2022 00:08:20 -0800 From: Dmitry Kozlyuk To: CC: Aaron Conole , Viacheslav Ovsiienko Subject: [PATCH v1 2/6] app/test: add allocator performance benchmark Date: Mon, 17 Jan 2022 10:07:57 +0200 Message-ID: <20220117080801.481568-3-dkozlyuk@nvidia.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20220117080801.481568-1-dkozlyuk@nvidia.com> References: <20211230143744.3550098-1-dkozlyuk@nvidia.com> <20220117080801.481568-1-dkozlyuk@nvidia.com> MIME-Version: 1.0 X-Originating-IP: [10.126.231.35] X-ClientProxiedBy: HQMAIL105.nvidia.com (172.20.187.12) To rnnvmail201.nvidia.com (10.129.68.8) X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-Office365-Filtering-Correlation-Id: 4c641412-d4be-4084-89d4-08d9d990882a X-MS-TrafficTypeDiagnostic: CY4PR1201MB2519:EE_ X-Microsoft-Antispam-PRVS: X-MS-Oob-TLC-OOBClassifiers: OLM:3968; X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: CCB0J8sOws7QY6ORvJi1Xk0sDLTNxvOJCCQyhYMBZrVHrJea4kT7vBAVK8p+ogByVXhAGJ01/1vjy7hvIiFCRyhLBtxCPR+n5yc0ys0WYKADbPJDFqQTI+sAxyY90dgLiRCNtp0Uk1dh0i7Uz5I4WNN4ivegSQy3j+jXBVgFzqvw1SRjIqHkkT6UrPkd5tI3Xu5EYLhxyHlYj5I8RqxJwohLQ2ETjaxnqsj/j23ohtyomSsqYhBNahYsVpe+7FoaB67sN6wUHaJKyXdjOBYV36hGOuBvdaCaMMU/+flZqfaSWdJF75s/IdF9QMZbDDrJcpzfuEo+TxOqIt+d87xdj2NsRmKHqbwOrrc9bjd4XBE8Em0UGU3c6Jwl6QRKmr8sVxsojymDGZbqPwIMLk9VYfCTrMOpdP5SMcf758G4iogv803rEL/NdAyKrVkMGuNSdz2NeKtP6gXbbbXbOKBqHFN9Jp0ejB2NtqL4N3HtABEAyaiBuWqh2jZ8WMgRs8H8CxPxyZUcmR3wsusHvsRP3gLUQXqW+HMM89u/cZczfPSGovaCAFRVfT2kAkkYoWMppA5kPVeRAMrJhltpD7l/YrPkfU9ieptSORLYaRBjZpDA6aBxUJTwtMyqSueNQ8tRET303Y7i/R8NL6LCrMZmb12/hgrq6EbwwWV99GgHGpYKcijDnML8DB/NjqpUf/E1PNbRS23kXrVzYz+mx8Au6IbiSpsp619Tl4cLbIVYTQ0V0+rkZUur0zDKHpFsOp37B+SIpA76J8tqGA4wsLNRr3qh8fS31EHdWlLdl20Xx0w= X-Forefront-Antispam-Report: CIP:12.22.5.234; CTRY:US; LANG:en; SCL:1; SRV:; IPV:CAL; SFV:NSPM; H:mail.nvidia.com; PTR:InfoNoRecords; CAT:NONE; SFS:(4636009)(46966006)(36840700001)(40470700002)(7696005)(6286002)(8676002)(16526019)(6666004)(508600001)(107886003)(26005)(186003)(1076003)(8936002)(54906003)(70586007)(36756003)(5660300002)(70206006)(426003)(81166007)(336012)(2906002)(6916009)(36860700001)(4326008)(356005)(86362001)(47076005)(2616005)(316002)(40460700001)(55016003)(82310400004)(36900700001); DIR:OUT; SFP:1101; X-OriginatorOrg: Nvidia.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 17 Jan 2022 08:08:23.1549 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 4c641412-d4be-4084-89d4-08d9d990882a X-MS-Exchange-CrossTenant-Id: 43083d15-7273-40c1-b7db-39efd9ccc17a X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=43083d15-7273-40c1-b7db-39efd9ccc17a; Ip=[12.22.5.234]; Helo=[mail.nvidia.com] X-MS-Exchange-CrossTenant-AuthSource: DM6NAM11FT011.eop-nam11.prod.protection.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: CY4PR1201MB2519 X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Memory allocator performance is crucial to applications that deal with large amount of memory or allocate frequently. DPDK allocator performance is affected by EAL options, API used and, at least, allocation size. New autotest is intended to be run with different EAL options. It measures performance with a range of sizes for dirrerent APIs: rte_malloc, rte_zmalloc, and rte_memzone_reserve. Work distribution between allocation and deallocation depends on EAL options. The test prints both times and total time to ease comparison. Memory can be filled with zeroes at different points of allocation path, but it always takes considerable fraction of overall timing. This is why the test measures filling speed and prints how long clearing takes for each size as a reference (for rte_memzone_reserve estimations are printed). Signed-off-by: Dmitry Kozlyuk Reviewed-by: Viacheslav Ovsiienko Acked-by: Aaron Conole --- app/test/meson.build | 2 + app/test/test_malloc_perf.c | 174 ++++++++++++++++++++++++++++++++++++ 2 files changed, 176 insertions(+) create mode 100644 app/test/test_malloc_perf.c diff --git a/app/test/meson.build b/app/test/meson.build index 344a609a4d..50cf2602a9 100644 --- a/app/test/meson.build +++ b/app/test/meson.build @@ -88,6 +88,7 @@ test_sources = files( 'test_lpm6_perf.c', 'test_lpm_perf.c', 'test_malloc.c', + 'test_malloc_perf.c', 'test_mbuf.c', 'test_member.c', 'test_member_perf.c', @@ -295,6 +296,7 @@ extra_test_names = [ perf_test_names = [ 'ring_perf_autotest', + 'malloc_perf_autotest', 'mempool_perf_autotest', 'memcpy_perf_autotest', 'hash_perf_autotest', diff --git a/app/test/test_malloc_perf.c b/app/test/test_malloc_perf.c new file mode 100644 index 0000000000..9686fc8af5 --- /dev/null +++ b/app/test/test_malloc_perf.c @@ -0,0 +1,174 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright (c) 2021 NVIDIA Corporation & Affiliates + */ + +#include +#include +#include +#include +#include +#include + +#include "test.h" + +#define TEST_LOG(level, ...) RTE_LOG(level, USER1, __VA_ARGS__) + +typedef void * (alloc_t)(const char *name, size_t size, unsigned int align); +typedef void (free_t)(void *addr); +typedef void * (memset_t)(void *addr, int value, size_t size); + +static const uint64_t KB = 1 << 10; +static const uint64_t GB = 1 << 30; + +static double +tsc_to_us(uint64_t tsc, size_t runs) +{ + return (double)tsc / rte_get_tsc_hz() * US_PER_S / runs; +} + +static int +test_memset_perf(double *us_per_gb) +{ + static const size_t RUNS = 20; + + void *ptr; + size_t i; + uint64_t tsc; + + TEST_LOG(INFO, "Reference: memset\n"); + + ptr = rte_malloc(NULL, GB, 0); + if (ptr == NULL) { + TEST_LOG(ERR, "rte_malloc(size=%"PRIx64") failed\n", GB); + return -1; + } + + tsc = rte_rdtsc_precise(); + for (i = 0; i < RUNS; i++) + memset(ptr, 0, GB); + tsc = rte_rdtsc_precise() - tsc; + + *us_per_gb = tsc_to_us(tsc, RUNS); + TEST_LOG(INFO, "Result: %f.3 GiB/s <=> %.2f us/MiB\n", + US_PER_S / *us_per_gb, *us_per_gb / KB); + + rte_free(ptr); + TEST_LOG(INFO, "\n"); + return 0; +} + +static int +test_alloc_perf(const char *name, alloc_t *alloc_fn, free_t *free_fn, + memset_t *memset_fn, double memset_gb_us, size_t max_runs) +{ + static const size_t SIZES[] = { + 1 << 6, 1 << 7, 1 << 10, 1 << 12, 1 << 16, 1 << 20, + 1 << 21, 1 << 22, 1 << 24, 1 << 30 }; + + size_t i, j; + void **ptrs; + + TEST_LOG(INFO, "Performance: %s\n", name); + + ptrs = calloc(max_runs, sizeof(ptrs[0])); + if (ptrs == NULL) { + TEST_LOG(ERR, "Cannot allocate memory for pointers"); + return -1; + } + + TEST_LOG(INFO, "%12s%8s%12s%12s%12s%17s\n", "Size (B)", "Runs", + "Alloc (us)", "Free (us)", "Total (us)", + memset_fn != NULL ? "memset (us)" : "est.memset (us)"); + for (i = 0; i < RTE_DIM(SIZES); i++) { + size_t size = SIZES[i]; + size_t runs_done; + uint64_t tsc_start, tsc_alloc, tsc_memset = 0, tsc_free; + double alloc_time, free_time, memset_time; + + tsc_start = rte_rdtsc_precise(); + for (j = 0; j < max_runs; j++) { + ptrs[j] = alloc_fn(NULL, size, 0); + if (ptrs[j] == NULL) + break; + } + tsc_alloc = rte_rdtsc_precise() - tsc_start; + + if (j == 0) { + TEST_LOG(INFO, "%12zu Interrupted: out of memory.\n", + size); + break; + } + runs_done = j; + + if (memset_fn != NULL) { + tsc_start = rte_rdtsc_precise(); + for (j = 0; j < runs_done && ptrs[j] != NULL; j++) + memset_fn(ptrs[j], 0, size); + tsc_memset = rte_rdtsc_precise() - tsc_start; + } + + tsc_start = rte_rdtsc_precise(); + for (j = 0; j < runs_done && ptrs[j] != NULL; j++) + free_fn(ptrs[j]); + tsc_free = rte_rdtsc_precise() - tsc_start; + + alloc_time = tsc_to_us(tsc_alloc, runs_done); + free_time = tsc_to_us(tsc_free, runs_done); + memset_time = memset_fn != NULL ? + tsc_to_us(tsc_memset, runs_done) : + memset_gb_us * size / GB; + TEST_LOG(INFO, "%12zu%8zu%12.2f%12.2f%12.2f%17.2f\n", + size, runs_done, alloc_time, free_time, + alloc_time + free_time, memset_time); + + memset(ptrs, 0, max_runs * sizeof(ptrs[0])); + } + + free(ptrs); + TEST_LOG(INFO, "\n"); + return 0; +} + +static void * +memzone_alloc(const char *name __rte_unused, size_t size, unsigned int align) +{ + const struct rte_memzone *mz; + char gen_name[RTE_MEMZONE_NAMESIZE]; + + snprintf(gen_name, sizeof(gen_name), "test-mz-%"PRIx64, rte_rdtsc()); + mz = rte_memzone_reserve_aligned(gen_name, size, SOCKET_ID_ANY, + RTE_MEMZONE_1GB | RTE_MEMZONE_SIZE_HINT_ONLY, align); + return (void *)(uintptr_t)mz; +} + +static void +memzone_free(void *addr) +{ + rte_memzone_free((struct rte_memzone *)addr); +} + +static int +test_malloc_perf(void) +{ + static const size_t MAX_RUNS = 10000; + + double memset_us_gb; + + if (test_memset_perf(&memset_us_gb) < 0) + return -1; + + if (test_alloc_perf("rte_malloc", rte_malloc, rte_free, memset, + memset_us_gb, MAX_RUNS) < 0) + return -1; + if (test_alloc_perf("rte_zmalloc", rte_zmalloc, rte_free, memset, + memset_us_gb, MAX_RUNS) < 0) + return -1; + + if (test_alloc_perf("rte_memzone_reserve", memzone_alloc, memzone_free, + NULL, memset_us_gb, RTE_MAX_MEMZONE - 1) < 0) + return -1; + + return 0; +} + +REGISTER_TEST_COMMAND(malloc_perf_autotest, test_malloc_perf); From patchwork Mon Jan 17 08:07:58 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dmitry Kozlyuk X-Patchwork-Id: 105907 X-Patchwork-Delegate: david.marchand@redhat.com Return-Path: X-Original-To: patchwork@inbox.dpdk.org Delivered-To: patchwork@inbox.dpdk.org Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 179D4A034F; Mon, 17 Jan 2022 09:08:44 +0100 (CET) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 7979B411ED; Mon, 17 Jan 2022 09:08:28 +0100 (CET) Received: from NAM11-BN8-obe.outbound.protection.outlook.com (mail-bn8nam11on2068.outbound.protection.outlook.com [40.107.236.68]) by mails.dpdk.org (Postfix) with ESMTP id 6E2C6411FE for ; Mon, 17 Jan 2022 09:08:27 +0100 (CET) ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=AMOKwr7pkwIaBzK5PVN7apiDJ7Mo41mzv1Og3V62JsQkHlqkvxLHOLMSFem5qhwYoiHtPN3Wx34COovm8xeQme9mHHOYRQpLfpXWL7fHUmGvI51ZV1c3pM8/R8GrUxTobgtbEtiJ/rlVD+NeFpWrpEj6+3I5U0fF5JHZSr3MXdK4DfGw+GaunYqEgB8zKKHxhQqPhnuKnjbV9yxLBQtVlT4MpPT1kq1zzg9NOPNozeRN3hffThi3QDd2kd5wQI2cTRASF7xrUbmIKjb7QdkGlSfI33iJmLAUH5U/X13zxFfVj33FT9QrASQeivXUbVpfJfdpcvhaeqa09ewxUG2B2g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=+4bnuS8dtAe/ORpswZzHjiLPbQL+O3OiGMFFKj8M8Ok=; b=ABUvRGkqKVKd/tPY6Rg3O4XZCrQ6+DFuo1TNO6t/jlr8K4M1jF00fAhbXVVSTMU3mdogzNeiXUkOLT5LoyjPW1lOsee3tFi/RgZ/smgO+nFIqsj0AmwfDOi7dIvlP3A3ywrp/2fOxXamw/HiYqeDs5IuXpYGT5xLjV4pdoXU3uabwNT2twxOQAfobzGMJVH2ncmrpZVJ4aMyVH1dBFZtLhVrNIKfaXf6Uhe2yaw5dGQifY0sMzfckbcjRpXHPLeWM/cNw3ygqdsh7phPMszKNoptyd4B3/xWHximjWPh5QBcNgwVap/APYw6/tmWgqn0Pry3O0XDNu0vYkuQ5NvQlA== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 12.22.5.236) smtp.rcpttodomain=intel.com smtp.mailfrom=nvidia.com; dmarc=pass (p=reject sp=reject pct=100) action=none header.from=nvidia.com; dkim=none (message not signed); arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=Nvidia.com; s=selector2; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=+4bnuS8dtAe/ORpswZzHjiLPbQL+O3OiGMFFKj8M8Ok=; b=BGskIRI7S3ouTUbK+8D5VljBD/68lcXsTV2bv0nStnJ03kqZ4uuuAhJWt2K4ZFTBn5bwvoM5AhWTXL/6g/3mwEIc6pWlat2G+xw6u3YI2b9FiMwILf7EX08f+/8lZk+Yp7c1zkP/puibnZ6T8fzn3PzoaCZ198bh364JmKQtEE9RdTJqeHlhjojCfe1AatFNnfxlFwseqvADkI0HtDT8IT3YmK+W99KOus3SV6ANEhmPZGgr7NH88fOASUATjC9PQJIoW9BQVNSeiL8JUiC10K/f6Qb/uzEdzzijal6bKLX3wlfS5JJB67FNCb5mq1R7f8n1RI9UhGxJx1BJrOLzWg== Received: from DM5PR13CA0067.namprd13.prod.outlook.com (2603:10b6:3:117::29) by BY5PR12MB4323.namprd12.prod.outlook.com (2603:10b6:a03:211::10) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4888.9; Mon, 17 Jan 2022 08:08:25 +0000 Received: from DM6NAM11FT021.eop-nam11.prod.protection.outlook.com (2603:10b6:3:117:cafe::c3) by DM5PR13CA0067.outlook.office365.com (2603:10b6:3:117::29) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4909.2 via Frontend Transport; Mon, 17 Jan 2022 08:08:24 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 12.22.5.236) smtp.mailfrom=nvidia.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=nvidia.com; Received-SPF: Pass (protection.outlook.com: domain of nvidia.com designates 12.22.5.236 as permitted sender) receiver=protection.outlook.com; client-ip=12.22.5.236; helo=mail.nvidia.com; Received: from mail.nvidia.com (12.22.5.236) by DM6NAM11FT021.mail.protection.outlook.com (10.13.173.76) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384) id 15.20.4888.9 via Frontend Transport; Mon, 17 Jan 2022 08:08:24 +0000 Received: from rnnvmail201.nvidia.com (10.129.68.8) by DRHQMAIL109.nvidia.com (10.27.9.19) with Microsoft SMTP Server (TLS) id 15.0.1497.18; Mon, 17 Jan 2022 08:08:23 +0000 Received: from nvidia.com (10.126.231.35) by rnnvmail201.nvidia.com (10.129.68.8) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384) id 15.2.986.9; Mon, 17 Jan 2022 00:08:22 -0800 From: Dmitry Kozlyuk To: CC: Anatoly Burakov Subject: [PATCH v1 3/6] mem: add dirty malloc element support Date: Mon, 17 Jan 2022 10:07:58 +0200 Message-ID: <20220117080801.481568-4-dkozlyuk@nvidia.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20220117080801.481568-1-dkozlyuk@nvidia.com> References: <20211230143744.3550098-1-dkozlyuk@nvidia.com> <20220117080801.481568-1-dkozlyuk@nvidia.com> MIME-Version: 1.0 X-Originating-IP: [10.126.231.35] X-ClientProxiedBy: HQMAIL105.nvidia.com (172.20.187.12) To rnnvmail201.nvidia.com (10.129.68.8) X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-Office365-Filtering-Correlation-Id: 7ccd605d-b2f1-457e-cfc7-08d9d9908909 X-MS-TrafficTypeDiagnostic: BY5PR12MB4323:EE_ X-Microsoft-Antispam-PRVS: X-MS-Oob-TLC-OOBClassifiers: OLM:9508; X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: UhvKma3tgtAbA2hlNqzWSa1Z6sZF10TqxNCVycSvqz4Jlv4Y1UzooC9v9jb+x66RPXumiYwqebaNsuMOibNnZA2nKU7wi3N55OyK9XoVryNwyGI0h4brFqywhdihIWqR7v9x54RaDyLFCBeDecVYd+RvraQIo1/MdHvGojVgjlWSMEPURNWHAyJF4+oVnkRxRdCMlFcDHl/JDDINIOBFL4BJ1BKEWBzGLOIht75sqbMkiuu/q9IocOgP+svUJrt6EAF05Q/lvPDEzy9zrPUxyxVP+uaaPMDskoooBnoP/EEriGE7/7eGizcao7+nv0iP519W9r3MaijtVMZcOvvo0LCcGNkVdPapzF/hVRpAPRQHZZ9+bWf6eEgJJk6eEv4A8spiqMbL1SHRwLsImnFvuo3wlBX9Sf6cfW4g8ZvLjxCctAlbe5hKDqbxUO/6sXGjQQ8D2tUXjifCnBXhUjXbXEp8vYll01r5/HzocBvumVeu2qKwdYcXvytBD1ABFUk3fKCZ8Bik/2HjfqX3VASgw438VNXrgxXQyz1PqBfHV2En2C34ejl5s5ZLG4zVrKPHO2iZPOWfQPZD7uPUoGsOkWYPZpAxnkwn7rtMAlBJ27F/xccw9U3GRn7aZwcL/yjjm2IG8UXfffVljCVhuhDnuJeORqvTfhOl2xzGNgz0H21STp3tSyOPEfsXIYQ57lsGoDhbeeb8+htjiy9CnshETssUvSYxQS3rl14s6l6fn0Dr+Q1POBW2fliSycO1RC8znPty52LiCXPG7sHwrypGhaYI/RvGFMdo4ixkksZjQ81AtLE0SMilpY8Y6ETpxM53dM8et+WaH/aEkQ2/K3UV5g== X-Forefront-Antispam-Report: CIP:12.22.5.236; CTRY:US; LANG:en; SCL:1; SRV:; IPV:CAL; SFV:NSPM; H:mail.nvidia.com; PTR:InfoNoRecords; CAT:NONE; SFS:(4636009)(46966006)(40470700002)(36840700001)(1076003)(2616005)(40460700001)(86362001)(6286002)(336012)(316002)(16526019)(186003)(4326008)(55016003)(82310400004)(6916009)(26005)(6666004)(426003)(2906002)(36756003)(508600001)(356005)(7696005)(83380400001)(8936002)(70206006)(47076005)(5660300002)(8676002)(81166007)(70586007)(36860700001)(14143004)(36900700001); DIR:OUT; SFP:1101; X-OriginatorOrg: Nvidia.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 17 Jan 2022 08:08:24.5400 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 7ccd605d-b2f1-457e-cfc7-08d9d9908909 X-MS-Exchange-CrossTenant-Id: 43083d15-7273-40c1-b7db-39efd9ccc17a X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=43083d15-7273-40c1-b7db-39efd9ccc17a; Ip=[12.22.5.236]; Helo=[mail.nvidia.com] X-MS-Exchange-CrossTenant-AuthSource: DM6NAM11FT021.eop-nam11.prod.protection.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: BY5PR12MB4323 X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org EAL malloc layer assumed all free elements content is filled with zeros ("clean"), as opposed to uninitialized ("dirty"). This assumption was ensured in two ways: 1. EAL memalloc layer always returned clean memory. 2. Freed memory was cleared before returning into the heap. Clearing the memory can be as slow as around 14 GiB/s. To save doing so, memalloc layer is allowed to return dirty memory. Such segments being marked with RTE_MEMSEG_FLAG_DIRTY. The allocator tracks elements that contain dirty memory using the new flag in the element header. When clean memory is requested via rte_zmalloc*() and the suitable element is dirty, it is cleared on allocation. When memory is deallocated, the freed element is joined with adjacent free elements, and the dirty flag is updated: dirty + freed + dirty = dirty => no need to clean freed + dirty = dirty the freed memory clean + freed + clean = clean => freed memory clean + freed = clean must be cleared freed + clean = clean freed = clean As a result, memory is either cleared on free, as before, or it will be cleared on allocation if need be, but never twice. Signed-off-by: Dmitry Kozlyuk --- lib/eal/common/malloc_elem.c | 22 +++++++++++++++++++--- lib/eal/common/malloc_elem.h | 11 +++++++++-- lib/eal/common/malloc_heap.c | 18 ++++++++++++------ lib/eal/common/rte_malloc.c | 21 ++++++++++++++------- lib/eal/include/rte_memory.h | 8 ++++++-- 5 files changed, 60 insertions(+), 20 deletions(-) diff --git a/lib/eal/common/malloc_elem.c b/lib/eal/common/malloc_elem.c index bdd20a162e..e04e0890fb 100644 --- a/lib/eal/common/malloc_elem.c +++ b/lib/eal/common/malloc_elem.c @@ -129,7 +129,7 @@ malloc_elem_find_max_iova_contig(struct malloc_elem *elem, size_t align) void malloc_elem_init(struct malloc_elem *elem, struct malloc_heap *heap, struct rte_memseg_list *msl, size_t size, - struct malloc_elem *orig_elem, size_t orig_size) + struct malloc_elem *orig_elem, size_t orig_size, bool dirty) { elem->heap = heap; elem->msl = msl; @@ -137,6 +137,7 @@ malloc_elem_init(struct malloc_elem *elem, struct malloc_heap *heap, elem->next = NULL; memset(&elem->free_list, 0, sizeof(elem->free_list)); elem->state = ELEM_FREE; + elem->dirty = dirty; elem->size = size; elem->pad = 0; elem->orig_elem = orig_elem; @@ -300,7 +301,7 @@ split_elem(struct malloc_elem *elem, struct malloc_elem *split_pt) const size_t new_elem_size = elem->size - old_elem_size; malloc_elem_init(split_pt, elem->heap, elem->msl, new_elem_size, - elem->orig_elem, elem->orig_size); + elem->orig_elem, elem->orig_size, elem->dirty); split_pt->prev = elem; split_pt->next = next_elem; if (next_elem) @@ -506,6 +507,7 @@ join_elem(struct malloc_elem *elem1, struct malloc_elem *elem2) else elem1->heap->last = elem1; elem1->next = next; + elem1->dirty |= elem2->dirty; if (elem1->pad) { struct malloc_elem *inner = RTE_PTR_ADD(elem1, elem1->pad); inner->size = elem1->size - elem1->pad; @@ -579,6 +581,14 @@ malloc_elem_free(struct malloc_elem *elem) ptr = RTE_PTR_ADD(elem, MALLOC_ELEM_HEADER_LEN); data_len = elem->size - MALLOC_ELEM_OVERHEAD; + /* + * Consider the element clean for the purposes of joining. + * If both neighbors are clean or non-existent, + * the joint element will be clean, + * which means the memory should be cleared. + * There is no need to clear the memory if the joint element is dirty. + */ + elem->dirty = false; elem = malloc_elem_join_adjacent_free(elem); malloc_elem_free_list_insert(elem); @@ -588,8 +598,14 @@ malloc_elem_free(struct malloc_elem *elem) /* decrease heap's count of allocated elements */ elem->heap->alloc_count--; - /* poison memory */ +#ifndef RTE_MALLOC_DEBUG + /* Normally clear the memory when needed. */ + if (!elem->dirty) + memset(ptr, 0, data_len); +#else + /* Always poison the memory in debug mode. */ memset(ptr, MALLOC_POISON, data_len); +#endif return elem; } diff --git a/lib/eal/common/malloc_elem.h b/lib/eal/common/malloc_elem.h index 15d8ba7af2..f2aa98821b 100644 --- a/lib/eal/common/malloc_elem.h +++ b/lib/eal/common/malloc_elem.h @@ -27,7 +27,13 @@ struct malloc_elem { LIST_ENTRY(malloc_elem) free_list; /**< list of free elements in heap */ struct rte_memseg_list *msl; - volatile enum elem_state state; + /** Element state, @c dirty and @c pad validity depends on it. */ + /* An extra bit is needed to represent enum elem_state as signed int. */ + enum elem_state state : 3; + /** If state == ELEM_FREE: the memory is not filled with zeroes. */ + uint32_t dirty : 1; + /** Reserved for future use. */ + uint32_t reserved : 28; uint32_t pad; size_t size; struct malloc_elem *orig_elem; @@ -320,7 +326,8 @@ malloc_elem_init(struct malloc_elem *elem, struct rte_memseg_list *msl, size_t size, struct malloc_elem *orig_elem, - size_t orig_size); + size_t orig_size, + bool dirty); void malloc_elem_insert(struct malloc_elem *elem); diff --git a/lib/eal/common/malloc_heap.c b/lib/eal/common/malloc_heap.c index 55aad2711b..24080fc473 100644 --- a/lib/eal/common/malloc_heap.c +++ b/lib/eal/common/malloc_heap.c @@ -93,11 +93,11 @@ malloc_socket_to_heap_id(unsigned int socket_id) */ static struct malloc_elem * malloc_heap_add_memory(struct malloc_heap *heap, struct rte_memseg_list *msl, - void *start, size_t len) + void *start, size_t len, bool dirty) { struct malloc_elem *elem = start; - malloc_elem_init(elem, heap, msl, len, elem, len); + malloc_elem_init(elem, heap, msl, len, elem, len, dirty); malloc_elem_insert(elem); @@ -135,7 +135,8 @@ malloc_add_seg(const struct rte_memseg_list *msl, found_msl = &mcfg->memsegs[msl_idx]; - malloc_heap_add_memory(heap, found_msl, ms->addr, len); + malloc_heap_add_memory(heap, found_msl, ms->addr, len, + ms->flags & RTE_MEMSEG_FLAG_DIRTY); heap->total_size += len; @@ -303,7 +304,8 @@ alloc_pages_on_heap(struct malloc_heap *heap, uint64_t pg_sz, size_t elt_size, struct rte_memseg_list *msl; struct malloc_elem *elem = NULL; size_t alloc_sz; - int allocd_pages; + int allocd_pages, i; + bool dirty = false; void *ret, *map_addr; alloc_sz = (size_t)pg_sz * n_segs; @@ -372,8 +374,12 @@ alloc_pages_on_heap(struct malloc_heap *heap, uint64_t pg_sz, size_t elt_size, goto fail; } + /* Element is dirty if it contains at least one dirty page. */ + for (i = 0; i < allocd_pages; i++) + dirty |= ms[i]->flags & RTE_MEMSEG_FLAG_DIRTY; + /* add newly minted memsegs to malloc heap */ - elem = malloc_heap_add_memory(heap, msl, map_addr, alloc_sz); + elem = malloc_heap_add_memory(heap, msl, map_addr, alloc_sz, dirty); /* try once more, as now we have allocated new memory */ ret = find_suitable_element(heap, elt_size, flags, align, bound, @@ -1260,7 +1266,7 @@ malloc_heap_add_external_memory(struct malloc_heap *heap, memset(msl->base_va, 0, msl->len); /* now, add newly minted memory to the malloc heap */ - malloc_heap_add_memory(heap, msl, msl->base_va, msl->len); + malloc_heap_add_memory(heap, msl, msl->base_va, msl->len, false); heap->total_size += msl->len; diff --git a/lib/eal/common/rte_malloc.c b/lib/eal/common/rte_malloc.c index d0bec26920..71a3f7ecb4 100644 --- a/lib/eal/common/rte_malloc.c +++ b/lib/eal/common/rte_malloc.c @@ -115,15 +115,22 @@ rte_zmalloc_socket(const char *type, size_t size, unsigned align, int socket) { void *ptr = rte_malloc_socket(type, size, align, socket); + if (ptr != NULL) { + struct malloc_elem *elem = malloc_elem_from_data(ptr); + + if (elem->dirty) { + memset(ptr, 0, size); + } else { #ifdef RTE_MALLOC_DEBUG - /* - * If DEBUG is enabled, then freed memory is marked with poison - * value and set to zero on allocation. - * If DEBUG is not enabled then memory is already zeroed. - */ - if (ptr != NULL) - memset(ptr, 0, size); + /* + * If DEBUG is enabled, then freed memory is marked + * with a poison value and set to zero on allocation. + * If DEBUG is disabled then memory is already zeroed. + */ + memset(ptr, 0, size); #endif + } + } rte_eal_trace_mem_zmalloc(type, size, align, socket, ptr); return ptr; diff --git a/lib/eal/include/rte_memory.h b/lib/eal/include/rte_memory.h index 6d018629ae..68b069fd04 100644 --- a/lib/eal/include/rte_memory.h +++ b/lib/eal/include/rte_memory.h @@ -19,6 +19,7 @@ extern "C" { #endif +#include #include #include #include @@ -37,11 +38,14 @@ extern "C" { #define SOCKET_ID_ANY -1 /**< Any NUMA socket. */ +/** Prevent this segment from being freed back to the OS. */ +#define RTE_MEMSEG_FLAG_DO_NOT_FREE RTE_BIT32(0) +/** This segment is not filled with zeros. */ +#define RTE_MEMSEG_FLAG_DIRTY RTE_BIT32(1) + /** * Physical memory segment descriptor. */ -#define RTE_MEMSEG_FLAG_DO_NOT_FREE (1 << 0) -/**< Prevent this segment from being freed back to the OS. */ struct rte_memseg { rte_iova_t iova; /**< Start IO address. */ RTE_STD_C11 From patchwork Mon Jan 17 08:07:59 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dmitry Kozlyuk X-Patchwork-Id: 105908 X-Patchwork-Delegate: david.marchand@redhat.com Return-Path: X-Original-To: patchwork@inbox.dpdk.org Delivered-To: patchwork@inbox.dpdk.org Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 18322A034F; Mon, 17 Jan 2022 09:08:51 +0100 (CET) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id C06BB4120A; Mon, 17 Jan 2022 09:08:31 +0100 (CET) Received: from NAM10-MW2-obe.outbound.protection.outlook.com (mail-mw2nam10on2085.outbound.protection.outlook.com [40.107.94.85]) by mails.dpdk.org (Postfix) with ESMTP id 4AA3B411C9 for ; Mon, 17 Jan 2022 09:08:28 +0100 (CET) ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=C2/W6zZ9CG8kdXHclircZqqtrB9RRjJzvW9I5fjb5gNQdgEemXCi0MiiVlcnIFM3ZkAa7zag/YRBAwKg/I/UJZSHinrIz7Q3kZejXMe2X5if3BKXrQA4KwnmzBRPPKui9kpUP0J0kj3Fsh3UKcCz5gjjGj3dSgvnFNgoKnJqwLacUpsGGr4aTEwXgiTeOyWbfKh+waKAsaYKcVDOPcRgNi7HN1rhQzGP64tVq5KwvMlwG3ZklMTev9NwNm317i4fC21pTcoSIbN9J9duSgUGQvngw4z0sHR+mrvnq1aJkofIeZ6vRY2XUJ9ShDIf5AFl4WwqX3NtRd3UCT3/IfC7Rg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=MotN3DXJjnc90k7ER7XOP0TEwbKuHQSsDOkUzmyIzf0=; b=eKdGK5olzIDKSM/PqpJey3DFvZp2/2mYgtrSFfpXFBvm9ua5Ki4VFrh+qNI73UfpUceBSA1QmQ957xWC8IWuHapNW690KFbfFhTAbcrRtOR7iV5mWLcZIzC0ztVwqOuo1lkeAjFoLWzI60J5EhYWjDdiQL8fCrxLCpIqcDsW77AiZixvZP/VXZV7wSyD86xmtzTzBoFwD+PtWk752m4qGWlO6494x2QBW9QB1a3Ly3EOakgPMNJAFrKC4JuUnX9AK9NvoApUOqbUTn5wzTRZW6ZURE0yHuMIzO7vnje76PMDxvCuJSyjfbMrY5qoHJAHDde4prGyo2PKFXkko2yYVw== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 12.22.5.235) smtp.rcpttodomain=intel.com smtp.mailfrom=nvidia.com; dmarc=pass (p=reject sp=reject pct=100) action=none header.from=nvidia.com; dkim=none (message not signed); arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=Nvidia.com; s=selector2; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=MotN3DXJjnc90k7ER7XOP0TEwbKuHQSsDOkUzmyIzf0=; b=QHGuYC3r5Nq1rYq4bdgITXEWVyo02+kAhDswimKvUI0J6lyZwjSjBcMeqG2wwhP3hXuPe12W+zc80K+KKOzjz7Co70Fbxkz1svJm7cWO/juSaClBrkm0m5DbovZHBSroe9s8V/iTr0yTTj+3RmA3Fd6J7E5Js7lpLAq56Xt97AOOhQmx9HWp62ljfzYbTetFvR/OhdgwarmMnmlGfT7NeFjPt/RRzAZfERggBtHV84eyoZDlrmpHvu+lkUYYmAce/gLg1N3c+4jqnP0iU6OnW7vgSFLiemvLjk69nXdetrQO1+1taEsYV0jWytoD2GtnI4hAGhgzlFwQS07dX7n2fQ== Received: from DM5PR08CA0027.namprd08.prod.outlook.com (2603:10b6:4:60::16) by BN9PR12MB5131.namprd12.prod.outlook.com (2603:10b6:408:118::19) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4888.10; Mon, 17 Jan 2022 08:08:26 +0000 Received: from DM6NAM11FT049.eop-nam11.prod.protection.outlook.com (2603:10b6:4:60:cafe::7c) by DM5PR08CA0027.outlook.office365.com (2603:10b6:4:60::16) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4888.11 via Frontend Transport; Mon, 17 Jan 2022 08:08:26 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 12.22.5.235) smtp.mailfrom=nvidia.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=nvidia.com; Received-SPF: Pass (protection.outlook.com: domain of nvidia.com designates 12.22.5.235 as permitted sender) receiver=protection.outlook.com; client-ip=12.22.5.235; helo=mail.nvidia.com; Received: from mail.nvidia.com (12.22.5.235) by DM6NAM11FT049.mail.protection.outlook.com (10.13.172.188) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384) id 15.20.4888.9 via Frontend Transport; Mon, 17 Jan 2022 08:08:25 +0000 Received: from rnnvmail201.nvidia.com (10.129.68.8) by DRHQMAIL107.nvidia.com (10.27.9.16) with Microsoft SMTP Server (TLS) id 15.0.1497.18; Mon, 17 Jan 2022 08:08:25 +0000 Received: from nvidia.com (10.126.231.35) by rnnvmail201.nvidia.com (10.129.68.8) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384) id 15.2.986.9; Mon, 17 Jan 2022 00:08:23 -0800 From: Dmitry Kozlyuk To: CC: Anatoly Burakov Subject: [PATCH v1 4/6] eal: refactor --huge-unlink storage Date: Mon, 17 Jan 2022 10:07:59 +0200 Message-ID: <20220117080801.481568-5-dkozlyuk@nvidia.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20220117080801.481568-1-dkozlyuk@nvidia.com> References: <20211230143744.3550098-1-dkozlyuk@nvidia.com> <20220117080801.481568-1-dkozlyuk@nvidia.com> MIME-Version: 1.0 X-Originating-IP: [10.126.231.35] X-ClientProxiedBy: HQMAIL105.nvidia.com (172.20.187.12) To rnnvmail201.nvidia.com (10.129.68.8) X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-Office365-Filtering-Correlation-Id: 655d8d69-4add-43f6-dfaf-08d9d99089c4 X-MS-TrafficTypeDiagnostic: BN9PR12MB5131:EE_ X-Microsoft-Antispam-PRVS: X-MS-Oob-TLC-OOBClassifiers: OLM:1850; X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: n4pRxhwinC8eBVeQrLOodd4epmk585Stqxm1vpWT+KA+L6oLFAxOgr3vOFGFFohF1vMbLWlfx7dh67Hnczv10C4uwbgfXT9avaN0XeakQAp9GuIag29CmhyI+2GwQwiEtnGZtteumVT0QL6yy++T4+Lnr9g92+aUfQ0as4GB/JnGVtmHov7C/wViCVALfkG6HYYXhkjptlhcVEJg1JqMdo1x7JcmHtxSRK11yBXfb4HF9Ah8FCX8wX9/bWg3BZfU8T0cOsHjTFpoWRqI3E8pMs4TwPdf3HvVDoGO++L5xD9Iv1jjN9PYU9cyVke5wnB+vCBiKPilPIsbokVWuhnCUkk68fdA1RiWjUrx8goilImi8Z2yazlW+tjKipWlOqbda89Txy38XJZlWHxpjKTNYGtUNJ5Vi8q5FLRRSILEf3nzpV6wnnn4mpm6MImbmEtrXugr8s4wFw1mE2kHayFKRYIZw5QTGPQ4zgqHKcYy6+eXMsXNtAfWqwwNB778pS/zCCfnBmj8UI/qpUkSYgnt85FIPixtFjg/mQBj9jMyyWhXuQh8rnoQlik2uAGXuLcA9+tzuB0xXxw5yTc3pzkFz7C4jm5lmAn0N8GDXQjqORJOOF8XwYFq0bGvGPFtjp5xiVVr9KVdwF4HP03DySanU0wm5rCDV24kwFSw964QmsnqwLtNUTQlRllDkaCKM8pCEDpdc+5NGOWQA1gqMp0UbyULFzqSCb74IU7FFeaZ0pxWIBHZxOc1oVX7Fxe4JfS1XrvWlCtILdYTrEoMeQrAJmtRb4SkNVpn1+F9EqdJuTWbPwRGNh/RRDeRmeHsPuOYjOH56N6yD34Oo/qx6OCP8w== X-Forefront-Antispam-Report: CIP:12.22.5.235; CTRY:US; LANG:en; SCL:1; SRV:; IPV:CAL; SFV:NSPM; H:mail.nvidia.com; PTR:InfoNoRecords; CAT:NONE; SFS:(4636009)(46966006)(36840700001)(40470700002)(47076005)(70586007)(6666004)(81166007)(6916009)(2906002)(83380400001)(186003)(82310400004)(356005)(7696005)(508600001)(5660300002)(40460700001)(70206006)(8676002)(16526019)(55016003)(316002)(8936002)(4326008)(36756003)(26005)(2616005)(1076003)(6286002)(426003)(336012)(86362001)(36860700001)(14583001)(36900700001); DIR:OUT; SFP:1101; X-OriginatorOrg: Nvidia.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 17 Jan 2022 08:08:25.8284 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 655d8d69-4add-43f6-dfaf-08d9d99089c4 X-MS-Exchange-CrossTenant-Id: 43083d15-7273-40c1-b7db-39efd9ccc17a X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=43083d15-7273-40c1-b7db-39efd9ccc17a; Ip=[12.22.5.235]; Helo=[mail.nvidia.com] X-MS-Exchange-CrossTenant-AuthSource: DM6NAM11FT049.eop-nam11.prod.protection.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: BN9PR12MB5131 X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org In preparation to extend --huge-unlink option semantics refactor how it is stored in the internal configuration. It makes future changes more isolated. Signed-off-by: Dmitry Kozlyuk Acked-by: Thomas Monjalon --- lib/eal/common/eal_common_options.c | 9 +++++---- lib/eal/common/eal_internal_cfg.h | 8 +++++++- lib/eal/linux/eal_memalloc.c | 7 ++++--- lib/eal/linux/eal_memory.c | 2 +- 4 files changed, 17 insertions(+), 9 deletions(-) diff --git a/lib/eal/common/eal_common_options.c b/lib/eal/common/eal_common_options.c index 1cfdd75f3b..7520ebda8e 100644 --- a/lib/eal/common/eal_common_options.c +++ b/lib/eal/common/eal_common_options.c @@ -1737,7 +1737,7 @@ eal_parse_common_option(int opt, const char *optarg, /* long options */ case OPT_HUGE_UNLINK_NUM: - conf->hugepage_unlink = 1; + conf->hugepage_file.unlink_before_mapping = true; break; case OPT_NO_HUGE_NUM: @@ -1766,7 +1766,7 @@ eal_parse_common_option(int opt, const char *optarg, conf->in_memory = 1; /* in-memory is a superset of noshconf and huge-unlink */ conf->no_shconf = 1; - conf->hugepage_unlink = 1; + conf->hugepage_file.unlink_before_mapping = true; break; case OPT_PROC_TYPE_NUM: @@ -2050,7 +2050,8 @@ eal_check_common_options(struct internal_config *internal_cfg) "be specified together with --"OPT_NO_HUGE"\n"); return -1; } - if (internal_cfg->no_hugetlbfs && internal_cfg->hugepage_unlink && + if (internal_cfg->no_hugetlbfs && + internal_cfg->hugepage_file.unlink_before_mapping && !internal_cfg->in_memory) { RTE_LOG(ERR, EAL, "Option --"OPT_HUGE_UNLINK" cannot " "be specified together with --"OPT_NO_HUGE"\n"); @@ -2061,7 +2062,7 @@ eal_check_common_options(struct internal_config *internal_cfg) " is only supported in non-legacy memory mode\n"); } if (internal_cfg->single_file_segments && - internal_cfg->hugepage_unlink && + internal_cfg->hugepage_file.unlink_before_mapping && !internal_cfg->in_memory) { RTE_LOG(ERR, EAL, "Option --"OPT_SINGLE_FILE_SEGMENTS" is " "not compatible with --"OPT_HUGE_UNLINK"\n"); diff --git a/lib/eal/common/eal_internal_cfg.h b/lib/eal/common/eal_internal_cfg.h index d6c0470eb8..b5e6942578 100644 --- a/lib/eal/common/eal_internal_cfg.h +++ b/lib/eal/common/eal_internal_cfg.h @@ -40,6 +40,12 @@ struct simd_bitwidth { uint16_t bitwidth; /**< bitwidth value */ }; +/** Hugepage backing files discipline. */ +struct hugepage_file_discipline { + /** Unlink files before mapping them to leave no trace in hugetlbfs. */ + bool unlink_before_mapping; +}; + /** * internal configuration */ @@ -48,7 +54,7 @@ struct internal_config { volatile unsigned force_nchannel; /**< force number of channels */ volatile unsigned force_nrank; /**< force number of ranks */ volatile unsigned no_hugetlbfs; /**< true to disable hugetlbfs */ - unsigned hugepage_unlink; /**< true to unlink backing files */ + struct hugepage_file_discipline hugepage_file; volatile unsigned no_pci; /**< true to disable PCI */ volatile unsigned no_hpet; /**< true to disable HPET */ volatile unsigned vmware_tsc_map; /**< true to use VMware TSC mapping diff --git a/lib/eal/linux/eal_memalloc.c b/lib/eal/linux/eal_memalloc.c index 337f2bc739..abbe605e49 100644 --- a/lib/eal/linux/eal_memalloc.c +++ b/lib/eal/linux/eal_memalloc.c @@ -564,7 +564,7 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id, __func__, strerror(errno)); goto resized; } - if (internal_conf->hugepage_unlink && + if (internal_conf->hugepage_file.unlink_before_mapping && !internal_conf->in_memory) { if (unlink(path)) { RTE_LOG(DEBUG, EAL, "%s(): unlink() failed: %s\n", @@ -697,7 +697,7 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id, close_hugefile(fd, path, list_idx); } else { /* only remove file if we can take out a write lock */ - if (internal_conf->hugepage_unlink == 0 && + if (!internal_conf->hugepage_file.unlink_before_mapping && internal_conf->in_memory == 0 && lock(fd, LOCK_EX) == 1) unlink(path); @@ -756,7 +756,8 @@ free_seg(struct rte_memseg *ms, struct hugepage_info *hi, /* if we're able to take out a write lock, we're the last one * holding onto this page. */ - if (!internal_conf->in_memory && !internal_conf->hugepage_unlink) { + if (!internal_conf->in_memory && + internal_conf->hugepage_file.unlink_before_mapping) { ret = lock(fd, LOCK_EX); if (ret >= 0) { /* no one else is using this page */ diff --git a/lib/eal/linux/eal_memory.c b/lib/eal/linux/eal_memory.c index 03a4f2dd2d..83eec078a4 100644 --- a/lib/eal/linux/eal_memory.c +++ b/lib/eal/linux/eal_memory.c @@ -1428,7 +1428,7 @@ eal_legacy_hugepage_init(void) } /* free the hugepage backing files */ - if (internal_conf->hugepage_unlink && + if (internal_conf->hugepage_file.unlink_before_mapping && unlink_hugepage_files(tmp_hp, internal_conf->num_hugepage_sizes) < 0) { RTE_LOG(ERR, EAL, "Unlinking hugepage files failed!\n"); goto fail; From patchwork Mon Jan 17 08:14:06 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dmitry Kozlyuk X-Patchwork-Id: 105913 X-Patchwork-Delegate: david.marchand@redhat.com Return-Path: X-Original-To: patchwork@inbox.dpdk.org Delivered-To: patchwork@inbox.dpdk.org Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id A6740A034F; Mon, 17 Jan 2022 09:14:26 +0100 (CET) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 9352241183; Mon, 17 Jan 2022 09:14:26 +0100 (CET) Received: from NAM10-MW2-obe.outbound.protection.outlook.com (mail-mw2nam10on2080.outbound.protection.outlook.com [40.107.94.80]) by mails.dpdk.org (Postfix) with ESMTP id 840604067B for ; Mon, 17 Jan 2022 09:14:24 +0100 (CET) ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=L5HOzXr9CPg4llf+Ct4Radsy7mv82kaTbaGiFhH0bY20u8lf7UNk/KoIzHNCIfqZQEe/YouhTFbqLmhRjO2JmOd9NJwl/Knsn7A98p0ZxfKfz26GSyKEOhPP/UQ64VEu337PMeF6ClT7WYFqRldmY60UjZmIIplD+VOqoSBuyv9Mx8pEcwTW3/tM7CU9fTXkMIsGtNllnsBTjG23vCL0n0+D5bkg5AHmQEgtdj5vEkb9eEA3uaWXwVKqPKE3V5DT0TWDI/ds87LLdHmbOIwPS3Puo9z3KzaSRqYJzLnLga6DJ7TN3q5l+4YXsFeOB7+7IrtFEGDXpu8hNy5l7XtsjA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=pdB8FAltAj/il5lcCx9TgSZ9HOqiJsUl8Z8C5qyvq6Q=; b=fE/RWgt4QaDW6cQhOV6DTS9XQPrCEf1xIWQ0jQO4uXW6gr2mV//de97thnESCy5DvRNONHhWI+LO+6UzsZo/h6YPCQJc8E02CHHJ0TUxWQAv8UWMpoTzh2t0dxUX2+qpZXs3cRzwCFGQhJ1JzdprhsLTEJAMfHNFpPM+QuOky5Vao2kscCDFV4yuSCcYRBP/Lwlh/+RjfK4aIRa47nOZKCK09oJ1FwtAmZkUjFb+e5RuTLq/UNxiZQIchCytA25tyxaq/MLIa8w4RIiXbmCcrxE7c/SY8cLNkhC6YZ8E4l7gyjLtEiIJVJmEmsJqKhrSrF1hOmET21kDKFVS/2WRlw== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 12.22.5.234) smtp.rcpttodomain=intel.com smtp.mailfrom=nvidia.com; dmarc=pass (p=reject sp=reject pct=100) action=none header.from=nvidia.com; dkim=none (message not signed); arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=Nvidia.com; s=selector2; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=pdB8FAltAj/il5lcCx9TgSZ9HOqiJsUl8Z8C5qyvq6Q=; b=cTPRr+nChaywjJju8xpJRZ3Ljv2rt1I+d37uK2E0rXrqQR+p7sfJSzEBPYhKUDghWDP+UTY0ZhLMhdEWkAptsrFMiAGexL2MYIGxh4YSLEuwrg+7QBnhJ+37LlOu7H/vUJTwmwH/4HCORekwqJJkA/2lo+tPE1JtJgG274CXGy3rW2WqXZCmOgNBpRvkyDAmti/FXYK6/fkb5878nbLyZa/Ga4hdmqL6cdMXSSUyaWGbR0OgwqL2flPSN0PWe6IAz8pnK/S1BQK7zOMUmT6LvYlv/iBEyn+tciCAR2tw2C7V65wCXe5XVNIS66tcEVfOclnJl28S57e11r0OO2sX6A== Received: from DS7PR06CA0035.namprd06.prod.outlook.com (2603:10b6:8:54::9) by BYAPR12MB2614.namprd12.prod.outlook.com (2603:10b6:a03:6b::17) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4888.12; Mon, 17 Jan 2022 08:14:22 +0000 Received: from DM6NAM11FT042.eop-nam11.prod.protection.outlook.com (2603:10b6:8:54:cafe::1d) by DS7PR06CA0035.outlook.office365.com (2603:10b6:8:54::9) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4888.11 via Frontend Transport; Mon, 17 Jan 2022 08:14:22 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 12.22.5.234) smtp.mailfrom=nvidia.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=nvidia.com; Received-SPF: Pass (protection.outlook.com: domain of nvidia.com designates 12.22.5.234 as permitted sender) receiver=protection.outlook.com; client-ip=12.22.5.234; helo=mail.nvidia.com; Received: from mail.nvidia.com (12.22.5.234) by DM6NAM11FT042.mail.protection.outlook.com (10.13.173.165) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384) id 15.20.4888.9 via Frontend Transport; Mon, 17 Jan 2022 08:14:21 +0000 Received: from rnnvmail201.nvidia.com (10.129.68.8) by DRHQMAIL101.nvidia.com (10.27.9.10) with Microsoft SMTP Server (TLS) id 15.0.1497.18; Mon, 17 Jan 2022 08:14:21 +0000 Received: from nvidia.com (10.126.231.35) by rnnvmail201.nvidia.com (10.129.68.8) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384) id 15.2.986.9; Mon, 17 Jan 2022 00:14:19 -0800 From: Dmitry Kozlyuk To: CC: Anatoly Burakov Subject: [PATCH v1 5/6] eal/linux: allow hugepage file reuse Date: Mon, 17 Jan 2022 10:14:06 +0200 Message-ID: <20220117081406.482210-1-dkozlyuk@nvidia.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20220117080801.481568-1-dkozlyuk@nvidia.com> References: <20220117080801.481568-1-dkozlyuk@nvidia.com> MIME-Version: 1.0 X-Originating-IP: [10.126.231.35] X-ClientProxiedBy: HQMAIL105.nvidia.com (172.20.187.12) To rnnvmail201.nvidia.com (10.129.68.8) X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-Office365-Filtering-Correlation-Id: 6515b18f-483e-4c8e-aa05-08d9d9915df8 X-MS-TrafficTypeDiagnostic: BYAPR12MB2614:EE_ X-Microsoft-Antispam-PRVS: X-MS-Oob-TLC-OOBClassifiers: OLM:6430; X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: 8NMfEh0xw2ojn0OT2sSHcA5FnbwoHHyKH8peSC/LiWMrGDDVWOJ8v9T24epHR3h4qvuGr49AHvvLYMkLK16oU89cshtTsfo5+kLdBDra4ByNNfLtl2l06ug72N8eP2pqIDUB3ZuLEt5IXbIOpQYydZCAiE4EGxyKDGdaHJFevsDGuR+ezi9WiXKyzFIM5hrl+gebZLDzWQKtvgV8RPCaYYUSoA3u7qB2+rPzrjUnh68dGM+VAqg4vtovsKiH5l7CpRpgFUepi+JRwsMwCUfS917AzvIOd47PWs6c/+1TQlvbFO39HUIB4C/U/rcFpKw+XLdGC1uJFGPda3s4mE/ezWl+LnPPwODVyEDzXirdkujGN+KNJY+LlJU9CdqxZiy8+OffHP8NkZnEGV3MA8gUYk67ywIN01E+FaCvdfFLoI8N3LM/vkSTVD7yl36W81t/O2PQ+jIh6eY75yjVaM8ZTQxj12YW3iQ80wRC0WDw+PYagdh6sscGfpOFGvL7bxf7+v0YP7uREesX11lTv4XOPv9nHUFYePt2SwaUwtcaqAnJ+u0WkOkMlzR8nia91R156ygVI8atXgy3vHtJpvpnNfheRgjMvF6ECnLq9gu6m4VItXU8atEGZho0v/MDc7f5vGlKxECj3RAYYBc/ccBheZ5IW9ypWcegJ2eZIpcGlFjRp9LNDomElx5fSolJZ/a/zbMUiUIG96RBqnjs/mIyAixRthjA8CwIayNCFgKqUqBih/5qSDNO7PJbhvnALOM++UurhoMNOzs2Oif6+57OZ2bvaMEgPAq/Bf18mqIqm84= X-Forefront-Antispam-Report: CIP:12.22.5.234; CTRY:US; LANG:en; SCL:1; SRV:; IPV:CAL; SFV:NSPM; H:mail.nvidia.com; PTR:InfoNoRecords; CAT:NONE; SFS:(4636009)(40470700002)(46966006)(36840700001)(4326008)(8676002)(82310400004)(70586007)(70206006)(186003)(8936002)(83380400001)(55016003)(16526019)(81166007)(36756003)(7696005)(1076003)(36860700001)(30864003)(6666004)(86362001)(316002)(6286002)(26005)(40460700001)(508600001)(5660300002)(356005)(2906002)(426003)(2616005)(336012)(47076005)(6916009)(36900700001); DIR:OUT; SFP:1101; X-OriginatorOrg: Nvidia.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 17 Jan 2022 08:14:21.7980 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 6515b18f-483e-4c8e-aa05-08d9d9915df8 X-MS-Exchange-CrossTenant-Id: 43083d15-7273-40c1-b7db-39efd9ccc17a X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=43083d15-7273-40c1-b7db-39efd9ccc17a; Ip=[12.22.5.234]; Helo=[mail.nvidia.com] X-MS-Exchange-CrossTenant-AuthSource: DM6NAM11FT042.eop-nam11.prod.protection.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: BYAPR12MB2614 X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Linux EAL ensured that mapped hugepages are clean by always mapping from newly created files: existing hugepage backing files were always removed. In this case, the kernel clears the page to prevent data leaks, because the mapped memory may contain leftover data from the previous process that was using this memory. Clearing takes the bulk of the time spent in mmap(2), increasing EAL initialization time. Introduce a mode to keep existing files and reuse them in order to speed up initial memory allocation in EAL. Hugepages mapped from such files may contain data left by the previous process that used this memory, so RTE_MEMSEG_FLAG_DIRTY is set for their segments. If multiple hugepages are mapped from the same file: 1. When fallocate(2) is used, all memory mapped from this file is considered dirty, because it is unknown which parts of the file are holes. 2. When ftruncate(3) is used, memory mapped from this file is considered dirty unless the file is extended to create a new mapping, which implies clean memory. Signed-off-by: Dmitry Kozlyuk --- Coverity complains that "path" may be uninitialized in get_seg_fd() at line 327, but it is always initialized with eal_get_hugefile_path() at lines 309-316. lib/eal/common/eal_internal_cfg.h | 2 + lib/eal/linux/eal_hugepage_info.c | 118 +++++++++++++++++---- lib/eal/linux/eal_memalloc.c | 164 ++++++++++++++++++------------ 3 files changed, 200 insertions(+), 84 deletions(-) diff --git a/lib/eal/common/eal_internal_cfg.h b/lib/eal/common/eal_internal_cfg.h index b5e6942578..3685aa7c52 100644 --- a/lib/eal/common/eal_internal_cfg.h +++ b/lib/eal/common/eal_internal_cfg.h @@ -44,6 +44,8 @@ struct simd_bitwidth { struct hugepage_file_discipline { /** Unlink files before mapping them to leave no trace in hugetlbfs. */ bool unlink_before_mapping; + /** Reuse existing files, never delete or re-create them. */ + bool keep_existing; }; /** diff --git a/lib/eal/linux/eal_hugepage_info.c b/lib/eal/linux/eal_hugepage_info.c index 9fb0e968db..6607fe5906 100644 --- a/lib/eal/linux/eal_hugepage_info.c +++ b/lib/eal/linux/eal_hugepage_info.c @@ -84,7 +84,7 @@ static int get_hp_sysfs_value(const char *subdir, const char *file, unsigned lon /* this function is only called from eal_hugepage_info_init which itself * is only called from a primary process */ static uint32_t -get_num_hugepages(const char *subdir, size_t sz) +get_num_hugepages(const char *subdir, size_t sz, unsigned int reusable_pages) { unsigned long resv_pages, num_pages, over_pages, surplus_pages; const char *nr_hp_file = "free_hugepages"; @@ -116,7 +116,7 @@ get_num_hugepages(const char *subdir, size_t sz) else over_pages = 0; - if (num_pages == 0 && over_pages == 0) + if (num_pages == 0 && over_pages == 0 && reusable_pages) RTE_LOG(WARNING, EAL, "No available %zu kB hugepages reported\n", sz >> 10); @@ -124,6 +124,10 @@ get_num_hugepages(const char *subdir, size_t sz) if (num_pages < over_pages) /* overflow */ num_pages = UINT32_MAX; + num_pages += reusable_pages; + if (num_pages < reusable_pages) /* overflow */ + num_pages = UINT32_MAX; + /* we want to return a uint32_t and more than this looks suspicious * anyway ... */ if (num_pages > UINT32_MAX) @@ -297,20 +301,28 @@ get_hugepage_dir(uint64_t hugepage_sz, char *hugedir, int len) return -1; } +struct walk_hugedir_data { + int dir_fd; + int file_fd; + const char *file_name; + void *user_data; +}; + +typedef void (walk_hugedir_t)(const struct walk_hugedir_data *whd); + /* - * Clear the hugepage directory of whatever hugepage files - * there are. Checks if the file is locked (i.e. - * if it's in use by another DPDK process). + * Search the hugepage directory for whatever hugepage files there are. + * Check if the file is in use by another DPDK process. + * If not, execute a callback on it. */ static int -clear_hugedir(const char * hugedir) +walk_hugedir(const char *hugedir, walk_hugedir_t *cb, void *user_data) { DIR *dir; struct dirent *dirent; int dir_fd, fd, lck_result; const char filter[] = "*map_*"; /* matches hugepage files */ - /* open directory */ dir = opendir(hugedir); if (!dir) { RTE_LOG(ERR, EAL, "Unable to open hugepage directory %s\n", @@ -326,7 +338,7 @@ clear_hugedir(const char * hugedir) goto error; } - while(dirent != NULL){ + while (dirent != NULL) { /* skip files that don't match the hugepage pattern */ if (fnmatch(filter, dirent->d_name, 0) > 0) { dirent = readdir(dir); @@ -345,9 +357,15 @@ clear_hugedir(const char * hugedir) /* non-blocking lock */ lck_result = flock(fd, LOCK_EX | LOCK_NB); - /* if lock succeeds, remove the file */ + /* if lock succeeds, execute callback */ if (lck_result != -1) - unlinkat(dir_fd, dirent->d_name, 0); + cb(&(struct walk_hugedir_data){ + .dir_fd = dir_fd, + .file_fd = fd, + .file_name = dirent->d_name, + .user_data = user_data, + }); + close (fd); dirent = readdir(dir); } @@ -359,12 +377,48 @@ clear_hugedir(const char * hugedir) if (dir) closedir(dir); - RTE_LOG(ERR, EAL, "Error while clearing hugepage dir: %s\n", + RTE_LOG(ERR, EAL, "Error while walking hugepage dir: %s\n", strerror(errno)); return -1; } +static void +clear_hugedir_cb(const struct walk_hugedir_data *whd) +{ + unlinkat(whd->dir_fd, whd->file_name, 0); +} + +/* Remove hugepage files not used by other DPDK processes from a directory. */ +static int +clear_hugedir(const char *hugedir) +{ + return walk_hugedir(hugedir, clear_hugedir_cb, NULL); +} + +static void +inspect_hugedir_cb(const struct walk_hugedir_data *whd) +{ + uint64_t *total_size = whd->user_data; + struct stat st; + + if (fstat(whd->file_fd, &st) < 0) + RTE_LOG(DEBUG, EAL, "%s(): stat(\"%s\") failed: %s", + __func__, whd->file_name, strerror(errno)); + else + (*total_size) += st.st_size; +} + +/* + * Count the total size in bytes of all files in the directory + * not mapped by other DPDK process. + */ +static int +inspect_hugedir(const char *hugedir, uint64_t *total_size) +{ + return walk_hugedir(hugedir, inspect_hugedir_cb, total_size); +} + static int compare_hpi(const void *a, const void *b) { @@ -375,7 +429,8 @@ compare_hpi(const void *a, const void *b) } static void -calc_num_pages(struct hugepage_info *hpi, struct dirent *dirent) +calc_num_pages(struct hugepage_info *hpi, struct dirent *dirent, + unsigned int reusable_pages) { uint64_t total_pages = 0; unsigned int i; @@ -388,8 +443,15 @@ calc_num_pages(struct hugepage_info *hpi, struct dirent *dirent) * in one socket and sorting them later */ total_pages = 0; - /* we also don't want to do this for legacy init */ - if (!internal_conf->legacy_mem) + + /* + * We also don't want to do this for legacy init. + * When there are hugepage files to reuse it is unknown + * what NUMA node the pages are on. + * This could be determined by mapping, + * but it is precisely what hugepage file reuse is trying to avoid. + */ + if (!internal_conf->legacy_mem && reusable_pages == 0) for (i = 0; i < rte_socket_count(); i++) { int socket = rte_socket_id_by_idx(i); unsigned int num_pages = @@ -405,7 +467,7 @@ calc_num_pages(struct hugepage_info *hpi, struct dirent *dirent) */ if (total_pages == 0) { hpi->num_pages[0] = get_num_hugepages(dirent->d_name, - hpi->hugepage_sz); + hpi->hugepage_sz, reusable_pages); #ifndef RTE_ARCH_64 /* for 32-bit systems, limit number of hugepages to @@ -421,6 +483,8 @@ hugepage_info_init(void) { const char dirent_start_text[] = "hugepages-"; const size_t dirent_start_len = sizeof(dirent_start_text) - 1; unsigned int i, num_sizes = 0; + uint64_t reusable_bytes; + unsigned int reusable_pages; DIR *dir; struct dirent *dirent; struct internal_config *internal_conf = @@ -454,7 +518,7 @@ hugepage_info_init(void) uint32_t num_pages; num_pages = get_num_hugepages(dirent->d_name, - hpi->hugepage_sz); + hpi->hugepage_sz, 0); if (num_pages > 0) RTE_LOG(NOTICE, EAL, "%" PRIu32 " hugepages of size " @@ -473,7 +537,7 @@ hugepage_info_init(void) "hugepages of size %" PRIu64 " bytes " "will be allocated anonymously\n", hpi->hugepage_sz); - calc_num_pages(hpi, dirent); + calc_num_pages(hpi, dirent, 0); num_sizes++; } #endif @@ -489,11 +553,23 @@ hugepage_info_init(void) "Failed to lock hugepage directory!\n"); break; } - /* clear out the hugepages dir from unused pages */ - if (clear_hugedir(hpi->hugedir) == -1) - break; - calc_num_pages(hpi, dirent); + /* + * Check for existing hugepage files and either remove them + * or count how many of them can be reused. + */ + reusable_pages = 0; + if (internal_conf->hugepage_file.keep_existing) { + reusable_bytes = 0; + if (inspect_hugedir(hpi->hugedir, + &reusable_bytes) < 0) + break; + RTE_ASSERT(reusable_bytes % hpi->hugepage_sz == 0); + reusable_pages = reusable_bytes / hpi->hugepage_sz; + } else if (clear_hugedir(hpi->hugedir) < 0) { + break; + } + calc_num_pages(hpi, dirent, reusable_pages); num_sizes++; } diff --git a/lib/eal/linux/eal_memalloc.c b/lib/eal/linux/eal_memalloc.c index abbe605e49..e4cd10b195 100644 --- a/lib/eal/linux/eal_memalloc.c +++ b/lib/eal/linux/eal_memalloc.c @@ -287,12 +287,19 @@ get_seg_memfd(struct hugepage_info *hi __rte_unused, static int get_seg_fd(char *path, int buflen, struct hugepage_info *hi, - unsigned int list_idx, unsigned int seg_idx) + unsigned int list_idx, unsigned int seg_idx, + bool *dirty) { int fd; + int *out_fd; + struct stat st; + int ret; const struct internal_config *internal_conf = eal_get_internal_configuration(); + if (dirty != NULL) + *dirty = false; + /* for in-memory mode, we only make it here when we're sure we support * memfd, and this is a special case. */ @@ -300,66 +307,69 @@ get_seg_fd(char *path, int buflen, struct hugepage_info *hi, return get_seg_memfd(hi, list_idx, seg_idx); if (internal_conf->single_file_segments) { - /* create a hugepage file path */ + out_fd = &fd_list[list_idx].memseg_list_fd; eal_get_hugefile_path(path, buflen, hi->hugedir, list_idx); - - fd = fd_list[list_idx].memseg_list_fd; - - if (fd < 0) { - fd = open(path, O_CREAT | O_RDWR, 0600); - if (fd < 0) { - RTE_LOG(ERR, EAL, "%s(): open failed: %s\n", - __func__, strerror(errno)); - return -1; - } - /* take out a read lock and keep it indefinitely */ - if (lock(fd, LOCK_SH) < 0) { - RTE_LOG(ERR, EAL, "%s(): lock failed: %s\n", - __func__, strerror(errno)); - close(fd); - return -1; - } - fd_list[list_idx].memseg_list_fd = fd; - } } else { - /* create a hugepage file path */ + out_fd = &fd_list[list_idx].fds[seg_idx]; eal_get_hugefile_path(path, buflen, hi->hugedir, list_idx * RTE_MAX_MEMSEG_PER_LIST + seg_idx); + } + fd = *out_fd; + if (fd >= 0) + return fd; - fd = fd_list[list_idx].fds[seg_idx]; - - if (fd < 0) { - /* A primary process is the only one creating these - * files. If there is a leftover that was not cleaned - * by clear_hugedir(), we must *now* make sure to drop - * the file or we will remap old stuff while the rest - * of the code is built on the assumption that a new - * page is clean. - */ - if (rte_eal_process_type() == RTE_PROC_PRIMARY && - unlink(path) == -1 && - errno != ENOENT) { - RTE_LOG(DEBUG, EAL, "%s(): could not remove '%s': %s\n", - __func__, path, strerror(errno)); - return -1; - } + /* + * There is no TOCTOU between stat() and unlink()/open() + * because the hugepage directory is locked. + */ + ret = stat(path, &st); + if (ret < 0 && errno != ENOENT) { + RTE_LOG(DEBUG, EAL, "%s(): stat() for '%s' failed: %s\n", + __func__, path, strerror(errno)); + return -1; + } + if (internal_conf->hugepage_file.keep_existing && ret == 0 && + dirty != NULL) + *dirty = true; - fd = open(path, O_CREAT | O_RDWR, 0600); - if (fd < 0) { - RTE_LOG(DEBUG, EAL, "%s(): open failed: %s\n", - __func__, strerror(errno)); - return -1; - } - /* take out a read lock */ - if (lock(fd, LOCK_SH) < 0) { - RTE_LOG(ERR, EAL, "%s(): lock failed: %s\n", - __func__, strerror(errno)); - close(fd); - return -1; - } - fd_list[list_idx].fds[seg_idx] = fd; + /* + * The kernel clears a hugepage only when it is mapped + * from a particular file for the first time. + * If the file already exists, the old content will be mapped. + * If the memory manager assumes all mapped pages to be clean, + * the file must be removed and created anew. + * Otherwise, the primary caller must be notified + * that mapped pages will be dirty + * (secondary callers receive the segment state from the primary one). + * When multiple hugepages are mapped from the same file, + * whether they will be dirty depends on the part that is mapped. + */ + if (!internal_conf->single_file_segments && + rte_eal_process_type() == RTE_PROC_PRIMARY && + ret == 0) { + /* coverity[toctou] */ + if (unlink(path) < 0) { + RTE_LOG(DEBUG, EAL, "%s(): could not remove '%s': %s\n", + __func__, path, strerror(errno)); + return -1; } } + + /* coverity[toctou] */ + fd = open(path, O_CREAT | O_RDWR, 0600); + if (fd < 0) { + RTE_LOG(DEBUG, EAL, "%s(): open failed: %s\n", + __func__, strerror(errno)); + return -1; + } + /* take out a read lock */ + if (lock(fd, LOCK_SH) < 0) { + RTE_LOG(ERR, EAL, "%s(): lock failed: %s\n", + __func__, strerror(errno)); + close(fd); + return -1; + } + *out_fd = fd; return fd; } @@ -385,8 +395,10 @@ resize_hugefile_in_memory(int fd, uint64_t fa_offset, static int resize_hugefile_in_filesystem(int fd, uint64_t fa_offset, uint64_t page_sz, - bool grow) + bool grow, bool *dirty) { + const struct internal_config *internal_conf = + eal_get_internal_configuration(); bool again = false; do { @@ -405,6 +417,8 @@ resize_hugefile_in_filesystem(int fd, uint64_t fa_offset, uint64_t page_sz, uint64_t cur_size = get_file_size(fd); /* fallocate isn't supported, fall back to ftruncate */ + if (dirty != NULL) + *dirty = new_size <= cur_size; if (new_size > cur_size && ftruncate(fd, new_size) < 0) { RTE_LOG(DEBUG, EAL, "%s(): ftruncate() failed: %s\n", @@ -447,8 +461,17 @@ resize_hugefile_in_filesystem(int fd, uint64_t fa_offset, uint64_t page_sz, strerror(errno)); return -1; } - } else + } else { fallocate_supported = 1; + /* + * It is unknown which portions of an existing + * hugepage file were allocated previously, + * so all pages within the file are considered + * dirty, unless the file is a fresh one. + */ + if (dirty != NULL) + *dirty &= internal_conf->hugepage_file.keep_existing; + } } } while (again); @@ -475,7 +498,8 @@ close_hugefile(int fd, char *path, int list_idx) } static int -resize_hugefile(int fd, uint64_t fa_offset, uint64_t page_sz, bool grow) +resize_hugefile(int fd, uint64_t fa_offset, uint64_t page_sz, bool grow, + bool *dirty) { /* in-memory mode is a special case, because we can be sure that * fallocate() is supported. @@ -483,12 +507,15 @@ resize_hugefile(int fd, uint64_t fa_offset, uint64_t page_sz, bool grow) const struct internal_config *internal_conf = eal_get_internal_configuration(); - if (internal_conf->in_memory) + if (internal_conf->in_memory) { + if (dirty != NULL) + *dirty = false; return resize_hugefile_in_memory(fd, fa_offset, page_sz, grow); + } return resize_hugefile_in_filesystem(fd, fa_offset, page_sz, - grow); + grow, dirty); } static int @@ -505,6 +532,7 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id, char path[PATH_MAX]; int ret = 0; int fd; + bool dirty; size_t alloc_sz; int flags; void *new_addr; @@ -534,6 +562,7 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id, pagesz_flag = pagesz_flags(alloc_sz); fd = -1; + dirty = false; mmap_flags = in_memory_flags | pagesz_flag; /* single-file segments codepath will never be active @@ -544,7 +573,8 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id, map_offset = 0; } else { /* takes out a read lock on segment or segment list */ - fd = get_seg_fd(path, sizeof(path), hi, list_idx, seg_idx); + fd = get_seg_fd(path, sizeof(path), hi, list_idx, seg_idx, + &dirty); if (fd < 0) { RTE_LOG(ERR, EAL, "Couldn't get fd on hugepage file\n"); return -1; @@ -552,7 +582,8 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id, if (internal_conf->single_file_segments) { map_offset = seg_idx * alloc_sz; - ret = resize_hugefile(fd, map_offset, alloc_sz, true); + ret = resize_hugefile(fd, map_offset, alloc_sz, true, + &dirty); if (ret < 0) goto resized; @@ -662,6 +693,7 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id, ms->nrank = rte_memory_get_nrank(); ms->iova = iova; ms->socket_id = socket_id; + ms->flags = dirty ? RTE_MEMSEG_FLAG_DIRTY : 0; return 0; @@ -689,7 +721,7 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id, return -1; if (internal_conf->single_file_segments) { - resize_hugefile(fd, map_offset, alloc_sz, false); + resize_hugefile(fd, map_offset, alloc_sz, false, NULL); /* ignore failure, can't make it any worse */ /* if refcount is at zero, close the file */ @@ -739,13 +771,13 @@ free_seg(struct rte_memseg *ms, struct hugepage_info *hi, * segment and thus drop the lock on original fd, but hugepage dir is * now locked so we can take out another one without races. */ - fd = get_seg_fd(path, sizeof(path), hi, list_idx, seg_idx); + fd = get_seg_fd(path, sizeof(path), hi, list_idx, seg_idx, NULL); if (fd < 0) return -1; if (internal_conf->single_file_segments) { map_offset = seg_idx * ms->len; - if (resize_hugefile(fd, map_offset, ms->len, false)) + if (resize_hugefile(fd, map_offset, ms->len, false, NULL)) return -1; if (--(fd_list[list_idx].count) == 0) @@ -1743,6 +1775,12 @@ eal_memalloc_init(void) RTE_LOG(ERR, EAL, "Using anonymous memory is not supported\n"); return -1; } + /* safety net, should be impossible to configure */ + if (internal_conf->hugepage_file.unlink_before_mapping && + internal_conf->hugepage_file.keep_existing) { + RTE_LOG(ERR, EAL, "Unable both to keep existing hugepage files and to unlink them.\n"); + return -1; + } } /* initialize all of the fd lists */ From patchwork Mon Jan 17 08:14:40 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dmitry Kozlyuk X-Patchwork-Id: 105914 X-Patchwork-Delegate: david.marchand@redhat.com Return-Path: X-Original-To: patchwork@inbox.dpdk.org Delivered-To: patchwork@inbox.dpdk.org Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id C62D1A034F; Mon, 17 Jan 2022 09:15:01 +0100 (CET) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id B88B441168; Mon, 17 Jan 2022 09:15:01 +0100 (CET) Received: from NAM12-BN8-obe.outbound.protection.outlook.com (mail-bn8nam12on2057.outbound.protection.outlook.com [40.107.237.57]) by mails.dpdk.org (Postfix) with ESMTP id DD37941168 for ; Mon, 17 Jan 2022 09:15:00 +0100 (CET) ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=nguM5vjbnqHopAJ0+avCx8nLZDWDOCZ60CSUms3P+S75lgj8ocOyo9qPj5MntL6YRMOuePPmftGwSNGYkJRbrUjsIxmTxUMWz/8nN6k/4swZjBDbWqAxgkmu1lnypTR6GcxUCSaQeo8jwuxFKbkvJAHb5As9kpVVVkImy4QnumlFFPN7piUC7gjRXdMe0bqgpK1YLCD7DF1Spf5APFGa483Q7GnK4qIAJzXkUg4PHKDcukpZxkB8mqWyAOKGFHDY7ekDUc6fQFFMB+MPT8FtGpLsWN9S8inlV2TCszibH/zLPTmzGcgzTqsEIakLuHJz/JVvaZP1kIKqk9sp0O/KKQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=OMbFxDNtveaNZui0e3Kn5aaOSvWmALQOj4HWVILl56E=; b=kuu1O8GDf0j1SFhDr/2hzdKYY2NFLAJIvKM3ws78fSbFnqid8l92aJbJQEvjmEP4YDv6Pj6rWnSYiEujEeJypdN7IEDhqtpaLKsHulWzOJ1C25taUmgP5vTMCPslue/huD2Zxl+6Uw2iUVQky7ogeHAht6j0D/6EzSFs+7hyfCTlfswLgt6YM6gBzuSu3Cm/hi1goFhg367gQhg5yjTKLl3D5RIWGLawYh4OAWSrcK1y7w5CDIsGqveG9MjFAdf6ONoDmBNy99MnUtLcSDcYyCGglN/B3U3w0oXZvF/vfahMVdauLb8U0TQuPGBm1KmpL+joW2K1gVrek4mzOw/eYA== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 12.22.5.236) smtp.rcpttodomain=intel.com smtp.mailfrom=nvidia.com; dmarc=pass (p=reject sp=reject pct=100) action=none header.from=nvidia.com; dkim=none (message not signed); arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=Nvidia.com; s=selector2; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=OMbFxDNtveaNZui0e3Kn5aaOSvWmALQOj4HWVILl56E=; b=EzbMpJQtmDmAlj6GZweSfthcIDzOHKQ1wb69pYXHE/IztT0HjsbAfS23Stc9NGWvBdS9c1szMbTK3jV6/PwtFBbhOuYF5ODVSnshfiH1faaLeRbYY5j9ZLMFciflpct1ju8zJtIuRblN8dv52xefmobb5V8BQO5Xlw1EpeLNyde97BdhpmxF53xbNqFQVTw3hQxpGVmcfYoOyCJ6CDp3PjPDloJrAerPsLECbC/ecc+mcs60hMGFOqs8j7VYBNXsmw3/SGLr30J3yTdvnazJRQWiyFTQAmvwD4brh78H5IjrlRLhJYczO2uM6v5LZPP9VG0efGE8QqS8a7oDOa/jdw== Received: from DM6PR10CA0011.namprd10.prod.outlook.com (2603:10b6:5:60::24) by DM5PR12MB1372.namprd12.prod.outlook.com (2603:10b6:3:77::7) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4888.11; Mon, 17 Jan 2022 08:14:59 +0000 Received: from DM6NAM11FT006.eop-nam11.prod.protection.outlook.com (2603:10b6:5:60:cafe::ad) by DM6PR10CA0011.outlook.office365.com (2603:10b6:5:60::24) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4888.11 via Frontend Transport; Mon, 17 Jan 2022 08:14:59 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 12.22.5.236) smtp.mailfrom=nvidia.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=nvidia.com; Received-SPF: Pass (protection.outlook.com: domain of nvidia.com designates 12.22.5.236 as permitted sender) receiver=protection.outlook.com; client-ip=12.22.5.236; helo=mail.nvidia.com; Received: from mail.nvidia.com (12.22.5.236) by DM6NAM11FT006.mail.protection.outlook.com (10.13.173.104) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384) id 15.20.4888.9 via Frontend Transport; Mon, 17 Jan 2022 08:14:59 +0000 Received: from rnnvmail201.nvidia.com (10.129.68.8) by DRHQMAIL109.nvidia.com (10.27.9.19) with Microsoft SMTP Server (TLS) id 15.0.1497.18; Mon, 17 Jan 2022 08:14:58 +0000 Received: from nvidia.com (10.126.231.35) by rnnvmail201.nvidia.com (10.129.68.8) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384) id 15.2.986.9; Mon, 17 Jan 2022 00:14:57 -0800 From: Dmitry Kozlyuk To: CC: Anatoly Burakov Subject: [PATCH v1 6/6] eal: extend --huge-unlink for hugepage file reuse Date: Mon, 17 Jan 2022 10:14:40 +0200 Message-ID: <20220117081440.482410-1-dkozlyuk@nvidia.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20220117080801.481568-1-dkozlyuk@nvidia.com> References: <20220117080801.481568-1-dkozlyuk@nvidia.com> MIME-Version: 1.0 X-Originating-IP: [10.126.231.35] X-ClientProxiedBy: HQMAIL105.nvidia.com (172.20.187.12) To rnnvmail201.nvidia.com (10.129.68.8) X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-Office365-Filtering-Correlation-Id: 41f4fe91-a6ba-478d-f2b4-08d9d991743f X-MS-TrafficTypeDiagnostic: DM5PR12MB1372:EE_ X-Microsoft-Antispam-PRVS: X-MS-Oob-TLC-OOBClassifiers: OLM:8882; X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: uFwjFHr1azVCEPMidHfzqtsDfT3duk1fQYLDXLvYPGvoM3QvHPGJMhiCYIPUTN2Oz0zHX/Dr1r4KPl6z3VhajEv7f36gg0ELQWs5ANp39+O8Q7RNF7QXG38VNRHBdo2/6DKlRHFohr5bsxD/Ie5aBztZIfDumR0VNYIlxGeJ2vqN50c9HmixRMj7Lur4PKJP9zSQBUBRoWbNKbdhNgRBMCOltiQ0Cnpvv/vS/d9ggPWV8TxBvC9pZPwhEo2NkX2nnB/c1p2hBaiqEmeyqSg5MgHQlrGs5rfAnXNW1mUOrSO3UE289I0X8FuQMfIRYE90i2sA77srjKOM90IZ5aj3rBpmD/YM/Ye134h8HuBYvOVQ8zIJIiuPkW9bq+HNfmVfOgxkWCtninWxkQysQ5or3csxPToOkLFSn3TQNmIJTzDTRpVnxLYpomAjfxyBhWWCppI0mj4QJcPElt6h8prEAGy9sxsZgocOzH2DCwH3pNpYw0FNQwmwIkxXa3fsqB2/a+ZGXgRkVHIs6Mgyw3VrPqeZLvMe7/A+BF6wYU82D6bG0Qfno9w6DIboVGOJUuw7/ctOJK9HCQdoq2Z6aVh0Cf/o5KwPjlI55dgaKYXq5/7GUwnUk7b/hEedFceDqBC7xpIrwNRzwV/MvvHpnbCndHRaJnkaZNUTzJCVA4mPtdANuusbPmDN4NTTpj4vCqKF6H2KL6VpfNR4w8HPUHn5NcjnDRsKgrlnYds9VlkZ7FOL3lZyQ9yI4fnXN8fbl7m/C9s1bWSnsfU5yj78X0KCtQEMXRmUN+HmJo+VZ8YxGtI6y6yJ6A7Jv/0X1GOi99A76ef4CuPuqjoo/Qx0OE9/ew== X-Forefront-Antispam-Report: CIP:12.22.5.236; CTRY:US; LANG:en; SCL:1; SRV:; IPV:CAL; SFV:NSPM; H:mail.nvidia.com; PTR:InfoNoRecords; CAT:NONE; SFS:(4636009)(40470700002)(36840700001)(46966006)(47076005)(4326008)(40460700001)(8936002)(55016003)(70586007)(86362001)(6286002)(83380400001)(82310400004)(36860700001)(426003)(2906002)(8676002)(6916009)(70206006)(7696005)(36756003)(26005)(316002)(1076003)(5660300002)(336012)(356005)(186003)(2616005)(508600001)(81166007)(16526019)(6666004)(14583001)(36900700001); DIR:OUT; SFP:1101; X-OriginatorOrg: Nvidia.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 17 Jan 2022 08:14:59.2219 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 41f4fe91-a6ba-478d-f2b4-08d9d991743f X-MS-Exchange-CrossTenant-Id: 43083d15-7273-40c1-b7db-39efd9ccc17a X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=43083d15-7273-40c1-b7db-39efd9ccc17a; Ip=[12.22.5.236]; Helo=[mail.nvidia.com] X-MS-Exchange-CrossTenant-AuthSource: DM6NAM11FT006.eop-nam11.prod.protection.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: DM5PR12MB1372 X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Expose Linux EAL ability to reuse existing hugepage files via --huge-unlink=never switch. Default behavior is unchanged, it can also be specified using --huge-unlink=existing for consistency. Old --huge-unlink switch is kept, it is an alias for --huge-unlink=always. Signed-off-by: Dmitry Kozlyuk Acked-by: Thomas Monjalon --- doc/guides/linux_gsg/linux_eal_parameters.rst | 21 ++++++++-- .../prog_guide/env_abstraction_layer.rst | 9 +++++ doc/guides/rel_notes/release_22_03.rst | 7 ++++ lib/eal/common/eal_common_options.c | 39 +++++++++++++++++-- 4 files changed, 69 insertions(+), 7 deletions(-) diff --git a/doc/guides/linux_gsg/linux_eal_parameters.rst b/doc/guides/linux_gsg/linux_eal_parameters.rst index 74df2611b5..7586f15ce3 100644 --- a/doc/guides/linux_gsg/linux_eal_parameters.rst +++ b/doc/guides/linux_gsg/linux_eal_parameters.rst @@ -84,10 +84,23 @@ Memory-related options Use specified hugetlbfs directory instead of autodetected ones. This can be a sub-directory within a hugetlbfs mountpoint. -* ``--huge-unlink`` - - Unlink hugepage files after creating them (implies no secondary process - support). +* ``--huge-unlink[=existing|always|never]`` + + No ``--huge-unlink`` option or ``--huge-unlink=existing`` is the default: + existing hugepage files are removed and re-created + to ensure the kernel clears the memory and prevents any data leaks. + + With ``--huge-unlink`` (no value) or ``--huge-unlink=always``, + hugepage files are also removed after creating them, + so that the application leaves no files in hugetlbfs. + This mode implies no multi-process support. + + When ``--huge-unlink=never`` is specified, existing hugepage files + are not removed either before or after mapping them. + This makes restart faster by saving time to clear memory at initialization, + but it may slow down zeroed allocations later. + Reused hugepages can contain data from previous processes that used them, + which may be a security concern. * ``--match-allocations`` diff --git a/doc/guides/prog_guide/env_abstraction_layer.rst b/doc/guides/prog_guide/env_abstraction_layer.rst index bfe4594bf1..c7dc4a0e6a 100644 --- a/doc/guides/prog_guide/env_abstraction_layer.rst +++ b/doc/guides/prog_guide/env_abstraction_layer.rst @@ -277,6 +277,15 @@ to prevent data leaks from previous users of the same hugepage. EAL ensures this behavior by removing existing backing files at startup and by recreating them before opening for mapping (as a precaution). +One exception is ``--huge-unlink=never`` mode. +It is used to speed up EAL initialization, usually on application restart. +Clearing memory constitutes more than 95% of hugepage mapping time. +EAL can save it by remapping existing backing files +with all the data left in the mapped hugepages ("dirty" memory). +Such segments are marked with ``RTE_MEMSEG_FLAG_DIRTY``. +Memory allocator detects dirty segments handles them accordingly, +in particular, it clears memory requested with ``rte_zmalloc*()``. + Anonymous mapping does not allow multi-process architecture, but it is free of filename conflicts and leftover files on hugetlbfs. If memfd_create(2) is supported both at build and run time, diff --git a/doc/guides/rel_notes/release_22_03.rst b/doc/guides/rel_notes/release_22_03.rst index 6d99d1eaa9..0b882362cf 100644 --- a/doc/guides/rel_notes/release_22_03.rst +++ b/doc/guides/rel_notes/release_22_03.rst @@ -55,6 +55,13 @@ New Features Also, make sure to start the actual text at the margin. ======================================================= +* **Added ability to reuse hugepages in Linux.** + + It is possible to reuse files in hugetlbfs to speed up hugepage mapping, + which may be useful for fast restart and large allocations. + The new mode is activated with ``--huge-unlink=never`` + and has security implications, refer to the user and programmer guides. + Removed Items ------------- diff --git a/lib/eal/common/eal_common_options.c b/lib/eal/common/eal_common_options.c index 7520ebda8e..905a7769bd 100644 --- a/lib/eal/common/eal_common_options.c +++ b/lib/eal/common/eal_common_options.c @@ -74,7 +74,7 @@ eal_long_options[] = { {OPT_FILE_PREFIX, 1, NULL, OPT_FILE_PREFIX_NUM }, {OPT_HELP, 0, NULL, OPT_HELP_NUM }, {OPT_HUGE_DIR, 1, NULL, OPT_HUGE_DIR_NUM }, - {OPT_HUGE_UNLINK, 0, NULL, OPT_HUGE_UNLINK_NUM }, + {OPT_HUGE_UNLINK, 2, NULL, OPT_HUGE_UNLINK_NUM }, {OPT_IOVA_MODE, 1, NULL, OPT_IOVA_MODE_NUM }, {OPT_LCORES, 1, NULL, OPT_LCORES_NUM }, {OPT_LOG_LEVEL, 1, NULL, OPT_LOG_LEVEL_NUM }, @@ -1596,6 +1596,28 @@ available_cores(void) return str; } +#define HUGE_UNLINK_NEVER "never" + +static int +eal_parse_huge_unlink(const char *arg, struct hugepage_file_discipline *out) +{ + if (arg == NULL || strcmp(arg, "always") == 0) { + out->unlink_before_mapping = true; + return 0; + } + if (strcmp(arg, "existing") == 0) { + /* same as not specifying the option */ + return 0; + } + if (strcmp(arg, HUGE_UNLINK_NEVER) == 0) { + RTE_LOG(WARNING, EAL, "Using --"OPT_HUGE_UNLINK"=" + HUGE_UNLINK_NEVER" may create data leaks.\n"); + out->keep_existing = true; + return 0; + } + return -1; +} + int eal_parse_common_option(int opt, const char *optarg, struct internal_config *conf) @@ -1737,7 +1759,10 @@ eal_parse_common_option(int opt, const char *optarg, /* long options */ case OPT_HUGE_UNLINK_NUM: - conf->hugepage_file.unlink_before_mapping = true; + if (eal_parse_huge_unlink(optarg, &conf->hugepage_file) < 0) { + RTE_LOG(ERR, EAL, "invalid --"OPT_HUGE_UNLINK" option\n"); + return -1; + } break; case OPT_NO_HUGE_NUM: @@ -2068,6 +2093,12 @@ eal_check_common_options(struct internal_config *internal_cfg) "not compatible with --"OPT_HUGE_UNLINK"\n"); return -1; } + if (internal_cfg->hugepage_file.keep_existing && + internal_cfg->in_memory) { + RTE_LOG(ERR, EAL, "Option --"OPT_IN_MEMORY" is not compatible " + "with --"OPT_HUGE_UNLINK"="HUGE_UNLINK_NEVER"\n"); + return -1; + } if (internal_cfg->legacy_mem && internal_cfg->in_memory) { RTE_LOG(ERR, EAL, "Option --"OPT_LEGACY_MEM" is not compatible " @@ -2200,7 +2231,9 @@ eal_common_usage(void) " --"OPT_NO_TELEMETRY" Disable telemetry support\n" " --"OPT_FORCE_MAX_SIMD_BITWIDTH" Force the max SIMD bitwidth\n" "\nEAL options for DEBUG use only:\n" - " --"OPT_HUGE_UNLINK" Unlink hugepage files after init\n" + " --"OPT_HUGE_UNLINK"[=existing|always|never]\n" + " When to unlink files in hugetlbfs\n" + " ('existing' by default, no value means 'always')\n" " --"OPT_NO_HUGE" Use malloc instead of hugetlbfs\n" " --"OPT_NO_PCI" Disable PCI\n" " --"OPT_NO_HPET" Disable HPET\n"