mbox series

[v2,0/6] Fast restart with many hugepages

Message ID 20220119210917.765505-1-dkozlyuk@nvidia.com (mailing list archive)
Headers
Series Fast restart with many hugepages |

Message

Dmitry Kozlyuk Jan. 19, 2022, 9:09 p.m. UTC
  This patchset is a new design and implementation of [1].

v2:
  * Fix hugepage file removal when they are no longer used.
    Disable removal with --huge-unlink=never as intended.
    Document this behavior difference. (Bruce)
  * Improve documentation, commit messages, and naming. (Thomas)

# Problem Statement

Large allocations that involve mapping new hugepages are slow.
This is problematic, for example, in the following use case.
A single-process application allocates ~1TB of mempools at startup.
Sometimes the app needs to restart as quick as possible.
Allocating the hugepages anew takes as long as 15 seconds,
while the new process could just pick up all the memory
left by the old one (reinitializing the contents as needed).

Almost all of mmap(2) time spent in the kernel
is clearing the memory, i.e. filling it with zeros.
This is done if a file in hugetlbfs is mapped
for the first time system-wide, i.e. a hugepage is committed
to prevent data leaks from the previous users of the same hugepage.
For example, mapping 32 GB from a new file may take 2.16 seconds,
while mapping the same pages again takes only 0.3 ms.
Security put aside, e.g. when the environment is controlled,
this effort is wasted for the memory intended for DMA,
because its content will be overwritten anyway.

Linux EAL explicitly removes hugetlbfs files at initialization
and before mapping to force the kernel clear the memory.
This allows the memory allocator to clean memory on only on freeing.

# Solution

Add a new mode allowing EAL to remap existing hugepage files.
While it is intended to make restarts faster in the first place,
it makes any startup faster except the cold one
(with no existing files).

It is the administrator who accepts security risks
implied by reusing hugepages.
The new mode is an opt-in and a warning is logged.

The feature is Linux-only as it is related
to mapping hugepages from files which only Linux does.
It is inherently incompatible with --in-memory,
for --huge-unlink see below.

There is formally no breakage of API contract,
but there is a behavior change in the new mode:
rte_malloc*() and rte_memzone_reserve*() may return dirty memory
(previously they were returning clean memory from free heap elements).
Their contract has always explicitly allowed this,
but still there may be users relying on the traditional behavior.
Such users will need to fix their code to use the new mode.

# Implementation

## User Interface

There is --huge-unlink switch in the same area to remove hugepage files
before mapping them. It is infeasible to use with the new mode,
because the point is to keep hugepage files for fast future restarts.
Extend --huge-unlink option to represent only valid combinations:

* --huge-unlink=existing OR no option (for compatibility):
  unlink files at initialization
  and before opening them as a precaution.

* --huge-unlink=always OR just --huge-unlink (for compatibility):
  same as above + unlink created files before mapping.

* --huge-unlink=never:
  the new mode, do not unlink hugepages files, reuse them.

This option was always Linux-only, but it is kept as common
in case there are users who expect it to be a no-op on other systems.
(Adding a separate --huge-reuse option was also considered,
but there is no obvious benefit and more combinations to test.)

## EAL

If a memseg is mapped dirty, it is marked with RTE_MEMSEG_FLAG_DIRTY
so that the memory allocator may clear the memory if need be.
See patch 5/6 description for details how this is done
in different memory mapping modes.

The memory manager tracks whether an element is clean or dirty.
If rte_zmalloc*() allocates from a dirty element,
the memory is cleared before handling it to the user.
On freeing, the allocator joins adjacent free elements,
but in the new mode it may not be feasible to clear the free memory
if the joint element is dirty (contains dirty parts).
In any case, memory will be cleared only once,
either on freeing or on allocation.
See patch 3/6 for details.
Patch 2/6 adds a benchmark to see how time is distributed
between allocation and freeing in different modes.

Besides clearing memory, each mmap() call takes some time.
For example, 1024 calls for 1 TB may take ~300 ms.
The time of one call mapping N hugepages is O(N),
because inside the kernel hugepages are allocated ony by one.
Syscall overhead is negligeable even for one page.
Hence, it does not make sense to reduce the number of mmap() calls,
which would essentially move the loop over pages into the kernel.

[1]: http://inbox.dpdk.org/dev/20211011085644.2716490-3-dkozlyuk@nvidia.com/

Dmitry Kozlyuk (6):
  doc: add hugepage mapping details
  app/test: add allocator performance benchmark
  mem: add dirty malloc element support
  eal: refactor --huge-unlink storage
  eal/linux: allow hugepage file reuse
  eal: extend --huge-unlink for hugepage file reuse

 app/test/meson.build                          |   2 +
 app/test/test_eal_flags.c                     |  25 +++
 app/test/test_malloc_perf.c                   | 174 ++++++++++++++++++
 doc/guides/linux_gsg/linux_eal_parameters.rst |  24 ++-
 .../prog_guide/env_abstraction_layer.rst      | 107 ++++++++++-
 doc/guides/rel_notes/release_22_03.rst        |   7 +
 lib/eal/common/eal_common_options.c           |  48 ++++-
 lib/eal/common/eal_internal_cfg.h             |  10 +-
 lib/eal/common/malloc_elem.c                  |  22 ++-
 lib/eal/common/malloc_elem.h                  |  11 +-
 lib/eal/common/malloc_heap.c                  |  18 +-
 lib/eal/common/rte_malloc.c                   |  21 ++-
 lib/eal/include/rte_memory.h                  |   8 +-
 lib/eal/linux/eal.c                           |   3 +-
 lib/eal/linux/eal_hugepage_info.c             | 118 +++++++++---
 lib/eal/linux/eal_memalloc.c                  | 173 ++++++++++-------
 lib/eal/linux/eal_memory.c                    |   2 +-
 17 files changed, 644 insertions(+), 129 deletions(-)
 create mode 100644 app/test/test_malloc_perf.c
  

Comments

Bruce Richardson Jan. 27, 2022, 12:07 p.m. UTC | #1
On Wed, Jan 19, 2022 at 11:09:11PM +0200, Dmitry Kozlyuk wrote:
> This patchset is a new design and implementation of [1].
> 
> v2:
>   * Fix hugepage file removal when they are no longer used.
>     Disable removal with --huge-unlink=never as intended.
>     Document this behavior difference. (Bruce)
>   * Improve documentation, commit messages, and naming. (Thomas)
> 
Thanks for the v2, I now see the promised perf improvements when running
some quick tests with testpmd. Some quick numbers below, summary version is
that for testpmd with default mempool size startup/exit time drops from
1.7s to 1.4s, and when I increase mempool size to 4M mbufs, time drops
from 7.6s to 3.9s.

/Bruce

cmd: "time echo "quit" | sudo ./build/app/dpdk-testpmd -c F --no-pci -- -i"

Baseline (no patches) - 1.7 sec
Baseline (with patches) - 1.7 sec
Huge-unlink=never - 1.4 sec

Adding --total-num-mbufs=4096000

Baseline (with patches) - 7.6 sec
Huge-unlink=never - 3.9 sec
  
Thomas Monjalon Feb. 2, 2022, 2:12 p.m. UTC | #2
2 weeks passed without any new comment except a test by Bruce.
I would prefer avoiding a merge in the last minute.
Anatoly, any comment?


19/01/2022 22:09, Dmitry Kozlyuk:
> This patchset is a new design and implementation of [1].
> 
> v2:
>   * Fix hugepage file removal when they are no longer used.
>     Disable removal with --huge-unlink=never as intended.
>     Document this behavior difference. (Bruce)
>   * Improve documentation, commit messages, and naming. (Thomas)
> 
> # Problem Statement
> 
> Large allocations that involve mapping new hugepages are slow.
> This is problematic, for example, in the following use case.
> A single-process application allocates ~1TB of mempools at startup.
> Sometimes the app needs to restart as quick as possible.
> Allocating the hugepages anew takes as long as 15 seconds,
> while the new process could just pick up all the memory
> left by the old one (reinitializing the contents as needed).
> 
> Almost all of mmap(2) time spent in the kernel
> is clearing the memory, i.e. filling it with zeros.
> This is done if a file in hugetlbfs is mapped
> for the first time system-wide, i.e. a hugepage is committed
> to prevent data leaks from the previous users of the same hugepage.
> For example, mapping 32 GB from a new file may take 2.16 seconds,
> while mapping the same pages again takes only 0.3 ms.
> Security put aside, e.g. when the environment is controlled,
> this effort is wasted for the memory intended for DMA,
> because its content will be overwritten anyway.
> 
> Linux EAL explicitly removes hugetlbfs files at initialization
> and before mapping to force the kernel clear the memory.
> This allows the memory allocator to clean memory on only on freeing.
> 
> # Solution
> 
> Add a new mode allowing EAL to remap existing hugepage files.
> While it is intended to make restarts faster in the first place,
> it makes any startup faster except the cold one
> (with no existing files).
> 
> It is the administrator who accepts security risks
> implied by reusing hugepages.
> The new mode is an opt-in and a warning is logged.
> 
> The feature is Linux-only as it is related
> to mapping hugepages from files which only Linux does.
> It is inherently incompatible with --in-memory,
> for --huge-unlink see below.
> 
> There is formally no breakage of API contract,
> but there is a behavior change in the new mode:
> rte_malloc*() and rte_memzone_reserve*() may return dirty memory
> (previously they were returning clean memory from free heap elements).
> Their contract has always explicitly allowed this,
> but still there may be users relying on the traditional behavior.
> Such users will need to fix their code to use the new mode.
> 
> # Implementation
> 
> ## User Interface
> 
> There is --huge-unlink switch in the same area to remove hugepage files
> before mapping them. It is infeasible to use with the new mode,
> because the point is to keep hugepage files for fast future restarts.
> Extend --huge-unlink option to represent only valid combinations:
> 
> * --huge-unlink=existing OR no option (for compatibility):
>   unlink files at initialization
>   and before opening them as a precaution.
> 
> * --huge-unlink=always OR just --huge-unlink (for compatibility):
>   same as above + unlink created files before mapping.
> 
> * --huge-unlink=never:
>   the new mode, do not unlink hugepages files, reuse them.
> 
> This option was always Linux-only, but it is kept as common
> in case there are users who expect it to be a no-op on other systems.
> (Adding a separate --huge-reuse option was also considered,
> but there is no obvious benefit and more combinations to test.)
> 
> ## EAL
> 
> If a memseg is mapped dirty, it is marked with RTE_MEMSEG_FLAG_DIRTY
> so that the memory allocator may clear the memory if need be.
> See patch 5/6 description for details how this is done
> in different memory mapping modes.
> 
> The memory manager tracks whether an element is clean or dirty.
> If rte_zmalloc*() allocates from a dirty element,
> the memory is cleared before handling it to the user.
> On freeing, the allocator joins adjacent free elements,
> but in the new mode it may not be feasible to clear the free memory
> if the joint element is dirty (contains dirty parts).
> In any case, memory will be cleared only once,
> either on freeing or on allocation.
> See patch 3/6 for details.
> Patch 2/6 adds a benchmark to see how time is distributed
> between allocation and freeing in different modes.
> 
> Besides clearing memory, each mmap() call takes some time.
> For example, 1024 calls for 1 TB may take ~300 ms.
> The time of one call mapping N hugepages is O(N),
> because inside the kernel hugepages are allocated ony by one.
> Syscall overhead is negligeable even for one page.
> Hence, it does not make sense to reduce the number of mmap() calls,
> which would essentially move the loop over pages into the kernel.
> 
> [1]: http://inbox.dpdk.org/dev/20211011085644.2716490-3-dkozlyuk@nvidia.com/
> 
> Dmitry Kozlyuk (6):
>   doc: add hugepage mapping details
>   app/test: add allocator performance benchmark
>   mem: add dirty malloc element support
>   eal: refactor --huge-unlink storage
>   eal/linux: allow hugepage file reuse
>   eal: extend --huge-unlink for hugepage file reuse
> 
>  app/test/meson.build                          |   2 +
>  app/test/test_eal_flags.c                     |  25 +++
>  app/test/test_malloc_perf.c                   | 174 ++++++++++++++++++
>  doc/guides/linux_gsg/linux_eal_parameters.rst |  24 ++-
>  .../prog_guide/env_abstraction_layer.rst      | 107 ++++++++++-
>  doc/guides/rel_notes/release_22_03.rst        |   7 +
>  lib/eal/common/eal_common_options.c           |  48 ++++-
>  lib/eal/common/eal_internal_cfg.h             |  10 +-
>  lib/eal/common/malloc_elem.c                  |  22 ++-
>  lib/eal/common/malloc_elem.h                  |  11 +-
>  lib/eal/common/malloc_heap.c                  |  18 +-
>  lib/eal/common/rte_malloc.c                   |  21 ++-
>  lib/eal/include/rte_memory.h                  |   8 +-
>  lib/eal/linux/eal.c                           |   3 +-
>  lib/eal/linux/eal_hugepage_info.c             | 118 +++++++++---
>  lib/eal/linux/eal_memalloc.c                  | 173 ++++++++++-------
>  lib/eal/linux/eal_memory.c                    |   2 +-
>  17 files changed, 644 insertions(+), 129 deletions(-)
>  create mode 100644 app/test/test_malloc_perf.c
> 
>
  
David Marchand Feb. 2, 2022, 9:54 p.m. UTC | #3
Hello Dmitry,

On Wed, Jan 19, 2022 at 10:09 PM Dmitry Kozlyuk <dkozlyuk@nvidia.com> wrote:
>
> This patchset is a new design and implementation of [1].
>
> v2:
>   * Fix hugepage file removal when they are no longer used.
>     Disable removal with --huge-unlink=never as intended.
>     Document this behavior difference. (Bruce)
>   * Improve documentation, commit messages, and naming. (Thomas)
>
> # Problem Statement
>
> Large allocations that involve mapping new hugepages are slow.
> This is problematic, for example, in the following use case.
> A single-process application allocates ~1TB of mempools at startup.
> Sometimes the app needs to restart as quick as possible.
> Allocating the hugepages anew takes as long as 15 seconds,
> while the new process could just pick up all the memory
> left by the old one (reinitializing the contents as needed).
>
> Almost all of mmap(2) time spent in the kernel
> is clearing the memory, i.e. filling it with zeros.
> This is done if a file in hugetlbfs is mapped
> for the first time system-wide, i.e. a hugepage is committed
> to prevent data leaks from the previous users of the same hugepage.
> For example, mapping 32 GB from a new file may take 2.16 seconds,
> while mapping the same pages again takes only 0.3 ms.
> Security put aside, e.g. when the environment is controlled,
> this effort is wasted for the memory intended for DMA,
> because its content will be overwritten anyway.
>
> Linux EAL explicitly removes hugetlbfs files at initialization
> and before mapping to force the kernel clear the memory.
> This allows the memory allocator to clean memory on only on freeing.
>
> # Solution
>
> Add a new mode allowing EAL to remap existing hugepage files.
> While it is intended to make restarts faster in the first place,
> it makes any startup faster except the cold one
> (with no existing files).
>
> It is the administrator who accepts security risks
> implied by reusing hugepages.
> The new mode is an opt-in and a warning is logged.
>
> The feature is Linux-only as it is related
> to mapping hugepages from files which only Linux does.
> It is inherently incompatible with --in-memory,
> for --huge-unlink see below.
>
> There is formally no breakage of API contract,
> but there is a behavior change in the new mode:
> rte_malloc*() and rte_memzone_reserve*() may return dirty memory
> (previously they were returning clean memory from free heap elements).
> Their contract has always explicitly allowed this,
> but still there may be users relying on the traditional behavior.
> Such users will need to fix their code to use the new mode.
>
> # Implementation
>
> ## User Interface
>
> There is --huge-unlink switch in the same area to remove hugepage files
> before mapping them. It is infeasible to use with the new mode,
> because the point is to keep hugepage files for fast future restarts.
> Extend --huge-unlink option to represent only valid combinations:
>
> * --huge-unlink=existing OR no option (for compatibility):
>   unlink files at initialization
>   and before opening them as a precaution.
>
> * --huge-unlink=always OR just --huge-unlink (for compatibility):
>   same as above + unlink created files before mapping.
>
> * --huge-unlink=never:
>   the new mode, do not unlink hugepages files, reuse them.
>
> This option was always Linux-only, but it is kept as common
> in case there are users who expect it to be a no-op on other systems.
> (Adding a separate --huge-reuse option was also considered,
> but there is no obvious benefit and more combinations to test.)
>
> ## EAL
>
> If a memseg is mapped dirty, it is marked with RTE_MEMSEG_FLAG_DIRTY
> so that the memory allocator may clear the memory if need be.
> See patch 5/6 description for details how this is done
> in different memory mapping modes.
>
> The memory manager tracks whether an element is clean or dirty.
> If rte_zmalloc*() allocates from a dirty element,
> the memory is cleared before handling it to the user.
> On freeing, the allocator joins adjacent free elements,
> but in the new mode it may not be feasible to clear the free memory
> if the joint element is dirty (contains dirty parts).
> In any case, memory will be cleared only once,
> either on freeing or on allocation.
> See patch 3/6 for details.
> Patch 2/6 adds a benchmark to see how time is distributed
> between allocation and freeing in different modes.
>
> Besides clearing memory, each mmap() call takes some time.
> For example, 1024 calls for 1 TB may take ~300 ms.
> The time of one call mapping N hugepages is O(N),
> because inside the kernel hugepages are allocated ony by one.
> Syscall overhead is negligeable even for one page.
> Hence, it does not make sense to reduce the number of mmap() calls,
> which would essentially move the loop over pages into the kernel.
>
> [1]: http://inbox.dpdk.org/dev/20211011085644.2716490-3-dkozlyuk@nvidia.com/
>
> Dmitry Kozlyuk (6):
>   doc: add hugepage mapping details
>   app/test: add allocator performance benchmark
>   mem: add dirty malloc element support
>   eal: refactor --huge-unlink storage
>   eal/linux: allow hugepage file reuse
>   eal: extend --huge-unlink for hugepage file reuse
>
>  app/test/meson.build                          |   2 +
>  app/test/test_eal_flags.c                     |  25 +++
>  app/test/test_malloc_perf.c                   | 174 ++++++++++++++++++
>  doc/guides/linux_gsg/linux_eal_parameters.rst |  24 ++-
>  .../prog_guide/env_abstraction_layer.rst      | 107 ++++++++++-
>  doc/guides/rel_notes/release_22_03.rst        |   7 +
>  lib/eal/common/eal_common_options.c           |  48 ++++-
>  lib/eal/common/eal_internal_cfg.h             |  10 +-
>  lib/eal/common/malloc_elem.c                  |  22 ++-
>  lib/eal/common/malloc_elem.h                  |  11 +-
>  lib/eal/common/malloc_heap.c                  |  18 +-
>  lib/eal/common/rte_malloc.c                   |  21 ++-
>  lib/eal/include/rte_memory.h                  |   8 +-
>  lib/eal/linux/eal.c                           |   3 +-
>  lib/eal/linux/eal_hugepage_info.c             | 118 +++++++++---
>  lib/eal/linux/eal_memalloc.c                  | 173 ++++++++++-------
>  lib/eal/linux/eal_memory.c                    |   2 +-
>  17 files changed, 644 insertions(+), 129 deletions(-)
>  create mode 100644 app/test/test_malloc_perf.c

Thanks for the series, the documentation update and keeping the EAL
options count the same as before :-).

It passes my checks (compilation per patch for Linux x86 native, arm86
and ppc cross compil), running unit tests, running malloc tests with
ASan enabled.


I could not check all unit tests with RTE_MALLOC_DEBUG (I passed
-Dc_args=-DRTE_MALLOC_DEBUG to meson).
mbuf_autotest fails but I reproduced the same error before the series
so I'll report and investigate this separately.
Fwiw, the failure is:
1: [/home/dmarchan/builds/build-gcc-shared/app/test/../../lib/librte_eal.so.22(rte_dump_stack+0x1b)
[0x7f860c482dab]]
Test mbuf linearize API
mbuf test FAILED (l.2035): <test_pktmbuf_read_from_offset: Incorrect
data length!
>
mbuf test FAILED (l.2539): <test_pktmbuf_ext_pinned_buffer:
test_rte_pktmbuf_read_from_offset(pinned) failed
>
test_pktmbuf_ext_pinned_buffer() failed
Test Failed


I have one comment on documentation: we have a detailed description of
internal malloc_elem structure and implementation of the dpdk mem
allocator.
https://doc.dpdk.org/guides/prog_guide/env_abstraction_layer.html#internal-implementation
The addition of the "dirty/clean" notion should be described, as it
would help others who want to look into this subsystem.
  
David Marchand Feb. 3, 2022, 10:26 a.m. UTC | #4
On Wed, Feb 2, 2022 at 10:54 PM David Marchand
<david.marchand@redhat.com> wrote:
> I could not check all unit tests with RTE_MALLOC_DEBUG (I passed
> -Dc_args=-DRTE_MALLOC_DEBUG to meson).
> mbuf_autotest fails but I reproduced the same error before the series
> so I'll report and investigate this separately.
> Fwiw, the failure is:
> 1: [/home/dmarchan/builds/build-gcc-shared/app/test/../../lib/librte_eal.so.22(rte_dump_stack+0x1b)
> [0x7f860c482dab]]
> Test mbuf linearize API
> mbuf test FAILED (l.2035): <test_pktmbuf_read_from_offset: Incorrect
> data length!
> >
> mbuf test FAILED (l.2539): <test_pktmbuf_ext_pinned_buffer:
> test_rte_pktmbuf_read_from_offset(pinned) failed
> >
> test_pktmbuf_ext_pinned_buffer() failed
> Test Failed

This should be fixed with:
https://patchwork.dpdk.org/project/dpdk/patch/20220203093912.25032-1-david.marchand@redhat.com/