[v6,2/3] eal: add memory pre-allocation from existing files

Message ID 20211011085644.2716490-3-dkozlyuk@nvidia.com (mailing list archive)
State Not Applicable, archived
Delegated to: David Marchand
Headers
Series eal: add memory pre-allocation from existing files |

Checks

Context Check Description
ci/checkpatch success coding style OK

Commit Message

Dmitry Kozlyuk Oct. 11, 2021, 8:56 a.m. UTC
  From: Viacheslav Ovsiienko <viacheslavo@nvidia.com>

The primary DPDK process launch might take a long time if initially
allocated memory is large. From practice allocation of 1 TB of memory
over 1 GB hugepages on Linux takes tens of seconds. Fast restart
is highly desired for some applications and launch delay presents
a problem.

The primary delay happens in this call trace:
  rte_eal_init()
    rte_eal_memory_init()
      rte_eal_hugepage_init()
        eal_dynmem_hugepage_init()
	  eal_memalloc_alloc_seg_bulk()
	    alloc_seg()
              mmap()

The largest part of the time spent in mmap() is filling the memory
with zeros. Kernel does so to prevent data leakage from a process
that was last using the page. However, in a controlled environment
it may not be the issue, while performance is. (Linux-specific
MAP_UNINITIALIZED flag allows mapping without clearing, but it is
disabled in all popular distributions for the reason above.)

It is proposed to add a new EAL option: --mem-file FILE1,FILE2,...
to map hugepages "as is" from specified FILEs in hugetlbfs.
Compared to using external memory for the task, EAL option requires
no change to application code, while allowing administrator
to control hugepage sizes and their NUMA affinity.

Limitations of the feature:

* Linux-specific (only Linux maps hugepages from files).
* Incompatible with --legacy-mem (partially replaces it).
* Incompatible with --single-file-segments
  (--mem-file FILEs can contain as many segments as needed).
* Incompatible with --in-memory (logically).

A warning about possible security implications is printed
when --mem-file is used.

Until this patch DPDK allocator always cleared memory on freeing,
so that it did not have to do that on allocation, while new memory
was cleared by the kernel. When --mem-file is in use, DPDK clears memory
after allocation in rte_zmalloc() and does not clean it on freeing.
Effectively user trades fast startup for occasional allocation slowdown
whenever it is absolutely necessary. When memory is recycled, it is
cleared again, which is suboptimal par se, but saves complication
of memory management.

Signed-off-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
---
 doc/guides/linux_gsg/linux_eal_parameters.rst |  17 +
 lib/eal/common/eal_common_dynmem.c            |   6 +
 lib/eal/common/eal_common_options.c           |  23 ++
 lib/eal/common/eal_internal_cfg.h             |   4 +
 lib/eal/common/eal_memalloc.h                 |   8 +-
 lib/eal/common/eal_options.h                  |   2 +
 lib/eal/common/malloc_elem.c                  |   5 +
 lib/eal/common/malloc_heap.h                  |   8 +
 lib/eal/common/rte_malloc.c                   |  16 +-
 lib/eal/include/rte_memory.h                  |   4 +-
 lib/eal/linux/eal.c                           |  28 ++
 lib/eal/linux/eal_hugepage_info.c             |   5 +
 lib/eal/linux/eal_memalloc.c                  | 328 +++++++++++++++++-
 13 files changed, 441 insertions(+), 13 deletions(-)
  

Comments

David Marchand Oct. 12, 2021, 3:37 p.m. UTC | #1
Hello Dmitry, Slava,

On Mon, Oct 11, 2021 at 10:57 AM Dmitry Kozlyuk <dkozlyuk@oss.nvidia.com> wrote:
>
> From: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
>
> The primary DPDK process launch might take a long time if initially
> allocated memory is large. From practice allocation of 1 TB of memory
> over 1 GB hugepages on Linux takes tens of seconds. Fast restart
> is highly desired for some applications and launch delay presents
> a problem.
>
> The primary delay happens in this call trace:
>   rte_eal_init()
>     rte_eal_memory_init()
>       rte_eal_hugepage_init()
>         eal_dynmem_hugepage_init()
>           eal_memalloc_alloc_seg_bulk()
>             alloc_seg()
>               mmap()
>
> The largest part of the time spent in mmap() is filling the memory
> with zeros. Kernel does so to prevent data leakage from a process
> that was last using the page. However, in a controlled environment
> it may not be the issue, while performance is. (Linux-specific
> MAP_UNINITIALIZED flag allows mapping without clearing, but it is
> disabled in all popular distributions for the reason above.)
>
> It is proposed to add a new EAL option: --mem-file FILE1,FILE2,...
> to map hugepages "as is" from specified FILEs in hugetlbfs.
> Compared to using external memory for the task, EAL option requires
> no change to application code, while allowing administrator
> to control hugepage sizes and their NUMA affinity.
>
> Limitations of the feature:
>
> * Linux-specific (only Linux maps hugepages from files).
> * Incompatible with --legacy-mem (partially replaces it).
> * Incompatible with --single-file-segments
>   (--mem-file FILEs can contain as many segments as needed).
> * Incompatible with --in-memory (logically).
>
> A warning about possible security implications is printed
> when --mem-file is used.
>
> Until this patch DPDK allocator always cleared memory on freeing,
> so that it did not have to do that on allocation, while new memory
> was cleared by the kernel. When --mem-file is in use, DPDK clears memory
> after allocation in rte_zmalloc() and does not clean it on freeing.
> Effectively user trades fast startup for occasional allocation slowdown
> whenever it is absolutely necessary. When memory is recycled, it is
> cleared again, which is suboptimal par se, but saves complication
> of memory management.

I have some trouble figuring the need for the list of files.
Why not use a global knob --mem-clear-on-alloc for this behavior change?


>
> Signed-off-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
> Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
  
Dmitry Kozlyuk Oct. 12, 2021, 3:55 p.m. UTC | #2
Hello David,

> I have some trouble figuring the need for the list of files.
> Why not use a global knob --mem-clear-on-alloc for this behavior change?

Moving memset() doesn't speed anything up, it's a forced step for the reasons below.
Currently, memory is cleared by the kernel when a page is mapped during an allocation.
This cannot be turned off in stock kernels. The issue is that initial allocations are longer
by the time needed to clear the pages, which is >90%. For the memory intended for DMA this time is just wasted. If allocations are large, application startup and restart take long. The only way to get hugepages mapped without the kernel clearing them is to map existing files in hugetlbfs. However, rte_zmalloc() needs to return clean memory, that's why we move memset() there. Memory intended for DMA is just never cleared this way. But memory freed and allocated again will be cleared again, unfortunately.
  
David Marchand Oct. 12, 2021, 5:32 p.m. UTC | #3
On Tue, Oct 12, 2021 at 5:55 PM Dmitry Kozlyuk <dkozlyuk@nvidia.com> wrote:
> > I have some trouble figuring the need for the list of files.
> > Why not use a global knob --mem-clear-on-alloc for this behavior change?
>
> Moving memset() doesn't speed anything up, it's a forced step for the reasons below.
> Currently, memory is cleared by the kernel when a page is mapped during an allocation.
> This cannot be turned off in stock kernels. The issue is that initial allocations are longer
> by the time needed to clear the pages, which is >90%. For the memory intended for DMA this time is just wasted. If allocations are large, application startup and restart take long. The only way to get hugepages mapped without the kernel clearing them is to map existing files in hugetlbfs. However, rte_zmalloc() needs to return clean memory, that's why we move memset() there. Memory intended for DMA is just never cleared this way. But memory freed and allocated again will be cleared again, unfortunately.

Writing my limited understanding, please correct me.

The --mem-file that is proposed does:
- preallocate files which is something close to --socket-mem with the
following differences
  - --mem-file lets user decide on dpdk hugepage files names, which I
think conflicts with --huge-dir and --file-prefix,
  - --mem-file lets user device on hugepage size which I think could
be achieved with some --huge-dir option,
- bypasses unlink() of existing hugepage files which I had overlooked
but is the main painpoint,
- enforces "clear on alloc" in rte_malloc/rte_free.


From this, I see two parts in this patch:
- faster restart, reusing hugepage files as is (combination of not
calling unlink() and doing "clear on alloc"),
  This part is interesting, and I think a single knob for this would be enough.
- finegrained control of hugepage files, but it has the drawback of
imposing primary/secondary run with the same options.
  The second part seems complex to configure. I see conflicts with
existing options, so it seems a good way to get caught up in the
carpet (sorry if it translates badly from French :p).
  
Dmitry Kozlyuk Oct. 12, 2021, 9:09 p.m. UTC | #4
> -----Original Message-----
> From: David Marchand <david.marchand@redhat.com>
> Sent: 12 октября 2021 г. 20:33
> To: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
> Cc: dev <dev@dpdk.org>; Slava Ovsiienko <viacheslavo@nvidia.com>; Anatoly
> Burakov <anatoly.burakov@intel.com>; NBU-Contact-Thomas Monjalon
> <thomas@monjalon.net>
> Subject: Re: [dpdk-dev] [PATCH v6 2/3] eal: add memory pre-allocation from
> existing files
> 
> External email: Use caution opening links or attachments
> 
> 
> On Tue, Oct 12, 2021 at 5:55 PM Dmitry Kozlyuk <dkozlyuk@nvidia.com>
> wrote:
> > > I have some trouble figuring the need for the list of files.
> > > Why not use a global knob --mem-clear-on-alloc for this behavior
> change?
> >
> > Moving memset() doesn't speed anything up, it's a forced step for the
> reasons below.
> > Currently, memory is cleared by the kernel when a page is mapped during
> an allocation.
> > This cannot be turned off in stock kernels. The issue is that initial
> > allocations are longer by the time needed to clear the pages, which is
> >90%. For the memory intended for DMA this time is just wasted. If
> allocations are large, application startup and restart take long. The only
> way to get hugepages mapped without the kernel clearing them is to map
> existing files in hugetlbfs. However, rte_zmalloc() needs to return clean
> memory, that's why we move memset() there. Memory intended for DMA is just
> never cleared this way. But memory freed and allocated again will be
> cleared again, unfortunately.
> 
> Writing my limited understanding, please correct me.
> 
> The --mem-file that is proposed does:
> - preallocate files which is something close to --socket-mem with the
> following differences
>   - --mem-file lets user decide on dpdk hugepage files names, which I
> think conflicts with --huge-dir and --file-prefix,
>   - --mem-file lets user device on hugepage size which I think could be
> achieved with some --huge-dir option,

The comparison to --socket-mem is valid, because preallocated files form the initial amount of memory allocated from the system. However, using --mem-file does not preclude DPDK from allocating more memory according to --huge-dir and --file-prefix when the application runs out of preallocated blocks.

> - bypasses unlink() of existing hugepage files which I had overlooked but
> is the main painpoint,
> - enforces "clear on alloc" in rte_malloc/rte_free.
> 
> 
> From this, I see two parts in this patch:
> - faster restart, reusing hugepage files as is (combination of not calling
> unlink() and doing "clear on alloc"),
>   This part is interesting, and I think a single knob for this would be
> enough.

In combination with rte_extmem* API this know would indeed allow to implement the feature in the app. However, the drawback is that all the logic to select hugepage size, NUMA, and names would need to be done from the app, probably with its own options. OTOH, there is already hugetlbfs and numactl to avoid apps duplicating this logic. Also, it's not only the fast restart, but also the fast initial start on a prepared system.

> - finegrained control of hugepage files, but it has the drawback of
> imposing primary/secondary run with the same options.
>   The second part seems complex to configure. I see conflicts with
> existing options, so it seems a good way to get caught up in the carpet
> (sorry if it translates badly from French :p).

I don't see why synchronizing memory options is a big issue.
Primary and secondary processes are inherently interdependent.
  
David Marchand Oct. 13, 2021, 10:18 a.m. UTC | #5
On Tue, Oct 12, 2021 at 11:09 PM Dmitry Kozlyuk <dkozlyuk@nvidia.com> wrote:
> > From this, I see two parts in this patch:
> > - faster restart, reusing hugepage files as is (combination of not calling
> > unlink() and doing "clear on alloc"),
> >   This part is interesting, and I think a single knob for this would be
> > enough.
>
> In combination with rte_extmem* API this know would indeed allow to implement the feature in the app. However, the drawback is that all the logic to select hugepage size, NUMA, and names would need to be done from the app, probably with its own options. OTOH, there is already hugetlbfs and numactl to avoid apps duplicating this logic. Also, it's not only the fast restart, but also the fast initial start on a prepared system.

How do you "prepare" a system?


>
> > - finegrained control of hugepage files, but it has the drawback of
> > imposing primary/secondary run with the same options.
> >   The second part seems complex to configure. I see conflicts with
> > existing options, so it seems a good way to get caught up in the carpet
> > (sorry if it translates badly from French :p).
>
> I don't see why synchronizing memory options is a big issue.

We have too many options for the memory subsystem.

I mentionned --socket-mem, --huge-dir, --file-prefix.
But there is also --huge-unlink, --no-shconf, --in-memory,
--legacy-mem, --single-file-segments, --match-allocations and
--socket-limit.
Some of those do part of the job, others are incompatible with this
new option and probably some are orthogonal.

Sure we can add a new one that prepare your toasts, coffee and wake up
the kids (that's progress!).

Maybe you can provide an example on how this is used?

Thanks.
  
Dmitry Kozlyuk Nov. 8, 2021, 2:27 p.m. UTC | #6
Hi David,

> -----Original Message-----
> From: David Marchand <david.marchand@redhat.com>
[...]
> > > - finegrained control of hugepage files, but it has the drawback of
> > > imposing primary/secondary run with the same options.
> > >   The second part seems complex to configure. I see conflicts with
> > > existing options, so it seems a good way to get caught up in the
> > > carpet (sorry if it translates badly from French :p).
> >
> > I don't see why synchronizing memory options is a big issue.
> 
> We have too many options for the memory subsystem.
> 
> I mentionned --socket-mem, --huge-dir, --file-prefix.
> But there is also --huge-unlink, --no-shconf, --in-memory, --legacy-mem, -
> -single-file-segments, --match-allocations and --socket-limit.
> Some of those do part of the job, others are incompatible with this new
> option and probably some are orthogonal.
> 
> Sure we can add a new one that prepare your toasts, coffee and wake up the
> kids (that's progress!).
>
> Maybe you can provide an example on how this is used?

Sorry for the late reply.

After more consideration offline with Thomas
we concluded that the --mem-file option is indeed too intrusive.
I'm going to propose a new solution for the slow restart issue for 22.02,
probably with a knob like you proposed,
only not just changing when the memory is zeroed,
but most importantly allowing EAL to reuse hugepages.
So that in the end the usage would be as follows,
and if it's a restart, memory clearing would be bypassed:

	./dpdk-app --huge-reuse -- ...

Refactoring and benchmark patches may still be useful,
so review efforts were hopefully not in vain.
Thank you for asking the right questions!

FWIW, I agree that memory options should be cleaned up independently.
  
David Marchand Nov. 8, 2021, 5:45 p.m. UTC | #7
On Mon, Nov 8, 2021 at 3:27 PM Dmitry Kozlyuk <dkozlyuk@nvidia.com> wrote:
>
> Hi David,
>
> > -----Original Message-----
> > From: David Marchand <david.marchand@redhat.com>
> [...]
> > > > - finegrained control of hugepage files, but it has the drawback of
> > > > imposing primary/secondary run with the same options.
> > > >   The second part seems complex to configure. I see conflicts with
> > > > existing options, so it seems a good way to get caught up in the
> > > > carpet (sorry if it translates badly from French :p).
> > >
> > > I don't see why synchronizing memory options is a big issue.
> >
> > We have too many options for the memory subsystem.
> >
> > I mentionned --socket-mem, --huge-dir, --file-prefix.
> > But there is also --huge-unlink, --no-shconf, --in-memory, --legacy-mem, -
> > -single-file-segments, --match-allocations and --socket-limit.
> > Some of those do part of the job, others are incompatible with this new
> > option and probably some are orthogonal.
> >
> > Sure we can add a new one that prepare your toasts, coffee and wake up the
> > kids (that's progress!).
> >
> > Maybe you can provide an example on how this is used?
>
> Sorry for the late reply.

No problem.

>
> After more consideration offline with Thomas
> we concluded that the --mem-file option is indeed too intrusive.
> I'm going to propose a new solution for the slow restart issue for 22.02,
> probably with a knob like you proposed,
> only not just changing when the memory is zeroed,
> but most importantly allowing EAL to reuse hugepages.
> So that in the end the usage would be as follows,
> and if it's a restart, memory clearing would be bypassed:
>
>         ./dpdk-app --huge-reuse -- ...
>
> Refactoring and benchmark patches may still be useful,
> so review efforts were hopefully not in vain.
> Thank you for asking the right questions!
>
> FWIW, I agree that memory options should be cleaned up independently.

Looking forward to 22.02 :-).
Thanks Dmitry.
  

Patch

diff --git a/doc/guides/linux_gsg/linux_eal_parameters.rst b/doc/guides/linux_gsg/linux_eal_parameters.rst
index bd3977cb3d..b465feaea8 100644
--- a/doc/guides/linux_gsg/linux_eal_parameters.rst
+++ b/doc/guides/linux_gsg/linux_eal_parameters.rst
@@ -92,6 +92,23 @@  Memory-related options
 
     Free hugepages back to system exactly as they were originally allocated.
 
+*   ``--mem-file <pre-allocated files>``
+
+    Use memory from pre-allocated files in ``hugetlbfs`` without clearing it;
+    when this memory is exhausted, switch to default dynamic allocation.
+    This speeds up startup compared to ``--legacy-mem`` while also avoiding
+    later delays for allocating new hugepages. One downside is slowdown
+    of all zeroed memory allocations. Security warning: an application
+    can access contents left by previous users of hugepages. Multiple files
+    can be pre-allocated in ``hugetlbfs`` with different page sizes,
+    on desired NUMA nodes, using ``mount`` options and ``numactl``:
+
+        --mem-file /mnt/huge-1G/node0,/mnt/huge-1G/node1,/mnt/huge-2M/extra
+
+    This option is incompatible with ``--legacy-mem``, ``--in-memory``,
+    and ``--single-file-segments``. Primary and secondary processes
+    must specify exactly the same list of files.
+
 Other options
 ~~~~~~~~~~~~~
 
diff --git a/lib/eal/common/eal_common_dynmem.c b/lib/eal/common/eal_common_dynmem.c
index 7c5437ddfa..abcf22f097 100644
--- a/lib/eal/common/eal_common_dynmem.c
+++ b/lib/eal/common/eal_common_dynmem.c
@@ -272,6 +272,12 @@  eal_dynmem_hugepage_init(void)
 			internal_conf->num_hugepage_sizes) < 0)
 		return -1;
 
+#ifdef RTE_EXEC_ENV_LINUX
+	/* pre-allocate pages from --mem-file option files */
+	if (eal_memalloc_memfile_alloc(used_hp) < 0)
+		return -1;
+#endif
+
 	for (hp_sz_idx = 0;
 			hp_sz_idx < (int)internal_conf->num_hugepage_sizes;
 			hp_sz_idx++) {
diff --git a/lib/eal/common/eal_common_options.c b/lib/eal/common/eal_common_options.c
index 1802e3d9e1..1265720484 100644
--- a/lib/eal/common/eal_common_options.c
+++ b/lib/eal/common/eal_common_options.c
@@ -84,6 +84,7 @@  eal_long_options[] = {
 	{OPT_TRACE_MODE,        1, NULL, OPT_TRACE_MODE_NUM       },
 	{OPT_MAIN_LCORE,        1, NULL, OPT_MAIN_LCORE_NUM       },
 	{OPT_MBUF_POOL_OPS_NAME, 1, NULL, OPT_MBUF_POOL_OPS_NAME_NUM},
+	{OPT_MEM_FILE,          1, NULL, OPT_MEM_FILE_NUM         },
 	{OPT_NO_HPET,           0, NULL, OPT_NO_HPET_NUM          },
 	{OPT_NO_HUGE,           0, NULL, OPT_NO_HUGE_NUM          },
 	{OPT_NO_PCI,            0, NULL, OPT_NO_PCI_NUM           },
@@ -1879,6 +1880,8 @@  eal_cleanup_config(struct internal_config *internal_cfg)
 		free(internal_cfg->hugepage_dir);
 	if (internal_cfg->user_mbuf_pool_ops_name != NULL)
 		free(internal_cfg->user_mbuf_pool_ops_name);
+	if (internal_cfg->mem_file[0])
+		free(internal_cfg->mem_file[0]);
 
 	return 0;
 }
@@ -1999,6 +2002,26 @@  eal_check_common_options(struct internal_config *internal_cfg)
 			"amount of reserved memory can be adjusted with "
 			"-m or --"OPT_SOCKET_MEM"\n");
 	}
+	if (internal_cfg->mem_file[0] && internal_conf->legacy_mem) {
+		RTE_LOG(ERR, EAL, "Option --"OPT_MEM_FILE" is not compatible "
+				"with --"OPT_LEGACY_MEM"\n");
+		return -1;
+	}
+	if (internal_cfg->mem_file[0] && internal_conf->no_hugetlbfs) {
+		RTE_LOG(ERR, EAL, "Option --"OPT_MEM_FILE" is not compatible "
+				"with --"OPT_NO_HUGE"\n");
+		return -1;
+	}
+	if (internal_cfg->mem_file[0] && internal_conf->in_memory) {
+		RTE_LOG(ERR, EAL, "Option --"OPT_MEM_FILE" is not compatible "
+				"with --"OPT_IN_MEMORY"\n");
+		return -1;
+	}
+	if (internal_cfg->mem_file[0] && internal_conf->single_file_segments) {
+		RTE_LOG(ERR, EAL, "Option --"OPT_MEM_FILE" is not compatible "
+				"with --"OPT_SINGLE_FILE_SEGMENTS"\n");
+		return -1;
+	}
 
 	return 0;
 }
diff --git a/lib/eal/common/eal_internal_cfg.h b/lib/eal/common/eal_internal_cfg.h
index d6c0470eb8..814d5c66e1 100644
--- a/lib/eal/common/eal_internal_cfg.h
+++ b/lib/eal/common/eal_internal_cfg.h
@@ -22,6 +22,9 @@ 
 #define MAX_HUGEPAGE_SIZES 3  /**< support up to 3 page sizes */
 #endif
 
+#define MAX_MEMFILE_ITEMS (MAX_HUGEPAGE_SIZES * RTE_MAX_NUMA_NODES)
+/**< Maximal number of mem-file parameters. */
+
 /*
  * internal configuration structure for the number, size and
  * mount points of hugepages
@@ -83,6 +86,7 @@  struct internal_config {
 	rte_uuid_t vfio_vf_token;
 	char *hugefile_prefix;      /**< the base filename of hugetlbfs files */
 	char *hugepage_dir;         /**< specific hugetlbfs directory to use */
+	char *mem_file[MAX_MEMFILE_ITEMS]; /**< pre-allocated memory files */
 	char *user_mbuf_pool_ops_name;
 			/**< user defined mbuf pool ops name */
 	unsigned num_hugepage_sizes;      /**< how many sizes on this system */
diff --git a/lib/eal/common/eal_memalloc.h b/lib/eal/common/eal_memalloc.h
index ebc3a6f6c1..d92c9a167b 100644
--- a/lib/eal/common/eal_memalloc.h
+++ b/lib/eal/common/eal_memalloc.h
@@ -8,7 +8,7 @@ 
 #include <stdbool.h>
 
 #include <rte_memory.h>
-
+#include "eal_internal_cfg.h"
 /*
  * Allocate segment of specified page size.
  */
@@ -96,4 +96,10 @@  eal_memalloc_init(void);
 int
 eal_memalloc_cleanup(void);
 
+int
+eal_memalloc_memfile_init(void);
+
+int
+eal_memalloc_memfile_alloc(struct hugepage_info *hpa);
+
 #endif /* EAL_MEMALLOC_H */
diff --git a/lib/eal/common/eal_options.h b/lib/eal/common/eal_options.h
index 8e4f7202a2..5c012c8125 100644
--- a/lib/eal/common/eal_options.h
+++ b/lib/eal/common/eal_options.h
@@ -87,6 +87,8 @@  enum {
 	OPT_NO_TELEMETRY_NUM,
 #define OPT_FORCE_MAX_SIMD_BITWIDTH  "force-max-simd-bitwidth"
 	OPT_FORCE_MAX_SIMD_BITWIDTH_NUM,
+#define OPT_MEM_FILE          "mem-file"
+	OPT_MEM_FILE_NUM,
 
 	OPT_LONG_MAX_NUM
 };
diff --git a/lib/eal/common/malloc_elem.c b/lib/eal/common/malloc_elem.c
index c2c9461f1d..6e71029a3c 100644
--- a/lib/eal/common/malloc_elem.c
+++ b/lib/eal/common/malloc_elem.c
@@ -578,8 +578,13 @@  malloc_elem_free(struct malloc_elem *elem)
 	/* decrease heap's count of allocated elements */
 	elem->heap->alloc_count--;
 
+#ifdef MALLOC_DEBUG
 	/* poison memory */
 	memset(ptr, MALLOC_POISON, data_len);
+#else
+	if (!malloc_clear_on_alloc())
+		memset(ptr, 0, data_len);
+#endif
 
 	return elem;
 }
diff --git a/lib/eal/common/malloc_heap.h b/lib/eal/common/malloc_heap.h
index 3a6ec6ecf0..cb1b5a5dd5 100644
--- a/lib/eal/common/malloc_heap.h
+++ b/lib/eal/common/malloc_heap.h
@@ -10,6 +10,7 @@ 
 
 #include <rte_malloc.h>
 #include <rte_spinlock.h>
+#include "eal_private.h"
 
 /* Number of free lists per heap, grouped by size. */
 #define RTE_HEAP_NUM_FREELISTS  13
@@ -44,6 +45,13 @@  malloc_get_numa_socket(void)
 	return socket_id;
 }
 
+static inline bool
+malloc_clear_on_alloc(void)
+{
+	const struct internal_config *cfg = eal_get_internal_configuration();
+	return cfg->mem_file[0] != NULL;
+}
+
 void *
 malloc_heap_alloc(const char *type, size_t size, int socket, unsigned int flags,
 		size_t align, size_t bound, bool contig);
diff --git a/lib/eal/common/rte_malloc.c b/lib/eal/common/rte_malloc.c
index 9d39e58c08..ce94268aca 100644
--- a/lib/eal/common/rte_malloc.c
+++ b/lib/eal/common/rte_malloc.c
@@ -113,17 +113,23 @@  rte_malloc(const char *type, size_t size, unsigned align)
 void *
 rte_zmalloc_socket(const char *type, size_t size, unsigned align, int socket)
 {
+	bool zero;
 	void *ptr = rte_malloc_socket(type, size, align, socket);
 
-#ifdef RTE_MALLOC_DEBUG
 	/*
 	 * If DEBUG is enabled, then freed memory is marked with poison
-	 * value and set to zero on allocation.
-	 * If DEBUG is not enabled then  memory is already zeroed.
+	 * value and must be set to zero on allocation.
+	 * If DEBUG is not enabled then it is configurable
+	 * whether memory comes already set to zero by memalloc or on free
+	 * or it must be set to zero here.
 	 */
-	if (ptr != NULL)
-		memset(ptr, 0, size);
+#ifdef RTE_MALLOC_DEBUG
+	zero = true;
+#else
+	zero = malloc_clear_on_alloc();
 #endif
+	if (ptr != NULL && zero)
+		memset(ptr, 0, size);
 
 	rte_eal_trace_mem_zmalloc(type, size, align, socket, ptr);
 	return ptr;
diff --git a/lib/eal/include/rte_memory.h b/lib/eal/include/rte_memory.h
index 6d018629ae..579358e29e 100644
--- a/lib/eal/include/rte_memory.h
+++ b/lib/eal/include/rte_memory.h
@@ -40,7 +40,9 @@  extern "C" {
 /**
  * Physical memory segment descriptor.
  */
-#define RTE_MEMSEG_FLAG_DO_NOT_FREE (1 << 0)
+#define RTE_MEMSEG_FLAG_DO_NOT_FREE   (1 << 0)
+#define RTE_MEMSEG_FLAG_PRE_ALLOCATED (1 << 1)
+
 /**< Prevent this segment from being freed back to the OS. */
 struct rte_memseg {
 	rte_iova_t iova;            /**< Start IO address. */
diff --git a/lib/eal/linux/eal.c b/lib/eal/linux/eal.c
index 3577eaeaa4..d0afcd8326 100644
--- a/lib/eal/linux/eal.c
+++ b/lib/eal/linux/eal.c
@@ -548,6 +548,7 @@  eal_usage(const char *prgname)
 	       "  --"OPT_LEGACY_MEM"        Legacy memory mode (no dynamic allocation, contiguous segments)\n"
 	       "  --"OPT_SINGLE_FILE_SEGMENTS" Put all hugepage memory in single files\n"
 	       "  --"OPT_MATCH_ALLOCATIONS" Free hugepages exactly as allocated\n"
+	       "  --"OPT_MEM_FILE"          Comma-separated list of files in hugetlbfs.\n"
 	       "\n");
 	/* Allow the application to print its usage message too if hook is set */
 	if (hook) {
@@ -678,6 +679,22 @@  eal_log_level_parse(int argc, char **argv)
 	optarg = old_optarg;
 }
 
+static int
+eal_parse_memfile_arg(const char *arg, char **mem_file)
+{
+	int ret;
+
+	char *copy = strdup(arg);
+	if (copy == NULL) {
+		RTE_LOG(ERR, EAL, "Cannot store --"OPT_MEM_FILE" names\n");
+		return -1;
+	}
+
+	ret = rte_strsplit(copy, strlen(copy), mem_file,
+			MAX_MEMFILE_ITEMS, ',');
+	return ret <= 0 ? -1 : 0;
+}
+
 /* Parse the argument given in the command line of the application */
 static int
 eal_parse_args(int argc, char **argv)
@@ -819,6 +836,17 @@  eal_parse_args(int argc, char **argv)
 			internal_conf->match_allocations = 1;
 			break;
 
+		case OPT_MEM_FILE_NUM:
+			if (eal_parse_memfile_arg(optarg,
+					internal_conf->mem_file) < 0) {
+				RTE_LOG(ERR, EAL, "invalid parameters for --"
+						OPT_MEM_FILE "\n");
+				eal_usage(prgname);
+				ret = -1;
+				goto out;
+			}
+			break;
+
 		default:
 			if (opt < OPT_LONG_MIN_NUM && isprint(opt)) {
 				RTE_LOG(ERR, EAL, "Option %c is not supported "
diff --git a/lib/eal/linux/eal_hugepage_info.c b/lib/eal/linux/eal_hugepage_info.c
index 193282e779..dfbb49ada9 100644
--- a/lib/eal/linux/eal_hugepage_info.c
+++ b/lib/eal/linux/eal_hugepage_info.c
@@ -37,6 +37,7 @@ 
 #include "eal_hugepages.h"
 #include "eal_hugepage_info.h"
 #include "eal_filesystem.h"
+#include "eal_memalloc.h"
 
 static const char sys_dir_path[] = "/sys/kernel/mm/hugepages";
 static const char sys_pages_numa_dir_path[] = "/sys/devices/system/node";
@@ -515,6 +516,10 @@  hugepage_info_init(void)
 	qsort(&internal_conf->hugepage_info[0], num_sizes,
 	      sizeof(internal_conf->hugepage_info[0]), compare_hpi);
 
+	/* add pre-allocated pages with --mem-file option to available ones */
+	if (eal_memalloc_memfile_init())
+		return -1;
+
 	/* now we have all info, check we have at least one valid size */
 	for (i = 0; i < num_sizes; i++) {
 		/* pages may no longer all be on socket 0, so check all */
diff --git a/lib/eal/linux/eal_memalloc.c b/lib/eal/linux/eal_memalloc.c
index 0ec8542283..c2b3586204 100644
--- a/lib/eal/linux/eal_memalloc.c
+++ b/lib/eal/linux/eal_memalloc.c
@@ -18,6 +18,7 @@ 
 #include <unistd.h>
 #include <limits.h>
 #include <fcntl.h>
+#include <mntent.h>
 #include <sys/ioctl.h>
 #include <sys/time.h>
 #include <signal.h>
@@ -41,6 +42,7 @@ 
 #include <rte_spinlock.h>
 
 #include "eal_filesystem.h"
+#include "eal_hugepage_info.h"
 #include "eal_internal_cfg.h"
 #include "eal_memalloc.h"
 #include "eal_memcfg.h"
@@ -102,6 +104,19 @@  static struct {
 	int count; /**< entries used in an array */
 } fd_list[RTE_MAX_MEMSEG_LISTS];
 
+struct memfile {
+	char *fname;		/**< file name */
+	uint64_t hugepage_sz;	/**< size of a huge page */
+	uint32_t num_pages;	/**< number of pages */
+	uint32_t num_allocated;	/**< number of already allocated pages */
+	int socket_id;		/**< Socket ID  */
+	int fd;			/**< file descriptor */
+};
+
+struct memfile mem_file[MAX_MEMFILE_ITEMS];
+
+static int alloc_memfile;
+
 /** local copy of a memory map, used to synchronize memory hotplug in MP */
 static struct rte_memseg_list local_memsegs[RTE_MAX_MEMSEG_LISTS];
 
@@ -542,6 +557,26 @@  alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 		 * stage.
 		 */
 		map_offset = 0;
+	} else if (alloc_memfile) {
+		uint32_t mf;
+
+		for (mf = 0; mf < RTE_DIM(mem_file); mf++) {
+			if (alloc_sz == mem_file[mf].hugepage_sz &&
+			    socket_id == mem_file[mf].socket_id &&
+			    mem_file[mf].num_allocated < mem_file[mf].num_pages)
+				break;
+		}
+		if (mf >= RTE_DIM(mem_file)) {
+			RTE_LOG(ERR, EAL,
+				"%s() cannot allocate from memfile\n",
+				__func__);
+			return -1;
+		}
+		fd = mem_file[mf].fd;
+		fd_list[list_idx].fds[seg_idx] = fd;
+		map_offset = mem_file[mf].num_allocated * alloc_sz;
+		mmap_flags = MAP_SHARED | MAP_POPULATE | MAP_FIXED;
+		mem_file[mf].num_allocated++;
 	} else {
 		/* takes out a read lock on segment or segment list */
 		fd = get_seg_fd(path, sizeof(path), hi, list_idx, seg_idx);
@@ -683,6 +718,10 @@  alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 	if (fd < 0)
 		return -1;
 
+	/* don't cleanup pre-allocated files */
+	if (alloc_memfile)
+		return -1;
+
 	if (internal_conf->single_file_segments) {
 		resize_hugefile(fd, map_offset, alloc_sz, false);
 		/* ignore failure, can't make it any worse */
@@ -712,8 +751,9 @@  free_seg(struct rte_memseg *ms, struct hugepage_info *hi,
 	const struct internal_config *internal_conf =
 		eal_get_internal_configuration();
 
-	/* erase page data */
-	memset(ms->addr, 0, ms->len);
+	/* Erase page data unless it's pre-allocated files. */
+	if (!alloc_memfile)
+		memset(ms->addr, 0, ms->len);
 
 	if (mmap(ms->addr, ms->len, PROT_NONE,
 			MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, -1, 0) ==
@@ -724,8 +764,12 @@  free_seg(struct rte_memseg *ms, struct hugepage_info *hi,
 
 	eal_mem_set_dump(ms->addr, ms->len, false);
 
-	/* if we're using anonymous hugepages, nothing to be done */
-	if (internal_conf->in_memory && !memfd_create_supported) {
+	/*
+	 * if we're using anonymous hugepages or pre-allocated files,
+	 * nothing to be done
+	 */
+	if ((internal_conf->in_memory && !memfd_create_supported) ||
+			alloc_memfile) {
 		memset(ms, 0, sizeof(*ms));
 		return 0;
 	}
@@ -838,7 +882,9 @@  alloc_seg_walk(const struct rte_memseg_list *msl, void *arg)
 	 * during init, we already hold a write lock, so don't try to take out
 	 * another one.
 	 */
-	if (wa->hi->lock_descriptor == -1 && !internal_conf->in_memory) {
+	if (wa->hi->lock_descriptor == -1 &&
+	    !internal_conf->in_memory &&
+	    !alloc_memfile) {
 		dir_fd = open(wa->hi->hugedir, O_RDONLY);
 		if (dir_fd < 0) {
 			RTE_LOG(ERR, EAL, "%s(): Cannot open '%s': %s\n",
@@ -868,7 +914,7 @@  alloc_seg_walk(const struct rte_memseg_list *msl, void *arg)
 				need, i);
 
 			/* if exact number wasn't requested, stop */
-			if (!wa->exact)
+			if (!wa->exact || alloc_memfile)
 				goto out;
 
 			/* clean up */
@@ -1120,6 +1166,262 @@  eal_memalloc_free_seg(struct rte_memseg *ms)
 	return eal_memalloc_free_seg_bulk(&ms, 1);
 }
 
+static int
+memfile_fill_socket_id(struct memfile *mf)
+{
+#ifdef RTE_EAL_NUMA_AWARE_HUGEPAGES
+	void *va;
+	int ret;
+
+	va = mmap(NULL, mf->hugepage_sz, PROT_READ | PROT_WRITE,
+			MAP_SHARED | MAP_POPULATE, mf->fd, 0);
+	if (va == MAP_FAILED) {
+		RTE_LOG(ERR, EAL, "%s(): %s: mmap(): %s\n",
+				__func__, mf->fname, strerror(errno));
+		return -1;
+	}
+
+	ret = 0;
+	if (check_numa()) {
+		if (get_mempolicy(&mf->socket_id, NULL, 0, va,
+				MPOL_F_NODE | MPOL_F_ADDR) < 0) {
+			RTE_LOG(ERR, EAL, "%s(): %s: get_mempolicy(): %s\n",
+				__func__, mf->fname, strerror(errno));
+			ret = -1;
+		}
+	} else
+		mf->socket_id = 0;
+
+	munmap(va, mf->hugepage_sz);
+	return ret;
+#else
+	mf->socket_id = 0;
+	return 0;
+#endif
+}
+
+struct match_memfile_path_arg {
+	const char *path;
+	uint64_t file_sz;
+	uint64_t hugepage_sz;
+	size_t best_len;
+};
+
+/*
+ * While it is unlikely for hugetlbfs, mount points can be nested.
+ * Find the deepest mount point that contains the file.
+ */
+static int
+match_memfile_path(const char *path, uint64_t hugepage_sz, void *cb_arg)
+{
+	struct match_memfile_path_arg *arg = cb_arg;
+	size_t dir_len = strlen(path);
+
+	if (dir_len < arg->best_len)
+		return 0;
+	if (strncmp(path, arg->path, dir_len) != 0)
+		return 0;
+	if (arg->file_sz % hugepage_sz != 0)
+		return 0;
+
+	arg->hugepage_sz = hugepage_sz;
+	arg->best_len = dir_len;
+	return 0;
+}
+
+/* Determine hugepage size from the path to a file in hugetlbfs. */
+static int
+memfile_fill_hugepage_sz(struct memfile *mf, uint64_t file_sz)
+{
+	char abspath[PATH_MAX];
+	struct match_memfile_path_arg arg;
+
+	if (realpath(mf->fname, abspath) == NULL) {
+		RTE_LOG(ERR, EAL, "%s(): realpath(): %s\n",
+				__func__, strerror(errno));
+		return -1;
+	}
+
+	memset(&arg, 0, sizeof(arg));
+	arg.path = abspath;
+	arg.file_sz = file_sz;
+	if (eal_hugepage_mount_walk(match_memfile_path, &arg) == 0 &&
+			arg.hugepage_sz != 0) {
+		mf->hugepage_sz = arg.hugepage_sz;
+		return 0;
+	}
+	return -1;
+}
+
+int
+eal_memalloc_memfile_init(void)
+{
+	struct internal_config *internal_conf =
+			eal_get_internal_configuration();
+	int err = -1, fd;
+	uint32_t i;
+
+	if (internal_conf->mem_file[0] == NULL)
+		return 0;
+
+	for (i = 0; i < RTE_DIM(internal_conf->mem_file); i++) {
+		struct memfile *mf = &mem_file[i];
+		uint64_t fsize;
+
+		if (internal_conf->mem_file[i] == NULL) {
+			err = 0;
+			break;
+		}
+		mf->fname = internal_conf->mem_file[i];
+		fd = open(mf->fname, O_RDWR, 0600);
+		mf->fd = fd;
+		if (fd < 0) {
+			RTE_LOG(ERR, EAL, "%s(): %s: open(): %s\n",
+					__func__, mf->fname, strerror(errno));
+			break;
+		}
+
+		/* take out a read lock and keep it indefinitely */
+		if (lock(fd, LOCK_SH) != 1) {
+			RTE_LOG(ERR, EAL, "%s(): %s: cannot lock file\n",
+					__func__, mf->fname);
+			break;
+		}
+
+		fsize = get_file_size(fd);
+		if (!fsize) {
+			RTE_LOG(ERR, EAL, "%s(): %s: zero file length\n",
+					__func__, mf->fname);
+			break;
+		}
+
+		if (memfile_fill_hugepage_sz(mf, fsize) < 0) {
+			RTE_LOG(ERR, EAL, "%s(): %s: cannot detect page size\n",
+					__func__, mf->fname);
+			break;
+		}
+		mf->num_pages = fsize / mf->hugepage_sz;
+
+		if (memfile_fill_socket_id(mf) < 0) {
+			RTE_LOG(ERR, EAL, "%s(): %s: cannot detect NUMA node\n",
+					__func__, mf->fname);
+			break;
+		}
+	}
+
+	/* check if some problem happened */
+	if (err && i < RTE_DIM(internal_conf->mem_file)) {
+		/* some error occurred, do rollback */
+		do {
+			fd = mem_file[i].fd;
+			/* closing fd drops the lock */
+			if (fd >= 0)
+				close(fd);
+			mem_file[i].fd = -1;
+		} while (i--);
+		return -1;
+	}
+
+	/* update hugepage_info with pages allocated in files */
+	for (i = 0; i < RTE_DIM(mem_file); i++) {
+		const struct memfile *mf = &mem_file[i];
+		struct hugepage_info *hpi = NULL;
+		uint64_t sz;
+
+		if (!mf->hugepage_sz)
+			break;
+
+		for (sz = 0; sz < internal_conf->num_hugepage_sizes; sz++) {
+			hpi = &internal_conf->hugepage_info[sz];
+
+			if (mf->hugepage_sz == hpi->hugepage_sz) {
+				hpi->num_pages[mf->socket_id] += mf->num_pages;
+				break;
+			}
+		}
+
+		/* it seems hugepage info is not socket aware yet */
+		if (hpi != NULL && sz >= internal_conf->num_hugepage_sizes)
+			hpi->num_pages[0] += mf->num_pages;
+	}
+	return 0;
+}
+
+int
+eal_memalloc_memfile_alloc(struct hugepage_info *hpa)
+{
+	struct internal_config *internal_conf =
+			eal_get_internal_configuration();
+	uint32_t i, sz;
+
+	if (internal_conf->mem_file[0] == NULL ||
+			rte_eal_process_type() != RTE_PROC_PRIMARY)
+		return 0;
+
+	for (i = 0; i < RTE_DIM(mem_file); i++) {
+		struct memfile *mf = &mem_file[i];
+		uint64_t hugepage_sz = mf->hugepage_sz;
+		int socket_id = mf->socket_id;
+		struct rte_memseg **pages;
+
+		if (!hugepage_sz)
+			break;
+
+		while (mf->num_allocated < mf->num_pages) {
+			int needed, allocated, j;
+			uint32_t prev;
+
+			prev = mf->num_allocated;
+			needed = mf->num_pages - mf->num_allocated;
+			pages = malloc(sizeof(*pages) * needed);
+			if (pages == NULL)
+				return -1;
+
+			/* memalloc is locked, it's safe to switch allocator */
+			alloc_memfile = 1;
+			allocated = eal_memalloc_alloc_seg_bulk(pages,
+					needed, hugepage_sz, socket_id,	false);
+			/* switch allocator back */
+			alloc_memfile = 0;
+			if (allocated <= 0) {
+				RTE_LOG(ERR, EAL, "%s(): %s: allocation failed\n",
+						__func__, mf->fname);
+				free(pages);
+				return -1;
+			}
+
+			/* mark preallocated pages as unfreeable */
+			for (j = 0; j < allocated; j++) {
+				struct rte_memseg *ms = pages[j];
+
+				ms->flags |= RTE_MEMSEG_FLAG_DO_NOT_FREE |
+					     RTE_MEMSEG_FLAG_PRE_ALLOCATED;
+			}
+
+			free(pages);
+
+			/* check whether we allocated from expected file */
+			if (prev + allocated != mf->num_allocated) {
+				RTE_LOG(ERR, EAL, "%s(): %s: incorrect allocation\n",
+						__func__, mf->fname);
+				return -1;
+			}
+		}
+
+		/* reflect we pre-allocated some memory */
+		for (sz = 0; sz < internal_conf->num_hugepage_sizes; sz++) {
+			struct hugepage_info *hpi = &hpa[sz];
+
+			if (hpi->hugepage_sz != hugepage_sz)
+				continue;
+			hpi->num_pages[socket_id] -=
+					RTE_MIN(hpi->num_pages[socket_id],
+						mf->num_allocated);
+		}
+	}
+	return 0;
+}
+
 static int
 sync_chunk(struct rte_memseg_list *primary_msl,
 		struct rte_memseg_list *local_msl, struct hugepage_info *hi,
@@ -1178,6 +1480,14 @@  sync_chunk(struct rte_memseg_list *primary_msl,
 		if (l_ms == NULL || p_ms == NULL)
 			return -1;
 
+		/*
+		 * Switch allocator for this segment.
+		 * This function is only called during init,
+		 * so don't try to restore allocator on failure.
+		 */
+		if (p_ms->flags & RTE_MEMSEG_FLAG_PRE_ALLOCATED)
+			alloc_memfile = 1;
+
 		if (used) {
 			ret = alloc_seg(l_ms, p_ms->addr,
 					p_ms->socket_id, hi,
@@ -1191,6 +1501,9 @@  sync_chunk(struct rte_memseg_list *primary_msl,
 			if (ret < 0)
 				return -1;
 		}
+
+		/* Reset the allocator. */
+		alloc_memfile = 0;
 	}
 
 	/* if we just allocated memory, notify the application */
@@ -1392,6 +1705,9 @@  eal_memalloc_sync_with_primary(void)
 	if (rte_eal_process_type() == RTE_PROC_PRIMARY)
 		return 0;
 
+	if (eal_memalloc_memfile_init() < 0)
+		return -1;
+
 	/* memalloc is locked, so it's safe to call thread-unsafe version */
 	if (rte_memseg_list_walk_thread_unsafe(sync_walk, NULL))
 		return -1;