[21.11,v2,2/3] eal: add memory pre-allocation from existing files

Message ID 20210716110806.2566788-3-dkozlyuk@nvidia.com (mailing list archive)
State Superseded, archived
Delegated to: David Marchand
Headers
Series eal: add memory pre-allocation from existing files |

Checks

Context Check Description
ci/checkpatch success coding style OK

Commit Message

Dmitry Kozlyuk July 16, 2021, 11:08 a.m. UTC
  From: Viacheslav Ovsiienko <viacheslavo@nvidia.com>

The primary DPDK process launch might take a long time if initially
allocated memory is large. From practice allocation of 1 TB of memory
over 1 GB hugepages on Linux takes tens of seconds. Fast restart
is highly desired for some applications and launch delay presents
a problem.

The primary delay happens in this call trace:
  rte_eal_init()
    rte_eal_memory_init()
      rte_eal_hugepage_init()
        eal_dynmem_hugepage_init()
	  eal_memalloc_alloc_seg_bulk()
	    alloc_seg()
              mmap()

The largest part of the time spent in mmap() is filling the memory
with zeros. Kernel does so to prevent data leakage from a process
that was last using the page. However, in a controlled environment
it may not be the issue, while performance is. (Linux-specific
MAP_UNINITIALIZED flag allows mapping without clearing, but it is
disabled in all popular distributions for the reason above.)

It is proposed to add a new EAL option: --mem-file FILE1,FILE2,...
to map hugepages "as is" from specified FILEs in hugetlbfs.
Compared to using external memory for the task, EAL option requires
no change to application code, while allowing administrator
to control hugepage sizes and their NUMA affinity.

Limitations of the feature:

* Linux-specific (only Linux maps hugepages from files).
* Incompatible with --legacy-mem (partially replaces it).
* Incompatible with --single-file-segments
  (--mem-file FILEs can contain as many segments as needed).
* Incompatible with --in-memory (logically).

A warning about possible security implications is printed
when --mem-file is used.

Until this patch DPDK allocator always cleared memory on freeing,
so that it did not have to do that on allocation, while new memory
was cleared by the kernel. When --mem-file is in use, DPDK clears memory
after allocation in rte_zmalloc() and does not clean it on freeing.
Effectively user trades fast startup for occasional allocation slowdown
whenever it is absolutely necessary. When memory is recycled, it is
cleared again, which is suboptimal par se, but saves complication
of memory management.

Signed-off-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
Signed-off-by: Dmitry Kozlyuk <dkozlyuk@nvidia.com>
---
 doc/guides/linux_gsg/linux_eal_parameters.rst |  17 +
 lib/eal/common/eal_common_dynmem.c            |   6 +
 lib/eal/common/eal_common_options.c           |  23 ++
 lib/eal/common/eal_internal_cfg.h             |   4 +
 lib/eal/common/eal_memalloc.h                 |   8 +-
 lib/eal/common/eal_options.h                  |   2 +
 lib/eal/common/malloc_elem.c                  |   5 +
 lib/eal/common/malloc_heap.h                  |   8 +
 lib/eal/common/rte_malloc.c                   |  16 +-
 lib/eal/include/rte_memory.h                  |   4 +-
 lib/eal/linux/eal.c                           |  28 ++
 lib/eal/linux/eal_hugepage_info.c             |   5 +
 lib/eal/linux/eal_memalloc.c                  | 328 +++++++++++++++++-
 13 files changed, 441 insertions(+), 13 deletions(-)
  

Patch

diff --git a/doc/guides/linux_gsg/linux_eal_parameters.rst b/doc/guides/linux_gsg/linux_eal_parameters.rst
index bd3977cb3d..b465feaea8 100644
--- a/doc/guides/linux_gsg/linux_eal_parameters.rst
+++ b/doc/guides/linux_gsg/linux_eal_parameters.rst
@@ -92,6 +92,23 @@  Memory-related options
 
     Free hugepages back to system exactly as they were originally allocated.
 
+*   ``--mem-file <pre-allocated files>``
+
+    Use memory from pre-allocated files in ``hugetlbfs`` without clearing it;
+    when this memory is exhausted, switch to default dynamic allocation.
+    This speeds up startup compared to ``--legacy-mem`` while also avoiding
+    later delays for allocating new hugepages. One downside is slowdown
+    of all zeroed memory allocations. Security warning: an application
+    can access contents left by previous users of hugepages. Multiple files
+    can be pre-allocated in ``hugetlbfs`` with different page sizes,
+    on desired NUMA nodes, using ``mount`` options and ``numactl``:
+
+        --mem-file /mnt/huge-1G/node0,/mnt/huge-1G/node1,/mnt/huge-2M/extra
+
+    This option is incompatible with ``--legacy-mem``, ``--in-memory``,
+    and ``--single-file-segments``. Primary and secondary processes
+    must specify exactly the same list of files.
+
 Other options
 ~~~~~~~~~~~~~
 
diff --git a/lib/eal/common/eal_common_dynmem.c b/lib/eal/common/eal_common_dynmem.c
index 7c5437ddfa..abcf22f097 100644
--- a/lib/eal/common/eal_common_dynmem.c
+++ b/lib/eal/common/eal_common_dynmem.c
@@ -272,6 +272,12 @@  eal_dynmem_hugepage_init(void)
 			internal_conf->num_hugepage_sizes) < 0)
 		return -1;
 
+#ifdef RTE_EXEC_ENV_LINUX
+	/* pre-allocate pages from --mem-file option files */
+	if (eal_memalloc_memfile_alloc(used_hp) < 0)
+		return -1;
+#endif
+
 	for (hp_sz_idx = 0;
 			hp_sz_idx < (int)internal_conf->num_hugepage_sizes;
 			hp_sz_idx++) {
diff --git a/lib/eal/common/eal_common_options.c b/lib/eal/common/eal_common_options.c
index ff5861b5f3..c729c36630 100644
--- a/lib/eal/common/eal_common_options.c
+++ b/lib/eal/common/eal_common_options.c
@@ -86,6 +86,7 @@  eal_long_options[] = {
 	{OPT_MASTER_LCORE,      1, NULL, OPT_MASTER_LCORE_NUM     },
 	{OPT_MAIN_LCORE,        1, NULL, OPT_MAIN_LCORE_NUM       },
 	{OPT_MBUF_POOL_OPS_NAME, 1, NULL, OPT_MBUF_POOL_OPS_NAME_NUM},
+	{OPT_MEM_FILE,          1, NULL, OPT_MEM_FILE_NUM         },
 	{OPT_NO_HPET,           0, NULL, OPT_NO_HPET_NUM          },
 	{OPT_NO_HUGE,           0, NULL, OPT_NO_HUGE_NUM          },
 	{OPT_NO_PCI,            0, NULL, OPT_NO_PCI_NUM           },
@@ -1898,6 +1899,8 @@  eal_cleanup_config(struct internal_config *internal_cfg)
 		free(internal_cfg->hugepage_dir);
 	if (internal_cfg->user_mbuf_pool_ops_name != NULL)
 		free(internal_cfg->user_mbuf_pool_ops_name);
+	if (internal_cfg->mem_file[0])
+		free(internal_cfg->mem_file[0]);
 
 	return 0;
 }
@@ -2018,6 +2021,26 @@  eal_check_common_options(struct internal_config *internal_cfg)
 			"amount of reserved memory can be adjusted with "
 			"-m or --"OPT_SOCKET_MEM"\n");
 	}
+	if (internal_cfg->mem_file[0] && internal_conf->legacy_mem) {
+		RTE_LOG(ERR, EAL, "Option --"OPT_MEM_FILE" is not compatible "
+				"with --"OPT_LEGACY_MEM"\n");
+		return -1;
+	}
+	if (internal_cfg->mem_file[0] && internal_conf->no_hugetlbfs) {
+		RTE_LOG(ERR, EAL, "Option --"OPT_MEM_FILE" is not compatible "
+				"with --"OPT_NO_HUGE"\n");
+		return -1;
+	}
+	if (internal_cfg->mem_file[0] && internal_conf->in_memory) {
+		RTE_LOG(ERR, EAL, "Option --"OPT_MEM_FILE" is not compatible "
+				"with --"OPT_IN_MEMORY"\n");
+		return -1;
+	}
+	if (internal_cfg->mem_file[0] && internal_conf->single_file_segments) {
+		RTE_LOG(ERR, EAL, "Option --"OPT_MEM_FILE" is not compatible "
+				"with --"OPT_SINGLE_FILE_SEGMENTS"\n");
+		return -1;
+	}
 
 	return 0;
 }
diff --git a/lib/eal/common/eal_internal_cfg.h b/lib/eal/common/eal_internal_cfg.h
index d6c0470eb8..814d5c66e1 100644
--- a/lib/eal/common/eal_internal_cfg.h
+++ b/lib/eal/common/eal_internal_cfg.h
@@ -22,6 +22,9 @@ 
 #define MAX_HUGEPAGE_SIZES 3  /**< support up to 3 page sizes */
 #endif
 
+#define MAX_MEMFILE_ITEMS (MAX_HUGEPAGE_SIZES * RTE_MAX_NUMA_NODES)
+/**< Maximal number of mem-file parameters. */
+
 /*
  * internal configuration structure for the number, size and
  * mount points of hugepages
@@ -83,6 +86,7 @@  struct internal_config {
 	rte_uuid_t vfio_vf_token;
 	char *hugefile_prefix;      /**< the base filename of hugetlbfs files */
 	char *hugepage_dir;         /**< specific hugetlbfs directory to use */
+	char *mem_file[MAX_MEMFILE_ITEMS]; /**< pre-allocated memory files */
 	char *user_mbuf_pool_ops_name;
 			/**< user defined mbuf pool ops name */
 	unsigned num_hugepage_sizes;      /**< how many sizes on this system */
diff --git a/lib/eal/common/eal_memalloc.h b/lib/eal/common/eal_memalloc.h
index ebc3a6f6c1..d92c9a167b 100644
--- a/lib/eal/common/eal_memalloc.h
+++ b/lib/eal/common/eal_memalloc.h
@@ -8,7 +8,7 @@ 
 #include <stdbool.h>
 
 #include <rte_memory.h>
-
+#include "eal_internal_cfg.h"
 /*
  * Allocate segment of specified page size.
  */
@@ -96,4 +96,10 @@  eal_memalloc_init(void);
 int
 eal_memalloc_cleanup(void);
 
+int
+eal_memalloc_memfile_init(void);
+
+int
+eal_memalloc_memfile_alloc(struct hugepage_info *hpa);
+
 #endif /* EAL_MEMALLOC_H */
diff --git a/lib/eal/common/eal_options.h b/lib/eal/common/eal_options.h
index 7b348e707f..c6c634b2b2 100644
--- a/lib/eal/common/eal_options.h
+++ b/lib/eal/common/eal_options.h
@@ -93,6 +93,8 @@  enum {
 	OPT_NO_TELEMETRY_NUM,
 #define OPT_FORCE_MAX_SIMD_BITWIDTH  "force-max-simd-bitwidth"
 	OPT_FORCE_MAX_SIMD_BITWIDTH_NUM,
+#define OPT_MEM_FILE          "mem-file"
+	OPT_MEM_FILE_NUM,
 
 	/* legacy option that will be removed in future */
 #define OPT_PCI_BLACKLIST     "pci-blacklist"
diff --git a/lib/eal/common/malloc_elem.c b/lib/eal/common/malloc_elem.c
index c2c9461f1d..6e71029a3c 100644
--- a/lib/eal/common/malloc_elem.c
+++ b/lib/eal/common/malloc_elem.c
@@ -578,8 +578,13 @@  malloc_elem_free(struct malloc_elem *elem)
 	/* decrease heap's count of allocated elements */
 	elem->heap->alloc_count--;
 
+#ifdef MALLOC_DEBUG
 	/* poison memory */
 	memset(ptr, MALLOC_POISON, data_len);
+#else
+	if (!malloc_clear_on_alloc())
+		memset(ptr, 0, data_len);
+#endif
 
 	return elem;
 }
diff --git a/lib/eal/common/malloc_heap.h b/lib/eal/common/malloc_heap.h
index 772736b53f..72b64d8052 100644
--- a/lib/eal/common/malloc_heap.h
+++ b/lib/eal/common/malloc_heap.h
@@ -10,6 +10,7 @@ 
 
 #include <rte_malloc.h>
 #include <rte_spinlock.h>
+#include "eal_private.h"
 
 /* Number of free lists per heap, grouped by size. */
 #define RTE_HEAP_NUM_FREELISTS  13
@@ -48,6 +49,13 @@  malloc_get_numa_socket(void)
 	return socket_id;
 }
 
+static inline bool
+malloc_clear_on_alloc(void)
+{
+	const struct internal_config *cfg = eal_get_internal_configuration();
+	return cfg->mem_file[0] != NULL;
+}
+
 void *
 malloc_heap_alloc(const char *type, size_t size, int socket, unsigned int flags,
 		size_t align, size_t bound, bool contig);
diff --git a/lib/eal/common/rte_malloc.c b/lib/eal/common/rte_malloc.c
index 9d39e58c08..ce94268aca 100644
--- a/lib/eal/common/rte_malloc.c
+++ b/lib/eal/common/rte_malloc.c
@@ -113,17 +113,23 @@  rte_malloc(const char *type, size_t size, unsigned align)
 void *
 rte_zmalloc_socket(const char *type, size_t size, unsigned align, int socket)
 {
+	bool zero;
 	void *ptr = rte_malloc_socket(type, size, align, socket);
 
-#ifdef RTE_MALLOC_DEBUG
 	/*
 	 * If DEBUG is enabled, then freed memory is marked with poison
-	 * value and set to zero on allocation.
-	 * If DEBUG is not enabled then  memory is already zeroed.
+	 * value and must be set to zero on allocation.
+	 * If DEBUG is not enabled then it is configurable
+	 * whether memory comes already set to zero by memalloc or on free
+	 * or it must be set to zero here.
 	 */
-	if (ptr != NULL)
-		memset(ptr, 0, size);
+#ifdef RTE_MALLOC_DEBUG
+	zero = true;
+#else
+	zero = malloc_clear_on_alloc();
 #endif
+	if (ptr != NULL && zero)
+		memset(ptr, 0, size);
 
 	rte_eal_trace_mem_zmalloc(type, size, align, socket, ptr);
 	return ptr;
diff --git a/lib/eal/include/rte_memory.h b/lib/eal/include/rte_memory.h
index bba9b5300a..9a2b191314 100644
--- a/lib/eal/include/rte_memory.h
+++ b/lib/eal/include/rte_memory.h
@@ -40,7 +40,9 @@  extern "C" {
 /**
  * Physical memory segment descriptor.
  */
-#define RTE_MEMSEG_FLAG_DO_NOT_FREE (1 << 0)
+#define RTE_MEMSEG_FLAG_DO_NOT_FREE   (1 << 0)
+#define RTE_MEMSEG_FLAG_PRE_ALLOCATED (1 << 1)
+
 /**< Prevent this segment from being freed back to the OS. */
 struct rte_memseg {
 	rte_iova_t iova;            /**< Start IO address. */
diff --git a/lib/eal/linux/eal.c b/lib/eal/linux/eal.c
index 3577eaeaa4..d0afcd8326 100644
--- a/lib/eal/linux/eal.c
+++ b/lib/eal/linux/eal.c
@@ -548,6 +548,7 @@  eal_usage(const char *prgname)
 	       "  --"OPT_LEGACY_MEM"        Legacy memory mode (no dynamic allocation, contiguous segments)\n"
 	       "  --"OPT_SINGLE_FILE_SEGMENTS" Put all hugepage memory in single files\n"
 	       "  --"OPT_MATCH_ALLOCATIONS" Free hugepages exactly as allocated\n"
+	       "  --"OPT_MEM_FILE"          Comma-separated list of files in hugetlbfs.\n"
 	       "\n");
 	/* Allow the application to print its usage message too if hook is set */
 	if (hook) {
@@ -678,6 +679,22 @@  eal_log_level_parse(int argc, char **argv)
 	optarg = old_optarg;
 }
 
+static int
+eal_parse_memfile_arg(const char *arg, char **mem_file)
+{
+	int ret;
+
+	char *copy = strdup(arg);
+	if (copy == NULL) {
+		RTE_LOG(ERR, EAL, "Cannot store --"OPT_MEM_FILE" names\n");
+		return -1;
+	}
+
+	ret = rte_strsplit(copy, strlen(copy), mem_file,
+			MAX_MEMFILE_ITEMS, ',');
+	return ret <= 0 ? -1 : 0;
+}
+
 /* Parse the argument given in the command line of the application */
 static int
 eal_parse_args(int argc, char **argv)
@@ -819,6 +836,17 @@  eal_parse_args(int argc, char **argv)
 			internal_conf->match_allocations = 1;
 			break;
 
+		case OPT_MEM_FILE_NUM:
+			if (eal_parse_memfile_arg(optarg,
+					internal_conf->mem_file) < 0) {
+				RTE_LOG(ERR, EAL, "invalid parameters for --"
+						OPT_MEM_FILE "\n");
+				eal_usage(prgname);
+				ret = -1;
+				goto out;
+			}
+			break;
+
 		default:
 			if (opt < OPT_LONG_MIN_NUM && isprint(opt)) {
 				RTE_LOG(ERR, EAL, "Option %c is not supported "
diff --git a/lib/eal/linux/eal_hugepage_info.c b/lib/eal/linux/eal_hugepage_info.c
index a090c0a5b5..0c262342a5 100644
--- a/lib/eal/linux/eal_hugepage_info.c
+++ b/lib/eal/linux/eal_hugepage_info.c
@@ -37,6 +37,7 @@ 
 #include "eal_hugepages.h"
 #include "eal_hugepage_info.h"
 #include "eal_filesystem.h"
+#include "eal_memalloc.h"
 
 static const char sys_dir_path[] = "/sys/kernel/mm/hugepages";
 static const char sys_pages_numa_dir_path[] = "/sys/devices/system/node";
@@ -515,6 +516,10 @@  hugepage_info_init(void)
 	qsort(&internal_conf->hugepage_info[0], num_sizes,
 	      sizeof(internal_conf->hugepage_info[0]), compare_hpi);
 
+	/* add pre-allocated pages with --mem-file option to available ones */
+	if (eal_memalloc_memfile_init())
+		return -1;
+
 	/* now we have all info, check we have at least one valid size */
 	for (i = 0; i < num_sizes; i++) {
 		/* pages may no longer all be on socket 0, so check all */
diff --git a/lib/eal/linux/eal_memalloc.c b/lib/eal/linux/eal_memalloc.c
index 0ec8542283..c2b3586204 100644
--- a/lib/eal/linux/eal_memalloc.c
+++ b/lib/eal/linux/eal_memalloc.c
@@ -18,6 +18,7 @@ 
 #include <unistd.h>
 #include <limits.h>
 #include <fcntl.h>
+#include <mntent.h>
 #include <sys/ioctl.h>
 #include <sys/time.h>
 #include <signal.h>
@@ -41,6 +42,7 @@ 
 #include <rte_spinlock.h>
 
 #include "eal_filesystem.h"
+#include "eal_hugepage_info.h"
 #include "eal_internal_cfg.h"
 #include "eal_memalloc.h"
 #include "eal_memcfg.h"
@@ -102,6 +104,19 @@  static struct {
 	int count; /**< entries used in an array */
 } fd_list[RTE_MAX_MEMSEG_LISTS];
 
+struct memfile {
+	char *fname;		/**< file name */
+	uint64_t hugepage_sz;	/**< size of a huge page */
+	uint32_t num_pages;	/**< number of pages */
+	uint32_t num_allocated;	/**< number of already allocated pages */
+	int socket_id;		/**< Socket ID  */
+	int fd;			/**< file descriptor */
+};
+
+struct memfile mem_file[MAX_MEMFILE_ITEMS];
+
+static int alloc_memfile;
+
 /** local copy of a memory map, used to synchronize memory hotplug in MP */
 static struct rte_memseg_list local_memsegs[RTE_MAX_MEMSEG_LISTS];
 
@@ -542,6 +557,26 @@  alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 		 * stage.
 		 */
 		map_offset = 0;
+	} else if (alloc_memfile) {
+		uint32_t mf;
+
+		for (mf = 0; mf < RTE_DIM(mem_file); mf++) {
+			if (alloc_sz == mem_file[mf].hugepage_sz &&
+			    socket_id == mem_file[mf].socket_id &&
+			    mem_file[mf].num_allocated < mem_file[mf].num_pages)
+				break;
+		}
+		if (mf >= RTE_DIM(mem_file)) {
+			RTE_LOG(ERR, EAL,
+				"%s() cannot allocate from memfile\n",
+				__func__);
+			return -1;
+		}
+		fd = mem_file[mf].fd;
+		fd_list[list_idx].fds[seg_idx] = fd;
+		map_offset = mem_file[mf].num_allocated * alloc_sz;
+		mmap_flags = MAP_SHARED | MAP_POPULATE | MAP_FIXED;
+		mem_file[mf].num_allocated++;
 	} else {
 		/* takes out a read lock on segment or segment list */
 		fd = get_seg_fd(path, sizeof(path), hi, list_idx, seg_idx);
@@ -683,6 +718,10 @@  alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 	if (fd < 0)
 		return -1;
 
+	/* don't cleanup pre-allocated files */
+	if (alloc_memfile)
+		return -1;
+
 	if (internal_conf->single_file_segments) {
 		resize_hugefile(fd, map_offset, alloc_sz, false);
 		/* ignore failure, can't make it any worse */
@@ -712,8 +751,9 @@  free_seg(struct rte_memseg *ms, struct hugepage_info *hi,
 	const struct internal_config *internal_conf =
 		eal_get_internal_configuration();
 
-	/* erase page data */
-	memset(ms->addr, 0, ms->len);
+	/* Erase page data unless it's pre-allocated files. */
+	if (!alloc_memfile)
+		memset(ms->addr, 0, ms->len);
 
 	if (mmap(ms->addr, ms->len, PROT_NONE,
 			MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, -1, 0) ==
@@ -724,8 +764,12 @@  free_seg(struct rte_memseg *ms, struct hugepage_info *hi,
 
 	eal_mem_set_dump(ms->addr, ms->len, false);
 
-	/* if we're using anonymous hugepages, nothing to be done */
-	if (internal_conf->in_memory && !memfd_create_supported) {
+	/*
+	 * if we're using anonymous hugepages or pre-allocated files,
+	 * nothing to be done
+	 */
+	if ((internal_conf->in_memory && !memfd_create_supported) ||
+			alloc_memfile) {
 		memset(ms, 0, sizeof(*ms));
 		return 0;
 	}
@@ -838,7 +882,9 @@  alloc_seg_walk(const struct rte_memseg_list *msl, void *arg)
 	 * during init, we already hold a write lock, so don't try to take out
 	 * another one.
 	 */
-	if (wa->hi->lock_descriptor == -1 && !internal_conf->in_memory) {
+	if (wa->hi->lock_descriptor == -1 &&
+	    !internal_conf->in_memory &&
+	    !alloc_memfile) {
 		dir_fd = open(wa->hi->hugedir, O_RDONLY);
 		if (dir_fd < 0) {
 			RTE_LOG(ERR, EAL, "%s(): Cannot open '%s': %s\n",
@@ -868,7 +914,7 @@  alloc_seg_walk(const struct rte_memseg_list *msl, void *arg)
 				need, i);
 
 			/* if exact number wasn't requested, stop */
-			if (!wa->exact)
+			if (!wa->exact || alloc_memfile)
 				goto out;
 
 			/* clean up */
@@ -1120,6 +1166,262 @@  eal_memalloc_free_seg(struct rte_memseg *ms)
 	return eal_memalloc_free_seg_bulk(&ms, 1);
 }
 
+static int
+memfile_fill_socket_id(struct memfile *mf)
+{
+#ifdef RTE_EAL_NUMA_AWARE_HUGEPAGES
+	void *va;
+	int ret;
+
+	va = mmap(NULL, mf->hugepage_sz, PROT_READ | PROT_WRITE,
+			MAP_SHARED | MAP_POPULATE, mf->fd, 0);
+	if (va == MAP_FAILED) {
+		RTE_LOG(ERR, EAL, "%s(): %s: mmap(): %s\n",
+				__func__, mf->fname, strerror(errno));
+		return -1;
+	}
+
+	ret = 0;
+	if (check_numa()) {
+		if (get_mempolicy(&mf->socket_id, NULL, 0, va,
+				MPOL_F_NODE | MPOL_F_ADDR) < 0) {
+			RTE_LOG(ERR, EAL, "%s(): %s: get_mempolicy(): %s\n",
+				__func__, mf->fname, strerror(errno));
+			ret = -1;
+		}
+	} else
+		mf->socket_id = 0;
+
+	munmap(va, mf->hugepage_sz);
+	return ret;
+#else
+	mf->socket_id = 0;
+	return 0;
+#endif
+}
+
+struct match_memfile_path_arg {
+	const char *path;
+	uint64_t file_sz;
+	uint64_t hugepage_sz;
+	size_t best_len;
+};
+
+/*
+ * While it is unlikely for hugetlbfs, mount points can be nested.
+ * Find the deepest mount point that contains the file.
+ */
+static int
+match_memfile_path(const char *path, uint64_t hugepage_sz, void *cb_arg)
+{
+	struct match_memfile_path_arg *arg = cb_arg;
+	size_t dir_len = strlen(path);
+
+	if (dir_len < arg->best_len)
+		return 0;
+	if (strncmp(path, arg->path, dir_len) != 0)
+		return 0;
+	if (arg->file_sz % hugepage_sz != 0)
+		return 0;
+
+	arg->hugepage_sz = hugepage_sz;
+	arg->best_len = dir_len;
+	return 0;
+}
+
+/* Determine hugepage size from the path to a file in hugetlbfs. */
+static int
+memfile_fill_hugepage_sz(struct memfile *mf, uint64_t file_sz)
+{
+	char abspath[PATH_MAX];
+	struct match_memfile_path_arg arg;
+
+	if (realpath(mf->fname, abspath) == NULL) {
+		RTE_LOG(ERR, EAL, "%s(): realpath(): %s\n",
+				__func__, strerror(errno));
+		return -1;
+	}
+
+	memset(&arg, 0, sizeof(arg));
+	arg.path = abspath;
+	arg.file_sz = file_sz;
+	if (eal_hugepage_mount_walk(match_memfile_path, &arg) == 0 &&
+			arg.hugepage_sz != 0) {
+		mf->hugepage_sz = arg.hugepage_sz;
+		return 0;
+	}
+	return -1;
+}
+
+int
+eal_memalloc_memfile_init(void)
+{
+	struct internal_config *internal_conf =
+			eal_get_internal_configuration();
+	int err = -1, fd;
+	uint32_t i;
+
+	if (internal_conf->mem_file[0] == NULL)
+		return 0;
+
+	for (i = 0; i < RTE_DIM(internal_conf->mem_file); i++) {
+		struct memfile *mf = &mem_file[i];
+		uint64_t fsize;
+
+		if (internal_conf->mem_file[i] == NULL) {
+			err = 0;
+			break;
+		}
+		mf->fname = internal_conf->mem_file[i];
+		fd = open(mf->fname, O_RDWR, 0600);
+		mf->fd = fd;
+		if (fd < 0) {
+			RTE_LOG(ERR, EAL, "%s(): %s: open(): %s\n",
+					__func__, mf->fname, strerror(errno));
+			break;
+		}
+
+		/* take out a read lock and keep it indefinitely */
+		if (lock(fd, LOCK_SH) != 1) {
+			RTE_LOG(ERR, EAL, "%s(): %s: cannot lock file\n",
+					__func__, mf->fname);
+			break;
+		}
+
+		fsize = get_file_size(fd);
+		if (!fsize) {
+			RTE_LOG(ERR, EAL, "%s(): %s: zero file length\n",
+					__func__, mf->fname);
+			break;
+		}
+
+		if (memfile_fill_hugepage_sz(mf, fsize) < 0) {
+			RTE_LOG(ERR, EAL, "%s(): %s: cannot detect page size\n",
+					__func__, mf->fname);
+			break;
+		}
+		mf->num_pages = fsize / mf->hugepage_sz;
+
+		if (memfile_fill_socket_id(mf) < 0) {
+			RTE_LOG(ERR, EAL, "%s(): %s: cannot detect NUMA node\n",
+					__func__, mf->fname);
+			break;
+		}
+	}
+
+	/* check if some problem happened */
+	if (err && i < RTE_DIM(internal_conf->mem_file)) {
+		/* some error occurred, do rollback */
+		do {
+			fd = mem_file[i].fd;
+			/* closing fd drops the lock */
+			if (fd >= 0)
+				close(fd);
+			mem_file[i].fd = -1;
+		} while (i--);
+		return -1;
+	}
+
+	/* update hugepage_info with pages allocated in files */
+	for (i = 0; i < RTE_DIM(mem_file); i++) {
+		const struct memfile *mf = &mem_file[i];
+		struct hugepage_info *hpi = NULL;
+		uint64_t sz;
+
+		if (!mf->hugepage_sz)
+			break;
+
+		for (sz = 0; sz < internal_conf->num_hugepage_sizes; sz++) {
+			hpi = &internal_conf->hugepage_info[sz];
+
+			if (mf->hugepage_sz == hpi->hugepage_sz) {
+				hpi->num_pages[mf->socket_id] += mf->num_pages;
+				break;
+			}
+		}
+
+		/* it seems hugepage info is not socket aware yet */
+		if (hpi != NULL && sz >= internal_conf->num_hugepage_sizes)
+			hpi->num_pages[0] += mf->num_pages;
+	}
+	return 0;
+}
+
+int
+eal_memalloc_memfile_alloc(struct hugepage_info *hpa)
+{
+	struct internal_config *internal_conf =
+			eal_get_internal_configuration();
+	uint32_t i, sz;
+
+	if (internal_conf->mem_file[0] == NULL ||
+			rte_eal_process_type() != RTE_PROC_PRIMARY)
+		return 0;
+
+	for (i = 0; i < RTE_DIM(mem_file); i++) {
+		struct memfile *mf = &mem_file[i];
+		uint64_t hugepage_sz = mf->hugepage_sz;
+		int socket_id = mf->socket_id;
+		struct rte_memseg **pages;
+
+		if (!hugepage_sz)
+			break;
+
+		while (mf->num_allocated < mf->num_pages) {
+			int needed, allocated, j;
+			uint32_t prev;
+
+			prev = mf->num_allocated;
+			needed = mf->num_pages - mf->num_allocated;
+			pages = malloc(sizeof(*pages) * needed);
+			if (pages == NULL)
+				return -1;
+
+			/* memalloc is locked, it's safe to switch allocator */
+			alloc_memfile = 1;
+			allocated = eal_memalloc_alloc_seg_bulk(pages,
+					needed, hugepage_sz, socket_id,	false);
+			/* switch allocator back */
+			alloc_memfile = 0;
+			if (allocated <= 0) {
+				RTE_LOG(ERR, EAL, "%s(): %s: allocation failed\n",
+						__func__, mf->fname);
+				free(pages);
+				return -1;
+			}
+
+			/* mark preallocated pages as unfreeable */
+			for (j = 0; j < allocated; j++) {
+				struct rte_memseg *ms = pages[j];
+
+				ms->flags |= RTE_MEMSEG_FLAG_DO_NOT_FREE |
+					     RTE_MEMSEG_FLAG_PRE_ALLOCATED;
+			}
+
+			free(pages);
+
+			/* check whether we allocated from expected file */
+			if (prev + allocated != mf->num_allocated) {
+				RTE_LOG(ERR, EAL, "%s(): %s: incorrect allocation\n",
+						__func__, mf->fname);
+				return -1;
+			}
+		}
+
+		/* reflect we pre-allocated some memory */
+		for (sz = 0; sz < internal_conf->num_hugepage_sizes; sz++) {
+			struct hugepage_info *hpi = &hpa[sz];
+
+			if (hpi->hugepage_sz != hugepage_sz)
+				continue;
+			hpi->num_pages[socket_id] -=
+					RTE_MIN(hpi->num_pages[socket_id],
+						mf->num_allocated);
+		}
+	}
+	return 0;
+}
+
 static int
 sync_chunk(struct rte_memseg_list *primary_msl,
 		struct rte_memseg_list *local_msl, struct hugepage_info *hi,
@@ -1178,6 +1480,14 @@  sync_chunk(struct rte_memseg_list *primary_msl,
 		if (l_ms == NULL || p_ms == NULL)
 			return -1;
 
+		/*
+		 * Switch allocator for this segment.
+		 * This function is only called during init,
+		 * so don't try to restore allocator on failure.
+		 */
+		if (p_ms->flags & RTE_MEMSEG_FLAG_PRE_ALLOCATED)
+			alloc_memfile = 1;
+
 		if (used) {
 			ret = alloc_seg(l_ms, p_ms->addr,
 					p_ms->socket_id, hi,
@@ -1191,6 +1501,9 @@  sync_chunk(struct rte_memseg_list *primary_msl,
 			if (ret < 0)
 				return -1;
 		}
+
+		/* Reset the allocator. */
+		alloc_memfile = 0;
 	}
 
 	/* if we just allocated memory, notify the application */
@@ -1392,6 +1705,9 @@  eal_memalloc_sync_with_primary(void)
 	if (rte_eal_process_type() == RTE_PROC_PRIMARY)
 		return 0;
 
+	if (eal_memalloc_memfile_init() < 0)
+		return -1;
+
 	/* memalloc is locked, so it's safe to call thread-unsafe version */
 	if (rte_memseg_list_walk_thread_unsafe(sync_walk, NULL))
 		return -1;