From patchwork Thu May 31 10:57:48 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Anatoly Burakov X-Patchwork-Id: 40536 X-Patchwork-Delegate: thomas@monjalon.net Return-Path: X-Original-To: patchwork@dpdk.org Delivered-To: patchwork@dpdk.org Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 589C74CBD; Thu, 31 May 2018 12:58:09 +0200 (CEST) Received: from mga01.intel.com (mga01.intel.com [192.55.52.88]) by dpdk.org (Postfix) with ESMTP id 46ED74C8E for ; Thu, 31 May 2018 12:58:06 +0200 (CEST) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga001.jf.intel.com ([10.7.209.18]) by fmsmga101.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 31 May 2018 03:58:03 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.49,463,1520924400"; d="scan'208";a="60619520" Received: from irvmail001.ir.intel.com ([163.33.26.43]) by orsmga001.jf.intel.com with ESMTP; 31 May 2018 03:58:00 -0700 Received: from sivswdev01.ir.intel.com (sivswdev01.ir.intel.com [10.237.217.45]) by irvmail001.ir.intel.com (8.14.3/8.13.6/MailSET/Hub) with ESMTP id w4VAw0cU004129; Thu, 31 May 2018 11:58:00 +0100 Received: from sivswdev01.ir.intel.com (localhost [127.0.0.1]) by sivswdev01.ir.intel.com with ESMTP id w4VAvx56002502; Thu, 31 May 2018 11:57:59 +0100 Received: (from aburakov@localhost) by sivswdev01.ir.intel.com with LOCAL id w4VAvx3J002497; Thu, 31 May 2018 11:57:59 +0100 From: Anatoly Burakov To: dev@dpdk.org Cc: thomas@monjalon.net, hemant.agrawal@nxp.com, bruce.richardson@intel.com, ferruh.yigit@intel.com, konstantin.ananyev@intel.com, jerin.jacob@caviumnetworks.com, olivier.matz@6wind.com, stephen@networkplumber.org, nhorman@tuxdriver.com, david.marchand@6wind.com, gowrishankar.m@linux.vnet.ibm.com Date: Thu, 31 May 2018 11:57:48 +0100 Message-Id: X-Mailer: git-send-email 1.7.0.7 In-Reply-To: References: In-Reply-To: References: Subject: [dpdk-dev] [RFC 1/3] fbarray: allow zero-sized elements X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" We need to keep usage of our memory area indexed, but we don't actually need to store any data - we need just the indexing capabilities of fbarray. Yet, it currently disallows zero-sized elements. Fix that by removing the check for zero-sized elements - the rest will work correctly already. Signed-off-by: Anatoly Burakov --- lib/librte_eal/common/eal_common_fbarray.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/lib/librte_eal/common/eal_common_fbarray.c b/lib/librte_eal/common/eal_common_fbarray.c index 019f84c18..4a365e7ce 100644 --- a/lib/librte_eal/common/eal_common_fbarray.c +++ b/lib/librte_eal/common/eal_common_fbarray.c @@ -391,9 +391,9 @@ set_used(struct rte_fbarray *arr, unsigned int idx, bool used) } static int -fully_validate(const char *name, unsigned int elt_sz, unsigned int len) +fully_validate(const char *name, unsigned int len) { - if (name == NULL || elt_sz == 0 || len == 0 || len > INT_MAX) { + if (name == NULL || len == 0 || len > INT_MAX) { rte_errno = EINVAL; return -1; } @@ -420,7 +420,7 @@ rte_fbarray_init(struct rte_fbarray *arr, const char *name, unsigned int len, return -1; } - if (fully_validate(name, elt_sz, len)) + if (fully_validate(name, len)) return -1; page_sz = sysconf(_SC_PAGESIZE); @@ -511,7 +511,7 @@ rte_fbarray_attach(struct rte_fbarray *arr) * the array, so the parts we care about will not race. */ - if (fully_validate(arr->name, arr->elt_sz, arr->len)) + if (fully_validate(arr->name, arr->len)) return -1; page_sz = sysconf(_SC_PAGESIZE); @@ -858,7 +858,7 @@ rte_fbarray_dump_metadata(struct rte_fbarray *arr, FILE *f) return; } - if (fully_validate(arr->name, arr->elt_sz, arr->len)) { + if (fully_validate(arr->name, arr->len)) { fprintf(f, "Invalid file-backed array\n"); goto out; } From patchwork Thu May 31 10:57:49 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Anatoly Burakov X-Patchwork-Id: 40537 X-Patchwork-Delegate: thomas@monjalon.net Return-Path: X-Original-To: patchwork@dpdk.org Delivered-To: patchwork@dpdk.org Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 884A94F92; Thu, 31 May 2018 12:58:11 +0200 (CEST) Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by dpdk.org (Postfix) with ESMTP id 1DACC4C74 for ; Thu, 31 May 2018 12:58:05 +0200 (CEST) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga004.fm.intel.com ([10.253.24.48]) by orsmga101.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 31 May 2018 03:58:03 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.49,463,1520924400"; d="scan'208";a="59202811" Received: from irvmail001.ir.intel.com ([163.33.26.43]) by fmsmga004.fm.intel.com with ESMTP; 31 May 2018 03:58:00 -0700 Received: from sivswdev01.ir.intel.com (sivswdev01.ir.intel.com [10.237.217.45]) by irvmail001.ir.intel.com (8.14.3/8.13.6/MailSET/Hub) with ESMTP id w4VAw0on004132; Thu, 31 May 2018 11:58:00 +0100 Received: from sivswdev01.ir.intel.com (localhost [127.0.0.1]) by sivswdev01.ir.intel.com with ESMTP id w4VAw0SJ002509; Thu, 31 May 2018 11:58:00 +0100 Received: (from aburakov@localhost) by sivswdev01.ir.intel.com with LOCAL id w4VAw0j2002505; Thu, 31 May 2018 11:58:00 +0100 From: Anatoly Burakov To: dev@dpdk.org Cc: thomas@monjalon.net, hemant.agrawal@nxp.com, bruce.richardson@intel.com, ferruh.yigit@intel.com, konstantin.ananyev@intel.com, jerin.jacob@caviumnetworks.com, olivier.matz@6wind.com, stephen@networkplumber.org, nhorman@tuxdriver.com, david.marchand@6wind.com, gowrishankar.m@linux.vnet.ibm.com Date: Thu, 31 May 2018 11:57:49 +0100 Message-Id: <5281a08a4d08db8692b38905ae1c6035343bca99.1527764061.git.anatoly.burakov@intel.com> X-Mailer: git-send-email 1.7.0.7 In-Reply-To: References: In-Reply-To: References: Subject: [dpdk-dev] [RFC 2/3] mem: add device memory reserve/free API X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" In order for hotplug in multiprocess to work reliably, we will need a common shared memory area that is guaranteed to be accessible to all processes at all times. This is accomplished by pre-reserving memory that will be used for device mappings at startup, and managing it at runtime. Two new API calls are added: alloc and free of device memory. Once allocation is requested, memory is considered to be reserved until it is freed back using the same API. Usage of which blocks are occupied is tracked using shared fbarray. This allows us to give out device memory piecemeal and lessen fragmentation. Naturally, this adds a limitation of how much device memory DPDK can use. This is currently set to 2 gigabytes, but will be adjustable in later revisions. Signed-off-by: Anatoly Burakov --- lib/librte_eal/common/eal_common_memory.c | 270 ++++++++++++++++-- .../common/include/rte_eal_memconfig.h | 18 ++ lib/librte_eal/common/include/rte_memory.h | 40 +++ 3 files changed, 312 insertions(+), 16 deletions(-) diff --git a/lib/librte_eal/common/eal_common_memory.c b/lib/librte_eal/common/eal_common_memory.c index 4f0688f9d..8cae9b354 100644 --- a/lib/librte_eal/common/eal_common_memory.c +++ b/lib/librte_eal/common/eal_common_memory.c @@ -33,6 +33,7 @@ */ #define MEMSEG_LIST_FMT "memseg-%" PRIu64 "k-%i-%i" +#define DEVICE_MEMORY_NAME "device_memory" static uint64_t baseaddr_offset; static uint64_t system_page_sz; @@ -904,6 +905,227 @@ rte_memseg_list_walk(rte_memseg_list_walk_t func, void *arg) return ret; } +void * __rte_experimental +rte_mem_dev_memory_alloc(size_t size, size_t align) +{ + struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config; + struct rte_fbarray *arr = &mcfg->device_memory.mem_map_arr; + unsigned int n_pages, page_align; + int start_idx, cur_idx; + void *addr = NULL; + + /* check parameters first */ + if (size == 0 || (size & (system_page_sz - 1)) != 0) { + RTE_LOG(ERR, EAL, "%s(): size is not page-aligned\n", + __func__); + rte_errno = EINVAL; + return NULL; + } + if ((align & (system_page_sz - 1)) != 0) { + RTE_LOG(ERR, EAL, "%s(): alignment is not page-aligned\n", + __func__); + rte_errno = EINVAL; + return NULL; + } + /* PCI BAR sizes can only be powers of two, but this memory may be used + * for more than just PCI BAR mappings, so only check if alignment is + * power of two. + */ + if (align != 0 && !rte_is_power_of_2(align)) { + RTE_LOG(ERR, EAL, "%s(): alignment is not a power of two\n", + __func__); + rte_errno = EINVAL; + return NULL; + } + /* check if device memory map is uninitialized. */ + if (mcfg->device_memory.base_va == NULL || arr->len == 0) { + RTE_LOG(ERR, EAL, "%s(): device memory map is not initialized\n", + __func__); + rte_errno = ENODEV; + return NULL; + } + + n_pages = size / system_page_sz; + page_align = align / system_page_sz; + + /* lock the device memory map */ + rte_spinlock_lock(&mcfg->device_memory.lock); + + start_idx = 0; + while (1) { + size_t offset; + int end; + + cur_idx = rte_fbarray_find_next_n_free(arr, start_idx, n_pages); + if (cur_idx < 0) + break; + + /* if there are alignment requirements, check if the offset we + * found is aligned, and if not, align it and check if we still + * have enough space. + */ + if (page_align != 0 && (cur_idx & (page_align - 1)) != 0) { + unsigned int aligned, len; + + aligned = RTE_ALIGN_CEIL(cur_idx, page_align); + len = rte_fbarray_find_contig_free(arr, aligned); + + /* if there's not enough space, keep looking */ + if (len < n_pages) { + start_idx = aligned + len; + continue; + } + + /* we've found space */ + cur_idx = aligned; + } + end = cur_idx + n_pages; + offset = cur_idx * system_page_sz; + addr = RTE_PTR_ADD(mcfg->device_memory.base_va, + offset); + + /* now, mark all space as occupied */ + for (; cur_idx < end; cur_idx++) + rte_fbarray_set_used(arr, cur_idx); + break; + } + rte_spinlock_unlock(&mcfg->device_memory.lock); + + if (addr != NULL) + RTE_LOG(DEBUG, EAL, "%s(): allocated %p-%p (%lu bytes) for hardware device usage\n", + __func__, addr, RTE_PTR_ADD(addr, size), size); + + return addr; +} + +int __rte_experimental +rte_mem_dev_memory_free(void *addr, size_t size) +{ + struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config; + struct rte_fbarray *arr = &mcfg->device_memory.mem_map_arr; + int cur_idx, end, ret, n_pages, len; + void *map_end, *mem_end; + + /* check parameters first */ + if (size == 0 || (size & (system_page_sz - 1)) != 0) { + RTE_LOG(ERR, EAL, "%s(): size is not page-aligned\n", + __func__); + rte_errno = EINVAL; + return -1; + } + /* check if device memory map is uninitialized. */ + if (mcfg->device_memory.base_va == NULL || arr->len == 0) { + RTE_LOG(ERR, EAL, "%s(): device memory map is not initialized\n", + __func__); + rte_errno = ENODEV; + return -1; + } + map_end = RTE_PTR_ADD(mcfg->device_memory.base_va, + arr->len * system_page_sz); + mem_end = RTE_PTR_ADD(addr, size); + + /* check if address is within the memory map */ + if (addr < mcfg->device_memory.base_va || addr >= map_end || + mem_end > map_end) { + RTE_LOG(ERR, EAL, "%s(): address is beyond device memory map range\n", + __func__); + rte_errno = EINVAL; + return -1; + } + + rte_spinlock_lock(&mcfg->device_memory.lock); + + n_pages = size / system_page_sz; + cur_idx = RTE_PTR_DIFF(addr, mcfg->device_memory.base_va) / + system_page_sz; + end = cur_idx + n_pages; + + /* check all space we will be marking as free is currently occupied */ + len = rte_fbarray_find_contig_used(arr, cur_idx); + if (len < n_pages) { + RTE_LOG(ERR, EAL, "%s(): attempting to free unoccupied space\n", + __func__); + rte_errno = EINVAL; + ret = -1; + goto unlock; + } + /* now, mark all space as free */ + for (; cur_idx < end; cur_idx++) + rte_fbarray_set_free(arr, cur_idx); + + /* success */ + ret = 0; + + RTE_LOG(DEBUG, EAL, "%s(): deallocated %p-%p (%lu bytes) for hardware device usage\n", + __func__, addr, RTE_PTR_ADD(addr, size), size); +unlock: + rte_spinlock_unlock(&mcfg->device_memory.lock); + return ret; +} + +static int +dev_memory_init(void) +{ + struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config; + size_t size; + void *addr; + int retval; + + if (system_page_sz == 0) + system_page_sz = sysconf(_SC_PAGESIZE); + + size = (size_t) 2 << 30; + + addr = eal_get_virtual_area(NULL, &size, system_page_sz, 0, 0); + if (addr == NULL) { + RTE_LOG(ERR, EAL, "Cannot reserve device memory\n"); + return -1; + } + + retval = rte_fbarray_init(&mcfg->device_memory.mem_map_arr, + DEVICE_MEMORY_NAME, size / system_page_sz, 0); + if (retval < 0) { + RTE_LOG(ERR, EAL, "Cannot initialize device memory map\n"); + return -1; + } + mcfg->device_memory.base_va = addr; + rte_spinlock_init(&mcfg->device_memory.lock); + return 0; +} + +static int +dev_memory_attach(void) +{ + struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config; + size_t size; + void *addr; + int retval; + + rte_spinlock_lock(&mcfg->device_memory.lock); + + if (system_page_sz == 0) + system_page_sz = sysconf(_SC_PAGESIZE); + + size = mcfg->device_memory.mem_map_arr.len * system_page_sz; + + addr = eal_get_virtual_area(mcfg->device_memory.base_va, &size, + system_page_sz, 0, 0); + if (addr == NULL) { + RTE_LOG(ERR, EAL, "Cannot reserve device memory\n"); + return -1; + } + + retval = rte_fbarray_attach(&mcfg->device_memory.mem_map_arr); + if (retval < 0) { + RTE_LOG(ERR, EAL, "Cannot attach to device memory map\n"); + return -1; + } + + rte_spinlock_unlock(&mcfg->device_memory.lock); + + return 0; +} + /* init memory subsystem */ int rte_eal_memory_init(void) @@ -918,25 +1140,41 @@ rte_eal_memory_init(void) /* lock mem hotplug here, to prevent races while we init */ rte_rwlock_read_lock(&mcfg->memory_hotplug_lock); - retval = rte_eal_process_type() == RTE_PROC_PRIMARY ? + if (rte_eal_process_type() == RTE_PROC_PRIMARY) { + retval = dev_memory_init(); + if (retval < 0) + goto fail; + #ifndef RTE_ARCH_64 - memseg_primary_init_32() : + retval = memseg_primary_init_32(); #else - memseg_primary_init() : + retval = memseg_primary_init(); #endif - memseg_secondary_init(); - - if (retval < 0) - goto fail; - - if (eal_memalloc_init() < 0) - goto fail; - - retval = rte_eal_process_type() == RTE_PROC_PRIMARY ? - rte_eal_hugepage_init() : - rte_eal_hugepage_attach(); - if (retval < 0) - goto fail; + if (retval < 0) + goto fail; + + if (eal_memalloc_init() < 0) + goto fail; + + retval = rte_eal_hugepage_init(); + if (retval < 0) + goto fail; + } else { + retval = dev_memory_attach(); + if (retval < 0) + goto fail; + + retval = memseg_secondary_init(); + if (retval < 0) + goto fail; + + if (eal_memalloc_init() < 0) + goto fail; + + retval = rte_eal_hugepage_attach(); + if (retval < 0) + goto fail; + } if (internal_config.no_shconf == 0 && rte_eal_memdevice_init() < 0) goto fail; diff --git a/lib/librte_eal/common/include/rte_eal_memconfig.h b/lib/librte_eal/common/include/rte_eal_memconfig.h index aff0688dd..a8e7c39ff 100644 --- a/lib/librte_eal/common/include/rte_eal_memconfig.h +++ b/lib/librte_eal/common/include/rte_eal_memconfig.h @@ -36,6 +36,22 @@ struct rte_memseg_list { struct rte_fbarray memseg_arr; }; +/** + * mem map is a special case because we need to store a bunch of other data + * together with the array itself. + */ +struct rte_mem_map { + RTE_STD_C11 + union { + void *base_va; + /**< Base virtual address for this mem map. */ + uint64_t addr_64; + /**< Makes sure addr is always 64-bits */ + }; + rte_spinlock_t lock; + struct rte_fbarray mem_map_arr; +}; + /** * the structure for the memory configuration for the RTE. * Used by the rte_config structure. It is separated out, as for multi-process @@ -68,6 +84,8 @@ struct rte_mem_config { struct rte_memseg_list memsegs[RTE_MAX_MEMSEG_LISTS]; /**< list of dynamic arrays holding memsegs */ + struct rte_mem_map device_memory; /**< Occupancy map of preallocated device memory */ + struct rte_tailq_head tailq_head[RTE_MAX_TAILQ]; /**< Tailqs for objects */ /* Heaps of Malloc per socket */ diff --git a/lib/librte_eal/common/include/rte_memory.h b/lib/librte_eal/common/include/rte_memory.h index aab9f6fe5..4cf58bd2a 100644 --- a/lib/librte_eal/common/include/rte_memory.h +++ b/lib/librte_eal/common/include/rte_memory.h @@ -445,6 +445,46 @@ rte_mem_alloc_validator_register(const char *name, int __rte_experimental rte_mem_alloc_validator_unregister(const char *name, int socket_id); +/** + * @brief Request memory for device mapping. + * + * @note after this call, reserved memory will be marked as unavailable in all + * processes until it is released, even if it goes unused. + * + * @param size + * Size of memory to request. + * + * @param align + * Alignment of memory to be returned. + * + * @return + * Valid pointer on successful fulfillment of request. + * NULL on unsuccessful fulfillment of request, with rte_errno indicating the + * case of error. + */ +void * __rte_experimental +rte_mem_dev_memory_alloc(size_t size, size_t align); + +/** + * @brief Release memory for device mapping. + * + * @note by the time this call is made, memory region being freed must not be in + * use. + * + * @param addr + * Address of previously requested block of memory. + * + * @param size + * Size of memory to request. + * + * @return + * 0 on successful memory release. + * -1 on unsuccessful memory release, with rte_errno indicating the cause of + * error. + */ +int __rte_experimental +rte_mem_dev_memory_free(void *addr, size_t size); + #ifdef __cplusplus } #endif From patchwork Thu May 31 10:57:50 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Anatoly Burakov X-Patchwork-Id: 40538 X-Patchwork-Delegate: thomas@monjalon.net Return-Path: X-Original-To: patchwork@dpdk.org Delivered-To: patchwork@dpdk.org Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 93ABB5681; Thu, 31 May 2018 12:58:13 +0200 (CEST) Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by dpdk.org (Postfix) with ESMTP id 093144C74 for ; Thu, 31 May 2018 12:58:06 +0200 (CEST) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga002.jf.intel.com ([10.7.209.21]) by orsmga101.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 31 May 2018 03:58:03 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.49,463,1520924400"; d="scan'208";a="63167116" Received: from irvmail001.ir.intel.com ([163.33.26.43]) by orsmga002.jf.intel.com with ESMTP; 31 May 2018 03:58:00 -0700 Received: from sivswdev01.ir.intel.com (sivswdev01.ir.intel.com [10.237.217.45]) by irvmail001.ir.intel.com (8.14.3/8.13.6/MailSET/Hub) with ESMTP id w4VAw0gR004135; Thu, 31 May 2018 11:58:00 +0100 Received: from sivswdev01.ir.intel.com (localhost [127.0.0.1]) by sivswdev01.ir.intel.com with ESMTP id w4VAw0Jc002516; Thu, 31 May 2018 11:58:00 +0100 Received: (from aburakov@localhost) by sivswdev01.ir.intel.com with LOCAL id w4VAw0Rq002512; Thu, 31 May 2018 11:58:00 +0100 From: Anatoly Burakov To: dev@dpdk.org Cc: Ferruh Yigit , Gaetan Rivet , thomas@monjalon.net, hemant.agrawal@nxp.com, bruce.richardson@intel.com, konstantin.ananyev@intel.com, jerin.jacob@caviumnetworks.com, olivier.matz@6wind.com, stephen@networkplumber.org, nhorman@tuxdriver.com, david.marchand@6wind.com, gowrishankar.m@linux.vnet.ibm.com Date: Thu, 31 May 2018 11:57:50 +0100 Message-Id: X-Mailer: git-send-email 1.7.0.7 In-Reply-To: References: In-Reply-To: References: Subject: [dpdk-dev] [RFC 3/3] bus/pci: use the new device memory API for BAR mapping X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Adjust PCI infrastructure to reserve device memory through the new device memory API. Any hotplug event will reserve memory, any hot-unplug event will release memory back to the system. This allows for more reliable PCI mappings in secondary processes, and will be crucial to support multiprocess hotplug. Signed-off-by: Anatoly Burakov --- drivers/bus/pci/linux/pci_init.h | 1 - drivers/bus/pci/linux/pci_uio.c | 11 +---------- drivers/bus/pci/linux/pci_vfio.c | 27 ++++++++++++--------------- lib/librte_pci/Makefile | 1 + lib/librte_pci/rte_pci.c | 20 +++++++++++++++++++- 5 files changed, 33 insertions(+), 27 deletions(-) diff --git a/drivers/bus/pci/linux/pci_init.h b/drivers/bus/pci/linux/pci_init.h index c2e603a37..bc9279c66 100644 --- a/drivers/bus/pci/linux/pci_init.h +++ b/drivers/bus/pci/linux/pci_init.h @@ -14,7 +14,6 @@ /* * Helper function to map PCI resources right after hugepages in virtual memory */ -extern void *pci_map_addr; void *pci_find_max_end_va(void); /* parse one line of the "resource" sysfs file (note that the 'line' diff --git a/drivers/bus/pci/linux/pci_uio.c b/drivers/bus/pci/linux/pci_uio.c index d423e4bb0..dbf108b6f 100644 --- a/drivers/bus/pci/linux/pci_uio.c +++ b/drivers/bus/pci/linux/pci_uio.c @@ -26,8 +26,6 @@ #include "eal_filesystem.h" #include "pci_init.h" -void *pci_map_addr = NULL; - #define OFF_MAX ((uint64_t)(off_t)-1) int @@ -316,19 +314,12 @@ pci_uio_map_resource_by_index(struct rte_pci_device *dev, int res_idx, goto error; } - /* try mapping somewhere close to the end of hugepages */ - if (pci_map_addr == NULL) - pci_map_addr = pci_find_max_end_va(); - - mapaddr = pci_map_resource(pci_map_addr, fd, 0, + mapaddr = pci_map_resource(NULL, fd, 0, (size_t)dev->mem_resource[res_idx].len, 0); close(fd); if (mapaddr == MAP_FAILED) goto error; - pci_map_addr = RTE_PTR_ADD(mapaddr, - (size_t)dev->mem_resource[res_idx].len); - maps[map_idx].phaddr = dev->mem_resource[res_idx].phys_addr; maps[map_idx].size = dev->mem_resource[res_idx].len; maps[map_idx].addr = mapaddr; diff --git a/drivers/bus/pci/linux/pci_vfio.c b/drivers/bus/pci/linux/pci_vfio.c index aeeaa9ed8..f390ea37a 100644 --- a/drivers/bus/pci/linux/pci_vfio.c +++ b/drivers/bus/pci/linux/pci_vfio.c @@ -324,7 +324,7 @@ pci_rte_vfio_setup_device(struct rte_pci_device *dev, int vfio_dev_fd) static int pci_vfio_mmap_bar(int vfio_dev_fd, struct mapped_pci_resource *vfio_res, - int bar_index, int additional_flags) + int bar_index) { struct memreg { unsigned long offset, size; @@ -371,9 +371,14 @@ pci_vfio_mmap_bar(int vfio_dev_fd, struct mapped_pci_resource *vfio_res, memreg[0].size = bar->size; } - /* reserve the address using an inaccessible mapping */ - bar_addr = mmap(bar->addr, bar->size, 0, MAP_PRIVATE | - MAP_ANONYMOUS | additional_flags, -1, 0); + if (bar->addr == NULL) { + bar_addr = rte_mem_dev_memory_alloc(bar->size, 0); + if (bar_addr == NULL) { + RTE_LOG(ERR, EAL, "%s(): cannot reserve space for device\n", + __func__); + return -1; + } + } if (bar_addr != MAP_FAILED) { void *map_addr = NULL; if (memreg[0].size) { @@ -469,7 +474,6 @@ pci_vfio_map_resource_primary(struct rte_pci_device *dev) for (i = 0; i < (int) vfio_res->nb_maps; i++) { struct vfio_region_info reg = { .argsz = sizeof(reg) }; - void *bar_addr; reg.index = i; @@ -494,19 +498,12 @@ pci_vfio_map_resource_primary(struct rte_pci_device *dev) if ((reg.flags & VFIO_REGION_INFO_FLAG_MMAP) == 0) continue; - /* try mapping somewhere close to the end of hugepages */ - if (pci_map_addr == NULL) - pci_map_addr = pci_find_max_end_va(); - - bar_addr = pci_map_addr; - pci_map_addr = RTE_PTR_ADD(bar_addr, (size_t) reg.size); - - maps[i].addr = bar_addr; + maps[i].addr = NULL; maps[i].offset = reg.offset; maps[i].size = reg.size; maps[i].path = NULL; /* vfio doesn't have per-resource paths */ - ret = pci_vfio_mmap_bar(vfio_dev_fd, vfio_res, i, 0); + ret = pci_vfio_mmap_bar(vfio_dev_fd, vfio_res, i); if (ret < 0) { RTE_LOG(ERR, EAL, " %s mapping BAR%i failed: %s\n", pci_addr, i, strerror(errno)); @@ -574,7 +571,7 @@ pci_vfio_map_resource_secondary(struct rte_pci_device *dev) maps = vfio_res->maps; for (i = 0; i < (int) vfio_res->nb_maps; i++) { - ret = pci_vfio_mmap_bar(vfio_dev_fd, vfio_res, i, MAP_FIXED); + ret = pci_vfio_mmap_bar(vfio_dev_fd, vfio_res, i); if (ret < 0) { RTE_LOG(ERR, EAL, " %s mapping BAR%i failed: %s\n", pci_addr, i, strerror(errno)); diff --git a/lib/librte_pci/Makefile b/lib/librte_pci/Makefile index 94a632670..f996fe33c 100644 --- a/lib/librte_pci/Makefile +++ b/lib/librte_pci/Makefile @@ -8,6 +8,7 @@ LIB = librte_pci.a CFLAGS := -I$(SRCDIR) $(CFLAGS) CFLAGS += $(WERROR_FLAGS) -O3 +CFLAGS += -DALLOW_EXPERIMENTAL_API LDLIBS += -lrte_eal EXPORT_MAP := rte_pci_version.map diff --git a/lib/librte_pci/rte_pci.c b/lib/librte_pci/rte_pci.c index 530738dbd..c425a624e 100644 --- a/lib/librte_pci/rte_pci.c +++ b/lib/librte_pci/rte_pci.c @@ -151,6 +151,16 @@ pci_map_resource(void *requested_addr, int fd, off_t offset, size_t size, { void *mapaddr; + if (requested_addr == NULL) { + requested_addr = rte_mem_dev_memory_alloc(size, 0); + if (requested_addr == NULL) { + RTE_LOG(ERR, EAL, "%s(): cannot reserve space for device\n", + __func__); + return MAP_FAILED; + } + } + additional_flags |= MAP_FIXED; + /* Map the PCI memory resource of device */ mapaddr = mmap(requested_addr, size, PROT_READ | PROT_WRITE, MAP_SHARED | additional_flags, fd, offset); @@ -170,15 +180,23 @@ pci_map_resource(void *requested_addr, int fd, off_t offset, size_t size, void pci_unmap_resource(void *requested_addr, size_t size) { + void *mapped; if (requested_addr == NULL) return; + mapped = mmap(requested_addr, size, PROT_READ, + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + /* Unmap the PCI memory resource of device */ - if (munmap(requested_addr, size)) { + if (mapped == MAP_FAILED) { RTE_LOG(ERR, EAL, "%s(): cannot munmap(%p, %#zx): %s\n", __func__, requested_addr, size, strerror(errno)); } else RTE_LOG(DEBUG, EAL, " PCI memory unmapped at %p\n", requested_addr); + if (rte_mem_dev_memory_free(requested_addr, size)) + RTE_LOG(ERR, EAL, "%s(): cannot mark %p-%p as free\n", + __func__, requested_addr, + RTE_PTR_ADD(requested_addr, size)); }