[3/6] bus: introduce DMA memory mapping for external memory

Message ID 323319abdbdc238c3586dafe9ad49dab554d6e64.1550048188.git.shahafs@mellanox.com (mailing list archive)
State Superseded, archived
Headers
Series introduce DMA memory mapping for external memory |

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/Intel-compilation success Compilation OK

Commit Message

Shahaf Shuler Feb. 13, 2019, 9:10 a.m. UTC
  The DPDK APIs expose 3 different modes to work with memory used for DMA:

1. Use the DPDK owned memory (backed by the DPDK provided hugepages).
This memory is allocated by the DPDK libraries, included in the DPDK
memory system (memseg lists) and automatically DMA mapped by the DPDK
layers.

2. Use memory allocated by the user and register to the DPDK memory
systems. This is also referred as external memory. Upon registration of
the external memory, the DPDK layers will DMA map it to all needed
devices.

3. Use memory allocated by the user and not registered to the DPDK memory
system. This is for users who wants to have tight control on this
memory. The user will need to explicitly call DMA map function in order
to register such memory to the different devices.

The scope of the patch focus on #3 above.

Currently the only way to map external memory is through VFIO
(rte_vfio_dma_map). While VFIO is common, there are other vendors
which use different ways to map memory (e.g. Mellanox and NXP).

The work in this patch moves the DMA mapping to vendor agnostic APIs.
A new map and unmap ops were added to rte_bus structure. Implementation
of those was done currently only on the PCI bus. The implementation takes
the driver map and umap implementation as bypass to the VFIO mapping.
That is, in case of no specific map/unmap from the PCI driver,
VFIO mapping, if possible, will be used.

Application use with those APIs is quite simple:
* allocate memory
* take a device, and query its rte_device.
* call the bus map function for this device.

Future work will deprecate the rte_vfio_dma_map and rte_vfio_dma_unmap
APIs, leaving the PCI device APIs as the preferred option for the user.

Signed-off-by: Shahaf Shuler <shahafs@mellanox.com>
---
 drivers/bus/pci/pci_common.c            | 78 ++++++++++++++++++++++++++++
 drivers/bus/pci/rte_bus_pci.h           | 14 +++++
 lib/librte_eal/common/eal_common_bus.c  | 22 ++++++++
 lib/librte_eal/common/include/rte_bus.h | 57 ++++++++++++++++++++
 lib/librte_eal/rte_eal_version.map      |  2 +
 5 files changed, 173 insertions(+)
  

Comments

Gaëtan Rivet Feb. 13, 2019, 11:17 a.m. UTC | #1
On Wed, Feb 13, 2019 at 11:10:23AM +0200, Shahaf Shuler wrote:
> The DPDK APIs expose 3 different modes to work with memory used for DMA:
> 
> 1. Use the DPDK owned memory (backed by the DPDK provided hugepages).
> This memory is allocated by the DPDK libraries, included in the DPDK
> memory system (memseg lists) and automatically DMA mapped by the DPDK
> layers.
> 
> 2. Use memory allocated by the user and register to the DPDK memory
> systems. This is also referred as external memory. Upon registration of
> the external memory, the DPDK layers will DMA map it to all needed
> devices.
> 
> 3. Use memory allocated by the user and not registered to the DPDK memory
> system. This is for users who wants to have tight control on this
> memory. The user will need to explicitly call DMA map function in order
> to register such memory to the different devices.
> 
> The scope of the patch focus on #3 above.
> 
> Currently the only way to map external memory is through VFIO
> (rte_vfio_dma_map). While VFIO is common, there are other vendors
> which use different ways to map memory (e.g. Mellanox and NXP).
> 

How are those other vendors' devices mapped initially right now? Are
they using #2 scheme instead? Then the user will remap everything using
#3?

Would it be interesting to be able to describe a mapping prior to
probing a device and refer to it upon hotplug?

> The work in this patch moves the DMA mapping to vendor agnostic APIs.
> A new map and unmap ops were added to rte_bus structure. Implementation
> of those was done currently only on the PCI bus. The implementation takes
> the driver map and umap implementation as bypass to the VFIO mapping.
> That is, in case of no specific map/unmap from the PCI driver,
> VFIO mapping, if possible, will be used.

This paragraph should be rewritten to better fit a commit log.

> 
> Application use with those APIs is quite simple:
> * allocate memory
> * take a device, and query its rte_device.
> * call the bus map function for this device.

Is the device already configured with the existing mappings? Should the
application stop it before attempting to map its allocated memory?

> 
> Future work will deprecate the rte_vfio_dma_map and rte_vfio_dma_unmap
> APIs, leaving the PCI device APIs as the preferred option for the user.
> 
> Signed-off-by: Shahaf Shuler <shahafs@mellanox.com>
> ---
>  drivers/bus/pci/pci_common.c            | 78 ++++++++++++++++++++++++++++
>  drivers/bus/pci/rte_bus_pci.h           | 14 +++++
>  lib/librte_eal/common/eal_common_bus.c  | 22 ++++++++
>  lib/librte_eal/common/include/rte_bus.h | 57 ++++++++++++++++++++
>  lib/librte_eal/rte_eal_version.map      |  2 +
>  5 files changed, 173 insertions(+)
> 
> diff --git a/drivers/bus/pci/pci_common.c b/drivers/bus/pci/pci_common.c
> index 6276e5d695..018080c48b 100644
> --- a/drivers/bus/pci/pci_common.c
> +++ b/drivers/bus/pci/pci_common.c
> @@ -528,6 +528,82 @@ pci_unplug(struct rte_device *dev)
>  	return ret;
>  }
>  
> +/**
> + * DMA Map memory segment to device. After a successful call the device
> + * will be able to read/write from/to this segment.
> + *
> + * @param dev
> + *   Pointer to the PCI device.
> + * @param addr
> + *   Starting virtual address of memory to be mapped.
> + * @param iova
> + *   Starting IOVA address of memory to be mapped.
> + * @param len
> + *   Length of memory segment being mapped.
> + * @return
> + *   - 0 On success.
> + *   - Negative value and rte_errno is set otherwise.
> + */

This doc should be on the callback typedef, not their implementation.
The rte_errno error spec should also be documented higher-up in the
abstraction pile, on the bus callback I think. Everyone should follow
the same error codes for applications to really be able to use any
implementation generically.

> +static int __rte_experimental

The __rte_experimental is not necessary in compilation units themselves,
only in the headers.

In any case, it would only be the publicly available API that must be
marked as such, so more the callback typedefs than their
implementations.

> +pci_dma_map(struct rte_device *dev, void *addr, uint64_t iova, size_t len)
> +{
> +	struct rte_pci_device *pdev = RTE_DEV_TO_PCI(dev);
> +
> +	if (!pdev || !pdev->driver) {

pdev cannot be null here, nor should its driver be.

> +		rte_errno = EINVAL;
> +		return -rte_errno;
> +	}
> +	if (pdev->driver->map)
> +		return pdev->driver->map(pdev, addr, iova, len);
> +	/**
> +	 *  In case driver don't provides any specific mapping
> +	 *  try fallback to VFIO.
> +	 */
> +	if (pdev->kdrv == RTE_KDRV_VFIO)
> +		return rte_vfio_container_dma_map(-1, (uintptr_t)addr, iova,
> +						  len);

Reiterating: RTE_VFIO_DEFAULT_CONTAINER_FD is more readable I think than
-1 here.

> +	rte_errno = ENOTSUP;
> +	return -rte_errno;
> +}
> +
> +/**
> + * Un-map memory segment to device. After a successful call the device
> + * will not be able to read/write from/to this segment.
> + *
> + * @param dev
> + *   Pointer to the PCI device.
> + * @param addr
> + *   Starting virtual address of memory to be unmapped.
> + * @param iova
> + *   Starting IOVA address of memory to be unmapped.
> + * @param len
> + *   Length of memory segment being unmapped.
> + * @return
> + *   - 0 On success.
> + *   - Negative value and rte_errno is set otherwise.
> + */
> +static int __rte_experimental

Same as before for __rte_experimental and doc.

> +pci_dma_unmap(struct rte_device *dev, void *addr, uint64_t iova, size_t len)
> +{
> +	struct rte_pci_device *pdev = RTE_DEV_TO_PCI(dev);
> +
> +	if (!pdev || !pdev->driver) {
> +		rte_errno = EINVAL;
> +		return -rte_errno;
> +	}
> +	if (pdev->driver->unmap)
> +		return pdev->driver->unmap(pdev, addr, iova, len);
> +	/**
> +	 *  In case driver don't provides any specific mapping
> +	 *  try fallback to VFIO.
> +	 */
> +	if (pdev->kdrv == RTE_KDRV_VFIO)
> +		return rte_vfio_container_dma_unmap(-1, (uintptr_t)addr, iova,
> +						    len);
> +	rte_errno = ENOTSUP;
> +	return -rte_errno;
> +}
> +
>  struct rte_pci_bus rte_pci_bus = {
>  	.bus = {
>  		.scan = rte_pci_scan,
> @@ -536,6 +612,8 @@ struct rte_pci_bus rte_pci_bus = {
>  		.plug = pci_plug,
>  		.unplug = pci_unplug,
>  		.parse = pci_parse,
> +		.map = pci_dma_map,
> +		.unmap = pci_dma_unmap,
>  		.get_iommu_class = rte_pci_get_iommu_class,
>  		.dev_iterate = rte_pci_dev_iterate,
>  		.hot_unplug_handler = pci_hot_unplug_handler,
> diff --git a/drivers/bus/pci/rte_bus_pci.h b/drivers/bus/pci/rte_bus_pci.h
> index f0d6d81c00..00b2d412c7 100644
> --- a/drivers/bus/pci/rte_bus_pci.h
> +++ b/drivers/bus/pci/rte_bus_pci.h
> @@ -114,6 +114,18 @@ typedef int (pci_probe_t)(struct rte_pci_driver *, struct rte_pci_device *);
>  typedef int (pci_remove_t)(struct rte_pci_device *);
>  
>  /**
> + * Driver-specific DMA mapping.
> + */
> +typedef int (pci_dma_map_t)(struct rte_pci_device *dev, void *addr,
> +			    uint64_t iova, size_t len);
> +
> +/**
> + * Driver-specific DMA unmapping.
> + */
> +typedef int (pci_dma_unmap_t)(struct rte_pci_device *dev, void *addr,
> +			      uint64_t iova, size_t len);
> +
> +/**
>   * A structure describing a PCI driver.
>   */
>  struct rte_pci_driver {
> @@ -122,6 +134,8 @@ struct rte_pci_driver {
>  	struct rte_pci_bus *bus;           /**< PCI bus reference. */
>  	pci_probe_t *probe;                /**< Device Probe function. */
>  	pci_remove_t *remove;              /**< Device Remove function. */
> +	pci_dma_map_t *map;		   /**< device dma map function. */
> +	pci_dma_unmap_t *unmap;		   /**< device dma unmap function. */

I'd call both callbacks dma_map and dma_unmap. It's clearer and more
consistent.

>  	const struct rte_pci_id *id_table; /**< ID table, NULL terminated. */
>  	uint32_t drv_flags;                /**< Flags RTE_PCI_DRV_*. */
>  };
> diff --git a/lib/librte_eal/common/eal_common_bus.c b/lib/librte_eal/common/eal_common_bus.c
> index c8f1901f0b..b7911d5ddd 100644
> --- a/lib/librte_eal/common/eal_common_bus.c
> +++ b/lib/librte_eal/common/eal_common_bus.c
> @@ -285,3 +285,25 @@ rte_bus_sigbus_handler(const void *failure_addr)
>  
>  	return ret;
>  }
> +
> +int __rte_experimental
> +rte_bus_dma_map(struct rte_device *dev, void *addr, uint64_t iova,
> +		size_t len)
> +{
> +	if (dev->bus->map == NULL || len == 0) {
> +		rte_errno = EINVAL;
> +		return -rte_errno;
> +	}
> +	return dev->bus->map(dev, addr, iova, len);
> +}
> +
> +int __rte_experimental
> +rte_bus_dma_unmap(struct rte_device *dev, void *addr, uint64_t iova,
> +		  size_t len)
> +{
> +	if (dev->bus->unmap == NULL || len == 0) {
> +		rte_errno = EINVAL;
> +		return -rte_errno;
> +	}
> +	return dev->bus->unmap(dev, addr, iova, len);
> +}

These functions should be called rte_dev_dma_{map,unmap} and be part of
eal_common_dev.c instead.

> diff --git a/lib/librte_eal/common/include/rte_bus.h b/lib/librte_eal/common/include/rte_bus.h
> index 6be4b5cabe..90e4bf51b2 100644
> --- a/lib/librte_eal/common/include/rte_bus.h
> +++ b/lib/librte_eal/common/include/rte_bus.h
> @@ -168,6 +168,48 @@ typedef int (*rte_bus_unplug_t)(struct rte_device *dev);
>  typedef int (*rte_bus_parse_t)(const char *name, void *addr);
>  
>  /**
> + * Bus specific DMA map function.
> + * After a successful call, the memory segment will be mapped to the
> + * given device.
> + *
> + * @param dev
> + *	Device pointer.
> + * @param addr
> + *	Virtual address to map.
> + * @param iova
> + *	IOVA address to map.
> + * @param len
> + *	Length of the memory segment being mapped.
> + *
> + * @return
> + *	0 if mapping was successful.
> + *	Negative value and rte_errno is set otherwise.
> + */
> +typedef int (*rte_bus_map_t)(struct rte_device *dev, void *addr,
> +			     uint64_t iova, size_t len);
> +
> +/**
> + * Bus specific DMA unmap function.
> + * After a successful call, the memory segment will no longer be
> + * accessible by the given device.
> + *
> + * @param dev
> + *	Device pointer.
> + * @param addr
> + *	Virtual address to unmap.
> + * @param iova
> + *	IOVA address to unmap.
> + * @param len
> + *	Length of the memory segment being mapped.
> + *
> + * @return
> + *	0 if un-mapping was successful.
> + *	Negative value and rte_errno is set otherwise.
> + */
> +typedef int (*rte_bus_unmap_t)(struct rte_device *dev, void *addr,
> +			       uint64_t iova, size_t len);
> +
> +/**
>   * Implement a specific hot-unplug handler, which is responsible for
>   * handle the failure when device be hot-unplugged. When the event of
>   * hot-unplug be detected, it could call this function to handle
> @@ -238,6 +280,8 @@ struct rte_bus {
>  	rte_bus_plug_t plug;         /**< Probe single device for drivers */
>  	rte_bus_unplug_t unplug;     /**< Remove single device from driver */
>  	rte_bus_parse_t parse;       /**< Parse a device name */
> +	rte_bus_map_t map;	     /**< DMA map for device in the bus */
> +	rte_bus_unmap_t unmap;	     /**< DMA unmap for device in the bus */

Same as for the driver callbacks, dma_map and dma_unmap seem a better
fit for the field names.

>  	struct rte_bus_conf conf;    /**< Bus configuration */
>  	rte_bus_get_iommu_class_t get_iommu_class; /**< Get iommu class */
>  	rte_dev_iterate_t dev_iterate; /**< Device iterator. */
> @@ -356,6 +400,19 @@ struct rte_bus *rte_bus_find_by_name(const char *busname);
>  enum rte_iova_mode rte_bus_get_iommu_class(void);
>  
>  /**
> + * Wrapper to call the bus specific DMA map function.
> + */
> +int __rte_experimental
> +rte_bus_dma_map(struct rte_device *dev, void *addr, uint64_t iova, size_t len);
> +
> +/**
> + * Wrapper to call the bus specific DMA unmap function.
> + */
> +int __rte_experimental
> +rte_bus_dma_unmap(struct rte_device *dev, void *addr, uint64_t iova,
> +		  size_t len);
> +
> +/**

Same as earlier -> these seem device-level functions, not bus-related.
You won't map those addresses to all devices on the bus.

>   * Helper for Bus registration.
>   * The constructor has higher priority than PMD constructors.
>   */
> diff --git a/lib/librte_eal/rte_eal_version.map b/lib/librte_eal/rte_eal_version.map
> index eb5f7b9cbd..23f3adb73a 100644
> --- a/lib/librte_eal/rte_eal_version.map
> +++ b/lib/librte_eal/rte_eal_version.map
> @@ -364,4 +364,6 @@ EXPERIMENTAL {
>  	rte_service_may_be_active;
>  	rte_socket_count;
>  	rte_socket_id_by_idx;
> +	rte_bus_dma_map;
> +	rte_bus_dma_unmap;
>  };
> -- 
> 2.12.0
>
  
Shahaf Shuler Feb. 13, 2019, 7:07 p.m. UTC | #2
Wednesday, February 13, 2019 1:17 PM, Gaëtan Rivet:
> Subject: Re: [PATCH 3/6] bus: introduce DMA memory mapping for external
> memory
> 
> On Wed, Feb 13, 2019 at 11:10:23AM +0200, Shahaf Shuler wrote:
> > The DPDK APIs expose 3 different modes to work with memory used for
> DMA:
> >
> > 1. Use the DPDK owned memory (backed by the DPDK provided
> hugepages).
> > This memory is allocated by the DPDK libraries, included in the DPDK
> > memory system (memseg lists) and automatically DMA mapped by the
> DPDK
> > layers.
> >
> > 2. Use memory allocated by the user and register to the DPDK memory
> > systems. This is also referred as external memory. Upon registration
> > of the external memory, the DPDK layers will DMA map it to all needed
> > devices.
> >
> > 3. Use memory allocated by the user and not registered to the DPDK
> > memory system. This is for users who wants to have tight control on
> > this memory. The user will need to explicitly call DMA map function in
> > order to register such memory to the different devices.
> >
> > The scope of the patch focus on #3 above.
> >
> > Currently the only way to map external memory is through VFIO
> > (rte_vfio_dma_map). While VFIO is common, there are other vendors
> > which use different ways to map memory (e.g. Mellanox and NXP).
> >
> 
> How are those other vendors' devices mapped initially right now? Are they
> using #2 scheme instead? Then the user will remap everything using #3?

It is not a re-map, it is a completely different mode for the memory management. 
The first question to ask is "how the application wants to manage its memory" ?
If it is either #1 or #2 above, no problem to make the mapping internal on the "other vendor devices" as they can register to the memory event callback which trigger every time new memory is added to the DPDK memory management system. 
For #3 the memory does not exists in the DPDK memory management system, and no memory events. Hence the application needs to explicitly call the dma MAP. 
The change on this patch is just to make it more generic than calling only VFIO. 

> 
> Would it be interesting to be able to describe a mapping prior to probing a
> device and refer to it upon hotplug?

Not sure it is an interesting use case. I don't see the need to setup the application memory before the probing of the devices. 
Regarding hotplug - this is a feature we can add on top of this series (for example if device was removed and hotplug back). This will require to store the mapping on some database, like VFIO does. 

> 
> > The work in this patch moves the DMA mapping to vendor agnostic APIs.
> > A new map and unmap ops were added to rte_bus structure.
> > Implementation of those was done currently only on the PCI bus. The
> > implementation takes the driver map and umap implementation as bypass
> to the VFIO mapping.
> > That is, in case of no specific map/unmap from the PCI driver, VFIO
> > mapping, if possible, will be used.
> 
> This paragraph should be rewritten to better fit a commit log.
> 
> >
> > Application use with those APIs is quite simple:
> > * allocate memory
> > * take a device, and query its rte_device.
> > * call the bus map function for this device.
> 
> Is the device already configured with the existing mappings? Should the
> application stop it before attempting to map its allocated memory?

Am not following.
When the application wants to register new memory for DMA for this device it calls map. When it wants to unregister it calls unmap. w/o explicit call to the map function the memory cannot be used for DMA.

> 
> >
> > Future work will deprecate the rte_vfio_dma_map and
> rte_vfio_dma_unmap
> > APIs, leaving the PCI device APIs as the preferred option for the user.
> >
> > Signed-off-by: Shahaf Shuler <shahafs@mellanox.com>
> > ---
> >  drivers/bus/pci/pci_common.c            | 78
> ++++++++++++++++++++++++++++
> >  drivers/bus/pci/rte_bus_pci.h           | 14 +++++
> >  lib/librte_eal/common/eal_common_bus.c  | 22 ++++++++
> > lib/librte_eal/common/include/rte_bus.h | 57 ++++++++++++++++++++
> >  lib/librte_eal/rte_eal_version.map      |  2 +
> >  5 files changed, 173 insertions(+)
> >
> > diff --git a/drivers/bus/pci/pci_common.c
> > b/drivers/bus/pci/pci_common.c index 6276e5d695..018080c48b 100644
> > --- a/drivers/bus/pci/pci_common.c
> > +++ b/drivers/bus/pci/pci_common.c
> > @@ -528,6 +528,82 @@ pci_unplug(struct rte_device *dev)
> >  	return ret;
> >  }
> >
> > +/**
> > + * DMA Map memory segment to device. After a successful call the
> > +device
> > + * will be able to read/write from/to this segment.
> > + *
> > + * @param dev
> > + *   Pointer to the PCI device.
> > + * @param addr
> > + *   Starting virtual address of memory to be mapped.
> > + * @param iova
> > + *   Starting IOVA address of memory to be mapped.
> > + * @param len
> > + *   Length of memory segment being mapped.
> > + * @return
> > + *   - 0 On success.
> > + *   - Negative value and rte_errno is set otherwise.
> > + */
> 
> This doc should be on the callback typedef, not their implementation.
> The rte_errno error spec should also be documented higher-up in the
> abstraction pile, on the bus callback I think. Everyone should follow the same
> error codes for applications to really be able to use any implementation
> generically.

OK. 

> 
> > +static int __rte_experimental
> 
> The __rte_experimental is not necessary in compilation units themselves,
> only in the headers.
> 
> In any case, it would only be the publicly available API that must be marked
> as such, so more the callback typedefs than their implementations.

OK

> 
> > +pci_dma_map(struct rte_device *dev, void *addr, uint64_t iova, size_t
> > +len) {
> > +	struct rte_pci_device *pdev = RTE_DEV_TO_PCI(dev);
> > +
> > +	if (!pdev || !pdev->driver) {
> 
> pdev cannot be null here, nor should its driver be.

So you say to relay on it and drop the check?

> 
> > +		rte_errno = EINVAL;
> > +		return -rte_errno;
> > +	}
> > +	if (pdev->driver->map)
> > +		return pdev->driver->map(pdev, addr, iova, len);
> > +	/**
> > +	 *  In case driver don't provides any specific mapping
> > +	 *  try fallback to VFIO.
> > +	 */
> > +	if (pdev->kdrv == RTE_KDRV_VFIO)
> > +		return rte_vfio_container_dma_map(-1, (uintptr_t)addr,
> iova,
> > +						  len);
> 
> Reiterating: RTE_VFIO_DEFAULT_CONTAINER_FD is more readable I think
> than
> -1 here.
> 
> > +	rte_errno = ENOTSUP;
> > +	return -rte_errno;
> > +}
> > +
> > +/**
> > + * Un-map memory segment to device. After a successful call the
> > +device
> > + * will not be able to read/write from/to this segment.
> > + *
> > + * @param dev
> > + *   Pointer to the PCI device.
> > + * @param addr
> > + *   Starting virtual address of memory to be unmapped.
> > + * @param iova
> > + *   Starting IOVA address of memory to be unmapped.
> > + * @param len
> > + *   Length of memory segment being unmapped.
> > + * @return
> > + *   - 0 On success.
> > + *   - Negative value and rte_errno is set otherwise.
> > + */
> > +static int __rte_experimental
> 
> Same as before for __rte_experimental and doc.
> 
> > +pci_dma_unmap(struct rte_device *dev, void *addr, uint64_t iova,
> > +size_t len) {
> > +	struct rte_pci_device *pdev = RTE_DEV_TO_PCI(dev);
> > +
> > +	if (!pdev || !pdev->driver) {
> > +		rte_errno = EINVAL;
> > +		return -rte_errno;
> > +	}
> > +	if (pdev->driver->unmap)
> > +		return pdev->driver->unmap(pdev, addr, iova, len);
> > +	/**
> > +	 *  In case driver don't provides any specific mapping
> > +	 *  try fallback to VFIO.
> > +	 */
> > +	if (pdev->kdrv == RTE_KDRV_VFIO)
> > +		return rte_vfio_container_dma_unmap(-1, (uintptr_t)addr,
> iova,
> > +						    len);
> > +	rte_errno = ENOTSUP;
> > +	return -rte_errno;
> > +}
> > +
> >  struct rte_pci_bus rte_pci_bus = {
> >  	.bus = {
> >  		.scan = rte_pci_scan,
> > @@ -536,6 +612,8 @@ struct rte_pci_bus rte_pci_bus = {
> >  		.plug = pci_plug,
> >  		.unplug = pci_unplug,
> >  		.parse = pci_parse,
> > +		.map = pci_dma_map,
> > +		.unmap = pci_dma_unmap,
> >  		.get_iommu_class = rte_pci_get_iommu_class,
> >  		.dev_iterate = rte_pci_dev_iterate,
> >  		.hot_unplug_handler = pci_hot_unplug_handler, diff --git
> > a/drivers/bus/pci/rte_bus_pci.h b/drivers/bus/pci/rte_bus_pci.h index
> > f0d6d81c00..00b2d412c7 100644
> > --- a/drivers/bus/pci/rte_bus_pci.h
> > +++ b/drivers/bus/pci/rte_bus_pci.h
> > @@ -114,6 +114,18 @@ typedef int (pci_probe_t)(struct rte_pci_driver
> > *, struct rte_pci_device *);  typedef int (pci_remove_t)(struct
> > rte_pci_device *);
> >
> >  /**
> > + * Driver-specific DMA mapping.
> > + */
> > +typedef int (pci_dma_map_t)(struct rte_pci_device *dev, void *addr,
> > +			    uint64_t iova, size_t len);
> > +
> > +/**
> > + * Driver-specific DMA unmapping.
> > + */
> > +typedef int (pci_dma_unmap_t)(struct rte_pci_device *dev, void *addr,
> > +			      uint64_t iova, size_t len);
> > +
> > +/**
> >   * A structure describing a PCI driver.
> >   */
> >  struct rte_pci_driver {
> > @@ -122,6 +134,8 @@ struct rte_pci_driver {
> >  	struct rte_pci_bus *bus;           /**< PCI bus reference. */
> >  	pci_probe_t *probe;                /**< Device Probe function. */
> >  	pci_remove_t *remove;              /**< Device Remove function. */
> > +	pci_dma_map_t *map;		   /**< device dma map function. */
> > +	pci_dma_unmap_t *unmap;		   /**< device dma unmap
> function. */
> 
> I'd call both callbacks dma_map and dma_unmap. It's clearer and more
> consistent.

OK. 

> 
> >  	const struct rte_pci_id *id_table; /**< ID table, NULL terminated. */
> >  	uint32_t drv_flags;                /**< Flags RTE_PCI_DRV_*. */
> >  };
> > diff --git a/lib/librte_eal/common/eal_common_bus.c
> > b/lib/librte_eal/common/eal_common_bus.c
> > index c8f1901f0b..b7911d5ddd 100644
> > --- a/lib/librte_eal/common/eal_common_bus.c
> > +++ b/lib/librte_eal/common/eal_common_bus.c
> > @@ -285,3 +285,25 @@ rte_bus_sigbus_handler(const void *failure_addr)
> >
> >  	return ret;
> >  }
> > +
> > +int __rte_experimental
> > +rte_bus_dma_map(struct rte_device *dev, void *addr, uint64_t iova,
> > +		size_t len)
> > +{
> > +	if (dev->bus->map == NULL || len == 0) {
> > +		rte_errno = EINVAL;
> > +		return -rte_errno;
> > +	}
> > +	return dev->bus->map(dev, addr, iova, len); }
> > +
> > +int __rte_experimental
> > +rte_bus_dma_unmap(struct rte_device *dev, void *addr, uint64_t iova,
> > +		  size_t len)
> > +{
> > +	if (dev->bus->unmap == NULL || len == 0) {
> > +		rte_errno = EINVAL;
> > +		return -rte_errno;
> > +	}
> > +	return dev->bus->unmap(dev, addr, iova, len); }
> 
> These functions should be called rte_dev_dma_{map,unmap} and be part of
> eal_common_dev.c instead.

Will move. 

> 
> > diff --git a/lib/librte_eal/common/include/rte_bus.h
> > b/lib/librte_eal/common/include/rte_bus.h
> > index 6be4b5cabe..90e4bf51b2 100644
> > --- a/lib/librte_eal/common/include/rte_bus.h
> > +++ b/lib/librte_eal/common/include/rte_bus.h
> > @@ -168,6 +168,48 @@ typedef int (*rte_bus_unplug_t)(struct rte_device
> > *dev);  typedef int (*rte_bus_parse_t)(const char *name, void *addr);
> >
> >  /**
> > + * Bus specific DMA map function.
> > + * After a successful call, the memory segment will be mapped to the
> > + * given device.
> > + *
> > + * @param dev
> > + *	Device pointer.
> > + * @param addr
> > + *	Virtual address to map.
> > + * @param iova
> > + *	IOVA address to map.
> > + * @param len
> > + *	Length of the memory segment being mapped.
> > + *
> > + * @return
> > + *	0 if mapping was successful.
> > + *	Negative value and rte_errno is set otherwise.
> > + */
> > +typedef int (*rte_bus_map_t)(struct rte_device *dev, void *addr,
> > +			     uint64_t iova, size_t len);
> > +
> > +/**
> > + * Bus specific DMA unmap function.
> > + * After a successful call, the memory segment will no longer be
> > + * accessible by the given device.
> > + *
> > + * @param dev
> > + *	Device pointer.
> > + * @param addr
> > + *	Virtual address to unmap.
> > + * @param iova
> > + *	IOVA address to unmap.
> > + * @param len
> > + *	Length of the memory segment being mapped.
> > + *
> > + * @return
> > + *	0 if un-mapping was successful.
> > + *	Negative value and rte_errno is set otherwise.
> > + */
> > +typedef int (*rte_bus_unmap_t)(struct rte_device *dev, void *addr,
> > +			       uint64_t iova, size_t len);
> > +
> > +/**
> >   * Implement a specific hot-unplug handler, which is responsible for
> >   * handle the failure when device be hot-unplugged. When the event of
> >   * hot-unplug be detected, it could call this function to handle @@
> > -238,6 +280,8 @@ struct rte_bus {
> >  	rte_bus_plug_t plug;         /**< Probe single device for drivers */
> >  	rte_bus_unplug_t unplug;     /**< Remove single device from driver
> */
> >  	rte_bus_parse_t parse;       /**< Parse a device name */
> > +	rte_bus_map_t map;	     /**< DMA map for device in the bus */
> > +	rte_bus_unmap_t unmap;	     /**< DMA unmap for device in the
> bus */
> 
> Same as for the driver callbacks, dma_map and dma_unmap seem a better
> fit for the field names.

OK

> 
> >  	struct rte_bus_conf conf;    /**< Bus configuration */
> >  	rte_bus_get_iommu_class_t get_iommu_class; /**< Get iommu
> class */
> >  	rte_dev_iterate_t dev_iterate; /**< Device iterator. */ @@ -356,6
> > +400,19 @@ struct rte_bus *rte_bus_find_by_name(const char
> *busname);
> > enum rte_iova_mode rte_bus_get_iommu_class(void);
> >
> >  /**
> > + * Wrapper to call the bus specific DMA map function.
> > + */
> > +int __rte_experimental
> > +rte_bus_dma_map(struct rte_device *dev, void *addr, uint64_t iova,
> > +size_t len);
> > +
> > +/**
> > + * Wrapper to call the bus specific DMA unmap function.
> > + */
> > +int __rte_experimental
> > +rte_bus_dma_unmap(struct rte_device *dev, void *addr, uint64_t iova,
> > +		  size_t len);
> > +
> > +/**
> 
> Same as earlier -> these seem device-level functions, not bus-related.
> You won't map those addresses to all devices on the bus.
> 
> >   * Helper for Bus registration.
> >   * The constructor has higher priority than PMD constructors.
> >   */
> > diff --git a/lib/librte_eal/rte_eal_version.map
> > b/lib/librte_eal/rte_eal_version.map
> > index eb5f7b9cbd..23f3adb73a 100644
> > --- a/lib/librte_eal/rte_eal_version.map
> > +++ b/lib/librte_eal/rte_eal_version.map
> > @@ -364,4 +364,6 @@ EXPERIMENTAL {
> >  	rte_service_may_be_active;
> >  	rte_socket_count;
> >  	rte_socket_id_by_idx;
> > +	rte_bus_dma_map;
> > +	rte_bus_dma_unmap;
> >  };
> > --
> > 2.12.0
> >
> 
> --
> Gaëtan Rivet
> 6WIND
  
Gaëtan Rivet Feb. 14, 2019, 2 p.m. UTC | #3
On Wed, Feb 13, 2019 at 07:07:11PM +0000, Shahaf Shuler wrote:
> Wednesday, February 13, 2019 1:17 PM, Gaëtan Rivet:
> > Subject: Re: [PATCH 3/6] bus: introduce DMA memory mapping for external
> > memory
> > 
> > On Wed, Feb 13, 2019 at 11:10:23AM +0200, Shahaf Shuler wrote:
> > > The DPDK APIs expose 3 different modes to work with memory used for
> > DMA:
> > >
> > > 1. Use the DPDK owned memory (backed by the DPDK provided
> > hugepages).
> > > This memory is allocated by the DPDK libraries, included in the DPDK
> > > memory system (memseg lists) and automatically DMA mapped by the
> > DPDK
> > > layers.
> > >
> > > 2. Use memory allocated by the user and register to the DPDK memory
> > > systems. This is also referred as external memory. Upon registration
> > > of the external memory, the DPDK layers will DMA map it to all needed
> > > devices.
> > >
> > > 3. Use memory allocated by the user and not registered to the DPDK
> > > memory system. This is for users who wants to have tight control on
> > > this memory. The user will need to explicitly call DMA map function in
> > > order to register such memory to the different devices.
> > >
> > > The scope of the patch focus on #3 above.
> > >
> > > Currently the only way to map external memory is through VFIO
> > > (rte_vfio_dma_map). While VFIO is common, there are other vendors
> > > which use different ways to map memory (e.g. Mellanox and NXP).
> > >
> > 
> > How are those other vendors' devices mapped initially right now? Are they
> > using #2 scheme instead? Then the user will remap everything using #3?
> 
> It is not a re-map, it is a completely different mode for the memory management. 
> The first question to ask is "how the application wants to manage its memory" ?
> If it is either #1 or #2 above, no problem to make the mapping internal on the "other vendor devices" as they can register to the memory event callback which trigger every time new memory is added to the DPDK memory management system. 
> For #3 the memory does not exists in the DPDK memory management system, and no memory events. Hence the application needs to explicitly call the dma MAP. 
> The change on this patch is just to make it more generic than calling only VFIO. 
> 

Right! I mostly used #1 ports and never really thought about other kind
of memory management or how they might follow a different logic.

Do you think this could be used with a lot of sequential
mapping/unmappings happening?

I'm thinking for example about a crypto app feeding crypto buffers,
being able to directly map the result instead of copying it within
buffers might be interesting. But then you'd have to unmap often.

- Is the unmap() simple from the app PoV?

- Must the mapping remain available for a long time?

- Does the app need to call tx_descriptor_status() a few times or
  does dma_unmap() verify that the mapping is not in use before unmapping?
  
Shahaf Shuler Feb. 17, 2019, 6:23 a.m. UTC | #4
Thursday, February 14, 2019 4:01 PM, Gaëtan Rivet:
> Subject: Re: [PATCH 3/6] bus: introduce DMA memory mapping for external
> memory
> 
> On Wed, Feb 13, 2019 at 07:07:11PM +0000, Shahaf Shuler wrote:
> > Wednesday, February 13, 2019 1:17 PM, Gaëtan Rivet:
> > > Subject: Re: [PATCH 3/6] bus: introduce DMA memory mapping for
> > > external memory

[...]

> > >
> > > How are those other vendors' devices mapped initially right now? Are
> > > they using #2 scheme instead? Then the user will remap everything using
> #3?
> >
> > It is not a re-map, it is a completely different mode for the memory
> management.
> > The first question to ask is "how the application wants to manage its
> memory" ?
> > If it is either #1 or #2 above, no problem to make the mapping internal on
> the "other vendor devices" as they can register to the memory event
> callback which trigger every time new memory is added to the DPDK memory
> management system.
> > For #3 the memory does not exists in the DPDK memory management
> system, and no memory events. Hence the application needs to explicitly call
> the dma MAP.
> > The change on this patch is just to make it more generic than calling only
> VFIO.
> >
> 
> Right! I mostly used #1 ports and never really thought about other kind of
> memory management or how they might follow a different logic.
> 
> Do you think this could be used with a lot of sequential mapping/unmappings
> happening?

It much depends how efficient is the driver mapping and unmapping. 
In most cases, mapping is heavy operation. 

> 
> I'm thinking for example about a crypto app feeding crypto buffers, being
> able to directly map the result instead of copying it within buffers might be
> interesting. But then you'd have to unmap often.
> 
> - Is the unmap() simple from the app PoV?

Yes, just call rte_bus_dma_unmap. 

> 
> - Must the mapping remain available for a long time?

It must remain as long as you need the device to access the memory. On your example, it should remain till the crypto dev finished writing the buffers. 

> 
> - Does the app need to call tx_descriptor_status() a few times or
>   does dma_unmap() verify that the mapping is not in use before
> unmapping?

I think it is a matter of driver implementation. 
In general, it is application responsibly to make sure the memory is no longer needed before unmapping, just like I don't destroy today mempool being used by some rxq. It can be done by any means the application has, not only tx_descriptor_status.
Driver can protect bad application and warn + fail the call on such case, however it is not a must. 

> 
> --
> Gaëtan Rivet
> 6WIND
  

Patch

diff --git a/drivers/bus/pci/pci_common.c b/drivers/bus/pci/pci_common.c
index 6276e5d695..018080c48b 100644
--- a/drivers/bus/pci/pci_common.c
+++ b/drivers/bus/pci/pci_common.c
@@ -528,6 +528,82 @@  pci_unplug(struct rte_device *dev)
 	return ret;
 }
 
+/**
+ * DMA Map memory segment to device. After a successful call the device
+ * will be able to read/write from/to this segment.
+ *
+ * @param dev
+ *   Pointer to the PCI device.
+ * @param addr
+ *   Starting virtual address of memory to be mapped.
+ * @param iova
+ *   Starting IOVA address of memory to be mapped.
+ * @param len
+ *   Length of memory segment being mapped.
+ * @return
+ *   - 0 On success.
+ *   - Negative value and rte_errno is set otherwise.
+ */
+static int __rte_experimental
+pci_dma_map(struct rte_device *dev, void *addr, uint64_t iova, size_t len)
+{
+	struct rte_pci_device *pdev = RTE_DEV_TO_PCI(dev);
+
+	if (!pdev || !pdev->driver) {
+		rte_errno = EINVAL;
+		return -rte_errno;
+	}
+	if (pdev->driver->map)
+		return pdev->driver->map(pdev, addr, iova, len);
+	/**
+	 *  In case driver don't provides any specific mapping
+	 *  try fallback to VFIO.
+	 */
+	if (pdev->kdrv == RTE_KDRV_VFIO)
+		return rte_vfio_container_dma_map(-1, (uintptr_t)addr, iova,
+						  len);
+	rte_errno = ENOTSUP;
+	return -rte_errno;
+}
+
+/**
+ * Un-map memory segment to device. After a successful call the device
+ * will not be able to read/write from/to this segment.
+ *
+ * @param dev
+ *   Pointer to the PCI device.
+ * @param addr
+ *   Starting virtual address of memory to be unmapped.
+ * @param iova
+ *   Starting IOVA address of memory to be unmapped.
+ * @param len
+ *   Length of memory segment being unmapped.
+ * @return
+ *   - 0 On success.
+ *   - Negative value and rte_errno is set otherwise.
+ */
+static int __rte_experimental
+pci_dma_unmap(struct rte_device *dev, void *addr, uint64_t iova, size_t len)
+{
+	struct rte_pci_device *pdev = RTE_DEV_TO_PCI(dev);
+
+	if (!pdev || !pdev->driver) {
+		rte_errno = EINVAL;
+		return -rte_errno;
+	}
+	if (pdev->driver->unmap)
+		return pdev->driver->unmap(pdev, addr, iova, len);
+	/**
+	 *  In case driver don't provides any specific mapping
+	 *  try fallback to VFIO.
+	 */
+	if (pdev->kdrv == RTE_KDRV_VFIO)
+		return rte_vfio_container_dma_unmap(-1, (uintptr_t)addr, iova,
+						    len);
+	rte_errno = ENOTSUP;
+	return -rte_errno;
+}
+
 struct rte_pci_bus rte_pci_bus = {
 	.bus = {
 		.scan = rte_pci_scan,
@@ -536,6 +612,8 @@  struct rte_pci_bus rte_pci_bus = {
 		.plug = pci_plug,
 		.unplug = pci_unplug,
 		.parse = pci_parse,
+		.map = pci_dma_map,
+		.unmap = pci_dma_unmap,
 		.get_iommu_class = rte_pci_get_iommu_class,
 		.dev_iterate = rte_pci_dev_iterate,
 		.hot_unplug_handler = pci_hot_unplug_handler,
diff --git a/drivers/bus/pci/rte_bus_pci.h b/drivers/bus/pci/rte_bus_pci.h
index f0d6d81c00..00b2d412c7 100644
--- a/drivers/bus/pci/rte_bus_pci.h
+++ b/drivers/bus/pci/rte_bus_pci.h
@@ -114,6 +114,18 @@  typedef int (pci_probe_t)(struct rte_pci_driver *, struct rte_pci_device *);
 typedef int (pci_remove_t)(struct rte_pci_device *);
 
 /**
+ * Driver-specific DMA mapping.
+ */
+typedef int (pci_dma_map_t)(struct rte_pci_device *dev, void *addr,
+			    uint64_t iova, size_t len);
+
+/**
+ * Driver-specific DMA unmapping.
+ */
+typedef int (pci_dma_unmap_t)(struct rte_pci_device *dev, void *addr,
+			      uint64_t iova, size_t len);
+
+/**
  * A structure describing a PCI driver.
  */
 struct rte_pci_driver {
@@ -122,6 +134,8 @@  struct rte_pci_driver {
 	struct rte_pci_bus *bus;           /**< PCI bus reference. */
 	pci_probe_t *probe;                /**< Device Probe function. */
 	pci_remove_t *remove;              /**< Device Remove function. */
+	pci_dma_map_t *map;		   /**< device dma map function. */
+	pci_dma_unmap_t *unmap;		   /**< device dma unmap function. */
 	const struct rte_pci_id *id_table; /**< ID table, NULL terminated. */
 	uint32_t drv_flags;                /**< Flags RTE_PCI_DRV_*. */
 };
diff --git a/lib/librte_eal/common/eal_common_bus.c b/lib/librte_eal/common/eal_common_bus.c
index c8f1901f0b..b7911d5ddd 100644
--- a/lib/librte_eal/common/eal_common_bus.c
+++ b/lib/librte_eal/common/eal_common_bus.c
@@ -285,3 +285,25 @@  rte_bus_sigbus_handler(const void *failure_addr)
 
 	return ret;
 }
+
+int __rte_experimental
+rte_bus_dma_map(struct rte_device *dev, void *addr, uint64_t iova,
+		size_t len)
+{
+	if (dev->bus->map == NULL || len == 0) {
+		rte_errno = EINVAL;
+		return -rte_errno;
+	}
+	return dev->bus->map(dev, addr, iova, len);
+}
+
+int __rte_experimental
+rte_bus_dma_unmap(struct rte_device *dev, void *addr, uint64_t iova,
+		  size_t len)
+{
+	if (dev->bus->unmap == NULL || len == 0) {
+		rte_errno = EINVAL;
+		return -rte_errno;
+	}
+	return dev->bus->unmap(dev, addr, iova, len);
+}
diff --git a/lib/librte_eal/common/include/rte_bus.h b/lib/librte_eal/common/include/rte_bus.h
index 6be4b5cabe..90e4bf51b2 100644
--- a/lib/librte_eal/common/include/rte_bus.h
+++ b/lib/librte_eal/common/include/rte_bus.h
@@ -168,6 +168,48 @@  typedef int (*rte_bus_unplug_t)(struct rte_device *dev);
 typedef int (*rte_bus_parse_t)(const char *name, void *addr);
 
 /**
+ * Bus specific DMA map function.
+ * After a successful call, the memory segment will be mapped to the
+ * given device.
+ *
+ * @param dev
+ *	Device pointer.
+ * @param addr
+ *	Virtual address to map.
+ * @param iova
+ *	IOVA address to map.
+ * @param len
+ *	Length of the memory segment being mapped.
+ *
+ * @return
+ *	0 if mapping was successful.
+ *	Negative value and rte_errno is set otherwise.
+ */
+typedef int (*rte_bus_map_t)(struct rte_device *dev, void *addr,
+			     uint64_t iova, size_t len);
+
+/**
+ * Bus specific DMA unmap function.
+ * After a successful call, the memory segment will no longer be
+ * accessible by the given device.
+ *
+ * @param dev
+ *	Device pointer.
+ * @param addr
+ *	Virtual address to unmap.
+ * @param iova
+ *	IOVA address to unmap.
+ * @param len
+ *	Length of the memory segment being mapped.
+ *
+ * @return
+ *	0 if un-mapping was successful.
+ *	Negative value and rte_errno is set otherwise.
+ */
+typedef int (*rte_bus_unmap_t)(struct rte_device *dev, void *addr,
+			       uint64_t iova, size_t len);
+
+/**
  * Implement a specific hot-unplug handler, which is responsible for
  * handle the failure when device be hot-unplugged. When the event of
  * hot-unplug be detected, it could call this function to handle
@@ -238,6 +280,8 @@  struct rte_bus {
 	rte_bus_plug_t plug;         /**< Probe single device for drivers */
 	rte_bus_unplug_t unplug;     /**< Remove single device from driver */
 	rte_bus_parse_t parse;       /**< Parse a device name */
+	rte_bus_map_t map;	     /**< DMA map for device in the bus */
+	rte_bus_unmap_t unmap;	     /**< DMA unmap for device in the bus */
 	struct rte_bus_conf conf;    /**< Bus configuration */
 	rte_bus_get_iommu_class_t get_iommu_class; /**< Get iommu class */
 	rte_dev_iterate_t dev_iterate; /**< Device iterator. */
@@ -356,6 +400,19 @@  struct rte_bus *rte_bus_find_by_name(const char *busname);
 enum rte_iova_mode rte_bus_get_iommu_class(void);
 
 /**
+ * Wrapper to call the bus specific DMA map function.
+ */
+int __rte_experimental
+rte_bus_dma_map(struct rte_device *dev, void *addr, uint64_t iova, size_t len);
+
+/**
+ * Wrapper to call the bus specific DMA unmap function.
+ */
+int __rte_experimental
+rte_bus_dma_unmap(struct rte_device *dev, void *addr, uint64_t iova,
+		  size_t len);
+
+/**
  * Helper for Bus registration.
  * The constructor has higher priority than PMD constructors.
  */
diff --git a/lib/librte_eal/rte_eal_version.map b/lib/librte_eal/rte_eal_version.map
index eb5f7b9cbd..23f3adb73a 100644
--- a/lib/librte_eal/rte_eal_version.map
+++ b/lib/librte_eal/rte_eal_version.map
@@ -364,4 +364,6 @@  EXPERIMENTAL {
 	rte_service_may_be_active;
 	rte_socket_count;
 	rte_socket_id_by_idx;
+	rte_bus_dma_map;
+	rte_bus_dma_unmap;
 };