[dpdk-dev,v3] vfio: fix sPAPR IOMMU DMA window size

Message ID 1502181667-17949-1-git-send-email-jpf@zurich.ibm.com (mailing list archive)
State Superseded, archived
Headers

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/Intel-compilation success Compilation OK

Commit Message

Jonas Pfefferle1 Aug. 8, 2017, 8:41 a.m. UTC
  DMA window size needs to be big enough to span all memory segment's
physical addresses. We do not need multiple levels of IOMMU tables
as we already span ~70TB of physical memory with 16MB hugepages.

Signed-off-by: Jonas Pfefferle <jpf@zurich.ibm.com>
---
v2:
* roundup to next power 2 function without loop.

v3:
* Replace roundup_next_pow2 with rte_align64pow2

 lib/librte_eal/linuxapp/eal/eal_vfio.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)
  

Comments

Anatoly Burakov Aug. 8, 2017, 9:15 a.m. UTC | #1
From: Jonas Pfefferle [mailto:jpf@zurich.ibm.com]
> Sent: Tuesday, August 8, 2017 9:41 AM
> To: Burakov, Anatoly <anatoly.burakov@intel.com>
> Cc: dev@dpdk.org; aik@ozlabs.ru; Jonas Pfefferle <jpf@zurich.ibm.com>
> Subject: [PATCH v3] vfio: fix sPAPR IOMMU DMA window size
> 
> DMA window size needs to be big enough to span all memory segment's
> physical addresses. We do not need multiple levels of IOMMU tables
> as we already span ~70TB of physical memory with 16MB hugepages.
> 
> Signed-off-by: Jonas Pfefferle <jpf@zurich.ibm.com>
> ---
> v2:
> * roundup to next power 2 function without loop.
> 
> v3:
> * Replace roundup_next_pow2 with rte_align64pow2
> 
>  lib/librte_eal/linuxapp/eal/eal_vfio.c | 13 ++++++++++---
>  1 file changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.c
> b/lib/librte_eal/linuxapp/eal/eal_vfio.c
> index 946df7e..550c41c 100644
> --- a/lib/librte_eal/linuxapp/eal/eal_vfio.c
> +++ b/lib/librte_eal/linuxapp/eal/eal_vfio.c
> @@ -759,10 +759,12 @@ vfio_spapr_dma_map(int vfio_container_fd)
>  		return -1;
>  	}
> 
> -	/* calculate window size based on number of hugepages configured
> */
> -	create.window_size = rte_eal_get_physmem_size();
> +	/* physicaly pages are sorted descending i.e. ms[0].phys_addr is max
> */

Do we always expect that to be the case in the future? Maybe it would be safer to walk the memsegs list.

Thanks,
Anatoly

> +	/* create DMA window from 0 to max(phys_addr + len) */
> +	/* sPAPR requires window size to be a power of 2 */
> +	create.window_size = rte_align64pow2(ms[0].phys_addr +
> ms[0].len);
>  	create.page_shift = __builtin_ctzll(ms->hugepage_sz);
> -	create.levels = 2;
> +	create.levels = 1;
> 
>  	ret = ioctl(vfio_container_fd, VFIO_IOMMU_SPAPR_TCE_CREATE,
> &create);
>  	if (ret) {
> @@ -771,6 +773,11 @@ vfio_spapr_dma_map(int vfio_container_fd)
>  		return -1;
>  	}
> 
> +	if (create.start_addr != 0) {
> +		RTE_LOG(ERR, EAL, "  DMA window start address != 0\n");
> +		return -1;
> +	}
> +
>  	/* map all DPDK segments for DMA. use 1:1 PA to IOVA mapping */
>  	for (i = 0; i < RTE_MAX_MEMSEG; i++) {
>  		struct vfio_iommu_type1_dma_map dma_map;
> --
> 2.7.4
  
Jonas Pfefferle1 Aug. 8, 2017, 9:29 a.m. UTC | #2
"Burakov, Anatoly" <anatoly.burakov@intel.com> wrote on 08/08/2017 11:15:24
AM:

> From: "Burakov, Anatoly" <anatoly.burakov@intel.com>
> To: Jonas Pfefferle <jpf@zurich.ibm.com>
> Cc: "dev@dpdk.org" <dev@dpdk.org>, "aik@ozlabs.ru" <aik@ozlabs.ru>
> Date: 08/08/2017 11:18 AM
> Subject: RE: [PATCH v3] vfio: fix sPAPR IOMMU DMA window size
>
> From: Jonas Pfefferle [mailto:jpf@zurich.ibm.com]
> > Sent: Tuesday, August 8, 2017 9:41 AM
> > To: Burakov, Anatoly <anatoly.burakov@intel.com>
> > Cc: dev@dpdk.org; aik@ozlabs.ru; Jonas Pfefferle <jpf@zurich.ibm.com>
> > Subject: [PATCH v3] vfio: fix sPAPR IOMMU DMA window size
> >
> > DMA window size needs to be big enough to span all memory segment's
> > physical addresses. We do not need multiple levels of IOMMU tables
> > as we already span ~70TB of physical memory with 16MB hugepages.
> >
> > Signed-off-by: Jonas Pfefferle <jpf@zurich.ibm.com>
> > ---
> > v2:
> > * roundup to next power 2 function without loop.
> >
> > v3:
> > * Replace roundup_next_pow2 with rte_align64pow2
> >
> >  lib/librte_eal/linuxapp/eal/eal_vfio.c | 13 ++++++++++---
> >  1 file changed, 10 insertions(+), 3 deletions(-)
> >
> > diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.c
> > b/lib/librte_eal/linuxapp/eal/eal_vfio.c
> > index 946df7e..550c41c 100644
> > --- a/lib/librte_eal/linuxapp/eal/eal_vfio.c
> > +++ b/lib/librte_eal/linuxapp/eal/eal_vfio.c
> > @@ -759,10 +759,12 @@ vfio_spapr_dma_map(int vfio_container_fd)
> >        return -1;
> >     }
> >
> > -   /* calculate window size based on number of hugepages configured
> > */
> > -   create.window_size = rte_eal_get_physmem_size();
> > +   /* physicaly pages are sorted descending i.e. ms[0].phys_addr is
max
> > */
>
> Do we always expect that to be the case in the future? Maybe it
> would be safer to walk the memsegs list.
>
> Thanks,
> Anatoly

I had this loop in before but removed it in favor of simplicity.
If we believe that the ordering is going to change in the future
I'm happy to bring back the loop. Is there other code which is
relying on the fact that the memsegs are sorted by their physical
addresses?

>
> > +   /* create DMA window from 0 to max(phys_addr + len) */
> > +   /* sPAPR requires window size to be a power of 2 */
> > +   create.window_size = rte_align64pow2(ms[0].phys_addr +
> > ms[0].len);
> >     create.page_shift = __builtin_ctzll(ms->hugepage_sz);
> > -   create.levels = 2;
> > +   create.levels = 1;
> >
> >     ret = ioctl(vfio_container_fd, VFIO_IOMMU_SPAPR_TCE_CREATE,
> > &create);
> >     if (ret) {
> > @@ -771,6 +773,11 @@ vfio_spapr_dma_map(int vfio_container_fd)
> >        return -1;
> >     }
> >
> > +   if (create.start_addr != 0) {
> > +      RTE_LOG(ERR, EAL, "  DMA window start address != 0\n");
> > +      return -1;
> > +   }
> > +
> >     /* map all DPDK segments for DMA. use 1:1 PA to IOVA mapping */
> >     for (i = 0; i < RTE_MAX_MEMSEG; i++) {
> >        struct vfio_iommu_type1_dma_map dma_map;
> > --
> > 2.7.4
>
  
Anatoly Burakov Aug. 8, 2017, 9:43 a.m. UTC | #3
> From: Jonas Pfefferle1 [mailto:JPF@zurich.ibm.com]
> Sent: Tuesday, August 8, 2017 10:30 AM
> To: Burakov, Anatoly <anatoly.burakov@intel.com>
> Cc: aik@ozlabs.ru; dev@dpdk.org
> Subject: RE: [PATCH v3] vfio: fix sPAPR IOMMU DMA window size
> 
> "Burakov, Anatoly" <anatoly.burakov@intel.com> wrote on 08/08/2017
> 11:15:24 AM:
> 
> > From: "Burakov, Anatoly" <anatoly.burakov@intel.com>
> > To: Jonas Pfefferle <jpf@zurich.ibm.com>
> > Cc: "dev@dpdk.org" <dev@dpdk.org>, "aik@ozlabs.ru" <aik@ozlabs.ru>
> > Date: 08/08/2017 11:18 AM
> > Subject: RE: [PATCH v3] vfio: fix sPAPR IOMMU DMA window size
> >
> > From: Jonas Pfefferle [mailto:jpf@zurich.ibm.com]
> > > Sent: Tuesday, August 8, 2017 9:41 AM
> > > To: Burakov, Anatoly <anatoly.burakov@intel.com>
> > > Cc: dev@dpdk.org; aik@ozlabs.ru; Jonas Pfefferle <jpf@zurich.ibm.com>
> > > Subject: [PATCH v3] vfio: fix sPAPR IOMMU DMA window size
> > >
> > > DMA window size needs to be big enough to span all memory segment's
> > > physical addresses. We do not need multiple levels of IOMMU tables
> > > as we already span ~70TB of physical memory with 16MB hugepages.
> > >
> > > Signed-off-by: Jonas Pfefferle <jpf@zurich.ibm.com>
> > > ---
> > > v2:
> > > * roundup to next power 2 function without loop.
> > >
> > > v3:
> > > * Replace roundup_next_pow2 with rte_align64pow2
> > >
> > >  lib/librte_eal/linuxapp/eal/eal_vfio.c | 13 ++++++++++---
> > >  1 file changed, 10 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.c
> > > b/lib/librte_eal/linuxapp/eal/eal_vfio.c
> > > index 946df7e..550c41c 100644
> > > --- a/lib/librte_eal/linuxapp/eal/eal_vfio.c
> > > +++ b/lib/librte_eal/linuxapp/eal/eal_vfio.c
> > > @@ -759,10 +759,12 @@ vfio_spapr_dma_map(int vfio_container_fd)
> > >        return -1;
> > >     }
> > >
> > > -   /* calculate window size based on number of hugepages configured
> > > */
> > > -   create.window_size = rte_eal_get_physmem_size();
> > > +   /* physicaly pages are sorted descending i.e. ms[0].phys_addr is max
> > > */
> >
> > Do we always expect that to be the case in the future? Maybe it
> > would be safer to walk the memsegs list.
> >
> > Thanks,
> > Anatoly
> 
> I had this loop in before but removed it in favor of simplicity.
> If we believe that the ordering is going to change in the future
> I'm happy to bring back the loop. Is there other code which is
> relying on the fact that the memsegs are sorted by their physical
> addresses?

I don't think there is. In any case, I think making assumptions about particulars of memseg organization is not a very good practice.

I seem to recall us doing similar things in other places, so maybe down the line we could introduce a new API (or internal-only) function to get a memseg with min/max address. For now I think a loop will do. 

> 
> >
> > > +   /* create DMA window from 0 to max(phys_addr + len) */
> > > +   /* sPAPR requires window size to be a power of 2 */
> > > +   create.window_size = rte_align64pow2(ms[0].phys_addr +
> > > ms[0].len);
> > >     create.page_shift = __builtin_ctzll(ms->hugepage_sz);
> > > -   create.levels = 2;
> > > +   create.levels = 1;
> > >
> > >     ret = ioctl(vfio_container_fd, VFIO_IOMMU_SPAPR_TCE_CREATE,
> > > &create);
> > >     if (ret) {
> > > @@ -771,6 +773,11 @@ vfio_spapr_dma_map(int vfio_container_fd)
> > >        return -1;
> > >     }
> > >
> > > +   if (create.start_addr != 0) {
> > > +      RTE_LOG(ERR, EAL, "  DMA window start address != 0\n");
> > > +      return -1;
> > > +   }
> > > +
> > >     /* map all DPDK segments for DMA. use 1:1 PA to IOVA mapping */
> > >     for (i = 0; i < RTE_MAX_MEMSEG; i++) {
> > >        struct vfio_iommu_type1_dma_map dma_map;
> > > --
> > > 2.7.4
> >
  
Jonas Pfefferle1 Aug. 8, 2017, 11:01 a.m. UTC | #4
"Burakov, Anatoly" <anatoly.burakov@intel.com> wrote on 08/08/2017 11:43:43
AM:

> From: "Burakov, Anatoly" <anatoly.burakov@intel.com>
> To: Jonas Pfefferle1 <JPF@zurich.ibm.com>
> Cc: "aik@ozlabs.ru" <aik@ozlabs.ru>, "dev@dpdk.org" <dev@dpdk.org>
> Date: 08/08/2017 11:43 AM
> Subject: RE: [PATCH v3] vfio: fix sPAPR IOMMU DMA window size
>
> > From: Jonas Pfefferle1 [mailto:JPF@zurich.ibm.com]
> > Sent: Tuesday, August 8, 2017 10:30 AM
> > To: Burakov, Anatoly <anatoly.burakov@intel.com>
> > Cc: aik@ozlabs.ru; dev@dpdk.org
> > Subject: RE: [PATCH v3] vfio: fix sPAPR IOMMU DMA window size
> >
> > "Burakov, Anatoly" <anatoly.burakov@intel.com> wrote on 08/08/2017
> > 11:15:24 AM:
> >
> > > From: "Burakov, Anatoly" <anatoly.burakov@intel.com>
> > > To: Jonas Pfefferle <jpf@zurich.ibm.com>
> > > Cc: "dev@dpdk.org" <dev@dpdk.org>, "aik@ozlabs.ru" <aik@ozlabs.ru>
> > > Date: 08/08/2017 11:18 AM
> > > Subject: RE: [PATCH v3] vfio: fix sPAPR IOMMU DMA window size
> > >
> > > From: Jonas Pfefferle [mailto:jpf@zurich.ibm.com]
> > > > Sent: Tuesday, August 8, 2017 9:41 AM
> > > > To: Burakov, Anatoly <anatoly.burakov@intel.com>
> > > > Cc: dev@dpdk.org; aik@ozlabs.ru; Jonas Pfefferle
<jpf@zurich.ibm.com>
> > > > Subject: [PATCH v3] vfio: fix sPAPR IOMMU DMA window size
> > > >
> > > > DMA window size needs to be big enough to span all memory segment's
> > > > physical addresses. We do not need multiple levels of IOMMU tables
> > > > as we already span ~70TB of physical memory with 16MB hugepages.
> > > >
> > > > Signed-off-by: Jonas Pfefferle <jpf@zurich.ibm.com>
> > > > ---
> > > > v2:
> > > > * roundup to next power 2 function without loop.
> > > >
> > > > v3:
> > > > * Replace roundup_next_pow2 with rte_align64pow2
> > > >
> > > >  lib/librte_eal/linuxapp/eal/eal_vfio.c | 13 ++++++++++---
> > > >  1 file changed, 10 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.c
> > > > b/lib/librte_eal/linuxapp/eal/eal_vfio.c
> > > > index 946df7e..550c41c 100644
> > > > --- a/lib/librte_eal/linuxapp/eal/eal_vfio.c
> > > > +++ b/lib/librte_eal/linuxapp/eal/eal_vfio.c
> > > > @@ -759,10 +759,12 @@ vfio_spapr_dma_map(int vfio_container_fd)
> > > >        return -1;
> > > >     }
> > > >
> > > > -   /* calculate window size based on number of hugepages
configured
> > > > */
> > > > -   create.window_size = rte_eal_get_physmem_size();
> > > > +   /* physicaly pages are sorted descending i.e. ms[0].phys_addr
is max
> > > > */
> > >
> > > Do we always expect that to be the case in the future? Maybe it
> > > would be safer to walk the memsegs list.
> > >
> > > Thanks,
> > > Anatoly
> >
> > I had this loop in before but removed it in favor of simplicity.
> > If we believe that the ordering is going to change in the future
> > I'm happy to bring back the loop. Is there other code which is
> > relying on the fact that the memsegs are sorted by their physical
> > addresses?
>
> I don't think there is. In any case, I think making assumptions
> about particulars of memseg organization is not a very good practice.
>
> I seem to recall us doing similar things in other places, so maybe
> down the line we could introduce a new API (or internal-only)
> function to get a memseg with min/max address. For now I think a
> loop will do.

Ok. Makes sense to me. Let me resubmit a new version with the loop.

>
> >
> > >
> > > > +   /* create DMA window from 0 to max(phys_addr + len) */
> > > > +   /* sPAPR requires window size to be a power of 2 */
> > > > +   create.window_size = rte_align64pow2(ms[0].phys_addr +
> > > > ms[0].len);
> > > >     create.page_shift = __builtin_ctzll(ms->hugepage_sz);
> > > > -   create.levels = 2;
> > > > +   create.levels = 1;
> > > >
> > > >     ret = ioctl(vfio_container_fd, VFIO_IOMMU_SPAPR_TCE_CREATE,
> > > > &create);
> > > >     if (ret) {
> > > > @@ -771,6 +773,11 @@ vfio_spapr_dma_map(int vfio_container_fd)
> > > >        return -1;
> > > >     }
> > > >
> > > > +   if (create.start_addr != 0) {
> > > > +      RTE_LOG(ERR, EAL, "  DMA window start address != 0\n");
> > > > +      return -1;
> > > > +   }
> > > > +
> > > >     /* map all DPDK segments for DMA. use 1:1 PA to IOVA mapping */
> > > >     for (i = 0; i < RTE_MAX_MEMSEG; i++) {
> > > >        struct vfio_iommu_type1_dma_map dma_map;
> > > > --
> > > > 2.7.4
> > >
>
  

Patch

diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.c b/lib/librte_eal/linuxapp/eal/eal_vfio.c
index 946df7e..550c41c 100644
--- a/lib/librte_eal/linuxapp/eal/eal_vfio.c
+++ b/lib/librte_eal/linuxapp/eal/eal_vfio.c
@@ -759,10 +759,12 @@  vfio_spapr_dma_map(int vfio_container_fd)
 		return -1;
 	}
 
-	/* calculate window size based on number of hugepages configured */
-	create.window_size = rte_eal_get_physmem_size();
+	/* physicaly pages are sorted descending i.e. ms[0].phys_addr is max */
+	/* create DMA window from 0 to max(phys_addr + len) */
+	/* sPAPR requires window size to be a power of 2 */
+	create.window_size = rte_align64pow2(ms[0].phys_addr + ms[0].len);
 	create.page_shift = __builtin_ctzll(ms->hugepage_sz);
-	create.levels = 2;
+	create.levels = 1;
 
 	ret = ioctl(vfio_container_fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
 	if (ret) {
@@ -771,6 +773,11 @@  vfio_spapr_dma_map(int vfio_container_fd)
 		return -1;
 	}
 
+	if (create.start_addr != 0) {
+		RTE_LOG(ERR, EAL, "  DMA window start address != 0\n");
+		return -1;
+	}
+
 	/* map all DPDK segments for DMA. use 1:1 PA to IOVA mapping */
 	for (i = 0; i < RTE_MAX_MEMSEG; i++) {
 		struct vfio_iommu_type1_dma_map dma_map;