vfio: do not coalesce DMA mappings

Message ID 20221230095853.1323616-1-nipun.gupta@amd.com (mailing list archive)
State Superseded, archived
Delegated to: Thomas Monjalon
Headers
Series vfio: do not coalesce DMA mappings |

Checks

Context Check Description
ci/checkpatch warning coding style issues
ci/loongarch-compilation success Compilation OK
ci/loongarch-unit-testing success Unit Testing PASS
ci/Intel-compilation success Compilation OK
ci/iol-broadcom-Performance success Performance Testing PASS
ci/intel-Testing success Testing PASS
ci/iol-intel-Functional success Functional Testing PASS
ci/iol-intel-Performance success Performance Testing PASS
ci/github-robot: build success github build: passed
ci/iol-abi-testing success Testing PASS
ci/iol-x86_64-compile-testing success Testing PASS
ci/iol-aarch64-unit-testing success Testing PASS
ci/iol-testing success Testing PASS
ci/iol-x86_64-unit-testing success Testing PASS
ci/iol-aarch64-compile-testing success Testing PASS

Commit Message

Gupta, Nipun Dec. 30, 2022, 9:58 a.m. UTC
  At the cleanup time when dma unmap is done, linux kernel
does not allow unmap of individual segments which were
coalesced together while creating the DMA map for type1 IOMMU
mappings. So, this change updates the mapping of the memory
segments(hugepages) on a per-page basis.

Signed-off-by: Nipun Gupta <nipun.gupta@amd.com>
---

When hotplug of devices is used, multiple pages gets colaeced
and a single mapping gets created for these pages (using APIs
rte_memseg_contig_walk() and type1_map_contig(). On the cleanup
time when the memory is released, the VFIO does not cleans up
that memory and following error is observed in the eal for 2MB
hugepages:
EAL: Unexpected size 0 of DMA remapping cleared instead of 2097152

This is because VFIO does not clear the DMA (refer API
vfio_dma_do_unmap() -
https://elixir.bootlin.com/linux/latest/source/drivers/vfio/vfio_iommu_type1.c#L1330),
where it checks the dma mapping where it checks for IOVA to free:
https://elixir.bootlin.com/linux/latest/source/drivers/vfio/vfio_iommu_type1.c#L1418.

Thus this change updates the mapping to be created individually
instead of colaecing them.

 lib/eal/linux/eal_vfio.c | 29 -----------------------------
 1 file changed, 29 deletions(-)
  

Comments

Ding, Xuan June 29, 2023, 8:21 a.m. UTC | #1
Hi Nipun,

I'd like to appreciate your time reading this email.

Our QA team found that since this commit "a399d7b5a994: do not coalesce DMA mappings" is introduced,
the dpdk testpmd start with "--no-huge" parameters will failed, and shows "EAL: Cannot set up DMA remapping, error 28 (No space left on device)".
So they reported it on dpdk Bugzilla: https://bugs.dpdk.org/show_bug.cgi?id=1235.

I understand this feature is to keep consistent with the kernel and not allow memory segments be merged.
The side effect is the testpmd with "--no-huge" parameters will not be able to start because the too many pages will exceed the capability of IOMMU.
Is it expected? Should we remove the --no-huge" in our testcase?

Regards,
Xuan

> -----Original Message-----
> From: Nipun Gupta <nipun.gupta@amd.com>
> Sent: Friday, December 30, 2022 5:59 PM
> To: dev@dpdk.org; thomas@monjalon.net; Burakov, Anatoly
> <anatoly.burakov@intel.com>; ferruh.yigit@amd.com
> Cc: nikhil.agarwal@amd.com; Nipun Gupta <nipun.gupta@amd.com>
> Subject: [PATCH] vfio: do not coalesce DMA mappings
> 
> At the cleanup time when dma unmap is done, linux kernel does not allow
> unmap of individual segments which were coalesced together while creating
> the DMA map for type1 IOMMU mappings. So, this change updates the
> mapping of the memory
> segments(hugepages) on a per-page basis.
> 
> Signed-off-by: Nipun Gupta <nipun.gupta@amd.com>
> ---
> 
> When hotplug of devices is used, multiple pages gets colaeced and a single
> mapping gets created for these pages (using APIs
> rte_memseg_contig_walk() and type1_map_contig(). On the cleanup time
> when the memory is released, the VFIO does not cleans up that memory and
> following error is observed in the eal for 2MB
> hugepages:
> EAL: Unexpected size 0 of DMA remapping cleared instead of 2097152
> 
> This is because VFIO does not clear the DMA (refer API
> vfio_dma_do_unmap() -
> https://elixir.bootlin.com/linux/latest/source/drivers/vfio/vfio_iommu_type1.
> c#L1330),
> where it checks the dma mapping where it checks for IOVA to free:
> https://elixir.bootlin.com/linux/latest/source/drivers/vfio/vfio_iommu_type1.
> c#L1418.
> 
> Thus this change updates the mapping to be created individually instead of
> colaecing them.
> 
>  lib/eal/linux/eal_vfio.c | 29 -----------------------------
>  1 file changed, 29 deletions(-)
> 
> diff --git a/lib/eal/linux/eal_vfio.c b/lib/eal/linux/eal_vfio.c index
> 549b86ae1d..56edccb0db 100644
> --- a/lib/eal/linux/eal_vfio.c
> +++ b/lib/eal/linux/eal_vfio.c
> @@ -1369,19 +1369,6 @@ rte_vfio_get_group_num(const char *sysfs_base,
>  	return 1;
>  }
> 
> -static int
> -type1_map_contig(const struct rte_memseg_list *msl, const struct
> rte_memseg *ms,
> -		size_t len, void *arg)
> -{
> -	int *vfio_container_fd = arg;
> -
> -	if (msl->external)
> -		return 0;
> -
> -	return vfio_type1_dma_mem_map(*vfio_container_fd, ms->addr_64,
> ms->iova,
> -			len, 1);
> -}
> -
>  static int
>  type1_map(const struct rte_memseg_list *msl, const struct rte_memseg *ms,
>  		void *arg)
> @@ -1396,10 +1383,6 @@ type1_map(const struct rte_memseg_list *msl,
> const struct rte_memseg *ms,
>  	if (ms->iova == RTE_BAD_IOVA)
>  		return 0;
> 
> -	/* if IOVA mode is VA, we've already mapped the internal segments */
> -	if (!msl->external && rte_eal_iova_mode() == RTE_IOVA_VA)
> -		return 0;
> -
>  	return vfio_type1_dma_mem_map(*vfio_container_fd, ms->addr_64,
> ms->iova,
>  			ms->len, 1);
>  }
> @@ -1464,18 +1447,6 @@ vfio_type1_dma_mem_map(int vfio_container_fd,
> uint64_t vaddr, uint64_t iova,  static int  vfio_type1_dma_map(int
> vfio_container_fd)  {
> -	if (rte_eal_iova_mode() == RTE_IOVA_VA) {
> -		/* with IOVA as VA mode, we can get away with mapping
> contiguous
> -		 * chunks rather than going page-by-page.
> -		 */
> -		int ret = rte_memseg_contig_walk(type1_map_contig,
> -				&vfio_container_fd);
> -		if (ret)
> -			return ret;
> -		/* we have to continue the walk because we've skipped the
> -		 * external segments during the config walk.
> -		 */
> -	}
>  	return rte_memseg_walk(type1_map, &vfio_container_fd);  }
> 
> --
> 2.25.1
  
Gupta, Nipun June 30, 2023, 1:45 a.m. UTC | #2
Hi Xuan,

Thanks for pointing out the issue and figuring out the patch which 
introduced this. If you have answers to below queries, please let me know:

Is there any other test cases which tests "--no-huge" which pass?

Also, if we change the "-m" option to provide lower memory, does the 
test pass?

When you mention too many pages exceed the capability of IOMMU, you are 
referring to HW capability to create multiple pages? Here it seems in 
case of 4K page size we need 256K pages which is limiting the capacity?

Regards,
Nipun

On 6/29/2023 1:51 PM, Ding, Xuan wrote:
> Hi Nipun,
> 
> I'd like to appreciate your time reading this email.
> 
> Our QA team found that since this commit "a399d7b5a994: do not coalesce DMA mappings" is introduced,
> the dpdk testpmd start with "--no-huge" parameters will failed, and shows "EAL: Cannot set up DMA remapping, error 28 (No space left on device)".
> So they reported it on dpdk Bugzilla: https://bugs.dpdk.org/show_bug.cgi?id=1235.
> 
> I understand this feature is to keep consistent with the kernel and not allow memory segments be merged.
> The side effect is the testpmd with "--no-huge" parameters will not be able to start because the too many pages will exceed the capability of IOMMU.
> Is it expected? Should we remove the --no-huge" in our testcase?
> 
> Regards,
> Xuan
> 
>> -----Original Message-----
>> From: Nipun Gupta <nipun.gupta@amd.com>
>> Sent: Friday, December 30, 2022 5:59 PM
>> To: dev@dpdk.org; thomas@monjalon.net; Burakov, Anatoly
>> <anatoly.burakov@intel.com>; ferruh.yigit@amd.com
>> Cc: nikhil.agarwal@amd.com; Nipun Gupta <nipun.gupta@amd.com>
>> Subject: [PATCH] vfio: do not coalesce DMA mappings
>>
>> At the cleanup time when dma unmap is done, linux kernel does not allow
>> unmap of individual segments which were coalesced together while creating
>> the DMA map for type1 IOMMU mappings. So, this change updates the
>> mapping of the memory
>> segments(hugepages) on a per-page basis.
>>
>> Signed-off-by: Nipun Gupta <nipun.gupta@amd.com>
>> ---
>>
>> When hotplug of devices is used, multiple pages gets colaeced and a single
>> mapping gets created for these pages (using APIs
>> rte_memseg_contig_walk() and type1_map_contig(). On the cleanup time
>> when the memory is released, the VFIO does not cleans up that memory and
>> following error is observed in the eal for 2MB
>> hugepages:
>> EAL: Unexpected size 0 of DMA remapping cleared instead of 2097152
>>
>> This is because VFIO does not clear the DMA (refer API
>> vfio_dma_do_unmap() -
>> https://elixir.bootlin.com/linux/latest/source/drivers/vfio/vfio_iommu_type1.
>> c#L1330),
>> where it checks the dma mapping where it checks for IOVA to free:
>> https://elixir.bootlin.com/linux/latest/source/drivers/vfio/vfio_iommu_type1.
>> c#L1418.
>>
>> Thus this change updates the mapping to be created individually instead of
>> colaecing them.
>>
>>   lib/eal/linux/eal_vfio.c | 29 -----------------------------
>>   1 file changed, 29 deletions(-)
>>
>> diff --git a/lib/eal/linux/eal_vfio.c b/lib/eal/linux/eal_vfio.c index
>> 549b86ae1d..56edccb0db 100644
>> --- a/lib/eal/linux/eal_vfio.c
>> +++ b/lib/eal/linux/eal_vfio.c
>> @@ -1369,19 +1369,6 @@ rte_vfio_get_group_num(const char *sysfs_base,
>>   	return 1;
>>   }
>>
>> -static int
>> -type1_map_contig(const struct rte_memseg_list *msl, const struct
>> rte_memseg *ms,
>> -		size_t len, void *arg)
>> -{
>> -	int *vfio_container_fd = arg;
>> -
>> -	if (msl->external)
>> -		return 0;
>> -
>> -	return vfio_type1_dma_mem_map(*vfio_container_fd, ms->addr_64,
>> ms->iova,
>> -			len, 1);
>> -}
>> -
>>   static int
>>   type1_map(const struct rte_memseg_list *msl, const struct rte_memseg *ms,
>>   		void *arg)
>> @@ -1396,10 +1383,6 @@ type1_map(const struct rte_memseg_list *msl,
>> const struct rte_memseg *ms,
>>   	if (ms->iova == RTE_BAD_IOVA)
>>   		return 0;
>>
>> -	/* if IOVA mode is VA, we've already mapped the internal segments */
>> -	if (!msl->external && rte_eal_iova_mode() == RTE_IOVA_VA)
>> -		return 0;
>> -
>>   	return vfio_type1_dma_mem_map(*vfio_container_fd, ms->addr_64,
>> ms->iova,
>>   			ms->len, 1);
>>   }
>> @@ -1464,18 +1447,6 @@ vfio_type1_dma_mem_map(int vfio_container_fd,
>> uint64_t vaddr, uint64_t iova,  static int  vfio_type1_dma_map(int
>> vfio_container_fd)  {
>> -	if (rte_eal_iova_mode() == RTE_IOVA_VA) {
>> -		/* with IOVA as VA mode, we can get away with mapping
>> contiguous
>> -		 * chunks rather than going page-by-page.
>> -		 */
>> -		int ret = rte_memseg_contig_walk(type1_map_contig,
>> -				&vfio_container_fd);
>> -		if (ret)
>> -			return ret;
>> -		/* we have to continue the walk because we've skipped the
>> -		 * external segments during the config walk.
>> -		 */
>> -	}
>>   	return rte_memseg_walk(type1_map, &vfio_container_fd);  }
>>
>> --
>> 2.25.1
>
  
Ding, Xuan June 30, 2023, 5:58 a.m. UTC | #3
Hi Nipun,

Replies are inline.

> -----Original Message-----
> From: Nipun Gupta <nipun.gupta@amd.com>
> Sent: Friday, June 30, 2023 9:46 AM
> To: Ding, Xuan <xuan.ding@intel.com>; dev@dpdk.org;
> thomas@monjalon.net; Burakov, Anatoly <anatoly.burakov@intel.com>;
> ferruh.yigit@amd.com
> Cc: nikhil.agarwal@amd.com; He, Xingguang <xingguang.he@intel.com>; Ling,
> WeiX <weix.ling@intel.com>
> Subject: Re: [PATCH] vfio: do not coalesce DMA mappings
> 
> Hi Xuan,
> 
> Thanks for pointing out the issue and figuring out the patch which introduced
> this. If you have answers to below queries, please let me know:
> 
> Is there any other test cases which tests "--no-huge" which pass?

Yes, there are test cases adding "--no-huge" option to validate 4k page size in async vhost.
Actually, the page size is decided by front-end, so I think this case can be removed.

Previously, testpmd can start with "--no-huge" options (not sure if there are test cases).
Cmd: ./build/app/dpdk-testpmd -l 5-6 -n 4 --no-huge -m 1024 -- -i

> 
> Also, if we change the "-m" option to provide lower memory, does the test
> pass?

"-m" option is also added and does not work.

> 
> When you mention too many pages exceed the capability of IOMMU, you are
> referring to HW capability to create multiple pages? Here it seems in case of
> 4K page size we need 256K pages which is limiting the capacity?

Yes, this is the result of my initial debugging.
The direct impact is that this kind of testpmd cases cannot start now.
If this is expected, I think we can close this defect and ignore the "--no-huge" option when start.

Regards,
Xuan

> 
> Regards,
> Nipun
> 
> On 6/29/2023 1:51 PM, Ding, Xuan wrote:
> > Hi Nipun,
> >
> > I'd like to appreciate your time reading this email.
> >
> > Our QA team found that since this commit "a399d7b5a994: do not
> > coalesce DMA mappings" is introduced, the dpdk testpmd start with "--no-
> huge" parameters will failed, and shows "EAL: Cannot set up DMA
> remapping, error 28 (No space left on device)".
> > So they reported it on dpdk Bugzilla:
> https://bugs.dpdk.org/show_bug.cgi?id=1235.
> >
> > I understand this feature is to keep consistent with the kernel and not allow
> memory segments be merged.
> > The side effect is the testpmd with "--no-huge" parameters will not be able
> to start because the too many pages will exceed the capability of IOMMU.
> > Is it expected? Should we remove the --no-huge" in our testcase?
> >
> > Regards,
> > Xuan
> >
> >> -----Original Message-----
> >> From: Nipun Gupta <nipun.gupta@amd.com>
> >> Sent: Friday, December 30, 2022 5:59 PM
> >> To: dev@dpdk.org; thomas@monjalon.net; Burakov, Anatoly
> >> <anatoly.burakov@intel.com>; ferruh.yigit@amd.com
> >> Cc: nikhil.agarwal@amd.com; Nipun Gupta <nipun.gupta@amd.com>
> >> Subject: [PATCH] vfio: do not coalesce DMA mappings
> >>
> >> At the cleanup time when dma unmap is done, linux kernel does not
> >> allow unmap of individual segments which were coalesced together
> >> while creating the DMA map for type1 IOMMU mappings. So, this change
> >> updates the mapping of the memory
> >> segments(hugepages) on a per-page basis.
> >>
> >> Signed-off-by: Nipun Gupta <nipun.gupta@amd.com>
> >> ---
> >>
> >> When hotplug of devices is used, multiple pages gets colaeced and a
> >> single mapping gets created for these pages (using APIs
> >> rte_memseg_contig_walk() and type1_map_contig(). On the cleanup time
> >> when the memory is released, the VFIO does not cleans up that memory
> >> and following error is observed in the eal for 2MB
> >> hugepages:
> >> EAL: Unexpected size 0 of DMA remapping cleared instead of 2097152
> >>
> >> This is because VFIO does not clear the DMA (refer API
> >> vfio_dma_do_unmap() -
> >>
> https://elixir.bootlin.com/linux/latest/source/drivers/vfio/vfio_iommu_type
> 1.
> >> c#L1330),
> >> where it checks the dma mapping where it checks for IOVA to free:
> >>
> https://elixir.bootlin.com/linux/latest/source/drivers/vfio/vfio_iommu_type
> 1.
> >> c#L1418.
> >>
> >> Thus this change updates the mapping to be created individually
> >> instead of colaecing them.
> >>
> >>   lib/eal/linux/eal_vfio.c | 29 -----------------------------
> >>   1 file changed, 29 deletions(-)
> >>
> >> diff --git a/lib/eal/linux/eal_vfio.c b/lib/eal/linux/eal_vfio.c
> >> index 549b86ae1d..56edccb0db 100644
> >> --- a/lib/eal/linux/eal_vfio.c
> >> +++ b/lib/eal/linux/eal_vfio.c
> >> @@ -1369,19 +1369,6 @@ rte_vfio_get_group_num(const char
> *sysfs_base,
> >>   	return 1;
> >>   }
> >>
> >> -static int
> >> -type1_map_contig(const struct rte_memseg_list *msl, const struct
> >> rte_memseg *ms,
> >> -		size_t len, void *arg)
> >> -{
> >> -	int *vfio_container_fd = arg;
> >> -
> >> -	if (msl->external)
> >> -		return 0;
> >> -
> >> -	return vfio_type1_dma_mem_map(*vfio_container_fd, ms->addr_64,
> >> ms->iova,
> >> -			len, 1);
> >> -}
> >> -
> >>   static int
> >>   type1_map(const struct rte_memseg_list *msl, const struct rte_memseg
> *ms,
> >>   		void *arg)
> >> @@ -1396,10 +1383,6 @@ type1_map(const struct rte_memseg_list *msl,
> >> const struct rte_memseg *ms,
> >>   	if (ms->iova == RTE_BAD_IOVA)
> >>   		return 0;
> >>
> >> -	/* if IOVA mode is VA, we've already mapped the internal segments
> */
> >> -	if (!msl->external && rte_eal_iova_mode() == RTE_IOVA_VA)
> >> -		return 0;
> >> -
> >>   	return vfio_type1_dma_mem_map(*vfio_container_fd, ms->addr_64,
> >> ms->iova,
> >>   			ms->len, 1);
> >>   }
> >> @@ -1464,18 +1447,6 @@ vfio_type1_dma_mem_map(int
> vfio_container_fd,
> >> uint64_t vaddr, uint64_t iova,  static int  vfio_type1_dma_map(int
> >> vfio_container_fd)  {
> >> -	if (rte_eal_iova_mode() == RTE_IOVA_VA) {
> >> -		/* with IOVA as VA mode, we can get away with mapping
> >> contiguous
> >> -		 * chunks rather than going page-by-page.
> >> -		 */
> >> -		int ret = rte_memseg_contig_walk(type1_map_contig,
> >> -				&vfio_container_fd);
> >> -		if (ret)
> >> -			return ret;
> >> -		/* we have to continue the walk because we've skipped the
> >> -		 * external segments during the config walk.
> >> -		 */
> >> -	}
> >>   	return rte_memseg_walk(type1_map, &vfio_container_fd);  }
> >>
> >> --
> >> 2.25.1
> >
  
Ding, Xuan July 4, 2023, 5:13 a.m. UTC | #4
Hi Nipun,

> -----Original Message-----
> From: Ding, Xuan
> Sent: Friday, June 30, 2023 1:58 PM
> To: Nipun Gupta <nipun.gupta@amd.com>; dev@dpdk.org;
> thomas@monjalon.net; Burakov, Anatoly <anatoly.burakov@intel.com>;
> ferruh.yigit@amd.com
> Cc: nikhil.agarwal@amd.com; He, Xingguang <xingguang.he@intel.com>; Ling,
> WeiX <weix.ling@intel.com>
> Subject: RE: [PATCH] vfio: do not coalesce DMA mappings
> 
> Hi Nipun,
> 
> Replies are inline.
> 
> > -----Original Message-----
> > From: Nipun Gupta <nipun.gupta@amd.com>
> > Sent: Friday, June 30, 2023 9:46 AM
> > To: Ding, Xuan <xuan.ding@intel.com>; dev@dpdk.org;
> > thomas@monjalon.net; Burakov, Anatoly <anatoly.burakov@intel.com>;
> > ferruh.yigit@amd.com
> > Cc: nikhil.agarwal@amd.com; He, Xingguang <xingguang.he@intel.com>;
> > Ling, WeiX <weix.ling@intel.com>
> > Subject: Re: [PATCH] vfio: do not coalesce DMA mappings
> >
> > Hi Xuan,
> >
> > Thanks for pointing out the issue and figuring out the patch which
> > introduced this. If you have answers to below queries, please let me know:
> >
> > Is there any other test cases which tests "--no-huge" which pass?
> 
> Yes, there are test cases adding "--no-huge" option to validate 4k page size in
> async vhost.
> Actually, the page size is decided by front-end, so I think this case can be
> removed.
> 
> Previously, testpmd can start with "--no-huge" options (not sure if there are
> test cases).
> Cmd: ./build/app/dpdk-testpmd -l 5-6 -n 4 --no-huge -m 1024 -- -i
> 
> >
> > Also, if we change the "-m" option to provide lower memory, does the
> > test pass?
> 
> "-m" option is also added and does not work.
> 
> >
> > When you mention too many pages exceed the capability of IOMMU, you
> > are referring to HW capability to create multiple pages? Here it seems
> > in case of 4K page size we need 256K pages which is limiting the capacity?
> 
> Yes, this is the result of my initial debugging.
> The direct impact is that this kind of testpmd cases cannot start now.
> If this is expected, I think we can close this defect and ignore the "--no-huge"
> option when start.

Any insights? Should we just ignore the "--no-huge" option and close this defect?
Now we did this as a workaround. Seems no one uses the "--no-huge" option in testpmd now.

Thanks,
Xuan

> 
> Regards,
> Xuan
> 
> >
> > Regards,
> > Nipun
> >
> > On 6/29/2023 1:51 PM, Ding, Xuan wrote:
> > > Hi Nipun,
> > >
> > > I'd like to appreciate your time reading this email.
> > >
> > > Our QA team found that since this commit "a399d7b5a994: do not
> > > coalesce DMA mappings" is introduced, the dpdk testpmd start with
> > > "--no-
> > huge" parameters will failed, and shows "EAL: Cannot set up DMA
> > remapping, error 28 (No space left on device)".
> > > So they reported it on dpdk Bugzilla:
> > https://bugs.dpdk.org/show_bug.cgi?id=1235.
> > >
> > > I understand this feature is to keep consistent with the kernel and
> > > not allow
> > memory segments be merged.
> > > The side effect is the testpmd with "--no-huge" parameters will not
> > > be able
> > to start because the too many pages will exceed the capability of IOMMU.
> > > Is it expected? Should we remove the --no-huge" in our testcase?
> > >
> > > Regards,
> > > Xuan
> > >
> > >> -----Original Message-----
> > >> From: Nipun Gupta <nipun.gupta@amd.com>
> > >> Sent: Friday, December 30, 2022 5:59 PM
> > >> To: dev@dpdk.org; thomas@monjalon.net; Burakov, Anatoly
> > >> <anatoly.burakov@intel.com>; ferruh.yigit@amd.com
> > >> Cc: nikhil.agarwal@amd.com; Nipun Gupta <nipun.gupta@amd.com>
> > >> Subject: [PATCH] vfio: do not coalesce DMA mappings
> > >>
> > >> At the cleanup time when dma unmap is done, linux kernel does not
> > >> allow unmap of individual segments which were coalesced together
> > >> while creating the DMA map for type1 IOMMU mappings. So, this
> > >> change updates the mapping of the memory
> > >> segments(hugepages) on a per-page basis.
> > >>
> > >> Signed-off-by: Nipun Gupta <nipun.gupta@amd.com>
> > >> ---
> > >>
> > >> When hotplug of devices is used, multiple pages gets colaeced and a
> > >> single mapping gets created for these pages (using APIs
> > >> rte_memseg_contig_walk() and type1_map_contig(). On the cleanup
> > >> time when the memory is released, the VFIO does not cleans up that
> > >> memory and following error is observed in the eal for 2MB
> > >> hugepages:
> > >> EAL: Unexpected size 0 of DMA remapping cleared instead of 2097152
> > >>
> > >> This is because VFIO does not clear the DMA (refer API
> > >> vfio_dma_do_unmap() -
> > >>
> > https://elixir.bootlin.com/linux/latest/source/drivers/vfio/vfio_iommu
> > _type
> > 1.
> > >> c#L1330),
> > >> where it checks the dma mapping where it checks for IOVA to free:
> > >>
> > https://elixir.bootlin.com/linux/latest/source/drivers/vfio/vfio_iommu
> > _type
> > 1.
> > >> c#L1418.
> > >>
> > >> Thus this change updates the mapping to be created individually
> > >> instead of colaecing them.
> > >>
> > >>   lib/eal/linux/eal_vfio.c | 29 -----------------------------
> > >>   1 file changed, 29 deletions(-)
> > >>
> > >> diff --git a/lib/eal/linux/eal_vfio.c b/lib/eal/linux/eal_vfio.c
> > >> index 549b86ae1d..56edccb0db 100644
> > >> --- a/lib/eal/linux/eal_vfio.c
> > >> +++ b/lib/eal/linux/eal_vfio.c
> > >> @@ -1369,19 +1369,6 @@ rte_vfio_get_group_num(const char
> > *sysfs_base,
> > >>   	return 1;
> > >>   }
> > >>
> > >> -static int
> > >> -type1_map_contig(const struct rte_memseg_list *msl, const struct
> > >> rte_memseg *ms,
> > >> -		size_t len, void *arg)
> > >> -{
> > >> -	int *vfio_container_fd = arg;
> > >> -
> > >> -	if (msl->external)
> > >> -		return 0;
> > >> -
> > >> -	return vfio_type1_dma_mem_map(*vfio_container_fd, ms->addr_64,
> > >> ms->iova,
> > >> -			len, 1);
> > >> -}
> > >> -
> > >>   static int
> > >>   type1_map(const struct rte_memseg_list *msl, const struct
> > >> rte_memseg
> > *ms,
> > >>   		void *arg)
> > >> @@ -1396,10 +1383,6 @@ type1_map(const struct rte_memseg_list *msl,
> > >> const struct rte_memseg *ms,
> > >>   	if (ms->iova == RTE_BAD_IOVA)
> > >>   		return 0;
> > >>
> > >> -	/* if IOVA mode is VA, we've already mapped the internal segments
> > */
> > >> -	if (!msl->external && rte_eal_iova_mode() == RTE_IOVA_VA)
> > >> -		return 0;
> > >> -
> > >>   	return vfio_type1_dma_mem_map(*vfio_container_fd, ms->addr_64,
> > >> ms->iova,
> > >>   			ms->len, 1);
> > >>   }
> > >> @@ -1464,18 +1447,6 @@ vfio_type1_dma_mem_map(int
> > vfio_container_fd,
> > >> uint64_t vaddr, uint64_t iova,  static int  vfio_type1_dma_map(int
> > >> vfio_container_fd)  {
> > >> -	if (rte_eal_iova_mode() == RTE_IOVA_VA) {
> > >> -		/* with IOVA as VA mode, we can get away with mapping
> > >> contiguous
> > >> -		 * chunks rather than going page-by-page.
> > >> -		 */
> > >> -		int ret = rte_memseg_contig_walk(type1_map_contig,
> > >> -				&vfio_container_fd);
> > >> -		if (ret)
> > >> -			return ret;
> > >> -		/* we have to continue the walk because we've skipped the
> > >> -		 * external segments during the config walk.
> > >> -		 */
> > >> -	}
> > >>   	return rte_memseg_walk(type1_map, &vfio_container_fd);  }
> > >>
> > >> --
> > >> 2.25.1
> > >
  
Gupta, Nipun July 4, 2023, 6:53 a.m. UTC | #5
Hi Xuan,

Please see inline.

> -----Original Message-----
> From: Ding, Xuan <xuan.ding@intel.com>
> Sent: Tuesday, July 4, 2023 10:43 AM
> To: Gupta, Nipun <Nipun.Gupta@amd.com>; dev@dpdk.org;
> thomas@monjalon.net; Burakov, Anatoly <anatoly.burakov@intel.com>; Yigit,
> Ferruh <Ferruh.Yigit@amd.com>
> Cc: Agarwal, Nikhil <nikhil.agarwal@amd.com>; He, Xingguang
> <xingguang.he@intel.com>; Ling, WeiX <weix.ling@intel.com>
> Subject: RE: [PATCH] vfio: do not coalesce DMA mappings
> 
> Hi Nipun,
> 
> > -----Original Message-----
> > From: Ding, Xuan
> > Sent: Friday, June 30, 2023 1:58 PM
> > To: Nipun Gupta <nipun.gupta@amd.com>; dev@dpdk.org;
> > thomas@monjalon.net; Burakov, Anatoly <anatoly.burakov@intel.com>;
> > ferruh.yigit@amd.com
> > Cc: nikhil.agarwal@amd.com; He, Xingguang <xingguang.he@intel.com>; Ling,
> > WeiX <weix.ling@intel.com>
> > Subject: RE: [PATCH] vfio: do not coalesce DMA mappings
> >
> > Hi Nipun,
> >
> > Replies are inline.
> >
> > > -----Original Message-----
> > > From: Nipun Gupta <nipun.gupta@amd.com>
> > > Sent: Friday, June 30, 2023 9:46 AM
> > > To: Ding, Xuan <xuan.ding@intel.com>; dev@dpdk.org;
> > > thomas@monjalon.net; Burakov, Anatoly <anatoly.burakov@intel.com>;
> > > ferruh.yigit@amd.com
> > > Cc: nikhil.agarwal@amd.com; He, Xingguang <xingguang.he@intel.com>;
> > > Ling, WeiX <weix.ling@intel.com>
> > > Subject: Re: [PATCH] vfio: do not coalesce DMA mappings
> > >
> > > Hi Xuan,
> > >
> > > Thanks for pointing out the issue and figuring out the patch which
> > > introduced this. If you have answers to below queries, please let me know:
> > >
> > > Is there any other test cases which tests "--no-huge" which pass?
> >
> > Yes, there are test cases adding "--no-huge" option to validate 4k page size in
> > async vhost.
> > Actually, the page size is decided by front-end, so I think this case can be
> > removed.
> >
> > Previously, testpmd can start with "--no-huge" options (not sure if there are
> > test cases).
> > Cmd: ./build/app/dpdk-testpmd -l 5-6 -n 4 --no-huge -m 1024 -- -i
> >
> > >
> > > Also, if we change the "-m" option to provide lower memory, does the
> > > test pass?
> >
> > "-m" option is also added and does not work.
> >
> > >
> > > When you mention too many pages exceed the capability of IOMMU, you
> > > are referring to HW capability to create multiple pages? Here it seems
> > > in case of 4K page size we need 256K pages which is limiting the capacity?
> >
> > Yes, this is the result of my initial debugging.
> > The direct impact is that this kind of testpmd cases cannot start now.
> > If this is expected, I think we can close this defect and ignore the "--no-huge"
> > option when start.
> 
> Any insights? Should we just ignore the "--no-huge" option and close this defect?
> Now we did this as a workaround. Seems no one uses the "--no-huge" option in
> testpmd now.

VFIO supports dma_entry_limit as a module parameter, which has a default value of
U16_MAX i.e. 64K, most likely which is limiting creation of 256K entries for 4K pages
here. This can be modified while inserting vfio module:
        modprobe vfio_iommu_type1 dma_entry_limit=1000000

You can update the test case with updating this limit, and test case shall pass.

Thanks,
Nipun

> 
> Thanks,
> Xuan
> 
> >
> > Regards,
> > Xuan
> >
> > >
> > > Regards,
> > > Nipun
> > >
> > > On 6/29/2023 1:51 PM, Ding, Xuan wrote:
> > > > Hi Nipun,
> > > >
> > > > I'd like to appreciate your time reading this email.
> > > >
> > > > Our QA team found that since this commit "a399d7b5a994: do not
> > > > coalesce DMA mappings" is introduced, the dpdk testpmd start with
> > > > "--no-
> > > huge" parameters will failed, and shows "EAL: Cannot set up DMA
> > > remapping, error 28 (No space left on device)".
> > > > So they reported it on dpdk Bugzilla:
> > > https://bugs.dpdk.org/show_bug.cgi?id=1235.
> > > >
> > > > I understand this feature is to keep consistent with the kernel and
> > > > not allow
> > > memory segments be merged.
> > > > The side effect is the testpmd with "--no-huge" parameters will not
> > > > be able
> > > to start because the too many pages will exceed the capability of IOMMU.
> > > > Is it expected? Should we remove the --no-huge" in our testcase?
> > > >
> > > > Regards,
> > > > Xuan
> > > >
> > > >> -----Original Message-----
> > > >> From: Nipun Gupta <nipun.gupta@amd.com>
> > > >> Sent: Friday, December 30, 2022 5:59 PM
> > > >> To: dev@dpdk.org; thomas@monjalon.net; Burakov, Anatoly
> > > >> <anatoly.burakov@intel.com>; ferruh.yigit@amd.com
> > > >> Cc: nikhil.agarwal@amd.com; Nipun Gupta <nipun.gupta@amd.com>
> > > >> Subject: [PATCH] vfio: do not coalesce DMA mappings
> > > >>
> > > >> At the cleanup time when dma unmap is done, linux kernel does not
> > > >> allow unmap of individual segments which were coalesced together
> > > >> while creating the DMA map for type1 IOMMU mappings. So, this
> > > >> change updates the mapping of the memory
> > > >> segments(hugepages) on a per-page basis.
> > > >>
> > > >> Signed-off-by: Nipun Gupta <nipun.gupta@amd.com>
> > > >> ---
> > > >>
> > > >> When hotplug of devices is used, multiple pages gets colaeced and a
> > > >> single mapping gets created for these pages (using APIs
> > > >> rte_memseg_contig_walk() and type1_map_contig(). On the cleanup
> > > >> time when the memory is released, the VFIO does not cleans up that
> > > >> memory and following error is observed in the eal for 2MB
> > > >> hugepages:
> > > >> EAL: Unexpected size 0 of DMA remapping cleared instead of 2097152
> > > >>
> > > >> This is because VFIO does not clear the DMA (refer API
> > > >> vfio_dma_do_unmap() -
> > > >>
> > > https://elixir.bootlin.com/linux/latest/source/drivers/vfio/vfio_iommu
> > > _type
> > > 1.
> > > >> c#L1330),
> > > >> where it checks the dma mapping where it checks for IOVA to free:
> > > >>
> > > https://elixir.bootlin.com/linux/latest/source/drivers/vfio/vfio_iommu
> > > _type
> > > 1.
> > > >> c#L1418.
> > > >>
> > > >> Thus this change updates the mapping to be created individually
> > > >> instead of colaecing them.
> > > >>
> > > >>   lib/eal/linux/eal_vfio.c | 29 -----------------------------
> > > >>   1 file changed, 29 deletions(-)
> > > >>
> > > >> diff --git a/lib/eal/linux/eal_vfio.c b/lib/eal/linux/eal_vfio.c
> > > >> index 549b86ae1d..56edccb0db 100644
> > > >> --- a/lib/eal/linux/eal_vfio.c
> > > >> +++ b/lib/eal/linux/eal_vfio.c
> > > >> @@ -1369,19 +1369,6 @@ rte_vfio_get_group_num(const char
> > > *sysfs_base,
> > > >>   	return 1;
> > > >>   }
> > > >>
> > > >> -static int
> > > >> -type1_map_contig(const struct rte_memseg_list *msl, const struct
> > > >> rte_memseg *ms,
> > > >> -		size_t len, void *arg)
> > > >> -{
> > > >> -	int *vfio_container_fd = arg;
> > > >> -
> > > >> -	if (msl->external)
> > > >> -		return 0;
> > > >> -
> > > >> -	return vfio_type1_dma_mem_map(*vfio_container_fd, ms->addr_64,
> > > >> ms->iova,
> > > >> -			len, 1);
> > > >> -}
> > > >> -
> > > >>   static int
> > > >>   type1_map(const struct rte_memseg_list *msl, const struct
> > > >> rte_memseg
> > > *ms,
> > > >>   		void *arg)
> > > >> @@ -1396,10 +1383,6 @@ type1_map(const struct rte_memseg_list *msl,
> > > >> const struct rte_memseg *ms,
> > > >>   	if (ms->iova == RTE_BAD_IOVA)
> > > >>   		return 0;
> > > >>
> > > >> -	/* if IOVA mode is VA, we've already mapped the internal segments
> > > */
> > > >> -	if (!msl->external && rte_eal_iova_mode() == RTE_IOVA_VA)
> > > >> -		return 0;
> > > >> -
> > > >>   	return vfio_type1_dma_mem_map(*vfio_container_fd, ms->addr_64,
> > > >> ms->iova,
> > > >>   			ms->len, 1);
> > > >>   }
> > > >> @@ -1464,18 +1447,6 @@ vfio_type1_dma_mem_map(int
> > > vfio_container_fd,
> > > >> uint64_t vaddr, uint64_t iova,  static int  vfio_type1_dma_map(int
> > > >> vfio_container_fd)  {
> > > >> -	if (rte_eal_iova_mode() == RTE_IOVA_VA) {
> > > >> -		/* with IOVA as VA mode, we can get away with mapping
> > > >> contiguous
> > > >> -		 * chunks rather than going page-by-page.
> > > >> -		 */
> > > >> -		int ret = rte_memseg_contig_walk(type1_map_contig,
> > > >> -				&vfio_container_fd);
> > > >> -		if (ret)
> > > >> -			return ret;
> > > >> -		/* we have to continue the walk because we've skipped the
> > > >> -		 * external segments during the config walk.
> > > >> -		 */
> > > >> -	}
> > > >>   	return rte_memseg_walk(type1_map, &vfio_container_fd);  }
> > > >>
> > > >> --
> > > >> 2.25.1
> > > >
  
Ding, Xuan July 4, 2023, 8:06 a.m. UTC | #6
Hi Nipun,

> -----Original Message-----
> From: Gupta, Nipun <Nipun.Gupta@amd.com>
> Sent: Tuesday, July 4, 2023 2:54 PM
> To: Ding, Xuan <xuan.ding@intel.com>; dev@dpdk.org; thomas@monjalon.net;
> Burakov, Anatoly <anatoly.burakov@intel.com>; Yigit, Ferruh
> <Ferruh.Yigit@amd.com>; David Marchand <david.marchand@redhat.com>
> Cc: Agarwal, Nikhil <nikhil.agarwal@amd.com>; He, Xingguang
> <xingguang.he@intel.com>; Ling, WeiX <weix.ling@intel.com>
> Subject: RE: [PATCH] vfio: do not coalesce DMA mappings
> 
> Hi Xuan,
> 
> Please see inline.
> 
> > -----Original Message-----
> > From: Ding, Xuan <xuan.ding@intel.com>
> > Sent: Tuesday, July 4, 2023 10:43 AM
> > To: Gupta, Nipun <Nipun.Gupta@amd.com>; dev@dpdk.org;
> > thomas@monjalon.net; Burakov, Anatoly <anatoly.burakov@intel.com>;
> > Yigit, Ferruh <Ferruh.Yigit@amd.com>
> > Cc: Agarwal, Nikhil <nikhil.agarwal@amd.com>; He, Xingguang
> > <xingguang.he@intel.com>; Ling, WeiX <weix.ling@intel.com>
> > Subject: RE: [PATCH] vfio: do not coalesce DMA mappings
> >
> > Hi Nipun,
> >
> > > -----Original Message-----
> > > From: Ding, Xuan
> > > Sent: Friday, June 30, 2023 1:58 PM
> > > To: Nipun Gupta <nipun.gupta@amd.com>; dev@dpdk.org;
> > > thomas@monjalon.net; Burakov, Anatoly <anatoly.burakov@intel.com>;
> > > ferruh.yigit@amd.com
> > > Cc: nikhil.agarwal@amd.com; He, Xingguang <xingguang.he@intel.com>;
> > > Ling, WeiX <weix.ling@intel.com>
> > > Subject: RE: [PATCH] vfio: do not coalesce DMA mappings
> > >
> > > Hi Nipun,
> > >
> > > Replies are inline.
> > >
> > > > -----Original Message-----
> > > > From: Nipun Gupta <nipun.gupta@amd.com>
> > > > Sent: Friday, June 30, 2023 9:46 AM
> > > > To: Ding, Xuan <xuan.ding@intel.com>; dev@dpdk.org;
> > > > thomas@monjalon.net; Burakov, Anatoly <anatoly.burakov@intel.com>;
> > > > ferruh.yigit@amd.com
> > > > Cc: nikhil.agarwal@amd.com; He, Xingguang
> > > > <xingguang.he@intel.com>; Ling, WeiX <weix.ling@intel.com>
> > > > Subject: Re: [PATCH] vfio: do not coalesce DMA mappings
> > > >
> > > > Hi Xuan,
> > > >
> > > > Thanks for pointing out the issue and figuring out the patch which
> > > > introduced this. If you have answers to below queries, please let me know:
> > > >
> > > > Is there any other test cases which tests "--no-huge" which pass?
> > >
> > > Yes, there are test cases adding "--no-huge" option to validate 4k
> > > page size in async vhost.
> > > Actually, the page size is decided by front-end, so I think this
> > > case can be removed.
> > >
> > > Previously, testpmd can start with "--no-huge" options (not sure if
> > > there are test cases).
> > > Cmd: ./build/app/dpdk-testpmd -l 5-6 -n 4 --no-huge -m 1024 -- -i
> > >
> > > >
> > > > Also, if we change the "-m" option to provide lower memory, does
> > > > the test pass?
> > >
> > > "-m" option is also added and does not work.
> > >
> > > >
> > > > When you mention too many pages exceed the capability of IOMMU,
> > > > you are referring to HW capability to create multiple pages? Here
> > > > it seems in case of 4K page size we need 256K pages which is limiting the
> capacity?
> > >
> > > Yes, this is the result of my initial debugging.
> > > The direct impact is that this kind of testpmd cases cannot start now.
> > > If this is expected, I think we can close this defect and ignore the "--no-
> huge"
> > > option when start.
> >
> > Any insights? Should we just ignore the "--no-huge" option and close this
> defect?
> > Now we did this as a workaround. Seems no one uses the "--no-huge"
> > option in testpmd now.
> 
> VFIO supports dma_entry_limit as a module parameter, which has a default
> value of U16_MAX i.e. 64K, most likely which is limiting creation of 256K
> entries for 4K pages here. This can be modified while inserting vfio module:
>         modprobe vfio_iommu_type1 dma_entry_limit=1000000

Thanks for your suggestion. I tried it on ubuntu 22.04 but it does not work.
The reason I think is vfio-pci is build-in in kernel driver (since 20.04) and it does not support dynamic insmod/rmmod.

Does this command need to rmmod vfio first and then modprobe again?

Thanks,
Xuan

> 
> You can update the test case with updating this limit, and test case shall pass.
> 
> Thanks,
> Nipun
> 
> >
> > Thanks,
> > Xuan
> >
> > >
> > > Regards,
> > > Xuan
> > >
> > > >
> > > > Regards,
> > > > Nipun
> > > >
> > > > On 6/29/2023 1:51 PM, Ding, Xuan wrote:
> > > > > Hi Nipun,
> > > > >
> > > > > I'd like to appreciate your time reading this email.
> > > > >
> > > > > Our QA team found that since this commit "a399d7b5a994: do not
> > > > > coalesce DMA mappings" is introduced, the dpdk testpmd start
> > > > > with
> > > > > "--no-
> > > > huge" parameters will failed, and shows "EAL: Cannot set up DMA
> > > > remapping, error 28 (No space left on device)".
> > > > > So they reported it on dpdk Bugzilla:
> > > > https://bugs.dpdk.org/show_bug.cgi?id=1235.
> > > > >
> > > > > I understand this feature is to keep consistent with the kernel
> > > > > and not allow
> > > > memory segments be merged.
> > > > > The side effect is the testpmd with "--no-huge" parameters will
> > > > > not be able
> > > > to start because the too many pages will exceed the capability of IOMMU.
> > > > > Is it expected? Should we remove the --no-huge" in our testcase?
> > > > >
> > > > > Regards,
> > > > > Xuan
> > > > >
> > > > >> -----Original Message-----
> > > > >> From: Nipun Gupta <nipun.gupta@amd.com>
> > > > >> Sent: Friday, December 30, 2022 5:59 PM
> > > > >> To: dev@dpdk.org; thomas@monjalon.net; Burakov, Anatoly
> > > > >> <anatoly.burakov@intel.com>; ferruh.yigit@amd.com
> > > > >> Cc: nikhil.agarwal@amd.com; Nipun Gupta <nipun.gupta@amd.com>
> > > > >> Subject: [PATCH] vfio: do not coalesce DMA mappings
> > > > >>
> > > > >> At the cleanup time when dma unmap is done, linux kernel does
> > > > >> not allow unmap of individual segments which were coalesced
> > > > >> together while creating the DMA map for type1 IOMMU mappings.
> > > > >> So, this change updates the mapping of the memory
> > > > >> segments(hugepages) on a per-page basis.
> > > > >>
> > > > >> Signed-off-by: Nipun Gupta <nipun.gupta@amd.com>
> > > > >> ---
> > > > >>
> > > > >> When hotplug of devices is used, multiple pages gets colaeced
> > > > >> and a single mapping gets created for these pages (using APIs
> > > > >> rte_memseg_contig_walk() and type1_map_contig(). On the cleanup
> > > > >> time when the memory is released, the VFIO does not cleans up
> > > > >> that memory and following error is observed in the eal for 2MB
> > > > >> hugepages:
> > > > >> EAL: Unexpected size 0 of DMA remapping cleared instead of
> > > > >> 2097152
> > > > >>
> > > > >> This is because VFIO does not clear the DMA (refer API
> > > > >> vfio_dma_do_unmap() -
> > > > >>
> > > > https://elixir.bootlin.com/linux/latest/source/drivers/vfio/vfio_i
> > > > ommu
> > > > _type
> > > > 1.
> > > > >> c#L1330),
> > > > >> where it checks the dma mapping where it checks for IOVA to free:
> > > > >>
> > > > https://elixir.bootlin.com/linux/latest/source/drivers/vfio/vfio_i
> > > > ommu
> > > > _type
> > > > 1.
> > > > >> c#L1418.
> > > > >>
> > > > >> Thus this change updates the mapping to be created individually
> > > > >> instead of colaecing them.
> > > > >>
> > > > >>   lib/eal/linux/eal_vfio.c | 29 -----------------------------
> > > > >>   1 file changed, 29 deletions(-)
> > > > >>
> > > > >> diff --git a/lib/eal/linux/eal_vfio.c
> > > > >> b/lib/eal/linux/eal_vfio.c index 549b86ae1d..56edccb0db 100644
> > > > >> --- a/lib/eal/linux/eal_vfio.c
> > > > >> +++ b/lib/eal/linux/eal_vfio.c
> > > > >> @@ -1369,19 +1369,6 @@ rte_vfio_get_group_num(const char
> > > > *sysfs_base,
> > > > >>   	return 1;
> > > > >>   }
> > > > >>
> > > > >> -static int
> > > > >> -type1_map_contig(const struct rte_memseg_list *msl, const
> > > > >> struct rte_memseg *ms,
> > > > >> -		size_t len, void *arg)
> > > > >> -{
> > > > >> -	int *vfio_container_fd = arg;
> > > > >> -
> > > > >> -	if (msl->external)
> > > > >> -		return 0;
> > > > >> -
> > > > >> -	return vfio_type1_dma_mem_map(*vfio_container_fd, ms-
> >addr_64,
> > > > >> ms->iova,
> > > > >> -			len, 1);
> > > > >> -}
> > > > >> -
> > > > >>   static int
> > > > >>   type1_map(const struct rte_memseg_list *msl, const struct
> > > > >> rte_memseg
> > > > *ms,
> > > > >>   		void *arg)
> > > > >> @@ -1396,10 +1383,6 @@ type1_map(const struct rte_memseg_list
> > > > >> *msl, const struct rte_memseg *ms,
> > > > >>   	if (ms->iova == RTE_BAD_IOVA)
> > > > >>   		return 0;
> > > > >>
> > > > >> -	/* if IOVA mode is VA, we've already mapped the internal
> segments
> > > > */
> > > > >> -	if (!msl->external && rte_eal_iova_mode() == RTE_IOVA_VA)
> > > > >> -		return 0;
> > > > >> -
> > > > >>   	return vfio_type1_dma_mem_map(*vfio_container_fd,
> > > > >> ms->addr_64,
> > > > >> ms->iova,
> > > > >>   			ms->len, 1);
> > > > >>   }
> > > > >> @@ -1464,18 +1447,6 @@ vfio_type1_dma_mem_map(int
> > > > vfio_container_fd,
> > > > >> uint64_t vaddr, uint64_t iova,  static int
> > > > >> vfio_type1_dma_map(int
> > > > >> vfio_container_fd)  {
> > > > >> -	if (rte_eal_iova_mode() == RTE_IOVA_VA) {
> > > > >> -		/* with IOVA as VA mode, we can get away with
> mapping
> > > > >> contiguous
> > > > >> -		 * chunks rather than going page-by-page.
> > > > >> -		 */
> > > > >> -		int ret = rte_memseg_contig_walk(type1_map_contig,
> > > > >> -				&vfio_container_fd);
> > > > >> -		if (ret)
> > > > >> -			return ret;
> > > > >> -		/* we have to continue the walk because we've
> skipped the
> > > > >> -		 * external segments during the config walk.
> > > > >> -		 */
> > > > >> -	}
> > > > >>   	return rte_memseg_walk(type1_map, &vfio_container_fd);  }
> > > > >>
> > > > >> --
> > > > >> 2.25.1
> > > > >
  
Gupta, Nipun July 4, 2023, 9:23 a.m. UTC | #7
Hi Xuan,

On 7/4/2023 1:36 PM, Ding, Xuan wrote:
> Hi Nipun,
> 
>> -----Original Message-----
>> From: Gupta, Nipun <Nipun.Gupta@amd.com>
>> Sent: Tuesday, July 4, 2023 2:54 PM
>> To: Ding, Xuan <xuan.ding@intel.com>; dev@dpdk.org; thomas@monjalon.net;
>> Burakov, Anatoly <anatoly.burakov@intel.com>; Yigit, Ferruh
>> <Ferruh.Yigit@amd.com>; David Marchand <david.marchand@redhat.com>
>> Cc: Agarwal, Nikhil <nikhil.agarwal@amd.com>; He, Xingguang
>> <xingguang.he@intel.com>; Ling, WeiX <weix.ling@intel.com>
>> Subject: RE: [PATCH] vfio: do not coalesce DMA mappings
>>
>> Hi Xuan,
>>
>> Please see inline.
>>
>>> -----Original Message-----
>>> From: Ding, Xuan <xuan.ding@intel.com>
>>> Sent: Tuesday, July 4, 2023 10:43 AM
>>> To: Gupta, Nipun <Nipun.Gupta@amd.com>; dev@dpdk.org;
>>> thomas@monjalon.net; Burakov, Anatoly <anatoly.burakov@intel.com>;
>>> Yigit, Ferruh <Ferruh.Yigit@amd.com>
>>> Cc: Agarwal, Nikhil <nikhil.agarwal@amd.com>; He, Xingguang
>>> <xingguang.he@intel.com>; Ling, WeiX <weix.ling@intel.com>
>>> Subject: RE: [PATCH] vfio: do not coalesce DMA mappings
>>>
>>> Hi Nipun,
>>>
>>>> -----Original Message-----
>>>> From: Ding, Xuan
>>>> Sent: Friday, June 30, 2023 1:58 PM
>>>> To: Nipun Gupta <nipun.gupta@amd.com>; dev@dpdk.org;
>>>> thomas@monjalon.net; Burakov, Anatoly <anatoly.burakov@intel.com>;
>>>> ferruh.yigit@amd.com
>>>> Cc: nikhil.agarwal@amd.com; He, Xingguang <xingguang.he@intel.com>;
>>>> Ling, WeiX <weix.ling@intel.com>
>>>> Subject: RE: [PATCH] vfio: do not coalesce DMA mappings
>>>>
>>>> Hi Nipun,
>>>>
>>>> Replies are inline.
>>>>
>>>>> -----Original Message-----
>>>>> From: Nipun Gupta <nipun.gupta@amd.com>
>>>>> Sent: Friday, June 30, 2023 9:46 AM
>>>>> To: Ding, Xuan <xuan.ding@intel.com>; dev@dpdk.org;
>>>>> thomas@monjalon.net; Burakov, Anatoly <anatoly.burakov@intel.com>;
>>>>> ferruh.yigit@amd.com
>>>>> Cc: nikhil.agarwal@amd.com; He, Xingguang
>>>>> <xingguang.he@intel.com>; Ling, WeiX <weix.ling@intel.com>
>>>>> Subject: Re: [PATCH] vfio: do not coalesce DMA mappings
>>>>>
>>>>> Hi Xuan,
>>>>>
>>>>> Thanks for pointing out the issue and figuring out the patch which
>>>>> introduced this. If you have answers to below queries, please let me know:
>>>>>
>>>>> Is there any other test cases which tests "--no-huge" which pass?
>>>>
>>>> Yes, there are test cases adding "--no-huge" option to validate 4k
>>>> page size in async vhost.
>>>> Actually, the page size is decided by front-end, so I think this
>>>> case can be removed.
>>>>
>>>> Previously, testpmd can start with "--no-huge" options (not sure if
>>>> there are test cases).
>>>> Cmd: ./build/app/dpdk-testpmd -l 5-6 -n 4 --no-huge -m 1024 -- -i
>>>>
>>>>>
>>>>> Also, if we change the "-m" option to provide lower memory, does
>>>>> the test pass?
>>>>
>>>> "-m" option is also added and does not work.
>>>>
>>>>>
>>>>> When you mention too many pages exceed the capability of IOMMU,
>>>>> you are referring to HW capability to create multiple pages? Here
>>>>> it seems in case of 4K page size we need 256K pages which is limiting the
>> capacity?
>>>>
>>>> Yes, this is the result of my initial debugging.
>>>> The direct impact is that this kind of testpmd cases cannot start now.
>>>> If this is expected, I think we can close this defect and ignore the "--no-
>> huge"
>>>> option when start.
>>>
>>> Any insights? Should we just ignore the "--no-huge" option and close this
>> defect?
>>> Now we did this as a workaround. Seems no one uses the "--no-huge"
>>> option in testpmd now.
>>
>> VFIO supports dma_entry_limit as a module parameter, which has a default
>> value of U16_MAX i.e. 64K, most likely which is limiting creation of 256K
>> entries for 4K pages here. This can be modified while inserting vfio module:
>>          modprobe vfio_iommu_type1 dma_entry_limit=1000000
> 
> Thanks for your suggestion. I tried it on ubuntu 22.04 but it does not work.
> The reason I think is vfio-pci is build-in in kernel driver (since 20.04) and it does not support dynamic insmod/rmmod.
> 
> Does this command need to rmmod vfio first and then modprobe again?
> 

If it is inserted as a module then you can remove using rmmod and then 
modprobe again with the dma_entry_limit parameter. Also note, 
vfio_iommu_type1 is the module which is limiting the entries to 64K, so 
this module needs to be inserted again providing the dma_entry_limit 
module param.

In case the module is built-in you can provide via kernel command line 
parameter (ref: 
https://www.kernel.org/doc/html/v4.12/admin-guide/kernel-parameters.html). 
As per this ref document, "vfio_iommu_type1.dma_entry_limit=1000000" 
should be used in the bootargs to set the module parameters.

FYI.. DPDK documentation also mentions the limitation at:
https://doc.dpdk.org/guides/linux_gsg/linux_drivers.html

Thanks,
Nipun
  
Thomas Monjalon July 4, 2023, 2:09 p.m. UTC | #8
04/07/2023 11:23, Gupta, Nipun:
> On 7/4/2023 1:36 PM, Ding, Xuan wrote:
>> From: Gupta, Nipun <Nipun.Gupta@amd.com>
>>> From: Ding, Xuan <xuan.ding@intel.com>
>>>> From: Ding, Xuan
>>>>> From: Nipun Gupta <nipun.gupta@amd.com>
> >>>>> Hi Xuan,
> >>>>>
> >>>>> Thanks for pointing out the issue and figuring out the patch which
> >>>>> introduced this. If you have answers to below queries, please let me know:
> >>>>>
> >>>>> Is there any other test cases which tests "--no-huge" which pass?
> >>>>
> >>>> Yes, there are test cases adding "--no-huge" option to validate 4k
> >>>> page size in async vhost.
> >>>> Actually, the page size is decided by front-end, so I think this
> >>>> case can be removed.
> >>>>
> >>>> Previously, testpmd can start with "--no-huge" options (not sure if
> >>>> there are test cases).
> >>>> Cmd: ./build/app/dpdk-testpmd -l 5-6 -n 4 --no-huge -m 1024 -- -i
> >>>>
> >>>>>
> >>>>> Also, if we change the "-m" option to provide lower memory, does
> >>>>> the test pass?
> >>>>
> >>>> "-m" option is also added and does not work.
> >>>>
> >>>>>
> >>>>> When you mention too many pages exceed the capability of IOMMU,
> >>>>> you are referring to HW capability to create multiple pages? Here
> >>>>> it seems in case of 4K page size we need 256K pages which is limiting the
> >> capacity?
> >>>>
> >>>> Yes, this is the result of my initial debugging.
> >>>> The direct impact is that this kind of testpmd cases cannot start now.
> >>>> If this is expected, I think we can close this defect and ignore the "--no-
> >> huge"
> >>>> option when start.
> >>>
> >>> Any insights? Should we just ignore the "--no-huge" option and close this
> >> defect?
> >>> Now we did this as a workaround. Seems no one uses the "--no-huge"
> >>> option in testpmd now.
> >>
> >> VFIO supports dma_entry_limit as a module parameter, which has a default
> >> value of U16_MAX i.e. 64K, most likely which is limiting creation of 256K
> >> entries for 4K pages here. This can be modified while inserting vfio module:
> >>          modprobe vfio_iommu_type1 dma_entry_limit=1000000
> > 
> > Thanks for your suggestion. I tried it on ubuntu 22.04 but it does not work.
> > The reason I think is vfio-pci is build-in in kernel driver (since 20.04) and it does not support dynamic insmod/rmmod.
> > 
> > Does this command need to rmmod vfio first and then modprobe again?
> > 
> 
> If it is inserted as a module then you can remove using rmmod and then 
> modprobe again with the dma_entry_limit parameter. Also note, 
> vfio_iommu_type1 is the module which is limiting the entries to 64K, so 
> this module needs to be inserted again providing the dma_entry_limit 
> module param.
> 
> In case the module is built-in you can provide via kernel command line 
> parameter (ref: 
> https://www.kernel.org/doc/html/v4.12/admin-guide/kernel-parameters.html). 
> As per this ref document, "vfio_iommu_type1.dma_entry_limit=1000000" 
> should be used in the bootargs to set the module parameters.
> 
> FYI.. DPDK documentation also mentions the limitation at:
> https://doc.dpdk.org/guides/linux_gsg/linux_drivers.html

Yes the parameter is discussed in
https://doc.dpdk.org/guides/linux_gsg/linux_drivers.html#vfio-memory-mapping-limits
but it does not mention we may need to decrease it with --no-huge.
Please could you add this to the documentation?
  
Ding, Xuan July 4, 2023, 3:11 p.m. UTC | #9
Hi Nipun,

> -----Original Message-----
> From: Gupta, Nipun <nipun.gupta@amd.com>
> Sent: Tuesday, July 4, 2023 5:23 PM
> To: Ding, Xuan <xuan.ding@intel.com>; dev@dpdk.org; thomas@monjalon.net;
> Burakov, Anatoly <anatoly.burakov@intel.com>; Yigit, Ferruh
> <Ferruh.Yigit@amd.com>; David Marchand <david.marchand@redhat.com>
> Cc: Agarwal, Nikhil <nikhil.agarwal@amd.com>; He, Xingguang
> <xingguang.he@intel.com>; Ling, WeiX <weix.ling@intel.com>
> Subject: Re: [PATCH] vfio: do not coalesce DMA mappings
> 
> Hi Xuan,
> 
> On 7/4/2023 1:36 PM, Ding, Xuan wrote:
> > Hi Nipun,
> >
> >> -----Original Message-----
> >> From: Gupta, Nipun <Nipun.Gupta@amd.com>
> >> Sent: Tuesday, July 4, 2023 2:54 PM
> >> To: Ding, Xuan <xuan.ding@intel.com>; dev@dpdk.org;
> >> thomas@monjalon.net; Burakov, Anatoly <anatoly.burakov@intel.com>;
> >> Yigit, Ferruh <Ferruh.Yigit@amd.com>; David Marchand
> >> <david.marchand@redhat.com>
> >> Cc: Agarwal, Nikhil <nikhil.agarwal@amd.com>; He, Xingguang
> >> <xingguang.he@intel.com>; Ling, WeiX <weix.ling@intel.com>
> >> Subject: RE: [PATCH] vfio: do not coalesce DMA mappings
> >>
> >> Hi Xuan,
> >>
> >> Please see inline.
> >>
> >>> -----Original Message-----
> >>> From: Ding, Xuan <xuan.ding@intel.com>
> >>> Sent: Tuesday, July 4, 2023 10:43 AM
> >>> To: Gupta, Nipun <Nipun.Gupta@amd.com>; dev@dpdk.org;
> >>> thomas@monjalon.net; Burakov, Anatoly <anatoly.burakov@intel.com>;
> >>> Yigit, Ferruh <Ferruh.Yigit@amd.com>
> >>> Cc: Agarwal, Nikhil <nikhil.agarwal@amd.com>; He, Xingguang
> >>> <xingguang.he@intel.com>; Ling, WeiX <weix.ling@intel.com>
> >>> Subject: RE: [PATCH] vfio: do not coalesce DMA mappings
> >>>
> >>> Hi Nipun,
> >>>
> >>>> -----Original Message-----
> >>>> From: Ding, Xuan
> >>>> Sent: Friday, June 30, 2023 1:58 PM
> >>>> To: Nipun Gupta <nipun.gupta@amd.com>; dev@dpdk.org;
> >>>> thomas@monjalon.net; Burakov, Anatoly <anatoly.burakov@intel.com>;
> >>>> ferruh.yigit@amd.com
> >>>> Cc: nikhil.agarwal@amd.com; He, Xingguang <xingguang.he@intel.com>;
> >>>> Ling, WeiX <weix.ling@intel.com>
> >>>> Subject: RE: [PATCH] vfio: do not coalesce DMA mappings
> >>>>
> >>>> Hi Nipun,
> >>>>
> >>>> Replies are inline.
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: Nipun Gupta <nipun.gupta@amd.com>
> >>>>> Sent: Friday, June 30, 2023 9:46 AM
> >>>>> To: Ding, Xuan <xuan.ding@intel.com>; dev@dpdk.org;
> >>>>> thomas@monjalon.net; Burakov, Anatoly <anatoly.burakov@intel.com>;
> >>>>> ferruh.yigit@amd.com
> >>>>> Cc: nikhil.agarwal@amd.com; He, Xingguang
> >>>>> <xingguang.he@intel.com>; Ling, WeiX <weix.ling@intel.com>
> >>>>> Subject: Re: [PATCH] vfio: do not coalesce DMA mappings
> >>>>>
> >>>>> Hi Xuan,
> >>>>>
> >>>>> Thanks for pointing out the issue and figuring out the patch which
> >>>>> introduced this. If you have answers to below queries, please let me
> know:
> >>>>>
> >>>>> Is there any other test cases which tests "--no-huge" which pass?
> >>>>
> >>>> Yes, there are test cases adding "--no-huge" option to validate 4k
> >>>> page size in async vhost.
> >>>> Actually, the page size is decided by front-end, so I think this
> >>>> case can be removed.
> >>>>
> >>>> Previously, testpmd can start with "--no-huge" options (not sure if
> >>>> there are test cases).
> >>>> Cmd: ./build/app/dpdk-testpmd -l 5-6 -n 4 --no-huge -m 1024 -- -i
> >>>>
> >>>>>
> >>>>> Also, if we change the "-m" option to provide lower memory, does
> >>>>> the test pass?
> >>>>
> >>>> "-m" option is also added and does not work.
> >>>>
> >>>>>
> >>>>> When you mention too many pages exceed the capability of IOMMU,
> >>>>> you are referring to HW capability to create multiple pages? Here
> >>>>> it seems in case of 4K page size we need 256K pages which is
> >>>>> limiting the
> >> capacity?
> >>>>
> >>>> Yes, this is the result of my initial debugging.
> >>>> The direct impact is that this kind of testpmd cases cannot start now.
> >>>> If this is expected, I think we can close this defect and ignore
> >>>> the "--no-
> >> huge"
> >>>> option when start.
> >>>
> >>> Any insights? Should we just ignore the "--no-huge" option and close
> >>> this
> >> defect?
> >>> Now we did this as a workaround. Seems no one uses the "--no-huge"
> >>> option in testpmd now.
> >>
> >> VFIO supports dma_entry_limit as a module parameter, which has a
> >> default value of U16_MAX i.e. 64K, most likely which is limiting
> >> creation of 256K entries for 4K pages here. This can be modified while
> inserting vfio module:
> >>          modprobe vfio_iommu_type1 dma_entry_limit=1000000
> >
> > Thanks for your suggestion. I tried it on ubuntu 22.04 but it does not work.
> > The reason I think is vfio-pci is build-in in kernel driver (since 20.04) and it
> does not support dynamic insmod/rmmod.
> >
> > Does this command need to rmmod vfio first and then modprobe again?
> >
> 
> If it is inserted as a module then you can remove using rmmod and then
> modprobe again with the dma_entry_limit parameter. Also note,
> vfio_iommu_type1 is the module which is limiting the entries to 64K, so this
> module needs to be inserted again providing the dma_entry_limit module
> param.
> 
> In case the module is built-in you can provide via kernel command line
> parameter (ref:
> https://www.kernel.org/doc/html/v4.12/admin-guide/kernel-
> parameters.html).
> As per this ref document, "vfio_iommu_type1.dma_entry_limit=1000000"
> should be used in the bootargs to set the module parameters.
> 
> FYI.. DPDK documentation also mentions the limitation at:
> https://doc.dpdk.org/guides/linux_gsg/linux_drivers.html

Thanks for pointing out these references.

Add supplement for configuring build-in vfio module:
Except adding " vfio_iommu_type1.dma_entry_limit=1000000" in bootargs,
we can use kernel command line: "echo 1000000 > /sys/module/vfio_iommu_type1/parameters/dma_entry_limit".

Both methods works, thanks again.

Regards,
Xuan

> 
> Thanks,
> Nipun
  
Gupta, Nipun July 4, 2023, 4:39 p.m. UTC | #10
On 7/4/2023 7:39 PM, Thomas Monjalon wrote:
> 04/07/2023 11:23, Gupta, Nipun:
>> On 7/4/2023 1:36 PM, Ding, Xuan wrote:
>>> From: Gupta, Nipun <Nipun.Gupta@amd.com>
>>>> From: Ding, Xuan <xuan.ding@intel.com>
>>>>> From: Ding, Xuan
>>>>>> From: Nipun Gupta <nipun.gupta@amd.com>
>>>>>>> Hi Xuan,
>>>>>>>
>>>>>>> Thanks for pointing out the issue and figuring out the patch which
>>>>>>> introduced this. If you have answers to below queries, please let me know:
>>>>>>>
>>>>>>> Is there any other test cases which tests "--no-huge" which pass?
>>>>>>
>>>>>> Yes, there are test cases adding "--no-huge" option to validate 4k
>>>>>> page size in async vhost.
>>>>>> Actually, the page size is decided by front-end, so I think this
>>>>>> case can be removed.
>>>>>>
>>>>>> Previously, testpmd can start with "--no-huge" options (not sure if
>>>>>> there are test cases).
>>>>>> Cmd: ./build/app/dpdk-testpmd -l 5-6 -n 4 --no-huge -m 1024 -- -i
>>>>>>
>>>>>>>
>>>>>>> Also, if we change the "-m" option to provide lower memory, does
>>>>>>> the test pass?
>>>>>>
>>>>>> "-m" option is also added and does not work.
>>>>>>
>>>>>>>
>>>>>>> When you mention too many pages exceed the capability of IOMMU,
>>>>>>> you are referring to HW capability to create multiple pages? Here
>>>>>>> it seems in case of 4K page size we need 256K pages which is limiting the
>>>> capacity?
>>>>>>
>>>>>> Yes, this is the result of my initial debugging.
>>>>>> The direct impact is that this kind of testpmd cases cannot start now.
>>>>>> If this is expected, I think we can close this defect and ignore the "--no-
>>>> huge"
>>>>>> option when start.
>>>>>
>>>>> Any insights? Should we just ignore the "--no-huge" option and close this
>>>> defect?
>>>>> Now we did this as a workaround. Seems no one uses the "--no-huge"
>>>>> option in testpmd now.
>>>>
>>>> VFIO supports dma_entry_limit as a module parameter, which has a default
>>>> value of U16_MAX i.e. 64K, most likely which is limiting creation of 256K
>>>> entries for 4K pages here. This can be modified while inserting vfio module:
>>>>           modprobe vfio_iommu_type1 dma_entry_limit=1000000
>>>
>>> Thanks for your suggestion. I tried it on ubuntu 22.04 but it does not work.
>>> The reason I think is vfio-pci is build-in in kernel driver (since 20.04) and it does not support dynamic insmod/rmmod.
>>>
>>> Does this command need to rmmod vfio first and then modprobe again?
>>>
>>
>> If it is inserted as a module then you can remove using rmmod and then
>> modprobe again with the dma_entry_limit parameter. Also note,
>> vfio_iommu_type1 is the module which is limiting the entries to 64K, so
>> this module needs to be inserted again providing the dma_entry_limit
>> module param.
>>
>> In case the module is built-in you can provide via kernel command line
>> parameter (ref:
>> https://www.kernel.org/doc/html/v4.12/admin-guide/kernel-parameters.html).
>> As per this ref document, "vfio_iommu_type1.dma_entry_limit=1000000"
>> should be used in the bootargs to set the module parameters.
>>
>> FYI.. DPDK documentation also mentions the limitation at:
>> https://doc.dpdk.org/guides/linux_gsg/linux_drivers.html
> 
> Yes the parameter is discussed in
> https://doc.dpdk.org/guides/linux_gsg/linux_drivers.html#vfio-memory-mapping-limits
> but it does not mention we may need to decrease it with --no-huge.
> Please could you add this to the documentation?

sure! Ill send out a patch for this.

> 
>
  
Gupta, Nipun July 4, 2023, 4:42 p.m. UTC | #11
On 7/4/2023 8:41 PM, Ding, Xuan wrote:
> Hi Nipun,
> 
>> -----Original Message-----
>> From: Gupta, Nipun <nipun.gupta@amd.com>
>> Sent: Tuesday, July 4, 2023 5:23 PM
>> To: Ding, Xuan <xuan.ding@intel.com>; dev@dpdk.org; thomas@monjalon.net;
>> Burakov, Anatoly <anatoly.burakov@intel.com>; Yigit, Ferruh
>> <Ferruh.Yigit@amd.com>; David Marchand <david.marchand@redhat.com>
>> Cc: Agarwal, Nikhil <nikhil.agarwal@amd.com>; He, Xingguang
>> <xingguang.he@intel.com>; Ling, WeiX <weix.ling@intel.com>
>> Subject: Re: [PATCH] vfio: do not coalesce DMA mappings
>>
>> Hi Xuan,
>>
>> On 7/4/2023 1:36 PM, Ding, Xuan wrote:
>>> Hi Nipun,
>>>
>>>> -----Original Message-----
>>>> From: Gupta, Nipun <Nipun.Gupta@amd.com>
>>>> Sent: Tuesday, July 4, 2023 2:54 PM
>>>> To: Ding, Xuan <xuan.ding@intel.com>; dev@dpdk.org;
>>>> thomas@monjalon.net; Burakov, Anatoly <anatoly.burakov@intel.com>;
>>>> Yigit, Ferruh <Ferruh.Yigit@amd.com>; David Marchand
>>>> <david.marchand@redhat.com>
>>>> Cc: Agarwal, Nikhil <nikhil.agarwal@amd.com>; He, Xingguang
>>>> <xingguang.he@intel.com>; Ling, WeiX <weix.ling@intel.com>
>>>> Subject: RE: [PATCH] vfio: do not coalesce DMA mappings
>>>>
>>>> Hi Xuan,
>>>>
>>>> Please see inline.
>>>>
>>>>> -----Original Message-----
>>>>> From: Ding, Xuan <xuan.ding@intel.com>
>>>>> Sent: Tuesday, July 4, 2023 10:43 AM
>>>>> To: Gupta, Nipun <Nipun.Gupta@amd.com>; dev@dpdk.org;
>>>>> thomas@monjalon.net; Burakov, Anatoly <anatoly.burakov@intel.com>;
>>>>> Yigit, Ferruh <Ferruh.Yigit@amd.com>
>>>>> Cc: Agarwal, Nikhil <nikhil.agarwal@amd.com>; He, Xingguang
>>>>> <xingguang.he@intel.com>; Ling, WeiX <weix.ling@intel.com>
>>>>> Subject: RE: [PATCH] vfio: do not coalesce DMA mappings
>>>>>
>>>>> Hi Nipun,
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Ding, Xuan
>>>>>> Sent: Friday, June 30, 2023 1:58 PM
>>>>>> To: Nipun Gupta <nipun.gupta@amd.com>; dev@dpdk.org;
>>>>>> thomas@monjalon.net; Burakov, Anatoly <anatoly.burakov@intel.com>;
>>>>>> ferruh.yigit@amd.com
>>>>>> Cc: nikhil.agarwal@amd.com; He, Xingguang <xingguang.he@intel.com>;
>>>>>> Ling, WeiX <weix.ling@intel.com>
>>>>>> Subject: RE: [PATCH] vfio: do not coalesce DMA mappings
>>>>>>
>>>>>> Hi Nipun,
>>>>>>
>>>>>> Replies are inline.
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Nipun Gupta <nipun.gupta@amd.com>
>>>>>>> Sent: Friday, June 30, 2023 9:46 AM
>>>>>>> To: Ding, Xuan <xuan.ding@intel.com>; dev@dpdk.org;
>>>>>>> thomas@monjalon.net; Burakov, Anatoly <anatoly.burakov@intel.com>;
>>>>>>> ferruh.yigit@amd.com
>>>>>>> Cc: nikhil.agarwal@amd.com; He, Xingguang
>>>>>>> <xingguang.he@intel.com>; Ling, WeiX <weix.ling@intel.com>
>>>>>>> Subject: Re: [PATCH] vfio: do not coalesce DMA mappings
>>>>>>>
>>>>>>> Hi Xuan,
>>>>>>>
>>>>>>> Thanks for pointing out the issue and figuring out the patch which
>>>>>>> introduced this. If you have answers to below queries, please let me
>> know:
>>>>>>>
>>>>>>> Is there any other test cases which tests "--no-huge" which pass?
>>>>>>
>>>>>> Yes, there are test cases adding "--no-huge" option to validate 4k
>>>>>> page size in async vhost.
>>>>>> Actually, the page size is decided by front-end, so I think this
>>>>>> case can be removed.
>>>>>>
>>>>>> Previously, testpmd can start with "--no-huge" options (not sure if
>>>>>> there are test cases).
>>>>>> Cmd: ./build/app/dpdk-testpmd -l 5-6 -n 4 --no-huge -m 1024 -- -i
>>>>>>
>>>>>>>
>>>>>>> Also, if we change the "-m" option to provide lower memory, does
>>>>>>> the test pass?
>>>>>>
>>>>>> "-m" option is also added and does not work.
>>>>>>
>>>>>>>
>>>>>>> When you mention too many pages exceed the capability of IOMMU,
>>>>>>> you are referring to HW capability to create multiple pages? Here
>>>>>>> it seems in case of 4K page size we need 256K pages which is
>>>>>>> limiting the
>>>> capacity?
>>>>>>
>>>>>> Yes, this is the result of my initial debugging.
>>>>>> The direct impact is that this kind of testpmd cases cannot start now.
>>>>>> If this is expected, I think we can close this defect and ignore
>>>>>> the "--no-
>>>> huge"
>>>>>> option when start.
>>>>>
>>>>> Any insights? Should we just ignore the "--no-huge" option and close
>>>>> this
>>>> defect?
>>>>> Now we did this as a workaround. Seems no one uses the "--no-huge"
>>>>> option in testpmd now.
>>>>
>>>> VFIO supports dma_entry_limit as a module parameter, which has a
>>>> default value of U16_MAX i.e. 64K, most likely which is limiting
>>>> creation of 256K entries for 4K pages here. This can be modified while
>> inserting vfio module:
>>>>           modprobe vfio_iommu_type1 dma_entry_limit=1000000
>>>
>>> Thanks for your suggestion. I tried it on ubuntu 22.04 but it does not work.
>>> The reason I think is vfio-pci is build-in in kernel driver (since 20.04) and it
>> does not support dynamic insmod/rmmod.
>>>
>>> Does this command need to rmmod vfio first and then modprobe again?
>>>
>>
>> If it is inserted as a module then you can remove using rmmod and then
>> modprobe again with the dma_entry_limit parameter. Also note,
>> vfio_iommu_type1 is the module which is limiting the entries to 64K, so this
>> module needs to be inserted again providing the dma_entry_limit module
>> param.
>>
>> In case the module is built-in you can provide via kernel command line
>> parameter (ref:
>> https://www.kernel.org/doc/html/v4.12/admin-guide/kernel-
>> parameters.html).
>> As per this ref document, "vfio_iommu_type1.dma_entry_limit=1000000"
>> should be used in the bootargs to set the module parameters.
>>
>> FYI.. DPDK documentation also mentions the limitation at:
>> https://doc.dpdk.org/guides/linux_gsg/linux_drivers.html
> 
> Thanks for pointing out these references.
> 
> Add supplement for configuring build-in vfio module:
> Except adding " vfio_iommu_type1.dma_entry_limit=1000000" in bootargs,
> we can use kernel command line: "echo 1000000 > /sys/module/vfio_iommu_type1/parameters/dma_entry_limit".
> 
> Both methods works, thanks again.

Good to know this works!

> 
> Regards,
> Xuan
> 
>>
>> Thanks,
>> Nipun
  

Patch

diff --git a/lib/eal/linux/eal_vfio.c b/lib/eal/linux/eal_vfio.c
index 549b86ae1d..56edccb0db 100644
--- a/lib/eal/linux/eal_vfio.c
+++ b/lib/eal/linux/eal_vfio.c
@@ -1369,19 +1369,6 @@  rte_vfio_get_group_num(const char *sysfs_base,
 	return 1;
 }
 
-static int
-type1_map_contig(const struct rte_memseg_list *msl, const struct rte_memseg *ms,
-		size_t len, void *arg)
-{
-	int *vfio_container_fd = arg;
-
-	if (msl->external)
-		return 0;
-
-	return vfio_type1_dma_mem_map(*vfio_container_fd, ms->addr_64, ms->iova,
-			len, 1);
-}
-
 static int
 type1_map(const struct rte_memseg_list *msl, const struct rte_memseg *ms,
 		void *arg)
@@ -1396,10 +1383,6 @@  type1_map(const struct rte_memseg_list *msl, const struct rte_memseg *ms,
 	if (ms->iova == RTE_BAD_IOVA)
 		return 0;
 
-	/* if IOVA mode is VA, we've already mapped the internal segments */
-	if (!msl->external && rte_eal_iova_mode() == RTE_IOVA_VA)
-		return 0;
-
 	return vfio_type1_dma_mem_map(*vfio_container_fd, ms->addr_64, ms->iova,
 			ms->len, 1);
 }
@@ -1464,18 +1447,6 @@  vfio_type1_dma_mem_map(int vfio_container_fd, uint64_t vaddr, uint64_t iova,
 static int
 vfio_type1_dma_map(int vfio_container_fd)
 {
-	if (rte_eal_iova_mode() == RTE_IOVA_VA) {
-		/* with IOVA as VA mode, we can get away with mapping contiguous
-		 * chunks rather than going page-by-page.
-		 */
-		int ret = rte_memseg_contig_walk(type1_map_contig,
-				&vfio_container_fd);
-		if (ret)
-			return ret;
-		/* we have to continue the walk because we've skipped the
-		 * external segments during the config walk.
-		 */
-	}
 	return rte_memseg_walk(type1_map, &vfio_container_fd);
 }