[v1,1/1] kernel/linux: introduce vfio_pf kernel module
diff mbox series

Message ID 20190906091230.13923-1-vattunuru@marvell.com
State New
Delegated to: Thomas Monjalon
Headers show
Series
  • [v1,1/1] kernel/linux: introduce vfio_pf kernel module
Related show

Checks

Context Check Description
ci/checkpatch success coding style OK
ci/Intel-compilation success Compilation OK
ci/iol-dpdk_compile_ovs success Compile Testing PASS
ci/iol-dpdk_compile success Compile Testing PASS
ci/iol-dpdk_compile_spdk success Compile Testing PASS
ci/intel-Performance success Performance Testing PASS
ci/mellanox-Performance success Performance Testing PASS

Commit Message

Vamsi Krishna Attunuru Sept. 6, 2019, 9:12 a.m. UTC
From: Vamsi Attunuru <vattunuru@marvell.com>

The DPDK use case such as VF representer or OVS offload etc
would call for PF and VF PCIe devices to bind vfio-pci
module to enable IOMMU protection.

In addition to vSwitch use case, unlike, other PCI class of
devices, Network class of PCIe devices would have additional
responsibility on the PF devices such as promiscuous mode support
etc.

The above use cases demand VFIO needs bound to PF and its
VF devices. This is use case is not supported in Linux kernel,
due to a security issue where it is possible to have
DoS in case if VF attached to guest over vfio-pci and netdev
kernel driver runs on it and which something VF representer
would like to enable it.

Since we can not differentiate, the vfio-pci bounded VF devices
runs DPDK application or netdev driver in guest, we can not
introduce any scheme to fix DoS case and therefore not have
proper support of this in the upstream kernel.

The igb_uio enables such PF and VF binding support for
non-iommu devices to make VF representer or OVS offload
run on non-iommu devices with DoS vulnerability for netdev driver
as VF.

This kernel module, facilitate to enable SRIOV on PF devices,
therefore, to run both PF and VF devices in VFIO mode knowing
its impacts like igb_uio driver functions of non-iommu devices.

Signed-off-by: Vamsi Attunuru <vattunuru@marvell.com>
Signed-off-by: Jerin Jacob <jerinj@marvell.com>
---
 MAINTAINERS                       |   6 +
 config/common_linux               |   1 +
 doc/guides/prog_guide/index.rst   |   1 +
 doc/guides/prog_guide/vfio_pf.rst |  90 ++++++++++
 kernel/linux/Makefile             |   1 +
 kernel/linux/meson.build          |   2 +-
 kernel/linux/vfio_pf/Kbuild       |   2 +
 kernel/linux/vfio_pf/Makefile     |  24 +++
 kernel/linux/vfio_pf/meson.build  |  20 +++
 kernel/linux/vfio_pf/vfio_pf.c    | 360 ++++++++++++++++++++++++++++++++++++++
 10 files changed, 506 insertions(+), 1 deletion(-)

Comments

Thomas Monjalon Sept. 6, 2019, 9:45 a.m. UTC | #1
06/09/2019 11:12, vattunuru@marvell.com:
> From: Vamsi Attunuru <vattunuru@marvell.com>
> 
> The DPDK use case such as VF representer or OVS offload etc
> would call for PF and VF PCIe devices to bind vfio-pci
> module to enable IOMMU protection.
> 
> In addition to vSwitch use case, unlike, other PCI class of
> devices, Network class of PCIe devices would have additional
> responsibility on the PF devices such as promiscuous mode support
> etc.
> 
> The above use cases demand VFIO needs bound to PF and its
> VF devices. This is use case is not supported in Linux kernel,
> due to a security issue where it is possible to have
> DoS in case if VF attached to guest over vfio-pci and netdev
> kernel driver runs on it and which something VF representer
> would like to enable it.
> 
> Since we can not differentiate, the vfio-pci bounded VF devices
> runs DPDK application or netdev driver in guest, we can not
> introduce any scheme to fix DoS case and therefore not have
> proper support of this in the upstream kernel.
> 
> The igb_uio enables such PF and VF binding support for
> non-iommu devices to make VF representer or OVS offload
> run on non-iommu devices with DoS vulnerability for netdev driver
> as VF.
> 
> This kernel module, facilitate to enable SRIOV on PF devices,
> therefore, to run both PF and VF devices in VFIO mode knowing
> its impacts like igb_uio driver functions of non-iommu devices.
> 
> Signed-off-by: Vamsi Attunuru <vattunuru@marvell.com>
> Signed-off-by: Jerin Jacob <jerinj@marvell.com>

Sorry I fail to properly understand the explanation above.
Please try to split in shorter sentences.

About the request to add an out-of-tree Linux kernel driver,
I guess Jerin is well aware that we don't want such anymore.
Jerin Jacob Kollanukkaran Sept. 6, 2019, 1:27 p.m. UTC | #2
> -----Original Message-----
> From: Thomas Monjalon <thomas@monjalon.net>
> Sent: Friday, September 6, 2019 3:15 PM
> To: Vamsi Krishna Attunuru <vattunuru@marvell.com>
> Cc: dev@dpdk.org; Jerin Jacob Kollanukkaran <jerinj@marvell.com>
> Subject: Re: [dpdk-dev] [PATCH v1 1/1] kernel/linux: introduce vfio_pf kernel
> module
> 
> 06/09/2019 11:12, vattunuru@marvell.com:
> > From: Vamsi Attunuru <vattunuru@marvell.com>
> >
> > The DPDK use case such as VF representer or OVS offload etc would call
> > for PF and VF PCIe devices to bind vfio-pci module to enable IOMMU
> > protection.
> >
> > In addition to vSwitch use case, unlike, other PCI class of devices,
> > Network class of PCIe devices would have additional responsibility on
> > the PF devices such as promiscuous mode support etc.
> >
> > The above use cases demand VFIO needs bound to PF and its VF devices.
> > This is use case is not supported in Linux kernel, due to a security
> > issue where it is possible to have DoS in case if VF attached to guest
> > over vfio-pci and netdev kernel driver runs on it and which something
> > VF representer would like to enable it.
> >
> > Since we can not differentiate, the vfio-pci bounded VF devices runs
> > DPDK application or netdev driver in guest, we can not introduce any
> > scheme to fix DoS case and therefore not have proper support of this
> > in the upstream kernel.
> >
> > The igb_uio enables such PF and VF binding support for non-iommu
> > devices to make VF representer or OVS offload run on non-iommu devices
> > with DoS vulnerability for netdev driver as VF.
> >
> > This kernel module, facilitate to enable SRIOV on PF devices,
> > therefore, to run both PF and VF devices in VFIO mode knowing its
> > impacts like igb_uio driver functions of non-iommu devices.
> >
> > Signed-off-by: Vamsi Attunuru <vattunuru@marvell.com>
> > Signed-off-by: Jerin Jacob <jerinj@marvell.com>
> 
> Sorry I fail to properly understand the explanation above.
> Please try to split in shorter sentences.
> 
> About the request to add an out-of-tree Linux kernel driver, I guess Jerin is well
> aware that we don't want such anymore.

Yes. I am aware of it. I don't like the out of tree modules either. But, This case,
I suggested Vamsi to have out of tree module.

Let me describe the issue and let us discuss how to tackle the  problem:

# Linux kernel wont allow VFIO PF to have SRIOV enable.

Patches and on going discussion are here:
https://patchwork.kernel.org/patch/10522381/
https://lwn.net/Articles/748526/

Based on my understanding the reason for NOT allowing the
VFIO PF to have SRIOV enable is genuine from kernel point of
View but not from DPDK point of view.

Here is the sequence  to describe the problem
1) Consider Linux kernel allowed VFIO PCI SRIOV enable
2) PF bound to vfio-pci
3) using SRIOV infrastructure of vfio-pci  PF driver,
VFs  are created
4) DPDK application bound to PF and VF, No issue here.
5) Assume DPDK application bound to PF and VF bound
To netdev kernel driver. Now, there is a genuine  concern
From kernel point of view that, DPDK PF can intercept,
VF mailbox message or so and deny the Kernel request
Or what if DPDK PF application crashes?

To avoid the case (5), (3) is not allowed in stock kernel.
Which makes sense IMO.

Now, From DPDK PoV, step 5 is valid as we have
Rte_flow's VF action etc used to enable such case.
Where, user can program the PF's rte_flow to steer
Some traffic to VF, where VF can be, DPDK application or
Linux kernel netdev driver.

This patch enables the step (3) to enable step (5) from DPDK
PoV. i.e DPDK needs to allow PF to bind to DPDK with VFs.

Why this issue now:
- igb_uio kernel driver is used as enabling step (3)
See store_max_vfs() kernel/linux/igb_uio/igb_uio.c
 This is fine for non-iommu device, IOMMU devices
needs VFIO.
- We would like support VFIO for IOMMU protection
And enable step (5) as DPDK supports form the spec level.
i.e need to fix feature disparity between iommu vs
non-iommu based devices.

Note:
We may not need a  brand new kernel module, we could move
this logic to igb_uio if maintenance is concern.







 


 

 




- 
 






>
Vamsi Krishna Attunuru Sept. 25, 2019, 4:06 a.m. UTC | #3
Hi,

We would like to have these support in 19.11. Please let us know if there are any comments or suggestions, we can discuss and address accordingly.

-----Original Message-----
From: Jerin Jacob Kollanukkaran <jerinj@marvell.com> 
Sent: Friday, September 6, 2019 6:58 PM
To: Thomas Monjalon <thomas@monjalon.net>; Vamsi Krishna Attunuru <vattunuru@marvell.com>
Cc: dev@dpdk.org
Subject: RE: [dpdk-dev] [PATCH v1 1/1] kernel/linux: introduce vfio_pf kernel module

> -----Original Message-----
> From: Thomas Monjalon <thomas@monjalon.net>
> Sent: Friday, September 6, 2019 3:15 PM
> To: Vamsi Krishna Attunuru <vattunuru@marvell.com>
> Cc: dev@dpdk.org; Jerin Jacob Kollanukkaran <jerinj@marvell.com>
> Subject: Re: [dpdk-dev] [PATCH v1 1/1] kernel/linux: introduce vfio_pf 
> kernel module
> 
> 06/09/2019 11:12, vattunuru@marvell.com:
> > From: Vamsi Attunuru <vattunuru@marvell.com>
> >
> > The DPDK use case such as VF representer or OVS offload etc would 
> > call for PF and VF PCIe devices to bind vfio-pci module to enable 
> > IOMMU protection.
> >
> > In addition to vSwitch use case, unlike, other PCI class of devices, 
> > Network class of PCIe devices would have additional responsibility 
> > on the PF devices such as promiscuous mode support etc.
> >
> > The above use cases demand VFIO needs bound to PF and its VF devices.
> > This is use case is not supported in Linux kernel, due to a security 
> > issue where it is possible to have DoS in case if VF attached to 
> > guest over vfio-pci and netdev kernel driver runs on it and which 
> > something VF representer would like to enable it.
> >
> > Since we can not differentiate, the vfio-pci bounded VF devices runs 
> > DPDK application or netdev driver in guest, we can not introduce any 
> > scheme to fix DoS case and therefore not have proper support of this 
> > in the upstream kernel.
> >
> > The igb_uio enables such PF and VF binding support for non-iommu 
> > devices to make VF representer or OVS offload run on non-iommu 
> > devices with DoS vulnerability for netdev driver as VF.
> >
> > This kernel module, facilitate to enable SRIOV on PF devices, 
> > therefore, to run both PF and VF devices in VFIO mode knowing its 
> > impacts like igb_uio driver functions of non-iommu devices.
> >
> > Signed-off-by: Vamsi Attunuru <vattunuru@marvell.com>
> > Signed-off-by: Jerin Jacob <jerinj@marvell.com>
> 
> Sorry I fail to properly understand the explanation above.
> Please try to split in shorter sentences.
> 
> About the request to add an out-of-tree Linux kernel driver, I guess 
> Jerin is well aware that we don't want such anymore.

Yes. I am aware of it. I don't like the out of tree modules either. But, This case, I suggested Vamsi to have out of tree module.

Let me describe the issue and let us discuss how to tackle the  problem:

# Linux kernel wont allow VFIO PF to have SRIOV enable.

Patches and on going discussion are here:
https://patchwork.kernel.org/patch/10522381/
https://lwn.net/Articles/748526/

Based on my understanding the reason for NOT allowing the VFIO PF to have SRIOV enable is genuine from kernel point of View but not from DPDK point of view.

Here is the sequence  to describe the problem
1) Consider Linux kernel allowed VFIO PCI SRIOV enable
2) PF bound to vfio-pci
3) using SRIOV infrastructure of vfio-pci  PF driver, VFs  are created
4) DPDK application bound to PF and VF, No issue here.
5) Assume DPDK application bound to PF and VF bound To netdev kernel driver. Now, there is a genuine  concern From kernel point of view that, DPDK PF can intercept, VF mailbox message or so and deny the Kernel request Or what if DPDK PF application crashes?

To avoid the case (5), (3) is not allowed in stock kernel.
Which makes sense IMO.

Now, From DPDK PoV, step 5 is valid as we have Rte_flow's VF action etc used to enable such case.
Where, user can program the PF's rte_flow to steer Some traffic to VF, where VF can be, DPDK application or Linux kernel netdev driver.

This patch enables the step (3) to enable step (5) from DPDK PoV. i.e DPDK needs to allow PF to bind to DPDK with VFs.

Why this issue now:
- igb_uio kernel driver is used as enabling step (3) See store_max_vfs() kernel/linux/igb_uio/igb_uio.c  This is fine for non-iommu device, IOMMU devices needs VFIO.
- We would like support VFIO for IOMMU protection And enable step (5) as DPDK supports form the spec level.
i.e need to fix feature disparity between iommu vs non-iommu based devices.

Note:
We may not need a  brand new kernel module, we could move this logic to igb_uio if maintenance is concern.







 


 

 




- 
 






>
Andrew Rybchenko Sept. 25, 2019, 7:18 a.m. UTC | #4
On 9/25/19 7:06 AM, Vamsi Krishna Attunuru wrote:
> Hi,
>
> We would like to have these support in 19.11. Please let us know if there are any comments or suggestions, we can discuss and address accordingly.
>
> -----Original Message-----
> From: Jerin Jacob Kollanukkaran <jerinj@marvell.com>
> Sent: Friday, September 6, 2019 6:58 PM
> To: Thomas Monjalon <thomas@monjalon.net>; Vamsi Krishna Attunuru <vattunuru@marvell.com>
> Cc: dev@dpdk.org
> Subject: RE: [dpdk-dev] [PATCH v1 1/1] kernel/linux: introduce vfio_pf kernel module
>
>> -----Original Message-----
>> From: Thomas Monjalon <thomas@monjalon.net>
>> Sent: Friday, September 6, 2019 3:15 PM
>> To: Vamsi Krishna Attunuru <vattunuru@marvell.com>
>> Cc: dev@dpdk.org; Jerin Jacob Kollanukkaran <jerinj@marvell.com>
>> Subject: Re: [dpdk-dev] [PATCH v1 1/1] kernel/linux: introduce vfio_pf
>> kernel module
>>
>> 06/09/2019 11:12, vattunuru@marvell.com:
>>> From: Vamsi Attunuru <vattunuru@marvell.com>
>>>
>>> The DPDK use case such as VF representer or OVS offload etc would
>>> call for PF and VF PCIe devices to bind vfio-pci module to enable
>>> IOMMU protection.
>>>
>>> In addition to vSwitch use case, unlike, other PCI class of devices,
>>> Network class of PCIe devices would have additional responsibility
>>> on the PF devices such as promiscuous mode support etc.
>>>
>>> The above use cases demand VFIO needs bound to PF and its VF devices.
>>> This is use case is not supported in Linux kernel, due to a security
>>> issue where it is possible to have DoS in case if VF attached to
>>> guest over vfio-pci and netdev kernel driver runs on it and which
>>> something VF representer would like to enable it.
>>>
>>> Since we can not differentiate, the vfio-pci bounded VF devices runs
>>> DPDK application or netdev driver in guest, we can not introduce any
>>> scheme to fix DoS case and therefore not have proper support of this
>>> in the upstream kernel.
>>>
>>> The igb_uio enables such PF and VF binding support for non-iommu
>>> devices to make VF representer or OVS offload run on non-iommu
>>> devices with DoS vulnerability for netdev driver as VF.
>>>
>>> This kernel module, facilitate to enable SRIOV on PF devices,
>>> therefore, to run both PF and VF devices in VFIO mode knowing its
>>> impacts like igb_uio driver functions of non-iommu devices.
>>>
>>> Signed-off-by: Vamsi Attunuru <vattunuru@marvell.com>
>>> Signed-off-by: Jerin Jacob <jerinj@marvell.com>
>> Sorry I fail to properly understand the explanation above.
>> Please try to split in shorter sentences.
>>
>> About the request to add an out-of-tree Linux kernel driver, I guess
>> Jerin is well aware that we don't want such anymore.
> Yes. I am aware of it. I don't like the out of tree modules either. But, This case, I suggested Vamsi to have out of tree module.
>
> Let me describe the issue and let us discuss how to tackle the  problem:
>
> # Linux kernel wont allow VFIO PF to have SRIOV enable.
>
> Patches and on going discussion are here:
> https://patchwork.kernel.org/patch/10522381/
> https://lwn.net/Articles/748526/
>
> Based on my understanding the reason for NOT allowing the VFIO PF to have SRIOV enable is genuine from kernel point of View but not from DPDK point of view.
>
> Here is the sequence  to describe the problem
> 1) Consider Linux kernel allowed VFIO PCI SRIOV enable
> 2) PF bound to vfio-pci
> 3) using SRIOV infrastructure of vfio-pci  PF driver, VFs  are created
> 4) DPDK application bound to PF and VF, No issue here.
> 5) Assume DPDK application bound to PF and VF bound To netdev kernel driver. Now, there is a genuine  concern From kernel point of view that, DPDK PF can intercept, VF mailbox message or so and deny the Kernel request Or what if DPDK PF application crashes?
>
> To avoid the case (5), (3) is not allowed in stock kernel.
> Which makes sense IMO.
>
> Now, From DPDK PoV, step 5 is valid as we have Rte_flow's VF action etc used to enable such case.
> Where, user can program the PF's rte_flow to steer Some traffic to VF, where VF can be, DPDK application or Linux kernel netdev driver.
>
> This patch enables the step (3) to enable step (5) from DPDK PoV. i.e DPDK needs to allow PF to bind to DPDK with VFs.
>
> Why this issue now:
> - igb_uio kernel driver is used as enabling step (3) See store_max_vfs() kernel/linux/igb_uio/igb_uio.c  This is fine for non-iommu device, IOMMU devices needs VFIO.
> - We would like support VFIO for IOMMU protection And enable step (5) as DPDK supports form the spec level.
> i.e need to fix feature disparity between iommu vs non-iommu based devices.
>
> Note:
> We may not need a  brand new kernel module, we could move this logic to igb_uio if maintenance is concern.

It is really a problem for not bifurcated drivers and I agree with Jerin 
above.

One driver could be better, but I'm not sure and I'm afraid it will make too
many changes in igb_uio and result in instabilities/bugs. I have no strong
opinion if we can take the risk with igb_uio - simply have no enough
information if it is widely used or Linux in-kernel drivers are preferred.

Acked-by: Andrew Rybchenko <arybchenko@solarflare.com>
Vamsi Krishna Attunuru Oct. 8, 2019, 5:07 a.m. UTC | #5
> -----Original Message-----
> From: Jerin Jacob Kollanukkaran <jerinj@marvell.com>
> Sent: Friday, September 6, 2019 6:58 PM
> To: Thomas Monjalon <thomas@monjalon.net>; Vamsi Krishna Attunuru
> <vattunuru@marvell.com>
> Cc: dev@dpdk.org
> Subject: RE: [dpdk-dev] [PATCH v1 1/1] kernel/linux: introduce vfio_pf kernel
> module
> 
> > -----Original Message-----
> > From: Thomas Monjalon <thomas@monjalon.net>
> > Sent: Friday, September 6, 2019 3:15 PM
> > To: Vamsi Krishna Attunuru <vattunuru@marvell.com>
> > Cc: dev@dpdk.org; Jerin Jacob Kollanukkaran <jerinj@marvell.com>
> > Subject: Re: [dpdk-dev] [PATCH v1 1/1] kernel/linux: introduce vfio_pf
> > kernel module
> >
> > 06/09/2019 11:12, vattunuru@marvell.com:
> > > From: Vamsi Attunuru <vattunuru@marvell.com>
> > >
> > > The DPDK use case such as VF representer or OVS offload etc would
> > > call for PF and VF PCIe devices to bind vfio-pci module to enable
> > > IOMMU protection.
> > >
> > > In addition to vSwitch use case, unlike, other PCI class of devices,
> > > Network class of PCIe devices would have additional responsibility
> > > on the PF devices such as promiscuous mode support etc.
> > >
> > > The above use cases demand VFIO needs bound to PF and its VF devices.
> > > This is use case is not supported in Linux kernel, due to a security
> > > issue where it is possible to have DoS in case if VF attached to
> > > guest over vfio-pci and netdev kernel driver runs on it and which
> > > something VF representer would like to enable it.
> > >
> > > Since we can not differentiate, the vfio-pci bounded VF devices runs
> > > DPDK application or netdev driver in guest, we can not introduce any
> > > scheme to fix DoS case and therefore not have proper support of this
> > > in the upstream kernel.
> > >
> > > The igb_uio enables such PF and VF binding support for non-iommu
> > > devices to make VF representer or OVS offload run on non-iommu
> > > devices with DoS vulnerability for netdev driver as VF.
> > >
> > > This kernel module, facilitate to enable SRIOV on PF devices,
> > > therefore, to run both PF and VF devices in VFIO mode knowing its
> > > impacts like igb_uio driver functions of non-iommu devices.
> > >
> > > Signed-off-by: Vamsi Attunuru <vattunuru@marvell.com>
> > > Signed-off-by: Jerin Jacob <jerinj@marvell.com>
> >
> > Sorry I fail to properly understand the explanation above.
> > Please try to split in shorter sentences.
> >
> > About the request to add an out-of-tree Linux kernel driver, I guess
> > Jerin is well aware that we don't want such anymore.
> 
> Yes. I am aware of it. I don't like the out of tree modules either. But, This
> case, I suggested Vamsi to have out of tree module.
> 
> Let me describe the issue and let us discuss how to tackle the  problem:
> 
> # Linux kernel wont allow VFIO PF to have SRIOV enable.
> 
> Patches and on going discussion are here:
> https://patchwork.kernel.org/patch/10522381/
> https://lwn.net/Articles/748526/
> 
> Based on my understanding the reason for NOT allowing the VFIO PF to have
> SRIOV enable is genuine from kernel point of View but not from DPDK point
> of view.
> 
> Here is the sequence  to describe the problem
> 1) Consider Linux kernel allowed VFIO PCI SRIOV enable
> 2) PF bound to vfio-pci
> 3) using SRIOV infrastructure of vfio-pci  PF driver, VFs  are created
> 4) DPDK application bound to PF and VF, No issue here.
> 5) Assume DPDK application bound to PF and VF bound To netdev kernel
> driver. Now, there is a genuine  concern From kernel point of view that, DPDK
> PF can intercept, VF mailbox message or so and deny the Kernel request Or
> what if DPDK PF application crashes?
> 
> To avoid the case (5), (3) is not allowed in stock kernel.
> Which makes sense IMO.
> 
> Now, From DPDK PoV, step 5 is valid as we have Rte_flow's VF action etc
> used to enable such case.
> Where, user can program the PF's rte_flow to steer Some traffic to VF, where
> VF can be, DPDK application or Linux kernel netdev driver.
> 
> This patch enables the step (3) to enable step (5) from DPDK PoV. i.e DPDK
> needs to allow PF to bind to DPDK with VFs.
> 
> Why this issue now:
> - igb_uio kernel driver is used as enabling step (3) See store_max_vfs()
> kernel/linux/igb_uio/igb_uio.c  This is fine for non-iommu device, IOMMU
> devices needs VFIO.
> - We would like support VFIO for IOMMU protection And enable step (5) as
> DPDK supports form the spec level.
> i.e need to fix feature disparity between iommu vs non-iommu based
> devices.
> 
> Note:
> We may not need a  brand new kernel module, we could move this logic to
> igb_uio if maintenance is concern.
> 

@All, we are expecting to merge this in 19.11 release and if any one have comments please respond.

> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> -
> 
> 
> 
> 
> 
> 
> 
> >
Stephen Hemminger Oct. 8, 2019, 3:12 p.m. UTC | #6
On Fri, 6 Sep 2019 14:42:30 +0530
<vattunuru@marvell.com> wrote:

> From: Vamsi Attunuru <vattunuru@marvell.com>
> 
> The DPDK use case such as VF representer or OVS offload etc
> would call for PF and VF PCIe devices to bind vfio-pci
> module to enable IOMMU protection.
> 
> In addition to vSwitch use case, unlike, other PCI class of
> devices, Network class of PCIe devices would have additional
> responsibility on the PF devices such as promiscuous mode support
> etc.
> 
> The above use cases demand VFIO needs bound to PF and its
> VF devices. This is use case is not supported in Linux kernel,
> due to a security issue where it is possible to have
> DoS in case if VF attached to guest over vfio-pci and netdev
> kernel driver runs on it and which something VF representer
> would like to enable it.
> 
> Since we can not differentiate, the vfio-pci bounded VF devices
> runs DPDK application or netdev driver in guest, we can not
> introduce any scheme to fix DoS case and therefore not have
> proper support of this in the upstream kernel.
> 
> The igb_uio enables such PF and VF binding support for
> non-iommu devices to make VF representer or OVS offload
> run on non-iommu devices with DoS vulnerability for netdev driver
> as VF.
> 
> This kernel module, facilitate to enable SRIOV on PF devices,
> therefore, to run both PF and VF devices in VFIO mode knowing
> its impacts like igb_uio driver functions of non-iommu devices.
> 
> Signed-off-by: Vamsi Attunuru <vattunuru@marvell.com>
> Signed-off-by: Jerin Jacob <jerinj@marvell.com>

NAK
Having kernel drivers  not in upstream kernel is a long term
maintenance and security risk. Please work with upstream kernel
developers to get this merged there.
Jerin Jacob Oct. 8, 2019, 3:28 p.m. UTC | #7
On Tue, 8 Oct, 2019, 8:42 PM Stephen Hemminger, <stephen@networkplumber.org>
wrote:

> On Fri, 6 Sep 2019 14:42:30 +0530
> <vattunuru@marvell.com> wrote:
>
> > From: Vamsi Attunuru <vattunuru@marvell.com>
> >
> > The DPDK use case such as VF representer or OVS offload etc
> > would call for PF and VF PCIe devices to bind vfio-pci
> > module to enable IOMMU protection.
> >
> > In addition to vSwitch use case, unlike, other PCI class of
> > devices, Network class of PCIe devices would have additional
> > responsibility on the PF devices such as promiscuous mode support
> > etc.
> >
> > The above use cases demand VFIO needs bound to PF and its
> > VF devices. This is use case is not supported in Linux kernel,
> > due to a security issue where it is possible to have
> > DoS in case if VF attached to guest over vfio-pci and netdev
> > kernel driver runs on it and which something VF representer
> > would like to enable it.
> >
> > Since we can not differentiate, the vfio-pci bounded VF devices
> > runs DPDK application or netdev driver in guest, we can not
> > introduce any scheme to fix DoS case and therefore not have
> > proper support of this in the upstream kernel.
> >
> > The igb_uio enables such PF and VF binding support for
> > non-iommu devices to make VF representer or OVS offload
> > run on non-iommu devices with DoS vulnerability for netdev driver
> > as VF.
> >
> > This kernel module, facilitate to enable SRIOV on PF devices,
> > therefore, to run both PF and VF devices in VFIO mode knowing
> > its impacts like igb_uio driver functions of non-iommu devices.
> >
> > Signed-off-by: Vamsi Attunuru <vattunuru@marvell.com>
> > Signed-off-by: Jerin Jacob <jerinj@marvell.com>
>
> NAK
> Having kernel drivers  not in upstream kernel is a long term
> maintenance and security risk. Please work with upstream kernel
> developers to get this merged there.
>

There is security issue in attaching DPDK PF driver and netdev bind to VF.
So this scheme is not upsteamble to Linux kernel. Since rte_flow had VF
action. We need this scheme to support VF action with VFIO. So, Out of tree
is the only way as it is DPDK specific feature. Already sent patches to
Linux kernel, it make sense to not accept this in upstream.  We are already
exposing such features through igb-uio for non VFIO device. IMO, there
should not be any disparity between igb-uio and VFIO in DPDK.

If we are against out of tree module, let's remove igb-uio as well. We
can't have different treatment for similar issues.
Stephen Hemminger Oct. 9, 2019, 11:28 p.m. UTC | #8
On Tue, 8 Oct 2019 20:58:27 +0530
Jerin Jacob <jerinjacobk@gmail.com> wrote:

> On Tue, 8 Oct, 2019, 8:42 PM Stephen Hemminger, <stephen@networkplumber.org>
> wrote:
> 
> > On Fri, 6 Sep 2019 14:42:30 +0530
> > <vattunuru@marvell.com> wrote:
> >  
> > > From: Vamsi Attunuru <vattunuru@marvell.com>
> > >
> > > The DPDK use case such as VF representer or OVS offload etc
> > > would call for PF and VF PCIe devices to bind vfio-pci
> > > module to enable IOMMU protection.
> > >
> > > In addition to vSwitch use case, unlike, other PCI class of
> > > devices, Network class of PCIe devices would have additional
> > > responsibility on the PF devices such as promiscuous mode support
> > > etc.
> > >
> > > The above use cases demand VFIO needs bound to PF and its
> > > VF devices. This is use case is not supported in Linux kernel,
> > > due to a security issue where it is possible to have
> > > DoS in case if VF attached to guest over vfio-pci and netdev
> > > kernel driver runs on it and which something VF representer
> > > would like to enable it.
> > >
> > > Since we can not differentiate, the vfio-pci bounded VF devices
> > > runs DPDK application or netdev driver in guest, we can not
> > > introduce any scheme to fix DoS case and therefore not have
> > > proper support of this in the upstream kernel.
> > >
> > > The igb_uio enables such PF and VF binding support for
> > > non-iommu devices to make VF representer or OVS offload
> > > run on non-iommu devices with DoS vulnerability for netdev driver
> > > as VF.
> > >
> > > This kernel module, facilitate to enable SRIOV on PF devices,
> > > therefore, to run both PF and VF devices in VFIO mode knowing
> > > its impacts like igb_uio driver functions of non-iommu devices.
> > >
> > > Signed-off-by: Vamsi Attunuru <vattunuru@marvell.com>
> > > Signed-off-by: Jerin Jacob <jerinj@marvell.com>  
> >
> > NAK
> > Having kernel drivers  not in upstream kernel is a long term
> > maintenance and security risk. Please work with upstream kernel
> > developers to get this merged there.
> >  
> 
> There is security issue in attaching DPDK PF driver and netdev bind to VF.
> So this scheme is not upsteamble to Linux kernel. Since rte_flow had VF
> action. We need this scheme to support VF action with VFIO. So, Out of tree
> is the only way as it is DPDK specific feature. Already sent patches to
> Linux kernel, it make sense to not accept this in upstream.  We are already
> exposing such features through igb-uio for non VFIO device. IMO, there
> should not be any disparity between igb-uio and VFIO in DPDK.
> 
> If we are against out of tree module, let's remove igb-uio as well. We
> can't have different treatment for similar issues.

You are right igb-uio is legacy and only exists for users unwilling to
change (such as old kernels which don't support vfio noiommu mode).
Eventually it should go as well. And the same for KNI.

Understand the pain, but this feels like a child who gets a no
answer from one parent and therefore thinks they can ask the other parent
and get a yes.

Who is the person in the upstream kernel community that needs to understand
what is needed. Maybe another "I know what I am doing" kernel flag is needed
like no-iommu mode has.
Jerin Jacob Oct. 10, 2019, 6:02 a.m. UTC | #9
On Thu, 10 Oct, 2019, 4:58 AM Stephen Hemminger, <stephen@networkplumber.org>
wrote:

> On Tue, 8 Oct 2019 20:58:27 +0530
> Jerin Jacob <jerinjacobk@gmail.com> wrote:
>
> > On Tue, 8 Oct, 2019, 8:42 PM Stephen Hemminger, <
> stephen@networkplumber.org>
> > wrote:
> >
> > > On Fri, 6 Sep 2019 14:42:30 +0530
> > > <vattunuru@marvell.com> wrote:
> > >
> > > > From: Vamsi Attunuru <vattunuru@marvell.com>
> > > >
> > > > The DPDK use case such as VF representer or OVS offload etc
> > > > would call for PF and VF PCIe devices to bind vfio-pci
> > > > module to enable IOMMU protection.
> > > >
> > > > In addition to vSwitch use case, unlike, other PCI class of
> > > > devices, Network class of PCIe devices would have additional
> > > > responsibility on the PF devices such as promiscuous mode support
> > > > etc.
> > > >
> > > > The above use cases demand VFIO needs bound to PF and its
> > > > VF devices. This is use case is not supported in Linux kernel,
> > > > due to a security issue where it is possible to have
> > > > DoS in case if VF attached to guest over vfio-pci and netdev
> > > > kernel driver runs on it and which something VF representer
> > > > would like to enable it.
> > > >
> > > > Since we can not differentiate, the vfio-pci bounded VF devices
> > > > runs DPDK application or netdev driver in guest, we can not
> > > > introduce any scheme to fix DoS case and therefore not have
> > > > proper support of this in the upstream kernel.
> > > >
> > > > The igb_uio enables such PF and VF binding support for
> > > > non-iommu devices to make VF representer or OVS offload
> > > > run on non-iommu devices with DoS vulnerability for netdev driver
> > > > as VF.
> > > >
> > > > This kernel module, facilitate to enable SRIOV on PF devices,
> > > > therefore, to run both PF and VF devices in VFIO mode knowing
> > > > its impacts like igb_uio driver functions of non-iommu devices.
> > > >
> > > > Signed-off-by: Vamsi Attunuru <vattunuru@marvell.com>
> > > > Signed-off-by: Jerin Jacob <jerinj@marvell.com>
> > >
> > > NAK
> > > Having kernel drivers  not in upstream kernel is a long term
> > > maintenance and security risk. Please work with upstream kernel
> > > developers to get this merged there.
> > >
> >
> > There is security issue in attaching DPDK PF driver and netdev bind to
> VF.
> > So this scheme is not upsteamble to Linux kernel. Since rte_flow had VF
> > action. We need this scheme to support VF action with VFIO. So, Out of
> tree
> > is the only way as it is DPDK specific feature. Already sent patches to
> > Linux kernel, it make sense to not accept this in upstream.  We are
> already
> > exposing such features through igb-uio for non VFIO device. IMO, there
> > should not be any disparity between igb-uio and VFIO in DPDK.
> >
> > If we are against out of tree module, let's remove igb-uio as well. We
> > can't have different treatment for similar issues.
>
> You are right igb-uio is legacy and only exists for users unwilling to
> change (such as old kernels which don't support vfio noiommu mode).
>

It is NOT completely correct. This specific use case (allowing PF to bind
DPDK and VF to have netdev in non VFIO mode allowed by igb-uio) now. So all
non upsteamble features are going through igb-uio. That's is the main
reason why igb-uio exist.



Eventually it should go as well. And the same for KNI.
>


Nothing works without timeline. If that's the plan then decide the
timeline.Please...



> Understand the pain, but this feels like a child who gets a no
> answer from one parent and therefore thinks they can ask the other parent
> and get a yes.
>
> Who is the person in the upstream kernel community that needs to understand
> what is needed. Maybe another "I know what I am doing" kernel flag is
> needed
> like no-iommu mode has.
>

no-iommu case is different where we cannot screw Linux netdev driver, you
can create a damage to your self that's an acceptable compromise.

In this case, when DPDK PF bound application dies then it will impact
netdev VF driver as gets stalled and there is a security issues to VF
netdev driver that DPDK PF can intersect the netdev VF mailbox message.

So this case is different where from Kernel PoV there is damange to netdev
VF so this can not be accepted in Linux.

One option is to add this piece of code igb-uio instead to adding new
driver. What do you say?

It is the problem for all non bifurcated drivers in DPDK not specific to
Marvell.

The workaround is to use igb-uio with non VFIO.
We can not support UIO through performance reason hence we need a solution
that works for VFIO due to HW accelerated mempool architecture (applicable
for all NPU)

We can not have VFIO vs UIO specific features disparity in DPDK. Please
have same treatment. Either remove igb-uio hacks from Dpdk or enable non
upsteamble features through igb-uio.So that there is no disparity for a
specific use case / vendor.

>
Jerin Jacob Oct. 13, 2019, 7:20 a.m. UTC | #10
On Thu, Oct 10, 2019 at 11:32 AM Jerin Jacob <jerinjacobk@gmail.com> wrote:
>
>
>
> On Thu, 10 Oct, 2019, 4:58 AM Stephen Hemminger, <stephen@networkplumber.org> wrote:
>>
>> On Tue, 8 Oct 2019 20:58:27 +0530
>> Jerin Jacob <jerinjacobk@gmail.com> wrote:
>>
>> > On Tue, 8 Oct, 2019, 8:42 PM Stephen Hemminger, <stephen@networkplumber.org>
>> > wrote:
>> >
>> > > On Fri, 6 Sep 2019 14:42:30 +0530
>> > > <vattunuru@marvell.com> wrote:
>> > >
>> > > > From: Vamsi Attunuru <vattunuru@marvell.com>
>> > > >
>> > > > The DPDK use case such as VF representer or OVS offload etc
>> > > > would call for PF and VF PCIe devices to bind vfio-pci
>> > > > module to enable IOMMU protection.
>> > > >
>> > > > In addition to vSwitch use case, unlike, other PCI class of
>> > > > devices, Network class of PCIe devices would have additional
>> > > > responsibility on the PF devices such as promiscuous mode support
>> > > > etc.
>> > > >
>> > > > The above use cases demand VFIO needs bound to PF and its
>> > > > VF devices. This is use case is not supported in Linux kernel,
>> > > > due to a security issue where it is possible to have
>> > > > DoS in case if VF attached to guest over vfio-pci and netdev
>> > > > kernel driver runs on it and which something VF representer
>> > > > would like to enable it.
>> > > >
>> > > > Since we can not differentiate, the vfio-pci bounded VF devices
>> > > > runs DPDK application or netdev driver in guest, we can not
>> > > > introduce any scheme to fix DoS case and therefore not have
>> > > > proper support of this in the upstream kernel.
>> > > >
>> > > > The igb_uio enables such PF and VF binding support for
>> > > > non-iommu devices to make VF representer or OVS offload
>> > > > run on non-iommu devices with DoS vulnerability for netdev driver
>> > > > as VF.
>> > > >
>> > > > This kernel module, facilitate to enable SRIOV on PF devices,
>> > > > therefore, to run both PF and VF devices in VFIO mode knowing
>> > > > its impacts like igb_uio driver functions of non-iommu devices.
>> > > >
>> > > > Signed-off-by: Vamsi Attunuru <vattunuru@marvell.com>
>> > > > Signed-off-by: Jerin Jacob <jerinj@marvell.com>
>> > >
>> > > NAK
>> > > Having kernel drivers  not in upstream kernel is a long term
>> > > maintenance and security risk. Please work with upstream kernel
>> > > developers to get this merged there.
>> > >
>> >
>> > There is security issue in attaching DPDK PF driver and netdev bind to VF.
>> > So this scheme is not upsteamble to Linux kernel. Since rte_flow had VF
>> > action. We need this scheme to support VF action with VFIO. So, Out of tree
>> > is the only way as it is DPDK specific feature. Already sent patches to
>> > Linux kernel, it make sense to not accept this in upstream.  We are already
>> > exposing such features through igb-uio for non VFIO device. IMO, there
>> > should not be any disparity between igb-uio and VFIO in DPDK.
>> >
>> > If we are against out of tree module, let's remove igb-uio as well. We
>> > can't have different treatment for similar issues.
>>
>> You are right igb-uio is legacy and only exists for users unwilling to
>> change (such as old kernels which don't support vfio noiommu mode).
>
>
> It is NOT completely correct. This specific use case (allowing PF to bind DPDK and VF to have netdev in non VFIO mode allowed by igb-uio) now. So all non upsteamble features are going through igb-uio. That's is the main reason why igb-uio exist.
>
>
>
>> Eventually it should go as well. And the same for KNI.
>
>
>
> Nothing works without timeline. If that's the plan then decide the timeline.Please...
>
>
>>
>> Understand the pain, but this feels like a child who gets a no
>> answer from one parent and therefore thinks they can ask the other parent
>> and get a yes.
>>
>> Who is the person in the upstream kernel community that needs to understand
>> what is needed. Maybe another "I know what I am doing" kernel flag is needed
>> like no-iommu mode has.
>
>
> no-iommu case is different where we cannot screw Linux netdev driver, you can create a damage to your self that's an acceptable compromise.
>
> In this case, when DPDK PF bound application dies then it will impact netdev VF driver as gets stalled and there is a security issues to VF netdev driver that DPDK PF can intersect the netdev VF mailbox message.
>
> So this case is different where from Kernel PoV there is damange to netdev VF so this can not be accepted in Linux.
>
> One option is to add this piece of code igb-uio instead to adding new driver. What do you say?
>
> It is the problem for all non bifurcated drivers in DPDK not specific to Marvell.
>
> The workaround is to use igb-uio with non VFIO.
> We can not support UIO through performance reason hence we need a solution that works for VFIO due to HW accelerated mempool architecture (applicable for all NPU)
>
> We can not have VFIO vs UIO specific features disparity in DPDK. Please have same treatment. Either remove igb-uio hacks from Dpdk or enable non upsteamble features through igb-uio.So that there is no disparity for a specific use case / vendor.

Andrew Rybchenko already acked this patch and I provided enough data
on the need for this patch.
I hope there is no more confusion about the need for this patch.
Jerin Jacob Oct. 16, 2019, 11:37 a.m. UTC | #11
> >
> >
> > no-iommu case is different where we cannot screw Linux netdev driver, you can create a damage to your self that's an acceptable compromise.
> >
> > In this case, when DPDK PF bound application dies then it will impact netdev VF driver as gets stalled and there is a security issues to VF netdev driver that DPDK PF can intersect the netdev VF mailbox message.
> >
> > So this case is different where from Kernel PoV there is damange to netdev VF so this can not be accepted in Linux.
> >
> > One option is to add this piece of code igb-uio instead to adding new driver. What do you say?
> >
> > It is the problem for all non bifurcated drivers in DPDK not specific to Marvell.
> >
> > The workaround is to use igb-uio with non VFIO.
> > We can not support UIO through performance reason hence we need a solution that works for VFIO due to HW accelerated mempool architecture (applicable for all NPU)
> >
> > We can not have VFIO vs UIO specific features disparity in DPDK. Please have same treatment. Either remove igb-uio hacks from Dpdk or enable non upsteamble features through igb-uio.So that there is no disparity for a specific use case / vendor.
>
> Andrew Rybchenko already acked this patch and I provided enough data
> on the need for this patch.
> I hope there is no more confusion about the need for this patch.


I assume no more issue to merge this patch for RC1. If yes, please discuss.
Jerin Jacob Oct. 23, 2019, 5:08 p.m. UTC | #12
On Wed, Oct 16, 2019 at 5:07 PM Jerin Jacob <jerinjacobk@gmail.com> wrote:
>
> > >
> > >
> > > no-iommu case is different where we cannot screw Linux netdev driver, you can create a damage to your self that's an acceptable compromise.
> > >
> > > In this case, when DPDK PF bound application dies then it will impact netdev VF driver as gets stalled and there is a security issues to VF netdev driver that DPDK PF can intersect the netdev VF mailbox message.
> > >
> > > So this case is different where from Kernel PoV there is damange to netdev VF so this can not be accepted in Linux.
> > >
> > > One option is to add this piece of code igb-uio instead to adding new driver. What do you say?
> > >
> > > It is the problem for all non bifurcated drivers in DPDK not specific to Marvell.
> > >
> > > The workaround is to use igb-uio with non VFIO.
> > > We can not support UIO through performance reason hence we need a solution that works for VFIO due to HW accelerated mempool architecture (applicable for all NPU)
> > >
> > > We can not have VFIO vs UIO specific features disparity in DPDK. Please have same treatment. Either remove igb-uio hacks from Dpdk or enable non upsteamble features through igb-uio.So that there is no disparity for a specific use case / vendor.
> >
> > Andrew Rybchenko already acked this patch and I provided enough data
> > on the need for this patch.
> > I hope there is no more confusion about the need for this patch.
>
>
> I assume no more issue to merge this patch for RC1. If yes, please discuss.

Ping++
Jerin Jacob Oct. 24, 2019, 11:08 a.m. UTC | #13
> > >
> >
> > There is security issue in attaching DPDK PF driver and netdev bind to VF.
> > So this scheme is not upsteamble to Linux kernel. Since rte_flow had VF
> > action. We need this scheme to support VF action with VFIO. So, Out of tree
> > is the only way as it is DPDK specific feature. Already sent patches to
> > Linux kernel, it make sense to not accept this in upstream.  We are already
> > exposing such features through igb-uio for non VFIO device. IMO, there
> > should not be any disparity between igb-uio and VFIO in DPDK.
> >
> > If we are against out of tree module, let's remove igb-uio as well. We
> > can't have different treatment for similar issues.
>
> You are right igb-uio is legacy and only exists for users unwilling to
> change (such as old kernels which don't support vfio noiommu mode).
> Eventually it should go as well. And the same for KNI.

+ Techboard

I understand, Everyone would like not to have out of tree kernel
modules in dpdk repo.

Is it possible to
a) move existing or new out of tree module or
b) libraries can be used by another project or
c) DPDK device drivers binary FW repo like Linux kernel FW etc(The
ones connected to dpdk)

add it in https://www.dpdk.org/hosted-projects/

and use tools like "https://gerrit.googlesource.com/git-repo/"[1] for
development and distribution case
[1]
git-repo: tools used by android for a similar situation.
Thomas Monjalon Oct. 31, 2019, 5:03 p.m. UTC | #14
We don't get enough attention on this topic.
Let me rephrase the issue and the proposals with more people Cc'ed.

We are talking about SR-IOV VFs in VMs
with a PF managed on the host by DPDK.
The PF driver is either a (1) bifurcated (Mellanox case),
or (2) bound to UIO with igb_uio, or (3) bound to VFIO.

In case 1, the PF is still managed by a kernel driver, so no issue.

In case 2, the PF is managed by UIO.
There is no SR-IOV support in upstream UIO,
but the out-of-tree module igb_uio works.
However we would like to drop this legacy module from DPDK.
Some (most) Linux distributions do not package igb_uio anyway.
The other issue is that igb_uio is using physical addressing,
which is not acceptable with OCTEON TX2 for performance reason.

In case 3, the PF is managed by VFIO. This is the case we want to fix.
VFIO does not allow to create VFs.
The workaround is to create VFs before binding the PF to VFIO.
But since Linux 4.19, VFIO forbids any SR-IOV VF management.
There is a security concern about allowing userspace to manage SR-IOV
VF messages and taking the responsibility for VFs in the guest.

It is desired to allow the system admin deciding the security levels,
by adding a flag in VFIO "let me manage VFs, I know what I am doing".
Reference of "recent" discussion: https://lkml.org/lkml/2018/3/6/855
For now, there is no upstream solution merged.

This patch is proposing a solution using an out-of-tree module.
In this case, the admin will decide explicitly to bind the PF to vfio_pf.
Unfortunately this solution won't work in environments which
forbid any out-of-tree module.
Another concern is that it looks like DPDK-only solution.

We have an issue but we do not want to propose a half-solution
which would harm other projects and users.
So the question is:
Do we accept this patch as a temporary solution?
Or can we get an agreement soon for an upstream kernel solution?

Thanks for reading and giving your (clear) opinion.


06/09/2019 15:27, Jerin Jacob Kollanukkaran:
> From: Thomas Monjalon <thomas@monjalon.net>
> > 06/09/2019 11:12, vattunuru@marvell.com:
> > > From: Vamsi Attunuru <vattunuru@marvell.com>
> > >
> > > The DPDK use case such as VF representer or OVS offload etc would call
> > > for PF and VF PCIe devices to bind vfio-pci module to enable IOMMU
> > > protection.
> > >
> > > In addition to vSwitch use case, unlike, other PCI class of devices,
> > > Network class of PCIe devices would have additional responsibility on
> > > the PF devices such as promiscuous mode support etc.
> > >
> > > The above use cases demand VFIO needs bound to PF and its VF devices.
> > > This is use case is not supported in Linux kernel, due to a security
> > > issue where it is possible to have DoS in case if VF attached to guest
> > > over vfio-pci and netdev kernel driver runs on it and which something
> > > VF representer would like to enable it.
> > >
> > > Since we can not differentiate, the vfio-pci bounded VF devices runs
> > > DPDK application or netdev driver in guest, we can not introduce any
> > > scheme to fix DoS case and therefore not have proper support of this
> > > in the upstream kernel.
> > >
> > > The igb_uio enables such PF and VF binding support for non-iommu
> > > devices to make VF representer or OVS offload run on non-iommu devices
> > > with DoS vulnerability for netdev driver as VF.
> > >
> > > This kernel module, facilitate to enable SRIOV on PF devices,
> > > therefore, to run both PF and VF devices in VFIO mode knowing its
> > > impacts like igb_uio driver functions of non-iommu devices.
> > >
> > > Signed-off-by: Vamsi Attunuru <vattunuru@marvell.com>
> > > Signed-off-by: Jerin Jacob <jerinj@marvell.com>
> > 
> > Sorry I fail to properly understand the explanation above.
> > Please try to split in shorter sentences.
> > 
> > About the request to add an out-of-tree Linux kernel driver, I guess Jerin is well
> > aware that we don't want such anymore.
> 
> Yes. I am aware of it. I don't like the out of tree modules either. But, This case,
> I suggested Vamsi to have out of tree module.
> 
> Let me describe the issue and let us discuss how to tackle the  problem:
> 
> # Linux kernel wont allow VFIO PF to have SRIOV enable.
> 
> Patches and on going discussion are here:
> https://patchwork.kernel.org/patch/10522381/
> https://lwn.net/Articles/748526/
> 
> Based on my understanding the reason for NOT allowing the
> VFIO PF to have SRIOV enable is genuine from kernel point of
> View but not from DPDK point of view.
> 
> Here is the sequence  to describe the problem
> 1) Consider Linux kernel allowed VFIO PCI SRIOV enable
> 2) PF bound to vfio-pci
> 3) using SRIOV infrastructure of vfio-pci  PF driver,
> VFs  are created
> 4) DPDK application bound to PF and VF, No issue here.
> 5) Assume DPDK application bound to PF and VF bound
> To netdev kernel driver. Now, there is a genuine  concern
> From kernel point of view that, DPDK PF can intercept,
> VF mailbox message or so and deny the Kernel request
> Or what if DPDK PF application crashes?
> 
> To avoid the case (5), (3) is not allowed in stock kernel.
> Which makes sense IMO.
> 
> Now, From DPDK PoV, step 5 is valid as we have
> Rte_flow's VF action etc used to enable such case.
> Where, user can program the PF's rte_flow to steer
> Some traffic to VF, where VF can be, DPDK application or
> Linux kernel netdev driver.
> 
> This patch enables the step (3) to enable step (5) from DPDK
> PoV. i.e DPDK needs to allow PF to bind to DPDK with VFs.
> 
> Why this issue now:
> - igb_uio kernel driver is used as enabling step (3)
> See store_max_vfs() kernel/linux/igb_uio/igb_uio.c
>  This is fine for non-iommu device, IOMMU devices
> needs VFIO.
> - We would like support VFIO for IOMMU protection
> And enable step (5) as DPDK supports form the spec level.
> i.e need to fix feature disparity between iommu vs
> non-iommu based devices.
> 
> Note:
> We may not need a  brand new kernel module, we could move
> this logic to igb_uio if maintenance is concern.
Luca Boccassi Nov. 1, 2019, 11:54 a.m. UTC | #15
For distros, out-of-tree kernel modules are painful. From my POV, it
would be preferable to try and find a solution upstream, even if it is
going to be difficult and require a lot of negotiation and work.

On Thu, 2019-10-31 at 18:03 +0100, Thomas Monjalon wrote:
> We don't get enough attention on this topic.
> Let me rephrase the issue and the proposals with more people Cc'ed.
> 
> We are talking about SR-IOV VFs in VMs
> with a PF managed on the host by DPDK.
> The PF driver is either a (1) bifurcated (Mellanox case),
> or (2) bound to UIO with igb_uio, or (3) bound to VFIO.
> 
> In case 1, the PF is still managed by a kernel driver, so no issue.
> 
> In case 2, the PF is managed by UIO.
> There is no SR-IOV support in upstream UIO,
> but the out-of-tree module igb_uio works.
> However we would like to drop this legacy module from DPDK.
> Some (most) Linux distributions do not package igb_uio anyway.
> The other issue is that igb_uio is using physical addressing,
> which is not acceptable with OCTEON TX2 for performance reason.
> 
> In case 3, the PF is managed by VFIO. This is the case we want to
> fix.
> VFIO does not allow to create VFs.
> The workaround is to create VFs before binding the PF to VFIO.
> But since Linux 4.19, VFIO forbids any SR-IOV VF management.
> There is a security concern about allowing userspace to manage SR-IOV
> VF messages and taking the responsibility for VFs in the guest.
> 
> It is desired to allow the system admin deciding the security levels,
> by adding a flag in VFIO "let me manage VFs, I know what I am doing".
> Reference of "recent" discussion: 
> https://lkml.org/lkml/2018/3/6/855
> 
> For now, there is no upstream solution merged.
> 
> This patch is proposing a solution using an out-of-tree module.
> In this case, the admin will decide explicitly to bind the PF to
> vfio_pf.
> Unfortunately this solution won't work in environments which
> forbid any out-of-tree module.
> Another concern is that it looks like DPDK-only solution.
> 
> We have an issue but we do not want to propose a half-solution
> which would harm other projects and users.
> So the question is:
> Do we accept this patch as a temporary solution?
> Or can we get an agreement soon for an upstream kernel solution?
> 
> Thanks for reading and giving your (clear) opinion.
> 
> 
> 06/09/2019 15:27, Jerin Jacob Kollanukkaran:
> > From: Thomas Monjalon <
> > thomas@monjalon.net
> > >
> > > 06/09/2019 11:12, 
> > > vattunuru@marvell.com
> > > :
> > > > From: Vamsi Attunuru <
> > > > vattunuru@marvell.com
> > > > >
> > > > 
> > > > The DPDK use case such as VF representer or OVS offload etc
> > > > would call
> > > > for PF and VF PCIe devices to bind vfio-pci module to enable
> > > > IOMMU
> > > > protection.
> > > > 
> > > > In addition to vSwitch use case, unlike, other PCI class of
> > > > devices,
> > > > Network class of PCIe devices would have additional
> > > > responsibility on
> > > > the PF devices such as promiscuous mode support etc.
> > > > 
> > > > The above use cases demand VFIO needs bound to PF and its VF
> > > > devices.
> > > > This is use case is not supported in Linux kernel, due to a
> > > > security
> > > > issue where it is possible to have DoS in case if VF attached
> > > > to guest
> > > > over vfio-pci and netdev kernel driver runs on it and which
> > > > something
> > > > VF representer would like to enable it.
> > > > 
> > > > Since we can not differentiate, the vfio-pci bounded VF devices
> > > > runs
> > > > DPDK application or netdev driver in guest, we can not
> > > > introduce any
> > > > scheme to fix DoS case and therefore not have proper support of
> > > > this
> > > > in the upstream kernel.
> > > > 
> > > > The igb_uio enables such PF and VF binding support for non-
> > > > iommu
> > > > devices to make VF representer or OVS offload run on non-iommu
> > > > devices
> > > > with DoS vulnerability for netdev driver as VF.
> > > > 
> > > > This kernel module, facilitate to enable SRIOV on PF devices,
> > > > therefore, to run both PF and VF devices in VFIO mode knowing
> > > > its
> > > > impacts like igb_uio driver functions of non-iommu devices.
> > > > 
> > > > Signed-off-by: Vamsi Attunuru <
> > > > vattunuru@marvell.com
> > > > >
> > > > Signed-off-by: Jerin Jacob <
> > > > jerinj@marvell.com
> > > > >
> > > 
> > > Sorry I fail to properly understand the explanation above.
> > > Please try to split in shorter sentences.
> > > 
> > > About the request to add an out-of-tree Linux kernel driver, I
> > > guess Jerin is well
> > > aware that we don't want such anymore.
> > 
> > Yes. I am aware of it. I don't like the out of tree modules either.
> > But, This case,
> > I suggested Vamsi to have out of tree module.
> > 
> > Let me describe the issue and let us discuss how to tackle
> > the  problem:
> > 
> > # Linux kernel wont allow VFIO PF to have SRIOV enable.
> > 
> > Patches and on going discussion are here:
> > https://patchwork.kernel.org/patch/10522381/
> > 
> > https://lwn.net/Articles/748526/
> > 
> > 
> > Based on my understanding the reason for NOT allowing the
> > VFIO PF to have SRIOV enable is genuine from kernel point of
> > View but not from DPDK point of view.
> > 
> > Here is the sequence  to describe the problem
> > 1) Consider Linux kernel allowed VFIO PCI SRIOV enable
> > 2) PF bound to vfio-pci
> > 3) using SRIOV infrastructure of vfio-pci  PF driver,
> > VFs  are created
> > 4) DPDK application bound to PF and VF, No issue here.
> > 5) Assume DPDK application bound to PF and VF bound
> > To netdev kernel driver. Now, there is a genuine  concern
> > From kernel point of view that, DPDK PF can intercept,
> > VF mailbox message or so and deny the Kernel request
> > Or what if DPDK PF application crashes?
> > 
> > To avoid the case (5), (3) is not allowed in stock kernel.
> > Which makes sense IMO.
> > 
> > Now, From DPDK PoV, step 5 is valid as we have
> > Rte_flow's VF action etc used to enable such case.
> > Where, user can program the PF's rte_flow to steer
> > Some traffic to VF, where VF can be, DPDK application or
> > Linux kernel netdev driver.
> > 
> > This patch enables the step (3) to enable step (5) from DPDK
> > PoV. i.e DPDK needs to allow PF to bind to DPDK with VFs.
> > 
> > Why this issue now:
> > - igb_uio kernel driver is used as enabling step (3)
> > See store_max_vfs() kernel/linux/igb_uio/igb_uio.c
> >  This is fine for non-iommu device, IOMMU devices
> > needs VFIO.
> > - We would like support VFIO for IOMMU protection
> > And enable step (5) as DPDK supports form the spec level.
> > i.e need to fix feature disparity between iommu vs
> > non-iommu based devices.
> > 
> > Note:
> > We may not need a  brand new kernel module, we could move
> > this logic to igb_uio if maintenance is concern.
> 
> 
> 
>
Jerin Jacob Nov. 1, 2019, 12:12 p.m. UTC | #16
On Fri, Nov 1, 2019 at 5:24 PM Luca Boccassi <bluca@debian.org> wrote:
>
> For distros, out-of-tree kernel modules are painful. From my POV, it

I agree.

> would be preferable to try and find a solution upstream, even if it is
> going to be difficult and require a lot of negotiation and work.

I understand from RH, They are not packaging out of tree modules,
Would like to know which are the distributions
packaging existing KNI and IGB_UIO modules?

IMO, packaging 1 module vs N module is the same pain as it a matter of
kernel dependency in packaging.
If some reason, we decide to remove the IGB_UIO then we can remove
this module as well.



>
> On Thu, 2019-10-31 at 18:03 +0100, Thomas Monjalon wrote:
> > We don't get enough attention on this topic.
> > Let me rephrase the issue and the proposals with more people Cc'ed.
> >
> > We are talking about SR-IOV VFs in VMs
> > with a PF managed on the host by DPDK.
> > The PF driver is either a (1) bifurcated (Mellanox case),
> > or (2) bound to UIO with igb_uio, or (3) bound to VFIO.
> >
> > In case 1, the PF is still managed by a kernel driver, so no issue.
> >
> > In case 2, the PF is managed by UIO.
> > There is no SR-IOV support in upstream UIO,
> > but the out-of-tree module igb_uio works.
> > However we would like to drop this legacy module from DPDK.
> > Some (most) Linux distributions do not package igb_uio anyway.
> > The other issue is that igb_uio is using physical addressing,
> > which is not acceptable with OCTEON TX2 for performance reason.
> >
> > In case 3, the PF is managed by VFIO. This is the case we want to
> > fix.
> > VFIO does not allow to create VFs.
> > The workaround is to create VFs before binding the PF to VFIO.
> > But since Linux 4.19, VFIO forbids any SR-IOV VF management.
> > There is a security concern about allowing userspace to manage SR-IOV
> > VF messages and taking the responsibility for VFs in the guest.
> >
> > It is desired to allow the system admin deciding the security levels,
> > by adding a flag in VFIO "let me manage VFs, I know what I am doing".
> > Reference of "recent" discussion:
> > https://lkml.org/lkml/2018/3/6/855
> >
> > For now, there is no upstream solution merged.
> >
> > This patch is proposing a solution using an out-of-tree module.
> > In this case, the admin will decide explicitly to bind the PF to
> > vfio_pf.
> > Unfortunately this solution won't work in environments which
> > forbid any out-of-tree module.
> > Another concern is that it looks like DPDK-only solution.
> >
> > We have an issue but we do not want to propose a half-solution
> > which would harm other projects and users.
> > So the question is:
> > Do we accept this patch as a temporary solution?
> > Or can we get an agreement soon for an upstream kernel solution?
> >
> > Thanks for reading and giving your (clear) opinion.
> >
> >
> > 06/09/2019 15:27, Jerin Jacob Kollanukkaran:
> > > From: Thomas Monjalon <
> > > thomas@monjalon.net
> > > >
> > > > 06/09/2019 11:12,
> > > > vattunuru@marvell.com
> > > > :
> > > > > From: Vamsi Attunuru <
> > > > > vattunuru@marvell.com
> > > > > >
> > > > >
> > > > > The DPDK use case such as VF representer or OVS offload etc
> > > > > would call
> > > > > for PF and VF PCIe devices to bind vfio-pci module to enable
> > > > > IOMMU
> > > > > protection.
> > > > >
> > > > > In addition to vSwitch use case, unlike, other PCI class of
> > > > > devices,
> > > > > Network class of PCIe devices would have additional
> > > > > responsibility on
> > > > > the PF devices such as promiscuous mode support etc.
> > > > >
> > > > > The above use cases demand VFIO needs bound to PF and its VF
> > > > > devices.
> > > > > This is use case is not supported in Linux kernel, due to a
> > > > > security
> > > > > issue where it is possible to have DoS in case if VF attached
> > > > > to guest
> > > > > over vfio-pci and netdev kernel driver runs on it and which
> > > > > something
> > > > > VF representer would like to enable it.
> > > > >
> > > > > Since we can not differentiate, the vfio-pci bounded VF devices
> > > > > runs
> > > > > DPDK application or netdev driver in guest, we can not
> > > > > introduce any
> > > > > scheme to fix DoS case and therefore not have proper support of
> > > > > this
> > > > > in the upstream kernel.
> > > > >
> > > > > The igb_uio enables such PF and VF binding support for non-
> > > > > iommu
> > > > > devices to make VF representer or OVS offload run on non-iommu
> > > > > devices
> > > > > with DoS vulnerability for netdev driver as VF.
> > > > >
> > > > > This kernel module, facilitate to enable SRIOV on PF devices,
> > > > > therefore, to run both PF and VF devices in VFIO mode knowing
> > > > > its
> > > > > impacts like igb_uio driver functions of non-iommu devices.
> > > > >
> > > > > Signed-off-by: Vamsi Attunuru <
> > > > > vattunuru@marvell.com
> > > > > >
> > > > > Signed-off-by: Jerin Jacob <
> > > > > jerinj@marvell.com
> > > > > >
> > > >
> > > > Sorry I fail to properly understand the explanation above.
> > > > Please try to split in shorter sentences.
> > > >
> > > > About the request to add an out-of-tree Linux kernel driver, I
> > > > guess Jerin is well
> > > > aware that we don't want such anymore.
> > >
> > > Yes. I am aware of it. I don't like the out of tree modules either.
> > > But, This case,
> > > I suggested Vamsi to have out of tree module.
> > >
> > > Let me describe the issue and let us discuss how to tackle
> > > the  problem:
> > >
> > > # Linux kernel wont allow VFIO PF to have SRIOV enable.
> > >
> > > Patches and on going discussion are here:
> > > https://patchwork.kernel.org/patch/10522381/
> > >
> > > https://lwn.net/Articles/748526/
> > >
> > >
> > > Based on my understanding the reason for NOT allowing the
> > > VFIO PF to have SRIOV enable is genuine from kernel point of
> > > View but not from DPDK point of view.
> > >
> > > Here is the sequence  to describe the problem
> > > 1) Consider Linux kernel allowed VFIO PCI SRIOV enable
> > > 2) PF bound to vfio-pci
> > > 3) using SRIOV infrastructure of vfio-pci  PF driver,
> > > VFs  are created
> > > 4) DPDK application bound to PF and VF, No issue here.
> > > 5) Assume DPDK application bound to PF and VF bound
> > > To netdev kernel driver. Now, there is a genuine  concern
> > > From kernel point of view that, DPDK PF can intercept,
> > > VF mailbox message or so and deny the Kernel request
> > > Or what if DPDK PF application crashes?
> > >
> > > To avoid the case (5), (3) is not allowed in stock kernel.
> > > Which makes sense IMO.
> > >
> > > Now, From DPDK PoV, step 5 is valid as we have
> > > Rte_flow's VF action etc used to enable such case.
> > > Where, user can program the PF's rte_flow to steer
> > > Some traffic to VF, where VF can be, DPDK application or
> > > Linux kernel netdev driver.
> > >
> > > This patch enables the step (3) to enable step (5) from DPDK
> > > PoV. i.e DPDK needs to allow PF to bind to DPDK with VFs.
> > >
> > > Why this issue now:
> > > - igb_uio kernel driver is used as enabling step (3)
> > > See store_max_vfs() kernel/linux/igb_uio/igb_uio.c
> > >  This is fine for non-iommu device, IOMMU devices
> > > needs VFIO.
> > > - We would like support VFIO for IOMMU protection
> > > And enable step (5) as DPDK supports form the spec level.
> > > i.e need to fix feature disparity between iommu vs
> > > non-iommu based devices.
> > >
> > > Note:
> > > We may not need a  brand new kernel module, we could move
> > > this logic to igb_uio if maintenance is concern.
> >
> >
> >
> >
> --
> Kind regards,
> Luca Boccassi
Bruce Richardson Nov. 4, 2019, 11:16 a.m. UTC | #17
On Fri, Nov 01, 2019 at 11:54:45AM +0000, Luca Boccassi wrote:
> For distros, out-of-tree kernel modules are painful. From my POV, it
> would be preferable to try and find a solution upstream, even if it is
> going to be difficult and require a lot of negotiation and work.
> 

I don't think anyone would disagree that getting an up-stream in-kernel
solution is the desired end-state. However, even if we accept that, it
doesn't necessarily help us as we need to decide on this support on DPDK
right now. The factors to take into account are:

* we don't have definite line-of-sight to an in-kernel solution - it may
  never come
* even if it does eventually materialize, it will be months before it is in
  a released kernel, more months before it makes it to stable/LTS distros,
  and more months and years thereafter before it actually makes it into
  deployed systems from users. Having an out-of-tree module in DPDK makes
  it available to users much, much sooner.
* there seems to be a real need for this support.

For me, the key point seems to be the last one - if the feature is needed
and likely to be used by a reasonable number of users (i.e. not just 1 or 2).
If it is needed, then we need to have a path to support it, and, right now,
I'm not seeing any such path other than having support in an out-of-tree
module in DPDK itself.

My 2c.

/Bruce

> On Thu, 2019-10-31 at 18:03 +0100, Thomas Monjalon wrote:
> > We don't get enough attention on this topic.
> > Let me rephrase the issue and the proposals with more people Cc'ed.
> > 
> > We are talking about SR-IOV VFs in VMs
> > with a PF managed on the host by DPDK.
> > The PF driver is either a (1) bifurcated (Mellanox case),
> > or (2) bound to UIO with igb_uio, or (3) bound to VFIO.
> > 
> > In case 1, the PF is still managed by a kernel driver, so no issue.
> > 
> > In case 2, the PF is managed by UIO.
> > There is no SR-IOV support in upstream UIO,
> > but the out-of-tree module igb_uio works.
> > However we would like to drop this legacy module from DPDK.
> > Some (most) Linux distributions do not package igb_uio anyway.
> > The other issue is that igb_uio is using physical addressing,
> > which is not acceptable with OCTEON TX2 for performance reason.
> > 
> > In case 3, the PF is managed by VFIO. This is the case we want to
> > fix.
> > VFIO does not allow to create VFs.
> > The workaround is to create VFs before binding the PF to VFIO.
> > But since Linux 4.19, VFIO forbids any SR-IOV VF management.
> > There is a security concern about allowing userspace to manage SR-IOV
> > VF messages and taking the responsibility for VFs in the guest.
> > 
> > It is desired to allow the system admin deciding the security levels,
> > by adding a flag in VFIO "let me manage VFs, I know what I am doing".
> > Reference of "recent" discussion: 
> > https://lkml.org/lkml/2018/3/6/855
> > 
> > For now, there is no upstream solution merged.
> > 
> > This patch is proposing a solution using an out-of-tree module.
> > In this case, the admin will decide explicitly to bind the PF to
> > vfio_pf.
> > Unfortunately this solution won't work in environments which
> > forbid any out-of-tree module.
> > Another concern is that it looks like DPDK-only solution.
> > 
> > We have an issue but we do not want to propose a half-solution
> > which would harm other projects and users.
> > So the question is:
> > Do we accept this patch as a temporary solution?
> > Or can we get an agreement soon for an upstream kernel solution?
> > 
> > Thanks for reading and giving your (clear) opinion.
> > 
> > 
> > 06/09/2019 15:27, Jerin Jacob Kollanukkaran:
> > > From: Thomas Monjalon <
> > > thomas@monjalon.net
> > > >
> > > > 06/09/2019 11:12, 
> > > > vattunuru@marvell.com
> > > > :
> > > > > From: Vamsi Attunuru <
> > > > > vattunuru@marvell.com
> > > > > >
> > > > > 
> > > > > The DPDK use case such as VF representer or OVS offload etc
> > > > > would call
> > > > > for PF and VF PCIe devices to bind vfio-pci module to enable
> > > > > IOMMU
> > > > > protection.
> > > > > 
> > > > > In addition to vSwitch use case, unlike, other PCI class of
> > > > > devices,
> > > > > Network class of PCIe devices would have additional
> > > > > responsibility on
> > > > > the PF devices such as promiscuous mode support etc.
> > > > > 
> > > > > The above use cases demand VFIO needs bound to PF and its VF
> > > > > devices.
> > > > > This is use case is not supported in Linux kernel, due to a
> > > > > security
> > > > > issue where it is possible to have DoS in case if VF attached
> > > > > to guest
> > > > > over vfio-pci and netdev kernel driver runs on it and which
> > > > > something
> > > > > VF representer would like to enable it.
> > > > > 
> > > > > Since we can not differentiate, the vfio-pci bounded VF devices
> > > > > runs
> > > > > DPDK application or netdev driver in guest, we can not
> > > > > introduce any
> > > > > scheme to fix DoS case and therefore not have proper support of
> > > > > this
> > > > > in the upstream kernel.
> > > > > 
> > > > > The igb_uio enables such PF and VF binding support for non-
> > > > > iommu
> > > > > devices to make VF representer or OVS offload run on non-iommu
> > > > > devices
> > > > > with DoS vulnerability for netdev driver as VF.
> > > > > 
> > > > > This kernel module, facilitate to enable SRIOV on PF devices,
> > > > > therefore, to run both PF and VF devices in VFIO mode knowing
> > > > > its
> > > > > impacts like igb_uio driver functions of non-iommu devices.
> > > > > 
> > > > > Signed-off-by: Vamsi Attunuru <
> > > > > vattunuru@marvell.com
> > > > > >
> > > > > Signed-off-by: Jerin Jacob <
> > > > > jerinj@marvell.com
> > > > > >
> > > > 
> > > > Sorry I fail to properly understand the explanation above.
> > > > Please try to split in shorter sentences.
> > > > 
> > > > About the request to add an out-of-tree Linux kernel driver, I
> > > > guess Jerin is well
> > > > aware that we don't want such anymore.
> > > 
> > > Yes. I am aware of it. I don't like the out of tree modules either.
> > > But, This case,
> > > I suggested Vamsi to have out of tree module.
> > > 
> > > Let me describe the issue and let us discuss how to tackle
> > > the  problem:
> > > 
> > > # Linux kernel wont allow VFIO PF to have SRIOV enable.
> > > 
> > > Patches and on going discussion are here:
> > > https://patchwork.kernel.org/patch/10522381/
> > > 
> > > https://lwn.net/Articles/748526/
> > > 
> > > 
> > > Based on my understanding the reason for NOT allowing the
> > > VFIO PF to have SRIOV enable is genuine from kernel point of
> > > View but not from DPDK point of view.
> > > 
> > > Here is the sequence  to describe the problem
> > > 1) Consider Linux kernel allowed VFIO PCI SRIOV enable
> > > 2) PF bound to vfio-pci
> > > 3) using SRIOV infrastructure of vfio-pci  PF driver,
> > > VFs  are created
> > > 4) DPDK application bound to PF and VF, No issue here.
> > > 5) Assume DPDK application bound to PF and VF bound
> > > To netdev kernel driver. Now, there is a genuine  concern
> > > From kernel point of view that, DPDK PF can intercept,
> > > VF mailbox message or so and deny the Kernel request
> > > Or what if DPDK PF application crashes?
> > > 
> > > To avoid the case (5), (3) is not allowed in stock kernel.
> > > Which makes sense IMO.
> > > 
> > > Now, From DPDK PoV, step 5 is valid as we have
> > > Rte_flow's VF action etc used to enable such case.
> > > Where, user can program the PF's rte_flow to steer
> > > Some traffic to VF, where VF can be, DPDK application or
> > > Linux kernel netdev driver.
> > > 
> > > This patch enables the step (3) to enable step (5) from DPDK
> > > PoV. i.e DPDK needs to allow PF to bind to DPDK with VFs.
> > > 
> > > Why this issue now:
> > > - igb_uio kernel driver is used as enabling step (3)
> > > See store_max_vfs() kernel/linux/igb_uio/igb_uio.c
> > >  This is fine for non-iommu device, IOMMU devices
> > > needs VFIO.
> > > - We would like support VFIO for IOMMU protection
> > > And enable step (5) as DPDK supports form the spec level.
> > > i.e need to fix feature disparity between iommu vs
> > > non-iommu based devices.
> > > 
> > > Note:
> > > We may not need a  brand new kernel module, we could move
> > > this logic to igb_uio if maintenance is concern.
> > 
> > 
> > 
> > 
> -- 
> Kind regards,
> Luca Boccassi
Luca Boccassi Nov. 5, 2019, 10:09 a.m. UTC | #18
On Mon, 2019-11-04 at 11:16 +0000, Bruce Richardson wrote:
> On Fri, Nov 01, 2019 at 11:54:45AM +0000, Luca Boccassi wrote:
> > For distros, out-of-tree kernel modules are painful. From my POV,
> > it
> > would be preferable to try and find a solution upstream, even if it
> > is
> > going to be difficult and require a lot of negotiation and work.
> > 
> 
> I don't think anyone would disagree that getting an up-stream in-
> kernel
> solution is the desired end-state. However, even if we accept that,
> it
> doesn't necessarily help us as we need to decide on this support on
> DPDK
> right now. The factors to take into account are:
> 
> * we don't have definite line-of-sight to an in-kernel solution - it
> may
>   never come
> * even if it does eventually materialize, it will be months before it
> is in
>   a released kernel, more months before it makes it to stable/LTS
> distros,
>   and more months and years thereafter before it actually makes it
> into
>   deployed systems from users. Having an out-of-tree module in DPDK
> makes
>   it available to users much, much sooner.
> * there seems to be a real need for this support.
> 
> For me, the key point seems to be the last one - if the feature is
> needed
> and likely to be used by a reasonable number of users (i.e. not just
> 1 or 2).
> If it is needed, then we need to have a path to support it, and,
> right now,
> I'm not seeing any such path other than having support in an out-of-
> tree
> module in DPDK itself.
> 
> My 2c.
> 
> /Bruce

That's a fair point - what I'd like to see is, if as a project we want
to move toward having no out-of-tree modules, is a firm statement, a
concrete plan and commitment via roadmap to keep working to provide a
solution upstream, however hard it might be, and not just declare job
done and move on once the oot module is accepted. At that point having
a module in the interim is acceptable for me.

> > On Thu, 2019-10-31 at 18:03 +0100, Thomas Monjalon wrote:
> > > We don't get enough attention on this topic.
> > > Let me rephrase the issue and the proposals with more people
> > > Cc'ed.
> > > 
> > > We are talking about SR-IOV VFs in VMs
> > > with a PF managed on the host by DPDK.
> > > The PF driver is either a (1) bifurcated (Mellanox case),
> > > or (2) bound to UIO with igb_uio, or (3) bound to VFIO.
> > > 
> > > In case 1, the PF is still managed by a kernel driver, so no
> > > issue.
> > > 
> > > In case 2, the PF is managed by UIO.
> > > There is no SR-IOV support in upstream UIO,
> > > but the out-of-tree module igb_uio works.
> > > However we would like to drop this legacy module from DPDK.
> > > Some (most) Linux distributions do not package igb_uio anyway.
> > > The other issue is that igb_uio is using physical addressing,
> > > which is not acceptable with OCTEON TX2 for performance reason.
> > > 
> > > In case 3, the PF is managed by VFIO. This is the case we want to
> > > fix.
> > > VFIO does not allow to create VFs.
> > > The workaround is to create VFs before binding the PF to VFIO.
> > > But since Linux 4.19, VFIO forbids any SR-IOV VF management.
> > > There is a security concern about allowing userspace to manage
> > > SR-IOV
> > > VF messages and taking the responsibility for VFs in the guest.
> > > 
> > > It is desired to allow the system admin deciding the security
> > > levels,
> > > by adding a flag in VFIO "let me manage VFs, I know what I am
> > > doing".
> > > Reference of "recent" discussion: 
> > > https://lkml.org/lkml/2018/3/6/855
> > > 
> > > 
> > > For now, there is no upstream solution merged.
> > > 
> > > This patch is proposing a solution using an out-of-tree module.
> > > In this case, the admin will decide explicitly to bind the PF to
> > > vfio_pf.
> > > Unfortunately this solution won't work in environments which
> > > forbid any out-of-tree module.
> > > Another concern is that it looks like DPDK-only solution.
> > > 
> > > We have an issue but we do not want to propose a half-solution
> > > which would harm other projects and users.
> > > So the question is:
> > > Do we accept this patch as a temporary solution?
> > > Or can we get an agreement soon for an upstream kernel solution?
> > > 
> > > Thanks for reading and giving your (clear) opinion.
> > > 
> > > 
> > > 06/09/2019 15:27, Jerin Jacob Kollanukkaran:
> > > > From: Thomas Monjalon <
> > > > thomas@monjalon.net
> > > > 
> > > > > 06/09/2019 11:12, 
> > > > > vattunuru@marvell.com
> > > > > 
> > > > > :
> > > > > > From: Vamsi Attunuru <
> > > > > > vattunuru@marvell.com
> > > > > > 
> > > > > > 
> > > > > > The DPDK use case such as VF representer or OVS offload etc
> > > > > > would call
> > > > > > for PF and VF PCIe devices to bind vfio-pci module to
> > > > > > enable
> > > > > > IOMMU
> > > > > > protection.
> > > > > > 
> > > > > > In addition to vSwitch use case, unlike, other PCI class of
> > > > > > devices,
> > > > > > Network class of PCIe devices would have additional
> > > > > > responsibility on
> > > > > > the PF devices such as promiscuous mode support etc.
> > > > > > 
> > > > > > The above use cases demand VFIO needs bound to PF and its
> > > > > > VF
> > > > > > devices.
> > > > > > This is use case is not supported in Linux kernel, due to a
> > > > > > security
> > > > > > issue where it is possible to have DoS in case if VF
> > > > > > attached
> > > > > > to guest
> > > > > > over vfio-pci and netdev kernel driver runs on it and which
> > > > > > something
> > > > > > VF representer would like to enable it.
> > > > > > 
> > > > > > Since we can not differentiate, the vfio-pci bounded VF
> > > > > > devices
> > > > > > runs
> > > > > > DPDK application or netdev driver in guest, we can not
> > > > > > introduce any
> > > > > > scheme to fix DoS case and therefore not have proper
> > > > > > support of
> > > > > > this
> > > > > > in the upstream kernel.
> > > > > > 
> > > > > > The igb_uio enables such PF and VF binding support for non-
> > > > > > iommu
> > > > > > devices to make VF representer or OVS offload run on non-
> > > > > > iommu
> > > > > > devices
> > > > > > with DoS vulnerability for netdev driver as VF.
> > > > > > 
> > > > > > This kernel module, facilitate to enable SRIOV on PF
> > > > > > devices,
> > > > > > therefore, to run both PF and VF devices in VFIO mode
> > > > > > knowing
> > > > > > its
> > > > > > impacts like igb_uio driver functions of non-iommu devices.
> > > > > > 
> > > > > > Signed-off-by: Vamsi Attunuru <
> > > > > > vattunuru@marvell.com
> > > > > > 
> > > > > > 
> > > > > > Signed-off-by: Jerin Jacob <
> > > > > > jerinj@marvell.com
> > > > > > 
> > > > > 
> > > > > Sorry I fail to properly understand the explanation above.
> > > > > Please try to split in shorter sentences.
> > > > > 
> > > > > About the request to add an out-of-tree Linux kernel driver,
> > > > > I
> > > > > guess Jerin is well
> > > > > aware that we don't want such anymore.
> > > > 
> > > > Yes. I am aware of it. I don't like the out of tree modules
> > > > either.
> > > > But, This case,
> > > > I suggested Vamsi to have out of tree module.
> > > > 
> > > > Let me describe the issue and let us discuss how to tackle
> > > > the  problem:
> > > > 
> > > > # Linux kernel wont allow VFIO PF to have SRIOV enable.
> > > > 
> > > > Patches and on going discussion are here:
> > > > https://patchwork.kernel.org/patch/10522381/
> > > > 
> > > > 
> > > > https://lwn.net/Articles/748526/
> > > > 
> > > > 
> > > > 
> > > > Based on my understanding the reason for NOT allowing the
> > > > VFIO PF to have SRIOV enable is genuine from kernel point of
> > > > View but not from DPDK point of view.
> > > > 
> > > > Here is the sequence  to describe the problem
> > > > 1) Consider Linux kernel allowed VFIO PCI SRIOV enable
> > > > 2) PF bound to vfio-pci
> > > > 3) using SRIOV infrastructure of vfio-pci  PF driver,
> > > > VFs  are created
> > > > 4) DPDK application bound to PF and VF, No issue here.
> > > > 5) Assume DPDK application bound to PF and VF bound
> > > > To netdev kernel driver. Now, there is a genuine  concern
> > > > From kernel point of view that, DPDK PF can intercept,
> > > > VF mailbox message or so and deny the Kernel request
> > > > Or what if DPDK PF application crashes?
> > > > 
> > > > To avoid the case (5), (3) is not allowed in stock kernel.
> > > > Which makes sense IMO.
> > > > 
> > > > Now, From DPDK PoV, step 5 is valid as we have
> > > > Rte_flow's VF action etc used to enable such case.
> > > > Where, user can program the PF's rte_flow to steer
> > > > Some traffic to VF, where VF can be, DPDK application or
> > > > Linux kernel netdev driver.
> > > > 
> > > > This patch enables the step (3) to enable step (5) from DPDK
> > > > PoV. i.e DPDK needs to allow PF to bind to DPDK with VFs.
> > > > 
> > > > Why this issue now:
> > > > - igb_uio kernel driver is used as enabling step (3)
> > > > See store_max_vfs() kernel/linux/igb_uio/igb_uio.c
> > > >  This is fine for non-iommu device, IOMMU devices
> > > > needs VFIO.
> > > > - We would like support VFIO for IOMMU protection
> > > > And enable step (5) as DPDK supports form the spec level.
> > > > i.e need to fix feature disparity between iommu vs
> > > > non-iommu based devices.
> > > > 
> > > > Note:
> > > > We may not need a  brand new kernel module, we could move
> > > > this logic to igb_uio if maintenance is concern.
> > > 
> > > 
> > > 
> > > 
> > 
> > -- 
> > Kind regards,
> > Luca Boccassi
Alex Williamson Nov. 6, 2019, 10:32 p.m. UTC | #19
On Thu, 31 Oct 2019 18:03:53 +0100
Thomas Monjalon <thomas@monjalon.net> wrote:

> We don't get enough attention on this topic.
> Let me rephrase the issue and the proposals with more people Cc'ed.
> 
> We are talking about SR-IOV VFs in VMs
> with a PF managed on the host by DPDK.
> The PF driver is either a (1) bifurcated (Mellanox case),
> or (2) bound to UIO with igb_uio, or (3) bound to VFIO.
> 
> In case 1, the PF is still managed by a kernel driver, so no issue.
> 
> In case 2, the PF is managed by UIO.
> There is no SR-IOV support in upstream UIO,
> but the out-of-tree module igb_uio works.
> However we would like to drop this legacy module from DPDK.
> Some (most) Linux distributions do not package igb_uio anyway.
> The other issue is that igb_uio is using physical addressing,
> which is not acceptable with OCTEON TX2 for performance reason.
> 
> In case 3, the PF is managed by VFIO. This is the case we want to fix.
> VFIO does not allow to create VFs.
> The workaround is to create VFs before binding the PF to VFIO.
> But since Linux 4.19, VFIO forbids any SR-IOV VF management.
> There is a security concern about allowing userspace to manage SR-IOV
> VF messages and taking the responsibility for VFs in the guest.
> 
> It is desired to allow the system admin deciding the security levels,
> by adding a flag in VFIO "let me manage VFs, I know what I am doing".
> Reference of "recent" discussion: https://lkml.org/lkml/2018/3/6/855
> For now, there is no upstream solution merged.
> 
> This patch is proposing a solution using an out-of-tree module.
> In this case, the admin will decide explicitly to bind the PF to vfio_pf.
> Unfortunately this solution won't work in environments which
> forbid any out-of-tree module.
> Another concern is that it looks like DPDK-only solution.
> 
> We have an issue but we do not want to propose a half-solution
> which would harm other projects and users.
> So the question is:
> Do we accept this patch as a temporary solution?
> Or can we get an agreement soon for an upstream kernel solution?
> 
> Thanks for reading and giving your (clear) opinion.

I'm pretty appalled that anyone would consider shipping this module and
actually claiming that it's supported in some way.  Seriously, it's
disturbing to see a driver that intentionally circumvents a security
issue that we all seem to agree exists, but just hand wave that it
doesn't apply to dpdk configurations.  Ideas have been suggested
upstream for for quarantining VFs generated from user owned PFs such
that we require an opt-in to make use of them in this way.  Nobody
seems to be pursuing such ideas upstream.  I don't even see upstream
proposals to add a scary sounding module option to vfio-pci that would
taint the kernel, but make this available.  If nothing else, please
remove the vfio name from this abomination, it has nothing to do with
vfio other than to try to subvert the security and isolation that vfio
attempts to provide.  Thanks,

Alex
Jerin Jacob Nov. 7, 2019, 5:02 a.m. UTC | #20
On Thu, Nov 7, 2019 at 4:03 AM Alex Williamson
<alex.williamson@redhat.com> wrote:
>
> On Thu, 31 Oct 2019 18:03:53 +0100
> Thomas Monjalon <thomas@monjalon.net> wrote:
>
> > We don't get enough attention on this topic.
> > Let me rephrase the issue and the proposals with more people Cc'ed.
> >
> > We are talking about SR-IOV VFs in VMs
> > with a PF managed on the host by DPDK.
> > The PF driver is either a (1) bifurcated (Mellanox case),
> > or (2) bound to UIO with igb_uio, or (3) bound to VFIO.
> >
> > In case 1, the PF is still managed by a kernel driver, so no issue.
> >
> > In case 2, the PF is managed by UIO.
> > There is no SR-IOV support in upstream UIO,
> > but the out-of-tree module igb_uio works.
> > However we would like to drop this legacy module from DPDK.
> > Some (most) Linux distributions do not package igb_uio anyway.
> > The other issue is that igb_uio is using physical addressing,
> > which is not acceptable with OCTEON TX2 for performance reason.
> >
> > In case 3, the PF is managed by VFIO. This is the case we want to fix.
> > VFIO does not allow to create VFs.
> > The workaround is to create VFs before binding the PF to VFIO.
> > But since Linux 4.19, VFIO forbids any SR-IOV VF management.
> > There is a security concern about allowing userspace to manage SR-IOV
> > VF messages and taking the responsibility for VFs in the guest.
> >
> > It is desired to allow the system admin deciding the security levels,
> > by adding a flag in VFIO "let me manage VFs, I know what I am doing".
> > Reference of "recent" discussion: https://lkml.org/lkml/2018/3/6/855
> > For now, there is no upstream solution merged.
> >
> > This patch is proposing a solution using an out-of-tree module.
> > In this case, the admin will decide explicitly to bind the PF to vfio_pf.
> > Unfortunately this solution won't work in environments which
> > forbid any out-of-tree module.
> > Another concern is that it looks like DPDK-only solution.
> >
> > We have an issue but we do not want to propose a half-solution
> > which would harm other projects and users.
> > So the question is:
> > Do we accept this patch as a temporary solution?
> > Or can we get an agreement soon for an upstream kernel solution?
> >
> > Thanks for reading and giving your (clear) opinion.

Thanks, Alex for the feedback.

> I'm pretty appalled that anyone would consider shipping this module and
> actually claiming that it's supported in some way.  Seriously, it's

Actually DPDK already shipping with this hack using the igb_uio module for UIO.

https://git.dpdk.org/dpdk/tree/kernel/linux/igb_uio/igb_uio.c#n44.

> disturbing to see a driver that intentionally circumvents a security
> issue that we all seem to agree exists, but just hand wave that it
> doesn't apply to dpdk configurations.

Yes. There is a security issue wrt netdev VFs. That's the reason, I
was scared to
submit any patch in upstream on this front. Having said that, OVS-DPDK
kind of userland
programs would like to define the fate of the netdev VF packets of the guest
as it is vswitch application. So there is a perception change in who
is controlling the who.


> Ideas have been suggested
> upstream for for quarantining VFs generated from user owned PFs such
> that we require an opt-in to make use of them in this way.  Nobody
> seems to be pursuing such ideas upstream.  I don't even see upstream
> proposals to add a scary sounding module option to vfio-pci that would
> taint the kernel, but make this available.  If nothing else, please
> remove the vfio name from this abomination, it has nothing to do with
> vfio other than to try to subvert the security and isolation that vfio
> attempts to provide.

Thanks for the feedback. Let's hold on accepting this patch.

We would like to have an upstream solution so that DPDK needs to only focus
on userspace.

I will work on submitting a patch for the discussion in Linux upstream.
Let see how the discussion goes, Based on the that, We can revisit
fate of this module.


> Alex
>
Thomas Monjalon Nov. 15, 2019, 6:57 a.m. UTC | #21
07/11/2019 06:02, Jerin Jacob:
> On Thu, Nov 7, 2019 at 4:03 AM Alex Williamson
> <alex.williamson@redhat.com> wrote:
> > Ideas have been suggested
> > upstream for for quarantining VFs generated from user owned PFs such
> > that we require an opt-in to make use of them in this way.  Nobody
> > seems to be pursuing such ideas upstream.  I don't even see upstream
> > proposals to add a scary sounding module option to vfio-pci that would
> > taint the kernel, but make this available.  If nothing else, please
> > remove the vfio name from this abomination, it has nothing to do with
> > vfio other than to try to subvert the security and isolation that vfio
> > attempts to provide.
> 
> Thanks for the feedback. Let's hold on accepting this patch.
> 
> We would like to have an upstream solution so that DPDK needs to only focus
> on userspace.
> 
> I will work on submitting a patch for the discussion in Linux upstream.
> Let see how the discussion goes, Based on the that, We can revisit
> fate of this module.

Just to be clear, vfio_pf is on hold, i.e. not merged in DPDK 19.11.
Jerin Jacob Nov. 15, 2019, 7:01 a.m. UTC | #22
On Fri, Nov 15, 2019 at 12:27 PM Thomas Monjalon <thomas@monjalon.net> wrote:
>
> 07/11/2019 06:02, Jerin Jacob:
> > On Thu, Nov 7, 2019 at 4:03 AM Alex Williamson
> > <alex.williamson@redhat.com> wrote:
> > > Ideas have been suggested
> > > upstream for for quarantining VFs generated from user owned PFs such
> > > that we require an opt-in to make use of them in this way.  Nobody
> > > seems to be pursuing such ideas upstream.  I don't even see upstream
> > > proposals to add a scary sounding module option to vfio-pci that would
> > > taint the kernel, but make this available.  If nothing else, please
> > > remove the vfio name from this abomination, it has nothing to do with
> > > vfio other than to try to subvert the security and isolation that vfio
> > > attempts to provide.
> >
> > Thanks for the feedback. Let's hold on accepting this patch.
> >
> > We would like to have an upstream solution so that DPDK needs to only focus
> > on userspace.
> >
> > I will work on submitting a patch for the discussion in Linux upstream.
> > Let see how the discussion goes, Based on the that, We can revisit
> > fate of this module.
>
> Just to be clear, vfio_pf is on hold, i.e. not merged in DPDK 19.11.

+1

>
>

Patch
diff mbox series

diff --git a/MAINTAINERS b/MAINTAINERS
index 4100260..49178a5 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -288,6 +288,12 @@  M: Anatoly Burakov <anatoly.burakov@intel.com>
 F: lib/librte_eal/linux/eal/*vfio*
 F: drivers/bus/pci/linux/*vfio*
 
+Linux VFIO_PF
+M: Vamsi Attunuru <vattunuru@marvell.com>
+M: Jerin Jacob <jerinj@marvell.com>
+F: kernel/linux/vfio_pf/
+F: doc/guides/prog_guide/vfio_pf.rst
+
 FreeBSD EAL (with overlaps)
 M: Bruce Richardson <bruce.richardson@intel.com>
 F: lib/librte_eal/freebsd/Makefile
diff --git a/config/common_linux b/config/common_linux
index 6e25255..70f1e79 100644
--- a/config/common_linux
+++ b/config/common_linux
@@ -11,6 +11,7 @@  CONFIG_RTE_EAL_NUMA_AWARE_HUGEPAGES=y
 CONFIG_RTE_EAL_IGB_UIO=y
 CONFIG_RTE_EAL_VFIO=y
 CONFIG_RTE_KNI_KMOD=y
+CONFIG_RTE_VFIO_PF_KMOD=y
 CONFIG_RTE_LIBRTE_KNI=y
 CONFIG_RTE_LIBRTE_PMD_KNI=y
 CONFIG_RTE_LIBRTE_VHOST=y
diff --git a/doc/guides/prog_guide/index.rst b/doc/guides/prog_guide/index.rst
index 692409a..601244c 100644
--- a/doc/guides/prog_guide/index.rst
+++ b/doc/guides/prog_guide/index.rst
@@ -43,6 +43,7 @@  Programmer's Guide
     pdump_lib
     multi_proc_support
     kernel_nic_interface
+    vfio_pf
     thread_safety_dpdk_functions
     eventdev
     event_ethernet_rx_adapter
diff --git a/doc/guides/prog_guide/vfio_pf.rst b/doc/guides/prog_guide/vfio_pf.rst
new file mode 100644
index 0000000..7e43cc7
--- /dev/null
+++ b/doc/guides/prog_guide/vfio_pf.rst
@@ -0,0 +1,90 @@ 
+..  SPDX-License-Identifier: BSD-3-Clause
+    Copyright(C) 2019 Marvell International Ltd.
+
+.. _vfio_pf:
+
+The DPDK VFIO_PF Kernel Module
+------------------------------
+
+The VFIO_PF kernel loadable module ``vfio_pf`` provides sysfs way of enabling
+PCI VF devices when the PCI PF is bound to VFIO driver.
+
+DPDK use case such as VF representer or OVS offload etc would call for PF
+and VF PCIe devices to bind vfio-pci module to enable IOMMU protection.
+
+In addition to vSwitch use case, unlike, other PCI class of devices, Network
+class of PCIe devices would have additional responsibility on the PF devices
+such as promiscuous mode support etc. The above use cases demand VFIO needs
+bound to PF and its VF devices.
+
+These use case is not supported in Linux kernel, due to a security issue
+where it is possible to have DoS in case if VF attached to guest over vfio-pci
+and netdev kernel driver runs on it and which something VF representer would
+like to enable it.
+
+The igb_uio enables such PF and VF binding support for non-iommu devices to
+make VF representer or OVS offload run on non-iommu devices with DoS
+vulnerability for netdev driver as VF.
+
+This kernel module facilitate to enable SRIOV on PF devices, therefore, to run
+both PF and VF devices in VFIO mode knowing its impacts like igb_uio driver
+functions of non-iommu devices.
+
+Example usage
+-------------
+
+Following example demonstrates enabling vfs on Marvell's Octeontx2 platform.
+RVU PF devices PF1 & PF2 are probed on BDFs "0002:02:00.0" and "0002:03:00.0"
+respectively. 2 VFs of each PF are enabled after followng the below procedure.
+
+.. code-block:: console
+
+    # echo "177d a063" >  /sys/bus/pci/drivers/vfio-pci/new_id
+    # echo 0002:02:00.0 > /sys/bus/pci/devices/0002:02:00.0/driver/unbind
+    # echo 0002:02:00.0 > /sys/bus/pci/drivers/vfio-pci/bind
+    # echo 0002:03:00.0 > /sys/bus/pci/devices/0002:03:00.0/driver/unbind
+    # echo 0002:03:00.0 > /sys/bus/pci/drivers/vfio-pci/bind
+
+    # lspci -k
+     0002:02:00.0 Ethernet controller: Cavium, Inc. Device a063 (rev 01)
+        Subsystem: Cavium, Inc. Device b200
+        Kernel driver in use: vfio-pci
+     0002:03:00.0 Ethernet controller: Cavium, Inc. Device a063 (rev 01)
+        Subsystem: Cavium, Inc. Device b200
+        Kernel driver in use: vfio-pci
+
+    # insmod build/kernel/linux/vfio_pf/vfio_pf.ko
+
+    # echo 0002:02:00.0 > /sys/vfio_pf/add_device
+    # echo 2 > /sys/vfio_pf/0002\:02\:00.0/num_vfs
+    # echo 0002:03:00.0 > /sys/vfio_pf/add_device
+    # echo 2 > /sys/vfio_pf/0002\:03\:00.0/num_vfs
+
+    # lspci -k
+     0002:02:00.0 Ethernet controller: Cavium, Inc. Device a063 (rev 01)
+        Subsystem: Cavium, Inc. Device b200
+        Kernel driver in use: vfio-pci
+     0002:02:00.1 Ethernet controller: Cavium, Inc. Device a064 (rev 01)
+        Subsystem: Cavium, Inc. Device b200
+     0002:02:00.2 Ethernet controller: Cavium, Inc. Device a064 (rev 01)
+        Subsystem: Cavium, Inc. Device b200
+     0002:03:00.0 Ethernet controller: Cavium, Inc. Device a063 (rev 01)
+        Subsystem: Cavium, Inc. Device b200
+        Kernel driver in use: vfio-pci
+     0002:03:00.1 Ethernet controller: Cavium, Inc. Device a064 (rev 01)
+        Subsystem: Cavium, Inc. Device b200
+     0002:03:00.2 Ethernet controller: Cavium, Inc. Device a064 (rev 01)
+        Subsystem: Cavium, Inc. Device b200
+
+    # echo 0 > /sys/vfio_pf/0002\:02\:00.0/num_vfs
+    # echo 0002:02:00.0 > /sys/vfio_pf/remove_device
+    # echo 0 > /sys/vfio_pf/0002\:03\:00.0/num_vfs
+    # echo 0002:03:00.0 > /sys/vfio_pf/remove_device
+
+    # rmmod build/kernel/linux/vfio_pf/vfio_pf.ko
+
+Prerequisite
+-------------
+
+PCI PF device needs to be bound to VFIO driver before enabling VFs using
+vfio_pf kernel module.
diff --git a/kernel/linux/Makefile b/kernel/linux/Makefile
index c2c45a3..ea202ec 100644
--- a/kernel/linux/Makefile
+++ b/kernel/linux/Makefile
@@ -5,5 +5,6 @@  include $(RTE_SDK)/mk/rte.vars.mk
 
 DIRS-$(CONFIG_RTE_EAL_IGB_UIO) += igb_uio
 DIRS-$(CONFIG_RTE_KNI_KMOD) += kni
+DIRS-$(CONFIG_RTE_VFIO_PF_KMOD) += vfio_pf
 
 include $(RTE_SDK)/mk/rte.subdir.mk
diff --git a/kernel/linux/meson.build b/kernel/linux/meson.build
index 1796cc6..50591dd 100644
--- a/kernel/linux/meson.build
+++ b/kernel/linux/meson.build
@@ -1,7 +1,7 @@ 
 # SPDX-License-Identifier: BSD-3-Clause
 # Copyright(c) 2018 Intel Corporation
 
-subdirs = ['igb_uio', 'kni']
+subdirs = ['igb_uio', 'kni', 'vfio_pf']
 
 # if we are cross-compiling we need kernel_dir specified
 if get_option('kernel_dir') == '' and meson.is_cross_build()
diff --git a/kernel/linux/vfio_pf/Kbuild b/kernel/linux/vfio_pf/Kbuild
new file mode 100644
index 0000000..18365c8
--- /dev/null
+++ b/kernel/linux/vfio_pf/Kbuild
@@ -0,0 +1,2 @@ 
+ccflags-y := $(MODULE_CFLAGS)
+obj-m := vfio_pf.o
diff --git a/kernel/linux/vfio_pf/Makefile b/kernel/linux/vfio_pf/Makefile
new file mode 100644
index 0000000..0ffbc6f
--- /dev/null
+++ b/kernel/linux/vfio_pf/Makefile
@@ -0,0 +1,24 @@ 
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(C) 2019 Marvell International Ltd.
+
+include $(RTE_SDK)/mk/rte.vars.mk
+
+#
+# module name and path
+#
+MODULE = vfio_pf
+
+#
+# CFLAGS
+#
+MODULE_CFLAGS += -I$(SRCDIR) --param max-inline-insns-single=100
+MODULE_CFLAGS += -I$(RTE_OUTPUT)/include
+MODULE_CFLAGS += -Winline -Wall -Werror
+MODULE_CFLAGS += -include $(RTE_OUTPUT)/include/rte_config.h
+
+#
+# all source are stored in SRCS-y
+#
+SRCS-y := vfio_pf.c
+
+include $(RTE_SDK)/mk/rte.module.mk
diff --git a/kernel/linux/vfio_pf/meson.build b/kernel/linux/vfio_pf/meson.build
new file mode 100644
index 0000000..e485e4c
--- /dev/null
+++ b/kernel/linux/vfio_pf/meson.build
@@ -0,0 +1,20 @@ 
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(C) 2019 Marvell International Ltd.
+
+mkfile = custom_target('vfio_pf_makefile',
+	output: 'Makefile',
+	command: ['touch', '@OUTPUT@'])
+
+custom_target('vfio_pf',
+	input: ['vfio_pf.c', 'Kbuild'],
+	output: 'vfio_pf.ko',
+	command: ['make', '-C', kernel_dir + '/build',
+		'M=' + meson.current_build_dir(),
+		'src=' + meson.current_source_dir(),
+		'EXTRA_CFLAGS=-I' + meson.current_source_dir() +
+			'/../../../lib/librte_eal/common/include',
+		'modules'],
+	depends: mkfile,
+	install: true,
+	install_dir: kernel_dir + '/extra/dpdk',
+	build_by_default: get_option('enable_kmods'))
diff --git a/kernel/linux/vfio_pf/vfio_pf.c b/kernel/linux/vfio_pf/vfio_pf.c
new file mode 100644
index 0000000..42f7f93
--- /dev/null
+++ b/kernel/linux/vfio_pf/vfio_pf.c
@@ -0,0 +1,360 @@ 
+/* SPDX-License-Identifier: GPL-2.0
+ * Copyright(C) 2019 Marvell International Ltd.
+ */
+
+#include <linux/device.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/pci.h>
+
+struct vfio_pf_group {
+	struct kobject *kobj;
+	struct kobj_attribute add_pf;
+	struct kobj_attribute remove_pf;
+	struct mutex lock;
+	struct list_head head;
+};
+
+static struct vfio_pf_group *pf_group;
+
+#define FMT_NVAL 4
+#define PCI_STR_SIZE sizeof("XXXXXXXX:XX:XX.X")
+
+struct pci_addr {
+	char name[PCI_STR_SIZE + 1];
+	uint32_t domain;
+	uint8_t bus;
+	uint8_t devid;
+	uint8_t function;
+};
+
+struct pf_obj {
+	struct list_head node;
+	struct pci_dev *pdev;
+	struct kobject *kobj;
+	struct kobj_attribute sysfs;
+	struct pci_addr paddr;
+};
+
+static int
+str_split(char *string, int stringlen,
+	  char **tokens, int maxtokens, char delim)
+{
+	int tokstart = 1;
+	int i, tok = 0;
+
+	if (string == NULL || tokens == NULL)
+		goto error;
+
+	for (i = 0; i < stringlen; i++) {
+		if (string[i] == '\0' || tok >= maxtokens)
+			break;
+		if (tokstart) {
+			tokstart = 0;
+			tokens[tok++] = &string[i];
+		}
+		if (string[i] == delim) {
+			string[i] = '\0';
+			tokstart = 1;
+		}
+	}
+	return tok;
+
+error:
+	return -1;
+}
+
+static int
+parse_pci_addr(const char *buf, int bufsize, struct pci_addr *paddr)
+{
+	union splitaddr {
+		struct {
+			char *domain;
+			char *bus;
+			char *devid;
+			char *function;
+		};
+		char *str[FMT_NVAL];
+	} splitaddr;
+
+	char *buf_copy = kstrndup(buf, bufsize, GFP_KERNEL);
+	if (buf_copy == NULL)
+		return -ENOMEM;
+
+	if (str_split(buf_copy, bufsize, splitaddr.str, FMT_NVAL, ':')
+			!= FMT_NVAL - 1)
+		goto error;
+	/* final split is on '.' between devid and function */
+	splitaddr.function = strchr(splitaddr.devid, '.');
+	if (splitaddr.function == NULL)
+		goto error;
+	*splitaddr.function++ = '\0';
+
+	if (kstrtou32(splitaddr.domain, 16, &paddr->domain) ||
+		kstrtou8(splitaddr.bus, 16, &paddr->bus) ||
+		kstrtou8(splitaddr.devid, 16, &paddr->devid) ||
+		kstrtou8(splitaddr.function, 10, &paddr->function))
+		goto error;
+
+	snprintf(paddr->name, sizeof(paddr->name), "%.4x:%.2x:%.2x.%.x",
+		 paddr->domain, paddr->bus, paddr->devid, paddr->function);
+
+	kfree(buf_copy);
+	return 0;
+error:
+	kfree(buf_copy);
+	return -EINVAL;
+}
+
+static ssize_t
+show_num_vfs(struct kobject *kobj, struct kobj_attribute *attr, char *buf)
+{
+	struct pf_obj *obj;
+
+	obj = container_of(attr, struct pf_obj, sysfs);
+
+	return snprintf(buf, 10, "%u\n", pci_num_vf(obj->pdev));
+}
+
+static ssize_t
+store_num_vfs(struct kobject *kobj, struct kobj_attribute *attr,
+	      const char *buf, size_t count)
+{
+	struct pf_obj *obj;
+	int num_vfs, err = 0;
+
+	obj = container_of(attr, struct pf_obj, sysfs);
+
+	if (kstrtoint(buf, 0, &num_vfs)) {
+		pr_err("Invalid %s,  %s\n", attr->attr.name, buf);
+		err = -EIO;
+	}
+	if (num_vfs < 0) {
+		pr_err("Invalid %s,  %d < 0\n", attr->attr.name,
+			num_vfs);
+		err = -EIO;
+	}
+
+	if (num_vfs == 0)
+		pci_disable_sriov(obj->pdev);
+	else if (pci_num_vf(obj->pdev) == 0)
+		err = pci_enable_sriov(obj->pdev, num_vfs);
+	else
+		err = -EINVAL;
+
+	return err ? err : count;
+}
+
+static int
+pf_sysfs_create(char *name, struct pf_obj *obj)
+{
+	int err;
+
+	if (name == NULL || obj == NULL)
+		return -EINVAL;
+
+	obj->sysfs.show = show_num_vfs;
+	obj->sysfs.store = store_num_vfs;
+	obj->sysfs.attr.name = name;
+	obj->sysfs.attr.mode = 0644;
+
+	sysfs_attr_init(&obj->sysfs.attr);
+	err = sysfs_create_file(obj->kobj, &obj->sysfs.attr);
+	if (err) {
+		pr_err("Failed to create '%s' sysfs for '%s'\n",
+			name, kobject_name(obj->kobj));
+		return -EFAULT;
+	}
+
+	return 0;
+}
+
+static int
+probe_pf_dev(struct pf_obj *obj)
+{
+	struct pci_dev *pdev = NULL;
+	struct pci_addr *paddr;
+
+	paddr = &obj->paddr;
+
+	pdev = pci_get_device(PCI_ANY_ID, PCI_ANY_ID, pdev);
+	if (pdev == NULL)
+		return -ENODEV;
+
+	while (pdev) {
+		if ((paddr->domain == pci_domain_nr(pdev->bus)) &&
+		    (pdev->bus->number == paddr->bus) &&
+		    (PCI_SLOT(pdev->devfn) == paddr->devid) &&
+		    (PCI_FUNC(pdev->devfn) == paddr->function))
+			break;
+
+		pdev = pci_get_device(PCI_ANY_ID, PCI_ANY_ID, pdev);
+	};
+
+	if (pdev) {
+		obj->pdev = pdev;
+		return 0;
+	} else
+		return -ENODEV;
+}
+
+static ssize_t
+add_device(struct kobject *kobj, struct kobj_attribute *attr,
+	   const char *buf, size_t count)
+{
+	struct pf_obj *obj;
+	int err = 0;
+
+	obj = kzalloc(sizeof(struct pf_obj), GFP_KERNEL);
+	if (obj == NULL) {
+		err = -ENOMEM;
+		goto exit;
+	}
+
+	if (parse_pci_addr(buf, strlen(buf), &obj->paddr)) {
+		err = -EINVAL;
+		goto exit;
+	}
+
+	if (probe_pf_dev(obj)) {
+		err = -ENXIO;
+		goto exit;
+	}
+
+	obj->kobj = kobject_create_and_add(obj->paddr.name, pf_group->kobj);
+
+	if (pf_sysfs_create("num_vfs", obj)) {
+		pci_dev_put(obj->pdev);
+		pr_err("Failed to create the sysfs for pdev:%s\n",
+			obj->paddr.name);
+		return -EFAULT;
+	}
+
+	mutex_lock(&pf_group->lock);
+	list_add(&obj->node, &pf_group->head);
+	mutex_unlock(&pf_group->lock);
+
+exit:
+	if (err && obj)
+		kfree(obj);
+
+	return err ? err : count;
+}
+
+static void
+remove_pf_obj(struct pf_obj *obj)
+{
+	if (pci_num_vf(obj->pdev))
+		pci_disable_sriov(obj->pdev);
+	sysfs_remove_file(obj->kobj, &obj->sysfs.attr);
+	kobject_del(obj->kobj);
+	list_del(&obj->node);
+	pci_dev_put(obj->pdev);
+	kfree(obj);
+}
+
+static ssize_t
+remove_device(struct kobject *kobj, struct kobj_attribute *attr,
+	      const char *buf, size_t count)
+{
+	struct list_head *pos, *tmp;
+	struct pci_addr paddr;
+	struct pf_obj *obj;
+	int err = 0;
+
+	if (parse_pci_addr(buf, strlen(buf), &paddr)) {
+		err = -EINVAL;
+		goto exit;
+	}
+
+	list_for_each_safe(pos, tmp, &pf_group->head) {
+		obj = list_entry(pos, struct pf_obj, node);
+		if (!strncmp(obj->paddr.name, paddr.name, sizeof(paddr.name))) {
+			remove_pf_obj(obj);
+			break;
+		}
+	}
+
+exit:
+	return err ? err : count;
+}
+
+static void
+destroy_pf_objs(void)
+{
+	struct list_head *pos, *tmp;
+	struct pf_obj *obj;
+
+	list_for_each_safe(pos, tmp, &pf_group->head) {
+		obj = list_entry(pos, struct pf_obj, node);
+		remove_pf_obj(obj);
+	}
+}
+
+static void
+__exit vfio_pf_cleanup(void)
+{
+	if (pf_group == NULL)
+		return;
+
+	destroy_pf_objs();
+	sysfs_remove_file(pf_group->kobj, &pf_group->add_pf.attr);
+	sysfs_remove_file(pf_group->kobj, &pf_group->remove_pf.attr);
+	kobject_del(pf_group->kobj);
+	mutex_destroy(&pf_group->lock);
+	kfree(pf_group);
+}
+
+static int
+__init vfio_pf_init(void)
+{
+	int err;
+
+	pf_group = kzalloc(sizeof(struct vfio_pf_group), GFP_KERNEL);
+	if (pf_group == NULL)
+		return -ENOMEM;
+
+	pf_group->kobj = kobject_create_and_add("vfio_pf", NULL);
+	pf_group->add_pf.store = add_device;
+	pf_group->add_pf.attr.name = "add_device";
+	pf_group->add_pf.attr.mode = 0644;
+	pf_group->remove_pf.store = remove_device;
+	pf_group->remove_pf.attr.name = "remove_device";
+	pf_group->remove_pf.attr.mode = 0644;
+
+	sysfs_attr_init(&pf_group->add_pf.attr);
+	err = sysfs_create_file(pf_group->kobj, &pf_group->add_pf.attr);
+	if (err) {
+		pr_err("Failed to create sysfs '%s' for '%s'\n",
+			pf_group->add_pf.attr.name,
+			kobject_name(pf_group->kobj));
+		goto exit;
+	}
+
+	sysfs_attr_init(&pf_group->remove_pf.attr);
+	err = sysfs_create_file(pf_group->kobj, &pf_group->remove_pf.attr);
+	if (err) {
+		pr_err("Failed to create sysfs '%s' for '%s'\n",
+			pf_group->remove_pf.attr.name,
+			kobject_name(pf_group->kobj));
+		sysfs_remove_file(pf_group->kobj, &pf_group->add_pf.attr);
+		goto exit;
+	}
+
+	INIT_LIST_HEAD(&pf_group->head);
+
+	mutex_init(&pf_group->lock);
+
+	return 0;
+
+exit:
+	kfree(pf_group);
+	return -EFAULT;
+}
+
+module_init(vfio_pf_init);
+module_exit(vfio_pf_cleanup);
+
+MODULE_DESCRIPTION("Kernel module for enabling SRIOV");
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Marvell International Ltd");