Message ID | 1443652138-31782-3-git-send-email-stephen@networkplumber.org (mailing list archive) |
---|---|
State | Not Applicable, archived |
Headers |
Return-Path: <dev-bounces@dpdk.org> X-Original-To: patchwork@dpdk.org Delivered-To: patchwork@dpdk.org Received: from [92.243.14.124] (localhost [IPv6:::1]) by dpdk.org (Postfix) with ESMTP id C8AC98E61; Thu, 1 Oct 2015 00:28:55 +0200 (CEST) Received: from mail-pa0-f47.google.com (mail-pa0-f47.google.com [209.85.220.47]) by dpdk.org (Postfix) with ESMTP id EEAB68D8A for <dev@dpdk.org>; Thu, 1 Oct 2015 00:28:52 +0200 (CEST) Received: by pacfv12 with SMTP id fv12so53962936pac.2 for <dev@dpdk.org>; Wed, 30 Sep 2015 15:28:52 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=DSBsE6xHeKl+y+flUAHKxvyGhoTlDteuyDG61Gudbc4=; b=Pjn2f8FgS2LNkv9Piz1pbZBvC+OQlTIVrswdul/qYxXnGGCcMwo7BiEY/EnaMb9RcU udOySaOJ0GmSPtZTd0QITwo3OZ1bJ/YCKbzoLWB02P0UqjmwY2I9GYFggNGTYrJXQZBO TcURYK5OU1eTpLA+U2oXOUKPMKXAP/DouovGyVhaH9jt7OrV8ZeHUcIonQdtoFXudQik yIVDFR3LhPIpDV7B9K/0vgm9FIGDRbwuG8/tcoNazKj+3+RJdoXtxXhBCuu45eBi20vB AnWsLmlSXp/mnM8c4t5Sxi9rGJoTeybXnQgZo8iAVeyzf9RfJGWtYkTswaGA5omMaHkG oipg== X-Gm-Message-State: ALoCoQl/9CjCkxyxXBIQ7GaI8iOjXanoD6LPs9QsWuy4KU671CrjgRy+kgoGoaPw0h6FwOU19z8P X-Received: by 10.69.25.1 with SMTP id im1mr7679669pbd.102.1443652132369; Wed, 30 Sep 2015 15:28:52 -0700 (PDT) Received: from urahara.home.lan (static-50-53-82-155.bvtn.or.frontiernet.net. [50.53.82.155]) by smtp.gmail.com with ESMTPSA id gw3sm2603289pbc.46.2015.09.30.15.28.51 (version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Wed, 30 Sep 2015 15:28:51 -0700 (PDT) From: Stephen Hemminger <stephen@networkplumber.org> To: hjk@hansjkoch.de, gregkh@linux-foundation.org Date: Wed, 30 Sep 2015 15:28:58 -0700 Message-Id: <1443652138-31782-3-git-send-email-stephen@networkplumber.org> X-Mailer: git-send-email 2.1.4 In-Reply-To: <1443652138-31782-1-git-send-email-stephen@networkplumber.org> References: <1443652138-31782-1-git-send-email-stephen@networkplumber.org> Cc: dev@dpdk.org, linux-kernel@vger.kernel.org Subject: [dpdk-dev] [PATCH 2/2] uio: new driver to support PCI MSI-X X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK <dev.dpdk.org> List-Unsubscribe: <http://dpdk.org/ml/options/dev>, <mailto:dev-request@dpdk.org?subject=unsubscribe> List-Archive: <http://dpdk.org/ml/archives/dev/> List-Post: <mailto:dev@dpdk.org> List-Help: <mailto:dev-request@dpdk.org?subject=help> List-Subscribe: <http://dpdk.org/ml/listinfo/dev>, <mailto:dev-request@dpdk.org?subject=subscribe> Errors-To: dev-bounces@dpdk.org Sender: "dev" <dev-bounces@dpdk.org> |
Commit Message
Stephen Hemminger
Sept. 30, 2015, 10:28 p.m. UTC
This driver allows using PCI device with Message Signalled Interrupt
from userspace. The API is similar to the igb_uio driver used by the DPDK.
Via ioctl it provides a mechanism to map MSI-X interrupts into event
file descriptors similar to VFIO.
VFIO is a better choice if IOMMU is available, but often userspace drivers
have to work in environments where IOMMU support (real or emulated) is
not available. All UIO drivers that support DMA are not secure against
rogue userspace applications programming DMA hardware to access
private memory; this driver is no less secure than existing code.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
drivers/uio/Kconfig | 9 ++
drivers/uio/Makefile | 1 +
drivers/uio/uio_msi.c | 378 +++++++++++++++++++++++++++++++++++++++++++
include/uapi/linux/Kbuild | 1 +
include/uapi/linux/uio_msi.h | 22 +++
5 files changed, 411 insertions(+)
create mode 100644 drivers/uio/uio_msi.c
create mode 100644 include/uapi/linux/uio_msi.h
Comments
On Wed, Sep 30, 2015 at 03:28:58PM -0700, Stephen Hemminger wrote: > This driver allows using PCI device with Message Signalled Interrupt > from userspace. The API is similar to the igb_uio driver used by the DPDK. > Via ioctl it provides a mechanism to map MSI-X interrupts into event > file descriptors similar to VFIO. > > VFIO is a better choice if IOMMU is available, but often userspace drivers > have to work in environments where IOMMU support (real or emulated) is > not available. All UIO drivers that support DMA are not secure against > rogue userspace applications programming DMA hardware to access > private memory; this driver is no less secure than existing code. > > Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> I don't think copying the igb_uio interface is a good idea. What DPDK is doing with igb_uio (and indeed uio_pci_generic) is abusing the sysfs BAR access to provide unlimited access to hardware. MSI messages are memory writes so any generic device capable of MSI is capable of corrupting kernel memory. This means that a bug in userspace will lead to kernel memory corruption and crashes. This is something distributions can't support. uio_pci_generic is already abused like that, mostly because when I wrote it, I didn't add enough protections against using it with DMA capable devices, and we can't go back and break working userspace. But at least it does not bind to VFs which all of them are capable of DMA. The result of merging this driver will be userspace abusing the sysfs BAR access with VFs as well, and we do not want that. Just forwarding events is not enough to make a valid driver. What is missing is a way to access the device in a safe way. On a more positive note: What would be a reasonable interface? One that does the following in kernel: 1. initializes device rings (can be in pinned userspace memory, but can not be writeable by userspace), brings up interface link 2. pins userspace memory (unless using e.g. hugetlbfs) 3. gets request, make sure it's valid and belongs to the correct task, put it in the ring 4. in the reverse direction, notify userspace when buffers are available in the ring 5. notify userspace about MSI (what this driver does) What userspace can be allowed to do: format requests (e.g. transmit, receive) in userspace read ring contents What userspace can't be allowed to do: access BAR write rings This means that the driver can not be a generic one, and there will be a system call overhead when you write the ring, but that's the price you have to pay for ability to run on systems without an IOMMU. > --- > drivers/uio/Kconfig | 9 ++ > drivers/uio/Makefile | 1 + > drivers/uio/uio_msi.c | 378 +++++++++++++++++++++++++++++++++++++++++++ > include/uapi/linux/Kbuild | 1 + > include/uapi/linux/uio_msi.h | 22 +++ > 5 files changed, 411 insertions(+) > create mode 100644 drivers/uio/uio_msi.c > create mode 100644 include/uapi/linux/uio_msi.h > > diff --git a/drivers/uio/Kconfig b/drivers/uio/Kconfig > index 52c98ce..04adfa0 100644 > --- a/drivers/uio/Kconfig > +++ b/drivers/uio/Kconfig > @@ -93,6 +93,15 @@ config UIO_PCI_GENERIC > primarily, for virtualization scenarios. > If you compile this as a module, it will be called uio_pci_generic. > > +config UIO_PCI_MSI > + tristate "Generic driver supporting MSI-x on PCI Express cards" > + depends on PCI > + help > + Generic driver that provides Message Signalled IRQ events > + similar to VFIO. If IOMMMU is available please use VFIO > + instead since it provides more security. > + If you compile this as a module, it will be called uio_msi. > + > config UIO_NETX > tristate "Hilscher NetX Card driver" > depends on PCI > diff --git a/drivers/uio/Makefile b/drivers/uio/Makefile > index 8560dad..62fc44b 100644 > --- a/drivers/uio/Makefile > +++ b/drivers/uio/Makefile > @@ -9,3 +9,4 @@ obj-$(CONFIG_UIO_NETX) += uio_netx.o > obj-$(CONFIG_UIO_PRUSS) += uio_pruss.o > obj-$(CONFIG_UIO_MF624) += uio_mf624.o > obj-$(CONFIG_UIO_FSL_ELBC_GPCM) += uio_fsl_elbc_gpcm.o > +obj-$(CONFIG_UIO_PCI_MSI) += uio_msi.o > diff --git a/drivers/uio/uio_msi.c b/drivers/uio/uio_msi.c > new file mode 100644 > index 0000000..802b5c4 > --- /dev/null > +++ b/drivers/uio/uio_msi.c > @@ -0,0 +1,378 @@ > +/*- > + * > + * Copyright (c) 2015 by Brocade Communications Systems, Inc. > + * Author: Stephen Hemminger <stephen@networkplumber.org> > + * > + * This work is licensed under the terms of the GNU GPL, version 2 only. > + */ > + > +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt > + > +#include <linux/device.h> > +#include <linux/interrupt.h> > +#include <linux/eventfd.h> > +#include <linux/module.h> > +#include <linux/pci.h> > +#include <linux/uio_driver.h> > +#include <linux/msi.h> > +#include <linux/uio_msi.h> > + > +#define DRIVER_VERSION "0.1.1" > +#define MAX_MSIX_VECTORS 64 > + > +/* MSI-X vector information */ > +struct uio_msi_pci_dev { > + struct uio_info info; /* UIO driver info */ > + struct pci_dev *pdev; /* PCI device */ > + struct mutex mutex; /* open/release/ioctl mutex */ > + int ref_cnt; /* references to device */ > + unsigned int max_vectors; /* MSI-X slots available */ > + struct msix_entry *msix; /* MSI-X vector table */ > + struct uio_msi_irq_ctx { > + struct eventfd_ctx *trigger; /* vector to eventfd */ > + char *name; /* name in /proc/interrupts */ > + } *ctx; > +}; > + > +static irqreturn_t uio_intx_irqhandler(int irq, void *arg) > +{ > + struct uio_msi_pci_dev *udev = arg; > + > + if (pci_check_and_mask_intx(udev->pdev)) { > + eventfd_signal(udev->ctx->trigger, 1); > + return IRQ_HANDLED; > + } > + > + return IRQ_NONE; > +} > + > +static irqreturn_t uio_msi_irqhandler(int irq, void *arg) > +{ > + struct eventfd_ctx *trigger = arg; > + > + eventfd_signal(trigger, 1); > + return IRQ_HANDLED; > +} > + > +/* set the mapping between vector # and existing eventfd. */ > +static int set_irq_eventfd(struct uio_msi_pci_dev *udev, u32 vec, int fd) > +{ > + struct eventfd_ctx *trigger; > + int irq, err; > + > + if (vec >= udev->max_vectors) { > + dev_notice(&udev->pdev->dev, "vec %u >= num_vec %u\n", > + vec, udev->max_vectors); > + return -ERANGE; > + } > + > + irq = udev->msix[vec].vector; > + trigger = udev->ctx[vec].trigger; > + if (trigger) { > + /* Clearup existing irq mapping */ > + free_irq(irq, trigger); > + eventfd_ctx_put(trigger); > + udev->ctx[vec].trigger = NULL; > + } > + > + /* Passing -1 is used to disable interrupt */ > + if (fd < 0) > + return 0; > + > + trigger = eventfd_ctx_fdget(fd); > + if (IS_ERR(trigger)) { > + err = PTR_ERR(trigger); > + dev_notice(&udev->pdev->dev, > + "eventfd ctx get failed: %d\n", err); > + return err; > + } > + > + if (udev->msix) > + err = request_irq(irq, uio_msi_irqhandler, 0, > + udev->ctx[vec].name, trigger); > + else > + err = request_irq(irq, uio_intx_irqhandler, IRQF_SHARED, > + udev->ctx[vec].name, udev); > + > + if (err) { > + dev_notice(&udev->pdev->dev, > + "request irq failed: %d\n", err); > + eventfd_ctx_put(trigger); > + return err; > + } > + > + udev->ctx[vec].trigger = trigger; > + return 0; > +} > + > +static int > +uio_msi_ioctl(struct uio_info *info, unsigned int cmd, unsigned long arg) > +{ > + struct uio_msi_pci_dev *udev > + = container_of(info, struct uio_msi_pci_dev, info); > + struct uio_msi_irq_set hdr; > + int err; > + > + switch (cmd) { > + case UIO_MSI_IRQ_SET: > + if (copy_from_user(&hdr, (void __user *)arg, sizeof(hdr))) > + return -EFAULT; > + > + mutex_lock(&udev->mutex); > + err = set_irq_eventfd(udev, hdr.vec, hdr.fd); > + mutex_unlock(&udev->mutex); > + break; > + default: > + err = -EOPNOTSUPP; > + } > + return err; > +} > + > +/* Opening the UIO device for first time enables MSI-X */ > +static int > +uio_msi_open(struct uio_info *info, struct inode *inode) > +{ > + struct uio_msi_pci_dev *udev > + = container_of(info, struct uio_msi_pci_dev, info); > + int err = 0; > + > + mutex_lock(&udev->mutex); > + if (udev->ref_cnt++ == 0) { > + if (udev->msix) > + err = pci_enable_msix(udev->pdev, udev->msix, > + udev->max_vectors); > + } > + mutex_unlock(&udev->mutex); > + > + return err; > +} > + > +/* Last close of the UIO device releases/disables all IRQ's */ > +static int > +uio_msi_release(struct uio_info *info, struct inode *inode) > +{ > + struct uio_msi_pci_dev *udev > + = container_of(info, struct uio_msi_pci_dev, info); > + int i; > + > + mutex_lock(&udev->mutex); > + if (--udev->ref_cnt == 0) { > + for (i = 0; i < udev->max_vectors; i++) { > + int irq = udev->msix[i].vector; > + struct eventfd_ctx *trigger = udev->ctx[i].trigger; > + > + if (!trigger) > + continue; > + > + free_irq(irq, trigger); > + eventfd_ctx_put(trigger); > + udev->ctx[i].trigger = NULL; > + } > + > + if (udev->msix) > + pci_disable_msix(udev->pdev); > + } > + mutex_unlock(&udev->mutex); > + > + return 0; > +} > + > +/* Unmap previously ioremap'd resources */ > +static void > +release_iomaps(struct uio_mem *mem) > +{ > + int i; > + > + for (i = 0; i < MAX_UIO_MAPS; i++, mem++) { > + if (mem->internal_addr) > + iounmap(mem->internal_addr); > + } > +} > + > +static int > +setup_maps(struct pci_dev *pdev, struct uio_info *info) > +{ > + int i, m = 0, p = 0, err; > + static const char * const bar_names[] = { > + "BAR0", "BAR1", "BAR2", "BAR3", "BAR4", "BAR5", > + }; > + > + for (i = 0; i < ARRAY_SIZE(bar_names); i++) { > + unsigned long start = pci_resource_start(pdev, i); > + unsigned long flags = pci_resource_flags(pdev, i); > + unsigned long len = pci_resource_len(pdev, i); > + > + if (start == 0 || len == 0) > + continue; > + > + if (flags & IORESOURCE_MEM) { > + void __iomem *addr; > + > + if (m >= MAX_UIO_MAPS) > + continue; > + > + addr = ioremap(start, len); > + if (addr == NULL) { > + err = -EINVAL; > + goto fail; > + } > + > + info->mem[m].name = bar_names[i]; > + info->mem[m].addr = start; > + info->mem[m].internal_addr = addr; > + info->mem[m].size = len; > + info->mem[m].memtype = UIO_MEM_PHYS; > + ++m; > + } else if (flags & IORESOURCE_IO) { > + if (p >= MAX_UIO_PORT_REGIONS) > + continue; > + > + info->port[p].name = bar_names[i]; > + info->port[p].start = start; > + info->port[p].size = len; > + info->port[p].porttype = UIO_PORT_X86; > + ++p; > + } > + } > + > + return 0; > + fail: > + for (i = 0; i < m; i++) > + iounmap(info->mem[i].internal_addr); > + return err; > +} > + > +static int uio_msi_probe(struct pci_dev *pdev, const struct pci_device_id *id) > +{ > + struct uio_msi_pci_dev *udev; > + int i, err, vectors; > + > + udev = kzalloc(sizeof(struct uio_msi_pci_dev), GFP_KERNEL); > + if (!udev) > + return -ENOMEM; > + > + err = pci_enable_device(pdev); > + if (err != 0) { > + dev_err(&pdev->dev, "cannot enable PCI device\n"); > + goto fail_free; > + } > + > + err = pci_request_regions(pdev, "uio_msi"); > + if (err != 0) { > + dev_err(&pdev->dev, "Cannot request regions\n"); > + goto fail_disable; > + } > + > + pci_set_master(pdev); > + > + /* remap resources */ > + err = setup_maps(pdev, &udev->info); > + if (err) > + goto fail_release_iomem; > + > + /* fill uio infos */ > + udev->info.name = "uio_msi"; > + udev->info.version = DRIVER_VERSION; > + udev->info.priv = udev; > + udev->pdev = pdev; > + udev->info.ioctl = uio_msi_ioctl; > + udev->info.open = uio_msi_open; > + udev->info.release = uio_msi_release; > + udev->info.irq = UIO_IRQ_CUSTOM; > + mutex_init(&udev->mutex); > + > + vectors = pci_msix_vec_count(pdev); > + if (vectors > 0) { > + udev->max_vectors = min_t(u16, vectors, MAX_MSIX_VECTORS); > + dev_info(&pdev->dev, "using up to %u MSI-X vectors\n", > + udev->max_vectors); > + > + err = -ENOMEM; > + udev->msix = kcalloc(udev->max_vectors, > + sizeof(struct msix_entry), GFP_KERNEL); > + if (!udev->msix) > + goto fail_release_iomem; > + } else if (!pci_intx_mask_supported(pdev)) { > + dev_err(&pdev->dev, > + "device does not support MSI-X or INTX\n"); > + err = -EINVAL; > + goto fail_release_iomem; > + } else { > + dev_notice(&pdev->dev, "using INTX\n"); > + udev->info.irq_flags = IRQF_SHARED; > + udev->max_vectors = 1; > + } > + > + udev->ctx = kcalloc(udev->max_vectors, > + sizeof(struct uio_msi_irq_ctx), GFP_KERNEL); > + if (!udev->ctx) > + goto fail_free_msix; > + > + for (i = 0; i < udev->max_vectors; i++) { > + udev->msix[i].entry = i; > + > + udev->ctx[i].name = kasprintf(GFP_KERNEL, > + KBUILD_MODNAME "[%d](%s)", > + i, pci_name(pdev)); > + if (!udev->ctx[i].name) > + goto fail_free_ctx; > + } > + > + /* register uio driver */ > + err = uio_register_device(&pdev->dev, &udev->info); > + if (err != 0) > + goto fail_free_ctx; > + > + pci_set_drvdata(pdev, udev); > + return 0; > + > +fail_free_ctx: > + for (i = 0; i < udev->max_vectors; i++) > + kfree(udev->ctx[i].name); > + kfree(udev->ctx); > +fail_free_msix: > + kfree(udev->msix); > +fail_release_iomem: > + release_iomaps(udev->info.mem); > + pci_release_regions(pdev); > +fail_disable: > + pci_disable_device(pdev); > +fail_free: > + kfree(udev); > + > + pr_notice("%s ret %d\n", __func__, err); > + return err; > +} > + > +static void uio_msi_remove(struct pci_dev *pdev) > +{ > + struct uio_info *info = pci_get_drvdata(pdev); > + struct uio_msi_pci_dev *udev > + = container_of(info, struct uio_msi_pci_dev, info); > + int i; > + > + uio_unregister_device(info); > + release_iomaps(info->mem); > + > + pci_release_regions(pdev); > + for (i = 0; i < udev->max_vectors; i++) > + kfree(udev->ctx[i].name); > + kfree(udev->ctx); > + kfree(udev->msix); > + pci_disable_device(pdev); > + > + pci_set_drvdata(pdev, NULL); > + kfree(udev); > +} > + > +static struct pci_driver uio_msi_pci_driver = { > + .name = "uio_msi", > + .probe = uio_msi_probe, > + .remove = uio_msi_remove, > +}; > + > +module_pci_driver(uio_msi_pci_driver); > +MODULE_VERSION(DRIVER_VERSION); > +MODULE_LICENSE("GPL v2"); > +MODULE_AUTHOR("Stephen Hemminger <stephen@networkplumber.org>"); > +MODULE_DESCRIPTION("UIO driver for MSI PCI devices"); > diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild > index f7b2db4..d9497691 100644 > --- a/include/uapi/linux/Kbuild > +++ b/include/uapi/linux/Kbuild > @@ -411,6 +411,7 @@ header-y += udp.h > header-y += uhid.h > header-y += uinput.h > header-y += uio.h > +header-y += uio_msi.h > header-y += ultrasound.h > header-y += un.h > header-y += unistd.h > diff --git a/include/uapi/linux/uio_msi.h b/include/uapi/linux/uio_msi.h > new file mode 100644 > index 0000000..297de00 > --- /dev/null > +++ b/include/uapi/linux/uio_msi.h > @@ -0,0 +1,22 @@ > +/* > + * UIO_MSI API definition > + * > + * Copyright (c) 2015 by Brocade Communications Systems, Inc. > + * All rights reserved. > + * > + * This program is free software; you can redistribute it and/or modify > + * it under the terms of the GNU General Public License version 2 as > + * published by the Free Software Foundation. > + */ > +#ifndef _UIO_PCI_MSI_H > +#define _UIO_PCI_MSI_H > + > +struct uio_msi_irq_set { > + u32 vec; > + int fd; > +}; > + > +#define UIO_MSI_BASE 0x86 > +#define UIO_MSI_IRQ_SET _IOW('I', UIO_MSI_BASE+1, struct uio_msi_irq_set) > + > +#endif > -- > 2.1.4 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/
On Thu, Oct 01, 2015 at 11:33:06AM +0300, Michael S. Tsirkin wrote: > On Wed, Sep 30, 2015 at 03:28:58PM -0700, Stephen Hemminger wrote: > > This driver allows using PCI device with Message Signalled Interrupt > > from userspace. The API is similar to the igb_uio driver used by the DPDK. > > Via ioctl it provides a mechanism to map MSI-X interrupts into event > > file descriptors similar to VFIO. > > > > VFIO is a better choice if IOMMU is available, but often userspace drivers > > have to work in environments where IOMMU support (real or emulated) is > > not available. All UIO drivers that support DMA are not secure against > > rogue userspace applications programming DMA hardware to access > > private memory; this driver is no less secure than existing code. > > > > Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> > > I don't think copying the igb_uio interface is a good idea. > What DPDK is doing with igb_uio (and indeed uio_pci_generic) > is abusing the sysfs BAR access to provide unlimited > access to hardware. > > MSI messages are memory writes so any generic device capable > of MSI is capable of corrupting kernel memory. > This means that a bug in userspace will lead to kernel memory corruption > and crashes. This is something distributions can't support. > > uio_pci_generic is already abused like that, mostly > because when I wrote it, I didn't add enough protections > against using it with DMA capable devices, > and we can't go back and break working userspace. > But at least it does not bind to VFs which all of > them are capable of DMA. > > The result of merging this driver will be userspace abusing the > sysfs BAR access with VFs as well, and we do not want that. > > > Just forwarding events is not enough to make a valid driver. > What is missing is a way to access the device in a safe way. > > On a more positive note: > > What would be a reasonable interface? One that does the following > in kernel: > > 1. initializes device rings (can be in pinned userspace memory, > but can not be writeable by userspace), brings up interface link > 2. pins userspace memory (unless using e.g. hugetlbfs) > 3. gets request, make sure it's valid and belongs to > the correct task, put it in the ring > 4. in the reverse direction, notify userspace when buffers > are available in the ring > 5. notify userspace about MSI (what this driver does) > > What userspace can be allowed to do: > > format requests (e.g. transmit, receive) in userspace > read ring contents > > What userspace can't be allowed to do: > > access BAR > write rings Thinking about it some more, many devices Have separate rings for DMA: TX (device reads memory) and RX (device writes memory). With such devices, a mode where userspace can write TX ring but not RX ring might make sense. This will mean userspace might read kernel memory through the device, but can not corrupt it. That's already a big win! And RX buffers do not have to be added one at a time. If we assume 0.2usec per system call, batching some 100 buffers per system call gives you 2 nano seconds overhead. That seems quite reasonable. > > This means that the driver can not be a generic one, > and there will be a system call overhead when you > write the ring, but that's the price you have to > pay for ability to run on systems without an IOMMU. > > > > > > --- > > drivers/uio/Kconfig | 9 ++ > > drivers/uio/Makefile | 1 + > > drivers/uio/uio_msi.c | 378 +++++++++++++++++++++++++++++++++++++++++++ > > include/uapi/linux/Kbuild | 1 + > > include/uapi/linux/uio_msi.h | 22 +++ > > 5 files changed, 411 insertions(+) > > create mode 100644 drivers/uio/uio_msi.c > > create mode 100644 include/uapi/linux/uio_msi.h > > > > diff --git a/drivers/uio/Kconfig b/drivers/uio/Kconfig > > index 52c98ce..04adfa0 100644 > > --- a/drivers/uio/Kconfig > > +++ b/drivers/uio/Kconfig > > @@ -93,6 +93,15 @@ config UIO_PCI_GENERIC > > primarily, for virtualization scenarios. > > If you compile this as a module, it will be called uio_pci_generic. > > > > +config UIO_PCI_MSI > > + tristate "Generic driver supporting MSI-x on PCI Express cards" > > + depends on PCI > > + help > > + Generic driver that provides Message Signalled IRQ events > > + similar to VFIO. If IOMMMU is available please use VFIO > > + instead since it provides more security. > > + If you compile this as a module, it will be called uio_msi. > > + > > config UIO_NETX > > tristate "Hilscher NetX Card driver" > > depends on PCI > > diff --git a/drivers/uio/Makefile b/drivers/uio/Makefile > > index 8560dad..62fc44b 100644 > > --- a/drivers/uio/Makefile > > +++ b/drivers/uio/Makefile > > @@ -9,3 +9,4 @@ obj-$(CONFIG_UIO_NETX) += uio_netx.o > > obj-$(CONFIG_UIO_PRUSS) += uio_pruss.o > > obj-$(CONFIG_UIO_MF624) += uio_mf624.o > > obj-$(CONFIG_UIO_FSL_ELBC_GPCM) += uio_fsl_elbc_gpcm.o > > +obj-$(CONFIG_UIO_PCI_MSI) += uio_msi.o > > diff --git a/drivers/uio/uio_msi.c b/drivers/uio/uio_msi.c > > new file mode 100644 > > index 0000000..802b5c4 > > --- /dev/null > > +++ b/drivers/uio/uio_msi.c > > @@ -0,0 +1,378 @@ > > +/*- > > + * > > + * Copyright (c) 2015 by Brocade Communications Systems, Inc. > > + * Author: Stephen Hemminger <stephen@networkplumber.org> > > + * > > + * This work is licensed under the terms of the GNU GPL, version 2 only. > > + */ > > + > > +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt > > + > > +#include <linux/device.h> > > +#include <linux/interrupt.h> > > +#include <linux/eventfd.h> > > +#include <linux/module.h> > > +#include <linux/pci.h> > > +#include <linux/uio_driver.h> > > +#include <linux/msi.h> > > +#include <linux/uio_msi.h> > > + > > +#define DRIVER_VERSION "0.1.1" > > +#define MAX_MSIX_VECTORS 64 > > + > > +/* MSI-X vector information */ > > +struct uio_msi_pci_dev { > > + struct uio_info info; /* UIO driver info */ > > + struct pci_dev *pdev; /* PCI device */ > > + struct mutex mutex; /* open/release/ioctl mutex */ > > + int ref_cnt; /* references to device */ > > + unsigned int max_vectors; /* MSI-X slots available */ > > + struct msix_entry *msix; /* MSI-X vector table */ > > + struct uio_msi_irq_ctx { > > + struct eventfd_ctx *trigger; /* vector to eventfd */ > > + char *name; /* name in /proc/interrupts */ > > + } *ctx; > > +}; > > + > > +static irqreturn_t uio_intx_irqhandler(int irq, void *arg) > > +{ > > + struct uio_msi_pci_dev *udev = arg; > > + > > + if (pci_check_and_mask_intx(udev->pdev)) { > > + eventfd_signal(udev->ctx->trigger, 1); > > + return IRQ_HANDLED; > > + } > > + > > + return IRQ_NONE; > > +} > > + > > +static irqreturn_t uio_msi_irqhandler(int irq, void *arg) > > +{ > > + struct eventfd_ctx *trigger = arg; > > + > > + eventfd_signal(trigger, 1); > > + return IRQ_HANDLED; > > +} > > + > > +/* set the mapping between vector # and existing eventfd. */ > > +static int set_irq_eventfd(struct uio_msi_pci_dev *udev, u32 vec, int fd) > > +{ > > + struct eventfd_ctx *trigger; > > + int irq, err; > > + > > + if (vec >= udev->max_vectors) { > > + dev_notice(&udev->pdev->dev, "vec %u >= num_vec %u\n", > > + vec, udev->max_vectors); > > + return -ERANGE; > > + } > > + > > + irq = udev->msix[vec].vector; > > + trigger = udev->ctx[vec].trigger; > > + if (trigger) { > > + /* Clearup existing irq mapping */ > > + free_irq(irq, trigger); > > + eventfd_ctx_put(trigger); > > + udev->ctx[vec].trigger = NULL; > > + } > > + > > + /* Passing -1 is used to disable interrupt */ > > + if (fd < 0) > > + return 0; > > + > > + trigger = eventfd_ctx_fdget(fd); > > + if (IS_ERR(trigger)) { > > + err = PTR_ERR(trigger); > > + dev_notice(&udev->pdev->dev, > > + "eventfd ctx get failed: %d\n", err); > > + return err; > > + } > > + > > + if (udev->msix) > > + err = request_irq(irq, uio_msi_irqhandler, 0, > > + udev->ctx[vec].name, trigger); > > + else > > + err = request_irq(irq, uio_intx_irqhandler, IRQF_SHARED, > > + udev->ctx[vec].name, udev); > > + > > + if (err) { > > + dev_notice(&udev->pdev->dev, > > + "request irq failed: %d\n", err); > > + eventfd_ctx_put(trigger); > > + return err; > > + } > > + > > + udev->ctx[vec].trigger = trigger; > > + return 0; > > +} > > + > > +static int > > +uio_msi_ioctl(struct uio_info *info, unsigned int cmd, unsigned long arg) > > +{ > > + struct uio_msi_pci_dev *udev > > + = container_of(info, struct uio_msi_pci_dev, info); > > + struct uio_msi_irq_set hdr; > > + int err; > > + > > + switch (cmd) { > > + case UIO_MSI_IRQ_SET: > > + if (copy_from_user(&hdr, (void __user *)arg, sizeof(hdr))) > > + return -EFAULT; > > + > > + mutex_lock(&udev->mutex); > > + err = set_irq_eventfd(udev, hdr.vec, hdr.fd); > > + mutex_unlock(&udev->mutex); > > + break; > > + default: > > + err = -EOPNOTSUPP; > > + } > > + return err; > > +} > > + > > +/* Opening the UIO device for first time enables MSI-X */ > > +static int > > +uio_msi_open(struct uio_info *info, struct inode *inode) > > +{ > > + struct uio_msi_pci_dev *udev > > + = container_of(info, struct uio_msi_pci_dev, info); > > + int err = 0; > > + > > + mutex_lock(&udev->mutex); > > + if (udev->ref_cnt++ == 0) { > > + if (udev->msix) > > + err = pci_enable_msix(udev->pdev, udev->msix, > > + udev->max_vectors); > > + } > > + mutex_unlock(&udev->mutex); > > + > > + return err; > > +} > > + > > +/* Last close of the UIO device releases/disables all IRQ's */ > > +static int > > +uio_msi_release(struct uio_info *info, struct inode *inode) > > +{ > > + struct uio_msi_pci_dev *udev > > + = container_of(info, struct uio_msi_pci_dev, info); > > + int i; > > + > > + mutex_lock(&udev->mutex); > > + if (--udev->ref_cnt == 0) { > > + for (i = 0; i < udev->max_vectors; i++) { > > + int irq = udev->msix[i].vector; > > + struct eventfd_ctx *trigger = udev->ctx[i].trigger; > > + > > + if (!trigger) > > + continue; > > + > > + free_irq(irq, trigger); > > + eventfd_ctx_put(trigger); > > + udev->ctx[i].trigger = NULL; > > + } > > + > > + if (udev->msix) > > + pci_disable_msix(udev->pdev); > > + } > > + mutex_unlock(&udev->mutex); > > + > > + return 0; > > +} > > + > > +/* Unmap previously ioremap'd resources */ > > +static void > > +release_iomaps(struct uio_mem *mem) > > +{ > > + int i; > > + > > + for (i = 0; i < MAX_UIO_MAPS; i++, mem++) { > > + if (mem->internal_addr) > > + iounmap(mem->internal_addr); > > + } > > +} > > + > > +static int > > +setup_maps(struct pci_dev *pdev, struct uio_info *info) > > +{ > > + int i, m = 0, p = 0, err; > > + static const char * const bar_names[] = { > > + "BAR0", "BAR1", "BAR2", "BAR3", "BAR4", "BAR5", > > + }; > > + > > + for (i = 0; i < ARRAY_SIZE(bar_names); i++) { > > + unsigned long start = pci_resource_start(pdev, i); > > + unsigned long flags = pci_resource_flags(pdev, i); > > + unsigned long len = pci_resource_len(pdev, i); > > + > > + if (start == 0 || len == 0) > > + continue; > > + > > + if (flags & IORESOURCE_MEM) { > > + void __iomem *addr; > > + > > + if (m >= MAX_UIO_MAPS) > > + continue; > > + > > + addr = ioremap(start, len); > > + if (addr == NULL) { > > + err = -EINVAL; > > + goto fail; > > + } > > + > > + info->mem[m].name = bar_names[i]; > > + info->mem[m].addr = start; > > + info->mem[m].internal_addr = addr; > > + info->mem[m].size = len; > > + info->mem[m].memtype = UIO_MEM_PHYS; > > + ++m; > > + } else if (flags & IORESOURCE_IO) { > > + if (p >= MAX_UIO_PORT_REGIONS) > > + continue; > > + > > + info->port[p].name = bar_names[i]; > > + info->port[p].start = start; > > + info->port[p].size = len; > > + info->port[p].porttype = UIO_PORT_X86; > > + ++p; > > + } > > + } > > + > > + return 0; > > + fail: > > + for (i = 0; i < m; i++) > > + iounmap(info->mem[i].internal_addr); > > + return err; > > +} > > + > > +static int uio_msi_probe(struct pci_dev *pdev, const struct pci_device_id *id) > > +{ > > + struct uio_msi_pci_dev *udev; > > + int i, err, vectors; > > + > > + udev = kzalloc(sizeof(struct uio_msi_pci_dev), GFP_KERNEL); > > + if (!udev) > > + return -ENOMEM; > > + > > + err = pci_enable_device(pdev); > > + if (err != 0) { > > + dev_err(&pdev->dev, "cannot enable PCI device\n"); > > + goto fail_free; > > + } > > + > > + err = pci_request_regions(pdev, "uio_msi"); > > + if (err != 0) { > > + dev_err(&pdev->dev, "Cannot request regions\n"); > > + goto fail_disable; > > + } > > + > > + pci_set_master(pdev); > > + > > + /* remap resources */ > > + err = setup_maps(pdev, &udev->info); > > + if (err) > > + goto fail_release_iomem; > > + > > + /* fill uio infos */ > > + udev->info.name = "uio_msi"; > > + udev->info.version = DRIVER_VERSION; > > + udev->info.priv = udev; > > + udev->pdev = pdev; > > + udev->info.ioctl = uio_msi_ioctl; > > + udev->info.open = uio_msi_open; > > + udev->info.release = uio_msi_release; > > + udev->info.irq = UIO_IRQ_CUSTOM; > > + mutex_init(&udev->mutex); > > + > > + vectors = pci_msix_vec_count(pdev); > > + if (vectors > 0) { > > + udev->max_vectors = min_t(u16, vectors, MAX_MSIX_VECTORS); > > + dev_info(&pdev->dev, "using up to %u MSI-X vectors\n", > > + udev->max_vectors); > > + > > + err = -ENOMEM; > > + udev->msix = kcalloc(udev->max_vectors, > > + sizeof(struct msix_entry), GFP_KERNEL); > > + if (!udev->msix) > > + goto fail_release_iomem; > > + } else if (!pci_intx_mask_supported(pdev)) { > > + dev_err(&pdev->dev, > > + "device does not support MSI-X or INTX\n"); > > + err = -EINVAL; > > + goto fail_release_iomem; > > + } else { > > + dev_notice(&pdev->dev, "using INTX\n"); > > + udev->info.irq_flags = IRQF_SHARED; > > + udev->max_vectors = 1; > > + } > > + > > + udev->ctx = kcalloc(udev->max_vectors, > > + sizeof(struct uio_msi_irq_ctx), GFP_KERNEL); > > + if (!udev->ctx) > > + goto fail_free_msix; > > + > > + for (i = 0; i < udev->max_vectors; i++) { > > + udev->msix[i].entry = i; > > + > > + udev->ctx[i].name = kasprintf(GFP_KERNEL, > > + KBUILD_MODNAME "[%d](%s)", > > + i, pci_name(pdev)); > > + if (!udev->ctx[i].name) > > + goto fail_free_ctx; > > + } > > + > > + /* register uio driver */ > > + err = uio_register_device(&pdev->dev, &udev->info); > > + if (err != 0) > > + goto fail_free_ctx; > > + > > + pci_set_drvdata(pdev, udev); > > + return 0; > > + > > +fail_free_ctx: > > + for (i = 0; i < udev->max_vectors; i++) > > + kfree(udev->ctx[i].name); > > + kfree(udev->ctx); > > +fail_free_msix: > > + kfree(udev->msix); > > +fail_release_iomem: > > + release_iomaps(udev->info.mem); > > + pci_release_regions(pdev); > > +fail_disable: > > + pci_disable_device(pdev); > > +fail_free: > > + kfree(udev); > > + > > + pr_notice("%s ret %d\n", __func__, err); > > + return err; > > +} > > + > > +static void uio_msi_remove(struct pci_dev *pdev) > > +{ > > + struct uio_info *info = pci_get_drvdata(pdev); > > + struct uio_msi_pci_dev *udev > > + = container_of(info, struct uio_msi_pci_dev, info); > > + int i; > > + > > + uio_unregister_device(info); > > + release_iomaps(info->mem); > > + > > + pci_release_regions(pdev); > > + for (i = 0; i < udev->max_vectors; i++) > > + kfree(udev->ctx[i].name); > > + kfree(udev->ctx); > > + kfree(udev->msix); > > + pci_disable_device(pdev); > > + > > + pci_set_drvdata(pdev, NULL); > > + kfree(udev); > > +} > > + > > +static struct pci_driver uio_msi_pci_driver = { > > + .name = "uio_msi", > > + .probe = uio_msi_probe, > > + .remove = uio_msi_remove, > > +}; > > + > > +module_pci_driver(uio_msi_pci_driver); > > +MODULE_VERSION(DRIVER_VERSION); > > +MODULE_LICENSE("GPL v2"); > > +MODULE_AUTHOR("Stephen Hemminger <stephen@networkplumber.org>"); > > +MODULE_DESCRIPTION("UIO driver for MSI PCI devices"); > > diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild > > index f7b2db4..d9497691 100644 > > --- a/include/uapi/linux/Kbuild > > +++ b/include/uapi/linux/Kbuild > > @@ -411,6 +411,7 @@ header-y += udp.h > > header-y += uhid.h > > header-y += uinput.h > > header-y += uio.h > > +header-y += uio_msi.h > > header-y += ultrasound.h > > header-y += un.h > > header-y += unistd.h > > diff --git a/include/uapi/linux/uio_msi.h b/include/uapi/linux/uio_msi.h > > new file mode 100644 > > index 0000000..297de00 > > --- /dev/null > > +++ b/include/uapi/linux/uio_msi.h > > @@ -0,0 +1,22 @@ > > +/* > > + * UIO_MSI API definition > > + * > > + * Copyright (c) 2015 by Brocade Communications Systems, Inc. > > + * All rights reserved. > > + * > > + * This program is free software; you can redistribute it and/or modify > > + * it under the terms of the GNU General Public License version 2 as > > + * published by the Free Software Foundation. > > + */ > > +#ifndef _UIO_PCI_MSI_H > > +#define _UIO_PCI_MSI_H > > + > > +struct uio_msi_irq_set { > > + u32 vec; > > + int fd; > > +}; > > + > > +#define UIO_MSI_BASE 0x86 > > +#define UIO_MSI_IRQ_SET _IOW('I', UIO_MSI_BASE+1, struct uio_msi_irq_set) > > + > > +#endif > > -- > > 2.1.4 > > > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > Please read the FAQ at http://www.tux.org/lkml/
On Thu, 1 Oct 2015 11:33:06 +0300 "Michael S. Tsirkin" <mst@redhat.com> wrote: > On Wed, Sep 30, 2015 at 03:28:58PM -0700, Stephen Hemminger wrote: > > This driver allows using PCI device with Message Signalled Interrupt > > from userspace. The API is similar to the igb_uio driver used by the DPDK. > > Via ioctl it provides a mechanism to map MSI-X interrupts into event > > file descriptors similar to VFIO. > > > > VFIO is a better choice if IOMMU is available, but often userspace drivers > > have to work in environments where IOMMU support (real or emulated) is > > not available. All UIO drivers that support DMA are not secure against > > rogue userspace applications programming DMA hardware to access > > private memory; this driver is no less secure than existing code. > > > > Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> > > I don't think copying the igb_uio interface is a good idea. > What DPDK is doing with igb_uio (and indeed uio_pci_generic) > is abusing the sysfs BAR access to provide unlimited > access to hardware. > > MSI messages are memory writes so any generic device capable > of MSI is capable of corrupting kernel memory. > This means that a bug in userspace will lead to kernel memory corruption > and crashes. This is something distributions can't support. > > uio_pci_generic is already abused like that, mostly > because when I wrote it, I didn't add enough protections > against using it with DMA capable devices, > and we can't go back and break working userspace. > But at least it does not bind to VFs which all of > them are capable of DMA. > > The result of merging this driver will be userspace abusing the > sysfs BAR access with VFs as well, and we do not want that. > > > Just forwarding events is not enough to make a valid driver. > What is missing is a way to access the device in a safe way. > > On a more positive note: > > What would be a reasonable interface? One that does the following > in kernel: > > 1. initializes device rings (can be in pinned userspace memory, > but can not be writeable by userspace), brings up interface link > 2. pins userspace memory (unless using e.g. hugetlbfs) > 3. gets request, make sure it's valid and belongs to > the correct task, put it in the ring > 4. in the reverse direction, notify userspace when buffers > are available in the ring > 5. notify userspace about MSI (what this driver does) > > What userspace can be allowed to do: > > format requests (e.g. transmit, receive) in userspace > read ring contents > > What userspace can't be allowed to do: > > access BAR > write rings > > > This means that the driver can not be a generic one, > and there will be a system call overhead when you > write the ring, but that's the price you have to > pay for ability to run on systems without an IOMMU. I think I understand what you are proposing, but it really doesn't fit into the high speed userspace networking model. 1. Device rings are device specific, can't be in a generic driver. 2. DPDK uses huge mememory. 3. Performance requires all ring requests be done in pure userspace, (ie no syscalls) 4. Ditto, can't have kernel to userspace notification per packet
On Thu, Oct 01, 2015 at 07:50:37AM -0700, Stephen Hemminger wrote: > On Thu, 1 Oct 2015 11:33:06 +0300 > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > On Wed, Sep 30, 2015 at 03:28:58PM -0700, Stephen Hemminger wrote: > > > This driver allows using PCI device with Message Signalled Interrupt > > > from userspace. The API is similar to the igb_uio driver used by the DPDK. > > > Via ioctl it provides a mechanism to map MSI-X interrupts into event > > > file descriptors similar to VFIO. > > > > > > VFIO is a better choice if IOMMU is available, but often userspace drivers > > > have to work in environments where IOMMU support (real or emulated) is > > > not available. All UIO drivers that support DMA are not secure against > > > rogue userspace applications programming DMA hardware to access > > > private memory; this driver is no less secure than existing code. > > > > > > Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> > > > > I don't think copying the igb_uio interface is a good idea. > > What DPDK is doing with igb_uio (and indeed uio_pci_generic) > > is abusing the sysfs BAR access to provide unlimited > > access to hardware. > > > > MSI messages are memory writes so any generic device capable > > of MSI is capable of corrupting kernel memory. > > This means that a bug in userspace will lead to kernel memory corruption > > and crashes. This is something distributions can't support. > > > > uio_pci_generic is already abused like that, mostly > > because when I wrote it, I didn't add enough protections > > against using it with DMA capable devices, > > and we can't go back and break working userspace. > > But at least it does not bind to VFs which all of > > them are capable of DMA. > > > > The result of merging this driver will be userspace abusing the > > sysfs BAR access with VFs as well, and we do not want that. > > > > > > Just forwarding events is not enough to make a valid driver. > > What is missing is a way to access the device in a safe way. > > > > On a more positive note: > > > > What would be a reasonable interface? One that does the following > > in kernel: > > > > 1. initializes device rings (can be in pinned userspace memory, > > but can not be writeable by userspace), brings up interface link > > 2. pins userspace memory (unless using e.g. hugetlbfs) > > 3. gets request, make sure it's valid and belongs to > > the correct task, put it in the ring > > 4. in the reverse direction, notify userspace when buffers > > are available in the ring > > 5. notify userspace about MSI (what this driver does) > > > > What userspace can be allowed to do: > > > > format requests (e.g. transmit, receive) in userspace > > read ring contents > > > > What userspace can't be allowed to do: > > > > access BAR > > write rings > > > > > > This means that the driver can not be a generic one, > > and there will be a system call overhead when you > > write the ring, but that's the price you have to > > pay for ability to run on systems without an IOMMU. > > I think I understand what you are proposing, but it really doesn't > fit into the high speed userspace networking model. I'm aware of the fact currently the model does everything including bringing up the link in user-space. But there's really no justification for this. Only data path things should be in userspace. A userspace bug should not be able to do things like over-writing the on-device EEPROM. > 1. Device rings are device specific, can't be in a generic driver. So that's more work, and it is not going to happen if people can get by with insecure hacks. > 2. DPDK uses huge mememory. Hugetlbfs? Don't see why this is an issue. Might make things simpler. > 3. Performance requires all ring requests be done in pure userspace, > (ie no syscalls) Make only the TX ring writeable then. At least you won't be able to corrupt the kernel memory. > 4. Ditto, can't have kernel to userspace notification per packet RX ring can be read-only, so userspace can read it directly.
On Thu, Oct 01, 2015 at 01:37:12PM +0300, Michael S. Tsirkin wrote: > On Thu, Oct 01, 2015 at 11:33:06AM +0300, Michael S. Tsirkin wrote: > > On Wed, Sep 30, 2015 at 03:28:58PM -0700, Stephen Hemminger wrote: > > > This driver allows using PCI device with Message Signalled Interrupt > > > from userspace. The API is similar to the igb_uio driver used by the DPDK. > > > Via ioctl it provides a mechanism to map MSI-X interrupts into event > > > file descriptors similar to VFIO. > > > > > > VFIO is a better choice if IOMMU is available, but often userspace drivers > > > have to work in environments where IOMMU support (real or emulated) is > > > not available. All UIO drivers that support DMA are not secure against > > > rogue userspace applications programming DMA hardware to access > > > private memory; this driver is no less secure than existing code. > > > > > > Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> > > > > I don't think copying the igb_uio interface is a good idea. > > What DPDK is doing with igb_uio (and indeed uio_pci_generic) > > is abusing the sysfs BAR access to provide unlimited > > access to hardware. > > > > MSI messages are memory writes so any generic device capable > > of MSI is capable of corrupting kernel memory. > > This means that a bug in userspace will lead to kernel memory corruption > > and crashes. This is something distributions can't support. > > > > uio_pci_generic is already abused like that, mostly > > because when I wrote it, I didn't add enough protections > > against using it with DMA capable devices, > > and we can't go back and break working userspace. > > But at least it does not bind to VFs which all of > > them are capable of DMA. > > > > The result of merging this driver will be userspace abusing the > > sysfs BAR access with VFs as well, and we do not want that. > > > > > > Just forwarding events is not enough to make a valid driver. > > What is missing is a way to access the device in a safe way. > > > > On a more positive note: > > > > What would be a reasonable interface? One that does the following > > in kernel: > > > > 1. initializes device rings (can be in pinned userspace memory, > > but can not be writeable by userspace), brings up interface link > > 2. pins userspace memory (unless using e.g. hugetlbfs) > > 3. gets request, make sure it's valid and belongs to > > the correct task, put it in the ring > > 4. in the reverse direction, notify userspace when buffers > > are available in the ring > > 5. notify userspace about MSI (what this driver does) > > > > What userspace can be allowed to do: > > > > format requests (e.g. transmit, receive) in userspace > > read ring contents > > > > What userspace can't be allowed to do: > > > > access BAR > > write rings > > Thinking about it some more, many devices > > > Have separate rings for DMA: TX (device reads memory) > and RX (device writes memory). > With such devices, a mode where userspace can write TX ring > but not RX ring might make sense. > > This will mean userspace might read kernel memory > through the device, but can not corrupt it. > > That's already a big win! > > And RX buffers do not have to be added one at a time. > If we assume 0.2usec per system call, batching some 100 buffers per > system call gives you 2 nano seconds overhead. That seems quite > reasonable. > To add to that, there's no reason to do this on the same core. Re-arming descriptors can happen in parallel with packet processing, so this overhead won't affect PPS or latency at all: only the CPU utilization. > > > > > > > This means that the driver can not be a generic one, > > and there will be a system call overhead when you > > write the ring, but that's the price you have to > > pay for ability to run on systems without an IOMMU. > > > > > > > > > > > --- > > > drivers/uio/Kconfig | 9 ++ > > > drivers/uio/Makefile | 1 + > > > drivers/uio/uio_msi.c | 378 +++++++++++++++++++++++++++++++++++++++++++ > > > include/uapi/linux/Kbuild | 1 + > > > include/uapi/linux/uio_msi.h | 22 +++ > > > 5 files changed, 411 insertions(+) > > > create mode 100644 drivers/uio/uio_msi.c > > > create mode 100644 include/uapi/linux/uio_msi.h > > > > > > diff --git a/drivers/uio/Kconfig b/drivers/uio/Kconfig > > > index 52c98ce..04adfa0 100644 > > > --- a/drivers/uio/Kconfig > > > +++ b/drivers/uio/Kconfig > > > @@ -93,6 +93,15 @@ config UIO_PCI_GENERIC > > > primarily, for virtualization scenarios. > > > If you compile this as a module, it will be called uio_pci_generic. > > > > > > +config UIO_PCI_MSI > > > + tristate "Generic driver supporting MSI-x on PCI Express cards" > > > + depends on PCI > > > + help > > > + Generic driver that provides Message Signalled IRQ events > > > + similar to VFIO. If IOMMMU is available please use VFIO > > > + instead since it provides more security. > > > + If you compile this as a module, it will be called uio_msi. > > > + > > > config UIO_NETX > > > tristate "Hilscher NetX Card driver" > > > depends on PCI > > > diff --git a/drivers/uio/Makefile b/drivers/uio/Makefile > > > index 8560dad..62fc44b 100644 > > > --- a/drivers/uio/Makefile > > > +++ b/drivers/uio/Makefile > > > @@ -9,3 +9,4 @@ obj-$(CONFIG_UIO_NETX) += uio_netx.o > > > obj-$(CONFIG_UIO_PRUSS) += uio_pruss.o > > > obj-$(CONFIG_UIO_MF624) += uio_mf624.o > > > obj-$(CONFIG_UIO_FSL_ELBC_GPCM) += uio_fsl_elbc_gpcm.o > > > +obj-$(CONFIG_UIO_PCI_MSI) += uio_msi.o > > > diff --git a/drivers/uio/uio_msi.c b/drivers/uio/uio_msi.c > > > new file mode 100644 > > > index 0000000..802b5c4 > > > --- /dev/null > > > +++ b/drivers/uio/uio_msi.c > > > @@ -0,0 +1,378 @@ > > > +/*- > > > + * > > > + * Copyright (c) 2015 by Brocade Communications Systems, Inc. > > > + * Author: Stephen Hemminger <stephen@networkplumber.org> > > > + * > > > + * This work is licensed under the terms of the GNU GPL, version 2 only. > > > + */ > > > + > > > +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt > > > + > > > +#include <linux/device.h> > > > +#include <linux/interrupt.h> > > > +#include <linux/eventfd.h> > > > +#include <linux/module.h> > > > +#include <linux/pci.h> > > > +#include <linux/uio_driver.h> > > > +#include <linux/msi.h> > > > +#include <linux/uio_msi.h> > > > + > > > +#define DRIVER_VERSION "0.1.1" > > > +#define MAX_MSIX_VECTORS 64 > > > + > > > +/* MSI-X vector information */ > > > +struct uio_msi_pci_dev { > > > + struct uio_info info; /* UIO driver info */ > > > + struct pci_dev *pdev; /* PCI device */ > > > + struct mutex mutex; /* open/release/ioctl mutex */ > > > + int ref_cnt; /* references to device */ > > > + unsigned int max_vectors; /* MSI-X slots available */ > > > + struct msix_entry *msix; /* MSI-X vector table */ > > > + struct uio_msi_irq_ctx { > > > + struct eventfd_ctx *trigger; /* vector to eventfd */ > > > + char *name; /* name in /proc/interrupts */ > > > + } *ctx; > > > +}; > > > + > > > +static irqreturn_t uio_intx_irqhandler(int irq, void *arg) > > > +{ > > > + struct uio_msi_pci_dev *udev = arg; > > > + > > > + if (pci_check_and_mask_intx(udev->pdev)) { > > > + eventfd_signal(udev->ctx->trigger, 1); > > > + return IRQ_HANDLED; > > > + } > > > + > > > + return IRQ_NONE; > > > +} > > > + > > > +static irqreturn_t uio_msi_irqhandler(int irq, void *arg) > > > +{ > > > + struct eventfd_ctx *trigger = arg; > > > + > > > + eventfd_signal(trigger, 1); > > > + return IRQ_HANDLED; > > > +} > > > + > > > +/* set the mapping between vector # and existing eventfd. */ > > > +static int set_irq_eventfd(struct uio_msi_pci_dev *udev, u32 vec, int fd) > > > +{ > > > + struct eventfd_ctx *trigger; > > > + int irq, err; > > > + > > > + if (vec >= udev->max_vectors) { > > > + dev_notice(&udev->pdev->dev, "vec %u >= num_vec %u\n", > > > + vec, udev->max_vectors); > > > + return -ERANGE; > > > + } > > > + > > > + irq = udev->msix[vec].vector; > > > + trigger = udev->ctx[vec].trigger; > > > + if (trigger) { > > > + /* Clearup existing irq mapping */ > > > + free_irq(irq, trigger); > > > + eventfd_ctx_put(trigger); > > > + udev->ctx[vec].trigger = NULL; > > > + } > > > + > > > + /* Passing -1 is used to disable interrupt */ > > > + if (fd < 0) > > > + return 0; > > > + > > > + trigger = eventfd_ctx_fdget(fd); > > > + if (IS_ERR(trigger)) { > > > + err = PTR_ERR(trigger); > > > + dev_notice(&udev->pdev->dev, > > > + "eventfd ctx get failed: %d\n", err); > > > + return err; > > > + } > > > + > > > + if (udev->msix) > > > + err = request_irq(irq, uio_msi_irqhandler, 0, > > > + udev->ctx[vec].name, trigger); > > > + else > > > + err = request_irq(irq, uio_intx_irqhandler, IRQF_SHARED, > > > + udev->ctx[vec].name, udev); > > > + > > > + if (err) { > > > + dev_notice(&udev->pdev->dev, > > > + "request irq failed: %d\n", err); > > > + eventfd_ctx_put(trigger); > > > + return err; > > > + } > > > + > > > + udev->ctx[vec].trigger = trigger; > > > + return 0; > > > +} > > > + > > > +static int > > > +uio_msi_ioctl(struct uio_info *info, unsigned int cmd, unsigned long arg) > > > +{ > > > + struct uio_msi_pci_dev *udev > > > + = container_of(info, struct uio_msi_pci_dev, info); > > > + struct uio_msi_irq_set hdr; > > > + int err; > > > + > > > + switch (cmd) { > > > + case UIO_MSI_IRQ_SET: > > > + if (copy_from_user(&hdr, (void __user *)arg, sizeof(hdr))) > > > + return -EFAULT; > > > + > > > + mutex_lock(&udev->mutex); > > > + err = set_irq_eventfd(udev, hdr.vec, hdr.fd); > > > + mutex_unlock(&udev->mutex); > > > + break; > > > + default: > > > + err = -EOPNOTSUPP; > > > + } > > > + return err; > > > +} > > > + > > > +/* Opening the UIO device for first time enables MSI-X */ > > > +static int > > > +uio_msi_open(struct uio_info *info, struct inode *inode) > > > +{ > > > + struct uio_msi_pci_dev *udev > > > + = container_of(info, struct uio_msi_pci_dev, info); > > > + int err = 0; > > > + > > > + mutex_lock(&udev->mutex); > > > + if (udev->ref_cnt++ == 0) { > > > + if (udev->msix) > > > + err = pci_enable_msix(udev->pdev, udev->msix, > > > + udev->max_vectors); > > > + } > > > + mutex_unlock(&udev->mutex); > > > + > > > + return err; > > > +} > > > + > > > +/* Last close of the UIO device releases/disables all IRQ's */ > > > +static int > > > +uio_msi_release(struct uio_info *info, struct inode *inode) > > > +{ > > > + struct uio_msi_pci_dev *udev > > > + = container_of(info, struct uio_msi_pci_dev, info); > > > + int i; > > > + > > > + mutex_lock(&udev->mutex); > > > + if (--udev->ref_cnt == 0) { > > > + for (i = 0; i < udev->max_vectors; i++) { > > > + int irq = udev->msix[i].vector; > > > + struct eventfd_ctx *trigger = udev->ctx[i].trigger; > > > + > > > + if (!trigger) > > > + continue; > > > + > > > + free_irq(irq, trigger); > > > + eventfd_ctx_put(trigger); > > > + udev->ctx[i].trigger = NULL; > > > + } > > > + > > > + if (udev->msix) > > > + pci_disable_msix(udev->pdev); > > > + } > > > + mutex_unlock(&udev->mutex); > > > + > > > + return 0; > > > +} > > > + > > > +/* Unmap previously ioremap'd resources */ > > > +static void > > > +release_iomaps(struct uio_mem *mem) > > > +{ > > > + int i; > > > + > > > + for (i = 0; i < MAX_UIO_MAPS; i++, mem++) { > > > + if (mem->internal_addr) > > > + iounmap(mem->internal_addr); > > > + } > > > +} > > > + > > > +static int > > > +setup_maps(struct pci_dev *pdev, struct uio_info *info) > > > +{ > > > + int i, m = 0, p = 0, err; > > > + static const char * const bar_names[] = { > > > + "BAR0", "BAR1", "BAR2", "BAR3", "BAR4", "BAR5", > > > + }; > > > + > > > + for (i = 0; i < ARRAY_SIZE(bar_names); i++) { > > > + unsigned long start = pci_resource_start(pdev, i); > > > + unsigned long flags = pci_resource_flags(pdev, i); > > > + unsigned long len = pci_resource_len(pdev, i); > > > + > > > + if (start == 0 || len == 0) > > > + continue; > > > + > > > + if (flags & IORESOURCE_MEM) { > > > + void __iomem *addr; > > > + > > > + if (m >= MAX_UIO_MAPS) > > > + continue; > > > + > > > + addr = ioremap(start, len); > > > + if (addr == NULL) { > > > + err = -EINVAL; > > > + goto fail; > > > + } > > > + > > > + info->mem[m].name = bar_names[i]; > > > + info->mem[m].addr = start; > > > + info->mem[m].internal_addr = addr; > > > + info->mem[m].size = len; > > > + info->mem[m].memtype = UIO_MEM_PHYS; > > > + ++m; > > > + } else if (flags & IORESOURCE_IO) { > > > + if (p >= MAX_UIO_PORT_REGIONS) > > > + continue; > > > + > > > + info->port[p].name = bar_names[i]; > > > + info->port[p].start = start; > > > + info->port[p].size = len; > > > + info->port[p].porttype = UIO_PORT_X86; > > > + ++p; > > > + } > > > + } > > > + > > > + return 0; > > > + fail: > > > + for (i = 0; i < m; i++) > > > + iounmap(info->mem[i].internal_addr); > > > + return err; > > > +} > > > + > > > +static int uio_msi_probe(struct pci_dev *pdev, const struct pci_device_id *id) > > > +{ > > > + struct uio_msi_pci_dev *udev; > > > + int i, err, vectors; > > > + > > > + udev = kzalloc(sizeof(struct uio_msi_pci_dev), GFP_KERNEL); > > > + if (!udev) > > > + return -ENOMEM; > > > + > > > + err = pci_enable_device(pdev); > > > + if (err != 0) { > > > + dev_err(&pdev->dev, "cannot enable PCI device\n"); > > > + goto fail_free; > > > + } > > > + > > > + err = pci_request_regions(pdev, "uio_msi"); > > > + if (err != 0) { > > > + dev_err(&pdev->dev, "Cannot request regions\n"); > > > + goto fail_disable; > > > + } > > > + > > > + pci_set_master(pdev); > > > + > > > + /* remap resources */ > > > + err = setup_maps(pdev, &udev->info); > > > + if (err) > > > + goto fail_release_iomem; > > > + > > > + /* fill uio infos */ > > > + udev->info.name = "uio_msi"; > > > + udev->info.version = DRIVER_VERSION; > > > + udev->info.priv = udev; > > > + udev->pdev = pdev; > > > + udev->info.ioctl = uio_msi_ioctl; > > > + udev->info.open = uio_msi_open; > > > + udev->info.release = uio_msi_release; > > > + udev->info.irq = UIO_IRQ_CUSTOM; > > > + mutex_init(&udev->mutex); > > > + > > > + vectors = pci_msix_vec_count(pdev); > > > + if (vectors > 0) { > > > + udev->max_vectors = min_t(u16, vectors, MAX_MSIX_VECTORS); > > > + dev_info(&pdev->dev, "using up to %u MSI-X vectors\n", > > > + udev->max_vectors); > > > + > > > + err = -ENOMEM; > > > + udev->msix = kcalloc(udev->max_vectors, > > > + sizeof(struct msix_entry), GFP_KERNEL); > > > + if (!udev->msix) > > > + goto fail_release_iomem; > > > + } else if (!pci_intx_mask_supported(pdev)) { > > > + dev_err(&pdev->dev, > > > + "device does not support MSI-X or INTX\n"); > > > + err = -EINVAL; > > > + goto fail_release_iomem; > > > + } else { > > > + dev_notice(&pdev->dev, "using INTX\n"); > > > + udev->info.irq_flags = IRQF_SHARED; > > > + udev->max_vectors = 1; > > > + } > > > + > > > + udev->ctx = kcalloc(udev->max_vectors, > > > + sizeof(struct uio_msi_irq_ctx), GFP_KERNEL); > > > + if (!udev->ctx) > > > + goto fail_free_msix; > > > + > > > + for (i = 0; i < udev->max_vectors; i++) { > > > + udev->msix[i].entry = i; > > > + > > > + udev->ctx[i].name = kasprintf(GFP_KERNEL, > > > + KBUILD_MODNAME "[%d](%s)", > > > + i, pci_name(pdev)); > > > + if (!udev->ctx[i].name) > > > + goto fail_free_ctx; > > > + } > > > + > > > + /* register uio driver */ > > > + err = uio_register_device(&pdev->dev, &udev->info); > > > + if (err != 0) > > > + goto fail_free_ctx; > > > + > > > + pci_set_drvdata(pdev, udev); > > > + return 0; > > > + > > > +fail_free_ctx: > > > + for (i = 0; i < udev->max_vectors; i++) > > > + kfree(udev->ctx[i].name); > > > + kfree(udev->ctx); > > > +fail_free_msix: > > > + kfree(udev->msix); > > > +fail_release_iomem: > > > + release_iomaps(udev->info.mem); > > > + pci_release_regions(pdev); > > > +fail_disable: > > > + pci_disable_device(pdev); > > > +fail_free: > > > + kfree(udev); > > > + > > > + pr_notice("%s ret %d\n", __func__, err); > > > + return err; > > > +} > > > + > > > +static void uio_msi_remove(struct pci_dev *pdev) > > > +{ > > > + struct uio_info *info = pci_get_drvdata(pdev); > > > + struct uio_msi_pci_dev *udev > > > + = container_of(info, struct uio_msi_pci_dev, info); > > > + int i; > > > + > > > + uio_unregister_device(info); > > > + release_iomaps(info->mem); > > > + > > > + pci_release_regions(pdev); > > > + for (i = 0; i < udev->max_vectors; i++) > > > + kfree(udev->ctx[i].name); > > > + kfree(udev->ctx); > > > + kfree(udev->msix); > > > + pci_disable_device(pdev); > > > + > > > + pci_set_drvdata(pdev, NULL); > > > + kfree(udev); > > > +} > > > + > > > +static struct pci_driver uio_msi_pci_driver = { > > > + .name = "uio_msi", > > > + .probe = uio_msi_probe, > > > + .remove = uio_msi_remove, > > > +}; > > > + > > > +module_pci_driver(uio_msi_pci_driver); > > > +MODULE_VERSION(DRIVER_VERSION); > > > +MODULE_LICENSE("GPL v2"); > > > +MODULE_AUTHOR("Stephen Hemminger <stephen@networkplumber.org>"); > > > +MODULE_DESCRIPTION("UIO driver for MSI PCI devices"); > > > diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild > > > index f7b2db4..d9497691 100644 > > > --- a/include/uapi/linux/Kbuild > > > +++ b/include/uapi/linux/Kbuild > > > @@ -411,6 +411,7 @@ header-y += udp.h > > > header-y += uhid.h > > > header-y += uinput.h > > > header-y += uio.h > > > +header-y += uio_msi.h > > > header-y += ultrasound.h > > > header-y += un.h > > > header-y += unistd.h > > > diff --git a/include/uapi/linux/uio_msi.h b/include/uapi/linux/uio_msi.h > > > new file mode 100644 > > > index 0000000..297de00 > > > --- /dev/null > > > +++ b/include/uapi/linux/uio_msi.h > > > @@ -0,0 +1,22 @@ > > > +/* > > > + * UIO_MSI API definition > > > + * > > > + * Copyright (c) 2015 by Brocade Communications Systems, Inc. > > > + * All rights reserved. > > > + * > > > + * This program is free software; you can redistribute it and/or modify > > > + * it under the terms of the GNU General Public License version 2 as > > > + * published by the Free Software Foundation. > > > + */ > > > +#ifndef _UIO_PCI_MSI_H > > > +#define _UIO_PCI_MSI_H > > > + > > > +struct uio_msi_irq_set { > > > + u32 vec; > > > + int fd; > > > +}; > > > + > > > +#define UIO_MSI_BASE 0x86 > > > +#define UIO_MSI_IRQ_SET _IOW('I', UIO_MSI_BASE+1, struct uio_msi_irq_set) > > > + > > > +#endif > > > -- > > > 2.1.4 > > > > > > -- > > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > > > the body of a message to majordomo@vger.kernel.org > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > Please read the FAQ at http://www.tux.org/lkml/
On Thu, Oct 01, 2015 at 11:33:06AM +0300, Michael S. Tsirkin wrote: > On Wed, Sep 30, 2015 at 03:28:58PM -0700, Stephen Hemminger wrote: > > This driver allows using PCI device with Message Signalled Interrupt > > from userspace. The API is similar to the igb_uio driver used by the DPDK. > > Via ioctl it provides a mechanism to map MSI-X interrupts into event > > file descriptors similar to VFIO. > > > > VFIO is a better choice if IOMMU is available, but often userspace drivers > > have to work in environments where IOMMU support (real or emulated) is > > not available. All UIO drivers that support DMA are not secure against > > rogue userspace applications programming DMA hardware to access > > private memory; this driver is no less secure than existing code. > > > > Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> > > I don't think copying the igb_uio interface is a good idea. > What DPDK is doing with igb_uio (and indeed uio_pci_generic) > is abusing the sysfs BAR access to provide unlimited > access to hardware. > > MSI messages are memory writes so any generic device capable > of MSI is capable of corrupting kernel memory. > This means that a bug in userspace will lead to kernel memory corruption > and crashes. This is something distributions can't support. > > uio_pci_generic is already abused like that, mostly > because when I wrote it, I didn't add enough protections > against using it with DMA capable devices, > and we can't go back and break working userspace. > But at least it does not bind to VFs which all of > them are capable of DMA. > > The result of merging this driver will be userspace abusing the > sysfs BAR access with VFs as well, and we do not want that. > > > Just forwarding events is not enough to make a valid driver. > What is missing is a way to access the device in a safe way. > > On a more positive note: > > What would be a reasonable interface? One that does the following > in kernel: > > 1. initializes device rings (can be in pinned userspace memory, > but can not be writeable by userspace), brings up interface link > 2. pins userspace memory (unless using e.g. hugetlbfs) > 3. gets request, make sure it's valid and belongs to > the correct task, put it in the ring > 4. in the reverse direction, notify userspace when buffers > are available in the ring > 5. notify userspace about MSI (what this driver does) > > What userspace can be allowed to do: > > format requests (e.g. transmit, receive) in userspace > read ring contents > > What userspace can't be allowed to do: > > access BAR > write rings > > > This means that the driver can not be a generic one, > and there will be a system call overhead when you > write the ring, but that's the price you have to > pay for ability to run on systems without an IOMMU. > The device specific parts can be taken from John Fastabend's patches BTW: https://patchwork.ozlabs.org/patch/396713/ IIUC what was missing there was exactly the memory protection we are looking for here. > > > > --- > > drivers/uio/Kconfig | 9 ++ > > drivers/uio/Makefile | 1 + > > drivers/uio/uio_msi.c | 378 +++++++++++++++++++++++++++++++++++++++++++ > > include/uapi/linux/Kbuild | 1 + > > include/uapi/linux/uio_msi.h | 22 +++ > > 5 files changed, 411 insertions(+) > > create mode 100644 drivers/uio/uio_msi.c > > create mode 100644 include/uapi/linux/uio_msi.h > > > > diff --git a/drivers/uio/Kconfig b/drivers/uio/Kconfig > > index 52c98ce..04adfa0 100644 > > --- a/drivers/uio/Kconfig > > +++ b/drivers/uio/Kconfig > > @@ -93,6 +93,15 @@ config UIO_PCI_GENERIC > > primarily, for virtualization scenarios. > > If you compile this as a module, it will be called uio_pci_generic. > > > > +config UIO_PCI_MSI > > + tristate "Generic driver supporting MSI-x on PCI Express cards" > > + depends on PCI > > + help > > + Generic driver that provides Message Signalled IRQ events > > + similar to VFIO. If IOMMMU is available please use VFIO > > + instead since it provides more security. > > + If you compile this as a module, it will be called uio_msi. > > + > > config UIO_NETX > > tristate "Hilscher NetX Card driver" > > depends on PCI > > diff --git a/drivers/uio/Makefile b/drivers/uio/Makefile > > index 8560dad..62fc44b 100644 > > --- a/drivers/uio/Makefile > > +++ b/drivers/uio/Makefile > > @@ -9,3 +9,4 @@ obj-$(CONFIG_UIO_NETX) += uio_netx.o > > obj-$(CONFIG_UIO_PRUSS) += uio_pruss.o > > obj-$(CONFIG_UIO_MF624) += uio_mf624.o > > obj-$(CONFIG_UIO_FSL_ELBC_GPCM) += uio_fsl_elbc_gpcm.o > > +obj-$(CONFIG_UIO_PCI_MSI) += uio_msi.o > > diff --git a/drivers/uio/uio_msi.c b/drivers/uio/uio_msi.c > > new file mode 100644 > > index 0000000..802b5c4 > > --- /dev/null > > +++ b/drivers/uio/uio_msi.c > > @@ -0,0 +1,378 @@ > > +/*- > > + * > > + * Copyright (c) 2015 by Brocade Communications Systems, Inc. > > + * Author: Stephen Hemminger <stephen@networkplumber.org> > > + * > > + * This work is licensed under the terms of the GNU GPL, version 2 only. > > + */ > > + > > +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt > > + > > +#include <linux/device.h> > > +#include <linux/interrupt.h> > > +#include <linux/eventfd.h> > > +#include <linux/module.h> > > +#include <linux/pci.h> > > +#include <linux/uio_driver.h> > > +#include <linux/msi.h> > > +#include <linux/uio_msi.h> > > + > > +#define DRIVER_VERSION "0.1.1" > > +#define MAX_MSIX_VECTORS 64 > > + > > +/* MSI-X vector information */ > > +struct uio_msi_pci_dev { > > + struct uio_info info; /* UIO driver info */ > > + struct pci_dev *pdev; /* PCI device */ > > + struct mutex mutex; /* open/release/ioctl mutex */ > > + int ref_cnt; /* references to device */ > > + unsigned int max_vectors; /* MSI-X slots available */ > > + struct msix_entry *msix; /* MSI-X vector table */ > > + struct uio_msi_irq_ctx { > > + struct eventfd_ctx *trigger; /* vector to eventfd */ > > + char *name; /* name in /proc/interrupts */ > > + } *ctx; > > +}; > > + > > +static irqreturn_t uio_intx_irqhandler(int irq, void *arg) > > +{ > > + struct uio_msi_pci_dev *udev = arg; > > + > > + if (pci_check_and_mask_intx(udev->pdev)) { > > + eventfd_signal(udev->ctx->trigger, 1); > > + return IRQ_HANDLED; > > + } > > + > > + return IRQ_NONE; > > +} > > + > > +static irqreturn_t uio_msi_irqhandler(int irq, void *arg) > > +{ > > + struct eventfd_ctx *trigger = arg; > > + > > + eventfd_signal(trigger, 1); > > + return IRQ_HANDLED; > > +} > > + > > +/* set the mapping between vector # and existing eventfd. */ > > +static int set_irq_eventfd(struct uio_msi_pci_dev *udev, u32 vec, int fd) > > +{ > > + struct eventfd_ctx *trigger; > > + int irq, err; > > + > > + if (vec >= udev->max_vectors) { > > + dev_notice(&udev->pdev->dev, "vec %u >= num_vec %u\n", > > + vec, udev->max_vectors); > > + return -ERANGE; > > + } > > + > > + irq = udev->msix[vec].vector; > > + trigger = udev->ctx[vec].trigger; > > + if (trigger) { > > + /* Clearup existing irq mapping */ > > + free_irq(irq, trigger); > > + eventfd_ctx_put(trigger); > > + udev->ctx[vec].trigger = NULL; > > + } > > + > > + /* Passing -1 is used to disable interrupt */ > > + if (fd < 0) > > + return 0; > > + > > + trigger = eventfd_ctx_fdget(fd); > > + if (IS_ERR(trigger)) { > > + err = PTR_ERR(trigger); > > + dev_notice(&udev->pdev->dev, > > + "eventfd ctx get failed: %d\n", err); > > + return err; > > + } > > + > > + if (udev->msix) > > + err = request_irq(irq, uio_msi_irqhandler, 0, > > + udev->ctx[vec].name, trigger); > > + else > > + err = request_irq(irq, uio_intx_irqhandler, IRQF_SHARED, > > + udev->ctx[vec].name, udev); > > + > > + if (err) { > > + dev_notice(&udev->pdev->dev, > > + "request irq failed: %d\n", err); > > + eventfd_ctx_put(trigger); > > + return err; > > + } > > + > > + udev->ctx[vec].trigger = trigger; > > + return 0; > > +} > > + > > +static int > > +uio_msi_ioctl(struct uio_info *info, unsigned int cmd, unsigned long arg) > > +{ > > + struct uio_msi_pci_dev *udev > > + = container_of(info, struct uio_msi_pci_dev, info); > > + struct uio_msi_irq_set hdr; > > + int err; > > + > > + switch (cmd) { > > + case UIO_MSI_IRQ_SET: > > + if (copy_from_user(&hdr, (void __user *)arg, sizeof(hdr))) > > + return -EFAULT; > > + > > + mutex_lock(&udev->mutex); > > + err = set_irq_eventfd(udev, hdr.vec, hdr.fd); > > + mutex_unlock(&udev->mutex); > > + break; > > + default: > > + err = -EOPNOTSUPP; > > + } > > + return err; > > +} > > + > > +/* Opening the UIO device for first time enables MSI-X */ > > +static int > > +uio_msi_open(struct uio_info *info, struct inode *inode) > > +{ > > + struct uio_msi_pci_dev *udev > > + = container_of(info, struct uio_msi_pci_dev, info); > > + int err = 0; > > + > > + mutex_lock(&udev->mutex); > > + if (udev->ref_cnt++ == 0) { > > + if (udev->msix) > > + err = pci_enable_msix(udev->pdev, udev->msix, > > + udev->max_vectors); > > + } > > + mutex_unlock(&udev->mutex); > > + > > + return err; > > +} > > + > > +/* Last close of the UIO device releases/disables all IRQ's */ > > +static int > > +uio_msi_release(struct uio_info *info, struct inode *inode) > > +{ > > + struct uio_msi_pci_dev *udev > > + = container_of(info, struct uio_msi_pci_dev, info); > > + int i; > > + > > + mutex_lock(&udev->mutex); > > + if (--udev->ref_cnt == 0) { > > + for (i = 0; i < udev->max_vectors; i++) { > > + int irq = udev->msix[i].vector; > > + struct eventfd_ctx *trigger = udev->ctx[i].trigger; > > + > > + if (!trigger) > > + continue; > > + > > + free_irq(irq, trigger); > > + eventfd_ctx_put(trigger); > > + udev->ctx[i].trigger = NULL; > > + } > > + > > + if (udev->msix) > > + pci_disable_msix(udev->pdev); > > + } > > + mutex_unlock(&udev->mutex); > > + > > + return 0; > > +} > > + > > +/* Unmap previously ioremap'd resources */ > > +static void > > +release_iomaps(struct uio_mem *mem) > > +{ > > + int i; > > + > > + for (i = 0; i < MAX_UIO_MAPS; i++, mem++) { > > + if (mem->internal_addr) > > + iounmap(mem->internal_addr); > > + } > > +} > > + > > +static int > > +setup_maps(struct pci_dev *pdev, struct uio_info *info) > > +{ > > + int i, m = 0, p = 0, err; > > + static const char * const bar_names[] = { > > + "BAR0", "BAR1", "BAR2", "BAR3", "BAR4", "BAR5", > > + }; > > + > > + for (i = 0; i < ARRAY_SIZE(bar_names); i++) { > > + unsigned long start = pci_resource_start(pdev, i); > > + unsigned long flags = pci_resource_flags(pdev, i); > > + unsigned long len = pci_resource_len(pdev, i); > > + > > + if (start == 0 || len == 0) > > + continue; > > + > > + if (flags & IORESOURCE_MEM) { > > + void __iomem *addr; > > + > > + if (m >= MAX_UIO_MAPS) > > + continue; > > + > > + addr = ioremap(start, len); > > + if (addr == NULL) { > > + err = -EINVAL; > > + goto fail; > > + } > > + > > + info->mem[m].name = bar_names[i]; > > + info->mem[m].addr = start; > > + info->mem[m].internal_addr = addr; > > + info->mem[m].size = len; > > + info->mem[m].memtype = UIO_MEM_PHYS; > > + ++m; > > + } else if (flags & IORESOURCE_IO) { > > + if (p >= MAX_UIO_PORT_REGIONS) > > + continue; > > + > > + info->port[p].name = bar_names[i]; > > + info->port[p].start = start; > > + info->port[p].size = len; > > + info->port[p].porttype = UIO_PORT_X86; > > + ++p; > > + } > > + } > > + > > + return 0; > > + fail: > > + for (i = 0; i < m; i++) > > + iounmap(info->mem[i].internal_addr); > > + return err; > > +} > > + > > +static int uio_msi_probe(struct pci_dev *pdev, const struct pci_device_id *id) > > +{ > > + struct uio_msi_pci_dev *udev; > > + int i, err, vectors; > > + > > + udev = kzalloc(sizeof(struct uio_msi_pci_dev), GFP_KERNEL); > > + if (!udev) > > + return -ENOMEM; > > + > > + err = pci_enable_device(pdev); > > + if (err != 0) { > > + dev_err(&pdev->dev, "cannot enable PCI device\n"); > > + goto fail_free; > > + } > > + > > + err = pci_request_regions(pdev, "uio_msi"); > > + if (err != 0) { > > + dev_err(&pdev->dev, "Cannot request regions\n"); > > + goto fail_disable; > > + } > > + > > + pci_set_master(pdev); > > + > > + /* remap resources */ > > + err = setup_maps(pdev, &udev->info); > > + if (err) > > + goto fail_release_iomem; > > + > > + /* fill uio infos */ > > + udev->info.name = "uio_msi"; > > + udev->info.version = DRIVER_VERSION; > > + udev->info.priv = udev; > > + udev->pdev = pdev; > > + udev->info.ioctl = uio_msi_ioctl; > > + udev->info.open = uio_msi_open; > > + udev->info.release = uio_msi_release; > > + udev->info.irq = UIO_IRQ_CUSTOM; > > + mutex_init(&udev->mutex); > > + > > + vectors = pci_msix_vec_count(pdev); > > + if (vectors > 0) { > > + udev->max_vectors = min_t(u16, vectors, MAX_MSIX_VECTORS); > > + dev_info(&pdev->dev, "using up to %u MSI-X vectors\n", > > + udev->max_vectors); > > + > > + err = -ENOMEM; > > + udev->msix = kcalloc(udev->max_vectors, > > + sizeof(struct msix_entry), GFP_KERNEL); > > + if (!udev->msix) > > + goto fail_release_iomem; > > + } else if (!pci_intx_mask_supported(pdev)) { > > + dev_err(&pdev->dev, > > + "device does not support MSI-X or INTX\n"); > > + err = -EINVAL; > > + goto fail_release_iomem; > > + } else { > > + dev_notice(&pdev->dev, "using INTX\n"); > > + udev->info.irq_flags = IRQF_SHARED; > > + udev->max_vectors = 1; > > + } > > + > > + udev->ctx = kcalloc(udev->max_vectors, > > + sizeof(struct uio_msi_irq_ctx), GFP_KERNEL); > > + if (!udev->ctx) > > + goto fail_free_msix; > > + > > + for (i = 0; i < udev->max_vectors; i++) { > > + udev->msix[i].entry = i; > > + > > + udev->ctx[i].name = kasprintf(GFP_KERNEL, > > + KBUILD_MODNAME "[%d](%s)", > > + i, pci_name(pdev)); > > + if (!udev->ctx[i].name) > > + goto fail_free_ctx; > > + } > > + > > + /* register uio driver */ > > + err = uio_register_device(&pdev->dev, &udev->info); > > + if (err != 0) > > + goto fail_free_ctx; > > + > > + pci_set_drvdata(pdev, udev); > > + return 0; > > + > > +fail_free_ctx: > > + for (i = 0; i < udev->max_vectors; i++) > > + kfree(udev->ctx[i].name); > > + kfree(udev->ctx); > > +fail_free_msix: > > + kfree(udev->msix); > > +fail_release_iomem: > > + release_iomaps(udev->info.mem); > > + pci_release_regions(pdev); > > +fail_disable: > > + pci_disable_device(pdev); > > +fail_free: > > + kfree(udev); > > + > > + pr_notice("%s ret %d\n", __func__, err); > > + return err; > > +} > > + > > +static void uio_msi_remove(struct pci_dev *pdev) > > +{ > > + struct uio_info *info = pci_get_drvdata(pdev); > > + struct uio_msi_pci_dev *udev > > + = container_of(info, struct uio_msi_pci_dev, info); > > + int i; > > + > > + uio_unregister_device(info); > > + release_iomaps(info->mem); > > + > > + pci_release_regions(pdev); > > + for (i = 0; i < udev->max_vectors; i++) > > + kfree(udev->ctx[i].name); > > + kfree(udev->ctx); > > + kfree(udev->msix); > > + pci_disable_device(pdev); > > + > > + pci_set_drvdata(pdev, NULL); > > + kfree(udev); > > +} > > + > > +static struct pci_driver uio_msi_pci_driver = { > > + .name = "uio_msi", > > + .probe = uio_msi_probe, > > + .remove = uio_msi_remove, > > +}; > > + > > +module_pci_driver(uio_msi_pci_driver); > > +MODULE_VERSION(DRIVER_VERSION); > > +MODULE_LICENSE("GPL v2"); > > +MODULE_AUTHOR("Stephen Hemminger <stephen@networkplumber.org>"); > > +MODULE_DESCRIPTION("UIO driver for MSI PCI devices"); > > diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild > > index f7b2db4..d9497691 100644 > > --- a/include/uapi/linux/Kbuild > > +++ b/include/uapi/linux/Kbuild > > @@ -411,6 +411,7 @@ header-y += udp.h > > header-y += uhid.h > > header-y += uinput.h > > header-y += uio.h > > +header-y += uio_msi.h > > header-y += ultrasound.h > > header-y += un.h > > header-y += unistd.h > > diff --git a/include/uapi/linux/uio_msi.h b/include/uapi/linux/uio_msi.h > > new file mode 100644 > > index 0000000..297de00 > > --- /dev/null > > +++ b/include/uapi/linux/uio_msi.h > > @@ -0,0 +1,22 @@ > > +/* > > + * UIO_MSI API definition > > + * > > + * Copyright (c) 2015 by Brocade Communications Systems, Inc. > > + * All rights reserved. > > + * > > + * This program is free software; you can redistribute it and/or modify > > + * it under the terms of the GNU General Public License version 2 as > > + * published by the Free Software Foundation. > > + */ > > +#ifndef _UIO_PCI_MSI_H > > +#define _UIO_PCI_MSI_H > > + > > +struct uio_msi_irq_set { > > + u32 vec; > > + int fd; > > +}; > > + > > +#define UIO_MSI_BASE 0x86 > > +#define UIO_MSI_IRQ_SET _IOW('I', UIO_MSI_BASE+1, struct uio_msi_irq_set) > > + > > +#endif > > -- > > 2.1.4 > > > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > Please read the FAQ at http://www.tux.org/lkml/
On Thu, 1 Oct 2015 19:31:08 +0300 "Michael S. Tsirkin" <mst@redhat.com> wrote: > On Thu, Oct 01, 2015 at 11:33:06AM +0300, Michael S. Tsirkin wrote: > > On Wed, Sep 30, 2015 at 03:28:58PM -0700, Stephen Hemminger wrote: > > > This driver allows using PCI device with Message Signalled Interrupt > > > from userspace. The API is similar to the igb_uio driver used by the DPDK. > > > Via ioctl it provides a mechanism to map MSI-X interrupts into event > > > file descriptors similar to VFIO. > > > > > > VFIO is a better choice if IOMMU is available, but often userspace drivers > > > have to work in environments where IOMMU support (real or emulated) is > > > not available. All UIO drivers that support DMA are not secure against > > > rogue userspace applications programming DMA hardware to access > > > private memory; this driver is no less secure than existing code. > > > > > > Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> > > > > I don't think copying the igb_uio interface is a good idea. > > What DPDK is doing with igb_uio (and indeed uio_pci_generic) > > is abusing the sysfs BAR access to provide unlimited > > access to hardware. > > > > MSI messages are memory writes so any generic device capable > > of MSI is capable of corrupting kernel memory. > > This means that a bug in userspace will lead to kernel memory corruption > > and crashes. This is something distributions can't support. > > > > uio_pci_generic is already abused like that, mostly > > because when I wrote it, I didn't add enough protections > > against using it with DMA capable devices, > > and we can't go back and break working userspace. > > But at least it does not bind to VFs which all of > > them are capable of DMA. > > > > The result of merging this driver will be userspace abusing the > > sysfs BAR access with VFs as well, and we do not want that. > > > > > > Just forwarding events is not enough to make a valid driver. > > What is missing is a way to access the device in a safe way. > > > > On a more positive note: > > > > What would be a reasonable interface? One that does the following > > in kernel: > > > > 1. initializes device rings (can be in pinned userspace memory, > > but can not be writeable by userspace), brings up interface link > > 2. pins userspace memory (unless using e.g. hugetlbfs) > > 3. gets request, make sure it's valid and belongs to > > the correct task, put it in the ring > > 4. in the reverse direction, notify userspace when buffers > > are available in the ring > > 5. notify userspace about MSI (what this driver does) > > > > What userspace can be allowed to do: > > > > format requests (e.g. transmit, receive) in userspace > > read ring contents > > > > What userspace can't be allowed to do: > > > > access BAR > > write rings > > > > > > This means that the driver can not be a generic one, > > and there will be a system call overhead when you > > write the ring, but that's the price you have to > > pay for ability to run on systems without an IOMMU. > > > > > The device specific parts can be taken from John Fastabend's patches > BTW: > > https://patchwork.ozlabs.org/patch/396713/ > > IIUC what was missing there was exactly the memory protection > we are looking for here. The bifuricated drivers are interesting from an architecture point of view, but do nothing to solve the immediate use case. The problem is not on bare metal environment, most of those already have IOMMU. The issues are on environments like VMWare with SRIOV or vmxnet3, neither of those are really helped by bifirucated driver or VFIO.
On Thu, Oct 01, 2015 at 10:26:19AM -0700, Stephen Hemminger wrote: > On Thu, 1 Oct 2015 19:31:08 +0300 > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > On Thu, Oct 01, 2015 at 11:33:06AM +0300, Michael S. Tsirkin wrote: > > > On Wed, Sep 30, 2015 at 03:28:58PM -0700, Stephen Hemminger wrote: > > > > This driver allows using PCI device with Message Signalled Interrupt > > > > from userspace. The API is similar to the igb_uio driver used by the DPDK. > > > > Via ioctl it provides a mechanism to map MSI-X interrupts into event > > > > file descriptors similar to VFIO. > > > > > > > > VFIO is a better choice if IOMMU is available, but often userspace drivers > > > > have to work in environments where IOMMU support (real or emulated) is > > > > not available. All UIO drivers that support DMA are not secure against > > > > rogue userspace applications programming DMA hardware to access > > > > private memory; this driver is no less secure than existing code. > > > > > > > > Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> > > > > > > I don't think copying the igb_uio interface is a good idea. > > > What DPDK is doing with igb_uio (and indeed uio_pci_generic) > > > is abusing the sysfs BAR access to provide unlimited > > > access to hardware. > > > > > > MSI messages are memory writes so any generic device capable > > > of MSI is capable of corrupting kernel memory. > > > This means that a bug in userspace will lead to kernel memory corruption > > > and crashes. This is something distributions can't support. > > > > > > uio_pci_generic is already abused like that, mostly > > > because when I wrote it, I didn't add enough protections > > > against using it with DMA capable devices, > > > and we can't go back and break working userspace. > > > But at least it does not bind to VFs which all of > > > them are capable of DMA. > > > > > > The result of merging this driver will be userspace abusing the > > > sysfs BAR access with VFs as well, and we do not want that. > > > > > > > > > Just forwarding events is not enough to make a valid driver. > > > What is missing is a way to access the device in a safe way. > > > > > > On a more positive note: > > > > > > What would be a reasonable interface? One that does the following > > > in kernel: > > > > > > 1. initializes device rings (can be in pinned userspace memory, > > > but can not be writeable by userspace), brings up interface link > > > 2. pins userspace memory (unless using e.g. hugetlbfs) > > > 3. gets request, make sure it's valid and belongs to > > > the correct task, put it in the ring > > > 4. in the reverse direction, notify userspace when buffers > > > are available in the ring > > > 5. notify userspace about MSI (what this driver does) > > > > > > What userspace can be allowed to do: > > > > > > format requests (e.g. transmit, receive) in userspace > > > read ring contents > > > > > > What userspace can't be allowed to do: > > > > > > access BAR > > > write rings > > > > > > > > > This means that the driver can not be a generic one, > > > and there will be a system call overhead when you > > > write the ring, but that's the price you have to > > > pay for ability to run on systems without an IOMMU. > > > > > > > > > The device specific parts can be taken from John Fastabend's patches > > BTW: > > > > https://patchwork.ozlabs.org/patch/396713/ > > > > IIUC what was missing there was exactly the memory protection > > we are looking for here. > > The bifuricated drivers are interesting from an architecture > point of view, but do nothing to solve the immediate use case. > The problem is not on bare metal environment, most of those already have IOMMU. > The issues are on environments like VMWare with SRIOV or vmxnet3, > neither of those are really helped by bifirucated driver or VFIO. Two points I tried to make (and apparently failed, so I'm trying again, more verbosely): - bifurcated drivers do DMA into both kernel and userspace memory from same pci address (bus/dev/fn). As IOMMU uses this source address to validated accesses, there is no way to have IOMMU prevent userspace from accessing kernel memory. If you are prepared to use dynamic mappings for kernel memory, it might be possible to limit the harm that userspace can do, but this will slow down kernel networking (changing IOMMU mappings is expensive) and userspace will likely be able to at least disrupt kernel networking. So what I am discussing might still have value there. - bifurcated drivers have code to bring up link and map rings into userspace (they also map other rings into kernel, and tweak rx filter in hardware, that might not be necessary for this usecase). What I proposed above can use that code, with the twist that the RX ring is made RO for userspace, and a system call to safely copy from userspace ring there is supported. In other words, here's the device specific part in kernel that you wanted to use, which will only need some tweaks.
On 09/30/2015 03:28 PM, Stephen Hemminger wrote: > This driver allows using PCI device with Message Signalled Interrupt > from userspace. The API is similar to the igb_uio driver used by the DPDK. > Via ioctl it provides a mechanism to map MSI-X interrupts into event > file descriptors similar to VFIO. > > VFIO is a better choice if IOMMU is available, but often userspace drivers > have to work in environments where IOMMU support (real or emulated) is > not available. All UIO drivers that support DMA are not secure against > rogue userspace applications programming DMA hardware to access > private memory; this driver is no less secure than existing code. > > Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> > --- > drivers/uio/Kconfig | 9 ++ > drivers/uio/Makefile | 1 + > drivers/uio/uio_msi.c | 378 +++++++++++++++++++++++++++++++++++++++++++ > include/uapi/linux/Kbuild | 1 + > include/uapi/linux/uio_msi.h | 22 +++ > 5 files changed, 411 insertions(+) > create mode 100644 drivers/uio/uio_msi.c > create mode 100644 include/uapi/linux/uio_msi.h > > diff --git a/drivers/uio/Kconfig b/drivers/uio/Kconfig > index 52c98ce..04adfa0 100644 > --- a/drivers/uio/Kconfig > +++ b/drivers/uio/Kconfig > @@ -93,6 +93,15 @@ config UIO_PCI_GENERIC > primarily, for virtualization scenarios. > If you compile this as a module, it will be called uio_pci_generic. > > +config UIO_PCI_MSI > + tristate "Generic driver supporting MSI-x on PCI Express cards" > + depends on PCI > + help > + Generic driver that provides Message Signalled IRQ events > + similar to VFIO. If IOMMMU is available please use VFIO > + instead since it provides more security. > + If you compile this as a module, it will be called uio_msi. > + > config UIO_NETX > tristate "Hilscher NetX Card driver" > depends on PCI Should you maybe instead depend on CONFIG_PCI_MSI. Without MSI this is essentially just uio_pci_generic with a bit more greedy mapping setup. > diff --git a/drivers/uio/Makefile b/drivers/uio/Makefile > index 8560dad..62fc44b 100644 > --- a/drivers/uio/Makefile > +++ b/drivers/uio/Makefile > @@ -9,3 +9,4 @@ obj-$(CONFIG_UIO_NETX) += uio_netx.o > obj-$(CONFIG_UIO_PRUSS) += uio_pruss.o > obj-$(CONFIG_UIO_MF624) += uio_mf624.o > obj-$(CONFIG_UIO_FSL_ELBC_GPCM) += uio_fsl_elbc_gpcm.o > +obj-$(CONFIG_UIO_PCI_MSI) += uio_msi.o > diff --git a/drivers/uio/uio_msi.c b/drivers/uio/uio_msi.c > new file mode 100644 > index 0000000..802b5c4 > --- /dev/null > +++ b/drivers/uio/uio_msi.c > @@ -0,0 +1,378 @@ > +/*- > + * > + * Copyright (c) 2015 by Brocade Communications Systems, Inc. > + * Author: Stephen Hemminger <stephen@networkplumber.org> > + * > + * This work is licensed under the terms of the GNU GPL, version 2 only. > + */ > + > +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt > + > +#include <linux/device.h> > +#include <linux/interrupt.h> > +#include <linux/eventfd.h> > +#include <linux/module.h> > +#include <linux/pci.h> > +#include <linux/uio_driver.h> > +#include <linux/msi.h> > +#include <linux/uio_msi.h> > + > +#define DRIVER_VERSION "0.1.1" > +#define MAX_MSIX_VECTORS 64 > + > +/* MSI-X vector information */ > +struct uio_msi_pci_dev { > + struct uio_info info; /* UIO driver info */ > + struct pci_dev *pdev; /* PCI device */ > + struct mutex mutex; /* open/release/ioctl mutex */ > + int ref_cnt; /* references to device */ > + unsigned int max_vectors; /* MSI-X slots available */ > + struct msix_entry *msix; /* MSI-X vector table */ > + struct uio_msi_irq_ctx { > + struct eventfd_ctx *trigger; /* vector to eventfd */ > + char *name; /* name in /proc/interrupts */ > + } *ctx; > +}; > + I would move the definition of uio_msi_irq_ctx out of uio_msi_pci_dev. It would help to make this a bit more readable. > +static irqreturn_t uio_intx_irqhandler(int irq, void *arg) > +{ > + struct uio_msi_pci_dev *udev = arg; > + > + if (pci_check_and_mask_intx(udev->pdev)) { > + eventfd_signal(udev->ctx->trigger, 1); > + return IRQ_HANDLED; > + } > + > + return IRQ_NONE; > +} > + I would really prefer to see the intx handling dropped since there are already 2 different UIO drivers setup for handling INTx style interrupts. Lets focus on the parts from the last decade and drop support for INTx now in favor of MSI-X and maybe MSI. If we _REALLY_ need it we can always come back later and add it. > +static irqreturn_t uio_msi_irqhandler(int irq, void *arg) > +{ > + struct eventfd_ctx *trigger = arg; > + > + eventfd_signal(trigger, 1); > + return IRQ_HANDLED; > +} > + > +/* set the mapping between vector # and existing eventfd. */ > +static int set_irq_eventfd(struct uio_msi_pci_dev *udev, u32 vec, int fd) > +{ > + struct eventfd_ctx *trigger; > + int irq, err; > + > + if (vec >= udev->max_vectors) { > + dev_notice(&udev->pdev->dev, "vec %u >= num_vec %u\n", > + vec, udev->max_vectors); > + return -ERANGE; > + } > + > + irq = udev->msix[vec].vector; > + trigger = udev->ctx[vec].trigger; > + if (trigger) { > + /* Clearup existing irq mapping */ Minor spelling issue here, "Clear up" should be 2 words, "Cleanup" can be one. > + free_irq(irq, trigger); > + eventfd_ctx_put(trigger); > + udev->ctx[vec].trigger = NULL; > + } > + > + /* Passing -1 is used to disable interrupt */ > + if (fd < 0) > + return 0; > + > + trigger = eventfd_ctx_fdget(fd); > + if (IS_ERR(trigger)) { > + err = PTR_ERR(trigger); > + dev_notice(&udev->pdev->dev, > + "eventfd ctx get failed: %d\n", err); > + return err; > + } > + > + if (udev->msix) > + err = request_irq(irq, uio_msi_irqhandler, 0, > + udev->ctx[vec].name, trigger); > + else > + err = request_irq(irq, uio_intx_irqhandler, IRQF_SHARED, > + udev->ctx[vec].name, udev); > + > + if (err) { > + dev_notice(&udev->pdev->dev, > + "request irq failed: %d\n", err); > + eventfd_ctx_put(trigger); > + return err; > + } > + > + udev->ctx[vec].trigger = trigger; > + return 0; > +} > + > +static int > +uio_msi_ioctl(struct uio_info *info, unsigned int cmd, unsigned long arg) > +{ > + struct uio_msi_pci_dev *udev > + = container_of(info, struct uio_msi_pci_dev, info); > + struct uio_msi_irq_set hdr; > + int err; > + > + switch (cmd) { > + case UIO_MSI_IRQ_SET: > + if (copy_from_user(&hdr, (void __user *)arg, sizeof(hdr))) > + return -EFAULT; > + > + mutex_lock(&udev->mutex); > + err = set_irq_eventfd(udev, hdr.vec, hdr.fd); > + mutex_unlock(&udev->mutex); > + break; > + default: > + err = -EOPNOTSUPP; > + } > + return err; > +} > + > +/* Opening the UIO device for first time enables MSI-X */ > +static int > +uio_msi_open(struct uio_info *info, struct inode *inode) > +{ > + struct uio_msi_pci_dev *udev > + = container_of(info, struct uio_msi_pci_dev, info); > + int err = 0; > + > + mutex_lock(&udev->mutex); > + if (udev->ref_cnt++ == 0) { > + if (udev->msix) > + err = pci_enable_msix(udev->pdev, udev->msix, > + udev->max_vectors); > + } > + mutex_unlock(&udev->mutex); > + > + return err; > +} > + I agree with some other reviewers. Why call pci_enable_msix in open? It seems like it would make much more sense to do this on probe, and then disable MSI-X on free. I can only assume you are trying to do it to save on resources but the fact is this is a driver you have to explicitly force onto a device so you would probably be safe to assume that they plan to use it in the near future. > +/* Last close of the UIO device releases/disables all IRQ's */ > +static int > +uio_msi_release(struct uio_info *info, struct inode *inode) > +{ > + struct uio_msi_pci_dev *udev > + = container_of(info, struct uio_msi_pci_dev, info); > + int i; > + > + mutex_lock(&udev->mutex); > + if (--udev->ref_cnt == 0) { > + for (i = 0; i < udev->max_vectors; i++) { > + int irq = udev->msix[i].vector; > + struct eventfd_ctx *trigger = udev->ctx[i].trigger; > + > + if (!trigger) > + continue; > + > + free_irq(irq, trigger); > + eventfd_ctx_put(trigger); > + udev->ctx[i].trigger = NULL; > + } > + > + if (udev->msix) > + pci_disable_msix(udev->pdev); > + } > + mutex_unlock(&udev->mutex); > + > + return 0; > +} > + > +/* Unmap previously ioremap'd resources */ > +static void > +release_iomaps(struct uio_mem *mem) > +{ > + int i; > + > + for (i = 0; i < MAX_UIO_MAPS; i++, mem++) { > + if (mem->internal_addr) > + iounmap(mem->internal_addr); > + } > +} > + > +static int > +setup_maps(struct pci_dev *pdev, struct uio_info *info) > +{ > + int i, m = 0, p = 0, err; > + static const char * const bar_names[] = { > + "BAR0", "BAR1", "BAR2", "BAR3", "BAR4", "BAR5", > + }; > + > + for (i = 0; i < ARRAY_SIZE(bar_names); i++) { > + unsigned long start = pci_resource_start(pdev, i); > + unsigned long flags = pci_resource_flags(pdev, i); > + unsigned long len = pci_resource_len(pdev, i); > + > + if (start == 0 || len == 0) > + continue; > + > + if (flags & IORESOURCE_MEM) { > + void __iomem *addr; > + > + if (m >= MAX_UIO_MAPS) > + continue; > + > + addr = ioremap(start, len); > + if (addr == NULL) { > + err = -EINVAL; > + goto fail; > + } > + > + info->mem[m].name = bar_names[i]; > + info->mem[m].addr = start; > + info->mem[m].internal_addr = addr; > + info->mem[m].size = len; > + info->mem[m].memtype = UIO_MEM_PHYS; > + ++m; > + } else if (flags & IORESOURCE_IO) { > + if (p >= MAX_UIO_PORT_REGIONS) > + continue; > + > + info->port[p].name = bar_names[i]; > + info->port[p].start = start; > + info->port[p].size = len; > + info->port[p].porttype = UIO_PORT_X86; > + ++p; > + } > + } > + > + return 0; > + fail: > + for (i = 0; i < m; i++) > + iounmap(info->mem[i].internal_addr); > + return err; > +} > + Do you really need to map IORESOURCE bars? Most drivers I can think of don't use IO BARs anymore. Maybe we could look at just dropping the code and adding it back later if we have a use case that absolutely needs it. Also how many devices actually need resources beyond BAR 0? I'm just curious as I know BAR 2 on many of the Intel devices is the register space related to MSI-X so now we have both the PCIe subsystem and user space with access to this region. > +static int uio_msi_probe(struct pci_dev *pdev, const struct pci_device_id *id) > +{ > + struct uio_msi_pci_dev *udev; > + int i, err, vectors; > + > + udev = kzalloc(sizeof(struct uio_msi_pci_dev), GFP_KERNEL); > + if (!udev) > + return -ENOMEM; > + > + err = pci_enable_device(pdev); > + if (err != 0) { > + dev_err(&pdev->dev, "cannot enable PCI device\n"); > + goto fail_free; > + } > + > + err = pci_request_regions(pdev, "uio_msi"); > + if (err != 0) { > + dev_err(&pdev->dev, "Cannot request regions\n"); > + goto fail_disable; > + } > + > + pci_set_master(pdev); > + > + /* remap resources */ > + err = setup_maps(pdev, &udev->info); > + if (err) > + goto fail_release_iomem; > + > + /* fill uio infos */ > + udev->info.name = "uio_msi"; > + udev->info.version = DRIVER_VERSION; > + udev->info.priv = udev; > + udev->pdev = pdev; > + udev->info.ioctl = uio_msi_ioctl; > + udev->info.open = uio_msi_open; > + udev->info.release = uio_msi_release; > + udev->info.irq = UIO_IRQ_CUSTOM; > + mutex_init(&udev->mutex); > + > + vectors = pci_msix_vec_count(pdev); > + if (vectors > 0) { > + udev->max_vectors = min_t(u16, vectors, MAX_MSIX_VECTORS); > + dev_info(&pdev->dev, "using up to %u MSI-X vectors\n", > + udev->max_vectors); > + > + err = -ENOMEM; > + udev->msix = kcalloc(udev->max_vectors, > + sizeof(struct msix_entry), GFP_KERNEL); > + if (!udev->msix) > + goto fail_release_iomem; > + } else if (!pci_intx_mask_supported(pdev)) { > + dev_err(&pdev->dev, > + "device does not support MSI-X or INTX\n"); > + err = -EINVAL; > + goto fail_release_iomem; > + } else { > + dev_notice(&pdev->dev, "using INTX\n"); > + udev->info.irq_flags = IRQF_SHARED; > + udev->max_vectors = 1; > + } > + > + udev->ctx = kcalloc(udev->max_vectors, > + sizeof(struct uio_msi_irq_ctx), GFP_KERNEL); > + if (!udev->ctx) > + goto fail_free_msix; > + > + for (i = 0; i < udev->max_vectors; i++) { > + udev->msix[i].entry = i; > + > + udev->ctx[i].name = kasprintf(GFP_KERNEL, > + KBUILD_MODNAME "[%d](%s)", > + i, pci_name(pdev)); > + if (!udev->ctx[i].name) > + goto fail_free_ctx; > + } > + > + /* register uio driver */ > + err = uio_register_device(&pdev->dev, &udev->info); > + if (err != 0) > + goto fail_free_ctx; > + > + pci_set_drvdata(pdev, udev); > + return 0; > + > +fail_free_ctx: > + for (i = 0; i < udev->max_vectors; i++) > + kfree(udev->ctx[i].name); > + kfree(udev->ctx); > +fail_free_msix: > + kfree(udev->msix); > +fail_release_iomem: > + release_iomaps(udev->info.mem); > + pci_release_regions(pdev); > +fail_disable: > + pci_disable_device(pdev); > +fail_free: > + kfree(udev); > + > + pr_notice("%s ret %d\n", __func__, err); > + return err; > +} > + > +static void uio_msi_remove(struct pci_dev *pdev) > +{ > + struct uio_info *info = pci_get_drvdata(pdev); > + struct uio_msi_pci_dev *udev > + = container_of(info, struct uio_msi_pci_dev, info); > + int i; > + > + uio_unregister_device(info); > + release_iomaps(info->mem); > + > + pci_release_regions(pdev); > + for (i = 0; i < udev->max_vectors; i++) > + kfree(udev->ctx[i].name); > + kfree(udev->ctx); > + kfree(udev->msix); > + pci_disable_device(pdev); > + > + pci_set_drvdata(pdev, NULL); > + kfree(udev); > +} > + > +static struct pci_driver uio_msi_pci_driver = { > + .name = "uio_msi", > + .probe = uio_msi_probe, > + .remove = uio_msi_remove, > +}; > + > +module_pci_driver(uio_msi_pci_driver); > +MODULE_VERSION(DRIVER_VERSION); > +MODULE_LICENSE("GPL v2"); > +MODULE_AUTHOR("Stephen Hemminger <stephen@networkplumber.org>"); > +MODULE_DESCRIPTION("UIO driver for MSI PCI devices"); > diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild > index f7b2db4..d9497691 100644 > --- a/include/uapi/linux/Kbuild > +++ b/include/uapi/linux/Kbuild > @@ -411,6 +411,7 @@ header-y += udp.h > header-y += uhid.h > header-y += uinput.h > header-y += uio.h > +header-y += uio_msi.h > header-y += ultrasound.h > header-y += un.h > header-y += unistd.h > diff --git a/include/uapi/linux/uio_msi.h b/include/uapi/linux/uio_msi.h > new file mode 100644 > index 0000000..297de00 > --- /dev/null > +++ b/include/uapi/linux/uio_msi.h > @@ -0,0 +1,22 @@ > +/* > + * UIO_MSI API definition > + * > + * Copyright (c) 2015 by Brocade Communications Systems, Inc. > + * All rights reserved. > + * > + * This program is free software; you can redistribute it and/or modify > + * it under the terms of the GNU General Public License version 2 as > + * published by the Free Software Foundation. > + */ > +#ifndef _UIO_PCI_MSI_H > +#define _UIO_PCI_MSI_H > + > +struct uio_msi_irq_set { > + u32 vec; > + int fd; > +}; > + > +#define UIO_MSI_BASE 0x86 > +#define UIO_MSI_IRQ_SET _IOW('I', UIO_MSI_BASE+1, struct uio_msi_irq_set) > + > +#endif >
On Thu, 1 Oct 2015 16:40:10 -0700 Alexander Duyck <alexander.duyck@gmail.com> wrote: > I agree with some other reviewers. Why call pci_enable_msix in open? > It seems like it would make much more sense to do this on probe, and > then disable MSI-X on free. I can only assume you are trying to do it > to save on resources but the fact is this is a driver you have to > explicitly force onto a device so you would probably be safe to assume > that they plan to use it in the near future. Because if interface is not up, the MSI handle doesn't have to be open. This saves resources and avoids some races.
On Thu, 1 Oct 2015 16:40:10 -0700 Alexander Duyck <alexander.duyck@gmail.com> wrote: > Do you really need to map IORESOURCE bars? Most drivers I can think of > don't use IO BARs anymore. Maybe we could look at just dropping the > code and adding it back later if we have a use case that absolutely > needs it. Mapping is not strictly necessary, but for virtio it acts a way to communicate the regions. > Also how many devices actually need resources beyond BAR 0? I'm just > curious as I know BAR 2 on many of the Intel devices is the register > space related to MSI-X so now we have both the PCIe subsystem and user > space with access to this region. VMXNet3 needs 2 bars. Most use only one.
On 10/01/2015 05:01 PM, Stephen Hemminger wrote: > On Thu, 1 Oct 2015 16:40:10 -0700 > Alexander Duyck <alexander.duyck@gmail.com> wrote: > >> I agree with some other reviewers. Why call pci_enable_msix in open? >> It seems like it would make much more sense to do this on probe, and >> then disable MSI-X on free. I can only assume you are trying to do it >> to save on resources but the fact is this is a driver you have to >> explicitly force onto a device so you would probably be safe to assume >> that they plan to use it in the near future. > Because if interface is not up, the MSI handle doesn't have to be open. > This saves resources and avoids some races. Yes, but it makes things a bit messier for the interrupts. Most drivers take care of interrupts during probe so that if there are any allocation problems they can take care of them then instead of leaving an interface out that will later fail when it is brought up. It ends up being a way to deal with the whole MSI-X fall-back issue. - Alex
On 10/01/2015 05:04 PM, Stephen Hemminger wrote: > On Thu, 1 Oct 2015 16:40:10 -0700 > Alexander Duyck <alexander.duyck@gmail.com> wrote: > >> Do you really need to map IORESOURCE bars? Most drivers I can think of >> don't use IO BARs anymore. Maybe we could look at just dropping the >> code and adding it back later if we have a use case that absolutely >> needs it. > Mapping is not strictly necessary, but for virtio it acts a way to communicate > the regions. I think I see what you are saying. I was hoping we could get away from having to map any I/O ports but it looks like virtio is still using them for BAR 0, or at least that is what I am seeing on my VM with virtio_net. I was really hoping we could get away from that since a 16b address space is far too restrictive anyway. >> Also how many devices actually need resources beyond BAR 0? I'm just >> curious as I know BAR 2 on many of the Intel devices is the register >> space related to MSI-X so now we have both the PCIe subsystem and user >> space with access to this region. > VMXNet3 needs 2 bars. Most use only one. So essentially we are needing to make exceptions for the virtual interfaces. I guess there isn't much we can do then and we probably need to map any and all base address registers we can find for the given device. I was hoping for something a bit more surgical since we are opening a security hole of sorts, but I guess it can't be helped if we want to support multiple devices and they all have such radically different configurations. - Alex
On Thu, Oct 01, 2015 at 11:33:06AM +0300, Michael S. Tsirkin wrote: > Just forwarding events is not enough to make a valid driver. > What is missing is a way to access the device in a safe way. Thinking about it some more, maybe some devices don't do DMA, and merely signal events with MSI/MSI-X. The fact you mention igb_uio in the cover letter seems to hint that this isn't the case, and that the real intent is to abuse it for DMA-capable devices, but still ... If we assume such a simple device, we need to block userspace from tweaking at least the MSI control and the MSI-X table. And changing BARs might make someone else corrupt the MSI-X table, so we need to block it from changing BARs, too. Things like device reset will clear the table. I guess this means we need to track access to reset, too, make sure we restore the table to a sane config. PM capability can be used to reset things tooI think. Better be careful about that. And a bunch of devices could be doing weird things that need to be special-cased. All of this is what VFIO is already dealing with. Maybe extending VFIO for this usecase, or finding another way to share code might be a better idea than duplicating the code within uio?
On Oct 6, 2015 12:55 AM, "Michael S. Tsirkin" <mst@redhat.com> wrote: > > On Thu, Oct 01, 2015 at 11:33:06AM +0300, Michael S. Tsirkin wrote: > > Just forwarding events is not enough to make a valid driver. > > What is missing is a way to access the device in a safe way. > > Thinking about it some more, maybe some devices don't do DMA, and merely > signal events with MSI/MSI-X. > > The fact you mention igb_uio in the cover letter seems to hint that this > isn't the case, and that the real intent is to abuse it for DMA-capable > devices, but still ... > > If we assume such a simple device, we need to block userspace from > tweaking at least the MSI control and the MSI-X table. > And changing BARs might make someone else corrupt the MSI-X > table, so we need to block it from changing BARs, too. > > Things like device reset will clear the table. I guess this means we > need to track access to reset, too, make sure we restore the > table to a sane config. > > PM capability can be used to reset things tooI think. Better be > careful about that. > > And a bunch of devices could be doing weird things that need > to be special-cased. > > All of this is what VFIO is already dealing with. > > Maybe extending VFIO for this usecase, or finding another way to share > code might be a better idea than duplicating the code within uio? How about instead of trying to invent the wheel just go and attack the problem directly just like i've proposed already a few times in the last days: instead of limiting the UIO limit the users that are allowed to use UIO to privileged users only (e.g. root). This would solve all clearly unresolvable issues u are raising here all together, wouldn't it? > > -- > MST
On Tue, Oct 06, 2015 at 01:09:55AM +0300, Vladislav Zolotarov wrote: > How about instead of trying to invent the wheel just go and attack the problem > directly just like i've proposed already a few times in the last days: instead > of limiting the UIO limit the users that are allowed to use UIO to privileged > users only (e.g. root). This would solve all clearly unresolvable issues u are > raising here all together, wouldn't it? No - root or no root, if the user can modify the addresses in the MSI-X table and make the chip corrupt random memory, this is IMHO a non-starter. And tainting kernel is not a solution - your patch adds a pile of code that either goes completely unused or taints the kernel. Not just that - it's a dedicated userspace API that either goes completely unused or taints the kernel. > > > > -- > > MST >
Other than implementation objections, so far the two main arguments against this reduce to: 1. If you allow UIO ioctl then it opens an API hook for all the crap out of tree UIO drivers to do what they want. 2. If you allow UIO MSI-X then you are expanding the usage of userspace device access in an insecure manner. Another alternative which I explored was making a version of VFIO that works without IOMMU. It solves #1 but actually increases the likely negative response to arguent #2. This would keep same API, and avoid having to modify UIO. But we would still have the same (if not more resistance) from IOMMU developers who believe all systems have to be secure against root.
On 10/06/15 01:49, Michael S. Tsirkin wrote: > On Tue, Oct 06, 2015 at 01:09:55AM +0300, Vladislav Zolotarov wrote: >> How about instead of trying to invent the wheel just go and attack the problem >> directly just like i've proposed already a few times in the last days: instead >> of limiting the UIO limit the users that are allowed to use UIO to privileged >> users only (e.g. root). This would solve all clearly unresolvable issues u are >> raising here all together, wouldn't it? > No - root or no root, if the user can modify the addresses in the MSI-X > table and make the chip corrupt random memory, this is IMHO a non-starter. Michael, how this or any other related patch is related to the problem u r describing? The above ability is there for years and if memory serves me well it was u who wrote uio_pci_generic with this "security flaw". ;) This patch in general only adds the ability to receive notifications per MSI-X interrupt and it has nothing to do with the ability to reprogram the MSI-X related registers from the user space which was always there. > > And tainting kernel is not a solution - your patch adds a pile of > code that either goes completely unused or taints the kernel. > Not just that - it's a dedicated userspace API that either > goes completely unused or taints the kernel. > >>> -- >>> MST
On 10/06/2015 10:33 AM, Stephen Hemminger wrote: > Other than implementation objections, so far the two main arguments > against this reduce to: > 1. If you allow UIO ioctl then it opens an API hook for all the crap out > of tree UIO drivers to do what they want. > 2. If you allow UIO MSI-X then you are expanding the usage of userspace > device access in an insecure manner. > > Another alternative which I explored was making a version of VFIO that > works without IOMMU. It solves #1 but actually increases the likely negative > response to arguent #2. This would keep same API, and avoid having to > modify UIO. But we would still have the same (if not more resistance) > from IOMMU developers who believe all systems have to be secure against > root. vfio's charter was explicitly aiming for modern setups with iommus. This could be revisited, but I agree it will have even more resistance, justified IMO. btw, (2) doesn't really add any insecurity. The user could already poke at the msix tables (as well as perform DMA); they just couldn't get a useful interrupt out of them. Maybe a module parameter "allow_insecure_dma" can be added to uio_pci_generic. Without the parameter, bus mastering and msix is disabled, with the parameter it is allowed. This requires the sysadmin to take a positive step in order to make use of their hardware.
On Tue, Oct 06, 2015 at 08:33:56AM +0100, Stephen Hemminger wrote: > Other than implementation objections, so far the two main arguments > against this reduce to: > 1. If you allow UIO ioctl then it opens an API hook for all the crap out > of tree UIO drivers to do what they want. > 2. If you allow UIO MSI-X then you are expanding the usage of userspace > device access in an insecure manner. That's not all. Without MSI one can detect insecure usage by detecting userspace enabling bus mastering. This can be detected simply using lspci. Or one can also imagine a configuration where this ability is disabled, is logged, or taints kernel. This seems like something that might be worth having for some locked-down systems. OTOH enabling MSI requires enabling bus mastering so suddenly we have no idea whether device can be/is used in a safe way. > > Another alternative which I explored was making a version of VFIO that > works without IOMMU. It solves #1 but actually increases the likely negative > response to arguent #2. No - because VFIO has limited protection against device misuse by userspace, by limiting access to sub-ranges of device BARs and config space. For a device that doesn't do DMA, that will be enough to make it secure to use. That's a pretty weak excuse to support userspace drivers for PCI devices without an IOMMU, but it's the best I heard so far. Is that worth the security trade-off? I'm still not sure. > This would keep same API, and avoid having to > modify UIO. But we would still have the same (if not more resistance) > from IOMMU developers who believe all systems have to be secure against > root. "Secure against root" is a confusing way to put it IMHO. We are talking about memory protection. So that's not IOMMU developers IIUC. I believe most kernel developers will agree it's not a good idea to let userspace corrupt kernel memory. Otherwise, the driver can't be supported, and maintaining upstream drivers that can't be supported serves no useful purpose. Anyone can load out of tree ones just as well. VFIO already supports MSI so VFIO developers already have a lot of experience with these issues. Getting their input would be valuable.
On Tue, Oct 06, 2015 at 11:23:11AM +0300, Vlad Zolotarov wrote: > Michael, how this or any other related patch is related to the problem u r > describing? > The above ability is there for years and if memory serves me > well it was u who wrote uio_pci_generic with this "security flaw". ;) I answered all this already. This patch enables bus mastering, enables MSI or MSI-X, and requires userspace to map the MSI-X table and read/write the config space. This means that a single userspace bug is enough to corrupt kernel memory. uio_pci_generic does not enable bus mastering or MSI, and it might be a good idea to have uio_pci_generic block access to MSI/MSI-X config.
On Tue, Oct 06, 2015 at 03:15:57PM +0300, Avi Kivity wrote: > btw, (2) doesn't really add any insecurity. The user could already poke at > the msix tables (as well as perform DMA); they just couldn't get a useful > interrupt out of them. Poking at msix tables won't cause memory corruption unless msix and bus mastering is enabled. It's true root can enable msix and bus mastering through sysfs - but that's easy to block or detect. Even if you don't buy a security story, it seems less likely to trigger as a result of a userspace bug.
On 10/06/15 16:58, Michael S. Tsirkin wrote: > On Tue, Oct 06, 2015 at 11:23:11AM +0300, Vlad Zolotarov wrote: >> Michael, how this or any other related patch is related to the problem u r >> describing? >> The above ability is there for years and if memory serves me >> well it was u who wrote uio_pci_generic with this "security flaw". ;) > I answered all this already. > > This patch enables bus mastering, enables MSI or MSI-X This may be done from the user space right now without this patch... > , and requires > userspace to map the MSI-X table Hmmm... I must have missed this requirement. Could u, pls., clarify? From what I see, MSI/MSI-X table is configured completely in the kernel here... > and read/write the config space. > This means that a single userspace bug is enough to corrupt kernel > memory. Could u, pls., provide and example of this simple bug? Because it's absolutely not obvious... > > uio_pci_generic does not enable bus mastering or MSI, and > it might be a good idea to have uio_pci_generic block > access to MSI/MSI-X config. Since device bars may be mapped bypassing the UIO/uio_pci_generic - this won't solve any issue.
On Tue, Oct 06, 2015 at 05:49:21PM +0300, Vlad Zolotarov wrote: > >and read/write the config space. > >This means that a single userspace bug is enough to corrupt kernel > >memory. > > Could u, pls., provide and example of this simple bug? Because it's > absolutely not obvious... Stick a value that happens to match a kernel address in Msg Addr field in an unmasked MSI-X entry.
On 10/06/2015 05:07 PM, Michael S. Tsirkin wrote: > On Tue, Oct 06, 2015 at 03:15:57PM +0300, Avi Kivity wrote: >> btw, (2) doesn't really add any insecurity. The user could already poke at >> the msix tables (as well as perform DMA); they just couldn't get a useful >> interrupt out of them. > Poking at msix tables won't cause memory corruption unless msix and bus > mastering is enabled. It's a given that bus mastering is enabled. It's true that msix is unlikely to be enabled, unless msix support is added. > It's true root can enable msix and bus mastering > through sysfs - but that's easy to block or detect. Even if you don't > buy a security story, it seems less likely to trigger as a result > of a userspace bug. If you're doing DMA, that's the least of your worries. Still, zero-mapping the msix space seems reasonable, and can protect userspace from silly stuff. It can't be considered to have anything to do with security though, as long as users can simply DMA to every bit of RAM in the system they want to.
On 10/06/15 18:00, Michael S. Tsirkin wrote: > On Tue, Oct 06, 2015 at 05:49:21PM +0300, Vlad Zolotarov wrote: >>> and read/write the config space. >>> This means that a single userspace bug is enough to corrupt kernel >>> memory. >> Could u, pls., provide and example of this simple bug? Because it's >> absolutely not obvious... > Stick a value that happens to match a kernel address in Msg Addr field > in an unmasked MSI-X entry. This patch neither configures MSI-X entries in the user space nor provides additional means to do so therefore this "sticking" would be a matter of some extra code that is absolutely unrelated to this patch. So, this example seems absolutely irrelevant to this particular discussion. thanks, vlad >
To sum it up, We want to remove the need of the out-of-tree module igb_uio. 3 possible implementations were discussed so far: - new UIO driver - extend uio_pci_generic - VFIO without IOMMU It is preferred to avoid creating yet another module to support. That's why the uio_pci_generic extension would be nice. In my understanding, there are currently 2 issues with the patches from Vlad and Stephen: - IRQ must be mapped to a fd without using a new ioctl - MSI-X handling in userspace breaks the memory protection I'm confident the first issue can be fixed with something like sysfs. About the "security" concern, mainly expressed by MST, I think the idea of Avi (below) deserves to be discussed. 2015-10-06 15:15, Avi Kivity: > On 10/06/2015 10:33 AM, Stephen Hemminger wrote: > > Other than implementation objections, so far the two main arguments > > against this reduce to: > > 1. If you allow UIO ioctl then it opens an API hook for all the crap out > > of tree UIO drivers to do what they want. > > 2. If you allow UIO MSI-X then you are expanding the usage of userspace > > device access in an insecure manner. [...] > btw, (2) doesn't really add any insecurity. The user could already poke > at the msix tables (as well as perform DMA); they just couldn't get a > useful interrupt out of them. > > Maybe a module parameter "allow_insecure_dma" can be added to > uio_pci_generic. Without the parameter, bus mastering and msix is > disabled, with the parameter it is allowed. This requires the sysadmin > to take a positive step in order to make use of their hardware. Giving the control of the memory protection level to the distribution or the administrator looks a good idea. When allowing insecure DMA, a log will make clear how it is supported -or not- by the system provider. From another thread: 2015-10-01 14:09, Michael S. Tsirkin: > If Linux keeps enabling hacks, no one will bother doing the right thing. > Upstream inclusion is the only carrot Linux has to make people do the > right thing. The "right thing" should be guided by the users needs at a given time. The "carrot" for a better solution will be to have a well protected system.
On Fri, 16 Oct 2015 19:11:35 +0200 Thomas Monjalon <thomas.monjalon@6wind.com> wrote: > To sum it up, > We want to remove the need of the out-of-tree module igb_uio. > 3 possible implementations were discussed so far: > - new UIO driver > - extend uio_pci_generic > - VFIO without IOMMU There is recent progress on VFIO without IOMMU. This looks the most promising long term solution.
diff --git a/drivers/uio/Kconfig b/drivers/uio/Kconfig index 52c98ce..04adfa0 100644 --- a/drivers/uio/Kconfig +++ b/drivers/uio/Kconfig @@ -93,6 +93,15 @@ config UIO_PCI_GENERIC primarily, for virtualization scenarios. If you compile this as a module, it will be called uio_pci_generic. +config UIO_PCI_MSI + tristate "Generic driver supporting MSI-x on PCI Express cards" + depends on PCI + help + Generic driver that provides Message Signalled IRQ events + similar to VFIO. If IOMMMU is available please use VFIO + instead since it provides more security. + If you compile this as a module, it will be called uio_msi. + config UIO_NETX tristate "Hilscher NetX Card driver" depends on PCI diff --git a/drivers/uio/Makefile b/drivers/uio/Makefile index 8560dad..62fc44b 100644 --- a/drivers/uio/Makefile +++ b/drivers/uio/Makefile @@ -9,3 +9,4 @@ obj-$(CONFIG_UIO_NETX) += uio_netx.o obj-$(CONFIG_UIO_PRUSS) += uio_pruss.o obj-$(CONFIG_UIO_MF624) += uio_mf624.o obj-$(CONFIG_UIO_FSL_ELBC_GPCM) += uio_fsl_elbc_gpcm.o +obj-$(CONFIG_UIO_PCI_MSI) += uio_msi.o diff --git a/drivers/uio/uio_msi.c b/drivers/uio/uio_msi.c new file mode 100644 index 0000000..802b5c4 --- /dev/null +++ b/drivers/uio/uio_msi.c @@ -0,0 +1,378 @@ +/*- + * + * Copyright (c) 2015 by Brocade Communications Systems, Inc. + * Author: Stephen Hemminger <stephen@networkplumber.org> + * + * This work is licensed under the terms of the GNU GPL, version 2 only. + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include <linux/device.h> +#include <linux/interrupt.h> +#include <linux/eventfd.h> +#include <linux/module.h> +#include <linux/pci.h> +#include <linux/uio_driver.h> +#include <linux/msi.h> +#include <linux/uio_msi.h> + +#define DRIVER_VERSION "0.1.1" +#define MAX_MSIX_VECTORS 64 + +/* MSI-X vector information */ +struct uio_msi_pci_dev { + struct uio_info info; /* UIO driver info */ + struct pci_dev *pdev; /* PCI device */ + struct mutex mutex; /* open/release/ioctl mutex */ + int ref_cnt; /* references to device */ + unsigned int max_vectors; /* MSI-X slots available */ + struct msix_entry *msix; /* MSI-X vector table */ + struct uio_msi_irq_ctx { + struct eventfd_ctx *trigger; /* vector to eventfd */ + char *name; /* name in /proc/interrupts */ + } *ctx; +}; + +static irqreturn_t uio_intx_irqhandler(int irq, void *arg) +{ + struct uio_msi_pci_dev *udev = arg; + + if (pci_check_and_mask_intx(udev->pdev)) { + eventfd_signal(udev->ctx->trigger, 1); + return IRQ_HANDLED; + } + + return IRQ_NONE; +} + +static irqreturn_t uio_msi_irqhandler(int irq, void *arg) +{ + struct eventfd_ctx *trigger = arg; + + eventfd_signal(trigger, 1); + return IRQ_HANDLED; +} + +/* set the mapping between vector # and existing eventfd. */ +static int set_irq_eventfd(struct uio_msi_pci_dev *udev, u32 vec, int fd) +{ + struct eventfd_ctx *trigger; + int irq, err; + + if (vec >= udev->max_vectors) { + dev_notice(&udev->pdev->dev, "vec %u >= num_vec %u\n", + vec, udev->max_vectors); + return -ERANGE; + } + + irq = udev->msix[vec].vector; + trigger = udev->ctx[vec].trigger; + if (trigger) { + /* Clearup existing irq mapping */ + free_irq(irq, trigger); + eventfd_ctx_put(trigger); + udev->ctx[vec].trigger = NULL; + } + + /* Passing -1 is used to disable interrupt */ + if (fd < 0) + return 0; + + trigger = eventfd_ctx_fdget(fd); + if (IS_ERR(trigger)) { + err = PTR_ERR(trigger); + dev_notice(&udev->pdev->dev, + "eventfd ctx get failed: %d\n", err); + return err; + } + + if (udev->msix) + err = request_irq(irq, uio_msi_irqhandler, 0, + udev->ctx[vec].name, trigger); + else + err = request_irq(irq, uio_intx_irqhandler, IRQF_SHARED, + udev->ctx[vec].name, udev); + + if (err) { + dev_notice(&udev->pdev->dev, + "request irq failed: %d\n", err); + eventfd_ctx_put(trigger); + return err; + } + + udev->ctx[vec].trigger = trigger; + return 0; +} + +static int +uio_msi_ioctl(struct uio_info *info, unsigned int cmd, unsigned long arg) +{ + struct uio_msi_pci_dev *udev + = container_of(info, struct uio_msi_pci_dev, info); + struct uio_msi_irq_set hdr; + int err; + + switch (cmd) { + case UIO_MSI_IRQ_SET: + if (copy_from_user(&hdr, (void __user *)arg, sizeof(hdr))) + return -EFAULT; + + mutex_lock(&udev->mutex); + err = set_irq_eventfd(udev, hdr.vec, hdr.fd); + mutex_unlock(&udev->mutex); + break; + default: + err = -EOPNOTSUPP; + } + return err; +} + +/* Opening the UIO device for first time enables MSI-X */ +static int +uio_msi_open(struct uio_info *info, struct inode *inode) +{ + struct uio_msi_pci_dev *udev + = container_of(info, struct uio_msi_pci_dev, info); + int err = 0; + + mutex_lock(&udev->mutex); + if (udev->ref_cnt++ == 0) { + if (udev->msix) + err = pci_enable_msix(udev->pdev, udev->msix, + udev->max_vectors); + } + mutex_unlock(&udev->mutex); + + return err; +} + +/* Last close of the UIO device releases/disables all IRQ's */ +static int +uio_msi_release(struct uio_info *info, struct inode *inode) +{ + struct uio_msi_pci_dev *udev + = container_of(info, struct uio_msi_pci_dev, info); + int i; + + mutex_lock(&udev->mutex); + if (--udev->ref_cnt == 0) { + for (i = 0; i < udev->max_vectors; i++) { + int irq = udev->msix[i].vector; + struct eventfd_ctx *trigger = udev->ctx[i].trigger; + + if (!trigger) + continue; + + free_irq(irq, trigger); + eventfd_ctx_put(trigger); + udev->ctx[i].trigger = NULL; + } + + if (udev->msix) + pci_disable_msix(udev->pdev); + } + mutex_unlock(&udev->mutex); + + return 0; +} + +/* Unmap previously ioremap'd resources */ +static void +release_iomaps(struct uio_mem *mem) +{ + int i; + + for (i = 0; i < MAX_UIO_MAPS; i++, mem++) { + if (mem->internal_addr) + iounmap(mem->internal_addr); + } +} + +static int +setup_maps(struct pci_dev *pdev, struct uio_info *info) +{ + int i, m = 0, p = 0, err; + static const char * const bar_names[] = { + "BAR0", "BAR1", "BAR2", "BAR3", "BAR4", "BAR5", + }; + + for (i = 0; i < ARRAY_SIZE(bar_names); i++) { + unsigned long start = pci_resource_start(pdev, i); + unsigned long flags = pci_resource_flags(pdev, i); + unsigned long len = pci_resource_len(pdev, i); + + if (start == 0 || len == 0) + continue; + + if (flags & IORESOURCE_MEM) { + void __iomem *addr; + + if (m >= MAX_UIO_MAPS) + continue; + + addr = ioremap(start, len); + if (addr == NULL) { + err = -EINVAL; + goto fail; + } + + info->mem[m].name = bar_names[i]; + info->mem[m].addr = start; + info->mem[m].internal_addr = addr; + info->mem[m].size = len; + info->mem[m].memtype = UIO_MEM_PHYS; + ++m; + } else if (flags & IORESOURCE_IO) { + if (p >= MAX_UIO_PORT_REGIONS) + continue; + + info->port[p].name = bar_names[i]; + info->port[p].start = start; + info->port[p].size = len; + info->port[p].porttype = UIO_PORT_X86; + ++p; + } + } + + return 0; + fail: + for (i = 0; i < m; i++) + iounmap(info->mem[i].internal_addr); + return err; +} + +static int uio_msi_probe(struct pci_dev *pdev, const struct pci_device_id *id) +{ + struct uio_msi_pci_dev *udev; + int i, err, vectors; + + udev = kzalloc(sizeof(struct uio_msi_pci_dev), GFP_KERNEL); + if (!udev) + return -ENOMEM; + + err = pci_enable_device(pdev); + if (err != 0) { + dev_err(&pdev->dev, "cannot enable PCI device\n"); + goto fail_free; + } + + err = pci_request_regions(pdev, "uio_msi"); + if (err != 0) { + dev_err(&pdev->dev, "Cannot request regions\n"); + goto fail_disable; + } + + pci_set_master(pdev); + + /* remap resources */ + err = setup_maps(pdev, &udev->info); + if (err) + goto fail_release_iomem; + + /* fill uio infos */ + udev->info.name = "uio_msi"; + udev->info.version = DRIVER_VERSION; + udev->info.priv = udev; + udev->pdev = pdev; + udev->info.ioctl = uio_msi_ioctl; + udev->info.open = uio_msi_open; + udev->info.release = uio_msi_release; + udev->info.irq = UIO_IRQ_CUSTOM; + mutex_init(&udev->mutex); + + vectors = pci_msix_vec_count(pdev); + if (vectors > 0) { + udev->max_vectors = min_t(u16, vectors, MAX_MSIX_VECTORS); + dev_info(&pdev->dev, "using up to %u MSI-X vectors\n", + udev->max_vectors); + + err = -ENOMEM; + udev->msix = kcalloc(udev->max_vectors, + sizeof(struct msix_entry), GFP_KERNEL); + if (!udev->msix) + goto fail_release_iomem; + } else if (!pci_intx_mask_supported(pdev)) { + dev_err(&pdev->dev, + "device does not support MSI-X or INTX\n"); + err = -EINVAL; + goto fail_release_iomem; + } else { + dev_notice(&pdev->dev, "using INTX\n"); + udev->info.irq_flags = IRQF_SHARED; + udev->max_vectors = 1; + } + + udev->ctx = kcalloc(udev->max_vectors, + sizeof(struct uio_msi_irq_ctx), GFP_KERNEL); + if (!udev->ctx) + goto fail_free_msix; + + for (i = 0; i < udev->max_vectors; i++) { + udev->msix[i].entry = i; + + udev->ctx[i].name = kasprintf(GFP_KERNEL, + KBUILD_MODNAME "[%d](%s)", + i, pci_name(pdev)); + if (!udev->ctx[i].name) + goto fail_free_ctx; + } + + /* register uio driver */ + err = uio_register_device(&pdev->dev, &udev->info); + if (err != 0) + goto fail_free_ctx; + + pci_set_drvdata(pdev, udev); + return 0; + +fail_free_ctx: + for (i = 0; i < udev->max_vectors; i++) + kfree(udev->ctx[i].name); + kfree(udev->ctx); +fail_free_msix: + kfree(udev->msix); +fail_release_iomem: + release_iomaps(udev->info.mem); + pci_release_regions(pdev); +fail_disable: + pci_disable_device(pdev); +fail_free: + kfree(udev); + + pr_notice("%s ret %d\n", __func__, err); + return err; +} + +static void uio_msi_remove(struct pci_dev *pdev) +{ + struct uio_info *info = pci_get_drvdata(pdev); + struct uio_msi_pci_dev *udev + = container_of(info, struct uio_msi_pci_dev, info); + int i; + + uio_unregister_device(info); + release_iomaps(info->mem); + + pci_release_regions(pdev); + for (i = 0; i < udev->max_vectors; i++) + kfree(udev->ctx[i].name); + kfree(udev->ctx); + kfree(udev->msix); + pci_disable_device(pdev); + + pci_set_drvdata(pdev, NULL); + kfree(udev); +} + +static struct pci_driver uio_msi_pci_driver = { + .name = "uio_msi", + .probe = uio_msi_probe, + .remove = uio_msi_remove, +}; + +module_pci_driver(uio_msi_pci_driver); +MODULE_VERSION(DRIVER_VERSION); +MODULE_LICENSE("GPL v2"); +MODULE_AUTHOR("Stephen Hemminger <stephen@networkplumber.org>"); +MODULE_DESCRIPTION("UIO driver for MSI PCI devices"); diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild index f7b2db4..d9497691 100644 --- a/include/uapi/linux/Kbuild +++ b/include/uapi/linux/Kbuild @@ -411,6 +411,7 @@ header-y += udp.h header-y += uhid.h header-y += uinput.h header-y += uio.h +header-y += uio_msi.h header-y += ultrasound.h header-y += un.h header-y += unistd.h diff --git a/include/uapi/linux/uio_msi.h b/include/uapi/linux/uio_msi.h new file mode 100644 index 0000000..297de00 --- /dev/null +++ b/include/uapi/linux/uio_msi.h @@ -0,0 +1,22 @@ +/* + * UIO_MSI API definition + * + * Copyright (c) 2015 by Brocade Communications Systems, Inc. + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + */ +#ifndef _UIO_PCI_MSI_H +#define _UIO_PCI_MSI_H + +struct uio_msi_irq_set { + u32 vec; + int fd; +}; + +#define UIO_MSI_BASE 0x86 +#define UIO_MSI_IRQ_SET _IOW('I', UIO_MSI_BASE+1, struct uio_msi_irq_set) + +#endif