summaryrefslogtreecommitdiff
path: root/drivers/vfio/pci
AgeCommit message (Collapse)Author
2018-08-16Merge tag 'vfio-v4.19-rc1' of git://github.com/awilliam/linux-vfioLinus Torvalds
Pull VFIO updates from Alex Williamson: - mark switch fall-through cases (Gustavo A. R. Silva) - disable binding SR-IOV enabled PFs (Alex Williamson) * tag 'vfio-v4.19-rc1' of git://github.com/awilliam/linux-vfio: vfio-pci: Disable binding to PFs with SR-IOV enabled vfio: Mark expected switch fall-throughs
2018-08-16Merge tag 'pci-v4.19-changes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci Pull pci updates from Bjorn Helgaas: - Decode AER errors with names similar to "lspci" (Tyler Baicar) - Expose AER statistics in sysfs (Rajat Jain) - Clear AER status bits selectively based on the type of recovery (Oza Pawandeep) - Honor "pcie_ports=native" even if HEST sets FIRMWARE_FIRST (Alexandru Gagniuc) - Don't clear AER status bits if we're using the "Firmware-First" strategy where firmware owns the registers (Alexandru Gagniuc) - Use sysfs_match_string() to simplify ASPM sysfs parsing (Andy Shevchenko) - Remove unnecessary includes of <linux/pci-aspm.h> (Bjorn Helgaas) - Defer DPC event handling to work queue (Keith Busch) - Use threaded IRQ for DPC bottom half (Keith Busch) - Print AER status while handling DPC events (Keith Busch) - Work around IDT switch ACS Source Validation erratum (James Puthukattukaran) - Emit diagnostics for all cases of PCIe Link downtraining (Links operating slower than they're capable of) (Alexandru Gagniuc) - Skip VFs when configuring Max Payload Size (Myron Stowe) - Reduce Root Port Max Payload Size if necessary when hot-adding a device below it (Myron Stowe) - Simplify SHPC existence/permission checks (Bjorn Helgaas) - Remove hotplug sample skeleton driver (Lukas Wunner) - Convert pciehp to threaded IRQ handling (Lukas Wunner) - Improve pciehp tolerance of missed events and initially unstable links (Lukas Wunner) - Clear spurious pciehp events on resume (Lukas Wunner) - Add pciehp runtime PM support, including for Thunderbolt controllers (Lukas Wunner) - Support interrupts from pciehp bridges in D3hot (Lukas Wunner) - Mark fall-through switch cases before enabling -Wimplicit-fallthrough (Gustavo A. R. Silva) - Move DMA-debug PCI init from arch code to PCI core (Christoph Hellwig) - Fix pci_request_irq() usage of IRQF_ONESHOT when no handler is supplied (Heiner Kallweit) - Unify PCI and DMA direction #defines (Shunyong Yang) - Add PCI_DEVICE_DATA() macro (Andy Shevchenko) - Check for VPD completion before checking for timeout (Bert Kenward) - Limit Netronome NFP5000 config space size to work around erratum (Jakub Kicinski) - Set IRQCHIP_ONESHOT_SAFE for PCI MSI irqchips (Heiner Kallweit) - Document ACPI description of PCI host bridges (Bjorn Helgaas) - Add "pci=disable_acs_redir=" parameter to disable ACS redirection for peer-to-peer DMA support (we don't have the peer-to-peer support yet; this is just one piece) (Logan Gunthorpe) - Clean up devm_of_pci_get_host_bridge_resources() resource allocation (Jan Kiszka) - Fixup resizable BARs after suspend/resume (Christian König) - Make "pci=earlydump" generic (Sinan Kaya) - Fix ROM BAR access routines to stay in bounds and check for signature correctly (Rex Zhu) - Add DMA alias quirk for Microsemi Switchtec NTB (Doug Meyer) - Expand documentation for pci_add_dma_alias() (Logan Gunthorpe) - To avoid bus errors, enable PASID only if entire path supports End-End TLP prefixes (Sinan Kaya) - Unify slot and bus reset functions and remove hotplug knowledge from callers (Sinan Kaya) - Add Function-Level Reset quirks for Intel and Samsung NVMe devices to fix guest reboot issues (Alex Williamson) - Add function 1 DMA alias quirk for Marvell 88SS9183 PCIe SSD Controller (Bjorn Helgaas) - Remove Xilinx AXI-PCIe host bridge arch dependency (Palmer Dabbelt) - Remove Aardvark outbound window configuration (Evan Wang) - Fix Aardvark bridge window sizing issue (Zachary Zhang) - Convert Aardvark to use pci_host_probe() to reduce code duplication (Thomas Petazzoni) - Correct the Cadence cdns_pcie_writel() signature (Alan Douglas) - Add Cadence support for optional generic PHYs (Alan Douglas) - Add Cadence power management ops (Alan Douglas) - Remove redundant variable from Cadence driver (Colin Ian King) - Add Kirin MSI support (Xiaowei Song) - Drop unnecessary root_bus_nr setting from exynos, imx6, keystone, armada8k, artpec6, designware-plat, histb, qcom, spear13xx (Shawn Guo) - Move link notification settings from DesignWare core to individual drivers (Gustavo Pimentel) - Add endpoint library MSI-X interfaces (Gustavo Pimentel) - Correct signature of endpoint library IRQ interfaces (Gustavo Pimentel) - Add DesignWare endpoint library MSI-X callbacks (Gustavo Pimentel) - Add endpoint library MSI-X test support (Gustavo Pimentel) - Remove unnecessary GFP_ATOMIC from Hyper-V "new child" allocation (Jia-Ju Bai) - Add more devices to Broadcom PAXC quirk (Ray Jui) - Work around corrupted Broadcom PAXC config space to enable SMMU and GICv3 ITS (Ray Jui) - Disable MSI parsing to work around broken Broadcom PAXC logic in some devices (Ray Jui) - Hide unconfigured functions to work around a Broadcom PAXC defect (Ray Jui) - Lower iproc log level to reduce console output during boot (Ray Jui) - Fix mobiveil iomem/phys_addr_t type usage (Lorenzo Pieralisi) - Fix mobiveil missing include file (Lorenzo Pieralisi) - Add mobiveil Kconfig/Makefile support (Lorenzo Pieralisi) - Fix mvebu I/O space remapping issues (Thomas Petazzoni) - Use generic pci_host_bridge in mvebu instead of ARM-specific API (Thomas Petazzoni) - Whitelist VMD devices with fast interrupt handlers to avoid sharing vectors with slow handlers (Keith Busch) * tag 'pci-v4.19-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (153 commits) PCI/AER: Don't clear AER bits if error handling is Firmware-First PCI: Limit config space size for Netronome NFP5000 PCI/MSI: Set IRQCHIP_ONESHOT_SAFE for PCI-MSI irqchips PCI/VPD: Check for VPD access completion before checking for timeout PCI: Add PCI_DEVICE_DATA() macro to fully describe device ID entry PCI: Match Root Port's MPS to endpoint's MPSS as necessary PCI: Skip MPS logic for Virtual Functions (VFs) PCI: Add function 1 DMA alias quirk for Marvell 88SS9183 PCI: Check for PCIe Link downtraining PCI: Add ACS Redirect disable quirk for Intel Sunrise Point PCI: Add device-specific ACS Redirect disable infrastructure PCI: Convert device-specific ACS quirks from NULL termination to ARRAY_SIZE PCI: Add "pci=disable_acs_redir=" parameter for peer-to-peer support PCI: Allow specifying devices using a base bus and path of devfns PCI: Make specifying PCI devices in kernel parameters reusable PCI: Hide ACS quirk declarations inside PCI core PCI: Delay after FLR of Intel DC P3700 NVMe PCI: Disable Samsung SM961/PM961 NVMe before FLR PCI: Export pcie_has_flr() PCI: mvebu: Drop bogus comment above mvebu_pcie_map_registers() ...
2018-08-06vfio-pci: Disable binding to PFs with SR-IOV enabledAlex Williamson
We expect to receive PFs with SR-IOV disabled, however some host drivers leave SR-IOV enabled at unbind. This puts us in a state where we can potentially assign both the PF and the VF, leading to both functionality as well as security concerns due to lack of managing the SR-IOV state as well as vendor dependent isolation from the PF to VF. If we were to attempt to actively disable SR-IOV on driver probe, we risk VF bound drivers blocking, potentially risking live lock scenarios. Therefore simply refuse to bind to PFs with SR-IOV enabled with a warning message indicating the issue. Users can resolve this by re-binding to the host driver and disabling SR-IOV before attempting to use the device with vfio-pci. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2018-08-06vfio: Mark expected switch fall-throughsGustavo A. R. Silva
In preparation to enabling -Wimplicit-fallthrough, mark switch cases where we are expecting to fall through. Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2018-07-19PCI: Rename pci_try_reset_bus() to pci_reset_bus()Sinan Kaya
Now that the old implementation of pci_reset_bus() is gone, replace pci_try_reset_bus() with pci_reset_bus(). Compared to the old implementation, new code will fail immmediately with -EAGAIN if object lock cannot be obtained. Signed-off-by: Sinan Kaya <okaya@codeaurora.org> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2018-07-19PCI: Unify try slot and bus reset APISinan Kaya
Drivers are expected to call pci_try_reset_slot() or pci_try_reset_bus() by querying if a system supports hotplug or not. A survey showed that most drivers don't do this and we are leaking hotplug capability to the user. Hide pci_try_slot_reset() from drivers and embed into pci_try_bus_reset(). Change pci_try_reset_bus() parameter from struct pci_bus to struct pci_dev. Signed-off-by: Sinan Kaya <okaya@codeaurora.org> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2018-07-18vfio/pci: Fix potential Spectre v1Gustavo A. R. Silva
info.index can be indirectly controlled by user-space, hence leading to a potential exploitation of the Spectre variant 1 vulnerability. This issue was detected with the help of Smatch: drivers/vfio/pci/vfio_pci.c:734 vfio_pci_ioctl() warn: potential spectre issue 'vdev->region' Fix this by sanitizing info.index before indirectly using it to index vdev->region Notice that given that speculation windows are large, the policy is to kill the speculation on the first load and not worry if it can be completed with a dependent load/store [1]. [1] https://marc.info/?l=linux-kernel&m=152449131114778&w=2 Cc: stable@vger.kernel.org Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2018-06-18vfio/pci: Make IGD support a configurable optionAlex Williamson
Allow the code which provides extensions to support direct assignment of Intel IGD (GVT-d) to be compiled out of the kernel if desired. The config option for this was previously automatically enabled on X86, therefore the default remains Y. This simply provides the option to disable it even for X86. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2018-04-06Merge tag 'vfio-v4.17-rc1' of git://github.com/awilliam/linux-vfioLinus Torvalds
Pull VFIO updates from Alex Williamson: - Adopt iommu_unmap_fast() interface to type1 backend (Suravee Suthikulpanit) - mdev sample driver fixup (Shunyong Yang) - More efficient PFN mapping handling in type1 backend (Jason Cai) - VFIO device ioeventfd interface (Alex Williamson) - Tag new vfio-platform sub-maintainer (Alex Williamson) * tag 'vfio-v4.17-rc1' of git://github.com/awilliam/linux-vfio: MAINTAINERS: vfio/platform: Update sub-maintainer vfio/pci: Add ioeventfd support vfio/pci: Use endian neutral helpers vfio/pci: Pull BAR mapping setup from read-write path vfio/type1: Improve memory pinning process for raw PFN mapping vfio-mdev/samples: change RDI interrupt condition vfio/type1: Adopt fast IOTLB flush interface when unmap IOVAs
2018-03-26vfio/pci: Add ioeventfd supportAlex Williamson
The ioeventfd here is actually irqfd handling of an ioeventfd such as supported in KVM. A user is able to pre-program a device write to occur when the eventfd triggers. This is yet another instance of eventfd-irqfd triggering between KVM and vfio. The impetus for this is high frequency writes to pages which are virtualized in QEMU. Enabling this near-direct write path for selected registers within the virtualized page can improve performance and reduce overhead. Specifically this is initially targeted at NVIDIA graphics cards where the driver issues a write to an MMIO register within a virtualized region in order to allow the MSI interrupt to re-trigger. Reviewed-by: Peter Xu <peterx@redhat.com> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2018-03-26vfio/pci: Use endian neutral helpersAlex Williamson
The iowriteXX/ioreadXX functions assume little endian hardware and convert to little endian on a write and from little endian on a read. We currently do our own explicit conversion to negate this. Instead, add some endian dependent defines to avoid all byte swaps. There should be no functional change other than big endian systems aren't penalized with wasted swaps. Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2018-03-26vfio/pci: Pull BAR mapping setup from read-write pathAlex Williamson
This creates a common helper that we'll use for ioeventfd setup. Reviewed-by: Peter Xu <peterx@redhat.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2018-03-21Revert: "vfio-pci: Mask INTx if a device is not capabable of enabling it"Alex Williamson
This reverts commit 2170dd04316e0754cbbfa4892a25aead39d225f7 The intent of commit 2170dd04316e ("vfio-pci: Mask INTx if a device is not capabable of enabling it") was to disallow the user from seeing that the device supports INTx if the platform is incapable of enabling it. The detection of this case however incorrectly includes devices which natively do not support INTx, such as SR-IOV VFs, and further discussions reveal gaps even for the target use case. Reported-by: Arjun Vynipadath <arjun@chelsio.com> Fixes: 2170dd04316e ("vfio-pci: Mask INTx if a device is not capabable of enabling it") Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-12-20vfio-pci: Allow mapping MSIX BARAlexey Kardashevskiy
By default VFIO disables mapping of MSIX BAR to the userspace as the userspace may program it in a way allowing spurious interrupts; instead the userspace uses the VFIO_DEVICE_SET_IRQS ioctl. In order to eliminate guessing from the userspace about what is mmapable, VFIO also advertises a sparse list of regions allowed to mmap. This works fine as long as the system page size equals to the MSIX alignment requirement which is 4KB. However with a bigger page size the existing code prohibits mapping non-MSIX parts of a page with MSIX structures so these parts have to be emulated via slow reads/writes on a VFIO device fd. If these emulated bits are accessed often, this has serious impact on performance. This allows mmap of the entire BAR containing MSIX vector table. This removes the sparse capability for PCI devices as it becomes useless. As the userspace needs to know for sure whether mmapping of the MSIX vector containing data can succeed, this adds a new capability - VFIO_REGION_INFO_CAP_MSIX_MAPPABLE - which explicitly tells the userspace that the entire BAR can be mmapped. This does not touch the MSIX mangling in the BAR read/write handlers as we are doing this just to enable direct access to non MSIX registers. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> [aw - fixup whitespace, trim function name] Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-12-20vfio: Simplify capability helperAlex Williamson
The vfio_info_add_capability() helper requires the caller to pass a capability ID, which it then uses to fill in header fields, assuming hard coded versions. This makes for an awkward and rigid interface. The only thing we want this helper to do is allocate sufficient space in the caps buffer and chain this capability into the list. Reduce it to that simple task. Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Acked-by: Zhenyu Wang <zhenyuw@linux.intel.com> Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com> Reviewed-by: Peter Xu <peterx@redhat.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-12-20vfio-pci: Mask INTx if a device is not capabable of enabling itAlexey Kardashevskiy
At the moment VFIO rightfully assumes that INTx is supported if the interrupt pin is not set to zero in the device config space. However if that is not the case (the pin is not zero but pdev->irq is), vfio_intx_enable() fails. In order to prevent the userspace from trying to enable INTx when we know that it cannot work, let's mask the PCI_INTERRUPT_PIN register. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-10-02vfio/pci: Virtualize Maximum Read Request SizeAlex Williamson
MRRS defines the maximum read request size a device is allowed to make. Drivers will often increase this to allow more data transfer with a single request. Completions to this request are bound by the MPS setting for the bus. Aside from device quirks (none known), it doesn't seem to make sense to set an MRRS value less than MPS, yet this is a likely scenario given that user drivers do not have a system-wide view of the PCI topology. Virtualize MRRS such that the user can set MRRS >= MPS, but use MPS as the floor value that we'll write to hardware. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-10-02vfio/pci: Virtualize Maximum Payload SizeAlex Williamson
With virtual PCI-Express chipsets, we now see userspace/guest drivers trying to match the physical MPS setting to a virtual downstream port. Of course a lone physical device surrounded by virtual interconnects cannot make a correct decision for a proper MPS setting. Instead, let's virtualize the MPS control register so that writes through to hardware are disallowed. Userspace drivers like QEMU assume they can write anything to the device and we'll filter out anything dangerous. Since mismatched MPS can lead to AER and other faults, let's add it to the kernel side rather than relying on userspace virtualization to handle it. Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: Eric Auger <eric.auger@redhat.com>
2017-07-27vfio/pci: Fix handling of RC integrated endpoint PCIe capability sizeAlex Williamson
Root complex integrated endpoints do not have a link and therefore may use a smaller PCIe capability in config space than we expect when building our config map. Add a case for these to avoid reporting an erroneous overlap. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-07-26vfio/pci: Use pci_try_reset_function() on initial openAlex Williamson
Device lock bites again; if a device .remove() callback races a user calling ioctl(VFIO_GROUP_GET_DEVICE_FD), the unbind request will hold the device lock, but the user ioctl may have already taken a vfio_device reference. In the case of a PCI device, the initial open will attempt to reset the device, which again attempts to get the device lock, resulting in deadlock. Use the trylock PCI reset interface and return error on the open path if reset fails due to lock contention. Link: https://lkml.org/lkml/2017/7/25/381 Reported-by: Wen Congyang <wencongyang2@huawei.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-06-13vfio/pci: Add Intel XXV710 to hidden INTx devicesAlex Williamson
XXV710 has the same broken INTx behavior as the rest of the X/XL710 series, the interrupt status register is not wired to report pending INTx interrupts, thus we never associate the interrupt to the device. Extend the device IDs to include these so that we hide that the device supports INTx at all to the user. Reported-by: Stefan Assmann <sassmann@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Acked-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
2017-01-04vfio-pci: Handle error from pci_iomapArvind Yadav
Here, pci_iomap can fail, handle this case release selected pci regions and return -ENOMEM. Signed-off-by: Arvind Yadav <arvind.yadav.cs@gmail.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-12-30vfio-pci: use 32-bit comparisons for register address for gcc-4.5Arnd Bergmann
Using ancient compilers (gcc-4.5 or older) on ARM, we get a link failure with the vfio-pci driver: ERROR: "__aeabi_lcmp" [drivers/vfio/pci/vfio-pci.ko] undefined! The reason is that the compiler tries to do a comparison of a 64-bit range. This changes it to convert to a 32-bit number explicitly first, as newer compilers do for themselves. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-12-15Merge tag 'pci-v4.10-changes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci Pull PCI updates from Bjorn Helgaas: "PCI changes: - add support for PCI on ARM64 boxes with ACPI. We already had this for theoretical spec-compliant hardware; now we're adding quirks for the actual hardware (Cavium, HiSilicon, Qualcomm, X-Gene) - add runtime PM support for hotplug ports - enable runtime suspend for Intel UHCI that uses platform-specific wakeup signaling - add yet another host bridge registration interface. We hope this is extensible enough to subsume the others - expose device revision in sysfs for DRM - to avoid device conflicts, make sure any VF BAR updates are done before enabling the VF - avoid unnecessary link retrains for ASPM - allow INTx masking on Mellanox devices that support it - allow access to non-standard VPD for Chelsio devices - update Broadcom iProc support for PAXB v2, PAXC v2, inbound DMA, etc - update Rockchip support for max-link-speed - add NVIDIA Tegra210 support - add Layerscape LS1046a support - update R-Car compatibility strings - add Qualcomm MSM8996 support - remove some uninformative bootup messages" * tag 'pci-v4.10-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (115 commits) PCI: Enable access to non-standard VPD for Chelsio devices (cxgb3) PCI: Expand "VPD access disabled" quirk message PCI: pciehp: Remove loading message PCI: hotplug: Remove hotplug core message PCI: Remove service driver load/unload messages PCI/AER: Log AER IRQ when claiming Root Port PCI/AER: Log errors with PCI device, not PCIe service device PCI/AER: Remove unused version macros PCI/PME: Log PME IRQ when claiming Root Port PCI/PME: Drop unused support for PMEs from Root Complex Event Collectors PCI: Move config space size macros to pci_regs.h x86/platform/intel-mid: Constify mid_pci_platform_pm PCI/ASPM: Don't retrain link if ASPM not possible PCI: iproc: Skip check for legacy IRQ on PAXC buses PCI: pciehp: Leave power indicator on when enabling already-enabled slot PCI: pciehp: Prioritize data-link event over presence detect PCI: rcar: Add gen3 fallback compatibility string for pcie-rcar PCI: rcar: Use gen2 fallback compatibility last PCI: rcar-gen2: Use gen2 fallback compatibility last PCI: rockchip: Move the deassert of pm/aclk/pclk after phy_init() ..
2016-12-12PCI: Move config space size macros to pci_regs.hWang Sheng-Hui
Move PCI configuration space size macros (PCI_CFG_SPACE_SIZE and PCI_CFG_SPACE_EXP_SIZE) from drivers/pci/pci.h to include/uapi/linux/pci_regs.h so they can be used by more drivers and eliminate duplicate definitions. [bhelgaas: Expand comment to include PCI-X details] Signed-off-by: Wang Sheng-Hui <shhuiw@foxmail.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2016-11-18vfio/pci: Drop unnecessary pcibios_err_to_errno()Cao jin
As of commit d97ffe236894 ("PCI: Fix return value from pci_user_{read,write}_config_*()") it's unnecessary to call pcibios_err_to_errno() to fixup the return value from these functions. pcibios_err_to_errno() already does simple passthrough of -errno values, therefore no functional change is expected. [aw: changelog] Signed-off-by: Cao jin <caoj.fnst@cn.fujitsu.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-11-17vfio_pci: Updated to use vfio_set_irqs_validate_and_prepare()Kirti Wankhede
Updated vfio_pci.c file to use vfio_set_irqs_validate_and_prepare() Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com> Signed-off-by: Neo Jia <cjia@nvidia.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-11-17vfio_pci: Update vfio_pci to use vfio_info_add_capability()Kirti Wankhede
Update msix_sparse_mmap_cap() to use vfio_info_add_capability() Update region type capability to use vfio_info_add_capability() Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com> Signed-off-by: Neo Jia <cjia@nvidia.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-10-26vfio/pci: Fix integer overflows, bitmask checkVlad Tsyrklevich
The VFIO_DEVICE_SET_IRQS ioctl did not sufficiently sanitize user-supplied integers, potentially allowing memory corruption. This patch adds appropriate integer overflow checks, checks the range bounds for VFIO_IRQ_SET_DATA_NONE, and also verifies that only single element in the VFIO_IRQ_SET_DATA_TYPE_MASK bitmask is set. VFIO_IRQ_SET_ACTION_TYPE_MASK is already correctly checked later in vfio_pci_set_irqs_ioctl(). Furthermore, a kzalloc is changed to a kcalloc because the use of a kzalloc with an integer multiplication allowed an integer overflow condition to be reached without this patch. kcalloc checks for overflow and should prevent a similar occurrence. Signed-off-by: Vlad Tsyrklevich <vlad@tsyrklevich.net> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-09-29vfio_pci: use pci_alloc_irq_vectorsChristoph Hellwig
Simplify the interrupt setup by using the new PCI layer helpers. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-09-26vfio-pci: Disable INTx after MSI/X teardownAlex Williamson
The MSI/X shutdown path can gratuitously enable INTx, which is not something we want to happen if we're dealing with broken INTx device. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-09-26vfio-pci: Virtualize PCIe & AF FLRAlex Williamson
We use a BAR restore trick to try to detect when a user has performed a device reset, possibly through FLR or other backdoors, to put things back into a working state. This is important for backdoor resets, but we can actually just virtualize the "front door" resets provided via PCIe and AF FLR. Set these bits as virtualized + writable, allowing the default write to set them in vconfig, then we can simply check the bit, perform an FLR of our own, and clear the bit. We don't actually have the granularity in PCI to specify the type of reset we want to do, but generally devices don't implement both PCIe and AF FLR and we'll favor these over other types of reset, so we should generally lineup. We do test whether the device provides the requested FLR type to stay consistent with hardware capabilities though. This seems to fix several instance of devices getting into bad states with userspace drivers, like dpdk, running inside a VM. Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: Greg Rose <grose@lightfleet.com>
2016-08-29vfio/pci: Fix typos in commentsWei Jiangang
Signed-off-by: Wei Jiangang <weijg.fnst@cn.fujitsu.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-08-08vfio/pci: Fix NULL pointer oops in error interrupt setup handlingAlex Williamson
There are multiple cases in vfio_pci_set_ctx_trigger_single() where we assume we can safely read from our data pointer without actually checking whether the user has passed any data via the count field. VFIO_IRQ_SET_DATA_NONE in particular is entirely broken since we attempt to pull an int32_t file descriptor out before even checking the data type. The other data types assume the data pointer contains one element of their type as well. In part this is good news because we were previously restricted from doing much sanitization of parameters because it was missed in the past and we didn't want to break existing users. Clearly DATA_NONE is completely broken, so it must not have any users and we can fix it up completely. For DATA_BOOL and DATA_EVENTFD, we'll just protect ourselves, returning error when count is zero since we previously would have oopsed. Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Reported-by: Chris Thompson <the_cartographer@hotmail.com> Cc: stable@vger.kernel.org Reviewed-by: Eric Auger <eric.auger@redhat.com>
2016-07-08vfio-pci: Allow to mmap sub-page MMIO BARs if the mmio page is exclusiveYongji Xie
Current vfio-pci implementation disallows to mmap sub-page(size < PAGE_SIZE) MMIO BARs because these BARs' mmio page may be shared with other BARs. This will cause some performance issues when we passthrough a PCI device with this kind of BARs. Guest will be not able to handle the mmio accesses to the BARs which leads to mmio emulations in host. However, not all sub-page BARs will share page with other BARs. We should allow to mmap the sub-page MMIO BARs which we can make sure will not share page with other BARs. This patch adds support for this case. And we try to add a dummy resource to reserve the remainder of the page which hot-add device's BAR might be assigned into. But it's not necessary to handle the case when the BAR is not page aligned. Because we can't expect the BAR will be assigned into the same location in a page in guest when we passthrough the BAR. And it's hard to access this BAR in userspace because we have no way to get the BAR's location in a page. Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-05-31vfio/pci: Allow VPD short readAlex Williamson
The size of the VPD area is not necessarily 4-byte aligned, so a pci_vpd_read() might return less than 4 bytes. Zero our buffer and accept anything other than an error. Intel X710 NICs exercise this. Fixes: 4e1a635552d3 ("vfio/pci: Use kernel VPD access functions") Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-05-30vfio/pci: Fix ordering of eventfd vs virqfd shutdownAlex Williamson
Both the INTx and MSI/X disable paths do an eventfd_ctx_put() for the trigger eventfd before calling vfio_virqfd_disable() any potential mask and unmask eventfds. This opens a use-after-free race where an inopportune irqfd can reference the freed signalling eventfd. Reorder to avoid this possibility. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-05-19vfio_pci: Test for extended capabilities if config space > 256 bytesAlexey Kardashevskiy
PCI-Express spec says that reading 4 bytes at offset 100h should return zero if there is no extended capability so VFIO reads this dword to know if there are extended capabilities. However it is not always possible to access the extended space so generic PCI code in pci_cfg_space_size_ext() checks if pci_read_config_dword() can read beyond 100h and if the check fails, it sets the config space size to 100h. VFIO does its own extended capabilities check by reading at offset 100h which may produce 0xffffffff which VFIO treats as the extended config space presense and calls vfio_ecap_init() which fails to parse capabilities (which is expected) but right before the exit, it writes zero at offset 100h which is beyond the buffer allocated for vdev->vconfig (which is 256 bytes) which leads to random memory corruption. This makes VFIO only check for the extended capabilities if the discovered config size is more than 256 bytes. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-04-28vfio/pci: Add test for BAR restoreAlex Williamson
If a device is reset without the memory or i/o bits enabled in the command register we may not detect it, potentially leaving the device without valid BAR programming. Add an additional test to check the BARs on each write to the command register. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-04-28vfio/pci: Hide broken INTx support from userAlex Williamson
INTx masking has two components, the first is that we need the ability to prevent the device from continuing to assert INTx. This is provided via the DisINTx bit in the command register and is the only thing we can really probe for when testing if INTx masking is supported. The second component is that the device needs to indicate if INTx is asserted via the interrupt status bit in the device status register. With these two features we can generically determine if one of the devices we own is asserting INTx, signal the user, and mask the interrupt while the user services the device. Generally if one or both of these components is broken we resort to APIC level interrupt masking, which requires an exclusive interrupt since we have no way to determine the source of the interrupt in a shared configuration. This often makes it difficult or impossible to configure the system for userspace use of the device, for an interrupt mode that the user may not need. One possible configuration of broken INTx masking is that the DisINTx support is fully functional, but the interrupt status bit never signals interrupt assertion. In this case we do have the ability to prevent the device from asserting INTx, but lack the ability to identify the interrupt source. For this case we can simply pretend that the device lacks INTx support entirely, keeping DisINTx set on the physical device, virtualizing this bit for the user, and virtualizing the interrupt pin register to indicate no INTx support. We already support virtualization of the DisINTx bit and already virtualize the interrupt pin for platforms without INTx support. By tying these components together, setting DisINTx on open and reset, and identifying devices broken in this particular way, we can provide support for them w/o the handicap of APIC level INTx masking. Intel i40e (XL710/X710) 10/20/40GbE NICs have been identified as being broken in this specific way. We leave the vfio-pci.nointxmask option as a mechanism to bypass this support, enabling INTx on the device with all the requirements of APIC level masking. Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Cc: John Ronciak <john.ronciak@intel.com> Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
2016-03-17Merge tag 'vfio-v4.6-rc1' of git://github.com/awilliam/linux-vfioLinus Torvalds
Pull VFIO updates from Alex Williamson: "Various enablers for assignment of Intel graphics devices and future support of vGPU devices (Alex Williamson). This includes - Handling the vfio type1 interface as an API rather than a specific implementation, allowing multiple type1 providers. - Capability chains, similar to PCI device capabilities, that allow extending ioctls. Extensions here include device specific regions and sparse mmap descriptions. The former is used to expose non-PCI regions for IGD, including the OpRegion (particularly the Video BIOS Table), and read only PCI config access to the host and LPC bridge as drivers often depend on identifying those devices. Sparse mmaps here are used to describe the MSIx vector table, which vfio has always protected from mmap, but never had an API to explicitly define that protection. In future vGPU support this is expected to allow the description of PCI BARs that may mix direct access and emulated access within a single region. - The ability to expose the shadow ROM as an option ROM as IGD use cases may rely on the ROM even though the physical device does not make use of a PCI option ROM BAR" * tag 'vfio-v4.6-rc1' of git://github.com/awilliam/linux-vfio: vfio/pci: return -EFAULT if copy_to_user fails vfio/pci: Expose shadow ROM as PCI option ROM vfio/pci: Intel IGD host and LCP bridge config space access vfio/pci: Intel IGD OpRegion support vfio/pci: Enable virtual register in PCI config space vfio/pci: Add infrastructure for additional device specific regions vfio: Define device specific region type capability vfio/pci: Include sparse mmap capability for MSI-X table regions vfio: Define sparse mmap capability for regions vfio: Add capability chain helpers vfio: Define capability chains vfio: If an IOMMU backend fails, keep looking vfio/pci: Fix unsigned comparison overflow
2016-02-28vfio: fix ioctl error handlingMichael S. Tsirkin
Calling return copy_to_user(...) in an ioctl will not do the right thing if there's a pagefault: copy_to_user returns the number of bytes not copied in this case. Fix up vfio to do return copy_to_user(...)) ? -EFAULT : 0; everywhere. Cc: stable@vger.kernel.org Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-02-25vfio/pci: return -EFAULT if copy_to_user failsDan Carpenter
The copy_to_user() function returns the number of bytes that were not copied but we want to return -EFAULT on error here. Fixes: 188ad9d6cbbc ('vfio/pci: Include sparse mmap capability for MSI-X table regions') Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-02-22vfio/pci: Expose shadow ROM as PCI option ROMAlex Williamson
Integrated graphics may have their ROM shadowed at 0xc0000 rather than implement a PCI option ROM. Make this ROM appear to the user using the ROM BAR. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-02-22vfio/pci: Intel IGD host and LCP bridge config space accessAlex Williamson
Provide read-only access to PCI config space of the PCI host bridge and LPC bridge through device specific regions. This may be used to configure a VM with matching register contents to satisfy driver requirements. Providing this through the vfio file descriptor removes an additional userspace requirement for access through pci-sysfs and removes the CAP_SYS_ADMIN requirement that doesn't appear to apply to the specific devices we're accessing. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-02-22vfio/pci: Intel IGD OpRegion supportAlex Williamson
This is the first consumer of vfio device specific resource support, providing read-only access to the OpRegion for Intel graphics devices. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-02-22vfio/pci: Enable virtual register in PCI config spaceAlex Williamson
Typically config space for a device is mapped out into capability specific handlers and unassigned space. The latter allows direct read/write access to config space. Sometimes we know about registers living in this void space and would like an easy way to virtualize them, similar to how BAR registers are managed. To do this, create one more pseudo (fake) PCI capability to be handled as purely virtual space. Reads and writes are serviced entirely from virtual config space. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-02-22vfio/pci: Add infrastructure for additional device specific regionsAlex Williamson
Add support for additional regions with indexes started after the already defined fixed regions. Device specific code can register these regions with the new vfio_pci_register_dev_region() function. The ops structure per region currently only includes read/write access and a release function, allowing automatic cleanup when the device is closed. mmap support is only missing here because it's not needed by the first user queued for this support. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-02-22vfio/pci: Include sparse mmap capability for MSI-X table regionsAlex Williamson
vfio-pci has never allowed the user to directly mmap the MSI-X vector table, but we've always relied on implicit knowledge of the user that they cannot do this. Now that we have capability chains that we can expose in the region info ioctl and a sparse mmap capability that represents the sub-areas within the region that can be mmap'd, we can make the mmap constraints more explicit. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-02-22vfio/pci: Fix unsigned comparison overflowAlex Williamson
Signed versus unsigned comparisons are implicitly cast to unsigned, which result in a couple possible overflows. For instance (start + count) might overflow and wrap, getting through our validation test. Also when unwinding setup, -1 being compared as unsigned doesn't produce the intended stop condition. Fix both of these and also fix vfio_msi_set_vector_signal() to validate parameters before using the vector index, though none of the callers should pass bad indexes anymore. Reported-by: Eric Auger <eric.auger@linaro.org> Reviewed-by: Eric Auger <eric.auger@linaro.org> Tested-by: Eric Auger <eric.auger@linaro.org> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>