summaryrefslogtreecommitdiff
path: root/drivers/vfio
AgeCommit message (Collapse)Author
2017-07-13Merge tag 'vfio-v4.13-rc1' of git://github.com/awilliam/linux-vfioLinus Torvalds
Pull VFIO updates from Alex Williamson: - Include Intel XXV710 in INTx workaround (Alex Williamson) - Make use of ERR_CAST() for error return (Dan Carpenter) - Fix vfio_group release deadlock from iommu notifier (Alex Williamson) - Unset KVM-VFIO attributes only on group match (Alex Williamson) - Fix release path group/file matching with KVM-VFIO (Alex Williamson) - Remove unnecessary lock uses triggering lockdep splat (Alex Williamson) * tag 'vfio-v4.13-rc1' of git://github.com/awilliam/linux-vfio: vfio: Remove unnecessary uses of vfio_container.group_lock vfio: New external user group/file match kvm-vfio: Decouple only when we match a group vfio: Fix group release deadlock vfio: Use ERR_CAST() instead of open coding it vfio/pci: Add Intel XXV710 to hidden INTx devices
2017-07-07vfio: Remove unnecessary uses of vfio_container.group_lockAlex Williamson
The original intent of vfio_container.group_lock is to protect vfio_container.group_list, however over time it's become a crutch to prevent changes in container composition any time we call into the iommu driver backend. This introduces problems when we start to have more complex interactions, for example when a user's DMA unmap request triggers a notification to an mdev vendor driver, who responds by attempting to unpin mappings within that request, re-entering the iommu backend. We incorrectly assume that the use of read-locks here allow for this nested locking behavior, but a poorly timed write-lock could in fact trigger a deadlock. The current use of group_lock seems to fall into the trap of locking code, not data. Correct that by removing uses of group_lock that are not directly related to group_list. Note that the vfio type1 iommu backend has its own mutex, vfio_iommu.lock, which it uses to protect itself for each of these interfaces anyway. The group_lock appears to be a redundancy for these interfaces and type1 even goes so far as to release its mutex to allow for exactly the re-entrant code path above. Reported-by: Chuanxiao Dong <chuanxiao.dong@intel.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Acked-by: Alexey Kardashevskiy <aik@ozlabs.ru> Cc: stable@vger.kernel.org # v4.10+
2017-06-28vfio: New external user group/file matchAlex Williamson
At the point where the kvm-vfio pseudo device wants to release its vfio group reference, we can't always acquire a new reference to make that happen. The group can be in a state where we wouldn't allow a new reference to be added. This new helper function allows a caller to match a file to a group to facilitate this. Given a file and group, report if they match. Thus the caller needs to already have a group reference to match to the file. This allows the deletion of a group without acquiring a new reference. Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Tested-by: Eric Auger <eric.auger@redhat.com> Cc: stable@vger.kernel.org
2017-06-28vfio: Fix group release deadlockAlex Williamson
If vfio_iommu_group_notifier() acquires a group reference and that reference becomes the last reference to the group, then vfio_group_put introduces a deadlock code path where we're trying to unregister from the iommu notifier chain from within a callout of that chain. Use a work_struct to release this reference asynchronously. Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Tested-by: Eric Auger <eric.auger@redhat.com> Cc: stable@vger.kernel.org
2017-06-20sched/wait: Rename wait_queue_t => wait_queue_entry_tIngo Molnar
Rename: wait_queue_t => wait_queue_entry_t 'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue", but in reality it's a queue *entry*. The 'real' queue is the wait queue head, which had to carry the name. Start sorting this out by renaming it to 'wait_queue_entry_t'. This also allows the real structure name 'struct __wait_queue' to lose its double underscore and become 'struct wait_queue_entry', which is the more canonical nomenclature for such data types. Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-13vfio: Use ERR_CAST() instead of open coding itDan Carpenter
It's a small cleanup to use ERR_CAST() here. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-06-13vfio/pci: Add Intel XXV710 to hidden INTx devicesAlex Williamson
XXV710 has the same broken INTx behavior as the rest of the X/XL710 series, the interrupt status register is not wired to report pending INTx interrupts, thus we never associate the interrupt to the device. Extend the device IDs to include these so that we hide that the device supports INTx at all to the user. Reported-by: Stefan Assmann <sassmann@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Acked-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
2017-05-05Merge tag 'powerpc-4.12-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: "Highlights include: - Larger virtual address space on 64-bit server CPUs. By default we use a 128TB virtual address space, but a process can request access to the full 512TB by passing a hint to mmap(). - Support for the new Power9 "XIVE" interrupt controller. - TLB flushing optimisations for the radix MMU on Power9. - Support for CAPI cards on Power9, using the "Coherent Accelerator Interface Architecture 2.0". - The ability to configure the mmap randomisation limits at build and runtime. - Several small fixes and cleanups to the kprobes code, as well as support for KPROBES_ON_FTRACE. - Major improvements to handling of system reset interrupts, correctly treating them as NMIs, giving them a dedicated stack and using a new hypervisor call to trigger them, all of which should aid debugging and robustness. - Many fixes and other minor enhancements. Thanks to: Alastair D'Silva, Alexey Kardashevskiy, Alistair Popple, Andrew Donnellan, Aneesh Kumar K.V, Anshuman Khandual, Anton Blanchard, Balbir Singh, Ben Hutchings, Benjamin Herrenschmidt, Bhupesh Sharma, Chris Packham, Christian Zigotzky, Christophe Leroy, Christophe Lombard, Daniel Axtens, David Gibson, Gautham R. Shenoy, Gavin Shan, Geert Uytterhoeven, Guilherme G. Piccoli, Hamish Martin, Hari Bathini, Kees Cook, Laurent Dufour, Madhavan Srinivasan, Mahesh J Salgaonkar, Mahesh Salgaonkar, Masami Hiramatsu, Matt Brown, Matthew R. Ochs, Michael Neuling, Naveen N. Rao, Nicholas Piggin, Oliver O'Halloran, Pan Xinhui, Paul Mackerras, Rashmica Gupta, Russell Currey, Sukadev Bhattiprolu, Thadeu Lima de Souza Cascardo, Tobin C. Harding, Tyrel Datwyler, Uma Krishnan, Vaibhav Jain, Vipin K Parashar, Yang Shi" * tag 'powerpc-4.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (214 commits) powerpc/64s: Power9 has no LPCR[VRMASD] field so don't set it powerpc/powernv: Fix TCE kill on NVLink2 powerpc/mm/radix: Drop support for CPUs without lockless tlbie powerpc/book3s/mce: Move add_taint() later in virtual mode powerpc/sysfs: Move #ifdef CONFIG_HOTPLUG_CPU out of the function body powerpc/smp: Document irq enable/disable after migrating IRQs powerpc/mpc52xx: Don't select user-visible RTAS_PROC powerpc/powernv: Document cxl dependency on special case in pnv_eeh_reset() powerpc/eeh: Clean up and document event handling functions powerpc/eeh: Avoid use after free in eeh_handle_special_event() cxl: Mask slice error interrupts after first occurrence cxl: Route eeh events to all drivers in cxl_pci_error_detected() cxl: Force context lock during EEH flow powerpc/64: Allow CONFIG_RELOCATABLE if COMPILE_TEST powerpc/xmon: Teach xmon oops about radix vectors powerpc/mm/hash: Fix off-by-one in comment about kernel contexts ids powerpc/pseries: Enable VFIO powerpc/powernv: Fix iommu table size calculation hook for small tables powerpc/powernv: Check kzalloc() return value in pnv_pci_table_alloc powerpc: Add arch/powerpc/tools directory ...
2017-04-18vfio/type1: Reduce repetitive calls in vfio_pin_pages_remote()Alex Williamson
vfio_pin_pages_remote() is typically called to iterate over a range of memory. Testing CAP_IPC_LOCK is relatively expensive, so it makes sense to push it up to the caller, which can then repeatedly call vfio_pin_pages_remote() using that value. This can show nearly a 20% improvement on the worst case path through VFIO_IOMMU_MAP_DMA with contiguous page mapping disabled. Testing RLIMIT_MEMLOCK is much more lightweight, but we bring it along on the same principle and it does seem to show a marginal improvement. Reviewed-by: Peter Xu <peterx@redhat.com> Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-04-18vfio/type1: Prune vfio_pin_page_external()Alex Williamson
With vfio_lock_acct() testing the locked memory limit under mmap_sem, it's redundant to do it here for a single page. We can also reorder our tests such that we can avoid testing for reserved pages if we're not doing accounting and let vfio_lock_acct() test the process CAP_IPC_LOCK. Finally, this function oddly returns 1 on success. Update to return zero on success, -errno on error. Since the function only pins a single page, there's no need to return the number of pages pinned. N.B. vfio_pin_pages_remote() can pin a large contiguous range of pages before calling vfio_lock_acct(). If we were to similarly remove the extra test there, a user could temporarily pin far more pages than they're allowed. Suggested-by: Kirti Wankhede <kwankhede@nvidia.com> Suggested-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-04-13vfio/type1: Remove locked page accounting workqueueAlex Williamson
If the mmap_sem is contented then the vfio type1 IOMMU backend will defer locked page accounting updates to a workqueue task. This has a few problems and depending on which side the user tries to play, they might be over-penalized for unmaps that haven't yet been accounted or race the workqueue to enter more mappings than they're allowed. The original intent of this workqueue mechanism seems to be focused on reducing latency through the ioctl, but we cannot do so at the cost of correctness. Remove this workqueue mechanism and update the callers to allow for failure. We can also now recheck the limit under write lock to make sure we don't exceed it. vfio_pin_pages_remote() also now necessarily includes an unwind path which we can jump to directly if the consecutive page pinning finds that we're exceeding the user's memory limits. This avoids the current lazy approach which does accounting and mapping up to the fault, only to return an error on the next iteration to unwind the entire vfio_dma. Cc: stable@vger.kernel.org Reviewed-by: Peter Xu <peterx@redhat.com> Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-04-11vfio/spapr_tce: Check kzalloc() return when preregistering memoryAlexey Kardashevskiy
This adds missing checking for kzalloc() return value. Fixes: 4b6fad7097f8 ("powerpc/mm/iommu, vfio/spapr: Put pages on VFIO container shutdown") Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-04-11vfio/powerpc/spapr_tce: Enforce IOMMU type compatibility checkAlexey Kardashevskiy
The existing SPAPR TCE driver advertises both VFIO_SPAPR_TCE_IOMMU and VFIO_SPAPR_TCE_v2_IOMMU types to the userspace and the userspace usually picks the v2. Normally the userspace would create a container, attach an IOMMU group to it and only then set the IOMMU type (which would normally be v2). However a specific IOMMU group may not support v2, in other words it may not implement set_window/unset_window/take_ownership/ release_ownership and such a group should not be attached to a v2 container. This adds extra checks that a new group can do what the selected IOMMU type suggests. The userspace can then test the return value from ioctl(VFIO_SET_IOMMU, VFIO_SPAPR_TCE_v2_IOMMU) and try VFIO_SPAPR_TCE_IOMMU. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-03-30powerpc/vfio_spapr_tce: Add reference counting to iommu_tableAlexey Kardashevskiy
So far iommu_table obejcts were only used in virtual mode and had a single owner. We are going to change this by implementing in-kernel acceleration of DMA mapping requests. The proposed acceleration will handle requests in real mode and KVM will keep references to tables. This adds a kref to iommu_table and defines new helpers to update it. This replaces iommu_free_table() with iommu_tce_table_put() and makes iommu_free_table() static. iommu_tce_table_get() is not used in this patch but it will be in the following patch. Since this touches prototypes, this also removes @node_name parameter as it has never been really useful on powernv and carrying it for the pseries platform code to iommu_free_table() seems to be quite useless as well. This should cause no behavioral change. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Acked-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-03-30powerpc/iommu/vfio_spapr_tce: Cleanup iommu_table disposalAlexey Kardashevskiy
At the moment iommu_table can be disposed by either calling iommu_table_free() directly or it_ops::free(); the only implementation of free() is in IODA2 - pnv_ioda2_table_free() - and it calls iommu_table_free() anyway. As we are going to have reference counting on tables, we need an unified way of disposing tables. This moves it_ops::free() call into iommu_free_table() and makes use of the latter. The free() callback now handles only platform-specific data. As from now on the iommu_free_table() calls it_ops->free(), we need to have it_ops initialized before calling iommu_free_table() so this moves this initialization in pnv_pci_ioda2_create_table(). This should cause no behavioral change. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Acked-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-03-24Merge tag 'vfio-v4.11-rc4' of git://github.com/awilliam/linux-vfioLinus Torvalds
Pull VFIO fix from Alex Williamson: "Rework sanity check for mdev driver group notifier de-registration (Alex Williamson)" * tag 'vfio-v4.11-rc4' of git://github.com/awilliam/linux-vfio: vfio: Rework group release notifier warning
2017-03-22iommu: Disambiguate MSI region typesRobin Murphy
The introduction of reserved regions has left a couple of rough edges which we could do with sorting out sooner rather than later. Since we are not yet addressing the potential dynamic aspect of software-managed reservations and presenting them at arbitrary fixed addresses, it is incongruous that we end up displaying hardware vs. software-managed MSI regions to userspace differently, especially since ARM-based systems may actually require one or the other, or even potentially both at once, (which iommu-dma currently has no hope of dealing with at all). Let's resolve the former user-visible inconsistency ASAP before the ABI has been baked into a kernel release, in a way that also lays the groundwork for the latter shortcoming to be addressed by follow-up patches. For clarity, rename the software-managed type to IOMMU_RESV_SW_MSI, use IOMMU_RESV_MSI to describe the hardware type, and document everything a little bit. Since the x86 MSI remapping hardware falls squarely under this meaning of IOMMU_RESV_MSI, apply that type to their regions as well, so that we tell the same story to userspace across all platforms. Secondly, as the various region types require quite different handling, and it really makes little sense to ever try combining them, convert the bitfield-esque #defines to a plain enum in the process before anyone gets the wrong impression. Fixes: d30ddcaa7b02 ("iommu: Add a new type field in iommu_resv_region") Reviewed-by: Eric Auger <eric.auger@redhat.com> CC: Alex Williamson <alex.williamson@redhat.com> CC: David Woodhouse <dwmw2@infradead.org> CC: kvm@vger.kernel.org Signed-off-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Joerg Roedel <jroedel@suse.de>
2017-03-21vfio: Rework group release notifier warningAlex Williamson
The intent of the original warning is make sure that the mdev vendor driver has removed any group notifiers at the point where the group is closed by the user. Theoretically this would be through an orderly shutdown where any devices are release prior to the group release. We can't always count on an orderly shutdown, the user can close the group before the notifier can be removed or the user task might be killed. We'd like to add this sanity test when the group is idle and the only references are from the devices within the group themselves, but we don't have a good way to do that. Instead check both when the group itself is removed and when the group is opened. A bit later than we'd prefer, but better than the current over aggressive approach. Fixes: ccd46dbae77d ("vfio: support notifier chain in vfio_group") Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Cc: <stable@vger.kernel.org> # v4.10 Cc: Jike Song <jike.song@intel.com>
2017-03-02sched/headers: Prepare for new header dependencies before moving code to ↵Ingo Molnar
<linux/sched/signal.h> We are going to split <linux/sched/signal.h> out of <linux/sched.h>, which will have to be picked up from other headers and a couple of .c files. Create a trivial placeholder <linux/sched/signal.h> file that just maps to <linux/sched.h> to make this patch obviously correct and bisectable. Include the new header in the files that are going to need it. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02sched/headers: Prepare for new header dependencies before moving code to ↵Ingo Molnar
<linux/sched/mm.h> We are going to split <linux/sched/mm.h> out of <linux/sched.h>, which will have to be picked up from other headers and a couple of .c files. Create a trivial placeholder <linux/sched/mm.h> file that just maps to <linux/sched.h> to make this patch obviously correct and bisectable. The APIs that are going to be moved first are: mm_alloc() __mmdrop() mmdrop() mmdrop_async_fn() mmdrop_async() mmget_not_zero() mmput() mmput_async() get_task_mm() mm_access() mm_release() Include the new header in the files that are going to need it. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-23Merge tag 'vfio-v4.11-rc1' of git://github.com/awilliam/linux-vfioLinus Torvalds
Pull VFIO updates from Alex Williamson: - Kconfig fixes for SPAPR_TCE_IOMMU=n (Michael Ellerman) - Module softdep rather than request_module to simplify usage from initrd (Alex Williamson) - Comment typo fix (Changbin Du) * tag 'vfio-v4.11-rc1' of git://github.com/awilliam/linux-vfio: vfio: fix a typo in comment of function vfio_pin_pages vfio: Replace module request with softdep vfio/mdev: Use a module softdep for vfio_mdev vfio: Fix build break when SPAPR_TCE_IOMMU=n
2017-02-22vfio: fix a typo in comment of function vfio_pin_pagesChangbin Du
Correct the description that 'unpinned' -> 'pinned'. Signed-off-by: Changbin Du <changbin.du@intel.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-02-20Merge tag 'iommu-updates-v4.11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu Pull IOMMU UPDATES from Joerg Roedel: - KVM PCIe/MSI passthrough support on ARM/ARM64 - introduction of a core representation for individual hardware iommus - support for IOMMU privileged mappings as supported by some ARM IOMMUS - 16-bit SID support for ARM-SMMUv2 - stream table optimization for ARM-SMMUv3 - various fixes and other small improvements * tag 'iommu-updates-v4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (61 commits) vfio/type1: Fix error return code in vfio_iommu_type1_attach_group() iommu: Remove iommu_register_instance interface iommu/exynos: Make use of iommu_device_register interface iommu/mediatek: Make use of iommu_device_register interface iommu/msm: Make use of iommu_device_register interface iommu/arm-smmu: Make use of the iommu_register interface iommu: Add iommu_device_set_fwnode() interface iommu: Make iommu_device_link/unlink take a struct iommu_device iommu: Add sysfs bindings for struct iommu_device iommu: Introduce new 'struct iommu_device' iommu: Rename struct iommu_device iommu: Rename iommu_get_instance() iommu: Fix static checker warning in iommu_insert_device_resv_regions iommu: Avoid unnecessary assignment of dev->iommu_fwspec iommu/mediatek: Remove bogus 'select' statements iommu/dma: Remove bogus dma_supported() implementation iommu/ipmmu-vmsa: Restrict IOMMU Domain Geometry to 32-bit address space iommu/vt-d: Don't over-free page table directories iommu/vt-d: Tylersburg isoch identity map check is done too late. iommu/vt-d: Fix some macros that are incorrectly specified in intel-iommu ...
2017-02-10Merge branches 'iommu/fixes', 'arm/exynos', 'arm/renesas', 'arm/smmu', ↵Joerg Roedel
'arm/mediatek', 'arm/core', 'x86/vt-d' and 'core' into next
2017-02-10vfio/type1: Fix error return code in vfio_iommu_type1_attach_group()Wei Yongjun
Fix to return a negative error code from the error handling case instead of 0, as done elsewhere in this function. Fixes: 5d704992189f ("vfio/type1: Allow transparent MSI IOVA allocation") Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Acked-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Joerg Roedel <jroedel@suse.de>
2017-02-09vfio: Replace module request with softdepAlex Williamson
Rather than doing a module request from within the init function, add a soft dependency on the available IOMMU backend drivers. This makes the dependency visible to userspace when picking modules for the ram disk. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-02-08vfio/mdev: Use a module softdep for vfio_mdevAlex Williamson
Use an explicit module softdep rather than a request module call such that the dependency is exposed to userspace. This allows us to more easily support modules loaded at initrd time. Reviewed by: Kirti Wankhede <kwankhede@nvidia.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-02-08vfio: Fix build break when SPAPR_TCE_IOMMU=nMichael Ellerman
Currently the kconfig logic for VFIO_IOMMU_SPAPR_TCE and VFIO_SPAPR_EEH is broken when SPAPR_TCE_IOMMU=n. Leading to: warning: (VFIO) selects VFIO_IOMMU_SPAPR_TCE which has unmet direct dependencies (VFIO && SPAPR_TCE_IOMMU) warning: (VFIO) selects VFIO_IOMMU_SPAPR_TCE which has unmet direct dependencies (VFIO && SPAPR_TCE_IOMMU) drivers/vfio/vfio_iommu_spapr_tce.c:113:8: error: implicit declaration of function 'mm_iommu_find' This stems from the fact that VFIO selects VFIO_IOMMU_SPAPR_TCE, and although it has an if clause, the condition is not correct. We could fix it by doing select VFIO_IOMMU_SPAPR_TCE if SPAPR_TCE_IOMMU, but the cleaner fix is to drop the selects and tie VFIO_IOMMU_SPAPR_TCE to the value of VFIO, and express the dependencies in only once place. Do the same for VFIO_SPAPR_EEH. The end result is that the values of VFIO_IOMMU_SPAPR_TCE and VFIO_SPAPR_EEH follow the value of VFIO, except when SPAPR_TCE_IOMMU=n and/or EEH=n. Which is exactly what we want to happen. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-02-07vfio/spapr_tce: Set window when adding additional groups to containerAlexey Kardashevskiy
If a container already has a group attached, attaching a new group should just program already created IOMMU tables to the hardware via the iommu_table_group_ops::set_window() callback. However commit 6f01cc692a16 ("vfio/spapr: Add a helper to create default DMA window") did not just simplify the code but also removed the set_window() calls in the case of attaching groups to a container which already has tables so it broke VFIO PCI hotplug. This reverts set_window() bits in tce_iommu_take_ownership_ddw(). Fixes: 6f01cc692a16 ("vfio/spapr: Add a helper to create default DMA window") Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-02-01vfio/spapr: Fix missing mutex unlock when creating a windowAlexey Kardashevskiy
Commit d9c728949ddc ("vfio/spapr: Postpone default window creation") added an additional exit to the VFIO_IOMMU_SPAPR_TCE_CREATE case and made it possible to return from tce_iommu_ioctl() without unlocking container->lock; this fixes the issue. Fixes: d9c728949ddc ("vfio/spapr: Postpone default window creation") Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-01-30Merge branch 'iommu/guest-msi' of ↵Joerg Roedel
git://git.kernel.org/pub/scm/linux/kernel/git/will/linux into arm/core
2017-01-24vfio/spapr: fail tce_iommu_attach_group() when iommu_data is nullGreg Kurz
The recently added mediated VFIO driver doesn't know about powerpc iommu. It thus doesn't register a struct iommu_table_group in the iommu group upon device creation. The iommu_data pointer hence remains null. This causes a kernel oops when userspace tries to set the iommu type of a container associated with a mediated device to VFIO_SPAPR_TCE_v2_IOMMU. [ 82.585440] mtty mtty: MDEV: Registered [ 87.655522] iommu: Adding device 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001 to group 10 [ 87.655527] vfio_mdev 83b8f4f2-509f-382f-3c1e-e6bfe0fa1001: MDEV: group_id = 10 [ 116.297184] Unable to handle kernel paging request for data at address 0x00000030 [ 116.297389] Faulting instruction address: 0xd000000007870524 [ 116.297465] Oops: Kernel access of bad area, sig: 11 [#1] [ 116.297611] SMP NR_CPUS=2048 [ 116.297611] NUMA [ 116.297627] PowerNV ... [ 116.297954] CPU: 33 PID: 7067 Comm: qemu-system-ppc Not tainted 4.10.0-rc5-mdev-test #8 [ 116.297993] task: c000000e7718b680 task.stack: c000000e77214000 [ 116.298025] NIP: d000000007870524 LR: d000000007870518 CTR: 0000000000000000 [ 116.298064] REGS: c000000e77217990 TRAP: 0300 Not tainted (4.10.0-rc5-mdev-test) [ 116.298103] MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE> [ 116.298107] CR: 84004444 XER: 00000000 [ 116.298154] CFAR: c00000000000888c DAR: 0000000000000030 DSISR: 40000000 SOFTE: 1 GPR00: d000000007870518 c000000e77217c10 d00000000787b0ed c000000eed2103c0 GPR04: 0000000000000000 0000000000000000 c000000eed2103e0 0000000f24320000 GPR08: 0000000000000104 0000000000000001 0000000000000000 d0000000078729b0 GPR12: c00000000025b7e0 c00000000fe08400 0000000000000001 000001002d31d100 GPR16: 000001002c22c850 00003ffff315c750 0000000043145680 0000000043141bc0 GPR20: ffffffffffffffed fffffffffffff000 0000000020003b65 d000000007706018 GPR24: c000000f16cf0d98 d000000007706000 c000000003f42980 c000000003f42980 GPR28: c000000f1575ac00 c000000003f429c8 0000000000000000 c000000eed2103c0 [ 116.298504] NIP [d000000007870524] tce_iommu_attach_group+0x10c/0x360 [vfio_iommu_spapr_tce] [ 116.298555] LR [d000000007870518] tce_iommu_attach_group+0x100/0x360 [vfio_iommu_spapr_tce] [ 116.298601] Call Trace: [ 116.298610] [c000000e77217c10] [d000000007870518] tce_iommu_attach_group+0x100/0x360 [vfio_iommu_spapr_tce] (unreliable) [ 116.298671] [c000000e77217cb0] [d0000000077033a0] vfio_fops_unl_ioctl+0x278/0x3e0 [vfio] [ 116.298713] [c000000e77217d40] [c0000000002a3ebc] do_vfs_ioctl+0xcc/0x8b0 [ 116.298745] [c000000e77217de0] [c0000000002a4700] SyS_ioctl+0x60/0xc0 [ 116.298782] [c000000e77217e30] [c00000000000b220] system_call+0x38/0xfc [ 116.298812] Instruction dump: [ 116.298828] 7d3f4b78 409effc8 3d220000 e9298020 3c800140 38a00018 608480c0 e8690028 [ 116.298869] 4800249d e8410018 7c7f1b79 41820230 <e93e0030> 2fa90000 419e0114 e9090020 [ 116.298914] ---[ end trace 1e10b0ced08b9120 ]--- This patch fixes the oops. Reported-by: Vaibhav Jain <vaibhav@linux.vnet.ibm.com> Signed-off-by: Greg Kurz <groug@kaod.org> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-01-23vfio/type1: Check MSI remapping at irq domain levelEric Auger
In case the IOMMU translates MSI transactions (typical case on ARM), we check MSI remapping capability at IRQ domain level. Otherwise it is checked at IOMMU level. At this stage the arm-smmu-(v3) still advertise the IOMMU_CAP_INTR_REMAP capability at IOMMU level. This will be removed in subsequent patches. Signed-off-by: Eric Auger <eric.auger@redhat.com> Acked-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: Tomasz Nowicki <tomasz.nowicki@caviumnetworks.com> Tested-by: Tomasz Nowicki <tomasz.nowicki@caviumnetworks.com> Tested-by: Bharat Bhushan <bharat.bhushan@nxp.com> Signed-off-by: Will Deacon <will.deacon@arm.com>
2017-01-23vfio/type1: Allow transparent MSI IOVA allocationEric Auger
When attaching a group to the container, check the group's reserved regions and test whether the IOMMU translates MSI transactions. If yes, we initialize an IOVA allocator through the iommu_get_msi_cookie API. This will allow the MSI IOVAs to be transparently allocated on MSI controller's compose(). Signed-off-by: Eric Auger <eric.auger@redhat.com> Acked-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: Tomasz Nowicki <tomasz.nowicki@caviumnetworks.com> Tested-by: Tomasz Nowicki <tomasz.nowicki@caviumnetworks.com> Tested-by: Bharat Bhushan <bharat.bhushan@nxp.com> Signed-off-by: Will Deacon <will.deacon@arm.com>
2017-01-13vfio/type1: Remove pid_namespace.h includeAlex Williamson
Using has_capability() rather than ns_capable(), we're no longer using this header. Cc: Jike Song <jike.song@intel.com> Cc: Kirti Wankhede <kwankhede@nvidia.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-01-12vfio iommu type1: fix the testing of capability for remote taskJike Song
Before the mdev enhancement type1 iommu used capable() to test the capability of current task; in the course of mdev development a new requirement, testing for another task other than current, was raised. ns_capable() was used for this purpose, however it still tests current, the only difference is, in a specified namespace. Fix it by using has_capability() instead, which tests the cap for specified task in init_user_ns, the same namespace as capable(). Cc: Gerd Hoffmann <kraxel@redhat.com> Signed-off-by: Jike Song <jike.song@intel.com> Reviewed-by: James Morris <james.l.morris@oracle.com> Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2017-01-04vfio-pci: Handle error from pci_iomapArvind Yadav
Here, pci_iomap can fail, handle this case release selected pci regions and return -ENOMEM. Signed-off-by: Arvind Yadav <arvind.yadav.cs@gmail.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-12-30vfio-pci: use 32-bit comparisons for register address for gcc-4.5Arnd Bergmann
Using ancient compilers (gcc-4.5 or older) on ARM, we get a link failure with the vfio-pci driver: ERROR: "__aeabi_lcmp" [drivers/vfio/pci/vfio-pci.ko] undefined! The reason is that the compiler tries to do a comparison of a 64-bit range. This changes it to convert to a 32-bit number explicitly first, as newer compilers do for themselves. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-12-30vfio-mdev: Make mdev_device private and abstract interfacesAlex Williamson
Abstract access to mdev_device so that we can define which interfaces are public rather than relying on comments in the structure. Cc: Zhenyu Wang <zhenyuw@linux.intel.com> Cc: Zhi Wang <zhi.a.wang@intel.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: Jike Song <jike.song@intel.com> Reviewed by: Kirti Wankhede <kwankhede@nvidia.com>
2016-12-30vfio-mdev: Make mdev_parent privateAlex Williamson
Rather than hoping for good behavior by marking some elements internal, enforce it by making the entire structure private and creating an accessor function for the one useful external field. Cc: Zhenyu Wang <zhenyuw@linux.intel.com> Cc: Zhi Wang <zhi.a.wang@intel.com> Cc: Jike Song <jike.song@intel.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Reviewed by: Kirti Wankhede <kwankhede@nvidia.com>
2016-12-30vfio-mdev: de-polute the namespace, rename parent_device & parent_opsAlex Williamson
Add an mdev_ prefix so we're not poluting the namespace so much. Cc: Zhenyu Wang <zhenyuw@linux.intel.com> Cc: Zhi Wang <zhi.a.wang@intel.com> Cc: Jike Song <jike.song@intel.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Reviewed by: Kirti Wankhede <kwankhede@nvidia.com>
2016-12-30vfio-mdev: Fix remove raceAlex Williamson
Using the mtty mdev sample driver we can generate a remove race by starting one shell that continuously creates mtty devices and several other shells all attempting to remove devices, in my case four remove shells. The fault occurs in mdev_remove_sysfs_files() where the passed type arg is NULL, which suggests we've received a struct device in mdev_device_remove() but it's in some sort of teardown state. The solution here is to make use of the accidentally unused list_head on the mdev_device such that the mdev core keeps a list of all the mdev devices. This allows us to validate that we have a valid mdev before we start removal, remove it from the list to prevent others from working on it, and if the vendor driver refuses to remove, we can re-add it to the list. Cc: Kirti Wankhede <kwankhede@nvidia.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-12-30vfio/type1: Restore mapping performance with mdev supportAlex Williamson
As part of the mdev support, type1 now gets a task reference per vfio_dma and uses that to get an mm reference for the task while working on accounting. That's correct, but it's not fast. For some paths, like vfio_pin_pages_remote(), we know we're only called from user context, so we can restore the lighter weight calls. In other cases, we're effectively already testing whether we're in the stored task context elsewhere, extend this vfio_lock_acct() as well. Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Reviewed by: Kirti Wankhede <kwankhede@nvidia.com>
2016-12-16Merge tag 'powerpc-4.10-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: "Highlights include: - Support for the kexec_file_load() syscall, which is a prereq for secure and trusted boot. - Prevent kernel execution of userspace on P9 Radix (similar to SMEP/PXN). - Sort the exception tables at build time, to save time at boot, and store them as relative offsets to save space in the kernel image & memory. - Allow building the kernel with thin archives, which should allow us to build an allyesconfig once some other fixes land. - Build fixes to allow us to correctly rebuild when changing the kernel endian from big to little or vice versa. - Plumbing so that we can avoid doing a full mm TLB flush on P9 Radix. - Initial stack protector support (-fstack-protector). - Support for dumping the radix (aka. Linux) and hash page tables via debugfs. - Fix an oops in cxl coredump generation when cxl_get_fd() is used. - Freescale updates from Scott: "Highlights include 8xx hugepage support, qbman fixes/cleanup, device tree updates, and some misc cleanup." - Many and varied fixes and minor enhancements as always. Thanks to: Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V, Anshuman Khandual, Anton Blanchard, Balbir Singh, Bartlomiej Zolnierkiewicz, Christophe Jaillet, Christophe Leroy, Denis Kirjanov, Elimar Riesebieter, Frederic Barrat, Gautham R. Shenoy, Geliang Tang, Geoff Levand, Jack Miller, Johan Hovold, Lars-Peter Clausen, Libin, Madhavan Srinivasan, Michael Neuling, Nathan Fontenot, Naveen N. Rao, Nicholas Piggin, Pan Xinhui, Peter Senna Tschudin, Rashmica Gupta, Rui Teng, Russell Currey, Scott Wood, Simon Guo, Suraj Jitindar Singh, Thiago Jung Bauermann, Tobias Klauser, Vaibhav Jain" [ And thanks to Michael, who took time off from a new baby to get this pull request done. - Linus ] * tag 'powerpc-4.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (174 commits) powerpc/fsl/dts: add FMan node for t1042d4rdb powerpc/fsl/dts: add sg_2500_aqr105_phy4 alias on t1024rdb powerpc/fsl/dts: add QMan and BMan nodes on t1024 powerpc/fsl/dts: add QMan and BMan nodes on t1023 soc/fsl/qman: test: use DEFINE_SPINLOCK() powerpc/fsl-lbc: use DEFINE_SPINLOCK() powerpc/8xx: Implement support of hugepages powerpc: get hugetlbpage handling more generic powerpc: port 64 bits pgtable_cache to 32 bits powerpc/boot: Request no dynamic linker for boot wrapper soc/fsl/bman: Use resource_size instead of computation soc/fsl/qe: use builtin_platform_driver powerpc/fsl_pmc: use builtin_platform_driver powerpc/83xx/suspend: use builtin_platform_driver powerpc/ftrace: Fix the comments for ftrace_modify_code powerpc/perf: macros for power9 format encoding powerpc/perf: power9 raw event format encoding powerpc/perf: update attribute_group data structure powerpc/perf: factor out the event format field powerpc/mm/iommu, vfio/spapr: Put pages on VFIO container shutdown ...
2016-12-15Merge tag 'pci-v4.10-changes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci Pull PCI updates from Bjorn Helgaas: "PCI changes: - add support for PCI on ARM64 boxes with ACPI. We already had this for theoretical spec-compliant hardware; now we're adding quirks for the actual hardware (Cavium, HiSilicon, Qualcomm, X-Gene) - add runtime PM support for hotplug ports - enable runtime suspend for Intel UHCI that uses platform-specific wakeup signaling - add yet another host bridge registration interface. We hope this is extensible enough to subsume the others - expose device revision in sysfs for DRM - to avoid device conflicts, make sure any VF BAR updates are done before enabling the VF - avoid unnecessary link retrains for ASPM - allow INTx masking on Mellanox devices that support it - allow access to non-standard VPD for Chelsio devices - update Broadcom iProc support for PAXB v2, PAXC v2, inbound DMA, etc - update Rockchip support for max-link-speed - add NVIDIA Tegra210 support - add Layerscape LS1046a support - update R-Car compatibility strings - add Qualcomm MSM8996 support - remove some uninformative bootup messages" * tag 'pci-v4.10-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (115 commits) PCI: Enable access to non-standard VPD for Chelsio devices (cxgb3) PCI: Expand "VPD access disabled" quirk message PCI: pciehp: Remove loading message PCI: hotplug: Remove hotplug core message PCI: Remove service driver load/unload messages PCI/AER: Log AER IRQ when claiming Root Port PCI/AER: Log errors with PCI device, not PCIe service device PCI/AER: Remove unused version macros PCI/PME: Log PME IRQ when claiming Root Port PCI/PME: Drop unused support for PMEs from Root Complex Event Collectors PCI: Move config space size macros to pci_regs.h x86/platform/intel-mid: Constify mid_pci_platform_pm PCI/ASPM: Don't retrain link if ASPM not possible PCI: iproc: Skip check for legacy IRQ on PAXC buses PCI: pciehp: Leave power indicator on when enabling already-enabled slot PCI: pciehp: Prioritize data-link event over presence detect PCI: rcar: Add gen3 fallback compatibility string for pcie-rcar PCI: rcar: Use gen2 fallback compatibility last PCI: rcar-gen2: Use gen2 fallback compatibility last PCI: rockchip: Move the deassert of pm/aclk/pclk after phy_init() ..
2016-12-14mm: add locked parameter to get_user_pages_remote()Lorenzo Stoakes
Patch series "mm: unexport __get_user_pages_unlocked()". This patch series continues the cleanup of get_user_pages*() functions taking advantage of the fact we can now pass gup_flags as we please. It firstly adds an additional 'locked' parameter to get_user_pages_remote() to allow for its callers to utilise VM_FAULT_RETRY functionality. This is necessary as the invocation of __get_user_pages_unlocked() in process_vm_rw_single_vec() makes use of this and no other existing higher level function would allow it to do so. Secondly existing callers of __get_user_pages_unlocked() are replaced with the appropriate higher-level replacement - get_user_pages_unlocked() if the current task and memory descriptor are referenced, or get_user_pages_remote() if other task/memory descriptors are referenced (having acquiring mmap_sem.) This patch (of 2): Add a int *locked parameter to get_user_pages_remote() to allow VM_FAULT_RETRY faulting behaviour similar to get_user_pages_[un]locked(). Taking into account the previous adjustments to get_user_pages*() functions allowing for the passing of gup_flags, we are now in a position where __get_user_pages_unlocked() need only be exported for his ability to allow VM_FAULT_RETRY behaviour, this adjustment allows us to subsequently unexport __get_user_pages_unlocked() as well as allowing for future flexibility in the use of get_user_pages_remote(). [sfr@canb.auug.org.au: merge fix for get_user_pages_remote API change] Link: http://lkml.kernel.org/r/20161122210511.024ec341@canb.auug.org.au Link: http://lkml.kernel.org/r/20161027095141.2569-2-lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Jan Kara <jack@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krcmar <rkrcmar@redhat.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-12PCI: Move config space size macros to pci_regs.hWang Sheng-Hui
Move PCI configuration space size macros (PCI_CFG_SPACE_SIZE and PCI_CFG_SPACE_EXP_SIZE) from drivers/pci/pci.h to include/uapi/linux/pci_regs.h so they can be used by more drivers and eliminate duplicate definitions. [bhelgaas: Expand comment to include PCI-X details] Signed-off-by: Wang Sheng-Hui <shhuiw@foxmail.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2016-12-06vfio iommu type1: Fix size argument to vfio_find_dma() in pin_pages/unpin_pagesKirti Wankhede
Passing zero for the size to vfio_find_dma() isn't compatible with matching the start address of an existing vfio_dma. Doing so triggers a corner case. In vfio_find_dma(), when the start address is equal to dma->iova and size is 0, check for the end of search range makes it to take wrong side of RB-tree. That fails the search even though the address is present in mapped dma ranges. In functions pin_pages and unpin_pages, the iova which is being searched is base address of page to be pinned or unpinned. So here size should be set to PAGE_SIZE, as argument to vfio_find_dma(). Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com> Signed-off-by: Neo Jia <cjia@nvidia.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-12-06vfio iommu type1: Fix size argument to vfio_find_dma() during DMA UNMAP.Kirti Wankhede
Passing zero for the size to vfio_find_dma() isn't compatible with matching the start address of an existing vfio_dma. Doing so triggers a corner case. In vfio_find_dma(), when the start address is equal to dma->iova and size is 0, check for the end of search range makes it to take wrong side of RB-tree. That fails the search even though the address is present in mapped dma ranges. Due to this, in vfio_dma_do_unmap(), while checking boundary conditions, size should be set to 1 for verifying start address of unmap range. vfio_find_dma() is also used to verify last address in unmap range with size = 0, but in that case address to be searched is calculated with start + size - 1 and so it works correctly. Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com> Signed-off-by: Neo Jia <cjia@nvidia.com> [aw: changelog tweak] Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-12-05vfio iommu type1: WARN_ON if notifier block is not unregisteredKirti Wankhede
mdev vendor driver should unregister the iommu notifier since the vfio iommu can persist beyond the attachment of the mdev group. WARN_ON will show warning if vendor driver doesn't unregister the notifier and is forced to follow the implementations steps. Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com> Signed-off-by: Neo Jia <cjia@nvidia.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>