summaryrefslogtreecommitdiff
path: root/drivers/infiniband
AgeCommit message (Collapse)Author
2019-06-18RDMA/efa: Handle mmap insertions overflowGal Pressman
When inserting a new mmap entry to the xarray we should check for 'mmap_page' overflow as it is limited to 32 bits. Fixes: 40909f664d27 ("RDMA/efa: Add EFA verbs implementation") Signed-off-by: Gal Pressman <galpress@amazon.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-06-17RDMA/efa: Fix success return value in case of errorGal Pressman
Existing code would mistakenly return success in case of error instead of a proper return value. Fixes: e9c6c5373088 ("RDMA/efa: Add common command handlers") Reviewed-by: Firas JahJah <firasj@amazon.com> Reviewed-by: Yossi Leybovich <sleybo@amazon.com> Signed-off-by: Gal Pressman <galpress@amazon.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-06-17IB/hfi1: Handle port down properly in pioMike Marciniszyn
The call to sc_buffer_alloc currently returns NULL (no buffer) or a buffer descriptor. There is a third case when the port is down. Currently that returns NULL and this prevents the caller from properly handling the sc_buffer_alloc() failure. A verbs code link test after the call is racy so the indication needs to come from the state check inside the allocation routine to be valid. Fix by encoding the ECOMM failure like SDMA. IS_ERR_OR_NULL() tests are added at all call sites. For verbs send, this needs to treat any error by returning a completion without any MMIO copy. Fixes: 7724105686e7 ("IB/hfi1: add driver files") Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-06-17IB/hfi1: Handle wakeup of orphaned QPs for pioMike Marciniszyn
Once a send context is taken down due to a link failure, any QPs waiting for pio credits will stay on the waitlist indefinitely. Fix by wakeing up all QPs linked to piowait list. Fixes: 7724105686e7 ("IB/hfi1: add driver files") Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-06-17IB/hfi1: Wakeup QPs orphaned on wait list after flushMike Marciniszyn
Once an SDMA engine is taken down due to a link failure, any waiting QPs that do not have outstanding descriptors in the ring will stay on the dmawait list as long as the port is down. Since there is no timer running, they will stay there for a long time. The fix is to wake up all iowaits linked to dmawait. The send engine will build and post packets that get flushed back. Fixes: 7724105686e7 ("IB/hfi1: add driver files") Reviewed-by: Kaike Wan <kaike.wan@intel.com> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-06-17IB/hfi1: Use aborts to trigger RC throttlingMike Marciniszyn
SDMA and pio flushes will cause a lot of packets to be transmitted after a link has gone down, using a lot of CPU to retransmit packets. Fix for RC QPs by recognizing the flush status and: - Forcing a timer start - Putting the QP into a "send one" mode Fixes: 7724105686e7 ("IB/hfi1: add driver files") Reviewed-by: Kaike Wan <kaike.wan@intel.com> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-06-17IB/hfi1: Create inline to get extended headersMike Marciniszyn
This paves the way for another patch that reacts to a flush sdma completion for RC. Fixes: 81cd3891f021 ("IB/hfi1: Add support for 16B Management Packets") Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-06-17IB/hfi1: Silence txreq allocation warningsMike Marciniszyn
The following warning can happen when a memory shortage occurs during txreq allocation: [10220.939246] SLUB: Unable to allocate memory on node -1, gfp=0xa20(GFP_ATOMIC) [10220.939246] Hardware name: Intel Corporation S2600WT2R/S2600WT2R, BIOS SE5C610.86B.01.01.0018.C4.072020161249 07/20/2016 [10220.939247] cache: mnt_cache, object size: 384, buffer size: 384, default order: 2, min order: 0 [10220.939260] Workqueue: hfi0_0 _hfi1_do_send [hfi1] [10220.939261] node 0: slabs: 1026568, objs: 43115856, free: 0 [10220.939262] Call Trace: [10220.939262] node 1: slabs: 820872, objs: 34476624, free: 0 [10220.939263] dump_stack+0x5a/0x73 [10220.939265] warn_alloc+0x103/0x190 [10220.939267] ? wake_all_kswapds+0x54/0x8b [10220.939268] __alloc_pages_slowpath+0x86c/0xa2e [10220.939270] ? __alloc_pages_nodemask+0x2fe/0x320 [10220.939271] __alloc_pages_nodemask+0x2fe/0x320 [10220.939273] new_slab+0x475/0x550 [10220.939275] ___slab_alloc+0x36c/0x520 [10220.939287] ? hfi1_make_rc_req+0x90/0x18b0 [hfi1] [10220.939299] ? __get_txreq+0x54/0x160 [hfi1] [10220.939310] ? hfi1_make_rc_req+0x90/0x18b0 [hfi1] [10220.939312] __slab_alloc+0x40/0x61 [10220.939323] ? hfi1_make_rc_req+0x90/0x18b0 [hfi1] [10220.939325] kmem_cache_alloc+0x181/0x1b0 [10220.939336] hfi1_make_rc_req+0x90/0x18b0 [hfi1] [10220.939348] ? hfi1_verbs_send_dma+0x386/0xa10 [hfi1] [10220.939359] ? find_prev_entry+0xb0/0xb0 [hfi1] [10220.939371] hfi1_do_send+0x1d9/0x3f0 [hfi1] [10220.939372] process_one_work+0x171/0x380 [10220.939374] worker_thread+0x49/0x3f0 [10220.939375] kthread+0xf8/0x130 [10220.939377] ? max_active_store+0x80/0x80 [10220.939378] ? kthread_bind+0x10/0x10 [10220.939379] ret_from_fork+0x35/0x40 [10220.939381] SLUB: Unable to allocate memory on node -1, gfp=0xa20(GFP_ATOMIC) The shortage is handled properly so the message isn't needed. Silence by adding the no warn option to the slab allocation. Fixes: 45842abbb292 ("staging/rdma/hfi1: move txreq header code") Cc: <stable@vger.kernel.org> Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-06-17IB/hfi1: Avoid hardlockup with flushlist_lockMike Marciniszyn
Heavy contention of the sde flushlist_lock can cause hard lockups at extreme scale when the flushing logic is under stress. Mitigate by replacing the item at a time copy to the local list with an O(1) list_splice_init() and using the high priority work queue to do the flushes. Fixes: 7724105686e7 ("IB/hfi1: add driver files") Cc: <stable@vger.kernel.org> Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-06-11IB/hfi1: Correct tid qp rcd to match verbs contextMike Marciniszyn
The qp priv rcd pointer doesn't match the context being used for verbs causing issues when 9B and kdeth packets are processed by different receive contexts and hence different CPUs. When running on different CPUs the following panic can occur: WARNING: CPU: 3 PID: 2584 at lib/list_debug.c:59 __list_del_entry+0xa1/0xd0 list_del corruption. prev->next should be ffff9a7ac31f7a30, but was ffff9a7c3bc89230 CPU: 3 PID: 2584 Comm: z_wr_iss Kdump: loaded Tainted: P OE ------------ 3.10.0-862.2.3.el7_lustre.x86_64 #1 Call Trace: <IRQ> [<ffffffffb7b0d78e>] dump_stack+0x19/0x1b [<ffffffffb74916d8>] __warn+0xd8/0x100 [<ffffffffb749175f>] warn_slowpath_fmt+0x5f/0x80 [<ffffffffb7768671>] __list_del_entry+0xa1/0xd0 [<ffffffffc0c7a945>] process_rcv_qp_work+0xb5/0x160 [hfi1] [<ffffffffc0c7bc2b>] handle_receive_interrupt_nodma_rtail+0x20b/0x2b0 [hfi1] [<ffffffffc0c70683>] receive_context_interrupt+0x23/0x40 [hfi1] [<ffffffffb7540a94>] __handle_irq_event_percpu+0x44/0x1c0 [<ffffffffb7540c42>] handle_irq_event_percpu+0x32/0x80 [<ffffffffb7540ccc>] handle_irq_event+0x3c/0x60 [<ffffffffb7543a1f>] handle_edge_irq+0x7f/0x150 [<ffffffffb742d504>] handle_irq+0xe4/0x1a0 [<ffffffffb7b23f7d>] do_IRQ+0x4d/0xf0 [<ffffffffb7b16362>] common_interrupt+0x162/0x162 <EOI> [<ffffffffb775a326>] ? memcpy+0x6/0x110 [<ffffffffc109210d>] ? abd_copy_from_buf_off_cb+0x1d/0x30 [zfs] [<ffffffffc10920f0>] ? abd_copy_to_buf_off_cb+0x30/0x30 [zfs] [<ffffffffc1093257>] abd_iterate_func+0x97/0x120 [zfs] [<ffffffffc10934d9>] abd_copy_from_buf_off+0x39/0x60 [zfs] [<ffffffffc109b828>] arc_write_ready+0x178/0x300 [zfs] [<ffffffffb7b11032>] ? mutex_lock+0x12/0x2f [<ffffffffb7b11032>] ? mutex_lock+0x12/0x2f [<ffffffffc1164d05>] zio_ready+0x65/0x3d0 [zfs] [<ffffffffc04d725e>] ? tsd_get_by_thread+0x2e/0x50 [spl] [<ffffffffc04d1318>] ? taskq_member+0x18/0x30 [spl] [<ffffffffc115ef22>] zio_execute+0xa2/0x100 [zfs] [<ffffffffc04d1d2c>] taskq_thread+0x2ac/0x4f0 [spl] [<ffffffffb74cee80>] ? wake_up_state+0x20/0x20 [<ffffffffc115ee80>] ? zio_taskq_member.isra.7.constprop.10+0x80/0x80 [zfs] [<ffffffffc04d1a80>] ? taskq_thread_spawn+0x60/0x60 [spl] [<ffffffffb74bae31>] kthread+0xd1/0xe0 [<ffffffffb74bad60>] ? insert_kthread_work+0x40/0x40 [<ffffffffb7b1f5f7>] ret_from_fork_nospec_begin+0x21/0x21 [<ffffffffb74bad60>] ? insert_kthread_work+0x40/0x40 Fix by reading the map entry in the same manner as the hardware so that the kdeth and verbs contexts match. Cc: <stable@vger.kernel.org> Fixes: 5190f052a365 ("IB/hfi1: Allow the driver to initialize QP priv struct") Reviewed-by: Kaike Wan <kaike.wan@intel.com> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-06-11IB/hfi1: Close PSM sdma_progress sleep windowMike Marciniszyn
The call to sdma_progress() is called outside the wait lock. In this case, there is a race condition where sdma_progress() can return false and the sdma_engine can idle. If that happens, there will be no more sdma interrupts to cause the wakeup and the user_sdma xmit will hang. Fix by moving the lock to enclose the sdma_progress() call. Also, delete busycount. The need for this was removed by: commit bcad29137a97 ("IB/hfi1: Serve the most starved iowait entry first") Cc: <stable@vger.kernel.org> Fixes: 7724105686e7 ("IB/hfi1: add driver files") Reviewed-by: Gary Leshner <Gary.S.Leshner@intel.com> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-06-11IB/hfi1: Validate fault injection opcode user inputKaike Wan
The opcode range for fault injection from user should be validated before it is applied to the fault->opcodes[] bitmap to avoid out-of-bound error. Cc: <stable@vger.kernel.org> Fixes: a74d5307caba ("IB/hfi1: Rework fault injection machinery") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Kaike Wan <kaike.wan@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-06-08Merge tag 'spdx-5.2-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull yet more SPDX updates from Greg KH: "Another round of SPDX header file fixes for 5.2-rc4 These are all more "GPL-2.0-or-later" or "GPL-2.0-only" tags being added, based on the text in the files. We are slowly chipping away at the 700+ different ways people tried to write the license text. All of these were reviewed on the spdx mailing list by a number of different people. We now have over 60% of the kernel files covered with SPDX tags: $ ./scripts/spdxcheck.py -v 2>&1 | grep Files Files checked: 64533 Files with SPDX: 40392 Files with errors: 0 I think the majority of the "easy" fixups are now done, it's now the start of the longer-tail of crazy variants to wade through" * tag 'spdx-5.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (159 commits) treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 450 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 449 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 448 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 446 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 445 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 444 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 443 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 442 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 441 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 440 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 438 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 437 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 436 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 435 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 434 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 433 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 432 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 431 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 430 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 429 ...
2019-06-07Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds
Pull rdma fixes from Jason Gunthorpe: "Things are looking pretty quiet here in RDMA, not too many bug fixes rolling in right now. The usual driver bug fixes and fixes for a couple of regressions introduced in 5.2: - Fix a race on bootup with RDMA device renaming and srp. SRP also needs to rename its internal sys files - Fix a memory leak in hns - Don't leak resources in efa on certain error unwinds - Don't panic in certain error unwinds in ib_register_device - Various small user visible bug fix patches for the hfi and efa drivers - Fix the 32 bit compilation break" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: RDMA/efa: Remove MAYEXEC flag check from mmap flow mlx5: avoid 64-bit division IB/hfi1: Validate page aligned for a given virtual address IB/{qib, hfi1, rdmavt}: Correct ibv_devinfo max_mr value IB/hfi1: Insure freeze_work work_struct is canceled on shutdown IB/rdmavt: Fix alloc_qpn() WARN_ON() RDMA/core: Fix panic when port_data isn't initialized RDMA/uverbs: Pass udata on uverbs error unwind RDMA/core: Clear out the udata before error unwind RDMA/hns: Fix PD memory leak for internal allocation RDMA/srp: Rename SRP sysfs name after IB device rename trigger
2019-06-05treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 288Thomas Gleixner
Based on 1 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms and conditions of the gnu general public license version 2 as published by the free software foundation this program is distributed in the hope it will be useful but without any warranty without even the implied warranty of merchantability or fitness for a particular purpose see the gnu general public license for more details extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 263 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Allison Randal <allison@lohutok.net> Reviewed-by: Alexios Zavras <alexios.zavras@intel.com> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190529141901.208660670@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-30treewide: Add SPDX license identifier - KbuildGreg Kroah-Hartman
Add SPDX license identifiers to all Make/Kconfig files which: - Have no license information of any form These files fall under the project license, GPL v2 only. The resulting SPDX license identifier is: GPL-2.0 Reported-by: Masahiro Yamada <yamada.masahiro@socionext.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-30treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 157Thomas Gleixner
Based on 3 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of the gnu general public license as published by the free software foundation either version 2 of the license or at your option any later version this program is distributed in the hope that it will be useful but without any warranty without even the implied warranty of merchantability or fitness for a particular purpose see the gnu general public license for more details this program is free software you can redistribute it and or modify it under the terms of the gnu general public license as published by the free software foundation either version 2 of the license or at your option any later version [author] [kishon] [vijay] [abraham] [i] [kishon]@[ti] [com] this program is distributed in the hope that it will be useful but without any warranty without even the implied warranty of merchantability or fitness for a particular purpose see the gnu general public license for more details this program is free software you can redistribute it and or modify it under the terms of the gnu general public license as published by the free software foundation either version 2 of the license or at your option any later version [author] [graeme] [gregory] [gg]@[slimlogic] [co] [uk] [author] [kishon] [vijay] [abraham] [i] [kishon]@[ti] [com] [based] [on] [twl6030]_[usb] [c] [author] [hema] [hk] [hemahk]@[ti] [com] this program is distributed in the hope that it will be useful but without any warranty without even the implied warranty of merchantability or fitness for a particular purpose see the gnu general public license for more details extracted by the scancode license scanner the SPDX license identifier GPL-2.0-or-later has been chosen to replace the boilerplate/reference in 1105 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Allison Randal <allison@lohutok.net> Reviewed-by: Richard Fontana <rfontana@redhat.com> Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190527070033.202006027@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-29RDMA/efa: Remove MAYEXEC flag check from mmap flowGal Pressman
MAYEXEC test was mistakenly added, remove it. Checking MAYEXEC in the driver prevents it from working with userspace that uses things like EXEC STACK. (ie some Fortran and other runtimes) Fixes: 40909f664d27 ("RDMA/efa: Add EFA verbs implementation") Reported-by: Jason Gunthorpe <jgg@ziepe.ca> Reviewed-by: Firas JahJah <firasj@amazon.com> Reviewed-by: Yossi Leybovich <sleybo@amazon.com> Signed-off-by: Gal Pressman <galpress@amazon.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-29mlx5: avoid 64-bit divisionMichal Kubecek
Commit 25c13324d03d ("IB/mlx5: Add steering SW ICM device memory type") breaks i386 build by introducing three 64-bit divisions. As the divisor is MLX5_SW_ICM_BLOCK_SIZE() which is always a power of 2, we can replace the division with bit operations. Fixes: 25c13324d03d ("IB/mlx5: Add steering SW ICM device memory type") Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-29IB/hfi1: Validate page aligned for a given virtual addressKamenee Arumugam
User applications can register memory regions for TID buffers that are not aligned on page boundaries. Hfi1 is expected to pin those pages in memory and cache the pages with mmu_rb. The rb tree will fail to insert pages that are not aligned correctly. Validate whether a given virtual address is page aligned before pinning. Fixes: 7e7a436ecb6e ("staging/hfi1: Add TID entry program function body") Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com> Signed-off-by: Kamenee Arumugam <kamenee.arumugam@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-29IB/{qib, hfi1, rdmavt}: Correct ibv_devinfo max_mr valueMike Marciniszyn
The command 'ibv_devinfo -v' reports 0 for max_mr. Fix by assigning the query values after the mr lkey_table has been built rather than early on in the driver. Fixes: 7b1e2099adc8 ("IB/rdmavt: Move memory registration into rdmavt") Reviewed-by: Josh Collier <josh.d.collier@intel.com> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-29IB/hfi1: Insure freeze_work work_struct is canceled on shutdownMike Marciniszyn
By code inspection, the freeze_work is never canceled. Fix by adding a cancel_work_sync in the shutdown path to insure it is no longer running. Fixes: 7724105686e7 ("IB/hfi1: add driver files") Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com> Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-29IB/rdmavt: Fix alloc_qpn() WARN_ON()Mike Marciniszyn
The qpn allocation logic has a WARN_ON() that intends to detect the use of an index that will introduce bits in the lower order bits of the QOS bits in the QPN. Unfortunately, it has the following bugs: - it misfires when wrapping QPN allocation for non-QOS - it doesn't correctly detect low order QOS bits (despite the comment) The WARN_ON() should not be applied to non-QOS (qos_shift == 1). Additionally, it SHOULD test the qpn bits per the table below: 2 data VLs: [qp7, qp6, qp5, qp4, qp3, qp2, qp1] ^ [ 0, 0, 0, 0, 0, 0, sc0], qp bit 1 always 0* 3-4 data VLs: [qp7, qp6, qp5, qp4, qp3, qp2, qp1] ^ [ 0, 0, 0, 0, 0, sc1, sc0], qp bits [21] always 0 5-8 data VLs: [qp7, qp6, qp5, qp4, qp3, qp2, qp1] ^ [ 0, 0, 0, 0, sc2, sc1, sc0] qp bits [321] always 0 Fix by qualifying the warning for qos_shift > 1 and producing the correct mask to insure the above bits are zero without generating a superfluous warning. Fixes: 501edc42446e ("IB/rdmavt: Correct warning during QPN allocation") Reviewed-by: Kaike Wan <kaike.wan@intel.com> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-27RDMA/core: Fix panic when port_data isn't initializedKamal Heib
This happens if assign_name() returns failure when called from ib_register_device(), that will lead to the following panic in every time that someone touches the port_data's data members. BUG: unable to handle kernel NULL pointer dereference at 00000000000000c0 PGD 0 P4D 0 Oops: 0002 [#1] SMP PTI CPU: 19 PID: 1994 Comm: systemd-udevd Not tainted 5.1.0-rc5+ #1 Hardware name: HP ProLiant DL360p Gen8, BIOS P71 12/20/2013 RIP: 0010:_raw_spin_lock_irqsave+0x1e/0x40 Code: 85 ff 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 53 9c 58 66 66 90 66 90 48 89 c3 fa 66 66 90 66 66 90 31 c0 ba 01 00 00 00 <f0> 0f b1 17 0f 94 c2 84 d2 74 05 48 89 d8 5b c3 89 c6 e8 b4 85 8a RSP: 0018:ffffa8d7079a7c08 EFLAGS: 00010046 RAX: 0000000000000000 RBX: 0000000000000202 RCX: ffffa8d7079a7bf8 RDX: 0000000000000001 RSI: ffff93607c990000 RDI: 00000000000000c0 RBP: 0000000000000001 R08: 0000000000000000 R09: ffffffffc08c4dd8 R10: 0000000000000000 R11: 0000000000000001 R12: 00000000000000c0 R13: ffff93607c990000 R14: ffffffffc05a9740 R15: ffffa8d7079a7e98 FS: 00007f1c6ee438c0(0000) GS:ffff93609f6c0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000000c0 CR3: 0000000819fca002 CR4: 00000000000606e0 Call Trace: free_netdevs+0x4d/0xe0 [ib_core] ib_dealloc_device+0x51/0xb0 [ib_core] __mlx5_ib_add+0x5e/0x70 [mlx5_ib] mlx5_add_device+0x57/0xe0 [mlx5_core] mlx5_register_interface+0x85/0xc0 [mlx5_core] ? 0xffffffffc0474000 do_one_initcall+0x4e/0x1d4 ? _cond_resched+0x15/0x30 ? kmem_cache_alloc_trace+0x15f/0x1c0 do_init_module+0x5a/0x218 load_module+0x186b/0x1e40 ? m_show+0x1c0/0x1c0 __do_sys_finit_module+0x94/0xe0 do_syscall_64+0x5b/0x180 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: 8ceb1357b337 ("RDMA/device: Consolidate ib_device per_port data into one place") Signed-off-by: Kamal Heib <kamalheib1@gmail.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-27RDMA/uverbs: Pass udata on uverbs error unwindGal Pressman
When destroy_* is called as a result of uverbs create cleanup flow a cleared udata should be passed instead of NULL to indicate that it is called under user flow. Fixes: c4367a26357b ("IB: Pass uverbs_attr_bundle down ib_x destroy path") Signed-off-by: Gal Pressman <galpress@amazon.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-27RDMA/core: Clear out the udata before error unwindJason Gunthorpe
The core code should not pass a udata to the driver destroy function that contains the input from the create command. Otherwise the driver will attempt to interpret the create udata as destroy udata, and at least in the case of EFA, will leak resources. Zero this stuff out before invoking destroy. Reported-by: Leon Romanovsky <leonro@mellanox.com> Fixes: c4367a26357b ("IB: Pass uverbs_attr_bundle down ib_x destroy path") Reviewed-by: Gal Pressman <galpress@amazon.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-21Merge tag 'spdx-5.2-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull SPDX update from Greg KH: "Here is a series of patches that add SPDX tags to different kernel files, based on two different things: - SPDX entries are added to a bunch of files that we missed a year ago that do not have any license information at all. These were either missed because the tool saw the MODULE_LICENSE() tag, or some EXPORT_SYMBOL tags, and got confused and thought the file had a real license, or the files have been added since the last big sweep, or they were Makefile/Kconfig files, which we didn't touch last time. - Add GPL-2.0-only or GPL-2.0-or-later tags to files where our scan tools can determine the license text in the file itself. Where this happens, the license text is removed, in order to cut down on the 700+ different ways we have in the kernel today, in a quest to get rid of all of these. These patches have been out for review on the linux-spdx@vger mailing list, and while they were created by automatic tools, they were hand-verified by a bunch of different people, all whom names are on the patches are reviewers. The reason for these "large" patches is if we were to continue to progress at the current rate of change in the kernel, adding license tags to individual files in different subsystems, we would be finished in about 10 years at the earliest. There will be more series of these types of patches coming over the next few weeks as the tools and reviewers crunch through the more "odd" variants of how to say "GPLv2" that developers have come up with over the years, combined with other fun oddities (GPL + a BSD disclaimer?) that are being unearthed, with the goal for the whole kernel to be cleaned up. These diffstats are not small, 3840 files are touched, over 10k lines removed in just 24 patches" * tag 'spdx-5.2-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (24 commits) treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 25 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 24 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 23 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 22 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 21 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 20 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 19 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 18 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 17 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 15 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 14 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 13 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 12 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 11 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 10 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 9 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 7 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 5 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 4 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 3 ...
2019-05-21RDMA/hns: Fix PD memory leak for internal allocationLeon Romanovsky
free_pd is allocated internally by the driver hence needs to be freed internally too or it leaks. Fixes: 21a428a019c9 ("RDMA: Handle PD allocations by IB/core") Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Acked-by: Lijun Ou <oulijun@huawei.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-21RDMA/srp: Rename SRP sysfs name after IB device rename triggerLeon Romanovsky
SRP logic used device name and port index as symlink to relevant kobject. If the IB device is renamed then the prior name will be re-used by the next device plugged in and sysfs will panic as SRP will try to re-use the same name. mlx5_ib: Mellanox Connect-IB Infiniband driver v5.0-0 sysfs: cannot create duplicate filename '/class/infiniband_srp/srp-mlx5_0-1' CPU: 3 PID: 1107 Comm: modprobe Not tainted 5.1.0-for-upstream-perf-2019-05-12_15-09-52-87 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 Call Trace: dump_stack+0x5a/0x73 sysfs_warn_dup+0x58/0x70 sysfs_do_create_link_sd.isra.2+0xa3/0xb0 device_add+0x33f/0x660 srp_add_one+0x301/0x4f0 [ib_srp] add_client_context+0x99/0xe0 [ib_core] enable_device_and_get+0xd1/0x1b0 [ib_core] ib_register_device+0x533/0x710 [ib_core] ? mutex_lock+0xe/0x30 __mlx5_ib_add+0x23/0x70 [mlx5_ib] mlx5_add_device+0x4e/0xd0 [mlx5_core] mlx5_register_interface+0x85/0xc0 [mlx5_core] ? 0xffffffffa0791000 do_one_initcall+0x4b/0x1cb ? kmem_cache_alloc_trace+0xc6/0x1d0 ? do_init_module+0x22/0x21f do_init_module+0x5a/0x21f load_module+0x17f2/0x1ca0 ? m_show+0x1c0/0x1c0 __do_sys_finit_module+0x94/0xe0 do_syscall_64+0x48/0x120 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f157cce10d9 The module load/unload sequence was used to trigger such kernel panic: sudo modprobe ib_srp sudo modprobe -r mlx5_ib sudo modprobe -r mlx5_core sudo modprobe mlx5_core Have SRP track the name of the core device so that it can't have a name collision. Fixes: d21943dd19b5 ("RDMA/core: Implement IB device rename function") Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-21treewide: Add SPDX license identifier - Makefile/KconfigThomas Gleixner
Add SPDX license identifiers to all Make/Kconfig files which: - Have no license information of any form These files fall under the project license, GPL v2 only. The resulting SPDX license identifier is: GPL-2.0-only Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller:1) Use after free in __dev_map_entry_free(), from Eric Dumazet. 1) Use after free in __dev_map_entry_free(), from Eric Dumazet. 2) Fix TCP retransmission timestamps on passive Fast Open, from Yuchung Cheng. 3) Orphan NFC, we'll take the patches directly into my tree. From Johannes Berg. 4) We can't recycle cloned TCP skbs, from Eric Dumazet. 5) Some flow dissector bpf test fixes, from Stanislav Fomichev. 6) Fix RCU marking and warnings in rhashtable, from Herbert Xu. 7) Fix some potential fib6 leaks, from Eric Dumazet. 8) Fix a _decode_session4 uninitialized memory read bug fix that got lost in a merge. From Florian Westphal. 9) Fix ipv6 source address routing wrt. exception route entries, from Wei Wang. 10) The netdev_xmit_more() conversion was not done %100 properly in mlx5 driver, fix from Tariq Toukan. 11) Clean up botched merge on netfilter kselftest, from Florian Westphal. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (74 commits) of_net: fix of_get_mac_address retval if compiled without CONFIG_OF net: fix kernel-doc warnings for socket.c net: Treat sock->sk_drops as an unsigned int when printing kselftests: netfilter: fix leftover net/net-next merge conflict mlxsw: core: Prevent reading unsupported slave address from SFP EEPROM mlxsw: core: Prevent QSFP module initialization for old hardware vsock/virtio: Initialize core virtio vsock before registering the driver net/mlx5e: Fix possible modify header actions memory leak net/mlx5e: Fix no rewrite fields with the same match net/mlx5e: Additional check for flow destination comparison net/mlx5e: Add missing ethtool driver info for representors net/mlx5e: Fix number of vports for ingress ACL configuration net/mlx5e: Fix ethtool rxfh commands when CONFIG_MLX5_EN_RXNFC is disabled net/mlx5e: Fix wrong xmit_more application net/mlx5: Fix peer pf disable hca command net/mlx5: E-Switch, Correct type to u16 for vport_num and int for vport_index net/mlx5: Add meaningful return codes to status_to_err function net/mlx5: Imply MLXFW in mlx5_core Revert "tipc: fix modprobe tipc failed after switch order of device registration" vsock/virtio: free packets during the socket release ...
2019-05-17net/mlx5: E-Switch, Correct type to u16 for vport_num and int for vport_indexParav Pandit
To avoid any ambiguity between vport index and vport number, rename functions that had vport, to vport_num or vport_index appropriately. vport_num is u16 hence change mlx5_eswitch_index_to_vport_num() return type to u16. vport_index is an int in vport array. Hence change input type of vport index in mlx5_eswitch_index_to_vport_num() to int. Correct multiple eswitch representor interfaces use type u16 of rep->vport as type int vport_index. Send vport FW commands with correct eswitch u16 vport_num instead host int vport_index. Fixes: 5ae5162066d8 ("net/mlx5: E-Switch, Assign a different position for uplink rep and vport") Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Vu Pham <vuhuong@mellanox.com> Reviewed-by: Bodong Wang <bodong@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-05-14Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds
Pull more rdma updates from Jason Gunthorpe: "This is being sent to get a fix for the gcc 9.1 build warnings, and I've also pulled in some bug fix patches that were posted in the last two weeks. - Avoid the gcc 9.1 warning about overflowing a union member - Fix the wrong callback type for a single response netlink to doit - Bug fixes from more usage of the mlx5 devx interface" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: net/mlx5: Set completion EQs as shared resources IB/mlx5: Verify DEVX general object type correctly RDMA/core: Change system parameters callback from dumpit to doit RDMA: Directly cast the sockaddr union to sockaddr
2019-05-14mm/mmu_notifier: convert user range->blockable to helper functionJérôme Glisse
Use the mmu_notifier_range_blockable() helper function instead of directly dereferencing the range->blockable field. This is done to make it easier to change the mmu_notifier range field. This patch is the outcome of the following coccinelle patch: %<------------------------------------------------------------------- @@ identifier I1, FN; @@ FN(..., struct mmu_notifier_range *I1, ...) { <... -I1->blockable +mmu_notifier_range_blockable(I1) ...> } ------------------------------------------------------------------->% spatch --in-place --sp-file blockable.spatch --dir . Link: http://lkml.kernel.org/r/20190326164747.24405-3-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Cc: Christian König <christian.koenig@amd.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Ross Zwisler <zwisler@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krcmar <rkrcmar@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Christian Koenig <christian.koenig@amd.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14IB/mthca: use the new FOLL_LONGTERM flag to get_user_pages_fast()Ira Weiny
Use the new FOLL_LONGTERM to get_user_pages_fast() to protect against FS DAX pages being mapped. Link: http://lkml.kernel.org/r/20190328084422.29911-8-ira.weiny@intel.com Link: http://lkml.kernel.org/r/20190317183438.2057-8-ira.weiny@intel.com Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Hogan <jhogan@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Rich Felker <dalias@libc.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Mike Marshall <hubcap@omnibond.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14IB/qib: use the new FOLL_LONGTERM flag to get_user_pages_fast()Ira Weiny
Use the new FOLL_LONGTERM to get_user_pages_fast() to protect against FS DAX pages being mapped. Link: http://lkml.kernel.org/r/20190328084422.29911-7-ira.weiny@intel.com Link: http://lkml.kernel.org/r/20190317183438.2057-7-ira.weiny@intel.com Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Hogan <jhogan@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Rich Felker <dalias@libc.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Mike Marshall <hubcap@omnibond.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14IB/hfi1: use the new FOLL_LONGTERM flag to get_user_pages_fast()Ira Weiny
Use the new FOLL_LONGTERM to get_user_pages_fast() to protect against FS DAX pages being mapped. [ira.weiny@intel.com: v3] Link: http://lkml.kernel.org/r/20190328084422.29911-6-ira.weiny@intel.com Link: http://lkml.kernel.org/r/20190328084422.29911-6-ira.weiny@intel.com Link: http://lkml.kernel.org/r/20190317183438.2057-6-ira.weiny@intel.com Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Hogan <jhogan@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Rich Felker <dalias@libc.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Mike Marshall <hubcap@omnibond.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm/gup: change GUP fast to use flags rather than a write 'bool'Ira Weiny
To facilitate additional options to get_user_pages_fast() change the singular write parameter to be gup_flags. This patch does not change any functionality. New functionality will follow in subsequent patches. Some of the get_user_pages_fast() call sites were unchanged because they already passed FOLL_WRITE or 0 for the write parameter. NOTE: It was suggested to change the ordering of the get_user_pages_fast() arguments to ensure that callers were converted. This breaks the current GUP call site convention of having the returned pages be the final parameter. So the suggestion was rejected. Link: http://lkml.kernel.org/r/20190328084422.29911-4-ira.weiny@intel.com Link: http://lkml.kernel.org/r/20190317183438.2057-4-ira.weiny@intel.com Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Mike Marshall <hubcap@omnibond.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Hogan <jhogan@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Rich Felker <dalias@libc.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14mm/gup: replace get_user_pages_longterm() with FOLL_LONGTERMIra Weiny
Pach series "Add FOLL_LONGTERM to GUP fast and use it". HFI1, qib, and mthca, use get_user_pages_fast() due to its performance advantages. These pages can be held for a significant time. But get_user_pages_fast() does not protect against mapping FS DAX pages. Introduce FOLL_LONGTERM and use this flag in get_user_pages_fast() which retains the performance while also adding the FS DAX checks. XDP has also shown interest in using this functionality.[1] In addition we change get_user_pages() to use the new FOLL_LONGTERM flag and remove the specialized get_user_pages_longterm call. [1] https://lkml.org/lkml/2019/3/19/939 "longterm" is a relative thing and at this point is probably a misnomer. This is really flagging a pin which is going to be given to hardware and can't move. I've thought of a couple of alternative names but I think we have to settle on if we are going to use FL_LAYOUT or something else to solve the "longterm" problem. Then I think we can change the flag to a better name. Secondly, it depends on how often you are registering memory. I have spoken with some RDMA users who consider MR in the performance path... For the overall application performance. I don't have the numbers as the tests for HFI1 were done a long time ago. But there was a significant advantage. Some of which is probably due to the fact that you don't have to hold mmap_sem. Finally, architecturally I think it would be good for everyone to use *_fast. There are patches submitted to the RDMA list which would allow the use of *_fast (they reworking the use of mmap_sem) and as soon as they are accepted I'll submit a patch to convert the RDMA core as well. Also to this point others are looking to use *_fast. As an aside, Jasons pointed out in my previous submission that *_fast and *_unlocked look very much the same. I agree and I think further cleanup will be coming. But I'm focused on getting the final solution for DAX at the moment. This patch (of 7): This patch starts a series which aims to support FOLL_LONGTERM in get_user_pages_fast(). Some callers who would like to do a longterm (user controlled pin) of pages with the fast variant of GUP for performance purposes. Rather than have a separate get_user_pages_longterm() call, introduce FOLL_LONGTERM and change the longterm callers to use it. This patch does not change any functionality. In the short term "longterm" or user controlled pins are unsafe for Filesystems and FS DAX in particular has been blocked. However, callers of get_user_pages_fast() were not "protected". FOLL_LONGTERM can _only_ be supported with get_user_pages[_fast]() as it requires vmas to determine if DAX is in use. NOTE: In merging with the CMA changes we opt to change the get_user_pages() call in check_and_migrate_cma_pages() to a call of __get_user_pages_locked() on the newly migrated pages. This makes the code read better in that we are calling __get_user_pages_locked() on the pages before and after a potential migration. As a side affect some of the interfaces are cleaned up but this is not the primary purpose of the series. In review[1] it was asked: <quote> > This I don't get - if you do lock down long term mappings performance > of the actual get_user_pages call shouldn't matter to start with. > > What do I miss? A couple of points. First "longterm" is a relative thing and at this point is probably a misnomer. This is really flagging a pin which is going to be given to hardware and can't move. I've thought of a couple of alternative names but I think we have to settle on if we are going to use FL_LAYOUT or something else to solve the "longterm" problem. Then I think we can change the flag to a better name. Second, It depends on how often you are registering memory. I have spoken with some RDMA users who consider MR in the performance path... For the overall application performance. I don't have the numbers as the tests for HFI1 were done a long time ago. But there was a significant advantage. Some of which is probably due to the fact that you don't have to hold mmap_sem. Finally, architecturally I think it would be good for everyone to use *_fast. There are patches submitted to the RDMA list which would allow the use of *_fast (they reworking the use of mmap_sem) and as soon as they are accepted I'll submit a patch to convert the RDMA core as well. Also to this point others are looking to use *_fast. As an asside, Jasons pointed out in my previous submission that *_fast and *_unlocked look very much the same. I agree and I think further cleanup will be coming. But I'm focused on getting the final solution for DAX at the moment. </quote> [1] https://lore.kernel.org/lkml/20190220180255.GA12020@iweiny-DESK2.sc.intel.com/T/#md6abad2569f3bf6c1f03686c8097ab6563e94965 [ira.weiny@intel.com: v3] Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com Link: http://lkml.kernel.org/r/20190317183438.2057-2-ira.weiny@intel.com Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Rich Felker <dalias@libc.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: James Hogan <jhogan@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Mike Marshall <hubcap@omnibond.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14IB/mlx5: Verify DEVX general object type correctlyYishai Hadas
As the obj_id in the firmware is not globally unique in general_object, the object type must be considered upon checking for a valid object id. Fixes: 2351776e87a1 ("IB/mlx5: Verify DEVX object type") Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-13RDMA/core: Change system parameters callback from dumpit to doitParav Pandit
.dumpit() callback is used for returning same type of data in the loop, e.g. loop over ports, resources, devices. However system parameters are general and standalone for whole subsystem. It means that getting system parameters should be doit callback. Fixes: cb7e0e130503 ("RDMA/core: Add interface to read device namespace sharing mode") Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-13RDMA: Directly cast the sockaddr union to sockaddrJason Gunthorpe
gcc 9 now does allocation size tracking and thinks that passing the member of a union and then accessing beyond that member's bounds is an overflow. Instead of using the union member, use the entire union with a cast to get to the sockaddr. gcc will now know that the memory extends the full size of the union. Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-09Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds
Pull rdma updates from Jason Gunthorpe: "This has been a smaller cycle than normal. One new driver was accepted, which is unusual, and at least one more driver remains in review on the list. Summary: - Driver fixes for hns, hfi1, nes, rxe, i40iw, mlx5, cxgb4, vmw_pvrdma - Many patches from MatthewW converting radix tree and IDR users to use xarray - Introduction of tracepoints to the MAD layer - Build large SGLs at the start for DMA mapping and get the driver to split them - Generally clean SGL handling code throughout the subsystem - Support for restricting RDMA devices to net namespaces for containers - Progress to remove object allocation boilerplate code from drivers - Change in how the mlx5 driver shows representor ports linked to VFs - mlx5 uapi feature to access the on chip SW ICM memory - Add a new driver for 'EFA'. This is HW that supports user space packet processing through QPs in Amazon's cloud" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (186 commits) RDMA/ipoib: Allow user space differentiate between valid dev_port IB/core, ipoib: Do not overreact to SM LID change event RDMA/device: Don't fire uevent before device is fully initialized lib/scatterlist: Remove leftover from sg_page_iter comment RDMA/efa: Add driver to Kconfig/Makefile RDMA/efa: Add the efa module RDMA/efa: Add EFA verbs implementation RDMA/efa: Add common command handlers RDMA/efa: Implement functions that submit and complete admin commands RDMA/efa: Add the ABI definitions RDMA/efa: Add the com service API definitions RDMA/efa: Add the efa_com.h file RDMA/efa: Add the efa.h header file RDMA/efa: Add EFA device definitions RDMA: Add EFA related definitions RDMA/umem: Remove hugetlb flag RDMA/bnxt_re: Use core helpers to get aligned DMA address RDMA/i40iw: Use core helpers to get aligned DMA address within a supported page size RDMA/verbs: Add a DMA iterator to return aligned contiguous memory blocks RDMA/umem: Add API to find best driver supported page size in an MR ...
2019-05-07Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds
Pull networking updates from David Miller: "Highlights: 1) Support AES128-CCM ciphers in kTLS, from Vakul Garg. 2) Add fib_sync_mem to control the amount of dirty memory we allow to queue up between synchronize RCU calls, from David Ahern. 3) Make flow classifier more lockless, from Vlad Buslov. 4) Add PHY downshift support to aquantia driver, from Heiner Kallweit. 5) Add SKB cache for TCP rx and tx, from Eric Dumazet. This reduces contention on SLAB spinlocks in heavy RPC workloads. 6) Partial GSO offload support in XFRM, from Boris Pismenny. 7) Add fast link down support to ethtool, from Heiner Kallweit. 8) Use siphash for IP ID generator, from Eric Dumazet. 9) Pull nexthops even further out from ipv4/ipv6 routes and FIB entries, from David Ahern. 10) Move skb->xmit_more into a per-cpu variable, from Florian Westphal. 11) Improve eBPF verifier speed and increase maximum program size, from Alexei Starovoitov. 12) Eliminate per-bucket spinlocks in rhashtable, and instead use bit spinlocks. From Neil Brown. 13) Allow tunneling with GUE encap in ipvs, from Jacky Hu. 14) Improve link partner cap detection in generic PHY code, from Heiner Kallweit. 15) Add layer 2 encap support to bpf_skb_adjust_room(), from Alan Maguire. 16) Remove SKB list implementation assumptions in SCTP, your's truly. 17) Various cleanups, optimizations, and simplifications in r8169 driver. From Heiner Kallweit. 18) Add memory accounting on TX and RX path of SCTP, from Xin Long. 19) Switch PHY drivers over to use dynamic featue detection, from Heiner Kallweit. 20) Support flow steering without masking in dpaa2-eth, from Ioana Ciocoi. 21) Implement ndo_get_devlink_port in netdevsim driver, from Jiri Pirko. 22) Increase the strict parsing of current and future netlink attributes, also export such policies to userspace. From Johannes Berg. 23) Allow DSA tag drivers to be modular, from Andrew Lunn. 24) Remove legacy DSA probing support, also from Andrew Lunn. 25) Allow ll_temac driver to be used on non-x86 platforms, from Esben Haabendal. 26) Add a generic tracepoint for TX queue timeouts to ease debugging, from Cong Wang. 27) More indirect call optimizations, from Paolo Abeni" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1763 commits) cxgb4: Fix error path in cxgb4_init_module net: phy: improve pause mode reporting in phy_print_status dt-bindings: net: Fix a typo in the phy-mode list for ethernet bindings net: macb: Change interrupt and napi enable order in open net: ll_temac: Improve error message on error IRQ net/sched: remove block pointer from common offload structure net: ethernet: support of_get_mac_address new ERR_PTR error net: usb: smsc: fix warning reported by kbuild test robot staging: octeon-ethernet: Fix of_get_mac_address ERR_PTR check net: dsa: support of_get_mac_address new ERR_PTR error net: dsa: sja1105: Fix status initialization in sja1105_get_ethtool_stats vrf: sit mtu should not be updated when vrf netdev is the link net: dsa: Fix error cleanup path in dsa_init_module l2tp: Fix possible NULL pointer dereference taprio: add null check on sched_nest to avoid potential null pointer dereference net: mvpp2: cls: fix less than zero check on a u32 variable net_sched: sch_fq: handle non connected flows net_sched: sch_fq: do not assume EDT packets are ordered net: hns3: use devm_kcalloc when allocating desc_cb net: hns3: some cleanup for struct hns3_enet_ring ...
2019-05-07Merge tag 'pidfd-v5.2-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull pidfd updates from Christian Brauner: "This patchset makes it possible to retrieve pidfds at process creation time by introducing the new flag CLONE_PIDFD to the clone() system call. Linus originally suggested to implement this as a new flag to clone() instead of making it a separate system call. After a thorough review from Oleg CLONE_PIDFD returns pidfds in the parent_tidptr argument. This means we can give back the associated pid and the pidfd at the same time. Access to process metadata information thus becomes rather trivial. As has been agreed, CLONE_PIDFD creates file descriptors based on anonymous inodes similar to the new mount api. They are made unconditional by this patchset as they are now needed by core kernel code (vfs, pidfd) even more than they already were before (timerfd, signalfd, io_uring, epoll etc.). The core patchset is rather small. The bulky looking changelist is caused by David's very simple changes to Kconfig to make anon inodes unconditional. A pidfd comes with additional information in fdinfo if the kernel supports procfs. The fdinfo file contains the pid of the process in the callers pid namespace in the same format as the procfs status file, i.e. "Pid:\t%d". To remove worries about missing metadata access this patchset comes with a sample/test program that illustrates how a combination of CLONE_PIDFD and pidfd_send_signal() can be used to gain race-free access to process metadata through /proc/<pid>. Further work based on this patchset has been done by Joel. His work makes pidfds pollable. It finished too late for this merge window. I would prefer to have it sitting in linux-next for a while and send it for inclusion during the 5.3 merge window" * tag 'pidfd-v5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: samples: show race-free pidfd metadata access signal: support CLONE_PIDFD with pidfd_send_signal clone: add CLONE_PIDFD Make anon_inodes unconditional
2019-05-07Merge tag 'stream_open-5.2' of https://lab.nexedi.com/kirr/linuxLinus Torvalds
Pull stream_open conversion from Kirill Smelkov: - remove unnecessary double nonseekable_open from drivers/char/dtlk.c as noticed by Pavel Machek while reviewing nonseekable_open -> stream_open mass conversion. - the mass conversion patch promised in commit 10dce8af3422 ("fs: stream_open - opener for stream-like files so that read and write can run simultaneously without deadlock") and is automatically generated by running $ make coccicheck MODE=patch COCCI=scripts/coccinelle/api/stream_open.cocci I've verified each generated change manually - that it is correct to convert - and each other nonseekable_open instance left - that it is either not correct to convert there, or that it is not converted due to current stream_open.cocci limitations. More details on this in the patch. - finally, change VFS to pass ppos=NULL into .read/.write for files that declare themselves streams. It was suggested by Rasmus Villemoes and makes sure that if ppos starts to be erroneously used in a stream file, such bug won't go unnoticed and will produce an oops instead of creating illusion of position change being taken into account. Note: this patch does not conflict with "fuse: Add FOPEN_STREAM to use stream_open()" that will be hopefully coming via FUSE tree, because fs/fuse/ uses new-style .read_iter/.write_iter, and for these accessors position is still passed as non-pointer kiocb.ki_pos . * tag 'stream_open-5.2' of https://lab.nexedi.com/kirr/linux: vfs: pass ppos=NULL to .read()/.write() of FMODE_STREAM files *: convert stream-like files from nonseekable_open -> stream_open dtlk: remove double call to nonseekable_open
2019-05-07RDMA/ipoib: Allow user space differentiate between valid dev_portLeon Romanovsky
Systemd triggers the following warning during IPoIB device load: mlx5_core 0000:00:0c.0 ib0: "systemd-udevd" wants to know my dev_id. Should it look at dev_port instead? See Documentation/ABI/testing/sysfs-class-net for more info. This is caused due to user space attempt to differentiate old systems without dev_port and new systems with dev_port. In case dev_port will be zero, the systemd will try to read dev_id instead. There is no need to print a warning in such case, because it is valid situation and it is needed to ensure systemd compatibility with old kernels. Link: https://github.com/systemd/systemd/blob/master/src/udev/udev-builtin-net_id.c#L358 Cc: <stable@vger.kernel.org> # 4.19 Fixes: f6350da41dc7 ("IB/ipoib: Log sysfs 'dev_id' accesses from userspace") Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-07IB/core, ipoib: Do not overreact to SM LID change eventDennis Dalessandro
When IPoIB receives an SM LID change event, it reacts by flushing its path record cache and rejoining multicast groups. This is the same behavior it performs when it receives a reregistration event. This behavior is unnecessary as an SM may have database backup or synchronization mechanisms which permit the SM location or LID to change without loss of multicast membership and without impact to path records. Both opensm and the OPA FM issue reregistration events if a new SM is started (or restarted with a new config) or an SM event occurs which results in loss of multicast membership records by the SM (such as opensm failover) or the SM encounters new nodes with Active ports (such as after joining 2 fabrics by connecting switches via ISLs). Hence this event can be depended on as the trigger for IPoIB cache and multicast flushing. It appears that some drivers, such as qib, and hfi1 issue the IB_EVENT_SM_CHANGE but other drivers such as mlx4 and mlx5 do not. Empirical testing on Mellanox EDR using ibv_asyncwatch has confirmed that Mellanox EDR HCAs do not generate SM change events and that opensm does generate reregistration. An SM LID change event is generated by the mentioned drivers to reflect that sm_lid and/or sm_sl in the local port info has changed. The intent of this event is to permit applications and ULPs which have a local copy of this information (or an address handle using it) to update their information. The intent is that the reregistration event (caused by the SM via a bit in Set(PortInfo)) be used to inform nodes that they need to rejoin multicast groups, resubscribe for notices and potentially update path records. When an SM migrates or fails over, a SM LID change event can occur. In response IPoIB discards path records and multicast membership and loses connectivity until these records are restored via SA requests. In very large fabrics, it may take minutes for the SM to be ready and for the SA responses to be supplied. This can result in undesirable and unnecessary IPoIB connectivity impacts. It also can result in an unnecessary storm of SA queries from all nodes in a cluster potentially followed by yet another storm if the SM issues the reregistration request. The fact the Mellanox HCAs do not even generate this event, is further evidence that on modern IB fabrics there will be no ill side effects from the proposed changes below to reduce the reaction by 3 kernel components to this event. So these changes should be benign for Mellanox IB fabrics and will benefit OPA fabrics while also making ib_core and ULP behavor "correct" as intended by the IBTA spec and kernel RDMA event APIs. Address these issues by removing IB_EVENT_SM_CHANGE handling from ipoib. IPoIB does not locally store sm_lid nor sm_sl, so it does not need to do anything on SM LID change. IPoIB makes use of other ib_core components to issue SA requests for it and those components correctly track SM LID and SM LID changes. Also in ib_core multicast handling, remove the test for IB_EVENT_SM_CHANGE. This code is moving all multicast groups to the error state, which will trigger rejoins. This code is used by IPoIB as well as the connection manager and other clients of multicast groups. This kernel module centralizes group membership status and joins since a node can only join a given group once but multiple ULPs or applications may want to join the same group. It makes use of the sa_query.c component in ib_core, which correctly trackes SM LID and SL. This component does not track SM LID nor SL itself and hence need not react to their changes. Similarly in the ib_core cache code remove the handling for the IB_EVENT_SM_CHANGE. In this function. The ib_cache_update function which is ultimately called is updating local copies of the pkey table, gid table and lmc. It does not update nor retain sm_lid nor sm_sl. As such it does not need to be called on an SM LID change. It technically also does not need to be called on a reregistration. The LID_CHANGE, PKEY_CHANGE, GID_CHANGE and port state change events (PORT_ERR, PORT_ACTICE) should be sufficient triggers. It is worth noting that the alternative of simply having the hfi1 and qib drivers not generate the SM LID change event was explored. While this would duplicate what Mellanox drivers do now, it is not the correct behavior and removes the ability for an SM to migrate without requiring reregistration. Since both opensm and OPA SM have mechanisms to backup or synchronize registration information, it is desirable to let them perform SM migrations (with LID or SL changes) without requiring reregistration when they deem it appropriate. Suggested-by: Todd Rimmer <todd.rimmer@intel.com> Tested-by: Michael Brooks <michael.brooks@intel.com> Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Reviewed-by: Todd Rimmer <todd.rimmer@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-07RDMA/device: Don't fire uevent before device is fully initializedLeon Romanovsky
When the refcount is 0 the device is invisible to netlink. However in the patch below the refcount = 1 was moved to after the device_add(). This creates a race where userspace can issue a netlink query after the device_add() event and not see the device as visible. Ensure that no uevent is fired before device is fully registered. Fixes: d79af7242bb2 ("RDMA/device: Expose ib_device_try_get(()") Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-07RDMA/efa: Add driver to Kconfig/MakefileGal Pressman
Add EFA Makefile and Kconfig. Signed-off-by: Gal Pressman <galpress@amazon.com> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>