summaryrefslogtreecommitdiff
path: root/drivers/nvme/host
AgeCommit message (Collapse)Author
2019-11-08Merge tag 'for-linus-2019-11-08' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: - Two NVMe device removal crash fixes, and a compat fixup for for an ioctl that was introduced in this release (Anton, Charles, Max - via Keith) - Missing error path mutex unlock for drbd (Dan) - cgroup writeback fixup on dead memcg (Tejun) - blkcg online stats print fix (Tejun) * tag 'for-linus-2019-11-08' of git://git.kernel.dk/linux-block: cgroup,writeback: don't switch wbs immediately on dead wbs if the memcg is dead block: drbd: remove a stray unlock in __drbd_send_protocol() blkcg: make blkcg_print_stat() print stats only for online blkgs nvme: change nvme_passthru_cmd64 to explicitly mark rsvd nvme-multipath: fix crash in nvme_mpath_clear_ctrl_paths nvme-rdma: fix a segmentation fault during module unload
2019-11-06nvme-multipath: fix crash in nvme_mpath_clear_ctrl_pathsAnton Eidelman
nvme_mpath_clear_ctrl_paths() iterates through the ctrl->namespaces list while holding ctrl->scan_lock. This does not seem to be the correct way of protecting from concurrent list modification. Specifically, nvme_scan_work() sorts ctrl->namespaces AFTER unlocking scan_lock. This may result in the following (rare) crash in ctrl disconnect during scan_work: BUG: kernel NULL pointer dereference, address: 0000000000000050 Oops: 0000 [#1] SMP PTI CPU: 0 PID: 3995 Comm: nvme 5.3.5-050305-generic RIP: 0010:nvme_mpath_clear_current_path+0xe/0x90 [nvme_core] ... Call Trace: nvme_mpath_clear_ctrl_paths+0x3c/0x70 [nvme_core] nvme_remove_namespaces+0x35/0xe0 [nvme_core] nvme_do_delete_ctrl+0x47/0x90 [nvme_core] nvme_sysfs_delete+0x49/0x60 [nvme_core] dev_attr_store+0x17/0x30 sysfs_kf_write+0x3e/0x50 kernfs_fop_write+0x11e/0x1a0 __vfs_write+0x1b/0x40 vfs_write+0xb9/0x1a0 ksys_write+0x67/0xe0 __x64_sys_write+0x1a/0x20 do_syscall_64+0x5a/0x130 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f8d02bfb154 Fix: After taking scan_lock in nvme_mpath_clear_ctrl_paths() down_read(&ctrl->namespaces_rwsem) as well to make list traversal safe. This will not cause deadlocks because taking scan_lock never happens while holding the namespaces_rwsem. Moreover, scan work downs namespaces_rwsem in the same order. Alternative: sort ctrl->namespaces in nvme_scan_work() while still holding the scan_lock. This would leave nvme_mpath_clear_ctrl_paths() without correct protection against ctrl->namespaces modification by anyone other than scan_work. Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Anton Eidelman <anton@lightbitslabs.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2019-11-06nvme-rdma: fix a segmentation fault during module unloadMax Gurtovoy
In case there are controllers that are not associated with any RDMA device (e.g. during unsuccessful reconnection) and the user will unload the module, these controllers will not be freed and will access already freed memory. The same logic appears in other fabric drivers as well. Fixes: 87fd125344d6 ("nvme-rdma: remove redundant reference between ib_device and tagset") Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2019-11-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds
Pull networking fixes from David Miller: 1) Fix free/alloc races in batmanadv, from Sven Eckelmann. 2) Several leaks and other fixes in kTLS support of mlx5 driver, from Tariq Toukan. 3) BPF devmap_hash cost calculation can overflow on 32-bit, from Toke Høiland-Jørgensen. 4) Add an r8152 device ID, from Kazutoshi Noguchi. 5) Missing include in ipv6's addrconf.c, from Ben Dooks. 6) Use siphash in flow dissector, from Eric Dumazet. Attackers can easily infer the 32-bit secret otherwise etc. 7) Several netdevice nesting depth fixes from Taehee Yoo. 8) Fix several KCSAN reported errors, from Eric Dumazet. For example, when doing lockless skb_queue_empty() checks, and accessing sk_napi_id/sk_incoming_cpu lockless as well. 9) Fix jumbo packet handling in RXRPC, from David Howells. 10) Bump SOMAXCONN and tcp_max_syn_backlog values, from Eric Dumazet. 11) Fix DMA synchronization in gve driver, from Yangchun Fu. 12) Several bpf offload fixes, from Jakub Kicinski. 13) Fix sk_page_frag() recursion during memory reclaim, from Tejun Heo. 14) Fix ping latency during high traffic rates in hisilicon driver, from Jiangfent Xiao. * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (146 commits) net: fix installing orphaned programs net: cls_bpf: fix NULL deref on offload filter removal selftests: bpf: Skip write only files in debugfs selftests: net: reuseport_dualstack: fix uninitalized parameter r8169: fix wrong PHY ID issue with RTL8168dp net: dsa: bcm_sf2: Fix IMP setup for port different than 8 net: phylink: Fix phylink_dbg() macro gve: Fixes DMA synchronization. inet: stop leaking jiffies on the wire ixgbe: Remove duplicate clear_bit() call Documentation: networking: device drivers: Remove stray asterisks e1000: fix memory leaks i40e: Fix receive buffer starvation for AF_XDP igb: Fix constant media auto sense switching when no cable is connected net: ethernet: arc: add the missed clk_disable_unprepare igb: Enable media autosense for the i350. igb/igc: Don't warn on fatal read failures when the device is removed tcp: increase tcp_max_syn_backlog max value net: increase SOMAXCONN to 4096 netdevsim: Fix use-after-free during device dismantle ...
2019-10-29nvme-multipath: remove unused groups_only mode in ana logAnton Eidelman
groups_only mode in nvme_read_ana_log() is no longer used: remove it. Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Anton Eidelman <anton@lightbitslabs.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-29nvme-multipath: fix possible io hang after ctrl reconnectAnton Eidelman
The following scenario results in an IO hang: 1) ctrl completes a request with NVME_SC_ANA_TRANSITION. NVME_NS_ANA_PENDING bit in ns->flags is set and ana_work is triggered. 2) ana_work: nvme_read_ana_log() tries to get the ANA log page from the ctrl. This fails because ctrl disconnects. Therefore nvme_update_ns_ana_state() is not called and NVME_NS_ANA_PENDING bit in ns->flags is not cleared. 3) ctrl reconnects: nvme_mpath_init(ctrl,...) calls nvme_read_ana_log(ctrl, groups_only=true). However, nvme_update_ana_state() does not update namespaces because nr_nsids = 0 (due to groups_only mode). 4) scan_work calls nvme_validate_ns() finds the ns and re-validates OK. Result: The ctrl is now live but NVME_NS_ANA_PENDING bit in ns->flags is still set. Consequently ctrl will never be considered a viable path by __nvme_find_path(). IO will hang if ctrl is the only or the last path to the namespace. More generally, while ctrl is reconnecting, its ANA state may change. And because nvme_mpath_init() requests ANA log in groups_only mode, these changes are not propagated to the existing ctrl namespaces. This may result in a mal-function or an IO hang. Solution: nvme_mpath_init() will nvme_read_ana_log() with groups_only set to false. This will not harm the new ctrl case (no namespaces present), and will make sure the ANA state of namespaces gets updated after reconnect. Note: Another option would be for nvme_mpath_init() to invoke nvme_parse_ana_log(..., nvme_set_ns_ana_state) for each existing namespace. Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Anton Eidelman <anton@lightbitslabs.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-28net: use skb_queue_empty_lockless() in busy poll contextsEric Dumazet
Busy polling usually runs without locks. Let's use skb_queue_empty_lockless() instead of skb_queue_empty() Also uses READ_ONCE() in __skb_try_recv_datagram() to address a similar potential problem. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-18nvme-pci: Set the prp2 correctly when using more than 4k pageKevin Hao
In the current code, the nvme is using a fixed 4k PRP entry size, but if the kernel use a page size which is more than 4k, we should consider the situation that the bv_offset may be larger than the dev->ctrl.page_size. Otherwise we may miss setting the prp2 and then cause the command can't be executed correctly. Fixes: dff824b2aadb ("nvme-pci: optimize mapping of small single segment requests") Cc: stable@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Kevin Hao <haokexin@gmail.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2019-10-15nvme-tcp: fix possible leakage during error flowMax Gurtovoy
During nvme_tcp_setup_cmd_pdu error flow, one must call nvme_cleanup_cmd since it's symmetric to nvme_setup_cmd. Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2019-10-14nvme-tcp: Initialize sk->sk_ll_usec only with NET_RX_BUSY_POLLSebastian Andrzej Siewior
The access to sk->sk_ll_usec should be hidden behind CONFIG_NET_RX_BUSY_POLL like the definition of sk_ll_usec. Put access to ->sk_ll_usec behind CONFIG_NET_RX_BUSY_POLL. Fixes: 1a9460cef5711 ("nvme-tcp: support simple polling") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2019-10-14nvme: Wait for reset state when requiredKeith Busch
Prevent simultaneous controller disabling/enabling tasks from interfering with each other through a function to wait until the task successfully transitioned the controller to the RESETTING state. This ensures disabling the controller will not be interrupted by another reset path, otherwise a concurrent reset may leave the controller in the wrong state. Tested-by: Edmund Nadolski <edmund.nadolski@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2019-10-14nvme: Prevent resets during paused controller stateKeith Busch
A paused controller is doing critical internal activation work in the background. Prevent subsequent controller resets from occurring during this period by setting the controller state to RESETTING first. A helper function, nvme_try_sched_reset_work(), is introduced for these paths so they may continue with scheduling the reset_work after they've completed their uninterruptible critical section. Tested-by: Edmund Nadolski <edmund.nadolski@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2019-10-14nvme: Restart request timers in resetting stateKeith Busch
A controller in the resetting state has not yet completed its recovery actions. The pci and fc transports were already handling this, so update the remaining transports to not attempt additional recovery in this state. Instead, just restart the request timer. Tested-by: Edmund Nadolski <edmund.nadolski@intel.com> Reviewed-by: James Smart <james.smart@broadcom.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2019-10-14nvme: Remove ADMIN_ONLY stateKeith Busch
The admin only state was intended to fence off actions that don't apply to a non-IO capable controller. The only actual user of this is the scan_work, and pci was the only transport to ever set this state. The consequence of having this state is placing an additional burden on every other action that applies to both live and admin only controllers. Remove the admin only state and place the admin only burden on the only place that actually cares: scan_work. This also prepares to make it easier to temporarily pause a LIVE state so that we don't need to remember which state the controller had been in prior to the pause. Tested-by: Edmund Nadolski <edmund.nadolski@intel.com> Reviewed-by: James Smart <james.smart@broadcom.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2019-10-14nvme-pci: Free tagset if no IO queuesKeith Busch
If a controller becomes degraded after a reset, we will not be able to perform any IO. We currently teardown previously created request queues and namespaces, but we had kept the unusable tagset. Free it after all queues using it have been released. Tested-by: Edmund Nadolski <edmund.nadolski@intel.com> Reviewed-by: James Smart <james.smart@broadcom.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2019-10-04nvme: retain split access workaround for capability readsArd Biesheuvel
Commit 7fd8930f26be4 "nvme: add a common helper to read Identify Controller data" has re-introduced an issue that we have attempted to work around in the past, in commit a310acd7a7ea ("NVMe: use split lo_hi_{read,write}q"). The problem is that some PCIe NVMe controllers do not implement 64-bit outbound accesses correctly, which is why the commit above switched to using lo_hi_[read|write]q for all 64-bit BAR accesses occuring in the code. In the mean time, the NVMe subsystem has been refactored, and now calls into the PCIe support layer for NVMe via a .reg_read64() method, which fails to use lo_hi_readq(), and thus reintroduces the problem that the workaround above aimed to address. Given that, at the moment, .reg_read64() is only used to read the capability register [which is known to tolerate split reads], let's switch .reg_read64() to lo_hi_readq() as well. This fixes a boot issue on some ARM boxes with NVMe behind a Synopsys DesignWare PCIe host controller. Fixes: 7fd8930f26be4 ("nvme: add a common helper to read Identify Controller data") Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-10-04nvme: fix possible deadlock when nvme_update_formats failsSagi Grimberg
nvme_update_formats may fail to revalidate the namespace and attempt to remove the namespace. This may lead to a deadlock as nvme_ns_remove will attempt to acquire the subsystem lock which is already acquired by the passthru command with effects. Move the invalid namepsace removal to after the passthru command releases the subsystem lock. Reported-by: Judy Brock <judy.brock@samsung.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-09-27Merge branch 'nvme-5.4' of git://git.infradead.org/nvme into for-linusJens Axboe
Pull NVMe changes from Sagi: "This set consists of various fixes and cleanups: - controller removal race fix from Balbir - quirk additions from Gabriel and Jian-Hong - nvme-pci power state save fix from Mario - Add 64bit user commands (for 64bit registers) from Marta - nvme-rdma/nvme-tcp fixes from Max, Mark and Me - Minor cleanups and nits from James, Dan and John" * 'nvme-5.4' of git://git.infradead.org/nvme: nvme-rdma: fix possible use-after-free in connect timeout nvme: Move ctrl sqsize to generic space nvme: Add ctrl attributes for queue_count and sqsize nvme: allow 64-bit results in passthru commands nvme: Add quirk for Kingston NVME SSD running FW E8FK11.T nvmet-tcp: remove superflous check on request sgl Added QUIRKs for ADATA XPG SX8200 Pro 512GB nvme-rdma: Fix max_hw_sectors calculation nvme: fix an error code in nvme_init_subsystem() nvme-pci: Save PCI state before putting drive into deepest state nvme-tcp: fix wrong stop condition in io_work nvme-pci: Fix a race in controller removal nvmet: change ppl to lpp
2019-09-27nvme-rdma: fix possible use-after-free in connect timeoutSagi Grimberg
If the connect times out, we may have already destroyed the queue in the timeout handler, so test if the queue is still allocated in the connect error handler. Reported-by: Yi Zhang <yi.zhang@redhat.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-09-26nvme: Move ctrl sqsize to generic spaceKeith Busch
This isn't specific to fabrics. Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-09-25nvme: Add ctrl attributes for queue_count and sqsizeJames Smart
Current controller interrogation requires a lot of guesswork on how many io queues were created and what the io sq size is. The numbers are dependent upon core/fabric defaults, connect arguments, and target responses. Add sysfs attributes for queue_count and sqsize. Signed-off-by: James Smart <jsmart2021@gmail.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-09-25nvme: allow 64-bit results in passthru commandsMarta Rybczynska
It is not possible to get 64-bit results from the passthru commands, what prevents from getting for the Capabilities (CAP) property value. As a result, it is not possible to implement IOL's NVMe Conformance test 4.3 Case 1 for Fabrics targets [1] (page 123). This issue has been already discussed [2], but without a solution. This patch solves the problem by adding new ioctls with a new passthru structure, including 64-bit results. The older ioctls stay unchanged. [1] https://www.iol.unh.edu/sites/default/files/testsuites/nvme/UNH-IOL_NVMe_Conformance_Test_Suite_v11.0.pdf [2] http://lists.infradead.org/pipermail/linux-nvme/2018-June/018791.html Signed-off-by: Marta Rybczynska <marta.rybczynska@kalray.eu> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-09-25nvme: Add quirk for Kingston NVME SSD running FW E8FK11.TJian-Hong Pan
Kingston NVME SSD with firmware version E8FK11.T has no interrupt after resume with actions related to suspend to idle. This patch applied NVME_QUIRK_SIMPLE_SUSPEND quirk to fix this issue. Fixes: d916b1be94b6 ("nvme-pci: use host managed power state for suspend") Buglink: https://bugzilla.kernel.org/show_bug.cgi?id=204887 Signed-off-by: Jian-Hong Pan <jian-hong@endlessm.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-09-25Added QUIRKs for ADATA XPG SX8200 Pro 512GBGabriel Craciunescu
Booting with default_ps_max_latency_us >6000 makes the device fail. Also SUBNQN is NULL and gives a warning on each boot/resume. $ nvme id-ctrl /dev/nvme0 | grep ^subnqn subnqn : (null) I use this device with an Acer Nitro 5 (AN515-43-R8BF) Laptop. To be sure is not a Laptop issue only, I tested the device on my server board with the same results. ( with 2x,4x link on the board and 4x link on a PCI-E card ). Signed-off-by: Gabriel Craciunescu <nix.or.die@gmail.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-09-25nvme-rdma: Fix max_hw_sectors calculationMax Gurtovoy
By default, the NVMe/RDMA driver should support max io_size of 1MiB (or upto the maximum supported size by the HCA). Currently, one will see that /sys/class/block/<bdev>/queue/max_hw_sectors_kb is 1020 instead of 1024. A non power of 2 value can cause performance degradation due to unnecessary splitting of IO requests and unoptimized allocation units. The number of pages per MR has been fixed here, so there is no longer any need to reduce max_sectors by 1. Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-09-25nvme: fix an error code in nvme_init_subsystem()Dan Carpenter
"ret" should be a negative error code here, but it's either success or possibly uninitialized. Fixes: 32fd90c40768 ("nvme: change locking for the per-subsystem controller list") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-09-25nvme-pci: Save PCI state before putting drive into deepest stateMario Limonciello
The action of saving the PCI state will cause numerous PCI configuration space reads which depending upon the vendor implementation may cause the drive to exit the deepest NVMe state. In these cases ASPM will typically resolve the PCIe link state and APST may resolve the NVMe power state. However it has also been observed that this register access after quiesced will cause PC10 failure on some device combinations. To resolve this, move the PCI state saving to before SetFeatures has been called. This has been proven to resolve the issue across a 5000 sample test on previously failing disk/system combinations. Signed-off-by: Mario Limonciello <mario.limonciello@dell.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-09-25nvme-tcp: fix wrong stop condition in io_workWunderlich, Mark
Allow the do/while statement to continue if current time is not after the proposed time 'deadline'. Intent is to allow loop to proceed for a specific time period. Currently the loop, as coded, will exit after first pass. Signed-off-by: Mark Wunderlich <mark.wunderlich@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-09-24Merge tag 'for-5.4/post-2019-09-24' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull more block updates from Jens Axboe: "Some later additions that weren't quite done for the first pull request, and also a few fixes that have arrived since. This contains: - Kill silly pktcdvd warning on attempting to register a non-scsi passthrough device (me) - Use symbolic constants for the block t10 protection types, and switch to handling it in core rather than in the drivers (Max) - libahci platform missing node put fix (Nishka) - Small series of fixes for BFQ (Paolo) - Fix possible nbd crash (Xiubo)" * tag 'for-5.4/post-2019-09-24' of git://git.kernel.dk/linux-block: block: drop device references in bsg_queue_rq() block: t10-pi: fix -Wswitch warning pktcdvd: remove warning on attempting to register non-passthrough dev ata: libahci_platform: Add of_node_put() before loop exit nbd: fix possible page fault for nbd disk nbd: rename the runtime flags as NBD_RT_ prefixed block, bfq: push up injection only after setting service time block, bfq: increase update frequency of inject limit block, bfq: reduce upper bound for inject limit to max_rq_in_driver+1 block, bfq: update inject limit only after injection occurred block: centralize PI remapping logic to the block layer block: use symbolic constants for t10_pi type
2019-09-23Merge tag 'pci-v5.4-changes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci Pull PCI updates from Bjorn Helgaas: "Enumeration: - Consolidate _HPP/_HPX stuff in pci-acpi.c and simplify it (Krzysztof Wilczynski) - Fix incorrect PCIe device types and remove dev->has_secondary_link to simplify code that deals with upstream/downstream ports (Mika Westerberg) - After suspend, restore Resizable BAR size bits correctly for 1MB BARs (Sumit Saxena) - Enable PCI_MSI_IRQ_DOMAIN support for RISC-V (Wesley Terpstra) Virtualization: - Add ACS quirks for iProc PAXB (Abhinav Ratna), Amazon Annapurna Labs (Ali Saidi) - Move sysfs SR-IOV functions to iov.c (Kelsey Skunberg) - Remove group write permissions from sysfs sriov_numvfs, sriov_drivers_autoprobe (Kelsey Skunberg) Hotplug: - Simplify pciehp indicator control (Denis Efremov) Peer-to-peer DMA: - Allow P2P DMA between root ports for whitelisted bridges (Logan Gunthorpe) - Whitelist some Intel host bridges for P2P DMA (Logan Gunthorpe) - DMA map P2P DMA requests that traverse host bridge (Logan Gunthorpe) Amazon Annapurna Labs host bridge driver: - Add DT binding and controller driver (Jonathan Chocron) Hyper-V host bridge driver: - Fix hv_pci_dev->pci_slot use-after-free (Dexuan Cui) - Fix PCI domain number collisions (Haiyang Zhang) - Use instance ID bytes 4 & 5 as PCI domain numbers (Haiyang Zhang) - Fix build errors on non-SYSFS config (Randy Dunlap) i.MX6 host bridge driver: - Limit DBI register length (Stefan Agner) Intel VMD host bridge driver: - Fix config addressing issues (Jon Derrick) Layerscape host bridge driver: - Add bar_fixed_64bit property to endpoint driver (Xiaowei Bao) - Add CONFIG_PCI_LAYERSCAPE_EP to build EP/RC drivers separately (Xiaowei Bao) Mediatek host bridge driver: - Add MT7629 controller support (Jianjun Wang) Mobiveil host bridge driver: - Fix CPU base address setup (Hou Zhiqiang) - Make "num-lanes" property optional (Hou Zhiqiang) Tegra host bridge driver: - Fix OF node reference leak (Nishka Dasgupta) - Disable MSI for root ports to work around design problem (Vidya Sagar) - Add Tegra194 DT binding and controller support (Vidya Sagar) - Add support for sideband pins and slot regulators (Vidya Sagar) - Add PIPE2UPHY support (Vidya Sagar) Misc: - Remove unused pci_block_cfg_access() et al (Kelsey Skunberg) - Unexport pci_bus_get(), etc (Kelsey Skunberg) - Hide PM, VC, link speed, ATS, ECRC, PTM constants and interfaces in the PCI core (Kelsey Skunberg) - Clean up sysfs DEVICE_ATTR() usage (Kelsey Skunberg) - Mark expected switch fall-through (Gustavo A. R. Silva) - Propagate errors for optional regulators and PHYs (Thierry Reding) - Fix kernel command line resource_alignment parameter issues (Logan Gunthorpe)" * tag 'pci-v5.4-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (112 commits) PCI: Add pci_irq_vector() and other stubs when !CONFIG_PCI arm64: tegra: Add PCIe slot supply information in p2972-0000 platform arm64: tegra: Add configuration for PCIe C5 sideband signals PCI: tegra: Add support to enable slot regulators PCI: tegra: Add support to configure sideband pins PCI: vmd: Fix shadow offsets to reflect spec changes PCI: vmd: Fix config addressing when using bus offsets PCI: dwc: Add validation that PCIe core is set to correct mode PCI: dwc: al: Add Amazon Annapurna Labs PCIe controller driver dt-bindings: PCI: Add Amazon's Annapurna Labs PCIe host bridge binding PCI: Add quirk to disable MSI-X support for Amazon's Annapurna Labs Root Port PCI/VPD: Prevent VPD access for Amazon's Annapurna Labs Root Port PCI: Add ACS quirk for Amazon Annapurna Labs root ports PCI: Add Amazon's Annapurna Labs vendor ID MAINTAINERS: Add PCI native host/endpoint controllers designated reviewer PCI: hv: Use bytes 4 and 5 from instance ID as the PCI domain numbers dt-bindings: PCI: tegra: Add PCIe slot supplies regulator entries dt-bindings: PCI: tegra: Add sideband pins configuration entries PCI: tegra: Add Tegra194 PCIe support PCI: Get rid of dev->has_secondary_link flag ...
2019-09-23nvme-pci: Fix a race in controller removalBalbir Singh
User space programs like udevd may try to read to partitions at the same time the driver detects a namespace is unusable, and may deadlock if revalidate_disk() is called while such a process is waiting to enter the frozen queue. On detecting a dead namespace, move the disk revalidate after unblocking dispatchers that may be holding bd_butex. changelog Suggested-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Balbir Singh <sblbir@amzn.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-09-17block: centralize PI remapping logic to the block layerMax Gurtovoy
Currently t10_pi_prepare/t10_pi_complete functions are called during the NVMe and SCSi layers command preparetion/completion, but their actual place should be the block layer since T10-PI is a general data integrity feature that is used by block storage protocols. Introduce .prepare_fn and .complete_fn callbacks within the integrity profile that each type can implement according to its needs. Suggested-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Suggested-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Fixed to not call queue integrity functions if BLK_DEV_INTEGRITY isn't defined in the config. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-17Merge tag 'for-5.4/block-2019-09-16' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block updates from Jens Axboe: - Two NVMe pull requests: - ana log parse fix from Anton - nvme quirks support for Apple devices from Ben - fix missing bio completion tracing for multipath stack devices from Hannes and Mikhail - IP TOS settings for nvme rdma and tcp transports from Israel - rq_dma_dir cleanups from Israel - tracing for Get LBA Status command from Minwoo - Some nvme-tcp cleanups from Minwoo, Potnuri and Myself - Some consolidation between the fabrics transports for handling the CAP register - reset race with ns scanning fix for fabrics (move fabrics commands to a dedicated request queue with a different lifetime from the admin request queue)." - controller reset and namespace scan races fixes - nvme discovery log change uevent support - naming improvements from Keith - multiple discovery controllers reject fix from James - some regular cleanups from various people - Series fixing (and re-fixing) null_blk debug printing and nr_devices checks (André) - A few pull requests from Song, with fixes from Andy, Guoqing, Guilherme, Neil, Nigel, and Yufen. - REQ_OP_ZONE_RESET_ALL support (Chaitanya) - Bio merge handling unification (Christoph) - Pick default elevator correctly for devices with special needs (Damien) - Block stats fixes (Hou) - Timeout and support devices nbd fixes (Mike) - Series fixing races around elevator switching and device add/remove (Ming) - sed-opal cleanups (Revanth) - Per device weight support for BFQ (Fam) - Support for blk-iocost, a new model that can properly account cost of IO workloads. (Tejun) - blk-cgroup writeback fixes (Tejun) - paride queue init fixes (zhengbin) - blk_set_runtime_active() cleanup (Stanley) - Block segment mapping optimizations (Bart) - lightnvm fixes (Hans/Minwoo/YueHaibing) - Various little fixes and cleanups * tag 'for-5.4/block-2019-09-16' of git://git.kernel.dk/linux-block: (186 commits) null_blk: format pr_* logs with pr_fmt null_blk: match the type of parameter nr_devices null_blk: do not fail the module load with zero devices block: also check RQF_STATS in blk_mq_need_time_stamp() block: make rq sector size accessible for block stats bfq: Fix bfq linkage error raid5: use bio_end_sector in r5_next_bio raid5: remove STRIPE_OPS_REQ_PENDING md: add feature flag MD_FEATURE_RAID0_LAYOUT md/raid0: avoid RAID0 data corruption due to layout confusion. raid5: don't set STRIPE_HANDLE to stripe which is in batch list raid5: don't increment read_errors on EILSEQ return nvmet: fix a wrong error status returned in error log page nvme: send discovery log page change events to userspace nvme: add uevent variables for controller devices nvme: enable aen regardless of the presence of I/O queues nvme-fabrics: allow discovery subsystems accept a kato nvmet: Use PTR_ERR_OR_ZERO() in nvmet_init_discovery() nvme: Remove redundant assignment of cq vector nvme: Assign subsys instance from first ctrl ...
2019-09-12nvme: send discovery log page change events to userspaceSagi Grimberg
If the controller supports discovery log page change events, we want to enable it. When we see a discovery log change event we will send it up to userspace and expect it to handle it. Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: James Smart <james.smart@broadcom.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-09-12nvme: add uevent variables for controller devicesSagi Grimberg
When we send uevents to userspace, add controller specific environment variables to uniquly identify the controller beyond its device name. This will be useful to address discovery log change events by actually verifying that the discovery controller is indeed the same as the device that generated the event. Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-09-12nvme: enable aen regardless of the presence of I/O queuesSagi Grimberg
AENs in general are not related to the presence of I/O queues, so enable them regardless. Note that the only exception is that discovery controller will not support any of the requested AENs and nvme_enable_aen will respect that and return, so it is still safe to enable regardless. Note it is safe to enable AENs even before the initial namespace scanning as we have the scan operation in a workqueue context. Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-09-12nvme-fabrics: allow discovery subsystems accept a katoSagi Grimberg
This modifies the behavior of discovery subsystems to accept a kato as a preparation to support discovery log change events. This also means that now every discovery controller will have a default kato value, and for non-persistent connections the host needs to pass in a zero kato value (keep_alive_tmo=0). Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: James Smart <james.smart@broadcom.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-09-12nvme: Remove redundant assignment of cq vectorIsrael Rukshin
The cq vector is already assigned with the correct value. Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-09-12nvme: Assign subsys instance from first ctrlKeith Busch
The namespace disk names must be unique for the lifetime of the subsystem. This was accomplished by using their parent subsystems' instances which were allocated independently from the controllers connected to that subsystem. This allowed name prefixes assigned to namespaces to match a controller from an unrelated subsystem, and has created confusion among users examining device nodes. Ensure a namespace's subsystem instance never clashes with a controller instance of another subsystem by transferring the instance ownership to the parent subsystem from the first controller discovered in that subsystem. Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Minwoo Im <minwoo.im@samsung.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-09-12nvme: tcp: remove redundant assignment to variable retColin Ian King
The variable ret is being initialized with a value that is never read and is being re-assigned immediately afterwards. The assignment is redundant and hence can be removed. Addresses-Coverity: ("Unused value") Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-09-12nvme: include admin_q sync with nvme_sync_queuesEdmund Nadolski
nvme_sync_queues currently syncs all namespace queues, but should also sync the admin queue, if present. Signed-off-by: Edmund Nadolski <edmund.nadolski@intel.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-09-12nvme: Treat discovery subsystems as unique subsystemsJames Smart
Current code matches subnqn and collapses all controllers to the same subnqn to a single subsystem structure. This is good for recognizing multiple controllers for the same subsystem. But with the well-known discovery subnqn, the subsystems aren't truly the same subsystem. As such, subsystem specific rules, such as no overlap of controller id, do not apply. With today's behavior, the check for overlap of controller id can fail, preventing the new discovery controller from being created. When searching for like subsystem nqn, exclude the discovery nqn from matching. This will result in each discovery controller being attached to a unique subsystem structure. Signed-off-by: James Smart <jsmart2021@gmail.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-09-12nvme: fix ns removal hang when failing to revalidate due to a transient errorSagi Grimberg
If a controller reset is racing with a namespace revalidation, the revalidation (admin) I/O will surely fail, but we should not remove the namespace as we will execute the I/O when the controller is back up. Same for spurious allocation errors (return -ENOMEM). Fix this by checking the specific error code in nvme_revalidate_disk and if it is a transient error (for example non DNR nvme statuses or a negative ENOMEM as allocation failure), do not remove the namespace as it will either recover when the controller is back up and schedule a subsequent scan, or the controller is going away and the namespaces will be removed anyways. This fixes a hang namespace scanning racing with a controller reset and also sporious I/O errors in path failover coditions where the controller reset is racing with the namespace scan work with multipath enabled. Reported-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: James Smart <james.smart@broadcom.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-09-12nvme: make nvme_report_ns_ids propagate error backSagi Grimberg
Make the callers check the return status and propagate back accordingly (casting to errno from a positive nvme status). Also print the return status in nvme_report_ns_ids. Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: James Smart <james.smart@broadcom.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-09-12nvme: make nvme_identify_ns propagate errors backSagi Grimberg
right now callers of nvme_identify_ns only know that it failed, but don't know why. Make nvme_identify_ns propagate the error back. Because nvme_submit_sync_cmd may return a positive status code, we make nvme_identify_ns receive the id by reference and return that status up the call chain, but make sure not to leak positive nvme status codes to the upper layers. Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: James Smart <james.smart@broadcom.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-09-12nvme: pass status to nvme_error_statusSagi Grimberg
No need for the full blown request structure. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-09-12nvme-fc: Fail transport errors with NVME_SC_HOST_PATHJames Smart
NVME_SC_INTERNAL should indicate an internal controller errors and not host transport errors. These errors will propagate to upper layers (essentially nvme core) and be interpereted as transport errors which should not be taken into account for namespace state or condition. Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: James Smart <james.smart@broadcom.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-09-12nvme-tcp: fail command with NVME_SC_HOST_PATH_ERROR send failedSagi Grimberg
This is a more appropriate error status for a transport error detected by us (the host). Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: James Smart <james.smart@broadcom.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-09-12nvme: fail cancelled commands with NVME_SC_HOST_PATH_ERRORSagi Grimberg
NVME_SC_ABORT_REQ means that the request was aborted due to an abort command received. In our case, this is a transport cancellation, so host pathing error is much more appropriate. Also, convert NVME_SC_HOST_PATH_ERROR to BLK_STS_TRANSPORT for such that callers can understand that the status is a transport related error. This will be used by the ns scanning code to understand if it got an error from the controller or that the controller happens to be unreachable by the transport. Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: James Smart <james.smart@broadcom.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-08-29nvme-rdma: Use rq_dma_dir macroIsrael Rukshin
Remove code duplication. Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>