summaryrefslogtreecommitdiff
path: root/drivers/nvme
AgeCommit message (Collapse)Author
2018-01-09treewide: Use DEVICE_ATTR_ROJoe Perches
Convert DEVICE_ATTR uses to DEVICE_ATTR_RO where possible. Done with perl script: $ git grep -w --name-only DEVICE_ATTR | \ xargs perl -i -e 'local $/; while (<>) { s/\bDEVICE_ATTR\s*\(\s*(\w+)\s*,\s*\(?(?:\s*S_IRUGO\s*|\s*0444\s*)\)?\s*,\s*\1_show\s*,\s*NULL\s*\)/DEVICE_ATTR_RO(\1)/g; print;}' Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Robert Jarzmik <robert.jarzmik@free.fr> Acked-by: Sagi Grimberg <sagi@grimberg.me> Acked-by: Zhang Rui <rui.zhang@intel.com> Acked-by: Harald Freudenberger <freude@linux.vnet.ibm.com> Acked-by: Jani Nikula <jani.nikula@intel.com> Acked-by: Corey Minyard <cminyard@mvista.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-08nvme: fix subsystem multiple controllers support checkIsrael Rukshin
There is a problem when another module (e.g. nvmet) takes a reference on the nvme block device and the physical nvme drive is removed. In that case nvme_free_ctrl() will not be called and the controller state will be "deleting" or "dead" unless nvmet module releases the block device. Later on, the same nvme drive probes back and nvme_init_subsystem() will be called and fail due to duplicate subnqn (if the nvme device doesn't support subsystem with multiple controllers). This will cause a probe failure. This commit changes the check of multiple controllers support at nvme_init_subsystem() by not counting all the controllers at "dead" or "deleting" state (this is safe because controllers at this state will never be active again). Fixes: ab9e00cc72fa ("nvme: track subsystems") Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Israel Rukshin <israelr@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-01-08nvme: take refcount on transport moduleNitzan Carmi
The block device is backed by the transport so we must ensure that the transport driver will not be removed until all references are released. Otherwise, we might end up referencing freed memory. Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Nitzan Carmi <nitzanc@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-01-08nvme-pci: fix NULL pointer reference in nvme_alloc_nsJianchao Wang
When the io queues setup or tagset allocation failed, ctrl.tagset is NULL. But the scan work will still be queued and executed, then panic comes up due to NULL pointer reference of ctrl.tagset. To fix this, add a new ctrl state NVME_CTRL_ADMIN_ONLY to inidcate only admin queue is live. When non io queues or tagset allocation failed, ctrl enters into this state, scan work will not be started. But async event work and nvme dev ioctl will be still available. This will be helpful to do further investigation and recovery. Suggested-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-01-08nvme: modify the debug level for setting shutdown timeoutMax Gurtovoy
When an NVMe controller reports RTD3 Entry Latency larger than the value of shutdown_timeout module parameter, we update the shutdown_timeout accordingly to honor RTD3 Entry Latency. Use an informational debug level instead of a warning level for it. Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-01-08nvme-pci: don't open-code nvme_reset_ctrlSagi Grimberg
Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-01-08nvmet: rearrange nvmet_ctrl_free()Israel Rukshin
Make it symmetric to nvmet_alloc_ctrl(). Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-01-08nvmet: fix error flow in nvmet_alloc_ctrl()Israel Rukshin
Remove the allocated id on error. Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-01-08nvme-pci: remove an unnecessary initialization in HMB codeMinwoo Im
The local variable __size__ will be set a bit later in a for-loop. Remove the explicit initialization at the beginning of this function. Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-01-08nvme-fabrics: protect against module unload during create_ctrlRoy Shterman
NVMe transport driver module unload may (and usually does) trigger iteration over the active controllers and delete them all (sometimes under a mutex). However, a controller can be created concurrently with module unload which can lead to leakage of resources (most important char device node leakage) in case the controller creation occured after the unload delete and drain sequence. To protect against this, we take a module reference to guarantee that the nvme transport driver is not unloaded while creating a controller. Signed-off-by: Roy Shterman <roys@lightbitslabs.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-01-08nvmet-fc: cleanup nvmet add_port/remove_portJames Smart
The current fc transport add_port routine validates that there is a matching port to the target port config. It then takes a reference on the targetport. The del_port removes the reference. Unfortunately, if the LLDD undergoes a hw reset or driver unload and wants to unreg the targetport, due to the reference, the targetport effectively can't be removed. It requires the admin to remove the port from the nvmet config first, which calls the del_port. Note: it appears nvmetcli clear skips over the del_port call (I'm not attempting to change that). There's no real reason to take the reference. With FC, there is nothing to enable or disable as the presence of the FC targetport implicitly means its enabled, and removal of the targtport means its disabled. Change add_port to simply validate and change remove_port to a noop. No references are taken on the targetport. Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-01-08nvme_fcloop: refactor host/target io job accessJames Smart
The split between what the host accesses on its flows vs what the target side accesses was flawed. Abort handling didn't properly clear initiator vs target structure cross-reference and locks weren't used for synchronization. Thus, there were issues of freeing structures too soon and access after free. A couple of these existed pre the IN_ISR mods, but when the target upcalls were converted to work items, thus adding delays between the 2 sides of accesses, the problems became pronounced. Resolve by: - tracking io state mainly in the tgt-side io structure. - make the tgt-side io structure released by reference not by code flow. - when changing initiator structures, use locks for synchronization - aborts are clearly tracked for which side saw the abort, and after seeing the abort, cross-references are cleared under lock. Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-01-08nvme_fcloop: rework to remove xxx_IN_ISR feature flagsJames Smart
The existing fcloop driver expects the target side upcalls to the transport to context switch, thus the calls into the nvmet layer are not done in the calling context of the host/initiator down calls. The xxx_IN_ISR feature flags are used to select this logic. The xxx_IN_ISR feature flags should go away in the nvmet_fc transport as no other lldd utilizes them. Both Broadcom and Cavium lldds have their own non-ISR deferred handlers thus the nvmet calls can be made directly. This patch converts the paths that make the target upcalls (command receive, abort receive) such that they schedule a work item rather than expecting the transport to schedule the work item. The patch also cleans up the following: - The completion path from target to host scheduled a host work element called "work". Rename it "tio_done_work" for code clarity. - The abort io path called a iniwork item to call the host side io done. This is no longer needed as the abort routine can make the same call. Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-01-08nvme_fcloop: disassocate local port structsJames Smart
The current fcloop driver gets its lport structure from the private area co-allocated with the fc_localport. All is fine except the teardown path, which wants to wait on the completion, which is marked complete by the delete_localport callback performed after unregister_localport. The issue is, the nvme_fc transport frees the localport structure immediately after delete_localport is called, meaning the original routine is trying to wait on a complete that was just freed. Change such that a lport struct is allocated coincident with the addition and registration of a localport. The private area of the localport now contains just a backpointer to the real lport struct. Now, the completion can be waited for, and after completing, the new structure can be kfree'd. Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-01-08nvme_fcloop: fix abort race conditionJames Smart
A test case revealed a race condition of an i/o completing on a thread parallel to the delete_association generating the aborts for the outstanding ios on the controller. The i/o completion was freeing the target fcloop context, thus the abort task referenced the just-freed memory. Correct by clearing the target/initiator cross pointers in the io completion and abort tasks before calling the callbacks. On aborts that detect already finished io's, ensure the complete context is called. Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-01-08nvmet: lower log level for each queue creationSagi Grimberg
It is a bit chatty to report on each queue, log it only for debug purposes. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-01-08nvmet-rdma: lowering log level for chatty debug messagesSagi Grimberg
It is a bit chatty to report on every deleted queue, so keep it for debug purposes only. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-01-08nvmet-rdma: removed queue cleanup from module exitSagi Grimberg
We already do that when we are notified in device removal which is triggered when unregistering as an ib client. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-01-08nvme-fabrics: initialize default host->id in nvmf_host_default()Ewan D. Milne
The field was uninitialized before use. Signed-off-by: Ewan D. Milne <emilne@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-01-06nvmet/rdma: Use sgl_alloc() and sgl_free()Bart Van Assche
Use the sgl_alloc() and sgl_free() functions instead of open coding these functions. Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Cc: Keith Busch <keith.busch@intel.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-06nvmet/fc: Use sgl_alloc() and sgl_free()Bart Van Assche
Use the sgl_alloc() and sgl_free() functions instead of open coding these functions. Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: James Smart <james.smart@broadcom.com> Cc: Keith Busch <keith.busch@intel.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-05lightnvm: make geometry structures 2.0 readyMatias Bjørling
Prepare for the 2.0 revision by adapting the geometry structures to coexist with the 1.2 revision. Signed-off-by: Matias Bjørling <m@bjorling.me> Reviewed-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-05lightnvm: remove lower page tablesMatias Bjørling
The lower page table is unused. All page tables reported by 1.2 devices are all reporting a sequential 1:1 page mapping. This is also not used going forward with the 2.0 revision. Signed-off-by: Matias Bjørling <m@bjorling.me> Reviewed-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-05lightnvm: remove hybrid ocssd 1.2 supportMatias Bjørling
Now that rrpc have been removed. Also remove the hybrid 1.2 support from the core. Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-12-29nvme-fcloop: avoid possible uninitialized variable warningJames Smart
The kbuild test robot send mail of a potential use of an uninitialized variable - "tport" in fcloop_delete_targetport() which then calls __targetport_unreg() which uses the variable. It will never be the case it is uninitialized as the call to __targetport_unreg() only occurs if there is a valid nport pointer. And at the time the nport pointer is assigned, the tport variable is set. Remove the warning by assigning a NULL value initially. Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-12-29nvme-mpath: fix last path removal during trafficSagi Grimberg
In case our last path is removed during traffic, we can end up requeueing the bio(s) but never schedule the actual requeue work as upper layers still have open handles on the mpath device node. Fix this by scheduling requeue work if the namespace being removed is the last path in the ns_head path list. Fixes: 32acab3181c7 ("nvme: implement multipath access to nvme subsystems") Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-12-29nvme-rdma: fix concurrent reset and reconnectSagi Grimberg
Now ctrl state machine allows to transition from RESETTING to RECONNECTING. In nvme-rdma when we receive a rdma cm DISONNECTED event, we trigger nvme_rdma_error_recovery. This happens also when we execute a controller reset, issue a cm diconnect request and receive a cm disconnect reply, as a result, the reset work and the error recovery work can run concurrently. Until now the state machine prevented from the error recovery work from running as a result of a controller reset (RESETTING -> RECONNECTING was not allowed). To fix this, we adopt the FC state machine approach, we always transition from LIVE to RESETTING and only then to RECONNECTING. We do this both for the error recovery work and the controller reset work: 1. transition to RESETTING 2. teardown the controller association 3. transition to RECONNECTING This will restore the protection against reset work and error recovery work from concurrently running together. Fixes: 3cec7f9de448 ("nvme: allow controller RESETTING to RECONNECTING transition") Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-12-29nvme: fix sector units when going between formatsJeff Lien
If you format a device with a 4k sector size back to 512 bytes, the queue limit values for physical block size and minimum IO size were not getting updated; only the logical block size was being updated. This patch adds code to update the physical block and IO minimum sizes. Signed-off-by: Jeff Lien <jeff.lien@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-12-29nvme-pci: move use_sgl initialization to nvme_init_iod()Minwoo Im
A flag "use_sgl" of "struct nvme_iod" has been used in nvme_init_iod() without being set to any value. It seems like "use_sgl" has been set in either nvme_pci_setup_prps() or nvme_pci_setup_sgls() which occur later than nvme_init_iod(). Make "iod->use_sgl" being set in a proper place, nvme_init_iod(). Also move nvme_pci_use_sgls() up above nvme_init_iod() to make it possible to be called by nvme_init_iod(). Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-12-15nvme: setup streams after initializing namespace headKeith Busch
Fixes a NULL pointer dereference. Reported-by: Arnav Dawn <a.dawn@samsung.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-12-15nvme: check hw sectors before setting chunk sectorsKeith Busch
Some devices with IDs matching the "stripe" quirk don't actually have this quirk, and don't have an MDTS value. When MDTS is not set, the driver sets the max sectors to UINT_MAX, which is not a power of 2, hitting a BUG_ON from blk_queue_chunk_sectors. This patch skips setting chunk sectors for such devices. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-12-15nvme: call blk_integrity_unregister after queue is cleaned upMing Lei
During IO complete path, bio_integrity_advance() is often called, and blk_get_integrity() is called in this function. But in blk_integrity_unregister, the buffer pointed by queue->integrity is cleared, and blk_integrity->profile becomes NULL, then blk_get_integrity returns NULL, and causes kernel oops[1] finally. This patch fixes this issue by calling blk_integrity_unregister() after blk_cleanup_queue(). [1] kernel oops log [ 122.068007] BUG: unable to handle kernel NULL pointer dereference at 000000000000000a [ 122.076760] IP: bio_integrity_advance+0x3d/0xf0 [ 122.081815] PGD 0 P4D 0 [ 122.084641] Oops: 0000 [#1] SMP [ 122.088142] Modules linked in: sunrpc ipmi_ssif intel_rapl vfat fat x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass mei_me ipmi_si crct10dif_pclmul crc32_pclmul sg mei ghash_clmulni_intel mxm_wmi ipmi_devintf iTCO_wdt intel_cstate intel_uncore pcspkr intel_rapl_perf iTCO_vendor_support dcdbas ipmi_msghandler lpc_ich acpi_power_meter shpchp wmi dm_multipath ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm crc32c_intel ahci nvme tg3 libahci nvme_core i2c_core libata ptp megaraid_sas pps_core dm_mirror dm_region_hash dm_log dm_mod [ 122.149577] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.14.0-11.el7a.x86_64 #1 [ 122.157635] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.5.5 08/16/2017 [ 122.166179] task: ffff8802ff1e8000 task.stack: ffffc90000130000 [ 122.172785] RIP: 0010:bio_integrity_advance+0x3d/0xf0 [ 122.178419] RSP: 0018:ffff88047fc03d70 EFLAGS: 00010006 [ 122.184248] RAX: ffff880473b08000 RBX: ffff880458c71a80 RCX: ffff880473b08248 [ 122.192209] RDX: 0000000000000000 RSI: 000000000000003c RDI: ffffc900038d7ba0 [ 122.200171] RBP: ffff88047fc03d78 R08: 0000000000000001 R09: ffffffffa01a78b5 [ 122.208132] R10: ffff88047fc1eda0 R11: ffff880458c71ad0 R12: 0000000000007800 [ 122.216094] R13: 0000000000000000 R14: 0000000000007800 R15: ffff880473a39b40 [ 122.224056] FS: 0000000000000000(0000) GS:ffff88047fc00000(0000) knlGS:0000000000000000 [ 122.233083] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 122.239494] CR2: 000000000000000a CR3: 0000000001c09002 CR4: 00000000001606e0 [ 122.247455] Call Trace: [ 122.250183] <IRQ> [ 122.252429] bio_advance+0x28/0xf0 [ 122.256217] blk_update_request+0xa1/0x310 [ 122.260778] blk_mq_end_request+0x1e/0x70 [ 122.265256] nvme_complete_rq+0x1c/0xd0 [nvme_core] [ 122.270699] nvme_pci_complete_rq+0x85/0x130 [nvme] [ 122.276140] __blk_mq_complete_request+0x8d/0x140 [ 122.281387] blk_mq_complete_request+0x16/0x20 [ 122.286345] nvme_process_cq+0xdd/0x1c0 [nvme] [ 122.291301] nvme_irq+0x23/0x50 [nvme] [ 122.295485] __handle_irq_event_percpu+0x3c/0x190 [ 122.300725] handle_irq_event_percpu+0x32/0x80 [ 122.305683] handle_irq_event+0x3b/0x60 [ 122.309964] handle_edge_irq+0x8f/0x190 [ 122.314247] handle_irq+0xab/0x120 [ 122.318043] do_IRQ+0x48/0xd0 [ 122.321355] common_interrupt+0x9d/0x9d [ 122.325625] </IRQ> [ 122.327967] RIP: 0010:cpuidle_enter_state+0xe9/0x280 [ 122.333504] RSP: 0018:ffffc90000133e68 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff35 [ 122.341952] RAX: ffff88047fc1b900 RBX: ffff88047fc24400 RCX: 000000000000001f [ 122.349913] RDX: 0000000000000000 RSI: fffffcf2e6007295 RDI: 0000000000000000 [ 122.357874] RBP: ffffc90000133ea0 R08: 000000000000062e R09: 0000000000000253 [ 122.365836] R10: 0000000000000225 R11: 0000000000000018 R12: 0000000000000002 [ 122.373797] R13: 0000000000000001 R14: ffff88047fc24400 R15: 0000001c6bd1d263 [ 122.381762] ? cpuidle_enter_state+0xc5/0x280 [ 122.386623] cpuidle_enter+0x17/0x20 [ 122.390611] call_cpuidle+0x23/0x40 [ 122.394501] do_idle+0x17e/0x1f0 [ 122.398101] cpu_startup_entry+0x73/0x80 [ 122.402478] start_secondary+0x178/0x1c0 [ 122.406854] secondary_startup_64+0xa5/0xa5 [ 122.411520] Code: 48 8b 5f 68 48 8b 47 08 31 d2 4c 8b 5b 48 48 8b 80 d0 03 00 00 48 83 b8 48 02 00 00 00 48 8d 88 48 02 00 00 48 0f 45 d1 c1 ee 09 <0f> b6 4a 0a 0f b6 52 09 89 f0 48 01 73 08 83 e9 09 d3 e8 0f af [ 122.432604] RIP: bio_integrity_advance+0x3d/0xf0 RSP: ffff88047fc03d70 [ 122.439888] CR2: 000000000000000a Reported-by: Zhang Yi <yizhan@redhat.com> Tested-by: Zhang Yi <yizhan@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-12-15nvme-fc: remove double put reference if admin connect failsJames Smart
There are two put references in the failure case of initial create_association. The first put actually frees the controller, thus the second put references freed memory. Remove the unnecessary 2nd put. Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-12-15nvme: set discard_alignment to zeroDavid Disseldorp
Similar to 7c084289795b ("rbd: set discard_alignment to zero"), NVMe devices are currently incorrectly initialised with the block queue discard_alignment set to the NVMe stream alignment. As per Documentation/ABI/testing/sysfs-block: The discard_alignment parameter indicates how many bytes the beginning of the device is offset from the internal allocation unit's natural alignment. Correcting the discard_alignment parameter to zero has no effect on how discard requests are propagated through the block layer - @alignment in __blkdev_issue_discard() remains zero. However, it does fix other consumers, such as LIO's Block Limits VPD response. Signed-off-by: David Disseldorp <ddiss@suse.de> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-11-28nvme-pci: fix NULL pointer dereference in nvme_free_host_mem()Minwoo Im
Following condition which will cause NULL pointer dereference will occur in nvme_free_host_mem() when it tries to remove pci device via nvme_remove() especially after a failure of host memory allocation for HMB. "(host_mem_descs == NULL) && (nr_host_mem_descs != 0)" It's because __nr_host_mem_descs__ is not cleared to 0 unlike __host_mem_descs__ is so. Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-11-28nvme-rdma: fix memory leak during queue allocationMax Gurtovoy
In case nvme_rdma_wait_for_cm timeout expires before we get an established or rejected event (rdma_connect succeeded) from rdma_cm, we end up with leaking the ib transport resources for dedicated queue. This scenario can easily reproduced using traffic test during port toggling. Also, in order to protect from parallel ib queue destruction, that may be invoked from different context's, introduce new flag that stands for transport readiness. While we're here, protect also against a situation that we can receive rdma_cm events during ib queue destruction. Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-11-26nvme-rdma: Use mr poolIsrael Rukshin
Currently, blk_mq_tagset_iter() iterate over initial hctx tags only. If an I/O scheduler is used, it doesn't iterate the hctx scheduler tags and the static request aren't been updated. For example, while using NVMe over Fabrics RDMA host, this cause us not to reinit the scheduler requests and thus not re-register all the memory regions during the tagset re-initialization in the reconnect flow. This may lead to a memory registration error: "MEMREG for CQE 0xffff88044c14dce8 failed with status memory management operation error (6)" With this commit we don't need to reinit the requests, and thus fix this failure. Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-11-26nvme-rdma: Check remotely invalidated rkey matches our expected rkeySagi Grimberg
If we got a remote invalidation on a bogus rkey, this is a protocol error. Fail the connection in this case. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-11-26nvme-rdma: wait for local invalidation before completing a requestSagi Grimberg
We must not complete a request before the host memory region is invalidated. Luckily we have send with invalidate protocol support so we usually don't need to execute it, but in case the target did not invalidate a memory region for us, we must wait for the invalidation to complete before unmapping host memory and completing the I/O. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-11-26nvme-rdma: don't complete requests before a send work request has completedSagi Grimberg
In order to guarantee that the HCA will never get an access violation (either from invalidated rkey or from iommu) when retrying a send operation we must complete a request only when both send completion and the nvme cqe has arrived. We need to set the send/recv completions flags atomically because we might have more than a single context accessing the request concurrently (one is cq irq-poll context and the other is user-polling used in IOCB_HIPRI). Only then we are safe to invalidate the rkey (if needed), unmap the host buffers, and complete the IO. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-11-26nvme-rdma: don't suppress send completionsSagi Grimberg
The entire completions suppress mechanism is currently broken because the HCA might retry a send operation (due to dropped ack) after the nvme transaction has completed. In order to handle this, we signal all send completions and introduce a separate done handler for async events as they will be handled differently (as they don't include in-capsule data by definition). Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-11-24nvme-fc: don't use bit masks for set/test_bit() numbersJens Axboe
So far harmless, but it's confusing and a bug waiting to happen if the shifts grow larger than 4. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-23nvme-pci: add quirk for delay before CHK RDY for WDC SN200Jeff Lien
And increase the existing delay to cover this device as well. Cc: stable@vger.kernel.org Signed-off-by: Jeff Lien <jeff.lien@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-11-20nvmet-fc: correct ref counting error when deferred rcv usedJames Smart
Whenever a cmd is received a reference is taken while looking up the queue. The reference is removed after the cmd is done as the iod is returned for reuse. The fod may be reused for a deferred (recevied but no job context) cmd. Existing code removes the reference only if the fod is not reused for another command. Given the fod may be used for one or more ios, although a reference was taken per io, it won't be matched on the frees. Remove the reference on every fod free. This pairs the references to each io. Signed-off-by: James Smart <james.smart@broadcom.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-11-20nvme: Suppress static analyis warningKeith Busch
The ns->head is always valid, so we don't need to check for NULL. Reported-by: Dan Carpenter <dan.caprenter@oracle.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-11-20nvme: Fix NULL dereference on reservation requestKeith Busch
This fixes using the NULL 'head' before getting the reference. It is however possible the head will always be NULL, so this patch uses the struct nvme_ns to get the ns_id field. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-11-20nvme: fix spelling mistake: "requeing" -> "requeuing"Colin Ian King
Trivial fix to spelling mistake in dev_warn_ratelimited message text Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-11-20nvme-pci: avoid hmb desc array idx out-of-bound when hmmaxd set.Minwoo Im
hmb descriptor idx out-of-bound occurs in case of below conditions. preferred = 128MiB chunk_size = 4MiB hmmaxd = 1 Current code will not allow rmmod which will free hmb descriptors to be done successfully in above case. "descs[i]" will be set in for-loop without seeing any conditions related to "max_entries" after a single "descs" was allocated by (max_entries = 1) in this case. Added a condition into for-loop to check index of descriptors. Fixes: 044a9df1("nvme-pci: implement the HMB entry number and size limitations") Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-11-20nvme-pci: disable APST on Samsung SSD 960 EVO + ASUS PRIME B350M-AKai-Heng Feng
The NVMe device in question drops off the PCIe bus after system suspend. I've tried several approaches to workaround this issue, but none of them works: - NVME_QUIRK_DELAY_BEFORE_CHK_RDY - NVME_QUIRK_NO_DEEPEST_PS - Disable APST before controller shutdown - Delay between controller shutdown and system suspend - Explicitly set power state to 0 before controller shutdown Fortunately it's a desktop, so disable APST won't hurt the battery. Also, change the quirk function name to reflect it's for vendor combination quirks. BugLink: https://bugs.launchpad.net/bugs/1705748 Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-11-20nvme-loop: check if queue is ready in queue_rqSagi Grimberg
In case the queue is not LIVE (fully functional and connected at the nvmf level), we cannot allow any commands other than connect to pass through. Add a new queue state flag NVME_LOOP_Q_LIVE which is set after nvmf connect and cleared in queue teardown. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>