Age | Commit message (Collapse) | Author |
|
If dm-mpath encounters an reservation conflict it should not fail the
path (as communication with the target is not affected) but should
rather retry on another path. However, in doing so we might be inducing
a ping-pong between paths, with no guarantee of any forward progress.
And arguably a reservation conflict is an unexpected error, so we should
be passing it upwards to allow the application to take appropriate
steps.
This change resolves a show-stopper problem seen with the pNFS SCSI
layout because it is trivial to hit reservation conflict based failover
loops without it.
Doubts were raised about the implications of this change relative to
products like IBM's SVC. But there is little point withholding a fix
for Linux because a proprietary product may or may not have some issues
in its implementation of how it interfaces with Linux. In the future,
if there is glaring evidence that this change is certainly problematic
we can revisit it.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Acked-by: Christoph Hellwig <hch@lst.de>
Tested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com> # tweaked header
|
|
Return DM_MAPIO_DELAY_REQUEUE from .clone_and_map_rq. Also, return
false from .busy, if all paths are down, so that blk-mq requests get
mapped via .clone_and_map_rq -- which results in DM_MAPIO_DELAY_REQUEUE
being returned to dm-rq.
This change allows for a noticeable reduction in cpu utilization
(reduced kworker load) while all paths are down, e.g.:
system CPU idleness (as measured by fio's --idle-prof=system):
before: system: 86.58%
after: system: 98.60%
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
|
|
When reinstating a path the blk-mq request_queue's requeue_list should
get kicked. It makes sense to kick the requeue_list as part of the
existing hook (previously only used by bio-based support).
Rename process_queued_bios_list to process_queued_io_list.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
|
|
Use autoremove_wake_function() instead of default_wake_function()
to make the dm wait loops more similar to other wait loops in the
kernel. This patch does not change any functionality.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
If pg_init_retries is set and a request is queued against a multipath
device with all underlying block device request_queues in the "dying"
state then an infinite loop is triggered because activate_path() never
succeeds and hence never calls pg_init_done().
This change avoids that device removal triggers an infinite loop by
failing the activate_path() which causes the "dying" path to be failed.
Reported-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
|
|
Since commit 63a4cc24867d, bio->bi_rw contains flags in the lower
portion and the op code in the higher portions. This means that
old code that relies on manually setting bi_rw is most likely
going to be broken. Instead of letting that brokeness linger,
rename the member, to force old and out-of-tree code to break
at compile time instead of at runtime.
No intended functional changes in this commit.
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Multiple flags were being tested without locking. Protect against
non-atomic bit changes in m->flags by holding m->lock (while testing or
setting the queue_if_no_path related flags).
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Allow a user to specify an optional feature 'queue_mode <mode>' where
<mode> may be "bio", "rq" or "mq" -- which corresponds to bio-based,
request_fn rq-based, and blk-mq rq-based respectively.
If the queue_mode feature isn't specified the default for the
"multipath" target is still "rq" but if dm_mod.use_blk_mq is set to Y
it'll default to mode "mq".
This new queue_mode feature introduces the ability for each multipath
device to have its own queue_mode (whereas before this feature all
multipath devices effectively had to have the same queue_mode).
This commit also goes a long way to eliminate the awkward (ab)use of
DM_TYPE_*, the associated filter_md_type() and other relatively fragile
and difficult to maintain code.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Add "multipath-bio" target that offers a bio-based multipath target as
an alternative to the request-based "multipath" target -- but in a
following commit "multipath-bio" will immediately be replaced by a new
"queue_mode" feature for the "multipath" target which will allow
bio-based mode to be selected.
When DM multipath was originally converted from bio-based to
request-based the motivation for the change was better dynamic load
balancing (by leveraging block core's request-based IO schedulers, for
merging and sorting, _before_ DM multipath would make the decision on
where to steer the IO -- based on path load and/or availability).
More background is available in this "Request-based Device-mapper
multipath and Dynamic load balancing" paper:
https://www.kernel.org/doc/ols/2007/ols2007v2-pages-235-244.pdf
But we've now come full circle where significantly faster storage
devices no longer need IOs to be made larger to drive optimal IO
performance. And even if they do there have been changes to the block
and filesystem layers that help ensure upper layers are constructing
larger IOs. In addition, SCSI's differentiated IO errors will propagate
through to bio-based IO completion hooks -- so that eliminates another
historic justiciation for request-based DM multipath. Lastly, the block
layer's immutable biovec changes have made bio cloning cheaper than it
has ever been; whereas request cloning is still relatively expensive
(both on a CPU usage and memory footprint level).
As such, bio-based DM multipath offers the promise of a more efficient
IO path for high IOPs devices that are, or will be, emerging.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Add some seperation between bio-based and request-based DM core code.
'struct mapped_device' and other DM core only structures and functions
have been moved to dm-core.h and all relevant DM core .c files have been
updated to include dm-core.h rather than dm.h
DM targets should _never_ include dm-core.h!
[block core merge conflict resolution from Stephen Rothwell]
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
|
|
The primary motivation of this commit is to improve the scalability of
DM multipath on large NUMA systems where m->lock spinlock contention has
been proven to be a serious bottleneck on really fast storage.
The ability to atomically read a pointer, using lockless_dereference(),
is leveraged in this commit. But all pointer writes are still protected
by the m->lock spinlock (which is fine since these all now occur in the
slow-path).
The following functions no longer require the m->lock spinlock in their
fast-path: multipath_busy(), __multipath_map(), and do_end_io()
And choose_pgpath() is modified to _not_ update m->current_pgpath unless
it also switches the path-group. This is done to avoid needing to take
the m->lock everytime __multipath_map() calls choose_pgpath().
But m->current_pgpath will be reset if it is failed via fail_path().
Suggested-by: Jeff Moyer <jmoyer@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Allows the 'work_mutex' member to no longer cross a cacheline.
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
The use of atomic_t for nr_valid_paths, pg_init_in_progress and
pg_init_count will allow relaxing the use of the m->lock spinlock.
Suggested-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Mechanical change that doesn't make any real effort to reduce the use of
m->lock; that will come later (once atomics are used for counters, etc).
Suggested-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
fail_path() will print a "Failing path ..." message but reinstate_path()
doesn't print a "Reinstating path ...". Add that message to
reinstate_path() to add symmetry and aid system debugging.
Remove reinstate_path()'s check for the path_selector providing
.reinstate_path hook. All path selectors provide this and any future
ones must too.
activate_path() calls pg_init_done() with SCSI_DH_DEV_OFFLINED but
pg_init_done() doesn't expicitly handle it in its swicth statement. Add
SCSI_DH_DEV_OFFLINED to the default case.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
If a path selector has any use for a repeat_count it should be handled
locally and not depend on the dm-mpath core to be concerned with it.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Preparation for making __multipath_map() avoid taking the m->lock
spinlock -- in favor of using RCU locking.
repeat_count was primarily for bio-based DM multipath's benefit. There
is really no need for it anymore now that DM multipath is request-based.
As such, repeat_count > 1 is no longer honored and a warning is
displayed if the user attempts to use a value > 1. This is a temporary
change for the round-robin path-selector (as a later commit will restore
its support for repeat_count > 1).
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
There isn't any need to support both old .request_fn and blk-mq paths
in the blk-mq specific portion of __multipath_map(). Call
blk_mq_alloc_request() directly rather than use blk_get_request().
Similarly, call blk_mq_free_request(), rather than blk_put_request(), in
multipath_release_clone().
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Refactor and rename existing interfaces to be more specific and
self-documenting.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Allow the multipath target to avoid making small allocations for each
'struct dm_mpath_io' that is needed for each request.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Rename various methods to have either a "dm_old" or "dm_mq" prefix.
Improve code comments to assist with understanding the duality of code
that handles both "dm_old" and "dm_mq" cases.
It is no much easier to quickly look at the code and _know_ that a given
method is either 1) "dm_old" only 2) "dm_mq" only 3) common to both.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Remove all fiddley code that propped up this support for a blk-mq
request-queue ontop of all .request_fn devices.
Testing has proven this niche request-based dm-mq mode to be buggy, when
testing fault tolerance with DM multipath, and there is no point trying
to preserve it.
Should help improve efficiency of pure dm-mq code and make code
maintenance less delicate.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
DM multipath is the only dm-mq target. But that aside, request-based DM
only supports tables with a single target that is immutable. Leverage
this fact in dm_mq_queue_rq() by using the 'immutable_target' stored in
the mapped_device when the table was made active. This saves the need
to even take the read-side of the SRCU via dm_{get,put}_live_table.
If the active DM table does not have an immutable target (e.g. "error"
target was swapped in) then fallback to the slow-path where the target
is looked up from the live table.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
In multipath_prepare_ioctl(),
- pgpath is a path selected from available paths
- m->queue_io is true if we cannot send a request immediately to
paths, either because:
* there is no available path
* the path group needs activation (pg_init)
- pg_init is not started
- pg_init is still running
- m->queue_if_no_path is true if the device is configured to queue
I/O if there are no available paths
If !pgpath && !m->queue_if_no_path, the handler should return -EIO.
However in the course of refactoring the condition check has broken
and returns success in that case. Since bdev points to the dm device
itself, dm_blk_ioctl() calls __blk_dev_driver_ioctl() for itself and
recurses until crash.
You could reproduce the problem like this:
# dmsetup create mp --table '0 1024 multipath 0 0 0 0'
# sg_inq /dev/mapper/mp
<crash>
[ 172.648615] BUG: unable to handle kernel paging request at fffffffc81b10268
[ 172.662843] PGD 19dd067 PUD 0
[ 172.666269] Thread overran stack, or stack corrupted
[ 172.671808] Oops: 0000 [#1] SMP
...
Fix the condition check with some clarifications.
Fixes: e56f81e0b01e ("dm: refactor ioctl handling")
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
dm-mpath retries ioctl, when no path is readily available and the device
is configured to queue I/O in such a case. If you want to stop the retry
before multipathd decides to turn off queueing mode, you could send
signal for the process to exit from the loop.
However the check of fatal signal has not carried along when commit
6c182cd88d17 ("dm mpath: fix ioctl deadlock when no paths") moved the
loop from dm-mpath to dm core. As a result, we can't terminate such
a process in the retry loop.
Easy reproducer of the situation is:
# dmsetup create mp --table '0 1024 multipath 0 0 0 0'
# dmsetup message mp 0 'queue_if_no_path'
# sg_inq /dev/mapper/mp
then you should be able to terminate sg_inq by pressing Ctrl+C.
Fixes: 6c182cd88d17 ("dm mpath: fix ioctl deadlock when no paths")
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
|
|
This adds support to pass through persistent reservation requests
similar to the existing ioctl handling, and with the same limitations,
e.g. devices may only have a single target attached.
This is mostly intended for multipathing.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
This moves the call to blkdev_ioctl and the argument checking to DM core
code, and only leaves a callout to find the block device to operate on
in the targets. This simplifies the code and allows us to pass through
ioctl-like command using other methods in the next patch.
Also split out a helper around calling the prepare_ioctl method that
will be reused for persistent reservation handling.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
This reverts commit a1989b330093578ea5470bea0a00f940c444c466.
That commit introduced a regression at least for the case of the SG_IO ioctl()
running without CAP_SYS_RAWIO capability (e.g., unprivileged users) when there
are no active paths: the ioctl() fails with the ENOTTY errno immediately rather
than blocking due to queue_if_no_path until a path becomes active, for example.
That case happens to be exercised by QEMU KVM guests with 'scsi-block' devices
(qemu "-device scsi-block" [1], libvirt "<disk type='block' device='lun'>" [2])
from multipath devices; which leads to SCSI/filesystem errors in such a guest.
More general scenarios can hit that regression too. The following demonstration
employs a SG_IO ioctl() with a standard SCSI INQUIRY command for this objective
(some output & user changes omitted for brevity and comments added for clarity).
Reverting that commit restores normal operation (queueing) in failing scenarios;
tested on linux-next (next-20151022).
1) Test-case is based on sg_simple0 [3] (just SG_IO; remove SG_GET_VERSION_NUM)
$ cat sg_simple0.c
... see [3] ...
$ sed '/SG_GET_VERSION_NUM/,/}/d' sg_simple0.c > sgio_inquiry.c
$ gcc sgio_inquiry.c -o sgio_inquiry
2) The ioctl() works fine with active paths present.
# multipath -l 85ag56
85ag56 (...) dm-19 IBM ,2145
size=60G features='1 queue_if_no_path' hwhandler='0' wp=rw
|-+- policy='service-time 0' prio=0 status=active
| |- 8:0:11:0 sdz 65:144 active undef running
| `- 9:0:9:0 sdbf 67:144 active undef running
`-+- policy='service-time 0' prio=0 status=enabled
|- 8:0:12:0 sdae 65:224 active undef running
`- 9:0:12:0 sdbo 68:32 active undef running
$ ./sgio_inquiry /dev/mapper/85ag56
Some of the INQUIRY command's response:
IBM 2145 0000
INQUIRY duration=0 millisecs, resid=0
3) The ioctl() fails with ENOTTY errno with _no_ active paths present,
for unprivileged users (rather than blocking due to queue_if_no_path).
# for path in $(multipath -l 85ag56 | grep -o 'sd[a-z]\+'); \
do multipathd -k"fail path $path"; done
# multipath -l 85ag56
85ag56 (...) dm-19 IBM ,2145
size=60G features='1 queue_if_no_path' hwhandler='0' wp=rw
|-+- policy='service-time 0' prio=0 status=enabled
| |- 8:0:11:0 sdz 65:144 failed undef running
| `- 9:0:9:0 sdbf 67:144 failed undef running
`-+- policy='service-time 0' prio=0 status=enabled
|- 8:0:12:0 sdae 65:224 failed undef running
`- 9:0:12:0 sdbo 68:32 failed undef running
$ ./sgio_inquiry /dev/mapper/85ag56
sg_simple0: Inquiry SG_IO ioctl error: Inappropriate ioctl for device
4) dmesg shows that scsi_verify_blk_ioctl() failed for SG_IO (0x2285);
it returns -ENOIOCTLCMD, later replaced with -ENOTTY in vfs_ioctl().
$ dmesg
<...>
[] device-mapper: multipath: Failing path 65:144.
[] device-mapper: multipath: Failing path 67:144.
[] device-mapper: multipath: Failing path 65:224.
[] device-mapper: multipath: Failing path 68:32.
[] sgio_inquiry: sending ioctl 2285 to a partition!
5) The ioctl() only works if the SYS_CAP_RAWIO capability is present
(then queueing happens -- in this example, queue_if_no_path is set);
this is due to a conditional check in scsi_verify_blk_ioctl().
# capsh --drop=cap_sys_rawio -- -c './sgio_inquiry /dev/mapper/85ag56'
sg_simple0: Inquiry SG_IO ioctl error: Inappropriate ioctl for device
# ./sgio_inquiry /dev/mapper/85ag56 &
[1] 72830
# cat /proc/72830/stack
[<c00000171c0df700>] 0xc00000171c0df700
[<c000000000015934>] __switch_to+0x204/0x350
[<c000000000152d4c>] msleep+0x5c/0x80
[<c00000000077dfb0>] dm_blk_ioctl+0x70/0x170
[<c000000000487c40>] blkdev_ioctl+0x2b0/0x9b0
[<c0000000003128e4>] block_ioctl+0x64/0xd0
[<c0000000002dd3b0>] do_vfs_ioctl+0x490/0x780
[<c0000000002dd774>] SyS_ioctl+0xd4/0xf0
[<c000000000009358>] system_call+0x38/0xd0
6) This is the function call chain exercised in this analysis:
SYSCALL_DEFINE3(ioctl, <...>) @ fs/ioctl.c
-> do_vfs_ioctl()
-> vfs_ioctl()
...
error = filp->f_op->unlocked_ioctl(filp, cmd, arg);
...
-> dm_blk_ioctl() @ drivers/md/dm.c
-> multipath_ioctl() @ drivers/md/dm-mpath.c
...
(bdev = NULL, due to no active paths)
...
if (!bdev || <...>) {
int err = scsi_verify_blk_ioctl(NULL, cmd);
if (err)
r = err;
}
...
-> scsi_verify_blk_ioctl() @ block/scsi_ioctl.c
...
if (bd && bd == bd->bd_contains) // not taken (bd = NULL)
return 0;
...
if (capable(CAP_SYS_RAWIO)) // not taken (unprivileged user)
return 0;
...
printk_ratelimited(KERN_WARNING
"%s: sending ioctl %x to a partition!\n" <...>);
return -ENOIOCTLCMD;
<-
...
return r ? : <...>
<-
...
if (error == -ENOIOCTLCMD)
error = -ENOTTY;
out:
return error;
...
Links:
[1] http://git.qemu.org/?p=qemu.git;a=commit;h=336a6915bc7089fb20fea4ba99972ad9a97c5f52
[2] https://libvirt.org/formatdomain.html#elementsDisks (see 'disk' -> 'device')
[3] http://tldp.org/HOWTO/SCSI-Generic-HOWTO/pexample.html (Revision 1.2, 2002-05-03)
Signed-off-by: Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
|
|
This way we can reused the same code any attachment method, not just those
requested from dm-mpath.
[jejb: fixup checkpatch error]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: James Bottomley <JBottomley@Odin.com>
|
|
While allowing dm-mpath to attach device handlers is a functionality we need
for backwards compatibility reason there is no reason to reference count
them and detach them if dm-mpath stops using the device for some reason.
If the device handler works for the given device it can just stay attached,
and we can take the retain_hw_handler codepath.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Hannes Reinecke <hare@Suse.de>
Signed-off-by: James Bottomley <JBottomley@Odin.com>
|
|
Otherwise kmemleak reported:
unreferenced object 0xffff88009b14e2b0 (size 16):
comm "fio", pid 4274, jiffies 4294978034 (age 1253.210s)
hex dump (first 16 bytes):
40 12 f3 99 01 88 ff ff 00 10 00 00 00 00 00 00 @...............
backtrace:
[<ffffffff81600029>] kmemleak_alloc+0x49/0xb0
[<ffffffff811679a8>] kmem_cache_alloc+0xf8/0x160
[<ffffffff8111c950>] mempool_alloc_slab+0x10/0x20
[<ffffffff8111cb37>] mempool_alloc+0x57/0x150
[<ffffffffa04d2b61>] __multipath_map.isra.17+0xe1/0x220 [dm_multipath]
[<ffffffffa04d2cb5>] multipath_clone_and_map+0x15/0x20 [dm_multipath]
[<ffffffffa02889b5>] map_request.isra.39+0xd5/0x220 [dm_mod]
[<ffffffffa028b0e4>] dm_mq_queue_rq+0x134/0x240 [dm_mod]
[<ffffffff812cccb5>] __blk_mq_run_hw_queue+0x1d5/0x380
[<ffffffff812ccaa5>] blk_mq_run_hw_queue+0xc5/0x100
[<ffffffff812ce350>] blk_sq_make_request+0x240/0x300
[<ffffffff812c0f30>] generic_make_request+0xc0/0x110
[<ffffffff812c0ff2>] submit_bio+0x72/0x150
[<ffffffff811c07cb>] do_blockdev_direct_IO+0x1f3b/0x2da0
[<ffffffff811c166e>] __blockdev_direct_IO+0x3e/0x40
[<ffffffff8120aa1a>] ext4_direct_IO+0x1aa/0x390
Fixes: e5863d9ad ("dm: allocate requests in target when stacking on blk-mq devices")
Reported-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # 4.0+
|
|
dm_mq_queue_rq() is in atomic context so care must be taken to not
sleep -- as such GFP_ATOMIC is used for the md->bs bioset allocations
and dm-mpath's call to blk_get_request(). In the future the bioset
allocations will hopefully go away (by removing support for partial
completions of bios in a cloned request).
Also prepare for supporting DM blk-mq ontop of old-style request_fn
device(s) if a new dm-mod 'use_blk_mq' parameter is set. The kthread
will still be used to queue work if blk-mq is used ontop of old-style
request_fn device(s).
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Commit e5863d9ad ("dm: allocate requests in target when stacking on
blk-mq devices") served as the first step toward fully utilizing blk-mq
in request-based DM -- it enabled stacking an old-style (request_fn)
request_queue ontop of the underlying blk-mq device(s). That first step
didn't improve performance of DM multipath ontop of fast blk-mq devices
(e.g. NVMe) because the top-level old-style request_queue was severely
limited by the queue_lock.
The second step offered here enables stacking a blk-mq request_queue
ontop of the underlying blk-mq device(s). This unlocks significant
performance gains on fast blk-mq devices, Keith Busch tested on his NVMe
testbed and offered this really positive news:
"Just providing a performance update. All my fio tests are getting
roughly equal performance whether accessed through the raw block
device or the multipath device mapper (~470k IOPS). I could only push
~20% of the raw iops through dm before this conversion, so this latest
tree is looking really solid from a performance standpoint."
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Tested-by: Keith Busch <keith.busch@intel.com>
|
|
There is no need for DM to export a wrapper around the already exported
blk_lld_busy().
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Currently the cleanup of all error cases are open-coded. Introduce a
common exit path and labels.
Signed-off-by: Johannes Thumshirn <morbidrsa@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
For blk-mq request-based DM the responsibility of allocating a cloned
request is transfered from DM core to the target type. Doing so
enables the cloned request to be allocated from the appropriate
blk-mq request_queue's pool (only the DM target, e.g. multipath, can
know which block device to send a given cloned request to).
Care was taken to preserve compatibility with old-style block request
completion that requires request-based DM _not_ acquire the clone
request's queue lock in the completion path. As such, there are now 2
different request-based DM target_type interfaces:
1) the original .map_rq() interface will continue to be used for
non-blk-mq devices -- the preallocated clone request is passed in
from DM core.
2) a new .clone_and_map_rq() and .release_clone_rq() will be used for
blk-mq devices -- blk_get_request() and blk_put_request() are used
respectively from these hooks.
dm_table_set_type() was updated to detect if the request-based target is
being stacked on blk-mq devices, if so DM_TYPE_MQ_REQUEST_BASED is set.
DM core disallows switching the DM table's type after it is set. This
means that there is no mixing of non-blk-mq and blk-mq devices within
the same request-based DM table.
[This patch was started by Keith and later heavily modified by Mike]
Tested-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Switch to having request-based DM enqueue all prep'ed requests into work
processed by another thread. This allows request-based DM to invoke
block APIs that assume interrupt enabled context (e.g. blk_get_request)
and is a prerequisite for adding blk-mq support to request-based DM.
The new kernel thread is only initialized for request-based DM devices.
multipath_map() is now always in irq enabled context so change multipath
spinlock (m->lock) locking to always disable interrupts.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
'queue_io' is set so that IO is queued while paths are being
initialized. Clear queue_io in __choose_pgpath if there are no valid
paths, since there are obviously no paths that can be initialized.
Otherwise IOs to the device will back up.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
pg_ready() is not comprehensive in its logic and only serves to
obfuscate code. Replace pg_ready() with the appropriate logic in
multipath_map().
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Commit e80991773 ("dm mpath: push back requests instead of queueing")
modified multipath_busy() to return true if !pg_ready(). pg_ready()
checks the current state of the multipath device and may return false
even if a new IO is needed to change the state.
Bart Van Assche reported that he had multipath IO lockup when he was
performing cable pull tests. Analysis showed that the multipath
device had a single path group with both paths active, but that the
path group itself was not active. During the multipath device state
transitions 'queue_io' got set but nothing could clear it. Clearing
'queue_io' only happens in __choose_pgpath(), but it won't be called
if multipath_busy() returns true due to pg_ready() returning false
when 'queue_io' is set.
As such the !pg_ready() check in multipath_busy() is wrong because new
IO will not be sent to multipath target and the multipath state change
won't happen. That results in multipath IO lockup.
The intent of multipath_busy() is to avoid unnecessary cycles of
dequeue + request_fn + requeue if it is known that the multipath
device will requeue.
Such "busy" situations would be:
- path group is being activated
- there is no path and the multipath is setup to requeue if no path
Fix multipath_busy() to return "busy" early only for these specific
situations.
Reported-by: Bart Van Assche <bvanassche@acm.org>
Tested-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org # v3.15
|
|
Add DM core support for disabling WRITE SAME on first failure to both
request-based and bio-based targets. The need to disable WRITE SAME
stems from SCSI enabling it by default but then disabling it when it
fails. When SCSI does this it returns "permanent target failure, do
not retry" using -EREMOTEIO. Update DM core to only disable WRITE SAME
on failure if the returned error is -EREMOTEIO.
Commit f84cb8a4 ("dm mpath: disable WRITE SAME if it fails")
implemented multipath specific disabling of WRITE SAME if it fails.
However, as that commit detailed, the multipath-only solution doesn't go
far enough if bio-based DM targets are stacked ontop of the
request-based dm-multipath target (as is commonly done using dm-linear
to support partitions on multipath devices, via kpartx).
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Tested-by: Alex Chen <alex.chen@huawei.com>
|
|
lockdep complains about a circular locking. And indeed, we need to
release the lock before calling dm_table_run_md_queue_async().
As such, commit 4cdd2ad ("dm mpath: fix lock order inconsistency in
multipath_ioctl") must also be reverted in addition to fixing the
lock order in the other dm_table_run_md_queue_async() callers.
Reported-by: Bart van Assche <bvanassche@acm.org>
Tested-by: Bart van Assche <bvanassche@acm.org>
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Commit 3e9f1be1b40 ("dm mpath: remove process_queued_ios()") did not
consistently take the multipath device's spinlock (m->lock) before
calling dm_table_run_md_queue_async() -- which takes the q->queue_lock.
Found with code inspection using hint from reported lockdep warning.
Reported-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
The warning message "Unrecognised multipath message received" is
displayed in two different situations in multipath_message(): when the
number of arguments passed is invalid and when the string passed in
argv[0] is not recognized.
Make it easier to identify where the problem is by making these warnings
more specific with additional context for each case.
Signed-off-by: Jose Castillo <jcastillo@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
activate_path() is run without a lock, so the path might be
set to failed before activate_path() had a chance to run.
This patch add a check for ->active in activate_path() to
avoid unnecessary overhead by calling functions which are known
to be failing.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
|
|
Return early for case when no path exists, and when the
pathgroup isn't ready. This eliminates the need for
extra nesting for the the common case.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Hannes Reinecke <hare@suse.de>
|