summaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)Author
2016-11-05Merge tag 'md/4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/mdLinus Torvalds
Pull MD fixes from Shaohua Li: "There are several bug fixes queued: - fix raid5-cache recovery bugs - fix discard IO error handling for raid1/10 - fix array sync writes bogus position to superblock - fix IO error handling for raid array with external metadata" * tag 'md/4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: md: be careful not lot leak internal curr_resync value into metadata. -- (all) raid1: handle read error also in readonly mode raid5-cache: correct condition for empty metadata write md: report 'write_pending' state when array in sync md/raid5: write an empty meta-block when creating log super-block md/raid5: initialize next_checkpoint field before use RAID10: ignore discard error RAID1: ignore discard error
2016-10-28md: be careful not lot leak internal curr_resync value into metadata. -- (all)NeilBrown
mddev->curr_resync usually records where the current resync is up to, but during the starting phase it has some "magic" values. 1 - means that the array is trying to start a resync, but has yielded to another array which shares physical devices, and also needs to start a resync 2 - means the array is trying to start resync, but has found another array which shares physical devices and has already started resync. 3 - means that resync has commensed, but it is possible that nothing has actually been resynced yet. It is important that this value not be visible to user-space and particularly that it doesn't get written to the metadata, as the resync or recovery checkpoint. In part, this is because it may be slightly higher than the correct value, though this is very rare. In part, because it is not a multiple of 4K, and some devices only support 4K aligned accesses. There are two places where this value is propagates into either ->curr_resync_completed or ->recovery_cp or ->recovery_offset. These currently avoid the propagation of values 1 and 3, but will allow 3 to leak through. Change them to only propagate the value if it is > 3. As this can cause an array to fail, the patch is suitable for -stable. Cc: stable@vger.kernel.org (v3.7+) Reported-by: Viswesh <viswesh.vichu@gmail.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-10-28raid1: handle read error also in readonly modeTomasz Majchrzak
If write is the first operation on a disk and it happens not to be aligned to page size, block layer sends read request first. If read operation fails, the disk is set as failed as no attempt to fix the error is made because array is in auto-readonly mode. Similarily, the disk is set as failed for read-only array. Take the same approach as in raid10. Don't fail the disk if array is in readonly or auto-readonly mode. Try to redirect the request first and if unsuccessful, return a read error. Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-10-28raid5-cache: correct condition for empty metadata writeShaohua Li
As long as we recover one metadata block, we should write the empty metadata write. The original code could make recovery corrupted if only one meta is valid. Reported-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn> Signed-off-by: Shaohua Li <shli@fb.com>
2016-10-28Merge tag 'dm-4.9-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fixes from Mike Snitzer: - a couple DM raid and DM mirror fixes - a couple .request_fn request-based DM NULL pointer fixes - a fix for a DM target reference count leak, on target load error, that prevented associated DM target kernel module(s) from being removed * tag 'dm-4.9-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm table: fix missing dm_put_target_type() in dm_table_add_target() dm rq: clear kworker_task if kthread_run() returned an error dm: free io_barrier after blk_cleanup_queue call dm raid: fix activation of existing raid4/10 devices dm mirror: use all available legs on multiple failures dm mirror: fix read error on recovery after default leg failure dm raid: fix compat_features validation
2016-10-24md: report 'write_pending' state when array in syncTomasz Majchrzak
If there is a bad block on a disk and there is a recovery performed from this disk, the same bad block is reported for a new disk. It involves setting MD_CHANGE_PENDING flag in rdev_set_badblocks. For external metadata this flag is not being cleared as array state is reported as 'clean'. The read request to bad block in RAID5 array gets stuck as it is waiting for a flag to be cleared - as per commit c3cce6cda162 ("md/raid5: ensure device failure recorded before write request returns."). The meaning of MD_CHANGE_PENDING and MD_CHANGE_CLEAN flags has been clarified in commit 070dc6dd7103 ("md: resolve confusion of MD_CHANGE_CLEAN"), however MD_CHANGE_PENDING flag has been used in personality error handlers since and it doesn't fully comply with initial purpose. It was supposed to notify that write request is about to start, however now it is also used to request metadata update. Initially (in md_allow_write, md_write_start) MD_CHANGE_PENDING flag has been set and in_sync has been set to 0 at the same time. Error handlers just set the flag without modifying in_sync value. Sysfs array state is a single value so now it reports 'clean' when MD_CHANGE_PENDING flag is set and in_sync is set to 1. Userspace has no idea it is expected to take some action. Swap the order that array state is checked so 'write_pending' is reported ahead of 'clean' ('write_pending' is a misleading name but it is too late to rename it now). Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-10-24md/raid5: write an empty meta-block when creating log super-blockZhengyuan Liu
If superblock points to an invalid meta block, r5l_load_log will set create_super with true and create an new superblock, this runtime path would always happen if we do no writing I/O to this array since it was created. Writing an empty meta block could avoid this unnecessary action at the first time we created log superblock. Another reason is for the corretness of log recovery. Currently we have bellow code to guarantee log revocery to be correct. if (ctx.seq > log->last_cp_seq + 1) { int ret; ret = r5l_log_write_empty_meta_block(log, ctx.pos, ctx.seq + 10); if (ret) return ret; log->seq = ctx.seq + 11; log->log_start = r5l_ring_add(log, ctx.pos, BLOCK_SECTORS); r5l_write_super(log, ctx.pos); } else { log->log_start = ctx.pos; log->seq = ctx.seq; } If we just created a array with a journal device, log->log_start and log->last_checkpoint should all be 0, then we write three meta block which are valid except mid one and supposed crash happened. The ctx.seq would equal to log->last_cp_seq + 1 and log->log_start would be set to position of mid invalid meta block after we did a recovery, this will lead to problems which could be avoided with this patch. Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn> Signed-off-by: Shaohua Li <shli@fb.com>
2016-10-24md/raid5: initialize next_checkpoint field before useZhengyuan Liu
No initial operation was done to this field when we load/recovery the log, it got assignment only when IO to raid disk was finished. So r5l_quiesce may use wrong next_checkpoint to reclaim log space, that would make reclaimable space calculation confused. Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn> Signed-off-by: Shaohua Li <shli@fb.com>
2016-10-24RAID10: ignore discard errorShaohua Li
This is the counterpart of raid10 fix. If a write error occurs, raid10 will try to rewrite the bio in small chunk size. If the rewrite fails, raid10 will record the error in bad block. narrow_write_error will always use WRITE for the bio, but actually it could be a discard. Since discard bio hasn't payload, write the bio will cause different issues. But discard error isn't fatal, we can safely ignore it. This is what this patch does. This issue should exist since discard is added, but only exposed with recent arbitrary bio size feature. Cc: Sitsofe Wheeler <sitsofe@gmail.com> Cc: stable@vger.kernel.org (v3.6) Signed-off-by: Shaohua Li <shli@fb.com>
2016-10-24RAID1: ignore discard errorShaohua Li
If a write error occurs, raid1 will try to rewrite the bio in small chunk size. If the rewrite fails, raid1 will record the error in bad block. narrow_write_error will always use WRITE for the bio, but actually it could be a discard. Since discard bio hasn't payload, write the bio will cause different issues. But discard error isn't fatal, we can safely ignore it. This is what this patch does. This issue should exist since discard is added, but only exposed with recent arbitrary bio size feature. Reported-and-tested-by: Sitsofe Wheeler <sitsofe@gmail.com> Cc: stable@vger.kernel.org (v3.6) Signed-off-by: Shaohua Li <shli@fb.com>
2016-10-24dm table: fix missing dm_put_target_type() in dm_table_add_target()tang.junhui
dm_get_target_type() was previously called so any error returned from dm_table_add_target() must first call dm_put_target_type(). Otherwise the DM target module's reference count will leak and the associated kernel module will be unable to be removed. Also, leverage the fact that r is already -EINVAL and remove an extra newline. Fixes: 36a0456 ("dm table: add immutable feature") Fixes: cc6cbe1 ("dm table: add always writeable feature") Fixes: 3791e2f ("dm table: add singleton feature") Cc: stable@vger.kernel.org # 3.2+ Signed-off-by: tang.junhui <tang.junhui@zte.com.cn> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-10-18dm rq: clear kworker_task if kthread_run() returned an errorMike Snitzer
cleanup_mapped_device() calls kthread_stop() if kworker_task is non-NULL. Currently the assigned value could be a valid task struct or an error code (e.g -ENOMEM). Reset md->kworker_task to NULL if kthread_run() returned an erorr. Fixes: 7193a9defc ("dm rq: check kthread_run return for .request_fn request-based DM") Cc: stable@vger.kernel.org # 4.8 Reported-by: Tahsin Erdogan <tahsin@google.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-10-18dm: free io_barrier after blk_cleanup_queue callTahsin Erdogan
dm_old_request_fn() has paths that access md->io_barrier. The party destroying io_barrier should ensure that no future execution of dm_old_request_fn() is possible. Move io_barrier destruction to below blk_cleanup_queue() to ensure this and avoid a NULL pointer crash during request-based DM device shutdown. Cc: stable@vger.kernel.org # 4.3+ Signed-off-by: Tahsin Erdogan <tahsin@google.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-10-17dm raid: fix activation of existing raid4/10 devicesHeinz Mauelshagen
dm-raid 1.9.0 fails to activate existing RAID4/10 devices that have the old superblock format (which does not have takeover/reshaping support that was added via commit 33e53f06850f). Fix validation path for old superblocks by reverting to the old raid4 layout and basing checks on mddev->new_{level,layout,...} members in super_init_validation(). Cc: stable@vger.kernel.org # 4.8 Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-10-14dm mirror: use all available legs on multiple failuresHeinz Mauelshagen
When any leg(s) have failed, any read will cause a new operational default leg to be selected and the read is resubmitted to it. If that new default leg fails the read too, no other still accessible legs are used to resubmit the read again -- thus failing the io. Fix by allowing the read to get resubmitted until all operational legs have been exhausted. Also, remove any details.bi_dev use as a flag. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-10-14dm mirror: fix read error on recovery after default leg failureHeinz Mauelshagen
If a default leg has failed, any read will cause a new operational default leg to be selected and the read is resubmitted. But until now the read will return failure even though it was successful due to resubmission. The reason for this is bio->bi_error was not being cleared before resubmitting the bio. Fix by clearing bio->bi_error before resubmission. Fixes: 4246a0b63bd8 ("block: add a bi_error field to struct bio") Cc: stable@vger.kernel.org # 4.3+ Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-10-11kthread: kthread worker API cleanupPetr Mladek
A good practice is to prefix the names of functions by the name of the subsystem. The kthread worker API is a mix of classic kthreads and workqueues. Each worker has a dedicated kthread. It runs a generic function that process queued works. It is implemented as part of the kthread subsystem. This patch renames the existing kthread worker API to use the corresponding name from the workqueues API prefixed by kthread_: __init_kthread_worker() -> __kthread_init_worker() init_kthread_worker() -> kthread_init_worker() init_kthread_work() -> kthread_init_work() insert_kthread_work() -> kthread_insert_work() queue_kthread_work() -> kthread_queue_work() flush_kthread_work() -> kthread_flush_work() flush_kthread_worker() -> kthread_flush_worker() Note that the names of DEFINE_KTHREAD_WORK*() macros stay as they are. It is common that the "DEFINE_" prefix has precedence over the subsystem names. Note that INIT() macros and init() functions use different naming scheme. There is no good solution. There are several reasons for this solution: + "init" in the function names stands for the verb "initialize" aka "initialize worker". While "INIT" in the macro names stands for the noun "INITIALIZER" aka "worker initializer". + INIT() macros are used only in DEFINE() macros + init() functions are used close to the other kthread() functions. It looks much better if all the functions use the same scheme. + There will be also kthread_destroy_worker() that will be used close to kthread_cancel_work(). It is related to the init() function. Again it looks better if all functions use the same naming scheme. + there are several precedents for such init() function names, e.g. amd_iommu_init_device(), free_area_init_node(), jump_label_init_type(), regmap_init_mmio_clk(), + It is not an argument but it was inconsistent even before. [arnd@arndb.de: fix linux-next merge conflict] Link: http://lkml.kernel.org/r/20160908135724.1311726-1-arnd@arndb.de Link: http://lkml.kernel.org/r/1470754545-17632-3-git-send-email-pmladek@suse.com Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Petr Mladek <pmladek@suse.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Borislav Petkov <bp@suse.de> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11dm raid: fix compat_features validationAndy Whitcroft
In ecbfb9f118bce4 ("dm raid: add raid level takeover support") a new compatible feature flag was added. Validation for these compat_features was added but this only passes for new raid mappings with this feature flag. This causes previously created raid mappings to be failed at import. Check compat_features for the only valid combination. Fixes: ecbfb9f118bce4 ("dm raid: add raid level takeover support") Cc: stable@vger.kernel.org # v4.8 Signed-off-by: Andy Whitcroft <apw@canonical.com> Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-10-09Merge branch 'for-4.9/block-irq' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull blk-mq irq/cpu mapping updates from Jens Axboe: "This is the block-irq topic branch for 4.9-rc. It's mostly from Christoph, and it allows drivers to specify their own mappings, and more importantly, to share the blk-mq mappings with the IRQ affinity mappings. It's a good step towards making this work better out of the box" * 'for-4.9/block-irq' of git://git.kernel.dk/linux-block: blk_mq: linux/blk-mq.h does not include all the headers it depends on blk-mq: kill unused blk_mq_create_mq_map() blk-mq: get rid of the cpumask in struct blk_mq_tags nvme: remove the post_scan callout nvme: switch to use pci_alloc_irq_vectors blk-mq: provide a default queue mapping for PCI device blk-mq: allow the driver to pass in a queue mapping blk-mq: remove ->map_queue blk-mq: only allocate a single mq_map per tag_set blk-mq: don't redistribute hardware queues on a CPU hotplug event
2016-10-09Merge tag 'dm-4.9-changes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - various fixes and cleanups for request-based DM core - add support for delaying the requeue of requests; used by DM multipath when all paths have failed and 'queue_if_no_path' is enabled - DM cache improvements to speedup the loading metadata and the writing of the hint array - fix potential for a dm-crypt crash on device teardown - remove dm_bufio_cond_resched() and just using cond_resched() - change DM multipath to return a reservation conflict error immediately; rather than failing the path and retrying (potentially indefinitely) * tag 'dm-4.9-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (24 commits) dm mpath: always return reservation conflict without failing over dm bufio: remove dm_bufio_cond_resched() dm crypt: fix crash on exit dm cache metadata: switch to using the new cursor api for loading metadata dm array: introduce cursor api dm btree: introduce cursor api dm cache policy smq: distribute entries to random levels when switching to smq dm cache: speed up writing of the hint array dm array: add dm_array_new() dm mpath: delay the requeue of blk-mq requests while all paths down dm mpath: use dm_mq_kick_requeue_list() dm rq: introduce dm_mq_kick_requeue_list() dm rq: reduce arguments passed to map_request() and dm_requeue_original_request() dm rq: add DM_MAPIO_DELAY_REQUEUE to delay requeue of blk-mq requests dm: convert wait loops to use autoremove_wake_function() dm: use signal_pending_state() in dm_wait_for_completion() dm: rename task state function arguments dm: add two lockdep_assert_held() statements dm rq: simplify dm_old_stop_queue() dm mpath: check if path's request_queue is dying in activate_path() ...
2016-10-07Merge branch 'for-4.9/block' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block layer updates from Jens Axboe: "This is the main pull request for block layer changes in 4.9. As mentioned at the last merge window, I've changed things up and now do just one branch for core block layer changes, and driver changes. This avoids dependencies between the two branches. Outside of this main pull request, there are two topical branches coming as well. This pull request contains: - A set of fixes, and a conversion to blk-mq, of nbd. From Josef. - Set of fixes and updates for lightnvm from Matias, Simon, and Arnd. Followup dependency fix from Geert. - General fixes from Bart, Baoyou, Guoqing, and Linus W. - CFQ async write starvation fix from Glauber. - Add supprot for delayed kick of the requeue list, from Mike. - Pull out the scalable bitmap code from blk-mq-tag.c and make it generally available under the name of sbitmap. Only blk-mq-tag uses it for now, but the blk-mq scheduling bits will use it as well. From Omar. - bdev thaw error progagation from Pierre. - Improve the blk polling statistics, and allow the user to clear them. From Stephen. - Set of minor cleanups from Christoph in block/blk-mq. - Set of cleanups and optimizations from me for block/blk-mq. - Various nvme/nvmet/nvmeof fixes from the various folks" * 'for-4.9/block' of git://git.kernel.dk/linux-block: (54 commits) fs/block_dev.c: return the right error in thaw_bdev() nvme: Pass pointers, not dma addresses, to nvme_get/set_features() nvme/scsi: Remove power management support nvmet: Make dsm number of ranges zero based nvmet: Use direct IO for writes admin-cmd: Added smart-log command support. nvme-fabrics: Add host_traddr options field to host infrastructure nvme-fabrics: revise host transport option descriptions nvme-fabrics: rework nvmf_get_address() for variable options nbd: use BLK_MQ_F_BLOCKING blkcg: Annotate blkg_hint correctly cfq: fix starvation of asynchronous writes blk-mq: add flag for drivers wanting blocking ->queue_rq() blk-mq: remove non-blocking pass in blk_mq_map_request blk-mq: get rid of manual run of queue with __blk_mq_run_hw_queue() block: export bio_free_pages to other modules lightnvm: propagate device_add() error code lightnvm: expose device geometry through sysfs lightnvm: control life of nvm_dev in driver blk-mq: register device instead of disk ...
2016-10-07Merge tag 'md/4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/mdLinus Torvalds
Pull MD updates from Shaohua Li: "This update includes: - new AVX512 instruction based raid6 gen/recovery algorithm - a couple of md-cluster related bug fixes - fix a potential deadlock - set nonrotational bit for raid array with SSD - set correct max_hw_sectors for raid5/6, which hopefuly can improve performance a little bit - other minor fixes" * tag 'md/4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: md: set rotational bit raid6/test/test.c: bug fix: Specify aligned(alignment) attributes to the char arrays raid5: handle register_shrinker failure raid5: fix to detect failure of register_shrinker md: fix a potential deadlock md/bitmap: fix wrong cleanup raid5: allow arbitrary max_hw_sectors lib/raid6: Add AVX512 optimized xor_syndrome functions lib/raid6/test/Makefile: Add avx512 gen_syndrome and recovery functions lib/raid6: Add AVX512 optimized recovery functions lib/raid6: Add AVX512 optimized gen_syndrome functions md-cluster: make resync lock also could be interruptted md-cluster: introduce dlm_lock_sync_interruptible to fix tasks hang md-cluster: convert the completion to wait queue md-cluster: protect md_find_rdev_nr_rcu with rcu lock md-cluster: clean related infos of cluster md: changes for MD_STILL_CLOSED flag md-cluster: remove some unnecessary dlm_unlock_sync md-cluster: use FORCEUNLOCK in lockres_free md-cluster: call md_kick_rdev_from_array once ack failed
2016-10-03Merge branch 'smp-hotplug-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull CPU hotplug updates from Thomas Gleixner: "Yet another batch of cpu hotplug core updates and conversions: - Provide core infrastructure for multi instance drivers so the drivers do not have to keep custom lists. - Convert custom lists to the new infrastructure. The block-mq custom list conversion comes through the block tree and makes the diffstat tip over to more lines removed than added. - Handle unbalanced hotplug enable/disable calls more gracefully. - Remove the obsolete CPU_STARTING/DYING notifier support. - Convert another batch of notifier users. The relayfs changes which conflicted with the conversion have been shipped to me by Andrew. The remaining lot is targeted for 4.10 so that we finally can remove the rest of the notifiers" * 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits) cpufreq: Fix up conversion to hotplug state machine blk/mq: Reserve hotplug states for block multiqueue x86/apic/uv: Convert to hotplug state machine s390/mm/pfault: Convert to hotplug state machine mips/loongson/smp: Convert to hotplug state machine mips/octeon/smp: Convert to hotplug state machine fault-injection/cpu: Convert to hotplug state machine padata: Convert to hotplug state machine cpufreq: Convert to hotplug state machine ACPI/processor: Convert to hotplug state machine virtio scsi: Convert to hotplug state machine oprofile/timer: Convert to hotplug state machine block/softirq: Convert to hotplug state machine lib/irq_poll: Convert to hotplug state machine x86/microcode: Convert to hotplug state machine sh/SH-X3 SMP: Convert to hotplug state machine ia64/mca: Convert to hotplug state machine ARM/OMAP/wakeupgen: Convert to hotplug state machine ARM/shmobile: Convert to hotplug state machine arm64/FP/SIMD: Convert to hotplug state machine ...
2016-10-03md: set rotational bitShaohua Li
if all disks in an array are non-rotational, set the array non-rotational. This only works for array with all disks populated at startup. Support for disk hotadd/hotremove could be added later if necessary. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Shaohua Li <shli@fb.com>
2016-09-29dm mpath: always return reservation conflict without failing overHannes Reinecke
If dm-mpath encounters an reservation conflict it should not fail the path (as communication with the target is not affected) but should rather retry on another path. However, in doing so we might be inducing a ping-pong between paths, with no guarantee of any forward progress. And arguably a reservation conflict is an unexpected error, so we should be passing it upwards to allow the application to take appropriate steps. This change resolves a show-stopper problem seen with the pNFS SCSI layout because it is trivial to hit reservation conflict based failover loops without it. Doubts were raised about the implications of this change relative to products like IBM's SVC. But there is little point withholding a fix for Linux because a proprietary product may or may not have some issues in its implementation of how it interfaces with Linux. In the future, if there is glaring evidence that this change is certainly problematic we can revisit it. Signed-off-by: Hannes Reinecke <hare@suse.de> Acked-by: Christoph Hellwig <hch@lst.de> Tested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com> # tweaked header
2016-09-22dm bufio: remove dm_bufio_cond_resched()Peter Zijlstra
Use cond_resched() like everybody else. Mikulas explained why dm_bufio_cond_resched() was introduced to begin with (hopefully cond_resched can be improved accordingly) here: https://www.redhat.com/archives/dm-devel/2016-September/msg00112.html Cc: Ingo Molnar <mingo@kernel.org> Cc: Mikulas Patocka <mpatocka@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: Alasdair Kergon <agk@redhat.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com> # added last comment in header
2016-09-22dm crypt: fix crash on exitRabin Vincent
As the documentation for kthread_stop() says, "if threadfn() may call do_exit() itself, the caller must ensure task_struct can't go away". dm-crypt does not ensure this and therefore crashes when crypt_dtr() calls kthread_stop(). The crash is trivially reproducible by adding a delay before the call to kthread_stop() and just opening and closing a dm-crypt device. general protection fault: 0000 [#1] PREEMPT SMP CPU: 0 PID: 533 Comm: cryptsetup Not tainted 4.8.0-rc7+ #7 task: ffff88003bd0df40 task.stack: ffff8800375b4000 RIP: 0010: kthread_stop+0x52/0x300 Call Trace: crypt_dtr+0x77/0x120 dm_table_destroy+0x6f/0x120 __dm_destroy+0x130/0x250 dm_destroy+0x13/0x20 dev_remove+0xe6/0x120 ? dev_suspend+0x250/0x250 ctl_ioctl+0x1fc/0x530 ? __lock_acquire+0x24f/0x1b10 dm_ctl_ioctl+0x13/0x20 do_vfs_ioctl+0x91/0x6a0 ? ____fput+0xe/0x10 ? entry_SYSCALL_64_fastpath+0x5/0xbd ? trace_hardirqs_on_caller+0x151/0x1e0 SyS_ioctl+0x41/0x70 entry_SYSCALL_64_fastpath+0x1f/0xbd This problem was introduced by bcbd94ff481e ("dm crypt: fix a possible hang due to race condition on exit"). Looking at the description of that patch (excerpted below), it seems like the problem it addresses can be solved by just using set_current_state instead of __set_current_state, since we obviously need the memory barrier. | dm crypt: fix a possible hang due to race condition on exit | | A kernel thread executes __set_current_state(TASK_INTERRUPTIBLE), | __add_wait_queue, spin_unlock_irq and then tests kthread_should_stop(). | It is possible that the processor reorders memory accesses so that | kthread_should_stop() is executed before __set_current_state(). If | such reordering happens, there is a possible race on thread | termination: [...] So this patch just reverts the aforementioned patch and changes the __set_current_state(TASK_INTERRUPTIBLE) to set_current_state(...). This fixes the crash and should also fix the potential hang. Fixes: bcbd94ff481e ("dm crypt: fix a possible hang due to race condition on exit") Cc: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org # v4.0+ Signed-off-by: Rabin Vincent <rabinv@axis.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-09-22dm cache metadata: switch to using the new cursor api for loading metadataJoe Thornber
This change offers a pretty significant performance improvement. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-09-22dm array: introduce cursor apiJoe Thornber
More efficient way to iterate an array due to prefetching (makes use of the new dm_btree_cursor_* api). Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-09-22dm btree: introduce cursor apiJoe Thornber
This uses prefetching to speed up iteration through a btree. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-09-22dm cache policy smq: distribute entries to random levels when switching to smqJoe Thornber
For smq the 32 bit 'hint' stores the multiqueue level that the entry should be stored in. If a different policy has been used previously, and then switched to smq, the hints will be invalid. In which case we used to put all entries in the bottom level of the multiqueue, and then redistribute. Redistribution is faster if we put entries with invalid hints in random levels initially. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-09-22dm cache: speed up writing of the hint arrayJoe Thornber
It's far quicker to always delete the hint array and recreate with dm_array_new() because we avoid the copying caused by mutation. Also simplifies the policy interface, replacing the walk_hints() with the simpler get_hint(). Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-09-22dm array: add dm_array_new()Joe Thornber
dm_array_new() creates a new, populated array more efficiently than starting with an empty one and resizing. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-09-22block: export bio_free_pages to other modulesGuoqing Jiang
bio_free_pages is introduced in commit 1dfa0f68c040 ("block: add a helper to free bio bounce buffer pages"), we can reuse the func in other modules after it was imported. Cc: Christoph Hellwig <hch@infradead.org> Cc: Jens Axboe <axboe@fb.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: Shaohua Li <shli@fb.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Acked-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-09-21raid5: handle register_shrinker failureShaohua Li
register_shrinker() now can fail. When it happens, shrinker.nr_deferred is null. We use it to determine if unregister_shrinker is required. Signed-off-by: Shaohua Li <shli@fb.com>
2016-09-21raid5: fix to detect failure of register_shrinkerChao Yu
register_shrinker can fail after commit 1d3d4437eae1 ("vmscan: per-node deferred work"), we should detect the failure of it, otherwise we may fail to register shrinker after raid5 configuration was setup successfully. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-09-21md: fix a potential deadlockShaohua Li
lockdep reports a potential deadlock. Fix this by droping the mutex before md_import_device [ 1137.126601] ====================================================== [ 1137.127013] [ INFO: possible circular locking dependency detected ] [ 1137.127013] 4.8.0-rc4+ #538 Not tainted [ 1137.127013] ------------------------------------------------------- [ 1137.127013] mdadm/16675 is trying to acquire lock: [ 1137.127013] (&bdev->bd_mutex){+.+.+.}, at: [<ffffffff81243cf3>] __blkdev_get+0x63/0x450 [ 1137.127013] but task is already holding lock: [ 1137.127013] (detected_devices_mutex){+.+.+.}, at: [<ffffffff81a5138c>] md_ioctl+0x2ac/0x1f50 [ 1137.127013] which lock already depends on the new lock. [ 1137.127013] the existing dependency chain (in reverse order) is: [ 1137.127013] -> #1 (detected_devices_mutex){+.+.+.}: [ 1137.127013] [<ffffffff810b6f19>] lock_acquire+0xb9/0x220 [ 1137.127013] [<ffffffff81c51647>] mutex_lock_nested+0x67/0x3d0 [ 1137.127013] [<ffffffff81a4eeaf>] md_autodetect_dev+0x3f/0x90 [ 1137.127013] [<ffffffff81595be8>] rescan_partitions+0x1a8/0x2c0 [ 1137.127013] [<ffffffff81590081>] __blkdev_reread_part+0x71/0xb0 [ 1137.127013] [<ffffffff815900e5>] blkdev_reread_part+0x25/0x40 [ 1137.127013] [<ffffffff81590c4b>] blkdev_ioctl+0x51b/0xa30 [ 1137.127013] [<ffffffff81242bf1>] block_ioctl+0x41/0x50 [ 1137.127013] [<ffffffff81214c96>] do_vfs_ioctl+0x96/0x6e0 [ 1137.127013] [<ffffffff81215321>] SyS_ioctl+0x41/0x70 [ 1137.127013] [<ffffffff81c56825>] entry_SYSCALL_64_fastpath+0x18/0xa8 [ 1137.127013] -> #0 (&bdev->bd_mutex){+.+.+.}: [ 1137.127013] [<ffffffff810b6af2>] __lock_acquire+0x1662/0x1690 [ 1137.127013] [<ffffffff810b6f19>] lock_acquire+0xb9/0x220 [ 1137.127013] [<ffffffff81c51647>] mutex_lock_nested+0x67/0x3d0 [ 1137.127013] [<ffffffff81243cf3>] __blkdev_get+0x63/0x450 [ 1137.127013] [<ffffffff81244307>] blkdev_get+0x227/0x350 [ 1137.127013] [<ffffffff812444f6>] blkdev_get_by_dev+0x36/0x50 [ 1137.127013] [<ffffffff81a46d65>] lock_rdev+0x35/0x80 [ 1137.127013] [<ffffffff81a49bb4>] md_import_device+0xb4/0x1b0 [ 1137.127013] [<ffffffff81a513d6>] md_ioctl+0x2f6/0x1f50 [ 1137.127013] [<ffffffff815909b3>] blkdev_ioctl+0x283/0xa30 [ 1137.127013] [<ffffffff81242bf1>] block_ioctl+0x41/0x50 [ 1137.127013] [<ffffffff81214c96>] do_vfs_ioctl+0x96/0x6e0 [ 1137.127013] [<ffffffff81215321>] SyS_ioctl+0x41/0x70 [ 1137.127013] [<ffffffff81c56825>] entry_SYSCALL_64_fastpath+0x18/0xa8 [ 1137.127013] other info that might help us debug this: [ 1137.127013] Possible unsafe locking scenario: [ 1137.127013] CPU0 CPU1 [ 1137.127013] ---- ---- [ 1137.127013] lock(detected_devices_mutex); [ 1137.127013] lock(&bdev->bd_mutex); [ 1137.127013] lock(detected_devices_mutex); [ 1137.127013] lock(&bdev->bd_mutex); [ 1137.127013] *** DEADLOCK *** Cc: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-09-21md/bitmap: fix wrong cleanupShaohua Li
if bitmap_create fails, the bitmap is already cleaned up and the returned value is an error number. We can't do the cleanup again. Reported-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Shaohua Li <shli@fb.com>
2016-09-21raid5: allow arbitrary max_hw_sectorsShaohua Li
raid5 will split bio to proper size internally, there is no point to use underlayer disk's max_hw_sectors. In my qemu system, without the change, the raid5 only receives 128k size bio, which reduces the chance of bio merge sending to underlayer disks. Signed-off-by: Shaohua Li <shli@fb.com>
2016-09-21md-cluster: make resync lock also could be interrupttedGuoqing Jiang
When one node is perform resync or recovery, other nodes can't get resync lock and could block for a while before it holds the lock, so we can't stop array immediately for this scenario. To make array could be stop quickly, we check MD_CLOSING in dlm_lock_sync_interruptible to make us can interrupt the lock request. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-09-21md-cluster: introduce dlm_lock_sync_interruptible to fix tasks hangGuoqing Jiang
When some node leaves cluster, then it's bitmap need to be synced by another node, so "md*_recover" thread is triggered for the purpose. However, with below steps. we can find tasks hang happened either in B or C. 1. Node A create a resyncing cluster raid1, assemble it in other two nodes (B and C). 2. stop array in B and C. 3. stop array in A. linux44:~ # ps aux|grep md|grep D root 5938 0.0 0.1 19852 1964 pts/0 D+ 14:52 0:00 mdadm -S md0 root 5939 0.0 0.0 0 0 ? D 14:52 0:00 [md0_recover] linux44:~ # cat /proc/5939/stack [<ffffffffa04cf321>] dlm_lock_sync+0x71/0x90 [md_cluster] [<ffffffffa04d0705>] recover_bitmaps+0x125/0x220 [md_cluster] [<ffffffffa052105d>] md_thread+0x16d/0x180 [md_mod] [<ffffffff8107ad94>] kthread+0xb4/0xc0 [<ffffffff8152a518>] ret_from_fork+0x58/0x90 linux44:~ # cat /proc/5938/stack [<ffffffff8107afde>] kthread_stop+0x6e/0x120 [<ffffffffa0519da0>] md_unregister_thread+0x40/0x80 [md_mod] [<ffffffffa04cfd20>] leave+0x70/0x120 [md_cluster] [<ffffffffa0525e24>] md_cluster_stop+0x14/0x30 [md_mod] [<ffffffffa05269ab>] bitmap_free+0x14b/0x150 [md_mod] [<ffffffffa0523f3b>] do_md_stop+0x35b/0x5a0 [md_mod] [<ffffffffa0524e83>] md_ioctl+0x873/0x1590 [md_mod] [<ffffffff81288464>] blkdev_ioctl+0x214/0x7d0 [<ffffffff811dd3dd>] block_ioctl+0x3d/0x40 [<ffffffff811b92d4>] do_vfs_ioctl+0x2d4/0x4b0 [<ffffffff811b9538>] SyS_ioctl+0x88/0xa0 [<ffffffff8152a5c9>] system_call_fastpath+0x16/0x1b The problem is caused by recover_bitmaps can't reliably abort when the thread is unregistered. So dlm_lock_sync_interruptible is introduced to detect the thread's situation to fix the problem. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-09-21md-cluster: convert the completion to wait queueGuoqing Jiang
Previously, we used completion to sync between require dlm lock and sync_ast, however we will have to expose completion.wait and completion.done in dlm_lock_sync_interruptible (introduced later), it is not a common usage for completion, so convert related things to wait queue. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-09-21md-cluster: protect md_find_rdev_nr_rcu with rcu lockGuoqing Jiang
We need to use rcu_read_lock/unlock to avoid potential race. Reported-by: Shaohua Li <shli@fb.com> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-09-21md-cluster: clean related infos of clusterGuoqing Jiang
cluster_info and bitmap_info.nodes also need to be cleared when array is stopped. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-09-21md: changes for MD_STILL_CLOSED flagGuoqing Jiang
When stop clustered raid while it is pending on resync, MD_STILL_CLOSED flag could be cleared since udev rule is triggered to open the mddev. So obviously array can't be stopped soon and returns EBUSY. mdadm -Ss md-raid-arrays.rules set MD_STILL_CLOSED md_open() ... ... ... clear MD_STILL_CLOSED do_md_stop We make below changes to resolve this issue: 1. rename MD_STILL_CLOSED to MD_CLOSING since it is set when stop array and it means we are stopping array. 2. let md_open returns early if CLOSING is set, so no other threads will open array if one thread is trying to close it. 3. no need to clear CLOSING bit in md_open because 1 has ensure the bit is cleared, then we also don't need to test CLOSING bit in do_md_stop. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-09-21md-cluster: remove some unnecessary dlm_unlock_syncGuoqing Jiang
Since DLM_LKF_FORCEUNLOCK is used in lockres_free, we don't need to call dlm_unlock_sync before free lock resource. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-09-21md-cluster: use FORCEUNLOCK in lockres_freeGuoqing Jiang
For dlm_unlock, we need to pass flag to dlm_unlock as the third parameter instead of set res->flags. Also, DLM_LKF_FORCEUNLOCK is more suitable for dlm_unlock since it works even the lock is on waiting or convert queue. Acked-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-09-21md-cluster: call md_kick_rdev_from_array once ack failedGuoqing Jiang
The new_disk_ack could return failure if WAITING_FOR_NEWDISK is not set, so we need to kick the dev from array in case failure happened. And we missed to check err before call new_disk_ack othwise we could kick a rdev which isn't in array, thanks for the reminder from Shaohua. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-09-21blk-mq: register device instead of diskMatias Bjørling
Enable devices without a gendisk instance to register itself with blk-mq and expose the associated multi-queue sysfs entries. Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-09-15dm mpath: delay the requeue of blk-mq requests while all paths downMike Snitzer
Return DM_MAPIO_DELAY_REQUEUE from .clone_and_map_rq. Also, return false from .busy, if all paths are down, so that blk-mq requests get mapped via .clone_and_map_rq -- which results in DM_MAPIO_DELAY_REQUEUE being returned to dm-rq. This change allows for a noticeable reduction in cpu utilization (reduced kworker load) while all paths are down, e.g.: system CPU idleness (as measured by fio's --idle-prof=system): before: system: 86.58% after: system: 98.60% Signed-off-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com>