Age | Commit message (Collapse) | Author |
|
Add a missing newline when printing module parameter 'start_ro' by
sysfs.
Signed-off-by: Xiongfeng Wang <wangxiongfeng2@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
|
|
Pointer to mddev is already available in private_data.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <songliubraving@fb.com>
|
|
In mddev_create_serial_pool(), memalloc scope APIs memalloc_noio_save()
and memalloc_noio_restore() are used when allocating memory by calling
mempool_create_kmalloc_pool(). After adding the memalloc scope APIs in
raid array suspend context, it is unncessary to explicitly call them
around mempool_create_kmalloc_pool() any longer.
This patch removes the redundant memalloc scope APIs in
mddev_create_serial_pool().
Signed-off-by: Coly Li <colyli@suse.de>
Cc: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
|
|
In raid5.c:resize_chunk(), scribble_alloc() is called with GFP_NOIO
flag, then it is sent into kvmalloc_array() inside scribble_alloc().
The problem is kvmalloc_array() eventually calls kvmalloc_node() which
does not accept non GFP_KERNEL compatible flag like GFP_NOIO, then
kmalloc_node() is called indeed to allocate physically continuous
pages. When system memory is under heavy pressure, and the requesting
size is large, there is high probability that allocating continueous
pages will fail.
But simply using GFP_KERNEL flag to call kvmalloc_array() is also
progblematic. In the code path where scribble_alloc() is called, the
raid array is suspended, if kvmalloc_node() triggers memory reclaim I/Os
and such I/Os go back to the suspend raid array, deadlock will happen.
What is desired here is to allocate non-physically (a.k.a virtually)
continuous pages and avoid memory reclaim I/Os. Michal Hocko suggests
to use the mmealloc sceope APIs to restrict memory reclaim I/O in
allocating context, specifically to call memalloc_noio_save() when
suspend the raid array and to call memalloc_noio_restore() when
resume the raid array.
This patch adds the memalloc scope APIs in mddev_suspend() and
mddev_resume(), to restrict memory reclaim I/Os during the raid array
is suspended. The benifit of adding the memalloc scope API in the
unified entry point mddev_suspend()/mddev_resume() is, no matter which
md raid array type (personality), we are sure the deadlock by recursive
memory reclaim I/O won't happen on the suspending context.
Please notice that the memalloc scope APIs only take effect on the raid
array suspending context, if the memory allocation is from another new
created kthread after raid array suspended, the recursive memory reclaim
I/Os won't be restricted. The mddev_suspend()/mddev_resume() entries are
used for the critical section where the raid metadata is modifying,
creating a kthread to allocate memory inside the critical section is
queer and very probably being buggy.
Fixes: b330e6a49dc3 ("md: convert to kvmalloc")
Suggested-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Song Liu <songliubraving@fb.com>
|
|
It is not not necessary to add a newline for them since they don't exceed
80 characters, and it is not intutive to distinguish ->hot_add_disk() from
hot_add_disk() too.
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
|
|
Since rdev->kobj is removed asynchronously, it is possible that the
rdev->kobj still exists when try to add the rdev again after rdev
is removed. But this path md_ioctl (HOT_ADD_DISK) -> hot_add_disk
-> bind_rdev_to_array missed it.
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
|
|
We need to check mddev->del_work before flush workqueu since the purpose
of flush is to ensure the previous md is disappeared. Otherwise the similar
deadlock appeared if LOCKDEP is enabled, it is due to md_open holds the
bdev->bd_mutex before flush workqueue.
kernel: [ 154.522645] ======================================================
kernel: [ 154.522647] WARNING: possible circular locking dependency detected
kernel: [ 154.522650] 5.6.0-rc7-lp151.27-default #25 Tainted: G O
kernel: [ 154.522651] ------------------------------------------------------
kernel: [ 154.522653] mdadm/2482 is trying to acquire lock:
kernel: [ 154.522655] ffff888078529128 ((wq_completion)md_misc){+.+.}, at: flush_workqueue+0x84/0x4b0
kernel: [ 154.522673]
kernel: [ 154.522673] but task is already holding lock:
kernel: [ 154.522675] ffff88804efa9338 (&bdev->bd_mutex){+.+.}, at: __blkdev_get+0x79/0x590
kernel: [ 154.522691]
kernel: [ 154.522691] which lock already depends on the new lock.
kernel: [ 154.522691]
kernel: [ 154.522694]
kernel: [ 154.522694] the existing dependency chain (in reverse order) is:
kernel: [ 154.522696]
kernel: [ 154.522696] -> #4 (&bdev->bd_mutex){+.+.}:
kernel: [ 154.522704] __mutex_lock+0x87/0x950
kernel: [ 154.522706] __blkdev_get+0x79/0x590
kernel: [ 154.522708] blkdev_get+0x65/0x140
kernel: [ 154.522709] blkdev_get_by_dev+0x2f/0x40
kernel: [ 154.522716] lock_rdev+0x3d/0x90 [md_mod]
kernel: [ 154.522719] md_import_device+0xd6/0x1b0 [md_mod]
kernel: [ 154.522723] new_dev_store+0x15e/0x210 [md_mod]
kernel: [ 154.522728] md_attr_store+0x7a/0xc0 [md_mod]
kernel: [ 154.522732] kernfs_fop_write+0x117/0x1b0
kernel: [ 154.522735] vfs_write+0xad/0x1a0
kernel: [ 154.522737] ksys_write+0xa4/0xe0
kernel: [ 154.522745] do_syscall_64+0x64/0x2b0
kernel: [ 154.522748] entry_SYSCALL_64_after_hwframe+0x49/0xbe
kernel: [ 154.522749]
kernel: [ 154.522749] -> #3 (&mddev->reconfig_mutex){+.+.}:
kernel: [ 154.522752] __mutex_lock+0x87/0x950
kernel: [ 154.522756] new_dev_store+0xc9/0x210 [md_mod]
kernel: [ 154.522759] md_attr_store+0x7a/0xc0 [md_mod]
kernel: [ 154.522761] kernfs_fop_write+0x117/0x1b0
kernel: [ 154.522763] vfs_write+0xad/0x1a0
kernel: [ 154.522765] ksys_write+0xa4/0xe0
kernel: [ 154.522767] do_syscall_64+0x64/0x2b0
kernel: [ 154.522769] entry_SYSCALL_64_after_hwframe+0x49/0xbe
kernel: [ 154.522770]
kernel: [ 154.522770] -> #2 (kn->count#253){++++}:
kernel: [ 154.522775] __kernfs_remove+0x253/0x2c0
kernel: [ 154.522778] kernfs_remove+0x1f/0x30
kernel: [ 154.522780] kobject_del+0x28/0x60
kernel: [ 154.522783] mddev_delayed_delete+0x24/0x30 [md_mod]
kernel: [ 154.522786] process_one_work+0x2a7/0x5f0
kernel: [ 154.522788] worker_thread+0x2d/0x3d0
kernel: [ 154.522793] kthread+0x117/0x130
kernel: [ 154.522795] ret_from_fork+0x3a/0x50
kernel: [ 154.522796]
kernel: [ 154.522796] -> #1 ((work_completion)(&mddev->del_work)){+.+.}:
kernel: [ 154.522800] process_one_work+0x27e/0x5f0
kernel: [ 154.522802] worker_thread+0x2d/0x3d0
kernel: [ 154.522804] kthread+0x117/0x130
kernel: [ 154.522806] ret_from_fork+0x3a/0x50
kernel: [ 154.522807]
kernel: [ 154.522807] -> #0 ((wq_completion)md_misc){+.+.}:
kernel: [ 154.522813] __lock_acquire+0x1392/0x1690
kernel: [ 154.522816] lock_acquire+0xb4/0x1a0
kernel: [ 154.522818] flush_workqueue+0xab/0x4b0
kernel: [ 154.522821] md_open+0xb6/0xc0 [md_mod]
kernel: [ 154.522823] __blkdev_get+0xea/0x590
kernel: [ 154.522825] blkdev_get+0x65/0x140
kernel: [ 154.522828] do_dentry_open+0x1d1/0x380
kernel: [ 154.522831] path_openat+0x567/0xcc0
kernel: [ 154.522834] do_filp_open+0x9b/0x110
kernel: [ 154.522836] do_sys_openat2+0x201/0x2a0
kernel: [ 154.522838] do_sys_open+0x57/0x80
kernel: [ 154.522840] do_syscall_64+0x64/0x2b0
kernel: [ 154.522842] entry_SYSCALL_64_after_hwframe+0x49/0xbe
kernel: [ 154.522844]
kernel: [ 154.522844] other info that might help us debug this:
kernel: [ 154.522844]
kernel: [ 154.522846] Chain exists of:
kernel: [ 154.522846] (wq_completion)md_misc --> &mddev->reconfig_mutex --> &bdev->bd_mutex
kernel: [ 154.522846]
kernel: [ 154.522850] Possible unsafe locking scenario:
kernel: [ 154.522850]
kernel: [ 154.522852] CPU0 CPU1
kernel: [ 154.522853] ---- ----
kernel: [ 154.522854] lock(&bdev->bd_mutex);
kernel: [ 154.522856] lock(&mddev->reconfig_mutex);
kernel: [ 154.522858] lock(&bdev->bd_mutex);
kernel: [ 154.522860] lock((wq_completion)md_misc);
kernel: [ 154.522861]
kernel: [ 154.522861] *** DEADLOCK ***
kernel: [ 154.522861]
kernel: [ 154.522864] 1 lock held by mdadm/2482:
kernel: [ 154.522865] #0: ffff88804efa9338 (&bdev->bd_mutex){+.+.}, at: __blkdev_get+0x79/0x590
kernel: [ 154.522868]
kernel: [ 154.522868] stack backtrace:
kernel: [ 154.522873] CPU: 1 PID: 2482 Comm: mdadm Tainted: G O 5.6.0-rc7-lp151.27-default #25
kernel: [ 154.522875] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
kernel: [ 154.522878] Call Trace:
kernel: [ 154.522881] dump_stack+0x8f/0xcb
kernel: [ 154.522884] check_noncircular+0x194/0x1b0
kernel: [ 154.522888] ? __lock_acquire+0x1392/0x1690
kernel: [ 154.522890] __lock_acquire+0x1392/0x1690
kernel: [ 154.522893] lock_acquire+0xb4/0x1a0
kernel: [ 154.522895] ? flush_workqueue+0x84/0x4b0
kernel: [ 154.522898] flush_workqueue+0xab/0x4b0
kernel: [ 154.522900] ? flush_workqueue+0x84/0x4b0
kernel: [ 154.522905] ? md_open+0xb6/0xc0 [md_mod]
kernel: [ 154.522908] md_open+0xb6/0xc0 [md_mod]
kernel: [ 154.522910] __blkdev_get+0xea/0x590
kernel: [ 154.522912] ? bd_acquire+0xc0/0xc0
kernel: [ 154.522914] blkdev_get+0x65/0x140
kernel: [ 154.522916] ? bd_acquire+0xc0/0xc0
kernel: [ 154.522918] do_dentry_open+0x1d1/0x380
kernel: [ 154.522921] path_openat+0x567/0xcc0
kernel: [ 154.522923] ? __lock_acquire+0x380/0x1690
kernel: [ 154.522926] do_filp_open+0x9b/0x110
kernel: [ 154.522929] ? __alloc_fd+0xe5/0x1f0
kernel: [ 154.522935] ? kmem_cache_alloc+0x28c/0x630
kernel: [ 154.522939] ? do_sys_openat2+0x201/0x2a0
kernel: [ 154.522941] do_sys_openat2+0x201/0x2a0
kernel: [ 154.522944] do_sys_open+0x57/0x80
kernel: [ 154.522946] do_syscall_64+0x64/0x2b0
kernel: [ 154.522948] entry_SYSCALL_64_after_hwframe+0x49/0xbe
kernel: [ 154.522951] RIP: 0033:0x7f98d279d9ae
And md_alloc also flushed the same workqueue, but the thing is different
here. Because all the paths call md_alloc don't hold bdev->bd_mutex, and
the flush is necessary to avoid race condition, so leave it as it is.
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
|
|
Since the purpose of call flush_workqueue in new_dev_store is to ensure
md_delayed_delete() has completed, so we should check rdev->del_work is
pending or not.
To suppress lockdep warning, we have to check mddev->del_work while
md_delayed_delete is attached to rdev->del_work, so it is not aligned
to the purpose of flush workquee. So a new workqueue is needed to avoid
the awkward situation, and introduce a new func flush_rdev_wq to flush
the new workqueue after check if there was pending work.
Also like new_dev_store, ADD_NEW_DISK ioctl has the same purpose to flush
workqueue while it holds bdev->bd_mutex, so make the same change applies
to the ioctl to avoid similar lock issue.
And md_delayed_delete actually wants to delete rdev, so rename the function
to rdev_delayed_delete.
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
|
|
Coly reported possible circular locking dependencyi with LOCKDEP enabled,
quote the below info from the detailed report [1].
[ 1607.673903] Chain exists of:
[ 1607.673903] kn->count#256 --> (wq_completion)md_misc -->
(work_completion)(&rdev->del_work)
[ 1607.673903]
[ 1607.827946] Possible unsafe locking scenario:
[ 1607.827946]
[ 1607.898780] CPU0 CPU1
[ 1607.952980] ---- ----
[ 1608.007173] lock((work_completion)(&rdev->del_work));
[ 1608.069690] lock((wq_completion)md_misc);
[ 1608.149887] lock((work_completion)(&rdev->del_work));
[ 1608.242563] lock(kn->count#256);
[ 1608.283238]
[ 1608.283238] *** DEADLOCK ***
[ 1608.283238]
[ 1608.354078] 2 locks held by kworker/5:0/843:
[ 1608.405152] #0: ffff8889eecc9948 ((wq_completion)md_misc){+.+.}, at:
process_one_work+0x42b/0xb30
[ 1608.512399] #1: ffff888a1d3b7e10
((work_completion)(&rdev->del_work)){+.+.}, at: process_one_work+0x42b/0xb30
[ 1608.632130]
Since works (rdev->del_work and mddev->del_work) are queued in md_misc_wq,
then lockdep_map lock is held if either of them are running, then both of
them try to hold kernfs lock by call kobject_del. Then if new_dev_store
or array_state_store are triggered by write to the related sysfs node, so
the write operation gets kernfs lock, but need the lockdep_map because all
of them would trigger flush_workqueue(md_misc_wq) finally, then the same
lockdep_map lock is needed.
To suppress the lockdep warnning, we should flush the workqueue in case the
related work is pending. And several works are attached to md_misc_wq, so
we need to check which work should be checked:
1. for __md_stop_writes, the purpose of call flush workqueue is ensure sync
thread is started if it was starting, so check mddev->del_work is pending
or not since md_start_sync is attached to mddev->del_work.
2. __md_stop flushes md_misc_wq to ensure event_work is done, check the
event_work is enough. Assume raid_{ctr,dtr} -> md_stop -> __md_stop doesn't
need the kernfs lock.
3. both new_dev_store (holds kernfs lock) and ADD_NEW_DISK ioctl (holds the
bdev->bd_mutex) call flush_workqueue to ensure md_delayed_delete has
completed, this case will be handled in next patch.
4. md_open flushes workqueue to ensure the previous md is disappeared, but
it holds bdev->bd_mutex then try to flush workqueue, so it is better to
check mddev->del_work as well to avoid potential lock issue, this will be
done in another patch.
[1]: https://marc.info/?l=linux-raid&m=158518958031584&w=2
Cc: Coly Li <colyli@suse.de>
Reported-by: Coly Li <colyli@suse.de>
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
|
|
Pull block driver updates from Jens Axboe:
- floppy driver cleanup series from Willy
- NVMe updates and fixes (Various)
- null_blk trace improvements (Chaitanya)
- bcache fixes (Coly)
- md fixes (via Song)
- loop block size change optimizations (Martijn)
- scnprintf() use (Takashi)
* tag 'for-5.7/drivers-2020-03-29' of git://git.kernel.dk/linux-block: (81 commits)
null_blk: add trace in null_blk_zoned.c
null_blk: add tracepoint helpers for zoned mode
block: add a zone condition debug helper
nvme: cleanup namespace identifier reporting in nvme_init_ns_head
nvme: rename __nvme_find_ns_head to nvme_find_ns_head
nvme: refactor nvme_identify_ns_descs error handling
nvme-tcp: Add warning on state change failure at nvme_tcp_setup_ctrl
nvme-rdma: Add warning on state change failure at nvme_rdma_setup_ctrl
nvme: Fix controller creation races with teardown flow
nvme: Make nvme_uninit_ctrl symmetric to nvme_init_ctrl
nvme: Fix ctrl use-after-free during sysfs deletion
nvme-pci: Re-order nvme_pci_free_ctrl
nvme: Remove unused return code from nvme_delete_ctrl_sync
nvme: Use nvme_state_terminal helper
nvme: release ida resources
nvme: Add compat_ioctl handler for NVME_IOCTL_SUBMIT_IO
nvmet-tcp: optimize tcp stack TX when data digest is used
nvme-fabrics: Use scnprintf() for avoiding potential buffer overflow
nvme-multipath: do not reset on unknown status
nvmet-rdma: allocate RW ctxs according to mdts
...
|
|
Current make_request based drivers use either blk_alloc_queue_node or
blk_alloc_queue to allocate a queue, and then set up the make_request_fn
function pointer and a few parameters using the blk_queue_make_request
helper. Simplify this by passing the make_request pointer to
blk_alloc_queue, and while at it merge the _node variant into the main
helper by always passing a node_id, and remove the superfluous gfp_mask
parameter. A lower-level __blk_alloc_queue is kept for the blk-mq case.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
These macros are just used by a few files. Move them out of genhd.h,
which is included everywhere into a new standalone header.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Add a new include/linux/raid/detect.h header to declare the
md_autodetect_dev prototype which can be shared between md and
the partition code. Then use IS_BUILTIN to call it instead of the
ifdef magic.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
There is no good reason for __bdevname to exist. Just open code
printing the string in the callers. For three of them the format
string can be trivially merged into existing printk statements,
and in init/do_mounts.c we can at least do the scnprintf once at
the start of the function, and unconditional of CONFIG_BLOCK to
make the output for tiny configfs a little more helpful.
Acked-by: Theodore Ts'o <tytso@mit.edu> # for ext4
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Don't call quiesce(1) and quiesce(0) if array is already suspended,
otherwise in level_store, the array is writable after mddev_detach
in below part though the intention is to make array writable after
resume.
mddev_suspend(mddev);
mddev_detach(mddev);
...
mddev_resume(mddev);
And it also causes calltrace as follows in [1].
[48005.653834] WARNING: CPU: 1 PID: 45380 at kernel/kthread.c:510 kthread_park+0x77/0x90
[...]
[48005.653976] CPU: 1 PID: 45380 Comm: mdadm Tainted: G OE 5.4.10-arch1-1 #1
[48005.653979] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./J4105-ITX, BIOS P1.40 08/06/2018
[48005.653984] RIP: 0010:kthread_park+0x77/0x90
[48005.654015] Call Trace:
[48005.654039] r5l_quiesce+0x3c/0x70 [raid456]
[48005.654052] raid5_quiesce+0x228/0x2e0 [raid456]
[48005.654073] mddev_detach+0x30/0x70 [md_mod]
[48005.654090] level_store+0x202/0x670 [md_mod]
[48005.654099] ? security_capable+0x40/0x60
[48005.654114] md_attr_store+0x7b/0xc0 [md_mod]
[48005.654123] kernfs_fop_write+0xce/0x1b0
[48005.654132] vfs_write+0xb6/0x1a0
[48005.654138] ksys_write+0x67/0xe0
[48005.654146] do_syscall_64+0x4e/0x140
[48005.654155] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[48005.654161] RIP: 0033:0x7fa0c8737497
[1]: https://bugzilla.kernel.org/show_bug.cgi?id=206161
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
|
|
The most notable change is DEFINE_SHOW_ATTRIBUTE macro split in
seq_file.h.
Conversion rule is:
llseek => proc_lseek
unlocked_ioctl => proc_ioctl
xxx => proc_xxx
delete ".owner = THIS_MODULE" line
[akpm@linux-foundation.org: fix drivers/isdn/capi/kcapi_proc.c]
[sfr@canb.auug.org.au: fix kernel/sched/psi.c]
Link: http://lkml.kernel.org/r/20200122180545.36222f50@canb.auug.org.au
Link: http://lkml.kernel.org/r/20191225172546.GB13378@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Since raid1 had already used bucket based mechanism to reduce
the conflict between write IO and resync IO, it is possible to
speed up performance for io serialization with refer to the
same mechanism.
To align with the barrier bucket mechanism, we created arrays
(with the same number of BARRIER_BUCKETS_NR) for spinlock, rb
tree and waitqueue. Then we can reduce lock competition with
multiple spinlocks, boost search performance with multiple rb
trees and also reduce thundering herd problem with multiple
waitqueues.
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
|
|
Obviously, IO serialization could cause the degradation of
performance a lot. In order to reduce the degradation, so a
rb interval tree is added in raid1 to speed up the check of
collision.
So, a rb root is needed in md_rdev, then abstract all the
serialize related members to a new struct (serial_in_rdev),
embed it into md_rdev.
Of course, we need to free the struct if it is not needed
anymore, so rdev/rdevs_uninit_serial are added accordingly.
And they should be called when destroty memory pool or can't
alloc memory.
And we need to consider to call mddev_destroy_serial_pool
in case serialize_policy/write-behind is disabled, bitmap
is destroyed or in __md_stop_writes.
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
|
|
So far, IO serialization is used for two scenarios:
1. raid1 which enables write-behind mode, and there is rdev in the array
which is multi-queue device and flaged with writemostly.
2. IO serialization is enabled or disabled by change serialize_policy.
So introduce rdev_need_serial to check the first scenario. And for 1, IO
serialization is enabled automatically while 2 is controlled manually.
And it is possible that both scenarios are true, so for create serial pool,
rdev/rdevs_init_serial should be separate from check if the pool existed or
not. Then for destroy pool, we need to check if the pool is needed by other
rdevs due to the first scenario.
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
|
|
With the new sysfs node, we can use it to control if raid1 array
wants io serialization or not. So mddev_create_serial_pool and
mddev_destroy_serial_pool are called in serialize_policy_store
to enable or disable the serialization.
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
|
|
1. The related resources (spin_lock, list and waitqueue) are needed for
address raid1 reorder overlap issue too, in this case, rdev is set to
NULL for mddev_create/destroy_serial_pool which implies all rdevs need
to handle these resources.
And also add "is_suspend" to mddev_destroy_serial_pool since it will
be called under suspended situation, which also makes both create and
destroy pool have same arguments.
2. Introduce rdevs_init_serial which is called if raid1 io serialization
is enabled since all rdevs need to init related stuffs.
3. rdev_init_serial and clear_bit(CollisionCheck, &rdev->flags) should
be called between suspend and resume.
No need to export mddev_create_serial_pool since it is only called in
md-mod module.
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
|
|
It actually means create here, so fix the typo.
Reported-by: Song Liu <liu.song.a23@gmail.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
|
|
Previously, wb_info_pool and wb_list stuffs are introduced
to address potential data inconsistence issue for write
behind device.
Now rename them to serial related name, since the same
mechanism will be used to address reorder overlap write
issue for raid1.
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
|
|
For super_90_load, we need to make sure 'desc_nr' less
than MD_SB_DISKS, avoiding invalid memory access of 'sb->disks'.
Fixes: 228fc7d76db6 ("md: avoid invalid memory access for array sb->dev_roles")
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
|
|
we need to gurantee 'desc_nr' valid before access array
of sb->dev_roles.
In addition, we should avoid .load_super always return '0'
when level is LEVEL_MULTIPATH, which is not expected.
Reported-by: coverity-bot <keescook+coverity-bot@chromium.org>
Addresses-Coverity-ID: 1487373 ("Memory - illegal accesses")
Fixes: 6a5cb53aaa4e ("md: no longer compare spare disk superblock events in super_load")
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
|
|
We have a test case as follow:
mdadm -CR /dev/md1 -l 1 -n 4 /dev/sd[a-d] \
--assume-clean --bitmap=internal
mdadm -S /dev/md1
mdadm -A /dev/md1 /dev/sd[b-c] --run --force
mdadm --zero /dev/sda
mdadm /dev/md1 -a /dev/sda
echo offline > /sys/block/sdc/device/state
echo offline > /sys/block/sdb/device/state
sleep 5
mdadm -S /dev/md1
echo running > /sys/block/sdb/device/state
echo running > /sys/block/sdc/device/state
mdadm -A /dev/md1 /dev/sd[a-c] --run --force
When we readd /dev/sda to the array, it started to do recovery.
After offline the other two disks in md1, the recovery have
been interrupted and superblock update info cannot be written
to the offline disks. While the spare disk (/dev/sda) can continue
to update superblock info.
After stopping the array and assemble it, we found the array
run fail, with the follow kernel message:
[ 172.986064] md: kicking non-fresh sdb from array!
[ 173.004210] md: kicking non-fresh sdc from array!
[ 173.022383] md/raid1:md1: active with 0 out of 4 mirrors
[ 173.022406] md1: failed to create bitmap (-5)
[ 173.023466] md: md1 stopped.
Since both sdb and sdc have the value of 'sb->events' smaller than
that in sda, they have been kicked from the array. However, the only
remained disk sda is in 'spare' state before stop and it cannot be
added to conf->mirrors[] array. In the end, raid array assemble
and run fail.
In fact, we can use the older disk sdb or sdc to assemble the array.
That means we should not choose the 'spare' disk as the fresh disk in
analyze_sbs().
To fix the problem, we do not compare superblock events when it is
a spare disk, as same as validate_super.
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
|
|
If pers->make_request fails in md_flush_request(), the bio is lost. To
fix this, pass back a bool to indicate if the original make_request call
should continue to handle the I/O and instead of assuming the flush logic
will push it to completion.
Convert md_flush_request to return a bool and no longer calls the raid
driver's make_request function. If the return is true, then the md flush
logic has or will complete the bio and the md make_request call is done.
If false, then the md make_request function needs to keep processing like
it is a normal bio. Let the original call to md_handle_request handle any
need to retry sending the bio to the raid driver's make_request function
should it be needed.
Also mark md_flush_request and the make_request function pointer as
__must_check to issue warnings should these critical return values be
ignored.
Fixes: 2bc13b83e629 ("md: batch flush requests.")
Cc: stable@vger.kernel.org # # v4.19+
Cc: NeilBrown <neilb@suse.com>
Signed-off-by: David Jeffery <djeffery@redhat.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
|
|
Due to a bug introduced in Linux 3.14 we cannot determine the
correctly layout for a multi-zone RAID0 array - there are two
possibilities.
It is possible to tell the kernel which to chose using a module
parameter, but this can be clumsy to use. It would be best if
the choice were recorded in the metadata.
So add a feature flag for this purpose.
If it is set, then the 'layout' field of the superblock is used
to determine which layout to use.
If this flag is not set, then mddev->layout gets set to -1,
which causes the module parameter to be required.
Acked-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Song Liu <songliubraving@fb.com>
|
|
Currently md raid0/linear are not provided with any mechanism to validate
if an array member got removed or failed. The driver keeps sending BIOs
regardless of the state of array members, and kernel shows state 'clean'
in the 'array_state' sysfs attribute. This leads to the following
situation: if a raid0/linear array member is removed and the array is
mounted, some user writing to this array won't realize that errors are
happening unless they check dmesg or perform one fsync per written file.
Despite udev signaling the member device is gone, 'mdadm' cannot issue the
STOP_ARRAY ioctl successfully, given the array is mounted.
In other words, no -EIO is returned and writes (except direct ones) appear
normal. Meaning the user might think the wrote data is correctly stored in
the array, but instead garbage was written given that raid0 does stripping
(and so, it requires all its members to be working in order to not corrupt
data). For md/linear, writes to the available members will work fine, but
if the writes go to the missing member(s), it'll cause a file corruption
situation, whereas the portion of the writes to the missing devices aren't
written effectively.
This patch changes this behavior: we check if the block device's gendisk
is UP when submitting the BIO to the array member, and if it isn't, we flag
the md device as MD_BROKEN and fail subsequent I/Os to that device; a read
request to the array requiring data from a valid member is still completed.
While flagging the device as MD_BROKEN, we also show a rate-limited warning
in the kernel log.
A new array state 'broken' was added too: it mimics the state 'clean' in
every aspect, being useful only to distinguish if the array has some member
missing. We rely on the MD_BROKEN flag to put the array in the 'broken'
state. This state cannot be written in 'array_state' as it just shows
one or more members of the array are missing but acts like 'clean', it
wouldn't make sense to write it.
With this patch, the filesystem reacts much faster to the event of missing
array member: after some I/O errors, ext4 for instance aborts the journal
and prevents corruption. Without this change, we're able to keep writing
in the disk and after a machine reboot, e2fsck shows some severe fs errors
that demand fixing. This patch was tested in ext4 and xfs filesystems, and
requires a 'mdadm' counterpart to handle the 'broken' state.
Cc: Song Liu <songliubraving@fb.com>
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Guilherme G. Piccoli <gpiccoli@canonical.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
|
|
Until revalidate_disk() has completed, the size of a new md array will
appear to be zero.
So we shouldn't report, through array_state, that the array is active
until that time.
udev rules check array_state to see if the array is ready. As soon as
it appear to be zero, fsck can be run. If it find the size to be
zero, it will fail.
So add a new flag to provide an interlock between do_md_run() and
array_state_show(). This flag is set while do_md_run() is active and
it prevents array_state_show() from reporting that the array is
active.
Before do_md_run() is called, ->pers will be NULL so array is
definitely not active.
After do_md_run() is called, revalidate_disk() will have run and the
array will be completely ready.
We also move various sysfs_notify*() calls out of md_run() into
do_md_run() after MD_NOT_READY is cleared. This ensure the
information is ready before the notification is sent.
Prior to v4.12, array_state_show() was called with the
mddev->reconfig_mutex held, which provided exclusion with do_md_run().
Note that MD_NOT_READY cleared twice. This is deliberate to cover
both success and error paths with minimal noise.
Fixes: b7b17c9b67e5 ("md: remove mddev_lock() from md_attr_show()")
Cc: stable@vger.kernel.org (v4.12++)
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
|
|
Since commit 4ad23a976413 ("MD: use per-cpu counter for
writes_pending"), set_in_sync() is substantially more expensive: it
can wait for a full RCU grace period which can be 10s of milliseconds.
So we should only call it when the cost is justified.
md_check_recovery() currently calls set_in_sync() every time it finds
anything to do (on non-external active arrays). For an array
performing resync or recovery, this will be quite often.
Each call will introduce a delay to the md thread, which can noticeable
affect IO submission latency.
In md_check_recovery() we only need to call set_in_sync() if
'safemode' was non-zero at entry, meaning that there has been not
recent IO. So we save this "safemode was nonzero" state, and only
call set_in_sync() if it was non-zero.
This measurably reduces mean and maximum IO submission latency during
resync/recovery.
Reported-and-tested-by: Jack Wang <jinpu.wang@cloud.ionos.com>
Fixes: 4ad23a976413 ("MD: use per-cpu counter for writes_pending")
Cc: stable@vger.kernel.org (v4.12+)
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
|
|
can't work
When add one disk to array, the md_reap_sync_thread is responsible
to activate the spare and set In_sync flag for the new member in
spare_active().
But if raid1 has one member disk A, and disk B is added to the array.
Then we offline A before all the datas are synchronized from A to B,
obviously B doesn't have the latest data as A, but B is still marked
with In_sync flag.
So let's not call spare_active under the condition, otherwise B is
still showed with 'U' state which is not correct.
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
|
|
When a disk is added to array, the following path is called in mdadm.
Manage_subdevs -> sysfs_freeze_array
-> Manage_add
-> sysfs_set_str(&info, NULL, "sync_action","idle")
Then from kernel side, Manage_add invokes the path (add_new_disk ->
validate_super = super_1_validate) to set In_sync flag.
Since In_sync means "device is in_sync with rest of array", and the new
added disk need to resync thread to help the synchronization of data.
And md_reap_sync_thread would call spare_active to set In_sync for the
new added disk finally. So don't set In_sync if array is in frozen.
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
|
|
When the 'last' device in a RAID1 or RAID10 reports an error,
we do not mark it as failed. This would serve little purpose
as there is no risk of losing data beyond that which is obviously
lost (as there is with RAID5), and there could be other sectors
on the device which are readable, and only readable from this device.
This in general this maximises access to data.
However the current implementation also stops an admin from removing
the last device by direct action. This is rarely useful, but in many
case is not harmful and can make automation easier by removing special
cases.
Also, if an attempt to write metadata fails the device must be marked
as faulty, else an infinite loop will result, attempting to update
the metadata on all non-faulty devices.
So add 'fail_last_dev' member to 'struct mddev', then we can bypasses
the 'last disk' checks for RAID1 and RAID10, and control the behavior
per array by change sysfs node.
Signed-off-by: NeilBrown <neilb@suse.de>
[add sysfs node for fail_last_dev by Guoqing]
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
|
|
Instead of linear approach to calculate power of 10, use generic int_pow()
which does it better.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu
Pull percpu updates from Dennis Zhou:
"This includes changes to let percpu_ref release the backing percpu
memory earlier after it has been switched to atomic in cases where the
percpu ref is not revived.
This will help recycle percpu memory earlier in cases where the
refcounts are pinned for prolonged periods of time"
* 'for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu:
percpu_ref: release percpu memory early without PERCPU_REF_ALLOW_REINIT
md: initialize percpu refcounters using PERCU_REF_ALLOW_REINIT
io_uring: initialize percpu refcounters using PERCU_REF_ALLOW_REINIT
percpu_ref: introduce PERCPU_REF_ALLOW_REINIT flag
|
|
Merge 5.2-rc6 into for-5.3/block, so we get the same page merge leak
fix. Otherwise we end up having conflicts with future patches between
for-5.3/block and master that touch this area. In particular, it makes
the bio_full() fix hard to backport to stable.
* tag 'v5.2-rc6': (482 commits)
Linux 5.2-rc6
Revert "iommu/vt-d: Fix lock inversion between iommu->lock and device_domain_lock"
Bluetooth: Fix regression with minimum encryption key size alignment
tcp: refine memory limit test in tcp_fragment()
x86/vdso: Prevent segfaults due to hoisted vclock reads
SUNRPC: Fix a credential refcount leak
Revert "SUNRPC: Declare RPC timers as TIMER_DEFERRABLE"
net :sunrpc :clnt :Fix xps refcount imbalance on the error path
NFS4: Only set creation opendata if O_CREAT
ARM: 8867/1: vdso: pass --be8 to linker if necessary
KVM: nVMX: reorganize initial steps of vmx_set_nested_state
KVM: PPC: Book3S HV: Invalidate ERAT when flushing guest TLB entries
habanalabs: use u64_to_user_ptr() for reading user pointers
nfsd: replace Jeff by Chuck as nfsd co-maintainer
inet: clear num_timeout reqsk_alloc()
PCI/P2PDMA: Ignore root complex whitelist when an IOMMU is present
net: mvpp2: debugfs: Add pmap to fs dump
ipv6: Default fib6_type to RTN_UNICAST when not set
net: hns3: Fix inconsistent indenting
net/af_iucv: always register net_device notifier
...
|
|
Now, there are two places need to consider about
the failure of destroy bitmap, so move the common
part between bitmap_abort and abort label.
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
|
|
Previously, we called rdev_init_wb to avoid potential data
inconsistency when array is created.
Now, we need to call the function and create mempool if a
device is added or just be flaged as "writemostly". So
mddev_create_wb_pool is introduced and called accordingly.
And for safety reason, we mark implicit GFP_NOIO allocation
scope for create mempool during mddev_suspend/mddev_resume.
And mempool should be removed conversely after remove a
member device or its's "writemostly" flag, which is done
by call mddev_destroy_wb_pool.
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
|
|
For write-behind mode, we think write IO is complete once it has
reached all the non-writemostly devices. It works fine for single
queue devices.
But for multiqueue device, if there are lots of IOs come from upper
layer, then the write-behind device could issue those IOs to different
queues, depends on the each queue's delay, so there is no guarantee
that those IOs can arrive in order.
To address the issue, we need to check the collision among write
behind IOs, we can only continue without collision, otherwise wait
for the completion of previous collisioned IO.
And WBCollision is introduced for multiqueue device which is worked
under write-behind mode.
But this patch doesn't handle below cases which could have the data
inconsistency issue as well, these cases will be handled in later
patches.
1. modify max_write_behind by write backlog node.
2. add or remove array's bitmap dynamically.
3. the change of member disk.
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
|
|
Stopping external metadata arrays during resync/recovery causes
retries, loop of interrupting and starting reconstruction, until it
hit at good moment to stop completely. While these retries
curr_mark_cnt can be small- especially on HDD drives, so subtraction
result can be smaller than 0. However it is casted to uint without
checking. As a result of it the status bar in /proc/mdstat while stopping
is strange (it jumps between 0% and 99%).
The real problem occurs here after commit 72deb455b5ec ("block: remove
CONFIG_LBDAF"). Sector_div() macro has been changed, now the
divisor is casted to uint32. For db = -8 the divisior(db/32-1) becomes 0.
Check if db value can be really counted and replace these macro by
div64_u64() inline.
Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@intel.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
|
|
This patch fix a spelling typo and add necessary space for code.
In addition, the patch get rid of the unnecessary 'if'.
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Commit c42d3240990814eec1e4b2b93fa0487fc4873aed
("md: return -ENODEV if rdev has no mddev assigned") changed
rdev_attr_store to return -ENODEV when rdev->mddev is NULL, now do the
same to rdev_attr_show.
Signed-off-by: Marcos Paulo de Souza <marcos.souza.org@gmail.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Based on 1 normalized pattern(s):
this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation either version 2 or at your option any
later version you should have received a copy of the gnu general
public license for example usr src linux copying if not write to the
free software foundation inc 675 mass ave cambridge ma 02139 usa
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-or-later
has been chosen to replace the boilerplate/reference in 20 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190520170858.552543146@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Percpu reference counters should now be initialized with the
PERCPU_REF_ALLOW_REINIT in order to allow switching them to the
percpu mode from the atomic mode.
To make percpu_ref_switch_to_percpu() call in set_in_sync()
succeed,let's initialize percpu refcounters with the
PERCU_REF_ALLOW_REINIT flag.
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
|
|
Mdadm expects that setting drive as faulty will fail with -EBUSY only if
this operation will cause RAID to be failed. If this happens, it will
try to stop the array. Currently -EBUSY might also be returned if rdev
is in the middle of the removal process - for example there is a race
with mdmon that already requested the drive to be failed/removed.
If rdev does not contain mddev, return -ENODEV instead, so the caller
can distinguish between those two cases and behave accordingly.
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Pawel Baldysiak <pawel.baldysiak@intel.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
|
|
Sparse complains that it has no external declaration, and it turns out
that it is never even used outside of md.c. So just mark it static
and drop the export.
Acked-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <songliubraving@fb.com>
|
|
If we want to convert from a little endian format we need to cast
to a little endian type, otherwise sparse will be unhappy.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <songliubraving@fb.com>
|
|
If we want to convert from a little endian format we need to cast
to a little endian type, otherwise sparse will be unhappy.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <songliubraving@fb.com>
|
|
The on-disk value is little endian and we need to convert it to
native endian before storing the value in the in-core structure.
Fixes: 7564beda19b36 ("md-cluster/raid10: support add disk under grow mode")
Cc: <stable@vger.kernel.org> # 4.20+
Acked-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <songliubraving@fb.com>
|