summaryrefslogtreecommitdiff
path: root/drivers/md/raid5.c
AgeCommit message (Collapse)Author
2020-08-10Merge tag 'locking-urgent-2020-08-10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking updates from Thomas Gleixner: "A set of locking fixes and updates: - Untangle the header spaghetti which causes build failures in various situations caused by the lockdep additions to seqcount to validate that the write side critical sections are non-preemptible. - The seqcount associated lock debug addons which were blocked by the above fallout. seqcount writers contrary to seqlock writers must be externally serialized, which usually happens via locking - except for strict per CPU seqcounts. As the lock is not part of the seqcount, lockdep cannot validate that the lock is held. This new debug mechanism adds the concept of associated locks. sequence count has now lock type variants and corresponding initializers which take a pointer to the associated lock used for writer serialization. If lockdep is enabled the pointer is stored and write_seqcount_begin() has a lockdep assertion to validate that the lock is held. Aside of the type and the initializer no other code changes are required at the seqcount usage sites. The rest of the seqcount API is unchanged and determines the type at compile time with the help of _Generic which is possible now that the minimal GCC version has been moved up. Adding this lockdep coverage unearthed a handful of seqcount bugs which have been addressed already independent of this. While generally useful this comes with a Trojan Horse twist: On RT kernels the write side critical section can become preemtible if the writers are serialized by an associated lock, which leads to the well known reader preempts writer livelock. RT prevents this by storing the associated lock pointer independent of lockdep in the seqcount and changing the reader side to block on the lock when a reader detects that a writer is in the write side critical section. - Conversion of seqcount usage sites to associated types and initializers" * tag 'locking-urgent-2020-08-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits) locking/seqlock, headers: Untangle the spaghetti monster locking, arch/ia64: Reduce <asm/smp.h> header dependencies by moving XTP bits into the new <asm/xtp.h> header x86/headers: Remove APIC headers from <asm/smp.h> seqcount: More consistent seqprop names seqcount: Compress SEQCNT_LOCKNAME_ZERO() seqlock: Fold seqcount_LOCKNAME_init() definition seqlock: Fold seqcount_LOCKNAME_t definition seqlock: s/__SEQ_LOCKDEP/__SEQ_LOCK/g hrtimer: Use sequence counter with associated raw spinlock kvm/eventfd: Use sequence counter with associated spinlock userfaultfd: Use sequence counter with associated spinlock NFSv4: Use sequence counter with associated spinlock iocost: Use sequence counter with associated spinlock raid5: Use sequence counter with associated spinlock vfs: Use sequence counter with associated spinlock timekeeping: Use sequence counter with associated raw spinlock xfrm: policy: Use sequence counters with associated lock netfilter: nft_set_rbtree: Use sequence counter with associated rwlock netfilter: conntrack: Use sequence counter with associated spinlock sched: tasks: Use sequence counter with associated spinlock ...
2020-08-05Merge tag 'for-5.9/drivers-20200803' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block driver updates from Jens Axboe: - NVMe: - ZNS support (Aravind, Keith, Matias, Niklas) - Misc cleanups, optimizations, fixes (Baolin, Chaitanya, David, Dongli, Max, Sagi) - null_blk zone capacity support (Aravind) - MD: - raid5/6 fixes (ChangSyun) - Warning fixes (Damien) - raid5 stripe fixes (Guoqing, Song, Yufen) - sysfs deadlock fix (Junxiao) - raid10 deadlock fix (Vitaly) - struct_size conversions (Gustavo) - Set of bcache updates/fixes (Coly) * tag 'for-5.9/drivers-20200803' of git://git.kernel.dk/linux-block: (117 commits) md/raid5: Allow degraded raid6 to do rmw md/raid5: Fix Force reconstruct-write io stuck in degraded raid5 raid5: don't duplicate code for different paths in handle_stripe raid5-cache: hold spinlock instead of mutex in r5c_journal_mode_show md: print errno in super_written md/raid5: remove the redundant setting of STRIPE_HANDLE md: register new md sysfs file 'uuid' read-only md: fix max sectors calculation for super 1.0 nvme-loop: remove extra variable in create ctrl nvme-loop: set ctrl state connecting after init nvme-multipath: do not fall back to __nvme_find_path() for non-optimized paths nvme-multipath: fix logic for non-optimized paths nvme-rdma: fix controller reset hang during traffic nvme-tcp: fix controller reset hang during traffic nvmet: introduce the passthru Kconfig option nvmet: introduce the passthru configfs interface nvmet: Add passthru enable/disable helpers nvmet: add passthru code to process commands nvme: export nvme_find_get_ns() and nvme_put_ns() nvme: introduce nvme_ctrl_get_by_path() ...
2020-08-04Merge tag 'uninit-macro-v5.9-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull uninitialized_var() macro removal from Kees Cook: "This is long overdue, and has hidden too many bugs over the years. The series has several "by hand" fixes, and then a trivial treewide replacement. - Clean up non-trivial uses of uninitialized_var() - Update documentation and checkpatch for uninitialized_var() removal - Treewide removal of uninitialized_var()" * tag 'uninit-macro-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: compiler: Remove uninitialized_var() macro treewide: Remove uninitialized_var() usage checkpatch: Remove awareness of uninitialized_var() macro mm/debug_vm_pgtable: Remove uninitialized_var() usage f2fs: Eliminate usage of uninitialized_var() macro media: sur40: Remove uninitialized_var() usage KVM: PPC: Book3S PR: Remove uninitialized_var() usage clk: spear: Remove uninitialized_var() usage clk: st: Remove uninitialized_var() usage spi: davinci: Remove uninitialized_var() usage ide: Remove uninitialized_var() usage rtlwifi: rtl8192cu: Remove uninitialized_var() usage b43: Remove uninitialized_var() usage drbd: Remove uninitialized_var() usage x86/mm/numa: Remove uninitialized_var() usage docs: deprecated.rst: Add uninitialized_var()
2020-08-02md/raid5: Allow degraded raid6 to do rmwChangSyun Peng
Degraded raid6 always do reconstruct-write now. With raid6 xor supported, we can do rmw in degraded raid6. This patch can reduce many read IOs to improve performance. If the failed disk is P, Q or the disk we want to write to, we may need to do reconstruct-write in max degraded raid6. In this situation we can not read enough data from handle_stripe_dirtying() so we have to set force_rcw in handle_stripe_fill() to read all data. Reviewed-by: Alex Wu <alexwu@synology.com> Reviewed-by: BingJing Chang <bingjingc@synology.com> Reviewed-by: Danny Shih <dannyshih@synology.com> Signed-off-by: ChangSyun Peng <allenpeng@synology.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-08-02md/raid5: Fix Force reconstruct-write io stuck in degraded raid5ChangSyun Peng
In degraded raid5, we need to read parity to do reconstruct-write when data disks fail. However, we can not read parity from handle_stripe_dirtying() in force reconstruct-write mode. Reproducible Steps: 1. Create degraded raid5 mdadm -C /dev/md2 --assume-clean -l5 -n3 /dev/sda2 /dev/sdb2 missing 2. Set rmw_level to 0 echo 0 > /sys/block/md2/md/rmw_level 3. IO to raid5 Now some io may be stuck in raid5. We can use handle_stripe_fill() to read the parity in this situation. Cc: <stable@vger.kernel.org> # v4.4+ Reviewed-by: Alex Wu <alexwu@synology.com> Reviewed-by: BingJing Chang <bingjingc@synology.com> Reviewed-by: Danny Shih <dannyshih@synology.com> Signed-off-by: ChangSyun Peng <allenpeng@synology.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-08-02raid5: don't duplicate code for different paths in handle_stripeGuoqing Jiang
As we can see, R5_LOCKED is set and s.locked is increased whether R5_ReWrite is set or not, so move it to common path. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-08-02md/raid5: remove the redundant setting of STRIPE_HANDLEGuoqing Jiang
The flag is already set before compare rcw with rmw, so it is not necessary to do it again. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-07-29raid5: Use sequence counter with associated spinlockAhmed S. Darwish
A sequence counter write side critical section must be protected by some form of locking to serialize writers. A plain seqcount_t does not contain the information of which lock must be held when entering a write side critical section. Use the new seqcount_spinlock_t data type, which allows to associate a spinlock with the sequence counter. This enables lockdep to verify that the spinlock used for writer serialization is held when the write side critical section is entered. If lockdep is disabled this lock association is compiled out and has neither storage size nor runtime overhead. Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Song Liu <song@kernel.org> Link: https://lkml.kernel.org/r/20200720155530.1173732-20-a.darwish@linutronix.de
2020-07-22md/raid5: use do_div() for 64 bit divisions in raid5_sync_requestYufen Yu
We get compilation error on 32-bit architectures (e.g. m68k), as: ERROR: modpost: "__udivdi3" [drivers/md/raid456.ko] undefined! Since 'sync_blocks' is defined as u64, use do_div() to fix this error. Fixes: c911c46c017c ("md/raid456: convert macro STRIPE_* to RAID5_STRIPE_*") Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-07-21md/raid5: support config stripe_size by sysfs entryYufen Yu
Adding a new 'stripe_size' sysfs entry to set and show stripe_size. stripe_size should not be bigger than PAGE_SIZE, and it requires to be multiple of 4096. We can adjust stripe_size by writing value into sysfs entry, likely, set stripe_size as 16KB: echo 16384 > /sys/block/md1/md/stripe_size Show current stripe_size value: cat /sys/block/md1/md/stripe_size For PAGE_SIZE is equal to 4096, 'stripe_size' can just be read. Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-07-21md/raid5: set default stripe_size as 4096Yufen Yu
In RAID5, if issued bio size is bigger than stripe_size, it will be split in the unit of stripe_size and process them one by one. Even for size less then stripe_size, RAID5 also request data from disk at least of stripe_size. Nowdays, stripe_size is equal to the value of PAGE_SIZE. Since filesystem usually issue bio in the unit of 4KB, there is no problem for PAGE_SIZE as 4KB. But, for 64KB PAGE_SIZE, bio from filesystem requests 4KB data while RAID5 issue IO at least stripe_size (64KB) each time. That will waste resource of disk bandwidth and compute xor. To avoding the waste, we want to make stripe_size configurable. This patch just set default stripe_size as 4096. User can also set the value bigger than 4KB for some special requirements, such as we know the issued io size is more than 4KB. To evaluate the new feature, we create raid5 device '/dev/md5' with 4 SSD disk and test it on arm64 machine with 64KB PAGE_SIZE. 1) We format /dev/md5 with mkfs.ext4 and mount ext4 with default configure on /mnt directory. Then, trying to test it by dbench with command: dbench -D /mnt -t 1000 10. Result show as: 'stripe_size = 64KB' Operation Count AvgLat MaxLat ---------------------------------------- NTCreateX 9805011 0.021 64.728 Close 7202525 0.001 0.120 Rename 415213 0.051 44.681 Unlink 1980066 0.079 93.147 Deltree 240 1.793 6.516 Mkdir 120 0.004 0.007 Qpathinfo 8887512 0.007 37.114 Qfileinfo 1557262 0.001 0.030 Qfsinfo 1629582 0.012 0.152 Sfileinfo 798756 0.040 57.641 Find 3436004 0.019 57.782 WriteX 4887239 0.021 57.638 ReadX 15370483 0.005 37.818 LockX 31934 0.003 0.022 UnlockX 31933 0.001 0.021 Flush 687205 13.302 530.088 Throughput 307.799 MB/sec 10 clients 10 procs max_latency=530.091 ms ------------------------------------------------------- 'stripe_size = 4KB' Operation Count AvgLat MaxLat ---------------------------------------- NTCreateX 11999166 0.021 36.380 Close 8814128 0.001 0.122 Rename 508113 0.051 29.169 Unlink 2423242 0.070 38.141 Deltree 300 1.885 7.155 Mkdir 150 0.004 0.006 Qpathinfo 10875921 0.007 35.485 Qfileinfo 1905837 0.001 0.032 Qfsinfo 1994304 0.012 0.125 Sfileinfo 977450 0.029 26.489 Find 4204952 0.019 9.361 WriteX 5981890 0.019 27.804 ReadX 18809742 0.004 33.491 LockX 39074 0.003 0.025 UnlockX 39074 0.001 0.014 Flush 841022 10.712 458.848 Throughput 376.777 MB/sec 10 clients 10 procs max_latency=458.852 ms ------------------------------------------------------- It show that setting stripe_size as 4KB has higher thoughput, i.e. (376.777 vs 307.799) and has smaller latency than that setting as 64KB. 2) We try to evaluate IO throughput for /dev/md5 by fio with config: [4KB randwrite] direct=1 numjob=2 iodepth=64 ioengine=libaio filename=/dev/md5 bs=4KB rw=randwrite [64KB write] direct=1 numjob=2 iodepth=64 ioengine=libaio filename=/dev/md5 bs=1MB rw=write The result as follow: + + | stripe_size(64KB) | stripe_size(4KB) +----------------------------------------------------+ 4KB randwrite | 15MB/s | 100MB/s +----------------------------------------------------+ 1MB write | 1000MB/s | 700MB/s The result show that when size of io is bigger than 4KB (64KB), 64KB stripe_size has much higher IOPS. But for 4KB randwrite, that means, size of io issued to device are smaller, 4KB stripe_size have better performance. Normally, default value (4096) can get relatively good performance. But if each issued io is bigger than 4096, setting value more than 4096 may get better performance. Here, we just set default stripe_size as 4096, and we will try to support setting different stripe_size by sysfs interface in the following patch. Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-07-21md/raid456: convert macro STRIPE_* to RAID5_STRIPE_*Yufen Yu
Convert macro STRIPE_SIZE, STRIPE_SECTORS and STRIPE_SHIFT to RAID5_STRIPE_SIZE(), RAID5_STRIPE_SECTORS() and RAID5_STRIPE_SHIFT(). This patch is prepare for the following adjustable stripe_size. It will not change any existing functionality. Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-07-16treewide: Remove uninitialized_var() usageKees Cook
Using uninitialized_var() is dangerous as it papers over real bugs[1] (or can in the future), and suppresses unrelated compiler warnings (e.g. "unused variable"). If the compiler thinks it is uninitialized, either simply initialize the variable or make compiler changes. In preparation for removing[2] the[3] macro[4], remove all remaining needless uses with the following script: git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \ xargs perl -pi -e \ 's/\buninitialized_var\(([^\)]+)\)/\1/g; s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;' drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid pathological white-space. No outstanding warnings were found building allmodconfig with GCC 9.3.0 for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64, alpha, and m68k. [1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/ [2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/ [3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/ [4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/ Reviewed-by: Leon Romanovsky <leonro@mellanox.com> # drivers/infiniband and mlx4/mlx5 Acked-by: Jason Gunthorpe <jgg@mellanox.com> # IB Acked-by: Kalle Valo <kvalo@codeaurora.org> # wireless drivers Reviewed-by: Chao Yu <yuchao0@huawei.com> # erofs Signed-off-by: Kees Cook <keescook@chromium.org>
2020-07-16raid5: remove the meaningless check in raid5_make_requestGuoqing Jiang
We can't guarntee the batched stripe to be set with STRIPE_HANDLE since there are lots of functions could set the flag, such as sync_request, ops_complete_* and end_{read,write}_request etc. Also clear_batch_ready called in handle_stripe ensures the batched list can't continue to be handled by handle_stripe. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-07-16raid5: put the comment of clear_batch_ready to the right placeGuoqing Jiang
To make people understand the function well, let's put the comment to the right place. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-07-16raid5: call clear_batch_ready before set STRIPE_ACTIVEGuoqing Jiang
We tried to only put the head sh of batch list to handle_list, then the handle_stripe doesn't handle other members in the batch list. However, we still got the calltrace in break_stripe_batch_list. [593764.644269] stripe state: 2003 kernel: [593764.644299] ------------[ cut here ]------------ kernel: [593764.644308] WARNING: CPU: 12 PID: 856 at drivers/md/raid5.c:4625 break_stripe_batch_list+0x203/0x240 [raid456] [...] kernel: [593764.644363] Call Trace: kernel: [593764.644370] handle_stripe+0x907/0x20c0 [raid456] kernel: [593764.644376] ? __wake_up_common_lock+0x89/0xc0 kernel: [593764.644379] handle_active_stripes.isra.57+0x35f/0x570 [raid456] kernel: [593764.644382] ? raid5_wakeup_stripe_thread+0x96/0x1f0 [raid456] kernel: [593764.644385] raid5d+0x480/0x6a0 [raid456] kernel: [593764.644390] ? md_thread+0x11f/0x160 kernel: [593764.644392] md_thread+0x11f/0x160 kernel: [593764.644394] ? wait_woken+0x80/0x80 kernel: [593764.644396] kthread+0xfc/0x130 kernel: [593764.644398] ? find_pers+0x70/0x70 kernel: [593764.644399] ? kthread_create_on_node+0x70/0x70 kernel: [593764.644401] ret_from_fork+0x1f/0x30 As we can see, the stripe was set with STRIPE_ACTIVE and STRIPE_HANDLE, and only handle_stripe could set those flags then return. And since the stipe was already in the batch list, we need to return earlier before set the two flags. And after dig a little about git history especially commit 3664847d95e6 ("md/raid5: fix a race condition in stripe batch"), it seems the batched stipe still could be handled by handle_stipe, then handle_stipe needs to return earlier if clear_batch_ready to return true. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-07-15md: raid5: Fix compilation warningDamien Le Moal
Remove the if statement around the calls to sysfs_link_rdev() to avoid the compilation warning "suggest braces around empty body in an ‘if’ statement" when compiling with W=1. Also fix function description comments to avoid kdoc format warnings. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-07-14md: fix deadlock causing by sysfs_notifyJunxiao Bi
The following deadlock was captured. The first process is holding 'kernfs_mutex' and hung by io. The io was staging in 'r1conf.pending_bio_list' of raid1 device, this pending bio list would be flushed by second process 'md127_raid1', but it was hung by 'kernfs_mutex'. Using sysfs_notify_dirent_safe() to replace sysfs_notify() can fix it. There were other sysfs_notify() invoked from io path, removed all of them. PID: 40430 TASK: ffff8ee9c8c65c40 CPU: 29 COMMAND: "probe_file" #0 [ffffb87c4df37260] __schedule at ffffffff9a8678ec #1 [ffffb87c4df372f8] schedule at ffffffff9a867f06 #2 [ffffb87c4df37310] io_schedule at ffffffff9a0c73e6 #3 [ffffb87c4df37328] __dta___xfs_iunpin_wait_3443 at ffffffffc03a4057 [xfs] #4 [ffffb87c4df373a0] xfs_iunpin_wait at ffffffffc03a6c79 [xfs] #5 [ffffb87c4df373b0] __dta_xfs_reclaim_inode_3357 at ffffffffc039a46c [xfs] #6 [ffffb87c4df37400] xfs_reclaim_inodes_ag at ffffffffc039a8b6 [xfs] #7 [ffffb87c4df37590] xfs_reclaim_inodes_nr at ffffffffc039bb33 [xfs] #8 [ffffb87c4df375b0] xfs_fs_free_cached_objects at ffffffffc03af0e9 [xfs] #9 [ffffb87c4df375c0] super_cache_scan at ffffffff9a287ec7 #10 [ffffb87c4df37618] shrink_slab at ffffffff9a1efd93 #11 [ffffb87c4df37700] shrink_node at ffffffff9a1f5968 #12 [ffffb87c4df37788] do_try_to_free_pages at ffffffff9a1f5ea2 #13 [ffffb87c4df377f0] try_to_free_mem_cgroup_pages at ffffffff9a1f6445 #14 [ffffb87c4df37880] try_charge at ffffffff9a26cc5f #15 [ffffb87c4df37920] memcg_kmem_charge_memcg at ffffffff9a270f6a #16 [ffffb87c4df37958] new_slab at ffffffff9a251430 #17 [ffffb87c4df379c0] ___slab_alloc at ffffffff9a251c85 #18 [ffffb87c4df37a80] __slab_alloc at ffffffff9a25635d #19 [ffffb87c4df37ac0] kmem_cache_alloc at ffffffff9a251f89 #20 [ffffb87c4df37b00] alloc_inode at ffffffff9a2a2b10 #21 [ffffb87c4df37b20] iget_locked at ffffffff9a2a4854 #22 [ffffb87c4df37b60] kernfs_get_inode at ffffffff9a311377 #23 [ffffb87c4df37b80] kernfs_iop_lookup at ffffffff9a311e2b #24 [ffffb87c4df37ba8] lookup_slow at ffffffff9a290118 #25 [ffffb87c4df37c10] walk_component at ffffffff9a291e83 #26 [ffffb87c4df37c78] path_lookupat at ffffffff9a293619 #27 [ffffb87c4df37cd8] filename_lookup at ffffffff9a2953af #28 [ffffb87c4df37de8] user_path_at_empty at ffffffff9a295566 #29 [ffffb87c4df37e10] vfs_statx at ffffffff9a289787 #30 [ffffb87c4df37e70] SYSC_newlstat at ffffffff9a289d5d #31 [ffffb87c4df37f18] sys_newlstat at ffffffff9a28a60e #32 [ffffb87c4df37f28] do_syscall_64 at ffffffff9a003949 #33 [ffffb87c4df37f50] entry_SYSCALL_64_after_hwframe at ffffffff9aa001ad RIP: 00007f617a5f2905 RSP: 00007f607334f838 RFLAGS: 00000246 RAX: ffffffffffffffda RBX: 00007f6064044b20 RCX: 00007f617a5f2905 RDX: 00007f6064044b20 RSI: 00007f6064044b20 RDI: 00007f6064005890 RBP: 00007f6064044aa0 R8: 0000000000000030 R9: 000000000000011c R10: 0000000000000013 R11: 0000000000000246 R12: 00007f606417e6d0 R13: 00007f6064044aa0 R14: 00007f6064044b10 R15: 00000000ffffffff ORIG_RAX: 0000000000000006 CS: 0033 SS: 002b PID: 927 TASK: ffff8f15ac5dbd80 CPU: 42 COMMAND: "md127_raid1" #0 [ffffb87c4df07b28] __schedule at ffffffff9a8678ec #1 [ffffb87c4df07bc0] schedule at ffffffff9a867f06 #2 [ffffb87c4df07bd8] schedule_preempt_disabled at ffffffff9a86825e #3 [ffffb87c4df07be8] __mutex_lock at ffffffff9a869bcc #4 [ffffb87c4df07ca0] __mutex_lock_slowpath at ffffffff9a86a013 #5 [ffffb87c4df07cb0] mutex_lock at ffffffff9a86a04f #6 [ffffb87c4df07cc8] kernfs_find_and_get_ns at ffffffff9a311d83 #7 [ffffb87c4df07cf0] sysfs_notify at ffffffff9a314b3a #8 [ffffb87c4df07d18] md_update_sb at ffffffff9a688696 #9 [ffffb87c4df07d98] md_update_sb at ffffffff9a6886d5 #10 [ffffb87c4df07da8] md_check_recovery at ffffffff9a68ad9c #11 [ffffb87c4df07dd0] raid1d at ffffffffc01f0375 [raid1] #12 [ffffb87c4df07ea0] md_thread at ffffffff9a680348 #13 [ffffb87c4df07f08] kthread at ffffffff9a0b8005 #14 [ffffb87c4df07f50] ret_from_fork at ffffffff9aa00344 Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-07-08writeback: remove bdi->congested_fnChristoph Hellwig
Except for pktdvd, the only places setting congested bits are file systems that allocate their own backing_dev_info structures. And pktdvd is a deprecated driver that isn't useful in stack setup either. So remove the dead congested_fn stacking infrastructure. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Song Liu <song@kernel.org> Acked-by: David Sterba <dsterba@suse.com> [axboe: fixup unused variables in bcache/request.c] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-01block: rename generic_make_request to submit_bio_noacctChristoph Hellwig
generic_make_request has always been very confusingly misnamed, so rename it to submit_bio_noacct to make it clear that it is submit_bio minus accounting and a few checks. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-13raid5: update code comment of scribble_alloc()Coly Li
Code comments of scribble_alloc() is outdated for a while. This patch update the comments in function header for the new parameter list. Suggested-by: Song Liu <songliubraving@fb.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-05-13raid5: remove gfp flags from scribble_alloc()Coly Li
Using GFP_NOIO flag to call scribble_alloc() from resize_chunk() does not have the expected behavior. kvmalloc_array() inside scribble_alloc() which receives the GFP_NOIO flag will eventually call kmalloc_node() to allocate physically continuous pages. Now we have memalloc scope APIs in mddev_suspend()/mddev_resume() to prevent memory reclaim I/Os during raid array suspend context, calling to kvmalloc_array() with GFP_KERNEL flag may avoid deadlock of recursive I/O as expected. This patch removes the useless gfp flags from parameters list of scribble_alloc(), and call kvmalloc_array() with GFP_KERNEL flag. The incorrect GFP_NOIO flag does not exist anymore. Fixes: b330e6a49dc3 ("md: convert to kvmalloc") Suggested-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-01-13raid5: remove worker_cnt_per_group argument from alloc_thread_groupsGuoqing Jiang
We can use "cnt" directly to update conf->worker_cnt_per_group if alloc_thread_groups returns 0. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2019-12-11raid5: need to set STRIPE_HANDLE for batch headGuoqing Jiang
With commit 6ce220dd2f8ea71d6afc29b9a7524c12e39f374a ("raid5: don't set STRIPE_HANDLE to stripe which is in batch list"), we don't want to set STRIPE_HANDLE flag for sh which is already in batch list. However, the stripe which is the head of batch list should set this flag, otherwise panic could happen inside init_stripe at BUG_ON(sh->batch_head), it is reproducible with raid5 on top of nvdimm devices per Xiao oberserved. Thanks for Xiao's effort to verify the change. Fixes: 6ce220dd2f8ea ("raid5: don't set STRIPE_HANDLE to stripe which is in batch list") Reported-by: Xiao Ni <xni@redhat.com> Tested-by: Xiao Ni <xni@redhat.com> Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2019-11-14drivers/md/raid5.c: use the new spelling of RWH_WRITE_LIFE_NOT_SETEugene Syromiatnikov
As it is consistent with prefixes of other write life time hints. Signed-off-by: Eugene Syromiatnikov <esyr@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2019-10-24md: improve handling of bio with REQ_PREFLUSH in md_flush_request()David Jeffery
If pers->make_request fails in md_flush_request(), the bio is lost. To fix this, pass back a bool to indicate if the original make_request call should continue to handle the I/O and instead of assuming the flush logic will push it to completion. Convert md_flush_request to return a bool and no longer calls the raid driver's make_request function. If the return is true, then the md flush logic has or will complete the bio and the md make_request call is done. If false, then the md make_request function needs to keep processing like it is a normal bio. Let the original call to md_handle_request handle any need to retry sending the bio to the raid driver's make_request function should it be needed. Also mark md_flush_request and the make_request function pointer as __must_check to issue warnings should these critical return values be ignored. Fixes: 2bc13b83e629 ("md: batch flush requests.") Cc: stable@vger.kernel.org # # v4.19+ Cc: NeilBrown <neilb@suse.com> Signed-off-by: David Jeffery <djeffery@redhat.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2019-09-13raid5: remove STRIPE_OPS_REQ_PENDINGGuoqing Jiang
This stripe state is not used anymore after commit 51acbcec6c42b24 ("md: remove CONFIG_MULTICORE_RAID456"), so remove the obsoleted state. gjiang@nb01257:~/md$ grep STRIPE_OPS_REQ_PENDING drivers/md/ -r drivers/md/raid5.c: (1 << STRIPE_OPS_REQ_PENDING) | drivers/md/raid5.h: STRIPE_OPS_REQ_PENDING, Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2019-09-13raid5: don't set STRIPE_HANDLE to stripe which is in batch listGuoqing Jiang
If stripe in batch list is set with STRIPE_HANDLE flag, then the stripe could be set with STRIPE_ACTIVE by the handle_stripe function. And if error happens to the batch_head at the same time, break_stripe_batch_list is called, then below warning could happen (the same report in [1]), it means a member of batch list was set with STRIPE_ACTIVE. [7028915.431770] stripe state: 2001 [7028915.431815] ------------[ cut here ]------------ [7028915.431828] WARNING: CPU: 18 PID: 29089 at drivers/md/raid5.c:4614 break_stripe_batch_list+0x203/0x240 [raid456] [...] [7028915.431879] CPU: 18 PID: 29089 Comm: kworker/u82:5 Tainted: G O 4.14.86-1-storage #4.14.86-1.2~deb9 [7028915.431881] Hardware name: Supermicro SSG-2028R-ACR24L/X10DRH-iT, BIOS 3.1 06/18/2018 [7028915.431888] Workqueue: raid5wq raid5_do_work [raid456] [7028915.431890] task: ffff9ab0ef36d7c0 task.stack: ffffb72926f84000 [7028915.431896] RIP: 0010:break_stripe_batch_list+0x203/0x240 [raid456] [7028915.431898] RSP: 0018:ffffb72926f87ba8 EFLAGS: 00010286 [7028915.431900] RAX: 0000000000000012 RBX: ffff9aaa84a98000 RCX: 0000000000000000 [7028915.431901] RDX: 0000000000000000 RSI: ffff9ab2bfa15458 RDI: ffff9ab2bfa15458 [7028915.431902] RBP: ffff9aaa8fb4e900 R08: 0000000000000001 R09: 0000000000002eb4 [7028915.431903] R10: 00000000ffffffff R11: 0000000000000000 R12: ffff9ab1736f1b00 [7028915.431904] R13: 0000000000000000 R14: ffff9aaa8fb4e900 R15: 0000000000000001 [7028915.431906] FS: 0000000000000000(0000) GS:ffff9ab2bfa00000(0000) knlGS:0000000000000000 [7028915.431907] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [7028915.431908] CR2: 00007ff953b9f5d8 CR3: 0000000bf4009002 CR4: 00000000003606e0 [7028915.431909] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [7028915.431910] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [7028915.431910] Call Trace: [7028915.431923] handle_stripe+0x8e7/0x2020 [raid456] [7028915.431930] ? __wake_up_common_lock+0x89/0xc0 [7028915.431935] handle_active_stripes.isra.58+0x35f/0x560 [raid456] [7028915.431939] raid5_do_work+0xc6/0x1f0 [raid456] Also commit 59fc630b8b5f9f ("RAID5: batch adjacent full stripe write") said "If a stripe is added to batch list, then only the first stripe of the list should be put to handle_list and run handle_stripe." So don't set STRIPE_HANDLE to stripe which is already in batch list, otherwise the stripe could be put to handle_list and run handle_stripe, then the above warning could be triggered. [1]. https://www.spinics.net/lists/raid/msg62552.html Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2019-09-13raid5: don't increment read_errors on EILSEQ returnNigel Croxon
While MD continues to count read errors returned by the lower layer. If those errors are -EILSEQ, instead of -EIO, it should NOT increase the read_errors count. When RAID6 is set up on dm-integrity target that detects massive corruption, the leg will be ejected from the array. Even if the issue is correctable with a sector re-write and the array has necessary redundancy to correct it. The leg is ejected because it runs up the rdev->read_errors beyond conf->max_nr_stripes. The return status in dm-drypt when there is a data integrity error is -EILSEQ (BLK_STS_PROTECTION). Signed-off-by: Nigel Croxon <ncroxon@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2019-09-03md/raid5: use bio_end_sector to calculate last_sectorGuoqing Jiang
Use the common way to get last_sector. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2019-08-27raid5 improve too many read errors msg by adding limitsNigel Croxon
Often limits can be changed by admin. When discussing such things it helps if you can provide "self-sustained" facts. Also sometimes the admin thinks he changed a limit, but it did not take effect for some reason or he changed the wrong thing. V3: Only pr_warn when Faulty is 0. V2: Add read_errors value to pr_warn. Signed-off-by: Nigel Croxon <ncroxon@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2019-08-07md/raid6: Set R5_ReadError when there is read failure on parity diskXiao Ni
7471fb77ce4d ("md/raid6: Fix anomily when recovering a single device in RAID6.") avoids rereading P when it can be computed from other members. However, this misses the chance to re-write the right data to P. This patch sets R5_ReadError if the re-read fails. Also, when re-read is skipped, we also missed the chance to reset rdev->read_errors to 0. It can fail the disk when there are many read errors on P member disk (other disks don't have read error) V2: upper layer read request don't read parity/Q data. So there is no need to consider such situation. This is Reported-by: kbuild test robot <lkp@intel.com> Fixes: 7471fb77ce4d ("md/raid6: Fix anomily when recovering a single device in RAID6.") Cc: <stable@vger.kernel.org> #4.4+ Signed-off-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2019-06-20block: remove the bi_phys_segments field in struct bioChristoph Hellwig
We only need the number of segments in the blk-mq submission path. Remove the field from struct bio, and return it from a variant of blk_queue_split instead of that it can passed as an argument to those functions that need the value. This also means we stop recounting segments except for cloning and partial segments. To keep the number of arguments in this how path down remove pointless struct request_queue arguments from any of the functions that had it and grew a nr_segs argument. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-15raid5-cache: Need to do start() part job after adding journal deviceXiao Ni
commit d5d885fd514f ("md: introduce new personality funciton start()") splits the init job to two parts. The first part run() does the jobs that do not require the md threads. The second part start() does the jobs that require the md threads. Now it just does run() in adding new journal device. It needs to do the second part start() too. Fixes: d5d885fd514f ("md: introduce new personality funciton start()") Cc: stable@vger.kernel.org #v4.9+ Reported-by: Michal Soltys <soltys@ziu.info> Signed-off-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-05-24treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 47Thomas Gleixner
Based on 1 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of the gnu general public license as published by the free software foundation either version 2 or at your option any later version you should have received a copy of the gnu general public license for example usr src linux copying if not write to the free software foundation inc 675 mass ave cambridge ma 02139 usa extracted by the scancode license scanner the SPDX license identifier GPL-2.0-or-later has been chosen to replace the boilerplate/reference in 20 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Allison Randal <allison@lohutok.net> Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190520170858.552543146@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-04-16md/raid: raid5 preserve the writeback action after the parity checkNigel Croxon
The problem is that any 'uptodate' vs 'disks' check is not precise in this path. Put a "WARN_ON(!test_bit(R5_UPTODATE, &dev->flags)" on the device that might try to kick off writes and then skip the action. Better to prevent the raid driver from taking unexpected action *and* keep the system alive vs killing the machine with BUG_ON. Note: fixed warning reported by kbuild test robot <lkp@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Nigel Croxon <ncroxon@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2019-04-16Revert "Don't jump to compute_result state from check_result state"Song Liu
This reverts commit 4f4fd7c5798bbdd5a03a60f6269cf1177fbd11ef. Cc: Dan Williams <dan.j.williams@intel.com> Cc: Nigel Croxon <ncroxon@redhat.com> Cc: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2019-04-10md: add __acquires/__releases annotations to handle_active_stripesChristoph Hellwig
This tells sparse that we release and reacquire the device_lock and avoids a warning. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <songliubraving@fb.com>
2019-04-10md: add __acquires/__releases annotations to (un)lock_two_stripesChristoph Hellwig
This tells sparse that we acquire/release the two stripe locks and avoids a warning. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <songliubraving@fb.com>
2019-04-01Don't jump to compute_result state from check_result stateNigel Croxon
Changing state from check_state_check_result to check_state_compute_result not only is unsafe but also doesn't appear to serve a valid purpose. A raid6 check should only be pushing out extra writes if doing repair and a mis-match occurs. The stripe dev management will already try and do repair writes for failing sectors. This patch makes the raid6 check_state_check_result handling work more like raid5's. If somehow too many failures for a check, just quit the check operation for the stripe. When any checks pass, don't try and use check_state_compute_result for a purpose it isn't needed for and is unsafe for. Just mark the stripe as in sync for passing its parity checks and let the stripe dev read/write code and the bad blocks list do their job handling I/O errors. Repro steps from Xiao: These are the steps to reproduce this problem: 1. redefined OPT_MEDIUM_ERR_ADDR to 12000 in scsi_debug.c 2. insmod scsi_debug.ko dev_size_mb=11000 max_luns=1 num_tgts=1 3. mdadm --create /dev/md127 --level=6 --raid-devices=5 /dev/sde1 /dev/sde2 /dev/sde3 /dev/sde5 /dev/sde6 sde is the disk created by scsi_debug 4. echo "2" >/sys/module/scsi_debug/parameters/opts 5. raid-check It panic: [ 4854.730899] md: data-check of RAID array md127 [ 4854.857455] sd 5:0:0:0: [sdr] tag#80 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [ 4854.859246] sd 5:0:0:0: [sdr] tag#80 Sense Key : Medium Error [current] [ 4854.860694] sd 5:0:0:0: [sdr] tag#80 Add. Sense: Unrecovered read error [ 4854.862207] sd 5:0:0:0: [sdr] tag#80 CDB: Read(10) 28 00 00 00 2d 88 00 04 00 00 [ 4854.864196] print_req_error: critical medium error, dev sdr, sector 11656 flags 0 [ 4854.867409] sd 5:0:0:0: [sdr] tag#100 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [ 4854.869469] sd 5:0:0:0: [sdr] tag#100 Sense Key : Medium Error [current] [ 4854.871206] sd 5:0:0:0: [sdr] tag#100 Add. Sense: Unrecovered read error [ 4854.872858] sd 5:0:0:0: [sdr] tag#100 CDB: Read(10) 28 00 00 00 2e e0 00 00 08 00 [ 4854.874587] print_req_error: critical medium error, dev sdr, sector 12000 flags 4000 [ 4854.876456] sd 5:0:0:0: [sdr] tag#101 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [ 4854.878552] sd 5:0:0:0: [sdr] tag#101 Sense Key : Medium Error [current] [ 4854.880278] sd 5:0:0:0: [sdr] tag#101 Add. Sense: Unrecovered read error [ 4854.881846] sd 5:0:0:0: [sdr] tag#101 CDB: Read(10) 28 00 00 00 2e e8 00 00 08 00 [ 4854.883691] print_req_error: critical medium error, dev sdr, sector 12008 flags 4000 [ 4854.893927] sd 5:0:0:0: [sdr] tag#166 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [ 4854.896002] sd 5:0:0:0: [sdr] tag#166 Sense Key : Medium Error [current] [ 4854.897561] sd 5:0:0:0: [sdr] tag#166 Add. Sense: Unrecovered read error [ 4854.899110] sd 5:0:0:0: [sdr] tag#166 CDB: Read(10) 28 00 00 00 2e e0 00 00 10 00 [ 4854.900989] print_req_error: critical medium error, dev sdr, sector 12000 flags 0 [ 4854.902757] md/raid:md127: read error NOT corrected!! (sector 9952 on sdr1). [ 4854.904375] md/raid:md127: read error NOT corrected!! (sector 9960 on sdr1). [ 4854.906201] ------------[ cut here ]------------ [ 4854.907341] kernel BUG at drivers/md/raid5.c:4190! raid5.c:4190 above is this BUG_ON: handle_parity_checks6() ... BUG_ON(s->uptodate < disks - 1); /* We don't need Q to recover */ Cc: <stable@vger.kernel.org> # v3.16+ OriginalAuthor: David Jeffery <djeffery@redhat.com> Cc: Xiao Ni <xni@redhat.com> Tested-by: David Jeffery <djeffery@redhat.com> Signed-off-by: David Jeffy <djeffery@redhat.com> Signed-off-by: Nigel Croxon <ncroxon@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-03-16Merge tag 'for-5.1/block-post-20190315' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull more block layer changes from Jens Axboe: "This is a collection of both stragglers, and fixes that came in after I finalized the initial pull. This contains: - An MD pull request from Song, with a few minor fixes - Set of NVMe patches via Christoph - Pull request from Konrad, with a few fixes for xen/blkback - pblk fix IO calculation fix (Javier) - Segment calculation fix for pass-through (Ming) - Fallthrough annotation for blkcg (Mathieu)" * tag 'for-5.1/block-post-20190315' of git://git.kernel.dk/linux-block: (25 commits) blkcg: annotate implicit fall through nvme-tcp: support C2HData with SUCCESS flag nvmet: ignore EOPNOTSUPP for discard nvme: add proper write zeroes setup for the multipath device nvme: add proper discard setup for the multipath device nvme: remove nvme_ns_config_oncs nvme: disable Write Zeroes for qemu controllers nvmet-fc: bring Disconnect into compliance with FC-NVME spec nvmet-fc: fix issues with targetport assoc_list list walking nvme-fc: reject reconnect if io queue count is reduced to zero nvme-fc: fix numa_node when dev is null nvme-fc: use nr_phys_segments to determine existence of sgl nvme-loop: init nvmet_ctrl fatal_err_work when allocate nvme: update comment to make the code easier to read nvme: put ns_head ref if namespace fails allocation nvme-trace: fix cdw10 buffer overrun nvme: don't warn on block content change effects nvme: add get-feature to admin cmds tracer md: Fix failed allocation of md_register_thread It's wrong to add len to sector_nr in raid10 reshape twice ...
2019-03-12md: Fix failed allocation of md_register_threadAditya Pakki
mddev->sync_thread can be set to NULL on kzalloc failure downstream. The patch checks for such a scenario and frees allocated resources. Committer node: Added similar fix to raid5.c, as suggested by Guoqing. Cc: stable@vger.kernel.org # v3.16+ Acked-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Aditya Pakki <pakki001@umn.edu> Signed-off-by: Song Liu <songliubraving@fb.com>
2019-03-12raid5: set write hint for PPLMariusz Dabrowski
When the Partial Parity Log is enabled, circular buffer is used to store PPL data. Each write to RAID device causes overwrite of data in this buffer so some write_hint can be set to those request to help drives handle garbage collection. This patch adds new sysfs attribute which can be used to specify which write_hint should be assigned to PPL. Acked-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Mariusz Dabrowski <mariusz.dabrowski@intel.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2019-03-12md: convert to kvmallocKent Overstreet
The code really just wants a big flat buffer, so just do that. Link: http://lkml.kernel.org/r/20181217131929.11727-3-kent.overstreet@gmail.com Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Reviewed-by: Matthew Wilcox <willy@infradead.org> Cc: Shaohua Li <shli@kernel.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Eric Paris <eparis@parisplace.org> Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: Paul Moore <paul@paul-moore.com> Cc: Pravin B Shelar <pshelar@ovn.org> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: Vlad Yasevich <vyasevich@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-28md/raid5: fix 'out of memory' during raid cache recoveryAlexei Naberezhnov
This fixes the case when md array assembly fails because of raid cache recovery unable to allocate a stripe, despite attempts to replay stripes and increase cache size. This happens because stripes released by r5c_recovery_replay_stripes and raid5_set_cache_size don't become available for allocation immediately. Released stripes first are placed on conf->released_stripes list and require md thread to merge them on conf->inactive_list before they can be allocated. Patch allows final allocation attempt during cache recovery to wait for new stripes to become availabe for allocation. Cc: linux-raid@vger.kernel.org Cc: Shaohua Li <shli@kernel.org> Cc: linux-stable <stable@vger.kernel.org> # 4.10+ Fixes: b4c625c67362 ("md/r5cache: r5cache recovery: part 1") Signed-off-by: Alexei Naberezhnov <anaberezhnov@fb.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2018-09-28raid5: block failing device if raid will be failedMariusz Tkaczyk
Currently there is an inconsistency for failing the member drives for arrays with different RAID levels. For RAID456 - there is a possibility to fail all of the devices. However - for other RAID levels - kernel blocks removing the member drive, if the operation results in array's FAIL state (EBUSY is returned). For example - removing last drive from RAID1 is not possible. This kind of blocker was never implemented for raid456 and we cannot see the reason why. We had tested following patch and did not observe any regression, so do you have any comments/reasons for current approach, or we can send the proper patch for this? Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>
2018-08-31md/raid5-cache: disable reshape completelyShaohua Li
We don't support reshape yet if an array supports log device. Previously we determine the fact by checking ->log. However, ->log could be NULL after a log device is removed, but the array is still marked to support log device. Don't allow reshape in this case too. User can disable log device support by setting 'consistency_policy' to 'resync' then do reshape. Reported-by: Xiao Ni <xni@redhat.com> Tested-by: Xiao Ni <xni@redhat.com> Signed-off-by: Shaohua Li <shli@fb.com>
2018-08-18Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input Pull input updates from Dmitry Torokhov: - a new driver for Rohm BU21029 touch controller - new bitmap APIs: bitmap_alloc, bitmap_zalloc and bitmap_free - updates to Atmel, eeti. pxrc and iforce drivers - assorted driver cleanups and fixes. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: (57 commits) MAINTAINERS: Add PhoenixRC Flight Controller Adapter Input: do not use WARN() in input_alloc_absinfo() Input: mark expected switch fall-throughs Input: raydium_i2c_ts - use true and false for boolean values Input: evdev - switch to bitmap API Input: gpio-keys - switch to bitmap_zalloc() Input: elan_i2c_smbus - cast sizeof to int for comparison bitmap: Add bitmap_alloc(), bitmap_zalloc() and bitmap_free() md: Avoid namespace collision with bitmap API dm: Avoid namespace collision with bitmap API Input: pm8941-pwrkey - add resin entry Input: pm8941-pwrkey - abstract register offsets and event code Input: iforce - reorganize joystick configuration lists Input: atmel_mxt_ts - move completion to after config crc is updated Input: atmel_mxt_ts - don't report zero pressure from T9 Input: atmel_mxt_ts - zero terminate config firmware file Input: atmel_mxt_ts - refactor config update code to add context struct Input: atmel_mxt_ts - config CRC may start at T71 Input: atmel_mxt_ts - remove unnecessary debug on ENOMEM Input: atmel_mxt_ts - remove duplicate setup of ABS_MT_PRESSURE ...
2018-08-02md/raid5: fix data corruption of replacements after originals droppedBingJing Chang
During raid5 replacement, the stripes can be marked with R5_NeedReplace flag. Data can be read from being-replaced devices and written to replacing spares without reading all other devices. (It's 'replace' mode. s.replacing = 1) If a being-replaced device is dropped, the replacement progress will be interrupted and resumed with pure recovery mode. However, existing stripes before being interrupted cannot read from the dropped device anymore. It prints lots of WARN_ON messages. And it results in data corruption because existing stripes write problematic data into its replacement device and update the progress. \# Erase disks (1MB + 2GB) dd if=/dev/zero of=/dev/sda bs=1MB count=2049 dd if=/dev/zero of=/dev/sdb bs=1MB count=2049 dd if=/dev/zero of=/dev/sdc bs=1MB count=2049 dd if=/dev/zero of=/dev/sdd bs=1MB count=2049 mdadm -C /dev/md0 -amd -R -l5 -n3 -x0 /dev/sd[abc] -z 2097152 \# Ensure array stores non-zero data dd if=/root/data_4GB.iso of=/dev/md0 bs=1MB \# Start replacement mdadm /dev/md0 -a /dev/sdd mdadm /dev/md0 --replace /dev/sda Then, Hot-plug out /dev/sda during recovery, and wait for recovery done. echo check > /sys/block/md0/md/sync_action cat /sys/block/md0/md/mismatch_cnt # it will be greater than 0. Soon after you hot-plug out /dev/sda, you will see many WARN_ON messages. The replacement recovery will be interrupted shortly. After the recovery finishes, it will result in data corruption. Actually, it's just an unhandled case of replacement. In commit <f94c0b6658c7> (md/raid5: fix interaction of 'replace' and 'recovery'.), if a NeedReplace device is not UPTODATE then that is an error, the commit just simply print WARN_ON but also mark these corrupted stripes with R5_WantReplace. (it means it's ready for writes.) To fix this case, we can leverage 'sync and replace' mode mentioned in commit <9a3e1101b827> (md/raid5: detect and handle replacements during recovery.). We can add logics to detect and use 'sync and replace' mode for these stripes. Reported-by: Alex Chen <alexchen@synology.com> Reviewed-by: Alex Wu <alexwu@synology.com> Reviewed-by: Chung-Chiang Cheng <cccheng@synology.com> Signed-off-by: BingJing Chang <bingjingc@synology.com> Signed-off-by: Shaohua Li <shli@fb.com>
2018-08-01md: Avoid namespace collision with bitmap APIAndy Shevchenko
bitmap API (include/linux/bitmap.h) has 'bitmap' prefix for its methods. On the other hand MD bitmap API is special case. Adding 'md' prefix to it to avoid name space collision. No functional changes intended. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Acked-by: Shaohua Li <shli@kernel.org> Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>