summaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)Author
2020-07-25bcache: avoid nr_stripes overflow in bcache_device_init()Coly Li
For some block devices which large capacity (e.g. 8TB) but small io_opt size (e.g. 8 sectors), in bcache_device_init() the stripes number calcu- lated by, DIV_ROUND_UP_ULL(sectors, d->stripe_size); might be overflow to the unsigned int bcache_device->nr_stripes. This patch uses the uint64_t variable to store DIV_ROUND_UP_ULL() and after the value is checked to be available in unsigned int range, sets it to bache_device->nr_stripes. Then the overflow is avoided. Reported-and-tested-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Coly Li <colyli@suse.de> Cc: stable@vger.kernel.org Link: https://bugzilla.redhat.com/show_bug.cgi?id=1783075 Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-25bcache: Use struct_size() in kzalloc()Gustavo A. R. Silva
Make use of the struct_size() helper instead of an open-coded version in order to avoid any potential type mistakes. This code was detected with the help of Coccinelle and, audited and fixed manually. Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-25bcache: movinggc: Use struct_size() helper in kzalloc()Gustavo A. R. Silva
Make use of the struct_size() helper instead of an open-coded version in order to avoid any potential type mistakes. This code was detected with the help of Coccinelle and, audited and fixed manually. Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-25bcache: writeback: Remove unneeded variable iXu Wang
Remove unneeded variable i in bch_dirty_init_thread(). Signed-off-by: Xu Wang <vulab@iscas.ac.cn> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-25bcache: journel: use for_each_clear_bit() to simplify the codeXu Wang
Using for_each_clear_bit() to simplify the code. Signed-off-by: Xu Wang <vulab@iscas.ac.cn> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-25bcache: allocate meta data pages as compound pagesColy Li
There are some meta data of bcache are allocated by multiple pages, and they are used as bio bv_page for I/Os to the cache device. for example cache_set->uuids, cache->disk_buckets, journal_write->data, bset_tree->data. For such meta data memory, all the allocated pages should be treated as a single memory block. Then the memory management and underlying I/O code can treat them more clearly. This patch adds __GFP_COMP flag to all the location allocating >0 order pages for the above mentioned meta data. Then their pages are treated as compound pages now. Signed-off-by: Coly Li <colyli@suse.de> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-25bcache: Fix typo in Kconfig nameJean Delvare
registraion -> registration Fixes: 0c8d3fceade2 ("bcache: configure the asynchronous registertion to be experimental") Signed-off-by: Jean Delvare <jdelvare@suse.de> Reviewed-by: Coly Li <colyli@suse.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-23dm integrity: fix integrity recalculation that is improperly skippedMikulas Patocka
Commit adc0daad366b62ca1bce3e2958a40b0b71a8b8b3 ("dm: report suspended device during destroy") broke integrity recalculation. The problem is dm_suspended() returns true not only during suspend, but also during resume. So this race condition could occur: 1. dm_integrity_resume calls queue_work(ic->recalc_wq, &ic->recalc_work) 2. integrity_recalc (&ic->recalc_work) preempts the current thread 3. integrity_recalc calls if (unlikely(dm_suspended(ic->ti))) goto unlock_ret; 4. integrity_recalc exits and no recalculating is done. To fix this race condition, add a function dm_post_suspending that is only true during the postsuspend phase and use it instead of dm_suspended(). Signed-off-by: Mikulas Patocka <mpatocka redhat com> Fixes: adc0daad366b ("dm: report suspended device during destroy") Cc: stable vger kernel org # v4.18+ Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-07-22md/raid5: use do_div() for 64 bit divisions in raid5_sync_requestYufen Yu
We get compilation error on 32-bit architectures (e.g. m68k), as: ERROR: modpost: "__udivdi3" [drivers/md/raid456.ko] undefined! Since 'sync_blocks' is defined as u64, use do_div() to fix this error. Fixes: c911c46c017c ("md/raid456: convert macro STRIPE_* to RAID5_STRIPE_*") Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-07-22md/raid10: avoid deadlock on recovery.Vitaly Mayatskikh
When disk failure happens and the array has a spare drive, resync thread kicks in and starts to refill the spare. However it may get blocked by a retry thread that resubmits failed IO to a mirror and itself can get blocked on a barrier raised by the resync thread. Acked-by: Nigel Croxon <ncroxon@redhat.com> Signed-off-by: Vitaly Mayatskikh <vmayatskikh@digitalocean.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-07-21md-cluster: fix rmmod issue when md_cluster convert bitmap to noneZhao Heming
update_array_info misses calling module_put when removing cluster bitmap. steps to reproduce: ``` node1 # mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sda /dev/sdb mdadm: array /dev/md0 started. node1 # lsmod | egrep "dlm|md_|raid1" md_cluster 28672 1 dlm 212992 14 md_cluster configfs 57344 2 dlm raid1 53248 1 md_mod 176128 2 raid1,md_cluster node1 # mdadm -G /dev/md0 -b none node1 # lsmod | egrep "dlm|md_|raid1" md_cluster 28672 1 <== should be zero dlm 212992 9 md_cluster configfs 57344 2 dlm raid1 53248 1 md_mod 176128 2 raid1,md_cluster node1 # mdadm -G /dev/md0 -b clustered node1 # lsmod | egrep "dlm|md_|raid1" md_cluster 28672 2 <== increase dlm 212992 14 md_cluster configfs 57344 2 dlm raid1 53248 1 md_mod 176128 2 raid1,md_cluster node1 # mdadm -G /dev/md0 -b none node1 # mdadm -G /dev/md0 -b clustered node1 # lsmod | egrep "dlm|md_|raid1" md_cluster 28672 3 <== increase dlm 212992 14 md_cluster configfs 57344 2 dlm raid1 53248 1 md_mod 176128 2 raid1,md_cluster ``` Reviewed-by: NeilBrown <neilb@suse.de> Signed-off-by: Zhao Heming <heming.zhao@suse.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-07-21md-cluster: fix safemode_delay value when converting to clustered bitmapZhao Heming
When array convert to clustered bitmap, the safe_mode_delay doesn't clean and vice versa. the /sys/block/mdX/md/safe_mode_delay keep original value after changing bitmap type. In safe_delay_store(), the code forbids setting mddev->safemode_delay when array is clustered. So in cluster-md env, the expected safemode_delay value should be 0. Reproducible steps: ``` node1 # mdadm --zero-superblock /dev/sd{b,c,d} node1 # mdadm -C /dev/md0 -b internal -e 1.2 -n 2 -l mirror /dev/sdb /dev/sdc node1 # cat /sys/block/md0/md/safe_mode_delay 0.204 node1 # mdadm -G /dev/md0 -b none node1 # mdadm --grow /dev/md0 --bitmap=clustered node1 # cat /sys/block/md0/md/safe_mode_delay 0.204 <== doesn't change, should ZERO for cluster-md node1 # mdadm --zero-superblock /dev/sd{b,c,d} node1 # mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sdb /dev/sdc node1 # cat /sys/block/md0/md/safe_mode_delay 0.000 node1 # mdadm -G /dev/md0 -b none node1 # cat /sys/block/md0/md/safe_mode_delay 0.000 <== doesn't change, should default value ``` Reviewed-by: NeilBrown <neilb@suse.de> Signed-off-by: Zhao Heming <heming.zhao@suse.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-07-21md/raid5: support config stripe_size by sysfs entryYufen Yu
Adding a new 'stripe_size' sysfs entry to set and show stripe_size. stripe_size should not be bigger than PAGE_SIZE, and it requires to be multiple of 4096. We can adjust stripe_size by writing value into sysfs entry, likely, set stripe_size as 16KB: echo 16384 > /sys/block/md1/md/stripe_size Show current stripe_size value: cat /sys/block/md1/md/stripe_size For PAGE_SIZE is equal to 4096, 'stripe_size' can just be read. Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-07-21md/raid5: set default stripe_size as 4096Yufen Yu
In RAID5, if issued bio size is bigger than stripe_size, it will be split in the unit of stripe_size and process them one by one. Even for size less then stripe_size, RAID5 also request data from disk at least of stripe_size. Nowdays, stripe_size is equal to the value of PAGE_SIZE. Since filesystem usually issue bio in the unit of 4KB, there is no problem for PAGE_SIZE as 4KB. But, for 64KB PAGE_SIZE, bio from filesystem requests 4KB data while RAID5 issue IO at least stripe_size (64KB) each time. That will waste resource of disk bandwidth and compute xor. To avoding the waste, we want to make stripe_size configurable. This patch just set default stripe_size as 4096. User can also set the value bigger than 4KB for some special requirements, such as we know the issued io size is more than 4KB. To evaluate the new feature, we create raid5 device '/dev/md5' with 4 SSD disk and test it on arm64 machine with 64KB PAGE_SIZE. 1) We format /dev/md5 with mkfs.ext4 and mount ext4 with default configure on /mnt directory. Then, trying to test it by dbench with command: dbench -D /mnt -t 1000 10. Result show as: 'stripe_size = 64KB' Operation Count AvgLat MaxLat ---------------------------------------- NTCreateX 9805011 0.021 64.728 Close 7202525 0.001 0.120 Rename 415213 0.051 44.681 Unlink 1980066 0.079 93.147 Deltree 240 1.793 6.516 Mkdir 120 0.004 0.007 Qpathinfo 8887512 0.007 37.114 Qfileinfo 1557262 0.001 0.030 Qfsinfo 1629582 0.012 0.152 Sfileinfo 798756 0.040 57.641 Find 3436004 0.019 57.782 WriteX 4887239 0.021 57.638 ReadX 15370483 0.005 37.818 LockX 31934 0.003 0.022 UnlockX 31933 0.001 0.021 Flush 687205 13.302 530.088 Throughput 307.799 MB/sec 10 clients 10 procs max_latency=530.091 ms ------------------------------------------------------- 'stripe_size = 4KB' Operation Count AvgLat MaxLat ---------------------------------------- NTCreateX 11999166 0.021 36.380 Close 8814128 0.001 0.122 Rename 508113 0.051 29.169 Unlink 2423242 0.070 38.141 Deltree 300 1.885 7.155 Mkdir 150 0.004 0.006 Qpathinfo 10875921 0.007 35.485 Qfileinfo 1905837 0.001 0.032 Qfsinfo 1994304 0.012 0.125 Sfileinfo 977450 0.029 26.489 Find 4204952 0.019 9.361 WriteX 5981890 0.019 27.804 ReadX 18809742 0.004 33.491 LockX 39074 0.003 0.025 UnlockX 39074 0.001 0.014 Flush 841022 10.712 458.848 Throughput 376.777 MB/sec 10 clients 10 procs max_latency=458.852 ms ------------------------------------------------------- It show that setting stripe_size as 4KB has higher thoughput, i.e. (376.777 vs 307.799) and has smaller latency than that setting as 64KB. 2) We try to evaluate IO throughput for /dev/md5 by fio with config: [4KB randwrite] direct=1 numjob=2 iodepth=64 ioengine=libaio filename=/dev/md5 bs=4KB rw=randwrite [64KB write] direct=1 numjob=2 iodepth=64 ioengine=libaio filename=/dev/md5 bs=1MB rw=write The result as follow: + + | stripe_size(64KB) | stripe_size(4KB) +----------------------------------------------------+ 4KB randwrite | 15MB/s | 100MB/s +----------------------------------------------------+ 1MB write | 1000MB/s | 700MB/s The result show that when size of io is bigger than 4KB (64KB), 64KB stripe_size has much higher IOPS. But for 4KB randwrite, that means, size of io issued to device are smaller, 4KB stripe_size have better performance. Normally, default value (4096) can get relatively good performance. But if each issued io is bigger than 4096, setting value more than 4096 may get better performance. Here, we just set default stripe_size as 4096, and we will try to support setting different stripe_size by sysfs interface in the following patch. Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-07-21md/raid456: convert macro STRIPE_* to RAID5_STRIPE_*Yufen Yu
Convert macro STRIPE_SIZE, STRIPE_SECTORS and STRIPE_SHIFT to RAID5_STRIPE_SIZE(), RAID5_STRIPE_SECTORS() and RAID5_STRIPE_SHIFT(). This patch is prepare for the following adjustable stripe_size. It will not change any existing functionality. Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-07-20block: remove bdev_stack_limitsChristoph Hellwig
This function is just a tiny wrapper around blk_stack_limit and has two callers. Simplify the stack a bit by open coding it in the two callers. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Tested-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-20block: inherit the zoned characteristics in blk_stack_limitsChristoph Hellwig
Lift the code from device mapper into blk_stack_limits to inherity the stacking limitations. This ensures we do the right thing for all stacked zoned block devices. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Tested-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-20Merge branch 'for-5.9/drivers' into for-5.9/block-mergeJens Axboe
* for-5.9/drivers: (38 commits) block: add max_active_zones to blk-sysfs block: add max_open_zones to blk-sysfs s390/dasd: Use struct_size() helper s390/dasd: fix inability to use DASD with DIAG driver md-cluster: fix wild pointer of unlock_all_bitmaps() md/raid5-cache: clear MD_SB_CHANGE_PENDING before flushing stripes md: fix deadlock causing by sysfs_notify md: improve io stats accounting md: raid0/linear: fix dereference before null check on pointer mddev rsxx: switch from 'pci_free_consistent()' to 'dma_free_coherent()' nvme: remove ns->disk checks nvme-pci: use standard block status symbolic names nvme-pci: use the consistent return type of nvme_pci_iod_alloc_size() nvme-pci: add a blank line after declarations nvme-pci: fix some comments issues nvme-pci: remove redundant segment validation nvme: document quirked Intel models nvme: expose reconnect_delay and ctrl_loss_tmo via sysfs nvme: support for zoned namespaces nvme: support for multiple Command Sets Supported and Effects log pages ...
2020-07-20Merge branch 'for-5.9/block' into for-5.9/block-mergeJens Axboe
* for-5.9/block: (124 commits) blk-cgroup: show global disk stats in root cgroup io.stat blk-cgroup: make iostat functions visible to stat printing block: improve discard bio alignment in __blkdev_issue_discard() block: change REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL to be odd numbers block: defer flush request no matter whether we have elevator block: make blk_timeout_init() static block: remove retry loop in ioc_release_fn() block: remove unnecessary ioc nested locking block: integrate bd_start_claiming into __blkdev_get block: use bd_prepare_to_claim directly in the loop driver block: refactor bd_start_claiming block: simplify the restart case in __blkdev_get Revert "blk-rq-qos: remove redundant finish_wait to rq_qos_wait." block: always remove partitions from blk_drop_partitions() block: relax jiffies rounding for timeouts blk-mq: remove redundant validation in __blk_mq_end_request() blk-mq: Remove unnecessary local variable writeback: remove bdi->congested_fn writeback: remove struct bdi_writeback_congested writeback: remove {set,clear}_wb_congested ...
2020-07-20dm crypt: Enable zoned block device supportDamien Le Moal
Enable support for zoned block devices. This is done by: 1) implementing the target report_zones method. 2) adding the DM_TARGET_ZONED_HM flag to the target features. 3) setting DM_CRYPT_NO_WRITE_WORKQUEUE flag to avoid IO processing via workqueue. 4) Introducing inline write encryption completion to preserve write ordering. The last point is implemented by introducing the internal flag DM_CRYPT_WRITE_INLINE. When set, kcryptd_crypt_write_convert() always waits inline for the completion of a write request encryption if the request is not already completed once crypt_convert() returns. Completion of write request encryption is signaled using the restart completion by kcryptd_async_done(). This mechanism allows using ciphers that have an asynchronous implementation, isolating dm-crypt from any potential request completion reordering for these ciphers. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-07-20dm crypt: add flags to optionally bypass kcryptd workqueuesIgnat Korchagin
This is a follow up to [1] that detailed latency problems associated with dm-crypt's use of workqueues when processing IO. Current dm-crypt implementation creates a significant IO performance overhead (at least on small IO block sizes) for both latency and throughput. We suspect offloading IO request processing into workqueues and async threads is more harmful these days with the modern fast storage. I also did some digging into the dm-crypt git history and much of this async processing is not needed anymore, because the reasons it was added are mostly gone from the kernel. More details can be found in [2] (see "Git archeology" section). This change adds DM_CRYPT_NO_READ_WORKQUEUE and DM_CRYPT_NO_WRITE_WORKQUEUE flags for read and write BIOs, which direct dm-crypt to not offload crypto operations into kcryptd workqueues. In addition, writes are not buffered to be sorted in the dm-crypt red-black tree, but dispatched immediately. For cases, where crypto operations cannot happen (hard interrupt context, for example the read path of some NVME drivers), we offload the work to a tasklet rather than a workqueue. These flags only ensure no async BIO processing in the dm-crypt module. It is worth noting that some Crypto API implementations may offload encryption into their own workqueues, which are independent of the dm-crypt and its configuration. However upon enabling these new flags dm-crypt will instruct Crypto API not to backlog crypto requests. To give an idea of the performance gains for certain workloads, consider the script, and results when tested against various devices, detailed here: https://www.redhat.com/archives/dm-devel/2020-July/msg00138.html [1]: https://www.spinics.net/lists/dm-crypt/msg07516.html [2]: https://blog.cloudflare.com/speeding-up-linux-disk-encryption/ Signed-off-by: Ignat Korchagin <ignat@cloudflare.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Bob Liu <bob.liu@oracle.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-07-20dm bufio: do buffer cleanup from a workqueueMikulas Patocka
Until now, DM bufio's waiting for IO from reclaim context in its shrinker has caused kswapd to block; which results in systemic IO stalls and even deadlock, e.g.: https://www.redhat.com/archives/dm-devel/2020-March/msg00025.html Here is Dave Chinner's problem description that motivated this fix, from: https://lore.kernel.org/linux-fsdevel/20190809215733.GZ7777@dread.disaster.area/ "Waiting for IO in kswapd reclaim context is considered harmful - kswapd context shrinker reclaim should be as non-blocking as possible, and any back-off to wait for IO to complete should be done by the high level reclaim core once it's completed an entire reclaim scan cycle of everything.... What follows from that, and is pertinent in this situation, is that if you don't block kswapd, then other reclaim contexts are not going to get stuck waiting for it regardless of the reclaim context they use." Continued elsewhere: "The only way to fix this problem once and for all is to stop using the shrinker as a mechanism to issue and wait on IO. If you need background writeback of dirty buffers, do it from a WQ_MEM_RECLAIM workqueue that isn't directly in the memory reclaim path and so can issue writeback and block safely from a GFP_KERNEL context. Kick the workqueue from the shrinker context, but get rid of the IO submission and waiting from the shrinker and all the GFP_NOFS memory reclaim recursion problems go away." As such, this commit moves buffer cleanup to a workqueue. Suggested-by: Dave Chinner <dchinner@redhat.com> Reported-by: Tahsin Erdogan <tahsin@google.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Tested-by: Gabriel Krisman Bertazi <krisman@collabora.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-07-20dm rq: don't call blk_mq_queue_stopped() in dm_stop_queue()Ming Lei
dm_stop_queue() only uses blk_mq_quiesce_queue() so it doesn't formally stop the blk-mq queue; therefore there is no point making the blk_mq_queue_stopped() check -- it will never be stopped. In addition, even though dm_stop_queue() actually tries to quiesce hw queues via blk_mq_quiesce_queue(), checking with blk_queue_quiesced() to avoid unnecessary queue quiesce isn't reliable because: the QUEUE_FLAG_QUIESCED flag is set before synchronize_rcu() and dm_stop_queue() may be called when synchronize_rcu() from another blk_mq_quiesce_queue() is in-progress. Fixes: 7b17c2f7292ba ("dm: Fix a race condition related to stopping and starting queues") Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-07-20dm dust: add interface to list all badblocksyangerkun
This interface may help anyone who want to know all badblocks without querying for each block. [Bryan: DMEMIT message if no blocks are in the bad block list.] Signed-off-by: yangerkun <yangerkun@huawei.com> Signed-off-by: Bryan Gurney <bgurney@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-07-20dm dust: report some message results directly back to useryangerkun
Some messages (queryblock, countbadblocks, removebadblock) are best reported directly to user directly. Do so with DMEMIT. [Bryan: maintain __func__ output in DMEMIT messages] Signed-off-by: yangerkun <yangerkun@huawei.com> Signed-off-by: Bryan Gurney <bgurney@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-07-16treewide: Remove uninitialized_var() usageKees Cook
Using uninitialized_var() is dangerous as it papers over real bugs[1] (or can in the future), and suppresses unrelated compiler warnings (e.g. "unused variable"). If the compiler thinks it is uninitialized, either simply initialize the variable or make compiler changes. In preparation for removing[2] the[3] macro[4], remove all remaining needless uses with the following script: git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \ xargs perl -pi -e \ 's/\buninitialized_var\(([^\)]+)\)/\1/g; s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;' drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid pathological white-space. No outstanding warnings were found building allmodconfig with GCC 9.3.0 for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64, alpha, and m68k. [1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/ [2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/ [3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/ [4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/ Reviewed-by: Leon Romanovsky <leonro@mellanox.com> # drivers/infiniband and mlx4/mlx5 Acked-by: Jason Gunthorpe <jgg@mellanox.com> # IB Acked-by: Kalle Valo <kvalo@codeaurora.org> # wireless drivers Reviewed-by: Chao Yu <yuchao0@huawei.com> # erofs Signed-off-by: Kees Cook <keescook@chromium.org>
2020-07-16raid5: remove the meaningless check in raid5_make_requestGuoqing Jiang
We can't guarntee the batched stripe to be set with STRIPE_HANDLE since there are lots of functions could set the flag, such as sync_request, ops_complete_* and end_{read,write}_request etc. Also clear_batch_ready called in handle_stripe ensures the batched list can't continue to be handled by handle_stripe. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-07-16raid5: put the comment of clear_batch_ready to the right placeGuoqing Jiang
To make people understand the function well, let's put the comment to the right place. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-07-16raid5: call clear_batch_ready before set STRIPE_ACTIVEGuoqing Jiang
We tried to only put the head sh of batch list to handle_list, then the handle_stripe doesn't handle other members in the batch list. However, we still got the calltrace in break_stripe_batch_list. [593764.644269] stripe state: 2003 kernel: [593764.644299] ------------[ cut here ]------------ kernel: [593764.644308] WARNING: CPU: 12 PID: 856 at drivers/md/raid5.c:4625 break_stripe_batch_list+0x203/0x240 [raid456] [...] kernel: [593764.644363] Call Trace: kernel: [593764.644370] handle_stripe+0x907/0x20c0 [raid456] kernel: [593764.644376] ? __wake_up_common_lock+0x89/0xc0 kernel: [593764.644379] handle_active_stripes.isra.57+0x35f/0x570 [raid456] kernel: [593764.644382] ? raid5_wakeup_stripe_thread+0x96/0x1f0 [raid456] kernel: [593764.644385] raid5d+0x480/0x6a0 [raid456] kernel: [593764.644390] ? md_thread+0x11f/0x160 kernel: [593764.644392] md_thread+0x11f/0x160 kernel: [593764.644394] ? wait_woken+0x80/0x80 kernel: [593764.644396] kthread+0xfc/0x130 kernel: [593764.644398] ? find_pers+0x70/0x70 kernel: [593764.644399] ? kthread_create_on_node+0x70/0x70 kernel: [593764.644401] ret_from_fork+0x1f/0x30 As we can see, the stripe was set with STRIPE_ACTIVE and STRIPE_HANDLE, and only handle_stripe could set those flags then return. And since the stipe was already in the batch list, we need to return earlier before set the two flags. And after dig a little about git history especially commit 3664847d95e6 ("md/raid5: fix a race condition in stripe batch"), it seems the batched stipe still could be handled by handle_stipe, then handle_stipe needs to return earlier if clear_batch_ready to return true. Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-07-16md: rewrite md_setup_drive to avoid ioctlsChristoph Hellwig
md_setup_drive knows it works with md devices, so it is rather pointless to open a file descriptor and issue ioctls. Just call directly into the relevant low-level md routines after getting a handle to the device using blkdev_get_by_dev instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: NeilBrown <neilb@suse.de> Acked-by: Song Liu <song@kernel.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-07-16md: simplify md_setup_driveChristoph Hellwig
Move the loop over the possible arrays into the caller to remove a level of indentation for the whole function. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: NeilBrown <neilb@suse.de> Acked-by: Song Liu <song@kernel.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-07-16md: remove the kernel version of md_u.hChristoph Hellwig
mdp_major can just move to drivers/md/md.h. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Song Liu <song@kernel.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-07-16md: remove the autoscan partition re-readChristoph Hellwig
devfs is long gone, and autoscan works just fine without this these days. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: NeilBrown <neilb@suse.de> Acked-by: Song Liu <song@kernel.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-07-16md: replace the RAID_AUTORUN ioctl with a direct function callChristoph Hellwig
Instead of using a spcial RAID_AUTORUN ioctl that only exists for non-modular builds and is only called from the early init code, just call the actual function directly. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: NeilBrown <neilb@suse.de> Acked-by: Song Liu <song@kernel.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-07-16md: move the early init autodetect code to drivers/md/Christoph Hellwig
Just like the NFS and CIFS root code this better lives with the driver it is tightly integrated with. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Song Liu <song@kernel.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-07-15md: raid10: Fix compilation warningDamien Le Moal
Remove the if statement around the call to sysfs_link_rdev() in raid10_start_reshape() to avoid the compilation warning: warning: suggest braces around empty body in an ‘if’ statement when compiling with W=1. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-07-15md: raid5: Fix compilation warningDamien Le Moal
Remove the if statement around the calls to sysfs_link_rdev() to avoid the compilation warning "suggest braces around empty body in an ‘if’ statement" when compiling with W=1. Also fix function description comments to avoid kdoc format warnings. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-07-15md: raid5-cache: Remove set but unused variableDamien Le Moal
Remove the variable offset in r5c_tree_index() to avoid a "set but not used" compilation warning when compiling with W=1. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-07-15md: Fix compilation warningDamien Le Moal
Remove the if statement around the calls to sysfs_link_rdev() to avoid the compilation warnings: warning: suggest braces around empty body in an ‘if’ statement when compiling with W=1. For the call to sysfs_create_link() generating the same warning, use the err variable to store the function result, avoiding triggering another warning as the function is declared as 'warn_unused_result'. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-07-16libnvdimm/nvdimm/flush: Allow architecture to override the flush barrierAneesh Kumar K.V
Architectures like ppc64 provide persistent memory specific barriers that will ensure that all stores for which the modifications are written to persistent storage by preceding dcbfps and dcbstps instructions have updated persistent storage before any data access or data transfer caused by subsequent instructions is initiated. This is in addition to the ordering done by wmb() Update nvdimm core such that architecture can use barriers other than wmb to ensure all previous writes are architecturally visible for the platform buffer flush. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200701072235.223558-5-aneesh.kumar@linux.ibm.com
2020-07-14md-cluster: fix wild pointer of unlock_all_bitmaps()Zhao Heming
reproduction steps: ``` node1 # mdadm -C /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sda /dev/sdb node2 # mdadm -A /dev/md0 /dev/sda /dev/sdb node1 # mdadm -G /dev/md0 -b none mdadm: failed to remove clustered bitmap. node1 # mdadm -S --scan ^C <==== mdadm hung & kernel crash ``` kernel stack: ``` [ 335.230657] general protection fault: 0000 [#1] SMP NOPTI [...] [ 335.230848] Call Trace: [ 335.230873] ? unlock_all_bitmaps+0x5/0x70 [md_cluster] [ 335.230886] unlock_all_bitmaps+0x3d/0x70 [md_cluster] [ 335.230899] leave+0x10f/0x190 [md_cluster] [ 335.230932] ? md_super_wait+0x93/0xa0 [md_mod] [ 335.230947] ? leave+0x5/0x190 [md_cluster] [ 335.230973] md_cluster_stop+0x1a/0x30 [md_mod] [ 335.230999] md_bitmap_free+0x142/0x150 [md_mod] [ 335.231013] ? _cond_resched+0x15/0x40 [ 335.231025] ? mutex_lock+0xe/0x30 [ 335.231056] __md_stop+0x1c/0xa0 [md_mod] [ 335.231083] do_md_stop+0x160/0x580 [md_mod] [ 335.231119] ? 0xffffffffc05fb078 [ 335.231148] md_ioctl+0xa04/0x1930 [md_mod] [ 335.231165] ? filename_lookup+0xf2/0x190 [ 335.231179] blkdev_ioctl+0x93c/0xa10 [ 335.231205] ? _cond_resched+0x15/0x40 [ 335.231214] ? __check_object_size+0xd4/0x1a0 [ 335.231224] block_ioctl+0x39/0x40 [ 335.231243] do_vfs_ioctl+0xa0/0x680 [ 335.231253] ksys_ioctl+0x70/0x80 [ 335.231261] __x64_sys_ioctl+0x16/0x20 [ 335.231271] do_syscall_64+0x65/0x1f0 [ 335.231278] entry_SYSCALL_64_after_hwframe+0x44/0xa9 ``` Signed-off-by: Zhao Heming <heming.zhao@suse.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-07-14md/raid5-cache: clear MD_SB_CHANGE_PENDING before flushing stripesSong Liu
In recovery, if we process too much data, raid5-cache may set MD_SB_CHANGE_PENDING, which causes spinning in handle_stripe(). Fix this issue by clearing the bit before flushing data only stripes. This issue was initially discussed in [1]. [1] https://www.spinics.net/lists/raid/msg64409.html Signed-off-by: Song Liu <songliubraving@fb.com>
2020-07-14md: fix deadlock causing by sysfs_notifyJunxiao Bi
The following deadlock was captured. The first process is holding 'kernfs_mutex' and hung by io. The io was staging in 'r1conf.pending_bio_list' of raid1 device, this pending bio list would be flushed by second process 'md127_raid1', but it was hung by 'kernfs_mutex'. Using sysfs_notify_dirent_safe() to replace sysfs_notify() can fix it. There were other sysfs_notify() invoked from io path, removed all of them. PID: 40430 TASK: ffff8ee9c8c65c40 CPU: 29 COMMAND: "probe_file" #0 [ffffb87c4df37260] __schedule at ffffffff9a8678ec #1 [ffffb87c4df372f8] schedule at ffffffff9a867f06 #2 [ffffb87c4df37310] io_schedule at ffffffff9a0c73e6 #3 [ffffb87c4df37328] __dta___xfs_iunpin_wait_3443 at ffffffffc03a4057 [xfs] #4 [ffffb87c4df373a0] xfs_iunpin_wait at ffffffffc03a6c79 [xfs] #5 [ffffb87c4df373b0] __dta_xfs_reclaim_inode_3357 at ffffffffc039a46c [xfs] #6 [ffffb87c4df37400] xfs_reclaim_inodes_ag at ffffffffc039a8b6 [xfs] #7 [ffffb87c4df37590] xfs_reclaim_inodes_nr at ffffffffc039bb33 [xfs] #8 [ffffb87c4df375b0] xfs_fs_free_cached_objects at ffffffffc03af0e9 [xfs] #9 [ffffb87c4df375c0] super_cache_scan at ffffffff9a287ec7 #10 [ffffb87c4df37618] shrink_slab at ffffffff9a1efd93 #11 [ffffb87c4df37700] shrink_node at ffffffff9a1f5968 #12 [ffffb87c4df37788] do_try_to_free_pages at ffffffff9a1f5ea2 #13 [ffffb87c4df377f0] try_to_free_mem_cgroup_pages at ffffffff9a1f6445 #14 [ffffb87c4df37880] try_charge at ffffffff9a26cc5f #15 [ffffb87c4df37920] memcg_kmem_charge_memcg at ffffffff9a270f6a #16 [ffffb87c4df37958] new_slab at ffffffff9a251430 #17 [ffffb87c4df379c0] ___slab_alloc at ffffffff9a251c85 #18 [ffffb87c4df37a80] __slab_alloc at ffffffff9a25635d #19 [ffffb87c4df37ac0] kmem_cache_alloc at ffffffff9a251f89 #20 [ffffb87c4df37b00] alloc_inode at ffffffff9a2a2b10 #21 [ffffb87c4df37b20] iget_locked at ffffffff9a2a4854 #22 [ffffb87c4df37b60] kernfs_get_inode at ffffffff9a311377 #23 [ffffb87c4df37b80] kernfs_iop_lookup at ffffffff9a311e2b #24 [ffffb87c4df37ba8] lookup_slow at ffffffff9a290118 #25 [ffffb87c4df37c10] walk_component at ffffffff9a291e83 #26 [ffffb87c4df37c78] path_lookupat at ffffffff9a293619 #27 [ffffb87c4df37cd8] filename_lookup at ffffffff9a2953af #28 [ffffb87c4df37de8] user_path_at_empty at ffffffff9a295566 #29 [ffffb87c4df37e10] vfs_statx at ffffffff9a289787 #30 [ffffb87c4df37e70] SYSC_newlstat at ffffffff9a289d5d #31 [ffffb87c4df37f18] sys_newlstat at ffffffff9a28a60e #32 [ffffb87c4df37f28] do_syscall_64 at ffffffff9a003949 #33 [ffffb87c4df37f50] entry_SYSCALL_64_after_hwframe at ffffffff9aa001ad RIP: 00007f617a5f2905 RSP: 00007f607334f838 RFLAGS: 00000246 RAX: ffffffffffffffda RBX: 00007f6064044b20 RCX: 00007f617a5f2905 RDX: 00007f6064044b20 RSI: 00007f6064044b20 RDI: 00007f6064005890 RBP: 00007f6064044aa0 R8: 0000000000000030 R9: 000000000000011c R10: 0000000000000013 R11: 0000000000000246 R12: 00007f606417e6d0 R13: 00007f6064044aa0 R14: 00007f6064044b10 R15: 00000000ffffffff ORIG_RAX: 0000000000000006 CS: 0033 SS: 002b PID: 927 TASK: ffff8f15ac5dbd80 CPU: 42 COMMAND: "md127_raid1" #0 [ffffb87c4df07b28] __schedule at ffffffff9a8678ec #1 [ffffb87c4df07bc0] schedule at ffffffff9a867f06 #2 [ffffb87c4df07bd8] schedule_preempt_disabled at ffffffff9a86825e #3 [ffffb87c4df07be8] __mutex_lock at ffffffff9a869bcc #4 [ffffb87c4df07ca0] __mutex_lock_slowpath at ffffffff9a86a013 #5 [ffffb87c4df07cb0] mutex_lock at ffffffff9a86a04f #6 [ffffb87c4df07cc8] kernfs_find_and_get_ns at ffffffff9a311d83 #7 [ffffb87c4df07cf0] sysfs_notify at ffffffff9a314b3a #8 [ffffb87c4df07d18] md_update_sb at ffffffff9a688696 #9 [ffffb87c4df07d98] md_update_sb at ffffffff9a6886d5 #10 [ffffb87c4df07da8] md_check_recovery at ffffffff9a68ad9c #11 [ffffb87c4df07dd0] raid1d at ffffffffc01f0375 [raid1] #12 [ffffb87c4df07ea0] md_thread at ffffffff9a680348 #13 [ffffb87c4df07f08] kthread at ffffffff9a0b8005 #14 [ffffb87c4df07f50] ret_from_fork at ffffffff9aa00344 Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-07-13md: improve io stats accountingArtur Paszkiewicz
Use generic io accounting functions to manage io stats. There was an attempt to do this earlier in commit 18c0b223cf99 ("md: use generic io stats accounting functions to simplify io stat accounting"), but it did not include a call to generic_end_io_acct() and caused issues with tracking in-flight IOs, so it was later removed in commit 74672d069b29 ("md: fix md io stats accounting broken"). This patch attempts to fix this by using both disk_start_io_acct() and disk_end_io_acct(). To make it possible, a struct md_io is allocated for every new md bio, which includes the io start_time. A new mempool is introduced for this purpose. We override bio->bi_end_io with our own callback and call disk_start_io_acct() before passing the bio to md_handle_request(). When it completes, we call disk_end_io_acct() and the original bi_end_io callback. This adds correct statistics about in-flight IOs and IO processing time, interpreted e.g. in iostat as await, svctm, aqu-sz and %util. It also fixes a situation where too many IOs where reported if a bio was re-submitted to the mddev, because io accounting is now performed only on newly arriving bios. Acked-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-07-13md: raid0/linear: fix dereference before null check on pointer mddevColin Ian King
Pointer mddev is being dereferenced with a test_bit call before mddev is being null checked, this may cause a null pointer dereference. Fix this by moving the null pointer checks to sanity check mddev before it is dereferenced. Addresses-Coverity: ("Dereference before null check") Fixes: 62f7b1989c02 ("md raid0/linear: Mark array as 'broken' and fail BIOs if a member is gone") Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Guilherme G. Piccoli <gpiccoli@canonical.com> Signed-off-by: Song Liu <songliubraving@fb.com>
2020-07-13dm verity: add "panic_on_corruption" error handling modeJeongHyeon Lee
Samsung smart phones may need the ability to panic on corruption. Not all devices provide the bootloader support needed to use the existing "restart_on_corruption" mode. Additional details for why Samsung needs this new mode can be found here: https://www.redhat.com/archives/dm-devel/2020-June/msg00235.html Signed-off-by: jhs2.lee <jhs2.lee@samsung.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-07-13dm mpath: use double checked locking in fast pathMike Snitzer
Fast-path code biased toward lazy acknowledgement of bit being set (primarily only for initialization). Multipath code is very retry oriented so even if state is missed it'll recover. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-07-13dm mpath: rename current_pgpath to pgpath in multipath_prepare_ioctlMike Snitzer
Makes consistent with __map_bio() and multipath_clone_and_map(). Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-07-13dm mpath: rework __map_bio()Mike Snitzer
so that it follows same pattern as request-based multipath_clone_and_map() Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2020-07-13dm mpath: factor out multipath_queue_bioMike Snitzer
Enables further cleanup. Signed-off-by: Mike Snitzer <snitzer@redhat.com>